1. 07 Dec, 2022 1 commit
  2. 06 Dec, 2022 1 commit
  3. 02 Dec, 2022 1 commit
  4. 30 Nov, 2022 2 commits
    • Caleb Sander's avatar
      nvme: fix SRCU protection of nvme_ns_head list · 899d2a05
      Caleb Sander authored
      Walking the nvme_ns_head siblings list is protected by the head's srcu
      in nvme_ns_head_submit_bio() but not nvme_mpath_revalidate_paths().
      Removing namespaces from the list also fails to synchronize the srcu.
      Concurrent scan work can therefore cause use-after-frees.
      
      Hold the head's srcu lock in nvme_mpath_revalidate_paths() and
      synchronize with the srcu, not the global RCU, in nvme_ns_remove().
      
      Observed the following panic when making NVMe/RDMA connections
      with native multipath on the Rocky Linux 8.6 kernel
      (it seems the upstream kernel has the same race condition).
      Disassembly shows the faulting instruction is cmp 0x50(%rdx),%rcx;
      computing capacity != get_capacity(ns->disk).
      Address 0x50 is dereferenced because ns->disk is NULL.
      The NULL disk appears to be the result of concurrent scan work
      freeing the namespace (note the log line in the middle of the panic).
      
      [37314.206036] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
      [37314.206036] nvme0n3: detected capacity change from 0 to 11811160064
      [37314.299753] PGD 0 P4D 0
      [37314.299756] Oops: 0000 [#1] SMP PTI
      [37314.299759] CPU: 29 PID: 322046 Comm: kworker/u98:3 Kdump: loaded Tainted: G        W      X --------- -  - 4.18.0-372.32.1.el8test86.x86_64 #1
      [37314.299762] Hardware name: Dell Inc. PowerEdge R720/0JP31P, BIOS 2.7.0 05/23/2018
      [37314.299763] Workqueue: nvme-wq nvme_scan_work [nvme_core]
      [37314.299783] RIP: 0010:nvme_mpath_revalidate_paths+0x26/0xb0 [nvme_core]
      [37314.299790] Code: 1f 44 00 00 66 66 66 66 90 55 53 48 8b 5f 50 48 8b 83 c8 c9 00 00 48 8b 13 48 8b 48 50 48 39 d3 74 20 48 8d 42 d0 48 8b 50 20 <48> 3b 4a 50 74 05 f0 80 60 70 ef 48 8b 50 30 48 8d 42 d0 48 39 d3
      [37315.058803] RSP: 0018:ffffabe28f913d10 EFLAGS: 00010202
      [37315.121316] RAX: ffff927a077da800 RBX: ffff92991dd70000 RCX: 0000000001600000
      [37315.206704] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff92991b719800
      [37315.292106] RBP: ffff929a6b70c000 R08: 000000010234cd4a R09: c0000000ffff7fff
      [37315.377501] R10: 0000000000000001 R11: ffffabe28f913a30 R12: 0000000000000000
      [37315.462889] R13: ffff92992716600c R14: ffff929964e6e030 R15: ffff92991dd70000
      [37315.548286] FS:  0000000000000000(0000) GS:ffff92b87fb80000(0000) knlGS:0000000000000000
      [37315.645111] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [37315.713871] CR2: 0000000000000050 CR3: 0000002208810006 CR4: 00000000000606e0
      [37315.799267] Call Trace:
      [37315.828515]  nvme_update_ns_info+0x1ac/0x250 [nvme_core]
      [37315.892075]  nvme_validate_or_alloc_ns+0x2ff/0xa00 [nvme_core]
      [37315.961871]  ? __blk_mq_free_request+0x6b/0x90
      [37316.015021]  nvme_scan_work+0x151/0x240 [nvme_core]
      [37316.073371]  process_one_work+0x1a7/0x360
      [37316.121318]  ? create_worker+0x1a0/0x1a0
      [37316.168227]  worker_thread+0x30/0x390
      [37316.212024]  ? create_worker+0x1a0/0x1a0
      [37316.258939]  kthread+0x10a/0x120
      [37316.297557]  ? set_kthread_struct+0x50/0x50
      [37316.347590]  ret_from_fork+0x35/0x40
      [37316.390360] Modules linked in: nvme_rdma nvme_tcp(X) nvme_fabrics nvme_core netconsole iscsi_tcp libiscsi_tcp dm_queue_length dm_service_time nf_conntrack_netlink br_netfilter bridge stp llc overlay nft_chain_nat ipt_MASQUERADE nf_nat xt_addrtype xt_CT nft_counter xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_comment xt_multiport nft_compat nf_tables libcrc32c nfnetlink dm_multipath tg3 rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm intel_rapl_msr iTCO_wdt iTCO_vendor_support dcdbas intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm irqbypass crct10dif_pclmul crc32_pclmul mlx5_ib ghash_clmulni_intel ib_uverbs rapl intel_cstate intel_uncore ib_core ipmi_si joydev mei_me pcspkr ipmi_devintf mei lpc_ich wmi ipmi_msghandler acpi_power_meter ext4 mbcache jbd2 sd_mod t10_pi sg mgag200 mlx5_core drm_kms_helper syscopyarea
      [37316.390419]  sysfillrect ahci sysimgblt fb_sys_fops libahci drm crc32c_intel libata mlxfw pci_hyperv_intf tls i2c_algo_bit psample dm_mirror dm_region_hash dm_log dm_mod fuse [last unloaded: nvme_core]
      [37317.645908] CR2: 0000000000000050
      
      Fixes: e7d65803 ("nvme-multipath: revalidate paths during rescan")
      Signed-off-by: default avatarCaleb Sander <csander@purestorage.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      899d2a05
    • Lei Rao's avatar
      nvme-pci: clear the prp2 field when not used · a56ea614
      Lei Rao authored
      If the prp2 field is not filled in nvme_setup_prp_simple(), the prp2
      field is garbage data. According to nvme spec, the prp2 is reserved if
      the data transfer does not cross a memory page boundary, so clear it to
      zero if it is not used.
      Signed-off-by: default avatarLei Rao <lei.rao@intel.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      a56ea614
  5. 24 Nov, 2022 1 commit
  6. 23 Nov, 2022 4 commits
  7. 22 Nov, 2022 1 commit
  8. 18 Nov, 2022 1 commit
  9. 16 Nov, 2022 7 commits
  10. 15 Nov, 2022 2 commits
  11. 14 Nov, 2022 1 commit
  12. 10 Nov, 2022 1 commit
  13. 09 Nov, 2022 3 commits
  14. 08 Nov, 2022 2 commits
    • Serge Semin's avatar
      block: sed-opal: kmalloc the cmd/resp buffers · f829230d
      Serge Semin authored
      In accordance with [1] the DMA-able memory buffers must be
      cacheline-aligned otherwise the cache writing-back and invalidation
      performed during the mapping may cause the adjacent data being lost. It's
      specifically required for the DMA-noncoherent platforms [2]. Seeing the
      opal_dev.{cmd,resp} buffers are implicitly used for DMAs in the NVME and
      SCSI/SD drivers in framework of the nvme_sec_submit() and sd_sec_submit()
      methods respectively they must be cacheline-aligned to prevent the denoted
      problem. One of the option to guarantee that is to kmalloc the buffers
      [2]. Let's explicitly allocate them then instead of embedding into the
      opal_dev structure instance.
      
      Note this fix was inspired by the commit c94b7f9b ("nvme-hwmon:
      kmalloc the NVME SMART log buffer").
      
      [1] Documentation/core-api/dma-api.rst
      [2] Documentation/core-api/dma-api-howto.rst
      
      Fixes: 455a7b23 ("block: Add Sed-opal library")
      Signed-off-by: default avatarSerge Semin <Sergey.Semin@baikalelectronics.ru>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20221107203944.31686-1-Sergey.Semin@baikalelectronics.ruSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f829230d
    • Yu Kuai's avatar
      block, bfq: fix null pointer dereference in bfq_bio_bfqg() · f02be900
      Yu Kuai authored
      Out test found a following problem in kernel 5.10, and the same problem
      should exist in mainline:
      
      BUG: kernel NULL pointer dereference, address: 0000000000000094
      PGD 0 P4D 0
      Oops: 0000 [#1] SMP
      CPU: 7 PID: 155 Comm: kworker/7:1 Not tainted 5.10.0-01932-g19e0ace2ca1d-dirty 4
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-b4
      Workqueue: kthrotld blk_throtl_dispatch_work_fn
      RIP: 0010:bfq_bio_bfqg+0x52/0xc0
      Code: 94 00 00 00 00 75 2e 48 8b 40 30 48 83 05 35 06 c8 0b 01 48 85 c0 74 3d 4b
      RSP: 0018:ffffc90001a1fba0 EFLAGS: 00010002
      RAX: ffff888100d60400 RBX: ffff8881132e7000 RCX: 0000000000000000
      RDX: 0000000000000017 RSI: ffff888103580a18 RDI: ffff888103580a18
      RBP: ffff8881132e7000 R08: 0000000000000000 R09: ffffc90001a1fe10
      R10: 0000000000000a20 R11: 0000000000034320 R12: 0000000000000000
      R13: ffff888103580a18 R14: ffff888114447000 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff88881fdc0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000094 CR3: 0000000100cdb000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       bfq_bic_update_cgroup+0x3c/0x350
       ? ioc_create_icq+0x42/0x270
       bfq_init_rq+0xfd/0x1060
       bfq_insert_requests+0x20f/0x1cc0
       ? ioc_create_icq+0x122/0x270
       blk_mq_sched_insert_requests+0x86/0x1d0
       blk_mq_flush_plug_list+0x193/0x2a0
       blk_flush_plug_list+0x127/0x170
       blk_finish_plug+0x31/0x50
       blk_throtl_dispatch_work_fn+0x151/0x190
       process_one_work+0x27c/0x5f0
       worker_thread+0x28b/0x6b0
       ? rescuer_thread+0x590/0x590
       kthread+0x153/0x1b0
       ? kthread_flush_work+0x170/0x170
       ret_from_fork+0x1f/0x30
      Modules linked in:
      CR2: 0000000000000094
      ---[ end trace e2e59ac014314547 ]---
      RIP: 0010:bfq_bio_bfqg+0x52/0xc0
      Code: 94 00 00 00 00 75 2e 48 8b 40 30 48 83 05 35 06 c8 0b 01 48 85 c0 74 3d 4b
      RSP: 0018:ffffc90001a1fba0 EFLAGS: 00010002
      RAX: ffff888100d60400 RBX: ffff8881132e7000 RCX: 0000000000000000
      RDX: 0000000000000017 RSI: ffff888103580a18 RDI: ffff888103580a18
      RBP: ffff8881132e7000 R08: 0000000000000000 R09: ffffc90001a1fe10
      R10: 0000000000000a20 R11: 0000000000034320 R12: 0000000000000000
      R13: ffff888103580a18 R14: ffff888114447000 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff88881fdc0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000094 CR3: 0000000100cdb000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      
      Root cause is quite complex:
      
      1) use bfq elevator for the test device.
      2) create a cgroup CG
      3) config blk throtl in CG
      
         blkg_conf_prep
          blkg_create
      
      4) create a thread T1 and issue async io in CG:
      
         bio_init
          bio_associate_blkg
         ...
         submit_bio
          submit_bio_noacct
           blk_throtl_bio -> io is throttled
           // io submit is done
      
      5) switch elevator:
      
         bfq_exit_queue
          blkcg_deactivate_policy
           list_for_each_entry(blkg, &q->blkg_list, q_node)
            blkg->pd[] = NULL
            // bfq policy is removed
      
      5) thread t1 exist, then remove the cgroup CG:
      
         blkcg_unpin_online
          blkcg_destroy_blkgs
           blkg_destroy
            list_del_init(&blkg->q_node)
            // blkg is removed from queue list
      
      6) switch elevator back to bfq
      
       bfq_init_queue
        bfq_create_group_hierarchy
         blkcg_activate_policy
          list_for_each_entry_reverse(blkg, &q->blkg_list)
           // blkg is removed from list, hence bfq policy is still NULL
      
      7) throttled io is dispatched to bfq:
      
       bfq_insert_requests
        bfq_init_rq
         bfq_bic_update_cgroup
          bfq_bio_bfqg
           bfqg = blkg_to_bfqg(blkg)
           // bfqg is NULL because bfq policy is NULL
      
      The problem is only possible in bfq because only bfq can be deactivated and
      activated while queue is online, while others can only be deactivated while
      the device is removed.
      
      Fix the problem in bfq by checking if blkg is online before calling
      blkg_to_bfqg().
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20221108103434.2853269-1-yukuai1@huaweicloud.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f02be900
  15. 01 Nov, 2022 1 commit
  16. 31 Oct, 2022 6 commits
  17. 28 Oct, 2022 1 commit
  18. 27 Oct, 2022 3 commits
    • Jens Axboe's avatar
      Merge tag 'nvme-6.1-2022-10-27' of git://git.infradead.org/nvme into block-6.1 · dea31328
      Jens Axboe authored
      Pull NVMe fixes from Christoph:
      
      "nvme fixes for Linux 6.1
      
       - make the multipath dma alignment to match the non-multipath one
         (Keith Busch)
       - fix a bogus use of sg_init_marker() (Nam Cao)
       - fix circulr locking in nvme-tcp (Sagi Grimberg)"
      
      * tag 'nvme-6.1-2022-10-27' of git://git.infradead.org/nvme:
        nvme-multipath: set queue dma alignment to 3
        nvme-tcp: fix possible circular locking when deleting a controller under memory pressure
        nvme-tcp: replace sg_init_marker() with sg_init_table()
      dea31328
    • Ming Lei's avatar
      blk-mq: don't add non-pt request with ->end_io to batch · 2d87d455
      Ming Lei authored
      dm-rq implements ->end_io callback for request issued to underlying queue,
      and it isn't passthrough request.
      
      Commit ab3e1d3b ("block: allow end_io based requests in the completion
      batch handling") doesn't clear rq->bio and rq->__data_len for request
      with ->end_io in blk_mq_end_request_batch(), and this way is actually
      dangerous, but so far it is only for nvme passthrough request.
      
      dm-rq needs to clean up remained bios in case of partial completion,
      and req->bio is required, then use-after-free is triggered, so the
      underlying clone request can't be completed in blk_mq_end_request_batch.
      
      Fix panic by not adding such request into batch list, and the issue
      can be triggered simply by exposing nvme pci to dm-mpath simply.
      
      Fixes: ab3e1d3b ("block: allow end_io based requests in the completion batch handling")
      Cc: dm-devel@redhat.com
      Cc: Mike Snitzer <snitzer@kernel.org>
      Reported-by: default avatarChanghui Zhong <czhong@redhat.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/20221027085709.513175-1-ming.lei@redhat.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2d87d455
    • Yang Yingliang's avatar
      rbd: fix possible memory leak in rbd_sysfs_init() · 7f21735f
      Yang Yingliang authored
      If device_register() returns error in rbd_sysfs_init(), name of kobject
      which is allocated in dev_set_name() called in device_add() is leaked.
      
      As comment of device_add() says, it should call put_device() to drop
      the reference count that was set in device_initialize() when it fails,
      so the name can be freed in kobject_cleanup().
      
      Fault injection test can trigger this problem:
      
      unreferenced object 0xffff88810173aa78 (size 8):
        comm "modprobe", pid 247, jiffies 4294714278 (age 31.789s)
        hex dump (first 8 bytes):
          72 62 64 00 81 88 ff ff                          rbd.....
        backtrace:
          [<00000000f58fae56>] __kmalloc_node_track_caller+0x44/0x1b0
          [<00000000bdd44fe7>] kstrdup+0x3a/0x70
          [<00000000f7844d0b>] kstrdup_const+0x63/0x80
          [<000000001b0a0eeb>] kvasprintf_const+0x10b/0x190
          [<00000000a47bd894>] kobject_set_name_vargs+0x56/0x150
          [<00000000d5edbf18>] dev_set_name+0xab/0xe0
          [<00000000f5153e80>] device_add+0x106/0x1f20
      
      Fixes: dfc5606d ("rbd: replace the rbd sysfs interface")
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: default avatarAlex Elder <elder@linaro.org>
      Link: https://lore.kernel.org/r/20221027091918.2294132-1-yangyingliang@huawei.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7f21735f
  19. 25 Oct, 2022 1 commit