1. 26 May, 2021 2 commits
  2. 25 May, 2021 3 commits
  3. 20 May, 2021 3 commits
  4. 19 May, 2021 5 commits
    • James Smart's avatar
      nvme-fc: clear q_live at beginning of association teardown · a7d13914
      James Smart authored
      The __nvmf_check_ready() routine used to bounce all filesystem io if the
      controller state isn't LIVE.  However, a later patch changed the logic so
      that it rejection ends up being based on the Q live check.  The FC
      transport has a slightly different sequence from rdma and tcp for
      shutting down queues/marking them non-live.  FC marks its queue non-live
      after aborting all ios and waiting for their termination, leaving a
      rather large window for filesystem io to continue to hit the transport.
      Unfortunately this resulted in filesystem I/O or applications seeing I/O
      errors.
      
      Change the FC transport to mark the queues non-live at the first sign of
      teardown for the association (when I/O is initially terminated).
      
      Fixes: 73a53799 ("nvme-fabrics: allow to queue requests for live queues")
      Signed-off-by: default avatarJames Smart <jsmart2021@gmail.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarHimanshu Madhani <himanshu.madhani@oracle.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      a7d13914
    • Keith Busch's avatar
      nvme-tcp: rerun io_work if req_list is not empty · a0fdd141
      Keith Busch authored
      A possible race condition exists where the request to send data is
      enqueued from nvme_tcp_handle_r2t()'s will not be observed by
      nvme_tcp_send_all() if it happens to be running. The driver relies on
      io_work to send the enqueued request when it is runs again, but the
      concurrently running nvme_tcp_send_all() may not have released the
      send_mutex at that time. If no future commands are enqueued to re-kick
      the io_work, the request will timeout in the SEND_H2C state, resulting
      in a timeout error like:
      
        nvme nvme0: queue 1: timeout request 0x3 type 6
      
      Ensure the io_work continues to run as long as the req_list is not empty.
      
      Fixes: db5ad6b7 ("nvme-tcp: try to send request in queue_rq context")
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      a0fdd141
    • Sagi Grimberg's avatar
      nvme-tcp: fix possible use-after-completion · 825619b0
      Sagi Grimberg authored
      Commit db5ad6b7 ("nvme-tcp: try to send request in queue_rq
      context") added a second context that may perform a network send.
      This means that now RX and TX are not serialized in nvme_tcp_io_work
      and can run concurrently.
      
      While there is correct mutual exclusion in the TX path (where
      the send_mutex protect the queue socket send activity) RX activity,
      and more specifically request completion may run concurrently.
      
      This means we must guarantee that any mutation of the request state
      related to its lifetime, bytes sent must not be accessed when a completion
      may have possibly arrived back (and processed).
      
      The race may trigger when a request completion arrives, processed
      _and_ reused as a fresh new request, exactly in the (relatively short)
      window between the last data payload sent and before the request iov_iter
      is advanced.
      
      Consider the following race:
      1. 16K write request is queued
      2. The nvme command and the data is sent to the controller (in-capsule
         or solicited by r2t)
      3. After the last payload is sent but before the req.iter is advanced,
         the controller sends back a completion.
      4. The completion is processed, the request is completed, and reused
         to transfer a new request (write or read)
      5. The new request is queued, and the driver reset the request parameters
         (nvme_tcp_setup_cmd_pdu).
      6. Now context in (2) resumes execution and advances the req.iter
      
      ==> use-after-completion as this is already a new request.
      
      Fix this by making sure the request is not advanced after the last
      data payload send, knowing that a completion may have arrived already.
      
      An alternative solution would have been to delay the request completion
      or state change waiting for reference counting on the TX path, but besides
      adding atomic operations to the hot-path, it may present challenges in
      multi-stage R2T scenarios where a r2t handler needs to be deferred to
      an async execution.
      Reported-by: default avatarNarayan Ayalasomayajula <narayan.ayalasomayajula@wdc.com>
      Tested-by: default avatarAnil Mishra <anil.mishra@wdc.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Cc: stable@vger.kernel.org # v5.8+
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      825619b0
    • Wu Bo's avatar
      nvme-loop: fix memory leak in nvme_loop_create_ctrl() · 03504e3b
      Wu Bo authored
      When creating loop ctrl in nvme_loop_create_ctrl(), if nvme_init_ctrl()
      fails, the loop ctrl should be freed before jumping to the "out" label.
      
      Fixes: 3a85a5de ("nvme-loop: add a NVMe loopback host driver")
      Signed-off-by: default avatarWu Bo <wubo40@huawei.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      03504e3b
    • Wu Bo's avatar
      nvmet: fix memory leak in nvmet_alloc_ctrl() · fec356a6
      Wu Bo authored
      When creating ctrl in nvmet_alloc_ctrl(), if the cntlid_min is larger
      than cntlid_max of the subsystem, and jumps to the
      "out_free_changed_ns_list" label, but the ctrl->sqs lack of be freed.
      Fix this by jumping to the "out_free_sqs" label.
      
      Fixes: 94a39d61 ("nvmet: make ctrl-id configurable")
      Signed-off-by: default avatarWu Bo <wubo40@huawei.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      fec356a6
  5. 14 May, 2021 3 commits
  6. 13 May, 2021 2 commits
    • Jens Axboe's avatar
      Merge tag 'nvme-5.13-2021-05-13' of git://git.infradead.org/nvme into block-5.13 · 6bdf2fbc
      Jens Axboe authored
      Pull NVMe fixes from Christoph:
      
      "nvme fix for Linux 5.13
      
       - correct the check for using the inline bio in nvmet
         (Chaitanya Kulkarni)
       - demote unsupported command warnings (Chaitanya Kulkarni)
       - fix corruption due to double initializing ANA state (me, Hou Pu)
       - reset ns->file when open fails (Daniel Wagner)
       - fix a NULL deref when SEND is completed with error in nvmet-rdma
         (Michal Kalderon)"
      
      * tag 'nvme-5.13-2021-05-13' of git://git.infradead.org/nvme:
        nvmet: use new ana_log_size instead the old one
        nvmet: seset ns->file when open fails
        nvmet: demote fabrics cmd parse err msg to debug
        nvmet: use helper to remove the duplicate code
        nvmet: demote discovery cmd parse err msg to debug
        nvmet-rdma: Fix NULL deref when SEND is completed with error
        nvmet: fix inline bio check for passthru
        nvmet: fix inline bio check for bdev-ns
        nvme-multipath: fix double initialization of ANA state
      6bdf2fbc
    • Hou Pu's avatar
      nvmet: use new ana_log_size instead the old one · e181811b
      Hou Pu authored
      The new ana_log_size should be used instead of the old one.
      Or kernel NULL pointer dereference will happen like below:
      
      [   38.957849][   T69] BUG: kernel NULL pointer dereference, address: 000000000000003c
      [   38.975550][   T69] #PF: supervisor write access in kernel mode
      [   38.975955][   T69] #PF: error_code(0x0002) - not-present page
      [   38.976905][   T69] PGD 0 P4D 0
      [   38.979388][   T69] Oops: 0002 [#1] SMP NOPTI
      [   38.980488][   T69] CPU: 0 PID: 69 Comm: kworker/0:2 Not tainted 5.12.0+ #54
      [   38.981254][   T69] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [   38.982502][   T69] Workqueue: events nvme_loop_execute_work
      [   38.985219][   T69] RIP: 0010:memcpy_orig+0x68/0x10f
      [   38.986203][   T69] Code: 83 c2 20 eb 44 48 01 d6 48 01 d7 48 83 ea 20 0f 1f 00 48 83 ea 20 4c 8b 46 f8 4c 8b 4e f0 4c 8b 56 e8 4c 8b 5e e0 48 8d 76 e0 <4c> 89 47 f8 4c 89 4f f0 4c 89 57 e8 4c 89 5f e0 48 8d 7f e0 73 d2
      [   38.987677][   T69] RSP: 0018:ffffc900001b7d48 EFLAGS: 00000287
      [   38.987996][   T69] RAX: 0000000000000020 RBX: 0000000000000024 RCX: 0000000000000010
      [   38.988327][   T69] RDX: ffffffffffffffe4 RSI: ffff8881084bc004 RDI: 0000000000000044
      [   38.988620][   T69] RBP: 0000000000000024 R08: 0000000100000000 R09: 0000000000000000
      [   38.988991][   T69] R10: 0000000100000000 R11: 0000000000000001 R12: 0000000000000024
      [   38.989289][   T69] R13: ffff8881084bc000 R14: 0000000000000000 R15: 0000000000000024
      [   38.989845][   T69] FS:  0000000000000000(0000) GS:ffff888237c00000(0000) knlGS:0000000000000000
      [   38.990234][   T69] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   38.990490][   T69] CR2: 000000000000003c CR3: 00000001085b2000 CR4: 00000000000006f0
      [   38.991105][   T69] Call Trace:
      [   38.994157][   T69]  sg_copy_buffer+0xb8/0xf0
      [   38.995357][   T69]  nvmet_copy_to_sgl+0x48/0x6d
      [   38.995565][   T69]  nvmet_execute_get_log_page_ana+0xd4/0x1cb
      [   38.995792][   T69]  nvmet_execute_get_log_page+0xc9/0x146
      [   38.995992][   T69]  nvme_loop_execute_work+0x3e/0x44
      [   38.996181][   T69]  process_one_work+0x1c3/0x3c0
      [   38.996393][   T69]  worker_thread+0x44/0x3d0
      [   38.996600][   T69]  ? cancel_delayed_work+0x90/0x90
      [   38.996804][   T69]  kthread+0xf7/0x130
      [   38.996961][   T69]  ? kthread_create_worker_on_cpu+0x70/0x70
      [   38.997171][   T69]  ret_from_fork+0x22/0x30
      [   38.997705][   T69] Modules linked in:
      [   38.998741][   T69] CR2: 000000000000003c
      [   39.000104][   T69] ---[ end trace e719927b609d0fa0 ]---
      
      Fixes: 5e1f6899 ("nvme-multipath: fix double initialization of ANA state")
      Signed-off-by: default avatarHou Pu <houpu.main@gmail.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      e181811b
  7. 12 May, 2021 6 commits
    • Daniel Wagner's avatar
      nvmet: seset ns->file when open fails · 85428bea
      Daniel Wagner authored
      Reset the ns->file value to NULL also in the error case in
      nvmet_file_ns_enable().
      
      The ns->file variable points either to file object or contains the
      error code after the filp_open() call. This can lead to following
      problem:
      
      When the user first setups an invalid file backend and tries to enable
      the ns, it will fail. Then the user switches over to a bdev backend
      and enables successfully the ns. The first received I/O will crash the
      system because the IO backend is chosen based on the ns->file value:
      
      static u16 nvmet_parse_io_cmd(struct nvmet_req *req)
      {
      	[...]
      
      	if (req->ns->file)
      		return nvmet_file_parse_io_cmd(req);
      
      	return nvmet_bdev_parse_io_cmd(req);
      }
      Reported-by: default avatarEnzo Matsumiya <ematsumiya@suse.com>
      Signed-off-by: default avatarDaniel Wagner <dwagner@suse.de>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      85428bea
    • Sun Ke's avatar
      nbd: share nbd_put and return by goto put_nbd · bedf78c4
      Sun Ke authored
      Replace the following two statements by the statement “goto put_nbd;”
      
      	nbd_put(nbd);
      	return 0;
      Signed-off-by: default avatarSun Ke <sunke32@huawei.com>
      Suggested-by: default avatarMarkus Elfring <Markus.Elfring@web.de>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Link: https://lore.kernel.org/r/20210512114331.1233964-3-sunke32@huawei.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bedf78c4
    • Sun Ke's avatar
      nbd: Fix NULL pointer in flush_workqueue · 79ebe911
      Sun Ke authored
      Open /dev/nbdX first, the config_refs will be 1 and
      the pointers in nbd_device are still null. Disconnect
      /dev/nbdX, then reference a null recv_workq. The
      protection by config_refs in nbd_genl_disconnect is useless.
      
      [  656.366194] BUG: kernel NULL pointer dereference, address: 0000000000000020
      [  656.368943] #PF: supervisor write access in kernel mode
      [  656.369844] #PF: error_code(0x0002) - not-present page
      [  656.370717] PGD 10cc87067 P4D 10cc87067 PUD 1074b4067 PMD 0
      [  656.371693] Oops: 0002 [#1] SMP
      [  656.372242] CPU: 5 PID: 7977 Comm: nbd-client Not tainted 5.11.0-rc5-00040-g76c057c8 #1
      [  656.373661] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
      [  656.375904] RIP: 0010:mutex_lock+0x29/0x60
      [  656.376627] Code: 00 0f 1f 44 00 00 55 48 89 fd 48 83 05 6f d7 fe 08 01 e8 7a c3 ff ff 48 83 05 6a d7 fe 08 01 31 c0 65 48 8b 14 25 00 6d 01 00 <f0> 48 0f b1 55 d
      [  656.378934] RSP: 0018:ffffc900005eb9b0 EFLAGS: 00010246
      [  656.379350] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      [  656.379915] RDX: ffff888104cf2600 RSI: ffffffffaae8f452 RDI: 0000000000000020
      [  656.380473] RBP: 0000000000000020 R08: 0000000000000000 R09: ffff88813bd6b318
      [  656.381039] R10: 00000000000000c7 R11: fefefefefefefeff R12: ffff888102710b40
      [  656.381599] R13: ffffc900005eb9e0 R14: ffffffffb2930680 R15: ffff88810770ef00
      [  656.382166] FS:  00007fdf117ebb40(0000) GS:ffff88813bd40000(0000) knlGS:0000000000000000
      [  656.382806] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  656.383261] CR2: 0000000000000020 CR3: 0000000100c84000 CR4: 00000000000006e0
      [  656.383819] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  656.384370] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  656.384927] Call Trace:
      [  656.385111]  flush_workqueue+0x92/0x6c0
      [  656.385395]  nbd_disconnect_and_put+0x81/0xd0
      [  656.385716]  nbd_genl_disconnect+0x125/0x2a0
      [  656.386034]  genl_family_rcv_msg_doit.isra.0+0x102/0x1b0
      [  656.386422]  genl_rcv_msg+0xfc/0x2b0
      [  656.386685]  ? nbd_ioctl+0x490/0x490
      [  656.386954]  ? genl_family_rcv_msg_doit.isra.0+0x1b0/0x1b0
      [  656.387354]  netlink_rcv_skb+0x62/0x180
      [  656.387638]  genl_rcv+0x34/0x60
      [  656.387874]  netlink_unicast+0x26d/0x590
      [  656.388162]  netlink_sendmsg+0x398/0x6c0
      [  656.388451]  ? netlink_rcv_skb+0x180/0x180
      [  656.388750]  ____sys_sendmsg+0x1da/0x320
      [  656.389038]  ? ____sys_recvmsg+0x130/0x220
      [  656.389334]  ___sys_sendmsg+0x8e/0xf0
      [  656.389605]  ? ___sys_recvmsg+0xa2/0xf0
      [  656.389889]  ? handle_mm_fault+0x1671/0x21d0
      [  656.390201]  __sys_sendmsg+0x6d/0xe0
      [  656.390464]  __x64_sys_sendmsg+0x23/0x30
      [  656.390751]  do_syscall_64+0x45/0x70
      [  656.391017]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      To fix it, just add if (nbd->recv_workq) to nbd_disconnect_and_put().
      
      Fixes: e9e006f5 ("nbd: fix max number of supported devs")
      Signed-off-by: default avatarSun Ke <sunke32@huawei.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Link: https://lore.kernel.org/r/20210512114331.1233964-2-sunke32@huawei.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      79ebe911
    • Lin Feng's avatar
      blkdev.h: remove unused codes blk_account_rq · 190515f6
      Lin Feng authored
      Last users of blk_account_rq gone with patch commit a1ce35fa
      ("block: remove dead elevator code") and now it gets no caller, it can
      be safely removed.
      Signed-off-by: default avatarLin Feng <linf@wangsu.com>
      Link: https://lore.kernel.org/r/20210512100124.173769-1-linf@wangsu.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      190515f6
    • Paolo Valente's avatar
      block, bfq: avoid circular stable merges · 7ea96eef
      Paolo Valente authored
      BFQ may merge a new bfq_queue, stably, with the last bfq_queue
      created. In particular, BFQ first waits a little bit for some I/O to
      flow inside the new queue, say Q2, if this is needed to understand
      whether it is better or worse to merge Q2 with the last queue created,
      say Q1. This delayed stable merge is performed by assigning
      bic->stable_merge_bfqq = Q1, for the bic associated with Q1.
      
      Yet, while waiting for some I/O to flow in Q2, a non-stable queue
      merge of Q2 with Q1 may happen, causing the bic previously associated
      with Q2 to be associated with exactly Q1 (bic->bfqq = Q1). After that,
      Q2 and Q1 may happen to be split, and, in the split, Q1 may happen to
      be recycled as a non-shared bfq_queue. In that case, Q1 may then
      happen to undergo a stable merge with the bfq_queue pointed by
      bic->stable_merge_bfqq. Yet bic->stable_merge_bfqq still points to
      Q1. So Q1 would be merged with itself.
      
      This commit fixes this error by intercepting this situation, and
      canceling the schedule of the stable merge.
      
      Fixes: 430a67f9 ("block, bfq: merge bursts of newly-created queues")
      Signed-off-by: default avatarPietro Pedroni <pedroni.pietro.96@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Link: https://lore.kernel.org/r/20210512094352.85545-2-paolo.valente@linaro.orgSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7ea96eef
    • Tejun Heo's avatar
      blk-iocost: fix weight updates of inner active iocgs · e9f4eee9
      Tejun Heo authored
      When the weight of an active iocg is updated, weight_updated() is called
      which in turn calls __propagate_weights() to update the active and inuse
      weights so that the effective hierarchical weights are update accordingly.
      
      The current implementation is incorrect for inner active nodes. For an
      active leaf iocg, inuse can be any value between 1 and active and the
      difference represents how much the iocg is donating. When weight is updated,
      as long as inuse is clamped between 1 and the new weight, we're alright and
      this is what __propagate_weights() currently implements.
      
      However, that's not how an active inner node's inuse is set. An inner node's
      inuse is solely determined by the ratio between the sums of inuse's and
      active's of its children - ie. they're results of propagating the leaves'
      active and inuse weights upwards. __propagate_weights() incorrectly applies
      the same clamping as for a leaf when an active inner node's weight is
      updated. Consider a hierarchy which looks like the following with saturating
      workloads in AA and BB.
      
           R
         /   \
        A     B
        |     |
       AA     BB
      
      1. For both A and B, active=100, inuse=100, hwa=0.5, hwi=0.5.
      
      2. echo 200 > A/io.weight
      
      3. __propagate_weights() update A's active to 200 and leave inuse at 100 as
         it's already between 1 and the new active, making A:active=200,
         A:inuse=100. As R's active_sum is updated along with A's active,
         A:hwa=2/3, B:hwa=1/3. However, because the inuses didn't change, the
         hwi's remain unchanged at 0.5.
      
      4. The weight of A is now twice that of B but AA and BB still have the same
         hwi of 0.5 and thus are doing the same amount of IOs.
      
      Fix it by making __propgate_weights() always calculate the inuse of an
      active inner iocg based on the ratio of child_inuse_sum to child_active_sum.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarDan Schatzberg <dschatzberg@fb.com>
      Fixes: 7caa4715 ("blkcg: implement blk-iocost")
      Cc: stable@vger.kernel.org # v5.4+
      Link: https://lore.kernel.org/r/YJsxnLZV1MnBcqjj@slm.duckdns.orgSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e9f4eee9
  8. 11 May, 2021 8 commits
    • Chaitanya Kulkarni's avatar
      nvmet: demote fabrics cmd parse err msg to debug · 7a4ffd20
      Chaitanya Kulkarni authored
      Host can send invalid commands and flood the target with error messages.
      Demote the error message from pr_err() to pr_debug() in
      nvmet_parse_fabrics_cmd() and nvmet_parse_connect_cmd().
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      7a4ffd20
    • Chaitanya Kulkarni's avatar
      nvmet: use helper to remove the duplicate code · 4c2dab2b
      Chaitanya Kulkarni authored
      Use the helper nvmet_report_invalid_opcode() to report invalid opcode
      so we can remove the duplicate code.
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      4c2dab2b
    • Chaitanya Kulkarni's avatar
      nvmet: demote discovery cmd parse err msg to debug · 3651aaac
      Chaitanya Kulkarni authored
      Host can send invalid commands and flood the target with error messages
      for the discovery controller. Demote the error message from pr_err() to
      pr_debug( in nvmet_parse_discovery_cmd(). 
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      3651aaac
    • Michal Kalderon's avatar
      nvmet-rdma: Fix NULL deref when SEND is completed with error · 8cc365f9
      Michal Kalderon authored
      When running some traffic and taking down the link on peer, a
      retry counter exceeded error is received. This leads to
      nvmet_rdma_error_comp which tried accessing the cq_context to
      obtain the queue. The cq_context is no longer valid after the
      fix to use shared CQ mechanism and should be obtained similar
      to how it is obtained in other functions from the wc->qp.
      
      [ 905.786331] nvmet_rdma: SEND for CQE 0x00000000e3337f90 failed with status transport retry counter exceeded (12).
      [ 905.832048] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
      [ 905.839919] PGD 0 P4D 0
      [ 905.842464] Oops: 0000 1 SMP NOPTI
      [ 905.846144] CPU: 13 PID: 1557 Comm: kworker/13:1H Kdump: loaded Tainted: G OE --------- - - 4.18.0-304.el8.x86_64 #1
      [ 905.872135] RIP: 0010:nvmet_rdma_error_comp+0x5/0x1b [nvmet_rdma]
      [ 905.878259] Code: 19 4f c0 e8 89 b3 a5 f6 e9 5b e0 ff ff 0f b7 75 14 4c 89 ea 48 c7 c7 08 1a 4f c0 e8 71 b3 a5 f6 e9 4b e0 ff ff 0f 1f 44 00 00 <48> 8b 47 48 48 85 c0 74 08 48 89 c7 e9 98 bf 49 00 e9 c3 e3 ff ff
      [ 905.897135] RSP: 0018:ffffab601c45fe28 EFLAGS: 00010246
      [ 905.902387] RAX: 0000000000000065 RBX: ffff9e729ea2f800 RCX: 0000000000000000
      [ 905.909558] RDX: 0000000000000000 RSI: ffff9e72df9567c8 RDI: 0000000000000000
      [ 905.916731] RBP: ffff9e729ea2b400 R08: 000000000000074d R09: 0000000000000074
      [ 905.923903] R10: 0000000000000000 R11: ffffab601c45fcc0 R12: 0000000000000010
      [ 905.931074] R13: 0000000000000000 R14: 0000000000000010 R15: ffff9e729ea2f400
      [ 905.938247] FS: 0000000000000000(0000) GS:ffff9e72df940000(0000) knlGS:0000000000000000
      [ 905.938249] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 905.950067] nvmet_rdma: SEND for CQE 0x00000000c7356cca failed with status transport retry counter exceeded (12).
      [ 905.961855] CR2: 0000000000000048 CR3: 000000678d010004 CR4: 00000000007706e0
      [ 905.961855] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 905.961856] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 905.961857] PKRU: 55555554
      [ 906.010315] Call Trace:
      [ 906.012778] __ib_process_cq+0x89/0x170 [ib_core]
      [ 906.017509] ib_cq_poll_work+0x26/0x80 [ib_core]
      [ 906.022152] process_one_work+0x1a7/0x360
      [ 906.026182] ? create_worker+0x1a0/0x1a0
      [ 906.030123] worker_thread+0x30/0x390
      [ 906.033802] ? create_worker+0x1a0/0x1a0
      [ 906.037744] kthread+0x116/0x130
      [ 906.040988] ? kthread_flush_work_fn+0x10/0x10
      [ 906.045456] ret_from_fork+0x1f/0x40
      
      Fixes: ca0f1a80 ("nvmet-rdma: use new shared CQ mechanism")
      Signed-off-by: default avatarShai Malin <smalin@marvell.com>
      Signed-off-by: default avatarMichal Kalderon <michal.kalderon@marvell.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      8cc365f9
    • Chaitanya Kulkarni's avatar
      nvmet: fix inline bio check for passthru · ab96de5d
      Chaitanya Kulkarni authored
      When handling passthru commands, for inline bio allocation we only
      consider the transfer size. This works well when req->sg_cnt fits into
      the req->inline_bvec, but it will result in the early return from
      bio_add_hw_page() when req->sg_cnt > NVMET_MAX_INLINE_BVEC.
      
      Consider an I/O of size 32768 and first buffer is not aligned to the
      page boundary, then I/O is split in following manner :-
      
      [ 2206.256140] nvmet: sg->length 3440 sg->offset 656
      [ 2206.256144] nvmet: sg->length 4096 sg->offset 0
      [ 2206.256148] nvmet: sg->length 4096 sg->offset 0
      [ 2206.256152] nvmet: sg->length 4096 sg->offset 0
      [ 2206.256155] nvmet: sg->length 4096 sg->offset 0
      [ 2206.256159] nvmet: sg->length 4096 sg->offset 0
      [ 2206.256163] nvmet: sg->length 4096 sg->offset 0
      [ 2206.256166] nvmet: sg->length 4096 sg->offset 0
      [ 2206.256170] nvmet: sg->length 656 sg->offset 0
      
      Now the req->transfer_size == NVMET_MAX_INLINE_DATA_LEN i.e. 32768, but
      the req->sg_cnt is (9) > NVMET_MAX_INLINE_BIOVEC which is (8).
      This will result in early return in the following code path :-
      
      nvmet_bdev_execute_rw()
      	bio_add_pc_page()
      		bio_add_hw_page()
      			if (bio_full(bio, len))
      				return 0;
      
      Use previously introduced helper nvmet_use_inline_bvec() to consider
      req->sg_cnt when using inline bio. This only affects nvme-loop
      transport.
      
      Fixes: dab3902b ("nvmet: use inline bio for passthru fast path")
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      ab96de5d
    • Chaitanya Kulkarni's avatar
      nvmet: fix inline bio check for bdev-ns · 608a9690
      Chaitanya Kulkarni authored
      When handling rw commands, for inline bio case we only consider
      transfer size. This works well when req->sg_cnt fits into the
      req->inline_bvec, but it will result in the warning in
      __bio_add_page() when req->sg_cnt > NVMET_MAX_INLINE_BVEC.
      
      Consider an I/O size 32768 and first page is not aligned to the page
      boundary, then I/O is split in following manner :-
      
      [ 2206.256140] nvmet: sg->length 3440 sg->offset 656
      [ 2206.256144] nvmet: sg->length 4096 sg->offset 0
      [ 2206.256148] nvmet: sg->length 4096 sg->offset 0
      [ 2206.256152] nvmet: sg->length 4096 sg->offset 0
      [ 2206.256155] nvmet: sg->length 4096 sg->offset 0
      [ 2206.256159] nvmet: sg->length 4096 sg->offset 0
      [ 2206.256163] nvmet: sg->length 4096 sg->offset 0
      [ 2206.256166] nvmet: sg->length 4096 sg->offset 0
      [ 2206.256170] nvmet: sg->length 656 sg->offset 0
      
      Now the req->transfer_size == NVMET_MAX_INLINE_DATA_LEN i.e. 32768, but
      the req->sg_cnt is (9) > NVMET_MAX_INLINE_BIOVEC which is (8).
      This will result in the following warning message :-
      
      nvmet_bdev_execute_rw()
      	bio_add_page()
      		__bio_add_page()
      			WARN_ON_ONCE(bio_full(bio, len));
      
      This scenario is very hard to reproduce on the nvme-loop transport only
      with rw commands issued with the passthru IOCTL interface from the host
      application and the data buffer is allocated with the malloc() and not
      the posix_memalign().
      
      Fixes: 73383adf ("nvmet: don't split large I/Os unconditionally")
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      608a9690
    • Christoph Hellwig's avatar
      nvme-multipath: fix double initialization of ANA state · 5e1f6899
      Christoph Hellwig authored
      nvme_init_identify and thus nvme_mpath_init can be called multiple
      times and thus must not overwrite potentially initialized or in-use
      fields.  Split out a helper for the basic initialization when the
      controller is initialized and make sure the init_identify path does
      not blindly change in-use data structures.
      
      Fixes: 0d0b660f ("nvme: add ANA support")
      Reported-by: default avatarMartin Wilck <mwilck@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      5e1f6899
    • Omar Sandoval's avatar
      kyber: fix out of bounds access when preempted · efed9a33
      Omar Sandoval authored
      __blk_mq_sched_bio_merge() gets the ctx and hctx for the current CPU and
      passes the hctx to ->bio_merge(). kyber_bio_merge() then gets the ctx
      for the current CPU again and uses that to get the corresponding Kyber
      context in the passed hctx. However, the thread may be preempted between
      the two calls to blk_mq_get_ctx(), and the ctx returned the second time
      may no longer correspond to the passed hctx. This "works" accidentally
      most of the time, but it can cause us to read garbage if the second ctx
      came from an hctx with more ctx's than the first one (i.e., if
      ctx->index_hw[hctx->type] > hctx->nr_ctx).
      
      This manifested as this UBSAN array index out of bounds error reported
      by Jakub:
      
      UBSAN: array-index-out-of-bounds in ../kernel/locking/qspinlock.c:130:9
      index 13106 is out of range for type 'long unsigned int [128]'
      Call Trace:
       dump_stack+0xa4/0xe5
       ubsan_epilogue+0x5/0x40
       __ubsan_handle_out_of_bounds.cold.13+0x2a/0x34
       queued_spin_lock_slowpath+0x476/0x480
       do_raw_spin_lock+0x1c2/0x1d0
       kyber_bio_merge+0x112/0x180
       blk_mq_submit_bio+0x1f5/0x1100
       submit_bio_noacct+0x7b0/0x870
       submit_bio+0xc2/0x3a0
       btrfs_map_bio+0x4f0/0x9d0
       btrfs_submit_data_bio+0x24e/0x310
       submit_one_bio+0x7f/0xb0
       submit_extent_page+0xc4/0x440
       __extent_writepage_io+0x2b8/0x5e0
       __extent_writepage+0x28d/0x6e0
       extent_write_cache_pages+0x4d7/0x7a0
       extent_writepages+0xa2/0x110
       do_writepages+0x8f/0x180
       __writeback_single_inode+0x99/0x7f0
       writeback_sb_inodes+0x34e/0x790
       __writeback_inodes_wb+0x9e/0x120
       wb_writeback+0x4d2/0x660
       wb_workfn+0x64d/0xa10
       process_one_work+0x53a/0xa80
       worker_thread+0x69/0x5b0
       kthread+0x20b/0x240
       ret_from_fork+0x1f/0x30
      
      Only Kyber uses the hctx, so fix it by passing the request_queue to
      ->bio_merge() instead. BFQ and mq-deadline just use that, and Kyber can
      map the queues itself to avoid the mismatch.
      
      Fixes: a6088845 ("block: kyber: make kyber more friendly with merging")
      Reported-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Link: https://lore.kernel.org/r/c7598605401a48d5cfeadebb678abd10af22b83f.1620691329.git.osandov@fb.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      efed9a33
  9. 10 May, 2021 1 commit
  10. 09 May, 2021 1 commit
  11. 06 May, 2021 1 commit
    • yangerkun's avatar
      block: reexpand iov_iter after read/write · cf7b39a0
      yangerkun authored
      We get a bug:
      
      BUG: KASAN: slab-out-of-bounds in iov_iter_revert+0x11c/0x404
      lib/iov_iter.c:1139
      Read of size 8 at addr ffff0000d3fb11f8 by task
      
      CPU: 0 PID: 12582 Comm: syz-executor.2 Not tainted
      5.10.0-00843-g352c8610ccd2 #2
      Hardware name: linux,dummy-virt (DT)
      Call trace:
       dump_backtrace+0x0/0x2d0 arch/arm64/kernel/stacktrace.c:132
       show_stack+0x28/0x34 arch/arm64/kernel/stacktrace.c:196
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x110/0x164 lib/dump_stack.c:118
       print_address_description+0x78/0x5c8 mm/kasan/report.c:385
       __kasan_report mm/kasan/report.c:545 [inline]
       kasan_report+0x148/0x1e4 mm/kasan/report.c:562
       check_memory_region_inline mm/kasan/generic.c:183 [inline]
       __asan_load8+0xb4/0xbc mm/kasan/generic.c:252
       iov_iter_revert+0x11c/0x404 lib/iov_iter.c:1139
       io_read fs/io_uring.c:3421 [inline]
       io_issue_sqe+0x2344/0x2d64 fs/io_uring.c:5943
       __io_queue_sqe+0x19c/0x520 fs/io_uring.c:6260
       io_queue_sqe+0x2a4/0x590 fs/io_uring.c:6326
       io_submit_sqe fs/io_uring.c:6395 [inline]
       io_submit_sqes+0x4c0/0xa04 fs/io_uring.c:6624
       __do_sys_io_uring_enter fs/io_uring.c:9013 [inline]
       __se_sys_io_uring_enter fs/io_uring.c:8960 [inline]
       __arm64_sys_io_uring_enter+0x190/0x708 fs/io_uring.c:8960
       __invoke_syscall arch/arm64/kernel/syscall.c:36 [inline]
       invoke_syscall arch/arm64/kernel/syscall.c:48 [inline]
       el0_svc_common arch/arm64/kernel/syscall.c:158 [inline]
       do_el0_svc+0x120/0x290 arch/arm64/kernel/syscall.c:227
       el0_svc+0x1c/0x28 arch/arm64/kernel/entry-common.c:367
       el0_sync_handler+0x98/0x170 arch/arm64/kernel/entry-common.c:383
       el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:670
      
      Allocated by task 12570:
       stack_trace_save+0x80/0xb8 kernel/stacktrace.c:121
       kasan_save_stack mm/kasan/common.c:48 [inline]
       kasan_set_track mm/kasan/common.c:56 [inline]
       __kasan_kmalloc+0xdc/0x120 mm/kasan/common.c:461
       kasan_kmalloc+0xc/0x14 mm/kasan/common.c:475
       __kmalloc+0x23c/0x334 mm/slub.c:3970
       kmalloc include/linux/slab.h:557 [inline]
       __io_alloc_async_data+0x68/0x9c fs/io_uring.c:3210
       io_setup_async_rw fs/io_uring.c:3229 [inline]
       io_read fs/io_uring.c:3436 [inline]
       io_issue_sqe+0x2954/0x2d64 fs/io_uring.c:5943
       __io_queue_sqe+0x19c/0x520 fs/io_uring.c:6260
       io_queue_sqe+0x2a4/0x590 fs/io_uring.c:6326
       io_submit_sqe fs/io_uring.c:6395 [inline]
       io_submit_sqes+0x4c0/0xa04 fs/io_uring.c:6624
       __do_sys_io_uring_enter fs/io_uring.c:9013 [inline]
       __se_sys_io_uring_enter fs/io_uring.c:8960 [inline]
       __arm64_sys_io_uring_enter+0x190/0x708 fs/io_uring.c:8960
       __invoke_syscall arch/arm64/kernel/syscall.c:36 [inline]
       invoke_syscall arch/arm64/kernel/syscall.c:48 [inline]
       el0_svc_common arch/arm64/kernel/syscall.c:158 [inline]
       do_el0_svc+0x120/0x290 arch/arm64/kernel/syscall.c:227
       el0_svc+0x1c/0x28 arch/arm64/kernel/entry-common.c:367
       el0_sync_handler+0x98/0x170 arch/arm64/kernel/entry-common.c:383
       el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:670
      
      Freed by task 12570:
       stack_trace_save+0x80/0xb8 kernel/stacktrace.c:121
       kasan_save_stack mm/kasan/common.c:48 [inline]
       kasan_set_track+0x38/0x6c mm/kasan/common.c:56
       kasan_set_free_info+0x20/0x40 mm/kasan/generic.c:355
       __kasan_slab_free+0x124/0x150 mm/kasan/common.c:422
       kasan_slab_free+0x10/0x1c mm/kasan/common.c:431
       slab_free_hook mm/slub.c:1544 [inline]
       slab_free_freelist_hook mm/slub.c:1577 [inline]
       slab_free mm/slub.c:3142 [inline]
       kfree+0x104/0x38c mm/slub.c:4124
       io_dismantle_req fs/io_uring.c:1855 [inline]
       __io_free_req+0x70/0x254 fs/io_uring.c:1867
       io_put_req_find_next fs/io_uring.c:2173 [inline]
       __io_queue_sqe+0x1fc/0x520 fs/io_uring.c:6279
       __io_req_task_submit+0x154/0x21c fs/io_uring.c:2051
       io_req_task_submit+0x2c/0x44 fs/io_uring.c:2063
       task_work_run+0xdc/0x128 kernel/task_work.c:151
       get_signal+0x6f8/0x980 kernel/signal.c:2562
       do_signal+0x108/0x3a4 arch/arm64/kernel/signal.c:658
       do_notify_resume+0xbc/0x25c arch/arm64/kernel/signal.c:722
       work_pending+0xc/0x180
      
      blkdev_read_iter can truncate iov_iter's count since the count + pos may
      exceed the size of the blkdev. This will confuse io_read that we have
      consume the iovec. And once we do the iov_iter_revert in io_read, we
      will trigger the slab-out-of-bounds. Fix it by reexpand the count with
      size has been truncated.
      
      blkdev_write_iter can trigger the problem too.
      Signed-off-by: default avataryangerkun <yangerkun@huawei.com>
      Acked-by: default avatarPavel Begunkov <asml.silencec@gmail.com>
      Link: https://lore.kernel.org/r/20210401071807.3328235-1-yangerkun@huawei.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      cf7b39a0
  12. 05 May, 2021 1 commit
    • Jens Axboe's avatar
      Merge tag 'nvme-5.13-2021-05-05' of git://git.infradead.org/nvme into block-5.13 · 9c38475c
      Jens Axboe authored
      Pull NVMe fixes from Christoph:
      
      "nvme updates for Linux 5.13
      
       - reset the bdev to ns head when failover (Daniel Wagner)
       - remove unsupported command noise (Keith Busch)
       - misc passthrough improvements (Kanchan Joshi)
       - fix controller ioctl through ns_head (Minwoo Im)
       - fix controller timeouts during reset (Tao Chiu)"
      
      * tag 'nvme-5.13-2021-05-05' of git://git.infradead.org/nvme:
        nvmet: remove unsupported command noise
        nvme-multipath: reset bdev to ns head when failover
        nvme-pci: fix controller reset hang when racing with nvme_timeout
        nvme: move the fabrics queue ready check routines to core
        nvme: avoid memset for passthrough requests
        nvme: add nvme_get_ns helper
        nvme: fix controller ioctl through ns_head
      9c38475c
  13. 04 May, 2021 4 commits
    • Keith Busch's avatar
      nvmet: remove unsupported command noise · 4a203425
      Keith Busch authored
      Nothing can stop a host from submitting invalid commands. The target
      just needs to respond with an appropriate status, but that's not a
      target error. Demote invalid command messages to the debug level so
      these events don't spam the kernel logs.
      Reported-by: default avatarYi Zhang <yi.zhang@redhat.com>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      Reviewed-by: default avatarKlaus Jensen <k.jensen@samsung.com>
      Reviewed-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      4a203425
    • Daniel Wagner's avatar
      nvme-multipath: reset bdev to ns head when failover · ce86dad2
      Daniel Wagner authored
      When a request finally completes in end_io() after it has failed over,
      the bdev pointer can be stale and thus the system can crash. Set the
      bdev back to ns head, so the request is map to an active path when
      resubmitted.
      Signed-off-by: default avatarDaniel Wagner <dwagner@suse.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      ce86dad2
    • Tao Chiu's avatar
      nvme-pci: fix controller reset hang when racing with nvme_timeout · d4060d2b
      Tao Chiu authored
      reset_work() in nvme-pci may hang forever in the following scenario:
      1) A reset caused by a command timeout occurs due to a controller being
         temporarily irresponsive.
      2) nvme_reset_work() restarts admin queue at nvme_alloc_admin_tags(). At
         the same time, a user-submitted admin command is queued and waiting
         for completion. Then, reset_work() changes its state to CONNECTING,
         and submits an identify command.
      3) However, the controller does still not respond to any command,
         causing a timeout being fired at the user-submitted command.
         Unfortunately, nvme_timeout() does not see the completion on cq, and
         any timeout that takes place under CONNECTING state causes a
         controller shutdown.
      4) Normally, the identify command in reset_work() would be canceled with
         SC_HOST_ABORTED by nvme_dev_disable(), then reset_work can tear down
         the controller accordingly. But the controller happens to return
         online and respond the identify command before nvme_dev_disable()
         should have been reaped it off.
      5) reset_work() continues to setup_io_queues() as it observes no error
         in init_identify(). However, the admin queue has already been
         quiesced in dev_disable(). Thus, any following commands would be
         blocked forever in blk_execute_rq().
      
      This can be fixed by restricting usercmd commands when controller is not
      in a LIVE state in nvme_queue_rq(), as what has been done previously in
      fabrics.
      
      ```
      nvme_reset_work():                     |
          nvme_alloc_admin_tags()            |
                                             | nvme_submit_user_cmd():
          nvme_init_identify():              |     ...
              __nvme_submit_sync_cmd():      |
                  ...                        |     ...
      ---------------------------------------> nvme_timeout():
      (Controller starts reponding commands) |     nvme_dev_disable(, true):
          nvme_setup_io_queues():            |
              __nvme_submit_sync_cmd():      |
                  (hung in blk_execute_rq    |
                   since run_hw_queue sees   |
                   queue quiesced)           |
      
      ```
      Signed-off-by: default avatarTao Chiu <taochiu@synology.com>
      Signed-off-by: default avatarCody Wong <codywong@synology.com>
      Reviewed-by: default avatarLeon Chien <leonchien@synology.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      d4060d2b
    • Tao Chiu's avatar
      nvme: move the fabrics queue ready check routines to core · a9715744
      Tao Chiu authored
      queue_rq() in pci only checks if the dispatched queue (nvmeq) is ready,
      e.g. not being suspended. Since nvme_alloc_admin_tags() in reset flow
      restarts the admin queue, users are able to submit admin commands to a
      controller before reset_work() completes. Commands submitted under this
      condition may interfere with commands that performs identify, IO queue
      setup in reset_work(), and may result in a hang described in the
      following patch.
      
      As seen in the fabrics, user commands are prevented from being executed
      under inproper controller states. We may reuse this logic to maintain a
      clear admin queue during reset_work().
      Signed-off-by: default avatarTao Chiu <taochiu@synology.com>
      Signed-off-by: default avatarCody Wong <codywong@synology.com>
      Reviewed-by: default avatarLeon Chien <leonchien@synology.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      a9715744