1. 28 Aug, 2020 8 commits
    • Sagi Grimberg's avatar
      nvme-tcp: fix reset hang if controller died in the middle of a reset · e5c01f4f
      Sagi Grimberg authored
      If the controller becomes unresponsive in the middle of a reset, we will
      hang because we are waiting for the freeze to complete, but that cannot
      happen since we have commands that are inflight holding the
      q_usage_counter, and we can't blindly fail requests that times out.
      
      So give a timeout and if we cannot wait for queue freeze before
      unfreezing, fail and have the error handling take care how to proceed
      (either schedule a reconnect of remove the controller).
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      e5c01f4f
    • Sagi Grimberg's avatar
      nvme-tcp: fix timeout handler · 236187c4
      Sagi Grimberg authored
      When a request times out in a LIVE state, we simply trigger error
      recovery and let the error recovery handle the request cancellation,
      however when a request times out in a non LIVE state, we make sure to
      complete it immediately as it might block controller setup or teardown
      and prevent forward progress.
      
      However tearing down the entire set of I/O and admin queues causes
      freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
      an overkill to what we actually need, which is to just fence controller
      teardown that may be running, stop the queue, and cancel the request if
      it is not already completed.
      
      Now that we have the controller teardown_lock, we can safely serialize
      request cancellation. This addresses a hang caused by calling extra
      queue freeze on controller namespaces, causing unfreeze to not complete
      correctly.
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      236187c4
    • Sagi Grimberg's avatar
      nvme-tcp: serialize controller teardown sequences · d4d61470
      Sagi Grimberg authored
      In the timeout handler we may need to complete a request because the
      request that timed out may be an I/O that is a part of a serial sequence
      of controller teardown or initialization. In order to complete the
      request, we need to fence any other context that may compete with us
      and complete the request that is timing out.
      
      In this case, we could have a potential double completion in case
      a hard-irq or a different competing context triggered error recovery
      and is running inflight request cancellation concurrently with the
      timeout handler.
      
      Protect using a ctrl teardown_lock to serialize contexts that may
      complete a cancelled request due to error recovery or a reset.
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      d4d61470
    • Sagi Grimberg's avatar
      nvme: have nvme_wait_freeze_timeout return if it timed out · 7cf0d7c0
      Sagi Grimberg authored
      Users can detect if the wait has completed or not and take appropriate
      actions based on this information (e.g. weather to continue
      initialization or rather fail and schedule another initialization
      attempt).
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      7cf0d7c0
    • Sagi Grimberg's avatar
      nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance · d7144f5c
      Sagi Grimberg authored
      NVME_CTRL_NEW should never see any I/O, because in order to start
      initialization it has to transition to NVME_CTRL_CONNECTING and from
      there it will never return to this state.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      d7144f5c
    • Ziye Yang's avatar
      nvmet-tcp: Fix NULL dereference when a connect data comes in h2cdata pdu · a6ce7d7b
      Ziye Yang authored
      When handling commands without in-capsule data, we assign the ttag
      assuming we already have the queue commands array allocated (based
      on the queue size information in the connect data payload). However
      if the connect itself did not send the connect data in-capsule we
      have yet to allocate the queue commands,and we will assign a bogus
      ttag and suffer a NULL dereference when we receive the corresponding
      h2cdata pdu.
      
      Fix this by checking if we already allocated commands before
      dereferencing it when handling h2cdata, if we didn't, its for sure a
      connect and we should use the preallocated connect command.
      Signed-off-by: default avatarZiye Yang <ziye.yang@intel.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      a6ce7d7b
    • Jens Axboe's avatar
      Merge branch 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-5.9 · a433d721
      Jens Axboe authored
      Pull MD fix from Song.
      
      * 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
        md/raid5: make sure stripe_size as power of two
      a433d721
    • Yufen Yu's avatar
      md/raid5: make sure stripe_size as power of two · 6af10a33
      Yufen Yu authored
      Commit 3b5408b9 ("md/raid5: support config stripe_size by sysfs
      entry") make stripe_size as a configurable value. It just requires
      stripe_size as multiple of 4KB.
      
      In fact, we should make sure stripe_size as power of two. Otherwise,
      stripe_shift which is the result of ilog2 can not represent the real
      stripe_size. Then, stripe_hash() and stripe_hash_locks_hash() may
      get unexpected value.
      
      Fixes: 3b5408b9 ("md/raid5: support config stripe_size by sysfs entry")
      Signed-off-by: default avatarYufen Yu <yuyufen@huawei.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      6af10a33
  2. 26 Aug, 2020 2 commits
  3. 21 Aug, 2020 23 commits
  4. 18 Aug, 2020 2 commits
    • Dmitry Monakhov's avatar
      bfq: fix blkio cgroup leakage v4 · 2de791ab
      Dmitry Monakhov authored
      Changes from v1:
          - update commit description with proper ref-accounting justification
      
      commit db37a34c ("block, bfq: get a ref to a group when adding it to a service tree")
      introduce leak forbfq_group and blkcg_gq objects because of get/put
      imbalance.
      In fact whole idea of original commit is wrong because bfq_group entity
      can not dissapear under us because it is referenced by child bfq_queue's
      entities from here:
       -> bfq_init_entity()
          ->bfqg_and_blkg_get(bfqg);
          ->entity->parent = bfqg->my_entity
      
       -> bfq_put_queue(bfqq)
          FINAL_PUT
          ->bfqg_and_blkg_put(bfqq_group(bfqq))
          ->kmem_cache_free(bfq_pool, bfqq);
      
      So parent entity can not disappear while child entity is in tree,
      and child entities already has proper protection.
      This patch revert commit db37a34c ("block, bfq: get a ref to a group when adding it to a service tree")
      
      bfq_group leak trace caused by bad commit:
      -> blkg_alloc
         -> bfq_pq_alloc
           -> bfqg_get (+1)
      ->bfq_activate_bfqq
        ->bfq_activate_requeue_entity
          -> __bfq_activate_entity
             ->bfq_get_entity
               ->bfqg_and_blkg_get (+1)  <==== : Note1
      ->bfq_del_bfqq_busy
        ->bfq_deactivate_entity+0x53/0xc0 [bfq]
          ->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
            -> bfq_forget_entity(is_in_service = true)
      	 entity->on_st_or_in_serv = false   <=== :Note2
      	 if (is_in_service)
      	     return;  ==> do not touch reference
      -> blkcg_css_offline
       -> blkcg_destroy_blkgs
        -> blkg_destroy
         -> bfq_pd_offline
          -> __bfq_deactivate_entity
               if (!entity->on_st_or_in_serv) /* true, because (Note2)
      		return false;
       -> bfq_pd_free
          -> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
      So bfq_group and blkcg_gq  will leak forever, see test-case below.
      
      ##TESTCASE_BEGIN:
      #!/bin/bash
      
      max_iters=${1:-100}
      #prep cgroup mounts
      mount -t tmpfs cgroup_root /sys/fs/cgroup
      mkdir /sys/fs/cgroup/blkio
      mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
      
      # Prepare blkdev
      grep blkio /proc/cgroups
      truncate -s 1M img
      losetup /dev/loop0 img
      echo bfq > /sys/block/loop0/queue/scheduler
      
      grep blkio /proc/cgroups
      for ((i=0;i<max_iters;i++))
      do
          mkdir -p /sys/fs/cgroup/blkio/a
          echo 0 > /sys/fs/cgroup/blkio/a/cgroup.procs
          dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null
          echo 0 > /sys/fs/cgroup/blkio/cgroup.procs
          rmdir /sys/fs/cgroup/blkio/a
          grep blkio /proc/cgroups
      done
      ##TESTCASE_END:
      
      Fixes: db37a34c ("block, bfq: get a ref to a group when adding it to a service tree")
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: default avatarDmitry Monakhov <dmtrmonakhov@yandex-team.ru>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2de791ab
    • Matthew Wilcox (Oracle)'s avatar
      block: Fix page_is_mergeable() for compound pages · d8166519
      Matthew Wilcox (Oracle) authored
      If we pass in an offset which is larger than PAGE_SIZE, then
      page_is_mergeable() thinks it's not mergeable with the previous bio_vec,
      leading to a large number of bio_vecs being used.  Use a slightly more
      obvious test that the two pages are compatible with each other.
      
      Fixes: 52d52d1c ("block: only allow contiguous page structs in a bio_vec")
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d8166519
  5. 17 Aug, 2020 5 commits
    • Ming Lei's avatar
      block: virtio_blk: fix handling single range discard request · af822aa6
      Ming Lei authored
      1f23816b ("virtio_blk: add discard and write zeroes support") starts
      to support multi-range discard for virtio-blk. However, the virtio-blk
      disk may report max discard segment as 1, at least that is exactly what
      qemu is doing.
      
      So far, block layer switches to normal request merge if max discard segment
      limit is 1, and multiple bios can be merged to single segment. This way may
      cause memory corruption in virtblk_setup_discard_write_zeroes().
      
      Fix the issue by handling single max discard segment in straightforward
      way.
      
      Fixes: 1f23816b ("virtio_blk: add discard and write zeroes support")
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Changpeng Liu <changpeng.liu@intel.com>
      Cc: Daniel Verkamp <dverkamp@chromium.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Stefan Hajnoczi <stefanha@redhat.com>
      Cc: Stefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      af822aa6
    • Ming Lei's avatar
      block: respect queue limit of max discard segment · 943b40c8
      Ming Lei authored
      When queue_max_discard_segments(q) is 1, blk_discard_mergable() will
      return false for discard request, then normal request merge is applied.
      However, only queue_max_segments() is checked, so max discard segment
      limit isn't respected.
      
      Check max discard segment limit in the request merge code for fixing
      the issue.
      
      Discard request failure of virtio_blk is fixed.
      
      Fixes: 69840466 ("block: fix the DISCARD request merge")
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Stefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      943b40c8
    • Ming Lei's avatar
      block: loop: set discard granularity and alignment for block device backed loop · bcb21c8c
      Ming Lei authored
      In case of block device backend, if the backend supports write zeros, the
      loop device will set queue flag of QUEUE_FLAG_DISCARD. However,
      limits.discard_granularity isn't setup, and this way is wrong,
      see the following description in Documentation/ABI/testing/sysfs-block:
      
      	A discard_granularity of 0 means that the device does not support
      	discard functionality.
      
      Especially 9b15d109 ("block: improve discard bio alignment in
      __blkdev_issue_discard()") starts to take q->limits.discard_granularity
      for computing max discard sectors. And zero discard granularity may cause
      kernel oops, or fail discard request even though the loop queue claims
      discard support via QUEUE_FLAG_DISCARD.
      
      Fix the issue by setup discard granularity and alignment.
      
      Fixes: c52abf56 ("loop: Better discard support for block devices")
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarColy Li <colyli@suse.de>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Xiao Ni <xni@redhat.com>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: Evan Green <evgreen@chromium.org>
      Cc: Gwendal Grignou <gwendal@chromium.org>
      Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Cc: Andrzej Pietrasiewicz <andrzej.p@collabora.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bcb21c8c
    • Ming Lei's avatar
      blk-mq: order adding requests to hctx->dispatch and checking SCHED_RESTART · d7d8535f
      Ming Lei authored
      SCHED_RESTART code path is relied to re-run queue for dispatch requests
      in hctx->dispatch. Meantime the SCHED_RSTART flag is checked when adding
      requests to hctx->dispatch.
      
      memory barriers have to be used for ordering the following two pair of OPs:
      
      1) adding requests to hctx->dispatch and checking SCHED_RESTART in
      blk_mq_dispatch_rq_list()
      
      2) clearing SCHED_RESTART and checking if there is request in hctx->dispatch
      in blk_mq_sched_restart().
      
      Without the added memory barrier, either:
      
      1) blk_mq_sched_restart() may miss requests added to hctx->dispatch meantime
      blk_mq_dispatch_rq_list() observes SCHED_RESTART, and not run queue in
      dispatch side
      
      or
      
      2) blk_mq_dispatch_rq_list still sees SCHED_RESTART, and not run queue
      in dispatch side, meantime checking if there is request in
      hctx->dispatch from blk_mq_sched_restart() is missed.
      
      IO hang in ltp/fs_fill test is reported by kernel test robot:
      
      	https://lkml.org/lkml/2020/7/26/77
      
      Turns out it is caused by the above out-of-order OPs. And the IO hang
      can't be observed any more after applying this patch.
      
      Fixes: bd166ef1 ("blk-mq-sched: add framework for MQ capable IO schedulers")
      Reported-by: default avatarkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Jeffery <djeffery@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d7d8535f
    • Xu Wang's avatar
      bsg-lib: convert comma to semicolon · 03ef5941
      Xu Wang authored
      Replace a comma between expression statements by a semicolon.
      Signed-off-by: default avatarXu Wang <vulab@iscas.ac.cn>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      03ef5941