1. 08 Dec, 2020 7 commits
    • Ming Lei's avatar
      blk-mq: add new API of blk_mq_hctx_set_fq_lock_class · fb01a293
      Ming Lei authored
      flush_end_io() may be called recursively from some driver, such as
      nvme-loop, so lockdep may complain 'possible recursive locking'.
      Commit b3c6a599("block: Fix a lockdep complaint triggered by
      request queue flushing") tried to address this issue by assigning
      dynamically allocated per-flush-queue lock class. This solution
      adds synchronize_rcu() for each hctx's release handler, and causes
      horrible SCSI MQ probe delay(more than half an hour on megaraid sas).
      
      Add new API of blk_mq_hctx_set_fq_lock_class() for these drivers, so
      we just need to use driver specific lock class for avoiding the
      lockdep warning of 'possible recursive locking'.
      Tested-by: default avatarKashyap Desai <kashyap.desai@broadcom.com>
      Reported-by: default avatarQian Cai <cai@redhat.com>
      Cc: Sumit Saxena <sumit.saxena@broadcom.com>
      Cc: John Garry <john.garry@huawei.com>
      Cc: Kashyap Desai <kashyap.desai@broadcom.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Hannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fb01a293
    • Jeffle Xu's avatar
      block: disable iopoll for split bio · cc29e1bf
      Jeffle Xu authored
      iopoll is initially for small size, latency sensitive IO. It doesn't
      work well for big IO, especially when it needs to be split to multiple
      bios. In this case, the returned cookie of __submit_bio_noacct_mq() is
      indeed the cookie of the last split bio. The completion of *this* last
      split bio done by iopoll doesn't mean the whole original bio has
      completed. Callers of iopoll still need to wait for completion of other
      split bios.
      
      Besides bio splitting may cause more trouble for iopoll which isn't
      supposed to be used in case of big IO.
      
      iopoll for split bio may cause potential race if CPU migration happens
      during bio submission. Since the returned cookie is that of the last
      split bio, polling on the corresponding hardware queue doesn't help
      complete other split bios, if these split bios are enqueued into
      different hardware queues. Since interrupts are disabled for polling
      queues, the completion of these other split bios depends on timeout
      mechanism, thus causing a potential hang.
      
      iopoll for split bio may also cause hang for sync polling. Currently
      both the blkdev and iomap-based fs (ext4/xfs, etc) support sync polling
      in direct IO routine. These routines will submit bio without REQ_NOWAIT
      flag set, and then start sync polling in current process context. The
      process may hang in blk_mq_get_tag() if the submitted bio has to be
      split into multiple bios and can rapidly exhaust the queue depth. The
      process are waiting for the completion of the previously allocated
      requests, which should be reaped by the following polling, and thus
      causing a deadlock.
      
      To avoid these subtle trouble described above, just disable iopoll for
      split bio and return BLK_QC_T_NONE in this case. The side effect is that
      non-HIPRI IO also returns BLK_QC_T_NONE now. It should be acceptable
      since the returned cookie is never used for non-HIPRI IO.
      Suggested-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJeffle Xu <jefflexu@linux.alibaba.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      cc29e1bf
    • Damien Le Moal's avatar
      block: Improve blk_revalidate_disk_zones() checks · 2afdeb23
      Damien Le Moal authored
      Improves the checks on the zones of a zoned block device done in
      blk_revalidate_disk_zones() by making sure that the device report_zones
      method did report at least one zone and that the zones reported exactly
      cover the entire disk capacity, that is, that there are no missing zones
      at the end of the disk sector range.
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2afdeb23
    • Pavel Begunkov's avatar
      sbitmap: simplify wrap check · 0eff1f1a
      Pavel Begunkov authored
      __sbitmap_get_word() doesn't warp if it's starting from the beginning
      (i.e. initial hint is 0). Instead of stashing the original hint just set
      @wrap accordingly.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0eff1f1a
    • Pavel Begunkov's avatar
      sbitmap: replace CAS with atomic and · c3250c8d
      Pavel Begunkov authored
      sbitmap_deferred_clear() does CAS loop to propagate cleared bits,
      replace it with equivalent atomic bitwise and. That's slightly faster
      and makes wait-free instead of lock-free as before.
      
      The atomic can be relaxed (i.e. barrier-less) because following
      sbitmap_get*() deal with synchronisation, see comments in
      sbitmap_queue_clear().
      
      It's ok to cast to atomic_long_t, that's what bitops/lock.h does.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c3250c8d
    • Pavel Begunkov's avatar
      sbitmap: remove swap_lock · 661d4f55
      Pavel Begunkov authored
      map->swap_lock protects map->cleared from concurrent modification,
      however sbitmap_deferred_clear() is already atomically drains it, so
      it's guaranteed to not loose bits on concurrent
      sbitmap_deferred_clear().
      
      A one threaded tag heavy test on top of nullbk showed ~1.5% t-put
      increase, and 3% -> 1% cycle reduction of sbitmap_get() according to perf.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      661d4f55
    • Pavel Begunkov's avatar
      sbitmap: optimise sbitmap_deferred_clear() · b78beea0
      Pavel Begunkov authored
      Because of spinlocks and atomics sbitmap_deferred_clear() have to reload
      &sb->map[index] on each access even though the map address won't change.
      Pass in sbitmap_word instead of {sb, index}, so it's cached in a
      variable. It also improves code generation of
      sbitmap_find_bit_in_index().
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarJohn Garry <john.garry@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b78beea0
  2. 07 Dec, 2020 7 commits
  3. 04 Dec, 2020 5 commits
  4. 02 Dec, 2020 4 commits
    • Yu Kuai's avatar
      blk-throttle: don't check whether or not lower limit is valid if... · acaf523a
      Yu Kuai authored
      blk-throttle: don't check whether or not lower limit is valid if CONFIG_BLK_DEV_THROTTLING_LOW is off
      
      blk_throtl_update_limit_valid() will search for descendants to see if
      'LIMIT_LOW' of bps/iops and READ/WRITE is nonzero. However, they're always
      zero if CONFIG_BLK_DEV_THROTTLING_LOW is not set, furthermore, a lot of
      time will be wasted to iterate descendants.
      
      Thus do nothing in blk_throtl_update_limit_valid() in such situation.
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      acaf523a
    • Jeffle Xu's avatar
      block: fix inflight statistics of part0 · b0d97557
      Jeffle Xu authored
      The inflight of partition 0 doesn't include inflight IOs to all
      sub-partitions, since currently mq calculates inflight of specific
      partition by simply camparing the value of the partition pointer.
      
      Thus the following case is possible:
      
      $ cat /sys/block/vda/inflight
             0        0
      $ cat /sys/block/vda/vda1/inflight
             0      128
      
      While single queue device (on a previous version, e.g. v3.10) has no
      this issue:
      
      $cat /sys/block/sda/sda3/inflight
             0       33
      $cat /sys/block/sda/inflight
             0       33
      
      Partition 0 should be specially handled since it represents the whole
      disk. This issue is introduced since commit bf0ddaba ("blk-mq: fix
      sysfs inflight counter").
      
      Besides, this patch can also fix the inflight statistics of part 0 in
      /proc/diskstats. Before this patch, the inflight statistics of part 0
      doesn't include that of sub partitions. (I have marked the 'inflight'
      field with asterisk.)
      
      $cat /proc/diskstats
       259       0 nvme0n1 45974469 0 367814768 6445794 1 0 1 0 *0* 111062 6445794 0 0 0 0 0 0
       259       2 nvme0n1p1 45974058 0 367797952 6445727 0 0 0 0 *33* 111001 6445727 0 0 0 0 0 0
      
      This is introduced since commit f299b7c7 ("blk-mq: provide internal
      in-flight variant").
      
      Fixes: bf0ddaba ("blk-mq: fix sysfs inflight counter")
      Fixes: f299b7c7 ("blk-mq: provide internal in-flight variant")
      Signed-off-by: default avatarJeffle Xu <jefflexu@linux.alibaba.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      [axboe: adapt for 5.11 partition change]
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b0d97557
    • Pavel Begunkov's avatar
      bio: optimise bvec iteration · 22b56c29
      Pavel Begunkov authored
      __bio_for_each_bvec(), __bio_for_each_segment() and bio_copy_data_iter()
      fall under conditions of bvec_iter_advance_single(), which is a faster
      and slimmer version of bvec_iter_advance(). Add
      bio_advance_iter_single() and convert them.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      22b56c29
    • Pavel Begunkov's avatar
      block: optimise for_each_bvec() advance · 6b6667aa
      Pavel Begunkov authored
      Because of how for_each_bvec() works it never advances across multiple
      entries at a time, so bvec_iter_advance() is an overkill. Add
      specialised bvec_iter_advance_single() that is faster. It also handles
      zero-len bvecs, so can kill bvec_iter_skip_zero_bvec().
      
         text    data     bss     dec     hex filename
      before:
        23977     805       0   24782    60ce lib/iov_iter.o
      before, bvec_iter_advance() w/o WARN_ONCE()
        22886     600       0   23486    5bbe ./lib/iov_iter.o
      after:
        21862     600       0   22462    57be lib/iov_iter.o
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6b6667aa
  5. 01 Dec, 2020 17 commits