Commits · 2971c058746319e9853919553259cef7fe280c94 · Kirill Smelkov / linux

22 Jun, 2023 5 commits

Documentation: dm-integrity: Document an example of how the tunables relate. · 2971c058
Russell Harmon authored Jun 04, 2023
```
Signed-off-by: Russell Harmon <eatnumber1@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
```
2971c058

Documentation: dm-integrity: Document default values. · 52145f28

Russell Harmon authored Jun 04, 2023

Signed-off-by: Russell Harmon <eatnumber1@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>

52145f28

Documentation: dm-integrity: Document the meaning of "buffer". · 3b671459

Russell Harmon authored Jun 04, 2023

"Buffers" are buffers of the metadata/checksum area of dm-integrity.
They are always at most as large as a single metadata area on-disk, but
may be smaller.
Signed-off-by: Russell Harmon <eatnumber1@gmail.com>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>

3b671459

Documentation: dm-integrity: Fix minor grammatical error. · c3ba5aa6

Russell Harmon authored Jun 04, 2023

"where dm-integrity uses bitmap" becomes "where dm-integrity uses a
bitmap"
Signed-off-by: Russell Harmon <eatnumber1@gmail.com>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>

c3ba5aa6

dm integrity: Use %*ph for printing hexdump of a small buffer · 25c9a4ab

Andy Shevchenko authored Jun 13, 2023

The kernel already has a helper to print a hexdump of a small
buffer via pointer extension. Use that instead of open coded
variant.

In long term it helps to kill pr_cont() or at least narrow down
its use.

Note, the format is slightly changed, i.e. the trailing space is
always printed. Also the IV dump is limited by 64 bytes which seems
fine.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>

25c9a4ab

16 Jun, 2023 20 commits

dm thin: disable discards for thin-pool if no_discard_passdown · fa375646

Mike Snitzer authored Jun 16, 2023

Also rename disable_passdown_if_not_supported to
disable_discard_passdown_if_not_supported.

And fold passdown_enabled() into only caller.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>

fa375646

dm: remove stale/redundant dm_internal_{suspend,resume} prototypes in dm.h · 862c6663
Mike Snitzer authored Jun 15, 2023
```
dm_internal_suspend() no longer exists.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
```
862c6663

dm: skip dm-stats work in alloc_io() unless needed · c4f512d2

Mike Snitzer authored Jun 13, 2023

Don't dm_stats_record_start() if dm_stats_used() is false.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>

c4f512d2

dm: avoid needless dm_io access if all IO accounting is disabled · 06eed768

Mike Snitzer authored Jun 13, 2023

Update dm_io_acct() to eliminate most dm_io struct accesses if both
block core's IO stats and dm-stats are disabled.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>

06eed768

dm: support turning off block-core's io stats accounting · 526d1006

Li Nan authored Jun 13, 2023

Commit bc58ba94 ("block: add sysfs file for controlling io stats
accounting") allowed users to turn off disk stat accounting completely
by checking if queue flag QUEUE_FLAG_IO_STAT is set. In dm, this flag
is neither set nor checked: so block-core's io stats are continuously
counted and cannot be turned off.

Add support for turning off block-core's io stats accounting for dm.
Set QUEUE_FLAG_IO_STAT for dm's request_queue. If QUEUE_FLAG_IO_STAT
is set when an io starts, record the need for block core's io stats by
setting the DM_IO_BLK_STAT dm_io flag to avoid io stats being disabled
in the middle of the io.

DM statistics (dm-stats) is independent of block-core's io stats and
remains unchanged.
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>

526d1006

dm zone: Use the bitmap API to allocate bitmaps · e118029c

Christophe JAILLET authored Jun 07, 2023

Use bitmap_zalloc()/bitmap_free() instead of hand-writing them.
It is less verbose and it improves the semantic.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>

e118029c

dm thin metadata: Fix ABBA deadlock by resetting dm_bufio_client · d4830012

Li Lingfeng authored Jun 05, 2023

As described in commit 8111964f ("dm thin: Fix ABBA deadlock between
shrink_slab and dm_pool_abort_metadata"), ABBA deadlocks will be
triggered because shrinker_rwsem currently needs to held by
dm_pool_abort_metadata() as a side-effect of thin-pool metadata
operation failure.

The following three problem scenarios have been noticed:

1) Described by commit 8111964f ("dm thin: Fix ABBA deadlock between
   shrink_slab and dm_pool_abort_metadata")

2) shrinker_rwsem and throttle->lock
          P1(drop cache)                        P2(kworker)
drop_caches_sysctl_handler
 drop_slab
  shrink_slab
   down_read(&shrinker_rwsem)  - LOCK A
   do_shrink_slab
    super_cache_scan
     prune_icache_sb
      dispose_list
       evict
        ext4_evict_inode
         ext4_clear_inode
          ext4_discard_preallocations
           ext4_mb_load_buddy_gfp
            ext4_mb_init_cache
             ext4_wait_block_bitmap
              __ext4_error
               ext4_handle_error
                ext4_commit_super
                 ...
                 dm_submit_bio
                                     do_worker
                                      throttle_work_update
                                       down_write(&t->lock) -- LOCK B
                                      process_deferred_bios
                                       commit
                                        metadata_operation_failed
                                         dm_pool_abort_metadata
                                          dm_block_manager_create
                                           dm_bufio_client_create
                                            register_shrinker
                                             down_write(&shrinker_rwsem)
                                             -- LOCK A
                 thin_map
                  thin_bio_map
                   thin_defer_bio_with_throttle
                    throttle_lock
                     down_read(&t->lock)  - LOCK B

3) shrinker_rwsem and wait_on_buffer
          P1(drop cache)                            P2(kworker)
drop_caches_sysctl_handler
 drop_slab
  shrink_slab
   down_read(&shrinker_rwsem)  - LOCK A
   do_shrink_slab
   ...
    ext4_wait_block_bitmap
     __ext4_error
      ext4_handle_error
       jbd2_journal_abort
        jbd2_journal_update_sb_errno
         jbd2_write_superblock
          submit_bh
           // LOCK B
           // RELEASE B
                             do_worker
                              throttle_work_update
                               down_write(&t->lock) - LOCK B
                              process_deferred_bios
                               process_bio
                               commit
                                metadata_operation_failed
                                 dm_pool_abort_metadata
                                  dm_block_manager_create
                                   dm_bufio_client_create
                                    register_shrinker
                                     register_shrinker_prepared
                                      down_write(&shrinker_rwsem)  - LOCK A
                               bio_endio
      wait_on_buffer
       __wait_on_buffer

Fix these by resetting dm_bufio_client without holding shrinker_rwsem.

Fixes: 8111964f ("dm thin: Fix ABBA deadlock between shrink_slab and dm_pool_abort_metadata")
Cc: stable@vger.kernel.org
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>

d4830012

dm crypt: fix crypt_ctr_cipher_new return value on invalid AEAD cipher · 2a32897c

Mikulas Patocka authored May 24, 2023

If the user specifies invalid AEAD cipher, dm-crypt should return the
error returned from crypt_ctr_auth_spec, not -ENOMEM.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>

2a32897c

dm thin: update .io_hints methods to not require handling discards last · ef6953fb

Mike Snitzer authored May 31, 2023

Removes assumptions about what might follow the discard setup code
(previously the code would return early if discards not enabled).

Makes it possible to add more capabilites to the end of each .io_hints
method (which is the natural thing to do when adding new features).
Signed-off-by: Mike Snitzer <snitzer@kernel.org>

ef6953fb

dm thin: remove return code variable in pool_map · c0a7a0ac

Mike Snitzer authored May 15, 2023

Always returns DM_MAPIO_REMAPPED so no need for variable.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>

c0a7a0ac

dm flakey: introduce random_read_corrupt and random_write_corrupt options · 4c2c845b

Mikulas Patocka authored May 01, 2023

The random_read_corrupt and random_write_corrupt options corrupt a
random byte in a bio with the provided probability. The corruption
only happens in the "down" interval.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>

4c2c845b

dm flakey: clone pages on write bio before corrupting them · 1d9a9438

Mikulas Patocka authored May 01, 2023

dm-flakey has an option to corrupt write bios. It corrupts the memory that
is being written. This can cause system crashes or security bugs - for
example, if the user writes a shared library code with O_DIRECT flag to a
dm-flakey device, it corrupts the memory for all users that have the
shared library mapped.

Fix this bug by cloning the bio and corrupting the clone rather than
the original.

Also drop the test for ZERO_PAGE(0) - it can't happen because we write
the cloned pages.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>

1d9a9438

dm crypt: allocate compound pages if possible · 5054e778

Mikulas Patocka authored May 01, 2023

It was reported that allocating pages for the write buffer in dm-crypt
causes measurable overhead [1].

Change dm-crypt to allocate compound pages if they are available. If
not, fall back to the mempool.

[1] https://listman.redhat.com/archives/dm-devel/2023-February/053284.htmlSuggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>

5054e778

blk-mq: fix NULL dereference on q->elevator in blk_mq_elv_switch_none · 24516565

Ming Lei authored Jun 16, 2023

After grabbing q->sysfs_lock, q->elevator may become NULL because of
elevator switch.

Fix the NULL dereference on q->elevator by checking it with lock.
Reported-by: Guangwu Zhang <guazhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230616132354.415109-1-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

24516565

iov_iter: remove iov_iter_get_pages and iov_iter_get_pages_alloc · 84bd06c6

Christoph Hellwig authored Jun 14, 2023

Now that the direct I/O helpers have switched to use
iov_iter_extract_pages, these helpers are unused.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230614140341.521331-5-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

84bd06c6

block: remove BIO_PAGE_REFFED · e4cc6465

Christoph Hellwig authored Jun 14, 2023

Now that all block direct I/O helpers use page pinning, this flag is
unused.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230614140341.521331-4-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

e4cc6465

splice: simplify a conditional in copy_splice_read · 2e82f6c3

Christoph Hellwig authored Jun 14, 2023

Check for -EFAULT instead of wrapping the check in an ret < 0 block.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230614140341.521331-3-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

2e82f6c3

splice: don't call file_accessed in copy_splice_read · 0b24be46

Christoph Hellwig authored Jun 14, 2023

copy_splice_read calls into ->read_iter to read the data, which already
calls file_accessed.

Fixes: 33b3b041 ("splice: Add a func to do a splice from an O_DIRECT file without ITER_PIPE")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230614140341.521331-2-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

0b24be46

Merge tag 'nvme-6.5-2023-06-16' of git://git.infradead.org/nvme into for-6.5/block · 236f2552

Jens Axboe authored Jun 16, 2023

Pull NVMe updates from Keith:

"nvme updates for Linux 6.5

 - Various cleanups all around (Irvin, Chaitanya, Christophe)
 - Better struct packing (Christophe JAILLET)
 - Reduce controller error logs for optional commands (Keith)
 - Support for >=64KiB block sizes (Daniel Gomez)
 - Fabrics fixes and code organization (Max, Chaitanya, Daniel Wagner)"

* tag 'nvme-6.5-2023-06-16' of git://git.infradead.org/nvme: (27 commits)
  nvme: forward port sysfs delete fix
  nvme: skip optional id ctrl csi if it failed
  nvme-core: use nvme_ns_head_multipath instead of ns->head->disk
  nvmet-fcloop: Do not wait on completion when unregister fails
  nvme-fabrics: open code __nvmf_host_find()
  nvme-fabrics: error out to unlock the mutex
  nvme: Increase block size variable size to 32-bit
  nvme-fcloop: no need to return from void function
  nvmet-auth: remove unnecessary break after goto
  nvmet-auth: remove some dead code
  nvme-core: remove redundant check from nvme_init_ns_head
  nvme: move sysfs code to a dedicated sysfs.c file
  nvme-fabrics: prevent overriding of existing host
  nvme-fabrics: check hostid using uuid_equal
  nvme-fabrics: unify common code in admin and io queue connect
  nvmet: reorder fields in 'struct nvmefc_fcp_req'
  nvmet: reorder fields in 'struct nvme_dhchap_queue_context'
  nvmet: reorder fields in 'struct nvmf_ctrl_options'
  nvme: reorder fields in 'struct nvme_ctrl'
  nvmet: reorder fields in 'struct nvmet_sq'
  ...

236f2552

nvme: forward port sysfs delete fix · 1c606f7f

Keith Busch authored Jun 15, 2023

We had a late fix that modified nvme_sysfs_delete() after the staging
branch for the next merge window relocated the function to a new file.
Port commit 2eb94dd5 ("nvme: do not let the user delete a ctrl
before a complete") to the latest to avoid a potentially confusing merge
conflict.

Cc: Maurizio Lombardi <mlombard@redhat.com>
Cc: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>

1c606f7f

15 Jun, 2023 9 commits

bcache: fixup btree_cache_wait list damage · f0854489

Mingzhe Zou authored Jun 15, 2023

We get a kernel crash about "list_add corruption. next->prev should be
prev (ffff9c801bc01210), but was ffff9c77b688237c.
(next=ffffae586d8afe68)."

crash> struct list_head 0xffff9c801bc01210
struct list_head {
  next = 0xffffae586d8afe68,
  prev = 0xffffae586d8afe68
}
crash> struct list_head 0xffff9c77b688237c
struct list_head {
  next = 0x0,
  prev = 0x0
}
crash> struct list_head 0xffffae586d8afe68
struct list_head struct: invalid kernel virtual address: ffffae586d8afe68  type: "gdb_readmem_callback"
Cannot access memory at address 0xffffae586d8afe68

[230469.019492] Call Trace:
[230469.032041]  prepare_to_wait+0x8a/0xb0
[230469.044363]  ? bch_btree_keys_free+0x6c/0xc0 [escache]
[230469.056533]  mca_cannibalize_lock+0x72/0x90 [escache]
[230469.068788]  mca_alloc+0x2ae/0x450 [escache]
[230469.080790]  bch_btree_node_get+0x136/0x2d0 [escache]
[230469.092681]  bch_btree_check_thread+0x1e1/0x260 [escache]
[230469.104382]  ? finish_wait+0x80/0x80
[230469.115884]  ? bch_btree_check_recurse+0x1a0/0x1a0 [escache]
[230469.127259]  kthread+0x112/0x130
[230469.138448]  ? kthread_flush_work_fn+0x10/0x10
[230469.149477]  ret_from_fork+0x35/0x40

bch_btree_check_thread() and bch_dirty_init_thread() may call
mca_cannibalize() to cannibalize other cached btree nodes. Only one thread
can do it at a time, so the op of other threads will be added to the
btree_cache_wait list.

We must call finish_wait() to remove op from btree_cache_wait before free
it's memory address. Otherwise, the list will be damaged. Also should call
bch_cannibalize_unlock() to release the btree_cache_alloc_lock and wake_up
other waiters.

Fixes: 8e710227 ("bcache: make bch_btree_check() to be multithreaded")
Fixes: b144e45f ("bcache: make bch_sectors_dirty_init() to be multithreaded")
Cc: stable@vger.kernel.org
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-7-colyli@suse.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

f0854489

bcache: Fix __bch_btree_node_alloc to make the failure behavior consistent · 80fca8a1

Zheng Wang authored Jun 15, 2023

In some specific situations, the return value of __bch_btree_node_alloc
may be NULL. This may lead to a potential NULL pointer dereference in
caller function like a calling chain :
btree_split->bch_btree_node_alloc->__bch_btree_node_alloc.

Fix it by initializing the return value in __bch_btree_node_alloc.

Fixes: cafe5635 ("bcache: A block layer cache")
Cc: stable@vger.kernel.org
Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-6-colyli@suse.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

80fca8a1

bcache: Remove unnecessary NULL point check in node allocations · 028ddcac

Zheng Wang authored Jun 15, 2023

Due to the previous fix of __bch_btree_node_alloc, the return value will
never be a NULL pointer. So IS_ERR is enough to handle the failure
situation. Fix it by replacing IS_ERR_OR_NULL check by an IS_ERR check.

Fixes: cafe5635 ("bcache: A block layer cache")
Cc: stable@vger.kernel.org
Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-5-colyli@suse.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

028ddcac

bcache: Remove dead references to cache_readaheads · ccb8c3bd

Andrea Tomassetti authored Jun 15, 2023

The cache_readaheads stat counter is not used anymore and should be
removed.
Signed-off-by: Andrea Tomassetti <andrea.tomassetti-opensource@devo.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-4-colyli@suse.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

ccb8c3bd

bcache: make kobj_type structures constant · b98dd0b0

Thomas Weißschuh authored Jun 15, 2023

Since commit ee6d3dd4 ("driver core: make kobj_type constant.")
the driver core allows the usage of const struct kobj_type.

Take advantage of this to constify the structure definitions to prevent
modification at runtime.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-3-colyli@suse.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

b98dd0b0

bcache: Convert to use sysfs_emit()/sysfs_emit_at() APIs · a301b2de

ye xingchen authored Jun 15, 2023

Follow the advice of the Documentation/filesystems/sysfs.rst and show()
should only use sysfs_emit() or sysfs_emit_at() when formatting the
value to be returned to user space.
Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-2-colyli@suse.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

a301b2de

block: fix blktrace debugfs entries leakage · dd7de370

Yu Kuai authored Jun 10, 2023

Commit 99d055b4 ("block: remove per-disk debugfs files in
blk_unregister_queue") moves blk_trace_shutdown() from
blk_release_queue() to blk_unregister_queue(), this is safe if blktrace
is created through sysfs, however, there is a regression in corner
case.

blktrace can still be enabled after del_gendisk() through ioctl if
the disk is opened before del_gendisk(), and if blktrace is not shutdown
through ioctl before closing the disk, debugfs entries will be leaked.

Fix this problem by shutdown blktrace in disk_release(), this is safe
because blk_trace_remove() is reentrant.

Fixes: 99d055b4 ("block: remove per-disk debugfs files in blk_unregister_queue")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230610022003.2557284-4-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

dd7de370

scsi: sg: fix blktrace debugfs entries leakage · db59133e

Yu Kuai authored Jun 10, 2023

sg_ioctl() support to enable blktrace, which will create debugfs entries
"/sys/kernel/debug/block/sgx/", however, there is no guarantee that user
will remove these entries through ioctl, and deleting sg device doesn't
cleanup these blktrace entries.

This problem can be fixed by cleanup blktrace while releasing
request_queue, however, it's not a good idea to do this special handling
in common layer just for sg device.

Fix this problem by shutdown bltkrace in sg_device_destroy(), where the
device is deleted and all the users close the device, also grab a
scsi_device reference from sg_add_device() to prevent scsi_device to be
freed before sg_device_destroy();
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230610022003.2557284-3-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

db59133e

blktrace: use inline function for blk_trace_remove() while blktrace is disabled · cbe7cff4

Yu Kuai authored Jun 10, 2023

If config is disabled, call blk_trace_remove() directly will trigger
build warning, hence use inline function instead, prepare to fix
blktrace debugfs entries leakage.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230610022003.2557284-2-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

cbe7cff4

14 Jun, 2023 4 commits

brd: use cond_resched instead of cond_resched_rcu · 6dd4423f

Pankaj Raghav authored Jun 14, 2023

The body of the loop is run without RCU lock held. Use the regular
cond_resched() instead of cond_resched_rcu().

Fixes: 786bb024 ("brd: use XArray instead of radix-tree to index backing pages")
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20230614133538.1279369-1-p.raghav@samsung.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

6dd4423f

blk-mq: check on cpu id when there is only one ctx mapping · 30654614

Ed Tsai authored Jun 14, 2023

commit f168420c ("blk-mq: don't redirect completion for hctx withs
only one ctx mapping") When nvme applies a 1:1 mapping of hctx and ctx,
there will be no remote request.

But for ufs, the submission and completion queues could be asymmetric.
(e.g. Multiple SQs share one CQ) Therefore, 1:1 mapping of hctx and
ctx won't complete request on the submission cpu. In this situation,
this nr_ctx check could violate the QUEUE_FLAG_SAME_FORCE, as a result,
check on cpu id when there is only one ctx mapping.
Signed-off-by: Ed Tsai <ed.tsai@mediatek.com>
Signed-off-by: Po-Wen Kao <powen.kao@mediatek.com>
Suggested-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230614002529.6636-1-ed.tsai@mediatek.com
[axboe: fixed up indentation]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

30654614

Merge tag 'md-next-20230613' of... · 60701311

Jens Axboe authored Jun 14, 2023

Merge tag 'md-next-20230613' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.5/block

Pull MD updates from Song:

"The major changes are:

 1. Protect md_thread with rcu, by Yu Kuai;
 2. Various non-urgent raid5 and raid1/10 fixes, by Yu Kuai;
 3. Non-urgent raid10 fixes, by Li Nan."

* tag 'md-next-20230613' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: (29 commits)
  md/raid1-10: limit the number of plugged bio
  md/raid1-10: don't handle pluged bio by daemon thread
  md/md-bitmap: add a new helper to unplug bitmap asynchrously
  md/raid1-10: submit write io directly if bitmap is not enabled
  md/raid1-10: factor out a helper to submit normal write
  md/raid1-10: factor out a helper to add bio to plug
  md/raid10: prevent soft lockup while flush writes
  md/raid10: fix io loss while replacement replace rdev
  md/raid10: Do not add spare disk when recovery fails
  md/raid10: clean up md_add_new_disk()
  md/raid10: prioritize adding disk to 'removed' mirror
  md/raid10: improve code of mrdev in raid10_sync_request
  md/raid10: fix null-ptr-deref of mreplace in raid10_sync_request
  md/raid5: don't start reshape when recovery or replace is in progress
  md: protect md_thread with rcu
  md/bitmap: factor out a helper to set timeout
  md/bitmap: always wake up md_thread in timeout_store
  dm-raid: remove useless checking in raid_message()
  md: factor out a helper to wake up md_thread directly
  md: fix duplicate filename for rdev
  ...

60701311

block: Fix dio_cleanup() to advance the head index · d44c4042

David Howells authored Jun 13, 2023

Fix dio_bio_cleanup() to advance the head index into the list of pages past
the pages it has released, as __blockdev_direct_IO() will call it twice if
do_direct_IO() fails.

The issue was causing:

        WARNING: CPU: 6 PID: 2220 at mm/gup.c:76 try_get_folio

This can be triggered by setting up a clean pair of UDF filesystems on
loopback devices and running the generic/451 xfstest with them as the
scratch and test partitions.  Something like the following:

    fallocate /mnt2/udf_scratch -l 1G
    fallocate /mnt2/udf_test -l 1G
    mknod /dev/lo0 b 7 0
    mknod /dev/lo1 b 7 1
    losetup lo0 /mnt2/udf_scratch
    losetup lo1 /mnt2/udf_test
    mkfs -t udf /dev/lo0
    mkfs -t udf /dev/lo1
    cd xfstests
    ./check generic/451

with xfstests configured by putting the following into local.config:

    export FSTYP=udf
    export DISABLE_UDF_TEST=1
    export TEST_DEV=/dev/lo1
    export TEST_DIR=/xfstest.test
    export SCRATCH_DEV=/dev/lo0
    export SCRATCH_MNT=/xfstest.scratch

Fixes: 1ccf164e ("block: Use iov_iter_extract_pages() and page pinning in direct-io.c")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202306120931.a9606b88-oliver.sang@intel.comSigned-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@infradead.org>
cc: David Hildenbrand <david@redhat.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jan Kara <jack@suse.cz>
cc: Jeff Layton <jlayton@kernel.org>
cc: Jason Gunthorpe <jgg@nvidia.com>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: Hillf Danton <hdanton@sina.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-kernel@vger.kernel.org
cc: linux-mm@kvack.org
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/1193485.1686693279@warthog.procyon.org.ukSigned-off-by: Jens Axboe <axboe@kernel.dk>

d44c4042

13 Jun, 2023 2 commits

md/raid1-10: limit the number of plugged bio · 460af1f9

Yu Kuai authored May 29, 2023

bio can be added to plug infinitely, and following writeback test can
trigger huge amount of plugged bio:

Test script:
modprobe brd rd_nr=4 rd_size=10485760
mdadm -CR /dev/md0 -l10 -n4 /dev/ram[0123] --assume-clean --bitmap=internal
echo 0 > /proc/sys/vm/dirty_background_ratio
fio -filename=/dev/md0 -ioengine=libaio -rw=write -bs=4k -numjobs=1 -iodepth=128 -name=test

Test result:
Monitor /sys/block/md0/inflight will found that inflight keep increasing
until fio finish writing, after running for about 2 minutes:

[root@fedora ~]# cat /sys/block/md0/inflight
       0  4474191

Fix the problem by limiting the number of plugged bio based on the number
of copies for original bio.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-8-yukuai1@huaweicloud.com

460af1f9

md/raid1-10: don't handle pluged bio by daemon thread · 9efcc2c3

Yu Kuai authored May 29, 2023

current->bio_list will be set under submit_bio() context, in this case
bitmap io will be added to the list and wait for current io submission to
finish, while current io submission must wait for bitmap io to be done.
commit 874807a8 ("md/raid1{,0}: fix deadlock in bitmap_unplug.") fix
the deadlock by handling plugged bio by daemon thread.

On the one hand, the deadlock won't exist after commit a214b949
("blk-mq: only flush requests from the plug in blk_mq_submit_bio"). On
the other hand, current solution makes it impossible to flush plugged bio
in raid1/10_make_request(), because this will cause that all the writes
will goto daemon thread.

In order to limit the number of plugged bio, commit 874807a8
("md/raid1{,0}: fix deadlock in bitmap_unplug.") is reverted, and the
deadlock is fixed by handling bitmap io asynchronously.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-7-yukuai1@huaweicloud.com

9efcc2c3