1. 02 Jun, 2022 2 commits
  2. 31 May, 2022 3 commits
  3. 28 May, 2022 7 commits
    • Coly Li's avatar
      bcache: avoid unnecessary soft lockup in kworker update_writeback_rate() · a1a2d8f0
      Coly Li authored
      The kworker routine update_writeback_rate() is schedued to update the
      writeback rate in every 5 seconds by default. Before calling
      __update_writeback_rate() to do real job, semaphore dc->writeback_lock
      should be held by the kworker routine.
      
      At the same time, bcache writeback thread routine bch_writeback_thread()
      also needs to hold dc->writeback_lock before flushing dirty data back
      into the backing device. If the dirty data set is large, it might be
      very long time for bch_writeback_thread() to scan all dirty buckets and
      releases dc->writeback_lock. In such case update_writeback_rate() can be
      starved for long enough time so that kernel reports a soft lockup warn-
      ing started like:
        watchdog: BUG: soft lockup - CPU#246 stuck for 23s! [kworker/246:31:179713]
      
      Such soft lockup condition is unnecessary, because after the writeback
      thread finishes its job and releases dc->writeback_lock, the kworker
      update_writeback_rate() may continue to work and everything is fine
      indeed.
      
      This patch avoids the unnecessary soft lockup by the following method,
      - Add new member to struct cached_dev
        - dc->rate_update_retry (0 by default)
      - In update_writeback_rate() call down_read_trylock(&dc->writeback_lock)
        firstly, if it fails then lock contention happens.
      - If dc->rate_update_retry <= BCH_WBRATE_UPDATE_MAX_SKIPS (15), doesn't
        acquire the lock and reschedules the kworker for next try.
      - If dc->rate_update_retry > BCH_WBRATE_UPDATE_MAX_SKIPS, no retry
        anymore and call down_read(&dc->writeback_lock) to wait for the lock.
      
      By the above method, at worst case update_writeback_rate() may retry for
      1+ minutes before blocking on dc->writeback_lock by calling down_read().
      For a 4TB cache device with 1TB dirty data, 90%+ of the unnecessary soft
      lockup warning message can be avoided.
      
      When retrying to acquire dc->writeback_lock in update_writeback_rate(),
      of course the writeback rate cannot be updated. It is fair, because when
      the kworker is blocked on the lock contention of dc->writeback_lock, the
      writeback rate cannot be updated neither.
      
      This change follows Jens Axboe's suggestion to a more clear and simple
      version.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Link: https://lore.kernel.org/r/20220528124550.32834-2-colyli@suse.deSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a1a2d8f0
    • Yu Kuai's avatar
      nbd: use pr_err to output error message · 1243172d
      Yu Kuai authored
      Instead of using the long printk(KERN_ERR "nbd: ...") to
      output error message, defining pr_fmt and using
      the short pr_err("") to do that. The replacemen is done
      by using the following command:
      
        sed -i 's/printk(KERN_ERR "nbd: /pr_err("/g' \
      		  drivers/block/nbd.c
      
      This patch also rewrap to 80 columns where possible.
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Link: https://lore.kernel.org/r/20220521073749.3146892-7-yukuai3@huawei.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1243172d
    • Zhang Wensheng's avatar
      nbd: fix possible overflow on 'first_minor' in nbd_dev_add() · 858f1bf6
      Zhang Wensheng authored
      When 'index' is a big numbers, it may become negative which forced
      to 'int'. then 'index << part_shift' might overflow to a positive
      value that is not greater than '0xfffff', then sysfs might complains
      about duplicate creation. Because of this, move the 'index' judgment
      to the front will fix it and be better.
      
      Fixes: b0d9111a ("nbd: use an idr to keep track of nbd devices")
      Fixes: 940c2649 ("nbd: fix possible overflow for 'first_minor' in nbd_dev_add()")
      Signed-off-by: default avatarZhang Wensheng <zhangwensheng5@huawei.com>
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Link: https://lore.kernel.org/r/20220521073749.3146892-6-yukuai3@huawei.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      858f1bf6
    • Yu Kuai's avatar
      nbd: fix io hung while disconnecting device · 09dadb59
      Yu Kuai authored
      In our tests, "qemu-nbd" triggers a io hung:
      
      INFO: task qemu-nbd:11445 blocked for more than 368 seconds.
            Not tainted 5.18.0-rc3-next-20220422-00003-g2176915513ca #884
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      task:qemu-nbd        state:D stack:    0 pid:11445 ppid:     1 flags:0x00000000
      Call Trace:
       <TASK>
       __schedule+0x480/0x1050
       ? _raw_spin_lock_irqsave+0x3e/0xb0
       schedule+0x9c/0x1b0
       blk_mq_freeze_queue_wait+0x9d/0xf0
       ? ipi_rseq+0x70/0x70
       blk_mq_freeze_queue+0x2b/0x40
       nbd_add_socket+0x6b/0x270 [nbd]
       nbd_ioctl+0x383/0x510 [nbd]
       blkdev_ioctl+0x18e/0x3e0
       __x64_sys_ioctl+0xac/0x120
       do_syscall_64+0x35/0x80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7fd8ff706577
      RSP: 002b:00007fd8fcdfebf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      RAX: ffffffffffffffda RBX: 0000000040000000 RCX: 00007fd8ff706577
      RDX: 000000000000000d RSI: 000000000000ab00 RDI: 000000000000000f
      RBP: 000000000000000f R08: 000000000000fbe8 R09: 000055fe497c62b0
      R10: 00000002aff20000 R11: 0000000000000246 R12: 000000000000006d
      R13: 0000000000000000 R14: 00007ffe82dc5e70 R15: 00007fd8fcdff9c0
      
      "qemu-ndb -d" will call ioctl 'NBD_DISCONNECT' first, however, following
      message was found:
      
      block nbd0: Send disconnect failed -32
      
      Which indicate that something is wrong with the server. Then,
      "qemu-nbd -d" will call ioctl 'NBD_CLEAR_SOCK', however ioctl can't clear
      requests after commit 2516ab15("nbd: only clear the queue on device
      teardown"). And in the meantime, request can't complete through timeout
      because nbd_xmit_timeout() will always return 'BLK_EH_RESET_TIMER', which
      means such request will never be completed in this situation.
      
      Now that the flag 'NBD_CMD_INFLIGHT' can make sure requests won't
      complete multiple times, switch back to call nbd_clear_sock() in
      nbd_clear_sock_ioctl(), so that inflight requests can be cleared.
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Link: https://lore.kernel.org/r/20220521073749.3146892-5-yukuai3@huawei.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      09dadb59
    • Yu Kuai's avatar
      nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed · 2895f183
      Yu Kuai authored
      Otherwise io will hung because request will only be completed if the
      cmd has the flag 'NBD_CMD_INFLIGHT'.
      
      Fixes: 07175cb1 ("nbd: make sure request completion won't concurrent")
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Link: https://lore.kernel.org/r/20220521073749.3146892-4-yukuai3@huawei.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2895f183
    • Yu Kuai's avatar
      nbd: fix race between nbd_alloc_config() and module removal · c55b2b98
      Yu Kuai authored
      When nbd module is being removing, nbd_alloc_config() may be
      called concurrently by nbd_genl_connect(), although try_module_get()
      will return false, but nbd_alloc_config() doesn't handle it.
      
      The race may lead to the leak of nbd_config and its related
      resources (e.g, recv_workq) and oops in nbd_read_stat() due
      to the unload of nbd module as shown below:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000040
        Oops: 0000 [#1] SMP PTI
        CPU: 5 PID: 13840 Comm: kworker/u17:33 Not tainted 5.14.0+ #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
        Workqueue: knbd16-recv recv_work [nbd]
        RIP: 0010:nbd_read_stat.cold+0x130/0x1a4 [nbd]
        Call Trace:
         recv_work+0x3b/0xb0 [nbd]
         process_one_work+0x1ed/0x390
         worker_thread+0x4a/0x3d0
         kthread+0x12a/0x150
         ret_from_fork+0x22/0x30
      
      Fixing it by checking the return value of try_module_get()
      in nbd_alloc_config(). As nbd_alloc_config() may return ERR_PTR(-ENODEV),
      assign nbd->config only when nbd_alloc_config() succeeds to ensure
      the value of nbd->config is binary (valid or NULL).
      
      Also adding a debug message to check the reference counter
      of nbd_config during module removal.
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Link: https://lore.kernel.org/r/20220521073749.3146892-3-yukuai3@huawei.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c55b2b98
    • Yu Kuai's avatar
      nbd: call genl_unregister_family() first in nbd_cleanup() · 06c4da89
      Yu Kuai authored
      Otherwise there may be race between module removal and the handling of
      netlink command, which can lead to the oops as shown below:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000098
        Oops: 0002 [#1] SMP PTI
        CPU: 1 PID: 31299 Comm: nbd-client Tainted: G            E     5.14.0-rc4
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
        RIP: 0010:down_write+0x1a/0x50
        Call Trace:
         start_creating+0x89/0x130
         debugfs_create_dir+0x1b/0x130
         nbd_start_device+0x13d/0x390 [nbd]
         nbd_genl_connect+0x42f/0x748 [nbd]
         genl_family_rcv_msg_doit.isra.0+0xec/0x150
         genl_rcv_msg+0xe5/0x1e0
         netlink_rcv_skb+0x55/0x100
         genl_rcv+0x29/0x40
         netlink_unicast+0x1a8/0x250
         netlink_sendmsg+0x21b/0x430
         ____sys_sendmsg+0x2a4/0x2d0
         ___sys_sendmsg+0x81/0xc0
         __sys_sendmsg+0x62/0xb0
         __x64_sys_sendmsg+0x1f/0x30
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
        Modules linked in: nbd(E-)
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Link: https://lore.kernel.org/r/20220521073749.3146892-2-yukuai3@huawei.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      06c4da89
  4. 27 May, 2022 3 commits
  5. 24 May, 2022 4 commits
    • Coly Li's avatar
      bcache: avoid journal no-space deadlock by reserving 1 journal bucket · 32feee36
      Coly Li authored
      The journal no-space deadlock was reported time to time. Such deadlock
      can happen in the following situation.
      
      When all journal buckets are fully filled by active jset with heavy
      write I/O load, the cache set registration (after a reboot) will load
      all active jsets and inserting them into the btree again (which is
      called journal replay). If a journaled bkey is inserted into a btree
      node and results btree node split, new journal request might be
      triggered. For example, the btree grows one more level after the node
      split, then the root node record in cache device super block will be
      upgrade by bch_journal_meta() from bch_btree_set_root(). But there is no
      space in journal buckets, the journal replay has to wait for new journal
      bucket to be reclaimed after at least one journal bucket replayed. This
      is one example that how the journal no-space deadlock happens.
      
      The solution to avoid the deadlock is to reserve 1 journal bucket in
      run time, and only permit the reserved journal bucket to be used during
      cache set registration procedure for things like journal replay. Then
      the journal space will never be fully filled, there is no chance for
      journal no-space deadlock to happen anymore.
      
      This patch adds a new member "bool do_reserve" in struct journal, it is
      inititalized to 0 (false) when struct journal is allocated, and set to
      1 (true) by bch_journal_space_reserve() when all initialization done in
      run_cache_set(). In the run time when journal_reclaim() tries to
      allocate a new journal bucket, free_journal_buckets() is called to check
      whether there are enough free journal buckets to use. If there is only
      1 free journal bucket and journal->do_reserve is 1 (true), the last
      bucket is reserved and free_journal_buckets() will return 0 to indicate
      no free journal bucket. Then journal_reclaim() will give up, and try
      next time to see whetheer there is free journal bucket to allocate. By
      this method, there is always 1 jouranl bucket reserved in run time.
      
      During the cache set registration, journal->do_reserve is 0 (false), so
      the reserved journal bucket can be used to avoid the no-space deadlock.
      Reported-by: default avatarNikhil Kshirsagar <nkshirsagar@gmail.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20220524102336.10684-5-colyli@suse.deSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      32feee36
    • Coly Li's avatar
      bcache: remove incremental dirty sector counting for bch_sectors_dirty_init() · 80db4e47
      Coly Li authored
      After making bch_sectors_dirty_init() being multithreaded, the existing
      incremental dirty sector counting in bch_root_node_dirty_init() doesn't
      release btree occupation after iterating 500000 (INIT_KEYS_EACH_TIME)
      bkeys. Because a read lock is added on btree root node to prevent the
      btree to be split during the dirty sectors counting, other I/O requester
      has no chance to gain the write lock even restart bcache_btree().
      
      That is to say, the incremental dirty sectors counting is incompatible
      to the multhreaded bch_sectors_dirty_init(). We have to choose one and
      drop another one.
      
      In my testing, with 512 bytes random writes, I generate 1.2T dirty data
      and a btree with 400K nodes. With single thread and incremental dirty
      sectors counting, it takes 30+ minites to register the backing device.
      And with multithreaded dirty sectors counting, the backing device
      registration can be accomplished within 2 minutes.
      
      The 30+ minutes V.S. 2- minutes difference makes me decide to keep
      multithreaded bch_sectors_dirty_init() and drop the incremental dirty
      sectors counting. This is what this patch does.
      
      But INIT_KEYS_EACH_TIME is kept, in sectors_dirty_init_fn() the CPU
      will be released by cond_resched() after every INIT_KEYS_EACH_TIME keys
      iterated. This is to avoid the watchdog reports a bogus soft lockup
      warning.
      
      Fixes: b144e45f ("bcache: make bch_sectors_dirty_init() to be multithreaded")
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20220524102336.10684-4-colyli@suse.deSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      80db4e47
    • Coly Li's avatar
      bcache: improve multithreaded bch_sectors_dirty_init() · 4dc34ae1
      Coly Li authored
      Commit b144e45f ("bcache: make bch_sectors_dirty_init() to be
      multithreaded") makes bch_sectors_dirty_init() to be much faster
      when counting dirty sectors by iterating all dirty keys in the btree.
      But it isn't in ideal shape yet, still can be improved.
      
      This patch does the following changes to improve current parallel dirty
      keys iteration on the btree,
      - Add read lock to root node when multiple threads iterating the btree,
        to prevent the root node gets split by I/Os from other registered
        bcache devices.
      - Remove local variable "char name[32]" and generate kernel thread name
        string directly when calling kthread_run().
      - Allocate "struct bch_dirty_init_state state" directly on stack and
        avoid the unnecessary dynamic memory allocation for it.
      - Decrease BCH_DIRTY_INIT_THRD_MAX from 64 to 12 which is enough indeed.
      - Increase &state->started to count created kernel thread after it
        succeeds to create.
      - When wait for all dirty key counting threads to finish, use
        wait_event() to replace wait_event_interruptible().
      
      With the above changes, the code is more clear, and some potential error
      conditions are avoided.
      
      Fixes: b144e45f ("bcache: make bch_sectors_dirty_init() to be multithreaded")
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20220524102336.10684-3-colyli@suse.deSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4dc34ae1
    • Coly Li's avatar
      bcache: improve multithreaded bch_btree_check() · 62253644
      Coly Li authored
      Commit 8e710227 ("bcache: make bch_btree_check() to be
      multithreaded") makes bch_btree_check() to be much faster when checking
      all btree nodes during cache device registration. But it isn't in ideal
      shap yet, still can be improved.
      
      This patch does the following thing to improve current parallel btree
      nodes check by multiple threads in bch_btree_check(),
      - Add read lock to root node while checking all the btree nodes with
        multiple threads. Although currently it is not mandatory but it is
        good to have a read lock in code logic.
      - Remove local variable 'char name[32]', and generate kernel thread name
        string directly when calling kthread_run().
      - Allocate local variable "struct btree_check_state check_state" on the
        stack and avoid unnecessary dynamic memory allocation for it.
      - Reduce BCH_BTR_CHKTHREAD_MAX from 64 to 12 which is enough indeed.
      - Increase check_state->started to count created kernel thread after it
        succeeds to create.
      - When wait for all checking kernel threads to finish, use wait_event()
        to replace wait_event_interruptible().
      
      With this change, the code is more clear, and some potential error
      conditions are avoided.
      
      Fixes: 8e710227 ("bcache: make bch_btree_check() to be multithreaded")
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20220524102336.10684-2-colyli@suse.deSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      62253644
  6. 23 May, 2022 6 commits
    • Jens Axboe's avatar
      Merge branch 'md-next' of... · df7e7f2b
      Jens Axboe authored
      Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.19/drivers
      
      Pull MD updates from Song:
      
      "- Remove uses of bdevname, by Christoph Hellwig;
       - Bug fixes by Guoqing Jiang, and Xiao Ni."
      
      * 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
        md: fix double free of io_acct_set bioset
        md: Don't set mddev private to NULL in raid0 pers->free
        md: remove most calls to bdevname
        md: protect md_unregister_thread from reentrancy
        md: don't unregister sync_thread with reconfig_mutex held
      df7e7f2b
    • Xiao Ni's avatar
      md: fix double free of io_acct_set bioset · 42b805af
      Xiao Ni authored
      Now io_acct_set is alloc and free in personality. Remove the codes that
      free io_acct_set in md_free and md_stop.
      
      Fixes: 0c031fd3 (md: Move alloc/free acct bioset in to personality)
      Signed-off-by: default avatarXiao Ni <xni@redhat.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      42b805af
    • Xiao Ni's avatar
      md: Don't set mddev private to NULL in raid0 pers->free · 0f2571ad
      Xiao Ni authored
      In normal stop process, it does like this:
         do_md_stop
            |
         __md_stop (pers->free(); mddev->private=NULL)
            |
         md_free (free mddev)
      __md_stop sets mddev->private to NULL after pers->free. The raid device
      will be stopped and mddev memory is free. But in reshape, it doesn't
      free the mddev and mddev will still be used in new raid.
      
      In reshape, it first sets mddev->private to new_pers and then runs
      old_pers->free(). Now raid0 sets mddev->private to NULL in raid0_free.
      The new raid can't work anymore. It will panic when dereference
      mddev->private because of NULL pointer dereference.
      
      It can panic like this:
      [63010.814972] kernel BUG at drivers/md/raid10.c:928!
      [63010.819778] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
      [63010.825011] CPU: 3 PID: 44437 Comm: md0_resync Kdump: loaded Not tainted 5.14.0-86.el9.x86_64 #1
      [63010.833789] Hardware name: Dell Inc. PowerEdge R6415/07YXFK, BIOS 1.15.0 09/11/2020
      [63010.841440] RIP: 0010:raise_barrier+0x161/0x170 [raid10]
      [63010.865508] RSP: 0018:ffffc312408bbc10 EFLAGS: 00010246
      [63010.870734] RAX: 0000000000000000 RBX: ffffa00bf7d39800 RCX: 0000000000000000
      [63010.877866] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffa00bf7d39800
      [63010.884999] RBP: 0000000000000000 R08: fffffa4945e74400 R09: 0000000000000000
      [63010.892132] R10: ffffa00eed02f798 R11: 0000000000000000 R12: ffffa00bbc435200
      [63010.899266] R13: ffffa00bf7d39800 R14: 0000000000000400 R15: 0000000000000003
      [63010.906399] FS:  0000000000000000(0000) GS:ffffa00eed000000(0000) knlGS:0000000000000000
      [63010.914485] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [63010.920229] CR2: 00007f5cfbe99828 CR3: 0000000105efe000 CR4: 00000000003506e0
      [63010.927363] Call Trace:
      [63010.929822]  ? bio_reset+0xe/0x40
      [63010.933144]  ? raid10_alloc_init_r10buf+0x60/0xa0 [raid10]
      [63010.938629]  raid10_sync_request+0x756/0x1610 [raid10]
      [63010.943770]  md_do_sync.cold+0x3e4/0x94c
      [63010.947698]  md_thread+0xab/0x160
      [63010.951024]  ? md_write_inc+0x50/0x50
      [63010.954688]  kthread+0x149/0x170
      [63010.957923]  ? set_kthread_struct+0x40/0x40
      [63010.962107]  ret_from_fork+0x22/0x30
      
      Removing the code that sets mddev->private to NULL in raid0 can fix
      problem.
      
      Fixes: 0c031fd3 (md: Move alloc/free acct bioset in to personality)
      Reported-by: default avatarFine Fan <ffan@redhat.com>
      Signed-off-by: default avatarXiao Ni <xni@redhat.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      0f2571ad
    • Christoph Hellwig's avatar
      md: remove most calls to bdevname · 913cce5a
      Christoph Hellwig authored
      Use the %pg format specifier to save on stack consumption and code size.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      913cce5a
    • Guoqing Jiang's avatar
      md: protect md_unregister_thread from reentrancy · 1e267742
      Guoqing Jiang authored
      Generally, the md_unregister_thread is called with reconfig_mutex, but
      raid_message in dm-raid doesn't hold reconfig_mutex to unregister thread,
      so md_unregister_thread can be called simulitaneously from two call sites
      in theory.
      
      Then after previous commit which remove the protection of reconfig_mutex
      for md_unregister_thread completely, the potential issue could be worse
      than before.
      
      Let's take pers_lock at the beginning of function to ensure reentrancy.
      Reported-by: default avatarDonald Buczek <buczek@molgen.mpg.de>
      Signed-off-by: default avatarGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      1e267742
    • Guoqing Jiang's avatar
      md: don't unregister sync_thread with reconfig_mutex held · 8b48ec23
      Guoqing Jiang authored
      Unregister sync_thread doesn't need to hold reconfig_mutex since it
      doesn't reconfigure array.
      
      And it could cause deadlock problem for raid5 as follows:
      
      1. process A tried to reap sync thread with reconfig_mutex held after echo
         idle to sync_action.
      2. raid5 sync thread was blocked if there were too many active stripes.
      3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer)
         which causes the number of active stripes can't be decreased.
      4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able
         to hold reconfig_mutex.
      
      More details in the link:
      https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
      
      And add one parameter to md_reap_sync_thread since it could be called by
      dm-raid which doesn't hold reconfig_mutex.
      Reported-and-tested-by: default avatarDonald Buczek <buczek@molgen.mpg.de>
      Signed-off-by: default avatarGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      8b48ec23
  7. 21 May, 2022 1 commit
  8. 20 May, 2022 1 commit
  9. 19 May, 2022 1 commit
    • Chaitanya Kulkarni's avatar
      nvme: set non-mdts limits in nvme_scan_work · 78288665
      Chaitanya Kulkarni authored
      In current implementation we set the non-mdts limits by calling
      nvme_init_non_mdts_limits() from nvme_init_ctrl_finish().
      This also tries to set the limits for the discovery controller which
      has no I/O queues resulting in the warning message reported by the
      nvme_log_error() when running blktest nvme/002: -
      
      [ 2005.155946] run blktests nvme/002 at 2022-04-09 16:57:47
      [ 2005.192223] loop: module loaded
      [ 2005.196429] nvmet: adding nsid 1 to subsystem blktests-subsystem-0
      [ 2005.200334] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
      
      <------------------------------SNIP---------------------------------->
      
      [ 2008.958108] nvmet: adding nsid 1 to subsystem blktests-subsystem-997
      [ 2008.962082] nvmet: adding nsid 1 to subsystem blktests-subsystem-998
      [ 2008.966102] nvmet: adding nsid 1 to subsystem blktests-subsystem-999
      [ 2008.973132] nvmet: creating discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN testhostnqn.
      *[ 2008.973196] nvme1: Identify(0x6), Invalid Field in Command (sct 0x0 / sc 0x2) MORE DNR*
      [ 2008.974595] nvme nvme1: new ctrl: "nqn.2014-08.org.nvmexpress.discovery"
      [ 2009.103248] nvme nvme1: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
      
      Move the call of nvme_init_non_mdts_limits() to nvme_scan_work() after
      we verify that I/O queues are created since that is a converging point
      for each transport where these limits are actually used.
      
      1. FC :
      nvme_fc_create_association()
       ...
       nvme_fc_create_io_queues(ctrl);
       ...
       nvme_start_ctrl()
        nvme_scan_queue()
         nvme_scan_work()
      
      2. PCIe:-
      nvme_reset_work()
       ...
       nvme_setup_io_queues()
        nvme_create_io_queues()
         nvme_alloc_queue()
       ...
       nvme_start_ctrl()
        nvme_scan_queue()
         nvme_scan_work()
      
      3. RDMA :-
      nvme_rdma_setup_ctrl
       ...
        nvme_rdma_configure_io_queues
        ...
        nvme_start_ctrl()
         nvme_scan_queue()
          nvme_scan_work()
      
      4. TCP :-
      nvme_tcp_setup_ctrl
       ...
        nvme_tcp_configure_io_queues
        ...
        nvme_start_ctrl()
         nvme_scan_queue()
          nvme_scan_work()
      
      * nvme_scan_work()
      ...
      nvme_validate_or_alloc_ns()
        nvme_alloc_ns()
         nvme_update_ns_info()
          nvme_update_disk_info()
           nvme_config_discard() <---
           blk_queue_max_write_zeroes_sectors() <---
      Signed-off-by: default avatarChaitanya Kulkarni <kch@nvidia.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      78288665
  10. 18 May, 2022 2 commits
    • Christoph Hellwig's avatar
      nvme: add support for TP4084 - Time-to-Ready Enhancements · 354201c5
      Christoph Hellwig authored
      Add support for using longer timeouts during controller initialization
      and letting the controller come up with namespaces that are not ready
      for I/O yet.  We skip these not ready namespaces during scanning and
      only bring them online once anoter scan is kicked off by the AEN that
      is set when the NRDY bit gets set in the  I/O Command Set Independent
      Identify Namespace Data Structure.   This asynchronous probing avoids
      blocking the kernel boot when controllers take a very long time to
      recover after unclean shutdowns (up to minutes).
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Reviewed-by: default avatarChaitanya Kulkarni <kch@nvidia.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      354201c5
    • Jens Axboe's avatar
      Merge tag 'nvme-5.19-2022-05-18' of git://git.infradead.org/nvme into for-5.19/drivers · da14f237
      Jens Axboe authored
      Pull NVMe updates from Christoph:
      
      "nvme updates for Linux 5.19
      
       - tighten the PCI presence check (Stefan Roese):
       - fix a potential NULL pointer dereference in an error path
         (Kyle Miller Smith)
       - fix interpretation of the DMRSL field (Tom Yan)
       - relax the data transfer alignment (Keith Busch)
       - verbose error logging improvements (Max Gurtovoy, Chaitanya Kulkarni)
       - misc cleanups (Chaitanya Kulkarni, me)"
      
      * tag 'nvme-5.19-2022-05-18' of git://git.infradead.org/nvme:
        nvme: split the enum used for various register constants
        nvme-fabrics: add a request timeout helper
        nvme-pci: harden drive presence detect in nvme_dev_disable()
        nvme-pci: fix a NULL pointer dereference in nvme_alloc_admin_tags
        nvme: mark internal passthru request RQF_QUIET
        nvme: remove unneeded include from constants file
        nvme: add missing status values to verbose logging
        nvme: set dma alignment to dword
        nvme: fix interpretation of DMRSL
      da14f237
  11. 17 May, 2022 1 commit
  12. 16 May, 2022 9 commits