1. 27 Sep, 2022 12 commits
  2. 24 Sep, 2022 9 commits
  3. 23 Sep, 2022 1 commit
    • Jens Axboe's avatar
      Merge branch 'md-next' of... · 4324796e
      Jens Axboe authored
      Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.1/block
      
      Pull MD updates and fixes from Song:
      
      "1. Various raid5 fix and clean up, by Logan Gunthorpe and David Sloan.
       2. Raid10 performance optimization, by Yu Kuai."
      
      * 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
        md: Fix spelling mistake in comments of r5l_log
        md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d
        md/raid10: convert resync_lock to use seqlock
        md/raid10: fix improper BUG_ON() in raise_barrier()
        md/raid10: prevent unnecessary calls to wake_up() in fast path
        md/raid10: don't modify 'nr_waitng' in wait_barrier() for the case nowait
        md/raid10: factor out code from wait_barrier() to stop_waiting_barrier()
        md: Remove extra mddev_get() in md_seq_start()
        md/raid5: Remove unnecessary bio_put() in raid5_read_one_chunk()
        md/raid5: Ensure stripe_fill happens on non-read IO with journal
        md/raid5: Don't read ->active_stripes if it's not needed
        md/raid5: Cleanup prototype of raid5_get_active_stripe()
        md/raid5: Drop extern on function declarations in raid5.h
        md/raid5: Refactor raid5_get_active_stripe()
        md: Replace snprintf with scnprintf
        md/raid10: fix compile warning
        md/raid5: Fix spelling mistakes in comments
      4324796e
  4. 22 Sep, 2022 18 commits
    • Zhou nan's avatar
      md: Fix spelling mistake in comments of r5l_log · 65b94b52
      Zhou nan authored
      Fix spelling of dones't in comments.
      Signed-off-by: default avatarZhou nan <zhounan@nfschina.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      65b94b52
    • Logan Gunthorpe's avatar
      md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d · 5e2cf333
      Logan Gunthorpe authored
      A complicated deadlock exists when using the journal and an elevated
      group_thrtead_cnt. It was found with loop devices, but its not clear
      whether it can be seen with real disks. The deadlock can occur simply
      by writing data with an fio script.
      
      When the deadlock occurs, multiple threads will hang in different ways:
      
       1) The group threads will hang in the blk-wbt code with bios waiting to
          be submitted to the block layer:
      
              io_schedule+0x70/0xb0
              rq_qos_wait+0x153/0x210
              wbt_wait+0x115/0x1b0
              io_schedule+0x70/0xb0
              rq_qos_wait+0x153/0x210
              wbt_wait+0x115/0x1b0
              __rq_qos_throttle+0x38/0x60
              blk_mq_submit_bio+0x589/0xcd0
              wbt_wait+0x115/0x1b0
              __rq_qos_throttle+0x38/0x60
              blk_mq_submit_bio+0x589/0xcd0
              __submit_bio+0xe6/0x100
              submit_bio_noacct_nocheck+0x42e/0x470
              submit_bio_noacct+0x4c2/0xbb0
              ops_run_io+0x46b/0x1a30
              handle_stripe+0xcd3/0x36b0
              handle_active_stripes.constprop.0+0x6f6/0xa60
              raid5_do_work+0x177/0x330
      
          Or:
              io_schedule+0x70/0xb0
              rq_qos_wait+0x153/0x210
              wbt_wait+0x115/0x1b0
              __rq_qos_throttle+0x38/0x60
              blk_mq_submit_bio+0x589/0xcd0
              __submit_bio+0xe6/0x100
              submit_bio_noacct_nocheck+0x42e/0x470
              submit_bio_noacct+0x4c2/0xbb0
              flush_deferred_bios+0x136/0x170
              raid5_do_work+0x262/0x330
      
       2) The r5l_reclaim thread will hang in the same way, submitting a
          bio to the block layer:
      
              io_schedule+0x70/0xb0
              rq_qos_wait+0x153/0x210
              wbt_wait+0x115/0x1b0
              __rq_qos_throttle+0x38/0x60
              blk_mq_submit_bio+0x589/0xcd0
              __submit_bio+0xe6/0x100
              submit_bio_noacct_nocheck+0x42e/0x470
              submit_bio_noacct+0x4c2/0xbb0
              submit_bio+0x3f/0xf0
              md_super_write+0x12f/0x1b0
              md_update_sb.part.0+0x7c6/0xff0
              md_update_sb+0x30/0x60
              r5l_do_reclaim+0x4f9/0x5e0
              r5l_reclaim_thread+0x69/0x30b
      
          However, before hanging, the MD_SB_CHANGE_PENDING flag will be
          set for sb_flags in r5l_write_super_and_discard_space(). This
          flag will never be cleared because the submit_bio() call never
          returns.
      
       3) Due to the MD_SB_CHANGE_PENDING flag being set, handle_stripe()
          will do no processing on any pending stripes and re-set
          STRIPE_HANDLE. This will cause the raid5d thread to enter an
          infinite loop, constantly trying to handle the same stripes
          stuck in the queue.
      
          The raid5d thread has a blk_plug that holds a number of bios
          that are also stuck waiting seeing the thread is in a loop
          that never schedules. These bios have been accounted for by
          blk-wbt thus preventing the other threads above from
          continuing when they try to submit bios. --Deadlock.
      
      To fix this, add the same wait_event() that is used in raid5_do_work()
      to raid5d() such that if MD_SB_CHANGE_PENDING is set, the thread will
      schedule and wait until the flag is cleared. The schedule action will
      flush the plug which will allow the r5l_reclaim thread to continue,
      thus preventing the deadlock.
      
      However, md_check_recovery() calls can also clear MD_SB_CHANGE_PENDING
      from the same thread and can thus deadlock if the thread is put to
      sleep. So avoid waiting if md_check_recovery() is being called in the
      loop.
      
      It's not clear when the deadlock was introduced, but the similar
      wait_event() call in raid5_do_work() was added in 2017 by this
      commit:
      
          16d997b7 ("md/raid5: simplfy delaying of writes while metadata
                         is updated.")
      
      Link: https://lore.kernel.org/r/7f3b87b6-b52a-f737-51d7-a4eec5c44112@deltatee.comSigned-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      5e2cf333
    • Song Liu's avatar
      Merge branch 'md-next-raid10-optimize' into md-next · 74173ff4
      Song Liu authored
      This patchset tries to avoid that two locks are held unconditionally
      in hot path.
      
      Test environment:
      
      Architecture:
      aarch64 Huawei KUNPENG 920
      x86 Intel(R) Xeon(R) Platinum 8380
      
      Raid10 initialize:
      mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 \
          /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
      
      Test cmd:
      (task set -c 0-15) fio -name=0 -ioengine=libaio -direct=1 -\
          group_reporting=1 -randseed=2022 -rwmixread=70 -refill_buffers \
          -filename=/dev/md0 -numjobs=16 -runtime=60s -bs=4k -iodepth=256 \
          -rw=randread
      
      Test result:
      
      aarch64:
      before this patchset:           3.2 GiB/s
      bind node before this patchset: 6.9 Gib/s
      after this patchset:            7.9 Gib/s
      bind node after this patchset:  8.0 Gib/s
      
      x86:(bind node is not tested yet)
      before this patchset: 7.0 GiB/s
      after this patchset : 9.3 GiB/s
      
      Please noted that in the test machine, memory access latency is very bad
      across nodes compare to local node in aarch64, which is why bandwidth
      while bind node is much better.
      74173ff4
    • Yu Kuai's avatar
      md/raid10: convert resync_lock to use seqlock · b9b083f9
      Yu Kuai authored
      Currently, wait_barrier() will hold 'resync_lock' to read 'conf->barrier',
      and io can't be dispatched until 'barrier' is dropped.
      
      Since holding the 'barrier' is not common, convert 'resync_lock' to use
      seqlock so that holding lock can be avoided in fast path.
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Reviewed-and-Tested-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      b9b083f9
    • Yu Kuai's avatar
      md/raid10: fix improper BUG_ON() in raise_barrier() · 4f350284
      Yu Kuai authored
      'conf->barrier' is protected by 'conf->resync_lock', reading
      'conf->barrier' without holding the lock is wrong.
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Acked-by: default avatarGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      4f350284
    • Yu Kuai's avatar
      md/raid10: prevent unnecessary calls to wake_up() in fast path · 0c0be98b
      Yu Kuai authored
      Currently, wake_up() is called unconditionally in fast path such as
      raid10_make_request(), which will cause lock contention under high
      concurrency:
      
      raid10_make_request
       wake_up
        __wake_up_common_lock
         spin_lock_irqsave
      
      Improve performance by only call wake_up() if waitqueue is not empty
      in allow_barrier() and raid10_make_request().
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Acked-by: default avatarGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      0c0be98b
    • Yu Kuai's avatar
      md/raid10: don't modify 'nr_waitng' in wait_barrier() for the case nowait · 0de57e54
      Yu Kuai authored
      For the case nowait in wait_barrier(), there is no point to increase
      nr_waiting and then decrease it.
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Acked-by: default avatarGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      0de57e54
    • Yu Kuai's avatar
      md/raid10: factor out code from wait_barrier() to stop_waiting_barrier() · ed2e063f
      Yu Kuai authored
      Currently the nasty condition in wait_barrier() is hard to read. This
      patch factors out the condition into a function.
      
      There are no functional changes.
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Acked-by: default avatarPaul Menzel <pmenzel@molgen.mpg.de>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Acked-by: default avatarGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      ed2e063f
    • Logan Gunthorpe's avatar
      md: Remove extra mddev_get() in md_seq_start() · 3bfc3bcd
      Logan Gunthorpe authored
      A regression is seen where mddev devices stay permanently after they
      are stopped due to an elevated reference count.
      
      This was tracked down to an extra mddev_get() in md_seq_start().
      
      It only happened rarely because most of the time the md_seq_start()
      is called with a zero offset. The path with an extra mddev_get() only
      happens when it starts with a non-zero offset.
      
      The commit noted below changed an mddev_get() to check its success
      but inadvertently left the original call in. Remove the extra call.
      
      Fixes: 12a6caf2 ("md: only delete entries from all_mddevs when the disk is freed")
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarGuoqing Jiang <Guoqing.jiang@linux.dev>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      3bfc3bcd
    • David Sloan's avatar
      md/raid5: Remove unnecessary bio_put() in raid5_read_one_chunk() · c66a6f41
      David Sloan authored
      When running chunk-sized reads on disks with badblocks duplicate bio
      free/puts are observed:
      
         =============================================================================
         BUG bio-200 (Not tainted): Object already free
         -----------------------------------------------------------------------------
         Allocated in mempool_alloc_slab+0x17/0x20 age=3 cpu=2 pid=7504
          __slab_alloc.constprop.0+0x5a/0xb0
          kmem_cache_alloc+0x31e/0x330
          mempool_alloc_slab+0x17/0x20
          mempool_alloc+0x100/0x2b0
          bio_alloc_bioset+0x181/0x460
          do_mpage_readpage+0x776/0xd00
          mpage_readahead+0x166/0x320
          blkdev_readahead+0x15/0x20
          read_pages+0x13f/0x5f0
          page_cache_ra_unbounded+0x18d/0x220
          force_page_cache_ra+0x181/0x1c0
          page_cache_sync_ra+0x65/0xb0
          filemap_get_pages+0x1df/0xaf0
          filemap_read+0x1e1/0x700
          blkdev_read_iter+0x1e5/0x330
          vfs_read+0x42a/0x570
         Freed in mempool_free_slab+0x17/0x20 age=3 cpu=2 pid=7504
          kmem_cache_free+0x46d/0x490
          mempool_free_slab+0x17/0x20
          mempool_free+0x66/0x190
          bio_free+0x78/0x90
          bio_put+0x100/0x1a0
          raid5_make_request+0x2259/0x2450
          md_handle_request+0x402/0x600
          md_submit_bio+0xd9/0x120
          __submit_bio+0x11f/0x1b0
          submit_bio_noacct_nocheck+0x204/0x480
          submit_bio_noacct+0x32e/0xc70
          submit_bio+0x98/0x1a0
          mpage_readahead+0x250/0x320
          blkdev_readahead+0x15/0x20
          read_pages+0x13f/0x5f0
          page_cache_ra_unbounded+0x18d/0x220
         Slab 0xffffea000481b600 objects=21 used=0 fp=0xffff8881206d8940 flags=0x17ffffc0010201(locked|slab|head|node=0|zone=2|lastcpupid=0x1fffff)
         CPU: 0 PID: 34525 Comm: kworker/u24:2 Not tainted 6.0.0-rc2-localyes-265166-gf11c5343fa3f #143
         Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
         Workqueue: raid5wq raid5_do_work
         Call Trace:
          <TASK>
          dump_stack_lvl+0x5a/0x78
          dump_stack+0x10/0x16
          print_trailer+0x158/0x165
          object_err+0x35/0x50
          free_debug_processing.cold+0xb7/0xbe
          __slab_free+0x1ae/0x330
          kmem_cache_free+0x46d/0x490
          mempool_free_slab+0x17/0x20
          mempool_free+0x66/0x190
          bio_free+0x78/0x90
          bio_put+0x100/0x1a0
          mpage_end_io+0x36/0x150
          bio_endio+0x2fd/0x360
          md_end_io_acct+0x7e/0x90
          bio_endio+0x2fd/0x360
          handle_failed_stripe+0x960/0xb80
          handle_stripe+0x1348/0x3760
          handle_active_stripes.constprop.0+0x72a/0xaf0
          raid5_do_work+0x177/0x330
          process_one_work+0x616/0xb20
          worker_thread+0x2bd/0x6f0
          kthread+0x179/0x1b0
          ret_from_fork+0x22/0x30
          </TASK>
      
      The double free is caused by an unnecessary bio_put() in the
      if(is_badblock(...)) error path in raid5_read_one_chunk().
      
      The error path was moved ahead of bio_alloc_clone() in c82aa1b7
      ("md/raid5: move checking badblock before clone bio in
      raid5_read_one_chunk"). The previous code checked and freed align_bio
      which required a bio_put. After the move that is no longer needed as
      raid_bio is returned to the control of the common io path which
      performs its own endio resulting in a double free on bad device blocks.
      
      Fixes: c82aa1b7 ("md/raid5: move checking badblock before clone bio in raid5_read_one_chunk")
      Signed-off-by: default avatarDavid Sloan <david.sloan@eideticom.com>
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarGuoqing Jiang <Guoqing.jiang@linux.dev>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      c66a6f41
    • Logan Gunthorpe's avatar
      md/raid5: Ensure stripe_fill happens on non-read IO with journal · e2eed85b
      Logan Gunthorpe authored
      When doing degrade/recover tests using the journal a kernel BUG
      is hit at drivers/md/raid5.c:4381 in handle_parity_checks5():
      
        BUG_ON(!test_bit(R5_UPTODATE, &dev->flags));
      
      This was found to occur because handle_stripe_fill() was skipped
      for stripes in the journal due to a condition in that function.
      Thus blocks were not fetched and R5_UPTODATE was not set when
      the code reached handle_parity_checks5().
      
      To fix this, don't skip handle_stripe_fill() unless the stripe is
      for read.
      
      Fixes: 07e83364 ("md/r5cache: shift complex rmw from read path to write path")
      Link: https://lore.kernel.org/linux-raid/e05c4239-41a9-d2f7-3cfa-4aa9d2cea8c1@deltatee.com/Suggested-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      e2eed85b
    • Logan Gunthorpe's avatar
      md/raid5: Don't read ->active_stripes if it's not needed · f9287c3e
      Logan Gunthorpe authored
      The atomic_read() is not needed in many cases so only do
      the read after the first checks are done.
      Suggested-by: default avatarChristoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      f9287c3e
    • Logan Gunthorpe's avatar
      md/raid5: Cleanup prototype of raid5_get_active_stripe() · 2f2d51ef
      Logan Gunthorpe authored
      Drop the three bools in the prototype of raid5_get_active_stripe()
      and replace them with a flags parameter.
      
      At the same time, drop the distinction with __raid5_get_active_stripe().
      Suggested-by: default avatarChristoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      2f2d51ef
    • Logan Gunthorpe's avatar
      md/raid5: Drop extern on function declarations in raid5.h · 9892fa99
      Logan Gunthorpe authored
      externs should not be used in function declarations, so clean those
      up.
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      9892fa99
    • Logan Gunthorpe's avatar
      md/raid5: Refactor raid5_get_active_stripe() · b6d56144
      Logan Gunthorpe authored
      Refactor raid5_get_active_stripe() without the gotos with an
      explicit infinite loop and some additional nesting.
      Suggested-by: default avatarChristoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      b6d56144
    • Saurabh Sengar's avatar
      md: Replace snprintf with scnprintf · 1727fd50
      Saurabh Sengar authored
      Current code produces a warning as shown below when total characters
      in the constituent block device names plus the slashes exceeds 200.
      snprintf() returns the number of characters generated from the given
      input, which could cause the expression “200 – len” to wrap around
      to a large positive number. Fix this by using scnprintf() instead,
      which returns the actual number of characters written into the buffer.
      
      [ 1513.267938] ------------[ cut here ]------------
      [ 1513.267943] WARNING: CPU: 15 PID: 37247 at <snip>/lib/vsprintf.c:2509 vsnprintf+0x2c8/0x510
      [ 1513.267944] Modules linked in:  <snip>
      [ 1513.267969] CPU: 15 PID: 37247 Comm: mdadm Not tainted 5.4.0-1085-azure #90~18.04.1-Ubuntu
      [ 1513.267969] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 05/09/2022
      [ 1513.267971] RIP: 0010:vsnprintf+0x2c8/0x510
      <-snip->
      [ 1513.267982] Call Trace:
      [ 1513.267986]  snprintf+0x45/0x70
      [ 1513.267990]  ? disk_name+0x71/0xa0
      [ 1513.267993]  dump_zones+0x114/0x240 [raid0]
      [ 1513.267996]  ? _cond_resched+0x19/0x40
      [ 1513.267998]  raid0_run+0x19e/0x270 [raid0]
      [ 1513.268000]  md_run+0x5e0/0xc50
      [ 1513.268003]  ? security_capable+0x3f/0x60
      [ 1513.268005]  do_md_run+0x19/0x110
      [ 1513.268006]  md_ioctl+0x195e/0x1f90
      [ 1513.268007]  blkdev_ioctl+0x91f/0x9f0
      [ 1513.268010]  block_ioctl+0x3d/0x50
      [ 1513.268012]  do_vfs_ioctl+0xa9/0x640
      [ 1513.268014]  ? __fput+0x162/0x260
      [ 1513.268016]  ksys_ioctl+0x75/0x80
      [ 1513.268017]  __x64_sys_ioctl+0x1a/0x20
      [ 1513.268019]  do_syscall_64+0x5e/0x200
      [ 1513.268021]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: 76603884 ("md/raid0: replace printk() with pr_*()")
      Reviewed-by: default avatarMichael Kelley <mikelley@microsoft.com>
      Acked-by: default avatarGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: default avatarSaurabh Sengar <ssengar@linux.microsoft.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      1727fd50
    • Guoqing Jiang's avatar
      md/raid10: fix compile warning · 62bca04b
      Guoqing Jiang authored
      With W=1, compiler complains.
      
      drivers/md/raid10.c:1983: warning: bad line:
      Signed-off-by: default avatarGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      62bca04b
    • XU pengfei's avatar
      md/raid5: Fix spelling mistakes in comments · 12ba6676
      XU pengfei authored
      Fix spelling of 'waitting' in comments.
      Signed-off-by: default avatarXU pengfei <xupengfei@nfschina.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      12ba6676