1. 11 Apr, 2017 5 commits
    • NeilBrown's avatar
      md/raid1: factor out flush_bio_list() · 673ca68d
      NeilBrown authored
      flush_pending_writes() and raid1_unplug() each contain identical
      copies of a fairly large slab of code.  So factor that out into
      new flush_bio_list() to simplify maintenance.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      673ca68d
    • NeilBrown's avatar
      md/raid1: simplify handle_read_error(). · 689389a0
      NeilBrown authored
      handle_read_error() duplicates a lot of the work that raid1_read_request()
      does, so it makes sense to just use that function.
      This doesn't quite work as handle_read_error() relies on the same r1bio
      being re-used so that, in the case of a read-only array, setting
      IO_BLOCKED in r1bio->bios[] ensures read_balance() won't re-use
      that device.
      So we need to allow a r1bio to be passed to raid1_read_request(), and to
      have that function mostly initialise the r1bio, but leave the bios[]
      array untouched.
      
      Two parts of handle_read_error() that need to be preserved are the warning
      message it prints, so they are conditionally added to raid1_read_request().
      
      Note that this highlights a minor bug on alloc_r1bio().  It doesn't
      initalise the bios[] array, so it is possible that old content is there,
      which might cause read_balance() to ignore some devices with no good reason.
      
      With this change, we no longer need inc_pending(), or the sectors_handled
      arg to alloc_r1bio().
      
      As handle_read_error() is called from raid1d() and allocates memory,
      there is tiny chance of a deadlock.  All element of various pools
      could be queued waiting for raid1 to handle them, and there may be no
      extra memory free.
      Achieving guaranteed forward progress would probably require a second
      thread and another mempool.  Instead of that complexity, add
      __GFP_HIGH to any allocations when read1_read_request() is called
      from raid1d.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      689389a0
    • NeilBrown's avatar
      Revert "block: introduce bio_copy_data_partial" · 50512625
      NeilBrown authored
      This reverts commit 6f880285.
      bio_copy_data_partial() is no longer needed.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      50512625
    • NeilBrown's avatar
      md/raid1: simplify alloc_behind_master_bio() · cb83efcf
      NeilBrown authored
      Now that we always always pass an offset of 0 and a size
      that matches the bio to alloc_behind_master_bio(),
      we can remove the offset/size args and simplify the code.
      
      We could probably remove bio_copy_data_partial() too.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      cb83efcf
    • NeilBrown's avatar
      md/raid1: simplify the splitting of requests. · c230e7e5
      NeilBrown authored
      raid1 currently splits requests in two different ways for
      two different reasons.
      
      First, bio_split() is used to ensure the bio fits within a
      resync accounting region.
      Second, multiple r1bios are allocated for each bio to handle
      the possiblity of known bad blocks on some devices.
      
      This can be simplified to just use bio_split() once, and not
      use multiple r1bios.
      We delay the split until we know a maximum bio size that can
      be handled with a single r1bio, and then split the bio and
      queue the remainder for later handling.
      
      This avoids all loops inside raid1.c request handling.  Just
      a single read, or a single set of writes, is submitted to
      lower-level devices for each bio that comes from
      generic_make_request().
      
      When the bio needs to be split, generic_make_request() will
      do the necessary looping and call md_make_request() multiple
      times.
      
      raid1_make_request() no longer queues request for raid1 to handle,
      so we can remove that branch from the 'if'.
      
      This patch also creates a new private bio_set
      (conf->bio_split) for splitting bios.  Using fs_bio_set
      is wrong, as it is meant to be used by filesystems, not
      block devices.  Using it inside md can lead to deadlocks
      under high memory pressure.
      
      Delete unused variable in raid1_write_request() (Shaohua)
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      c230e7e5
  2. 10 Apr, 2017 9 commits
    • Artur Paszkiewicz's avatar
      raid5-ppl: partial parity calculation optimization · ae1713e2
      Artur Paszkiewicz authored
      In case of read-modify-write, partial partity is the same as the result
      of ops_run_prexor5(), so we can just copy sh->dev[pd_idx].page into
      sh->ppl_page instead of calculating it again.
      Signed-off-by: default avatarArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      ae1713e2
    • Artur Paszkiewicz's avatar
      raid5-ppl: use resize_stripes() when enabling or disabling ppl · 845b9e22
      Artur Paszkiewicz authored
      Use resize_stripes() instead of raid5_reset_stripe_cache() to allocate
      or free sh->ppl_page at runtime for all stripes in the stripe cache.
      raid5_reset_stripe_cache() required suspending the mddev and could
      deadlock because of GFP_KERNEL allocations.
      
      Move the 'newsize' check to check_reshape() to allow reallocating the
      stripes with the same number of disks. Allocate sh->ppl_page in
      alloc_stripe() instead of grow_buffers(). Pass 'struct r5conf *conf' as
      a parameter to alloc_stripe() because it is needed to check whether to
      allocate ppl_page. Add free_stripe() and use it to free stripes rather
      than directly call kmem_cache_free(). Also free sh->ppl_page in
      free_stripe().
      
      Set MD_HAS_PPL at the end of ppl_init_log() instead of explicitly
      setting it in advance and add another parameter to log_init() to allow
      calling ppl_init_log() without the bit set. Don't try to calculate
      partial parity or add a stripe to log if it does not have ppl_page set.
      
      Enabling ppl can now be performed without suspending the mddev, because
      the log won't be used until new stripes are allocated with ppl_page.
      Calling mddev_suspend/resume is still necessary when disabling ppl,
      because we want all stripes to finish before stopping the log, but
      resize_stripes() can be called after mddev_resume() when ppl is no
      longer active.
      Suggested-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      845b9e22
    • Artur Paszkiewicz's avatar
      raid5-ppl: move no_mem_stripes to struct ppl_conf · 94568f64
      Artur Paszkiewicz authored
      Use a single no_mem_stripes list instead of per member device lists for
      handling stripes that need retrying in case of failed io_unit
      allocation. Because io_units are allocated from a memory pool shared
      between all member disks, the no_mem_stripes list should be checked when
      an io_unit for any member is freed. This fixes a deadlock that could
      happen if there are stripes in more than one no_mem_stripes list.
      Signed-off-by: default avatarArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      94568f64
    • NeilBrown's avatar
      md/raid1: avoid reusing a resync bio after error handling. · 0c9d5b12
      NeilBrown authored
      fix_sync_read_error() modifies a bio on a newly faulty
      device by setting bi_end_io to end_sync_write.
      This ensure that put_buf() will still call rdev_dec_pending()
      as required, but makes sure that subsequent code in
      fix_sync_read_error() doesn't try to read from the device.
      
      Unfortunately this interacts badly with sync_request_write()
      which assumes that any bio with bi_end_io set to non-NULL
      other than end_sync_read is safe to write to.
      
      As the device is now faulty it doesn't make sense to write.
      As the bio was recently used for a read, it is "dirty"
      and not suitable for immediate submission.
      In particular, ->bi_next might be non-NULL, which will cause
      generic_make_request() to complain.
      
      Break this interaction by refusing to write to devices
      which are marked as Faulty.
      Reported-and-tested-by: default avatarMichael Wang <yun.wang@profitbricks.com>
      Fixes: 2e52d449 ("md/raid1: add failfast handling for reads.")
      Cc: stable@vger.kernel.org (v4.10+)
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      0c9d5b12
    • Zhilong Liu's avatar
      md.c:didn't unlock the mddev before return EINVAL in array_size_store · b670883b
      Zhilong Liu authored
      md.c: it needs to release the mddev lock before
      the array_size_store() returns.
      
      Fixes: ab5a98b1 ("md-cluster: change array_sectors and update size are not supported")
      Signed-off-by: default avatarZhilong Liu <zlliu@suse.com>
      Reviewed-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      b670883b
    • NeilBrown's avatar
      md: MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stop · 065e519e
      NeilBrown authored
      if called md_set_readonly and set MD_CLOSING bit, the mddev cannot
      be opened any more due to the MD_CLOING bit wasn't cleared. Thus it
      needs to be cleared in md_ioctl after any call to md_set_readonly()
      or do_md_stop().
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Fixes: af8d8e6f ("md: changes for MD_STILL_CLOSED flag")
      Cc: stable@vger.kernel.org (v4.9+)
      Signed-off-by: default avatarZhilong Liu <zlliu@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      065e519e
    • Guoqing Jiang's avatar
      md/raid10: reset the 'first' at the end of loop · 6f287ca6
      Guoqing Jiang authored
      We need to set "first = 0' at the end of rdev_for_each
      loop, so we can get the array's min_offset_diff correctly
      otherwise min_offset_diff just means the last rdev's
      offset diff.
      Suggested-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      6f287ca6
    • NeilBrown's avatar
      md/raid6: Fix anomily when recovering a single device in RAID6. · 7471fb77
      NeilBrown authored
      When recoverying a single missing/failed device in a RAID6,
      those stripes where the Q block is on the missing device are
      handled a bit differently.  In these cases it is easy to
      check that the P block is correct, so we do.  This results
      in the P block be destroy.  Consequently the P block needs
      to be read a second time in order to compute Q.  This causes
      lots of seeks and hurts performance.
      
      It shouldn't be necessary to re-read P as it can be computed
      from the DATA.  But we only compute blocks on missing
      devices, since c337869d ("md: do not compute parity
      unless it is on a failed drive").
      
      So relax the change made in that commit to allow computing
      of the P block in a RAID6 which it is the only missing that
      block.
      
      This makes RAID6 recovery run much faster as the disk just
      "before" the recovering device is no longer seeking
      back-and-forth.
      Reported-by-tested-by: default avatarBrad Campbell <lists2009@fnarfbargle.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      7471fb77
    • Dennis Yang's avatar
      md: update slab_cache before releasing new stripes when stripes resizing · 583da48e
      Dennis Yang authored
      When growing raid5 device on machine with small memory, there is chance that
      mdadm will be killed and the following bug report can be observed. The same
      bug could also be reproduced in linux-4.10.6.
      
      [57600.075774] BUG: unable to handle kernel NULL pointer dereference at           (null)
      [57600.083796] IP: [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20
      [57600.110378] PGD 421cf067 PUD 4442d067 PMD 0
      [57600.114678] Oops: 0002 [#1] SMP
      [57600.180799] CPU: 1 PID: 25990 Comm: mdadm Tainted: P           O    4.2.8 #1
      [57600.187849] Hardware name: To be filled by O.E.M. To be filled by O.E.M./MAHOBAY, BIOS QV05AR66 03/06/2013
      [57600.197490] task: ffff880044e47240 ti: ffff880043070000 task.ti: ffff880043070000
      [57600.204963] RIP: 0010:[<ffffffff81a6aa87>]  [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20
      [57600.213057] RSP: 0018:ffff880043073810  EFLAGS: 00010046
      [57600.218359] RAX: 0000000000000000 RBX: 000000000000000c RCX: ffff88011e296dd0
      [57600.225486] RDX: 0000000000000001 RSI: ffffe8ffffcb46c0 RDI: 0000000000000000
      [57600.232613] RBP: ffff880043073878 R08: ffff88011e5f8170 R09: 0000000000000282
      [57600.239739] R10: 0000000000000005 R11: 28f5c28f5c28f5c3 R12: ffff880043073838
      [57600.246872] R13: ffffe8ffffcb46c0 R14: 0000000000000000 R15: ffff8800b9706a00
      [57600.253999] FS:  00007f576106c700(0000) GS:ffff88011e280000(0000) knlGS:0000000000000000
      [57600.262078] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [57600.267817] CR2: 0000000000000000 CR3: 00000000428fe000 CR4: 00000000001406e0
      [57600.274942] Stack:
      [57600.276949]  ffffffff8114ee35 ffff880043073868 0000000000000282 000000000000eb3f
      [57600.284383]  ffffffff81119043 ffff880043073838 ffff880043073838 ffff88003e197b98
      [57600.291820]  ffffe8ffffcb46c0 ffff88003e197360 0000000000000286 ffff880043073968
      [57600.299254] Call Trace:
      [57600.301698]  [<ffffffff8114ee35>] ? cache_flusharray+0x35/0xe0
      [57600.307523]  [<ffffffff81119043>] ? __page_cache_release+0x23/0x110
      [57600.313779]  [<ffffffff8114eb53>] kmem_cache_free+0x63/0xc0
      [57600.319344]  [<ffffffff81579942>] drop_one_stripe+0x62/0x90
      [57600.324915]  [<ffffffff81579b5b>] raid5_cache_scan+0x8b/0xb0
      [57600.330563]  [<ffffffff8111b98a>] shrink_slab.part.36+0x19a/0x250
      [57600.336650]  [<ffffffff8111e38c>] shrink_zone+0x23c/0x250
      [57600.342039]  [<ffffffff8111e4f3>] do_try_to_free_pages+0x153/0x420
      [57600.348210]  [<ffffffff8111e851>] try_to_free_pages+0x91/0xa0
      [57600.353959]  [<ffffffff811145b1>] __alloc_pages_nodemask+0x4d1/0x8b0
      [57600.360303]  [<ffffffff8157a30b>] check_reshape+0x62b/0x770
      [57600.365866]  [<ffffffff8157a4a5>] raid5_check_reshape+0x55/0xa0
      [57600.371778]  [<ffffffff81583df7>] update_raid_disks+0xc7/0x110
      [57600.377604]  [<ffffffff81592b73>] md_ioctl+0xd83/0x1b10
      [57600.382827]  [<ffffffff81385380>] blkdev_ioctl+0x170/0x690
      [57600.388307]  [<ffffffff81195238>] block_ioctl+0x38/0x40
      [57600.393525]  [<ffffffff811731c5>] do_vfs_ioctl+0x2b5/0x480
      [57600.399010]  [<ffffffff8115e07b>] ? vfs_write+0x14b/0x1f0
      [57600.404400]  [<ffffffff811733cc>] SyS_ioctl+0x3c/0x70
      [57600.409447]  [<ffffffff81a6ad97>] entry_SYSCALL_64_fastpath+0x12/0x6a
      [57600.415875] Code: 00 00 00 00 55 48 89 e5 8b 07 85 c0 74 04 31 c0 5d c3 ba 01 00 00 00 f0 0f b1 17 85 c0 75 ef b0 01 5d c3 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 85 c0 75 01 c3 55 89 c6 48 89 e5 e8 85 d1 63 ff 5d
      [57600.435460] RIP  [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20
      [57600.441208]  RSP <ffff880043073810>
      [57600.444690] CR2: 0000000000000000
      [57600.448000] ---[ end trace cbc6b5cc4bf9831d ]---
      
      The problem is that resize_stripes() releases new stripe_heads before assigning new
      slab cache to conf->slab_cache. If the shrinker function raid5_cache_scan() gets called
      after resize_stripes() starting releasing new stripes but right before new slab cache
      being assigned, it is possible that these new stripe_heads will be freed with the old
      slab_cache which was already been destoryed and that triggers this bug.
      Signed-off-by: default avatarDennis Yang <dennisyang@qnap.com>
      Fixes: edbe83ab ("md/raid5: allow the stripe_cache to grow and shrink.")
      Cc: stable@vger.kernel.org (4.1+)
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      583da48e
  3. 28 Mar, 2017 1 commit
    • Ming Lei's avatar
      md: raid1: kill warning on powerpc_pseries · 8fc04e6e
      Ming Lei authored
      This patch kills the warning reported on powerpc_pseries,
      and actually we don't need the initialization.
      
      	After merging the md tree, today's linux-next build (powerpc
      	pseries_le_defconfig) produced this warning:
      
      	drivers/md/raid1.c: In function 'raid1d':
      	drivers/md/raid1.c:2172:9: warning: 'page_len$' may be used uninitialized in this function [-Wmaybe-uninitialized]
      	     if (memcmp(page_address(ppages[j]),
      	         ^
      	drivers/md/raid1.c:2160:7: note: 'page_len$' was declared here
      	   int page_len[RESYNC_PAGES];
             ^
      Signed-off-by: default avatarMing Lei <tom.leiming@gmail.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      8fc04e6e
  4. 27 Mar, 2017 1 commit
    • Song Liu's avatar
      md/raid5: use consistency_policy to remove journal feature · 0bb0c105
      Song Liu authored
      When journal device of an array fails, the array is forced into read-only
      mode. To make the array normal without adding another journal device, we
      need to remove journal _feature_ from the array.
      
      This patch allows remove journal _feature_ from an array, For journal
      existing journal should be either missing or faulty.
      
      To remove journal feature, it is necessary to remove the journal device
      first:
      
        mdadm --fail /dev/md0 /dev/sdb
        mdadm: set /dev/sdb faulty in /dev/md0
        mdadm --remove /dev/md0 /dev/sdb
        mdadm: hot removed /dev/sdb from /dev/md0
      
      Then the journal feature can be removed by echoing into the sysfs file:
      
       cat /sys/block/md0/md/consistency_policy
       journal
      
       echo resync > /sys/block/md0/md/consistency_policy
       cat /sys/block/md0/md/consistency_policy
       resync
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      0bb0c105
  5. 25 Mar, 2017 3 commits
  6. 24 Mar, 2017 17 commits
  7. 23 Mar, 2017 4 commits
    • NeilBrown's avatar
      MD: use per-cpu counter for writes_pending · 4ad23a97
      NeilBrown authored
      The 'writes_pending' counter is used to determine when the
      array is stable so that it can be marked in the superblock
      as "Clean".  Consequently it needs to be updated frequently
      but only checked for zero occasionally.  Recent changes to
      raid5 cause the count to be updated even more often - once
      per 4K rather than once per bio.  This provided
      justification for making the updates more efficient.
      
      So we replace the atomic counter a percpu-refcount.
      This can be incremented and decremented cheaply most of the
      time, and can be switched to "atomic" mode when more
      precise counting is needed.  As it is possible for multiple
      threads to want a precise count, we introduce a
      "sync_checker" counter to count the number of threads
      in "set_in_sync()", and only switch the refcount back
      to percpu mode when that is zero.
      
      We need to be careful about races between set_in_sync()
      setting ->in_sync to 1, and md_write_start() setting it
      to zero.  md_write_start() holds the rcu_read_lock()
      while checking if the refcount is in percpu mode.  If
      it is, then we know a switch to 'atomic' will not happen until
      after we call rcu_read_unlock(), in which case set_in_sync()
      will see the elevated count, and not set in_sync to 1.
      If it is not in percpu mode, we take the mddev->lock to
      ensure proper synchronization.
      
      It is no longer possible to quickly check if the count is zero, which
      we previously did to update a timer or to schedule the md_thread.
      So now we do these every time we decrement that counter, but make
      sure they are fast.
      
      mod_timer() already optimizes the case where the timeout value doesn't
      actually change.  We leverage that further by always rounding off the
      jiffies to the timeout value.  This may delay the marking of 'clean'
      slightly, but ensure we only perform atomic operation here when absolutely
      needed.
      
      md_wakeup_thread() current always calls wake_up(), even if
      THREAD_WAKEUP is already set.  That too can be optimised to avoid
      calls to wake_up().
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      4ad23a97
    • NeilBrown's avatar
      percpu-refcount: support synchronous switch to atomic mode. · 210f7cdc
      NeilBrown authored
      percpu_ref_switch_to_atomic_sync() schedules the switch to atomic mode, then
      waits for it to complete.
      
      Also export percpu_ref_switch_to_* so they can be used from modules.
      
      This will be used in md/raid to count the number of pending write
      requests to an array.
      We occasionally need to check if the count is zero, but most often
      we don't care.
      We always want updates to the counter to be fast, as in some cases
      we count every 4K page.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      210f7cdc
    • NeilBrown's avatar
      md: close a race with setting mddev->in_sync · 55cc39f3
      NeilBrown authored
      If ->in_sync is being set just as md_write_start() is being called,
      it is possible that set_in_sync() won't see the elevated
      ->writes_pending, and md_write_start() won't see the set ->in_sync.
      
      To close this race, re-test ->writes_pending after setting ->in_sync,
      and add memory barriers to ensure the increment of ->writes_pending
      will be seen by the time of this second test, or the new ->in_sync
      will be seen by md_write_start().
      
      Add a spinlock to array_state_show() to ensure this temporary
      instability is never visible from userspace.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      55cc39f3
    • NeilBrown's avatar
      md: factor out set_in_sync() · 6497709b
      NeilBrown authored
      Three separate places in md.c check if the number of active
      writes is zero and, if so, sets mddev->in_sync.
      
      There are a few differences, but there shouldn't be:
      - it is always appropriate to notify the change in
        sysfs_state, and there is no need to do this outside a
        spin-locked region.
      - we never need to check ->recovery_cp.  The state of resync
        is not relevant for whether there are any pending writes
        or not (which is what ->in_sync reports).
      
      So create set_in_sync() which does the correct tests and
      makes the correct changes, and call this in all three
      places.
      
      Any behaviour changes here a minor and cosmetic.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      6497709b