1. 10 Apr, 2017 6 commits
    • NeilBrown's avatar
      md/raid1: avoid reusing a resync bio after error handling. · 0c9d5b12
      NeilBrown authored
      fix_sync_read_error() modifies a bio on a newly faulty
      device by setting bi_end_io to end_sync_write.
      This ensure that put_buf() will still call rdev_dec_pending()
      as required, but makes sure that subsequent code in
      fix_sync_read_error() doesn't try to read from the device.
      
      Unfortunately this interacts badly with sync_request_write()
      which assumes that any bio with bi_end_io set to non-NULL
      other than end_sync_read is safe to write to.
      
      As the device is now faulty it doesn't make sense to write.
      As the bio was recently used for a read, it is "dirty"
      and not suitable for immediate submission.
      In particular, ->bi_next might be non-NULL, which will cause
      generic_make_request() to complain.
      
      Break this interaction by refusing to write to devices
      which are marked as Faulty.
      Reported-and-tested-by: default avatarMichael Wang <yun.wang@profitbricks.com>
      Fixes: 2e52d449 ("md/raid1: add failfast handling for reads.")
      Cc: stable@vger.kernel.org (v4.10+)
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      0c9d5b12
    • Zhilong Liu's avatar
      md.c:didn't unlock the mddev before return EINVAL in array_size_store · b670883b
      Zhilong Liu authored
      md.c: it needs to release the mddev lock before
      the array_size_store() returns.
      
      Fixes: ab5a98b1 ("md-cluster: change array_sectors and update size are not supported")
      Signed-off-by: default avatarZhilong Liu <zlliu@suse.com>
      Reviewed-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      b670883b
    • NeilBrown's avatar
      md: MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stop · 065e519e
      NeilBrown authored
      if called md_set_readonly and set MD_CLOSING bit, the mddev cannot
      be opened any more due to the MD_CLOING bit wasn't cleared. Thus it
      needs to be cleared in md_ioctl after any call to md_set_readonly()
      or do_md_stop().
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Fixes: af8d8e6f ("md: changes for MD_STILL_CLOSED flag")
      Cc: stable@vger.kernel.org (v4.9+)
      Signed-off-by: default avatarZhilong Liu <zlliu@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      065e519e
    • Guoqing Jiang's avatar
      md/raid10: reset the 'first' at the end of loop · 6f287ca6
      Guoqing Jiang authored
      We need to set "first = 0' at the end of rdev_for_each
      loop, so we can get the array's min_offset_diff correctly
      otherwise min_offset_diff just means the last rdev's
      offset diff.
      Suggested-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      6f287ca6
    • NeilBrown's avatar
      md/raid6: Fix anomily when recovering a single device in RAID6. · 7471fb77
      NeilBrown authored
      When recoverying a single missing/failed device in a RAID6,
      those stripes where the Q block is on the missing device are
      handled a bit differently.  In these cases it is easy to
      check that the P block is correct, so we do.  This results
      in the P block be destroy.  Consequently the P block needs
      to be read a second time in order to compute Q.  This causes
      lots of seeks and hurts performance.
      
      It shouldn't be necessary to re-read P as it can be computed
      from the DATA.  But we only compute blocks on missing
      devices, since c337869d ("md: do not compute parity
      unless it is on a failed drive").
      
      So relax the change made in that commit to allow computing
      of the P block in a RAID6 which it is the only missing that
      block.
      
      This makes RAID6 recovery run much faster as the disk just
      "before" the recovering device is no longer seeking
      back-and-forth.
      Reported-by-tested-by: default avatarBrad Campbell <lists2009@fnarfbargle.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      7471fb77
    • Dennis Yang's avatar
      md: update slab_cache before releasing new stripes when stripes resizing · 583da48e
      Dennis Yang authored
      When growing raid5 device on machine with small memory, there is chance that
      mdadm will be killed and the following bug report can be observed. The same
      bug could also be reproduced in linux-4.10.6.
      
      [57600.075774] BUG: unable to handle kernel NULL pointer dereference at           (null)
      [57600.083796] IP: [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20
      [57600.110378] PGD 421cf067 PUD 4442d067 PMD 0
      [57600.114678] Oops: 0002 [#1] SMP
      [57600.180799] CPU: 1 PID: 25990 Comm: mdadm Tainted: P           O    4.2.8 #1
      [57600.187849] Hardware name: To be filled by O.E.M. To be filled by O.E.M./MAHOBAY, BIOS QV05AR66 03/06/2013
      [57600.197490] task: ffff880044e47240 ti: ffff880043070000 task.ti: ffff880043070000
      [57600.204963] RIP: 0010:[<ffffffff81a6aa87>]  [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20
      [57600.213057] RSP: 0018:ffff880043073810  EFLAGS: 00010046
      [57600.218359] RAX: 0000000000000000 RBX: 000000000000000c RCX: ffff88011e296dd0
      [57600.225486] RDX: 0000000000000001 RSI: ffffe8ffffcb46c0 RDI: 0000000000000000
      [57600.232613] RBP: ffff880043073878 R08: ffff88011e5f8170 R09: 0000000000000282
      [57600.239739] R10: 0000000000000005 R11: 28f5c28f5c28f5c3 R12: ffff880043073838
      [57600.246872] R13: ffffe8ffffcb46c0 R14: 0000000000000000 R15: ffff8800b9706a00
      [57600.253999] FS:  00007f576106c700(0000) GS:ffff88011e280000(0000) knlGS:0000000000000000
      [57600.262078] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [57600.267817] CR2: 0000000000000000 CR3: 00000000428fe000 CR4: 00000000001406e0
      [57600.274942] Stack:
      [57600.276949]  ffffffff8114ee35 ffff880043073868 0000000000000282 000000000000eb3f
      [57600.284383]  ffffffff81119043 ffff880043073838 ffff880043073838 ffff88003e197b98
      [57600.291820]  ffffe8ffffcb46c0 ffff88003e197360 0000000000000286 ffff880043073968
      [57600.299254] Call Trace:
      [57600.301698]  [<ffffffff8114ee35>] ? cache_flusharray+0x35/0xe0
      [57600.307523]  [<ffffffff81119043>] ? __page_cache_release+0x23/0x110
      [57600.313779]  [<ffffffff8114eb53>] kmem_cache_free+0x63/0xc0
      [57600.319344]  [<ffffffff81579942>] drop_one_stripe+0x62/0x90
      [57600.324915]  [<ffffffff81579b5b>] raid5_cache_scan+0x8b/0xb0
      [57600.330563]  [<ffffffff8111b98a>] shrink_slab.part.36+0x19a/0x250
      [57600.336650]  [<ffffffff8111e38c>] shrink_zone+0x23c/0x250
      [57600.342039]  [<ffffffff8111e4f3>] do_try_to_free_pages+0x153/0x420
      [57600.348210]  [<ffffffff8111e851>] try_to_free_pages+0x91/0xa0
      [57600.353959]  [<ffffffff811145b1>] __alloc_pages_nodemask+0x4d1/0x8b0
      [57600.360303]  [<ffffffff8157a30b>] check_reshape+0x62b/0x770
      [57600.365866]  [<ffffffff8157a4a5>] raid5_check_reshape+0x55/0xa0
      [57600.371778]  [<ffffffff81583df7>] update_raid_disks+0xc7/0x110
      [57600.377604]  [<ffffffff81592b73>] md_ioctl+0xd83/0x1b10
      [57600.382827]  [<ffffffff81385380>] blkdev_ioctl+0x170/0x690
      [57600.388307]  [<ffffffff81195238>] block_ioctl+0x38/0x40
      [57600.393525]  [<ffffffff811731c5>] do_vfs_ioctl+0x2b5/0x480
      [57600.399010]  [<ffffffff8115e07b>] ? vfs_write+0x14b/0x1f0
      [57600.404400]  [<ffffffff811733cc>] SyS_ioctl+0x3c/0x70
      [57600.409447]  [<ffffffff81a6ad97>] entry_SYSCALL_64_fastpath+0x12/0x6a
      [57600.415875] Code: 00 00 00 00 55 48 89 e5 8b 07 85 c0 74 04 31 c0 5d c3 ba 01 00 00 00 f0 0f b1 17 85 c0 75 ef b0 01 5d c3 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 85 c0 75 01 c3 55 89 c6 48 89 e5 e8 85 d1 63 ff 5d
      [57600.435460] RIP  [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20
      [57600.441208]  RSP <ffff880043073810>
      [57600.444690] CR2: 0000000000000000
      [57600.448000] ---[ end trace cbc6b5cc4bf9831d ]---
      
      The problem is that resize_stripes() releases new stripe_heads before assigning new
      slab cache to conf->slab_cache. If the shrinker function raid5_cache_scan() gets called
      after resize_stripes() starting releasing new stripes but right before new slab cache
      being assigned, it is possible that these new stripe_heads will be freed with the old
      slab_cache which was already been destoryed and that triggers this bug.
      Signed-off-by: default avatarDennis Yang <dennisyang@qnap.com>
      Fixes: edbe83ab ("md/raid5: allow the stripe_cache to grow and shrink.")
      Cc: stable@vger.kernel.org (4.1+)
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      583da48e
  2. 28 Mar, 2017 1 commit
    • Ming Lei's avatar
      md: raid1: kill warning on powerpc_pseries · 8fc04e6e
      Ming Lei authored
      This patch kills the warning reported on powerpc_pseries,
      and actually we don't need the initialization.
      
      	After merging the md tree, today's linux-next build (powerpc
      	pseries_le_defconfig) produced this warning:
      
      	drivers/md/raid1.c: In function 'raid1d':
      	drivers/md/raid1.c:2172:9: warning: 'page_len$' may be used uninitialized in this function [-Wmaybe-uninitialized]
      	     if (memcmp(page_address(ppages[j]),
      	         ^
      	drivers/md/raid1.c:2160:7: note: 'page_len$' was declared here
      	   int page_len[RESYNC_PAGES];
             ^
      Signed-off-by: default avatarMing Lei <tom.leiming@gmail.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      8fc04e6e
  3. 27 Mar, 2017 1 commit
    • Song Liu's avatar
      md/raid5: use consistency_policy to remove journal feature · 0bb0c105
      Song Liu authored
      When journal device of an array fails, the array is forced into read-only
      mode. To make the array normal without adding another journal device, we
      need to remove journal _feature_ from the array.
      
      This patch allows remove journal _feature_ from an array, For journal
      existing journal should be either missing or faulty.
      
      To remove journal feature, it is necessary to remove the journal device
      first:
      
        mdadm --fail /dev/md0 /dev/sdb
        mdadm: set /dev/sdb faulty in /dev/md0
        mdadm --remove /dev/md0 /dev/sdb
        mdadm: hot removed /dev/sdb from /dev/md0
      
      Then the journal feature can be removed by echoing into the sysfs file:
      
       cat /sys/block/md0/md/consistency_policy
       journal
      
       echo resync > /sys/block/md0/md/consistency_policy
       cat /sys/block/md0/md/consistency_policy
       resync
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      0bb0c105
  4. 25 Mar, 2017 3 commits
  5. 24 Mar, 2017 17 commits
  6. 23 Mar, 2017 12 commits
    • NeilBrown's avatar
      MD: use per-cpu counter for writes_pending · 4ad23a97
      NeilBrown authored
      The 'writes_pending' counter is used to determine when the
      array is stable so that it can be marked in the superblock
      as "Clean".  Consequently it needs to be updated frequently
      but only checked for zero occasionally.  Recent changes to
      raid5 cause the count to be updated even more often - once
      per 4K rather than once per bio.  This provided
      justification for making the updates more efficient.
      
      So we replace the atomic counter a percpu-refcount.
      This can be incremented and decremented cheaply most of the
      time, and can be switched to "atomic" mode when more
      precise counting is needed.  As it is possible for multiple
      threads to want a precise count, we introduce a
      "sync_checker" counter to count the number of threads
      in "set_in_sync()", and only switch the refcount back
      to percpu mode when that is zero.
      
      We need to be careful about races between set_in_sync()
      setting ->in_sync to 1, and md_write_start() setting it
      to zero.  md_write_start() holds the rcu_read_lock()
      while checking if the refcount is in percpu mode.  If
      it is, then we know a switch to 'atomic' will not happen until
      after we call rcu_read_unlock(), in which case set_in_sync()
      will see the elevated count, and not set in_sync to 1.
      If it is not in percpu mode, we take the mddev->lock to
      ensure proper synchronization.
      
      It is no longer possible to quickly check if the count is zero, which
      we previously did to update a timer or to schedule the md_thread.
      So now we do these every time we decrement that counter, but make
      sure they are fast.
      
      mod_timer() already optimizes the case where the timeout value doesn't
      actually change.  We leverage that further by always rounding off the
      jiffies to the timeout value.  This may delay the marking of 'clean'
      slightly, but ensure we only perform atomic operation here when absolutely
      needed.
      
      md_wakeup_thread() current always calls wake_up(), even if
      THREAD_WAKEUP is already set.  That too can be optimised to avoid
      calls to wake_up().
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      4ad23a97
    • NeilBrown's avatar
      percpu-refcount: support synchronous switch to atomic mode. · 210f7cdc
      NeilBrown authored
      percpu_ref_switch_to_atomic_sync() schedules the switch to atomic mode, then
      waits for it to complete.
      
      Also export percpu_ref_switch_to_* so they can be used from modules.
      
      This will be used in md/raid to count the number of pending write
      requests to an array.
      We occasionally need to check if the count is zero, but most often
      we don't care.
      We always want updates to the counter to be fast, as in some cases
      we count every 4K page.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      210f7cdc
    • NeilBrown's avatar
      md: close a race with setting mddev->in_sync · 55cc39f3
      NeilBrown authored
      If ->in_sync is being set just as md_write_start() is being called,
      it is possible that set_in_sync() won't see the elevated
      ->writes_pending, and md_write_start() won't see the set ->in_sync.
      
      To close this race, re-test ->writes_pending after setting ->in_sync,
      and add memory barriers to ensure the increment of ->writes_pending
      will be seen by the time of this second test, or the new ->in_sync
      will be seen by md_write_start().
      
      Add a spinlock to array_state_show() to ensure this temporary
      instability is never visible from userspace.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      55cc39f3
    • NeilBrown's avatar
      md: factor out set_in_sync() · 6497709b
      NeilBrown authored
      Three separate places in md.c check if the number of active
      writes is zero and, if so, sets mddev->in_sync.
      
      There are a few differences, but there shouldn't be:
      - it is always appropriate to notify the change in
        sysfs_state, and there is no need to do this outside a
        spin-locked region.
      - we never need to check ->recovery_cp.  The state of resync
        is not relevant for whether there are any pending writes
        or not (which is what ->in_sync reports).
      
      So create set_in_sync() which does the correct tests and
      makes the correct changes, and call this in all three
      places.
      
      Any behaviour changes here a minor and cosmetic.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      6497709b
    • NeilBrown's avatar
      md/raid5: don't test ->writes_pending in raid5_remove_disk · 84dd97a6
      NeilBrown authored
      This test on ->writes_pending cannot be safe as the counter
      can be incremented at any moment and cannot be locked against.
      
      Change it to test conf->active_stripes, which at least
      can be locked against.  More changes are still needed.
      
      A future patch will change ->writes_pending, and testing it here will
      be very inconvenient.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      84dd97a6
    • NeilBrown's avatar
      md/raid1: stop using bi_phys_segment · 37011e3a
      NeilBrown authored
      Change to use bio->__bi_remaining to count number of r1bio attached
      to a bio.
      See precious raid10 patch for more details.
      
      Like the raid10.c patch, this fixes a bug as nr_queued and nr_pending
      used to measure different things, but were being compared.
      
      This patch fixes another bug in that nr_pending previously did not
      could write-behind requests, so behind writes could continue while
      resync was happening.  How that nr_pending counts all r1_bio,
      the resync cannot commence until the behind writes have completed.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      37011e3a
    • NeilBrown's avatar
      md/raid10: stop using bi_phys_segments · fd16f2e8
      NeilBrown authored
      raid10 currently repurposes bi_phys_segments on each
      incoming bio to count how many r10bio was used to encode the
      request.
      
      We need to know when the number of attached r10bio reaches
      zero to:
      1/ call bio_endio() when all IO on the bio is finished
      2/ decrement ->nr_pending so that resync IO can proceed.
      
      Now that the bio has its own __bi_remaining counter, that
      can be used instead. We can call bio_inc_remaining to
      increment the counter and call bio_endio() every time an
      r10bio completes, rather than only when bi_phys_segments
      reaches zero.
      
      This addresses point 1, but not point 2.  bio_endio()
      doesn't (and cannot) report when the last r10bio has
      finished, so a different approach is needed.
      
      So: instead of counting bios in ->nr_pending, count r10bios.
      i.e. every time we attach a bio, increment nr_pending.
      Every time an r10bio completes, decrement nr_pending.
      
      Normally we only increment nr_pending after first checking
      that ->barrier is zero, or some other non-trivial tests and
      possible waiting.  When attaching multiple r10bios to a bio,
      we only need the tests and the waiting once.  After the
      first increment, subsequent increments can happen
      unconditionally as they are really all part of the one
      request.
      
      So introduce inc_pending() which can be used when we know
      that nr_pending is already elevated.
      
      Note that this fixes a bug.  freeze_array() contains the line
      	atomic_read(&conf->nr_pending) == conf->nr_queued+extra,
      which implies that the units for ->nr_pending, ->nr_queued and extra
      are the same.
      ->nr_queue and extra count r10_bios, but prior to this patch,
      ->nr_pending counted bios.  If a bio ever resulted in multiple
      r10_bios (due to bad blocks), freeze_array() would not work correctly.
      Now it does.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      fd16f2e8
    • NeilBrown's avatar
      md/raid1, raid10: move rXbio accounting closer to allocation. · 6b6c8110
      NeilBrown authored
      When raid1 or raid10 find they will need to allocate a new
      r1bio/r10bio, in order to work around a known bad block, they
      account for the allocation well before the allocation is
      made.  This separation makes the correctness less obvious
      and requires comments.
      
      The accounting needs to be a little before: before the first
      rXbio is submitted, but that is all.
      
      So move the accounting down to where it makes more sense.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      6b6c8110
    • NeilBrown's avatar
      Revert "md/raid5: limit request size according to implementation limits" · 97d53438
      NeilBrown authored
      This reverts commit e8d7c332.
      
      Now that raid5 doesn't abuse bi_phys_segments any more, we no longer
      need to impose these limits.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      97d53438
    • NeilBrown's avatar
      md/raid5: remove over-loading of ->bi_phys_segments. · 0472a42b
      NeilBrown authored
      When a read request, which bypassed the cache, fails, we need to retry
      it through the cache.
      This involves attaching it to a sequence of stripe_heads, and it may not
      be possible to get all the stripe_heads we need at once.
      We do what we can, and record how far we got in ->bi_phys_segments so
      we can pick up again later.
      
      There is only ever one bio which may have a non-zero offset stored in
      ->bi_phys_segments, the one that is either active in the single thread
      which calls retry_aligned_read(), or is in conf->retry_read_aligned
      waiting for retry_aligned_read() to be called again.
      
      So we only need to store one offset value.  This can be in a local
      variable passed between remove_bio_from_retry() and
      retry_aligned_read(), or in the r5conf structure next to the
      ->retry_read_aligned pointer.
      
      Storing it there allows the last usage of ->bi_phys_segments to be
      removed from md/raid5.c.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      0472a42b
    • NeilBrown's avatar
      md/raid5: use bio_inc_remaining() instead of repurposing bi_phys_segments as a counter · 016c76ac
      NeilBrown authored
      md/raid5 needs to keep track of how many stripe_heads are processing a
      bio so that it can delay calling bio_endio() until all stripe_heads
      have completed.  It currently uses 16 bits of ->bi_phys_segments for
      this purpose.
      
      16 bits is only enough for 256M requests, and it is possible for a
      single bio to be larger than this, which causes problems.  Also, the
      bio struct contains a larger counter, __bi_remaining, which has a
      purpose very similar to the purpose of our counter.  So stop using
      ->bi_phys_segments, and instead use __bi_remaining.
      
      This means we don't need to initialize the counter, as our caller
      initializes it to '1'.  It also means we can call bio_endio() directly
      as it tests this counter internally.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      016c76ac
    • NeilBrown's avatar
      md/raid5: call bio_endio() directly rather than queueing for later. · bd83d0a2
      NeilBrown authored
      We currently gather bios that need to be returned into a bio_list
      and call bio_endio() on them all together.
      The original reason for this was to avoid making the calls while
      holding a spinlock.
      Locking has changed a lot since then, and that reason is no longer
      valid.
      
      So discard return_io() and various return_bi lists, and just call
      bio_endio() directly as needed.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      bd83d0a2