1. 05 Feb, 2015 2 commits
    • NeilBrown's avatar
      md: make reconfig_mutex optional for writes to md sysfs files. · 6791875e
      NeilBrown authored
      
      Rather than using mddev_lock() to take the reconfig_mutex
      when writing to any md sysfs file, we only take mddev_lock()
      in the particular _store() functions that require it.
      Admittedly this is most, but it isn't all.
      
      This also allows us to remove special-case handling for new_dev_store
      (in md_attr_store).
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      6791875e
    • NeilBrown's avatar
      md/raid5: use ->lock to protect accessing raid5 sysfs attributes. · 7b1485ba
      NeilBrown authored
      
      It is important that mddev->private isn't freed while
      a sysfs attribute function is accessing it.
      
      So use mddev->lock to protect the setting of ->private to NULL, and
      take that lock when checking ->private for NULL and de-referencing it
      in the sysfs access functions.
      
      This only applies to the read ('show') side of access.  Write
      access will be handled separately.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      7b1485ba
  2. 03 Feb, 2015 9 commits
    • NeilBrown's avatar
      md: rename ->stop to ->free · afa0f557
      NeilBrown authored
      
      Now that the ->stop function only frees the private data,
      rename is accordingly.
      
      Also pass in the private pointer as an arg rather than using
      mddev->private.  This flexibility will be useful in level_store().
      
      Finally, don't clear ->private.  It doesn't make sense to clear
      it seeing that isn't what we free, and it is no longer necessary
      to clear ->private (it was some time ago before  ->to_remove was
      introduced).
      
      Setting ->to_remove in ->free() is a bit of a wart, but not a
      big problem at the moment.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      afa0f557
    • NeilBrown's avatar
      md: split detach operation out from ->stop. · 5aa61f42
      NeilBrown authored
      
      Each md personality has a 'stop' operation which does two
      things:
       1/ it finalizes some aspects of the array to ensure nothing
          is accessing the ->private data
       2/ it frees the ->private data.
      
      All the steps in '1' can apply to all arrays and so can be
      performed in common code.
      
      This is useful as in the case where we change the personality which
      manages an array (in level_store()), it would be helpful to do
      step 1 early, and step 2 later.
      
      So split the 'step 1' functionality out into a new mddev_detach().
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      5aa61f42
    • NeilBrown's avatar
      md: make merge_bvec_fn more robust in face of personality changes. · 64590f45
      NeilBrown authored
      
      There is no locking around calls to merge_bvec_fn(), so
      it is possible that calls which coincide with a level (or personality)
      change could go wrong.
      
      So create a central dispatch point for these functions and use
      rcu_read_lock().
      If the array is suspended, reject any merge that can be rejected.
      If not, we know it is safe to call the function.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      64590f45
    • NeilBrown's avatar
      md: make ->congested robust against personality changes. · 5c675f83
      NeilBrown authored
      
      There is currently no locking around calls to the 'congested'
      bdi function.  If called at an awkward time while an array is
      being converted from one level (or personality) to another, there
      is a tiny chance of running code in an unreferenced module etc.
      
      So add a 'congested' function to the md_personality operations
      structure, and call it with appropriate locking from a central
      'mddev_congested'.
      
      When the array personality is changing the array will be 'suspended'
      so no IO is processed.
      If mddev_congested detects this, it simply reports that the
      array is congested, which is a safe guess.
      As mddev_suspend calls synchronize_rcu(), mddev_congested can
      avoid races by included the whole call inside an rcu_read_lock()
      region.
      This require that the congested functions for all subordinate devices
      can be run under rcu_lock.  Fortunately this is the case.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      5c675f83
    • NeilBrown's avatar
      md/raid5: need_this_block: tidy/fix last condition. · ea664c82
      NeilBrown authored
      
      That last condition is unclear and over cautious.
      
      There are two related issues here.
      
      If a partial write is destined for a missing device, then
      either RMW or RCW can work.  We must read all the available
      block.  Only then can the missing blocks be calculated, and
      then the parity update performed.
      
      If RMW is not an option, then there is a complication even
      without partial writes.  If we would need to read a missing
      device to perform the reconstruction, then we must first read every
      block so the missing device data can be computed.
      This is the case for RAID6 (Which currently does not support
      RMW) and for times when we don't trust the parity (after a crash)
      and so are in the process of resyncing it.
      
      So make these two cases more clear and separate, and perform
      the relevant tests more  thoroughly.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      ea664c82
    • NeilBrown's avatar
      md/raid5: need_this_block: start simplifying the last two conditions. · a9d56950
      NeilBrown authored
      
      Both the last two cases are only relevant if something has failed and
      something needs to be written (but not over-written), and if it is OK
      to pre-read blocks at this point.  So factor out those tests and
      explain them.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      a9d56950
    • NeilBrown's avatar
      md/raid5: separate out the easy conditions in need_this_block. · a79cfe12
      NeilBrown authored
      
      Some of the conditions in need_this_block have very straight
      forward motivation.  Separate those out and document them.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      a79cfe12
    • NeilBrown's avatar
      md/raid5: separate large if clause out of fetch_block(). · 2c58f06e
      NeilBrown authored
      
      fetch_block() has a very large and hard to read 'if' condition.
      
      Separate it into its own function so that it can be
      made more readable.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      2c58f06e
    • Jes Sorensen's avatar
      md: do_release_stripe(): No need to call md_wakeup_thread() twice · ad3ab8b6
      Jes Sorensen authored
      67f45548
      
       introduced a call to
      md_wakeup_thread() when adding to the delayed_list. However the md
      thread is woken up unconditionally just below.
      
      Remove the unnecessary wakeup call.
      Signed-off-by: default avatarJes Sorensen <Jes.Sorensen@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      ad3ab8b6
  3. 02 Feb, 2015 1 commit
    • NeilBrown's avatar
      md/raid5: fix another livelock caused by non-aligned writes. · b1b02fe9
      NeilBrown authored
      If a non-page-aligned write is destined for a device which
      is missing/faulty, we can deadlock.
      
      As the target device is missing, a read-modify-write cycle
      is not possible.
      As the write is not for a full-page, a recontruct-write cycle
      is not possible.
      
      This should be handled by logic in fetch_block() which notices
      there is a non-R5_OVERWRITE write to a missing device, and so
      loads all blocks.
      
      However since commit 67f45548, that code requires
      STRIPE_PREREAD_ACTIVE before it will active, and those circumstances
      never set STRIPE_PREREAD_ACTIVE.
      
      So: in handle_stripe_dirtying, if neither rmw or rcw was possible,
      set STRIPE_DELAYED, which will cause STRIPE_PREREAD_ACTIVE be set
      after a suitable delay.
      
      Fixes: 67f45548
      
      
      Cc: stable@vger.kernel.org (v3.16+)
      Reported-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Tested-by: default avatarHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      b1b02fe9
  4. 03 Dec, 2014 1 commit
    • NeilBrown's avatar
      md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants. · 108cef3a
      NeilBrown authored
      It is critical that fetch_block() and handle_stripe_dirtying()
      are consistent in their analysis of what needs to be loaded.
      Otherwise raid5 can wait forever for a block that won't be loaded.
      
      Currently when writing to a RAID5 that is resyncing, to a location
      beyond the resync offset, handle_stripe_dirtying chooses a
      reconstruct-write cycle, but fetch_block() assumes a
      read-modify-write, and a lockup can happen.
      
      So treat that case just like RAID6, just as we do in
      handle_stripe_dirtying.  RAID6 always does reconstruct-write.
      
      This bug was introduced when the behaviour of handle_stripe_dirtying
      was changed in 3.7, so the patch is suitable for any kernel since,
      though it will need careful merging for some versions.
      
      Cc: stable@vger.kernel.org (v3.7+)
      Fixes: a7854487
      
      Reported-by: default avatarHenry Cai <henryplusplus@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      108cef3a
  5. 14 Oct, 2014 2 commits
  6. 08 Oct, 2014 1 commit
  7. 02 Oct, 2014 1 commit
    • NeilBrown's avatar
      md/raid5: disable 'DISCARD' by default due to safety concerns. · 8e0e99ba
      NeilBrown authored
      
      It has come to my attention (thanks Martin) that 'discard_zeroes_data'
      is only a hint.  Some devices in some cases don't do what it
      says on the label.
      
      The use of DISCARD in RAID5 depends on reads from discarded regions
      being predictably zero.  If a write to a previously discarded region
      performs a read-modify-write cycle it assumes that the parity block
      was consistent with the data blocks.  If all were zero, this would
      be the case.  If some are and some aren't this would not be the case.
      This could lead to data corruption after a device failure when
      data needs to be reconstructed from the parity.
      
      As we cannot trust 'discard_zeroes_data', ignore it by default
      and so disallow DISCARD on all raid4/5/6 arrays.
      
      As many devices are trustworthy, and as there are benefits to using
      DISCARD, add a module parameter to over-ride this caution and cause
      DISCARD to work if discard_zeroes_data is set.
      
      If a site want to enable DISCARD on some arrays but not on others they
      should select DISCARD support at the filesystem level, and set the
      raid456 module parameter.
          raid456.devices_handle_discard_safely=Y
      
      As this is a data-safety issue, I believe this patch is suitable for
      -stable.
      DISCARD support for RAID456 was added in 3.7
      
      Cc: Shaohua Li <shli@kernel.org>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Heinz Mauelshagen <heinzm@redhat.com>
      Cc: stable@vger.kernel.org (3.7+)
      Acked-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Acked-by: default avatarMike Snitzer <snitzer@redhat.com>
      Fixes: 620125f2
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      8e0e99ba
  8. 18 Aug, 2014 2 commits
    • NeilBrown's avatar
      md/raid6: avoid data corruption during recovery of double-degraded RAID6 · 9c4bdf69
      NeilBrown authored
      During recovery of a double-degraded RAID6 it is possible for
      some blocks not to be recovered properly, leading to corruption.
      
      If a write happens to one block in a stripe that would be written to a
      missing device, and at the same time that stripe is recovering data
      to the other missing device, then that recovered data may not be written.
      
      This patch skips, in the double-degraded case, an optimisation that is
      only safe for single-degraded arrays.
      
      Bug was introduced in 2.6.32 and fix is suitable for any kernel since
      then.  In an older kernel with separate handle_stripe5() and
      handle_stripe6() functions the patch must change handle_stripe6().
      
      Cc: stable@vger.kernel.org (2.6.32+)
      Fixes: 6c0069c0
      
      
      Cc: Yuri Tikhonov <yur@emcraft.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Reported-by: default avatar"Manibalan P" <pmanibalan@amiindia.co.in>
      Tested-by: default avatar"Manibalan P" <pmanibalan@amiindia.co.in>
      Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1090423
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Acked-by: default avatarDan Williams <dan.j.williams@intel.com>
      9c4bdf69
    • NeilBrown's avatar
      md/raid5: avoid livelock caused by non-aligned writes. · a40687ff
      NeilBrown authored
      If a stripe in a raid6 array received a write to each data block while
      the array is degraded, and if any of these writes to a missing device
      are not page-aligned, then a live-lock happens.
      
      In this case the P and Q blocks need to be read so that the part of
      the missing block which is *not* being updated by the write can be
      constructed.  Due to a logic error, these blocks are not loaded, so
      the update cannot proceed and the stripe is 'handled' repeatedly in an
      infinite loop.
      
      This bug is unlikely as most writes are page aligned.  However as it
      can lead to a livelock it is suitable for -stable.  It was introduced
      in 3.16.
      
      Cc: stable@vger.kernel.org (v3.16)
      Fixed: 67f45548
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      a40687ff
  9. 10 Jun, 2014 1 commit
    • Eivind Sarto's avatar
      raid5: speedup sync_request processing · 053f5b65
      Eivind Sarto authored
      
      The raid5 sync_request() processing calls handle_stripe() within the context of
      the resync-thread.  The resync-thread issues the first set of read requests
      and this adds execution latency and slows down the scheduling of the next
      sync_request().
      The current rebuild/resync speed of raid5 is not much faster than what
      rotational HDDs can sustain.
      Testing the following patch on a 6-drive array, I can increase the rebuild
      speed from 100 MB/s to 175 MB/s.
      The sync_request() now just sets STRIPE_HANDLE and releases the stripe.  This
      creates some more parallelism between the resync-thread and raid5 kernel daemon.
      Signed-off-by: default avatarEivind Sarto <esarto@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      053f5b65
  10. 05 Jun, 2014 1 commit
    • hui jiao's avatar
      md/raid5: deadlock between retry_aligned_read with barrier io · 2844dc32
      hui jiao authored
      
      A chunk aligned read increases counter active_aligned_reads and
      decreases it after sub-device handle it successfully. But when a read
      error occurs,  the read redispatched by raid5d, and the
      active_aligned_reads will not be decreased until we can grab a stripe
      head in retry_aligned_read. Now suppose, a barrier io comes, set
      conf->quiesce to 2, and wait until both active_stripes and
      active_aligned_reads are zero. The retried chunk aligned read gets
      stuck at get_active_stripe waiting until conf->quiesce becomes 0.
      Retry_aligned_read and barrier io are waiting each other now.
      One possible solution is that we ignore conf->quiesce, let the retried
      aligned read finish. I reproduced this deadlock and test this patch on
      centos6.0
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      2844dc32
  11. 29 May, 2014 3 commits
    • Shaohua Li's avatar
      raid5: add an option to avoid copy data from bio to stripe cache · d592a996
      Shaohua Li authored
      
      The stripe cache has two goals:
      1. cache data, so next time if data can be found in stripe cache, disk access
      can be avoided.
      2. stable data. data is copied from bio to stripe cache and calculated parity.
      data written to disk is from stripe cache, so if upper layer changes bio data,
      data written to disk isn't impacted.
      
      In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
      can guarantee 2 too. For 1, it's not common too. block plug mechanism will
      dispatch a bunch of sequentail small requests together. And since I'm using
      SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
      
      So I'd like to avoid the copy from bio to stripe cache and it's very helpful
      for performance. In my 1M randwrite tests, avoid the copy can increase the
      performance more than 30%.
      
      Of course, this shouldn't be enabled by default. It's reported enabling
      BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
      control it.
      
      Neilb:
        changed BUG_ON to WARN_ON
        Removed some assignments from raid5_build_block which are now not needed.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      d592a996
    • Eivind Sarto's avatar
      raid5: avoid release list until last reference of the stripe · cf170f3f
      Eivind Sarto authored
      
      The (lockless) release_list reduces lock contention, but there is excessive
      queueing and dequeuing of stripes on this list.  A stripe will currently be
      queued on the release_list with a stripe reference count > 1.  This can cause
      the raid5 kernel thread(s) to dequeue the stripe and decrement the refcount
      without doing any other useful processing of the stripe.  The are two cases
      when the stripe can be put on the release_list multiple times before it is
      actually handled by the kernel thread(s).
      1) make_request() activates the stripe processing in 4k increments.  When a
         write request is large enough to span multiple chunks of a stripe_head, the
         first 4k chunk adds the stripe to the plug list.  The next 4k chunk that is
         processed for the same stripe puts the stripe on the release_list with a
         refcount=2.  This can cause the kernel thread to process and decrement the
         stripe before the stripe us unplugged, which again will put it back on the
         release_list.
      2) Whenever IO is scheduled on a stripe (pre-read and/or write), the stripe
         refcount is set to the number of active IO (for each chunk).  The stripe is
         released as each IO complete, and can be queued and dequeued multiple times
         on the release_list, until its refcount finally reached zero.
      
      This simple patch will ensure a stripe is only queued on the release_list when
      its refcount=1 and is ready to be handled by the kernel thread(s).  I added some
      instrumentation to raid5 and counted the number of times striped were queued on
      the release_list for a variety of write IO sizes.  Without this patch the number
      of times stripes got queued on the release_list was 100-500% higher than with
      the patch.  The excess queuing will increase with the IO size.  The patch also
      improved throughput by 5-10%.
      Signed-off-by: default avatarEivind Sarto <esarto@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      cf170f3f
    • NeilBrown's avatar
      md/raid56: Don't perform reads to support writes until stripe is ready. · 67f45548
      NeilBrown authored
      
      If it is found that we need to pre-read some blocks before a write
      can succeed, we normally set STRIPE_DELAYED and don't actually perform
      the read until STRIPE_PREREAD_ACTIVE subsequently gets set.
      
      However for a degraded RAID6 we currently perform the reads as soon
      as we see that a write is pending.  This significantly hurts
      throughput.
      
      So:
       - when handle_stripe_dirtying find a block that it wants on a device
         that is failed, set STRIPE_DELAY, instead of doing nothing, and
       - when fetch_block detects that a read might be required to satisfy a
         write, only perform the read if STRIPE_PREREAD_ACTIVE is set,
         and if we would actually need to read something to complete the write.
      
      This also helps RAID5, though less often as RAID5 supports a
      read-modify-write cycle.  For RAID5 the read is performed too early
      only if the write is not a full 4K aligned write (i.e. no an
      R5_OVERWRITE).
      
      Also clean up a couple of horrible bits of formatting.
      Reported-by: default avatarPatrik Horník <patrik@dsl.sk>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      67f45548
  12. 18 Apr, 2014 1 commit
  13. 17 Apr, 2014 1 commit
  14. 09 Apr, 2014 2 commits
    • Shaohua Li's avatar
      raid5: get_active_stripe avoids device_lock · e240c183
      Shaohua Li authored
      
      For sequential workload (or request size big workload), get_active_stripe can
      find cached stripe. In this case, we always hold device_lock, which exposes a
      lot of lock contention for such workload. If stripe count isn't 0, we don't
      need hold the lock actually, since we just increase its count. And this is the
      hot code path for such workload. Unfortunately we must delete the BUG_ON.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      e240c183
    • Shaohua Li's avatar
      raid5: make_request does less prepare wait · 27c0f68f
      Shaohua Li authored
      
      In NUMA machine, prepare_to_wait/finish_wait in make_request exposes a
      lot of contention for sequential workload (or big request size
      workload). For such workload, each bio includes several stripes. So we
      can just do prepare_to_wait/finish_wait once for the whold bio instead
      of every stripe.  This reduces the lock contention completely for such
      workload. Random workload might have the similar lock contention too,
      but I didn't see it yet, maybe because my stroage is still not fast
      enough.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      27c0f68f
  15. 13 Feb, 2014 1 commit
    • Oleg Nesterov's avatar
      md/raid5: Fix CPU hotplug callback registration · 789b5e03
      Oleg Nesterov authored
      Subsystems that want to register CPU hotplug callbacks, as well as perform
      initialization for the CPUs that are already online, often do it as shown
      below:
      
      	get_online_cpus();
      
      	for_each_online_cpu(cpu)
      		init_cpu(cpu);
      
      	register_cpu_notifier(&foobar_cpu_notifier);
      
      	put_online_cpus();
      
      This is wrong, since it is prone to ABBA deadlocks involving the
      cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
      with CPU hotplug operations).
      
      Interestingly, the raid5 code can actually prevent double initialization and
      hence can use the following simplified form of callback registration:
      
      	register_cpu_notifier(&foobar_cpu_notifier);
      
      	get_online_cpus();
      
      	for_each_online_cpu(cpu)
      		init_cpu(cpu);
      
      	put_online_cpus();
      
      A hotplug operation that occurs between registering the notifier and calling
      get_online_cpus(), won't disrupt anything, because the code takes care to
      perform the memory allocations only once.
      
      So reorganize the code in raid5 this way to fix the deadlock with callback
      registration.
      
      Cc: linux-raid@vger.kernel.org
      Cc: stable@vger.kernel.org (v2.6.32+)
      Fixes: 36d1c647
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      [Srivatsa: Fixed the unregister_cpu_notifier() deadlock, added the
      free_scratch_buffer() helper to condense code further and wrote the changelog.]
      Signed-off-by: default avatarSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      789b5e03
  16. 22 Jan, 2014 1 commit
    • NeilBrown's avatar
      md/raid5: close recently introduced race in stripe_head management. · 7da9d450
      NeilBrown authored
      As release_stripe and __release_stripe decrement ->count and then
      manipulate ->lru both under ->device_lock, it is important that
      get_active_stripe() increments ->count and clears ->lru also under
      ->device_lock.
      
      However we currently list_del_init ->lru under the lock, but increment
      the ->count outside the lock.  This can lead to races and list
      corruption.
      
      So move the atomic_inc(&sh->count) up inside the ->device_lock
      protected region.
      
      Note that we still increment ->count without device lock in the case
      where get_free_stripe() was called, and in fact don't take
      ->device_lock at all in that path.
      This is safe because if the stripe_head can be found by
      get_free_stripe, then the hash lock assures us the no-one else could
      possibly be calling release_stripe() at the same time.
      
      Fixes: 566c09c5
      
      
      Cc: stable@vger.kernel.org (3.13)
      Reported-and-tested-by: default avatarIan Kumlien <ian.kumlien@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      7da9d450
  17. 15 Jan, 2014 1 commit
    • NeilBrown's avatar
      md/raid5: fix long-standing problem with bitmap handling on write failure. · 9f97e4b1
      NeilBrown authored
      
      Before a write starts we set a bit in the write-intent bitmap.
      When the write completes we clear that bit if the write was successful
      to all devices.  However if the write wasn't fully successful we
      should not clear the bit.  If the faulty drive is subsequently
      re-added, the fact that the bit is still set ensure that we will
      re-write the data that is missing.
      
      This logic is mediated by the STRIPE_DEGRADED flag - we only clear the
      bitmap bit when this flag is not set.
      Currently we correctly set the flag if a write starts when some
      devices are failed or missing.  But we do *not* set the flag if some
      device failed during the write attempt.
      This is wrong and can result in clearing the bit inappropriately.
      
      So: set the flag when a write fails.
      
      This bug has been present since bitmaps were introduces, so the fix is
      suitable for any -stable kernel.
      Reported-by: default avatarEthan Wilson <ethan.wilson@shiftmail.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      9f97e4b1
  18. 14 Jan, 2014 2 commits
    • NeilBrown's avatar
      md/raid5: fix a recently broken BUG_ON(). · 5af9bef7
      NeilBrown authored
      commit 6d183de4
          md/raid5: fix newly-broken locking in get_active_stripe.
      
      simplified a BUG_ON, but removed too much so now it sometimes fires
      when it shouldn't.
      
      When the STRIPE_EXPANDING flag is set, the stripe_head might be on a
      special list while multiple stripe_heads are collected, or it might
      not be on any list, even a 'free' list when the refcount is zero.  As
      long as STRIPE_EXPANDING is set, it will be found and added back to a
      list eventually.
      
      So both of the BUG_ONs which test for the ->lru being empty or not
      need to avoid the case where STRIPE_EXPANDING is set.
      
      The patch which broke this was marked for -stable, so this patch needs
      to be applied to any branch that received 6d183de4
      
      Fixes: 6d183de4
      
      
      Cc: stable@vger.kernel.org (any release to which above was applied)
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      5af9bef7
    • NeilBrown's avatar
      md/raid5: Fix possible confusion when multiple write errors occur. · 1cc03eb9
      NeilBrown authored
      commit 5d8c71f9
          md: raid5 crash during degradation
      
      Fixed a crash in an overly simplistic way which could leave
      R5_WriteError or R5_MadeGood set in the stripe cache for devices
      for which it is no longer relevant.
      When those devices are removed and spares added the flags are still
      set and can cause incorrect behaviour.
      
      commit 14a75d3e
      
      
          md/raid5: preferentially read from replacement device if possible.
      
      Fixed the same bug if a more effective way, so we can now revert
      the original commit.
      Reported-and-tested-by: default avatarAlexander Lyakas <alex.bolshoy@gmail.com>
      Cc: stable@vger.kernel.org (3.2+ - 3.2 will need a different fix though)
      Fixes: 5d8c71f9
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      1cc03eb9
  19. 08 Jan, 2014 1 commit
    • Kent Overstreet's avatar
      bcache/md: Use raid stripe size · c78afc62
      Kent Overstreet authored
      
      Now that we've got code for raid5/6 stripe awareness, bcache just needs
      to know about the stripes and when writing partial stripes is expensive
      - we probably don't want to enable this optimization for raid1 or 10,
      even though they have stripes. So add a flag to queue_limits.
      Signed-off-by: default avatarKent Overstreet <kmo@daterainc.com>
      c78afc62
  20. 28 Nov, 2013 2 commits
    • NeilBrown's avatar
      md/raid5: fix newly-broken locking in get_active_stripe. · 6d183de4
      NeilBrown authored
      commit 566c09c5 raid5: relieve lock contention in get_active_stripe()
      
      modified the locking in get_active_stripe() reducing the range
      protected by the (highly contended) device_lock.
      Unfortunately it reduced the range too much opening up some races.
      
      One race can occur if get_priority_stripe runs between the
      test on sh->count and device_lock being taken.
      This will mean that sh->lru is not empty while get_active_stripe
      thinks ->count is zero resulting in a 'BUG' firing.
      
      Another race happens if __release_stripe is called immediately
      after sh->count is tested and found to be non-zero.  If STRIPE_HANDLE
      is not set, get_active_stripe should increment ->active_stripes
      when it increments ->count from 0, but as it didn't think it was 0,
      it doesn't.
      
      Extending device_lock to cover the test on sh->count close these
      races.
      
      While we are here, fix the two BUG tests:
       -If count is zero, then lru really must not be empty, or we've
        lock the stripe_hea...
      6d183de4
    • NeilBrown's avatar
      md/raid5: fix new memory-reference bug in alloc_thread_groups. · 0c775d52
      NeilBrown authored
      In alloc_thread_groups, worker_groups is a pointer to an array,
      not an array of pointers.
      So
         worker_groups[i]
      is wrong.  It should be
         &(*worker_groups)[i]
      
      Found-by: coverity
      Fixes: 60aaf933
      
      Reported-by: default avatarBen Hutchings <bhutchings@solarflare.com>
      Cc: majianpeng <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      0c775d52
  21. 24 Nov, 2013 2 commits
    • Kent Overstreet's avatar
      block: Convert bio_for_each_segment() to bvec_iter · 7988613b
      Kent Overstreet authored
      
      More prep work for immutable biovecs - with immutable bvecs drivers
      won't be able to use the biovec directly, they'll need to use helpers
      that take into account bio->bi_iter.bi_bvec_done.
      
      This updates callers for the new usage without changing the
      implementation yet.
      Signed-off-by: default avatarKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Ed L. Cashin" <ecashin@coraid.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Paul Clements <Paul.Clements@steeleye.com>
      Cc: Jim Paris <jim@jtan.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@inktank.com>
      Cc: ceph-devel@vger.kernel.org
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philip Kelleher <pj...
      7988613b
    • Kent Overstreet's avatar
      block: Abstract out bvec iterator · 4f024f37
      Kent Overstreet authored
      
      Immutable biovecs are going to require an explicit iterator. To
      implement immutable bvecs, a later patch is going to add a bi_bvec_done
      member to this struct; for now, this patch effectively just renames
      things.
      Signed-off-by: default avatarKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Ed L. Cashin" <ecashin@coraid.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@inktank.com>
      Cc: ceph-devel@vger.kernel.org
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.co...
      4f024f37
  22. 19 Nov, 2013 2 commits
    • majianpeng's avatar
      md/raid5: Use conf->device_lock protect changing of multi-thread resources. · 60aaf933
      majianpeng authored
      When we change group_thread_cnt from sysfs entry, it can OOPS.
      
      The kernel messages are:
      [  135.299021] BUG: unable to handle kernel NULL pointer dereference at           (null)
      [  135.299073] IP: [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440
      [  135.299107] PGD 0
      [  135.299122] Oops: 0000 [#1] SMP
      [  135.299144] Modules linked in: netconsole e1000e ptp pps_core
      [  135.299188] CPU: 3 PID: 2225 Comm: md0_raid5 Not tainted 3.12.0+ #24
      [  135.299214] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015  11/09/2011
      [  135.299255] task: ffff8800b9638f80 ti: ffff8800b77a4000 task.ti: ffff8800b77a4000
      [  135.299283] RIP: 0010:[<ffffffff815188ab>]  [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440
      [  135.299323] RSP: 0018:ffff8800b77a5c48  EFLAGS: 00010002
      [  135.299344] RAX: ffff880037bb5c70 RBX: 0000000000000000 RCX: 0000000000000008
      [  135.299371] RDX: ffff880037bb5cb8 RSI: 0000000000000001 RDI: ffff880037bb5c00
      [  135.299398] RBP: ffff8800b77a5d08 R08: 0000000000000001 R09: 0000000000000000
      [  135.299425] R10: ffff8800b77a5c98 R11: 00000000ffffffff R12: ffff880037bb5c00
      [  135.299452] R13: 0000000000000000 R14: 0000000000000000 R15: ffff880037bb5c70
      [  135.299479] FS:  0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
      [  135.299510] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [  135.299532] CR2: 0000000000000000 CR3: 0000000001c0b000 CR4: 00000000000407e0
      [  135.299559] Stack:
      [  135.299570]  ffff8800b77a5c88 ffffffff8107383e ffff8800b77a5c88 ffff880037a64300
      [  135.299611]  000000000000ec08 ffff880037bb5cb8 ffff8800b77a5c98 ffffffffffffffd8
      [  135.299654]  000000000000ec08 ffff880037bb5c60 ffff8800b77a5c98 ffff8800b77a5c98
      [  135.299696] Call Trace:
      [  135.299711]  [<ffffffff8107383e>] ? __wake_up+0x4e/0x70
      [  135.299733]  [<ffffffff81518f88>] raid5d+0x4c8/0x680
      [  135.299756]  [<ffffffff817174ed>] ? schedule_timeout+0x15d/0x1f0
      [  135.299781]  [<ffffffff81524c9f>] md_thread+0x11f/0x170
      [  135.299804]  [<ffffffff81069cd0>] ? wake_up_bit+0x40/0x40
      [  135.299826]  [<ffffffff81524b80>] ? md_rdev_init+0x110/0x110
      [  135.299850]  [<ffffffff81069656>] kthread+0xc6/0xd0
      [  135.299871]  [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
      [  135.299899]  [<ffffffff81722ffc>] ret_from_fork+0x7c/0xb0
      [  135.299923]  [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
      [  135.299951] Code: ff ff ff 0f 84 d7 fe ff ff e9 5c fe ff ff 66 90 41 8b b4 24 d8 01 00 00 45 31 ed 85 f6 0f 8e 7b fd ff ff 49 8b 9c 24 d0 01 00 00 <48> 3b 1b 49 89 dd 0f 85 67 fd ff ff 48 8d 43 28 31 d2 eb 17 90
      [  135.300005] RIP  [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440
      [  135.300005]  RSP <ffff8800b77a5c48>
      [  135.300005] CR2: 0000000000000000
      [  135.300005] ---[ end trace 504854e5bb7562ed ]---
      [  135.300005] Kernel panic - not syncing: Fatal exception
      
      This is because raid5d() can be running when the multi-thread
      resources are changed via system. We see need to provide locking.
      
      mddev->device_lock is suitable, but we cannot simple call
      alloc_thread_groups under this lock as we cannot allocate memory
      while holding a spinlock.
      So change alloc_thread_groups() to allocate and return the data
      structures, then raid5_store_group_thread_cnt() can take the lock
      while updating the pointers to the data structures.
      
      This fixes a bug introduced in 3.12 and so is suitable for the 3.12.x
      stable series.
      
      Fixes: b721420e
      
      
      Cc: stable@vger.kernel.org (3.12)
      Signed-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarShaohua Li <shli@kernel.org>
      60aaf933
    • majianpeng's avatar
      md/raid5: Before freeing old multi-thread worker, it should flush them. · d206dcfa
      majianpeng authored
      When changing group_thread_cnt from sysfs entry, the kernel can oops.
      
      The kernel messages are:
      [  740.961389] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      [  740.961444] IP: [<ffffffff81062570>] process_one_work+0x30/0x500
      [  740.961476] PGD b9013067 PUD b651e067 PMD 0
      [  740.961503] Oops: 0000 [#1] SMP
      [  740.961525] Modules linked in: netconsole e1000e ptp pps_core
      [  740.961577] CPU: 0 PID: 3683 Comm: kworker/u8:5 Not tainted 3.12.0+ #23
      [  740.961602] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015  11/09/2011
      [  740.961646] task: ffff88013abe0000 ti: ffff88013a246000 task.ti: ffff88013a246000
      [  740.961673] RIP: 0010:[<ffffffff81062570>]  [<ffffffff81062570>] process_one_work+0x30/0x500
      [  740.961708] RSP: 0018:ffff88013a247e08  EFLAGS: 00010086
      [  740.961730] RAX: ffff8800b912b400 RBX: ffff88013a61e680 RCX: ffff8800b912b400
      [  740.961757] RDX: ffff8800b912b600 RSI: ffff8800b912b600 RDI: ffff88013a61e680
      [  740.961782] RBP: ffff88013a247e48 R08: ffff88013a246000 R09: 000000000002c09d
      [  740.961808] R10: 000000000000010f R11: 0000000000000000 R12: ffff88013b00cc00
      [  740.961833] R13: 0000000000000000 R14: ffff88013b00cf80 R15: ffff88013a61e6b0
      [  740.961861] FS:  0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
      [  740.961893] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [  740.962001] CR2: 00000000000000b8 CR3: 00000000b24fe000 CR4: 00000000000407f0
      [  740.962001] Stack:
      [  740.962001]  0000000000000008 ffff8800b912b600 ffff88013b00cc00 ffff88013a61e680
      [  740.962001]  ffff88013b00cc00 ffff88013b00cc18 ffff88013b00cf80 ffff88013a61e6b0
      [  740.962001]  ffff88013a247eb8 ffffffff810639c6 0000000000012a80 ffff88013a247fd8
      [  740.962001] Call Trace:
      [  740.962001]  [<ffffffff810639c6>] worker_thread+0x206/0x3f0
      [  740.962001]  [<ffffffff810637c0>] ? manage_workers+0x2c0/0x2c0
      [  740.962001]  [<ffffffff81069656>] kthread+0xc6/0xd0
      [  740.962001]  [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
      [  740.962001]  [<ffffffff81722ffc>] ret_from_fork+0x7c/0xb0
      [  740.962001]  [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
      [  740.962001] Code: 89 e5 41 57 41 56 41 55 45 31 ed 41 54 53 48 89 fb 48 83 ec 18 48 8b 06 4c 8b 67 48 48 89 c1 30 c9 a8 04 4c 0f 45 e9 80 7f 58 00 <49> 8b 45 08 44 8b b0 00 01 00 00 78 0c 41 f6 44 24 10 04 0f 84
      [  740.962001] RIP  [<ffffffff81062570>] process_one_work+0x30/0x500
      [  740.962001]  RSP <ffff88013a247e08>
      [  740.962001] CR2: 0000000000000008
      [  740.962001] ---[ end trace 39181460000748de ]---
      [  740.962001] Kernel panic - not syncing: Fatal exception
      
      This can happen if there are some stripes left, fewer than MAX_STRIPE_BATCH.
      A worker is queued to handle them.
      But before calling raid5_do_work, raid5d handles those
      stripes making conf->active_stripe = 0.
      So mddev_suspend() can return.
      We might then free old worker resources before the queued
      raid5_do_work() handled them.  When it runs, it crashes.
      
      	raid5d()		raid5_store_group_thread_cnt()
      	queue_work		mddev_suspend()
      				handle_strips
      				active_stripe=0
      				free(old worker resources)
      	process_one_work
      	raid5_do_work
      
      To avoid this, we should only flush the worker resources before freeing them.
      
      This fixes a bug introduced in 3.12 so is suitable for the 3.12.x
      stable series.
      
      Cc: stable@vger.kernel.org (3.12)
      Fixes: b721420e
      
      Signed-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarShaohua Li <shli@kernel.org>
      d206dcfa