An error occurred fetching the project authors.
  1. 18 May, 2010 1 commit
  2. 17 May, 2010 1 commit
    • NeilBrown's avatar
      md: manage redundancy group in sysfs when changing level. · a64c876f
      NeilBrown authored
      Some levels expect the 'redundancy group' to be present,
      others don't.
      So when we change level of an array we might need to
      add or remove this group.
      
      This requires fixing up the current practice of overloading ->private
      to indicate (when ->pers == NULL) that something needs to be removed.
      So create a new ->to_remove to fill that role.
      
      When changing levels, we may need to add or remove attributes.  When
      changing RAID5 -> RAID6, we both add and remove the same thing.  It is
      important to catch this and optimise it out as the removal is delayed
      until a lock is released, so trying to add immediately would cause
      problems.
      
      
      Cc: stable@kernel.org
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      a64c876f
  3. 26 Feb, 2010 1 commit
  4. 17 Feb, 2010 1 commit
    • Tejun Heo's avatar
      percpu: add __percpu sparse annotations to what's left · a29d8b8e
      Tejun Heo authored
      Add __percpu sparse annotations to places which didn't make it in one
      of the previous patches.  All converions are trivial.
      
      These annotations are to make sparse consider percpu variables to be
      in a different address space and warn if accessed without going
      through percpu accessors.  This patch doesn't affect normal builds.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarBorislav Petkov <borislav.petkov@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Neil Brown <neilb@suse.de>
      a29d8b8e
  5. 10 Feb, 2010 1 commit
    • NeilBrown's avatar
      md: fix some lockdep issues between md and sysfs. · ef286f6f
      NeilBrown authored
      ======
      This fix is related to
          http://bugzilla.kernel.org/show_bug.cgi?id=15142
      but does not address that exact issue.
      ======
      
      sysfs does like attributes being removed while they are being accessed
      (i.e. read or written) and waits for the access to complete.
      
      As accessing some md attributes takes the same lock that is held while
      removing those attributes a deadlock can occur.
      
      This patch addresses 3 issues in md that could lead to this deadlock.
      
      Two relate to calling flush_scheduled_work while the lock is held.
      This is probably a bad idea in general and as we use schedule_work to
      delete various sysfs objects it is particularly bad.
      
      In one case flush_scheduled_work is called from md_alloc (called by
      md_probe) called from do_md_run which holds the lock.  This call is
      only present to ensure that ->gendisk is set.  However we can be sure
      that gendisk is always set (though possibly we couldn't when that code
      was originally written.  This is because do_md_run is called in three
      different contexts:
        1/ from md_ioctl.  This requires that md_open has succeeded, and it
           fails if ->gendisk is not set.
        2/ from writing a sysfs attribute.  This can only happen if the
           mddev has been registered in sysfs which happens in md_alloc
           after ->gendisk has been set.
        3/ from autorun_array which is only called by autorun_devices, which
           checks for ->gendisk to be set before calling autorun_array.
      So the call to md_probe in do_md_run can be removed, and the check on
      ->gendisk can also go.
      
      
      In the other case flush_scheduled_work is being called in do_md_stop,
      purportedly to wait for all md_delayed_delete calls (which delete the
      component rdevs) to complete.  However there really isn't any need to
      wait for them - they have already been disconnected in all important
      ways.
      
      The third issue is that raid5->stop() removes some attribute names
      while the lock is held.  There is already some infrastructure in place
      to delay attribute removal until after the lock is released (using
      schedule_work).  So extend that infrastructure to remove the
      raid5_attrs_group.
      
      This does not address all lockdep issues related to the sysfs
      "s_active" lock.  The rest can be address by splitting that lockdep
      context between symlinks and non-symlinks which hopefully will happen.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      ef286f6f
  6. 09 Feb, 2010 1 commit
    • NeilBrown's avatar
      md: fix 'degraded' calculation when starting a reshape. · 9eb07c25
      NeilBrown authored
      This code was written long ago when it was not possible to
      reshape a degraded array.  Now it is so the current level of
      degraded-ness needs to be taken in to account.  Also newly addded
      devices should only reduce degradedness if they are deemed to be
      in-sync.
      
      In particular, if you convert a RAID5 to a RAID6, and increase the
      number of devices at the same time, then the 5->6 conversion will
      make the array degraded so the current code will produce a wrong
      value for 'degraded' - "-1" to be precise.
      
      If the reshape runs to completion end_reshape will calculate a correct
      new value for 'degraded', but if a device fails during the reshape an
      incorrect decision might be made based on the incorrect value of
      "degraded".
      
      This patch is suitable for 2.6.32-stable and if they are still open,
      2.6.31-stable and 2.6.30-stable as well.
      
      Cc: stable@kernel.org
      Reported-by: default avatarMichael Evans <mjevans1983@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      9eb07c25
  7. 14 Dec, 2009 4 commits
    • NeilBrown's avatar
      md: add MODULE_DESCRIPTION for all md related modules. · 0efb9e61
      NeilBrown authored
      Suggested by  Oren Held <orenhe@il.ibm.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      0efb9e61
    • NeilBrown's avatar
      md/raid5: don't complete make_request on barrier until writes are scheduled · 729a1866
      NeilBrown authored
      The post-barrier-flush is sent by md as soon as make_request on the
      barrier write completes.  For raid5, the data might not be in the
      per-device queues yet.  So for barrier requests, wait for any
      pre-reading to be done so that the request will be in the per-device
      queues.
      
      We use the 'preread_active' count to check that nothing is still in
      the preread phase, and delay the decrement of this count until after
      write requests have been submitted to the underlying devices.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      729a1866
    • NeilBrown's avatar
      md: support barrier requests on all personalities. · a2826aa9
      NeilBrown authored
      Previously barriers were only supported on RAID1.  This is because
      other levels requires synchronisation across all devices and so needed
      a different approach.
      Here is that approach.
      
      When a barrier arrives, we send a zero-length barrier to every active
      device.  When that completes - and if the original request was not
      empty -  we submit the barrier request itself (with the barrier flag
      cleared) and then submit a fresh load of zero length barriers.
      
      The barrier request itself is asynchronous, but any subsequent
      request will block until the barrier completes.
      
      The reason for clearing the barrier flag is that a barrier request is
      allowed to fail.  If we pass a non-empty barrier through a striping
      raid level it is conceivable that part of it could succeed and part
      could fail.  That would be way too hard to deal with.
      So if the first run of zero length barriers succeed, we assume all is
      sufficiently well that we send the request and ignore errors in the
      second run of barriers.
      
      RAID5 needs extra care as write requests may not have been submitted
      to the underlying devices yet.  So we flush the stripe cache before
      proceeding with the barrier.
      
      Note that the second set of zero-length barriers are submitted
      immediately after the original request is submitted.  Thus when
      a personality finds mddev->barrier to be set during make_request,
      it should not return from make_request until the corresponding
      per-device request(s) have been queued.
      
      That will be done in later patches.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarAndre Noll <maan@systemlinux.org>
      a2826aa9
    • NeilBrown's avatar
      md/raid5: remove some sparse warnings. · 8553fe7e
      NeilBrown authored
      qd_idx is previously declared and given exactly the same value!
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      8553fe7e
  8. 13 Nov, 2009 2 commits
    • NeilBrown's avatar
      md/raid5: Allow dirty-degraded arrays to be assembled when only party is degraded. · c148ffdc
      NeilBrown authored
      Normally is it not safe to allow a raid5 that is both dirty and
      degraded to be assembled without explicit request from that admin, as
      it can cause hidden data corruption.
      This is because 'dirty' means that the parity cannot be trusted, and
      'degraded' means that the parity needs to be used.
      
      However, if the device that is missing contains only parity, then
      there is no issue and assembly can continue.
      This particularly applies when a RAID5 is being converted to a RAID6
      and there is an unclean shutdown while the conversion is happening.
      
      So check for whether the degraded space only contains parity, and
      in that case, allow the assembly.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      c148ffdc
    • NeilBrown's avatar
      Don't unconditionally set in_sync on newly added device in raid5_reshape · 7ef90146
      NeilBrown authored
      When a reshape finds that it can add spare devices into the array,
      those devices might already be 'in_sync' if they are beyond the old
      size of the array, or they might not if they are within the array.
      
      The first case happens when we change an N-drive RAID5 to an
      N+1-drive RAID5.
      The second happens when we convert an N-drive RAID5 to an
      N+1-drive RAID6.
      
      So set the flag more carefully.
      Also, ->recovery_offset is only meaningful when the flag is clear,
      so only set it in that case.
      
      This change needs the preceding two to ensure that the non-in_sync
      device doesn't get evicted from the array when it is stopped, in the
      case where v0.90 metadata is used.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      7ef90146
  9. 06 Nov, 2009 1 commit
    • NeilBrown's avatar
      md/raid5: make sure curr_sync_completes is uptodate when reshape starts · 8dee7211
      NeilBrown authored
      This value is visible through sysfs and is used by mdadm
      when it manages a reshape (backing up data that is about to be
      rearranged).  So it is important that it is always correct.
      Current it does not get updated properly when a reshape
      starts which can cause problems when assembling an array
      that is in the middle of being reshaped.
      
      This is suitable for 2.6.31.y stable kernels.
      
      Cc: stable@kernel.org
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      8dee7211
  10. 20 Oct, 2009 1 commit
  11. 16 Oct, 2009 6 commits
    • NeilBrown's avatar
      md/async: don't pass a memory pointer as a page pointer. · 5dd33c9a
      NeilBrown authored
      md/raid6 passes a list of 'struct page *' to the async_tx routines,
      which then either DMA map them for offload, or take the page_address
      for CPU based calculations.
      
      For RAID6 we sometime leave 'blanks' in the list of pages.
      For CPU based calcs, we want to treat theses as a page of zeros.
      For offloaded calculations, we simply don't pass a page to the
      hardware.
      
      Currently the 'blanks' are encoded as a pointer to
      raid6_empty_zero_page.  This is a 4096 byte memory region, not a
      'struct page'.  This is mostly handled correctly but is rather ugly.
      
      So change the code to pass and expect a NULL pointer for the blanks.
      When taking page_address of a page, we need to check for a NULL and
      in that case use raid6_empty_zero_page.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      5dd33c9a
    • NeilBrown's avatar
      md: Fix handling of raid5 array which is being reshaped to fewer devices. · 5e5e3e78
      NeilBrown authored
      When a raid5 (or raid6) array is being reshaped to have fewer devices,
      conf->raid_disks is the latter and hence smaller number of devices.
      However sometimes we want to use a number which is the total number of
      currently required devices - the larger of the 'old' and 'new' sizes.
      Before we implemented reducing the number of devices, this was always
      'new' i.e. ->raid_disks.
      Now we need max(raid_disks, previous_raid_disks) in those places.
      
      This particularly affects assembling an array that was shutdown while
      in the middle of a reshape to fewer devices.
      
      md.c needs a similar fix when interpreting the md metadata.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      5e5e3e78
    • NeilBrown's avatar
      md: fix problems with RAID6 calculations for DDF. · e4424fee
      NeilBrown authored
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      e4424fee
    • Dan Williams's avatar
      md/raid456: downlevel multicore operations to raid_run_ops · 417b8d4a
      Dan Williams authored
      The percpu conversion allowed a straightforward handoff of stripe
      processing to the async subsytem that initially showed some modest gains
      (+4%).  However, this model is too simplistic and leads to stripes
      bouncing between raid5d and the async thread pool for every invocation
      of handle_stripe().  As reported by Holger this can fall into a
      pathological situation severely impacting throughput (6x performance
      loss).
      
      By downleveling the parallelism to raid_run_ops the pathological
      stripe_head bouncing is eliminated.  This version still exhibits an
      average 11% throughput loss for:
      
      	mdadm --create /dev/md0 /dev/sd[b-q] -n 16 -l 6
      	echo 1024 > /sys/block/md0/md/stripe_cache_size
      	dd if=/dev/zero of=/dev/md0 bs=1024k count=2048
      
      ...but the results are at least stable and can be used as a base for
      further multicore experimentation.
      Reported-by: default avatarHolger Kiehl <Holger.Kiehl@dwd.de>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      417b8d4a
    • Dan Williams's avatar
      md/raid5: initialize conf->device_lock earlier · f5efd45a
      Dan Williams authored
      Deallocating a raid5_conf_t structure requires taking 'device_lock'.
      Ensure it is initialized before it is used, i.e. initialize the lock
      before attempting any further initializations that might fail.
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      f5efd45a
    • NeilBrown's avatar
      Revert "md: do not progress the resync process if the stripe was blocked" · 1442577b
      NeilBrown authored
      This reverts commit df10cfbc.
      
      This patch was based on a misunderstanding and risks introducing a busy-wait loop.
      So revert it.
      Acked-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      1442577b
  12. 23 Sep, 2009 3 commits
  13. 16 Sep, 2009 2 commits
  14. 11 Sep, 2009 1 commit
  15. 09 Sep, 2009 1 commit
    • Dan Williams's avatar
      dmaengine: add fence support · 0403e382
      Dan Williams authored
      Some engines optimize operation by reading ahead in the descriptor chain
      such that descriptor2 may start execution before descriptor1 completes.
      If descriptor2 depends on the result from descriptor1 then a fence is
      required (on descriptor2) to disable this optimization.  The async_tx
      api could implicitly identify dependencies via the 'depend_tx'
      parameter, but that would constrain cases where the dependency chain
      only specifies a completion order rather than a data dependency.  So,
      provide an ASYNC_TX_FENCE to explicitly identify data dependencies.
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      0403e382
  16. 30 Aug, 2009 12 commits
  17. 13 Aug, 2009 1 commit