Commits · 2b8bf3451d1e3133ebc3998721d14013a6c27114 · nexedi / linux

An error occurred fetching the project authors.

11 Oct, 2011 3 commits

md: remove typedefs: mdk_thread_t -> struct md_thread · 2b8bf345
NeilBrown authored 13 years ago
```
Signed-off-by: NeilBrown <neilb@suse.de>
```
2b8bf345

md: remove typedefs: mddev_t -> struct mddev · fd01b88c

NeilBrown authored 13 years ago

Having mddev_t and 'struct mddev_s' is ugly and not preferred
Signed-off-by: NeilBrown <neilb@suse.de>

fd01b88c

md: removing typedefs: mdk_rdev_t -> struct md_rdev · 3cb03002

NeilBrown authored 13 years ago

The typedefs are just annoying. 'mdk' probably refers to 'md_k.h'
which used to be an include file that defined this thing.
Signed-off-by: NeilBrown <neilb@suse.de>

3cb03002

28 Jul, 2011 3 commits

md/raid5: Clear bad blocks on successful write. · b84db560

NeilBrown authored 13 years ago

On a successful write to a known bad block, flag the sh
so that raid5d can remove the known bad block from the list.
Signed-off-by: NeilBrown <neilb@suse.de>

b84db560

md/raid5: write errors should be recorded as bad blocks if possible. · bc2607f3

NeilBrown authored 13 years ago

When a write error is detected, don't mark the device as failed
immediately but rather record the fact for handle_stripe to deal with.

Handle_stripe then attempts to record a bad block.  Only if that fails
does the device get marked as faulty.
Signed-off-by: NeilBrown <neilb@suse.de>

bc2607f3

md/raid5: use bad-block log to improve handling of uncorrectable read errors. · 7f0da59b

NeilBrown authored 13 years ago

If we get an uncorrectable read error - record a bad block rather than
failing the device.
And if these errors (which may be due to known bad blocks) cause
recovery to be impossible, record a bad block on the recovering
devices, or abort the recovery.

As we might abort a recovery without failing a device we need to teach
RAID5 about recovery_disabled handling.
Signed-off-by: NeilBrown <neilb@suse.de>

7f0da59b

26 Jul, 2011 4 commits

md/raid5: add some more fields to stripe_head_state · c5709ef6

NeilBrown authored 13 years ago

Adding these three fields will allow more common code to be moved
to handle_stripe()

struct field rearrangement by Namhyung Kim.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>

c5709ef6

md/raid5: unify stripe_head_state and r6_state · f2b3b44d

NeilBrown authored 13 years ago

'struct stripe_head_state' stores state about the 'current' stripe
that is passed around while handling the stripe.
For RAID6 there is an extension structure: r6_state, which is also
passed around.
There is no value in keeping these separate, so move the fields from
the latter into the former.

This means that all code now needs to treat s->failed_num as an small
array, but this is a small cost.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>

f2b3b44d

md/raid5: replace sh->lock with an 'active' flag. · c4c1663b

NeilBrown authored 13 years ago

sh->lock is now mainly used to ensure that two threads aren't running
in the locked part of handle_stripe[56] at the same time.

That can more neatly be achieved with an 'active' flag which we set
while running handle_stripe.  If we find the flag is set, we simply
requeue the stripe for later by setting STRIPE_HANDLE.

For safety we take ->device_lock while examining the state of the
stripe and creating a summary in 'stripe_head_state / r6_state'.
This possibly isn't needed but as shared fields like ->toread,
->towrite are checked it is safer for now at least.

We leave the label after the old 'unlock' called "unlock" because it
will disappear in a few patches, so renaming seems pointless.

This leaves the stripe 'locked' for longer as we clear STRIPE_ACTIVE
later, but that is not a problem.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>

c4c1663b

md/raid5: Remove use of sh->lock in sync_request · 83206d66

NeilBrown authored 13 years ago

This is the start of a series of patches to remove sh->lock.

sync_request takes sh->lock before setting STRIPE_SYNCING to ensure
there is no race with testing it in handle_stripe[56].

Instead, use a new flag STRIPE_SYNC_REQUESTED and test it early
in handle_stripe[56] (after getting the same lock) and perform the
same set/clear operations if it was set.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>

83206d66

18 Apr, 2011 1 commit

md - remove old plugging code. · 482c0834

NeilBrown authored 13 years ago

md has some plugging infrastructure for RAID5 to use because the
normal plugging infrastructure required a 'request_queue', and when
called from dm, RAID5 doesn't have one of those available.

This relied on the ->unplug_fn callback which doesn't exist any more.

So remove all of that code, both in md and raid5.  Subsequent patches
with restore the plugging functionality.
Signed-off-by: NeilBrown <neilb@suse.de>

482c0834

10 Mar, 2011 1 commit

block: remove per-queue plugging · 7eaceacc

Jens Axboe authored 14 years ago

Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>

7eaceacc

10 Sep, 2010 1 commit

md: implment REQ_FLUSH/FUA support · e9c7469b

Tejun Heo authored 14 years ago

This patch converts md to support REQ_FLUSH/FUA instead of now
deprecated REQ_HARDBARRIER.  In the core part (md.c), the following
changes are notable.

* Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with
  processing of other requests and thus there is no reason to mark the
  queue congested while FLUSH/FUA is in progress.

* REQ_FLUSH/FUA failures are final and its users don't need retry
  logic.  Retry logic is removed.

* Preflush needs to be issued to all member devices but FUA writes can
  be handled the same way as other writes - their processing can be
  deferred to request_queue of member devices.  md_barrier_request()
  is renamed to md_flush_request() and simplified accordingly.

For linear, raid0 and multipath, the core changes are enough.  raid1,
5 and 10 need the following conversions.

* raid1: Handling of FLUSH/FUA bio's can simply be deferred to
  request_queues of member devices.  Barrier related logic removed.

* raid5: Queue draining logic dropped.  FUA bit is propagated through
  biodrain and stripe resconstruction such that all the updated parts
  of the stripe are written out with FUA writes if any of the dirtying
  writes was FUA.  preread_active_stripes handling in make_request()
  is updated as suggested by Neil Brown.

* raid10: FUA bit needs to be propagated to write clones.

linear, raid0, 1, 5 and 10 tested.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>

e9c7469b

26 Jul, 2010 4 commits

md/raid5: export raid5 unplugging interface. · 9f7c2220

NeilBrown authored 14 years ago

Also remove remaining accesses to ->queue and ->gendisk when ->queue
is NULL (As it is in a DM target).
Signed-off-by: NeilBrown <neilb@suse.de>

9f7c2220

md/raid5: add simple plugging infrastructure. · 2ac87401

NeilBrown authored 14 years ago

md/raid5 uses the plugging infrastructure provided by the block layer
and 'struct request_queue'.  However when we plug raid5 under dm there
is no request queue so we cannot use that.

So create a similar infrastructure that is much lighter weight and use
it for raid5.
Signed-off-by: NeilBrown <neilb@suse.de>

2ac87401

md/raid5: export is_congested test · 11d8a6e3

NeilBrown authored 14 years ago

the dm module will need this for dm-raid45.

Also only access ->queue->backing_dev_info->congested_fn
if ->queue actually exists.  It won't in a dm target.
Signed-off-by: NeilBrown <neilb@suse.de>

11d8a6e3

md/raid5: ensure we create a unique name for kmem_cache when mddev has no gendisk · f4be6b43

NeilBrown authored 14 years ago

We will shortly allow md devices with no gendisk (they are attached to
a dm-target instead).  That will cause mdname() to return 'mdX'.
There is one place where mdname really needs to be unique: when
creating the name for a slab cache.
So in that case, if there is no gendisk, you the address of the mddev
formatted in HEX to provide a unique name.
Signed-off-by: NeilBrown <neilb@suse.de>

f4be6b43

21 Jul, 2010 1 commit

md/raid5: factor out code for changing size of stripe cache. · c41d4ac4

NeilBrown authored 14 years ago

Separate the actual 'change' code from the sysfs interface
so that it can eventually be called internally.
Signed-off-by: NeilBrown <neilb@suse.de>

c41d4ac4

17 Feb, 2010 1 commit

percpu: add __percpu sparse annotations to what's left · a29d8b8e

Tejun Heo authored 15 years ago

Add __percpu sparse annotations to places which didn't make it in one
of the previous patches.  All converions are trivial.

These annotations are to make sparse consider percpu variables to be
in a different address space and warn if accessed without going
through percpu accessors.  This patch doesn't affect normal builds.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Neil Brown <neilb@suse.de>

a29d8b8e

16 Oct, 2009 2 commits

md: fix problems with RAID6 calculations for DDF. · e4424fee
NeilBrown authored 15 years ago
```
Signed-off-by: NeilBrown <neilb@suse.de>
```
e4424fee

md/raid456: downlevel multicore operations to raid_run_ops · 417b8d4a

Dan Williams authored 15 years ago

The percpu conversion allowed a straightforward handoff of stripe
processing to the async subsytem that initially showed some modest gains
(+4%).  However, this model is too simplistic and leads to stripes
bouncing between raid5d and the async thread pool for every invocation
of handle_stripe().  As reported by Holger this can fall into a
pathological situation severely impacting throughput (6x performance
loss).

By downleveling the parallelism to raid_run_ops the pathological
stripe_head bouncing is eliminated.  This version still exhibits an
average 11% throughput loss for:

	mdadm --create /dev/md0 /dev/sd[b-q] -n 16 -l 6
	echo 1024 > /sys/block/md0/md/stripe_cache_size
	dd if=/dev/zero of=/dev/md0 bs=1024k count=2048

...but the results are at least stable and can be used as a base for
further multicore experimentation.
Reported-by: Holger Kiehl <Holger.Kiehl@dwd.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>

417b8d4a

30 Aug, 2009 4 commits

md/raid6: asynchronous raid6 operations · ac6b53b6

Dan Williams authored 15 years ago

[ Based on an original patch by Yuri Tikhonov ]

The raid_run_ops routine uses the asynchronous offload api and
the stripe_operations member of a stripe_head to carry out xor+pq+copy
operations asynchronously, outside the lock.

The operations performed by RAID-6 are the same as in the RAID-5 case
except for no support of STRIPE_OP_PREXOR operations. All the others
are supported:
STRIPE_OP_BIOFILL
 - copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
 - generate missing blocks (1 or 2) in the cache from the other blocks
STRIPE_OP_BIODRAIN
 - copy data out of request buffers to satisfy a write request
STRIPE_OP_RECONSTRUCT
 - recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
 - verify that the parity is correct

The flow is the same as in the RAID-5 case, and reuses some routines, namely:
1/ ops_complete_postxor (renamed to ops_complete_reconstruct)
2/ ops_complete_compute (updated to set up to 2 targets uptodate)
3/ ops_run_check (renamed to ops_run_check_p for xor parity checks)

[neilb@suse.de: fixes to get it to pass mdadm regression suite]
Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Yuri Tikhonov <yur@emcraft.com>
Signed-off-by: Ilya Yanok <yanok@emcraft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

ac6b53b6

async_tx: add sum check flags · ad283ea4

Dan Williams authored 15 years ago

Replace the flat zero_sum_result with a collection of flags to contain
the P (xor) zero-sum result, and the soon to be utilized Q (raid6 reed
solomon syndrome) zero-sum result.  Use the SUM_CHECK_ namespace instead
of DMA_ since these flags will be used on non-dma-zero-sum enabled
platforms.
Reviewed-by: Andre Noll <maan@systemlinux.org>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

ad283ea4

md/raid5,6: add percpu scribble region for buffer lists · d6f38f31

Dan Williams authored 15 years ago

Use percpu memory rather than stack for storing the buffer lists used in
parity calculations.  Include space for dma address conversions and pass
that to async_tx via the async_submit_ctl.scribble pointer.

[ Impact: move memory pressure from stack to heap ]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

d6f38f31

md/raid6: move the spare page to a percpu allocation · 36d1c647

Dan Williams authored 15 years ago

In preparation for asynchronous handling of raid6 operations move the
spare page to a percpu allocation to allow multiple simultaneous
synchronous raid6 recovery operations.

Make this allocation cpu hotplug aware to maximize allocation
efficiency.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

36d1c647

17 Jun, 2009 1 commit

md: convert conf->chunk_size and conf->prev_chunk to sectors. · 09c9e5fa

Andre Noll authored 15 years ago

This kills some more shifts.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>

09c9e5fa

16 Jun, 2009 1 commit

md: remove mddev_to_conf "helper" macro · 070ec55d

NeilBrown authored 15 years ago

Having a macro just to cast a void* isn't really helpful.
I would must rather see that we are simply de-referencing ->private,
than have to know what the macro does.

So open code the macro everywhere and remove the pointless cast.
Signed-off-by: NeilBrown <neilb@suse.de>

070ec55d

31 Mar, 2009 13 commits

md/raid5 revise rules for when to update metadata during reshape · c8f517c4

NeilBrown authored 15 years ago

We currently update the metadata :
 1/ every 3Megabytes
 2/ When the place we will write new-layout data to is recorded in
    the metadata as still containing old-layout data.

Rule one exists to avoid having to re-do too much reshaping in the
face of a crash/restart.  So it should really be time based rather
than size based.  So change it to "every 10 seconds".

Rule two turns out to be too harsh when restriping an array
'in-place', as in that case the metadata much be updates for every
stripe.
For the in-place update, it can only possibly be safe from a crash if
some user-space program data a backup of every e.g. few hundred
stripes before allowing them to be reshaped.  In that case, the
constant metadata update is pointless.
So only update the metadata if the new metadata will report that the
end of the 'old-layout' data is beyond where we are currently
writing 'new-layout' data.
Signed-off-by: NeilBrown <neilb@suse.de>

c8f517c4

md/raid5: prepare for allowing reshape to change layout · e183eaed

NeilBrown authored 15 years ago

Add prev_algo to raid5_conf_t along the same lines as prev_chunk
and previous_raid_disks.
Signed-off-by: NeilBrown <neilb@suse.de>

e183eaed

md/raid5: prepare for allowing reshape to change chunksize. · 784052ec

NeilBrown authored 15 years ago

Add "prev_chunk" to raid5_conf_t, similar to "previous_raid_disks", to
remember what the chunk size was before the reshape that is currently
underway.

This seems like duplication with "chunk_size" and "new_chunk" in
mddev_t, and to some extent it is, but there are differences.
The values in mddev_t are always defined and often the same.
The prev* values are only defined if a reshape is underway.

Also (and more significantly) the raid5_conf_t values will be changed
at the same time (inside an appropriate lock) that the reshape is
started by setting reshape_position.  In contrast, the new_chunk value
is set when the sysfs file is written which could be well before the
reshape starts.
Signed-off-by: NeilBrown <neilb@suse.de>

784052ec

md/raid5: clearly differentiate 'before' and 'after' stripes during reshape. · 86b42c71

NeilBrown authored 15 years ago

During a raid5 reshape, we have some stripes in the cache that are
'before' the reshape (and are still to be processed) and some that are
'after'.  They are currently differentiated by having different
->disks values as the only reshape current supported involves changing
the number of disks.

However we will soon support reshapes that do not change the number
of disks (chunk parity or chunk size).  So make the difference more
explicit with a 'generation' number.
Signed-off-by: NeilBrown <neilb@suse.de>

86b42c71

md/raid5: change reshape-progress measurement to cope with reshaping backwards. · fef9c61f

NeilBrown authored 15 years ago

When reducing the number of devices in a raid4/5/6, the reshape
process has to start at the end of the array and work down to the
beginning.  So we need to handle expand_progress and expand_lo
differently.

This patch renames "expand_progress" and "expand_lo" to avoid the
implication that anything is getting bigger (expand->reshape) and
every place they are used, we make sure that they are used the right
way depending on whether delta_disks is positive or negative.
Signed-off-by: NeilBrown <neilb@suse.de>

fef9c61f

md/raid5: drop qd_idx from r6_state · 34e04e87

NeilBrown authored 15 years ago

We now have this value in stripe_head so we don't need to duplicate
it.
Signed-off-by: NeilBrown <neilb@suse.de>

34e04e87

md/raid6: move raid6 data processing to raid6_pq.ko · f701d589

Dan Williams authored 15 years ago

Move the raid6 data processing routines into a standalone module
(raid6_pq) to prepare them to be called from async_tx wrappers and other
non-md drivers/modules.  This precludes a circular dependency of raid456
needing the async modules for data processing while those modules in
turn depend on raid456 for the base level synchronous raid6 routines.

To support this move:
1/ The exportable definitions in raid6.h move to include/linux/raid/pq.h
2/ The raid6_call, recovery calls, and table symbols are exported
3/ Extra #ifdef __KERNEL__ statements to enable the userspace raid6test to
   compile
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>

f701d589

md/raid5: refactor raid5 "run" · 91adb564

NeilBrown authored 15 years ago

.. so that the code to create the private data structures is separate.
This will help with future code to change the level of an active
array.
Signed-off-by: NeilBrown <neilb@suse.de>

91adb564

md/raid5: finish support for DDF/raid6 · 67cc2b81

NeilBrown authored 15 years ago

DDF requires RAID6 calculations over different devices in a different
order.
For md/raid6, we calculate over just the data devices, starting
immediately after the 'Q' block.
For ddf/raid6 we calculate over all devices, using zeros in place of
the P and Q blocks.

This requires unfortunately complex loops...
Signed-off-by: NeilBrown <neilb@suse.de>

67cc2b81

md/raid5: Add support for new layouts for raid5 and raid6. · 99c0fb5f

NeilBrown authored 15 years ago

DDF uses different layouts for P and Q blocks than current md/raid6
so add those that are missing.
Also add support for RAID6 layouts that are identical to various
raid5 layouts with the simple addition of one device to hold all of
the 'Q' blocks.
Finally add 'raid5' layouts to match raid4.
These last to will allow online level conversion.

Note that this does not provide correct support for DDF/raid6 yet
as the order in which data blocks are summed to produce the Q block
is significant and different between current md code and DDF
requirements.
Signed-off-by: NeilBrown <neilb@suse.de>

99c0fb5f

md/raid6: remove expectation that Q device is immediately after P device. · d0dabf7e

NeilBrown authored 15 years ago

Code currently assumes that the devices in a raid6 stripe are
  0 1 ... N-1 P Q
in some rotated order.  We will shortly add new layouts in which
this strict pattern is broken.
So remove this expectation.  We still assume that the data disks
are roughly in-order.  However P and Q can be inserted anywhere within
that order.
Signed-off-by: NeilBrown <neilb@suse.de>

d0dabf7e

md: move lots of #include lines out of .h files and into .c · bff61975

NeilBrown authored 15 years ago

This makes the includes more explicit, and is preparation for moving
md_k.h to drivers/md/md.h

Remove include/raid/md.h as its only remaining use was to #include
other files.
Signed-off-by: NeilBrown <neilb@suse.de>

bff61975

md: move headers out of include/linux/raid/ · ef740c37

Christoph Hellwig authored 15 years ago

Move the headers with the local structures for the disciplines and
bitmap.h into drivers/md/ so that they are more easily grepable for
hacking and not far away.  md.h is left where it is for now as there
are some uses from the outside.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>

ef740c37