Commits · 7b92813c3c0b6990f14838e3985fb385d2655d0c · Kirill Smelkov / linux

An error occurred fetching the project authors.

18 May, 2010 1 commit

drivers/md: Remove unnecessary casts of void * · 7b92813c

void pointers do not need to be cast to other pointer types.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: NeilBrown <neilb@suse.de>

7b92813c

17 May, 2010 1 commit

md: manage redundancy group in sysfs when changing level. · a64c876f

NeilBrown authored 14 years ago

Some levels expect the 'redundancy group' to be present,
others don't.
So when we change level of an array we might need to
add or remove this group.

This requires fixing up the current practice of overloading ->private
to indicate (when ->pers == NULL) that something needs to be removed.
So create a new ->to_remove to fill that role.

When changing levels, we may need to add or remove attributes.  When
changing RAID5 -> RAID6, we both add and remove the same thing.  It is
important to catch this and optimise it out as the removal is delayed
until a lock is released, so trying to add immediately would cause
problems.


Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>

a64c876f

26 Feb, 2010 1 commit

block: Consolidate phys_segment and hw_segment limits · 8a78362c

Martin K. Petersen authored 15 years ago

Except for SCSI no device drivers distinguish between physical and
hardware segment limits.  Consolidate the two into a single segment
limit.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

8a78362c

17 Feb, 2010 1 commit

percpu: add __percpu sparse annotations to what's left · a29d8b8e

Tejun Heo authored 15 years ago

Add __percpu sparse annotations to places which didn't make it in one
of the previous patches.  All converions are trivial.

These annotations are to make sparse consider percpu variables to be
in a different address space and warn if accessed without going
through percpu accessors.  This patch doesn't affect normal builds.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Neil Brown <neilb@suse.de>

a29d8b8e

10 Feb, 2010 1 commit

md: fix some lockdep issues between md and sysfs. · ef286f6f

NeilBrown authored 15 years ago

======
This fix is related to
    http://bugzilla.kernel.org/show_bug.cgi?id=15142
but does not address that exact issue.
======

sysfs does like attributes being removed while they are being accessed
(i.e. read or written) and waits for the access to complete.

As accessing some md attributes takes the same lock that is held while
removing those attributes a deadlock can occur.

This patch addresses 3 issues in md that could lead to this deadlock.

Two relate to calling flush_scheduled_work while the lock is held.
This is probably a bad idea in general and as we use schedule_work to
delete various sysfs objects it is particularly bad.

In one case flush_scheduled_work is called from md_alloc (called by
md_probe) called from do_md_run which holds the lock.  This call is
only present to ensure that ->gendisk is set.  However we can be sure
that gendisk is always set (though possibly we couldn't when that code
was originally written.  This is because do_md_run is called in three
different contexts:
  1/ from md_ioctl.  This requires that md_open has succeeded, and it
     fails if ->gendisk is not set.
  2/ from writing a sysfs attribute.  This can only happen if the
     mddev has been registered in sysfs which happens in md_alloc
     after ->gendisk has been set.
  3/ from autorun_array which is only called by autorun_devices, which
     checks for ->gendisk to be set before calling autorun_array.
So the call to md_probe in do_md_run can be removed, and the check on
->gendisk can also go.


In the other case flush_scheduled_work is being called in do_md_stop,
purportedly to wait for all md_delayed_delete calls (which delete the
component rdevs) to complete.  However there really isn't any need to
wait for them - they have already been disconnected in all important
ways.

The third issue is that raid5->stop() removes some attribute names
while the lock is held.  There is already some infrastructure in place
to delay attribute removal until after the lock is released (using
schedule_work).  So extend that infrastructure to remove the
raid5_attrs_group.

This does not address all lockdep issues related to the sysfs
"s_active" lock.  The rest can be address by splitting that lockdep
context between symlinks and non-symlinks which hopefully will happen.
Signed-off-by: NeilBrown <neilb@suse.de>

ef286f6f

09 Feb, 2010 1 commit

md: fix 'degraded' calculation when starting a reshape. · 9eb07c25

NeilBrown authored 15 years ago

This code was written long ago when it was not possible to
reshape a degraded array.  Now it is so the current level of
degraded-ness needs to be taken in to account.  Also newly addded
devices should only reduce degradedness if they are deemed to be
in-sync.

In particular, if you convert a RAID5 to a RAID6, and increase the
number of devices at the same time, then the 5->6 conversion will
make the array degraded so the current code will produce a wrong
value for 'degraded' - "-1" to be precise.

If the reshape runs to completion end_reshape will calculate a correct
new value for 'degraded', but if a device fails during the reshape an
incorrect decision might be made based on the incorrect value of
"degraded".

This patch is suitable for 2.6.32-stable and if they are still open,
2.6.31-stable and 2.6.30-stable as well.

Cc: stable@kernel.org
Reported-by: Michael Evans <mjevans1983@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>

9eb07c25

14 Dec, 2009 4 commits

md: add MODULE_DESCRIPTION for all md related modules. · 0efb9e61
NeilBrown authored 15 years ago
```
Suggested by  Oren Held <orenhe@il.ibm.com>
Signed-off-by: NeilBrown <neilb@suse.de>
```
0efb9e61

md/raid5: don't complete make_request on barrier until writes are scheduled · 729a1866

NeilBrown authored 15 years ago

The post-barrier-flush is sent by md as soon as make_request on the
barrier write completes.  For raid5, the data might not be in the
per-device queues yet.  So for barrier requests, wait for any
pre-reading to be done so that the request will be in the per-device
queues.

We use the 'preread_active' count to check that nothing is still in
the preread phase, and delay the decrement of this count until after
write requests have been submitted to the underlying devices.
Signed-off-by: NeilBrown <neilb@suse.de>

729a1866

md: support barrier requests on all personalities. · a2826aa9

NeilBrown authored 15 years ago

Previously barriers were only supported on RAID1.  This is because
other levels requires synchronisation across all devices and so needed
a different approach.
Here is that approach.

When a barrier arrives, we send a zero-length barrier to every active
device.  When that completes - and if the original request was not
empty -  we submit the barrier request itself (with the barrier flag
cleared) and then submit a fresh load of zero length barriers.

The barrier request itself is asynchronous, but any subsequent
request will block until the barrier completes.

The reason for clearing the barrier flag is that a barrier request is
allowed to fail.  If we pass a non-empty barrier through a striping
raid level it is conceivable that part of it could succeed and part
could fail.  That would be way too hard to deal with.
So if the first run of zero length barriers succeed, we assume all is
sufficiently well that we send the request and ignore errors in the
second run of barriers.

RAID5 needs extra care as write requests may not have been submitted
to the underlying devices yet.  So we flush the stripe cache before
proceeding with the barrier.

Note that the second set of zero-length barriers are submitted
immediately after the original request is submitted.  Thus when
a personality finds mddev->barrier to be set during make_request,
it should not return from make_request until the corresponding
per-device request(s) have been queued.

That will be done in later patches.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Andre Noll <maan@systemlinux.org>

a2826aa9

md/raid5: remove some sparse warnings. · 8553fe7e

NeilBrown authored 15 years ago

qd_idx is previously declared and given exactly the same value!
Signed-off-by: NeilBrown <neilb@suse.de>

8553fe7e

13 Nov, 2009 2 commits

md/raid5: Allow dirty-degraded arrays to be assembled when only party is degraded. · c148ffdc

NeilBrown authored 15 years ago

Normally is it not safe to allow a raid5 that is both dirty and
degraded to be assembled without explicit request from that admin, as
it can cause hidden data corruption.
This is because 'dirty' means that the parity cannot be trusted, and
'degraded' means that the parity needs to be used.

However, if the device that is missing contains only parity, then
there is no issue and assembly can continue.
This particularly applies when a RAID5 is being converted to a RAID6
and there is an unclean shutdown while the conversion is happening.

So check for whether the degraded space only contains parity, and
in that case, allow the assembly.
Signed-off-by: NeilBrown <neilb@suse.de>

c148ffdc

Don't unconditionally set in_sync on newly added device in raid5_reshape · 7ef90146

NeilBrown authored 15 years ago

When a reshape finds that it can add spare devices into the array,
those devices might already be 'in_sync' if they are beyond the old
size of the array, or they might not if they are within the array.

The first case happens when we change an N-drive RAID5 to an
N+1-drive RAID5.
The second happens when we convert an N-drive RAID5 to an
N+1-drive RAID6.

So set the flag more carefully.
Also, ->recovery_offset is only meaningful when the flag is clear,
so only set it in that case.

This change needs the preceding two to ensure that the non-in_sync
device doesn't get evicted from the array when it is stopped, in the
case where v0.90 metadata is used.
Signed-off-by: NeilBrown <neilb@suse.de>

7ef90146

06 Nov, 2009 1 commit

md/raid5: make sure curr_sync_completes is uptodate when reshape starts · 8dee7211

NeilBrown authored 15 years ago

This value is visible through sysfs and is used by mdadm
when it manages a reshape (backing up data that is about to be
rearranged).  So it is important that it is always correct.
Current it does not get updated properly when a reshape
starts which can cause problems when assembling an array
that is in the middle of being reshaped.

This is suitable for 2.6.31.y stable kernels.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>

8dee7211

20 Oct, 2009 1 commit
- md/raid6: kill a gcc-4.0.1 'uninitialized variable' warning · 6629542e
  Dan Williams authored 15 years ago
```
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
```
  6629542e
16 Oct, 2009 6 commits

md/async: don't pass a memory pointer as a page pointer. · 5dd33c9a

NeilBrown authored 15 years ago

md/raid6 passes a list of 'struct page *' to the async_tx routines,
which then either DMA map them for offload, or take the page_address
for CPU based calculations.

For RAID6 we sometime leave 'blanks' in the list of pages.
For CPU based calcs, we want to treat theses as a page of zeros.
For offloaded calculations, we simply don't pass a page to the
hardware.

Currently the 'blanks' are encoded as a pointer to
raid6_empty_zero_page.  This is a 4096 byte memory region, not a
'struct page'.  This is mostly handled correctly but is rather ugly.

So change the code to pass and expect a NULL pointer for the blanks.
When taking page_address of a page, we need to check for a NULL and
in that case use raid6_empty_zero_page.
Signed-off-by: NeilBrown <neilb@suse.de>

5dd33c9a

md: Fix handling of raid5 array which is being reshaped to fewer devices. · 5e5e3e78

NeilBrown authored 15 years ago

When a raid5 (or raid6) array is being reshaped to have fewer devices,
conf->raid_disks is the latter and hence smaller number of devices.
However sometimes we want to use a number which is the total number of
currently required devices - the larger of the 'old' and 'new' sizes.
Before we implemented reducing the number of devices, this was always
'new' i.e. ->raid_disks.
Now we need max(raid_disks, previous_raid_disks) in those places.

This particularly affects assembling an array that was shutdown while
in the middle of a reshape to fewer devices.

md.c needs a similar fix when interpreting the md metadata.
Signed-off-by: NeilBrown <neilb@suse.de>

5e5e3e78

md: fix problems with RAID6 calculations for DDF. · e4424fee
NeilBrown authored 15 years ago
```
Signed-off-by: NeilBrown <neilb@suse.de>
```
e4424fee

md/raid456: downlevel multicore operations to raid_run_ops · 417b8d4a

Dan Williams authored 15 years ago

The percpu conversion allowed a straightforward handoff of stripe
processing to the async subsytem that initially showed some modest gains
(+4%).  However, this model is too simplistic and leads to stripes
bouncing between raid5d and the async thread pool for every invocation
of handle_stripe().  As reported by Holger this can fall into a
pathological situation severely impacting throughput (6x performance
loss).

By downleveling the parallelism to raid_run_ops the pathological
stripe_head bouncing is eliminated.  This version still exhibits an
average 11% throughput loss for:

	mdadm --create /dev/md0 /dev/sd[b-q] -n 16 -l 6
	echo 1024 > /sys/block/md0/md/stripe_cache_size
	dd if=/dev/zero of=/dev/md0 bs=1024k count=2048

...but the results are at least stable and can be used as a base for
further multicore experimentation.
Reported-by: Holger Kiehl <Holger.Kiehl@dwd.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>

417b8d4a

md/raid5: initialize conf->device_lock earlier · f5efd45a

Dan Williams authored 15 years ago

Deallocating a raid5_conf_t structure requires taking 'device_lock'.
Ensure it is initialized before it is used, i.e. initialize the lock
before attempting any further initializations that might fail.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>

f5efd45a

Revert "md: do not progress the resync process if the stripe was blocked" · 1442577b

NeilBrown authored 15 years ago

This reverts commit df10cfbc.

This patch was based on a misunderstanding and risks introducing a busy-wait loop.
So revert it.
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>

1442577b

23 Sep, 2009 3 commits

md: report device as congested when suspended · 3fa841d7

NeilBrown authored 15 years ago

This should writeback from coming when the device is temporarily
suspended.
Signed-off-by: NeilBrown <neilb@suse.de>

3fa841d7

md: Improve name of threads created by md_register_thread · 0da3c619

NeilBrown authored 15 years ago

The management thread for raid4,5,6 arrays are all called
mdX_raid5, independent of the actual raid level, which is wrong and
can be confusion.

So change md_register_thread to use the name from the personality
unless no alternate name (like 'resync' or 'reshape') is given.

This is simpler and more correct.

Cc: Jinzc <zhenchengjin@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>

0da3c619

md: remove sparse waring "symbol xxx shadows an earlier one" · a9f326eb

NeilBrown authored 15 years ago

Rename some variable and remove some duplicate definitions
to avoid there warnings.  None of them are actual errors.
Signed-off-by: NeilBrown <neilb@suse.de>

a9f326eb

16 Sep, 2009 2 commits

md/raid6: cleanup ops_run_compute6_2 · 6c910a78

Dan Williams authored 15 years ago

Neil says:
	"It is correct as it stands, but the fact that every branch in
	 the 'if' part ends with a 'return' isn't immediately obvious,
	 so it is clearer if we are explicit about the if / then / else
	 structure."
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

6c910a78

md/raid6: eliminate BUG_ON with side effect · 2d6e4ecc

Dan Williams authored 15 years ago

As pointed out by Neil it should be possible to build a driver with all
BUG_ON statements deleted.  It's bad form to have a BUG_ON with a side
effect.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

2d6e4ecc

11 Sep, 2009 1 commit

bio: first step in sanitizing the bio->bi_rw flag testing · 1f98a13f

Jens Axboe authored 15 years ago

Get rid of any functions that test for these bits and make callers
use bio_rw_flagged() directly. Then it is at least directly apparent
what variable and flag they check.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

1f98a13f

09 Sep, 2009 1 commit

dmaengine: add fence support · 0403e382

Dan Williams authored 15 years ago

Some engines optimize operation by reading ahead in the descriptor chain
such that descriptor2 may start execution before descriptor1 completes.
If descriptor2 depends on the result from descriptor1 then a fence is
required (on descriptor2) to disable this optimization. The async_tx
api could implicitly identify dependencies via the 'depend_tx'
parameter, but that would constrain cases where the dependency chain
only specifies a completion order rather than a data dependency. So,
provide an ASYNC_TX_FENCE to explicitly identify data dependencies.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

0403e382

30 Aug, 2009 12 commits

md/raid456: distribute raid processing over multiple cores · 07a3b417

Dan Williams authored 15 years ago

Now that the resources to handle stripe_head operations are allocated
percpu it is possible for raid5d to distribute stripe handling over
multiple cores.  This conversion also adds a call to cond_resched() in
the non-multicore case to prevent one core from getting monopolized for
raid operations.

Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

07a3b417

md/raid6: remove synchronous infrastructure · b774ef49

Yuri Tikhonov authored 15 years ago

These routines have been replaced by there asynchronous counterparts.
Signed-off-by: Yuri Tikhonov <yur@emcraft.com>
Signed-off-by: Ilya Yanok <yanok@emcraft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

b774ef49

md/raid6: asynchronous handle_stripe6 · 6c0069c0

Yuri Tikhonov authored 15 years ago

1/ Use STRIPE_OP_BIOFILL to offload completion of read requests to
   raid_run_ops
2/ Implement a handler for sh->reconstruct_state similar to the raid5 case
   (adds handling of Q parity)
3/ Prevent handle_parity_checks6 from running concurrently with 'compute'
   operations
4/ Hook up raid_run_ops
Signed-off-by: Yuri Tikhonov <yur@emcraft.com>
Signed-off-by: Ilya Yanok <yanok@emcraft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

6c0069c0

md/raid6: asynchronous handle_parity_check6 · d82dfee0

Dan Williams authored 15 years ago

[ Based on an original patch by Yuri Tikhonov ]

Implement the state machine for handling the RAID-6 parities check and
repair functionality.  Note that the raid6 case does not need to check
for new failures, like raid5, as it will always writeback the correct
disks.  The raid5 case can be updated to check zero_sum_result to avoid
getting confused by new failures rather than retrying the entire check
operation.
Signed-off-by: Yuri Tikhonov <yur@emcraft.com>
Signed-off-by: Ilya Yanok <yanok@emcraft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

d82dfee0

md/raid6: asynchronous handle_stripe_dirtying6 · a9b39a74

Yuri Tikhonov authored 15 years ago

In the synchronous implementation of stripe dirtying we processed a
degraded stripe with one call to handle_stripe_dirtying6().  I.e.
compute the missing blocks from the other drives, then copy in the new
data and reconstruct the parities.

In the asynchronous case we do not perform stripe operations directly.
Instead, operations are scheduled with flags to be later serviced by
raid_run_ops.  So, for the degraded case the final reconstruction step
can only be carried out after all blocks have been brought up to date by
being read, or computed.  Like the raid5 case schedule_reconstruction()
sets STRIPE_OP_RECONSTRUCT to request a parity generation pass and
through operation chaining can handle compute and reconstruct in a
single raid_run_ops pass.

[dan.j.williams@intel.com: fixup handle_stripe_dirtying6 gating]
Signed-off-by: Yuri Tikhonov <yur@emcraft.com>
Signed-off-by: Ilya Yanok <yanok@emcraft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

a9b39a74

md/raid6: asynchronous handle_stripe_fill6 · 5599becc

Yuri Tikhonov authored 15 years ago

Modify handle_stripe_fill6 to work asynchronously by introducing
fetch_block6 as the raid6 analog of fetch_block5 (schedule compute
operations for missing/out-of-sync disks).

[dan.j.williams@intel.com: compute D+Q in one pass]
Signed-off-by: Yuri Tikhonov <yur@emcraft.com>
Signed-off-by: Ilya Yanok <yanok@emcraft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

5599becc

md/raid5,6: common schedule_reconstruction for raid5/6 · c0f7bddb

Yuri Tikhonov authored 15 years ago

Extend schedule_reconstruction5 for reuse by the raid6 path.  Add
support for generating Q and BUG() if a request is made to perform
'prexor'.
Signed-off-by: Yuri Tikhonov <yur@emcraft.com>
Signed-off-by: Ilya Yanok <yanok@emcraft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

c0f7bddb

md/raid6: asynchronous raid6 operations · ac6b53b6

Dan Williams authored 15 years ago

[ Based on an original patch by Yuri Tikhonov ]

The raid_run_ops routine uses the asynchronous offload api and
the stripe_operations member of a stripe_head to carry out xor+pq+copy
operations asynchronously, outside the lock.

The operations performed by RAID-6 are the same as in the RAID-5 case
except for no support of STRIPE_OP_PREXOR operations. All the others
are supported:
STRIPE_OP_BIOFILL
 - copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
 - generate missing blocks (1 or 2) in the cache from the other blocks
STRIPE_OP_BIODRAIN
 - copy data out of request buffers to satisfy a write request
STRIPE_OP_RECONSTRUCT
 - recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
 - verify that the parity is correct

The flow is the same as in the RAID-5 case, and reuses some routines, namely:
1/ ops_complete_postxor (renamed to ops_complete_reconstruct)
2/ ops_complete_compute (updated to set up to 2 targets uptodate)
3/ ops_run_check (renamed to ops_run_check_p for xor parity checks)

[neilb@suse.de: fixes to get it to pass mdadm regression suite]
Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Yuri Tikhonov <yur@emcraft.com>
Signed-off-by: Ilya Yanok <yanok@emcraft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

ac6b53b6

md/raid5: factor out mark_uptodate from ops_complete_compute5 · 4e7d2c0a

Dan Williams authored 15 years ago

ops_complete_compute5 can be reused in the raid6 path if it is updated to
generically handle a second target.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

4e7d2c0a

async_tx: add sum check flags · ad283ea4

Dan Williams authored 15 years ago

Replace the flat zero_sum_result with a collection of flags to contain
the P (xor) zero-sum result, and the soon to be utilized Q (raid6 reed
solomon syndrome) zero-sum result.  Use the SUM_CHECK_ namespace instead
of DMA_ since these flags will be used on non-dma-zero-sum enabled
platforms.
Reviewed-by: Andre Noll <maan@systemlinux.org>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

ad283ea4

md/raid5,6: add percpu scribble region for buffer lists · d6f38f31

Dan Williams authored 15 years ago

Use percpu memory rather than stack for storing the buffer lists used in
parity calculations.  Include space for dma address conversions and pass
that to async_tx via the async_submit_ctl.scribble pointer.

[ Impact: move memory pressure from stack to heap ]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

d6f38f31

md/raid6: move the spare page to a percpu allocation · 36d1c647

Dan Williams authored 15 years ago

In preparation for asynchronous handling of raid6 operations move the
spare page to a percpu allocation to allow multiple simultaneous
synchronous raid6 recovery operations.

Make this allocation cpu hotplug aware to maximize allocation
efficiency.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

36d1c647

13 Aug, 2009 1 commit

md/raid5: Properly remove excess drives after shrinking a raid5/6 · 1a67dde0

NeilBrown authored 15 years ago

We were removing the drives, from the array, but not
removing symlinks from /sys/.... and not marking the device
as having been removed.
Signed-off-by: NeilBrown <neilb@suse.de>

1a67dde0