Commits · 18b0033491f584a2d79697da714b1ef9d6b27d22 · Kirill Smelkov / linux

31 Mar, 2009 36 commits

md: raid5 run(): Fix max_degraded for raid level 4. · 18b00334

Andre Noll authored Mar 31, 2009

raid4 allows only one failed disk.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>

18b00334

md: 'array_size' sysfs attribute · b522adcd

Dan Williams authored Mar 31, 2009

Allow userspace to set the size of the array according to the following
semantics:

1/ size must be <= to the size returned by mddev->pers->size(mddev, 0, 0)
   a) If size is set before the array is running, do_md_run will fail
      if size is greater than the default size
   b) A reshape attempt that reduces the default size to less than the set
      array size should be blocked
2/ once userspace sets the size the kernel will not change it
3/ writing 'default' to this attribute returns control of the size to the
   kernel and reverts to the size reported by the personality

Also, convert locations that need to know the default size from directly
reading ->array_sectors to <pers>_size.  Resync/reshape operations
always follow the default size.

Finally, fixup other locations that read a number of 1k-blocks from
userspace to use strict_blocks_to_sectors() which checks for unsigned
long long to sector_t overflow and blocks to sectors overflow.
Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

b522adcd

md: centralize ->array_sectors modifications · 1f403624

Dan Williams authored Mar 31, 2009

Get personalities out of the business of directly modifying
->array_sectors.  Lays groundwork to introduce policy on when
->array_sectors can be modified.
Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

1f403624

md: add 'size' as a personality method · 80c3a6ce

Dan Williams authored Mar 17, 2009

In preparation for giving userspace control over ->array_sectors we need
to be able to retrieve the 'default' size, and the 'anticipated' size
when a reshape is requested.  For personalities that do not reshape emit
a warning if anything but the default size is requested.

In the raid5 case we need to update ->previous_raid_disks to make the
new 'default' size available.
Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

80c3a6ce

md: fix typo in FSF address · 93ed05e2

Atsushi SAKAI authored Mar 31, 2009

Hello,

 I found a typo Bosto"m" in FSF address.
And I am checking around linux source code.
Here is the only place which uses Bosto"m" (not Boston).
Signed-off-by: Atsushi SAKAI <sakaia@jp.fujitsu.com>
Signed-off-by: NeilBrown <neilb@suse.de>

93ed05e2

md: add takeover support for converting raid6 back into raid5 · fc9739c6

NeilBrown authored Mar 31, 2009

If a raid6 is still in the layout that comes from converting raid5
into a raid6. this will allow us to convert it back again.
Signed-off-by: NeilBrown <neilb@suse.de>

fc9739c6

md: add takeover support for raid4 -> raid5 conversion. · e9d4758f
NeilBrown authored Mar 31, 2009
```
Signed-off-by: NeilBrown <neilb@suse.de>
```
e9d4758f

md/raid5: allow layout/chunksize to be changed on an active 2-drive raid5. · b3546035

NeilBrown authored Mar 31, 2009

2-drive raid5's aren't very interesting.  But if you are converting
a raid1 into a raid5, you will at least temporarily have one.  And
that it a good time to set the layout/chunksize for the new RAID5
if you aren't happy with the defaults.

layout and chunksize don't actually affect the placement of data
on a 2-drive raid5, so we just do some internal book-keeping.
Signed-off-by: NeilBrown <neilb@suse.de>

b3546035

md: add ->takeover method for raid5 to be able to take over raid1 · d562b0c4

NeilBrown authored Mar 31, 2009

The RAID1 must have two drives and be a suitable size to
be a multiple of a chunksize that isn't too small.
Signed-off-by: NeilBrown <neilb@suse.de>

d562b0c4

md: add ->takeover method to support changing the personality managing an array · 245f46c2

NeilBrown authored Mar 31, 2009

Implement this for RAID6 to be able to 'takeover' a RAID5 array.  The
new RAID6 will use a layout which places Q on the last device, and
that device will be missing.
If there are any available spares, one will immediately have Q
recovered onto it.
Signed-off-by: NeilBrown <neilb@suse.de>

245f46c2

md: enable suspend/resume of md devices. · 409c57f3

NeilBrown authored Mar 31, 2009

To be able to change the 'level' of an md/raid array, we need to
suspend the device so that no requests are active - then move some
pointers around etc.

The code already keeps counts of active requests and the ->quiesce
function can be used to wait until those counts hit zero.
However the quiesce function blocks new requests once they are all
ready 'inside' the personality module, and that is too late if we want
to replace the personality modules.

So make all md requests come in through a common md_make_request
function that keeps track of how many requests have entered the
modules but may not yet be on the internal reference counts.
Allow md_make_request to be blocked when we want to suspend the
device, and make it possible to wait for all those in-transit requests
to be added to internal lists so that ->quiesce can wait for them.

There is still a problem that when a request completes, we drop the
ref count inside the personality code so there is a short time between
when the refcount hits zero, and when the personality code is no
longer being used.
The personality code never blocks (schedule or spinlock) between
dropping the refcount and exiting the routine, so this should be safe
(as put_module calls synchronize_sched() before unmapping the module
code).
Signed-off-by: NeilBrown <neilb@suse.de>

409c57f3

md: md_unregister_thread should cope with being passed NULL · e0cf8f04

NeilBrown authored Mar 31, 2009

Mostly md_unregister_thread is only called when we know that the
thread is NULL, but sometimes we need to check first.  It is safer
to put the check inside md_unregister_thread itself.
Signed-off-by: NeilBrown <neilb@suse.de>

e0cf8f04

md/raid5: refactor raid5 "run" · 91adb564

NeilBrown authored Mar 31, 2009

.. so that the code to create the private data structures is separate.
This will help with future code to change the level of an active
array.
Signed-off-by: NeilBrown <neilb@suse.de>

91adb564

md: make sure new_level, new_chunksize, new_layout always have sensible values. · 34817e8c

NeilBrown authored Mar 31, 2009

When an md array is undergoing a change, we have new_* fields that
show the new values.
When no change is happening, it is least confusing if these have
the same value as the normal fields.
This is true in most cases, but not when the values are set via sysfs.

So fix this up.

A subsequent patch will BUG_ON if these things aren't consistent.
Signed-off-by: NeilBrown <neilb@suse.de>

34817e8c

md/raid5: finish support for DDF/raid6 · 67cc2b81

NeilBrown authored Mar 31, 2009

DDF requires RAID6 calculations over different devices in a different
order.
For md/raid6, we calculate over just the data devices, starting
immediately after the 'Q' block.
For ddf/raid6 we calculate over all devices, using zeros in place of
the P and Q blocks.

This requires unfortunately complex loops...
Signed-off-by: NeilBrown <neilb@suse.de>

67cc2b81

md/raid5: Add support for new layouts for raid5 and raid6. · 99c0fb5f

NeilBrown authored Mar 31, 2009

DDF uses different layouts for P and Q blocks than current md/raid6
so add those that are missing.
Also add support for RAID6 layouts that are identical to various
raid5 layouts with the simple addition of one device to hold all of
the 'Q' blocks.
Finally add 'raid5' layouts to match raid4.
These last to will allow online level conversion.

Note that this does not provide correct support for DDF/raid6 yet
as the order in which data blocks are summed to produce the Q block
is significant and different between current md code and DDF
requirements.
Signed-off-by: NeilBrown <neilb@suse.de>

99c0fb5f

md/raid5: simplify raid5_compute_sector interface · 911d4ee8

NeilBrown authored Mar 31, 2009

Rather than passing 'pd_idx' and 'qd_idx' to be filled in, pass
a 'struct stripe_head *' and fill in the relevant fields.  This is
more extensible.
Signed-off-by: NeilBrown <neilb@suse.de>

911d4ee8

md/raid6: remove expectation that Q device is immediately after P device. · d0dabf7e

NeilBrown authored Mar 31, 2009


Code currently assumes that the devices in a raid6 stripe are
  0 1 ... N-1 P Q
in some rotated order.  We will shortly add new layouts in which
this strict pattern is broken.
So remove this expectation.  We still assume that the data disks
are roughly in-order.  However P and Q can be inserted anywhere within
that order.
Signed-off-by: NeilBrown <neilb@suse.de>

d0dabf7e

md/raid5: change raid5_compute_sector and stripe_to_pdidx to take a 'previous' argument · 112bf897

NeilBrown authored Mar 31, 2009

This similar to the recent change to get_active_stripe.
There is no functional change, just come rearrangement to make
future patches cleaner.
Signed-off-by: NeilBrown <neilb@suse.de>

112bf897

md/raid5: simplify interface for init_stripe and get_active_stripe · b5663ba4

NeilBrown authored Mar 31, 2009

Rather than passing 'pd_idx' and 'disks' to these functions, just pass
'previous' which tells whether to use the 'previous' or 'current'
geometry during a reshape, and let init_stripe calculate
disks and pd_idx and anything else it might need.

This is not a substantial simplification and even adds a division.
However we will shortly be adding more complexity to init_stripe
to handle more interesting 'reshape' activities, and without this
change, the interface to these functions would get very complex.
Signed-off-by: NeilBrown <neilb@suse.de>

b5663ba4

md: Represent raid device size in sectors. · dd8ac336

Andre Noll authored Mar 31, 2009

This patch renames the "size" field of struct mdk_rdev_s to
"sectors" and changes this field to store sectors instead of
blocks.

All users of this field, linear.c, raid0.c and md.c, are fixed up
accordingly which gets rid of many multiplications and divisions.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>

dd8ac336

md: Make mddev->size sector-based. · 58c0fed4

Andre Noll authored Mar 31, 2009

This patch renames the "size" field of struct mddev_s to "dev_sectors"
and stores the number of 512-byte sectors instead of the number of
1K-blocks in it.

All users of that field, including raid levels 1,4-6,10, are adjusted
accordingly. This simplifies the code a bit because it allows to get
rid of a couple of divisions/multiplications by two.

In order to make checkpatch happy, some minor coding style issues
have also been addressed. In particular, size_store() now uses
strict_strtoull() instead of simple_strtoull().
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>

58c0fed4

md: be more consistent about setting WriteMostly flag when adding a drive to an array · 575a80fa

NeilBrown authored Mar 31, 2009

When a drive is added to an array using ADD_NEW_DISK, there are two
places we can get certain flags from:  the metadata on the disk or the
flags passed through the IOCTL.

For the WriteMostly flag (aka MD_DISK_WRITEMOSTLY) we take the value
from either of those sources depending on if it is set (i.e. we
effectively 'or' the two sources together).

This makes it awkward to clear, and is at best inconsistent.

As documented code (in mdadm) requires that setting
MD_DISK_WRITEMOSTLY in the ioctl will be effective, we resolve the
inconsistency by always using the value for this flag from the ioctl,
and ignoring the value on disk.
Signed-off-by: NeilBrown <neilb@suse.de>

575a80fa

md: occasionally checkpoint drive recovery to reduce duplicate effort after a crash · 97e4f42d

NeilBrown authored Mar 31, 2009

Version 1.x metadata has the ability to record the status of a
partially completed drive recovery.
However we only update that record on a clean shutdown.
It would be nice to update it on unclean shutdowns too, particularly
when using a bitmap that removes much to the 'sync' effort after an
unclean shutdown.

One complication with checkpointing recovery is that we only know
where we are up to in terms of IO requests started, not which ones
have completed.  And we need to know what has completed to record
how much is recovered.  So occasionally pause the recovery until all
submitted requests are completed, then update the record of where
we are up to.

When we have a bitmap, we already do that pause occasionally to keep
the bitmap up-to-date.  So enhance that code to record the recovery
offset and schedule a superblock update.
And when there is no bitmap, just pause 16 times during the resync to
do a checkpoint.
'16' is a fairly arbitrary number.  But we don't really have any good
way to judge how often is acceptable, and it seems like a reasonable
number for now.
Signed-off-by: NeilBrown <neilb@suse.de>

97e4f42d

md: move md_k.h from include/linux/raid/ to drivers/md/ · 43b2e5d8
NeilBrown authored Mar 31, 2009
```
It really is nicer to keep related code together..
Signed-off-by: NeilBrown <neilb@suse.de>
```
43b2e5d8

md: move lots of #include lines out of .h files and into .c · bff61975

NeilBrown authored Mar 31, 2009

This makes the includes more explicit, and is preparation for moving
md_k.h to drivers/md/md.h

Remove include/raid/md.h as its only remaining use was to #include
other files.
Signed-off-by: NeilBrown <neilb@suse.de>

bff61975

md: move most content from md.h to md_k.h · 92022950

NeilBrown authored Mar 31, 2009

The extern function definitions are kernel-internal definitions, so
they belong in md_k.h

The MD_*_VERSION values could reasonably go in a number of places,
but md_u.h seems most reasonable.

This leaves almost nothing in md.h.  It will go soon.
Signed-off-by: NeilBrown <neilb@suse.de>

92022950

md: move LEVEL_* definition from md_k.h to md_u.h · 8b2b5c21

NeilBrown authored Mar 31, 2009

.. as they are part of the user-space interface.
Also move MdpMinorShift into there so we can remove duplication.

Lastly move mdp_major in.  It is less obviously part of the user-space
interface, but do_mounts_md.c uses it, and it is acting a bit like
user-space.
Signed-off-by: NeilBrown <neilb@suse.de>

8b2b5c21

md: move headers out of include/linux/raid/ · ef740c37

Christoph Hellwig authored Mar 31, 2009

Move the headers with the local structures for the disciplines and
bitmap.h into drivers/md/ so that they are more easily grepable for
hacking and not far away.  md.h is left where it is for now as there
are some uses from the outside.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>

ef740c37

cleanup drivers/md/Makefile · 2a40a8ae

Christoph Hellwig authored Mar 31, 2009

Use the -y variables instead of the old -objs so we can easily add
conditional objects to the modules.  Also always use += to add
subobjects to avoid problems when placing additional objects in
some place in the file.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>

2a40a8ae

md: stop defining MAJOR_NR · 3dbd8c2e

Christoph Hellwig authored Mar 31, 2009

MAJOR_NR was only required for magic in linux/blk.h in 2.4 or earlier
kernels, so no need to keep it around.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>

3dbd8c2e

MD data integrity support · 3f9d99c1

Martin K. Petersen authored Mar 31, 2009

md: Add support for data integrity to MD

If all subdevices support the same protection format the MD device is
flagged as integrity capable.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NeilBrown <neilb@suse.de>

3f9d99c1

md: write bitmap information to devices that are undergoing recovery. · 355a43e6

NeilBrown authored Mar 31, 2009

When we add some spares to an array and start recovery, and we have
a bitmap which is stored 'internally' on all devices, we call
bitmap_write_all to make sure the bitmap is correct on the new
device(s).
However that doesn't work as write_sb_page only writes to
'In_sync' devices, and devices undergoing recovery are not
'In_sync' until recovery finishes.

So extend write_sb_page (actually next_active_rdev) to include devices
that are under recovery.
Signed-off-by: NeilBrown <neilb@suse.de>

355a43e6

md: never clear bit from the write-intent bitmap when the array is degraded. · d0a4bb49

NeilBrown authored Mar 31, 2009


It is safe to clear a bit from the write-intent bitmap for a raid1
if we know the data has been written to all devices, which is
what the current test does.

But it is not always safe to update the 'events_cleared' counter in
that case.  This is because one request could complete successfully
after some other request has partially failed.

So simply disable the clearing and updating of events_cleared whenever
the array is degraded.  This might end up not clearing some bits that
could safely be cleared, but it is safest approach.

Note that the bug fixed here did not risk corrupting data by letting
the array get out-of-sync.  Rather it meant that when a device is
removed and re-added to the array, it might incorrectly require a full
recovery rather than just recovering based on the bitmap.
Signed-off-by: NeilBrown <neilb@suse.de>

d0a4bb49

md: Allow write-intent bitmaps to have chunksize < PAGE_SIZE · 1187cf0a

NeilBrown authored Mar 31, 2009

md currently insists that the chunk size used for write-intent
bitmaps (the amount of data that corresponds to one chunk)
be at least one page.

The reason for this restriction is lost in the mists of time,
but a review of the code (and a vague memory) suggests that the only
problem would be related to resync.  Resync tries very hard to
work in multiples of a page, but also needs to sync with units
of a bitmap_chunk too.

This connection comes out in the bitmap_start_sync call.

So change bitmap_start_sync to always work in multiples of a page.
If the bitmap chunk size is less that one page, we flag multiple
chunks as 'syncing' and generally make them all appear to the
resync routines like one chunk.

All other code either already works with data ranges that could
span multiple chunks, or explicitly only cares about a single chunk.
Signed-off-by: Neil Brown <neilb@suse.de>

1187cf0a

md: Fix is_mddev_idle test (again). · eea1bf38

NeilBrown authored Mar 31, 2009

There are two problems with is_mddev_idle.

1/ sync_io is 'atomic_t' and hence 'int'.  curr_events and all the
   rest are 'long'.
   So if sync_io were to wrap on a 64bit host, the value of
   curr_events would go very negative suddenly, and take a very
   long time to return to positive.

   So do all calculations as 'int'.  That gives us plenty of precision
   for what we need.

2/ To initialise rdev->last_events we simply call is_mddev_idle, on
   the assumption that it will make sure that last_events is in a
   suitable range.  It used to do this, but now it does not.
   So now we need to be more explicit about initialisation.
Signed-off-by: NeilBrown <neilb@suse.de>

eea1bf38

09 Mar, 2009 4 commits

Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq · 99adcd9d

Linus Torvalds authored Mar 09, 2009

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq:
  [CPUFREQ] Add p4-clockmod sysfs-ui removal to feature-removal schedule.
  Revert "[CPUFREQ] Disable sysfs ui for p4-clockmod."

99adcd9d

copy_process: fix CLONE_PARENT && parent_exec_id interaction · 2d5516cb

Oleg Nesterov authored Mar 02, 2009

CLONE_PARENT can fool the ->self_exec_id/parent_exec_id logic. If we
re-use the old parent, we must also re-use ->parent_exec_id to make
sure exit_notify() sees the right ->xxx_exec_id's when the CLONE_PARENT'ed
task exits.

Also, move down the "p->parent_exec_id = p->self_exec_id" thing, to place
two different cases together.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Serge E. Hallyn <serge@hallyn.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

2d5516cb

[CPUFREQ] Add p4-clockmod sysfs-ui removal to feature-removal schedule. · 753b7aea
Dave Jones authored Mar 09, 2009
```
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Dave Jones <davej@redhat.com>
```
753b7aea

Revert "[CPUFREQ] Disable sysfs ui for p4-clockmod." · 129f8ae9

Dave Jones authored Mar 09, 2009

This reverts commit e088e4c9.

Removing the sysfs interface for p4-clockmod was flagged as a
regression in bug 12826.

Course of action:
 - Find out the remaining causes of overheating, and fix them
   if possible. ACPI should be doing the right thing automatically.
   If it isn't, we need to fix that.
 - mark p4-clockmod ui as deprecated
 - try again with the removal in six months.

It's not really feasible to printk about the deprecation, because
it needs to happen at all the sysfs entry points, which means adding
a lot of strcmp("p4-clockmod".. calls to the core, which.. bleuch.
Signed-off-by: Dave Jones <davej@redhat.com>

129f8ae9