Commits · 388667bed591b2359713bb17d5de0cf56e961447 · Kirill Smelkov / linux

01 Aug, 2008 1 commit

md: raid10: wake up frozen array · 388667be

Arthur Jones authored Jul 25, 2008

When rescheduling a bio in raid10, we wake up
the md thread, but if the array is frozen, this
will have no effect.  This causes the array to
remain frozen for eternity.  We add a wake_up
to allow the array to de-freeze.  This code is
nearly identical to the raid1 code, which has
this fix already.
Signed-off-by: Arthur Jones <ajones@riverbed.com>
Signed-off-by: NeilBrown <neilb@suse.de>

388667be

29 Jul, 2008 2 commits

md: do not count blocked devices as spares · e5427135

Dan Williams authored Jul 28, 2008

remove_and_add_spares() assumes that failed devices have been hot-removed
from the array.  Removal is skipped in the 'blocked' case so do not count a
device in this state as 'spare'.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

e5427135

md: do not progress the resync process if the stripe was blocked · df10cfbc

Dan Williams authored Jul 28, 2008

handle_stripe will take no action on a stripe when waiting for userspace
to unblock the array, so do not report completed sectors.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

df10cfbc

23 Jul, 2008 3 commits

md: delay notification of 'active_idle' to the recovery thread · d8e64406

Dan Williams authored Jul 23, 2008

sysfs_notify might sleep, so do not call it from md_safemode_timeout.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

d8e64406

md: fix merge error · 23397883

Dan Williams authored Jul 23, 2008

The original STRIPE_OP_IO removal patch had the following hunk:

-               for (i = conf->raid_disks; i--; ) {
+               for (i = conf->raid_disks; i--; )
                        set_bit(R5_Wantwrite, &sh->dev[i].flags);
-                       if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
-                               sh->ops.count++;
-               }

However it appears the hunk became broken after merging:
-               for (i = conf->raid_disks; i--; ) {
+               for (i = conf->raid_disks; i--; )
                        set_bit(R5_Wantwrite, &sh->dev[i].flags);
                        set_bit(R5_LOCKED, &dev->flags);
                        s.locked++;
-                       if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
-                               sh->ops.count++;
-               }
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

23397883

md: move async_tx_issue_pending_all outside spin_lock_irq · c9f21aaf

Dan Williams authored Jul 23, 2008

Some dma drivers need to call spin_lock_bh in their device_issue_pending
routines.  This change avoids:

WARNING: at kernel/softirq.c:136 local_bh_enable_ip+0x3a/0x85()
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

c9f21aaf

21 Jul, 2008 7 commits

md: Protect access to mddev->disks list using RCU · 4b80991c

NeilBrown authored Jul 21, 2008

All modifications and most access to the mddev->disks list are made
under the reconfig_mutex lock.  However there are three places where
the list is walked without any locking.  If a reconfig happens at this
time, havoc (and oops) can ensue.

So use RCU to protect these accesses:
  - wrap them in rcu_read_{,un}lock()
  - use list_for_each_entry_rcu
  - add to the list with list_add_rcu
  - delete from the list with list_del_rcu
  - delay the 'free' with call_rcu rather than schedule_work

Note that export_rdev did a list_del_init on this list.  In almost all
cases the entry was not in the list anymore so it was a no-op and so
safe.  It is no longer safe as after list_del_rcu we may not touch
the list_head.
An audit shows that export_rdev is called:
  - after unbind_rdev_from_array, in which case the delete has
     already been done,
  - after bind_rdev_to_array fails, in which case the delete isn't needed.
  - before the device has been put on a list at all (e.g. in
      add_new_disk where reading the superblock fails).
  - and in autorun devices after a failure when the device is on a
      different list.

So remove the list_del_init call from export_rdev, and add it back
immediately before the called to export_rdev for that last case.

Note also that ->same_set is sometimes used for lists other than
mddev->list (e.g. candidates).  In these cases rcu is not needed.
Signed-off-by: NeilBrown <neilb@suse.de>

4b80991c

md: only count actual openers as access which prevent a 'stop' · f2ea68cf

NeilBrown authored Jul 21, 2008

Open isn't the only thing that increments ->active.  e.g. reading
/proc/mdstat will increment it briefly.  So to avoid false positives
in testing for concurrent access, introduce a new counter that counts
just the number of times the md device it open.
Signed-off-by: NeilBrown <neilb@suse.de>

f2ea68cf

md: linear: Make array_size sector-based and rename it to array_sectors. · d6e22150
Andre Noll authored Jul 21, 2008
```
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
```
d6e22150

md: Make mddev->array_size sector-based. · f233ea5c

Andre Noll authored Jul 21, 2008

This patch renames the array_size field of struct mddev_s to array_sectors
and converts all instances to use units of 512 byte sectors instead of 1k
blocks.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>

f233ea5c

md: Make super_type->rdev_size_change() take sector-based sizes. · 15f4a5fd

Andre Noll authored Jul 21, 2008

Also, change the type of the size parameter from unsigned long long to
sector_t and rename it to num_sectors.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>

15f4a5fd

md: Fix check for overlapping devices. · d07bd3bc

Andre Noll authored Jul 21, 2008

The checks in overlaps() expect all parameters either in block-based
or sector-based quantities. However, its single caller passes two
rdev->data_offset arguments as well as two rdev->size arguments, the
former being sector counts while the latter are measured in 1K blocks.

This could cause rdev_size_store() to accept an invalid size from user
space. Fix it by passing only sector-based quantities to overlaps().
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>

d07bd3bc

md: Tidy up rdev_size_store a bit: · d7027458

Neil Brown authored Jul 12, 2008

 - used strict_strtoull in place of simple_strtoull
 - use my_mddev in place of rdev->mddev (they have the same value)
and more significantly,
 - don't adjust mddev->size to fit, rather reject changes which make
   rdev->size smaller than mddev->size

Adjusting mddev->size is a hangover from bind_rdev_to_array which
does a similar thing.  But it really is a better design to insist that
mddev->size is set as required, then the rdev->sizes are set to allow
for that.  The previous way invites confusion.
Signed-off-by: NeilBrown <neilb@suse.de>

d7027458

11 Jul, 2008 13 commits

md: Remove some unused macros. · 7e93a892

Andre Noll authored Jul 11, 2008

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>

7e93a892

md: Turn rdev->sb_offset into a sector-based quantity. · 0f420358

Andre Noll authored Jul 11, 2008

Rename it to sb_start to make sure all users have been converted.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>

0f420358

md: Make calc_dev_sboffset() return a sector count. · b73df2d3

Andre Noll authored Jul 11, 2008

As BLOCK_SIZE_BITS is 10 and

	MD_NEW_SIZE_SECTORS(2 * x) = 2 * NEW_SIZE_BLOCKS(x),

the return value of calc_dev_sboffset() doubles. Fix up all three
callers accordingly.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>

b73df2d3

md: Replace calc_dev_size() by calc_num_sectors(). · e7debaa4

Andre Noll authored Jul 11, 2008

Number of sectors is the preferred unit for sizes of raid devices,
so change calc_dev_size() so that it returns this unit instead of
the number of 1K blocks.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>

e7debaa4

md: Make update_size() take the number of sectors. · d71f9f88

Andre Noll authored Jul 11, 2008

Changing the internal representations of sizes of raid devices
from 1K blocks to sector counts (512B units) is desirable because
it allows to get rid of many divisions/multiplications and unnecessary
casts that are present in the current code.

This patch is a first step in this direction. It replaces the old
1K-based "size" argument of update_size() by "num_sectors" and
fixes up its two callers.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>

d71f9f88

md: Better control of when do_md_stop is allowed to stop the array. · df5b20cf

Neil Brown authored Jul 11, 2008

do_md_stop check the number of active users before allowing the array
to be stopped.
Two problems:
  1/ it assumes the request is coming through an open file descriptor
     (via ioctl) so it allows for that.  This is not always the case.
  2/ it doesn't do the check it the array hasn't been activated.
     This is not good for cases when we use an inactive array to hold
     some devices in a container.
Signed-off-by: Neil Brown <neilb@suse.de>

df5b20cf

md: get_disk_info(): Don't convert between signed and unsigned and back. · 26ef379f

Andre Noll authored Jul 11, 2008

The current code copies a signed int from user space, converts it to
unsigned and passes the unsigned value to find_rdev_nr() which expects
a signed value. Simply pass the signed value from user space directly.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>

26ef379f

md: Simplify restart_array(). · 80fab1d7

Andre Noll authored Jul 11, 2008

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>

80fab1d7

md: alloc_disk_sb(): Return proper error value. · ebc24337

Andre Noll authored Jul 11, 2008

If alloc_page() fails, ENOMEM is a more suitable error value
than EINVAL.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>

ebc24337

md: Simplify sb_equal(). · ce0c8e05

Andre Noll authored Jul 11, 2008

The only caller of sb_equal() tests the return value against
zero, so it's OK to return the negated return value of memcmp().
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>

ce0c8e05

md: Simplify uuid_equal(). · 05710466

Andre Noll authored Jul 11, 2008

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>

05710466

Merge branch 'master' into for-next · 0306d5ef
Neil Brown authored Jul 11, 2008

0306d5ef

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 · e5a5816f

Linus Torvalds authored Jul 10, 2008

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (27 commits)
  tun: Persistent devices can get stuck in xoff state
  xfrm: Add a XFRM_STATE_AF_UNSPEC flag to xfrm_usersa_info
  ipv6: missed namespace context in ipv6_rthdr_rcv
  netlabel: netlink_unicast calls kfree_skb on error path by itself
  ipv4: fib_trie: Fix lookup error return
  tcp: correct kcalloc usage
  ip: sysctl documentation cleanup
  Documentation: clarify tcp_{r,w}mem sysctl docs
  netfilter: nf_nat_snmp_basic: fix a range check in NAT for SNMP
  netfilter: nf_conntrack_tcp: fix endless loop
  libertas: fix memory alignment problems on the blackfin
  zd1211rw: stop beacons on remove_interface
  rt2x00: Disable synchronization during initialization
  rc80211_pid: Fix fast_start parameter handling
  sctp: Add documentation for sctp sysctl variable
  ipv6: fix race between ipv6_del_addr and DAD timer
  irda: Fix netlink error path return value
  irda: New device ID for nsc-ircc
  irda: via-ircc proper dma freeing
  sctp: Mark the tsn as received after all allocations finish
  ...

e5a5816f

10 Jul, 2008 14 commits

tun: Persistent devices can get stuck in xoff state · e35259a9

Max Krasnyansky authored Jul 10, 2008

The scenario goes like this. App stops reading from tun/tap.
TX queue gets full and driver does netif_stop_queue().
App closes fd and TX queue gets flushed as part of the cleanup.
Next time the app opens tun/tap and starts reading from it but
the xoff state is not cleared. We're stuck.
Normally xoff state is cleared when netdev is brought up. But
in the case of persistent devices this happens only during
initial setup.

The fix is trivial. If device is already up when an app opens
it we clear xoff state and that gets things moving again.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e35259a9

xfrm: Add a XFRM_STATE_AF_UNSPEC flag to xfrm_usersa_info · ccf9b3b8

Steffen Klassert authored Jul 10, 2008

Add a XFRM_STATE_AF_UNSPEC flag to handle the AF_UNSPEC behavior for
the selector family. Userspace applications can set this flag to leave
the selector family of the xfrm_state unspecified.  This can be used
to to handle inter family tunnels if the selector is not set from
userspace.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

ccf9b3b8

ipv6: missed namespace context in ipv6_rthdr_rcv · 0ce28553

Denis V. Lunev authored Jul 10, 2008

Signed-off-by: Denis V. Lunev <den@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0ce28553

netlabel: netlink_unicast calls kfree_skb on error path by itself · fe785bee

Denis V. Lunev authored Jul 10, 2008

So, no need to kfree_skb here on the error path. In this case we can
simply return.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fe785bee

ipv4: fib_trie: Fix lookup error return · 2e655571

Ben Hutchings authored Jul 10, 2008

In commit a07f5f50 "[IPV4] fib_trie: style
cleanup", the changes to check_leaf() and fn_trie_lookup() were wrong - where
fn_trie_lookup() would previously return a negative error value from
check_leaf(), it now returns 0.
 
Now fn_trie_lookup() doesn't appear to care about plen, so we can revert
check_leaf() to returning the error value.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Tested-by: William Boughton <bill@boughton.de>
Acked-by: Stephen Heminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2e655571

tcp: correct kcalloc usage · 3d8ea1fd

Milton Miller authored Jul 10, 2008

kcalloc is supposed to be called with the count as its first argument and
the element size as the second.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3d8ea1fd

ip: sysctl documentation cleanup · 4edc2f34

Stephen Hemminger authored Jul 10, 2008

Reduced version of the spelling cleanup patch.

Take out the confusing language in tcp_frto, and organize the
undocumented values.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

4edc2f34

Documentation: clarify tcp_{r,w}mem sysctl docs · 53025f5e

J. Bruce Fields authored Jul 10, 2008

Fix some of the defaults and attempt to clarify some language.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>

53025f5e

slub: Fix use-after-preempt of per-CPU data structure · bdb21928

Dmitry Adamushko authored Jul 10, 2008

Vegard Nossum reported a crash in kmem_cache_alloc():

	BUG: unable to handle kernel paging request at da87d000
	IP: [<c01991c7>] kmem_cache_alloc+0xc7/0xe0
	*pde = 28180163 *pte = 1a87d160
	Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
	Pid: 3850, comm: grep Not tainted (2.6.26-rc9-00059-gb190333 #5)
	EIP: 0060:[<c01991c7>] EFLAGS: 00210203 CPU: 0
	EIP is at kmem_cache_alloc+0xc7/0xe0
	EAX: 00000000 EBX: da87c100 ECX: 1adad71a EDX: 6b6b6b6b
	ESI: 00200282 EDI: da87d000 EBP: f60bfe74 ESP: f60bfe54
	DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068

and analyzed it:

  "The register %ecx looks innocent but is very important here. The disassembly:

       mov    %edx,%ecx
       shr    $0x2,%ecx
       rep stos %eax,%es:(%edi) <-- the fault

   So %ecx has been loaded from %edx... which is 0x6b6b6b6b/POISON_FREE.
   (0x6b6b6b6b >> 2 == 0x1adadada.)

   %ecx is the counter for the memset, from here:

       memset(object, 0, c->objsize);

  i.e. %ecx was loaded from c->objsize, so "c" must have been freed.
  Where did "c" come from? Uh-oh...

       c = get_cpu_slab(s, smp_processor_id());

  This looks like it has very much to do with CPU hotplug/unplug. Is
  there a race between SLUB/hotplug since the CPU slab is used after it
  has been freed?"

Good analysis.

Yeah, it's possible that a caller of kmem_cache_alloc() -> slab_alloc()
can be migrated on another CPU right after local_irq_restore() and
before memset().  The inital cpu can become offline in the mean time (or
a migration is a consequence of the CPU going offline) so its
'kmem_cache_cpu' structure gets freed ( slab_cpuup_callback).

At some point of time the caller continues on another CPU having an
obsolete pointer...
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

bdb21928

exec: fix stack excutability without PT_GNU_STACK · 96a8e13e

Hugh Dickins authored Jul 10, 2008

Kernel Bugzilla #11063 points out that on some architectures (e.g. x86_32)
exec'ing an ELF without a PT_GNU_STACK program header should default to an
executable stack; but this got broken by the unlimited argv feature because
stack vma is now created before the right personality has been established:
so breaking old binaries using nested function trampolines.

Therefore re-evaluate VM_STACK_FLAGS in setup_arg_pages, where stack
vm_flags used to be set, before the mprotect_fixup. Checking through
our existing VM_flags, none would have changed since insert_vm_struct:
so this seems safer than finding a way through the personality labyrinth.

Reported-by: pageexec@freemail.hu
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

96a8e13e

Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2 · f8804d39
Linus Torvalds authored Jul 10, 2008
```
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
  ocfs2: Fix flags in ocfs2_file_lock
```
f8804d39

Merge branch 'sched-fixes-for-linus' of... · a26449da

Linus Torvalds authored Jul 10, 2008

Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: fix cpu hotplug, cleanup
  sched: fix cpu hotplug

a26449da

sched: fix cpu hotplug, cleanup · b1e38734

Linus Torvalds authored Jul 10, 2008

Clean up __migrate_task(): to just have separate "done" and "fail"
cases, instead of that "out" case with random error behavior.
Signed-off-by: Ingo Molnar <mingo@elte.hu>

b1e38734

Merge branch 'x86-fixes-for-linus' of... · 9cc30892

Linus Torvalds authored Jul 10, 2008

Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86: fix /dev/mem compatibility under PAT

9cc30892