1. 31 Mar, 2009 23 commits
    • NeilBrown's avatar
      md: make sure new_level, new_chunksize, new_layout always have sensible values. · 34817e8c
      NeilBrown authored
      When an md array is undergoing a change, we have new_* fields that
      show the new values.
      When no change is happening, it is least confusing if these have
      the same value as the normal fields.
      This is true in most cases, but not when the values are set via sysfs.
      
      So fix this up.
      
      A subsequent patch will BUG_ON if these things aren't consistent.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      34817e8c
    • NeilBrown's avatar
      md/raid5: finish support for DDF/raid6 · 67cc2b81
      NeilBrown authored
      DDF requires RAID6 calculations over different devices in a different
      order.
      For md/raid6, we calculate over just the data devices, starting
      immediately after the 'Q' block.
      For ddf/raid6 we calculate over all devices, using zeros in place of
      the P and Q blocks.
      
      This requires unfortunately complex loops...
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      67cc2b81
    • NeilBrown's avatar
      md/raid5: Add support for new layouts for raid5 and raid6. · 99c0fb5f
      NeilBrown authored
      DDF uses different layouts for P and Q blocks than current md/raid6
      so add those that are missing.
      Also add support for RAID6 layouts that are identical to various
      raid5 layouts with the simple addition of one device to hold all of
      the 'Q' blocks.
      Finally add 'raid5' layouts to match raid4.
      These last to will allow online level conversion.
      
      Note that this does not provide correct support for DDF/raid6 yet
      as the order in which data blocks are summed to produce the Q block
      is significant and different between current md code and DDF
      requirements.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      99c0fb5f
    • NeilBrown's avatar
      md/raid5: simplify raid5_compute_sector interface · 911d4ee8
      NeilBrown authored
      Rather than passing 'pd_idx' and 'qd_idx' to be filled in, pass
      a 'struct stripe_head *' and fill in the relevant fields.  This is
      more extensible.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      911d4ee8
    • NeilBrown's avatar
      md/raid6: remove expectation that Q device is immediately after P device. · d0dabf7e
      NeilBrown authored
      
      Code currently assumes that the devices in a raid6 stripe are
        0 1 ... N-1 P Q
      in some rotated order.  We will shortly add new layouts in which
      this strict pattern is broken.
      So remove this expectation.  We still assume that the data disks
      are roughly in-order.  However P and Q can be inserted anywhere within
      that order.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      d0dabf7e
    • NeilBrown's avatar
      md/raid5: change raid5_compute_sector and stripe_to_pdidx to take a 'previous' argument · 112bf897
      NeilBrown authored
      This similar to the recent change to get_active_stripe.
      There is no functional change, just come rearrangement to make
      future patches cleaner.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      112bf897
    • NeilBrown's avatar
      md/raid5: simplify interface for init_stripe and get_active_stripe · b5663ba4
      NeilBrown authored
      Rather than passing 'pd_idx' and 'disks' to these functions, just pass
      'previous' which tells whether to use the 'previous' or 'current'
      geometry during a reshape, and let init_stripe calculate
      disks and pd_idx and anything else it might need.
      
      This is not a substantial simplification and even adds a division.
      However we will shortly be adding more complexity to init_stripe
      to handle more interesting 'reshape' activities, and without this
      change, the interface to these functions would get very complex.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      b5663ba4
    • Andre Noll's avatar
      md: Represent raid device size in sectors. · dd8ac336
      Andre Noll authored
      This patch renames the "size" field of struct mdk_rdev_s to
      "sectors" and changes this field to store sectors instead of
      blocks.
      
      All users of this field, linear.c, raid0.c and md.c, are fixed up
      accordingly which gets rid of many multiplications and divisions.
      Signed-off-by: default avatarAndre Noll <maan@systemlinux.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      dd8ac336
    • Andre Noll's avatar
      md: Make mddev->size sector-based. · 58c0fed4
      Andre Noll authored
      This patch renames the "size" field of struct mddev_s to "dev_sectors"
      and stores the number of 512-byte sectors instead of the number of
      1K-blocks in it.
      
      All users of that field, including raid levels 1,4-6,10, are adjusted
      accordingly. This simplifies the code a bit because it allows to get
      rid of a couple of divisions/multiplications by two.
      
      In order to make checkpatch happy, some minor coding style issues
      have also been addressed. In particular, size_store() now uses
      strict_strtoull() instead of simple_strtoull().
      Signed-off-by: default avatarAndre Noll <maan@systemlinux.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      58c0fed4
    • NeilBrown's avatar
      md: be more consistent about setting WriteMostly flag when adding a drive to an array · 575a80fa
      NeilBrown authored
      When a drive is added to an array using ADD_NEW_DISK, there are two
      places we can get certain flags from:  the metadata on the disk or the
      flags passed through the IOCTL.
      
      For the WriteMostly flag (aka MD_DISK_WRITEMOSTLY) we take the value
      from either of those sources depending on if it is set (i.e. we
      effectively 'or' the two sources together).
      
      This makes it awkward to clear, and is at best inconsistent.
      
      As documented code (in mdadm) requires that setting
      MD_DISK_WRITEMOSTLY in the ioctl will be effective, we resolve the
      inconsistency by always using the value for this flag from the ioctl,
      and ignoring the value on disk.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      575a80fa
    • NeilBrown's avatar
      md: occasionally checkpoint drive recovery to reduce duplicate effort after a crash · 97e4f42d
      NeilBrown authored
      Version 1.x metadata has the ability to record the status of a
      partially completed drive recovery.
      However we only update that record on a clean shutdown.
      It would be nice to update it on unclean shutdowns too, particularly
      when using a bitmap that removes much to the 'sync' effort after an
      unclean shutdown.
      
      One complication with checkpointing recovery is that we only know
      where we are up to in terms of IO requests started, not which ones
      have completed.  And we need to know what has completed to record
      how much is recovered.  So occasionally pause the recovery until all
      submitted requests are completed, then update the record of where
      we are up to.
      
      When we have a bitmap, we already do that pause occasionally to keep
      the bitmap up-to-date.  So enhance that code to record the recovery
      offset and schedule a superblock update.
      And when there is no bitmap, just pause 16 times during the resync to
      do a checkpoint.
      '16' is a fairly arbitrary number.  But we don't really have any good
      way to judge how often is acceptable, and it seems like a reasonable
      number for now.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      97e4f42d
    • NeilBrown's avatar
      md: move md_k.h from include/linux/raid/ to drivers/md/ · 43b2e5d8
      NeilBrown authored
      It really is nicer to keep related code together..
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      43b2e5d8
    • NeilBrown's avatar
      md: move lots of #include lines out of .h files and into .c · bff61975
      NeilBrown authored
      This makes the includes more explicit, and is preparation for moving
      md_k.h to drivers/md/md.h
      
      Remove include/raid/md.h as its only remaining use was to #include
      other files.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      bff61975
    • NeilBrown's avatar
      md: move most content from md.h to md_k.h · 92022950
      NeilBrown authored
      The extern function definitions are kernel-internal definitions, so
      they belong in md_k.h
      
      The MD_*_VERSION values could reasonably go in a number of places,
      but md_u.h seems most reasonable.
      
      This leaves almost nothing in md.h.  It will go soon.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      92022950
    • NeilBrown's avatar
      md: move LEVEL_* definition from md_k.h to md_u.h · 8b2b5c21
      NeilBrown authored
      .. as they are part of the user-space interface.
      Also move MdpMinorShift into there so we can remove duplication.
      
      Lastly move mdp_major in.  It is less obviously part of the user-space
      interface, but do_mounts_md.c uses it, and it is acting a bit like
      user-space.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      8b2b5c21
    • Christoph Hellwig's avatar
      md: move headers out of include/linux/raid/ · ef740c37
      Christoph Hellwig authored
      Move the headers with the local structures for the disciplines and
      bitmap.h into drivers/md/ so that they are more easily grepable for
      hacking and not far away.  md.h is left where it is for now as there
      are some uses from the outside.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      ef740c37
    • Christoph Hellwig's avatar
      cleanup drivers/md/Makefile · 2a40a8ae
      Christoph Hellwig authored
      Use the -y variables instead of the old -objs so we can easily add
      conditional objects to the modules.  Also always use += to add
      subobjects to avoid problems when placing additional objects in
      some place in the file.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      2a40a8ae
    • Christoph Hellwig's avatar
      md: stop defining MAJOR_NR · 3dbd8c2e
      Christoph Hellwig authored
      MAJOR_NR was only required for magic in linux/blk.h in 2.4 or earlier
      kernels, so no need to keep it around.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      3dbd8c2e
    • Martin K. Petersen's avatar
      MD data integrity support · 3f9d99c1
      Martin K. Petersen authored
      md: Add support for data integrity to MD
      
      If all subdevices support the same protection format the MD device is
      flagged as integrity capable.
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      3f9d99c1
    • NeilBrown's avatar
      md: write bitmap information to devices that are undergoing recovery. · 355a43e6
      NeilBrown authored
      When we add some spares to an array and start recovery, and we have
      a bitmap which is stored 'internally' on all devices, we call
      bitmap_write_all to make sure the bitmap is correct on the new
      device(s).
      However that doesn't work as write_sb_page only writes to
      'In_sync' devices, and devices undergoing recovery are not
      'In_sync' until recovery finishes.
      
      So extend write_sb_page (actually next_active_rdev) to include devices
      that are under recovery.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      355a43e6
    • NeilBrown's avatar
      md: never clear bit from the write-intent bitmap when the array is degraded. · d0a4bb49
      NeilBrown authored
      
      It is safe to clear a bit from the write-intent bitmap for a raid1
      if we know the data has been written to all devices, which is
      what the current test does.
      
      But it is not always safe to update the 'events_cleared' counter in
      that case.  This is because one request could complete successfully
      after some other request has partially failed.
      
      So simply disable the clearing and updating of events_cleared whenever
      the array is degraded.  This might end up not clearing some bits that
      could safely be cleared, but it is safest approach.
      
      Note that the bug fixed here did not risk corrupting data by letting
      the array get out-of-sync.  Rather it meant that when a device is
      removed and re-added to the array, it might incorrectly require a full
      recovery rather than just recovering based on the bitmap.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      d0a4bb49
    • NeilBrown's avatar
      md: Allow write-intent bitmaps to have chunksize < PAGE_SIZE · 1187cf0a
      NeilBrown authored
      md currently insists that the chunk size used for write-intent
      bitmaps (the amount of data that corresponds to one chunk)
      be at least one page.
      
      The reason for this restriction is lost in the mists of time,
      but a review of the code (and a vague memory) suggests that the only
      problem would be related to resync.  Resync tries very hard to
      work in multiples of a page, but also needs to sync with units
      of a bitmap_chunk too.
      
      This connection comes out in the bitmap_start_sync call.
      
      So change bitmap_start_sync to always work in multiples of a page.
      If the bitmap chunk size is less that one page, we flag multiple
      chunks as 'syncing' and generally make them all appear to the
      resync routines like one chunk.
      
      All other code either already works with data ranges that could
      span multiple chunks, or explicitly only cares about a single chunk.
      Signed-off-by: default avatarNeil Brown <neilb@suse.de>
      1187cf0a
    • NeilBrown's avatar
      md: Fix is_mddev_idle test (again). · eea1bf38
      NeilBrown authored
      There are two problems with is_mddev_idle.
      
      1/ sync_io is 'atomic_t' and hence 'int'.  curr_events and all the
         rest are 'long'.
         So if sync_io were to wrap on a 64bit host, the value of
         curr_events would go very negative suddenly, and take a very
         long time to return to positive.
      
         So do all calculations as 'int'.  That gives us plenty of precision
         for what we need.
      
      2/ To initialise rdev->last_events we simply call is_mddev_idle, on
         the assumption that it will make sure that last_events is in a
         suitable range.  It used to do this, but now it does not.
         So now we need to be more explicit about initialisation.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      eea1bf38
  2. 09 Mar, 2009 9 commits
    • Linus Torvalds's avatar
      Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq · 99adcd9d
      Linus Torvalds authored
      * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq:
        [CPUFREQ] Add p4-clockmod sysfs-ui removal to feature-removal schedule.
        Revert "[CPUFREQ] Disable sysfs ui for p4-clockmod."
      99adcd9d
    • Oleg Nesterov's avatar
      copy_process: fix CLONE_PARENT && parent_exec_id interaction · 2d5516cb
      Oleg Nesterov authored
      CLONE_PARENT can fool the ->self_exec_id/parent_exec_id logic. If we
      re-use the old parent, we must also re-use ->parent_exec_id to make
      sure exit_notify() sees the right ->xxx_exec_id's when the CLONE_PARENT'ed
      task exits.
      
      Also, move down the "p->parent_exec_id = p->self_exec_id" thing, to place
      two different cases together.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Serge E. Hallyn <serge@hallyn.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d5516cb
    • Dave Jones's avatar
    • Dave Jones's avatar
      Revert "[CPUFREQ] Disable sysfs ui for p4-clockmod." · 129f8ae9
      Dave Jones authored
      This reverts commit e088e4c9.
      
      Removing the sysfs interface for p4-clockmod was flagged as a
      regression in bug 12826.
      
      Course of action:
       - Find out the remaining causes of overheating, and fix them
         if possible. ACPI should be doing the right thing automatically.
         If it isn't, we need to fix that.
       - mark p4-clockmod ui as deprecated
       - try again with the removal in six months.
      
      It's not really feasible to printk about the deprecation, because
      it needs to happen at all the sysfs entry points, which means adding
      a lot of strcmp("p4-clockmod".. calls to the core, which.. bleuch.
      Signed-off-by: default avatarDave Jones <davej@redhat.com>
      129f8ae9
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 · df0b4a50
      Linus Torvalds authored
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (29 commits)
        p54: fix race condition in memory management
        cfg80211: test before subtraction on unsigned
        iwlwifi: fix error flow in iwl*_pci_probe
        rt2x00 : more devices to rt73usb.c
        rt2x00 : more devices to rt2500usb.c
        bonding: Fix device passed into ->ndo_neigh_setup().
        vlan: Fix vlan-in-vlan crashes.
        net: Fix missing dev->neigh_setup in register_netdevice().
        tmspci: fix request_irq race
        pkt_sched: act_police: Fix a rate estimator test.
        tg3: Fix 5906 link problems
        SCTP: change sctp_ctl_sock_init() to try IPv4 if IPv6 fails
        IPv6: add "disable" module parameter support to ipv6.ko
        sungem: another error printed one too early
        aoe: error printed 1 too early
        net pcmcia: worklimit reaches -1
        net: more timeouts that reach -1
        net: fix tokenring license
        dm9601: new vendor/product IDs
        netlink: invert error code in netlink_set_err()
        ...
      df0b4a50
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus · 39a3478c
      Linus Torvalds authored
      * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
        lguest: fix for CONFIG_SPARSE_IRQ=y
        lguest: fix crash 'unhandled trap 13 at <native_read_msr_safe>'
      39a3478c
    • Linus Torvalds's avatar
    • Chris Mason's avatar
      Btrfs: fix spinlock assertions on UP systems · b9447ef8
      Chris Mason authored
      btrfs_tree_locked was being used to make sure a given extent_buffer was
      properly locked in a few places.  But, it wasn't correct for UP compiled
      kernels.
      
      This switches it to using assert_spin_locked instead, and renames it to
      btrfs_assert_tree_locked to better reflect how it was really being used.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      b9447ef8
    • Heiko Carstens's avatar
      Fix fixpoint divide exception in acct_update_integrals · 6d5b5acc
      Heiko Carstens authored
      Frans Pop reported the crash below when running an s390 kernel under Hercules:
      
        Kernel BUG at 000738b4  verbose debug info unavailable!
        fixpoint divide exception: 0009  #1! SMP
        Modules linked in: nfs lockd nfs_acl sunrpc ctcm fsm tape_34xx
           cu3088 tape ccwgroup tape_class ext3 jbd mbcache dm_mirror dm_log dm_snapshot
           dm_mod dasd_eckd_mod dasd_mod
        CPU: 0 Not tainted 2.6.27.19 #13
        Process awk (pid: 2069, task: 0f9ed9b8, ksp: 0f4f7d18)
        Krnl PSW : 070c1000 800738b4 (acct_update_integrals+0x4c/0x118)
                   R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0
        Krnl GPRS: 00000000 000007d0 7fffffff fffff830
                   00000000 ffffffff 00000002 0f9ed9b8
                   00000000 00008ca0 00000000 0f9ed9b8
                   0f9edda4 8007386e 0f4f7ec8 0f4f7e98
        Krnl Code: 800738aa: a71807d0         lhi     %r1,2000
                   800738ae: 8c200001         srdl    %r2,1
                   800738b2: 1d21             dr      %r2,%r1
                  >800738b4: 5810d10e         l       %r1,270(%r13)
                   800738b8: 1823             lr      %r2,%r3
                   800738ba: 4130f060         la      %r3,96(%r15)
                   800738be: 0de1             basr    %r14,%r1
                   800738c0: 5800f060         l       %r0,96(%r15)
        Call Trace:
        ( <000000000004fdea>! blocking_notifier_call_chain+0x1e/0x2c)
          <0000000000038502>! do_exit+0x106/0x7c0
          <0000000000038c36>! do_group_exit+0x7a/0xb4
          <0000000000038c8e>! SyS_exit_group+0x1e/0x30
          <0000000000021c28>! sysc_do_restart+0x12/0x16
          <0000000077e7e924>! 0x77e7e924
      
      Reason for this is that cpu time accounting usually only happens from
      interrupt context, but acct_update_integrals gets also called from
      process context with interrupts enabled.
      
      So in acct_update_integrals we may end up with the following scenario:
      
      Between reading tsk->stime/tsk->utime and tsk->acct_timexpd an interrupt
      happens which updates accouting values.  This causes acct_timexpd to be
      greater than the former stime + utime.  The subsequent calculation of
      
      	dtime = cputime_sub(time, tsk->acct_timexpd);
      
      will be negative and the division performed by
      
      	cputime_to_jiffies(dtime)
      
      will generate an exception since the result won't fit into a 32 bit
      register.
      
      In order to fix this just always disable interrupts while accessing any
      of the accounting values.
      
      Reported by: Frans Pop <elendil@planet.nl>
      Tested by: Frans Pop <elendil@planet.nl>
      Cc: stable@kernel.org
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d5b5acc
  3. 08 Mar, 2009 8 commits