1. 18 Feb, 2010 8 commits
  2. 17 Feb, 2010 21 commits
    • Benjamin Herrenschmidt's avatar
      Merge commit 'jwb/next' into next · efd0f0f3
      Benjamin Herrenschmidt authored
      efd0f0f3
    • Dave Kleikamp's avatar
      powerpc/booke: Add support for advanced debug registers · 3bffb652
      Dave Kleikamp authored
      powerpc/booke: Add support for advanced debug registers
      
      From: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
      
      Based on patches originally written by Torez Smith.
      
      This patch defines context switch and trap related functionality
      for BookE specific Debug Registers. It adds support to ptrace()
      for setting and getting BookE related Debug Registers
      Signed-off-by: default avatarDave Kleikamp <shaggy@linux.vnet.ibm.com>
      Cc: Torez Smith  <lnxtorez@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Gibson <dwg@au1.ibm.com>
      Cc: Josh Boyer <jwboyer@linux.vnet.ibm.com>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Cc: Sergio Durigan Junior <sergiodj@br.ibm.com>
      Cc: Thiago Jung Bauermann <bauerman@br.ibm.com>
      Cc: linuxppc-dev list <Linuxppc-dev@ozlabs.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      3bffb652
    • Dave Kleikamp's avatar
      powerpc/booke: Add definitions for advanced debug registers · 99396ac1
      Dave Kleikamp authored
      powerpc/booke: Add definitions for advanced debug registers
      
      From: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
      
      Based on patches originally written by Torez Smith.
      
      This patch adds additional definitions for BookE Debug Registers
      to the reg_booke.h header file.
      Signed-off-by: default avatarDave Kleikamp <shaggy@linux.vnet.ibm.com>
      Acked-by: default avatarDavid Gibson <dwg@au1.ibm.com>
      Cc: Torez Smith  <lnxtorez@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Josh Boyer <jwboyer@linux.vnet.ibm.com>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Cc: Sergio Durigan Junior <sergiodj@br.ibm.com>
      Cc: Thiago Jung Bauermann <bauerman@br.ibm.com>
      Cc: linuxppc-dev list <Linuxppc-dev@ozlabs.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      99396ac1
    • Dave Kleikamp's avatar
      powerpc: Extended ptrace interface · 3162d92d
      Dave Kleikamp authored
      powerpc: Extended ptrace interface
      
      From: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
      
      Based on patches originally written by Torez Smith.
      
      Add a new extended ptrace interface so that user-space has a single
      interface for powerpc, without having to know the specific layout
      of the debug registers.
      
      Implement:
      PPC_PTRACE_GETHWDEBUGINFO
      PPC_PTRACE_SETHWDEBUG
      PPC_PTRACE_DELHWDEBUG
      Signed-off-by: default avatarDave Kleikamp <shaggy@linux.vnet.ibm.com>
      Acked-by: default avatarDavid Gibson <dwg@au1.ibm.com>
      Cc: Torez Smith  <lnxtorez@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Josh Boyer <jwboyer@linux.vnet.ibm.com>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Cc: Sergio Durigan Junior <sergiodj@br.ibm.com>
      Cc: Thiago Jung Bauermann <bauerman@br.ibm.com>
      Cc: linuxppc-dev list <Linuxppc-dev@ozlabs.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      3162d92d
    • Dave Kleikamp's avatar
      powerpc/booke: Introduce new CONFIG options for advanced debug registers · 172ae2e7
      Dave Kleikamp authored
      powerpc/booke: Introduce new CONFIG options for advanced debug registers
      
      From: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
      
      Introduce new config options to simplify the ifdefs pertaining to the
      advanced debug registers for booke and 40x processors:
      
      CONFIG_PPC_ADV_DEBUG_REGS - boolean: true for dac-based processors
      CONFIG_PPC_ADV_DEBUG_IACS - number of IAC registers
      CONFIG_PPC_ADV_DEBUG_DACS - number of DAC registers
      CONFIG_PPC_ADV_DEBUG_DVCS - number of DVC registers
      CONFIG_PPC_ADV_DEBUG_DAC_RANGE - DAC ranges supported
      
      Beginning conservatively, since I only have the facilities to test 440
      hardware.  I believe all 40x and booke platforms support at least 2 IAC
      and 2 DAC registers.  For 440, 4 IAC and 2 DVC registers are enabled, as
      well as the DAC ranges.
      Signed-off-by: default avatarDave Kleikamp <shaggy@linux.vnet.ibm.com>
      Acked-by: default avatarDavid Gibson <dwg@au1.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      172ae2e7
    • Anton Blanchard's avatar
      powerpc: Improve 64bit copy_tofrom_user · 789c299c
      Anton Blanchard authored
      Here is a patch from Paul Mackerras that improves the ppc64 copy_tofrom_user.
      The loop now does 32 bytes at a time and as well as pairing loads and stores.
      
      A quick test case that reads 8kB over and over shows the improvement:
      
      POWER6: 53% faster
      POWER7: 51% faster
      
      #define _XOPEN_SOURCE 500
      #include <stdlib.h>
      #include <stdio.h>
      #include <unistd.h>
      #include <fcntl.h>
      #include <sys/types.h>
      #include <sys/stat.h>
      
      #define BUFSIZE (8 * 1024)
      #define ITERATIONS 10000000
      
      int main()
      {
      	char tmpfile[] = "/tmp/copy_to_user_testXXXXXX";
      	int fd;
      	char *buf[BUFSIZE];
      	unsigned long i;
      
      	fd = mkstemp(tmpfile);
      	if (fd < 0) {
      		perror("open");
      		exit(1);
      	}
      
      	if (write(fd, buf, BUFSIZE) != BUFSIZE) {
      		perror("open");
      		exit(1);
      	}
      
      	for (i = 0; i < 10000000; i++) {
      		if (pread(fd, buf, BUFSIZE, 0) != BUFSIZE) {
      			perror("pread");
      			exit(1);
      		}
      	}
      
      	unlink(tmpfile);
      
      	return 0;
      }
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      789c299c
    • Anton Blanchard's avatar
      powerpc: Pair loads and stores in copy_4k_page · 63e6c5b8
      Anton Blanchard authored
      A number of our chips like loads and stores to be paired. A small kernel
      module testcase shows the improvement of pairing loads and stores in
      copy_4k_page:
      
      POWER6: +9%
      POWER7: +1.5%
      
      #include <linux/module.h>
      #include <linux/mm.h>
      
      #define ITERATIONS 10000000
      
      static int __init copypage_init(void)
      {
      	struct timespec before, after;
      	unsigned long i;
      	struct page *destpage, *srcpage;
      	char *dest, *src;
      
      	destpage = alloc_page(GFP_KERNEL);
      	srcpage = alloc_page(GFP_KERNEL);
      
      	dest = page_address(destpage);
      	src = page_address(srcpage);
      
      	getnstimeofday(&before);
      
      	for (i = 0; i < ITERATIONS; i++)
      		copy_4K_page(dest, src);
      
      	getnstimeofday(&after);
      
      	free_page((unsigned long)dest);
      	free_page((unsigned long)src);
      
      	printk(KERN_DEBUG "copy_4K_page loop took %lu ns\n",
      		(after.tv_sec - before.tv_sec) * NSEC_PER_SEC +
      		(after.tv_nsec - before.tv_nsec));
      
      	return 0;
      }
      
      static void __exit copypage_exit(void)
      {
      }
      
      module_init(copypage_init)
      module_exit(copypage_exit)
      MODULE_LICENSE("GPL");
      MODULE_AUTHOR("Anton Blanchard");
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      63e6c5b8
    • Anton Blanchard's avatar
      powerpc: Use lwsync for acquire barrier if CPU supports it · 5a0e9b57
      Anton Blanchard authored
      Nick Piggin discovered that lwsync barriers around locks were faster than isync
      on 970. That was a long time ago and I completely dropped the ball in testing
      his patches across other ppc64 processors.
      
      Turns out the idea helps on other chips. Using a microbenchmark that
      uses a lot of threads to contend on a global pthread mutex (and therefore a
      global futex), POWER6 improves 8% and POWER7 improves 2%. I checked POWER5
      and while I couldn't measure an improvement, there was no regression.
      
      This patch uses the lwsync patching code to replace the isyncs with lwsyncs
      on CPUs that support the instruction. We were marking POWER3 and RS64 as lwsync
      capable but in reality they treat it as a full sync (ie slow). Remove the
      CPU_FTR_LWSYNC bit from these CPUs so they continue to use the faster isync
      method.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      5a0e9b57
    • Anton Blanchard's avatar
      powerpc: Fix lwsync patching code on 64bit · 53eae228
      Anton Blanchard authored
      do_lwsync_fixups doesn't work on 64bit, we end up writing lwsyncs to the
      wrong addresses:
      
      0:mon> di c0000001000bfacc
      c0000001000bfacc  7c2004ac      lwsync
      
      Since the lwsync section has negative offsets we need to use a signed int
      pointer so we sign extend the value.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      53eae228
    • Anton Blanchard's avatar
      powerpc: Rename LWSYNC_ON_SMP to PPC_RELEASE_BARRIER, ISYNC_ON_SMP to PPC_ACQUIRE_BARRIER · f10e2e5b
      Anton Blanchard authored
      For performance reasons we are about to change ISYNC_ON_SMP to sometimes be
      lwsync. Now that the macro name doesn't make sense, change it and LWSYNC_ON_SMP
      to better explain what the barriers are doing.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      f10e2e5b
    • Anton Blanchard's avatar
      powerpc: Convert open coded native hashtable bit lock · 66d99b88
      Anton Blanchard authored
      Now we have real bit locks use them instead of open coding it.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      66d99b88
    • Anton Blanchard's avatar
      powerpc: Use lwarx/ldarx hint in bit locks · 864b9e6f
      Anton Blanchard authored
      This patch implements the lwarx/ldarx hint bit for bit locks.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      864b9e6f
    • Anton Blanchard's avatar
      powerpc: Use lwarx hint in spinlocks · 4e14a4d1
      Anton Blanchard authored
      Recent versions of the PowerPC architecture added a hint bit to the larx
      instructions to differentiate between an atomic operation and a lock operation:
      
      > 0 Other programs might attempt to modify the word in storage addressed by EA
      > even if the subsequent Store Conditional succeeds.
      >
      > 1 Other programs will not attempt to modify the word in storage addressed by
      > EA until the program that has acquired the lock performs a subsequent store
      > releasing the lock.
      
      To avoid a binutils dependency this patch create macros for the extended lwarx
      format and uses it in the spinlock code. To test this change I used a simple
      test case that acquires and releases a global pthread mutex:
      
      	pthread_mutex_lock(&mutex);
      	pthread_mutex_unlock(&mutex);
      
      On a 32 core POWER6, running 32 test threads we spend almost all our time in
      the futex spinlock code:
      
          94.37%     perf  [kernel]                     [k] ._raw_spin_lock
                     |
                     |--99.95%-- ._raw_spin_lock
                     |          |
                     |          |--63.29%-- .futex_wake
                     |          |
                     |          |--36.64%-- .futex_wait_setup
      
      Which is a good test for this patch. The results (in lock/unlock operations per
      second) are:
      
      before: 1538203 ops/sec
      after:  2189219 ops/sec
      
      An improvement of 42%
      
      A 32 core POWER7 improves even more:
      
      before: 1279529 ops/sec
      after:  2282076 ops/sec
      
      An improvement of 78%
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      4e14a4d1
    • Anton Blanchard's avatar
      powerpc: Convert global "BAD" interrupt to per cpu spurious · 17081102
      Anton Blanchard authored
      I often get asked if BAD interrupts are really bad. On some boxes (eg
      IBM machines running a hypervisor) there are valid cases where are
      presented with an interrupt that is not for us. These cases are common
      enough to show up as thousands of BAD interrupts a day.
      
      Tone them down by calling them spurious. Since they can be a significant cause
      of OS jitter, we may as well log them per cpu so we know where they are
      occurring.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      17081102
    • Anton Blanchard's avatar
      powerpc: Add timer, performance monitor and machine check counts to /proc/interrupts · 89713ed1
      Anton Blanchard authored
      With NO_HZ it is useful to know how often the decrementer is going off. The
      patch below adds an entry for it and also adds it into the /proc/stat
      summaries.
      
      While here, I added performance monitoring and machine check exceptions.
      I found it useful to keep an eye on the PMU exception rate
      when using the perf tool. Since it's possible to take a completely
      handled machine check on a System p box it also sounds like a good idea to
      keep a machine check summary.
      
      The event naming matches x86 to keep gratuitous differences to a minimum.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      89713ed1
    • Anton Blanchard's avatar
      powerpc: Remove whitespace in irq chip name fields · fc380c0c
      Anton Blanchard authored
      Now we use printf style alignment there is no need to manually space
      these fields.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      fc380c0c
    • Anton Blanchard's avatar
      powerpc: Rework /proc/interrupts · c86845ed
      Anton Blanchard authored
      On a large machine I noticed the columns of /proc/interrupts failed to line up
      with the header after CPU9. At sufficiently large numbers of CPUs it becomes
      impossible to line up the CPU number with the counts.
      
      While fixing this I noticed x86 has a number of updates that we may as well
      pull in. On PowerPC we currently omit an interrupt completely if there is no
      active handler, whereas on x86 it is printed if there is a non zero count.
      
      The x86 code also spaces the first column correctly based on nr_irqs.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      c86845ed
    • Anton Blanchard's avatar
      powerpc: Reduce footprint of xics_ipi_struct · fda9d861
      Anton Blanchard authored
      Right now we allocate a cacheline sized NR_CPUS array for xics IPI
      communication. Use DECLARE_PER_CPU_SHARED_ALIGNED to put it in percpu
      data in its own cacheline since it is written to by other cpus.
      
      On a kernel with NR_CPUS=1024, this saves quite a lot of memory:
      
         text    data     bss      dec         hex    filename
      8767779 2944260 1505724 13217763         c9afe3 vmlinux.irq_cpustat
      8767555 2813444 1505724 13086723         c7b003 vmlinux.xics
      
      A saving of around 128kB.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      fda9d861
    • Anton Blanchard's avatar
      powerpc: Reduce footprint of irq_stat · 8c007bfd
      Anton Blanchard authored
      PowerPC is currently using asm-generic/hardirq.h which statically allocates an
      NR_CPUS irq_stat array. Switch to an arch specific implementation which uses
      per cpu data:
      
      On a kernel with NR_CPUS=1024, this saves quite a lot of memory:
      
         text    data     bss      dec         hex    filename
      8767938 2944132 1636796 13348866         cbb002 vmlinux.baseline
      8767779 2944260 1505724 13217763         c9afe3 vmlinux.irq_cpustat
      
      A saving of around 128kB.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      8c007bfd
    • Breno Leitao's avatar
      powerpc/eeh: Fix a bug when pci structure is null · 8d3d50bf
      Breno Leitao authored
      During a EEH recover, the pci_dev structure can be null, mainly if an
      eeh event is detected during cpi config operation. In this case, the
      pci_dev will not be known (and will be null) the kernel will crash
      with the following message:
      
      Unable to handle kernel paging request for data at address 0x000000a0
      Faulting instruction address: 0xc00000000006b8b4
      Oops: Kernel access of bad area, sig: 11 [#1]
      
      NIP [c00000000006b8b4] .eeh_event_handler+0x10c/0x1a0
      LR [c00000000006b8a8] .eeh_event_handler+0x100/0x1a0
      Call Trace:
      [c0000003a80dff00] [c00000000006b8a8] .eeh_event_handler+0x100/0x1a0
      [c0000003a80dff90] [c000000000031f1c] .kernel_thread+0x54/0x70
      
      The bug occurs because pci_name() tries to access a null pointer.
      This patch just guarantee that pci_name() is not called on Null pointers.
      Signed-off-by: default avatarBreno Leitao <leitao@linux.vnet.ibm.com>
      Signed-off-by: default avatarLinas Vepstas <linasvepstas@gmail.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      8d3d50bf
    • Corey Minyard's avatar
      powerpc: Add coherent_dma_mask to mv64x60 devices · e0508b15
      Corey Minyard authored
      DMA ops requires that coherent_dma_mask be set properly for a device,
      but this was not being done for devices on the MV64x60 that use DMA.
      Both the serial and ethernet devices need this or they won't be able
      to allocate memory.
      Signed-off-by: default avatarCorey Minyard <cminyard@mvista.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      e0508b15
  3. 16 Feb, 2010 11 commits
    • Benjamin Herrenschmidt's avatar
      ec144a81
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm · 88626272
      Linus Torvalds authored
      * git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm:
        dm: sysfs revert add empty release function to avoid debug warning
        dm mpath: fix stall when requeueing io
        dm raid1: fix null pointer dereference in suspend
        dm raid1: fail writes if errors are not handled and log fails
        dm log: userspace fix overhead_size calcuations
        dm snapshot: persistent annotate work_queue as on stack
        dm stripe: avoid divide by zero with invalid stripe count
      88626272
    • Linus Torvalds's avatar
      5ae1d955
    • Alasdair G Kergon's avatar
      dm: sysfs revert add empty release function to avoid debug warning · 9307f6b1
      Alasdair G Kergon authored
      Revert commit d2bb7df8 at Greg's request.
      
          Author: Milan Broz <mbroz@redhat.com>
          Date:   Thu Dec 10 23:51:53 2009 +0000
      
          dm: sysfs add empty release function to avoid debug warning
      
          This patch just removes an unnecessary warning:
           kobject: 'dm': does not have a release() function,
           it is broken and must be fixed.
      
          The kobject is embedded in mapped device struct, so
          code does not need to release memory explicitly here.
      
      Cc: Greg KH <gregkh@suse.de>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      9307f6b1
    • Kiyoshi Ueda's avatar
      dm mpath: fix stall when requeueing io · 9eef87da
      Kiyoshi Ueda authored
      This patch fixes the problem that system may stall if target's ->map_rq
      returns DM_MAPIO_REQUEUE in map_request().
      E.g. stall happens on 1 CPU box when a dm-mpath device with queue_if_no_path
           bounces between all-paths-down and paths-up on I/O load.
      
      When target's ->map_rq returns DM_MAPIO_REQUEUE, map_request() requeues
      the request and returns to dm_request_fn().  Then, dm_request_fn()
      doesn't exit the I/O dispatching loop and continues processing
      the requeued request again.
      This map and requeue loop can be done with interrupt disabled,
      so 1 CPU system can be stalled if this situation happens.
      
      For example, commands below can stall my 1 CPU box within 1 minute or so:
        # dmsetup table mp
        mp: 0 2097152 multipath 1 queue_if_no_path 0 1 1 service-time 0 1 2 8:144 1 1
        # while true; do dd if=/dev/mapper/mp of=/dev/null bs=1M count=100; done &
        # while true; do \
        > dmsetup message mp 0 "fail_path 8:144" \
        > dmsetup suspend --noflush mp \
        > dmsetup resume mp \
        > dmsetup message mp 0 "reinstate_path 8:144" \
        > done
      
      To fix the problem above, this patch changes dm_request_fn() to exit
      the I/O dispatching loop once if a request is requeued in map_request().
      Signed-off-by: default avatarKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: default avatarJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      9eef87da
    • Takahiro Yasui's avatar
      dm raid1: fix null pointer dereference in suspend · 558569aa
      Takahiro Yasui authored
      When suspending a failed mirror, bios are completed by mirror_end_io() and
      __rh_lookup() in dm_rh_dec() returns NULL where a non-NULL return value is
      required by design.  Fix this by not changing the state of the recovery failed
      region from DM_RH_RECOVERING to DM_RH_NOSYNC in dm_rh_recovery_end().
      
      Issue
      
      On 2.6.33-rc1 kernel, I hit the bug when I suspended the failed
      mirror by dmsetup command.
      
      BUG: unable to handle kernel NULL pointer dereference at 00000020
      IP: [<f94f38e2>] dm_rh_dec+0x35/0xa1 [dm_region_hash]
      ...
      EIP: 0060:[<f94f38e2>] EFLAGS: 00010046 CPU: 0
      EIP is at dm_rh_dec+0x35/0xa1 [dm_region_hash]
      EAX: 00000286 EBX: 00000000 ECX: 00000286 EDX: 00000000
      ESI: eff79eac EDI: eff79e80 EBP: f6915cd4 ESP: f6915cc4
       DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
      Process dmsetup (pid: 2849, ti=f6914000 task=eff03e80 task.ti=f6914000)
       ...
      Call Trace:
       [<f9530af6>] ? mirror_end_io+0x53/0x1b1 [dm_mirror]
       [<f9413104>] ? clone_endio+0x4d/0xa2 [dm_mod]
       [<f9530aa3>] ? mirror_end_io+0x0/0x1b1 [dm_mirror]
       [<f94130b7>] ? clone_endio+0x0/0xa2 [dm_mod]
       [<c02d6bcb>] ? bio_endio+0x28/0x2b
       [<f952f303>] ? hold_bio+0x2d/0x62 [dm_mirror]
       [<f952f942>] ? mirror_presuspend+0xeb/0xf7 [dm_mirror]
       [<c02aa3e2>] ? vmap_page_range+0xb/0xd
       [<f9414c8d>] ? suspend_targets+0x2d/0x3b [dm_mod]
       [<f9414ca9>] ? dm_table_presuspend_targets+0xe/0x10 [dm_mod]
       [<f941456f>] ? dm_suspend+0x4d/0x150 [dm_mod]
       [<f941767d>] ? dev_suspend+0x55/0x18a [dm_mod]
       [<c0343762>] ? _copy_from_user+0x42/0x56
       [<f9417fb0>] ? dm_ctl_ioctl+0x22c/0x281 [dm_mod]
       [<f9417628>] ? dev_suspend+0x0/0x18a [dm_mod]
       [<f9417d84>] ? dm_ctl_ioctl+0x0/0x281 [dm_mod]
       [<c02c3c4b>] ? vfs_ioctl+0x22/0x85
       [<c02c422c>] ? do_vfs_ioctl+0x4cb/0x516
       [<c02c42b7>] ? sys_ioctl+0x40/0x5a
       [<c0202858>] ? sysenter_do_call+0x12/0x28
      
      Analysis
      
      When recovery process of a region failed, dm_rh_recovery_end() function
      changes the state of the region from RM_RH_RECOVERING to DM_RH_NOSYNC.
      When recovery_complete() is executed between dm_rh_update_states() and
      dm_writes() in do_mirror(), bios are processed with the region state,
      DM_RH_NOSYNC. However, the region data is freed without checking its
      pending count when dm_rh_update_states() is called next time.
      
      When bios are finished by mirror_end_io(), __rh_lookup() in dm_rh_dec()
      returns NULL even though a valid return value are expected.
      
      Solution
      
      Remove the state change of the recovery failed region from DM_RH_RECOVERING
      to DM_RH_NOSYNC in dm_rh_recovery_end(). We can remove the state change
      because:
      
        - If the region data has been released by dm_rh_update_states(),
          a new region data is created with the state of DM_RH_NOSYNC, and
          bios are processed according to the DM_RH_NOSYNC state.
      
        - If the region data has not been released by dm_rh_update_states(),
          a state of the region is DM_RH_RECOVERING and bios are put in the
          delayed_bio list.
      
      The flag change from DM_RH_RECOVERING to DM_RH_NOSYNC in dm_rh_recovery_end()
      was added in the following commit:
        dm raid1: handle resync failures
        author  Jonathan Brassow <jbrassow@redhat.com>
          Thu, 12 Jul 2007 16:29:04 +0000 (17:29 +0100)
        http://git.kernel.org/linus/f44db678edcc6f4c2779ac43f63f0b9dfa28b724Signed-off-by: default avatarTakahiro Yasui <tyasui@redhat.com>
      Reviewed-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      558569aa
    • Mikulas Patocka's avatar
      dm raid1: fail writes if errors are not handled and log fails · 5528d17d
      Mikulas Patocka authored
      If the mirror log fails when the handle_errors option was not selected
      and there is no remaining valid mirror leg, writes return success even
      though they weren't actually written to any device.  This patch
      completes them with EIO instead.
      
      This code path is taken:
      do_writes:
      	bio_list_merge(&ms->failures, &sync);
      do_failures:
      	if (!get_valid_mirror(ms)) (false)
      	else if (errors_handled(ms)) (false)
      	else bio_endio(bio, 0);
      
      The logic in do_failures is based on presuming that the write was already
      tried: if it succeeded at least on one leg (without handle_errors) it
      is reported as success.
      
      Reference: https://bugzilla.redhat.com/show_bug.cgi?id=555197Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      5528d17d
    • Jonathan Brassow's avatar
      dm log: userspace fix overhead_size calcuations · ebfd32bb
      Jonathan Brassow authored
      This patch fixes two bugs that revolve around the miscalculation and
      misuse of the variable 'overhead_size'.  'overhead_size' is the size of
      the various header structures used during communication.
      
      The first bug is the use of 'sizeof' with the pointer of a structure
      instead of the structure itself - resulting in the wrong size being
      computed.  This is then used in a check to see if the payload
      (data_size) would be to large for the preallocated structure.  Since the
      bug produces a smaller value for the overhead, it was possible for the
      structure to be breached.  (Although the current users of the code do
      not currently send enough data to trigger this bug.)
      
      The second bug is that the 'overhead_size' value is used to compute how
      much of the preallocated space should be cleared before populating it
      with fresh data.  This should have simply been 'sizeof(struct cn_msg)'
      not overhead_size.  The fact that 'overhead_size' was computed
      incorrectly made this problem "less bad" - leaving only a pointer's
      worth of space at the end uncleared.  Thus, this bug was never producing
      a bad result, but still needs to be fixed - especially now that the
      value is computed correctly.
      
      Cc: stable@kernel.org
      Signed-off-by: Jonathan Brassow <jbrassow@redhat.com
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      ebfd32bb
    • Mike Snitzer's avatar
      dm snapshot: persistent annotate work_queue as on stack · 55f67f2d
      Mike Snitzer authored
      chunk_io() declares its 'struct mdata_req' on the stack and then
      initializes its 'struct work_struct' member.  Annotate the
      initialization of this workqueue with INIT_WORK_ON_STACK to suppress a
      debugobjects warning seen when CONFIG_DEBUG_OBJECTS_WORK is enabled.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      55f67f2d
    • Nikanth Karthikesan's avatar
      dm stripe: avoid divide by zero with invalid stripe count · 781248c1
      Nikanth Karthikesan authored
      If a table containing zero as stripe count is passed into stripe_ctr
      the code attempts to divide by zero.
      
      This patch changes DM_TABLE_LOAD to return -EINVAL if the stripe count
      is zero.
      
      We now get the following error messages:
        device-mapper: table: 253:0: striped: Invalid stripe count
        device-mapper: ioctl: error adding target to table
      Signed-off-by: default avatarNikanth Karthikesan <knikanth@suse.de>
      Cc: stable@kernel.org
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      781248c1
    • Oleg Nesterov's avatar
      x86: ELF_PLAT_INIT() shouldn't worry about TIF_IA32 · 11557b24
      Oleg Nesterov authored
      The 64-bit version of ELF_PLAT_INIT() clears TIF_IA32, but at this point
      it has already been cleared by SET_PERSONALITY == set_personality_64bit.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      11557b24