1. 20 Oct, 2010 3 commits
    • Shaohua Li's avatar
      x86: Spread tlb flush vector between nodes · 93296720
      Shaohua Li authored
      Currently flush tlb vector allocation is based on below equation:
      	sender = smp_processor_id() % 8
      This isn't optimal, CPUs from different node can have the same vector, this
      causes a lot of lock contention. Instead, we can assign the same vectors to
      CPUs from the same node, while different node has different vectors. This has
      below advantages:
      a. if there is lock contention, the lock contention is between CPUs from one
      node. This should be much cheaper than the contention between nodes.
      b. completely avoid lock contention between nodes. This especially benefits
      kswapd, which is the biggest user of tlb flush, since kswapd sets its affinity
      to specific node.
      
      In my test, this could reduce > 20% CPU overhead in extreme case.The test
      machine has 4 nodes and each node has 16 CPUs. I then bind each node's kswapd
      to the first CPU of the node. I run a workload with 4 sequential mmap file
      read thread. The files are empty sparse file. This workload will trigger a
      lot of page reclaim and tlbflush. The kswapd bind is to easy trigger the
      extreme tlb flush lock contention because otherwise kswapd keeps migrating
      between CPUs of a node and I can't get stable result. Sure in real workload,
      we can't always see so big tlb flush lock contention, but it's possible.
      
      [ hpa: folded in fix from Eric Dumazet to use this_cpu_read() ]
      Signed-off-by: default avatarShaohua Li <shaohua.li@intel.com>
      LKML-Reference: <1287544023.4571.8.camel@sli10-conroe.sh.intel.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      93296720
    • Shaohua Li's avatar
      percpu: Introduce a read-mostly percpu API · c957ef2c
      Shaohua Li authored
      Add a new readmostly percpu section and API.  This can be used to
      avoid dirtying data lines which are generally not written to, which is
      especially important for data which may be accessed by processors
      other than the one for which the percpu area belongs to.
      
      [ hpa: moved it *after* the page-aligned section, for obvious
        reasons. ]
      Signed-off-by: default avatarShaohua Li <shaohua.li@intel.com>
      LKML-Reference: <1287544022.4571.7.camel@sli10-conroe.sh.intel.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      c957ef2c
    • Borislav Petkov's avatar
      x86, mm: Fix incorrect data type in vmalloc_sync_all() · f01f7c56
      Borislav Petkov authored
      arch/x86/mm/fault.c: In function 'vmalloc_sync_all':
      arch/x86/mm/fault.c:238: warning: assignment makes integer from pointer without a cast
      
      introduced by 617d34d9.
      Signed-off-by: default avatarBorislav Petkov <borislav.petkov@amd.com>
      LKML-Reference: <20101020103642.GA3135@kryptos.osrc.amd.com>
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      f01f7c56
  2. 19 Oct, 2010 2 commits
  3. 07 Oct, 2010 1 commit
    • Namhyung Kim's avatar
      x86-32: Fix sparse warning for the __PHYSICAL_MASK calculation · a416e9e1
      Namhyung Kim authored
      On 32-bit non-PAE system, cast to 'phys_addr_t' truncates value
      before subtraction. Subtracting before cast produce same result
      but remove following warnings from sparse:
      
       arch/x86/include/asm/pgtable_types.h:255:38: warning: cast truncates bits from constant value (100000000 becomes 0)
       arch/x86/include/asm/pgtable_types.h:270:38: warning: cast truncates bits from constant value (100000000 becomes 0)
       arch/x86/include/asm/pgtable.h:127:32: warning: cast truncates bits from constant value (100000000 becomes 0)
       arch/x86/include/asm/pgtable.h:132:32: warning: cast truncates bits from constant value (100000000 becomes 0)
       arch/x86/include/asm/pgtable.h:344:31: warning: cast truncates bits from constant value (100000000 becomes 0)
      
      64-bit or PAE machines will not be affected by this change.
      Signed-off-by: default avatarNamhyung Kim <namhyung@gmail.com>
      LKML-Reference: <1285770588-14065-1-git-send-email-namhyung@gmail.com>
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      a416e9e1
  4. 06 Oct, 2010 1 commit
  5. 17 Sep, 2010 1 commit
    • Cliff Wickman's avatar
      mm, x86: Saving vmcore with non-lazy freeing of vmas · 3ee48b6a
      Cliff Wickman authored
      During the reading of /proc/vmcore the kernel is doing
      ioremap()/iounmap() repeatedly. And the buildup of un-flushed
      vm_area_struct's is causing a great deal of overhead. (rb_next()
      is chewing up most of that time).
      
      This solution is to provide function set_iounmap_nonlazy(). It
      causes a subsequent call to iounmap() to immediately purge the
      vma area (with try_purge_vmap_area_lazy()).
      
      With this patch we have seen the time for writing a 250MB
      compressed dump drop from 71 seconds to 44 seconds.
      Signed-off-by: default avatarCliff Wickman <cpw@sgi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: kexec@lists.infradead.org
      Cc: <stable@kernel.org>
      LKML-Reference: <E1OwHZ4-0005WK-Tw@eag09.americas.sgi.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      3ee48b6a
  6. 09 Sep, 2010 1 commit
  7. 03 Sep, 2010 1 commit
  8. 30 Aug, 2010 1 commit
  9. 26 Aug, 2010 3 commits
    • Shaohua Li's avatar
      x86, mm: Make spurious_fault check explicitly check the PRESENT bit · 660a293e
      Shaohua Li authored
      pte_present() returns true even present bit isn't set but _PAGE_PROTNONE
      (global bit) bit is set. While with CONFIG_DEBUG_PAGEALLOC, free pages have
      global bit set but present bit clear. This patch makes we could catch
      free pages access with CONFIG_DEBUG_PAGEALLOC enabled.
      
      [ hpa: added a comment in the code as a warning to janitors ]
      Signed-off-by: default avatarShaohua Li <shaohua.li@intel.com>
      LKML-Reference: <1280217988.32400.75.camel@sli10-desk.sh.intel.com>
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      660a293e
    • Haicheng Li's avatar
      x86-64, mem: Update all PGDs for direct mapping and vmemmap mapping changes · 9b861528
      Haicheng Li authored
      When memory hotplug-adding happens for a large enough area
      that a new PGD entry is needed for the direct mapping, the PGDs
      of other processes would not get updated. This leads to some CPUs
      oopsing like below when they have to access the unmapped areas.
      
      [ 1139.243192] BUG: soft lockup - CPU#0 stuck for 61s! [bash:6534]
      [ 1139.243195] Modules linked in: ipv6 autofs4 rfcomm l2cap crc16 bluetooth rfkill binfmt_misc
      dm_mirror dm_region_hash dm_log dm_multipath dm_mod video output sbs sbshc fan battery ac parport_pc
      lp parport joydev usbhid processor thermal thermal_sys container button rtc_cmos rtc_core rtc_lib
      i2c_i801 i2c_core pcspkr uhci_hcd ohci_hcd ehci_hcd usbcore
      [ 1139.243229] irq event stamp: 8538759
      [ 1139.243230] hardirqs last  enabled at (8538759): [<ffffffff8100c3fc>] restore_args+0x0/0x30
      [ 1139.243236] hardirqs last disabled at (8538757): [<ffffffff810422df>] __do_softirq+0x106/0x146
      [ 1139.243240] softirqs last  enabled at (8538758): [<ffffffff81042310>] __do_softirq+0x137/0x146
      [ 1139.243245] softirqs last disabled at (8538743): [<ffffffff8100cb5c>] call_softirq+0x1c/0x34
      [ 1139.243249] CPU 0:
      [ 1139.243250] Modules linked in: ipv6 autofs4 rfcomm l2cap crc16 bluetooth rfkill binfmt_misc
      dm_mirror dm_region_hash dm_log dm_multipath dm_mod video output sbs sbshc fan battery ac parport_pc
      lp parport joydev usbhid processor thermal thermal_sys container button rtc_cmos rtc_core rtc_lib
      i2c_i801 i2c_core pcspkr uhci_hcd ohci_hcd ehci_hcd usbcore
      [ 1139.243284] Pid: 6534, comm: bash Tainted: G   M       2.6.32-haicheng-cpuhp #7 QSSC-S4R
      [ 1139.243287] RIP: 0010:[<ffffffff810ace35>]  [<ffffffff810ace35>] alloc_arraycache+0x35/0x69
      [ 1139.243292] RSP: 0018:ffff8802799f9d78  EFLAGS: 00010286
      [ 1139.243295] RAX: ffff8884ffc00000 RBX: ffff8802799f9d98 RCX: 0000000000000000
      [ 1139.243297] RDX: 0000000000190018 RSI: 0000000000000001 RDI: ffff8884ffc00010
      [ 1139.243300] RBP: ffffffff8100c34e R08: 0000000000000002 R09: 0000000000000000
      [ 1139.243303] R10: ffffffff8246dda0 R11: 000000d08246dda0 R12: ffff8802599bfff0
      [ 1139.243305] R13: ffff88027904c040 R14: ffff8802799f8000 R15: 0000000000000001
      [ 1139.243308] FS:  00007fe81bfe86e0(0000) GS:ffff88000d800000(0000) knlGS:0000000000000000
      [ 1139.243311] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1139.243313] CR2: ffff8884ffc00000 CR3: 000000026cf2d000 CR4: 00000000000006f0
      [ 1139.243316] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 1139.243318] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [ 1139.243321] Call Trace:
      [ 1139.243324]  [<ffffffff810ace29>] ? alloc_arraycache+0x29/0x69
      [ 1139.243328]  [<ffffffff8135004e>] ? cpuup_callback+0x1b0/0x32a
      [ 1139.243333]  [<ffffffff8105385d>] ? notifier_call_chain+0x33/0x5b
      [ 1139.243337]  [<ffffffff810538a4>] ? __raw_notifier_call_chain+0x9/0xb
      [ 1139.243340]  [<ffffffff8134ecfc>] ? cpu_up+0xb3/0x152
      [ 1139.243344]  [<ffffffff813388ce>] ? store_online+0x4d/0x75
      [ 1139.243348]  [<ffffffff811e53f3>] ? sysdev_store+0x1b/0x1d
      [ 1139.243351]  [<ffffffff8110589f>] ? sysfs_write_file+0xe5/0x121
      [ 1139.243355]  [<ffffffff810b539d>] ? vfs_write+0xae/0x14a
      [ 1139.243358]  [<ffffffff810b587f>] ? sys_write+0x47/0x6f
      [ 1139.243362]  [<ffffffff8100b9ab>] ? system_call_fastpath+0x16/0x1b
      
      This patch makes sure to always replicate new direct mapping PGD entries
      to the PGDs of all processes, as well as ensures corresponding vmemmap
      mapping gets synced.
      
      V1: initial code by Andi Kleen.
      V2: fix several issues found in testing.
      V3: as suggested by Wu Fengguang, reuse common code of vmalloc_sync_all().
      
      [ hpa: changed pgd_change from int to bool ]
      Originally-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarHaicheng Li <haicheng.li@linux.intel.com>
      LKML-Reference: <4C6E4FD8.6080100@linux.intel.com>
      Reviewed-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      9b861528
    • Haicheng Li's avatar
      x86, mm: Separate x86_64 vmalloc_sync_all() into separate functions · 6afb5157
      Haicheng Li authored
      No behavior change.
      
      Move some of vmalloc_sync_all() code into a new function
      sync_global_pgds() that will be useful for memory hotplug.
      Signed-off-by: default avatarHaicheng Li <haicheng.li@linux.intel.com>
      LKML-Reference: <4C6E4ECD.1090607@linux.intel.com>
      Reviewed-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      6afb5157
  10. 23 Aug, 2010 2 commits
    • Shaohua Li's avatar
      x86, mm: Avoid unnecessary TLB flush · 61c77326
      Shaohua Li authored
      In x86, access and dirty bits are set automatically by CPU when CPU accesses
      memory. When we go into the code path of below flush_tlb_fix_spurious_fault(),
      we already set dirty bit for pte and don't need flush tlb. This might mean
      tlb entry in some CPUs hasn't dirty bit set, but this doesn't matter. When
      the CPUs do page write, they will automatically check the bit and no software
      involved.
      
      On the other hand, flush tlb in below position is harmful. Test creates CPU
      number of threads, each thread writes to a same but random address in same vma
      range and we measure the total time. Under a 4 socket system, original time is
      1.96s, while with the patch, the time is 0.8s. Under a 2 socket system, there is
      20% time cut too. perf shows a lot of time are taking to send ipi/handle ipi for
      tlb flush.
      Signed-off-by: default avatarShaohua Li <shaohua.li@intel.com>
      LKML-Reference: <20100816011655.GA362@sli10-desk.sh.intel.com>
      Acked-by: default avatarSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Andrea Archangeli <aarcange@redhat.com>
      Signed-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      61c77326
    • Linus Torvalds's avatar
      Linux 2.6.36-rc2 · 76be97c1
      Linus Torvalds authored
      76be97c1
  11. 22 Aug, 2010 12 commits
  12. 21 Aug, 2010 6 commits
  13. 20 Aug, 2010 6 commits