1. 07 May, 2007 40 commits
    • Benjamin Herrenschmidt's avatar
      get_unmapped_area handles MAP_FIXED on alpha · 4b87b3b2
      Benjamin Herrenschmidt authored
      Handle MAP_FIXED in alpha's arch_get_unmapped_area(), simple case, just return
      the address as passed in
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b87b3b2
    • Benjamin Herrenschmidt's avatar
      get_unmapped_area handles MAP_FIXED on powerpc · d506a772
      Benjamin Herrenschmidt authored
      The current get_unmapped_area code calls the f_ops->get_unmapped_area or the
      arch one (via the mm) only when MAP_FIXED is not passed.  That makes it
      impossible for archs to impose proper constraints on regions of the virtual
      address space.  To work around that, get_unmapped_area() then calls some
      hugetlbfs specific hacks.
      
      This cause several problems, among others:
      
      - It makes it impossible for a driver or filesystem to do the same thing
        that hugetlbfs does (for example, to allow a driver to use larger page sizes
        to map external hardware) if that requires applying a constraint on the
        addresses (constraining that mapping in certain regions and other mappings
        out of those regions).
      
      - Some archs like arm, mips, sparc, sparc64, sh and sh64 already want
        MAP_FIXED to be passed down in order to deal with aliasing issues.  The code
        is there to handle it...  but is never called.
      
      This series of patches moves the logic to handle MAP_FIXED down to the various
      arch/driver get_unmapped_area() implementations, and then changes the generic
      code to always call them.  The hugetlbfs hacks then disappear from the generic
      code.
      
      Since I need to do some special 64K pages mappings for SPEs on cell, I need to
      work around the first problem at least.  I have further patches thus
      implementing a "slices" layer that handles multiple page sizes through slices
      of the address space for use by hugetlbfs, the SPE code, and possibly others,
      but it requires that serie of patches first/
      
      There is still a potential (but not practical) issue due to the fact that
      filesystems/drivers implemeting g_u_a will effectively bypass all arch checks.
       This is not an issue in practice as the only filesystems/drivers using that
      hook are doing so for arch specific purposes in the first place.
      
      There is also a problem with mremap that will completely bypass all arch
      checks.  I'll try to address that separately, I'm not 100% certain yet how,
      possibly by making it not work when the vma has a file whose f_ops has a
      get_unmapped_area callback, and by making it use is_hugepage_only_range()
      before expanding into a new area.
      
      Also, I want to turn is_hugepage_only_range() into a more generic
      is_normal_page_range() as that's really what it will end up meaning when used
      in stack grow, brk grow and mremap.
      
      None of the above "issues" however are introduced by this patch, they are
      already there, so I think the patch can go ini for 2.6.22.
      
      This patch:
      
      Handle MAP_FIXED in powerpc's arch_get_unmapped_area() in all 3
      implementations of it.
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: default avatarWilliam Irwin <bill.irwin@oracle.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Grant Grundler <grundler@parisc-linux.org>
      Cc: Matthew Wilcox <willy@debian.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d506a772
    • David Rientjes's avatar
      oom: fix constraint deadlock · 2b45ab33
      David Rientjes authored
      Fixes a deadlock in the OOM killer for allocations that are not
      __GFP_HARDWALL.
      
      Before the OOM killer checks for the allocation constraint, it takes
      callback_mutex.
      
      constrained_alloc() iterates through each zone in the allocation zonelist
      and calls cpuset_zone_allowed_softwall() to determine whether an allocation
      for gfp_mask is possible.  If a zone's node is not in the OOM-triggering
      task's mems_allowed, it is not exiting, and we did not fail on a
      __GFP_HARDWALL allocation, cpuset_zone_allowed_softwall() attempts to take
      callback_mutex to check the nearest exclusive ancestor of current's cpuset.
       This results in deadlock.
      
      We now take callback_mutex after iterating through the zonelist since we
      don't need it yet.
      
      Cc: Andi Kleen <ak@suse.de>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: Martin J. Bligh <mbligh@mbligh.org>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2b45ab33
    • Yasunori Goto's avatar
      mm: fix handling of panic_on_oom when cpusets are in use · 2b744c01
      Yasunori Goto authored
      The current panic_on_oom may not work if there is a process using
      cpusets/mempolicy, because other nodes' memory may remain.  But some people
      want failover by panic ASAP even if they are used.  This patch makes new
      setting for its request.
      
      This is tested on my ia64 box which has 3 nodes.
      Signed-off-by: default avatarYasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Ethan Solomita <solo@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2b744c01
    • Akinobu Mita's avatar
      fault injection: fix failslab with CONFIG_NUMA · 824ebef1
      Akinobu Mita authored
      Currently failslab injects failures into ____cache_alloc().  But with enabling
      CONFIG_NUMA it's not enough to let actual slab allocator functions (kmalloc,
      kmem_cache_alloc, ...) return NULL.
      
      This patch moves fault injection hook inside of __cache_alloc() and
      __cache_alloc_node().  These are lower call path than ____cache_alloc() and
      enable to inject faulures to slab allocators with CONFIG_NUMA.
      Acked-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      824ebef1
    • Christoph Lameter's avatar
      slab allocators: remove multiple alignment specifications · f0f3980b
      Christoph Lameter authored
      It is not necessary to tell the slab allocators to align to a cacheline
      if an explicit alignment was already specified. It is rather confusing
      to specify multiple alignments.
      
      Make sure that the call sites only use one form of alignment.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0f3980b
    • Christoph Lameter's avatar
      KMEM_CACHE(): simplify slab cache creation · 0a31bd5f
      Christoph Lameter authored
      This patch provides a new macro
      
      KMEM_CACHE(<struct>, <flags>)
      
      to simplify slab creation. KMEM_CACHE creates a slab with the name of the
      struct, with the size of the struct and with the alignment of the struct.
      Additional slab flags may be specified if necessary.
      
      Example
      
      struct test_slab {
      	int a,b,c;
      	struct list_head;
      } __cacheline_aligned_in_smp;
      
      test_slab_cache = KMEM_CACHE(test_slab, SLAB_PANIC)
      
      will create a new slab named "test_slab" of the size sizeof(struct
      test_slab) and aligned to the alignment of test slab.  If it fails then we
      panic.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a31bd5f
    • Christoph Lameter's avatar
      slab allocators: Remove obsolete SLAB_MUST_HWCACHE_ALIGN · 5af60839
      Christoph Lameter authored
      This patch was recently posted to lkml and acked by Pekka.
      
      The flag SLAB_MUST_HWCACHE_ALIGN is
      
      1. Never checked by SLAB at all.
      
      2. A duplicate of SLAB_HWCACHE_ALIGN for SLUB
      
      3. Fulfills the role of SLAB_HWCACHE_ALIGN for SLOB.
      
      The only remaining use is in sparc64 and ppc64 and their use there
      reflects some earlier role that the slab flag once may have had. If
      its specified then SLAB_HWCACHE_ALIGN is also specified.
      
      The flag is confusing, inconsistent and has no purpose.
      
      Remove it.
      Acked-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5af60839
    • Peter Zijlstra's avatar
      mm: optimize acorn partition truncate · 96018fda
      Peter Zijlstra authored
      invalidate_bdev() is superfluous when truncate_inode_pages() is also
      called.  do call invalidate_bh_lrus() though, to avoid stale pointers.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      96018fda
    • Peter Zijlstra's avatar
      mm: optimize kill_bdev() · f9a14399
      Peter Zijlstra authored
      Remove duplicate work in kill_bdev().
      
      It currently invalidates and then truncates the bdev's mapping.
      invalidate_mapping_pages() will opportunistically remove pages from the
      mapping.  And truncate_inode_pages() will forcefully remove all pages.
      
      The only thing truncate doesn't do is flush the bh lrus.  So do that
      explicitly.  This avoids (very unlikely) but possible invalid lookup
      results if the same bdev is quickly re-issued.
      
      It also will prevent extreme kernel latencies which are observed when
      blockdevs which have a large amount of pagecache are unmounted, by avoiding
      invalidate_mapping_pages() on that path.  invalidate_mapping_pages() has no
      cond_resched (it can be called under spinlock), whereas truncate_inode_pages()
      has one.
      
      [akpm@linux-foundation.org: restore nrpages==0 optimisation]
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f9a14399
    • Peter Zijlstra's avatar
      mm: remove destroy_dirty_buffers from invalidate_bdev() · f98393a6
      Peter Zijlstra authored
      Remove the destroy_dirty_buffers argument from invalidate_bdev(), it hasn't
      been used in 6 years (so akpm says).
      
      find * -name \*.[ch] | xargs grep -l invalidate_bdev |
      while read file; do
      	quilt add $file;
      	sed -ie 's/invalidate_bdev(\([^,]*\),[^)]*)/invalidate_bdev(\1)/g' $file;
      done
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f98393a6
    • Nick Piggin's avatar
      mm: madvise avoid exclusive mmap_sem · 0a27a14a
      Nick Piggin authored
      Avoid down_write of the mmap_sem in madvise when we can help it.
      Acked-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a27a14a
    • matze's avatar
    • Akinobu Mita's avatar
      slob: handle SLAB_PANIC flag · bc0055ae
      Akinobu Mita authored
      kmem_cache_create() for slob doesn't handle SLAB_PANIC.
      Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc0055ae
    • David Miller's avatar
      Quicklist support for sparc64 · 3a2cba99
      David Miller authored
      I ported this to sparc64 as per the patch below, tested on UP SunBlade1500 and
      24 cpu Niagara T1000.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andi Kleen <ak@suse.de>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a2cba99
    • Christoph Lameter's avatar
      Quicklists for page table pages · 6225e937
      Christoph Lameter authored
      On x86_64 this cuts allocation overhead for page table pages down to a
      fraction (kernel compile / editing load.  TSC based measurement of times spend
      in each function):
      
      no quicklist
      
      pte_alloc               1569048 4.3s(401ns/2.7us/179.7us)
      pmd_alloc                780988 2.1s(337ns/2.7us/86.1us)
      pud_alloc                780072 2.2s(424ns/2.8us/300.6us)
      pgd_alloc                260022 1s(920ns/4us/263.1us)
      
      quicklist:
      
      pte_alloc                452436 573.4ms(8ns/1.3us/121.1us)
      pmd_alloc                196204 174.5ms(7ns/889ns/46.1us)
      pud_alloc                195688 172.4ms(7ns/881ns/151.3us)
      pgd_alloc                 65228 9.8ms(8ns/150ns/6.1us)
      
      pgd allocations are the most complex and there we see the most dramatic
      improvement (may be we can cut down the amount of pgds cached somewhat?).  But
      even the pte allocations still see a doubling of performance.
      
      1. Proven code from the IA64 arch.
      
      	The method used here has been fine tuned for years and
      	is NUMA aware. It is based on the knowledge that accesses
      	to page table pages are sparse in nature. Taking a page
      	off the freelists instead of allocating a zeroed pages
      	allows a reduction of number of cachelines touched
      	in addition to getting rid of the slab overhead. So
      	performance improves. This is particularly useful if pgds
      	contain standard mappings. We can save on the teardown
      	and setup of such a page if we have some on the quicklists.
      	This includes avoiding lists operations that are otherwise
      	necessary on alloc and free to track pgds.
      
      2. Light weight alternative to use slab to manage page size pages
      
      	Slab overhead is significant and even page allocator use
      	is pretty heavy weight. The use of a per cpu quicklist
      	means that we touch only two cachelines for an allocation.
      	There is no need to access the page_struct (unless arch code
      	needs to fiddle around with it). So the fast past just
      	means bringing in one cacheline at the beginning of the
      	page. That same cacheline may then be used to store the
      	page table entry. Or a second cacheline may be used
      	if the page table entry is not in the first cacheline of
      	the page. The current code will zero the page which means
      	touching 32 cachelines (assuming 128 byte). We get down
      	from 32 to 2 cachelines in the fast path.
      
      3. x86_64 gets lightweight page table page management.
      
      	This will allow x86_64 arch code to faster repopulate pgds
      	and other page table entries. The list operations for pgds
      	are reduced in the same way as for i386 to the point where
      	a pgd is allocated from the page allocator and when it is
      	freed back to the page allocator. A pgd can pass through
      	the quicklists without having to be reinitialized.
      
      64 Consolidation of code from multiple arches
      
      	So far arches have their own implementation of quicklist
      	management. This patch moves that feature into the core allowing
      	an easier maintenance and consistent management of quicklists.
      
      Page table pages have the characteristics that they are typically zero or in a
      known state when they are freed.  This is usually the exactly same state as
      needed after allocation.  So it makes sense to build a list of freed page
      table pages and then consume the pages already in use first.  Those pages have
      already been initialized correctly (thus no need to zero them) and are likely
      already cached in such a way that the MMU can use them most effectively.  Page
      table pages are used in a sparse way so zeroing them on allocation is not too
      useful.
      
      Such an implementation already exits for ia64.  Howver, that implementation
      did not support constructors and destructors as needed by i386 / x86_64.  It
      also only supported a single quicklist.  The implementation here has
      constructor and destructor support as well as the ability for an arch to
      specify how many quicklists are needed.
      
      Quicklists are defined by an arch defining CONFIG_QUICKLIST.  If more than one
      quicklist is necessary then we can define NR_QUICK for additional lists.  F.e.
       i386 needs two and thus has
      
      config NR_QUICK
      	int
      	default 2
      
      If an arch has requested quicklist support then pages can be allocated
      from the quicklist (or from the page allocator if the quicklist is
      empty) via:
      
      quicklist_alloc(<quicklist-nr>, <gfpflags>, <constructor>)
      
      Page table pages can be freed using:
      
      quicklist_free(<quicklist-nr>, <destructor>, <page>)
      
      Pages must have a definite state after allocation and before
      they are freed. If no constructor is specified then pages
      will be zeroed on allocation and must be zeroed before they are
      freed.
      
      If a constructor is used then the constructor will establish
      a definite page state. F.e. the i386 and x86_64 pgd constructors
      establish certain mappings.
      
      Constructors and destructors can also be used to track the pages.
      i386 and x86_64 use a list of pgds in order to be able to dynamically
      update standard mappings.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andi Kleen <ak@suse.de>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6225e937
    • Christoph Lameter's avatar
      slub: add slabinfo tool · c09d8751
      Christoph Lameter authored
      Add the tool which gets reports about slabs to the VM documentation directory.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c09d8751
    • Christoph Lameter's avatar
    • Christoph Lameter's avatar
      slub: remove object activities out of checking functions · 70d71228
      Christoph Lameter authored
      Make sure that the check function really only check things and do not perform
      activities.  Extract the tracing and object seeding out of the two check
      functions and place them into slab_alloc and slab_free
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      70d71228
    • Christoph Lameter's avatar
      SLUB: Free slabs and sort partial slab lists in kmem_cache_shrink · 2086d26a
      Christoph Lameter authored
      At kmem_cache_shrink check if we have any empty slabs on the partial
      if so then remove them.
      
      Also--as an anti-fragmentation measure--sort the partial slabs so that
      the most fully allocated ones come first and the least allocated last.
      
      The next allocations may fill up the nearly full slabs. Having the
      least allocated slabs last gives them the maximum chance that their
      remaining objects may be freed. Thus we can hopefully minimize the
      partial slabs.
      
      I think this is the best one can do in terms antifragmentation
      measures. Real defragmentation (meaning moving objects out of slabs with
      the least free objects to those that are almost full) can be implemted
      by reverse scanning through the list produced here but that would mean
      that we need to provide a callback at slab cache creation that allows
      the deletion or moving of an object. This will involve slab API
      changes, so defer for now.
      
      Cc: Mel Gorman <mel@skynet.ie>
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2086d26a
    • Christoph Lameter's avatar
      slub: add ability to list alloc / free callers per slab · 88a420e4
      Christoph Lameter authored
      This patch enables listing the callers who allocated or freed objects in a
      cache.
      
      For example to list the allocators for kmalloc-128 do
      
      cat /sys/slab/kmalloc-128/alloc_calls
            7 sn_io_slot_fixup+0x40/0x700
            7 sn_io_slot_fixup+0x80/0x700
            9 sn_bus_fixup+0xe0/0x380
            6 param_sysfs_setup+0xf0/0x280
          276 percpu_populate+0xf0/0x1a0
           19 __register_chrdev_region+0x30/0x360
            8 expand_files+0x2e0/0x6e0
            1 sys_epoll_create+0x60/0x200
            1 __mounts_open+0x140/0x2c0
           65 kmem_alloc+0x110/0x280
            3 alloc_disk_node+0xe0/0x200
           33 as_get_io_context+0x90/0x280
           74 kobject_kset_add_dir+0x40/0x140
           12 pci_create_bus+0x2a0/0x5c0
            1 acpi_ev_create_gpe_block+0x120/0x9e0
           41 con_insert_unipair+0x100/0x1c0
            1 uart_open+0x1c0/0xba0
            1 dma_pool_create+0xe0/0x340
            2 neigh_table_init_no_netlink+0x260/0x4c0
            6 neigh_parms_alloc+0x30/0x200
            1 netlink_kernel_create+0x130/0x320
            5 fz_hash_alloc+0x50/0xe0
            2 sn_common_hubdev_init+0xd0/0x6e0
           28 kernel_param_sysfs_setup+0x30/0x180
           72 process_zones+0x70/0x2e0
      
      cat /sys/slab/kmalloc-128/free_calls
          558 <not-available>
            3 sn_io_slot_fixup+0x600/0x700
           84 free_fdtable_rcu+0x120/0x260
            2 seq_release+0x40/0x60
            6 kmem_free+0x70/0xc0
           24 free_as_io_context+0x20/0x200
            1 acpi_get_object_info+0x3a0/0x3e0
            1 acpi_add_single_object+0xcf0/0x1e40
            2 con_release_unimap+0x80/0x140
            1 free+0x20/0x40
      
      SLAB_STORE_USER must be enabled for a slab cache by either booting with
      "slab_debug" or enabling user tracking specifically for the slab of interest.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88a420e4
    • Christoph Lameter's avatar
      SLUB: Add MIN_PARTIAL · e95eed57
      Christoph Lameter authored
      We leave a mininum of partial slabs on nodes when we search for
      partial slabs on other node. Define a constant for that value.
      
      Then modify slub to keep MIN_PARTIAL slabs around.
      
      This avoids bad situations where a function frees the last object
      in a slab (which results in the page being returned to the page
      allocator) only to then allocate one again (which requires getting
      a page back from the page allocator if the partial list was empty).
      Keeping a couple of slabs on the partial list reduces overhead.
      
      Empty slabs are added to the end of the partial list to insure that
      partially allocated slabs are consumed first (defragmentation).
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e95eed57
    • Christoph Lameter's avatar
      slub: validation of slabs (metadata and guard zones) · 53e15af0
      Christoph Lameter authored
      This enables validation of slab.  Validation means that all objects are
      checked to see if there are redzone violations, if padding has been
      overwritten or any pointers have been corrupted.  Also checks the consistency
      of slab counters.
      
      Validation enables the detection of metadata corruption without the kernel
      having to execute code that actually uses (allocs/frees) and object.  It
      allows one to make sure that the slab metainformation and the guard values
      around an object have not been compromised.
      
      A single slabcache can be checked by writing a 1 to the "validate" file.
      
      i.e.
      
      echo 1 >/sys/slab/kmalloc-128/validate
      
      or use the slabinfo tool to check all slabs
      
      slabinfo -v
      
      Error messages will show up in the syslog.
      
      Note that validation can only reach slabs that are on a list.  This means that
      we are usually restricted to partial slabs and active slabs unless
      SLAB_STORE_USER is active which will build a full slab list and allows
      validation of slabs that are fully in use.  Booting with "slub_debug" set will
      enable SLAB_STORE_USER and then full diagnostic are available.
      
      Note that we attempt to push cpu slabs back to the lists when we start the
      check.  If the cpu slab is reactivated before we get to it (another processor
      grabs it before we get to it) then it cannot be checked.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      53e15af0
    • Christoph Lameter's avatar
      slub: enable tracking of full slabs · 643b1138
      Christoph Lameter authored
      If slab tracking is on then build a list of full slabs so that we can verify
      the integrity of all slabs and are also able to built list of alloc/free
      callers.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      643b1138
    • Christoph Lameter's avatar
      slub: fix object tracking · 77c5e2d0
      Christoph Lameter authored
      Object tracking did not work the right way for several call chains. Fix this up
      by adding a new parameter to slub_alloc and slub_free that specifies the
      caller address explicitly.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      77c5e2d0
    • Christoph Lameter's avatar
    • Christoph Lameter's avatar
      mm: optimize compound_head() by avoiding a shared page flag · 6d777953
      Christoph Lameter authored
      The patch adds PageTail(page) and PageHead(page) to check if a page is the
      head or the tail of a compound page.  This is done by masking the two bits
      describing the state of a compound page and then comparing them.  So one
      comparision and a branch instead of two bit checks and two branches.
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d777953
    • Christoph Lameter's avatar
      Make page->private usable in compound pages · d85f3385
      Christoph Lameter authored
      If we add a new flag so that we can distinguish between the first page and the
      tail pages then we can avoid to use page->private in the first page.
      page->private == page for the first page, so there is no real information in
      there.
      
      Freeing up page->private makes the use of compound pages more transparent.
      They become more usable like real pages.  Right now we have to be careful f.e.
       if we are going beyond PAGE_SIZE allocations in the slab on i386 because we
      can then no longer use the private field.  This is one of the issues that
      cause us not to support debugging for page size slabs in SLAB.
      
      Having page->private available for SLUB would allow more meta information in
      the page struct.  I can probably avoid the 16 bit ints that I have in there
      right now.
      
      Also if page->private is available then a compound page may be equipped with
      buffer heads.  This may free up the way for filesystems to support larger
      blocks than page size.
      
      We add PageTail as an alias of PageReclaim.  Compound pages cannot currently
      be reclaimed.  Because of the alias one needs to check PageCompound first.
      
      The RFC for the this approach was discussed at
      http://marc.info/?t=117574302800001&r=1&w=2
      
      [nacc@us.ibm.com: fix hugetlbfs]
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d85f3385
    • Christoph Lameter's avatar
      PowerPC: Disable SLUB for configurations in which slab page structs are modified · 30520864
      Christoph Lameter authored
      PowerPC uses the slab allocator to manage the lowest level of the page
      table.  In high cpu configurations we also use the page struct to split the
      page table lock.  Disallow the selection of SLUB for that case.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Acked-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      30520864
    • Christoph Lameter's avatar
      SLUB: allocate smallest object size if the user asks for 0 bytes · 614410d5
      Christoph Lameter authored
      Makes SLUB behave like SLAB in this area to avoid issues....
      
      Throw a stack dump to alert people.
      
      At some point the behavior should be switched back.  NULL is no memory as
      far as I can tell and if the use asked for 0 bytes then he need to get no
      memory.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      614410d5
    • Christoph Lameter's avatar
      SLUB: change default alignments · 47bfdc0d
      Christoph Lameter authored
      Structures may contain u64 items on 32 bit platforms that are only able to
      address 64 bit items on 64 bit boundaries.  Change the mininum alignment of
      slabs to conform to those expectations.
      
      ARCH_KMALLOC_MINALIGN must be changed for good since a variety of structure
      are mixed in the general slabs.
      
      ARCH_SLAB_MINALIGN is changed because currently there is no consistent
      specification of object alignment.  We may have that in the future when the
      KMEM_CACHE and related macros are used to generate slabs.  These pass the
      alignment of the structure generated by the compiler to the slab.
      
      With KMEM_CACHE etc we could align structures that do not contain 64
      bit values to 32 bit boundaries potentially saving some memory.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      47bfdc0d
    • Christoph Lameter's avatar
      SLUB core · 81819f0f
      Christoph Lameter authored
      This is a new slab allocator which was motivated by the complexity of the
      existing code in mm/slab.c. It attempts to address a variety of concerns
      with the existing implementation.
      
      A. Management of object queues
      
         A particular concern was the complex management of the numerous object
         queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for
         each allocating CPU and use objects from a slab directly instead of
         queueing them up.
      
      B. Storage overhead of object queues
      
         SLAB Object queues exist per node, per CPU. The alien cache queue even
         has a queue array that contain a queue for each processor on each
         node. For very large systems the number of queues and the number of
         objects that may be caught in those queues grows exponentially. On our
         systems with 1k nodes / processors we have several gigabytes just tied up
         for storing references to objects for those queues  This does not include
         the objects that could be on those queues. One fears that the whole
         memory of the machine could one day be consumed by those queues.
      
      C. SLAB meta data overhead
      
         SLAB has overhead at the beginning of each slab. This means that data
         cannot be naturally aligned at the beginning of a slab block. SLUB keeps
         all meta data in the corresponding page_struct. Objects can be naturally
         aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte
         boundaries and can fit tightly into a 4k page with no bytes left over.
         SLAB cannot do this.
      
      D. SLAB has a complex cache reaper
      
         SLUB does not need a cache reaper for UP systems. On SMP systems
         the per CPU slab may be pushed back into partial list but that
         operation is simple and does not require an iteration over a list
         of objects. SLAB expires per CPU, shared and alien object queues
         during cache reaping which may cause strange hold offs.
      
      E. SLAB has complex NUMA policy layer support
      
         SLUB pushes NUMA policy handling into the page allocator. This means that
         allocation is coarser (SLUB does interleave on a page level) but that
         situation was also present before 2.6.13. SLABs application of
         policies to individual slab objects allocated in SLAB is
         certainly a performance concern due to the frequent references to
         memory policies which may lead a sequence of objects to come from
         one node after another. SLUB will get a slab full of objects
         from one node and then will switch to the next.
      
      F. Reduction of the size of partial slab lists
      
         SLAB has per node partial lists. This means that over time a large
         number of partial slabs may accumulate on those lists. These can
         only be reused if allocator occur on specific nodes. SLUB has a global
         pool of partial slabs and will consume slabs from that pool to
         decrease fragmentation.
      
      G. Tunables
      
         SLAB has sophisticated tuning abilities for each slab cache. One can
         manipulate the queue sizes in detail. However, filling the queues still
         requires the uses of the spin lock to check out slabs. SLUB has a global
         parameter (min_slab_order) for tuning. Increasing the minimum slab
         order can decrease the locking overhead. The bigger the slab order the
         less motions of pages between per CPU and partial lists occur and the
         better SLUB will be scaling.
      
      G. Slab merging
      
         We often have slab caches with similar parameters. SLUB detects those
         on boot up and merges them into the corresponding general caches. This
         leads to more effective memory use. About 50% of all caches can
         be eliminated through slab merging. This will also decrease
         slab fragmentation because partial allocated slabs can be filled
         up again. Slab merging can be switched off by specifying
         slub_nomerge on boot up.
      
         Note that merging can expose heretofore unknown bugs in the kernel
         because corrupted objects may now be placed differently and corrupt
         differing neighboring objects. Enable sanity checks to find those.
      
      H. Diagnostics
      
         The current slab diagnostics are difficult to use and require a
         recompilation of the kernel. SLUB contains debugging code that
         is always available (but is kept out of the hot code paths).
         SLUB diagnostics can be enabled via the "slab_debug" option.
         Parameters can be specified to select a single or a group of
         slab caches for diagnostics. This means that the system is running
         with the usual performance and it is much more likely that
         race conditions can be reproduced.
      
      I. Resiliency
      
         If basic sanity checks are on then SLUB is capable of detecting
         common error conditions and recover as best as possible to allow the
         system to continue.
      
      J. Tracing
      
         Tracing can be enabled via the slab_debug=T,<slabcache> option
         during boot. SLUB will then protocol all actions on that slabcache
         and dump the object contents on free.
      
      K. On demand DMA cache creation.
      
         Generally DMA caches are not needed. If a kmalloc is used with
         __GFP_DMA then just create this single slabcache that is needed.
         For systems that have no ZONE_DMA requirement the support is
         completely eliminated.
      
      L. Performance increase
      
         Some benchmarks have shown speed improvements on kernbench in the
         range of 5-10%. The locking overhead of slub is based on the
         underlying base allocation size. If we can reliably allocate
         larger order pages then it is possible to increase slub
         performance much further. The anti-fragmentation patches may
         enable further performance increases.
      
      Tested on:
      i386 UP + SMP, x86_64 UP + SMP + NUMA emulation, IA64 NUMA + Simulator
      
      SLUB Boot options
      
      slub_nomerge		Disable merging of slabs
      slub_min_order=x	Require a minimum order for slab caches. This
      			increases the managed chunk size and therefore
      			reduces meta data and locking overhead.
      slub_min_objects=x	Mininum objects per slab. Default is 8.
      slub_max_order=x	Avoid generating slabs larger than order specified.
      slub_debug		Enable all diagnostics for all caches
      slub_debug=<options>	Enable selective options for all caches
      slub_debug=<o>,<cache>	Enable selective options for a certain set of
      			caches
      
      Available Debug options
      F		Double Free checking, sanity and resiliency
      R		Red zoning
      P		Object / padding poisoning
      U		Track last free / alloc
      T		Trace all allocs / frees (only use for individual slabs).
      
      To use SLUB: Apply this patch and then select SLUB as the default slab
      allocator.
      
      [hugh@veritas.com: fix an oops-causing locking error]
      [akpm@linux-foundation.org: various stupid cleanups and small fixes]
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      81819f0f
    • Andy Whitcroft's avatar
      tty_register_driver: only allocate tty instances when defined · 543691a6
      Andy Whitcroft authored
      If device->num is zero we attempt to kmalloc() zero bytes.  When SLUB is
      enabled this returns a null pointer and take that as an allocation failure
      and fail the device register.  Check for no devices and avoid the
      allocation.
      
      [akpm: opportunistic kzalloc() conversion]
      Signed-off-by: default avatarAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      543691a6
    • Christoph Lameter's avatar
      i386: use page allocator to allocate thread_info structure · b5637e65
      Christoph Lameter authored
      i386 uses kmalloc to allocate the threadinfo structure assuming that the
      allocations result in a page sized aligned allocation.  That has worked so
      far because SLAB exempts page sized slabs from debugging and aligns them in
      special ways that goes beyond the restrictions imposed by
      KMALLOC_ARCH_MINALIGN valid for other slabs in the kmalloc array.
      
      SLUB also works fine without debugging since page sized allocations neatly
      align at page boundaries.  However, if debugging is switched on then SLUB
      will extend the slab with debug information.  The resulting slab is not
      longer of page size.  It will only be aligned following the requirements
      imposed by KMALLOC_ARCH_MINALIGN.  As a result the threadinfo structure may
      not be page aligned which makes i386 fail to boot with SLUB debug on.
      
      Replace the calls to kmalloc with calls into the page allocator.
      
      An alternate solution may be to create a custom slab cache where the
      alignment is set to PAGE_SIZE.  That would allow slub debugging to be
      applied to the threadinfo structure.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5637e65
    • David Rientjes's avatar
      cpusets: allow TIF_MEMDIE threads to allocate anywhere · c596d9f3
      David Rientjes authored
      OOM killed tasks have access to memory reserves as specified by the
      TIF_MEMDIE flag in the hopes that it will quickly exit.  If such a task has
      memory allocations constrained by cpusets, we may encounter a deadlock if a
      blocking task cannot exit because it cannot allocate the necessary memory.
      
      We allow tasks that have the TIF_MEMDIE flag to allocate memory anywhere,
      including outside its cpuset restriction, so that it can quickly die
      regardless of whether it is __GFP_HARDWALL.
      
      Cc: Andi Kleen <ak@suse.de>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c596d9f3
    • Andrew Morton's avatar
      slab: mark set_up_list3s() __init · a3a02be7
      Andrew Morton authored
      It is only ever used prior to free_initmem().
      
      (It will cause a warning when we run the section checking, but that's a
      false-positive and it simply changes the source of an existing warning, which
      is also a false-positive)
      
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3a02be7
    • Mel Gorman's avatar
      Do not disable interrupts when reading min_free_kbytes · 3b1d92c5
      Mel Gorman authored
      The sysctl handler for min_free_kbytes calls setup_per_zone_pages_min() on
      read or write.  This function iterates through every zone and calls
      spin_lock_irqsave() on the zone LRU lock.  When reading min_free_kbytes,
      this is a total waste of time that disables interrupts on the local
      processor.  It might even be noticable machines with large numbers of zones
      if a process started constantly reading min_free_kbytes.
      
      This patch only calls setup_per_zone_pages_min() only on write. Tested on
      an x86 laptop and it did the right thing.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarChristoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b1d92c5
    • Eric Dumazet's avatar
      slab: NUMA kmem_cache diet · 8da3430d
      Eric Dumazet authored
      Some NUMA machines have a big MAX_NUMNODES (possibly 1024), but fewer
      possible nodes.  This patch dynamically sizes the 'struct kmem_cache' to
      allocate only needed space.
      
      I moved nodelists[] field at the end of struct kmem_cache, and use the
      following computation in kmem_cache_init()
      
      cache_cache.buffer_size = offsetof(struct kmem_cache, nodelists) +
                                       nr_node_ids * sizeof(struct kmem_list3 *);
      
      On my two nodes x86_64 machine, kmem_cache.obj_size is now 192 instead of 704
      (This is because on x86_64, MAX_NUMNODES is 64)
      
      On bigger NUMA setups, this might reduce the gfporder of "cache_cache"
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8da3430d
    • Eric Dumazet's avatar
      SLAB: don't allocate empty shared caches · 63109846
      Eric Dumazet authored
      We can avoid allocating empty shared caches and avoid unecessary check of
      cache->limit.  We save some memory.  We avoid bringing into CPU cache
      unecessary cache lines.
      
      All accesses to l3->shared are already checking NULL pointers so this patch is
      safe.
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Acked-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63109846
    • Eric Dumazet's avatar
      SLAB: use num_possible_cpus() in enable_cpucache() · 364fbb29
      Eric Dumazet authored
      The existing comment in mm/slab.c is *perfect*, so I reproduce it :
      
               /*
                * CPU bound tasks (e.g. network routing) can exhibit cpu bound
                * allocation behaviour: Most allocs on one cpu, most free operations
                * on another cpu. For these cases, an efficient object passing between
                * cpus is necessary. This is provided by a shared array. The array
                * replaces Bonwick's magazine layer.
                * On uniprocessor, it's functionally equivalent (but less efficient)
                * to a larger limit. Thus disabled by default.
                */
      
      As most shiped linux kernels are now compiled with CONFIG_SMP, there is no way
      a preprocessor #if can detect if the machine is UP or SMP. Better to use
      num_possible_cpus().
      
      This means on UP we allocate a 'size=0 shared array', to be more efficient.
      
      Another patch can later avoid the allocations of 'empty shared arrays', to
      save some memory.
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Acked-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      Acked-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      364fbb29