1. 29 May, 2012 40 commits
    • Konstantin Khlebnikov's avatar
      mm/memcg: kill mem_cgroup_lru_del() · bbf808ed
      Konstantin Khlebnikov authored
      This patch kills mem_cgroup_lru_del(), we can use
      mem_cgroup_lru_del_list() instead.  On 0-order isolation we already have
      right lru list id.
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bbf808ed
    • Konstantin Khlebnikov's avatar
      mm: remove lru type checks from __isolate_lru_page() · f3fd4a61
      Konstantin Khlebnikov authored
      After patch "mm: forbid lumpy-reclaim in shrink_active_list()" we can
      completely remove anon/file and active/inactive lru type filters from
      __isolate_lru_page(), because isolation for 0-order reclaim always
      isolates pages from right lru list.  And pages-isolation for lumpy
      shrink_inactive_list() or memory-compaction anyway allowed to isolate
      pages from all evictable lru lists.
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3fd4a61
    • Konstantin Khlebnikov's avatar
      mm: mark mm-inline functions as __always_inline · 014483bc
      Konstantin Khlebnikov authored
      GCC sometimes ignores "inline" directives even for small and simple functions.
      This supposed to be fixed in gcc 4.7, but it was released only yesterday.
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      014483bc
    • Konstantin Khlebnikov's avatar
      mm: push lru index into shrink_[in]active_list() · 3cb99451
      Konstantin Khlebnikov authored
      Let's toss lru index through call stack to isolate_lru_pages(), this is
      better than its reconstructing from individual bits.
      
      [akpm@linux-foundation.org: fix kerneldoc, per Minchan]
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3cb99451
    • Hugh Dickins's avatar
      mm/memcg: move reclaim_stat into lruvec · 89abfab1
      Hugh Dickins authored
      With mem_cgroup_disabled() now explicit, it becomes clear that the
      zone_reclaim_stat structure actually belongs in lruvec, per-zone when
      memcg is disabled but per-memcg per-zone when it's enabled.
      
      We can delete mem_cgroup_get_reclaim_stat(), and change
      update_page_reclaim_stat() to update just the one set of stats, the one
      which get_scan_count() will actually use.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      89abfab1
    • Hugh Dickins's avatar
      mm/memcg: scanning_global_lru means mem_cgroup_disabled · c3c787e8
      Hugh Dickins authored
      Although one has to admire the skill with which it has been concealed,
      scanning_global_lru(mz) is actually just an interesting way to test
      mem_cgroup_disabled().  Too many developer hours have been wasted on
      confusing it with global_reclaim(): just use mem_cgroup_disabled().
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarGlauber Costa <glommer@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3c787e8
    • Hugh Dickins's avatar
      memcg swap: use mem_cgroup_uncharge_swap() · 86493009
      Hugh Dickins authored
      That stuff __mem_cgroup_commit_charge_swapin() does with a swap entry, it
      has a name and even a declaration: just use mem_cgroup_uncharge_swap().
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86493009
    • Hugh Dickins's avatar
      memcg swap: mem_cgroup_move_swap_account never needs fixup · e91cbb42
      Hugh Dickins authored
      The need_fixup arg to mem_cgroup_move_swap_account() is always false,
      so just remove it.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e91cbb42
    • KAMEZAWA Hiroyuki's avatar
      memcg: fix/change behavior of shared anon at moving task · 4b91355e
      KAMEZAWA Hiroyuki authored
      This patch changes memcg's behavior at task_move().
      
      At task_move(), the kernel scans a task's page table and move the changes
      for mapped pages from source cgroup to target cgroup.  There has been a
      bug at handling shared anonymous pages for a long time.
      
      Before patch:
        - The spec says 'shared anonymous pages are not moved.'
        - The implementation was 'shared anonymoys pages may be moved'.
          If page_mapcount <=2, shared anonymous pages's charge were moved.
      
      After patch:
        - The spec says 'all anonymous pages are moved'.
        - The implementation is 'all anonymous pages are moved'.
      
      Considering usage of memcg, this will not affect user's experience.
      'shared anonymous' pages only exists between a tree of processes which
      don't do exec().  Moving one of process without exec() seems not sane.
      For example, libcgroup will not be affected by this change.  (Anyway, no
      one noticed the implementation for a long time...)
      
      Below is a discussion log:
      
       - current spec/implementation are complex
       - Now, shared file caches are moved
       - It adds unclear check as page_mapcount(). To do correct check,
         we should check swap users, etc.
       - No one notice this implementation behavior. So, no one get benefit
         from the design.
       - In general, once task is moved to a cgroup for running, it will not
         be moved....
       - Finally, we have control knob as memory.move_charge_at_immigrate.
      
      Here is a patch to allow moving shared pages, completely. This makes
      memcg simpler and fix current broken code.
      Suggested-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b91355e
    • Gavin Shan's avatar
      mm/memblock: fix memory leak on extending regions · 181eb394
      Gavin Shan authored
      The overall memblock has been organized into the memory regions and
      reserved regions.  Initially, the memory regions and reserved regions are
      stored in the predetermined arrays of "struct memblock _region".  It's
      possible for the arrays to be enlarged when we have newly added regions,
      but no free space left there.  The policy here is to create double-sized
      array either by slab allocator or memblock allocator.  Unfortunately, we
      didn't free the old array, which might be allocated through slab allocator
      before.  That would cause memory leak.
      
      The patch introduces 2 variables to trace where (slab or memblock) the
      memory and reserved regions come from.  The memory for the memory or
      reserved regions will be deallocated by kfree() if that was allocated by
      slab allocator.  Thus to fix the memory leak issue.
      Signed-off-by: default avatarGavin Shan <shangw@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      181eb394
    • Gavin Shan's avatar
      mm/memblock: cleanup on duplicate VA/PA conversion · 4e2f0775
      Gavin Shan authored
      The overall memblock has been organized into the memory regions and
      reserved regions.  Initially, the memory regions and reserved regions are
      stored in the predetermined arrays of "struct memblock _region".  It's
      possible for the arrays to be enlarged when we have newly added regions
      for them, but no enough space there.  Under the situation, We will created
      double-sized array to meet the requirement.  However, the original
      implementation converted the VA (Virtual Address) of the newly allocated
      array of regions to PA (Physical Address), then translate back when we
      allocates the new array from slab.  That's actually unnecessary.
      
      The patch removes the duplicate VA/PA conversion.
      Signed-off-by: default avatarGavin Shan <shangw@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e2f0775
    • Pravin B Shelar's avatar
      mm: fix slab->page flags corruption · 5bf5f03c
      Pravin B Shelar authored
      Transparent huge pages can change page->flags (PG_compound_lock) without
      taking Slab lock.  Since THP can not break slab pages we can safely access
      compound page without taking compound lock.
      
      Specifically this patch fixes a race between compound_unlock() and slab
      functions which perform page-flags updates.  This can occur when
      get_page()/put_page() is called on a page from slab.
      
      [akpm@linux-foundation.org: tweak comment text, fix comment layout, fix label indenting]
      Reported-by: default avatarAmey Bhide <abhide@nicira.com>
      Signed-off-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5bf5f03c
    • KyongHo's avatar
      mm: fix faulty initialization in vmalloc_init() · dbda591d
      KyongHo authored
      The transfer of ->flags causes some of the static mapping virtual
      addresses to be prematurely freed (before the mapping is removed) because
      VM_LAZY_FREE gets "set" if tmp->flags has VM_IOREMAP set.  This might
      cause subsequent vmalloc/ioremap calls to fail because it might allocate
      one of the freed virtual address ranges that aren't unmapped.
      
      va->flags has different types of flags from tmp->flags.  If a region with
      VM_IOREMAP set is registered with vm_area_add_early(), it will be removed
      by __purge_vmap_area_lazy().
      
      Fix vmalloc_init() to correctly initialize vmap_area for the given
      vm_struct.
      
      Also initialise va->vm.  If it is not set, find_vm_area() for the early
      vm regions will always fail.
      Signed-off-by: default avatarKyongHo Cho <pullip.cho@samsung.com>
      Cc: "Olav Haugan" <ohaugan@codeaurora.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dbda591d
    • Andrea Arcangeli's avatar
      mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition · 26c19178
      Andrea Arcangeli authored
      When holding the mmap_sem for reading, pmd_offset_map_lock should only
      run on a pmd_t that has been read atomically from the pmdp pointer,
      otherwise we may read only half of it leading to this crash.
      
      PID: 11679  TASK: f06e8000  CPU: 3   COMMAND: "do_race_2_panic"
       #0 [f06a9dd8] crash_kexec at c049b5ec
       #1 [f06a9e2c] oops_end at c083d1c2
       #2 [f06a9e40] no_context at c0433ded
       #3 [f06a9e64] bad_area_nosemaphore at c043401a
       #4 [f06a9e6c] __do_page_fault at c0434493
       #5 [f06a9eec] do_page_fault at c083eb45
       #6 [f06a9f04] error_code (via page_fault) at c083c5d5
          EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP:
          00000000
          DS:  007b     ESI: 9e201000 ES:  007b     EDI: 01fb4700 GS:  00e0
          CS:  0060     EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246
       #7 [f06a9f38] _spin_lock at c083bc14
       #8 [f06a9f44] sys_mincore at c0507b7d
       #9 [f06a9fb0] system_call at c083becd
                               start           len
          EAX: ffffffda  EBX: 9e200000  ECX: 00001000  EDX: 6228537f
          DS:  007b      ESI: 00000000  ES:  007b      EDI: 003d0f00
          SS:  007b      ESP: 62285354  EBP: 62285388  GS:  0033
          CS:  0073      EIP: 00291416  ERR: 000000da  EFLAGS: 00000286
      
      This should be a longstanding bug affecting x86 32bit PAE without THP.
      Only archs with 64bit large pmd_t and 32bit unsigned long should be
      affected.
      
      With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad()
      would partly hide the bug when the pmd transition from none to stable,
      by forcing a re-read of the *pmd in pmd_offset_map_lock, but when THP is
      enabled a new set of problem arises by the fact could then transition
      freely in any of the none, pmd_trans_huge or pmd_trans_stable states.
      So making the barrier in pmd_none_or_trans_huge_or_clear_bad()
      unconditional isn't good idea and it would be a flakey solution.
      
      This should be fully fixed by introducing a pmd_read_atomic that reads
      the pmd in order with THP disabled, or by reading the pmd atomically
      with cmpxchg8b with THP enabled.
      
      Luckily this new race condition only triggers in the places that must
      already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix
      is localized there but this bug is not related to THP.
      
      NOTE: this can trigger on x86 32bit systems with PAE enabled with more
      than 4G of ram, otherwise the high part of the pmd will never risk to be
      truncated because it would be zero at all times, in turn so hiding the
      SMP race.
      
      This bug was discovered and fully debugged by Ulrich, quote:
      
      ----
      [..]
      pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and
      eax.
      
          496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t
          *pmd)
          497 {
          498         /* depend on compiler for an atomic pmd read */
          499         pmd_t pmdval = *pmd;
      
                                      // edi = pmd pointer
      0xc0507a74 <sys_mincore+548>:   mov    0x8(%esp),%edi
      ...
                                      // edx = PTE page table high address
      0xc0507a84 <sys_mincore+564>:   mov    0x4(%edi),%edx
      ...
                                      // eax = PTE page table low address
      0xc0507a8e <sys_mincore+574>:   mov    (%edi),%eax
      
      [..]
      
      Please note that the PMD is not read atomically. These are two "mov"
      instructions where the high order bits of the PMD entry are fetched
      first. Hence, the above machine code is prone to the following race.
      
      -  The PMD entry {high|low} is 0x0000000000000000.
         The "mov" at 0xc0507a84 loads 0x00000000 into edx.
      
      -  A page fault (on another CPU) sneaks in between the two "mov"
         instructions and instantiates the PMD.
      
      -  The PMD entry {high|low} is now 0x00000003fda38067.
         The "mov" at 0xc0507a8e loads 0xfda38067 into eax.
      ----
      Reported-by: default avatarUlrich Obergfell <uobergfe@redhat.com>
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Petr Matousek <pmatouse@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      26c19178
    • David Rientjes's avatar
      mm, oom: normalize oom scores to oom_score_adj scale only for userspace · a7f638f9
      David Rientjes authored
      The oom_score_adj scale ranges from -1000 to 1000 and represents the
      proportion of memory available to the process at allocation time.  This
      means an oom_score_adj value of 300, for example, will bias a process as
      though it was using an extra 30.0% of available memory and a value of
      -350 will discount 35.0% of available memory from its usage.
      
      The oom killer badness heuristic also uses this scale to report the oom
      score for each eligible process in determining the "best" process to
      kill.  Thus, it can only differentiate each process's memory usage by
      0.1% of system RAM.
      
      On large systems, this can end up being a large amount of memory: 256MB
      on 256GB systems, for example.
      
      This can be fixed by having the badness heuristic to use the actual
      memory usage in scoring threads and then normalizing it to the
      oom_score_adj scale for userspace.  This results in better comparison
      between eligible threads for kill and no change from the userspace
      perspective.
      Suggested-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Tested-by: default avatarDave Jones <davej@redhat.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a7f638f9
    • Satoru Moriya's avatar
      mm: avoid swapping out with swappiness==0 · fe35004f
      Satoru Moriya authored
      Sometimes we'd like to avoid swapping out anonymous memory.  In
      particular, avoid swapping out pages of important process or process
      groups while there is a reasonable amount of pagecache on RAM so that we
      can satisfy our customers' requirements.
      
      OTOH, we can control how aggressive the kernel will swap memory pages with
      /proc/sys/vm/swappiness for global and
      /sys/fs/cgroup/memory/memory.swappiness for each memcg.
      
      But with current reclaim implementation, the kernel may swap out even if
      we set swappiness=0 and there is pagecache in RAM.
      
      This patch changes the behavior with swappiness==0.  If we set
      swappiness==0, the kernel does not swap out completely (for global reclaim
      until the amount of free pages and filebacked pages in a zone has been
      reduced to something very very small (nr_free + nr_filebacked < high
      watermark)).
      Signed-off-by: default avatarSatoru Moriya <satoru.moriya@hds.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fe35004f
    • Dave Hansen's avatar
      hugetlb: fix resv_map leak in error path · c50ac050
      Dave Hansen authored
      When called for anonymous (non-shared) mappings, hugetlb_reserve_pages()
      does a resv_map_alloc().  It depends on code in hugetlbfs's
      vm_ops->close() to release that allocation.
      
      However, in the mmap() failure path, we do a plain unmap_region() without
      the remove_vma() which actually calls vm_ops->close().
      
      This is a decent fix.  This leak could get reintroduced if new code (say,
      after hugetlb_reserve_pages() in hugetlbfs_file_mmap()) decides to return
      an error.  But, I think it would have to unroll the reservation anyway.
      
      Christoph's test case:
      
      	http://marc.info/?l=linux-mm&m=133728900729735
      
      This patch applies to 3.4 and later.  A version for earlier kernels is at
      https://lkml.org/lkml/2012/5/22/418.
      Signed-off-by: default avatarDave Hansen <dave@linux.vnet.ibm.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reported-by: default avatarChristoph Lameter <cl@linux.com>
      Tested-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>	[2.6.32+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c50ac050
    • Gavin Shan's avatar
      mm/bootmem.c: cleanup on addition to bootmem data list · 5c2b8a16
      Gavin Shan authored
      The objects of "struct bootmem_data_t" are linked together to form
      double-linked list sequentially based on its minimal page frame number.
      
      The current implementation implicitly supports the following cases,
      which means the inserting point for current bootmem data depends on how
      "list_for_each" works.  That makes the code a little hard to read.
      Besides, "list_for_each" and "list_entry" can be replaced with
      "list_for_each_entry".
      
              - The linked list is empty.
              - There has no entry in the linked list, whose minimal page
                frame number is bigger than current one.
      Signed-off-by: default avatarGavin Shan <shangw@linux.vnet.ibm.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5c2b8a16
    • Michal Hocko's avatar
      mm: consider all swapped back pages in used-once logic · e4898273
      Michal Hocko authored
      Commit 64574746 ("vmscan: detect mapped file pages used only once")
      made mapped pages have another round in inactive list because they might
      be just short lived and so we could consider them again next time.  This
      heuristic helps to reduce pressure on the active list with a streaming
      IO worklods.
      
      This patch fixes a regression introduced by this commit for heavy shmem
      based workloads because unlike Anon pages, which are excluded from this
      heuristic because they are usually long lived, shmem pages are handled
      as a regular page cache.
      
      This doesn't work quite well, unfortunately, if the workload is mostly
      backed by shmem (in memory database sitting on 80% of memory) with a
      streaming IO in the background (backup - up to 20% of memory).  Anon
      inactive list is full of (dirty) shmem pages when watermarks are hit.
      Shmem pages are kept in the inactive list (they are referenced) in the
      first round and it is hard to reclaim anything else so we reach lower
      scanning priorities very quickly which leads to an excessive swap out.
      
      Let's fix this by excluding all swap backed pages (they tend to be long
      lived wrt.  the regular page cache anyway) from used-once heuristic and
      rather activate them if they are referenced.
      
      The customer's workload is shmem backed database (80% of RAM) and they
      are measuring transactions/s with an IO in the background (20%).
      Transactions touch more or less random rows in the table.  The
      transaction rate fell by a factor of 3 (in the worst case) because of
      commit 64574746.  This patch restores the previous numbers.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>	[2.6.34+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e4898273
    • Mel Gorman's avatar
      mm: document the meminfo and vmstat fields of relevance to transparent hugepages · 69256994
      Mel Gorman authored
      Update Documentation/vm/transhuge.txt and
      Documentation/filesystems/proc.txt with some information on monitoring
      transparent huge page usage and the associated overhead.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69256994
    • Andrew Morton's avatar
      mm/page_alloc.c: cleanups · 51300cef
      Andrew Morton authored
      - make pageflag_names[] const
      
      - remove null termination of pageflag_names[]
      
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      51300cef
    • Johannes Weiner's avatar
      mm: page_alloc: catch out-of-date list of page flag names · acc50c11
      Johannes Weiner authored
      String tables with names of enum items are always prone to go out of
      sync with the enums themselves.  Ensure during compile time that the
      name table of page flags has the same size as the page flags enum.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      acc50c11
    • Gavin Shan's avatar
      mm/buddy: dump PG_compound_lock page flag · be9cd873
      Gavin Shan authored
      The array pageflag_names[] does conversion from page flags into their
      corresponding names so that a meaningful representation of the
      corresponding page flag can be printed.  This mechanism is used while
      dumping page frames.  However, the array missed PG_compound_lock.  So
      the PG_compound_lock page flag would be printed as a digital number
      instead of a meaningful string.
      
      The patch fixes that and prints "compound_lock" for the PG_compound_lock
      page flag.
      Signed-off-by: default avatarGavin Shan <shangw@linux.vnet.ibm.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      be9cd873
    • Cong Wang's avatar
      mm: move readahead syscall to mm/readahead.c · 782182e5
      Cong Wang authored
      It is better to define readahead(2) in mm/readahead.c than in
      mm/filemap.c.
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      782182e5
    • Hugh Dickins's avatar
      tmpfs: support SEEK_DATA and SEEK_HOLE · 4fb5ef08
      Hugh Dickins authored
      It's quite easy for tmpfs to scan the radix_tree to support llseek's new
      SEEK_DATA and SEEK_HOLE options: so add them while the minutiae are still
      on my mind (in particular, the !PageUptodate-ness of pages fallocated but
      still unwritten).
      
      But I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it
      would be of any use to them on tmpfs.  This code adds 92 lines and 752
      bytes on x86_64 - is that bloat or worthwhile?
      
      [akpm@linux-foundation.org: fix warning with CONFIG_TMPFS=n]
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Josef Bacik <josef@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Andreas Dilger <adilger@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Marco Stornelli <marco.stornelli@gmail.com>
      Cc: Jeff liu <jeff.liu@oracle.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Sunil Mushran <sunil.mushran@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4fb5ef08
    • Hugh Dickins's avatar
      tmpfs: quit when fallocate fills memory · 1aac1400
      Hugh Dickins authored
      As it stands, a large fallocate() on tmpfs is liable to fill memory with
      pages, freed on failure except when they run into swap, at which point
      they become fixed into the file despite the failure.  That feels quite
      wrong, to be consuming resources precisely when they're in short supply.
      
      Go the other way instead: shmem_fallocate() indicate the range it has
      fallocated to shmem_writepage(), keeping count of pages it's allocating;
      shmem_writepage() reactivate instead of swapping out pages fallocated by
      this syscall (but happily swap out those from earlier occasions), keeping
      count; shmem_fallocate() compare counts and give up once the reactivated
      pages have started to coming back to writepage (approximately: some zones
      would in fact recycle faster than others).
      
      This is a little unusual, but works well: although we could consider the
      failure to swap as a bug, and fix it later with SWAP_MAP_FALLOC handling
      added in swapfile.c and memcontrol.c, I doubt that we shall ever want to.
      
      (If there's no swap, an over-large fallocate() on tmpfs is limited in the
      same way as writing: stopped by rlimit, or by tmpfs mount size if that was
      set sensibly, or by __vm_enough_memory() heuristics if OVERCOMMIT_GUESS or
      OVERCOMMIT_NEVER.  If OVERCOMMIT_ALWAYS, then it is liable to OOM-kill
      others as writing would, but stops and frees if interrupted.)
      
      Now that everything is freed on failure, we can then skip updating ctime.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Cong Wang <amwang@redhat.com>
      Cc: Kay Sievers <kay@vrfy.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1aac1400
    • Hugh Dickins's avatar
      tmpfs: undo fallocation on failure · 1635f6a7
      Hugh Dickins authored
      In the previous episode, we left the already-fallocated pages attached to
      the file when shmem_fallocate() fails part way through.
      
      Now try to do better, by extending the earlier optimization of !Uptodate
      pages (then always under page lock) to !Uptodate pages (outside of page
      lock), representing fallocated pages.  And don't waste time clearing them
      at the time of fallocate(), leave that until later if necessary.
      
      Adapt shmem_truncate_range() to shmem_undo_range(), so that a failing
      fallocate can recognize and remove precisely those !Uptodate allocations
      which it added (and were not independently allocated by racing tasks).
      
      But unless we start playing with swapfile.c and memcontrol.c too, once one
      of our fallocated pages reaches shmem_writepage(), we do then have to
      instantiate it as an ordinarily allocated page, before swapping out.  This
      is unsatisfactory, but improved in the next episode.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Cong Wang <amwang@redhat.com>
      Cc: Kay Sievers <kay@vrfy.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1635f6a7
    • Hugh Dickins's avatar
      tmpfs: support fallocate preallocation · e2d12e22
      Hugh Dickins authored
      The systemd plumbers expressed a wish that tmpfs support preallocation.
      Cong Wang wrote a patch, but several kernel guys expressed scepticism:
      https://lkml.org/lkml/2011/11/18/137
      
      Christoph Hellwig: What for exactly? Please explain why preallocating on
      tmpfs would make any sense.
      
      Kay Sievers: To be able to safely use mmap(), regarding SIGBUS, on files
      on the /dev/shm filesystem.  The glibc fallback loop for -ENOSYS [or
      -EOPNOTSUPP] on fallocate is just ugly.
      
      Hugh Dickins: If tmpfs is going to support
      fallocate(FALLOC_FL_PUNCH_HOLE), it would seem perverse to permit the
      deallocation but fail the allocation.  Christoph Hellwig: Agreed.
      
      Now that we do have shmem_fallocate() for hole-punching, plumb in basic
      support for preallocation mode too.  It's fairly straightforward (though
      quite a few details needed attention), except for when it fails part way
      through.  What a pity that fallocate(2) was not specified to return the
      length allocated, permitting short fallocations!
      
      As it is, when it fails part way through, we ought to free what has just
      been allocated by this system call; but must be very sure not to free any
      allocated earlier, or any allocated by racing accesses (not all excluded
      by i_mutex).
      
      But we cannot distinguish them: so in this patch simply leak allocations
      on partial failure (they will be freed later if the file is removed).
      
      An attractive alternative approach would have been for fallocate() not to
      allocate pages at all, but note reservations by entries in the radix-tree.
       But that would give less assurance, and, critically, would be hard to fit
      with mem cgroups (who owns the reservations?): allocating pages lets
      fallocate() behave in just the same way as write().
      Based-on-patch-by: default avatarCong Wang <amwang@redhat.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Cong Wang <amwang@redhat.com>
      Cc: Kay Sievers <kay@vrfy.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e2d12e22
    • Hugh Dickins's avatar
      mm/fs: remove truncate_range · 17cf28af
      Hugh Dickins authored
      Remove vmtruncate_range(), and remove the truncate_range method from
      struct inode_operations: only tmpfs ever supported it, and tmpfs has now
      converted over to using the fallocate method of file_operations.
      
      Update Documentation accordingly, adding (setlease and) fallocate lines.
      And while we're in mm.h, remove duplicate declarations of shmem_lock() and
      shmem_file_setup(): everyone is now using the ones in shmem_fs.h.
      Based-on-patch-by: default avatarCong Wang <amwang@redhat.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Cong Wang <amwang@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      17cf28af
    • Hugh Dickins's avatar
      mm/fs: route MADV_REMOVE to FALLOC_FL_PUNCH_HOLE · 3f31d075
      Hugh Dickins authored
      Now tmpfs supports hole-punching via fallocate(), switch madvise_remove()
      to use do_fallocate() instead of vmtruncate_range(): which extends
      madvise(,,MADV_REMOVE) support from tmpfs to ext4, ocfs2 and xfs.
      
      There is one more user of vmtruncate_range() in our tree,
      staging/android's ashmem_shrink(): convert it to use do_fallocate() too
      (but if its unpinned areas are already unmapped - I don't know - then it
      would do better to use shmem_truncate_range() directly).
      Based-on-patch-by: default avatarCong Wang <amwang@redhat.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Colin Cross <ccross@android.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger@dilger.ca>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3f31d075
    • Hugh Dickins's avatar
      tmpfs: support fallocate FALLOC_FL_PUNCH_HOLE · 83e4fa9c
      Hugh Dickins authored
      tmpfs has supported hole-punching since 2.6.16, via
      madvise(,,MADV_REMOVE).
      
      But nowadays fallocate(,FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE,,) is
      the agreed way to punch holes.
      
      So add shmem_fallocate() to support that, and tweak shmem_truncate_range()
      to support partial pages at both the beginning and end of range (never
      needed for madvise, which demands rounded addr and rounds up length).
      Based-on-patch-by: default avatarCong Wang <amwang@redhat.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Cong Wang <amwang@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      83e4fa9c
    • Hugh Dickins's avatar
      tmpfs: optimize clearing when writing · ec9516fb
      Hugh Dickins authored
      Nick proposed years ago that tmpfs should avoid clearing its pages where
      write will overwrite them with new data, as ramfs has long done.  But I
      messed it up and just got bad data.  Tried again recently, it works
      fine.
      
      Here's time output for writing 4GiB 16 times on this Core i5 laptop:
      
      before: real	0m21.169s user	0m0.028s sys	0m21.057s
              real	0m21.382s user	0m0.016s sys	0m21.289s
              real	0m21.311s user	0m0.020s sys	0m21.217s
      
      after:  real	0m18.273s user	0m0.032s sys	0m18.165s
              real	0m18.354s user	0m0.020s sys	0m18.265s
              real	0m18.440s user	0m0.032s sys	0m18.337s
      
      ramfs:  real	0m16.860s user	0m0.028s sys	0m16.765s
              real	0m17.382s user	0m0.040s sys	0m17.273s
              real	0m17.133s user	0m0.044s sys	0m17.021s
      
      Yes, I have done perf reports, but they need more explanation than they
      deserve: in summary, clear_page vanishes, its cache loading shifts into
      copy_user_generic_unrolled; shmem_getpage_gfp goes down, and
      surprisingly mark_page_accessed goes way up - I think because they are
      respectively where the cache gets to be reloaded after being purged by
      clear or copy.
      Suggested-by: default avatarNick Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec9516fb
    • Hugh Dickins's avatar
      tmpfs: enable NOSEC optimization · 2f6e38f3
      Hugh Dickins authored
      Let tmpfs into the NOSEC optimization (avoiding file_remove_suid()
      overhead on most common writes): set MS_NOSEC on its superblocks.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2f6e38f3
    • Hugh Dickins's avatar
      shmem: replace page if mapping excludes its zone · bde05d1c
      Hugh Dickins authored
      The GMA500 GPU driver uses GEM shmem objects, but with a new twist: the
      backing RAM has to be below 4GB.  Not a problem while the boards
      supported only 4GB: but now Intel's D2700MUD boards support 8GB, and
      their GMA3600 is managed by the GMA500 driver.
      
      shmem/tmpfs has never pretended to support hardware restrictions on the
      backing memory, but it might have appeared to do so before v3.1, and
      even now it works fine until a page is swapped out then back in.  When
      read_cache_page_gfp() supplied a freshly allocated page for copy, that
      compensated for whatever choice might have been made by earlier swapin
      readahead; but swapoff was likely to destroy the illusion.
      
      We'd like to continue to support GMA500, so now add a new
      shmem_should_replace_page() check on the zone when about to move a page
      from swapcache to filecache (in swapin and swapoff cases), with
      shmem_replace_page() to allocate and substitute a suitable page (given
      gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32).
      
      This does involve a minor extension to mem_cgroup_replace_page_cache()
      (the page may or may not have already been charged); and I've removed a
      comment and call to mem_cgroup_uncharge_cache_page(), which in fact is
      always a no-op while PageSwapCache.
      
      Also removed optimization of an unlikely path in shmem_getpage_gfp(),
      now that we need to check PageSwapCache more carefully (a racing caller
      might already have made the copy).  And at one point shmem_unuse_inode()
      needs to use the hitherto private page_swapcount(), to guard against
      racing with inode eviction.
      
      It would make sense to extend shmem_should_replace_page(), to cover
      cpuset and NUMA mempolicy restrictions too, but set that aside for now:
      needs a cleanup of shmem mempolicy handling, and more testing, and ought
      to handle swap faults in do_swap_page() as well as shmem.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Stephane Marchesin <marcheu@chromium.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Dave Airlie <airlied@gmail.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Rob Clark <rob.clark@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bde05d1c
    • Bartlomiej Zolnierkiewicz's avatar
      mm: compaction: handle incorrect MIGRATE_UNMOVABLE type pageblocks · 5ceb9ce6
      Bartlomiej Zolnierkiewicz authored
      When MIGRATE_UNMOVABLE pages are freed from MIGRATE_UNMOVABLE type
      pageblock (and some MIGRATE_MOVABLE pages are left in it) waiting until an
      allocation takes ownership of the block may take too long.  The type of
      the pageblock remains unchanged so the pageblock cannot be used as a
      migration target during compaction.
      
      Fix it by:
      
      * Adding enum compact_mode (COMPACT_ASYNC_[MOVABLE,UNMOVABLE], and
        COMPACT_SYNC) and then converting sync field in struct compact_control
        to use it.
      
      * Adding nr_pageblocks_skipped field to struct compact_control and
        tracking how many destination pageblocks were of MIGRATE_UNMOVABLE type.
         If COMPACT_ASYNC_MOVABLE mode compaction ran fully in
        try_to_compact_pages() (COMPACT_COMPLETE) it implies that there is not a
        suitable page for allocation.  In this case then check how if there were
        enough MIGRATE_UNMOVABLE pageblocks to try a second pass in
        COMPACT_ASYNC_UNMOVABLE mode.
      
      * Scanning the MIGRATE_UNMOVABLE pageblocks (during COMPACT_SYNC and
        COMPACT_ASYNC_UNMOVABLE compaction modes) and building a count based on
        finding PageBuddy pages, page_count(page) == 0 or PageLRU pages.  If all
        pages within the MIGRATE_UNMOVABLE pageblock are in one of those three
        sets change the whole pageblock type to MIGRATE_MOVABLE.
      
      My particular test case (on a ARM EXYNOS4 device with 512 MiB, which means
      131072 standard 4KiB pages in 'Normal' zone) is to:
      
      - allocate 120000 pages for kernel's usage
      - free every second page (60000 pages) of memory just allocated
      - allocate and use 60000 pages from user space
      - free remaining 60000 pages of kernel memory
        (now we have fragmented memory occupied mostly by user space pages)
      - try to allocate 100 order-9 (2048 KiB) pages for kernel's usage
      
      The results:
      - with compaction disabled I get 11 successful allocations
      - with compaction enabled - 14 successful allocations
      - with this patch I'm able to get all 100 successful allocations
      
      NOTE: If we can make kswapd aware of order-0 request during compaction, we
      can enhance kswapd with changing mode to COMPACT_ASYNC_FULL
      (COMPACT_ASYNC_MOVABLE + COMPACT_ASYNC_UNMOVABLE).  Please see the
      following thread:
      
      	http://marc.info/?l=linux-mm&m=133552069417068&w=2
      
      [minchan@kernel.org: minor cleanups]
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: default avatarBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: default avatarKyungmin Park <kyungmin.park@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ceb9ce6
    • Johannes Weiner's avatar
      mm: remove sparsemem allocation details from the bootmem allocator · 238305bb
      Johannes Weiner authored
      alloc_bootmem_section() derives allocation area constraints from the
      specified sparsemem section.  This is a bit specific for a generic memory
      allocator like bootmem, though, so move it over to sparsemem.
      
      As __alloc_bootmem_node_nopanic() already retries failed allocations with
      relaxed area constraints, the fallback code in sparsemem.c can be removed
      and the code becomes a bit more compact overall.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      238305bb
    • Johannes Weiner's avatar
      mm: bootmem: pass pgdat instead of pgdat->bdata down the stack · e9079911
      Johannes Weiner authored
      Pass down the node descriptor instead of the more specific bootmem node
      descriptor down the call stack, like nobootmem does, when there is no good
      reason for the two to be different.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e9079911
    • Johannes Weiner's avatar
      mm: nobootmem: unify allocation policy of (non-)panicking node allocations · ba539868
      Johannes Weiner authored
      While the panicking node-specific allocation function tries to satisfy
      node+goal, goal, node, anywhere, the non-panicking function still does
      node+goal, goal, anywhere.
      
      Make it simpler: define the panicking version in terms of the non-panicking
      one, like the node-agnostic interface, so they always behave the same way
      apart from how to deal with allocation failure.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarYinghai Lu <yinghai@kernel.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ba539868
    • Johannes Weiner's avatar
      mm: nobootmem: panic on node-specific allocation failure · 2c478eae
      Johannes Weiner authored
      __alloc_bootmem_node and __alloc_bootmem_low_node documentation claims
      the functions panic on allocation failure.  Do it.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarYinghai Lu <yinghai@kernel.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c478eae
    • Johannes Weiner's avatar
      mm: bootmem: unify allocation policy of (non-)panicking node allocations · 421456ed
      Johannes Weiner authored
      While the panicking node-specific allocation function tries to satisfy
      node+goal, goal, node, anywhere, the non-panicking function still does
      node+goal, goal, anywhere.
      
      Make it simpler: define the panicking version in terms of the
      non-panicking one, like the node-agnostic interface, so they always behave
      the same way apart from how to deal with allocation failure.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      421456ed