1. 20 May, 2016 40 commits
    • Joonsoo Kim's avatar
      mm/page_owner: add zone range overlapping check · 9d43f5ae
      Joonsoo Kim authored
      There is a system thats node's pfns are overlapped as follows:
      
        -----pfn-------->
        N0 N1 N2 N0 N1 N2
      
      Therefore, we need to care this overlapping when iterating pfn range.
      
      There are one place in page_owner.c that iterates pfn range and it
      doesn't consider this overlapping.  Add it.
      
      Without this patch, above system could over count early allocated page
      number before page_owner is activated.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9d43f5ae
    • Joonsoo Kim's avatar
      mm/vmstat: add zone range overlapping check · a91c43c7
      Joonsoo Kim authored
      There is a system thats node's pfns are overlapped as follows:
      
        -----pfn-------->
        N0 N1 N2 N0 N1 N2
      
      Therefore, we need to care this overlapping when iterating pfn range.
      
      There are two places in vmstat.c that iterates pfn range and they don't
      consider this overlapping.  Add it.
      
      Without this patch, above system could over count pageblock number on a
      zone.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a91c43c7
    • Joonsoo Kim's avatar
      mm/memory_hotplug: add comment to some functions related to memory hotplug · b9eb6319
      Joonsoo Kim authored
      __offline_isolated_pages() and test_pages_isolated() are used by memory
      hotplug.  These functions require that range is in a single zone but
      there is no code to do this because memory hotplug checks it before
      calling these functions.  To avoid confusing future user of these
      functions, this patch adds comments to them.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b9eb6319
    • Joonsoo Kim's avatar
      mm/hugetlb: add same zone check in pfn_range_valid_gigantic() · f44b2dda
      Joonsoo Kim authored
      This patchset deals with some problematic sites that iterate pfn ranges.
      
      There is a system thats node's pfns are overlapped as follows:
      
        -----pfn-------->
        N0 N1 N2 N0 N1 N2
      
      Therefore, we need to take care of this overlapping when iterating pfn
      range.
      
      I audit many iterating sites that uses pfn_valid(), pfn_valid_within(),
      zone_start_pfn and etc.  and others looks safe to me.  This is a
      preparation step for a new CMA implementation, ZONE_CMA
      (https://lkml.org/lkml/2015/2/12/95), because it would be easily
      overlapped with other zones.  But, zone overlap check is also needed for
      the general case so I send it separately.
      
      This patch (of 5):
      
      alloc_gigantic_page() uses alloc_contig_range() and this requires that
      the requested range is in a single zone.  To satisfy this requirement,
      add this check to pfn_range_valid_gigantic().
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f44b2dda
    • Andrew Morton's avatar
      mm: uninline page_mapped() · 1aa8aea5
      Andrew Morton authored
      It's huge.  Uninlining it saves 206 bytes per callsite.  Shaves 4924
      bytes from the x86_64 allmodconfig vmlinux.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1aa8aea5
    • Chanho Min's avatar
      mm/highmem: simplify is_highmem() · 29f9cb53
      Chanho Min authored
      is_highmem() can be simplified by use of is_highmem_idx().  This patch
      removes redundant code and will make it easier to maintain if the zone
      policy is changed or a new zone is added.
      
      (akpm: saves me 25 bytes of text per is_highmem() callsite)
      Signed-off-by: default avatarChanho Min <chanho.min@lge.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      29f9cb53
    • Vlastimil Babka's avatar
      mm, compaction: skip blocks where isolation fails in async direct compaction · fdd048e1
      Vlastimil Babka authored
      The goal of direct compaction is to quickly make a high-order page
      available for the pending allocation.  Within an aligned block of pages
      of desired order, a single allocated page that cannot be isolated for
      migration means that the block cannot fully merge to a buddy page that
      would satisfy the allocation request.  Therefore we can reduce the
      allocation stall by skipping the rest of the block immediately on
      isolation failure.  For async compaction, this also means a higher
      chance of succeeding until it detects contention.
      
      We however shouldn't completely sacrifice the second objective of
      compaction, which is to reduce overal long-term memory fragmentation.
      As a compromise, perform the eager skipping only in direct async
      compaction, while sync compaction (including kcompactd) remains
      thorough.
      
      Testing was done using stress-highalloc from mmtests, configured for
      order-4 GFP_KERNEL allocations:
      
                                       4.6-rc1               4.6-rc1
                                        before                 after
        Success 1 Min         24.00 (  0.00%)       27.00 (-12.50%)
        Success 1 Mean        30.20 (  0.00%)       31.60 ( -4.64%)
        Success 1 Max         37.00 (  0.00%)       35.00 (  5.41%)
        Success 2 Min         42.00 (  0.00%)       32.00 ( 23.81%)
        Success 2 Mean        44.00 (  0.00%)       44.80 ( -1.82%)
        Success 2 Max         48.00 (  0.00%)       52.00 ( -8.33%)
        Success 3 Min         91.00 (  0.00%)       92.00 ( -1.10%)
        Success 3 Mean        92.20 (  0.00%)       92.80 ( -0.65%)
        Success 3 Max         94.00 (  0.00%)       93.00 (  1.06%)
      
      We can see that success rates are unaffected by the skipping.
      
                      4.6-rc1     4.6-rc1
                       before       after
        User         2587.42     2566.53
        System        482.89      471.20
        Elapsed      1395.68     1382.00
      
      Times are not so useful metric for this benchmark as main portion is the
      interfering kernel builds, but results do hint at reduced system times.
      
                                            4.6-rc1     4.6-rc1
                                             before       after
        Direct pages scanned                163614      159608
        Kswapd pages scanned               2070139     2078790
        Kswapd pages reclaimed             2061707     2069757
        Direct pages reclaimed              163354      159505
      
      Reduced direct reclaim was unintended, but could be explained by more
      successful first attempt at (async) direct compaction, which is
      attempted before the first reclaim attempt in __alloc_pages_slowpath().
      
        Compaction stalls                    33052       39853
        Compaction success                   12121       19773
        Compaction failures                  20931       20079
      
      Compaction is indeed more successful, and thus less likely to get
      deferred, so there are also more direct compaction stalls.
      
        Page migrate success               3781876     3326819
        Page migrate failure                 45817       41774
        Compaction pages isolated          7868232     6941457
        Compaction migrate scanned       168160492   127269354
        Compaction migrate prescanned            0           0
        Compaction free scanned         2522142582  2326342620
        Compaction free direct alloc             0           0
        Compaction free dir. all. miss           0           0
        Compaction cost                       5252        4476
      
      The patch reduces migration scanned pages by 25% thanks to the eager
      skipping.
      
      [hughd@google.com: prevent nr_isolated_* from going negative]
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fdd048e1
    • Vlastimil Babka's avatar
      mm, compaction: reduce spurious pcplist drains · a34753d2
      Vlastimil Babka authored
      Compaction drains the local pcplists each time migration scanner moves
      away from a cc->order aligned block where it isolated pages for
      migration, so that the pages freed by migrations can merge into higher
      orders.
      
      The detection is currently coarser than it could be.  The
      cc->last_migrated_pfn variable should track the lowest pfn that was
      isolated for migration.  But it is set to the pfn where
      isolate_migratepages_block() starts scanning, which is typically the
      first pfn of the pageblock.  There, the scanner might fail to isolate
      several order-aligned blocks, and then isolate COMPACT_CLUSTER_MAX in
      another block.  This would cause the pcplists drain to be performed,
      although the scanner didn't yet finish the block where it isolated from.
      
      This patch thus makes cc->last_migrated_pfn handling more accurate by
      setting it to the pfn of an actually isolated page in
      isolate_migratepages_block().  Although practical effects of this patch
      are likely low, it arguably makes the intent of the code more obvious.
      Also the next patch will make async direct compaction skip blocks more
      aggressively, and draining pcplists due to skipped blocks is wasteful.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a34753d2
    • Vlastimil Babka's avatar
      mm, compaction: wrap calculating first and last pfn of pageblock · 06b6640a
      Vlastimil Babka authored
      Compaction code has accumulated numerous instances of manual
      calculations of the first (inclusive) and last (exclusive) pfn of a
      pageblock (or a smaller block of given order), given a pfn within the
      pageblock.
      
      Wrap these calculations by introducing pageblock_start_pfn(pfn) and
      pageblock_end_pfn(pfn) macros.
      
      [vbabka@suse.cz: fix crash in get_pfnblock_flags_mask() from isolate_freepages():]
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      06b6640a
    • Konstantin Khlebnikov's avatar
      mm/rmap: replace BUG_ON(anon_vma->degree) with VM_WARN_ON · e4c5800a
      Konstantin Khlebnikov authored
      This check effectively catches anon vma hierarchy inconsistence and some
      vma corruptions.  It was effective for catching corner cases in anon vma
      reusing logic.  For now this code seems stable so check could be hidden
      under CONFIG_DEBUG_VM and replaced with WARN because it's not so fatal.
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Suggested-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e4c5800a
    • Andrew Morton's avatar
      mm/mempolicy.c:offset_il_node() document and clarify · fee83b3a
      Andrew Morton authored
      This code was pretty obscure and was relying upon obscure side-effects
      of next_node(-1, ...) and was relying upon NUMA_NO_NODE being equal to
      -1.
      
      Clean that all up and document the function's intent.
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fee83b3a
    • Andrew Morton's avatar
      mm/hugetlb.c: use first_memory_node · 54f18d35
      Andrew Morton authored
      Instead of open-coding it.
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      54f18d35
    • Li Zhang's avatar
      mm/page_alloc: Remove useless parameter of __free_pages_boot_core · 949698a3
      Li Zhang authored
      __free_pages_boot_core has parameter pfn which is not used at all.
      Remove it.
      Signed-off-by: default avatarLi Zhang <zhlcindy@linux.vnet.ibm.com>
      Reviewed-by: default avatarPan Xinhui <xinhui.pan@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      949698a3
    • Michal Hocko's avatar
      mm/memcontrol.c:mem_cgroup_select_victim_node(): clarify comment · fda3d69b
      Michal Hocko authored
      > The comment seems to have not much to do with the code?
      
      I guess the comment tries to say that the code path is triggered when we
      charge the page which happens _before_ it is added to the LRU list and
      so last_scanned_node might contain the stale data.
      
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fda3d69b
    • Yaowei Bai's avatar
      mm/mempolicy.c: vma_migratable() can return bool · 4ee815be
      Yaowei Bai authored
      Make vma_migratable() return bool due to this particular function only
      using either one or zero as its return value.
      Signed-off-by: default avatarYaowei Bai <baiyaowei@cmss.chinamobile.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ee815be
    • Yaowei Bai's avatar
      mm/vmalloc.c: is_vmalloc_addr() can return bool · bb00a789
      Yaowei Bai authored
      Make is_vmalloc_addr() return bool to improve readability due to this
      particular function only using either one or zero as its return value.
      Signed-off-by: default avatarYaowei Bai <baiyaowei@cmss.chinamobile.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb00a789
    • Yaowei Bai's avatar
      mm/memory_hotplug: is_mem_section_removable() can return bool · c98940f6
      Yaowei Bai authored
      Make is_mem_section_removable() return bool to improve readability due
      to this particular function only using either one or zero as its return
      value.
      Signed-off-by: default avatarYaowei Bai <baiyaowei@cmss.chinamobile.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c98940f6
    • Yaowei Bai's avatar
      mm/hugetlb: is_vm_hugetlb_page() can return bool · 32f6271d
      Yaowei Bai authored
      Make is_vm_hugetlb_page() return bool to improve readability due to this
      particular function only using either one or zero as its return value.
      Signed-off-by: default avatarYaowei Bai <baiyaowei@cmss.chinamobile.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      32f6271d
    • Vaishali Thakkar's avatar
      x86: mm: use hugetlb_bad_size() · 2b18e532
      Vaishali Thakkar authored
      Update setup_hugepagesz() to call hugetlb_bad_size() when unsupported
      hugepage size is found.
      Signed-off-by: default avatarVaishali Thakkar <vaishali.thakkar@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2b18e532
    • Vaishali Thakkar's avatar
      tile: mm: use hugetlb_bad_size() · b3d424f1
      Vaishali Thakkar authored
      Update setup_hugepagesz() to call hugetlb_bad_size() when unsupported
      hugepage size is found.
      Signed-off-by: default avatarVaishali Thakkar <vaishali.thakkar@oracle.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3d424f1
    • Vaishali Thakkar's avatar
      powerpc: mm: use hugetlb_bad_size() · 71bf79cc
      Vaishali Thakkar authored
      Update setup_hugepagesz() to call hugetlb_bad_size() when unsupported
      hugepage size is found.
      Signed-off-by: default avatarVaishali Thakkar <vaishali.thakkar@oracle.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      71bf79cc
    • Vaishali Thakkar's avatar
      metag: mm: use hugetlb_bad_size() · 9cc3387f
      Vaishali Thakkar authored
      Update setup_hugepagesz() to call hugetlb_bad_size() when unsupported
      hugepage size is found.
      Signed-off-by: default avatarVaishali Thakkar <vaishali.thakkar@oracle.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9cc3387f
    • Vaishali Thakkar's avatar
      arm64: mm: use hugetlb_bad_size() · d77e20ce
      Vaishali Thakkar authored
      Update setup_hugepagesz() to call hugetlb_bad_size() when unsupported
      hugepage size is found.
      Signed-off-by: default avatarVaishali Thakkar <vaishali.thakkar@oracle.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d77e20ce
    • Vaishali Thakkar's avatar
      mm/hugetlb: introduce hugetlb_bad_size() · 9fee021d
      Vaishali Thakkar authored
      When any unsupported hugepage size is specified, 'hugepagesz=' and
      'hugepages=' should be ignored during command line parsing until any
      supported hugepage size is found.  But currently incorrect number of
      hugepages are allocated when unsupported size is specified as it fails
      to ignore the 'hugepages=' command.
      
      Test case:
      
      Note that this is specific to x86 architecture.
      
      Boot the kernel with command line option 'hugepagesz=256M hugepages=X'.
      After boot, dmesg output shows that X number of hugepages of the size 2M
      is pre-allocated instead of 0.
      
      So, to handle such command line options, introduce new routine
      hugetlb_bad_size.  The routine hugetlb_bad_size sets the global variable
      parsed_valid_hugepagesz.  We are using parsed_valid_hugepagesz to save
      the state when unsupported hugepagesize is found so that we can ignore
      the 'hugepages=' parameters after that and then reset the variable when
      supported hugepage size is found.
      
      The routine hugetlb_bad_size can be called while setting 'hugepagesz='
      parameter in an architecture specific code.
      Signed-off-by: default avatarVaishali Thakkar <vaishali.thakkar@oracle.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9fee021d
    • Mike Kravetz's avatar
      mm/hugetlb: optimize minimum size (min_size) accounting · 09a95e29
      Mike Kravetz authored
      It was observed that minimum size accounting associated with the
      hugetlbfs min_size mount option may not perform optimally and as
      expected.  As huge pages/reservations are released from the filesystem
      and given back to the global pools, they are reserved for subsequent
      filesystem use as long as the subpool reserved count is less than
      subpool minimum size.  It does not take into account used pages within
      the filesystem.  The filesystem size limits are not exceeded and this is
      technically not a bug.  However, better behavior would be to wait for
      the number of used pages/reservations associated with the filesystem to
      drop below the minimum size before taking reservations to satisfy
      minimum size.
      
      An optimization is also made to the hugepage_subpool_get_pages() routine
      which is called when pages/reservations are allocated.  This does not
      change behavior, but simply avoids the accounting if all reservations
      have already been taken (subpool reserved count == 0).
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09a95e29
    • Andrew Morton's avatar
      include/linux/nodemask.h: create next_node_in() helper · 0edaf86c
      Andrew Morton authored
      Lots of code does
      
      	node = next_node(node, XXX);
      	if (node == MAX_NUMNODES)
      		node = first_node(XXX);
      
      so create next_node_in() to do this and use it in various places.
      
      [mhocko@suse.com: use next_node_in() helper]
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Hui Zhu <zhuhui@xiaomi.com>
      Cc: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0edaf86c
    • Rasmus Villemoes's avatar
      include/linux: apply __malloc attribute · 48a27055
      Rasmus Villemoes authored
      Attach the malloc attribute to a few allocation functions.  This helps
      gcc generate better code by telling it that the return value doesn't
      alias any existing pointers (which is even more valuable given the
      pessimizations implied by -fno-strict-aliasing).
      
      A simple example of what this allows gcc to do can be seen by looking at
      the last part of drm_atomic_helper_plane_reset:
      
      	plane->state = kzalloc(sizeof(*plane->state), GFP_KERNEL);
      
      	if (plane->state) {
      		plane->state->plane = plane;
      		plane->state->rotation = BIT(DRM_ROTATE_0);
      	}
      
      which compiles to
      
          e8 99 bf d6 ff          callq  ffffffff8116d540 <kmem_cache_alloc_trace>
          48 85 c0                test   %rax,%rax
          48 89 83 40 02 00 00    mov    %rax,0x240(%rbx)
          74 11                   je     ffffffff814015c4 <drm_atomic_helper_plane_reset+0x64>
          48 89 18                mov    %rbx,(%rax)
          48 8b 83 40 02 00 00    mov    0x240(%rbx),%rax [*]
          c7 40 40 01 00 00 00    movl   $0x1,0x40(%rax)
      
      With this patch applied, the instruction at [*] is elided, since the
      store to plane->state->plane is known to not alter the value of
      plane->state.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48a27055
    • Rasmus Villemoes's avatar
      compiler.h: add support for malloc attribute · d64e85d3
      Rasmus Villemoes authored
      gcc as far back as at least 3.04 documents the function attribute
      __malloc__.  Add a shorthand for attaching that to a function
      declaration.  This was also suggested by Andi Kleen way back in 2002
      [1], but didn't get applied, perhaps because gcc at that time generated
      the exact same code with and without this attribute.
      
      This attribute tells the compiler that the return value (if non-NULL)
      can be assumed not to alias any other valid pointers at the time of the
      call.
      
      Please note that the documentation for a range of gcc versions (starting
      from around 4.7) contained a somewhat confusing and self-contradicting
      text:
      
        The malloc attribute is used to tell the compiler that a function may
        be treated as if any non-NULL pointer it returns cannot alias any other
        pointer valid when the function returns and *that the memory has
        undefined content*.  [...] Standard functions with this property include
        malloc and *calloc*.
      
      (emphasis mine). The intended meaning has later been clarified [2]:
      
        This tells the compiler that a function is malloc-like, i.e., that the
        pointer P returned by the function cannot alias any other pointer valid
        when the function returns, and moreover no pointers to valid objects
        occur in any storage addressed by P.
      
      What this means is that we can apply the attribute to kmalloc and
      friends, and it is ok for the returned memory to have well-defined
      contents (__GFP_ZERO).  But it is not ok to apply it to kmemdup(), nor
      to other functions which both allocate and possibly initialize the
      memory with existing pointers.  So unless someone is doing something
      pretty perverted kstrdup() should also be a fine candidate.
      
      [1] http://thread.gmane.org/gmane.linux.kernel/57172
      [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56955Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d64e85d3
    • Joonsoo Kim's avatar
      mm: rename _count, field of the struct page, to _refcount · 0139aa7b
      Joonsoo Kim authored
      Many developers already know that field for reference count of the
      struct page is _count and atomic type.  They would try to handle it
      directly and this could break the purpose of page reference count
      tracepoint.  To prevent direct _count modification, this patch rename it
      to _refcount and add warning message on the code.  After that, developer
      who need to handle reference count will find that field should not be
      accessed directly.
      
      [akpm@linux-foundation.org: fix comments, per Vlastimil]
      [akpm@linux-foundation.org: Documentation/vm/transhuge.txt too]
      [sfr@canb.auug.org.au: sync ethernet driver changes]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Sunil Goutham <sgoutham@cavium.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Manish Chopra <manish.chopra@qlogic.com>
      Cc: Yuval Mintz <yuval.mintz@qlogic.com>
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Cc: Saeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0139aa7b
    • Joonsoo Kim's avatar
      mm/page_ref: use page_ref helper instead of direct modification of _count · 6d061f9f
      Joonsoo Kim authored
      page_reference manipulation functions are introduced to track down
      reference count change of the page.  Use it instead of direct
      modification of _count.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Sunil Goutham <sgoutham@cavium.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d061f9f
    • Li Peng's avatar
      mm/slub.c: fix sysfs filename in comment · 43efd3ea
      Li Peng authored
      /sys/kernel/slab/xx/defrag_ratio should be remote_node_defrag_ratio.
      
      Link: http://lkml.kernel.org/r/1463449242-5366-1-git-send-email-lip@dtdream.comSigned-off-by: default avatarLi Peng <lip@dtdream.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      43efd3ea
    • Yang Shi's avatar
      mm: slab: remove ZONE_DMA_FLAG · a3187e43
      Yang Shi authored
      Now we have IS_ENABLED helper to check if a Kconfig option is enabled or
      not, so ZONE_DMA_FLAG sounds no longer useful.
      
      And, the use of ZONE_DMA_FLAG in slab looks pointless according to the
      comment [1] from Johannes Weiner, so remove them and ORing passed in
      flags with the cache gfp flags has been done in kmem_getpages().
      
      [1] https://lkml.org/lkml/2014/9/25/553
      
      Link: http://lkml.kernel.org/r/1462381297-11009-1-git-send-email-yang.shi@linaro.orgSigned-off-by: default avatarYang Shi <yang.shi@linaro.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3187e43
    • Thomas Garnier's avatar
      mm: SLAB freelist randomization · c7ce4f60
      Thomas Garnier authored
      Provides an optional config (CONFIG_SLAB_FREELIST_RANDOM) to randomize
      the SLAB freelist.  The list is randomized during initialization of a
      new set of pages.  The order on different freelist sizes is pre-computed
      at boot for performance.  Each kmem_cache has its own randomized
      freelist.  Before pre-computed lists are available freelists are
      generated dynamically.  This security feature reduces the predictability
      of the kernel SLAB allocator against heap overflows rendering attacks
      much less stable.
      
      For example this attack against SLUB (also applicable against SLAB)
      would be affected:
      
        https://jon.oberheide.org/blog/2010/09/10/linux-kernel-can-slub-overflow/
      
      Also, since v4.6 the freelist was moved at the end of the SLAB.  It
      means a controllable heap is opened to new attacks not yet publicly
      discussed.  A kernel heap overflow can be transformed to multiple
      use-after-free.  This feature makes this type of attack harder too.
      
      To generate entropy, we use get_random_bytes_arch because 0 bits of
      entropy is available in the boot stage.  In the worse case this function
      will fallback to the get_random_bytes sub API.  We also generate a shift
      random number to shift pre-computed freelist for each new set of pages.
      
      The config option name is not specific to the SLAB as this approach will
      be extended to other allocators like SLUB.
      
      Performance results highlighted no major changes:
      
      Hackbench (running 90 10 times):
      
        Before average: 0.0698
        After average: 0.0663 (-5.01%)
      
      slab_test 1 run on boot.  Difference only seen on the 2048 size test
      being the worse case scenario covered by freelist randomization.  New
      slab pages are constantly being created on the 10000 allocations.
      Variance should be mainly due to getting new pages every few
      allocations.
      
      Before:
      
        Single thread testing
        =====================
        1. Kmalloc: Repeatedly allocate then free test
        10000 times kmalloc(8) -> 99 cycles kfree -> 112 cycles
        10000 times kmalloc(16) -> 109 cycles kfree -> 140 cycles
        10000 times kmalloc(32) -> 129 cycles kfree -> 137 cycles
        10000 times kmalloc(64) -> 141 cycles kfree -> 141 cycles
        10000 times kmalloc(128) -> 152 cycles kfree -> 148 cycles
        10000 times kmalloc(256) -> 195 cycles kfree -> 167 cycles
        10000 times kmalloc(512) -> 257 cycles kfree -> 199 cycles
        10000 times kmalloc(1024) -> 393 cycles kfree -> 251 cycles
        10000 times kmalloc(2048) -> 649 cycles kfree -> 228 cycles
        10000 times kmalloc(4096) -> 806 cycles kfree -> 370 cycles
        10000 times kmalloc(8192) -> 814 cycles kfree -> 411 cycles
        10000 times kmalloc(16384) -> 892 cycles kfree -> 455 cycles
        2. Kmalloc: alloc/free test
        10000 times kmalloc(8)/kfree -> 121 cycles
        10000 times kmalloc(16)/kfree -> 121 cycles
        10000 times kmalloc(32)/kfree -> 121 cycles
        10000 times kmalloc(64)/kfree -> 121 cycles
        10000 times kmalloc(128)/kfree -> 121 cycles
        10000 times kmalloc(256)/kfree -> 119 cycles
        10000 times kmalloc(512)/kfree -> 119 cycles
        10000 times kmalloc(1024)/kfree -> 119 cycles
        10000 times kmalloc(2048)/kfree -> 119 cycles
        10000 times kmalloc(4096)/kfree -> 121 cycles
        10000 times kmalloc(8192)/kfree -> 119 cycles
        10000 times kmalloc(16384)/kfree -> 119 cycles
      
      After:
      
        Single thread testing
        =====================
        1. Kmalloc: Repeatedly allocate then free test
        10000 times kmalloc(8) -> 130 cycles kfree -> 86 cycles
        10000 times kmalloc(16) -> 118 cycles kfree -> 86 cycles
        10000 times kmalloc(32) -> 121 cycles kfree -> 85 cycles
        10000 times kmalloc(64) -> 176 cycles kfree -> 102 cycles
        10000 times kmalloc(128) -> 178 cycles kfree -> 100 cycles
        10000 times kmalloc(256) -> 205 cycles kfree -> 109 cycles
        10000 times kmalloc(512) -> 262 cycles kfree -> 136 cycles
        10000 times kmalloc(1024) -> 342 cycles kfree -> 157 cycles
        10000 times kmalloc(2048) -> 701 cycles kfree -> 238 cycles
        10000 times kmalloc(4096) -> 803 cycles kfree -> 364 cycles
        10000 times kmalloc(8192) -> 835 cycles kfree -> 404 cycles
        10000 times kmalloc(16384) -> 896 cycles kfree -> 441 cycles
        2. Kmalloc: alloc/free test
        10000 times kmalloc(8)/kfree -> 121 cycles
        10000 times kmalloc(16)/kfree -> 121 cycles
        10000 times kmalloc(32)/kfree -> 123 cycles
        10000 times kmalloc(64)/kfree -> 142 cycles
        10000 times kmalloc(128)/kfree -> 121 cycles
        10000 times kmalloc(256)/kfree -> 119 cycles
        10000 times kmalloc(512)/kfree -> 119 cycles
        10000 times kmalloc(1024)/kfree -> 119 cycles
        10000 times kmalloc(2048)/kfree -> 119 cycles
        10000 times kmalloc(4096)/kfree -> 119 cycles
        10000 times kmalloc(8192)/kfree -> 119 cycles
        10000 times kmalloc(16384)/kfree -> 119 cycles
      
      [akpm@linux-foundation.org: propagate gfp_t into cache_random_seq_create()]
      Signed-off-by: default avatarThomas Garnier <thgarnie@google.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Laura Abbott <labbott@fedoraproject.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c7ce4f60
    • Vladimir Davydov's avatar
      mm/slub.c: replace kick_all_cpus_sync() with synchronize_sched() in kmem_cache_shrink() · 81ae6d03
      Vladimir Davydov authored
      When we call __kmem_cache_shrink on memory cgroup removal, we need to
      synchronize kmem_cache->cpu_partial update with put_cpu_partial that
      might be running on other cpus.  Currently, we achieve that by using
      kick_all_cpus_sync, which works as a system wide memory barrier.  Though
      fast it is, this method has a flaw - it issues a lot of IPIs, which
      might hurt high performance or real-time workloads.
      
      To fix this, let's replace kick_all_cpus_sync with synchronize_sched.
      Although the latter one may take much longer to finish, it shouldn't be
      a problem in this particular case, because memory cgroups are destroyed
      asynchronously from a workqueue so that no user visible effects should
      be introduced.  OTOH, it will save us from excessive IPIs when someone
      removes a cgroup.
      
      Anyway, even if using synchronize_sched turns out to take too long, we
      can always introduce a kind of __kmem_cache_shrink batching so that this
      method would only be called once per one cgroup destruction (not per
      each per memcg kmem cache as it is now).
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Reported-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      81ae6d03
    • Joonsoo Kim's avatar
      mm/slab: lockless decision to grow cache · 801faf0d
      Joonsoo Kim authored
      To check whether free objects exist or not precisely, we need to grab a
      lock.  But, accuracy isn't that important because race window would be
      even small and if there is too much free object, cache reaper would reap
      it.  So, this patch makes the check for free object exisistence not to
      hold a lock.  This will reduce lock contention in heavily allocation
      case.
      
      Note that until now, n->shared can be freed during the processing by
      writing slabinfo, but, with some trick in this patch, we can access it
      freely within interrupt disabled period.
      
      Below is the result of concurrent allocation/free in slab allocation
      benchmark made by Christoph a long time ago.  I make the output simpler.
      The number shows cycle count during alloc/free respectively so less is
      better.
      
        * Before
        Kmalloc N*alloc N*free(32): Average=248/966
        Kmalloc N*alloc N*free(64): Average=261/949
        Kmalloc N*alloc N*free(128): Average=314/1016
        Kmalloc N*alloc N*free(256): Average=741/1061
        Kmalloc N*alloc N*free(512): Average=1246/1152
        Kmalloc N*alloc N*free(1024): Average=2437/1259
        Kmalloc N*alloc N*free(2048): Average=4980/1800
        Kmalloc N*alloc N*free(4096): Average=9000/2078
      
        * After
        Kmalloc N*alloc N*free(32): Average=344/792
        Kmalloc N*alloc N*free(64): Average=347/882
        Kmalloc N*alloc N*free(128): Average=390/959
        Kmalloc N*alloc N*free(256): Average=393/1067
        Kmalloc N*alloc N*free(512): Average=683/1229
        Kmalloc N*alloc N*free(1024): Average=1295/1325
        Kmalloc N*alloc N*free(2048): Average=2513/1664
        Kmalloc N*alloc N*free(4096): Average=4742/2172
      
      It shows that allocation performance decreases for the object size up to
      128 and it may be due to extra checks in cache_alloc_refill().  But,
      with considering improvement of free performance, net result looks the
      same.  Result for other size class looks very promising, roughly, 50%
      performance improvement.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      801faf0d
    • Joonsoo Kim's avatar
      mm/slab: refill cpu cache through a new slab without holding a node lock · 213b4695
      Joonsoo Kim authored
      Until now, cache growing makes a free slab on node's slab list and then
      we can allocate free objects from it.  This necessarily requires to hold
      a node lock which is very contended.  If we refill cpu cache before
      attaching it to node's slab list, we can avoid holding a node lock as
      much as possible because this newly allocated slab is only visible to
      the current task.  This will reduce lock contention.
      
      Below is the result of concurrent allocation/free in slab allocation
      benchmark made by Christoph a long time ago.  I make the output simpler.
      The number shows cycle count during alloc/free respectively so less is
      better.
      
        * Before
        Kmalloc N*alloc N*free(32): Average=355/750
        Kmalloc N*alloc N*free(64): Average=452/812
        Kmalloc N*alloc N*free(128): Average=559/1070
        Kmalloc N*alloc N*free(256): Average=1176/980
        Kmalloc N*alloc N*free(512): Average=1939/1189
        Kmalloc N*alloc N*free(1024): Average=3521/1278
        Kmalloc N*alloc N*free(2048): Average=7152/1838
        Kmalloc N*alloc N*free(4096): Average=13438/2013
      
        * After
        Kmalloc N*alloc N*free(32): Average=248/966
        Kmalloc N*alloc N*free(64): Average=261/949
        Kmalloc N*alloc N*free(128): Average=314/1016
        Kmalloc N*alloc N*free(256): Average=741/1061
        Kmalloc N*alloc N*free(512): Average=1246/1152
        Kmalloc N*alloc N*free(1024): Average=2437/1259
        Kmalloc N*alloc N*free(2048): Average=4980/1800
        Kmalloc N*alloc N*free(4096): Average=9000/2078
      
      It shows that contention is reduced for all the object sizes and
      performance increases by 30 ~ 40%.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      213b4695
    • Joonsoo Kim's avatar
      mm/slab: separate cache_grow() to two parts · 76b342bd
      Joonsoo Kim authored
      This is a preparation step to implement lockless allocation path when
      there is no free objects in kmem_cache.
      
      What we'd like to do here is to refill cpu cache without holding a node
      lock.  To accomplish this purpose, refill should be done after new slab
      allocation but before attaching the slab to the management list.  So,
      this patch separates cache_grow() to two parts, allocation and attaching
      to the list in order to add some code inbetween them in the following
      patch.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      76b342bd
    • Joonsoo Kim's avatar
      mm/slab: make cache_grow() handle the page allocated on arbitrary node · 511e3a05
      Joonsoo Kim authored
      Currently, cache_grow() assumes that allocated page's nodeid would be
      same with parameter nodeid which is used for allocation request.  If we
      discard this assumption, we can handle fallback_alloc() case gracefully.
      So, this patch makes cache_grow() handle the page allocated on arbitrary
      node and clean-up relevant code.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      511e3a05
    • Joonsoo Kim's avatar
      mm/slab: racy access/modify the slab color · 03d1d43a
      Joonsoo Kim authored
      Slab color isn't needed to be changed strictly.  Because locking for
      changing slab color could cause more lock contention so this patch
      implements racy access/modify the slab color.  This is a preparation
      step to implement lockless allocation path when there is no free objects
      in the kmem_cache.
      
      Below is the result of concurrent allocation/free in slab allocation
      benchmark made by Christoph a long time ago.  I make the output simpler.
      The number shows cycle count during alloc/free respectively so less is
      better.
      
        * Before
        Kmalloc N*alloc N*free(32): Average=365/806
        Kmalloc N*alloc N*free(64): Average=452/690
        Kmalloc N*alloc N*free(128): Average=736/886
        Kmalloc N*alloc N*free(256): Average=1167/985
        Kmalloc N*alloc N*free(512): Average=2088/1125
        Kmalloc N*alloc N*free(1024): Average=4115/1184
        Kmalloc N*alloc N*free(2048): Average=8451/1748
        Kmalloc N*alloc N*free(4096): Average=16024/2048
      
        * After
        Kmalloc N*alloc N*free(32): Average=355/750
        Kmalloc N*alloc N*free(64): Average=452/812
        Kmalloc N*alloc N*free(128): Average=559/1070
        Kmalloc N*alloc N*free(256): Average=1176/980
        Kmalloc N*alloc N*free(512): Average=1939/1189
        Kmalloc N*alloc N*free(1024): Average=3521/1278
        Kmalloc N*alloc N*free(2048): Average=7152/1838
        Kmalloc N*alloc N*free(4096): Average=13438/2013
      
      It shows that contention is reduced for object size >= 1024 and
      performance increases by roughly 15%.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03d1d43a
    • Joonsoo Kim's avatar
      mm/slab: don't keep free slabs if free_objects exceeds free_limit · 6052b788
      Joonsoo Kim authored
      Currently, determination to free a slab is done whenever each freed
      object is put into the slab.  This has a following problem.
      
      Assume free_limit = 10 and nr_free = 9.
      
      Free happens as following sequence and nr_free changes as following.
      
      free(become a free slab) free(not become a free slab) nr_free: 9 -> 10
      (at first free) -> 11 (at second free)
      
      If we try to check if we can free current slab or not on each object
      free, we can't free any slab in this situation because current slab
      isn't a free slab when nr_free exceed free_limit (at second free) even
      if there is a free slab.
      
      However, if we check it lastly, we can free 1 free slab.
      
      This problem would cause to keep too much memory in the slab subsystem.
      This patch try to fix it by checking number of free object after all
      free work is done.  If there is free slab at that time, we can free slab
      as much as possible so we keep free slab as minimal.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6052b788