1. 12 Aug, 2020 37 commits
    • Daniel Jordan's avatar
      x86/mm: use max memory block size on bare metal · fe124c95
      Daniel Jordan authored
      Some of our servers spend significant time at kernel boot initializing
      memory block sysfs directories and then creating symlinks between them and
      the corresponding nodes.  The slowness happens because the machines get
      stuck with the smallest supported memory block size on x86 (128M), which
      results in 16,288 directories to cover the 2T of installed RAM.  The
      search for each memory block is noticeable even with commit 4fb6eabf
      ("drivers/base/memory.c: cache memory blocks in xarray to accelerate
      lookup").
      
      Commit 078eb6aa ("x86/mm/memory_hotplug: determine block size based on
      the end of boot memory") chooses the block size based on alignment with
      memory end.  That addresses hotplug failures in qemu guests, but for bare
      metal systems whose memory end isn't aligned to even the smallest size, it
      leaves them at 128M.
      
      Make kernels that aren't running on a hypervisor use the largest supported
      size (2G) to minimize overhead on big machines.  Kernel boot goes 7%
      faster on the aforementioned servers, shaving off half a second.
      
      [daniel.m.jordan@oracle.com: v3]
        Link: http://lkml.kernel.org/r/20200714205450.945834-1-daniel.m.jordan@oracle.comSigned-off-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20200609225451.3542648-1-daniel.m.jordan@oracle.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fe124c95
    • Krzysztof Kozlowski's avatar
      mm: mmu_notifier: fix and extend kerneldoc · d49653f3
      Krzysztof Kozlowski authored
      Fix W=1 compile warnings (invalid kerneldoc):
      
          mm/mmu_notifier.c:187: warning: Function parameter or member 'interval_sub' not described in 'mmu_interval_read_bgin'
          mm/mmu_notifier.c:708: warning: Function parameter or member 'subscription' not described in 'mmu_notifier_registr'
          mm/mmu_notifier.c:708: warning: Excess function parameter 'mn' description in 'mmu_notifier_register'
          mm/mmu_notifier.c:880: warning: Function parameter or member 'subscription' not described in 'mmu_notifier_put'
          mm/mmu_notifier.c:880: warning: Excess function parameter 'mn' description in 'mmu_notifier_put'
          mm/mmu_notifier.c:982: warning: Function parameter or member 'ops' not described in 'mmu_interval_notifier_insert'
      Signed-off-by: default avatarKrzysztof Kozlowski <krzk@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Link: http://lkml.kernel.org/r/20200728171109.28687-4-krzk@kernel.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d49653f3
    • Waiman Long's avatar
      include/linux/sched/mm.h: optimize current_gfp_context() · af161bee
      Waiman Long authored
      The current_gfp_context() converts a number of PF_MEMALLOC_* per-process
      flags into the corresponding GFP_* flags for memory allocation.  In that
      function, current->flags is accessed 3 times.  That may lead to duplicated
      access of the same memory location.
      
      This is not usually a problem with minimal debug config options on as the
      compiler can optimize away the duplicated memory accesses.  With most of
      the debug config options on, however, that may not be the case.  For
      example, the x86-64 object size of the __need_fs_reclaim() in a debug
      kernel that calls current_gfp_context() was 309 bytes.  With this patch
      applied, the object size is reduced to 202 bytes.  This is a saving of 107
      bytes and will probably be slightly faster too.
      
      Use READ_ONCE() to access current->flags to prevent the compiler from
      possibly accessing current->flags multiple times.
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Michel Lespinasse <walken@google.com>
      Link: http://lkml.kernel.org/r/20200618212936.9776-1-longman@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af161bee
    • Mike Kravetz's avatar
      cma: don't quit at first error when activating reserved areas · 3a5139f1
      Mike Kravetz authored
      The routine cma_init_reserved_areas is designed to activate all
      reserved cma areas.  It quits when it first encounters an error.
      This can leave some areas in a state where they are reserved but
      not activated.  There is no feedback to code which performed the
      reservation.  Attempting to allocate memory from areas in such a
      state will result in a BUG.
      
      Modify cma_init_reserved_areas to always attempt to activate all
      areas.  The called routine, cma_activate_area is responsible for
      leaving the area in a valid state.  No one is making active use
      of returned error codes, so change the routine to void.
      
      How to reproduce:  This example uses kernelcore, hugetlb and cma
      as an easy way to reproduce.  However, this is a more general cma
      issue.
      
      Two node x86 VM 16GB total, 8GB per node
      Kernel command line parameters, kernelcore=4G hugetlb_cma=8G
      Related boot time messages,
        hugetlb_cma: reserve 8192 MiB, up to 4096 MiB per node
        cma: Reserved 4096 MiB at 0x0000000100000000
        hugetlb_cma: reserved 4096 MiB on node 0
        cma: Reserved 4096 MiB at 0x0000000300000000
        hugetlb_cma: reserved 4096 MiB on node 1
        cma: CMA area hugetlb could not be activated
      
       # echo 8 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
      
        BUG: kernel NULL pointer dereference, address: 0000000000000000
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP PTI
        ...
        Call Trace:
          bitmap_find_next_zero_area_off+0x51/0x90
          cma_alloc+0x1a5/0x310
          alloc_fresh_huge_page+0x78/0x1a0
          alloc_pool_huge_page+0x6f/0xf0
          set_max_huge_pages+0x10c/0x250
          nr_hugepages_store_common+0x92/0x120
          ? __kmalloc+0x171/0x270
          kernfs_fop_write+0xc1/0x1a0
          vfs_write+0xc7/0x1f0
          ksys_write+0x5f/0xe0
          do_syscall_64+0x4d/0x90
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: c64be2bb ("drivers: add Contiguous Memory Allocator")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarBarry Song <song.bao.hua@hisilicon.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200730163123.6451-1-mike.kravetz@oracle.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a5139f1
    • Barry Song's avatar
      mm: hugetlb: fix the name of hugetlb CMA · 29d0f41d
      Barry Song authored
      Once we enable CMA_DEBUGFS, we will get the below errors: directory
      'cma-hugetlb' with parent 'cma' already present.
      
      We should have different names for different CMA areas.
      Signed-off-by: default avatarBarry Song <song.bao.hua@hisilicon.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Link: http://lkml.kernel.org/r/20200616223131.33828-3-song.bao.hua@hisilicon.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      29d0f41d
    • Barry Song's avatar
      mm: cma: fix the name of CMA areas · 18e98e56
      Barry Song authored
      Patch series "mm: fix the names of general cma and hugetlb cma", v2.
      
      The current code of CMA can only work when users pass a const string as
      name parameter.  we need to fix the way to handle names in CMA.  On the
      other hand, to avoid name conflicts after enabling CMA_DEBUGFS, each
      hugetlb should get a different CMA name.
      
      This patch (of 2):
      
      If users give a name saved in stack, the current code will generate magic
      pointer.  if users don't give a name(NULL), kasprintf() will always return
      NULL as we are at the early stage.  that means cma_init_reserved_mem()
      will return -ENOMEM if users set name parameter as NULL.
      
      [natechancellor@gmail.com: return cma->name directly in cma_get_name]
        Link: https://github.com/ClangBuiltLinux/linux/issues/1063
        Link: http://lkml.kernel.org/r/20200623015840.621964-1-natechancellor@gmail.comSigned-off-by: default avatarBarry Song <song.bao.hua@hisilicon.com>
      Signed-off-by: default avatarNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Link: http://lkml.kernel.org/r/20200616223131.33828-2-song.bao.hua@hisilicon.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18e98e56
    • Jianqun Xu's avatar
      mm/cma.c: fix NULL pointer dereference when cma could not be activated · 835832ba
      Jianqun Xu authored
      In some case the cma area could not be activated, but the cma_alloc be
      used under this case, then the kernel will crash caused by NULL pointer
      dereference.
      
      Add bitmap valid check in cma_alloc to avoid this issue.
      Signed-off-by: default avatarJianqun Xu <jay.xu@rock-chips.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Link: http://lkml.kernel.org/r/20200615010123.15596-1-jay.xu@rock-chips.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      835832ba
    • Anshuman Khandual's avatar
      mm/vmstat: add events for THP migration without split · 1a5bae25
      Anshuman Khandual authored
      Add following new vmstat events which will help in validating THP
      migration without split.  Statistics reported through these new VM events
      will help in performance debugging.
      
      1. THP_MIGRATION_SUCCESS
      2. THP_MIGRATION_FAILURE
      3. THP_MIGRATION_SPLIT
      
      In addition, these new events also update normal page migration statistics
      appropriately via PGMIGRATE_SUCCESS and PGMIGRATE_FAILURE.  While here,
      this updates current trace event 'mm_migrate_pages' to accommodate now
      available THP statistics.
      
      [akpm@linux-foundation.org: s/hpage_nr_pages/thp_nr_pages/]
      [ziy@nvidia.com: v2]
        Link: http://lkml.kernel.org/r/C5E3C65C-8253-4638-9D3C-71A61858BB8B@nvidia.com
      [anshuman.khandual@arm.com: s/thp_nr_pages/hpage_nr_pages/]
        Link: http://lkml.kernel.org/r/1594287583-16568-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarZi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Link: http://lkml.kernel.org/r/1594080415-27924-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1a5bae25
    • Yang Shi's avatar
      mm: thp: remove debug_cow switch · 4958e4d8
      Yang Shi authored
      Since commit 3917c802 ("thp: change CoW semantics for
      anon-THP"), the CoW page fault of THP has been rewritten, debug_cow is not
      used anymore.  So, just remove it.
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Link: http://lkml.kernel.org/r/1592270980-116062-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4958e4d8
    • Ralph Campbell's avatar
      mm/migrate: add migrate-shared test for migrate_vma_*() · b0fc0f3f
      Ralph Campbell authored
      Add a migrate_vma_*() self test for mmap(MAP_SHARED) to verify that
      !vma_anonymous() ranges won't be migrated.
      Signed-off-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: "Bharata B Rao" <bharata@linux.ibm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Link: http://lkml.kernel.org/r/20200710194840.7602-3-rcampbell@nvidia.com
      Link: http://lkml.kernel.org/r/20200709165711.26584-3-rcampbell@nvidia.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b0fc0f3f
    • Ralph Campbell's avatar
      mm/migrate: optimize migrate_vma_setup() for holes · 0744f280
      Ralph Campbell authored
      Patch series "mm/migrate: optimize migrate_vma_setup() for holes".
      
      A simple optimization for migrate_vma_*() when the source vma is not an
      anonymous vma and a new test case to exercise it.
      
      This patch (of 2):
      
      When migrating system memory to device private memory, if the source
      address range is a valid VMA range and there is no memory or a zero page,
      the source PFN array is marked as valid but with no PFN.
      
      This lets the device driver allocate private memory and clear it, then
      insert the new device private struct page into the CPU's page tables when
      migrate_vma_pages() is called.  migrate_vma_pages() only inserts the new
      page if the VMA is an anonymous range.
      
      There is no point in telling the device driver to allocate device private
      memory and then not migrate the page.  Instead, mark the source PFN array
      entries as not migrating to avoid this overhead.
      
      [rcampbell@nvidia.com: v2]
        Link: http://lkml.kernel.org/r/20200710194840.7602-2-rcampbell@nvidia.comSigned-off-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: "Bharata B Rao" <bharata@linux.ibm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Link: http://lkml.kernel.org/r/20200710194840.7602-1-rcampbell@nvidia.com
      Link: http://lkml.kernel.org/r/20200709165711.26584-1-rcampbell@nvidia.com
      Link: http://lkml.kernel.org/r/20200709165711.26584-2-rcampbell@nvidia.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0744f280
    • Mike Kravetz's avatar
      hugetlbfs: remove call to huge_pte_alloc without i_mmap_rwsem · 34ae204f
      Mike Kravetz authored
      Commit c0d0381a ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
      synchronization") requires callers of huge_pte_alloc to hold i_mmap_rwsem
      in at least read mode.  This is because the explicit locking in
      huge_pmd_share (called by huge_pte_alloc) was removed.  When restructuring
      the code, the call to huge_pte_alloc in the else block at the beginning of
      hugetlb_fault was missed.
      
      Unfortunately, that else clause is exercised when there is no page table
      entry.  This will likely lead to a call to huge_pmd_share.  If
      huge_pmd_share thinks pmd sharing is possible, it will traverse the
      mapping tree (i_mmap) without holding i_mmap_rwsem.  If someone else is
      modifying the tree, bad things such as addressing exceptions or worse
      could happen.
      
      Simply remove the else clause.  It should have been removed previously.
      The code following the else will call huge_pte_alloc with the appropriate
      locking.
      
      To prevent this type of issue in the future, add routines to assert that
      i_mmap_rwsem is held, and call these routines in huge pmd sharing
      routines.
      
      Fixes: c0d0381a ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A.Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/e670f327-5cf9-1959-96e4-6dc7cc30d3d5@oracle.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      34ae204f
    • Mike Kravetz's avatar
      hugetlbfs: prevent filesystem stacking of hugetlbfs · 15568299
      Mike Kravetz authored
      syzbot found issues with having hugetlbfs on a union/overlay as reported
      in [1].  Due to the limitations (no write) and special functionality of
      hugetlbfs, it does not work well in filesystem stacking.  There are no
      know use cases for hugetlbfs stacking.  Rather than making modifications
      to get hugetlbfs working in such environments, simply prevent stacking.
      
      [1] https://lore.kernel.org/linux-mm/000000000000b4684e05a2968ca6@google.com/
      
      Reported-by: syzbot+d6ec23007e951dadf3de@syzkaller.appspotmail.com
      Suggested-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Colin Walters <walters@verbum.org>
      Link: http://lkml.kernel.org/r/80f869aa-810d-ef6c-8888-b46cee135907@oracle.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      15568299
    • Yafang Shao's avatar
      mm, oom: show process exiting information in __oom_kill_process() · 619b5b46
      Yafang Shao authored
      When the OOM killer finds a victim and tryies to kill it, if the victim is
      already exiting, the task mm will be NULL and no process will be killed.
      But the dump_header() has been already executed, so it will be strange to
      dump so much information without killing a process.  We'd better show some
      helpful information to indicate why this happens.
      Suggested-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Qian Cai <cai@lca.pw>
      Link: http://lkml.kernel.org/r/20200721010127.17238-1-laoar.shao@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      619b5b46
    • Michal Hocko's avatar
      doc, mm: clarify /proc/<pid>/oom_score value range · b1aa7c93
      Michal Hocko authored
      The exported value includes oom_score_adj so the range is no [0, 1000] as
      described in the previous section but rather [0, 2000].  Mention that fact
      explicitly.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Link: http://lkml.kernel.org/r/20200709062603.18480-2-mhocko@kernel.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b1aa7c93
    • Michal Hocko's avatar
      doc, mm: sync up oom_score_adj documentation · de3f32e1
      Michal Hocko authored
      There are at least two notes in the oom section.  The 3% discount for root
      processes is gone since d46078b2 ("mm, oom: remove 3% bonus for
      CAP_SYS_ADMIN processes").
      
      Likewise children of the selected oom victim are not sacrificed since
      bbbe4802 ("mm, oom: remove 'prefer children over parent' heuristic")
      
      Drop both of them.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Link: http://lkml.kernel.org/r/20200709062603.18480-1-mhocko@kernel.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      de3f32e1
    • Yafang Shao's avatar
      mm, oom: make the calculation of oom badness more accurate · 9066e5cf
      Yafang Shao authored
      Recently we found an issue on our production environment that when memcg
      oom is triggered the oom killer doesn't chose the process with largest
      resident memory but chose the first scanned process.  Note that all
      processes in this memcg have the same oom_score_adj, so the oom killer
      should chose the process with largest resident memory.
      
      Bellow is part of the oom info, which is enough to analyze this issue.
      [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
      [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
      [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
      [...]
      [7516987.983293] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
      [7516987.983510] [ 5740]     0  5740      257        1    32768        0          -998 pause
      [7516987.983574] [58804]     0 58804     4594      771    81920        0          -998 entry_point.bas
      [7516987.983577] [58908]     0 58908     7089      689    98304        0          -998 cron
      [7516987.983580] [58910]     0 58910    16235     5576   163840        0          -998 supervisord
      [7516987.983590] [59620]     0 59620    18074     1395   188416        0          -998 sshd
      [7516987.983594] [59622]     0 59622    18680     6679   188416        0          -998 python
      [7516987.983598] [59624]     0 59624  1859266     5161   548864        0          -998 odin-agent
      [7516987.983600] [59625]     0 59625   707223     9248   983040        0          -998 filebeat
      [7516987.983604] [59627]     0 59627   416433    64239   774144        0          -998 odin-log-agent
      [7516987.983607] [59631]     0 59631   180671    15012   385024        0          -998 python3
      [7516987.983612] [61396]     0 61396   791287     3189   352256        0          -998 client
      [7516987.983615] [61641]     0 61641  1844642    29089   946176        0          -998 client
      [7516987.983765] [ 9236]     0  9236     2642      467    53248        0          -998 php_scanner
      [7516987.983911] [42898]     0 42898    15543      838   167936        0          -998 su
      [7516987.983915] [42900]  1000 42900     3673      867    77824        0          -998 exec_script_vr2
      [7516987.983918] [42925]  1000 42925    36475    19033   335872        0          -998 python
      [7516987.983921] [57146]  1000 57146     3673      848    73728        0          -998 exec_script_J2p
      [7516987.983925] [57195]  1000 57195   186359    22958   491520        0          -998 python2
      [7516987.983928] [58376]  1000 58376   275764    14402   290816        0          -998 rosmaster
      [7516987.983931] [58395]  1000 58395   155166     4449   245760        0          -998 rosout
      [7516987.983935] [58406]  1000 58406 18285584  3967322 37101568        0          -998 data_sim
      [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
      [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
      [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      We can find that the first scanned process 5740 (pause) was killed, but
      its rss is only one page.  That is because, when we calculate the oom
      badness in oom_badness(), we always ignore the negtive point and convert
      all of these negtive points to 1.  Now as oom_score_adj of all the
      processes in this targeted memcg have the same value -998, the points of
      these processes are all negtive value.  As a result, the first scanned
      process will be killed.
      
      The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
      a Guaranteed pod, which has higher priority to prevent from being killed
      by system oom.
      
      To fix this issue, we should make the calculation of oom point more
      accurate.  We can achieve it by convert the chosen_point from 'unsigned
      long' to 'long'.
      
      [cai@lca.pw: reported a issue in the previous version]
      [mhocko@suse.com: fixed the issue reported by Cai]
      [mhocko@suse.com: add the comment in proc_oom_score()]
      [laoar.shao@gmail.com: v3]
        Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.comSigned-off-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Qian Cai <cai@lca.pw>
      Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9066e5cf
    • Yanfei Xu's avatar
    • Wenchao Hao's avatar
      mm/mempolicy.c: check parameters first in kernel_get_mempolicy · 4605f057
      Wenchao Hao authored
      Previous implementatoin calls untagged_addr() before error check, while if
      the error check failed and return EINVAL, the untagged_addr() call is just
      useless work.
      Signed-off-by: default avatarWenchao Hao <haowenchao22@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200801090825.5597-1-haowenchao22@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4605f057
    • Krzysztof Kozlowski's avatar
      mm: mempolicy: fix kerneldoc of numa_map_to_online_node() · f6e92f40
      Krzysztof Kozlowski authored
      Fix W=1 compile warnings (invalid kerneldoc):
      
          mm/mempolicy.c:137: warning: Function parameter or member 'node' not described in 'numa_map_to_online_node'
          mm/mempolicy.c:137: warning: Excess function parameter 'nid' description in 'numa_map_to_online_node'
      Signed-off-by: default avatarKrzysztof Kozlowski <krzk@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200728171109.28687-3-krzk@kernel.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f6e92f40
    • Alex Shi's avatar
      mm/compaction: correct the comments of compact_defer_shift · 860b3272
      Alex Shi authored
      There is no compact_defer_limit. It should be compact_defer_shift in
      use. and add compact_order_failed explanation.
      Signed-off-by: default avatarAlex Shi <alex.shi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Link: http://lkml.kernel.org/r/3bd60e1b-a74e-050d-ade4-6e8f54e00b92@linux.alibaba.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      860b3272
    • Nitin Gupta's avatar
      mm: use unsigned types for fragmentation score · d34c0a75
      Nitin Gupta authored
      Proactive compaction uses per-node/zone "fragmentation score" which is
      always in range [0, 100], so use unsigned type of these scores as well as
      for related constants.
      Signed-off-by: default avatarNitin Gupta <nigupta@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Link: http://lkml.kernel.org/r/20200618010319.13159-1-nigupta@nvidia.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d34c0a75
    • Nitin Gupta's avatar
      mm: fix compile error due to COMPACTION_HPAGE_ORDER · 25788738
      Nitin Gupta authored
      Fix compile error when COMPACTION_HPAGE_ORDER is assigned to
      HUGETLB_PAGE_ORDER.  The correct way to check if this constant is defined
      is to check for CONFIG_HUGETLBFS.
      Reported-by: default avatarNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: default avatarNitin Gupta <nigupta@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarNathan Chancellor <natechancellor@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200623064544.25766-1-nigupta@nvidia.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      25788738
    • Nitin Gupta's avatar
      mm: proactive compaction · facdaa91
      Nitin Gupta authored
      For some applications, we need to allocate almost all memory as hugepages.
      However, on a running system, higher-order allocations can fail if the
      memory is fragmented.  Linux kernel currently does on-demand compaction as
      we request more hugepages, but this style of compaction incurs very high
      latency.  Experiments with one-time full memory compaction (followed by
      hugepage allocations) show that kernel is able to restore a highly
      fragmented memory state to a fairly compacted memory state within <1 sec
      for a 32G system.  Such data suggests that a more proactive compaction can
      help us allocate a large fraction of memory as hugepages keeping
      allocation latencies low.
      
      For a more proactive compaction, the approach taken here is to define a
      new sysctl called 'vm.compaction_proactiveness' which dictates bounds for
      external fragmentation which kcompactd tries to maintain.
      
      The tunable takes a value in range [0, 100], with a default of 20.
      
      Note that a previous version of this patch [1] was found to introduce too
      many tunables (per-order extfrag{low, high}), but this one reduces them to
      just one sysctl.  Also, the new tunable is an opaque value instead of
      asking for specific bounds of "external fragmentation", which would have
      been difficult to estimate.  The internal interpretation of this opaque
      value allows for future fine-tuning.
      
      Currently, we use a simple translation from this tunable to [low, high]
      "fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
      The score for a node is defined as weighted mean of per-zone external
      fragmentation.  A zone's present_pages determines its weight.
      
      To periodically check per-node score, we reuse per-node kcompactd threads,
      which are woken up every 500 milliseconds to check the same.  If a node's
      score exceeds its high threshold (as derived from user-provided
      proactiveness value), proactive compaction is started until its score
      reaches its low threshold value.  By default, proactiveness is set to 20,
      which implies threshold values of low=80 and high=90.
      
      This patch is largely based on ideas from Michal Hocko [2].  See also the
      LWN article [3].
      
      Performance data
      ================
      
      System: x64_64, 1T RAM, 80 CPU threads.
      Kernel: 5.6.0-rc3 + this patch
      
      echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
      echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
      
      Before starting the driver, the system was fragmented from a userspace
      program that allocates all memory and then for each 2M aligned section,
      frees 3/4 of base pages using munmap.  The workload is mainly anonymous
      userspace pages, which are easy to move around.  I intentionally avoided
      unmovable pages in this test to see how much latency we incur when
      hugepage allocations hit direct compaction.
      
      1. Kernel hugepage allocation latencies
      
      With the system in such a fragmented state, a kernel driver then allocates
      as many hugepages as possible and measures allocation latency:
      
      (all latency values are in microseconds)
      
      - With vanilla 5.6.0-rc3
      
        percentile latency
        –––––––––– –––––––
      	   5    7894
      	  10    9496
      	  25   12561
      	  30   15295
      	  40   18244
      	  50   21229
      	  60   27556
      	  75   30147
      	  80   31047
      	  90   32859
      	  95   33799
      
      Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G
      total free => 98% of free memory could be allocated as hugepages)
      
      - With 5.6.0-rc3 + this patch, with proactiveness=20
      
      sysctl -w vm.compaction_proactiveness=20
      
        percentile latency
        –––––––––– –––––––
      	   5       2
      	  10       2
      	  25       3
      	  30       3
      	  40       3
      	  50       4
      	  60       4
      	  75       4
      	  80       4
      	  90       5
      	  95     429
      
      Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
      total free => 98% of free memory could be allocated as hugepages)
      
      2. JAVA heap allocation
      
      In this test, we first fragment memory using the same method as for (1).
      
      Then, we start a Java process with a heap size set to 700G and request the
      heap to be allocated with THP hugepages.  We also set THP to madvise to
      allow hugepage backing of this heap.
      
      /usr/bin/time
       java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
      
      The above command allocates 700G of Java heap using hugepages.
      
      - With vanilla 5.6.0-rc3
      
      17.39user 1666.48system 27:37.89elapsed
      
      - With 5.6.0-rc3 + this patch, with proactiveness=20
      
      8.35user 194.58system 3:19.62elapsed
      
      Elapsed time remains around 3:15, as proactiveness is further increased.
      
      Note that proactive compaction happens throughout the runtime of these
      workloads.  The situation of one-time compaction, sufficient to supply
      hugepages for following allocation stream, can probably happen for more
      extreme proactiveness values, like 80 or 90.
      
      In the above Java workload, proactiveness is set to 20.  The test starts
      with a node's score of 80 or higher, depending on the delay between the
      fragmentation step and starting the benchmark, which gives more-or-less
      time for the initial round of compaction.  As t he benchmark consumes
      hugepages, node's score quickly rises above the high threshold (90) and
      proactive compaction starts again, which brings down the score to the low
      threshold level (80).  Repeat.
      
      bpftrace also confirms proactive compaction running 20+ times during the
      runtime of this Java benchmark.  kcompactd threads consume 100% of one of
      the CPUs while it tries to bring a node's score within thresholds.
      
      Backoff behavior
      ================
      
      Above workloads produce a memory state which is easy to compact.  However,
      if memory is filled with unmovable pages, proactive compaction should
      essentially back off.  To test this aspect:
      
      - Created a kernel driver that allocates almost all memory as hugepages
        followed by freeing first 3/4 of each hugepage.
      - Set proactiveness=40
      - Note that proactive_compact_node() is deferred maximum number of times
        with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
        (=> ~30 seconds between retries).
      
      [1] https://patchwork.kernel.org/patch/11098289/
      [2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
      [3] https://lwn.net/Articles/817905/Signed-off-by: default avatarNitin Gupta <nigupta@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@redhat.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Reviewed-by: default avatarOleksandr Natalenko <oleksandr@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Nitin Gupta <ngupta@nitingupta.dev>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      facdaa91
    • Michal Koutný's avatar
      /proc/PID/smaps: consistent whitespace output format · 471e78cc
      Michal Koutný authored
      The keys in smaps output are padded to fixed width with spaces.  All
      except for THPeligible that uses tabs (only since commit c0630669
      ("mm: thp: fix false negative of shmem vma's THP eligibility")).
      
      Unify the output formatting to save time debugging some naïve parsers.
      (Part of the unification is also aligning FilePmdMapped with others.)
      Signed-off-by: default avatarMichal Koutný <mkoutny@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Link: http://lkml.kernel.org/r/20200728083207.17531-1-mkoutny@suse.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      471e78cc
    • Joonsoo Kim's avatar
      mm/vmscan: restore active/inactive ratio for anonymous LRU · 4002570c
      Joonsoo Kim authored
      Now that workingset detection is implemented for anonymous LRU, we don't
      need large inactive list to allow detecting frequently accessed pages
      before they are reclaimed, anymore.  This effectively reverts the
      temporary measure put in by commit "mm/vmscan: make active/inactive ratio
      as 1:1 for anon lru".
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Link: http://lkml.kernel.org/r/1595490560-15117-7-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4002570c
    • Joonsoo Kim's avatar
      mm/swap: implement workingset detection for anonymous LRU · aae466b0
      Joonsoo Kim authored
      This patch implements workingset detection for anonymous LRU.  All the
      infrastructure is implemented by the previous patches so this patch just
      activates the workingset detection by installing/retrieving the shadow
      entry and adding refault calculation.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Link: http://lkml.kernel.org/r/1595490560-15117-6-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aae466b0
    • Joonsoo Kim's avatar
      mm/swapcache: support to handle the shadow entries · 3852f676
      Joonsoo Kim authored
      Workingset detection for anonymous page will be implemented in the
      following patch and it requires to store the shadow entries into the
      swapcache.  This patch implements an infrastructure to store the shadow
      entry in the swapcache.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/1595490560-15117-5-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3852f676
    • Joonsoo Kim's avatar
      mm/workingset: prepare the workingset detection infrastructure for anon LRU · 170b04b7
      Joonsoo Kim authored
      To prepare the workingset detection for anon LRU, this patch splits
      workingset event counters for refault, activate and restore into anon and
      file variants, as well as the refaults counter in struct lruvec.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Link: http://lkml.kernel.org/r/1595490560-15117-4-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      170b04b7
    • Joonsoo Kim's avatar
      mm/vmscan: protect the workingset on anonymous LRU · b518154e
      Joonsoo Kim authored
      In current implementation, newly created or swap-in anonymous page is
      started on active list.  Growing active list results in rebalancing
      active/inactive list so old pages on active list are demoted to inactive
      list.  Hence, the page on active list isn't protected at all.
      
      Following is an example of this situation.
      
      Assume that 50 hot pages on active list.  Numbers denote the number of
      pages on active/inactive list (active | inactive).
      
      1. 50 hot pages on active list
      50(h) | 0
      
      2. workload: 50 newly created (used-once) pages
      50(uo) | 50(h)
      
      3. workload: another 50 newly created (used-once) pages
      50(uo) | 50(uo), swap-out 50(h)
      
      This patch tries to fix this issue.  Like as file LRU, newly created or
      swap-in anonymous pages will be inserted to the inactive list.  They are
      promoted to active list if enough reference happens.  This simple
      modification changes the above example as following.
      
      1. 50 hot pages on active list
      50(h) | 0
      
      2. workload: 50 newly created (used-once) pages
      50(h) | 50(uo)
      
      3. workload: another 50 newly created (used-once) pages
      50(h) | 50(uo), swap-out 50(uo)
      
      As you can see, hot pages on active list would be protected.
      
      Note that, this implementation has a drawback that the page cannot be
      promoted and will be swapped-out if re-access interval is greater than the
      size of inactive list but less than the size of total(active+inactive).
      To solve this potential issue, following patch will apply workingset
      detection similar to the one that's already applied to file LRU.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b518154e
    • Joonsoo Kim's avatar
      mm/vmscan: make active/inactive ratio as 1:1 for anon lru · ccc5dc67
      Joonsoo Kim authored
      Patch series "workingset protection/detection on the anonymous LRU list", v7.
      
      * PROBLEM
      In current implementation, newly created or swap-in anonymous page is
      started on the active list.  Growing the active list results in
      rebalancing active/inactive list so old pages on the active list are
      demoted to the inactive list.  Hence, hot page on the active list isn't
      protected at all.
      
      Following is an example of this situation.
      
      Assume that 50 hot pages on active list and system can contain total 100
      pages.  Numbers denote the number of pages on active/inactive list (active
      | inactive).  (h) stands for hot pages and (uo) stands for used-once
      pages.
      
      1. 50 hot pages on active list
      50(h) | 0
      
      2. workload: 50 newly created (used-once) pages
      50(uo) | 50(h)
      
      3. workload: another 50 newly created (used-once) pages
      50(uo) | 50(uo), swap-out 50(h)
      
      As we can see, hot pages are swapped-out and it would cause swap-in later.
      
      * SOLUTION
      Since this is what we want to avoid, this patchset implements workingset
      protection.  Like as the file LRU list, newly created or swap-in anonymous
      page is started on the inactive list.  Also, like as the file LRU list, if
      enough reference happens, the page will be promoted.  This simple
      modification changes the above example as following.
      
      1. 50 hot pages on active list
      50(h) | 0
      
      2. workload: 50 newly created (used-once) pages
      50(h) | 50(uo)
      
      3. workload: another 50 newly created (used-once) pages
      50(h) | 50(uo), swap-out 50(uo)
      
      hot pages remains in the active list. :)
      
      * EXPERIMENT
      I tested this scenario on my test bed and confirmed that this problem
      happens on current implementation. I also checked that it is fixed by
      this patchset.
      
      * SUBJECT
      workingset detection
      
      * PROBLEM
      Later part of the patchset implements the workingset detection for the
      anonymous LRU list.  There is a corner case that workingset protection
      could cause thrashing.  If we can avoid thrashing by workingset detection,
      we can get the better performance.
      
      Following is an example of thrashing due to the workingset protection.
      
      1. 50 hot pages on active list
      50(h) | 0
      
      2. workload: 50 newly created (will be hot) pages
      50(h) | 50(wh)
      
      3. workload: another 50 newly created (used-once) pages
      50(h) | 50(uo), swap-out 50(wh)
      
      4. workload: 50 (will be hot) pages
      50(h) | 50(wh), swap-in 50(wh)
      
      5. workload: another 50 newly created (used-once) pages
      50(h) | 50(uo), swap-out 50(wh)
      
      6. repeat 4, 5
      
      Without workingset detection, this kind of workload cannot be promoted and
      thrashing happens forever.
      
      * SOLUTION
      Therefore, this patchset implements workingset detection.  All the
      infrastructure for workingset detecion is already implemented, so there is
      not much work to do.  First, extend workingset detection code to deal with
      the anonymous LRU list.  Then, make swap cache handles the exceptional
      value for the shadow entry.  Lastly, install/retrieve the shadow value
      into/from the swap cache and check the refault distance.
      
      * EXPERIMENT
      I made a test program to imitates above scenario and confirmed that
      problem exists.  Then, I checked that this patchset fixes it.
      
      My test setup is a virtual machine with 8 cpus and 6100MB memory.  But,
      the amount of the memory that the test program can use is about 280 MB.
      This is because the system uses large ram-backed swap and large ramdisk to
      capture the trace.
      
      Test scenario is like as below.
      
      1. allocate cold memory (512MB)
      2. allocate hot-1 memory (96MB)
      3. activate hot-1 memory (96MB)
      4. allocate another hot-2 memory (96MB)
      5. access cold memory (128MB)
      6. access hot-2 memory (96MB)
      7. repeat 5, 6
      
      Since hot-1 memory (96MB) is on the active list, the inactive list can
      contains roughly 190MB pages.  hot-2 memory's re-access interval (96+128
      MB) is more 190MB, so it cannot be promoted without workingset detection
      and swap-in/out happens repeatedly.  With this patchset, workingset
      detection works and promotion happens.  Therefore, swap-in/out occurs
      less.
      
      Here is the result. (average of 5 runs)
      
      type swap-in swap-out
      base 863240 989945
      patch 681565 809273
      
      As we can see, patched kernel do less swap-in/out.
      
      * OVERALL TEST (ebizzy using modified random function)
      ebizzy is the test program that main thread allocates lots of memory and
      child threads access them randomly during the given times.  Swap-in will
      happen if allocated memory is larger than the system memory.
      
      The random function that represents the zipf distribution is used to make
      hot/cold memory.  Hot/cold ratio is controlled by the parameter.  If the
      parameter is high, hot memory is accessed much larger than cold one.  If
      the parameter is low, the number of access on each memory would be
      similar.  I uses various parameters in order to show the effect of
      patchset on various hot/cold ratio workload.
      
      My test setup is a virtual machine with 8 cpus, 1024 MB memory and 5120 MB
      ram swap.
      
      Result format is as following.
      
      param: 1-1024-0.1
      - 1 (number of thread)
      - 1024 (allocated memory size, MB)
      - 0.1 (zipf distribution alpha,
      0.1 works like as roughly uniform random,
      1.3 works like as small portion of memory is hot and the others are cold)
      
      pswpin: smaller is better
      std: standard deviation
      improvement: negative is better
      
      * single thread
                 param        pswpin       std       improvement
            base 1-1024.0-0.1 14101983.40   79441.19
            prot 1-1024.0-0.1 14065875.80  136413.01  (   -0.26 )
          detect 1-1024.0-0.1 13910435.60  100804.82  (   -1.36 )
            base 1-1024.0-0.7 7998368.80   43469.32
            prot 1-1024.0-0.7 7622245.80   88318.74  (   -4.70 )
          detect 1-1024.0-0.7 7618515.20   59742.07  (   -4.75 )
            base 1-1024.0-1.3 1017400.80   38756.30
            prot 1-1024.0-1.3  940464.60   29310.69  (   -7.56 )
          detect 1-1024.0-1.3  945511.40   24579.52  (   -7.07 )
            base 1-1280.0-0.1 22895541.40   50016.08
            prot 1-1280.0-0.1 22860305.40   51952.37  (   -0.15 )
          detect 1-1280.0-0.1 22705565.20   93380.35  (   -0.83 )
            base 1-1280.0-0.7 13717645.60   46250.65
            prot 1-1280.0-0.7 12935355.80   64754.43  (   -5.70 )
          detect 1-1280.0-0.7 13040232.00   63304.00  (   -4.94 )
            base 1-1280.0-1.3 1654251.40    4159.68
            prot 1-1280.0-1.3 1522680.60   33673.50  (   -7.95 )
          detect 1-1280.0-1.3 1599207.00   70327.89  (   -3.33 )
            base 1-1536.0-0.1 31621775.40   31156.28
            prot 1-1536.0-0.1 31540355.20   62241.36  (   -0.26 )
          detect 1-1536.0-0.1 31420056.00  123831.27  (   -0.64 )
            base 1-1536.0-0.7 19620760.60   60937.60
            prot 1-1536.0-0.7 18337839.60   56102.58  (   -6.54 )
          detect 1-1536.0-0.7 18599128.00   75289.48  (   -5.21 )
            base 1-1536.0-1.3 2378142.40   20994.43
            prot 1-1536.0-1.3 2166260.60   48455.46  (   -8.91 )
          detect 1-1536.0-1.3 2183762.20   16883.24  (   -8.17 )
            base 1-1792.0-0.1 40259714.80   90750.70
            prot 1-1792.0-0.1 40053917.20   64509.47  (   -0.51 )
          detect 1-1792.0-0.1 39949736.40  104989.64  (   -0.77 )
            base 1-1792.0-0.7 25704884.40   69429.68
            prot 1-1792.0-0.7 23937389.00   79945.60  (   -6.88 )
          detect 1-1792.0-0.7 24271902.00   35044.30  (   -5.57 )
            base 1-1792.0-1.3 3129497.00   32731.86
            prot 1-1792.0-1.3 2796994.40   19017.26  (  -10.62 )
          detect 1-1792.0-1.3 2886840.40   33938.82  (   -7.75 )
            base 1-2048.0-0.1 48746924.40   50863.88
            prot 1-2048.0-0.1 48631954.40   24537.30  (   -0.24 )
          detect 1-2048.0-0.1 48509419.80   27085.34  (   -0.49 )
            base 1-2048.0-0.7 32046424.40   78624.22
            prot 1-2048.0-0.7 29764182.20   86002.26  (   -7.12 )
          detect 1-2048.0-0.7 30250315.80  101282.14  (   -5.60 )
            base 1-2048.0-1.3 3916723.60   24048.55
            prot 1-2048.0-1.3 3490781.60   33292.61  (  -10.87 )
          detect 1-2048.0-1.3 3585002.20   44942.04  (   -8.47 )
      
      * multi thread
                 param        pswpin       std       improvement
            base 8-1024.0-0.1 16219822.60  329474.01
            prot 8-1024.0-0.1 15959494.00  654597.45  (   -1.61 )
          detect 8-1024.0-0.1 15773790.80  502275.25  (   -2.75 )
            base 8-1024.0-0.7 9174107.80  537619.33
            prot 8-1024.0-0.7 8571915.00  385230.08  (   -6.56 )
          detect 8-1024.0-0.7 8489484.20  364683.00  (   -7.46 )
            base 8-1024.0-1.3 1108495.60   83555.98
            prot 8-1024.0-1.3 1038906.20   63465.20  (   -6.28 )
          detect 8-1024.0-1.3  941817.80   32648.80  (  -15.04 )
            base 8-1280.0-0.1 25776114.20  450480.45
            prot 8-1280.0-0.1 25430847.00  465627.07  (   -1.34 )
          detect 8-1280.0-0.1 25282555.00  465666.55  (   -1.91 )
            base 8-1280.0-0.7 15218968.00  702007.69
            prot 8-1280.0-0.7 13957947.80  492643.86  (   -8.29 )
          detect 8-1280.0-0.7 14158331.20  238656.02  (   -6.97 )
            base 8-1280.0-1.3 1792482.80   30512.90
            prot 8-1280.0-1.3 1577686.40   34002.62  (  -11.98 )
          detect 8-1280.0-1.3 1556133.00   22944.79  (  -13.19 )
            base 8-1536.0-0.1 33923761.40  575455.85
            prot 8-1536.0-0.1 32715766.20  300633.51  (   -3.56 )
          detect 8-1536.0-0.1 33158477.40  117764.51  (   -2.26 )
            base 8-1536.0-0.7 20628907.80  303851.34
            prot 8-1536.0-0.7 19329511.20  341719.31  (   -6.30 )
          detect 8-1536.0-0.7 20013934.00  385358.66  (   -2.98 )
            base 8-1536.0-1.3 2588106.40  130769.20
            prot 8-1536.0-1.3 2275222.40   89637.06  (  -12.09 )
          detect 8-1536.0-1.3 2365008.40  124412.55  (   -8.62 )
            base 8-1792.0-0.1 43328279.20  946469.12
            prot 8-1792.0-0.1 41481980.80  525690.89  (   -4.26 )
          detect 8-1792.0-0.1 41713944.60  406798.93  (   -3.73 )
            base 8-1792.0-0.7 27155647.40  536253.57
            prot 8-1792.0-0.7 24989406.80  502734.52  (   -7.98 )
          detect 8-1792.0-0.7 25524806.40  263237.87  (   -6.01 )
            base 8-1792.0-1.3 3260372.80  137907.92
            prot 8-1792.0-1.3 2879187.80   63597.26  (  -11.69 )
          detect 8-1792.0-1.3 2892962.20   33229.13  (  -11.27 )
            base 8-2048.0-0.1 50583989.80  710121.48
            prot 8-2048.0-0.1 49599984.40  228782.42  (   -1.95 )
          detect 8-2048.0-0.1 50578596.00  660971.66  (   -0.01 )
            base 8-2048.0-0.7 33765479.60  812659.55
            prot 8-2048.0-0.7 30767021.20  462907.24  (   -8.88 )
          detect 8-2048.0-0.7 32213068.80  211884.24  (   -4.60 )
            base 8-2048.0-1.3 3941675.80   28436.45
            prot 8-2048.0-1.3 3538742.40   76856.08  (  -10.22 )
          detect 8-2048.0-1.3 3579397.80   58630.95  (   -9.19 )
      
      As we can see, all the cases show improvement.  Especially, test case with
      zipf distribution 1.3 show more improvements.  It means that if there is a
      hot/cold tendency in anon pages, this patchset works better.
      
      This patch (of 6):
      
      Current implementation of LRU management for anonymous page has some
      problems.  Most important one is that it doesn't protect the workingset,
      that is, pages on the active LRU list.  Although, this problem will be
      fixed in the following patchset, the preparation is required and this
      patch does it.
      
      What following patch does is to implement workingset protection.  After
      the following patchset, newly created or swap-in pages will start their
      lifetime on the inactive list.  If inactive list is too small, there is
      not enough chance to be referenced and the page cannot become the
      workingset.
      
      In order to provide the newly anonymous or swap-in pages enough chance to
      be referenced again, this patch makes active/inactive LRU ratio as 1:1.
      
      This is just a temporary measure.  Later patch in the series introduces
      workingset detection for anonymous LRU that will be used to better decide
      if pages should start on the active and inactive list.  Afterwards this
      patch is effectively reverted.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Link: http://lkml.kernel.org/r/1595490560-15117-1-git-send-email-iamjoonsoo.kim@lge.com
      Link: http://lkml.kernel.org/r/1595490560-15117-2-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ccc5dc67
    • Muchun Song's avatar
      mm/hugetlb: add mempolicy check in the reservation routine · 8ca39e68
      Muchun Song authored
      In the reservation routine, we only check whether the cpuset meets the
      memory allocation requirements.  But we ignore the mempolicy of MPOL_BIND
      case.  If someone mmap hugetlb succeeds, but the subsequent memory
      allocation may fail due to mempolicy restrictions and receives the SIGBUS
      signal.  This can be reproduced by the follow steps.
      
       1) Compile the test case.
          cd tools/testing/selftests/vm/
          gcc map_hugetlb.c -o map_hugetlb
      
       2) Pre-allocate huge pages. Suppose there are 2 numa nodes in the
          system. Each node will pre-allocate one huge page.
          echo 2 > /proc/sys/vm/nr_hugepages
      
       3) Run test case(mmap 4MB). We receive the SIGBUS signal.
          numactl --membind=3D0 ./map_hugetlb 4
      
      With this patch applied, the mmap will fail in the step 3) and throw
      "mmap: Cannot allocate memory".
      
      [akpm@linux-foundation.org: include sched.h for `current']
      Reported-by: default avatarJianchao Guo <guojianchao@bytedance.com>
      Suggested-by: default avatarMichal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Baoquan He <bhe@redhat.com>
      Link: http://lkml.kernel.org/r/20200728034938.14993-1-songmuchun@bytedance.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8ca39e68
    • Roman Gushchin's avatar
      kselftests: cgroup: add perpcu memory accounting test · 90631e1d
      Roman Gushchin authored
      Add a simple test to check the percpu memory accounting.  The test creates
      a cgroup tree with 1000 child cgroups and checks values of memory.current
      and memory.stat::percpu.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Tobin C. Harding <tobin@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Bixuan Cui <cuibixuan@huawei.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200608230819.832349-6-guro@fb.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90631e1d
    • Roman Gushchin's avatar
      mm: memcg: charge memcg percpu memory to the parent cgroup · 3e38e0aa
      Roman Gushchin authored
      Memory cgroups are using large chunks of percpu memory to store vmstat
      data.  Yet this memory is not accounted at all, so in the case when there
      are many (dying) cgroups, it's not exactly clear where all the memory is.
      
      Because the size of memory cgroup internal structures can dramatically
      exceed the size of object or page which is pinning it in the memory, it's
      not a good idea to simply ignore it.  It actually breaks the isolation
      between cgroups.
      
      Let's account the consumed percpu memory to the parent cgroup.
      
      [guro@fb.com: add WARN_ON_ONCE()s, per Johannes]
        Link: http://lkml.kernel.org/r/20200811170611.GB1507044@carbon.DHCP.thefacebook.comSigned-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarDennis Zhou <dennis@kernel.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Tobin C. Harding <tobin@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Bixuan Cui <cuibixuan@huawei.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200623184515.4132564-5-guro@fb.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e38e0aa
    • Roman Gushchin's avatar
      mm: memcg/percpu: per-memcg percpu memory statistics · 772616b0
      Roman Gushchin authored
      Percpu memory can represent a noticeable chunk of the total memory
      consumption, especially on big machines with many CPUs.  Let's track
      percpu memory usage for each memcg and display it in memory.stat.
      
      A percpu allocation is usually scattered over multiple pages (and nodes),
      and can be significantly smaller than a page.  So let's add a byte-sized
      counter on the memcg level: MEMCG_PERCPU_B.  Byte-sized vmstat infra
      created for slabs can be perfectly reused for percpu case.
      
      [guro@fb.com: v3]
        Link: http://lkml.kernel.org/r/20200623184515.4132564-4-guro@fb.comSigned-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarDennis Zhou <dennis@kernel.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Tobin C. Harding <tobin@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Bixuan Cui <cuibixuan@huawei.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200608230819.832349-4-guro@fb.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      772616b0
    • Roman Gushchin's avatar
      mm: memcg/percpu: account percpu memory to memory cgroups · 3c7be18a
      Roman Gushchin authored
      Percpu memory is becoming more and more widely used by various subsystems,
      and the total amount of memory controlled by the percpu allocator can make
      a good part of the total memory.
      
      As an example, bpf maps can consume a lot of percpu memory, and they are
      created by a user.  Also, some cgroup internals (e.g.  memory controller
      statistics) can be quite large.  On a machine with many CPUs and big
      number of cgroups they can consume hundreds of megabytes.
      
      So the lack of memcg accounting is creating a breach in the memory
      isolation.  Similar to the slab memory, percpu memory should be accounted
      by default.
      
      To implement the perpcu accounting it's possible to take the slab memory
      accounting as a model to follow.  Let's introduce two types of percpu
      chunks: root and memcg.  What makes memcg chunks different is an
      additional space allocated to store memcg membership information.  If
      __GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
      If it's possible to charge the corresponding size to the target memory
      cgroup, allocation is performed, and the memcg ownership data is recorded.
      System-wide allocations are performed using root chunks, so there is no
      additional memory overhead.
      
      To implement a fast reparenting of percpu memory on memcg removal, we
      don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
      introduced for slab accounting.
      
      [akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
      [akpm@linux-foundation.org: move unreachable code, per Roman]
      [cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
        Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.comSigned-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarBixuan Cui <cuibixuan@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarDennis Zhou <dennis@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Tobin C. Harding <tobin@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Bixuan Cui <cuibixuan@huawei.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c7be18a
    • Roman Gushchin's avatar
      percpu: return number of released bytes from pcpu_free_area() · 5b32af91
      Roman Gushchin authored
      Patch series "mm: memcg accounting of percpu memory", v3.
      
      This patchset adds percpu memory accounting to memory cgroups.  It's based
      on the rework of the slab controller and reuses concepts and features
      introduced for the per-object slab accounting.
      
      Percpu memory is becoming more and more widely used by various subsystems,
      and the total amount of memory controlled by the percpu allocator can make
      a good part of the total memory.
      
      As an example, bpf maps can consume a lot of percpu memory, and they are
      created by a user.  Also, some cgroup internals (e.g.  memory controller
      statistics) can be quite large.  On a machine with many CPUs and big
      number of cgroups they can consume hundreds of megabytes.
      
      So the lack of memcg accounting is creating a breach in the memory
      isolation.  Similar to the slab memory, percpu memory should be accounted
      by default.
      
      Percpu allocations by their nature are scattered over multiple pages, so
      they can't be tracked on the per-page basis.  So the per-object tracking
      introduced by the new slab controller is reused.
      
      The patchset implements charging of percpu allocations, adds memcg-level
      statistics, enables accounting for percpu allocations made by memory
      cgroup internals and provides some basic tests.
      
      To implement the accounting of percpu memory without a significant memory
      and performance overhead the following approach is used: all accounted
      allocations are placed into a separate percpu chunk (or chunks).  These
      chunks are similar to default chunks, except that they do have an attached
      vector of pointers to obj_cgroup objects, which is big enough to save a
      pointer for each allocated object.  On the allocation, if the allocation
      has to be accounted (__GFP_ACCOUNT is passed, the allocating process
      belongs to a non-root memory cgroup, etc), the memory cgroup is getting
      charged and if the maximum limit is not exceeded the allocation is
      performed using a memcg-aware chunk.  Otherwise -ENOMEM is returned or the
      allocation is forced over the limit, depending on gfp (as any other kernel
      memory allocation).  The memory cgroup information is saved in the
      obj_cgroup vector at the corresponding offset.  On the release time the
      memcg information is restored from the vector and the cgroup is getting
      uncharged.  Unaccounted allocations (at this point the absolute majority
      of all percpu allocations) are performed in the old way, so no additional
      overhead is expected.
      
      To avoid pinning dying memory cgroups by outstanding allocations,
      obj_cgroup API is used instead of directly saving memory cgroup pointers.
      obj_cgroup is basically a pointer to a memory cgroup with a standalone
      reference counter.  The trick is that it can be atomically swapped to
      point at the parent cgroup, so that the original memory cgroup can be
      released prior to all objects, which has been charged to it.  Because all
      charges and statistics are fully recursive, it's perfectly correct to
      uncharge the parent cgroup instead.  This scheme is used in the slab
      memory accounting, and percpu memory can just follow the scheme.
      
      This patch (of 5):
      
      To implement accounting of percpu memory we need the information about the
      size of freed object.  Return it from pcpu_free_area().
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarDennis Zhou <dennis@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tobin C. Harding <tobin@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Waiman Long <longman@redhat.com>
      cC: Michal Koutnýutny@suse.com>
      Cc: Bixuan Cui <cuibixuan@huawei.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200623184515.4132564-1-guro@fb.com
      Link: http://lkml.kernel.org/r/20200608230819.832349-1-guro@fb.com
      Link: http://lkml.kernel.org/r/20200608230819.832349-2-guro@fb.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b32af91
  2. 11 Aug, 2020 3 commits
    • Linus Torvalds's avatar
      Merge tag 'perf-tools-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux · 00e4db51
      Linus Torvalds authored
      Pull perf tools updates from Arnaldo Carvalho de Melo:
       "New features:
      
         - Introduce controlling how 'perf stat' and 'perf record' works via a
           control file descriptor, allowing starting with events configured
           but disabled until commands are received via the control file
           descriptor. This allows, for instance for tools such as Intel VTune
           to make further use of perf as its Linux platform driver.
      
         - Improve 'perf record' to to register in a perf.data file header the
           clockid used to help later correlate things like syslog files and
           perf events recorded.
      
         - Add basic syscall and find_next_bit benchmarks to 'perf bench'.
      
         - Allow using computed metrics in calculating other metrics. For
           instance:
      
      	  {
      	    .metric_expr    = "l2_rqsts.demand_data_rd_hit + l2_rqsts.pf_hit + l2_rqsts.rfo_hit",
      	    .metric_name    = "DCache_L2_All_Hits",
      	  },
      	  {
      	    .metric_expr    = "max(l2_rqsts.all_demand_data_rd - l2_rqsts.demand_data_rd_hit, 0) + l2_rqsts.pf_miss + l2_rqsts.rfo_miss",
      	    .metric_name    = "DCache_L2_All_Miss",
      	  },
      	  {
      	     .metric_expr    = "dcache_l2_all_hits + dcache_l2_all_miss",
      	     .metric_name    = "DCache_L2_All",
      	  }
      
         - Add suport for 'd_ratio', '>' and '<' operators to the expression
           resolver used in calculating metrics in 'perf stat'.
      
        Support for new kernel features:
      
         - Support TEXT_POKE and KSYMBOL_TYPE_OOL perf metadata events to cope
           with things like ftrace, trampolines, i.e. changes in the kernel
           text that gets in the way of properly decoding Intel PT hardware
           traces, for instance.
      
        Intel PT:
      
         - Add various knobs to reduce the volume of Intel PT traces by
           reducing the level of details such as decoding just some types of
           packets (e.g., FUP/TIP, PSB+), also filtering by time range.
      
         - Add new itrace options (log flags to the 'd' option, error flags to
           the 'e' one, etc), controlling how Intel PT is transformed into
           perf events, document some missing options (e.g., how to synthesize
           callchains).
      
        BPF:
      
         - Properly report BPF errors when parsing events.
      
         - Do not setup side-band events if LIBBPF is not linked, fixing a
           segfault.
      
        Libraries:
      
         - Improvements to the libtraceevent plugin mechanism.
      
         - Improve libtracevent support for KVM trace events SVM exit reasons.
      
         - Add a libtracevent plugins for decoding syscalls/sys_enter_futex
           and for tlb_flush.
      
         - Ensure sample_period is set libpfm4 events in 'perf test'.
      
         - Fixup libperf namespacing, to make sure what is in libperf has the
           perf_ namespace while what is now only in tools/perf/ doesn't use
           that prefix.
      
        Arch specific:
      
         - Improve the testing of vendor events and metrics in 'perf test'.
      
         - Allow no ARM CoreSight hardware tracer sink to be specified on
           command line.
      
         - Fix arm_spe_x recording when mixed with other perf events.
      
         - Add s390 idle functions 'psw_idle' and 'psw_idle_exit' to list of
           idle symbols.
      
         - List kernel supplied event aliases for arm64 in 'perf list'.
      
         - Add support for extended register capability in PowerPC 9 and 10.
      
         - Added nest IMC power9 metric events.
      
        Miscellaneous:
      
         - No need to setup sample_regs_intr/sample_regs_user for dummy
           events.
      
         - Update various copies of kernel headers, some causing perf to
           handle new syscalls, MSRs, etc.
      
         - Improve usage of flex and yacc, enabling warnings and addressing
           the fallout.
      
         - Add missing '--output' option to 'perf kmem' so that it can pass it
           along to 'perf record'.
      
         - 'perf probe' fixes related to adding multiple probes on the same
           address for the same event.
      
         - Make 'perf probe' warn if the target function is a GNU indirect
           function.
      
         - Remove //anon mmap events from 'perf inject jit' to fix supporting
           both using ELF files for generated functions and the perf-PID.map
           approaches"
      
      * tag 'perf-tools-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (144 commits)
        perf record: Skip side-band event setup if HAVE_LIBBPF_SUPPORT is not set
        perf tools powerpc: Add support for extended regs in power10
        perf tools powerpc: Add support for extended register capability
        tools headers UAPI: Sync drm/i915_drm.h with the kernel sources
        tools arch x86: Sync asm/cpufeatures.h with the kernel sources
        tools arch x86: Sync the msr-index.h copy with the kernel sources
        tools headers UAPI: update linux/in.h copy
        tools headers API: Update close_range affected files
        perf script: Add 'tod' field to display time of day
        perf script: Change the 'enum perf_output_field' enumerators to be 64 bits
        perf data: Add support to store time of day in CTF data conversion
        perf tools: Move clockid_res_ns under clock struct
        perf header: Store clock references for -k/--clockid option
        perf tools: Add clockid_name function
        perf clockid: Move parse_clockid() to new clockid object
        tools lib traceevent: Handle possible strdup() error in tep_add_plugin_path() API
        libtraceevent: Fixed description of tep_add_plugin_path() API
        libtraceevent: Fixed type in PRINT_FMT_STING
        libtraceevent: Fixed broken indentation in parse_ip4_print_args()
        libtraceevent: Improve error handling of tep_plugin_add_option() API
        ...
      00e4db51
    • Linus Torvalds's avatar
      Merge tag 'ktest-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest · ed3854ff
      Linus Torvalds authored
      Pull ktest updates from Steven Rostedt:
      
       - Have config-bisect save the good/bad configs at each step.
      
       - Show log file location even on success
      
       - Add PRE_TEST_DIE to kill test if the PRE_TEST fails
      
       - Add a NOT operator for conditionals in config file
      
       - Add the log output of the last test when emailing on failure.
      
       - Other minor clean ups and small fixes.
      
      * tag 'ktest-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest:
        ktest.pl: Fix spelling mistake "Cant" -> "Can't"
        ktest.pl: Change the logic to control the size of the log file emailed
        ktest.pl: Add MAIL_MAX_SIZE to limit the amount of log emailed
        ktest.pl: Add the log of last test in email on failure
        ktest.pl: Turn off buffering to the log file
        ktest.pl: Just open up the log file once
        ktest.pl: Add a NOT operator
        ktest.pl: Define PRE_TEST_DIE to kill the test if the PRE_TEST fails
        ktest.pl: Always show log file location if defined even on success
        ktest.pl: Have config-bisect save each config used in the bisect
      ed3854ff
    • Linus Torvalds's avatar
      Merge tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 97d052ea
      Linus Torvalds authored
      Pull locking updates from Thomas Gleixner:
       "A set of locking fixes and updates:
      
         - Untangle the header spaghetti which causes build failures in
           various situations caused by the lockdep additions to seqcount to
           validate that the write side critical sections are non-preemptible.
      
         - The seqcount associated lock debug addons which were blocked by the
           above fallout.
      
           seqcount writers contrary to seqlock writers must be externally
           serialized, which usually happens via locking - except for strict
           per CPU seqcounts. As the lock is not part of the seqcount, lockdep
           cannot validate that the lock is held.
      
           This new debug mechanism adds the concept of associated locks.
           sequence count has now lock type variants and corresponding
           initializers which take a pointer to the associated lock used for
           writer serialization. If lockdep is enabled the pointer is stored
           and write_seqcount_begin() has a lockdep assertion to validate that
           the lock is held.
      
           Aside of the type and the initializer no other code changes are
           required at the seqcount usage sites. The rest of the seqcount API
           is unchanged and determines the type at compile time with the help
           of _Generic which is possible now that the minimal GCC version has
           been moved up.
      
           Adding this lockdep coverage unearthed a handful of seqcount bugs
           which have been addressed already independent of this.
      
           While generally useful this comes with a Trojan Horse twist: On RT
           kernels the write side critical section can become preemtible if
           the writers are serialized by an associated lock, which leads to
           the well known reader preempts writer livelock. RT prevents this by
           storing the associated lock pointer independent of lockdep in the
           seqcount and changing the reader side to block on the lock when a
           reader detects that a writer is in the write side critical section.
      
         - Conversion of seqcount usage sites to associated types and
           initializers"
      
      * tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
        locking/seqlock, headers: Untangle the spaghetti monster
        locking, arch/ia64: Reduce <asm/smp.h> header dependencies by moving XTP bits into the new <asm/xtp.h> header
        x86/headers: Remove APIC headers from <asm/smp.h>
        seqcount: More consistent seqprop names
        seqcount: Compress SEQCNT_LOCKNAME_ZERO()
        seqlock: Fold seqcount_LOCKNAME_init() definition
        seqlock: Fold seqcount_LOCKNAME_t definition
        seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g
        hrtimer: Use sequence counter with associated raw spinlock
        kvm/eventfd: Use sequence counter with associated spinlock
        userfaultfd: Use sequence counter with associated spinlock
        NFSv4: Use sequence counter with associated spinlock
        iocost: Use sequence counter with associated spinlock
        raid5: Use sequence counter with associated spinlock
        vfs: Use sequence counter with associated spinlock
        timekeeping: Use sequence counter with associated raw spinlock
        xfrm: policy: Use sequence counters with associated lock
        netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
        netfilter: conntrack: Use sequence counter with associated spinlock
        sched: tasks: Use sequence counter with associated spinlock
        ...
      97d052ea