1. 25 Oct, 2023 40 commits
    • Kefeng Wang's avatar
      mm: convert wp_page_reuse() and finish_mkwrite_fault() to take a folio · a86bc96b
      Kefeng Wang authored
      Saves one compound_head() call, also in preparation for
      page_cpupid_xchg_last() conversion.
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-18-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a86bc96b
    • Kefeng Wang's avatar
      mm: make finish_mkwrite_fault() static · c08b7e38
      Kefeng Wang authored
      Make finish_mkwrite_fault static since it is not used outside of
      memory.c.
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-17-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c08b7e38
    • Kefeng Wang's avatar
      mm: huge_memory: use folio_xchg_last_cpupid() in __split_huge_page_tail() · c8253011
      Kefeng Wang authored
      Convert to use folio_xchg_last_cpupid() in __split_huge_page_tail().
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-16-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c8253011
    • Kefeng Wang's avatar
      mm: migrate: use folio_xchg_last_cpupid() in folio_migrate_flags() · 4e694fe4
      Kefeng Wang authored
      Convert to use folio_xchg_last_cpupid() in folio_migrate_flags(), also
      directly use folio_nid() instead of page_to_nid(&folio->page).
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-15-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4e694fe4
    • Kefeng Wang's avatar
      sched/fair: use folio_xchg_last_cpupid() in should_numa_migrate_memory() · 1b143cc7
      Kefeng Wang authored
      Convert to use folio_xchg_last_cpupid() in should_numa_migrate_memory().
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-14-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1b143cc7
    • Kefeng Wang's avatar
      mm: add folio_xchg_last_cpupid() · 136d0b47
      Kefeng Wang authored
      Add folio_xchg_last_cpupid() wrapper, which is required to convert
      page_cpupid_xchg_last() to folio vertion later in the series.
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-13-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      136d0b47
    • Kefeng Wang's avatar
      mm: remove xchg_page_access_time() · f3930843
      Kefeng Wang authored
      Since all calls use folio_xchg_access_time(), remove
      xchg_page_access_time().
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-12-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f3930843
    • Kefeng Wang's avatar
      mm: huge_memory: use a folio in change_huge_pmd() · d986ba2b
      Kefeng Wang authored
      Use a folio in change_huge_pmd(), which helps to remove last
      xchg_page_access_time() caller.
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-11-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d986ba2b
    • Kefeng Wang's avatar
      mm: mprotect: use a folio in change_pte_range() · ec177880
      Kefeng Wang authored
      Use a folio in change_pte_range() to save three compound_head() calls.
      Since now only normal and PMD-mapped page is handled by numa balancing,
      it is enough to only update the entire folio's access time.
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-10-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ec177880
    • Kefeng Wang's avatar
      sched/fair: use folio_xchg_access_time() in numa_hint_fault_latency() · 0b201c36
      Kefeng Wang authored
      Convert to use folio_xchg_access_time() in numa_hint_fault_latency().
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-9-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0b201c36
    • Kefeng Wang's avatar
      mm: add folio_xchg_access_time() · 55c19938
      Kefeng Wang authored
      Add folio_xchg_access_time() wrapper, which is required to convert
      xchg_page_access_time() to folio vertion later in the series.
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-8-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      55c19938
    • Kefeng Wang's avatar
      mm: remove page_cpupid_last() · f39eac30
      Kefeng Wang authored
      Since all calls use folio_last_cpupid(), remove page_cpupid_last().
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-7-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f39eac30
    • Kefeng Wang's avatar
      mm: huge_memory: use folio_last_cpupid() in __split_huge_page_tail() · 19c1ac02
      Kefeng Wang authored
      Convert to use folio_last_cpupid() in __split_huge_page_tail().
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-6-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      19c1ac02
    • Kefeng Wang's avatar
      mm: huge_memory: use folio_last_cpupid() in do_huge_pmd_numa_page() · c4a8d2fa
      Kefeng Wang authored
      Convert to use folio_last_cpupid() in do_huge_pmd_numa_page().
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-5-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c4a8d2fa
    • Kefeng Wang's avatar
      mm: memory: use folio_last_cpupid() in do_numa_page() · 67b33e3f
      Kefeng Wang authored
      Convert to use folio_last_cpupid() in do_numa_page().
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-4-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      67b33e3f
    • Kefeng Wang's avatar
      mm: add folio_last_cpupid() · 155c98cf
      Kefeng Wang authored
      Add folio_last_cpupid() wrapper, which is required to convert
      page_cpupid_last() to folio vertion later in the series.
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-3-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      155c98cf
    • Kefeng Wang's avatar
      mm_types: add virtual and _last_cpupid into struct folio · 1d44f2e6
      Kefeng Wang authored
      Patch series "mm: convert page cpupid functions to folios", v3.
      
      The cpupid(or access time) used by numa balancing is stored in flags or
      _last_cpupid(if LAST_CPUPID_NOT_IN_PAGE_FLAGS) of page, this is to convert
      page cpupid to folio cpupid, a new _last_cpupid is added into folio, which
      make us to use folio->_last_cpupid directly, and the page cpupid functions
      are converted to folio ones.
      
        page_cpupid_last()		-> folio_last_cpupid()
        xchg_page_access_time()	-> folio_xchg_access_time()
        page_cpupid_xchg_last()	-> folio_xchg_last_cpupid()
      
      
      This patch (of 19):
      
      If WANT_PAGE_VIRTUAL and LAST_CPUPID_NOT_IN_PAGE_FLAGS defined, the
      'virtual' and '_last_cpupid' are in struct page, and since _last_cpupid is
      used by numa balancing feature, it is better to move it before KMSAN
      metadata from struct page, also add them into struct folio to make us to
      access them from folio directly.
      
      Link: https://lkml.kernel.org/r/20231018140806.2783514-1-wangkefeng.wang@huawei.com
      Link: https://lkml.kernel.org/r/20231018140806.2783514-2-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1d44f2e6
    • Kairui Song's avatar
      mm/swap: avoid a xa load for swapout path · e5b306a0
      Kairui Song authored
      A variable is never used for swapout path (shadowp is NULL) and compiler
      is unable to optimize out the unneeded load since it's a function call.
      
      The was introduced by 3852f676 ("mm/swapcache: support to handle the
      shadow entries").
      
      Link: https://lkml.kernel.org/r/20231017011728.37508-1-ryncsn@gmail.comSigned-off-by: default avatarKairui Song <kasong@tencent.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e5b306a0
    • Roman Gushchin's avatar
      mm: kmem: reimplement get_obj_cgroup_from_current() · e56808fe
      Roman Gushchin authored
      Reimplement get_obj_cgroup_from_current() using current_obj_cgroup(). 
      get_obj_cgroup_from_current() and current_obj_cgroup() share 80% of the
      code, so the new implementation is almost trivial.
      
      get_obj_cgroup_from_current() is a convenient function used by the
      bpf subsystem, so there is no reason to get rid of it completely.
      
      Link: https://lkml.kernel.org/r/20231019225346.1822282-7-roman.gushchin@linux.devSigned-off-by: default avatarRoman Gushchin (Cruise) <roman.gushchin@linux.dev>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e56808fe
    • Roman Gushchin's avatar
      percpu: scoped objcg protection · c63b835d
      Roman Gushchin authored
      Similar to slab and kmem, switch to a scope-based protection of the objcg
      pointer to avoid.
      
      Link: https://lkml.kernel.org/r/20231019225346.1822282-6-roman.gushchin@linux.devSigned-off-by: default avatarRoman Gushchin (Cruise) <roman.gushchin@linux.dev>
      Tested-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c63b835d
    • Roman Gushchin's avatar
      mm: kmem: scoped objcg protection · e86828e5
      Roman Gushchin authored
      Switch to a scope-based protection of the objcg pointer on slab/kmem
      allocation paths.  Instead of using the get_() semantics in the
      pre-allocation hook and put the reference afterwards, let's rely on the
      fact that objcg is pinned by the scope.
      
      It's possible because:
      1) if the objcg is received from the current task struct, the task is
         keeping a reference to the objcg.
      2) if the objcg is received from an active memcg (remote charging),
         the memcg is pinned by the scope and has a reference to the
         corresponding objcg.
      
      Link: https://lkml.kernel.org/r/20231019225346.1822282-5-roman.gushchin@linux.devSigned-off-by: default avatarRoman Gushchin (Cruise) <roman.gushchin@linux.dev>
      Tested-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e86828e5
    • Roman Gushchin's avatar
      mm: kmem: make memcg keep a reference to the original objcg · 675d6c9b
      Roman Gushchin authored
      Keep a reference to the original objcg object for the entire life of a
      memcg structure.
      
      This allows to simplify the synchronization on the kernel memory
      allocation paths: pinning a (live) memcg will also pin the corresponding
      objcg.
      
      The memory overhead of this change is minimal because object cgroups
      usually outlive their corresponding memory cgroups even without this
      change, so it's only an additional pointer per memcg.
      
      Link: https://lkml.kernel.org/r/20231019225346.1822282-4-roman.gushchin@linux.devSigned-off-by: default avatarRoman Gushchin (Cruise) <roman.gushchin@linux.dev>
      Tested-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      675d6c9b
    • Roman Gushchin's avatar
      mm: kmem: add direct objcg pointer to task_struct · 1aacbd35
      Roman Gushchin authored
      To charge a freshly allocated kernel object to a memory cgroup, the kernel
      needs to obtain an objcg pointer.  Currently it does it indirectly by
      obtaining the memcg pointer first and then calling to
      __get_obj_cgroup_from_memcg().
      
      Usually tasks spend their entire life belonging to the same object cgroup.
      So it makes sense to save the objcg pointer on task_struct directly, so
      it can be obtained faster.  It requires some work on fork, exit and cgroup
      migrate paths, but these paths are way colder.
      
      To avoid any costly synchronization the following rules are applied:
      1) A task sets it's objcg pointer itself.
      
      2) If a task is being migrated to another cgroup, the least
         significant bit of the objcg pointer is set atomically.
      
      3) On the allocation path the objcg pointer is obtained locklessly
         using the READ_ONCE() macro and the least significant bit is
         checked. If it's set, the following procedure is used to update
         it locklessly:
             - task->objcg is zeroed using cmpxcg
             - new objcg pointer is obtained
             - task->objcg is updated using try_cmpxchg
             - operation is repeated if try_cmpxcg fails
         It guarantees that no updates will be lost if task migration
         is racing against objcg pointer update. It also allows to keep
         both read and write paths fully lockless.
      
      Because the task is keeping a reference to the objcg, it can't go away
      while the task is alive.
      
      This commit doesn't change the way the remote memcg charging works.
      
      Link: https://lkml.kernel.org/r/20231019225346.1822282-3-roman.gushchin@linux.devSigned-off-by: default avatarRoman Gushchin (Cruise) <roman.gushchin@linux.dev>
      Tested-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1aacbd35
    • Roman Gushchin's avatar
      mm: kmem: optimize get_obj_cgroup_from_current() · 7d0715d0
      Roman Gushchin authored
      Patch series "mm: improve performance of accounted kernel memory
      allocations", v5.
      
      This patchset improves the performance of accounted kernel memory
      allocations by ~30% as measured by a micro-benchmark [1].  The benchmark
      is very straightforward: 1M of 64 bytes-large kmalloc() allocations.
      
      Below are results with the disabled kernel memory accounting, the original state
      and with this patchset applied.
      
      |             | Kmem disabled | Original | Patched |  Delta |
      |-------------+---------------+----------+---------+--------|
      | User cgroup |         29764 |    84548 |   59078 | -30.0% |
      | Root cgroup |         29742 |    48342 |   31501 | -34.8% |
      
      As we can see, the patchset removes the majority of the overhead when
      there is no actual accounting (a task belongs to the root memory cgroup)
      and almost halves the accounting overhead otherwise.
      
      The main idea is to get rid of unnecessary memcg to objcg conversions and
      switch to a scope-based protection of objcgs, which eliminates extra
      operations with objcg reference counters under a rcu read lock.  More
      details are provided in individual commit descriptions.
      
      
      This patch (of 5):
      
      Manually inline memcg_kmem_bypass() and active_memcg() to speed up
      get_obj_cgroup_from_current() by avoiding duplicate in_task() checks and
      active_memcg() readings.
      
      Also add a likely() macro to __get_obj_cgroup_from_memcg():
      obj_cgroup_tryget() should succeed at almost all times except a very
      unlikely race with the memcg deletion path.
      
      Link: https://lkml.kernel.org/r/20231019225346.1822282-1-roman.gushchin@linux.dev
      Link: https://lkml.kernel.org/r/20231019225346.1822282-2-roman.gushchin@linux.devSigned-off-by: default avatarRoman Gushchin (Cruise) <roman.gushchin@linux.dev>
      Tested-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7d0715d0
    • Huang Ying's avatar
      mm, pcp: reduce detecting time of consecutive high order page freeing · 6ccdcb6d
      Huang Ying authored
      In current PCP auto-tuning design, if the number of pages allocated is
      much more than that of pages freed on a CPU, the PCP high may become the
      maximal value even if the allocating/freeing depth is small, for example,
      in the sender of network workloads.  If a CPU was used as sender
      originally, then it is used as receiver after context switching, we need
      to fill the whole PCP with maximal high before triggering PCP draining for
      consecutive high order freeing.  This will hurt the performance of some
      network workloads.
      
      To solve the issue, in this patch, we will track the consecutive page
      freeing with a counter in stead of relying on PCP draining.  So, we can
      detect consecutive page freeing much earlier.
      
      On a 2-socket Intel server with 128 logical CPU, we tested
      SCTP_STREAM_MANY test case of netperf test suite with 64-pair processes. 
      With the patch, the network bandwidth improves 5.0%.  This restores the
      performance drop caused by PCP auto-tuning.
      
      Link: https://lkml.kernel.org/r/20231016053002.756205-10-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6ccdcb6d
    • Huang Ying's avatar
      mm, pcp: decrease PCP high if free pages < high watermark · 57c0419c
      Huang Ying authored
      One target of PCP is to minimize pages in PCP if the system free pages is
      too few.  To reach that target, when page reclaiming is active for the
      zone (ZONE_RECLAIM_ACTIVE), we will stop increasing PCP high in allocating
      path, decrease PCP high and free some pages in freeing path.  But this may
      be too late because the background page reclaiming may introduce latency
      for some workloads.  So, in this patch, during page allocation we will
      detect whether the number of free pages of the zone is below high
      watermark.  If so, we will stop increasing PCP high in allocating path,
      decrease PCP high and free some pages in freeing path.  With this, we can
      reduce the possibility of the premature background page reclaiming caused
      by too large PCP.
      
      The high watermark checking is done in allocating path to reduce the
      overhead in hotter freeing path.
      
      Link: https://lkml.kernel.org/r/20231016053002.756205-9-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      57c0419c
    • Huang Ying's avatar
      mm: tune PCP high automatically · 51a755c5
      Huang Ying authored
      The target to tune PCP high automatically is as follows,
      
      - Minimize allocation/freeing from/to shared zone
      
      - Minimize idle pages in PCP
      
      - Minimize pages in PCP if the system free pages is too few
      
      To reach these target, a tuning algorithm as follows is designed,
      
      - When we refill PCP via allocating from the zone, increase PCP high.
        Because if we had larger PCP, we could avoid to allocate from the
        zone.
      
      - In periodic vmstat updating kworker (via refresh_cpu_vm_stats()),
        decrease PCP high to try to free possible idle PCP pages.
      
      - When page reclaiming is active for the zone, stop increasing PCP
        high in allocating path, decrease PCP high and free some pages in
        freeing path.
      
      So, the PCP high can be tuned to the page allocating/freeing depth of
      workloads eventually.
      
      One issue of the algorithm is that if the number of pages allocated is
      much more than that of pages freed on a CPU, the PCP high may become the
      maximal value even if the allocating/freeing depth is small.  But this
      isn't a severe issue, because there are no idle pages in this case.
      
      One alternative choice is to increase PCP high when we drain PCP via
      trying to free pages to the zone, but don't increase PCP high during PCP
      refilling.  This can avoid the issue above.  But if the number of pages
      allocated is much less than that of pages freed on a CPU, there will be
      many idle pages in PCP and it is hard to free these idle pages.
      
      1/8 (>> 3) of PCP high will be decreased periodically.  The value 1/8 is
      kind of arbitrary.  Just to make sure that the idle PCP pages will be
      freed eventually.
      
      On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
      in parallel (each with `make -j 28`) in 8 cgroup.  This simulates the
      kbuild server that is used by 0-Day kbuild service.  With the patch, the
      build time decreases 3.5%.  The cycles% of the spinlock contention (mostly
      for zone lock) decreases from 11.0% to 0.5%.  The number of PCP draining
      for high order pages freeing (free_high) decreases 65.6%.  The number of
      pages allocated from zone (instead of from PCP) decreases 83.9%.
      
      Link: https://lkml.kernel.org/r/20231016053002.756205-8-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Suggested-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      51a755c5
    • Huang Ying's avatar
      mm: add framework for PCP high auto-tuning · 90b41691
      Huang Ying authored
      The page allocation performance requirements of different workloads are
      usually different.  So, we need to tune PCP (per-CPU pageset) high to
      optimize the workload page allocation performance.  Now, we have a system
      wide sysctl knob (percpu_pagelist_high_fraction) to tune PCP high by hand.
      But, it's hard to find out the best value by hand.  And one global
      configuration may not work best for the different workloads that run on
      the same system.  One solution to these issues is to tune PCP high of each
      CPU automatically.
      
      This patch adds the framework for PCP high auto-tuning.  With it,
      pcp->high of each CPU will be changed automatically by tuning algorithm at
      runtime.  The minimal high (pcp->high_min) is the original PCP high value
      calculated based on the low watermark pages.  While the maximal high
      (pcp->high_max) is the PCP high value when percpu_pagelist_high_fraction
      sysctl knob is set to MIN_PERCPU_PAGELIST_HIGH_FRACTION.  That is, the
      maximal pcp->high that can be set via sysctl knob by hand.
      
      It's possible that PCP high auto-tuning doesn't work well for some
      workloads.  So, when PCP high is tuned by hand via the sysctl knob, the
      auto-tuning will be disabled.  The PCP high set by hand will be used
      instead.
      
      This patch only adds the framework, so pcp->high will be set to
      pcp->high_min (original default) always.  We will add actual auto-tuning
      algorithm in the following patches in the series.
      
      Link: https://lkml.kernel.org/r/20231016053002.756205-7-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      90b41691
    • Huang Ying's avatar
      mm, page_alloc: scale the number of pages that are batch allocated · c0a24239
      Huang Ying authored
      When a task is allocating a large number of order-0 pages, it may acquire
      the zone->lock multiple times allocating pages in batches.  This may
      unnecessarily contend on the zone lock when allocating very large number
      of pages.  This patch adapts the size of the batch based on the recent
      pattern to scale the batch size for subsequent allocations.
      
      On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
      in parallel (each with `make -j 28`) in 8 cgroup.  This simulates the
      kbuild server that is used by 0-Day kbuild service.  With the patch, the
      cycles% of the spinlock contention (mostly for zone lock) decreases from
      12.6% to 11.0% (with PCP size == 367).
      
      Link: https://lkml.kernel.org/r/20231016053002.756205-6-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Suggested-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c0a24239
    • Huang Ying's avatar
      mm: restrict the pcp batch scale factor to avoid too long latency · 52166607
      Huang Ying authored
      In page allocator, PCP (Per-CPU Pageset) is refilled and drained in
      batches to increase page allocation throughput, reduce page
      allocation/freeing latency per page, and reduce zone lock contention.  But
      too large batch size will cause too long maximal allocation/freeing
      latency, which may punish arbitrary users.  So the default batch size is
      chosen carefully (in zone_batchsize(), the value is 63 for zone > 1GB) to
      avoid that.
      
      In commit 3b12e7e9 ("mm/page_alloc: scale the number of pages that are
      batch freed"), the batch size will be scaled for large number of page
      freeing to improve page freeing performance and reduce zone lock
      contention.  Similar optimization can be used for large number of pages
      allocation too.
      
      To find out a suitable max batch scale factor (that is, max effective
      batch size), some tests and measurement on some machines were done as
      follows.
      
      A set of debug patches are implemented as follows,
      
      - Set PCP high to be 2 * batch to reduce the effect of PCP high
      
      - Disable free batch size scaling to get the raw performance.
      
      - The code with zone lock held is extracted from rmqueue_bulk() and
        free_pcppages_bulk() to 2 separate functions to make it easy to
        measure the function run time with ftrace function_graph tracer.
      
      - The batch size is hard coded to be 63 (default), 127, 255, 511,
        1023, 2047, 4095.
      
      Then will-it-scale/page_fault1 is used to generate the page
      allocation/freeing workload.  The page allocation/freeing throughput
      (page/s) is measured via will-it-scale.  The page allocation/freeing
      average latency (alloc/free latency avg, in us) and allocation/freeing
      latency at 99 percentile (alloc/free latency 99%, in us) are measured with
      ftrace function_graph tracer.
      
      The test results are as follows,
      
      Sapphire Rapids Server
      ======================
      Batch	throughput	free latency	free latency	alloc latency	alloc latency
      	page/s		avg / us	99% / us	avg / us	99% / us
      -----	----------	------------	------------	-------------	-------------
        63	513633.4	 2.33		 3.57		 2.67		  6.83
       127	517616.7	 4.35		 6.65		 4.22		 13.03
       255	520822.8	 8.29		13.32		 7.52		 25.24
       511	524122.0	15.79		23.42		14.02		 49.35
      1023	525980.5	30.25		44.19		25.36		 94.88
      2047	526793.6	59.39		84.50		45.22		140.81
      
      Ice Lake Server
      ===============
      Batch	throughput	free latency	free latency	alloc latency	alloc latency
      	page/s		avg / us	99% / us	avg / us	99% / us
      -----	----------	------------	------------	-------------	-------------
        63	620210.3	 2.21		 3.68		 2.02		 4.35
       127	627003.0	 4.09		 6.86		 3.51		 8.28
       255	630777.5	 7.70		13.50		 6.17		15.97
       511	633651.5	14.85		22.62		11.66		31.08
      1023	637071.1	28.55		42.02		20.81		54.36
      2047	638089.7	56.54		84.06		39.28		91.68
      
      Cascade Lake Server
      ===================
      Batch	throughput	free latency	free latency	alloc latency	alloc latency
      	page/s		avg / us	99% / us	avg / us	99% / us
      -----	----------	------------	------------	-------------	-------------
        63	404706.7	 3.29		  5.03		 3.53		  4.75
       127	422475.2	 6.12		  9.09		 6.36		  8.76
       255	411522.2	11.68		 16.97		10.90		 16.39
       511	428124.1	22.54		 31.28		19.86		 32.25
      1023	414718.4	43.39		 62.52		40.00		 66.33
      2047	429848.7	86.64		120.34		71.14		106.08
      
      Commet Lake Desktop
      ===================
      Batch	throughput	free latency	free latency	alloc latency	alloc latency
      	page/s		avg / us	99% / us	avg / us	99% / us
      -----	----------	------------	------------	-------------	-------------
      
        63	795183.13	 2.18		 3.55		 2.03		 3.05
       127	803067.85	 3.91		 6.56		 3.85		 5.52
       255	812771.10	 7.35		10.80		 7.14		10.20
       511	817723.48	14.17		27.54		13.43		30.31
      1023	818870.19	27.72		40.10		27.89		46.28
      
      Coffee Lake Desktop
      ===================
      Batch	throughput	free latency	free latency	alloc latency	alloc latency
      	page/s		avg / us	99% / us	avg / us	99% / us
      -----	----------	------------	------------	-------------	-------------
        63	510542.8	 3.13		  4.40		 2.48		 3.43
       127	514288.6	 5.97		  7.89		 4.65		 6.04
       255	516889.7	11.86		 15.58		 8.96		12.55
       511	519802.4	23.10		 28.81		16.95		26.19
      1023	520802.7	45.30		 52.51		33.19		45.95
      2047	519997.1	90.63		104.00		65.26		81.74
      
      From the above data, to restrict the allocation/freeing latency to be less
      than 100 us in most times, the max batch scale factor needs to be less
      than or equal to 5.
      
      Although it is reasonable to use 5 as max batch scale factor for the
      systems tested, there are also slower systems.  Where smaller value should
      be used to constrain the page allocation/freeing latency.
      
      So, in this patch, a new kconfig option (PCP_BATCH_SCALE_MAX) is added to
      set the max batch scale factor.  Whose default value is 5, and users can
      reduce it when necessary.
      
      Link: https://lkml.kernel.org/r/20231016053002.756205-5-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      52166607
    • Huang Ying's avatar
      mm, pcp: reduce lock contention for draining high-order pages · 362d37a1
      Huang Ying authored
      In commit f26b3fa0 ("mm/page_alloc: limit number of high-order pages
      on PCP during bulk free"), the PCP (Per-CPU Pageset) will be drained when
      PCP is mostly used for high-order pages freeing to improve the cache-hot
      pages reusing between page allocating and freeing CPUs.
      
      On system with small per-CPU data cache slice, pages shouldn't be cached
      before draining to guarantee cache-hot.  But on a system with large
      per-CPU data cache slice, some pages can be cached before draining to
      reduce zone lock contention.
      
      So, in this patch, instead of draining without any caching, "pcp->batch"
      pages will be cached in PCP before draining if the size of the per-CPU
      data cache slice is more than "3 * batch".
      
      In theory, if the size of per-CPU data cache slice is more than "2 *
      batch", we can reuse cache-hot pages between CPUs.  But considering the
      other usage of cache (code, other data accessing, etc.), "3 * batch" is
      used.
      
      Note: "3 * batch" is chosen to make sure the optimization works on recent
      x86_64 server CPUs.  If you want to increase it, please check whether it
      breaks the optimization.
      
      On a 2-socket Intel server with 128 logical CPU, with the patch, the
      network bandwidth of the UNIX (AF_UNIX) test case of lmbench test suite
      with 16-pair processes increase 70.5%.  The cycles% of the spinlock
      contention (mostly for zone lock) decreases from 46.1% to 21.3%.  The
      number of PCP draining for high order pages freeing (free_high) decreases
      89.9%.  The cache miss rate keeps 0.2%.
      
      Link: https://lkml.kernel.org/r/20231016053002.756205-4-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      362d37a1
    • Huang Ying's avatar
      cacheinfo: calculate size of per-CPU data cache slice · 94a3bfe4
      Huang Ying authored
      This can be used to estimate the size of the data cache slice that can be
      used by one CPU under ideal circumstances.  Both DATA caches and UNIFIED
      caches are used in calculation.  So, the users need to consider the impact
      of the code cache usage.
      
      Because the cache inclusive/non-inclusive information isn't available now,
      we just use the size of the per-CPU slice of LLC to make the result more
      predictable across architectures.  This may be improved when more cache
      information is available in the future.
      
      A brute-force algorithm to iterate all online CPUs is used to avoid to
      allocate an extra cpumask, especially in offline callback.
      
      Link: https://lkml.kernel.org/r/20231016053002.756205-3-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      94a3bfe4
    • Huang Ying's avatar
      mm, pcp: avoid to drain PCP when process exit · ca71fe1a
      Huang Ying authored
      Patch series "mm: PCP high auto-tuning", v3.
      
      The page allocation performance requirements of different workloads are
      often different.  So, we need to tune the PCP (Per-CPU Pageset) high on
      each CPU automatically to optimize the page allocation performance.
      
      The list of patches in series is as follows,
      
      [1/9] mm, pcp: avoid to drain PCP when process exit
      [2/9] cacheinfo: calculate per-CPU data cache size
      [3/9] mm, pcp: reduce lock contention for draining high-order pages
      [4/9] mm: restrict the pcp batch scale factor to avoid too long latency
      [5/9] mm, page_alloc: scale the number of pages that are batch allocated
      [6/9] mm: add framework for PCP high auto-tuning
      [7/9] mm: tune PCP high automatically
      [8/9] mm, pcp: decrease PCP high if free pages < high watermark
      [9/9] mm, pcp: reduce detecting time of consecutive high order page freeing
      
      Patch [1/9], [2/9], [3/9] optimize the PCP draining for consecutive
      high-order pages freeing.
      
      Patch [4/9], [5/9] optimize batch freeing and allocating.
      
      Patch [6/9], [7/9], [8/9] implement and optimize a PCP high
      auto-tuning method.
      
      Patch [9/9] optimize the PCP draining for consecutive high order page
      freeing based on PCP high auto-tuning.
      
      The test results for patches with performance impact are as follows,
      
      kbuild
      ======
      
      On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
      in parallel (each with `make -j 28`) in 8 cgroup.  This simulates the
      kbuild server that is used by 0-Day kbuild service.
      
      	build time   lock contend%	free_high	alloc_zone
      	----------	----------	---------	----------
      base	     100.0	      14.0          100.0            100.0
      patch1	      99.5	      12.8	     19.5	      95.6
      patch3	      99.4	      12.6	      7.1	      95.6
      patch5	      98.6	      11.0	      8.1	      97.1
      patch7	      95.1	       0.5	      2.8	      15.6
      patch9	      95.0	       1.0	      8.8	      20.0
      
      The PCP draining optimization (patch [1/9], [3/9]) and PCP batch
      allocation optimization (patch [5/9]) reduces zone lock contention a
      little.  The PCP high auto-tuning (patch [7/9], [9/9]) reduces build time
      visibly.  Where the tuning target: the number of pages allocated from zone
      reduces greatly.  So, the zone contention cycles% reduces greatly.
      
      With PCP tuning patches (patch [7/9], [9/9]), the average used memory
      during test increases up to 18.4% because more pages are cached in PCP. 
      But at the end of the test, the number of the used memory decreases to the
      same level as that of the base patch.  That is, the pages cached in PCP
      will be released to zone after not being used actively.
      
      netperf SCTP_STREAM_MANY
      ========================
      
      On a 2-socket Intel server with 128 logical CPU, we tested
      SCTP_STREAM_MANY test case of netperf test suite with 64-pair processes.
      
      	     score   lock contend%	free_high	alloc_zone  cache miss rate%
      	     -----	----------	---------	----------  ----------------
      base	     100.0	       2.1          100.0            100.0	         1.3
      patch1	      99.4	       2.1	     99.4	      99.4		 1.3
      patch3	     106.4	       1.3	     13.3	     106.3		 1.3
      patch5	     106.0	       1.2	     13.2	     105.9		 1.3
      patch7	     103.4	       1.9	      6.7	      90.3		 7.6
      patch9	     108.6	       1.3	     13.7	     108.6		 1.3
      
      The PCP draining optimization (patch [1/9]+[3/9]) improves performance. 
      The PCP high auto-tuning (patch [7/9]) reduces performance a little
      because PCP draining cannot be triggered in time sometimes.  So, the cache
      miss rate% increases.  The further PCP draining optimization (patch [9/9])
      based on PCP tuning restore the performance.
      
      lmbench3 UNIX (AF_UNIX)
      =======================
      
      On a 2-socket Intel server with 128 logical CPU, we tested UNIX
      (AF_UNIX socket) test case of lmbench3 test suite with 16-pair
      processes.
      
      	     score   lock contend%	free_high	alloc_zone  cache miss rate%
      	     -----	----------	---------	----------  ----------------
      base	     100.0	      51.4          100.0            100.0	         0.2
      patch1	     116.8	      46.1           69.5	     104.3	         0.2
      patch3	     199.1	      21.3            7.0	     104.9	         0.2
      patch5	     200.0	      20.8            7.1	     106.9	         0.3
      patch7	     191.6	      19.9            6.8	     103.8	         2.8
      patch9	     193.4	      21.7            7.0	     104.7	         2.1
      
      The PCP draining optimization (patch [1/9], [3/9]) improves performance
      much.  The PCP tuning (patch [7/9]) reduces performance a little because
      PCP draining cannot be triggered in time sometimes.  The further PCP
      draining optimization (patch [9/9]) based on PCP tuning restores the
      performance partly.
      
      The patchset adds several fields in struct per_cpu_pages.  The struct
      layout before/after the patchset is as follows,
      
      base
      ====
      
      struct per_cpu_pages {
      	spinlock_t                 lock;                 /*     0     4 */
      	int                        count;                /*     4     4 */
      	int                        high;                 /*     8     4 */
      	int                        batch;                /*    12     4 */
      	short int                  free_factor;          /*    16     2 */
      	short int                  expire;               /*    18     2 */
      
      	/* XXX 4 bytes hole, try to pack */
      
      	struct list_head           lists[13];            /*    24   208 */
      
      	/* size: 256, cachelines: 4, members: 7 */
      	/* sum members: 228, holes: 1, sum holes: 4 */
      	/* padding: 24 */
      } __attribute__((__aligned__(64)));
      
      patched
      =======
      
      struct per_cpu_pages {
      	spinlock_t                 lock;                 /*     0     4 */
      	int                        count;                /*     4     4 */
      	int                        high;                 /*     8     4 */
      	int                        high_min;             /*    12     4 */
      	int                        high_max;             /*    16     4 */
      	int                        batch;                /*    20     4 */
      	u8                         flags;                /*    24     1 */
      	u8                         alloc_factor;         /*    25     1 */
      	u8                         expire;               /*    26     1 */
      
      	/* XXX 1 byte hole, try to pack */
      
      	short int                  free_count;           /*    28     2 */
      
      	/* XXX 2 bytes hole, try to pack */
      
      	struct list_head           lists[13];            /*    32   208 */
      
      	/* size: 256, cachelines: 4, members: 11 */
      	/* sum members: 237, holes: 2, sum holes: 3 */
      	/* padding: 16 */
      } __attribute__((__aligned__(64)));
      
      The size of the struct doesn't changed with the patchset.
      
      
      This patch (of 9):
      
      In commit f26b3fa0 ("mm/page_alloc: limit number of high-order pages
      on PCP during bulk free"), the PCP (Per-CPU Pageset) will be drained when
      PCP is mostly used for high-order pages freeing to improve the cache-hot
      pages reusing between page allocation and freeing CPUs.
      
      But, the PCP draining mechanism may be triggered unexpectedly when process
      exits.  With some customized trace point, it was found that PCP draining
      (free_high == true) was triggered with the order-1 page freeing with the
      following call stack,
      
       => free_unref_page_commit
       => free_unref_page
       => __mmdrop
       => exit_mm
       => do_exit
       => do_group_exit
       => __x64_sys_exit_group
       => do_syscall_64
      
      Checking the source code, this is the page table PGD freeing
      (mm_free_pgd()).  It's a order-1 page freeing if
      CONFIG_PAGE_TABLE_ISOLATION=y.  Which is a common configuration for
      security.
      
      Just before that, page freeing with the following call stack was found,
      
       => free_unref_page_commit
       => free_unref_page_list
       => release_pages
       => tlb_batch_pages_flush
       => tlb_finish_mmu
       => exit_mmap
       => __mmput
       => exit_mm
       => do_exit
       => do_group_exit
       => __x64_sys_exit_group
       => do_syscall_64
      
      So, when a process exits,
      
      - a large number of user pages of the process will be freed without
        page allocation, it's highly possible that pcp->free_factor becomes >
        0.  In fact, this is expected behavior to improve process exit
        performance.
      
      - after freeing all user pages, the PGD will be freed, which is a
        order-1 page freeing, PCP will be drained.
      
      All in all, when a process exits, it's high possible that the PCP will be
      drained.  This is an unexpected behavior.
      
      To avoid this, in the patch, the PCP draining will only be triggered for 2
      consecutive high-order page freeing.
      
      On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
      in parallel (each with `make -j 28`) in 8 cgroup.  This simulates the
      kbuild server that is used by 0-Day kbuild service.  With the patch, the
      cycles% of the spinlock contention (mostly for zone lock) decreases from
      14.0% to 12.8% (with PCP size == 367).  The number of PCP draining for
      high order pages freeing (free_high) decreases 80.5%.
      
      This helps network workload too for reduced zone lock contention.  On a
      2-socket Intel server with 128 logical CPU, with the patch, the network
      bandwidth of the UNIX (AF_UNIX) test case of lmbench test suite with
      16-pair processes increase 16.8%.  The cycles% of the spinlock contention
      (mostly for zone lock) decreases from 51.4% to 46.1%.  The number of PCP
      draining for high order pages freeing (free_high) decreases 30.5%.  The
      cache miss rate keeps 0.2%.
      
      Link: https://lkml.kernel.org/r/20231016053002.756205-1-ying.huang@intel.com
      Link: https://lkml.kernel.org/r/20231016053002.756205-2-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ca71fe1a
    • Kairui Song's avatar
      mm/oom_killer: simplify OOM killer info dump helper · 1f4f7f0f
      Kairui Song authored
      There is only one caller wants to dump the kill victim info, so just let
      it call the standalone helper, no need to make the generic info dump
      helper take an extra argument for that.
      
      Result of bloat-o-meter:
      ./scripts/bloat-o-meter ./mm/oom_kill.old.o ./mm/oom_kill.o
      add/remove: 0/0 grow/shrink: 1/2 up/down: 131/-142 (-11)
      Function                                     old     new   delta
      oom_kill_process                             412     543    +131
      out_of_memory                               1422    1418      -4
      dump_header                                  562     424    -138
      Total: Before=21514, After=21503, chg -0.05%
      
      Link: https://lkml.kernel.org/r/20231016113103.86477-1-ryncsn@gmail.comSigned-off-by: default avatarKairui Song <kasong@tencent.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1f4f7f0f
    • Pedro Falcato's avatar
      mm: kmsan: panic on failure to allocate early boot metadata · 09aec5f9
      Pedro Falcato authored
      Given large enough allocations and a machine with low enough memory (i.e a
      default QEMU VM), it's entirely possible that
      kmsan_init_alloc_meta_for_range's shadow+origin allocation fails.
      
      Instead of eating a NULL deref kernel oops, check explicitly for
      memblock_alloc() failure and panic with a nice error message.
      
      Alexander Potapenko said:
      
      For posterity, it is generally quite important for the allocated shadow
      and origin to be contiguous, otherwise an unaligned memory write may
      result in memory corruption (the corresponding unaligned shadow write will
      be assuming that shadow pages are adjacent).  So instead of panicking we
      could have split the range into smaller ones until the allocation
      succeeds, but that would've led to hard-to-debug problems in the future.
      
      Link: https://lkml.kernel.org/r/20231016153446.132763-1-pedro.falcato@gmail.comSigned-off-by: default avatarPedro Falcato <pedro.falcato@gmail.com>
      Reviewed-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      09aec5f9
    • Matthew Wilcox (Oracle)'s avatar
      buffer: remove folio_create_empty_buffers() · 0a88810d
      Matthew Wilcox (Oracle) authored
      With all users converted, remove the old create_empty_buffers() and rename
      folio_create_empty_buffers() to create_empty_buffers().
      
      Link: https://lkml.kernel.org/r/20231016201114.1928083-28-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0a88810d
    • Matthew Wilcox (Oracle)'s avatar
      ufs: remove ufs_get_locked_page() · c9f2480e
      Matthew Wilcox (Oracle) authored
      Both callers are now converted to ufs_get_locked_folio().
      
      Link: https://lkml.kernel.org/r/20231016201114.1928083-27-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c9f2480e
    • Matthew Wilcox (Oracle)'s avatar
      ufs: convert ufs_change_blocknr() to use folios · c7e8812c
      Matthew Wilcox (Oracle) authored
      Convert the locked_page argument to a folio, then use folios throughout. 
      Saves three hidden calls to compound_head().
      
      Link: https://lkml.kernel.org/r/20231016201114.1928083-26-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c7e8812c
    • Matthew Wilcox (Oracle)'s avatar
      ufs: use ufs_get_locked_folio() in ufs_alloc_lastblock() · e7ca7f17
      Matthew Wilcox (Oracle) authored
      Switch to the folio APIs, saving one folio->page->folio conversion.
      
      Link: https://lkml.kernel.org/r/20231016201114.1928083-25-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e7ca7f17
    • Matthew Wilcox (Oracle)'s avatar
      ufs: add ufs_get_locked_folio and ufs_put_locked_folio · 5fb7bd50
      Matthew Wilcox (Oracle) authored
      Convert the _page variants to call them.  Saves a few hidden calls to
      compound_head().
      
      Link: https://lkml.kernel.org/r/20231016201114.1928083-24-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5fb7bd50