1. 27 Sep, 2022 40 commits
    • Jagdish Gediya's avatar
      mm/demotion: demote pages according to allocation fallback order · 32008027
      Jagdish Gediya authored
      Currently, a higher tier node can only be demoted to selected nodes on the
      next lower tier as defined by the demotion path.  This strict demotion
      order does not work in all use cases (e.g.  some use cases may want to
      allow cross-socket demotion to another node in the same demotion tier as a
      fallback when the preferred demotion node is out of space).  This demotion
      order is also inconsistent with the page allocation fallback order when
      all the nodes in a higher tier are out of space: The page allocation can
      fall back to any node from any lower tier, whereas the demotion order
      doesn't allow that currently.
      
      This patch adds support to get all the allowed demotion targets for a
      memory tier.  demote_page_list() function is now modified to utilize this
      allowed node mask as the fallback allocation mask.
      
      Link: https://lkml.kernel.org/r/20220818131042.113280-9-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarJagdish Gediya <jvgediya.oss@gmail.com>
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      32008027
    • Aneesh Kumar K.V's avatar
      mm/demotion: drop memtier from memtype · b26ac6f3
      Aneesh Kumar K.V authored
      Now that we track node-specific memtier in pg_data_t, we can drop memtier
      from memtype.
      
      Link: https://lkml.kernel.org/r/20220818131042.113280-8-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b26ac6f3
    • Aneesh Kumar K.V's avatar
      mm/demotion: add pg_data_t member to track node memory tier details · 7766cf7a
      Aneesh Kumar K.V authored
      Also update different helpes to use NODE_DATA()->memtier.  Since node
      specific memtier can change based on the reassignment of NUMA node to a
      different memory tiers, accessing NODE_DATA()->memtier needs to happen
      under an rcu read lock or memory_tier_lock.
      
      Link: https://lkml.kernel.org/r/20220818131042.113280-7-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7766cf7a
    • Aneesh Kumar K.V's avatar
      mm/demotion: build demotion targets based on explicit memory tiers · 6c542ab7
      Aneesh Kumar K.V authored
      This patch switch the demotion target building logic to use memory tiers
      instead of NUMA distance.  All N_MEMORY NUMA nodes will be placed in the
      default memory tier and additional memory tiers will be added by drivers
      like dax kmem.
      
      This patch builds the demotion target for a NUMA node by looking at all
      memory tiers below the tier to which the NUMA node belongs.  The closest
      node in the immediately following memory tier is used as a demotion
      target.
      
      Since we are now only building demotion target for N_MEMORY NUMA nodes the
      CPU hotplug calls are removed in this patch.
      
      Link: https://lkml.kernel.org/r/20220818131042.113280-6-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6c542ab7
    • Aneesh Kumar K.V's avatar
      mm/demotion/dax/kmem: set node's abstract distance to MEMTIER_DEFAULT_DAX_ADISTANCE · 7b88bda3
      Aneesh Kumar K.V authored
      By default, all nodes are assigned to the default memory tier which is the
      memory tier designated for nodes with DRAM
      
      Set dax kmem device node's tier to slower memory tier by assigning
      abstract distance to MEMTIER_DEFAULT_DAX_ADISTANCE.  Low-level drivers
      like papr_scm or ACPI NFIT can initialize memory device type to a more
      accurate value based on device tree details or HMAT.  If the kernel
      doesn't find the memory type initialized, a default slower memory type is
      assigned by the kmem driver.
      
      [aneesh.kumar@linux.ibm.com: assign correct memory type for multiple dax devices with the same node affinity]
        Link: https://lkml.kernel.org/r/20220826100224.542312-1-aneesh.kumar@linux.ibm.com
      Link: https://lkml.kernel.org/r/20220818131042.113280-5-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7b88bda3
    • Aneesh Kumar K.V's avatar
      mm/demotion: add hotplug callbacks to handle new numa node onlined · c6123a19
      Aneesh Kumar K.V authored
      If the new NUMA node onlined doesn't have a abstract distance assigned,
      the kernel adds the NUMA node to default memory tier.
      
      [aneesh.kumar@linux.ibm.com: fix kernel error with memory hotplug]
        Link: https://lkml.kernel.org/r/20220825092019.379069-1-aneesh.kumar@linux.ibm.com
      Link: https://lkml.kernel.org/r/20220818131042.113280-4-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c6123a19
    • Aneesh Kumar K.V's avatar
      mm/demotion: move memory demotion related code · 91952440
      Aneesh Kumar K.V authored
      This moves memory demotion related code to mm/memory-tiers.c.  No
      functional change in this patch.
      
      Link: https://lkml.kernel.org/r/20220818131042.113280-3-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      91952440
    • Aneesh Kumar K.V's avatar
      mm/demotion: add support for explicit memory tiers · 992bf775
      Aneesh Kumar K.V authored
      Patch series "mm/demotion: Memory tiers and demotion", v15.
      
      The current kernel has the basic memory tiering support: Inactive pages on
      a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA
      node to make room for new allocations on the higher tier NUMA node. 
      Frequently accessed pages on a lower tier NUMA node can be migrated
      (promoted) to a higher tier NUMA node to improve the performance.
      
      In the current kernel, memory tiers are defined implicitly via a demotion
      path relationship between NUMA nodes, which is created during the kernel
      initialization and updated when a NUMA node is hot-added or hot-removed. 
      The current implementation puts all nodes with CPU into the highest tier,
      and builds the tier hierarchy tier-by-tier by establishing the per-node
      demotion targets based on the distances between nodes.
      
      This current memory tier kernel implementation needs to be improved for
      several important use cases:
      
      * The current tier initialization code always initializes each
        memory-only NUMA node into a lower tier.  But a memory-only NUMA node
        may have a high performance memory device (e.g.  a DRAM-backed
        memory-only node on a virtual machine) and that should be put into a
        higher tier.
      
      * The current tier hierarchy always puts CPU nodes into the top tier. 
        But on a system with HBM (e.g.  GPU memory) devices, these memory-only
        HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are
        better to be placed into the next lower tier.
      
      * Also because the current tier hierarchy always puts CPU nodes into the
        top tier, when a CPU is hot-added (or hot-removed) and triggers a memory
        node from CPU-less into a CPU node (or vice versa), the memory tier
        hierarchy gets changed, even though no memory node is added or removed. 
        This can make the tier hierarchy unstable and make it difficult to
        support tier-based memory accounting.
      
      * A higher tier node can only be demoted to nodes with shortest distance
        on the next lower tier as defined by the demotion path, not any other
        node from any lower tier.  This strict, demotion order does not work in
        all use cases (e.g.  some use cases may want to allow cross-socket
        demotion to another node in the same demotion tier as a fallback when
        the preferred demotion node is out of space), and has resulted in the
        feature request for an interface to override the system-wide, per-node
        demotion order from the userspace.  This demotion order is also
        inconsistent with the page allocation fallback order when all the nodes
        in a higher tier are out of space: The page allocation can fall back to
        any node from any lower tier, whereas the demotion order doesn't allow
        that.
      
      This patch series make the creation of memory tiers explicit under the
      control of device driver.
      
      Memory Tier Initialization
      ==========================
      
      Linux kernel presents memory devices as NUMA nodes and each memory device
      is of a specific type.  The memory type of a device is represented by its
      abstract distance.  A memory tier corresponds to a range of abstract
      distance.  This allows for classifying memory devices with a specific
      performance range into a memory tier.
      
      By default, all memory nodes are assigned to the default tier with
      abstract distance 512.
      
      A device driver can move its memory nodes from the default tier.  For
      example, PMEM can move its memory nodes below the default tier, whereas
      GPU can move its memory nodes above the default tier.
      
      The kernel initialization code makes the decision on which exact tier a
      memory node should be assigned to based on the requests from the device
      drivers as well as the memory device hardware information provided by the
      firmware.
      
      Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
      
      
      This patch (of 10):
      
      In the current kernel, memory tiers are defined implicitly via a demotion
      path relationship between NUMA nodes, which is created during the kernel
      initialization and updated when a NUMA node is hot-added or hot-removed. 
      The current implementation puts all nodes with CPU into the highest tier,
      and builds the tier hierarchy by establishing the per-node demotion
      targets based on the distances between nodes.
      
      This current memory tier kernel implementation needs to be improved for
      several important use cases,
      
      The current tier initialization code always initializes each memory-only
      NUMA node into a lower tier.  But a memory-only NUMA node may have a high
      performance memory device (e.g.  a DRAM-backed memory-only node on a
      virtual machine) that should be put into a higher tier.
      
      The current tier hierarchy always puts CPU nodes into the top tier.  But
      on a system with HBM or GPU devices, the memory-only NUMA nodes mapping
      these devices should be in the top tier, and DRAM nodes with CPUs are
      better to be placed into the next lower tier.
      
      With current kernel higher tier node can only be demoted to nodes with
      shortest distance on the next lower tier as defined by the demotion path,
      not any other node from any lower tier.  This strict, demotion order does
      not work in all use cases (e.g.  some use cases may want to allow
      cross-socket demotion to another node in the same demotion tier as a
      fallback when the preferred demotion node is out of space), This demotion
      order is also inconsistent with the page allocation fallback order when
      all the nodes in a higher tier are out of space: The page allocation can
      fall back to any node from any lower tier, whereas the demotion order
      doesn't allow that.
      
      This patch series address the above by defining memory tiers explicitly.
      
      Linux kernel presents memory devices as NUMA nodes and each memory device
      is of a specific type.  The memory type of a device is represented by its
      abstract distance.  A memory tier corresponds to a range of abstract
      distance.  This allows for classifying memory devices with a specific
      performance range into a memory tier.
      
      This patch configures the range/chunk size to be 128.  The default DRAM
      abstract distance is 512.  We can have 4 memory tiers below the default
      DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511.
      Faster memory devices can be placed in these faster(higher) memory tiers.
      Slower memory devices like persistent memory will have abstract distance
      higher than the default DRAM level.
      
      [akpm@linux-foundation.org: fix comment, per Aneesh]
      Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com
      Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      992bf775
    • Yu Zhao's avatar
      mm: multi-gen LRU: design doc · 8be976a0
      Yu Zhao authored
      Add a design doc.
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-15-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8be976a0
    • Yu Zhao's avatar
      mm: multi-gen LRU: admin guide · 07017acb
      Yu Zhao authored
      Add an admin guide.
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-14-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      07017acb
    • Yu Zhao's avatar
      mm: multi-gen LRU: debugfs interface · d6c3af7d
      Yu Zhao authored
      Add /sys/kernel/debug/lru_gen for working set estimation and proactive
      reclaim.  These techniques are commonly used to optimize job scheduling
      (bin packing) in data centers [1][2].
      
      Compared with the page table-based approach and the PFN-based
      approach, this lruvec-based approach has the following advantages:
      1. It offers better choices because it is aware of memcgs, NUMA nodes,
         shared mappings and unmapped page cache.
      2. It is more scalable because it is O(nr_hot_pages), whereas the
         PFN-based approach is O(nr_total_pages).
      
      Add /sys/kernel/debug/lru_gen_full for debugging.
      
      [1] https://dl.acm.org/doi/10.1145/3297858.3304053
      [2] https://dl.acm.org/doi/10.1145/3503222.3507731
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-13-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d6c3af7d
    • Yu Zhao's avatar
      mm: multi-gen LRU: thrashing prevention · 1332a809
      Yu Zhao authored
      Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention, as
      requested by many desktop users [1].
      
      When set to value N, it prevents the working set of N milliseconds from
      getting evicted.  The OOM killer is triggered if this working set cannot
      be kept in memory.  Based on the average human detectable lag (~100ms),
      N=1000 usually eliminates intolerable lags due to thrashing.  Larger
      values like N=3000 make lags less noticeable at the risk of premature OOM
      kills.
      
      Compared with the size-based approach [2], this time-based approach
      has the following advantages:
      
      1. It is easier to configure because it is agnostic to applications
         and memory sizes.
      2. It is more reliable because it is directly wired to the OOM killer.
      
      [1] https://lore.kernel.org/r/Ydza%2FzXKY9ATRoh6@google.com/
      [2] https://lore.kernel.org/r/20101028191523.GA14972@google.com/
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-12-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1332a809
    • Yu Zhao's avatar
      mm: multi-gen LRU: kill switch · 354ed597
      Yu Zhao authored
      Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that
      can be disabled include:
        0x0001: the multi-gen LRU core
        0x0002: walking page table, when arch_has_hw_pte_young() returns
                true
        0x0004: clearing the accessed bit in non-leaf PMD entries, when
                CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y
        [yYnN]: apply to all the components above
      E.g.,
        echo y >/sys/kernel/mm/lru_gen/enabled
        cat /sys/kernel/mm/lru_gen/enabled
        0x0007
        echo 5 >/sys/kernel/mm/lru_gen/enabled
        cat /sys/kernel/mm/lru_gen/enabled
        0x0005
      
      NB: the page table walks happen on the scale of seconds under heavy memory
      pressure, in which case the mmap_lock contention is a lesser concern,
      compared with the LRU lock contention and the I/O congestion.  So far the
      only well-known case of the mmap_lock contention happens on Android, due
      to Scudo [1] which allocates several thousand VMAs for merely a few
      hundred MBs.  The SPF and the Maple Tree also have provided their own
      assessments [2][3].  However, if walking page tables does worsen the
      mmap_lock contention, the kill switch can be used to disable it.  In this
      case the multi-gen LRU will suffer a minor performance degradation, as
      shown previously.
      
      Clearing the accessed bit in non-leaf PMD entries can also be disabled,
      since this behavior was not tested on x86 varieties other than Intel and
      AMD.
      
      [1] https://source.android.com/devices/tech/debug/scudo
      [2] https://lore.kernel.org/r/20220128131006.67712-1-michel@lespinasse.org/
      [3] https://lore.kernel.org/r/20220426150616.3937571-1-Liam.Howlett@oracle.com/
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-11-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      354ed597
    • Yu Zhao's avatar
      mm: multi-gen LRU: optimize multiple memcgs · f76c8337
      Yu Zhao authored
      When multiple memcgs are available, it is possible to use generations as a
      frame of reference to make better choices and improve overall performance
      under global memory pressure.  This patch adds a basic optimization to
      select memcgs that can drop single-use unmapped clean pages first.  Doing
      so reduces the chance of going into the aging path or swapping, which can
      be costly.
      
      A typical example that benefits from this optimization is a server running
      mixed types of workloads, e.g., heavy anon workload in one memcg and heavy
      buffered I/O workload in the other.
      
      Though this optimization can be applied to both kswapd and direct reclaim,
      it is only added to kswapd to keep the patchset manageable.  Later
      improvements may cover the direct reclaim path.
      
      While ensuring certain fairness to all eligible memcgs, proportional scans
      of individual memcgs also require proper backoff to avoid overshooting
      their aggregate reclaim target by too much.  Otherwise it can cause high
      direct reclaim latency.  The conditions for backoff are:
      
      1. At low priorities, for direct reclaim, if aging fairness or direct
         reclaim latency is at risk, i.e., aging one memcg multiple times or
         swapping after the target is met.
      2. At high priorities, for global reclaim, if per-zone free pages are
         above respective watermarks.
      
      Server benchmark results:
        Mixed workloads:
          fio (buffered I/O): +[19, 21]%
                      IOPS         BW
            patch1-8: 1880k        7343MiB/s
            patch1-9: 2252k        8796MiB/s
      
          memcached (anon): +[119, 123]%
                      Ops/sec      KB/sec
            patch1-8: 862768.65    33514.68
            patch1-9: 1911022.12   74234.54
      
        Mixed workloads:
          fio (buffered I/O): +[75, 77]%
                      IOPS         BW
            5.19-rc1: 1279k        4996MiB/s
            patch1-9: 2252k        8796MiB/s
      
          memcached (anon): +[13, 15]%
                      Ops/sec      KB/sec
            5.19-rc1: 1673524.04   65008.87
            patch1-9: 1911022.12   74234.54
      
        Configurations:
          (changes since patch 6)
      
          cat mixed.sh
          modprobe brd rd_nr=2 rd_size=56623104
      
          swapoff -a
          mkswap /dev/ram0
          swapon /dev/ram0
      
          mkfs.ext4 /dev/ram1
          mount -t ext4 /dev/ram1 /mnt
      
          memtier_benchmark -S /var/run/memcached/memcached.sock \
            -P memcache_binary -n allkeys --key-minimum=1 \
            --key-maximum=50000000 --key-pattern=P:P -c 1 -t 36 \
            --ratio 1:0 --pipeline 8 -d 2000
      
          fio -name=mglru --numjobs=36 --directory=/mnt --size=1408m \
            --buffered=1 --ioengine=io_uring --iodepth=128 \
            --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
            --rw=randread --random_distribution=random --norandommap \
            --time_based --ramp_time=10m --runtime=90m --group_reporting &
          pid=$!
      
          sleep 200
      
          memtier_benchmark -S /var/run/memcached/memcached.sock \
            -P memcache_binary -n allkeys --key-minimum=1 \
            --key-maximum=50000000 --key-pattern=R:R -c 1 -t 36 \
            --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
      
          kill -INT $pid
          wait
      
      Client benchmark results:
        no change (CONFIG_MEMCG=n)
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-10-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f76c8337
    • Yu Zhao's avatar
      mm: multi-gen LRU: support page table walks · bd74fdae
      Yu Zhao authored
      To further exploit spatial locality, the aging prefers to walk page tables
      to search for young PTEs and promote hot pages.  A kill switch will be
      added in the next patch to disable this behavior.  When disabled, the
      aging relies on the rmap only.
      
      NB: this behavior has nothing similar with the page table scanning in the
      2.4 kernel [1], which searches page tables for old PTEs, adds cold pages
      to swapcache and unmaps them.
      
      To avoid confusion, the term "iteration" specifically means the traversal
      of an entire mm_struct list; the term "walk" will be applied to page
      tables and the rmap, as usual.
      
      An mm_struct list is maintained for each memcg, and an mm_struct follows
      its owner task to the new memcg when this task is migrated.  Given an
      lruvec, the aging iterates lruvec_memcg()->mm_list and calls
      walk_page_range() with each mm_struct on this list to promote hot pages
      before it increments max_seq.
      
      When multiple page table walkers iterate the same list, each of them gets
      a unique mm_struct; therefore they can run concurrently.  Page table
      walkers ignore any misplaced pages, e.g., if an mm_struct was migrated,
      pages it left in the previous memcg will not be promoted when its current
      memcg is under reclaim.  Similarly, page table walkers will not promote
      pages from nodes other than the one under reclaim.
      
      This patch uses the following optimizations when walking page tables:
      1. It tracks the usage of mm_struct's between context switches so that
         page table walkers can skip processes that have been sleeping since
         the last iteration.
      2. It uses generational Bloom filters to record populated branches so
         that page table walkers can reduce their search space based on the
         query results, e.g., to skip page tables containing mostly holes or
         misplaced pages.
      3. It takes advantage of the accessed bit in non-leaf PMD entries when
         CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
      4. It does not zigzag between a PGD table and the same PMD table
         spanning multiple VMAs. IOW, it finishes all the VMAs within the
         range of the same PMD table before it returns to a PGD table. This
         improves the cache performance for workloads that have large
         numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.
      
      Server benchmark results:
        Single workload:
          fio (buffered I/O): no change
      
        Single workload:
          memcached (anon): +[8, 10]%
                      Ops/sec      KB/sec
            patch1-7: 1147696.57   44640.29
            patch1-8: 1245274.91   48435.66
      
        Configurations:
          no change
      
      Client benchmark results:
        kswapd profiles:
          patch1-7
            48.16%  lzo1x_1_do_compress (real work)
             8.20%  page_vma_mapped_walk (overhead)
             7.06%  _raw_spin_unlock_irq
             2.92%  ptep_clear_flush
             2.53%  __zram_bvec_write
             2.11%  do_raw_spin_lock
             2.02%  memmove
             1.93%  lru_gen_look_around
             1.56%  free_unref_page_list
             1.40%  memset
      
          patch1-8
            49.44%  lzo1x_1_do_compress (real work)
             6.19%  page_vma_mapped_walk (overhead)
             5.97%  _raw_spin_unlock_irq
             3.13%  get_pfn_folio
             2.85%  ptep_clear_flush
             2.42%  __zram_bvec_write
             2.08%  do_raw_spin_lock
             1.92%  memmove
             1.44%  alloc_zspage
             1.36%  memset
      
        Configurations:
          no change
      
      Thanks to the following developers for their efforts [3].
        kernel test robot <lkp@intel.com>
      
      [1] https://lwn.net/Articles/23732/
      [2] https://llvm.org/docs/ScudoHardenedAllocator.html
      [3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bd74fdae
    • Yu Zhao's avatar
      mm: multi-gen LRU: exploit locality in rmap · 018ee47f
      Yu Zhao authored
      Searching the rmap for PTEs mapping each page on an LRU list (to test and
      clear the accessed bit) can be expensive because pages from different VMAs
      (PA space) are not cache friendly to the rmap (VA space).  For workloads
      mostly using mapped pages, searching the rmap can incur the highest CPU
      cost in the reclaim path.
      
      This patch exploits spatial locality to reduce the trips into the rmap. 
      When shrink_page_list() walks the rmap and finds a young PTE, a new
      function lru_gen_look_around() scans at most BITS_PER_LONG-1 adjacent
      PTEs.  On finding another young PTE, it clears the accessed bit and
      updates the gen counter of the page mapped by this PTE to
      (max_seq%MAX_NR_GENS)+1.
      
      Server benchmark results:
        Single workload:
          fio (buffered I/O): no change
      
        Single workload:
          memcached (anon): +[3, 5]%
                      Ops/sec      KB/sec
            patch1-6: 1106168.46   43025.04
            patch1-7: 1147696.57   44640.29
      
        Configurations:
          no change
      
      Client benchmark results:
        kswapd profiles:
          patch1-6
            39.03%  lzo1x_1_do_compress (real work)
            18.47%  page_vma_mapped_walk (overhead)
             6.74%  _raw_spin_unlock_irq
             3.97%  do_raw_spin_lock
             2.49%  ptep_clear_flush
             2.48%  anon_vma_interval_tree_iter_first
             1.92%  folio_referenced_one
             1.88%  __zram_bvec_write
             1.48%  memmove
             1.31%  vma_interval_tree_iter_next
      
          patch1-7
            48.16%  lzo1x_1_do_compress (real work)
             8.20%  page_vma_mapped_walk (overhead)
             7.06%  _raw_spin_unlock_irq
             2.92%  ptep_clear_flush
             2.53%  __zram_bvec_write
             2.11%  do_raw_spin_lock
             2.02%  memmove
             1.93%  lru_gen_look_around
             1.56%  free_unref_page_list
             1.40%  memset
      
        Configurations:
          no change
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-8-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarBarry Song <baohua@kernel.org>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      018ee47f
    • Yu Zhao's avatar
      mm: multi-gen LRU: minimal implementation · ac35a490
      Yu Zhao authored
      To avoid confusion, the terms "promotion" and "demotion" will be applied
      to the multi-gen LRU, as a new convention; the terms "activation" and
      "deactivation" will be applied to the active/inactive LRU, as usual.
      
      The aging produces young generations.  Given an lruvec, it increments
      max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS.  The aging promotes
      hot pages to the youngest generation when it finds them accessed through
      page tables; the demotion of cold pages happens consequently when it
      increments max_seq.  Promotion in the aging path does not involve any LRU
      list operations, only the updates of the gen counter and
      lrugen->nr_pages[]; demotion, unless as the result of the increment of
      max_seq, requires LRU list operations, e.g., lru_deactivate_fn().  The
      aging has the complexity O(nr_hot_pages), since it is only interested in
      hot pages.
      
      The eviction consumes old generations.  Given an lruvec, it increments
      min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
      A feedback loop modeled after the PID controller monitors refaults over
      anon and file types and decides which type to evict when both types are
      available from the same generation.
      
      The protection of pages accessed multiple times through file descriptors
      takes place in the eviction path.  Each generation is divided into
      multiple tiers.  A page accessed N times through file descriptors is in
      tier order_base_2(N).  Tiers do not have dedicated lrugen->lists[], only
      bits in folio->flags.  The aforementioned feedback loop also monitors
      refaults over all tiers and decides when to protect pages in which tiers
      (N>1), using the first tier (N=0,1) as a baseline.  The first tier
      contains single-use unmapped clean pages, which are most likely the best
      choices.  In contrast to promotion in the aging path, the protection of a
      page in the eviction path is achieved by moving this page to the next
      generation, i.e., min_seq+1, if the feedback loop decides so.  This
      approach has the following advantages:
      
      1. It removes the cost of activation in the buffered access path by
         inferring whether pages accessed multiple times through file
         descriptors are statistically hot and thus worth protecting in the
         eviction path.
      2. It takes pages accessed through page tables into account and avoids
         overprotecting pages accessed multiple times through file
         descriptors. (Pages accessed through page tables are in the first
         tier, since N=0.)
      3. More tiers provide better protection for pages accessed more than
         twice through file descriptors, when under heavy buffered I/O
         workloads.
      
      Server benchmark results:
        Single workload:
          fio (buffered I/O): +[30, 32]%
                      IOPS         BW
            5.19-rc1: 2673k        10.2GiB/s
            patch1-6: 3491k        13.3GiB/s
      
        Single workload:
          memcached (anon): -[4, 6]%
                      Ops/sec      KB/sec
            5.19-rc1: 1161501.04   45177.25
            patch1-6: 1106168.46   43025.04
      
        Configurations:
          CPU: two Xeon 6154
          Mem: total 256G
      
          Node 1 was only used as a ram disk to reduce the variance in the
          results.
      
          patch drivers/block/brd.c <<EOF
          99,100c99,100
          < 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
          < 	page = alloc_page(gfp_flags);
          ---
          > 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
          > 	page = alloc_pages_node(1, gfp_flags, 0);
          EOF
      
          cat >>/etc/systemd/system.conf <<EOF
          CPUAffinity=numa
          NUMAPolicy=bind
          NUMAMask=0
          EOF
      
          cat >>/etc/memcached.conf <<EOF
          -m 184320
          -s /var/run/memcached/memcached.sock
          -a 0766
          -t 36
          -B binary
          EOF
      
          cat fio.sh
          modprobe brd rd_nr=1 rd_size=113246208
          swapoff -a
          mkfs.ext4 /dev/ram0
          mount -t ext4 /dev/ram0 /mnt
      
          mkdir /sys/fs/cgroup/user.slice/test
          echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
          echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
          fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
            --buffered=1 --ioengine=io_uring --iodepth=128 \
            --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
            --rw=randread --random_distribution=random --norandommap \
            --time_based --ramp_time=10m --runtime=5m --group_reporting
      
          cat memcached.sh
          modprobe brd rd_nr=1 rd_size=113246208
          swapoff -a
          mkswap /dev/ram0
          swapon /dev/ram0
      
          memtier_benchmark -S /var/run/memcached/memcached.sock \
            -P memcache_binary -n allkeys --key-minimum=1 \
            --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
            --ratio 1:0 --pipeline 8 -d 2000
      
          memtier_benchmark -S /var/run/memcached/memcached.sock \
            -P memcache_binary -n allkeys --key-minimum=1 \
            --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
            --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
      
      Client benchmark results:
        kswapd profiles:
          5.19-rc1
            40.33%  page_vma_mapped_walk (overhead)
            21.80%  lzo1x_1_do_compress (real work)
             7.53%  do_raw_spin_lock
             3.95%  _raw_spin_unlock_irq
             2.52%  vma_interval_tree_iter_next
             2.37%  folio_referenced_one
             2.28%  vma_interval_tree_subtree_search
             1.97%  anon_vma_interval_tree_iter_first
             1.60%  ptep_clear_flush
             1.06%  __zram_bvec_write
      
          patch1-6
            39.03%  lzo1x_1_do_compress (real work)
            18.47%  page_vma_mapped_walk (overhead)
             6.74%  _raw_spin_unlock_irq
             3.97%  do_raw_spin_lock
             2.49%  ptep_clear_flush
             2.48%  anon_vma_interval_tree_iter_first
             1.92%  folio_referenced_one
             1.88%  __zram_bvec_write
             1.48%  memmove
             1.31%  vma_interval_tree_iter_next
      
        Configurations:
          CPU: single Snapdragon 7c
          Mem: total 4G
      
          ChromeOS MemoryPressure [1]
      
      [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ac35a490
    • Yu Zhao's avatar
      mm: multi-gen LRU: groundwork · ec1c86b2
      Yu Zhao authored
      Evictable pages are divided into multiple generations for each lruvec.
      The youngest generation number is stored in lrugen->max_seq for both
      anon and file types as they are aged on an equal footing. The oldest
      generation numbers are stored in lrugen->min_seq[] separately for anon
      and file types as clean file pages can be evicted regardless of swap
      constraints. These three variables are monotonically increasing.
      
      Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
      in order to fit into the gen counter in folio->flags. Each truncated
      generation number is an index to lrugen->lists[]. The sliding window
      technique is used to track at least MIN_NR_GENS and at most
      MAX_NR_GENS generations. The gen counter stores a value within [1,
      MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
      stores 0.
      
      There are two conceptually independent procedures: "the aging", which
      produces young generations, and "the eviction", which consumes old
      generations.  They form a closed-loop system, i.e., "the page reclaim". 
      Both procedures can be invoked from userspace for the purposes of working
      set estimation and proactive reclaim.  These techniques are commonly used
      to optimize job scheduling (bin packing) in data centers [1][2].
      
      To avoid confusion, the terms "hot" and "cold" will be applied to the
      multi-gen LRU, as a new convention; the terms "active" and "inactive" will
      be applied to the active/inactive LRU, as usual.
      
      The protection of hot pages and the selection of cold pages are based
      on page access channels and patterns. There are two access channels:
      one through page tables and the other through file descriptors. The
      protection of the former channel is by design stronger because:
      1. The uncertainty in determining the access patterns of the former
         channel is higher due to the approximation of the accessed bit.
      2. The cost of evicting the former channel is higher due to the TLB
         flushes required and the likelihood of encountering the dirty bit.
      3. The penalty of underprotecting the former channel is higher because
         applications usually do not prepare themselves for major page
         faults like they do for blocked I/O. E.g., GUI applications
         commonly use dedicated I/O threads to avoid blocking rendering
         threads.
      
      There are also two access patterns: one with temporal locality and the
      other without.  For the reasons listed above, the former channel is
      assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is
      present; the latter channel is assumed to follow the latter pattern unless
      outlying refaults have been observed [3][4].
      
      The next patch will address the "outlying refaults".  Three macros, i.e.,
      LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in
      this patch to make the entire patchset less diffy.
      
      A page is added to the youngest generation on faulting.  The aging needs
      to check the accessed bit at least twice before handing this page over to
      the eviction.  The first check takes care of the accessed bit set on the
      initial fault; the second check makes sure this page has not been used
      since then.  This protocol, AKA second chance, requires a minimum of two
      generations, hence MIN_NR_GENS.
      
      [1] https://dl.acm.org/doi/10.1145/3297858.3304053
      [2] https://dl.acm.org/doi/10.1145/3503222.3507731
      [3] https://lwn.net/Articles/495543/
      [4] https://lwn.net/Articles/815342/
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ec1c86b2
    • Yu Zhao's avatar
      Revert "include/linux/mm_inline.h: fold __update_lru_size() into its sole caller" · aa1b6790
      Yu Zhao authored
      This patch undoes the following refactor: commit 289ccba1
      ("include/linux/mm_inline.h: fold __update_lru_size() into its sole
      caller")
      
      The upcoming changes to include/linux/mm_inline.h will reuse
      __update_lru_size().
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-5-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      aa1b6790
    • Yu Zhao's avatar
      mm/vmscan.c: refactor shrink_node() · f1e1a7be
      Yu Zhao authored
      This patch refactors shrink_node() to improve readability for the upcoming
      changes to mm/vmscan.c.
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-4-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatarBarry Song <baohua@kernel.org>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f1e1a7be
    • Yu Zhao's avatar
      mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG · eed9a328
      Yu Zhao authored
      Some architectures support the accessed bit in non-leaf PMD entries, e.g.,
      x86 sets the accessed bit in a non-leaf PMD entry when using it as part of
      linear address translation [1].  Page table walkers that clear the
      accessed bit may use this capability to reduce their search space.
      
      Note that:
      1. Although an inline function is preferable, this capability is added
         as a configuration option for consistency with the existing macros.
      2. Due to the little interest in other varieties, this capability was
         only tested on Intel and AMD CPUs.
      
      Thanks to the following developers for their efforts [2][3].
        Randy Dunlap <rdunlap@infradead.org>
        Stephen Rothwell <sfr@canb.auug.org.au>
      
      [1]: Intel 64 and IA-32 Architectures Software Developer's Manual
           Volume 3 (June 2021), section 4.8
      [2] https://lore.kernel.org/r/bfdcc7c8-922f-61a9-aa15-7e7250f04af7@infradead.org/
      [3] https://lore.kernel.org/r/20220413151513.5a0d7a7e@canb.auug.org.au/
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-3-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatarBarry Song <baohua@kernel.org>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      eed9a328
    • Yu Zhao's avatar
      mm: x86, arm64: add arch_has_hw_pte_young() · e1fd09e3
      Yu Zhao authored
      Patch series "Multi-Gen LRU Framework", v14.
      
      What's new
      ==========
      1. OpenWrt, in addition to Android, Arch Linux Zen, Armbian, ChromeOS,
         Liquorix, post-factum and XanMod, is now shipping MGLRU on 5.15.
      2. Fixed long-tailed direct reclaim latency seen on high-memory (TBs)
         machines. The old direct reclaim backoff, which tries to enforce a
         minimum fairness among all eligible memcgs, over-swapped by about
         (total_mem>>DEF_PRIORITY)-nr_to_reclaim. The new backoff, which
         pulls the plug on swapping once the target is met, trades some
         fairness for curtailed latency:
         https://lore.kernel.org/r/20220918080010.2920238-10-yuzhao@google.com/
      3. Fixed minior build warnings and conflicts. More comments and nits.
      
      TLDR
      ====
      The current page reclaim is too expensive in terms of CPU usage and it
      often makes poor choices about what to evict. This patchset offers an
      alternative solution that is performant, versatile and
      straightforward.
      
      Patchset overview
      =================
      The design and implementation overview is in patch 14:
      https://lore.kernel.org/r/20220918080010.2920238-15-yuzhao@google.com/
      
      01. mm: x86, arm64: add arch_has_hw_pte_young()
      02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
      Take advantage of hardware features when trying to clear the accessed
      bit in many PTEs.
      
      03. mm/vmscan.c: refactor shrink_node()
      04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into
          its sole caller"
      Minor refactors to improve readability for the following patches.
      
      05. mm: multi-gen LRU: groundwork
      Adds the basic data structure and the functions that insert pages to
      and remove pages from the multi-gen LRU (MGLRU) lists.
      
      06. mm: multi-gen LRU: minimal implementation
      A minimal implementation without optimizations.
      
      07. mm: multi-gen LRU: exploit locality in rmap
      Exploits spatial locality to improve efficiency when using the rmap.
      
      08. mm: multi-gen LRU: support page table walks
      Further exploits spatial locality by optionally scanning page tables.
      
      09. mm: multi-gen LRU: optimize multiple memcgs
      Optimizes the overall performance for multiple memcgs running mixed
      types of workloads.
      
      10. mm: multi-gen LRU: kill switch
      Adds a kill switch to enable or disable MGLRU at runtime.
      
      11. mm: multi-gen LRU: thrashing prevention
      12. mm: multi-gen LRU: debugfs interface
      Provide userspace with features like thrashing prevention, working set
      estimation and proactive reclaim.
      
      13. mm: multi-gen LRU: admin guide
      14. mm: multi-gen LRU: design doc
      Add an admin guide and a design doc.
      
      Benchmark results
      =================
      Independent lab results
      -----------------------
      Based on the popularity of searches [01] and the memory usage in
      Google's public cloud, the most popular open-source memory-hungry
      applications, in alphabetical order, are:
            Apache Cassandra      Memcached
            Apache Hadoop         MongoDB
            Apache Spark          PostgreSQL
            MariaDB (MySQL)       Redis
      
      An independent lab evaluated MGLRU with the most widely used benchmark
      suites for the above applications. They posted 960 data points along
      with kernel metrics and perf profiles collected over more than 500
      hours of total benchmark time. Their final reports show that, with 95%
      confidence intervals (CIs), the above applications all performed
      significantly better for at least part of their benchmark matrices.
      
      On 5.14:
      1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
         less wall time to sort three billion random integers, respectively,
         under the medium- and the high-concurrency conditions, when
         overcommitting memory. There were no statistically significant
         changes in wall time for the rest of the benchmark matrix.
      2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
         more transactions per minute (TPM), respectively, under the medium-
         and the high-concurrency conditions, when overcommitting memory.
         There were no statistically significant changes in TPM for the rest
         of the benchmark matrix.
      3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
         and [21.59, 30.02]% more operations per second (OPS), respectively,
         for sequential access, random access and Gaussian (distribution)
         access, when THP=always; 95% CIs [13.85, 15.97]% and
         [23.94, 29.92]% more OPS, respectively, for random access and
         Gaussian access, when THP=never. There were no statistically
         significant changes in OPS for the rest of the benchmark matrix.
      4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
         [2.16, 3.55]% more operations per second (OPS), respectively, for
         exponential (distribution) access, random access and Zipfian
         (distribution) access, when underutilizing memory; 95% CIs
         [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
         respectively, for exponential access, random access and Zipfian
         access, when overcommitting memory.
      
      On 5.15:
      5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
         and [4.11, 7.50]% more operations per second (OPS), respectively,
         for exponential (distribution) access, random access and Zipfian
         (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
         [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
         exponential access, random access and Zipfian access, when swap was
         on.
      6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
         less average wall time to finish twelve parallel TeraSort jobs,
         respectively, under the medium- and the high-concurrency
         conditions, when swap was on. There were no statistically
         significant changes in average wall time for the rest of the
         benchmark matrix.
      7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
         minute (TPM) under the high-concurrency condition, when swap was
         off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
         respectively, under the medium- and the high-concurrency
         conditions, when swap was on. There were no statistically
         significant changes in TPM for the rest of the benchmark matrix.
      8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
         [11.47, 19.36]% more total operations per second (OPS),
         respectively, for sequential access, random access and Gaussian
         (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
         [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
         for sequential access, random access and Gaussian access, when
         THP=never.
      
      Our lab results
      ---------------
      To supplement the above results, we ran the following benchmark suites
      on 5.16-rc7 and found no regressions [10].
            fs_fio_bench_hdd_mq      pft
            fs_lmbench               pgsql-hammerdb
            fs_parallelio            redis
            fs_postmark              stream
            hackbench                sysbenchthread
            kernbench                tpcc_spark
            memcached                unixbench
            multichase               vm-scalability
            mutilate                 will-it-scale
            nginx
      
      [01] https://trends.google.com
      [02] https://lore.kernel.org/r/20211102002002.92051-1-bot@edi.works/
      [03] https://lore.kernel.org/r/20211009054315.47073-1-bot@edi.works/
      [04] https://lore.kernel.org/r/20211021194103.65648-1-bot@edi.works/
      [05] https://lore.kernel.org/r/20211109021346.50266-1-bot@edi.works/
      [06] https://lore.kernel.org/r/20211202062806.80365-1-bot@edi.works/
      [07] https://lore.kernel.org/r/20211209072416.33606-1-bot@edi.works/
      [08] https://lore.kernel.org/r/20211218071041.24077-1-bot@edi.works/
      [09] https://lore.kernel.org/r/20211122053248.57311-1-bot@edi.works/
      [10] https://lore.kernel.org/r/20220104202247.2903702-1-yuzhao@google.com/
      
      Read-world applications
      =======================
      Third-party testimonials
      ------------------------
      Konstantin reported [11]:
         I have Archlinux with 8G RAM + zswap + swap. While developing, I
         have lots of apps opened such as multiple LSP-servers for different
         langs, chats, two browsers, etc... Usually, my system gets quickly
         to a point of SWAP-storms, where I have to kill LSP-servers,
         restart browsers to free memory, etc, otherwise the system lags
         heavily and is barely usable.
         
         1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
         patchset, and I started up by opening lots of apps to create memory
         pressure, and worked for a day like this. Till now I had not a
         single SWAP-storm, and mind you I got 3.4G in SWAP. I was never
         getting to the point of 3G in SWAP before without a single
         SWAP-storm.
      
      Vaibhav from IBM reported [12]:
         In a synthetic MongoDB Benchmark, seeing an average of ~19%
         throughput improvement on POWER10(Radix MMU + 64K Page Size) with
         MGLRU patches on top of 5.16 kernel for MongoDB + YCSB across
         three different request distributions, namely, Exponential, Uniform
         and Zipfan.
      
      Shuang from U of Rochester reported [13]:
         With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
         and [9.26, 10.36]% higher throughput, respectively, for random
         access, Zipfian (distribution) access and Gaussian (distribution)
         access, when the average number of jobs per CPU is 1; 95% CIs
         [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher
         throughput, respectively, for random access, Zipfian access and
         Gaussian access, when the average number of jobs per CPU is 2.
      
      Daniel from Michigan Tech reported [14]:
         With Memcached allocating ~100GB of byte-addressable Optante,
         performance improvement in terms of throughput (measured as queries
         per second) was about 10% for a series of workloads.
      
      Large-scale deployments
      -----------------------
      We've rolled out MGLRU to tens of millions of ChromeOS users and
      about a million Android users. Google's fleetwide profiling [15] shows
      an overall 40% decrease in kswapd CPU usage, in addition to
      improvements in other UX metrics, e.g., an 85% decrease in the number
      of low-memory kills at the 75th percentile and an 18% decrease in
      app launch time at the 50th percentile.
      
      The downstream kernels that have been using MGLRU include:
      1. Android [16]
      2. Arch Linux Zen [17]
      3. Armbian [18]
      4. ChromeOS [19]
      5. Liquorix [20]
      6. OpenWrt [21]
      7. post-factum [22]
      8. XanMod [23]
      
      [11] https://lore.kernel.org/r/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/
      [12] https://lore.kernel.org/r/87czj3mux0.fsf@vajain21.in.ibm.com/
      [13] https://lore.kernel.org/r/20220105024423.26409-1-szhai2@cs.rochester.edu/
      [14] https://lore.kernel.org/r/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
      [15] https://dl.acm.org/doi/10.1145/2749469.2750392
      [16] https://android.com
      [17] https://archlinux.org
      [18] https://armbian.com
      [19] https://chromium.org
      [20] https://liquorix.net
      [21] https://openwrt.org
      [22] https://codeberg.org/pf-kernel
      [23] https://xanmod.org
      
      Summary
      =======
      The facts are:
      1. The independent lab results and the real-world applications
         indicate substantial improvements; there are no known regressions.
      2. Thrashing prevention, working set estimation and proactive reclaim
         work out of the box; there are no equivalent solutions.
      3. There is a lot of new code; no smaller changes have been
         demonstrated similar effects.
      
      Our options, accordingly, are:
      1. Given the amount of evidence, the reported improvements will likely
         materialize for a wide range of workloads.
      2. Gauging the interest from the past discussions, the new features
         will likely be put to use for both personal computers and data
         centers.
      3. Based on Google's track record, the new code will likely be well
         maintained in the long term. It'd be more difficult if not
         impossible to achieve similar effects with other approaches.
      
      
      This patch (of 14):
      
      Some architectures automatically set the accessed bit in PTEs, e.g., x86
      and arm64 v8.2.  On architectures that do not have this capability,
      clearing the accessed bit in a PTE usually triggers a page fault following
      the TLB miss of this PTE (to emulate the accessed bit).
      
      Being aware of this capability can help make better decisions, e.g.,
      whether to spread the work out over a period of time to reduce bursty page
      faults when trying to clear the accessed bit in many PTEs.
      
      Note that theoretically this capability can be unreliable, e.g.,
      hotplugged CPUs might be different from builtin ones.  Therefore it should
      not be used in architecture-independent code that involves correctness,
      e.g., to determine whether TLB flushes are required (in combination with
      the accessed bit).
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-1-yuzhao@google.com
      Link: https://lkml.kernel.org/r/20220918080010.2920238-2-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatarBarry Song <baohua@kernel.org>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Acked-by: default avatarWill Deacon <will@kernel.org>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e1fd09e3
    • Yang Yang's avatar
      mm/page_io: count submission time as thrashing delay for delayacct · 3a9bb7b1
      Yang Yang authored
      Once upon a time, we only support accounting thrashing of page cache. 
      Then Joonsoo introduced workingset detection for anonymous pages and we
      gained the ability to account thrashing of them[1].
      
      Likes PSI, we count submission time as thrashing delay because when the
      device is congested, or the submitting cgroup IO-throttled, submission can
      be a significant part of overall IO time.
      
      Without this patch, swap thrashing through frontswap or some block
      device supporting rw_page operation isn't measured correctly.
      
      This patch is based on "delayacct: support re-entrance detection of
      thrashing accounting".
      
      [1] commit aae466b0 ("mm/swap: implement workingset detection for anonymous LRU")
      
      Link: https://lkml.kernel.org/r/20220815072835.74876-1-yang.yang29@zte.com.cnSigned-off-by: default avatarYang Yang <yang.yang29@zte.com.cn>
      Signed-off-by: default avatarCGEL ZTE <cgel.zte@gmail.com>
      Reviewed-by: default avatarRan Xiaokai <ran.xiaokai@zte.com.cn>
      Reviewed-by: default avatarwangyong <wang.yong12@zte.com.cn>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3a9bb7b1
    • Yang Yang's avatar
      delayacct: support re-entrance detection of thrashing accounting · aa1cf99b
      Yang Yang authored
      Once upon a time, we only support accounting thrashing of page cache. 
      Then Joonsoo introduced workingset detection for anonymous pages and we
      gained the ability to account thrashing of them[1].
      
      For page cache thrashing accounting, there is no suitable place to do it
      in fs level likes swap_readpage().  So we have to do it in
      folio_wait_bit_common().
      
      Then for anonymous pages thrashing accounting, we have to do it in both
      swap_readpage() and folio_wait_bit_common().  This likes PSI, so we should
      let thrashing accounting supports re-entrance detection.
      
      This patch is to prepare complete thrashing accounting, and is based on
      patch "filemap: make the accounting of thrashing more consistent".
      
      [1] commit aae466b0 ("mm/swap: implement workingset detection for anonymous LRU")
      
      Link: https://lkml.kernel.org/r/20220815071134.74551-1-yang.yang29@zte.com.cnSigned-off-by: default avatarYang Yang <yang.yang29@zte.com.cn>
      Signed-off-by: default avatarCGEL ZTE <cgel.zte@gmail.com>
      Reviewed-by: default avatarRan Xiaokai <ran.xiaokai@zte.com.cn>
      Reviewed-by: default avatarwangyong <wang.yong12@zte.com.cn>
      Acked-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      aa1cf99b
    • Baolin Wang's avatar
      mm: migrate: do not retry 10 times for the subpages of fail-to-migrate THP · 7047b5a4
      Baolin Wang authored
      If THP is failed to migrate due to -ENOSYS or -ENOMEM case, the THP will
      be split, and the subpages of fail-to-migrate THP will be tried to migrate
      again, so we should not account the retry counter in the second loop,
      since we already accounted 'nr_thp_failed' in the first loop.
      
      Moreover we also do not need retry 10 times for -EAGAIN case for the
      subpages of fail-to-migrate THP in the second loop, since we already
      regarded the THP as migration failure, and save some migration time (for
      the worst case, will try 512 * 10 times) according to previous discussion
      [1].
      
      [1] https://lore.kernel.org/linux-mm/87r13a7n04.fsf@yhuang6-desk2.ccr.corp.intel.com/
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-9-ying.huang@intel.comTested-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7047b5a4
    • Huang Ying's avatar
      migrate_pages(): fix failure counting for retry · 077309bc
      Huang Ying authored
      After 10 retries, we will give up and the remaining pages will be counted
      as failure in nr_failed and nr_thp_failed.  We should count the failure in
      nr_failed_pages too.  This is done in this patch.
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-8-ying.huang@intel.com
      Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      077309bc
    • Huang Ying's avatar
      migrate_pages(): fix failure counting for THP splitting · e6fa8a79
      Huang Ying authored
      If THP is failed to be migrated, it may be split and retry.  But after
      splitting, the head page will be left in "from" list, although THP
      migration failure has been counted already.  If the head page is failed to
      be migrated too, the failure will be counted twice incorrectly.  So this
      is fixed in this patch via moving the head page of THP after splitting to
      "thp_split_pages" too.
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-7-ying.huang@intel.com
      Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e6fa8a79
    • Huang Ying's avatar
      migrate_pages(): fix failure counting for THP on -ENOSYS · 577be05c
      Huang Ying authored
      If THP or hugetlbfs page migration isn't supported, unmap_and_move() or
      unmap_and_move_huge_page() will return -ENOSYS.  For THP, splitting will
      be tried, but if splitting doesn't succeed, the THP will be left in "from"
      list wrongly.  If some other pages are retried, the THP migration failure
      will counted again.  This is fixed via moving the failure THP from "from"
      to "ret_pages".
      
      Another issue of the original code is that the unsupported failure
      processing isn't consistent between THP and hugetlbfs page.  Make them
      consistent in this patch to make the code easier to be understood too.
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-6-ying.huang@intel.com
      Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      577be05c
    • Huang Ying's avatar
      migrate_pages(): fix failure counting for THP subpages retrying · 5fc30916
      Huang Ying authored
      If THP is failed to be migrated for -ENOSYS and -ENOMEM, the THP will be
      split into thp_split_pages, and after other pages are migrated, pages in
      thp_split_pages will be migrated with no_subpage_counting == true, because
      its failure have been counted already.  If some pages in thp_split_pages
      are retried during migration, we should not count their failure if
      no_subpage_counting == true too.  This is done this patch to fix the
      failure counting for THP subpages retrying.
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-5-ying.huang@intel.com
      Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5fc30916
    • Huang Ying's avatar
      migrate_pages(): fix THP failure counting for -ENOMEM · fbed53b4
      Huang Ying authored
      In unmap_and_move(), if the new THP cannot be allocated, -ENOMEM will be
      returned, and migrate_pages() will try to split the THP unless "reason" is
      MR_NUMA_MISPLACED (that is, nosplit == true).  But when nosplit == true,
      the THP migration failure will not be counted.
      
      This is incorrect, so in this patch, the THP migration failure will be
      counted for -ENOMEM regardless of nosplit is true or false.  The nr_failed
      counting isn't fixed because it's not used.  Added some comments for it
      per Baolin's suggestion.
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-4-ying.huang@intel.com
      Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fbed53b4
    • Huang Ying's avatar
      migrate_pages(): remove unnecessary list_safe_reset_next() · 9c62ff00
      Huang Ying authored
      Before commit b5bade97 ("mm: migrate: fix the return value of
      migrate_pages()"), the tail pages of THP will be put in the "from"
      list directly.  So one of the loop cursors (page2) needs to be reset,
      as is done in try_split_thp() via list_safe_reset_next().  But after
      the commit, the tail pages of THP will be put in a dedicated
      list (thp_split_pages).  That is, the "from" list will not be changed
      during splitting.  So, it's unnecessary to call list_safe_reset_next()
      anymore.
      
      This is a code cleanup, no functionality changes are expected.
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-3-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9c62ff00
    • Huang Ying's avatar
      migrate: fix syscall move_pages() return value for failure · a7504ed1
      Huang Ying authored
      Patch series "migrate_pages(): fix several bugs in error path", v3.
      
      During review the code of migrate_pages() and build a test program for
      it.  Several bugs in error path are identified and fixed in this
      series.
      
      Most patches are tested via
      
      - Apply error-inject.patch in Linux kernel
      - Compile test-migrate.c (with -lnuma)
      - Test with test-migrate.sh
      
      error-inject.patch, test-migrate.c, and test-migrate.sh are as below.
      It turns out that error injection is an important tool to fix bugs in
      error path.
      
      
      This patch (of 8):
      
      The return value of move_pages() syscall is incorrect when counting
      the remaining pages to be migrated.  For example, for the following
      test program,
      
      "
       #define _GNU_SOURCE
      
       #include <stdbool.h>
       #include <stdio.h>
       #include <string.h>
       #include <stdlib.h>
       #include <errno.h>
      
       #include <fcntl.h>
       #include <sys/uio.h>
       #include <sys/mman.h>
       #include <sys/types.h>
       #include <unistd.h>
       #include <numaif.h>
       #include <numa.h>
      
       #ifndef MADV_FREE
       #define MADV_FREE	8		/* free pages only if memory pressure */
       #endif
      
       #define ONE_MB		(1024 * 1024)
       #define MAP_SIZE	(16 * ONE_MB)
       #define THP_SIZE	(2 * ONE_MB)
       #define THP_MASK	(THP_SIZE - 1)
      
       #define ERR_EXIT_ON(cond, msg)					\
      	 do {							\
      		 int __cond_in_macro = (cond);			\
      		 if (__cond_in_macro)				\
      			 error_exit(__cond_in_macro, (msg));	\
      	 } while (0)
      
       void error_msg(int ret, int nr, int *status, const char *msg)
       {
      	 int i;
      
      	 fprintf(stderr, "Error: %s, ret : %d, error: %s\n",
      		 msg, ret, strerror(errno));
      
      	 if (!nr)
      		 return;
      	 fprintf(stderr, "status: ");
      	 for (i = 0; i < nr; i++)
      		 fprintf(stderr, "%d ", status[i]);
      	 fprintf(stderr, "\n");
       }
      
       void error_exit(int ret, const char *msg)
       {
      	 error_msg(ret, 0, NULL, msg);
      	 exit(1);
       }
      
       int page_size;
      
       bool do_vmsplice;
       bool do_thp;
      
       static int pipe_fds[2];
       void *addr;
       char *pn;
       char *pn1;
       void *pages[2];
       int status[2];
      
       void prepare()
       {
      	 int ret;
      	 struct iovec iov;
      
      	 if (addr) {
      		 munmap(addr, MAP_SIZE);
      		 close(pipe_fds[0]);
      		 close(pipe_fds[1]);
      	 }
      
      	 ret = pipe(pipe_fds);
      	 ERR_EXIT_ON(ret, "pipe");
      
      	 addr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
      		     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
      	 ERR_EXIT_ON(addr == MAP_FAILED, "mmap");
      	 if (do_thp) {
      		 ret = madvise(addr, MAP_SIZE, MADV_HUGEPAGE);
      		 ERR_EXIT_ON(ret, "advise hugepage");
      	 }
      
      	 pn = (char *)(((unsigned long)addr + THP_SIZE) & ~THP_MASK);
      	 pn1 = pn + THP_SIZE;
      	 pages[0] = pn;
      	 pages[1] = pn1;
      	 *pn = 1;
      
      	 if (do_vmsplice) {
      		 iov.iov_base = pn;
      		 iov.iov_len = page_size;
      		 ret = vmsplice(pipe_fds[1], &iov, 1, 0);
      		 ERR_EXIT_ON(ret < 0, "vmsplice");
      	 }
      
      	 status[0] = status[1] = 1024;
       }
      
       void test_migrate()
       {
      	 int ret;
      	 int nodes[2] = { 1, 1 };
      	 pid_t pid = getpid();
      
      	 prepare();
      	 ret = move_pages(pid, 1, pages, nodes, status, MPOL_MF_MOVE_ALL);
      	 error_msg(ret, 1, status, "move 1 page");
      
      	 prepare();
      	 ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL);
      	 error_msg(ret, 2, status, "move 2 pages, page 1 not mapped");
      
      	 prepare();
      	 *pn1 = 1;
      	 ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL);
      	 error_msg(ret, 2, status, "move 2 pages");
      
      	 prepare();
      	 *pn1 = 1;
      	 nodes[1] = 0;
      	 ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL);
      	 error_msg(ret, 2, status, "move 2 pages, page 1 to node 0");
       }
      
       int main(int argc, char *argv[])
       {
      	 numa_run_on_node(0);
      	 page_size = getpagesize();
      
      	 test_migrate();
      
      	 fprintf(stderr, "\nMake page 0 cannot be migrated:\n");
      	 do_vmsplice = true;
      	 test_migrate();
      
      	 fprintf(stderr, "\nTest THP:\n");
      	 do_thp = true;
      	 do_vmsplice = false;
      	 test_migrate();
      
      	 fprintf(stderr, "\nTHP: make page 0 cannot be migrated:\n");
      	 do_vmsplice = true;
      	 test_migrate();
      
      	 return 0;
       }
      "
      
      The output of the current kernel is,
      
      "
      Error: move 1 page, ret : 0, error: Success
      status: 1
      Error: move 2 pages, page 1 not mapped, ret : 0, error: Success
      status: 1 -14
      Error: move 2 pages, ret : 0, error: Success
      status: 1 1
      Error: move 2 pages, page 1 to node 0, ret : 0, error: Success
      status: 1 0
      
      Make page 0 cannot be migrated:
      Error: move 1 page, ret : 0, error: Success
      status: 1024
      Error: move 2 pages, page 1 not mapped, ret : 1, error: Success
      status: 1024 -14
      Error: move 2 pages, ret : 0, error: Success
      status: 1024 1024
      Error: move 2 pages, page 1 to node 0, ret : 1, error: Success
      status: 1024 1024
      "
      
      While the expected output is,
      
      "
      Error: move 1 page, ret : 0, error: Success
      status: 1
      Error: move 2 pages, page 1 not mapped, ret : 0, error: Success
      status: 1 -14
      Error: move 2 pages, ret : 0, error: Success
      status: 1 1
      Error: move 2 pages, page 1 to node 0, ret : 0, error: Success
      status: 1 0
      
      Make page 0 cannot be migrated:
      Error: move 1 page, ret : 1, error: Success
      status: 1024
      Error: move 2 pages, page 1 not mapped, ret : 1, error: Success
      status: 1024 -14
      Error: move 2 pages, ret : 1, error: Success
      status: 1024 1024
      Error: move 2 pages, page 1 to node 0, ret : 2, error: Success
      status: 1024 1024
      "
      
      Fix this via correcting the remaining pages counting.  With the fix,
      the output for the test program as above is expected.
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-1-ying.huang@intel.com
      Link: https://lkml.kernel.org/r/20220817081408.513338-2-ying.huang@intel.com
      Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a7504ed1
    • Yang Yang's avatar
      filemap: make the accounting of thrashing more consistent · f347c9d2
      Yang Yang authored
      Once upon a time, we only support accounting thrashing of page cache. 
      Then Joonsoo introduced workingset detection for anonymous pages and we
      gained the ability to account thrashing of them[1].
      
      So let delayacct account both the thrashing of page cache and anonymous
      pages, this could make the codes more consistent and simpler.
      
      [1] commit aae466b0 ("mm/swap: implement workingset detection for anonymous LRU")
      
      Link: https://lkml.kernel.org/r/20220805033838.1714674-1-yang.yang29@zte.com.cnSigned-off-by: default avatarYang Yang <yang.yang29@zte.com.cn>
      Signed-off-by: default avatarCGEL ZTE <cgel.zte@gmail.com>
      Acked-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Yang Yang <yang.yang29@zte.com.cn>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f347c9d2
    • Peter Xu's avatar
      mm/swap: cache swap migration A/D bits support · 5154e607
      Peter Xu authored
      Introduce a variable swap_migration_ad_supported to cache whether the arch
      supports swap migration A/D bits.
      
      Here one thing to mention is that SWP_MIG_TOTAL_BITS will internally
      reference the other macro MAX_PHYSMEM_BITS, which is a function call on
      x86 (constant on all the rest of archs).
      
      It's safe to reference it in swapfile_init() because when reaching here
      we're already during initcalls level 4 so we must have initialized 5-level
      pgtable for x86_64 (right after early_identify_cpu() finishes).
      
      - start_kernel
        - setup_arch
          - early_cpu_init
            - get_cpu_cap --> fetch from CPUID (including X86_FEATURE_LA57)
            - early_identify_cpu --> clear X86_FEATURE_LA57 (if early lvl5 not enabled (USE_EARLY_PGTABLE_L5))
        - arch_call_rest_init
          - rest_init
            - kernel_init
              - kernel_init_freeable
                - do_basic_setup
                  - do_initcalls --> calls swapfile_init() (initcall level 4)
      
      This should slightly speed up the migration swap entry handlings.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-8-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5154e607
    • Peter Xu's avatar
      mm/swap: cache maximum swapfile size when init swap · be45a490
      Peter Xu authored
      We used to have swapfile_maximum_size() fetching a maximum value of
      swapfile size per-arch.
      
      As the caller of max_swapfile_size() grows, this patch introduce a
      variable "swapfile_maximum_size" and cache the value of old
      max_swapfile_size(), so that we don't need to calculate the value every
      time.
      
      Caching the value in swapfile_init() is safe because when reaching the
      phase we should have initialized all the relevant information.  Here the
      major arch to take care of is x86, which defines the max swapfile size
      based on L1TF mitigation.
      
      Here both X86_BUG_L1TF or l1tf_mitigation should have been setup properly
      when reaching swapfile_init().  As a reference, the code path looks like
      this for x86:
      
      - start_kernel
        - setup_arch
          - early_cpu_init
            - early_identify_cpu --> setup X86_BUG_L1TF
        - parse_early_param
          - l1tf_cmdline --> set l1tf_mitigation
        - check_bugs
          - l1tf_select_mitigation --> set l1tf_mitigation
        - arch_call_rest_init
          - rest_init
            - kernel_init
              - kernel_init_freeable
                - do_basic_setup
                  - do_initcalls --> calls swapfile_init() (initcall level 4)
      
      The swapfile size only depends on swp pte format on non-x86 archs, so
      caching it is safe too.
      
      Since at it, rename max_swapfile_size() to arch_max_swapfile_size()
      because arch can define its own function, so it's more straightforward to
      have "arch_" as its prefix.  At the meantime, export swapfile_maximum_size
      to replace the old usages of max_swapfile_size().
      
      [peterx@redhat.com: declare arch_max_swapfile_size) in swapfile.h]
        Link: https://lkml.kernel.org/r/YxTh1GuC6ro5fKL5@xz-m1.local
      Link: https://lkml.kernel.org/r/20220811161331.37055-7-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      be45a490
    • Peter Xu's avatar
      mm: remember young/dirty bit for page migrations · 2e346877
      Peter Xu authored
      When page migration happens, we always ignore the young/dirty bit settings
      in the old pgtable, and marking the page as old in the new page table
      using either pte_mkold() or pmd_mkold(), and keeping the pte clean.
      
      That's fine from functional-wise, but that's not friendly to page reclaim
      because the moving page can be actively accessed within the procedure. 
      Not to mention hardware setting the young bit can bring quite some
      overhead on some systems, e.g.  x86_64 needs a few hundreds nanoseconds to
      set the bit.  The same slowdown problem to dirty bits when the memory is
      first written after page migration happened.
      
      Actually we can easily remember the A/D bit configuration and recover the
      information after the page is migrated.  To achieve it, define a new set
      of bits in the migration swap offset field to cache the A/D bits for old
      pte.  Then when removing/recovering the migration entry, we can recover
      the A/D bits even if the page changed.
      
      One thing to mention is that here we used max_swapfile_size() to detect
      how many swp offset bits we have, and we'll only enable this feature if we
      know the swp offset is big enough to store both the PFN value and the A/D
      bits.  Otherwise the A/D bits are dropped like before.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-6-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2e346877
    • Peter Xu's avatar
      mm/thp: carry over dirty bit when thp splits on pmd · 0ccf7f16
      Peter Xu authored
      Carry over the dirty bit from pmd to pte when a huge pmd splits.  It
      shouldn't be a correctness issue since when pmd_dirty() we'll have the
      page marked dirty anyway, however having dirty bit carried over helps the
      next initial writes of split ptes on some archs like x86.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-5-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarHuang Ying <ying.huang@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0ccf7f16
    • Peter Xu's avatar
      mm/swap: add swp_offset_pfn() to fetch PFN from swap entry · 0d206b5d
      Peter Xu authored
      We've got a bunch of special swap entries that stores PFN inside the swap
      offset fields.  To fetch the PFN, normally the user just calls
      swp_offset() assuming that'll be the PFN.
      
      Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the
      max possible length of a PFN on the host, meanwhile doing proper check
      with MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the
      PFNs properly always using the BUILD_BUG_ON() in is_pfn_swap_entry().
      
      One reason to do so is we never tried to sanitize whether swap offset can
      really fit for storing PFN.  At the meantime, this patch also prepares us
      with the future possibility to store more information inside the swp
      offset field, so assuming "swp_offset(entry)" to be the PFN will not stand
      any more very soon.
      
      Replace many of the swp_offset() callers to use swp_offset_pfn() where
      proper.  Note that many of the existing users are not candidates for the
      replacement, e.g.:
      
        (1) When the swap entry is not a pfn swap entry at all, or,
        (2) when we wanna keep the whole swp_offset but only change the swp type.
      
      For the latter, it can happen when fork() triggered on a write-migration
      swap entry pte, we may want to only change the migration type from
      write->read but keep the rest, so it's not "fetching PFN" but "changing
      swap type only".  They're left aside so that when there're more
      information within the swp offset they'll be carried over naturally in
      those cases.
      
      Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what
      the new swp_offset_pfn() is about.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-4-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0d206b5d
    • Peter Xu's avatar
      mm/swap: comment all the ifdef in swapops.h · eba4d770
      Peter Xu authored
      swapops.h contains quite a few layers of ifdef, some of the "else" and
      "endif" doesn't get proper comment on the macro so it's hard to follow on
      what are they referring to.  Add the comments.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-3-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Suggested-by: default avatarNadav Amit <nadav.amit@gmail.com>
      Reviewed-by: default avatarHuang Ying <ying.huang@intel.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      eba4d770
    • Peter Xu's avatar
      mm/x86: use SWP_TYPE_BITS in 3-level swap macros · 9c61d532
      Peter Xu authored
      Patch series "mm: Remember a/d bits for migration entries", v4.
      
      
      Problem
      =======
      
      When migrating a page, right now we always mark the migrated page as old &
      clean.
      
      However that could lead to at least two problems:
      
        (1) We lost the real hot/cold information while we could have persisted.
            That information shouldn't change even if the backing page is changed
            after the migration,
      
        (2) There can be always extra overhead on the immediate next access to
            any migrated page, because hardware MMU needs cycles to set the young
            bit again for reads, and dirty bits for write, as long as the
            hardware MMU supports these bits.
      
      Many of the recent upstream works showed that (2) is not something trivial
      and actually very measurable.  In my test case, reading 1G chunk of memory
      - jumping in page size intervals - could take 99ms just because of the
      extra setting on the young bit on a generic x86_64 system, comparing to
      4ms if young set.
      
      This issue is originally reported by Andrea Arcangeli.
      
      Solution
      ========
      
      To solve this problem, this patchset tries to remember the young/dirty
      bits in the migration entries and carry them over when recovering the
      ptes.
      
      We have the chance to do so because in many systems the swap offset is not
      really fully used.  Migration entries use swp offset to store PFN only,
      while the PFN is normally not as large as swp offset and normally smaller.
      It means we do have some free bits in swp offset that we can use to store
      things like A/D bits, and that's how this series tried to approach this
      problem.
      
      max_swapfile_size() is used here to detect per-arch offset length in swp
      entries.  We'll automatically remember the A/D bits when we find that we
      have enough swp offset field to keep both the PFN and the extra bits.
      
      Since max_swapfile_size() can be slow, the last two patches cache the
      results for it and also swap_migration_ad_supported as a whole.
      
      Known Issues / TODOs
      ====================
      
      We still haven't taught madvise() to recognize the new A/D bits in
      migration entries, namely MADV_COLD/MADV_FREE.  E.g.  when MADV_COLD upon
      a migration entry.  It's not clear yet on whether we should clear the A
      bit, or we should just drop the entry directly.
      
      We didn't teach idle page tracking on the new migration entries, because
      it'll need larger rework on the tree on rmap pgtable walk.  However it
      should make it already better because before this patchset page will be
      old page after migration, so the series will fix potential false negative
      of idle page tracking when pages were migrated before observing.
      
      The other thing is migration A/D bits will not start to working for
      private device swap entries.  The code is there for completeness but since
      private device swap entries do not yet have fields to store A/D bits, even
      if we'll persistent A/D across present pte switching to migration entry,
      we'll lose it again when the migration entry converted to private device
      swap entry.
      
      Tests
      =====
      
      After the patchset applied, the immediate read access test [1] of above 1G
      chunk after migration can shrink from 99ms to 4ms.  The test is done by
      moving 1G pages from node 0->1->0 then read it in page size jumps.  The
      test is with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.
      
      Similar effect can also be measured when writting the memory the 1st time
      after migration.
      
      After applying the patchset, both initial immediate read/write after page
      migrated will perform similarly like before migration happened.
      
      Patch Layout
      ============
      
      Patch 1-2:  Cleanups from either previous versions or on swapops.h macros.
      
      Patch 3-4:  Prepare for the introduction of migration A/D bits
      
      Patch 5:    The core patch to remember young/dirty bit in swap offsets.
      
      Patch 6-7:  Cache relevant fields to make migration_entry_supports_ad() fast.
      
      [1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c
      
      
      This patch (of 7):
      
      Replace all the magic "5" with the macro.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20220811161331.37055-2-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarHuang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9c61d532