1. 18 Oct, 2023 18 commits
  2. 16 Oct, 2023 19 commits
    • Stefan Roesch's avatar
      mm/ksm: document pages_skipped sysfs knob · b0540208
      Stefan Roesch authored
      This adds documentation for the new metric pages_skipped.
      
      Link: https://lkml.kernel.org/r/20230926040939.516161-5-shr@devkernel.ioSigned-off-by: default avatarStefan Roesch <shr@devkernel.io>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b0540208
    • Stefan Roesch's avatar
      mm/ksm: document smart scan mode · 75d7dd41
      Stefan Roesch authored
      This adds documentation for the smart scan mode of KSM.
      
      [akpm@linux-foundation.org: fix typo]
      [akpm@linux-foundation.org: document that smart_scan defaults to on]
      Link: https://lkml.kernel.org/r/20230926040939.516161-4-shr@devkernel.ioSigned-off-by: default avatarStefan Roesch <shr@devkernel.io>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      75d7dd41
    • Stefan Roesch's avatar
      mm/ksm: add pages_skipped metric · e5a68991
      Stefan Roesch authored
      This change adds the "pages skipped" metric.  To be able to evaluate how
      successful smart page scanning is, the pages skipped metric can be
      compared to the pages scanned metric.
      
      The pages skipped metric is a cumulative counter.  The counter is stored
      under /sys/kernel/mm/ksm/pages_skipped.
      
      Link: https://lkml.kernel.org/r/20230926040939.516161-3-shr@devkernel.ioSigned-off-by: default avatarStefan Roesch <shr@devkernel.io>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e5a68991
    • Stefan Roesch's avatar
      mm/ksm: add "smart" page scanning mode · 5e924ff5
      Stefan Roesch authored
      Patch series "Smart scanning mode for KSM", v3.
      
      This patch series adds "smart scanning" for KSM.
      
      What is smart scanning?
      =======================
      KSM evaluates all the candidate pages for each scan. It does not use historic
      information from previous scans. This has the effect that candidate pages that
      couldn't be used for KSM de-duplication continue to be evaluated for each scan.
      
      The idea of "smart scanning" is to keep historic information. With the historic
      information we can temporarily skip the candidate page for one or several scans.
      
      Details:
      ========
      "Smart scanning" is to keep two small counters to store if the page has been
      used for KSM. One counter stores how often we already tried to use the page for
      KSM and the other counter stores how often we skip a page.
      
      How often we skip the candidate page depends how often a page failed KSM
      de-duplication. The code skips a maximum of 8 times. During testing this has
      shown to be a good compromise for different workloads.
      
      New sysfs knob:
      ===============
      Smart scanning is not enabled by default. With /sys/kernel/mm/ksm/smart_scan
      smart scanning can be enabled.
      
      Monitoring:
      ===========
      To monitor how effective smart scanning is a new sysfs knob has been introduced.
      /sys/kernel/mm/pages_skipped report how many pages have been skipped by smart
      scanning.
      
      Results:
      ========
      - Various workloads have shown a 20% - 25% reduction in page scans
        For the instagram workload for instance, the number of pages scanned has been
        reduced from over 20M pages per scan to less than 15M pages.
      - Less pages scans also resulted in an overall higher de-duplication rate as
        some shorter lived pages could be de-duplicated additionally
      - Less pages scanned allows to reduce the pages_to_scan parameter
        and this resulted in  a 25% reduction in terms of CPU.
      - The improvements have been observed for workloads that enable KSM with
        madvise as well as prctl
      
      
      This patch (of 4):
      
      This change adds a "smart" page scanning mode for KSM.  So far all the
      candidate pages are continuously scanned to find candidates for
      de-duplication.  There are a considerably number of pages that cannot be
      de-duplicated.  This is costly in terms of CPU.  By using smart scanning
      considerable CPU savings can be achieved.
      
      This change takes the history of scanning pages into account and skips the
      page scanning of certain pages for a while if de-deduplication for this
      page has not been successful in the past.
      
      To do this it introduces two new fields in the ksm_rmap_item structure:
      age and remaining_skips.  age, is the KSM age and remaining_skips
      determines how often scanning of this page is skipped.  The age field is
      incremented each time the page is scanned and the page cannot be de-
      duplicated.  age updated is capped at U8_MAX.
      
      How often a page is skipped is dependent how often de-duplication has been
      tried so far and the number of skips is currently limited to 8.  This
      value has shown to be effective with different workloads.
      
      The feature is currently disable by default and can be enabled with the
      new smart_scan knob.
      
      The feature has shown to be very effective: upt to 25% of the page scans
      can be eliminated; the pages_to_scan rate can be reduced by 40 - 50% and a
      similar de-duplication rate can be maintained.
      
      [akpm@linux-foundation.org: make ksm_smart_scan default true, for testing]
      Link: https://lkml.kernel.org/r/20230926040939.516161-1-shr@devkernel.io
      Link: https://lkml.kernel.org/r/20230926040939.516161-2-shr@devkernel.ioSigned-off-by: default avatarStefan Roesch <shr@devkernel.io>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Stefan Roesch <shr@devkernel.io>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5e924ff5
    • Huang Ying's avatar
      dax, kmem: calculate abstract distance with general interface · 6bc2cfdf
      Huang Ying authored
      Previously, a fixed abstract distance MEMTIER_DEFAULT_DAX_ADISTANCE is
      used for slow memory type in kmem driver.  This limits the usage of kmem
      driver, for example, it cannot be used for HBM (high bandwidth memory).
      
      So, we use the general abstract distance calculation mechanism in kmem
      drivers to get more accurate abstract distance on systems with proper
      support.  The original MEMTIER_DEFAULT_DAX_ADISTANCE is used as fallback
      only.
      
      Now, multiple memory types may be managed by kmem.  These memory types are
      put into the "kmem_memory_types" list and protected by
      kmem_memory_type_lock.
      
      Link: https://lkml.kernel.org/r/20230926060628.265989-5-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Tested-by: default avatarBharata B Rao <bharata@amd.com>
      Reviewed-by: default avatarDave Jiang <dave.jiang@intel.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6bc2cfdf
    • Huang Ying's avatar
      acpi, hmat: calculate abstract distance with HMAT · 3718c02d
      Huang Ying authored
      A memory tiering abstract distance calculation algorithm based on ACPI
      HMAT is implemented.  The basic idea is as follows.
      
      The performance attributes of system default DRAM nodes are recorded as
      the base line.  Whose abstract distance is MEMTIER_ADISTANCE_DRAM.  Then,
      the ratio of the abstract distance of a memory node (target) to
      MEMTIER_ADISTANCE_DRAM is scaled based on the ratio of the performance
      attributes of the node to that of the default DRAM nodes.
      
      The functions to record the read/write latency/bandwidth of the default
      DRAM nodes and calculate abstract distance according to read/write
      latency/bandwidth ratio will be used by CXL CDAT (Coherent Device
      Attribute Table) and other memory device drivers.  So, they are put in
      memory-tiers.c.
      
      Link: https://lkml.kernel.org/r/20230926060628.265989-4-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Tested-by: default avatarBharata B Rao <bharata@amd.com>
      Reviewed-by: default avatarDave Jiang <dave.jiang@intel.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3718c02d
    • Huang Ying's avatar
      acpi, hmat: refactor hmat_register_target_initiators() · d0376aac
      Huang Ying authored
      Previously, in hmat_register_target_initiators(), the performance
      attributes are calculated and the corresponding sysfs links and files are
      created too.  Which is called during memory onlining.
      
      But now, to calculate the abstract distance of a memory target before
      memory onlining, we need to calculate the performance attributes for a
      memory target without creating sysfs links and files.
      
      To do that, hmat_register_target_initiators() is refactored to make it
      possible to calculate performance attributes separately.
      
      Link: https://lkml.kernel.org/r/20230926060628.265989-3-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Tested-by: default avatarAlistair Popple <apopple@nvidia.com>
      Tested-by: default avatarBharata B Rao <bharata@amd.com>
      Reviewed-by: default avatarDave Jiang <dave.jiang@intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d0376aac
    • Huang Ying's avatar
      memory tiering: add abstract distance calculation algorithms management · 07a8bdd4
      Huang Ying authored
      Patch series "memory tiering: calculate abstract distance based on ACPI
      HMAT", v4.
      
      We have the explicit memory tiers framework to manage systems with
      multiple types of memory, e.g., DRAM in DIMM slots and CXL memory devices.
      Where, same kind of memory devices will be grouped into memory types,
      then put into memory tiers.  To describe the performance of a memory type,
      abstract distance is defined.  Which is in direct proportion to the memory
      latency and inversely proportional to the memory bandwidth.  To keep the
      code as simple as possible, fixed abstract distance is used in dax/kmem to
      describe slow memory such as Optane DCPMM.
      
      To support more memory types, in this series, we added the abstract
      distance calculation algorithm management mechanism, provided a algorithm
      implementation based on ACPI HMAT, and used the general abstract distance
      calculation interface in dax/kmem driver.  So, dax/kmem can support HBM
      (high bandwidth memory) in addition to the original Optane DCPMM.
      
      
      This patch (of 4):
      
      The abstract distance may be calculated by various drivers, such as ACPI
      HMAT, CXL CDAT, etc.  While it may be used by various code which hot-add
      memory node, such as dax/kmem etc.  To decouple the algorithm users and
      the providers, the abstract distance calculation algorithms management
      mechanism is implemented in this patch.  It provides interface for the
      providers to register the implementation, and interface for the users.
      
      Multiple algorithm implementations can cooperate via calculating abstract
      distance for different memory nodes.  The preference of algorithm
      implementations can be specified via priority (notifier_block.priority).
      
      Link: https://lkml.kernel.org/r/20230926060628.265989-1-ying.huang@intel.com
      Link: https://lkml.kernel.org/r/20230926060628.265989-2-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Tested-by: default avatarBharata B Rao <bharata@amd.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Reviewed-by: default avatarDave Jiang <dave.jiang@intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      07a8bdd4
    • Sidhartha Kumar's avatar
      mm/hugetlb: replace page_ref_freeze() with folio_ref_freeze() in hugetlb_folio_init_vmemmap() · a48bf7b4
      Sidhartha Kumar authored
      No functional difference, folio_ref_freeze() is currently a wrapper for
      page_ref_freeze().
      
      Link: https://lkml.kernel.org/r/20230926174433.81241-1-sidhartha.kumar@oracle.comSigned-off-by: default avatarSidhartha Kumar <sidhartha.kumar@oracle.com>
      Reviewed-by: Muchun Song <songmuchun@bytedance.com> 
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Usama Arif <usama.arif@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a48bf7b4
    • Sidhartha Kumar's avatar
      mm/filemap: remove hugetlb special casing in filemap.c · a08c7193
      Sidhartha Kumar authored
      Remove special cased hugetlb handling code within the page cache by
      changing the granularity of ->index to the base page size rather than the
      huge page size.  The motivation of this patch is to reduce complexity
      within the filemap code while also increasing performance by removing
      branches that are evaluated on every page cache lookup.
      
      To support the change in index, new wrappers for hugetlb page cache
      interactions are added.  These wrappers perform the conversion to a linear
      index which is now expected by the page cache for huge pages.
      
      ========================= PERFORMANCE ======================================
      
      Perf was used to check the performance differences after the patch. 
      Overall the performance is similar to mainline with a very small larger
      overhead that occurs in __filemap_add_folio() and
      hugetlb_add_to_page_cache().  This is because of the larger overhead that
      occurs in xa_load() and xa_store() as the xarray is now using more entries
      to store hugetlb folios in the page cache.
      
      Timing
      
      aarch64
          2MB Page Size
              6.5-rc3 + this patch:
                  [root@sidhakum-ol9-1 hugepages]# time fallocate -l 700GB test.txt
                  real    1m49.568s
                  user    0m0.000s
                  sys     1m49.461s
      
              6.5-rc3:
                  [root]# time fallocate -l 700GB test.txt
                  real    1m47.495s
                  user    0m0.000s
                  sys     1m47.370s
          1GB Page Size
              6.5-rc3 + this patch:
                  [root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
                  real    1m47.024s
                  user    0m0.000s
                  sys     1m46.921s
      
              6.5-rc3:
                  [root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
                  real    1m44.551s
                  user    0m0.000s
                  sys     1m44.438s
      
      x86
          2MB Page Size
              6.5-rc3 + this patch:
                  [root@sidhakum-ol9-2 hugepages]# time fallocate -l 100GB test.txt
                  real    0m22.383s
                  user    0m0.000s
                  sys     0m22.255s
      
              6.5-rc3:
                  [opc@sidhakum-ol9-2 hugepages]$ time sudo fallocate -l 100GB /dev/hugepages/test.txt
                  real    0m22.735s
                  user    0m0.038s
                  sys     0m22.567s
      
          1GB Page Size
              6.5-rc3 + this patch:
                  [root@sidhakum-ol9-2 hugepages1GB]# time fallocate -l 100GB test.txt
                  real    0m25.786s
                  user    0m0.001s
                  sys     0m25.589s
      
              6.5-rc3:
                  [root@sidhakum-ol9-2 hugepages1G]# time fallocate -l 100GB test.txt
                  real    0m33.454s
                  user    0m0.001s
                  sys     0m33.193s
      
      aarch64:
          workload - fallocate a 700GB file backed by huge pages
      
          6.5-rc3 + this patch:
              2MB Page Size:
                  --100.00%--__arm64_sys_fallocate
                                ksys_fallocate
                                vfs_fallocate
                                hugetlbfs_fallocate
                                |
                                |--95.04%--__pi_clear_page
                                |
                                |--3.57%--clear_huge_page
                                |          |
                                |          |--2.63%--rcu_all_qs
                                |          |
                                |           --0.91%--__cond_resched
                                |
                                 --0.67%--__cond_resched
                  0.17%     0.00%             0  fallocate  [kernel.vmlinux]       [k] hugetlb_add_to_page_cache
                  0.14%     0.10%            11  fallocate  [kernel.vmlinux]       [k] __filemap_add_folio
      
          6.5-rc3
              2MB Page Size:
                      --100.00%--__arm64_sys_fallocate
                                ksys_fallocate
                                vfs_fallocate
                                hugetlbfs_fallocate
                                |
                                |--94.91%--__pi_clear_page
                                |
                                |--4.11%--clear_huge_page
                                |          |
                                |          |--3.00%--rcu_all_qs
                                |          |
                                |           --1.10%--__cond_resched
                                |
                                 --0.59%--__cond_resched
                  0.08%     0.01%             1  fallocate  [kernel.kallsyms]  [k] hugetlb_add_to_page_cache
                  0.05%     0.03%             3  fallocate  [kernel.kallsyms]  [k] __filemap_add_folio
      
      x86
          workload - fallocate a 100GB file backed by huge pages
      
          6.5-rc3 + this patch:
              2MB Page Size:
                  hugetlbfs_fallocate
                  |
                  --99.57%--clear_huge_page
                      |
                      --98.47%--clear_page_erms
                          |
                          --0.53%--asm_sysvec_apic_timer_interrupt
      
                  0.04%     0.04%             1  fallocate  [kernel.kallsyms]     [k] xa_load
                  0.04%     0.00%             0  fallocate  [kernel.kallsyms]     [k] hugetlb_add_to_page_cache
                  0.04%     0.00%             0  fallocate  [kernel.kallsyms]     [k] __filemap_add_folio
                  0.04%     0.00%             0  fallocate  [kernel.kallsyms]     [k] xas_store
      
          6.5-rc3
              2MB Page Size:
                      --99.93%--__x64_sys_fallocate
                                vfs_fallocate
                                hugetlbfs_fallocate
                                |
                                 --99.38%--clear_huge_page
                                           |
                                           |--98.40%--clear_page_erms
                                           |
                                            --0.59%--__cond_resched
                  0.03%     0.03%             1  fallocate  [kernel.kallsyms]  [k] __filemap_add_folio
      
      ========================= TESTING ======================================
      
      This patch passes libhugetlbfs tests and LTP hugetlb tests
      
      ********** TEST SUMMARY
      *                      2M
      *                      32-bit 64-bit
      *     Total testcases:   110    113
      *             Skipped:     0      0
      *                PASS:   107    113
      *                FAIL:     0      0
      *    Killed by signal:     3      0
      *   Bad configuration:     0      0
      *       Expected FAIL:     0      0
      *     Unexpected PASS:     0      0
      *    Test not present:     0      0
      * Strange test result:     0      0
      **********
      
          Done executing testcases.
          LTP Version:  20220527-178-g2761a81c4
      
      page migration was also tested using Mike Kravetz's test program.[8]
      
      [dan.carpenter@linaro.org: fix an NULL vs IS_ERR() bug]
        Link: https://lkml.kernel.org/r/1772c296-1417-486f-8eef-171af2192681@moroto.mountain
      Link: https://lkml.kernel.org/r/20230926192017.98183-1-sidhartha.kumar@oracle.comSigned-off-by: default avatarSidhartha Kumar <sidhartha.kumar@oracle.com>
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Reported-and-tested-by: syzbot+c225dea486da4d5592bd@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=c225dea486da4d5592bd
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a08c7193
    • Stefan Roesch's avatar
      mm/ksm: test case for prctl fork/exec workflow · 0374af1d
      Stefan Roesch authored
      This adds a new test case to the ksm functional tests to make sure that
      the KSM setting is inherited by the child process when doing a fork/exec.
      
      Link: https://lkml.kernel.org/r/20230922211141.320789-3-shr@devkernel.ioSigned-off-by: default avatarStefan Roesch <shr@devkernel.io>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Carl Klemm <carl@uvos.xyz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0374af1d
    • Stefan Roesch's avatar
      mm/ksm: support fork/exec for prctl · 3c6f33b7
      Stefan Roesch authored
      Patch series "mm/ksm: add fork-exec support for prctl", v4.
      
      A process can enable KSM with the prctl system call.  When the process is
      forked the KSM flag is inherited by the child process.  However if the
      process is executing an exec system call directly after the fork, the KSM
      setting is cleared.  This patch series addresses this problem.
      
      1) Change the mask in coredump.h for execing a new process
      2) Add a new test case in ksm_functional_tests
      
      
      This patch (of 2):
      
      Today we have two ways to enable KSM:
      
      1) madvise system call
         This allows to enable KSM for a memory region for a long time.
      
      2) prctl system call
         This is a recent addition to enable KSM for the complete process.
         In addition when a process is forked, the KSM setting is inherited.
      
      This change only affects the second case.
      
      One of the use cases for (2) was to support the ability to enable
      KSM for cgroups. This allows systemd to enable KSM for the seed
      process. By enabling it in the seed process all child processes inherit
      the setting.
      
      This works correctly when the process is forked. However it doesn't
      support fork/exec workflow.
      
      From the previous cover letter:
      
      ....
      Use case 3:
      With the madvise call sharing opportunities are only enabled for the
      current process: it is a workload-local decision. A considerable number
      of sharing opportunities may exist across multiple workloads or jobs
      (if they are part of the same security domain). Only a higler level
      entity like a job scheduler or container can know for certain if its
      running one or more instances of a job. That job scheduler however
      doesn't have the necessary internal workload knowledge to make targeted
      madvise calls.
      ....
      
      In addition it can also be a bit surprising that fork keeps the KSM
      setting and fork/exec does not.
      
      Link: https://lkml.kernel.org/r/20230922211141.320789-1-shr@devkernel.io
      Link: https://lkml.kernel.org/r/20230922211141.320789-2-shr@devkernel.ioSigned-off-by: default avatarStefan Roesch <shr@devkernel.io>
      Fixes: d7597f59 ("mm: add new api to enable ksm per process")
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatarCarl Klemm <carl@uvos.xyz>
      Tested-by: default avatarCarl Klemm <carl@uvos.xyz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3c6f33b7
    • Huan Yang's avatar
      mm/damon/core: remove unnecessary si_meminfo invoke. · 987ffa5a
      Huan Yang authored
      si_meminfo() will read and assign more info not just free/ram pages.  For
      just DAMOS_WMARK_FREE_MEM_RATE use, only get free and ram pages is ok to
      save cpu.
      
      Link: https://lkml.kernel.org/r/20230920015727.4482-1-link@vivo.comSigned-off-by: default avatarHuan Yang <link@vivo.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      987ffa5a
    • Kefeng Wang's avatar
      sched/numa, mm: make numa migrate functions to take a folio · 8c9ae56d
      Kefeng Wang authored
      The cpupid (or access time) is stored in the head page for THP, so it is
      safely to make should_numa_migrate_memory() and numa_hint_fault_latency()
      to take a folio.  This is in preparation for large folio numa balancing.
      
      Link: https://lkml.kernel.org/r/20230921074417.24004-7-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8c9ae56d
    • Kefeng Wang's avatar
      mm: mempolicy: make mpol_misplaced() to take a folio · 75c70128
      Kefeng Wang authored
      In preparation for large folio numa balancing, make mpol_misplaced() to
      take a folio, no functional change intended.
      
      Link: https://lkml.kernel.org/r/20230921074417.24004-6-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      75c70128
    • Kefeng Wang's avatar
      mm: memory: make numa_migrate_prep() to take a folio · cda6d936
      Kefeng Wang authored
      In preparation for large folio numa balancing, make numa_migrate_prep() to
      take a folio, no functional change intended.
      
      Link: https://lkml.kernel.org/r/20230921074417.24004-5-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cda6d936
    • Kefeng Wang's avatar
      mm: memory: use a folio in do_numa_page() · 6695cf68
      Kefeng Wang authored
      Numa balancing only try to migrate non-compound page in do_numa_page(),
      use a folio in it to save several compound_head calls, note we use
      folio_estimated_sharers(), it is enough to check the folio sharers since
      only normal page is handled, if large folio numa balancing is supported, a
      precise folio sharers check would be used, no functional change intended.
      
      Link: https://lkml.kernel.org/r/20230921074417.24004-4-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6695cf68
    • Kefeng Wang's avatar
      mm: huge_memory: use a folio in do_huge_pmd_numa_page() · 667ffc31
      Kefeng Wang authored
      Use a folio in do_huge_pmd_numa_page(), reduce three page_folio() calls to
      one, no functional change intended.
      
      Link: https://lkml.kernel.org/r/20230921074417.24004-3-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      667ffc31
    • Kefeng Wang's avatar
      mm: memory: add vm_normal_folio_pmd() · 65610453
      Kefeng Wang authored
      Patch series "mm: convert numa balancing functions to use a folio", v2.
      
      do_numa_pages() only handles non-compound pages, and only PMD-mapped THPs
      are handled in do_huge_pmd_numa_page().  But a large, PTE-mapped folio
      will be supported so let's convert more numa balancing functions to
      use/take a folio in preparation for that, no functional change intended
      for now.
      
      
      This patch (of 6):
      
      The new vm_normal_folio_pmd() wrapper is similar to vm_normal_folio(),
      which allow them to completely replace the struct page variables with
      struct folio variables.
      
      Link: https://lkml.kernel.org/r/20230921074417.24004-1-wangkefeng.wang@huawei.com
      Link: https://lkml.kernel.org/r/20230921074417.24004-2-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      65610453
  3. 06 Oct, 2023 3 commits