1. 27 Sep, 2022 26 commits
    • Yu Zhao's avatar
      mm: x86, arm64: add arch_has_hw_pte_young() · e1fd09e3
      Yu Zhao authored
      Patch series "Multi-Gen LRU Framework", v14.
      
      What's new
      ==========
      1. OpenWrt, in addition to Android, Arch Linux Zen, Armbian, ChromeOS,
         Liquorix, post-factum and XanMod, is now shipping MGLRU on 5.15.
      2. Fixed long-tailed direct reclaim latency seen on high-memory (TBs)
         machines. The old direct reclaim backoff, which tries to enforce a
         minimum fairness among all eligible memcgs, over-swapped by about
         (total_mem>>DEF_PRIORITY)-nr_to_reclaim. The new backoff, which
         pulls the plug on swapping once the target is met, trades some
         fairness for curtailed latency:
         https://lore.kernel.org/r/20220918080010.2920238-10-yuzhao@google.com/
      3. Fixed minior build warnings and conflicts. More comments and nits.
      
      TLDR
      ====
      The current page reclaim is too expensive in terms of CPU usage and it
      often makes poor choices about what to evict. This patchset offers an
      alternative solution that is performant, versatile and
      straightforward.
      
      Patchset overview
      =================
      The design and implementation overview is in patch 14:
      https://lore.kernel.org/r/20220918080010.2920238-15-yuzhao@google.com/
      
      01. mm: x86, arm64: add arch_has_hw_pte_young()
      02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
      Take advantage of hardware features when trying to clear the accessed
      bit in many PTEs.
      
      03. mm/vmscan.c: refactor shrink_node()
      04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into
          its sole caller"
      Minor refactors to improve readability for the following patches.
      
      05. mm: multi-gen LRU: groundwork
      Adds the basic data structure and the functions that insert pages to
      and remove pages from the multi-gen LRU (MGLRU) lists.
      
      06. mm: multi-gen LRU: minimal implementation
      A minimal implementation without optimizations.
      
      07. mm: multi-gen LRU: exploit locality in rmap
      Exploits spatial locality to improve efficiency when using the rmap.
      
      08. mm: multi-gen LRU: support page table walks
      Further exploits spatial locality by optionally scanning page tables.
      
      09. mm: multi-gen LRU: optimize multiple memcgs
      Optimizes the overall performance for multiple memcgs running mixed
      types of workloads.
      
      10. mm: multi-gen LRU: kill switch
      Adds a kill switch to enable or disable MGLRU at runtime.
      
      11. mm: multi-gen LRU: thrashing prevention
      12. mm: multi-gen LRU: debugfs interface
      Provide userspace with features like thrashing prevention, working set
      estimation and proactive reclaim.
      
      13. mm: multi-gen LRU: admin guide
      14. mm: multi-gen LRU: design doc
      Add an admin guide and a design doc.
      
      Benchmark results
      =================
      Independent lab results
      -----------------------
      Based on the popularity of searches [01] and the memory usage in
      Google's public cloud, the most popular open-source memory-hungry
      applications, in alphabetical order, are:
            Apache Cassandra      Memcached
            Apache Hadoop         MongoDB
            Apache Spark          PostgreSQL
            MariaDB (MySQL)       Redis
      
      An independent lab evaluated MGLRU with the most widely used benchmark
      suites for the above applications. They posted 960 data points along
      with kernel metrics and perf profiles collected over more than 500
      hours of total benchmark time. Their final reports show that, with 95%
      confidence intervals (CIs), the above applications all performed
      significantly better for at least part of their benchmark matrices.
      
      On 5.14:
      1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
         less wall time to sort three billion random integers, respectively,
         under the medium- and the high-concurrency conditions, when
         overcommitting memory. There were no statistically significant
         changes in wall time for the rest of the benchmark matrix.
      2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
         more transactions per minute (TPM), respectively, under the medium-
         and the high-concurrency conditions, when overcommitting memory.
         There were no statistically significant changes in TPM for the rest
         of the benchmark matrix.
      3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
         and [21.59, 30.02]% more operations per second (OPS), respectively,
         for sequential access, random access and Gaussian (distribution)
         access, when THP=always; 95% CIs [13.85, 15.97]% and
         [23.94, 29.92]% more OPS, respectively, for random access and
         Gaussian access, when THP=never. There were no statistically
         significant changes in OPS for the rest of the benchmark matrix.
      4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
         [2.16, 3.55]% more operations per second (OPS), respectively, for
         exponential (distribution) access, random access and Zipfian
         (distribution) access, when underutilizing memory; 95% CIs
         [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
         respectively, for exponential access, random access and Zipfian
         access, when overcommitting memory.
      
      On 5.15:
      5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
         and [4.11, 7.50]% more operations per second (OPS), respectively,
         for exponential (distribution) access, random access and Zipfian
         (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
         [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
         exponential access, random access and Zipfian access, when swap was
         on.
      6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
         less average wall time to finish twelve parallel TeraSort jobs,
         respectively, under the medium- and the high-concurrency
         conditions, when swap was on. There were no statistically
         significant changes in average wall time for the rest of the
         benchmark matrix.
      7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
         minute (TPM) under the high-concurrency condition, when swap was
         off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
         respectively, under the medium- and the high-concurrency
         conditions, when swap was on. There were no statistically
         significant changes in TPM for the rest of the benchmark matrix.
      8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
         [11.47, 19.36]% more total operations per second (OPS),
         respectively, for sequential access, random access and Gaussian
         (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
         [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
         for sequential access, random access and Gaussian access, when
         THP=never.
      
      Our lab results
      ---------------
      To supplement the above results, we ran the following benchmark suites
      on 5.16-rc7 and found no regressions [10].
            fs_fio_bench_hdd_mq      pft
            fs_lmbench               pgsql-hammerdb
            fs_parallelio            redis
            fs_postmark              stream
            hackbench                sysbenchthread
            kernbench                tpcc_spark
            memcached                unixbench
            multichase               vm-scalability
            mutilate                 will-it-scale
            nginx
      
      [01] https://trends.google.com
      [02] https://lore.kernel.org/r/20211102002002.92051-1-bot@edi.works/
      [03] https://lore.kernel.org/r/20211009054315.47073-1-bot@edi.works/
      [04] https://lore.kernel.org/r/20211021194103.65648-1-bot@edi.works/
      [05] https://lore.kernel.org/r/20211109021346.50266-1-bot@edi.works/
      [06] https://lore.kernel.org/r/20211202062806.80365-1-bot@edi.works/
      [07] https://lore.kernel.org/r/20211209072416.33606-1-bot@edi.works/
      [08] https://lore.kernel.org/r/20211218071041.24077-1-bot@edi.works/
      [09] https://lore.kernel.org/r/20211122053248.57311-1-bot@edi.works/
      [10] https://lore.kernel.org/r/20220104202247.2903702-1-yuzhao@google.com/
      
      Read-world applications
      =======================
      Third-party testimonials
      ------------------------
      Konstantin reported [11]:
         I have Archlinux with 8G RAM + zswap + swap. While developing, I
         have lots of apps opened such as multiple LSP-servers for different
         langs, chats, two browsers, etc... Usually, my system gets quickly
         to a point of SWAP-storms, where I have to kill LSP-servers,
         restart browsers to free memory, etc, otherwise the system lags
         heavily and is barely usable.
         
         1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
         patchset, and I started up by opening lots of apps to create memory
         pressure, and worked for a day like this. Till now I had not a
         single SWAP-storm, and mind you I got 3.4G in SWAP. I was never
         getting to the point of 3G in SWAP before without a single
         SWAP-storm.
      
      Vaibhav from IBM reported [12]:
         In a synthetic MongoDB Benchmark, seeing an average of ~19%
         throughput improvement on POWER10(Radix MMU + 64K Page Size) with
         MGLRU patches on top of 5.16 kernel for MongoDB + YCSB across
         three different request distributions, namely, Exponential, Uniform
         and Zipfan.
      
      Shuang from U of Rochester reported [13]:
         With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
         and [9.26, 10.36]% higher throughput, respectively, for random
         access, Zipfian (distribution) access and Gaussian (distribution)
         access, when the average number of jobs per CPU is 1; 95% CIs
         [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher
         throughput, respectively, for random access, Zipfian access and
         Gaussian access, when the average number of jobs per CPU is 2.
      
      Daniel from Michigan Tech reported [14]:
         With Memcached allocating ~100GB of byte-addressable Optante,
         performance improvement in terms of throughput (measured as queries
         per second) was about 10% for a series of workloads.
      
      Large-scale deployments
      -----------------------
      We've rolled out MGLRU to tens of millions of ChromeOS users and
      about a million Android users. Google's fleetwide profiling [15] shows
      an overall 40% decrease in kswapd CPU usage, in addition to
      improvements in other UX metrics, e.g., an 85% decrease in the number
      of low-memory kills at the 75th percentile and an 18% decrease in
      app launch time at the 50th percentile.
      
      The downstream kernels that have been using MGLRU include:
      1. Android [16]
      2. Arch Linux Zen [17]
      3. Armbian [18]
      4. ChromeOS [19]
      5. Liquorix [20]
      6. OpenWrt [21]
      7. post-factum [22]
      8. XanMod [23]
      
      [11] https://lore.kernel.org/r/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/
      [12] https://lore.kernel.org/r/87czj3mux0.fsf@vajain21.in.ibm.com/
      [13] https://lore.kernel.org/r/20220105024423.26409-1-szhai2@cs.rochester.edu/
      [14] https://lore.kernel.org/r/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
      [15] https://dl.acm.org/doi/10.1145/2749469.2750392
      [16] https://android.com
      [17] https://archlinux.org
      [18] https://armbian.com
      [19] https://chromium.org
      [20] https://liquorix.net
      [21] https://openwrt.org
      [22] https://codeberg.org/pf-kernel
      [23] https://xanmod.org
      
      Summary
      =======
      The facts are:
      1. The independent lab results and the real-world applications
         indicate substantial improvements; there are no known regressions.
      2. Thrashing prevention, working set estimation and proactive reclaim
         work out of the box; there are no equivalent solutions.
      3. There is a lot of new code; no smaller changes have been
         demonstrated similar effects.
      
      Our options, accordingly, are:
      1. Given the amount of evidence, the reported improvements will likely
         materialize for a wide range of workloads.
      2. Gauging the interest from the past discussions, the new features
         will likely be put to use for both personal computers and data
         centers.
      3. Based on Google's track record, the new code will likely be well
         maintained in the long term. It'd be more difficult if not
         impossible to achieve similar effects with other approaches.
      
      
      This patch (of 14):
      
      Some architectures automatically set the accessed bit in PTEs, e.g., x86
      and arm64 v8.2.  On architectures that do not have this capability,
      clearing the accessed bit in a PTE usually triggers a page fault following
      the TLB miss of this PTE (to emulate the accessed bit).
      
      Being aware of this capability can help make better decisions, e.g.,
      whether to spread the work out over a period of time to reduce bursty page
      faults when trying to clear the accessed bit in many PTEs.
      
      Note that theoretically this capability can be unreliable, e.g.,
      hotplugged CPUs might be different from builtin ones.  Therefore it should
      not be used in architecture-independent code that involves correctness,
      e.g., to determine whether TLB flushes are required (in combination with
      the accessed bit).
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-1-yuzhao@google.com
      Link: https://lkml.kernel.org/r/20220918080010.2920238-2-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatarBarry Song <baohua@kernel.org>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Acked-by: default avatarWill Deacon <will@kernel.org>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e1fd09e3
    • Yang Yang's avatar
      mm/page_io: count submission time as thrashing delay for delayacct · 3a9bb7b1
      Yang Yang authored
      Once upon a time, we only support accounting thrashing of page cache. 
      Then Joonsoo introduced workingset detection for anonymous pages and we
      gained the ability to account thrashing of them[1].
      
      Likes PSI, we count submission time as thrashing delay because when the
      device is congested, or the submitting cgroup IO-throttled, submission can
      be a significant part of overall IO time.
      
      Without this patch, swap thrashing through frontswap or some block
      device supporting rw_page operation isn't measured correctly.
      
      This patch is based on "delayacct: support re-entrance detection of
      thrashing accounting".
      
      [1] commit aae466b0 ("mm/swap: implement workingset detection for anonymous LRU")
      
      Link: https://lkml.kernel.org/r/20220815072835.74876-1-yang.yang29@zte.com.cnSigned-off-by: default avatarYang Yang <yang.yang29@zte.com.cn>
      Signed-off-by: default avatarCGEL ZTE <cgel.zte@gmail.com>
      Reviewed-by: default avatarRan Xiaokai <ran.xiaokai@zte.com.cn>
      Reviewed-by: default avatarwangyong <wang.yong12@zte.com.cn>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3a9bb7b1
    • Yang Yang's avatar
      delayacct: support re-entrance detection of thrashing accounting · aa1cf99b
      Yang Yang authored
      Once upon a time, we only support accounting thrashing of page cache. 
      Then Joonsoo introduced workingset detection for anonymous pages and we
      gained the ability to account thrashing of them[1].
      
      For page cache thrashing accounting, there is no suitable place to do it
      in fs level likes swap_readpage().  So we have to do it in
      folio_wait_bit_common().
      
      Then for anonymous pages thrashing accounting, we have to do it in both
      swap_readpage() and folio_wait_bit_common().  This likes PSI, so we should
      let thrashing accounting supports re-entrance detection.
      
      This patch is to prepare complete thrashing accounting, and is based on
      patch "filemap: make the accounting of thrashing more consistent".
      
      [1] commit aae466b0 ("mm/swap: implement workingset detection for anonymous LRU")
      
      Link: https://lkml.kernel.org/r/20220815071134.74551-1-yang.yang29@zte.com.cnSigned-off-by: default avatarYang Yang <yang.yang29@zte.com.cn>
      Signed-off-by: default avatarCGEL ZTE <cgel.zte@gmail.com>
      Reviewed-by: default avatarRan Xiaokai <ran.xiaokai@zte.com.cn>
      Reviewed-by: default avatarwangyong <wang.yong12@zte.com.cn>
      Acked-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      aa1cf99b
    • Baolin Wang's avatar
      mm: migrate: do not retry 10 times for the subpages of fail-to-migrate THP · 7047b5a4
      Baolin Wang authored
      If THP is failed to migrate due to -ENOSYS or -ENOMEM case, the THP will
      be split, and the subpages of fail-to-migrate THP will be tried to migrate
      again, so we should not account the retry counter in the second loop,
      since we already accounted 'nr_thp_failed' in the first loop.
      
      Moreover we also do not need retry 10 times for -EAGAIN case for the
      subpages of fail-to-migrate THP in the second loop, since we already
      regarded the THP as migration failure, and save some migration time (for
      the worst case, will try 512 * 10 times) according to previous discussion
      [1].
      
      [1] https://lore.kernel.org/linux-mm/87r13a7n04.fsf@yhuang6-desk2.ccr.corp.intel.com/
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-9-ying.huang@intel.comTested-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7047b5a4
    • Huang Ying's avatar
      migrate_pages(): fix failure counting for retry · 077309bc
      Huang Ying authored
      After 10 retries, we will give up and the remaining pages will be counted
      as failure in nr_failed and nr_thp_failed.  We should count the failure in
      nr_failed_pages too.  This is done in this patch.
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-8-ying.huang@intel.com
      Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      077309bc
    • Huang Ying's avatar
      migrate_pages(): fix failure counting for THP splitting · e6fa8a79
      Huang Ying authored
      If THP is failed to be migrated, it may be split and retry.  But after
      splitting, the head page will be left in "from" list, although THP
      migration failure has been counted already.  If the head page is failed to
      be migrated too, the failure will be counted twice incorrectly.  So this
      is fixed in this patch via moving the head page of THP after splitting to
      "thp_split_pages" too.
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-7-ying.huang@intel.com
      Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e6fa8a79
    • Huang Ying's avatar
      migrate_pages(): fix failure counting for THP on -ENOSYS · 577be05c
      Huang Ying authored
      If THP or hugetlbfs page migration isn't supported, unmap_and_move() or
      unmap_and_move_huge_page() will return -ENOSYS.  For THP, splitting will
      be tried, but if splitting doesn't succeed, the THP will be left in "from"
      list wrongly.  If some other pages are retried, the THP migration failure
      will counted again.  This is fixed via moving the failure THP from "from"
      to "ret_pages".
      
      Another issue of the original code is that the unsupported failure
      processing isn't consistent between THP and hugetlbfs page.  Make them
      consistent in this patch to make the code easier to be understood too.
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-6-ying.huang@intel.com
      Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      577be05c
    • Huang Ying's avatar
      migrate_pages(): fix failure counting for THP subpages retrying · 5fc30916
      Huang Ying authored
      If THP is failed to be migrated for -ENOSYS and -ENOMEM, the THP will be
      split into thp_split_pages, and after other pages are migrated, pages in
      thp_split_pages will be migrated with no_subpage_counting == true, because
      its failure have been counted already.  If some pages in thp_split_pages
      are retried during migration, we should not count their failure if
      no_subpage_counting == true too.  This is done this patch to fix the
      failure counting for THP subpages retrying.
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-5-ying.huang@intel.com
      Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5fc30916
    • Huang Ying's avatar
      migrate_pages(): fix THP failure counting for -ENOMEM · fbed53b4
      Huang Ying authored
      In unmap_and_move(), if the new THP cannot be allocated, -ENOMEM will be
      returned, and migrate_pages() will try to split the THP unless "reason" is
      MR_NUMA_MISPLACED (that is, nosplit == true).  But when nosplit == true,
      the THP migration failure will not be counted.
      
      This is incorrect, so in this patch, the THP migration failure will be
      counted for -ENOMEM regardless of nosplit is true or false.  The nr_failed
      counting isn't fixed because it's not used.  Added some comments for it
      per Baolin's suggestion.
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-4-ying.huang@intel.com
      Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fbed53b4
    • Huang Ying's avatar
      migrate_pages(): remove unnecessary list_safe_reset_next() · 9c62ff00
      Huang Ying authored
      Before commit b5bade97 ("mm: migrate: fix the return value of
      migrate_pages()"), the tail pages of THP will be put in the "from"
      list directly.  So one of the loop cursors (page2) needs to be reset,
      as is done in try_split_thp() via list_safe_reset_next().  But after
      the commit, the tail pages of THP will be put in a dedicated
      list (thp_split_pages).  That is, the "from" list will not be changed
      during splitting.  So, it's unnecessary to call list_safe_reset_next()
      anymore.
      
      This is a code cleanup, no functionality changes are expected.
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-3-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9c62ff00
    • Huang Ying's avatar
      migrate: fix syscall move_pages() return value for failure · a7504ed1
      Huang Ying authored
      Patch series "migrate_pages(): fix several bugs in error path", v3.
      
      During review the code of migrate_pages() and build a test program for
      it.  Several bugs in error path are identified and fixed in this
      series.
      
      Most patches are tested via
      
      - Apply error-inject.patch in Linux kernel
      - Compile test-migrate.c (with -lnuma)
      - Test with test-migrate.sh
      
      error-inject.patch, test-migrate.c, and test-migrate.sh are as below.
      It turns out that error injection is an important tool to fix bugs in
      error path.
      
      
      This patch (of 8):
      
      The return value of move_pages() syscall is incorrect when counting
      the remaining pages to be migrated.  For example, for the following
      test program,
      
      "
       #define _GNU_SOURCE
      
       #include <stdbool.h>
       #include <stdio.h>
       #include <string.h>
       #include <stdlib.h>
       #include <errno.h>
      
       #include <fcntl.h>
       #include <sys/uio.h>
       #include <sys/mman.h>
       #include <sys/types.h>
       #include <unistd.h>
       #include <numaif.h>
       #include <numa.h>
      
       #ifndef MADV_FREE
       #define MADV_FREE	8		/* free pages only if memory pressure */
       #endif
      
       #define ONE_MB		(1024 * 1024)
       #define MAP_SIZE	(16 * ONE_MB)
       #define THP_SIZE	(2 * ONE_MB)
       #define THP_MASK	(THP_SIZE - 1)
      
       #define ERR_EXIT_ON(cond, msg)					\
      	 do {							\
      		 int __cond_in_macro = (cond);			\
      		 if (__cond_in_macro)				\
      			 error_exit(__cond_in_macro, (msg));	\
      	 } while (0)
      
       void error_msg(int ret, int nr, int *status, const char *msg)
       {
      	 int i;
      
      	 fprintf(stderr, "Error: %s, ret : %d, error: %s\n",
      		 msg, ret, strerror(errno));
      
      	 if (!nr)
      		 return;
      	 fprintf(stderr, "status: ");
      	 for (i = 0; i < nr; i++)
      		 fprintf(stderr, "%d ", status[i]);
      	 fprintf(stderr, "\n");
       }
      
       void error_exit(int ret, const char *msg)
       {
      	 error_msg(ret, 0, NULL, msg);
      	 exit(1);
       }
      
       int page_size;
      
       bool do_vmsplice;
       bool do_thp;
      
       static int pipe_fds[2];
       void *addr;
       char *pn;
       char *pn1;
       void *pages[2];
       int status[2];
      
       void prepare()
       {
      	 int ret;
      	 struct iovec iov;
      
      	 if (addr) {
      		 munmap(addr, MAP_SIZE);
      		 close(pipe_fds[0]);
      		 close(pipe_fds[1]);
      	 }
      
      	 ret = pipe(pipe_fds);
      	 ERR_EXIT_ON(ret, "pipe");
      
      	 addr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
      		     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
      	 ERR_EXIT_ON(addr == MAP_FAILED, "mmap");
      	 if (do_thp) {
      		 ret = madvise(addr, MAP_SIZE, MADV_HUGEPAGE);
      		 ERR_EXIT_ON(ret, "advise hugepage");
      	 }
      
      	 pn = (char *)(((unsigned long)addr + THP_SIZE) & ~THP_MASK);
      	 pn1 = pn + THP_SIZE;
      	 pages[0] = pn;
      	 pages[1] = pn1;
      	 *pn = 1;
      
      	 if (do_vmsplice) {
      		 iov.iov_base = pn;
      		 iov.iov_len = page_size;
      		 ret = vmsplice(pipe_fds[1], &iov, 1, 0);
      		 ERR_EXIT_ON(ret < 0, "vmsplice");
      	 }
      
      	 status[0] = status[1] = 1024;
       }
      
       void test_migrate()
       {
      	 int ret;
      	 int nodes[2] = { 1, 1 };
      	 pid_t pid = getpid();
      
      	 prepare();
      	 ret = move_pages(pid, 1, pages, nodes, status, MPOL_MF_MOVE_ALL);
      	 error_msg(ret, 1, status, "move 1 page");
      
      	 prepare();
      	 ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL);
      	 error_msg(ret, 2, status, "move 2 pages, page 1 not mapped");
      
      	 prepare();
      	 *pn1 = 1;
      	 ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL);
      	 error_msg(ret, 2, status, "move 2 pages");
      
      	 prepare();
      	 *pn1 = 1;
      	 nodes[1] = 0;
      	 ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL);
      	 error_msg(ret, 2, status, "move 2 pages, page 1 to node 0");
       }
      
       int main(int argc, char *argv[])
       {
      	 numa_run_on_node(0);
      	 page_size = getpagesize();
      
      	 test_migrate();
      
      	 fprintf(stderr, "\nMake page 0 cannot be migrated:\n");
      	 do_vmsplice = true;
      	 test_migrate();
      
      	 fprintf(stderr, "\nTest THP:\n");
      	 do_thp = true;
      	 do_vmsplice = false;
      	 test_migrate();
      
      	 fprintf(stderr, "\nTHP: make page 0 cannot be migrated:\n");
      	 do_vmsplice = true;
      	 test_migrate();
      
      	 return 0;
       }
      "
      
      The output of the current kernel is,
      
      "
      Error: move 1 page, ret : 0, error: Success
      status: 1
      Error: move 2 pages, page 1 not mapped, ret : 0, error: Success
      status: 1 -14
      Error: move 2 pages, ret : 0, error: Success
      status: 1 1
      Error: move 2 pages, page 1 to node 0, ret : 0, error: Success
      status: 1 0
      
      Make page 0 cannot be migrated:
      Error: move 1 page, ret : 0, error: Success
      status: 1024
      Error: move 2 pages, page 1 not mapped, ret : 1, error: Success
      status: 1024 -14
      Error: move 2 pages, ret : 0, error: Success
      status: 1024 1024
      Error: move 2 pages, page 1 to node 0, ret : 1, error: Success
      status: 1024 1024
      "
      
      While the expected output is,
      
      "
      Error: move 1 page, ret : 0, error: Success
      status: 1
      Error: move 2 pages, page 1 not mapped, ret : 0, error: Success
      status: 1 -14
      Error: move 2 pages, ret : 0, error: Success
      status: 1 1
      Error: move 2 pages, page 1 to node 0, ret : 0, error: Success
      status: 1 0
      
      Make page 0 cannot be migrated:
      Error: move 1 page, ret : 1, error: Success
      status: 1024
      Error: move 2 pages, page 1 not mapped, ret : 1, error: Success
      status: 1024 -14
      Error: move 2 pages, ret : 1, error: Success
      status: 1024 1024
      Error: move 2 pages, page 1 to node 0, ret : 2, error: Success
      status: 1024 1024
      "
      
      Fix this via correcting the remaining pages counting.  With the fix,
      the output for the test program as above is expected.
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-1-ying.huang@intel.com
      Link: https://lkml.kernel.org/r/20220817081408.513338-2-ying.huang@intel.com
      Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a7504ed1
    • Yang Yang's avatar
      filemap: make the accounting of thrashing more consistent · f347c9d2
      Yang Yang authored
      Once upon a time, we only support accounting thrashing of page cache. 
      Then Joonsoo introduced workingset detection for anonymous pages and we
      gained the ability to account thrashing of them[1].
      
      So let delayacct account both the thrashing of page cache and anonymous
      pages, this could make the codes more consistent and simpler.
      
      [1] commit aae466b0 ("mm/swap: implement workingset detection for anonymous LRU")
      
      Link: https://lkml.kernel.org/r/20220805033838.1714674-1-yang.yang29@zte.com.cnSigned-off-by: default avatarYang Yang <yang.yang29@zte.com.cn>
      Signed-off-by: default avatarCGEL ZTE <cgel.zte@gmail.com>
      Acked-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Yang Yang <yang.yang29@zte.com.cn>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f347c9d2
    • Peter Xu's avatar
      mm/swap: cache swap migration A/D bits support · 5154e607
      Peter Xu authored
      Introduce a variable swap_migration_ad_supported to cache whether the arch
      supports swap migration A/D bits.
      
      Here one thing to mention is that SWP_MIG_TOTAL_BITS will internally
      reference the other macro MAX_PHYSMEM_BITS, which is a function call on
      x86 (constant on all the rest of archs).
      
      It's safe to reference it in swapfile_init() because when reaching here
      we're already during initcalls level 4 so we must have initialized 5-level
      pgtable for x86_64 (right after early_identify_cpu() finishes).
      
      - start_kernel
        - setup_arch
          - early_cpu_init
            - get_cpu_cap --> fetch from CPUID (including X86_FEATURE_LA57)
            - early_identify_cpu --> clear X86_FEATURE_LA57 (if early lvl5 not enabled (USE_EARLY_PGTABLE_L5))
        - arch_call_rest_init
          - rest_init
            - kernel_init
              - kernel_init_freeable
                - do_basic_setup
                  - do_initcalls --> calls swapfile_init() (initcall level 4)
      
      This should slightly speed up the migration swap entry handlings.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-8-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5154e607
    • Peter Xu's avatar
      mm/swap: cache maximum swapfile size when init swap · be45a490
      Peter Xu authored
      We used to have swapfile_maximum_size() fetching a maximum value of
      swapfile size per-arch.
      
      As the caller of max_swapfile_size() grows, this patch introduce a
      variable "swapfile_maximum_size" and cache the value of old
      max_swapfile_size(), so that we don't need to calculate the value every
      time.
      
      Caching the value in swapfile_init() is safe because when reaching the
      phase we should have initialized all the relevant information.  Here the
      major arch to take care of is x86, which defines the max swapfile size
      based on L1TF mitigation.
      
      Here both X86_BUG_L1TF or l1tf_mitigation should have been setup properly
      when reaching swapfile_init().  As a reference, the code path looks like
      this for x86:
      
      - start_kernel
        - setup_arch
          - early_cpu_init
            - early_identify_cpu --> setup X86_BUG_L1TF
        - parse_early_param
          - l1tf_cmdline --> set l1tf_mitigation
        - check_bugs
          - l1tf_select_mitigation --> set l1tf_mitigation
        - arch_call_rest_init
          - rest_init
            - kernel_init
              - kernel_init_freeable
                - do_basic_setup
                  - do_initcalls --> calls swapfile_init() (initcall level 4)
      
      The swapfile size only depends on swp pte format on non-x86 archs, so
      caching it is safe too.
      
      Since at it, rename max_swapfile_size() to arch_max_swapfile_size()
      because arch can define its own function, so it's more straightforward to
      have "arch_" as its prefix.  At the meantime, export swapfile_maximum_size
      to replace the old usages of max_swapfile_size().
      
      [peterx@redhat.com: declare arch_max_swapfile_size) in swapfile.h]
        Link: https://lkml.kernel.org/r/YxTh1GuC6ro5fKL5@xz-m1.local
      Link: https://lkml.kernel.org/r/20220811161331.37055-7-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      be45a490
    • Peter Xu's avatar
      mm: remember young/dirty bit for page migrations · 2e346877
      Peter Xu authored
      When page migration happens, we always ignore the young/dirty bit settings
      in the old pgtable, and marking the page as old in the new page table
      using either pte_mkold() or pmd_mkold(), and keeping the pte clean.
      
      That's fine from functional-wise, but that's not friendly to page reclaim
      because the moving page can be actively accessed within the procedure. 
      Not to mention hardware setting the young bit can bring quite some
      overhead on some systems, e.g.  x86_64 needs a few hundreds nanoseconds to
      set the bit.  The same slowdown problem to dirty bits when the memory is
      first written after page migration happened.
      
      Actually we can easily remember the A/D bit configuration and recover the
      information after the page is migrated.  To achieve it, define a new set
      of bits in the migration swap offset field to cache the A/D bits for old
      pte.  Then when removing/recovering the migration entry, we can recover
      the A/D bits even if the page changed.
      
      One thing to mention is that here we used max_swapfile_size() to detect
      how many swp offset bits we have, and we'll only enable this feature if we
      know the swp offset is big enough to store both the PFN value and the A/D
      bits.  Otherwise the A/D bits are dropped like before.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-6-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2e346877
    • Peter Xu's avatar
      mm/thp: carry over dirty bit when thp splits on pmd · 0ccf7f16
      Peter Xu authored
      Carry over the dirty bit from pmd to pte when a huge pmd splits.  It
      shouldn't be a correctness issue since when pmd_dirty() we'll have the
      page marked dirty anyway, however having dirty bit carried over helps the
      next initial writes of split ptes on some archs like x86.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-5-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarHuang Ying <ying.huang@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0ccf7f16
    • Peter Xu's avatar
      mm/swap: add swp_offset_pfn() to fetch PFN from swap entry · 0d206b5d
      Peter Xu authored
      We've got a bunch of special swap entries that stores PFN inside the swap
      offset fields.  To fetch the PFN, normally the user just calls
      swp_offset() assuming that'll be the PFN.
      
      Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the
      max possible length of a PFN on the host, meanwhile doing proper check
      with MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the
      PFNs properly always using the BUILD_BUG_ON() in is_pfn_swap_entry().
      
      One reason to do so is we never tried to sanitize whether swap offset can
      really fit for storing PFN.  At the meantime, this patch also prepares us
      with the future possibility to store more information inside the swp
      offset field, so assuming "swp_offset(entry)" to be the PFN will not stand
      any more very soon.
      
      Replace many of the swp_offset() callers to use swp_offset_pfn() where
      proper.  Note that many of the existing users are not candidates for the
      replacement, e.g.:
      
        (1) When the swap entry is not a pfn swap entry at all, or,
        (2) when we wanna keep the whole swp_offset but only change the swp type.
      
      For the latter, it can happen when fork() triggered on a write-migration
      swap entry pte, we may want to only change the migration type from
      write->read but keep the rest, so it's not "fetching PFN" but "changing
      swap type only".  They're left aside so that when there're more
      information within the swp offset they'll be carried over naturally in
      those cases.
      
      Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what
      the new swp_offset_pfn() is about.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-4-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0d206b5d
    • Peter Xu's avatar
      mm/swap: comment all the ifdef in swapops.h · eba4d770
      Peter Xu authored
      swapops.h contains quite a few layers of ifdef, some of the "else" and
      "endif" doesn't get proper comment on the macro so it's hard to follow on
      what are they referring to.  Add the comments.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-3-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Suggested-by: default avatarNadav Amit <nadav.amit@gmail.com>
      Reviewed-by: default avatarHuang Ying <ying.huang@intel.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      eba4d770
    • Peter Xu's avatar
      mm/x86: use SWP_TYPE_BITS in 3-level swap macros · 9c61d532
      Peter Xu authored
      Patch series "mm: Remember a/d bits for migration entries", v4.
      
      
      Problem
      =======
      
      When migrating a page, right now we always mark the migrated page as old &
      clean.
      
      However that could lead to at least two problems:
      
        (1) We lost the real hot/cold information while we could have persisted.
            That information shouldn't change even if the backing page is changed
            after the migration,
      
        (2) There can be always extra overhead on the immediate next access to
            any migrated page, because hardware MMU needs cycles to set the young
            bit again for reads, and dirty bits for write, as long as the
            hardware MMU supports these bits.
      
      Many of the recent upstream works showed that (2) is not something trivial
      and actually very measurable.  In my test case, reading 1G chunk of memory
      - jumping in page size intervals - could take 99ms just because of the
      extra setting on the young bit on a generic x86_64 system, comparing to
      4ms if young set.
      
      This issue is originally reported by Andrea Arcangeli.
      
      Solution
      ========
      
      To solve this problem, this patchset tries to remember the young/dirty
      bits in the migration entries and carry them over when recovering the
      ptes.
      
      We have the chance to do so because in many systems the swap offset is not
      really fully used.  Migration entries use swp offset to store PFN only,
      while the PFN is normally not as large as swp offset and normally smaller.
      It means we do have some free bits in swp offset that we can use to store
      things like A/D bits, and that's how this series tried to approach this
      problem.
      
      max_swapfile_size() is used here to detect per-arch offset length in swp
      entries.  We'll automatically remember the A/D bits when we find that we
      have enough swp offset field to keep both the PFN and the extra bits.
      
      Since max_swapfile_size() can be slow, the last two patches cache the
      results for it and also swap_migration_ad_supported as a whole.
      
      Known Issues / TODOs
      ====================
      
      We still haven't taught madvise() to recognize the new A/D bits in
      migration entries, namely MADV_COLD/MADV_FREE.  E.g.  when MADV_COLD upon
      a migration entry.  It's not clear yet on whether we should clear the A
      bit, or we should just drop the entry directly.
      
      We didn't teach idle page tracking on the new migration entries, because
      it'll need larger rework on the tree on rmap pgtable walk.  However it
      should make it already better because before this patchset page will be
      old page after migration, so the series will fix potential false negative
      of idle page tracking when pages were migrated before observing.
      
      The other thing is migration A/D bits will not start to working for
      private device swap entries.  The code is there for completeness but since
      private device swap entries do not yet have fields to store A/D bits, even
      if we'll persistent A/D across present pte switching to migration entry,
      we'll lose it again when the migration entry converted to private device
      swap entry.
      
      Tests
      =====
      
      After the patchset applied, the immediate read access test [1] of above 1G
      chunk after migration can shrink from 99ms to 4ms.  The test is done by
      moving 1G pages from node 0->1->0 then read it in page size jumps.  The
      test is with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.
      
      Similar effect can also be measured when writting the memory the 1st time
      after migration.
      
      After applying the patchset, both initial immediate read/write after page
      migrated will perform similarly like before migration happened.
      
      Patch Layout
      ============
      
      Patch 1-2:  Cleanups from either previous versions or on swapops.h macros.
      
      Patch 3-4:  Prepare for the introduction of migration A/D bits
      
      Patch 5:    The core patch to remember young/dirty bit in swap offsets.
      
      Patch 6-7:  Cache relevant fields to make migration_entry_supports_ad() fast.
      
      [1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c
      
      
      This patch (of 7):
      
      Replace all the magic "5" with the macro.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20220811161331.37055-2-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarHuang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9c61d532
    • Miaohe Lin's avatar
      mm, hwpoison: cleanup some obsolete comments · 9cf28191
      Miaohe Lin authored
      1.Remove meaningless comment in kill_proc(). That doesn't tell anything.
      2.Fix the wrong function name get_hwpoison_unless_zero(). It should be
      get_page_unless_zero().
      3.The gate keeper for free hwpoison page has moved to check_new_page().
      Update the corresponding comment.
      
      Link: https://lkml.kernel.org/r/20220830123604.25763-7-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9cf28191
    • Miaohe Lin's avatar
      mm, hwpoison: check PageTable() explicitly in hwpoison_user_mappings() · b680dae9
      Miaohe Lin authored
      PageTable can't be handled by memory_failure(). Filter it out explicitly in
      hwpoison_user_mappings(). This will also make code more consistent with the
      relevant check in unpoison_memory().
      
      Link: https://lkml.kernel.org/r/20220830123604.25763-6-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b680dae9
    • Miaohe Lin's avatar
      mm, hwpoison: avoid unneeded page_mapped_in_vma() overhead in collect_procs_anon() · 36537a67
      Miaohe Lin authored
      If vma->vm_mm != t->mm, there's no need to call page_mapped_in_vma() as
      add_to_kill() won't be called in this case. Move up the mm check to avoid
      possible unneeded calling to page_mapped_in_vma().
      
      Link: https://lkml.kernel.org/r/20220830123604.25763-5-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      36537a67
    • Miaohe Lin's avatar
      mm, hwpoison: use num_poisoned_pages_sub() to decrease num_poisoned_pages · 21c9e90a
      Miaohe Lin authored
      Use num_poisoned_pages_sub() to combine multiple atomic ops into one. Also
      num_poisoned_pages_dec() can be killed as there's no caller now.
      
      Link: https://lkml.kernel.org/r/20220830123604.25763-4-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      21c9e90a
    • Miaohe Lin's avatar
      mm, hwpoison: use __PageMovable() to detect non-lru movable pages · da294991
      Miaohe Lin authored
      It's more recommended to use __PageMovable() to detect non-lru movable
      pages. We can avoid bumping page refcnt via isolate_movable_page() for
      the isolated lru pages. Also if pages become PageLRU just after they're
      checked but before trying to isolate them, isolate_lru_page() will be
      called to do the right work.
      
      [linmiaohe@huawei.com: fixes per Naoya Horiguchi]
        Link: https://lkml.kernel.org/r/1f7ee86e-7d28-0d8c-e0de-b7a5a94519e8@huawei.com
      Link: https://lkml.kernel.org/r/20220830123604.25763-3-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      da294991
    • Miaohe Lin's avatar
      mm, hwpoison: use ClearPageHWPoison() in memory_failure() · 2fe62e22
      Miaohe Lin authored
      Patch series "A few cleanup patches for memory-failure".
      
      his series contains a few cleanup patches to use __PageMovable() to detect
      non-lru movable pages, use num_poisoned_pages_sub() to reduce multiple
      atomic ops overheads and so on.  More details can be found in the
      respective changelogs.
      
      
      This patch (of 6):
      
      Use ClearPageHWPoison() instead of TestClearPageHWPoison() to clear page
      hwpoison flags to avoid unneeded full memory barrier overhead.
      
      Link: https://lkml.kernel.org/r/20220830123604.25763-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20220830123604.25763-2-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2fe62e22
    • Yang Shi's avatar
      mm: MADV_COLLAPSE: refetch vm_end after reacquiring mmap_lock · 4d24de94
      Yang Shi authored
      The syzbot reported the below problem:
      
      BUG: Bad page map in process syz-executor198  pte:8000000071c00227 pmd:74b30067
      addr:0000000020563000 vm_flags:08100077 anon_vma:ffff8880547d2200 mapping:0000000000000000 index:20563
      file:(null) fault:0x0 mmap:0x0 read_folio:0x0
      CPU: 1 PID: 3614 Comm: syz-executor198 Not tainted 6.0.0-rc3-next-20220901-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       print_bad_pte.cold+0x2a7/0x2d0 mm/memory.c:565
       vm_normal_page+0x10c/0x2a0 mm/memory.c:636
       hpage_collapse_scan_pmd+0x729/0x1da0 mm/khugepaged.c:1199
       madvise_collapse+0x481/0x910 mm/khugepaged.c:2433
       madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1062
       madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1236
       do_madvise.part.0+0x24a/0x340 mm/madvise.c:1415
       do_madvise mm/madvise.c:1428 [inline]
       __do_sys_madvise mm/madvise.c:1428 [inline]
       __se_sys_madvise mm/madvise.c:1426 [inline]
       __x64_sys_madvise+0x113/0x150 mm/madvise.c:1426
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f770ba87929
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f770ba18308 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
      RAX: ffffffffffffffda RBX: 00007f770bb0f3f8 RCX: 00007f770ba87929
      RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
      RBP: 00007f770bb0f3f0 R08: 00007f770ba18700 R09: 0000000000000000
      R10: 00007f770ba18700 R11: 0000000000000246 R12: 00007f770bb0f3fc
      R13: 00007ffc2d8b62ef R14: 00007f770ba18400 R15: 0000000000022000
      
      Basically the test program does the below conceptually:
      1. mmap 0x2000000 - 0x21000000 as anonymous region
      2. mmap io_uring SQ stuff at 0x20563000 with MAP_FIXED, io_uring_mmap()
         actually remaps the pages with special PTEs
      3. call MADV_COLLAPSE for 0x20000000 - 0x21000000
      
      It actually triggered the below race:
      
                   CPU A                                          CPU B
      mmap 0x20000000 - 0x21000000 as anon
                                                 madvise_collapse is called on this area
                                                   Retrieve start and end address from the vma (NEVER updated later!)
                                                   Collapsed the first 2M area and dropped mmap_lock
      Acquire mmap_lock
      mmap io_uring file at 0x20563000
      Release mmap_lock
                                                   Reacquire mmap_lock
                                                   revalidate vma pass since 0x20200000 + 0x200000 > 0x20563000
                                                   scan the next 2M (0x20200000 - 0x20400000), but due to whatever reason it didn't release mmap_lock
                                                   scan the 3rd 2M area (start from 0x20400000)
                                                     get into the vma created by io_uring
      
      The hend should be updated after MADV_COLLAPSE reacquire mmap_lock since
      the vma may be shrunk.  We don't have to worry about shink from the other
      direction since it could be caught by hugepage_vma_revalidate().  Either
      no valid vma is found or the vma doesn't fit anymore.
      
      Link: https://lkml.kernel.org/r/20220914162220.787703-1-shy828301@gmail.com
      Fixes: 7d8faaf1 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
      Reported-by: syzbot+915f3e317adb0e85835f@syzkaller.appspotmail.com
      Signed-off-by: default avatarYang Shi <shy828301@gmail.com>
      Reviewed-by: default avatarZach O'Keefe <zokeefe@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4d24de94
  2. 26 Sep, 2022 12 commits
  3. 12 Sep, 2022 2 commits
    • David Hildenbrand's avatar
      mm: fix PageAnonExclusive clearing racing with concurrent RCU GUP-fast · 088b8aa5
      David Hildenbrand authored
      commit 6c287605 ("mm: remember exclusively mapped anonymous pages with
      PG_anon_exclusive") made sure that when PageAnonExclusive() has to be
      cleared during temporary unmapping of a page, that the PTE is
      cleared/invalidated and that the TLB is flushed.
      
      What we want to achieve in all cases is that we cannot end up with a pin on
      an anonymous page that may be shared, because such pins would be
      unreliable and could result in memory corruptions when the mapped page
      and the pin go out of sync due to a write fault.
      
      That TLB flush handling was inspired by an outdated comment in
      mm/ksm.c:write_protect_page(), which similarly required the TLB flush in
      the past to synchronize with GUP-fast. However, ever since general RCU GUP
      fast was introduced in commit 2667f50e ("mm: introduce a general RCU
      get_user_pages_fast()"), a TLB flush is no longer sufficient to handle
      concurrent GUP-fast in all cases -- it only handles traditional IPI-based
      GUP-fast correctly.
      
      Peter Xu (thankfully) questioned whether that TLB flush is really
      required. On architectures that send an IPI broadcast on TLB flush,
      it works as expected. To synchronize with RCU GUP-fast properly, we're
      conceptually fine, however, we have to enforce a certain memory order and
      are missing memory barriers.
      
      Let's document that, avoid the TLB flush where possible and use proper
      explicit memory barriers where required. We shouldn't really care about the
      additional memory barriers here, as we're not on extremely hot paths --
      and we're getting rid of some TLB flushes.
      
      We use a smp_mb() pair for handling concurrent pinning and a
      smp_rmb()/smp_wmb() pair for handling the corner case of only temporary
      PTE changes but permanent PageAnonExclusive changes.
      
      One extreme example, whereby GUP-fast takes a R/O pin and KSM wants to
      convert an exclusive anonymous page to a KSM page, and that page is already
      mapped write-protected (-> no PTE change) would be:
      
      	Thread 0 (KSM)			Thread 1 (GUP-fast)
      
      					(B1) Read the PTE
      					# (B2) skipped without FOLL_WRITE
      	(A1) Clear PTE
      	smp_mb()
      	(A2) Check pinned
      					(B3) Pin the mapped page
      					smp_mb()
      	(A3) Clear PageAnonExclusive
      	smp_wmb()
      	(A4) Restore PTE
      					(B4) Check if the PTE changed
      					smp_rmb()
      					(B5) Check PageAnonExclusive
      
      Thread 1 will properly detect that PageAnonExclusive was cleared and
      back off.
      
      Note that we don't need a memory barrier between checking if the page is
      pinned and clearing PageAnonExclusive, because stores are not
      speculated.
      
      The possible issues due to reordering are of theoretical nature so far
      and attempts to reproduce the race failed.
      
      Especially the "no PTE change" case isn't the common case, because we'd
      need an exclusive anonymous page that's mapped R/O and the PTE is clean
      in KSM code -- and using KSM with page pinning isn't extremely common.
      Further, the clear+TLB flush we used for now implies a memory barrier.
      So the problematic missing part should be the missing memory barrier
      after pinning but before checking if the PTE changed.
      
      Link: https://lkml.kernel.org/r/20220901083559.67446-1-david@redhat.com
      Fixes: 6c287605 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Christoph von Recklinghausen <crecklin@redhat.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      088b8aa5
    • Christophe JAILLET's avatar
      mm/mremap_pages: save a few cycles in get_dev_pagemap() · e7b72c48
      Christophe JAILLET authored
      Use 'percpu_ref_tryget_live_rcu()' instead of 'percpu_ref_tryget_live()'
      to save a few cycles when it is known that the rcu lock is already
      taken/released.
      
      Link: https://lkml.kernel.org/r/9ef1562a1975371360f3e263856e9f1c5749b656.1662136782.git.christophe.jaillet@wanadoo.frSigned-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e7b72c48