An error occurred fetching the project authors.
  1. 04 Aug, 2023 1 commit
  2. 23 Jun, 2023 1 commit
  3. 19 Jun, 2023 3 commits
    • Ryan Roberts's avatar
      mm: ptep_get() conversion · c33c7948
      Ryan Roberts authored
      Convert all instances of direct pte_t* dereferencing to instead use
      ptep_get() helper.  This means that by default, the accesses change from a
      C dereference to a READ_ONCE().  This is technically the correct thing to
      do since where pgtables are modified by HW (for access/dirty) they are
      volatile and therefore we should always ensure READ_ONCE() semantics.
      
      But more importantly, by always using the helper, it can be overridden by
      the architecture to fully encapsulate the contents of the pte.  Arch code
      is deliberately not converted, as the arch code knows best.  It is
      intended that arch code (arm64) will override the default with its own
      implementation that can (e.g.) hide certain bits from the core code, or
      determine young/dirty status by mixing in state from another source.
      
      Conversion was done using Coccinelle:
      
      ----
      
      // $ make coccicheck \
      //          COCCI=ptepget.cocci \
      //          SPFLAGS="--include-headers" \
      //          MODE=patch
      
      virtual patch
      
      @ depends on patch @
      pte_t *v;
      @@
      
      - *v
      + ptep_get(v)
      
      ----
      
      Then reviewed and hand-edited to avoid multiple unnecessary calls to
      ptep_get(), instead opting to store the result of a single call in a
      variable, where it is correct to do so.  This aims to negate any cost of
      READ_ONCE() and will benefit arch-overrides that may be more complex.
      
      Included is a fix for an issue in an earlier version of this patch that
      was pointed out by kernel test robot.  The issue arose because config
      MMU=n elides definition of the ptep helper functions, including
      ptep_get().  HUGETLB_PAGE=n configs still define a simple
      huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
      So when both configs are disabled, this caused a build error because
      ptep_get() is not defined.  Fix by continuing to do a direct dereference
      when MMU=n.  This is safe because for this config the arch code cannot be
      trying to virtualize the ptes because none of the ptep helpers are
      defined.
      
      Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.comReported-by: default avatarkernel test robot <lkp@intel.com>
      Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/Signed-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Dave Airlie <airlied@gmail.com>
      Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c33c7948
    • Hugh Dickins's avatar
      mm/various: give up if pte_offset_map[_lock]() fails · 04dee9e8
      Hugh Dickins authored
      Following the examples of nearby code, various functions can just give up
      if pte_offset_map() or pte_offset_map_lock() fails.  And there's no need
      for a preliminary pmd_trans_unstable() or other such check, since such
      cases are now safely handled inside.
      
      Link: https://lkml.kernel.org/r/7b9bd85d-1652-cbf2-159d-f503b45e5b@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      04dee9e8
    • Hugh Dickins's avatar
      mm: use pmdp_get_lockless() without surplus barrier() · 26e1a0c3
      Hugh Dickins authored
      Patch series "mm: allow pte_offset_map[_lock]() to fail", v2.
      
      What is it all about?  Some mmap_lock avoidance i.e.  latency reduction. 
      Initially just for the case of collapsing shmem or file pages to THPs; but
      likely to be relied upon later in other contexts e.g.  freeing of empty
      page tables (but that's not work I'm doing).  mmap_write_lock avoidance
      when collapsing to anon THPs?  Perhaps, but again that's not work I've
      done: a quick attempt was not as easy as the shmem/file case.
      
      I would much prefer not to have to make these small but wide-ranging
      changes for such a niche case; but failed to find another way, and have
      heard that shmem MADV_COLLAPSE's usefulness is being limited by that
      mmap_write_lock it currently requires.
      
      These changes (though of course not these exact patches) have been in
      Google's data centre kernel for three years now: we do rely upon them.
      
      What is this preparatory series about?
      
      The current mmap locking will not be enough to guard against that tricky
      transition between pmd entry pointing to page table, and empty pmd entry,
      and pmd entry pointing to huge page: pte_offset_map() will have to
      validate the pmd entry for itself, returning NULL if no page table is
      there.  What to do about that varies: sometimes nearby error handling
      indicates just to skip it; but in many cases an ACTION_AGAIN or "goto
      again" is appropriate (and if that risks an infinite loop, then there must
      have been an oops, or pfn 0 mistaken for page table, before).
      
      Given the likely extension to freeing empty page tables, I have not
      limited this set of changes to a THP config; and it has been easier, and
      sets a better example, if each site is given appropriate handling: even
      where deeper study might prove that failure could only happen if the pmd
      table were corrupted.
      
      Several of the patches are, or include, cleanup on the way; and by the
      end, pmd_trans_unstable() and suchlike are deleted: pte_offset_map() and
      pte_offset_map_lock() then handle those original races and more.  Most
      uses of pte_lockptr() are deprecated, with pte_offset_map_nolock() taking
      its place.
      
      
      This patch (of 32):
      
      Use pmdp_get_lockless() in preference to READ_ONCE(*pmdp), to get a more
      reliable result with PAE (or READ_ONCE as before without PAE); and remove
      the unnecessary extra barrier()s which got left behind in its callers.
      
      HOWEVER: Note the small print in linux/pgtable.h, where it was designed
      specifically for fast GUP, and depends on interrupts being disabled for
      its full guarantee: most callers which have been added (here and before)
      do NOT have interrupts disabled, so there is still some need for caution.
      
      Link: https://lkml.kernel.org/r/f35279a9-9ac0-de22-d245-591afbfb4dc@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      26e1a0c3
  4. 03 May, 2023 2 commits
  5. 21 Apr, 2023 2 commits
    • Stefan Roesch's avatar
      mm: add new KSM process and sysfs knobs · d21077fb
      Stefan Roesch authored
      This adds the general_profit KSM sysfs knob and the process profit metric
      knobs to ksm_stat.
      
      1) expose general_profit metric
      
         The documentation mentions a general profit metric, however this
         metric is not calculated.  In addition the formula depends on the size
         of internal structures, which makes it more difficult for an
         administrator to make the calculation.  Adding the metric for a better
         user experience.
      
      2) document general_profit sysfs knob
      
      3) calculate ksm process profit metric
      
         The ksm documentation mentions the process profit metric and how to
         calculate it.  This adds the calculation of the metric.
      
      4) mm: expose ksm process profit metric in ksm_stat
      
         This exposes the ksm process profit metric in /proc/<pid>/ksm_stat.
         The documentation mentions the formula for the ksm process profit
         metric, however it does not calculate it.  In addition the formula
         depends on the size of internal structures.  So it makes sense to
         expose it.
      
      5) document new procfs ksm knobs
      
      Link: https://lkml.kernel.org/r/20230418051342.1919757-3-shr@devkernel.ioSigned-off-by: default avatarStefan Roesch <shr@devkernel.io>
      Reviewed-by: default avatarBagas Sanjaya <bagasdotme@gmail.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d21077fb
    • Stefan Roesch's avatar
      mm: add new api to enable ksm per process · d7597f59
      Stefan Roesch authored
      Patch series "mm: process/cgroup ksm support", v9.
      
      So far KSM can only be enabled by calling madvise for memory regions.  To
      be able to use KSM for more workloads, KSM needs to have the ability to be
      enabled / disabled at the process / cgroup level.
      
      Use case 1:
        The madvise call is not available in the programming language.  An
        example for this are programs with forked workloads using a garbage
        collected language without pointers.  In such a language madvise cannot
        be made available.
      
        In addition the addresses of objects get moved around as they are
        garbage collected.  KSM sharing needs to be enabled "from the outside"
        for these type of workloads.
      
      Use case 2:
        The same interpreter can also be used for workloads where KSM brings
        no benefit or even has overhead.  We'd like to be able to enable KSM on
        a workload by workload basis.
      
      Use case 3:
        With the madvise call sharing opportunities are only enabled for the
        current process: it is a workload-local decision.  A considerable number
        of sharing opportunities may exist across multiple workloads or jobs (if
        they are part of the same security domain).  Only a higler level entity
        like a job scheduler or container can know for certain if its running
        one or more instances of a job.  That job scheduler however doesn't have
        the necessary internal workload knowledge to make targeted madvise
        calls.
      
      Security concerns:
      
        In previous discussions security concerns have been brought up.  The
        problem is that an individual workload does not have the knowledge about
        what else is running on a machine.  Therefore it has to be very
        conservative in what memory areas can be shared or not.  However, if the
        system is dedicated to running multiple jobs within the same security
        domain, its the job scheduler that has the knowledge that sharing can be
        safely enabled and is even desirable.
      
      Performance:
      
        Experiments with using UKSM have shown a capacity increase of around 20%.
      
        Here are the metrics from an instagram workload (taken from a machine
        with 64GB main memory):
      
         full_scans: 445
         general_profit: 20158298048
         max_page_sharing: 256
         merge_across_nodes: 1
         pages_shared: 129547
         pages_sharing: 5119146
         pages_to_scan: 4000
         pages_unshared: 1760924
         pages_volatile: 10761341
         run: 1
         sleep_millisecs: 20
         stable_node_chains: 167
         stable_node_chains_prune_millisecs: 2000
         stable_node_dups: 2751
         use_zero_pages: 0
         zero_pages_sharing: 0
      
      After the service is running for 30 minutes to an hour, 4 to 5 million
      shared pages are common for this workload when using KSM.
      
      
      Detailed changes:
      
      1. New options for prctl system command
         This patch series adds two new options to the prctl system call. 
         The first one allows to enable KSM at the process level and the second
         one to query the setting.
      
      The setting will be inherited by child processes.
      
      With the above setting, KSM can be enabled for the seed process of a cgroup
      and all processes in the cgroup will inherit the setting.
      
      2. Changes to KSM processing
         When KSM is enabled at the process level, the KSM code will iterate
         over all the VMA's and enable KSM for the eligible VMA's.
      
         When forking a process that has KSM enabled, the setting will be
         inherited by the new child process.
      
      3. Add general_profit metric
         The general_profit metric of KSM is specified in the documentation,
         but not calculated.  This adds the general profit metric to
         /sys/kernel/debug/mm/ksm.
      
      4. Add more metrics to ksm_stat
         This adds the process profit metric to /proc/<pid>/ksm_stat.
      
      5. Add more tests to ksm_tests and ksm_functional_tests
         This adds an option to specify the merge type to the ksm_tests. 
         This allows to test madvise and prctl KSM.
      
         It also adds a two new tests to ksm_functional_tests: one to test
         the new prctl options and the other one is a fork test to verify that
         the KSM process setting is inherited by client processes.
      
      
      This patch (of 3):
      
      So far KSM can only be enabled by calling madvise for memory regions.  To
      be able to use KSM for more workloads, KSM needs to have the ability to be
      enabled / disabled at the process / cgroup level.
      
      1. New options for prctl system command
      
         This patch series adds two new options to the prctl system call.
         The first one allows to enable KSM at the process level and the second
         one to query the setting.
      
         The setting will be inherited by child processes.
      
         With the above setting, KSM can be enabled for the seed process of a
         cgroup and all processes in the cgroup will inherit the setting.
      
      2. Changes to KSM processing
      
         When KSM is enabled at the process level, the KSM code will iterate
         over all the VMA's and enable KSM for the eligible VMA's.
      
         When forking a process that has KSM enabled, the setting will be
         inherited by the new child process.
      
        1) Introduce new MMF_VM_MERGE_ANY flag
      
           This introduces the new flag MMF_VM_MERGE_ANY flag.  When this flag
           is set, kernel samepage merging (ksm) gets enabled for all vma's of a
           process.
      
        2) Setting VM_MERGEABLE on VMA creation
      
           When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the
           VM_MERGEABLE flag will be set for this VMA.
      
        3) support disabling of ksm for a process
      
           This adds the ability to disable ksm for a process if ksm has been
           enabled for the process with prctl.
      
        4) add new prctl option to get and set ksm for a process
      
           This adds two new options to the prctl system call
           - enable ksm for all vmas of a process (if the vmas support it).
           - query if ksm has been enabled for a process.
      
      3. Disabling MMF_VM_MERGE_ANY for storage keys in s390
      
         In the s390 architecture when storage keys are used, the
         MMF_VM_MERGE_ANY will be disabled.
      
      Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io
      Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.ioSigned-off-by: default avatarStefan Roesch <shr@devkernel.io>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Bagas Sanjaya <bagasdotme@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d7597f59
  6. 18 Apr, 2023 1 commit
  7. 28 Mar, 2023 1 commit
  8. 24 Mar, 2023 1 commit
  9. 09 Feb, 2023 1 commit
  10. 03 Feb, 2023 1 commit
  11. 12 Dec, 2022 3 commits
    • David Hildenbrand's avatar
      mm/ksm: convert break_ksm() to use walk_page_range_vma() · d7c0e68d
      David Hildenbrand authored
      FOLL_MIGRATION exists only for the purpose of break_ksm(), and actually,
      there is not even the need to wait for the migration to finish, we only
      want to know if we're dealing with a KSM page.
      
      Using follow_page() just to identify a KSM page overcomplicates GUP code. 
      Let's use walk_page_range_vma() instead, because we don't actually care
      about the page itself, we only need to know a single property -- no need
      to even grab a reference.
      
      So, get rid of follow_page() usage such that we can get rid of
      FOLL_MIGRATION now and eventually be able to get rid of follow_page() in
      the future.
      
      In my setup (AMD Ryzen 9 3900X), running the KSM selftest to test unmerge
      performance on 2 GiB (taskset 0x8 ./ksm_tests -D -s 2048), this results in
      a performance degradation of ~2% (old: ~5010 MiB/s, new: ~4900 MiB/s).  I
      don't think we particularly care for now.
      
      Interestingly, the benchmark reduction is due to the single callback. 
      Adding a second callback (e.g., pud_entry()) reduces the benchmark by
      another 100-200 MiB/s.
      
      Link: https://lkml.kernel.org/r/20221021101141.84170-9-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d7c0e68d
    • David Hildenbrand's avatar
      mm/ksm: fix KSM COW breaking with userfaultfd-wp via FAULT_FLAG_UNSHARE · 6cce3314
      David Hildenbrand authored
      Let's stop breaking COW via a fake write fault and let's use
      FAULT_FLAG_UNSHARE instead.  This avoids any wrong side effects of the
      fake write fault, such as mapping the PTE writable and marking the pte
      dirty/softdirty.
      
      Consequently, we will no longer trigger a fake write fault and break COW
      without any such side-effects.
      
      Also, this fixes KSM interaction with userfaultfd-wp: when we have a KSM
      page that's write-protected by userfaultfd, break_ksm()->handle_mm_fault()
      will fail with VM_FAULT_SIGBUS and will simply return in break_ksm() with
      0 instead of actually breaking COW.
      
      For now, the KSM unmerge tests can trigger that:
          $ sudo ./ksm_functional_tests
          TAP version 13
          1..3
          # [RUN] test_unmerge
          ok 1 Pages were unmerged
          # [RUN] test_unmerge_discarded
          ok 2 Pages were unmerged
          # [RUN] test_unmerge_uffd_wp
          not ok 3 Pages were unmerged
          Bail out! 1 out of 3 tests failed
          # Planned tests != run tests (2 != 3)
          # Totals: pass:2 fail:1 xfail:0 xpass:0 skip:0 error:0
      
      The warning in dmesg also indicates this wrong handling:
          [  230.096368] FAULT_FLAG_ALLOW_RETRY missing 881
          [  230.100822] CPU: 1 PID: 1643 Comm: ksm-uffd-wp [...]
          [  230.110124] Hardware name: [...]
          [  230.117775] Call Trace:
          [  230.120227]  <TASK>
          [  230.122334]  dump_stack_lvl+0x44/0x5c
          [  230.126010]  handle_userfault.cold+0x14/0x19
          [  230.130281]  ? tlb_finish_mmu+0x65/0x170
          [  230.134207]  ? uffd_wp_range+0x65/0xa0
          [  230.137959]  ? _raw_spin_unlock+0x15/0x30
          [  230.141972]  ? do_wp_page+0x50/0x590
          [  230.145551]  __handle_mm_fault+0x9f5/0xf50
          [  230.149652]  ? mmput+0x1f/0x40
          [  230.152712]  handle_mm_fault+0xb9/0x2a0
          [  230.156550]  break_ksm+0x141/0x180
          [  230.159964]  unmerge_ksm_pages+0x60/0x90
          [  230.163890]  ksm_madvise+0x3c/0xb0
          [  230.167295]  do_madvise.part.0+0x10c/0xeb0
          [  230.171396]  ? do_syscall_64+0x67/0x80
          [  230.175157]  __x64_sys_madvise+0x5a/0x70
          [  230.179082]  do_syscall_64+0x58/0x80
          [  230.182661]  ? do_syscall_64+0x67/0x80
          [  230.186413]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      This is primarily a fix for KSM+userfaultfd-wp, however, the fake write
      fault was always questionable.  As this fix is not easy to backport and
      it's not very critical, let's not cc stable.
      
      Link: https://lkml.kernel.org/r/20221021101141.84170-6-david@redhat.com
      Fixes: 529b930b ("userfaultfd: wp: hook userfault handler to write protection fault")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6cce3314
    • David Hildenbrand's avatar
      mm/ksm: simplify break_ksm() to not rely on VM_FAULT_WRITE · 58f595c6
      David Hildenbrand authored
      Now that GUP no longer requires VM_FAULT_WRITE, break_ksm() is the sole
      remaining user of VM_FAULT_WRITE.  As we also want to stop triggering a
      fake write fault and instead use FAULT_FLAG_UNSHARE -- similar to
      GUP-triggered unsharing when taking a R/O pin on a shared anonymous page
      (including KSM pages), let's stop relying on VM_FAULT_WRITE.
      
      Let's rework break_ksm() to not rely on the return value of
      handle_mm_fault() anymore to figure out whether COW-breaking was
      successful.  Simply perform another follow_page() lookup to verify the
      result.
      
      While this makes break_ksm() slightly less efficient, we can simplify
      handle_mm_fault() a little and easily switch to FAULT_FLAG_UNSHARE without
      introducing similar KSM-specific behavior for FAULT_FLAG_UNSHARE.
      
      In my setup (AMD Ryzen 9 3900X), running the KSM selftest to test unmerge
      performance on 2 GiB (taskset 0x8 ./ksm_tests -D -s 2048), this results in
      a performance degradation of ~4% -- 5% (old: ~5250 MiB/s, new: ~5010
      MiB/s).
      
      I don't think that we particularly care about that performance drop when
      unmerging.  If it ever turns out to be an actual performance issue, we can
      think about a better alternative for FAULT_FLAG_UNSHARE -- let's just keep
      it simple for now.
      
      Link: https://lkml.kernel.org/r/20221021101141.84170-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      58f595c6
  12. 30 Nov, 2022 1 commit
    • David Hildenbrand's avatar
      mm/autonuma: use can_change_(pte|pmd)_writable() to replace savedwrite · 6a56ccbc
      David Hildenbrand authored
      commit b191f9b1 ("mm: numa: preserve PTE write permissions across a
      NUMA hinting fault") added remembering write permissions using ordinary
      pte_write() for PROT_NONE mapped pages to avoid write faults when
      remapping the page !PROT_NONE on NUMA hinting faults.
      
      That commit noted:
      
          The patch looks hacky but the alternatives looked worse. The tidest was
          to rewalk the page tables after a hinting fault but it was more complex
          than this approach and the performance was worse. It's not generally
          safe to just mark the page writable during the fault if it's a write
          fault as it may have been read-only for COW so that approach was
          discarded.
      
      Later, commit 288bc549 ("mm/autonuma: let architecture override how
      the write bit should be stashed in a protnone pte.") introduced a family
      of savedwrite PTE functions that didn't necessarily improve the whole
      situation.
      
      One confusing thing is that nowadays, if a page is pte_protnone()
      and pte_savedwrite() then also pte_write() is true. Another source of
      confusion is that there is only a single pte_mk_savedwrite() call in the
      kernel. All other write-protection code seems to silently rely on
      pte_wrprotect().
      
      Ever since PageAnonExclusive was introduced and we started using it in
      mprotect context via commit 64fe24a3 ("mm/mprotect: try avoiding write
      faults for exclusive anonymous pages when changing protection"), we do
      have machinery in place to avoid write faults when changing protection,
      which is exactly what we want to do here.
      
      Let's similarly do what ordinary mprotect() does nowadays when upgrading
      write permissions and reuse can_change_pte_writable() and
      can_change_pmd_writable() to detect if we can upgrade PTE permissions to be
      writable.
      
      For anonymous pages there should be absolutely no change: if an
      anonymous page is not exclusive, it could not have been mapped writable --
      because only exclusive anonymous pages can be mapped writable.
      
      However, there *might* be a change for writable shared mappings that
      require writenotify: if they are not dirty, we cannot map them writable.
      While it might not matter in practice, we'd need a different way to
      identify whether writenotify is actually required -- and ordinary mprotect
      would benefit from that as well.
      
      Note that we don't optimize for the actual migration case:
      (1) When migration succeeds the new PTE will not be writable because the
          source PTE was not writable (protnone); in the future we
          might just optimize that case similarly by reusing
          can_change_pte_writable()/can_change_pmd_writable() when removing
          migration PTEs.
      (2) When migration fails, we'd have to recalculate the "writable" flag
          because we temporarily dropped the PT lock; for now keep it simple and
          set "writable=false".
      
      We'll remove all savedwrite leftovers next.
      
      Link: https://lkml.kernel.org/r/20221108174652.198904-6-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6a56ccbc
  13. 09 Nov, 2022 1 commit
  14. 03 Oct, 2022 5 commits
  15. 27 Sep, 2022 3 commits
  16. 12 Sep, 2022 2 commits
    • David Hildenbrand's avatar
      mm: fix PageAnonExclusive clearing racing with concurrent RCU GUP-fast · 088b8aa5
      David Hildenbrand authored
      commit 6c287605 ("mm: remember exclusively mapped anonymous pages with
      PG_anon_exclusive") made sure that when PageAnonExclusive() has to be
      cleared during temporary unmapping of a page, that the PTE is
      cleared/invalidated and that the TLB is flushed.
      
      What we want to achieve in all cases is that we cannot end up with a pin on
      an anonymous page that may be shared, because such pins would be
      unreliable and could result in memory corruptions when the mapped page
      and the pin go out of sync due to a write fault.
      
      That TLB flush handling was inspired by an outdated comment in
      mm/ksm.c:write_protect_page(), which similarly required the TLB flush in
      the past to synchronize with GUP-fast. However, ever since general RCU GUP
      fast was introduced in commit 2667f50e ("mm: introduce a general RCU
      get_user_pages_fast()"), a TLB flush is no longer sufficient to handle
      concurrent GUP-fast in all cases -- it only handles traditional IPI-based
      GUP-fast correctly.
      
      Peter Xu (thankfully) questioned whether that TLB flush is really
      required. On architectures that send an IPI broadcast on TLB flush,
      it works as expected. To synchronize with RCU GUP-fast properly, we're
      conceptually fine, however, we have to enforce a certain memory order and
      are missing memory barriers.
      
      Let's document that, avoid the TLB flush where possible and use proper
      explicit memory barriers where required. We shouldn't really care about the
      additional memory barriers here, as we're not on extremely hot paths --
      and we're getting rid of some TLB flushes.
      
      We use a smp_mb() pair for handling concurrent pinning and a
      smp_rmb()/smp_wmb() pair for handling the corner case of only temporary
      PTE changes but permanent PageAnonExclusive changes.
      
      One extreme example, whereby GUP-fast takes a R/O pin and KSM wants to
      convert an exclusive anonymous page to a KSM page, and that page is already
      mapped write-protected (-> no PTE change) would be:
      
      	Thread 0 (KSM)			Thread 1 (GUP-fast)
      
      					(B1) Read the PTE
      					# (B2) skipped without FOLL_WRITE
      	(A1) Clear PTE
      	smp_mb()
      	(A2) Check pinned
      					(B3) Pin the mapped page
      					smp_mb()
      	(A3) Clear PageAnonExclusive
      	smp_wmb()
      	(A4) Restore PTE
      					(B4) Check if the PTE changed
      					smp_rmb()
      					(B5) Check PageAnonExclusive
      
      Thread 1 will properly detect that PageAnonExclusive was cleared and
      back off.
      
      Note that we don't need a memory barrier between checking if the page is
      pinned and clearing PageAnonExclusive, because stores are not
      speculated.
      
      The possible issues due to reordering are of theoretical nature so far
      and attempts to reproduce the race failed.
      
      Especially the "no PTE change" case isn't the common case, because we'd
      need an exclusive anonymous page that's mapped R/O and the PTE is clean
      in KSM code -- and using KSM with page pinning isn't extremely common.
      Further, the clear+TLB flush we used for now implies a memory barrier.
      So the problematic missing part should be the missing memory barrier
      after pinning but before checking if the PTE changed.
      
      Link: https://lkml.kernel.org/r/20220901083559.67446-1-david@redhat.com
      Fixes: 6c287605 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Christoph von Recklinghausen <crecklin@redhat.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      088b8aa5
    • Zach O'Keefe's avatar
      mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage · 50722804
      Zach O'Keefe authored
      When scanning an anon pmd to see if it's eligible for collapse, return
      SCAN_PMD_MAPPED if the pmd already maps a hugepage.  Note that
      SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the
      file-collapse path, since the latter might identify pte-mapped compound
      pages.  This is required by MADV_COLLAPSE which necessarily needs to know
      what hugepage-aligned/sized regions are already pmd-mapped.
      
      In order to determine if a pmd already maps a hugepage, refactor
      mm_find_pmd():
      
      Return mm_find_pmd() to it's pre-commit f72e7dcd ("mm: let mm_find_pmd
      fix buggy race with THP fault") behavior.  ksm was the only caller that
      explicitly wanted a pte-mapping pmd, so open code the pte-mapping logic
      there (pmd_present() and pmd_trans_huge() checks).
      
      Undo revert change in commit f72e7dcd ("mm: let mm_find_pmd fix buggy
      race with THP fault") that open-coded split_huge_pmd_address() pmd lookup
      and use mm_find_pmd() instead.
      
      Link: https://lkml.kernel.org/r/20220706235936.2197195-9-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      50722804
  17. 02 Aug, 2022 1 commit
  18. 18 Jul, 2022 1 commit
    • Alex Sierra's avatar
      mm: handling Non-LRU pages returned by vm_normal_pages · 3218f871
      Alex Sierra authored
      With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
      device-managed anonymous pages that are not LRU pages.  Although they
      behave like normal pages for purposes of mapping in CPU page, and for COW.
      They do not support LRU lists, NUMA migration or THP.
      
      Callers to follow_page() currently don't expect ZONE_DEVICE pages,
      however, with DEVICE_COHERENT we might now return ZONE_DEVICE.  Check for
      ZONE_DEVICE pages in applicable users of follow_page() as well.
      
      Link: https://lkml.kernel.org/r/20220715150521.18165-5-alex.sierra@amd.comSigned-off-by: default avatarAlex Sierra <alex.sierra@amd.com>
      Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>	[v2]
      Reviewed-by: Alistair Popple <apopple@nvidia.com>	[v6]
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3218f871
  19. 27 Jun, 2022 1 commit
  20. 25 May, 2022 1 commit
  21. 19 May, 2022 1 commit
    • Minchan Kim's avatar
      mm: don't be stuck to rmap lock on reclaim path · 6d4675e6
      Minchan Kim authored
      The rmap locks(i_mmap_rwsem and anon_vma->root->rwsem) could be contended
      under memory pressure if processes keep working on their vmas(e.g., fork,
      mmap, munmap).  It makes reclaim path stuck.  In our real workload traces,
      we see kswapd is waiting the lock for 300ms+(worst case, a sec) and it
      makes other processes entering direct reclaim, which were also stuck on
      the lock.
      
      This patch makes lru aging path try_lock mode like shink_page_list so the
      reclaim context will keep working with next lru pages without being stuck.
      if it found the rmap lock contended, it rotates the page back to head of
      lru in both active/inactive lrus to make them consistent behavior, which
      is basic starting point rather than adding more heristic.
      
      Since this patch introduces a new "contended" field as out-param along
      with try_lock in-param in rmap_walk_control, it's not immutable any longer
      if the try_lock is set so remove const keywords on rmap related functions.
      Since rmap walking is already expensive operation, I doubt the const
      would help sizable benefit( And we didn't have it until 5.17).
      
      In a heavy app workload in Android, trace shows following statistics.  It
      almost removes rmap lock contention from reclaim path.
      
      Martin Liu reported:
      
      Before:
      
         max_dur(ms)  min_dur(ms)  max-min(dur)ms  avg_dur(ms)  sum_dur(ms)  count blocked_function
               1632            0            1631   151.542173        31672    209  page_lock_anon_vma_read
                601            0             601   145.544681        28817    198  rmap_walk_file
      
      After:
      
         max_dur(ms)  min_dur(ms)  max-min(dur)ms  avg_dur(ms)  sum_dur(ms)  count blocked_function
                NaN          NaN              NaN          NaN          NaN    0.0             NaN
                  0            0                0     0.127645            1     12  rmap_walk_file
      
      [minchan@kernel.org: add comment, per Matthew]
        Link: https://lkml.kernel.org/r/YnNqeB5tUf6LZ57b@google.com
      Link: https://lkml.kernel.org/r/20220510215423.164547-1-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: John Dias <joaodias@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Martin Liu <liumartin@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6d4675e6
  22. 10 May, 2022 2 commits
    • David Hildenbrand's avatar
      mm: remember exclusively mapped anonymous pages with PG_anon_exclusive · 6c287605
      David Hildenbrand authored
      Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as
      exclusive, and use that information to make GUP pins reliable and stay
      consistent with the page mapped into the page table even if the page table
      entry gets write-protected.
      
      With that information at hand, we can extend our COW logic to always reuse
      anonymous pages that are exclusive.  For anonymous pages that might be
      shared, the existing logic applies.
      
      As already documented, PG_anon_exclusive is usually only expressive in
      combination with a page table entry.  Especially PTE vs.  PMD-mapped
      anonymous pages require more thought, some examples: due to mremap() we
      can easily have a single compound page PTE-mapped into multiple page
      tables exclusively in a single process -- multiple page table locks apply.
      Further, due to MADV_WIPEONFORK we might not necessarily write-protect
      all PTEs, and only some subpages might be pinned.  Long story short: once
      PTE-mapped, we have to track information about exclusivity per sub-page,
      but until then, we can just track it for the compound page in the head
      page and not having to update a whole bunch of subpages all of the time
      for a simple PMD mapping of a THP.
      
      For simplicity, this commit mostly talks about "anonymous pages", while
      it's for THP actually "the part of an anonymous folio referenced via a
      page table entry".
      
      To not spill PG_anon_exclusive code all over the mm code-base, we let the
      anon rmap code to handle all PG_anon_exclusive logic it can easily handle.
      
      If a writable, present page table entry points at an anonymous (sub)page,
      that (sub)page must be PG_anon_exclusive.  If GUP wants to take a reliably
      pin (FOLL_PIN) on an anonymous page references via a present page table
      entry, it must only pin if PG_anon_exclusive is set for the mapped
      (sub)page.
      
      This commit doesn't adjust GUP, so this is only implicitly handled for
      FOLL_WRITE, follow-up commits will teach GUP to also respect it for
      FOLL_PIN without FOLL_WRITE, to make all GUP pins of anonymous pages fully
      reliable.
      
      Whenever an anonymous page is to be shared (fork(), KSM), or when
      temporarily unmapping an anonymous page (swap, migration), the relevant
      PG_anon_exclusive bit has to be cleared to mark the anonymous page
      possibly shared.  Clearing will fail if there are GUP pins on the page:
      
      * For fork(), this means having to copy the page and not being able to
        share it.  fork() protects against concurrent GUP using the PT lock and
        the src_mm->write_protect_seq.
      
      * For KSM, this means sharing will fail.  For swap this means, unmapping
        will fail, For migration this means, migration will fail early.  All
        three cases protect against concurrent GUP using the PT lock and a
        proper clear/invalidate+flush of the relevant page table entry.
      
      This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a
      pinned page gets mapped R/O and the successive write fault ends up
      replacing the page instead of reusing it.  It improves the situation for
      O_DIRECT/vmsplice/...  that still use FOLL_GET instead of FOLL_PIN, if
      fork() is *not* involved, however swapout and fork() are still
      problematic.  Properly using FOLL_PIN instead of FOLL_GET for these GUP
      users will fix the issue for them.
      
      I. Details about basic handling
      
      I.1. Fresh anonymous pages
      
      page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the
      given page exclusive via __page_set_anon_rmap(exclusive=1).  As that is
      the mechanism fresh anonymous pages come into life (besides migration code
      where we copy the page->mapping), all fresh anonymous pages will start out
      as exclusive.
      
      I.2. COW reuse handling of anonymous pages
      
      When a COW handler stumbles over a (sub)page that's marked exclusive, it
      simply reuses it.  Otherwise, the handler tries harder under page lock to
      detect if the (sub)page is exclusive and can be reused.  If exclusive,
      page_move_anon_rmap() will mark the given (sub)page exclusive.
      
      Note that hugetlb code does not yet check for PageAnonExclusive(), as it
      still uses the old COW logic that is prone to the COW security issue
      because hugetlb code cannot really tolerate unnecessary/wrong COW as huge
      pages are a scarce resource.
      
      I.3. Migration handling
      
      try_to_migrate() has to try marking an exclusive anonymous page shared via
      page_try_share_anon_rmap().  If it fails because there are GUP pins on the
      page, unmap fails.  migrate_vma_collect_pmd() and
      __split_huge_pmd_locked() are handled similarly.
      
      Writable migration entries implicitly point at shared anonymous pages. 
      For readable migration entries that information is stored via a new
      "readable-exclusive" migration entry, specific to anonymous pages.
      
      When restoring a migration entry in remove_migration_pte(), information
      about exlusivity is detected via the migration entry type, and
      RMAP_EXCLUSIVE is set accordingly for
      page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that information.
      
      I.4. Swapout handling
      
      try_to_unmap() has to try marking the mapped page possibly shared via
      page_try_share_anon_rmap().  If it fails because there are GUP pins on the
      page, unmap fails.  For now, information about exclusivity is lost.  In
      the future, we might want to remember that information in the swap entry
      in some cases, however, it requires more thought, care, and a way to store
      that information in swap entries.
      
      I.5. Swapin handling
      
      do_swap_page() will never stumble over exclusive anonymous pages in the
      swap cache, as try_to_migrate() prohibits that.  do_swap_page() always has
      to detect manually if an anonymous page is exclusive and has to set
      RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly.
      
      I.6. THP handling
      
      __split_huge_pmd_locked() has to move the information about exclusivity
      from the PMD to the PTEs.
      
      a) In case we have a readable-exclusive PMD migration entry, simply
         insert readable-exclusive PTE migration entries.
      
      b) In case we have a present PMD entry and we don't want to freeze
         ("convert to migration entries"), simply forward PG_anon_exclusive to
         all sub-pages, no need to temporarily clear the bit.
      
      c) In case we have a present PMD entry and want to freeze, handle it
         similar to try_to_migrate(): try marking the page shared first.  In
         case we fail, we ignore the "freeze" instruction and simply split
         ordinarily.  try_to_migrate() will properly fail because the THP is
         still mapped via PTEs.
      
      When splitting a compound anonymous folio (THP), the information about
      exclusivity is implicitly handled via the migration entries: no need to
      replicate PG_anon_exclusive manually.
      
      I.7.  fork() handling fork() handling is relatively easy, because
      PG_anon_exclusive is only expressive for some page table entry types.
      
      a) Present anonymous pages
      
      page_try_dup_anon_rmap() will mark the given subpage shared -- which will
      fail if the page is pinned.  If it failed, we have to copy (or PTE-map a
      PMD to handle it on the PTE level).
      
      Note that device exclusive entries are just a pointer at a PageAnon()
      page.  fork() will first convert a device exclusive entry to a present
      page table and handle it just like present anonymous pages.
      
      b) Device private entry
      
      Device private entries point at PageAnon() pages that cannot be mapped
      directly and, therefore, cannot get pinned.
      
      page_try_dup_anon_rmap() will mark the given subpage shared, which cannot
      fail because they cannot get pinned.
      
      c) HW poison entries
      
      PG_anon_exclusive will remain untouched and is stale -- the page table
      entry is just a placeholder after all.
      
      d) Migration entries
      
      Writable and readable-exclusive entries are converted to readable entries:
      possibly shared.
      
      I.8. mprotect() handling
      
      mprotect() only has to properly handle the new readable-exclusive
      migration entry:
      
      When write-protecting a migration entry that points at an anonymous page,
      remember the information about exclusivity via the "readable-exclusive"
      migration entry type.
      
      II. Migration and GUP-fast
      
      Whenever replacing a present page table entry that maps an exclusive
      anonymous page by a migration entry, we have to mark the page possibly
      shared and synchronize against GUP-fast by a proper clear/invalidate+flush
      to make the following scenario impossible:
      
      1. try_to_migrate() places a migration entry after checking for GUP pins
         and marks the page possibly shared.
      
      2. GUP-fast pins the page due to lack of synchronization
      
      3. fork() converts the "writable/readable-exclusive" migration entry into a
         readable migration entry
      
      4. Migration fails due to the GUP pin (failing to freeze the refcount)
      
      5. Migration entries are restored. PG_anon_exclusive is lost
      
      -> We have a pinned page that is not marked exclusive anymore.
      
      Note that we move information about exclusivity from the page to the
      migration entry as it otherwise highly overcomplicates fork() and
      PTE-mapping a THP.
      
      III. Swapout and GUP-fast
      
      Whenever replacing a present page table entry that maps an exclusive
      anonymous page by a swap entry, we have to mark the page possibly shared
      and synchronize against GUP-fast by a proper clear/invalidate+flush to
      make the following scenario impossible:
      
      1. try_to_unmap() places a swap entry after checking for GUP pins and
         clears exclusivity information on the page.
      
      2. GUP-fast pins the page due to lack of synchronization.
      
      -> We have a pinned page that is not marked exclusive anymore.
      
      If we'd ever store information about exclusivity in the swap entry,
      similar to migration handling, the same considerations as in II would
      apply.  This is future work.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-13-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6c287605
    • David Hildenbrand's avatar
      mm/rmap: remove do_page_add_anon_rmap() · f1e2db12
      David Hildenbrand authored
      ... and instead convert page_add_anon_rmap() to accept flags.
      
      Passing flags instead of bools is usually nicer either way, and we want to
      more often also pass RMAP_EXCLUSIVE in follow up patches when detecting
      that an anonymous page is exclusive: for example, when restoring an
      anonymous page from a writable migration entry.
      
      This is a preparation for marking an anonymous page inside
      page_add_anon_rmap() as exclusive when RMAP_EXCLUSIVE is passed.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-7-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f1e2db12
  23. 29 Apr, 2022 1 commit
    • xu xin's avatar
      ksm: count ksm merging pages for each process · 76093853
      xu xin authored
      Some applications or containers want to use KSM by calling madvise() to
      advise areas of address space to be MERGEABLE.  But they may not know
      which applications are more likely to cause real merges in the
      deployment.  If this patch is applied, it helps them know their
      corresponding number of merged pages, and then optimize their app code.
      
      As current KSM only counts the number of KSM merging pages(e.g. 
      ksm_pages_sharing and ksm_pages_shared) of the whole system, we cannot see
      the more fine-grained KSM merging, for the upper application optimization,
      the merging area cannot be set easily according to the KSM page merging
      probability of each process.  Therefore, it is necessary to add extra
      statistical means so that the upper level users can know the detailed KSM
      merging information of each process.
      
      We add a new proc file named as ksm_merging_pages under /proc/<pid>/ to
      indicate the involved ksm merging pages of this process.
      
      [akpm@linux-foundation.org: fix comment typo, remove BUG_ON()s]
      Link: https://lkml.kernel.org/r/20220325082318.2352853-1-xu.xin16@zte.com.cnSigned-off-by: default avatarxu xin <xu.xin16@zte.com.cn>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reviewed-by: default avatarYang Yang <yang.yang29@zte.com.cn>
      Reviewed-by: default avatarRan Xiaokai <ran.xiaokai@zte.com.cn>
      Reported-by: default avatarZeal Robot <zealci@zte.com.cn>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Ohhoon Kwon <ohoono.kwon@samsung.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Stephen Brennan <stephen.s.brennan@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Yang Yang <yang.yang29@zte.com.cn>
      Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn>
      Cc: Zeal Robot <zealci@zte.com.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      76093853
  24. 22 Mar, 2022 2 commits
  25. 21 Mar, 2022 1 commit