1. 06 Oct, 2023 8 commits
  2. 04 Oct, 2023 32 commits
    • Yin Fengwei's avatar
      mm: mlock: update mlock_pte_range to handle large folio · dc68badc
      Yin Fengwei authored
      Current kernel only lock base size folio during mlock syscall.
      Add large folio support with following rules:
        - Only mlock large folio when it's in VM_LOCKED VMA range
          and fully mapped to page table.
      
          fully mapped folio is required as if folio is not fully
          mapped to a VM_LOCKED VMA, if system is in memory pressure,
          page reclaim is allowed to pick up this folio, split it
          and reclaim the pages which are not in VM_LOCKED VMA.
      
        - munlock will apply to the large folio which is in VMA range
          or cross the VMA boundary.
      
          This is required to handle the case that the large folio is
          mlocked, later the VMA is split in the middle of large folio.
      
      Link: https://lkml.kernel.org/r/20230918073318.1181104-4-fengwei.yin@intel.comSigned-off-by: default avatarYin Fengwei <fengwei.yin@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      dc68badc
    • Yin Fengwei's avatar
      mm: handle large folio when large folio in VM_LOCKED VMA range · 1acbc3f9
      Yin Fengwei authored
      If large folio is in the range of VM_LOCKED VMA, it should be mlocked to
      avoid being picked by page reclaim.  Which may split the large folio and
      then mlock each pages again.
      
      Mlock this kind of large folio to prevent them being picked by page
      reclaim.
      
      For the large folio which cross the boundary of VM_LOCKED VMA or not fully
      mapped to VM_LOCKED VMA, we'd better not to mlock it.  So if the system is
      under memory pressure, this kind of large folio will be split and the
      pages ouf of VM_LOCKED VMA can be reclaimed.
      
      Ideally, for large folio, we should mlock it when the large folio is fully
      mapped to VMA and munlock it if any page are unmampped from VMA.  But it's
      not easy to detect whether the large folio is fully mapped to VMA in some
      cases (like add/remove rmap).  So we update mlock_vma_folio() and
      munlock_vma_folio() to mlock/munlock the folio according to vma->vm_flags.
      Let caller to decide whether they should call these two functions.
      
      For add rmap, only mlock normal 4K folio and postpone large folio handling
      to page reclaim phase.  It is possible to reuse page table iterator to
      detect whether folio is fully mapped or not during page reclaim phase. 
      For remove rmap, invoke munlock_vma_folio() to munlock folio unconditionly
      because rmap makes folio not fully mapped to VMA.
      
      Link: https://lkml.kernel.org/r/20230918073318.1181104-3-fengwei.yin@intel.comSigned-off-by: default avatarYin Fengwei <fengwei.yin@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1acbc3f9
    • Yin Fengwei's avatar
      mm: add functions folio_in_range() and folio_within_vma() · 28e56657
      Yin Fengwei authored
      Patch series "support large folio for mlock", v3.
      
      Yu mentioned at [1] about the mlock() can't be applied to large folio.
      
      I leant the related code and here is my understanding:
      
      - For RLIMIT_MEMLOCK related, there is no problem.  Because the
        RLIMIT_MEMLOCK statistics is not related underneath page.  That means
        underneath page mlock or munlock doesn't impact the RLIMIT_MEMLOCK
        statistics collection which is always correct.
      
      - For keeping the page in RAM, there is no problem either.  At least,
        during try_to_unmap_one(), once detect the VMA has VM_LOCKED bit set in
        vm_flags, the folio will be kept whatever the folio is mlocked or not.
      
      So the function of mlock for large folio works.  But it's not optimized
      because the page reclaim needs scan these large folio and may split them.
      
      This series identified the large folio for mlock to four types:
        - The large folio is in VM_LOCKED range and fully mapped to the
          range
      
        - The large folio is in the VM_LOCKED range but not fully mapped to
          the range
      
        - The large folio cross VM_LOCKED VMA boundary
      
        - The large folio cross last level page table boundary
      
      For the first type, we mlock large folio so page reclaim will skip it.
      
      For the second/third type, we don't mlock large folio.  As the pages not
      mapped to VM_LOACKED range are mapped to none VM_LOCKED range, if system
      is in memory pressure situation, the large folio can be picked by page
      reclaim and split.  Then the pages not mapped to VM_LOCKED range can be
      reclaimed.
      
      For the fourth type, we don't mlock large folio because locking one page
      table lock can't prevent the part in another last level page table being
      unmapped.  Thanks to Ryan for pointing this out.
      
      
      To check whether the folio is fully mapped to the range, PTEs needs be
      checked to see whether the page of folio is associated.  Which needs take
      page table lock and is heavy operation.  So far, the only place needs this
      check is madvise and page reclaim.  These functions already have their own
      PTE iterator.
      
      patch1 introduce API to check whether large folio is in VMA range.
      patch2 make page reclaim/mlock_vma_folio/munlock_vma_folio support
             large folio mlock/munlock.
      patch3 make mlock/munlock syscall support large folio.
      
      Yu also mentioned a race which can make folio unevictable after munlock
      during RFC v2 discussion [3]:
      We decided that race issue didn't block this series based on:
        - That race issue was not introduced by this series
      
        - We had a looks-ok fix for that race issue. Need to wait
          for mlock_count fixing patch as Yosry Ahmed suggested [4]
      
      [1] https://lore.kernel.org/linux-mm/CAOUHufbtNPkdktjt_5qM45GegVO-rCFOMkSh0HQminQ12zsV8Q@mail.gmail.com/
      [2] https://lore.kernel.org/linux-mm/20230809061105.3369958-1-fengwei.yin@intel.com/
      [3] https://lore.kernel.org/linux-mm/CAOUHufZ6=9P_=CAOQyw0xw-3q707q-1FVV09dBNDC-hpcpj2Pg@mail.gmail.com/
      
      
      This patch (of 3):
      
      folio_in_range() will be used to check whether the folio is mapped to
      specific VMA and whether the mapping address of folio is in the range.
      
      Also a helper function folio_within_vma() to check whether folio
      is in the range of vma based on folio_in_range().
      
      Link: https://lkml.kernel.org/r/20230918073318.1181104-1-fengwei.yin@intel.com
      Link: https://lkml.kernel.org/r/20230918073318.1181104-2-fengwei.yin@intel.comSigned-off-by: default avatarYin Fengwei <fengwei.yin@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      28e56657
    • Jinjie Ruan's avatar
      mm/damon/core-test: fix memory leak in damon_new_ctx() · a0ce7925
      Jinjie Ruan authored
      When CONFIG_DAMON_KUNIT_TEST=y and making CONFIG_DEBUG_KMEMLEAK=y and
      CONFIG_DEBUG_KMEMLEAK_AUTO_SCAN=y, the below memory leak is detected.
      
      The damon_ctx which is allocated by kzalloc() in damon_new_ctx() in
      damon_test_ops_registration() and damon_test_set_attrs() are not freed. 
      So use damon_destroy_ctx() to free it.  After applying this patch, the
      following memory leak is never detected
      
          unreferenced object 0xffff2b49c6968800 (size 512):
            comm "kunit_try_catch", pid 350, jiffies 4294895294 (age 557.028s)
            hex dump (first 32 bytes):
              88 13 00 00 00 00 00 00 a0 86 01 00 00 00 00 00  ................
              00 87 93 03 00 00 00 00 0a 00 00 00 00 00 00 00  ................
            backtrace:
              [<0000000088e71769>] slab_post_alloc_hook+0xb8/0x368
              [<0000000073acab3b>] __kmem_cache_alloc_node+0x174/0x290
              [<00000000b5f89cef>] kmalloc_trace+0x40/0x164
              [<00000000eb19e83f>] damon_new_ctx+0x28/0xb4
              [<00000000daf6227b>] damon_test_ops_registration+0x34/0x328
              [<00000000559c4801>] kunit_try_run_case+0x50/0xac
              [<000000003932ed49>] kunit_generic_run_threadfn_adapter+0x20/0x2c
              [<000000003c3e9211>] kthread+0x124/0x130
              [<0000000028f85bdd>] ret_from_fork+0x10/0x20
          unreferenced object 0xffff2b49c1a9cc00 (size 512):
            comm "kunit_try_catch", pid 356, jiffies 4294895306 (age 557.000s)
            hex dump (first 32 bytes):
              88 13 00 00 00 00 00 00 a0 86 01 00 00 00 00 00  ................
              00 00 00 00 00 00 00 00 0a 00 00 00 00 00 00 00  ................
            backtrace:
              [<0000000088e71769>] slab_post_alloc_hook+0xb8/0x368
              [<0000000073acab3b>] __kmem_cache_alloc_node+0x174/0x290
              [<00000000b5f89cef>] kmalloc_trace+0x40/0x164
              [<00000000eb19e83f>] damon_new_ctx+0x28/0xb4
              [<00000000058495c4>] damon_test_set_attrs+0x30/0x1a8
              [<00000000559c4801>] kunit_try_run_case+0x50/0xac
              [<000000003932ed49>] kunit_generic_run_threadfn_adapter+0x20/0x2c
              [<000000003c3e9211>] kthread+0x124/0x130
              [<0000000028f85bdd>] ret_from_fork+0x10/0x20
      
      Link: https://lkml.kernel.org/r/20230918120951.2230468-3-ruanjinjie@huawei.com
      Fixes: d1836a3b ("mm/damon/core-test: initialise context before test in damon_test_set_attrs()")
      Fixes: 4f540f5a ("mm/damon/core-test: add a kunit test case for ops registration")
      Signed-off-by: default avatarJinjie Ruan <ruanjinjie@huawei.com>
      Reviewed-by: default avatarFeng Tang <feng.tang@intel.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendan.higgins@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a0ce7925
    • Jinjie Ruan's avatar
      mm/damon/core-test: fix memory leak in damon_new_region() · f950fa6e
      Jinjie Ruan authored
      Patch series "mm/damon/core-test: Fix memory leaks in core-test", v3.
      
      There are a few memory leaks in core-test which are detected by kmemleak. 
      This patchset fixes the issues.
      
      
      This patch (of 2):
      
      When CONFIG_DAMON_KUNIT_TEST=y and making CONFIG_DEBUG_KMEMLEAK=y
      and CONFIG_DEBUG_KMEMLEAK_AUTO_SCAN=y, the below memory leak is detected.
      
      The damon_region which is allocated by kmem_cache_alloc() in
      damon_new_region() in damon_test_regions() and
      damon_test_update_monitoring_result() are not freed.
      
      So for damon_test_regions(), replace damon_del_region() call with
      damon_destroy_region() so that it calls both damon_del_region() and
      damon_free_region(), the latter will free the damon_region. For
      damon_test_update_monitoring_result(), call damon_free_region() to
      free it. After applying this patch, the following memory leak is never
      detected.
      
          unreferenced object 0xffff2b49c3edc000 (size 56):
            comm "kunit_try_catch", pid 338, jiffies 4294895280 (age 557.084s)
            hex dump (first 32 bytes):
              01 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00  ................
              00 00 00 00 00 00 00 00 00 00 00 00 49 2b ff ff  ............I+..
            backtrace:
              [<0000000088e71769>] slab_post_alloc_hook+0xb8/0x368
              [<00000000b528f67c>] kmem_cache_alloc+0x168/0x284
              [<000000008603f022>] damon_new_region+0x28/0x54
              [<00000000a3b8c64e>] damon_test_regions+0x38/0x270
              [<00000000559c4801>] kunit_try_run_case+0x50/0xac
              [<000000003932ed49>] kunit_generic_run_threadfn_adapter+0x20/0x2c
              [<000000003c3e9211>] kthread+0x124/0x130
              [<0000000028f85bdd>] ret_from_fork+0x10/0x20
          unreferenced object 0xffff2b49c5b20000 (size 56):
            comm "kunit_try_catch", pid 354, jiffies 4294895304 (age 556.988s)
            hex dump (first 32 bytes):
              03 00 00 00 00 00 00 00 07 00 00 00 00 00 00 00  ................
              00 00 00 00 00 00 00 00 96 00 00 00 49 2b ff ff  ............I+..
            backtrace:
              [<0000000088e71769>] slab_post_alloc_hook+0xb8/0x368
              [<00000000b528f67c>] kmem_cache_alloc+0x168/0x284
              [<000000008603f022>] damon_new_region+0x28/0x54
              [<00000000ca019f80>] damon_test_update_monitoring_result+0x18/0x34
              [<00000000559c4801>] kunit_try_run_case+0x50/0xac
              [<000000003932ed49>] kunit_generic_run_threadfn_adapter+0x20/0x2c
              [<000000003c3e9211>] kthread+0x124/0x130
              [<0000000028f85bdd>] ret_from_fork+0x10/0x20
      
      Link: https://lkml.kernel.org/r/20230918120951.2230468-1-ruanjinjie@huawei.com
      Link: https://lkml.kernel.org/r/20230918120951.2230468-2-ruanjinjie@huawei.com
      Fixes: 17ccae8b ("mm/damon: add kunit tests")
      Fixes: f4c978b6 ("mm/damon/core-test: add a test for damon_update_monitoring_results()")
      Signed-off-by: default avatarJinjie Ruan <ruanjinjie@huawei.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendan.higgins@linux.dev>
      Cc: Feng Tang <feng.tang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f950fa6e
    • Jianguo Bao's avatar
      mm/writeback: update filemap_dirty_folio() comment · ab428b4c
      Jianguo Bao authored
      Change to use new address space operation dirty_folio().
      
      Link: https://lkml.kernel.org/r/20230917-trycontrib1-v1-1-db22630b8839@gmail.com
      Fixes: 6f31a5a2 ("fs: Add aops->dirty_folio")
      Signed-off-by: default avatarJianguo Bau <roidinev@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ab428b4c
    • SeongJae Park's avatar
      Docs/ABI/damon: update for DAMOS apply intervals · d57d36b5
      SeongJae Park authored
      Update DAMON ABI document for the newly added DAMON sysfs file for DAMOS
      apply intervals (apply_interval_us file).
      
      Link: https://lkml.kernel.org/r/20230916020945.47296-10-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d57d36b5
    • SeongJae Park's avatar
      Docs/admin-guide/mm/damon/usage: update for DAMOS apply intervals · 033343d5
      SeongJae Park authored
      Update DAMON usage document's DAMON sysfs interface section for the newly
      added DAMOS apply intervals support (apply_interval_us file).
      
      Link: https://lkml.kernel.org/r/20230916020945.47296-9-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      033343d5
    • SeongJae Park's avatar
      selftests/damon/sysfs: test DAMOS apply intervals · 65ded14e
      SeongJae Park authored
      Update DAMON selftests to test existence of the file for reading/writing
      DAMOS apply interval under each scheme directory.
      
      Link: https://lkml.kernel.org/r/20230916020945.47296-8-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      65ded14e
    • SeongJae Park's avatar
      mm/damon/sysfs-schemes: support DAMOS apply interval · a2a9f68e
      SeongJae Park authored
      Update DAMON sysfs interface to support DAMOS apply intervals by adding a
      new file, 'apply_interval_us' in each scheme directory.  Users can set and
      get the interval for each scheme in microseconds by writing to and reading
      from the file.
      
      Link: https://lkml.kernel.org/r/20230916020945.47296-7-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a2a9f68e
    • SeongJae Park's avatar
      Docs/mm/damon/design: document DAMOS apply intervals · 3f8723f1
      SeongJae Park authored
      Update DAMON design doc to explain about DAMOS apply intervals.
      
      Link: https://lkml.kernel.org/r/20230916020945.47296-6-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3f8723f1
    • SeongJae Park's avatar
      mm/damon/core: implement scheme-specific apply interval · 42f994b7
      SeongJae Park authored
      DAMON-based operation schemes are applied for every aggregation interval. 
      That was mainly because schemes were using nr_accesses, which be complete
      to be used for every aggregation interval.  However, the schemes are now
      using nr_accesses_bp, which is updated for each sampling interval in a way
      that reasonable to be used.  Therefore, there is no reason to apply
      schemes for each aggregation interval.
      
      The unnecessary alignment with aggregation interval was also making some
      use cases of DAMOS tricky.  Quotas setting under long aggregation interval
      is one such example.  Suppose the aggregation interval is ten seconds, and
      there is a scheme having CPU quota 100ms per 1s.  The scheme will actually
      uses 100ms per ten seconds, since it cannobe be applied before next
      aggregation interval.  The feature is working as intended, but the results
      might not that intuitive for some users.  This could be fixed by updating
      the quota to 1s per 10s.  But, in the case, the CPU usage of DAMOS could
      look like spikes, and would actually make a bad effect to other
      CPU-sensitive workloads.
      
      Implement a dedicated timing interval for each DAMON-based operation
      scheme, namely apply_interval.  The interval will be sampling interval
      aligned, and each scheme will be applied for its apply_interval.  The
      interval is set to 0 by default, and it means the scheme should use the
      aggregation interval instead.  This avoids old users getting any
      behavioral difference.
      
      Link: https://lkml.kernel.org/r/20230916020945.47296-5-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      42f994b7
    • SeongJae Park's avatar
      mm/damon/core: use nr_accesses_bp as a source of damos_before_apply tracepoint · a72217ad
      SeongJae Park authored
      damos_before_apply tracepoint is exposing access rate of DAMON regions
      using nr_accesses field of regions, which was actually used by DAMOS in
      the past.  However, it has changed to use nr_accesses_bp instead.  Update
      the tracepoint to expose the value that DAMOS is really using.
      
      Note that it doesn't expose the value as is in the basis point, but after
      converting it to the natural number by dividing it by 10,000.  Therefore
      this change doesn't make user-visible behavioral differences.
      
      Link: https://lkml.kernel.org/r/20230916020945.47296-4-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a72217ad
    • SeongJae Park's avatar
      mm/damon/sysfs-schemes: use nr_accesses_bp as the source of tried_regions/<N>/nr_accesses · e7639bb4
      SeongJae Park authored
      DAMON sysfs interface exposes access rate of each region via DAMOS tried
      regions directory.  For this, the nr_accesses field of the region is used.
      DAMOS was actually using nr_accesses in the past, but it uses
      nr_accesses_bp now.  Use the value that it is really using as the source.
      
      Note that this doesn't expose nr_accesses_bp as is (in basis point), but
      after converting it to the natural number by dividing the value by 10,000.
      Hence there is no behavioral change from users' perspective.
      
      Link: https://lkml.kernel.org/r/20230916020945.47296-3-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e7639bb4
    • SeongJae Park's avatar
      mm/damon/core: make DAMOS uses nr_accesses_bp instead of nr_accesses · affa87c7
      SeongJae Park authored
      Patch series "mm/damon: implement DAMOS apply intervals".
      
      DAMON-based operation schemes are applied for every aggregation interval. 
      That is mainly because schemes are using nr_accesses, which be complete to
      be used for every aggregation interval.
      
      This makes some DAMOS use cases be tricky.  Quota setting under long
      aggregation interval is one such example.  Suppose the aggregation
      interval is ten seconds, and there is a scheme having CPU quota 100ms per
      1s.  The scheme will actually uses 100ms per ten seconds, since it cannobe
      be applied before next aggregation interval.  The feature is working as
      intended, but the results might not that intuitive for some users.  This
      could be fixed by updating the quota to 1s per 10s.  But, in the case, the
      CPU usage of DAMOS could look like spikes, and actually make a bad effect
      to other CPU-sensitive workloads.
      
      Also, with such huge aggregation interval, users may want schemes to be
      applied more frequently.
      
      DAMON provides nr_accesses_bp, which is updated for each sampling interval
      in a way that reasonable to be used.  By using that instead of
      nr_accesses, DAMOS can have its own time interval and mitigate abovely
      mentioned issues.
      
      This patchset makes DAMOS schemes to use nr_accesses_bp instead of
      nr_accesses, and have their own timing intervals.  Also update DAMOS tried
      regions sysfs files and DAMOS before_apply tracepoint to use the new data
      as their source.  Note that the interval is zero by default, and it is
      interpreted to use the aggregation interval instead.  This avoids making
      user-visible behavioral changes.
      
      
      Patches Seuqeunce
      -----------------
      
      The first patch (patch 1/9) makes DAMOS uses nr_accesses_bp instead of
      nr_accesses, and following two patches (patches 2/9 and 3/9) updates DAMON
      sysfs interface for DAMOS tried regions and the DAMOS before_apply
      tracespoint to use nr_accesses_bp instead of nr_accesses, respectively.
      
      The following two patches (patches 4/9 and 5/9) implements the
      scheme-specific apply interval for DAMON kernel API users and update the
      design document for the new feature.
      
      Finally, the following four patches (patches 6/9, 7/9, 8/9 and 9/9) add
      support of the feature in DAMON sysfs interface, add a simple selftest
      test case, and document the new file on the usage and the ABI documents,
      repsectively.
      
      
      This patch (of 9):
      
      DAMON provides nr_accesses_bp, which becomes same to nr_accesses * 10000
      for every aggregation interval, but updated every sampling interval with a
      reasonable accuracy.  Since DAMON-based operation schemes are applied in
      every aggregation interval using nr_accesses, using nr_accesses_bp instead
      will make no difference to users.  Meanwhile, it allows DAMOS to apply the
      schemes in a time interval that less than the aggregation interval.  It
      could be useful and more flexible for some cases.  Do it.
      
      Link: https://lkml.kernel.org/r/20230916020945.47296-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20230916020945.47296-2-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      affa87c7
    • Matthew Wilcox (Oracle)'s avatar
      hugetlb: convert remove_pool_huge_page() to remove_pool_hugetlb_folio() · d5b43e96
      Matthew Wilcox (Oracle) authored
      Convert the callers to expect a folio and remove the unnecesary conversion
      back to a struct page.
      
      Link: https://lkml.kernel.org/r/20230824141325.2704553-4-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d5b43e96
    • Matthew Wilcox (Oracle)'s avatar
      hugetlb: remove a few calls to page_folio() · 04bbfd84
      Matthew Wilcox (Oracle) authored
      Anything found on a linked list threaded through ->lru is guaranteed to be
      a folio as the compound_head found in a tail page overlaps the ->lru
      member of struct page.  So we can pull folios directly off these lists no
      matter whether pages or folios were added to the list.
      
      Link: https://lkml.kernel.org/r/20230824141325.2704553-3-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      04bbfd84
    • Matthew Wilcox (Oracle)'s avatar
      hugetlb: use a folio in free_hpage_workfn() · 3ec145f9
      Matthew Wilcox (Oracle) authored
      Patch series "Small hugetlb cleanups", v2.
      
      Some trivial folio conversions
      
      
      This patch (of 3):
      
      update_and_free_hugetlb_folio puts the memory on hpage_freelist as a folio
      so we can take it off the list as a folio.
      
      Link: https://lkml.kernel.org/r/20230824141325.2704553-1-willy@infradead.org
      Link: https://lkml.kernel.org/r/20230824141325.2704553-2-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3ec145f9
    • Usama Arif's avatar
      mm: hugetlb: skip initialization of gigantic tail struct pages if freed by HVO · fde1c4ec
      Usama Arif authored
      The new boot flow when it comes to initialization of gigantic pages is as
      follows:
      
      - At boot time, for a gigantic page during __alloc_bootmem_hugepage, the
        region after the first struct page is marked as noinit.
      
      - This results in only the first struct page to be initialized in
        reserve_bootmem_region.  As the tail struct pages are not initialized at
        this point, there can be a significant saving in boot time if HVO
        succeeds later on.
      
      - Later on in the boot, the head page is prepped and the first
        HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page) - 1 tail struct pages
        are initialized.
      
      - HVO is attempted.  If it is not successful, then the rest of the tail
        struct pages are initialized.  If it is successful, no more tail struct
        pages need to be initialized saving significant boot time.
      
      The WARN_ON for increased ref count in gather_bootmem_prealloc was changed
      to a VM_BUG_ON.  This is OK as there should be no speculative references
      this early in boot process.  The VM_BUG_ON's are there just in case such
      code is introduced.
      
      [akpm@linux-foundation.org: make it nicer for 80 cols]
      Link: https://lkml.kernel.org/r/20230913105401.519709-5-usama.arif@bytedance.comSigned-off-by: default avatarUsama Arif <usama.arif@bytedance.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Punit Agrawal <punit.agrawal@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fde1c4ec
    • Usama Arif's avatar
      memblock: introduce MEMBLOCK_RSRV_NOINIT flag · 77e6c43e
      Usama Arif authored
      For reserved memory regions marked with this flag, reserve_bootmem_region
      is not called during memmap_init_reserved_pages.  This can be used to
      avoid struct page initialization for regions which won't need them, for
      e.g.  hugepages with Hugepage Vmemmap Optimization enabled.
      
      Link: https://lkml.kernel.org/r/20230913105401.519709-4-usama.arif@bytedance.comSigned-off-by: default avatarUsama Arif <usama.arif@bytedance.com>
      Acked-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Punit Agrawal <punit.agrawal@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      77e6c43e
    • Usama Arif's avatar
      memblock: pass memblock_type to memblock_setclr_flag · ee8d2071
      Usama Arif authored
      This allows setting flags to both memblock types and is in preparation for
      setting flags (for e.g.  to not initialize struct pages) on reserved
      memory region.
      
      [usama.arif@bytedance.com: add missing argument definition]
        Link: https://lkml.kernel.org/r/20230918090657.220463-1-usama.arif@bytedance.com
      Link: https://lkml.kernel.org/r/20230913105401.519709-3-usama.arif@bytedance.comSigned-off-by: default avatarUsama Arif <usama.arif@bytedance.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Punit Agrawal <punit.agrawal@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ee8d2071
    • Usama Arif's avatar
      mm: hugetlb_vmemmap: use nid of the head page to reallocate it · a9e34ea1
      Usama Arif authored
      Patch series "mm: hugetlb: Skip initialization of gigantic tail struct
      pages if freed by HVO", v5.
      
      This series moves the boot time initialization of tail struct pages of a
      gigantic page to later on in the boot.  Only the
      HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page) - 1 tail struct pages
      are initialized at the start.  If HVO is successful, then no more tail
      struct pages need to be initialized.  For a 1G hugepage, this series avoid
      initialization of 262144 - 63 = 262081 struct pages per hugepage.
      
      When tested on a 512G system (allocating 500 1G hugepages), the kexec-boot
      times with DEFERRED_STRUCT_PAGE_INIT enabled are:
      
      - with patches, HVO enabled: 1.32 seconds
      - with patches, HVO disabled: 2.15 seconds
      - without patches, HVO enabled: 3.90  seconds
      - without patches, HVO disabled: 3.58 seconds
      
      This represents an approximately 70% reduction in boot time and will
      significantly reduce server downtime when using a large number of gigantic
      pages.
      
      
      This patch (of 4):
      
      If tail page prep and initialization is skipped, then the "start" page
      will not contain the correct nid.  Use the nid from first vmemap page.
      
      Link: https://lkml.kernel.org/r/20230913105401.519709-1-usama.arif@bytedance.com
      Link: https://lkml.kernel.org/r/20230913105401.519709-2-usama.arif@bytedance.comSigned-off-by: default avatarUsama Arif <usama.arif@bytedance.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Punit Agrawal <punit.agrawal@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a9e34ea1
    • SeongJae Park's avatar
      mm/damon/core: mark damon_moving_sum() as a static function · 863803a7
      SeongJae Park authored
      The function is used by only mm/damon/core.c.  Mark it as a static
      function.
      
      Link: https://lkml.kernel.org/r/20230915025251.72816-9-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      863803a7
    • SeongJae Park's avatar
      mm/damon/core: skip updating nr_accesses_bp for each aggregation interval · 401807a3
      SeongJae Park authored
      damon_merge_regions_of(), which is called for each aggregation interval,
      updates nr_accesses_bp to nr_accesses * 10000.  However, nr_accesses_bp is
      updated for each sampling interval via damon_moving_sum() using the
      aggregation interval as the moving time window.  And by the definition of
      the algorithm, the value becomes same to discrete-window based sum for
      each time window-aligned time.  Hence, nr_accesses_bp will be same to
      nr_accesses * 10000 for each aggregation interval without explicit update.
      Remove the unnecessary update of nr_accesses_bp in
      damon_merge_regions_of().
      
      Link: https://lkml.kernel.org/r/20230915025251.72816-8-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      401807a3
    • SeongJae Park's avatar
      mm/damon/core: use pseudo-moving sum for nr_accesses_bp · ace30fb2
      SeongJae Park authored
      Let nr_accesses_bp be calculated as a pseudo-moving sum that updated for
      every sampling interval, using damon_moving_sum().  This is assumed to be
      useful for cases that the aggregation interval is set quite huge, but the
      monivoting results need to be collected earlier than next aggregation
      interval is passed.
      
      Link: https://lkml.kernel.org/r/20230915025251.72816-7-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ace30fb2
    • SeongJae Park's avatar
      mm/damon/core: introduce nr_accesses_bp · 80333828
      SeongJae Park authored
      Add yet another representation of the access rate of each region, namely
      nr_accesses_bp.  It is just same to the nr_accesses but represents the
      value in basis point (1 in 10,000), and updated at once in every
      aggregation interval.  That is, moving_accesses_bp is just nr_accesses *
      10000.  This may seems useless at the moment.  However, it will be useful
      for representing less than one nr_accesses value that will be needed to
      make moving sum-based nr_accesses.
      
      Link: https://lkml.kernel.org/r/20230915025251.72816-6-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      80333828
    • SeongJae Park's avatar
      mm/damon/core-test: add a unit test for damon_moving_sum() · 0926e8ff
      SeongJae Park authored
      Add a simple unit test for the pseudo moving-sum function
      (damon_moving_sum()).
      
      Link: https://lkml.kernel.org/r/20230915025251.72816-5-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0926e8ff
    • SeongJae Park's avatar
      mm/damon/core: implement a pseudo-moving sum function · d2c062ad
      SeongJae Park authored
      For values that continuously change, moving average or sum are good ways
      to provide fast updates while handling temporal and errorneous variability
      of the value.  For example, the access rate counter (nr_accesses) is
      calculated as a sum of the number of positive sampled access check results
      that collected during a discrete time window (aggregation interval), and
      hence it handles temporal and errorneous access check results, but
      provides the update only for every aggregation interval.  Using a moving
      sum method for that could allow providing the value for every sampling
      interval.  That could be useful for getting monitoring results snapshot or
      running DAMOS in fine-grained timing.
      
      However, supporting the moving sum for cases that number of samples in the
      time window is arbirary could impose high overhead, since the number of
      past values that it needs to keep could be too high.  The nr_accesses
      would also be one of the cases.  To mitigate the overhead, implement a
      pseudo-moving sum function that only provides an estimated pseudo-moving
      sum.  It assumes there was no error in last discrete time window and
      subtract constant portion of last discrete time window sum.
      
      Note that the function is not strictly implementing the moving sum, but it
      keeps a property of moving sum, which makes the value same to the
      dsicrete-window based sum for each time window-aligned timing.  Hence,
      people collecting the value in the old timings would show no difference.
      
      Link: https://lkml.kernel.org/r/20230915025251.72816-4-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d2c062ad
    • SeongJae Park's avatar
      mm/damon/vaddr: call damon_update_region_access_rate() always · 22a77880
      SeongJae Park authored
      When getting mm_struct of the monitoring target process fails, there wil
      be no need to increase the access rate counter (nr_accesses) of the
      regions for the process.  Hence, damon_va_check_accesses() skips calling
      damon_update_region_access_rate() in the case.  This breaks the assumption
      that damon_update_region_access_rate() is called for every region, for
      every sampling interval.  Call the function for every region even in the
      case.  This might increase the overhead in some cases, but such case would
      not be frequent, so no significant impact is really expected.
      
      Link: https://lkml.kernel.org/r/20230915025251.72816-3-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      22a77880
    • SeongJae Park's avatar
      mm/damon/core: define and use a dedicated function for region access rate update · 78fbfb15
      SeongJae Park authored
      Patch series "mm/damon: provide pseudo-moving sum based access rate".
      
      DAMON checks the access to each region for every sampling interval,
      increase the access rate counter of the region, namely nr_accesses, if the
      access was made.  For every aggregation interval, the counter is reset. 
      The counter is exposed to users to be used as a metric showing the
      relative access rate (frequency) of each region.  In other words, DAMON
      provides access rate of each region in every aggregation interval.  The
      aggregation avoids temporal access pattern changes making things
      confusing.  However, this also makes a few DAMON-related operations to
      unnecessarily need to be aligned to the aggregation interval.  This can
      restrict the flexibility of DAMON applications, especially when the
      aggregation interval is huge.
      
      To provide the monitoring results in finer-grained timing while keeping
      handling of temporal access pattern change, this patchset implements a
      pseudo-moving sum based access rate metric.  It is pseudo-moving sum
      because strict moving sum implementation would need to keep all values for
      last time window, and that could incur high overhead of there could be
      arbitrary number of values in a time window.  Especially in case of the
      nr_accesses, since the sampling interval and aggregation interval can
      arbitrarily set and the past values should be maintained for every region,
      it could be risky.  The pseudo-moving sum assumes there were no temporal
      access pattern change in last discrete time window to remove the needs for
      keeping the list of the last time window values.  As a result, it beocmes
      not strict moving sum implementation, but provides a reasonable accuracy.
      
      Also, it keeps an important property of the moving sum.  That is, the
      moving sum becomes same to discrete-window based sum at the time that
      aligns to the time window.  This means using the pseudo moving sum based
      nr_accesses makes no change to users who shows the value for every
      aggregation interval.
      
      Patches Sequence
      ----------------
      
      The sequence of the patches is as follows.  The first four patches are for
      preparation of the change.  The first two (patches 1 and 2) implements a
      helper function for nr_accesses update and eliminate corner case that
      skips use of the function, respectively.  Following two (patches 3 and 4)
      respectively implement the pseudo-moving sum function and its simple unit
      test case.
      
      Two patches for making DAMON to use the pseudo-moving sum follow.  The
      fifthe one (patch 5) introduces a new field for representing the
      pseudo-moving sum-based access rate of each region, and the sixth one
      makes the new representation to actually updated with the pseudo-moving
      sum function.
      
      Last two patches (patches 7 and 8) makes followup fixes for skipping
      unnecessary updates and marking the moving sum function as static,
      respectively.
      
      
      This patch (of 8):
      
      Each DAMON operarions set is updating nr_accesses field of each
      damon_region for each of their access check results, from the
      check_accesses() callback.  Directly accessing the field could make things
      complex to manage and change in future.  Define and use a dedicated
      function for the purpose.
      
      Link: https://lkml.kernel.org/r/20230915025251.72816-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20230915025251.72816-2-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      78fbfb15
    • SeongJae Park's avatar
      mm/damon/core: use number of passed access sampling as a timer · 4472edf6
      SeongJae Park authored
      DAMON sleeps for sampling interval after each sampling, and check if the
      aggregation interval and the ops update interval have passed using
      ktime_get_coarse_ts64() and baseline timestamps for the intervals.  That
      design is for making the operations occur at deterministic timing
      regardless of the time that spend for each work.  However, it turned out
      it is not that useful, and incur not-that-intuitive results.
      
      After all, timer functions, and especially sleep functions that DAMON uses
      to wait for specific timing, are not necessarily strictly accurate.  It is
      legal design, so no problem.  However, depending on such inaccuracies, the
      nr_accesses can be larger than aggregation interval divided by sampling
      interval.  For example, with the default setting (5 ms sampling interval
      and 100 ms aggregation interval) we frequently show regions having
      nr_accesses larger than 20.  Also, if the execution of a DAMOS scheme
      takes a long time, next aggregation could happen before enough number of
      samples are collected.  This is not what usual users would intuitively
      expect.
      
      Since access check sampling is the smallest unit work of DAMON, using the
      number of passed sampling intervals as the DAMON-internal timer can easily
      avoid these problems.  That is, convert aggregation and ops update
      intervals to numbers of sampling intervals that need to be passed before
      those operations be executed, count the number of passed sampling
      intervals, and invoke the operations as soon as the specific amount of
      sampling intervals passed.  Make the change.
      
      Note that this could make a behavioral change to settings that using
      intervals that not aligned by the sampling interval.  For example, if the
      sampling interval is 5 ms and the aggregation interval is 12 ms, DAMON
      effectively uses 15 ms as its aggregation interval, because it checks
      whether the aggregation interval after sleeping the sampling interval. 
      This change will make DAMON to effectively use 10 ms as aggregation
      interval, since it uses 'aggregation interval / sampling interval *
      sampling interval' as the effective aggregation interval, and we don't use
      floating point types.  Usual users would have used aligned intervals, so
      this behavioral change is not expected to make any meaningful impact, so
      just make this change.
      
      Link: https://lkml.kernel.org/r/20230914021523.60649-1-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4472edf6
    • Zi Yan's avatar
      mips: use nth_page() in place of direct struct page manipulation · aa5fe31b
      Zi Yan authored
      __flush_dcache_pages() is called during hugetlb migration via
      migrate_pages() -> migrate_hugetlbs() -> unmap_and_move_huge_page() ->
      move_to_new_folio() -> flush_dcache_folio().  And with hugetlb and without
      sparsemem vmemmap, struct page is not guaranteed to be contiguous beyond a
      section.  Use nth_page() instead.
      
      Without the fix, a wrong address might be used for data cache page flush.
      No bug is reported. The fix comes from code inspection.
      
      Link: https://lkml.kernel.org/r/20230913201248.452081-6-zi.yan@sent.com
      Fixes: 15fa3e8e ("mips: implement the new page table range API")
      Signed-off-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      aa5fe31b