1. 04 Jun, 2020 40 commits
    • Kirill A. Shutemov's avatar
      khugepaged: drain all LRU caches before scanning pages · a980df33
      Kirill A. Shutemov authored
      Having a page in LRU add cache offsets page refcount and gives
      false-negative on PageLRU().  It reduces collapse success rate.
      
      Drain all LRU add caches before scanning.  It happens relatively rare and
      should not disturb the system too much.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarZi Yan <ziy@nvidia.com>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Acked-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Link: http://lkml.kernel.org/r/20200416160026.16538-4-kirill.shutemov@linux.intel.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a980df33
    • Kirill A. Shutemov's avatar
      khugepaged: do not stop collapse if less than half PTEs are referenced · ffe945e6
      Kirill A. Shutemov authored
      __collapse_huge_page_swapin() checks the number of referenced PTE to
      decide if the memory range is hot enough to justify swapin.
      
      We have few problems with the approach:
      
       - It is way too late: we can do the check much earlier and safe time.
         khugepaged_scan_pmd() already knows if we have any pages to swap in
         and number of referenced page.
      
       - It stops collapse altogether if there's not enough referenced pages,
         not only swappingin.
      
      Fix it by making the right check early. We also can avoid additional
      page table scanning if khugepaged_scan_pmd() haven't found any swap
      entries.
      
      Fixes: 0db501f7 ("mm, thp: convert from optimistic swapin collapsing to conservative")
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarZi Yan <ziy@nvidia.com>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Acked-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Link: http://lkml.kernel.org/r/20200416160026.16538-3-kirill.shutemov@linux.intel.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ffe945e6
    • Kirill A. Shutemov's avatar
      khugepaged: add self test · e0c13f97
      Kirill A. Shutemov authored
      Patch series "thp/khugepaged improvements and CoW semantics", v4.
      
      The patchset adds khugepaged selftest (anon-THP only for now), expands
      cases khugepaged can handle and switches anon-THP copy-on-write handling
      to 4k.
      
      This patch (of 8):
      
      The test checks if khugepaged is able to recover huge page where we expect
      to do so.  It only covers anon-THP for now.
      
      Currently the test shows few failures.  They are going to be addressed by
      the following patches.
      
      [colin.king@canonical.com: fix several spelling mistakes]
        Link: http://lkml.kernel.org/r/20200420084241.65433-1-colin.king@canonical.com
      [aneesh.kumar@linux.ibm.com: replace the usage of system(3) in the test]
        Link: http://lkml.kernel.org/r/20200429110727.89388-1-aneesh.kumar@linux.ibm.com
      [kirill@shutemov.name: fixup for issues I've noticed]
        Link: http://lkml.kernel.org/r/20200429124816.jp272trghrzxx5j5@box
      [jhubbard@nvidia.com: add khugepaged to .gitignore]
        Link: http://lkml.kernel.org/r/20200517002509.362401-1-jhubbard@nvidia.comSigned-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarZi Yan <ziy@nvidia.com>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Acked-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Link: http://lkml.kernel.org/r/20200416160026.16538-1-kirill.shutemov@linux.intel.com
      Link: http://lkml.kernel.org/r/20200416160026.16538-2-kirill.shutemov@linux.intel.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0c13f97
    • Chen Tao's avatar
    • Daniel Jordan's avatar
      padata: document multithreaded jobs · ec3b39c7
      Daniel Jordan authored
      Add Documentation for multithreaded jobs.
      Signed-off-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Robert Elliott <elliott@hpe.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Link: http://lkml.kernel.org/r/20200527173608.2885243-9-daniel.m.jordan@oracle.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec3b39c7
    • Daniel Jordan's avatar
      mm: make deferred init's max threads arch-specific · ecd09650
      Daniel Jordan authored
      Using padata during deferred init has only been tested on x86, so for now
      limit it to this architecture.
      
      If another arch wants this, it can find the max thread limit that's best
      for it and override deferred_page_init_max_threads().
      Signed-off-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Robert Elliott <elliott@hpe.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Link: http://lkml.kernel.org/r/20200527173608.2885243-8-daniel.m.jordan@oracle.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ecd09650
    • Daniel Jordan's avatar
      mm: parallelize deferred_init_memmap() · e4443149
      Daniel Jordan authored
      Deferred struct page init is a significant bottleneck in kernel boot.
      Optimizing it maximizes availability for large-memory systems and allows
      spinning up short-lived VMs as needed without having to leave them
      running.  It also benefits bare metal machines hosting VMs that are
      sensitive to downtime.  In projects such as VMM Fast Restart[1], where
      guest state is preserved across kexec reboot, it helps prevent application
      and network timeouts in the guests.
      
      Multithread to take full advantage of system memory bandwidth.
      
      The maximum number of threads is capped at the number of CPUs on the node
      because speedups always improve with additional threads on every system
      tested, and at this phase of boot, the system is otherwise idle and
      waiting on page init to finish.
      
      Helper threads operate on section-aligned ranges to both avoid false
      sharing when setting the pageblock's migrate type and to avoid accessing
      uninitialized buddy pages, though max order alignment is enough for the
      latter.
      
      The minimum chunk size is also a section.  There was benefit to using
      multiple threads even on relatively small memory (1G) systems, and this is
      the smallest size that the alignment allows.
      
      The time (milliseconds) is the slowest node to initialize since boot
      blocks until all nodes finish.  intel_pstate is loaded in active mode
      without hwp and with turbo enabled, and intel_idle is active as well.
      
          Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal)
            2 nodes * 26 cores * 2 threads = 104 CPUs
            384G/node = 768G memory
      
                         kernel boot                 deferred init
                         ------------------------    ------------------------
          node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
                (  0)         --   4089.7 (  8.1)         --   1785.7 (  7.6)
             2% (  1)       1.7%   4019.3 (  1.5)       3.8%   1717.7 ( 11.8)
            12% (  6)      34.9%   2662.7 (  2.9)      79.9%    359.3 (  0.6)
            25% ( 13)      39.9%   2459.0 (  3.6)      91.2%    157.0 (  0.0)
            37% ( 19)      39.2%   2485.0 ( 29.7)      90.4%    172.0 ( 28.6)
            50% ( 26)      39.3%   2482.7 ( 25.7)      90.3%    173.7 ( 30.0)
            75% ( 39)      39.0%   2495.7 (  5.5)      89.4%    190.0 (  1.0)
           100% ( 52)      40.2%   2443.7 (  3.8)      92.3%    138.0 (  1.0)
      
          Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, kvm guest)
            1 node * 16 cores * 2 threads = 32 CPUs
            192G/node = 192G memory
      
                         kernel boot                 deferred init
                         ------------------------    ------------------------
          node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
                (  0)         --   1988.7 (  9.6)         --   1096.0 ( 11.5)
             3% (  1)       1.1%   1967.0 ( 17.6)       0.3%   1092.7 ( 11.0)
            12% (  4)      41.1%   1170.3 ( 14.2)      73.8%    287.0 (  3.6)
            25% (  8)      47.1%   1052.7 ( 21.9)      83.9%    177.0 ( 13.5)
            38% ( 12)      48.9%   1016.3 ( 12.1)      86.8%    144.7 (  1.5)
            50% ( 16)      48.9%   1015.7 (  8.1)      87.8%    134.0 (  4.4)
            75% ( 24)      49.1%   1012.3 (  3.1)      88.1%    130.3 (  2.3)
           100% ( 32)      49.5%   1004.0 (  5.3)      88.5%    125.7 (  2.1)
      
          Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal)
            2 nodes * 18 cores * 2 threads = 72 CPUs
            128G/node = 256G memory
      
                         kernel boot                 deferred init
                         ------------------------    ------------------------
          node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
                (  0)         --   1680.0 (  4.6)         --    627.0 (  4.0)
             3% (  1)       0.3%   1675.7 (  4.5)      -0.2%    628.0 (  3.6)
            11% (  4)      25.6%   1250.7 (  2.1)      67.9%    201.0 (  0.0)
            25% (  9)      30.7%   1164.0 ( 17.3)      81.8%    114.3 ( 17.7)
            36% ( 13)      31.4%   1152.7 ( 10.8)      84.0%    100.3 ( 17.9)
            50% ( 18)      31.5%   1150.7 (  9.3)      83.9%    101.0 ( 14.1)
            75% ( 27)      31.7%   1148.0 (  5.6)      84.5%     97.3 (  6.4)
           100% ( 36)      32.0%   1142.3 (  4.0)      85.6%     90.0 (  1.0)
      
          AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
            1 node * 8 cores * 2 threads = 16 CPUs
            64G/node = 64G memory
      
                         kernel boot                 deferred init
                         ------------------------    ------------------------
          node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
                (  0)         --   1029.3 ( 25.1)         --    240.7 (  1.5)
             6% (  1)      -0.6%   1036.0 (  7.8)      -2.2%    246.0 (  0.0)
            12% (  2)      11.8%    907.7 (  8.6)      44.7%    133.0 (  1.0)
            25% (  4)      13.9%    886.0 ( 10.6)      62.6%     90.0 (  6.0)
            38% (  6)      17.8%    845.7 ( 14.2)      69.1%     74.3 (  3.8)
            50% (  8)      16.8%    856.0 ( 22.1)      72.9%     65.3 (  5.7)
            75% ( 12)      15.4%    871.0 ( 29.2)      79.8%     48.7 (  7.4)
           100% ( 16)      21.0%    813.7 ( 21.0)      80.5%     47.0 (  5.2)
      
      Server-oriented distros that enable deferred page init sometimes run in
      small VMs, and they still benefit even though the fraction of boot time
      saved is smaller:
      
          AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
            1 node * 2 cores * 2 threads = 4 CPUs
            16G/node = 16G memory
      
                         kernel boot                 deferred init
                         ------------------------    ------------------------
          node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
                (  0)         --    716.0 ( 14.0)         --     49.7 (  0.6)
            25% (  1)       1.8%    703.0 (  5.3)      -4.0%     51.7 (  0.6)
            50% (  2)       1.6%    704.7 (  1.2)      43.0%     28.3 (  0.6)
            75% (  3)       2.7%    696.7 ( 13.1)      49.7%     25.0 (  0.0)
           100% (  4)       4.1%    687.0 ( 10.4)      55.7%     22.0 (  0.0)
      
          Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, kvm guest)
            1 node * 2 cores * 2 threads = 4 CPUs
            14G/node = 14G memory
      
                         kernel boot                 deferred init
                         ------------------------    ------------------------
          node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
                (  0)         --    787.7 (  6.4)         --    122.3 (  0.6)
            25% (  1)       0.2%    786.3 ( 10.8)      -2.5%    125.3 (  2.1)
            50% (  2)       5.9%    741.0 ( 13.9)      37.6%     76.3 ( 19.7)
            75% (  3)       8.3%    722.0 ( 19.0)      49.9%     61.3 (  3.2)
           100% (  4)       9.3%    714.7 (  9.5)      56.4%     53.3 (  1.5)
      
      On Josh's 96-CPU and 192G memory system:
      
          Without this patch series:
          [    0.487132] node 0 initialised, 23398907 pages in 292ms
          [    0.499132] node 1 initialised, 24189223 pages in 304ms
          ...
          [    0.629376] Run /sbin/init as init process
      
          With this patch series:
          [    0.231435] node 1 initialised, 24189223 pages in 32ms
          [    0.236718] node 0 initialised, 23398907 pages in 36ms
      
      [1] https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdfSigned-off-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Reviewed-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Robert Elliott <elliott@hpe.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Link: http://lkml.kernel.org/r/20200527173608.2885243-7-daniel.m.jordan@oracle.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e4443149
    • Daniel Jordan's avatar
      mm: don't track number of pages during deferred initialization · 89c7c402
      Daniel Jordan authored
      Deferred page init used to report the number of pages initialized:
      
        node 0 initialised, 32439114 pages in 97ms
      
      Tracking this makes the code more complicated when using multiple threads.
      Given that the statistic probably has limited value, especially since a
      zone grows on demand so that the page count can vary, just remove it.
      
      The boot message now looks like
      
        node 0 deferred pages initialised in 97ms
      Suggested-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Robert Elliott <elliott@hpe.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Link: http://lkml.kernel.org/r/20200527173608.2885243-6-daniel.m.jordan@oracle.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      89c7c402
    • Daniel Jordan's avatar
      padata: add basic support for multithreaded jobs · 004ed426
      Daniel Jordan authored
      Sometimes the kernel doesn't take full advantage of system memory
      bandwidth, leading to a single CPU spending excessive time in
      initialization paths where the data scales with memory size.
      
      Multithreading naturally addresses this problem.
      
      Extend padata, a framework that handles many parallel yet singlethreaded
      jobs, to also handle multithreaded jobs by adding support for splitting up
      the work evenly, specifying a minimum amount of work that's appropriate
      for one helper thread to do, load balancing between helpers, and
      coordinating them.
      
      This is inspired by work from Pavel Tatashin and Steve Sistare.
      Signed-off-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Robert Elliott <elliott@hpe.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Link: http://lkml.kernel.org/r/20200527173608.2885243-5-daniel.m.jordan@oracle.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      004ed426
    • Daniel Jordan's avatar
      padata: allocate work structures for parallel jobs from a pool · 4611ce22
      Daniel Jordan authored
      padata allocates per-CPU, per-instance work structs for parallel jobs.  A
      do_parallel call assigns a job to a sequence number and hashes the number
      to a CPU, where the job will eventually run using the corresponding work.
      
      This approach fit with how padata used to bind a job to each CPU
      round-robin, makes less sense after commit bfde23ce ("padata: unbind
      parallel jobs from specific CPUs") because a work isn't bound to a
      particular CPU anymore, and isn't needed at all for multithreaded jobs
      because they don't have sequence numbers.
      
      Replace the per-CPU works with a preallocated pool, which allows sharing
      them between existing padata users and the upcoming multithreaded user.
      The pool will also facilitate setting NUMA-aware concurrency limits with
      later users.
      
      The pool is sized according to the number of possible CPUs.  With this
      limit, MAX_OBJ_NUM no longer makes sense, so remove it.
      
      If the global pool is exhausted, a parallel job is run in the current task
      instead to throttle a system trying to do too much in parallel.
      Signed-off-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Robert Elliott <elliott@hpe.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Link: http://lkml.kernel.org/r/20200527173608.2885243-4-daniel.m.jordan@oracle.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4611ce22
    • Daniel Jordan's avatar
      padata: initialize earlier · f1b192b1
      Daniel Jordan authored
      padata will soon initialize the system's struct pages in parallel, so it
      needs to be ready by page_alloc_init_late().
      
      The error return from padata_driver_init() triggers an initcall warning,
      so add a warning to padata_init() to avoid silent failure.
      Signed-off-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Robert Elliott <elliott@hpe.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Link: http://lkml.kernel.org/r/20200527173608.2885243-3-daniel.m.jordan@oracle.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1b192b1
    • Daniel Jordan's avatar
      padata: remove exit routine · 305dacf7
      Daniel Jordan authored
      Patch series "padata: parallelize deferred page init", v3.
      
      Deferred struct page init is a bottleneck in kernel boot--the biggest for
      us and probably others.  Optimizing it maximizes availability for
      large-memory systems and allows spinning up short-lived VMs as needed
      without having to leave them running.  It also benefits bare metal
      machines hosting VMs that are sensitive to downtime.  In projects such as
      VMM Fast Restart[1], where guest state is preserved across kexec reboot,
      it helps prevent application and network timeouts in the guests.
      
      So, multithread deferred init to take full advantage of system memory
      bandwidth.
      
      Extend padata, a framework that handles many parallel singlethreaded jobs,
      to handle multithreaded jobs as well by adding support for splitting up
      the work evenly, specifying a minimum amount of work that's appropriate
      for one helper thread to do, load balancing between helpers, and
      coordinating them.  More documentation in patches 4 and 8.
      
      This series is the first step in a project to address other memory
      proportional bottlenecks in the kernel such as pmem struct page init, vfio
      page pinning, hugetlb fallocate, and munmap.  Deferred page init doesn't
      require concurrency limits, resource control, or priority adjustments like
      these other users will because it happens during boot when the system is
      otherwise idle and waiting for page init to finish.
      
      This has been run on a variety of x86 systems and speeds up kernel boot by
      4% to 49%, saving up to 1.6 out of 4 seconds.  Patch 6 has more numbers.
      
      This patch (of 8):
      
      padata_driver_exit() is unnecessary because padata isn't built as a module
      and doesn't exit.
      
      padata's init routine will soon allocate memory, so getting rid of the
      exit function now avoids pointless code to free it.
      Signed-off-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Robert Elliott <elliott@hpe.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Link: http://lkml.kernel.org/r/20200527173608.2885243-1-daniel.m.jordan@oracle.com
      Link: http://lkml.kernel.org/r/20200527173608.2885243-2-daniel.m.jordan@oracle.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      305dacf7
    • Pavel Tatashin's avatar
      mm: call cond_resched() from deferred_init_memmap() · da97f2d5
      Pavel Tatashin authored
      Now that deferred pages are initialized with interrupts enabled we can
      replace touch_nmi_watchdog() with cond_resched(), as it was before
      3a2d7fa8.
      
      For now, we cannot do the same in deferred_grow_zone() as it is still
      initializes pages with interrupts disabled.
      
      This change fixes RCU problem described in
      https://lkml.kernel.org/r/20200401104156.11564-2-david@redhat.com
      
      [   60.474005] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
      [   60.475000] rcu:  1-...0: (0 ticks this GP) idle=02a/1/0x4000000000000000 softirq=1/1 fqs=15000
      [   60.475000] rcu:  (detected by 0, t=60002 jiffies, g=-1199, q=1)
      [   60.475000] Sending NMI from CPU 0 to CPUs 1:
      [    1.760091] NMI backtrace for cpu 1
      [    1.760091] CPU: 1 PID: 20 Comm: pgdatinit0 Not tainted 4.18.0-147.9.1.el8_1.x86_64 #1
      [    1.760091] Hardware name: Red Hat KVM, BIOS 1.13.0-1.module+el8.2.0+5520+4e5817f3 04/01/2014
      [    1.760091] RIP: 0010:__init_single_page.isra.65+0x10/0x4f
      [    1.760091] Code: 48 83 cf 63 48 89 f8 0f 1f 40 00 48 89 c6 48 89 d7 e8 6b 18 80 ff 66 90 5b c3 31 c0 b9 10 00 00 00 49 89 f8 48 c1 e6 33 f3 ab <b8> 07 00 00 00 48 c1 e2 36 41 c7 40 34 01 00 00 00 48 c1 e0 33 41
      [    1.760091] RSP: 0000:ffffba783123be40 EFLAGS: 00000006
      [    1.760091] RAX: 0000000000000000 RBX: fffffad34405e300 RCX: 0000000000000000
      [    1.760091] RDX: 0000000000000000 RSI: 0010000000000000 RDI: fffffad34405e340
      [    1.760091] RBP: 0000000033f3177e R08: fffffad34405e300 R09: 0000000000000002
      [    1.760091] R10: 000000000000002b R11: ffff98afb691a500 R12: 0000000000000002
      [    1.760091] R13: 0000000000000000 R14: 000000003f03ea00 R15: 000000003e10178c
      [    1.760091] FS:  0000000000000000(0000) GS:ffff9c9ebeb00000(0000) knlGS:0000000000000000
      [    1.760091] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [    1.760091] CR2: 00000000ffffffff CR3: 000000a1cf20a001 CR4: 00000000003606e0
      [    1.760091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [    1.760091] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [    1.760091] Call Trace:
      [    1.760091]  deferred_init_pages+0x8f/0xbf
      [    1.760091]  deferred_init_memmap+0x184/0x29d
      [    1.760091]  ? deferred_free_pages.isra.97+0xba/0xba
      [    1.760091]  kthread+0x112/0x130
      [    1.760091]  ? kthread_flush_work_fn+0x10/0x10
      [    1.760091]  ret_from_fork+0x35/0x40
      [   89.123011] node 0 initialised, 1055935372 pages in 88650ms
      
      Fixes: 3a2d7fa8 ("mm: disable interrupts while initializing deferred pages")
      Reported-by: default avatarYiqian Wei <yiwei@redhat.com>
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>	[4.17+]
      Link: http://lkml.kernel.org/r/20200403140952.17177-4-pasha.tatashin@soleen.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      da97f2d5
    • Pavel Tatashin's avatar
      mm: initialize deferred pages with interrupts enabled · 3d060856
      Pavel Tatashin authored
      Initializing struct pages is a long task and keeping interrupts disabled
      for the duration of this operation introduces a number of problems.
      
      1. jiffies are not updated for long period of time, and thus incorrect time
         is reported. See proposed solution and discussion here:
         lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com
      2. It prevents farther improving deferred page initialization by allowing
         intra-node multi-threading.
      
      We are keeping interrupts disabled to solve a rather theoretical problem
      that was never observed in real world (See 3a2d7fa8).
      
      Let's keep interrupts enabled. In case we ever encounter a scenario where
      an interrupt thread wants to allocate large amount of memory this early in
      boot we can deal with that by growing zone (see deferred_grow_zone()) by
      the needed amount before starting deferred_init_memmap() threads.
      
      Before:
      [    1.232459] node 0 initialised, 12058412 pages in 1ms
      
      After:
      [    1.632580] node 0 initialised, 12051227 pages in 436ms
      
      Fixes: 3a2d7fa8 ("mm: disable interrupts while initializing deferred pages")
      Reported-by: default avatarShile Zhang <shile.zhang@linux.alibaba.com>
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Yiqian Wei <yiwei@redhat.com>
      Cc: <stable@vger.kernel.org>	[4.17+]
      Link: http://lkml.kernel.org/r/20200403140952.17177-3-pasha.tatashin@soleen.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d060856
    • Daniel Jordan's avatar
      mm/pagealloc.c: call touch_nmi_watchdog() on max order boundaries in deferred init · 117003c3
      Daniel Jordan authored
      Patch series "initialize deferred pages with interrupts enabled", v4.
      
      Keep interrupts enabled during deferred page initialization in order to
      make code more modular and allow jiffies to update.
      
      Original approach, and discussion can be found here:
       http://lkml.kernel.org/r/20200311123848.118638-1-shile.zhang@linux.alibaba.com
      
      This patch (of 3):
      
      deferred_init_memmap() disables interrupts the entire time, so it calls
      touch_nmi_watchdog() periodically to avoid soft lockup splats.  Soon it
      will run with interrupts enabled, at which point cond_resched() should be
      used instead.
      
      deferred_grow_zone() makes the same watchdog calls through code shared
      with deferred init but will continue to run with interrupts disabled, so
      it can't call cond_resched().
      
      Pull the watchdog calls up to these two places to allow the first to be
      changed later, independently of the second.  The frequency reduces from
      twice per pageblock (init and free) to once per max order block.
      
      Fixes: 3a2d7fa8 ("mm: disable interrupts while initializing deferred pages")
      Signed-off-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Yiqian Wei <yiwei@redhat.com>
      Cc: <stable@vger.kernel.org>	[4.17+]
      Link: http://lkml.kernel.org/r/20200403140952.17177-2-pasha.tatashin@soleen.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      117003c3
    • Anshuman Khandual's avatar
      mm/page_alloc: restrict and formalize compound_page_dtors[] · ae70eddd
      Anshuman Khandual authored
      Restrict elements in compound_page_dtors[] array per NR_COMPOUND_DTORS and
      explicitly position them according to enum compound_dtor_id.  This
      improves protection against possible misalignment between
      compound_page_dtors[] and enum compound_dtor_id later on.
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Link: http://lkml.kernel.org/r/1589795958-19317-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ae70eddd
    • Charan Teja Reddy's avatar
      mm, page_alloc: reset the zone->watermark_boost early · aa092591
      Charan Teja Reddy authored
      Updating the zone watermarks by any means, like min_free_kbytes,
      water_mark_scale_factor etc, when ->watermark_boost is set will result in
      higher low and high watermarks than the user asked.
      
      Below are the steps to reproduce the problem on system setup of Android
      kernel running on Snapdragon hardware.
      
      1) Default settings of the system are as below:
      
         #cat /proc/sys/vm/min_free_kbytes = 5162
         #cat /proc/zoneinfo | grep -e boost -e low -e "high " -e min -e Node
      	Node 0, zone   Normal
      		min      797
      		low      8340
      		high     8539
      
      2) Monitor the zone->watermark_boost(by adding a debug print in the
         kernel) and whenever it is greater than zero value, write the same
         value of min_free_kbytes obtained from step 1.
      
         #echo 5162 > /proc/sys/vm/min_free_kbytes
      
      3) Then read the zone watermarks in the system while the
         ->watermark_boost is zero.  This should show the same values of
         watermarks as step 1 but shown a higher values than asked.
      
         #cat /proc/zoneinfo | grep -e boost -e low -e "high " -e min -e Node
      	Node 0, zone   Normal
      		min      797
      		low      21148
      		high     21347
      
      These higher values are because of updating the zone watermarks using the
      macro min_wmark_pages(zone) which also adds the zone->watermark_boost.
      
      	#define min_wmark_pages(z) (z->_watermark[WMARK_MIN] +
      					z->watermark_boost)
      
      So the steps that lead to the issue are:
      
      1) On the extfrag event, watermarks are boosted by storing the required
         value in ->watermark_boost.
      
      2) User tries to update the zone watermarks level in the system through
         min_free_kbytes or watermark_scale_factor.
      
      3) Later, when kswapd woke up, it resets the zone->watermark_boost to
         zero.
      
      In step 2), we use the min_wmark_pages() macro to store the watermarks
      in the zone structure thus the values are always offsetted by
      ->watermark_boost value. This can be avoided by resetting the
      ->watermark_boost to zero before it is used.
      Signed-off-by: default avatarCharan Teja Reddy <charante@codeaurora.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Link: http://lkml.kernel.org/r/1589457511-4255-1-git-send-email-charante@codeaurora.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aa092591
    • Sandipan Das's avatar
      mm/page_alloc.c: reset numa stats for boot pagesets · b418a0f9
      Sandipan Das authored
      Initially, the per-cpu pagesets of each zone are set to the boot pagesets.
      The real pagesets are allocated later but before that happens, page
      allocations do occur and the numa stats for the boot pagesets get
      incremented since they are common to all zones at that point.
      
      The real pagesets, however, are allocated for the populated zones only.
      Unpopulated zones, like those associated with memory-less nodes, continue
      using the boot pageset and end up skewing the numa stats of the
      corresponding node.
      
      E.g.
      
        $ numactl -H
        available: 2 nodes (0-1)
        node 0 cpus: 0 1 2 3
        node 0 size: 0 MB
        node 0 free: 0 MB
        node 1 cpus: 4 5 6 7
        node 1 size: 8131 MB
        node 1 free: 6980 MB
        node distances:
        node   0   1
          0:  10  40
          1:  40  10
      
        $ numastat
                                   node0           node1
        numa_hit                     108           56495
        numa_miss                      0               0
        numa_foreign                   0               0
        interleave_hit                 0            4537
        local_node                   108           31547
        other_node                     0           24948
      
      Hence, the boot pageset stats need to be cleared after the real pagesets
      are allocated.
      
      After this point, the stats of the boot pagesets do not change as page
      allocations requested for a memory-less node will either fail (if
      __GFP_THISNODE is used) or get fulfilled by a preferred zone of a
      different node based on the fallback zonelist.
      
      [sandipan@linux.ibm.com: v3]
        Link: http://lkml.kernel.org/r/20200511170356.162531-1-sandipan@linux.ibm.comSigned-off-by: default avatarSandipan Das <sandipan@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Link: http://lkml.kernel.org/r/9c9c2d1b15e37f6e6bf32f99e3100035e90c4ac9.1588868430.git.sandipan@linux.ibm.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b418a0f9
    • Wei Yang's avatar
      mm: rename gfpflags_to_migratetype to gfp_migratetype for same convention · 01c0bfe0
      Wei Yang authored
      Pageblock migrate type is encoded in GFP flags, just as zone_type and
      zonelist.
      
      Currently we use gfp_zone() and gfp_zonelist() to extract related
      information, it would be proper to use the same naming convention for
      migrate type.
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Link: http://lkml.kernel.org/r/20200329080823.7735-1-richard.weiyang@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      01c0bfe0
    • Wei Yang's avatar
      mm/page_alloc.c: use NODE_MASK_NONE in build_zonelists() · d0ddf49b
      Wei Yang authored
      Slightly simplify the code by initializing user_mask with NODE_MASK_NONE,
      instead of later calling nodes_clear().  This saves a line of code.
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Link: http://lkml.kernel.org/r/20200330220840.21228-1-richard.weiyang@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0ddf49b
    • Joonsoo Kim's avatar
      mm/page_alloc: integrate classzone_idx and high_zoneidx · 97a225e6
      Joonsoo Kim authored
      classzone_idx is just different name for high_zoneidx now.  So, integrate
      them and add some comment to struct alloc_context in order to reduce
      future confusion about the meaning of this variable.
      
      The accessor, ac_classzone_idx() is also removed since it isn't needed
      after integration.
      
      In addition to integration, this patch also renames high_zoneidx to
      highest_zoneidx since it represents more precise meaning.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ye Xiaolong <xiaolong.ye@intel.com>
      Link: http://lkml.kernel.org/r/1587095923-7515-3-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      97a225e6
    • Joonsoo Kim's avatar
      mm/page_alloc: use ac->high_zoneidx for classzone_idx · 3334a45e
      Joonsoo Kim authored
      Patch series "integrate classzone_idx and high_zoneidx", v5.
      
      This patchset is followup of the problem reported and discussed two years
      ago [1, 2].  The problem this patchset solves is related to the
      classzone_idx on the NUMA system.  It causes a problem when the lowmem
      reserve protection exists for some zones on a node that do not exist on
      other nodes.
      
      This problem was reported two years ago, and, at that time, the solution
      got general agreements [2].  But it was not upstreamed.
      
      [1]: http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop
      [2]: http://lkml.kernel.org/r/1525408246-14768-1-git-send-email-iamjoonsoo.kim@lge.com
      
      This patch (of 2):
      
      Currently, we use classzone_idx to calculate lowmem reserve proetection
      for an allocation request.  This classzone_idx causes a problem on NUMA
      systems when the lowmem reserve protection exists for some zones on a node
      that do not exist on other nodes.
      
      Before further explanation, I should first clarify how to compute the
      classzone_idx and the high_zoneidx.
      
      - ac->high_zoneidx is computed via the arcane gfp_zone(gfp_mask) and
        represents the index of the highest zone the allocation can use
      
      - classzone_idx was supposed to be the index of the highest zone on the
        local node that the allocation can use, that is actually available in
        the system
      
      Think about following example.  Node 0 has 4 populated zone,
      DMA/DMA32/NORMAL/MOVABLE.  Node 1 has 1 populated zone, NORMAL.  Some
      zones, such as MOVABLE, doesn't exist on node 1 and this makes following
      difference.
      
      Assume that there is an allocation request whose gfp_zone(gfp_mask) is the
      zone, MOVABLE.  Then, it's high_zoneidx is 3.  If this allocation is
      initiated on node 0, it's classzone_idx is 3 since actually
      available/usable zone on local (node 0) is MOVABLE.  If this allocation is
      initiated on node 1, it's classzone_idx is 2 since actually
      available/usable zone on local (node 1) is NORMAL.
      
      You can see that classzone_idx of the allocation request are different
      according to their starting node, even if their high_zoneidx is the same.
      
      Think more about these two allocation requests.  If they are processed on
      local, there is no problem.  However, if allocation is initiated on node 1
      are processed on remote, in this example, at the NORMAL zone on node 0,
      due to memory shortage, problem occurs.  Their different classzone_idx
      leads to different lowmem reserve and then different min watermark.  See
      the following example.
      
      root@ubuntu:/sys/devices/system/memory# cat /proc/zoneinfo
      Node 0, zone      DMA
        per-node stats
      ...
        pages free     3965
              min      5
              low      8
              high     11
              spanned  4095
              present  3998
              managed  3977
              protection: (0, 2961, 4928, 5440)
      ...
      Node 0, zone    DMA32
        pages free     757955
              min      1129
              low      1887
              high     2645
              spanned  1044480
              present  782303
              managed  758116
              protection: (0, 0, 1967, 2479)
      ...
      Node 0, zone   Normal
        pages free     459806
              min      750
              low      1253
              high     1756
              spanned  524288
              present  524288
              managed  503620
              protection: (0, 0, 0, 4096)
      ...
      Node 0, zone  Movable
        pages free     130759
              min      195
              low      326
              high     457
              spanned  1966079
              present  131072
              managed  131072
              protection: (0, 0, 0, 0)
      ...
      Node 1, zone      DMA
        pages free     0
              min      0
              low      0
              high     0
              spanned  0
              present  0
              managed  0
              protection: (0, 0, 1006, 1006)
      Node 1, zone    DMA32
        pages free     0
              min      0
              low      0
              high     0
              spanned  0
              present  0
              managed  0
              protection: (0, 0, 1006, 1006)
      Node 1, zone   Normal
        per-node stats
      ...
        pages free     233277
              min      383
              low      640
              high     897
              spanned  262144
              present  262144
              managed  257744
              protection: (0, 0, 0, 0)
      ...
      Node 1, zone  Movable
        pages free     0
              min      0
              low      0
              high     0
              spanned  262144
              present  0
              managed  0
              protection: (0, 0, 0, 0)
      
      - static min watermark for the NORMAL zone on node 0 is 750.
      
      - lowmem reserve for the request with classzone idx 3 at the NORMAL on
        node 0 is 4096.
      
      - lowmem reserve for the request with classzone idx 2 at the NORMAL on
        node 0 is 0.
      
      So, overall min watermark is:
      allocation initiated on node 0 (classzone_idx 3): 750 + 4096 = 4846
      allocation initiated on node 1 (classzone_idx 2): 750 + 0 = 750
      
      Allocation initiated on node 1 will have some precedence than allocation
      initiated on node 0 because min watermark of the former allocation is
      lower than the other.  So, allocation initiated on node 1 could succeed on
      node 0 when allocation initiated on node 0 could not, and, this could
      cause too many numa_miss allocation.  Then, performance could be
      downgraded.
      
      Recently, there was a regression report about this problem on CMA patches
      since CMA memory are placed in ZONE_MOVABLE by those patches.  I checked
      that problem is disappeared with this fix that uses high_zoneidx for
      classzone_idx.
      
      http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop
      
      Using high_zoneidx for classzone_idx is more consistent way than previous
      approach because system's memory layout doesn't affect anything to it.
      With this patch, both classzone_idx on above example will be 3 so will
      have the same min watermark.
      
      allocation initiated on node 0: 750 + 4096 = 4846
      allocation initiated on node 1: 750 + 4096 = 4846
      
      One could wonder if there is a side effect that allocation initiated on
      node 1 will use higher bar when allocation is handled on local since
      classzone_idx could be higher than before.  It will not happen because the
      zone without managed page doesn't contributes lowmem_reserve at all.
      Reported-by: default avatarYe Xiaolong <xiaolong.ye@intel.com>
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarYe Xiaolong <xiaolong.ye@intel.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Link: http://lkml.kernel.org/r/1587095923-7515-1-git-send-email-iamjoonsoo.kim@lge.com
      Link: http://lkml.kernel.org/r/1587095923-7515-2-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3334a45e
    • Baoquan He's avatar
      mm/vmstat.c: do not show lowmem reserve protection information of empty zone · 26e7dead
      Baoquan He authored
      Because the lowmem reserve protection of a zone can't tell anything if the
      zone is empty, except of adding one more line in /proc/zoneinfo.
      
      Let's remove it from that zone's showing.
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200402140113.3696-4-bhe@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      26e7dead
    • Baoquan He's avatar
      mm/page_alloc.c: clear out zone->lowmem_reserve[] if the zone is empty · f6366156
      Baoquan He authored
      When requesting memory allocation from a specific zone is not satisfied,
      it will fall to lower zone to try allocating memory.  In this case, lower
      zone's ->lowmem_reserve[] will help protect its own memory resource.  The
      higher the relevant ->lowmem_reserve[] is, the harder the upper zone can
      get memory from this lower zone.
      
      However, this protection mechanism should be applied to populated zone,
      but not an empty zone. So filling ->lowmem_reserve[] for empty zone is
      not necessary, and may mislead people that it's valid data in that zone.
      
      Node 2, zone      DMA
        pages free     0
              min      0
              low      0
              high     0
              spanned  0
              present  0
              managed  0
              protection: (0, 0, 1024, 1024)
      Node 2, zone    DMA32
        pages free     0
              min      0
              low      0
              high     0
              spanned  0
              present  0
              managed  0
              protection: (0, 0, 1024, 1024)
      Node 2, zone   Normal
        per-node stats
            nr_inactive_anon 0
            nr_active_anon 143
            nr_inactive_file 0
            nr_active_file 0
            nr_unevictable 0
            nr_slab_reclaimable 45
            nr_slab_unreclaimable 254
      
      Here clear out zone->lowmem_reserve[] if zone is empty.
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200402140113.3696-3-bhe@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f6366156
    • Baoquan He's avatar
      mm/page_alloc.c: only tune sysctl_lowmem_reserve_ratio value once when changing it · 86aaf255
      Baoquan He authored
      Patch series "improvements about lowmem_reserve and /proc/zoneinfo", v2.
      
      This patch (of 3):
      
      When people write to /proc/sys/vm/lowmem_reserve_ratio to change
      sysctl_lowmem_reserve_ratio[], setup_per_zone_lowmem_reserve() is called
      to recalculate all ->lowmem_reserve[] for each zone of all nodes as below:
      
      static void setup_per_zone_lowmem_reserve(void)
      {
      ...
      	for_each_online_pgdat(pgdat) {
      		for (j = 0; j < MAX_NR_ZONES; j++) {
      			...
      			while (idx) {
      				...
      				if (sysctl_lowmem_reserve_ratio[idx] < 1) {
      					sysctl_lowmem_reserve_ratio[idx] = 0;
      					lower_zone->lowmem_reserve[j] = 0;
                                      } else {
      				...
      			}
      		}
      	}
      }
      
      Meanwhile, here, sysctl_lowmem_reserve_ratio[idx] will be tuned if its
      value is smaller than '1'.  As we know, sysctl_lowmem_reserve_ratio[] is
      set for zone without regarding to which node it belongs to.  That means
      the tuning will be done on all nodes, even though it has been done in the
      first node.
      
      And the tuning will be done too even when init_per_zone_wmark_min() calls
      setup_per_zone_lowmem_reserve(), where actually nobody tries to change
      sysctl_lowmem_reserve_ratio[].
      
      So now move the tuning into lowmem_reserve_ratio_sysctl_handler(), to make
      code logic more reasonable.
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Link: http://lkml.kernel.org/r/20200402140113.3696-1-bhe@redhat.com
      Link: http://lkml.kernel.org/r/20200402140113.3696-2-bhe@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86aaf255
    • Baoquan He's avatar
      mm/page_alloc.c: remove unused free_bootmem_with_active_regions · 4ca7be24
      Baoquan He authored
      Since commit 397dc00e ("mips: sgi-ip27: switch from DISCONTIGMEM
      to SPARSEMEM"), the last caller of free_bootmem_with_active_regions() was
      gone.  Now no user calls it any more.
      
      Let's remove it.
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200402143455.5145-1-bhe@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ca7be24
    • Roman Gushchin's avatar
      mm,page_alloc,cma: conditionally prefer cma pageblocks for movable allocations · 16867664
      Roman Gushchin authored
      Currently a cma area is barely used by the page allocator because it's
      used only as a fallback from movable, however kswapd tries hard to make
      sure that the fallback path isn't used.
      
      This results in a system evicting memory and pushing data into swap, while
      lots of CMA memory is still available.  This happens despite the fact that
      alloc_contig_range is perfectly capable of moving any movable allocations
      out of the way of an allocation.
      
      To effectively use the cma area let's alter the rules: if the zone has
      more free cma pages than the half of total free pages in the zone, use cma
      pageblocks first and fallback to movable blocks in the case of failure.
      
      [guro@fb.com: ifdef the cma-specific code]
        Link: http://lkml.kernel.org/r/20200311225832.GA178154@carbon.DHCP.thefacebook.comCo-developed-by: default avatarRik van Riel <riel@surriel.com>
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarRik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Link: http://lkml.kernel.org/r/20200306150102.3e77354b@imladris.surriel.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      16867664
    • Wei Yang's avatar
      mm/page_alloc.c: extract check_[new|free]_page_bad() common part to page_bad_reason() · 58b7f119
      Wei Yang authored
      We share similar code in check_[new|free]_page_bad() to get the page's bad
      reason.
      
      Let's extract it and reduce code duplication.
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200411220357.9636-6-richard.weiyang@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      58b7f119
    • Wei Yang's avatar
      mm/page_alloc.c: rename free_pages_check() to check_free_page() · 534fe5e3
      Wei Yang authored
      free_pages_check() is the counterpart of check_new_page().  Rename it to
      use the same naming convention.
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200411220357.9636-5-richard.weiyang@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      534fe5e3
    • Wei Yang's avatar
      mm/page_alloc.c: rename free_pages_check_bad() to check_free_page_bad() · 0d0c48a2
      Wei Yang authored
      free_pages_check_bad() is the counterpart of check_new_page_bad().  Rename
      it to use the same naming convention.
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200411220357.9636-4-richard.weiyang@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0d0c48a2
    • Wei Yang's avatar
      mm/page_alloc.c: bad_flags is not necessary for bad_page() · 82a3241a
      Wei Yang authored
      After commit 5b57b8f2 ("mm/debug.c: always print flags in
      dump_page()"), page->flags is always printed for a bad page.  It is not
      necessary to have bad_flags any more.
      Suggested-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200411220357.9636-3-richard.weiyang@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      82a3241a
    • Wei Yang's avatar
      mm/page_alloc.c: bad_[reason|flags] is not necessary when PageHWPoison · 833d8a42
      Wei Yang authored
      Patch series "mm/page_alloc.c: cleanup on check page", v3.
      
      This patchset does some cleanup related to check page.
      
      1. Remove unnecessary bad_reason assignment
      2. Remove bad_flags to bad_page()
      3. Rename function for naming convention
      4. Extract common part to check page
      
      Thanks for suggestions from David Rientjes and Anshuman Khandual.
      
      This patch (of 5):
      
      Since function returns directly, bad_[reason|flags] is not used any where.
      And move this to the first.
      
      This is a following cleanup for commit e570f56c ("mm:
      check_new_page_bad() directly returns in __PG_HWPOISON case")
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: David Rientjes <rientjes@google.com>
      Link: http://lkml.kernel.org/r/20200411220357.9636-2-richard.weiyang@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      833d8a42
    • Mike Rapoport's avatar
      docs/vm: update memory-models documentation · 237e506c
      Mike Rapoport authored
      To reflect the updates to free_area_init() family of functions.
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-22-rppt@kernel.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      237e506c
    • Mike Rapoport's avatar
      mm: simplify find_min_pfn_with_active_regions() · 8a1b25fe
      Mike Rapoport authored
      find_min_pfn_with_active_regions() calls find_min_pfn_for_node() with nid
      parameter set to MAX_NUMNODES.  This makes the find_min_pfn_for_node()
      traverse all memblock memory regions although the first PFN in the system
      can be easily found with memblock_start_of_DRAM().
      
      Use memblock_start_of_DRAM() in find_min_pfn_with_active_regions() and drop
      now unused find_min_pfn_for_node().
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-21-rppt@kernel.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8a1b25fe
    • Mike Rapoport's avatar
      mm: clean up free_area_init_node() and its helpers · 854e8848
      Mike Rapoport authored
      free_area_init_node() now always uses memblock info and the zone PFN
      limits so it does not need the backwards compatibility functions to
      calculate the zone spanned and absent pages.  The removal of the compat_
      versions of zone_{abscent,spanned}_pages_in_node() in turn, makes
      zone_size and zhole_size parameters unused.
      
      The node_start_pfn is determined by get_pfn_range_for_nid(), so there is
      no need to pass it to free_area_init_node().
      
      As a result, the only required parameter to free_area_init_node() is the
      node ID, all the rest are removed along with no longer used
      compat_zone_{abscent,spanned}_pages_in_node() helpers.
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-20-rppt@kernel.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      854e8848
    • Mike Rapoport's avatar
      mm: rename free_area_init_node() to free_area_init_memoryless_node() · bc9331a1
      Mike Rapoport authored
      free_area_init_node() is only used by x86 to initialize a memory-less
      nodes.  Make its name reflect this and drop all the function parameters
      except node ID as they are anyway zero.
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-19-rppt@kernel.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc9331a1
    • Mike Rapoport's avatar
      mm: free_area_init: allow defining max_zone_pfn in descending order · 51930df5
      Mike Rapoport authored
      Some architectures (e.g.  ARC) have the ZONE_HIGHMEM zone below the
      ZONE_NORMAL.  Allowing free_area_init() parse max_zone_pfn array even it
      is sorted in descending order allows using free_area_init() on such
      architectures.
      
      Add top -> down traversal of max_zone_pfn array in free_area_init() and
      use the latter in ARC node/zone initialization.
      
      [rppt@kernel.org: ARC fix]
        Link: http://lkml.kernel.org/r/20200504153901.GM14260@kernel.org
      [rppt@linux.ibm.com: arc: free_area_init(): take into account PAE40 mode]
        Link: http://lkml.kernel.org/r/20200507205900.GH683243@linux.ibm.com
      [akpm@linux-foundation.org: declare arch_has_descending_max_zone_pfns()]
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Link: http://lkml.kernel.org/r/20200412194859.12663-18-rppt@kernel.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      51930df5
    • Mike Rapoport's avatar
      mm: remove early_pfn_in_nid() and CONFIG_NODES_SPAN_OTHER_NODES · acd3f5c4
      Mike Rapoport authored
      The memmap_init() function was made to iterate over memblock regions and
      as the result the early_pfn_in_nid() function became obsolete.  Since
      CONFIG_NODES_SPAN_OTHER_NODES is only used to pick a stub or a real
      implementation of early_pfn_in_nid(), it is also not needed anymore.
      
      Remove both early_pfn_in_nid() and the CONFIG_NODES_SPAN_OTHER_NODES.
      Co-developed-by: default avatarHoan Tran <Hoan@os.amperecomputing.com>
      Signed-off-by: default avatarHoan Tran <Hoan@os.amperecomputing.com>
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-17-rppt@kernel.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      acd3f5c4
    • Baoquan He's avatar
      mm: memmap_init: iterate over memblock regions rather that check each PFN · 73a6e474
      Baoquan He authored
      When called during boot the memmap_init_zone() function checks if each PFN
      is valid and actually belongs to the node being initialized using
      early_pfn_valid() and early_pfn_in_nid().
      
      Each such check may cost up to O(log(n)) where n is the number of memory
      banks, so for large amount of memory overall time spent in early_pfn*()
      becomes substantial.
      
      Since the information is anyway present in memblock, we can iterate over
      memblock memory regions in memmap_init() and only call memmap_init_zone()
      for PFN ranges that are know to be valid and in the appropriate node.
      
      [cai@lca.pw: fix a compilation warning from Clang]
        Link: http://lkml.kernel.org/r/CF6E407F-17DC-427C-8203-21979FB882EF@lca.pw
      [bhe@redhat.com: fix the incorrect hole in fast_isolate_freepages()]
        Link: http://lkml.kernel.org/r/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw
        Link: http://lkml.kernel.org/r/20200521014407.29690-1-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Qian Cai <cai@lca.pw>
      Link: http://lkml.kernel.org/r/20200412194859.12663-16-rppt@kernel.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73a6e474
    • Mike Rapoport's avatar
      xtensa: simplify detection of memory zone boundaries · da50c57b
      Mike Rapoport authored
      free_area_init() only requires the definition of maximal PFN for each of
      the supported zone rater than calculation of actual zone sizes and the
      sizes of the holes between the zones.
      
      After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP the free_area_init() is
      available to all architectures.
      
      Using this function instead of free_area_init_node() simplifies the zone
      detection.
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-15-rppt@kernel.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      da50c57b