• Linus Torvalds's avatar
    Merge branch 'hugepage-fallbacks' (hugepatch patches from David Rientjes) · edf445ad
    Linus Torvalds authored
    Merge hugepage allocation updates from David Rientjes:
     "We (mostly Linus, Andrea, and myself) have been discussing offlist how
      to implement a sane default allocation strategy for hugepages on NUMA
      platforms.
    
      With these reverts in place, the page allocator will happily allocate
      a remote hugepage immediately rather than try to make a local hugepage
      available. This incurs a substantial performance degradation when
      memory compaction would have otherwise made a local hugepage
      available.
    
      This series reverts those reverts and attempts to propose a more sane
      default allocation strategy specifically for hugepages. Andrea
      acknowledges this is likely to fix the swap storms that he originally
      reported that resulted in the patches that removed __GFP_THISNODE from
      hugepage allocations.
    
      The immediate goal is to return 5.3 to the behavior the kernel has
      implemented over the past several years so that remote hugepages are
      not immediately allocated when local hugepages could have been made
      available because the increased access latency is untenable.
    
      The next goal is to introduce a sane default allocation strategy for
      hugepages allocations in general regardless of the configuration of
      the system so that we prevent thrashing of local memory when
      compaction is unlikely to succeed and can prefer remote hugepages over
      remote native pages when the local node is low on memory."
    
    Note on timing: this reverts the hugepage VM behavior changes that got
    introduced fairly late in the 5.3 cycle, and that fixed a huge
    performance regression for certain loads that had been around since
    4.18.
    
    Andrea had this note:
    
     "The regression of 4.18 was that it was taking hours to start a VM
      where 3.10 was only taking a few seconds, I reported all the details
      on lkml when it was finally tracked down in August 2018.
    
         https://lore.kernel.org/linux-mm/20180820032640.9896-2-aarcange@redhat.com/
    
      __GFP_THISNODE in MADV_HUGEPAGE made the above enterprise vfio
      workload degrade like in the "current upstream" above. And it still
      would have been that bad as above until 5.3-rc5"
    
    where the bad behavior ends up happening as you fill up a local node,
    and without that change, you'd get into the nasty swap storm behavior
    due to compaction working overtime to make room for more memory on the
    nodes.
    
    As a result 5.3 got the two performance fix reverts in rc5.
    
    However, David Rientjes then noted that those performance fixes in turn
    regressed performance for other loads - although not quite to the same
    degree.  He suggested reverting the reverts and instead replacing them
    with two small changes to how hugepage allocations are done (patch
    descriptions rephrased by me):
    
     - "avoid expensive reclaim when compaction may not succeed": just admit
       that the allocation failed when you're trying to allocate a huge-page
       and compaction wasn't successful.
    
     - "allow hugepage fallback to remote nodes when madvised": when that
       node-local huge-page allocation failed, retry without forcing the
       local node.
    
    but by then I judged it too late to replace the fixes for a 5.3 release.
    So 5.3 was released with behavior that harked back to the pre-4.18 logic.
    
    But now we're in the merge window for 5.4, and we can see if this
    alternate model fixes not just the horrendous swap storm behavior, but
    also restores the performance regression that the late reverts caused.
    
    Fingers crossed.
    
    * emailed patches from David Rientjes <rientjes@google.com>:
      mm, page_alloc: allow hugepage fallback to remote nodes when madvised
      mm, page_alloc: avoid expensive reclaim when compaction may not succeed
      Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask""
      Revert "Revert "mm, thp: restore node-local hugepage allocations""
    edf445ad
mempolicy.c 74.2 KB