• Mike Kravetz's avatar
    hugetlb: add demote hugetlb page sysfs interfaces · 79dfc695
    Mike Kravetz authored
    Patch series "hugetlb: add demote/split page functionality", v4.
    
    The concurrent use of multiple hugetlb page sizes on a single system is
    becoming more common.  One of the reasons is better TLB support for
    gigantic page sizes on x86 hardware.  In addition, hugetlb pages are
    being used to back VMs in hosting environments.
    
    When using hugetlb pages to back VMs, it is often desirable to
    preallocate hugetlb pools.  This avoids the delay and uncertainty of
    allocating hugetlb pages at VM startup.  In addition, preallocating huge
    pages minimizes the issue of memory fragmentation that increases the
    longer the system is up and running.
    
    In such environments, a combination of larger and smaller hugetlb pages
    are preallocated in anticipation of backing VMs of various sizes.  Over
    time, the preallocated pool of smaller hugetlb pages may become depleted
    while larger hugetlb pages still remain.  In such situations, it is
    desirable to convert larger hugetlb pages to smaller hugetlb pages.
    
    Converting larger to smaller hugetlb pages can be accomplished today by
    first freeing the larger page to the buddy allocator and then allocating
    the smaller pages.  For example, to convert 50 GB pages on x86:
    
      gb_pages=`cat .../hugepages-1048576kB/nr_hugepages`
      m2_pages=`cat .../hugepages-2048kB/nr_hugepages`
      echo $(($gb_pages - 50)) > .../hugepages-1048576kB/nr_hugepages
      echo $(($m2_pages + 25600)) > .../hugepages-2048kB/nr_hugepages
    
    On an idle system this operation is fairly reliable and results are as
    expected.  The number of 2MB pages is increased as expected and the time
    of the operation is a second or two.
    
    However, when there is activity on the system the following issues
    arise:
    
    1) This process can take quite some time, especially if allocation of
       the smaller pages is not immediate and requires migration/compaction.
    
    2) There is no guarantee that the total size of smaller pages allocated
       will match the size of the larger page which was freed. This is
       because the area freed by the larger page could quickly be
       fragmented.
    
    In a test environment with a load that continually fills the page cache
    with clean pages, results such as the following can be observed:
    
      Unexpected number of 2MB pages allocated: Expected 25600, have 19944
      real    0m42.092s
      user    0m0.008s
      sys     0m41.467s
    
    To address these issues, introduce the concept of hugetlb page demotion.
    Demotion provides a means of 'in place' splitting of a hugetlb page to
    pages of a smaller size.  This avoids freeing pages to buddy and then
    trying to allocate from buddy.
    
    Page demotion is controlled via sysfs files that reside in the per-hugetlb
    page size and per node directories.
    
     - demote_size
            Target page size for demotion, a smaller huge page size. File
            can be written to chose a smaller huge page size if multiple are
            available.
    
     - demote
            Writable number of hugetlb pages to be demoted
    
    To demote 50 GB huge pages, one would:
    
      cat .../hugepages-1048576kB/free_hugepages   /* optional, verify free pages */
      cat .../hugepages-1048576kB/demote_size      /* optional, verify target size */
      echo 50 > .../hugepages-1048576kB/demote
    
    Only hugetlb pages which are free at the time of the request can be
    demoted.  Demotion does not add to the complexity of surplus pages and
    honors reserved huge pages.  Therefore, when a value is written to the
    sysfs demote file, that value is only the maximum number of pages which
    will be demoted.  It is possible fewer will actually be demoted.  The
    recently introduced per-hstate mutex is used to synchronize demote
    operations with other operations that modify hugetlb pools.
    
    Real world use cases
    --------------------
    The above scenario describes a real world use case where hugetlb pages
    are used to back VMs on x86.  Both issues of long allocation times and
    not necessarily getting the expected number of smaller huge pages after
    a free and allocate cycle have been experienced.  The occurrence of
    these issues is dependent on other activity within the host and can not
    be predicted.
    
    This patch (of 5):
    
    Two new sysfs files are added to demote hugtlb pages.  These files are
    both per-hugetlb page size and per node.  Files are:
    
      demote_size - The size in Kb that pages are demoted to. (read-write)
      demote - The number of huge pages to demote. (write-only)
    
    By default, demote_size is the next smallest huge page size.  Valid huge
    page sizes less than huge page size may be written to this file.  When
    huge pages are demoted, they are demoted to this size.
    
    Writing a value to demote will result in an attempt to demote that
    number of hugetlb pages to an appropriate number of demote_size pages.
    
    NOTE: Demote interfaces are only provided for huge page sizes if there
    is a smaller target demote huge page size.  For example, on x86 1GB huge
    pages will have demote interfaces.  2MB huge pages will not have demote
    interfaces.
    
    This patch does not provide full demote functionality.  It only provides
    the sysfs interfaces.
    
    It also provides documentation for the new interfaces.
    
    [mike.kravetz@oracle.com: n_mask initialization does not need to be protected by the mutex]
      Link: https://lkml.kernel.org/r/0530e4ef-2492-5186-f919-5db68edea654@oracle.com
    
    Link: https://lkml.kernel.org/r/20211007181918.136982-2-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: David Rientjes <rientjes@google.com>
    Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
    Cc: Nghia Le <nghialm78@gmail.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    79dfc695
hugetlb.c 180 KB