• Oscar Salvador's avatar
    mm: make alloc_contig_range handle free hugetlb pages · 369fa227
    Oscar Salvador authored
    alloc_contig_range will fail if it ever sees a HugeTLB page within the
    range we are trying to allocate, even when that page is free and can be
    easily reallocated.
    
    This has proved to be problematic for some users of alloc_contic_range,
    e.g: CMA and virtio-mem, where those would fail the call even when those
    pages lay in ZONE_MOVABLE and are free.
    
    We can do better by trying to replace such page.
    
    Free hugepages are tricky to handle so as to no userspace application
    notices disruption, we need to replace the current free hugepage with a
    new one.
    
    In order to do that, a new function called alloc_and_dissolve_huge_page is
    introduced.  This function will first try to get a new fresh hugepage, and
    if it succeeds, it will replace the old one in the free hugepage pool.
    
    The free page replacement is done under hugetlb_lock, so no external users
    of hugetlb will notice the change.  To allocate the new huge page, we use
    alloc_buddy_huge_page(), so we do not have to deal with any counters, and
    prep_new_huge_page() is not called.  This is valulable because in case we
    need to free the new page, we only need to call __free_pages().
    
    Once we know that the page to be replaced is a genuine 0-refcounted huge
    page, we remove the old page from the freelist by remove_hugetlb_page().
    Then, we can call __prep_new_huge_page() and
    __prep_account_new_huge_page() for the new huge page to properly
    initialize it and increment the hstate->nr_huge_pages counter (previously
    decremented by remove_hugetlb_page()).  Once done, the page is enqueued by
    enqueue_huge_page() and it is ready to be used.
    
    There is one tricky case when page's refcount is 0 because it is in the
    process of being released.  A missing PageHugeFreed bit will tell us that
    freeing is in flight so we retry after dropping the hugetlb_lock.  The
    race window should be small and the next retry should make a forward
    progress.
    
    E.g:
    
    CPU0				CPU1
    free_huge_page()		isolate_or_dissolve_huge_page
    				  PageHuge() == T
    				  alloc_and_dissolve_huge_page
    				    alloc_buddy_huge_page()
    				    spin_lock_irq(hugetlb_lock)
    				    // PageHuge() && !PageHugeFreed &&
    				    // !PageCount()
    				    spin_unlock_irq(hugetlb_lock)
      spin_lock_irq(hugetlb_lock)
      1) update_and_free_page
           PageHuge() == F
           __free_pages()
      2) enqueue_huge_page
           SetPageHugeFreed()
      spin_unlock_irq(&hugetlb_lock)
    				  spin_lock_irq(hugetlb_lock)
                                       1) PageHuge() == F (freed by case#1 from CPU0)
    				   2) PageHuge() == T
                                           PageHugeFreed() == T
                                           - proceed with replacing the page
    
    In the case above we retry as the window race is quite small and we have
    high chances to succeed next time.
    
    With regard to the allocation, we restrict it to the node the page belongs
    to with __GFP_THISNODE, meaning we do not fallback on other node's zones.
    
    Note that gigantic hugetlb pages are fenced off since there is a cyclic
    dependency between them and alloc_contig_range.
    
    Link: https://lkml.kernel.org/r/20210419075413.1064-6-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
    Acked-by: default avatarMichal Hocko <mhocko@suse.com>
    Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
    Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    369fa227
compaction.c 82 KB