• Muchun Song's avatar
    arm64: mm: hugetlb: enable HUGETLB_PAGE_FREE_VMEMMAP for arm64 · 1e63ac08
    Muchun Song authored
    The feature of minimizing overhead of struct page associated with each
    HugeTLB page aims to free its vmemmap pages (used as struct page) to save
    memory, where is ~14GB/16GB per 1TB HugeTLB pages (2MB/1GB type).  In
    short, when a HugeTLB page is allocated or freed, the vmemmap array
    representing the range associated with the page will need to be remapped. 
    When a page is allocated, vmemmap pages are freed after remapping.  When a
    page is freed, previously discarded vmemmap pages must be allocated before
    remapping.  More implementations and details can be found here [1].
    
    The infrastructure of freeing vmemmap pages associated with each HugeTLB
    page is already there, we can easily enable HUGETLB_PAGE_FREE_VMEMMAP for
    arm64, the only thing to be fixed is flush_dcache_page() .
    
    flush_dcache_page() need to be adapted to operate on the head page's flags
    since the tail vmemmap pages are mapped with read-only after the feature
    is enabled (clear operation is not permitted).
    
    There was some discussions about this in the thread [2], but there was no
    conclusion in the end.  And I copied the concern proposed by Anshuman to
    here and explain why those concern is superfluous.  It is safe to enable
    it for x86_64 as well as arm64.
    
    1st concern:
    '''
    But what happens when a hot remove section's vmemmap area (which is
    being teared down) is nearby another vmemmap area which is either created
    or being destroyed for HugeTLB alloc/free purpose. As you mentioned
    HugeTLB pages inside the hot remove section might be safe. But what about
    other HugeTLB areas whose vmemmap area shares page table entries with
    vmemmap entries for a section being hot removed ? Massive HugeTLB alloc
    /use/free test cycle using memory just adjacent to a memory hotplug area,
    which is always added and removed periodically, should be able to expose
    this problem.
    '''
    
    Answer: At the time memory is removed, all HugeTLB pages either have been
    migrated away or dissolved.  So there is no race between memory hot remove
    and free_huge_page_vmemmap().  Therefore, HugeTLB pages inside the hot
    remove section is safe.  Let's talk your question "what about other
    HugeTLB areas whose vmemmap area shares page table entries with vmemmap
    entries for a section being hot removed ?", the question is not
    established.  The minimal granularity size of hotplug memory 128MB (on
    arm64, 4k base page), any HugeTLB smaller than 128MB is within a section,
    then, there is no share PTE page tables between HugeTLB in this section
    and ones in other sections and a HugeTLB page could not cross two
    sections.  In this case, the section cannot be freed.  Any HugeTLB bigger
    than 128MB (section size) whose vmemmap pages is an integer multiple of
    2MB (PMD-mapped).  As long as:
    
      1) HugeTLBs are naturally aligned, power-of-two sizes
      2) The HugeTLB size >= the section size
      3) The HugeTLB size >= the vmemmap leaf mapping size
    
    Then a HugeTLB will not share any leaf page table entries with *anything
    else*, but will share intermediate entries.  In this case, at the time
    memory is removed, all HugeTLB pages either have been migrated away or
    dissolved.  So there is also no race between memory hot remove and
    free_huge_page_vmemmap().
    
    2nd concern:
    '''
    differently, not sure if ptdump would require any synchronization.
    
    Dumping an wrong value is probably okay but crashing because a page table
    entry is being freed after ptdump acquired the pointer is bad. On arm64,
    ptdump() is protected against hotremove via [get|put]_online_mems().
    '''
    
    Answer: The ptdump should be fine since vmemmap_remap_free() only
    exchanges PTEs or splits the PMD entry (which means allocating a PTE page
    table).  Both operations do not free any page tables (PTE), so ptdump
    cannot run into a UAF on any page tables.  The worst case is just dumping
    an wrong value.
    
    [1] https://lore.kernel.org/all/20210510030027.56044-1-songmuchun@bytedance.com/
    [2] https://lore.kernel.org/all/20210518091826.36937-1-songmuchun@bytedance.com/
    
    [songmuchun@bytedance.com: restructure the code comment inside flush_dcache_page()]
      Link: https://lkml.kernel.org/r/20220414072646.21910-1-songmuchun@bytedance.com
    Link: https://lkml.kernel.org/r/20220331065640.5777-2-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
    Reviewed-by: default avatarBarry Song <baohua@kernel.org>
    Tested-by: default avatarBarry Song <baohua@kernel.org>
    Cc: Will Deacon <will@kernel.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: James Morse <james.morse@arm.com>
    Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
    Cc: Fam Zheng <fam.zheng@bytedance.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    1e63ac08
flush.c 3.2 KB