• Oscar Salvador's avatar
    mm,memory_hotplug: allocate memmap from the added memory range · a08a2ae3
    Oscar Salvador authored
    Physical memory hotadd has to allocate a memmap (struct page array) for
    the newly added memory section.  Currently, alloc_pages_node() is used
    for those allocations.
    
    This has some disadvantages:
     a) an existing memory is consumed for that purpose
        (eg: ~2MB per 128MB memory section on x86_64)
        This can even lead to extreme cases where system goes OOM because
        the physically hotplugged memory depletes the available memory before
        it is onlined.
     b) if the whole node is movable then we have off-node struct pages
        which has performance drawbacks.
     c) It might be there are no PMD_ALIGNED chunks so memmap array gets
        populated with base pages.
    
    This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.
    
    Vmemap page tables can map arbitrary memory.  That means that we can
    reserve a part of the physically hotadded memory to back vmemmap page
    tables.  This implementation uses the beginning of the hotplugged memory
    for that purpose.
    
    There are some non-obviously things to consider though.
    
    Vmemmap pages are allocated/freed during the memory hotplug events
    (add_memory_resource(), try_remove_memory()) when the memory is
    added/removed.  This means that the reserved physical range is not
    online although it is used.  The most obvious side effect is that
    pfn_to_online_page() returns NULL for those pfns.  The current design
    expects that this should be OK as the hotplugged memory is considered a
    garbage until it is onlined.  For example hibernation wouldn't save the
    content of those vmmemmaps into the image so it wouldn't be restored on
    resume but this should be OK as there no real content to recover anyway
    while metadata is reachable from other data structures (e.g.  vmemmap
    page tables).
    
    The reserved space is therefore (de)initialized during the {on,off}line
    events (mhp_{de}init_memmap_on_memory).  That is done by extracting page
    allocator independent initialization from the regular onlining path.
    The primary reason to handle the reserved space outside of
    {on,off}line_pages is to make each initialization specific to the
    purpose rather than special case them in a single function.
    
    As per above, the functions that are introduced are:
    
     - mhp_init_memmap_on_memory:
       Initializes vmemmap pages by calling move_pfn_range_to_zone(), calls
       kasan_add_zero_shadow(), and onlines as many sections as vmemmap pages
       fully span.
    
     - mhp_deinit_memmap_on_memory:
       Offlines as many sections as vmemmap pages fully span, removes the
       range from zhe zone by remove_pfn_range_from_zone(), and calls
       kasan_remove_zero_shadow() for the range.
    
    The new function memory_block_online() calls mhp_init_memmap_on_memory()
    before doing the actual online_pages().  Should online_pages() fail, we
    clean up by calling mhp_deinit_memmap_on_memory().  Adjusting of
    present_pages is done at the end once we know that online_pages()
    succedeed.
    
    On offline, memory_block_offline() needs to unaccount vmemmap pages from
    present_pages() before calling offline_pages().  This is necessary because
    offline_pages() tears down some structures based on the fact whether the
    node or the zone become empty.  If offline_pages() fails, we account back
    vmemmap pages.  If it succeeds, we call mhp_deinit_memmap_on_memory().
    
    Hot-remove:
    
     We need to be careful when removing memory, as adding and
     removing memory needs to be done with the same granularity.
     To check that this assumption is not violated, we check the
     memory range we want to remove and if a) any memory block has
     vmemmap pages and b) the range spans more than a single memory
     block, we scream out loud and refuse to proceed.
    
     If all is good and the range was using memmap on memory (aka vmemmap pages),
     we construct an altmap structure so free_hugepage_table does the right
     thing and calls vmem_altmap_free instead of free_pagetable.
    
    Link: https://lkml.kernel.org/r/20210421102701.25051-5-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
    Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
    Acked-by: default avatarMichal Hocko <mhocko@suse.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    a08a2ae3
memory.c 22.1 KB