• Dave Hansen's avatar
    powerpc/mm: Fix numa reserve bootmem page selection · 06eccea6
    Dave Hansen authored
    Fix the powerpc NUMA reserve bootmem page selection logic.
    
    commit 8f64e1f2 (powerpc: Reserve
    in bootmem lmb reserved regions that cross NUMA nodes) changed
    the logic for how the powerpc LMB reserved regions were converted
    to bootmen reserved regions.  As the folowing discussion reports,
    the new logic was not correct.
    
    mark_reserved_regions_for_nid() goes through each LMB on the
    system that specifies a reserved area.  It searches for
    active regions that intersect with that LMB and are on the
    specified node.  It attempts to bootmem-reserve only the area
    where the active region and the reserved LMB intersect.  We
    can not reserve things on other nodes as they may not have
    bootmem structures allocated, yet.
    
    We base the size of the bootmem reservation on two possible
    things.  Normally, we just make the reservation start and
    stop exactly at the start and end of the LMB.
    
    However, the LMB reservations are not aware of NUMA nodes and
    on occasion a single LMB may cross into several adjacent
    active regions.  Those may even be on different NUMA nodes
    and will require separate calls to the bootmem reserve
    functions.  So, the bootmem reservation must be trimmed to
    fit inside the current active region.
    
    That's all fine and dandy, but we trim the reservation
    in a page-aligned fashion.  That's bad because we start the
    reservation at a non-page-aligned address: physbase.
    
    The reservation may only span 2 bytes, but that those bytes
    may span two pfns and cause a reserve_size of 2*PAGE_SIZE.
    
    Take the case where you reserve 0x2 bytes at 0x0fff and
    where the active region ends at 0x1000.  You'll jump into
    that if() statment, but node_ar.end_pfn=0x1 and
    start_pfn=0x0.  You'll end up with a reserve_size=0x1000,
    and then call
    
      reserve_bootmem_node(node, physbase=0xfff, size=0x1000);
    
    0x1000 may not be on the same node as 0xfff.  Oops.
    
    In almost all the vm code, end_<anything> is not inclusive.
    If you have an end_pfn of 0x1234, page 0x1234 is not
    included in the range.  Using PFN_UP instead of the
    (>> >> PAGE_SHIFT) will make this consistent with the other VM
    code.
    
    We also need to do math for the reserved size with physbase
    instead of start_pfn.  node_ar.end_pfn << PAGE_SHIFT is
    *precisely* the end of the node.  However,
    (start_pfn << PAGE_SHIFT) is *NOT* precisely the beginning
    of the reserved area.  That is, of course, physbase.
    If we don't use physbase here, the reserve_size can be
    made too large.
    
    From: Dave Hansen <dave@linux.vnet.ibm.com>
    Tested-by: Geoff Levand <geoffrey.levand@am.sony.com>  Tested on PS3.
    Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
    06eccea6
numa.c 28.3 KB