• KAMEZAWA Hiroyuki's avatar
    change zonelist order: zonelist order selection logic · f0c0b2b8
    KAMEZAWA Hiroyuki authored
    Make zonelist creation policy selectable from sysctl/boot option v6.
    
    This patch makes NUMA's zonelist (of pgdat) order selectable.
    Available order are Default(automatic)/ Node-based / Zone-based.
    
    [Default Order]
    The kernel selects Node-based or Zone-based order automatically.
    
    [Node-based Order]
    This policy treats the locality of memory as the most important parameter.
    Zonelist order is created by each zone's locality. This means lower zones
    (ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion.
    IOW. ZONE_DMA will be in the middle of zonelist.
    current 2.6.21 kernel uses this.
    
    Pros.
     * A user can expect local memory as much as possible.
    Cons.
     * lower zone will be exhansted before higher zone. This may cause OOM_KILL.
    
    Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL
    because of ZONE_DMA exhaution and you need the best locality.
    
    (example)
    assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.
    
    *node(0)'s memory allocation order:
    
     node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL.
    
    *node(1)'s memory allocation order:
    
     node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.
    
    [Zone-based order]
    This policy treats the zone type as the most important parameter.
    Zonelist order is created by zone-type order. This means lower zone
    never be used bofere higher zone exhaustion.
    IOW. ZONE_DMA will be always at the tail of zonelist.
    
    Pros.
     * OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted.
    Cons.
     * memory locality may not be best.
    
    (example)
    assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.
    
    *node(0)'s memory allocation order:
    
     node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA.
    
    *node(1)'s memory allocation order:
    
     node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.
    
    bootoption "numa_zonelist_order=" and proc/sysctl is supporetd.
    
    command:
    %echo N > /proc/sys/vm/numa_zonelist_order
    
    Will rebuild zonelist in Node-based order.
    
    command:
    %echo Z > /proc/sys/vm/numa_zonelist_order
    
    Will rebuild zonelist in Zone-based order.
    
    Thanks to Lee Schermerhorn, he gives me much help and codes.
    
    [Lee.Schermerhorn@hp.com: add check_highest_zone to build_zonelists_in_zone_order]
    [akpm@linux-foundation.org: build fix]
    Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
    Cc: Christoph Lameter <clameter@sgi.com>
    Cc: Andi Kleen <ak@suse.de>
    Cc: "jesse.barnes@intel.com" <jesse.barnes@intel.com>
    Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    f0c0b2b8
sysctl.c 58.1 KB