• Miao Xie's avatar
    mempolicy: restructure rebinding-mempolicy functions · 708c1bbc
    Miao Xie authored
    Nick Piggin reported that the allocator may see an empty nodemask when
    changing cpuset's mems[1].  It happens only on the kernel that do not do
    atomic nodemask_t stores.  (MAX_NUMNODES > BITS_PER_LONG)
    
    But I found that there is also a problem on the kernel that can do atomic
    nodemask_t stores.  The problem is that the allocator can't find a node to
    alloc page when changing cpuset's mems though there is a lot of free
    memory.  The reason is like this:
    
    (mpol: mempolicy)
    	task1			task1's mpol	task2
    	alloc page		1
    	  alloc on node0? NO	1
    				1		change mems from 1 to 0
    				1		rebind task1's mpol
    				0-1		  set new bits
    				0	  	  clear disallowed bits
    	  alloc on node1? NO	0
    	  ...
    	can't alloc page
    	  goto oom
    
    I can use the attached program reproduce it by the following step:
    
    # mkdir /dev/cpuset
    # mount -t cpuset cpuset /dev/cpuset
    # mkdir /dev/cpuset/1
    # echo `cat /dev/cpuset/cpus` > /dev/cpuset/1/cpus
    # echo `cat /dev/cpuset/mems` > /dev/cpuset/1/mems
    # echo $$ > /dev/cpuset/1/tasks
    # numactl --membind=`cat /dev/cpuset/mems` ./cpuset_mem_hog <nr_tasks> &
       <nr_tasks> = max(nr_cpus - 1, 1)
    # killall -s SIGUSR1 cpuset_mem_hog
    # ./change_mems.sh
    
    several hours later, oom will happen though there is a lot of free memory.
    
    This patchset fixes this problem by expanding the nodes range first(set
    newly allowed bits) and shrink it lazily(clear newly disallowed bits).  So
    we use a variable to tell the write-side task that read-side task is
    reading nodemask, and the write-side task clears newly disallowed nodes
    after read-side task ends the current memory allocation.
    
    This patch:
    
    In order to fix no node to alloc memory, when we want to update mempolicy
    and mems_allowed, we expand the set of nodes first (set all the newly
    nodes) and shrink the set of nodes lazily(clean disallowed nodes), But the
    mempolicy's rebind functions may breaks the expanding.
    
    So we restructure the mempolicy's rebind functions and split the rebind
    work to two steps, just like the update of cpuset's mems: The 1st step:
    expand the set of the mempolicy's nodes.  The 2nd step: shrink the set of
    the mempolicy's nodes.  It is used when there is no real lock to protect
    the mempolicy in the read-side.  Otherwise we can do rebind work at once.
    
    In order to implement it, we define
    
    	enum mpol_rebind_step {
    		MPOL_REBIND_ONCE,
    		MPOL_REBIND_STEP1,
    		MPOL_REBIND_STEP2,
    		MPOL_REBIND_NSTEP,
    	};
    
    If the mempolicy needn't be updated by two steps, we can pass
    MPOL_REBIND_ONCE to the rebind functions.  Or we can pass
    MPOL_REBIND_STEP1 to do the first step of the rebind work and pass
    MPOL_REBIND_STEP2 to do the second step work.
    
    Besides that, it maybe long time between these two step and we have to
    release the lock that protects mempolicy and mems_allowed.  If we hold the
    lock once again, we must check whether the current mempolicy is under the
    rebinding (the first step has been done) or not, because the task may
    alloc a new mempolicy when we don't hold the lock.  So we defined the
    following flag to identify it:
    
    #define MPOL_F_REBINDING (1 << 2)
    
    The new functions will be used in the next patch.
    Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Nick Piggin <npiggin@suse.de>
    Cc: Paul Menage <menage@google.com>
    Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
    Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
    Cc: Ravikiran Thirumalai <kiran@scalex86.org>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Christoph Lameter <cl@linux-foundation.org>
    Cc: Andi Kleen <andi@firstfloor.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    708c1bbc
cpuset.c 73 KB