• Paul Jackson's avatar
    [PATCH] cpusets - big numa cpu and memory placement · 3a978e55
    Paul Jackson authored
    This my cpuset patch, with the following changes in the last two weeks:
    
     1) Updated to 2.6.8.1-mm1
     2) [Simon Derr <Simon.Derr@bull.net>] Fix new cpuset to begin empty,
        not copied from parent.  Needed to avoid breaking exclusive property.
     3) [Dinakar Guniguntala <dino@in.ibm.com>] Finish initializing top
        cpuset from cpu_possible_map after smp_init() called.
     4) [Paul Jackson <pj@sgi.com>] Check on each call to __alloc_pages()
        if the current tasks cpuset mems_allowed has changed.  Use a cpuset
        generation number, bumped on any cpuset memory placement change,
        to make this check efficient.  Update the tasks mems_allowed from
        its cpuset, if the cpuset has changed.
     5) [Paul Jackson <pj@sgi.com>] If a task is moved to another cpuset,
        then update its cpus_allowed, using set_cpus_allowed().
     6) [Paul Jackson <pj@sgi.com>] Update Documentation/cpusets.txt to
        reflect above changes (4) and (5).
    
    I continue to recommend the following patch for inclusion in your 2.6.9-*mm
    series, when that opens.  It provides an important facility for high
    performance computing on large systems.  Simon Derr of Bull (France) and
    myself are the primary authors.  Erich Focht has indicated that NEC is also
    a potential user of this patch on the TX-7 NUMA machines, and that he
    "would very much welcome the inclusion of cpusets."
    
    I offer this update to lkml, in order to invite continued feedback.
    
    The one prerequiste patch for this cpuset patch was just posted before this
    one.  That was a patch to provide a new bitmap list format, of which
    cpusets is the first user.
    
    This patch has been built on top of 2.6.8.1-mm1, for the arch's:
    
      i386 x86_64 sparc ia64 powerpc-405 powerpc-750 sparc64
    
    with and without CONFIG_CPUSET.  It has been booted and tested on ia64
    (sn2_defconfig, SN2 hardware).  The 'alpha' arch also built, except for
    what seems to be an unrelated toolchain problem (crosstool ld sigsegv) in
    the final link step.
    
    ===
    
    Cpusets provide a mechanism for assigning a set of CPUs and Memory Nodes to
    a set of tasks.
    
    Cpusets constrain the CPU and Memory placement of tasks to only the
    processor and memory resources within a tasks current cpuset.  They form a
    nested hierarchy visible in a virtual file system.  These are the essential
    hooks, beyond what is already present, required to manage dynamic job
    placement on large systems.
    
    Cpusets require small kernel hooks in init, exit, fork, mempolicy,
    sched_setaffinity, page_alloc and vmscan.  And they require a "struct
    cpuset" pointer, a cpuset_mems_generation, and a "mems_allowed" nodemask_t
    (to go along with the "cpus_allowed" cpumask_t that's already there) in
    each task struct.
    
    These hooks:
      1) establish and propagate cpusets,
      2) enforce CPU placement in sched_setaffinity,
      3) enforce Memory placement in mbind and sys_set_mempolicy,
      4) restrict page allocation and scanning to mems_allowed, and
      5) restrict migration and set_cpus_allowed to cpus_allowed.
    
    The other required hook, restricting task scheduling to CPUs in a tasks
    cpus_allowed mask, is already present.
    
    Cpusets extend the usefulness of, the existing placement support that was
    added to Linux 2.6 kernels: sched_setaffinity() for CPU placement, and
    mbind() and set_mempolicy() for memory placement.  On smaller or dedicated
    use systems, the existing calls are often sufficient.
    
    On larger NUMA systems, running more than one, performance critical, job,
    it is necessary to be able to manage jobs in their entirety.  This includes
    providing a job with exclusive CPU and memory that no other job can use,
    and being able to list all tasks currently in a cpuset.
    
    A given job running within a cpuset, would likely use the existing
    placement calls to manage its CPU and memory placement in more detail.
    
    Cpusets are named, nested sets of CPUs and Memory Nodes.  Each cpuset is
    represented by a directory in the cpuset virtual file system, normally
    mounted at /dev/cpuset.
    
    Each cpuset directory provides the following files, which can be
    read and written:
    
      cpus:
          List of CPUs allowed to tasks in that cpuset.
    
      mems:
          List of Memory Nodes allowed to tasks in that cpuset.
    
      tasks:
          List of pid's of tasks in that cpuset.
    
      cpu_exclusive:
          Flag (0 or 1) - if set, cpuset has exclusive use of
          its CPUs (no sibling or cousin cpuset may overlap CPUs).
    
      mem_exclusive:
          Flag (0 or 1) - if set, cpuset has exclusive use of
          its Memory Nodes (no sibling or cousin may overlap).
    
      notify_on_release:
          Flag (0 or 1) - if set, then /sbin/cpuset_release_agent
          will be invoked, with the name (/dev/cpuset relative path)
          of that cpuset in argv[1], when the last user of it (task
          or child cpuset) goes away.  This supports automatic
          cleanup of abandoned cpusets.
    
    In addition one new filetype is added to the /proc file system:
    
      /proc/<pid>/cpuset:
          For each task (pid), list its cpuset path, relative to the
          root of the cpuset file system.  This file is read-only.
    
    New cpusets are created using 'mkdir' (at the shell or in C).  Old ones are
    removed using 'rmdir'.  The above files are accessed using read(2) and
    write(2) system calls, or shell commands such as 'cat' and 'echo'.
    
    The CPUs and Memory Nodes in a given cpuset are always a subset of its
    parent.  The root cpuset has all possible CPUs and Memory Nodes in the
    system.  A cpuset may be exclusive (cpu or memory) only if its parent is
    similarly exclusive.
    
    See further Documentation/cpusets.txt, at the top of the following
    patch.
    
    
    /proc interface:
    
    It is useful, when learning and making new uses of cpusets and placement to be
    able to see what are the current value of a tasks cpus_allowed and
    mems_allowed, which are the actual placement used by the kernel scheduler and
    memory allocator.
    
    The cpus_allowed and mems_allowed values are needed by user space apps that
    are micromanaging placement, such as when moving an app to a obtained by
    that app within its cpuset using sched_setaffinity, mbind and
    set_mempolicy.
    
    The cpus_allowed value is also available via the sched_getaffinity system
    call.  But since the entire rest of the cpuset API, including the display
    of mems_allowed added here, is via an ascii style presentation in /proc and
    /dev/cpuset, it is worth the extra couple lines of code to display
    cpus_allowed in the same way.
    
    This patch adds the display of these two fields to the 'status' file in the
    /proc/<pid> directory of each task.  The fields are only added if
    CONFIG_CPUSETS is enabled (which is also needed to define the mems_allowed
    field of each task).  The new output lines look like:
    
      $ tail -2 /proc/1/status
      Cpus_allowed:   ffffffff,ffffffff,ffffffff,ffffffff
      Mems_allowed:   ffffffff,ffffffff
    Signed-off-by: default avatarDinakar Guniguntala <dino@in.ibm.com>
    Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
    Signed-off-by: default avatarAdrian Bunk <bunk@stusta.de>
    Signed-off-by: default avatarSimon Derr <simon.derr@bull.net>
    Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
    Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    3a978e55
fork.c 31.3 KB