1. 10 Sep, 2013 27 commits
    • Andrew Morton's avatar
      xfs-convert-buftarg-lru-to-generic-code-fix · addbda40
      Andrew Morton authored
      fix warnings
      
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Glauber Costa <glommer@openvz.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      addbda40
    • Dave Chinner's avatar
      xfs: convert buftarg LRU to generic code · e80dfa19
      Dave Chinner authored
      Convert the buftarg LRU to use the new generic LRU list and take advantage
      of the functionality it supplies to make the buffer cache shrinker node
      aware.
      Signed-off-by: default avatarGlauber Costa <glommer@openvz.org>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      e80dfa19
    • Dave Chinner's avatar
      fs: convert inode and dentry shrinking to be node aware · 9b17c623
      Dave Chinner authored
      Now that the shrinker is passing a node in the scan control structure, we
      can pass this to the the generic LRU list code to isolate reclaim to the
      lists on matching nodes.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      9b17c623
    • Glauber Costa's avatar
      vmscan: per-node deferred work · 1d3d4437
      Glauber Costa authored
      The list_lru infrastructure already keeps per-node LRU lists in its
      node-specific list_lru_node arrays and provide us with a per-node API, and
      the shrinkers are properly equiped with node information.  This means that
      we can now focus our shrinking effort in a single node, but the work that
      is deferred from one run to another is kept global at nr_in_batch.  Work
      can be deferred, for instance, during direct reclaim under a GFP_NOFS
      allocation, where situation, all the filesystem shrinkers will be
      prevented from running and accumulate in nr_in_batch the amount of work
      they should have done, but could not.
      
      This creates an impedance problem, where upon node pressure, work deferred
      will accumulate and end up being flushed in other nodes.  The problem we
      describe is particularly harmful in big machines, where many nodes can
      accumulate at the same time, all adding to the global counter nr_in_batch.
       As we accumulate more and more, we start to ask for the caches to flush
      even bigger numbers.  The result is that the caches are depleted and do
      not stabilize.  To achieve stable steady state behavior, we need to tackle
      it differently.
      
      In this patch we keep the deferred count per-node, in the new array
      nr_deferred[] (the name is also a bit more descriptive) and will never
      accumulate that to other nodes.
      Signed-off-by: default avatarGlauber Costa <glommer@openvz.org>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      1d3d4437
    • Dave Chinner's avatar
      shrinker: add node awareness · 0ce3d744
      Dave Chinner authored
      Pass the node of the current zone being reclaimed to shrink_slab(),
      allowing the shrinker control nodemask to be set appropriately for node
      aware shrinkers.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarGlauber Costa <glommer@openvz.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      0ce3d744
    • Glauber Costa's avatar
      list_lru: remove special case function list_lru_dispose_all. · 4e717f5c
      Glauber Costa authored
      The list_lru implementation has one function, list_lru_dispose_all, with
      only one user (the dentry code).  At first, such function appears to make
      sense because we are really not interested in the result of isolating each
      dentry separately - all of them are going away anyway.  However, it's
      implementation is buggy in the following way:
      
      When we call list_lru_dispose_all in fs/dcache.c, we scan all dentries
      marking them with DCACHE_SHRINK_LIST.  However, this is done without the
      nlru->lock taken.  The imediate result of that is that someone else may
      add or remove the dentry from the LRU at the same time.  When list_lru_del
      happens in that scenario we will see an element that is not yet marked
      with DCACHE_SHRINK_LIST (even though it will be in the future) and
      obviously remove it from an lru where the element no longer is.  Since
      list_lru_dispose_all will in effect count down nlru's nr_items and
      list_lru_del will do the same, this will lead to an imbalance.
      
      The solution for this would not be so simple: we can obviously just keep
      the lru_lock taken, but then we have no guarantees that we will be able to
      acquire the dentry lock (dentry->d_lock).  To properly solve this, we need
      a communication mechanism between the lru and dentry code, so they can
      coordinate this with each other.
      
      Such mechanism already exists in the form of the list_lru_walk_cb
      callback.  So it is possible to construct a dcache-side prune function
      that does the right thing only by calling list_lru_walk in a loop until no
      more dentries are available.
      
      With only one user, plus the fact that a sane solution for the problem
      would involve boucing between dcache and list_lru anyway, I see little
      justification to keep the special case list_lru_dispose_all in tree.
      Signed-off-by: default avatarGlauber Costa <glommer@openvz.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      4e717f5c
    • Glauber Costa's avatar
      list_lru: per-node API · 6a4f496f
      Glauber Costa authored
      This patch adapts the list_lru API to accept an optional node argument, to
      be used by NUMA aware shrinking functions.  Code that does not care about
      the NUMA placement of objects can still call into the very same functions
      as before.  They will simply iterate over all nodes.
      Signed-off-by: default avatarGlauber Costa <glommer@openvz.org>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      6a4f496f
    • Dave Chinner's avatar
      list_lru: fix broken LRU_RETRY behaviour · 5cedf721
      Dave Chinner authored
      The LRU_RETRY code assumes that the list traversal status after we have
      dropped and regained the list lock.  Unfortunately, this is not a valid
      assumption, and that can lead to racing traversals isolating objects that
      the other traversal expects to be the next item on the list.
      
      This is causing problems with the inode cache shrinker isolation, with
      races resulting in an inode on a dispose list being "isolated" because a
      racing traversal still thinks it is on the LRU.  The inode is then never
      reclaimed and that causes hangs if a subsequent lookup on that inode
      occurs.
      
      Fix it by always restarting the list walk on a LRU_RETRY return from the
      isolate callback.  Avoid the possibility of livelocks the current code was
      trying to avoid by always decrementing the nr_to_walk counter on retries
      so that even if we keep hitting the same item on the list we'll eventually
      stop trying to walk and exit out of the situation causing the problem.
      Reported-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      5cedf721
    • Dave Chinner's avatar
      list_lru: per-node list infrastructure · 3b1d58a4
      Dave Chinner authored
      Now that we have an LRU list API, we can start to enhance the
      implementation.  This splits the single LRU list into per-node lists and
      locks to enhance scalability.  Items are placed on lists according to the
      node the memory belongs to.  To make scanning the lists efficient, also
      track whether the per-node lists have entries in them in a active
      nodemask.
      
      Note: We use a fixed-size array for the node LRU, this struct can be very
      big if MAX_NUMNODES is big.  If this becomes a problem this is fixable by
      turning this into a pointer and dynamically allocating this to
      nr_node_ids.  This quantity is firwmare-provided, and still would provide
      room for all nodes at the cost of a pointer lookup and an extra
      allocation.  Because that allocation will most likely come from a may very
      well fail.
      
      [glommer@openvz.org: fix warnings, added note about node lru]
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarGlauber Costa <glommer@openvz.org>
      Reviewed-by: default avatarGreg Thelen <gthelen@google.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      3b1d58a4
    • Dave Chinner's avatar
      dcache: convert to use new lru list infrastructure · f6041567
      Dave Chinner authored
      [glommer@openvz.org: don't reintroduce double decrement of nr_unused_dentries, adapted for new LRU return codes]
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarGlauber Costa <glommer@openvz.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      f6041567
    • Glauber Costa's avatar
      inode: move inode to a different list inside lock · d38fa698
      Glauber Costa authored
      When removing an element from the lru, this will be done today after the lock
      is released. This is a clear mistake, although we are not sure if the bugs we
      are seeing are related to this. All list manipulations are done inside the
      lock, and so should this one.
      Signed-off-by: default avatarGlauber Costa <glommer@openvz.org>
      Tested-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Dave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      d38fa698
    • Dave Chinner's avatar
      inode: convert inode lru list to generic lru list code. · bc3b14cb
      Dave Chinner authored
      [glommer@openvz.org: adapted for new LRU return codes]
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarGlauber Costa <glommer@openvz.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      bc3b14cb
    • Dave Chinner's avatar
      list: add a new LRU list type · a38e4082
      Dave Chinner authored
      Several subsystems use the same construct for LRU lists - a list head, a
      spin lock and and item count.  They also use exactly the same code for
      adding and removing items from the LRU.  Create a generic type for these
      LRU lists.
      
      This is the beginning of generic, node aware LRUs for shrinkers to work
      with.
      
      [glommer@openvz.org: enum defined constants for lru. Suggested by gthelen, don't relock over retry]
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarGlauber Costa <glommer@openvz.org>
      Reviewed-by: default avatarGreg Thelen <gthelen@google.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      a38e4082
    • Dave Chinner's avatar
      shrinker: convert superblock shrinkers to new API · 0a234c6d
      Dave Chinner authored
      Convert superblock shrinker to use the new count/scan API, and propagate
      the API changes through to the filesystem callouts.  The filesystem
      callouts already use a count/scan API, so it's just changing counters to
      longs to match the VM API.
      
      This requires the dentry and inode shrinker callouts to be converted to
      the count/scan API.  This is mainly a mechanical change.
      
      [glommer@openvz.org: use mult_frac for fractional proportions, build fixes]
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarGlauber Costa <glommer@openvz.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      0a234c6d
    • Dave Chinner's avatar
      mm: new shrinker API · 24f7c6b9
      Dave Chinner authored
      The current shrinker callout API uses an a single shrinker call for
      multiple functions.  To determine the function, a special magical value is
      passed in a parameter to change the behaviour.  This complicates the
      implementation and return value specification for the different
      behaviours.
      
      Separate the two different behaviours into separate operations, one to
      return a count of freeable objects in the cache, and another to scan a
      certain number of objects in the cache for freeing.  In defining these new
      operations, ensure the return values and resultant behaviours are clearly
      defined and documented.
      
      Modify shrink_slab() to use the new API and implement the callouts for all
      the existing shrinkers.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      24f7c6b9
    • Dave Chinner's avatar
      dcache: remove dentries from LRU before putting on dispose list · dd1f6b2e
      Dave Chinner authored
      One of the big problems with modifying the way the dcache shrinker and LRU
      implementation works is that the LRU is abused in several ways.  One of
      these is shrink_dentry_list().
      
      Basically, we can move a dentry off the LRU onto a different list without
      doing any accounting changes, and then use dentry_lru_prune() to remove it
      from what-ever list it is now on to do the LRU accounting at that point.
      
      This makes it -really hard- to change the LRU implementation.  The use of
      the per-sb LRU lock serialises movement of the dentries between the
      different lists and the removal of them, and this is the only reason that
      it works.  If we want to break up the dentry LRU lock and lists into, say,
      per-node lists, we remove the only serialisation that allows this lru
      list/dispose list abuse to work.
      
      To make this work effectively, the dispose list has to be isolated from
      the LRU list - dentries have to be removed from the LRU *before* being
      placed on the dispose list.  This means that the LRU accounting and
      isolation is completed before disposal is started, and that means we can
      change the LRU implementation freely in future.
      
      This means that dentries *must* be marked with DCACHE_SHRINK_LIST when
      they are placed on the dispose list so that we don't think that parent
      dentries found in try_prune_one_dentry() are on the LRU when the are
      actually on the dispose list.  This would result in accounting the dentry
      to the LRU a second time.  Hence dentry_lru_del() has to handle the
      DCACHE_SHRINK_LIST case
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarGlauber Costa <glommer@openvz.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      dd1f6b2e
    • Dave Chinner's avatar
      dentry: move to per-sb LRU locks · 19156840
      Dave Chinner authored
      With the dentry LRUs being per-sb structures, there is no real need for
      a global dentry_lru_lock. The locking can be made more fine-grained by
      moving to a per-sb LRU lock, isolating the LRU operations of different
      filesytsems completely from each other. The need for this is independent
      of any performance consideration that may arise: in the interest of
      abstracting the lru operations away, it is mandatory that each lru works
      around its own lock instead of a global lock for all of them.
      
      [glommer@openvz.org: updated changelog ]
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarGlauber Costa <glommer@openvz.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      19156840
    • Dave Chinner's avatar
      dcache: convert dentry_stat.nr_unused to per-cpu counters · 62d36c77
      Dave Chinner authored
      Before we split up the dcache_lru_lock, the unused dentry counter needs to
      be made independent of the global dcache_lru_lock.  Convert it to per-cpu
      counters to do this.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarGlauber Costa <glommer@openvz.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      62d36c77
    • Glauber Costa's avatar
      super: fix calculation of shrinkable objects for small numbers · 55f841ce
      Glauber Costa authored
      The sysctl knob sysctl_vfs_cache_pressure is used to determine which
      percentage of the shrinkable objects in our cache we should actively try
      to shrink.
      
      It works great in situations in which we have many objects (at least more
      than 100), because the aproximation errors will be negligible.  But if
      this is not the case, specially when total_objects < 100, we may end up
      concluding that we have no objects at all (total / 100 = 0, if total <
      100).
      
      This is certainly not the biggest killer in the world, but may matter in
      very low kernel memory situations.
      Signed-off-by: default avatarGlauber Costa <glommer@openvz.org>
      Reviewed-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      55f841ce
    • Glauber Costa's avatar
      fs: bump inode and dentry counters to long · 3942c07c
      Glauber Costa authored
      This series reworks our current object cache shrinking infrastructure in
      two main ways:
      
       * Noticing that a lot of users copy and paste their own version of LRU
         lists for objects, we put some effort in providing a generic version.
         It is modeled after the filesystem users: dentries, inodes, and xfs
         (for various tasks), but we expect that other users could benefit in
         the near future with little or no modification.  Let us know if you
         have any issues.
      
       * The underlying list_lru being proposed automatically and
         transparently keeps the elements in per-node lists, and is able to
         manipulate the node lists individually.  Given this infrastructure, we
         are able to modify the up-to-now hammer called shrink_slab to proceed
         with node-reclaim instead of always searching memory from all over like
         it has been doing.
      
      Per-node lru lists are also expected to lead to less contention in the lru
      locks on multi-node scans, since we are now no longer fighting for a
      global lock.  The locks usually disappear from the profilers with this
      change.
      
      Although we have no official benchmarks for this version - be our guest to
      independently evaluate this - earlier versions of this series were
      performance tested (details at
      http://permalink.gmane.org/gmane.linux.kernel.mm/100537) yielding no
      visible performance regressions while yielding a better qualitative
      behavior in NUMA machines.
      
      With this infrastructure in place, we can use the list_lru entry point to
      provide memcg isolation and per-memcg targeted reclaim.  Historically,
      those two pieces of work have been posted together.  This version presents
      only the infrastructure work, deferring the memcg work for a later time,
      so we can focus on getting this part tested.  You can see more about the
      history of such work at http://lwn.net/Articles/552769/
      
      Dave Chinner (18):
        dcache: convert dentry_stat.nr_unused to per-cpu counters
        dentry: move to per-sb LRU locks
        dcache: remove dentries from LRU before putting on dispose list
        mm: new shrinker API
        shrinker: convert superblock shrinkers to new API
        list: add a new LRU list type
        inode: convert inode lru list to generic lru list code.
        dcache: convert to use new lru list infrastructure
        list_lru: per-node list infrastructure
        shrinker: add node awareness
        fs: convert inode and dentry shrinking to be node aware
        xfs: convert buftarg LRU to generic code
        xfs: rework buffer dispose list tracking
        xfs: convert dquot cache lru to list_lru
        fs: convert fs shrinkers to new scan/count API
        drivers: convert shrinkers to new count/scan API
        shrinker: convert remaining shrinkers to count/scan API
        shrinker: Kill old ->shrink API.
      
      Glauber Costa (7):
        fs: bump inode and dentry counters to long
        super: fix calculation of shrinkable objects for small numbers
        list_lru: per-node API
        vmscan: per-node deferred work
        i915: bail out earlier when shrinker cannot acquire mutex
        hugepage: convert huge zero page shrinker to new shrinker API
        list_lru: dynamically adjust node arrays
      
      This patch:
      
      There are situations in very large machines in which we can have a large
      quantity of dirty inodes, unused dentries, etc.  This is particularly true
      when umounting a filesystem, where eventually since every live object will
      eventually be discarded.
      
      Dave Chinner reported a problem with this while experimenting with the
      shrinker revamp patchset.  So we believe it is time for a change.  This
      patch just moves int to longs.  Machines where it matters should have a
      big long anyway.
      Signed-off-by: default avatarGlauber Costa <glommer@openvz.org>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      3942c07c
    • Dave Jones's avatar
    • Al Viro's avatar
    • Christoph Hellwig's avatar
      fs: remove vfs_follow_link · aac34df1
      Christoph Hellwig authored
      For a long time no filesystem has been using vfs_follow_link, and as seen
      by recent filesystem submissions any new use is accidental as well.
      
      Remove vfs_follow_link, document the replacement in
      Documentation/filesystems/porting and also rename __vfs_follow_link
      to match its only caller better.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      aac34df1
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · b05430fc
      Linus Torvalds authored
      Pull vfs pile 3 (of many) from Al Viro:
       "Waiman's conversion of d_path() and bits related to it,
        kern_path_mountpoint(), several cleanups and fixes (exportfs
        one is -stable fodder, IMO).
      
        There definitely will be more...  ;-/"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        split read_seqretry_or_unlock(), convert d_walk() to resulting primitives
        dcache: Translating dentry into pathname without taking rename_lock
        autofs4 - fix device ioctl mount lookup
        introduce kern_path_mountpoint()
        rename user_path_umountat() to user_path_mountpoint_at()
        take unlazy_walk() into umount_lookup_last()
        Kill indirect include of file.h from eventfd.h, use fdget() in cgroup.c
        prune_super(): sb->s_op is never NULL
        exportfs: don't assume that ->iterate() won't feed us too long entries
        afs: get rid of redundant ->d_name.len checks
      b05430fc
    • Linus Torvalds's avatar
      vfs: make sure we don't have a stale root path if unlazy_walk() fails · d0d27277
      Linus Torvalds authored
      When I moved the RCU walk termination into unlazy_walk(), I didn't copy
      quite all of it: for the successful RCU termination we properly add the
      necessary reference counts to our temporary copy of the root path, but
      for the failure case we need to make sure that any temporary root path
      information is cleared out (since it does _not_ have the proper
      reference counts from the RCU lookup).
      
      We could clean up this mess by just always dropping the temporary root
      information, but Al points out that that would mean that a single lookup
      through symlinks could see multiple different root entries if it races
      with another thread doing chroot.  Not that I think we should really
      care (we had that before too, back before we had a copy of the root path
      in the nameidata).
      
      Al says he has a cunning plan.  In the meantime, this is the minimal fix
      for the problem, even if it's not all that pretty.
      Reported-by: default avatarMace Moneta <moneta.mace@gmail.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0d27277
    • Linus Torvalds's avatar
      Merge tag 'dmaengine-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine · 26b0332e
      Linus Torvalds authored
      Pull dmaengine update from Dan Williams:
       "Collection of random updates to the core and some end-driver fixups
        for ioatdma and mv_xor:
         - NUMA aware channel allocation
         - Cleanup dmatest debugfs interface
         - ioat: make raid-support Atom only
         - mv_xor: big endian
      
        Aside from the top three commits these have all had some soak time in
        -next.  The top commit fixes a recent build breakage.
      
        It has been a long while since my last pull request, hopefully it does
        not show.  Thanks to Vinod for keeping an eye on drivers/dma/ this
        past year"
      
      * tag 'dmaengine-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine:
        dmaengine: dma_sync_wait and dma_find_channel undefined
        MAINTAINERS: update email for Dan Williams
        dma: mv_xor: Fix incorrect error path
        ioatdma: silence GCC warnings
        dmaengine: make dma_channel_rebalance() NUMA aware
        dmaengine: make dma_submit_error() return an error code
        ioatdma: disable RAID on non-Atom platforms and reenable unaligned copies
        mv_xor: support big endian systems using descriptor swap feature
        mv_xor: use {readl, writel}_relaxed instead of __raw_{readl, writel}
        dmatest: print message on debug level in case of no error
        dmatest: remove IS_ERR_OR_NULL checks of debugfs calls
        dmatest: make module parameters writable
      26b0332e
    • Jon Mason's avatar
      dmaengine: dma_sync_wait and dma_find_channel undefined · 4a43f394
      Jon Mason authored
      dma_sync_wait and dma_find_channel are declared regardless of whether
      CONFIG_DMA_ENGINE is enabled, but calling the function without
      CONFIG_DMA_ENGINE enabled results "undefined reference" errors.
      
      To get around this, declare dma_sync_wait and dma_find_channel as inline
      functions if CONFIG_DMA_ENGINE is undefined.
      Signed-off-by: default avatarJon Mason <jon.mason@intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      4a43f394
  2. 09 Sep, 2013 13 commits
    • Linus Torvalds's avatar
      Merge tag 'late-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · 64041417
      Linus Torvalds authored
      Pull ARM SoC late changes from Kevin Hilman:
       "These are changes that arrived a little late before the merge window,
        or had dependencies on previous branches.
      
        Highlights:
         - ux500: misc.  cleanup, fixup I2C devices
         - exynos: DT updates for RTC; PM updates
         - at91: DT updates for NAND; new platforms added to generic defconfig
         - sunxi: DT updates: cubieboard2, pinctrl driver, gated clocks
         - highbank: LPAE fixes, select necessary ARM errata
         - omap: PM fixes and improvements; OMAP5 mailbox support
         - omap: basic support for new DRA7xx SoCs"
      
      * tag 'late-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (60 commits)
        ARM: dts: vexpress: Add CCI node to TC2 device-tree
        ARM: EXYNOS: Skip C1 cpuidle state for exynos5440
        ARM: EXYNOS: always enable PM domains support for EXYNOS4X12
        ARM: highbank: clean-up some unused includes
        ARM: sun7i: Enable the A20 clocks in the DTSI
        ARM: sun6i: Enable clock support in the DTSI
        ARM: sun5i: dt: Use the A10s gates in the DTSI
        ARM: at91: at91_dt_defconfig: enable rm9200 support
        ARM: dts: add ADC device tree node for exynos5420/5250
        ARM: dts: Add RTC DT node to Exynos5420 SoC
        ARM: dts: Update the "status" property of RTC DT node for Exynos5250 SoC
        ARM: dts: Fix the RTC DT node name for Exynos5250
        irqchip: mmp: avoid to include irqs head file
        ARM: mmp: avoid to include head file in mach-mmp
        irqchip: mmp: support irqchip
        irqchip: move mmp irq driver
        ARM: OMAP: AM33xx: clock: Add RNG clock data
        ARM: OMAP: TI81XX: add always-on powerdomain for TI81XX
        ARM: OMAP4: clock: Lock PLLs in the right sequence
        ARM: OMAP: AM33XX: hwmod: Add hwmod data for debugSS
        ...
      64041417
    • Linus Torvalds's avatar
      Merge tag 'renesas-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · fa91515c
      Linus Torvalds authored
      Pull ARM Renesas SoC cleanup, refactoring and more SMP support from Kevin Hilman:
       "Lots of cleanup and refactoring and some SMP additions for Renesas
        platforms.  Due to some inter-dependencies with other arm-soc
        branches, this Renesas stuff was separated out for sending after the
        other branches were merged.
      
        Highlights:
         - remove unused board support and cleanup of unused headers
         - refactoring of init and device registration
         - simplify IRQ initialization"
      
      * tag 'renesas-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (68 commits)
        ARM: shmobile: Per-CPU SMP boot / sleep code for SCU SoCs
        ARM: shmobile: Introduce per-CPU SMP boot / sleep code
        ARM: shmobile: Use shared SCU CPU Hotplug code on r8a7779
        ARM: shmobile: Use shared SCU CPU Hotplug code on sh73a0
        ARM: shmobile: Add shared SCU CPU Hotplug code
        ARM: shmobile: Use shared SCU SMP boot code on emev2
        ARM: shmobile: Use shared SCU SMP boot code on r8a7779
        ARM: shmobile: Use shared SCU SMP boot code on sh73a0
        ARM: shmobile: Introduce shared SCU SMP boot code
        ARM: shmobile: sh73a0: Remove global GPIO_NR definition
        ARM: shmobile: kzm9d: remove nfsroot settings from bootargs
        ARM: shmobile: armadillo800eva: remove nfsroot settings from bootargs
        ARM: shmobile: r8a7779: move r8a7779_init_irq_xxx() to setup
        ARM: shmobile: r8a7740: move r8a7740_init_irq_of() to setup
        ARM: shmobile: bockw: add missing __initdata
        ARM: shmobile: r8a7790: add missing __initdata
        ARM: shmobile: r8a7779: add missing __initdata
        ARM: shmobile: Remove unused shmobile_init_time()
        ARM: shmobile: Use clocksource_of_init() on r8a7790
        ARM: shmobile: Use default ->init_time() on KZM9G DT ref
        ...
      fa91515c
    • Linus Torvalds's avatar
      Merge tag 'drivers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · a35c6322
      Linus Torvalds authored
      Pull ARM SoC driver update from Kevin Hilman:
       "This contains the ARM SoC related driver updates for v3.12.  The only
        thing this cycle are core PM updates and CPUidle support for ARM's TC2
        big.LITTLE development platform"
      
      * tag 'drivers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
        cpuidle: big.LITTLE: vexpress-TC2 CPU idle driver
        ARM: vexpress: tc2: disable GIC CPU IF in tc2_pm_suspend
        drivers: irq-chip: irq-gic: introduce gic_cpu_if_down()
      a35c6322
    • Linus Torvalds's avatar
      Merge tag 'clk-for-linus-3.12' of git://git.linaro.org/people/mturquette/linux · bef4a0ab
      Linus Torvalds authored
      Pull clock framework changes from Michael Turquette:
       "The common clk framework changes for 3.12 are dominated by clock
        driver patches, both new drivers and fixes to existing.  A high
        percentage of these are for Samsung platforms like Exynos.  Core
        framework fixes and some new features like automagical clock
        re-parenting round out the patches"
      
      * tag 'clk-for-linus-3.12' of git://git.linaro.org/people/mturquette/linux: (102 commits)
        clk: only call get_parent if there is one
        clk: samsung: exynos5250: Simplify registration of PLL rate tables
        clk: samsung: exynos4: Register PLL rate tables for Exynos4x12
        clk: samsung: exynos4: Register PLL rate tables for Exynos4210
        clk: samsung: exynos4: Reorder registration of mout_vpllsrc
        clk: samsung: pll: Add support for rate configuration of PLL46xx
        clk: samsung: pll: Use new registration method for PLL46xx
        clk: samsung: pll: Add support for rate configuration of PLL45xx
        clk: samsung: pll: Use new registration method for PLL45xx
        clk: samsung: exynos4: Rename exynos4_plls to exynos4x12_plls
        clk: samsung: exynos4: Remove checks for DT node
        clk: samsung: exynos4: Remove unused static clkdev aliases
        clk: samsung: Modify _get_rate() helper to use __clk_lookup()
        clk: samsung: exynos4: Use separate aliases for cpufreq related clocks
        clocksource: samsung_pwm_timer: Get clock from device tree
        ARM: dts: exynos4: Specify PWM clocks in PWM node
        pwm: samsung: Update DT bindings documentation to cover clocks
        clk: Move symbol export to proper location
        clk: fix new_parent dereference before null check
        clk: wm831x: Initialise wm831x pointer on init
        ...
      bef4a0ab
    • Linus Torvalds's avatar
      Merge tag 'trace-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · 7eb69529
      Linus Torvalds authored
      Pull tracing updates from Steven Rostedt:
       "Not much changes for the 3.12 merge window.  The major tracing changes
        are still in flux, and will have to wait for 3.13.
      
        The changes for 3.12 are mostly clean ups and minor fixes.
      
        H Peter Anvin added a check to x86_32 static function tracing that
        helps a small segment of the kernel community.
      
        Oleg Nesterov had a few changes from 3.11, but were mostly clean ups
        and not worth pushing in the -rc time frame.
      
        Li Zefan had small clean up with annotating a raw_init with __init.
      
        I fixed a slight race in updating function callbacks, but the race is
        so small and the bug that happens when it occurs is so minor it's not
        even worth pushing to stable.
      
        The only real enhancement is from Alexander Z Lam that made the
        tracing_cpumask work for trace buffer instances, instead of them all
        sharing a global cpumask"
      
      * tag 'trace-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
        ftrace/rcu: Do not trace debug_lockdep_rcu_enabled()
        x86-32, ftrace: Fix static ftrace when early microcode is enabled
        ftrace: Fix a slight race in modifying what function callback gets traced
        tracing: Make tracing_cpumask available for all instances
        tracing: Kill the !CONFIG_MODULES code in trace_events.c
        tracing: Don't pass file_operations array to event_create_dir()
        tracing: Kill trace_create_file_ops() and friends
        tracing/syscalls: Annotate raw_init function with __init
      7eb69529
    • Alex Elder's avatar
      clk: only call get_parent if there is one · 12d29886
      Alex Elder authored
      In __clk_init(), after a clock is mostly initialized, a scan is done
      of the orphan clocks to see if the clock being registered is the
      parent of any of them.
      
      This code assumes that any clock that provides a get_parent method
      actually has at least one parent, and that's not a valid assumption.
      
      As a result, an orphan clock with no parent can return *something*
      as the parent index, and that value is blindly used to dereference
      the orphan's parent_names[] array (which will be ZERO_SIZE_PTR or
      NULL).
      
      Fix this by ensuring get_parent is only called for orphans with at
      least one parent.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarMike Turquette <mturquette@linaro.org>
      12d29886
    • Al Viro's avatar
      split read_seqretry_or_unlock(), convert d_walk() to resulting primitives · 48f5ec21
      Al Viro authored
      Separate "check if we need to retry" from "unlock if we are done and
      had seq_writelock"; that allows to use these guys in d_walk(), where
      we need to recheck every time we ascend back to parent, but do *not*
      want to unlock until the very end.  Lift rcu_read_lock/rcu_read_unlock
      out into callers.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      48f5ec21
    • Linus Torvalds's avatar
      Merge tag 'xfs-for-linus-v3.12-rc1' of git://oss.sgi.com/xfs/xfs · 300893b0
      Linus Torvalds authored
      Pull xfs updates from Ben Myers:
       "For 3.12-rc1 there are a number of bugfixes in addition to work to
        ease usage of shared code between libxfs and the kernel, the rest of
        the work to enable project and group quotas to be used simultaneously,
        performance optimisations in the log and the CIL, directory entry file
        type support, fixes for log space reservations, some spelling/grammar
        cleanups, and the addition of user namespace support.
      
         - introduce readahead to log recovery
         - add directory entry file type support
         - fix a number of spelling errors in comments
         - introduce new Q_XGETQSTATV quotactl for project quotas
         - add USER_NS support
         - log space reservation rework
         - CIL optimisations
        - kernel/userspace libxfs rework"
      
      * tag 'xfs-for-linus-v3.12-rc1' of git://oss.sgi.com/xfs/xfs: (112 commits)
        xfs: XFS_MOUNT_QUOTA_ALL needed by userspace
        xfs: dtype changed xfs_dir2_sfe_put_ino to xfs_dir3_sfe_put_ino
        Fix wrong flag ASSERT in xfs_attr_shortform_getvalue
        xfs: finish removing IOP_* macros.
        xfs: inode log reservations are too small
        xfs: check correct status variable for xfs_inobt_get_rec() call
        xfs: inode buffers may not be valid during recovery readahead
        xfs: check LSN ordering for v5 superblocks during recovery
        xfs: btree block LSN escaping to disk uninitialised
        XFS: Assertion failed: first <= last && last < BBTOB(bp->b_length), file: fs/xfs/xfs_trans_buf.c, line: 568
        xfs: fix bad dquot buffer size in log recovery readahead
        xfs: don't account buffer cancellation during log recovery readahead
        xfs: check for underflow in xfs_iformat_fork()
        xfs: xfs_dir3_sfe_put_ino can be static
        xfs: introduce object readahead to log recovery
        xfs: Simplify xfs_ail_min() with list_first_entry_or_null()
        xfs: Register hotcpu notifier after initialization
        xfs: add xfs sb v4 support for dirent filetype field
        xfs: Add write support for dirent filetype field
        xfs: Add read-only support for dirent filetype field
        ...
      300893b0
    • Olof Johansson's avatar
      direct-io: Use return from cmpxchg to decide of assignment happened · 45150c43
      Olof Johansson authored
      Not using the return value can in the generic case be racy, so it's
      in general good practice to check the return value instead.
      
      This also resolved the warning caused on ARM and other architectures:
      
        fs/direct-io.c: In function 'sb_init_dio_done_wq':
        fs/direct-io.c:557:2: warning: value computed is not used [-Wunused-value]
      Signed-off-by: default avatarOlof Johansson <olof@lixom.net>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: H Peter Anvin <hpa@zytor.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      45150c43
    • Waiman Long's avatar
      dcache: Translating dentry into pathname without taking rename_lock · 232d2d60
      Waiman Long authored
      When running the AIM7's short workload, Linus' lockref patch eliminated
      most of the spinlock contention. However, there were still some left:
      
           8.46%     reaim  [kernel.kallsyms]     [k] _raw_spin_lock
                       |--42.21%-- d_path
                       |          proc_pid_readlink
                       |          SyS_readlinkat
                       |          SyS_readlink
                       |          system_call
                       |          __GI___readlink
                       |
                       |--40.97%-- sys_getcwd
                       |          system_call
                       |          __getcwd
      
      The big one here is the rename_lock (seqlock) contention in d_path()
      and the getcwd system call. This patch will eliminate the need to take
      the rename_lock while translating dentries into the full pathnames.
      
      The need to take the rename_lock is to make sure that no rename
      operation can be ongoing while the translation is in progress. However,
      only one thread can take the rename_lock thus blocking all the other
      threads that need it even though the translation process won't make
      any change to the dentries.
      
      This patch will replace the writer's write_seqlock/write_sequnlock
      sequence of the rename_lock of the callers of the prepend_path() and
      __dentry_path() functions with the reader's read_seqbegin/read_seqretry
      sequence within these 2 functions. As a result, the code will have to
      retry if one or more rename operations had been performed. In addition,
      RCU read lock will be taken during the translation process to make sure
      that no dentries will go away. To prevent live-lock from happening,
      the code will switch back to take the rename_lock if read_seqretry()
      fails for three times.
      
      To further reduce spinlock contention, this patch does not take the
      dentry's d_lock when copying the filename from the dentries. Instead,
      it treats the name pointer and length as unreliable and just copy
      the string byte-by-byte over until it hits a null byte or the end of
      string as specified by the length. This should avoid stepping into
      invalid memory address. The error cases are left to be handled by
      the sequence number check.
      
      The following code re-factoring are also made:
      1. Move prepend('/') into prepend_name() to remove one conditional
         check.
      2. Move the global root check in prepend_path() back to the top of
         the while loop.
      
      With this patch, the _raw_spin_lock will now account for only 1.2%
      of the total CPU cycles for the short workload. This patch also has
      the effect of reducing the effect of running perf on its profile
      since the perf command itself can be a heavy user of the d_path()
      function depending on the complexity of the workload.
      
      When taking the perf profile of the high-systime workload, the amount
      of spinlock contention contributed by running perf without this patch
      was about 16%. With this patch, the spinlock contention caused by
      the running of perf will go away and we will have a more accurate
      perf profile.
      Signed-off-by: default avatarWaiman Long <Waiman.Long@hp.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      232d2d60
    • Linus Torvalds's avatar
      Merge tag 'for-linus-20130909' of git://git.infradead.org/linux-mtd · ef9a61be
      Linus Torvalds authored
      Pull mtd updates from David Woodhouse:
       - factor out common code from MTD tests
       - nand-gpio cleanup and portability to non-ARM
       - m25p80 support for 4-byte addressing chips, other new chips
       - pxa3xx cleanup and support for new platforms
       - remove obsolete alauda, octagon-5066 drivers
       - erase/write support for bcm47xxsflash
       - improve detection of ECC requirements for NAND, controller setup
       - NFC acceleration support for atmel-nand, read/write via SRAM
       - etc
      
      * tag 'for-linus-20130909' of git://git.infradead.org/linux-mtd: (184 commits)
        mtd: chips: Add support for PMC SPI Flash chips in m25p80.c
        mtd: ofpart: use for_each_child_of_node() macro
        mtd: mtdswap: replace strict_strtoul() with kstrtoul()
        mtd cs553x_nand: use kzalloc() instead of memset
        mtd: atmel_nand: fix error return code in atmel_nand_probe()
        mtd: bcm47xxsflash: writing support
        mtd: bcm47xxsflash: implement erasing support
        mtd: bcm47xxsflash: convert to module_platform_driver instead of init/exit
        mtd: bcm47xxsflash: convert kzalloc to avoid invalid access
        mtd: remove alauda driver
        mtd: nand: mxc_nand: mark 'const' properly
        mtd: maps: cfi_flagadm: add missing __iomem annotation
        mtd: spear_smi: add missing __iomem annotation
        mtd: r852: Staticize local symbols
        mtd: nandsim: Staticize local symbols
        mtd: impa7: add missing __iomem annotation
        mtd: sm_ftl: Staticize local symbols
        mtd: m25p80: add support for mr25h10
        mtd: m25p80: make CONFIG_M25PXX_USE_FAST_READ safe to enable
        mtd: m25p80: Pass flags through CAT25_INFO macro
        ...
      ef9a61be
    • Linus Torvalds's avatar
      Merge tag 'firewire-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394 · b5f0998c
      Linus Torvalds authored
      Pull firewire updates from Stefan Richter:
      
       - Fix a regression since 3.2 inclusive: The subsystem workqueue
         deadlocked between transaction completion handling and bus reset
         handling if the worker pool could not be increased in time.
      
       - janitorial updates
      
      * tag 'firewire-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
        firewire: ohci: Fix deadlock at bus reset
        firewire: ohci: Change module_pci_driver to module_init/module_exit
        firewire: ohci: beautify some macro definitions
        firewire: ohci: change confusing name of a struct member
        firewire: core: typecast from gfp_t to bool more safely
        firewire: WQ_NON_REENTRANT is meaningless and going away
      b5f0998c
    • Dan Williams's avatar
      MAINTAINERS: update email for Dan Williams · ab5f8c6e
      Dan Williams authored
      Returned to intel.com
      
      Cc: Vinod Koul <vinod.koul@intel.com>
      Cc: Linus Walleij <linus.walleij@linaro.org>
      Cc: Jon Mason <jon.mason@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Shaohua Li <shli@kernel.org>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      ab5f8c6e