1. 21 Jul, 2011 21 commits
  2. 20 Jul, 2011 19 commits
    • Dave Chinner's avatar
      superblock: move pin_sb_for_writeback() to fs/super.c · 12ad3ab6
      Dave Chinner authored
      The per-sb shrinker has the same requirement as the writeback
      threads of ensuring that the superblock is usable and pinned for the
      time it takes to run the work. Both need to take a passive reference
      to the sb, take a read lock on the s_umount lock and then only
      continue if an unmount is not in progress.
      
      pin_sb_for_writeback() does this exactly, so move it to fs/super.c
      and rename it to grab_super_passive() and exporting it via
      fs/internal.h for all the VFS code to be able to use.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      12ad3ab6
    • Dave Chinner's avatar
      inode: move to per-sb LRU locks · 09cc9fc7
      Dave Chinner authored
      With the inode LRUs moving to per-sb structures, there is no longer
      a need for a global inode_lru_lock. The locking can be made more
      fine-grained by moving to a per-sb LRU lock, isolating the LRU
      operations of different filesytsems completely from each other.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      09cc9fc7
    • Dave Chinner's avatar
      inode: Make unused inode LRU per superblock · 98b745c6
      Dave Chinner authored
      The inode unused list is currently a global LRU. This does not match
      the other global filesystem cache - the dentry cache - which uses
      per-superblock LRU lists. Hence we have related filesystem object
      types using different LRU reclaimation schemes.
      
      To enable a per-superblock filesystem cache shrinker, both of these
      caches need to have per-sb unused object LRU lists. Hence this patch
      converts the global inode LRU to per-sb LRUs.
      
      The patch only does rudimentary per-sb propotioning in the shrinker
      infrastructure, as this gets removed when the per-sb shrinker
      callouts are introduced later on.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      98b745c6
    • Dave Chinner's avatar
      inode: convert inode_stat.nr_unused to per-cpu counters · fcb94f72
      Dave Chinner authored
      Before we split up the inode_lru_lock, the unused inode counter
      needs to be made independent of the global inode_lru_lock. Convert
      it to per-cpu counters to do this.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      fcb94f72
    • Dave Chinner's avatar
      vmscan: add customisable shrinker batch size · e9299f50
      Dave Chinner authored
      For shrinkers that have their own cond_resched* calls, having
      shrink_slab break the work down into small batches is not
      paticularly efficient. Add a custom batchsize field to the struct
      shrinker so that shrinkers can use a larger batch size if they
      desire.
      
      A value of zero (uninitialised) means "use the default", so
      behaviour is unchanged by this patch.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      e9299f50
    • Dave Chinner's avatar
      vmscan: reduce wind up shrinker->nr when shrinker can't do work · 3567b59a
      Dave Chinner authored
      When a shrinker returns -1 to shrink_slab() to indicate it cannot do
      any work given the current memory reclaim requirements, it adds the
      entire total_scan count to shrinker->nr. The idea ehind this is that
      whenteh shrinker is next called and can do work, it will do the work
      of the previously aborted shrinker call as well.
      
      However, if a filesystem is doing lots of allocation with GFP_NOFS
      set, then we get many, many more aborts from the shrinkers than we
      do successful calls. The result is that shrinker->nr winds up to
      it's maximum permissible value (twice the current cache size) and
      then when the next shrinker call that can do work is issued, it
      has enough scan count built up to free the entire cache twice over.
      
      This manifests itself in the cache going from full to empty in a
      matter of seconds, even when only a small part of the cache is
      needed to be emptied to free sufficient memory.
      
      Under metadata intensive workloads on ext4 and XFS, I'm seeing the
      VFS caches increase memory consumption up to 75% of memory (no page
      cache pressure) over a period of 30-60s, and then the shrinker
      empties them down to zero in the space of 2-3s. This cycle repeats
      over and over again, with the shrinker completely trashing the inode
      and dentry caches every minute or so the workload continues.
      
      This behaviour was made obvious by the shrink_slab tracepoints added
      earlier in the series, and made worse by the patch that corrected
      the concurrent accounting of shrinker->nr.
      
      To avoid this problem, stop repeated small increments of the total
      scan value from winding shrinker->nr up to a value that can cause
      the entire cache to be freed. We still need to allow it to wind up,
      so use the delta as the "large scan" threshold check - if the delta
      is more than a quarter of the entire cache size, then it is a large
      scan and allowed to cause lots of windup because we are clearly
      needing to free lots of memory.
      
      If it isn't a large scan then limit the total scan to half the size
      of the cache so that windup never increases to consume the whole
      cache. Reducing the total scan limit further does not allow enough
      wind-up to maintain the current levels of performance, whilst a
      higher threshold does not prevent the windup from freeing the entire
      cache under sustained workloads.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      3567b59a
    • Dave Chinner's avatar
      vmscan: shrinker->nr updates race and go wrong · acf92b48
      Dave Chinner authored
      shrink_slab() allows shrinkers to be called in parallel so the
      struct shrinker can be updated concurrently. It does not provide any
      exclusio for such updates, so we can get the shrinker->nr value
      increasing or decreasing incorrectly.
      
      As a result, when a shrinker repeatedly returns a value of -1 (e.g.
      a VFS shrinker called w/ GFP_NOFS), the shrinker->nr goes haywire,
      sometimes updating with the scan count that wasn't used, sometimes
      losing it altogether. Worse is when a shrinker does work and that
      update is lost due to racy updates, which means the shrinker will do
      the work again!
      
      Fix this by making the total_scan calculations independent of
      shrinker->nr, and making the shrinker->nr updates atomic w.r.t. to
      other updates via cmpxchg loops.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      acf92b48
    • Dave Chinner's avatar
      vmscan: add shrink_slab tracepoints · 09576073
      Dave Chinner authored
      It is impossible to understand what the shrinkers are actually doing
      without instrumenting the code, so add a some tracepoints to allow
      insight to be gained.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      09576073
    • Al Viro's avatar
      make d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err) · a9049376
      Al Viro authored
      ... and simplify the living hell out of callers
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      a9049376
    • Al Viro's avatar
      deuglify squashfs_lookup() · 0c1aa9a9
      Al Viro authored
      d_splice_alias(NULL, dentry) is equivalent to d_add(dentry, NULL), NULL
      so no need for that if (inode) ... in there (or ERR_PTR(0), for that
      matter)
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      0c1aa9a9
    • Al Viro's avatar
      nfsd4_list_rec_dir(): don't bother with reopening rec_file · 5b4b299c
      Al Viro authored
      just rewind it to the beginning before vfs_readdir() and be
      done with that...
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      5b4b299c
    • Al Viro's avatar
      kill useless checks for sb->s_op == NULL · e7f59097
      Al Viro authored
      never is...
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      e7f59097
    • Al Viro's avatar
      btrfs: kill magical embedded struct superblock · 0ee5dc67
      Al Viro authored
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      0ee5dc67
    • Al Viro's avatar
      get rid of pointless checks for dentry->sb == NULL · fb408e6c
      Al Viro authored
      it never is...
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      fb408e6c
    • Al Viro's avatar
      Make ->d_sb assign-once and always non-NULL · a4464dbc
      Al Viro authored
      New helper (non-exported, fs/internal.h-only): __d_alloc(sb, name).
      Allocates dentry, sets its ->d_sb to given superblock and sets
      ->d_op accordingly.  Old d_alloc(NULL, name) callers are converted
      to that (all of them know what superblock they want).  d_alloc()
      itself is left only for parent != NULl case; uses __d_alloc(),
      inserts result into the list of parent's children.
      
      Note that now ->d_sb is assign-once and never NULL *and*
      ->d_parent is never NULL either.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      a4464dbc
    • Al Viro's avatar
      unexport kern_path_parent() · e3c3d9c8
      Al Viro authored
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      e3c3d9c8
    • Al Viro's avatar
      switch vfs_path_lookup() to struct path · e0a01249
      Al Viro authored
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      e0a01249
    • Al Viro's avatar
      kill lookup_create() · ed75e95d
      Al Viro authored
      folded into the only caller (kern_path_create())
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      ed75e95d
    • Al Viro's avatar
      devtmpfs: get rid of bogus mkdir in create_path() · 5da4e689
      Al Viro authored
      We do _NOT_ want to mkdir the path itself - we are preparing to
      mknod it, after all.  Normally it'll fail with -ENOENT and
      just do nothing, but if somebody has created the parent in
      the meanwhile, we'll get buggered...
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      5da4e689