1. 21 Oct, 2011 15 commits
    • Steven Whitehouse's avatar
      GFS2: Use cached rgrp in gfs2_rlist_add() · 70b0c365
      Steven Whitehouse authored
      Each block which is deallocated, requires a call to gfs2_rlist_add()
      and each of those calls was calling gfs2_blk2rgrpd() in order to
      figure out which rgrp the block belonged in. This can be speeded up
      by making use of the rgrp cached in the inode. We also reset this
      cached rgrp in case the block has changed rgrp. This should provide
      a big reduction in gfs2_blk2rgrpd() calls during deallocation.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      70b0c365
    • Steven Whitehouse's avatar
      GFS2: Call do_strip() directly from recursive_scan() · d56fa8a1
      Steven Whitehouse authored
      The recursive_scan() function only ever takes a single "bc"
      argument, so we might as well just call do_strip() directly
      from resource_scan() rather than pass it in as an argument.
      
      Also the "data" argument is always a struct strip_mine, so
      we can pass that in, rather than using a void pointer.
      
      This also moves do_strip() ahead of recursive_scan() so that
      we don't need to add a prototype.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      d56fa8a1
    • Steven Whitehouse's avatar
      GFS2: Remove obsolete assert · 534029e2
      Steven Whitehouse authored
      Given that a resource group has been locked, there is no reason why
      we should not be able to allocate as many blocks as are free. The
      al_requested parameter should really be considered as a minimum
      number of blocks to be available. Should this limit be overshot,
      there are other mechanisms which will prevent over allocation.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      534029e2
    • Steven Whitehouse's avatar
      GFS2: Cache the most recently used resource group in the inode · 54335b1f
      Steven Whitehouse authored
      This means that after the initial allocation for any inode, the
      last used resource group is cached in the inode for future use.
      This drastically reduces the number of lookups of resource
      groups in the common case, and this the contention on that
      data structure.
      
      The allocation algorithm is the same as previously, except that we
      always check to see if the goal block is within the cached rgrp
      first before going to the rbtree to look one up.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      54335b1f
    • Steven Whitehouse's avatar
      GFS2: Make resource groups "append only" during life of fs · 8339ee54
      Steven Whitehouse authored
      Since we have ruled out supporting online filesystem shrink,
      it is possible to make the resource group list append only
      during the life of a super block. This gives several benefits:
      
      Firstly, we only need to read new rindex elements as they are added
      rather than needing to reread the whole rindex file each time one
      element is added.
      
      Secondly, the rindex glock can be held for much shorter periods of
      time, and is completely removed from the fast path for allocations.
      The lock is taken in shared mode only when updating the resource
      groups when the first allocation occurs, and after a grow has
      taken place.
      
      Thirdly, this results in a reduction in code size, and everything
      gets a lot simpler to understand in this area.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      8339ee54
    • Bob Peterson's avatar
      GFS2: Use rbtree for resource groups and clean up bitmap buffer ref count scheme · 7c9ca621
      Bob Peterson authored
      Here is an update of Bob's original rbtree patch which, in addition, also
      resolves the rather strange ref counting that was being done relating to
      the bitmap blocks.
      
      Originally we had a dual system for journaling resource groups. The metadata
      blocks were journaled and also the rgrp itself was added to a list. The reason
      for adding the rgrp to the list in the journal was so that the "repolish
      clones" code could be run to update the free space, and potentially send any
      discard requests when the log was flushed. This was done by comparing the
      "cloned" bitmap with what had been written back on disk during the transaction
      commit.
      
      Due to this, there was a requirement to hang on to the rgrps' bitmap buffers
      until the journal had been flushed. For that reason, there was a rather
      complicated set up in the ->go_lock ->go_unlock functions for rgrps involving
      both a mutex and a spinlock (the ->sd_rindex_spin) to maintain a reference
      count on the buffers.
      
      However, the journal maintains a reference count on the buffers anyway, since
      they are being journaled as metadata buffers. So by moving the code which deals
      with the post-journal accounting for bitmap blocks to the metadata journaling
      code, we can entirely dispense with the rather strange buffer ref counting
      scheme and also the requirement to journal the rgrps.
      
      The net result of all this is that the ->sd_rindex_spin is left to do exactly
      one job, and that is to look after the rbtree or rgrps.
      
      This patch is designed to be a stepping stone towards using RCU for the rbtree
      of resource groups, however the reduction in the number of uses of the
      ->sd_rindex_spin is likely to have benefits for multi-threaded workloads,
      anyway.
      
      The patch retains ->go_lock and ->go_unlock for rgrps, however these maybe also
      be removed in future in favour of calling the functions directly where required
      in the code. That will allow locking of resource groups without needing to
      actually read them in - something that could be useful in speeding up statfs.
      
      In the mean time though it is valid to dereference ->bi_bh only when the rgrp
      is locked. This is basically the same rule as before, modulo the references not
      being valid until the following journal flush.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Cc: Benjamin Marzinski <bmarzins@redhat.com>
      7c9ca621
    • Steven Whitehouse's avatar
      GFS2: Fix lseek after SEEK_DATA, SEEK_HOLE have been added · 9453615a
      Steven Whitehouse authored
      We need to take the inode's glock whenever the inode's size
      is referenced, otherwise it might not be uptodate. Even
      though generic_file_llseek_unlocked() doesn't implement
      SEEK_DATA, SEEK_HOLE directly, it does reference the inode's
      size in those cases, so we need to add them to the list
      of origins which need the glock.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      9453615a
    • Steven Whitehouse's avatar
      GFS2: Clean up gfs2_create · 9a63edd1
      Steven Whitehouse authored
      If we pass through knowledge of whether the creation is intended to be
      exclusive or not, then we can deal with that in gfs2_create_inode
      and remove one set of locking. Also this removes the loop in
      gfs2_create and simplifies the code a bit.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      9a63edd1
    • Steven Whitehouse's avatar
      GFS2: Use ->dirty_inode() · ab9bbda0
      Steven Whitehouse authored
      The aim of this patch is to use the newly enhanced ->dirty_inode()
      super block operation to deal with atime updates, rather than
      piggy backing that code into ->write_inode() as is currently
      done.
      
      The net result is a simplification of the code in various places
      and a reduction of the number of gfs2_dinode_out() calls since
      this is now implied by ->dirty_inode().
      
      Some of the mark_inode_dirty() calls have been moved under glocks
      in order to take advantage of then being able to avoid locking in
      ->dirty_inode() when we already have suitable locks.
      
      One consequence is that generic_write_end() now correctly deals
      with file size updates, so that we do not need a separate check
      for that afterwards. This also, indirectly, means that fdatasync
      should work correctly on GFS2 - the current code always syncs the
      metadata whether it needs to or not.
      
      Has survived testing with postmark (with and without atime) and
      also fsx.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      ab9bbda0
    • Steven Whitehouse's avatar
      GFS2: Fix bug trap and journaled data fsync · f1818529
      Steven Whitehouse authored
      Journaled data requires that a complete flush of all dirty data for
      the file is done, in order that the ail flush which comes after
      will succeed.
      
      Also the recently enhanced bug trap can trigger falsely in case
      an ail flush from fsync races with a page read. This updates the
      bug trap such that it will ignore buffers which are locked and
      only trigger on dirty and/or pinned buffers when the ail flush
      is run from fsync. The original bug trap is retained when ail
      flush is run from ->go_sync()
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      f1818529
    • Steven Whitehouse's avatar
      GFS2: Fix inode allocation error path · 40ac218f
      Steven Whitehouse authored
      If we have got far enough through the inode allocation code
      path that an inode has already been allocated, then we must
      call iput to dispose of it, if an error occurs during a
      later part of the process. This will always be the final iput
      since there will be no other references to the inode.
      
      Unlike when the inode has been unlinked, its block state will
      be GFS2_BLKST_INODE rather than GFS2_BLKST_UNLINKED so we need
      to skip the test in ->evict_inode() for this one case in order
      to ensure that it will be deallocated correctly. This patch adds
      a new flag in order to ensure that this will happen correctly.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      40ac218f
    • Steven Whitehouse's avatar
      GFS2: Make atime checks more efficient · 1d4ec642
      Steven Whitehouse authored
      We do not need to start a transaction unless the atime
      check has proved positive. Also if we are going to flush
      the complete ail list anyway, we might as well skip the
      writeback for this specific inode's metadata, since that
      will be done as part of the ail writeback process in an
      order offering potentially more efficient I/O.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      1d4ec642
    • Steven Whitehouse's avatar
      GFS2: Fix bug-trap in ail flush code · 75549186
      Steven Whitehouse authored
      The assert was being tested under the wrong lock, a
      legacy of the original code. Also, if it does trigger,
      the resulting information was not always a lot of help.
      
      This moves the patch under the correct lock and also
      prints out more useful information in tacking down the
      source of the problem.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      75549186
    • Steven Whitehouse's avatar
      GFS2: Split data write & wait in fsync · 2f0264d5
      Steven Whitehouse authored
      Now that the data writing is part of fsync proper, we can split
      the waiting part out and do it later on. This reduces the
      number of waits that we do during fsync on average.
      
      There is also no need to take the i_mutex unless we are flushing
      metadata to disk, so we can move that to within the metadata
      flushing code.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      2f0264d5
    • Steven Whitehouse's avatar
      GFS2: Clean up dir hash table reading · 4c28d338
      Steven Whitehouse authored
      Since there is now only a single caller to gfs2_dir_read_data()
      and it has a number of constant arguments, we can factor
      those out. Also some tests relating to the inode size were
      being done twice.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      4c28d338
  2. 20 Oct, 2011 3 commits
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc · fd11e153
      Linus Torvalds authored
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
        sparc: Add alignment flag to PCI expansion resources
        sparc: Avoid calling sigprocmask()
        sparc: Use set_current_blocked()
        sparc32,leon: SRMMU MMU Table probe fix
      fd11e153
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 505f48b5
      Linus Torvalds authored
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
        fib_rules: fix unresolved_rules counting
        r8169: fix wrong eee setting for rlt8111evl
        r8169: fix driver shutdown WoL regression.
        ehea: Change maintainer to me
        pptp: pptp_rcv_core() misses pskb_may_pull() call
        tproxy: copy transparent flag when creating a time wait
        pptp: fix skb leak in pptp_xmit()
        bonding: use local function pointer of bond->recv_probe in bond_handle_frame
        smsc911x: Add support for SMSC LAN89218
        tg3: negate USE_PHYLIB flag check
        netconsole: enable netconsole can make net_device refcnt incorrent
        bluetooth: Properly clone LSM attributes to newly created child connections
        l2tp: fix a potential skb leak in l2tp_xmit_skb()
        bridge: fix hang on removal of bridge via netlink
        x25: Prevent skb overreads when checking call user data
        x25: Handle undersized/fragmented skbs
        x25: Validate incoming call user data lengths
        udplite: fast-path computation of checksum coverage
        IPVS netns shutdown/startup dead-lock
        netfilter: nf_conntrack: fix event flooding in GRE protocol tracker
      505f48b5
    • Hugh Dickins's avatar
      mm: fix race between mremap and removing migration entry · 486cf46f
      Hugh Dickins authored
      I don't usually pay much attention to the stale "? " addresses in
      stack backtraces, but this lucky report from Pawel Sikora hints that
      mremap's move_ptes() has inadequate locking against page migration.
      
       3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
       kernel BUG at include/linux/swapops.h:105!
       RIP: 0010:[<ffffffff81127b76>]  [<ffffffff81127b76>]
                             migration_entry_wait+0x156/0x160
        [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
        [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
        [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
        [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
        [<ffffffff81106097>] ? vma_adjust+0x537/0x570
        [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
        [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
        [<ffffffff81421d5f>] page_fault+0x1f/0x30
      
      mremap's down_write of mmap_sem, together with i_mmap_mutex or lock,
      and pagetable locks, were good enough before page migration (with its
      requirement that every migration entry be found) came in, and enough
      while migration always held mmap_sem; but not enough nowadays, when
      there's memory hotremove and compaction.
      
      The danger is that move_ptes() lets a migration entry dodge around
      behind remove_migration_pte()'s back, so it's in the old location when
      looking at the new, then in the new location when looking at the old.
      
      Either mremap's move_ptes() must additionally take anon_vma lock(), or
      migration's remove_migration_pte() must stop peeking for is_swap_entry()
      before it takes pagetable lock.
      
      Consensus chooses the latter: we prefer to add overhead to migration
      than to mremapping, which gets used by JVMs and by exec stack setup.
      Reported-and-tested-by: default avatarPaweł Sikora <pluto@agmk.net>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      486cf46f
  3. 19 Oct, 2011 19 commits
  4. 18 Oct, 2011 3 commits