1. 21 Feb, 2017 4 commits
  2. 16 Feb, 2017 1 commit
    • Dan Williams's avatar
      async_tx: deprecate broken support for channel switching · b802c841
      Dan Williams authored
      Back in 2011, Russell pointed out that the "async_tx channel switch"
      capability was violating expectations of the dma mapping api [1]. At the
      time the existing uses were reviewed as still usable, but that longer
      term we needed a rework of the raid offload implementation. While some
      of the framework for a fixed implementation was introduced in 2012 [2],
      the wider rewrite never materialized.
      
      There continues to be interest in raid offload with new dma/raid engine
      drivers being submitted. Those drivers must not build on top of the
      broken channel switching capability.
      
      Prevent async_tx from using an offload engine if the channel switching
      capability is enabled. This still allows the engine to be used for other
      purposes, but the broken way async_tx uses these engines for raid will
      be disabled. For configurations where this causes a performance
      regression the only solution is to start the work of eliminating the
      async_tx api and moving channel management into the raid code directly
      where it can manage marshalling an operation stream between multiple dma
      channels.
      
      [1]: http://lists.infradead.org/pipermail/linux-arm-kernel/2011-January/036753.html
      [2]: https://lkml.org/lkml/2012/12/6/71
      
      Cc: Anatolij Gustschin <agust@denx.de>
      Cc: Anup Patel <anup.patel@broadcom.com>
      Cc: Rameshwar Prasad Sahu <rsahu@apm.com>
      Cc: Saeed Bishara <saeed.bishara@gmail.com>
      Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Reported-by: default avatarRussell King <linux@armlinux.org.uk>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarVinod Koul <vinod.koul@intel.com>
      b802c841
  3. 14 Feb, 2017 1 commit
  4. 05 Feb, 2017 1 commit
  5. 31 Jan, 2017 1 commit
  6. 25 Jan, 2017 9 commits
  7. 14 Jan, 2017 2 commits
  8. 10 Jan, 2017 2 commits
  9. 03 Jan, 2017 6 commits
  10. 02 Jan, 2017 6 commits
  11. 01 Jan, 2017 2 commits
    • Linus Torvalds's avatar
      Linux 4.10-rc2 · 0c744ea4
      Linus Torvalds authored
      0c744ea4
    • Linus Torvalds's avatar
      Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm · 4759d386
      Linus Torvalds authored
      Pull DAX updates from Dan Williams:
       "The completion of Jan's DAX work for 4.10.
      
        As I mentioned in the libnvdimm-for-4.10 pull request, these are some
        final fixes for the DAX dirty-cacheline-tracking invalidation work
        that was merged through the -mm, ext4, and xfs trees in -rc1. These
        patches were prepared prior to the merge window, but we waited for
        4.10-rc1 to have a stable merge base after all the prerequisites were
        merged.
      
        Quoting Jan on the overall changes in these patches:
      
           "So I'd like all these 6 patches to go for rc2. The first three
            patches fix invalidation of exceptional DAX entries (a bug which
            is there for a long time) - without these patches data loss can
            occur on power failure even though user called fsync(2). The other
            three patches change locking of DAX faults so that ->iomap_begin()
            is called in a more relaxed locking context and we are safe to
            start a transaction there for ext4"
      
        These have received a build success notification from the kbuild
        robot, and pass the latest libnvdimm unit tests. There have not been
        any -next releases since -rc1, so they have not appeared there"
      
      * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
        ext4: Simplify DAX fault path
        dax: Call ->iomap_begin without entry lock during dax fault
        dax: Finish fault completely when loading holes
        dax: Avoid page invalidation races and unnecessary radix tree traversals
        mm: Invalidate DAX radix tree entries only if appropriate
        ext2: Return BH_New buffers for zeroed blocks
      4759d386
  12. 30 Dec, 2016 2 commits
  13. 29 Dec, 2016 2 commits
    • Olof Johansson's avatar
      mm/filemap: fix parameters to test_bit() · 98473f9f
      Olof Johansson authored
       mm/filemap.c: In function 'clear_bit_unlock_is_negative_byte':
        mm/filemap.c:933:9: error: too few arguments to function 'test_bit'
          return test_bit(PG_waiters);
               ^~~~~~~~
      
      Fixes: b91e1302 ('mm: optimize PageWaiters bit use for unlock_page()')
      Signed-off-by: default avatarOlof Johansson <olof@lixom.net>
      Brown-paper-bag-by: default avatarLinus Torvalds <dummy@duh.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      98473f9f
    • Linus Torvalds's avatar
      mm: optimize PageWaiters bit use for unlock_page() · b91e1302
      Linus Torvalds authored
      In commit 62906027 ("mm: add PageWaiters indicating tasks are
      waiting for a page bit") Nick Piggin made our page locking no longer
      unconditionally touch the hashed page waitqueue, which not only helps
      performance in general, but is particularly helpful on NUMA machines
      where the hashed wait queues can bounce around a lot.
      
      However, the "clear lock bit atomically and then test the waiters bit"
      sequence turns out to be much more expensive than it needs to be,
      because you get a nasty stall when trying to access the same word that
      just got updated atomically.
      
      On architectures where locking is done with LL/SC, this would be trivial
      to fix with a new primitive that clears one bit and tests another
      atomically, but that ends up not working on x86, where the only atomic
      operations that return the result end up being cmpxchg and xadd.  The
      atomic bit operations return the old value of the same bit we changed,
      not the value of an unrelated bit.
      
      On x86, we could put the lock bit in the high bit of the byte, and use
      "xadd" with that bit (where the overflow ends up not touching other
      bits), and look at the other bits of the result.  However, an even
      simpler model is to just use a regular atomic "and" to clear the lock
      bit, and then the sign bit in eflags will indicate the resulting state
      of the unrelated bit #7.
      
      So by moving the PageWaiters bit up to bit #7, we can atomically clear
      the lock bit and test the waiters bit on x86 too.  And architectures
      with LL/SC (which is all the usual RISC suspects), the particular bit
      doesn't matter, so they are fine with this approach too.
      
      This avoids the extra access to the same atomic word, and thus avoids
      the costly stall at page unlock time.
      
      The only downside is that the interface ends up being a bit odd and
      specialized: clear a bit in a byte, and test the sign bit.  Nick doesn't
      love the resulting name of the new primitive, but I'd rather make the
      name be descriptive and very clear about the limitation imposed by
      trying to work across all relevant architectures than make it be some
      generic thing that doesn't make the odd semantics explicit.
      
      So this introduces the new architecture primitive
      
          clear_bit_unlock_is_negative_byte();
      
      and adds the trivial implementation for x86.  We have a generic
      non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
      combination) which can be overridden by any architecture that can do
      better.  According to Nick, Power has the same hickup x86 has, for
      example, but some other architectures may not even care.
      
      All these optimizations mean that my page locking stress-test (which is
      just executing a lot of small short-lived shell scripts: "make test" in
      the git source tree) no longer makes our page locking look horribly bad.
      Before all these optimizations, just the unlock_page() costs were just
      over 3% of all CPU overhead on "make test".  After this, it's down to
      0.66%, so just a quarter of the cost it used to be.
      
      (The difference on NUMA is bigger, but there this micro-optimization is
      likely less noticeable, since the big issue on NUMA was not the accesses
      to 'struct page', but the waitqueue accesses that were already removed
      by Nick's earlier commit).
      Acked-by: default avatarNick Piggin <npiggin@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Andrew Lutomirski <luto@kernel.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b91e1302
  14. 28 Dec, 2016 1 commit