1. 26 Sep, 2016 11 commits
    • Naohiro Aota's avatar
      btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs · 5d8eb6fe
      Naohiro Aota authored
      Currently, btrfs_relocate_chunk() is removing relocated BG by itself. But
      the work can be done by btrfs_delete_unused_bgs() (and it's better since it
      trim the BG). Let's dedupe the code.
      
      While btrfs_delete_unused_bgs() is already hitting the relocated BG, it
      skip the BG since the BG has "ro" flag set (to keep balancing BG intact).
      On the other hand, btrfs cannot drop "ro" flag here to prevent additional
      writes. So this patch make use of "removed" flag.
      btrfs_delete_unused_bgs() now detect the flag to distinguish whether a
      read-only BG is relocating or not.
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@hgst.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5d8eb6fe
    • Liu Bo's avatar
      Btrfs: bail out if block group has different mixed flag · 49303381
      Liu Bo authored
      Currently we allow inconsistence about mixed flag
       (BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA).
      
      We'd get ENOSPC if block group has mixed flag and btrfs doesn't.
      If that happens, we have one space_info with mixed flag and another
      space_info only with BTRFS_BLOCK_GROUP_METADATA, and
      global_block_rsv.space_info points to the latter one, but all bytes
      from block_group contributes to the mixed space_info, thus all the
      allocation will fail with ENOSPC.
      
      This adds a check for the above case.
      Reported-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      [ updated message ]
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      49303381
    • Liu Bo's avatar
      Btrfs: fix memory leak in reading btree blocks · 2571e739
      Liu Bo authored
      So we can read a btree block via readahead or intentional read,
      and we can end up with a memory leak when something happens as
      follows,
      1) readahead starts to read block A but does not wait for read
         completion,
      2) btree_readpage_end_io_hook finds that block A is corrupted,
         and it needs to clear all block A's pages' uptodate bit.
      3) meanwhile an intentional read kicks in and checks block A's
         pages' uptodate to decide which page needs to be read.
      4) when some pages have the uptodate bit during 3)'s check so
         3) doesn't count them for eb->io_pages, but they are later
         cleared by 2) so we has to readpage on the page, we get
         the wrong eb->io_pages which results in a memory leak of
         this block.
      
      This fixes the problem by firstly getting all pages's locking and
      then checking pages' uptodate bit.
      
         t1(readahead)                              t2(readahead endio)                                       t3(the following read)
      read_extent_buffer_pages                    end_bio_extent_readpage
        for pg in eb:                                for page 0,1,2 in eb:
            if pg is uptodate:                           btree_readpage_end_io_hook(pg)
                num_reads++                              if uptodate:
        eb->io_pages = num_reads                             SetPageUptodate(pg)              _______________
        for pg in eb:                                for page 3 in eb:                                     read_extent_buffer_pages
             if pg is NOT uptodate:                      btree_readpage_end_io_hook(pg)                       for pg in eb:
                 __extent_read_full_page(pg)                 sanity check reports something wrong                 if pg is uptodate:
                                                             clear_extent_buffer_uptodate(eb)                         num_reads++
                                                                 for pg in eb:                                eb->io_pages = num_reads
                                                                     ClearPageUptodate(page)  _______________
                                                                                                              for pg in eb:
                                                                                                                  if pg is NOT uptodate:
                                                                                                                      __extent_read_full_page(pg)
      
      So t3's eb->io_pages is not consistent with the number of pages it's reading,
      and during endio(), atomic_dec_and_test(&eb->io_pages) will get a negative
      number so that we're not able to free the eb.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2571e739
    • Liu Bo's avatar
      Btrfs: remove BUG() in raid56 · e46a28ca
      Liu Bo authored
      This BUG() has been triggered by a fuzz testing image, which contains
      an invalid chunk type, ie. a single stripe chunk has the raid6 type.
      
      Btrfs can handle this gracefully by returning -EIO, so besides using
      btrfs_warn to give us more debugging information rather than a single
      BUG(), we can return error properly.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e46a28ca
    • Lu Fengqi's avatar
      btrfs: fix check_shared for fiemap ioctl · afce772e
      Lu Fengqi authored
      Only in the case of different root_id or different object_id, check_shared
      identified extent as the shared. However, If a extent was referred by
      different offset of same file, it should also be identified as shared.
      In addition, check_shared's loop scale is at least n^3, so if a extent
      has too many references, even causes soft hang up.
      
      First, add all delayed_ref to the ref_tree and calculate the unqiue_refs,
      if the unique_refs is greater than one, return BACKREF_FOUND_SHARED.
      Then individually add the on-disk reference(inline/keyed) to the ref_tree
      and calculate the unique_refs of the ref_tree to check if the unique_refs
      is greater than one.Because once there are two references to return
      SHARED, so the time complexity is close to the constant.
      Reported-by: default avatarTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: default avatarLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      afce772e
    • David Sterba's avatar
    • Eric Sandeen's avatar
      btrfs: fix perms on demonstration debugfs interface · 07f6a480
      Eric Sandeen authored
      btrfs provides a helpful demonstration of how to export
      a global variable via debugfs; however, it is unique among
      other debugfs files in that it is world-writable, which causes
      some concern to people who are not familiar with its purpose.
      
      Fix it so that it is only user-writable.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      07f6a480
    • Liu Bo's avatar
      Btrfs: fix memory leak of block group cache · c79a1751
      Liu Bo authored
      While processing delayed refs, we may update block group's statistics
      and attach it to cur_trans->dirty_bgs, and later writing dirty block
      groups will process the list, which happens during
      btrfs_commit_transaction().
      
      For whatever reason, the transaction is aborted and dirty_bgs
      is not processed in cleanup_transaction(), we end up with memory leak
      of these dirty block group cache.
      
      Since btrfs_start_dirty_block_groups() doesn't make it go to the commit
      critical section, this also adds the cleanup work inside it.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c79a1751
    • Linus Torvalds's avatar
      Linux 4.8-rc8 · 08895a8b
      Linus Torvalds authored
      08895a8b
    • Linus Torvalds's avatar
      Merge tag 'trace-v4.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · 4c04b4b5
      Linus Torvalds authored
      Pull tracefs fixes from Steven Rostedt:
       "Al Viro has been looking at the tracefs code, and has pointed out some
        issues.  This contains one fix by me and one by Al.  I'm sure that
        he'll come up with more but for now I tested these patches and they
        don't appear to have any negative impact on tracing"
      
      * tag 'trace-v4.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
        fix memory leaks in tracing_buffers_splice_read()
        tracing: Move mutex to protect against resetting of seq data
      4c04b4b5
    • Dave Chinner's avatar
      fault_in_multipages_readable() throws set-but-unused error · 90b75db6
      Dave Chinner authored
      When building XFS with -Werror, it now fails with:
      
        include/linux/pagemap.h: In function 'fault_in_multipages_readable':
        include/linux/pagemap.h:602:16: error: variable 'c' set but not used [-Werror=unused-but-set-variable]
          volatile char c;
                        ^
      
      This is a regression caused by commit e23d4159 ("fix
      fault_in_multipages_...() on architectures with no-op access_ok()").
      Fix it by re-adding the "(void)c" trick taht was previously used to make
      the compiler think the variable is used.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90b75db6
  2. 25 Sep, 2016 7 commits
    • Lorenzo Stoakes's avatar
      mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing · 38e08854
      Lorenzo Stoakes authored
      The NUMA balancing logic uses an arch-specific PROT_NONE page table flag
      defined by pte_protnone() or pmd_protnone() to mark PTEs or huge page
      PMDs respectively as requiring balancing upon a subsequent page fault.
      User-defined PROT_NONE memory regions which also have this flag set will
      not normally invoke the NUMA balancing code as do_page_fault() will send
      a segfault to the process before handle_mm_fault() is even called.
      
      However if access_remote_vm() is invoked to access a PROT_NONE region of
      memory, handle_mm_fault() is called via faultin_page() and
      __get_user_pages() without any access checks being performed, meaning
      the NUMA balancing logic is incorrectly invoked on a non-NUMA memory
      region.
      
      A simple means of triggering this problem is to access PROT_NONE mmap'd
      memory using /proc/self/mem which reliably results in the NUMA handling
      functions being invoked when CONFIG_NUMA_BALANCING is set.
      
      This issue was reported in bugzilla (issue 99101) which includes some
      simple repro code.
      
      There are BUG_ON() checks in do_numa_page() and do_huge_pmd_numa_page()
      added at commit c0e7cad9 to avoid accidentally provoking strange
      behaviour by attempting to apply NUMA balancing to pages that are in
      fact PROT_NONE.  The BUG_ON()'s are consistently triggered by the repro.
      
      This patch moves the PROT_NONE check into mm/memory.c rather than
      invoking BUG_ON() as faulting in these pages via faultin_page() is a
      valid reason for reaching the NUMA check with the PROT_NONE page table
      flag set and is therefore not always a bug.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=99101Reported-by: default avatarTrevor Saunders <tbsaunde@tbsaunde.org>
      Signed-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      38e08854
    • Linus Torvalds's avatar
      Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus · 831e45d8
      Linus Torvalds authored
      Pull MIPS fixes from Ralf Baechle:
       "A round of 4.8 fixes:
      
        MIPS generic code:
         - Add a missing ".set pop" in an early commit
         - Fix memory regions reaching top of physical
         - MAAR: Fix address alignment
         - vDSO: Fix Malta EVA mapping to vDSO page structs
         - uprobes: fix incorrect uprobe brk handling
         - uprobes: select HAVE_REGS_AND_STACK_ACCESS_API
         - Avoid a BUG warning during PR_SET_FP_MODE prctl
         - SMP: Fix possibility of deadlock when bringing CPUs online
         - R6: Remove compact branch policy Kconfig entries
         - Fix size calc when avoiding IPIs for small icache flushes
         - Fix pre-r6 emulation FPU initialisation
         - Fix delay slot emulation count in debugfs
      
        ATH79:
         - Fix test for error return of clk_register_fixed_factor.
      
        Octeon:
         - Fix kernel header to work for VDSO build.
         - Fix initialization of platform device probing.
      
        paravirt:
         - Fix undefined reference to smp_bootstrap"
      
      * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
        MIPS: Fix delay slot emulation count in debugfs
        MIPS: SMP: Fix possibility of deadlock when bringing CPUs online
        MIPS: Fix pre-r6 emulation FPU initialisation
        MIPS: vDSO: Fix Malta EVA mapping to vDSO page structs
        MIPS: Select HAVE_REGS_AND_STACK_ACCESS_API
        MIPS: Octeon: Fix platform bus probing
        MIPS: Octeon: mangle-port: fix build failure with VDSO code
        MIPS: Avoid a BUG warning during prctl(PR_SET_FP_MODE, ...)
        MIPS: c-r4k: Fix size calc when avoiding IPIs for small icache flushes
        MIPS: Add a missing ".set pop" in an early commit
        MIPS: paravirt: Fix undefined reference to smp_bootstrap
        MIPS: Remove compact branch policy Kconfig entries
        MIPS: MAAR: Fix address alignment
        MIPS: Fix memory regions reaching top of physical
        MIPS: uprobes: fix incorrect uprobe brk handling
        MIPS: ath79: Fix test for error return of clk_register_fixed_factor().
      831e45d8
    • Linus Torvalds's avatar
      Merge tag 'powerpc-4.8-7' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 751b9a5d
      Linus Torvalds authored
      Pull one more powerpc fix from Michael Ellerman:
       "powernv/pci: Fix m64 checks for SR-IOV and window alignment from
        Russell Currey"
      
      * tag 'powerpc-4.8-7' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/powernv/pci: Fix m64 checks for SR-IOV and window alignment
      751b9a5d
    • Linus Torvalds's avatar
      radix tree: fix sibling entry handling in radix_tree_descend() · 8d2c0d36
      Linus Torvalds authored
      The fixes to the radix tree test suite show that the multi-order case is
      broken.  The basic reason is that the radix tree code uses tagged
      pointers with the "internal" bit in the low bits, and calculating the
      pointer indices was supposed to mask off those bits.  But gcc will
      notice that we then use the index to re-create the pointer, and will
      avoid doing the arithmetic and use the tagged pointer directly.
      
      This cleans the code up, using the existing is_sibling_entry() helper to
      validate the sibling pointer range (instead of open-coding it), and
      using entry_to_node() to mask off the low tag bit from the pointer.  And
      once you do that, you might as well just use the now cleaned-up pointer
      directly.
      
      [ Side note: the multi-order code isn't actually ever used in the kernel
        right now, and the only reason I didn't just delete all that code is
        that Kirill Shutemov piped up and said:
      
          "Well, my ext4-with-huge-pages patchset[1] uses multi-order entries.
           It also converts shmem-with-huge-pages and hugetlb to them.
      
           I'm okay with converting it to other mechanism, but I need
           something.  (I looked into Konstantin's RFC patchset[2].  It looks
           okay, but I don't feel myself qualified to review it as I don't
           know much about radix-tree internals.)"
      
        [1] http://lkml.kernel.org/r/20160915115523.29737-1-kirill.shutemov@linux.intel.com
        [2] http://lkml.kernel.org/r/147230727479.9957.1087787722571077339.stgit@zurg ]
      Reported-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Cedric Blancher <cedric.blancher@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8d2c0d36
    • Matthew Wilcox's avatar
      radix tree test suite: Test radix_tree_replace_slot() for multiorder entries · 62fd5258
      Matthew Wilcox authored
      When we replace a multiorder entry, check that all indices reflect the
      new value.
      
      Also, compile the test suite with -O2, which shows other problems with
      the code due to some dodgy pointer operations in the radix tree code.
      Signed-off-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      62fd5258
    • Al Viro's avatar
      fix memory leaks in tracing_buffers_splice_read() · 1ae2293d
      Al Viro authored
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      1ae2293d
    • Steven Rostedt (Red Hat)'s avatar
      tracing: Move mutex to protect against resetting of seq data · 1245800c
      Steven Rostedt (Red Hat) authored
      The iter->seq can be reset outside the protection of the mutex. So can
      reading of user data. Move the mutex up to the beginning of the function.
      
      Fixes: d7350c3f ("tracing/core: make the read callbacks reentrants")
      Cc: stable@vger.kernel.org # 2.6.30+
      Reported-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      1245800c
  3. 24 Sep, 2016 10 commits
  4. 23 Sep, 2016 12 commits