1. 16 Jan, 2015 24 commits
  2. 08 Jan, 2015 16 commits
    • Greg Kroah-Hartman's avatar
      Linux 3.18.2 · e609d3fc
      Greg Kroah-Hartman authored
      e609d3fc
    • Filipe Manana's avatar
      Btrfs: fix fs corruption on transaction abort if device supports discard · 3c6babf5
      Filipe Manana authored
      commit 678886bd upstream.
      
      When we abort a transaction we iterate over all the ranges marked as dirty
      in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them
      from those trees, add them back (unpin) to the free space caches and, if
      the fs was mounted with "-o discard", perform a discard on those regions.
      Also, after adding the regions to the free space caches, a fitrim ioctl call
      can see those ranges in a block group's free space cache and perform a discard
      on the ranges, so the same issue can happen without "-o discard" as well.
      
      This causes corruption, affecting one or multiple btree nodes (in the worst
      case leaving the fs unmountable) because some of those ranges (the ones in
      the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are
      referred by the last committed super block - breaking the rule that anything
      that was committed by a transaction is untouched until the next transaction
      commits successfully.
      
      I ran into this while running in a loop (for several hours) the fstest that
      I recently submitted:
      
        [PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim
      
      The corruption always happened when a transaction aborted and then fsck complained
      like this:
      
         _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
         *** fsck.btrfs output ***
         Check tree block failed, want=94945280, have=0
         Check tree block failed, want=94945280, have=0
         Check tree block failed, want=94945280, have=0
         Check tree block failed, want=94945280, have=0
         Check tree block failed, want=94945280, have=0
         read block failed check_tree_block
         Couldn't open file system
      
      In this case 94945280 corresponded to the root of a tree.
      Using frace what I observed was the following sequence of steps happened:
      
         1) transaction N started, fs_info->pinned_extents pointed to
            fs_info->freed_extents[0];
      
         2) node/eb 94945280 is created;
      
         3) eb is persisted to disk;
      
         4) transaction N commit starts, fs_info->pinned_extents now points to
            fs_info->freed_extents[1], and transaction N completes;
      
         5) transaction N + 1 starts;
      
         6) eb is COWed, and btrfs_free_tree_block() called for this eb;
      
         7) eb range (94945280 to 94945280 + 16Kb) is added to
            fs_info->pinned_extents (fs_info->freed_extents[1]);
      
         8) Something goes wrong in transaction N + 1, like hitting ENOSPC
            for example, and the transaction is aborted, turning the fs into
            readonly mode. The stack trace I got for example:
      
            [112065.253935]  [<ffffffff8140c7b6>] dump_stack+0x4d/0x66
            [112065.254271]  [<ffffffff81042984>] warn_slowpath_common+0x7f/0x98
            [112065.254567]  [<ffffffffa0325990>] ? __btrfs_abort_transaction+0x50/0x10b [btrfs]
            [112065.261674]  [<ffffffff810429e5>] warn_slowpath_fmt+0x48/0x50
            [112065.261922]  [<ffffffffa032949e>] ? btrfs_free_path+0x26/0x29 [btrfs]
            [112065.262211]  [<ffffffffa0325990>] __btrfs_abort_transaction+0x50/0x10b [btrfs]
            [112065.262545]  [<ffffffffa036b1d6>] btrfs_remove_chunk+0x537/0x58b [btrfs]
            [112065.262771]  [<ffffffffa033840f>] btrfs_delete_unused_bgs+0x1de/0x21b [btrfs]
            [112065.263105]  [<ffffffffa0343106>] cleaner_kthread+0x100/0x12f [btrfs]
            (...)
            [112065.264493] ---[ end trace dd7903a975a31a08 ]---
            [112065.264673] BTRFS: error (device sdc) in btrfs_remove_chunk:2625: errno=-28 No space left
            [112065.264997] BTRFS info (device sdc): forced readonly
      
         9) The clear kthread sees that the BTRFS_FS_STATE_ERROR bit is set in
            fs_info->fs_state and calls btrfs_cleanup_transaction(), which in
            turn calls btrfs_destroy_pinned_extent();
      
         10) Then btrfs_destroy_pinned_extent() iterates over all the ranges
             marked as dirty in fs_info->freed_extents[], and for each one
             it calls discard, if the fs was mounted with "-o discard", and
             adds the range to the free space cache of the respective block
             group;
      
         11) btrfs_trim_block_group(), invoked from the fitrim ioctl code path,
             sees the free space entries and performs a discard;
      
         12) After an umount and mount (or fsck), our eb's location on disk was full
             of zeroes, and it should have been untouched, because it was marked as
             dirty in the fs_info->pinned_extents tree, and therefore used by the
             trees that the last committed superblock points to.
      
      Fix this by not performing a discard and not adding the ranges to the free space
      caches - it's useless from this point since the fs is now in readonly mode and
      we won't write free space caches to disk anymore (otherwise we would leak space)
      nor any new superblock. By not adding the ranges to the free space caches, it
      prevents other code paths from allocating that space and write to it as well,
      therefore being safer and simpler.
      
      This isn't a new problem, as it's been present since 2011 (git commit
      acce952b).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3c6babf5
    • Josef Bacik's avatar
      Btrfs: make sure logged extents complete in the current transaction V3 · 9d15399d
      Josef Bacik authored
      commit 50d9aa99 upstream.
      
      Liu Bo pointed out that my previous fix would lose the generation update in the
      scenario I described.  It is actually much worse than that, we could lose the
      entire extent if we lose power right after the transaction commits.  Consider
      the following
      
      write extent 0-4k
      log extent in log tree
      commit transaction
      	< power fail happens here
      ordered extent completes
      
      We would lose the 0-4k extent because it hasn't updated the actual fs tree, and
      the transaction commit will reset the log so it isn't replayed.  If we lose
      power before the transaction commit we are save, otherwise we are not.
      
      Fix this by keeping track of all extents we logged in this transaction.  Then
      when we go to commit the transaction make sure we wait for all of those ordered
      extents to complete before proceeding.  This will make sure that if we lose
      power after the transaction commit we still have our data.  This also fixes the
      problem of the improperly updated extent generation.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9d15399d
    • Josef Bacik's avatar
      Btrfs: do not move em to modified list when unpinning · 115f9146
      Josef Bacik authored
      commit a2804695 upstream.
      
      We use the modified list to keep track of which extents have been modified so we
      know which ones are candidates for logging at fsync() time.  Newly modified
      extents are added to the list at modification time, around the same time the
      ordered extent is created.  We do this so that we don't have to wait for ordered
      extents to complete before we know what we need to log.  The problem is when
      something like this happens
      
      log extent 0-4k on inode 1
      copy csum for 0-4k from ordered extent into log
      sync log
      commit transaction
      log some other extent on inode 1
      ordered extent for 0-4k completes and adds itself onto modified list again
      log changed extents
      see ordered extent for 0-4k has already been logged
      	at this point we assume the csum has been copied
      sync log
      crash
      
      On replay we will see the extent 0-4k in the log, drop the original 0-4k extent
      which is the same one that we are replaying which also drops the csum, and then
      we won't find the csum in the log for that bytenr.  This of course causes us to
      have errors about not having csums for certain ranges of our inode.  So remove
      the modified list manipulation in unpin_extent_cache, any modified extents
      should have been added well before now, and we don't want them re-logged.  This
      fixes my test that I could reliably reproduce this problem with.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      115f9146
    • David Sterba's avatar
      btrfs: fix wrong accounting of raid1 data profile in statfs · 37ea7a1f
      David Sterba authored
      commit 0d95c1be upstream.
      
      The sizes that are obtained from space infos are in raw units and have
      to be adjusted according to the raid factor. This was missing for
      f_bavail and df reported doubled size for raid1.
      Reported-by: default avatarMartin Steigerwald <Martin@lichtvoll.de>
      Fixes: ba7b6e62 ("btrfs: adjust statfs calculations according to raid profiles")
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      37ea7a1f
    • Josef Bacik's avatar
      Btrfs: make sure we wait on logged extents when fsycning two subvols · d77fe802
      Josef Bacik authored
      commit 9dba8cf1 upstream.
      
      If we have two fsync()'s race on different subvols one will do all of its work
      to get into the log_tree, wait on it's outstanding IO, and then allow the
      log_tree to finish it's commit.  The problem is we were just free'ing that
      subvols logged extents instead of waiting on them, so whoever lost the race
      wouldn't really have their data on disk.  Fix this by waiting properly instead
      of freeing the logged extents.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d77fe802
    • Michael Halcrow's avatar
      eCryptfs: Remove buggy and unnecessary write in file name decode routine · d7fad547
      Michael Halcrow authored
      commit 94208064 upstream.
      
      Dmitry Chernenkov used KASAN to discover that eCryptfs writes past the
      end of the allocated buffer during encrypted filename decoding. This
      fix corrects the issue by getting rid of the unnecessary 0 write when
      the current bit offset is 2.
      Signed-off-by: default avatarMichael Halcrow <mhalcrow@google.com>
      Reported-by: default avatarDmitry Chernenkov <dmitryc@google.com>
      Suggested-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarTyler Hicks <tyhicks@canonical.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d7fad547
    • Tyler Hicks's avatar
      eCryptfs: Force RO mount when encrypted view is enabled · bbeb37ea
      Tyler Hicks authored
      commit 332b122d upstream.
      
      The ecryptfs_encrypted_view mount option greatly changes the
      functionality of an eCryptfs mount. Instead of encrypting and decrypting
      lower files, it provides a unified view of the encrypted files in the
      lower filesystem. The presence of the ecryptfs_encrypted_view mount
      option is intended to force a read-only mount and modifying files is not
      supported when the feature is in use. See the following commit for more
      information:
      
        e77a56dd [PATCH] eCryptfs: Encrypted passthrough
      
      This patch forces the mount to be read-only when the
      ecryptfs_encrypted_view mount option is specified by setting the
      MS_RDONLY flag on the superblock. Additionally, this patch removes some
      broken logic in ecryptfs_open() that attempted to prevent modifications
      of files when the encrypted view feature was in use. The check in
      ecryptfs_open() was not sufficient to prevent file modifications using
      system calls that do not operate on a file descriptor.
      Signed-off-by: default avatarTyler Hicks <tyhicks@canonical.com>
      Reported-by: default avatarPriya Bansal <p.bansal@samsung.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bbeb37ea
    • Jan Kara's avatar
      udf: Check component length before reading it · 41ba2abb
      Jan Kara authored
      commit e237ec37 upstream.
      
      Check that length specified in a component of a symlink fits in the
      input buffer we are reading. Also properly ignore component length for
      component types that do not use it. Otherwise we read memory after end
      of buffer for corrupted udf image.
      Reported-by: default avatarCarl Henrik Lunde <chlunde@ping.uio.no>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      41ba2abb
    • Jan Kara's avatar
      udf: Verify symlink size before loading it · 53fbe4cb
      Jan Kara authored
      commit a1d47b26 upstream.
      
      UDF specification allows arbitrarily large symlinks. However we support
      only symlinks at most one block large. Check the length of the symlink
      so that we don't access memory beyond end of the symlink block.
      Reported-by: default avatarCarl Henrik Lunde <chlunde@gmail.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      53fbe4cb
    • Jan Kara's avatar
      udf: Verify i_size when loading inode · a6a4afa5
      Jan Kara authored
      commit e159332b upstream.
      
      Verify that inode size is sane when loading inode with data stored in
      ICB. Otherwise we may get confused later when working with the inode and
      inode size is too big.
      Reported-by: default avatarCarl Henrik Lunde <chlunde@ping.uio.no>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a6a4afa5
    • Jan Kara's avatar
      udf: Check path length when reading symlink · 1a927faa
      Jan Kara authored
      commit 0e5cc9a4 upstream.
      
      Symlink reading code does not check whether the resulting path fits into
      the page provided by the generic code. This isn't as easy as just
      checking the symlink size because of various encoding conversions we
      perform on path. So we have to check whether there is still enough space
      in the buffer on the fly.
      Reported-by: default avatarCarl Henrik Lunde <chlunde@ping.uio.no>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1a927faa
    • Oleg Nesterov's avatar
      exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting · 9cc010ca
      Oleg Nesterov authored
      commit 24c037eb upstream.
      
      alloc_pid() does get_pid_ns() beforehand but forgets to put_pid_ns() if it
      fails because disable_pid_allocation() was called by the exiting
      child_reaper.
      
      We could simply move get_pid_ns() down to successful return, but this fix
      tries to be as trivial as possible.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
      Cc: Sterling Alexander <stalexan@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9cc010ca
    • Joonsoo Kim's avatar
      mm/CMA: fix boot regression due to physical address of high_memory · 2f4f9b92
      Joonsoo Kim authored
      commit 6b101e2a upstream.
      
      high_memory isn't direct mapped memory so retrieving it's physical address
      isn't appropriate.  But, it would be useful to check physical address of
      highmem boundary so it's justfiable to get physical address from it.  In
      x86, there is a validation check if CONFIG_DEBUG_VIRTUAL and it triggers
      following boot failure reported by Ingo.
      
        ...
        BUG: Int 6: CR2 00f06f53
        ...
        Call Trace:
          dump_stack+0x41/0x52
          early_idt_handler+0x6b/0x6b
          cma_declare_contiguous+0x33/0x212
          dma_contiguous_reserve_area+0x31/0x4e
          dma_contiguous_reserve+0x11d/0x125
          setup_arch+0x7b5/0xb63
          start_kernel+0xb8/0x3e6
          i386_start_kernel+0x79/0x7d
      
      To fix boot regression, this patch implements workaround to avoid
      validation check in x86 when retrieving physical address of high_memory.
      __pa_nodebug() used by this patch is implemented only in x86 so there is
      no choice but to use dirty #ifdef.
      
      [akpm@linux-foundation.org: tweak comment]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reported-by: default avatarIngo Molnar <mingo@kernel.org>
      Tested-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2f4f9b92
    • Jan Kara's avatar
      ncpfs: return proper error from NCP_IOC_SETROOT ioctl · 522a8162
      Jan Kara authored
      commit a682e9c2 upstream.
      
      If some error happens in NCP_IOC_SETROOT ioctl, the appropriate error
      return value is then (in most cases) just overwritten before we return.
      This can result in reporting success to userspace although error happened.
      
      This bug was introduced by commit 2e54eb96 ("BKL: Remove BKL from
      ncpfs").  Propagate the errors correctly.
      
      Coverity id: 1226925.
      
      Fixes: 2e54eb96 ("BKL: Remove BKL from ncpfs")
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Cc: Petr Vandrovec <petr@vandrovec.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      522a8162
    • Rabin Vincent's avatar
      crypto: af_alg - fix backlog handling · fa06c84a
      Rabin Vincent authored
      commit 7e77bdeb upstream.
      
      If a request is backlogged, it's complete() handler will get called
      twice: once with -EINPROGRESS, and once with the final error code.
      
      af_alg's complete handler, unlike other users, does not handle the
      -EINPROGRESS but instead always completes the completion that recvmsg()
      is waiting on.  This can lead to a return to user space while the
      request is still pending in the driver.  If userspace closes the sockets
      before the requests are handled by the driver, this will lead to
      use-after-frees (and potential crashes) in the kernel due to the tfm
      having been freed.
      
      The crashes can be easily reproduced (for example) by reducing the max
      queue length in cryptod.c and running the following (from
      http://www.chronox.de/libkcapi.html) on AES-NI capable hardware:
      
       $ while true; do kcapi -x 1 -e -c '__ecb-aes-aesni' \
          -k 00000000000000000000000000000000 \
          -p 00000000000000000000000000000000 >/dev/null & done
      Signed-off-by: default avatarRabin Vincent <rabin.vincent@axis.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fa06c84a