1. 26 Jan, 2015 10 commits
  2. 14 Jan, 2015 7 commits
  3. 13 Jan, 2015 2 commits
    • Jiri Slaby's avatar
      Linux 3.12.36 · f6101957
      Jiri Slaby authored
      f6101957
    • Kirill A. Shutemov's avatar
      thp: close race between split and zap huge pages · 198717b6
      Kirill A. Shutemov authored
      commit b5a8cad3 upstream.
      
      [stable 3.12 note]
      This commit was supposed to fix a completely other issue. But in 3.12,
      with commit f72e7dcd (mm: let
      mm_find_pmd fix buggy race with THP fault), we need this commit as
      well (it fixes the issue as a by-product). Hugh Dickins writes:
      <== citation starts here>
      Fine for this to go in, but there is one catch, which I discovered
      when backporting to v3.11: it needed one more hunk.  I haven't checked
      your base tree, but if this applies then I believe you need it - most
      of the time no problem, but it can case page migration to fail to find
      a migration entry it inserted earlier, then BUG_ON(!PageLocked(p)) in
      migration_entry_to_page() soon after.  Here's what I wrote back then:
      
      Note on rebase to v3.11: added a hunk to replace the use of
      mm_find_pmd() in page_check_address_pmd().  This call had been
      similarly replaced by the time of my v3.16 commit, in Kirill
      Shutemov's v3.15 b5a8cad3 ("thp: close race between split and zap
      huge pages"): which we do not need as such, since it's fixing v3.13
      117b0791 ("mm, thp: move ptl taking inside
      page_check_address_pmd()"), from a split page-table-lock series we are
      not backporting.  But without this additional hunk, rmap sometimes
      broke when the new semantic for mm_find_pmd() was used here.
      <== end of citation>
      
      But instead of appending hunks to commits, I am taking a full,
      backported version of commit b5a8cad3 with this note prepended.
      
      So the changelog of b5a8cad3 is left below, but does not apply to
      3.12 yet.
      [=== stable 3.12 note ends here]
      
      Sasha Levin has reported two THP BUGs[1][2].  I believe both of them
      have the same root cause.  Let's look to them one by one.
      
      The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!".  It's
      BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page().  From my
      testing I see that page_mapcount() is higher than mapcount here.
      
      I think it happens due to race between zap_huge_pmd() and
      page_check_address_pmd().  page_check_address_pmd() misses PMD which is
      under zap:
      
      	CPU0						CPU1
      						zap_huge_pmd()
      						  pmdp_get_and_clear()
      __split_huge_page()
        anon_vma_interval_tree_foreach()
          __split_huge_page_splitting()
            page_check_address_pmd()
              mm_find_pmd()
      	  /*
      	   * We check if PMD present without taking ptl: no
      	   * serialization against zap_huge_pmd(). We miss this PMD,
      	   * it's not accounted to 'mapcount' in __split_huge_page().
      	   */
      	  pmd_present(pmd) == 0
      
        BUG_ON(mapcount != page_mapcount(page)) // CRASH!!!
      
      						  page_remove_rmap(page)
      						    atomic_add_negative(-1, &page->_mapcount)
      
      The second bug[2] is "kernel BUG at mm/huge_memory.c:1371!".
      It's VM_BUG_ON_PAGE(!PageHead(page), page) in zap_huge_pmd().
      
      This happens in similar way:
      
      	CPU0						CPU1
      						zap_huge_pmd()
      						  pmdp_get_and_clear()
      						  page_remove_rmap(page)
      						    atomic_add_negative(-1, &page->_mapcount)
      __split_huge_page()
        anon_vma_interval_tree_foreach()
          __split_huge_page_splitting()
            page_check_address_pmd()
              mm_find_pmd()
      	  pmd_present(pmd) == 0	/* The same comment as above */
        /*
         * No crash this time since we already decremented page->_mapcount in
         * zap_huge_pmd().
         */
        BUG_ON(mapcount != page_mapcount(page))
      
        /*
         * We split the compound page here into small pages without
         * serialization against zap_huge_pmd()
         */
        __split_huge_page_refcount()
      						VM_BUG_ON_PAGE(!PageHead(page), page); // CRASH!!!
      
      So my understanding the problem is pmd_present() check in mm_find_pmd()
      without taking page table lock.
      
      The bug was introduced by me commit with commit 117b0791. Sorry for
      that. :(
      
      Let's open code mm_find_pmd() in page_check_address_pmd() and do the
      check under page table lock.
      
      Note that __page_check_address() does the same for PTE entires
      if sync != 0.
      
      I've stress tested split and zap code paths for 36+ hours by now and
      don't see crashes with the patch applied. Before it took <20 min to
      trigger the first bug and few hours for second one (if we ignore
      first).
      
      [1] https://lkml.kernel.org/g/<53440991.9090001@oracle.com>
      [2] https://lkml.kernel.org/g/<5310C56C.60709@oracle.com>
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Tested-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      198717b6
  4. 10 Jan, 2015 1 commit
  5. 09 Jan, 2015 3 commits
    • Hugh Dickins's avatar
      mm: let mm_find_pmd fix buggy race with THP fault · e1c34dac
      Hugh Dickins authored
      commit f72e7dcd upstream.
      
      Trinity has reported:
      
          BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
          IP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
          CPU: 6 PID: 16173 Comm: trinity-c364 Tainted: G        W
                                  3.15.0-rc1-next-20140415-sasha-00020-gaa90d09 #398
          lock_acquire (arch/x86/include/asm/current.h:14
                        kernel/locking/lockdep.c:3602)
          _raw_spin_lock (include/linux/spinlock_api_smp.h:143
                          kernel/locking/spinlock.c:151)
          remove_migration_pte (mm/migrate.c:137)
          rmap_walk (mm/rmap.c:1628 mm/rmap.c:1699)
          remove_migration_ptes (mm/migrate.c:224)
          migrate_pages (mm/migrate.c:922 mm/migrate.c:960 mm/migrate.c:1126)
          migrate_misplaced_page (mm/migrate.c:1733)
          __handle_mm_fault (mm/memory.c:3762 mm/memory.c:3812 mm/memory.c:3925)
          handle_mm_fault (mm/memory.c:3948)
          __get_user_pages (mm/memory.c:1851)
          __mlock_vma_pages_range (mm/mlock.c:255)
          __mm_populate (mm/mlock.c:711)
          SyS_mlockall (include/linux/mm.h:1799 mm/mlock.c:817 mm/mlock.c:791)
      
      I believe this comes about because, whereas collapsing and splitting THP
      functions take anon_vma lock in write mode (which excludes concurrent
      rmap walks), faulting THP functions (write protection and misplaced
      NUMA) do not - and mostly they do not need to.
      
      But they do use a pmdp_clear_flush(), set_pmd_at() sequence which, for
      an instant (indeed, for a long instant, given the inter-CPU TLB flush in
      there), leaves *pmd neither present not trans_huge.
      
      Which can confuse a concurrent rmap walk, as when removing migration
      ptes, seen in the dumped trace.  Although that rmap walk has a 4k page
      to insert, anon_vmas containing THPs are in no way segregated from
      4k-page anon_vmas, so the 4k-intent mm_find_pmd() does need to cope with
      that instant when a trans_huge pmd is temporarily absent.
      
      I don't think we need strengthen the locking at the THP end: it's easily
      handled with an ACCESS_ONCE() before testing both conditions.
      
      And since mm_find_pmd() had only one caller who wanted a THP rather than
      a pmd, let's slightly repurpose it to fail when it hits a THP or
      non-present pmd, and open code split_huge_page_address() again.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Lameter <cl@gentwo.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      e1c34dac
    • Johan Hovold's avatar
      mfd: viperboard: Fix platform-device id collision · 0702a76d
      Johan Hovold authored
      commit b6684228 upstream.
      
      Allow more than one viperboard to be connected by registering with
      PLATFORM_DEVID_AUTO instead of PLATFORM_DEVID_NONE.
      
      The subdevices are currently registered with PLATFORM_DEVID_NONE, which
      will cause a name collision on the platform bus when a second viperboard
      is plugged in:
      
      viperboard 1-2.4:1.0: version 0.00 found at bus 001 address 004
      ------------[ cut here ]------------
      WARNING: CPU: 0 PID: 181 at /home/johan/work/omicron/src/linux/fs/sysfs/dir.c:31 sysfs_warn_dup+0x74/0x84()
      sysfs: cannot create duplicate filename '/bus/platform/devices/viperboard-gpio'
      Modules linked in: i2c_viperboard viperboard netconsole [last unloaded: viperboard]
      CPU: 0 PID: 181 Comm: bash Tainted: G        W      3.17.0-rc6 #1
      [<c0016bf4>] (unwind_backtrace) from [<c0013860>] (show_stack+0x20/0x24)
      [<c0013860>] (show_stack) from [<c04305f8>] (dump_stack+0x24/0x28)
      [<c04305f8>] (dump_stack) from [<c0040fb4>] (warn_slowpath_common+0x80/0x98)
      [<c0040fb4>] (warn_slowpath_common) from [<c004100c>] (warn_slowpath_fmt+0x40/0x48)
      [<c004100c>] (warn_slowpath_fmt) from [<c016f1bc>] (sysfs_warn_dup+0x74/0x84)
      [<c016f1bc>] (sysfs_warn_dup) from [<c016f548>] (sysfs_do_create_link_sd.isra.2+0xcc/0xd0)
      [<c016f548>] (sysfs_do_create_link_sd.isra.2) from [<c016f588>] (sysfs_create_link+0x3c/0x48)
      [<c016f588>] (sysfs_create_link) from [<c02867ec>] (bus_add_device+0x12c/0x1e0)
      [<c02867ec>] (bus_add_device) from [<c0284820>] (device_add+0x410/0x584)
      [<c0284820>] (device_add) from [<c0289440>] (platform_device_add+0xd8/0x26c)
      [<c0289440>] (platform_device_add) from [<c02a5ae4>] (mfd_add_device+0x240/0x344)
      [<c02a5ae4>] (mfd_add_device) from [<c02a5ce0>] (mfd_add_devices+0xb8/0x110)
      [<c02a5ce0>] (mfd_add_devices) from [<bf00d1c8>] (vprbrd_probe+0x160/0x1b0 [viperboard])
      [<bf00d1c8>] (vprbrd_probe [viperboard]) from [<c030c000>] (usb_probe_interface+0x1bc/0x2a8)
      [<c030c000>] (usb_probe_interface) from [<c028768c>] (driver_probe_device+0x14c/0x3ac)
      [<c028768c>] (driver_probe_device) from [<c02879e4>] (__driver_attach+0xa4/0xa8)
      [<c02879e4>] (__driver_attach) from [<c0285698>] (bus_for_each_dev+0x70/0xa4)
      [<c0285698>] (bus_for_each_dev) from [<c0287030>] (driver_attach+0x2c/0x30)
      [<c0287030>] (driver_attach) from [<c030a288>] (usb_store_new_id+0x170/0x1ac)
      [<c030a288>] (usb_store_new_id) from [<c030a2f8>] (new_id_store+0x34/0x3c)
      [<c030a2f8>] (new_id_store) from [<c02853ec>] (drv_attr_store+0x30/0x3c)
      [<c02853ec>] (drv_attr_store) from [<c016eaa8>] (sysfs_kf_write+0x5c/0x60)
      [<c016eaa8>] (sysfs_kf_write) from [<c016dc68>] (kernfs_fop_write+0xd4/0x194)
      [<c016dc68>] (kernfs_fop_write) from [<c010fe40>] (vfs_write+0xb4/0x1c0)
      [<c010fe40>] (vfs_write) from [<c01104a8>] (SyS_write+0x4c/0xa0)
      [<c01104a8>] (SyS_write) from [<c000f900>] (ret_fast_syscall+0x0/0x48)
      ---[ end trace 98e8603c22d65817 ]---
      viperboard 1-2.4:1.0: Failed to add mfd devices to core.
      viperboard: probe of 1-2.4:1.0 failed with error -17
      Signed-off-by: default avatarJohan Hovold <johan@kernel.org>
      Signed-off-by: default avatarLee Jones <lee.jones@linaro.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      0702a76d
    • Linus Walleij's avatar
      mfd: stmpe: Fix STMPE24xx GPMR LSB · 83ee5961
      Linus Walleij authored
      commit 871c3cf4 upstream.
      
      The least significat byte of the GPIO value read register
      on the STMPE24xx series is on addres 0xA4 not 0xA5. Correct
      against datasheet and tested on the STMPE2401 hardware.
      Signed-off-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: default avatarLee Jones <lee.jones@linaro.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      83ee5961
  6. 07 Jan, 2015 17 commits
    • Filipe Manana's avatar
      Btrfs: fix fs corruption on transaction abort if device supports discard · bc5e18c1
      Filipe Manana authored
      commit 678886bd upstream.
      
      When we abort a transaction we iterate over all the ranges marked as dirty
      in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them
      from those trees, add them back (unpin) to the free space caches and, if
      the fs was mounted with "-o discard", perform a discard on those regions.
      Also, after adding the regions to the free space caches, a fitrim ioctl call
      can see those ranges in a block group's free space cache and perform a discard
      on the ranges, so the same issue can happen without "-o discard" as well.
      
      This causes corruption, affecting one or multiple btree nodes (in the worst
      case leaving the fs unmountable) because some of those ranges (the ones in
      the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are
      referred by the last committed super block - breaking the rule that anything
      that was committed by a transaction is untouched until the next transaction
      commits successfully.
      
      I ran into this while running in a loop (for several hours) the fstest that
      I recently submitted:
      
        [PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim
      
      The corruption always happened when a transaction aborted and then fsck complained
      like this:
      
         _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
         *** fsck.btrfs output ***
         Check tree block failed, want=94945280, have=0
         Check tree block failed, want=94945280, have=0
         Check tree block failed, want=94945280, have=0
         Check tree block failed, want=94945280, have=0
         Check tree block failed, want=94945280, have=0
         read block failed check_tree_block
         Couldn't open file system
      
      In this case 94945280 corresponded to the root of a tree.
      Using frace what I observed was the following sequence of steps happened:
      
         1) transaction N started, fs_info->pinned_extents pointed to
            fs_info->freed_extents[0];
      
         2) node/eb 94945280 is created;
      
         3) eb is persisted to disk;
      
         4) transaction N commit starts, fs_info->pinned_extents now points to
            fs_info->freed_extents[1], and transaction N completes;
      
         5) transaction N + 1 starts;
      
         6) eb is COWed, and btrfs_free_tree_block() called for this eb;
      
         7) eb range (94945280 to 94945280 + 16Kb) is added to
            fs_info->pinned_extents (fs_info->freed_extents[1]);
      
         8) Something goes wrong in transaction N + 1, like hitting ENOSPC
            for example, and the transaction is aborted, turning the fs into
            readonly mode. The stack trace I got for example:
      
            [112065.253935]  [<ffffffff8140c7b6>] dump_stack+0x4d/0x66
            [112065.254271]  [<ffffffff81042984>] warn_slowpath_common+0x7f/0x98
            [112065.254567]  [<ffffffffa0325990>] ? __btrfs_abort_transaction+0x50/0x10b [btrfs]
            [112065.261674]  [<ffffffff810429e5>] warn_slowpath_fmt+0x48/0x50
            [112065.261922]  [<ffffffffa032949e>] ? btrfs_free_path+0x26/0x29 [btrfs]
            [112065.262211]  [<ffffffffa0325990>] __btrfs_abort_transaction+0x50/0x10b [btrfs]
            [112065.262545]  [<ffffffffa036b1d6>] btrfs_remove_chunk+0x537/0x58b [btrfs]
            [112065.262771]  [<ffffffffa033840f>] btrfs_delete_unused_bgs+0x1de/0x21b [btrfs]
            [112065.263105]  [<ffffffffa0343106>] cleaner_kthread+0x100/0x12f [btrfs]
            (...)
            [112065.264493] ---[ end trace dd7903a975a31a08 ]---
            [112065.264673] BTRFS: error (device sdc) in btrfs_remove_chunk:2625: errno=-28 No space left
            [112065.264997] BTRFS info (device sdc): forced readonly
      
         9) The clear kthread sees that the BTRFS_FS_STATE_ERROR bit is set in
            fs_info->fs_state and calls btrfs_cleanup_transaction(), which in
            turn calls btrfs_destroy_pinned_extent();
      
         10) Then btrfs_destroy_pinned_extent() iterates over all the ranges
             marked as dirty in fs_info->freed_extents[], and for each one
             it calls discard, if the fs was mounted with "-o discard", and
             adds the range to the free space cache of the respective block
             group;
      
         11) btrfs_trim_block_group(), invoked from the fitrim ioctl code path,
             sees the free space entries and performs a discard;
      
         12) After an umount and mount (or fsck), our eb's location on disk was full
             of zeroes, and it should have been untouched, because it was marked as
             dirty in the fs_info->pinned_extents tree, and therefore used by the
             trees that the last committed superblock points to.
      
      Fix this by not performing a discard and not adding the ranges to the free space
      caches - it's useless from this point since the fs is now in readonly mode and
      we won't write free space caches to disk anymore (otherwise we would leak space)
      nor any new superblock. By not adding the ranges to the free space caches, it
      prevents other code paths from allocating that space and write to it as well,
      therefore being safer and simpler.
      
      This isn't a new problem, as it's been present since 2011 (git commit
      acce952b).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      bc5e18c1
    • Josef Bacik's avatar
      Btrfs: do not move em to modified list when unpinning · f2595264
      Josef Bacik authored
      commit a2804695 upstream.
      
      We use the modified list to keep track of which extents have been modified so we
      know which ones are candidates for logging at fsync() time.  Newly modified
      extents are added to the list at modification time, around the same time the
      ordered extent is created.  We do this so that we don't have to wait for ordered
      extents to complete before we know what we need to log.  The problem is when
      something like this happens
      
      log extent 0-4k on inode 1
      copy csum for 0-4k from ordered extent into log
      sync log
      commit transaction
      log some other extent on inode 1
      ordered extent for 0-4k completes and adds itself onto modified list again
      log changed extents
      see ordered extent for 0-4k has already been logged
      	at this point we assume the csum has been copied
      sync log
      crash
      
      On replay we will see the extent 0-4k in the log, drop the original 0-4k extent
      which is the same one that we are replaying which also drops the csum, and then
      we won't find the csum in the log for that bytenr.  This of course causes us to
      have errors about not having csums for certain ranges of our inode.  So remove
      the modified list manipulation in unpin_extent_cache, any modified extents
      should have been added well before now, and we don't want them re-logged.  This
      fixes my test that I could reliably reproduce this problem with.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      f2595264
    • Michael Halcrow's avatar
      eCryptfs: Remove buggy and unnecessary write in file name decode routine · 8ffea99d
      Michael Halcrow authored
      commit 94208064 upstream.
      
      Dmitry Chernenkov used KASAN to discover that eCryptfs writes past the
      end of the allocated buffer during encrypted filename decoding. This
      fix corrects the issue by getting rid of the unnecessary 0 write when
      the current bit offset is 2.
      Signed-off-by: default avatarMichael Halcrow <mhalcrow@google.com>
      Reported-by: default avatarDmitry Chernenkov <dmitryc@google.com>
      Suggested-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarTyler Hicks <tyhicks@canonical.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      8ffea99d
    • Tyler Hicks's avatar
      eCryptfs: Force RO mount when encrypted view is enabled · c1b5e3f8
      Tyler Hicks authored
      commit 332b122d upstream.
      
      The ecryptfs_encrypted_view mount option greatly changes the
      functionality of an eCryptfs mount. Instead of encrypting and decrypting
      lower files, it provides a unified view of the encrypted files in the
      lower filesystem. The presence of the ecryptfs_encrypted_view mount
      option is intended to force a read-only mount and modifying files is not
      supported when the feature is in use. See the following commit for more
      information:
      
        e77a56dd [PATCH] eCryptfs: Encrypted passthrough
      
      This patch forces the mount to be read-only when the
      ecryptfs_encrypted_view mount option is specified by setting the
      MS_RDONLY flag on the superblock. Additionally, this patch removes some
      broken logic in ecryptfs_open() that attempted to prevent modifications
      of files when the encrypted view feature was in use. The check in
      ecryptfs_open() was not sufficient to prevent file modifications using
      system calls that do not operate on a file descriptor.
      Signed-off-by: default avatarTyler Hicks <tyhicks@canonical.com>
      Reported-by: default avatarPriya Bansal <p.bansal@samsung.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      c1b5e3f8
    • Jan Kara's avatar
      udf: Verify symlink size before loading it · 2bf2e3b9
      Jan Kara authored
      commit a1d47b26 upstream.
      
      UDF specification allows arbitrarily large symlinks. However we support
      only symlinks at most one block large. Check the length of the symlink
      so that we don't access memory beyond end of the symlink block.
      Reported-by: default avatarCarl Henrik Lunde <chlunde@gmail.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      2bf2e3b9
    • Oleg Nesterov's avatar
      exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting · 259c1c87
      Oleg Nesterov authored
      commit 24c037eb upstream.
      
      alloc_pid() does get_pid_ns() beforehand but forgets to put_pid_ns() if it
      fails because disable_pid_allocation() was called by the exiting
      child_reaper.
      
      We could simply move get_pid_ns() down to successful return, but this fix
      tries to be as trivial as possible.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
      Cc: Sterling Alexander <stalexan@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      259c1c87
    • Jan Kara's avatar
      ncpfs: return proper error from NCP_IOC_SETROOT ioctl · 6186e536
      Jan Kara authored
      commit a682e9c2 upstream.
      
      If some error happens in NCP_IOC_SETROOT ioctl, the appropriate error
      return value is then (in most cases) just overwritten before we return.
      This can result in reporting success to userspace although error happened.
      
      This bug was introduced by commit 2e54eb96 ("BKL: Remove BKL from
      ncpfs").  Propagate the errors correctly.
      
      Coverity id: 1226925.
      
      Fixes: 2e54eb96 ("BKL: Remove BKL from ncpfs")
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Cc: Petr Vandrovec <petr@vandrovec.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      6186e536
    • Rabin Vincent's avatar
      crypto: af_alg - fix backlog handling · 4f62b27e
      Rabin Vincent authored
      commit 7e77bdeb upstream.
      
      If a request is backlogged, it's complete() handler will get called
      twice: once with -EINPROGRESS, and once with the final error code.
      
      af_alg's complete handler, unlike other users, does not handle the
      -EINPROGRESS but instead always completes the completion that recvmsg()
      is waiting on.  This can lead to a return to user space while the
      request is still pending in the driver.  If userspace closes the sockets
      before the requests are handled by the driver, this will lead to
      use-after-frees (and potential crashes) in the kernel due to the tfm
      having been freed.
      
      The crashes can be easily reproduced (for example) by reducing the max
      queue length in cryptod.c and running the following (from
      http://www.chronox.de/libkcapi.html) on AES-NI capable hardware:
      
       $ while true; do kcapi -x 1 -e -c '__ecb-aes-aesni' \
          -k 00000000000000000000000000000000 \
          -p 00000000000000000000000000000000 >/dev/null & done
      Signed-off-by: default avatarRabin Vincent <rabin.vincent@axis.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      4f62b27e
    • Richard Guy Briggs's avatar
      audit: restore AUDIT_LOGINUID unset ABI · 5055918c
      Richard Guy Briggs authored
      commit 041d7b98 upstream.
      
      A regression was caused by commit 780a7654:
      	 audit: Make testing for a valid loginuid explicit.
      (which in turn attempted to fix a regression caused by e1760bd5)
      
      When audit_krule_to_data() fills in the rules to get a listing, there was a
      missing clause to convert back from AUDIT_LOGINUID_SET to AUDIT_LOGINUID.
      
      This broke userspace by not returning the same information that was sent and
      expected.
      
      The rule:
      	auditctl -a exit,never -F auid=-1
      gives:
      	auditctl -l
      		LIST_RULES: exit,never f24=0 syscall=all
      when it should give:
      		LIST_RULES: exit,never auid=-1 (0xffffffff) syscall=all
      
      Tag it so that it is reported the same way it was set.  Create a new
      private flags audit_krule field (pflags) to store it that won't interact with
      the public one from the API.
      Signed-off-by: default avatarRichard Guy Briggs <rgb@redhat.com>
      Signed-off-by: default avatarPaul Moore <pmoore@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      5055918c
    • Eric W. Biederman's avatar
      userns: Unbreak the unprivileged remount tests · dc7a80cc
      Eric W. Biederman authored
      commit db86da7c upstream.
      
      A security fix in caused the way the unprivileged remount tests were
      using user namespaces to break.  Tweak the way user namespaces are
      being used so the test works again.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      dc7a80cc
    • Eric W. Biederman's avatar
      userns: Allow setting gid_maps without privilege when setgroups is disabled · 90aca415
      Eric W. Biederman authored
      commit 66d2f338 upstream.
      
      Now that setgroups can be disabled and not reenabled, setting gid_map
      without privielge can now be enabled when setgroups is disabled.
      
      This restores most of the functionality that was lost when unprivileged
      setting of gid_map was removed.  Applications that use this functionality
      will need to check to see if they use setgroups or init_groups, and if they
      don't they can be fixed by simply disabling setgroups before writing to
      gid_map.
      Reviewed-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      90aca415
    • Eric W. Biederman's avatar
      userns: Add a knob to disable setgroups on a per user namespace basis · 84c4e718
      Eric W. Biederman authored
      commit 9cc46516 upstream.
      
      - Expose the knob to user space through a proc file /proc/<pid>/setgroups
      
        A value of "deny" means the setgroups system call is disabled in the
        current processes user namespace and can not be enabled in the
        future in this user namespace.
      
        A value of "allow" means the segtoups system call is enabled.
      
      - Descendant user namespaces inherit the value of setgroups from
        their parents.
      
      - A proc file is used (instead of a sysctl) as sysctls currently do
        not allow checking the permissions at open time.
      
      - Writing to the proc file is restricted to before the gid_map
        for the user namespace is set.
      
        This ensures that disabling setgroups at a user namespace
        level will never remove the ability to call setgroups
        from a process that already has that ability.
      
        A process may opt in to the setgroups disable for itself by
        creating, entering and configuring a user namespace or by calling
        setns on an existing user namespace with setgroups disabled.
        Processes without privileges already can not call setgroups so this
        is a noop.  Prodcess with privilege become processes without
        privilege when entering a user namespace and as with any other path
        to dropping privilege they would not have the ability to call
        setgroups.  So this remains within the bounds of what is possible
        without a knob to disable setgroups permanently in a user namespace.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      84c4e718
    • Eric W. Biederman's avatar
      userns: Rename id_map_mutex to userns_state_mutex · 8af7f144
      Eric W. Biederman authored
      commit f0d62aec upstream.
      
      Generalize id_map_mutex so it can be used for more state of a user namespace.
      Reviewed-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      8af7f144
    • Eric W. Biederman's avatar
      userns: Only allow the creator of the userns unprivileged mappings · 42eec84f
      Eric W. Biederman authored
      commit f95d7918 upstream.
      
      If you did not create the user namespace and are allowed
      to write to uid_map or gid_map you should already have the necessary
      privilege in the parent user namespace to establish any mapping
      you want so this will not affect userspace in practice.
      
      Limiting unprivileged uid mapping establishment to the creator of the
      user namespace makes it easier to verify all credentials obtained with
      the uid mapping can be obtained without the uid mapping without
      privilege.
      
      Limiting unprivileged gid mapping establishment (which is temporarily
      absent) to the creator of the user namespace also ensures that the
      combination of uid and gid can already be obtained without privilege.
      
      This is part of the fix for CVE-2014-8989.
      Reviewed-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      42eec84f
    • Eric W. Biederman's avatar
      userns: Check euid no fsuid when establishing an unprivileged uid mapping · 6cffb252
      Eric W. Biederman authored
      commit 80dd00a2 upstream.
      
      setresuid allows the euid to be set to any of uid, euid, suid, and
      fsuid.  Therefor it is safe to allow an unprivileged user to map
      their euid and use CAP_SETUID privileged with exactly that uid,
      as no new credentials can be obtained.
      
      I can not find a combination of existing system calls that allows setting
      uid, euid, suid, and fsuid from the fsuid making the previous use
      of fsuid for allowing unprivileged mappings a bug.
      
      This is part of a fix for CVE-2014-8989.
      Reviewed-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      6cffb252
    • Eric W. Biederman's avatar
      userns: Don't allow unprivileged creation of gid mappings · c75f0d0b
      Eric W. Biederman authored
      commit be7c6dba upstream.
      
      As any gid mapping will allow and must allow for backwards
      compatibility dropping groups don't allow any gid mappings to be
      established without CAP_SETGID in the parent user namespace.
      
      For a small class of applications this change breaks userspace
      and removes useful functionality.  This small class of applications
      includes tools/testing/selftests/mount/unprivilged-remount-test.c
      
      Most of the removed functionality will be added back with the addition
      of a one way knob to disable setgroups.  Once setgroups is disabled
      setting the gid_map becomes as safe as setting the uid_map.
      
      For more common applications that set the uid_map and the gid_map
      with privilege this change will have no affect.
      
      This is part of a fix for CVE-2014-8989.
      Reviewed-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      c75f0d0b
    • Eric W. Biederman's avatar
      userns: Don't allow setgroups until a gid mapping has been setablished · ae254fcf
      Eric W. Biederman authored
      commit 273d2c67 upstream.
      
      setgroups is unique in not needing a valid mapping before it can be called,
      in the case of setgroups(0, NULL) which drops all supplemental groups.
      
      The design of the user namespace assumes that CAP_SETGID can not actually
      be used until a gid mapping is established.  Therefore add a helper function
      to see if the user namespace gid mapping has been established and call
      that function in the setgroups permission check.
      
      This is part of the fix for CVE-2014-8989, being able to drop groups
      without privilege using user namespaces.
      Reviewed-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      ae254fcf