Commits · 069d5ac9ae0d271903cc4607890616418118379a · Kirill Smelkov / linux

30 Sep, 2016 2 commits

autofs: Fix automounts by using current_real_cred()->uid · 069d5ac9

Eric W. Biederman authored Sep 30, 2016

Seth Forshee reports that in 4.8-rcN some automounts are failing
because the requesting the automount changed.

The relevant call path is:
follow_automount()
    ->d_automount
    autofs4_d_automount
       autofs4_mount_wait
           autofs4_wait

In autofs4_wait wq_uid and wq_gid are set to current_uid() and
current_gid respectively.  With follow_automount now overriding creds
uid that we export to userspace changes and that breaks existing
setups.

To remove the regression set wq_uid and wq_gid from
current_real_cred()->uid and current_real_cred()->gid respectively.
This restores the current behavior as current->real_cred is identical
to current->cred except when override creds are used.

Cc: stable@vger.kernel.org
Fixes: aeaa4a79 ("fs: Call d_automount with the filesystems creds")
Reported-by: Seth Forshee <seth.forshee@canonical.com>
Tested-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

069d5ac9

mnt: Add a per mount namespace limit on the number of mounts · d2921684

Eric W. Biederman authored Sep 28, 2016

CAI Qian <caiqian@redhat.com> pointed out that the semantics
of shared subtrees make it possible to create an exponentially
increasing number of mounts in a mount namespace.

    mkdir /tmp/1 /tmp/2
    mount --make-rshared /
    for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done

Will create create 2^20 or 1048576 mounts, which is a practical problem
as some people have managed to hit this by accident.

As such CVE-2016-6213 was assigned.

Ian Kent <raven@themaw.net> described the situation for autofs users
as follows:

> The number of mounts for direct mount maps is usually not very large because of
> the way they are implemented, large direct mount maps can have performance
> problems. There can be anywhere from a few (likely case a few hundred) to less
> than 10000, plus mounts that have been triggered and not yet expired.
>
> Indirect mounts have one autofs mount at the root plus the number of mounts that
> have been triggered and not yet expired.
>
> The number of autofs indirect map entries can range from a few to the common
> case of several thousand and in rare cases up to between 30000 and 50000. I've
> not heard of people with maps larger than 50000 entries.
>
> The larger the number of map entries the greater the possibility for a large
> number of active mounts so it's not hard to expect cases of a 1000 or somewhat
> more active mounts.

So I am setting the default number of mounts allowed per mount
namespace at 100,000.  This is more than enough for any use case I
know of, but small enough to quickly stop an exponential increase
in mounts.  Which should be perfect to catch misconfigurations and
malfunctioning programs.

For anyone who needs a higher limit this can be changed by writing
to the new /proc/sys/fs/mount-max sysctl.
Tested-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

d2921684

23 Sep, 2016 7 commits

netns: move {inc,dec}_net_namespaces into #ifdef · 2ed6afde

Arnd Bergmann authored Sep 23, 2016

With the newly enforced limit on the number of namespaces,
we get a build warning if CONFIG_NETNS is disabled:

net/core/net_namespace.c:273:13: error: 'dec_net_namespaces' defined but not used [-Werror=unused-function]
net/core/net_namespace.c:268:24: error: 'inc_net_namespaces' defined but not used [-Werror=unused-function]

This moves the two added functions inside the #ifdef that guards
their callers.

Fixes: 70328660 ("netns: Add a limit on the number of net namespaces")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

2ed6afde

nsfs: Simplify __ns_get_path · 213b067c

Eric W. Biederman authored Sep 22, 2016

Move mntget from the very beginning of __ns_get_path to
the success path of __ns_get_path, and remove the mntget
calls.

This removes the possibility that there will be a mntget/mntput
pair of __ns_get_path has to retry, and generally simplifies the code.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

213b067c

Merge branch 'nsfs-ioctls' into HEAD · 78725596

Eric W. Biederman authored Sep 22, 2016

From: Andrey Vagin <avagin@openvz.org>

Each namespace has an owning user namespace and now there is not way
to discover these relationships.

Pid and user namepaces are hierarchical. There is no way to discover
parent-child relationships too.

Why we may want to know relationships between namespaces?

One use would be visualization, in order to understand the running
system.  Another would be to answer the question: what capability does
process X have to perform operations on a resource governed by namespace
Y?

One more use-case (which usually called abnormal) is checkpoint/restart.
In CRIU we are going to dump and restore nested namespaces.

There [1] was a discussion about which interface to choose to determing
relationships between namespaces.

Eric suggested to add two ioctl-s [2]:
> Grumble, Grumble.  I think this may actually a case for creating ioctls
> for these two cases.  Now that random nsfs file descriptors are bind
> mountable the original reason for using proc files is not as pressing.
>
> One ioctl for the user namespace that owns a file descriptor.
> One ioctl for the parent namespace of a namespace file descriptor.

Here is an implementaions of these ioctl-s.

$ man man7/namespaces.7
...
Since  Linux  4.X,  the  following  ioctl(2)  calls are supported for
namespace file descriptors.  The correct syntax is:

      fd = ioctl(ns_fd, ioctl_type);

where ioctl_type is one of the following:

NS_GET_USERNS
      Returns a file descriptor that refers to an owning user names‐
      pace.

NS_GET_PARENT
      Returns  a  file descriptor that refers to a parent namespace.
      This ioctl(2) can be used for pid  and  user  namespaces.  For
      user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same
      meaning.

In addition to generic ioctl(2) errors, the following  specific  ones
can occur:

EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.

EPERM  The  requested  namespace  is outside of the current namespace
      scope.

[1] https://lkml.org/lkml/2016/7/6/158
[2] https://lkml.org/lkml/2016/7/9/101

Changes for v2:
* don't return ENOENT for init_user_ns and init_pid_ns. There is nothing
  outside of the init namespace, so we can return EPERM in this case too.
  > The fewer special cases the easier the code is to get
  > correct, and the easier it is to read. // Eric

Changes for v3:
* rename ns->get_owner() to ns->owner(). get_* usually means that it
  grabs a reference.

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: "W. Trevor King" <wking@tremily.us>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Serge Hallyn <serge.hallyn@canonical.com>

78725596

tools/testing: add a test to check nsfs ioctl-s · 6ad92bf6

Andrey Vagin authored Sep 06, 2016

There are two new ioctl-s:
One ioctl for the user namespace that owns a file descriptor.
One ioctl for the parent namespace of a namespace file descriptor.

The test checks that these ioctl-s works and that they handle a case
when a target namespace is outside of the current process namespace.
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

6ad92bf6

nsfs: add ioctl to get a parent namespace · a7306ed8

Andrey Vagin authored Sep 06, 2016

Pid and user namepaces are hierarchical. There is no way to discover
parent-child relationships.

In a future we will use this interface to dump and restore nested
namespaces.
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

a7306ed8

nsfs: add ioctl to get an owning user namespace for ns file descriptor · 6786741d

Andrey Vagin authored Sep 06, 2016

Each namespace has an owning user namespace and now there is not way
to discover these relationships.

Understending namespaces relationships allows to answer the question:
what capability does process X have to perform operations on a resource
governed by namespace Y?

After a long discussion, Eric W. Biederman proposed to use ioctl-s for
this purpose.

The NS_GET_USERNS ioctl returns a file descriptor to an owning user
namespace.
It returns EPERM if a target namespace is outside of a current user
namespace.

v2: rename parent to relative

v3: Add a missing mntput when returning -EAGAIN --EWB
Acked-by: Serge Hallyn <serge@hallyn.com>
Link: https://lkml.org/lkml/2016/7/6/158Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

6786741d

kernel: add a helper to get an owning user namespace for a namespace · bcac25a5

Andrey Vagin authored Sep 06, 2016

Return -EPERM if an owning user namespace is outside of a process
current user namespace.

v2: In a first version ns_get_owner returned ENOENT for init_user_ns.
    This special cases was removed from this version. There is nothing
    outside of init_user_ns, so we can return EPERM.
v3: rename ns->get_owner() to ns->owner(). get_* usually means that it
grabs a reference.
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

bcac25a5

22 Sep, 2016 8 commits

devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts · 93f0a88b

Eric W. Biederman authored Dec 08, 2015

In 99.99% of the cases only root in a user namespace can mount /dev/pts
and in those cases the owner of /dev/pts/ptmx will remain root.root

In the oddball case where someone else has CAP_SYS_ADMIN this code
modifies the /dev/pts mount code to use current_fsuid and current_fsgid
as the values to use when creating the /dev/ptmx inode.  As is done
when any other file is created.

This is a code simplification, and it allows running without a root
user entirely.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

93f0a88b

devpts: Remove sync_filesystems · 985e5d85

Eric W. Biederman authored Apr 20, 2016

devpts does not and never will have anything to sync
so don't bother calling sync_filesystems on remount.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

985e5d85

devpts: Make devpts_kill_sb safe if fsi is NULL · 0d126a7f
Eric W. Biederman authored Dec 22, 2015
```
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
```
0d126a7f

devpts: Simplify devpts_mount by using mount_nodev · ec0a9ba6

Eric W. Biederman authored Apr 19, 2016

Now that all of the work of setting up a superblock has been moved to
devpts_fill_super simplify devpts_mount by calling mount_nodev instead
of rolling mount_nodev by hand.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

ec0a9ba6

devpts: Move the creation of /dev/pts/ptmx into fill_super · 7dd17f71

Eric W. Biederman authored Apr 19, 2016

The code makes more sense here and things are just clearer.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

7dd17f71

devpts: Move parse_mount_options into fill_super · 20890479
Eric W. Biederman authored Apr 20, 2016
```
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
```
20890479

userns: When the per user per user namespace limit is reached return ENOSPC · df75e774

Eric W. Biederman authored Sep 22, 2016

The current error codes returned when a the per user per user
namespace limit are hit (EINVAL, EUSERS, and ENFILE) are wrong. I
asked for advice on linux-api and it we made clear that those were
the wrong error code, but a correct effor code was not suggested.

The best general error code I have found for hitting a resource limit
is ENOSPC. It is not perfect but as it is unambiguous it will serve
until someone comes up with a better error code.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

df75e774

userns; Document per user per user namespace limits. · 9c722e40
Eric W. Biederman authored Sep 22, 2016
```
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
```
9c722e40

31 Aug, 2016 1 commit

mntns: Add a limit on the number of mount namespaces. · 537f7ccb

Eric W. Biederman authored Aug 08, 2016

v2: Fixed the very obvious lack of setting ucounts
    on struct mnt_ns reported by Andrei Vagin, and the kbuild
    test report.
Reported-by: Andrei Vagin <avagin@openvz.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

537f7ccb

08 Aug, 2016 12 commits

netns: Add a limit on the number of net namespaces · 70328660

Eric W. Biederman authored Aug 08, 2016

Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

70328660

cgroupns: Add a limit on the number of cgroup namespaces · d08311dd
Eric W. Biederman authored Aug 08, 2016
```
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
```
d08311dd

ipcns: Add a limit on the number of ipc namespaces · aba35661

Eric W. Biederman authored Aug 08, 2016

Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

aba35661

utsns: Add a limit on the number of uts namespaces · f7af3d1c

Eric W. Biederman authored Aug 08, 2016

Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

f7af3d1c

pidns: Add a limit on the number of pid namespaces · f333c700

Eric W. Biederman authored Aug 08, 2016

Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

f333c700

userns: Generalize the user namespace count into ucount · 25f9c081

Eric W. Biederman authored Aug 08, 2016

The same kind of recursive sane default limit and policy
countrol that has been implemented for the user namespace
is desirable for the other namespaces, so generalize
the user namespace refernce count into a ucount.
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

25f9c081

userns: Make the count of user namespaces per user · f6b2db1a

Eric W. Biederman authored Aug 08, 2016

Add a structure that is per user and per user ns and use it to hold
the count of user namespaces.  This makes prevents one user from
creating denying service to another user by creating the maximum
number of user namespaces.

Rename the sysctl export of the maximum count from
/proc/sys/userns/max_user_namespaces to /proc/sys/user/max_user_namespaces
to reflect that the count is now per user.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

f6b2db1a

userns: Add a limit on the number of user namespaces · b376c3e1

Eric W. Biederman authored Aug 08, 2016

Export the export the maximum number of user namespaces as
/proc/sys/userns/max_user_namespaces.
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

b376c3e1

userns: Add per user namespace sysctls. · dbec2846

Eric W. Biederman authored Jul 30, 2016

Limit per userns sysctls to only be opened for write by a holder
of CAP_SYS_RESOURCE.

Add all of the necessary boilerplate for having per user namespace
sysctls.
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

dbec2846

userns: Free user namespaces in process context · b032132c

Eric W. Biederman authored Jul 30, 2016

Add the necessary boiler plate to move freeing of user namespaces into
work queue and thus into process context where things can sleep.

This is a necessary precursor to per user namespace sysctls.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

b032132c

sysctl: Stop implicitly passing current into sysctl_table_root.lookup · 13bcc6a2

Eric W. Biederman authored Jul 16, 2016

Passing nsproxy into sysctl_table_root.lookup was a premature
optimization in attempt to avoid depending on current.  The
directory /proc/self/sys has not appeared and if and when
it does this code will need to be reviewed closely and reworked
anyway.  So remove the premature optimization.
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

13bcc6a2

Linux 4.8-rc1 · 29b4817d
Linus Torvalds authored Aug 07, 2016

29b4817d

07 Aug, 2016 10 commits

Merge branch 'for-linus' of git://git.kernel.dk/linux-block · 857953d7

Linus Torvalds authored Aug 07, 2016

Pull more block fixes from Jens Axboe:
 "As mentioned in the pull the other day, a few more fixes for this
  round, all related to the bio op changes in this series.

  Two fixes, and then a cleanup, renaming bio->bi_rw to bio->bi_opf.  I
  wanted to do that change right after or right before -rc1, so that
  risk of conflict was reduced.  I just rebased the series on top of
  current master, and no new ->bi_rw usage has snuck in"

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: rename bio bi_rw to bi_opf
  target: iblock_execute_sync_cache() should use bio_set_op_attrs()
  mm: make __swap_writepage() use bio_set_op_attrs()
  block/mm: make bdev_ops->rw_page() take a bool for read/write

857953d7

Merge tag 'drm-for-v4.8-zpos' of git://people.freedesktop.org/~airlied/linux · 635a4ba1

Linus Torvalds authored Aug 07, 2016

Pull drm zpos property support from Dave Airlie:
 "This tree was waiting on some media stuff I hadn't had time to get a
  stable branchpoint off, so I just waited until it was all in your tree
  first.

  It's been around a bit on the list and shouldn't affect anything
  outside adding the generic API and moving some ARM drivers to using
  it"

* tag 'drm-for-v4.8-zpos' of git://people.freedesktop.org/~airlied/linux:
  drm: rcar: use generic code for managing zpos plane property
  drm/exynos: use generic code for managing zpos plane property
  drm: sti: use generic zpos for plane
  drm: add generic zpos property

635a4ba1

block: rename bio bi_rw to bi_opf · 1eff9d32

Jens Axboe authored Aug 05, 2016

Since commit 63a4cc24, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.
Signed-off-by: Jens Axboe <axboe@fb.com>

1eff9d32

target: iblock_execute_sync_cache() should use bio_set_op_attrs() · 31c64f78

Jens Axboe authored Aug 01, 2016

The original commit missed this function, it needs to mark it a
write flush.

Cc: Mike Christie <mchristi@redhat.com>
Fixes: e742fc32 ("target: use bio op accessors")
Signed-off-by: Jens Axboe <axboe@fb.com>

31c64f78

mm: make __swap_writepage() use bio_set_op_attrs() · ba13e83e
Jens Axboe authored Aug 01, 2016
```
Cleaner than manipulating bio->bi_rw flags directly.
Signed-off-by: Jens Axboe <axboe@fb.com>
```
ba13e83e

block/mm: make bdev_ops->rw_page() take a bool for read/write · c11f0c0b

Jens Axboe authored Aug 05, 2016

Commit abf54548 changed it from an 'rw' flags type to the
newer ops based interface, but now we're effectively leaking
some bdev internals to the rest of the kernel. Since we only
care about whether it's a read or a write at that level, just
pass in a bool 'is_write' parameter instead.

Then we can also move op_is_write() and friends back under
CONFIG_BLOCK protection.
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

c11f0c0b

Merge tag 'doc-4.8-fixes' of git://git.lwn.net/linux · 52ddb7e9

Linus Torvalds authored Aug 07, 2016

Pull documentation fixes from Jonathan Corbet:
 "Three fixes for the docs build, including removing an annoying warning
  on 'make help' if sphinx isn't present"

* tag 'doc-4.8-fixes' of git://git.lwn.net/linux:
  DocBook: use DOCBOOKS="" to ignore DocBooks instead of IGNORE_DOCBOOKS=1
  Documenation: update cgroup's document path
  Documentation/sphinx: do not warn about missing tools in 'make help'

52ddb7e9

Merge tag 'binfmt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/binfmt_misc · e9d488c3

Linus Torvalds authored Aug 07, 2016

Pull binfmt_misc update from James Bottomley:
 "This update is to allow architecture emulation containers to function
  such that the emulation binary can be housed outside the container
  itself.  The container and fs parts both have acks from relevant
  experts.

  To use the new feature you have to add an F option to your binfmt_misc
  configuration"

From the docs:
 "The usual behaviour of binfmt_misc is to spawn the binary lazily when
  the misc format file is invoked.  However, this doesn't work very well
  in the face of mount namespaces and changeroots, so the F mode opens
  the binary as soon as the emulation is installed and uses the opened
  image to spawn the emulator, meaning it is always available once
  installed, regardless of how the environment changes"

* tag 'binfmt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/binfmt_misc:
  binfmt_misc: add F option description to documentation
  binfmt_misc: add persistent opened binary handler for containers
  fs: add filp_clone_open API

e9d488c3

fs: return EPERM on immutable inode · 337684a1

Eryu Guan authored Aug 02, 2016

In most cases, EPERM is returned on immutable inode, and there're only a
few places returning EACCES. I noticed this when running LTP on
overlayfs, setxattr03 failed due to unexpected EACCES on immutable
inode.

So converting all EACCES to EPERM on immutable inode.
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

337684a1

Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · fe64f328

Linus Torvalds authored Aug 07, 2016

Pull more vfs updates from Al Viro:
 "Assorted cleanups and fixes.

  In the "trivial API change" department - ->d_compare() losing 'parent'
  argument"

* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  cachefiles: Fix race between inactivating and culling a cache object
  9p: use clone_fid()
  9p: fix braino introduced in "9p: new helper - v9fs_parent_fid()"
  vfs: make dentry_needs_remove_privs() internal
  vfs: remove file_needs_remove_privs()
  vfs: fix deadlock in file_remove_privs() on overlayfs
  get rid of 'parent' argument of ->d_compare()
  cifs, msdos, vfat, hfs+: don't bother with parent in ->d_compare()
  affs ->d_compare(): don't bother with ->d_inode
  fold _d_rehash() and __d_rehash() together
  fold dentry_rcuwalk_invalidate() into its only remaining caller

fe64f328