1. 15 Dec, 2012 1 commit
    • Eric W. Biederman's avatar
      userns: Require CAP_SYS_ADMIN for most uses of setns. · 5e4a0847
      Eric W. Biederman authored
      Andy Lutomirski <luto@amacapital.net> found a nasty little bug in
      the permissions of setns.  With unprivileged user namespaces it
      became possible to create new namespaces without privilege.
      
      However the setns calls were relaxed to only require CAP_SYS_ADMIN in
      the user nameapce of the targed namespace.
      
      Which made the following nasty sequence possible.
      
      pid = clone(CLONE_NEWUSER | CLONE_NEWNS);
      if (pid == 0) { /* child */
      	system("mount --bind /home/me/passwd /etc/passwd");
      }
      else if (pid != 0) { /* parent */
      	char path[PATH_MAX];
      	snprintf(path, sizeof(path), "/proc/%u/ns/mnt");
      	fd = open(path, O_RDONLY);
      	setns(fd, 0);
      	system("su -");
      }
      
      Prevent this possibility by requiring CAP_SYS_ADMIN
      in the current user namespace when joing all but the user namespace.
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      5e4a0847
  2. 14 Dec, 2012 1 commit
    • Eric W. Biederman's avatar
      Fix cap_capable to only allow owners in the parent user namespace to have caps. · 520d9eab
      Eric W. Biederman authored
      Andy Lutomirski pointed out that the current behavior of allowing the
      owner of a user namespace to have all caps when that owner is not in a
      parent user namespace is wrong.  Add a test to ensure the owner of a user
      namespace is in the parent of the user namespace to fix this bug.
      
      Thankfully this bug did not apply to the initial user namespace, keeping
      the mischief that can be caused by this bug quite small.
      
      This is bug was introduced in v3.5 by commit 783291e6
      "Simplify the user_namespace by making userns->creator a kuid."
      But did not matter until the permisions required to create
      a user namespace were relaxed allowing a user namespace to be created
      inside of a user namespace.
      
      The bug made it possible for the owner of a user namespace to be
      present in a child user namespace.  Since the owner of a user nameapce
      is granted all capabilities it became possible for users in a
      grandchild user namespace to have all privilges over their parent user
      namspace.
      
      Reorder the checks in cap_capable.  This should make the common case
      faster and make it clear that nothing magic happens in the initial
      user namespace.  The reordering is safe because cred->user_ns
      can only be in targ_ns or targ_ns->parent but not both.
      
      Add a comment a the top of the loop to make the logic of
      the code clear.
      
      Add a distinct variable ns that changes as we walk up
      the user namespace hierarchy to make it clear which variable
      is changing.
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      520d9eab
  3. 20 Nov, 2012 14 commits
  4. 19 Nov, 2012 22 commits
    • Eric W. Biederman's avatar
      userns: Allow unprivileged users to create user namespaces. · 5eaf563e
      Eric W. Biederman authored
      Now that we have been through every permission check in the kernel
      having uid == 0 and gid == 0 in your local user namespace no
      longer adds any special privileges.  Even having a full set
      of caps in your local user namespace is safe because capabilies
      are relative to your local user namespace, and do not confer
      unexpected privileges.
      
      Over the long term this should allow much more of the kernels
      functionality to be safely used by non-root users.  Functionality
      like unsharing the mount namespace that is only unsafe because
      it can fool applications whose privileges are raised when they
      are executed.  Since those applications have no privileges in
      a user namespaces it becomes safe to spoof and confuse those
      applications all you want.
      
      Those capabilities will still need to be enabled carefully because
      we may still need things like rlimits on the number of unprivileged
      mounts but that is to avoid DOS attacks not to avoid fooling root
      owned processes.
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      5eaf563e
    • Eric W. Biederman's avatar
      userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped · 3cdf5b45
      Eric W. Biederman authored
      When performing an exec where the binary lives in one user namespace and
      the execing process lives in another usre namespace there is the possibility
      that the target uids can not be represented.
      
      Instead of failing the exec simply ignore the suid/sgid bits and run
      the binary with lower privileges.   We already do this in the case
      of MNT_NOSUID so this should be a well tested code path.
      
      As the user and group are not changed this should not introduce any
      security issues.
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      3cdf5b45
    • Zhao Hongjiang's avatar
      userns: fix return value on mntns_install() failure · ae11e0f1
      Zhao Hongjiang authored
      Change return value from -EINVAL to -EPERM when the permission check fails.
      Signed-off-by: default avatarZhao Hongjiang <zhaohongjiang@huawei.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      ae11e0f1
    • Eric W. Biederman's avatar
      vfs: Allow unprivileged manipulation of the mount namespace. · 0c55cfc4
      Eric W. Biederman authored
      - Add a filesystem flag to mark filesystems that are safe to mount as
        an unprivileged user.
      
      - Add a filesystem flag to mark filesystems that don't need MNT_NODEV
        when mounted by an unprivileged user.
      
      - Relax the permission checks to allow unprivileged users that have
        CAP_SYS_ADMIN permissions in the user namespace referred to by the
        current mount namespace to be allowed to mount, unmount, and move
        filesystems.
      Acked-by: default avatar"Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      0c55cfc4
    • Eric W. Biederman's avatar
      vfs: Only support slave subtrees across different user namespaces · 7a472ef4
      Eric W. Biederman authored
      Sharing mount subtress with mount namespaces created by unprivileged
      users allows unprivileged mounts created by unprivileged users to
      propagate to mount namespaces controlled by privileged users.
      
      Prevent nasty consequences by changing shared subtrees to slave
      subtress when an unprivileged users creates a new mount namespace.
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      7a472ef4
    • Eric W. Biederman's avatar
      vfs: Add a user namespace reference from struct mnt_namespace · 771b1371
      Eric W. Biederman authored
      This will allow for support for unprivileged mounts in a new user namespace.
      Acked-by: default avatar"Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      771b1371
    • Eric W. Biederman's avatar
      vfs: Add setns support for the mount namespace · 8823c079
      Eric W. Biederman authored
      setns support for the mount namespace is a little tricky as an
      arbitrary decision must be made about what to set fs->root and
      fs->pwd to, as there is no expectation of a relationship between
      the two mount namespaces.  Therefore I arbitrarily find the root
      mount point, and follow every mount on top of it to find the top
      of the mount stack.  Then I set fs->root and fs->pwd to that
      location.  The topmost root of the mount stack seems like a
      reasonable place to be.
      
      Bind mount support for the mount namespace inodes has the
      possibility of creating circular dependencies between mount
      namespaces.  Circular dependencies can result in loops that
      prevent mount namespaces from every being freed.  I avoid
      creating those circular dependencies by adding a sequence number
      to the mount namespace and require all bind mounts be of a
      younger mount namespace into an older mount namespace.
      
      Add a helper function proc_ns_inode so it is possible to
      detect when we are attempting to bind mound a namespace inode.
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      8823c079
    • Eric W. Biederman's avatar
      vfs: Allow chroot if you have CAP_SYS_CHROOT in your user namespace · a85fb273
      Eric W. Biederman authored
      Once you are confined to a user namespace applications can not gain
      privilege and escape the user namespace so there is no longer a reason
      to restrict chroot.
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      a85fb273
    • Eric W. Biederman's avatar
      pidns: Support unsharing the pid namespace. · 50804fe3
      Eric W. Biederman authored
      Unsharing of the pid namespace unlike unsharing of other namespaces
      does not take affect immediately.  Instead it affects the children
      created with fork and clone.  The first of these children becomes the init
      process of the new pid namespace, the rest become oddball children
      of pid 0.  From the point of view of the new pid namespace the process
      that created it is pid 0, as it's pid does not map.
      
      A couple of different semantics were considered but this one was
      settled on because it is easy to implement and it is usable from
      pam modules.  The core reasons for the existence of unshare.
      
      I took a survey of the callers of pam modules and the following
      appears to be a representative sample of their logic.
      {
      	setup stuff include pam
      	child = fork();
      	if (!child) {
      		setuid()
                      exec /bin/bash
              }
              waitpid(child);
      
              pam and other cleanup
      }
      
      As you can see there is a fork to create the unprivileged user
      space process.  Which means that the unprivileged user space
      process will appear as pid 1 in the new pid namespace.  Further
      most login processes do not cope with extraneous children which
      means shifting the duty of reaping extraneous child process to
      the creator of those extraneous children makes the system more
      comprehensible.
      
      The practical reason for this set of pid namespace semantics is
      that it is simple to implement and verify they work correctly.
      Whereas an implementation that requres changing the struct
      pid on a process comes with a lot more races and pain.  Not
      the least of which is that glibc caches getpid().
      
      These semantics are implemented by having two notions
      of the pid namespace of a proces.  There is task_active_pid_ns
      which is the pid namspace the process was created with
      and the pid namespace that all pids are presented to
      that process in.  The task_active_pid_ns is stored
      in the struct pid of the task.
      
      Then there is the pid namespace that will be used for children
      that pid namespace is stored in task->nsproxy->pid_ns.
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      50804fe3
    • Eric W. Biederman's avatar
      pidns: Consolidate initialzation of special init task state · 1c4042c2
      Eric W. Biederman authored
      Instead of setting child_reaper and SIGNAL_UNKILLABLE one way
      for the system init process, and another way for pid namespace
      init processes test pid->nr == 1 and use the same code for both.
      
      For the global init this results in SIGNAL_UNKILLABLE being set
      much earlier in the initialization process.
      
      This is a small cleanup and it paves the way for allowing unshare and
      enter of the pid namespace as that path like our global init also will
      not set CLONE_NEWPID.
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      1c4042c2
    • Eric W. Biederman's avatar
      pidns: Add setns support · 57e8391d
      Eric W. Biederman authored
      - Pid namespaces are designed to be inescapable so verify that the
        passed in pid namespace is a child of the currently active
        pid namespace or the currently active pid namespace itself.
      
        Allowing the currently active pid namespace is important so
        the effects of an earlier setns can be cancelled.
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      57e8391d
    • Eric W. Biederman's avatar
      pidns: Deny strange cases when creating pid namespaces. · 225778d6
      Eric W. Biederman authored
      task_active_pid_ns(current) != current->ns_proxy->pid_ns will
      soon be allowed to support unshare and setns.
      
      The definition of creating a child pid namespace when
      task_active_pid_ns(current) != current->ns_proxy->pid_ns could be that
      we create a child pid namespace of current->ns_proxy->pid_ns.  However
      that leads to strange cases like trying to have a single process be
      init in multiple pid namespaces, which is racy and hard to think
      about.
      
      The definition of creating a child pid namespace when
      task_active_pid_ns(current) != current->ns_proxy->pid_ns could be that
      we create a child pid namespace of task_active_pid_ns(current).  While
      that seems less racy it does not provide any utility.
      
      Therefore define the semantics of creating a child pid namespace when
      task_active_pid_ns(current) != current->ns_proxy->pid_ns to be that the
      pid namespace creation fails.  That is easy to implement and easy
      to think about.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      225778d6
    • Eric W. Biederman's avatar
      pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1 · af4b8a83
      Eric W. Biederman authored
      Looking at pid_ns->nr_hashed is a bit simpler and it works for
      disjoint process trees that an unshare or a join of a pid_namespace
      may create.
      Acked-by: default avatar"Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      af4b8a83
    • Eric W. Biederman's avatar
      pidns: Don't allow new processes in a dead pid namespace. · 5e1182de
      Eric W. Biederman authored
      Set nr_hashed to -1 just before we schedule the work to cleanup proc.
      Test nr_hashed just before we hash a new pid and if nr_hashed is < 0
      fail.
      
      This guaranteees that processes never enter a pid namespaces after we
      have cleaned up the state to support processes in a pid namespace.
      
      Currently sending SIGKILL to all of the process in a pid namespace as
      init exists gives us this guarantee but we need something a little
      stronger to support unsharing and joining a pid namespace.
      Acked-by: default avatar"Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      5e1182de
    • Eric W. Biederman's avatar
      pidns: Make the pidns proc mount/umount logic obvious. · 0a01f2cc
      Eric W. Biederman authored
      Track the number of pids in the proc hash table.  When the number of
      pids goes to 0 schedule work to unmount the kernel mount of proc.
      
      Move the mount of proc into alloc_pid when we allocate the pid for
      init.
      
      Remove the surprising calls of pid_ns_release proc in fork and
      proc_flush_task.  Those code paths really shouldn't know about proc
      namespace implementation details and people have demonstrated several
      times that finding and understanding those code paths is difficult and
      non-obvious.
      
      Because of the call path detach pid is alwasy called with the
      rtnl_lock held free_pid is not allowed to sleep, so the work to
      unmounting proc is moved to a work queue.  This has the side benefit
      of not blocking the entire world waiting for the unnecessary
      rcu_barrier in deactivate_locked_super.
      
      In the process of making the code clear and obvious this fixes a bug
      reported by Gao feng <gaofeng@cn.fujitsu.com> where we would leak a
      mount of proc during clone(CLONE_NEWPID|CLONE_NEWNET) if copy_pid_ns
      succeeded and copy_net_ns failed.
      Acked-by: default avatar"Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      0a01f2cc
    • Eric W. Biederman's avatar
      pidns: Use task_active_pid_ns where appropriate · 17cf22c3
      Eric W. Biederman authored
      The expressions tsk->nsproxy->pid_ns and task_active_pid_ns
      aka ns_of_pid(task_pid(tsk)) should have the same number of
      cache line misses with the practical difference that
      ns_of_pid(task_pid(tsk)) is released later in a processes life.
      
      Furthermore by using task_active_pid_ns it becomes trivial
      to write an unshare implementation for the the pid namespace.
      
      So I have used task_active_pid_ns everywhere I can.
      
      In fork since the pid has not yet been attached to the
      process I use ns_of_pid, to achieve the same effect.
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      17cf22c3
    • Eric W. Biederman's avatar
      pidns: Capture the user namespace and filter ns_last_pid · 49f4d8b9
      Eric W. Biederman authored
      - Capture the the user namespace that creates the pid namespace
      - Use that user namespace to test if it is ok to write to
        /proc/sys/kernel/ns_last_pid.
      
      Zhao Hongjiang <zhaohongjiang@huawei.com> noticed I was missing a put_user_ns
      in when destroying a pid_ns.  I have foloded his patch into this one
      so that bisects will work properly.
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      49f4d8b9
    • Eric W. Biederman's avatar
      procfs: Don't cache a pid in the root inode. · ae06c7c8
      Eric W. Biederman authored
      Now that we have s_fs_info pointing to our pid namespace
      the original reason for the proc root inode having a struct
      pid is gone.
      
      Caching a pid in the root inode has led to some complicated
      code.  Now that we don't need the struct pid, just remove it.
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      ae06c7c8
    • Eric W. Biederman's avatar
      procfs: Use the proc generic infrastructure for proc/self. · e656d8a6
      Eric W. Biederman authored
      I had visions at one point of splitting proc into two filesystems.  If
      that had happened proc/self being the the part of proc that actually deals
      with pids would have been a nice cleanup.  As it is proc/self requires
      a lot of unnecessary infrastructure for a single file.
      
      The only user visible change is that a mounted /proc for a pid namespace
      that is dead now shows a broken proc symlink, instead of being completely
      invisible.  I don't think anyone will notice or care.
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      e656d8a6
    • Eric W. Biederman's avatar
      userns: On mips modify check_same_owner to use uid_eq · dd34ad35
      Eric W. Biederman authored
      The kbuild test robot <fengguang.wu@intel.com> report the following error
      when building mips with user namespace support enabled.
      
      All error/warnings:
      arch/mips/kernel/mips-mt-fpaff.c: In function 'check_same_owner':
      arch/mips/kernel/mips-mt-fpaff.c:53:22: error: invalid operands to binary == (have 'kuid_t' and 'kuid_t')
      arch/mips/kernel/mips-mt-fpaff.c:54:15: error: invalid operands to binary == (have 'kuid_t' and 'kuid_t')
      
      Replace "a == b" with uid_eq(a, b) removes this error and allows the
      code to work with user namespaces enabled.
      
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      dd34ad35
    • Eric W. Biederman's avatar
      userns: make each net (net_ns) belong to a user_ns · 038e7332
      Eric W. Biederman authored
      The user namespace which creates a new network namespace owns that
      namespace and all resources created in it.  This way we can target
      capability checks for privileged operations against network resources to
      the user_ns which created the network namespace in which the resource
      lives.  Privilege to the user namespace which owns the network
      namespace, or any parent user namespace thereof, provides the same
      privilege to the network resource.
      
      This patch is reworked from a version originally by
      Serge E. Hallyn <serge.hallyn@canonical.com>
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      038e7332
    • Eric W. Biederman's avatar
      netns: Deduplicate and fix copy_net_ns when !CONFIG_NET_NS · d727abcb
      Eric W. Biederman authored
      The copy of copy_net_ns used when the network stack is not
      built is broken as it does not return -EINVAL when attempting
      to create a new network namespace.  We don't even have
      a previous network namespace.
      
      Since we need a copy of copy_net_ns in net/net_namespace.h that is
      available when the networking stack is not built at all move the
      correct version of copy_net_ns from net_namespace.c into net_namespace.h
      Leaving us with just 2 versions of copy_net_ns.  One version for when
      we compile in network namespace suport and another stub for all other
      occasions.
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      d727abcb
  5. 15 Nov, 2012 2 commits
    • Eric W. Biederman's avatar
      userns: Support fuse interacting with multiple user namespaces · 499dcf20
      Eric W. Biederman authored
      Use kuid_t and kgid_t in struct fuse_conn and struct fuse_mount_data.
      
      The connection between between a fuse filesystem and a fuse daemon is
      established when a fuse filesystem is mounted and provided with a file
      descriptor the fuse daemon created by opening /dev/fuse.
      
      For now restrict the communication of uids and gids between the fuse
      filesystem and the fuse daemon to the initial user namespace.  Enforce
      this by verifying the file descriptor passed to the mount of fuse was
      opened in the initial user namespace.  Ensuring the mount happens in
      the initial user namespace is not necessary as mounts from non-initial
      user namespaces are not yet allowed.
      
      In fuse_req_init_context convert the currrent fsuid and fsgid into the
      initial user namespace for the request that will be sent to the fuse
      daemon.
      
      In fuse_fill_attr convert the uid and gid passed from the fuse daemon
      from the initial user namespace into kuids and kgids.
      
      In iattr_to_fattr called from fuse_setattr convert kuids and kgids
      into the uids and gids in the initial user namespace before passing
      them to the fuse filesystem.
      
      In fuse_change_attributes_common called from fuse_dentry_revalidate,
      fuse_permission, fuse_geattr, and fuse_setattr, and fuse_iget convert
      the uid and gid from the fuse daemon into a kuid and a kgid to store
      on the fuse inode.
      
      By default fuse mounts are restricted to task whose uid, suid, and
      euid matches the fuse user_id and whose gid, sgid, and egid matches
      the fuse group id.  Convert the user_id and group_id mount options
      into kuids and kgids at mount time, and use uid_eq and gid_eq to
      compare the in fuse_allow_task.
      
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      499dcf20
    • Eric W. Biederman's avatar
      userns: Support autofs4 interacing with multiple user namespaces · 45634cd8
      Eric W. Biederman authored
      Use kuid_t and kgid_t in struct autofs_info and struct autofs_wait_queue.
      
      When creating directories and symlinks default the uid and gid of
      the mount requester to the global root uid and gid.  autofs4_wait
      will update these fields when a mount is requested.
      
      When generating autofsv5 packets report the uid and gid of the mount
      requestor in user namespace of the process that opened the pipe,
      reporting unmapped uids and gids as overflowuid and overflowgid.
      
      In autofs_dev_ioctl_requester return the uid and gid of the last mount
      requester converted into the calling processes user namespace.  When the
      uid or gid don't map return overflowuid and overflowgid as appropriate,
      allowing failure to find a mount requester to be distinguished from
      failure to map a mount requester.
      
      The uid and gid mount options specifying the user and group of the
      root autofs inode are converted into kuid and kgid as they are parsed
      defaulting to the current uid and current gid of the process that
      mounts autofs.
      
      Mounting of autofs for the present remains confined to processes in
      the initial user namespace.
      
      Cc: Ian Kent <raven@themaw.net>
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      45634cd8