• Linus Torvalds's avatar
    Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace · 0cbee992
    Linus Torvalds authored
    Pull user namespace updates from Eric Biederman:
     "Long ago and far away when user namespaces where young it was realized
      that allowing fresh mounts of proc and sysfs with only user namespace
      permissions could violate the basic rule that only root gets to decide
      if proc or sysfs should be mounted at all.
    
      Some hacks were put in place to reduce the worst of the damage could
      be done, and the common sense rule was adopted that fresh mounts of
      proc and sysfs should allow no more than bind mounts of proc and
      sysfs.  Unfortunately that rule has not been fully enforced.
    
      There are two kinds of gaps in that enforcement.  Only filesystems
      mounted on empty directories of proc and sysfs should be ignored but
      the test for empty directories was insufficient.  So in my tree
      directories on proc, sysctl and sysfs that will always be empty are
      created specially.  Every other technique is imperfect as an ordinary
      directory can have entries added even after a readdir returns and
      shows that the directory is empty.  Special creation of directories
      for mount points makes the code in the kernel a smidge clearer about
      it's purpose.  I asked container developers from the various container
      projects to help test this and no holes were found in the set of mount
      points on proc and sysfs that are created specially.
    
      This set of changes also starts enforcing the mount flags of fresh
      mounts of proc and sysfs are consistent with the existing mount of
      proc and sysfs.  I expected this to be the boring part of the work but
      unfortunately unprivileged userspace winds up mounting fresh copies of
      proc and sysfs with noexec and nosuid clear when root set those flags
      on the previous mount of proc and sysfs.  So for now only the atime,
      read-only and nodev attributes which userspace happens to keep
      consistent are enforced.  Dealing with the noexec and nosuid
      attributes remains for another time.
    
      This set of changes also addresses an issue with how open file
      descriptors from /proc/<pid>/ns/* are displayed.  Recently readlink of
      /proc/<pid>/fd has been triggering a WARN_ON that has not been
      meaningful since it was added (as all of the code in the kernel was
      converted) and is not now actively wrong.
    
      There is also a short list of issues that have not been fixed yet that
      I will mention briefly.
    
      It is possible to rename a directory from below to above a bind mount.
      At which point any directory pointers below the renamed directory can
      be walked up to the root directory of the filesystem.  With user
      namespaces enabled a bind mount of the bind mount can be created
      allowing the user to pick a directory whose children they can rename
      to outside of the bind mount.  This is challenging to fix and doubly
      so because all obvious solutions must touch code that is in the
      performance part of pathname resolution.
    
      As mentioned above there is also a question of how to ensure that
      developers by accident or with purpose do not introduce exectuable
      files on sysfs and proc and in doing so introduce security regressions
      in the current userspace that will not be immediately obvious and as
      such are likely to require breaking userspace in painful ways once
      they are recognized"
    
    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
      vfs: Remove incorrect debugging WARN in prepend_path
      mnt: Update fs_fully_visible to test for permanently empty directories
      sysfs: Create mountpoints with sysfs_create_mount_point
      sysfs: Add support for permanently empty directories to serve as mount points.
      kernfs: Add support for always empty directories.
      proc: Allow creating permanently empty directories that serve as mount points
      sysctl: Allow creating permanently empty directories that serve as mountpoints.
      fs: Add helper functions for permanently empty directories.
      vfs: Ignore unlocked mounts in fs_fully_visible
      mnt: Modify fs_fully_visible to deal with locked ro nodev and atime
      mnt: Refactor the logic for mounting sysfs and proc in a user namespace
    0cbee992
inode.c 31.4 KB