1. 12 Jun, 2020 1 commit
    • Eric W. Biederman's avatar
      proc: Use new_inode not new_inode_pseudo · ef1548ad
      Eric W. Biederman authored
      Recently syzbot reported that unmounting proc when there is an ongoing
      inotify watch on the root directory of proc could result in a use
      after free when the watch is removed after the unmount of proc
      when the watcher exits.
      
      Commit 69879c01 ("proc: Remove the now unnecessary internal mount
      of proc") made it easier to unmount proc and allowed syzbot to see the
      problem, but looking at the code it has been around for a long time.
      
      Looking at the code the fsnotify watch should have been removed by
      fsnotify_sb_delete in generic_shutdown_super.  Unfortunately the inode
      was allocated with new_inode_pseudo instead of new_inode so the inode
      was not on the sb->s_inodes list.  Which prevented
      fsnotify_unmount_inodes from finding the inode and removing the watch
      as well as made it so the "VFS: Busy inodes after unmount" warning
      could not find the inodes to warn about them.
      
      Make all of the inodes in proc visible to generic_shutdown_super,
      and fsnotify_sb_delete by using new_inode instead of new_inode_pseudo.
      The only functional difference is that new_inode places the inodes
      on the sb->s_inodes list.
      
      I wrote a small test program and I can verify that without changes it
      can trigger this issue, and by replacing new_inode_pseudo with
      new_inode the issues goes away.
      
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/000000000000d788c905a7dfa3f4@google.com
      Reported-by: syzbot+7d2debdcdb3cb93c1e5e@syzkaller.appspotmail.com
      Fixes: 0097875b ("proc: Implement /proc/thread-self to point at the directory of the current thread")
      Fixes: 021ada7d ("procfs: switch /proc/self away from proc_dir_entry")
      Fixes: 51f0885e ("vfs,proc: guarantee unique inodes in /proc")
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      ef1548ad
  2. 10 Jun, 2020 1 commit
  3. 19 May, 2020 1 commit
  4. 30 Apr, 2020 3 commits
    • Eric W. Biederman's avatar
      posix-cpu-timers: Use pids not tasks in lookup · 2dd8083f
      Eric W. Biederman authored
      The current posix-cpu-timer code uses pids when holding persistent
      references in timers.  However the lookups from clock_id_t still return
      tasks that need to be converted into pids for use.
      
      This results in usage being pid->task->pid and that can race with
      release_task and de_thread.  This can lead to some not wrong but
      surprising results.  Surprising enough that Oleg and I both thought
      there were some bugs in the code for a while.
      
      This set of changes modifies the code to just lookup, verify, and return
      pids from the clockid_t lookups to remove those potentialy troublesome
      races.
      
      Eric W. Biederman (3):
            posix-cpu-timers: Extend rcu_read_lock removing task_struct references
            posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type
            posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock
      
       kernel/time/posix-cpu-timers.c | 102 ++++++++++++++++++-----------------------
       1 file changed, 45 insertions(+), 57 deletions(-)
      Suggested-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      2dd8083f
    • Oleg Nesterov's avatar
      remove the no longer needed pid_alive() check in __task_pid_nr_ns() · 1dd694a1
      Oleg Nesterov authored
      Starting from 2c470475 ("pids: Move the pgrp and session pid pointers
      from task_struct to signal_struct") __task_pid_nr_ns() doesn't dereference
      task->group_leader, we can remove the pid_alive() check.
      
      pid_nr_ns() has to check pid != NULL anyway, pid_alive() just adds the
      unnecessary confusion.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      1dd694a1
    • Eric W. Biederman's avatar
      Removing has_group_leader_pid · 06576edd
      Eric W. Biederman authored
      With de_thread now calling exchange_tids has_group_leader_pid no longer
      makes any sense and is equivalent to calling thread_group_leader.
      
      As there are only 2 remaining users of has_group_leader_pid let's
      update the code and get rid of has_group_leader_pid.
      
      There is one extra patch to lookup_task that performs that unifies
      to code paths that become identical when has_group_leader_pid went
      away.
      
      Eric W. Biederman (4):
            posix-cpu-timer: Tidy up group_leader logic in lookup_task
            posix-cpu-timer:  Unify the now redundant code in lookup_task
            exec: Remove BUG_ON(has_group_leader_pid)
            signal: Remove has_group_leader_pid
      
       fs/exec.c                      |  1 -
       include/linux/sched/signal.h   | 11 -----------
       kernel/time/posix-cpu-timers.c | 21 ++++++++-------------
       3 files changed, 8 insertions(+), 25 deletions(-)
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      06576edd
  5. 29 Apr, 2020 3 commits
    • Eric W. Biederman's avatar
      posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock · 96498773
      Eric W. Biederman authored
      Now that the codes store references to pids instead of referendes to
      tasks.  Looking up a task for a clock instead of looking up a struct
      pid makes the code more difficult to verify it is correct than
      necessary.
      
      In posix_cpu_timers_create get_task_pid can race with release_task for
      threads and return a NULL pid.  As put_pid and cpu_timer_task_rcu
      handle NULL pids just fine the code works without problems but it is
      an extra case to consider and keep in mind while verifying and
      modifying the code.
      
      There are races with de_thread to consider that only don't apply
      because thread clocks are only allowed for threads in the same
      thread_group.
      
      So instead of leaving a burden for people making modification to the
      code in the future return a rcu protected struct pid for the clock
      instead.
      
      The logic for __get_task_for_pid and lookup_task has been folded into
      the new function pid_for_clock with the only change being the logic
      has been modified from working on a task to working on a pid that
      will be returned.
      
      In posix_cpu_clock_get instead of calling pid_for_clock checking the
      result and then calling pid_task to get the task.  The result of
      pid_for_clock is fed directly into pid_task.  This is safe because
      pid_task handles NULL pids.  As such an extra error check was
      unnecessary.
      
      Instead of hiding the flag that enables the special clock_gettime
      handling, I have made the 3 callers just pass the flag in themselves.
      That is less code and seems just as simple to work with as the
      wrapper functions.
      
      Historically the clock_gettime special case of allowing a process
      clock to be found by the thread id did not even exist [33ab0fec]
      but Thomas Gleixner reports that he has found code that uses that
      functionality [55e8c8eb].
      
      Link: https://lkml.kernel.org/r/87zhaxqkwa.fsf@nanos.tec.linutronix.de/
      Ref: 33ab0fec ("posix-timers: Consolidate posix_cpu_clock_get()")
      Ref: 55e8c8eb ("posix-cpu-timers: Store a reference to a pid not a task")
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      96498773
    • Eric W. Biederman's avatar
      posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type · fece9826
      Eric W. Biederman authored
      Taking a clock and returning a pid_type is a more general and
      a superset of taking a timer and returning a pid_type.
      
      Perform this generalization so that future changes may use
      this code on clocks as well as timers.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      fece9826
    • Eric W. Biederman's avatar
      posix-cpu-timers: Extend rcu_read_lock removing task_struct references · 9bf7c324
      Eric W. Biederman authored
      Now that the code stores of pid references it is no longer necessary
      or desirable to take a reference on task_struct in __get_task_for_clock.
      
      Instead extend the scope of rcu_read_lock and remove the reference
      counting on struct task_struct entirely.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      9bf7c324
  6. 28 Apr, 2020 7 commits
    • Eric W. Biederman's avatar
      signal: Remove has_group_leader_pid · bbd40fc4
      Eric W. Biederman authored
      After the introduction of exchange_tids has_group_leader_pid is
      equivalent to thread_group_leader.  After the last couple of cleanups
      has_group_leader_pid has no more callers.
      
      So remove the now unused and redundant has_group_leader_pid.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      bbd40fc4
    • Eric W. Biederman's avatar
      exec: Remove BUG_ON(has_group_leader_pid) · 610b8188
      Eric W. Biederman authored
      With the introduction of exchange_tids thread_group_leader and
      has_group_leader_pid have become equivalent.  Further at this point in the
      code a thread group has exactly two threads, the previous thread_group_leader
      that is waiting to be reaped and tsk.  So we know it is impossible for tsk to
      be the thread_group_leader.
      
      This is also the last user of has_group_leader_pid so removing this check
      will allow has_group_leader_pid to be removed.
      
      So remove the "BUG_ON(has_group_leader_pid)" that will never fire.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      610b8188
    • Eric W. Biederman's avatar
      posix-cpu-timer: Unify the now redundant code in lookup_task · c7f51940
      Eric W. Biederman authored
      Now that both !thread paths through lookup_task call
      thread_group_leader, unify them into the single test at the end of
      lookup_task.
      
      This unification just makes it clear what is happening in the gettime
      special case of lookup_task.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      c7f51940
    • Eric W. Biederman's avatar
      posix-cpu-timer: Tidy up group_leader logic in lookup_task · 8feebc67
      Eric W. Biederman authored
      Replace has_group_leader_pid with thread_group_leader.  Years ago Oleg
      suggested changing thread_group_leader to has_group_leader_pid to handle
      races.  Looking at the code then and now I don't see how it ever helped.
      Especially as then the code really did need to be the
      thread_group_leader.
      
      Today it doesn't make a difference if thread_group_leader races with
      de_thread as the task returned from lookup_task in the non-thread case is
      just used to find values in task->signal.
      
      Since the races with de_thread have never been handled revert
      has_group_header_pid to thread_group_leader for clarity.
      
      Update the comment in lookup_task to remove implementation details that
      are no longer true and to mention task->signal instead of task->sighand,
      as the relevant cpu timer details are all in task->signal.
      
      Ref: 55e8c8eb ("posix-cpu-timers: Store a reference to a pid not a task")
      Ref: c0deae8c ("posix-cpu-timers: Rcu_read_lock/unlock protect find_task_by_vpid call")
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      8feebc67
    • Eric W. Biederman's avatar
      proc: Ensure we see the exit of each process tid exactly · 50712280
      Eric W. Biederman authored
      In the work to remove proc_mnt I noticed that we were calling
      proc_flush_task now proc_flush_pid possibly multiple times for the same
      pid because of how de_thread works.
      
      This is a bare minimal patchset to sort out de_thread, by introducing
      exchange_tids and the helper of exchange_tids hlists_swap_heads_rcu.
      
      The actual call of exchange_tids should be slowpath so I have
      prioritized readability over getting every last drop of performance.
      
      I have also read through a bunch of the code to see if I could find
      anything that would be affected by this change.  Users of
      has_group_leader_pid were a good canidates.  But I also looked at other
      cases that might have a pid->task->pid transition.  I ignored other
      sources of races with de_thread and exec as those are preexisting.
      
      I found a close call with send_signals user of task_active_pid_ns, but
      all pids of a thread group are guaranteeds to be in the same pid
      namespace so there is not a problem.
      
      I found a few pieces of debugging code that do:
      
      	task = pid_task(pid, PIDTYPE_PID);
      	if (task) {
      		printk("%u\n", task->pid);
      	}
      
      But I can't see how we care if it happens at the wrong moment that
      task->pid might not match pid_nr(pid);
      
      Similarly because the code in posix-cpu-timers goes pid->task->pid it
      feels like there should be a problem.  But as the code that works with
      PIDTYPE_PID is only available within the thread group, and as de_thread
      kills all of the other threads before it makes any changes of this
      kind the race can not happen.
      
      In short I don't think this change will introduce any regressions.
      
      Eric W. Biederman (2):
            rculist: Add hlists_swap_heads_rcu
            proc: Ensure we see the exit of each process tid exactly once
      
       fs/exec.c               |  5 +----
       include/linux/pid.h     |  1 +
       include/linux/rculist.h | 21 +++++++++++++++++++++
       kernel/pid.c            | 19 +++++++++++++++++++
       4 files changed, 42 insertions(+), 4 deletions(-)
      
      Link: https://lore.kernel.org/lkml/87sggnajpv.fsf_-_@x220.int.ebiederm.org/Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      50712280
    • Eric W. Biederman's avatar
      proc: Ensure we see the exit of each process tid exactly once · 6b03d130
      Eric W. Biederman authored
      When the thread group leader changes during exec and the old leaders
      thread is reaped proc_flush_pid will flush the dentries for the entire
      process because the leader still has it's original pid.
      
      Fix this by exchanging the pids in an rcu safe manner,
      and wrapping the code to do that up in a helper exchange_tids.
      
      When I removed switch_exec_pids and introduced this behavior
      in d73d6529 ("[PATCH] pidhash: kill switch_exec_pids") there
      really was nothing that cared as flushing happened with
      the cached dentry and de_thread flushed both of them on exec.
      
      This lack of fully exchanging pids became a problem a few months later
      when I introduced 48e6484d ("[PATCH] proc: Rewrite the proc dentry
      flush on exit optimization").  Which overlooked the de_thread case
      was no longer swapping pids, and I was looking up proc dentries
      by task->pid.
      
      The current behavior isn't properly a bug as everything in proc will
      continue to work correctly just a little bit less efficiently.  Fix
      this just so there are no little surprise corner cases waiting to bite
      people.
      
      -- Oleg points out this could be an issue in next_tgid in proc where
         has_group_leader_pid is called, and reording some of the assignments
         should fix that.
      
      -- Oleg points out this will break the 10 year old hack in __exit_signal.c
      >	/*
      >	 * This can only happen if the caller is de_thread().
      >	 * FIXME: this is the temporary hack, we should teach
      >	 * posix-cpu-timers to handle this case correctly.
      >	 */
      >	if (unlikely(has_group_leader_pid(tsk)))
      >		posix_cpu_timers_exit_group(tsk);
      
      The code in next_tgid has been changed to use PIDTYPE_TGID,
      and the posix cpu timers code has been fixed so it does not
      need the 10 year old hack, so this should be safe to merge
      now.
      
      Link: https://lore.kernel.org/lkml/87h7x3ajll.fsf_-_@x220.int.ebiederm.org/Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Fixes: 48e6484d ("[PATCH] proc: Rewrite the proc dentry flush on exit optimization").
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      6b03d130
    • Eric W. Biederman's avatar
      rculist: Add hlists_swap_heads_rcu · 35fc0e3b
      Eric W. Biederman authored
      Using the struct pid to refer to two tasks in de_thread was a clever
      idea and ultimately too clever, as it has lead to proc_flush_task
      being called inconsistently.
      
      To support rectifying this add hlists_swap_heads_rcu.  An hlist
      primitive that just swaps the hlist heads of two lists.  This is
      exactly what is needed for exchanging the pids of two tasks.
      
      Only consideration of correctness of the code has been given,
      as the caller is expected to be a slowpath.
      
      Link: https://lore.kernel.org/lkml/87mu6vajnq.fsf_-_@x220.int.ebiederm.org/Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      35fc0e3b
  7. 24 Apr, 2020 4 commits
    • Eric W. Biederman's avatar
      proc: Use PIDTYPE_TGID in next_tgid · 3147d8aa
      Eric W. Biederman authored
      Combine the pid_task and thes test has_group_leader_pid into a single
      dereference by using pid_task(PIDTYPE_TGID).
      
      This makes the code simpler and proof against needing to even think
      about any shenanigans that de_thread might get up to.
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      3147d8aa
    • Eric W. Biederman's avatar
      proc: modernize proc to support multiple private instances · 0fb5ce62
      Eric W. Biederman authored
      Alexey Gladkov <gladkov.alexey@gmail.com> writes:
       Procfs modernization:
       ---------------------
       Historically procfs was always tied to pid namespaces, during pid
       namespace creation we internally create a procfs mount for it. However,
       this has the effect that all new procfs mounts are just a mirror of the
       internal one, any change, any mount option update, any new future
       introduction will propagate to all other procfs mounts that are in the
       same pid namespace.
      
       This may have solved several use cases in that time. However today we
       face new requirements, and making procfs able to support new private
       instances inside same pid namespace seems a major point. If we want to
       to introduce new features and security mechanisms we have to make sure
       first that we do not break existing usecases. Supporting private procfs
       instances will allow to support new features and behaviour without
       propagating it to all other procfs mounts.
      
       Today procfs is more of a burden especially to some Embedded, IoT,
       sandbox, container use cases. In user space we are over-mounting null
       or inaccessible files on top to hide files and information. If we want
       to hide pids we have to create PID namespaces otherwise mount options
       propagate to all other proc mounts, changing a mount option value in one
       mount will propagate to all other proc mounts. If we want to introduce
       new features, then they will propagate to all other mounts too, resulting
       either maybe new useful functionality or maybe breaking stuff. We have
       also to note that userspace should not workaround procfs, the kernel
       should just provide a sane simple interface.
      
       In this regard several developers and maintainers pointed out that
       there are problems with procfs and it has to be modernized:
      
       "Here's another one: split up and modernize /proc." by Andy Lutomirski [1]
      
       Discussion about kernel pointer leaks:
      
       "And yes, as Kees and Daniel mentioned, it's definitely not just dmesg.
       In fact, the primary things tend to be /proc and /sys, not dmesg
       itself." By Linus Torvalds [2]
      
       Lot of other areas in the kernel and filesystems have been updated to be
       able to support private instances, devpts is one major example [3].
      
       Which will be used for:
      
       1) Embedded systems and IoT: usually we have one supervisor for
       apps, we have some lightweight sandbox support, however if we create
       pid namespaces we have to manage all the processes inside too,
       where our goal is to be able to run a bunch of apps each one inside
       its own mount namespace, maybe use network namespaces for vlans
       setups, but right now we only want mount namespaces, without all the
       other complexity. We want procfs to behave more like a real file system,
       and block access to inodes that belong to other users. The 'hidepid=' will
       not work since it is a shared mount option.
      
       2) Containers, sandboxes and Private instances of file systems - devpts case
       Historically, lot of file systems inside Linux kernel view when instantiated
       were just a mirror of an already created and mounted filesystem. This was the
       case of devpts filesystem, it seems at that time the requirements were to
       optimize things and reuse the same memory, etc. This design used to work but not
       anymore with today's containers, IoT, hostile environments and all the privacy
       challenges that Linux faces.
      
       In that regards, devpts was updated so that each new mounts is a total
       independent file system by the following patches:
      
       "devpts: Make each mount of devpts an independent filesystem" by
       Eric W. Biederman [3] [4]
      
       3) Linux Security Modules have multiple ptrace paths inside some
       subsystems, however inside procfs, the implementation does not guarantee
       that the ptrace() check which triggers the security_ptrace_check() hook
       will always run. We have the 'hidepid' mount option that can be used to
       force the ptrace_may_access() check inside has_pid_permissions() to run.
       The problem is that 'hidepid' is per pid namespace and not attached to
       the mount point, any remount or modification of 'hidepid' will propagate
       to all other procfs mounts.
      
       This also does not allow to support Yama LSM easily in desktop and user
       sessions. Yama ptrace scope which restricts ptrace and some other
       syscalls to be allowed only on inferiors, can be updated to have a
       per-task context, where the context will be inherited during fork(),
       clone() and preserved across execve(). If we support multiple private
       procfs instances, then we may force the ptrace_may_access() on
       /proc/<pids>/ to always run inside that new procfs instances. This will
       allow to specifiy on user sessions if we should populate procfs with
       pids that the user can ptrace or not.
      
       By using Yama ptrace scope, some restricted users will only be able to see
       inferiors inside /proc, they won't even be able to see their other
       processes. Some software like Chromium, Firefox's crash handler, Wine
       and others are already using Yama to restrict which processes can be
       ptracable. With this change this will give the possibility to restrict
       /proc/<pids>/ but more importantly this will give desktop users a
       generic and usuable way to specifiy which users should see all processes
       and which user can not.
      
       Side notes:
      
       * This covers the lack of seccomp where it is not able to parse
       arguments, it is easy to install a seccomp filter on direct syscalls
       that operate on pids, however /proc/<pid>/ is a Linux ABI using
       filesystem syscalls. With this change all LSMs should be able to analyze
       open/read/write/close... on /proc/<pid>/
      
       4) This will allow to implement new features either in kernel or
       userspace without having to worry about procfs.
       In containers, sandboxes, etc we have workarounds to hide some /proc
       inodes, this should be supported natively without doing extra complex
       work, the kernel should be able to support sane options that work with
       today and future Linux use cases.
      
       5) Creation of new superblock with all procfs options for each procfs
       mount will fix the ignoring of mount options. The problem is that the
       second mount of procfs in the same pid namespace ignores the mount
       options. The mount options are ignored without error until procfs is
       remounted.
      
       Before:
      
       proc /proc proc rw,relatime,hidepid=2 0 0
      
       mount("proc", "/tmp/proc", "proc", 0, "hidepid=1") = 0
       +++ exited with 0 +++
      
       proc /proc proc rw,relatime,hidepid=2 0 0
       proc /tmp/proc proc rw,relatime,hidepid=2 0 0
      
       proc /proc proc rw,relatime,hidepid=1 0 0
       proc /tmp/proc proc rw,relatime,hidepid=1 0 0
      
       After:
      
       proc /proc proc rw,relatime,hidepid=ptraceable 0 0
      
       proc /proc proc rw,relatime,hidepid=ptraceable 0 0
       proc /tmp/proc proc rw,relatime,hidepid=invisible 0 0
      
       Introduced changes:
       -------------------
       Each mount of procfs creates a separate procfs instance with its own
       mount options.
      
       This series adds few new mount options:
      
       * New 'hidepid=ptraceable' or 'hidepid=4' mount option to show only ptraceable
       processes in the procfs. This allows to support lightweight sandboxes in
       Embedded Linux, also solves the case for LSM where now with this mount option,
       we make sure that they have a ptrace path in procfs.
      
       * 'subset=pid' that allows to hide non-pid inodes from procfs. It can be used
       in containers and sandboxes, as these are already trying to hide and block
       access to procfs inodes anyway.
      
       ChangeLog:
       ----------
       * Rebase on top of v5.7-rc1.
       * Fix a resource leak if proc is not mounted or if proc is simply reconfigured.
       * Add few selftests.
      
       * After a discussion with Eric W. Biederman, the numerical values for hidepid
         parameter have been removed from uapi.
       * Remove proc_self and proc_thread_self from the pid_namespace struct.
       * I took into account the comment of Kees Cook.
       * Update Reviewed-by tags.
      
       * 'subset=pidfs' renamed to 'subset=pid' as suggested by Alexey Dobriyan.
       * Include Reviewed-by tags.
      
       * Rebase on top of Eric W. Biederman's procfs changes.
       * Add human readable values of 'hidepid' as suggested by Andy Lutomirski.
      
       * Started using RCU lock to clean dcache entries as suggested by Linus Torvalds.
      
       * 'pidonly=1' renamed to 'subset=pidfs' as suggested by Alexey Dobriyan.
       * HIDEPID_* moved to uapi/ as they are user interface to mount().
         Suggested-by Alexey Dobriyan <adobriyan@gmail.com>
      
       * 'hidepid=' and 'gid=' mount options are moved from pid namespace to superblock.
       * 'newinstance' mount option removed as suggested by Eric W. Biederman.
          Mount of procfs always creates a new instance.
       * 'limit_pids' renamed to 'hidepid=3'.
       * I took into account the comment of Linus Torvalds [7].
       * Documentation added.
      
       * Fixed a bug that caused a problem with the Fedora boot.
       * The 'pidonly' option is visible among the mount options.
      
       * Renamed mount options to 'newinstance' and 'pids='
      Suggested-by: default avatarAndy Lutomirski <luto@kernel.org>
       * Fixed order of commit, Suggested-by: Andy Lutomirski <luto@kernel.org>
       * Many bug fixes.
      
       * Removed 'unshared' mount option and replaced it with 'limit_pids'
          which is attached to the current procfs mount.
          Suggested-by Andy Lutomirski <luto@kernel.org>
       * Do not fill dcache with pid entries that we can not ptrace.
       * Many bug fixes.
      
       References:
       -----------
       [1] https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2017-January/004215.html
       [2] http://www.openwall.com/lists/kernel-hardening/2017/10/05/5
       [3] https://lwn.net/Articles/689539/
       [4] http://lxr.free-electrons.com/source/Documentation/filesystems/devpts.txt?v=3.14
       [5] https://lkml.org/lkml/2017/5/2/407
       [6] https://lkml.org/lkml/2017/5/3/357
       [7] https://lkml.org/lkml/2018/5/11/505
      
       Alexey Gladkov (7):
         proc: rename struct proc_fs_info to proc_fs_opts
         proc: allow to mount many instances of proc in one pid namespace
         proc: instantiate only pids that we can ptrace on 'hidepid=4' mount
           option
         proc: add option to mount only a pids subset
         docs: proc: add documentation for "hidepid=4" and "subset=pid" options
           and new mount behavior
        proc: use human-readable values for hidepid
         proc: use named enums for better readability
      
        Documentation/filesystems/proc.rst            |  92 +++++++++---
        fs/proc/base.c                                |  48 +++++--
        fs/proc/generic.c                             |   9 ++
        fs/proc/inode.c                               |  30 +++-
        fs/proc/root.c                                | 131 +++++++++++++-----
        fs/proc/self.c                                |   6 +-
        fs/proc/thread_self.c                         |   6 +-
        fs/proc_namespace.c                           |  14 +-
        include/linux/pid_namespace.h                 |  12 --
        include/linux/proc_fs.h                       |  30 +++-
        tools/testing/selftests/proc/.gitignore       |   2 +
        tools/testing/selftests/proc/Makefile         |   2 +
        .../selftests/proc/proc-fsconfig-hidepid.c    |  50 +++++++
        .../selftests/proc/proc-multiple-procfs.c     |  48 +++++++
        14 files changed, 384 insertions(+), 96 deletions(-)
        create mode 100644 tools/testing/selftests/proc/proc-fsconfig-hidepid.c
        create mode 100644 tools/testing/selftests/proc/proc-multiple-procfs.c
      
      Link: https://lore.kernel.org/lkml/20200419141057.621356-1-gladkov.alexey@gmail.com/Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      0fb5ce62
    • Alexey Gladkov's avatar
    • Eric W. Biederman's avatar
      proc: Put thread_pid in release_task not proc_flush_pid · 6ade99ec
      Eric W. Biederman authored
      Oleg pointed out that in the unlikely event the kernel is compiled
      with CONFIG_PROC_FS unset that release_task will now leak the pid.
      
      Move the put_pid out of proc_flush_pid into release_task to fix this
      and to guarantee I don't make that mistake again.
      
      When possible it makes sense to keep get and put in the same function
      so it can easily been seen how they pair up.
      
      Fixes: 7bc3e6e5 ("proc: Use a list of inodes to flush from proc")
      Reported-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      6ade99ec
  8. 22 Apr, 2020 7 commits
  9. 21 Apr, 2020 1 commit
    • Eric W. Biederman's avatar
      signal: Avoid corrupting si_pid and si_uid in do_notify_parent · 61e713bd
      Eric W. Biederman authored
      Christof Meerwald <cmeerw@cmeerw.org> writes:
      > Hi,
      >
      > this is probably related to commit
      > 7a0cf094 (signal: Correct namespace
      > fixups of si_pid and si_uid).
      >
      > With a 5.6.5 kernel I am seeing SIGCHLD signals that don't include a
      > properly set si_pid field - this seems to happen for multi-threaded
      > child processes.
      >
      > A simple test program (based on the sample from the signalfd man page):
      >
      > #include <sys/signalfd.h>
      > #include <signal.h>
      > #include <unistd.h>
      > #include <spawn.h>
      > #include <stdlib.h>
      > #include <stdio.h>
      >
      > #define handle_error(msg) \
      >     do { perror(msg); exit(EXIT_FAILURE); } while (0)
      >
      > int main(int argc, char *argv[])
      > {
      >   sigset_t mask;
      >   int sfd;
      >   struct signalfd_siginfo fdsi;
      >   ssize_t s;
      >
      >   sigemptyset(&mask);
      >   sigaddset(&mask, SIGCHLD);
      >
      >   if (sigprocmask(SIG_BLOCK, &mask, NULL) == -1)
      >     handle_error("sigprocmask");
      >
      >   pid_t chldpid;
      >   char *chldargv[] = { "./sfdclient", NULL };
      >   posix_spawn(&chldpid, "./sfdclient", NULL, NULL, chldargv, NULL);
      >
      >   sfd = signalfd(-1, &mask, 0);
      >   if (sfd == -1)
      >     handle_error("signalfd");
      >
      >   for (;;) {
      >     s = read(sfd, &fdsi, sizeof(struct signalfd_siginfo));
      >     if (s != sizeof(struct signalfd_siginfo))
      >       handle_error("read");
      >
      >     if (fdsi.ssi_signo == SIGCHLD) {
      >       printf("Got SIGCHLD %d %d %d %d\n",
      >           fdsi.ssi_status, fdsi.ssi_code,
      >           fdsi.ssi_uid, fdsi.ssi_pid);
      >       return 0;
      >     } else {
      >       printf("Read unexpected signal\n");
      >     }
      >   }
      > }
      >
      >
      > and a multi-threaded client to test with:
      >
      > #include <unistd.h>
      > #include <pthread.h>
      >
      > void *f(void *arg)
      > {
      >   sleep(100);
      > }
      >
      > int main()
      > {
      >   pthread_t t[8];
      >
      >   for (int i = 0; i != 8; ++i)
      >   {
      >     pthread_create(&t[i], NULL, f, NULL);
      >   }
      > }
      >
      > I tried to do a bit of debugging and what seems to be happening is
      > that
      >
      >   /* From an ancestor pid namespace? */
      >   if (!task_pid_nr_ns(current, task_active_pid_ns(t))) {
      >
      > fails inside task_pid_nr_ns because the check for "pid_alive" fails.
      >
      > This code seems to be called from do_notify_parent and there we
      > actually have "tsk != current" (I am assuming both are threads of the
      > current process?)
      
      I instrumented the code with a warning and received the following backtrace:
      > WARNING: CPU: 0 PID: 777 at kernel/pid.c:501 __task_pid_nr_ns.cold.6+0xc/0x15
      > Modules linked in:
      > CPU: 0 PID: 777 Comm: sfdclient Not tainted 5.7.0-rc1userns+ #2924
      > Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      > RIP: 0010:__task_pid_nr_ns.cold.6+0xc/0x15
      > Code: ff 66 90 48 83 ec 08 89 7c 24 04 48 8d 7e 08 48 8d 74 24 04 e8 9a b6 44 00 48 83 c4 08 c3 48 c7 c7 59 9f ac 82 e8 c2 c4 04 00 <0f> 0b e9 3fd
      > RSP: 0018:ffffc9000042fbf8 EFLAGS: 00010046
      > RAX: 000000000000000c RBX: 0000000000000000 RCX: ffffc9000042faf4
      > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff81193d29
      > RBP: ffffc9000042fc18 R08: 0000000000000000 R09: 0000000000000001
      > R10: 000000100f938416 R11: 0000000000000309 R12: ffff8880b941c140
      > R13: 0000000000000000 R14: 0000000000000000 R15: ffff8880b941c140
      > FS:  0000000000000000(0000) GS:ffff8880bca00000(0000) knlGS:0000000000000000
      > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      > CR2: 00007f2e8c0a32e0 CR3: 0000000002e10000 CR4: 00000000000006f0
      > Call Trace:
      >  send_signal+0x1c8/0x310
      >  do_notify_parent+0x50f/0x550
      >  release_task.part.21+0x4fd/0x620
      >  do_exit+0x6f6/0xaf0
      >  do_group_exit+0x42/0xb0
      >  get_signal+0x13b/0xbb0
      >  do_signal+0x2b/0x670
      >  ? __audit_syscall_exit+0x24d/0x2b0
      >  ? rcu_read_lock_sched_held+0x4d/0x60
      >  ? kfree+0x24c/0x2b0
      >  do_syscall_64+0x176/0x640
      >  ? trace_hardirqs_off_thunk+0x1a/0x1c
      >  entry_SYSCALL_64_after_hwframe+0x49/0xb3
      
      The immediate problem is as Christof noticed that "pid_alive(current) == false".
      This happens because do_notify_parent is called from the last thread to exit
      in a process after that thread has been reaped.
      
      The bigger issue is that do_notify_parent can be called from any
      process that manages to wait on a thread of a multi-threaded process
      from wait_task_zombie.  So any logic based upon current for
      do_notify_parent is just nonsense, as current can be pretty much
      anything.
      
      So change do_notify_parent to call __send_signal directly.
      
      Inspecting the code it appears this problem has existed since the pid
      namespace support started handling this case in 2.6.30.  This fix only
      backports to 7a0cf094 ("signal: Correct namespace fixups of si_pid and si_uid")
      where the problem logic was moved out of __send_signal and into send_signal.
      
      Cc: stable@vger.kernel.org
      Fixes: 6588c1e3 ("signals: SI_USER: Masquerade si_pid when crossing pid ns boundary")
      Ref: 921cf9f6 ("signals: protect cinit from unblocked SIG_DFL signals")
      Link: https://lore.kernel.org/lkml/20200419201336.GI22017@edge.cmeerw.net/Reported-by: default avatarChristof Meerwald <cmeerw@cmeerw.org>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      61e713bd
  10. 16 Apr, 2020 1 commit
    • Eric W. Biederman's avatar
      proc: Handle umounts cleanly · 4fa3b1c4
      Eric W. Biederman authored
      syzbot writes:
      > KASAN: use-after-free Read in dput (2)
      >
      > proc_fill_super: allocate dentry failed
      > ==================================================================
      > BUG: KASAN: use-after-free in fast_dput fs/dcache.c:727 [inline]
      > BUG: KASAN: use-after-free in dput+0x53e/0xdf0 fs/dcache.c:846
      > Read of size 4 at addr ffff88808a618cf0 by task syz-executor.0/8426
      >
      > CPU: 0 PID: 8426 Comm: syz-executor.0 Not tainted 5.6.0-next-20200412-syzkaller #0
      > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      > Call Trace:
      >  __dump_stack lib/dump_stack.c:77 [inline]
      >  dump_stack+0x188/0x20d lib/dump_stack.c:118
      >  print_address_description.constprop.0.cold+0xd3/0x315 mm/kasan/report.c:382
      >  __kasan_report.cold+0x35/0x4d mm/kasan/report.c:511
      >  kasan_report+0x33/0x50 mm/kasan/common.c:625
      >  fast_dput fs/dcache.c:727 [inline]
      >  dput+0x53e/0xdf0 fs/dcache.c:846
      >  proc_kill_sb+0x73/0xf0 fs/proc/root.c:195
      >  deactivate_locked_super+0x8c/0xf0 fs/super.c:335
      >  vfs_get_super+0x258/0x2d0 fs/super.c:1212
      >  vfs_get_tree+0x89/0x2f0 fs/super.c:1547
      >  do_new_mount fs/namespace.c:2813 [inline]
      >  do_mount+0x1306/0x1b30 fs/namespace.c:3138
      >  __do_sys_mount fs/namespace.c:3347 [inline]
      >  __se_sys_mount fs/namespace.c:3324 [inline]
      >  __x64_sys_mount+0x18f/0x230 fs/namespace.c:3324
      >  do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
      >  entry_SYSCALL_64_after_hwframe+0x49/0xb3
      > RIP: 0033:0x45c889
      > Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      > RSP: 002b:00007ffc1930ec48 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
      > RAX: ffffffffffffffda RBX: 0000000001324914 RCX: 000000000045c889
      > RDX: 0000000020000140 RSI: 0000000020000040 RDI: 0000000000000000
      > RBP: 000000000076bf00 R08: 0000000000000000 R09: 0000000000000000
      > R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
      > R13: 0000000000000749 R14: 00000000004ca15a R15: 0000000000000013
      
      Looking at the code now that it the internal mount of proc is no
      longer used it is possible to unmount proc.   If proc is unmounted
      the fields of the pid namespace that were used for filesystem
      specific state are not reinitialized.
      
      Which means that proc_self and proc_thread_self can be pointers to
      already freed dentries.
      
      The reported user after free appears to be from mounting and
      unmounting proc followed by mounting proc again and using error
      injection to cause the new root dentry allocation to fail.  This in
      turn results in proc_kill_sb running with proc_self and
      proc_thread_self still retaining their values from the previous mount
      of proc.  Then calling dput on either proc_self of proc_thread_self
      will result in double put.  Which KASAN sees as a use after free.
      
      Solve this by always reinitializing the filesystem state stored
      in the struct pid_namespace, when proc is unmounted.
      
      Reported-by: syzbot+72868dd424eb66c6b95f@syzkaller.appspotmail.com
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Fixes: 69879c01 ("proc: Remove the now unnecessary internal mount of proc")
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      4fa3b1c4
  11. 12 Apr, 2020 10 commits
    • Linus Torvalds's avatar
      Linux 5.7-rc1 · 8f3d9f35
      Linus Torvalds authored
      8f3d9f35
    • Linus Torvalds's avatar
      MAINTAINERS: sort field names for all entries · 3b50142d
      Linus Torvalds authored
      This sorts the actual field names too, potentially causing even more
      chaos and confusion at merge time if you have edited the MAINTAINERS
      file.  But the end result is a more consistent layout, and hopefully
      it's a one-time pain minimized by doing this just before the -rc1
      release.
      
      This was entirely scripted:
      
        ./scripts/parse-maintainers.pl --input=MAINTAINERS --output=MAINTAINERS --order
      Requested-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b50142d
    • Linus Torvalds's avatar
      MAINTAINERS: sort entries by entry name · 4400b7d6
      Linus Torvalds authored
      They are all supposed to be sorted, but people who add new entries don't
      always know the alphabet.  Plus sometimes the entry names get edited,
      and people don't then re-order the entry.
      
      Let's see how painful this will be for merging purposes (the MAINTAINERS
      file is often edited in various different trees), but Joe claims there's
      relatively few patches in -next that touch this, and doing it just
      before -rc1 is likely the best time.  Fingers crossed.
      
      This was scripted with
      
        /scripts/parse-maintainers.pl --input=MAINTAINERS --output=MAINTAINERS
      
      but then I also ended up manually upper-casing a few entry names that
      stood out when looking at the end result.
      Requested-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4400b7d6
    • Linus Torvalds's avatar
      Merge tag 'x86-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 4f8a3cc1
      Linus Torvalds authored
      Pull x86 fixes from Thomas Gleixner:
       "A set of three patches to fix the fallout of the newly added split
        lock detection feature.
      
        It addressed the case where a KVM guest triggers a split lock #AC and
        KVM reinjects it into the guest which is not prepared to handle it.
      
        Add proper sanity checks which prevent the unconditional injection
        into the guest and handles the #AC on the host side in the same way as
        user space detections are handled. Depending on the detection mode it
        either warns and disables detection for the task or kills the task if
        the mode is set to fatal"
      
      * tag 'x86-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        KVM: VMX: Extend VMXs #AC interceptor to handle split lock #AC in guest
        KVM: x86: Emulate split-lock access as a write in emulator
        x86/split_lock: Provide handle_guest_split_lock()
      4f8a3cc1
    • Linus Torvalds's avatar
      Merge tag 'timers-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 0785249f
      Linus Torvalds authored
      Pull time(keeping) updates from Thomas Gleixner:
      
       - Fix the time_for_children symlink in /proc/$PID/ so it properly
         reflects that it part of the 'time' namespace
      
       - Add the missing userns limit for the allowed number of time
         namespaces, which was half defined but the actual array member was
         not added. This went unnoticed as the array has an exessive empty
         member at the end but introduced a user visible regression as the
         output was corrupted.
      
       - Prevent further silent ucount corruption by adding a BUILD_BUG_ON()
         to catch half updated data.
      
      * tag 'timers-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        ucount: Make sure ucounts in /proc/sys/user don't regress again
        time/namespace: Add max_time_namespaces ucount
        time/namespace: Fix time_for_children symlink
      0785249f
    • Linus Torvalds's avatar
      Merge tag 'sched-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 590680d1
      Linus Torvalds authored
      Pull scheduler fixes/updates from Thomas Gleixner:
      
       - Deduplicate the average computations in the scheduler core and the
         fair class code.
      
       - Fix a raise between runtime distribution and assignement which can
         cause exceeding the quota by up to 70%.
      
       - Prevent negative results in the imbalanace calculation
      
       - Remove a stale warning in the workqueue code which can be triggered
         since the call site was moved out of preempt disabled code. It's a
         false positive.
      
       - Deduplicate the print macros for procfs
      
       - Add the ucmap values to the SCHED_DEBUG procfs output for completness
      
      * tag 'sched-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/debug: Add task uclamp values to SCHED_DEBUG procfs
        sched/debug: Factor out printing formats into common macros
        sched/debug: Remove redundant macro define
        sched/core: Remove unused rq::last_load_update_tick
        workqueue: Remove the warning in wq_worker_sleeping()
        sched/fair: Fix negative imbalance in imbalance calculation
        sched/fair: Fix race between runtime distribution and assignment
        sched/fair: Align rq->avg_idle and rq->avg_scan_cost
      590680d1
    • Linus Torvalds's avatar
      Merge tag 'perf-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 20e2aa81
      Linus Torvalds authored
      Pull perf fixes from Thomas Gleixner:
       "Three fixes/updates for perf:
      
         - Fix the perf event cgroup tracking which tries to track the cgroup
           even for disabled events.
      
         - Add Ice Lake server support for uncore events
      
         - Disable pagefaults when retrieving the physical address in the
           sampling code"
      
      * tag 'perf-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/core: Disable page faults when getting phys address
        perf/x86/intel/uncore: Add Ice Lake server uncore support
        perf/cgroup: Correct indirection in perf_less_group_idx()
        perf/core: Fix event cgroup tracking
      20e2aa81
    • Linus Torvalds's avatar
      Merge tag 'locking-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 652fa53c
      Linus Torvalds authored
      Pull locking fixes from Thomas Gleixner:
       "Three small fixes/updates for the locking core code:
      
         - Plug a task struct reference leak in the percpu rswem
           implementation.
      
         - Document the refcount interaction with PID_MAX_LIMIT
      
         - Improve the 'invalid wait context' data dump in lockdep so it
           contains all information which is required to decode the problem"
      
      * tag 'locking-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        locking/lockdep: Improve 'invalid wait context' splat
        locking/refcount: Document interaction with PID_MAX_LIMIT
        locking/percpu-rwsem: Fix a task_struct refcount
      652fa53c
    • Linus Torvalds's avatar
      Merge tag '5.7-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6 · 4119bf9f
      Linus Torvalds authored
      Pull cifs fixes from Steve French:
       "Ten cifs/smb fixes:
      
         - five RDMA (smbdirect) related fixes
      
         - add experimental support for swap over SMB3 mounts
      
         - also a fix which improves performance of signed connections"
      
      * tag '5.7-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
        smb3: enable swap on SMB3 mounts
        smb3: change noisy error message to FYI
        smb3: smbdirect support can be configured by default
        cifs: smbd: Do not schedule work to send immediate packet on every receive
        cifs: smbd: Properly process errors on ib_post_send
        cifs: Allocate crypto structures on the fly for calculating signatures of incoming packets
        cifs: smbd: Update receive credits before sending and deal with credits roll back on failure before sending
        cifs: smbd: Check send queue size before posting a send
        cifs: smbd: Merge code to track pending packets
        cifs: ignore cached share root handle closing errors
      4119bf9f
    • Linus Torvalds's avatar
      Merge tag 'nfs-for-5.7-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs · 50bda5fa
      Linus Torvalds authored
      Pull NFS client bugfix from Trond Myklebust:
       "Fix an RCU read lock leakage in pnfs_alloc_ds_commits_list()"
      
      * tag 'nfs-for-5.7-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
        pNFS: Fix RCU lock leakage
      50bda5fa
  12. 11 Apr, 2020 1 commit