1. 22 Apr, 2020 2 commits
    • Alexey Gladkov's avatar
      proc: allow to mount many instances of proc in one pid namespace · fa10fed3
      Alexey Gladkov authored
      This patch allows to have multiple procfs instances inside the
      same pid namespace. The aim here is lightweight sandboxes, and to allow
      that we have to modernize procfs internals.
      
      1) The main aim of this work is to have on embedded systems one
      supervisor for apps. Right now we have some lightweight sandbox support,
      however if we create pid namespacess we have to manages all the
      processes inside too, where our goal is to be able to run a bunch of
      apps each one inside its own mount namespace without being able to
      notice each other. We only want to use mount namespaces, and we want
      procfs to behave more like a real mount point.
      
      2) Linux Security Modules have multiple ptrace paths inside some
      subsystems, however inside procfs, the implementation does not guarantee
      that the ptrace() check which triggers the security_ptrace_check() hook
      will always run. We have the 'hidepid' mount option that can be used to
      force the ptrace_may_access() check inside has_pid_permissions() to run.
      The problem is that 'hidepid' is per pid namespace and not attached to
      the mount point, any remount or modification of 'hidepid' will propagate
      to all other procfs mounts.
      
      This also does not allow to support Yama LSM easily in desktop and user
      sessions. Yama ptrace scope which restricts ptrace and some other
      syscalls to be allowed only on inferiors, can be updated to have a
      per-task context, where the context will be inherited during fork(),
      clone() and preserved across execve(). If we support multiple private
      procfs instances, then we may force the ptrace_may_access() on
      /proc/<pids>/ to always run inside that new procfs instances. This will
      allow to specifiy on user sessions if we should populate procfs with
      pids that the user can ptrace or not.
      
      By using Yama ptrace scope, some restricted users will only be able to see
      inferiors inside /proc, they won't even be able to see their other
      processes. Some software like Chromium, Firefox's crash handler, Wine
      and others are already using Yama to restrict which processes can be
      ptracable. With this change this will give the possibility to restrict
      /proc/<pids>/ but more importantly this will give desktop users a
      generic and usuable way to specifiy which users should see all processes
      and which users can not.
      
      Side notes:
      * This covers the lack of seccomp where it is not able to parse
      arguments, it is easy to install a seccomp filter on direct syscalls
      that operate on pids, however /proc/<pid>/ is a Linux ABI using
      filesystem syscalls. With this change LSMs should be able to analyze
      open/read/write/close...
      
      In the new patch set version I removed the 'newinstance' option
      as suggested by Eric W. Biederman.
      
      Selftest has been added to verify new behavior.
      Signed-off-by: default avatarAlexey Gladkov <gladkov.alexey@gmail.com>
      Reviewed-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      fa10fed3
    • Alexey Gladkov's avatar
  2. 21 Apr, 2020 1 commit
    • Eric W. Biederman's avatar
      signal: Avoid corrupting si_pid and si_uid in do_notify_parent · 61e713bd
      Eric W. Biederman authored
      Christof Meerwald <cmeerw@cmeerw.org> writes:
      > Hi,
      >
      > this is probably related to commit
      > 7a0cf094 (signal: Correct namespace
      > fixups of si_pid and si_uid).
      >
      > With a 5.6.5 kernel I am seeing SIGCHLD signals that don't include a
      > properly set si_pid field - this seems to happen for multi-threaded
      > child processes.
      >
      > A simple test program (based on the sample from the signalfd man page):
      >
      > #include <sys/signalfd.h>
      > #include <signal.h>
      > #include <unistd.h>
      > #include <spawn.h>
      > #include <stdlib.h>
      > #include <stdio.h>
      >
      > #define handle_error(msg) \
      >     do { perror(msg); exit(EXIT_FAILURE); } while (0)
      >
      > int main(int argc, char *argv[])
      > {
      >   sigset_t mask;
      >   int sfd;
      >   struct signalfd_siginfo fdsi;
      >   ssize_t s;
      >
      >   sigemptyset(&mask);
      >   sigaddset(&mask, SIGCHLD);
      >
      >   if (sigprocmask(SIG_BLOCK, &mask, NULL) == -1)
      >     handle_error("sigprocmask");
      >
      >   pid_t chldpid;
      >   char *chldargv[] = { "./sfdclient", NULL };
      >   posix_spawn(&chldpid, "./sfdclient", NULL, NULL, chldargv, NULL);
      >
      >   sfd = signalfd(-1, &mask, 0);
      >   if (sfd == -1)
      >     handle_error("signalfd");
      >
      >   for (;;) {
      >     s = read(sfd, &fdsi, sizeof(struct signalfd_siginfo));
      >     if (s != sizeof(struct signalfd_siginfo))
      >       handle_error("read");
      >
      >     if (fdsi.ssi_signo == SIGCHLD) {
      >       printf("Got SIGCHLD %d %d %d %d\n",
      >           fdsi.ssi_status, fdsi.ssi_code,
      >           fdsi.ssi_uid, fdsi.ssi_pid);
      >       return 0;
      >     } else {
      >       printf("Read unexpected signal\n");
      >     }
      >   }
      > }
      >
      >
      > and a multi-threaded client to test with:
      >
      > #include <unistd.h>
      > #include <pthread.h>
      >
      > void *f(void *arg)
      > {
      >   sleep(100);
      > }
      >
      > int main()
      > {
      >   pthread_t t[8];
      >
      >   for (int i = 0; i != 8; ++i)
      >   {
      >     pthread_create(&t[i], NULL, f, NULL);
      >   }
      > }
      >
      > I tried to do a bit of debugging and what seems to be happening is
      > that
      >
      >   /* From an ancestor pid namespace? */
      >   if (!task_pid_nr_ns(current, task_active_pid_ns(t))) {
      >
      > fails inside task_pid_nr_ns because the check for "pid_alive" fails.
      >
      > This code seems to be called from do_notify_parent and there we
      > actually have "tsk != current" (I am assuming both are threads of the
      > current process?)
      
      I instrumented the code with a warning and received the following backtrace:
      > WARNING: CPU: 0 PID: 777 at kernel/pid.c:501 __task_pid_nr_ns.cold.6+0xc/0x15
      > Modules linked in:
      > CPU: 0 PID: 777 Comm: sfdclient Not tainted 5.7.0-rc1userns+ #2924
      > Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      > RIP: 0010:__task_pid_nr_ns.cold.6+0xc/0x15
      > Code: ff 66 90 48 83 ec 08 89 7c 24 04 48 8d 7e 08 48 8d 74 24 04 e8 9a b6 44 00 48 83 c4 08 c3 48 c7 c7 59 9f ac 82 e8 c2 c4 04 00 <0f> 0b e9 3fd
      > RSP: 0018:ffffc9000042fbf8 EFLAGS: 00010046
      > RAX: 000000000000000c RBX: 0000000000000000 RCX: ffffc9000042faf4
      > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff81193d29
      > RBP: ffffc9000042fc18 R08: 0000000000000000 R09: 0000000000000001
      > R10: 000000100f938416 R11: 0000000000000309 R12: ffff8880b941c140
      > R13: 0000000000000000 R14: 0000000000000000 R15: ffff8880b941c140
      > FS:  0000000000000000(0000) GS:ffff8880bca00000(0000) knlGS:0000000000000000
      > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      > CR2: 00007f2e8c0a32e0 CR3: 0000000002e10000 CR4: 00000000000006f0
      > Call Trace:
      >  send_signal+0x1c8/0x310
      >  do_notify_parent+0x50f/0x550
      >  release_task.part.21+0x4fd/0x620
      >  do_exit+0x6f6/0xaf0
      >  do_group_exit+0x42/0xb0
      >  get_signal+0x13b/0xbb0
      >  do_signal+0x2b/0x670
      >  ? __audit_syscall_exit+0x24d/0x2b0
      >  ? rcu_read_lock_sched_held+0x4d/0x60
      >  ? kfree+0x24c/0x2b0
      >  do_syscall_64+0x176/0x640
      >  ? trace_hardirqs_off_thunk+0x1a/0x1c
      >  entry_SYSCALL_64_after_hwframe+0x49/0xb3
      
      The immediate problem is as Christof noticed that "pid_alive(current) == false".
      This happens because do_notify_parent is called from the last thread to exit
      in a process after that thread has been reaped.
      
      The bigger issue is that do_notify_parent can be called from any
      process that manages to wait on a thread of a multi-threaded process
      from wait_task_zombie.  So any logic based upon current for
      do_notify_parent is just nonsense, as current can be pretty much
      anything.
      
      So change do_notify_parent to call __send_signal directly.
      
      Inspecting the code it appears this problem has existed since the pid
      namespace support started handling this case in 2.6.30.  This fix only
      backports to 7a0cf094 ("signal: Correct namespace fixups of si_pid and si_uid")
      where the problem logic was moved out of __send_signal and into send_signal.
      
      Cc: stable@vger.kernel.org
      Fixes: 6588c1e3 ("signals: SI_USER: Masquerade si_pid when crossing pid ns boundary")
      Ref: 921cf9f6 ("signals: protect cinit from unblocked SIG_DFL signals")
      Link: https://lore.kernel.org/lkml/20200419201336.GI22017@edge.cmeerw.net/Reported-by: default avatarChristof Meerwald <cmeerw@cmeerw.org>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      61e713bd
  3. 16 Apr, 2020 1 commit
    • Eric W. Biederman's avatar
      proc: Handle umounts cleanly · 4fa3b1c4
      Eric W. Biederman authored
      syzbot writes:
      > KASAN: use-after-free Read in dput (2)
      >
      > proc_fill_super: allocate dentry failed
      > ==================================================================
      > BUG: KASAN: use-after-free in fast_dput fs/dcache.c:727 [inline]
      > BUG: KASAN: use-after-free in dput+0x53e/0xdf0 fs/dcache.c:846
      > Read of size 4 at addr ffff88808a618cf0 by task syz-executor.0/8426
      >
      > CPU: 0 PID: 8426 Comm: syz-executor.0 Not tainted 5.6.0-next-20200412-syzkaller #0
      > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      > Call Trace:
      >  __dump_stack lib/dump_stack.c:77 [inline]
      >  dump_stack+0x188/0x20d lib/dump_stack.c:118
      >  print_address_description.constprop.0.cold+0xd3/0x315 mm/kasan/report.c:382
      >  __kasan_report.cold+0x35/0x4d mm/kasan/report.c:511
      >  kasan_report+0x33/0x50 mm/kasan/common.c:625
      >  fast_dput fs/dcache.c:727 [inline]
      >  dput+0x53e/0xdf0 fs/dcache.c:846
      >  proc_kill_sb+0x73/0xf0 fs/proc/root.c:195
      >  deactivate_locked_super+0x8c/0xf0 fs/super.c:335
      >  vfs_get_super+0x258/0x2d0 fs/super.c:1212
      >  vfs_get_tree+0x89/0x2f0 fs/super.c:1547
      >  do_new_mount fs/namespace.c:2813 [inline]
      >  do_mount+0x1306/0x1b30 fs/namespace.c:3138
      >  __do_sys_mount fs/namespace.c:3347 [inline]
      >  __se_sys_mount fs/namespace.c:3324 [inline]
      >  __x64_sys_mount+0x18f/0x230 fs/namespace.c:3324
      >  do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
      >  entry_SYSCALL_64_after_hwframe+0x49/0xb3
      > RIP: 0033:0x45c889
      > Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      > RSP: 002b:00007ffc1930ec48 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
      > RAX: ffffffffffffffda RBX: 0000000001324914 RCX: 000000000045c889
      > RDX: 0000000020000140 RSI: 0000000020000040 RDI: 0000000000000000
      > RBP: 000000000076bf00 R08: 0000000000000000 R09: 0000000000000000
      > R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
      > R13: 0000000000000749 R14: 00000000004ca15a R15: 0000000000000013
      
      Looking at the code now that it the internal mount of proc is no
      longer used it is possible to unmount proc.   If proc is unmounted
      the fields of the pid namespace that were used for filesystem
      specific state are not reinitialized.
      
      Which means that proc_self and proc_thread_self can be pointers to
      already freed dentries.
      
      The reported user after free appears to be from mounting and
      unmounting proc followed by mounting proc again and using error
      injection to cause the new root dentry allocation to fail.  This in
      turn results in proc_kill_sb running with proc_self and
      proc_thread_self still retaining their values from the previous mount
      of proc.  Then calling dput on either proc_self of proc_thread_self
      will result in double put.  Which KASAN sees as a use after free.
      
      Solve this by always reinitializing the filesystem state stored
      in the struct pid_namespace, when proc is unmounted.
      
      Reported-by: syzbot+72868dd424eb66c6b95f@syzkaller.appspotmail.com
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Fixes: 69879c01 ("proc: Remove the now unnecessary internal mount of proc")
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      4fa3b1c4
  4. 12 Apr, 2020 10 commits
    • Linus Torvalds's avatar
      Linux 5.7-rc1 · 8f3d9f35
      Linus Torvalds authored
      8f3d9f35
    • Linus Torvalds's avatar
      MAINTAINERS: sort field names for all entries · 3b50142d
      Linus Torvalds authored
      This sorts the actual field names too, potentially causing even more
      chaos and confusion at merge time if you have edited the MAINTAINERS
      file.  But the end result is a more consistent layout, and hopefully
      it's a one-time pain minimized by doing this just before the -rc1
      release.
      
      This was entirely scripted:
      
        ./scripts/parse-maintainers.pl --input=MAINTAINERS --output=MAINTAINERS --order
      Requested-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b50142d
    • Linus Torvalds's avatar
      MAINTAINERS: sort entries by entry name · 4400b7d6
      Linus Torvalds authored
      They are all supposed to be sorted, but people who add new entries don't
      always know the alphabet.  Plus sometimes the entry names get edited,
      and people don't then re-order the entry.
      
      Let's see how painful this will be for merging purposes (the MAINTAINERS
      file is often edited in various different trees), but Joe claims there's
      relatively few patches in -next that touch this, and doing it just
      before -rc1 is likely the best time.  Fingers crossed.
      
      This was scripted with
      
        /scripts/parse-maintainers.pl --input=MAINTAINERS --output=MAINTAINERS
      
      but then I also ended up manually upper-casing a few entry names that
      stood out when looking at the end result.
      Requested-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4400b7d6
    • Linus Torvalds's avatar
      Merge tag 'x86-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 4f8a3cc1
      Linus Torvalds authored
      Pull x86 fixes from Thomas Gleixner:
       "A set of three patches to fix the fallout of the newly added split
        lock detection feature.
      
        It addressed the case where a KVM guest triggers a split lock #AC and
        KVM reinjects it into the guest which is not prepared to handle it.
      
        Add proper sanity checks which prevent the unconditional injection
        into the guest and handles the #AC on the host side in the same way as
        user space detections are handled. Depending on the detection mode it
        either warns and disables detection for the task or kills the task if
        the mode is set to fatal"
      
      * tag 'x86-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        KVM: VMX: Extend VMXs #AC interceptor to handle split lock #AC in guest
        KVM: x86: Emulate split-lock access as a write in emulator
        x86/split_lock: Provide handle_guest_split_lock()
      4f8a3cc1
    • Linus Torvalds's avatar
      Merge tag 'timers-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 0785249f
      Linus Torvalds authored
      Pull time(keeping) updates from Thomas Gleixner:
      
       - Fix the time_for_children symlink in /proc/$PID/ so it properly
         reflects that it part of the 'time' namespace
      
       - Add the missing userns limit for the allowed number of time
         namespaces, which was half defined but the actual array member was
         not added. This went unnoticed as the array has an exessive empty
         member at the end but introduced a user visible regression as the
         output was corrupted.
      
       - Prevent further silent ucount corruption by adding a BUILD_BUG_ON()
         to catch half updated data.
      
      * tag 'timers-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        ucount: Make sure ucounts in /proc/sys/user don't regress again
        time/namespace: Add max_time_namespaces ucount
        time/namespace: Fix time_for_children symlink
      0785249f
    • Linus Torvalds's avatar
      Merge tag 'sched-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 590680d1
      Linus Torvalds authored
      Pull scheduler fixes/updates from Thomas Gleixner:
      
       - Deduplicate the average computations in the scheduler core and the
         fair class code.
      
       - Fix a raise between runtime distribution and assignement which can
         cause exceeding the quota by up to 70%.
      
       - Prevent negative results in the imbalanace calculation
      
       - Remove a stale warning in the workqueue code which can be triggered
         since the call site was moved out of preempt disabled code. It's a
         false positive.
      
       - Deduplicate the print macros for procfs
      
       - Add the ucmap values to the SCHED_DEBUG procfs output for completness
      
      * tag 'sched-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/debug: Add task uclamp values to SCHED_DEBUG procfs
        sched/debug: Factor out printing formats into common macros
        sched/debug: Remove redundant macro define
        sched/core: Remove unused rq::last_load_update_tick
        workqueue: Remove the warning in wq_worker_sleeping()
        sched/fair: Fix negative imbalance in imbalance calculation
        sched/fair: Fix race between runtime distribution and assignment
        sched/fair: Align rq->avg_idle and rq->avg_scan_cost
      590680d1
    • Linus Torvalds's avatar
      Merge tag 'perf-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 20e2aa81
      Linus Torvalds authored
      Pull perf fixes from Thomas Gleixner:
       "Three fixes/updates for perf:
      
         - Fix the perf event cgroup tracking which tries to track the cgroup
           even for disabled events.
      
         - Add Ice Lake server support for uncore events
      
         - Disable pagefaults when retrieving the physical address in the
           sampling code"
      
      * tag 'perf-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/core: Disable page faults when getting phys address
        perf/x86/intel/uncore: Add Ice Lake server uncore support
        perf/cgroup: Correct indirection in perf_less_group_idx()
        perf/core: Fix event cgroup tracking
      20e2aa81
    • Linus Torvalds's avatar
      Merge tag 'locking-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 652fa53c
      Linus Torvalds authored
      Pull locking fixes from Thomas Gleixner:
       "Three small fixes/updates for the locking core code:
      
         - Plug a task struct reference leak in the percpu rswem
           implementation.
      
         - Document the refcount interaction with PID_MAX_LIMIT
      
         - Improve the 'invalid wait context' data dump in lockdep so it
           contains all information which is required to decode the problem"
      
      * tag 'locking-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        locking/lockdep: Improve 'invalid wait context' splat
        locking/refcount: Document interaction with PID_MAX_LIMIT
        locking/percpu-rwsem: Fix a task_struct refcount
      652fa53c
    • Linus Torvalds's avatar
      Merge tag '5.7-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6 · 4119bf9f
      Linus Torvalds authored
      Pull cifs fixes from Steve French:
       "Ten cifs/smb fixes:
      
         - five RDMA (smbdirect) related fixes
      
         - add experimental support for swap over SMB3 mounts
      
         - also a fix which improves performance of signed connections"
      
      * tag '5.7-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
        smb3: enable swap on SMB3 mounts
        smb3: change noisy error message to FYI
        smb3: smbdirect support can be configured by default
        cifs: smbd: Do not schedule work to send immediate packet on every receive
        cifs: smbd: Properly process errors on ib_post_send
        cifs: Allocate crypto structures on the fly for calculating signatures of incoming packets
        cifs: smbd: Update receive credits before sending and deal with credits roll back on failure before sending
        cifs: smbd: Check send queue size before posting a send
        cifs: smbd: Merge code to track pending packets
        cifs: ignore cached share root handle closing errors
      4119bf9f
    • Linus Torvalds's avatar
      Merge tag 'nfs-for-5.7-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs · 50bda5fa
      Linus Torvalds authored
      Pull NFS client bugfix from Trond Myklebust:
       "Fix an RCU read lock leakage in pnfs_alloc_ds_commits_list()"
      
      * tag 'nfs-for-5.7-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
        pNFS: Fix RCU lock leakage
      50bda5fa
  5. 11 Apr, 2020 14 commits
  6. 10 Apr, 2020 12 commits