• Daniel Borkmann's avatar
    bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks · ff40e510
    Daniel Borkmann authored
    Commit 59438b46 ("security,lockdown,selinux: implement SELinux lockdown")
    added an implementation of the locked_down LSM hook to SELinux, with the aim
    to restrict which domains are allowed to perform operations that would breach
    lockdown. This is indirectly also getting audit subsystem involved to report
    events. The latter is problematic, as reported by Ondrej and Serhei, since it
    can bring down the whole system via audit:
    
      1) The audit events that are triggered due to calls to security_locked_down()
         can OOM kill a machine, see below details [0].
    
      2) It also seems to be causing a deadlock via avc_has_perm()/slow_avc_audit()
         when trying to wake up kauditd, for example, when using trace_sched_switch()
         tracepoint, see details in [1]. Triggering this was not via some hypothetical
         corner case, but with existing tools like runqlat & runqslower from bcc, for
         example, which make use of this tracepoint. Rough call sequence goes like:
    
         rq_lock(rq) -> -------------------------+
           trace_sched_switch() ->               |
             bpf_prog_xyz() ->                   +-> deadlock
               selinux_lockdown() ->             |
                 audit_log_end() ->              |
                   wake_up_interruptible() ->    |
                     try_to_wake_up() ->         |
                       rq_lock(rq) --------------+
    
    What's worse is that the intention of 59438b46 to further restrict lockdown
    settings for specific applications in respect to the global lockdown policy is
    completely broken for BPF. The SELinux policy rule for the current lockdown check
    looks something like this:
    
      allow <who> <who> : lockdown { <reason> };
    
    However, this doesn't match with the 'current' task where the security_locked_down()
    is executed, example: httpd does a syscall. There is a tracing program attached
    to the syscall which triggers a BPF program to run, which ends up doing a
    bpf_probe_read_kernel{,_str}() helper call. The selinux_lockdown() hook does
    the permission check against 'current', that is, httpd in this example. httpd
    has literally zero relation to this tracing program, and it would be nonsensical
    having to write an SELinux policy rule against httpd to let the tracing helper
    pass. The policy in this case needs to be against the entity that is installing
    the BPF program. For example, if bpftrace would generate a histogram of syscall
    counts by user space application:
    
      bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
    
    bpftrace would then go and generate a BPF program from this internally. One way
    of doing it [for the sake of the example] could be to call bpf_get_current_task()
    helper and then access current->comm via one of bpf_probe_read_kernel{,_str}()
    helpers. So the program itself has nothing to do with httpd or any other random
    app doing a syscall here. The BPF program _explicitly initiated_ the lockdown
    check. The allow/deny policy belongs in the context of bpftrace: meaning, you
    want to grant bpftrace access to use these helpers, but other tracers on the
    system like my_random_tracer _not_.
    
    Therefore fix all three issues at the same time by taking a completely different
    approach for the security_locked_down() hook, that is, move the check into the
    program verification phase where we actually retrieve the BPF func proto. This
    also reliably gets the task (current) that is trying to install the BPF tracing
    program, e.g. bpftrace/bcc/perf/systemtap/etc, and it also fixes the OOM since
    we're moving this out of the BPF helper's fast-path which can be called several
    millions of times per second.
    
    The check is then also in line with other security_locked_down() hooks in the
    system where the enforcement is performed at open/load time, for example,
    open_kcore() for /proc/kcore access or module_sig_check() for module signatures
    just to pick few random ones. What's out of scope in the fix as well as in
    other security_locked_down() hook locations /outside/ of BPF subsystem is that
    if the lockdown policy changes on the fly there is no retrospective action.
    This requires a different discussion, potentially complex infrastructure, and
    it's also not clear whether this can be solved generically. Either way, it is
    out of scope for a suitable stable fix which this one is targeting. Note that
    the breakage is specifically on 59438b46 where it started to rely on 'current'
    as UAPI behavior, and _not_ earlier infrastructure such as 9d1f8be5 ("bpf:
    Restrict bpf when kernel lockdown is in confidentiality mode").
    
    [0] https://bugzilla.redhat.com/show_bug.cgi?id=1955585, Jakub Hrozek says:
    
      I starting seeing this with F-34. When I run a container that is traced with
      BPF to record the syscalls it is doing, auditd is flooded with messages like:
    
      type=AVC msg=audit(1619784520.593:282387): avc:  denied  { confidentiality }
        for pid=476 comm="auditd" lockdown_reason="use of bpf to read kernel RAM"
          scontext=system_u:system_r:auditd_t:s0 tcontext=system_u:system_r:auditd_t:s0
            tclass=lockdown permissive=0
    
      This seems to be leading to auditd running out of space in the backlog buffer
      and eventually OOMs the machine.
    
      [...]
      auditd running at 99% CPU presumably processing all the messages, eventually I get:
      Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
      Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
      Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152579 > audit_backlog_limit=64
      Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152626 > audit_backlog_limit=64
      Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152694 > audit_backlog_limit=64
      Apr 30 12:20:42 fedora kernel: audit: audit_lost=6878426 audit_rate_limit=0 audit_backlog_limit=64
      Apr 30 12:20:45 fedora kernel: oci-seccomp-bpf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000
      Apr 30 12:20:45 fedora kernel: CPU: 0 PID: 13284 Comm: oci-seccomp-bpf Not tainted 5.11.12-300.fc34.x86_64 #1
      Apr 30 12:20:45 fedora kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
      [...]
    
    [1] https://lore.kernel.org/linux-audit/CANYvDQN7H5tVp47fbYcRasv4XF07eUbsDwT_eDCHXJUj43J7jQ@mail.gmail.com/,
        Serhei Makarov says:
    
      Upstream kernel 5.11.0-rc7 and later was found to deadlock during a
      bpf_probe_read_compat() call within a sched_switch tracepoint. The problem
      is reproducible with the reg_alloc3 testcase from SystemTap's BPF backend
      testsuite on x86_64 as well as the runqlat, runqslower tools from bcc on
      ppc64le. Example stack trace:
    
      [...]
      [  730.868702] stack backtrace:
      [  730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted, 5.12.0-0.rc2.20210309git144c79ef.166.fc35.x86_64 #1
      [  730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
      [  730.873278] Call Trace:
      [  730.873770]  dump_stack+0x7f/0xa1
      [  730.874433]  check_noncircular+0xdf/0x100
      [  730.875232]  __lock_acquire+0x1202/0x1e10
      [  730.876031]  ? __lock_acquire+0xfc0/0x1e10
      [  730.876844]  lock_acquire+0xc2/0x3a0
      [  730.877551]  ? __wake_up_common_lock+0x52/0x90
      [  730.878434]  ? lock_acquire+0xc2/0x3a0
      [  730.879186]  ? lock_is_held_type+0xa7/0x120
      [  730.880044]  ? skb_queue_tail+0x1b/0x50
      [  730.880800]  _raw_spin_lock_irqsave+0x4d/0x90
      [  730.881656]  ? __wake_up_common_lock+0x52/0x90
      [  730.882532]  __wake_up_common_lock+0x52/0x90
      [  730.883375]  audit_log_end+0x5b/0x100
      [  730.884104]  slow_avc_audit+0x69/0x90
      [  730.884836]  avc_has_perm+0x8b/0xb0
      [  730.885532]  selinux_lockdown+0xa5/0xd0
      [  730.886297]  security_locked_down+0x20/0x40
      [  730.887133]  bpf_probe_read_compat+0x66/0xd0
      [  730.887983]  bpf_prog_250599c5469ac7b5+0x10f/0x820
      [  730.888917]  trace_call_bpf+0xe9/0x240
      [  730.889672]  perf_trace_run_bpf_submit+0x4d/0xc0
      [  730.890579]  perf_trace_sched_switch+0x142/0x180
      [  730.891485]  ? __schedule+0x6d8/0xb20
      [  730.892209]  __schedule+0x6d8/0xb20
      [  730.892899]  schedule+0x5b/0xc0
      [  730.893522]  exit_to_user_mode_prepare+0x11d/0x240
      [  730.894457]  syscall_exit_to_user_mode+0x27/0x70
      [  730.895361]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [...]
    
    Fixes: 59438b46 ("security,lockdown,selinux: implement SELinux lockdown")
    Reported-by: default avatarOndrej Mosnacek <omosnace@redhat.com>
    Reported-by: default avatarJakub Hrozek <jhrozek@redhat.com>
    Reported-by: default avatarSerhei Makarov <smakarov@redhat.com>
    Reported-by: default avatarJiri Olsa <jolsa@redhat.com>
    Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
    Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
    Tested-by: default avatarJiri Olsa <jolsa@redhat.com>
    Cc: Paul Moore <paul@paul-moore.com>
    Cc: James Morris <jamorris@linux.microsoft.com>
    Cc: Jerome Marchand <jmarchan@redhat.com>
    Cc: Frank Eigler <fche@redhat.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Link: https://lore.kernel.org/bpf/01135120-8bf7-df2e-cff0-1d73f1f841c3@iogearbox.net
    ff40e510
bpf_trace.c 52.3 KB