• Hou Tao's avatar
    bpf: Wait for busy refill_work when destroying bpf memory allocator · 3d058187
    Hou Tao authored
    A busy irq work is an unfinished irq work and it can be either in the
    pending state or in the running state. When destroying bpf memory
    allocator, refill_work may be busy for PREEMPT_RT kernel in which irq
    work is invoked in a per-CPU RT-kthread. It is also possible for kernel
    with arch_irq_work_has_interrupt() being false (e.g. 1-cpu arm32 host or
    mips) and irq work is inovked in timer interrupt.
    
    The busy refill_work leads to various issues. The obvious one is that
    there will be concurrent operations on free_by_rcu and free_list between
    irq work and memory draining. Another one is call_rcu_in_progress will
    not be reliable for the checking of pending RCU callback because
    do_call_rcu() may have not been invoked by irq work yet. The other is
    there will be use-after-free if irq work is freed before the callback
    of irq work is invoked as shown below:
    
     BUG: kernel NULL pointer dereference, address: 0000000000000000
     #PF: supervisor instruction fetch in kernel mode
     #PF: error_code(0x0010) - not-present page
     PGD 12ab94067 P4D 12ab94067 PUD 1796b4067 PMD 0
     Oops: 0010 [#1] PREEMPT_RT SMP
     CPU: 5 PID: 64 Comm: irq_work/5 Not tainted 6.0.0-rt11+ #1
     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
     RIP: 0010:0x0
     Code: Unable to access opcode bytes at 0xffffffffffffffd6.
     RSP: 0018:ffffadc080293e78 EFLAGS: 00010286
     RAX: 0000000000000000 RBX: ffffcdc07fb6a388 RCX: ffffa05000a2e000
     RDX: ffffa05000a2e000 RSI: ffffffff96cc9827 RDI: ffffcdc07fb6a388
     ......
     Call Trace:
      <TASK>
      irq_work_single+0x24/0x60
      irq_work_run_list+0x24/0x30
      run_irq_workd+0x23/0x30
      smpboot_thread_fn+0x203/0x300
      kthread+0x126/0x150
      ret_from_fork+0x1f/0x30
      </TASK>
    
    Considering the ease of concurrency handling, no overhead for
    irq_work_sync() under non-PREEMPT_RT kernel and has-irq-work-interrupt
    kernel and the short wait time used for irq_work_sync() under PREEMPT_RT
    (When running two test_maps on PREEMPT_RT kernel and 72-cpus host, the
    max wait time is about 8ms and the 99th percentile is 10us), just using
    irq_work_sync() to wait for busy refill_work to complete before memory
    draining and memory freeing.
    
    Fixes: 7c8199e2 ("bpf: Introduce any context BPF specific memory allocator.")
    Acked-by: default avatarStanislav Fomichev <sdf@google.com>
    Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
    Link: https://lore.kernel.org/r/20221021114913.60508-2-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
    3d058187
memalloc.c 17 KB