• Yonghong Song's avatar
    bpf: Permit cond_resched for some iterators · cf83b2d2
    Yonghong Song authored
    Commit e679654a ("bpf: Fix a rcu_sched stall issue with
    bpf task/task_file iterator") tries to fix rcu stalls warning
    which is caused by bpf task_file iterator when running
    "bpftool prog".
    
          rcu: INFO: rcu_sched self-detected stall on CPU
          rcu: \x097-....: (20999 ticks this GP) idle=302/1/0x4000000000000000 softirq=1508852/1508852 fqs=4913
          \x09(t=21031 jiffies g=2534773 q=179750)
          NMI backtrace for cpu 7
          CPU: 7 PID: 184195 Comm: bpftool Kdump: loaded Tainted: G        W         5.8.0-00004-g68bfc7f8c1b4 #6
          Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A17 05/03/2019
          Call Trace:
          <IRQ>
          dump_stack+0x57/0x70
          nmi_cpu_backtrace.cold+0x14/0x53
          ? lapic_can_unplug_cpu.cold+0x39/0x39
          nmi_trigger_cpumask_backtrace+0xb7/0xc7
          rcu_dump_cpu_stacks+0xa2/0xd0
          rcu_sched_clock_irq.cold+0x1ff/0x3d9
          ? tick_nohz_handler+0x100/0x100
          update_process_times+0x5b/0x90
          tick_sched_timer+0x5e/0xf0
          __hrtimer_run_queues+0x12a/0x2a0
          hrtimer_interrupt+0x10e/0x280
          __sysvec_apic_timer_interrupt+0x51/0xe0
          asm_call_on_stack+0xf/0x20
          </IRQ>
          sysvec_apic_timer_interrupt+0x6f/0x80
          ...
          task_file_seq_next+0x52/0xa0
          bpf_seq_read+0xb9/0x320
          vfs_read+0x9d/0x180
          ksys_read+0x5f/0xe0
          do_syscall_64+0x38/0x60
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
    The fix is to limit the number of bpf program runs to be
    one million. This fixed the program in most cases. But
    we also found under heavy load, which can increase the wallclock
    time for bpf_seq_read(), the warning may still be possible.
    
    For example, calling bpf_delay() in the "while" loop of
    bpf_seq_read(), which will introduce artificial delay,
    the warning will show up in my qemu run.
    
      static unsigned q;
      volatile unsigned *p = &q;
      volatile unsigned long long ll;
      static void bpf_delay(void)
      {
             int i, j;
    
             for (i = 0; i < 10000; i++)
                     for (j = 0; j < 10000; j++)
                             ll += *p;
      }
    
    There are two ways to fix this issue. One is to reduce the above
    one million threshold to say 100,000 and hopefully rcu warning will
    not show up any more. Another is to introduce a target feature
    which enables bpf_seq_read() calling cond_resched().
    
    This patch took second approach as the first approach may cause
    more -EAGAIN failures for read() syscalls. Note that not all bpf_iter
    targets can permit cond_resched() in bpf_seq_read() as some, e.g.,
    netlink seq iterator, rcu read lock critical section spans through
    seq_ops->next() -> seq_ops->show() -> seq_ops->next().
    
    For the kernel code with the above hack, "bpftool p" roughly takes
    38 seconds to finish on my VM with 184 bpf program runs.
    Using the following command, I am able to collect the number of
    context switches:
       perf stat -e context-switches -- ./bpftool p >& log
    Without this patch,
       69      context-switches
    With this patch,
       75      context-switches
    This patch added additional 6 context switches, roughly every 6 seconds
    to reschedule, to avoid lengthy no-rescheduling which may cause the
    above RCU warnings.
    Signed-off-by: default avatarYonghong Song <yhs@fb.com>
    Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
    Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20201028061054.1411116-1-yhs@fb.com
    cf83b2d2
task_iter.c 8.51 KB