1. 03 Feb, 2019 2 commits
  2. 01 Feb, 2019 19 commits
  3. 31 Jan, 2019 19 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-lockdep-fixes' · f01c2803
      Daniel Borkmann authored
      Alexei Starovoitov says:
      
      ====================
      v1->v2:
      - reworded 2nd patch. It's a real dead lock. Not a false positive
      - dropped the lockdep fix for up_read_non_owner in bpf_get_stackid
      
      In addition to preempt_disable patch for socket filters
      https://patchwork.ozlabs.org/patch/1032437/
      First patch fixes lockdep false positive in percpu_freelist
      Second patch fixes potential deadlock in bpf_prog_register
      Third patch fixes another potential deadlock in stackmap access
      from tracing bpf prog and from syscall.
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      f01c2803
    • Martin KaFai Lau's avatar
      bpf: Fix syscall's stackmap lookup potential deadlock · 7c4cd051
      Martin KaFai Lau authored
      The map_lookup_elem used to not acquiring spinlock
      in order to optimize the reader.
      
      It was true until commit 557c0c6e ("bpf: convert stackmap to pre-allocation")
      The syscall's map_lookup_elem(stackmap) calls bpf_stackmap_copy().
      bpf_stackmap_copy() may find the elem no longer needed after the copy is done.
      If that is the case, pcpu_freelist_push() saves this elem for reuse later.
      This push requires a spinlock.
      
      If a tracing bpf_prog got run in the middle of the syscall's
      map_lookup_elem(stackmap) and this tracing bpf_prog is calling
      bpf_get_stackid(stackmap) which also requires the same pcpu_freelist's
      spinlock, it may end up with a dead lock situation as reported by
      Eric Dumazet in https://patchwork.ozlabs.org/patch/1030266/
      
      The situation is the same as the syscall's map_update_elem() which
      needs to acquire the pcpu_freelist's spinlock and could race
      with tracing bpf_prog.  Hence, this patch fixes it by protecting
      bpf_stackmap_copy() with this_cpu_inc(bpf_prog_active)
      to prevent tracing bpf_prog from running.
      
      A later syscall's map_lookup_elem commit f1a2e44a ("bpf: add queue and stack maps")
      also acquires a spinlock and races with tracing bpf_prog similarly.
      Hence, this patch is forward looking and protects the majority
      of the map lookups.  bpf_map_offload_lookup_elem() is the exception
      since it is for network bpf_prog only (i.e. never called by tracing
      bpf_prog).
      
      Fixes: 557c0c6e ("bpf: convert stackmap to pre-allocation")
      Reported-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      7c4cd051
    • Alexei Starovoitov's avatar
      bpf: fix potential deadlock in bpf_prog_register · e16ec340
      Alexei Starovoitov authored
      Lockdep found a potential deadlock between cpu_hotplug_lock, bpf_event_mutex, and cpuctx_mutex:
      [   13.007000] WARNING: possible circular locking dependency detected
      [   13.007587] 5.0.0-rc3-00018-g2fa53f89-dirty #477 Not tainted
      [   13.008124] ------------------------------------------------------
      [   13.008624] test_progs/246 is trying to acquire lock:
      [   13.009030] 0000000094160d1d (tracepoints_mutex){+.+.}, at: tracepoint_probe_register_prio+0x2d/0x300
      [   13.009770]
      [   13.009770] but task is already holding lock:
      [   13.010239] 00000000d663ef86 (bpf_event_mutex){+.+.}, at: bpf_probe_register+0x1d/0x60
      [   13.010877]
      [   13.010877] which lock already depends on the new lock.
      [   13.010877]
      [   13.011532]
      [   13.011532] the existing dependency chain (in reverse order) is:
      [   13.012129]
      [   13.012129] -> #4 (bpf_event_mutex){+.+.}:
      [   13.012582]        perf_event_query_prog_array+0x9b/0x130
      [   13.013016]        _perf_ioctl+0x3aa/0x830
      [   13.013354]        perf_ioctl+0x2e/0x50
      [   13.013668]        do_vfs_ioctl+0x8f/0x6a0
      [   13.014003]        ksys_ioctl+0x70/0x80
      [   13.014320]        __x64_sys_ioctl+0x16/0x20
      [   13.014668]        do_syscall_64+0x4a/0x180
      [   13.015007]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [   13.015469]
      [   13.015469] -> #3 (&cpuctx_mutex){+.+.}:
      [   13.015910]        perf_event_init_cpu+0x5a/0x90
      [   13.016291]        perf_event_init+0x1b2/0x1de
      [   13.016654]        start_kernel+0x2b8/0x42a
      [   13.016995]        secondary_startup_64+0xa4/0xb0
      [   13.017382]
      [   13.017382] -> #2 (pmus_lock){+.+.}:
      [   13.017794]        perf_event_init_cpu+0x21/0x90
      [   13.018172]        cpuhp_invoke_callback+0xb3/0x960
      [   13.018573]        _cpu_up+0xa7/0x140
      [   13.018871]        do_cpu_up+0xa4/0xc0
      [   13.019178]        smp_init+0xcd/0xd2
      [   13.019483]        kernel_init_freeable+0x123/0x24f
      [   13.019878]        kernel_init+0xa/0x110
      [   13.020201]        ret_from_fork+0x24/0x30
      [   13.020541]
      [   13.020541] -> #1 (cpu_hotplug_lock.rw_sem){++++}:
      [   13.021051]        static_key_slow_inc+0xe/0x20
      [   13.021424]        tracepoint_probe_register_prio+0x28c/0x300
      [   13.021891]        perf_trace_event_init+0x11f/0x250
      [   13.022297]        perf_trace_init+0x6b/0xa0
      [   13.022644]        perf_tp_event_init+0x25/0x40
      [   13.023011]        perf_try_init_event+0x6b/0x90
      [   13.023386]        perf_event_alloc+0x9a8/0xc40
      [   13.023754]        __do_sys_perf_event_open+0x1dd/0xd30
      [   13.024173]        do_syscall_64+0x4a/0x180
      [   13.024519]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [   13.024968]
      [   13.024968] -> #0 (tracepoints_mutex){+.+.}:
      [   13.025434]        __mutex_lock+0x86/0x970
      [   13.025764]        tracepoint_probe_register_prio+0x2d/0x300
      [   13.026215]        bpf_probe_register+0x40/0x60
      [   13.026584]        bpf_raw_tracepoint_open.isra.34+0xa4/0x130
      [   13.027042]        __do_sys_bpf+0x94f/0x1a90
      [   13.027389]        do_syscall_64+0x4a/0x180
      [   13.027727]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [   13.028171]
      [   13.028171] other info that might help us debug this:
      [   13.028171]
      [   13.028807] Chain exists of:
      [   13.028807]   tracepoints_mutex --> &cpuctx_mutex --> bpf_event_mutex
      [   13.028807]
      [   13.029666]  Possible unsafe locking scenario:
      [   13.029666]
      [   13.030140]        CPU0                    CPU1
      [   13.030510]        ----                    ----
      [   13.030875]   lock(bpf_event_mutex);
      [   13.031166]                                lock(&cpuctx_mutex);
      [   13.031645]                                lock(bpf_event_mutex);
      [   13.032135]   lock(tracepoints_mutex);
      [   13.032441]
      [   13.032441]  *** DEADLOCK ***
      [   13.032441]
      [   13.032911] 1 lock held by test_progs/246:
      [   13.033239]  #0: 00000000d663ef86 (bpf_event_mutex){+.+.}, at: bpf_probe_register+0x1d/0x60
      [   13.033909]
      [   13.033909] stack backtrace:
      [   13.034258] CPU: 1 PID: 246 Comm: test_progs Not tainted 5.0.0-rc3-00018-g2fa53f89-dirty #477
      [   13.034964] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
      [   13.035657] Call Trace:
      [   13.035859]  dump_stack+0x5f/0x8b
      [   13.036130]  print_circular_bug.isra.37+0x1ce/0x1db
      [   13.036526]  __lock_acquire+0x1158/0x1350
      [   13.036852]  ? lock_acquire+0x98/0x190
      [   13.037154]  lock_acquire+0x98/0x190
      [   13.037447]  ? tracepoint_probe_register_prio+0x2d/0x300
      [   13.037876]  __mutex_lock+0x86/0x970
      [   13.038167]  ? tracepoint_probe_register_prio+0x2d/0x300
      [   13.038600]  ? tracepoint_probe_register_prio+0x2d/0x300
      [   13.039028]  ? __mutex_lock+0x86/0x970
      [   13.039337]  ? __mutex_lock+0x24a/0x970
      [   13.039649]  ? bpf_probe_register+0x1d/0x60
      [   13.039992]  ? __bpf_trace_sched_wake_idle_without_ipi+0x10/0x10
      [   13.040478]  ? tracepoint_probe_register_prio+0x2d/0x300
      [   13.040906]  tracepoint_probe_register_prio+0x2d/0x300
      [   13.041325]  bpf_probe_register+0x40/0x60
      [   13.041649]  bpf_raw_tracepoint_open.isra.34+0xa4/0x130
      [   13.042068]  ? __might_fault+0x3e/0x90
      [   13.042374]  __do_sys_bpf+0x94f/0x1a90
      [   13.042678]  do_syscall_64+0x4a/0x180
      [   13.042975]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [   13.043382] RIP: 0033:0x7f23b10a07f9
      [   13.045155] RSP: 002b:00007ffdef42fdd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
      [   13.045759] RAX: ffffffffffffffda RBX: 00007ffdef42ff70 RCX: 00007f23b10a07f9
      [   13.046326] RDX: 0000000000000070 RSI: 00007ffdef42fe10 RDI: 0000000000000011
      [   13.046893] RBP: 00007ffdef42fdf0 R08: 0000000000000038 R09: 00007ffdef42fe10
      [   13.047462] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
      [   13.048029] R13: 0000000000000016 R14: 00007f23b1db4690 R15: 0000000000000000
      
      Since tracepoints_mutex will be taken in tracepoint_probe_register/unregister()
      there is no need to take bpf_event_mutex too.
      bpf_event_mutex is protecting modifications to prog array used in kprobe/perf bpf progs.
      bpf_raw_tracepoints don't need to take this mutex.
      
      Fixes: c4f6699d ("bpf: introduce BPF_RAW_TRACEPOINT")
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      e16ec340
    • Alexei Starovoitov's avatar
      bpf: fix lockdep false positive in percpu_freelist · a89fac57
      Alexei Starovoitov authored
      Lockdep warns about false positive:
      [   12.492084] 00000000e6b28347 (&head->lock){+...}, at: pcpu_freelist_push+0x2a/0x40
      [   12.492696] but this lock was taken by another, HARDIRQ-safe lock in the past:
      [   12.493275]  (&rq->lock){-.-.}
      [   12.493276]
      [   12.493276]
      [   12.493276] and interrupts could create inverse lock ordering between them.
      [   12.493276]
      [   12.494435]
      [   12.494435] other info that might help us debug this:
      [   12.494979]  Possible interrupt unsafe locking scenario:
      [   12.494979]
      [   12.495518]        CPU0                    CPU1
      [   12.495879]        ----                    ----
      [   12.496243]   lock(&head->lock);
      [   12.496502]                                local_irq_disable();
      [   12.496969]                                lock(&rq->lock);
      [   12.497431]                                lock(&head->lock);
      [   12.497890]   <Interrupt>
      [   12.498104]     lock(&rq->lock);
      [   12.498368]
      [   12.498368]  *** DEADLOCK ***
      [   12.498368]
      [   12.498837] 1 lock held by dd/276:
      [   12.499110]  #0: 00000000c58cb2ee (rcu_read_lock){....}, at: trace_call_bpf+0x5e/0x240
      [   12.499747]
      [   12.499747] the shortest dependencies between 2nd lock and 1st lock:
      [   12.500389]  -> (&rq->lock){-.-.} {
      [   12.500669]     IN-HARDIRQ-W at:
      [   12.500934]                       _raw_spin_lock+0x2f/0x40
      [   12.501373]                       scheduler_tick+0x4c/0xf0
      [   12.501812]                       update_process_times+0x40/0x50
      [   12.502294]                       tick_periodic+0x27/0xb0
      [   12.502723]                       tick_handle_periodic+0x1f/0x60
      [   12.503203]                       timer_interrupt+0x11/0x20
      [   12.503651]                       __handle_irq_event_percpu+0x43/0x2c0
      [   12.504167]                       handle_irq_event_percpu+0x20/0x50
      [   12.504674]                       handle_irq_event+0x37/0x60
      [   12.505139]                       handle_level_irq+0xa7/0x120
      [   12.505601]                       handle_irq+0xa1/0x150
      [   12.506018]                       do_IRQ+0x77/0x140
      [   12.506411]                       ret_from_intr+0x0/0x1d
      [   12.506834]                       _raw_spin_unlock_irqrestore+0x53/0x60
      [   12.507362]                       __setup_irq+0x481/0x730
      [   12.507789]                       setup_irq+0x49/0x80
      [   12.508195]                       hpet_time_init+0x21/0x32
      [   12.508644]                       x86_late_time_init+0xb/0x16
      [   12.509106]                       start_kernel+0x390/0x42a
      [   12.509554]                       secondary_startup_64+0xa4/0xb0
      [   12.510034]     IN-SOFTIRQ-W at:
      [   12.510305]                       _raw_spin_lock+0x2f/0x40
      [   12.510772]                       try_to_wake_up+0x1c7/0x4e0
      [   12.511220]                       swake_up_locked+0x20/0x40
      [   12.511657]                       swake_up_one+0x1a/0x30
      [   12.512070]                       rcu_process_callbacks+0xc5/0x650
      [   12.512553]                       __do_softirq+0xe6/0x47b
      [   12.512978]                       irq_exit+0xc3/0xd0
      [   12.513372]                       smp_apic_timer_interrupt+0xa9/0x250
      [   12.513876]                       apic_timer_interrupt+0xf/0x20
      [   12.514343]                       default_idle+0x1c/0x170
      [   12.514765]                       do_idle+0x199/0x240
      [   12.515159]                       cpu_startup_entry+0x19/0x20
      [   12.515614]                       start_kernel+0x422/0x42a
      [   12.516045]                       secondary_startup_64+0xa4/0xb0
      [   12.516521]     INITIAL USE at:
      [   12.516774]                      _raw_spin_lock_irqsave+0x38/0x50
      [   12.517258]                      rq_attach_root+0x16/0xd0
      [   12.517685]                      sched_init+0x2f2/0x3eb
      [   12.518096]                      start_kernel+0x1fb/0x42a
      [   12.518525]                      secondary_startup_64+0xa4/0xb0
      [   12.518986]   }
      [   12.519132]   ... key      at: [<ffffffff82b7bc28>] __key.71384+0x0/0x8
      [   12.519649]   ... acquired at:
      [   12.519892]    pcpu_freelist_pop+0x7b/0xd0
      [   12.520221]    bpf_get_stackid+0x1d2/0x4d0
      [   12.520563]    ___bpf_prog_run+0x8b4/0x11a0
      [   12.520887]
      [   12.521008] -> (&head->lock){+...} {
      [   12.521292]    HARDIRQ-ON-W at:
      [   12.521539]                     _raw_spin_lock+0x2f/0x40
      [   12.521950]                     pcpu_freelist_push+0x2a/0x40
      [   12.522396]                     bpf_get_stackid+0x494/0x4d0
      [   12.522828]                     ___bpf_prog_run+0x8b4/0x11a0
      [   12.523296]    INITIAL USE at:
      [   12.523537]                    _raw_spin_lock+0x2f/0x40
      [   12.523944]                    pcpu_freelist_populate+0xc0/0x120
      [   12.524417]                    htab_map_alloc+0x405/0x500
      [   12.524835]                    __do_sys_bpf+0x1a3/0x1a90
      [   12.525253]                    do_syscall_64+0x4a/0x180
      [   12.525659]                    entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [   12.526167]  }
      [   12.526311]  ... key      at: [<ffffffff838f7668>] __key.13130+0x0/0x8
      [   12.526812]  ... acquired at:
      [   12.527047]    __lock_acquire+0x521/0x1350
      [   12.527371]    lock_acquire+0x98/0x190
      [   12.527680]    _raw_spin_lock+0x2f/0x40
      [   12.527994]    pcpu_freelist_push+0x2a/0x40
      [   12.528325]    bpf_get_stackid+0x494/0x4d0
      [   12.528645]    ___bpf_prog_run+0x8b4/0x11a0
      [   12.528970]
      [   12.529092]
      [   12.529092] stack backtrace:
      [   12.529444] CPU: 0 PID: 276 Comm: dd Not tainted 5.0.0-rc3-00018-g2fa53f89 #475
      [   12.530043] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
      [   12.530750] Call Trace:
      [   12.530948]  dump_stack+0x5f/0x8b
      [   12.531248]  check_usage_backwards+0x10c/0x120
      [   12.531598]  ? ___bpf_prog_run+0x8b4/0x11a0
      [   12.531935]  ? mark_lock+0x382/0x560
      [   12.532229]  mark_lock+0x382/0x560
      [   12.532496]  ? print_shortest_lock_dependencies+0x180/0x180
      [   12.532928]  __lock_acquire+0x521/0x1350
      [   12.533271]  ? find_get_entry+0x17f/0x2e0
      [   12.533586]  ? find_get_entry+0x19c/0x2e0
      [   12.533902]  ? lock_acquire+0x98/0x190
      [   12.534196]  lock_acquire+0x98/0x190
      [   12.534482]  ? pcpu_freelist_push+0x2a/0x40
      [   12.534810]  _raw_spin_lock+0x2f/0x40
      [   12.535099]  ? pcpu_freelist_push+0x2a/0x40
      [   12.535432]  pcpu_freelist_push+0x2a/0x40
      [   12.535750]  bpf_get_stackid+0x494/0x4d0
      [   12.536062]  ___bpf_prog_run+0x8b4/0x11a0
      
      It has been explained that is a false positive here:
      https://lkml.org/lkml/2018/7/25/756
      Recap:
      - stackmap uses pcpu_freelist
      - The lock in pcpu_freelist is a percpu lock
      - stackmap is only used by tracing bpf_prog
      - A tracing bpf_prog cannot be run if another bpf_prog
        has already been running (ensured by the percpu bpf_prog_active counter).
      
      Eric pointed out that this lockdep splats stops other
      legit lockdep splats in selftests/bpf/test_progs.c.
      
      Fix this by calling local_irq_save/restore for stackmap.
      
      Another false positive had also been worked around by calling
      local_irq_save in commit 89ad2fa3 ("bpf: fix lockdep splat").
      That commit added unnecessary irq_save/restore to fast path of
      bpf hash map. irqs are already disabled at that point, since htab
      is holding per bucket spin_lock with irqsave.
      
      Let's reduce overhead for htab by introducing __pcpu_freelist_push/pop
      function w/o irqsave and convert pcpu_freelist_push/pop to irqsave
      to be used elsewhere (right now only in stackmap).
      It stops lockdep false positive in stackmap with a bit of acceptable overhead.
      
      Fixes: 557c0c6e ("bpf: convert stackmap to pre-allocation")
      Reported-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Reported-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a89fac57
    • Alexei Starovoitov's avatar
      bpf: run bpf programs with preemption disabled · 6cab5e90
      Alexei Starovoitov authored
      Disabled preemption is necessary for proper access to per-cpu maps
      from BPF programs.
      
      But the sender side of socket filters didn't have preemption disabled:
      unix_dgram_sendmsg->sk_filter->sk_filter_trim_cap->bpf_prog_run_save_cb->BPF_PROG_RUN
      
      and a combination of af_packet with tun device didn't disable either:
      tpacket_snd->packet_direct_xmit->packet_pick_tx_queue->ndo_select_queue->
        tun_select_queue->tun_ebpf_select_queue->bpf_prog_run_clear_cb->BPF_PROG_RUN
      
      Disable preemption before executing BPF programs (both classic and extended).
      Reported-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      6cab5e90
    • Martynas Pumputis's avatar
      bpf, selftests: fix handling of sparse CPU allocations · 1bb54c40
      Martynas Pumputis authored
      Previously, bpf_num_possible_cpus() had a bug when calculating a
      number of possible CPUs in the case of sparse CPU allocations, as
      it was considering only the first range or element of
      /sys/devices/system/cpu/possible.
      
      E.g. in the case of "0,2-3" (CPU 1 is not available), the function
      returned 1 instead of 3.
      
      This patch fixes the function by making it parse all CPU ranges and
      elements.
      Signed-off-by: default avatarMartynas Pumputis <m@lambda.lt>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      1bb54c40
    • Michael Chan's avatar
      bnxt_en: Disable interrupts when allocating CP rings or NQs. · 5e66e35a
      Michael Chan authored
      When calling firmware to allocate a CP ring or NQ, an interrupt associated
      with that ring may be generated immediately before the doorbell is even
      setup after the firmware call returns.  When servicing the interrupt, the
      driver may crash when trying to access the doorbell.
      
      Fix it by disabling interrupt on that vector until the doorbell is
      set up.
      
      Fixes: 697197e5 ("bnxt_en: Re-structure doorbells.")
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e66e35a
    • David S. Miller's avatar
      Merge branch 'ieee802154-for-davem-2019-01-31' of... · da0e5171
      David S. Miller authored
      Merge branch 'ieee802154-for-davem-2019-01-31' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan
      
      Stefan Schmidt says:
      
      ====================
      pull-request: ieee802154 for net 2019-01-31
      
      An update from ieee802154 for your *net* tree.
      
      I waited a while to see if anything else comes up, but it seems this time
      we only have one fixup patch for the -rc rounds.
      Colin fixed some indentation in the mcr20a drivers. That's about it.
      
      If there are any problems with taking these two before the final 5.0 let
      me know.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      da0e5171
    • Eric Dumazet's avatar
      rds: fix refcount bug in rds_sock_addref · 6fa19f56
      Eric Dumazet authored
      syzbot was able to catch a bug in rds [1]
      
      The issue here is that the socket might be found in a hash table
      but that its refcount has already be set to 0 by another cpu.
      
      We need to use refcount_inc_not_zero() to be safe here.
      
      [1]
      
      refcount_t: increment on 0; use-after-free.
      WARNING: CPU: 1 PID: 23129 at lib/refcount.c:153 refcount_inc_checked lib/refcount.c:153 [inline]
      WARNING: CPU: 1 PID: 23129 at lib/refcount.c:153 refcount_inc_checked+0x61/0x70 lib/refcount.c:151
      Kernel panic - not syncing: panic_on_warn set ...
      CPU: 1 PID: 23129 Comm: syz-executor3 Not tainted 5.0.0-rc4+ #53
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1db/0x2d0 lib/dump_stack.c:113
       panic+0x2cb/0x65c kernel/panic.c:214
       __warn.cold+0x20/0x48 kernel/panic.c:571
       report_bug+0x263/0x2b0 lib/bug.c:186
       fixup_bug arch/x86/kernel/traps.c:178 [inline]
       fixup_bug arch/x86/kernel/traps.c:173 [inline]
       do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:271
       do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:290
       invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:973
      RIP: 0010:refcount_inc_checked lib/refcount.c:153 [inline]
      RIP: 0010:refcount_inc_checked+0x61/0x70 lib/refcount.c:151
      Code: 1d 51 63 c8 06 31 ff 89 de e8 eb 1b f2 fd 84 db 75 dd e8 a2 1a f2 fd 48 c7 c7 60 9f 81 88 c6 05 31 63 c8 06 01 e8 af 65 bb fd <0f> 0b eb c1 90 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 54 49
      RSP: 0018:ffff8880a0cbf1e8 EFLAGS: 00010282
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc90006113000
      RDX: 000000000001047d RSI: ffffffff81685776 RDI: 0000000000000005
      RBP: ffff8880a0cbf1f8 R08: ffff888097c9e100 R09: ffffed1015ce5021
      R10: ffffed1015ce5020 R11: ffff8880ae728107 R12: ffff8880723c20c0
      R13: ffff8880723c24b0 R14: dffffc0000000000 R15: ffffed1014197e64
       sock_hold include/net/sock.h:647 [inline]
       rds_sock_addref+0x19/0x20 net/rds/af_rds.c:675
       rds_find_bound+0x97c/0x1080 net/rds/bind.c:82
       rds_recv_incoming+0x3be/0x1430 net/rds/recv.c:362
       rds_loop_xmit+0xf3/0x2a0 net/rds/loop.c:96
       rds_send_xmit+0x1355/0x2a10 net/rds/send.c:355
       rds_sendmsg+0x323c/0x44e0 net/rds/send.c:1368
       sock_sendmsg_nosec net/socket.c:621 [inline]
       sock_sendmsg+0xdd/0x130 net/socket.c:631
       __sys_sendto+0x387/0x5f0 net/socket.c:1788
       __do_sys_sendto net/socket.c:1800 [inline]
       __se_sys_sendto net/socket.c:1796 [inline]
       __x64_sys_sendto+0xe1/0x1a0 net/socket.c:1796
       do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x458089
      Code: 6d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 3b b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fc266df8c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 0000000000458089
      RDX: 0000000000000000 RSI: 00000000204b3fff RDI: 0000000000000005
      RBP: 000000000073bf00 R08: 00000000202b4000 R09: 0000000000000010
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007fc266df96d4
      R13: 00000000004c56e4 R14: 00000000004d94a8 R15: 00000000ffffffff
      
      Fixes: cc4dfb7f ("rds: fix two RCU related problems")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Cc: Sowmini Varadhan <sowmini.varadhan@oracle.com>
      Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
      Cc: rds-devel@oss.oracle.com
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6fa19f56
    • Bart Van Assche's avatar
      lib/test_rhashtable: Make test_insert_dup() allocate its hash table dynamically · fc42a689
      Bart Van Assche authored
      The test_insert_dup() function from lib/test_rhashtable.c passes a
      pointer to a stack object to rhltable_init(). Allocate the hash table
      dynamically to avoid that the following is reported with object
      debugging enabled:
      
      ODEBUG: object (ptrval) is on stack (ptrval), but NOT annotated.
      WARNING: CPU: 0 PID: 1 at lib/debugobjects.c:368 __debug_object_init+0x312/0x480
      Modules linked in:
      EIP: __debug_object_init+0x312/0x480
      Call Trace:
       ? debug_object_init+0x1a/0x20
       ? __init_work+0x16/0x30
       ? rhashtable_init+0x1e1/0x460
       ? sched_clock_cpu+0x57/0xe0
       ? rhltable_init+0xb/0x20
       ? test_insert_dup+0x32/0x20f
       ? trace_hardirqs_on+0x38/0xf0
       ? ida_dump+0x10/0x10
       ? jhash+0x130/0x130
       ? my_hashfn+0x30/0x30
       ? test_rht_init+0x6aa/0xab4
       ? ida_dump+0x10/0x10
       ? test_rhltable+0xc5c/0xc5c
       ? do_one_initcall+0x67/0x28e
       ? trace_hardirqs_off+0x22/0xe0
       ? restore_all_kernel+0xf/0x70
       ? trace_hardirqs_on_thunk+0xc/0x10
       ? restore_all_kernel+0xf/0x70
       ? kernel_init_freeable+0x142/0x213
       ? rest_init+0x230/0x230
       ? kernel_init+0x10/0x110
       ? schedule_tail_wrapper+0x9/0xc
       ? ret_from_fork+0x19/0x24
      
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: netdev@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Acked-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fc42a689
    • Jacob Wen's avatar
      l2tp: copy 4 more bytes to linear part if necessary · 91c52470
      Jacob Wen authored
      The size of L2TPv2 header with all optional fields is 14 bytes.
      l2tp_udp_recv_core only moves 10 bytes to the linear part of a
      skb. This may lead to l2tp_recv_common read data outside of a skb.
      
      This patch make sure that there is at least 14 bytes in the linear
      part of a skb to meet the maximum need of l2tp_udp_recv_core and
      l2tp_recv_common. The minimum size of both PPP HDLC-like frame and
      Ethernet frame is larger than 14 bytes, so we are safe to do so.
      
      Also remove L2TP_HDR_SIZE_NOSEQ, it is unused now.
      
      Fixes: fd558d18 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
      Suggested-by: default avatarGuillaume Nault <gnault@redhat.com>
      Signed-off-by: default avatarJacob Wen <jian.w.wen@oracle.com>
      Acked-by: default avatarGuillaume Nault <gnault@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      91c52470
    • Colin Ian King's avatar
      ieee802154: mcr20a: fix indentation, remove tabs · 34aaaac8
      Colin Ian King authored
      The are a couple of statments that are one level too deep, fix this by
      removing tabs.
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarStefan Schmidt <stefan@datenfreihafen.org>
      34aaaac8
    • David S. Miller's avatar
      Merge branch 'stmmac-fixes' · 3aa9179b
      David S. Miller authored
      Jose Abreu says:
      
      ====================
      net: stmmac: Misc fixes
      
      Some misc fixes for stmmac targeting -net.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3aa9179b
    • Jose Abreu's avatar
      net: stmmac: Disable EEE mode earlier in XMIT callback · e2cd682d
      Jose Abreu authored
      In stmmac xmit callback we use a different flow for TSO packets but TSO
      xmit callback is not disabling the EEE mode.
      
      Fix this by disabling earlier the EEE mode, i.e. before calling the TSO
      xmit callback.
      Signed-off-by: default avatarJose Abreu <joabreu@synopsys.com>
      Cc: Joao Pinto <jpinto@synopsys.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: Alexandre Torgue <alexandre.torgue@st.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e2cd682d
    • Jose Abreu's avatar
      net: stmmac: Send TSO packets always from Queue 0 · c5acdbee
      Jose Abreu authored
      The number of TSO enabled channels in HW can be different than the
      number of total channels. There is no way to determined, at runtime, the
      number of TSO capable channels and its safe to assume that if TSO is
      enabled then at least channel 0 will be TSO capable.
      
      Lets always send TSO packets from Queue 0.
      Signed-off-by: default avatarJose Abreu <joabreu@synopsys.com>
      Cc: Joao Pinto <jpinto@synopsys.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: Alexandre Torgue <alexandre.torgue@st.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c5acdbee
    • Jose Abreu's avatar
      net: stmmac: Fallback to Platform Data clock in Watchdog conversion · 4ec5302f
      Jose Abreu authored
      If we don't have DT then stmmac_clk will not be available. Let's add a
      new Platform Data field so that we can specify the refclk by this mean.
      
      This way we can still use the coalesce command in PCI based setups.
      Signed-off-by: default avatarJose Abreu <joabreu@synopsys.com>
      Cc: Joao Pinto <jpinto@synopsys.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: Alexandre Torgue <alexandre.torgue@st.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4ec5302f
    • Daniel Borkmann's avatar
      ipvlan, l3mdev: fix broken l3s mode wrt local routes · d5256083
      Daniel Borkmann authored
      While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin,
      I ran into the issue that while l3 mode is working fine, l3s mode
      does not have any connectivity to kube-apiserver and hence all pods
      end up in Error state as well. The ipvlan master device sits on
      top of a bond device and hostns traffic to kube-apiserver (also running
      in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573
      where the latter is the address of the bond0. While in l3 mode, a
      curl to https://10.152.183.1:443 or to https://139.178.29.207:37573
      works fine from hostns, neither of them do in case of l3s. In the
      latter only a curl to https://127.0.0.1:37573 appeared to work where
      for local addresses of bond0 I saw kernel suddenly starting to emit
      ARP requests to query HW address of bond0 which remained unanswered
      and neighbor entries in INCOMPLETE state. These ARP requests only
      happen while in l3s.
      
      Debugging this further, I found the issue is that l3s mode is piggy-
      backing on l3 master device, and in this case local routes are using
      l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit
      f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev
      if relevant") and 5f02ce24 ("net: l3mdev: Allow the l3mdev to be
      a loopback"). I found that reverting them back into using the
      net->loopback_dev fixed ipvlan l3s connectivity and got everything
      working for the CNI.
      
      Now judging from 4fbae7d8 ("ipvlan: Introduce l3s mode") and the
      l3mdev paper in [0] the only sole reason why ipvlan l3s is relying
      on l3 master device is to get the l3mdev_ip_rcv() receive hook for
      setting the dst entry of the input route without adding its own
      ipvlan specific hacks into the receive path, however, any l3 domain
      semantics beyond just that are breaking l3s operation. Note that
      ipvlan also has the ability to dynamically switch its internal
      operation from l3 to l3s for all ports via ipvlan_set_port_mode()
      at runtime. In any case, l3 vs l3s soley distinguishes itself by
      'de-confusing' netfilter through switching skb->dev to ipvlan slave
      device late in NF_INET_LOCAL_IN before handing the skb to L4.
      
      Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which,
      if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook
      without any additional l3mdev semantics on top. This should also have
      minimal impact since dev->priv_flags is already hot in cache. With
      this set, l3s mode is working fine and I also get things like
      masquerading pod traffic on the ipvlan master properly working.
      
        [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf
      
      Fixes: f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev if relevant")
      Fixes: 5f02ce24 ("net: l3mdev: Allow the l3mdev to be a loopback")
      Fixes: 4fbae7d8 ("ipvlan: Introduce l3s mode")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: David Ahern <dsa@cumulusnetworks.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Martynas Pumputis <m@lambda.lt>
      Acked-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d5256083
    • Jacob Wen's avatar
      l2tp: fix reading optional fields of L2TPv3 · 4522a70d
      Jacob Wen authored
      Use pskb_may_pull() to make sure the optional fields are in skb linear
      parts, so we can safely read them later.
      
      It's easy to reproduce the issue with a net driver that supports paged
      skb data. Just create a L2TPv3 over IP tunnel and then generates some
      network traffic.
      Once reproduced, rx err in /sys/kernel/debug/l2tp/tunnels will increase.
      
      Changes in v4:
      1. s/l2tp_v3_pull_opt/l2tp_v3_ensure_opt_in_linear/
      2. s/tunnel->version != L2TP_HDR_VER_2/tunnel->version == L2TP_HDR_VER_3/
      3. Add 'Fixes' in commit messages.
      
      Changes in v3:
      1. To keep consistency, move the code out of l2tp_recv_common.
      2. Use "net" instead of "net-next", since this is a bug fix.
      
      Changes in v2:
      1. Only fix L2TPv3 to make code simple.
         To fix both L2TPv3 and L2TPv2, we'd better refactor l2tp_recv_common.
         It's complicated to do so.
      2. Reloading pointers after pskb_may_pull
      
      Fixes: f7faffa3 ("l2tp: Add L2TPv3 protocol support")
      Fixes: 0d76751f ("l2tp: Add L2TPv3 IP encapsulation (no UDP) support")
      Fixes: a32e0eec ("l2tp: introduce L2TPv3 IP encapsulation support for IPv6")
      Signed-off-by: default avatarJacob Wen <jian.w.wen@oracle.com>
      Acked-by: default avatarGuillaume Nault <gnault@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4522a70d
    • George Amanakis's avatar
      tun: move the call to tun_set_real_num_queues · 3a03cb84
      George Amanakis authored
      Call tun_set_real_num_queues() after the increment of tun->numqueues
      since the former depends on it. Otherwise, the number of queues is not
      correctly accounted for, which results to warnings similar to:
      "vnet0 selects TX queue 11, but real number of TX queues is 11".
      
      Fixes: 0b7959b6 ("tun: publish tfile after it's fully initialized")
      Reported-and-tested-by: default avatarGeorge Amanakis <gamanakis@gmail.com>
      Signed-off-by: default avatarGeorge Amanakis <gamanakis@gmail.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3a03cb84