1. 10 Aug, 2021 4 commits
    • Randy Dunlap's avatar
      bpf, core: Fix kernel-doc notation · 019d0454
      Randy Dunlap authored
      Fix kernel-doc warnings in kernel/bpf/core.c (found by scripts/kernel-doc
      and W=1 builds). That is, correct a function name in a comment and add
      return descriptions for 2 functions.
      
      Fixes these kernel-doc warnings:
      
        kernel/bpf/core.c:1372: warning: expecting prototype for __bpf_prog_run(). Prototype was for ___bpf_prog_run() instead
        kernel/bpf/core.c:1372: warning: No description found for return value of '___bpf_prog_run'
        kernel/bpf/core.c:1883: warning: No description found for return value of 'bpf_prog_select_runtime'
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210809215229.7556-1-rdunlap@infradead.org
      019d0454
    • Yonghong Song's avatar
      bpf: Fix potentially incorrect results with bpf_get_local_storage() · a2baf4e8
      Yonghong Song authored
      Commit b910eaaa ("bpf: Fix NULL pointer dereference in bpf_get_local_storage()
      helper") fixed a bug for bpf_get_local_storage() helper so different tasks
      won't mess up with each other's percpu local storage.
      
      The percpu data contains 8 slots so it can hold up to 8 contexts (same or
      different tasks), for 8 different program runs, at the same time. This in
      general is sufficient. But our internal testing showed the following warning
      multiple times:
      
        [...]
        warning: WARNING: CPU: 13 PID: 41661 at include/linux/bpf-cgroup.h:193
           __cgroup_bpf_run_filter_sock_ops+0x13e/0x180
        RIP: 0010:__cgroup_bpf_run_filter_sock_ops+0x13e/0x180
        <IRQ>
         tcp_call_bpf.constprop.99+0x93/0xc0
         tcp_conn_request+0x41e/0xa50
         ? tcp_rcv_state_process+0x203/0xe00
         tcp_rcv_state_process+0x203/0xe00
         ? sk_filter_trim_cap+0xbc/0x210
         ? tcp_v6_inbound_md5_hash.constprop.41+0x44/0x160
         tcp_v6_do_rcv+0x181/0x3e0
         tcp_v6_rcv+0xc65/0xcb0
         ip6_protocol_deliver_rcu+0xbd/0x450
         ip6_input_finish+0x11/0x20
         ip6_input+0xb5/0xc0
         ip6_sublist_rcv_finish+0x37/0x50
         ip6_sublist_rcv+0x1dc/0x270
         ipv6_list_rcv+0x113/0x140
         __netif_receive_skb_list_core+0x1a0/0x210
         netif_receive_skb_list_internal+0x186/0x2a0
         gro_normal_list.part.170+0x19/0x40
         napi_complete_done+0x65/0x150
         mlx5e_napi_poll+0x1ae/0x680
         __napi_poll+0x25/0x120
         net_rx_action+0x11e/0x280
         __do_softirq+0xbb/0x271
         irq_exit_rcu+0x97/0xa0
         common_interrupt+0x7f/0xa0
         </IRQ>
         asm_common_interrupt+0x1e/0x40
        RIP: 0010:bpf_prog_1835a9241238291a_tw_egress+0x5/0xbac
         ? __cgroup_bpf_run_filter_skb+0x378/0x4e0
         ? do_softirq+0x34/0x70
         ? ip6_finish_output2+0x266/0x590
         ? ip6_finish_output+0x66/0xa0
         ? ip6_output+0x6c/0x130
         ? ip6_xmit+0x279/0x550
         ? ip6_dst_check+0x61/0xd0
        [...]
      
      Using drgn [0] to dump the percpu buffer contents showed that on this CPU
      slot 0 is still available, but slots 1-7 are occupied and those tasks in
      slots 1-7 mostly don't exist any more. So we might have issues in
      bpf_cgroup_storage_unset().
      
      Further debugging confirmed that there is a bug in bpf_cgroup_storage_unset().
      Currently, it tries to unset "current" slot with searching from the start.
      So the following sequence is possible:
      
        1. A task is running and claims slot 0
        2. Running BPF program is done, and it checked slot 0 has the "task"
           and ready to reset it to NULL (not yet).
        3. An interrupt happens, another BPF program runs and it claims slot 1
           with the *same* task.
        4. The unset() in interrupt context releases slot 0 since it matches "task".
        5. Interrupt is done, the task in process context reset slot 0.
      
      At the end, slot 1 is not reset and the same process can continue to occupy
      slots 2-7 and finally, when the above step 1-5 is repeated again, step 3 BPF
      program won't be able to claim an empty slot and a warning will be issued.
      
      To fix the issue, for unset() function, we should traverse from the last slot
      to the first. This way, the above issue can be avoided.
      
      The same reverse traversal should also be done in bpf_get_local_storage() helper
      itself. Otherwise, incorrect local storage may be returned to BPF program.
      
        [0] https://github.com/osandov/drgn
      
      Fixes: b910eaaa ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper")
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20210810010413.1976277-1-yhs@fb.com
      a2baf4e8
    • Yonghong Song's avatar
      bpf: Add missing bpf_read_[un]lock_trace() for syscall program · 87b7b533
      Yonghong Song authored
      Commit 79a7f8bd ("bpf: Introduce bpf_sys_bpf() helper and program type.")
      added support for syscall program, which is a sleepable program.
      
      But the program run missed bpf_read_lock_trace()/bpf_read_unlock_trace(),
      which is needed to ensure proper rcu callback invocations. This patch adds
      bpf_read_[un]lock_trace() properly.
      
      Fixes: 79a7f8bd ("bpf: Introduce bpf_sys_bpf() helper and program type.")
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20210809235151.1663680-1-yhs@fb.com
      87b7b533
    • Daniel Borkmann's avatar
      bpf: Add lockdown check for probe_write_user helper · 51e1bb9e
      Daniel Borkmann authored
      Back then, commit 96ae5227 ("bpf: Add bpf_probe_write_user BPF helper
      to be called in tracers") added the bpf_probe_write_user() helper in order
      to allow to override user space memory. Its original goal was to have a
      facility to "debug, divert, and manipulate execution of semi-cooperative
      processes" under CAP_SYS_ADMIN. Write to kernel was explicitly disallowed
      since it would otherwise tamper with its integrity.
      
      One use case was shown in cf9b1199 ("samples/bpf: Add test/example of
      using bpf_probe_write_user bpf helper") where the program DNATs traffic
      at the time of connect(2) syscall, meaning, it rewrites the arguments to
      a syscall while they're still in userspace, and before the syscall has a
      chance to copy the argument into kernel space. These days we have better
      mechanisms in BPF for achieving the same (e.g. for load-balancers), but
      without having to write to userspace memory.
      
      Of course the bpf_probe_write_user() helper can also be used to abuse
      many other things for both good or bad purpose. Outside of BPF, there is
      a similar mechanism for ptrace(2) such as PTRACE_PEEK{TEXT,DATA} and
      PTRACE_POKE{TEXT,DATA}, but would likely require some more effort.
      Commit 96ae5227 explicitly dedicated the helper for experimentation
      purpose only. Thus, move the helper's availability behind a newly added
      LOCKDOWN_BPF_WRITE_USER lockdown knob so that the helper is disabled under
      the "integrity" mode. More fine-grained control can be implemented also
      from LSM side with this change.
      
      Fixes: 96ae5227 ("bpf: Add bpf_probe_write_user BPF helper to be called in tracers")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      51e1bb9e
  2. 09 Aug, 2021 8 commits
    • Daniel Borkmann's avatar
      bpf: Add _kernel suffix to internal lockdown_bpf_read · 71330842
      Daniel Borkmann authored
      Rename LOCKDOWN_BPF_READ into LOCKDOWN_BPF_READ_KERNEL so we have naming
      more consistent with a LOCKDOWN_BPF_WRITE_USER option that we are adding.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      71330842
    • Hangbin Liu's avatar
      net: sched: act_mirred: Reset ct info when mirror/redirect skb · d09c548d
      Hangbin Liu authored
      When mirror/redirect a skb to a different port, the ct info should be reset
      for reclassification. Or the pkts will match unexpected rules. For example,
      with following topology and commands:
      
          -----------
                    |
             veth0 -+-------
                    |
             veth1 -+-------
                    |
         ------------
      
       tc qdisc add dev veth0 clsact
       # The same with "action mirred egress mirror dev veth1" or "action mirred ingress redirect dev veth1"
       tc filter add dev veth0 egress chain 1 protocol ip flower ct_state +trk action mirred ingress mirror dev veth1
       tc filter add dev veth0 egress chain 0 protocol ip flower ct_state -inv action ct commit action goto chain 1
       tc qdisc add dev veth1 clsact
       tc filter add dev veth1 ingress chain 0 protocol ip flower ct_state +trk action drop
      
       ping <remove ip via veth0> &
       tc -s filter show dev veth1 ingress
      
      With command 'tc -s filter show', we can find the pkts were dropped on
      veth1.
      
      Fixes: b57dc7c1 ("net/sched: Introduce action ct")
      Signed-off-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d09c548d
    • David S. Miller's avatar
      Merge branch 'smc-fixes' · 605bb443
      David S. Miller authored
      Guvenc Gulce says:
      
      ====================
      net/smc: fixes 2021-08-09
      
      please apply the following patch series for smc to netdev's net tree.
      One patch fixes invalid connection counting for links and the other
      one fixes an access to an already cleared link.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      605bb443
    • Guvenc Gulce's avatar
      net/smc: Correct smc link connection counter in case of smc client · 64513d26
      Guvenc Gulce authored
      SMC clients may be assigned to a different link after the initial
      connection between two peers was established. In such a case,
      the connection counter was not correctly set.
      
      Update the connection counter correctly when a smc client connection
      is assigned to a different smc link.
      
      Fixes: 07d51580 ("net/smc: Add connection counters for links")
      Signed-off-by: default avatarGuvenc Gulce <guvenc@linux.ibm.com>
      Tested-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      64513d26
    • Karsten Graul's avatar
      net/smc: fix wait on already cleared link · 8f3d65c1
      Karsten Graul authored
      There can be a race between the waiters for a tx work request buffer
      and the link down processing that finally clears the link. Although
      all waiters are woken up before the link is cleared there might be
      waiters which did not yet get back control and are still waiting.
      This results in an access to a cleared wait queue head.
      
      Fix this by introducing atomic reference counting around the wait calls,
      and wait with the link clear processing until all waiters have finished.
      Move the work request layer related calls into smc_wr.c and set the
      link state to INACTIVE before calling smcr_link_clear() in
      smc_llc_srv_add_link().
      
      Fixes: 15e1b99a ("net/smc: no WR buffer wait for terminating link group")
      Signed-off-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: default avatarGuvenc Gulce <guvenc@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8f3d65c1
    • Grygorii Strashko's avatar
      net: ethernet: ti: cpsw: fix min eth packet size for non-switch use-cases · acc68b8d
      Grygorii Strashko authored
      The CPSW switchdev driver inherited fix from commit 9421c901 ("net:
      ethernet: ti: cpsw: fix min eth packet size") which changes min TX packet
      size to 64bytes (VLAN_ETH_ZLEN, excluding ETH_FCS). It was done to fix HW
      packed drop issue when packets are sent from Host to the port with PVID and
      un-tagging enabled. Unfortunately this breaks some other non-switch
      specific use-cases, like:
      - [1] CPSW port as DSA CPU port with DSA-tag applied at the end of the
      packet
      - [2] Some industrial protocols, which expects min TX packet size 60Bytes
      (excluding FCS).
      
      Fix it by configuring min TX packet size depending on driver mode
       - 60Bytes (ETH_ZLEN) for multi mac (dual-mac) mode
       - 64Bytes (VLAN_ETH_ZLEN) for switch mode
      and update it during driver mode change and annotate with
      READ_ONCE()/WRITE_ONCE() as it can be read by napi while writing.
      
      [1] https://lore.kernel.org/netdev/20210531124051.GA15218@cephalopod/
      [2] https://e2e.ti.com/support/arm/sitara_arm/f/791/t/701669
      
      Cc: stable@vger.kernel.org
      Fixes: ed3525ed ("net: ethernet: ti: introduce cpsw switchdev based driver part 1 - dual-emac")
      Reported-by: default avatarBen Hutchings <ben.hutchings@essensium.com>
      Signed-off-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      acc68b8d
    • Yunsheng Lin's avatar
      page_pool: mask the page->signature before the checking · 0fa32ca4
      Yunsheng Lin authored
      As mentioned in commit c07aea3e ("mm: add a signature in
      struct page"):
      "The page->signature field is aliased to page->lru.next and
      page->compound_head."
      
      And as the comment in page_is_pfmemalloc():
      "lru.next has bit 1 set if the page is allocated from the
      pfmemalloc reserves. Callers may simply overwrite it if they
      do not need to preserve that information."
      
      The page->signature is OR’ed with PP_SIGNATURE when a page is
      allocated in page pool, see __page_pool_alloc_pages_slow(),
      and page->signature is checked directly with PP_SIGNATURE in
      page_pool_return_skb_page(), which might cause resoure leaking
      problem for a page from page pool if bit 1 of lru.next is set
      for a pfmemalloc page. What happens here is that the original
      pp->signature is OR'ed with PP_SIGNATURE after the allocation
      in order to preserve any existing bits(such as the bit 1, used
      to indicate a pfmemalloc page), so when those bits are present,
      those page is not considered to be from page pool and the DMA
      mapping of those pages will be left stale.
      
      As bit 0 is for page->compound_head, So mask both bit 0/1 before
      the checking in page_pool_return_skb_page(). And we will return
      those pfmemalloc pages back to the page allocator after cleaning
      up the DMA mapping.
      
      Fixes: 6a5bcd84 ("page_pool: Allow drivers to hint on SKB recycling")
      Reviewed-by: default avatarIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: default avatarYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0fa32ca4
    • Randy Dunlap's avatar
      dccp: add do-while-0 stubs for dccp_pr_debug macros · 86aab09a
      Randy Dunlap authored
      GCC complains about empty macros in an 'if' statement, so convert
      them to 'do {} while (0)' macros.
      
      Fixes these build warnings:
      
      net/dccp/output.c: In function 'dccp_xmit_packet':
      ../net/dccp/output.c:283:71: warning: suggest braces around empty body in an 'if' statement [-Wempty-body]
        283 |                 dccp_pr_debug("transmit_skb() returned err=%d\n", err);
      net/dccp/ackvec.c: In function 'dccp_ackvec_update_old':
      ../net/dccp/ackvec.c:163:80: warning: suggest braces around empty body in an 'else' statement [-Wempty-body]
        163 |                                               (unsigned long long)seqno, state);
      
      Fixes: dc841e30 ("dccp: Extend CCID packet dequeueing interface")
      Fixes: 38024086 ("dccp ccid-2: Update code for the Ack Vector input/registration routine")
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: dccp@vger.kernel.org
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      86aab09a
  3. 08 Aug, 2021 9 commits
    • Pali Rohár's avatar
      ppp: Fix generating ppp unit id when ifname is not specified · 3125f26c
      Pali Rohár authored
      When registering new ppp interface via PPPIOCNEWUNIT ioctl then kernel has
      to choose interface name as this ioctl API does not support specifying it.
      
      Kernel in this case register new interface with name "ppp<id>" where <id>
      is the ppp unit id, which can be obtained via PPPIOCGUNIT ioctl. This
      applies also in the case when registering new ppp interface via rtnl
      without supplying IFLA_IFNAME.
      
      PPPIOCNEWUNIT ioctl allows to specify own ppp unit id which will kernel
      assign to ppp interface, in case this ppp id is not already used by other
      ppp interface.
      
      In case user does not specify ppp unit id then kernel choose the first free
      ppp unit id. This applies also for case when creating ppp interface via
      rtnl method as it does not provide a way for specifying own ppp unit id.
      
      If some network interface (does not have to be ppp) has name "ppp<id>"
      with this first free ppp id then PPPIOCNEWUNIT ioctl or rtnl call fails.
      
      And registering new ppp interface is not possible anymore, until interface
      which holds conflicting name is renamed. Or when using rtnl method with
      custom interface name in IFLA_IFNAME.
      
      As list of allocated / used ppp unit ids is not possible to retrieve from
      kernel to userspace, userspace has no idea what happens nor which interface
      is doing this conflict.
      
      So change the algorithm how ppp unit id is generated. And choose the first
      number which is not neither used as ppp unit id nor in some network
      interface with pattern "ppp<id>".
      
      This issue can be simply reproduced by following pppd call when there is no
      ppp interface registered and also no interface with name pattern "ppp<id>":
      
          pppd ifname ppp1 +ipv6 noip noauth nolock local nodetach pty "pppd +ipv6 noip noauth nolock local nodetach notty"
      
      Or by creating the one ppp interface (which gets assigned ppp unit id 0),
      renaming it to "ppp1" and then trying to create a new ppp interface (which
      will always fails as next free ppp unit id is 1, but network interface with
      name "ppp1" exists).
      
      This patch fixes above described issue by generating new and new ppp unit
      id until some non-conflicting id with network interfaces is generated.
      Signed-off-by: default avatarPali Rohár <pali@kernel.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3125f26c
    • Pali Rohár's avatar
      ppp: Fix generating ifname when empty IFLA_IFNAME is specified · 2459dcb9
      Pali Rohár authored
      IFLA_IFNAME is nul-term string which means that IFLA_IFNAME buffer can be
      larger than length of string which contains.
      
      Function __rtnl_newlink() generates new own ifname if either IFLA_IFNAME
      was not specified at all or userspace passed empty nul-term string.
      
      It is expected that if userspace does not specify ifname for new ppp netdev
      then kernel generates one in format "ppp<id>" where id matches to the ppp
      unit id which can be later obtained by PPPIOCGUNIT ioctl.
      
      And it works in this way if IFLA_IFNAME is not specified at all. But it
      does not work when IFLA_IFNAME is specified with empty string.
      
      So fix this logic also for empty IFLA_IFNAME in ppp_nl_newlink() function
      and correctly generates ifname based on ppp unit identifier if userspace
      did not provided preferred ifname.
      
      Without this patch when IFLA_IFNAME was specified with empty string then
      kernel created a new ppp interface in format "ppp<id>" but id did not
      match ppp unit id returned by PPPIOCGUNIT ioctl. In this case id was some
      number generated by __rtnl_newlink() function.
      Signed-off-by: default avatarPali Rohár <pali@kernel.org>
      Fixes: bb8082f6 ("ppp: build ifname using unit identifier for rtnl based devices")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2459dcb9
    • David S. Miller's avatar
      Merge branch 'bnxt_en-ptp-fixes' · 2f5501a8
      David S. Miller authored
      Michael Chan says:
      
      ====================
      bnxt_en: PTP fixes
      
      This series includes 2 fixes for the PTP feature.  Update to the new
      firmware interface so that the driver can pass the PTP sequence number
      header offset of TX packets to the firmware.  This is needed for all
      PTP packet types (v1, v2, with or without VLAN) to work.  The 2nd
      fix is to use a different register window to read the PHC to avoid
      conflict with an older Broadcom tool.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2f5501a8
    • Michael Chan's avatar
      bnxt_en: Use register window 6 instead of 5 to read the PHC · 92529df7
      Michael Chan authored
      Some older Broadcom debug tools use window 5 and may conflict, so switch
      to use window 6 instead.
      
      Fixes: 118612d5 ("bnxt_en: Add PTP clock APIs, ioctls, and ethtool methods")
      Reviewed-by: default avatarAndy Gospodarek <gospo@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      92529df7
    • Michael Chan's avatar
      bnxt_en: Update firmware call to retrieve TX PTP timestamp · 9e266807
      Michael Chan authored
      New firmware interface requires the PTP sequence ID header offset to
      be passed to the firmware to properly find the matching timestamp
      for all protocols.
      
      Fixes: 83bb623c ("bnxt_en: Transmit and retrieve packet timestamps")
      Reviewed-by: default avatarEdwin Peer <edwin.peer@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9e266807
    • Michael Chan's avatar
      bnxt_en: Update firmware interface to 1.10.2.52 · fbfee257
      Michael Chan authored
      The key change is the firmware call to retrieve the PTP TX timestamp.
      The header offset for the PTP sequence number field is now added.
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fbfee257
    • Kefeng Wang's avatar
      once: Fix panic when module unload · 1027b96e
      Kefeng Wang authored
      DO_ONCE
      DEFINE_STATIC_KEY_TRUE(___once_key);
      __do_once_done
        once_disable_jump(once_key);
          INIT_WORK(&w->work, once_deferred);
          struct once_work *w;
          w->key = key;
          schedule_work(&w->work);                     module unload
                                                         //*the key is
      destroy*
      process_one_work
        once_deferred
          BUG_ON(!static_key_enabled(work->key));
             static_key_count((struct static_key *)x)    //*access key, crash*
      
      When module uses DO_ONCE mechanism, it could crash due to the above
      concurrency problem, we could reproduce it with link[1].
      
      Fix it by add/put module refcount in the once work process.
      
      [1] https://lore.kernel.org/netdev/eaa6c371-465e-57eb-6be9-f4b16b9d7cbf@huawei.com/
      
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Reported-by: default avatarMinmin chen <chenmingmin@huawei.com>
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1027b96e
    • Vinicius Costa Gomes's avatar
      ptp: Fix possible memory leak caused by invalid cast · d329e41a
      Vinicius Costa Gomes authored
      Fixes possible leak of PTP virtual clocks.
      
      The number of PTP virtual clocks to be unregistered is passed as
      'u32', but the function that unregister the devices handles that as
      'u8'.
      
      Fixes: 73f37068 ("ptp: support ptp physical/virtual clocks conversion")
      Signed-off-by: default avatarVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d329e41a
    • Ben Hutchings's avatar
      net: phy: micrel: Fix link detection on ksz87xx switch" · 2383cb94
      Ben Hutchings authored
      Commit a5e63c7d "net: phy: micrel: Fix detection of ksz87xx
      switch" broke link detection on the external ports of the KSZ8795.
      
      The previously unused phy_driver structure for these devices specifies
      config_aneg and read_status functions that appear to be designed for a
      fixed link and do not work with the embedded PHYs in the KSZ8795.
      
      Delete the use of these functions in favour of the generic PHY
      implementations which were used previously.
      
      Fixes: a5e63c7d ("net: phy: micrel: Fix detection of ksz87xx switch")
      Signed-off-by: default avatarBen Hutchings <ben.hutchings@mind.be>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2383cb94
  4. 07 Aug, 2021 6 commits
  5. 06 Aug, 2021 13 commits
    • Tatsuhiko Yasumatsu's avatar
      bpf: Fix integer overflow involving bucket_size · c4eb1f40
      Tatsuhiko Yasumatsu authored
      In __htab_map_lookup_and_delete_batch(), hash buckets are iterated
      over to count the number of elements in each bucket (bucket_size).
      If bucket_size is large enough, the multiplication to calculate
      kvmalloc() size could overflow, resulting in out-of-bounds write
      as reported by KASAN:
      
        [...]
        [  104.986052] BUG: KASAN: vmalloc-out-of-bounds in __htab_map_lookup_and_delete_batch+0x5ce/0xb60
        [  104.986489] Write of size 4194224 at addr ffffc9010503be70 by task crash/112
        [  104.986889]
        [  104.987193] CPU: 0 PID: 112 Comm: crash Not tainted 5.14.0-rc4 #13
        [  104.987552] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
        [  104.988104] Call Trace:
        [  104.988410]  dump_stack_lvl+0x34/0x44
        [  104.988706]  print_address_description.constprop.0+0x21/0x140
        [  104.988991]  ? __htab_map_lookup_and_delete_batch+0x5ce/0xb60
        [  104.989327]  ? __htab_map_lookup_and_delete_batch+0x5ce/0xb60
        [  104.989622]  kasan_report.cold+0x7f/0x11b
        [  104.989881]  ? __htab_map_lookup_and_delete_batch+0x5ce/0xb60
        [  104.990239]  kasan_check_range+0x17c/0x1e0
        [  104.990467]  memcpy+0x39/0x60
        [  104.990670]  __htab_map_lookup_and_delete_batch+0x5ce/0xb60
        [  104.990982]  ? __wake_up_common+0x4d/0x230
        [  104.991256]  ? htab_of_map_free+0x130/0x130
        [  104.991541]  bpf_map_do_batch+0x1fb/0x220
        [...]
      
      In hashtable, if the elements' keys have the same jhash() value, the
      elements will be put into the same bucket. By putting a lot of elements
      into a single bucket, the value of bucket_size can be increased to
      trigger the integer overflow.
      
      Triggering the overflow is possible for both callers with CAP_SYS_ADMIN
      and callers without CAP_SYS_ADMIN.
      
      It will be trivial for a caller with CAP_SYS_ADMIN to intentionally
      reach this overflow by enabling BPF_F_ZERO_SEED. As this flag will set
      the random seed passed to jhash() to 0, it will be easy for the caller
      to prepare keys which will be hashed into the same value, and thus put
      all the elements into the same bucket.
      
      If the caller does not have CAP_SYS_ADMIN, BPF_F_ZERO_SEED cannot be
      used. However, it will be still technically possible to trigger the
      overflow, by guessing the random seed value passed to jhash() (32bit)
      and repeating the attempt to trigger the overflow. In this case,
      the probability to trigger the overflow will be low and will take
      a very long time.
      
      Fix the integer overflow by calling kvmalloc_array() instead of
      kvmalloc() to allocate memory.
      
      Fixes: 05799638 ("bpf: Add batch ops to all htab bpf map")
      Signed-off-by: default avatarTatsuhiko Yasumatsu <th.yasumatsu@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210806150419.109658-1-th.yasumatsu@gmail.com
      c4eb1f40
    • Randy Dunlap's avatar
      libbpf, doc: Eliminate warnings in libbpf_naming_convention · 7c4a2233
      Randy Dunlap authored
      Use "code-block: none" instead of "c" for non-C-language code blocks.
      Removes these warnings:
      
        lnx-514-rc4/Documentation/bpf/libbpf/libbpf_naming_convention.rst:111: WARNING: Could not lex literal_block as "c". Highlighting skipped.
        lnx-514-rc4/Documentation/bpf/libbpf/libbpf_naming_convention.rst:124: WARNING: Could not lex literal_block as "c". Highlighting skipped.
      
      Fixes: f42cfb46 ("bpf: Add documentation for libbpf including API autogen")
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210802015037.787-1-rdunlap@infradead.org
      7c4a2233
    • Daniel Xu's avatar
      libbpf: Do not close un-owned FD 0 on errors · c34c338a
      Daniel Xu authored
      Before this patch, btf_new() was liable to close an arbitrary FD 0 if
      BTF parsing failed. This was because:
      
      * btf->fd was initialized to 0 through the calloc()
      * btf__free() (in the `done` label) closed any FDs >= 0
      * btf->fd is left at 0 if parsing fails
      
      This issue was discovered on a system using libbpf v0.3 (without
      BTF_KIND_FLOAT support) but with a kernel that had BTF_KIND_FLOAT types
      in BTF. Thus, parsing fails.
      
      While this patch technically doesn't fix any issues b/c upstream libbpf
      has BTF_KIND_FLOAT support, it'll help prevent issues in the future if
      more BTF types are added. It also allow the fix to be backported to
      older libbpf's.
      
      Fixes: 3289959b ("libbpf: Support BTF loading and raw data output in both endianness")
      Signed-off-by: default avatarDaniel Xu <dxu@dxuuu.xyz>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/5969bb991adedb03c6ae93e051fd2a00d293cf25.1627513670.git.dxu@dxuuu.xyz
      c34c338a
    • Robin Gögge's avatar
      libbpf: Fix probe for BPF_PROG_TYPE_CGROUP_SOCKOPT · 78d14bda
      Robin Gögge authored
      This patch fixes the probe for BPF_PROG_TYPE_CGROUP_SOCKOPT,
      so the probe reports accurate results when used by e.g.
      bpftool.
      
      Fixes: 4cdbfb59 ("libbpf: support sockopt hooks")
      Signed-off-by: default avatarRobin Gögge <r.goegge@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarQuentin Monnet <quentin@isovalent.com>
      Link: https://lore.kernel.org/bpf/20210728225825.2357586-1-r.goegge@gmail.com
      78d14bda
    • Jakub Kicinski's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf · cc4e5eec
      Jakub Kicinski authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      The following patchset contains Netfilter fixes for net:
      
      1) Restrict range element expansion in ipset to avoid soft lockup,
         from Jozsef Kadlecsik.
      
      2) Memleak in error path for nf_conntrack_bridge for IPv4 packets,
         from Yajun Deng.
      
      3) Simplify conntrack garbage collection strategy to avoid frequent
         wake-ups, from Florian Westphal.
      
      4) Fix NFNLA_HOOK_FUNCTION_NAME string, do not include module name.
      
      5) Missing chain family netlink attribute in chain description
         in nfnetlink_hook.
      
      6) Incorrect sequence number on nfnetlink_hook dumps.
      
      7) Use netlink request family in reply message for consistency.
      
      8) Remove offload_pickup sysctl, use conntrack for established state
         instead, from Florian Westphal.
      
      9) Translate NFPROTO_INET/ingress to NFPROTO_NETDEV/ingress, since
         NFPROTO_INET is not exposed through nfnetlink_hook.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
        netfilter: nfnetlink_hook: translate inet ingress to netdev
        netfilter: conntrack: remove offload_pickup sysctl again
        netfilter: nfnetlink_hook: Use same family as request message
        netfilter: nfnetlink_hook: use the sequence number of the request message
        netfilter: nfnetlink_hook: missing chain family
        netfilter: nfnetlink_hook: strip off module name from hookfn
        netfilter: conntrack: collect all entries in one cycle
        netfilter: nf_conntrack_bridge: Fix memory leak when error
        netfilter: ipset: Limit the maximal range of consecutive elements to add/delete
      ====================
      
      Link: https://lore.kernel.org/r/20210806151149.6356-1-pablo@netfilter.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      cc4e5eec
    • Pablo Neira Ayuso's avatar
      netfilter: nfnetlink_hook: translate inet ingress to netdev · 269fc695
      Pablo Neira Ayuso authored
      The NFPROTO_INET pseudofamily is not exposed through this new netlink
      interface. The netlink dump either shows NFPROTO_IPV4 or NFPROTO_IPV6
      for NFPROTO_INET prerouting/input/forward/output/postrouting hooks.
      The NFNLA_CHAIN_FAMILY attribute provides the family chain, which
      specifies if this hook applies to inet traffic only (either IPv4 or
      IPv6).
      
      Translate the inet/ingress hook to netdev/ingress to fully hide the
      NFPROTO_INET implementation details.
      
      Fixes: e2cf17d3 ("netfilter: add new hook nfnl subsystem")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      269fc695
    • Florian Westphal's avatar
      netfilter: conntrack: remove offload_pickup sysctl again · 4592ee7f
      Florian Westphal authored
      These two sysctls were added because the hardcoded defaults (2 minutes,
      tcp, 30 seconds, udp) turned out to be too low for some setups.
      
      They appeared in 5.14-rc1 so it should be fine to remove it again.
      
      Marcelo convinced me that there should be no difference between a flow
      that was offloaded vs. a flow that was not wrt. timeout handling.
      Thus the default is changed to those for TCP established and UDP stream,
      5 days and 120 seconds, respectively.
      
      Marcelo also suggested to account for the timeout value used for the
      offloading, this avoids increase beyond the value in the conntrack-sysctl
      and will also instantly expire the conntrack entry with altered sysctls.
      
      Example:
         nf_conntrack_udp_timeout_stream=60
         nf_flowtable_udp_timeout=60
      
      This will remove offloaded udp flows after one minute, rather than two.
      
      An earlier version of this patch also cleared the ASSURED bit to
      allow nf_conntrack to evict the entry via early_drop (i.e., table full).
      However, it looks like we can safely assume that connection timed out
      via HW is still in established state, so this isn't needed.
      
      Quoting Oz:
       [..] the hardware sends all packets with a set FIN flags to sw.
       [..] Connections that are aged in hardware are expected to be in the
       established state.
      
      In case it turns out that back-to-sw-path transition can occur for
      'dodgy' connections too (e.g., one side disappeared while software-path
      would have been in RETRANS timeout), we can adjust this later.
      
      Cc: Oz Shlomo <ozsh@nvidia.com>
      Cc: Paul Blakey <paulb@nvidia.com>
      Suggested-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Reviewed-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Reviewed-by: default avatarOz Shlomo <ozsh@nvidia.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      4592ee7f
    • Pablo Neira Ayuso's avatar
      netfilter: nfnetlink_hook: Use same family as request message · 69311e7c
      Pablo Neira Ayuso authored
      Use the same family as the request message, for consistency. The
      netlink payload provides sufficient information to describe the hook
      object, including the family.
      
      This makes it easier to userspace to correlate the hooks are that
      visited by the packets for a certain family.
      
      Fixes: e2cf17d3 ("netfilter: add new hook nfnl subsystem")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      69311e7c
    • Pablo Neira Ayuso's avatar
      netfilter: nfnetlink_hook: use the sequence number of the request message · 3d9bbaf6
      Pablo Neira Ayuso authored
      The sequence number allows to correlate the netlink reply message (as
      part of the dump) with the original request message.
      
      The cb->seq field is internally used to detect an interference (update)
      of the hook list during the netlink dump, do not use it as sequence
      number in the netlink dump header.
      
      Fixes: e2cf17d3 ("netfilter: add new hook nfnl subsystem")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      3d9bbaf6
    • Pablo Neira Ayuso's avatar
      netfilter: nfnetlink_hook: missing chain family · a6e57c4a
      Pablo Neira Ayuso authored
      The family is relevant for pseudo-families like NFPROTO_INET
      otherwise the user needs to rely on the hook function name to
      differentiate it from NFPROTO_IPV4 and NFPROTO_IPV6 names.
      
      Add nfnl_hook_chain_desc_attributes instead of using the existing
      NFTA_CHAIN_* attributes, since these do not provide a family number.
      
      Fixes: e2cf17d3 ("netfilter: add new hook nfnl subsystem")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      a6e57c4a
    • Pablo Neira Ayuso's avatar
      netfilter: nfnetlink_hook: strip off module name from hookfn · 61e0c2bc
      Pablo Neira Ayuso authored
      NFNLA_HOOK_FUNCTION_NAME should include the hook function name only,
      the module name is already provided by NFNLA_HOOK_MODULE_NAME.
      
      Fixes: e2cf17d3 ("netfilter: add new hook nfnl subsystem")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      61e0c2bc
    • Florian Westphal's avatar
      netfilter: conntrack: collect all entries in one cycle · 4608fdfc
      Florian Westphal authored
      Michal Kubecek reports that conntrack gc is responsible for frequent
      wakeups (every 125ms) on idle systems.
      
      On busy systems, timed out entries are evicted during lookup.
      The gc worker is only needed to remove entries after system becomes idle
      after a busy period.
      
      To resolve this, always scan the entire table.
      If the scan is taking too long, reschedule so other work_structs can run
      and resume from next bucket.
      
      After a completed scan, wait for 2 minutes before the next cycle.
      Heuristics for faster re-schedule are removed.
      
      GC_SCAN_INTERVAL could be exposed as a sysctl in the future to allow
      tuning this as-needed or even turn the gc worker off.
      Reported-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      4608fdfc
    • John Hubbard's avatar
      net: mvvp2: fix short frame size on s390 · 704e624f
      John Hubbard authored
      On s390, the following build warning occurs:
      
      drivers/net/ethernet/marvell/mvpp2/mvpp2.h:844:2: warning: overflow in
      conversion from 'long unsigned int' to 'int' changes value from
      '18446744073709551584' to '-32' [-Woverflow]
      844 |  ((total_size) - MVPP2_SKB_HEADROOM - MVPP2_SKB_SHINFO_SIZE)
      
      This happens because MVPP2_SKB_SHINFO_SIZE, which is 320 bytes (which is
      already 64-byte aligned) on some architectures, actually gets ALIGN'd up
      to 512 bytes in the s390 case.
      
      So then, when this is invoked:
      
          MVPP2_RX_MAX_PKT_SIZE(MVPP2_BM_SHORT_FRAME_SIZE)
      
      ...that turns into:
      
           704 - 224 - 512 == -32
      
      ...which is not a good frame size to end up with! The warning above is a
      bit lucky: it notices a signed/unsigned bad behavior here, which leads
      to the real problem of a frame that is too short for its contents.
      
      Increase MVPP2_BM_SHORT_FRAME_SIZE by 32 (from 704 to 736), which is
      just exactly big enough. (The other values can't readily be changed
      without causing a lot of other problems.)
      
      Fixes: 07dd0a7a ("mvpp2: add basic XDP support")
      Cc: Sven Auhagen <sven.auhagen@voleatech.de>
      Cc: Matteo Croce <mcroce@microsoft.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      704e624f