1. 29 Apr, 2024 28 commits
    • Andrii Nakryiko's avatar
      libbpf: handle nulled-out program in struct_ops correctly · f973fccd
      Andrii Nakryiko authored
      If struct_ops has one of program callbacks set declaratively and host
      kernel is old and doesn't support this callback, libbpf will allow to
      load such struct_ops as long as that callback was explicitly nulled-out
      (presumably through skeleton). This is all working correctly, except we
      won't reset corresponding program slot to NULL before bailing out, which
      will lead to libbpf not detecting that BPF program has to be not
      auto-loaded. Fix this by unconditionally resetting corresponding program
      slot to NULL.
      
      Fixes: c911fc61 ("libbpf: Skip zeroed or null fields if not found in the kernel type.")
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20240428030954.3918764-1-andrii@kernel.orgSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      f973fccd
    • Dmitrii Bundin's avatar
      bpf: Include linux/types.h for u32 · cfd3bfe9
      Dmitrii Bundin authored
      Inclusion of the header linux/btf_ids.h relies on indirect inclusion of
      the header linux/types.h. Including it directly on the top level helps
      to avoid potential problems if linux/types.h hasn't been included
      before.
      
      The main motivation to introduce this it is to avoid similar problems that
      have shown up in the bpftool where GNU libc indirectly pulls
      linux/types.h causing compile error of the form:
      
         error: unknown type name 'u32'
                                   u32 cnt;
                                   ^~~
      
      The bpftool compile error was fixed in
      62248b22 ("tools/resolve_btfids: fix build with musl libc").
      Signed-off-by: default avatarDmitrii Bundin <dmitrii.bundin.a@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20240420042457.3198883-1-dmitrii.bundin.a@gmail.com
      cfd3bfe9
    • Andrii Nakryiko's avatar
      Merge branch 'free-strdup-memory-in-selftests' · 789d9a53
      Andrii Nakryiko authored
      Geliang Tang says:
      
      ====================
      Free strdup memory in selftests
      
      From: Geliang Tang <tanggeliang@kylinos.cn>
      
      Two fixes to free strdup memory in selftests to avoid memory leaks.
      ====================
      
      Link: https://lore.kernel.org/r/cover.1714374022.git.tanggeliang@kylinos.cnSigned-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      789d9a53
    • Geliang Tang's avatar
      selftests/bpf: Free strdup memory in veristat · 25927d0a
      Geliang Tang authored
      The strdup() function returns a pointer to a new string which is a
      duplicate of the string "input". Memory for the new string is obtained
      with malloc(), and need to be freed with free().
      
      This patch adds these missing "free(input)" in parse_stats() to avoid
      memory leak in veristat.c.
      Signed-off-by: default avatarGeliang Tang <tanggeliang@kylinos.cn>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/ded44f8865cd7f337f52fc5fb0a5fbed7d6bd641.1714374022.git.tanggeliang@kylinos.cn
      25927d0a
    • Geliang Tang's avatar
      selftests/bpf: Free strdup memory in test_sockmap · 237c522c
      Geliang Tang authored
      The strdup() function returns a pointer to a new string which is a
      duplicate of the string "ptr". Memory for the new string is obtained
      with malloc(), and need to be freed with free().
      
      This patch adds these missing "free(ptr)" in check_whitelist() and
      check_blacklist() to avoid memory leaks in test_sockmap.c.
      Signed-off-by: default avatarGeliang Tang <tanggeliang@kylinos.cn>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/b76f2f4c550aebe4ab8ea73d23c4cbe4f06ea996.1714374022.git.tanggeliang@kylinos.cn
      237c522c
    • Viktor Malik's avatar
      selftests/bpf: Run cgroup1_hierarchy test in own mount namespace · 19468ed5
      Viktor Malik authored
      The cgroup1_hierarchy test uses setup_classid_environment to setup
      cgroupv1 environment. The problem is that the environment is set in
      /sys/fs/cgroup and therefore, if not run under an own mount namespace,
      effectively deletes all system cgroups:
      
          $ ls /sys/fs/cgroup | wc -l
          27
          $ sudo ./test_progs -t cgroup1_hierarchy
          #41/1    cgroup1_hierarchy/test_cgroup1_hierarchy:OK
          #41/2    cgroup1_hierarchy/test_root_cgid:OK
          #41/3    cgroup1_hierarchy/test_invalid_level:OK
          #41/4    cgroup1_hierarchy/test_invalid_cgid:OK
          #41/5    cgroup1_hierarchy/test_invalid_hid:OK
          #41/6    cgroup1_hierarchy/test_invalid_cgrp_name:OK
          #41/7    cgroup1_hierarchy/test_invalid_cgrp_name2:OK
          #41/8    cgroup1_hierarchy/test_sleepable_prog:OK
          #41      cgroup1_hierarchy:OK
          Summary: 1/8 PASSED, 0 SKIPPED, 0 FAILED
          $ ls /sys/fs/cgroup | wc -l
          1
      
      To avoid this, run setup_cgroup_environment first which will create an
      own mount namespace. This only affects the cgroupv1_hierarchy test as
      all other cgroup1 test progs already run setup_cgroup_environment prior
      to running setup_classid_environment.
      
      Also add a comment to the header of setup_classid_environment to warn
      against this invalid usage in future.
      
      Fixes: 36076923 ("selftests/bpf: Add selftests for cgroup1 hierarchy")
      Signed-off-by: default avatarViktor Malik <vmalik@redhat.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20240429112311.402497-1-vmalik@redhat.com
      19468ed5
    • Andy Shevchenko's avatar
      bpf: Switch to krealloc_array() · a3034872
      Andy Shevchenko authored
      Let the krealloc_array() copy the original data and
      check for a multiplication overflow.
      Signed-off-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/bpf/20240429120005.3539116-1-andriy.shevchenko@linux.intel.com
      a3034872
    • Andy Shevchenko's avatar
      bpf: Use struct_size() · cb01621b
      Andy Shevchenko authored
      Use struct_size() instead of hand writing it.
      This is less verbose and more robust.
      Signed-off-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/bpf/20240429121323.3818497-1-andriy.shevchenko@linux.intel.com
      cb01621b
    • Tao Chen's avatar
      samples/bpf: Add valid info for VMLINUX_BTF · 397658dd
      Tao Chen authored
      When I use the command 'make M=samples/bpf' to compile samples/bpf code
      in ubuntu 22.04, the error info occured:
      Cannot find a vmlinux for VMLINUX_BTF at any of "  /home/ubuntu/code/linux/vmlinux",
      build the kernel or set VMLINUX_BTF or VMLINUX_H variable
      
      Others often encounter this kind of issue, new kernel has the vmlinux, so we can
      set the path in error info which seems more intuitive, like:
      Cannot find a vmlinux for VMLINUX_BTF at any of "  /home/ubuntu/code/linux/vmlinux",
      buiild the kernel or set VMLINUX_BTF like "VMLINUX_BTF=/sys/kernel/btf/vmlinux" or
      VMLINUX_H variable
      Signed-off-by: default avatarTao Chen <chen.dylane@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20240428161032.239043-1-chen.dylane@gmail.com
      397658dd
    • Alexei Starovoitov's avatar
      bpf: Fix verifier assumptions about socket->sk · 0db63c0b
      Alexei Starovoitov authored
      The verifier assumes that 'sk' field in 'struct socket' is valid
      and non-NULL when 'socket' pointer itself is trusted and non-NULL.
      That may not be the case when socket was just created and
      passed to LSM socket_accept hook.
      Fix this verifier assumption and adjust tests.
      Reported-by: default avatarLiam Wisehart <liamwisehart@meta.com>
      Acked-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Fixes: 6fcd486b ("bpf: Refactor RCU enforcement in the verifier.")
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/r/20240427002544.68803-1-alexei.starovoitov@gmail.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      0db63c0b
    • Jakub Kicinski's avatar
      Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 89de2db1
      Jakub Kicinski authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf-next 2024-04-29
      
      We've added 147 non-merge commits during the last 32 day(s) which contain
      a total of 158 files changed, 9400 insertions(+), 2213 deletions(-).
      
      The main changes are:
      
      1) Add an internal-only BPF per-CPU instruction for resolving per-CPU
         memory addresses and implement support in x86 BPF JIT. This allows
         inlining per-CPU array and hashmap lookups
         and the bpf_get_smp_processor_id() helper, from Andrii Nakryiko.
      
      2) Add BPF link support for sk_msg and sk_skb programs, from Yonghong Song.
      
      3) Optimize x86 BPF JIT's emit_mov_imm64, and add support for various
         atomics in bpf_arena which can be JITed as a single x86 instruction,
         from Alexei Starovoitov.
      
      4) Add support for passing mark with bpf_fib_lookup helper,
         from Anton Protopopov.
      
      5) Add a new bpf_wq API for deferring events and refactor sleepable
         bpf_timer code to keep common code where possible,
         from Benjamin Tissoires.
      
      6) Fix BPF_PROG_TEST_RUN infra with regards to bpf_dummy_struct_ops programs
         to check when NULL is passed for non-NULLable parameters,
         from Eduard Zingerman.
      
      7) Harden the BPF verifier's and/or/xor value tracking,
         from Harishankar Vishwanathan.
      
      8) Introduce crypto kfuncs to make BPF programs able to utilize the kernel
         crypto subsystem, from Vadim Fedorenko.
      
      9) Various improvements to the BPF instruction set standardization doc,
         from Dave Thaler.
      
      10) Extend libbpf APIs to partially consume items from the BPF ringbuffer,
          from Andrea Righi.
      
      11) Bigger batch of BPF selftests refactoring to use common network helpers
          and to drop duplicate code, from Geliang Tang.
      
      12) Support bpf_tail_call_static() helper for BPF programs with GCC 13,
          from Jose E. Marchesi.
      
      13) Add bpf_preempt_{disable,enable}() kfuncs in order to allow a BPF
          program to have code sections where preemption is disabled,
          from Kumar Kartikeya Dwivedi.
      
      14) Allow invoking BPF kfuncs from BPF_PROG_TYPE_SYSCALL programs,
          from David Vernet.
      
      15) Extend the BPF verifier to allow different input maps for a given
          bpf_for_each_map_elem() helper call in a BPF program, from Philo Lu.
      
      16) Add support for PROBE_MEM32 and bpf_addr_space_cast instructions
          for riscv64 and arm64 JITs to enable BPF Arena, from Puranjay Mohan.
      
      17) Shut up a false-positive KMSAN splat in interpreter mode by unpoison
          the stack memory, from Martin KaFai Lau.
      
      18) Improve xsk selftest coverage with new tests on maximum and minimum
          hardware ring size configurations, from Tushar Vyavahare.
      
      19) Various ReST man pages fixes as well as documentation and bash completion
          improvements for bpftool, from Rameez Rehman & Quentin Monnet.
      
      20) Fix libbpf with regards to dumping subsequent char arrays,
          from Quentin Deslandes.
      
      * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (147 commits)
        bpf, docs: Clarify PC use in instruction-set.rst
        bpf_helpers.h: Define bpf_tail_call_static when building with GCC
        bpf, docs: Add introduction for use in the ISA Internet Draft
        selftests/bpf: extend BPF_SOCK_OPS_RTT_CB test for srtt and mrtt_us
        bpf: add mrtt and srtt as BPF_SOCK_OPS_RTT_CB args
        selftests/bpf: dummy_st_ops should reject 0 for non-nullable params
        bpf: check bpf_dummy_struct_ops program params for test runs
        selftests/bpf: do not pass NULL for non-nullable params in dummy_st_ops
        selftests/bpf: adjust dummy_st_ops_success to detect additional error
        bpf: mark bpf_dummy_struct_ops.test_1 parameter as nullable
        selftests/bpf: Add ring_buffer__consume_n test.
        bpf: Add bpf_guard_preempt() convenience macro
        selftests: bpf: crypto: add benchmark for crypto functions
        selftests: bpf: crypto skcipher algo selftests
        bpf: crypto: add skcipher to bpf crypto
        bpf: make common crypto API for TC/XDP programs
        bpf: update the comment for BTF_FIELDS_MAX
        selftests/bpf: Fix wq test.
        selftests/bpf: Use make_sockaddr in test_sock_addr
        selftests/bpf: Use connect_to_addr in test_sock_addr
        ...
      ====================
      
      Link: https://lore.kernel.org/r/20240429131657.19423-1-daniel@iogearbox.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      89de2db1
    • Horatiu Vultur's avatar
      net: phy: micrel: Add support for PTP_PF_EXTTS for lan8814 · b3f1a08f
      Horatiu Vultur authored
      Extend the PTP programmable gpios to implement also PTP_PF_EXTTS
      function. The pins can be configured to capture both of rising
      and falling edge. Once the event is seen, then an interrupt is
      generated and the LTC is saved in the registers.
      On lan8814 only GPIO 3 can be configured for this.
      
      This was tested using:
      ts2phc -m -l 7 -s generic -f ts2phc.cfg
      
      Where the configuration was the following:
          ---
          [global]
          ts2phc.pin_index  3
      
          [eth0]
          ---
      Reviewed-by: default avatarVadim Fedorenko <vadim.fedorenko@linux.dev>
      Signed-off-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b3f1a08f
    • David S. Miller's avatar
      Merge branch 'dsa-realtek-leds' · 3208bdd0
      David S. Miller authored
      Luiz Angelo Daros de Luca says:
      
      ====================
      net: dsa: realtek: fix LED support for rtl8366
      
      This series fixes the LED support for rtl8366. The existing code was not
      tested in a device with switch LEDs and it was using a flawed logic.
      
      The driver now keeps the default LED configuration if nothing requests a
      different behavior. This may be enough for most devices. This can be
      achieved either by omitting the LED from the device-tree or configuring
      all LEDs in a group with the default state set to "keep".
      
      The hardware trigger for LEDs in Realtek switches is shared among all
      LEDs in a group. This behavior doesn't align well with the Linux LED
      API, which controls LEDs individually. Once the OS changes the
      brightness of a LED in a group still triggered by the hardware, the
      entire group switches to software-controlled LEDs, even for those not
      metioned in the device-tree. This shared behavior also prevents
      offloading the trigger to the hardware as it would require an
      orchestration between LEDs in a group, not currently present in the LED
      API.
      
      The assertion of device hardware reset during driver removal was removed
      because it was causing an issue with the LED release code. Devres
      devices are released after the driver's removal is executed. Asserting
      the reset at that point was causing timeout errors during LED release
      when it attempted to turn off the LED.
      
      To: Linus Walleij <linus.walleij@linaro.org>
      To: Alvin Šipraga <alsi@bang-olufsen.dk>
      To: Andrew Lunn <andrew@lunn.ch>
      To: Florian Fainelli <f.fainelli@gmail.com>
      To: Vladimir Oltean <olteanv@gmail.com>
      To: David S. Miller <davem@davemloft.net>
      To: Eric Dumazet <edumazet@google.com>
      To: Jakub Kicinski <kuba@kernel.org>
      To: Paolo Abeni <pabeni@redhat.com>
      To: Rob Herring <robh+dt@kernel.org>
      To: Krzysztof Kozlowski <krzysztof.kozlowski+dt@linaro.org>
      To: Conor Dooley <conor+dt@kernel.org>
      Cc: netdev@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: devicetree@vger.kernel.org
      Signed-off-by: default avatarLuiz Angelo Daros de Luca <luizluca@gmail.com>
      
      Changes in v2:
      - Fixed commit message formatting
      - Added GROUP to LED group enum values. With that, moved the code that
        disables LED into a new function to keep 80-collumn limit.
      - Dropped unused enable argument in rb8366rb_get_port_led()
      - Fixed variable order in rtl8366rb_setup_led()
      - Removed redundant led group test in rb8366rb_{g,s}et_port_led()
      - Initialize ret as 0 in rtl8366rb_setup_leds()
      - Updated comments related to LED blinking and setup
      - Link to v1: https://lore.kernel.org/r/20240310-realtek-led-v1-0-4d9813ce938e@gmail.com
      
      Changes in v1:
      - Rebased on new relatek DSA drivers
      - Improved commit messages
      - Added commit to remove the reset assert during .remove
      - Link to RFC: https://lore.kernel.org/r/20240106184651.3665-1-luizluca@gmail.com
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3208bdd0
    • Luiz Angelo Daros de Luca's avatar
      net: dsa: realtek: add LED drivers for rtl8366rb · 32d61700
      Luiz Angelo Daros de Luca authored
      This commit introduces LED drivers for rtl8366rb, enabling LEDs to be
      described in the device tree using the same format as qca8k. Each port
      can configure up to 4 LEDs.
      
      If all LEDs in a group use the default state "keep", they will use the
      default behavior after a reset. Changing the brightness of one LED,
      either manually or by a trigger, will disable the default hardware
      trigger and switch the entire LED group to manually controlled LEDs.
      Once in this mode, there is no way to revert to hardware-controlled LEDs
      (except by resetting the switch).
      
      Software triggers function as expected with manually controlled LEDs.
      Signed-off-by: default avatarLuiz Angelo Daros de Luca <luizluca@gmail.com>
      Reviewed-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      32d61700
    • Luiz Angelo Daros de Luca's avatar
      net: dsa: realtek: do not assert reset on remove · 4f580e9a
      Luiz Angelo Daros de Luca authored
      The necessity of asserting the reset on removal was previously
      questioned, as DSA's own cleanup methods should suffice to prevent
      traffic leakage[1].
      
      When a driver has subdrivers controlled by devres, they will be
      unregistered after the main driver's .remove is executed. If it asserts
      a reset, the subdrivers will be unable to communicate with the hardware
      during their cleanup. For LEDs, this means that they will fail to turn
      off, resulting in a timeout error.
      
      [1] https://lore.kernel.org/r/20240123215606.26716-9-luizluca@gmail.com/Signed-off-by: default avatarLuiz Angelo Daros de Luca <luizluca@gmail.com>
      Reviewed-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4f580e9a
    • Luiz Angelo Daros de Luca's avatar
      net: dsa: realtek: keep default LED state in rtl8366rb · 5edc6585
      Luiz Angelo Daros de Luca authored
      This switch family supports four LEDs for each of its six ports. Each
      LED group is composed of one of these four LEDs from all six ports. LED
      groups can be configured to display hardware information, such as link
      activity, or manually controlled through a bitmap in registers
      RTL8366RB_LED_0_1_CTRL_REG and RTL8366RB_LED_2_3_CTRL_REG.
      
      After a reset, the default LED group configuration for groups 0 to 3
      indicates, respectively, link activity, link at 1000M, 100M, and 10M, or
      RTL8366RB_LED_CTRL_REG as 0x5432. These configurations are commonly used
      for LED indications. However, the driver was replacing that
      configuration to use manually controlled LEDs (RTL8366RB_LED_FORCE)
      without providing a way for the OS to control them. The default
      configuration is deemed more useful than fixed, uncontrollable turned-on
      LEDs.
      
      The driver was enabling/disabling LEDs during port_enable/disable.
      However, these events occur when the port is administratively controlled
      (up or down) and are not related to link presence. Additionally, when a
      port N was disabled, the driver was turning off all LEDs for group N,
      not only the corresponding LED for port N in any of those 4 groups. In
      such cases, if port 0 was brought down, the LEDs for all ports in LED
      group 0 would be turned off. As another side effect, the driver was
      wrongly warning that port 5 didn't have an LED ("no LED for port 5").
      Since showing the administrative state of ports is not an orthodox way
      to use LEDs, it was not worth it to fix it and all this code was
      dropped.
      
      The code to disable LEDs was simplified only changing each LED group to
      the RTL8366RB_LED_OFF state. Registers RTL8366RB_LED_0_1_CTRL_REG and
      RTL8366RB_LED_2_3_CTRL_REG are only used when the corresponding LED
      group is configured with RTL8366RB_LED_FORCE and they don't need to be
      cleaned. The code still references an LED controlled by
      RTL8366RB_INTERRUPT_CONTROL_REG, but as of now, no test device has
      actually used it. Also, some magic numbers were replaced by macros.
      Signed-off-by: default avatarLuiz Angelo Daros de Luca <luizluca@gmail.com>
      Reviewed-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5edc6585
    • Eric Dumazet's avatar
      ipv6: introduce dst_rt6_info() helper · e8dfd42c
      Eric Dumazet authored
      Instead of (struct rt6_info *)dst casts, we can use :
      
       #define dst_rt6_info(_ptr) \
               container_of_const(_ptr, struct rt6_info, dst)
      
      Some places needed missing const qualifiers :
      
      ip6_confirm_neigh(), ipv6_anycast_destination(),
      ipv6_unicast_destination(), has_gateway()
      
      v2: added missing parts (David Ahern)
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e8dfd42c
    • Dave Thaler's avatar
      bpf, docs: Clarify PC use in instruction-set.rst · 07801a24
      Dave Thaler authored
      This patch elaborates on the use of PC by expanding the PC acronym,
      explaining the units, and the relative position to which the offset
      applies.
      Signed-off-by: default avatarDave Thaler <dthaler1968@googlemail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/bpf/20240426231126.5130-1-dthaler1968@gmail.com
      07801a24
    • David S. Miller's avatar
      Merge branch 'mlxsw-events-processing-performance' · fac87d32
      David S. Miller authored
      Petr Machata says:
      
      ====================
      mlxsw: Improve events processing performance
      
      Amit Cohen writes:
      
      Spectrum ASICs only support a single interrupt, it means that all the
      events are handled by one IRQ (interrupt request) handler.
      
      Currently, we schedule a tasklet to handle events in EQ, then we also use
      tasklet for CQ, SDQ and RDQ. Tasklet runs in softIRQ (software IRQ)
      context, and will be run on the same CPU which scheduled it. It means that
      today we have one CPU which handles all the packets (both network packets
      and EMADs) from hardware.
      
      The existing implementation is not efficient and can be improved.
      
      Measuring latency of EMADs in the driver (without the time in FW) shows
      that latency is increased by factor of 28 (x28) when network traffic is
      handled by the driver.
      
      Measuring throughput in CPU shows that CPU can handle ~35% less packets
      of specific flow when corrupted packets are also handled by the driver.
      There are cases that these values even worse, we measure decrease of ~44%
      packet rate.
      
      This can be improved if network packet and EMADs will be handled in
      parallel by several CPUs, and more than that, if different types of traffic
      will be handled in parallel. We can achieve this using NAPI.
      
      This set converts the driver to process completions from hardware via NAPI.
      The idea is to add NAPI instance per CQ (which is mapped 1:1 to SDQ/RDQ),
      which means that each DQ can be handled separately. we have DQ for EMADs
      and DQs for each trap group (like LLDP, BGP, L3 drops, etc..). See more
      details in commit messages.
      
      An additional improvement which is done as part of this set is related to
      doorbells' ring. The idea is to handle small chunks of Rx packets (which
      is also recommended using NAPI) and ring doorbells once per chunk. This
      reduces the access to hardware which is expensive (time wise) and might
      take time because of memory barriers.
      
      With this set we can see better performance.
      To summerize:
      
      EMADs latency:
      +------------------------------------------------------------------------+
      |                  | Before this set           | Now                     |
      |------------------|---------------------------|-------------------------|
      | Increased factor | x28                       | x1.5                    |
      +------------------------------------------------------------------------+
      Note that we can see even measurements that show better latency when
      traffic is handled by the driver.
      
      Throughput:
      +------------------------------------------------------------------------+
      |             | Before this set            | Now                         |
      |-------------|----------------------------|-----------------------------|
      | Reduced     | 35%                        | 6%                          |
      | packet rate |                            |                             |
      +------------------------------------------------------------------------+
      
      Additional improvements are planned - use page pool for buffer allocations
      and avoid cache miss of each SKB using napi_build_skb().
      
      Patch set overview:
      Patches #1-#2 improve access to hardware by reducing dorbells' rings
      Patch #3-#4 are preaparations for NAPI usage
      Patch #5 converts the driver to use NAPI
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fac87d32
    • Amit Cohen's avatar
      mlxsw: pci: Use NAPI for event processing · 3b0b3019
      Amit Cohen authored
      Spectrum ASICs only support a single interrupt, that means that all the
      events are handled by one IRQ (interrupt request) handler. Once an
      interrupt is received, we schedule tasklet to handle events from EQ and
      then schedule tasklets to handle completions from CQs. Tasklet runs in
      softIRQ (software IRQ) context, and will be run on the same CPU which
      scheduled it. That means that today we use only one CPU to handle all the
      packets (both network packets and EMADs) from hardware.
      
      This can be improved using NAPI. The idea is to use NAPI instance per
      CQ, which is mapped 1:1 to DQ (RDQ or SDQ). NAPI poll method can be run
      in kernel thread, so then the driver will be able to handle WQEs in several
      CPUs. Convert the existing code to use NAPI APIs.
      
      Add NAPI instance as part of 'struct mlxsw_pci_queue' and initialize it
      as part of CQs initialization. Set the appropriate poll method and dummy
      net device, according to queue number, similar to tasklet setup. For CQs
      which are used for completions of RDQ, use Rx poll method and
      'napi_dev_rx', which is set as 'threaded'. It means that Rx poll method
      will run in kernel context, so several RDQs will be handled in parallel.
      For CQs which are used for completions of SDQ, use Tx poll method and
      'napi_dev_tx', this method will run in softIRQ context, as it is
      recommended in NAPI documentation, as Tx packets' processing is short task.
      
      Convert mlxsw_pci_cq_{rx,tx}_tasklet() to poll methods. Handle 'budget'
      argument - ignore it in Tx poll method, as it is recommended to not limit
      Tx processing. For Rx processing, handle up to 'budget' completions.
      Return 'work_done' which is the amount of completions that were handled.
      
      Handle the following cases:
      1. After processing 'budget' completions, the driver still has work to do:
         Return work-done = budget. In that case, the NAPI instance will be
         polled again (without the need to be rescheduled). Do not re-arm the
         queue, as NAPI will handle the reschedule, so we do not have to involve
         hardware to send an additional interrupt for the completions that should
         be processed.
      
      2. Event processing has been completed:
         Call napi_complete_done() to mark NAPI processing as completed, which
         means that the poll method will not be rescheduled. Re-arm the queue,
         as all completions were handled.
      
         In case that poll method handled exactly 'budget' completions, return
         work-done = budget -1, to distinguish from the case that driver still
         has completions to handle. Otherwise, return the amount of completions
         that were handled.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3b0b3019
    • Amit Cohen's avatar
      mlxsw: pci: Reorganize 'mlxsw_pci_queue' structure · c0d92678
      Amit Cohen authored
      The next patch will set the driver to use NAPI for event processing. Then
      tasklet mechanism will be used only for EQ. Reorganize 'mlxsw_pci_queue'
      to hold EQ and CQ attributes in a union. For now, add tasklet for both EQ
      and CQ. This will be changed in the next patch, as 'tasklet_struct' will be
      replaced with NAPI instance.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c0d92678
    • Amit Cohen's avatar
      mlxsw: pci: Initialize dummy net devices for NAPI · 5d01ed2e
      Amit Cohen authored
      mlxsw will use NAPI for event processing in a next patch. As preparation,
      add two dummy net devices and initialize them.
      
      NAPI instance should be attached to net device. Usually each queue is used
      by a single net device in network drivers, so the mapping between net
      device to NAPI instance is intuitive. In our case, Rx queues are not per
      port, they are per trap-group. Tx queues are mapped to net devices, but we
      do not have a separate queue for each local port, several ports share the
      same queue.
      
      Use init_dummy_netdev() to initialize dummy net devices for NAPI.
      
      To run NAPI poll method in a kernel thread, the net device which NAPI
      instance is attached to should be marked as 'threaded'. It is
      recommended to handle Tx packets in softIRQ context, as usually this is
      a short task - just free the Tx packet which has been transmitted.
      Rx packets handling is more complicated task, so drivers can use a
      dedicated kernel thread to process them. It allows processing packets from
      different Rx queues in parallel. We would like to handle only Rx packets in
      kernel threads, which means that we will use two dummy net devices
      (one for Rx and one for Tx). Set only one of them with 'threaded' as it
      will be used for Rx processing. Do not fail in case that setting 'threaded'
      fails, as it is better to use regular softIRQ NAPI rather than preventing
      the driver from loading.
      
      Note that the net devices are initialized with init_dummy_netdev(), so
      they are not registered, which means that they will not be visible to user.
      It will not be possible to change 'threaded' configuration from user
      space, but it is reasonable in our case, as there is no another
      configuration which makes sense, considering that user has no influence
      on the usage of each queue.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5d01ed2e
    • Amit Cohen's avatar
      mlxsw: pci: Ring RDQ and CQ doorbells once per several completions · 6b3d015c
      Amit Cohen authored
      Currently, for each CQE in CQ, we ring CQ doorbell, then handle RDQ and
      ring RDQ doorbell. Finally we ring CQ arm doorbell - once per CQ tasklet.
      
      The idea of ringing CQ doorbell before RDQ doorbell, is to be sure that
      when we post new WQE (after RDQ is handled), there is an available CQE.
      This was done because of a hardware bug as part of
      commit c9ebea04 ("mlxsw: pci: Ring CQ's doorbell before RDQ's").
      
      There is no real reason to ring RDQ and CQ doorbells for each completion,
      it is better to handle several completions and reduce number of ringings,
      as access to hardware is expensive (time wise) and might take time because
      of memory barriers.
      
      A previous patch changed CQ tasklet to handle up to 64 Rx packets. With
      this limitation, we can ring CQ and RDQ doorbells once per CQ tasklet.
      The counters of the doorbells are increased by the amount of packets
      that we handled, then the device will know for which completion to send
      an additional event.
      
      To avoid reordering CQ and RDQ doorbells' ring, let the tasklet to ring
      also RDQ doorbell, mlxsw_pci_cqe_rdq_handle() handles the counter but
      does not ring the doorbell.
      
      Note that with this change there is no need to copy the CQE, as we ring CQ
      doorbell only after Rx packet processing (which uses the CQE) is done.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6b3d015c
    • Amit Cohen's avatar
      mlxsw: pci: Handle up to 64 Rx completions in tasklet · e28d8aba
      Amit Cohen authored
      We can get many completions in one interrupt. Currently, the CQ tasklet
      handles up to half queue size completions, and then arms the hardware to
      generate additional events, which means that in case that there were
      additional completions that we did not handle, we will get immediately an
      additional interrupt to handle the rest.
      
      The decision to handle up to half of the queue size is arbitrary and was
      determined in 2015, when mlxsw driver was added to the kernel. One
      additional fact that should be taken into account is that while WQEs
      from RDQ are handled, the CPU that handles the tasklet is dedicated for
      this task, which means that we might hold the CPU for a long time.
      
      Handle WQEs in smaller chucks, then arm CQ doorbell to notify the hardware
      to send additional notifications. Set the chunk size to 64 as this number
      is recommended using NAPI and the driver will use NAPI in a next patch.
      Note that for now we use ARM doorbell to retrigger CQ tasklet, but with
      NAPI it will be more efficient as software will reschedule the poll
      method and we will not involve hardware for that.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e28d8aba
    • Eric Dumazet's avatar
      ipv6: use call_rcu_hurry() in fib6_info_release() · b5327b9a
      Eric Dumazet authored
      This is a followup of commit c4e86b43 ("net: add two more
      call_rcu_hurry()")
      
      fib6_info_destroy_rcu() is calling nexthop_put() or fib6_nh_release()
      
      We must not delay it too much or risk unregister_netdevice/ref_tracker
      traces because references to netdev are not released in time.
      
      This should speedup device/netns dismantles when CONFIG_RCU_LAZY=y
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b5327b9a
    • Eric Dumazet's avatar
      inet: use call_rcu_hurry() in inet_free_ifa() · 61f5338d
      Eric Dumazet authored
      This is a followup of commit c4e86b43 ("net: add two more
      call_rcu_hurry()")
      
      Our reference to ifa->ifa_dev must be freed ASAP
      to release the reference to the netdev the same way.
      
      inet_rcu_free_ifa()
      
      	in_dev_put()
      	 -> in_dev_finish_destroy()
      	   -> netdev_put()
      
      This should speedup device/netns dismantles when CONFIG_RCU_LAZY=y
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      61f5338d
    • Eric Dumazet's avatar
      net: give more chances to rcu in netdev_wait_allrefs_any() · cd42ba1c
      Eric Dumazet authored
      This came while reviewing commit c4e86b43 ("net: add two more
      call_rcu_hurry()").
      
      Paolo asked if adding one synchronize_rcu() would help.
      
      While synchronize_rcu() does not help, making sure to call
      rcu_barrier() before msleep(wait) is definitely helping
      to make sure lazy call_rcu() are completed.
      
      Instead of waiting ~100 seconds in my tests, the ref_tracker
      splats occurs one time only, and netdev_wait_allrefs_any()
      latency is reduced to the strict minimum.
      
      Ideally we should audit our call_rcu() users to make sure
      no refcount (or cascading call_rcu()) is held too long,
      because rcu_barrier() is quite expensive.
      
      Fixes: 0e4be9e5 ("net: use exponential backoff in netdev_wait_allrefs")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/all/28bbf698-befb-42f6-b561-851c67f464aa@kernel.org/T/#m76d73ed6b03cd930778ac4d20a777f22a08d6824Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cd42ba1c
    • Tanmay Patil's avatar
      net: ethernet: ti: am65-cpsw-qos: Add support to taprio for past base_time · d63394ab
      Tanmay Patil authored
      If the base-time for taprio is in the past, start the schedule at the time
      of the form "base_time + N*cycle_time" where N is the smallest possible
      integer such that the above time is in the future.
      Signed-off-by: default avatarTanmay Patil <t-patil@ti.com>
      Signed-off-by: default avatarChintan Vankar <c-vankar@ti.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d63394ab
  2. 27 Apr, 2024 1 commit
  3. 26 Apr, 2024 11 commits