1. 01 Oct, 2021 8 commits
  2. 30 Sep, 2021 32 commits
    • Jakub Kicinski's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · dd9a887b
      Jakub Kicinski authored
      drivers/net/phy/bcm7xxx.c
        d88fd1b5 ("net: phy: bcm7xxx: Fixed indirect MMD operations")
        f68d08c4 ("net: phy: bcm7xxx: Add EPHY entry for 72165")
      
      net/sched/sch_api.c
        b193e15a ("net: prevent user from passing illegal stab size")
        69508d43 ("net_sched: Use struct_size() and flex_array_size() helpers")
      
      Both cases trivial - adjacent code additions.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      dd9a887b
    • Linus Torvalds's avatar
      Merge tag 'net-5.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 4de593fb
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Networking fixes, including fixes from mac80211, netfilter and bpf.
      
        Current release - regressions:
      
         - bpf, cgroup: assign cgroup in cgroup_sk_alloc when called from
           interrupt
      
         - mdio: revert mechanical patches which broke handling of optional
           resources
      
         - dev_addr_list: prevent address duplication
      
        Previous releases - regressions:
      
         - sctp: break out if skb_header_pointer returns NULL in sctp_rcv_ootb
           (NULL deref)
      
         - Revert "mac80211: do not use low data rates for data frames with no
           ack flag", fixing broadcast transmissions
      
         - mac80211: fix use-after-free in CCMP/GCMP RX
      
         - netfilter: include zone id in tuple hash again, minimize collisions
      
         - netfilter: nf_tables: unlink table before deleting it (race -> UAF)
      
         - netfilter: log: work around missing softdep backend module
      
         - mptcp: don't return sockets in foreign netns
      
         - sched: flower: protect fl_walk() with rcu (race -> UAF)
      
         - ixgbe: fix NULL pointer dereference in ixgbe_xdp_setup
      
         - smsc95xx: fix stalled rx after link change
      
         - enetc: fix the incorrect clearing of IF_MODE bits
      
         - ipv4: fix rtnexthop len when RTA_FLOW is present
      
         - dsa: mv88e6xxx: 6161: use correct MAX MTU config method for this
           SKU
      
         - e100: fix length calculation & buffer overrun in ethtool::get_regs
      
        Previous releases - always broken:
      
         - mac80211: fix using stale frag_tail skb pointer in A-MSDU tx
      
         - mac80211: drop frames from invalid MAC address in ad-hoc mode
      
         - af_unix: fix races in sk_peer_pid and sk_peer_cred accesses (race
           -> UAF)
      
         - bpf, x86: Fix bpf mapping of atomic fetch implementation
      
         - bpf: handle return value of BPF_PROG_TYPE_STRUCT_OPS prog
      
         - netfilter: ip6_tables: zero-initialize fragment offset
      
         - mhi: fix error path in mhi_net_newlink
      
         - af_unix: return errno instead of NULL in unix_create1() when over
           the fs.file-max limit
      
        Misc:
      
         - bpf: exempt CAP_BPF from checks against bpf_jit_limit
      
         - netfilter: conntrack: make max chain length random, prevent
           guessing buckets by attackers
      
         - netfilter: nf_nat_masquerade: make async masq_inet6_event handling
           generic, defer conntrack walk to work queue (prevent hogging RTNL
           lock)"
      
      * tag 'net-5.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (77 commits)
        af_unix: fix races in sk_peer_pid and sk_peer_cred accesses
        net: stmmac: fix EEE init issue when paired with EEE capable PHYs
        net: dev_addr_list: handle first address in __hw_addr_add_ex
        net: sched: flower: protect fl_walk() with rcu
        net: introduce and use lock_sock_fast_nested()
        net: phy: bcm7xxx: Fixed indirect MMD operations
        net: hns3: disable firmware compatible features when uninstall PF
        net: hns3: fix always enable rx vlan filter problem after selftest
        net: hns3: PF enable promisc for VF when mac table is overflow
        net: hns3: fix show wrong state when add existing uc mac address
        net: hns3: fix mixed flag HCLGE_FLAG_MQPRIO_ENABLE and HCLGE_FLAG_DCB_ENABLE
        net: hns3: don't rollback when destroy mqprio fail
        net: hns3: remove tc enable checking
        net: hns3: do not allow call hns3_nic_net_open repeatedly
        ixgbe: Fix NULL pointer dereference in ixgbe_xdp_setup
        net: bridge: mcast: Associate the seqcount with its protecting lock.
        net: mdio-ipq4019: Fix the error for an optional regs resource
        net: hns3: fix hclge_dbg_dump_tm_pg() stack usage
        net: mdio: mscc-miim: Fix the mdio controller
        af_unix: Return errno instead of NULL in unix_create1().
        ...
      4de593fb
    • Aya Levin's avatar
      net/mlx5e: Mutually exclude setting of TX-port-TS and MQPRIO in channel mode · 3bf1742f
      Aya Levin authored
      TX-port-TS hijacks the PTP traffic to a specific HW TX-queue. This
      conflicts with MQPRIO in channel mode, which specifies explicitly which
      TC accepts the packet. This patch mutually excludes the above
      configuration.
      
      Fixes: ec60c458 ("net/mlx5e: Support MQPRIO channel mode")
      Signed-off-by: default avatarAya Levin <ayal@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      3bf1742f
    • Lama Kayal's avatar
      net/mlx5e: Fix the presented RQ index in PTP stats · dd1979cf
      Lama Kayal authored
      PTP-RQ counters title format contains PTP-RQ identifier, which is
      mistakenly not passed to sprinft().
      This leads to unexpected garbage values instead.
      This patch fixes it.
      
      Before applying the patch:
      ethtool -S eth3 | grep ptp_rq
           ptp_rq15_packets: 0
           ptp_rq8_bytes: 0
           ptp_rq6_csum_complete: 0
           ptp_rq14_csum_complete_tail: 0
           ptp_rq3_csum_complete_tail_slow : 0
           ptp_rq9_csum_unnecessary: 0
           ptp_rq1_csum_unnecessary_inner: 0
           ptp_rq7_csum_none: 0
           ptp_rq10_xdp_drop: 0
           ptp_rq9_xdp_redirect: 0
           ptp_rq13_lro_packets: 0
           ptp_rq12_lro_bytes: 0
           ptp_rq10_ecn_mark: 0
           ptp_rq9_removed_vlan_packets: 0
           ptp_rq5_wqe_err: 0
           ptp_rq8_mpwqe_filler_cqes: 0
           ptp_rq2_mpwqe_filler_strides: 0
           ptp_rq5_oversize_pkts_sw_drop: 0
           ptp_rq6_buff_alloc_err: 0
           ptp_rq15_cqe_compress_blks: 0
           ptp_rq2_cqe_compress_pkts: 0
           ptp_rq2_cache_reuse: 0
           ptp_rq12_cache_full: 0
           ptp_rq11_cache_empty: 256
           ptp_rq12_cache_busy: 0
           ptp_rq11_cache_waive: 0
           ptp_rq12_congst_umr: 0
           ptp_rq11_arfs_err: 0
           ptp_rq9_recover: 0
      
      After applying the patch:
      ethtool -S eth3 | grep ptp_rq
           ptp_rq0_packets: 0
           ptp_rq0_bytes: 0
           ptp_rq0_csum_complete: 0
           ptp_rq0_csum_complete_tail: 0
           ptp_rq0_csum_complete_tail_slow : 0
           ptp_rq0_csum_unnecessary: 0
           ptp_rq0_csum_unnecessary_inner: 0
           ptp_rq0_csum_none: 0
           ptp_rq0_xdp_drop: 0
           ptp_rq0_xdp_redirect: 0
           ptp_rq0_lro_packets: 0
           ptp_rq0_lro_bytes: 0
           ptp_rq0_ecn_mark: 0
           ptp_rq0_removed_vlan_packets: 0
           ptp_rq0_wqe_err: 0
           ptp_rq0_mpwqe_filler_cqes: 0
           ptp_rq0_mpwqe_filler_strides: 0
           ptp_rq0_oversize_pkts_sw_drop: 0
           ptp_rq0_buff_alloc_err: 0
           ptp_rq0_cqe_compress_blks: 0
           ptp_rq0_cqe_compress_pkts: 0
           ptp_rq0_cache_reuse: 0
           ptp_rq0_cache_full: 0
           ptp_rq0_cache_empty: 256
           ptp_rq0_cache_busy: 0
           ptp_rq0_cache_waive: 0
           ptp_rq0_congst_umr: 0
           ptp_rq0_arfs_err: 0
           ptp_rq0_recover: 0
      
      Fixes: a28359e9 ("net/mlx5e: Add PTP-RX statistics")
      Signed-off-by: default avatarLama Kayal <lkayal@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      dd1979cf
    • Shay Drory's avatar
      net/mlx5: Fix setting number of EQs of SFs · f88c4876
      Shay Drory authored
      When setting number of completion EQs of the SF, consider number of
      online CPUs.
      Without this consideration, when number of online cpus are less than 8,
      unnecessary 8 completion EQs are allocated.
      
      Fixes: c36326d3 ("net/mlx5: Round-Robin EQs over IRQs")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarParav Pandit <parav@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      f88c4876
    • Shay Drory's avatar
      net/mlx5: Fix length of irq_index in chars · ac8b7d50
      Shay Drory authored
      The maximum irq_index can be 2047, This means irq_name should have 4
      characters reserve for the irq_index. Hence, increase it to 4.
      
      Fixes: 3af26495 ("net/mlx5: Enlarge interrupt field in CREATE_EQ")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarParav Pandit <parav@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      ac8b7d50
    • Aya Levin's avatar
      net/mlx5: Avoid generating event after PPS out in Real time mode · 99b9a678
      Aya Levin authored
      When in Real-time mode, HW clock is synced with the PTP daemon. Hence
      driver should not re-calibrate the next pulse (via MTPPSE repetitive
      events mechanism).
      
      This patch arms repetitive events only in free-running mode.
      
      Fixes: 432119de ("net/mlx5: Add cyc2time HW translation mode support")
      Signed-off-by: default avatarAya Levin <ayal@nvidia.com>
      Reviewed-by: default avatarEran Ben Elisha <eranbe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      99b9a678
    • Aya Levin's avatar
      net/mlx5: Force round second at 1PPS out start time · 64728294
      Aya Levin authored
      Allow configuration of 1PPS start time only with time-stamp representing
      a round second. Prior to this patch driver allowed setting of a
      non-round-second which is not supported by the device. Avoid unexpected
      behavior by restricting start-time configuration to a round-second.
      
      Fixes: 4272f9b8 ("net/mlx5e: Change 1PPS out scheme")
      Signed-off-by: default avatarAya Levin <ayal@nvidia.com>
      Reviewed-by: default avatarEran Ben Elisha <eranbe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      64728294
    • Moshe Shemesh's avatar
      net/mlx5: E-Switch, Fix double allocation of acl flow counter · a586775f
      Moshe Shemesh authored
      Flow counter is allocated in eswitch legacy acl setting functions
      without checking if already allocated by previous setting. Add a check
      to avoid such double allocation.
      
      Fixes: 07bab950 ("net/mlx5: E-Switch, Refactor eswitch ingress acl codes")
      Fixes: ea651a86 ("net/mlx5: E-Switch, Refactor eswitch egress acl codes")
      Signed-off-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      a586775f
    • Tariq Toukan's avatar
      net/mlx5e: Improve MQPRIO resiliency · 7dbc849b
      Tariq Toukan authored
      * Add netdev->tc_to_txq rollback in case of failure in
        mlx5e_update_netdev_queues().
      * Fix broken transition between the two modes:
        MQPRIO DCB mode with tc==8, and MQPRIO channel mode.
      * Disable MQPRIO channel mode if re-attaching with a different number
        of channels.
      * Improve code sharing.
      
      Fixes: ec60c458 ("net/mlx5e: Support MQPRIO channel mode")
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      7dbc849b
    • Tariq Toukan's avatar
      net/mlx5e: Keep the value for maximum number of channels in-sync · 9d758d4a
      Tariq Toukan authored
      The value for maximum number of channels is first calculated based
      on the netdev's profile and current function resources (specifically,
      number of MSIX vectors, which depends among other things on the number
      of online cores in the system).
      This value is then used to calculate the netdev's number of rxqs/txqs.
      Once created (by alloc_etherdev_mqs), the number of netdev's rxqs/txqs
      is constant and we must not exceed it.
      
      To achieve this, keep the maximum number of channels in sync upon any
      netdevice re-attach.
      
      Use mlx5e_get_max_num_channels() for calculating the number of netdev's
      rxqs/txqs. After netdev is created, use mlx5e_calc_max_nch() (which
      coinsiders core device resources, profile, and netdev) to init or
      update priv->max_nch.
      
      Before this patch, the value of priv->max_nch might get out of sync,
      mistakenly allowing accesses to out-of-bounds objects, which would
      crash the system.
      
      Track the number of channels stats structures used in a separate
      field, as they are persistent to suspend/resume operations. All the
      collected stats of every channel index that ever existed should be
      preserved. They are reset only when struct mlx5e_priv is,
      in mlx5e_priv_cleanup(), which is part of the profile changing flow.
      
      There is no point anymore in blocking a profile change due to max_nch
      mismatch in mlx5e_netdev_change_profile(). Remove the limitation.
      
      Fixes: a1f240f1 ("net/mlx5e: Adjust to max number of channles when re-attaching")
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarAya Levin <ayal@nvidia.com>
      Reviewed-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      9d758d4a
    • Raed Salem's avatar
      net/mlx5e: IPSEC RX, enable checksum complete · f9a10440
      Raed Salem authored
      Currently in Rx data path IPsec crypto offloaded packets uses
      csum_none flag, so checksum is handled by the stack, this naturally
      have some performance/cpu utilization impact on such flows. As Nvidia
      NIC starting from ConnectX6DX provides checksum complete value out of
      the box also for such flows there is no sense in taking csum_none path,
      furthermore the stack (xfrm) have the method to handle checksum complete
      corrections for such flows i.e. IPsec trailer removal and consequently
      checksum value adjustment.
      
      Because of the above and in addition the ConnectX6DX is the first HW
      which supports IPsec crypto offload then it is safe to report csum
      complete for IPsec offloaded traffic.
      
      Fixes: b2ac7541 ("net/mlx5e: IPsec: Add Connect-X IPsec Rx data path offload")
      Signed-off-by: default avatarRaed Salem <raeds@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      f9a10440
    • Linus Torvalds's avatar
      Merge tag 'gpio-fixes-for-v5.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux · 115f6134
      Linus Torvalds authored
      Pull gpio fixes from Bartosz Golaszewski:
       "A single fix for the gpio-pca953x driver and two commits updating the
        MAINTAINERS entries for Mun Yew Tham (GPIO specific) and myself
        (treewide after a change in professional situation).
      
        Summary:
      
         - don't ignore I2C errors in gpio-pca953x
      
         - update MAINTAINERS entries for Mun Yew Tham and myself"
      
      * tag 'gpio-fixes-for-v5.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
        MAINTAINERS: Update Mun Yew Tham as Altera Pio Driver maintainer
        MAINTAINERS: update my email address
        gpio: pca953x: do not ignore i2c errors
      115f6134
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma · 78c56e53
      Linus Torvalds authored
      Pull rdma fixes from Jason Gunthorpe:
       "Not much too exciting here, although two syzkaller bugs that seem to
        have 9 lives may have finally been squashed.
      
        Several core bugs and a batch of driver bug fixes:
      
         - Fix compilation problems in qib and hfi1
      
         - Do not corrupt the joined multicast group state when using
           SEND_ONLY
      
         - Several CMA bugs, a reference leak for listening and two syzkaller
           crashers
      
         - Various bug fixes for irdma
      
         - Fix a Sleeping while atomic bug in usnic
      
         - Properly sanitize kernel pointers in dmesg
      
         - Two bugs in the 64b CQE support for hns"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
        RDMA/hns: Add the check of the CQE size of the user space
        RDMA/hns: Fix the size setting error when copying CQE in clean_cq()
        RDMA/hfi1: Fix kernel pointer leak
        RDMA/usnic: Lock VF with mutex instead of spinlock
        RDMA/hns: Work around broken constant propagation in gcc 8
        RDMA/cma: Ensure rdma_addr_cancel() happens before issuing more requests
        RDMA/cma: Do not change route.addr.src_addr.ss_family
        RDMA/irdma: Report correct WC error when there are MW bind errors
        RDMA/irdma: Report correct WC error when transport retry counter is exceeded
        RDMA/irdma: Validate number of CQ entries on create CQ
        RDMA/irdma: Skip CQP ring during a reset
        MAINTAINERS: Update Broadcom RDMA maintainers
        RDMA/cma: Fix listener leak in rdma_cma_listen_on_all() failure
        IB/cma: Do not send IGMP leaves for sendonly Multicast groups
        IB/qib: Fix clang confusion of NULL pointer comparison
      78c56e53
    • Eric Dumazet's avatar
      af_unix: fix races in sk_peer_pid and sk_peer_cred accesses · 35306eb2
      Eric Dumazet authored
      Jann Horn reported that SO_PEERCRED and SO_PEERGROUPS implementations
      are racy, as af_unix can concurrently change sk_peer_pid and sk_peer_cred.
      
      In order to fix this issue, this patch adds a new spinlock that needs
      to be used whenever these fields are read or written.
      
      Jann also pointed out that l2cap_sock_get_peer_pid_cb() is currently
      reading sk->sk_peer_pid which makes no sense, as this field
      is only possibly set by AF_UNIX sockets.
      We will have to clean this in a separate patch.
      This could be done by reverting b48596d1 "Bluetooth: L2CAP: Add get_peer_pid callback"
      or implementing what was truly expected.
      
      Fixes: 109f6e39 ("af_unix: Allow SO_PEERCRED to work across namespaces.")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarJann Horn <jannh@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
      Cc: Marcel Holtmann <marcel@holtmann.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35306eb2
    • David S. Miller's avatar
      Merge branch 'snmp-optimizations' · b0517302
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      net: snmp: minor optimizations
      
      Fetching many SNMP counters on hosts with large number of cpus
      takes a lot of time. mptcp still uses the old non-batched
      fashion which is not cache friendly.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b0517302
    • Eric Dumazet's avatar
      mptcp: use batch snmp operations in mptcp_seq_show() · acbd0c81
      Eric Dumazet authored
      Using snmp_get_cpu_field_batch() allows for better cpu cache
      utilization, especially on hosts with large number of cpus.
      
      Also remove special handling when mptcp mibs where not yet
      allocated.
      
      I chose to use temporary storage on the stack to keep this patch simple.
      We might in the future use the storage allocated in netstat_seq_show().
      
      Combined with prior patch (inlining snmp_get_cpu_field)
      time to fetch and output mptcp counters on a 256 cpu host [1]
      goes from 75 usec to 16 usec.
      
      [1] L1 cache size is 32KB, it is not big enough to hold all dataset.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      acbd0c81
    • Eric Dumazet's avatar
      net: snmp: inline snmp_get_cpu_field() · 59f09ae8
      Eric Dumazet authored
      This trivial function is called ~90,000 times on 256 cpus hosts,
      when reading /proc/net/netstat. And this number keeps inflating.
      
      Inlining it saves many cycles.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      59f09ae8
    • Joshua Roys's avatar
      net/mlx4_en: Add XDP_REDIRECT statistics · dee3b2d0
      Joshua Roys authored
      Add counters for XDP REDIRECT success and failure. This brings the
      redirect path in line with metrics gathered via the other XDP paths.
      Signed-off-by: default avatarJoshua Roys <roysjosh@gmail.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dee3b2d0
    • Wong Vee Khee's avatar
      net: stmmac: fix EEE init issue when paired with EEE capable PHYs · 656ed8b0
      Wong Vee Khee authored
      When STMMAC is paired with Energy-Efficient Ethernet(EEE) capable PHY,
      and the PHY is advertising EEE by default, we need to enable EEE on the
      xPCS side too, instead of having user to manually trigger the enabling
      config via ethtool.
      
      Fixed this by adding xpcs_config_eee() call in stmmac_eee_init().
      
      Fixes: 7617af3d ("net: pcs: Introducing support for DWC xpcs Energy Efficient Ethernet")
      Cc: Michael Sit Wei Hong <michael.wei.hong.sit@intel.com>
      Signed-off-by: default avatarWong Vee Khee <vee.khee.wong@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      656ed8b0
    • Jason Xing's avatar
      ixgbe: let the xdpdrv work with more than 64 cpus · 4fe81585
      Jason Xing authored
      Originally, ixgbe driver doesn't allow the mounting of xdpdrv if the
      server is equipped with more than 64 cpus online. So it turns out that
      the loading of xdpdrv causes the "NOMEM" failure.
      
      Actually, we can adjust the algorithm and then make it work through
      mapping the current cpu to some xdp ring with the protect of @tx_lock.
      
      Here are some numbers before/after applying this patch with xdp-example
      loaded on the eth0X:
      
      As client (tx path):
                           Before    After
      TCP_STREAM send-64   734.14    714.20
      TCP_STREAM send-128  1401.91   1395.05
      TCP_STREAM send-512  5311.67   5292.84
      TCP_STREAM send-1k   9277.40   9356.22 (not stable)
      TCP_RR     send-1    22559.75  21844.22
      TCP_RR     send-128  23169.54  22725.13
      TCP_RR     send-512  21670.91  21412.56
      
      As server (rx path):
                           Before    After
      TCP_STREAM send-64   1416.49   1383.12
      TCP_STREAM send-128  3141.49   3055.50
      TCP_STREAM send-512  9488.73   9487.44
      TCP_STREAM send-1k   9491.17   9356.22 (not stable)
      TCP_RR     send-1    23617.74  23601.60
      ...
      
      Notice: the TCP_RR mode is unstable as the official document explains.
      
      I tested many times with different parameters combined through netperf.
      Though the result is not that accurate, I cannot see much influence on
      this patch. The static key is places on the hot path, but it actually
      shouldn't cause a huge regression theoretically.
      Co-developed-by: default avatarShujin Li <lishujin@kuaishou.com>
      Signed-off-by: default avatarShujin Li <lishujin@kuaishou.com>
      Signed-off-by: default avatarJason Xing <xingwanli@kuaishou.com>
      Tested-by: default avatarSandeep Penigalapati <sandeep.penigalapati@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4fe81585
    • David S. Miller's avatar
      Merge branch 'SO_RESEVED_MEM' · a3e4abac
      David S. Miller authored
      Wei Wang says:
      
      ====================
      net: add new socket option SO_RESERVE_MEM
      
      This patch series introduces a new socket option SO_RESERVE_MEM.
      This socket option provides a mechanism for users to reserve a certain
      amount of memory for the socket to use. When this option is set, kernel
      charges the user specified amount of memory to memcg, as well as
      sk_forward_alloc. This amount of memory is not reclaimable and is
      available in sk_forward_alloc for this socket.
      With this socket option set, the networking stack spends less cycles
      doing forward alloc and reclaim, which should lead to better system
      performance, with the cost of an amount of pre-allocated and
      unreclaimable memory, even under memory pressure.
      With a tcp_stream test with 10 flows running on a simulated 100ms RTT
      link, I can see the cycles spent in __sk_mem_raise_allocated() dropping
      by ~0.02%. Not a whole lot, since we already have logic in
      sk_mem_uncharge() to only reclaim 1MB when sk_forward_alloc has more
      than 2MB free space. But on a system suffering memory pressure
      constently, the savings should be more.
      
      The first patch is the implementation of this socket option. The
      following 2 patches change the tcp stack to make use of this reserved
      memory when under memory pressure. This makes the tcp stack behavior
      more flexible when under memory pressure, and provides a way for user to
      control the distribution of the memory among its sockets.
      With a TCP connection on a simulated 100ms RTT link, the default
      throughput under memory pressure is ~500Kbps. With SO_RESERVE_MEM set to
      100KB, the throughput under memory pressure goes up to ~3.5Mbps.
      
      Change since v2:
      - Added description for new field added in struct sock in patch 1
      Change since v1:
      - Added performance stats in cover letter and rebased
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a3e4abac
    • Wei Wang's avatar
      tcp: adjust rcv_ssthresh according to sk_reserved_mem · 053f3684
      Wei Wang authored
      When user sets SO_RESERVE_MEM socket option, in order to utilize the
      reserved memory when in memory pressure state, we adjust rcv_ssthresh
      according to the available reserved memory for the socket, instead of
      using 4 * advmss always.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      053f3684
    • Wei Wang's avatar
      tcp: adjust sndbuf according to sk_reserved_mem · ca057051
      Wei Wang authored
      If user sets SO_RESERVE_MEM socket option, in order to fully utilize the
      reserved memory in memory pressure state on the tx path, we modify the
      logic in sk_stream_moderate_sndbuf() to set sk_sndbuf according to
      available reserved memory, instead of MIN_SOCK_SNDBUF, and adjust it
      when new data is acked.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ca057051
    • Wei Wang's avatar
      net: add new socket option SO_RESERVE_MEM · 2bb2f5fb
      Wei Wang authored
      This socket option provides a mechanism for users to reserve a certain
      amount of memory for the socket to use. When this option is set, kernel
      charges the user specified amount of memory to memcg, as well as
      sk_forward_alloc. This amount of memory is not reclaimable and is
      available in sk_forward_alloc for this socket.
      With this socket option set, the networking stack spends less cycles
      doing forward alloc and reclaim, which should lead to better system
      performance, with the cost of an amount of pre-allocated and
      unreclaimable memory, even under memory pressure.
      
      Note:
      This socket option is only available when memory cgroup is enabled and we
      require this reserved memory to be charged to the user's memcg. We hope
      this could avoid mis-behaving users to abused this feature to reserve a
      large amount on certain sockets and cause unfairness for others.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2bb2f5fb
    • Jakub Kicinski's avatar
      net: dev_addr_list: handle first address in __hw_addr_add_ex · a5b8fd65
      Jakub Kicinski authored
      struct dev_addr_list is used for device addresses, unicast addresses
      and multicast addresses. The first of those needs special handling
      of the main address - netdev->dev_addr points directly the data
      of the entry and drivers write to it freely, so we can't maintain
      it in the rbtree (for now, at least, to be fixed in net-next).
      
      Current work around sprinkles special handling of the first
      address on the list throughout the code but it missed the case
      where address is being added. First address will not be visible
      during subsequent adds.
      
      Syzbot found a warning where unicast addresses are modified
      without holding the rtnl lock, tl;dr is that team generates
      the same modification multiple times, not necessarily when
      right locks are held.
      
      In the repro we have:
      
        macvlan -> team -> veth
      
      macvlan adds a unicast address to the team. Team then pushes
      that address down to its memebers (veths). Next something unrelated
      makes team sync member addrs again, and because of the bug
      the addr entries get duplicated in the veths. macvlan gets
      removed, removes its addr from team which removes only one
      of the duplicated addresses from veths. This removal is done
      under rtnl. Next syzbot uses iptables to add a multicast addr
      to team (which does not hold rtnl lock). Team syncs veth addrs,
      but because veths' unicast list still has the duplicate it will
      also get sync, even though this update is intended for mc addresses.
      Again, uc address updates need rtnl lock, boom.
      
      Reported-by: syzbot+7a2ab2cdc14d134de553@syzkaller.appspotmail.com
      Fixes: 406f42fa ("net-next: When a bond have a massive amount of VLANs with IPv6 addresses, performance of changing link state, attaching a VRF, changing an IPv6 address, etc. go down dramtically.")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a5b8fd65
    • Russell King's avatar
      net: phy: marvell10g: add downshift tunable support · 4075a6a0
      Russell King authored
      Add support for the downshift tunable for the Marvell 88x3310 PHY.
      Downshift is only usable with firmware 0.3.5.0 and later.
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4075a6a0
    • Vlad Buslov's avatar
      net: sched: flower: protect fl_walk() with rcu · d5ef1906
      Vlad Buslov authored
      Patch that refactored fl_walk() to use idr_for_each_entry_continue_ul()
      also removed rcu protection of individual filters which causes following
      use-after-free when filter is deleted concurrently. Fix fl_walk() to obtain
      rcu read lock while iterating and taking the filter reference and temporary
      release the lock while calling arg->fn() callback that can sleep.
      
      KASAN trace:
      
      [  352.773640] ==================================================================
      [  352.775041] BUG: KASAN: use-after-free in fl_walk+0x159/0x240 [cls_flower]
      [  352.776304] Read of size 4 at addr ffff8881c8251480 by task tc/2987
      
      [  352.777862] CPU: 3 PID: 2987 Comm: tc Not tainted 5.15.0-rc2+ #2
      [  352.778980] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
      [  352.781022] Call Trace:
      [  352.781573]  dump_stack_lvl+0x46/0x5a
      [  352.782332]  print_address_description.constprop.0+0x1f/0x140
      [  352.783400]  ? fl_walk+0x159/0x240 [cls_flower]
      [  352.784292]  ? fl_walk+0x159/0x240 [cls_flower]
      [  352.785138]  kasan_report.cold+0x83/0xdf
      [  352.785851]  ? fl_walk+0x159/0x240 [cls_flower]
      [  352.786587]  kasan_check_range+0x145/0x1a0
      [  352.787337]  fl_walk+0x159/0x240 [cls_flower]
      [  352.788163]  ? fl_put+0x10/0x10 [cls_flower]
      [  352.789007]  ? __mutex_unlock_slowpath.constprop.0+0x220/0x220
      [  352.790102]  tcf_chain_dump+0x231/0x450
      [  352.790878]  ? tcf_chain_tp_delete_empty+0x170/0x170
      [  352.791833]  ? __might_sleep+0x2e/0xc0
      [  352.792594]  ? tfilter_notify+0x170/0x170
      [  352.793400]  ? __mutex_unlock_slowpath.constprop.0+0x220/0x220
      [  352.794477]  tc_dump_tfilter+0x385/0x4b0
      [  352.795262]  ? tc_new_tfilter+0x1180/0x1180
      [  352.796103]  ? __mod_node_page_state+0x1f/0xc0
      [  352.796974]  ? __build_skb_around+0x10e/0x130
      [  352.797826]  netlink_dump+0x2c0/0x560
      [  352.798563]  ? netlink_getsockopt+0x430/0x430
      [  352.799433]  ? __mutex_unlock_slowpath.constprop.0+0x220/0x220
      [  352.800542]  __netlink_dump_start+0x356/0x440
      [  352.801397]  rtnetlink_rcv_msg+0x3ff/0x550
      [  352.802190]  ? tc_new_tfilter+0x1180/0x1180
      [  352.802872]  ? rtnl_calcit.isra.0+0x1f0/0x1f0
      [  352.803668]  ? tc_new_tfilter+0x1180/0x1180
      [  352.804344]  ? _copy_from_iter_nocache+0x800/0x800
      [  352.805202]  ? kasan_set_track+0x1c/0x30
      [  352.805900]  netlink_rcv_skb+0xc6/0x1f0
      [  352.806587]  ? rht_deferred_worker+0x6b0/0x6b0
      [  352.807455]  ? rtnl_calcit.isra.0+0x1f0/0x1f0
      [  352.808324]  ? netlink_ack+0x4d0/0x4d0
      [  352.809086]  ? netlink_deliver_tap+0x62/0x3d0
      [  352.809951]  netlink_unicast+0x353/0x480
      [  352.810744]  ? netlink_attachskb+0x430/0x430
      [  352.811586]  ? __alloc_skb+0xd7/0x200
      [  352.812349]  netlink_sendmsg+0x396/0x680
      [  352.813132]  ? netlink_unicast+0x480/0x480
      [  352.813952]  ? __import_iovec+0x192/0x210
      [  352.814759]  ? netlink_unicast+0x480/0x480
      [  352.815580]  sock_sendmsg+0x6c/0x80
      [  352.816299]  ____sys_sendmsg+0x3a5/0x3c0
      [  352.817096]  ? kernel_sendmsg+0x30/0x30
      [  352.817873]  ? __ia32_sys_recvmmsg+0x150/0x150
      [  352.818753]  ___sys_sendmsg+0xd8/0x140
      [  352.819518]  ? sendmsg_copy_msghdr+0x110/0x110
      [  352.820402]  ? ___sys_recvmsg+0xf4/0x1a0
      [  352.821110]  ? __copy_msghdr_from_user+0x260/0x260
      [  352.821934]  ? _raw_spin_lock+0x81/0xd0
      [  352.822680]  ? __handle_mm_fault+0xef3/0x1b20
      [  352.823549]  ? rb_insert_color+0x2a/0x270
      [  352.824373]  ? copy_page_range+0x16b0/0x16b0
      [  352.825209]  ? perf_event_update_userpage+0x2d0/0x2d0
      [  352.826190]  ? __fget_light+0xd9/0xf0
      [  352.826941]  __sys_sendmsg+0xb3/0x130
      [  352.827613]  ? __sys_sendmsg_sock+0x20/0x20
      [  352.828377]  ? do_user_addr_fault+0x2c5/0x8a0
      [  352.829184]  ? fpregs_assert_state_consistent+0x52/0x60
      [  352.830001]  ? exit_to_user_mode_prepare+0x32/0x160
      [  352.830845]  do_syscall_64+0x35/0x80
      [  352.831445]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [  352.832331] RIP: 0033:0x7f7bee973c17
      [  352.833078] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
      [  352.836202] RSP: 002b:00007ffcbb368e28 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      [  352.837524] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7bee973c17
      [  352.838715] RDX: 0000000000000000 RSI: 00007ffcbb368e50 RDI: 0000000000000003
      [  352.839838] RBP: 00007ffcbb36d090 R08: 00000000cea96d79 R09: 00007f7beea34a40
      [  352.841021] R10: 00000000004059bb R11: 0000000000000246 R12: 000000000046563f
      [  352.842208] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffcbb36d088
      
      [  352.843784] Allocated by task 2960:
      [  352.844451]  kasan_save_stack+0x1b/0x40
      [  352.845173]  __kasan_kmalloc+0x7c/0x90
      [  352.845873]  fl_change+0x282/0x22db [cls_flower]
      [  352.846696]  tc_new_tfilter+0x6cf/0x1180
      [  352.847493]  rtnetlink_rcv_msg+0x471/0x550
      [  352.848323]  netlink_rcv_skb+0xc6/0x1f0
      [  352.849097]  netlink_unicast+0x353/0x480
      [  352.849886]  netlink_sendmsg+0x396/0x680
      [  352.850678]  sock_sendmsg+0x6c/0x80
      [  352.851398]  ____sys_sendmsg+0x3a5/0x3c0
      [  352.852202]  ___sys_sendmsg+0xd8/0x140
      [  352.852967]  __sys_sendmsg+0xb3/0x130
      [  352.853718]  do_syscall_64+0x35/0x80
      [  352.854457]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      [  352.855830] Freed by task 7:
      [  352.856421]  kasan_save_stack+0x1b/0x40
      [  352.857139]  kasan_set_track+0x1c/0x30
      [  352.857854]  kasan_set_free_info+0x20/0x30
      [  352.858609]  __kasan_slab_free+0xed/0x130
      [  352.859348]  kfree+0xa7/0x3c0
      [  352.859951]  process_one_work+0x44d/0x780
      [  352.860685]  worker_thread+0x2e2/0x7e0
      [  352.861390]  kthread+0x1f4/0x220
      [  352.862022]  ret_from_fork+0x1f/0x30
      
      [  352.862955] Last potentially related work creation:
      [  352.863758]  kasan_save_stack+0x1b/0x40
      [  352.864378]  kasan_record_aux_stack+0xab/0xc0
      [  352.865028]  insert_work+0x30/0x160
      [  352.865617]  __queue_work+0x351/0x670
      [  352.866261]  rcu_work_rcufn+0x30/0x40
      [  352.866917]  rcu_core+0x3b2/0xdb0
      [  352.867561]  __do_softirq+0xf6/0x386
      
      [  352.868708] Second to last potentially related work creation:
      [  352.869779]  kasan_save_stack+0x1b/0x40
      [  352.870560]  kasan_record_aux_stack+0xab/0xc0
      [  352.871426]  call_rcu+0x5f/0x5c0
      [  352.872108]  queue_rcu_work+0x44/0x50
      [  352.872855]  __fl_put+0x17c/0x240 [cls_flower]
      [  352.873733]  fl_delete+0xc7/0x100 [cls_flower]
      [  352.874607]  tc_del_tfilter+0x510/0xb30
      [  352.886085]  rtnetlink_rcv_msg+0x471/0x550
      [  352.886875]  netlink_rcv_skb+0xc6/0x1f0
      [  352.887636]  netlink_unicast+0x353/0x480
      [  352.888285]  netlink_sendmsg+0x396/0x680
      [  352.888942]  sock_sendmsg+0x6c/0x80
      [  352.889583]  ____sys_sendmsg+0x3a5/0x3c0
      [  352.890311]  ___sys_sendmsg+0xd8/0x140
      [  352.891019]  __sys_sendmsg+0xb3/0x130
      [  352.891716]  do_syscall_64+0x35/0x80
      [  352.892395]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      [  352.893666] The buggy address belongs to the object at ffff8881c8251000
                      which belongs to the cache kmalloc-2k of size 2048
      [  352.895696] The buggy address is located 1152 bytes inside of
                      2048-byte region [ffff8881c8251000, ffff8881c8251800)
      [  352.897640] The buggy address belongs to the page:
      [  352.898492] page:00000000213bac35 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1c8250
      [  352.900110] head:00000000213bac35 order:3 compound_mapcount:0 compound_pincount:0
      [  352.901541] flags: 0x2ffff800010200(slab|head|node=0|zone=2|lastcpupid=0x1ffff)
      [  352.902908] raw: 002ffff800010200 0000000000000000 dead000000000122 ffff888100042f00
      [  352.904391] raw: 0000000000000000 0000000000080008 00000001ffffffff 0000000000000000
      [  352.905861] page dumped because: kasan: bad access detected
      
      [  352.907323] Memory state around the buggy address:
      [  352.908218]  ffff8881c8251380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  352.909471]  ffff8881c8251400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  352.910735] >ffff8881c8251480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  352.912012]                    ^
      [  352.912642]  ffff8881c8251500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  352.913919]  ffff8881c8251580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  352.915185] ==================================================================
      
      Fixes: d39d7149 ("idr: introduce idr_for_each_entry_continue_ul()")
      Signed-off-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Acked-by: default avatarCong Wang <cong.wang@bytedance.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d5ef1906
    • Colin Ian King's avatar
      octeontx2-af: Remove redundant initialization of variable pin · 75f81afb
      Colin Ian King authored
      The variable pin is being initialized with a value that is never
      read, it is being updated later on in only one case of a switch
      statement.  The assignment is redundant and can be removed.
      
      Addresses-Coverity: ("Unused value")
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      75f81afb
    • Lars-Peter Clausen's avatar
      net: macb: ptp: Switch to gettimex64() interface · e51bb5c2
      Lars-Peter Clausen authored
      The macb PTP support currently implements the `gettime64` callback to allow
      to retrieve the hardware clock time. Update the implementation to provide
      the `gettimex64` callback instead.
      
      The difference between the two is that with `gettime64` a snapshot of the
      system clock is taken before and after invoking the callback. Whereas
      `gettimex64` expects the callback itself to take the snapshots.
      
      To get the time from the macb Ethernet core multiple register accesses have
      to be done. Only one of which will happen at the time reported by the
      function. This leads to a non-symmetric delay and adds a slight offset
      between the hardware and system clock time when using the `gettime64`
      method. This offset can be a few 100 nanoseconds. Switching to the
      `gettimex64` method allows for a more precise correlation of the hardware
      and system clocks and results in a lower offset between the two.
      
      On a Xilinx ZynqMP system `phc2sys` reports a delay of 1120 ns before and
      300 ns after the patch. With the latter being mostly symmetric.
      Signed-off-by: default avatarLars-Peter Clausen <lars@metafoo.de>
      Acked-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e51bb5c2
    • Boris Sukholitko's avatar
      dissector: do not set invalid PPP protocol · 2e861e5e
      Boris Sukholitko authored
      The following flower filter fails to match non-PPP_IP{V6} packets
      wrapped in PPP_SES protocol:
      
      tc filter add dev eth0 ingress protocol ppp_ses flower \
              action simple sdata hi64
      
      The reason is that proto local variable is being set even when
      FLOW_DISSECT_RET_OUT_BAD status is returned.
      
      The fix is to avoid setting proto variable if the PPP protocol is unknown.
      Signed-off-by: default avatarBoris Sukholitko <boris.sukholitko@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2e861e5e
    • Linus Walleij's avatar
      net: dsa: rtl8366rb: Use core filtering tracking · 55b115c7
      Linus Walleij authored
      We added a state variable to track whether a certain port
      was VLAN filtering or not, but we can just inquire the DSA
      core about this.
      
      Cc: Vladimir Oltean <olteanv@gmail.com>
      Cc: Mauri Sandberg <sandberg@mailfence.com>
      Cc: DENG Qingfang <dqfext@gmail.com>
      Cc: Alvin Šipraga <alsi@bang-olufsen.dk>
      Cc: Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      55b115c7