1. 25 Aug, 2021 21 commits
    • Paolo Abeni's avatar
      mptcp: shrink mptcp_out_options struct · d7b26908
      Paolo Abeni authored
      After the previous patch we can alias with a union several
      fields in mptcp_out_options. Such struct is stack allocated and
      memset() for each plain TCP out packet. Every saved byted counts.
      
      Before:
      pahole -EC mptcp_out_options
       # ...
      /* size: 136, cachelines: 3, members: 17 */
      
      After:
      pahole -EC mptcp_out_options
       # ...
      /* size: 56, cachelines: 1, members: 9 */
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d7b26908
    • Paolo Abeni's avatar
      mptcp: optimize out option generation · 1bff1e43
      Paolo Abeni authored
      Currently we have several protocol constraints on MPTCP options
      generation (e.g. MPC and MPJ subopt are mutually exclusive)
      and some additional ones required by our implementation
      (e.g. almost all ADD_ADDR variant are mutually exclusive with
      everything else).
      
      We can leverage the above to optimize the out option generation:
      we check DSS/MPC/MPJ presence in a mutually exclusive way,
      avoiding many unneeded conditionals in the common cases.
      
      Additionally extend the existing constraints on ADD_ADDR opt on
      all subvariants, so that it becomes fully mutually exclusive with
      the above and we can skip another conditional statement for the
      common case.
      
      This change is also needed by the next patch.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1bff1e43
    • David S. Miller's avatar
      Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · d484dc2b
      David S. Miller authored
      Tony Nguyen says:
      
      ====================
      1GbE Intel Wired LAN Driver Updates 2021-08-24
      
      Vinicius Costa Gomes says:
      
      This adds support for PCIe PTM (Precision Time Measurement) to the igc
      driver. PCIe PTM allows the NIC and Host clocks to be compared more
      precisely, improving the clock synchronization accuracy.
      
      Patch 1/4 reverts a commit that made pci_enable_ptm() private to the
      PCI subsystem, reverting makes it possible for it to be called from
      the drivers.
      
      Patch 2/4 adds the pcie_ptm_enabled() helper.
      
      Patch 3/4 calls pci_enable_ptm() from the igc driver.
      
      Patch 4/4 implements the PCIe PTM support. Exposing it via the
      .getcrosststamp() API implies that the time measurements are made
      synchronously with the ioctl(). The hardware was implemented so the
      most convenient way to retrieve that information would be
      asynchronously. So, to follow the expectations of the ioctl() we have
      to use less convenient ways, triggering an PCIe PTM dialog every time
      a ioctl() is received.
      
      Some questions are raised (also pointed out in the commit message):
      
      1. Using convert_art_ns_to_tsc() is too x86 specific, there should be
         a common way to create a 'system_counterval_t' from a timestamp.
      
      2. convert_art_ns_to_tsc() says that it should only be used when
         X86_FEATURE_TSC_KNOWN_FREQ is true, but during tests it works even
         when it returns false. Should that check be done?
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d484dc2b
    • David S. Miller's avatar
      Merge branch 'lan7800-improvements' · 38cbd6e7
      David S. Miller authored
      John Efstathiades says:
      
      ====================
      LAN7800 driver improvements
      
      This patch set introduces a number of improvements and fixes for
      problems found during testing of a modification to add a NAPI-style
      approach to packet handling to improve performance.
      
      NOTE: the NAPI changes are not part of this patch set and the issues
            fixed by this patch set are not coupled to the NAPI changes.
      
      Patch 1 fixes white space and style issues
      
      Patch 2 removes an unused timer
      
      Patch 3 introduces macros to set the internal packet FIFO flow
      control levels, which makes it easier to update the levels in future.
      
      Patch 4 removes an unused queue
      
      Patch 5 (updated for v2) introduces function return value checks and
      error propagation to various parts of the driver where a return
      code was captured but then ignored.
      
      This patch is completely different to patch 5 in version 1 of this patch
      set. The changes in the v1 patch 5 are being set aside for the time
      being.
      
      Patch 6 updates the LAN7800 MAC reset code to ensure there is no
      PHY register access in progress when the MAC is reset. This change
      prevents a kernel exception that can otherwise occur.
      
      Patch 7 fixes problems with system suspend and resume handling while
      the device is transmitting and receiving data.
      
      Patch 8 fixes problems with auto-suspend and resume handling and
      depends on changes introduced by patch 7.
      
      Patch 9 fixes problems with device disconnect handling that can result
      in kernel exceptions and/or hang.
      
      Patch 10 limits the rate at which driver warning messages are emitted.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      38cbd6e7
    • John Efstathiades's avatar
      lan78xx: Limit number of driver warning messages · df0d6f7a
      John Efstathiades authored
      Device removal can result in a large burst of driver warning messages
      (20 - 30) sent to the kernel log. Most of these are register read/write
      failures.
      
      This change limits the rate at which these messages are emitted.
      Signed-off-by: default avatarJohn Efstathiades <john.efstathiades@pebblebay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df0d6f7a
    • John Efstathiades's avatar
      lan78xx: Fix race condition in disconnect handling · 77dfff5b
      John Efstathiades authored
      If there is a device disconnect at roughly the same time as a
      deferred PHY link reset there is a race condition that can result
      in a kernel lock up due to a null pointer dereference in the
      driver's deferred work handling routine lan78xx_delayedwork().
      The following changes fix this problem.
      
      Add new status flag EVENT_DEV_DISCONNECT to indicate when the
      device has been removed and use it to prevent operations, such as
      register access, that will fail once the device is removed.
      
      Stop processing of deferred work items when the driver's USB
      disconnect handler is invoked.
      
      Disconnect the PHY only after the network device has been
      unregistered and all delayed work has been cancelled.
      Signed-off-by: default avatarJohn Efstathiades <john.efstathiades@pebblebay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      77dfff5b
    • John Efstathiades's avatar
      lan78xx: Fix race conditions in suspend/resume handling · 5f4cc6e2
      John Efstathiades authored
      If the interface is given an IP address while the device is
      suspended (as a result of an auto-suspend event) there is a race
      between lan78xx_resume() and lan78xx_open() that can result in an
      exception or failure to handle incoming packets. The following
      changes fix this problem.
      
      Introduce a mutex to serialise operations in the network interface
      open and stop entry points with respect to the USB driver suspend
      and resume entry points.
      
      Move Tx and Rx data path start/stop to lan78xx_start() and
      lan78xx_stop() respectively and flush the packet FIFOs before
      starting the Tx and Rx data paths. This prevents the MAC and FIFOs
      getting out of step and delivery of malformed packets to the network
      stack.
      
      Stop processing of received packets before disconnecting the
      PHY from the MAC to prevent a kernel exception caused by handling
      packets after the PHY device has been removed.
      
      Refactor device auto-suspend code to make it consistent with the
      the system suspend code and make the suspend handler easier to read.
      
      Add new code to stop wake-on-lan packets or PHY events resuming the
      host or device from suspend if the device has not been opened
      (typically after an IP address is assigned).
      
      This patch is dependent on changes to lan78xx_suspend() and
      lan78xx_resume() introduced in the previous patch of this patch set.
      Signed-off-by: default avatarJohn Efstathiades <john.efstathiades@pebblebay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5f4cc6e2
    • John Efstathiades's avatar
      lan78xx: Fix partial packet errors on suspend/resume · e1210fe6
      John Efstathiades authored
      The MAC can get out of step with the internal packet FIFOs if the
      system goes to sleep when the link is active, especially at high
      data rates. This can result in partial frames in the packet FIFOs
      that in result in malformed frames being delivered to the host.
      This occurs because the driver does not enable/disable the internal
      packet FIFOs in step with the corresponding MAC data path. The
      following changes fix this problem.
      
      Update code that enables/disables the MAC receiver and transmitter
      to the more general Rx and Tx data path, where the data path in each
      direction consists of both the MAC function (Tx or Rx) and the
      corresponding packet FIFO.
      
      In the receive path the packet FIFO must be enabled before the MAC
      receiver but disabled after the MAC receiver.
      
      In the transmit path the opposite is true: the packet FIFO must be
      enabled after the MAC transmitter but disabled before the MAC
      transmitter.
      
      The packet FIFOs can be flushed safely once the corresponding data
      path is stopped.
      Signed-off-by: default avatarJohn Efstathiades <john.efstathiades@pebblebay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e1210fe6
    • John Efstathiades's avatar
      lan78xx: Fix exception on link speed change · b1f6696d
      John Efstathiades authored
      An exception is sometimes seen when the link speed is changed
      from auto-negotiation to a fixed speed, or vice versa. The
      exception occurs when the MAC is reset (due to the link speed
      change) at the same time as the PHY state machine is accessing
      a PHY register. The following changes fix this problem.
      
      Rework the MAC reset to ensure there is no outstanding MDIO
      register transaction before the reset and then wait until the
      reset is complete before allowing any further MAC register access.
      Signed-off-by: default avatarJohn Efstathiades <john.efstathiades@pebblebay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b1f6696d
    • John Efstathiades's avatar
      lan78xx: Add missing return code checks · 3415f6ba
      John Efstathiades authored
      There are many places in the driver where the return code from a
      function call is captured but without a subsequent test of the
      return code and appropriate action taken.
      
      This patch adds the missing return code tests and action. In most
      cases the action is an early exit from the calling function.
      
      The function lan78xx_set_suspend() was also updated to make it
      consistent with lan78xx_suspend().
      Signed-off-by: default avatarJohn Efstathiades <john.efstathiades@pebblebay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3415f6ba
    • John Efstathiades's avatar
      lan78xx: Remove unused pause frame queue · 40b8452f
      John Efstathiades authored
      Remove the pause frame queue from the driver. It is initialised
      but not actually used.
      Signed-off-by: default avatarJohn Efstathiades <john.efstathiades@pebblebay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      40b8452f
    • John Efstathiades's avatar
      lan78xx: Set flow control threshold to prevent packet loss · dc35f854
      John Efstathiades authored
      Set threshold at which flow control is triggered to 3/4 full of
      the internal Rx packet FIFO to prevent packet drops at high data
      rates. The new setting reduces the number of dropped UDP frames
      and TCP retransmit requests especially on less capable CPUs.
      Signed-off-by: default avatarJohn Efstathiades <john.efstathiades@pebblebay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dc35f854
    • John Efstathiades's avatar
      lan78xx: Remove unused timer · 3bef6b9e
      John Efstathiades authored
      Remove kernel timer that is not used by the driver.
      Signed-off-by: default avatarJohn Efstathiades <john.efstathiades@pebblebay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3bef6b9e
    • John Efstathiades's avatar
      lan78xx: Fix white space and style issues · 9ceec7d3
      John Efstathiades authored
      Fix white space and code style issues identified by checkpatch.
      Signed-off-by: default avatarJohn Efstathiades <john.efstathiades@pebblebay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9ceec7d3
    • David S. Miller's avatar
      Merge branch 'xen-harden-netfront' · fbd029df
      David S. Miller authored
      Juergen Gross says:
      
      ====================
      xen: harden netfront against malicious backends
      
      Xen backends of para-virtualized devices can live in dom0 kernel, dom0
      user land, or in a driver domain. This means that a backend might
      reside in a less trusted environment than the Xen core components, so
      a backend should not be able to do harm to a Xen guest (it can still
      mess up I/O data, but it shouldn't be able to e.g. crash a guest by
      other means or cause a privilege escalation in the guest).
      
      Unfortunately netfront in the Linux kernel is fully trusting its
      backend. This series is fixing netfront in this regard.
      
      It was discussed to handle this as a security problem, but the topic
      was discussed in public before, so it isn't a real secret.
      
      It should be mentioned that a similar series has been posted some years
      ago by Marek Marczykowski-Górecki, but this series has not been applied
      due to a Xen header not having been available in the Xen git repo at
      that time. Additionally my series is fixing some more DoS cases.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fbd029df
    • Juergen Gross's avatar
      xen/netfront: don't trust the backend response data blindly · a884daa6
      Juergen Gross authored
      Today netfront will trust the backend to send only sane response data.
      In order to avoid privilege escalations or crashes in case of malicious
      backends verify the data to be within expected limits. Especially make
      sure that the response always references an outstanding request.
      
      Note that only the tx queue needs special id handling, as for the rx
      queue the id is equal to the index in the ring page.
      
      Introduce a new indicator for the device whether it is broken and let
      the device stop working when it is set. Set this indicator in case the
      backend sets any weird data.
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a884daa6
    • Juergen Gross's avatar
      xen/netfront: disentangle tx_skb_freelist · 21631d2d
      Juergen Gross authored
      The tx_skb_freelist elements are in a single linked list with the
      request id used as link reference. The per element link field is in a
      union with the skb pointer of an in use request.
      
      Move the link reference out of the union in order to enable a later
      reuse of it for requests which need a populated skb pointer.
      
      Rename add_id_to_freelist() and get_id_from_freelist() to
      add_id_to_list() and get_id_from_list() in order to prepare using
      those for other lists as well. Define ~0 as value to indicate the end
      of a list and place that value into the link for a request not being
      on the list.
      
      When freeing a skb zero the skb pointer in the request. Use a NULL
      value of the skb pointer instead of skb_entry_is_link() for deciding
      whether a request has a skb linked to it.
      
      Remove skb_entry_set_link() and open code it instead as it is really
      trivial now.
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      21631d2d
    • Juergen Gross's avatar
      xen/netfront: don't read data from request on the ring page · 162081ec
      Juergen Gross authored
      In order to avoid a malicious backend being able to influence the local
      processing of a request build the request locally first and then copy
      it to the ring page. Any reading from the request influencing the
      processing in the frontend needs to be done on the local instance.
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      162081ec
    • Juergen Gross's avatar
      xen/netfront: read response from backend only once · 8446066b
      Juergen Gross authored
      In order to avoid problems in case the backend is modifying a response
      on the ring page while the frontend has already seen it, just read the
      response into a local buffer in one go and then operate on that buffer
      only.
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8446066b
    • Alok Prasad's avatar
      qed: Enable automatic recovery on error condition. · 755f9053
      Alok Prasad authored
      This patch enables automatic recovery by default in case of various
      error condition like fw assert , hardware error etc.
      This also ensure driver can handle multiple iteration of assertion
      conditions.
      Signed-off-by: default avatarAriel Elior <aelior@marvell.com>
      Signed-off-by: default avatarShai Malin <smalin@marvell.com>
      Signed-off-by: default avatarIgor Russkikh <irusskikh@marvell.com>
      Signed-off-by: default avatarAlok Prasad <palok@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      755f9053
    • Gilad Naaman's avatar
      net-next: When a bond have a massive amount of VLANs with IPv6 addresses,... · 406f42fa
      Gilad Naaman authored
      net-next: When a bond have a massive amount of VLANs with IPv6 addresses, performance of changing link state, attaching a VRF, changing an IPv6 address, etc. go down dramtically.
      
      The source of most of the slow down is the `dev_addr_lists.c` module,
      which mainatins a linked list of HW addresses.
      When using IPv6, this list grows for each IPv6 address added on a
      VLAN, since each IPv6 address has a multicast HW address associated with
      it.
      
      When performing any modification to the involved links, this list is
      traversed many times, often for nothing, all while holding the RTNL
      lock.
      
      Instead, this patch adds an auxilliary rbtree which cuts down
      traversal time significantly.
      
      Performance can be seen with the following script:
      
      	#!/bin/bash
      	ip netns del test || true 2>/dev/null
      	ip netns add test
      
      	echo 1 | ip netns exec test tee /proc/sys/net/ipv6/conf/all/keep_addr_on_down > /dev/null
      
      	set -e
      
      	ip -n test link add foo type veth peer name bar
      	ip -n test link add b1 type bond
      	ip -n test link add florp type vrf table 10
      
      	ip -n test link set bar master b1
      	ip -n test link set foo up
      	ip -n test link set bar up
      	ip -n test link set b1 up
      	ip -n test link set florp up
      
      	VLAN_COUNT=1500
      	BASE_DEV=b1
      
      	echo Creating vlans
      	ip netns exec test time -p bash -c "for i in \$(seq 1 $VLAN_COUNT);
      	do ip -n test link add link $BASE_DEV name foo.\$i type vlan id \$i; done"
      
      	echo Bringing them up
      	ip netns exec test time -p bash -c "for i in \$(seq 1 $VLAN_COUNT);
      	do ip -n test link set foo.\$i up; done"
      
      	echo Assiging IPv6 Addresses
      	ip netns exec test time -p bash -c "for i in \$(seq 1 $VLAN_COUNT);
      	do ip -n test address add dev foo.\$i 2000::\$i/64; done"
      
      	echo Attaching to VRF
      	ip netns exec test time -p bash -c "for i in \$(seq 1 $VLAN_COUNT);
      	do ip -n test link set foo.\$i master florp; done"
      
      On an Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz machine, the performance
      before the patch is (truncated):
      
      	Creating vlans
      	real 108.35
      	Bringing them up
      	real 4.96
      	Assiging IPv6 Addresses
      	real 19.22
      	Attaching to VRF
      	real 458.84
      
      After the patch:
      
      	Creating vlans
      	real 5.59
      	Bringing them up
      	real 5.07
      	Assiging IPv6 Addresses
      	real 5.64
      	Attaching to VRF
      	real 25.37
      
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Lu Wei <luwei32@huawei.com>
      Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
      Cc: Taehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarGilad Naaman <gnaaman@drivenets.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      406f42fa
  2. 24 Aug, 2021 19 commits
    • Kangmin Park's avatar
      net: bridge: change return type of br_handle_ingress_vlan_tunnel · a37c5c26
      Kangmin Park authored
      br_handle_ingress_vlan_tunnel() is only referenced in
      br_handle_frame(). If br_handle_ingress_vlan_tunnel() is called and
      return non-zero value, goto drop in br_handle_frame().
      
      But, br_handle_ingress_vlan_tunnel() always return 0. So, the
      routines that check the return value and goto drop has no meaning.
      
      Therefore, change return type of br_handle_ingress_vlan_tunnel() to
      void and remove if statement of br_handle_frame().
      Signed-off-by: default avatarKangmin Park <l4stpr0gr4m@gmail.com>
      Acked-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Link: https://lore.kernel.org/r/20210823102118.17966-1-l4stpr0gr4m@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a37c5c26
    • Po-Hsu Lin's avatar
      selftests/net: Use kselftest skip code for skipped tests · 7844ec21
      Po-Hsu Lin authored
      There are several test cases in the net directory are still using
      exit 0 or exit 1 when they need to be skipped. Use kselftest
      framework skip code instead so it can help us to distinguish the
      return status.
      
      Criterion to filter out what should be fixed in net directory:
        grep -r "exit [01]" -B1 | grep -i skip
      
      This change might cause some false-positives if people are running
      these test scripts directly and only checking their return codes,
      which will change from 0 to 4. However I think the impact should be
      small as most of our scripts here are already using this skip code.
      And there will be no such issue if running them with the kselftest
      framework.
      Signed-off-by: default avatarPo-Hsu Lin <po-hsu.lin@canonical.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Tested-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Link: https://lore.kernel.org/r/20210823085854.40216-1-po-hsu.lin@canonical.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7844ec21
    • Vinicius Costa Gomes's avatar
      igc: Add support for PTP getcrosststamp() · a90ec848
      Vinicius Costa Gomes authored
      i225 supports PCIe Precision Time Measurement (PTM), allowing us to
      support the PTP_SYS_OFFSET_PRECISE ioctl() in the driver via the
      getcrosststamp() function.
      
      The easiest way to expose the PTM registers would be to configure the PTM
      dialogs to run periodically, but the PTP_SYS_OFFSET_PRECISE ioctl()
      semantics are more aligned to using a kind of "one-shot" way of retrieving
      the PTM timestamps. But this causes a bit more code to be written: the
      trigger registers for the PTM dialogs are not cleared automatically.
      
      i225 can be configured to send "fake" packets with the PTM
      information, adding support for handling these types of packets is
      left for the future.
      
      PTM improves the accuracy of time synchronization, for example, using
      phc2sys, while a simple application is sending packets as fast as
      possible. First, without .getcrosststamp():
      
      phc2sys[191.382]: enp4s0 sys offset      -959 s2 freq    -454 delay   4492
      phc2sys[191.482]: enp4s0 sys offset       798 s2 freq   +1015 delay   4069
      phc2sys[191.583]: enp4s0 sys offset       962 s2 freq   +1418 delay   3849
      phc2sys[191.683]: enp4s0 sys offset       924 s2 freq   +1669 delay   3753
      phc2sys[191.783]: enp4s0 sys offset       664 s2 freq   +1686 delay   3349
      phc2sys[191.883]: enp4s0 sys offset       218 s2 freq   +1439 delay   2585
      phc2sys[191.983]: enp4s0 sys offset       761 s2 freq   +2048 delay   3750
      phc2sys[192.083]: enp4s0 sys offset       756 s2 freq   +2271 delay   4061
      phc2sys[192.183]: enp4s0 sys offset       809 s2 freq   +2551 delay   4384
      phc2sys[192.283]: enp4s0 sys offset      -108 s2 freq   +1877 delay   2480
      phc2sys[192.383]: enp4s0 sys offset     -1145 s2 freq    +807 delay   4438
      phc2sys[192.484]: enp4s0 sys offset       571 s2 freq   +2180 delay   3849
      phc2sys[192.584]: enp4s0 sys offset       241 s2 freq   +2021 delay   3389
      phc2sys[192.684]: enp4s0 sys offset       405 s2 freq   +2257 delay   3829
      phc2sys[192.784]: enp4s0 sys offset        17 s2 freq   +1991 delay   3273
      phc2sys[192.884]: enp4s0 sys offset       152 s2 freq   +2131 delay   3948
      phc2sys[192.984]: enp4s0 sys offset      -187 s2 freq   +1837 delay   3162
      phc2sys[193.084]: enp4s0 sys offset     -1595 s2 freq    +373 delay   4557
      phc2sys[193.184]: enp4s0 sys offset       107 s2 freq   +1597 delay   3740
      phc2sys[193.284]: enp4s0 sys offset       199 s2 freq   +1721 delay   4010
      phc2sys[193.385]: enp4s0 sys offset      -169 s2 freq   +1413 delay   3701
      phc2sys[193.485]: enp4s0 sys offset       -47 s2 freq   +1484 delay   3581
      phc2sys[193.585]: enp4s0 sys offset       -65 s2 freq   +1452 delay   3778
      phc2sys[193.685]: enp4s0 sys offset        95 s2 freq   +1592 delay   3888
      phc2sys[193.785]: enp4s0 sys offset       206 s2 freq   +1732 delay   4445
      phc2sys[193.885]: enp4s0 sys offset      -652 s2 freq    +936 delay   2521
      phc2sys[193.985]: enp4s0 sys offset      -203 s2 freq   +1189 delay   3391
      phc2sys[194.085]: enp4s0 sys offset      -376 s2 freq    +955 delay   2951
      phc2sys[194.185]: enp4s0 sys offset      -134 s2 freq   +1084 delay   3330
      phc2sys[194.285]: enp4s0 sys offset       -22 s2 freq   +1156 delay   3479
      phc2sys[194.386]: enp4s0 sys offset        32 s2 freq   +1204 delay   3602
      phc2sys[194.486]: enp4s0 sys offset       122 s2 freq   +1303 delay   3731
      
      Statistics for this run (total of 2179 lines), in nanoseconds:
        average: -1.12
        stdev: 634.80
        max: 1551
        min: -2215
      
      With .getcrosststamp() via PCIe PTM:
      
      phc2sys[367.859]: enp4s0 sys offset         6 s2 freq   +1727 delay      0
      phc2sys[367.959]: enp4s0 sys offset        -2 s2 freq   +1721 delay      0
      phc2sys[368.059]: enp4s0 sys offset         5 s2 freq   +1727 delay      0
      phc2sys[368.160]: enp4s0 sys offset        -1 s2 freq   +1723 delay      0
      phc2sys[368.260]: enp4s0 sys offset        -4 s2 freq   +1719 delay      0
      phc2sys[368.360]: enp4s0 sys offset        -5 s2 freq   +1717 delay      0
      phc2sys[368.460]: enp4s0 sys offset         1 s2 freq   +1722 delay      0
      phc2sys[368.560]: enp4s0 sys offset        -3 s2 freq   +1718 delay      0
      phc2sys[368.660]: enp4s0 sys offset         5 s2 freq   +1725 delay      0
      phc2sys[368.760]: enp4s0 sys offset        -1 s2 freq   +1721 delay      0
      phc2sys[368.860]: enp4s0 sys offset         0 s2 freq   +1721 delay      0
      phc2sys[368.960]: enp4s0 sys offset         0 s2 freq   +1721 delay      0
      phc2sys[369.061]: enp4s0 sys offset         4 s2 freq   +1725 delay      0
      phc2sys[369.161]: enp4s0 sys offset         1 s2 freq   +1724 delay      0
      phc2sys[369.261]: enp4s0 sys offset         4 s2 freq   +1727 delay      0
      phc2sys[369.361]: enp4s0 sys offset         8 s2 freq   +1732 delay      0
      phc2sys[369.461]: enp4s0 sys offset         7 s2 freq   +1733 delay      0
      phc2sys[369.561]: enp4s0 sys offset         4 s2 freq   +1733 delay      0
      phc2sys[369.661]: enp4s0 sys offset         1 s2 freq   +1731 delay      0
      phc2sys[369.761]: enp4s0 sys offset         1 s2 freq   +1731 delay      0
      phc2sys[369.861]: enp4s0 sys offset        -5 s2 freq   +1725 delay      0
      phc2sys[369.961]: enp4s0 sys offset        -4 s2 freq   +1725 delay      0
      phc2sys[370.062]: enp4s0 sys offset         2 s2 freq   +1730 delay      0
      phc2sys[370.162]: enp4s0 sys offset        -7 s2 freq   +1721 delay      0
      phc2sys[370.262]: enp4s0 sys offset        -3 s2 freq   +1723 delay      0
      phc2sys[370.362]: enp4s0 sys offset         1 s2 freq   +1726 delay      0
      phc2sys[370.462]: enp4s0 sys offset        -3 s2 freq   +1723 delay      0
      phc2sys[370.562]: enp4s0 sys offset        -1 s2 freq   +1724 delay      0
      phc2sys[370.662]: enp4s0 sys offset        -4 s2 freq   +1720 delay      0
      phc2sys[370.762]: enp4s0 sys offset        -7 s2 freq   +1716 delay      0
      phc2sys[370.862]: enp4s0 sys offset        -2 s2 freq   +1719 delay      0
      
      Statistics for this run (total of 2179 lines), in nanoseconds:
        average: 0.14
        stdev: 5.03
        max: 48
        min: -27
      
      For reference, the statistics for runs without PCIe congestion show
      that the improvements from enabling PTM are less dramatic. For two
      runs of 16466 entries:
        without PTM: avg -0.04 stdev 10.57 max 39 min -42
        with PTM: avg 0.01 stdev 4.20 max 19 min -16
      
      One possible explanation is that when PTM is not enabled, and there's a lot
      of traffic in the PCIe fabric, some register reads will take more time
      than the others because of congestion on the PCIe fabric.
      
      When PTM is enabled, even if the PTM dialogs take more time to
      complete under heavy traffic, the time measurements do not depend on
      the time to read the registers.
      
      This was implemented following the i225 EAS version 0.993.
      Signed-off-by: default avatarVinicius Costa Gomes <vinicius.gomes@intel.com>
      Tested-by: default avatarDvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      a90ec848
    • Vinicius Costa Gomes's avatar
      igc: Enable PCIe PTM · 1b5d73fb
      Vinicius Costa Gomes authored
      Enables PCIe PTM (Precision Time Measurement) support in the igc
      driver. Notifies the PCI devices that PCIe PTM should be enabled.
      
      PCIe PTM is similar protocol to PTP (Precision Time Protocol) running
      in the PCIe fabric, it allows devices to report time measurements from
      their internal clocks and the correlation with the PCIe root clock.
      
      The i225 NIC exposes some registers that expose those time
      measurements, those registers will be used, in later patches, to
      implement the PTP_SYS_OFFSET_PRECISE ioctl().
      Signed-off-by: default avatarVinicius Costa Gomes <vinicius.gomes@intel.com>
      Tested-by: default avatarDvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      1b5d73fb
    • Vinicius Costa Gomes's avatar
      PCI: Add pcie_ptm_enabled() · 014408cd
      Vinicius Costa Gomes authored
      Add a predicate that returns if PCIe PTM (Precision Time Measurement)
      is enabled.
      
      It will only return true if it's enabled in all the ports in the path
      from the device to the root.
      Signed-off-by: default avatarVinicius Costa Gomes <vinicius.gomes@intel.com>
      Acked-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      014408cd
    • Vinicius Costa Gomes's avatar
      Revert "PCI: Make pci_enable_ptm() private" · 1d71eb53
      Vinicius Costa Gomes authored
      Make pci_enable_ptm() accessible from the drivers.
      
      Exposing this to the driver enables the driver to use the
      'ptm_enabled' field of 'pci_dev' to check if PTM is enabled or not.
      
      This reverts commit ac6c26da ("PCI: Make pci_enable_ptm() private").
      Signed-off-by: default avatarVinicius Costa Gomes <vinicius.gomes@intel.com>
      Acked-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      1d71eb53
    • Jakub Kicinski's avatar
      Merge branch 'ethtool-extend-coalesce-uapi' · 3a62c333
      Jakub Kicinski authored
      Yufeng Mo says:
      
      ====================
      ethtool: extend coalesce uAPI
      
      In order to support some configuration in coalesce uAPI, this series
      extend coalesce uAPI and add support for CQE mode.
      
      Below is some test result with HNS3 driver:
      1. old ethtool(ioctl) + new kernel:
      estuary:/$ ethtool -c eth0
      Coalesce parameters for eth0:
      Adaptive RX: on  TX: on
      stats-block-usecs: 0
      sample-interval: 0
      pkt-rate-low: 0
      pkt-rate-high: 0
      
      rx-usecs: 20
      rx-frames: 0
      rx-usecs-irq: 0
      rx-frames-irq: 0
      
      tx-usecs: 20
      tx-frames: 0
      tx-usecs-irq: 0
      tx-frames-irq: 0
      
      rx-usecs-low: 0
      rx-frame-low: 0
      tx-usecs-low: 0
      tx-frame-low: 0
      
      rx-usecs-high: 0
      rx-frame-high: 0
      tx-usecs-high: 0
      tx-frame-high: 0
      
      2. ethtool(netlink with cqe mode) + kernel without cqe mode:
      estuary:/$ ethtool -c eth0
      Coalesce parameters for eth0:
      Adaptive RX: on  TX: on
      stats-block-usecs: n/a
      sample-interval: n/a
      pkt-rate-low: n/a
      pkt-rate-high: n/a
      
      rx-usecs: 20
      rx-frames: 0
      rx-usecs-irq: n/a
      rx-frames-irq: n/a
      
      tx-usecs: 20
      tx-frames: 0
      tx-usecs-irq: n/a
      tx-frames-irq: n/a
      
      rx-usecs-low: n/a
      rx-frame-low: n/a
      tx-usecs-low: n/a
      tx-frame-low: n/a
      
      rx-usecs-high: 0
      rx-frame-high: n/a
      tx-usecs-high: 0
      tx-frame-high: n/a
      
      CQE mode RX: n/a  TX: n/a
      
      3. ethool(netlink with cqe mode) + kernel with cqe mode:
      estuary:/$ ethtool -c eth0
      Coalesce parameters for eth0:
      Adaptive RX: on  TX: on
      stats-block-usecs: n/a
      sample-interval: n/a
      pkt-rate-low: n/a
      pkt-rate-high: n/a
      
      rx-usecs: 20
      rx-frames: 0
      rx-usecs-irq: n/a
      rx-frames-irq: n/a
      
      tx-usecs: 20
      tx-frames: 0
      tx-usecs-irq: n/a
      tx-frames-irq: n/a
      
      rx-usecs-low: n/a
      rx-frame-low: n/a
      tx-usecs-low: n/a
      tx-frame-low: n/a
      
      rx-usecs-high: 0
      rx-frame-high: n/a
      tx-usecs-high: 0
      tx-frame-high: n/a
      
      CQE mode RX: off  TX: off
      
      4. ethool(netlink without cqe mode) + kernel with cqe mode:
      estuary:/$ ethtool -c eth0
      Coalesce parameters for eth0:
      Adaptive RX: on  TX: on
      stats-block-usecs: n/a
      sample-interval: n/a
      pkt-rate-low: n/a
      pkt-rate-high: n/a
      
      rx-usecs: 20
      rx-frames: 0
      rx-usecs-irq: n/a
      rx-frames-irq: n/a
      
      tx-usecs: 20
      tx-frames: 0
      tx-usecs-irq: n/a
      tx-frames-irq: n/a
      
      rx-usecs-low: n/a
      rx-frame-low: n/a
      tx-usecs-low: n/a
      tx-frame-low: n/a
      
      rx-usecs-high: 0
      rx-frame-high: n/a
      tx-usecs-high: 0
      tx-frame-high: n/a
      
      Change log:
      V2 -> V3:
               fix some warning on W=1 builds in #2
      
      V1 -> V2:
               1. fix compile error using allmodconfig in #2
               2. move some property-related modifications from #2 to #1
                  for better review suggested by Jakub Kicinski.
      
      Change log from RFC:
      V3 -> V4:
               add document explaining the difference between CQE and EQE
               in #1 suggested by Jakub Kicinski.
      
      V2 -> V3:
               1. split #1 into adding new parameter and adding new attributes.
               2. use NLA_POLICY_MAX(NLA_U8, 1) instead of NLA_U8.
               3. modify the description of CQE in Document.
      
      V1 -> V2:
               refactor #1&#2 in V1 suggestted by Jakub Kicinski.
      ====================
      
      Link: https://lore.kernel.org/r/1629444920-25437-1-git-send-email-moyufeng@huawei.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3a62c333
    • Yufeng Mo's avatar
      net: hns3: add ethtool support for CQE/EQE mode configuration · cce1689e
      Yufeng Mo authored
      Add support in ethtool for switching EQE/CQE mode.
      Signed-off-by: default avatarYufeng Mo <moyufeng@huawei.com>
      Signed-off-by: default avatarHuazhong Tan <tanhuazhong@huawei.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      cce1689e
    • Yufeng Mo's avatar
      net: hns3: add support for EQE/CQE mode configuration · 9f0c6f4b
      Yufeng Mo authored
      For device whose version is above V3(include V3), the GL can
      select EQE or CQE mode, so adds support for it.
      
      In CQE mode, the coalesced timer will restart when the first new
      completion occurs, while in EQE mode, the timer will not restart.
      Signed-off-by: default avatarYufeng Mo <moyufeng@huawei.com>
      Signed-off-by: default avatarHuazhong Tan <tanhuazhong@huawei.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9f0c6f4b
    • Yufeng Mo's avatar
      ethtool: extend coalesce setting uAPI with CQE mode · f3ccfda1
      Yufeng Mo authored
      In order to support more coalesce parameters through netlink,
      add two new parameter kernel_coal and extack for .set_coalesce
      and .get_coalesce, then some extra info can return to user with
      the netlink API.
      Signed-off-by: default avatarYufeng Mo <moyufeng@huawei.com>
      Signed-off-by: default avatarHuazhong Tan <tanhuazhong@huawei.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f3ccfda1
    • Yufeng Mo's avatar
      ethtool: add two coalesce attributes for CQE mode · 029ee6b1
      Yufeng Mo authored
      Currently, there are many drivers who support CQE mode configuration,
      some configure it as a fixed when initialized, some provide an
      interface to change it by ethtool private flags. In order to make it
      more generic, add two new 'ETHTOOL_A_COALESCE_USE_CQE_TX' and
      'ETHTOOL_A_COALESCE_USE_CQE_RX' coalesce attributes, then these
      parameters can be accessed by ethtool netlink coalesce uAPI.
      
      Also add an new structure kernel_ethtool_coalesce, then the
      new parameter can be added into this struct.
      Signed-off-by: default avatarYufeng Mo <moyufeng@huawei.com>
      Signed-off-by: default avatarHuazhong Tan <tanhuazhong@huawei.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      029ee6b1
    • Jakub Kicinski's avatar
      netdevice: move xdp_rxq within netdev_rx_queue · 95d1d249
      Jakub Kicinski authored
      Both struct netdev_rx_queue and struct xdp_rxq_info are cacheline
      aligned. This causes extra padding before and after the xdp_rxq
      member. Move the member upfront, so that it's naturally aligned.
      
      Before:
      	/* size: 256, cachelines: 4, members: 6 */
      	/* sum members: 160, holes: 1, sum holes: 40 */
      	/* padding: 56 */
      	/* paddings: 1, sum paddings: 36 */
      	/* forced alignments: 1, forced holes: 1, sum forced holes: 40 */
      
      After:
      	/* size: 192, cachelines: 3, members: 6 */
      	/* padding: 32 */
      	/* paddings: 1, sum paddings: 36 */
      	/* forced alignments: 1 */
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Link: https://lore.kernel.org/r/20210823180135.1153608-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      95d1d249
    • Heiner Kallweit's avatar
      r8169: enable ASPM L0s state · 18a9eae2
      Heiner Kallweit authored
      ASPM is disabled completely because we've seen different types of
      problems in the past. However it seems these problems occurred with
      L1 or L1 sub-states only. On all the chip versions I've seen the
      acceptable L0s exit latency is 512ns. This should be short enough not
      to cause problems. If the actual L0s exit latency of the PCIe link
      is bigger than 512ns then the PCI core will disable L0s anyway.
      So let's give it a try and disable L1 and L1 sub-states only.
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      18a9eae2
    • Yunsheng Lin's avatar
      page_pool: use relaxed atomic for release side accounting · 7fb9b66d
      Yunsheng Lin authored
      There is no need to synchronize the account updating, so
      use the relaxed atomic to avoid some memory barrier in the
      data path.
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarYunsheng Lin <linyunsheng@huawei.com>
      Acked-by: default avatarIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7fb9b66d
    • David S. Miller's avatar
      Merge branch 'dsa-sw-bridging' · 669f047e
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      Plug holes in DSA's software bridging support
      
      Changes in v2:
      - Make sure that leaving an unoffloaded bridge works well too
      - Remove a set but unused variable
      - Tweak a commit message
      
      This series addresses some oddities reported by Alvin while he was
      working on the new rtl8365mb driver (a driver which does not implement
      bridge offloading for now, and relies on software bridging).
      
      First is that DSA behaves, in the lack of a .port_bridge_join method, as
      if the operation succeeds, and does not kick off its internal procedures
      for software bridging (the same procedures that were written for indirect
      software bridging, meaning bridging with an unoffloaded software LAG).
      
      Second is that even after being patched to treat ports with software
      bridging as standalone, we still don't get rid of bridge VLANs, even
      though we have code to ignore them, that code manages to get bypassed.
      This is in fact a recurring issue which was brought up by Tobias
      Waldekranz a while ago, but the solution never made it to the git tree.
      
      After debugging with Florian the last time:
      https://patchwork.kernel.org/project/netdevbpf/patch/20210320225928.2481575-3-olteanv@gmail.com/
      I became very concerned about sending these patches to stable kernels.
      They are relatively large reworks, and they are only tested properly on
      net-next.
      
      A few commands on my test vehicle which has ds->vlan_filtering_is_global
      set to true:
      
      | Nothing is committed to hardware when we add VLAN 100 on a standalone
      | port
      $ ip link add link sw0p2 name sw0p2.100 type vlan id 100
      | When a neighbor port joins a VLAN-aware bridge, VLAN filtering gets
      | enabled globally on the switch. This replays the VLAN 100 from
      | sw0p2.100 and also installs VLAN 1 from the bridge on sw0p0.
      $ ip link add br0 type bridge vlan_filtering 1 && ip link set sw0p0 master br0
      [   97.948087] sja1105 spi2.0: Reset switch and programmed static config. Reason: VLAN filtering
      [   97.957989] sja1105 spi2.0: sja1105_bridge_vlan_add: port 2 vlan 100
      [   97.964442] sja1105 spi2.0: sja1105_bridge_vlan_add: port 4 vlan 100
      [   97.971202] device sw0p0 entered promiscuous mode
      [   97.976129] sja1105 spi2.0: sja1105_bridge_vlan_add: port 0 vlan 1
      [   97.982640] sja1105 spi2.0: sja1105_bridge_vlan_add: port 4 vlan 1
      | We can see that sw0p2, the standalone port, is now filtering because
      | of the bridge
      $ ethtool -k sw0p2 | grep vlan
      rx-vlan-filter: on [fixed]
      | When we make the bridge VLAN-unaware, the 8021q upper sw0p2.100 is
      | uncomitted from hardware. The VLANs managed by the bridge still remain
      | committed to hardware, because they are managed by the bridge.
      $ ip link set br0 type bridge vlan_filtering 0
      [  134.218869] sja1105 spi2.0: Reset switch and programmed static config. Reason: VLAN filtering
      [  134.228913] sja1105 spi2.0: sja1105_bridge_vlan_del: port 2 vlan 100
      | And now the standalone port is not filtering anymore.
      ethtool -k sw0p2 | grep vlan
      rx-vlan-filter: off [fixed]
      
      The same test with .port_bridge_join and .port_bridge_leave commented
      out from this driver:
      
      | Not a flinch
      $ ip link add link sw0p2 name sw0p2.100 type vlan id 100
      $ ip link add br0 type bridge vlan_filtering 1 && ip link set sw0p0 master br0
      Warning: dsa_core: Offloading not supported.
      $ ethtool -k sw0p2 | grep vlan
      rx-vlan-filter: off [fixed]
      $ ip link set br0 type bridge vlan_filtering 0
      $ ethtool -k sw0p2 | grep vlan
      rx-vlan-filter: off [fixed]
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      669f047e
    • Vladimir Oltean's avatar
      net: dsa: let drivers state that they need VLAN filtering while standalone · 58adf9dc
      Vladimir Oltean authored
      As explained in commit e358bef7 ("net: dsa: Give drivers the chance
      to veto certain upper devices"), the hellcreek driver uses some tricks
      to comply with the network stack expectations: it enforces port
      separation in standalone mode using VLANs. For untagged traffic,
      bridging between ports is prevented by using different PVIDs, and for
      VLAN-tagged traffic, it never accepts 8021q uppers with the same VID on
      two ports, so packets with one VLAN cannot leak from one port to another.
      
      That is almost fine*, and has worked because hellcreek relied on an
      implicit behavior of the DSA core that was changed by the previous
      patch: the standalone ports declare the 'rx-vlan-filter' feature as 'on
      [fixed]'. Since most of the DSA drivers are actually VLAN-unaware in
      standalone mode, that feature was actually incorrectly reflecting the
      hardware/driver state, so there was a desire to fix it. This leaves the
      hellcreek driver in a situation where it has to explicitly request this
      behavior from the DSA framework.
      
      We configure the ports as follows:
      
      - Standalone: 'rx-vlan-filter' is on. An 8021q upper on top of a
        standalone hellcreek port will go through dsa_slave_vlan_rx_add_vid
        and will add a VLAN to the hardware tables, giving the driver the
        opportunity to refuse it through .port_prechangeupper.
      
      - Bridged with vlan_filtering=0: 'rx-vlan-filter' is off. An 8021q upper
        on top of a bridged hellcreek port will not go through
        dsa_slave_vlan_rx_add_vid, because there will not be any attempt to
        offload this VLAN. The driver already disables VLAN awareness, so that
        upper should receive the traffic it needs.
      
      - Bridged with vlan_filtering=1: 'rx-vlan-filter' is on. An 8021q upper
        on top of a bridged hellcreek port will call dsa_slave_vlan_rx_add_vid,
        and can again be vetoed through .port_prechangeupper.
      
      *It is not actually completely fine, because if I follow through
      correctly, we can have the following situation:
      
      ip link add br0 type bridge vlan_filtering 0
      ip link set lan0 master br0 # lan0 now becomes VLAN-unaware
      ip link set lan0 nomaster # lan0 fails to become VLAN-aware again, therefore breaking isolation
      
      This patch fixes that corner case by extending the DSA core logic, based
      on this requested attribute, to change the VLAN awareness state of the
      switch (port) when it leaves the bridge.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Acked-by: Kurt Kanzenbach's avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      58adf9dc
    • Vladimir Oltean's avatar
      net: dsa: don't advertise 'rx-vlan-filter' when not needed · 06cfb2df
      Vladimir Oltean authored
      There have been multiple independent reports about
      dsa_slave_vlan_rx_add_vid being called (and consequently calling the
      drivers' .port_vlan_add) when it isn't needed, and sometimes (not
      always) causing problems in the process.
      
      Case 1:
      mv88e6xxx_port_vlan_prepare is stubborn and only accepts VLANs on
      bridged ports. That is understandably so, because standalone mv88e6xxx
      ports are VLAN-unaware, and VTU entries are said to be a scarce
      resource.
      
      Otherwise said, the following fails lamentably on mv88e6xxx:
      
      ip link add br0 type bridge vlan_filtering 1
      ip link set lan3 master br0
      ip link add link lan10 name lan10.1 type vlan id 1
      [485256.724147] mv88e6085 d0032004.mdio-mii:12: p10: hw VLAN 1 already used by port 3 in br0
      RTNETLINK answers: Operation not supported
      
      This has become a worse issue since commit 9b236d2a ("net: dsa:
      Advertise the VLAN offload netdev ability only if switch supports it").
      Up to that point, the driver was returning -EOPNOTSUPP and DSA was
      reconverting that error to 0, making the 8021q upper think all is ok
      (but obviously the error message was there even prior to this change).
      After that change the -EOPNOTSUPP is propagated to vlan_vid_add, and it
      is a hard error.
      
      Case 2:
      Ports that don't offload the Linux bridge (have a dp->bridge_dev = NULL
      because they don't implement .port_bridge_{join,leave}). Understandably,
      a standalone port should not offload VLANs either, it should remain VLAN
      unaware and any VLAN should be a software VLAN (as long as the hardware
      is not quirky, that is).
      
      In fact, dsa_slave_port_obj_add does do the right thing and rejects
      switchdev VLAN objects coming from the bridge when that bridge is not
      offloaded:
      
      	case SWITCHDEV_OBJ_ID_PORT_VLAN:
      		if (!dsa_port_offloads_bridge_port(dp, obj->orig_dev))
      			return -EOPNOTSUPP;
      
      		err = dsa_slave_vlan_add(dev, obj, extack);
      
      But it seems that the bridge is able to trick us. The __vlan_vid_add
      from br_vlan.c has:
      
      	/* Try switchdev op first. In case it is not supported, fallback to
      	 * 8021q add.
      	 */
      	err = br_switchdev_port_vlan_add(dev, v->vid, flags, extack);
      	if (err == -EOPNOTSUPP)
      		return vlan_vid_add(dev, br->vlan_proto, v->vid);
      
      So it says "no, no, you need this VLAN in your life!". And we, naive as
      we are, say "oh, this comes from the vlan_vid_add code path, it must be
      an 8021q upper, sure, I'll take that". And we end up with that bridge
      VLAN installed on our port anyway. But this time, it has the wrong flags:
      if the bridge was trying to install VLAN 1 as a pvid/untagged VLAN,
      failed via switchdev, retried via vlan_vid_add, we have this comment:
      
      	/* This API only allows programming tagged, non-PVID VIDs */
      
      So what we do makes absolutely no sense.
      
      Backtracing a bit, we see the common pattern. We allow the network stack
      to think that our standalone ports are VLAN-aware, but they aren't, for
      the vast majority of switches. The quirky ones should not dictate the
      norm. The dsa_slave_vlan_rx_add_vid and dsa_slave_vlan_rx_kill_vid
      methods exist for drivers that need the 'rx-vlan-filter: on' feature in
      ethtool -k, which can be due to any of the following reasons:
      
      1. vlan_filtering_is_global = true, and some ports are under a
         VLAN-aware bridge while others are standalone, and the standalone
         ports would otherwise drop VLAN-tagged traffic. This is described in
         commit 061f6a50 ("net: dsa: Add ndo_vlan_rx_{add, kill}_vid
         implementation").
      
      2. the ports that are under a VLAN-aware bridge should also set this
         feature, for 8021q uppers having a VID not claimed by the bridge.
         In this case, the driver will essentially not even know that the VID
         is coming from the 8021q layer and not the bridge.
      
      3. Hellcreek. This driver needs it because in standalone mode, it uses
         unique VLANs per port to ensure separation. For separation of untagged
         traffic, it uses different PVIDs for each port, and for separation of
         VLAN-tagged traffic, it never accepts 8021q uppers with the same vid
         on two ports.
      
      If a driver does not fall under any of the above 3 categories, there is
      no reason why it should advertise the 'rx-vlan-filter' feature, therefore
      no reason why it should offload the VLANs added through vlan_vid_add.
      
      This commit fixes the problem by removing the 'rx-vlan-filter' feature
      from the slave devices when they operate in standalone mode, and when
      they offload a VLAN-unaware bridge.
      
      The way it works is that vlan_vid_add will now stop its processing here:
      
      vlan_add_rx_filter_info:
      	if (!vlan_hw_filter_capable(dev, proto))
      		return 0;
      
      So the VLAN will still be saved in the interface's VLAN RX filtering
      list, but because it does not declare VLAN filtering in its features,
      the 8021q module will return zero without committing that VLAN to
      hardware.
      
      This gives the drivers what they want, since it keeps the 8021q VLANs
      away from the VLAN table until VLAN awareness is enabled (point at which
      the ports are no longer standalone, hence in the mv88e6xxx case, the
      check in mv88e6xxx_port_vlan_prepare passes).
      
      Since the issue predates the existence of the hellcreek driver, case 3
      will be dealt with in a separate patch.
      
      The main change that this patch makes is to no longer set
      NETIF_F_HW_VLAN_CTAG_FILTER unconditionally, but toggle it dynamically
      (for most switches, never).
      
      The second part of the patch addresses an issue that the first part
      introduces: because the 'rx-vlan-filter' feature is now dynamically
      toggled, and our .ndo_vlan_rx_add_vid does not get called when
      'rx-vlan-filter' is off, we need to avoid bugs such as the following by
      replaying the VLANs from 8021q uppers every time we enable VLAN
      filtering:
      
      ip link add link lan0 name lan0.100 type vlan id 100
      ip addr add 192.168.100.1/24 dev lan0.100
      ping 192.168.100.2 # should work
      ip link add br0 type bridge vlan_filtering 0
      ip link set lan0 master br0
      ping 192.168.100.2 # should still work
      ip link set br0 type bridge vlan_filtering 1
      ping 192.168.100.2 # should still work but doesn't
      
      As reported by Florian, some drivers look at ds->vlan_filtering in
      their .port_vlan_add() implementation. So this patch also makes sure
      that ds->vlan_filtering is committed before calling the driver. This is
      the reason why it is first committed, then restored on the failure path.
      Reported-by: default avatarTobias Waldekranz <tobias@waldekranz.com>
      Reported-by: default avatarAlvin Šipraga <alsi@bang-olufsen.dk>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Tested-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      06cfb2df
    • Vladimir Oltean's avatar
      net: dsa: properly fall back to software bridging · 67b5fb5d
      Vladimir Oltean authored
      If the driver does not implement .port_bridge_{join,leave}, then we must
      fall back to standalone operation on that port, and trigger the error
      path of dsa_port_bridge_join. This sets dp->bridge_dev = NULL.
      
      In turn, having a non-NULL dp->bridge_dev when there is no offloading
      support makes the following things go wrong:
      
      - dsa_default_offload_fwd_mark make the wrong decision in setting
        skb->offload_fwd_mark. It should set skb->offload_fwd_mark = 0 for
        ports that don't offload the bridge, which should instruct the bridge
        to forward in software. But this does not happen, dp->bridge_dev is
        incorrectly set to point to the bridge, so the bridge is told that
        packets have been forwarded in hardware, which they haven't.
      
      - switchdev objects (MDBs, VLANs) should not be offloaded by ports that
        don't offload the bridge. Standalone ports should behave as packet-in,
        packet-out and the bridge should not be able to manipulate the pvid of
        the port, or tag stripping on egress, or ingress filtering. This
        should already work fine because dsa_slave_port_obj_add has:
      
      	case SWITCHDEV_OBJ_ID_PORT_VLAN:
      		if (!dsa_port_offloads_bridge_port(dp, obj->orig_dev))
      			return -EOPNOTSUPP;
      
      		err = dsa_slave_vlan_add(dev, obj, extack);
      
        but since dsa_port_offloads_bridge_port works based on dp->bridge_dev,
        this is again sabotaging us.
      
      All the above work in case the port has an unoffloaded LAG interface, so
      this is well exercised code, we should apply it for plain unoffloaded
      bridge ports too.
      Reported-by: default avatarAlvin Šipraga <alsi@bang-olufsen.dk>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      67b5fb5d
    • Vladimir Oltean's avatar
      net: dsa: don't call switchdev_bridge_port_unoffload for unoffloaded bridge ports · 09dba21b
      Vladimir Oltean authored
      For ports that have a NULL dp->bridge_dev, dsa_port_to_bridge_port()
      also returns NULL as expected.
      
      Issue #1 is that we are performing a NULL pointer dereference on brport_dev.
      
      Issue #2 is that these are ports on which switchdev_bridge_port_offload
      has not been called, so we should not call switchdev_bridge_port_unoffload
      on them either.
      
      Both issues are addressed by checking against a NULL brport_dev in
      dsa_port_pre_bridge_leave and exiting early.
      
      Fixes: 2f5dc00f ("net: bridge: switchdev: let drivers inform which bridge ports are offloaded")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      09dba21b