1. 16 Aug, 2021 8 commits
    • Wong Vee Khee's avatar
      net: pcs: xpcs: Add Pause Mode support for SGMII and 2500BaseX · 849d2f83
      Wong Vee Khee authored
      SGMII/2500BaseX supports Pause frame as defined in the IEEE802.3x
      Flow Control standardization.
      
      Add this as a supported feature under the xpcs_sgmii_features struct.
      
      Cc: Vladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarWong Vee Khee <vee.khee.wong@linux.intel.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      849d2f83
    • David S. Miller's avatar
      Merge branch 'pktgen-samples' · 5fa5fb8b
      David S. Miller authored
      samples: pktgen: enhance the usability of pktgen samples
      
      This patchset improves the usability of pktgen samples by adding an option for
      propagating the environment variable of normal user to sudo. And also adds the
      missing IPv6 option to pktgen scripts.
      
      Currently, all pktgen samples are able to use the environment variable instead
      of optional parameters. However, it doesn't work appropriately when running
      samples as normal user.
      
      This is results of running sample as root and user:
      
          // running as root
          # DEV=eth0 DEST_IP=10.1.0.1 DST_MAC=00:11:22:33:44:55 ./pktgen_sample01_simple.sh -v -n 1
          Running... ctrl^C to stop
      
          // running as normal user
          $ DEV=eth0 DEST_IP=10.1.0.1 DST_MAC=00:11:22:33:44:55 ./pktgen_sample01_simple.sh -v -n 1
          [...]
          ERROR: Please specify output device
      
      The reason why passing the environment varaible doesn't work properly when
      running samples as normal user is that the environment variable of normal user
      doesn't propagate to sudo (root_check_run_with_sudo)). So the first commit
      solves this issue by using "-E" (--preserve-env) option of "sudo", which passes
      normal user's existing environment variables.
      
      Also, "sample04" and "sample05" are not working properly when running with IPv6
      option parameter("-6"). Because the commit 0f06a678 ("samples: Add an IPv6
      "-6" option to the pktgen scripts") has omitted the addition of this option at
      these samples. So the second commit adds missing IPv6 option to pktgen scripts.
      
      ====================
      
      Fixes: 0f06a678 ("samples: Add an IPv6 "-6" option to the pktgen scripts")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5fa5fb8b
    • Juhee Kang's avatar
      samples: pktgen: add missing IPv6 option to pktgen scripts · 0f0c4f1b
      Juhee Kang authored
      Currently, "sample04" and "sample05" are not working properly when
      running with an IPv6 option("-6"). The commit 0f06a678 ("samples:
      Add an IPv6 "-6" option to the pktgen scripts") has omitted the addition
      of this option at "sample04" and "sample05".
      
      In order to support IPv6 option, this commit adds logic related to IPv6
      option.
      
      Fixes: 0f06a678 ("samples: Add an IPv6 "-6" option to the pktgen scripts")
      Signed-off-by: default avatarJuhee Kang <claudiajkang@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0f0c4f1b
    • Juhee Kang's avatar
      samples: pktgen: pass the environment variable of normal user to sudo · 7caeabd7
      Juhee Kang authored
      All pktgen samples can use the environment variable instead of option
      parameters(eg. $DEV is able to use instead of '-i' option).
      
      This is results of running sample as root and user:
      
          // running as root
          # DEV=eth0 DEST_IP=10.1.0.1 DST_MAC=00:11:22:33:44:55 ./pktgen_sample01_simple.sh -v -n 1
          Running... ctrl^C to stop
      
          // running as normal user
          $ DEV=eth0 DEST_IP=10.1.0.1 DST_MAC=00:11:22:33:44:55 ./pktgen_sample01_simple.sh -v -n 1
          [...]
          ERROR: Please specify output device
      
      This results show the sample doesn't work properly when the sample runs
      as normal user. Because the sample is restarted by the function
      (root_check_run_with_sudo) to run with sudo. In this process, the
      environment variable of normal user doesn't propagate to sudo.
      
      It can be solved by using "-E"(--preserve-env) option of "sudo", which
      preserve normal user's existing environment variables. So this commit
      adds "-E" option in the function (root_check_run_with_sudo).
      Signed-off-by: default avatarJuhee Kang <claudiajkang@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7caeabd7
    • David S. Miller's avatar
      Merge branch 'ipq-mdio' · cbbb7abd
      David S. Miller authored
      Luo Jie says:
      
      ====================
      net: mdio: Add IPQ MDIO reset related function
      
      This patch series add the MDIO reset features, which includes
      configuring MDIO clock source frequency and indicating CMN_PLL that
      ethernet LDO has been ready, this ethernet LDO is dedicated in the
      IPQ5018 platform.
      
      Specify more chipset IPQ40xx, IPQ807x, IPQ60xx and IPQ50xx supported by
      this MDIO driver.
      
      Changes in v3:
      	* simplify the function ipq_mdio_reset.
      
      Changes in v2:
      	* Addressed review comments (Andrew Lunn).
      	* Remove the IS_ERR().
      	* make binding patch part of series.
      	* document the property 'reg' and 'clock'.
      
      Changes in v1:
      	* make MDIO_IPQ4019 unchanged for backwards compatibility.
      	* remove the PHY reset functions
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cbbb7abd
    • Luo Jie's avatar
      dt-bindings: net: Add the properties for ipq4019 MDIO · 2a4c32e7
      Luo Jie authored
      The new added properties resource "reg" is for configuring
      ethernet LDO in the IPQ5018 chipset, the property "clocks"
      is for configuring the MDIO clock source frequency.
      Signed-off-by: default avatarLuo Jie <luoj@codeaurora.org>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2a4c32e7
    • Luo Jie's avatar
      MDIO: Kconfig: Specify more IPQ chipset supported · c76ee263
      Luo Jie authored
      The IPQ MDIO driver currently supports the chipset IPQ40xx, IPQ807x,
      IPQ60xx and IPQ50xx.
      
      Add the compatible 'qcom,ipq5018-mdio' because of ethernet LDO dedicated
      to the IPQ5018 platform.
      Signed-off-by: default avatarLuo Jie <luoj@codeaurora.org>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c76ee263
    • Luo Jie's avatar
      net: mdio: Add the reset function for IPQ MDIO driver · 23a890d4
      Luo Jie authored
      1. configure the MDIO clock source frequency.
      2. the LDO resource is needed to configure the ethernet LDO available
      for CMN_PLL.
      Signed-off-by: default avatarLuo Jie <luoj@codeaurora.org>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      23a890d4
  2. 14 Aug, 2021 32 commits
    • Cai Huoqing's avatar
      MAINTAINERS: Remove the ipx network layer info · e4637f62
      Cai Huoqing authored
      commit <47595e32> ("<MAINTAINERS: Mark some staging directories>")
      indicated the ipx network layer as obsolete in Jan 2018,
      updated in the MAINTAINERS file.
      
      now, after being exposed for 3 years to refactoring, so to
      remove the ipx network layer info from MAINTAINERS.
      additionally, there is no module that depends on ipx.h
      except a broken staging driver(r8188eu)
      Signed-off-by: default avatarCai Huoqing <caihuoqing@baidu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e4637f62
    • Cai Huoqing's avatar
      net: Remove net/ipx.h and uapi/linux/ipx.h header files · 6c9b4084
      Cai Huoqing authored
      commit <47595e32> ("<MAINTAINERS: Mark some staging directories>")
      indicated the ipx network layer as obsolete in Jan 2018,
      updated in the MAINTAINERS file
      
      now, after being exposed for 3 years to refactoring, so to
      delete uapi/linux/ipx.h and net/ipx.h header files for good.
      additionally, there is no module that depends on ipx.h except
      a broken staging driver(r8188eu)
      Signed-off-by: default avatarCai Huoqing <caihuoqing@baidu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6c9b4084
    • David S. Miller's avatar
      Merge branch 'iupa-last-things-before-pm-conversion' · fda4e19d
      David S. Miller authored
      Alex Elder says:
      
      ====================
      net: ipa: last things before PM conversion
      
      This series contains a few remaining changes needed before fully
      switching over to using runtime power management rather than the
      previous "IPA clock" mechanism.
      
      The first patch moves the calls to enable and disable the IPA
      interrupt as a system wakeup interrupt into "ipa_clock.c" with the
      rest of the power-related code.
      
      The second adds a flag to make it possible to distinguish runtime
      suspend from system suspend.
      
      The third and fourth patches arrange for the ->start_xmit path to
      resume hardware if necessary, to ensure it is powered.  If power is
      not active, the TX queue is stopped, and arrangements are made for
      the queue to be restarted once hardware power is active again.
      
      The fifth patch keeps the TX queue active during suspend.  This
      isn't necessary for system suspend but it's important for runtime
      suspend.
      
      And the last patch makes it so we don't hold the hardware active
      while the modem network device is open.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fda4e19d
    • Alex Elder's avatar
      net: ipa: don't hold clock reference while netdev open · 8dc181f2
      Alex Elder authored
      Currently a clock reference is taken whenever the ->ndo_open
      callback for the modem netdev is called.  That reference is dropped
      when the device is closed, in ipa_stop().
      
      We no longer need this, because ipa_start_xmit() now handles the
      situation where the hardware power state is not active.
      
      Drop the clock reference in ipa_open() when we're done, and take a
      new reference in ipa_stop() before we begin closing the interface.
      
      Finally (and unrelated, but trivial), change the return type of
      ipa_start_xmit() to be netdev_tx_t instead of int.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8dc181f2
    • Alex Elder's avatar
      net: ipa: don't stop TX on suspend · 8dcf8bb3
      Alex Elder authored
      Currently we stop the modem netdev transmit queue when suspending
      the hardware.  For system suspend this ensured we'd never attempt
      to transmit while attempting to suspend the modem endpoints.
      
      For runtime suspend, the IPA hardware might get suspended while the
      system is operating.  In that case we want an attempt to transmit a
      packet to cause the hardware to resume if necessary.  But if we
      disable the queue this cannot happen.
      
      So stop disabling the queue on suspend.  In case we end up disabling
      it in ipa_start_xmit() (see the previous commit), we still arrange
      to start the TX queue on resume.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8dcf8bb3
    • Alex Elder's avatar
      net: ipa: ensure hardware has power in ipa_start_xmit() · 6b51f802
      Alex Elder authored
      We need to ensure the hardware is powered when we transmit a packet.
      But if it's not, we can't block to wait for it.  So asynchronously
      request power in ipa_start_xmit(), and only proceed if the return
      value indicates the power state is active.
      
      If the hardware is not active, a runtime resume request will have
      been initiated.  In that case, stop the network stack from further
      transmit attempts until the resume completes.  Return NETDEV_TX_BUSY,
      to retry sending the packet once the queue is restarted.
      
      If the power request returns an error (other than -EINPROGRESS,
      which just means a resume requested elsewhere isn't complete), just
      drop the packet.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6b51f802
    • Alex Elder's avatar
      net: ipa: re-enable transmit in PM WQ context · a96e73fa
      Alex Elder authored
      Create a new work structure in the modem private data, and use it to
      re-enable the modem network device transmit queue when resuming.
      
      This is needed by the next patch, which stops the TX queue if IPA
      power isn't active when a transmit request arrives.  Packets will
      start arriving the instant the TX queue is enabled, but resuming
      isn't complete until ipa_modem_resume() returns.  This way we're
      sure to be resumed before transmits are allowed again.
      
      Cancel it before calling ipa_stop() in ipa_modem_stop() to ensure
      the transmit queue restart completes before it gets stopped there.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a96e73fa
    • Alex Elder's avatar
      net: ipa: distinguish system from runtime suspend · b9c532c1
      Alex Elder authored
      Add a new flag that is set when the hardware is suspended due to a
      system suspend operation, distingishing it from runtime suspend.
      Use it in the SUSPEND IPA interrupt handler to determine whether to
      trigger a system resume because of the event.  Define new suspend
      and resume power management callback functions to set and clear the
      new flag, respectively.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b9c532c1
    • Alex Elder's avatar
      net: ipa: enable wakeup in ipa_power_setup() · d430fe4b
      Alex Elder authored
      Move the call to enable the IPA interrupt as a wakeup interrupt into
      ipa_power_setup(), disable it in ipa_power_teardown().
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d430fe4b
    • David S. Miller's avatar
      Merge branch 'bridgge-mcast' · 8db102a6
      David S. Miller authored
      Nikolay Aleksandrov says:
      
      ====================
      net: bridge: mcast: dump querier state
      
      This set adds the ability to dump the current multicast querier state.
      This is extremely useful when debugging multicast issues, we've had
      many cases of unexpected queriers causing strange behaviour and mcast
      test failures. The first patch changes the querier struct to record
      a port device's ifindex instead of a pointer to the port itself so we
      can later retrieve it, I chose this way because it's much simpler
      and doesn't require us to do querier port ref counting, it is best
      effort anyway. Then patch 02 makes the querier address/port updates
      consistent via a combination of multicast_lock and seqcount, so readers
      can only use seqcount to get a consistent snapshot of address and port.
      Patch 03 is a minor cleanup in preparation for the dump support, it
      consolidates IPv4 and IPv6 querier selection paths as they share most of
      the logic (except address comparisons of course). Finally the last three
      patches add the new querier state dumping support, for the bridge's
      global multicast context we embed the BRIDGE_QUERIER_xxx attributes
      into IFLA_BR_MCAST_QUERIER_STATE and for the per-vlan global mcast
      contexts we embed them into BRIDGE_VLANDB_GOPTS_MCAST_QUERIER_STATE.
      
      The structure is:
        [IFLA_BR_MCAST_QUERIER_STATE / BRIDGE_VLANDB_GOPTS_MCAST_QUERIER_STATE]
        `[BRIDGE_QUERIER_IP_ADDRESS] - ip address of the querier
        `[BRIDGE_QUERIER_IP_PORT]    - bridge port ifindex where the querier was
                                       seen (set only if external querier)
        `[BRIDGE_QUERIER_IP_OTHER_TIMER]   -  other querier timeout
        `[BRIDGE_QUERIER_IPV6_ADDRESS] - ip address of the querier
        `[BRIDGE_QUERIER_IPV6_PORT]    - bridge port ifindex where the querier
                                         was seen (set only if external querier)
        `[BRIDGE_QUERIER_IPV6_OTHER_TIMER]   -  other querier timeout
      
      Later we can also add IGMP version of seen queriers and last seen values
      from the queries.
      ====================
      8db102a6
    • Nikolay Aleksandrov's avatar
      net: bridge: vlan: dump mcast ctx querier state · ddc649d1
      Nikolay Aleksandrov authored
      Use the new mcast querier state dump infrastructure and export vlans'
      mcast context querier state embedded in attribute
      BRIDGE_VLANDB_GOPTS_MCAST_QUERIER_STATE.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ddc649d1
    • Nikolay Aleksandrov's avatar
      net: bridge: mcast: dump ipv6 querier state · 85b41082
      Nikolay Aleksandrov authored
      Add support for dumping global IPv6 querier state, we dump the state
      only if our own querier is enabled or there has been another external
      querier which has won the election. For the bridge global state we use
      a new attribute IFLA_BR_MCAST_QUERIER_STATE and embed the state inside.
      The structure is:
        [IFLA_BR_MCAST_QUERIER_STATE]
         `[BRIDGE_QUERIER_IPV6_ADDRESS] - ip address of the querier
         `[BRIDGE_QUERIER_IPV6_PORT]    - bridge port ifindex where the querier
                                          was seen (set only if external querier)
         `[BRIDGE_QUERIER_IPV6_OTHER_TIMER]   -  other querier timeout
      
      IPv4 and IPv6 attributes are embedded at the same level of
      IFLA_BR_MCAST_QUERIER_STATE. If we didn't dump anything we cancel the nest
      and return.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      85b41082
    • Nikolay Aleksandrov's avatar
      net: bridge: mcast: dump ipv4 querier state · c7fa1d9b
      Nikolay Aleksandrov authored
      Add support for dumping global IPv4 querier state, we dump the state
      only if our own querier is enabled or there has been another external
      querier which has won the election. For the bridge global state we use
      a new attribute IFLA_BR_MCAST_QUERIER_STATE and embed the state inside.
      The structure is:
       [IFLA_BR_MCAST_QUERIER_STATE]
        `[BRIDGE_QUERIER_IP_ADDRESS] - ip address of the querier
        `[BRIDGE_QUERIER_IP_PORT]    - bridge port ifindex where the querier was
                                       seen (set only if external querier)
        `[BRIDGE_QUERIER_IP_OTHER_TIMER]   -  other querier timeout
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c7fa1d9b
    • Nikolay Aleksandrov's avatar
      net: bridge: mcast: consolidate querier selection for ipv4 and ipv6 · c3fb3698
      Nikolay Aleksandrov authored
      We can consolidate both functions as they share almost the same logic.
      This is easier to maintain and we have a single querier update function.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3fb3698
    • Nikolay Aleksandrov's avatar
      net: bridge: mcast: make sure querier port/address updates are consistent · 67b746f9
      Nikolay Aleksandrov authored
      Use a sequence counter to make sure port/address updates can be read
      consistently without requiring the bridge multicast_lock. We need to
      zero out the port and address when the other querier has expired and
      we're about to select ourselves as querier. br_multicast_read_querier
      will be used later when dumping querier state. Updates are done only
      with the multicast spinlock and softirqs disabled, while reads are done
      from process context and from softirqs (due to notifications).
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      67b746f9
    • Nikolay Aleksandrov's avatar
      net: bridge: mcast: record querier port device ifindex instead of pointer · bb18ef8e
      Nikolay Aleksandrov authored
      Currently when a querier port is detected its net_bridge_port pointer is
      recorded, but it's used only for comparisons so it's fine to have stale
      pointer, in order to dereference and use the port pointer a proper
      accounting of its usage must be implemented adding unnecessary
      complexity. To solve the problem we can just store the netdevice ifindex
      instead of the port pointer and retrieve the bridge port. It is a best
      effort and the device needs to be validated that is still part of that
      bridge before use, but that is small price to pay for avoiding querier
      reference counting for each port/vlan.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bb18ef8e
    • David S. Miller's avatar
      Merge branch 'devlink-cleanup-for-delay-event' · 2fa16787
      David S. Miller authored
      Leon Romanovsky says:
      
      ====================
      Devlink cleanup for delay event series
      
      Jakub's request to make sure that devlink events are delayed and not
      printed till they fully accessible [1] requires us to implement delayed
      event notification system in the devlink.
      
      In order to do it, I moved some of my patches (xarray e.t.c) from the future
      series to be before "Move devlink_register to be near devlink_reload_enable" [2].
      
      That allows us to rely on DEVLINK_REGISTERED xarray mark to decide if to print
      event or not.
      
      Other patches are simple cleanup which is needed anyway.
      
      [1] https://lore.kernel.org/lkml/20210811071817.4af5ab34@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com
      [2] https://lore.kernel.org/lkml/cover.1628599239.git.leonro@nvidia.com
      
      Next in the queue:
       * Delay event series
       * Move devlink_register to be near devlink_reload_enable"
       * Extension of devlink_ops to be set dynamically
       * devlink_reload_* delete
       * Devlink locks rework to user xarray and reference counting
       * ????
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2fa16787
    • Leon Romanovsky's avatar
      net: hns3: remove always exist devlink pointer check · a1fcb106
      Leon Romanovsky authored
      The devlink pointer always exists after hclge_devlink_init() succeed.
      Remove that check together with NULL setting after release and ensure
      that devlink_register is last command prior to call to devlink_reload_enable().
      
      Fixes: b741269b ("net: hns3: add support for registering devlink for PF")
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a1fcb106
    • Leon Romanovsky's avatar
      devlink: Clear whole devlink_flash_notify struct · ed43fbac
      Leon Romanovsky authored
      The { 0 } doesn't clear all fields in the struct, but tells to the
      compiler to set all fields to zero and doesn't touch any sub-fields
      if they exists.
      
      The {} is an empty initialiser that instructs to fully initialize whole
      struct including sub-fields, which is error-prone for future
      devlink_flash_notify extensions.
      
      Fixes: 6700acc5 ("devlink: collect flash notify params into a struct")
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ed43fbac
    • Leon Romanovsky's avatar
      devlink: Use xarray to store devlink instances · 11a861d7
      Leon Romanovsky authored
      We can use xarray instead of linearly organized linked lists for the
      devlink instances. This will let us revise the locking scheme in favour
      of internal xarray locking that protects database.
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      11a861d7
    • Leon Romanovsky's avatar
      devlink: Count struct devlink consumers · 437ebfd9
      Leon Romanovsky authored
      The struct devlink itself is protected by internal lock and doesn't
      need global lock during operation. That global lock is used to protect
      addition/removal new devlink instances from the global list in use by
      all devlink consumers in the system.
      
      The future conversion of linked list to be xarray will allow us to
      actually delete that lock, but first we need to count all struct devlink
      users.
      
      The reference counting provides us a way to ensure that no new user
      space commands success to grab devlink instance which is going to be
      destroyed makes it is safe to access it without lock.
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      437ebfd9
    • Leon Romanovsky's avatar
      devlink: Remove check of always valid devlink pointer · 7ca973dc
      Leon Romanovsky authored
      Devlink objects are accessible only after they were registered and
      have valid devlink_*->devlink pointers.
      
      Remove that check and simplify respective fill functions as an outcome
      of such change.
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7ca973dc
    • Leon Romanovsky's avatar
      devlink: Simplify devlink_pernet_pre_exit call · cbf6ab67
      Leon Romanovsky authored
      The devlink_pernet_pre_exit() will be called if net namespace exits.
      
      That routine is relevant for devlink instances that were assigned to
      that namespaces first. This assignment is possible only with the following
      command: "devlink reload DEV netns ...", which already checks reload support.
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cbf6ab67
    • David S. Miller's avatar
      Merge branch 'mptcp-improve-backup-subflows' · 38e3bfa8
      David S. Miller authored
      Mat Martineau says:
      
      ====================
      mptcp: Improve use of backup subflows
      
      Multipath TCP combines multiple TCP subflows in to one stream, and the
      MPTCP-level socket must decide which subflow to use when sending (or
      resending) chunks of data. The choice of the "best" subflow to transmit
      on can vary depending on the priority (normal or backup) for each
      subflow and how well the subflow is performing.
      
      In order to improve MPTCP performance when some subflows are failing,
      this patch set changes how backup subflows are utilized and introduces
      tracking of "stale" subflows that are still connected but not making
      progress.
      
      Patch 1 adjusts MPTCP-level retransmit timeouts to use data from all
      subflows.
      
      Patch 2 makes MPTCP-level retransmissions less aggressive to avoid
      resending data that's still queued at the TCP level.
      
      Patch 3 changes the way pending data is handled when subflows are
      closed. Unacked MPTCP-level data still in the subflow tx queue is
      immediately moved to another subflow for transmission instead of waiting
      for MPTCP-level timeouts to trigger retransmission.
      
      Patch 4 has some sysctl code cleanup.
      
      Patches 5 and 6 add tracking of "stale" subflows, so only underlying TCP
      subflow connections that appear to be making progress are considered
      when selecting a subflow to (re)transmit data. How fast a subflow goes
      stale is configurable with a per-namespace sysctl. Related MIBS are
      added too.
      
      Patch 7 makes sure the backup flag is always correctly recorded when the
      MP_JOIN SYN/ACK is received for an added subflow.
      
      Patch 8 adds more test cases for backup subflows and stale subflows.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      38e3bfa8
    • Paolo Abeni's avatar
      selftests: mptcp: add testcase for active-back · 7d1e6f16
      Paolo Abeni authored
      Add more test-case for link failures scenario,
      including recovery from link failure using only
      backup subflows and bi-directional transfer.
      
      Additionally explicitly check for stale count
      Co-developed-by: default avatarGeliang Tang <geliangtang@gmail.com>
      Signed-off-by: default avatarGeliang Tang <geliangtang@gmail.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7d1e6f16
    • Paolo Abeni's avatar
      mptcp: backup flag from incoming MPJ ack option · 0460ce22
      Paolo Abeni authored
      the parsed incoming backup flag is not propagated
      to the subflow itself, the client may end-up using it
      to send data.
      
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/191Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0460ce22
    • Paolo Abeni's avatar
      mptcp: add mibs for stale subflows processing · fc1b4e3b
      Paolo Abeni authored
      This allows monitoring exceptional events like
      active backup scenarios.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fc1b4e3b
    • Paolo Abeni's avatar
      mptcp: faster active backup recovery · ff5a0b42
      Paolo Abeni authored
      The msk can use backup subflows to transmit in-sequence data
      only if there are no other active subflow. On active backup
      scenario, the MPTCP connection can do forward progress only
      due to MPTCP retransmissions - rtx can pick backup subflows.
      
      This patch introduces a new flag flow MPTCP subflows: if the
      underlying TCP connection made no progresses for long time,
      and there are other less problematic subflows available, the
      given subflow become stale.
      
      Stale subflows are not considered active: if all non backup
      subflows become stale, the MPTCP scheduler can pick backup
      subflows for plain transmissions.
      
      Stale subflows can return in active state, as soon as any reply
      from the peer is observed.
      
      Active backup scenarios can now leverage the available b/w
      with no restrinction.
      
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/207Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ff5a0b42
    • Paolo Abeni's avatar
      mptcp: cleanup sysctl data and helpers · 6da14d74
      Paolo Abeni authored
      Reorder the data in mptcp_pernet to avoid wasting space
      with no reasons and constify the access helpers.
      
      No functional changes intended.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6da14d74
    • Paolo Abeni's avatar
      mptcp: handle pending data on closed subflow · 1e1d9d6f
      Paolo Abeni authored
      The PM can close active subflow, e.g. due to ingress RM_ADDR
      option. Such subflow could carry data still unacked at the
      MPTCP-level, both in the write and the rtx_queue, which has
      never reached the other peer.
      
      Currently the mptcp-level retransmission will deliver such data,
      but at a very low rate (at most 1 DSM for each MPTCP rtx interval).
      
      We can speed-up the recovery a lot, moving all the unacked in the
      tcp write_queue, so that it will be pushed again via other
      subflows, at the speed allowed by them.
      
      Also make available the new helper for later patches.
      
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/207Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1e1d9d6f
    • Paolo Abeni's avatar
      mptcp: less aggressive retransmission strategy · 71b7dec2
      Paolo Abeni authored
      The current mptcp re-inject strategy is very aggressive,
      we have mptcp-level retransmissions even on single subflow
      connection, if the link in-use is lossy.
      
      Let's be a little more conservative: we do retransmit
      only if at least a subflow has write and rtx queue empty.
      
      Additionally use the backup subflows only if the active
      subflows are stale - no progresses in at least an rtx period
      and ignore stale subflows for rtx timeout update
      
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/207Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      71b7dec2
    • Paolo Abeni's avatar
      mptcp: more accurate timeout · 33d41c9c
      Paolo Abeni authored
      As reported by Maxim, we have a lot of MPTCP-level
      retransmissions when multilple links with different latencies
      are in use.
      
      This patch refactor the mptcp-level timeout accounting so that
      the maximum of all the active subflow timeout is used. To avoid
      traversing the subflow list multiple times, the update is
      performed inside the packet scheduler.
      
      Additionally clean-up a bit timeout handling.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      33d41c9c