1. 11 Nov, 2019 13 commits
    • Vladimir Oltean's avatar
      net: mscc: ocelot: change prototypes of switchdev port attribute handlers · 4bda1415
      Vladimir Oltean authored
      This is needed so that the Felix DSA front-end can call the Ocelot
      implementations.
      
      The implementation of the "mc_disabled" switchdev attribute has also
      been simplified by using the read-modify-write macro instead of
      open-coding that operation.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4bda1415
    • Vladimir Oltean's avatar
      net: mscc: ocelot: change prototypes of hwtstamping ioctls · 306fd44b
      Vladimir Oltean authored
      This is needed in order to present a simpler prototype to the DSA
      front-end of ocelot.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      306fd44b
    • Vladimir Oltean's avatar
      net: mscc: ocelot: break out fdb operations into abstract implementations · 531ee1a6
      Vladimir Oltean authored
      To be able to implement a DSA front-end over ocelot_fdb_add,
      ocelot_fdb_del, ocelot_fdb_dump, these need to have a simple function
      prototype that is independent of struct net_device, netlink skb, etc.
      
      So rename the ndo ops of the ocelot driver into
      ocelot_port_fdb_{add,del,dump}, and have them all call the abstract
      implementations. At the same time, refactor ocelot_port_fdb_do_dump into
      a function whose prototype is compatible with dsa_fdb_dump_cb_t, so that
      the do_dump implementations can live together and be called by the
      ocelot_fdb_dump through a function pointer.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      531ee1a6
    • Vladimir Oltean's avatar
      net: mscc: ocelot: break apart vlan operations into ocelot_vlan_{add, del} · 9855934c
      Vladimir Oltean authored
      We need an implementation of these functions that is agnostic to the
      higher layer (switchdev or dsa).
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9855934c
    • Vladimir Oltean's avatar
      net: mscc: ocelot: break apart ocelot_vlan_port_apply · 97bb69e1
      Vladimir Oltean authored
      This patch transforms the ocelot_vlan_port_apply function ("apply
      what?") into 3 standalone functions:
      
      - ocelot_port_vlan_filtering
      - ocelot_port_set_native_vlan
      - ocelot_port_set_pvid
      
      These functions have a prototype that is better aligned to the DSA API.
      
      The function also had some static initialization (TPID, drop frames with
      multicast source MAC) which was not being changed from any place, so
      that was just moved to ocelot_probe_port (one of the 6 callers of
      ocelot_vlan_port_apply).
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      97bb69e1
    • David S. Miller's avatar
      Merge branch 'net-dsa-mv88e6xxx-Add-support-for-port-mirroring' · c82488df
      David S. Miller authored
      Iwan R Timmer says:
      
      ====================
      net: dsa: mv88e6xxx: Add support for port mirroring
      
      This patch series add support for port mirroring in the mv88e6xx switch driver.
      The first patch changes the set_egress_port function to allow different egress
      ports for egress and ingress traffic. The second patch adds the actual code for
      port mirroring support.
      
      Tested on a 88E6176 with:
      
      tc qdisc add dev wan0 clsact
      tc filter add dev wan0 ingress matchall skip_sw \
              action mirred egress mirror dev lan2
      tc filter add dev wan0 egress matchall skip_sw \
              action mirred egress mirror dev lan3
      
      Changes in v3
      
      - Use enum for egress traffic direction
      - Keep track of egress ports on mv88e6390
      - Move booleans in struct for better structure packing
      
      Changes in v2
      
      - Support mirroring egress and ingress traffic to different ports
      - Check for invalid configurations when multiple ports are mirrored
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c82488df
    • Iwan R Timmer's avatar
      net: dsa: mv88e6xxx: Add support for port mirroring · f0942e00
      Iwan R Timmer authored
      Add support for configuring port mirroring through the cls_matchall
      classifier. We do a full ingress and/or egress capture towards a
      capture port. It allows setting a different capture port for ingress
      and egress traffic.
      
      It keeps track of the mirrored ports and the destination ports to
      prevent changes to the capture port while other ports are being
      mirrored.
      Signed-off-by: default avatarIwan R Timmer <irtimmer@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f0942e00
    • Iwan R Timmer's avatar
      net: dsa: mv88e6xxx: Split monitor port configuration · 5c74c54c
      Iwan R Timmer authored
      Separate the configuration of the egress and ingress monitor port.
      This allows the port mirror functionality to do ingress and egress
      port mirroring to separate ports.
      Signed-off-by: default avatarIwan R Timmer <irtimmer@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5c74c54c
    • John Efstathiades's avatar
      Support LAN743x PTP periodic output on any GPIO · 22820017
      John Efstathiades authored
      The LAN743x Ethernet controller provides two independent PTP event
      channels. Each one can be used to generate a periodic output from
      the PTP clock. The output can be routed to any one of the available
      GPIO pins on the device.
      
      The PTP clock API can now be used to:
      - select any LAN743x GPIO pin to function as a periodic output
      - select either LAN743x PTP event channel to generate the output
      
      The LAN7430 has 4 GPIO pins that are multiplexed with its internal
      PHY LED control signals. A pin assigned to the LED control function
      will be assigned to the GPIO function if selected for PTP periodic
      output.
      Signed-off-by: default avatarJohn Efstathiades <john.efstathiades@pebblebay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      22820017
    • David S. Miller's avatar
      Merge branch 'Unlock-new-potential-in-SJA1105-with-PTP-system-timestamping' · 26285f13
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      Unlock new potential in SJA1105 with PTP system timestamping
      
      The SJA1105 being an automotive switch means it is designed to live in a
      set-and-forget environment, far from the configure-at-runtime nature of
      Linux. Frequently resetting the switch to change its static config means
      it loses track of its PTP time, which is not good.
      
      This patch series implements PTP system timestamping for this switch
      (using the API introduced for SPI here:
      https://www.mail-archive.com/netdev@vger.kernel.org/msg316725.html),
      adding the following benefits to the driver:
      - When under control of a user space PTP servo loop (ptp4l, phc2sys),
        the loss of sync during a switch reset is much more manageable, and
        the switch still remains in the s2 (locked servo) state.
      - When synchronizing the switch using the software technique (based on
        reading clock A and writing the value to clock B, as opposed to
        relying on hardware timestamping), e.g. by using phc2sys, the sync
        accuracy is vastly improved due to the fact that the actual switch PTP
        time can now be more precisely correlated with something of better
        precision (CLOCK_REALTIME). The issue is that SPI transfers are
        inherently bad for measuring time with low jitter, but the newly
        introduced API aims to alleviate that issue somewhat.
      
      This series is also a requirement for a future patch set that adds full
      time-aware scheduling offload support for the switch.
      ====================
      Acked-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      26285f13
    • Vladimir Oltean's avatar
      net: dsa: sja1105: Disallow management xmit during switch reset · af580ae2
      Vladimir Oltean authored
      The purpose here is to avoid ptp4l fail due to this condition:
      
        timed out while polling for tx timestamp
        increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug
        port 1: send peer delay request failed
      
      So either reset the switch before the management frame was sent, or
      after it was timestamped as well, but not in the middle.
      
      The condition may arise either due to a true timeout (i.e. because
      re-uploading the static config takes time), or due to the TX timestamp
      actually getting lost due to reset. For the former we can increase
      tx_timestamp_timeout in userspace, for the latter we need this patch.
      
      Locking all traffic during switch reset does not make sense at all,
      though. Forcing all CPU-originated traffic to potentially block waiting
      for a sleepable context to send > 800 bytes over SPI is not a good idea.
      Flows that are autonomously forwarded by the switch will get dropped
      anyway during switch reset no matter what. So just let all other
      CPU-originated traffic be dropped as well.
      Signed-off-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      af580ae2
    • Vladimir Oltean's avatar
      net: dsa: sja1105: Restore PTP time after switch reset · 6cf99c13
      Vladimir Oltean authored
      The PTP time of the switch is not preserved when uploading a new static
      configuration. Work around this hardware oddity by reading its PTP time
      before a static config upload, and restoring it afterwards.
      
      Static config changes are expected to occur at runtime even in scenarios
      directly related to PTP, i.e. the Time-Aware Scheduler of the switch is
      programmed in this way.
      
      Perhaps the larger implication of this patch is that the PTP .gettimex64
      and .settime functions need to be exposed to sja1105_main.c, where the
      PTP lock needs to be held during this entire process. So their core
      implementation needs to move to some common functions which get exposed
      in sja1105_ptp.h.
      Signed-off-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6cf99c13
    • Vladimir Oltean's avatar
      net: dsa: sja1105: Implement the .gettimex64 system call for PTP · 34d76e9f
      Vladimir Oltean authored
      Through the PTP_SYS_OFFSET_EXTENDED ioctl, it is possible for userspace
      applications (i.e. phc2sys) to compensate for the delays incurred while
      reading the PHC's time.
      
      The task itself of taking the software timestamp is delegated to the SPI
      subsystem, through the newly introduced API in struct spi_transfer. The
      goal is to cross-timestamp I/O operations on the switch's PTP clock with
      values in the local system clock (CLOCK_REALTIME). For that we need to
      understand a bit of the hardware internals.
      
      The 'read PTP time' message is a 12 byte structure, first 4 bytes of
      which represent the SPI header, and the last 8 bytes represent the
      64-bit PTP time. The switch itself starts processing the command
      immediately after receiving the last bit of the address, i.e. at the
      middle of byte 3 (last byte of header). The PTP time is shadowed to a
      buffer register in the switch, and retrieved atomically during the
      subsequent SPI frames.
      
      A similar thing goes on for the 'write PTP time' message, although in
      that case the switch waits until the 64-bit PTP time becomes fully
      available before taking any action. So the byte that needs to be
      software-timestamped is byte 11 (last) of the transfer.
      
      The patch creates a common (and local) sja1105_xfer implementation for
      the SPI I/O, and offers 3 front-ends:
      
      - sja1105_xfer_u32 and sja1105_xfer_u64: these are capable of optionally
        requesting a PTP timestamp
      
      - sja1105_xfer_buf: this is for large transfers (e.g. the static config
        buffer) and other misc data, and there is no point in giving
        timestamping capabilities to this.
      Signed-off-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      34d76e9f
  2. 10 Nov, 2019 7 commits
  3. 09 Nov, 2019 10 commits
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 14684b93
      David S. Miller authored
      One conflict in the BPF samples Makefile, some fixes in 'net' whilst
      we were converting over to Makefile.target rules in 'net-next'.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      14684b93
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 0058b0a5
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) BPF sample build fixes from Björn Töpel
      
       2) Fix powerpc bpf tail call implementation, from Eric Dumazet.
      
       3) DCCP leaks jiffies on the wire, fix also from Eric Dumazet.
      
       4) Fix crash in ebtables when using dnat target, from Florian Westphal.
      
       5) Fix port disable handling whne removing bcm_sf2 driver, from Florian
          Fainelli.
      
       6) Fix kTLS sk_msg trim on fallback to copy mode, from Jakub Kicinski.
      
       7) Various KCSAN fixes all over the networking, from Eric Dumazet.
      
       8) Memory leaks in mlx5 driver, from Alex Vesker.
      
       9) SMC interface refcounting fix, from Ursula Braun.
      
      10) TSO descriptor handling fixes in stmmac driver, from Jose Abreu.
      
      11) Add a TX lock to synchonize the kTLS TX path properly with crypto
          operations. From Jakub Kicinski.
      
      12) Sock refcount during shutdown fix in vsock/virtio code, from Stefano
          Garzarella.
      
      13) Infinite loop in Intel ice driver, from Colin Ian King.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (108 commits)
        ixgbe: need_wakeup flag might not be set for Tx
        i40e: need_wakeup flag might not be set for Tx
        igb/igc: use ktime accessors for skb->tstamp
        i40e: Fix for ethtool -m issue on X722 NIC
        iavf: initialize ITRN registers with correct values
        ice: fix potential infinite loop because loop counter being too small
        qede: fix NULL pointer deref in __qede_remove()
        net: fix data-race in neigh_event_send()
        vsock/virtio: fix sock refcnt holding during the shutdown
        net: ethernet: octeon_mgmt: Account for second possible VLAN header
        mac80211: fix station inactive_time shortly after boot
        net/fq_impl: Switch to kvmalloc() for memory allocation
        mac80211: fix ieee80211_txq_setup_flows() failure path
        ipv4: Fix table id reference in fib_sync_down_addr
        ipv6: fixes rt6_probe() and fib6_nh->last_probe init
        net: hns: Fix the stray netpoll locks causing deadlock in NAPI path
        net: usb: qmi_wwan: add support for DW5821e with eSIM support
        CDC-NCM: handle incomplete transfer of MTU
        nfc: netlink: fix double device reference drop
        NFC: st21nfca: fix double free
        ...
      0058b0a5
    • Linus Torvalds's avatar
      Merge tag 'for-linus-2019-11-08' of git://git.kernel.dk/linux-block · 5cb8418c
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
      
       - Two NVMe device removal crash fixes, and a compat fixup for for an
         ioctl that was introduced in this release (Anton, Charles, Max - via
         Keith)
      
       - Missing error path mutex unlock for drbd (Dan)
      
       - cgroup writeback fixup on dead memcg (Tejun)
      
       - blkcg online stats print fix (Tejun)
      
      * tag 'for-linus-2019-11-08' of git://git.kernel.dk/linux-block:
        cgroup,writeback: don't switch wbs immediately on dead wbs if the memcg is dead
        block: drbd: remove a stray unlock in __drbd_send_protocol()
        blkcg: make blkcg_print_stat() print stats only for online blkgs
        nvme: change nvme_passthru_cmd64 to explicitly mark rsvd
        nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths
        nvme-rdma: fix a segmentation fault during module unload
      5cb8418c
    • David S. Miller's avatar
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue · a2582cdc
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Fixes 2019-11-08
      
      This series contains fixes to igb, igc, ixgbe, i40e, iavf and ice
      drivers.
      
      Colin Ian King fixes a potentially wrap-around counter in a for-loop.
      
      Nick fixes the default ITR values for the iavf driver to 50 usecs
      interval.
      
      Arkadiusz fixes 'ethtool -m' for X722 devices where the correct value
      cannot be obtained from the firmware, so add X722 to the check to ensure
      the wrong value is not returned.
      
      Jake fixes igb and igc drivers in their implementation of launch time
      support by declaring skb->tstamp value as ktime_t instead of s64.
      
      Magnus fixes ixgbe and i40e where the need_wakeup flag for transmit may
      not be set for AF_XDP sockets that are only used to send packets.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a2582cdc
    • Magnus Karlsson's avatar
      ixgbe: need_wakeup flag might not be set for Tx · 0843aa8f
      Magnus Karlsson authored
      The need_wakeup flag for Tx might not be set for AF_XDP sockets that
      are only used to send packets. This happens if there is at least one
      outstanding packet that has not been completed by the hardware and we
      get that corresponding completion (which will not generate an
      interrupt since interrupts are disabled in the napi poll loop) between
      the time we stopped processing the Tx completions and interrupts are
      enabled again. In this case, the need_wakeup flag will have been
      cleared at the end of the Tx completion processing as we believe we
      will get an interrupt from the outstanding completion at a later point
      in time. But if this completion interrupt occurs before interrupts
      are enable, we lose it and should at that point really have set the
      need_wakeup flag since there are no more outstanding completions that
      can generate an interrupt to continue the processing. When this
      happens, user space will see a Tx queue need_wakeup of 0 and skip
      issuing a syscall, which means will never get into the Tx processing
      again and we have a deadlock.
      
      This patch introduces a quick fix for this issue by just setting the
      need_wakeup flag for Tx to 1 all the time. I am working on a proper
      fix for this that will toggle the flag appropriately, but it is more
      challenging than I anticipated and I am afraid that this patch will
      not be completed before the merge window closes, therefore this easier
      fix for now. This fix has a negative performance impact in the range
      of 0% to 4%. Towards the higher end of the scale if you have driver
      and application on the same core and issue a lot of packets, and
      towards no negative impact if you use two cores, lower transmission
      speeds and/or a workload that also receives packets.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      0843aa8f
    • Magnus Karlsson's avatar
      i40e: need_wakeup flag might not be set for Tx · 70563957
      Magnus Karlsson authored
      The need_wakeup flag for Tx might not be set for AF_XDP sockets that
      are only used to send packets. This happens if there is at least one
      outstanding packet that has not been completed by the hardware and we
      get that corresponding completion (which will not generate an
      interrupt since interrupts are disabled in the napi poll loop) between
      the time we stopped processing the Tx completions and interrupts are
      enabled again. In this case, the need_wakeup flag will have been
      cleared at the end of the Tx completion processing as we believe we
      will get an interrupt from the outstanding completion at a later point
      in time. But if this completion interrupt occurs before interrupts
      are enable, we lose it and should at that point really have set the
      need_wakeup flag since there are no more outstanding completions that
      can generate an interrupt to continue the processing. When this
      happens, user space will see a Tx queue need_wakeup of 0 and skip
      issuing a syscall, which means will never get into the Tx processing
      again and we have a deadlock.
      
      This patch introduces a quick fix for this issue by just setting the
      need_wakeup flag for Tx to 1 all the time. I am working on a proper
      fix for this that will toggle the flag appropriately, but it is more
      challenging than I anticipated and I am afraid that this patch will
      not be completed before the merge window closes, therefore this easier
      fix for now. This fix has a negative performance impact in the range
      of 0% to 4%. Towards the higher end of the scale if you have driver
      and application on the same core and issue a lot of packets, and
      towards no negative impact if you use two cores, lower transmission
      speeds and/or a workload that also receives packets.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      70563957
    • Jacob Keller's avatar
      igb/igc: use ktime accessors for skb->tstamp · 6acab13b
      Jacob Keller authored
      When implementing launch time support in the igb and igc drivers, the
      skb->tstamp value is assumed to be a s64, but it's declared as a ktime_t
      value.
      
      Although ktime_t is typedef'd to s64 it wasn't always, and the kernel
      provides accessors for ktime_t values.
      
      Use the ktime_to_timespec64 and ktime_set accessors instead of directly
      assuming that the variable is always an s64.
      
      This improves portability if the code is ever moved to another kernel
      version, or if the definition of ktime_t ever changes again in the
      future.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Acked-by: default avatarVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      6acab13b
    • Arkadiusz Kubalewski's avatar
      i40e: Fix for ethtool -m issue on X722 NIC · 4c9da6f2
      Arkadiusz Kubalewski authored
      This patch contains fix for a problem with command:
      'ethtool -m <dev>'
      which breaks functionality of:
      'ethtool <dev>'
      when called on X722 NIC
      
      Disallowed update of link phy_types on X722 NIC
      Currently correct value cannot be obtained from FW
      Previously wrong value returned by FW was used and was
      a root cause for incorrect output of 'ethtool <dev>' command
      Signed-off-by: default avatarArkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      4c9da6f2
    • Nicholas Nunley's avatar
      iavf: initialize ITRN registers with correct values · 4eda4e00
      Nicholas Nunley authored
      Since commit 92418fb1 ("i40e/i40evf: Use usec value instead of reg
      value for ITR defines") the driver tracks the interrupt throttling
      intervals in single usec units, although the actual ITRN registers are
      programmed in 2 usec units. Most register programming flows in the driver
      correctly handle the conversion, although it is currently not applied when
      the registers are initialized to their default values. Most of the time
      this doesn't present a problem since the default values are usually
      immediately overwritten through the standard adaptive throttling mechanism,
      or updated manually by the user, but if adaptive throttling is disabled and
      the interval values are left alone then the incorrect value will persist.
      
      Since the intended default interval of 50 usecs (vs. 100 usecs as
      programmed) performs better for most traffic workloads, this can lead to
      performance regressions.
      
      This patch adds the correct conversion when writing the initial values to
      the ITRN registers.
      Signed-off-by: default avatarNicholas Nunley <nicholas.d.nunley@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      4eda4e00
    • Colin Ian King's avatar
      ice: fix potential infinite loop because loop counter being too small · 615457a2
      Colin Ian King authored
      Currently the for-loop counter i is a u8 however it is being checked
      against a maximum value hw->num_tx_sched_layers which is a u16. Hence
      there is a potential wrap-around of counter i back to zero if
      hw->num_tx_sched_layers is greater than 255.  Fix this by making i
      a u16.
      
      Addresses-Coverity: ("Infinite loop")
      Fixes: b36c598c ("ice: Updates to Tx scheduler code")
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      615457a2
  4. 08 Nov, 2019 10 commits
    • David S. Miller's avatar
      Merge branch 'sctp-rfc7829' · 92da362c
      David S. Miller authored
      Xin Long says:
      
      ====================
      sctp: update from rfc7829
      
      SCTP-PF was implemented based on a Internet-Draft in 2012:
      
        https://tools.ietf.org/html/draft-nishida-tsvwg-sctp-failover-05
      
      It's been updated quite a few by rfc7829 in 2016.
      
      This patchset adds the following features:
      
        1. add SCTP_ADDR_POTENTIALLY_FAILED notification
        2. add pf_expose per netns/sock/asoc
        3. add SCTP_EXPOSE_POTENTIALLY_FAILED_STATE sockopt
        4. add ps_retrans per netns/sock/asoc/transport
           (Primary Path Switchover)
        5. add spt_pathcpthld for SCTP_PEER_ADDR_THLDS sockopt
      
      v1->v2:
        - See Patch 2/5 and Patch 5/5.
      v2->v3:
        - See Patch 1/5, 2/5 and 3/5.
      v3->v4:
        - See Patch 1/5, 2/5, 3/5 and 4/5.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      92da362c
    • Xin Long's avatar
      sctp: add SCTP_PEER_ADDR_THLDS_V2 sockopt · d467ac0a
      Xin Long authored
      Section 7.2 of rfc7829: "Peer Address Thresholds (SCTP_PEER_ADDR_THLDS)
      Socket Option" extends 'struct sctp_paddrthlds' with 'spt_pathcpthld'
      added to allow a user to change ps_retrans per sock/asoc/transport, as
      other 2 paddrthlds: pf_retrans, pathmaxrxt.
      
      Note: to not break the user's program, here to support pf_retrans dump
      and setting by adding a new sockopt SCTP_PEER_ADDR_THLDS_V2, and a new
      structure sctp_paddrthlds_v2 instead of extending sctp_paddrthlds.
      
      Also, when setting ps_retrans, the value is not allowed to be greater
      than pf_retrans.
      
      v1->v2:
        - use SCTP_PEER_ADDR_THLDS_V2 to set/get pf_retrans instead,
          as Marcelo and David Laight suggested.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d467ac0a
    • Xin Long's avatar
      sctp: add support for Primary Path Switchover · 34515e94
      Xin Long authored
      This is a new feature defined in section 5 of rfc7829: "Primary Path
      Switchover". By introducing a new tunable parameter:
      
        Primary.Switchover.Max.Retrans (PSMR)
      
      The primary path will be changed to another active path when the path
      error counter on the old primary path exceeds PSMR, so that "the SCTP
      sender is allowed to continue data transmission on a new working path
      even when the old primary destination address becomes active again".
      
      This patch is to add this tunable parameter, 'ps_retrans' per netns,
      sock, asoc and transport. It also allows a user to change ps_retrans
      per netns by sysctl, and ps_retrans per sock/asoc/transport will be
      initialized with it.
      
      The check will be done in sctp_do_8_2_transport_strike() when this
      feature is enabled.
      
      Note this feature is disabled by initializing 'ps_retrans' per netns
      as 0xffff by default, and its value can't be less than 'pf_retrans'
      when changing by sysctl.
      
      v3->v4:
        - add define SCTP_PS_RETRANS_MAX 0xffff, and use it on extra2 of
          sysctl 'ps_retrans'.
        - add a new entry for ps_retrans on ip-sysctl.txt.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      34515e94
    • Xin Long's avatar
      sctp: add SCTP_EXPOSE_POTENTIALLY_FAILED_STATE sockopt · 8d2a6935
      Xin Long authored
      This is a sockopt defined in section 7.3 of rfc7829: "Exposing
      the Potentially Failed Path State", by which users can change
      pf_expose per sock and asoc.
      
      The new sockopt SCTP_EXPOSE_POTENTIALLY_FAILED_STATE is also
      known as SCTP_EXPOSE_PF_STATE for short.
      
      v2->v3:
        - return -EINVAL if params.assoc_value > SCTP_PF_EXPOSE_MAX.
        - define SCTP_EXPOSE_PF_STATE SCTP_EXPOSE_POTENTIALLY_FAILED_STATE.
      v3->v4:
        - improve changelog.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8d2a6935
    • Xin Long's avatar
      sctp: add SCTP_ADDR_POTENTIALLY_FAILED notification · 768e1518
      Xin Long authored
      SCTP Quick failover draft section 5.1, point 5 has been removed
      from rfc7829. Instead, "the sender SHOULD (i) notify the Upper
      Layer Protocol (ULP) about this state transition", as said in
      section 3.2, point 8.
      
      So this patch is to add SCTP_ADDR_POTENTIALLY_FAILED, defined
      in section 7.1, "which is reported if the affected address
      becomes PF". Also remove transport cwnd's update when moving
      from PF back to ACTIVE , which is no longer in rfc7829 either.
      
      Note that ulp_notify will be set to false if asoc->expose is
      not 'enabled', according to last patch.
      
      v2->v3:
        - define SCTP_ADDR_PF SCTP_ADDR_POTENTIALLY_FAILED.
      v3->v4:
        - initialize spc_state with SCTP_ADDR_AVAILABLE, as Marcelo suggested.
        - check asoc->pf_expose in sctp_assoc_control_transport(), as Marcelo
          suggested.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      768e1518
    • Xin Long's avatar
      sctp: add pf_expose per netns and sock and asoc · aef587be
      Xin Long authored
      As said in rfc7829, section 3, point 12:
      
        The SCTP stack SHOULD expose the PF state of its destination
        addresses to the ULP as well as provide the means to notify the
        ULP of state transitions of its destination addresses from
        active to PF, and vice versa.  However, it is recommended that
        an SCTP stack implementing SCTP-PF also allows for the ULP to be
        kept ignorant of the PF state of its destinations and the
        associated state transitions, thus allowing for retention of the
        simpler state transition model of [RFC4960] in the ULP.
      
      Not only does it allow to expose the PF state to ULP, but also
      allow to ignore sctp-pf to ULP.
      
      So this patch is to add pf_expose per netns, sock and asoc. And in
      sctp_assoc_control_transport(), ulp_notify will be set to false if
      asoc->expose is not 'enabled' in next patch.
      
      It also allows a user to change pf_expose per netns by sysctl, and
      pf_expose per sock and asoc will be initialized with it.
      
      Note that pf_expose also works for SCTP_GET_PEER_ADDR_INFO sockopt,
      to not allow a user to query the state of a sctp-pf peer address
      when pf_expose is 'disabled', as said in section 7.3.
      
      v1->v2:
        - Fix a build warning noticed by Nathan Chancellor.
      v2->v3:
        - set pf_expose to UNUSED by default to keep compatible with old
          applications.
      v3->v4:
        - add a new entry for pf_expose on ip-sysctl.txt, as Marcelo suggested.
        - change this patch to 1/5, and move sctp_assoc_control_transport
          change into 2/5, as Marcelo suggested.
        - use SCTP_PF_EXPOSE_UNSET instead of SCTP_PF_EXPOSE_UNUSED, and
          set SCTP_PF_EXPOSE_UNSET to 0 in enum, as Marcelo suggested.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aef587be
    • Jiri Pirko's avatar
      devlink: disallow reload operation during device cleanup · a0c76345
      Jiri Pirko authored
      There is a race between driver code that does setup/cleanup of device
      and devlink reload operation that in some drivers works with the same
      code. Use after free could we easily obtained by running:
      
      while true; do
              echo 10 > /sys/bus/netdevsim/new_device
              devlink dev reload netdevsim/netdevsim10 &
              echo 10 > /sys/bus/netdevsim/del_device
      done
      
      Fix this by enabling reload only after setup of device is complete and
      disabling it at the beginning of the cleanup process.
      Reported-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Fixes: 2d8dc5bb ("devlink: Add support for reload")
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Acked-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a0c76345
    • Jiri Pirko's avatar
      selftest: net: add alternative names test · f95e6c9c
      Jiri Pirko authored
      Add a simple test for recently added netdevice alternative names.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f95e6c9c
    • Manish Chopra's avatar
      qede: fix NULL pointer deref in __qede_remove() · deabc871
      Manish Chopra authored
      While rebooting the system with SR-IOV vfs enabled leads
      to below crash due to recurrence of __qede_remove() on the VF
      devices (first from .shutdown() flow of the VF itself and
      another from PF's .shutdown() flow executing pci_disable_sriov())
      
      This patch adds a safeguard in __qede_remove() flow to fix this,
      so that driver doesn't attempt to remove "already removed" devices.
      
      [  194.360134] BUG: unable to handle kernel NULL pointer dereference at 00000000000008dc
      [  194.360227] IP: [<ffffffffc03553c4>] __qede_remove+0x24/0x130 [qede]
      [  194.360304] PGD 0
      [  194.360325] Oops: 0000 [#1] SMP
      [  194.360360] Modules linked in: tcp_lp fuse tun bridge stp llc devlink bonding ip_set nfnetlink ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib ib_umad rpcrdma sunrpc rdma_ucm ib_uverbs ib_iser rdma_cm iw_cm ib_cm libiscsi scsi_transport_iscsi dell_smbios iTCO_wdt iTCO_vendor_support dell_wmi_descriptor dcdbas vfat fat pcc_cpufreq skx_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd qedr ib_core pcspkr ses enclosure joydev ipmi_ssif sg i2c_i801 lpc_ich mei_me mei wmi ipmi_si ipmi_devintf ipmi_msghandler tpm_crb acpi_pad acpi_power_meter xfs libcrc32c sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32c_intel mgag200
      [  194.361044]  qede i2c_algo_bit drm_kms_helper qed syscopyarea sysfillrect nvme sysimgblt fb_sys_fops ttm nvme_core mpt3sas crc8 ptp drm pps_core ahci raid_class scsi_transport_sas libahci libata drm_panel_orientation_quirks nfit libnvdimm dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ip_tables]
      [  194.361297] CPU: 51 PID: 7996 Comm: reboot Kdump: loaded Not tainted 3.10.0-1062.el7.x86_64 #1
      [  194.361359] Hardware name: Dell Inc. PowerEdge MX840c/0740HW, BIOS 2.4.6 10/15/2019
      [  194.361412] task: ffff9cea9b360000 ti: ffff9ceabebdc000 task.ti: ffff9ceabebdc000
      [  194.361463] RIP: 0010:[<ffffffffc03553c4>]  [<ffffffffc03553c4>] __qede_remove+0x24/0x130 [qede]
      [  194.361534] RSP: 0018:ffff9ceabebdfac0  EFLAGS: 00010282
      [  194.361570] RAX: 0000000000000000 RBX: ffff9cd013846098 RCX: 0000000000000000
      [  194.361621] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9cd013846098
      [  194.361668] RBP: ffff9ceabebdfae8 R08: 0000000000000000 R09: 0000000000000000
      [  194.361715] R10: 00000000bfe14201 R11: ffff9ceabfe141e0 R12: 0000000000000000
      [  194.361762] R13: ffff9cd013846098 R14: 0000000000000000 R15: ffff9ceab5e48000
      [  194.361810] FS:  00007f799c02d880(0000) GS:ffff9ceacb0c0000(0000) knlGS:0000000000000000
      [  194.361865] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  194.361903] CR2: 00000000000008dc CR3: 0000001bdac76000 CR4: 00000000007607e0
      [  194.361953] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  194.362002] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  194.362051] PKRU: 55555554
      [  194.362073] Call Trace:
      [  194.362109]  [<ffffffffc0355500>] qede_remove+0x10/0x20 [qede]
      [  194.362180]  [<ffffffffb97d0f3e>] pci_device_remove+0x3e/0xc0
      [  194.362240]  [<ffffffffb98b3c52>] __device_release_driver+0x82/0xf0
      [  194.362285]  [<ffffffffb98b3ce3>] device_release_driver+0x23/0x30
      [  194.362343]  [<ffffffffb97c86d4>] pci_stop_bus_device+0x84/0xa0
      [  194.362388]  [<ffffffffb97c87e2>] pci_stop_and_remove_bus_device+0x12/0x20
      [  194.362450]  [<ffffffffb97f153f>] pci_iov_remove_virtfn+0xaf/0x160
      [  194.362496]  [<ffffffffb97f1aec>] sriov_disable+0x3c/0xf0
      [  194.362534]  [<ffffffffb97f1bc3>] pci_disable_sriov+0x23/0x30
      [  194.362599]  [<ffffffffc02f83c3>] qed_sriov_disable+0x5e3/0x650 [qed]
      [  194.362658]  [<ffffffffb9622df6>] ? kfree+0x106/0x140
      [  194.362709]  [<ffffffffc02cc0c0>] ? qed_free_stream_mem+0x70/0x90 [qed]
      [  194.362754]  [<ffffffffb9622df6>] ? kfree+0x106/0x140
      [  194.362803]  [<ffffffffc02cd659>] qed_slowpath_stop+0x1a9/0x1d0 [qed]
      [  194.362854]  [<ffffffffc035544e>] __qede_remove+0xae/0x130 [qede]
      [  194.362904]  [<ffffffffc03554e0>] qede_shutdown+0x10/0x20 [qede]
      [  194.362956]  [<ffffffffb97cf90a>] pci_device_shutdown+0x3a/0x60
      [  194.363010]  [<ffffffffb98b180b>] device_shutdown+0xfb/0x1f0
      [  194.363066]  [<ffffffffb94b66c6>] kernel_restart_prepare+0x36/0x40
      [  194.363107]  [<ffffffffb94b66e2>] kernel_restart+0x12/0x60
      [  194.363146]  [<ffffffffb94b6959>] SYSC_reboot+0x229/0x260
      [  194.363196]  [<ffffffffb95f200d>] ? handle_mm_fault+0x39d/0x9b0
      [  194.363253]  [<ffffffffb942b621>] ? __switch_to+0x151/0x580
      [  194.363304]  [<ffffffffb9b7ec28>] ? __schedule+0x448/0x9c0
      [  194.363343]  [<ffffffffb94b69fe>] SyS_reboot+0xe/0x10
      [  194.363387]  [<ffffffffb9b8bede>] system_call_fastpath+0x25/0x2a
      [  194.363430] Code: f9 e9 37 ff ff ff 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 4c 8d af 98 00 00 00 41 54 4c 89 ef 41 89 f4 53 e8 4c e4 55 f9 <80> b8 dc 08 00 00 01 48 89 c3 4c 8d b8 c0 08 00 00 4c 8b b0 c0
      [  194.363712] RIP  [<ffffffffc03553c4>] __qede_remove+0x24/0x130 [qede]
      [  194.363764]  RSP <ffff9ceabebdfac0>
      [  194.363791] CR2: 00000000000008dc
      Signed-off-by: default avatarManish Chopra <manishc@marvell.com>
      Signed-off-by: default avatarAriel Elior <aelior@marvell.com>
      Signed-off-by: default avatarSudarsana Kalluru <skalluru@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      deabc871
    • Eric Dumazet's avatar
      packet: fix data-race in fanout_flow_is_huge() · b756ad92
      Eric Dumazet authored
      KCSAN reported the following data-race [1]
      
      Adding a couple of READ_ONCE()/WRITE_ONCE() should silence it.
      
      Since the report hinted about multiple cpus using the history
      concurrently, I added a test avoiding writing on it if the
      victim slot already contains the desired value.
      
      [1]
      
      BUG: KCSAN: data-race in fanout_demux_rollover / fanout_demux_rollover
      
      read to 0xffff8880b01786cc of 4 bytes by task 18921 on cpu 1:
       fanout_flow_is_huge net/packet/af_packet.c:1303 [inline]
       fanout_demux_rollover+0x33e/0x3f0 net/packet/af_packet.c:1353
       packet_rcv_fanout+0x34e/0x490 net/packet/af_packet.c:1453
       deliver_skb net/core/dev.c:1888 [inline]
       dev_queue_xmit_nit+0x15b/0x540 net/core/dev.c:1958
       xmit_one net/core/dev.c:3195 [inline]
       dev_hard_start_xmit+0x3f5/0x430 net/core/dev.c:3215
       __dev_queue_xmit+0x14ab/0x1b40 net/core/dev.c:3792
       dev_queue_xmit+0x21/0x30 net/core/dev.c:3825
       neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530
       neigh_output include/net/neighbour.h:511 [inline]
       ip6_finish_output2+0x7a2/0xec0 net/ipv6/ip6_output.c:116
       __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
       __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
       ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
       NF_HOOK_COND include/linux/netfilter.h:294 [inline]
       ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
       dst_output include/net/dst.h:436 [inline]
       ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179
       ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795
       udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173
       udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471
       inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576
       sock_sendmsg_nosec net/socket.c:637 [inline]
       sock_sendmsg+0x9f/0xc0 net/socket.c:657
       ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311
       __sys_sendmmsg+0x123/0x350 net/socket.c:2413
       __do_sys_sendmmsg net/socket.c:2442 [inline]
       __se_sys_sendmmsg net/socket.c:2439 [inline]
       __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439
       do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      write to 0xffff8880b01786cc of 4 bytes by task 18922 on cpu 0:
       fanout_flow_is_huge net/packet/af_packet.c:1306 [inline]
       fanout_demux_rollover+0x3a4/0x3f0 net/packet/af_packet.c:1353
       packet_rcv_fanout+0x34e/0x490 net/packet/af_packet.c:1453
       deliver_skb net/core/dev.c:1888 [inline]
       dev_queue_xmit_nit+0x15b/0x540 net/core/dev.c:1958
       xmit_one net/core/dev.c:3195 [inline]
       dev_hard_start_xmit+0x3f5/0x430 net/core/dev.c:3215
       __dev_queue_xmit+0x14ab/0x1b40 net/core/dev.c:3792
       dev_queue_xmit+0x21/0x30 net/core/dev.c:3825
       neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530
       neigh_output include/net/neighbour.h:511 [inline]
       ip6_finish_output2+0x7a2/0xec0 net/ipv6/ip6_output.c:116
       __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
       __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
       ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
       NF_HOOK_COND include/linux/netfilter.h:294 [inline]
       ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
       dst_output include/net/dst.h:436 [inline]
       ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179
       ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795
       udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173
       udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471
       inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576
       sock_sendmsg_nosec net/socket.c:637 [inline]
       sock_sendmsg+0x9f/0xc0 net/socket.c:657
       ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311
       __sys_sendmmsg+0x123/0x350 net/socket.c:2413
       __do_sys_sendmmsg net/socket.c:2442 [inline]
       __se_sys_sendmmsg net/socket.c:2439 [inline]
       __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439
       do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 18922 Comm: syz-executor.3 Not tainted 5.4.0-rc6+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: 3b3a5b0a ("packet: rollover huge flows before small flows")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b756ad92