1. 24 Jun, 2021 37 commits
    • Xin Long's avatar
      sctp: do black hole detection in search complete state · 0dac127c
      Xin Long authored
      Currently the PLPMUTD probe will stop for a long period (interval * 30)
      after it enters search complete state. If there's a pmtu change on the
      route path, it takes a long time to be aware if the ICMP TooBig packet
      is lost or filtered.
      
      As it says in rfc8899#section-4.3:
      
        "A DPLPMTUD method MUST NOT rely solely on this method."
        (ICMP PTB message).
      
      This patch is to enable the other method for search complete state:
      
        "A PL can use the DPLPMTUD probing mechanism to periodically
         generate probe packets of the size of the current PLPMTU."
      
      With this patch, the probe will continue with the current pmtu every
      'interval' until the PMTU_RAISE_TIMER 'timeout', which we implement
      by adding raise_count to raise the probe size when it counts to 30
      and removing the SCTP_PL_COMPLETE check for PMTU_RAISE_TIMER.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0dac127c
    • David S. Miller's avatar
      Merge branch 'sja1110-doc' · 98ebad48
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      Document the NXP SJA1110 switch as supported
      
      Now that most of the basic work for SJA1110 support has been done in the
      sja1105 DSA driver, let's add the missing documentation bits to make it
      clear that the driver can be used.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      98ebad48
    • Vladimir Oltean's avatar
      net: dsa: sja1105: document the SJA1110 in the Kconfig · 75e99470
      Vladimir Oltean authored
      Mention support for the SJA1110 in menuconfig.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      75e99470
    • Vladimir Oltean's avatar
      Documentation: net: dsa: add details about SJA1110 · 44531076
      Vladimir Oltean authored
      Denote that the new switch generation is supported, detail its pin
      strapping options (with differences compared to SJA1105) and explain how
      MDIO access to the internal 100base-T1 and 100base-TX PHYs is performed.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      44531076
    • David S. Miller's avatar
      Merge branch 'gve-dqo' · 89bddde3
      David S. Miller authored
      Bailey Forrest says:
      
      ====================
      gve: Introduce DQO descriptor format
      
      DQO is the descriptor format for our next generation virtual NIC. The existing
      descriptor format will be referred to as "GQI" in the patch set.
      
      One major change with DQO is it uses dual descriptor rings for both TX and RX
      queues.
      
      The TX path uses a TX queue to send descriptors to HW, and receives packet
      completion events on a TX completion queue.
      
      The RX path posts buffers to HW using an RX buffer queue and receives incoming
      packets on an RX queue.
      
      One important note is that DQO descriptors and doorbells are little endian. We
      continue to use the existing big endian control plane infrastructure.
      
      The general format of the patch series is:
      - Refactor existing code/data structures to be shared by DQO
      - Expand admin queues to support DQO device setup
      - Expand data structures and device setup to support DQO
      - Add logic to setup DQO queues
      - Implement datapath
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      89bddde3
    • Bailey Forrest's avatar
      gve: DQO: Add RX path · 9b8dd5e5
      Bailey Forrest authored
      The RX queue has an array of `gve_rx_buf_state_dqo` objects. All
      allocated pages have an associated buf_state object. When a buffer is
      posted on the RX buffer queue, the buffer ID will be the buf_state's
      index into the RX queue's array.
      
      On packet reception, the RX queue will have one descriptor for each
      buffer associated with a received packet. Each RX descriptor will have
      a buffer_id that was posted on the buffer queue.
      
      Notable mentions:
      
      - We use a default buffer size of 2048 bytes. Based on page size, we
        may post separate sections of a single page as separate buffers.
      
      - The driver holds an extra reference on pages passed up the receive
        path with an skb and keeps these pages on a list. When posting new
        buffers to the NIC, we check if any of these pages has only our
        reference, or another buffer sized segment of the page has no
        references. If so, it is free to reuse. This page recycling approach
        is a common netdev optimization that reduces page alloc/free calls.
      
      - Pages in the free list have a page_count bias in order to avoid an
        atomic increment of pagecount every time we attempt to reuse a page.
        # references = page_count() - bias
      
      - In order to track when a page is safe to reuse, we keep track of the
        last offset which had a single SKB reference. When this occurs, it
        implies that every single other offset is reusable. Otherwise, we
        don't know if offsets can be safely reused.
      
      - We maintain two free lists of pages. List #1 (recycled_buf_states)
        contains pages we know can be reused right away. List #2
        (used_buf_states) contains pages which cannot be used right away. We
        only attempt to get pages from list #2 when list #1 is empty. We only
        attempt to use a small fixed number pages from list #2 before giving
        up and allocating a new page. Both lists are FIFOs in hope that by the
        time we attempt to reuse a page, the references were dropped.
      Signed-off-by: default avatarBailey Forrest <bcf@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarCatherine Sullivan <csully@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9b8dd5e5
    • Bailey Forrest's avatar
      gve: DQO: Add TX path · a57e5de4
      Bailey Forrest authored
      TX SKBs will have their buffers DMA mapped with the device. Each buffer
      will have at least one TX descriptor associated. Each SKB will also have
      a metadata descriptor.
      
      Each TX queue maintains an array of `gve_tx_pending_packet_dqo` objects.
      Every TX SKB will have an associated pending_packet object. A TX SKB's
      descriptors will use its pending_packet's index as the completion tag,
      which will be returned on the TX completion queue.
      
      The device implements a "flow-miss model". Most packets will simply
      receive a packet completion. The flow-miss system may choose to process
      a packet based on its contents. A TX packet which experiences a flow
      miss would receive a miss completion followed by a later reinjection
      completion. The miss-completion is received when the packet starts to be
      processed by the flow-miss system and the reinjection completion is
      received when the flow-miss system completes processing the packet and
      sends it on the wire.
      
      Notable mentions:
      
      - Buffers may be freed after receiving the miss-completion, but in order
        to avoid packet reordering, we do not complete the SKB until receiving
        the reinjection completion.
      
      - The driver must robustly handle the unlikely scenario where a miss
        completion does not have an associated reinjection completion. This is
        accomplished by maintaining a list of packets which have a pending
        reinjection completion. After a short timeout (5 seconds), the
        SKB and buffers are released and the pending_packet is moved to a
        second list which has a longer timeout (60 seconds), where the
        pending_packet will not be reused. When the longer timeout elapses,
        the driver may assume the reinjection completion would never be
        received and the pending_packet may be reused.
      
      - Completion handling is triggered by an interrupt and is done in the
        NAPI poll function. Because the TX path and completion exist in
        different threading contexts they maintain their own lists for free
        pending_packet objects. The TX path uses a lock-free approach to steal
        the list from the completion path.
      
      - Both the TSO context and general context descriptors have metadata
        bytes. The device requires that if multiple descriptors contain the
        same field, each descriptor must have the same value set for that
        field.
      Signed-off-by: default avatarBailey Forrest <bcf@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarCatherine Sullivan <csully@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a57e5de4
    • Bailey Forrest's avatar
      gve: DQO: Configure interrupts on device up · 0dcc144a
      Bailey Forrest authored
      When interrupts are first enabled, we also set the ratelimits, which
      will be static for the entire usage of the device.
      Signed-off-by: default avatarBailey Forrest <bcf@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarCatherine Sullivan <csully@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0dcc144a
    • Bailey Forrest's avatar
      gve: DQO: Add ring allocation and initialization · 9c1a59a2
      Bailey Forrest authored
      Allocate the buffer and completion ring structures. Do not populate the
      rings yet. That will happen in the respective rx and tx datapath
      follow-on patches
      Signed-off-by: default avatarBailey Forrest <bcf@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarCatherine Sullivan <csully@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9c1a59a2
    • Bailey Forrest's avatar
      gve: DQO: Add core netdev features · 5e8c5adf
      Bailey Forrest authored
      Add napi netdev device registration, interrupt handling and initial tx
      and rx polling stubs. The stubs will be filled in follow-on patches.
      
      Also:
      - LRO feature advertisement and handling
      - Also update ethtool logic
      Signed-off-by: default avatarBailey Forrest <bcf@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarCatherine Sullivan <csully@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e8c5adf
    • Bailey Forrest's avatar
      gve: Update adminq commands to support DQO queues · 1f6228e4
      Bailey Forrest authored
      DQO queue creation requires additional parameters:
      - TX completion/RX buffer queue size
      - TX completion/RX buffer queue address
      - TX/RX queue size
      - RX buffer size
      Signed-off-by: default avatarBailey Forrest <bcf@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarCatherine Sullivan <csully@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1f6228e4
    • Bailey Forrest's avatar
      gve: Add DQO fields for core data structures · a4aa1f1e
      Bailey Forrest authored
      - Add new DQO datapath structures:
        - `gve_rx_buf_queue_dqo`
        - `gve_rx_compl_queue_dqo`
        - `gve_rx_buf_state_dqo`
        - `gve_tx_desc_dqo`
        - `gve_tx_pending_packet_dqo`
      
      - Incorporate these into the existing ring data structures:
        - `gve_rx_ring`
        - `gve_tx_ring`
      
      Noteworthy mentions:
      
      - `gve_rx_buf_state` represents an RX buffer which was posted to HW.
        Each RX queue has an array of these objects and the index into the
        array is used as the buffer_id when posted to HW.
      
      - `gve_tx_pending_packet_dqo` is treated similarly for TX queues. The
        completion_tag is the index into the array.
      
      - These two structures have links for linked lists which are represented
        by 16b indexes into a contiguous array of these structures.
        This reduces memory footprint compared to 64b pointers.
      
      - We use unions for the writeable datapath structures to reduce cache
        footprint. GQI specific members will renamed like DQO members in a
        future patch.
      Signed-off-by: default avatarBailey Forrest <bcf@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarCatherine Sullivan <csully@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a4aa1f1e
    • Bailey Forrest's avatar
      gve: Add dqo descriptors · 22319818
      Bailey Forrest authored
      General description of rings and descriptors:
      
      TX ring is used for sending TX packet buffers to the NIC. It has the
      following descriptors:
      - `gve_tx_pkt_desc_dqo` - Data buffer descriptor
      - `gve_tx_tso_context_desc_dqo` - TSO context descriptor
      - `gve_tx_general_context_desc_dqo` - Generic metadata descriptor
      
      Metadata is a collection of 12 bytes. We define `gve_tx_metadata_dqo`
      which represents the logical interpetation of the metadata bytes. It's
      helpful to define this structure because the metadata bytes exist in
      multiple descriptor types (including `gve_tx_tso_context_desc_dqo`),
      and the device requires same field has the same value in all
      descriptors.
      
      The TX completion ring is used to receive completions from the NIC.
      Having a separate ring allows for completions to be out of order. The
      completion descriptor `gve_tx_compl_desc` has several different types,
      most important are packet and descriptor completions. Descriptor
      completions are used to notify the driver when descriptors sent on the
      TX ring are done being consumed. The descriptor completion is only used
      to signal that space is cleared in the TX ring. A packet completion will
      be received when a packet transmitted on the TX queue is done being
      transmitted.
      
      In addition there are "miss" and "reinjection" completions. The device
      implements a "flow-miss model". Most packets will simply receive a
      packet completion. The flow-miss system may choose to process a packet
      based on its contents. A TX packet which experiences a flow miss would
      receive a miss completion followed by a later reinjection completion.
      The miss-completion is received when the packet starts to be processed
      by the flow-miss system and the reinjection completion is received when
      the flow-miss system completes processing the packet and sends it on the
      wire.
      
      The RX buffer ring is used to send buffers to HW via the
      `gve_rx_desc_dqo` descriptor.
      
      Received packets are put into the RX queue by the device, which
      populates the `gve_rx_compl_desc_dqo` descriptor. The RX descriptors
      refer to buffers posted by the buffer queue. Received buffers may be
      returned out of order, such as when HW LRO is enabled.
      
      Important concepts:
      - "TX" and "RX buffer" queues, which send descriptors to the device, use
        MMIO doorbells to notify the device of new descriptors.
      
      - "RX" and "TX completion" queues, which receive descriptors from the
        device, use a "generation bit" to know when a descriptor was populated
        by the device. The driver initializes all bits with the "current
        generation". The device will populate received descriptors with the
        "next generation" which is inverted from the current generation. When
        the ring wraps, the current/next generation are swapped.
      
      - It's the driver's responsibility to ensure that the RX and TX
        completion queues are not overrun. This can be accomplished by
        limiting the number of descriptors posted to HW.
      
      - TX packets have a 16 bit completion_tag and RX buffers have a 16 bit
        buffer_id. These will be returned on the TX completion and RX queues
        respectively to let the driver know which packet/buffer was completed.
      
      Bitfields are used to describe descriptor fields. This notation is more
      concise and readable than shift-and-mask. It is possible because the
      driver is restricted to little endian platforms.
      Signed-off-by: default avatarBailey Forrest <bcf@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarCatherine Sullivan <csully@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      22319818
    • Bailey Forrest's avatar
      gve: Add support for DQO RX PTYPE map · c4b87ac8
      Bailey Forrest authored
      Unlike GQI, DQO RX descriptors do not contain the L3 and L4 type of the
      packet. L3 and L4 types are necessary in order to set the hash and csum
      on RX SKBs correctly.
      
      DQO RX descriptors instead contain a 10 bit PTYPE index. The PTYPE map
      enables the device to tell the driver how to map from PTYPE index to
      L3/L4 type.
      
      The device doesn't provide any guarantees about the range of possible
      PTYPEs, so we just use a 1024 entry array to implement a fast mapping
      structure.
      Signed-off-by: default avatarBailey Forrest <bcf@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarCatherine Sullivan <csully@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c4b87ac8
    • Bailey Forrest's avatar
      gve: adminq: DQO specific device descriptor logic · 5ca2265e
      Bailey Forrest authored
      - In addition to TX and RX queues, DQO has TX completion and RX buffer
        queues.
        - TX completions are received when the device has completed sending a
          packet on the wire.
        - RX buffers are posted on a separate queue form the RX completions.
      - DQO descriptor rings are allowed to be smaller than PAGE_SIZE.
      Signed-off-by: default avatarBailey Forrest <bcf@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarCatherine Sullivan <csully@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5ca2265e
    • Bailey Forrest's avatar
      gve: Introduce per netdev `enum gve_queue_format` · a5886ef4
      Bailey Forrest authored
      The currently supported queue formats are:
      - GQI_RDA - GQI with raw DMA addressing
      - GQI_QPL - GQI with queue page list
      - DQO_RDA - DQO with raw DMA addressing
      
      The old `gve_priv.raw_addressing` value is only used for GQI_RDA, so we
      remove it in favor of just checking against GQI_RDA
      Signed-off-by: default avatarBailey Forrest <bcf@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarCatherine Sullivan <csully@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a5886ef4
    • Bailey Forrest's avatar
      gve: Introduce a new model for device options · 8a39d3e0
      Bailey Forrest authored
      The current model uses an integer ID and a fixed size struct for the
      parameters of each device option.
      
      The new model allows the device option structs to grow in size over
      time. A driver may assume that changes to device option structs will
      always be appended.
      
      New device options will also generally have a
      `supported_features_mask` so that the driver knows which fields within a
      particular device option are enabled.
      
      `gve_device_option.feat_mask` is changed to `required_features_mask`,
      and it is a bitmask which must match the value expected by the driver.
      This gives the device the ability to break backwards compatibility with
      old drivers for certain features by blocking the old drivers from trying
      to use the feature.
      
      We maintain ABI compatibility with the old model for
      GVE_DEV_OPT_ID_RAW_ADDRESSING in case a driver is using a device which
      does not support the new model.
      
      This patch introduces some new terminology:
      
      RDA - Raw DMA Addressing - Buffers associated with SKBs are directly DMA
            mapped and read/updated by the device.
      
      QPL - Queue Page Lists - Driver uses bounce buffers which are DMA mapped
            with the device for read/write and data is copied from/to SKBs.
      Signed-off-by: default avatarBailey Forrest <bcf@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarCatherine Sullivan <csully@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8a39d3e0
    • Bailey Forrest's avatar
      gve: Make gve_rx_slot_page_info.page_offset an absolute offset · 920fb451
      Bailey Forrest authored
      Using `page_offset` like a boolean means a page may only be split into
      two sections. With page sizes larger than 4k, this can be very wasteful.
      Future commits in this patchset use `struct gve_rx_slot_page_info` in a
      way which supports a fixed buffer size and a variable page size.
      Signed-off-by: default avatarBailey Forrest <bcf@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarCatherine Sullivan <csully@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      920fb451
    • Bailey Forrest's avatar
      gve: gve_rx_copy: Move padding to an argument · 35f9b2f4
      Bailey Forrest authored
      Future use cases will have a different padding value.
      Signed-off-by: default avatarBailey Forrest <bcf@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarCatherine Sullivan <csully@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35f9b2f4
    • Bailey Forrest's avatar
      gve: Move some static functions to a common file · dbdaa675
      Bailey Forrest authored
      These functions will be shared by the GQI and DQO variants of the GVNIC
      driver as of follow-up patches in this series.
      Signed-off-by: default avatarBailey Forrest <bcf@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarCatherine Sullivan <csully@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dbdaa675
    • Bailey Forrest's avatar
      gve: Update GVE documentation to describe DQO · c6a7ed77
      Bailey Forrest authored
      DQO is a new descriptor format for our next generation virtual NIC.
      Signed-off-by: default avatarBailey Forrest <bcf@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarCatherine Sullivan <csully@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c6a7ed77
    • Yajun Deng's avatar
      usbnet: add usbnet_event_names[] for kevent · 47889068
      Yajun Deng authored
      Modify the netdev_dbg content from int to char * in usbnet_defer_kevent(),
      this looks more readable.
      Signed-off-by: default avatarYajun Deng <yajun.deng@linux.dev>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      47889068
    • David S. Miller's avatar
      Merge branch 'add-sparx5i-driver' · 67faf76d
      David S. Miller authored
      Steen Hegelund says:
      
      ====================
      Adding the Sparx5i Switch Driver
      
      This series provides the Microchip Sparx5i Switch Driver
      
      The SparX-5 Enterprise Ethernet switch family provides a rich set of
      Enterprise switching features such as advanced TCAM-based VLAN and QoS
      processing enabling delivery of differentiated services, and security
      through TCAMbased frame processing using versatile content aware processor
      (VCAP). IPv4/IPv6 Layer 3 (L3) unicast and multicast routing is supported
      with up to 18K IPv4/9K IPv6 unicast LPM entries and up to 9K IPv4/3K IPv6
      (S,G) multicast groups. L3 security features include source guard and
      reverse path forwarding (uRPF) tasks. Additional L3 features include
      VRF-Lite and IP tunnels (IP over GRE/IP).
      
      The SparX-5 switch family features a highly flexible set of Ethernet ports
      with support for 10G and 25G aggregation links, QSGMII, USGMII, and
      USXGMII.  The device integrates a powerful 1 GHz dual-core ARM® Cortex®-A53
      CPU enabling full management of the switch and advanced Enterprise
      applications.
      
      The SparX-5 switch family targets managed Layer 2 and Layer 3 equipment in
      SMB, SME, and Enterprise where high port count 1G/2.5G/5G/10G switching
      with 10G/25G aggregation links is required.
      
      The SparX-5 switch family consists of following SKUs:
      
        VSC7546 SparX-5-64 supports up to 64 Gbps of bandwidth with the following
        primary port configurations.
         - 6 ×10G
         - 16 × 2.5G + 2 × 10G
         - 24 × 1G + 4 × 10G
      
        VSC7549 SparX-5-90 supports up to 90 Gbps of bandwidth with the following
        primary port configurations.
         - 9 × 10G
         - 16 × 2.5G + 4 × 10G
         - 48 × 1G + 4 × 10G
      
        VSC7552 SparX-5-128 supports up to 128 Gbps of bandwidth with the
        following primary port configurations.
         - 12 × 10G
         - 6 x 10G + 2 x 25G
         - 16 × 2.5G + 8 × 10G
         - 48 × 1G + 8 × 10G
      
        VSC7556 SparX-5-160 supports up to 160 Gbps of bandwidth with the
        following primary port configurations.
         - 16 × 10G
         - 10 × 10G + 2 × 25G
         - 16 × 2.5G + 10 × 10G
         - 48 × 1G + 10 × 10G
      
        VSC7558 SparX-5-200 supports up to 200 Gbps of bandwidth with the
        following primary port configurations.
         - 20 × 10G
         - 8 × 25G
      
      In addition, the device supports one 10/100/1000/2500/5000 Mbps
      SGMII/SerDes node processor interface (NPI) Ethernet port.
      
      Time sensitive networking (TSN) is supported through a comprehensive set of
      features including frame preemption, cut-through, frame replication and
      elimination for reliability, enhanced scheduling: credit-based shaping,
      time-aware shaping, cyclic queuing, and forwarding, and per-stream policing
      and filtering.
      
      Together with IEEE 1588 and IEEE 802.1AS support, this guarantees
      low-latency deterministic networking for Industrial Ethernet.
      
      The Sparx5i support is developed on the PCB134 and PCB135 evaluation boards.
      
      - PCB134 main networking features:
        - 12x SFP+ front 10G module slots (connected to Sparx5i through SFI).
        - 8x SFP28 front 25G module slots (connected to Sparx5i through SFI high
          speed).
        - Optional, one additional 10/100/1000BASE-T (RJ45) Ethernet port
          (on-board VSC8211 PHY connected to Sparx5i through SGMII).
      
      - PCB135 main networking features:
        - 48x1G (10/100/1000M) RJ45 front ports using 12xVSC8514 QuadPHY’s each
          connected to VSC7558 through QSGMII.
        - 4x10G (1G/2.5G/5G/10G) RJ45 front ports using the AQR407 10G QuadPHY
          each port connects to VSC7558 through SFI.
        - 4x SFP28 25G module slots on back connected to VSC7558 through SFI high
          speed.
        - Optional, one additional 1G (10/100/1000M) RJ45 port using an on-board
          VSC8211 PHY, which can be connected to VSC7558 NPI port through SGMII
          using a loopback add-on PCB)
      
      This series provides support for:
        - SFPs and DAC cables via PHYLINK with a number of 5G, 10G and 25G
          devices and media types.
        - Port module configuration for 10M to 25G speeds with SGMII, QSGMII,
          1000BASEX, 2500BASEX and 10GBASER as appropriate for these modes.
        - SerDes configuration via the Sparx5i SerDes driver (see below).
        - Host mode providing register based injection and extraction.
        - Switch mode providing MAC/VLAN table learning and Layer2 switching
          offloaded to the Sparx5i switch.
        - STP state, VLAN support, host/bridge port mode, Forwarding DB, and
          configuration and statistics via ethtool.
      
      More support will be added at a later stage.
      
      The Sparx5i Chip Register Model can be browsed at this location:
      https://github.com/microchip-ung/sparx-5_reginfo
      and the datasheet is available here:
      https://ww1.microchip.com/downloads/en/DeviceDoc/SparX-5_Family_L2L3_Enterprise_10G_Ethernet_Switches_Datasheet_00003822B.pdf
      
      The series depends on the following series currently on their way
      into the kernel:
      
      - 25G Base-R phy mode
        Link: https://lore.kernel.org/r/20210611125453.313308-1-steen.hegelund@microchip.com/
      - Sparx5 Reset Driver
        Link: https://lore.kernel.org/r/20210416084054.2922327-1-steen.hegelund@microchip.com/
      
      ChangeLog:
      v5:
          - cover letter
              - updated the description to match the latest data sheets
          - basic driver
              - added error message in case of reset controller error
              - port struct: replacing has_sfp with inband, adding pause_adv
          - host mode
              - port cleanup: unregisters netdevs and then removes phylink etc
              - checking for pause_adv when comparing port config changes
              - getting duplex and pause state in the link_up callback.
              - getting inband, autoneg and pause_adv config in the pcs_config
                callback.
          - port
              - use only the pause_adv bits when getting aneg status
              - use the inband state when updating the PCS and port config
      v4:
          - basic driver:
              Using devm_reset_control_get_optional_shared to get the reset
              control, and let the reset framework check if it is valid.
          - host mode (phylink):
              Use the PCS operations to get state and update configuration.
              Removed the setting of interface modes.  Let phylink control this.
              Using the new 5gbase-r and 25gbase-r modes.
              Using a helper function to check if one of the 3 base-r modes has
              been selected.
              Currently it will not be possible to change the interface mode by
              changing the speed (e.g via ethtool).  This will be added later.
      v3:
          - basic driver:
              - removed unneeded braces
              - release reference to ports node after use
              - use dev_err_probe to handle DEFER
              - update error value when bailing out (a few cases)
              - updated formatting of port struct and grouping of bool values
              - simplified the spx5_rmw and spx5_inst_rmw inline functions
          - host mode (netdev):
              - removed lockless flag
              - added port timer init
          - host mode (packet - manual injection):
              - updated error counters in error situations
              - implemented timer handling of watermark threshold: stop and
                restart netif queues.
              - fixed error message handling (rate limited)
              - fixed comment style error
              - used DIV_ROUND_UP macro
              - removed a debug message for open ports
      
      v2:
          - Updated bindings:
              - drop minItems for the reg property
          - Statistics implementation:
              - Reorganized statistics into ethtool groups:
                  eth-phy, eth-mac, eth-ctrl, rmon
                as defined by the IEEE 802.3 categories and RFC 2819.
              - The remaining statistics are provided by the classic ethtool
                statistics command.
          - Hostmode support:
              - Removed netdev renaming
              - Validate ethernet address in sparx5_set_mac_address()
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      67faf76d
    • Steen Hegelund's avatar
      arm64: dts: sparx5: Add the Sparx5 switch node · d0f482bb
      Steen Hegelund authored
      This provides the configuration for the currently available evaluation
      boards PCB134 and PCB135.
      
      The series depends on the following series currently on its way
      into the kernel:
      
      - Sparx5 Reset Driver
        Link: https://lore.kernel.org/r/20210416084054.2922327-1-steen.hegelund@microchip.com/Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Signed-off-by: default avatarLars Povlsen <lars.povlsen@microchip.com>
      Signed-off-by: default avatarBjarni Jonasson <bjarni.jonasson@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d0f482bb
    • Steen Hegelund's avatar
      net: sparx5: add ethtool configuration and statistics support · af4b1102
      Steen Hegelund authored
      This adds statistic counters for the network interfaces provided
      by the driver.  It also adds CPU port counters (which are not
      exposed by ethtool).
      This also adds support for configuring the network interface
      parameters via ethtool: speed, duplex, aneg etc.
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Signed-off-by: default avatarBjarni Jonasson <bjarni.jonasson@microchip.com>
      Signed-off-by: default avatarLars Povlsen <lars.povlsen@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      af4b1102
    • Steen Hegelund's avatar
      net: sparx5: add calendar bandwidth allocation support · 0a9d48ad
      Steen Hegelund authored
      This configures the Sparx5 calendars according to the bandwidth
      requested in the Device Tree nodes.
      It also checks if the total requested bandwidth is within the
      specs of the detected Sparx5 models limits.
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Signed-off-by: default avatarBjarni Jonasson <bjarni.jonasson@microchip.com>
      Signed-off-by: default avatarLars Povlsen <lars.povlsen@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a9d48ad
    • Steen Hegelund's avatar
      net: sparx5: add switching support · d6fce514
      Steen Hegelund authored
      This adds SwitchDev support by hardware offloading the
      software bridge.
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Signed-off-by: default avatarBjarni Jonasson <bjarni.jonasson@microchip.com>
      Signed-off-by: default avatarLars Povlsen <lars.povlsen@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d6fce514
    • Steen Hegelund's avatar
      net: sparx5: add vlan support · 78eab33b
      Steen Hegelund authored
      This adds Sparx5 VLAN support.
      
      Sparx5 has more VLAN features than provided here, but these will be added
      in later series. For now we only add the basic L2 features.
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Signed-off-by: default avatarBjarni Jonasson <bjarni.jonasson@microchip.com>
      Signed-off-by: default avatarLars Povlsen <lars.povlsen@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      78eab33b
    • Steen Hegelund's avatar
      net: sparx5: add mactable support · b37a1bae
      Steen Hegelund authored
      This adds the Sparx5 MAC tables: listening for MAC table updates and
      updating on request.
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Signed-off-by: default avatarBjarni Jonasson <bjarni.jonasson@microchip.com>
      Signed-off-by: default avatarLars Povlsen <lars.povlsen@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b37a1bae
    • Steen Hegelund's avatar
      net: sparx5: add port module support · 946e7fd5
      Steen Hegelund authored
      This add configuration of the Sparx5 port module instances.
      
      Sparx5 has in total 65 logical ports (denoted D0 to D64) and 33
      physical SerDes connections (S0 to S32).  The 65th port (D64) is fixed
      allocated to SerDes0 (S0). The remaining 64 ports can in various
      multiplexing scenarios be connected to the remaining 32 SerDes using
      QSGMII, or USGMII or USXGMII extenders. 32 of the ports can have a 1:1
      mapping to the 32 SerDes.
      
      Some additional ports (D65 to D69) are internal to the device and do not
      connect to port modules or SerDes macros. For example, internal ports are
      used for frame injection and extraction to the CPU queues.
      
      The 65 logical ports are split up into the following blocks.
      
      - 13 x 5G ports (D0-D11, D64)
      - 32 x 2G5 ports (D16-D47)
      - 12 x 10G ports (D12-D15, D48-D55)
      - 8 x 25G ports (D56-D63)
      
      Each logical port supports different line speeds, and depending on the
      speeds supported, different port modules (MAC+PCS) are needed. A port
      supporting 5 Gbps, 10 Gbps, or 25 Gbps as maximum line speed, will have a
      DEV5G, DEV10G, or DEV25G module to support the 5 Gbps, 10 Gbps (incl 5
      Gbps), or 25 Gbps (including 10 Gbps and 5 Gbps) speeds. As well as, it
      will have a shadow DEV2G5 port module to support the lower speeds
      (10/100/1000/2500Mbps). When a port needs to operate at lower speed and the
      shadow DEV2G5 needs to be connected to its corresponding SerDes
      
      Not all interface modes are supported in this series, but will be added at
      a later stage.
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Signed-off-by: default avatarBjarni Jonasson <bjarni.jonasson@microchip.com>
      Signed-off-by: default avatarLars Povlsen <lars.povlsen@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      946e7fd5
    • Steen Hegelund's avatar
      net: sparx5: add hostmode with phylink support · f3cad261
      Steen Hegelund authored
      This patch adds netdevs and phylink support for the ports in the switch.
      It also adds register based injection and extraction for these ports.
      
      Frame DMA support for injection and extraction will be added in a later
      series.
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Signed-off-by: default avatarBjarni Jonasson <bjarni.jonasson@microchip.com>
      Signed-off-by: default avatarLars Povlsen <lars.povlsen@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f3cad261
    • Steen Hegelund's avatar
      net: sparx5: add the basic sparx5 driver · 3cfa11ba
      Steen Hegelund authored
      This adds the Sparx5 basic SwitchDev driver framework with IO range
      mapping, switch device detection and core clock configuration.
      
      Support for ports, phylink, netdev, mactable etc. are in the following
      patches.
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Signed-off-by: default avatarBjarni Jonasson <bjarni.jonasson@microchip.com>
      Signed-off-by: default avatarLars Povlsen <lars.povlsen@microchip.com>
      Reviewed-by: default avatarPhilipp Zabel <p.zabel@pengutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3cfa11ba
    • Steen Hegelund's avatar
      dt-bindings: net: sparx5: Add sparx5-switch bindings · f8c63088
      Steen Hegelund authored
      Document the Sparx5 switch device driver bindings
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Signed-off-by: default avatarLars Povlsen <lars.povlsen@microchip.com>
      Signed-off-by: default avatarBjarni Jonasson <bjarni.jonasson@microchip.com>
      Reviewed-by: default avatarRob Herring <robh@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f8c63088
    • Marcin Wojtas's avatar
      net: mdiobus: fix fwnode_mdbiobus_register() fallback case · c88c192d
      Marcin Wojtas authored
      The fallback case of fwnode_mdbiobus_register()
      (relevant for !CONFIG_FWNODE_MDIO) was defined with wrong
      argument name, causing a compilation error. Fix that.
      Signed-off-by: default avatarMarcin Wojtas <mw@semihalf.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c88c192d
    • Jakub Kicinski's avatar
      net: ip: avoid OOM kills with large UDP sends over loopback · 6d123b81
      Jakub Kicinski authored
      Dave observed number of machines hitting OOM on the UDP send
      path. The workload seems to be sending large UDP packets over
      loopback. Since loopback has MTU of 64k kernel will try to
      allocate an skb with up to 64k of head space. This has a good
      chance of failing under memory pressure. What's worse if
      the message length is <32k the allocation may trigger an
      OOM killer.
      
      This is entirely avoidable, we can use an skb with page frags.
      
      af_unix solves a similar problem by limiting the head
      length to SKB_MAX_ALLOC. This seems like a good and simple
      approach. It means that UDP messages > 16kB will now
      use fragments if underlying device supports SG, if extra
      allocator pressure causes regressions in real workloads
      we can switch to trying the large allocation first and
      falling back.
      
      v4: pre-calculate all the additions to alloclen so
          we can be sure it won't go over order-2
      Reported-by: default avatarDave Jones <dsj@fb.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6d123b81
    • Lorenz Bauer's avatar
      tools/testing: add a selftest for SO_NETNS_COOKIE · ae24bab2
      Lorenz Bauer authored
      Make sure that SO_NETNS_COOKIE returns a non-zero value, and
      that sockets from different namespaces have a distinct cookie
      value.
      Signed-off-by: default avatarLorenz Bauer <lmb@cloudflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ae24bab2
    • Martynas Pumputis's avatar
      net: retrieve netns cookie via getsocketopt · e8b9eab9
      Martynas Pumputis authored
      It's getting more common to run nested container environments for
      testing cloud software. One of such examples is Kind [1] which runs a
      Kubernetes cluster in Docker containers on a single host. Each container
      acts as a Kubernetes node, and thus can run any Pod (aka container)
      inside the former. This approach simplifies testing a lot, as it
      eliminates complicated VM setups.
      
      Unfortunately, such a setup breaks some functionality when cgroupv2 BPF
      programs are used for load-balancing. The load-balancer BPF program
      needs to detect whether a request originates from the host netns or a
      container netns in order to allow some access, e.g. to a service via a
      loopback IP address. Typically, the programs detect this by comparing
      netns cookies with the one of the init ns via a call to
      bpf_get_netns_cookie(NULL). However, in nested environments the latter
      cannot be used given the Kubernetes node's netns is outside the init ns.
      To fix this, we need to pass the Kubernetes node netns cookie to the
      program in a different way: by extending getsockopt() with a
      SO_NETNS_COOKIE option, the orchestrator which runs in the Kubernetes
      node netns can retrieve the cookie and pass it to the program instead.
      
      Thus, this is following up on Eric's commit 3d368ab8 ("net:
      initialize net->net_cookie at netns setup") to allow retrieval via
      SO_NETNS_COOKIE.  This is also in line in how we retrieve socket cookie
      via SO_COOKIE.
      
        [1] https://kind.sigs.k8s.io/Signed-off-by: default avatarLorenz Bauer <lmb@cloudflare.com>
      Signed-off-by: default avatarMartynas Pumputis <m@lambda.lt>
      Cc: Eric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e8b9eab9
  2. 23 Jun, 2021 3 commits