1. 13 Oct, 2017 40 commits
    • Alan Brady's avatar
      i40e/i40evf: don't trust VF to reset itself · 17a9422d
      Alan Brady authored
      When using 'ethtool -L' on a VF to change number of requested queues
      from PF, we shouldn't trust the VF to reset itself after making the
      request.  Doing it that way opens the door for a potentially malicious
      VF to do nasty things to the PF which should never be the case.
      
      This makes it such that after VF makes a successful request, PF will
      then reset the VF to institute required changes.  Only if the request
      fails will PF send a message back to VF letting it know the request was
      unsuccessful.
      
      Testing-hints:
      There should be no real functional changes.  This is simply hardening
      against a potentially malicious VF.
      Signed-off-by: default avatarAlan Brady <alan.brady@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      17a9422d
    • Alan Brady's avatar
      i40e: fix link reporting · 8fdb69dd
      Alan Brady authored
      When querying the NVM for supported phy_types, on some firmware
      versions, we were failing to actually fill out the phy_types which means
      ethtool wouldn't report any link types.
      
      Testing-hints:
      Check 'ethtool <iface>' if you have the right (wrong?) firmware.
      Without this patch, no link modes will be reported.
      Signed-off-by: default avatarAlan Brady <alan.brady@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      8fdb69dd
    • Colin Ian King's avatar
      i40e: make const array patterns static, reduces object code size · b06da8f9
      Colin Ian King authored
      Don't populate const array patterns on the stack, instead make it
      static. Makes the object code smaller by over 60 bytes:
      
      Before:
         text	   data	    bss	    dec	    hex	filename
         1953	    496	      0	   2449	    991	i40e_diag.o
      
      After:
         text	   data	    bss	    dec	    hex	filename
         1798	    584	      0	   2382	    94e	i40e_diag.o
      
      (gcc 6.3.0, x86-64)
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Acked-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      b06da8f9
    • Amritha Nambiar's avatar
      i40e: Add support setting TC max bandwidth rates · 2027d4de
      Amritha Nambiar authored
      This patch enables setting up maximum Tx rates for the traffic
      classes in i40e. The maximum rate is offloaded to the hardware through
      the mqprio framework by specifying the mode option as 'channel' and
      shaper option as 'bw_rlimit' and is configured for the VSI. Configuring
      minimum Tx rate limit is not supported in the device. The minimum
      usable value for Tx rate is 50Mbps.
      
      Example:
      # tc qdisc add dev eth0 root mqprio num_tc 2  map 0 0 0 0 1 1 1 1\
        queues 4@0 4@4 hw 1 mode channel shaper bw_rlimit\
        max_rate 4Gbit 5Gbit
      
      To dump the bandwidth rates:
      # tc qdisc show dev eth0
      
      qdisc mqprio 804a: root  tc 2 map 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
                   queues:(0:3) (4:7)
                   mode:channel
                   shaper:bw_rlimit   max_rate:4Gbit 5Gbit
      Signed-off-by: default avatarAmritha Nambiar <amritha.nambiar@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      2027d4de
    • Amritha Nambiar's avatar
      i40e: Refactor VF BW rate limiting · 5ecae412
      Amritha Nambiar authored
      This patch refactors the BW rate limiting for Tx traffic
      on the VF to be reused in the next patch for rate limiting Tx
      traffic for the VSIs on the PF as well.
      Signed-off-by: default avatarAmritha Nambiar <amritha.nambiar@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      5ecae412
    • Amritha Nambiar's avatar
      i40e: Enable 'channel' mode in mqprio for TC configs · a9ce82f7
      Amritha Nambiar authored
      The i40e driver is modified to enable the new mqprio hardware
      offload mode and factor the TCs and queue configuration by
      creating channel VSIs. In this mode, the priority to traffic
      class mapping and the user specified queue ranges are used
      to configure the traffic classes by setting the mode option to
      'channel'.
      
      Example:
        map 0 0 0 0 1 2 2 3 queues 2@0 2@2 1@4 1@5\
        hw 1 mode channel
      
      qdisc mqprio 8038: root  tc 4 map 0 0 0 0 1 2 2 3 0 0 0 0 0 0 0 0
                   queues:(0:1) (2:3) (4:4) (5:5)
                   mode:channel
                   shaper:dcb
      
      The HW channels created are removed and all the queue configuration
      is set to default when the qdisc is detached from the root of the
      device.
      
      This patch also disables setting up channels via ethtool (ethtool -L)
      when the TCs are configured using mqprio scheduler.
      
      The patch also limits setting ethtool Rx flow hash indirection
      (ethtool -X eth0 equal N) to max queues configured via mqprio.
      The Rx flow hash indirection input through ethtool should be
      validated so that it is within in the queue range configured via
      tc/mqprio. The bound checking is achieved by reporting the current
      rss size to the kernel when queues are configured via mqprio.
      
      Example:
        map 0 0 0 1 0 2 3 0 queues 2@0 4@2 8@6 11@14\
        hw 1 mode channel
      
      Cannot set RX flow hash configuration: Invalid argument
      Signed-off-by: default avatarAmritha Nambiar <amritha.nambiar@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      a9ce82f7
    • Amritha Nambiar's avatar
      i40e: Add infrastructure for queue channel support · 8f88b303
      Amritha Nambiar authored
      This patch sets up the infrastructure for offloading TCs and
      queue configurations to the hardware by creating HW channels(VSI).
      A new channel is created for each of the traffic class
      configuration offloaded via mqprio framework except for the first TC
      (TC0). TC0 for the main VSI is also reconfigured as per user provided
      queue parameters. Queue counts that are not power-of-2 are handled by
      reconfiguring RSS by reprogramming LUTs using the queue count value.
      This patch also handles configuring the TX rings for the channels,
      setting up the RX queue map for channel.
      
      Also, the channels so created are removed and all the queue
      configuration is set to default when the qdisc is detached from the
      root of the device.
      Signed-off-by: default avatarAmritha Nambiar <amritha.nambiar@intel.com>
      Signed-off-by: default avatarKiran Patil <kiran.patil@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      8f88b303
    • Amritha Nambiar's avatar
      i40e: Add macro for PF reset bit · ff424188
      Amritha Nambiar authored
      Introduce a macro for the bit setting the PF reset flag and
      update its usages. This makes it easier to use this flag
      in functions to be introduced in future without encountering
      checkpatch issues related to alignment and line over 80
      characters.
      Signed-off-by: default avatarAmritha Nambiar <amritha.nambiar@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      ff424188
    • Amritha Nambiar's avatar
      mqprio: Introduce new hardware offload mode and shaper in mqprio · 4e8b86c0
      Amritha Nambiar authored
      The offload types currently supported in mqprio are 0 (no offload) and
      1 (offload only TCs) by setting these values for the 'hw' option. If
      offloads are supported by setting the 'hw' option to 1, the default
      offload mode is 'dcb' where only the TC values are offloaded to the
      device. This patch introduces a new hardware offload mode called
      'channel' with 'hw' set to 1 in mqprio which makes full use of the
      mqprio options, the TCs, the queue configurations and the QoS parameters
      for the TCs. This is achieved through a new netlink attribute for the
      'mode' option which takes values such as 'dcb' (default) and 'channel'.
      The 'channel' mode also supports QoS attributes for traffic class such as
      minimum and maximum values for bandwidth rate limits.
      
      This patch enables configuring additional HW shaper attributes associated
      with a traffic class. Currently the shaper for bandwidth rate limiting is
      supported which takes options such as minimum and maximum bandwidth rates
      and are offloaded to the hardware in the 'channel' mode. The min and max
      limits for bandwidth rates are provided by the user along with the TCs
      and the queue configurations when creating the mqprio qdisc. The interface
      can be extended to support new HW shapers in future through the 'shaper'
      attribute.
      
      Introduces a new data structure 'tc_mqprio_qopt_offload' for offloading
      mqprio queue options and use this to be shared between the kernel and
      device driver. This contains a copy of the existing data structure
      for mqprio queue options. This new data structure can be extended when
      adding new attributes for traffic class such as mode, shaper, shaper
      parameters (bandwidth rate limits). The existing data structure for mqprio
      queue options will be shared between the kernel and userspace.
      
      Example:
        queues 4@0 4@4 hw 1 mode channel shaper bw_rlimit\
        min_rate 1Gbit 2Gbit max_rate 4Gbit 5Gbit
      
      To dump the bandwidth rates:
      
      qdisc mqprio 804a: root  tc 2 map 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
                   queues:(0:3) (4:7)
                   mode:channel
                   shaper:bw_rlimit   min_rate:1Gbit 2Gbit   max_rate:4Gbit 5Gbit
      Signed-off-by: default avatarAmritha Nambiar <amritha.nambiar@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      4e8b86c0
    • David S. Miller's avatar
      Merge branch 'tipc-comm-groups' · a00344bd
      David S. Miller authored
      Jon Maloy says:
      
      ====================
      tipc: Introduce Communcation Group feature
      
      With this commit series we introduce a 'Group Communication' feature in
      order to resolve the datagram and multicast flow control problem. This
      new feature makes it possible for a user to instantiate multiple private
      virtual brokerless message buses by just creating and joining member
      sockets.
      
      The main features are as follows:
      ---------------------------------
      - Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
        If it is the first socket of the group this implies creation of the
        group. This call takes four parameters: 'type' serves as group
        identifier, 'instance' serves as member identifier, and 'scope'
        indicates the visibility of the group (node/cluster/zone). Finally,
        'flags' indicates different options for the socket joining the group.
        For the time being, there are only two such flags: 1) 'LOOPBACK'
        indicates if the creator of the socket wants to receive a copy of
        broadcast or multicast messages it sends to the group, 2) EVENTS
        indicates if it wants to receive membership (JOINED/LEFT) events for
        the other members of the group.
      
      - Groups are closed, i.e., sockets which have not joined a group will
        not be able to send messages to or receive messages from members of
        the group, and vice versa. A socket can only be member of one group
        at a time.
      
      - There are four transmission modes.
        1: Unicast. The sender transmits a message using the port identity
           (node:port tuple) of the receiving socket.
        2: Anycast. The sender transmits a message using a port name (type:
           instance:scope) of one of the receiving sockets. If more than
           one member socket matches the given address a destination is
           selected according to a round-robin algorithm, but also considering
           the destination load (advertised window size) as an additional
           criteria.
        3: Multicast. The sender transmits a message using a port name
           (type:instance:scope) of one or more of the receiving sockets.
           All sockets in the group matching the given address will receive
           a copy of the message.
        4: Broadcast. The sender transmits a message using the primtive
           send(). All members of the group, irrespective of their member
           identity (instance) number receive a copy of the message.
      
      - TIPC broadcast is used for carrying messages in mode 3 or 4 when
        this is deemed more efficient, i.e., depending on number of actual
        destinations.
      
      - All transmission modes are flow controlled, so that messages never
        are dropped or rejected, just like we are used to from connection
        oriented communication. A special algorithm guarantees that this is
        true even for multipoint-to-point communication, i.e., at occasions
        where many source sockets may decide to send simultaneously towards
        the same  destination socket.
      
      - Sequence order is always guaranteed, even between the different
        transmission modes.
      
      - Member join/leave events are received in all other member sockets
        in guaranteed order. I.e., a 'JOINED' (an empty message with the OOB
        bit set) will always be received before the first data message from
        a new member, and a 'LEAVE' (like 'JOINED', but with EOR bit set) will
        always arrive after the last data message from a leaving member.
      
      -----
      v2: Reordered variable declarations in descending length order, as per
          feedback from David Miller. This was done as far as permitted by the
          the initialization order.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a00344bd
    • Jon Maloy's avatar
      tipc: add multipoint-to-point flow control · 04d7b574
      Jon Maloy authored
      We already have point-to-multipoint flow control within a group. But
      we even need the opposite; -a scheme which can handle that potentially
      hundreds of sources may try to send messages to the same destination
      simultaneously without causing buffer overflow at the recipient. This
      commit adds such a mechanism.
      
      The algorithm works as follows:
      
      - When a member detects a new, joining member, it initially set its
        state to JOINED and advertises a minimum window to the new member.
        This window is chosen so that the new member can send exactly one
        maximum sized message, or several smaller ones, to the recipient
        before it must stop and wait for an additional advertisement. This
        minimum window ADV_IDLE is set to 65 1kB blocks.
      
      - When a member receives the first data message from a JOINED member,
        it changes the state of the latter to ACTIVE, and advertises a larger
        window ADV_ACTIVE = 12 x ADV_IDLE blocks to the sender, so it can
        continue sending with minimal disturbances to the data flow.
      
      - The active members are kept in a dedicated linked list. Each time a
        message is received from an active member, it will be moved to the
        tail of that list. This way, we keep a record of which members have
        been most (tail) and least (head) recently active.
      
      - There is a maximum number (16) of permitted simultaneous active
        senders per receiver. When this limit is reached, the receiver will
        not advertise anything immediately to a new sender, but instead put
        it in a PENDING state, and add it to a corresponding queue. At the
        same time, it will pick the least recently active member, send it an
        advertisement RECLAIM message, and set this member to state
        RECLAIMING.
      
      - The reclaimee member has to respond with a REMIT message, meaning that
        it goes back to a send window of ADV_IDLE, and returns its unused
        advertised blocks beyond that value to the reclaiming member.
      
      - When the reclaiming member receives the REMIT message, it unlinks
        the reclaimee from its active list, resets its state to JOINED, and
        notes that it is now back at ADV_IDLE advertised blocks to that
        member. If there are still unread data messages sent out by
        reclaimee before the REMIT, the member goes into an intermediate
        state REMITTED, where it stays until the said messages have been
        consumed.
      
      - The returned advertised blocks can now be re-advertised to the
        pending member, which is now set to state ACTIVE and added to
        the active member list.
      
      - To be proactive, i.e., to minimize the risk that any member will
        end up in the pending queue, we start reclaiming resources already
        when the number of active members exceeds 3/4 of the permitted
        maximum.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      04d7b574
    • Jon Maloy's avatar
      tipc: guarantee delivery of last broadcast before DOWN event · a3bada70
      Jon Maloy authored
      The following scenario is possible:
      - A user sends a broadcast message, and thereafter immediately leaves
        the group.
      - The LEAVE message, following a different path than the broadcast,
        arrives ahead of the broadcast, and the sending member is removed
        from the receiver's list.
      - The broadcast message arrives, but is dropped because the sender
        now is unknown to the receipient.
      
      We fix this by sequence numbering membership events, just like ordinary
      unicast messages. Currently, when a JOIN is sent to a peer, it contains
      a synchronization point, - the sequence number of the next sent
      broadcast, in order to give the receiver a start synchronization point.
      We now let even LEAVE messages contain such an "end synchronization"
      point, so that the recipient can delay the removal of the sending member
      until it knows that all messages have been received.
      
      The received synchronization points are added as sequence numbers to the
      generated membership events, making it possible to handle them almost
      the same way as regular unicasts in the receiving filter function. In
      particular, a DOWN event with a too high sequence number will be kept
      in the reordering queue until the missing broadcast(s) arrive and have
      been delivered.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a3bada70
    • Jon Maloy's avatar
      tipc: guarantee delivery of UP event before first broadcast · 399574d4
      Jon Maloy authored
      The following scenario is possible:
      - A user joins a group, and immediately sends out a broadcast message
        to its members.
      - The broadcast message, following a different data path than the
        initial JOIN message sent out during the joining procedure, arrives
        to a receiver before the latter..
      - The receiver drops the message, since it is not ready to accept any
        messages until the JOIN has arrived.
      
      We avoid this by treating group protocol JOIN messages like unicast
      messages.
      - We let them pass through the recipient's multicast input queue, just
        like ordinary unicasts.
      - We force the first following broadacst to be sent as replicated
        unicast and being acknowledged by the recipient before accepting
        any more broadcast transmissions.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      399574d4
    • Jon Maloy's avatar
      tipc: guarantee that group broadcast doesn't bypass group unicast · 2f487712
      Jon Maloy authored
      We need a mechanism guaranteeing that group unicasts sent out from a
      socket are not bypassed by later sent broadcasts from the same socket.
      We do this as follows:
      
      - Each time a unicast is sent, we set a the broadcast method for the
        socket to "replicast" and "mandatory". This forces the first
        subsequent broadcast message to follow the same network and data path
        as the preceding unicast to a destination, hence preventing it from
        overtaking the latter.
      
      - In order to make the 'same data path' statement above true, we let
        group unicasts pass through the multicast link input queue, instead
        of as previously through the unicast link input queue.
      
      - In the first broadcast following a unicast, we set a new header flag,
        requiring all recipients to immediately acknowledge its reception.
      
      - During the period before all the expected acknowledges are received,
        the socket refuses to accept any more broadcast attempts, i.e., by
        blocking or returning EAGAIN. This period should typically not be
        longer than a few microseconds.
      
      - When all acknowledges have been received, the sending socket will
        open up for subsequent broadcasts, this time giving the link layer
        freedom to itself select the best transmission method.
      
      - The forced and/or abrupt transmission method changes described above
        may lead to broadcasts arriving out of order to the recipients. We
        remedy this by introducing code that checks and if necessary
        re-orders such messages at the receiving end.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2f487712
    • Jon Maloy's avatar
      tipc: guarantee group unicast doesn't bypass group broadcast · b87a5ea3
      Jon Maloy authored
      Group unicast messages don't follow the same path as broadcast messages,
      and there is a high risk that unicasts sent from a socket might bypass
      previously sent broadcasts from the same socket.
      
      We fix this by letting all unicast messages carry the sequence number of
      the next sent broadcast from the same node, but without updating this
      number at the receiver. This way, a receiver can check and if necessary
      re-order such messages before they are added to the socket receive buffer.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b87a5ea3
    • Jon Maloy's avatar
      tipc: introduce group multicast messaging · 5b8dddb6
      Jon Maloy authored
      The previously introduced message transport to all group members is
      based on the tipc multicast service, but is logically a broadcast
      service within the group, and that is what we call it.
      
      We now add functionality for sending messages to all group members
      having a certain identity. Correspondingly, we call this feature 'group
      multicast'. The service is using unicast when only one destination is
      found, otherwise it will use the bearer broadcast service to transfer
      the messages. In the latter case, the receiving members filter arriving
      messages by looking at the intended destination instance. If there is
      no match, the message will be dropped, while still being considered
      received and read as seen by the flow control mechanism.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5b8dddb6
    • Jon Maloy's avatar
      tipc: introduce group anycast messaging · ee106d7f
      Jon Maloy authored
      In this commit, we make it possible to send connectionless unicast
      messages to any member corresponding to the given member identity,
      when there is more than one such member. The sender must use a
      TIPC_ADDR_NAME address to achieve this effect.
      
      We also perform load balancing between the destinations, i.e., we
      primarily select one which has advertised sufficient send window
      to not cause a block/EAGAIN delay, if any. This mechanism is
      overlayed on the always present round-robin selection.
      
      Anycast messages are subject to the same start synchronization
      and flow control mechanism as group broadcast messages.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ee106d7f
    • Jon Maloy's avatar
      tipc: introduce group unicast messaging · 27bd9ec0
      Jon Maloy authored
      We now make it possible to send connectionless unicast messages
      within a communication group. To send a message, the sender can use
      either a direct port address, aka port identity, or an indirect port
      name to be looked up.
      
      This type of messages are subject to the same start synchronization
      and flow control mechanism as group broadcast messages.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      27bd9ec0
    • Jon Maloy's avatar
      tipc: introduce flow control for group broadcast messages · b7d42635
      Jon Maloy authored
      We introduce an end-to-end flow control mechanism for group broadcast
      messages. This ensures that no messages are ever lost because of
      destination receive buffer overflow, with minimal impact on performance.
      For now, the algorithm is based on the assumption that there is only one
      active transmitter at any moment in time.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b7d42635
    • Jon Maloy's avatar
      tipc: receive group membership events via member socket · ae236fb2
      Jon Maloy authored
      Like with any other service, group members' availability can be
      subscribed for by connecting to be topology server. However, because
      the events arrive via a different socket than the member socket, there
      is a real risk that membership events my arrive out of synch with the
      actual JOIN/LEAVE action. I.e., it is possible to receive the first
      messages from a new member before the corresponding JOIN event arrives,
      just as it is possible to receive the last messages from a leaving
      member after the LEAVE event has already been received.
      
      Since each member socket is internally also subscribing for membership
      events, we now fix this problem by passing those events on to the user
      via the member socket. We leverage the already present member synch-
      ronization protocol to guarantee correct message/event order. An event
      is delivered to the user as an empty message where the two source
      addresses identify the new/lost member. Furthermore, we set the MSG_OOB
      bit in the message flags to mark it as an event. If the event is an
      indication about a member loss we also set the MSG_EOR bit, so it can
      be distinguished from a member addition event.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ae236fb2
    • Jon Maloy's avatar
      tipc: add second source address to recvmsg()/recvfrom() · 31c82a2d
      Jon Maloy authored
      With group communication, it becomes important for a message receiver to
      identify not only from which socket (identfied by a node:port tuple) the
      message was sent, but also the logical identity (type:instance) of the
      sending member.
      
      We fix this by adding a second instance of struct sockaddr_tipc to the
      source address area when a message is read. The extra address struct
      is filled in with data found in the received message header (type,) and
      in the local member representation struct (instance.)
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      31c82a2d
    • Jon Maloy's avatar
      tipc: introduce communication groups · 75da2163
      Jon Maloy authored
      As a preparation for introducing flow control for multicast and datagram
      messaging we need a more strictly defined framework than we have now. A
      socket must be able keep track of exactly how many and which other
      sockets it is allowed to communicate with at any moment, and keep the
      necessary state for those.
      
      We therefore introduce a new concept we have named Communication Group.
      Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
      The call takes four parameters: 'type' serves as group identifier,
      'instance' serves as an logical member identifier, and 'scope' indicates
      the visibility of the group (node/cluster/zone). Finally, 'flags' makes
      it possible to set certain properties for the member. For now, there is
      only one flag, indicating if the creator of the socket wants to receive
      a copy of broadcast or multicast messages it is sending via the socket,
      and if wants to be eligible as destination for its own anycasts.
      
      A group is closed, i.e., sockets which have not joined a group will
      not be able to send messages to or receive messages from members of
      the group, and vice versa.
      
      Any member of a group can send multicast ('group broadcast') messages
      to all group members, optionally including itself, using the primitive
      send(). The messages are received via the recvmsg() primitive. A socket
      can only be member of one group at a time.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      75da2163
    • Jon Maloy's avatar
      tipc: improve destination linked list · a80ae530
      Jon Maloy authored
      We often see a need for a linked list of destination identities,
      sometimes containing a port number, sometimes a node identity, and
      sometimes both. The currently defined struct u32_list is not generic
      enough to cover all cases, so we extend it to contain two u32 integers
      and rename it to struct tipc_dest_list.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a80ae530
    • Jon Maloy's avatar
      tipc: add new function for sending multiple small messages · f70d37b7
      Jon Maloy authored
      We see an increasing need to send multiple single-buffer messages
      of TIPC_SYSTEM_IMPORTANCE to different individual destination nodes.
      Instead of looping over the send queue and sending each buffer
      individually, as we do now, we add a new help function
      tipc_node_distr_xmit() to do this.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f70d37b7
    • Jon Maloy's avatar
      tipc: refactor function filter_rcv() · 64ac5f59
      Jon Maloy authored
      In the following commits we will need to handle multiple incoming and
      rejected/returned buffers in the function socket.c::filter_rcv().
      As a preparation for this, we generalize the function by handling
      buffer queues instead of individual buffers. We also introduce a
      help function tipc_skb_reject(), and rename filter_rcv() to
      tipc_sk_filter_rcv() in line with other functions in socket.c.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      64ac5f59
    • Jon Maloy's avatar
      tipc: add ability to obtain node availability status from other files · 38077b8e
      Jon Maloy authored
      In the coming commits, functions at the socket level will need the
      ability to read the availability status of a given node. We therefore
      introduce a new function for this purpose, while renaming the existing
      static function currently having the wanted name.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      38077b8e
    • Jon Maloy's avatar
      tipc: improve address sanity check in tipc_connect() · 23998835
      Jon Maloy authored
      The address given to tipc_connect() is not completely sanity checked,
      under the assumption that this will be done later in the function
      __tipc_sendmsg() when the address is used there.
      
      However, the latter functon will in the next commits serve as caller
      to several other send functions, so we want to move the corresponding
      sanity check there to the beginning of that function, before we possibly
      need to grab the address stored by tipc_connect(). We must therefore
      be able to trust that this address already has been thoroughly checked.
      
      We do this in this commit.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      23998835
    • Jon Maloy's avatar
      tipc: add ability to order and receive topology events in driver · 14c04493
      Jon Maloy authored
      As preparation for introducing communication groups, we add the ability
      to issue topology subscriptions and receive topology events from kernel
      space. This will make it possible for group member sockets to keep track
      of other group members.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      14c04493
    • Florian Westphal's avatar
    • Geert Uytterhoeven's avatar
      ravb: Consolidate clock handling · ab104615
      Geert Uytterhoeven authored
      The module clock is used for two purposes:
        - Wake-on-LAN (WoL), which is optional,
        - gPTP Timer Increment (GTI) configuration, which is mandatory.
      
      As the clock is needed for GTI configuration anyway, WoL is always
      available.  Hence remove duplication and repeated obtaining of the clock
      by making GTI use the stored clock for WoL use.
      Signed-off-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Reviewed-by: default avatarNiklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
      Reviewed-by: default avatarSergei Shtylyov <sergei.shtylyov@cogentembedded.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ab104615
    • David S. Miller's avatar
      Merge branch 'net-support-bgmac-with-B50212E-B1-PHY' · c669b5cf
      David S. Miller authored
      Rafał Miłecki says:
      
      ====================
      net: support bgmac with B50212E B1 PHY
      
      I got a report that a board with BCM47189 SoC and B50212E B1 PHY doesn't
      work well some devices as there is massive ping loss. After analyzing
      PHY state it has appeared that is runs in slave mode and doesn't auto
      switch to master properly when needed.
      
      This patchset fixes this by:
      1) Adding new flag support to the PHY driver for setting master mode
      2) Modifying bgmac to request master mode for reported hardware
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c669b5cf
    • Rafał Miłecki's avatar
      net: bgmac: enable master mode for BCM54210E and B50212E PHYs · 12acd136
      Rafał Miłecki authored
      There are 4 very similar PHYs:
      0x600d84a1: BCM54210E (rev B0)
      0x600d84a2: BCM54210E (rev B1)
      0x600d84a5: B50212E (rev B0)
      0x600d84a6: B50212E (rev B1)
      that need setting master mode manually. It's because they run in slave
      mode by default with Automatic Slave/Master configuration disabled which
      can lead to unreliable connection with massive ping loss.
      
      So far it was reported for a board with BCM47189 SoC and B50212E B1 PHY
      connected to the bgmac supported ethernet device. Telling PHY driver to
      setup PHY properly solves this issue.
      Signed-off-by: default avatarRafał Miłecki <rafal@milecki.pl>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      12acd136
    • Rafał Miłecki's avatar
      net: phy: broadcom: support new device flag for setting master mode · 2355a654
      Rafał Miłecki authored
      Some of Broadcom's PHYs run by default in slave mode with Automatic
      Slave/Master configuration disabled. It stops them from working properly
      with some devices.
      
      So far it has been verified for BCM54210E and BCM50212E which don't
      work well with Intel's I217-LM and I218-LM:
      http://ark.intel.com/products/60019/Intel-Ethernet-Connection-I217-LM
      http://ark.intel.com/products/71307/Intel-Ethernet-Connection-I218-LM
      I was told there is massive ping loss.
      
      This commit adds support for a new flag which can be set by an ethernet
      driver to fixup PHY setup.
      Signed-off-by: default avatarRafał Miłecki <rafal@milecki.pl>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2355a654
    • Mahesh Bandewar's avatar
      ipvlan: always use the current L2 addr of the master · 32c10bbf
      Mahesh Bandewar authored
      If the underlying master ever changes its L2 (e.g. bonding device),
      then make sure that the IPvlan slaves always emit packets with the
      current L2 of the master instead of the stale mac addr which was
      copied during the device creation. The problem can be seen with
      following script -
      
        #!/bin/bash
        # Create a vEth pair
        ip link add dev veth0 type veth peer name veth1
        ip link set veth0 up
        ip link set veth1 up
        ip link show veth0
        ip link show veth1
        # Create an IPvlan device on one end of this vEth pair.
        ip link add link veth0 dev ipvl0 type ipvlan mode l2
        ip link show ipvl0
        # Change the mac-address of the vEth master.
        ip link set veth0 address 02:11:22:33:44:55
      
      Fixes: 2ad7bf36 ("ipvlan: Initial check-in of the IPVLAN driver.")
      Signed-off-by: default avatarMahesh Bandewar <maheshb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      32c10bbf
    • David S. Miller's avatar
      Merge branch 'act-ife-misc' · 743b8bb6
      David S. Miller authored
      Alexander Aring says:
      
      ====================
      sched: act: ife: UAPI checks and performance tweaks
      
      this patch series contains at first a patch which adds a check for
      IFE_ENCODE and IFE_DECODE when a ife act gets created or updated and adding
      handling of these cases only inside the act callback only.
      
      The second patch use per-cpu counters and move the spinlock around so that
      the spinlock is less being held in act callback.
      
      The last patch use rcu for update parameters and also move the spinlock for
      the same purpose as in patch 2.
      
      Notes:
       - There is still a spinlock around for protecting the metalist and a
         rw-lock for another list. Should be migrated to a rcu list, ife
         possible.
      
       - I use still dereference in dump callback, so I think what I didn't
         got was what happened when rcu_assign_pointer will do when rcu read
         lock is held. I suppose the pointer will be updated, then we don't
         have any issue here.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      743b8bb6
    • Alexander Aring's avatar
      sched: act: ife: update parameters via rcu handling · aa9fd9a3
      Alexander Aring authored
      This patch changes the parameter updating via RCU and not protected by a
      spinlock anymore. This reduce the time that the spinlock is being held.
      Signed-off-by: default avatarAlexander Aring <aring@mojatatu.com>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aa9fd9a3
    • Alexander Aring's avatar
      sched: act: ife: migrate to use per-cpu counters · ced273ea
      Alexander Aring authored
      This patch migrates the current counter handling which is protected by a
      spinlock to a per-cpu counter handling. This reduce the time where the
      spinlock is being held.
      Signed-off-by: default avatarAlexander Aring <aring@mojatatu.com>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ced273ea
    • Alexander Aring's avatar
      sched: act: ife: move encode/decode check to init · 734534e9
      Alexander Aring authored
      This patch adds the check of the two possible ife handlings encode
      and decode to the init callback. The decode value is for usability
      aspect and used in userspace code only. The current code offers encode
      else decode only. This patch avoids any other option than this.
      Signed-off-by: default avatarAlexander Aring <aring@mojatatu.com>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      734534e9
    • David S. Miller's avatar
      Merge branch 'net-sched-fix-IFE-meta-modules-loading' · ed7f2622
      David S. Miller authored
      Roman Mashak says:
      
      ====================
      net: sched: Fix IFE meta modules loading
      
      Adjust module alias names of IFE meta modules and fix the bug that
      prevented auto-loading IFE modules in run-time.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ed7f2622
    • Roman Mashak's avatar
      net sched actions: fix module auto-loading · d3f24ba8
      Roman Mashak authored
      Macro __stringify_1() can stringify a macro argument, however IFE_META_*
      are enums, so they never expand, however request_module expects an integer
      in IFE module name, so as a result it always fails to auto-load.
      
      Fixes: ef6980b6 ("introduce IFE action")
      Signed-off-by: default avatarRoman Mashak <mrv@mojatatu.com>
      Acked-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d3f24ba8