1. 18 Oct, 2017 16 commits
    • David S. Miller's avatar
      Merge branch 'dsa-master-and-slave-helpers' · 1bbc7289
      David S. Miller authored
      Vivien Didelot says:
      
      ====================
      net: dsa: master and slave helpers
      
      This patch series adds a few helpers to DSA core for clarity and
      readability but brings no functional changes.
      
      A dsa_slave_notify helper calls the DSA notifiers when (un)registering a
      slave device.
      
      Most of the DSA slave code only needs to access the dsa_port structure,
      not the dsa_slave_priv (which only contains a few PHY-specific members).
      Thus a dsa_slave_to_port helper returns a dsa_port structure of a slave
      device.
      
      A dsa_slave_to_master returns the master device of a slave device.
      
      After that the netdev member of the dsa_port structure is split into two
      explicit master and slave members to avoid confusion, and a dsa_to_port
      helper is added for switch drivers to get a const reference to a port.
      
      Changes in v2:
        - prefer dsa_slave_to_master instead of dsa_slave_get_master
        - rename dsa_master_get_slave to dsa_master_find_slave
        - pack master and slave net devices into an anonymous union
        - add dsa_to_port public helper for switch drivers
        - add Reviewed-by tags from Florian
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1bbc7289
    • Vivien Didelot's avatar
      net: dsa: add dsa_to_port helper · c8652c83
      Vivien Didelot authored
      The dsa_port structure is part of DSA core data and must only be updated
      by the later. It is OK and sometimes necessary for the DSA drivers to
      access this data, but this has to be read only.
      
      For that purpose, add a dsa_to_port() helper which returns a const
      pointer to a dsa_port structure which must be used by DSA drivers from
      now on instead of digging into ds->ports[] themselves.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c8652c83
    • Vivien Didelot's avatar
      net: dsa: split dsa_port's netdev member · f8b8b1cd
      Vivien Didelot authored
      The dsa_port structure has a "netdev" member, which can be used for
      either the master device, or the slave device, depending on its type.
      
      It is true that today, CPU port are not exposed to userspace, thus the
      port's netdev member can be used to point to its master interface.
      
      But it is still slightly confusing, so split it into more explicit
      "master" and "slave" members inside an anonymous union.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f8b8b1cd
    • Vivien Didelot's avatar
      net: dsa: rename dsa_master_get_slave · 2231c43b
      Vivien Didelot authored
      The dsa_master_get_slave is slightly confusing since the idiomatic "get"
      term often suggests reference counting, in symmetry to "put".
      
      Rename it to dsa_master_find_slave to make the look up operation clear.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2231c43b
    • Vivien Didelot's avatar
      net: dsa: add slave to master helper · d0006b00
      Vivien Didelot authored
      Many part of the DSA slave code require to get the master device
      assigned to a slave device. Remove dsa_master_netdev() in favor of a
      dsa_slave_to_master() helper which does that.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d0006b00
    • Vivien Didelot's avatar
      net: dsa: add slave to port helper · d945097b
      Vivien Didelot authored
      Many portions of DSA core code require to get the dsa_port structure
      corresponding to a slave net_device. For this purpose, introduce a
      dsa_slave_to_port() helper.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d945097b
    • Vivien Didelot's avatar
      net: dsa: add slave notify helper · 6158eaa7
      Vivien Didelot authored
      Both DSA slave create and destroy functions call call_dsa_notifiers with
      respectively DSA_PORT_REGISTER and DSA_PORT_UNREGISTER and the same
      dsa_notifier_register_info structure.
      
      Wrap this in a dsa_slave_notify helper so prevent cluttering these
      functions.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6158eaa7
    • Vivien Didelot's avatar
      net: dsa: use port's cpu_dp when creating a slave · a5b930e0
      Vivien Didelot authored
      When dsa_slave_create is called, the related port already has a CPU port
      assigned to it, available in its cpu_dp member. Use it instead of the
      unique tree cpu_dp.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a5b930e0
    • Johannes Berg's avatar
      netlink: use NETLINK_CB(in_skb).sk instead of looking it up · a2084f56
      Johannes Berg authored
      When netlink_ack() reports an allocation error to the sending
      socket, there's no need to look up the sending socket since
      it's available in the SKB's CB. Use that instead of going to
      the trouble of looking it up.
      
      Note that the pointer is only available since Eric Biederman's
      commit 3fbc2905 ("netlink: Make the sending netlink socket availabe in NETLINK_CB")
      which is far newer than the original lookup code (Oct 2003)
      (though the field was called 'ssk' in that commit and only got
      renamed to 'sk' later, I'd actually argue 'ssk' was better - or
      perhaps it should've been 'source_sk' - since there are so many
      different 'sk's involved.)
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a2084f56
    • David S. Miller's avatar
      Merge branch 'bpf-cpumap-type-for-XDP_REDIRECT' · 452606d6
      David S. Miller authored
      Jesper Dangaard Brouer says:
      
      ====================
      net: New bpf cpumap type for XDP_REDIRECT
      
      Introducing a new way to redirect XDP frames.  Notice how no driver
      changes are necessary given the design of XDP_REDIRECT.
      
      This redirect map type is called 'cpumap', as it allows redirection
      XDP frames to remote CPUs.  The remote CPU will do the SKB allocation
      and start the network stack invocation on that CPU.
      
      This is a scalability and isolation mechanism, that allow separating
      the early driver network XDP layer, from the rest of the netstack, and
      assigning dedicated CPUs for this stage.  The sysadm control/configure
      the RX-CPU to NIC-RX queue (as usual) via procfs smp_affinity and how
      many queues are configured via ethtool --set-channels.  Benchmarks
      show that a single CPU can handle approx 11Mpps.  Thus, only assigning
      two NIC RX-queues (and two CPUs) is sufficient for handling 10Gbit/s
      wirespeed smallest packet 14.88Mpps.  Reducing the number of queues
      have the advantage that more packets being "bulk" available per hard
      interrupt[1].
      
      [1] https://www.netdevconf.org/2.1/papers/BusyPollingNextGen.pdf
      
      Use-cases:
      
      1. End-host based pre-filtering for DDoS mitigation.  This is fast
         enough to allow software to see and filter all packets wirespeed.
         Thus, no packets getting silently dropped by hardware.
      
      2. Given NIC HW unevenly distributes packets across RX queue, this
         mechanism can be used for redistribution load across CPUs.  This
         usually happens when HW is unaware of a new protocol.  This
         resembles RPS (Receive Packet Steering), just faster, but with more
         responsibility placed on the BPF program for correct steering.
      
      3. Auto-scaling or power saving via only activating the appropriate
         number of remote CPUs for handling the current load.  The cpumap
         tracepoints can function as a feedback loop for this purpose.
      
      In V7, a --stress-mode was implemented for the samples program, which
      between each stats update, adds + removes CPUs from the map
      concurrently with traffic.  I did find and fix some concurrency issues
      in the tear-down path, details in patch desc.  The stress test have
      now been running for 15 hours without any issues, while being
      bombarded with 11.6 Mpps via pktgen_sample04_many_flows.sh.
      
      See individual patches for patchset-version changes.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      452606d6
    • Jesper Dangaard Brouer's avatar
      samples/bpf: add cpumap sample program xdp_redirect_cpu · fad3917e
      Jesper Dangaard Brouer authored
      This sample program show how to use cpumap and the associated
      tracepoints.
      
      It provides command line stats, which shows how the XDP-RX process,
      cpumap-enqueue and cpumap kthread dequeue is cooperating on a per CPU
      basis.  It also utilize the xdp_exception and xdp_redirect_err
      transpoints to allow users quickly to identify setup issues.
      
      One issue with ixgbe driver is that the driver reset the link when
      loading XDP.  This reset the procfs smp_affinity settings.  Thus,
      after loading the program, these must be reconfigured.  The easiest
      workaround it to reduce the RX-queue to e.g. two via:
      
       # ethtool --set-channels ixgbe1 combined 2
      
      And then add CPUs above 0 and 1, like:
      
       # xdp_redirect_cpu --dev ixgbe1 --prog 2 --cpu 2 --cpu 3 --cpu 4
      
      Another issue with ixgbe is that the page recycle mechanism is tied to
      the RX-ring size.  And the default setting of 512 elements is too
      small.  This is the same issue with regular devmap XDP_REDIRECT.
      To overcome this I've been using 1024 rx-ring size:
      
       # ethtool -G ixgbe1 rx 1024 tx 1024
      
      V3:
       - whitespace cleanups
       - bpf tracepoint cannot access top part of struct
      
      V4:
       - report on kthread sched events, according to tracepoint change
       - report average bulk enqueue size
      
      V5:
       - bpf_map_lookup_elem on cpumap not allowed from bpf_prog
         use separate map to mark CPUs not available
      
      V6:
       - correct kthread sched summary output
      
      V7:
       - Added a --stress-mode for concurrently changing underlying cpumap
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fad3917e
    • Jesper Dangaard Brouer's avatar
      bpf: cpumap add tracepoints · f9419f7b
      Jesper Dangaard Brouer authored
      This adds two tracepoint to the cpumap.  One for the enqueue side
      trace_xdp_cpumap_enqueue() and one for the kthread dequeue side
      trace_xdp_cpumap_kthread().
      
      To mitigate the tracepoint overhead, these are invoked during the
      enqueue/dequeue bulking phases, thus amortizing the cost.
      
      The obvious use-cases are for debugging and monitoring.  The
      non-intuitive use-case is using these as a feedback loop to know the
      system load.  One can imagine auto-scaling by reducing, adding or
      activating more worker CPUs on demand.
      
      V4: tracepoint remove time_limit info, instead add sched info
      
      V8: intro struct bpf_cpu_map_entry members cpu+map_id in this patch
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9419f7b
    • Jesper Dangaard Brouer's avatar
      bpf: cpumap xdp_buff to skb conversion and allocation · 1c601d82
      Jesper Dangaard Brouer authored
      This patch makes cpumap functional, by adding SKB allocation and
      invoking the network stack on the dequeuing CPU.
      
      For constructing the SKB on the remote CPU, the xdp_buff in converted
      into a struct xdp_pkt, and it mapped into the top headroom of the
      packet, to avoid allocating separate mem.  For now, struct xdp_pkt is
      just a cpumap internal data structure, with info carried between
      enqueue to dequeue.
      
      If a driver doesn't have enough headroom it is simply dropped, with
      return code -EOVERFLOW.  This will be picked up the xdp tracepoint
      infrastructure, to allow users to catch this.
      
      V2: take into account xdp->data_meta
      
      V4:
       - Drop busypoll tricks, keeping it more simple.
       - Skip RPS and Generic-XDP-recursive-reinjection, suggested by Alexei
      
      V5: correct RCU read protection around __netif_receive_skb_core.
      
      V6: Setting TASK_RUNNING vs TASK_INTERRUPTIBLE based on talk with Rik van Riel
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1c601d82
    • Jesper Dangaard Brouer's avatar
      bpf: XDP_REDIRECT enable use of cpumap · 9c270af3
      Jesper Dangaard Brouer authored
      This patch connects cpumap to the xdp_do_redirect_map infrastructure.
      
      Still no SKB allocation are done yet.  The XDP frames are transferred
      to the other CPU, but they are simply refcnt decremented on the remote
      CPU.  This served as a good benchmark for measuring the overhead of
      remote refcnt decrement.  If driver page recycle cache is not
      efficient then this, exposes a bottleneck in the page allocator.
      
      A shout-out to MST's ptr_ring, which is the secret behind is being so
      efficient to transfer memory pointers between CPUs, without constantly
      bouncing cache-lines between CPUs.
      
      V3: Handle !CONFIG_BPF_SYSCALL pointed out by kbuild test robot.
      
      V4: Make Generic-XDP aware of cpumap type, but don't allow redirect yet,
       as implementation require a separate upstream discussion.
      
      V5:
       - Fix a maybe-uninitialized pointed out by kbuild test robot.
       - Restrict bpf-prog side access to cpumap, open when use-cases appear
       - Implement cpu_map_enqueue() as a more simple void pointer enqueue
      
      V6:
       - Allow cpumap type for usage in helper bpf_redirect_map,
         general bpf-prog side restriction moved to earlier patch.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9c270af3
    • Jesper Dangaard Brouer's avatar
      bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP · 6710e112
      Jesper Dangaard Brouer authored
      The 'cpumap' is primarily used as a backend map for XDP BPF helper
      call bpf_redirect_map() and XDP_REDIRECT action, like 'devmap'.
      
      This patch implement the main part of the map.  It is not connected to
      the XDP redirect system yet, and no SKB allocation are done yet.
      
      The main concern in this patch is to ensure the datapath can run
      without any locking.  This adds complexity to the setup and tear-down
      procedure, which assumptions are extra carefully documented in the
      code comments.
      
      V2:
       - make sure array isn't larger than NR_CPUS
       - make sure CPUs added is a valid possible CPU
      
      V3: fix nitpicks from Jakub Kicinski <kubakici@wp.pl>
      
      V5:
       - Restrict map allocation to root / CAP_SYS_ADMIN
       - WARN_ON_ONCE if queue is not empty on tear-down
       - Return -EPERM on memlock limit instead of -ENOMEM
       - Error code in __cpu_map_entry_alloc() also handle ptr_ring_cleanup()
       - Moved cpu_map_enqueue() to next patch
      
      V6: all notice by Daniel Borkmann
       - Fix err return code in cpu_map_alloc() introduced in V5
       - Move cpu_possible() check after max_entries boundary check
       - Forbid usage initially in check_map_func_compatibility()
      
      V7:
       - Fix alloc error path spotted by Daniel Borkmann
       - Did stress test adding+removing CPUs from the map concurrently
       - Fixed refcnt issue on cpu_map_entry, kthread started too soon
       - Make sure packets are flushed during tear-down, involved use of
         rcu_barrier() and kthread_run only exit after queue is empty
       - Fix alloc error path in __cpu_map_entry_alloc() for ptr_ring
      
      V8:
       - Nitpicking comments and gramma by Edward Cree
       - Fix missing semi-colon introduced in V7 due to rebasing
       - Move struct bpf_cpu_map_entry members cpu+map_id to tracepoint patch
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6710e112
    • Joel Stanley's avatar
      net: ftgmac100: Request clock and set speed · 4b70c62b
      Joel Stanley authored
      According to the ASPEED datasheet, gigabit speeds require a clock of
      100MHz or higher. Other speeds require 25MHz or higher. This patch
      configures a 100MHz clock if the system has a direct-attached
      PHY, or 25MHz if the system is running NC-SI which is limited to 100MHz.
      
      There appear to be no other upstream users of the FTGMAC100 driver it is
      hard to know the clocking requirements of other platforms. Therefore a
      conservative approach was taken with enabling clocks. If the platform is
      not ASPEED, both requesting the clock and configuring the speed is
      skipped.
      Signed-off-by: default avatarJoel Stanley <joel@jms.id.au>
      Tested-by: default avatarAndrew Jeffery <andrew@aj.id.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4b70c62b
  2. 17 Oct, 2017 1 commit
  3. 16 Oct, 2017 23 commits