1. 24 Jan, 2017 23 commits
    • Daniel Borkmann's avatar
      bpf: enable verifier to better track const alu ops · 3fadc801
      Daniel Borkmann authored
      William reported couple of issues in relation to direct packet
      access. Typical scheme is to check for data + [off] <= data_end,
      where [off] can be either immediate or coming from a tracked
      register that contains an immediate, depending on the branch, we
      can then access the data. However, in case of calculating [off]
      for either the mentioned test itself or for access after the test
      in a more "complex" way, then the verifier will stop tracking the
      CONST_IMM marked register and will mark it as UNKNOWN_VALUE one.
      
      Adding that UNKNOWN_VALUE typed register to a pkt() marked
      register, the verifier then bails out in check_packet_ptr_add()
      as it finds the registers imm value below 48. In the first below
      example, that is due to evaluate_reg_imm_alu() not handling right
      shifts and thus marking the register as UNKNOWN_VALUE via helper
      __mark_reg_unknown_value() that resets imm to 0.
      
      In the second case the same happens at the time when r4 is set
      to r4 &= r5, where it transitions to UNKNOWN_VALUE from
      evaluate_reg_imm_alu(). Later on r4 we shift right by 3 inside
      evaluate_reg_alu(), where the register's imm turns into 3. That
      is, for registers with type UNKNOWN_VALUE, imm of 0 means that
      we don't know what value the register has, and for imm > 0 it
      means that the value has [imm] upper zero bits. F.e. when shifting
      an UNKNOWN_VALUE register by 3 to the right, no matter what value
      it had, we know that the 3 upper most bits must be zero now.
      This is to make sure that ALU operations with unknown registers
      don't overflow. Meaning, once we know that we have more than 48
      upper zero bits, or, in other words cannot go beyond 0xffff offset
      with ALU ops, such an addition will track the target register
      as a new pkt() register with a new id, but 0 offset and 0 range,
      so for that a new data/data_end test will be required. Is the source
      register a CONST_IMM one that is to be added to the pkt() register,
      or the source instruction is an add instruction with immediate
      value, then it will get added if it stays within max 0xffff bounds.
      >From there, pkt() type, can be accessed should reg->off + imm be
      within the access range of pkt().
      
        [...]
        from 28 to 30: R0=imm1,min_value=1,max_value=1
          R1=pkt(id=0,off=0,r=22) R2=pkt_end
          R3=imm144,min_value=144,max_value=144
          R4=imm0,min_value=0,max_value=0
          R5=inv48,min_value=2054,max_value=2054 R10=fp
        30: (bf) r5 = r3
        31: (07) r5 += 23
        32: (77) r5 >>= 3
        33: (bf) r6 = r1
        34: (0f) r6 += r5
        cannot add integer value with 0 upper zero bits to ptr_to_packet
      
        [...]
        from 52 to 80: R0=imm1,min_value=1,max_value=1
          R1=pkt(id=0,off=0,r=34) R2=pkt_end R3=inv
          R4=imm272 R5=inv56,min_value=17,max_value=17
          R6=pkt(id=0,off=26,r=34) R10=fp
        80: (07) r4 += 71
        81: (18) r5 = 0xfffffff8
        83: (5f) r4 &= r5
        84: (77) r4 >>= 3
        85: (0f) r1 += r4
        cannot add integer value with 3 upper zero bits to ptr_to_packet
      
      Thus to get above use-cases working, evaluate_reg_imm_alu() has
      been extended for further ALU ops. This is fine, because we only
      operate strictly within realm of CONST_IMM types, so here we don't
      care about overflows as they will happen in the simulated but also
      real execution and interaction with pkt() in check_packet_ptr_add()
      will check actual imm value once added to pkt(), but it's irrelevant
      before.
      
      With regards to 06c1c049 ("bpf: allow helpers access to variable
      memory") that works on UNKNOWN_VALUE registers, the verifier becomes
      now a bit smarter as it can better resolve ALU ops, so we need to
      adapt two test cases there, as min/max bound tracking only becomes
      necessary when registers were spilled to stack. So while mask was
      set before to track upper bound for UNKNOWN_VALUE case, it's now
      resolved directly as CONST_IMM, and such contructs are only necessary
      when f.e. registers are spilled.
      
      For commit 6b173873 ("bpf: recognize 64bit immediate loads as
      consts") that initially enabled dw load tracking only for nfp jit/
      analyzer, I did couple of tests on large, complex programs and we
      don't increase complexity badly (my tests were in ~3% range on avg).
      I've added a couple of tests similar to affected code above, and
      it works fine with verifier now.
      Reported-by: default avatarWilliam Tu <u9012063@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Gianluca Borello <g.borello@gmail.com>
      Cc: William Tu <u9012063@gmail.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3fadc801
    • Daniel Borkmann's avatar
      bpf: add prog tag test case to bpf selftests · 62b64660
      Daniel Borkmann authored
      Add the test case used to compare the results from fdinfo with
      af_alg's output on the tag. Tests are from min to max sized
      programs, with and without maps included.
      
        # ./test_tag
        test_tag: OK (40945 tests)
      
      Tested on x86_64 and s390x.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      62b64660
    • Daniel Borkmann's avatar
      bpf: allow option for setting bpf_l4_csum_replace from scratch · d1b662ad
      Daniel Borkmann authored
      When programs need to calculate the csum from scratch for small UDP
      packets and use bpf_l4_csum_replace() to feed the result from helpers
      like bpf_csum_diff(), then we need a flag besides BPF_F_MARK_MANGLED_0
      that would ignore the case of current csum being 0, and which would
      still allow for the helper to set the csum and transform when needed
      to CSUM_MANGLED_0.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d1b662ad
    • Daniel Borkmann's avatar
      bpf: enable load bytes helper for filter/reuseport progs · 2492d3b8
      Daniel Borkmann authored
      BPF_PROG_TYPE_SOCKET_FILTER are used in various facilities such as
      for SO_REUSEPORT and packet fanout demuxing, packet filtering, kcm,
      etc, and yet the only facility they can use is BPF_LD with {BPF_ABS,
      BPF_IND} for single byte/half/word access.
      
      Direct packet access is only restricted to tc programs right now,
      but we can still facilitate usage by allowing skb_load_bytes() helper
      added back then in 05c74e5e ("bpf: add bpf_skb_load_bytes helper")
      that calls skb_header_pointer() similarly to bpf_load_pointer(), but
      for stack buffers with larger access size.
      
      Name the previous sk_filter_func_proto() as bpf_base_func_proto()
      since this is used everywhere else as well, similarly for the ctx
      converter, that is, bpf_convert_ctx_access().
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2492d3b8
    • Daniel Borkmann's avatar
      bpf: simplify __is_valid_access test on cb · 4faf940d
      Daniel Borkmann authored
      The __is_valid_access() test for cb[] from 62c7989b ("bpf: allow
      b/h/w/dw access for bpf's cb in ctx") was done unnecessarily complex,
      we can just simplify it the same way as recent fix from 2d071c64
      ("bpf, trace: make ctx access checks more robust") did. Overflow can
      never happen as size is 1/2/4/8 depending on access.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4faf940d
    • Arnd Bergmann's avatar
      phy: marvell: remove conflicting initializer · 18702414
      Arnd Bergmann authored
      One line was apparently pasted incorrectly during a new feature patch:
      
      drivers/net/phy/marvell.c:2090:15: error: initialized field overwritten [-Werror=override-init]
         .features = PHY_GBIT_FEATURES,
      
      I'm removing the extraneous line here to avoid the W=1 warning and restore
      the previous flags value, and I'm slightly reordering the lines for consistency
      to make it less likely to happen again in the future. The ordering in the
      array is still not the same as in the structure definition, instead I picked
      the order that is most common in this file and that seems to make more sense
      here.
      
      Fixes: 0b04680f ("phy: marvell: Add support for temperature sensor")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      18702414
    • Phil Sutter's avatar
      net: dummy: Introduce dummy virtual functions · e1636836
      Phil Sutter authored
      The idea for this was born when testing VF support in iproute2 which was
      impeded by hardware requirements. In fact, not every VF-capable hardware
      driver implements all netdev ops, so testing the interface is still hard
      to do even with a well-sorted hardware shelf.
      
      To overcome this and allow for testing the user-kernel interface, this
      patch allows to turn dummy into a PF with a configurable amount of VFs.
      
      Since my patch series 'bus-agnostic-num-vf' has been accepted,
      implementing the required interfaces is pretty straightforward: Iff
      'num_vfs' module parameter was given a value >0, a dummy bus type is
      being registered which implements the 'num_vf()' callback. Additionally,
      a dummy parent device common to all dummy devices is registered which
      sits on the above dummy bus.
      
      Joint work with Sabrina Dubroca.
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarPhil Sutter <phil@nwl.cc>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e1636836
    • Philippe Reynes's avatar
      net: broadcom: bnx2x: use new api ethtool_{get|set}_link_ksettings · 8b86b2c1
      Philippe Reynes authored
      The ethtool api {get|set}_settings is deprecated.
      We move this driver to new api {get|set}_link_ksettings.
      
      As I don't have the hardware, I'd be very pleased if
      someone may test this patch.
      Signed-off-by: default avatarPhilippe Reynes <tremyfr@gmail.com>
      Acked-by: default avatarYuval Mintz <Yuval.Mintz@cavium.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8b86b2c1
    • David S. Miller's avatar
      Merge branch 'packet-sampling-offload' · 36f87780
      David S. Miller authored
      Jiri Pirko says:
      
      ====================
      Add support for offloading packet-sampling
      
      Yotam says:
      
      The first patch introduces the psample module, a netlink channel dedicated
      to packet sampling implemented using generic netlink. This module provides
      a generic way for kernel modules to sample packets, while not being tied
      to any specific subsystem like NFLOG.
      
      The second patch adds the sample tc action, which uses psample to randomly
      sample packets that match a classifier. The user can configure the psample
      group number, the sampling rate and the packet's truncation (to save
      kernel-user traffic).
      
      The last two patches add the support for offloading the matchall-sample
      tc command in the mlxsw driver, for ingress qdiscs.
      
      An example for psample usage can be found in the libpsample project at:
      https://github.com/Mellanox/libpsample
      
      v1->v2:
      - Reword first patch's commit message
      - Fix typo in comment in second patch
      - Change order of tc_sample uapi enum to match convention
      - Rename act_sample action callback tcf_sample -> tcf_sample_act
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      36f87780
    • Yotam Gigi's avatar
      mlxsw: spectrum: Add packet sample offloading support · 98d0f7b9
      Yotam Gigi authored
      Using the MPSC register, add the functions that configure port-based
      packet sampling in hardware and the necessary datatypes in the
      mlxsw_sp_port struct. In addition, add the necessary trap for sampled
      packets and integrate with matchall offloading to allow offloading of the
      sample tc action.
      
      The current offload support is for the tc command:
      
      tc filter add dev <DEV> parent ffff: \
      	  matchall skip_sw \
      	  action sample rate <RATE> group <GROUP> [trunc <SIZE>]
      
      Where only ingress qdiscs are supported, and only a combination of
      matchall classifier and sample action will lead to activating hardware
      packet sampling.
      Signed-off-by: default avatarYotam Gigi <yotamg@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      98d0f7b9
    • Yotam Gigi's avatar
      mlxsw: reg: add the Monitoring Packet Sampling Configuration Register · 0677d682
      Yotam Gigi authored
      The MPSC register allows to configure ingress packet sampling on specific
      port of the mlxsw device. The sampled packets are then trapped via
      PKT_SAMPLE trap.
      Signed-off-by: default avatarYotam Gigi <yotamg@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0677d682
    • Yotam Gigi's avatar
      net/sched: Introduce sample tc action · 5c5670fa
      Yotam Gigi authored
      This action allows the user to sample traffic matched by tc classifier.
      The sampling consists of choosing packets randomly and sampling them using
      the psample module. The user can configure the psample group number, the
      sampling rate and the packet's truncation (to save kernel-user traffic).
      
      Example:
      To sample ingress traffic from interface eth1, one may use the commands:
      
      tc qdisc add dev eth1 handle ffff: ingress
      
      tc filter add dev eth1 parent ffff: \
      	   matchall action sample rate 12 group 4
      
      Where the first command adds an ingress qdisc and the second starts
      sampling randomly with an average of one sampled packet per 12 packets on
      dev eth1 to psample group 4.
      Signed-off-by: default avatarYotam Gigi <yotamg@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5c5670fa
    • Yotam Gigi's avatar
      net: Introduce psample, a new genetlink channel for packet sampling · 6ae0a628
      Yotam Gigi authored
      Add a general way for kernel modules to sample packets, without being tied
      to any specific subsystem. This netlink channel can be used by tc,
      iptables, etc. and allow to standardize packet sampling in the kernel.
      
      For every sampled packet, the psample module adds the following metadata
      fields:
      
      PSAMPLE_ATTR_IIFINDEX - the packets input ifindex, if applicable
      
      PSAMPLE_ATTR_OIFINDEX - the packet output ifindex, if applicable
      
      PSAMPLE_ATTR_ORIGSIZE - the packet's original size, in case it has been
         truncated during sampling
      
      PSAMPLE_ATTR_SAMPLE_GROUP - the packet's sample group, which is set by the
         user who initiated the sampling. This field allows the user to
         differentiate between several samplers working simultaneously and
         filter packets relevant to him
      
      PSAMPLE_ATTR_GROUP_SEQ - sequence counter of last sent packet. The
         sequence is kept for each group
      
      PSAMPLE_ATTR_SAMPLE_RATE - the sampling rate used for sampling the packets
      
      PSAMPLE_ATTR_DATA - the actual packet bits
      
      The sampled packets are sent to the PSAMPLE_NL_MCGRP_SAMPLE multicast
      group. In addition, add the GET_GROUPS netlink command which allows the
      user to see the current sample groups, their refcount and sequence number.
      This command currently supports only netlink dump mode.
      Signed-off-by: default avatarYotam Gigi <yotamg@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ae0a628
    • David S. Miller's avatar
      Merge branch 'mdio_module_driver-misc' · d36db83b
      David S. Miller authored
      Florian Fainelli says:
      
      ====================
      net: couple mdio_module_driver changes
      
      Small patch series fixing a comment for mdio_module_driver and
      finally utilizing it in b53_mdio.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d36db83b
    • Florian Fainelli's avatar
      net: dsa: b53: Utilize mdio_module_driver · 8a180cc7
      Florian Fainelli authored
      Eliminate a bit of boilerplate code.
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8a180cc7
    • Florian Fainelli's avatar
      net: phy: Fix typo for MDIO module boilerplate comment · b70f43a1
      Florian Fainelli authored
      The module boilerplate macro is named mdio_module_driver and not
      module_mdio_driver, fix that.
      
      Fixes: a9049e0c ("mdio: Add support for mdio drivers.")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b70f43a1
    • David S. Miller's avatar
      Merge branch 'stmmac-dwmac-meson8b-configurable-RGMII-TX-delay' · dd8e01fb
      David S. Miller authored
      Martin Blumenstingl says:
      
      ====================
      stmmac: dwmac-meson8b: configurable RGMII TX delay
      
      Currently the dwmac-meson8b stmmac glue driver uses a hardcoded 1/4
      cycle (= 2ns) TX clock delay. This seems to work fine for many boards
      (for example Odroid-C2 or Amlogic's reference boards) but there are
      some others where TX traffic is simply broken.
      There are probably multiple reasons why it's working on some boards
      while it's broken on others:
      - some of Amlogic's reference boards are using a Micrel PHY
      - hardware circuit design
      - maybe more...
      
      iperf3 results on my Mecool BB2 board (Meson GXM, RTL8211F PHY) with
      TX clock delay disabled on the MAC (as it's enabled in the PHY driver).
      TX throughput was virtually zero before:
      $ iperf3 -c 192.168.1.100 -R
      Connecting to host 192.168.1.100, port 5201
      Reverse mode, remote host 192.168.1.100 is sending
      [  4] local 192.168.1.206 port 52828 connected to 192.168.1.100 port 5201
      [ ID] Interval           Transfer     Bandwidth
      [  4]   0.00-1.00   sec   108 MBytes   901 Mbits/sec
      [  4]   1.00-2.00   sec  94.2 MBytes   791 Mbits/sec
      [  4]   2.00-3.00   sec  96.5 MBytes   810 Mbits/sec
      [  4]   3.00-4.00   sec  96.2 MBytes   808 Mbits/sec
      [  4]   4.00-5.00   sec  96.6 MBytes   810 Mbits/sec
      [  4]   5.00-6.00   sec  96.5 MBytes   810 Mbits/sec
      [  4]   6.00-7.00   sec  96.6 MBytes   810 Mbits/sec
      [  4]   7.00-8.00   sec  96.5 MBytes   809 Mbits/sec
      [  4]   8.00-9.00   sec   105 MBytes   884 Mbits/sec
      [  4]   9.00-10.00  sec   111 MBytes   934 Mbits/sec
      - - - - - - - - - - - - - - - - - - - - - - - - -
      [ ID] Interval           Transfer     Bandwidth       Retr
      [  4]   0.00-10.00  sec  1000 MBytes   839 Mbits/sec    0             sender
      [  4]   0.00-10.00  sec   998 MBytes   837 Mbits/sec                  receiver
      
      iperf Done.
      $ iperf3 -c 192.168.1.100
      Connecting to host 192.168.1.100, port 5201
      [  4] local 192.168.1.206 port 52832 connected to 192.168.1.100 port 5201
      [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
      [  4]   0.00-1.01   sec  99.5 MBytes   829 Mbits/sec  117    139 KBytes
      [  4]   1.01-2.00   sec   105 MBytes   884 Mbits/sec  129   70.7 KBytes
      [  4]   2.00-3.01   sec   107 MBytes   889 Mbits/sec  106    187 KBytes
      [  4]   3.01-4.01   sec   105 MBytes   878 Mbits/sec   92    143 KBytes
      [  4]   4.01-5.00   sec   105 MBytes   882 Mbits/sec  140    129 KBytes
      [  4]   5.00-6.01   sec   106 MBytes   883 Mbits/sec  115    195 KBytes
      [  4]   6.01-7.00   sec   102 MBytes   863 Mbits/sec  133   70.7 KBytes
      [  4]   7.00-8.01   sec   106 MBytes   884 Mbits/sec  143   97.6 KBytes
      [  4]   8.01-9.01   sec   104 MBytes   875 Mbits/sec  124    107 KBytes
      [  4]   9.01-10.01  sec   105 MBytes   876 Mbits/sec   90    139 KBytes
      - - - - - - - - - - - - - - - - - - - - - - - - -
      [ ID] Interval           Transfer     Bandwidth       Retr
      [  4]   0.00-10.01  sec  1.02 GBytes   874 Mbits/sec  1189             sender
      [  4]   0.00-10.01  sec  1.02 GBytes   873 Mbits/sec                  receiver
      
      iperf Done.
      
      I get similar TX throughput on my Meson GXBB "MXQ Pro+" board when I
      disable the PHY's TX-delay and configure a 4ms TX-delay on the MAC.
      So changes to at least the RTL8211F PHY driver are needed to get it
      working properly in all situations.
      
      Changes since v4:
      - add a fallback of 2ns (the value which was previously hardcoded) for
        the TX delay so we are backwards-compatible with older .dts'
      - update the documentation with the new fallback value and add a small
        note that the "amlogic,tx-delay" property is ignored when the phy-mode
        is "rmii".
      
      Changes since v3:
      - rebased to apply against current net-next branch (fixes a conflict
        with d2ed0a77 "net: ethernet: stmmac: fix of-node and
        fixed-link-phydev leaks")
      
      Changes since v2:
      - moved all .dts patches (3-7) to a separate series
      - removed the default 2ns TX delay when phy-mode RGMII is specified
      - (rebased against current net-next)
      
      Changes since v1:
      - renamed the devicetree property "amlogic,tx-delay" to
        "amlogic,tx-delay-ns", which makes the .dts easier to read as we can
        simply specify human-readable values instead of having "preprocessor
        defines and calculation in human brain". Thanks to Andrew Lunn for
        the suggestion!
      - improved documentation to indicate when the MAC TX-delay should be
        configured and how to use the PHY's TX-delay
      - changed the default TX-delay in the dwmac-meson8b driver from 2ns
        to 0ms when any of the rgmii-*id modes are used (the 2ns default
        value still applies for phy-mode "rgmii")
      - added patches to properly reset the PHY on Meson GXBB devices and to
        use a similar configuration than the one we use on Meson GXL devices
        (by passing a phy-handle to stmmac and defining the PHY in the mdio0
        bus - patch 3-6)
      - add the "amlogic,tx-delay-ns" property to all boards which are using
        the RGMII PHY (patch 7)
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dd8e01fb
    • Martin Blumenstingl's avatar
      net: stmmac: dwmac-meson8b: make the RGMII TX delay configurable · b765234e
      Martin Blumenstingl authored
      Prior to this patch we were using a hardcoded RGMII TX clock delay of
      2ns (= 1/4 cycle of the 125MHz RGMII TX clock). This value works for
      many boards, but unfortunately not for all (due to the way the actual
      circuit is designed, sometimes because the TX delay is enabled in the
      PHY, etc.). Making the TX delay on the MAC side configurable allows us
      to support all possible hardware combinations.
      
      This allows fixing a compatibility issue on some boards, where the
      RTL8211F PHY is configured to generate the TX delay. We can now turn
      off the TX delay in the MAC, because otherwise we would be applying the
      delay twice (which results in non-working TX traffic).
      Signed-off-by: default avatarMartin Blumenstingl <martin.blumenstingl@googlemail.com>
      Tested-by: default avatarNeil Armstrong <narmstrong@baylibre.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b765234e
    • Martin Blumenstingl's avatar
      net: dt-bindings: add RGMII TX delay configuration to meson8b-dwmac · d5490f1f
      Martin Blumenstingl authored
      This allows configuring the RGMII TX clock delay. The RGMII clock is
      generated by underlying hardware of the the Meson 8b / GXBB DWMAC glue.
      The configuration depends on the actual hardware (no delay may be
      needed due to the design of the actual circuit, the PHY might add this
      delay, etc.).
      Signed-off-by: default avatarMartin Blumenstingl <martin.blumenstingl@googlemail.com>
      Tested-by: default avatarNeil Armstrong <narmstrong@baylibre.com>
      Acked-by: default avatarRob Herring <robh@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d5490f1f
    • Andrew Lunn's avatar
      net: dsa: Fix inverted test for multiple CPU interface · 23e3d618
      Andrew Lunn authored
      Remove the wrong !, otherwise we get false positives about having
      multiple CPU interfaces.
      
      Fixes: b22de490 ("net: dsa: store CPU switch structure in the tree")
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      23e3d618
    • Felix Fietkau's avatar
      bridge: multicast to unicast · 6db6f0ea
      Felix Fietkau authored
      Implements an optional, per bridge port flag and feature to deliver
      multicast packets to any host on the according port via unicast
      individually. This is done by copying the packet per host and
      changing the multicast destination MAC to a unicast one accordingly.
      
      multicast-to-unicast works on top of the multicast snooping feature of
      the bridge. Which means unicast copies are only delivered to hosts which
      are interested in it and signalized this via IGMP/MLD reports
      previously.
      
      This feature is intended for interface types which have a more reliable
      and/or efficient way to deliver unicast packets than broadcast ones
      (e.g. wifi).
      
      However, it should only be enabled on interfaces where no IGMPv2/MLDv1
      report suppression takes place. This feature is disabled by default.
      
      The initial patch and idea is from Felix Fietkau.
      Signed-off-by: default avatarFelix Fietkau <nbd@nbd.name>
      [linus.luessing@c0d3.blue: various bug + style fixes, commit message]
      Signed-off-by: default avatarLinus Lüssing <linus.luessing@c0d3.blue>
      Reviewed-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6db6f0ea
    • Krister Johansen's avatar
      Introduce a sysctl that modifies the value of PROT_SOCK. · 4548b683
      Krister Johansen authored
      Add net.ipv4.ip_unprivileged_port_start, which is a per namespace sysctl
      that denotes the first unprivileged inet port in the namespace.  To
      disable all privileged ports set this to zero.  It also checks for
      overlap with the local port range.  The privileged and local range may
      not overlap.
      
      The use case for this change is to allow containerized processes to bind
      to priviliged ports, but prevent them from ever being allowed to modify
      their container's network configuration.  The latter is accomplished by
      ensuring that the network namespace is not a child of the user
      namespace.  This modification was needed to allow the container manager
      to disable a namespace's priviliged port restrictions without exposing
      control of the network namespace to processes in the user namespace.
      Signed-off-by: default avatarKrister Johansen <kjlx@templeofstupid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4548b683
    • Daniel Borkmann's avatar
      bpf, lpm: fix kfree of im_node in trie_update_elem · d140199a
      Daniel Borkmann authored
      We need to initialize im_node to NULL, otherwise in case of error path
      it gets passed to kfree() as uninitialized pointer.
      
      Fixes: b95a5c4d ("bpf: add a longest prefix match trie map implementation")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d140199a
  2. 23 Jan, 2017 9 commits
    • David S. Miller's avatar
      Merge branch 'bpf-lpm' · 2acc76cb
      David S. Miller authored
      Daniel Mack says:
      
      ====================
      bpf: add longest prefix match map
      
      This patch set adds a longest prefix match algorithm that can be used
      to match IP addresses to a stored set of ranges. It is exposed as a
      bpf map type.
      
      Internally, data is stored in an unbalanced tree of nodes that has a
      maximum height of n, where n is the prefixlen the trie was created
      with.
      
      Note that this has nothing to do with fib or fib6 and is in no way meant
      to replace or share code with it. It's rather a much simpler
      implementation that is specifically written with bpf maps in mind.
      
      Patch 1/2 adds the implementation, 2/2 an extensive test suite and 3/3
      has benchmarking code for the new trie type.
      
      Feedback is much appreciated.
      
      Changelog:
      
      v3 -> v4:
      	* David added a 3rd patch that augments map_perf_test for
      	  LPM trie benchmarks
      	* Limit allocation of maps of this new type to CAP_SYS_ADMIN
      	  for now, as requested by Alexei
      	* Add a stub .map_delete_elem so the core does not stumble
      	  over a NULL pointer when the syscall is invoked
      	* Tests for non-power-of-2 prefix lengths were added
      	* More comment style fixes
      
      v2 -> v3:
      	* Store both the key match data and the caller provided
      	  value in the same byte array attached to a node. This
      	  avoids double allocations
      	* Bring back node->flags to distinguish between 'real'
      	  and intermediate nodes
      	* Fix comment style and some typos
      
      v1 -> v2:
      	* Turn spin lock into raw spinlock
      	* Lock with irqsave options during trie_update_elem()
      	* Return -ENOMEM properly from trie_alloc()
      	* Force attr->flags == BPF_F_NO_PREALLOC during creation
      	* Set trie->map.pages after creation to account for map memory
      	* Allow arbitrary value sizes
      	* Removed node->flags and denode intermediate nodes through
      	  node->value == NULL instead
      
      rfc -> v1:
      	* Add __rcu pointer annotations to make sparse happy
      	* Fold _lpm_trie_find_target_node() into its only caller
      	* Fix some minor documentation issues
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2acc76cb
    • David Herrmann's avatar
      samples/bpf: add lpm-trie benchmark · b8a943e2
      David Herrmann authored
      Extend the map_perf_test_{user,kern}.c infrastructure to stress test
      lpm-trie lookups. We hook into the kprobe on sys_gettid() and measure
      the latency depending on trie size and lookup count.
      
      On my Intel Haswell i7-6400U, a single gettid() syscall with an empty
      bpf program takes roughly 6.5us on my system. Lookups in empty tries
      take ~1.8us on first try, ~0.9us on retries. Lookups in tries with 8192
      entries take ~7.1us (on the first _and_ any subsequent try).
      Signed-off-by: default avatarDavid Herrmann <dh.herrmann@gmail.com>
      Reviewed-by: default avatarDaniel Mack <daniel@zonque.org>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b8a943e2
    • David Herrmann's avatar
      bpf: Add tests for the lpm trie map · 4d3381f5
      David Herrmann authored
      The first part of this program runs randomized tests against the
      lpm-bpf-map. It implements a "Trivial Longest Prefix Match" (tlpm)
      based on simple, linear, single linked lists. The implementation
      should be pretty straightforward.
      
      Based on tlpm, this inserts randomized data into bpf-lpm-maps and
      verifies the trie-based bpf-map implementation behaves the same way
      as tlpm.
      
      The second part uses 'real world' IPv4 and IPv6 addresses and tests
      the trie with those.
      Signed-off-by: default avatarDavid Herrmann <dh.herrmann@gmail.com>
      Signed-off-by: default avatarDaniel Mack <daniel@zonque.org>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4d3381f5
    • Daniel Mack's avatar
      bpf: add a longest prefix match trie map implementation · b95a5c4d
      Daniel Mack authored
      This trie implements a longest prefix match algorithm that can be used
      to match IP addresses to a stored set of ranges.
      
      Internally, data is stored in an unbalanced trie of nodes that has a
      maximum height of n, where n is the prefixlen the trie was created
      with.
      
      Tries may be created with prefix lengths that are multiples of 8, in
      the range from 8 to 2048. The key used for lookup and update operations
      is a struct bpf_lpm_trie_key, and the value is a uint64_t.
      
      The code carries more information about the internal implementation.
      Signed-off-by: default avatarDaniel Mack <daniel@zonque.org>
      Reviewed-by: default avatarDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b95a5c4d
    • Bhumika Goyal's avatar
      net: xilinx: constify net_device_ops structure · 10eeb5e6
      Bhumika Goyal authored
      Declare net_device_ops structure as const as it is only stored in
      the netdev_ops field of a net_device structure. This field is of type
      const, so net_device_ops structures having same properties can be made
      const too.
      Done using Coccinelle:
      
      @r1 disable optional_qualifier@
      identifier i;
      position p;
      @@
      static struct net_device_ops i@p={...};
      
      @ok1@
      identifier r1.i;
      position p;
      struct net_device ndev;
      @@
      ndev.netdev_ops=&i@p
      
      @bad@
      position p!={r1.p,ok1.p};
      identifier r1.i;
      @@
      i@p
      
      @depends on !bad disable optional_qualifier@
      identifier r1.i;
      @@
      +const
      struct net_device_ops i;
      
      File size before:
         text	   data	    bss	    dec	    hex	filename
         6201	    744	      0	   6945	   1b21 ethernet/xilinx/xilinx_emaclite.o
      
      File size after:
         text	   data	    bss	    dec	    hex	filename
         6745	    192	      0	   6937	   1b19 ethernet/xilinx/xilinx_emaclite.o
      Signed-off-by: default avatarBhumika Goyal <bhumirks@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      10eeb5e6
    • Bhumika Goyal's avatar
      net: moxa: constify net_device_ops structures · 30bd2f52
      Bhumika Goyal authored
      Declare net_device_ops structure as const as it is only stored in
      the netdev_ops field of a net_device structure. This field is of type
      const, so net_device_ops structures having same properties can be made
      const too.
      Done using Coccinelle:
      
      @r1 disable optional_qualifier@
      identifier i;
      position p;
      @@
      static struct net_device_ops i@p={...};
      
      @ok1@
      identifier r1.i;
      position p;
      struct net_device ndev;
      @@
      ndev.netdev_ops=&i@p
      
      @bad@
      position p!={r1.p,ok1.p};
      identifier r1.i;
      @@
      i@p
      
      @depends on !bad disable optional_qualifier@
      identifier r1.i;
      @@
      +const
      struct net_device_ops i;
      
      File size before:
         text	   data	    bss	    dec	    hex	filename
         4821	    744	      0	   5565	   15bd ethernet/moxa/moxart_ether.o
      
      File size after:
         text	   data	    bss	    dec	    hex	filename
         5373	    192	      0	   5565	   15bd ethernet/moxa/moxart_ether.o
      Signed-off-by: default avatarBhumika Goyal <bhumirks@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      30bd2f52
    • Timur Tabi's avatar
      net: qcom/emac: claim the irq only when the device is opened · 4404323c
      Timur Tabi authored
      During reset, functions emac_mac_down() and emac_mac_up() are called,
      so we don't want to free and claim the IRQ unnecessarily.  Move those
      operations to open/close.
      Signed-off-by: default avatarTimur Tabi <timur@codeaurora.org>
      Reviewed-by: default avatarLino Sanfilippo <LinoSanfilippo@gmx.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4404323c
    • Timur Tabi's avatar
      net: qcom/emac: rename emac_phy to emac_sgmii and move it · 41c1093f
      Timur Tabi authored
      The EMAC has an internal PHY that is often called the "SGMII".  This
      SGMII is also connected to an external PHY, which is managed by phylib.
      These dual PHYs often cause confusion.  In this case, the data structure
      for managing the SGMII was mis-named and located in the wrong header file.
      
      Structure emac_phy is renamed to emac_sgmii to clearly indicate it applies
      to the internal PHY only.  It also also moved from emac_phy.h (which
      supports the external PHY) to emac_sgmii.h (where it belongs).
      
      To keep the changes minimal, only the structure name is changed, not
      the names of any variables of that type.
      Signed-off-by: default avatarTimur Tabi <timur@codeaurora.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      41c1093f
    • Eric Dumazet's avatar
      bnx2x: avoid two atomic ops per page on x86 · b9032741
      Eric Dumazet authored
      Commit 4cace675 ("bnx2x: Alloc 4k fragment for each rx ring buffer
      element") added extra put_page() and get_page() calls on arches where
      PAGE_SIZE=4K like x86
      
      Reorder things to avoid this overhead.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
      Cc: Yuval Mintz <Yuval.Mintz@cavium.com>
      Cc: Ariel Elior <ariel.elior@cavium.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b9032741
  3. 22 Jan, 2017 8 commits