1. 23 Oct, 2017 18 commits
  2. 22 Oct, 2017 22 commits
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · f8ddadc4
      David S. Miller authored
      There were quite a few overlapping sets of changes here.
      
      Daniel's bug fix for off-by-ones in the new BPF branch instructions,
      along with the added allowances for "data_end > ptr + x" forms
      collided with the metadata additions.
      
      Along with those three changes came veritifer test cases, which in
      their final form I tried to group together properly.  If I had just
      trimmed GIT's conflict tags as-is, this would have split up the
      meta tests unnecessarily.
      
      In the socketmap code, a set of preemption disabling changes
      overlapped with the rename of bpf_compute_data_end() to
      bpf_compute_data_pointers().
      
      Changes were made to the mv88e6060.c driver set addr method
      which got removed in net-next.
      
      The hyperv transport socket layer had a locking change in 'net'
      which overlapped with a change of socket state macro usage
      in 'net-next'.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f8ddadc4
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · b5ac3beb
      Linus Torvalds authored
      Pull networking fixes from David Miller:
       "A little more than usual this time around. Been travelling, so that is
        part of it.
      
        Anyways, here are the highlights:
      
         1) Deal with memcontrol races wrt. listener dismantle, from Eric
            Dumazet.
      
         2) Handle page allocation failures properly in nfp driver, from Jaku
            Kicinski.
      
         3) Fix memory leaks in macsec, from Sabrina Dubroca.
      
         4) Fix crashes in pppol2tp_session_ioctl(), from Guillaume Nault.
      
         5) Several fixes in bnxt_en driver, including preventing potential
            NVRAM parameter corruption from Michael Chan.
      
         6) Fix for KRACK attacks in wireless, from Johannes Berg.
      
         7) rtnetlink event generation fixes from Xin Long.
      
         8) Deadlock in mlxsw driver, from Ido Schimmel.
      
         9) Disallow arithmetic operations on context pointers in bpf, from
            Jakub Kicinski.
      
        10) Missing sock_owned_by_user() check in sctp_icmp_redirect(), from
            Xin Long.
      
        11) Only TCP is supported for sockmap, make that explicit with a
            check, from John Fastabend.
      
        12) Fix IP options state races in DCCP and TCP, from Eric Dumazet.
      
        13) Fix panic in packet_getsockopt(), also from Eric Dumazet.
      
        14) Add missing locked in hv_sock layer, from Dexuan Cui.
      
        15) Various aquantia bug fixes, including several statistics handling
            cures. From Igor Russkikh et al.
      
        16) Fix arithmetic overflow in devmap code, from John Fastabend.
      
        17) Fix busted socket memory accounting when we get a fault in the tcp
            zero copy paths. From Willem de Bruijn.
      
        18) Don't leave opt->tot_len uninitialized in ipv6, from Eric Dumazet"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits)
        stmmac: Don't access tx_q->dirty_tx before netif_tx_lock
        ipv6: flowlabel: do not leave opt->tot_len with garbage
        of_mdio: Fix broken PHY IRQ in case of probe deferral
        textsearch: fix typos in library helpers
        rxrpc: Don't release call mutex on error pointer
        net: stmmac: Prevent infinite loop in get_rx_timestamp_status()
        net: stmmac: Fix stmmac_get_rx_hwtstamp()
        net: stmmac: Add missing call to dev_kfree_skb()
        mlxsw: spectrum_router: Configure TIGCR on init
        mlxsw: reg: Add Tunneling IPinIP General Configuration Register
        net: ethtool: remove error check for legacy setting transceiver type
        soreuseport: fix initialization race
        net: bridge: fix returning of vlan range op errors
        sock: correct sk_wmem_queued accounting on efault in tcp zerocopy
        bpf: add test cases to bpf selftests to cover all access tests
        bpf: fix pattern matches for direct packet access
        bpf: fix off by one for range markings with L{T, E} patterns
        bpf: devmap fix arithmetic overflow in bitmap_size calculation
        net: aquantia: Bad udp rate on default interrupt coalescing
        net: aquantia: Enable coalescing management via ethtool interface
        ...
      b5ac3beb
    • Bernd Edlinger's avatar
      stmmac: Don't access tx_q->dirty_tx before netif_tx_lock · 8d5f4b07
      Bernd Edlinger authored
      This is the possible reason for different hard to reproduce
      problems on my ARMv7-SMP test system.
      
      The symptoms are in recent kernels imprecise external aborts,
      and in older kernels various kinds of network stalls and
      unexpected page allocation failures.
      
      My testing indicates that the trouble started between v4.5 and v4.6
      and prevails up to v4.14.
      
      Using the dirty_tx before acquiring the spin lock is clearly
      wrong and was first introduced with v4.6.
      
      Fixes: e3ad57c9 ("stmmac: review RX/TX ring management")
      Signed-off-by: default avatarBernd Edlinger <bernd.edlinger@hotmail.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8d5f4b07
    • Eric Dumazet's avatar
      ipv6: flowlabel: do not leave opt->tot_len with garbage · 864e2a1f
      Eric Dumazet authored
      When syzkaller team brought us a C repro for the crash [1] that
      had been reported many times in the past, I finally could find
      the root cause.
      
      If FlowLabel info is merged by fl6_merge_options(), we leave
      part of the opt_space storage provided by udp/raw/l2tp with random value
      in opt_space.tot_len, unless a control message was provided at sendmsg()
      time.
      
      Then ip6_setup_cork() would use this random value to perform a kzalloc()
      call. Undefined behavior and crashes.
      
      Fix is to properly set tot_len in fl6_merge_options()
      
      At the same time, we can also avoid consuming memory and cpu cycles
      to clear it, if every option is copied via a kmemdup(). This is the
      change in ip6_setup_cork().
      
      [1]
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Modules linked in:
      CPU: 0 PID: 6613 Comm: syz-executor0 Not tainted 4.14.0-rc4+ #127
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      task: ffff8801cb64a100 task.stack: ffff8801cc350000
      RIP: 0010:ip6_setup_cork+0x274/0x15c0 net/ipv6/ip6_output.c:1168
      RSP: 0018:ffff8801cc357550 EFLAGS: 00010203
      RAX: dffffc0000000000 RBX: ffff8801cc357748 RCX: 0000000000000010
      RDX: 0000000000000002 RSI: ffffffff842bd1d9 RDI: 0000000000000014
      RBP: ffff8801cc357620 R08: ffff8801cb17f380 R09: ffff8801cc357b10
      R10: ffff8801cb64a100 R11: 0000000000000000 R12: ffff8801cc357ab0
      R13: ffff8801cc357b10 R14: 0000000000000000 R15: ffff8801c3bbf0c0
      FS:  00007f9c5c459700(0000) GS:ffff8801db200000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020324000 CR3: 00000001d1cf2000 CR4: 00000000001406f0
      DR0: 0000000020001010 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
      Call Trace:
       ip6_make_skb+0x282/0x530 net/ipv6/ip6_output.c:1729
       udpv6_sendmsg+0x2769/0x3380 net/ipv6/udp.c:1340
       inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:762
       sock_sendmsg_nosec net/socket.c:633 [inline]
       sock_sendmsg+0xca/0x110 net/socket.c:643
       SYSC_sendto+0x358/0x5a0 net/socket.c:1750
       SyS_sendto+0x40/0x50 net/socket.c:1718
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      RIP: 0033:0x4520a9
      RSP: 002b:00007f9c5c458c08 EFLAGS: 00000216 ORIG_RAX: 000000000000002c
      RAX: ffffffffffffffda RBX: 0000000000718000 RCX: 00000000004520a9
      RDX: 0000000000000001 RSI: 0000000020fd1000 RDI: 0000000000000016
      RBP: 0000000000000086 R08: 0000000020e0afe4 R09: 000000000000001c
      R10: 0000000000000000 R11: 0000000000000216 R12: 00000000004bb1ee
      R13: 00000000ffffffff R14: 0000000000000016 R15: 0000000000000029
      Code: e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 ea 0f 00 00 48 8d 79 04 48 b8 00 00 00 00 00 fc ff df 45 8b 74 24 04 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85
      RIP: ip6_setup_cork+0x274/0x15c0 net/ipv6/ip6_output.c:1168 RSP: ffff8801cc357550
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      864e2a1f
    • Geert Uytterhoeven's avatar
      of_mdio: Fix broken PHY IRQ in case of probe deferral · 66bdede4
      Geert Uytterhoeven authored
      If an Ethernet PHY is initialized before the interrupt controller it is
      connected to, a message like the following is printed:
      
          irq: no irq domain found for /interrupt-controller@e61c0000 !
      
      However, the actual error is ignored, leading to a non-functional (POLL)
      PHY interrupt later:
      
          Micrel KSZ8041RNLI ee700000.ethernet-ffffffff:01: attached PHY driver [Micrel KSZ8041RNLI] (mii_bus:phy_addr=ee700000.ethernet-ffffffff:01, irq=POLL)
      
      Depending on whether the PHY driver will fall back to polling, Ethernet
      may or may not work.
      
      To fix this:
        1. Switch of_mdiobus_register_phy() from irq_of_parse_and_map() to
           of_irq_get().
           Unlike the former, the latter returns -EPROBE_DEFER if the
           interrupt controller is not yet available, so this condition can be
           detected.
           Other errors are handled the same as before, i.e. use the passed
           mdio->irq[addr] as interrupt.
        2. Propagate and handle errors from of_mdiobus_register_phy() and
           of_mdiobus_register_device().
      Signed-off-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      66bdede4
    • Randy Dunlap's avatar
      textsearch: fix typos in library helpers · 7433a8d6
      Randy Dunlap authored
      Fix spellos (typos) in textsearch library helpers.
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7433a8d6
    • David S. Miller's avatar
      Merge branch 'tun-timer-cleanups' · bdd091ba
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      tun: timer cleanups
      
      While working on a syzkaller issue that might have been
      fixed already by Cong Wang in commit 0ad646c8
      ("tun: call dev_get_valid_name() before register_netdevice()")
      I made three small changes related to flow_gc_timer.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bdd091ba
    • Eric Dumazet's avatar
      tun: do not arm flow_gc_timer in tun_flow_init() · ee74d996
      Eric Dumazet authored
      Timer is properly armed on demand from tun_flow_update(),
      so there is no need to arm it at tun init.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ee74d996
    • Eric Dumazet's avatar
      tun: avoid extra timer schedule in tun_flow_cleanup() · 81d98fa4
      Eric Dumazet authored
      If tun_flow_cleanup() deleted all flows, no need to
      arm the timer again. It will be armed next time
      tun_flow_update() is called.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      81d98fa4
    • Eric Dumazet's avatar
      tun: do not block BH again in tun_flow_cleanup() · 7dbfb4ef
      Eric Dumazet authored
      tun_flow_cleanup() being a timer callback, it is already
      running in BH context.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7dbfb4ef
    • David S. Miller's avatar
      Merge branch 'bpf-BASE_RTT' · 02db34d0
      David S. Miller authored
      Lawrence Brakmo says:
      
      ====================
      bpf: add support for BASE_RTT
      
      This patch set adds the following functionality to socket_ops BPF
      programs.
      1) Add bpf helper function bpf_getsocketops. Currently only supports
         TCP_CONGESTION
      2) Add BPF_SOCKET_OPS_BASE_RTT op to get the base RTT of the
         connection. In general, the base RTT indicates the threshold such
         that RTTs above it indicate congestion. More details in the
         relevant patches.
      
      Consists of the following patches:
      
      [PATCH net-next 1/5] bpf: add support for BPF_SOCK_OPS_BASE_RTT
      [PATCH net-next 2/5] bpf: Adding helper function bpf_getsockops
      [PATCH net-next 3/5] bpf: Add BPF_SOCKET_OPS_BASE_RTT support to
      [PATCH net-next 4/5] bpf: sample BPF_SOCKET_OPS_BASE_RTT program
      [PATCH net-next 5/5] bpf: create samples/bpf/tcp_bpf.readme
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      02db34d0
    • Lawrence Brakmo's avatar
      bpf: create samples/bpf/tcp_bpf.readme · bfdf7569
      Lawrence Brakmo authored
      Readme file explaining how to create a cgroupv2 and attach one
      of the tcp_*_kern.o socket_ops BPF program.
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked_by: Alexei Starovoitov <ast@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bfdf7569
    • Lawrence Brakmo's avatar
      bpf: sample BPF_SOCKET_OPS_BASE_RTT program · c890063e
      Lawrence Brakmo authored
      Sample socket_ops BPF program to test the BPF helper function
      bpf_getsocketops and the new socket_ops op BPF_SOCKET_OPS_BASE_RTT.
      
      The program provides a base RTT of 80us when the calling flow is
      within a DC (as determined by the IPV6 prefix) and the congestion
      algorithm is "nv".
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked_by: Alexei Starovoitov <ast@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c890063e
    • Lawrence Brakmo's avatar
      bpf: Add BPF_SOCKET_OPS_BASE_RTT support to tcp_nv · 85cce215
      Lawrence Brakmo authored
      TCP_NV will try to get the base RTT from a socket_ops BPF program if one
      is loaded. NV will then use the base RTT to bound its min RTT (its
      notion of the base RTT). It uses the base RTT as an upper bound and 80%
      of the base RTT as its lower bound.
      
      In other words, NV will consider filtered RTTs larger than base RTT as a
      sign of congestion. As a result, there is no minRTT inflation when there
      is a lot of congestion. For example, in a DC where the RTTs are less
      than 40us when there is no congestion, a base RTT value of 80us improves
      the performance of NV. The difference between the uncongested RTT and
      the base RTT provided represents how much queueing we are willing to
      have (in practice it can be higher).
      
      NV has been tunned to reduce congestion when there are many flows at the
      cost of one flow not achieving full bandwith utilization. When a
      reasonable base RTT is provided, one NV flow can now fully utilize the
      full bandwidth. In addition, the performance is also improved when there
      are many flows.
      
      In the following examples the NV results are using a kernel with this
      patch set (i.e. both NV results are using the new nv_loss_dec_factor).
      
      With one host sending to another host and only one flow the
      goodputs are:
        Cubic: 9.3 Gbps, NV: 5.5 Gbps, NV (baseRTT=80us): 9.2 Gbps
      
      With 2 hosts sending to one host (1 flow per host, the goodput per flow
      is:
        Cubic: 4.6 Gbps, NV: 4.5 Gbps, NV (baseRTT=80us)L 4.6 Gbps
      
      But the RTTs seen by a ping process in the sender is:
        Cubic: 3.3ms  NV: 97us,  NV (baseRTT=80us): 146us
      
      With a lot of flows things look even better for NV with baseRTT. Here we
      have 3 hosts sending to one host. Each sending host has 6 flows: 1
      stream, 4x1MB RPC, 1x10KB RPC. Cubic, NV and NV with baseRTT all fully
      utilize the full available bandwidth. However, the distribution of
      bandwidth among the flows is very different. For the 10KB RPC flow:
        Cubic: 27Mbps, NV: 111Mbps, NV (baseRTT=80us): 222Mbps
      
      The 99% latencies for the 10KB flows are:
        Cubic: 26ms,  NV: 1ms,  NV (baseRTT=80us): 500us
      
      The RTT seen by a ping process at the senders:
        Cubic: 3.2ms  NV: 720us,  NV (baseRTT=80us): 330us
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      85cce215
    • Lawrence Brakmo's avatar
      bpf: Adding helper function bpf_getsockops · cd86d1fd
      Lawrence Brakmo authored
      Adding support for helper function bpf_getsockops to socket_ops BPF
      programs. This patch only supports TCP_CONGESTION.
      Signed-off-by: default avatarVlad Vysotsky <vlad@cs.ucla.edu>
      Acked-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@fb.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cd86d1fd
    • Lawrence Brakmo's avatar
      bpf: add support for BPF_SOCK_OPS_BASE_RTT · e6546ef6
      Lawrence Brakmo authored
      A congestion control algorithm can make a call to the BPF socket_ops
      program to request the base RTT. The base RTT can be congestion control
      dependent and is meant to represent a congestion threshold such that
      RTTs above it indicate congestion. This is especially useful for flows
      within a DC where the base RTT is easy to obtain.
      
      Being provided a base RTT solves a basic problem in RTT based congestion
      avoidance algorithms (such as Vegas, NV and BBR). Although it is easy
      to get the base RTT when the network is not congested, it is very
      diffcult to do when it is very congested. Newer connections get an
      inflated value of the base RTT leading to unfariness (newer flows with a
      larger base RTT get more bandwidth). As a result, RTT based congestion
      avoidance algorithms tend to update their base RTTs to improve fairness.
      In very congested networks this can lead to base RTT inflation, reducing
      the ability of these RTT based congestion control algorithms to prevent
      congestion.
      
      Note that in my experiments with TCP-NV, the base RTT provided can be
      much larger than the actual hardware RTT. For example, experimenting
      with hosts within a rack where the hardware RTT is 16-20us, I've used
      base RTTs up to 150us. The effect of using a larger base RTT is that the
      congestion avoidance algorithm will allow more queueing. When there are
      only a few flows the main effect is larger measured RTTs and RPC
      latencies due to the increased queueing. When there are a lot of flows,
      a larger base RTT can lead to more congestion and more packet drops.
      For this case, where the hardware RTT is 20us, a base RTT of 80us
      produces good results.
      
      This patch only introduces BPF_SOCK_OPS_BASE_RTT, a later patch in this
      set adds support for using it in TCP-NV. Further study and testing is
      needed before support can be added to other delay based congestion
      avoidance algorithms.
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@fb.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e6546ef6
    • Pieter Jansen van Vuuren's avatar
      nfp: use struct fields for 8 bit-wide access · 62d3f60b
      Pieter Jansen van Vuuren authored
      Use direct access struct fields rather than PREP_FIELD()
      macros to manipulate the jump ID and length, both of which
      are exactly 8-bits wide. This simplifies the code somewhat.
      Signed-off-by: default avatarSimon Horman <simon.horman@netronome.com>
      Signed-off-by: default avatarPieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
      Acked-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      62d3f60b
    • Gustavo A. R. Silva's avatar
      net: x25: mark expected switch fall-throughs · 0cea8e28
      Gustavo A. R. Silva authored
      In preparation to enabling -Wimplicit-fallthrough, mark switch cases
      where we are expecting to fall through.
      Signed-off-by: default avatarGustavo A. R. Silva <garsilva@embeddedor.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0cea8e28
    • Gustavo A. R. Silva's avatar
      net: af_unix: mark expected switch fall-through · 110af3ac
      Gustavo A. R. Silva authored
      In preparation to enabling -Wimplicit-fallthrough, mark switch cases
      where we are expecting to fall through.
      Signed-off-by: default avatarGustavo A. R. Silva <garsilva@embeddedor.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      110af3ac
    • David Howells's avatar
      rxrpc: Don't release call mutex on error pointer · 6cb3ece9
      David Howells authored
      Don't release call mutex at the end of rxrpc_kernel_begin_call() if the
      call pointer actually holds an error value.
      
      Fixes: 540b1c48 ("rxrpc: Fix deadlock between call creation and sendmsg/recvmsg")
      Reported-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6cb3ece9
    • David S. Miller's avatar
      Merge branch 'stmmac-hw-tstamp-fixes' · 748759d5
      David S. Miller authored
      Jose Abreu says:
      
      ====================
      net: stmmac: Fix HW timestamping
      
      Three fixes for HW timestamping feature, all of them for RX side.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      748759d5
    • Jose Abreu's avatar
      net: stmmac: Prevent infinite loop in get_rx_timestamp_status() · 9454360d
      Jose Abreu authored
      Prevent infinite loop by correctly setting the loop condition to
      break when i == 10.
      Signed-off-by: default avatarJose Abreu <joabreu@synopsys.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Joao Pinto <jpinto@synopsys.com>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: Alexandre Torgue <alexandre.torgue@st.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9454360d