1. 31 Jan, 2019 10 commits
    • Eric Dumazet's avatar
      rds: fix refcount bug in rds_sock_addref · 6fa19f56
      Eric Dumazet authored
      syzbot was able to catch a bug in rds [1]
      
      The issue here is that the socket might be found in a hash table
      but that its refcount has already be set to 0 by another cpu.
      
      We need to use refcount_inc_not_zero() to be safe here.
      
      [1]
      
      refcount_t: increment on 0; use-after-free.
      WARNING: CPU: 1 PID: 23129 at lib/refcount.c:153 refcount_inc_checked lib/refcount.c:153 [inline]
      WARNING: CPU: 1 PID: 23129 at lib/refcount.c:153 refcount_inc_checked+0x61/0x70 lib/refcount.c:151
      Kernel panic - not syncing: panic_on_warn set ...
      CPU: 1 PID: 23129 Comm: syz-executor3 Not tainted 5.0.0-rc4+ #53
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1db/0x2d0 lib/dump_stack.c:113
       panic+0x2cb/0x65c kernel/panic.c:214
       __warn.cold+0x20/0x48 kernel/panic.c:571
       report_bug+0x263/0x2b0 lib/bug.c:186
       fixup_bug arch/x86/kernel/traps.c:178 [inline]
       fixup_bug arch/x86/kernel/traps.c:173 [inline]
       do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:271
       do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:290
       invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:973
      RIP: 0010:refcount_inc_checked lib/refcount.c:153 [inline]
      RIP: 0010:refcount_inc_checked+0x61/0x70 lib/refcount.c:151
      Code: 1d 51 63 c8 06 31 ff 89 de e8 eb 1b f2 fd 84 db 75 dd e8 a2 1a f2 fd 48 c7 c7 60 9f 81 88 c6 05 31 63 c8 06 01 e8 af 65 bb fd <0f> 0b eb c1 90 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 54 49
      RSP: 0018:ffff8880a0cbf1e8 EFLAGS: 00010282
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc90006113000
      RDX: 000000000001047d RSI: ffffffff81685776 RDI: 0000000000000005
      RBP: ffff8880a0cbf1f8 R08: ffff888097c9e100 R09: ffffed1015ce5021
      R10: ffffed1015ce5020 R11: ffff8880ae728107 R12: ffff8880723c20c0
      R13: ffff8880723c24b0 R14: dffffc0000000000 R15: ffffed1014197e64
       sock_hold include/net/sock.h:647 [inline]
       rds_sock_addref+0x19/0x20 net/rds/af_rds.c:675
       rds_find_bound+0x97c/0x1080 net/rds/bind.c:82
       rds_recv_incoming+0x3be/0x1430 net/rds/recv.c:362
       rds_loop_xmit+0xf3/0x2a0 net/rds/loop.c:96
       rds_send_xmit+0x1355/0x2a10 net/rds/send.c:355
       rds_sendmsg+0x323c/0x44e0 net/rds/send.c:1368
       sock_sendmsg_nosec net/socket.c:621 [inline]
       sock_sendmsg+0xdd/0x130 net/socket.c:631
       __sys_sendto+0x387/0x5f0 net/socket.c:1788
       __do_sys_sendto net/socket.c:1800 [inline]
       __se_sys_sendto net/socket.c:1796 [inline]
       __x64_sys_sendto+0xe1/0x1a0 net/socket.c:1796
       do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x458089
      Code: 6d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 3b b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fc266df8c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 0000000000458089
      RDX: 0000000000000000 RSI: 00000000204b3fff RDI: 0000000000000005
      RBP: 000000000073bf00 R08: 00000000202b4000 R09: 0000000000000010
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007fc266df96d4
      R13: 00000000004c56e4 R14: 00000000004d94a8 R15: 00000000ffffffff
      
      Fixes: cc4dfb7f ("rds: fix two RCU related problems")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Cc: Sowmini Varadhan <sowmini.varadhan@oracle.com>
      Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
      Cc: rds-devel@oss.oracle.com
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6fa19f56
    • Bart Van Assche's avatar
      lib/test_rhashtable: Make test_insert_dup() allocate its hash table dynamically · fc42a689
      Bart Van Assche authored
      The test_insert_dup() function from lib/test_rhashtable.c passes a
      pointer to a stack object to rhltable_init(). Allocate the hash table
      dynamically to avoid that the following is reported with object
      debugging enabled:
      
      ODEBUG: object (ptrval) is on stack (ptrval), but NOT annotated.
      WARNING: CPU: 0 PID: 1 at lib/debugobjects.c:368 __debug_object_init+0x312/0x480
      Modules linked in:
      EIP: __debug_object_init+0x312/0x480
      Call Trace:
       ? debug_object_init+0x1a/0x20
       ? __init_work+0x16/0x30
       ? rhashtable_init+0x1e1/0x460
       ? sched_clock_cpu+0x57/0xe0
       ? rhltable_init+0xb/0x20
       ? test_insert_dup+0x32/0x20f
       ? trace_hardirqs_on+0x38/0xf0
       ? ida_dump+0x10/0x10
       ? jhash+0x130/0x130
       ? my_hashfn+0x30/0x30
       ? test_rht_init+0x6aa/0xab4
       ? ida_dump+0x10/0x10
       ? test_rhltable+0xc5c/0xc5c
       ? do_one_initcall+0x67/0x28e
       ? trace_hardirqs_off+0x22/0xe0
       ? restore_all_kernel+0xf/0x70
       ? trace_hardirqs_on_thunk+0xc/0x10
       ? restore_all_kernel+0xf/0x70
       ? kernel_init_freeable+0x142/0x213
       ? rest_init+0x230/0x230
       ? kernel_init+0x10/0x110
       ? schedule_tail_wrapper+0x9/0xc
       ? ret_from_fork+0x19/0x24
      
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: netdev@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Acked-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fc42a689
    • Jacob Wen's avatar
      l2tp: copy 4 more bytes to linear part if necessary · 91c52470
      Jacob Wen authored
      The size of L2TPv2 header with all optional fields is 14 bytes.
      l2tp_udp_recv_core only moves 10 bytes to the linear part of a
      skb. This may lead to l2tp_recv_common read data outside of a skb.
      
      This patch make sure that there is at least 14 bytes in the linear
      part of a skb to meet the maximum need of l2tp_udp_recv_core and
      l2tp_recv_common. The minimum size of both PPP HDLC-like frame and
      Ethernet frame is larger than 14 bytes, so we are safe to do so.
      
      Also remove L2TP_HDR_SIZE_NOSEQ, it is unused now.
      
      Fixes: fd558d18 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
      Suggested-by: default avatarGuillaume Nault <gnault@redhat.com>
      Signed-off-by: default avatarJacob Wen <jian.w.wen@oracle.com>
      Acked-by: default avatarGuillaume Nault <gnault@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      91c52470
    • David S. Miller's avatar
      Merge branch 'stmmac-fixes' · 3aa9179b
      David S. Miller authored
      Jose Abreu says:
      
      ====================
      net: stmmac: Misc fixes
      
      Some misc fixes for stmmac targeting -net.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3aa9179b
    • Jose Abreu's avatar
      net: stmmac: Disable EEE mode earlier in XMIT callback · e2cd682d
      Jose Abreu authored
      In stmmac xmit callback we use a different flow for TSO packets but TSO
      xmit callback is not disabling the EEE mode.
      
      Fix this by disabling earlier the EEE mode, i.e. before calling the TSO
      xmit callback.
      Signed-off-by: default avatarJose Abreu <joabreu@synopsys.com>
      Cc: Joao Pinto <jpinto@synopsys.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: Alexandre Torgue <alexandre.torgue@st.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e2cd682d
    • Jose Abreu's avatar
      net: stmmac: Send TSO packets always from Queue 0 · c5acdbee
      Jose Abreu authored
      The number of TSO enabled channels in HW can be different than the
      number of total channels. There is no way to determined, at runtime, the
      number of TSO capable channels and its safe to assume that if TSO is
      enabled then at least channel 0 will be TSO capable.
      
      Lets always send TSO packets from Queue 0.
      Signed-off-by: default avatarJose Abreu <joabreu@synopsys.com>
      Cc: Joao Pinto <jpinto@synopsys.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: Alexandre Torgue <alexandre.torgue@st.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c5acdbee
    • Jose Abreu's avatar
      net: stmmac: Fallback to Platform Data clock in Watchdog conversion · 4ec5302f
      Jose Abreu authored
      If we don't have DT then stmmac_clk will not be available. Let's add a
      new Platform Data field so that we can specify the refclk by this mean.
      
      This way we can still use the coalesce command in PCI based setups.
      Signed-off-by: default avatarJose Abreu <joabreu@synopsys.com>
      Cc: Joao Pinto <jpinto@synopsys.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: Alexandre Torgue <alexandre.torgue@st.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4ec5302f
    • Daniel Borkmann's avatar
      ipvlan, l3mdev: fix broken l3s mode wrt local routes · d5256083
      Daniel Borkmann authored
      While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin,
      I ran into the issue that while l3 mode is working fine, l3s mode
      does not have any connectivity to kube-apiserver and hence all pods
      end up in Error state as well. The ipvlan master device sits on
      top of a bond device and hostns traffic to kube-apiserver (also running
      in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573
      where the latter is the address of the bond0. While in l3 mode, a
      curl to https://10.152.183.1:443 or to https://139.178.29.207:37573
      works fine from hostns, neither of them do in case of l3s. In the
      latter only a curl to https://127.0.0.1:37573 appeared to work where
      for local addresses of bond0 I saw kernel suddenly starting to emit
      ARP requests to query HW address of bond0 which remained unanswered
      and neighbor entries in INCOMPLETE state. These ARP requests only
      happen while in l3s.
      
      Debugging this further, I found the issue is that l3s mode is piggy-
      backing on l3 master device, and in this case local routes are using
      l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit
      f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev
      if relevant") and 5f02ce24 ("net: l3mdev: Allow the l3mdev to be
      a loopback"). I found that reverting them back into using the
      net->loopback_dev fixed ipvlan l3s connectivity and got everything
      working for the CNI.
      
      Now judging from 4fbae7d8 ("ipvlan: Introduce l3s mode") and the
      l3mdev paper in [0] the only sole reason why ipvlan l3s is relying
      on l3 master device is to get the l3mdev_ip_rcv() receive hook for
      setting the dst entry of the input route without adding its own
      ipvlan specific hacks into the receive path, however, any l3 domain
      semantics beyond just that are breaking l3s operation. Note that
      ipvlan also has the ability to dynamically switch its internal
      operation from l3 to l3s for all ports via ipvlan_set_port_mode()
      at runtime. In any case, l3 vs l3s soley distinguishes itself by
      'de-confusing' netfilter through switching skb->dev to ipvlan slave
      device late in NF_INET_LOCAL_IN before handing the skb to L4.
      
      Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which,
      if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook
      without any additional l3mdev semantics on top. This should also have
      minimal impact since dev->priv_flags is already hot in cache. With
      this set, l3s mode is working fine and I also get things like
      masquerading pod traffic on the ipvlan master properly working.
      
        [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf
      
      Fixes: f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev if relevant")
      Fixes: 5f02ce24 ("net: l3mdev: Allow the l3mdev to be a loopback")
      Fixes: 4fbae7d8 ("ipvlan: Introduce l3s mode")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: David Ahern <dsa@cumulusnetworks.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Martynas Pumputis <m@lambda.lt>
      Acked-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d5256083
    • Jacob Wen's avatar
      l2tp: fix reading optional fields of L2TPv3 · 4522a70d
      Jacob Wen authored
      Use pskb_may_pull() to make sure the optional fields are in skb linear
      parts, so we can safely read them later.
      
      It's easy to reproduce the issue with a net driver that supports paged
      skb data. Just create a L2TPv3 over IP tunnel and then generates some
      network traffic.
      Once reproduced, rx err in /sys/kernel/debug/l2tp/tunnels will increase.
      
      Changes in v4:
      1. s/l2tp_v3_pull_opt/l2tp_v3_ensure_opt_in_linear/
      2. s/tunnel->version != L2TP_HDR_VER_2/tunnel->version == L2TP_HDR_VER_3/
      3. Add 'Fixes' in commit messages.
      
      Changes in v3:
      1. To keep consistency, move the code out of l2tp_recv_common.
      2. Use "net" instead of "net-next", since this is a bug fix.
      
      Changes in v2:
      1. Only fix L2TPv3 to make code simple.
         To fix both L2TPv3 and L2TPv2, we'd better refactor l2tp_recv_common.
         It's complicated to do so.
      2. Reloading pointers after pskb_may_pull
      
      Fixes: f7faffa3 ("l2tp: Add L2TPv3 protocol support")
      Fixes: 0d76751f ("l2tp: Add L2TPv3 IP encapsulation (no UDP) support")
      Fixes: a32e0eec ("l2tp: introduce L2TPv3 IP encapsulation support for IPv6")
      Signed-off-by: default avatarJacob Wen <jian.w.wen@oracle.com>
      Acked-by: default avatarGuillaume Nault <gnault@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4522a70d
    • George Amanakis's avatar
      tun: move the call to tun_set_real_num_queues · 3a03cb84
      George Amanakis authored
      Call tun_set_real_num_queues() after the increment of tun->numqueues
      since the former depends on it. Otherwise, the number of queues is not
      correctly accounted for, which results to warnings similar to:
      "vnet0 selects TX queue 11, but real number of TX queues is 11".
      
      Fixes: 0b7959b6 ("tun: publish tfile after it's fully initialized")
      Reported-and-tested-by: default avatarGeorge Amanakis <gamanakis@gmail.com>
      Signed-off-by: default avatarGeorge Amanakis <gamanakis@gmail.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3a03cb84
  2. 30 Jan, 2019 19 commits
    • Yohei Kanemaru's avatar
      ipv6: sr: clear IP6CB(skb) on SRH ip4ip6 encapsulation · ef489749
      Yohei Kanemaru authored
      skb->cb may contain data from previous layers (in an observed case
      IPv4 with L3 Master Device). In the observed scenario, the data in
      IPCB(skb)->frags was misinterpreted as IP6CB(skb)->frag_max_size,
      eventually caused an unexpected IPv6 fragmentation in ip6_fragment()
      through ip6_finish_output().
      
      This patch clears IP6CB(skb), which potentially contains garbage data,
      on the SRH ip4ip6 encapsulation.
      
      Fixes: 32d99d0b ("ipv6: sr: add support for ip4ip6 encapsulation")
      Signed-off-by: default avatarYohei Kanemaru <yohei.kanemaru@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef489749
    • David S. Miller's avatar
      Merge branch 'virtio_net-Fix-problems-around-XDP-tx-and-napi_tx' · a10cc847
      David S. Miller authored
      Toshiaki Makita says:
      
      ====================
      virtio_net: Fix problems around XDP tx and napi_tx
      
      While I'm looking into how to account standard tx counters on XDP tx
      processing, I found several bugs around XDP tx and napi_tx.
      
      Patch1: Fix oops on error path. Patch2 depends on this.
      Patch2: Fix memory corruption on freeing xdp_frames with napi_tx enabled.
      Patch3: Minor fix patch5 depends on.
      Patch4: Fix memory corruption on processing xdp_frames when XDP is disabled.
        Also patch5 depends on this.
      Patch5: Fix memory corruption on processing xdp_frames while XDP is being
        disabled.
      Patch6: Minor fix patch7 depends on.
      Patch7: Fix memory corruption on freeing sk_buff or xdp_frames when a normal
        queue is reused for XDP and vise versa.
      
      v2:
      - patch5: Make rcu_assign_pointer/synchronize_net conditional instead of
                _virtnet_set_queues.
      - patch7: Use napi_consume_skb() instead of dev_consume_skb_any()
      ====================
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a10cc847
    • Toshiaki Makita's avatar
      virtio_net: Differentiate sk_buff and xdp_frame on freeing · 5050471d
      Toshiaki Makita authored
      We do not reset or free up unused buffers when enabling/disabling XDP,
      so it can happen that xdp_frames are freed after disabling XDP or
      sk_buffs are freed after enabling XDP on xdp tx queues.
      Thus we need to handle both forms (xdp_frames and sk_buffs) regardless
      of XDP setting.
      One way to trigger this problem is to disable XDP when napi_tx is
      enabled. In that case, virtnet_xdp_set() calls virtnet_napi_enable()
      which kicks NAPI. The NAPI handler will call virtnet_poll_cleantx()
      which invokes free_old_xmit_skbs() for queues which have been used by
      XDP.
      
      Note that even with this change we need to keep skipping
      free_old_xmit_skbs() from NAPI handlers when XDP is enabled, because XDP
      tx queues do not aquire queue locks.
      
      - v2: Use napi_consume_skb() instead of dev_consume_skb_any()
      
      Fixes: 4941d472 ("virtio-net: do not reset during XDP set")
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5050471d
    • Toshiaki Makita's avatar
      virtio_net: Use xdp_return_frame to free xdp_frames on destroying vqs · 07b344f4
      Toshiaki Makita authored
      put_page() can work as a fallback for freeing xdp_frames, but the
      appropriate way is to use xdp_return_frame().
      
      Fixes: cac320c8 ("virtio_net: convert to use generic xdp_frame and xdp_return_frame API")
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      07b344f4
    • Toshiaki Makita's avatar
      virtio_net: Don't process redirected XDP frames when XDP is disabled · 03aa6d34
      Toshiaki Makita authored
      Commit 8dcc5b0a ("virtio_net: fix ndo_xdp_xmit crash towards dev not
      ready for XDP") tried to avoid access to unexpected sq while XDP is
      disabled, but was not complete.
      
      There was a small window which causes out of bounds sq access in
      virtnet_xdp_xmit() while disabling XDP.
      
      An example case of
       - curr_queue_pairs = 6 (2 for SKB and 4 for XDP)
       - online_cpu_num = xdp_queue_paris = 4
      when XDP is enabled:
      
      CPU 0                         CPU 1
      (Disabling XDP)               (Processing redirected XDP frames)
      
                                    virtnet_xdp_xmit()
      virtnet_xdp_set()
       _virtnet_set_queues()
        set curr_queue_pairs (2)
                                     check if rq->xdp_prog is not NULL
                                     virtnet_xdp_sq(vi)
                                      qp = curr_queue_pairs -
                                           xdp_queue_pairs +
                                           smp_processor_id()
                                         = 2 - 4 + 1 = -1
                                      sq = &vi->sq[qp] // out of bounds access
        set xdp_queue_pairs (0)
        rq->xdp_prog = NULL
      
      Basically we should not change curr_queue_pairs and xdp_queue_pairs
      while someone can read the values. Thus, when disabling XDP, assign NULL
      to rq->xdp_prog first, and wait for RCU grace period, then change
      xxx_queue_pairs.
      Note that we need to keep the current order when enabling XDP though.
      
      - v2: Make rcu_assign_pointer/synchronize_net conditional instead of
            _virtnet_set_queues.
      
      Fixes: 186b3c99 ("virtio-net: support XDP_REDIRECT")
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      03aa6d34
    • Toshiaki Makita's avatar
      virtio_net: Fix out of bounds access of sq · 1667c08a
      Toshiaki Makita authored
      When XDP is disabled, curr_queue_pairs + smp_processor_id() can be
      larger than max_queue_pairs.
      There is no guarantee that we have enough XDP send queues dedicated for
      each cpu when XDP is disabled, so do not count drops on sq in that case.
      
      Fixes: 5b8f3c8d ("virtio_net: Add XDP related stats")
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1667c08a
    • Toshiaki Makita's avatar
      virtio_net: Fix not restoring real_num_rx_queues · 188313c1
      Toshiaki Makita authored
      When _virtnet_set_queues() failed we did not restore real_num_rx_queues.
      Fix this by placing the change of real_num_rx_queues after
      _virtnet_set_queues().
      This order is also in line with virtnet_set_channels().
      
      Fixes: 4941d472 ("virtio-net: do not reset during XDP set")
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      188313c1
    • Toshiaki Makita's avatar
      virtio_net: Don't call free_old_xmit_skbs for xdp_frames · 534da5e8
      Toshiaki Makita authored
      When napi_tx is enabled, virtnet_poll_cleantx() called
      free_old_xmit_skbs() even for xdp send queue.
      This is bogus since the queue has xdp_frames, not sk_buffs, thus mangled
      device tx bytes counters because skb->len is meaningless value, and even
      triggered oops due to general protection fault on freeing them.
      
      Since xdp send queues do not aquire locks, old xdp_frames should be
      freed only in virtnet_xdp_xmit(), so just skip free_old_xmit_skbs() for
      xdp send queues.
      
      Similarly virtnet_poll_tx() called free_old_xmit_skbs(). This NAPI
      handler is called even without calling start_xmit() because cb for tx is
      by default enabled. Once the handler is called, it enabled the cb again,
      and then the handler would be called again. We don't need this handler
      for XDP, so don't enable cb as well as not calling free_old_xmit_skbs().
      
      Also, we need to disable tx NAPI when disabling XDP, so
      virtnet_poll_tx() can safely access curr_queue_pairs and
      xdp_queue_pairs, which are not atomically updated while disabling XDP.
      
      Fixes: b92f1e67 ("virtio-net: transmit napi")
      Fixes: 7b0411ef ("virtio-net: clean tx descriptors from rx napi")
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      534da5e8
    • Toshiaki Makita's avatar
      virtio_net: Don't enable NAPI when interface is down · 8be4d9a4
      Toshiaki Makita authored
      Commit 4e09ff53 ("virtio-net: disable NAPI only when enabled during
      XDP set") tried to fix inappropriate NAPI enabling/disabling when
      !netif_running(), but was not complete.
      
      On error path virtio_net could enable NAPI even when !netif_running().
      This can cause enabling NAPI twice on virtnet_open(), which would
      trigger BUG_ON() in napi_enable().
      
      Fixes: 4941d472 ("virtio-net: do not reset during XDP set")
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8be4d9a4
    • David S. Miller's avatar
      Merge branch 'erspan-always-reports-output-key-to-userspace' · 41ef81be
      David S. Miller authored
      Lorenzo Bianconi says:
      
      ====================
      erspan: always reports output key to userspace
      
      Erspan protocol relies on output key to set session id header field.
      However TUNNEL_KEY bit is cleared in order to not add key field to
      the external GRE header and so the configured o_key is not reported
      to userspace.
      Fix the issue adding TUNNEL_KEY bit to the o_flags parameter dumping
      device info
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      41ef81be
    • Lorenzo Bianconi's avatar
      net: ip6_gre: always reports o_key to userspace · c706863b
      Lorenzo Bianconi authored
      As Erspan_v4, Erspan_v6 protocol relies on o_key to configure
      session id header field. However TUNNEL_KEY bit is cleared in
      ip6erspan_tunnel_xmit since ERSPAN protocol does not set the key field
      of the external GRE header and so the configured o_key is not reported
      to userspace. The issue can be triggered with the following reproducer:
      
      $ip link add ip6erspan1 type ip6erspan local 2000::1 remote 2000::2 \
          key 1 seq erspan_ver 1
      $ip link set ip6erspan1 up
      ip -d link sh ip6erspan1
      
      ip6erspan1@NONE: <BROADCAST,MULTICAST> mtu 1422 qdisc noop state DOWN mode DEFAULT
          link/ether ba:ff:09:24:c3:0e brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 1500
          ip6erspan remote 2000::2 local 2000::1 encaplimit 4 flowlabel 0x00000 ikey 0.0.0.1 iseq oseq
      
      Fix the issue adding TUNNEL_KEY bit to the o_flags parameter in
      ip6gre_fill_info
      
      Fixes: 5a963eb6 ("ip6_gre: Add ERSPAN native tunnel support")
      Signed-off-by: default avatarLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c706863b
    • Lorenzo Bianconi's avatar
      net: ip_gre: always reports o_key to userspace · feaf5c79
      Lorenzo Bianconi authored
      Erspan protocol (version 1 and 2) relies on o_key to configure
      session id header field. However TUNNEL_KEY bit is cleared in
      erspan_xmit since ERSPAN protocol does not set the key field
      of the external GRE header and so the configured o_key is not reported
      to userspace. The issue can be triggered with the following reproducer:
      
      $ip link add erspan1 type erspan local 192.168.0.1 remote 192.168.0.2 \
          key 1 seq erspan_ver 1
      $ip link set erspan1 up
      $ip -d link sh erspan1
      
      erspan1@NONE: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast state UNKNOWN mode DEFAULT
        link/ether 52:aa:99:95:9a:b5 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 1500
        erspan remote 192.168.0.2 local 192.168.0.1 ttl inherit ikey 0.0.0.1 iseq oseq erspan_index 0
      
      Fix the issue adding TUNNEL_KEY bit to the o_flags parameter in
      ipgre_fill_info
      
      Fixes: 84e54fe0 ("gre: introduce native tunnel support for ERSPAN")
      Signed-off-by: default avatarLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      feaf5c79
    • Mathias Thore's avatar
      ucc_geth: Reset BQL queue when stopping device · e15aa3b2
      Mathias Thore authored
      After a timeout event caused by for example a broadcast storm, when
      the MAC and PHY are reset, the BQL TX queue needs to be reset as
      well. Otherwise, the device will exhibit severe performance issues
      even after the storm has ended.
      Co-authored-by: default avatarDavid Gounaris <david.gounaris@infinera.com>
      Signed-off-by: default avatarMathias Thore <mathias.thore@infinera.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e15aa3b2
    • David S. Miller's avatar
      Merge branch 'net-various-compat-ioctl-fixes' · 794827f3
      David S. Miller authored
      Johannes Berg says:
      
      ====================
      various compat ioctl fixes
      
      Back a long time ago, I already fixed a few of these by passing
      the size of the struct ifreq to do_sock_ioctl(). However, Robert
      found more cases, and now it won't be as simple because we'd have
      to pass that down all the way to e.g. bond_do_ioctl() which isn't
      really feasible.
      
      Therefore, restore the old code.
      
      While looking at why SIOCGIFNAME was broken, I realized that Al
      had removed that case - which had been handled in an explicit
      separate function - as well, and looking through his work at the
      time I saw that bond ioctls were also affected by the erroneous
      removal.
      
      I've restored SIOCGIFNAME and bond ioctls by going through the
      (now renamed) dev_ifsioc() instead of reintroducing their own
      helper functions, which I hope is correct but have only tested
      with SIOCGIFNAME.
      ====================
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      794827f3
    • Johannes Berg's avatar
      net: socket: make bond ioctls go through compat_ifreq_ioctl() · 98406133
      Johannes Berg authored
      Same story as before, these use struct ifreq and thus need
      to be read with the shorter version to not cause faults.
      
      Cc: stable@vger.kernel.org
      Fixes: f92d4fc9 ("kill bond_ioctl()")
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      98406133
    • Johannes Berg's avatar
      net: socket: fix SIOCGIFNAME in compat · c6c9fee3
      Johannes Berg authored
      As reported by Robert O'Callahan in
      https://bugzilla.kernel.org/show_bug.cgi?id=202273
      reverting the previous changes in this area broke
      the SIOCGIFNAME ioctl in compat again (I'd previously
      fixed it after his previous report of breakage in
      https://bugzilla.kernel.org/show_bug.cgi?id=199469).
      
      This is obviously because I fixed SIOCGIFNAME more or
      less by accident.
      
      Fix it explicitly now by making it pass through the
      restored compat translation code.
      
      Cc: stable@vger.kernel.org
      Fixes: 4cf808e7 ("kill dev_ifname32()")
      Reported-by: default avatarRobert O'Callahan <robert@ocallahan.org>
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c6c9fee3
    • Johannes Berg's avatar
      Revert "kill dev_ifsioc()" · 37ac39bd
      Johannes Berg authored
      This reverts commit bf440573 ("kill dev_ifsioc()").
      
      This wasn't really unused as implied by the original commit,
      it still handles the copy to/from user differently, and the
      commit thus caused issues such as
        https://bugzilla.kernel.org/show_bug.cgi?id=199469
      and
        https://bugzilla.kernel.org/show_bug.cgi?id=202273
      
      However, deviating from a strict revert, rename dev_ifsioc()
      to compat_ifreq_ioctl() to be clearer as to its purpose and
      add a comment.
      
      Cc: stable@vger.kernel.org
      Fixes: bf440573 ("kill dev_ifsioc()")
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      37ac39bd
    • Johannes Berg's avatar
      Revert "socket: fix struct ifreq size in compat ioctl" · 63ff03ab
      Johannes Berg authored
      This reverts commit 1cebf8f1 ("socket: fix struct ifreq
      size in compat ioctl"), it's a bugfix for another commit that
      I'll revert next.
      
      This is not a 'perfect' revert, I'm keeping some coding style
      intact rather than revert to the state with indentation errors.
      
      Cc: stable@vger.kernel.org
      Fixes: 1cebf8f1 ("socket: fix struct ifreq size in compat ioctl")
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      63ff03ab
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 62967898
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Need to save away the IV across tls async operations, from Dave
          Watson.
      
       2) Upon successful packet processing, we should liberate the SKB with
          dev_consume_skb{_irq}(). From Yang Wei.
      
       3) Only apply RX hang workaround on effected macb chips, from Harini
          Katakam.
      
       4) Dummy netdev need a proper namespace assigned to them, from Josh
          Elsasser.
      
       5) Some paths of nft_compat run lockless now, and thus we need to use a
          proper refcnt_t. From Florian Westphal.
      
       6) Avoid deadlock in mlx5 by doing IRQ locking, from Moni Shoua.
      
       7) netrom does not refcount sockets properly wrt. timers, fix that by
          using the sock timer API. From Cong Wang.
      
       8) Fix locking of inexact inserts of xfrm policies, from Florian
          Westphal.
      
       9) Missing xfrm hash generation bump, also from Florian.
      
      10) Missing of_node_put() in hns driver, from Yonglong Liu.
      
      11) Fix DN_IFREQ_SIZE, from Johannes Berg.
      
      12) ip6mr notifier is invoked during traversal of wrong table, from Nir
          Dotan.
      
      13) TX promisc settings not performed correctly in qed, from Manish
          Chopra.
      
      14) Fix OOB access in vhost, from Jason Wang.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits)
        MAINTAINERS: Add entry for XDP (eXpress Data Path)
        net: set default network namespace in init_dummy_netdev()
        net: b44: replace dev_kfree_skb_xxx by dev_consume_skb_xxx for drop profiles
        net: caif: call dev_consume_skb_any when skb xmit done
        net: 8139cp: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
        net: macb: Apply RXUBR workaround only to versions with errata
        net: ti: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
        net: apple: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
        net: amd8111e: replace dev_kfree_skb_irq by dev_consume_skb_irq
        net: alteon: replace dev_kfree_skb_irq by dev_consume_skb_irq
        net: tls: Fix deadlock in free_resources tx
        net: tls: Save iv in tls_rec for async crypto requests
        vhost: fix OOB in get_rx_bufs()
        qed: Fix stack out of bounds bug
        qed: Fix system crash in ll2 xmit
        qed: Fix VF probe failure while FLR
        qed: Fix LACP pdu drops for VFs
        qed: Fix bug in tx promiscuous mode settings
        net: i825xx: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
        netfilter: ipt_CLUSTERIP: fix warning unused variable cn
        ...
      62967898
  3. 29 Jan, 2019 11 commits