1. 17 Jan, 2020 3 commits
    • Cong Wang's avatar
      net: avoid updating qdisc_xmit_lock_key in netdev_update_lockdep_key() · 53d37497
      Cong Wang authored
      syzbot reported some bogus lockdep warnings, for example bad unlock
      balance in sch_direct_xmit(). They are due to a race condition between
      slow path and fast path, that is qdisc_xmit_lock_key gets re-registered
      in netdev_update_lockdep_key() on slow path, while we could still
      acquire the queue->_xmit_lock on fast path in this small window:
      
      CPU A						CPU B
      						__netif_tx_lock();
      lockdep_unregister_key(qdisc_xmit_lock_key);
      						__netif_tx_unlock();
      lockdep_register_key(qdisc_xmit_lock_key);
      
      In fact, unlike the addr_list_lock which has to be reordered when
      the master/slave device relationship changes, queue->_xmit_lock is
      only acquired on fast path and only when NETIF_F_LLTX is not set,
      so there is likely no nested locking for it.
      
      Therefore, we can just get rid of re-registration of
      qdisc_xmit_lock_key.
      
      Reported-by: syzbot+4ec99438ed7450da6272@syzkaller.appspotmail.com
      Fixes: ab92d68f ("net: core: add generic lockdep keys")
      Cc: Taehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      53d37497
    • Eric Dumazet's avatar
      net/sched: act_ife: initalize ife->metalist earlier · 44c23d71
      Eric Dumazet authored
      It seems better to init ife->metalist earlier in tcf_ife_init()
      to avoid the following crash :
      
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] PREEMPT SMP KASAN
      CPU: 0 PID: 10483 Comm: syz-executor216 Not tainted 5.5.0-rc5-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:_tcf_ife_cleanup net/sched/act_ife.c:412 [inline]
      RIP: 0010:tcf_ife_cleanup+0x6e/0x400 net/sched/act_ife.c:431
      Code: 48 c1 ea 03 80 3c 02 00 0f 85 94 03 00 00 49 8b bd f8 00 00 00 48 b8 00 00 00 00 00 fc ff df 4c 8d 67 e8 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 5c 03 00 00 48 bb 00 00 00 00 00 fc ff df 48 8b
      RSP: 0018:ffffc90001dc6d00 EFLAGS: 00010246
      RAX: dffffc0000000000 RBX: ffffffff864619c0 RCX: ffffffff815bfa09
      RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000000
      RBP: ffffc90001dc6d50 R08: 0000000000000004 R09: fffff520003b8d8e
      R10: fffff520003b8d8d R11: 0000000000000003 R12: ffffffffffffffe8
      R13: ffff8880a79fc000 R14: ffff88809aba0e00 R15: 0000000000000000
      FS:  0000000001b51880(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000563f52cce140 CR3: 0000000093541000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       tcf_action_cleanup+0x62/0x1b0 net/sched/act_api.c:119
       __tcf_action_put+0xfa/0x130 net/sched/act_api.c:135
       __tcf_idr_release net/sched/act_api.c:165 [inline]
       __tcf_idr_release+0x59/0xf0 net/sched/act_api.c:145
       tcf_idr_release include/net/act_api.h:171 [inline]
       tcf_ife_init+0x97c/0x1870 net/sched/act_ife.c:616
       tcf_action_init_1+0x6b6/0xa40 net/sched/act_api.c:944
       tcf_action_init+0x21a/0x330 net/sched/act_api.c:1000
       tcf_action_add+0xf5/0x3b0 net/sched/act_api.c:1410
       tc_ctl_action+0x390/0x488 net/sched/act_api.c:1465
       rtnetlink_rcv_msg+0x45e/0xaf0 net/core/rtnetlink.c:5424
       netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
       rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5442
       netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
       netlink_unicast+0x58c/0x7d0 net/netlink/af_netlink.c:1328
       netlink_sendmsg+0x91c/0xea0 net/netlink/af_netlink.c:1917
       sock_sendmsg_nosec net/socket.c:639 [inline]
       sock_sendmsg+0xd7/0x130 net/socket.c:659
       ____sys_sendmsg+0x753/0x880 net/socket.c:2330
       ___sys_sendmsg+0x100/0x170 net/socket.c:2384
       __sys_sendmsg+0x105/0x1d0 net/socket.c:2417
       __do_sys_sendmsg net/socket.c:2426 [inline]
       __se_sys_sendmsg net/socket.c:2424 [inline]
       __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2424
       do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fixes: 11a94d7f ("net/sched: act_ife: validate the control action inside init()")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Cc: Davide Caratti <dcaratti@redhat.com>
      Reviewed-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Acked-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      44c23d71
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf · a72b6a1e
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter updates for net
      
      The following patchset contains Netfilter fixes for net:
      
      1) Fix use-after-free in ipset bitmap destroy path, from Cong Wang.
      
      2) Missing init netns in entry cleanup path of arp_tables,
         from Florian Westphal.
      
      3) Fix WARN_ON in set destroy path due to missing cleanup on
         transaction error.
      
      4) Incorrect netlink sanity check in tunnel, from Florian Westphal.
      
      5) Missing sanity check for erspan version netlink attribute, also
         from Florian.
      
      6) Remove WARN in nft_request_module() that can be triggered from
         userspace, from Florian Westphal.
      
      7) Memleak in NFTA_HOOK_DEVS netlink parser, from Dan Carpenter.
      
      8) List poison from commit path for flowtables that are added and
         deleted in the same batch, from Florian Westphal.
      
      9) Fix NAT ICMP packet corruption, from Eyal Birger.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a72b6a1e
  2. 16 Jan, 2020 14 commits
  3. 15 Jan, 2020 23 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-sockmap-tls-fixes' · 85ddd9c3
      Daniel Borkmann authored
      John Fastabend says:
      
      ====================
      To date our usage of sockmap/tls has been fairly simple, the BPF programs
      did only well-defined pop, push, pull and apply/cork operations.
      
      Now that we started to push more complex programs into sockmap we uncovered
      a series of issues addressed here. Further OpenSSL3.0 version should be
      released soon with kTLS support so its important to get any remaining
      issues on BPF and kTLS support resolved.
      
      Additionally, I have a patch under development to allow sockmap to be
      enabled/disabled at runtime for Cilium endpoints. This allows us to stress
      the map insert/delete with kTLS more than previously where Cilium only
      added the socket to the map when it entered ESTABLISHED state and never
      touched it from the control path side again relying on the sockets own
      close() hook to remove it.
      
      To test I have a set of test cases in test_sockmap.c that expose these
      issues. Once we get fixes here merged and in bpf-next I'll submit the
      tests to bpf-next tree to ensure we don't regress again. Also I've run
      these patches in the Cilium CI with OpenSSL (master branch) this will
      run tools such as netperf, ab, wrk2, curl, etc. to get a broad set of
      testing.
      
      I'm aware of two more issues that we are working to resolve in another
      couple (probably two) patches. First we see an auth tag corruption in
      kTLS when sending small 1byte chunks under stress. I've not pinned this
      down yet. But, guessing because its under 1B stress tests it must be
      some error path being triggered. And second we need to ensure BPF RX
      programs are not skipped when kTLS ULP is loaded. This breaks some of the
      sockmap selftests when running with kTLS. I'll send a follow up for this.
      
      v2: I dropped a patch that added !0 size check in tls_push_record
          this originated from a panic I caught awhile ago with a trace
          in the crypto stack. But I can not reproduce it anymore so will
          dig into that and send another patch later if needed. Anyways
          after a bit of thought it would be nicer if tls/crypto/bpf didn't
          require special case handling for the !0 size.
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      85ddd9c3
    • John Fastabend's avatar
      bpf: Sockmap/tls, fix pop data with SK_DROP return code · 7361d448
      John Fastabend authored
      When user returns SK_DROP we need to reset the number of copied bytes
      to indicate to the user the bytes were dropped and not sent. If we
      don't reset the copied arg sendmsg will return as if those bytes were
      copied giving the user a positive return value.
      
      This works as expected today except in the case where the user also
      pops bytes. In the pop case the sg.size is reduced but we don't correctly
      account for this when copied bytes is reset. The popped bytes are not
      accounted for and we return a small positive value potentially confusing
      the user.
      
      The reason this happens is due to a typo where we do the wrong comparison
      when accounting for pop bytes. In this fix notice the if/else is not
      needed and that we have a similar problem if we push data except its not
      visible to the user because if delta is larger the sg.size we return a
      negative value so it appears as an error regardless.
      
      Fixes: 7246d8ed ("bpf: helper to pop data from messages")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-9-john.fastabend@gmail.com
      7361d448
    • John Fastabend's avatar
      bpf: Sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining · 9aaaa568
      John Fastabend authored
      Its possible through a set of push, pop, apply helper calls to construct
      a skmsg, which is just a ring of scatterlist elements, with the start
      value larger than the end value. For example,
      
            end       start
        |_0_|_1_| ... |_n_|_n+1_|
      
      Where end points at 1 and start points and n so that valid elements is
      the set {n, n+1, 0, 1}.
      
      Currently, because we don't build the correct chain only {n, n+1} will
      be sent. This adds a check and sg_chain call to correctly submit the
      above to the crypto and tls send path.
      
      Fixes: d3b18ad3 ("tls: add bpf support to sk_msg handling")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-8-john.fastabend@gmail.com
      9aaaa568
    • John Fastabend's avatar
      bpf: Sockmap/tls, tls_sw can create a plaintext buf > encrypt buf · d468e477
      John Fastabend authored
      It is possible to build a plaintext buffer using push helper that is larger
      than the allocated encrypt buffer. When this record is pushed to crypto
      layers this can result in a NULL pointer dereference because the crypto
      API expects the encrypt buffer is large enough to fit the plaintext
      buffer. Kernel splat below.
      
      To resolve catch the cases this can happen and split the buffer into two
      records to send individually. Unfortunately, there is still one case to
      handle where the split creates a zero sized buffer. In this case we merge
      the buffers and unmark the split. This happens when apply is zero and user
      pushed data beyond encrypt buffer. This fixes the original case as well
      because the split allocated an encrypt buffer larger than the plaintext
      buffer and the merge simply moves the pointers around so we now have
      a reference to the new (larger) encrypt buffer.
      
      Perhaps its not ideal but it seems the best solution for a fixes branch
      and avoids handling these two cases, (a) apply that needs split and (b)
      non apply case. The are edge cases anyways so optimizing them seems not
      necessary unless someone wants later in next branches.
      
      [  306.719107] BUG: kernel NULL pointer dereference, address: 0000000000000008
      [...]
      [  306.747260] RIP: 0010:scatterwalk_copychunks+0x12f/0x1b0
      [...]
      [  306.770350] Call Trace:
      [  306.770956]  scatterwalk_map_and_copy+0x6c/0x80
      [  306.772026]  gcm_enc_copy_hash+0x4b/0x50
      [  306.772925]  gcm_hash_crypt_remain_continue+0xef/0x110
      [  306.774138]  gcm_hash_crypt_continue+0xa1/0xb0
      [  306.775103]  ? gcm_hash_crypt_continue+0xa1/0xb0
      [  306.776103]  gcm_hash_assoc_remain_continue+0x94/0xa0
      [  306.777170]  gcm_hash_assoc_continue+0x9d/0xb0
      [  306.778239]  gcm_hash_init_continue+0x8f/0xa0
      [  306.779121]  gcm_hash+0x73/0x80
      [  306.779762]  gcm_encrypt_continue+0x6d/0x80
      [  306.780582]  crypto_gcm_encrypt+0xcb/0xe0
      [  306.781474]  crypto_aead_encrypt+0x1f/0x30
      [  306.782353]  tls_push_record+0x3b9/0xb20 [tls]
      [  306.783314]  ? sk_psock_msg_verdict+0x199/0x300
      [  306.784287]  bpf_exec_tx_verdict+0x3f2/0x680 [tls]
      [  306.785357]  tls_sw_sendmsg+0x4a3/0x6a0 [tls]
      
      test_sockmap test signature to trigger bug,
      
      [TEST]: (1, 1, 1, sendmsg, pass,redir,start 1,end 2,pop (1,2),ktls,):
      
      Fixes: d3b18ad3 ("tls: add bpf support to sk_msg handling")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-7-john.fastabend@gmail.com
      d468e477
    • John Fastabend's avatar
      bpf: Sockmap/tls, msg_push_data may leave end mark in place · cf21e9ba
      John Fastabend authored
      Leaving an incorrect end mark in place when passing to crypto
      layer will cause crypto layer to stop processing data before
      all data is encrypted. To fix clear the end mark on push
      data instead of expecting users of the helper to clear the
      mark value after the fact.
      
      This happens when we push data into the middle of a skmsg and
      have room for it so we don't do a set of copies that already
      clear the end flag.
      
      Fixes: 6fff607e ("bpf: sk_msg program helper bpf_msg_push_data")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-6-john.fastabend@gmail.com
      cf21e9ba
    • John Fastabend's avatar
      bpf: Sockmap, skmsg helper overestimates push, pull, and pop bounds · 6562e29c
      John Fastabend authored
      In the push, pull, and pop helpers operating on skmsg objects to make
      data writable or insert/remove data we use this bounds check to ensure
      specified data is valid,
      
       /* Bounds checks: start and pop must be inside message */
       if (start >= offset + l || last >= msg->sg.size)
           return -EINVAL;
      
      The problem here is offset has already included the length of the
      current element the 'l' above. So start could be past the end of
      the scatterlist element in the case where start also points into an
      offset on the last skmsg element.
      
      To fix do the accounting slightly different by adding the length of
      the previous entry to offset at the start of the iteration. And
      ensure its initialized to zero so that the first iteration does
      nothing.
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Fixes: 6fff607e ("bpf: sk_msg program helper bpf_msg_push_data")
      Fixes: 7246d8ed ("bpf: helper to pop data from messages")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-5-john.fastabend@gmail.com
      6562e29c
    • John Fastabend's avatar
      bpf: Sockmap/tls, push write_space updates through ulp updates · 33bfe20d
      John Fastabend authored
      When sockmap sock with TLS enabled is removed we cleanup bpf/psock state
      and call tcp_update_ulp() to push updates to TLS ULP on top. However, we
      don't push the write_space callback up and instead simply overwrite the
      op with the psock stored previous op. This may or may not be correct so
      to ensure we don't overwrite the TLS write space hook pass this field to
      the ULP and have it fixup the ctx.
      
      This completes a previous fix that pushed the ops through to the ULP
      but at the time missed doing this for write_space, presumably because
      write_space TLS hook was added around the same time.
      
      Fixes: 95fa1454 ("bpf: sockmap/tls, close can race with map free")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Acked-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-4-john.fastabend@gmail.com
      33bfe20d
    • John Fastabend's avatar
      bpf: Sockmap, ensure sock lock held during tear down · 7e81a353
      John Fastabend authored
      The sock_map_free() and sock_hash_free() paths used to delete sockmap
      and sockhash maps walk the maps and destroy psock and bpf state associated
      with the socks in the map. When done the socks no longer have BPF programs
      attached and will function normally. This can happen while the socks in
      the map are still "live" meaning data may be sent/received during the walk.
      
      Currently, though we don't take the sock_lock when the psock and bpf state
      is removed through this path. Specifically, this means we can be writing
      into the ops structure pointers such as sendmsg, sendpage, recvmsg, etc.
      while they are also being called from the networking side. This is not
      safe, we never used proper READ_ONCE/WRITE_ONCE semantics here if we
      believed it was safe. Further its not clear to me its even a good idea
      to try and do this on "live" sockets while networking side might also
      be using the socket. Instead of trying to reason about using the socks
      from both sides lets realize that every use case I'm aware of rarely
      deletes maps, in fact kubernetes/Cilium case builds map at init and
      never tears it down except on errors. So lets do the simple fix and
      grab sock lock.
      
      This patch wraps sock deletes from maps in sock lock and adds some
      annotations so we catch any other cases easier.
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-3-john.fastabend@gmail.com
      7e81a353
    • John Fastabend's avatar
      bpf: Sockmap/tls, during free we may call tcp_bpf_unhash() in loop · 4da6a196
      John Fastabend authored
      When a sockmap is free'd and a socket in the map is enabled with tls
      we tear down the bpf context on the socket, the psock struct and state,
      and then call tcp_update_ulp(). The tcp_update_ulp() call is to inform
      the tls stack it needs to update its saved sock ops so that when the tls
      socket is later destroyed it doesn't try to call the now destroyed psock
      hooks.
      
      This is about keeping stacked ULPs in good shape so they always have
      the right set of stacked ops.
      
      However, recently unhash() hook was removed from TLS side. But, the
      sockmap/bpf side is not doing any extra work to update the unhash op
      when is torn down instead expecting TLS side to manage it. So both
      TLS and sockmap believe the other side is managing the op and instead
      no one updates the hook so it continues to point at tcp_bpf_unhash().
      When unhash hook is called we call tcp_bpf_unhash() which detects the
      psock has already been destroyed and calls sk->sk_prot_unhash() which
      calls tcp_bpf_unhash() yet again and so on looping and hanging the core.
      
      To fix have sockmap tear down logic fixup the stale pointer.
      
      Fixes: 5d92e631 ("net/tls: partially revert fix transition through disconnect with close")
      Reported-by: syzbot+83979935eb6304f8cd46@syzkaller.appspotmail.com
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-2-john.fastabend@gmail.com
      4da6a196
    • David S. Miller's avatar
      Merge branch 'stmmac-Fix-selftests-in-Synopsys-AXS101-board' · 567110f1
      David S. Miller authored
      Jose Abreu says:
      
      ====================
      net: stmmac: Fix selftests in Synopsys AXS101 board
      
      Set of fixes for sefltests so that they work in Synopsys AXS101 board.
      
      Final output:
      
      $ ethtool -t eth0
      The test result is PASS
      The test extra info:
       1. MAC Loopback                 0
       2. PHY Loopback                 -95
       3. MMC Counters                 0
       4. EEE                          -95
       5. Hash Filter MC               0
       6. Perfect Filter UC            0
       7. MC Filter                    0
       8. UC Filter                    0
       9. Flow Control                 -95
      10. RSS                          -95
      11. VLAN Filtering               -95
      12. VLAN Filtering (perf)        -95
      13. Double VLAN Filter           -95
      14. Double VLAN Filter (perf)    -95
      15. Flexible RX Parser           -95
      16. SA Insertion (desc)          -95
      17. SA Replacement (desc)        -95
      18. SA Insertion (reg)           -95
      19. SA Replacement (reg)         -95
      20. VLAN TX Insertion            -95
      21. SVLAN TX Insertion           -95
      22. L3 DA Filtering              -95
      23. L3 SA Filtering              -95
      24. L4 DA TCP Filtering          -95
      25. L4 SA TCP Filtering          -95
      26. L4 DA UDP Filtering          -95
      27. L4 SA UDP Filtering          -95
      28. ARP Offload                  -95
      29. Jumbo Frame                  0
      30. Multichannel Jumbo           -95
      31. Split Header                 -95
      
      Description:
      
      1) Fixes the unaligned accesses that caused CPU halt in Synopsys AXS101
      boards.
      
      2) Fixes the VLAN tests when filtering failed to work.
      
      3) Fixes the VLAN Perfect tests when filtering is not available in HW.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      567110f1
    • Jose Abreu's avatar
      net: stmmac: selftests: Guard VLAN Perfect test against non supported HW · 4eee13f1
      Jose Abreu authored
      When HW does not support perfect filtering the feature will not be
      enabled in the net_device. Add a check for this to prevent failures.
      
      Fixes: 1b2250a0 ("net: stmmac: selftests: Add tests for VLAN Perfect Filtering")
      Signed-off-by: default avatarJose Abreu <Jose.Abreu@synopsys.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4eee13f1
    • Jose Abreu's avatar
      net: stmmac: selftests: Mark as fail when received VLAN ID != expected · d39b68e5
      Jose Abreu authored
      When the VLAN ID does not match the expected one it means filter failed
      in HW. Fix it.
      
      Fixes: 94e18382 ("net: stmmac: selftests: Add selftest for VLAN TX Offload")
      Signed-off-by: default avatarJose Abreu <Jose.Abreu@synopsys.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d39b68e5
    • Jose Abreu's avatar
      net: stmmac: selftests: Make it work in Synopsys AXS101 boards · 0b9f932e
      Jose Abreu authored
      Synopsys AXS101 boards do not support unaligned memory loads or stores.
      Change the selftests mechanism to explicity:
      - Not add extra alignment in TX SKB
      - Use the unaligned version of ether_addr_equal()
      
      Fixes: 091810db ("net: stmmac: Introduce selftests support")
      Signed-off-by: default avatarJose Abreu <Jose.Abreu@synopsys.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0b9f932e
    • Colin Ian King's avatar
      net/wan/fsl_ucc_hdlc: fix out of bounds write on array utdm_info · ddf42039
      Colin Ian King authored
      Array utdm_info is declared as an array of MAX_HDLC_NUM (4) elements
      however up to UCC_MAX_NUM (8) elements are potentially being written
      to it.  Currently we have an array out-of-bounds write error on the
      last 4 elements. Fix this by making utdm_info UCC_MAX_NUM elements in
      size.
      
      Addresses-Coverity: ("Out-of-bounds write")
      Fixes: c19b6d24 ("drivers/net: support hdlc function for QE-UCC")
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ddf42039
    • David S. Miller's avatar
      Merge tag 'batadv-net-for-davem-20200114' of git://git.open-mesh.org/linux-merge · 5a40420e
      David S. Miller authored
      Simon Wunderlich says:
      
      ====================
      Here is a batman-adv bugfix:
      
       - Fix DAT candidate selection on little endian systems,
         by Sven Eckelmann
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5a40420e
    • Daniel Borkmann's avatar
      bpf: Fix incorrect verifier simulation of ARSH under ALU32 · 0af2ffc9
      Daniel Borkmann authored
      Anatoly has been fuzzing with kBdysch harness and reported a hang in one
      of the outcomes:
      
        0: R1=ctx(id=0,off=0,imm=0) R10=fp0
        0: (85) call bpf_get_socket_cookie#46
        1: R0_w=invP(id=0) R10=fp0
        1: (57) r0 &= 808464432
        2: R0_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x30303030)) R10=fp0
        2: (14) w0 -= 810299440
        3: R0_w=invP(id=0,umax_value=4294967295,var_off=(0xcf800000; 0x3077fff0)) R10=fp0
        3: (c4) w0 s>>= 1
        4: R0_w=invP(id=0,umin_value=1740636160,umax_value=2147221496,var_off=(0x67c00000; 0x183bfff8)) R10=fp0
        4: (76) if w0 s>= 0x30303030 goto pc+216
        221: R0_w=invP(id=0,umin_value=1740636160,umax_value=2147221496,var_off=(0x67c00000; 0x183bfff8)) R10=fp0
        221: (95) exit
        processed 6 insns (limit 1000000) [...]
      
      Taking a closer look, the program was xlated as follows:
      
        # ./bpftool p d x i 12
        0: (85) call bpf_get_socket_cookie#7800896
        1: (bf) r6 = r0
        2: (57) r6 &= 808464432
        3: (14) w6 -= 810299440
        4: (c4) w6 s>>= 1
        5: (76) if w6 s>= 0x30303030 goto pc+216
        6: (05) goto pc-1
        7: (05) goto pc-1
        8: (05) goto pc-1
        [...]
        220: (05) goto pc-1
        221: (05) goto pc-1
        222: (95) exit
      
      Meaning, the visible effect is very similar to f54c7898 ("bpf: Fix
      precision tracking for unbounded scalars"), that is, the fall-through
      branch in the instruction 5 is considered to be never taken given the
      conclusion from the min/max bounds tracking in w6, and therefore the
      dead-code sanitation rewrites it as goto pc-1. However, real-life input
      disagrees with verification analysis since a soft-lockup was observed.
      
      The bug sits in the analysis of the ARSH. The definition is that we shift
      the target register value right by K bits through shifting in copies of
      its sign bit. In adjust_scalar_min_max_vals(), we do first coerce the
      register into 32 bit mode, same happens after simulating the operation.
      However, for the case of simulating the actual ARSH, we don't take the
      mode into account and act as if it's always 64 bit, but location of sign
      bit is different:
      
        dst_reg->smin_value >>= umin_val;
        dst_reg->smax_value >>= umin_val;
        dst_reg->var_off = tnum_arshift(dst_reg->var_off, umin_val);
      
      Consider an unknown R0 where bpf_get_socket_cookie() (or others) would
      for example return 0xffff. With the above ARSH simulation, we'd see the
      following results:
      
        [...]
        1: R1=ctx(id=0,off=0,imm=0) R2_w=invP65535 R10=fp0
        1: (85) call bpf_get_socket_cookie#46
        2: R0_w=invP(id=0) R10=fp0
        2: (57) r0 &= 808464432
          -> R0_runtime = 0x3030
        3: R0_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x30303030)) R10=fp0
        3: (14) w0 -= 810299440
          -> R0_runtime = 0xcfb40000
        4: R0_w=invP(id=0,umax_value=4294967295,var_off=(0xcf800000; 0x3077fff0)) R10=fp0
                                    (0xffffffff)
        4: (c4) w0 s>>= 1
          -> R0_runtime = 0xe7da0000
        5: R0_w=invP(id=0,umin_value=1740636160,umax_value=2147221496,var_off=(0x67c00000; 0x183bfff8)) R10=fp0
                                    (0x67c00000)           (0x7ffbfff8)
        [...]
      
      In insn 3, we have a runtime value of 0xcfb40000, which is '1100 1111 1011
      0100 0000 0000 0000 0000', the result after the shift has 0xe7da0000 that
      is '1110 0111 1101 1010 0000 0000 0000 0000', where the sign bit is correctly
      retained in 32 bit mode. In insn4, the umax was 0xffffffff, and changed into
      0x7ffbfff8 after the shift, that is, '0111 1111 1111 1011 1111 1111 1111 1000'
      and means here that the simulation didn't retain the sign bit. With above
      logic, the updates happen on the 64 bit min/max bounds and given we coerced
      the register, the sign bits of the bounds are cleared as well, meaning, we
      need to force the simulation into s32 space for 32 bit alu mode.
      
      Verification after the fix below. We're first analyzing the fall-through branch
      on 32 bit signed >= test eventually leading to rejection of the program in this
      specific case:
      
        0: R1=ctx(id=0,off=0,imm=0) R10=fp0
        0: (b7) r2 = 808464432
        1: R1=ctx(id=0,off=0,imm=0) R2_w=invP808464432 R10=fp0
        1: (85) call bpf_get_socket_cookie#46
        2: R0_w=invP(id=0) R10=fp0
        2: (bf) r6 = r0
        3: R0_w=invP(id=0) R6_w=invP(id=0) R10=fp0
        3: (57) r6 &= 808464432
        4: R0_w=invP(id=0) R6_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x30303030)) R10=fp0
        4: (14) w6 -= 810299440
        5: R0_w=invP(id=0) R6_w=invP(id=0,umax_value=4294967295,var_off=(0xcf800000; 0x3077fff0)) R10=fp0
        5: (c4) w6 s>>= 1
        6: R0_w=invP(id=0) R6_w=invP(id=0,umin_value=3888119808,umax_value=4294705144,var_off=(0xe7c00000; 0x183bfff8)) R10=fp0
                                                    (0x67c00000)          (0xfffbfff8)
        6: (76) if w6 s>= 0x30303030 goto pc+216
        7: R0_w=invP(id=0) R6_w=invP(id=0,umin_value=3888119808,umax_value=4294705144,var_off=(0xe7c00000; 0x183bfff8)) R10=fp0
        7: (30) r0 = *(u8 *)skb[808464432]
        BPF_LD_[ABS|IND] uses reserved fields
        processed 8 insns (limit 1000000) [...]
      
      Fixes: 9cbe1f5a ("bpf/verifier: improve register value range tracking with ARSH")
      Reported-by: default avatarAnatoly Trosinenko <anatoly.trosinenko@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200115204733.16648-1-daniel@iogearbox.net
      0af2ffc9
    • Mohammed Gamal's avatar
      hv_netvsc: Fix memory leak when removing rndis device · 536dc5df
      Mohammed Gamal authored
      kmemleak detects the following memory leak when hot removing
      a network device:
      
      unreferenced object 0xffff888083f63600 (size 256):
        comm "kworker/0:1", pid 12, jiffies 4294831717 (age 1113.676s)
        hex dump (first 32 bytes):
          00 40 c7 33 80 88 ff ff 00 00 00 00 10 00 00 00  .@.3............
          00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00  .....N..........
        backtrace:
          [<00000000d4a8f5be>] rndis_filter_device_add+0x117/0x11c0 [hv_netvsc]
          [<000000009c02d75b>] netvsc_probe+0x5e7/0xbf0 [hv_netvsc]
          [<00000000ddafce23>] vmbus_probe+0x74/0x170 [hv_vmbus]
          [<00000000046e64f1>] really_probe+0x22f/0xb50
          [<000000005cc35eb7>] driver_probe_device+0x25e/0x370
          [<0000000043c642b2>] bus_for_each_drv+0x11f/0x1b0
          [<000000005e3d09f0>] __device_attach+0x1c6/0x2f0
          [<00000000a72c362f>] bus_probe_device+0x1a6/0x260
          [<0000000008478399>] device_add+0x10a3/0x18e0
          [<00000000cf07b48c>] vmbus_device_register+0xe7/0x1e0 [hv_vmbus]
          [<00000000d46cf032>] vmbus_add_channel_work+0x8ab/0x1770 [hv_vmbus]
          [<000000002c94bb64>] process_one_work+0x919/0x17d0
          [<0000000096de6781>] worker_thread+0x87/0xb40
          [<00000000fbe7397e>] kthread+0x333/0x3f0
          [<000000004f844269>] ret_from_fork+0x3a/0x50
      
      rndis_filter_device_add() allocates an instance of struct rndis_device
      which never gets deallocated as rndis_filter_device_remove() sets
      net_device->extension which points to the rndis_device struct to NULL,
      leaving the rndis_device dangling.
      
      Since net_device->extension is eventually freed in free_netvsc_device(),
      we refrain from setting it to NULL inside rndis_filter_device_remove()
      Signed-off-by: default avatarMohammed Gamal <mgamal@redhat.com>
      Reviewed-by: default avatarHaiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      536dc5df
    • Pengcheng Yang's avatar
      tcp: fix marked lost packets not being retransmitted · e176b1ba
      Pengcheng Yang authored
      When the packet pointed to by retransmit_skb_hint is unlinked by ACK,
      retransmit_skb_hint will be set to NULL in tcp_clean_rtx_queue().
      If packet loss is detected at this time, retransmit_skb_hint will be set
      to point to the current packet loss in tcp_verify_retransmit_hint(),
      then the packets that were previously marked lost but not retransmitted
      due to the restriction of cwnd will be skipped and cannot be
      retransmitted.
      
      To fix this, when retransmit_skb_hint is NULL, retransmit_skb_hint can
      be reset only after all marked lost packets are retransmitted
      (retrans_out >= lost_out), otherwise we need to traverse from
      tcp_rtx_queue_head in tcp_xmit_retransmit_queue().
      
      Packetdrill to demonstrate:
      
      // Disable RACK and set max_reordering to keep things simple
          0 `sysctl -q net.ipv4.tcp_recovery=0`
         +0 `sysctl -q net.ipv4.tcp_max_reordering=3`
      
      // Establish a connection
         +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
        +.1 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
         +0 > S. 0:0(0) ack 1 <...>
       +.01 < . 1:1(0) ack 1 win 257
         +0 accept(3, ..., ...) = 4
      
      // Send 8 data segments
         +0 write(4, ..., 8000) = 8000
         +0 > P. 1:8001(8000) ack 1
      
      // Enter recovery and 1:3001 is marked lost
       +.01 < . 1:1(0) ack 1 win 257 <sack 3001:4001,nop,nop>
         +0 < . 1:1(0) ack 1 win 257 <sack 5001:6001 3001:4001,nop,nop>
         +0 < . 1:1(0) ack 1 win 257 <sack 5001:7001 3001:4001,nop,nop>
      
      // Retransmit 1:1001, now retransmit_skb_hint points to 1001:2001
         +0 > . 1:1001(1000) ack 1
      
      // 1001:2001 was ACKed causing retransmit_skb_hint to be set to NULL
       +.01 < . 1:1(0) ack 2001 win 257 <sack 5001:8001 3001:4001,nop,nop>
      // Now retransmit_skb_hint points to 4001:5001 which is now marked lost
      
      // BUG: 2001:3001 was not retransmitted
         +0 > . 2001:3001(1000) ack 1
      Signed-off-by: default avatarPengcheng Yang <yangpc@wangsu.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Tested-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e176b1ba
    • David S. Miller's avatar
      Merge branch 'mlxsw-Various-fixes' · 8b792f84
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      mlxsw: Various fixes
      
      This patch set contains various fixes for mlxsw.
      
      Patch #1 splits the init() callback between Spectrum-2 and Spectrum-3 in
      order to avoid enforcing the same firmware version for both ASICs, as
      this can't possibly work. Without this patch the driver cannot boot with
      the Spectrum-3 ASIC.
      
      Patches #2-#3 fix a long standing race condition that was recently
      exposed while testing the driver on an emulator, which is very slow
      compared to the actual hardware. The problem is explained in detail in
      the commit messages.
      
      Patch #4 fixes a selftest.
      
      Patch #5 prevents offloaded qdiscs from presenting a non-zero backlog to
      the user when the netdev is down. This is done by clearing the cached
      backlog in the driver when the netdev goes down.
      
      Patch #6 fixes qdisc statistics (backlog and tail drops) to also take
      into account the multicast traffic classes.
      
      v2:
      * Patches #2-#3: use skb_cow_head() instead of skb_unshare() as
        suggested by Jakub. Remove unnecessary check regarding headroom
      * Patches #5-#6: new
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8b792f84
    • Petr Machata's avatar
      mlxsw: spectrum_qdisc: Include MC TCs in Qdisc counters · 85005b82
      Petr Machata authored
      mlxsw configures Spectrum in such a way that BUM traffic is passed not
      through its nominal traffic class TC, but through its MC counterpart TC+8.
      However, when collecting statistics, Qdiscs only look at the nominal TC and
      ignore the MC TC.
      
      Add two helpers to compute the value for logical TC from the constituents,
      one for backlog, the other for tail drops. Use them throughout instead of
      going through the xstats pointer directly.
      
      Counters for TX bytes and packets are deduced from packet priority
      counters, and therefore already include BUM traffic. wred_drop counter is
      irrelevant on MC TCs, because RED is not enabled on them.
      
      Fixes: 7b819530 ("mlxsw: spectrum: Configure MC-aware mode on mlxsw ports")
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      85005b82
    • Petr Machata's avatar
      mlxsw: spectrum: Wipe xstats.backlog of down ports · ca7609ff
      Petr Machata authored
      Per-port counter cache used by Qdiscs is updated periodically, unless the
      port is down. The fact that the cache is not updated for down ports is no
      problem for most counters, which are relative in nature. However, backlog
      is absolute in nature, and if there is a non-zero value in the cache around
      the time that the port goes down, that value just stays there. This value
      then leaks to offloaded Qdiscs that report non-zero backlog even if
      there (obviously) is no traffic.
      
      The HW does not keep backlog of a downed port, so do likewise: as the port
      goes down, wipe the backlog value from xstats.
      
      Fixes: 075ab8ad ("mlxsw: spectrum: Collect tclass related stats periodically")
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ca7609ff
    • Petr Machata's avatar
      selftests: mlxsw: qos_mc_aware: Fix mausezahn invocation · fef6d670
      Petr Machata authored
      Mausezahn does not recognize "own" as a keyword on source IP address. As a
      result, the MC stream is not running at all, and therefore no UC
      degradation can be observed even in principle.
      
      Fix the invocation, and tighten the test: due to the minimum shaper
      configured at the MC TCs, we always expect about 20% degradation. Fail the
      test if it is lower.
      
      Fixes: 573363a6 ("selftests: mlxsw: Add qos_lib.sh")
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Reported-by: default avatarAmit Cohen <amitc@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fef6d670
    • Ido Schimmel's avatar
      mlxsw: switchx2: Do not modify cloned SKBs during xmit · 63963d0f
      Ido Schimmel authored
      The driver needs to prepend a Tx header to each packet it is
      transmitting. The header includes information such as the egress port
      and traffic class.
      
      The addition of the header requires the driver to modify the SKB's
      header and therefore it must not be shared. Otherwise, we risk hitting
      various race conditions.
      
      For example, when a packet is flooded (cloned) by the bridge driver to
      two switch ports swp1 and swp2:
      
      t0 - mlxsw_sp_port_xmit() is called for swp1. Tx header is prepended with
           swp1's port number
      t1 - mlxsw_sp_port_xmit() is called for swp2. Tx header is prepended with
           swp2's port number, overwriting swp1's port number
      t2 - The device processes data buffer from t0. Packet is transmitted via
           swp2
      t3 - The device processes data buffer from t1. Packet is transmitted via
           swp2
      
      Usually, the device is fast enough and transmits the packet before its
      Tx header is overwritten, but this is not the case in emulated
      environments.
      
      Fix this by making sure the SKB's header is writable by calling
      skb_cow_head(). Since the function ensures we have headroom to push the
      Tx header, the check further in the function can be removed.
      
      v2:
      * Use skb_cow_head() instead of skb_unshare() as suggested by Jakub
      * Remove unnecessary check regarding headroom
      
      Fixes: 31557f0f ("mlxsw: Introduce Mellanox SwitchX-2 ASIC support")
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reported-by: default avatarShalom Toledo <shalomt@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      63963d0f