1. 12 Sep, 2019 1 commit
    • Yang Yingliang's avatar
      tun: fix use-after-free when register netdev failed · 77f22f92
      Yang Yingliang authored
      I got a UAF repport in tun driver when doing fuzzy test:
      
      [  466.269490] ==================================================================
      [  466.271792] BUG: KASAN: use-after-free in tun_chr_read_iter+0x2ca/0x2d0
      [  466.271806] Read of size 8 at addr ffff888372139250 by task tun-test/2699
      [  466.271810]
      [  466.271824] CPU: 1 PID: 2699 Comm: tun-test Not tainted 5.3.0-rc1-00001-g5a9433db2614-dirty #427
      [  466.271833] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
      [  466.271838] Call Trace:
      [  466.271858]  dump_stack+0xca/0x13e
      [  466.271871]  ? tun_chr_read_iter+0x2ca/0x2d0
      [  466.271890]  print_address_description+0x79/0x440
      [  466.271906]  ? vprintk_func+0x5e/0xf0
      [  466.271920]  ? tun_chr_read_iter+0x2ca/0x2d0
      [  466.271935]  __kasan_report+0x15c/0x1df
      [  466.271958]  ? tun_chr_read_iter+0x2ca/0x2d0
      [  466.271976]  kasan_report+0xe/0x20
      [  466.271987]  tun_chr_read_iter+0x2ca/0x2d0
      [  466.272013]  do_iter_readv_writev+0x4b7/0x740
      [  466.272032]  ? default_llseek+0x2d0/0x2d0
      [  466.272072]  do_iter_read+0x1c5/0x5e0
      [  466.272110]  vfs_readv+0x108/0x180
      [  466.299007]  ? compat_rw_copy_check_uvector+0x440/0x440
      [  466.299020]  ? fsnotify+0x888/0xd50
      [  466.299040]  ? __fsnotify_parent+0xd0/0x350
      [  466.299064]  ? fsnotify_first_mark+0x1e0/0x1e0
      [  466.304548]  ? vfs_write+0x264/0x510
      [  466.304569]  ? ksys_write+0x101/0x210
      [  466.304591]  ? do_preadv+0x116/0x1a0
      [  466.304609]  do_preadv+0x116/0x1a0
      [  466.309829]  do_syscall_64+0xc8/0x600
      [  466.309849]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  466.309861] RIP: 0033:0x4560f9
      [  466.309875] Code: 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      [  466.309889] RSP: 002b:00007ffffa5166e8 EFLAGS: 00000206 ORIG_RAX: 0000000000000127
      [  466.322992] RAX: ffffffffffffffda RBX: 0000000000400460 RCX: 00000000004560f9
      [  466.322999] RDX: 0000000000000003 RSI: 00000000200008c0 RDI: 0000000000000003
      [  466.323007] RBP: 00007ffffa516700 R08: 0000000000000004 R09: 0000000000000000
      [  466.323014] R10: 0000000000000000 R11: 0000000000000206 R12: 000000000040cb10
      [  466.323021] R13: 0000000000000000 R14: 00000000006d7018 R15: 0000000000000000
      [  466.323057]
      [  466.323064] Allocated by task 2605:
      [  466.335165]  save_stack+0x19/0x80
      [  466.336240]  __kasan_kmalloc.constprop.8+0xa0/0xd0
      [  466.337755]  kmem_cache_alloc+0xe8/0x320
      [  466.339050]  getname_flags+0xca/0x560
      [  466.340229]  user_path_at_empty+0x2c/0x50
      [  466.341508]  vfs_statx+0xe6/0x190
      [  466.342619]  __do_sys_newstat+0x81/0x100
      [  466.343908]  do_syscall_64+0xc8/0x600
      [  466.345303]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  466.347034]
      [  466.347517] Freed by task 2605:
      [  466.348471]  save_stack+0x19/0x80
      [  466.349476]  __kasan_slab_free+0x12e/0x180
      [  466.350726]  kmem_cache_free+0xc8/0x430
      [  466.351874]  putname+0xe2/0x120
      [  466.352921]  filename_lookup+0x257/0x3e0
      [  466.354319]  vfs_statx+0xe6/0x190
      [  466.355498]  __do_sys_newstat+0x81/0x100
      [  466.356889]  do_syscall_64+0xc8/0x600
      [  466.358037]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  466.359567]
      [  466.360050] The buggy address belongs to the object at ffff888372139100
      [  466.360050]  which belongs to the cache names_cache of size 4096
      [  466.363735] The buggy address is located 336 bytes inside of
      [  466.363735]  4096-byte region [ffff888372139100, ffff88837213a100)
      [  466.367179] The buggy address belongs to the page:
      [  466.368604] page:ffffea000dc84e00 refcount:1 mapcount:0 mapping:ffff8883df1b4f00 index:0x0 compound_mapcount: 0
      [  466.371582] flags: 0x2fffff80010200(slab|head)
      [  466.372910] raw: 002fffff80010200 dead000000000100 dead000000000122 ffff8883df1b4f00
      [  466.375209] raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
      [  466.377778] page dumped because: kasan: bad access detected
      [  466.379730]
      [  466.380288] Memory state around the buggy address:
      [  466.381844]  ffff888372139100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  466.384009]  ffff888372139180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  466.386131] >ffff888372139200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  466.388257]                                                  ^
      [  466.390234]  ffff888372139280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  466.392512]  ffff888372139300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  466.394667] ==================================================================
      
      tun_chr_read_iter() accessed the memory which freed by free_netdev()
      called by tun_set_iff():
      
              CPUA                                           CPUB
        tun_set_iff()
          alloc_netdev_mqs()
          tun_attach()
                                                        tun_chr_read_iter()
                                                          tun_get()
                                                          tun_do_read()
                                                            tun_ring_recv()
          register_netdevice() <-- inject error
          goto err_detach
          tun_detach_all() <-- set RCV_SHUTDOWN
          free_netdev() <-- called from
                           err_free_dev path
            netdev_freemem() <-- free the memory
                              without check refcount
            (In this path, the refcount cannot prevent
             freeing the memory of dev, and the memory
             will be used by dev_put() called by
             tun_chr_read_iter() on CPUB.)
                                                           (Break from tun_ring_recv(),
                                                           because RCV_SHUTDOWN is set)
                                                         tun_put()
                                                           dev_put() <-- use the memory
                                                                         freed by netdev_freemem()
      
      Put the publishing of tfile->tun after register_netdevice(),
      so tun_get() won't get the tun pointer that freed by
      err_detach path if register_netdevice() failed.
      
      Fixes: eb0fb363 ("tuntap: attach queue 0 before registering netdevice")
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Suggested-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      77f22f92
  2. 11 Sep, 2019 13 commits
  3. 10 Sep, 2019 5 commits
  4. 07 Sep, 2019 8 commits
    • Fred Lotter's avatar
      nfp: flower: cmsg rtnl locks can timeout reify messages · 28abe579
      Fred Lotter authored
      Flower control message replies are handled in different locations. The truly
      high priority replies are handled in the BH (tasklet) context, while the
      remaining replies are handled in a predefined Linux work queue. The work
      queue handler orders replies into high and low priority groups, and always
      start servicing the high priority replies within the received batch first.
      
      Reply Type:			Rtnl Lock:	Handler:
      
      CMSG_TYPE_PORT_MOD		no		BH tasklet (mtu)
      CMSG_TYPE_TUN_NEIGH		no		BH tasklet
      CMSG_TYPE_FLOW_STATS		no		BH tasklet
      CMSG_TYPE_PORT_REIFY		no		WQ high
      CMSG_TYPE_PORT_MOD		yes		WQ high (link/mtu)
      CMSG_TYPE_MERGE_HINT		yes		WQ low
      CMSG_TYPE_NO_NEIGH		no		WQ low
      CMSG_TYPE_ACTIVE_TUNS		no		WQ low
      CMSG_TYPE_QOS_STATS		no		WQ low
      CMSG_TYPE_LAG_CONFIG		no		WQ low
      
      A subset of control messages can block waiting for an rtnl lock (from both
      work queue priority groups). The rtnl lock is heavily contended for by
      external processes such as systemd-udevd, systemd-network and libvirtd,
      especially during netdev creation, such as when flower VFs and representors
      are instantiated.
      
      Kernel netlink instrumentation shows that external processes (such as
      systemd-udevd) often use successive rtnl_trylock() sequences, which can result
      in an rtnl_lock() blocked control message to starve for longer periods of time
      during rtnl lock contention, i.e. netdev creation.
      
      In the current design a single blocked control message will block the entire
      work queue (both priorities), and introduce a latency which is
      nondeterministic and dependent on system wide rtnl lock usage.
      
      In some extreme cases, one blocked control message at exactly the wrong time,
      just before the maximum number of VFs are instantiated, can block the work
      queue for long enough to prevent VF representor REIFY replies from getting
      handled in time for the 40ms timeout.
      
      The firmware will deliver the total maximum number of REIFY message replies in
      around 300us.
      
      Only REIFY and MTU update messages require replies within a timeout period (of
      40ms). The MTU-only updates are already done directly in the BH (tasklet)
      handler.
      
      Move the REIFY handler down into the BH (tasklet) in order to resolve timeouts
      caused by a blocked work queue waiting on rtnl locks.
      Signed-off-by: default avatarFred Lotter <frederik.lotter@netronome.com>
      Signed-off-by: default avatarSimon Horman <simon.horman@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      28abe579
    • Shmulik Ladkani's avatar
      net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list · 3dcbdb13
      Shmulik Ladkani authored
      Historically, support for frag_list packets entering skb_segment() was
      limited to frag_list members terminating on exact same gso_size
      boundaries. This is verified with a BUG_ON since commit 89319d38
      ("net: Add frag_list support to skb_segment"), quote:
      
          As such we require all frag_list members terminate on exact MSS
          boundaries.  This is checked using BUG_ON.
          As there should only be one producer in the kernel of such packets,
          namely GRO, this requirement should not be difficult to maintain.
      
      However, since commit 6578171a ("bpf: add bpf_skb_change_proto helper"),
      the "exact MSS boundaries" assumption no longer holds:
      An eBPF program using bpf_skb_change_proto() DOES modify 'gso_size', but
      leaves the frag_list members as originally merged by GRO with the
      original 'gso_size'. Example of such programs are bpf-based NAT46 or
      NAT64.
      
      This lead to a kernel BUG_ON for flows involving:
       - GRO generating a frag_list skb
       - bpf program performing bpf_skb_change_proto() or bpf_skb_adjust_room()
       - skb_segment() of the skb
      
      See example BUG_ON reports in [0].
      
      In commit 13acc94e ("net: permit skb_segment on head_frag frag_list skb"),
      skb_segment() was modified to support the "gso_size mangling" case of
      a frag_list GRO'ed skb, but *only* for frag_list members having
      head_frag==true (having a page-fragment head).
      
      Alas, GRO packets having frag_list members with a linear kmalloced head
      (head_frag==false) still hit the BUG_ON.
      
      This commit adds support to skb_segment() for a 'head_skb' packet having
      a frag_list whose members are *non* head_frag, with gso_size mangled, by
      disabling SG and thus falling-back to copying the data from the given
      'head_skb' into the generated segmented skbs - as suggested by Willem de
      Bruijn [1].
      
      Since this approach involves the penalty of skb_copy_and_csum_bits()
      when building the segments, care was taken in order to enable this
      solution only when required:
       - untrusted gso_size, by testing SKB_GSO_DODGY is set
         (SKB_GSO_DODGY is set by any gso_size mangling functions in
          net/core/filter.c)
       - the frag_list is non empty, its item is a non head_frag, *and* the
         headlen of the given 'head_skb' does not match the gso_size.
      
      [0]
      https://lore.kernel.org/netdev/20190826170724.25ff616f@pixies/
      https://lore.kernel.org/netdev/9265b93f-253d-6b8c-f2b8-4b54eff1835c@fb.com/
      
      [1]
      https://lore.kernel.org/netdev/CA+FuTSfVsgNDi7c=GUU8nMg2hWxF2SjCNLXetHeVPdnxAW5K-w@mail.gmail.com/
      
      Fixes: 6578171a ("bpf: add bpf_skb_change_proto helper")
      Suggested-by: default avatarWillem de Bruijn <willemdebruijn.kernel@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Signed-off-by: default avatarShmulik Ladkani <shmulik.ladkani@gmail.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3dcbdb13
    • Maciej Żenczykowski's avatar
      ipv6: addrconf_f6i_alloc - fix non-null pointer check to !IS_ERR() · 8652f17c
      Maciej Żenczykowski authored
      Fixes a stupid bug I recently introduced...
      ip6_route_info_create() returns an ERR_PTR(err) and not a NULL on error.
      
      Fixes: d55a2e37 ("net-ipv6: fix excessive RTF_ADDRCONF flag on ::1/128 local route (and others)'")
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Lorenzo Colitti <lorenzo@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarMaciej Żenczykowski <maze@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8652f17c
    • Eric Biggers's avatar
      isdn/capi: check message length in capi_write() · fe163e53
      Eric Biggers authored
      syzbot reported:
      
          BUG: KMSAN: uninit-value in capi_write+0x791/0xa90 drivers/isdn/capi/capi.c:700
          CPU: 0 PID: 10025 Comm: syz-executor379 Not tainted 4.20.0-rc7+ #2
          Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
          Call Trace:
            __dump_stack lib/dump_stack.c:77 [inline]
            dump_stack+0x173/0x1d0 lib/dump_stack.c:113
            kmsan_report+0x12e/0x2a0 mm/kmsan/kmsan.c:613
            __msan_warning+0x82/0xf0 mm/kmsan/kmsan_instr.c:313
            capi_write+0x791/0xa90 drivers/isdn/capi/capi.c:700
            do_loop_readv_writev fs/read_write.c:703 [inline]
            do_iter_write+0x83e/0xd80 fs/read_write.c:961
            vfs_writev fs/read_write.c:1004 [inline]
            do_writev+0x397/0x840 fs/read_write.c:1039
            __do_sys_writev fs/read_write.c:1112 [inline]
            __se_sys_writev+0x9b/0xb0 fs/read_write.c:1109
            __x64_sys_writev+0x4a/0x70 fs/read_write.c:1109
            do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:291
            entry_SYSCALL_64_after_hwframe+0x63/0xe7
          [...]
      
      The problem is that capi_write() is reading past the end of the message.
      Fix it by checking the message's length in the needed places.
      
      Reported-and-tested-by: syzbot+0849c524d9c634f5ae66@syzkaller.appspotmail.com
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fe163e53
    • Juliet Kim's avatar
      net/ibmvnic: free reset work of removed device from queue · 1c2977c0
      Juliet Kim authored
      Commit 36f1031c ("ibmvnic: Do not process reset during or after
       device removal") made the change to exit reset if the driver has been
      removed, but does not free reset work items of the adapter from queue.
      
      Ensure all reset work items are freed when breaking out of the loop early.
      
      Fixes: 36f1031c ("ibmnvic: Do not process reset during or after device removal”)
      Signed-off-by: default avatarJuliet Kim <julietk@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1c2977c0
    • Stefan Chulski's avatar
      net: phylink: Fix flow control resolution · 63b2ed4e
      Stefan Chulski authored
      Regarding to IEEE 802.3-2015 standard section 2
      28B.3 Priority resolution - Table 28-3 - Pause resolution
      
      In case of Local device Pause=1 AsymDir=0, Link partner
      Pause=1 AsymDir=1, Local device resolution should be enable PAUSE
      transmit, disable PAUSE receive.
      And in case of Local device Pause=1 AsymDir=1, Link partner
      Pause=1 AsymDir=0, Local device resolution should be enable PAUSE
      receive, disable PAUSE transmit.
      
      Fixes: 9525ae83 ("phylink: add phylink infrastructure")
      Signed-off-by: default avatarStefan Chulski <stefanc@marvell.com>
      Reported-by: default avatarShaul Ben-Mayor <shaulb@marvell.com>
      Acked-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      63b2ed4e
    • Christophe JAILLET's avatar
      net/hamradio/6pack: Fix the size of a sk_buff used in 'sp_bump()' · b82573fd
      Christophe JAILLET authored
      We 'allocate' 'count' bytes here. In fact, 'dev_alloc_skb' already add some
      extra space for padding, so a bit more is allocated.
      
      However, we use 1 byte for the KISS command, then copy 'count' bytes, so
      count+1 bytes.
      
      Explicitly allocate and use 1 more byte to be safe.
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b82573fd
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 0c04eb72
      David S. Miller authored
      Alexei Starovoitov says:
      
      ====================
      pull-request: bpf 2019-09-06
      
      The following pull-request contains BPF updates for your *net* tree.
      
      The main changes are:
      
      1) verifier precision tracking fix, from Alexei.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0c04eb72
  5. 06 Sep, 2019 6 commits
    • David S. Miller's avatar
      Merge tag 'wireless-drivers-for-davem-2019-09-05' of... · 74346c43
      David S. Miller authored
      Merge tag 'wireless-drivers-for-davem-2019-09-05' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers
      
      Kalle Valo says:
      
      ====================
      wireless-drivers fixes for 5.3
      
      Fourth set of fixes for 5.3, and hopefully really the last one. Quite
      a few CVE fixes this time but at least to my knowledge none of them
      have a known exploit.
      
      mt76
      
      * workaround firmware hang by disabling hardware encryption on MT7630E
      
      * disable 5GHz band for MT7630E as it's not working properly
      
      mwifiex
      
      * fix IE parsing to avoid a heap buffer overflow
      
      iwlwifi
      
      * fix for QuZ device initialisation
      
      rt2x00
      
      * another fix for rekeying
      
      * revert a commit causing degradation in rx signal levels
      
      rsi
      
      * fix a double free
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      74346c43
    • Radhey Shyam Pandey's avatar
      MAINTAINERS: add myself as maintainer for xilinx axiethernet driver · b0a3caea
      Radhey Shyam Pandey authored
      I am maintaining xilinx axiethernet driver in xilinx tree and would like
      to maintain it in the mainline kernel as well. Hence adding myself as a
      maintainer. Also Anirudha and John has moved to new roles, so based on
      request removing them from the maintainer list.
      Signed-off-by: default avatarRadhey Shyam Pandey <radhey.shyam.pandey@xilinx.com>
      Acked-by: default avatarJohn Linn <john.linn@xilinx.com>
      Acked-by: default avatarMichal Simek <michal.simek@xilinx.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b0a3caea
    • Eric Dumazet's avatar
      net: sched: fix reordering issues · b88dd52c
      Eric Dumazet authored
      Whenever MQ is not used on a multiqueue device, we experience
      serious reordering problems. Bisection found the cited
      commit.
      
      The issue can be described this way :
      
      - A single qdisc hierarchy is shared by all transmit queues.
        (eg : tc qdisc replace dev eth0 root fq_codel)
      
      - When/if try_bulk_dequeue_skb_slow() dequeues a packet targetting
        a different transmit queue than the one used to build a packet train,
        we stop building the current list and save the 'bad' skb (P1) in a
        special queue. (bad_txq)
      
      - When dequeue_skb() calls qdisc_dequeue_skb_bad_txq() and finds this
        skb (P1), it checks if the associated transmit queues is still in frozen
        state. If the queue is still blocked (by BQL or NIC tx ring full),
        we leave the skb in bad_txq and return NULL.
      
      - dequeue_skb() calls q->dequeue() to get another packet (P2)
      
        The other packet can target the problematic queue (that we found
        in frozen state for the bad_txq packet), but another cpu just ran
        TX completion and made room in the txq that is now ready to accept
        new packets.
      
      - Packet P2 is sent while P1 is still held in bad_txq, P1 might be sent
        at next round. In practice P2 is the lead of a big packet train
        (P2,P3,P4 ...) filling the BQL budget and delaying P1 by many packets :/
      
      To solve this problem, we have to block the dequeue process as long
      as the first packet in bad_txq can not be sent. Reordering issues
      disappear and no side effects have been seen.
      
      Fixes: a53851e2 ("net: sched: explicit locking in gso_cpu fallback")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b88dd52c
    • David S. Miller's avatar
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec · 2e9550ed
      David S. Miller authored
      Steffen Klassert says:
      
      ====================
      pull request (net): ipsec 2019-09-05
      
      1) Several xfrm interface fixes from Nicolas Dichtel:
         - Avoid an interface ID corruption on changelink.
         - Fix wrong intterface names in the logs.
         - Fix a list corruption when changing network namespaces.
         - Fix unregistation of the underying phydev.
      
      2) Fix a potential warning when merging xfrm_plocy nodes.
         From Florian Westphal.
      
      Please pull or let me know if there are problems.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2e9550ed
    • Zhu Yanjun's avatar
      forcedeth: use per cpu to collect xmit/recv statistics · f4b633b9
      Zhu Yanjun authored
      When testing with a background iperf pushing 1Gbit/sec traffic and running
      both ifconfig and netstat to collect statistics, some deadlocks occurred.
      
      Ifconfig and netstat will call nv_get_stats64 to get software xmit/recv
      statistics. In the commit f5d827ae ("forcedeth: implement
      ndo_get_stats64() API"), the normal tx/rx variables is to collect tx/rx
      statistics. The fix is to replace normal tx/rx variables with per
      cpu 64-bit variable to collect xmit/recv statistics. The per cpu variable
      will avoid deadlocks and provide fast efficient statistics updates.
      
      In nv_probe, the per cpu variable is initialized. In nv_remove, this
      per cpu variable is freed.
      
      In xmit/recv process, this per cpu variable will be updated.
      
      In nv_get_stats64, this per cpu variable on each cpu is added up. Then
      the driver can get xmit/recv packets statistics.
      
      A test runs for several days with this commit, the deadlocks disappear
      and the performance is better.
      
      Tested:
         - iperf SMP x86_64 ->
         Client connecting to 1.1.1.108, TCP port 5001
         TCP window size: 85.0 KByte (default)
         ------------------------------------------------------------
         [  3] local 1.1.1.105 port 38888 connected with 1.1.1.108 port 5001
         [ ID] Interval       Transfer     Bandwidth
         [  3]  0.0-10.0 sec  1.10 GBytes   943 Mbits/sec
      
         ifconfig results:
      
         enp0s9 Link encap:Ethernet  HWaddr 00:21:28:6f:de:0f
                inet addr:1.1.1.105  Bcast:0.0.0.0  Mask:255.255.255.0
                UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
                RX packets:5774764531 errors:0 dropped:0 overruns:0 frame:0
                TX packets:633534193 errors:0 dropped:0 overruns:0 carrier:0
                collisions:0 txqueuelen:1000
                RX bytes:7646159340904 (7.6 TB) TX bytes:11425340407722 (11.4 TB)
      
         netstat results:
      
         Kernel Interface table
         Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
         ...
         enp0s9 1500 0  5774764531 0    0 0      633534193      0      0  0 BMRU
         ...
      
      Fixes: f5d827ae ("forcedeth: implement ndo_get_stats64() API")
      CC: Joe Jin <joe.jin@oracle.com>
      CC: JUNXIAO_BI <junxiao.bi@oracle.com>
      Reported-and-tested-by: default avatarNan san <nan.1986san@gmail.com>
      Signed-off-by: default avatarZhu Yanjun <yanjun.zhu@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f4b633b9
    • Mao Wenan's avatar
      net: sonic: return NETDEV_TX_OK if failed to map buffer · 6e1cdedc
      Mao Wenan authored
      NETDEV_TX_BUSY really should only be used by drivers that call
      netif_tx_stop_queue() at the wrong moment. If dma_map_single() is
      failed to map tx DMA buffer, it might trigger an infinite loop.
      This patch use NETDEV_TX_OK instead of NETDEV_TX_BUSY, and change
      printk to pr_err_ratelimited.
      
      Fixes: d9fb9f38 ("*sonic/natsemi/ns83829: Move the National Semi-conductor drivers")
      Signed-off-by: default avatarMao Wenan <maowenan@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6e1cdedc
  6. 05 Sep, 2019 7 commits
    • Alexei Starovoitov's avatar
      bpf: fix precision tracking of stack slots · 2339cd6c
      Alexei Starovoitov authored
      The problem can be seen in the following two tests:
      0: (bf) r3 = r10
      1: (55) if r3 != 0x7b goto pc+0
      2: (7a) *(u64 *)(r3 -8) = 0
      3: (79) r4 = *(u64 *)(r10 -8)
      ..
      0: (85) call bpf_get_prandom_u32#7
      1: (bf) r3 = r10
      2: (55) if r3 != 0x7b goto pc+0
      3: (7b) *(u64 *)(r3 -8) = r0
      4: (79) r4 = *(u64 *)(r10 -8)
      
      When backtracking need to mark R4 it will mark slot fp-8.
      But ST or STX into fp-8 could belong to the same block of instructions.
      When backtracing is done the parent state may have fp-8 slot
      as "unallocated stack". Which will cause verifier to warn
      and incorrectly reject such programs.
      
      Writes into stack via non-R10 register are rare. llvm always
      generates canonical stack spill/fill.
      For such pathological case fall back to conservative precision
      tracking instead of rejecting.
      
      Reported-by: syzbot+c8d66267fd2b5955287e@syzkaller.appspotmail.com
      Fixes: b5dc0163 ("bpf: precise scalar_value tracking")
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      2339cd6c
    • Donald Sharp's avatar
      net: Properly update v4 routes with v6 nexthop · 7bdf4de1
      Donald Sharp authored
      When creating a v4 route that uses a v6 nexthop from a nexthop group.
      Allow the kernel to properly send the nexthop as v6 via the RTA_VIA
      attribute.
      
      Broken behavior:
      
      $ ip nexthop add via fe80::9 dev eth0
      $ ip nexthop show
      id 1 via fe80::9 dev eth0 scope link
      $ ip route add 4.5.6.7/32 nhid 1
      $ ip route show
      default via 10.0.2.2 dev eth0
      4.5.6.7 nhid 1 via 254.128.0.0 dev eth0
      10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15
      $
      
      Fixed behavior:
      
      $ ip nexthop add via fe80::9 dev eth0
      $ ip nexthop show
      id 1 via fe80::9 dev eth0 scope link
      $ ip route add 4.5.6.7/32 nhid 1
      $ ip route show
      default via 10.0.2.2 dev eth0
      4.5.6.7 nhid 1 via inet6 fe80::9 dev eth0
      10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15
      $
      
      v2, v3: Addresses code review comments from David Ahern
      
      Fixes: dcb1ecb5 (“ipv4: Prepare for fib6_nh from a nexthop object”)
      Signed-off-by: default avatarDonald Sharp <sharpd@cumulusnetworks.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7bdf4de1
    • David S. Miller's avatar
      Merge branch 'nexthops-Fix-multipath-notifications-for-IPv6-and-selftests' · e9752c83
      David S. Miller authored
      David Ahern says:
      
      ====================
      nexthops: Fix multipath notifications for IPv6 and selftests
      
      A couple of bug fixes noticed while testing Donald's patch.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e9752c83
    • David Ahern's avatar
      selftest: A few cleanups for fib_nexthops.sh · 91bfb564
      David Ahern authored
      Cleanups of the tests in fib_nexthops.sh
      1. Several tests noted unexpected route output, but the
         discrepancy was not showing in the summary output and
         overlooked in the verbose output. Add a WARNING message
         to the summary output to make it clear a test is not showing
         expected output.
      
      2. Several check_* calls are missing extra data like scope and metric
         causing mismatches when the nexthops or routes are correct - some of
         them are a side effect of the evolving iproute2 command. Update the
         data to the expected output.
      
      3. Several check_routes are checking for the wrong nexthop data,
         most likely a copy-paste-update error.
      
      4. A couple of tests were re-using a nexthop id that already existed.
         Fix those to use a new id.
      
      Fixes: 6345266a ("selftests: Add test cases for nexthop objects")
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      91bfb564
    • David Ahern's avatar
      ipv6: Fix RTA_MULTIPATH with nexthop objects · 4255ff05
      David Ahern authored
      A change to the core nla helpers was missed during the push of
      the nexthop changes. rt6_fill_node_nexthop should be calling
      nla_nest_start_noflag not nla_nest_start. Currently, iproute2
      does not print multipath data because of parsing issues with
      the attribute.
      
      Fixes: f88d8ea6 ("ipv6: Plumb support for nexthop object in a fib6_info")
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4255ff05
    • John Fastabend's avatar
      net: sock_map, fix missing ulp check in sock hash case · 44580a01
      John Fastabend authored
      sock_map and ULP only work together when ULP is loaded after the sock
      map is loaded. In the sock_map case we added a check for this to fail
      the load if ULP is already set. However, we missed the check on the
      sock_hash side.
      
      Add a ULP check to the sock_hash update path.
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Reported-by: syzbot+7a6ee4d0078eac6bf782@syzkaller.appspotmail.com
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      44580a01
    • Moritz Fischer's avatar
      net: fixed_phy: Add forward declaration for struct gpio_desc; · ebe26aca
      Moritz Fischer authored
      Add forward declaration for struct gpio_desc in order to address
      the following:
      
      ./include/linux/phy_fixed.h:48:17: error: 'struct gpio_desc' declared inside parameter list [-Werror]
      ./include/linux/phy_fixed.h:48:17: error: its scope is only this definition or declaration, which is probably not what you want [-Werror]
      
      Fixes: 71bd106d ("net: fixed-phy: Add fixed_phy_register_with_gpiod() API")
      Signed-off-by: default avatarMoritz Fischer <mdf@kernel.org>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ebe26aca