1. 27 Dec, 2017 9 commits
    • David S. Miller's avatar
      Merge branch 'tg3-fixes' · 67538790
      David S. Miller authored
      Siva Reddy Kallam says:
      
      ====================
      tg3: update on copyright and couple of fixes
      
      First patch:
      	Update copyright
      
      Second patch:
      	Add workaround to restrict 5762 MRRS
      
      Third patch:
      	Add PHY reset in change MTU path for 5720
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      67538790
    • Siva Reddy Kallam's avatar
      tg3: Enable PHY reset in MTU change path for 5720 · e60ee41a
      Siva Reddy Kallam authored
      A customer noticed RX path hang when MTU is changed on the fly while
      running heavy traffic with NCSI enabled for 5717 and 5719. Since 5720
      belongs to same ASIC family, we observed same issue and same fix
      could solve this problem for 5720.
      Signed-off-by: default avatarSiva Reddy Kallam <siva.kallam@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e60ee41a
    • Siva Reddy Kallam's avatar
      tg3: Add workaround to restrict 5762 MRRS to 2048 · 4419bb1c
      Siva Reddy Kallam authored
      One of AMD based server with 5762 hangs with jumbo frame traffic.
      This AMD platform has southbridge limitation which is restricting MRRS
      to 4000. As a work around, driver to restricts the MRRS to 2048 for
      this particular 5762 NX1 card.
      Signed-off-by: default avatarSiva Reddy Kallam <siva.kallam@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4419bb1c
    • Siva Reddy Kallam's avatar
      5a8bae97
    • David S. Miller's avatar
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec · 65bbbf6c
      David S. Miller authored
      Steffen Klassert says:
      
      ====================
      pull request (net): ipsec 2017-12-22
      
      1) Check for valid id proto in validate_tmpl(), otherwise
         we may trigger a warning in xfrm_state_fini().
         From Cong Wang.
      
      2) Fix a typo on XFRMA_OUTPUT_MARK policy attribute.
         From Michal Kubecek.
      
      3) Verify the state is valid when encap_type < 0,
         otherwise we may crash on IPsec GRO .
         From Aviv Heller.
      
      4) Fix stack-out-of-bounds read on socket policy lookup.
         We access the flowi of the wrong address family in the
         IPv4 mapped IPv6 case, fix this by catching address
         family missmatches before we do the lookup.
      
      5) fix xfrm_do_migrate() with AEAD to copy the geniv
         field too. Otherwise the state is not fully initialized
         and migration fails. From Antony Antony.
      
      6) Fix stack-out-of-bounds with misconfigured transport
         mode policies. Our policy template validation is not
         strict enough. It is possible to configure policies
         with transport mode template where the address family
         of the template does not match the selectors address
         family. Fix this by refusing such a configuration,
         address family can not change on transport mode.
      
      7) Fix a policy reference leak when reusing pcpu xdst
         entry. From Florian Westphal.
      
      8) Reinject transport-mode packets through tasklet,
         otherwise it is possible to reate a recursion
         loop. From Herbert Xu.
      
      Please pull or let me know if there are problems.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      65bbbf6c
    • Fugang Duan's avatar
      net: fec: unmap the xmit buffer that are not transferred by DMA · 178e5f57
      Fugang Duan authored
      The enet IP only support 32 bit, it will use swiotlb buffer to do dma
      mapping when xmit buffer DMA memory address is bigger than 4G in i.MX
      platform. After stress suspend/resume test, it will print out:
      
      log:
      [12826.352864] fec 5b040000.ethernet: swiotlb buffer is full (sz: 191 bytes)
      [12826.359676] DMA: Out of SW-IOMMU space for 191 bytes at device 5b040000.ethernet
      [12826.367110] fec 5b040000.ethernet eth0: Tx DMA memory map failed
      
      The issue is that the ready xmit buffers that are dma mapped but DMA still
      don't copy them into fifo, once MAC restart, these DMA buffers are not unmapped.
      So it should check the dma mapping buffer and unmap them.
      Signed-off-by: default avatarFugang Duan <fugang.duan@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      178e5f57
    • Tommi Rantala's avatar
      tipc: fix tipc_mon_delete() oops in tipc_enable_bearer() error path · 642a8439
      Tommi Rantala authored
      Calling tipc_mon_delete() before the monitor has been created will oops.
      This can happen in tipc_enable_bearer() error path if tipc_disc_create()
      fails.
      
      [   48.589074] BUG: unable to handle kernel paging request at 0000000000001008
      [   48.590266] IP: tipc_mon_delete+0xea/0x270 [tipc]
      [   48.591223] PGD 1e60c5067 P4D 1e60c5067 PUD 1eb0cf067 PMD 0
      [   48.592230] Oops: 0000 [#1] SMP KASAN
      [   48.595610] CPU: 5 PID: 1199 Comm: tipc Tainted: G    B            4.15.0-rc4-pc64-dirty #5
      [   48.597176] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
      [   48.598489] RIP: 0010:tipc_mon_delete+0xea/0x270 [tipc]
      [   48.599347] RSP: 0018:ffff8801d827f668 EFLAGS: 00010282
      [   48.600705] RAX: ffff8801ee813f00 RBX: 0000000000000204 RCX: 0000000000000000
      [   48.602183] RDX: 1ffffffff1de6a75 RSI: 0000000000000297 RDI: 0000000000000297
      [   48.604373] RBP: 0000000000000000 R08: 0000000000000000 R09: fffffbfff1dd1533
      [   48.605607] R10: ffffffff8eafbb05 R11: fffffbfff1dd1534 R12: 0000000000000050
      [   48.607082] R13: dead000000000200 R14: ffffffff8e73f310 R15: 0000000000001020
      [   48.608228] FS:  00007fc686484800(0000) GS:ffff8801f5540000(0000) knlGS:0000000000000000
      [   48.610189] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   48.611459] CR2: 0000000000001008 CR3: 00000001dda70002 CR4: 00000000003606e0
      [   48.612759] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   48.613831] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   48.615038] Call Trace:
      [   48.615635]  tipc_enable_bearer+0x415/0x5e0 [tipc]
      [   48.620623]  tipc_nl_bearer_enable+0x1ab/0x200 [tipc]
      [   48.625118]  genl_family_rcv_msg+0x36b/0x570
      [   48.631233]  genl_rcv_msg+0x5a/0xa0
      [   48.631867]  netlink_rcv_skb+0x1cc/0x220
      [   48.636373]  genl_rcv+0x24/0x40
      [   48.637306]  netlink_unicast+0x29c/0x350
      [   48.639664]  netlink_sendmsg+0x439/0x590
      [   48.642014]  SYSC_sendto+0x199/0x250
      [   48.649912]  do_syscall_64+0xfd/0x2c0
      [   48.650651]  entry_SYSCALL64_slow_path+0x25/0x25
      [   48.651843] RIP: 0033:0x7fc6859848e3
      [   48.652539] RSP: 002b:00007ffd25dff938 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      [   48.654003] RAX: ffffffffffffffda RBX: 00007ffd25dff990 RCX: 00007fc6859848e3
      [   48.655303] RDX: 0000000000000054 RSI: 00007ffd25dff990 RDI: 0000000000000003
      [   48.656512] RBP: 00007ffd25dff980 R08: 00007fc685c35fc0 R09: 000000000000000c
      [   48.657697] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000d13010
      [   48.658840] R13: 00007ffd25e009c0 R14: 0000000000000000 R15: 0000000000000000
      [   48.662972] RIP: tipc_mon_delete+0xea/0x270 [tipc] RSP: ffff8801d827f668
      [   48.664073] CR2: 0000000000001008
      [   48.664576] ---[ end trace e811818d54d5ce88 ]---
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Acked-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarTommi Rantala <tommi.t.rantala@nokia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      642a8439
    • Tommi Rantala's avatar
      tipc: error path leak fixes in tipc_enable_bearer() · 19142551
      Tommi Rantala authored
      Fix memory leak in tipc_enable_bearer() if enable_media() fails, and
      cleanup with bearer_disable() if tipc_mon_create() fails.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Acked-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarTommi Rantala <tommi.t.rantala@nokia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      19142551
    • Avinash Repaka's avatar
      RDS: Check cmsg_len before dereferencing CMSG_DATA · 14e138a8
      Avinash Repaka authored
      RDS currently doesn't check if the length of the control message is
      large enough to hold the required data, before dereferencing the control
      message data. This results in following crash:
      
      BUG: KASAN: stack-out-of-bounds in rds_rdma_bytes net/rds/send.c:1013
      [inline]
      BUG: KASAN: stack-out-of-bounds in rds_sendmsg+0x1f02/0x1f90
      net/rds/send.c:1066
      Read of size 8 at addr ffff8801c928fb70 by task syzkaller455006/3157
      
      CPU: 0 PID: 3157 Comm: syzkaller455006 Not tainted 4.15.0-rc3+ #161
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x194/0x257 lib/dump_stack.c:53
       print_address_description+0x73/0x250 mm/kasan/report.c:252
       kasan_report_error mm/kasan/report.c:351 [inline]
       kasan_report+0x25b/0x340 mm/kasan/report.c:409
       __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
       rds_rdma_bytes net/rds/send.c:1013 [inline]
       rds_sendmsg+0x1f02/0x1f90 net/rds/send.c:1066
       sock_sendmsg_nosec net/socket.c:628 [inline]
       sock_sendmsg+0xca/0x110 net/socket.c:638
       ___sys_sendmsg+0x320/0x8b0 net/socket.c:2018
       __sys_sendmmsg+0x1ee/0x620 net/socket.c:2108
       SYSC_sendmmsg net/socket.c:2139 [inline]
       SyS_sendmmsg+0x35/0x60 net/socket.c:2134
       entry_SYSCALL_64_fastpath+0x1f/0x96
      RIP: 0033:0x43fe49
      RSP: 002b:00007fffbe244ad8 EFLAGS: 00000217 ORIG_RAX: 0000000000000133
      RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043fe49
      RDX: 0000000000000001 RSI: 000000002020c000 RDI: 0000000000000003
      RBP: 00000000006ca018 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000217 R12: 00000000004017b0
      R13: 0000000000401840 R14: 0000000000000000 R15: 0000000000000000
      
      To fix this, we verify that the cmsg_len is large enough to hold the
      data to be read, before proceeding further.
      Reported-by: default avatarsyzbot <syzkaller-bugs@googlegroups.com>
      Signed-off-by: default avatarAvinash Repaka <avinash.repaka@oracle.com>
      Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Reviewed-by: default avatarYuval Shaia <yuval.shaia@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      14e138a8
  2. 26 Dec, 2017 9 commits
    • Mat Martineau's avatar
      tcp: Avoid preprocessor directives in tracepoint macro args · 6a6b0b99
      Mat Martineau authored
      Using a preprocessor directive to check for CONFIG_IPV6 in the middle of
      a DECLARE_EVENT_CLASS macro's arg list causes sparse to report a series
      of errors:
      
      ./include/trace/events/tcp.h:68:1: error: directive in argument list
      ./include/trace/events/tcp.h:75:1: error: directive in argument list
      ./include/trace/events/tcp.h:144:1: error: directive in argument list
      ./include/trace/events/tcp.h:151:1: error: directive in argument list
      ./include/trace/events/tcp.h:216:1: error: directive in argument list
      ./include/trace/events/tcp.h:223:1: error: directive in argument list
      ./include/trace/events/tcp.h:274:1: error: directive in argument list
      ./include/trace/events/tcp.h:281:1: error: directive in argument list
      
      Once sparse finds an error, it stops printing warnings for the file it
      is checking. This masks any sparse warnings that would normally be
      reported for the core TCP code.
      
      Instead, handle the preprocessor conditionals in a couple of auxiliary
      macros. This also has the benefit of reducing duplicate code.
      
      Cc: David Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6a6b0b99
    • Jon Maloy's avatar
      tipc: fix memory leak of group member when peer node is lost · 3a33a19b
      Jon Maloy authored
      When a group member receives a member WITHDRAW event, this might have
      two reasons: either the peer member is leaving the group, or the link
      to the member's node has been lost.
      
      In the latter case we need to issue a DOWN event to the user right away,
      and let function tipc_group_filter_msg() perform delete of the member
      item. However, in this case we miss to change the state of the member
      item to MBR_LEAVING, so the member item is not deleted, and we have a
      memory leak.
      
      We now separate better between the four sub-cases of a WITHRAW event
      and make sure that each case is handled correctly.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3a33a19b
    • Jiri Pirko's avatar
      net: sched: fix possible null pointer deref in tcf_block_put · 4853f128
      Jiri Pirko authored
      We need to check block for being null in both tcf_block_put and
      tcf_block_put_ext.
      
      Fixes: 343723dd ("net: sched: fix clsact init error path")
      Reported-by: default avatarPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4853f128
    • Jon Maloy's avatar
      tipc: base group replicast ack counter on number of actual receivers · 0a3d805c
      Jon Maloy authored
      In commit 2f487712 ("tipc: guarantee that group broadcast doesn't
      bypass group unicast") we introduced a mechanism that requires the first
      (replicated) broadcast sent after a unicast to be acknowledged by all
      receivers before permitting sending of the next (true) broadcast.
      
      The counter for keeping track of the number of acknowledges to expect
      is based on the tipc_group::member_cnt variable. But this misses that
      some of the known members may not be ready for reception, and will never
      acknowledge the message, either because they haven't fully joined the
      group or because they are leaving the group. Such members are identified
      by not fulfilling the condition tested for in the function
      tipc_group_is_enabled().
      
      We now set the counter for the actual number of acks to receive at the
      moment the message is sent, by just counting the number of recipients
      satisfying the tipc_group_is_enabled() test.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a3d805c
    • Cong Wang's avatar
      net_sched: fix a missing rcu barrier in mini_qdisc_pair_swap() · b2fb01f4
      Cong Wang authored
      The rcu_barrier_bh() in mini_qdisc_pair_swap() is to wait for
      flying RCU callback installed by a previous mini_qdisc_pair_swap(),
      however we miss it on the tp_head==NULL path, which leads to that
      the RCU callback still uses miniq_old->rcu after it is freed together
      with qdisc in qdisc_graft(). So just add it on that path too.
      
      Fixes: 46209401 ("net: core: introduce mini_Qdisc and eliminate usage of tp->q for clsact fastpath ")
      Reported-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Tested-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Cc: Jiri Pirko <jiri@mellanox.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b2fb01f4
    • Grygorii Strashko's avatar
      net: phy: micrel: ksz9031: reconfigure autoneg after phy autoneg workaround · c1a8d0a3
      Grygorii Strashko authored
      Under some circumstances driver will perform PHY reset in
      ksz9031_read_status() to fix autoneg failure case (idle error count =
      0xFF). When this happens ksz9031 will not detect link status change any
      more when connecting to Netgear 1G switch (link can be recovered sometimes by
      restarting netdevice "ifconfig down up"). Reproduced with TI am572x board
      equipped with ksz9031 PHY while connecting to Netgear 1G switch.
      
      Fix the issue by reconfiguring autonegotiation after PHY reset in
      ksz9031_read_status().
      
      Fixes: d2fd719b ("net/phy: micrel: Add workaround for bad autoneg")
      Signed-off-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c1a8d0a3
    • Alexey Kodanev's avatar
      ip6_gre: fix device features for ioctl setup · e5a9336a
      Alexey Kodanev authored
      When ip6gre is created using ioctl, its features, such as
      scatter-gather, GSO and tx-checksumming will be turned off:
      
        # ip -f inet6 tunnel add gre6 mode ip6gre remote fd00::1
        # ethtool -k gre6 (truncated output)
          tx-checksumming: off
          scatter-gather: off
          tcp-segmentation-offload: off
          generic-segmentation-offload: off [requested on]
      
      But when netlink is used, they will be enabled:
        # ip link add gre6 type ip6gre remote fd00::1
        # ethtool -k gre6 (truncated output)
          tx-checksumming: on
          scatter-gather: on
          tcp-segmentation-offload: on
          generic-segmentation-offload: on
      
      This results in a loss of performance when gre6 is created via ioctl.
      The issue was found with LTP/gre tests.
      
      Fix it by moving the setup of device features to a separate function
      and invoke it with ndo_init callback because both netlink and ioctl
      will eventually call it via register_netdevice():
      
         register_netdevice()
             - ndo_init() callback -> ip6gre_tunnel_init() or ip6gre_tap_init()
                 - ip6gre_tunnel_init_common()
                      - ip6gre_tnl_init_features()
      
      The moved code also contains two minor style fixes:
        * removed needless tab from GRE6_FEATURES on NETIF_F_HIGHDMA line.
        * fixed the issue reported by checkpatch: "Unnecessary parentheses around
          'nt->encap.type == TUNNEL_ENCAP_NONE'"
      
      Fixes: ac4eb009 ("ip6gre: Add support for basic offloads offloads excluding GSO")
      Signed-off-by: default avatarAlexey Kodanev <alexey.kodanev@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e5a9336a
    • Russell King's avatar
      phylink: ensure AN is enabled · 74ee0e8c
      Russell King authored
      Ensure that we mark AN as enabled at boot time, rather than leaving
      it disabled.  This is noticable if your SFP module is fiber, and
      it supports faster speeds than 1G with 2.5G support in place.
      
      Fixes: 9525ae83 ("phylink: add phylink infrastructure")
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      74ee0e8c
    • Russell King's avatar
      phylink: ensure the PHY interface mode is appropriately set · 182088aa
      Russell King authored
      When setting the ethtool settings, ensure that the validated PHY
      interface mode is propagated to the current link settings, so that
      2500BaseX can be selected.
      
      Fixes: 9525ae83 ("phylink: add phylink infrastructure")
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      182088aa
  3. 21 Dec, 2017 22 commits
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · ead68f21
      Linus Torvalds authored
      Pull networking fixes from David Miller"
       "What's a holiday weekend without some networking bug fixes? [1]
      
         1) Fix some eBPF JIT bugs wrt. SKB pointers across helper function
            calls, from Daniel Borkmann.
      
         2) Fix regression from errata limiting change to marvell PHY driver,
            from Zhao Qiang.
      
         3) Fix u16 overflow in SCTP, from Xin Long.
      
         4) Fix potential memory leak during bridge newlink, from Nikolay
            Aleksandrov.
      
         5) Fix BPF selftest build on s390, from Hendrik Brueckner.
      
         6) Don't append to cfg80211 automatically generated certs file,
            always write new ones from scratch. From Thierry Reding.
      
         7) Fix sleep in atomic in mac80211 hwsim, from Jia-Ju Bai.
      
         8) Fix hang on tg3 MTU change with certain chips, from Brian King.
      
         9) Add stall detection to arc emac driver and reset chip when this
            happens, from Alexander Kochetkov.
      
        10) Fix MTU limitng in GRE tunnel drivers, from Xin Long.
      
        11) Fix stmmac timestamping bug due to mis-shifting of field. From
            Fredrik Hallenberg.
      
        12) Fix metrics match when deleting an ipv4 route. The kernel sets
            some internal metrics bits which the user isn't going to set when
            it makes the delete request. From Phil Sutter.
      
        13) mvneta driver loop over RX queues limits on "txq_number" :-) Fix
            from Yelena Krivosheev.
      
        14) Fix double free and memory corruption in get_net_ns_by_id, from
            Eric W. Biederman.
      
        15) Flush ipv4 FIB tables in the reverse order. Some tables can share
            their actual backing data, in particular this happens for the MAIN
            and LOCAL tables. We have to kill the LOCAL table first, because
            it uses MAIN's backing memory. Fix from Ido Schimmel.
      
        16) Several eBPF verifier value tracking fixes, from Edward Cree, Jann
            Horn, and Alexei Starovoitov.
      
        17) Make changes to ipv6 autoflowlabel sysctl really propagate to
            sockets, unless the socket has set the per-socket value
            explicitly. From Shaohua Li.
      
        18) Fix leaks and double callback invocations of zerocopy SKBs, from
            Willem de Bruijn"
      
      [1] Is this a trick question? "Relaxing"? "Quiet"? "Fine"? - Linus.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (77 commits)
        skbuff: skb_copy_ubufs must release uarg even without user frags
        skbuff: orphan frags before zerocopy clone
        net: reevalulate autoflowlabel setting after sysctl setting
        openvswitch: Fix pop_vlan action for double tagged frames
        ipv6: Honor specified parameters in fibmatch lookup
        bpf: do not allow root to mangle valid pointers
        selftests/bpf: add tests for recent bugfixes
        bpf: fix integer overflows
        bpf: don't prune branches when a scalar is replaced with a pointer
        bpf: force strict alignment checks for stack pointers
        bpf: fix missing error return in check_stack_boundary()
        bpf: fix 32-bit ALU op verification
        bpf: fix incorrect tracking of register size truncation
        bpf: fix incorrect sign extension in check_alu_op()
        bpf/verifier: fix bounds calculation on BPF_RSH
        ipv4: Fix use-after-free when flushing FIB tables
        s390/qeth: fix error handling in checksum cmd callback
        tipc: remove joining group member from congested list
        selftests: net: Adding config fragment CONFIG_NUMA=y
        nfp: bpf: keep track of the offloaded program
        ...
      ead68f21
    • David S. Miller's avatar
      Merge branch 'net-zerocopy-fixes' · c50b7c47
      David S. Miller authored
      Saeed Mahameed says:
      
      ===================
      Mellanox, mlx5 fixes 2017-12-19
      
      The follwoing series includes some fixes for mlx5 core and etherent
      driver.
      
      Please pull and let me know if there is any problem.
      
      This series doesn't introduce any conflict with the ongoing mlx5 for-next
      submission.
      
      For -stable:
      
      kernels >= v4.7.y
          ("net/mlx5e: Fix possible deadlock of VXLAN lock")
          ("net/mlx5e: Add refcount to VXLAN structure")
          ("net/mlx5e: Prevent possible races in VXLAN control flow")
          ("net/mlx5e: Fix features check of IPv6 traffic")
      
      kernels >= v4.9.y
          ("net/mlx5: Fix error flow in CREATE_QP command")
          ("net/mlx5: Fix rate limit packet pacing naming and struct")
      
      kernels >= v4.13.y
          ("net/mlx5: FPGA, return -EINVAL if size is zero")
      
      kernels >= v4.14.y
          ("Revert "mlx5: move affinity hints assignments to generic code")
      
      All above patches apply and compile with no issues on corresponding -stable.
      ===================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c50b7c47
    • Willem de Bruijn's avatar
      skbuff: skb_copy_ubufs must release uarg even without user frags · b90ddd56
      Willem de Bruijn authored
      skb_copy_ubufs creates a private copy of frags[] to release its hold
      on user frags, then calls uarg->callback to notify the owner.
      
      Call uarg->callback even when no frags exist. This edge case can
      happen when zerocopy_sg_from_iter finds enough room in skb_headlen
      to copy all the data.
      
      Fixes: 3ece7826 ("sock: skb_copy_ubufs support for compound pages")
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b90ddd56
    • Willem de Bruijn's avatar
      skbuff: orphan frags before zerocopy clone · 268b7906
      Willem de Bruijn authored
      Call skb_zerocopy_clone after skb_orphan_frags, to avoid duplicate
      calls to skb_uarg(skb)->callback for the same data.
      
      skb_zerocopy_clone associates skb_shinfo(skb)->uarg from frag_skb
      with each segment. This is only safe for uargs that do refcounting,
      which is those that pass skb_orphan_frags without dropping their
      shared frags. For others, skb_orphan_frags drops the user frags and
      sets the uarg to NULL, after which sock_zerocopy_clone has no effect.
      
      Qemu hangs were reported due to duplicate vhost_net_zerocopy_callback
      calls for the same data causing the vhost_net_ubuf_ref_>refcount to
      drop below zero.
      
      Link: http://lkml.kernel.org/r/<CAF=yD-LWyCD4Y0aJ9O0e_CHLR+3JOeKicRRTEVCPxgw4XOcqGQ@mail.gmail.com>
      Fixes: 1f8b977a ("sock: enable MSG_ZEROCOPY")
      Reported-by: default avatarAndreas Hartmann <andihartmann@01019freenet.de>
      Reported-by: default avatarDavid Hill <dhill@redhat.com>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      268b7906
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.dk/linux-block · 9035a896
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
       "It's been a few weeks, so here's a small collection of fixes that
        should go into the current series.
      
        This contains:
      
         - NVMe pull request from Christoph, with a few important fixes.
      
         - kyber hang fix from Omar.
      
         - A blk-throttl fix from Shaohua, fixing a case where we double
           charge a bio.
      
         - Two call_single_data alignment fixes from me, fixing up some
           unfortunate changes that went into 4.14 without being properly
           reviewed on the block side (since nobody was CC'ed on the
           patch...).
      
         - A bounce buffer fix in two parts, one from me and one from Ming.
      
         - Revert bdi debug error handling patch. It's causing boot issues for
           some folks, and a week down the line, we're still no closer to a
           fix. Revert this patch for now until it's figured out, then we can
           retry for 4.16"
      
      * 'for-linus' of git://git.kernel.dk/linux-block:
        Revert "bdi: add error handle for bdi_debug_register"
        null_blk: unalign call_single_data
        block: unalign call_single_data in struct request
        block-throttle: avoid double charge
        block: fix blk_rq_append_bio
        block: don't let passthrough IO go into .make_request_fn()
        nvme: setup streams after initializing namespace head
        nvme: check hw sectors before setting chunk sectors
        nvme: call blk_integrity_unregister after queue is cleaned up
        nvme-fc: remove double put reference if admin connect fails
        nvme: set discard_alignment to zero
        kyber: fix another domain token wait queue hang
      9035a896
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 409232a4
      Linus Torvalds authored
      Pull KVM fixes from Paolo Bonzini:
       "ARM fixes:
         - A bug in handling of SPE state for non-vhe systems
         - A fix for a crash on system shutdown
         - Three timer fixes, introduced by the timer optimizations for v4.15
      
        x86 fixes:
         - fix for a WARN that was introduced in 4.15
         - fix for SMM when guest uses PCID
         - fixes for several bugs found by syzkaller
      
        ... and a dozen papercut fixes for the kvm_stat tool"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (22 commits)
        tools/kvm_stat: sort '-f help' output
        kvm: x86: fix RSM when PCID is non-zero
        KVM: Fix stack-out-of-bounds read in write_mmio
        KVM: arm/arm64: Fix timer enable flow
        KVM: arm/arm64: Properly handle arch-timer IRQs after vtimer_save_state
        KVM: arm/arm64: timer: Don't set irq as forwarded if no usable GIC
        KVM: arm/arm64: Fix HYP unmapping going off limits
        arm64: kvm: Prevent restoring stale PMSCR_EL1 for vcpu
        KVM/x86: Check input paging mode when cs.l is set
        tools/kvm_stat: add line for totals
        tools/kvm_stat: stop ignoring unhandled arguments
        tools/kvm_stat: suppress usage information on command line errors
        tools/kvm_stat: handle invalid regular expressions
        tools/kvm_stat: add hint on '-f help' to man page
        tools/kvm_stat: fix child trace events accounting
        tools/kvm_stat: fix extra handling of 'help' with fields filter
        tools/kvm_stat: fix missing field update after filter change
        tools/kvm_stat: fix drilldown in events-by-guests mode
        tools/kvm_stat: fix command line option '-g'
        kvm: x86: fix WARN due to uninitialized guest FPU state
        ...
      409232a4
    • Shaohua Li's avatar
      net: reevalulate autoflowlabel setting after sysctl setting · 513674b5
      Shaohua Li authored
      sysctl.ip6.auto_flowlabels is default 1. In our hosts, we set it to 2.
      If sockopt doesn't set autoflowlabel, outcome packets from the hosts are
      supposed to not include flowlabel. This is true for normal packet, but
      not for reset packet.
      
      The reason is ipv6_pinfo.autoflowlabel is set in sock creation. Later if
      we change sysctl.ip6.auto_flowlabels, the ipv6_pinfo.autoflowlabel isn't
      changed, so the sock will keep the old behavior in terms of auto
      flowlabel. Reset packet is suffering from this problem, because reset
      packet is sent from a special control socket, which is created at boot
      time. Since sysctl.ipv6.auto_flowlabels is 1 by default, the control
      socket will always have its ipv6_pinfo.autoflowlabel set, even after
      user set sysctl.ipv6.auto_flowlabels to 1, so reset packset will always
      have flowlabel. Normal sock created before sysctl setting suffers from
      the same issue. We can't even turn off autoflowlabel unless we kill all
      socks in the hosts.
      
      To fix this, if IPV6_AUTOFLOWLABEL sockopt is used, we use the
      autoflowlabel setting from user, otherwise we always call
      ip6_default_np_autolabel() which has the new settings of sysctl.
      
      Note, this changes behavior a little bit. Before commit 42240901
      (ipv6: Implement different admin modes for automatic flow labels), the
      autoflowlabel behavior of a sock isn't sticky, eg, if sysctl changes,
      existing connection will change autoflowlabel behavior. After that
      commit, autoflowlabel behavior is sticky in the whole life of the sock.
      With this patch, the behavior isn't sticky again.
      
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Tom Herbert <tom@quantonium.net>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      513674b5
    • Eric Garver's avatar
      openvswitch: Fix pop_vlan action for double tagged frames · c48e7473
      Eric Garver authored
      skb_vlan_pop() expects skb->protocol to be a valid TPID for double
      tagged frames. So set skb->protocol to the TPID and let skb_vlan_pop()
      shift the true ethertype into position for us.
      
      Fixes: 5108bbad ("openvswitch: add processing of L3 packets")
      Signed-off-by: default avatarEric Garver <e@erig.me>
      Reviewed-by: default avatarJiri Benc <jbenc@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c48e7473
    • Jens Axboe's avatar
      Revert "bdi: add error handle for bdi_debug_register" · 6d0e4827
      Jens Axboe authored
      This reverts commit a0747a85.
      
      It breaks some booting for some users, and more than a week
      into this, there's still no good fix. Revert this commit
      for now until a solution has been found.
      Reported-by: default avatarLaura Abbott <labbott@redhat.com>
      Reported-by: default avatarBruno Wolff III <bruno@wolff.to>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6d0e4827
    • Ido Schimmel's avatar
      ipv6: Honor specified parameters in fibmatch lookup · 58acfd71
      Ido Schimmel authored
      Currently, parameters such as oif and source address are not taken into
      account during fibmatch lookup. Example (IPv4 for reference) before
      patch:
      
      $ ip -4 route show
      192.0.2.0/24 dev dummy0 proto kernel scope link src 192.0.2.1
      198.51.100.0/24 dev dummy1 proto kernel scope link src 198.51.100.1
      
      $ ip -6 route show
      2001:db8:1::/64 dev dummy0 proto kernel metric 256 pref medium
      2001:db8:2::/64 dev dummy1 proto kernel metric 256 pref medium
      fe80::/64 dev dummy0 proto kernel metric 256 pref medium
      fe80::/64 dev dummy1 proto kernel metric 256 pref medium
      
      $ ip -4 route get fibmatch 192.0.2.2 oif dummy0
      192.0.2.0/24 dev dummy0 proto kernel scope link src 192.0.2.1
      $ ip -4 route get fibmatch 192.0.2.2 oif dummy1
      RTNETLINK answers: No route to host
      
      $ ip -6 route get fibmatch 2001:db8:1::2 oif dummy0
      2001:db8:1::/64 dev dummy0 proto kernel metric 256 pref medium
      $ ip -6 route get fibmatch 2001:db8:1::2 oif dummy1
      2001:db8:1::/64 dev dummy0 proto kernel metric 256 pref medium
      
      After:
      
      $ ip -6 route get fibmatch 2001:db8:1::2 oif dummy0
      2001:db8:1::/64 dev dummy0 proto kernel metric 256 pref medium
      $ ip -6 route get fibmatch 2001:db8:1::2 oif dummy1
      RTNETLINK answers: Network is unreachable
      
      The problem stems from the fact that the necessary route lookup flags
      are not set based on these parameters.
      
      Instead of duplicating the same logic for fibmatch, we can simply
      resolve the original route from its copy and dump it instead.
      
      Fixes: 18c3a61c ("net: ipv6: RTM_GETROUTE: return matched fib result when requested")
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      58acfd71
    • Stefan Raspl's avatar
      tools/kvm_stat: sort '-f help' output · aa12f594
      Stefan Raspl authored
      Sort the fields returned by specifying '-f help' on the command line.
      While at it, simplify the code a bit, indent the output and eliminate an
      extra blank line at the beginning.
      Signed-off-by: default avatarStefan Raspl <raspl@linux.vnet.ibm.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      aa12f594
    • Paolo Bonzini's avatar
      kvm: x86: fix RSM when PCID is non-zero · fae1a3e7
      Paolo Bonzini authored
      rsm_load_state_64() and rsm_enter_protected_mode() load CR3, then
      CR4 & ~PCIDE, then CR0, then CR4.
      
      However, setting CR4.PCIDE fails if CR3[11:0] != 0.  It's probably easier
      in the long run to replace rsm_enter_protected_mode() with an emulator
      callback that sets all the special registers (like KVM_SET_SREGS would
      do).  For now, set the PCID field of CR3 only after CR4.PCIDE is 1.
      Reported-by: default avatarLaszlo Ersek <lersek@redhat.com>
      Tested-by: default avatarLaszlo Ersek <lersek@redhat.com>
      Fixes: 660a5d51
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fae1a3e7
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 8b6ca2bf
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2017-12-21
      
      The following pull-request contains BPF updates for your *net* tree.
      
      The main changes are:
      
      1) Fix multiple security issues in the BPF verifier mostly related
         to the value and min/max bounds tracking rework in 4.14. Issues
         range from incorrect bounds calculation in some BPF_RSH cases,
         to improper sign extension and reg size handling on 32 bit
         ALU ops, missing strict alignment checks on stack pointers, and
         several others that got fixed, from Jann, Alexei and Edward.
      
      2) Fix various build failures in BPF selftests on sparc64. More
         specifically, librt needed to be added to the libs to link
         against and few format string fixups for sizeof, from David.
      
      3) Fix one last remaining issue from BPF selftest build that was
         still occuring on s390x from the asm/bpf_perf_event.h include
         which could not find the asm/ptrace.h copy, from Hendrik.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8b6ca2bf
    • Alexei Starovoitov's avatar
      bpf: do not allow root to mangle valid pointers · 82abbf8d
      Alexei Starovoitov authored
      Do not allow root to convert valid pointers into unknown scalars.
      In particular disallow:
       ptr &= reg
       ptr <<= reg
       ptr += ptr
      and explicitly allow:
       ptr -= ptr
      since pkt_end - pkt == length
      
      1.
      This minimizes amount of address leaks root can do.
      In the future may need to further tighten the leaks with kptr_restrict.
      
      2.
      If program has such pointer math it's likely a user mistake and
      when verifier complains about it right away instead of many instructions
      later on invalid memory access it's easier for users to fix their progs.
      
      3.
      when register holding a pointer cannot change to scalar it allows JITs to
      optimize better. Like 32-bit archs could use single register for pointers
      instead of a pair required to hold 64-bit scalars.
      
      4.
      reduces architecture dependent behavior. Since code:
      r1 = r10;
      r1 &= 0xff;
      if (r1 ...)
      will behave differently arm64 vs x64 and offloaded vs native.
      
      A significant chunk of ptr mangling was allowed by
      commit f1174f77 ("bpf/verifier: rework value tracking")
      yet some of it was allowed even earlier.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      82abbf8d
    • Daniel Borkmann's avatar
      Merge branch 'bpf-verifier-sec-fixes' · 3db9128f
      Daniel Borkmann authored
      Alexei Starovoitov says:
      
      ====================
      This patch set addresses a set of security vulnerabilities
      in bpf verifier logic discovered by Jann Horn.
      All of the patches are candidates for 4.14 stable.
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      3db9128f
    • Jann Horn's avatar
      selftests/bpf: add tests for recent bugfixes · 2255f8d5
      Jann Horn authored
      These tests should cover the following cases:
      
       - MOV with both zero-extended and sign-extended immediates
       - implicit truncation of register contents via ALU32/MOV32
       - implicit 32-bit truncation of ALU32 output
       - oversized register source operand for ALU32 shift
       - right-shift of a number that could be positive or negative
       - map access where adding the operation size to the offset causes signed
         32-bit overflow
       - direct stack access at a ~4GiB offset
      
      Also remove the F_LOAD_WITH_STRICT_ALIGNMENT flag from a bunch of tests
      that should fail independent of what flags userspace passes.
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      2255f8d5
    • Alexei Starovoitov's avatar
      bpf: fix integer overflows · bb7f0f98
      Alexei Starovoitov authored
      There were various issues related to the limited size of integers used in
      the verifier:
       - `off + size` overflow in __check_map_access()
       - `off + reg->off` overflow in check_mem_access()
       - `off + reg->var_off.value` overflow or 32-bit truncation of
         `reg->var_off.value` in check_mem_access()
       - 32-bit truncation in check_stack_boundary()
      
      Make sure that any integer math cannot overflow by not allowing
      pointer math with large values.
      
      Also reduce the scope of "scalar op scalar" tracking.
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Reported-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      bb7f0f98
    • Jann Horn's avatar
      bpf: don't prune branches when a scalar is replaced with a pointer · 179d1c56
      Jann Horn authored
      This could be made safe by passing through a reference to env and checking
      for env->allow_ptr_leaks, but it would only work one way and is probably
      not worth the hassle - not doing it will not directly lead to program
      rejection.
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      179d1c56
    • Jann Horn's avatar
      bpf: force strict alignment checks for stack pointers · a5ec6ae1
      Jann Horn authored
      Force strict alignment checks for stack pointers because the tracking of
      stack spills relies on it; unaligned stack accesses can lead to corruption
      of spilled registers, which is exploitable.
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a5ec6ae1
    • Jann Horn's avatar
      bpf: fix missing error return in check_stack_boundary() · ea25f914
      Jann Horn authored
      Prevent indirect stack accesses at non-constant addresses, which would
      permit reading and corrupting spilled pointers.
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      ea25f914
    • Jann Horn's avatar
      bpf: fix 32-bit ALU op verification · 468f6eaf
      Jann Horn authored
      32-bit ALU ops operate on 32-bit values and have 32-bit outputs.
      Adjust the verifier accordingly.
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      468f6eaf
    • Jann Horn's avatar
      bpf: fix incorrect tracking of register size truncation · 0c17d1d2
      Jann Horn authored
      Properly handle register truncation to a smaller size.
      
      The old code first mirrors the clearing of the high 32 bits in the bitwise
      tristate representation, which is correct. But then, it computes the new
      arithmetic bounds as the intersection between the old arithmetic bounds and
      the bounds resulting from the bitwise tristate representation. Therefore,
      when coerce_reg_to_32() is called on a number with bounds
      [0xffff'fff8, 0x1'0000'0007], the verifier computes
      [0xffff'fff8, 0xffff'ffff] as bounds of the truncated number.
      This is incorrect: The truncated number could also be in the range [0, 7],
      and no meaningful arithmetic bounds can be computed in that case apart from
      the obvious [0, 0xffff'ffff].
      
      Starting with v4.14, this is exploitable by unprivileged users as long as
      the unprivileged_bpf_disabled sysctl isn't set.
      
      Debian assigned CVE-2017-16996 for this issue.
      
      v2:
       - flip the mask during arithmetic bounds calculation (Ben Hutchings)
      v3:
       - add CVE number (Ben Hutchings)
      
      Fixes: b03c9f9f ("bpf/verifier: track signed and unsigned min/max values")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Acked-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      0c17d1d2