1. 06 Feb, 2019 22 commits
    • Nir Dotan's avatar
      ip6mr: Fix notifiers call on mroute_clean_tables() · 505e5f3d
      Nir Dotan authored
      [ Upstream commit 146820cc ]
      
      When the MC route socket is closed, mroute_clean_tables() is called to
      cleanup existing routes. Mistakenly notifiers call was put on the cleanup
      of the unresolved MC route entries cache.
      In a case where the MC socket closes before an unresolved route expires,
      the notifier call leads to a crash, caused by the driver trying to
      increment a non initialized refcount_t object [1] and then when handling
      is done, to decrement it [2]. This was detected by a test recently added in
      commit 6d4efada ("selftests: forwarding: Add multicast routing test").
      
      Fix that by putting notifiers call on the resolved entries traversal,
      instead of on the unresolved entries traversal.
      
      [1]
      
      [  245.748967] refcount_t: increment on 0; use-after-free.
      [  245.754829] WARNING: CPU: 3 PID: 3223 at lib/refcount.c:153 refcount_inc_checked+0x2b/0x30
      ...
      [  245.802357] Hardware name: Mellanox Technologies Ltd. MSN2740/SA001237, BIOS 5.6.5 06/07/2016
      [  245.811873] RIP: 0010:refcount_inc_checked+0x2b/0x30
      ...
      [  245.907487] Call Trace:
      [  245.910231]  mlxsw_sp_router_fib_event.cold.181+0x42/0x47 [mlxsw_spectrum]
      [  245.917913]  notifier_call_chain+0x45/0x7
      [  245.922484]  atomic_notifier_call_chain+0x15/0x20
      [  245.927729]  call_fib_notifiers+0x15/0x30
      [  245.932205]  mroute_clean_tables+0x372/0x3f
      [  245.936971]  ip6mr_sk_done+0xb1/0xc0
      [  245.940960]  ip6_mroute_setsockopt+0x1da/0x5f0
      ...
      
      [2]
      
      [  246.128487] refcount_t: underflow; use-after-free.
      [  246.133859] WARNING: CPU: 0 PID: 7 at lib/refcount.c:187 refcount_sub_and_test_checked+0x4c/0x60
      [  246.183521] Hardware name: Mellanox Technologies Ltd. MSN2740/SA001237, BIOS 5.6.5 06/07/2016
      ...
      [  246.193062] Workqueue: mlxsw_core_ordered mlxsw_sp_router_fibmr_event_work [mlxsw_spectrum]
      [  246.202394] RIP: 0010:refcount_sub_and_test_checked+0x4c/0x60
      ...
      [  246.298889] Call Trace:
      [  246.301617]  refcount_dec_and_test_checked+0x11/0x20
      [  246.307170]  mlxsw_sp_router_fibmr_event_work.cold.196+0x47/0x78 [mlxsw_spectrum]
      [  246.315531]  process_one_work+0x1fa/0x3f0
      [  246.320005]  worker_thread+0x2f/0x3e0
      [  246.324083]  kthread+0x118/0x130
      [  246.327683]  ? wq_update_unbound_numa+0x1b0/0x1b0
      [  246.332926]  ? kthread_park+0x80/0x80
      [  246.337013]  ret_from_fork+0x1f/0x30
      
      Fixes: 088aa3ee ("ip6mr: Support fib notifications")
      Signed-off-by: default avatarNir Dotan <nird@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      505e5f3d
    • Aya Levin's avatar
      net/mlx5e: Allow MAC invalidation while spoofchk is ON · 50990a40
      Aya Levin authored
      [ Upstream commit 9d2cbdc5 ]
      
      Prior to this patch the driver prohibited spoof checking on invalid MAC.
      Now the user can set this configuration if it wishes to.
      
      This is required since libvirt might invalidate the VF Mac by setting it
      to zero, while spoofcheck is ON.
      
      Fixes: 1ab2068a ("net/mlx5: Implement vports admin state backup/restore")
      Signed-off-by: default avatarAya Levin <ayal@mellanox.com>
      Reviewed-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      50990a40
    • Xin Long's avatar
      sctp: improve the events for sctp stream adding · 4ec13999
      Xin Long authored
      [ Upstream commit 8220c870 ]
      
      This patch is to improve sctp stream adding events in 2 places:
      
        1. In sctp_process_strreset_addstrm_out(), move up SCTP_MAX_STREAM
           and in stream allocation failure checks, as the adding has to
           succeed after reconf_timer stops for the in stream adding
           request retransmission.
      
        3. In sctp_process_strreset_addstrm_in(), no event should be sent,
           as no in or out stream is added here.
      
      Fixes: 50a41591 ("sctp: implement receiver-side procedures for the Add Outgoing Streams Request Parameter")
      Fixes: c5c4ebb3 ("sctp: implement receiver-side procedures for the Add Incoming Streams Request Parameter")
      Reported-by: default avatarYing Xu <yinxu@redhat.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4ec13999
    • Lorenzo Bianconi's avatar
      net: ip6_gre: always reports o_key to userspace · 9f7d849b
      Lorenzo Bianconi authored
      [ Upstream commit c706863b ]
      
      As Erspan_v4, Erspan_v6 protocol relies on o_key to configure
      session id header field. However TUNNEL_KEY bit is cleared in
      ip6erspan_tunnel_xmit since ERSPAN protocol does not set the key field
      of the external GRE header and so the configured o_key is not reported
      to userspace. The issue can be triggered with the following reproducer:
      
      $ip link add ip6erspan1 type ip6erspan local 2000::1 remote 2000::2 \
          key 1 seq erspan_ver 1
      $ip link set ip6erspan1 up
      ip -d link sh ip6erspan1
      
      ip6erspan1@NONE: <BROADCAST,MULTICAST> mtu 1422 qdisc noop state DOWN mode DEFAULT
          link/ether ba:ff:09:24:c3:0e brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 1500
          ip6erspan remote 2000::2 local 2000::1 encaplimit 4 flowlabel 0x00000 ikey 0.0.0.1 iseq oseq
      
      Fix the issue adding TUNNEL_KEY bit to the o_flags parameter in
      ip6gre_fill_info
      
      Fixes: 5a963eb6 ("ip6_gre: Add ERSPAN native tunnel support")
      Signed-off-by: default avatarLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9f7d849b
    • Jason Wang's avatar
      vhost: fix OOB in get_rx_bufs() · aafe74b7
      Jason Wang authored
      [ Upstream commit b46a0bf7 ]
      
      After batched used ring updating was introduced in commit e2b3b35e
      ("vhost_net: batch used ring update in rx"). We tend to batch heads in
      vq->heads for more than one packet. But the quota passed to
      get_rx_bufs() was not correctly limited, which can result a OOB write
      in vq->heads.
      
              headcount = get_rx_bufs(vq, vq->heads + nvq->done_idx,
                          vhost_len, &in, vq_log, &log,
                          likely(mergeable) ? UIO_MAXIOV : 1);
      
      UIO_MAXIOV was still used which is wrong since we could have batched
      used in vq->heads, this will cause OOB if the next buffer needs more
      than 960 (1024 (UIO_MAXIOV) - 64 (VHOST_NET_BATCH)) heads after we've
      batched 64 (VHOST_NET_BATCH) heads:
      Acked-by: default avatarStefan Hajnoczi <stefanha@redhat.com>
      
      =============================================================================
      BUG kmalloc-8k (Tainted: G    B            ): Redzone overwritten
      -----------------------------------------------------------------------------
      
      INFO: 0x00000000fd93b7a2-0x00000000f0713384. First byte 0xa9 instead of 0xcc
      INFO: Allocated in alloc_pd+0x22/0x60 age=3933677 cpu=2 pid=2674
          kmem_cache_alloc_trace+0xbb/0x140
          alloc_pd+0x22/0x60
          gen8_ppgtt_create+0x11d/0x5f0
          i915_ppgtt_create+0x16/0x80
          i915_gem_create_context+0x248/0x390
          i915_gem_context_create_ioctl+0x4b/0xe0
          drm_ioctl_kernel+0xa5/0xf0
          drm_ioctl+0x2ed/0x3a0
          do_vfs_ioctl+0x9f/0x620
          ksys_ioctl+0x6b/0x80
          __x64_sys_ioctl+0x11/0x20
          do_syscall_64+0x43/0xf0
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      INFO: Slab 0x00000000d13e87af objects=3 used=3 fp=0x          (null) flags=0x200000000010201
      INFO: Object 0x0000000003278802 @offset=17064 fp=0x00000000e2e6652b
      
      Fixing this by allocating UIO_MAXIOV + VHOST_NET_BATCH iovs for
      vhost-net. This is done through set the limitation through
      vhost_dev_init(), then set_owner can allocate the number of iov in a
      per device manner.
      
      This fixes CVE-2018-16880.
      
      Fixes: e2b3b35e ("vhost_net: batch used ring update in rx")
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aafe74b7
    • Mathias Thore's avatar
      ucc_geth: Reset BQL queue when stopping device · d0773852
      Mathias Thore authored
      [ Upstream commit e15aa3b2 ]
      
      After a timeout event caused by for example a broadcast storm, when
      the MAC and PHY are reset, the BQL TX queue needs to be reset as
      well. Otherwise, the device will exhibit severe performance issues
      even after the storm has ended.
      Co-authored-by: default avatarDavid Gounaris <david.gounaris@infinera.com>
      Signed-off-by: default avatarMathias Thore <mathias.thore@infinera.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d0773852
    • George Amanakis's avatar
      tun: move the call to tun_set_real_num_queues · dace5274
      George Amanakis authored
      [ Upstream commit 3a03cb84 ]
      
      Call tun_set_real_num_queues() after the increment of tun->numqueues
      since the former depends on it. Otherwise, the number of queues is not
      correctly accounted for, which results to warnings similar to:
      "vnet0 selects TX queue 11, but real number of TX queues is 11".
      
      Fixes: 0b7959b6 ("tun: publish tfile after it's fully initialized")
      Reported-and-tested-by: default avatarGeorge Amanakis <gamanakis@gmail.com>
      Signed-off-by: default avatarGeorge Amanakis <gamanakis@gmail.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dace5274
    • Xin Long's avatar
      sctp: improve the events for sctp stream reset · e569927a
      Xin Long authored
      [ Upstream commit 2e6dc4d9 ]
      
      This patch is to improve sctp stream reset events in 4 places:
      
        1. In sctp_process_strreset_outreq(), the flag should always be set with
           SCTP_STREAM_RESET_INCOMING_SSN instead of OUTGOING, as receiver's in
           stream is reset here.
        2. In sctp_process_strreset_outreq(), move up SCTP_STRRESET_ERR_WRONG_SSN
           check, as the reset has to succeed after reconf_timer stops for the
           in stream reset request retransmission.
        3. In sctp_process_strreset_inreq(), no event should be sent, as no in
           or out stream is reset here.
        4. In sctp_process_strreset_resp(), SCTP_STREAM_RESET_INCOMING_SSN or
           OUTGOING event should always be sent for stream reset requests, no
           matter it fails or succeeds to process the request.
      
      Fixes: 81054476 ("sctp: implement receiver-side procedures for the Outgoing SSN Reset Request Parameter")
      Fixes: 16e1a919 ("sctp: implement receiver-side procedures for the Incoming SSN Reset Request Parameter")
      Fixes: 11ae76e6 ("sctp: implement receiver-side procedures for the Reconf Response Parameter")
      Reported-by: default avatarYing Xu <yinxu@redhat.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e569927a
    • Simon Horman's avatar
      ravb: expand rx descriptor data to accommodate hw checksum · 4fae696c
      Simon Horman authored
      [ Upstream commit 12da6430 ]
      
      EtherAVB may provide a checksum of packet data appended to packet data. In
      order to allow this checksum to be received by the host descriptor data
      needs to be enlarged by 2 bytes to accommodate the checksum.
      
      In the case of MTU-sized packets without a VLAN tag the
      checksum were already accommodated by virtue of the space reserved for the
      VLAN tag. However, a packet of MTU-size with a  VLAN tag consumed all
      packet data space provided by a descriptor leaving no space for the
      trailing checksum.
      
      This was not detected by the driver which incorrectly used the last two
      bytes of packet data as the checksum and truncate the packet by two bytes.
      This resulted all such packets being dropped.
      
      A work around is to disable RX checksum offload
       # ethtool -K eth0 rx off
      
      This patch resolves this problem by increasing the size available for
      packet data in RX descriptors by two bytes.
      
      Tested on R-Car E3 (r8a77990) ES1.0 based Ebisu-4D board
      
      v2
      * Use sizeof(__sum16) directly rather than adding a driver-local
        #define for the size of the checksum provided by the hw (2 bytes).
      
      Fixes: 4d86d381 ("ravb: RX checksum offload")
      Signed-off-by: default avatarSimon Horman <horms+renesas@verge.net.au>
      Reviewed-by: default avatarSergei Shtylyov <sergei.shtylyov@cogentembedded.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4fae696c
    • Josh Elsasser's avatar
      net: set default network namespace in init_dummy_netdev() · 5f1a18e0
      Josh Elsasser authored
      [ Upstream commit 35edfdc7 ]
      
      Assign a default net namespace to netdevs created by init_dummy_netdev().
      Fixes a NULL pointer dereference caused by busy-polling a socket bound to
      an iwlwifi wireless device, which bumps the per-net BUSYPOLLRXPACKETS stat
      if napi_poll() received packets:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000190
        IP: napi_busy_loop+0xd6/0x200
        Call Trace:
          sock_poll+0x5e/0x80
          do_sys_poll+0x324/0x5a0
          SyS_poll+0x6c/0xf0
          do_syscall_64+0x6b/0x1f0
          entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      Fixes: 7db6b048 ("net: Commonize busy polling code to focus on napi_id instead of socket")
      Signed-off-by: default avatarJosh Elsasser <jelsasser@appneta.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5f1a18e0
    • Bernard Pidoux's avatar
      net/rose: fix NULL ax25_cb kernel panic · fc4154c7
      Bernard Pidoux authored
      [ Upstream commit b0cf0292 ]
      
      When an internally generated frame is handled by rose_xmit(),
      rose_route_frame() is called:
      
              if (!rose_route_frame(skb, NULL)) {
                      dev_kfree_skb(skb);
                      stats->tx_errors++;
                      return NETDEV_TX_OK;
              }
      
      We have the same code sequence in Net/Rom where an internally generated
      frame is handled by nr_xmit() calling nr_route_frame(skb, NULL).
      However, in this function NULL argument is tested while it is not in
      rose_route_frame().
      Then kernel panic occurs later on when calling ax25cmp() with a NULL
      ax25_cb argument as reported many times and recently with syzbot.
      
      We need to test if ax25 is NULL before using it.
      
      Testing:
      Built kernel with CONFIG_ROSE=y.
      Signed-off-by: default avatarBernard Pidoux <f6bvp@free.fr>
      Acked-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Reported-by: syzbot+1a2c456a1ea08fa5b5f7@syzkaller.appspotmail.com
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Bernard Pidoux <f6bvp@free.fr>
      Cc: linux-hams@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fc4154c7
    • Cong Wang's avatar
      netrom: switch to sock timer API · 2c6b5724
      Cong Wang authored
      [ Upstream commit 63346650 ]
      
      sk_reset_timer() and sk_stop_timer() properly handle
      sock refcnt for timer function. Switching to them
      could fix a refcounting bug reported by syzbot.
      
      Reported-and-tested-by: syzbot+defa700d16f1bd1b9a05@syzkaller.appspotmail.com
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: linux-hams@vger.kernel.org
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2c6b5724
    • Aya Levin's avatar
      net/mlx4_core: Add masking for a few queries on HCA caps · 00865891
      Aya Levin authored
      [ Upstream commit a40ded60 ]
      
      Driver reads the query HCA capabilities without the corresponding masks.
      Without the correct masks, the base addresses of the queues are
      unaligned.  In addition some reserved bits were wrongly read.  Using the
      correct masks, ensures alignment of the base addresses and allows future
      firmware versions safe use of the reserved bits.
      
      Fixes: ab9c17a0 ("mlx4_core: Modify driver initialization flow to accommodate SRIOV for Ethernet")
      Fixes: 0ff1fb65 ("{NET, IB}/mlx4: Add device managed flow steering firmware API")
      Signed-off-by: default avatarAya Levin <ayal@mellanox.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      00865891
    • Lorenzo Bianconi's avatar
      net: ip_gre: use erspan key field for tunnel lookup · 0a198e0b
      Lorenzo Bianconi authored
      [ Upstream commit cb73ee40 ]
      
      Use ERSPAN key header field as tunnel key in gre_parse_header routine
      since ERSPAN protocol sets the key field of the external GRE header to
      0 resulting in a tunnel lookup fail in ip6gre_err.
      In addition remove key field parsing and pskb_may_pull check in
      erspan_rcv and ip6erspan_rcv
      
      Fixes: 5a963eb6 ("ip6_gre: Add ERSPAN native tunnel support")
      Signed-off-by: default avatarLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0a198e0b
    • Lorenzo Bianconi's avatar
      net: ip_gre: always reports o_key to userspace · 897ea28b
      Lorenzo Bianconi authored
      [ Upstream commit feaf5c79 ]
      
      Erspan protocol (version 1 and 2) relies on o_key to configure
      session id header field. However TUNNEL_KEY bit is cleared in
      erspan_xmit since ERSPAN protocol does not set the key field
      of the external GRE header and so the configured o_key is not reported
      to userspace. The issue can be triggered with the following reproducer:
      
      $ip link add erspan1 type erspan local 192.168.0.1 remote 192.168.0.2 \
          key 1 seq erspan_ver 1
      $ip link set erspan1 up
      $ip -d link sh erspan1
      
      erspan1@NONE: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast state UNKNOWN mode DEFAULT
        link/ether 52:aa:99:95:9a:b5 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 1500
        erspan remote 192.168.0.2 local 192.168.0.1 ttl inherit ikey 0.0.0.1 iseq oseq erspan_index 0
      
      Fix the issue adding TUNNEL_KEY bit to the o_flags parameter in
      ipgre_fill_info
      
      Fixes: 84e54fe0 ("gre: introduce native tunnel support for ERSPAN")
      Signed-off-by: default avatarLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      897ea28b
    • Jacob Wen's avatar
      l2tp: fix reading optional fields of L2TPv3 · 8de67666
      Jacob Wen authored
      [ Upstream commit 4522a70d ]
      
      Use pskb_may_pull() to make sure the optional fields are in skb linear
      parts, so we can safely read them later.
      
      It's easy to reproduce the issue with a net driver that supports paged
      skb data. Just create a L2TPv3 over IP tunnel and then generates some
      network traffic.
      Once reproduced, rx err in /sys/kernel/debug/l2tp/tunnels will increase.
      
      Changes in v4:
      1. s/l2tp_v3_pull_opt/l2tp_v3_ensure_opt_in_linear/
      2. s/tunnel->version != L2TP_HDR_VER_2/tunnel->version == L2TP_HDR_VER_3/
      3. Add 'Fixes' in commit messages.
      
      Changes in v3:
      1. To keep consistency, move the code out of l2tp_recv_common.
      2. Use "net" instead of "net-next", since this is a bug fix.
      
      Changes in v2:
      1. Only fix L2TPv3 to make code simple.
         To fix both L2TPv3 and L2TPv2, we'd better refactor l2tp_recv_common.
         It's complicated to do so.
      2. Reloading pointers after pskb_may_pull
      
      Fixes: f7faffa3 ("l2tp: Add L2TPv3 protocol support")
      Fixes: 0d76751f ("l2tp: Add L2TPv3 IP encapsulation (no UDP) support")
      Fixes: a32e0eec ("l2tp: introduce L2TPv3 IP encapsulation support for IPv6")
      Signed-off-by: default avatarJacob Wen <jian.w.wen@oracle.com>
      Acked-by: default avatarGuillaume Nault <gnault@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8de67666
    • Jacob Wen's avatar
      l2tp: copy 4 more bytes to linear part if necessary · 3d418a25
      Jacob Wen authored
      [ Upstream commit 91c52470 ]
      
      The size of L2TPv2 header with all optional fields is 14 bytes.
      l2tp_udp_recv_core only moves 10 bytes to the linear part of a
      skb. This may lead to l2tp_recv_common read data outside of a skb.
      
      This patch make sure that there is at least 14 bytes in the linear
      part of a skb to meet the maximum need of l2tp_udp_recv_core and
      l2tp_recv_common. The minimum size of both PPP HDLC-like frame and
      Ethernet frame is larger than 14 bytes, so we are safe to do so.
      
      Also remove L2TP_HDR_SIZE_NOSEQ, it is unused now.
      
      Fixes: fd558d18 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
      Suggested-by: default avatarGuillaume Nault <gnault@redhat.com>
      Signed-off-by: default avatarJacob Wen <jian.w.wen@oracle.com>
      Acked-by: default avatarGuillaume Nault <gnault@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3d418a25
    • Daniel Borkmann's avatar
      ipvlan, l3mdev: fix broken l3s mode wrt local routes · fcc9c69a
      Daniel Borkmann authored
      [ Upstream commit d5256083 ]
      
      While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin,
      I ran into the issue that while l3 mode is working fine, l3s mode
      does not have any connectivity to kube-apiserver and hence all pods
      end up in Error state as well. The ipvlan master device sits on
      top of a bond device and hostns traffic to kube-apiserver (also running
      in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573
      where the latter is the address of the bond0. While in l3 mode, a
      curl to https://10.152.183.1:443 or to https://139.178.29.207:37573
      works fine from hostns, neither of them do in case of l3s. In the
      latter only a curl to https://127.0.0.1:37573 appeared to work where
      for local addresses of bond0 I saw kernel suddenly starting to emit
      ARP requests to query HW address of bond0 which remained unanswered
      and neighbor entries in INCOMPLETE state. These ARP requests only
      happen while in l3s.
      
      Debugging this further, I found the issue is that l3s mode is piggy-
      backing on l3 master device, and in this case local routes are using
      l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit
      f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev
      if relevant") and 5f02ce24 ("net: l3mdev: Allow the l3mdev to be
      a loopback"). I found that reverting them back into using the
      net->loopback_dev fixed ipvlan l3s connectivity and got everything
      working for the CNI.
      
      Now judging from 4fbae7d8 ("ipvlan: Introduce l3s mode") and the
      l3mdev paper in [0] the only sole reason why ipvlan l3s is relying
      on l3 master device is to get the l3mdev_ip_rcv() receive hook for
      setting the dst entry of the input route without adding its own
      ipvlan specific hacks into the receive path, however, any l3 domain
      semantics beyond just that are breaking l3s operation. Note that
      ipvlan also has the ability to dynamically switch its internal
      operation from l3 to l3s for all ports via ipvlan_set_port_mode()
      at runtime. In any case, l3 vs l3s soley distinguishes itself by
      'de-confusing' netfilter through switching skb->dev to ipvlan slave
      device late in NF_INET_LOCAL_IN before handing the skb to L4.
      
      Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which,
      if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook
      without any additional l3mdev semantics on top. This should also have
      minimal impact since dev->priv_flags is already hot in cache. With
      this set, l3s mode is working fine and I also get things like
      masquerading pod traffic on the ipvlan master properly working.
      
        [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf
      
      Fixes: f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev if relevant")
      Fixes: 5f02ce24 ("net: l3mdev: Allow the l3mdev to be a loopback")
      Fixes: 4fbae7d8 ("ipvlan: Introduce l3s mode")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: David Ahern <dsa@cumulusnetworks.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Martynas Pumputis <m@lambda.lt>
      Acked-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fcc9c69a
    • Yohei Kanemaru's avatar
      ipv6: sr: clear IP6CB(skb) on SRH ip4ip6 encapsulation · 2f704348
      Yohei Kanemaru authored
      [ Upstream commit ef489749 ]
      
      skb->cb may contain data from previous layers (in an observed case
      IPv4 with L3 Master Device). In the observed scenario, the data in
      IPCB(skb)->frags was misinterpreted as IP6CB(skb)->frag_max_size,
      eventually caused an unexpected IPv6 fragmentation in ip6_fragment()
      through ip6_finish_output().
      
      This patch clears IP6CB(skb), which potentially contains garbage data,
      on the SRH ip4ip6 encapsulation.
      
      Fixes: 32d99d0b ("ipv6: sr: add support for ip4ip6 encapsulation")
      Signed-off-by: default avatarYohei Kanemaru <yohei.kanemaru@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2f704348
    • David Ahern's avatar
      ipv6: Consider sk_bound_dev_if when binding a socket to an address · 7e9a6476
      David Ahern authored
      [ Upstream commit c5ee0663 ]
      
      IPv6 does not consider if the socket is bound to a device when binding
      to an address. The result is that a socket can be bound to eth0 and then
      bound to the address of eth1. If the device is a VRF, the result is that
      a socket can only be bound to an address in the default VRF.
      
      Resolve by considering the device if sk_bound_dev_if is set.
      
      This problem exists from the beginning of git history.
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7e9a6476
    • Arnd Bergmann's avatar
      drm/msm/gpu: fix building without debugfs · 8877843b
      Arnd Bergmann authored
      commit c878a628 upstream.
      
      When debugfs is disabled, but coredump is turned on, the adreno driver fails to build:
      
      drivers/gpu/drm/msm/adreno/a3xx_gpu.c:460:4: error: 'struct msm_gpu_funcs' has no member named 'show'
         .show = adreno_show,
          ^~~~
      drivers/gpu/drm/msm/adreno/a3xx_gpu.c:460:11: note: (near initialization for 'funcs.base')
      drivers/gpu/drm/msm/adreno/a3xx_gpu.c:460:11: error: initialization of 'void (*)(struct msm_gpu *, struct msm_gem_submit *, struct msm_file_private *)' from incompatible pointer type 'void (*)(struct msm_gpu *, struct msm_gpu_state *, struct drm_printer *)' [-Werror=incompatible-pointer-types]
      drivers/gpu/drm/msm/adreno/a3xx_gpu.c:460:11: note: (near initialization for 'funcs.base.submit')
      drivers/gpu/drm/msm/adreno/a4xx_gpu.c:546:4: error: 'struct msm_gpu_funcs' has no member named 'show'
      drivers/gpu/drm/msm/adreno/a5xx_gpu.c:1460:4: error: 'struct msm_gpu_funcs' has no member named 'show'
      drivers/gpu/drm/msm/adreno/a6xx_gpu.c:769:4: error: 'struct msm_gpu_funcs' has no member named 'show'
      drivers/gpu/drm/msm/msm_gpu.c: In function 'msm_gpu_devcoredump_read':
      drivers/gpu/drm/msm/msm_gpu.c:289:12: error: 'const struct msm_gpu_funcs' has no member named 'show'
      
      Adjust the #ifdef to make it build again.
      
      Fixes: c0fec7f5 ("drm/msm/gpu: Capture the GPU state on a GPU hang")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarRob Clark <robdclark@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8877843b
    • Greg Kroah-Hartman's avatar
      Fix "net: ipv4: do not handle duplicate fragments as overlapping" · 8c763a3c
      Greg Kroah-Hartman authored
      ade44640 ("net: ipv4: do not handle duplicate fragments as
      overlapping") was backported to many stable trees, but it had a problem
      that was "accidentally" fixed by the upstream commit 0ff89efb ("ip:
      fail fast on IP defrag errors")
      
      This is the fixup for that problem as we do not want the larger patch in
      the older stable trees.
      
      Fixes: ade44640 ("net: ipv4: do not handle duplicate fragments as overlapping")
      Reported-by: default avatarIvan Babrou <ivan@cloudflare.com>
      Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8c763a3c
  2. 31 Jan, 2019 18 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.19.19 · dffbba43
      Greg Kroah-Hartman authored
      dffbba43
    • Deepa Dinamani's avatar
      Input: input_event - fix the CONFIG_SPARC64 mixup · 3a3b6a6b
      Deepa Dinamani authored
      commit 141e5dca upstream.
      
      Arnd Bergmann pointed out that CONFIG_* cannot be used in a uapi header.
      Override with an equivalent conditional.
      
      Fixes: 2e746942 ("Input: input_event - provide override for sparc64")
      Fixes: 152194fe ("Input: extend usable life of event timestamps to 2106 on 32 bit systems")
      Signed-off-by: default avatarDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: default avatarDmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3a3b6a6b
    • Christoph Hellwig's avatar
      ide: fix a typo in the settings proc file name · d4a6ac28
      Christoph Hellwig authored
      commit f8ff6c73 upstream.
      
      Fixes: ec7d9c9c ("ide: replace ->proc_fops with ->proc_show")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d4a6ac28
    • Jack Pham's avatar
      usb: dwc3: gadget: Clear req->needs_extra_trb flag on cleanup · 25ad17d6
      Jack Pham authored
      commit bd674224 upstream.
      
      OUT endpoint requests may somtimes have this flag set when
      preparing to be submitted to HW indicating that there is an
      additional TRB chained to the request for alignment purposes.
      If that request is removed before the controller can execute the
      transfer (e.g. ep_dequeue/ep_disable), the request will not go
      through the dwc3_gadget_ep_cleanup_completed_request() handler
      and will not have its needs_extra_trb flag cleared when
      dwc3_gadget_giveback() is called.  This same request could be
      later requeued for a new transfer that does not require an
      extra TRB and if it is successfully completed, the cleanup
      and TRB reclamation will incorrectly process the additional TRB
      which belongs to the next request, and incorrectly advances the
      TRB dequeue pointer, thereby messing up calculation of the next
      requeust's actual/remaining count when it completes.
      
      The right thing to do here is to ensure that the flag is cleared
      before it is given back to the function driver.  A good place
      to do that is in dwc3_gadget_del_and_unmap_request().
      
      Fixes: c6267a51 ("usb: dwc3: gadget: align transfers to wMaxPacketSize")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJack Pham <jackp@codeaurora.org>
      Signed-off-by: default avatarFelipe Balbi <felipe.balbi@linux.intel.com>
      [jackp: backport to <= 4.20: replaced 'needs_extra_trb' with 'unaligned'
              and 'zero' members in patch and reworded commit text]
      Signed-off-by: default avatarJack Pham <jackp@codeaurora.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      25ad17d6
    • Michal Hocko's avatar
      Revert "mm, memory_hotplug: initialize struct pages for the full memory section" · 6bab9573
      Michal Hocko authored
      commit 4aa9fc2a upstream.
      
      This reverts commit 2830bf6f.
      
      The underlying assumption that one sparse section belongs into a single
      numa node doesn't hold really. Robert Shteynfeld has reported a boot
      failure. The boot log was not captured but his memory layout is as
      follows:
      
        Early memory node ranges
          node   1: [mem 0x0000000000001000-0x0000000000090fff]
          node   1: [mem 0x0000000000100000-0x00000000dbdf8fff]
          node   1: [mem 0x0000000100000000-0x0000001423ffffff]
          node   0: [mem 0x0000001424000000-0x0000002023ffffff]
      
      This means that node0 starts in the middle of a memory section which is
      also in node1.  memmap_init_zone tries to initialize padding of a
      section even when it is outside of the given pfn range because there are
      code paths (e.g.  memory hotplug) which assume that the full worth of
      memory section is always initialized.
      
      In this particular case, though, such a range is already intialized and
      most likely already managed by the page allocator.  Scribbling over
      those pages corrupts the internal state and likely blows up when any of
      those pages gets used.
      Reported-by: default avatarRobert Shteynfeld <robert.shteynfeld@gmail.com>
      Fixes: 2830bf6f ("mm, memory_hotplug: initialize struct pages for the full memory section")
      Cc: stable@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6bab9573
    • Raju Rangoju's avatar
      nvmet-rdma: fix null dereference under heavy load · 7dbf1297
      Raju Rangoju authored
      commit 5cbab630 upstream.
      
      Under heavy load if we don't have any pre-allocated rsps left, we
      dynamically allocate a rsp, but we are not actually allocating memory
      for nvme_completion (rsp->req.rsp). In such a case, accessing pointer
      fields (req->rsp->status) in nvmet_req_init() will result in crash.
      
      To fix this, allocate the memory for nvme_completion by calling
      nvmet_rdma_alloc_rsp()
      
      Fixes: 8407879c("nvmet-rdma:fix possible bogus dereference under heavy load")
      
      Cc: <stable@vger.kernel.org>
      Reviewed-by: default avatarMax Gurtovoy <maxg@mellanox.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarRaju Rangoju <rajur@chelsio.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7dbf1297
    • Israel Rukshin's avatar
      nvmet-rdma: Add unlikely for response allocated check · fa9184be
      Israel Rukshin authored
      commit ad1f8249 upstream.
      Signed-off-by: default avatarIsrael Rukshin <israelr@mellanox.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarMax Gurtovoy <maxg@mellanox.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Cc: Raju  Rangoju <rajur@chelsio.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fa9184be
    • David Hildenbrand's avatar
      s390/smp: Fix calling smp_call_ipl_cpu() from ipl CPU · 48046a01
      David Hildenbrand authored
      commit 60f1bf29 upstream.
      
      When calling smp_call_ipl_cpu() from the IPL CPU, we will try to read
      from pcpu_devices->lowcore. However, due to prefixing, that will result
      in reading from absolute address 0 on that CPU. We have to go via the
      actual lowcore instead.
      
      This means that right now, we will read lc->nodat_stack == 0 and
      therfore work on a very wrong stack.
      
      This BUG essentially broke rebooting under QEMU TCG (which will report
      a low address protection exception). And checking under KVM, it is
      also broken under KVM. With 1 VCPU it can be easily triggered.
      
      :/# echo 1 > /proc/sys/kernel/sysrq
      :/# echo b > /proc/sysrq-trigger
      [   28.476745] sysrq: SysRq : Resetting
      [   28.476793] Kernel stack overflow.
      [   28.476817] CPU: 0 PID: 424 Comm: sh Not tainted 5.0.0-rc1+ #13
      [   28.476820] Hardware name: IBM 2964 NE1 716 (KVM/Linux)
      [   28.476826] Krnl PSW : 0400c00180000000 0000000000115c0c (pcpu_delegate+0x12c/0x140)
      [   28.476861]            R:0 T:1 IO:0 EX:0 Key:0 M:0 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
      [   28.476863] Krnl GPRS: ffffffffffffffff 0000000000000000 000000000010dff8 0000000000000000
      [   28.476864]            0000000000000000 0000000000000000 0000000000ab7090 000003e0006efbf0
      [   28.476864]            000000000010dff8 0000000000000000 0000000000000000 0000000000000000
      [   28.476865]            000000007fffc000 0000000000730408 000003e0006efc58 0000000000000000
      [   28.476887] Krnl Code: 0000000000115bfe: 4170f000            la      %r7,0(%r15)
      [   28.476887]            0000000000115c02: 41f0a000            la      %r15,0(%r10)
      [   28.476887]           #0000000000115c06: e370f0980024        stg     %r7,152(%r15)
      [   28.476887]           >0000000000115c0c: c0e5fffff86e        brasl   %r14,114ce8
      [   28.476887]            0000000000115c12: 41f07000            la      %r15,0(%r7)
      [   28.476887]            0000000000115c16: a7f4ffa8            brc     15,115b66
      [   28.476887]            0000000000115c1a: 0707                bcr     0,%r7
      [   28.476887]            0000000000115c1c: 0707                bcr     0,%r7
      [   28.476901] Call Trace:
      [   28.476902] Last Breaking-Event-Address:
      [   28.476920]  [<0000000000a01c4a>] arch_call_rest_init+0x22/0x80
      [   28.476927] Kernel panic - not syncing: Corrupt kernel stack, can't continue.
      [   28.476930] CPU: 0 PID: 424 Comm: sh Not tainted 5.0.0-rc1+ #13
      [   28.476932] Hardware name: IBM 2964 NE1 716 (KVM/Linux)
      [   28.476932] Call Trace:
      
      Fixes: 2f859d0d ("s390/smp: reduce size of struct pcpu")
      Cc: stable@vger.kernel.org # 4.0+
      Reported-by: default avatarCornelia Huck <cohuck@redhat.com>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      48046a01
    • Daniel Borkmann's avatar
      bpf: fix inner map masking to prevent oob under speculation · 37c9e3ee
      Daniel Borkmann authored
      [ commit 9d5564dd upstream ]
      
      During review I noticed that inner meta map setup for map in
      map is buggy in that it does not propagate all needed data
      from the reference map which the verifier is later accessing.
      
      In particular one such case is index masking to prevent out of
      bounds access under speculative execution due to missing the
      map's unpriv_array/index_mask field propagation. Fix this such
      that the verifier is generating the correct code for inlined
      lookups in case of unpriviledged use.
      
      Before patch (test_verifier's 'map in map access' dump):
      
        # bpftool prog dump xla id 3
           0: (62) *(u32 *)(r10 -4) = 0
           1: (bf) r2 = r10
           2: (07) r2 += -4
           3: (18) r1 = map[id:4]
           5: (07) r1 += 272                |
           6: (61) r0 = *(u32 *)(r2 +0)     |
           7: (35) if r0 >= 0x1 goto pc+6   | Inlined map in map lookup
           8: (54) (u32) r0 &= (u32) 0      | with index masking for
           9: (67) r0 <<= 3                 | map->unpriv_array.
          10: (0f) r0 += r1                 |
          11: (79) r0 = *(u64 *)(r0 +0)     |
          12: (15) if r0 == 0x0 goto pc+1   |
          13: (05) goto pc+1                |
          14: (b7) r0 = 0                   |
          15: (15) if r0 == 0x0 goto pc+11
          16: (62) *(u32 *)(r10 -4) = 0
          17: (bf) r2 = r10
          18: (07) r2 += -4
          19: (bf) r1 = r0
          20: (07) r1 += 272                |
          21: (61) r0 = *(u32 *)(r2 +0)     | Index masking missing (!)
          22: (35) if r0 >= 0x1 goto pc+3   | for inner map despite
          23: (67) r0 <<= 3                 | map->unpriv_array set.
          24: (0f) r0 += r1                 |
          25: (05) goto pc+1                |
          26: (b7) r0 = 0                   |
          27: (b7) r0 = 0
          28: (95) exit
      
      After patch:
      
        # bpftool prog dump xla id 1
           0: (62) *(u32 *)(r10 -4) = 0
           1: (bf) r2 = r10
           2: (07) r2 += -4
           3: (18) r1 = map[id:2]
           5: (07) r1 += 272                |
           6: (61) r0 = *(u32 *)(r2 +0)     |
           7: (35) if r0 >= 0x1 goto pc+6   | Same inlined map in map lookup
           8: (54) (u32) r0 &= (u32) 0      | with index masking due to
           9: (67) r0 <<= 3                 | map->unpriv_array.
          10: (0f) r0 += r1                 |
          11: (79) r0 = *(u64 *)(r0 +0)     |
          12: (15) if r0 == 0x0 goto pc+1   |
          13: (05) goto pc+1                |
          14: (b7) r0 = 0                   |
          15: (15) if r0 == 0x0 goto pc+12
          16: (62) *(u32 *)(r10 -4) = 0
          17: (bf) r2 = r10
          18: (07) r2 += -4
          19: (bf) r1 = r0
          20: (07) r1 += 272                |
          21: (61) r0 = *(u32 *)(r2 +0)     |
          22: (35) if r0 >= 0x1 goto pc+4   | Now fixed inlined inner map
          23: (54) (u32) r0 &= (u32) 0      | lookup with proper index masking
          24: (67) r0 <<= 3                 | for map->unpriv_array.
          25: (0f) r0 += r1                 |
          26: (05) goto pc+1                |
          27: (b7) r0 = 0                   |
          28: (b7) r0 = 0
          29: (95) exit
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      37c9e3ee
    • Daniel Borkmann's avatar
      bpf: fix sanitation of alu op with pointer / scalar type from different paths · eed84f94
      Daniel Borkmann authored
      [ commit d3bd7413 upstream ]
      
      While 979d63d5 ("bpf: prevent out of bounds speculation on pointer
      arithmetic") took care of rejecting alu op on pointer when e.g. pointer
      came from two different map values with different map properties such as
      value size, Jann reported that a case was not covered yet when a given
      alu op is used in both "ptr_reg += reg" and "numeric_reg += reg" from
      different branches where we would incorrectly try to sanitize based
      on the pointer's limit. Catch this corner case and reject the program
      instead.
      
      Fixes: 979d63d5 ("bpf: prevent out of bounds speculation on pointer arithmetic")
      Reported-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      eed84f94
    • Daniel Borkmann's avatar
      bpf: prevent out of bounds speculation on pointer arithmetic · f92a819b
      Daniel Borkmann authored
      [ commit 979d63d5 upstream ]
      
      Jann reported that the original commit back in b2157399
      ("bpf: prevent out-of-bounds speculation") was not sufficient
      to stop CPU from speculating out of bounds memory access:
      While b2157399 only focussed on masking array map access
      for unprivileged users for tail calls and data access such
      that the user provided index gets sanitized from BPF program
      and syscall side, there is still a more generic form affected
      from BPF programs that applies to most maps that hold user
      data in relation to dynamic map access when dealing with
      unknown scalars or "slow" known scalars as access offset, for
      example:
      
        - Load a map value pointer into R6
        - Load an index into R7
        - Do a slow computation (e.g. with a memory dependency) that
          loads a limit into R8 (e.g. load the limit from a map for
          high latency, then mask it to make the verifier happy)
        - Exit if R7 >= R8 (mispredicted branch)
        - Load R0 = R6[R7]
        - Load R0 = R6[R0]
      
      For unknown scalars there are two options in the BPF verifier
      where we could derive knowledge from in order to guarantee
      safe access to the memory: i) While </>/<=/>= variants won't
      allow to derive any lower or upper bounds from the unknown
      scalar where it would be safe to add it to the map value
      pointer, it is possible through ==/!= test however. ii) another
      option is to transform the unknown scalar into a known scalar,
      for example, through ALU ops combination such as R &= <imm>
      followed by R |= <imm> or any similar combination where the
      original information from the unknown scalar would be destroyed
      entirely leaving R with a constant. The initial slow load still
      precedes the latter ALU ops on that register, so the CPU
      executes speculatively from that point. Once we have the known
      scalar, any compare operation would work then. A third option
      only involving registers with known scalars could be crafted
      as described in [0] where a CPU port (e.g. Slow Int unit)
      would be filled with many dependent computations such that
      the subsequent condition depending on its outcome has to wait
      for evaluation on its execution port and thereby executing
      speculatively if the speculated code can be scheduled on a
      different execution port, or any other form of mistraining
      as described in [1], for example. Given this is not limited
      to only unknown scalars, not only map but also stack access
      is affected since both is accessible for unprivileged users
      and could potentially be used for out of bounds access under
      speculation.
      
      In order to prevent any of these cases, the verifier is now
      sanitizing pointer arithmetic on the offset such that any
      out of bounds speculation would be masked in a way where the
      pointer arithmetic result in the destination register will
      stay unchanged, meaning offset masked into zero similar as
      in array_index_nospec() case. With regards to implementation,
      there are three options that were considered: i) new insn
      for sanitation, ii) push/pop insn and sanitation as inlined
      BPF, iii) reuse of ax register and sanitation as inlined BPF.
      
      Option i) has the downside that we end up using from reserved
      bits in the opcode space, but also that we would require
      each JIT to emit masking as native arch opcodes meaning
      mitigation would have slow adoption till everyone implements
      it eventually which is counter-productive. Option ii) and iii)
      have both in common that a temporary register is needed in
      order to implement the sanitation as inlined BPF since we
      are not allowed to modify the source register. While a push /
      pop insn in ii) would be useful to have in any case, it
      requires once again that every JIT needs to implement it
      first. While possible, amount of changes needed would also
      be unsuitable for a -stable patch. Therefore, the path which
      has fewer changes, less BPF instructions for the mitigation
      and does not require anything to be changed in the JITs is
      option iii) which this work is pursuing. The ax register is
      already mapped to a register in all JITs (modulo arm32 where
      it's mapped to stack as various other BPF registers there)
      and used in constant blinding for JITs-only so far. It can
      be reused for verifier rewrites under certain constraints.
      The interpreter's tmp "register" has therefore been remapped
      into extending the register set with hidden ax register and
      reusing that for a number of instructions that needed the
      prior temporary variable internally (e.g. div, mod). This
      allows for zero increase in stack space usage in the interpreter,
      and enables (restricted) generic use in rewrites otherwise as
      long as such a patchlet does not make use of these instructions.
      The sanitation mask is dynamic and relative to the offset the
      map value or stack pointer currently holds.
      
      There are various cases that need to be taken under consideration
      for the masking, e.g. such operation could look as follows:
      ptr += val or val += ptr or ptr -= val. Thus, the value to be
      sanitized could reside either in source or in destination
      register, and the limit is different depending on whether
      the ALU op is addition or subtraction and depending on the
      current known and bounded offset. The limit is derived as
      follows: limit := max_value_size - (smin_value + off). For
      subtraction: limit := umax_value + off. This holds because
      we do not allow any pointer arithmetic that would
      temporarily go out of bounds or would have an unknown
      value with mixed signed bounds where it is unclear at
      verification time whether the actual runtime value would
      be either negative or positive. For example, we have a
      derived map pointer value with constant offset and bounded
      one, so limit based on smin_value works because the verifier
      requires that statically analyzed arithmetic on the pointer
      must be in bounds, and thus it checks if resulting
      smin_value + off and umax_value + off is still within map
      value bounds at time of arithmetic in addition to time of
      access. Similarly, for the case of stack access we derive
      the limit as follows: MAX_BPF_STACK + off for subtraction
      and -off for the case of addition where off := ptr_reg->off +
      ptr_reg->var_off.value. Subtraction is a special case for
      the masking which can be in form of ptr += -val, ptr -= -val,
      or ptr -= val. In the first two cases where we know that
      the value is negative, we need to temporarily negate the
      value in order to do the sanitation on a positive value
      where we later swap the ALU op, and restore original source
      register if the value was in source.
      
      The sanitation of pointer arithmetic alone is still not fully
      sufficient as is, since a scenario like the following could
      happen ...
      
        PTR += 0x1000 (e.g. K-based imm)
        PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
        PTR += 0x1000
        PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
        [...]
      
      ... which under speculation could end up as ...
      
        PTR += 0x1000
        PTR -= 0 [ truncated by mitigation ]
        PTR += 0x1000
        PTR -= 0 [ truncated by mitigation ]
        [...]
      
      ... and therefore still access out of bounds. To prevent such
      case, the verifier is also analyzing safety for potential out
      of bounds access under speculative execution. Meaning, it is
      also simulating pointer access under truncation. We therefore
      "branch off" and push the current verification state after the
      ALU operation with known 0 to the verification stack for later
      analysis. Given the current path analysis succeeded it is
      likely that the one under speculation can be pruned. In any
      case, it is also subject to existing complexity limits and
      therefore anything beyond this point will be rejected. In
      terms of pruning, it needs to be ensured that the verification
      state from speculative execution simulation must never prune
      a non-speculative execution path, therefore, we mark verifier
      state accordingly at the time of push_stack(). If verifier
      detects out of bounds access under speculative execution from
      one of the possible paths that includes a truncation, it will
      reject such program.
      
      Given we mask every reg-based pointer arithmetic for
      unprivileged programs, we've been looking into how it could
      affect real-world programs in terms of size increase. As the
      majority of programs are targeted for privileged-only use
      case, we've unconditionally enabled masking (with its alu
      restrictions on top of it) for privileged programs for the
      sake of testing in order to check i) whether they get rejected
      in its current form, and ii) by how much the number of
      instructions and size will increase. We've tested this by
      using Katran, Cilium and test_l4lb from the kernel selftests.
      For Katran we've evaluated balancer_kern.o, Cilium bpf_lxc.o
      and an older test object bpf_lxc_opt_-DUNKNOWN.o and l4lb
      we've used test_l4lb.o as well as test_l4lb_noinline.o. We
      found that none of the programs got rejected by the verifier
      with this change, and that impact is rather minimal to none.
      balancer_kern.o had 13,904 bytes (1,738 insns) xlated and
      7,797 bytes JITed before and after the change. Most complex
      program in bpf_lxc.o had 30,544 bytes (3,817 insns) xlated
      and 18,538 bytes JITed before and after and none of the other
      tail call programs in bpf_lxc.o had any changes either. For
      the older bpf_lxc_opt_-DUNKNOWN.o object we found a small
      increase from 20,616 bytes (2,576 insns) and 12,536 bytes JITed
      before to 20,664 bytes (2,582 insns) and 12,558 bytes JITed
      after the change. Other programs from that object file had
      similar small increase. Both test_l4lb.o had no change and
      remained at 6,544 bytes (817 insns) xlated and 3,401 bytes
      JITed and for test_l4lb_noinline.o constant at 5,080 bytes
      (634 insns) xlated and 3,313 bytes JITed. This can be explained
      in that LLVM typically optimizes stack based pointer arithmetic
      by using K-based operations and that use of dynamic map access
      is not overly frequent. However, in future we may decide to
      optimize the algorithm further under known guarantees from
      branch and value speculation. Latter seems also unclear in
      terms of prediction heuristics that today's CPUs apply as well
      as whether there could be collisions in e.g. the predictor's
      Value History/Pattern Table for triggering out of bounds access,
      thus masking is performed unconditionally at this point but could
      be subject to relaxation later on. We were generally also
      brainstorming various other approaches for mitigation, but the
      blocker was always lack of available registers at runtime and/or
      overhead for runtime tracking of limits belonging to a specific
      pointer. Thus, we found this to be minimally intrusive under
      given constraints.
      
      With that in place, a simple example with sanitized access on
      unprivileged load at post-verification time looks as follows:
      
        # bpftool prog dump xlated id 282
        [...]
        28: (79) r1 = *(u64 *)(r7 +0)
        29: (79) r2 = *(u64 *)(r7 +8)
        30: (57) r1 &= 15
        31: (79) r3 = *(u64 *)(r0 +4608)
        32: (57) r3 &= 1
        33: (47) r3 |= 1
        34: (2d) if r2 > r3 goto pc+19
        35: (b4) (u32) r11 = (u32) 20479  |
        36: (1f) r11 -= r2                | Dynamic sanitation for pointer
        37: (4f) r11 |= r2                | arithmetic with registers
        38: (87) r11 = -r11               | containing bounded or known
        39: (c7) r11 s>>= 63              | scalars in order to prevent
        40: (5f) r11 &= r2                | out of bounds speculation.
        41: (0f) r4 += r11                |
        42: (71) r4 = *(u8 *)(r4 +0)
        43: (6f) r4 <<= r1
        [...]
      
      For the case where the scalar sits in the destination register
      as opposed to the source register, the following code is emitted
      for the above example:
      
        [...]
        16: (b4) (u32) r11 = (u32) 20479
        17: (1f) r11 -= r2
        18: (4f) r11 |= r2
        19: (87) r11 = -r11
        20: (c7) r11 s>>= 63
        21: (5f) r2 &= r11
        22: (0f) r2 += r0
        23: (61) r0 = *(u32 *)(r2 +0)
        [...]
      
      JIT blinding example with non-conflicting use of r10:
      
        [...]
         d5:	je     0x0000000000000106    _
         d7:	mov    0x0(%rax),%edi       |
         da:	mov    $0xf153246,%r10d     | Index load from map value and
         e0:	xor    $0xf153259,%r10      | (const blinded) mask with 0x1f.
         e7:	and    %r10,%rdi            |_
         ea:	mov    $0x2f,%r10d          |
         f0:	sub    %rdi,%r10            | Sanitized addition. Both use r10
         f3:	or     %rdi,%r10            | but do not interfere with each
         f6:	neg    %r10                 | other. (Neither do these instructions
         f9:	sar    $0x3f,%r10           | interfere with the use of ax as temp
         fd:	and    %r10,%rdi            | in interpreter.)
        100:	add    %rax,%rdi            |_
        103:	mov    0x0(%rdi),%eax
       [...]
      
      Tested that it fixes Jann's reproducer, and also checked that test_verifier
      and test_progs suite with interpreter, JIT and JIT with hardening enabled
      on x86-64 and arm64 runs successfully.
      
        [0] Speculose: Analyzing the Security Implications of Speculative
            Execution in CPUs, Giorgi Maisuradze and Christian Rossow,
            https://arxiv.org/pdf/1801.04084.pdf
      
        [1] A Systematic Evaluation of Transient Execution Attacks and
            Defenses, Claudio Canella, Jo Van Bulck, Michael Schwarz,
            Moritz Lipp, Benjamin von Berg, Philipp Ortner, Frank Piessens,
            Dmitry Evtyushkin, Daniel Gruss,
            https://arxiv.org/pdf/1811.05441.pdf
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Reported-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      f92a819b
    • Daniel Borkmann's avatar
      bpf: fix check_map_access smin_value test when pointer contains offset · 4f7f708d
      Daniel Borkmann authored
      [ commit b7137c4e upstream ]
      
      In check_map_access() we probe actual bounds through __check_map_access()
      with offset of reg->smin_value + off for lower bound and offset of
      reg->umax_value + off for the upper bound. However, even though the
      reg->smin_value could have a negative value, the final result of the
      sum with off could be positive when pointer arithmetic with known and
      unknown scalars is combined. In this case we reject the program with
      an error such as "R<x> min value is negative, either use unsigned index
      or do a if (index >=0) check." even though the access itself would be
      fine. Therefore extend the check to probe whether the actual resulting
      reg->smin_value + off is less than zero.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      4f7f708d
    • Daniel Borkmann's avatar
      bpf: restrict unknown scalars of mixed signed bounds for unprivileged · 44f8fc64
      Daniel Borkmann authored
      [ commit 9d7eceed upstream ]
      
      For unknown scalars of mixed signed bounds, meaning their smin_value is
      negative and their smax_value is positive, we need to reject arithmetic
      with pointer to map value. For unprivileged the goal is to mask every
      map pointer arithmetic and this cannot reliably be done when it is
      unknown at verification time whether the scalar value is negative or
      positive. Given this is a corner case, the likelihood of breaking should
      be very small.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      44f8fc64
    • Daniel Borkmann's avatar
      bpf: restrict stack pointer arithmetic for unprivileged · 5332dda9
      Daniel Borkmann authored
      [ commit e4298d25 upstream ]
      
      Restrict stack pointer arithmetic for unprivileged users in that
      arithmetic itself must not go out of bounds as opposed to the actual
      access later on. Therefore after each adjust_ptr_min_max_vals() with
      a stack pointer as a destination we simulate a check_stack_access()
      of 1 byte on the destination and once that fails the program is
      rejected for unprivileged program loads. This is analog to map
      value pointer arithmetic and needed for masking later on.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      5332dda9
    • Daniel Borkmann's avatar
      bpf: restrict map value pointer arithmetic for unprivileged · 9e57b296
      Daniel Borkmann authored
      [ commit 0d6303db upstream ]
      
      Restrict map value pointer arithmetic for unprivileged users in that
      arithmetic itself must not go out of bounds as opposed to the actual
      access later on. Therefore after each adjust_ptr_min_max_vals() with a
      map value pointer as a destination it will simulate a check_map_access()
      of 1 byte on the destination and once that fails the program is rejected
      for unprivileged program loads. We use this later on for masking any
      pointer arithmetic with the remainder of the map value space. The
      likelihood of breaking any existing real-world unprivileged eBPF
      program is very small for this corner case.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      9e57b296
    • Daniel Borkmann's avatar
      bpf: enable access to ax register also from verifier rewrite · 232ac70d
      Daniel Borkmann authored
      [ commit 9b73bfdd upstream ]
      
      Right now we are using BPF ax register in JIT for constant blinding as
      well as in interpreter as temporary variable. Verifier will not be able
      to use it simply because its use will get overridden from the former in
      bpf_jit_blind_insn(). However, it can be made to work in that blinding
      will be skipped if there is prior use in either source or destination
      register on the instruction. Taking constraints of ax into account, the
      verifier is then open to use it in rewrites under some constraints. Note,
      ax register already has mappings in every eBPF JIT.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      232ac70d
    • Daniel Borkmann's avatar
      bpf: move tmp variable into ax register in interpreter · b855e310
      Daniel Borkmann authored
      [ commit 144cd91c upstream ]
      
      This change moves the on-stack 64 bit tmp variable in ___bpf_prog_run()
      into the hidden ax register. The latter is currently only used in JITs
      for constant blinding as a temporary scratch register, meaning the BPF
      interpreter will never see the use of ax. Therefore it is safe to use
      it for the cases where tmp has been used earlier. This is needed to later
      on allow restricted hidden use of ax in both interpreter and JITs.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      b855e310
    • Daniel Borkmann's avatar
      bpf: move {prev_,}insn_idx into verifier env · 333a31c8
      Daniel Borkmann authored
      [ commit c08435ec upstream ]
      
      Move prev_insn_idx and insn_idx from the do_check() function into
      the verifier environment, so they can be read inside the various
      helper functions for handling the instructions. It's easier to put
      this into the environment rather than changing all call-sites only
      to pass it along. insn_idx is useful in particular since this later
      on allows to hold state in env->insn_aux_data[env->insn_idx].
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      333a31c8