1. 10 Jul, 2019 28 commits
  2. 03 Jul, 2019 12 commits
    • Greg Kroah-Hartman's avatar
      Linux 5.1.16 · 8584aaf1
      Greg Kroah-Hartman authored
      8584aaf1
    • Jean-Philippe Brucker's avatar
      arm64: insn: Fix ldadd instruction encoding · 25998210
      Jean-Philippe Brucker authored
      commit c5e2edeb upstream.
      
      GCC 8.1.0 reports that the ldadd instruction encoding, recently added to
      insn.c, doesn't match the mask and couldn't possibly be identified:
      
       linux/arch/arm64/include/asm/insn.h: In function 'aarch64_insn_is_ldadd':
       linux/arch/arm64/include/asm/insn.h:280:257: warning: bitwise comparison always evaluates to false [-Wtautological-compare]
      
      Bits [31:30] normally encode the size of the instruction (1 to 8 bytes)
      and the current instruction value only encodes the 4- and 8-byte
      variants. At the moment only the BPF JIT needs this instruction, and
      doesn't require the 1- and 2-byte variants, but to be consistent with
      our other ldr and str instruction encodings, clear the size field in the
      insn value.
      
      Fixes: 34b8ab09 ("bpf, arm64: use more scalable stadd over ldxr / stxr loop in xadd")
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reported-by: default avatarKuninori Morimoto <kuninori.morimoto.gx@renesas.com>
      Signed-off-by: default avatarYoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      25998210
    • Xin Long's avatar
      tipc: pass tunnel dev as NULL to udp_tunnel(6)_xmit_skb · cf9513b4
      Xin Long authored
      commit c3bcde02 upstream.
      
      udp_tunnel(6)_xmit_skb() called by tipc_udp_xmit() expects a tunnel device
      to count packets on dev->tstats, a perpcu variable. However, TIPC is using
      udp tunnel with no tunnel device, and pass the lower dev, like veth device
      that only initializes dev->lstats(a perpcu variable) when creating it.
      
      Later iptunnel_xmit_stats() called by ip(6)tunnel_xmit() thinks the dev as
      a tunnel device, and uses dev->tstats instead of dev->lstats. tstats' each
      pointer points to a bigger struct than lstats, so when tstats->tx_bytes is
      increased, other percpu variable's members could be overwritten.
      
      syzbot has reported quite a few crashes due to fib_nh_common percpu member
      'nhc_pcpu_rth_output' overwritten, call traces are like:
      
        BUG: KASAN: slab-out-of-bounds in rt_cache_valid+0x158/0x190
        net/ipv4/route.c:1556
          rt_cache_valid+0x158/0x190 net/ipv4/route.c:1556
          __mkroute_output net/ipv4/route.c:2332 [inline]
          ip_route_output_key_hash_rcu+0x819/0x2d50 net/ipv4/route.c:2564
          ip_route_output_key_hash+0x1ef/0x360 net/ipv4/route.c:2393
          __ip_route_output_key include/net/route.h:125 [inline]
          ip_route_output_flow+0x28/0xc0 net/ipv4/route.c:2651
          ip_route_output_key include/net/route.h:135 [inline]
        ...
      
      or:
      
        kasan: GPF could be caused by NULL-ptr deref or user memory access
        RIP: 0010:dst_dev_put+0x24/0x290 net/core/dst.c:168
          <IRQ>
          rt_fibinfo_free_cpus net/ipv4/fib_semantics.c:200 [inline]
          free_fib_info_rcu+0x2e1/0x490 net/ipv4/fib_semantics.c:217
          __rcu_reclaim kernel/rcu/rcu.h:240 [inline]
          rcu_do_batch kernel/rcu/tree.c:2437 [inline]
          invoke_rcu_callbacks kernel/rcu/tree.c:2716 [inline]
          rcu_process_callbacks+0x100a/0x1ac0 kernel/rcu/tree.c:2697
        ...
      
      The issue exists since tunnel stats update is moved to iptunnel_xmit by
      Commit 039f5062 ("ip_tunnel: Move stats update to iptunnel_xmit()"),
      and here to fix it by passing a NULL tunnel dev to udp_tunnel(6)_xmit_skb
      so that the packets counting won't happen on dev->tstats.
      
      Reported-by: syzbot+9d4c12bfd45a58738d0a@syzkaller.appspotmail.com
      Reported-by: syzbot+a9e23ea2aa21044c2798@syzkaller.appspotmail.com
      Reported-by: syzbot+c4c4b2bb358bb936ad7e@syzkaller.appspotmail.com
      Reported-by: syzbot+0290d2290a607e035ba1@syzkaller.appspotmail.com
      Reported-by: syzbot+a43d8d4e7e8a7a9e149e@syzkaller.appspotmail.com
      Reported-by: syzbot+a47c5f4c6c00fc1ed16e@syzkaller.appspotmail.com
      Fixes: 039f5062 ("ip_tunnel: Move stats update to iptunnel_xmit()")
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cf9513b4
    • Amir Goldstein's avatar
      fanotify: update connector fsid cache on add mark · b74063b5
      Amir Goldstein authored
      commit c285a2f0 upstream.
      
      When implementing connector fsid cache, we only initialized the cache
      when the first mark added to object was added by FAN_REPORT_FID group.
      We forgot to update conn->fsid when the second mark is added by
      FAN_REPORT_FID group to an already attached connector without fsid
      cache.
      
      Reported-and-tested-by: syzbot+c277e8e2f46414645508@syzkaller.appspotmail.com
      Fixes: 77115225 ("fanotify: cache fsid in fsnotify_mark_connector")
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b74063b5
    • Jason Gunthorpe's avatar
      RDMA: Directly cast the sockaddr union to sockaddr · 993a0821
      Jason Gunthorpe authored
      commit 641114d2 upstream.
      
      gcc 9 now does allocation size tracking and thinks that passing the member
      of a union and then accessing beyond that member's bounds is an overflow.
      
      Instead of using the union member, use the entire union with a cast to
      get to the sockaddr. gcc will now know that the memory extends the full
      size of the union.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      993a0821
    • Will Deacon's avatar
      futex: Update comments and docs about return values of arch futex code · 41dd902f
      Will Deacon authored
      commit 42750351 upstream.
      
      The architecture implementations of 'arch_futex_atomic_op_inuser()' and
      'futex_atomic_cmpxchg_inatomic()' are permitted to return only -EFAULT,
      -EAGAIN or -ENOSYS in the case of failure.
      
      Update the comments in the asm-generic/ implementation and also a stray
      reference in the robust futex documentation.
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      41dd902f
    • Daniel Borkmann's avatar
      bpf, arm64: use more scalable stadd over ldxr / stxr loop in xadd · 272ca391
      Daniel Borkmann authored
      commit 34b8ab09 upstream.
      
      Since ARMv8.1 supplement introduced LSE atomic instructions back in 2016,
      lets add support for STADD and use that in favor of LDXR / STXR loop for
      the XADD mapping if available. STADD is encoded as an alias for LDADD with
      XZR as the destination register, therefore add LDADD to the instruction
      encoder along with STADD as special case and use it in the JIT for CPUs
      that advertise LSE atomics in CPUID register. If immediate offset in the
      BPF XADD insn is 0, then use dst register directly instead of temporary
      one.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      272ca391
    • Will Deacon's avatar
      arm64: futex: Avoid copying out uninitialised stack in failed cmpxchg() · fac9c643
      Will Deacon authored
      commit 8e4e0ac0 upstream.
      
      Returning an error code from futex_atomic_cmpxchg_inatomic() indicates
      that the caller should not make any use of *uval, and should instead act
      upon on the value of the error code. Although this is implemented
      correctly in our futex code, we needlessly copy uninitialised stack to
      *uval in the error case, which can easily be avoided.
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fac9c643
    • Martin KaFai Lau's avatar
      bpf: udp: ipv6: Avoid running reuseport's bpf_prog from __udp6_lib_err · bb3fb093
      Martin KaFai Lau authored
      commit 4ac30c4b upstream.
      
      __udp6_lib_err() may be called when handling icmpv6 message. For example,
      the icmpv6 toobig(type=2).  __udp6_lib_lookup() is then called
      which may call reuseport_select_sock().  reuseport_select_sock() will
      call into a bpf_prog (if there is one).
      
      reuseport_select_sock() is expecting the skb->data pointing to the
      transport header (udphdr in this case).  For example, run_bpf_filter()
      is pulling the transport header.
      
      However, in the __udp6_lib_err() path, the skb->data is pointing to the
      ipv6hdr instead of the udphdr.
      
      One option is to pull and push the ipv6hdr in __udp6_lib_err().
      Instead of doing this, this patch follows how the original
      commit 538950a1 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF")
      was done in IPv4, which has passed a NULL skb pointer to
      reuseport_select_sock().
      
      Fixes: 538950a1 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF")
      Cc: Craig Gallek <kraig@google.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Acked-by: default avatarCraig Gallek <kraig@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bb3fb093
    • Martin KaFai Lau's avatar
      bpf: udp: Avoid calling reuseport's bpf_prog from udp_gro · da6dab63
      Martin KaFai Lau authored
      commit 257a525f upstream.
      
      When the commit a6024562 ("udp: Add GRO functions to UDP socket")
      added udp[46]_lib_lookup_skb to the udp_gro code path, it broke
      the reuseport_select_sock() assumption that skb->data is pointing
      to the transport header.
      
      This patch follows an earlier __udp6_lib_err() fix by
      passing a NULL skb to avoid calling the reuseport's bpf_prog.
      
      Fixes: a6024562 ("udp: Add GRO functions to UDP socket")
      Cc: Tom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      da6dab63
    • Daniel Borkmann's avatar
      bpf: fix unconnected udp hooks · 591c18e3
      Daniel Borkmann authored
      commit 983695fa upstream.
      
      Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
      to applications as also stated in original motivation in 7828f20e ("Merge
      branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
      two hooks into Cilium to enable host based load-balancing with Kubernetes,
      I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes
      typically sets up DNS as a service and is thus subject to load-balancing.
      
      Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
      is currently insufficient and thus not usable as-is for standard applications
      shipped with most distros. To break down the issue we ran into with a simple
      example:
      
        # cat /etc/resolv.conf
        nameserver 147.75.207.207
        nameserver 147.75.207.208
      
      For the purpose of a simple test, we set up above IPs as service IPs and
      transparently redirect traffic to a different DNS backend server for that
      node:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      The attached BPF program is basically selecting one of the backends if the
      service IP/port matches on the cgroup hook. DNS breaks here, because the
      hooks are not transparent enough to applications which have built-in msg_name
      address checks:
      
        # nslookup 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
        ;; connection timed out; no servers could be reached
      
        # dig 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; connection timed out; no servers could be reached
      
      For comparison, if none of the service IPs is used, and we tell nslookup
      to use 8.8.8.8 directly it works just fine, of course:
      
        # nslookup 1.1.1.1 8.8.8.8
        1.1.1.1.in-addr.arpa	name = one.one.one.one.
      
      In order to fix this and thus act more transparent to the application,
      this needs reverse translation on recvmsg() side. A minimal fix for this
      API is to add similar recvmsg() hooks behind the BPF cgroups static key
      such that the program can track state and replace the current sockaddr_in{,6}
      with the original service IP. From BPF side, this basically tracks the
      service tuple plus socket cookie in an LRU map where the reverse NAT can
      then be retrieved via map value as one example. Side-note: the BPF cgroups
      static key should be converted to a per-hook static key in future.
      
      Same example after this fix:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      Lookups work fine now:
      
        # nslookup 1.1.1.1
        1.1.1.1.in-addr.arpa    name = one.one.one.one.
      
        Authoritative answers can be found from:
      
        # dig 1.1.1.1
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; Got answer:
        ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550
        ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
      
        ;; OPT PSEUDOSECTION:
        ; EDNS: version: 0, flags:; udp: 512
        ;; QUESTION SECTION:
        ;1.1.1.1.                       IN      A
      
        ;; AUTHORITY SECTION:
        .                       23426   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400
      
        ;; Query time: 17 msec
        ;; SERVER: 147.75.207.207#53(147.75.207.207)
        ;; WHEN: Tue May 21 12:59:38 UTC 2019
        ;; MSG SIZE  rcvd: 111
      
      And from an actual packet level it shows that we're using the back end
      server when talking via 147.75.207.20{7,8} front end:
      
        # tcpdump -i any udp
        [...]
        12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        [...]
      
      In order to be flexible and to have same semantics as in sendmsg BPF
      programs, we only allow return codes in [1,1] range. In the sendmsg case
      the program is called if msg->msg_name is present which can be the case
      in both, connected and unconnected UDP.
      
      The former only relies on the sockaddr_in{,6} passed via connect(2) if
      passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar
      way to call into the BPF program whenever a non-NULL msg->msg_name was
      passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
      that for TCP case, the msg->msg_name is ignored in the regular recvmsg
      path and therefore not relevant.
      
      For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
      the hook is not called. This is intentional as it aligns with the same
      semantics as in case of TCP cgroup BPF hooks right now. This might be
      better addressed in future through a different bpf_attach_type such
      that this case can be distinguished from the regular recvmsg paths,
      for example.
      
      Fixes: 1cedee13 ("bpf: Hooks for sys_sendmsg")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarMartynas Pumputis <m@lambda.lt>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      591c18e3
    • Matt Mullins's avatar
      bpf: fix nested bpf tracepoints with per-cpu data · 2a9fedc1
      Matt Mullins authored
      commit 9594dc3c upstream.
      
      BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CPU, as
      they do not increment bpf_prog_active while executing.
      
      This enables three levels of nesting, to support
        - a kprobe or raw tp or perf event,
        - another one of the above that irq context happens to call, and
        - another one in nmi context
      (at most one of which may be a kprobe or perf event).
      
      Fixes: 20b9d7ac ("bpf: avoid excessive stack usage for perf_sample_data")
      Signed-off-by: default avatarMatt Mullins <mmullins@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2a9fedc1