1. 24 Jun, 2021 12 commits
    • Toke Høiland-Jørgensen's avatar
      ena: Remove rcu_read_lock() around XDP program invocation · 0939e053
      Toke Høiland-Jørgensen authored
      The ena driver has rcu_read_lock()/rcu_read_unlock() pairs around XDP
      program invocations. However, the actual lifetime of the objects referred
      by the XDP program invocation is longer, all the way through to the call to
      xdp_do_flush(), making the scope of the rcu_read_lock() too small. This
      turns out to be harmless because it all happens in a single NAPI poll
      cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock()
      misleading.
      
      Rather than extend the scope of the rcu_read_lock(), just get rid of it
      entirely. With the addition of RCU annotations to the XDP_REDIRECT map
      types that take bh execution into account, lockdep even understands this to
      be safe, so there's really no reason to keep it around.
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Saeed Bishara <saeedb@amazon.com>
      Cc: Guy Tzalik <gtzalik@amazon.com>
      Link: https://lore.kernel.org/bpf/20210624160609.292325-8-toke@redhat.com
      0939e053
    • Toke Høiland-Jørgensen's avatar
      bpf, sched: Remove unneeded rcu_read_lock() around BPF program invocation · 77151ccf
      Toke Høiland-Jørgensen authored
      The rcu_read_lock() call in cls_bpf and act_bpf are redundant: on the TX
      side, there's already a call to rcu_read_lock_bh() in __dev_queue_xmit(),
      and on RX there's a covering rcu_read_lock() in
      netif_receive_skb{,_list}_internal().
      
      With the previous patches we also amended the lockdep checks in the map
      code to not require any particular RCU flavour, so we can just get rid of
      the rcu_read_lock()s.
      Suggested-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210624160609.292325-7-toke@redhat.com
      77151ccf
    • Toke Høiland-Jørgensen's avatar
      xdp: Add proper __rcu annotations to redirect map entries · 782347b6
      Toke Høiland-Jørgensen authored
      XDP_REDIRECT works by a three-step process: the bpf_redirect() and
      bpf_redirect_map() helpers will lookup the target of the redirect and store
      it (along with some other metadata) in a per-CPU struct bpf_redirect_info.
      Next, when the program returns the XDP_REDIRECT return code, the driver
      will call xdp_do_redirect() which will use the information thus stored to
      actually enqueue the frame into a bulk queue structure (that differs
      slightly by map type, but shares the same principle). Finally, before
      exiting its NAPI poll loop, the driver will call xdp_do_flush(), which will
      flush all the different bulk queues, thus completing the redirect.
      
      Pointers to the map entries will be kept around for this whole sequence of
      steps, protected by RCU. However, there is no top-level rcu_read_lock() in
      the core code; instead drivers add their own rcu_read_lock() around the XDP
      portions of the code, but somewhat inconsistently as Martin discovered[0].
      However, things still work because everything happens inside a single NAPI
      poll sequence, which means it's between a pair of calls to
      local_bh_disable()/local_bh_enable(). So Paul suggested[1] that we could
      document this intention by using rcu_dereference_check() with
      rcu_read_lock_bh_held() as a second parameter, thus allowing sparse and
      lockdep to verify that everything is done correctly.
      
      This patch does just that: we add an __rcu annotation to the map entry
      pointers and remove the various comments explaining the NAPI poll assurance
      strewn through devmap.c in favour of a longer explanation in filter.c. The
      goal is to have one coherent documentation of the entire flow, and rely on
      the RCU annotations as a "standard" way of communicating the flow in the
      map code (which can additionally be understood by sparse and lockdep).
      
      The RCU annotation replacements result in a fairly straight-forward
      replacement where READ_ONCE() becomes rcu_dereference_check(), WRITE_ONCE()
      becomes rcu_assign_pointer() and xchg() and cmpxchg() gets wrapped in the
      proper constructs to cast the pointer back and forth between __rcu and
      __kernel address space (for the benefit of sparse). The one complication is
      that xskmap has a few constructions where double-pointers are passed back
      and forth; these simply all gain __rcu annotations, and only the final
      reference/dereference to the inner-most pointer gets changed.
      
      With this, everything can be run through sparse without eliciting
      complaints, and lockdep can verify correctness even without the use of
      rcu_read_lock() in the drivers. Subsequent patches will clean these up from
      the drivers.
      
      [0] https://lore.kernel.org/bpf/20210415173551.7ma4slcbqeyiba2r@kafai-mbp.dhcp.thefacebook.com/
      [1] https://lore.kernel.org/bpf/20210419165837.GA975577@paulmck-ThinkPad-P17-Gen-1/Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210624160609.292325-6-toke@redhat.com
      782347b6
    • Toke Høiland-Jørgensen's avatar
      bpf: Allow RCU-protected lookups to happen from bh context · 694cea39
      Toke Høiland-Jørgensen authored
      XDP programs are called from a NAPI poll context, which means the RCU
      reference liveness is ensured by local_bh_disable(). Add
      rcu_read_lock_bh_held() as a condition to the RCU checks for map lookups so
      lockdep understands that the dereferences are safe from inside *either* an
      rcu_read_lock() section *or* a local_bh_disable() section. While both
      bh_disabled and rcu_read_lock() provide RCU protection, they are
      semantically distinct, so we need both conditions to prevent lockdep
      complaints.
      
      This change is done in preparation for removing the redundant
      rcu_read_lock()s from drivers.
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20210624160609.292325-5-toke@redhat.com
      694cea39
    • Toke Høiland-Jørgensen's avatar
      doc: Give XDP as example of non-obvious RCU reader/updater pairing · e74c74f9
      Toke Høiland-Jørgensen authored
      This commit gives an example of non-obvious RCU reader/updater pairing
      in the guise of the XDP feature in networking, which calls BPF programs
      from network-driver NAPI (softirq) context.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210624160609.292325-4-toke@redhat.com
      e74c74f9
    • Paul E. McKenney's avatar
      doc: Clarify and expand RCU updaters and corresponding readers · 9a145c04
      Paul E. McKenney authored
      This commit clarifies which primitives readers can use given that the
      corresponding updaters have made a specific choice.  This commit also adds
      this information for the various RCU Tasks flavors.  While in the area, it
      removes a paragraph that no longer applies in any straightforward manner.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210624160609.292325-3-toke@redhat.com
      9a145c04
    • Paul E. McKenney's avatar
      rcu: Create an unrcu_pointer() to remove __rcu from a pointer · b9964ce7
      Paul E. McKenney authored
      The xchg() and cmpxchg() functions are sometimes used to carry out RCU
      updates.  Unfortunately, this can result in sparse warnings for both
      the old-value and new-value arguments, as well as for the return value.
      The arguments can be dealt with using RCU_INITIALIZER():
      
              old_p = xchg(&p, RCU_INITIALIZER(new_p));
      
      But a sparse warning still remains due to assigning the __rcu pointer
      returned from xchg to the (most likely) non-__rcu pointer old_p.
      
      This commit therefore provides an unrcu_pointer() macro that strips
      the __rcu.  This macro can be used as follows:
      
              old_p = unrcu_pointer(xchg(&p, RCU_INITIALIZER(new_p)));
      Reported-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210624160609.292325-2-toke@redhat.com
      b9964ce7
    • Maciej Żenczykowski's avatar
      bpf: Support all gso types in bpf_skb_change_proto() · 0bc919d3
      Maciej Żenczykowski authored
      Since we no longer modify gso_size, it is now theoretically
      safe to not set SKB_GSO_DODGY and reset gso_segs to zero.
      
      This also means the skb_is_gso_tcp() check should no longer
      be necessary.
      
      Unfortunately we cannot remove the skb_{decrease,increase}_gso_size()
      helpers, as they are still used elsewhere:
      
        bpf_skb_net_grow() without BPF_F_ADJ_ROOM_FIXED_GSO
        bpf_skb_net_shrink() without BPF_F_ADJ_ROOM_FIXED_GSO
        net/core/lwt_bpf.c's handle_gso_type()
      Signed-off-by: default avatarMaciej Żenczykowski <maze@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Dongseok Yi <dseok.yi@samsung.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Link: https://lore.kernel.org/bpf/20210617000953.2787453-3-zenczykowski@gmail.com
      0bc919d3
    • Maciej Żenczykowski's avatar
      bpf: Do not change gso_size during bpf_skb_change_proto() · 364745fb
      Maciej Żenczykowski authored
      This is technically a backwards incompatible change in behaviour, but I'm
      going to argue that it is very unlikely to break things, and likely to fix
      *far* more then it breaks.
      
      In no particular order, various reasons follow:
      
      (a) I've long had a bug assigned to myself to debug a super rare kernel crash
      on Android Pixel phones which can (per stacktrace) be traced back to BPF clat
      IPv6 to IPv4 protocol conversion causing some sort of ugly failure much later
      on during transmit deep in the GSO engine, AFAICT precisely because of this
      change to gso_size, though I've never been able to manually reproduce it. I
      believe it may be related to the particular network offload support of attached
      USB ethernet dongle being used for tethering off of an IPv6-only cellular
      connection. The reason might be we end up with more segments than max permitted,
      or with a GSO packet with only one segment... (either way we break some
      assumption and hit a BUG_ON)
      
      (b) There is no check that the gso_size is > 20 when reducing it by 20, so we
      might end up with a negative (or underflowing) gso_size or a gso_size of 0.
      This can't possibly be good. Indeed this is probably somehow exploitable (or
      at least can result in a kernel crash) by delivering crafted packets and perhaps
      triggering an infinite loop or a divide by zero... As a reminder: gso_size (MSS)
      is related to MTU, but not directly derived from it: gso_size/MSS may be
      significantly smaller then one would get by deriving from local MTU. And on
      some NICs (which do loose MTU checking on receive, it may even potentially be
      larger, for example my work pc with 1500 MTU can receive 1520 byte frames [and
      sometimes does due to bugs in a vendor plat46 implementation]). Indeed even just
      going from 21 to 1 is potentially problematic because it increases the number
      of segments by a factor of 21 (think DoS, or some other crash due to too many
      segments).
      
      (c) It's always safe to not increase the gso_size, because it doesn't result in
      the max packet size increasing.  So the skb_increase_gso_size() call was always
      unnecessary for correctness (and outright undesirable, see later). As such the
      only part which is potentially dangerous (ie. could cause backwards compatibility
      issues) is the removal of the skb_decrease_gso_size() call.
      
      (d) If the packets are ultimately destined to the local device, then there is
      absolutely no benefit to playing around with gso_size. It only matters if the
      packets will egress the device. ie. we're either forwarding, or transmitting
      from the device.
      
      (e) This logic only triggers for packets which are GSO. It does not trigger for
      skbs which are not GSO. It will not convert a non-GSO MTU sized packet into a
      GSO packet (and you don't even know what the MTU is, so you can't even fix it).
      As such your transmit path must *already* be able to handle an MTU 20 bytes
      larger then your receive path (for IPv4 to IPv6 translation) - and indeed 28
      bytes larger due to IPv4 fragments. Thus removing the skb_decrease_gso_size()
      call doesn't actually increase the size of the packets your transmit side must
      be able to handle. ie. to handle non-GSO max-MTU packets, the IPv4/IPv6 device/
      route MTUs must already be set correctly. Since for example with an IPv4 egress
      MTU of 1500, IPv4 to IPv6 translation will already build 1520 byte IPv6 frames,
      so you need a 1520 byte device MTU. This means if your IPv6 device's egress
      MTU is 1280, your IPv4 route must be 1260 (and actually 1252, because of the
      need to handle fragments). This is to handle normal non-GSO packets. Thus the
      reduction is simply not needed for GSO packets, because when they're correctly
      built, they will already be the right size.
      
      (f) TSO/GSO should be able to exactly undo GRO: the number of packets (TCP
      segments) should not be modified, so that TCP's MSS counting works correctly
      (this matters for congestion control). If protocol conversion changes the
      gso_size, then the number of TCP segments may increase or decrease. Packet loss
      after protocol conversion can result in partial loss of MSS segments that the
      sender sent. How's the sending TCP stack going to react to receiving ACKs/SACKs
      in the middle of the segments it sent?
      
      (g) skb_{decrease,increase}_gso_size() are already no-ops for GSO_BY_FRAGS
      case (besides triggering WARN_ON_ONCE). This means you already cannot guarantee
      that gso_size (and thus resulting packet MTU) is changed. ie. you must assume
      it won't be changed.
      
      (h) changing gso_size is outright buggy for UDP GSO packets, where framing
      matters (I believe that's also the case for SCTP, but it's already excluded
      by [g]).  So the only remaining case is TCP, which also doesn't want it
      (see [f]).
      
      (i) see also the reasoning on the previous attempt at fixing this
      (commit fa7b83bf) which shows that the current
      behaviour causes TCP packet loss:
      
        In the forwarding path GRO -> BPF 6 to 4 -> GSO for TCP traffic, the
        coalesced packet payload can be > MSS, but < MSS + 20.
      
        bpf_skb_proto_6_to_4() will upgrade the MSS and it can be > the payload
        length. After then tcp_gso_segment checks for the payload length if it
        is <= MSS. The condition is causing the packet to be dropped.
      
        tcp_gso_segment():
          [...]
          mss = skb_shinfo(skb)->gso_size;
          if (unlikely(skb->len <= mss)) goto out;
          [...]
      
      Thus changing the gso_size is simply a very bad idea. Increasing is unnecessary
      and buggy, and decreasing can go negative.
      
      Fixes: 6578171a ("bpf: add bpf_skb_change_proto helper")
      Signed-off-by: default avatarMaciej Żenczykowski <maze@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Dongseok Yi <dseok.yi@samsung.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Link: https://lore.kernel.org/bpf/CANP3RGfjLikQ6dg=YpBU0OeHvyv7JOki7CyOUS9modaXAi-9vQ@mail.gmail.com
      Link: https://lore.kernel.org/bpf/20210617000953.2787453-2-zenczykowski@gmail.com
      364745fb
    • Maciej Żenczykowski's avatar
      Revert "bpf: Check for BPF_F_ADJ_ROOM_FIXED_GSO when bpf_skb_change_proto" · ba47396e
      Maciej Żenczykowski authored
      This reverts commit fa7b83bf.
      
      See the followup commit for the reasoning why I believe the appropriate
      approach is to simply make this change without a flag, but it can basically
      be summarized as using this helper without the flag is bug-prone or outright
      buggy, and thus the default should be this new behaviour.
      
      As this commit has only made it into net-next/master, but not into
      any real release, such a backwards incompatible change is still ok.
      Signed-off-by: default avatarMaciej Żenczykowski <maze@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Dongseok Yi <dseok.yi@samsung.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Link: https://lore.kernel.org/bpf/20210617000953.2787453-1-zenczykowski@gmail.com
      ba47396e
    • Sean Young's avatar
      media, bpf: Do not copy more entries than user space requested · 647d446d
      Sean Young authored
      The syscall bpf(BPF_PROG_QUERY, &attr) should use the prog_cnt field to
      see how many entries user space provided and return ENOSPC if there are
      more programs than that. Before this patch, this is not checked and
      ENOSPC is never returned.
      
      Note that one lirc device is limited to 64 bpf programs, and user space
      I'm aware of -- ir-keytable -- always gives enough space for 64 entries
      already. However, we should not copy program ids than are requested.
      Signed-off-by: default avatarSean Young <sean@mess.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210623213754.632-1-sean@mess.org
      647d446d
    • Jiri Olsa's avatar
      bpf, x86: Remove unused cnt increase from EMIT macro · ced50fc4
      Jiri Olsa authored
      Removing unused cnt increase from EMIT macro together with cnt declarations.
      This was introduced in commit [1] to ensure proper code generation. But that
      code was removed in commit [2] and this extra code was left in.
      
        [1] b52f00e6 ("x86: bpf_jit: implement bpf_tail_call() helper")
        [2] ebf7d1f5 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT")
      Signed-off-by: default avatarJiri Olsa <jolsa@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210623112504.709856-1-jolsa@kernel.org
      ced50fc4
  2. 23 Jun, 2021 1 commit
  3. 22 Jun, 2021 2 commits
    • Kumar Kartikeya Dwivedi's avatar
      libbpf: Switch to void * casting in netlink helpers · ee62a5c6
      Kumar Kartikeya Dwivedi authored
      Netlink helpers I added in 8bbb77b7 ("libbpf: Add various netlink
      helpers") used char * casts everywhere, and there were a few more that
      existed from before.
      
      Convert all of them to void * cast, as it is treated equivalently by
      clang/gcc for the purposes of pointer arithmetic and to follow the
      convention elsewhere in the kernel/libbpf.
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210619041454.417577-2-memxor@gmail.com
      ee62a5c6
    • Kumar Kartikeya Dwivedi's avatar
      libbpf: Add request buffer type for netlink messages · 0ae64fb6
      Kumar Kartikeya Dwivedi authored
      Coverity complains about OOB writes to nlmsghdr. There is no OOB as we
      write to the trailing buffer, but static analyzers and compilers may
      rightfully be confused as the nlmsghdr pointer has subobject provenance
      (and hence subobject bounds).
      
      Fix this by using an explicit request structure containing the nlmsghdr,
      struct tcmsg/ifinfomsg, and attribute buffer.
      
      Also switch nh_tail (renamed to req_tail) to cast req * to char * so
      that it can be understood as arithmetic on pointer to the representation
      array (hence having same bound as request structure), which should
      further appease analyzers.
      
      As a bonus, callers don't have to pass sizeof(req) all the time now, as
      size is implicitly obtained using the pointer. While at it, also reduce
      the size of attribute buffer to 128 bytes (132 for ifinfomsg using
      functions due to the padding).
      
      Summary of problem:
      
        Even though C standard allows interconvertibility of pointer to first
        member and pointer to struct, for the purposes of alias analysis it
        would still consider the first as having pointer value "pointer to T"
        where T is type of first member hence having subobject bounds,
        allowing analyzers within reason to complain when object is accessed
        beyond the size of pointed to object.
      
        The only exception to this rule may be when a char * is formed to a
        member subobject. It is not possible for the compiler to be able to
        tell the intent of the programmer that it is a pointer to member
        object or the underlying representation array of the containing
        object, so such diagnosis is suppressed.
      
      Fixes: 715c5ce4 ("libbpf: Add low level TC-BPF management API")
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210619041454.417577-1-memxor@gmail.com
      0ae64fb6
  4. 21 Jun, 2021 1 commit
  5. 18 Jun, 2021 4 commits
  6. 17 Jun, 2021 20 commits