1. 25 May, 2023 13 commits
  2. 24 May, 2023 7 commits
  3. 23 May, 2023 20 commits
    • John Fastabend's avatar
      bpf, sockmap: Test progs verifier error with latest clang · f726e035
      John Fastabend authored
      With a relatively recent clang (7090c10273119) and with this commit
      to fix warnings in selftests (c8ed6685) that uses __sink(err)
      to resolve unused variables. We get the following verifier error.
      
      root@6e731a24b33a:/host/tools/testing/selftests/bpf# ./test_sockmap
      libbpf: prog 'bpf_sockmap': BPF program load failed: Permission denied
      libbpf: prog 'bpf_sockmap': -- BEGIN PROG LOAD LOG --
      0: R1=ctx(off=0,imm=0) R10=fp0
      ; op = (int) skops->op;
      0: (61) r2 = *(u32 *)(r1 +0)          ; R1=ctx(off=0,imm=0) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
      ; switch (op) {
      1: (16) if w2 == 0x4 goto pc+5        ; R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
      2: (56) if w2 != 0x5 goto pc+15       ; R2_w=5
      ; lport = skops->local_port;
      3: (61) r2 = *(u32 *)(r1 +68)         ; R1=ctx(off=0,imm=0) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
      ; if (lport == 10000) {
      4: (56) if w2 != 0x2710 goto pc+13 18: R1=ctx(off=0,imm=0) R2=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R10=fp0
      ; __sink(err);
      18: (bc) w1 = w0
      R0 !read_ok
      processed 18 insns (limit 1000000) max_states_per_insn 0 total_states 2 peak_states 2 mark_read 1
      -- END PROG LOAD LOG --
      libbpf: prog 'bpf_sockmap': failed to load: -13
      libbpf: failed to load object 'test_sockmap_kern.bpf.o'
      load_bpf_file: (-1) No such file or directory
      ERROR: (-1) load bpf failed
      libbpf: prog 'bpf_sockmap': BPF program load failed: Permission denied
      libbpf: prog 'bpf_sockmap': -- BEGIN PROG LOAD LOG --
      0: R1=ctx(off=0,imm=0) R10=fp0
      ; op = (int) skops->op;
      0: (61) r2 = *(u32 *)(r1 +0)          ; R1=ctx(off=0,imm=0) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
      ; switch (op) {
      1: (16) if w2 == 0x4 goto pc+5        ; R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
      2: (56) if w2 != 0x5 goto pc+15       ; R2_w=5
      ; lport = skops->local_port;
      3: (61) r2 = *(u32 *)(r1 +68)         ; R1=ctx(off=0,imm=0) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
      ; if (lport == 10000) {
      4: (56) if w2 != 0x2710 goto pc+13 18: R1=ctx(off=0,imm=0) R2=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R10=fp0
      ; __sink(err);
      18: (bc) w1 = w0
      R0 !read_ok
      processed 18 insns (limit 1000000) max_states_per_insn 0 total_states 2 peak_states 2 mark_read 1
      -- END PROG LOAD LOG --
      libbpf: prog 'bpf_sockmap': failed to load: -13
      libbpf: failed to load object 'test_sockhash_kern.bpf.o'
      load_bpf_file: (-1) No such file or directory
      ERROR: (-1) load bpf failed
      libbpf: prog 'bpf_sockmap': BPF program load failed: Permission denied
      libbpf: prog 'bpf_sockmap': -- BEGIN PROG LOAD LOG --
      0: R1=ctx(off=0,imm=0) R10=fp0
      ; op = (int) skops->op;
      0: (61) r2 = *(u32 *)(r1 +0)          ; R1=ctx(off=0,imm=0) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
      ; switch (op) {
      1: (16) if w2 == 0x4 goto pc+5        ; R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
      2: (56) if w2 != 0x5 goto pc+15       ; R2_w=5
      ; lport = skops->local_port;
      3: (61) r2 = *(u32 *)(r1 +68)         ; R1=ctx(off=0,imm=0) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
      ; if (lport == 10000) {
      4: (56) if w2 != 0x2710 goto pc+13 18: R1=ctx(off=0,imm=0) R2=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R10=fp0
      ; __sink(err);
      18: (bc) w1 = w0
      R0 !read_ok
      processed 18 insns (limit 1000000) max_states_per_insn 0 total_states 2 peak_states 2 mark_read 1
      -- END PROG LOAD LOG --
      
      To fix simply remove the err value because its not actually used anywhere
      in the testing. We can investigate the root cause later. Future patch should
      probably actually test the err value as well. Although if the map updates
      fail they will get caught eventually by userspace.
      
      Fixes: c8ed6685 ("selftests/bpf: fix lots of silly mistakes pointed out by compiler")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-15-john.fastabend@gmail.com
      f726e035
    • John Fastabend's avatar
      bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer with drops · 80e24d22
      John Fastabend authored
      When BPF program drops pkts the sockmap logic 'eats' the packet and
      updates copied_seq. In the PASS case where the sk_buff is accepted
      we update copied_seq from recvmsg path so we need a new test to
      handle the drop case.
      
      Original patch series broke this resulting in
      
      test_sockmap_skb_verdict_fionread:PASS:ioctl(FIONREAD) error 0 nsec
      test_sockmap_skb_verdict_fionread:FAIL:ioctl(FIONREAD) unexpected ioctl(FIONREAD): actual 1503041772 != expected 256
      
      After updated patch with fix.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-14-john.fastabend@gmail.com
      80e24d22
    • John Fastabend's avatar
      bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer · bb516f98
      John Fastabend authored
      A bug was reported where ioctl(FIONREAD) returned zero even though the
      socket with a SK_SKB verdict program attached had bytes in the msg
      queue. The result is programs may hang or more likely try to recover,
      but use suboptimal buffer sizes.
      
      Add a test to check that ioctl(FIONREAD) returns the correct number of
      bytes.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-13-john.fastabend@gmail.com
      bb516f98
    • John Fastabend's avatar
      bpf, sockmap: Test shutdown() correctly exits epoll and recv()=0 · 1fa1fe8f
      John Fastabend authored
      When session gracefully shutdowns epoll needs to wake up and any recv()
      readers should return 0 not the -EAGAIN they previously returned.
      
      Note we use epoll instead of select to test the epoll wake on shutdown
      event as well.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-12-john.fastabend@gmail.com
      1fa1fe8f
    • John Fastabend's avatar
      bpf, sockmap: Build helper to create connected socket pair · 298970c8
      John Fastabend authored
      A common operation for testing is to spin up a pair of sockets that are
      connected. Then we can use these to run specific tests that need to
      send data, check BPF programs and so on.
      
      The sockmap_listen programs already have this logic lets move it into
      the new sockmap_helpers header file for general use.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-11-john.fastabend@gmail.com
      298970c8
    • John Fastabend's avatar
      bpf, sockmap: Pull socket helpers out of listen test for general use · 4e02588d
      John Fastabend authored
      No functional change here we merely pull the helpers in sockmap_listen.c
      into a header file so we can use these in other programs. The tests we
      are about to add aren't really _listen tests so doesn't make sense
      to add them here.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-10-john.fastabend@gmail.com
      4e02588d
    • John Fastabend's avatar
      bpf, sockmap: Incorrectly handling copied_seq · e5c6de5f
      John Fastabend authored
      The read_skb() logic is incrementing the tcp->copied_seq which is used for
      among other things calculating how many outstanding bytes can be read by
      the application. This results in application errors, if the application
      does an ioctl(FIONREAD) we return zero because this is calculated from
      the copied_seq value.
      
      To fix this we move tcp->copied_seq accounting into the recv handler so
      that we update these when the recvmsg() hook is called and data is in
      fact copied into user buffers. This gives an accurate FIONREAD value
      as expected and improves ACK handling. Before we were calling the
      tcp_rcv_space_adjust() which would update 'number of bytes copied to
      user in last RTT' which is wrong for programs returning SK_PASS. The
      bytes are only copied to the user when recvmsg is handled.
      
      Doing the fix for recvmsg is straightforward, but fixing redirect and
      SK_DROP pkts is a bit tricker. Build a tcp_psock_eat() helper and then
      call this from skmsg handlers. This fixes another issue where a broken
      socket with a BPF program doing a resubmit could hang the receiver. This
      happened because although read_skb() consumed the skb through sock_drop()
      it did not update the copied_seq. Now if a single reccv socket is
      redirecting to many sockets (for example for lb) the receiver sk will be
      hung even though we might expect it to continue. The hang comes from
      not updating the copied_seq numbers and memory pressure resulting from
      that.
      
      We have a slight layer problem of calling tcp_eat_skb even if its not
      a TCP socket. To fix we could refactor and create per type receiver
      handlers. I decided this is more work than we want in the fix and we
      already have some small tweaks depending on caller that use the
      helper skb_bpf_strparser(). So we extend that a bit and always set
      the strparser bit when it is in use and then we can gate the
      seq_copied updates on this.
      
      Fixes: 04919bed ("tcp: Introduce tcp_read_skb()")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-9-john.fastabend@gmail.com
      e5c6de5f
    • John Fastabend's avatar
      bpf, sockmap: Wake up polling after data copy · 6df7f764
      John Fastabend authored
      When TCP stack has data ready to read sk_data_ready() is called. Sockmap
      overwrites this with its own handler to call into BPF verdict program.
      But, the original TCP socket had sock_def_readable that would additionally
      wake up any user space waiters with sk_wake_async().
      
      Sockmap saved the callback when the socket was created so call the saved
      data ready callback and then we can wake up any epoll() logic waiting
      on the read.
      
      Note we call on 'copied >= 0' to account for returning 0 when a FIN is
      received because we need to wake up user for this as well so they
      can do the recvmsg() -> 0 and detect the shutdown.
      
      Fixes: 04919bed ("tcp: Introduce tcp_read_skb()")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-8-john.fastabend@gmail.com
      6df7f764
    • John Fastabend's avatar
      bpf, sockmap: TCP data stall on recv before accept · ea444185
      John Fastabend authored
      A common mechanism to put a TCP socket into the sockmap is to hook the
      BPF_SOCK_OPS_{ACTIVE_PASSIVE}_ESTABLISHED_CB event with a BPF program
      that can map the socket info to the correct BPF verdict parser. When
      the user adds the socket to the map the psock is created and the new
      ops are assigned to ensure the verdict program will 'see' the sk_buffs
      as they arrive.
      
      Part of this process hooks the sk_data_ready op with a BPF specific
      handler to wake up the BPF verdict program when data is ready to read.
      The logic is simple enough (posted here for easy reading)
      
       static void sk_psock_verdict_data_ready(struct sock *sk)
       {
      	struct socket *sock = sk->sk_socket;
      
      	if (unlikely(!sock || !sock->ops || !sock->ops->read_skb))
      		return;
      	sock->ops->read_skb(sk, sk_psock_verdict_recv);
       }
      
      The oversight here is sk->sk_socket is not assigned until the application
      accepts() the new socket. However, its entirely ok for the peer application
      to do a connect() followed immediately by sends. The socket on the receiver
      is sitting on the backlog queue of the listening socket until its accepted
      and the data is queued up. If the peer never accepts the socket or is slow
      it will eventually hit data limits and rate limit the session. But,
      important for BPF sockmap hooks when this data is received TCP stack does
      the sk_data_ready() call but the read_skb() for this data is never called
      because sk_socket is missing. The data sits on the sk_receive_queue.
      
      Then once the socket is accepted if we never receive more data from the
      peer there will be no further sk_data_ready calls and all the data
      is still on the sk_receive_queue(). Then user calls recvmsg after accept()
      and for TCP sockets in sockmap we use the tcp_bpf_recvmsg_parser() handler.
      The handler checks for data in the sk_msg ingress queue expecting that
      the BPF program has already run from the sk_data_ready hook and enqueued
      the data as needed. So we are stuck.
      
      To fix do an unlikely check in recvmsg handler for data on the
      sk_receive_queue and if it exists wake up data_ready. We have the sock
      locked in both read_skb and recvmsg so should avoid having multiple
      runners.
      
      Fixes: 04919bed ("tcp: Introduce tcp_read_skb()")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-7-john.fastabend@gmail.com
      ea444185
    • John Fastabend's avatar
      bpf, sockmap: Handle fin correctly · 901546fd
      John Fastabend authored
      The sockmap code is returning EAGAIN after a FIN packet is received and no
      more data is on the receive queue. Correct behavior is to return 0 to the
      user and the user can then close the socket. The EAGAIN causes many apps
      to retry which masks the problem. Eventually the socket is evicted from
      the sockmap because its released from sockmap sock free handling. The
      issue creates a delay and can cause some errors on application side.
      
      To fix this check on sk_msg_recvmsg side if length is zero and FIN flag
      is set then set return to zero. A selftest will be added to check this
      condition.
      
      Fixes: 04919bed ("tcp: Introduce tcp_read_skb()")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarWilliam Findlay <will@isovalent.com>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-6-john.fastabend@gmail.com
      901546fd
    • John Fastabend's avatar
      bpf, sockmap: Improved check for empty queue · 405df89d
      John Fastabend authored
      We noticed some rare sk_buffs were stepping past the queue when system was
      under memory pressure. The general theory is to skip enqueueing
      sk_buffs when its not necessary which is the normal case with a system
      that is properly provisioned for the task, no memory pressure and enough
      cpu assigned.
      
      But, if we can't allocate memory due to an ENOMEM error when enqueueing
      the sk_buff into the sockmap receive queue we push it onto a delayed
      workqueue to retry later. When a new sk_buff is received we then check
      if that queue is empty. However, there is a problem with simply checking
      the queue length. When a sk_buff is being processed from the ingress queue
      but not yet on the sockmap msg receive queue its possible to also recv
      a sk_buff through normal path. It will check the ingress queue which is
      zero and then skip ahead of the pkt being processed.
      
      Previously we used sock lock from both contexts which made the problem
      harder to hit, but not impossible.
      
      To fix instead of popping the skb from the queue entirely we peek the
      skb from the queue and do the copy there. This ensures checks to the
      queue length are non-zero while skb is being processed. Then finally
      when the entire skb has been copied to user space queue or another
      socket we pop it off the queue. This way the queue length check allows
      bypassing the queue only after the list has been completely processed.
      
      To reproduce issue we run NGINX compliance test with sockmap running and
      observe some flakes in our testing that we attributed to this issue.
      
      Fixes: 04919bed ("tcp: Introduce tcp_read_skb()")
      Suggested-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarWilliam Findlay <will@isovalent.com>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-5-john.fastabend@gmail.com
      405df89d
    • John Fastabend's avatar
      bpf, sockmap: Reschedule is now done through backlog · bce22552
      John Fastabend authored
      Now that the backlog manages the reschedule() logic correctly we can drop
      the partial fix to reschedule from recvmsg hook.
      
      Rescheduling on recvmsg hook was added to address a corner case where we
      still had data in the backlog state but had nothing to kick it and
      reschedule the backlog worker to run and finish copying data out of the
      state. This had a couple limitations, first it required user space to
      kick it introducing an unnecessary EBUSY and retry. Second it only
      handled the ingress case and egress redirects would still be hung.
      
      With the correct fix, pushing the reschedule logic down to where the
      enomem error occurs we can drop this fix.
      
      Fixes: bec21719 ("skmsg: Schedule psock work if the cached skb exists on the psock")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-4-john.fastabend@gmail.com
      bce22552
    • John Fastabend's avatar
      bpf, sockmap: Convert schedule_work into delayed_work · 29173d07
      John Fastabend authored
      Sk_buffs are fed into sockmap verdict programs either from a strparser
      (when the user might want to decide how framing of skb is done by attaching
      another parser program) or directly through tcp_read_sock. The
      tcp_read_sock is the preferred method for performance when the BPF logic is
      a stream parser.
      
      The flow for Cilium's common use case with a stream parser is,
      
       tcp_read_sock()
        sk_psock_verdict_recv
          ret = bpf_prog_run_pin_on_cpu()
          sk_psock_verdict_apply(sock, skb, ret)
           // if system is under memory pressure or app is slow we may
           // need to queue skb. Do this queuing through ingress_skb and
           // then kick timer to wake up handler
           skb_queue_tail(ingress_skb, skb)
           schedule_work(work);
      
      The work queue is wired up to sk_psock_backlog(). This will then walk the
      ingress_skb skb list that holds our sk_buffs that could not be handled,
      but should be OK to run at some later point. However, its possible that
      the workqueue doing this work still hits an error when sending the skb.
      When this happens the skbuff is requeued on a temporary 'state' struct
      kept with the workqueue. This is necessary because its possible to
      partially send an skbuff before hitting an error and we need to know how
      and where to restart when the workqueue runs next.
      
      Now for the trouble, we don't rekick the workqueue. This can cause a
      stall where the skbuff we just cached on the state variable might never
      be sent. This happens when its the last packet in a flow and no further
      packets come along that would cause the system to kick the workqueue from
      that side.
      
      To fix we could do simple schedule_work(), but while under memory pressure
      it makes sense to back off some instead of continue to retry repeatedly. So
      instead to fix convert schedule_work to schedule_delayed_work and add
      backoff logic to reschedule from backlog queue on errors. Its not obvious
      though what a good backoff is so use '1'.
      
      To test we observed some flakes whil running NGINX compliance test with
      sockmap we attributed these failed test to this bug and subsequent issue.
      
      >From on list discussion. This commit
      
       bec21719("skmsg: Schedule psock work if the cached skb exists on the psock")
      
      was intended to address similar race, but had a couple cases it missed.
      Most obvious it only accounted for receiving traffic on the local socket
      so if redirecting into another socket we could still get an sk_buff stuck
      here. Next it missed the case where copied=0 in the recv() handler and
      then we wouldn't kick the scheduler. Also its sub-optimal to require
      userspace to kick the internal mechanisms of sockmap to wake it up and
      copy data to user. It results in an extra syscall and requires the app
      to actual handle the EAGAIN correctly.
      
      Fixes: 04919bed ("tcp: Introduce tcp_read_skb()")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarWilliam Findlay <will@isovalent.com>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-3-john.fastabend@gmail.com
      29173d07
    • John Fastabend's avatar
      bpf, sockmap: Pass skb ownership through read_skb · 78fa0d61
      John Fastabend authored
      The read_skb hook calls consume_skb() now, but this means that if the
      recv_actor program wants to use the skb it needs to inc the ref cnt
      so that the consume_skb() doesn't kfree the sk_buff.
      
      This is problematic because in some error cases under memory pressure
      we may need to linearize the sk_buff from sk_psock_skb_ingress_enqueue().
      Then we get this,
      
       skb_linearize()
         __pskb_pull_tail()
           pskb_expand_head()
             BUG_ON(skb_shared(skb))
      
      Because we incremented users refcnt from sk_psock_verdict_recv() we
      hit the bug on with refcnt > 1 and trip it.
      
      To fix lets simply pass ownership of the sk_buff through the skb_read
      call. Then we can drop the consume from read_skb handlers and assume
      the verdict recv does any required kfree.
      
      Bug found while testing in our CI which runs in VMs that hit memory
      constraints rather regularly. William tested TCP read_skb handlers.
      
      [  106.536188] ------------[ cut here ]------------
      [  106.536197] kernel BUG at net/core/skbuff.c:1693!
      [  106.536479] invalid opcode: 0000 [#1] PREEMPT SMP PTI
      [  106.536726] CPU: 3 PID: 1495 Comm: curl Not tainted 5.19.0-rc5 #1
      [  106.537023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.16.0-1 04/01/2014
      [  106.537467] RIP: 0010:pskb_expand_head+0x269/0x330
      [  106.538585] RSP: 0018:ffffc90000138b68 EFLAGS: 00010202
      [  106.538839] RAX: 000000000000003f RBX: ffff8881048940e8 RCX: 0000000000000a20
      [  106.539186] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff8881048940e8
      [  106.539529] RBP: ffffc90000138be8 R08: 00000000e161fd1a R09: 0000000000000000
      [  106.539877] R10: 0000000000000018 R11: 0000000000000000 R12: ffff8881048940e8
      [  106.540222] R13: 0000000000000003 R14: 0000000000000000 R15: ffff8881048940e8
      [  106.540568] FS:  00007f277dde9f00(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
      [  106.540954] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  106.541227] CR2: 00007f277eeede64 CR3: 000000000ad3e000 CR4: 00000000000006e0
      [  106.541569] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  106.541915] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  106.542255] Call Trace:
      [  106.542383]  <IRQ>
      [  106.542487]  __pskb_pull_tail+0x4b/0x3e0
      [  106.542681]  skb_ensure_writable+0x85/0xa0
      [  106.542882]  sk_skb_pull_data+0x18/0x20
      [  106.543084]  bpf_prog_b517a65a242018b0_bpf_skskb_http_verdict+0x3a9/0x4aa9
      [  106.543536]  ? migrate_disable+0x66/0x80
      [  106.543871]  sk_psock_verdict_recv+0xe2/0x310
      [  106.544258]  ? sk_psock_write_space+0x1f0/0x1f0
      [  106.544561]  tcp_read_skb+0x7b/0x120
      [  106.544740]  tcp_data_queue+0x904/0xee0
      [  106.544931]  tcp_rcv_established+0x212/0x7c0
      [  106.545142]  tcp_v4_do_rcv+0x174/0x2a0
      [  106.545326]  tcp_v4_rcv+0xe70/0xf60
      [  106.545500]  ip_protocol_deliver_rcu+0x48/0x290
      [  106.545744]  ip_local_deliver_finish+0xa7/0x150
      
      Fixes: 04919bed ("tcp: Introduce tcp_read_skb()")
      Reported-by: default avatarWilliam Findlay <will@isovalent.com>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarWilliam Findlay <will@isovalent.com>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-2-john.fastabend@gmail.com
      78fa0d61
    • Nicolas Dichtel's avatar
      ipv{4,6}/raw: fix output xfrm lookup wrt protocol · 3632679d
      Nicolas Dichtel authored
      With a raw socket bound to IPPROTO_RAW (ie with hdrincl enabled), the
      protocol field of the flow structure, build by raw_sendmsg() /
      rawv6_sendmsg()),  is set to IPPROTO_RAW. This breaks the ipsec policy
      lookup when some policies are defined with a protocol in the selector.
      
      For ipv6, the sin6_port field from 'struct sockaddr_in6' could be used to
      specify the protocol. Just accept all values for IPPROTO_RAW socket.
      
      For ipv4, the sin_port field of 'struct sockaddr_in' could not be used
      without breaking backward compatibility (the value of this field was never
      checked). Let's add a new kind of control message, so that the userland
      could specify which protocol is used.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      CC: stable@vger.kernel.org
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Link: https://lore.kernel.org/r/20230522120820.1319391-1-nicolas.dichtel@6wind.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      3632679d
    • Horatiu Vultur's avatar
      lan966x: Fix unloading/loading of the driver · 60076124
      Horatiu Vultur authored
      It was noticing that after a while when unloading/loading the driver and
      sending traffic through the switch, it would stop working. It would stop
      forwarding any traffic and the only way to get out of this was to do a
      power cycle of the board. The root cause seems to be that the switch
      core is initialized twice. Apparently initializing twice the switch core
      disturbs the pointers in the queue systems in the HW, so after a while
      it would stop sending the traffic.
      Unfortunetly, it is not possible to use a reset of the switch here,
      because the reset line is connected to multiple devices like MDIO,
      SGPIO, FAN, etc. So then all the devices will get reseted when the
      network driver will be loaded.
      So the fix is to check if the core is initialized already and if that is
      the case don't initialize it again.
      
      Fixes: db8bcaad ("net: lan966x: add the basic lan966x driver")
      Signed-off-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Link: https://lore.kernel.org/r/20230522120038.3749026-1-horatiu.vultur@microchip.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      60076124
    • Shay Drory's avatar
      net/mlx5: Fix indexing of mlx5_irq · 1da438c0
      Shay Drory authored
      After the cited patch, mlx5_irq xarray index can be different then
      mlx5_irq MSIX table index.
      Fix it by storing both mlx5_irq xarray index and MSIX table index.
      
      Fixes: 3354822c ("net/mlx5: Use dynamic msix vectors allocation")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarEli Cohen <elic@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      1da438c0
    • Shay Drory's avatar
      net/mlx5: Fix irq affinity management · ef8c063c
      Shay Drory authored
      The cited patch deny the user of changing the affinity of mlx5 irqs,
      which break backward compatibility.
      Hence, allow the user to change the affinity of mlx5 irqs.
      
      Fixes: bbac70c7 ("net/mlx5: Use newer affinity descriptor")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarEli Cohen <elic@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      ef8c063c
    • Shay Drory's avatar
      net/mlx5: Free irqs only on shutdown callback · 9c2d0801
      Shay Drory authored
      Whenever a shutdown is invoked, free irqs only and keep mlx5_irq
      synthetic wrapper intact in order to avoid use-after-free on
      system shutdown.
      
      for example:
      ==================================================================
      BUG: KASAN: use-after-free in _find_first_bit+0x66/0x80
      Read of size 8 at addr ffff88823fc0d318 by task kworker/u192:0/13608
      
      CPU: 25 PID: 13608 Comm: kworker/u192:0 Tainted: G    B   W  O  6.1.21-cloudflare-kasan-2023.3.21 #1
      Hardware name: GIGABYTE R162-R2-GEN0/MZ12-HD2-CD, BIOS R14 05/03/2021
      Workqueue: mlx5e mlx5e_tx_timeout_work [mlx5_core]
      Call Trace:
        <TASK>
        dump_stack_lvl+0x34/0x48
        print_report+0x170/0x473
        ? _find_first_bit+0x66/0x80
        kasan_report+0xad/0x130
        ? _find_first_bit+0x66/0x80
        _find_first_bit+0x66/0x80
        mlx5e_open_channels+0x3c5/0x3a10 [mlx5_core]
        ? console_unlock+0x2fa/0x430
        ? _raw_spin_lock_irqsave+0x8d/0xf0
        ? _raw_spin_unlock_irqrestore+0x42/0x80
        ? preempt_count_add+0x7d/0x150
        ? __wake_up_klogd.part.0+0x7d/0xc0
        ? vprintk_emit+0xfe/0x2c0
        ? mlx5e_trigger_napi_sched+0x40/0x40 [mlx5_core]
        ? dev_attr_show.cold+0x35/0x35
        ? devlink_health_do_dump.part.0+0x174/0x340
        ? devlink_health_report+0x504/0x810
        ? mlx5e_reporter_tx_timeout+0x29d/0x3a0 [mlx5_core]
        ? mlx5e_tx_timeout_work+0x17c/0x230 [mlx5_core]
        ? process_one_work+0x680/0x1050
        mlx5e_safe_switch_params+0x156/0x220 [mlx5_core]
        ? mlx5e_switch_priv_channels+0x310/0x310 [mlx5_core]
        ? mlx5_eq_poll_irq_disabled+0xb6/0x100 [mlx5_core]
        mlx5e_tx_reporter_timeout_recover+0x123/0x240 [mlx5_core]
        ? __mutex_unlock_slowpath.constprop.0+0x2b0/0x2b0
        devlink_health_reporter_recover+0xa6/0x1f0
        devlink_health_report+0x2f7/0x810
        ? vsnprintf+0x854/0x15e0
        mlx5e_reporter_tx_timeout+0x29d/0x3a0 [mlx5_core]
        ? mlx5e_reporter_tx_err_cqe+0x1a0/0x1a0 [mlx5_core]
        ? mlx5e_tx_reporter_timeout_dump+0x50/0x50 [mlx5_core]
        ? mlx5e_tx_reporter_dump_sq+0x260/0x260 [mlx5_core]
        ? newidle_balance+0x9b7/0xe30
        ? psi_group_change+0x6a7/0xb80
        ? mutex_lock+0x96/0xf0
        ? __mutex_lock_slowpath+0x10/0x10
        mlx5e_tx_timeout_work+0x17c/0x230 [mlx5_core]
        process_one_work+0x680/0x1050
        worker_thread+0x5a0/0xeb0
        ? process_one_work+0x1050/0x1050
        kthread+0x2a2/0x340
        ? kthread_complete_and_exit+0x20/0x20
        ret_from_fork+0x22/0x30
        </TASK>
      
      Freed by task 1:
        kasan_save_stack+0x23/0x50
        kasan_set_track+0x21/0x30
        kasan_save_free_info+0x2a/0x40
        ____kasan_slab_free+0x169/0x1d0
        slab_free_freelist_hook+0xd2/0x190
        __kmem_cache_free+0x1a1/0x2f0
        irq_pool_free+0x138/0x200 [mlx5_core]
        mlx5_irq_table_destroy+0xf6/0x170 [mlx5_core]
        mlx5_core_eq_free_irqs+0x74/0xf0 [mlx5_core]
        shutdown+0x194/0x1aa [mlx5_core]
        pci_device_shutdown+0x75/0x120
        device_shutdown+0x35c/0x620
        kernel_restart+0x60/0xa0
        __do_sys_reboot+0x1cb/0x2c0
        do_syscall_64+0x3b/0x90
        entry_SYSCALL_64_after_hwframe+0x4b/0xb5
      
      The buggy address belongs to the object at ffff88823fc0d300
        which belongs to the cache kmalloc-192 of size 192
      The buggy address is located 24 bytes inside of
        192-byte region [ffff88823fc0d300, ffff88823fc0d3c0)
      
      The buggy address belongs to the physical page:
      page:0000000010139587 refcount:1 mapcount:0 mapping:0000000000000000
      index:0x0 pfn:0x23fc0c
      head:0000000010139587 order:1 compound_mapcount:0 compound_pincount:0
      flags: 0x2ffff800010200(slab|head|node=0|zone=2|lastcpupid=0x1ffff)
      raw: 002ffff800010200 0000000000000000 dead000000000122 ffff88810004ca00
      raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
        ffff88823fc0d200: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ffff88823fc0d280: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
       >ffff88823fc0d300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                   ^
        ffff88823fc0d380: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
        ffff88823fc0d400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      ==================================================================
      general protection fault, probably for non-canonical address
      0xdffffc005c40d7ac: 0000 [#1] PREEMPT SMP KASAN NOPTI
      KASAN: probably user-memory-access in range [0x00000002e206bd60-0x00000002e206bd67]
      CPU: 25 PID: 13608 Comm: kworker/u192:0 Tainted: G    B   W  O  6.1.21-cloudflare-kasan-2023.3.21 #1
      Hardware name: GIGABYTE R162-R2-GEN0/MZ12-HD2-CD, BIOS R14 05/03/2021
      Workqueue: mlx5e mlx5e_tx_timeout_work [mlx5_core]
      RIP: 0010:__alloc_pages+0x141/0x5c0
      Call Trace:
        <TASK>
        ? sysvec_apic_timer_interrupt+0xa0/0xc0
        ? asm_sysvec_apic_timer_interrupt+0x16/0x20
        ? __alloc_pages_slowpath.constprop.0+0x1ec0/0x1ec0
        ? _raw_spin_unlock_irqrestore+0x3d/0x80
        __kmalloc_large_node+0x80/0x120
        ? kvmalloc_node+0x4e/0x170
        __kmalloc_node+0xd4/0x150
        kvmalloc_node+0x4e/0x170
        mlx5e_open_channels+0x631/0x3a10 [mlx5_core]
        ? console_unlock+0x2fa/0x430
        ? _raw_spin_lock_irqsave+0x8d/0xf0
        ? _raw_spin_unlock_irqrestore+0x42/0x80
        ? preempt_count_add+0x7d/0x150
        ? __wake_up_klogd.part.0+0x7d/0xc0
        ? vprintk_emit+0xfe/0x2c0
        ? mlx5e_trigger_napi_sched+0x40/0x40 [mlx5_core]
        ? dev_attr_show.cold+0x35/0x35
        ? devlink_health_do_dump.part.0+0x174/0x340
        ? devlink_health_report+0x504/0x810
        ? mlx5e_reporter_tx_timeout+0x29d/0x3a0 [mlx5_core]
        ? mlx5e_tx_timeout_work+0x17c/0x230 [mlx5_core]
        ? process_one_work+0x680/0x1050
        mlx5e_safe_switch_params+0x156/0x220 [mlx5_core]
        ? mlx5e_switch_priv_channels+0x310/0x310 [mlx5_core]
        ? mlx5_eq_poll_irq_disabled+0xb6/0x100 [mlx5_core]
        mlx5e_tx_reporter_timeout_recover+0x123/0x240 [mlx5_core]
        ? __mutex_unlock_slowpath.constprop.0+0x2b0/0x2b0
        devlink_health_reporter_recover+0xa6/0x1f0
        devlink_health_report+0x2f7/0x810
        ? vsnprintf+0x854/0x15e0
        mlx5e_reporter_tx_timeout+0x29d/0x3a0 [mlx5_core]
        ? mlx5e_reporter_tx_err_cqe+0x1a0/0x1a0 [mlx5_core]
        ? mlx5e_tx_reporter_timeout_dump+0x50/0x50 [mlx5_core]
        ? mlx5e_tx_reporter_dump_sq+0x260/0x260 [mlx5_core]
        ? newidle_balance+0x9b7/0xe30
        ? psi_group_change+0x6a7/0xb80
        ? mutex_lock+0x96/0xf0
        ? __mutex_lock_slowpath+0x10/0x10
        mlx5e_tx_timeout_work+0x17c/0x230 [mlx5_core]
        process_one_work+0x680/0x1050
        worker_thread+0x5a0/0xeb0
        ? process_one_work+0x1050/0x1050
        kthread+0x2a2/0x340
        ? kthread_complete_and_exit+0x20/0x20
        ret_from_fork+0x22/0x30
        </TASK>
      ---[ end trace 0000000000000000  ]---
      RIP: 0010:__alloc_pages+0x141/0x5c0
      Code: e0 39 a3 96 89 e9 b8 22 01 32 01 83 e1 0f 48 89 fa 01 c9 48 c1 ea
      03 d3 f8 83 e0 03 89 44 24 6c 48 b8 00 00 00 00 00 fc ff df <80> 3c 02
      00 0f 85 fc 03 00 00 89 e8 4a 8b 14 f5 e0 39 a3 96 4c 89
      RSP: 0018:ffff888251f0f438 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: 1ffff1104a3e1e8b RCX: 0000000000000000
      RDX: 000000005c40d7ac RSI: 0000000000000003 RDI: 00000002e206bd60
      RBP: 0000000000052dc0 R08: ffff8882b0044218 R09: ffff8882b0045e8a
      R10: fffffbfff300fefc R11: ffff888167af4000 R12: 0000000000000003
      R13: 0000000000000000 R14: 00000000696c7070 R15: ffff8882373f4380
      FS:  0000000000000000(0000) GS:ffff88bf2be80000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00005641d031eee8 CR3: 0000002e7ca14000 CR4: 0000000000350ee0
      Kernel panic - not syncing: Fatal exception
      Kernel Offset: 0x11000000 from 0xffffffff81000000 (relocation range:
      0xffffffff80000000-0xffffffffbfffffff)
      ---[ end Kernel panic - not syncing: Fatal exception  ]---]
      Reported-by: default avatarFrederick Lawler <fred@cloudflare.com>
      Link: https://lore.kernel.org/netdev/be5b9271-7507-19c5-ded1-fa78f1980e69@cloudflare.comSigned-off-by: default avatarShay Drory <shayd@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      9c2d0801
    • Shay Drory's avatar
      net/mlx5: Devcom, serialize devcom registration · 1f893f57
      Shay Drory authored
      From one hand, mlx5 driver is allowing to probe PFs in parallel.
      From the other hand, devcom, which is a share resource between PFs, is
      registered without any lock. This might resulted in memory problems.
      
      Hence, use the global mlx5_dev_list_lock in order to serialize devcom
      registration.
      
      Fixes: fadd59fc ("net/mlx5: Introduce inter-device communication mechanism")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarMark Bloch <mbloch@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      1f893f57