1. 12 Feb, 2019 7 commits
  2. 11 Feb, 2019 9 commits
    • Alexei Starovoitov's avatar
      Merge branch 'skb_sk-sk_fullsock-tcp_sock' · d105fa98
      Alexei Starovoitov authored
      Martin KaFai Lau says:
      
      ====================
      This series adds __sk_buff->sk, "struct bpf_tcp_sock",
      BPF_FUNC_sk_fullsock and BPF_FUNC_tcp_sock.  Together, they provide
      a common way to expose the members of "struct tcp_sock" and
      "struct bpf_sock" for the bpf_prog to access.
      
      The patch series first adds a bpf_sock pointer to __sk_buff
      and a new helper BPF_FUNC_sk_fullsock.
      
      It then adds BPF_FUNC_tcp_sock to get a bpf_tcp_sock
      pointer from a bpf_sock pointer.
      
      The current use case is to allow a cg_skb_bpf_prog to provide
      per cgroup traffic policing/shaping.
      
      Please see individual patch for details.
      
      v2:
      - Patch 1 depends on
        commit d6238766 ("bpf: Fix narrow load on a bpf_sock returned from sk_lookup()")
        in the bpf branch.
      - Add sk_to_full_sk() to bpf_sk_fullsock() and bpf_tcp_sock()
        such that there is a way to access the listener's sk and tcp_sk
        when __sk_buff->sk is a request_sock.
        The comments in the uapi bpf.h is updated accordingly.
      - bpf_ctx_range_till() is used in bpf_sock_common_is_valid_access()
        in patch 1.  Saved a few lines.
      - Patch 2 is new in v2 and it adds "state", "dst_ip4", "dst_ip6" and
        "dst_port" to the bpf_sock.  Narrow load is allowed on them.
        The "state" (i.e. sk_state) has already been used in
        INET_DIAG (e.g. ss -t) and getsockopt(TCP_INFO).
      - While at it in the new patch 2, also allow narrow load on some
        existing fields of the bpf_sock, which are "family", "type", "protocol"
        and "src_port".  Only allow loading from first byte for now.
        i.e. does not allow narrow load starting from the 2nd byte.
      - Add some narrow load tests to the test_verifier's sock.c
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d105fa98
    • Martin KaFai Lau's avatar
      bpf: Add test_sock_fields for skb->sk and bpf_tcp_sock · e0b27b3f
      Martin KaFai Lau authored
      This patch adds a C program to show the usage on
      skb->sk and bpf_tcp_sock.
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e0b27b3f
    • Martin KaFai Lau's avatar
      bpf: Add skb->sk, bpf_sk_fullsock and bpf_tcp_sock tests to test_verifer · fb47d1d9
      Martin KaFai Lau authored
      This patch tests accessing the skb->sk and the new helpers,
      bpf_sk_fullsock and bpf_tcp_sock.
      
      The errstr of some existing "reference tracking" tests is changed
      with s/bpf_sock/sock/ and s/socket/sock/ where "sock" is from the
      verifier's reg_type_str[].
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      fb47d1d9
    • Martin KaFai Lau's avatar
      bpf: Sync bpf.h to tools/ · 281f9e75
      Martin KaFai Lau authored
      This patch sync the uapi bpf.h to tools/.
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      281f9e75
    • Martin KaFai Lau's avatar
      bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock · 655a51e5
      Martin KaFai Lau authored
      This patch adds a helper function BPF_FUNC_tcp_sock and it
      is currently available for cg_skb and sched_(cls|act):
      
      struct bpf_tcp_sock *bpf_tcp_sock(struct bpf_sock *sk);
      
      int cg_skb_foo(struct __sk_buff *skb) {
      	struct bpf_tcp_sock *tp;
      	struct bpf_sock *sk;
      	__u32 snd_cwnd;
      
      	sk = skb->sk;
      	if (!sk)
      		return 1;
      
      	tp = bpf_tcp_sock(sk);
      	if (!tp)
      		return 1;
      
      	snd_cwnd = tp->snd_cwnd;
      	/* ... */
      
      	return 1;
      }
      
      A 'struct bpf_tcp_sock' is also added to the uapi bpf.h to provide
      read-only access.  bpf_tcp_sock has all the existing tcp_sock's fields
      that has already been exposed by the bpf_sock_ops.
      i.e. no new tcp_sock's fields are exposed in bpf.h.
      
      This helper returns a pointer to the tcp_sock.  If it is not a tcp_sock
      or it cannot be traced back to a tcp_sock by sk_to_full_sk(), it
      returns NULL.  Hence, the caller needs to check for NULL before
      accessing it.
      
      The current use case is to expose members from tcp_sock
      to allow a cg_skb_bpf_prog to provide per cgroup traffic
      policing/shaping.
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      655a51e5
    • Martin KaFai Lau's avatar
      bpf: Refactor sock_ops_convert_ctx_access · 9b1f3d6e
      Martin KaFai Lau authored
      The next patch will introduce a new "struct bpf_tcp_sock" which
      exposes the same tcp_sock's fields already exposed in
      "struct bpf_sock_ops".
      
      This patch refactor the existing convert_ctx_access() codes for
      "struct bpf_sock_ops" to get them ready to be reused for
      "struct bpf_tcp_sock".  The "rtt_min" is not refactored
      in this patch because its handling is different from other
      fields.
      
      The SOCK_OPS_GET_TCP_SOCK_FIELD is new. All other SOCK_OPS_XXX_FIELD
      changes are code move only.
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9b1f3d6e
    • Martin KaFai Lau's avatar
      bpf: Add state, dst_ip4, dst_ip6 and dst_port to bpf_sock · aa65d696
      Martin KaFai Lau authored
      This patch adds "state", "dst_ip4", "dst_ip6" and "dst_port" to the
      bpf_sock.  The userspace has already been using "state",
      e.g. inet_diag (ss -t) and getsockopt(TCP_INFO).
      
      This patch also allows narrow load on the following existing fields:
      "family", "type", "protocol" and "src_port".  Unlike IP address,
      the load offset is resticted to the first byte for them but it
      can be relaxed later if there is a use case.
      
      This patch also folds __sock_filter_check_size() into
      bpf_sock_is_valid_access() since it is not called
      by any where else.  All bpf_sock checking is in
      one place.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      aa65d696
    • Martin KaFai Lau's avatar
      bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper · 46f8bc92
      Martin KaFai Lau authored
      In kernel, it is common to check "skb->sk && sk_fullsock(skb->sk)"
      before accessing the fields in sock.  For example, in __netdev_pick_tx:
      
      static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
      			    struct net_device *sb_dev)
      {
      	/* ... */
      
      	struct sock *sk = skb->sk;
      
      		if (queue_index != new_index && sk &&
      		    sk_fullsock(sk) &&
      		    rcu_access_pointer(sk->sk_dst_cache))
      			sk_tx_queue_set(sk, new_index);
      
      	/* ... */
      
      	return queue_index;
      }
      
      This patch adds a "struct bpf_sock *sk" pointer to the "struct __sk_buff"
      where a few of the convert_ctx_access() in filter.c has already been
      accessing the skb->sk sock_common's fields,
      e.g. sock_ops_convert_ctx_access().
      
      "__sk_buff->sk" is a PTR_TO_SOCK_COMMON_OR_NULL in the verifier.
      Some of the fileds in "bpf_sock" will not be directly
      accessible through the "__sk_buff->sk" pointer.  It is limited
      by the new "bpf_sock_common_is_valid_access()".
      e.g. The existing "type", "protocol", "mark" and "priority" in bpf_sock
           are not allowed.
      
      The newly added "struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk)"
      can be used to get a sk with all accessible fields in "bpf_sock".
      This helper is added to both cg_skb and sched_(cls|act).
      
      int cg_skb_foo(struct __sk_buff *skb) {
      	struct bpf_sock *sk;
      
      	sk = skb->sk;
      	if (!sk)
      		return 1;
      
      	sk = bpf_sk_fullsock(sk);
      	if (!sk)
      		return 1;
      
      	if (sk->family != AF_INET6 || sk->protocol != IPPROTO_TCP)
      		return 1;
      
      	/* some_traffic_shaping(); */
      
      	return 1;
      }
      
      (1) The sk is read only
      
      (2) There is no new "struct bpf_sock_common" introduced.
      
      (3) Future kernel sock's members could be added to bpf_sock only
          instead of repeatedly adding at multiple places like currently
          in bpf_sock_ops_md, bpf_sock_addr_md, sk_reuseport_md...etc.
      
      (4) After "sk = skb->sk", the reg holding sk is in type
          PTR_TO_SOCK_COMMON_OR_NULL.
      
      (5) After bpf_sk_fullsock(), the return type will be in type
          PTR_TO_SOCKET_OR_NULL which is the same as the return type of
          bpf_sk_lookup_xxx().
      
          However, bpf_sk_fullsock() does not take refcnt.  The
          acquire_reference_state() is only depending on the return type now.
          To avoid it, a new is_acquire_function() is checked before calling
          acquire_reference_state().
      
      (6) The WARN_ON in "release_reference_state()" is no longer an
          internal verifier bug.
      
          When reg->id is not found in state->refs[], it means the
          bpf_prog does something wrong like
          "bpf_sk_release(bpf_sk_fullsock(skb->sk))" where reference has
          never been acquired by calling "bpf_sk_fullsock(skb->sk)".
      
          A -EINVAL and a verbose are done instead of WARN_ON.  A test is
          added to the test_verifier in a later patch.
      
          Since the WARN_ON in "release_reference_state()" is no longer
          needed, "__release_reference_state()" is folded into
          "release_reference_state()" also.
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      46f8bc92
    • Martin KaFai Lau's avatar
      bpf: Fix narrow load on a bpf_sock returned from sk_lookup() · 5f456649
      Martin KaFai Lau authored
      By adding this test to test_verifier:
      {
      	"reference tracking: access sk->src_ip4 (narrow load)",
      	.insns = {
      	BPF_SK_LOOKUP,
      	BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
      	BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3),
      	BPF_LDX_MEM(BPF_H, BPF_REG_2, BPF_REG_0, offsetof(struct bpf_sock, src_ip4) + 2),
      	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
      	BPF_EMIT_CALL(BPF_FUNC_sk_release),
      	BPF_EXIT_INSN(),
      	},
      	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
      	.result = ACCEPT,
      },
      
      The above test loads 2 bytes from sk->src_ip4 where
      sk is obtained by bpf_sk_lookup_tcp().
      
      It hits an internal verifier error from convert_ctx_accesses():
      [root@arch-fb-vm1 bpf]# ./test_verifier 665 665
      Failed to load prog 'Invalid argument'!
      0: (b7) r2 = 0
      1: (63) *(u32 *)(r10 -8) = r2
      2: (7b) *(u64 *)(r10 -16) = r2
      3: (7b) *(u64 *)(r10 -24) = r2
      4: (7b) *(u64 *)(r10 -32) = r2
      5: (7b) *(u64 *)(r10 -40) = r2
      6: (7b) *(u64 *)(r10 -48) = r2
      7: (bf) r2 = r10
      8: (07) r2 += -48
      9: (b7) r3 = 36
      10: (b7) r4 = 0
      11: (b7) r5 = 0
      12: (85) call bpf_sk_lookup_tcp#84
      13: (bf) r6 = r0
      14: (15) if r0 == 0x0 goto pc+3
       R0=sock(id=1,off=0,imm=0) R6=sock(id=1,off=0,imm=0) R10=fp0,call_-1 fp-8=????0000 fp-16=0000mmmm fp-24=mmmmmmmm fp-32=mmmmmmmm fp-40=mmmmmmmm fp-48=mmmmmmmm refs=1
      15: (69) r2 = *(u16 *)(r0 +26)
      16: (bf) r1 = r6
      17: (85) call bpf_sk_release#86
      18: (95) exit
      
      from 14 to 18: safe
      processed 20 insns (limit 131072), stack depth 48
      bpf verifier is misconfigured
      Summary: 0 PASSED, 0 SKIPPED, 1 FAILED
      
      The bpf_sock_is_valid_access() is expecting src_ip4 can be narrowly
      loaded (meaning load any 1 or 2 bytes of the src_ip4) by
      marking info->ctx_field_size.  However, this marked
      ctx_field_size is not used.  This patch fixes it.
      
      Due to the recent refactoring in test_verifier,
      this new test will be added to the bpf-next branch
      (together with the bpf_tcp_sock patchset)
      to avoid merge conflict.
      
      Fixes: c64b7983 ("bpf: Add PTR_TO_SOCKET verifier type")
      Cc: Joe Stringer <joe@wand.net.nz>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarJoe Stringer <joe@wand.net.nz>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5f456649
  3. 08 Feb, 2019 23 commits
  4. 07 Feb, 2019 1 commit
    • Petr Machata's avatar
      net: vxlan: Free a leaked vetoed multicast rdst · fc4aa1ca
      Petr Machata authored
      When an rdst is rejected by a driver, the current code removes it from
      the remote list, but neglects to free it. This is triggered by
      tools/testing/selftests/drivers/net/mlxsw/vxlan_fdb_veto.sh and shows as
      the following kmemleak trace:
      
      unreferenced object 0xffff88817fa3d888 (size 96):
        comm "softirq", pid 0, jiffies 4372702718 (age 165.252s)
        hex dump (first 32 bytes):
          02 00 00 00 c6 33 64 03 80 f5 a2 61 81 88 ff ff  .....3d....a....
          06 df 71 ae ff ff ff ff 0c 00 00 00 04 d2 6a 6b  ..q...........jk
        backtrace:
          [<00000000296b27ac>] kmem_cache_alloc_trace+0x1ae/0x370
          [<0000000075c86dc6>] vxlan_fdb_append.part.12+0x62/0x3b0 [vxlan]
          [<00000000e0414b63>] vxlan_fdb_update+0xc61/0x1020 [vxlan]
          [<00000000f330c4bd>] vxlan_fdb_add+0x2e8/0x3d0 [vxlan]
          [<0000000008f81c2c>] rtnl_fdb_add+0x4c2/0xa10
          [<00000000bdc4b270>] rtnetlink_rcv_msg+0x6dd/0x970
          [<000000006701f2ce>] netlink_rcv_skb+0x290/0x410
          [<00000000c08a5487>] rtnetlink_rcv+0x15/0x20
          [<00000000d5f54b1e>] netlink_unicast+0x43f/0x5e0
          [<00000000db4336bb>] netlink_sendmsg+0x789/0xcd0
          [<00000000e1ee26b6>] sock_sendmsg+0xba/0x100
          [<00000000ba409802>] ___sys_sendmsg+0x631/0x960
          [<000000003c332113>] __sys_sendmsg+0xea/0x180
          [<00000000f4139144>] __x64_sys_sendmsg+0x78/0xb0
          [<000000006d1ddc59>] do_syscall_64+0x94/0x410
          [<00000000c8defa9a>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Move vxlan_dst_free() up and schedule a call thereof to plug this leak.
      
      Fixes: 61f46fe8 ("vxlan: Allow vetoing of FDB notifications")
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fc4aa1ca