1. 13 Aug, 2018 31 commits
  2. 12 Aug, 2018 9 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-ancestor-cgroup-id' · 2ce3206b
      Daniel Borkmann authored
      Andrey Ignatov says:
      
      ====================
      This patch set adds new BPF helper bpf_skb_ancestor_cgroup_id that returns
      id of cgroup v2 that is ancestor of cgroup associated with the skb at the
      ancestor_level.
      
      The helper is useful to implement policies in TC based on cgroups that are
      upper in hierarchy than immediate cgroup associated with skb.
      
      v1->v2:
      - more reliable check for testing IPv6 to become ready in selftest.
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      2ce3206b
    • Andrey Ignatov's avatar
      selftests/bpf: Selftest for bpf_skb_ancestor_cgroup_id · 5ecd8c22
      Andrey Ignatov authored
      Add selftests for bpf_skb_ancestor_cgroup_id helper.
      
      test_skb_cgroup_id.sh prepares testing interface and adds tc qdisc and
      filter for it using BPF object compiled from test_skb_cgroup_id_kern.c
      program.
      
      BPF program in test_skb_cgroup_id_kern.c gets ancestor cgroup id using
      the new helper at different levels of cgroup hierarchy that skb belongs
      to, including root level and non-existing level, and saves it to the map
      where the key is the level of corresponding cgroup and the value is its
      id.
      
      To trigger BPF program, user space program test_skb_cgroup_id_user is
      run. It adds itself into testing cgroup and sends UDP datagram to
      link-local multicast address of testing interface. Then it reads cgroup
      ids saved in kernel for different levels from the BPF map and compares
      them with those in user space. They must be equal for every level of
      ancestry.
      
      Example of run:
        # ./test_skb_cgroup_id.sh
        Wait for testing link-local IP to become available ... OK
        Note: 8 bytes struct bpf_elf_map fixup performed due to size mismatch!
        [PASS]
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      5ecd8c22
    • Andrey Ignatov's avatar
      selftests/bpf: Add cgroup id helpers to bpf_helpers.h · 02f6ac74
      Andrey Ignatov authored
      Add bpf_skb_cgroup_id and bpf_skb_ancestor_cgroup_id helpers to
      bpf_helpers.h to use them in tests and samples.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      02f6ac74
    • Andrey Ignatov's avatar
      bpf: Sync bpf.h to tools/ · 539764d0
      Andrey Ignatov authored
      Sync skb_ancestor_cgroup_id() related bpf UAPI changes to tools/.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      539764d0
    • Andrey Ignatov's avatar
      bpf: Introduce bpf_skb_ancestor_cgroup_id helper · 77236281
      Andrey Ignatov authored
      == Problem description ==
      
      It's useful to be able to identify cgroup associated with skb in TC so
      that a policy can be applied to this skb, and existing bpf_skb_cgroup_id
      helper can help with this.
      
      Though in real life cgroup hierarchy and hierarchy to apply a policy to
      don't map 1:1.
      
      It's often the case that there is a container and corresponding cgroup,
      but there are many more sub-cgroups inside container, e.g. because it's
      delegated to containerized application to control resources for its
      subsystems, or to separate application inside container from infra that
      belongs to containerization system (e.g. sshd).
      
      At the same time it may be useful to apply a policy to container as a
      whole.
      
      If multiple containers like this are run on a host (what is often the
      case) and many of them have sub-cgroups, it may not be possible to apply
      per-container policy in TC with existing helpers such as
      bpf_skb_under_cgroup or bpf_skb_cgroup_id:
      
      * bpf_skb_cgroup_id will return id of immediate cgroup associated with
        skb, i.e. if it's a sub-cgroup inside container, it can't be used to
        identify container's cgroup;
      
      * bpf_skb_under_cgroup can work only with one cgroup and doesn't scale,
        i.e. if there are N containers on a host and a policy has to be
        applied to M of them (0 <= M <= N), it'd require M calls to
        bpf_skb_under_cgroup, and, if M changes, it'd require to rebuild &
        load new BPF program.
      
      == Solution ==
      
      The patch introduces new helper bpf_skb_ancestor_cgroup_id that can be
      used to get id of cgroup v2 that is an ancestor of cgroup associated
      with skb at specified level of cgroup hierarchy.
      
      That way admin can place all containers on one level of cgroup hierarchy
      (what is a good practice in general and already used in many
      configurations) and identify specific cgroup on this level no matter
      what sub-cgroup skb is associated with.
      
      E.g. if there is a cgroup hierarchy:
        root/
        root/container1/
        root/container1/app11/
        root/container1/app11/sub-app-a/
        root/container1/app12/
        root/container2/
        root/container2/app21/
        root/container2/app22/
        root/container2/app22/sub-app-b/
      
      , then having skb associated with root/container1/app11/sub-app-a/ it's
      possible to get ancestor at level 1, what is container1 and apply policy
      for this container, or apply another policy if it's container2.
      
      Policies can be kept e.g. in a hash map where key is a container cgroup
      id and value is an action.
      
      Levels where container cgroups are created are usually known in advance
      whether cgroup hierarchy inside container may be hard to predict
      especially in case when its creation is delegated to containerized
      application.
      
      == Implementation details ==
      
      The helper gets ancestor by walking parents up to specified level.
      
      Another option would be to get different kind of "id" from
      cgroup->ancestor_ids[level] and use it with idr_find() to get struct
      cgroup for ancestor. But that would require radix lookup what doesn't
      seem to be better (at least it's not obviously better).
      
      Format of return value of the new helper is same as that of
      bpf_skb_cgroup_id.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      77236281
    • Daniel Borkmann's avatar
      bpf: decouple btf from seq bpf fs dump and enable more maps · e8d2bec0
      Daniel Borkmann authored
      Commit a26ca7c9 ("bpf: btf: Add pretty print support to
      the basic arraymap") and 699c86d6 ("bpf: btf: add pretty
      print for hash/lru_hash maps") enabled support for BTF and
      dumping via BPF fs for array and hash/lru map. However, both
      can be decoupled from each other such that regular BPF maps
      can be supported for attaching BTF key/value information,
      while not all maps necessarily need to dump via map_seq_show_elem()
      callback.
      
      The basic sanity check which is a prerequisite for all maps
      is that key/value size has to match in any case, and some maps
      can have extra checks via map_check_btf() callback, e.g.
      probing certain types or indicating no support in general. With
      that we can also enable retrieving BTF info for per-cpu map
      types and lpm.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      e8d2bec0
    • David S. Miller's avatar
      Merge branch 'ip-faster-in-order-IP-fragments' · 78cbac64
      David S. Miller authored
      Peter Oskolkov says:
      
      ====================
      ip: faster in-order IP fragments
      
      Added "Signed-off-by" in v2.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      78cbac64
    • Peter Oskolkov's avatar
      ip: process in-order fragments efficiently · a4fd284a
      Peter Oskolkov authored
      This patch changes the runtime behavior of IP defrag queue:
      incoming in-order fragments are added to the end of the current
      list/"run" of in-order fragments at the tail.
      
      On some workloads, UDP stream performance is substantially improved:
      
      RX: ./udp_stream -F 10 -T 2 -l 60
      TX: ./udp_stream -c -H <host> -F 10 -T 5 -l 60
      
      with this patchset applied on a 10Gbps receiver:
      
        throughput=9524.18
        throughput_units=Mbit/s
      
      upstream (net-next):
      
        throughput=4608.93
        throughput_units=Mbit/s
      Reported-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarPeter Oskolkov <posk@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a4fd284a
    • Peter Oskolkov's avatar
      ip: add helpers to process in-order fragments faster. · 353c9cb3
      Peter Oskolkov authored
      This patch introduces several helper functions/macros that will be
      used in the follow-up patch. No runtime changes yet.
      
      The new logic (fully implemented in the second patch) is as follows:
      
      * Nodes in the rb-tree will now contain not single fragments, but lists
        of consecutive fragments ("runs").
      
      * At each point in time, the current "active" run at the tail is
        maintained/tracked. Fragments that arrive in-order, adjacent
        to the previous tail fragment, are added to this tail run without
        triggering the re-balancing of the rb-tree.
      
      * If a fragment arrives out of order with the offset _before_ the tail run,
        it is inserted into the rb-tree as a single fragment.
      
      * If a fragment arrives after the current tail fragment (with a gap),
        it starts a new "tail" run, as is inserted into the rb-tree
        at the end as the head of the new run.
      
      skb->cb is used to store additional information
      needed here (suggested by Eric Dumazet).
      Reported-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarPeter Oskolkov <posk@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      353c9cb3