• Daniel Borkmann's avatar
    bpf: Add netns cookie and enable it for bpf cgroup hooks · f318903c
    Daniel Borkmann authored
    In Cilium we're mainly using BPF cgroup hooks today in order to implement
    kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*),
    ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic
    between Cilium managed nodes. While this works in its current shape and avoids
    packet-level NAT for inter Cilium managed node traffic, there is one major
    limitation we're facing today, that is, lack of netns awareness.
    
    In Kubernetes, the concept of Pods (which hold one or multiple containers)
    has been built around network namespaces, so while we can use the global scope
    of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing
    NodePort ports on loopback addresses), we also have the need to differentiate
    between initial network namespaces and non-initial one. For example, ExternalIP
    services mandate that non-local service IPs are not to be translated from the
    host (initial) network namespace as one example. Right now, we have an ugly
    work-around in place where non-local service IPs for ExternalIP services are
    not xlated from connect() and friends BPF hooks but instead via less efficient
    packet-level NAT on the veth tc ingress hook for Pod traffic.
    
    On top of determining whether we're in initial or non-initial network namespace
    we also have a need for a socket-cookie like mechanism for network namespaces
    scope. Socket cookies have the nice property that they can be combined as part
    of the key structure e.g. for BPF LRU maps without having to worry that the
    cookie could be recycled. We are planning to use this for our sessionAffinity
    implementation for services. Therefore, add a new bpf_get_netns_cookie() helper
    which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would
    provide the cookie for the initial network namespace while passing the context
    instead of NULL would provide the cookie from the application's network namespace.
    We're using a hole, so no size increase; the assignment happens only once.
    Therefore this allows for a comparison on initial namespace as well as regular
    cookie usage as we have today with socket cookies. We could later on enable
    this helper for other program types as well as we would see need.
    
      (*) Both externalTrafficPolicy={Local|Cluster} types
      [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.cSigned-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
    Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net
    f318903c
net_namespace.h 12 KB