• YiFei Zhu's avatar
    bpf: Make cgroup storages shared between programs on the same cgroup · 7d9c3427
    YiFei Zhu authored
    This change comes in several parts:
    
    One, the restriction that the CGROUP_STORAGE map can only be used
    by one program is removed. This results in the removal of the field
    'aux' in struct bpf_cgroup_storage_map, and removal of relevant
    code associated with the field, and removal of now-noop functions
    bpf_free_cgroup_storage and bpf_cgroup_storage_release.
    
    Second, we permit a key of type u64 as the key to the map.
    Providing such a key type indicates that the map should ignore
    attach type when comparing map keys. However, for simplicity newly
    linked storage will still have the attach type at link time in
    its key struct. cgroup_storage_check_btf is adapted to accept
    u64 as the type of the key.
    
    Third, because the storages are now shared, the storages cannot
    be unconditionally freed on program detach. There could be two
    ways to solve this issue:
    * A. Reference count the usage of the storages, and free when the
         last program is detached.
    * B. Free only when the storage is impossible to be referred to
         again, i.e. when either the cgroup_bpf it is attached to, or
         the map itself, is freed.
    Option A has the side effect that, when the user detach and
    reattach a program, whether the program gets a fresh storage
    depends on whether there is another program attached using that
    storage. This could trigger races if the user is multi-threaded,
    and since nondeterminism in data races is evil, go with option B.
    
    The both the map and the cgroup_bpf now tracks their associated
    storages, and the storage unlink and free are removed from
    cgroup_bpf_detach and added to cgroup_bpf_release and
    cgroup_storage_map_free. The latter also new holds the cgroup_mutex
    to prevent any races with the former.
    
    Fourth, on attach, we reuse the old storage if the key already
    exists in the map, via cgroup_storage_lookup. If the storage
    does not exist yet, we create a new one, and publish it at the
    last step in the attach process. This does not create a race
    condition because for the whole attach the cgroup_mutex is held.
    We keep track of an array of new storages that was allocated
    and if the process fails only the new storages would get freed.
    Signed-off-by: default avatarYiFei Zhu <zhuyifei@google.com>
    Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/d5401c6106728a00890401190db40020a1f84ff1.1595565795.git.zhuyifei@google.com
    7d9c3427
core.c 58.3 KB