1. 26 Jul, 2020 32 commits
    • Alexei Starovoitov's avatar
      Merge branch 'shared-cgroup-storage' · 36f72484
      Alexei Starovoitov authored
      YiFei Zhu says:
      
      ====================
      To access the storage in a CGROUP_STORAGE map, one uses
      bpf_get_local_storage helper, which is extremely fast due to its
      use of per-CPU variables. However, its whole code is built on
      the assumption that one map can only be used by one program at any
      time, and this prohibits any sharing of data between multiple
      programs using these maps, eliminating a lot of use cases, such
      as some per-cgroup configuration storage, written to by a
      setsockopt program and read by a cg_sock_addr program.
      
      Why not use other map types? The great part of CGROUP_STORAGE map
      is that it is isolated by different cgroups its attached to. When
      one program uses bpf_get_local_storage, even on the same map, it
      gets different storages if it were run as a result of attaching
      to different cgroups. The kernel manages the storages, simplifying
      BPF program or userspace. In theory, one could probably use other
      maps like array or hash to do the same thing, but it would be a
      major overhead / complexity. Userspace needs to know when a cgroup
      is being freed in order to free up a space in the replacement map.
      
      This patch set introduces a significant change to the semantics of
      CGROUP_STORAGE map type. Instead of each storage being tied to one
      single attachment, it is shared across different attachments to
      the same cgroup, and persists until either the map or the cgroup
      attached to is being freed.
      
      User may use u64 as the key to the map, and the result would be
      that the attach type become ignored during key comparison, and
      programs of different attach types will share the same storage if
      the cgroups they are attached to are the same.
      
      How could this break existing users?
      * Users that uses detach & reattach / program replacement as a
        shortcut to zeroing the storage. Since we need sharing between
        programs, we cannot zero the storage. Users that expect this
        behavior should either attach a program with a new map, or
        explicitly zero the map with a syscall.
      This case is dependent on undocumented implementation details,
      so the impact should be very minimal.
      
      Patch 1 introduces a test on the old expected behavior of the map
      type.
      
      Patch 2 introduces a test showing how two programs cannot share
      one such map.
      
      Patch 3 implements the change of semantics to the map.
      
      Patch 4 amends the new test such that it yields the behavior we
      expect from the change.
      
      Patch 5 documents the map type.
      
      Changes since RFC:
      * Clarify commit message in patch 3 such that it says the lifetime
        of the storage is ended at the freeing of the cgroup_bpf, rather
        than the cgroup itself.
      * Restored an -ENOMEM check in __cgroup_bpf_attach.
      * Update selftests for recent change in network_helpers API.
      
      Changes since v1:
      * s/CHECK_FAIL/CHECK/
      * s/bpf_prog_attach/bpf_program__attach_cgroup/
      * Moved test__start_subtest to test_cg_storage_multi.
      * Removed some redundant CHECK_FAIL where they are already CHECK-ed.
      
      Changes since v2:
      * Lock cgroup_mutex during map_free.
      * Publish new storages only if attach is successful, by tracking
        exactly which storages are reused in an array of bools.
      * Mention bpftool map dump showing a value of zero for attach_type
        in patch 3 commit message.
      
      Changes since v3:
      * Use a much simpler lookup and allocate-if-not-exist from the fact
        that cgroup_mutex is locked during attach.
      * Removed an unnecessary spinlock hold.
      
      Changes since v4:
      * Changed semantics so that if the key type is struct
        bpf_cgroup_storage_key the map retains isolation between different
        attach types. Sharing between different attach types only occur
        when key type is u64.
      * Adapted tests and docs for the above change.
      
      Changes since v5:
      * Removed redundant NULL check before bpf_link__destroy.
      * Free BPF object explicitly, after asserting that object failed to
        load, in the event that the object did not fail to load.
      * Rename variable in bpf_cgroup_storage_key_cmp for clarity.
      * Added a lot of information to Documentation, more or less copied
        from what Martin KaFai Lau wrote.
      ====================
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      36f72484
    • YiFei Zhu's avatar
      Documentation/bpf: Document CGROUP_STORAGE map type · 4e15f460
      YiFei Zhu authored
      The machanics and usage are not very straightforward. Given the
      changes it's better to document how it works and how to use it,
      rather than having to rely on the examples and implementation to
      infer what is going on.
      Signed-off-by: default avatarYiFei Zhu <zhuyifei@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/b412edfbb05cb1077c9e2a36a981a54ee23fa8b3.1595565795.git.zhuyifei@google.com
      4e15f460
    • YiFei Zhu's avatar
      selftests/bpf: Test CGROUP_STORAGE behavior on shared egress + ingress · 3573f384
      YiFei Zhu authored
      This mirrors the original egress-only test. The cgroup_storage is
      now extended to have two packet counters, one for egress and one
      for ingress. We also extend to have two egress programs to test
      that egress will always share with other egress origrams in the
      same cgroup. The behavior of the counters are exactly the same as
      the original egress-only test.
      
      The test is split into two, one "isolated" test that when the key
      type is struct bpf_cgroup_storage_key, which contains the attach
      type, programs of different attach types will see different
      storages. The other, "shared" test that when the key type is u64,
      programs of different attach types will see the same storage if
      they are attached to the same cgroup.
      Signed-off-by: default avatarYiFei Zhu <zhuyifei@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/c756f5f1521227b8e6e90a453299dda722d7324d.1595565795.git.zhuyifei@google.com
      3573f384
    • Alexei Starovoitov's avatar
      Merge branch 'fix-bpf_get_stack-with-PEBS' · 90065c06
      Alexei Starovoitov authored
      Song Liu says:
      
      ====================
      Calling get_perf_callchain() on perf_events from PEBS entries may cause
      unwinder errors. To fix this issue, perf subsystem fetches callchain early,
      and marks perf_events are marked with __PERF_SAMPLE_CALLCHAIN_EARLY.
      Similar issue exists when BPF program calls get_perf_callchain() via
      helper functions. For more information about this issue, please refer to
      discussions in [1].
      
      This set fixes this issue with helper proto bpf_get_stackid_pe and
      bpf_get_stack_pe.
      
      [1] https://lore.kernel.org/lkml/ED7B9430-6489-4260-B3C5-9CFA2E3AA87A@fb.com/
      
      Changes v4 => v5:
      1. Return -EPROTO instead of -EINVAL on PERF_EVENT_IOC_SET_BPF errors.
         (Alexei)
      2. Let libbpf print a hint message when PERF_EVENT_IOC_SET_BPF returns
         -EPROTO. (Alexei)
      
      Changes v3 => v4:
      1. Fix error check logic in bpf_get_stackid_pe and bpf_get_stack_pe.
         (Alexei)
      2. Do not allow attaching BPF programs with bpf_get_stack|stackid to
         perf_event with precise_ip > 0, but not proper callchain. (Alexei)
      3. Add selftest get_stackid_cannot_attach.
      
      Changes v2 => v3:
      1. Fix handling of stackmap skip field. (Andrii)
      2. Simplify the code in a few places. (Andrii)
      
      Changes v1 => v2:
      1. Simplify the design and avoid introducing new helper function. (Andrii)
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      90065c06
    • YiFei Zhu's avatar
      bpf: Make cgroup storages shared between programs on the same cgroup · 7d9c3427
      YiFei Zhu authored
      This change comes in several parts:
      
      One, the restriction that the CGROUP_STORAGE map can only be used
      by one program is removed. This results in the removal of the field
      'aux' in struct bpf_cgroup_storage_map, and removal of relevant
      code associated with the field, and removal of now-noop functions
      bpf_free_cgroup_storage and bpf_cgroup_storage_release.
      
      Second, we permit a key of type u64 as the key to the map.
      Providing such a key type indicates that the map should ignore
      attach type when comparing map keys. However, for simplicity newly
      linked storage will still have the attach type at link time in
      its key struct. cgroup_storage_check_btf is adapted to accept
      u64 as the type of the key.
      
      Third, because the storages are now shared, the storages cannot
      be unconditionally freed on program detach. There could be two
      ways to solve this issue:
      * A. Reference count the usage of the storages, and free when the
           last program is detached.
      * B. Free only when the storage is impossible to be referred to
           again, i.e. when either the cgroup_bpf it is attached to, or
           the map itself, is freed.
      Option A has the side effect that, when the user detach and
      reattach a program, whether the program gets a fresh storage
      depends on whether there is another program attached using that
      storage. This could trigger races if the user is multi-threaded,
      and since nondeterminism in data races is evil, go with option B.
      
      The both the map and the cgroup_bpf now tracks their associated
      storages, and the storage unlink and free are removed from
      cgroup_bpf_detach and added to cgroup_bpf_release and
      cgroup_storage_map_free. The latter also new holds the cgroup_mutex
      to prevent any races with the former.
      
      Fourth, on attach, we reuse the old storage if the key already
      exists in the map, via cgroup_storage_lookup. If the storage
      does not exist yet, we create a new one, and publish it at the
      last step in the attach process. This does not create a race
      condition because for the whole attach the cgroup_mutex is held.
      We keep track of an array of new storages that was allocated
      and if the process fails only the new storages would get freed.
      Signed-off-by: default avatarYiFei Zhu <zhuyifei@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/d5401c6106728a00890401190db40020a1f84ff1.1595565795.git.zhuyifei@google.com
      7d9c3427
    • Song Liu's avatar
      selftests/bpf: Add get_stackid_cannot_attach · 346938e9
      Song Liu authored
      This test confirms that BPF program that calls bpf_get_stackid() cannot
      attach to perf_event with precise_ip > 0 but not PERF_SAMPLE_CALLCHAIN;
      and cannot attach if the perf_event has exclude_callchain_kernel.
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200723180648.1429892-6-songliubraving@fb.com
      346938e9
    • YiFei Zhu's avatar
      selftests/bpf: Test CGROUP_STORAGE map can't be used by multiple progs · 9e5bd1f7
      YiFei Zhu authored
      The current assumption is that the lifetime of a cgroup storage
      is tied to the program's attachment. The storage is created in
      cgroup_bpf_attach, and released upon cgroup_bpf_detach and
      cgroup_bpf_release.
      
      Because the current semantics is that each attachment gets a
      completely independent cgroup storage, and you can have multiple
      programs attached to the same (cgroup, attach type) pair, the key
      of the CGROUP_STORAGE map, looking up the map with this pair could
      yield multiple storages, and that is not permitted. Therefore,
      the kernel verifier checks that two programs cannot share the same
      CGROUP_STORAGE map, even if they have different expected attach
      types, considering that the actual attach type does not always
      have to be equal to the expected attach type.
      
      The test creates a CGROUP_STORAGE map and make it shared across
      two different programs, one cgroup_skb/egress and one /ingress.
      It asserts that the two programs cannot be both loaded, due to
      verifier failure from the above reason.
      Signed-off-by: default avatarYiFei Zhu <zhuyifei@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/30a6b0da67ae6b0296c4d511bfb19c5f3d035916.1595565795.git.zhuyifei@google.com
      9e5bd1f7
    • Song Liu's avatar
      selftests/bpf: Add callchain_stackid · 1da4864c
      Song Liu authored
      This tests new helper function bpf_get_stackid_pe and bpf_get_stack_pe.
      These two helpers have different implementation for perf_event with PEB
      entries.
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200723180648.1429892-5-songliubraving@fb.com
      1da4864c
    • YiFei Zhu's avatar
      selftests/bpf: Add test for CGROUP_STORAGE map on multiple attaches · d4a89c1e
      YiFei Zhu authored
      This test creates a parent cgroup, and a child of that cgroup.
      It attaches a cgroup_skb/egress program that simply counts packets,
      to a global variable (ARRAY map), and to a CGROUP_STORAGE map.
      The program is first attached to the parent cgroup only, then to
      parent and child.
      
      The test cases sends a message within the child cgroup, and because
      the program is inherited across parent / child cgroups, it will
      trigger the egress program for both the parent and child, if they
      exist. The program, when looking up a CGROUP_STORAGE map, uses the
      cgroup and attach type of the attachment parameters; therefore,
      both attaches uses different cgroup storages.
      
      We assert that all packet counts returns what we expects.
      Signed-off-by: default avatarYiFei Zhu <zhuyifei@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/5a20206afa4606144691c7caa0d1b997cd60dec0.1595565795.git.zhuyifei@google.com
      d4a89c1e
    • Song Liu's avatar
      libbpf: Print hint when PERF_EVENT_IOC_SET_BPF returns -EPROTO · d4b4dd6c
      Song Liu authored
      The kernel prevents potential unwinder warnings and crashes by blocking
      BPF program with bpf_get_[stack|stackid] on perf_event without
      PERF_SAMPLE_CALLCHAIN, or with exclude_callchain_[kernel|user]. Print a
      hint message in libbpf to help the user debug such issues.
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200723180648.1429892-4-songliubraving@fb.com
      d4b4dd6c
    • Alexei Starovoitov's avatar
      Merge branch 'bpf_iter-for-map-elems' · 909e446b
      Alexei Starovoitov authored
      Yonghong Song says:
      
      ====================
      Bpf iterator has been implemented for task, task_file,
      bpf_map, ipv6_route, netlink, tcp and udp so far.
      
      For map elements, there are two ways to traverse all elements from
      user space:
        1. using BPF_MAP_GET_NEXT_KEY bpf subcommand to get elements
           one by one.
        2. using BPF_MAP_LOOKUP_BATCH bpf subcommand to get a batch of
           elements.
      Both these approaches need to copy data from kernel to user space
      in order to do inspection.
      
      This patch implements bpf iterator for map elements.
      User can have a bpf program in kernel to run with each map element,
      do checking, filtering, aggregation, modifying values etc.
      without copying data to user space.
      
      Patch #1 and #2 are refactoring. Patch #3 implements readonly/readwrite
      buffer support in verifier. Patches #4 - #7 implements map element
      support for hash, percpu hash, lru hash lru percpu hash, array,
      percpu array and sock local storage maps. Patches #8 - #9 are libbpf
      and bpftool support. Patches #10 - #13 are selftests for implemented
      map element iterators.
      
      Changelogs:
        v3 -> v4:
          . fix a kasan failure triggered by a failed bpf_iter link_create,
            not just free_link but need cleanup_link. (Alexei)
        v2 -> v3:
          . rebase on top of latest bpf-next
        v1 -> v2:
          . support to modify map element values. (Alexei)
          . map key/values can be used with helper arguments
            for those arguments with ARG_PTR_TO_MEM or
            ARG_PTR_TO_INIT_MEM register type. (Alexei)
          . remove usused variable. (kernel test robot)
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      909e446b
    • Song Liu's avatar
      bpf: Fail PERF_EVENT_IOC_SET_BPF when bpf_get_[stack|stackid] cannot work · 5d99cb2c
      Song Liu authored
      bpf_get_[stack|stackid] on perf_events with precise_ip uses callchain
      attached to perf_sample_data. If this callchain is not presented, do not
      allow attaching BPF program that calls bpf_get_[stack|stackid] to this
      event.
      
      In the error case, -EPROTO is returned so that libbpf can identify this
      error and print proper hint message.
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200723180648.1429892-3-songliubraving@fb.com
      5d99cb2c
    • Yonghong Song's avatar
      selftests/bpf: Add a test for out of bound rdonly buf access · 9efcc4ad
      Yonghong Song authored
      If the bpf program contains out of bound access w.r.t. a
      particular map key/value size, the verification will be
      still okay, e.g., it will be accepted by verifier. But
      it will be rejected during link_create time. A test
      is added here to ensure link_create failure did happen
      if out of bound access happened.
        $ ./test_progs -n 4
        ...
        #4/23 rdonly-buf-out-of-bound:OK
        ...
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200723184124.591700-1-yhs@fb.com
      9efcc4ad
    • Song Liu's avatar
      bpf: Separate bpf_get_[stack|stackid] for perf events BPF · 7b04d6d6
      Song Liu authored
      Calling get_perf_callchain() on perf_events from PEBS entries may cause
      unwinder errors. To fix this issue, the callchain is fetched early. Such
      perf_events are marked with __PERF_SAMPLE_CALLCHAIN_EARLY.
      
      Similarly, calling bpf_get_[stack|stackid] on perf_events from PEBS may
      also cause unwinder errors. To fix this, add separate version of these
      two helpers, bpf_get_[stack|stackid]_pe. These two hepers use callchain in
      bpf_perf_event_data_kern->data->callchain.
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200723180648.1429892-2-songliubraving@fb.com
      7b04d6d6
    • Yonghong Song's avatar
      selftests/bpf: Add a test for bpf sk_storage_map iterator · 3b1c420b
      Yonghong Song authored
      Added one test for bpf sk_storage_map_iterator.
        $ ./test_progs -n 4
        ...
        #4/22 bpf_sk_storage_map:OK
        ...
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200723184122.591591-1-yhs@fb.com
      3b1c420b
    • Yonghong Song's avatar
      selftests/bpf: Add test for bpf array map iterators · 60dd49ea
      Yonghong Song authored
      Two subtests are added.
        $ ./test_progs -n 4
        ...
        #4/20 bpf_array_map:OK
        #4/21 bpf_percpu_array_map:OK
        ...
      
      The bpf_array_map subtest also tested bpf program
      changing array element values and send key/value
      to user space through bpf_seq_write() interface.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200723184121.591367-1-yhs@fb.com
      60dd49ea
    • Yonghong Song's avatar
      selftests/bpf: Add test for bpf hash map iterators · 2a7c2fff
      Yonghong Song authored
      Two subtests are added.
        $ ./test_progs -n 4
        ...
        #4/18 bpf_hash_map:OK
        #4/19 bpf_percpu_hash_map:OK
        ...
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200723184120.590916-1-yhs@fb.com
      2a7c2fff
    • Yonghong Song's avatar
      tools/bpftool: Add bpftool support for bpf map element iterator · d8793aca
      Yonghong Song authored
      The optional parameter "map MAP" can be added to "bpftool iter"
      command to create a bpf iterator for map elements. For example,
        bpftool iter pin ./prog.o /sys/fs/bpf/p1 map id 333
      
      For map element bpf iterator "map MAP" parameter is required.
      Otherwise, bpf link creation will return an error.
      
      Quentin Monnet kindly provided bash-completion implementation
      for new "map MAP" option.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200723184119.590799-1-yhs@fb.com
      d8793aca
    • Yonghong Song's avatar
      tools/libbpf: Add support for bpf map element iterator · cd31039a
      Yonghong Song authored
      Add map_fd to bpf_iter_attach_opts and flags to
      bpf_link_create_opts. Later on, bpftool or selftest
      will be able to create a bpf map element iterator
      by passing map_fd to the kernel during link
      creation time.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200723184117.590673-1-yhs@fb.com
      cd31039a
    • Yonghong Song's avatar
      bpf: Implement bpf iterator for sock local storage map · 5ce6e77c
      Yonghong Song authored
      The bpf iterator for bpf sock local storage map
      is implemented. User space interacts with sock
      local storage map with fd as a key and storage value.
      In kernel, passing fd to the bpf program does not
      really make sense. In this case, the sock itself is
      passed to bpf program.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200723184116.590602-1-yhs@fb.com
      5ce6e77c
    • Yonghong Song's avatar
      bpf: Implement bpf iterator for array maps · d3cc2ab5
      Yonghong Song authored
      The bpf iterators for array and percpu array
      are implemented. Similar to hash maps, for percpu
      array map, bpf program will receive values
      from all cpus.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200723184115.590532-1-yhs@fb.com
      d3cc2ab5
    • Yonghong Song's avatar
      bpf: Implement bpf iterator for hash maps · d6c4503c
      Yonghong Song authored
      The bpf iterators for hash, percpu hash, lru hash
      and lru percpu hash are implemented. During link time,
      bpf_iter_reg->check_target() will check map type
      and ensure the program access key/value region is
      within the map defined key/value size limit.
      
      For percpu hash and lru hash maps, the bpf program
      will receive values for all cpus. The map element
      bpf iterator infrastructure will prepare value
      properly before passing the value pointer to the
      bpf program.
      
      This patch set supports readonly map keys and
      read/write map values. It does not support deleting
      map elements, e.g., from hash tables. If there is
      a user case for this, the following mechanism can
      be used to support map deletion for hashtab, etc.
        - permit a new bpf program return value, e.g., 2,
          to let bpf iterator know the map element should
          be removed.
        - since bucket lock is taken, the map element will be
          queued.
        - once bucket lock is released after all elements under
          this bucket are traversed, all to-be-deleted map
          elements can be deleted.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200723184114.590470-1-yhs@fb.com
      d6c4503c
    • Alexei Starovoitov's avatar
      bpf: Add bpf_prog iterator · a228a64f
      Alexei Starovoitov authored
      It's mostly a copy paste of commit 6086d29d ("bpf: Add bpf_map iterator")
      that is use to implement bpf_seq_file opreations to traverse all bpf programs.
      
      v1->v2: Tweak to use build time btf_id
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a228a64f
    • Yonghong Song's avatar
      bpf: Implement bpf iterator for map elements · a5cbe05a
      Yonghong Song authored
      The bpf iterator for map elements are implemented.
      The bpf program will receive four parameters:
        bpf_iter_meta *meta: the meta data
        bpf_map *map:        the bpf_map whose elements are traversed
        void *key:           the key of one element
        void *value:         the value of the same element
      
      Here, meta and map pointers are always valid, and
      key has register type PTR_TO_RDONLY_BUF_OR_NULL and
      value has register type PTR_TO_RDWR_BUF_OR_NULL.
      The kernel will track the access range of key and value
      during verification time. Later, these values will be compared
      against the values in the actual map to ensure all accesses
      are within range.
      
      A new field iter_seq_info is added to bpf_map_ops which
      is used to add map type specific information, i.e., seq_ops,
      init/fini seq_file func and seq_file private data size.
      Subsequent patches will have actual implementation
      for bpf_map_ops->iter_seq_info.
      
      In user space, BPF_ITER_LINK_MAP_FD needs to be
      specified in prog attr->link_create.flags, which indicates
      that attr->link_create.target_fd is a map_fd.
      The reason for such an explicit flag is for possible
      future cases where one bpf iterator may allow more than
      one possible customization, e.g., pid and cgroup id for
      task_file.
      
      Current kernel internal implementation only allows
      the target to register at most one required bpf_iter_link_info.
      To support the above case, optional bpf_iter_link_info's
      are needed, the target can be extended to register such link
      infos, and user provided link_info needs to match one of
      target supported ones.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200723184112.590360-1-yhs@fb.com
      a5cbe05a
    • Yonghong Song's avatar
      bpf: Fix pos computation for bpf_iter seq_ops->start() · 3f9969f2
      Yonghong Song authored
      Currently, the pos pointer in bpf iterator map/task/task_file
      seq_ops->start() is always incremented.
      This is incorrect. It should be increased only if
      *pos is 0 (for SEQ_START_TOKEN) since these start()
      function actually returns the first real object.
      If *pos is not 0, it merely found the object
      based on the state in seq->private, and not really
      advancing the *pos. This patch fixed this issue
      by only incrementing *pos if it is 0.
      
      Note that the old *pos calculation, although not
      correct, does not affect correctness of bpf_iter
      as bpf_iter seq_file->read() does not support llseek.
      
      This patch also renamed "mid" in bpf_map iterator
      seq_file private data to "map_id" for better clarity.
      
      Fixes: 6086d29d ("bpf: Add bpf_map iterator")
      Fixes: eaaacd23 ("bpf: Add task and task/file iterator targets")
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200722195156.4029817-1-yhs@fb.com
      3f9969f2
    • Yonghong Song's avatar
      bpf: Support readonly/readwrite buffers in verifier · afbf21dc
      Yonghong Song authored
      Readonly and readwrite buffer register states
      are introduced. Totally four states,
      PTR_TO_RDONLY_BUF[_OR_NULL] and PTR_TO_RDWR_BUF[_OR_NULL]
      are supported. As suggested by their respective
      names, PTR_TO_RDONLY_BUF[_OR_NULL] are for
      readonly buffers and PTR_TO_RDWR_BUF[_OR_NULL]
      for read/write buffers.
      
      These new register states will be used
      by later bpf map element iterator.
      
      New register states share some similarity to
      PTR_TO_TP_BUFFER as it will calculate accessed buffer
      size during verification time. The accessed buffer
      size will be later compared to other metrics during
      later attach/link_create time.
      
      Similar to reg_state PTR_TO_BTF_ID_OR_NULL in bpf
      iterator programs, PTR_TO_RDONLY_BUF_OR_NULL or
      PTR_TO_RDWR_BUF_OR_NULL reg_types can be set at
      prog->aux->bpf_ctx_arg_aux, and bpf verifier will
      retrieve the values during btf_ctx_access().
      Later bpf map element iterator implementation
      will show how such information will be assigned
      during target registeration time.
      
      The verifier is also enhanced such that PTR_TO_RDONLY_BUF
      can be passed to ARG_PTR_TO_MEM[_OR_NULL] helper argument, and
      PTR_TO_RDWR_BUF can be passed to ARG_PTR_TO_MEM[_OR_NULL] or
      ARG_PTR_TO_UNINIT_MEM.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200723184111.590274-1-yhs@fb.com
      afbf21dc
    • Jakub Sitnicki's avatar
      selftests/bpf: Test BPF socket lookup and reuseport with connections · 86176a18
      Jakub Sitnicki authored
      Cover the case when BPF socket lookup returns a socket that belongs to a
      reuseport group, and the reuseport group contains connected UDP sockets.
      
      Ensure that the presence of connected UDP sockets in reuseport group does
      not affect the socket lookup result. Socket selected by reuseport should
      always be used as result in such case.
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Link: https://lore.kernel.org/bpf/20200722161720.940831-3-jakub@cloudflare.com
      86176a18
    • Yonghong Song's avatar
      bpf: Refactor to provide aux info to bpf_iter_init_seq_priv_t · f9c79272
      Yonghong Song authored
      This patch refactored target bpf_iter_init_seq_priv_t callback
      function to accept additional information. This will be needed
      in later patches for map element targets since a particular
      map should be passed to traverse elements for that particular
      map. In the future, other information may be passed to target
      as well, e.g., pid, cgroup id, etc. to customize the iterator.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200723184110.590156-1-yhs@fb.com
      f9c79272
    • Yonghong Song's avatar
      bpf: Refactor bpf_iter_reg to have separate seq_info member · 14fc6bd6
      Yonghong Song authored
      There is no functionality change for this patch.
      Struct bpf_iter_reg is used to register a bpf_iter target,
      which includes information for both prog_load, link_create
      and seq_file creation.
      
      This patch puts fields related seq_file creation into
      a different structure. This will be useful for map
      elements iterator where one iterator covers different
      map types and different map types may have different
      seq_ops, init/fini private_data function and
      private_data size.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200723184109.590030-1-yhs@fb.com
      14fc6bd6
    • Jakub Sitnicki's avatar
      udp: Don't discard reuseport selection when group has connections · c8a2983c
      Jakub Sitnicki authored
      When BPF socket lookup prog selects a socket that belongs to a reuseport
      group, and the reuseport group has connected sockets in it, the socket
      selected by reuseport will be discarded, and socket returned by BPF socket
      lookup will be used instead.
      
      Modify this behavior so that the socket selected by reuseport running after
      BPF socket lookup always gets used. Ignore the fact that the reuseport
      group might have connections because it is only relevant when scoring
      sockets during regular hashtable-based lookup.
      
      Fixes: 72f7e944 ("udp: Run SK_LOOKUP BPF program on socket lookup")
      Fixes: 6d4201b1 ("udp6: Run SK_LOOKUP BPF program on socket lookup")
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Link: https://lore.kernel.org/bpf/20200722161720.940831-2-jakub@cloudflare.com
      c8a2983c
    • Andrii Nakryiko's avatar
      tools/bpftool: Strip BPF .o files before skeleton generation · f3c93a93
      Andrii Nakryiko authored
      Strip away DWARF info from .bpf.o files, before generating BPF skeletons.
      This reduces bpftool binary size from 3.43MB to 2.58MB.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarQuentin Monnet <quentin@isovalent.com>
      Link: https://lore.kernel.org/bpf/20200722043804.2373298-1-andriin@fb.com
      f3c93a93
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · a57066b1
      David S. Miller authored
      The UDP reuseport conflict was a little bit tricky.
      
      The net-next code, via bpf-next, extracted the reuseport handling
      into a helper so that the BPF sk lookup code could invoke it.
      
      At the same time, the logic for reuseport handling of unconnected
      sockets changed via commit efc6b6f6
      which changed the logic to carry on the reuseport result into the
      rest of the lookup loop if we do not return immediately.
      
      This requires moving the reuseport_has_conns() logic into the callers.
      
      While we are here, get rid of inline directives as they do not belong
      in foo.c files.
      
      The other changes were cases of more straightforward overlapping
      modifications.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a57066b1
  2. 25 Jul, 2020 8 commits
    • Linus Torvalds's avatar
      Merge tag 'riscv-for-linus-5.8-rc7' of... · 04300d66
      Linus Torvalds authored
      Merge tag 'riscv-for-linus-5.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux into master
      
      Pull RISC-V fixes from Palmer Dabbelt:
       "A few more fixes this week:
      
         - A fix to avoid using SBI calls during kasan initialization, as the
           SBI calls themselves have not been probed yet.
      
         - Three fixes related to systems with multiple memory regions"
      
      * tag 'riscv-for-linus-5.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
        riscv: Parse all memory blocks to remove unusable memory
        RISC-V: Do not rely on initrd_start/end computed during early dt parsing
        RISC-V: Set maximum number of mapped pages correctly
        riscv: kasan: use local_tlb_flush_all() to avoid uninitialized __sbi_rfence
      04300d66
    • Linus Torvalds's avatar
      Merge tag 'x86-urgent-2020-07-25' of... · fbe0d451
      Linus Torvalds authored
      Merge tag 'x86-urgent-2020-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into master
      
      Pull x86 fixes from Ingo Molnar:
       "Misc fixes:
      
         - Fix a section end page alignment assumption that was causing
           crashes
      
         - Fix ORC unwinding on freshly forked tasks which haven't executed
           yet and which have empty user task stacks
      
         - Fix the debug.exception-trace=1 sysctl dumping of user stacks,
           which was broken by recent maccess changes"
      
      * tag 'x86-urgent-2020-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/dumpstack: Dump user space code correctly again
        x86/stacktrace: Fix reliable check for empty user task stacks
        x86/unwind/orc: Fix ORC for newly forked tasks
        x86, vmlinux.lds: Page-align end of ..page_aligned sections
      fbe0d451
    • Linus Torvalds's avatar
      Merge tag 'perf-urgent-2020-07-25' of... · 78b1afe2
      Linus Torvalds authored
      Merge tag 'perf-urgent-2020-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into master
      
      Pull uprobe fix from Ingo Molnar:
       "Fix an interaction/regression between uprobes based shared library
        tracing & GDB"
      
      * tag 'perf-urgent-2020-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        uprobes: Change handle_swbp() to send SIGTRAP with si_code=SI_KERNEL, to fix GDB regression
      78b1afe2
    • Linus Torvalds's avatar
      Merge tag 'timers-urgent-2020-07-25' of... · a7b36c2b
      Linus Torvalds authored
      Merge tag 'timers-urgent-2020-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into master
      
      Pull timer fix from Ingo Molnar:
       "Fix a suspend/resume regression (crash) on TI AM3/AM4 SoC's"
      
      * tag 'timers-urgent-2020-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        clocksource/drivers/timer-ti-dm: Fix suspend and resume for am3 and am4
      a7b36c2b
    • Linus Torvalds's avatar
      Merge tag 'sched-urgent-2020-07-25' of... · 3077805e
      Linus Torvalds authored
      Merge tag 'sched-urgent-2020-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into master
      
      Pull scheduler fixes from Ingo Molnar:
       "Fix a race introduced by the recent loadavg race fix, plus add a debug
        check for a hard to debug case of bogus wakeup function flags"
      
      * tag 'sched-urgent-2020-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched: Warn if garbage is passed to default_wake_function()
        sched: Fix race against ptrace_freeze_trace()
      3077805e
    • Linus Torvalds's avatar
      Merge tag 'efi-urgent-2020-07-25' of... · 17baa442
      Linus Torvalds authored
      Merge tag 'efi-urgent-2020-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into master
      
      Pull EFI fixes from Ingo Molnar:
       "Various EFI fixes:
      
         - Fix the layering violation in the use of the EFI runtime services
           availability mask in users of the 'efivars' abstraction
      
         - Revert build fix for GCC v4.8 which is no longer supported
      
         - Clean up some x86 EFI stub details, some of which are borderline
           bugs that copy around garbage into padding fields - let's fix these
           out of caution.
      
         - Fix build issues while working on RISC-V support
      
         - Avoid --whole-archive when linking the stub on arm64"
      
      * tag 'efi-urgent-2020-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        efi: Revert "efi/x86: Fix build with gcc 4"
        efi/efivars: Expose RT service availability via efivars abstraction
        efi/libstub: Move the function prototypes to header file
        efi/libstub: Fix gcc error around __umoddi3 for 32 bit builds
        efi/libstub/arm64: link stub lib.a conditionally
        efi/x86: Only copy upto the end of setup_header
        efi/x86: Remove unused variables
      17baa442
    • Linus Torvalds's avatar
      Merge tag '5.8-rc6-cifs-fix' of git://git.samba.org/sfrench/cifs-2.6 into master · 7cb3a5c5
      Linus Torvalds authored
      Pull cifs fix from Steve French:
       "A fix for a recently discovered regression in rename to older servers
        caused by a recent patch"
      
      * tag '5.8-rc6-cifs-fix' of git://git.samba.org/sfrench/cifs-2.6:
        Revert "cifs: Fix the target file was deleted when rename failed."
      7cb3a5c5
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net into master · 1b64b2e2
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Fix RCU locaking in iwlwifi, from Johannes Berg.
      
       2) mt76 can access uninitialized NAPI struct, from Felix Fietkau.
      
       3) Fix race in updating pause settings in bnxt_en, from Vasundhara
          Volam.
      
       4) Propagate error return properly during unbind failures in ax88172a,
          from George Kennedy.
      
       5) Fix memleak in adf7242_probe, from Liu Jian.
      
       6) smc_drv_probe() can leak, from Wang Hai.
      
       7) Don't muck with the carrier state if register_netdevice() fails in
          the bonding driver, from Taehee Yoo.
      
       8) Fix memleak in dpaa_eth_probe, from Liu Jian.
      
       9) Need to check skb_put_padto() return value in hsr_fill_tag(), from
          Murali Karicheri.
      
      10) Don't lose ionic RSS hash settings across FW update, from Shannon
          Nelson.
      
      11) Fix clobbered SKB control block in act_ct, from Wen Xu.
      
      12) Missing newlink in "tx_timeout" sysfs output, from Xiongfeng Wang.
      
      13) IS_UDPLITE cleanup a long time ago, incorrectly handled
          transformations involving UDPLITE_RECV_CC. From Miaohe Lin.
      
      14) Unbalanced locking in netdevsim, from Taehee Yoo.
      
      15) Suppress false-positive error messages in qed driver, from Alexander
          Lobakin.
      
      16) Out of bounds read in ax25_connect and ax25_sendmsg, from Peilin Ye.
      
      17) Missing SKB release in cxgb4's uld_send(), from Navid Emamdoost.
      
      18) Uninitialized value in geneve_changelink(), from Cong Wang.
      
      19) Fix deadlock in xen-netfront, from Andera Righi.
      
      19) flush_backlog() frees skbs with IRQs disabled, so should use
          dev_kfree_skb_irq() instead of kfree_skb(). From Subash Abhinov
          Kasiviswanathan.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (111 commits)
        drivers/net/wan: lapb: Corrected the usage of skb_cow
        dev: Defer free of skbs in flush_backlog
        qrtr: orphan socket in qrtr_release()
        xen-netfront: fix potential deadlock in xennet_remove()
        flow_offload: Move rhashtable inclusion to the source file
        geneve: fix an uninitialized value in geneve_changelink()
        bonding: check return value of register_netdevice() in bond_newlink()
        tcp: allow at most one TLP probe per flight
        AX.25: Prevent integer overflows in connect and sendmsg
        cxgb4: add missing release on skb in uld_send()
        net: atlantic: fix PTP on AQC10X
        AX.25: Prevent out-of-bounds read in ax25_sendmsg()
        sctp: shrink stream outq when fails to do addstream reconf
        sctp: shrink stream outq only when new outcnt < old outcnt
        AX.25: Fix out-of-bounds read in ax25_connect()
        enetc: Remove the mdio bus on PF probe bailout
        net: ethernet: ti: add NETIF_F_HW_TC hw feature flag for taprio offload
        net: ethernet: ave: Fix error returns in ave_init
        drivers/net/wan/x25_asy: Fix to make it work
        ipvs: fix the connection sync failed in some cases
        ...
      1b64b2e2