1. 25 May, 2019 2 commits
    • Jiong Wang's avatar
      bpf: verifier: mark patched-insn with sub-register zext flag · b325fbca
      Jiong Wang authored
      Patched insns do not go through generic verification, therefore doesn't has
      zero extension information collected during insn walking.
      
      We don't bother analyze them at the moment, for any sub-register def comes
      from them, just conservatively mark it as needing zero extension.
      Signed-off-by: default avatarJiong Wang <jiong.wang@netronome.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b325fbca
    • Jiong Wang's avatar
      bpf: verifier: mark verified-insn with sub-register zext flag · 5327ed3d
      Jiong Wang authored
      eBPF ISA specification requires high 32-bit cleared when low 32-bit
      sub-register is written. This applies to destination register of ALU32 etc.
      JIT back-ends must guarantee this semantic when doing code-gen. x86_64 and
      AArch64 ISA has the same semantics, so the corresponding JIT back-end
      doesn't need to do extra work.
      
      However, 32-bit arches (arm, x86, nfp etc.) and some other 64-bit arches
      (PowerPC, SPARC etc) need to do explicit zero extension to meet this
      requirement, otherwise code like the following will fail.
      
        u64_value = (u64) u32_value
        ... other uses of u64_value
      
      This is because compiler could exploit the semantic described above and
      save those zero extensions for extending u32_value to u64_value, these JIT
      back-ends are expected to guarantee this through inserting extra zero
      extensions which however could be a significant increase on the code size.
      Some benchmarks show there could be ~40% sub-register writes out of total
      insns, meaning at least ~40% extra code-gen.
      
      One observation is these extra zero extensions are not always necessary.
      Take above code snippet for example, it is possible u32_value will never be
      casted into a u64, the value of high 32-bit of u32_value then could be
      ignored and extra zero extension could be eliminated.
      
      This patch implements this idea, insns defining sub-registers will be
      marked when the high 32-bit of the defined sub-register matters. For
      those unmarked insns, it is safe to eliminate high 32-bit clearnace for
      them.
      
      Algo:
       - Split read flags into READ32 and READ64.
      
       - Record index of insn that does sub-register write. Keep the index inside
         reg state and update it during verifier insn walking.
      
       - A full register read on a sub-register marks its definition insn as
         needing zero extension on dst register.
      
         A new sub-register write overrides the old one.
      
       - When propagating read64 during path pruning, also mark any insn defining
         a sub-register that is read in the pruned path as full-register.
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarJiong Wang <jiong.wang@netronome.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5327ed3d
  2. 24 May, 2019 19 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-send-sig' · a08acd11
      Daniel Borkmann authored
      Yonghong Song says:
      
      ====================
      This patch tries to solve the following specific use case.
      
      Currently, bpf program can already collect stack traces
      through kernel function get_perf_callchain()
      when certain events happens (e.g., cache miss counter or
      cpu clock counter overflows). But such stack traces are
      not enough for jitted programs, e.g., hhvm (jited php).
      To get real stack trace, jit engine internal data structures
      need to be traversed in order to get the real user functions.
      
      bpf program itself may not be the best place to traverse
      the jit engine as the traversing logic could be complex and
      it is not a stable interface either.
      
      Instead, hhvm implements a signal handler,
      e.g. for SIGALARM, and a set of program locations which
      it can dump stack traces. When it receives a signal, it will
      dump the stack in next such program location.
      
      This patch implements bpf_send_signal() helper to send
      a signal to hhvm in real time, resulting in intended stack traces.
      
      Patch #1 implemented the bpf_send_helper() in the kernel.
      Patch #2 synced uapi header bpf.h to tools directory.
      Patch #3 added a self test which covers tracepoint
      and perf_event bpf programs.
      
      Changelogs:
        v4 => v5:
          . pass the "current" task struct to irq_work as well
            since the current task struct may change between
            nmi and subsequent irq_work_interrupt.
            Discovered by Daniel.
        v3 => v4:
          . fix one typo and declare "const char *id_path = ..."
            to avoid directly use the long string in the func body
            in Patch #3.
        v2 => v3:
          . change the standalone test to be part of prog_tests.
        RFC v1 => v2:
          . previous version allows to send signal to an arbitrary
            pid. This version just sends the signal to current
            task to avoid unstable pid and potential races between
            sending signals and task state changes for the pid.
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a08acd11
    • Yonghong Song's avatar
      tools/bpf: add selftest in test_progs for bpf_send_signal() helper · 16f0efc3
      Yonghong Song authored
      The test covered both nmi and tracepoint perf events.
        $ ./test_progs
        ...
        test_send_signal_tracepoint:PASS:tracepoint 0 nsec
        ...
        test_send_signal_common:PASS:tracepoint 0 nsec
        ...
        test_send_signal_common:PASS:perf_event 0 nsec
        ...
        test_send_signal:OK
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      16f0efc3
    • Yonghong Song's avatar
      tools/bpf: sync bpf uapi header bpf.h to tools directory · edaccf89
      Yonghong Song authored
      The bpf uapi header include/uapi/linux/bpf.h is sync'ed
      to tools/include/uapi/linux/bpf.h.
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      edaccf89
    • Yonghong Song's avatar
      bpf: implement bpf_send_signal() helper · 8b401f9e
      Yonghong Song authored
      This patch tries to solve the following specific use case.
      
      Currently, bpf program can already collect stack traces
      through kernel function get_perf_callchain()
      when certain events happens (e.g., cache miss counter or
      cpu clock counter overflows). But such stack traces are
      not enough for jitted programs, e.g., hhvm (jited php).
      To get real stack trace, jit engine internal data structures
      need to be traversed in order to get the real user functions.
      
      bpf program itself may not be the best place to traverse
      the jit engine as the traversing logic could be complex and
      it is not a stable interface either.
      
      Instead, hhvm implements a signal handler,
      e.g. for SIGALARM, and a set of program locations which
      it can dump stack traces. When it receives a signal, it will
      dump the stack in next such program location.
      
      Such a mechanism can be implemented in the following way:
        . a perf ring buffer is created between bpf program
          and tracing app.
        . once a particular event happens, bpf program writes
          to the ring buffer and the tracing app gets notified.
        . the tracing app sends a signal SIGALARM to the hhvm.
      
      But this method could have large delays and causing profiling
      results skewed.
      
      This patch implements bpf_send_signal() helper to send
      a signal to hhvm in real time, resulting in intended stack traces.
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      8b401f9e
    • Alexei Starovoitov's avatar
      Merge branch 'btf2c-converter' · 5420f320
      Alexei Starovoitov authored
      Andrii Nakryiko says:
      
      ====================
      This patch set adds BTF-to-C dumping APIs to libbpf, allowing to output
      a subset of BTF types as a compilable C type definitions. This is useful by
      itself, as raw BTF output is not easy to inspect and comprehend. But it's also
      a big part of BPF CO-RE (compile once - run everywhere) initiative aimed at
      allowing to write relocatable BPF programs, that won't require on-the-host
      kernel headers (and would be able to inspect internal kernel structures, not
      exposed through kernel headers).
      
      This patch set consists of three groups of patches and one pre-patch, with the
      BTF-to-C dumper API depending on the first two groups.
      
      Pre-patch #1 fixes issue with libbpf_internal.h.
      
      btf__parse_elf() API patches:
      - patch #2 adds btf__parse_elf() API to libbpf, allowing to load BTF and/or
        BTF.ext from ELF file;
      - patch #3 utilizies btf__parse_elf() from bpftool for `btf dump file` command;
      - patch #4 switches test_btf.c to use btf__parse_elf() to check for presence
        of BTF data in object file.
      
      libbpf's internal hashmap patches:
      - patch #5 adds resizeable non-thread safe generic hashmap to libbpf;
      - patch #6 adds tests for that hashmap;
      - patch #7 migrates btf_dedup()'s dedup_table to use hashmap w/ APPEND.
      
      BTF-to-C dumper API patches:
      - patch #8 adds btf_dump APIs with all the logic for laying out type
        definitions in correct order and emitting C syntax for them;
      - patch #9 adds lots of tests for common and quirky parts of C type system;
      - patch #10 adds support for C-syntax btf dumping to bpftool;
      - patch #11 updates bpftool documentation to mention C-syntax dump option;
      - patch #12 update bash-completion for btf dump sub-command.
      
      v2->v3:
      - fix bpftool-btf.rst formatting (Quentin);
      - simplify bash autocompletion script (Quentin);
      - better error message in btf dump (Quentin);
      
      v1->v2:
      - removed unuseful file header (Jakub);
      - removed inlines in .c (Jakub);
      - added 'format {c|raw}' keyword/option (Jakub);
      - re-use i var for iteration in btf_dump_c() (Jakub);
      - bumped libbpf version to 0.0.4;
      
      v0->v1:
      - fix bug in hashmap__for_each_bucket_entry() not handling empty hashmap;
      - removed `btf dump`-specific libbpf logging hook up (Quentin has more generic
        patchset);
      - change btf__parse_elf() to always load .BTF and return it as a result, with
        .BTF.ext being optional and returned through struct btf_ext** arg (Alexei);
      - endianness check to use __BYTE_ORDER__ (Alexei);
      - bool:1 to __u8:1 in type_aux_state (Alexei);
      - added HASHMAP_APPEND strategy to hashmap, changed
        hashmap__for_each_key_entry() to also check for key equality during
        iteration (multimap iteration for key);
      - added new tests for empty hashmap and hashmap as a multimap;
      - tried to clarify weak/strong dependency ordering comments (Alexei)
      - btf dump test's expected output - support better commenting aproach (Alexei);
      - added bash-completion for a new "c" option (Alexei).
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5420f320
    • Andrii Nakryiko's avatar
      bpftool: update bash-completion w/ new c option for btf dump · 90eea408
      Andrii Nakryiko authored
      Add bash completion for new C btf dump option.
      
      Cc: Quentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Reviewed-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      90eea408
    • Andrii Nakryiko's avatar
      bpftool/docs: add description of btf dump C option · 220ba451
      Andrii Nakryiko authored
      Document optional **c** option for btf dump subcommand.
      
      Cc: Quentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Reviewed-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      220ba451
    • Andrii Nakryiko's avatar
      bpftool: add C output format option to btf dump subcommand · 2119f218
      Andrii Nakryiko authored
      Utilize new libbpf's btf_dump API to emit BTF as a C definitions.
      Acked-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Reviewed-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2119f218
    • Andrii Nakryiko's avatar
      selftests/bpf: add btf_dump BTF-to-C conversion tests · 2d2a3ad8
      Andrii Nakryiko authored
      Add new test_btf_dump set of tests, validating BTF-to-C conversion
      correctness. Tests rely on clang to generate BTF from provided C test
      cases.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2d2a3ad8
    • Andrii Nakryiko's avatar
      libbpf: add btf_dump API for BTF-to-C conversion · 351131b5
      Andrii Nakryiko authored
      BTF contains enough type information to allow generating valid
      compilable C header w/ correct layout of structs/unions and all the
      typedef/enum definitions. This patch adds a new "object" - btf_dump to
      facilitate dumping BTF as valid C. btf_dump__dump_type() is the main API
      which takes care of dumping out (through user-provided printf-like
      callback function) C definitions for given type ID and it's required
      dependencies. This allows for not just dumping out entirety of BTF types,
      but also selective filtering based on user-provided criterias w/ minimal
      set of dependent types.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      351131b5
    • Andrii Nakryiko's avatar
      libbpf: switch btf_dedup() to hashmap for dedup table · 2fc3fc0b
      Andrii Nakryiko authored
      Utilize libbpf's hashmap as a multimap fof dedup_table implementation.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2fc3fc0b
    • Andrii Nakryiko's avatar
      selftests/bpf: add tests for libbpf's hashmap · 5d04ec68
      Andrii Nakryiko authored
      Test all APIs for internal hashmap implementation.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5d04ec68
    • Andrii Nakryiko's avatar
      libbpf: add resizable non-thread safe internal hashmap · e3b92422
      Andrii Nakryiko authored
      There is a need for fast point lookups inside libbpf for multiple use
      cases (e.g., name resolution for BTF-to-C conversion, by-name lookups in
      BTF for upcoming BPF CO-RE relocation support, etc). This patch
      implements simple resizable non-thread safe hashmap using single linked
      list chains.
      
      Four different insert strategies are supported:
       - HASHMAP_ADD - only add key/value if key doesn't exist yet;
       - HASHMAP_SET - add key/value pair if key doesn't exist yet; otherwise,
         update value;
       - HASHMAP_UPDATE - update value, if key already exists; otherwise, do
         nothing and return -ENOENT;
       - HASHMAP_APPEND - always add key/value pair, even if key already exists.
         This turns hashmap into a multimap by allowing multiple values to be
         associated with the same key. Most useful read API for such hashmap is
         hashmap__for_each_key_entry() iteration. If hashmap__find() is still
         used, it will return last inserted key/value entry (first in a bucket
         chain).
      
      For HASHMAP_SET and HASHMAP_UPDATE, old key/value pair is returned, so
      that calling code can handle proper memory management, if necessary.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e3b92422
    • Andrii Nakryiko's avatar
      selftests/bpf: use btf__parse_elf to check presence of BTF/BTF.ext · 9db32431
      Andrii Nakryiko authored
      Switch test_btf.c to rely on btf__parse_elf to check presence of BTF and
      BTF.ext data, instead of implementing its own ELF parsing.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9db32431
    • Andrii Nakryiko's avatar
      bpftool: use libbpf's btf__parse_elf API · 58650cc4
      Andrii Nakryiko authored
      Use btf__parse_elf() API, provided by libbpf, instead of implementing
      ELF parsing by itself.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      58650cc4
    • Andrii Nakryiko's avatar
      libbpf: add btf__parse_elf API to load .BTF and .BTF.ext · e6c64855
      Andrii Nakryiko authored
      Loading BTF and BTF.ext from ELF file is a common need. Instead of
      requiring every user to re-implement it, let's provide this API from
      libbpf itself. It's mostly copy/paste from `bpftool btf dump`
      implementation, which will be switched to libbpf's version in next
      patch. btf__parse_elf allows to load BTF and optionally BTF.ext.
      This is also useful for tests that need to load/work with BTF, loaded
      from test ELF files.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e6c64855
    • Andrii Nakryiko's avatar
      libbpf: ensure libbpf.h is included along libbpf_internal.h · 1d7a08b3
      Andrii Nakryiko authored
      libbpf_internal.h expects a bunch of stuff defined in libbpf.h to be
      defined. This patch makes sure that libbpf.h is always included.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1d7a08b3
    • Michal Rostecki's avatar
      samples: bpf: Do not define bpf_printk macro · c87f60a7
      Michal Rostecki authored
      The bpf_printk macro was moved to bpf_helpers.h which is included in all
      example programs.
      Signed-off-by: default avatarMichal Rostecki <mrostecki@opensuse.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c87f60a7
    • Michal Rostecki's avatar
      selftests: bpf: Move bpf_printk to bpf_helpers.h · 37739d1b
      Michal Rostecki authored
      bpf_printk is a macro which is commonly used to print out debug messages
      in BPF programs and it was copied in many selftests and samples. Since
      all of them include bpf_helpers.h, this change moves the macro there.
      Signed-off-by: default avatarMichal Rostecki <mrostecki@opensuse.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      37739d1b
  3. 23 May, 2019 19 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-explored-states' · 5762a20b
      Daniel Borkmann authored
      Alexei Starovoitov says:
      
      ====================
      Convert explored_states array into hash table and use simple hash
      to reduce verifier peak memory consumption for programs with bpf2bpf
      calls. More details in patch 3.
      
      v1->v2: fixed Jakub's small nit in patch 1
      ====================
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      5762a20b
    • Alexei Starovoitov's avatar
      bpf: convert explored_states to hash table · dc2a4ebc
      Alexei Starovoitov authored
      All prune points inside a callee bpf function most likely will have
      different callsites. For example, if function foo() is called from
      two callsites the half of explored states in all prune points in foo()
      will be useless for subsequent walking of one of those callsites.
      Fortunately explored_states pruning heuristics keeps the number of states
      per prune point small, but walking these states is still a waste of cpu
      time when the callsite of the current state is different from the callsite
      of the explored state.
      
      To improve pruning logic convert explored_states into hash table and
      use simple insn_idx ^ callsite hash to select hash bucket.
      This optimization has no effect on programs without bpf2bpf calls
      and drastically improves programs with calls.
      In the later case it reduces total memory consumption in 1M scale tests
      by almost 3 times (peak_states drops from 5752 to 2016).
      
      Care should be taken when comparing the states for equivalency.
      Since the same hash bucket can now contain states with different indices
      the insn_idx has to be part of verifier_state and compared.
      
      Different hash table sizes and different hash functions were explored,
      but the results were not significantly better vs this patch.
      They can be improved in the future.
      
      Hit/miss heuristic is not counting index miscompare as a miss.
      Otherwise verifier stats become unstable when experimenting
      with different hash functions.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      dc2a4ebc
    • Alexei Starovoitov's avatar
      bpf: split explored_states · a8f500af
      Alexei Starovoitov authored
      split explored_states into prune_point boolean mark
      and link list of explored states.
      This removes STATE_LIST_MARK hack and allows marks to be separate from states.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a8f500af
    • Alexei Starovoitov's avatar
      bpf: cleanup explored_states · 5d839021
      Alexei Starovoitov authored
      clean up explored_states to prep for introduction of hashtable
      No functional changes.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      5d839021
    • Daniel Borkmann's avatar
      Merge branch 'bpf-jmp-seq-limit' · 29c677c8
      Daniel Borkmann authored
      Alexei Starovoitov says:
      
      ====================
      Patch 1 - jmp sequence limit
      Patch 2 - improve existing tests
      Patch 3 - add pyperf-based realistic bpf program that takes
                advantage of higher limit and use it as a stress test
      
      v1->v2: fixed nit in patch 3. added Andrii's acks
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      29c677c8
    • Alexei Starovoitov's avatar
      selftests/bpf: add pyperf scale test · 7c944106
      Alexei Starovoitov authored
      Add a snippet of pyperf bpf program used to collect python stack traces
      as a scale test for the verifier.
      
      At 189 loop iterations llvm 9.0 starts ignoring '#pragma unroll'
      and generates partially unrolled loop instead.
      Hence use 50, 100, and 180 loop iterations to stress test.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      7c944106
    • Alexei Starovoitov's avatar
      selftests/bpf: adjust verifier scale test · 7c0c6095
      Alexei Starovoitov authored
      Adjust scale tests to check for new jmp sequence limit.
      
      BPF_JGT had to be changed to BPF_JEQ because the verifier was
      too smart. It tracked the known safe range of R0 values
      and pruned the search earlier before hitting exact 8192 limit.
      bpf_semi_rand_get() was too (un)?lucky.
      
      k = 0; was missing in bpf_fill_scale2.
      It was testing a bit shorter sequence of jumps than intended.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      7c0c6095
    • Alexei Starovoitov's avatar
      bpf: bump jmp sequence limit · b285fcb7
      Alexei Starovoitov authored
      The limit of 1024 subsequent jumps was causing otherwise valid
      programs to be rejected. Bump it to 8192 and make the error more verbose.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      b285fcb7
    • Andrii Nakryiko's avatar
      libbpf: emit diff of mismatched public API, if any · 9efc7794
      Andrii Nakryiko authored
      It's easy to have a mismatch of "intended to be public" vs really
      exposed API functions. While Makefile does check for this mismatch, if
      it actually occurs it's not trivial to determine which functions are
      accidentally exposed. This patch dumps out a diff showing what's not
      supposed to be exposed facilitating easier fixing.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9efc7794
    • Sunil Muthuswamy's avatar
      hv_sock: perf: loop in send() to maximize bandwidth · 14a1eaa8
      Sunil Muthuswamy authored
      Currently, the hv_sock send() iterates once over the buffer, puts data into
      the VMBUS channel and returns. It doesn't maximize on the case when there
      is a simultaneous reader draining data from the channel. In such a case,
      the send() can maximize the bandwidth (and consequently minimize the cpu
      cycles) by iterating until the channel is found to be full.
      
      Perf data:
      Total Data Transfer: 10GB/iteration
      Single threaded reader/writer, Linux hvsocket writer with Windows hvsocket
      reader
      Packet size: 64KB
      CPU sys time was captured using the 'time' command for the writer to send
      10GB of data.
      'Send Buffer Loop' is with the patch applied.
      The values below are over 10 iterations.
      
      |--------------------------------------------------------|
      |        |        Current        |   Send Buffer Loop    |
      |--------------------------------------------------------|
      |        | Throughput | CPU sys  | Throughput | CPU sys  |
      |        | (MB/s)     | time (s) | (MB/s)     | time (s) |
      |--------------------------------------------------------|
      | Min    |     407    |   7.048  |    401     |  5.958   |
      |--------------------------------------------------------|
      | Max    |     455    |   7.563  |    542     |  6.993   |
      |--------------------------------------------------------|
      | Avg    |     440    |   7.411  |    451     |  6.639   |
      |--------------------------------------------------------|
      | Median |     446    |   7.417  |    447     |  6.761   |
      |--------------------------------------------------------|
      
      Observation:
      1. The avg throughput doesn't really change much with this change for this
      scenario. This is most probably because the bottleneck on throughput is
      somewhere else.
      2. The average system (or kernel) cpu time goes down by 10%+ with this
      change, for the same amount of data transfer.
      Signed-off-by: default avatarSunil Muthuswamy <sunilmut@microsoft.com>
      Reviewed-by: default avatarDexuan Cui <decui@microsoft.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      14a1eaa8
    • Sunil Muthuswamy's avatar
      hv_sock: perf: Allow the socket buffer size options to influence the actual socket buffers · ac383f58
      Sunil Muthuswamy authored
      Currently, the hv_sock buffer size is static and can't scale to the
      bandwidth requirements of the application. This change allows the
      applications to influence the socket buffer sizes using the SO_SNDBUF and
      the SO_RCVBUF socket options.
      
      Few interesting points to note:
      1. Since the VMBUS does not allow a resize operation of the ring size, the
      socket buffer size option should be set prior to establishing the
      connection for it to take effect.
      2. Setting the socket option comes with the cost of that much memory being
      reserved/allocated by the kernel, for the lifetime of the connection.
      
      Perf data:
      Total Data Transfer: 1GB
      Single threaded reader/writer
      Results below are summarized over 10 iterations.
      
      Linux hvsocket writer + Windows hvsocket reader:
      |---------------------------------------------------------------------------------------------|
      |Packet size ->   |      128B       |       1KB       |       4KB       |        64KB         |
      |---------------------------------------------------------------------------------------------|
      |SO_SNDBUF size | |                 Throughput in MB/s (min/max/avg/median):                  |
      |               v |                                                                           |
      |---------------------------------------------------------------------------------------------|
      |      Default    | 109/118/114/116 | 636/774/701/700 | 435/507/480/476 |   410/491/462/470   |
      |      16KB       | 110/116/112/111 | 575/705/662/671 | 749/900/854/869 |   592/824/692/676   |
      |      32KB       | 108/120/115/115 | 703/823/767/772 | 718/878/850/866 | 1593/2124/2000/2085 |
      |      64KB       | 108/119/114/114 | 592/732/683/688 | 805/934/903/911 | 1784/1943/1862/1843 |
      |---------------------------------------------------------------------------------------------|
      
      Windows hvsocket writer + Linux hvsocket reader:
      |---------------------------------------------------------------------------------------------|
      |Packet size ->   |     128B    |      1KB        |          4KB        |        64KB         |
      |---------------------------------------------------------------------------------------------|
      |SO_RCVBUF size | |               Throughput in MB/s (min/max/avg/median):                    |
      |               v |                                                                           |
      |---------------------------------------------------------------------------------------------|
      |      Default    | 69/82/75/73 | 313/343/333/336 |   418/477/446/445   |   659/701/676/678   |
      |      16KB       | 69/83/76/77 | 350/401/375/382 |   506/548/517/516   |   602/624/615/615   |
      |      32KB       | 62/83/73/73 | 471/529/496/494 |   830/1046/935/939  | 944/1180/1070/1100  |
      |      64KB       | 64/70/68/69 | 467/533/501/497 | 1260/1590/1430/1431 | 1605/1819/1670/1660 |
      |---------------------------------------------------------------------------------------------|
      Signed-off-by: default avatarSunil Muthuswamy <sunilmut@microsoft.com>
      Reviewed-by: default avatarDexuan Cui <decui@microsoft.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ac383f58
    • Eric Dumazet's avatar
      ipv4/igmp: shrink struct ip_sf_list · 0db355d4
      Eric Dumazet authored
      Removing two 4 bytes holes allows to use kmalloc-32
      kmem cache instead of kmalloc-64 on 64bit kernels.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0db355d4
    • David Ahern's avatar
      neighbor: Add tracepoint to __neigh_create · fc651001
      David Ahern authored
      Add tracepoint to __neigh_create to enable debugging of new entries.
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fc651001
    • David Ahern's avatar
      selftests: pmtu: Simplify cleanup and namespace names · a92a0a7b
      David Ahern authored
      The point of the pause-on-fail argument is to leave the setup as is after
      a test fails to allow a user to debug why it failed. Move the cleanup
      after posting the result to the user to make it so.
      
      Random names for the namespaces are not user friendly when trying to
      debug a failure. Make them simpler and more direct for the tests. Run
      cleanup at the beginning to ensure they are cleaned up if they already
      exist.
      
      Remove cleanup_done. There is no harm in doing cleanup twice; just
      ignore any errors related to not existing - which is already done.
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a92a0a7b
    • David Ahern's avatar
      selftests: fib-onlink: Make quiet by default · 9b7e94e6
      David Ahern authored
      Add VERBOSE argument to fib-onlink-tests.sh and make output quiet by
      default. Add getopt parsing of inputs and support for -v (verbose) and
      -p (pause on fail).
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9b7e94e6
    • David Ahern's avatar
      net: Set strict_start_type for routes and rules · 75425657
      David Ahern authored
      New userspace on an older kernel can send unknown and unsupported
      attributes resulting in an incompelete config which is almost
      always wrong for routing (few exceptions are passthrough settings
      like the protocol that installed the route).
      
      Set strict_start_type in the policies for IPv4 and IPv6 routes and
      rules to detect new, unsupported attributes and fail the route add.
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      75425657
    • David S. Miller's avatar
      Merge branch 'net-Export-functions-for-nexthop-code' · e38f7cbd
      David S. Miller authored
      David Ahern says:
      
      ====================
      net: Export functions for nexthop code
      
      This set exports ipv4 and ipv6 fib functions for use by the nexthop
      code. It also adds new ones to send route notifications if a nexthop
      configuration changes.
      
      v2
      - repost of patches dropped at the end of the last dev window
        added patch 8 which exports nh_update_mtu since it is inline with
        the other patches
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e38f7cbd
    • David Ahern's avatar
      ipv4: Rename and export nh_update_mtu · 06c77c3e
      David Ahern authored
      Rename nh_update_mtu to fib_nhc_update_mtu and export for use by the
      nexthop code.
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      06c77c3e
    • David Ahern's avatar
      ipv4: export fib_info_update_nh_saddr · c3669486
      David Ahern authored
      Add scope as input argument versus relying on fib_info reference in
      fib_nh, and export fib_info_update_nh_saddr.
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3669486