Commits · 2fc3fc0bcdcc5429947d1e5ed8f6f0a77ba80997 · Kirill Smelkov / linux

24 May, 2019 9 commits

libbpf: switch btf_dedup() to hashmap for dedup table · 2fc3fc0b

Andrii Nakryiko authored May 24, 2019

Utilize libbpf's hashmap as a multimap fof dedup_table implementation.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

2fc3fc0b

selftests/bpf: add tests for libbpf's hashmap · 5d04ec68

Andrii Nakryiko authored May 24, 2019

Test all APIs for internal hashmap implementation.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

5d04ec68

libbpf: add resizable non-thread safe internal hashmap · e3b92422

Andrii Nakryiko authored May 24, 2019

There is a need for fast point lookups inside libbpf for multiple use
cases (e.g., name resolution for BTF-to-C conversion, by-name lookups in
BTF for upcoming BPF CO-RE relocation support, etc). This patch
implements simple resizable non-thread safe hashmap using single linked
list chains.

Four different insert strategies are supported:
 - HASHMAP_ADD - only add key/value if key doesn't exist yet;
 - HASHMAP_SET - add key/value pair if key doesn't exist yet; otherwise,
   update value;
 - HASHMAP_UPDATE - update value, if key already exists; otherwise, do
   nothing and return -ENOENT;
 - HASHMAP_APPEND - always add key/value pair, even if key already exists.
   This turns hashmap into a multimap by allowing multiple values to be
   associated with the same key. Most useful read API for such hashmap is
   hashmap__for_each_key_entry() iteration. If hashmap__find() is still
   used, it will return last inserted key/value entry (first in a bucket
   chain).

For HASHMAP_SET and HASHMAP_UPDATE, old key/value pair is returned, so
that calling code can handle proper memory management, if necessary.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

e3b92422

selftests/bpf: use btf__parse_elf to check presence of BTF/BTF.ext · 9db32431

Andrii Nakryiko authored May 24, 2019

Switch test_btf.c to rely on btf__parse_elf to check presence of BTF and
BTF.ext data, instead of implementing its own ELF parsing.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

9db32431

bpftool: use libbpf's btf__parse_elf API · 58650cc4

Andrii Nakryiko authored May 24, 2019

Use btf__parse_elf() API, provided by libbpf, instead of implementing
ELF parsing by itself.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

58650cc4

libbpf: add btf__parse_elf API to load .BTF and .BTF.ext · e6c64855

Andrii Nakryiko authored May 24, 2019

Loading BTF and BTF.ext from ELF file is a common need. Instead of
requiring every user to re-implement it, let's provide this API from
libbpf itself. It's mostly copy/paste from `bpftool btf dump`
implementation, which will be switched to libbpf's version in next
patch. btf__parse_elf allows to load BTF and optionally BTF.ext.
This is also useful for tests that need to load/work with BTF, loaded
from test ELF files.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

e6c64855

libbpf: ensure libbpf.h is included along libbpf_internal.h · 1d7a08b3

Andrii Nakryiko authored May 24, 2019

libbpf_internal.h expects a bunch of stuff defined in libbpf.h to be
defined. This patch makes sure that libbpf.h is always included.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

1d7a08b3

samples: bpf: Do not define bpf_printk macro · c87f60a7

Michal Rostecki authored May 23, 2019

The bpf_printk macro was moved to bpf_helpers.h which is included in all
example programs.
Signed-off-by: Michal Rostecki <mrostecki@opensuse.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

c87f60a7

selftests: bpf: Move bpf_printk to bpf_helpers.h · 37739d1b

Michal Rostecki authored May 23, 2019

bpf_printk is a macro which is commonly used to print out debug messages
in BPF programs and it was copied in many selftests and samples. Since
all of them include bpf_helpers.h, this change moves the macro there.
Signed-off-by: Michal Rostecki <mrostecki@opensuse.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

37739d1b

23 May, 2019 31 commits

Merge branch 'bpf-explored-states' · 5762a20b

Daniel Borkmann authored May 24, 2019

Alexei Starovoitov says:

====================
Convert explored_states array into hash table and use simple hash
to reduce verifier peak memory consumption for programs with bpf2bpf
calls. More details in patch 3.

v1->v2: fixed Jakub's small nit in patch 1
====================
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

5762a20b

bpf: convert explored_states to hash table · dc2a4ebc

Alexei Starovoitov authored May 21, 2019

All prune points inside a callee bpf function most likely will have
different callsites. For example, if function foo() is called from
two callsites the half of explored states in all prune points in foo()
will be useless for subsequent walking of one of those callsites.
Fortunately explored_states pruning heuristics keeps the number of states
per prune point small, but walking these states is still a waste of cpu
time when the callsite of the current state is different from the callsite
of the explored state.

To improve pruning logic convert explored_states into hash table and
use simple insn_idx ^ callsite hash to select hash bucket.
This optimization has no effect on programs without bpf2bpf calls
and drastically improves programs with calls.
In the later case it reduces total memory consumption in 1M scale tests
by almost 3 times (peak_states drops from 5752 to 2016).

Care should be taken when comparing the states for equivalency.
Since the same hash bucket can now contain states with different indices
the insn_idx has to be part of verifier_state and compared.

Different hash table sizes and different hash functions were explored,
but the results were not significantly better vs this patch.
They can be improved in the future.

Hit/miss heuristic is not counting index miscompare as a miss.
Otherwise verifier stats become unstable when experimenting
with different hash functions.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

dc2a4ebc

bpf: split explored_states · a8f500af

Alexei Starovoitov authored May 21, 2019

split explored_states into prune_point boolean mark
and link list of explored states.
This removes STATE_LIST_MARK hack and allows marks to be separate from states.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

a8f500af

bpf: cleanup explored_states · 5d839021

Alexei Starovoitov authored May 21, 2019

clean up explored_states to prep for introduction of hashtable
No functional changes.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

5d839021

Merge branch 'bpf-jmp-seq-limit' · 29c677c8

Daniel Borkmann authored May 23, 2019

Alexei Starovoitov says:

====================
Patch 1 - jmp sequence limit
Patch 2 - improve existing tests
Patch 3 - add pyperf-based realistic bpf program that takes
          advantage of higher limit and use it as a stress test

v1->v2: fixed nit in patch 3. added Andrii's acks
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

29c677c8

selftests/bpf: add pyperf scale test · 7c944106

Alexei Starovoitov authored May 21, 2019

Add a snippet of pyperf bpf program used to collect python stack traces
as a scale test for the verifier.

At 189 loop iterations llvm 9.0 starts ignoring '#pragma unroll'
and generates partially unrolled loop instead.
Hence use 50, 100, and 180 loop iterations to stress test.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

7c944106

selftests/bpf: adjust verifier scale test · 7c0c6095

Alexei Starovoitov authored May 21, 2019

Adjust scale tests to check for new jmp sequence limit.

BPF_JGT had to be changed to BPF_JEQ because the verifier was
too smart. It tracked the known safe range of R0 values
and pruned the search earlier before hitting exact 8192 limit.
bpf_semi_rand_get() was too (un)?lucky.

k = 0; was missing in bpf_fill_scale2.
It was testing a bit shorter sequence of jumps than intended.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

7c0c6095

bpf: bump jmp sequence limit · b285fcb7

Alexei Starovoitov authored May 21, 2019

The limit of 1024 subsequent jumps was causing otherwise valid
programs to be rejected. Bump it to 8192 and make the error more verbose.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

b285fcb7

libbpf: emit diff of mismatched public API, if any · 9efc7794

Andrii Nakryiko authored May 22, 2019

It's easy to have a mismatch of "intended to be public" vs really
exposed API functions. While Makefile does check for this mismatch, if
it actually occurs it's not trivial to determine which functions are
accidentally exposed. This patch dumps out a diff showing what's not
supposed to be exposed facilitating easier fixing.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

9efc7794

hv_sock: perf: loop in send() to maximize bandwidth · 14a1eaa8

Sunil Muthuswamy authored May 22, 2019

Currently, the hv_sock send() iterates once over the buffer, puts data into
the VMBUS channel and returns. It doesn't maximize on the case when there
is a simultaneous reader draining data from the channel. In such a case,
the send() can maximize the bandwidth (and consequently minimize the cpu
cycles) by iterating until the channel is found to be full.

Perf data:
Total Data Transfer: 10GB/iteration
Single threaded reader/writer, Linux hvsocket writer with Windows hvsocket
reader
Packet size: 64KB
CPU sys time was captured using the 'time' command for the writer to send
10GB of data.
'Send Buffer Loop' is with the patch applied.
The values below are over 10 iterations.

|--------------------------------------------------------|
|        |        Current        |   Send Buffer Loop    |
|--------------------------------------------------------|
|        | Throughput | CPU sys  | Throughput | CPU sys  |
|        | (MB/s)     | time (s) | (MB/s)     | time (s) |
|--------------------------------------------------------|
| Min    |     407    |   7.048  |    401     |  5.958   |
|--------------------------------------------------------|
| Max    |     455    |   7.563  |    542     |  6.993   |
|--------------------------------------------------------|
| Avg    |     440    |   7.411  |    451     |  6.639   |
|--------------------------------------------------------|
| Median |     446    |   7.417  |    447     |  6.761   |
|--------------------------------------------------------|

Observation:
1. The avg throughput doesn't really change much with this change for this
scenario. This is most probably because the bottleneck on throughput is
somewhere else.
2. The average system (or kernel) cpu time goes down by 10%+ with this
change, for the same amount of data transfer.
Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

14a1eaa8

hv_sock: perf: Allow the socket buffer size options to influence the actual socket buffers · ac383f58

Sunil Muthuswamy authored May 22, 2019

Currently, the hv_sock buffer size is static and can't scale to the
bandwidth requirements of the application. This change allows the
applications to influence the socket buffer sizes using the SO_SNDBUF and
the SO_RCVBUF socket options.

Few interesting points to note:
1. Since the VMBUS does not allow a resize operation of the ring size, the
socket buffer size option should be set prior to establishing the
connection for it to take effect.
2. Setting the socket option comes with the cost of that much memory being
reserved/allocated by the kernel, for the lifetime of the connection.

Perf data:
Total Data Transfer: 1GB
Single threaded reader/writer
Results below are summarized over 10 iterations.

Linux hvsocket writer + Windows hvsocket reader:
|---------------------------------------------------------------------------------------------|
|Packet size ->   |      128B       |       1KB       |       4KB       |        64KB         |
|---------------------------------------------------------------------------------------------|
|SO_SNDBUF size | |                 Throughput in MB/s (min/max/avg/median):                  |
|               v |                                                                           |
|---------------------------------------------------------------------------------------------|
|      Default    | 109/118/114/116 | 636/774/701/700 | 435/507/480/476 |   410/491/462/470   |
|      16KB       | 110/116/112/111 | 575/705/662/671 | 749/900/854/869 |   592/824/692/676   |
|      32KB       | 108/120/115/115 | 703/823/767/772 | 718/878/850/866 | 1593/2124/2000/2085 |
|      64KB       | 108/119/114/114 | 592/732/683/688 | 805/934/903/911 | 1784/1943/1862/1843 |
|---------------------------------------------------------------------------------------------|

Windows hvsocket writer + Linux hvsocket reader:
|---------------------------------------------------------------------------------------------|
|Packet size ->   |     128B    |      1KB        |          4KB        |        64KB         |
|---------------------------------------------------------------------------------------------|
|SO_RCVBUF size | |               Throughput in MB/s (min/max/avg/median):                    |
|               v |                                                                           |
|---------------------------------------------------------------------------------------------|
|      Default    | 69/82/75/73 | 313/343/333/336 |   418/477/446/445   |   659/701/676/678   |
|      16KB       | 69/83/76/77 | 350/401/375/382 |   506/548/517/516   |   602/624/615/615   |
|      32KB       | 62/83/73/73 | 471/529/496/494 |   830/1046/935/939  | 944/1180/1070/1100  |
|      64KB       | 64/70/68/69 | 467/533/501/497 | 1260/1590/1430/1431 | 1605/1819/1670/1660 |
|---------------------------------------------------------------------------------------------|
Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ac383f58

ipv4/igmp: shrink struct ip_sf_list · 0db355d4

Eric Dumazet authored May 22, 2019

Removing two 4 bytes holes allows to use kmalloc-32
kmem cache instead of kmalloc-64 on 64bit kernels.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0db355d4

neighbor: Add tracepoint to __neigh_create · fc651001

David Ahern authored May 22, 2019

Add tracepoint to __neigh_create to enable debugging of new entries.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fc651001

selftests: pmtu: Simplify cleanup and namespace names · a92a0a7b

David Ahern authored May 22, 2019

The point of the pause-on-fail argument is to leave the setup as is after
a test fails to allow a user to debug why it failed. Move the cleanup
after posting the result to the user to make it so.

Random names for the namespaces are not user friendly when trying to
debug a failure. Make them simpler and more direct for the tests. Run
cleanup at the beginning to ensure they are cleaned up if they already
exist.

Remove cleanup_done. There is no harm in doing cleanup twice; just
ignore any errors related to not existing - which is already done.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a92a0a7b

selftests: fib-onlink: Make quiet by default · 9b7e94e6

David Ahern authored May 22, 2019

Add VERBOSE argument to fib-onlink-tests.sh and make output quiet by
default. Add getopt parsing of inputs and support for -v (verbose) and
-p (pause on fail).
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9b7e94e6

net: Set strict_start_type for routes and rules · 75425657

David Ahern authored May 22, 2019

New userspace on an older kernel can send unknown and unsupported
attributes resulting in an incompelete config which is almost
always wrong for routing (few exceptions are passthrough settings
like the protocol that installed the route).

Set strict_start_type in the policies for IPv4 and IPv6 routes and
rules to detect new, unsupported attributes and fail the route add.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

75425657

Merge branch 'net-Export-functions-for-nexthop-code' · e38f7cbd

David S. Miller authored May 22, 2019

David Ahern says:

====================
net: Export functions for nexthop code

This set exports ipv4 and ipv6 fib functions for use by the nexthop
code. It also adds new ones to send route notifications if a nexthop
configuration changes.

v2
- repost of patches dropped at the end of the last dev window
  added patch 8 which exports nh_update_mtu since it is inline with
  the other patches
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

e38f7cbd

ipv4: Rename and export nh_update_mtu · 06c77c3e

David Ahern authored May 22, 2019

Rename nh_update_mtu to fib_nhc_update_mtu and export for use by the
nexthop code.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

06c77c3e

ipv4: export fib_info_update_nh_saddr · c3669486

David Ahern authored May 22, 2019

Add scope as input argument versus relying on fib_info reference in
fib_nh, and export fib_info_update_nh_saddr.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c3669486

ipv4: export fib_flush · 9bd83667

David Ahern authored May 22, 2019

As nexthops are deleted, fib entries referencing it are marked dead.
Export fib_flush so those entries can be removed in a timely manner.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9bd83667

ipv4: export fib_check_nh · ac1fab2d

David Ahern authored May 22, 2019

Change fib_check_nh to take net, table and scope as input arguments
over struct fib_config and export for use by nexthop code.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ac1fab2d

ipv4: Add function to send route updates · 1bff1a0c

David Ahern authored May 22, 2019

Add fib_info_notify_update to walk the fib and send RTM_NEWROUTE
notifications with NLM_F_REPLACE set for entries linked to a fib_info
that have nh_updated flag set. This helper will be used by the nexthop
code to notify userspace of routes that are impacted when a nexthop
config is updated via replace. The new function and its helper are
similar to how fib_flush and fib_table_flush work for address delete
and link down events.

This notification is needed for legacy apps that do not understand
the new nexthop object. Apps that are nexthop aware can use the
RTA_NH_ID attribute in the route notification to just ignore it.

In the future this should be wrapped in a sysctl to allow OS'es that
are fully updated to avoid the notificaton storm.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1bff1a0c

ipv6: export function to send route updates · 19a3b7ee

David Ahern authored May 22, 2019

Add fib6_rt_update to send RTM_NEWROUTE with NLM_F_REPLACE set. This
helper will be used by the nexthop code to notify userspace of routes
that are impacted when a nexthop config is updated via replace.

This notification is needed for legacy apps that do not understand
the new nexthop object. Apps that are nexthop aware can use the
RTA_NH_ID attribute in the route notification to just ignore it.

In the future this should be wrapped in a sysctl to allow OS'es that
are fully updated to avoid the notificaton storm.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

19a3b7ee

ipv6: Add hook to bump sernum for a route to stubs · cdaa16a4

David Ahern authored May 22, 2019

Add hook to ipv6 stub to bump the sernum up to the root node for a
route. This is needed by the nexthop code when a nexthop config changes.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cdaa16a4

ipv6: Add delete route hook to stubs · 68a9b13d

David Ahern authored May 22, 2019

Add ip6_del_rt to the IPv6 stub. The hook is needed by the nexthop
code to remove entries linked to a nexthop that is getting deleted.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

68a9b13d

Merge branch 'net-phy-T1-support' · 26b1b8d7

David S. Miller authored May 22, 2019

Andrew Lunn says:

====================
net: phy: T1 support

T1 PHYs make use of a single twisted pair, rather than the traditional
2 pair for 100BaseT or 4 pair for 1000BaseT. This patchset adds link
modes for 100BaseT1 and 1000BaseT1, and them makes use of 100BaseT1 in
the list of PHY features used by current T1 drivers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

26b1b8d7

net: phy: Make phy_basic_t1_features use base100t1. · e5fb32c6

Andrew Lunn authored May 22, 2019

Now that there is a link mode for 100BaseT1, use it in
phy_basic_t1_features so T1 PHY drivers will indicate this mode via
the Ethtool API.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

e5fb32c6

net: phy: Add support for 100BaseT1 and 1000BaseT1 · b2557764

Andrew Lunn authored May 22, 2019

Add link modes for 100Mbps and 1Gbps over a single pair.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

b2557764

net: phy: dp83867: Allocate state struct in probe · 565d9d22

Trent Piepho authored May 22, 2019

This was being done in config the first time the phy was configured.
Should be in the probe method.

Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Trent Piepho <tpiepho@impinj.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

565d9d22

net: phy: dp83867: Validate FIFO depth property · f8bbf417

Trent Piepho authored May 22, 2019

Insure property is in valid range and fail when reading DT if it is not.
Also add error message for existing failure if required property is not
present.

Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Trent Piepho <tpiepho@impinj.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

f8bbf417

net: phy: dp83867: IO impedance is not dependent on RGMII delay · 27708eb5

Trent Piepho authored May 22, 2019

The driver would only set the IO impedance value when RGMII internal
delays were enabled.  There is no reason for this.  Move the IO
impedance block out of the RGMII delay block.

Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Trent Piepho <tpiepho@impinj.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

27708eb5