Commits · ede656e8465839530c3287c7f54adf75dc2b9563 · Kirill Smelkov / linux

28 Dec, 2019 5 commits

tcp_cubic: make Hystart aware of pacing · ede656e8

Eric Dumazet authored Dec 23, 2019

For years we disabled Hystart ACK train detection at Google
because it was fooled by TCP pacing.

ACK train detection uses a simple heuristic, detecting if
we receive ACK past half the RTT, to exit slow start before
hitting the bottleneck and experience massive drops.

But pacing by design might delay packets up to RTT/2,
so we need to tweak the Hystart logic to be aware of this
extra delay.

Tested:
 Added a 100 usec delay at receiver.

Before:
nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
   9117
   7057
   9553
   8300
   7030
   6849
   9533
  10126
   6876
   8473
TcpExtTCPHystartTrainDetect     10                 0.0
TcpExtTCPHystartTrainCwnd       1230               0.0

After :
nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
   9845
  10103
  10866
  11096
  11936
  11487
  11773
  12188
  11066
  11894
TcpExtTCPHystartTrainDetect     10                 0.0
TcpExtTCPHystartTrainCwnd       6462               0.0

Disabling Hystart ACK Train detection gives similar numbers

echo 2 >/sys/module/tcp_cubic/parameters/hystart_detect
nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
  11173
  10954
  12455
  10627
  11578
  11583
  11222
  10880
  10665
  11366
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ede656e8

tcp_cubic: tweak Hystart detection for short RTT flows · 42f3a8aa

Eric Dumazet authored Dec 23, 2019

After switching ca->delay_min to usec resolution, we exit
slow start prematurely for very low RTT flows, setting
snd_ssthresh to 20.

The reason is that delay_min is fed with RTT of small packet
trains. Then as cwnd is increased, TCP sends bigger TSO packets.

LRO/GRO aggregation and/or interrupt mitigation strategies
on receiver tend to inflate RTT samples.

Fix this by adding to delay_min the expected delay of
two TSO packets, given current pacing rate.

Tested:

Sender uses pfifo_fast qdisc

Before :
$ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
  11348
  11707
  11562
  11428
  11773
  11534
   9878
  11693
  10597
  10968
TcpExtTCPHystartTrainDetect     10                 0.0
TcpExtTCPHystartTrainCwnd       200                0.0

After :
$ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
  14877
  14517
  15797
  18466
  17376
  14833
  17558
  17933
  16039
  18059
TcpExtTCPHystartTrainDetect     10                 0.0
TcpExtTCPHystartTrainCwnd       1670               0.0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

42f3a8aa

tcp_cubic: switch bictcp_clock() to usec resolution · cff04e2d

Eric Dumazet authored Dec 23, 2019

Current 1ms clock feeds ca->round_start, ca->delay_min,
ca->last_ack.

This is quite problematic for data-center flows, where delay_min
is way below 1 ms.

This means Hystart Train detection triggers every time jiffies value
is updated, since "((s32)(now - ca->round_start) > ca->delay_min >> 4)"
expression becomes true.

This kind of random behavior can be solved by reusing the existing
usec timestamp that TCP keeps in tp->tcp_mstamp

Note that a followup patch will tweak things a bit, because
during slow start, GRO aggregation on receivers naturally
increases the RTT as TSO packets gradually come to ~64KB size.

To recap, right after this patch CUBIC Hystart train detection
is more aggressive, since short RTT flows might exit slow start at
cwnd = 20, instead of being possibly unbounded.

Following patch will address this problem.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cff04e2d

tcp_cubic: remove one conditional from hystart_update() · 35821fc2

Eric Dumazet authored Dec 23, 2019

If we initialize ca->curr_rtt to ~0U, we do not need to test
for zero value in hystart_update()

We only read ca->curr_rtt if at least HYSTART_MIN_SAMPLES have
been processed, and thus ca->curr_rtt will have a sane value.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

35821fc2

tcp_cubic: optimize hystart_update() · 473900a5

Eric Dumazet authored Dec 23, 2019

We do not care which bit in ca->found is set.

We avoid accessing hystart and hystart_detect unless really needed,
possibly avoiding one cache line miss.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

473900a5

27 Dec, 2019 2 commits

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 2bbc078f

David S. Miller authored Dec 27, 2019

Daniel Borkmann says:

====================
pull-request: bpf-next 2019-12-27

The following pull-request contains BPF updates for your *net-next* tree.

We've added 127 non-merge commits during the last 17 day(s) which contain
a total of 110 files changed, 6901 insertions(+), 2721 deletions(-).

There are three merge conflicts. Conflicts and resolution looks as follows:

1) Merge conflict in net/bpf/test_run.c:

There was a tree-wide cleanup c593642c ("treewide: Use sizeof_field() macro")
which gets in the way with b590cb5f ("bpf: Switch to offsetofend in
BPF_PROG_TEST_RUN"):

  <<<<<<< HEAD
          if (!range_is_zero(__skb, offsetof(struct __sk_buff, priority) +
                             sizeof_field(struct __sk_buff, priority),
  =======
          if (!range_is_zero(__skb, offsetofend(struct __sk_buff, priority),
  >>>>>>> 7c8dce4b

There are a few occasions that look similar to this. Always take the chunk with
offsetofend(). Note that there is one where the fields differ in here:

  <<<<<<< HEAD
          if (!range_is_zero(__skb, offsetof(struct __sk_buff, tstamp) +
                             sizeof_field(struct __sk_buff, tstamp),
  =======
          if (!range_is_zero(__skb, offsetofend(struct __sk_buff, gso_segs),
  >>>>>>> 7c8dce4b

Just take the one with offsetofend() /and/ gso_segs. Latter is correct due to
850a88cc ("bpf: Expose __sk_buff wire_len/gso_segs to BPF_PROG_TEST_RUN").

2) Merge conflict in arch/riscv/net/bpf_jit_comp.c:

(I'm keeping Bjorn in Cc here for a double-check in case I got it wrong.)

  <<<<<<< HEAD
          if (is_13b_check(off, insn))
                  return -1;
          emit(rv_blt(tcc, RV_REG_ZERO, off >> 1), ctx);
  =======
          emit_branch(BPF_JSLT, RV_REG_T1, RV_REG_ZERO, off, ctx);
  >>>>>>> 7c8dce4b

Result should look like:

          emit_branch(BPF_JSLT, tcc, RV_REG_ZERO, off, ctx);

3) Merge conflict in arch/riscv/include/asm/pgtable.h:

  <<<<<<< HEAD
  =======
  #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
  #define VMALLOC_END      (PAGE_OFFSET - 1)
  #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)

  #define BPF_JIT_REGION_SIZE     (SZ_128M)
  #define BPF_JIT_REGION_START    (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
  #define BPF_JIT_REGION_END      (VMALLOC_END)

  /*
   * Roughly size the vmemmap space to be large enough to fit enough
   * struct pages to map half the virtual address space. Then
   * position vmemmap directly below the VMALLOC region.
   */
  #define VMEMMAP_SHIFT \
          (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
  #define VMEMMAP_SIZE    BIT(VMEMMAP_SHIFT)
  #define VMEMMAP_END     (VMALLOC_START - 1)
  #define VMEMMAP_START   (VMALLOC_START - VMEMMAP_SIZE)

  #define vmemmap         ((struct page *)VMEMMAP_START)

  >>>>>>> 7c8dce4b

Only take the BPF_* defines from there and move them higher up in the
same file. Remove the rest from the chunk. The VMALLOC_* etc defines
got moved via 01f52e16 ("riscv: define vmemmap before pfn_to_page
calls"). Result:

  [...]
  #define __S101  PAGE_READ_EXEC
  #define __S110  PAGE_SHARED_EXEC
  #define __S111  PAGE_SHARED_EXEC

  #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
  #define VMALLOC_END      (PAGE_OFFSET - 1)
  #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)

  #define BPF_JIT_REGION_SIZE     (SZ_128M)
  #define BPF_JIT_REGION_START    (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
  #define BPF_JIT_REGION_END      (VMALLOC_END)

  /*
   * Roughly size the vmemmap space to be large enough to fit enough
   * struct pages to map half the virtual address space. Then
   * position vmemmap directly below the VMALLOC region.
   */
  #define VMEMMAP_SHIFT \
          (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
  #define VMEMMAP_SIZE    BIT(VMEMMAP_SHIFT)
  #define VMEMMAP_END     (VMALLOC_START - 1)
  #define VMEMMAP_START   (VMALLOC_START - VMEMMAP_SIZE)

  [...]

Let me know if there are any other issues.

Anyway, the main changes are:

1) Extend bpftool to produce a struct (aka "skeleton") tailored and specific
   to a provided BPF object file. This provides an alternative, simplified API
   compared to standard libbpf interaction. Also, add libbpf extern variable
   resolution for .kconfig section to import Kconfig data, from Andrii Nakryiko.

2) Add BPF dispatcher for XDP which is a mechanism to avoid indirect calls by
   generating a branch funnel as discussed back in bpfconf'19 at LSF/MM. Also,
   add various BPF riscv JIT improvements, from Björn Töpel.

3) Extend bpftool to allow matching BPF programs and maps by name,
   from Paul Chaignon.

4) Support for replacing cgroup BPF programs attached with BPF_F_ALLOW_MULTI
   flag for allowing updates without service interruption, from Andrey Ignatov.

5) Cleanup and simplification of ring access functions for AF_XDP with a
   bonus of 0-5% performance improvement, from Magnus Karlsson.

6) Enable BPF JITs for x86-64 and arm64 by default. Also, final version of
   audit support for BPF, from Daniel Borkmann and latter with Jiri Olsa.

7) Move and extend test_select_reuseport into BPF program tests under
   BPF selftests, from Jakub Sitnicki.

8) Various BPF sample improvements for xdpsock for customizing parameters
   to set up and benchmark AF_XDP, from Jay Jayatheerthan.

9) Improve libbpf to provide a ulimit hint on permission denied errors.
   Also change XDP sample programs to attach in driver mode by default,
   from Toke Høiland-Jørgensen.

10) Extend BPF test infrastructure to allow changing skb mark from tc BPF
    programs, from Nikita V. Shirokov.

11) Optimize prologue code sequence in BPF arm32 JIT, from Russell King.

12) Fix xdp_redirect_cpu BPF sample to manually attach to tracepoints after
    libbpf conversion, from Jesper Dangaard Brouer.

13) Minor misc improvements from various others.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

2bbc078f

bpftool: Make skeleton C code compilable with C++ compiler · 7c8dce4b

Andrii Nakryiko authored Dec 26, 2019

When auto-generated BPF skeleton C code is included from C++ application, it
triggers compilation error due to void * being implicitly casted to whatever
target pointer type. This is supported by C, but not C++. To solve this
problem, add explicit casts, where necessary.

To ensure issues like this are captured going forward, add skeleton usage in
test_cpp test.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191226210253.3132060-1-andriin@fb.com

7c8dce4b

26 Dec, 2019 32 commits

Merge branch 's390-qeth-next' · 9e41fbf3

David S. Miller authored Dec 26, 2019

Julian Wiedmann says:

====================
s390/qeth: updates 2019-12-23

please apply the following patch series for qeth to your net-next tree.

This reworks the RX code to use napi_gro_frags() when building non-linear
skbs, along with some consolidation and cleanups.

Happy holidays - and many thanks for all the effort & support over the past
year, to both Jakub and you. It's much appreciated.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

9e41fbf3

s390/qeth: remove QETH_RX_PULL_LEN · 8ca8559f

Julian Wiedmann authored Dec 23, 2019

Since commit f677fcb9 ("s390/qeth: ensure linear access to packet headers"),
the CQ-specific skbs are allocated with a slightly bigger linear part
than necessary. Shrink it down to the maximum that's needed by
qeth_extract_skb().
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8ca8559f

s390/qeth: use napi_gro_frags() for SG skbs · dcdcf867

Julian Wiedmann authored Dec 23, 2019

For non-linear packets, get the skb for attaching the page fragments
from napi_get_frags() so that it can be recycled during GRO.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dcdcf867

s390/qeth: consolidate RX code · c04b116a

Julian Wiedmann authored Dec 23, 2019

To reduce the path length and levels of indirection, move the RX
processing from the sub-drivers into the core.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c04b116a

af_packet: refactoring code for prb_calc_retire_blk_tmo · 0914d2bb

Mao Wenan authored Dec 23, 2019

If __ethtool_get_link_ksettings() is failed and with
non-zero value, prb_calc_retire_blk_tmo() should return
DEFAULT_PRB_RETIRE_TOV firstly.

This patch is to refactory code and make it more readable.
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0914d2bb

xen-netback: support dynamic unbind/bind · 9476654b

Paul Durrant authored Dec 23, 2019

By re-attaching RX, TX, and CTL rings during connect() rather than
assuming they are freshly allocated (i.e. assuming the counters are zero),
and avoiding forcing state to Closed in netback_remove() it is possible
for vif instances to be unbound and re-bound from and to (respectively) a
running guest.

Dynamic unbind/bind is a highly useful feature for a backend module as it
allows it to be unloaded and re-loaded (i.e. updated) without requiring
domUs to be halted.

This has been tested by running iperf as a server in the test VM and
then running a client against it in a continuous loop, whilst also
running:

while true;
  do echo vif-$DOMID-$VIF >unbind;
  echo down;
  rmmod xen-netback;
  echo unloaded;
  modprobe xen-netback;
  cd $(pwd);
  brctl addif xenbr0 vif$DOMID.$VIF;
  ip link set vif$DOMID.$VIF up;
  echo up;
  sleep 5;
  done

in dom0 from /sys/bus/xen-backend/drivers/vif to continuously unbind,
unload, re-load, re-bind and re-plumb the backend.

Clearly a performance drop was seen but no TCP connection resets were
observed during this test and moreover a parallel SSH connection into the
guest remained perfectly usable throughout.
Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

9476654b

Merge branch 'RTL8211F-RGMII-RX-TX-delay-configuration-improvements' · 8d347992

David S. Miller authored Dec 26, 2019

Martin Blumenstingl says:

====================
RTL8211F: RGMII RX/TX delay configuration improvements

In discussion with Andrew [0] we figured out that it would be best to
make the RX delay of the RTL8211F PHY configurable (just like the TX
delay is already configurable).

While here I took the opportunity to add some logging to the TX delay
configuration as well.

There is no public documentation for the RX and TX delay registers.
I received this information a while ago (and created this RfC patch
back then: [1]). Realtek gave me permission to take the information
from the datasheet extracts and phase them in my own words and publish
that (I am not allowed to publish the datasheet extracts).

I have tested these patches on two boards:
- Amlogic Meson8b Odroid-C1
- Amlogic GXM Khadas VIM2
Both still behave as before these changes (iperf3 speeds are the same
in both directions: RX and TX), which is expected because they are
currently using phy-mode = "rgmii" with the RX delay not being generated
by the PHY.

[0] https://patchwork.ozlabs.org/patch/1215313/
[1] https://patchwork.ozlabs.org/patch/843946/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

8d347992

net: phy: realtek: add support for configuring the RX delay on RTL8211F · 1b3047b5

Martin Blumenstingl authored Dec 26, 2019

On RTL8211F the RX and TX delays (2ns) can be configured in two ways:
- pin strapping (RXD1 for the TX delay and RXD0 for the RX delay, LOW
  means "off" and HIGH means "on") which is read during PHY reset
- using software to configure the TX and RX delay registers

So far only the configuration using pin strapping has been supported.
Add support for enabling or disabling the RGMII RX delay based on the
phy-mode to be able to get the RX delay into a known state. This is
important because the RX delay has to be coordinated between the PHY,
MAC and the PCB design (trace length). With an invalid RX delay applied
(for example if both PHY and MAC add a 2ns RX delay) Ethernet may not
work at all.

Also add debug logging when configuring the RX delay (just like the TX
delay) because this is a common source of problems.
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1b3047b5

net: phy: realtek: add logging for the RGMII TX delay configuration · 3aec743d

Martin Blumenstingl authored Dec 26, 2019

RGMII requires a delay of 2ns between the data and the clock signal.
There are at least three ways this can happen. One possibility is by
having the PHY generate this delay.
This is a common source for problems (for example with slow TX speeds or
packet loss when sending data). The TX delay configuration of the
RTL8211F PHY can be set either by pin-strappping the RXD1 pin (HIGH
means enabled, LOW means disabled) or through configuring a paged
register. The setting from the RXD1 pin is also reflected in the
register.

Add debug logging to the TX delay configuration on RTL8211F so it's
easier to spot these issues (for example if the TX delay is enabled for
both, the RTL8211F PHY and the MAC).
This is especially helpful because there is no public datasheet for the
RTL8211F PHY available with all the RX/TX delay specifics.
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3aec743d

Merge branch 'mlxsw-spectrum_router-Cleanups' · 1f4f16fa

David S. Miller authored Dec 26, 2019

Ido Schimmel says:

====================
mlxsw: spectrum_router: Cleanups

This patch set removes from mlxsw code that is no longer necessary after
the simplification of the IPv4 and IPv6 route offload API.

The patches eliminate unnecessary code by taking advantage of the fact
that mlxsw no longer needs to maintain a list of identical routes,
following recent changes in route offload API.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

1f4f16fa

mlxsw: spectrum_router: Remove FIB entry list from FIB node · 7c4a7ec8

Ido Schimmel authored Dec 26, 2019

As explained in previous patches, the driver no longer needs to maintain
a list of identical FIB entries (i.e, same {tb_id, prefix, prefix
length}) and therefore each FIB node can only store one FIB entry.

Remove the FIB entry list and simplify the code.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7c4a7ec8

mlxsw: spectrum_router: Consolidate identical functions · b04720ae

Ido Schimmel authored Dec 26, 2019

After the last patch mlxsw_sp_fib{4,6}_node_entry_link() and
mlxsw_sp_fib{4,6}_node_entry_unlink() are identical and can therefore be
consolidated into the same common function.

Perform the consolidation.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b04720ae

mlxsw: spectrum_router: Make route creation and destruction symmetric · 0705297e

Ido Schimmel authored Dec 26, 2019

Host routes that perform decapsulation of IP in IP tunnels have a
special adjacency entry linked to them. This entry stores information
such as the expected underlay source IP. When the route is deleted this
entry needs to be freed.

The allocation of the adjacency entry happens in
mlxsw_sp_fib4_entry_type_set(), but it is freed in
mlxsw_sp_fib4_node_entry_unlink().

Create a new function - mlxsw_sp_fib4_entry_type_unset() - and free the
adjacency entry there.

This will allow us to consolidate mlxsw_sp_fib{4,6}_node_entry_unlink()
in the next patch.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0705297e

mlxsw: spectrum_router: Eliminate dead code · 0d2fb5aa

Ido Schimmel authored Dec 26, 2019

Since the driver no longer maintains a list of identical routes there is
no route to promote when a route is deleted.

Remove that code that took care of it.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0d2fb5aa

mlxsw: spectrum_router: Remove unnecessary checks · 231c8d2b

Ido Schimmel authored Dec 26, 2019

Now that the networking stack takes care of only notifying the routes of
interest, we do not need to maintain a list of identical routes.

Remove the check that tests if the route is the first route in the FIB
node.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

231c8d2b

bonding: rename AD_STATE_* to LACP_STATE_* · c1e46990

Andy Roulin authored Dec 26, 2019

As the LACP actor/partner state is now part of the uapi, rename the
3ad state defines with LACP prefix. The LACP prefix is preferred over
BOND_3AD as the LACP standard moved to 802.1AX.

Fixes: 826f66b3 ("bonding: move 802.3ad port state flags to uapi")
Signed-off-by: Andy Roulin <aroulin@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c1e46990

sctp: move trace_sctp_probe_path into sctp_outq_sack · f643ee29

Kevin Kou authored Dec 26, 2019

The original patch bringed in the "SCTP ACK tracking trace event"
feature was committed at Dec.20, 2017, it replaced jprobe usage
with trace events, and bringed in two trace events, one is
TRACE_EVENT(sctp_probe), another one is TRACE_EVENT(sctp_probe_path).
The original patch intended to trigger the trace_sctp_probe_path in
TRACE_EVENT(sctp_probe) as below code,

+TRACE_EVENT(sctp_probe,
+
+	TP_PROTO(const struct sctp_endpoint *ep,
+		 const struct sctp_association *asoc,
+		 struct sctp_chunk *chunk),
+
+	TP_ARGS(ep, asoc, chunk),
+
+	TP_STRUCT__entry(
+		__field(__u64, asoc)
+		__field(__u32, mark)
+		__field(__u16, bind_port)
+		__field(__u16, peer_port)
+		__field(__u32, pathmtu)
+		__field(__u32, rwnd)
+		__field(__u16, unack_data)
+	),
+
+	TP_fast_assign(
+		struct sk_buff *skb = chunk->skb;
+
+		__entry->asoc = (unsigned long)asoc;
+		__entry->mark = skb->mark;
+		__entry->bind_port = ep->base.bind_addr.port;
+		__entry->peer_port = asoc->peer.port;
+		__entry->pathmtu = asoc->pathmtu;
+		__entry->rwnd = asoc->peer.rwnd;
+		__entry->unack_data = asoc->unack_data;
+
+		if (trace_sctp_probe_path_enabled()) {
+			struct sctp_transport *sp;
+
+			list_for_each_entry(sp, &asoc->peer.transport_addr_list,
+					    transports) {
+				trace_sctp_probe_path(sp, asoc);
+			}
+		}
+	),

But I found it did not work when I did testing, and trace_sctp_probe_path
had no output, I finally found that there is trace buffer lock
operation(trace_event_buffer_reserve) in include/trace/trace_events.h:

static notrace void							\
trace_event_raw_event_##call(void *__data, proto)			\
{									\
	struct trace_event_file *trace_file = __data;			\
	struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\
	struct trace_event_buffer fbuffer;				\
	struct trace_event_raw_##call *entry;				\
	int __data_size;						\
									\
	if (trace_trigger_soft_disabled(trace_file))			\
		return;							\
									\
	__data_size = trace_event_get_offsets_##call(&__data_offsets, args); \
									\
	entry = trace_event_buffer_reserve(&fbuffer, trace_file,	\
				 sizeof(*entry) + __data_size);		\
									\
	if (!entry)							\
		return;							\
									\
	tstruct								\
									\
	{ assign; }							\
									\
	trace_event_buffer_commit(&fbuffer);				\
}

The reason caused no output of trace_sctp_probe_path is that
trace_sctp_probe_path written in TP_fast_assign part of
TRACE_EVENT(sctp_probe), and it will be placed( { assign; } ) after the
trace_event_buffer_reserve() when compiler expands Macro,

        entry = trace_event_buffer_reserve(&fbuffer, trace_file,        \
                                 sizeof(*entry) + __data_size);         \
                                                                        \
        if (!entry)                                                     \
                return;                                                 \
                                                                        \
        tstruct                                                         \
                                                                        \
        { assign; }                                                     \

so trace_sctp_probe_path finally can not acquire trace_event_buffer
and return no output, that is to say the nest of tracepoint entry function
is not allowed. The function call flow is:

trace_sctp_probe()
-> trace_event_raw_event_sctp_probe()
 -> lock buffer
 -> trace_sctp_probe_path()
   -> trace_event_raw_event_sctp_probe_path()  --nested
   -> buffer has been locked and return no output.

This patch is to remove trace_sctp_probe_path from the TP_fast_assign
part of TRACE_EVENT(sctp_probe) to avoid the nest of entry function,
and trigger sctp_probe_path_trace in sctp_outq_sack.

After this patch, you can enable both events individually,
  # cd /sys/kernel/debug/tracing
  # echo 1 > events/sctp/sctp_probe/enable
  # echo 1 > events/sctp/sctp_probe_path/enable

Or, you can enable all the events under sctp.

  # echo 1 > events/sctp/enable
Signed-off-by: Kevin Kou <qdkevin.kou@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f643ee29

bpf: Print error message for bpftool cgroup show · 1162f844

Hechao Li authored Dec 23, 2019

Currently, when bpftool cgroup show <path> has an error, no error
message is printed. This is confusing because the user may think the
result is empty.

Before the change:

$ bpftool cgroup show /sys/fs/cgroup
ID       AttachType      AttachFlags     Name
$ echo $?
255

After the change:
$ ./bpftool cgroup show /sys/fs/cgroup
Error: can't query bpf programs attached to /sys/fs/cgroup: Operation
not permitted

v2: Rename check_query_cgroup_progs to cgroup_has_attached_progs
Signed-off-by: Hechao Li <hechaol@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191224011742.3714301-1-hechaol@fb.com

1162f844

libbpf: Support CO-RE relocations for LDX/ST/STX instructions · 8ab9da57

Andrii Nakryiko authored Dec 23, 2019

Clang patch [0] enables emitting relocatable generic ALU/ALU64 instructions
(i.e, shifts and arithmetic operations), as well as generic load/store
instructions. The former ones are already supported by libbpf as is. This
patch adds further support for load/store instructions. Relocatable field
offset is encoded in BPF instruction's 16-bit offset section and are adjusted
by libbpf based on target kernel BTF.

These Clang changes and corresponding libbpf changes allow for more succinct
generated BPF code by encoding relocatable field reads as a single
ST/LDX/STX instruction. It also enables relocatable access to BPF context.
Previously, if context struct (e.g., __sk_buff) was accessed with CO-RE
relocations (e.g., due to preserve_access_index attribute), it would be
rejected by BPF verifier due to modified context pointer dereference. With
Clang patch, such context accesses are both relocatable and have a fixed
offset from the point of view of BPF verifier.

  [0] https://reviews.llvm.org/D71790Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20191223180305.86417-1-andriin@fb.com

8ab9da57

Merge branch 'Peer-to-Peer-One-Step-time-stamping' · aea3dee8

David S. Miller authored Dec 25, 2019

Richard Cochran says:

====================
Peer to Peer One-Step time stamping

This series adds support for PTP (IEEE 1588) P2P one-step time
stamping along with a driver for a hardware device that supports this.

If the hardware supports p2p one-step, it subtracts the ingress time
stamp value from the Pdelay_Request correction field.  The user space
software stack then simply copies the correction field into the
Pdelay_Response, and on transmission the hardware adds the egress time
stamp into the correction field.

This new functionality extends CONFIG_NETWORK_PHY_TIMESTAMPING to
cover MII snooping devices, but it still depends on phylib, just as
that option does.  Expanding beyond phylib is not within the scope of
the this series.

User space support is available in the current linuxptp master branch.

- Patch 1 adds phy_device methods for existing time stamping fields.
- Patches 2-5 convert the stack and drivers to the new methods.
- Patch 6 moves code around the dp83640 driver.
- Patches 7-10 add support for MII time stamping in non-PHY devices.
- Patch 11 adds the new P2P 1-step option.
- Patch 12 adds a driver implementing the new option.

Thanks,
Richard

Changed in v9:
~~~~~~~~~~~~~~

- Fix two more drivers' switch/case blocks WRT the new HWTSTAMP ioctl.
- Picked up two more review tags from Andrew.

Changed in v8:
~~~~~~~~~~~~~~

- Avoided adding forward functional declarations in the dp83640 driver.
- Picked up Florian's new review tags and another one from Andrew.

Changed in v7:
~~~~~~~~~~~~~~

- Converted pr_debug|err to dev_ variants in new driver.
- Fixed device tree documentation per Rob's v6 review.
- Picked up Andrew's and Rob's review tags.
- Silenced sparse warnings in new driver.

Changed in v6:
~~~~~~~~~~~~~~

- Added methods for accessing the phy_device time stamping fields.
- Adjust the device tree documentation per Rob's v5 review.
- Fixed the build failures due to missing exports.

Changed in v5:
~~~~~~~~~~~~~~

- Fixed build failure in macvlan.
- Fixed latent bug with its gcc warning in the driver.

Changed in v4:
~~~~~~~~~~~~~~

- Correct error paths and PTR_ERR return values in the framework.
- Expanded KernelDoc comments WRT PHY locking.
- Pick up Andrew's review tag.

Changed in v3:
~~~~~~~~~~~~~~

- Simplify the device tree binding and document the time stamping
  phandle by itself.

Changed in v2:
~~~~~~~~~~~~~~

- Per the v1 review, changed the modeling of MII time stamping
  devices.  They are no longer a kind of mdio device.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

aea3dee8

ptp: Add a driver for InES time stamping IP core. · bad1eaa6

Richard Cochran authored Dec 25, 2019

The InES at the ZHAW offers a PTP time stamping IP core.  The FPGA
logic recognizes and time stamps PTP frames on the MII bus.  This
patch adds a driver for the core along with a device tree binding to
allow hooking the driver to MII buses.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bad1eaa6

net: Introduce peer to peer one step PTP time stamping. · b6fd7b96

Richard Cochran authored Dec 25, 2019

The 1588 standard defines one step operation for both Sync and
PDelay_Resp messages.  Up until now, hardware with P2P one step has
been rare, and kernel support was lacking.  This patch adds support of
the mode in anticipation of new hardware developments.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b6fd7b96

net: mdio: of: Register discovered MII time stampers. · 1dca22b1

Richard Cochran authored Dec 25, 2019

When parsing a PHY node, register its time stamper, if any, and attach
the instance to the PHY device.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

1dca22b1

dt-bindings: ptp: Introduce MII time stamping devices. · 25d12e1d

Richard Cochran authored Dec 25, 2019

This patch add a new binding that allows non-PHY MII time stamping
devices to find their buses.  The new documentation covers both the
generic binding and one upcoming user.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

25d12e1d

net: Add a layer for non-PHY MII time stamping drivers. · 767ff483

Richard Cochran authored Dec 25, 2019

While PHY time stamping drivers can simply attach their interface
directly to the PHY instance, stand alone drivers require support in
order to manage their services.  Non-PHY MII time stamping drivers
have a control interface over another bus like I2C, SPI, UART, or via
a memory mapped peripheral.  The controller device will be associated
with one or more time stamping channels, each of which sits snoops in
on a MII bus.

This patch provides a glue layer that will enable time stamping
channels to find their controlling device.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

767ff483

net: Introduce a new MII time stamping interface. · 4715f65f

Richard Cochran authored Dec 25, 2019

Currently the stack supports time stamping in PHY devices.  However,
there are newer, non-PHY devices that can snoop an MII bus and provide
time stamps.  In order to support such devices, this patch introduces
a new interface to be used by both PHY and non-PHY devices.

In addition, the one and only user of the old PHY time stamping API is
converted to the new interface.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

4715f65f

net: phy: dp83640: Move the probe and remove methods around. · 12d0efb9

Richard Cochran authored Dec 25, 2019

An upcoming patch will change how the PHY time stamping functions are
registered with the networking stack, and adapting this driver would
entail adding forward declarations for four time stamping methods.
However, forward declarations are considered to be stylistic defects.
This patch avoids the issue by moving the probe and remove methods
immediately above the phy_driver interface structure.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

12d0efb9

net: netcp_ethss: Use the PHY time stamping interface. · bfd57b59

Richard Cochran authored Dec 25, 2019

The netcp_ethss driver tests fields of the phy_device in order to
determine whether to defer to the PHY's time stamping functionality.
This patch replaces the open coded logic with an invocation of the
proper methods.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bfd57b59

net: ethtool: Use the PHY time stamping interface. · 7774ee23

Richard Cochran authored Dec 25, 2019

The ethtool layer tests fields of the phy_device in order to determine
whether to invoke the PHY's tsinfo ethtool callback.  This patch
replaces the open coded logic with an invocation of the proper
methods.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7774ee23

net: vlan: Use the PHY time stamping interface. · dfe6d68f

Richard Cochran authored Dec 25, 2019

The vlan layer tests fields of the phy_device in order to determine
whether to invoke the PHY's tsinfo ethtool callback.  This patch
replaces the open coded logic with an invocation of the proper
methods.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dfe6d68f

net: macvlan: Use the PHY time stamping interface. · d25de984

Richard Cochran authored Dec 25, 2019

The macvlan layer tests fields of the phy_device in order to determine
whether to invoke the PHY's tsinfo ethtool callback.  This patch
replaces the open coded logic with an invocation of the proper
methods.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d25de984

net: phy: Introduce helper functions for time stamping support. · 0e5dafc8

Richard Cochran authored Dec 25, 2019

Some parts of the networking stack and at least one driver test fields
within the 'struct phy_device' in order to query time stamping
capabilities and to invoke time stamping methods.  This patch adds a
functional interface around the time stamping fields.  This will allow
insulating the callers from future changes to the details of the time
stamping implemenation.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0e5dafc8

25 Dec, 2019 1 commit

Merge branch 'Simplify-IPv6-route-offload-API' · 9f6cff99

David S. Miller authored Dec 24, 2019

Ido Schimmel says:

====================
Simplify IPv6 route offload API

Motivation
==========

This is the IPv6 counterpart of "Simplify IPv4 route offload API" [1].
The aim of this patch set is to simplify the IPv6 route offload API by
making the stack a bit smarter about the notifications it is generating.
This allows driver authors to focus on programming the underlying device
instead of having to duplicate the IPv6 route insertion logic in their
driver, which is error-prone.

Details
=======

Today, whenever an IPv6 route is added or deleted a notification is sent
in the FIB notification chain and it is up to offload drivers to decide
if the route should be programmed to the hardware or not. This is not an
easy task as in hardware routes are keyed by {prefix, prefix length,
table id}, whereas the kernel can store multiple such routes that only
differ in metric / nexthop info.

This series makes sure that only routes that are actually used in the
data path are notified to offload drivers. This greatly simplifies the
work these drivers need to do, as they are now only concerned with
programming the hardware and do not need to replicate the IPv6 route
insertion logic and store multiple identical routes.

The route that is notified is the first route in the IPv6 FIB node,
which represents a single prefix and length in a given table. In case
the route is deleted and there is another route with the same key, a
replace notification is emitted. Otherwise, a delete notification is
emitted.

Unlike IPv4, in IPv6 it is possible to append individual nexthops to an
existing multipath route. Therefore, in addition to the replace and
delete notifications present in IPv4, an append notification is also
used.

Testing
=======

To ensure there is no degradation in route insertion rates, I averaged
the insertion rate of 512k routes (/64 and /128) over 50 runs. Did not
observe any degradation.

Functional tests are available here [2]. They rely on route trap
indication, which is added in a subsequent patch set.

In addition, I have been running syzkaller for the past couple of weeks
with debug options enabled. Did not observe any problems.

Patch set overview
==================

Patches #1-#7 gradually introduce the new FIB notifications
Patch #8 converts mlxsw to use the new notifications
Patch #9 remove the old notifications

[1] https://patchwork.ozlabs.org/cover/1209738/
[2] https://github.com/idosch/linux/tree/fib-notifier
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

9f6cff99