1. 28 Dec, 2019 5 commits
    • Eric Dumazet's avatar
      tcp_cubic: make Hystart aware of pacing · ede656e8
      Eric Dumazet authored
      For years we disabled Hystart ACK train detection at Google
      because it was fooled by TCP pacing.
      
      ACK train detection uses a simple heuristic, detecting if
      we receive ACK past half the RTT, to exit slow start before
      hitting the bottleneck and experience massive drops.
      
      But pacing by design might delay packets up to RTT/2,
      so we need to tweak the Hystart logic to be aware of this
      extra delay.
      
      Tested:
       Added a 100 usec delay at receiver.
      
      Before:
      nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
         9117
         7057
         9553
         8300
         7030
         6849
         9533
        10126
         6876
         8473
      TcpExtTCPHystartTrainDetect     10                 0.0
      TcpExtTCPHystartTrainCwnd       1230               0.0
      
      After :
      nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
         9845
        10103
        10866
        11096
        11936
        11487
        11773
        12188
        11066
        11894
      TcpExtTCPHystartTrainDetect     10                 0.0
      TcpExtTCPHystartTrainCwnd       6462               0.0
      
      Disabling Hystart ACK Train detection gives similar numbers
      
      echo 2 >/sys/module/tcp_cubic/parameters/hystart_detect
      nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
        11173
        10954
        12455
        10627
        11578
        11583
        11222
        10880
        10665
        11366
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ede656e8
    • Eric Dumazet's avatar
      tcp_cubic: tweak Hystart detection for short RTT flows · 42f3a8aa
      Eric Dumazet authored
      After switching ca->delay_min to usec resolution, we exit
      slow start prematurely for very low RTT flows, setting
      snd_ssthresh to 20.
      
      The reason is that delay_min is fed with RTT of small packet
      trains. Then as cwnd is increased, TCP sends bigger TSO packets.
      
      LRO/GRO aggregation and/or interrupt mitigation strategies
      on receiver tend to inflate RTT samples.
      
      Fix this by adding to delay_min the expected delay of
      two TSO packets, given current pacing rate.
      
      Tested:
      
      Sender uses pfifo_fast qdisc
      
      Before :
      $ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
        11348
        11707
        11562
        11428
        11773
        11534
         9878
        11693
        10597
        10968
      TcpExtTCPHystartTrainDetect     10                 0.0
      TcpExtTCPHystartTrainCwnd       200                0.0
      
      After :
      $ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
        14877
        14517
        15797
        18466
        17376
        14833
        17558
        17933
        16039
        18059
      TcpExtTCPHystartTrainDetect     10                 0.0
      TcpExtTCPHystartTrainCwnd       1670               0.0
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      42f3a8aa
    • Eric Dumazet's avatar
      tcp_cubic: switch bictcp_clock() to usec resolution · cff04e2d
      Eric Dumazet authored
      Current 1ms clock feeds ca->round_start, ca->delay_min,
      ca->last_ack.
      
      This is quite problematic for data-center flows, where delay_min
      is way below 1 ms.
      
      This means Hystart Train detection triggers every time jiffies value
      is updated, since "((s32)(now - ca->round_start) > ca->delay_min >> 4)"
      expression becomes true.
      
      This kind of random behavior can be solved by reusing the existing
      usec timestamp that TCP keeps in tp->tcp_mstamp
      
      Note that a followup patch will tweak things a bit, because
      during slow start, GRO aggregation on receivers naturally
      increases the RTT as TSO packets gradually come to ~64KB size.
      
      To recap, right after this patch CUBIC Hystart train detection
      is more aggressive, since short RTT flows might exit slow start at
      cwnd = 20, instead of being possibly unbounded.
      
      Following patch will address this problem.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cff04e2d
    • Eric Dumazet's avatar
      tcp_cubic: remove one conditional from hystart_update() · 35821fc2
      Eric Dumazet authored
      If we initialize ca->curr_rtt to ~0U, we do not need to test
      for zero value in hystart_update()
      
      We only read ca->curr_rtt if at least HYSTART_MIN_SAMPLES have
      been processed, and thus ca->curr_rtt will have a sane value.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35821fc2
    • Eric Dumazet's avatar
      tcp_cubic: optimize hystart_update() · 473900a5
      Eric Dumazet authored
      We do not care which bit in ca->found is set.
      
      We avoid accessing hystart and hystart_detect unless really needed,
      possibly avoiding one cache line miss.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      473900a5
  2. 27 Dec, 2019 2 commits
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 2bbc078f
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf-next 2019-12-27
      
      The following pull-request contains BPF updates for your *net-next* tree.
      
      We've added 127 non-merge commits during the last 17 day(s) which contain
      a total of 110 files changed, 6901 insertions(+), 2721 deletions(-).
      
      There are three merge conflicts. Conflicts and resolution looks as follows:
      
      1) Merge conflict in net/bpf/test_run.c:
      
      There was a tree-wide cleanup c593642c ("treewide: Use sizeof_field() macro")
      which gets in the way with b590cb5f ("bpf: Switch to offsetofend in
      BPF_PROG_TEST_RUN"):
      
        <<<<<<< HEAD
                if (!range_is_zero(__skb, offsetof(struct __sk_buff, priority) +
                                   sizeof_field(struct __sk_buff, priority),
        =======
                if (!range_is_zero(__skb, offsetofend(struct __sk_buff, priority),
        >>>>>>> 7c8dce4b
      
      There are a few occasions that look similar to this. Always take the chunk with
      offsetofend(). Note that there is one where the fields differ in here:
      
        <<<<<<< HEAD
                if (!range_is_zero(__skb, offsetof(struct __sk_buff, tstamp) +
                                   sizeof_field(struct __sk_buff, tstamp),
        =======
                if (!range_is_zero(__skb, offsetofend(struct __sk_buff, gso_segs),
        >>>>>>> 7c8dce4b
      
      Just take the one with offsetofend() /and/ gso_segs. Latter is correct due to
      850a88cc ("bpf: Expose __sk_buff wire_len/gso_segs to BPF_PROG_TEST_RUN").
      
      2) Merge conflict in arch/riscv/net/bpf_jit_comp.c:
      
      (I'm keeping Bjorn in Cc here for a double-check in case I got it wrong.)
      
        <<<<<<< HEAD
                if (is_13b_check(off, insn))
                        return -1;
                emit(rv_blt(tcc, RV_REG_ZERO, off >> 1), ctx);
        =======
                emit_branch(BPF_JSLT, RV_REG_T1, RV_REG_ZERO, off, ctx);
        >>>>>>> 7c8dce4b
      
      Result should look like:
      
                emit_branch(BPF_JSLT, tcc, RV_REG_ZERO, off, ctx);
      
      3) Merge conflict in arch/riscv/include/asm/pgtable.h:
      
        <<<<<<< HEAD
        =======
        #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
        #define VMALLOC_END      (PAGE_OFFSET - 1)
        #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
      
        #define BPF_JIT_REGION_SIZE     (SZ_128M)
        #define BPF_JIT_REGION_START    (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
        #define BPF_JIT_REGION_END      (VMALLOC_END)
      
        /*
         * Roughly size the vmemmap space to be large enough to fit enough
         * struct pages to map half the virtual address space. Then
         * position vmemmap directly below the VMALLOC region.
         */
        #define VMEMMAP_SHIFT \
                (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
        #define VMEMMAP_SIZE    BIT(VMEMMAP_SHIFT)
        #define VMEMMAP_END     (VMALLOC_START - 1)
        #define VMEMMAP_START   (VMALLOC_START - VMEMMAP_SIZE)
      
        #define vmemmap         ((struct page *)VMEMMAP_START)
      
        >>>>>>> 7c8dce4b
      
      Only take the BPF_* defines from there and move them higher up in the
      same file. Remove the rest from the chunk. The VMALLOC_* etc defines
      got moved via 01f52e16 ("riscv: define vmemmap before pfn_to_page
      calls"). Result:
      
        [...]
        #define __S101  PAGE_READ_EXEC
        #define __S110  PAGE_SHARED_EXEC
        #define __S111  PAGE_SHARED_EXEC
      
        #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
        #define VMALLOC_END      (PAGE_OFFSET - 1)
        #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
      
        #define BPF_JIT_REGION_SIZE     (SZ_128M)
        #define BPF_JIT_REGION_START    (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
        #define BPF_JIT_REGION_END      (VMALLOC_END)
      
        /*
         * Roughly size the vmemmap space to be large enough to fit enough
         * struct pages to map half the virtual address space. Then
         * position vmemmap directly below the VMALLOC region.
         */
        #define VMEMMAP_SHIFT \
                (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
        #define VMEMMAP_SIZE    BIT(VMEMMAP_SHIFT)
        #define VMEMMAP_END     (VMALLOC_START - 1)
        #define VMEMMAP_START   (VMALLOC_START - VMEMMAP_SIZE)
      
        [...]
      
      Let me know if there are any other issues.
      
      Anyway, the main changes are:
      
      1) Extend bpftool to produce a struct (aka "skeleton") tailored and specific
         to a provided BPF object file. This provides an alternative, simplified API
         compared to standard libbpf interaction. Also, add libbpf extern variable
         resolution for .kconfig section to import Kconfig data, from Andrii Nakryiko.
      
      2) Add BPF dispatcher for XDP which is a mechanism to avoid indirect calls by
         generating a branch funnel as discussed back in bpfconf'19 at LSF/MM. Also,
         add various BPF riscv JIT improvements, from Björn Töpel.
      
      3) Extend bpftool to allow matching BPF programs and maps by name,
         from Paul Chaignon.
      
      4) Support for replacing cgroup BPF programs attached with BPF_F_ALLOW_MULTI
         flag for allowing updates without service interruption, from Andrey Ignatov.
      
      5) Cleanup and simplification of ring access functions for AF_XDP with a
         bonus of 0-5% performance improvement, from Magnus Karlsson.
      
      6) Enable BPF JITs for x86-64 and arm64 by default. Also, final version of
         audit support for BPF, from Daniel Borkmann and latter with Jiri Olsa.
      
      7) Move and extend test_select_reuseport into BPF program tests under
         BPF selftests, from Jakub Sitnicki.
      
      8) Various BPF sample improvements for xdpsock for customizing parameters
         to set up and benchmark AF_XDP, from Jay Jayatheerthan.
      
      9) Improve libbpf to provide a ulimit hint on permission denied errors.
         Also change XDP sample programs to attach in driver mode by default,
         from Toke Høiland-Jørgensen.
      
      10) Extend BPF test infrastructure to allow changing skb mark from tc BPF
          programs, from Nikita V. Shirokov.
      
      11) Optimize prologue code sequence in BPF arm32 JIT, from Russell King.
      
      12) Fix xdp_redirect_cpu BPF sample to manually attach to tracepoints after
          libbpf conversion, from Jesper Dangaard Brouer.
      
      13) Minor misc improvements from various others.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2bbc078f
    • Andrii Nakryiko's avatar
      bpftool: Make skeleton C code compilable with C++ compiler · 7c8dce4b
      Andrii Nakryiko authored
      When auto-generated BPF skeleton C code is included from C++ application, it
      triggers compilation error due to void * being implicitly casted to whatever
      target pointer type. This is supported by C, but not C++. To solve this
      problem, add explicit casts, where necessary.
      
      To ensure issues like this are captured going forward, add skeleton usage in
      test_cpp test.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20191226210253.3132060-1-andriin@fb.com
      7c8dce4b
  3. 26 Dec, 2019 32 commits
  4. 25 Dec, 2019 1 commit
    • David S. Miller's avatar
      Merge branch 'Simplify-IPv6-route-offload-API' · 9f6cff99
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      Simplify IPv6 route offload API
      
      Motivation
      ==========
      
      This is the IPv6 counterpart of "Simplify IPv4 route offload API" [1].
      The aim of this patch set is to simplify the IPv6 route offload API by
      making the stack a bit smarter about the notifications it is generating.
      This allows driver authors to focus on programming the underlying device
      instead of having to duplicate the IPv6 route insertion logic in their
      driver, which is error-prone.
      
      Details
      =======
      
      Today, whenever an IPv6 route is added or deleted a notification is sent
      in the FIB notification chain and it is up to offload drivers to decide
      if the route should be programmed to the hardware or not. This is not an
      easy task as in hardware routes are keyed by {prefix, prefix length,
      table id}, whereas the kernel can store multiple such routes that only
      differ in metric / nexthop info.
      
      This series makes sure that only routes that are actually used in the
      data path are notified to offload drivers. This greatly simplifies the
      work these drivers need to do, as they are now only concerned with
      programming the hardware and do not need to replicate the IPv6 route
      insertion logic and store multiple identical routes.
      
      The route that is notified is the first route in the IPv6 FIB node,
      which represents a single prefix and length in a given table. In case
      the route is deleted and there is another route with the same key, a
      replace notification is emitted. Otherwise, a delete notification is
      emitted.
      
      Unlike IPv4, in IPv6 it is possible to append individual nexthops to an
      existing multipath route. Therefore, in addition to the replace and
      delete notifications present in IPv4, an append notification is also
      used.
      
      Testing
      =======
      
      To ensure there is no degradation in route insertion rates, I averaged
      the insertion rate of 512k routes (/64 and /128) over 50 runs. Did not
      observe any degradation.
      
      Functional tests are available here [2]. They rely on route trap
      indication, which is added in a subsequent patch set.
      
      In addition, I have been running syzkaller for the past couple of weeks
      with debug options enabled. Did not observe any problems.
      
      Patch set overview
      ==================
      
      Patches #1-#7 gradually introduce the new FIB notifications
      Patch #8 converts mlxsw to use the new notifications
      Patch #9 remove the old notifications
      
      [1] https://patchwork.ozlabs.org/cover/1209738/
      [2] https://github.com/idosch/linux/tree/fib-notifier
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9f6cff99