1. 13 Dec, 2019 10 commits
  2. 12 Dec, 2019 2 commits
  3. 11 Dec, 2019 13 commits
    • Daniel Borkmann's avatar
      bpf: Emit audit messages upon successful prog load and unload · bae141f5
      Daniel Borkmann authored
      Allow for audit messages to be emitted upon BPF program load and
      unload for having a timeline of events. The load itself is in
      syscall context, so additional info about the process initiating
      the BPF prog creation can be logged and later directly correlated
      to the unload event.
      
      The only info really needed from BPF side is the globally unique
      prog ID where then audit user space tooling can query / dump all
      info needed about the specific BPF program right upon load event
      and enrich the record, thus these changes needed here can be kept
      small and non-intrusive to the core.
      
      Raw example output:
      
        # auditctl -D
        # auditctl -a always,exit -F arch=x86_64 -S bpf
        # ausearch --start recent -m 1334
        ...
        ----
        time->Wed Nov 27 16:04:13 2019
        type=PROCTITLE msg=audit(1574867053.120:84664): proctitle="./bpf"
        type=SYSCALL msg=audit(1574867053.120:84664): arch=c000003e syscall=321   \
          success=yes exit=3 a0=5 a1=7ffea484fbe0 a2=70 a3=0 items=0 ppid=7477    \
          pid=12698 auid=1001 uid=1001 gid=1001 euid=1001 suid=1001 fsuid=1001    \
          egid=1001 sgid=1001 fsgid=1001 tty=pts2 ses=4 comm="bpf"                \
          exe="/home/jolsa/auditd/audit-testsuite/tests/bpf/bpf"                  \
          subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null)
        type=UNKNOWN[1334] msg=audit(1574867053.120:84664): prog-id=76 op=LOAD
        ----
        time->Wed Nov 27 16:04:13 2019
        type=UNKNOWN[1334] msg=audit(1574867053.120:84665): prog-id=76 op=UNLOAD
        ...
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Co-developed-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Acked-by: default avatarPaul Moore <paul@paul-moore.com>
      Link: https://lore.kernel.org/bpf/20191206214934.11319-1-jolsa@kernel.org
      bae141f5
    • Stanislav Fomichev's avatar
      bpf: Switch to offsetofend in BPF_PROG_TEST_RUN · b590cb5f
      Stanislav Fomichev authored
      Switch existing pattern of "offsetof(..., member) + FIELD_SIZEOF(...,
      member)' to "offsetofend(..., member)" which does exactly what
      we need without all the copy-paste.
      Suggested-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20191210191933.105321-1-sdf@google.com
      b590cb5f
    • Andrii Nakryiko's avatar
      libbpf: Bump libpf current version to v0.0.7 · 09c4708d
      Andrii Nakryiko authored
      New development cycles starts, bump to v0.0.7 proactively.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20191209224022.3544519-1-andriin@fb.com
      09c4708d
    • Russell King's avatar
      ARM: net: bpf: Improve prologue code sequence · c4533128
      Russell King authored
      Improve the prologue code sequence to be able to take advantage of
      64-bit stores, changing the code from:
      
        push    {r4, r5, r6, r7, r8, r9, fp, lr}
        mov     fp, sp
        sub     ip, sp, #80     ; 0x50
        sub     sp, sp, #600    ; 0x258
        str     ip, [fp, #-100] ; 0xffffff9c
        mov     r6, #0
        str     r6, [fp, #-96]  ; 0xffffffa0
        mov     r4, #0
        mov     r3, r4
        mov     r2, r0
        str     r4, [fp, #-104] ; 0xffffff98
        str     r4, [fp, #-108] ; 0xffffff94
      
      to the tighter:
      
        push    {r4, r5, r6, r7, r8, r9, fp, lr}
        mov     fp, sp
        mov     r3, #0
        sub     r2, sp, #80     ; 0x50
        sub     sp, sp, #600    ; 0x258
        strd    r2, [fp, #-100] ; 0xffffff9c
        mov     r2, #0
        strd    r2, [fp, #-108] ; 0xffffff94
        mov     r2, r0
      
      resulting in a saving of three instructions.
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/E1ieH2g-0004ih-Rb@rmk-PC.armlinux.org.uk
      c4533128
    • Shahjada Abul Husain's avatar
      cxgb4: add support for high priority filters · c2193999
      Shahjada Abul Husain authored
      T6 has a separate region known as high priority filter region
      that allows classifying packets going through ULD path. So,
      query firmware for HPFILTER resources and enable the high
      priority offload filter support when it is available.
      Signed-off-by: default avatarShahjada Abul Husain <shahjada@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c2193999
    • Chen Wandun's avatar
      enetc: remove variable 'tc_max_sized_frame' set but not used · 6525b5ef
      Chen Wandun authored
      Fixes gcc '-Wunused-but-set-variable' warning:
      
      drivers/net/ethernet/freescale/enetc/enetc_qos.c: In function enetc_setup_tc_cbs:
      drivers/net/ethernet/freescale/enetc/enetc_qos.c:195:6: warning: variable tc_max_sized_frame set but not used [-Wunused-but-set-variable]
      
      Fixes: c431047c ("enetc: add support Credit Based Shaper(CBS) for hardware offload")
      Signed-off-by: default avatarChen Wandun <chenwandun@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6525b5ef
    • Jakub Kicinski's avatar
      nfp: add support for TLV device stats · ca866ee8
      Jakub Kicinski authored
      Device stats are currently hard coded in the PCI BAR0 layout.
      Add a ability to read them from the TLV area instead.
      Names for the stats are maintained by the driver, and their
      meaning documented. This allows us to more easily add and
      remove device stats.
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ca866ee8
    • Kuniyuki Iwashima's avatar
      tcp: Cleanup duplicate initialization of sk->sk_state. · 5000b28b
      Kuniyuki Iwashima authored
      When a TCP socket is created, sk->sk_state is initialized twice as
      TCP_CLOSE in sock_init_data() and tcp_init_sock(). The tcp_init_sock() is
      always called after the sock_init_data(), so it is not necessary to update
      sk->sk_state in the tcp_init_sock().
      
      Before v2.1.8, the code of the two functions was in the inet_create(). In
      the patch of v2.1.8, the tcp_v4/v6_init_sock() were added and the code of
      initialization of sk->state was duplicated.
      Signed-off-by: default avatarKuniyuki Iwashima <kuni1840@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5000b28b
    • Michael Walle's avatar
      enetc: add software timestamping · 4caefbce
      Michael Walle authored
      Provide a software TX timestamp and add it to the ethtool query
      interface.
      
      skb_tx_timestamp() is also needed if one would like to use PHY
      timestamping.
      Signed-off-by: default avatarMichael Walle <michael@walle.cc>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4caefbce
    • David S. Miller's avatar
      Merge branch 'tipc-introduce-variable-window-congestion-control' · bb9d8454
      David S. Miller authored
      Jon Maloy says:
      
      ====================
      tipc: introduce variable window congestion control
      
      We improve thoughput greatly by introducing a variety of the Reno
      congestion control algorithm at the link level.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bb9d8454
    • Jon Maloy's avatar
      tipc: introduce variable window congestion control · 16ad3f40
      Jon Maloy authored
      We introduce a simple variable window congestion control for links.
      The algorithm is inspired by the Reno algorithm, covering both 'slow
      start', 'congestion avoidance', and 'fast recovery' modes.
      
      - We introduce hard lower and upper window limits per link, still
        different and configurable per bearer type.
      
      - We introduce a 'slow start theshold' variable, initially set to
        the maximum window size.
      
      - We let a link start at the minimum congestion window, i.e. in slow
        start mode, and then let is grow rapidly (+1 per rceived ACK) until
        it reaches the slow start threshold and enters congestion avoidance
        mode.
      
      - In congestion avoidance mode we increment the congestion window for
        each window-size number of acked packets, up to a possible maximum
        equal to the configured maximum window.
      
      - For each non-duplicate NACK received, we drop back to fast recovery
        mode, by setting the both the slow start threshold to and the
        congestion window to (current_congestion_window / 2).
      
      - If the timeout handler finds that the transmit queue has not moved
        since the previous timeout, it drops the link back to slow start
        and forces a probe containing the last sent sequence number to the
        sent to the peer, so that this can discover the stale situation.
      
      This change does in reality have effect only on unicast ethernet
      transport, as we have seen that there is no room whatsoever for
      increasing the window max size for the UDP bearer.
      For now, we also choose to keep the limits for the broadcast link
      unchanged and equal.
      
      This algorithm seems to give a 50-100% throughput improvement for
      messages larger than MTU.
      Suggested-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      16ad3f40
    • Jon Maloy's avatar
      tipc: eliminate more unnecessary nacks and retransmissions · d3b09995
      Jon Maloy authored
      When we increase the link tranmsit window we often observe the following
      scenario:
      
      1) A STATE message bypasses a sequence of traffic packets and arrives
         far ahead of those to the receiver. STATE messages contain a
         'peers_nxt_snt' field to indicate which was the last packet sent
         from the peer. This mechanism is intended as a last resort for the
         receiver to detect missing packets, e.g., during very low traffic
         when there is no packet flow to help early loss detection.
      3) The receiving link compares the 'peer_nxt_snt' field to its own
         'rcv_nxt', finds that there is a gap, and immediately sends a
         NACK message back to the peer.
      4) When this NACKs arrives at the sender, all the requested
         retransmissions are performed, since it is a first-time request.
      
      Just like in the scenario described in the previous commit this leads
      to many redundant retransmissions, with decreased throughput as a
      consequence.
      
      We fix this by adding two more conditions before we send a NACK in
      this sitution. First, the deferred queue must be empty, so we cannot
      assume that the potential packet loss has already been detected by
      other means. Second, we check the 'peers_snd_nxt' field only in probe/
      probe_reply messages, thus turning this into a true mechanism of last
      resort as it was really meant to be.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d3b09995
    • Jon Maloy's avatar
      tipc: eliminate gap indicator from ACK messages · 02288248
      Jon Maloy authored
      When we increase the link send window we sometimes observe the
      following scenario:
      
      1) A packet #N arrives out of order far ahead of a sequence of older
         packets which are still under way. The packet is added to the
         deferred queue.
      2) The missing packets arrive in sequence, and for each 16th of them
         an ACK is sent back to the receiver, as it should be.
      3) When building those ACK messages, it is checked if there is a gap
         between the link's 'rcv_nxt' and the first packet in the deferred
         queue. This is always the case until packet number #N-1 arrives, and
         a 'gap' indicator is added, effectively turning them into NACK
         messages.
      4) When those NACKs arrive at the sender, all the requested
         retransmissions are done, since it is a first-time request.
      
      This sometimes leads to a huge amount of redundant retransmissions,
      causing a drop in max throughput. This problem gets worse when we
      in a later commit introduce variable window congestion control,
      since it drops the link back to 'fast recovery' much more often
      than necessary.
      
      We now fix this by not sending any 'gap' indicator in regular ACK
      messages. We already have a mechanism for sending explicit NACKs
      in place, and this is sufficient to keep up the packet flow.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      02288248
  4. 10 Dec, 2019 8 commits
  5. 09 Dec, 2019 5 commits
    • Russell King's avatar
      net: sfp: avoid tx-fault with Nokia GPON module · 26c97a2d
      Russell King authored
      The Nokia GPON module can hold tx-fault active while it is initialising
      which can take up to 60s. Avoid this causing the module to be declared
      faulty after the SFP MSA defined non-cooled module timeout.
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      26c97a2d
    • Colin Ian King's avatar
      qed: remove redundant assignments to rc · e70ac628
      Colin Ian King authored
      The variable rc is assigned with a value that is never read and
      it is re-assigned a new value later on.  The assignment is redundant
      and can be removed.  Clean up multiple occurrances of this pattern.
      
      Addresses-Coverity: ("Unused value")
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e70ac628
    • Mao Wenan's avatar
      NFC: port100: Convert cpu_to_le16(le16_to_cpu(E1) + E2) to use le16_add_cpu(). · 718eae27
      Mao Wenan authored
      Convert cpu_to_le16(le16_to_cpu(frame->datalen) + len) to
      use le16_add_cpu(), which is more concise and does the same thing.
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarMao Wenan <maowenan@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      718eae27
    • David S. Miller's avatar
      Merge branch 'for-upstream' of... · 4a63ef71
      David S. Miller authored
      Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
      
      Johan Hedberg says:
      
      ====================
      pull request: bluetooth-next 2019-12-09
      
      Here's the first bluetooth-next pull request for 5.6:
      
       - Devicetree bindings updates for Broadcom controllers
       - Add support for PCM configuration for Broadcom controllers
       - btusb: Fixes for Realtek devices
       - butsb: A few other smaller fixes (mem leak & non-atomic allocation issue)
      
      Please let me know if there are any issues pulling. Thanks.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4a63ef71
    • Jason A. Donenfeld's avatar
      net: WireGuard secure network tunnel · e7096c13
      Jason A. Donenfeld authored
      WireGuard is a layer 3 secure networking tunnel made specifically for
      the kernel, that aims to be much simpler and easier to audit than IPsec.
      Extensive documentation and description of the protocol and
      considerations, along with formal proofs of the cryptography, are
      available at:
      
        * https://www.wireguard.com/
        * https://www.wireguard.com/papers/wireguard.pdf
      
      This commit implements WireGuard as a simple network device driver,
      accessible in the usual RTNL way used by virtual network drivers. It
      makes use of the udp_tunnel APIs, GRO, GSO, NAPI, and the usual set of
      networking subsystem APIs. It has a somewhat novel multicore queueing
      system designed for maximum throughput and minimal latency of encryption
      operations, but it is implemented modestly using workqueues and NAPI.
      Configuration is done via generic Netlink, and following a review from
      the Netlink maintainer a year ago, several high profile userspace tools
      have already implemented the API.
      
      This commit also comes with several different tests, both in-kernel
      tests and out-of-kernel tests based on network namespaces, taking profit
      of the fact that sockets used by WireGuard intentionally stay in the
      namespace the WireGuard interface was originally created, exactly like
      the semantics of userspace tun devices. See wireguard.com/netns/ for
      pictures and examples.
      
      The source code is fairly short, but rather than combining everything
      into a single file, WireGuard is developed as cleanly separable files,
      making auditing and comprehension easier. Things are laid out as
      follows:
      
        * noise.[ch], cookie.[ch], messages.h: These implement the bulk of the
          cryptographic aspects of the protocol, and are mostly data-only in
          nature, taking in buffers of bytes and spitting out buffers of
          bytes. They also handle reference counting for their various shared
          pieces of data, like keys and key lists.
      
        * ratelimiter.[ch]: Used as an integral part of cookie.[ch] for
          ratelimiting certain types of cryptographic operations in accordance
          with particular WireGuard semantics.
      
        * allowedips.[ch], peerlookup.[ch]: The main lookup structures of
          WireGuard, the former being trie-like with particular semantics, an
          integral part of the design of the protocol, and the latter just
          being nice helper functions around the various hashtables we use.
      
        * device.[ch]: Implementation of functions for the netdevice and for
          rtnl, responsible for maintaining the life of a given interface and
          wiring it up to the rest of WireGuard.
      
        * peer.[ch]: Each interface has a list of peers, with helper functions
          available here for creation, destruction, and reference counting.
      
        * socket.[ch]: Implementation of functions related to udp_socket and
          the general set of kernel socket APIs, for sending and receiving
          ciphertext UDP packets, and taking care of WireGuard-specific sticky
          socket routing semantics for the automatic roaming.
      
        * netlink.[ch]: Userspace API entry point for configuring WireGuard
          peers and devices. The API has been implemented by several userspace
          tools and network management utility, and the WireGuard project
          distributes the basic wg(8) tool.
      
        * queueing.[ch]: Shared function on the rx and tx path for handling
          the various queues used in the multicore algorithms.
      
        * send.c: Handles encrypting outgoing packets in parallel on
          multiple cores, before sending them in order on a single core, via
          workqueues and ring buffers. Also handles sending handshake and cookie
          messages as part of the protocol, in parallel.
      
        * receive.c: Handles decrypting incoming packets in parallel on
          multiple cores, before passing them off in order to be ingested via
          the rest of the networking subsystem with GRO via the typical NAPI
          poll function. Also handles receiving handshake and cookie messages
          as part of the protocol, in parallel.
      
        * timers.[ch]: Uses the timer wheel to implement protocol particular
          event timeouts, and gives a set of very simple event-driven entry
          point functions for callers.
      
        * main.c, version.h: Initialization and deinitialization of the module.
      
        * selftest/*.h: Runtime unit tests for some of the most security
          sensitive functions.
      
        * tools/testing/selftests/wireguard/netns.sh: Aforementioned testing
          script using network namespaces.
      
      This commit aims to be as self-contained as possible, implementing
      WireGuard as a standalone module not needing much special handling or
      coordination from the network subsystem. I expect for future
      optimizations to the network stack to positively improve WireGuard, and
      vice-versa, but for the time being, this exists as intentionally
      standalone.
      
      We introduce a menu option for CONFIG_WIREGUARD, as well as providing a
      verbose debug log and self-tests via CONFIG_WIREGUARD_DEBUG.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: linux-crypto@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7096c13
  6. 08 Dec, 2019 2 commits
    • Linus Torvalds's avatar
      Linux 5.5-rc1 · e42617b8
      Linus Torvalds authored
      e42617b8
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 95e6ba51
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) More jumbo frame fixes in r8169, from Heiner Kallweit.
      
       2) Fix bpf build in minimal configuration, from Alexei Starovoitov.
      
       3) Use after free in slcan driver, from Jouni Hogander.
      
       4) Flower classifier port ranges don't work properly in the HW offload
          case, from Yoshiki Komachi.
      
       5) Use after free in hns3_nic_maybe_stop_tx(), from Yunsheng Lin.
      
       6) Out of bounds access in mqprio_dump(), from Vladyslav Tarasiuk.
      
       7) Fix flow dissection in dsa TX path, from Alexander Lobakin.
      
       8) Stale syncookie timestampe fixes from Guillaume Nault.
      
      [ Did an evil merge to silence a warning introduced by this pull - Linus ]
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (84 commits)
        r8169: fix rtl_hw_jumbo_disable for RTL8168evl
        net_sched: validate TCA_KIND attribute in tc_chain_tmplt_add()
        r8169: add missing RX enabling for WoL on RTL8125
        vhost/vsock: accept only packets with the right dst_cid
        net: phy: dp83867: fix hfs boot in rgmii mode
        net: ethernet: ti: cpsw: fix extra rx interrupt
        inet: protect against too small mtu values.
        gre: refetch erspan header from skb->data after pskb_may_pull()
        pppoe: remove redundant BUG_ON() check in pppoe_pernet
        tcp: Protect accesses to .ts_recent_stamp with {READ,WRITE}_ONCE()
        tcp: tighten acceptance of ACKs not matching a child socket
        tcp: fix rejected syncookies due to stale timestamps
        lpc_eth: kernel BUG on remove
        tcp: md5: fix potential overestimation of TCP option space
        net: sched: allow indirect blocks to bind to clsact in TC
        net: core: rename indirect block ingress cb function
        net-sysfs: Call dev_hold always in netdev_queue_add_kobject
        net: dsa: fix flow dissection on Tx path
        net/tls: Fix return values to avoid ENOTSUPP
        net: avoid an indirect call in ____sys_recvmsg()
        ...
      95e6ba51