1. 06 May, 2022 9 commits
  2. 05 May, 2022 19 commits
    • Tetsuo Handa's avatar
      net: rds: use maybe_get_net() when acquiring refcount on TCP sockets · 6997fbd7
      Tetsuo Handa authored
      Eric Dumazet is reporting addition on 0 problem at rds_tcp_tune(), for
      delayed works queued in rds_wq might be invoked after a net namespace's
      refcount already reached 0.
      
      Since rds_tcp_exit_net() from cleanup_net() calls flush_workqueue(rds_wq),
      it is guaranteed that we can instead use maybe_get_net() from delayed work
      functions until rds_tcp_exit_net() returns.
      
      Note that I'm not convinced that all works which might access a net
      namespace are already queued in rds_wq by the moment rds_tcp_exit_net()
      calls flush_workqueue(rds_wq). If some race is there, rds_tcp_exit_net()
      will fail to wait for work functions, and kmem_cache_free() could be
      called from net_free() before maybe_get_net() is called from
      rds_tcp_tune().
      Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Fixes: 3a58f13a ("net: rds: acquire refcount on TCP sockets")
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/41d09faf-bc78-1a87-dfd1-c6d1b5984b61@I-love.SAKURA.ne.jpSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6997fbd7
    • Linus Torvalds's avatar
      Merge tag 'net-5.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 68533eb1
      Linus Torvalds authored
      Pull networking fixes from Paolo Abeni:
       "Including fixes from can, rxrpc and wireguard.
      
        Previous releases - regressions:
      
         - igmp: respect RCU rules in ip_mc_source() and ip_mc_msfilter()
      
         - mld: respect RCU rules in ip6_mc_source() and ip6_mc_msfilter()
      
         - rds: acquire netns refcount on TCP sockets
      
         - rxrpc: enable IPv6 checksums on transport socket
      
         - nic: hinic: fix bug of wq out of bound access
      
         - nic: thunder: don't use pci_irq_vector() in atomic context
      
         - nic: bnxt_en: fix possible bnxt_open() failure caused by wrong RFS
           flag
      
         - nic: mlx5e:
            - lag, fix use-after-free in fib event handler
            - fix deadlock in sync reset flow
      
        Previous releases - always broken:
      
         - tcp: fix insufficient TCP source port randomness
      
         - can: grcan: grcan_close(): fix deadlock
      
         - nfc: reorder destructive operations in to avoid bugs
      
        Misc:
      
         - wireguard: improve selftests reliability"
      
      * tag 'net-5.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (63 commits)
        NFC: netlink: fix sleep in atomic bug when firmware download timeout
        selftests: ocelot: tc_flower_chains: specify conform-exceed action for policer
        tcp: drop the hash_32() part from the index calculation
        tcp: increase source port perturb table to 2^16
        tcp: dynamically allocate the perturb table used by source ports
        tcp: add small random increments to the source port
        tcp: resalt the secret every 10 seconds
        tcp: use different parts of the port_offset for index and offset
        secure_seq: use the 64 bits of the siphash for port offset calculation
        wireguard: selftests: set panic_on_warn=1 from cmdline
        wireguard: selftests: bump package deps
        wireguard: selftests: restore support for ccache
        wireguard: selftests: use newer toolchains to fill out architectures
        wireguard: selftests: limit parallelism to $(nproc) tests at once
        wireguard: selftests: make routing loop test non-fatal
        net/mlx5: Fix matching on inner TTC
        net/mlx5: Avoid double clear or set of sync reset requested
        net/mlx5: Fix deadlock in sync reset flow
        net/mlx5e: Fix trust state reset in reload
        net/mlx5e: Avoid checking offload capability in post_parse action
        ...
      68533eb1
    • Duoming Zhou's avatar
      NFC: netlink: fix sleep in atomic bug when firmware download timeout · 4071bf12
      Duoming Zhou authored
      There are sleep in atomic bug that could cause kernel panic during
      firmware download process. The root cause is that nlmsg_new with
      GFP_KERNEL parameter is called in fw_dnld_timeout which is a timer
      handler. The call trace is shown below:
      
      BUG: sleeping function called from invalid context at include/linux/sched/mm.h:265
      Call Trace:
      kmem_cache_alloc_node
      __alloc_skb
      nfc_genl_fw_download_done
      call_timer_fn
      __run_timers.part.0
      run_timer_softirq
      __do_softirq
      ...
      
      The nlmsg_new with GFP_KERNEL parameter may sleep during memory
      allocation process, and the timer handler is run as the result of
      a "software interrupt" that should not call any other function
      that could sleep.
      
      This patch changes allocation mode of netlink message from GFP_KERNEL
      to GFP_ATOMIC in order to prevent sleep in atomic bug. The GFP_ATOMIC
      flag makes memory allocation operation could be used in atomic context.
      
      Fixes: 9674da87 ("NFC: Add firmware upload netlink command")
      Fixes: 9ea7187c ("NFC: netlink: Rename CMD_FW_UPLOAD to CMD_FW_DOWNLOAD")
      Signed-off-by: default avatarDuoming Zhou <duoming@zju.edu.cn>
      Reviewed-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Link: https://lore.kernel.org/r/20220504055847.38026-1-duoming@zju.edu.cnSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      4071bf12
    • Vladimir Oltean's avatar
      selftests: ocelot: tc_flower_chains: specify conform-exceed action for policer · 5a7c5f70
      Vladimir Oltean authored
      As discussed here with Ido Schimmel:
      https://patchwork.kernel.org/project/netdevbpf/patch/20220224102908.5255-2-jianbol@nvidia.com/
      
      the default conform-exceed action is "reclassify", for a reason we don't
      really understand.
      
      The point is that hardware can't offload that police action, so not
      specifying "conform-exceed" was always wrong, even though the command
      used to work in hardware (but not in software) until the kernel started
      adding validation for it.
      
      Fix the command used by the selftest by making the policer drop on
      exceed, and pass the packet to the next action (goto) on conform.
      
      Fixes: 8cd6b020 ("selftests: ocelot: add some example VCAP IS1, IS2 and ES0 tc offloads")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Link: https://lore.kernel.org/r/20220503121428.842906-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5a7c5f70
    • Jakub Kicinski's avatar
      Merge branch 'insufficient-tcp-source-port-randomness' · ef562489
      Jakub Kicinski authored
      Willy Tarreau says:
      
      ====================
      insufficient TCP source port randomness
      
      In a not-yet published paper, Moshe Kol, Amit Klein, and Yossi Gilad
      report being able to accurately identify a client by forcing it to emit
      only 40 times more connections than the number of entries in the
      table_perturb[] table, which is indexed by hashing the connection tuple.
      The current 2^8 setting allows them to perform that attack with only 10k
      connections, which is not hard to achieve in a few seconds.
      
      Eric, Amit and I have been working on this for a few weeks now imagining,
      testing and eliminating a number of approaches that Amit and his team were
      still able to break or that were found to be too risky or too expensive,
      and ended up with the simple improvements in this series that resists to
      the attack, doesn't degrade the performance, and preserves a reliable port
      selection algorithm to avoid connection failures, including the odd/even
      port selection preference that allows bind() to always find a port quickly
      even under strong connect() stress.
      
      The approach relies on several factors:
        - resalting the hash secret that's used to choose the table_perturb[]
          entry every 10 seconds to eliminate slow attacks and force the
          attacker to forget everything that was learned after this delay.
          This already eliminates most of the problem because if a client
          stays silent for more than 10 seconds there's no link between the
          previous and the next patterns, and 10s isn't yet frequent enough
          to cause too frequent repetition of a same port that may induce a
          connection failure ;
      
        - adding small random increments to the source port. Previously, a
          random 0 or 1 was added every 16 ports. Now a random 0 to 7 is
          added after each port. This means that with the default 32768-60999
          range, a worst case rollover happens after 1764 connections, and
          an average of 3137. This doesn't stop statistical attacks but
          requires significantly more iterations of the same attack to
          confirm a guess.
      
        - increasing the table_perturb[] size from 2^8 to 2^16, which Amit
          says will require 2.6 million connections to be attacked with the
          changes above, making it pointless to get a fingerprint that will
          only last 10 seconds. Due to the size, the table was made dynamic.
      
        - a few minor improvements on the bits used from the hash, to eliminate
          some unfortunate correlations that may possibly have been exploited
          to design future attack models.
      
      These changes were tested under the most extreme conditions, up to
      1.1 million connections per second to one and a few targets, showing no
      performance regression, and only 2 connection failures within 13 billion,
      which is less than 2^-32 and perfectly within usual values.
      
      The series is split into small reviewable changes and was already reviewed
      by Amit and Eric.
      ====================
      
      Link: https://lore.kernel.org/r/20220502084614.24123-1-w@1wt.euSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ef562489
    • Willy Tarreau's avatar
      tcp: drop the hash_32() part from the index calculation · e8161345
      Willy Tarreau authored
      In commit 190cc824 ("tcp: change source port randomizarion at
      connect() time"), the table_perturb[] array was introduced and an
      index was taken from the port_offset via hash_32(). But it turns
      out that hash_32() performs a multiplication while the input here
      comes from the output of SipHash in secure_seq, that is well
      distributed enough to avoid the need for yet another hash.
      Suggested-by: default avatarAmit Klein <aksecurity@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e8161345
    • Willy Tarreau's avatar
      tcp: increase source port perturb table to 2^16 · 4c2c8f03
      Willy Tarreau authored
      Moshe Kol, Amit Klein, and Yossi Gilad reported being able to accurately
      identify a client by forcing it to emit only 40 times more connections
      than there are entries in the table_perturb[] table. The previous two
      improvements consisting in resalting the secret every 10s and adding
      randomness to each port selection only slightly improved the situation,
      and the current value of 2^8 was too small as it's not very difficult
      to make a client emit 10k connections in less than 10 seconds.
      
      Thus we're increasing the perturb table from 2^8 to 2^16 so that the
      same precision now requires 2.6M connections, which is more difficult in
      this time frame and harder to hide as a background activity. The impact
      is that the table now uses 256 kB instead of 1 kB, which could mostly
      affect devices making frequent outgoing connections. However such
      components usually target a small set of destinations (load balancers,
      database clients, perf assessment tools), and in practice only a few
      entries will be visited, like before.
      
      A live test at 1 million connections per second showed no performance
      difference from the previous value.
      Reported-by: default avatarMoshe Kol <moshe.kol@mail.huji.ac.il>
      Reported-by: default avatarYossi Gilad <yossi.gilad@mail.huji.ac.il>
      Reported-by: default avatarAmit Klein <aksecurity@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4c2c8f03
    • Willy Tarreau's avatar
      tcp: dynamically allocate the perturb table used by source ports · e9261476
      Willy Tarreau authored
      We'll need to further increase the size of this table and it's likely
      that at some point its size will not be suitable anymore for a static
      table. Let's allocate it on boot from inet_hashinfo2_init(), which is
      called from tcp_init().
      
      Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
      Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
      Cc: Amit Klein <aksecurity@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e9261476
    • Willy Tarreau's avatar
      tcp: add small random increments to the source port · ca7af040
      Willy Tarreau authored
      Here we're randomly adding between 0 and 7 random increments to the
      selected source port in order to add some noise in the source port
      selection that will make the next port less predictable.
      
      With the default port range of 32768-60999 this means a worst case
      reuse scenario of 14116/8=1764 connections between two consecutive
      uses of the same port, with an average of 14116/4.5=3137. This code
      was stressed at more than 800000 connections per second to a fixed
      target with all connections closed by the client using RSTs (worst
      condition) and only 2 connections failed among 13 billion, despite
      the hash being reseeded every 10 seconds, indicating a perfectly
      safe situation.
      
      Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
      Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
      Cc: Amit Klein <aksecurity@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ca7af040
    • Eric Dumazet's avatar
      tcp: resalt the secret every 10 seconds · 4dfa9b43
      Eric Dumazet authored
      In order to limit the ability for an observer to recognize the source
      ports sequence used to contact a set of destinations, we should
      periodically shuffle the secret. 10 seconds looks effective enough
      without causing particular issues.
      
      Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
      Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
      Cc: Amit Klein <aksecurity@gmail.com>
      Cc: Jason A. Donenfeld <Jason@zx2c4.com>
      Tested-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4dfa9b43
    • Willy Tarreau's avatar
      tcp: use different parts of the port_offset for index and offset · 9e9b70ae
      Willy Tarreau authored
      Amit Klein suggests that we use different parts of port_offset for the
      table's index and the port offset so that there is no direct relation
      between them.
      
      Cc: Jason A. Donenfeld <Jason@zx2c4.com>
      Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
      Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
      Cc: Amit Klein <aksecurity@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9e9b70ae
    • Willy Tarreau's avatar
      secure_seq: use the 64 bits of the siphash for port offset calculation · b2d05756
      Willy Tarreau authored
      SipHash replaced MD5 in secure_ipv{4,6}_port_ephemeral() via commit
      7cd23e53 ("secure_seq: use SipHash in place of MD5"), but the output
      remained truncated to 32-bit only. In order to exploit more bits from the
      hash, let's make the functions return the full 64-bit of siphash_3u32().
      We also make sure the port offset calculation in __inet_hash_connect()
      remains done on 32-bit to avoid the need for div_u64_rem() and an extra
      cost on 32-bit systems.
      
      Cc: Jason A. Donenfeld <Jason@zx2c4.com>
      Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
      Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
      Cc: Amit Klein <aksecurity@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b2d05756
    • Jakub Kicinski's avatar
      Merge branch 'wireguard-patches-for-5-18-rc6' · 205557ba
      Jakub Kicinski authored
      Jason A. Donenfeld says:
      
      ====================
      wireguard patches for 5.18-rc6
      
      In working on some other problems, I wound up leaning on the WireGuard
      CI more than usual and uncovered a few small issues with reliability.
      These are fairly low key changes, since they don't impact kernel code
      itself.
      
      One change does stick out in particular, though, which is the "make
      routing loop test non-fatal" commit. I'm not thrilled about doing this,
      but currently [1] remains unsolved, and I'm still working on a real
      solution to that (hopefully for 5.19 or 5.20 if I can come up with a
      good idea...), so for now that test just prints a big red warning
      instead.
      
      [1] https://lore.kernel.org/netdev/YmszSXueTxYOC41G@zx2c4.com/
      ====================
      
      Link: https://lore.kernel.org/r/20220504202920.72908-1-Jason@zx2c4.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      205557ba
    • Jason A. Donenfeld's avatar
      wireguard: selftests: set panic_on_warn=1 from cmdline · 3fc1b11e
      Jason A. Donenfeld authored
      Rather than setting this once init is running, set panic_on_warn from
      the kernel command line, so that it catches splats from WireGuard
      initialization code and the various crypto selftests.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3fc1b11e
    • Jason A. Donenfeld's avatar
      wireguard: selftests: bump package deps · a6b8ea91
      Jason A. Donenfeld authored
      Use newer, more reliable package dependencies. These should hopefully
      reduce flakes. However, we keep the old iputils package, as it
      accumulated bugs after resulting in flakes on slow machines.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a6b8ea91
    • Jason A. Donenfeld's avatar
      wireguard: selftests: restore support for ccache · d261ba6a
      Jason A. Donenfeld authored
      When moving to non-system toolchains, we inadvertantly killed the
      ability to use ccache. So instead, build ccache support into the test
      harness directly.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d261ba6a
    • Jason A. Donenfeld's avatar
      wireguard: selftests: use newer toolchains to fill out architectures · d5d9b29b
      Jason A. Donenfeld authored
      Rather than relying on the system to have cross toolchains available,
      simply download musl.cc's ones and use that libc.so, and then we use it
      to fill in a few missing platforms, such as riscv64, riscv64, powerpc64,
      and s390x.
      
      Since riscv doesn't have a second serial port in its device description,
      we have to use virtio's vport. This is actually the same situation on
      ARM, but we were previously hacking QEMU up to work around this, which
      required a custom QEMU. Instead just do the vport trick on ARM too.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d5d9b29b
    • Jason A. Donenfeld's avatar
      wireguard: selftests: limit parallelism to $(nproc) tests at once · 39f02bf1
      Jason A. Donenfeld authored
      The parallel tests were added to catch queueing issues from multiple
      cores. But what happens in reality when testing tons of processes is
      that these separate threads wind up fighting with the scheduler, and we
      wind up with contention in places we don't care about that decrease the
      chances of hitting a bug. So just do a test with the number of CPU
      cores, rather than trying to scale up arbitrarily.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      39f02bf1
    • Jason A. Donenfeld's avatar
      wireguard: selftests: make routing loop test non-fatal · ae2de669
      Jason A. Donenfeld authored
      I hate to do this, but I still do not have a good solution to actually
      fix this bug across architectures. So just disable it for now, so that
      the CI can still deliver actionable results. This commit adds a large
      red warning, so that at least the failure isn't lost forever, and
      hopefully this can be revisited down the line.
      
      Link: https://lore.kernel.org/netdev/CAHmME9pv1x6C4TNdL6648HydD8r+txpV4hTUXOBVkrapBXH4QQ@mail.gmail.com/
      Link: https://lore.kernel.org/netdev/YmszSXueTxYOC41G@zx2c4.com/
      Link: https://lore.kernel.org/wireguard/CAHmME9rNnBiNvBstb7MPwK-7AmAN0sOfnhdR=eeLrowWcKxaaQ@mail.gmail.com/Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ae2de669
  3. 04 May, 2022 12 commits