1. 22 May, 2020 23 commits
    • Jonathan McDowell's avatar
      net: ethernet: stmmac: Enable interface clocks on probe for IPQ806x · a96ac8a0
      Jonathan McDowell authored
      The ipq806x_gmac_probe() function enables the PTP clock but not the
      appropriate interface clocks. This means that if the bootloader hasn't
      done so attempting to bring up the interface will fail with an error
      like:
      
      [   59.028131] ipq806x-gmac-dwmac 37600000.ethernet: Failed to reset the dma
      [   59.028196] ipq806x-gmac-dwmac 37600000.ethernet eth1: stmmac_hw_setup: DMA engine initialization failed
      [   59.034056] ipq806x-gmac-dwmac 37600000.ethernet eth1: stmmac_open: Hw setup failed
      
      This patch, a slightly cleaned up version of one posted by Sergey
      Sergeev in:
      
      https://forum.openwrt.org/t/support-for-mikrotik-rb3011uias-rm/4064/257
      
      correctly enables the clock; we have already configured the source just
      before this.
      
      Tested on a MikroTik RB3011.
      Signed-off-by: default avatarJonathan McDowell <noodles@earth.li>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a96ac8a0
    • David S. Miller's avatar
      Merge branch 'netdevsim-Two-small-fixes' · 7a40a2d2
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      netdevsim: Two small fixes
      
      Fix two bugs observed while analyzing regression failures.
      
      Patch #1 fixes a bug where sometimes the drop counter of a packet trap
      policer would not increase.
      
      Patch #2 adds a missing initialization of a variable in a related
      selftest.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7a40a2d2
    • Ido Schimmel's avatar
      selftests: netdevsim: Always initialize 'RET' variable · 4d59e59c
      Ido Schimmel authored
      The variable is used by log_test() to check if the test case completely
      successfully or not. In case it is not initialized at the start of a
      test case, it is possible for the test case to fail despite not
      encountering any errors.
      
      Example:
      
      ```
      ...
      TEST: Trap group statistics                                         [ OK ]
      TEST: Trap policer                                                  [FAIL]
      	Policer drop counter was not incremented
      TEST: Trap policer binding                                          [FAIL]
      	Policer drop counter was not incremented
      ```
      
      Failure of trap_policer_test() caused trap_policer_bind_test() to fail
      as well.
      
      Fix by adding missing initialization of the variable.
      
      Fixes: 5fbff58e ("selftests: netdevsim: Add test cases for devlink-trap policers")
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4d59e59c
    • Ido Schimmel's avatar
      netdevsim: Ensure policer drop counter always increases · be43224f
      Ido Schimmel authored
      In case the policer drop counter is retrieved when the jiffies value is
      a multiple of 64, the counter will not be incremented.
      
      This randomly breaks a selftest [1] the reads the counter twice and
      checks that it was incremented:
      
      ```
      TEST: Trap policer                                                  [FAIL]
      	Policer drop counter was not incremented
      ```
      
      Fix by always incrementing the counter by 1.
      
      [1] tools/testing/selftests/drivers/net/netdevsim/devlink_trap.sh
      
      Fixes: ad188458 ("netdevsim: Add devlink-trap policer support")
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      be43224f
    • David S. Miller's avatar
      Merge tag 'rxrpc-fixes-20200520' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs · 4629ed2e
      David S. Miller authored
      David Howells says:
      
      ====================
      rxrpc: Fix retransmission timeout and ACK discard
      
      Here are a couple of fixes and an extra tracepoint for AF_RXRPC:
      
       (1) Calculate the RTO pretty much as TCP does, rather than making
           something up, including an initial 4s timeout (which causes return
           probes from the fileserver to fail if a packet goes missing), and add
           backoff.
      
       (2) Fix the discarding of out-of-order received ACKs.  We mustn't let the
           hard-ACK point regress, nor do we want to do unnecessary
           retransmission because the soft-ACK list regresses.  This is not
           trivial, however, due to some loose wording in various old protocol
           specs, the ACK field that should be used for this sometimes has the
           wrong information in it.
      
       (3) Add a tracepoint to log a discarded ACK.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4629ed2e
    • Valentin Longchamp's avatar
      net/ethernet/freescale: rework quiesce/activate for ucc_geth · 79dde73c
      Valentin Longchamp authored
      ugeth_quiesce/activate are used to halt the controller when there is a
      link change that requires to reconfigure the mac.
      
      The previous implementation called netif_device_detach(). This however
      causes the initial activation of the netdevice to fail precisely because
      it's detached. For details, see [1].
      
      A possible workaround was the revert of commit
      net: linkwatch: add check for netdevice being present to linkwatch_do_dev
      However, the check introduced in the above commit is correct and shall be
      kept.
      
      The netif_device_detach() is thus replaced with
      netif_tx_stop_all_queues() that prevents any tranmission. This allows to
      perform mac config change required by the link change, without detaching
      the corresponding netdevice and thus not preventing its initial
      activation.
      
      [1] https://lists.openwall.net/netdev/2020/01/08/201Signed-off-by: default avatarValentin Longchamp <valentin@longchamp.me>
      Acked-by: default avatarMatteo Ghidoni <matteo.ghidoni@ch.abb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      79dde73c
    • Jere Leppänen's avatar
      sctp: Start shutdown on association restart if in SHUTDOWN-SENT state and socket is closed · d3e8e4c1
      Jere Leppänen authored
      Commit bdf6fa52 ("sctp: handle association restarts when the
      socket is closed.") starts shutdown when an association is restarted,
      if in SHUTDOWN-PENDING state and the socket is closed. However, the
      rationale stated in that commit applies also when in SHUTDOWN-SENT
      state - we don't want to move an association to ESTABLISHED state when
      the socket has been closed, because that results in an association
      that is unreachable from user space.
      
      The problem scenario:
      
      1.  Client crashes and/or restarts.
      
      2.  Server (using one-to-one socket) calls close(). SHUTDOWN is lost.
      
      3.  Client reconnects using the same addresses and ports.
      
      4.  Server's association is restarted. The association and the socket
          move to ESTABLISHED state, even though the server process has
          closed its descriptor.
      
      Also, after step 4 when the server process exits, some resources are
      leaked in an attempt to release the underlying inet sock structure in
      ESTABLISHED state:
      
          IPv4: Attempt to release TCP socket in state 1 00000000377288c7
      
      Fix by acting the same way as in SHUTDOWN-PENDING state. That is, if
      an association is restarted in SHUTDOWN-SENT state and the socket is
      closed, then start shutdown and don't move the association or the
      socket to ESTABLISHED state.
      
      Fixes: bdf6fa52 ("sctp: handle association restarts when the socket is closed.")
      Signed-off-by: default avatarJere Leppänen <jere.leppanen@nokia.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d3e8e4c1
    • Eric Dumazet's avatar
      tipc: block BH before using dst_cache · 13788174
      Eric Dumazet authored
      dst_cache_get() documents it must be used with BH disabled.
      
      sysbot reported :
      
      BUG: using smp_processor_id() in preemptible [00000000] code: /21697
      caller is dst_cache_get+0x3a/0xb0 net/core/dst_cache.c:68
      CPU: 0 PID: 21697 Comm:  Not tainted 5.7.0-rc6-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x188/0x20d lib/dump_stack.c:118
       check_preemption_disabled lib/smp_processor_id.c:47 [inline]
       debug_smp_processor_id.cold+0x88/0x9b lib/smp_processor_id.c:57
       dst_cache_get+0x3a/0xb0 net/core/dst_cache.c:68
       tipc_udp_xmit.isra.0+0xb9/0xad0 net/tipc/udp_media.c:164
       tipc_udp_send_msg+0x3e6/0x490 net/tipc/udp_media.c:244
       tipc_bearer_xmit_skb+0x1de/0x3f0 net/tipc/bearer.c:526
       tipc_enable_bearer+0xb2f/0xd60 net/tipc/bearer.c:331
       __tipc_nl_bearer_enable+0x2bf/0x390 net/tipc/bearer.c:995
       tipc_nl_bearer_enable+0x1e/0x30 net/tipc/bearer.c:1003
       genl_family_rcv_msg_doit net/netlink/genetlink.c:673 [inline]
       genl_family_rcv_msg net/netlink/genetlink.c:718 [inline]
       genl_rcv_msg+0x627/0xdf0 net/netlink/genetlink.c:735
       netlink_rcv_skb+0x15a/0x410 net/netlink/af_netlink.c:2469
       genl_rcv+0x24/0x40 net/netlink/genetlink.c:746
       netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline]
       netlink_unicast+0x537/0x740 net/netlink/af_netlink.c:1329
       netlink_sendmsg+0x882/0xe10 net/netlink/af_netlink.c:1918
       sock_sendmsg_nosec net/socket.c:652 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:672
       ____sys_sendmsg+0x6bf/0x7e0 net/socket.c:2362
       ___sys_sendmsg+0x100/0x170 net/socket.c:2416
       __sys_sendmsg+0xec/0x1b0 net/socket.c:2449
       do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
       entry_SYSCALL_64_after_hwframe+0x49/0xb3
      RIP: 0033:0x45ca29
      
      Fixes: e9c1a793 ("tipc: add dst_cache support for udp media")
      Cc: Xin Long <lucien.xin@gmail.com>
      Cc: Jon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      13788174
    • Russell King's avatar
      net: mvpp2: fix RX hashing for non-10G ports · 3138a07c
      Russell King authored
      When rxhash is enabled on any ethernet port except the first in each CP
      block, traffic flow is prevented.  The analysis is below:
      
      I've been investigating this afternoon, and what I've found, comparing
      a kernel without 895586d5 and with 895586d5 applied is:
      
      - The table programmed into the hardware via mvpp22_rss_fill_table()
        appears to be identical with or without the commit.
      
      - When rxhash is enabled on eth2, mvpp2_rss_port_c2_enable() reports
        that c2.attr[0] and c2.attr[2] are written back containing:
      
         - with 895586d5, failing:    00200000 40000000
         - without 895586d5, working: 04000000 40000000
      
      - When disabling rxhash, c2.attr[0] and c2.attr[2] are written back as:
      
         04000000 00000000
      
      The second value represents the MVPP22_CLS_C2_ATTR2_RSS_EN bit, the
      first value is the queue number, which comprises two fields. The high
      5 bits are 24:29 and the low three are 21:23 inclusive. This comes
      from:
      
             c2.attr[0] = MVPP22_CLS_C2_ATTR0_QHIGH(qh) |
                           MVPP22_CLS_C2_ATTR0_QLOW(ql);
      
      So, the working case gives eth2 a queue id of 4.0, or 32 as per
      port->first_rxq, and the non-working case a queue id of 0.1, or 1.
      The allocation of queue IDs seems to be in mvpp2_port_probe():
      
              if (priv->hw_version == MVPP21)
                      port->first_rxq = port->id * port->nrxqs;
              else
                      port->first_rxq = port->id * priv->max_port_rxqs;
      
      Where:
      
              if (priv->hw_version == MVPP21)
                      priv->max_port_rxqs = 8;
              else
                      priv->max_port_rxqs = 32;
      
      Making the port 0 (eth0 / eth1) have port->first_rxq = 0, and port 1
      (eth2) be 32. It seems the idea is that the first 32 queues belong to
      port 0, the second 32 queues belong to port 1, etc.
      
      mvpp2_rss_port_c2_enable() gets the queue number from it's parameter,
      'ctx', which comes from mvpp22_rss_ctx(port, 0). This returns
      port->rss_ctx[0].
      
      mvpp22_rss_context_create() is responsible for allocating that, which
      it does by looking for an unallocated priv->rss_tables[] pointer. This
      table is shared amongst all ports on the CP silicon.
      
      When we write the tables in mvpp22_rss_fill_table(), the RSS table
      entry is defined by:
      
                      u32 sel = MVPP22_RSS_INDEX_TABLE(rss_ctx) |
                                MVPP22_RSS_INDEX_TABLE_ENTRY(i);
      
      where rss_ctx is the context ID (queue number) and i is the index in
      the table.
      
      If we look at what is written:
      
      - The first table to be written has "sel" values of 00000000..0000001f,
        containing values 0..3. This appears to be for eth1. This is table 0,
        RX queue number 0.
      - The second table has "sel" values of 00000100..0000011f, and appears
        to be for eth2.  These contain values 0x20..0x23. This is table 1,
        RX queue number 0.
      - The third table has "sel" values of 00000200..0000021f, and appears
        to be for eth3.  These contain values 0x40..0x43. This is table 2,
        RX queue number 0.
      
      How do queue numbers translate to the RSS table?  There is another
      table - the RXQ2RSS table, indexed by the MVPP22_RSS_INDEX_QUEUE field
      of MVPP22_RSS_INDEX and accessed through the MVPP22_RXQ2RSS_TABLE
      register. Before 895586d5, it was:
      
             mvpp2_write(priv, MVPP22_RSS_INDEX,
                         MVPP22_RSS_INDEX_QUEUE(port->first_rxq));
             mvpp2_write(priv, MVPP22_RXQ2RSS_TABLE,
                         MVPP22_RSS_TABLE_POINTER(port->id));
      
      and after:
      
             mvpp2_write(priv, MVPP22_RSS_INDEX, MVPP22_RSS_INDEX_QUEUE(ctx));
             mvpp2_write(priv, MVPP22_RXQ2RSS_TABLE, MVPP22_RSS_TABLE_POINTER(ctx));
      
      Before the commit, for eth2, that would've contained '32' for the
      index and '1' for the table pointer - mapping queue 32 to table 1.
      Remember that this is queue-high.queue-low of 4.0.
      
      After the commit, we appear to map queue 1 to table 1. That again
      looks fine on the face of it.
      
      Section 9.3.1 of the A8040 manual seems indicate the reason that the
      queue number is separated. queue-low seems to always come from the
      classifier, whereas queue-high can be from the ingress physical port
      number or the classifier depending on the MVPP2_CLS_SWFWD_PCTRL_REG.
      
      We set the port bit in MVPP2_CLS_SWFWD_PCTRL_REG, meaning that queue-high
      comes from the MVPP2_CLS_SWFWD_P2HQ_REG() register... and this seems to
      be where our bug comes from.
      
      mvpp2_cls_oversize_rxq_set() sets this up as:
      
              mvpp2_write(port->priv, MVPP2_CLS_SWFWD_P2HQ_REG(port->id),
                          (port->first_rxq >> MVPP2_CLS_OVERSIZE_RXQ_LOW_BITS));
      
              val = mvpp2_read(port->priv, MVPP2_CLS_SWFWD_PCTRL_REG);
              val |= MVPP2_CLS_SWFWD_PCTRL_MASK(port->id);
              mvpp2_write(port->priv, MVPP2_CLS_SWFWD_PCTRL_REG, val);
      
      Setting the MVPP2_CLS_SWFWD_PCTRL_MASK bit means that the queue-high
      for eth2 is _always_ 4, so only queues 32 through 39 inclusive are
      available to eth2. Yet, we're trying to tell the classifier to set
      queue-high, which will be ignored, to zero. Hence, the queue-high
      field (MVPP22_CLS_C2_ATTR0_QHIGH()) from the classifier will be
      ignored.
      
      This means we end up directing traffic from eth2 not to queue 1, but
      to queue 33, and then we tell it to look up queue 33 in the RSS table.
      However, RSS table has not been programmed for queue 33, and so it ends
      up (presumably) dropping the packets.
      
      It seems that mvpp22_rss_context_create() doesn't take account of the
      fact that the upper 5 bits of the queue ID can't actually be changed
      due to the settings in mvpp2_cls_oversize_rxq_set(), _or_ it seems that
      mvpp2_cls_oversize_rxq_set() has been missed in this commit. Either
      way, these two functions mutually disagree with what queue number
      should be used.
      
      Looking deeper into what mvpp2_cls_oversize_rxq_set() and the MTU
      validation is doing, it seems that MVPP2_CLS_SWFWD_P2HQ_REG() is used
      for over-sized packets attempting to egress through this port. With
      the classifier having had RSS enabled and directing eth2 traffic to
      queue 1, we may still have packets appearing on queue 32 for this port.
      
      However, the only way we may end up with over-sized packets attempting
      to egress through eth2 - is if the A8040 forwards frames between its
      ports. From what I can see, we don't support that feature, and the
      kernel restricts the egress packet size to the MTU. In any case, if we
      were to attempt to transmit an oversized packet, we have no support in
      the kernel to deal with that appearing in the port's receive queue.
      
      So, this patch attempts to solve the issue by clearing the
      MVPP2_CLS_SWFWD_PCTRL_MASK() bit, allowing MVPP22_CLS_C2_ATTR0_QHIGH()
      from the classifier to define the queue-high field of the queue number.
      
      My testing seems to confirm my findings above - clearing this bit
      means that if I enable rxhash on eth2, the interface can then pass
      traffic, as we are now directing traffic to RX queue 1 rather than
      queue 33. Traffic still seems to work with rxhash off as well.
      Reported-by: default avatarMatteo Croce <mcroce@redhat.com>
      Tested-by: default avatarMatteo Croce <mcroce@redhat.com>
      Fixes: 895586d5 ("net: mvpp2: cls: Use RSS contexts to handle RSS tables")
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3138a07c
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · d3b968bc
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2020-05-22
      
      The following pull-request contains BPF updates for your *net* tree.
      
      We've added 3 non-merge commits during the last 3 day(s) which contain
      a total of 5 files changed, 69 insertions(+), 11 deletions(-).
      
      The main changes are:
      
      1) Fix to reject mmap()'ing read-only array maps as writable since BPF verifier
         relies on such map content to be frozen, from Andrii Nakryiko.
      
      2) Fix breaking audit from secid_to_secctx() LSM hook by avoiding to use
         call_int_hook() since this hook is not stackable, from KP Singh.
      
      3) Fix BPF flow dissector program ref leak on netns cleanup, from Jakub Sitnicki.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d3b968bc
    • Claudiu Manoil's avatar
      felix: Fix initialization of ioremap resources · b4024c9e
      Claudiu Manoil authored
      The caller of devm_ioremap_resource(), either accidentally
      or by wrong assumption, is writing back derived resource data
      to global static resource initialization tables that should
      have been constant.  Meaning that after it computes the final
      physical start address it saves the address for no reason
      in the static tables.  This doesn't affect the first driver
      probing after reboot, but it breaks consecutive driver reloads
      (i.e. driver unbind & bind) because the initialization tables
      no longer have the correct initial values.  So the next probe()
      will map the device registers to wrong physical addresses,
      causing ARM SError async exceptions.
      This patch fixes all of the above.
      
      Fixes: 56051948 ("net: dsa: ocelot: add driver for Felix switch family")
      Signed-off-by: default avatarClaudiu Manoil <claudiu.manoil@nxp.com>
      Reviewed-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Tested-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b4024c9e
    • Todd Malsbary's avatar
      mptcp: use untruncated hash in ADD_ADDR HMAC · bd697222
      Todd Malsbary authored
      There is some ambiguity in the RFC as to whether the ADD_ADDR HMAC is
      the rightmost 64 bits of the entire hash or of the leftmost 160 bits
      of the hash.  The intention, as clarified with the author of the RFC,
      is the entire hash.
      
      This change returns the entire hash from
      mptcp_crypto_hmac_sha (instead of only the first 160 bits), and moves
      any truncation/selection operation on the hash to the caller.
      
      Fixes: 12555a2d ("mptcp: use rightmost 64 bits in ADD_ADDR HMAC")
      Reviewed-by: default avatarChristoph Paasch <cpaasch@apple.com>
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarTodd Malsbary <todd.malsbary@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bd697222
    • Jakub Sitnicki's avatar
      flow_dissector: Drop BPF flow dissector prog ref on netns cleanup · 5cf65922
      Jakub Sitnicki authored
      When attaching a flow dissector program to a network namespace with
      bpf(BPF_PROG_ATTACH, ...) we grab a reference to bpf_prog.
      
      If netns gets destroyed while a flow dissector is still attached, and there
      are no other references to the prog, we leak the reference and the program
      remains loaded.
      
      Leak can be reproduced by running flow dissector tests from selftests/bpf:
      
        # bpftool prog list
        # ./test_flow_dissector.sh
        ...
        selftests: test_flow_dissector [PASS]
        # bpftool prog list
        4: flow_dissector  name _dissect  tag e314084d332a5338  gpl
                loaded_at 2020-05-20T18:50:53+0200  uid 0
                xlated 552B  jited 355B  memlock 4096B  map_ids 3,4
                btf_id 4
        #
      
      Fix it by detaching the flow dissector program when netns is going away.
      
      Fixes: d58e468b ("flow_dissector: implements flow dissector BPF hook")
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarStanislav Fomichev <sdf@google.com>
      Link: https://lore.kernel.org/bpf/20200521083435.560256-1-jakub@cloudflare.com
      5cf65922
    • Tang Bin's avatar
      net: sgi: ioc3-eth: Fix return value check in ioc3eth_probe() · a7654211
      Tang Bin authored
      In the function devm_platform_ioremap_resource(), if get resource
      failed, the return value is ERR_PTR() not NULL. Thus it must be
      replaced by IS_ERR(), or else it may result in crashes if a critical
      error path is encountered.
      
      Fixes: 0ce5ebd2 ("mfd: ioc3: Add driver for SGI IOC3 chip")
      Signed-off-by: default avatarZhang Shengju <zhangshengju@cmss.chinamobile.com>
      Signed-off-by: default avatarTang Bin <tangbin@cmss.chinamobile.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a7654211
    • Sabrina Dubroca's avatar
      net: don't return invalid table id error when we fall back to PF_UNSPEC · 41b4bd98
      Sabrina Dubroca authored
      In case we can't find a ->dumpit callback for the requested
      (family,type) pair, we fall back to (PF_UNSPEC,type). In effect, we're
      in the same situation as if userspace had requested a PF_UNSPEC
      dump. For RTM_GETROUTE, that handler is rtnl_dump_all, which calls all
      the registered RTM_GETROUTE handlers.
      
      The requested table id may or may not exist for all of those
      families. commit ae677bbb ("net: Don't return invalid table id
      error when dumping all families") fixed the problem when userspace
      explicitly requests a PF_UNSPEC dump, but missed the fallback case.
      
      For example, when we pass ipv6.disable=1 to a kernel with
      CONFIG_IP_MROUTE=y and CONFIG_IP_MROUTE_MULTIPLE_TABLES=y,
      the (PF_INET6, RTM_GETROUTE) handler isn't registered, so we end up in
      rtnl_dump_all, and listing IPv6 routes will unexpectedly print:
      
        # ip -6 r
        Error: ipv4: MR table does not exist.
        Dump terminated
      
      commit ae677bbb introduced the dump_all_families variable, which
      gets set when userspace requests a PF_UNSPEC dump. However, we can't
      simply set the family to PF_UNSPEC in rtnetlink_rcv_msg in the
      fallback case to get dump_all_families == true, because some messages
      types (for example RTM_GETRULE and RTM_GETNEIGH) only register the
      PF_UNSPEC handler and use the family to filter in the kernel what is
      dumped to userspace. We would then export more entries, that userspace
      would have to filter. iproute does that, but other programs may not.
      
      Instead, this patch removes dump_all_families and updates the
      RTM_GETROUTE handlers to check if the family that is being dumped is
      their own. When it's not, which covers both the intentional PF_UNSPEC
      dumps (as dump_all_families did) and the fallback case, ignore the
      missing table id error.
      
      Fixes: cb167893 ("net: Plumb support for filtering ipv4 and ipv6 multicast route dumps")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      41b4bd98
    • Vadim Fedorenko's avatar
      net: ipip: fix wrong address family in init error path · 57ebc8f0
      Vadim Fedorenko authored
      In case of error with MPLS support the code is misusing AF_INET
      instead of AF_MPLS.
      
      Fixes: 1b69e7e6 ("ipip: support MPLS over IPv4")
      Signed-off-by: default avatarVadim Fedorenko <vfedorenko@novek.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      57ebc8f0
    • David S. Miller's avatar
      Merge branch 'net-tls-fix-encryption-error-path' · a5534617
      David S. Miller authored
      Vadim Fedorenko says:
      
      ====================
      net/tls: fix encryption error path
      
      The problem with data stream corruption was found in KTLS
      transmit path with small socket send buffers and large
      amount of data. bpf_exec_tx_verdict() frees open record
      on any type of error including EAGAIN, ENOMEM and ENOSPC
      while callers are able to recover this transient errors.
      Also wrong error code was returned to user space in that
      case. This patchset fixes the problems.
      ====================
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a5534617
    • Vadim Fedorenko's avatar
      net/tls: free record only on encryption error · 635d9398
      Vadim Fedorenko authored
      We cannot free record on any transient error because it leads to
      losing previos data. Check socket error to know whether record must
      be freed or not.
      
      Fixes: d10523d0 ("net/tls: free the record on encryption error")
      Signed-off-by: default avatarVadim Fedorenko <vfedorenko@novek.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      635d9398
    • Vadim Fedorenko's avatar
      net/tls: fix encryption error checking · a7bff11f
      Vadim Fedorenko authored
      bpf_exec_tx_verdict() can return negative value for copied
      variable. In that case this value will be pushed back to caller
      and the real error code will be lost. Fix it using signed type and
      checking for positive value.
      
      Fixes: d10523d0 ("net/tls: free the record on encryption error")
      Fixes: d3b18ad3 ("tls: add bpf support to sk_msg handling")
      Signed-off-by: default avatarVadim Fedorenko <vfedorenko@novek.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a7bff11f
    • David S. Miller's avatar
      Merge branch 'net-ethernet-ti-fix-some-return-value-check' · 04ba6b7d
      David S. Miller authored
      Wei Yongjun says:
      
      ====================
      net: ethernet: ti: fix some return value check
      
      This patchset convert cpsw_ale_create() to return PTR_ERR() only, and
      changed all the caller to check IS_ERR() instead of NULL.
      
      Since v2:
      1) rebased on net.git, as Jakub's suggest
      2) split am65-cpsw-nuss.c changes, as Grygorii's suggest
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      04ba6b7d
    • Wei Yongjun's avatar
      net: ethernet: ti: am65-cpsw-nuss: fix error handling of am65_cpsw_nuss_probe · 1401cf60
      Wei Yongjun authored
      Convert to using IS_ERR() instead of NULL test for cpsw_ale_create()
      error handling. Also fix to return negative error code from this error
      handling case instead of 0 in.
      
      Fixes: 93a76530 ("net: ethernet: ti: introduce am65x/j721e gigabit eth subsystem driver")
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarWei Yongjun <weiyongjun1@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1401cf60
    • Wei Yongjun's avatar
      net: ethernet: ti: fix some return value check of cpsw_ale_create() · 3469660d
      Wei Yongjun authored
      cpsw_ale_create() can return both NULL and PTR_ERR(), but all of
      the caller only check NULL for error handling. This patch convert
      it to only return PTR_ERR() in all error cases, and the caller using
      IS_ERR() instead of NULL test.
      
      Fixes: 4b41d343 ("net: ethernet: ti: cpsw: allow untagged traffic on host port")
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarWei Yongjun <weiyongjun1@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3469660d
    • Manivannan Sadhasivam's avatar
      net: qrtr: Fix passing invalid reference to qrtr_local_enqueue() · d28ea1fb
      Manivannan Sadhasivam authored
      Once the traversal of the list is completed with list_for_each_entry(),
      the iterator (node) will point to an invalid object. So passing this to
      qrtr_local_enqueue() which is outside of the iterator block is erroneous
      eventhough the object is not used.
      
      So fix this by passing NULL to qrtr_local_enqueue().
      
      Fixes: bdabad3e ("net: Add Qualcomm IPC router")
      Reported-by: default avatarkbuild test robot <lkp@intel.com>
      Reported-by: default avatarJulia Lawall <julia.lawall@lip6.fr>
      Signed-off-by: default avatarManivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
      Reviewed-by: default avatarBjorn Andersson <bjorn.andersson@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d28ea1fb
  2. 21 May, 2020 10 commits
    • Michal Kubecek's avatar
      ethtool: count header size in reply size estimate · 7c87e32d
      Michal Kubecek authored
      As ethnl_request_ops::reply_size handlers do not include common header
      size into calculated/estimated reply size, it needs to be added in
      ethnl_default_doit() and ethnl_default_notify() before allocating the
      message. On the other hand, strset_reply_size() should not add common
      header size.
      
      Fixes: 728480f1 ("ethtool: default handlers for GET requests")
      Reported-by: default avatarOleksij Rempel <o.rempel@pengutronix.de>
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7c87e32d
    • Stephen Worley's avatar
      net: nlmsg_cancel() if put fails for nhmsg · d69100b8
      Stephen Worley authored
      Fixes data remnant seen when we fail to reserve space for a
      nexthop group during a larger dump.
      
      If we fail the reservation, we goto nla_put_failure and
      cancel the message.
      
      Reproduce with the following iproute2 commands:
      =====================
      ip link add dummy1 type dummy
      ip link add dummy2 type dummy
      ip link add dummy3 type dummy
      ip link add dummy4 type dummy
      ip link add dummy5 type dummy
      ip link add dummy6 type dummy
      ip link add dummy7 type dummy
      ip link add dummy8 type dummy
      ip link add dummy9 type dummy
      ip link add dummy10 type dummy
      ip link add dummy11 type dummy
      ip link add dummy12 type dummy
      ip link add dummy13 type dummy
      ip link add dummy14 type dummy
      ip link add dummy15 type dummy
      ip link add dummy16 type dummy
      ip link add dummy17 type dummy
      ip link add dummy18 type dummy
      ip link add dummy19 type dummy
      ip link add dummy20 type dummy
      ip link add dummy21 type dummy
      ip link add dummy22 type dummy
      ip link add dummy23 type dummy
      ip link add dummy24 type dummy
      ip link add dummy25 type dummy
      ip link add dummy26 type dummy
      ip link add dummy27 type dummy
      ip link add dummy28 type dummy
      ip link add dummy29 type dummy
      ip link add dummy30 type dummy
      ip link add dummy31 type dummy
      ip link add dummy32 type dummy
      
      ip link set dummy1 up
      ip link set dummy2 up
      ip link set dummy3 up
      ip link set dummy4 up
      ip link set dummy5 up
      ip link set dummy6 up
      ip link set dummy7 up
      ip link set dummy8 up
      ip link set dummy9 up
      ip link set dummy10 up
      ip link set dummy11 up
      ip link set dummy12 up
      ip link set dummy13 up
      ip link set dummy14 up
      ip link set dummy15 up
      ip link set dummy16 up
      ip link set dummy17 up
      ip link set dummy18 up
      ip link set dummy19 up
      ip link set dummy20 up
      ip link set dummy21 up
      ip link set dummy22 up
      ip link set dummy23 up
      ip link set dummy24 up
      ip link set dummy25 up
      ip link set dummy26 up
      ip link set dummy27 up
      ip link set dummy28 up
      ip link set dummy29 up
      ip link set dummy30 up
      ip link set dummy31 up
      ip link set dummy32 up
      
      ip link set dummy33 up
      ip link set dummy34 up
      
      ip link set vrf-red up
      ip link set vrf-blue up
      
      ip link set dummyVRFred up
      ip link set dummyVRFblue up
      
      ip ro add 1.1.1.1/32 dev dummy1
      ip ro add 1.1.1.2/32 dev dummy2
      ip ro add 1.1.1.3/32 dev dummy3
      ip ro add 1.1.1.4/32 dev dummy4
      ip ro add 1.1.1.5/32 dev dummy5
      ip ro add 1.1.1.6/32 dev dummy6
      ip ro add 1.1.1.7/32 dev dummy7
      ip ro add 1.1.1.8/32 dev dummy8
      ip ro add 1.1.1.9/32 dev dummy9
      ip ro add 1.1.1.10/32 dev dummy10
      ip ro add 1.1.1.11/32 dev dummy11
      ip ro add 1.1.1.12/32 dev dummy12
      ip ro add 1.1.1.13/32 dev dummy13
      ip ro add 1.1.1.14/32 dev dummy14
      ip ro add 1.1.1.15/32 dev dummy15
      ip ro add 1.1.1.16/32 dev dummy16
      ip ro add 1.1.1.17/32 dev dummy17
      ip ro add 1.1.1.18/32 dev dummy18
      ip ro add 1.1.1.19/32 dev dummy19
      ip ro add 1.1.1.20/32 dev dummy20
      ip ro add 1.1.1.21/32 dev dummy21
      ip ro add 1.1.1.22/32 dev dummy22
      ip ro add 1.1.1.23/32 dev dummy23
      ip ro add 1.1.1.24/32 dev dummy24
      ip ro add 1.1.1.25/32 dev dummy25
      ip ro add 1.1.1.26/32 dev dummy26
      ip ro add 1.1.1.27/32 dev dummy27
      ip ro add 1.1.1.28/32 dev dummy28
      ip ro add 1.1.1.29/32 dev dummy29
      ip ro add 1.1.1.30/32 dev dummy30
      ip ro add 1.1.1.31/32 dev dummy31
      ip ro add 1.1.1.32/32 dev dummy32
      
      ip next add id 1 via 1.1.1.1 dev dummy1
      ip next add id 2 via 1.1.1.2 dev dummy2
      ip next add id 3 via 1.1.1.3 dev dummy3
      ip next add id 4 via 1.1.1.4 dev dummy4
      ip next add id 5 via 1.1.1.5 dev dummy5
      ip next add id 6 via 1.1.1.6 dev dummy6
      ip next add id 7 via 1.1.1.7 dev dummy7
      ip next add id 8 via 1.1.1.8 dev dummy8
      ip next add id 9 via 1.1.1.9 dev dummy9
      ip next add id 10 via 1.1.1.10 dev dummy10
      ip next add id 11 via 1.1.1.11 dev dummy11
      ip next add id 12 via 1.1.1.12 dev dummy12
      ip next add id 13 via 1.1.1.13 dev dummy13
      ip next add id 14 via 1.1.1.14 dev dummy14
      ip next add id 15 via 1.1.1.15 dev dummy15
      ip next add id 16 via 1.1.1.16 dev dummy16
      ip next add id 17 via 1.1.1.17 dev dummy17
      ip next add id 18 via 1.1.1.18 dev dummy18
      ip next add id 19 via 1.1.1.19 dev dummy19
      ip next add id 20 via 1.1.1.20 dev dummy20
      ip next add id 21 via 1.1.1.21 dev dummy21
      ip next add id 22 via 1.1.1.22 dev dummy22
      ip next add id 23 via 1.1.1.23 dev dummy23
      ip next add id 24 via 1.1.1.24 dev dummy24
      ip next add id 25 via 1.1.1.25 dev dummy25
      ip next add id 26 via 1.1.1.26 dev dummy26
      ip next add id 27 via 1.1.1.27 dev dummy27
      ip next add id 28 via 1.1.1.28 dev dummy28
      ip next add id 29 via 1.1.1.29 dev dummy29
      ip next add id 30 via 1.1.1.30 dev dummy30
      ip next add id 31 via 1.1.1.31 dev dummy31
      ip next add id 32 via 1.1.1.32 dev dummy32
      
      i=100
      
      while [ $i -le 200 ]
      do
      ip next add id $i group 1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19
      
      	echo $i
      
      	((i++))
      
      done
      
      ip next add id 999 group 1/2/3/4/5/6
      
      ip next ls
      
      ========================
      
      Fixes: ab84be7e ("net: Initial nexthop code")
      Signed-off-by: default avatarStephen Worley <sworley@cumulusnetworks.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d69100b8
    • Eric Dumazet's avatar
      ax25: fix setsockopt(SO_BINDTODEVICE) · 687775ce
      Eric Dumazet authored
      syzbot was able to trigger this trace [1], probably by using
      a zero optlen.
      
      While we are at it, cap optlen to IFNAMSIZ - 1 instead of IFNAMSIZ.
      
      [1]
      BUG: KMSAN: uninit-value in strnlen+0xf9/0x170 lib/string.c:569
      CPU: 0 PID: 8807 Comm: syz-executor483 Not tainted 5.7.0-rc4-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1c9/0x220 lib/dump_stack.c:118
       kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:121
       __msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:215
       strnlen+0xf9/0x170 lib/string.c:569
       dev_name_hash net/core/dev.c:207 [inline]
       netdev_name_node_lookup net/core/dev.c:277 [inline]
       __dev_get_by_name+0x75/0x2b0 net/core/dev.c:778
       ax25_setsockopt+0xfa3/0x1170 net/ax25/af_ax25.c:654
       __compat_sys_setsockopt+0x4ed/0x910 net/compat.c:403
       __do_compat_sys_setsockopt net/compat.c:413 [inline]
       __se_compat_sys_setsockopt+0xdd/0x100 net/compat.c:410
       __ia32_compat_sys_setsockopt+0x62/0x80 net/compat.c:410
       do_syscall_32_irqs_on arch/x86/entry/common.c:339 [inline]
       do_fast_syscall_32+0x3bf/0x6d0 arch/x86/entry/common.c:398
       entry_SYSENTER_compat+0x68/0x77 arch/x86/entry/entry_64_compat.S:139
      RIP: 0023:0xf7f57dd9
      Code: 90 e8 0b 00 00 00 f3 90 0f ae e8 eb f9 8d 74 26 00 89 3c 24 c3 90 90 90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
      RSP: 002b:00000000ffae8c1c EFLAGS: 00000217 ORIG_RAX: 000000000000016e
      RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000000101
      RDX: 0000000000000019 RSI: 0000000020000000 RDI: 0000000000000004
      RBP: 0000000000000012 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
      
      Local variable ----devname@ax25_setsockopt created at:
       ax25_setsockopt+0xe6/0x1170 net/ax25/af_ax25.c:536
       ax25_setsockopt+0xe6/0x1170 net/ax25/af_ax25.c:536
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      687775ce
    • David S. Miller's avatar
      Merge branch 'wireguard-fixes' · 53cb0995
      David S. Miller authored
      Jason A. Donenfeld says:
      
      ====================
      wireguard fixes for 5.7-rc7
      
      Hopefully these are the last fixes for 5.7:
      
      1) A trivial bump in the selftest harness to support gcc-10.
         build.wireguard.com is still on gcc-9 but I'll probably switch to
         gcc-10 in the coming weeks.
      
      2) A concurrency fix regarding userspace modifying the pre-shared key at
         the same time as packets are being processed, reported by Matt
         Dunwoodie.
      
      3) We were previously clearing skb->hash on egress, which broke
         fq_codel, cake, and other things that actually make use of the flow
         hash for queueing, reported by Dave Taht and Toke Høiland-Jørgensen.
      
      4) A fix for the increased memory usage caused by (3). This can be
         thought of as part of patch (3), but because of the separate
         reasoning and breadth of it I thought made it a bit cleaner to put in
         a standalone commit.
      
      Fixes (2), (3), and (4) are -stable material.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      53cb0995
    • Jason A. Donenfeld's avatar
      wireguard: noise: separate receive counter from send counter · a9e90d99
      Jason A. Donenfeld authored
      In "wireguard: queueing: preserve flow hash across packet scrubbing", we
      were required to slightly increase the size of the receive replay
      counter to something still fairly small, but an increase nonetheless.
      It turns out that we can recoup some of the additional memory overhead
      by splitting up the prior union type into two distinct types. Before, we
      used the same "noise_counter" union for both sending and receiving, with
      sending just using a simple atomic64_t, while receiving used the full
      replay counter checker. This meant that most of the memory being
      allocated for the sending counter was being wasted. Since the old
      "noise_counter" type increased in size in the prior commit, now is a
      good time to split up that union type into a distinct "noise_replay_
      counter" for receiving and a boring atomic64_t for sending, each using
      neither more nor less memory than required.
      
      Also, since sometimes the replay counter is accessed without
      necessitating additional accesses to the bitmap, we can reduce cache
      misses by hoisting the always-necessary lock above the bitmap in the
      struct layout. We also change a "noise_replay_counter" stack allocation
      to kmalloc in a -DDEBUG selftest so that KASAN doesn't trigger a stack
      frame warning.
      
      All and all, removing a bit of abstraction in this commit makes the code
      simpler and smaller, in addition to the motivating memory usage
      recuperation. For example, passing around raw "noise_symmetric_key"
      structs is something that really only makes sense within noise.c, in the
      one place where the sending and receiving keys can safely be thought of
      as the same type of object; subsequent to that, it's important that we
      uniformly access these through keypair->{sending,receiving}, where their
      distinct roles are always made explicit. So this patch allows us to draw
      that distinction clearly as well.
      
      Fixes: e7096c13 ("net: WireGuard secure network tunnel")
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a9e90d99
    • Jason A. Donenfeld's avatar
      wireguard: queueing: preserve flow hash across packet scrubbing · c78a0b4a
      Jason A. Donenfeld authored
      It's important that we clear most header fields during encapsulation and
      decapsulation, because the packet is substantially changed, and we don't
      want any info leak or logic bug due to an accidental correlation. But,
      for encapsulation, it's wrong to clear skb->hash, since it's used by
      fq_codel and flow dissection in general. Without it, classification does
      not proceed as usual. This change might make it easier to estimate the
      number of innerflows by examining clustering of out of order packets,
      but this shouldn't open up anything that can't already be inferred
      otherwise (e.g. syn packet size inference), and fq_codel can be disabled
      anyway.
      
      Furthermore, it might be the case that the hash isn't used or queried at
      all until after wireguard transmits the encrypted UDP packet, which
      means skb->hash might still be zero at this point, and thus no hash
      taken over the inner packet data. In order to address this situation, we
      force a calculation of skb->hash before encrypting packet data.
      
      Of course this means that fq_codel might transmit packets slightly more
      out of order than usual. Toke did some testing on beefy machines with
      high quantities of parallel flows and found that increasing the
      reply-attack counter to 8192 takes care of the most pathological cases
      pretty well.
      Reported-by: default avatarDave Taht <dave.taht@gmail.com>
      Reviewed-and-tested-by: default avatarToke Høiland-Jørgensen <toke@toke.dk>
      Fixes: e7096c13 ("net: WireGuard secure network tunnel")
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c78a0b4a
    • Jason A. Donenfeld's avatar
      wireguard: noise: read preshared key while taking lock · bc67d371
      Jason A. Donenfeld authored
      Prior we read the preshared key after dropping the handshake lock, which
      isn't an actual crypto issue if it races, but it's still not quite
      correct. So copy that part of the state into a temporary like we do with
      the rest of the handshake state variables. Then we can release the lock,
      operate on the temporary, and zero it out at the end of the function. In
      performance tests, the impact of this was entirely unnoticable, probably
      because those bytes are coming from the same cacheline as other things
      that are being copied out in the same manner.
      Reported-by: default avatarMatt Dunwoodie <ncon@noconroy.net>
      Fixes: e7096c13 ("net: WireGuard secure network tunnel")
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bc67d371
    • Jason A. Donenfeld's avatar
      wireguard: selftests: use newer iproute2 for gcc-10 · ee3c1aa3
      Jason A. Donenfeld authored
      gcc-10 switched to defaulting to -fno-common, which broke iproute2-5.4.
      This was fixed in iproute-5.6, so switch to that. Because we're after a
      stable testing surface, we generally don't like to bump these
      unnecessarily, but in this case, being able to actually build is a basic
      necessity.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ee3c1aa3
    • Andrii Nakryiko's avatar
      bpf: Prevent mmap()'ing read-only maps as writable · dfeb376d
      Andrii Nakryiko authored
      As discussed in [0], it's dangerous to allow mapping BPF map, that's meant to
      be frozen and is read-only on BPF program side, because that allows user-space
      to actually store a writable view to the page even after it is frozen. This is
      exacerbated by BPF verifier making a strong assumption that contents of such
      frozen map will remain unchanged. To prevent this, disallow mapping
      BPF_F_RDONLY_PROG mmap()'able BPF maps as writable, ever.
      
        [0] https://lore.kernel.org/bpf/CAEf4BzYGWYhXdp6BJ7_=9OQPJxQpgug080MMjdSB72i9R+5c6g@mail.gmail.com/
      
      Fixes: fc970227 ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY")
      Suggested-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarJann Horn <jannh@google.com>
      Link: https://lore.kernel.org/bpf/20200519053824.1089415-1-andriin@fb.com
      dfeb376d
    • KP Singh's avatar
      security: Fix hook iteration for secid_to_secctx · 0550cfe8
      KP Singh authored
      secid_to_secctx is not stackable, and since the BPF LSM registers this
      hook by default, the call_int_hook logic is not suitable which
      "bails-on-fail" and casues issues when other LSMs register this hook and
      eventually breaks Audit.
      
      In order to fix this, directly iterate over the security hooks instead
      of using call_int_hook as suggested in:
      
      https: //lore.kernel.org/bpf/9d0eb6c6-803a-ff3a-5603-9ad6d9edfc00@schaufler-ca.com/#t
      
      Fixes: 98e828a0 ("security: Refactor declaration of LSM hooks")
      Fixes: 625236ba ("security: Fix the default value of secid_to_secctx hook")
      Reported-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarKP Singh <kpsingh@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Link: https://lore.kernel.org/bpf/20200520125616.193765-1-kpsingh@chromium.org
      0550cfe8
  3. 20 May, 2020 2 commits
    • David Howells's avatar
      rxrpc: Fix ack discard · 441fdee1
      David Howells authored
      The Rx protocol has a "previousPacket" field in it that is not handled in
      the same way by all protocol implementations.  Sometimes it contains the
      serial number of the last DATA packet received, sometimes the sequence
      number of the last DATA packet received and sometimes the highest sequence
      number so far received.
      
      AF_RXRPC is using this to weed out ACKs that are out of date (it's possible
      for ACK packets to get reordered on the wire), but this does not work with
      OpenAFS which will just stick the sequence number of the last packet seen
      into previousPacket.
      
      The issue being seen is that big AFS FS.StoreData RPC (eg. of ~256MiB) are
      timing out when partly sent.  A trace was captured, with an additional
      tracepoint to show ACKs being discarded in rxrpc_input_ack().  Here's an
      excerpt showing the problem.
      
       52873.203230: rxrpc_tx_data: c=000004ae DATA ed1a3584:00000002 0002449c q=00024499 fl=09
      
      A DATA packet with sequence number 00024499 has been transmitted (the "q="
      field).
      
       ...
       52873.243296: rxrpc_rx_ack: c=000004ae 00012a2b DLY r=00024499 f=00024497 p=00024496 n=0
       52873.243376: rxrpc_rx_ack: c=000004ae 00012a2c IDL r=0002449b f=00024499 p=00024498 n=0
       52873.243383: rxrpc_rx_ack: c=000004ae 00012a2d OOS r=0002449d f=00024499 p=0002449a n=2
      
      The Out-Of-Sequence ACK indicates that the server didn't see DATA sequence
      number 00024499, but did see seq 0002449a (previousPacket, shown as "p=",
      skipped the number, but firstPacket, "f=", which shows the bottom of the
      window is set at that point).
      
       52873.252663: rxrpc_retransmit: c=000004ae q=24499 a=02 xp=14581537
       52873.252664: rxrpc_tx_data: c=000004ae DATA ed1a3584:00000002 000244bc q=00024499 fl=0b *RETRANS*
      
      The packet has been retransmitted.  Retransmission recurs until the peer
      says it got the packet.
      
       52873.271013: rxrpc_rx_ack: c=000004ae 00012a31 OOS r=000244a1 f=00024499 p=0002449e n=6
      
      More OOS ACKs indicate that the other packets that are already in the
      transmission pipeline are being received.  The specific-ACK list is up to 6
      ACKs and NAKs.
      
       ...
       52873.284792: rxrpc_rx_ack: c=000004ae 00012a49 OOS r=000244b9 f=00024499 p=000244b6 n=30
       52873.284802: rxrpc_retransmit: c=000004ae q=24499 a=0a xp=63505500
       52873.284804: rxrpc_tx_data: c=000004ae DATA ed1a3584:00000002 000244c2 q=00024499 fl=0b *RETRANS*
       52873.287468: rxrpc_rx_ack: c=000004ae 00012a4a OOS r=000244ba f=00024499 p=000244b7 n=31
       52873.287478: rxrpc_rx_ack: c=000004ae 00012a4b OOS r=000244bb f=00024499 p=000244b8 n=32
      
      At this point, the server's receive window is full (n=32) with presumably 1
      NAK'd packet and 31 ACK'd packets.  We can't transmit any more packets.
      
       52873.287488: rxrpc_retransmit: c=000004ae q=24499 a=0a xp=61327980
       52873.287489: rxrpc_tx_data: c=000004ae DATA ed1a3584:00000002 000244c3 q=00024499 fl=0b *RETRANS*
       52873.293850: rxrpc_rx_ack: c=000004ae 00012a4c DLY r=000244bc f=000244a0 p=00024499 n=25
      
      And now we've received an ACK indicating that a DATA retransmission was
      received.  7 packets have been processed (the occupied part of the window
      moved, as indicated by f= and n=).
      
       52873.293853: rxrpc_rx_discard_ack: c=000004ae r=00012a4c 000244a0<00024499 00024499<000244b8
      
      However, the DLY ACK gets discarded because its previousPacket has gone
      backwards (from p=000244b8, in the ACK at 52873.287478 to p=00024499 in the
      ACK at 52873.293850).
      
      We then end up in a continuous cycle of retransmit/discard.  kafs fails to
      update its window because it's discarding the ACKs and can't transmit an
      extra packet that would clear the issue because the window is full.
      OpenAFS doesn't change the previousPacket value in the ACKs because no new
      DATA packets are received with a different previousPacket number.
      
      Fix this by altering the discard check to only discard an ACK based on
      previousPacket if there was no advance in the firstPacket.  This allows us
      to transmit a new packet which will cause previousPacket to advance in the
      next ACK.
      
      The check, however, needs to allow for the possibility that previousPacket
      may actually have had the serial number placed in it instead - in which
      case it will go outside the window and we should ignore it.
      
      Fixes: 1a2391c3 ("rxrpc: Fix detection of out of order acks")
      Reported-by: default avatarDave Botsch <botsch@cnf.cornell.edu>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      441fdee1
    • David Howells's avatar
      rxrpc: Trace discarded ACKs · d1f12947
      David Howells authored
      Add a tracepoint to track received ACKs that are discarded due to being
      outside of the Tx window.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      d1f12947
  4. 19 May, 2020 5 commits
    • Neil Horman's avatar
      sctp: Don't add the shutdown timer if its already been added · 20a785aa
      Neil Horman authored
      This BUG halt was reported a while back, but the patch somehow got
      missed:
      
      PID: 2879   TASK: c16adaa0  CPU: 1   COMMAND: "sctpn"
       #0 [f418dd28] crash_kexec at c04a7d8c
       #1 [f418dd7c] oops_end at c0863e02
       #2 [f418dd90] do_invalid_op at c040aaca
       #3 [f418de28] error_code (via invalid_op) at c08631a5
          EAX: f34baac0  EBX: 00000090  ECX: f418deb0  EDX: f5542950  EBP: 00000000
          DS:  007b      ESI: f34ba800  ES:  007b      EDI: f418dea0  GS:  00e0
          CS:  0060      EIP: c046fa5e  ERR: ffffffff  EFLAGS: 00010286
       #4 [f418de5c] add_timer at c046fa5e
       #5 [f418de68] sctp_do_sm at f8db8c77 [sctp]
       #6 [f418df30] sctp_primitive_SHUTDOWN at f8dcc1b5 [sctp]
       #7 [f418df48] inet_shutdown at c080baf9
       #8 [f418df5c] sys_shutdown at c079eedf
       #9 [f418df70] sys_socketcall at c079fe88
          EAX: ffffffda  EBX: 0000000d  ECX: bfceea90  EDX: 0937af98
          DS:  007b      ESI: 0000000c  ES:  007b      EDI: b7150ae4
          SS:  007b      ESP: bfceea7c  EBP: bfceeaa8  GS:  0033
          CS:  0073      EIP: b775c424  ERR: 00000066  EFLAGS: 00000282
      
      It appears that the side effect that starts the shutdown timer was processed
      multiple times, which can happen as multiple paths can trigger it.  This of
      course leads to the BUG halt in add_timer getting called.
      
      Fix seems pretty straightforward, just check before the timer is added if its
      already been started.  If it has mod the timer instead to min(current
      expiration, new expiration)
      
      Its been tested but not confirmed to fix the problem, as the issue has only
      occured in production environments where test kernels are enjoined from being
      installed.  It appears to be a sane fix to me though.  Also, recentely,
      Jere found a reproducer posted on list to confirm that this resolves the
      issues
      Signed-off-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      CC: Vlad Yasevich <vyasevich@gmail.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: jere.leppanen@nokia.com
      CC: marcelo.leitner@gmail.com
      CC: netdev@vger.kernel.org
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      20a785aa
    • Boris Sukholitko's avatar
      __netif_receive_skb_core: pass skb by reference · c0bbbdc3
      Boris Sukholitko authored
      __netif_receive_skb_core may change the skb pointer passed into it (e.g.
      in rx_handler). The original skb may be freed as a result of this
      operation.
      
      The callers of __netif_receive_skb_core may further process original skb
      by using pt_prev pointer returned by __netif_receive_skb_core thus
      leading to unpleasant effects.
      
      The solution is to pass skb by reference into __netif_receive_skb_core.
      
      v2: Added Fixes tag and comment regarding ppt_prev and skb invariant.
      
      Fixes: 88eb1944 ("net: core: propagate SKB lists through packet_type lookup")
      Signed-off-by: default avatarBoris Sukholitko <boris.sukholitko@broadcom.com>
      Acked-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c0bbbdc3
    • Martin KaFai Lau's avatar
      net: inet_csk: Fix so_reuseport bind-address cache in tb->fast* · 88d7fcfa
      Martin KaFai Lau authored
      The commit 637bc8bb ("inet: reset tb->fastreuseport when adding a reuseport sk")
      added a bind-address cache in tb->fast*.  The tb->fast* caches the address
      of a sk which has successfully been binded with SO_REUSEPORT ON.  The idea
      is to avoid the expensive conflict search in inet_csk_bind_conflict().
      
      There is an issue with wildcard matching where sk_reuseport_match() should
      have returned false but it is currently returning true.  It ends up
      hiding bind conflict.  For example,
      
      bind("[::1]:443"); /* without SO_REUSEPORT. Succeed. */
      bind("[::2]:443"); /* with    SO_REUSEPORT. Succeed. */
      bind("[::]:443");  /* with    SO_REUSEPORT. Still Succeed where it shouldn't */
      
      The last bind("[::]:443") with SO_REUSEPORT on should have failed because
      it should have a conflict with the very first bind("[::1]:443") which
      has SO_REUSEPORT off.  However, the address "[::2]" is cached in
      tb->fast* in the second bind. In the last bind, the sk_reuseport_match()
      returns true because the binding sk's wildcard addr "[::]" matches with
      the "[::2]" cached in tb->fast*.
      
      The correct bind conflict is reported by removing the second
      bind such that tb->fast* cache is not involved and forces the
      bind("[::]:443") to go through the inet_csk_bind_conflict():
      
      bind("[::1]:443"); /* without SO_REUSEPORT. Succeed. */
      bind("[::]:443");  /* with    SO_REUSEPORT. -EADDRINUSE */
      
      The expected behavior for sk_reuseport_match() is, it should only allow
      the "cached" tb->fast* address to be used as a wildcard match but not
      the address of the binding sk.  To do that, the current
      "bool match_wildcard" arg is split into
      "bool match_sk1_wildcard" and "bool match_sk2_wildcard".
      
      This change only affects the sk_reuseport_match() which is only
      used by inet_csk (e.g. TCP).
      The other use cases are calling inet_rcv_saddr_equal() and
      this patch makes it pass the same "match_wildcard" arg twice to
      the "ipv[46]_rcv_saddr_equal(..., match_wildcard, match_wildcard)".
      
      Cc: Josef Bacik <jbacik@fb.com>
      Fixes: 637bc8bb ("inet: reset tb->fastreuseport when adding a reuseport sk")
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      88d7fcfa
    • Marc Payne's avatar
      r8152: support additional Microsoft Surface Ethernet Adapter variant · c27a2043
      Marc Payne authored
      Device id 0927 is the RTL8153B-based component of the 'Surface USB-C to
      Ethernet and USB Adapter' and may be used as a component of other devices
      in future. Tested and working with the r8152 driver.
      
      Update the cdc_ether blacklist due to the RTL8153 'network jam on suspend'
      issue which this device will cause (personally confirmed).
      Signed-off-by: default avatarMarc Payne <marc.payne@mdpsys.co.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c27a2043
    • Todd Malsbary's avatar
      mptcp: use rightmost 64 bits in ADD_ADDR HMAC · 12555a2d
      Todd Malsbary authored
      This changes the HMAC used in the ADD_ADDR option from the leftmost 64
      bits to the rightmost 64 bits as described in RFC 8684, section 3.4.1.
      
      This issue was discovered while adding support to packetdrill for the
      ADD_ADDR v1 option.
      
      Fixes: 3df523ab ("mptcp: Add ADD_ADDR handling")
      Signed-off-by: default avatarTodd Malsbary <todd.malsbary@linux.intel.com>
      Acked-by: default avatarChristoph Paasch <cpaasch@apple.com>
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      12555a2d