1. 17 May, 2018 40 commits
    • Manish Chopra's avatar
      qede: Add build_skb() support. · 8a863397
      Manish Chopra authored
      This patch makes use of build_skb() throughout in driver's receieve
      data path [HW gro flow and non HW gro flow]. With this, driver can
      build skb directly from the page segments which are already mapped
      to the hardware instead of allocating new SKB via netdev_alloc_skb()
      and memcpy the data which is quite costly.
      
      This really improves performance (keeping same or slight gain in rx
      throughput) in terms of CPU utilization which is significantly reduced
      [almost half] in non HW gro flow where for every incoming MTU sized
      packet driver had to allocate skb, memcpy headers etc. Additionally
      in that flow, it also gets rid of bunch of additional overheads
      [eth_get_headlen() etc.] to split headers and data in the skb.
      
      Tested with:
      system: 2 sockets, 4 cores per socket, hyperthreading, 2x4x2=16 cores
      iperf [server]: iperf -s
      iperf [client]: iperf -c <server_ip> -t 500 -i 10 -P 32
      
      HW GRO off – w/o build_skb(), throughput: 36.8 Gbits/sec
      
      Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
      Average:     all    0.59    0.00   32.93    0.00    0.00   43.07    0.00    0.00   23.42
      
      HW GRO off - with build_skb(), throughput: 36.9 Gbits/sec
      
      Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
      Average:     all    0.70    0.00   31.70    0.00    0.00   25.68    0.00    0.00   41.92
      
      HW GRO on - w/o build_skb(), throughput: 36.9 Gbits/sec
      
      Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
      Average:     all    0.86    0.00   24.14    0.00    0.00    6.59    0.00    0.00   68.41
      
      HW GRO on - with build_skb(), throughput: 37.5 Gbits/sec
      
      Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
      Average:     all    0.87    0.00   23.75    0.00    0.00    6.19    0.00    0.00   69.19
      Signed-off-by: default avatarAriel Elior <ariel.elior@cavium.com>
      Signed-off-by: default avatarManish Chopra <manish.chopra@cavium.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8a863397
    • David S. Miller's avatar
      Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 56a9a9e7
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      10GbE Intel Wired LAN Driver Updates 2018-05-17
      
      This series contains updates to ixgbe, ixgbevf and ice drivers.
      
      Cathy Zhou resolves sparse warnings by using the force attribute.
      
      Mauro S M Rodrigues fixes a bug where IRQs were not freed if a PCI error
      recovery system opts to remove the device which causes
      ixgbe_io_error_detected() to return PCI_ERS_RESULT_DISCONNECT before
      calling ixgbe_close_suspend() which results in IRQs not freed and
      crashing when the remove handler calls pci_disable_device().  Resolved
      this by calling ixgbe_close_suspend() before evaluating the PCI channel
      state.
      
      Pavel Tatashin releases the rtnl_lock during the call to
      ixgbe_close_suspend() to allow scaling if device_shutdown() is
      multi-threaded.
      
      Emil modifies ixgbe to not validate the MAC address during a reset,
      unless the MAC was set on the host so that the VF will get a new MAC
      address every time it reloads.  Also updates ixgbevf to set
      hw->mac.perm_addr in order to retain the custom MAC on a reset.
      
      Anirudh updates the ice NVM read/erase/update AQ commands to align with
      the latest specification.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      56a9a9e7
    • Roman Mashak's avatar
    • Dan Carpenter's avatar
      net/ncsi: prevent a couple array underflows · 990a9d49
      Dan Carpenter authored
      We recently refactored this code and introduced a static checker
      warning.  Smatch complains that if cmd->index is zero then we would
      underflow the arrays.  That's obviously true.
      
      The question is whether we prevent cmd->index from being zero at a
      different level.  I've looked at the code and I don't immediately see
      a check for that.
      
      Fixes: 062b3e1b ("net/ncsi: Refactor MAC, VLAN filters")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      990a9d49
    • Eric Dumazet's avatar
      net/smc: init conn.tx_work & conn.send_lock sooner · be7f3e59
      Eric Dumazet authored
      syzkaller found that following program crashes the host :
      
      {
        int fd = socket(AF_SMC, SOCK_STREAM, 0);
        int val = 1;
      
        listen(fd, 0);
        shutdown(fd, SHUT_RDWR);
        setsockopt(fd, 6, TCP_NODELAY, &val, 4);
      }
      
      Simply initialize conn.tx_work & conn.send_lock at socket creation,
      rather than deeper in the stack.
      
      ODEBUG: assert_init not available (active state 0) object type: timer_list hint:           (null)
      WARNING: CPU: 1 PID: 13988 at lib/debugobjects.c:329 debug_print_object+0x16a/0x210 lib/debugobjects.c:326
      Kernel panic - not syncing: panic_on_warn set ...
      
      CPU: 1 PID: 13988 Comm: syz-executor0 Not tainted 4.17.0-rc4+ #46
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1b9/0x294 lib/dump_stack.c:113
       panic+0x22f/0x4de kernel/panic.c:184
       __warn.cold.8+0x163/0x1b3 kernel/panic.c:536
       report_bug+0x252/0x2d0 lib/bug.c:186
       fixup_bug arch/x86/kernel/traps.c:178 [inline]
       do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296
       do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
       invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
      RIP: 0010:debug_print_object+0x16a/0x210 lib/debugobjects.c:326
      RSP: 0018:ffff880197a37880 EFLAGS: 00010086
      RAX: 0000000000000061 RBX: 0000000000000005 RCX: ffffc90001ed0000
      RDX: 0000000000004aaf RSI: ffffffff8160f6f1 RDI: 0000000000000001
      RBP: ffff880197a378c0 R08: ffff8801aa7a0080 R09: ffffed003b5e3eb2
      R10: ffffed003b5e3eb2 R11: ffff8801daf1f597 R12: 0000000000000001
      R13: ffffffff88d96980 R14: ffffffff87fa19a0 R15: ffffffff81666ec0
       debug_object_assert_init+0x309/0x500 lib/debugobjects.c:692
       debug_timer_assert_init kernel/time/timer.c:724 [inline]
       debug_assert_init kernel/time/timer.c:776 [inline]
       del_timer+0x74/0x140 kernel/time/timer.c:1198
       try_to_grab_pending+0x439/0x9a0 kernel/workqueue.c:1223
       mod_delayed_work_on+0x91/0x250 kernel/workqueue.c:1592
       mod_delayed_work include/linux/workqueue.h:541 [inline]
       smc_setsockopt+0x387/0x6d0 net/smc/af_smc.c:1367
       __sys_setsockopt+0x1bd/0x390 net/socket.c:1903
       __do_sys_setsockopt net/socket.c:1914 [inline]
       __se_sys_setsockopt net/socket.c:1911 [inline]
       __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1911
       do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fixes: 01d2f7e2 ("net/smc: sockopts TCP_NODELAY and TCP_CORK")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Ursula Braun <ubraun@linux.ibm.com>
      Cc: linux-s390@vger.kernel.org
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      be7f3e59
    • Jiri Pirko's avatar
      nfp: flower: fix error path during representor creation · 3b734ff6
      Jiri Pirko authored
      Don't store repr pointer to reprs array until the representor is
      successfully created. This avoids message about "representor
      destruction" even when it was never created. Also it cleans-up the flow.
      Also, check return value after port alloc.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3b734ff6
    • David S. Miller's avatar
      Merge branch 'mvpp2-small-improvements' · 6d9f868f
      David S. Miller authored
      Antoine Tenart says:
      
      ====================
      net: mvpp2: small improvements
      
      Those 3 patches are small improvements to the Marvell PPv2 driver. The
      series does not conflict with the one sent about phylink and
      1000/2500baseX support, so the two series can live in parallel.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6d9f868f
    • Yan Markman's avatar
      net: mvpp2: print rx error with rate-limit · 934e0f83
      Yan Markman authored
      Prevent flood of RX error prints during heavy traffic with weak signal
      in link by checking net_ratelimit() before using netdev_err().
      Signed-off-by: default avatarYan Markman <ymarkman@marvell.com>
      [Antoine: small rework, commit message]
      Signed-off-by: default avatarAntoine Tenart <antoine.tenart@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      934e0f83
    • Yan Markman's avatar
      net: mvpp2: set mac address does not require the stop/start sequence · 5b0ab2f4
      Yan Markman authored
      Remove special stop/start handling from the set_mac_address callback.
      All this special care is not needed, and can be removed. It also
      simplifies the up/down status in the driver and helps avoiding possible
      link status mismatch issues.
      Signed-off-by: default avatarYan Markman <ymarkman@marvell.com>
      [Antoine: commit message]
      Signed-off-by: default avatarAntoine Tenart <antoine.tenart@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5b0ab2f4
    • Yan Markman's avatar
      net: mvpp2: avoid checking for free aggregated descriptors twice · 914365f1
      Yan Markman authored
      Avoid repeating the check for free aggregated descriptors when it
      already failed at the beginning of the function.
      Signed-off-by: default avatarYan Markman <ymarkman@marvell.com>
      [Antoine: commit message]
      Signed-off-by: default avatarAntoine Tenart <antoine.tenart@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      914365f1
    • David S. Miller's avatar
      Merge branch 'mvpp2-phylink-conversion' · 808e2fc3
      David S. Miller authored
      Antoine Tenart says:
      
      ====================
      net: mvpp2: phylink conversion
      
      This series convert the Marvell PPv2 driver to phylink (models the MAC
      to PHY link).
      
      One important point is the PPv2 driver supports two probe modes: device
      tree and ACPI. This series only brings phylink support for the device
      tree mode, as the ACPI one will need further work. Still, the driver
      should be working as before when using ACPI. This split should be
      temporary, and was discussed with Marcin (in Cc.) who added ACPI support
      to the driver.
      
      Also as the SFP cages on both DB boards can be considered as non-wired.
      We thus chose not to describe those SFP cages and we use fixed-link.
      
      The rest of the series uses phylink to add support for 1000BaseX and
      2500BaseX modes in the PPv2 driver. To do this, two patches are needed
      in the common PHY framework (patches 3 and 4). The last 4 patches modify
      the device tree to use the new PPv2 functionalities.
      
      The series has been tested for the device tree mode on the 7040-db,
      8040-db and 8040-mcbin boards, to ensure all the interface where working
      as expected.
      
      @Dave: patches 7 to 10 should go through the mvebu tree (Gregory in
      Cc.) to avoid any conflict with the other mvebu dt patches taken during
      this cycle.
      
      The series is based on today's net-next.
      
      Since v2:
        - Removed the SFP description from the DB boards, as their SFP cages
          are wired properly. We now use fixed-link.
        - Because of this rework, split the series in two, so that the SFP
          part is reviewed separately.
        - Small fixes in the phylink patch.
        - Rebased on the latest net-next branch.
      
      Since v1:
        - Chose a different approach to the SFP changes, as the previous ones
          weren't valid and reworked both BD boards device trees.
        - Misc fixes.
        - Added Kishon's acked-by on one patch.
        - Rebaed on latest net-next branch.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      808e2fc3
    • Antoine Tenart's avatar
      net: mvpp2: 2500baseX support · a6fe31de
      Antoine Tenart authored
      This patch adds the 2500Base-X PHY mode support in the Marvell PPv2
      driver. 2500Base-X is quite close to 1000Base-X and SGMII modes and uses
      nearly the same code path.
      Signed-off-by: default avatarAntoine Tenart <antoine.tenart@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a6fe31de
    • Antoine Tenart's avatar
      net: mvpp2: 1000baseX support · d97c9f4a
      Antoine Tenart authored
      This patch adds the 1000Base-X PHY mode support in the Marvell PPv2
      driver. 1000Base-X is quite close the SGMII and uses nearly the same
      code path.
      Signed-off-by: default avatarAntoine Tenart <antoine.tenart@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d97c9f4a
    • Antoine Tenart's avatar
      phy: cp110-comphy: 2.5G SGMII mode · 9ad8bd81
      Antoine Tenart authored
      This patch allow the CP110 comphy to configure some lanes in the
      2.5G SGMII mode. This mode is quite close to SGMII and uses nearly the
      same code path.
      Signed-off-by: default avatarAntoine Tenart <antoine.tenart@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9ad8bd81
    • Antoine Tenart's avatar
      phy: add 2.5G SGMII mode to the phy_mode enum · 5490b872
      Antoine Tenart authored
      This patch adds one more generic PHY mode to the phy_mode enum, to allow
      configuring generic PHYs to the 2.5G SGMII mode by using the set_mode
      callback.
      Signed-off-by: default avatarAntoine Tenart <antoine.tenart@bootlin.com>
      Acked-by: default avatarKishon Vijay Abraham I <kishon@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5490b872
    • Antoine Tenart's avatar
      net: mvpp2: phylink support · 4bb04326
      Antoine Tenart authored
      Convert the PPv2 driver to implement phylink helpers, and use phylink in
      DT mode. The other mode supported is ACPI, which will need further work
      in order to be entirely compatible with phylink.
      
      The MAC and GoP configuration functions were completely moved to fit
      into the phylink helpers. When a PHY is always present between the MAC
      and the physical port, phylink only is used, but when this is not the
      case (the MAC directly is connected to the physical port) the link IRQ
      is used to detect changes in the link state and call phylink_mac_change.
      
      The ACPI mode do not uses phylink as of now, and the changes shouldn't
      impact its use.
      Signed-off-by: default avatarAntoine Tenart <antoine.tenart@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4bb04326
    • Antoine Tenart's avatar
      net: mvpp2: align the ethtool ops definition · dcd3e73a
      Antoine Tenart authored
      Cosmetic patch to align the ethtool functions to ops definitions. This
      patch does not change in any way the driver's behaviour.
      Signed-off-by: default avatarAntoine Tenart <antoine.tenart@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dcd3e73a
    • David S. Miller's avatar
      Merge tag 'wireless-drivers-next-for-davem-2018-05-17' of... · a564b659
      David S. Miller authored
      Merge tag 'wireless-drivers-next-for-davem-2018-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next
      
      Kalle Valo says:
      
      ====================
      wireless-drivers-next patches for 4.18
      
      The first pull request for 4.18. As usual new features and bug fixes
      but nothing really special.
      
      I also merged wireless-drivers due to an iwlwifi patch dependency.
      
      Major changes:
      
      iwlwifi
      
      * implement Traffic Condition Monitor and use it for scan, BT coex and
        to detect when the AP doesn't support UAPSD properly
      
      * some more work for the 22000 family of devices;
      
      * introduce AMSDU rate control offload
      
      qtnfmac
      
      * DFS offload support
      
      rsi
      
      * roaming enhancements
      
      * increase max supported aggregation subframes
      
      * don't advertise 5 GHz support if the device doesn't support it
      
      brcmfmac
      
      * add support for BCM4366E chipset
      
      * add support for bcm43364 wireless chipset
      
      ath10k
      
      * enable temperature reads for QCA6174 and QCA9377
      
      * add firmware memory dump support for QCA9984
      
      * continue adding WCN3990 support via SNOC bus
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a564b659
    • YueHaibing's avatar
      vmxnet3: Replace msleep(1) with usleep_range() · 93c65d13
      YueHaibing authored
      As documented in Documentation/timers/timers-howto.txt,
      replace msleep(1) with usleep_range().
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      93c65d13
    • Tonghao Zhang's avatar
      bonding: introduce link change helper · 7e878b60
      Tonghao Zhang authored
      Introduce an new common helper to avoid redundancy.
      Signed-off-by: default avatarTonghao Zhang <xiangxia.m.yue@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7e878b60
    • David S. Miller's avatar
      Merge branch 'tcp-default-RACK-loss-recovery' · 10e361e1
      David S. Miller authored
      Yuchung Cheng says:
      
      ====================
      tcp: default RACK loss recovery
      
      This patch set implements the features correspond to the
      draft-ietf-tcpm-rack-03 version of the RACK draft.
      https://datatracker.ietf.org/meeting/101/materials/slides-101-tcpm-update-on-tcp-rack-00
      
      1. SACK: implement equivalent DUPACK threshold heuristic in RACK to
         replace existing RFC6675 recovery (tcp_mark_head_lost).
      
      2. Non-SACK: simplify RFC6582 NewReno implementation
      
      3. RTO: apply RACK's time-based approach to avoid spuriouly
         marking very recently sent packets lost.
      
      4. with (1)(2)(3), make RACK the exclusive fast recovery mechanism to
         mark losses based on time on S/ACK. Tail loss probe and F-RTO remain
         enabled by default as complementary mechanisms to send probes in
         CA_Open and CA_Loss states. The probes would solicit S/ACKs to trigger
         RACK time-based loss detection.
      
      All Google web and internal servers have been running RACK-only mode
      (4) for a while now. a/b experiments indicate RACK/TLP on average
      reduces recovery latency by 10% compared to RFC6675. RFC6675
      is default-off now but can be enabled by disabling RACK (sysctl
      net.ipv4.tcp_recovery=0) for unseen issues.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      10e361e1
    • Yuchung Cheng's avatar
      tcp: don't mark recently sent packets lost on RTO · 56f8c5d7
      Yuchung Cheng authored
      An RTO event indicates the head has not been acked for a long time
      after its last (re)transmission. But the other packets are not
      necessarily lost if they have been only sent recently (for example
      due to application limit). This patch would prohibit marking packets
      sent within an RTT to be lost on RTO event, using similar logic in
      TCP RACK detection.
      
      Normally the head (SND.UNA) would be marked lost since RTO should
      fire strictly after the head was sent. An exception is when the
      most recent RACK RTT measurement is larger than the (previous)
      RTO. To address this exception the head is always marked lost.
      
      Congestion control interaction: since we may not mark every packet
      lost, the congestion window may be more than 1 (inflight plus 1).
      But only one packet will be retransmitted after RTO, since
      tcp_retransmit_timer() calls tcp_retransmit_skb(...,segs=1). The
      connection still performs slow start from one packet (with Cubic
      congestion control).
      
      This commit was tested in an A/B test with Google web servers,
      and showed a reduction of 2% in (spurious) retransmits post
      timeout (SlowStartRetrans), and correspondingly reduced DSACKs
      (DSACKIgnoredOld) by 7%.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Reviewed-by: default avatarPriyaranjan Jha <priyarjha@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      56f8c5d7
    • Yuchung Cheng's avatar
      tcp: new helper tcp_rack_skb_timeout · b8fef65a
      Yuchung Cheng authored
      Create and export a new helper tcp_rack_skb_timeout and move tcp_is_rack
      to prepare the final RTO change.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Reviewed-by: default avatarPriyaranjan Jha <priyarjha@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b8fef65a
    • Yuchung Cheng's avatar
      tcp: separate loss marking and state update on RTO · c77d62ff
      Yuchung Cheng authored
      Previously when TCP times out, it first updates cwnd and ssthresh,
      marks packets lost, and then updates congestion state again. This
      was fine because everything not yet delivered is marked lost,
      so the inflight is always 0 and cwnd can be safely set to 1 to
      retransmit one packet on timeout.
      
      But the inflight may not always be 0 on timeout if TCP changes to
      mark packets lost based on packet sent time. Therefore we must
      first mark the packet lost, then set the cwnd based on the
      (updated) inflight.
      
      This is not a pure refactor. Congestion control may potentially
      break if it uses (not yet updated) inflight to compute ssthresh.
      Fortunately all existing congestion control modules does not do that.
      Also it changes the inflight when CA_LOSS_EVENT is called, and only
      westwood processes such an event but does not use inflight.
      
      This change has two other minor side benefits:
      1) consistent with Fast Recovery s.t. the inflight is updated
         first before tcp_enter_recovery flips state to CA_Recovery.
      
      2) avoid intertwining loss marking with state update, making the
         code more readable.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Reviewed-by: default avatarPriyaranjan Jha <priyarjha@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c77d62ff
    • Yuchung Cheng's avatar
      tcp: new helper tcp_timeout_mark_lost · 2ad55f56
      Yuchung Cheng authored
      Refactor using a new helper, tcp_timeout_mark_loss(), that marks packets
      lost upon RTO.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Reviewed-by: default avatarPriyaranjan Jha <priyarjha@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2ad55f56
    • Yuchung Cheng's avatar
      tcp: account lost retransmit after timeout · d716bfdb
      Yuchung Cheng authored
      The previous approach for the lost and retransmit bits was to
      wipe the slate clean: zero all the lost and retransmit bits,
      correspondingly zero the lost_out and retrans_out counters, and
      then add back the lost bits (and correspondingly increment lost_out).
      
      The new approach is to treat this very much like marking packets
      lost in fast recovery. We don’t wipe the slate clean. We just say
      that for all packets that were not yet marked sacked or lost, we now
      mark them as lost in exactly the same way we do for fast recovery.
      
      This fixes the lost retransmit accounting at RTO time and greatly
      simplifies the RTO code by sharing much of the logic with Fast
      Recovery.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Reviewed-by: default avatarPriyaranjan Jha <priyarjha@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d716bfdb
    • Yuchung Cheng's avatar
      tcp: simpler NewReno implementation · 6ac06ecd
      Yuchung Cheng authored
      This is a rewrite of NewReno loss recovery implementation that is
      simpler and standalone for readability and better performance by
      using less states.
      
      Note that NewReno refers to RFC6582 as a modification to the fast
      recovery algorithm. It is used only if the connection does not
      support SACK in Linux. It should not to be confused with the Reno
      (AIMD) congestion control.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Reviewed-by: default avatarPriyaranjan Jha <priyarjha@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ac06ecd
    • Yuchung Cheng's avatar
      tcp: disable RFC6675 loss detection · b38a51fe
      Yuchung Cheng authored
      This patch disables RFC6675 loss detection and make sysctl
      net.ipv4.tcp_recovery = 1 controls a binary choice between RACK
      (1) or RFC6675 (0).
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Reviewed-by: default avatarPriyaranjan Jha <priyarjha@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b38a51fe
    • Yuchung Cheng's avatar
      tcp: support DUPACK threshold in RACK · 20b654df
      Yuchung Cheng authored
      This patch adds support for the classic DUPACK threshold rule
      (#DupThresh) in RACK.
      
      When the number of packets SACKed is greater or equal to the
      threshold, RACK sets the reordering window to zero which would
      immediately mark all the unsacked packets below the highest SACKed
      sequence lost. Since this approach is known to not work well with
      reordering, RACK only uses it if no reordering has been observed.
      
      The DUPACK threshold rule is a particularly useful extension to the
      fast recoveries triggered by RACK reordering timer. For example
      data-center transfers where the RTT is much smaller than a timer
      tick, or high RTT path where the default RTT/4 may take too long.
      
      Note that this patch differs slightly from RFC6675. RFC6675
      considers a packet lost when at least #DupThresh higher-sequence
      packets are SACKed.
      
      With RACK, for connections that have seen reordering, RACK
      continues to use a dynamically-adaptive time-based reordering
      window to detect losses. But for connections on which we have not
      yet seen reordering, this patch considers a packet lost when at
      least one higher sequence packet is SACKed and the total number
      of SACKed packets is at least DupThresh. For example, suppose a
      connection has not seen reordering, and sends 10 packets, and
      packets 3, 5, 7 are SACKed. RFC6675 considers packets 1 and 2
      lost. RACK considers packets 1, 2, 4, 6 lost.
      
      There is some small risk of spurious retransmits here due to
      reordering. However, this is mostly limited to the first flight of
      a connection on which the sender receives SACKs from reordering.
      And RFC 6675 and FACK loss detection have a similar risk on the
      first flight with reordering (it's just that the risk of spurious
      retransmits from reordering was slightly narrower for those older
      algorithms due to the margin of 3*MSS).
      
      Also the minimum reordering window is reduced from 1 msec to 0
      to recover quicker on short RTT transfers. Therefore RACK is more
      aggressive in marking packets lost during recovery to reduce the
      reordering window timeouts.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Reviewed-by: default avatarPriyaranjan Jha <priyarjha@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      20b654df
    • Ivan Khoronzhuk's avatar
      net: ethernet: ti: cpsw: disable mq feature for "AM33xx ES1.0" devices · 9611d6d6
      Ivan Khoronzhuk authored
      The early versions of am33xx devices, related to ES1.0 SoC revision
      have errata limiting mq support. That's the same errata as
      commit 7da11600 ("drivers: net: cpsw: add am335x errata workarround for
      interrutps")
      
      AM33xx Errata [1] Advisory 1.0.9
      http://www.ti.com/lit/er/sprz360f/sprz360f.pdf
      
      After additional investigation were found that drivers w/a is
      propagated on all AM33xx SoCs and on DM814x. But the errata exists
      only for ES1.0 of AM33xx family, limiting mq support for revisions
      after ES1.0. So, disable mq support only for related SoCs and use
      separate polls for revisions allowing mq.
      Signed-off-by: default avatarIvan Khoronzhuk <ivan.khoronzhuk@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9611d6d6
    • David S. Miller's avatar
      Merge branch 'sched-refactor-NOLOCK-qdiscs' · 4b9c7768
      David S. Miller authored
      Paolo Abeni says:
      
      ====================
      sched: refactor NOLOCK qdiscs
      
      With the introduction of NOLOCK qdiscs, pfifo_fast performances in the
      uncontended scenario degraded measurably, especially after the commit
      eb82a994 ("net: sched, fix OOO packets with pfifo_fast").
      
      This series restore the pfifo_fast performances in such scenario back the
      previous level, mainly reducing the number of atomic operations required to
      perform the qdisc_run() call. Even performances in the contended scenario
      increase measurably.
      
      Note: This series is on top of:
      
      sched: manipulate __QDISC_STATE_RUNNING in qdisc_run_* helpers
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4b9c7768
    • Paolo Abeni's avatar
      pfifo_fast: drop unneeded additional lock on dequeue · 021a17ed
      Paolo Abeni authored
      After the previous patch, for NOLOCK qdiscs, q->seqlock is
      always held when the dequeue() is invoked, we can drop
      any additional locking to protect such operation.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      021a17ed
    • Paolo Abeni's avatar
      sched: replace __QDISC_STATE_RUNNING bit with a spin lock · 96009c7d
      Paolo Abeni authored
      So that we can use lockdep on it.
      The newly introduced sequence lock has the same scope of busylock,
      so it shares the same lockdep annotation, but it's only used for
      NOLOCK qdiscs.
      
      With this changeset we acquire such lock in the control path around
      flushing operation (qdisc reset), to allow more NOLOCK qdisc perf
      improvement in the next patch.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      96009c7d
    • Anirudh Venkataramanan's avatar
      ice: Update NVM AQ command functions · 43c89b16
      Anirudh Venkataramanan authored
      This patch updates the NVM read/erase/update AQ commands to align with
      the latest specification.
      Signed-off-by: default avatarAnirudh Venkataramanan <anirudh.venkataramanan@intel.com>
      Tested-by: default avatarTony Brelinski <tonyx.brelinski@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      43c89b16
    • Emil Tantilov's avatar
      ixgbevf: fix MAC address changes through ixgbevf_set_mac() · 6e7d0ba1
      Emil Tantilov authored
      Set hw->mac.perm_addr in ixgbevf_set_mac() in order to avoid losing the
      custom MAC on reset. This can happen in the following case:
      
      >ip link set $vf address $mac
      >ethtool -r $vf
      Signed-off-by: default avatarEmil Tantilov <emil.s.tantilov@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      6e7d0ba1
    • Emil Tantilov's avatar
      ixgbe: force VF to grab new MAC on driver reload · a8d9bb3d
      Emil Tantilov authored
      Do not validate the MAC address during a reset, unless the MAC was set on
      the host. This way the VF will get a new MAC address every time it reloads.
      
      Remove the "no MAC address assigned" message since it will get spammed on
      reset and it doesn't help much as the MAC on the VF is randomly generated.
      Signed-off-by: default avatarEmil Tantilov <emil.s.tantilov@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      a8d9bb3d
    • Pavel Tatashin's avatar
      ixgbe: release lock for the duration of ixgbe_suspend_close() · 6710f970
      Pavel Tatashin authored
      Currently, during device_shutdown() ixgbe holds rtnl_lock for the duration
      of lengthy ixgbe_close_suspend(). On machines with multiple ixgbe cards
      this lock prevents scaling if device_shutdown() function is multi-threaded.
      
      It is not necessary to hold this lock during ixgbe_close_suspend()
      as it is not held when ixgbe_close() is called also during shutdown but for
      kexec case.
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      6710f970
    • Mauro S M Rodrigues's avatar
      ixgbe/ixgbevf: Free IRQ when PCI error recovery removes the device · b212d815
      Mauro S M Rodrigues authored
      Since commit f7f37e7f ("ixgbe: handle close/suspend race with
      netif_device_detach/present") ixgbe_close_suspend is called, from
      ixgbe_close, only if the device is present, i.e. if it isn't detached.
      That exposed a situation where IRQs weren't freed if a PCI error
      recovery system opts to remove the device. For such case the pci channel
      state is set to pci_channel_io_perm_failure and ixgbe_io_error_detected
      was returning PCI_ERS_RESULT_DISCONNECT before calling
      ixgbe_close_suspend consequentially not freeing IRQ and crashing when
      the remove handler calls pci_disable_device, hitting a BUG_ON at
      free_msi_irqs, which asserts that there is no non-free IRQ associated
      with the device to be removed:
      
      BUG_ON(irq_has_action(entry->irq + i));
      
      The issue is fixed by calling the ixgbe_close_suspend before evaluate
      the pci channel state.
      Reported-by: default avatarNaresh Bannoth <nbannoth@in.ibm.com>
      Reported-by: default avatarAbdul Haleem <abdhalee@in.ibm.com>
      Signed-off-by: default avatarMauro S M Rodrigues <maurosr@linux.vnet.ibm.com>
      Reviewed-by: default avatarAlexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      b212d815
    • Cathy Zhou's avatar
      ixgbe: cleanup sparse warnings · 9cfbfa70
      Cathy Zhou authored
      Sparse complains valid conversions between restricted types, force
      attribute is used to avoid those warnings.
      Signed-off-by: default avatarCathy Zhou <cathy.zhou@oracle.com>
      Reviewed-by: default avatarShannon Nelson <shannon.nelson@oracle.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      9cfbfa70
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · b9f672af
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf-next 2018-05-17
      
      The following pull-request contains BPF updates for your *net-next* tree.
      
      The main changes are:
      
      1) Provide a new BPF helper for doing a FIB and neighbor lookup
         in the kernel tables from an XDP or tc BPF program. The helper
         provides a fast-path for forwarding packets. The API supports
         IPv4, IPv6 and MPLS protocols, but currently IPv4 and IPv6 are
         implemented in this initial work, from David (Ahern).
      
      2) Just a tiny diff but huge feature enabled for nfp driver by
         extending the BPF offload beyond a pure host processing offload.
         Offloaded XDP programs are allowed to set the RX queue index and
         thus opening the door for defining a fully programmable RSS/n-tuple
         filter replacement. Once BPF decided on a queue already, the device
         data-path will skip the conventional RSS processing completely,
         from Jakub.
      
      3) The original sockmap implementation was array based similar to
         devmap. However unlike devmap where an ifindex has a 1:1 mapping
         into the map there are use cases with sockets that need to be
         referenced using longer keys. Hence, sockhash map is added reusing
         as much of the sockmap code as possible, from John.
      
      4) Introduce BTF ID. The ID is allocatd through an IDR similar as
         with BPF maps and progs. It also makes BTF accessible to user
         space via BPF_BTF_GET_FD_BY_ID and adds exposure of the BTF data
         through BPF_OBJ_GET_INFO_BY_FD, from Martin.
      
      5) Enable BPF stackmap with build_id also in NMI context. Due to the
         up_read() of current->mm->mmap_sem build_id cannot be parsed.
         This work defers the up_read() via a per-cpu irq_work so that
         at least limited support can be enabled, from Song.
      
      6) Various BPF JIT follow-up cleanups and fixups after the LD_ABS/LD_IND
         JIT conversion as well as implementation of an optimized 32/64 bit
         immediate load in the arm64 JIT that allows to reduce the number of
         emitted instructions; in case of tested real-world programs they
         were shrinking by three percent, from Daniel.
      
      7) Add ifindex parameter to the libbpf loader in order to enable
         BPF offload support. Right now only iproute2 can load offloaded
         BPF and this will also enable libbpf for direct integration into
         other applications, from David (Beckett).
      
      8) Convert the plain text documentation under Documentation/bpf/ into
         RST format since this is the appropriate standard the kernel is
         moving to for all documentation. Also add an overview README.rst,
         from Jesper.
      
      9) Add __printf verification attribute to the bpf_verifier_vlog()
         helper. Though it uses va_list we can still allow gcc to check
         the format string, from Mathieu.
      
      10) Fix a bash reference in the BPF selftest's Makefile. The '|& ...'
          is a bash 4.0+ feature which is not guaranteed to be available
          when calling out to shell, therefore use a more portable variant,
          from Joe.
      
      11) Fix a 64 bit division in xdp_umem_reg() by using div_u64()
          instead of relying on the gcc built-in, from Björn.
      
      12) Fix a sock hashmap kmalloc warning reported by syzbot when an
          overly large key size is used in hashmap then causing overflows
          in htab->elem_size. Reject bogus attr->key_size early in the
          sock_hash_alloc(), from Yonghong.
      
      13) Ensure in BPF selftests when urandom_read is being linked that
          --build-id is always enabled so that test_stacktrace_build_id[_nmi]
          won't be failing, from Alexei.
      
      14) Add bitsperlong.h as well as errno.h uapi headers into the tools
          header infrastructure which point to one of the arch specific
          uapi headers. This was needed in order to fix a build error on
          some systems for the BPF selftests, from Sirio.
      
      15) Allow for short options to be used in the xdp_monitor BPF sample
          code. And also a bpf.h tools uapi header sync in order to fix a
          selftest build failure. Both from Prashant.
      
      16) More formally clarify the meaning of ID in the direct packet access
          section of the BPF documentation, from Wang.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b9f672af