1. 03 Aug, 2018 10 commits
  2. 28 Jul, 2018 30 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.14.59 · 53208e12
      Greg Kroah-Hartman authored
      53208e12
    • Arnd Bergmann's avatar
      turn off -Wattribute-alias · e94f784f
      Arnd Bergmann authored
      Starting with gcc-8.1, we get a warning about all system call definitions,
      which use an alias between functions with incompatible prototypes, e.g.:
      
      In file included from ../mm/process_vm_access.c:19:
      ../include/linux/syscalls.h:211:18: warning: 'sys_process_vm_readv' alias between functions of incompatible types 'long int(pid_t,  const struct iovec *, long unsigned int,  const struct iovec *, long unsigned int,  long unsigned int)' {aka 'long int(int,  const struct iovec *, long unsigned int,  const struct iovec *, long unsigned int,  long unsigned int)'} and 'long int(long int,  long int,  long int,  long int,  long int,  long int)' [-Wattribute-alias]
        asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \
                        ^~~
      ../include/linux/syscalls.h:207:2: note: in expansion of macro '__SYSCALL_DEFINEx'
        __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
        ^~~~~~~~~~~~~~~~~
      ../include/linux/syscalls.h:201:36: note: in expansion of macro 'SYSCALL_DEFINEx'
       #define SYSCALL_DEFINE6(name, ...) SYSCALL_DEFINEx(6, _##name, __VA_ARGS__)
                                          ^~~~~~~~~~~~~~~
      ../mm/process_vm_access.c:300:1: note: in expansion of macro 'SYSCALL_DEFINE6'
       SYSCALL_DEFINE6(process_vm_readv, pid_t, pid, const struct iovec __user *, lvec,
       ^~~~~~~~~~~~~~~
      ../include/linux/syscalls.h:215:18: note: aliased declaration here
        asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \
                        ^~~
      ../include/linux/syscalls.h:207:2: note: in expansion of macro '__SYSCALL_DEFINEx'
        __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
        ^~~~~~~~~~~~~~~~~
      ../include/linux/syscalls.h:201:36: note: in expansion of macro 'SYSCALL_DEFINEx'
       #define SYSCALL_DEFINE6(name, ...) SYSCALL_DEFINEx(6, _##name, __VA_ARGS__)
                                          ^~~~~~~~~~~~~~~
      ../mm/process_vm_access.c:300:1: note: in expansion of macro 'SYSCALL_DEFINE6'
       SYSCALL_DEFINE6(process_vm_readv, pid_t, pid, const struct iovec __user *, lvec,
      
      This is really noisy and does not indicate a real problem. In the latest
      mainline kernel, this was addressed by commit bee20031 ("disable
      -Wattribute-alias warning for SYSCALL_DEFINEx()"), which seems too invasive
      to backport.
      
      This takes a much simpler approach and just disables the warning across the
      kernel.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e94f784f
    • Roman Fietze's avatar
      can: m_can.c: fix setup of CCCR register: clear CCCR NISO bit before checking can.ctrlmode · 08382d3a
      Roman Fietze authored
      commit 393753b2 upstream.
      
      Inside m_can_chip_config(), when setting up the new value of the CCCR,
      the CCCR_NISO bit is not cleared like the others, CCCR_TEST, CCCR_MON,
      CCCR_BRSE and CCCR_FDOE, before checking the can.ctrlmode bits for
      CAN_CTRLMODE_FD_NON_ISO.
      
      This way once the controller was configured for CAN_CTRLMODE_FD_NON_ISO,
      this mode could never be cleared again.
      
      This fix is only relevant for controllers with version 3.1.x or 3.2.x.
      Older versions do not support NISO.
      Signed-off-by: default avatarRoman Fietze <roman.fietze@telemotive.de>
      Cc: linux-stable <stable@vger.kernel.org>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      08382d3a
    • Stephane Grosjean's avatar
      can: peak_canfd: fix firmware < v3.3.0: limit allocation to 32-bit DMA addr only · a55d3d73
      Stephane Grosjean authored
      commit 5d4c94ed upstream.
      
      The DMA logic in firmwares < v3.3.0 embedded in the PCAN-PCIe FD cards
      family is not capable of handling a mix of 32-bit and 64-bit logical
      addresses. If the board is equipped with 2 or 4 CAN ports, then such a
      situation might lead to a PCIe Bus Error "Malformed TLP" packet
      as well as "irq xx: nobody cared" issue.
      
      This patch adds a workaround that requests only 32-bit DMA addresses
      when these might be allocated outside of the 4 GB area.
      
      This issue has been fixed in firmware v3.3.0 and next.
      Signed-off-by: default avatarStephane Grosjean <s.grosjean@peak-system.com>
      Cc: linux-stable <stable@vger.kernel.org>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a55d3d73
    • Anssi Hannula's avatar
      can: xilinx_can: fix RX overflow interrupt not being enabled · 60454a97
      Anssi Hannula authored
      commit 83997997 upstream.
      
      RX overflow interrupt (RXOFLW) is disabled even though xcan_interrupt()
      processes it. This means that an RX overflow interrupt will only be
      processed when another interrupt gets asserted (e.g. for RX/TX).
      
      Fix that by enabling the RXOFLW interrupt.
      
      Fixes: b1201e44 ("can: xilinx CAN controller support")
      Signed-off-by: default avatarAnssi Hannula <anssi.hannula@bitwise.fi>
      Cc: Michal Simek <michal.simek@xilinx.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      60454a97
    • Anssi Hannula's avatar
      can: xilinx_can: fix incorrect clear of non-processed interrupts · 19c756e0
      Anssi Hannula authored
      commit 2f4f0f33 upstream.
      
      xcan_interrupt() clears ERROR|RXOFLV|BSOFF|ARBLST interrupts if any of
      them is asserted. This does not take into account that some of them
      could have been asserted between interrupt status read and interrupt
      clear, therefore clearing them without handling them.
      
      Fix the code to only clear those interrupts that it knows are asserted
      and therefore going to be processed in xcan_err_interrupt().
      
      Fixes: b1201e44 ("can: xilinx CAN controller support")
      Signed-off-by: default avatarAnssi Hannula <anssi.hannula@bitwise.fi>
      Cc: Michal Simek <michal.simek@xilinx.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      19c756e0
    • Anssi Hannula's avatar
      can: xilinx_can: keep only 1-2 frames in TX FIFO to fix TX accounting · 189c7890
      Anssi Hannula authored
      commit 620050d9 upstream.
      
      The xilinx_can driver assumes that the TXOK interrupt only clears after
      it has been acknowledged as many times as there have been successfully
      sent frames.
      
      However, the documentation does not mention such behavior, instead
      saying just that the interrupt is cleared when the clear bit is set.
      
      Similarly, testing seems to also suggest that it is immediately cleared
      regardless of the amount of frames having been sent. Performing some
      heavy TX load and then going back to idle has the tx_head drifting
      further away from tx_tail over time, steadily reducing the amount of
      frames the driver keeps in the TX FIFO (but not to zero, as the TXOK
      interrupt always frees up space for 1 frame from the driver's
      perspective, so frames continue to be sent) and delaying the local echo
      frames.
      
      The TX FIFO tracking is also otherwise buggy as it does not account for
      TX FIFO being cleared after software resets, causing
        BUG!, TX FIFO full when queue awake!
      messages to be output.
      
      There does not seem to be any way to accurately track the state of the
      TX FIFO for local echo support while using the full TX FIFO.
      
      The Zynq version of the HW (but not the soft-AXI version) has watermark
      programming support and with it an additional TX-FIFO-empty interrupt
      bit.
      
      Modify the driver to only put 1 frame into TX FIFO at a time on soft-AXI
      and 2 frames at a time on Zynq. On Zynq the TXFEMP interrupt bit is used
      to detect whether 1 or 2 frames have been sent at interrupt processing
      time.
      
      Tested with the integrated CAN on Zynq-7000 SoC. The 1-frame-FIFO mode
      was also tested.
      
      An alternative way to solve this would be to drop local echo support but
      keep using the full TX FIFO.
      
      v2: Add FIFO space check before TX queue wake with locking to
      synchronize with queue stop. This avoids waking the queue when xmit()
      had just filled it.
      
      v3: Keep local echo support and reduce the amount of frames in FIFO
      instead as suggested by Marc Kleine-Budde.
      
      Fixes: b1201e44 ("can: xilinx CAN controller support")
      Signed-off-by: default avatarAnssi Hannula <anssi.hannula@bitwise.fi>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      189c7890
    • Anssi Hannula's avatar
      can: xilinx_can: fix device dropping off bus on RX overrun · 96bf3257
      Anssi Hannula authored
      commit 2574fe54 upstream.
      
      The xilinx_can driver performs a software reset when an RX overrun is
      detected. This causes the device to enter Configuration mode where no
      messages are received or transmitted.
      
      The documentation does not mention any need to perform a reset on an RX
      overrun, and testing by inducing an RX overflow also indicated that the
      device continues to work just fine without a reset.
      
      Remove the software reset.
      
      Tested with the integrated CAN on Zynq-7000 SoC.
      
      Fixes: b1201e44 ("can: xilinx CAN controller support")
      Signed-off-by: default avatarAnssi Hannula <anssi.hannula@bitwise.fi>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      96bf3257
    • Anssi Hannula's avatar
      can: xilinx_can: fix recovery from error states not being propagated · c5846b2f
      Anssi Hannula authored
      commit 877e0b75 upstream.
      
      The xilinx_can driver contains no mechanism for propagating recovery
      from CAN_STATE_ERROR_WARNING and CAN_STATE_ERROR_PASSIVE.
      
      Add such a mechanism by factoring the handling of
      XCAN_STATE_ERROR_PASSIVE and XCAN_STATE_ERROR_WARNING out of
      xcan_err_interrupt and checking for recovery after RX and TX if the
      interface is in one of those states.
      
      Tested with the integrated CAN on Zynq-7000 SoC.
      
      Fixes: b1201e44 ("can: xilinx CAN controller support")
      Signed-off-by: default avatarAnssi Hannula <anssi.hannula@bitwise.fi>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c5846b2f
    • Anssi Hannula's avatar
      can: xilinx_can: fix power management handling · f820de2a
      Anssi Hannula authored
      commit 8ebd83bd upstream.
      
      There are several issues with the suspend/resume handling code of the
      driver:
      
      - The device is attached and detached in the runtime_suspend() and
        runtime_resume() callbacks if the interface is running. However,
        during xcan_chip_start() the interface is considered running,
        causing the resume handler to incorrectly call netif_start_queue()
        at the beginning of xcan_chip_start(), and on xcan_chip_start() error
        return the suspend handler detaches the device leaving the user
        unable to bring-up the device anymore.
      
      - The device is not brought properly up on system resume. A reset is
        done and the code tries to determine the bus state after that.
        However, after reset the device is always in Configuration mode
        (down), so the state checking code does not make sense and
        communication will also not work.
      
      - The suspend callback tries to set the device to sleep mode (low-power
        mode which monitors the bus and brings the device back to normal mode
        on activity), but then immediately disables the clocks (possibly
        before the device reaches the sleep mode), which does not make sense
        to me. If a clean shutdown is wanted before disabling clocks, we can
        just bring it down completely instead of only sleep mode.
      
      Reorganize the PM code so that only the clock logic remains in the
      runtime PM callbacks and the system PM callbacks contain the device
      bring-up/down logic. This makes calling the runtime PM callbacks during
      e.g. xcan_chip_start() safe.
      
      The system PM callbacks now simply call common code to start/stop the
      HW if the interface was running, replacing the broken code from before.
      
      xcan_chip_stop() is updated to use the common reset code so that it will
      wait for the reset to complete. Reset also disables all interrupts so do
      not do that separately.
      
      Also, the device_may_wakeup() checks are removed as the driver does not
      have wakeup support.
      
      Tested on Zynq-7000 integrated CAN.
      Signed-off-by: default avatarAnssi Hannula <anssi.hannula@bitwise.fi>
      Cc: Michal Simek <michal.simek@xilinx.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f820de2a
    • Anssi Hannula's avatar
      can: xilinx_can: fix RX loop if RXNEMP is asserted without RXOK · 464a3f91
      Anssi Hannula authored
      commit 32852c56 upstream.
      
      If the device gets into a state where RXNEMP (RX FIFO not empty)
      interrupt is asserted without RXOK (new frame received successfully)
      interrupt being asserted, xcan_rx_poll() will continue to try to clear
      RXNEMP without actually reading frames from RX FIFO. If the RX FIFO is
      not empty, the interrupt will not be cleared and napi_schedule() will
      just be called again.
      
      This situation can occur when:
      
      (a) xcan_rx() returns without reading RX FIFO due to an error condition.
      The code tries to clear both RXOK and RXNEMP but RXNEMP will not clear
      due to a frame still being in the FIFO. The frame will never be read
      from the FIFO as RXOK is no longer set.
      
      (b) A frame is received between xcan_rx_poll() reading interrupt status
      and clearing RXOK. RXOK will be cleared, but RXNEMP will again remain
      set as the new message is still in the FIFO.
      
      I'm able to trigger case (b) by flooding the bus with frames under load.
      
      There does not seem to be any benefit in using both RXNEMP and RXOK in
      the way the driver does, and the polling example in the reference manual
      (UG585 v1.10 18.3.7 Read Messages from RxFIFO) also says that either
      RXOK or RXNEMP can be used for detecting incoming messages.
      
      Fix the issue and simplify the RX processing by only using RXNEMP
      without RXOK.
      
      Tested with the integrated CAN on Zynq-7000 SoC.
      
      Fixes: b1201e44 ("can: xilinx CAN controller support")
      Signed-off-by: default avatarAnssi Hannula <anssi.hannula@bitwise.fi>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      464a3f91
    • Rafael J. Wysocki's avatar
      driver core: Partially revert "driver core: correct device's shutdown order" · 55cb8f40
      Rafael J. Wysocki authored
      commit 722e5f2b upstream.
      
      Commit 52cdbdd4 (driver core: correct device's shutdown order)
      introduced a regression by breaking device shutdown on some systems.
      
      Namely, the devices_kset_move_last() call in really_probe() added by
      that commit is a mistake as it may cause parents to follow children
      in the devices_kset list which then causes shutdown to fail.  For
      example, if a device has children before really_probe() is called
      for it (which is not uncommon), that call will cause it to be
      reordered after the children in the devices_kset list and the
      ordering of that list will not reflect the correct device shutdown
      order any more.
      
      Also it causes the devices_kset list to be constantly reordered
      until all drivers have been probed which is totally pointless
      overhead in the majority of cases and it only covered an issue
      with system shutdown, while system-wide suspend/resume potentially
      had the same issue on the affected platforms (which was not covered).
      
      Moreover, the shutdown issue originally addressed by the change in
      really_probe() made by commit 52cdbdd4 is not present in 4.18-rc
      any more, since dra7 started to use the sdhci-omap driver which
      doesn't disable any regulators during shutdown, so the really_probe()
      part of commit 52cdbdd4 can be safely reverted.  [The original
      issue was related to the omap_hsmmc driver used by dra7 previously.]
      
      For the above reasons, revert the really_probe() modifications made
      by commit 52cdbdd4.
      
      The other code changes made by commit 52cdbdd4 are useful and
      they need not be reverted.
      
      Fixes: 52cdbdd4 (driver core: correct device's shutdown order)
      Link: https://lore.kernel.org/lkml/CAFgQCTt7VfqM=UyCnvNFxrSw8Z6cUtAi3HUwR4_xPAc03SgHjQ@mail.gmail.com/Reported-by: default avatarPingfan Liu <kernelfans@gmail.com>
      Tested-by: default avatarPingfan Liu <kernelfans@gmail.com>
      Reviewed-by: default avatarKishon Vijay Abraham I <kishon@ti.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      55cb8f40
    • Jerry Zhang's avatar
      usb: gadget: f_fs: Only return delayed status when len is 0 · 5421694d
      Jerry Zhang authored
      commit 4d644abf upstream.
      
      Commit 1b9ba000 ("Allow function drivers to pause control
      transfers") states that USB_GADGET_DELAYED_STATUS is only
      supported if data phase is 0 bytes.
      
      It seems that when the length is not 0 bytes, there is no
      need to explicitly delay the data stage since the transfer
      is not completed until the user responds. However, when the
      length is 0, there is no data stage and the transfer is
      finished once setup() returns, hence there is a need to
      explicitly delay completion.
      
      This manifests as the following bugs:
      
      Prior to 946ef68a ('Let setup() return
      USB_GADGET_DELAYED_STATUS'), when setup is 0 bytes, ffs
      would require user to queue a 0 byte request in order to
      clear setup state. However, that 0 byte request was actually
      not needed and would hang and cause errors in other setup
      requests.
      
      After the above commit, 0 byte setups work since the gadget
      now accepts empty queues to ep0 to clear the delay, but all
      other setups hang.
      
      Fixes: 946ef68a ("Let setup() return USB_GADGET_DELAYED_STATUS")
      Signed-off-by: default avatarJerry Zhang <zhangjerry@google.com>
      Cc: stable <stable@vger.kernel.org>
      Acked-by: default avatarFelipe Balbi <felipe.balbi@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5421694d
    • Antti Seppälä's avatar
      usb: dwc2: Fix DMA alignment to start at allocated boundary · 68fc92a0
      Antti Seppälä authored
      commit 56406e01 upstream.
      
      The commit 3bc04e28 ("usb: dwc2: host: Get aligned DMA in a more
      supported way") introduced a common way to align DMA allocations.
      The code in the commit aligns the struct dma_aligned_buffer but the
      actual DMA address pointed by data[0] gets aligned to an offset from
      the allocated boundary by the kmalloc_ptr and the old_xfer_buffer
      pointers.
      
      This is against the recommendation in Documentation/DMA-API.txt which
      states:
      
        Therefore, it is recommended that driver writers who don't take
        special care to determine the cache line size at run time only map
        virtual regions that begin and end on page boundaries (which are
        guaranteed also to be cache line boundaries).
      
      The effect of this is that architectures with non-coherent DMA caches
      may run into memory corruption or kernel crashes with Unhandled
      kernel unaligned accesses exceptions.
      
      Fix the alignment by positioning the DMA area in front of the allocation
      and use memory at the end of the area for storing the orginal
      transfer_buffer pointer. This may have the added benefit of increased
      performance as the DMA area is now fully aligned on all architectures.
      
      Tested with Lantiq xRX200 (MIPS) and RPi Model B Rev 2 (ARM).
      
      Fixes: 3bc04e28 ("usb: dwc2: host: Get aligned DMA in a more supported way")
      Cc: <stable@vger.kernel.org>
      Reviewed-by: default avatarDouglas Anderson <dianders@chromium.org>
      Signed-off-by: default avatarAntti Seppälä <a.seppala@gmail.com>
      Signed-off-by: default avatarFelipe Balbi <felipe.balbi@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      68fc92a0
    • Bin Liu's avatar
      usb: core: handle hub C_PORT_OVER_CURRENT condition · ac3f65c6
      Bin Liu authored
      commit 249a32b7 upstream.
      
      Based on USB2.0 Spec Section 11.12.5,
      
        "If a hub has per-port power switching and per-port current limiting,
        an over-current on one port may still cause the power on another port
        to fall below specific minimums. In this case, the affected port is
        placed in the Power-Off state and C_PORT_OVER_CURRENT is set for the
        port, but PORT_OVER_CURRENT is not set."
      
      so let's check C_PORT_OVER_CURRENT too for over current condition.
      
      Fixes: 08d1dec6 ("usb:hub set hub->change_bits when over-current happens")
      Cc: <stable@vger.kernel.org>
      Tested-by: default avatarAlessandro Antenucci <antenucci@korg.it>
      Signed-off-by: default avatarBin Liu <b-liu@ti.com>
      Acked-by: default avatarAlan Stern <stern@rowland.harvard.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ac3f65c6
    • Lubomir Rintel's avatar
      usb: cdc_acm: Add quirk for Castles VEGA3000 · e089c305
      Lubomir Rintel authored
      commit 1445cbe4 upstream.
      
      The device (a POS terminal) implements CDC ACM, but has not union
      descriptor.
      Signed-off-by: default avatarLubomir Rintel <lkundrak@v3.sk>
      Acked-by: default avatarOliver Neukum <oneukum@suse.com>
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e089c305
    • Samuel Thibault's avatar
      staging: speakup: fix wraparound in uaccess length check · ab9489c4
      Samuel Thibault authored
      commit b96fba8d upstream.
      
      If softsynthx_read() is called with `count < 3`, `count - 3` wraps, causing
      the loop to copy as much data as available to the provided buffer. If
      softsynthx_read() is invoked through sys_splice(), this causes an
      unbounded kernel write; but even when userspace just reads from it
      normally, a small size could cause userspace crashes.
      
      Fixes: 425e586c ("speakup: add unicode variant of /dev/softsynth")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSamuel Thibault <samuel.thibault@ens-lyon.org>
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ab9489c4
    • Eric Dumazet's avatar
      tcp: add tcp_ooo_try_coalesce() helper · 22e3d317
      Eric Dumazet authored
      [ Upstream commit 58152ecb ]
      
      In case skb in out_or_order_queue is the result of
      multiple skbs coalescing, we would like to get a proper gso_segs
      counter tracking, so that future tcp_drop() can report an accurate
      number.
      
      I chose to not implement this tracking for skbs in receive queue,
      since they are not dropped, unless socket is disconnected.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      22e3d317
    • Eric Dumazet's avatar
      tcp: call tcp_drop() from tcp_data_queue_ofo() · ec645ae6
      Eric Dumazet authored
      [ Upstream commit 8541b21e ]
      
      In order to be able to give better diagnostics and detect
      malicious traffic, we need to have better sk->sk_drops tracking.
      
      Fixes: 9f5afeae ("tcp: use an RB tree for ooo receive queue")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ec645ae6
    • Eric Dumazet's avatar
      tcp: detect malicious patterns in tcp_collapse_ofo_queue() · 6285a74a
      Eric Dumazet authored
      [ Upstream commit 3d4bf93a ]
      
      In case an attacker feeds tiny packets completely out of order,
      tcp_collapse_ofo_queue() might scan the whole rb-tree, performing
      expensive copies, but not changing socket memory usage at all.
      
      1) Do not attempt to collapse tiny skbs.
      2) Add logic to exit early when too many tiny skbs are detected.
      
      We prefer not doing aggressive collapsing (which copies packets)
      for pathological flows, and revert to tcp_prune_ofo_queue() which
      will be less expensive.
      
      In the future, we might add the possibility of terminating flows
      that are proven to be malicious.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6285a74a
    • Eric Dumazet's avatar
      tcp: avoid collapses in tcp_prune_queue() if possible · 81e6b01d
      Eric Dumazet authored
      [ Upstream commit f4a3313d ]
      
      Right after a TCP flow is created, receiving tiny out of order
      packets allways hit the condition :
      
      if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
      	tcp_clamp_window(sk);
      
      tcp_clamp_window() increases sk_rcvbuf to match sk_rmem_alloc
      (guarded by tcp_rmem[2])
      
      Calling tcp_collapse_ofo_queue() in this case is not useful,
      and offers a O(N^2) surface attack to malicious peers.
      
      Better not attempt anything before full queue capacity is reached,
      forcing attacker to spend lots of resource and allow us to more
      easily detect the abuse.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      81e6b01d
    • Eric Dumazet's avatar
      tcp: free batches of packets in tcp_prune_ofo_queue() · f3a5ba63
      Eric Dumazet authored
      [ Upstream commit 72cd43ba ]
      
      Juha-Matti Tilli reported that malicious peers could inject tiny
      packets in out_of_order_queue, forcing very expensive calls
      to tcp_collapse_ofo_queue() and tcp_prune_ofo_queue() for
      every incoming packet. out_of_order_queue rb-tree can contain
      thousands of nodes, iterating over all of them is not nice.
      
      Before linux-4.9, we would have pruned all packets in ofo_queue
      in one go, every XXXX packets. XXXX depends on sk_rcvbuf and skbs
      truesize, but is about 7000 packets with tcp_rmem[2] default of 6 MB.
      
      Since we plan to increase tcp_rmem[2] in the future to cope with
      modern BDP, can not revert to the old behavior, without great pain.
      
      Strategy taken in this patch is to purge ~12.5 % of the queue capacity.
      
      Fixes: 36a6503f ("tcp: refine tcp_prune_ofo_queue() to not drop all packets")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarJuha-Matti Tilli <juha-matti.tilli@iki.fi>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f3a5ba63
    • Yuchung Cheng's avatar
      tcp: do not delay ACK in DCTCP upon CE status change · ae70b615
      Yuchung Cheng authored
      [ Upstream commit a0496ef2 ]
      
      Per DCTCP RFC8257 (Section 3.2) the ACK reflecting the CE status change
      has to be sent immediately so the sender can respond quickly:
      
      """ When receiving packets, the CE codepoint MUST be processed as follows:
      
         1.  If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to
             true and send an immediate ACK.
      
         2.  If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE
             to false and send an immediate ACK.
      """
      
      Previously DCTCP implementation may continue to delay the ACK. This
      patch fixes that to implement the RFC by forcing an immediate ACK.
      
      Tested with this packetdrill script provided by Larry Brakmo
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
      0.000 bind(3, ..., ...) = 0
      0.000 listen(3, 1) = 0
      
      0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
      0.110 < [ect0] . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
         +0 setsockopt(4, SOL_SOCKET, SO_DEBUG, [1], 4) = 0
      
      0.200 < [ect0] . 1:1001(1000) ack 1 win 257
      0.200 > [ect01] . 1:1(0) ack 1001
      
      0.200 write(4, ..., 1) = 1
      0.200 > [ect01] P. 1:2(1) ack 1001
      
      0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
      +0.005 < [ce] . 2001:3001(1000) ack 2 win 257
      
      +0.000 > [ect01] . 2:2(0) ack 2001
      // Previously the ACK below would be delayed by 40ms
      +0.000 > [ect01] E. 2:2(0) ack 3001
      
      +0.500 < F. 9501:9501(0) ack 4 win 257
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ae70b615
    • Yuchung Cheng's avatar
      tcp: do not cancel delay-AcK on DCTCP special ACK · 78636179
      Yuchung Cheng authored
      [ Upstream commit 27cde44a ]
      
      Currently when a DCTCP receiver delays an ACK and receive a
      data packet with a different CE mark from the previous one's, it
      sends two immediate ACKs acking previous and latest sequences
      respectly (for ECN accounting).
      
      Previously sending the first ACK may mark off the delayed ACK timer
      (tcp_event_ack_sent). This may subsequently prevent sending the
      second ACK to acknowledge the latest sequence (tcp_ack_snd_check).
      The culprit is that tcp_send_ack() assumes it always acknowleges
      the latest sequence, which is not true for the first special ACK.
      
      The fix is to not make the assumption in tcp_send_ack and check the
      actual ack sequence before cancelling the delayed ACK. Further it's
      safer to pass the ack sequence number as a local variable into
      tcp_send_ack routine, instead of intercepting tp->rcv_nxt to avoid
      future bugs like this.
      Reported-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      78636179
    • Yuchung Cheng's avatar
      tcp: helpers to send special DCTCP ack · f7f24b36
      Yuchung Cheng authored
      [ Upstream commit 2987babb ]
      
      Refactor and create helpers to send the special ACK in DCTCP.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f7f24b36
    • Yuchung Cheng's avatar
      tcp: fix dctcp delayed ACK schedule · 68c9bdfc
      Yuchung Cheng authored
      [ Upstream commit b0c05d0e ]
      
      Previously, when a data segment was sent an ACK was piggybacked
      on the data segment without generating a CA_EVENT_NON_DELAYED_ACK
      event to notify congestion control modules. So the DCTCP
      ca->delayed_ack_reserved flag could incorrectly stay set when
      in fact there were no delayed ACKs being reserved. This could result
      in sending a special ECN notification ACK that carries an older
      ACK sequence, when in fact there was no need for such an ACK.
      DCTCP keeps track of the delayed ACK status with its own separate
      state ca->delayed_ack_reserved. Previously it may accidentally cancel
      the delayed ACK without updating this field upon sending a special
      ACK that carries a older ACK sequence. This inconsistency would
      lead to DCTCP receiver never acknowledging the latest data until the
      sender times out and retry in some cases.
      
      Packetdrill script (provided by Larry Brakmo)
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
      0.000 bind(3, ..., ...) = 0
      0.000 listen(3, 1) = 0
      
      0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
      0.110 < [ect0] . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      
      0.200 < [ect0] . 1:1001(1000) ack 1 win 257
      0.200 > [ect01] . 1:1(0) ack 1001
      
      0.200 write(4, ..., 1) = 1
      0.200 > [ect01] P. 1:2(1) ack 1001
      
      0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
      0.200 write(4, ..., 1) = 1
      0.200 > [ect01] P. 2:3(1) ack 2001
      
      0.200 < [ect0] . 2001:3001(1000) ack 3 win 257
      0.200 < [ect0] . 3001:4001(1000) ack 3 win 257
      0.200 > [ect01] . 3:3(0) ack 4001
      
      0.210 < [ce] P. 4001:4501(500) ack 3 win 257
      
      +0.001 read(4, ..., 4500) = 4500
      +0 write(4, ..., 1) = 1
      +0 > [ect01] PE. 3:4(1) ack 4501
      
      +0.010 < [ect0] W. 4501:5501(1000) ack 4 win 257
      // Previously the ACK sequence below would be 4501, causing a long RTO
      +0.040~+0.045 > [ect01] . 4:4(0) ack 5501   // delayed ack
      
      +0.311 < [ect0] . 5501:6501(1000) ack 4 win 257  // More data
      +0 > [ect01] . 4:4(0) ack 6501     // now acks everything
      
      +0.500 < F. 9501:9501(0) ack 4 win 257
      Reported-by: default avatarLarry Brakmo <brakmo@fb.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Acked-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      68c9bdfc
    • Roopa Prabhu's avatar
      vxlan: fix default fdb entry netlink notify ordering during netdev create · 68974d0b
      Roopa Prabhu authored
      [ Upstream commit e99465b9 ]
      
      Problem:
      In vxlan_newlink, a default fdb entry is added before register_netdev.
      The default fdb creation function also notifies user-space of the
      fdb entry on the vxlan device which user-space does not know about yet.
      (RTM_NEWNEIGH goes before RTM_NEWLINK for the same ifindex).
      
      This patch fixes the user-space netlink notification ordering issue
      with the following changes:
      - decouple fdb notify from fdb create.
      - Move fdb notify after register_netdev.
      - Call rtnl_configure_link in vxlan newlink handler to notify
      userspace about the newlink before fdb notify and
      hence avoiding the user-space race.
      
      Fixes: afbd8bae ("vxlan: add implicit fdb entry for default destination")
      Signed-off-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      68974d0b
    • Roopa Prabhu's avatar
      vxlan: make netlink notify in vxlan_fdb_destroy optional · bb0335aa
      Roopa Prabhu authored
      [ Upstream commit f6e05385 ]
      
      Add a new option do_notify to vxlan_fdb_destroy to make
      sending netlink notify optional. Used by a later patch.
      Signed-off-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bb0335aa
    • Roopa Prabhu's avatar
      vxlan: add new fdb alloc and create helpers · 1c345a52
      Roopa Prabhu authored
      [ Upstream commit 7431016b ]
      
      - Add new vxlan_fdb_alloc helper
      - rename existing vxlan_fdb_create into vxlan_fdb_update:
              because it really creates or updates an existing
              fdb entry
      - move new fdb creation into a separate vxlan_fdb_create
      
      Main motivation for this change is to introduce the ability
      to decouple vxlan fdb creation and notify, used in a later patch.
      Signed-off-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1c345a52
    • Roopa Prabhu's avatar
      rtnetlink: add rtnl_link_state check in rtnl_configure_link · 23557c5d
      Roopa Prabhu authored
      [ Upstream commit 5025f7f7 ]
      
      rtnl_configure_link sets dev->rtnl_link_state to
      RTNL_LINK_INITIALIZED and unconditionally calls
      __dev_notify_flags to notify user-space of dev flags.
      
      current call sequence for rtnl_configure_link
      rtnetlink_newlink
          rtnl_link_ops->newlink
          rtnl_configure_link (unconditionally notifies userspace of
                               default and new dev flags)
      
      If a newlink handler wants to call rtnl_configure_link
      early, we will end up with duplicate notifications to
      user-space.
      
      This patch fixes rtnl_configure_link to check rtnl_link_state
      and call __dev_notify_flags with gchanges = 0 if already
      RTNL_LINK_INITIALIZED.
      
      Later in the series, this patch will help the following sequence
      where a driver implementing newlink can call rtnl_configure_link
      to initialize the link early.
      
      makes the following call sequence work:
      rtnetlink_newlink
          rtnl_link_ops->newlink (vxlan) -> rtnl_configure_link (initializes
                                                      link and notifies
                                                      user-space of default
                                                      dev flags)
          rtnl_configure_link (updates dev flags if requested by user ifm
                               and notifies user-space of new dev flags)
      Signed-off-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      23557c5d