1. 10 Oct, 2017 8 commits
    • Bernd Edlinger's avatar
    • Willem de Bruijn's avatar
      vhost_net: do not stall on zerocopy depletion · 1e6f7453
      Willem de Bruijn authored
      Vhost-net has a hard limit on the number of zerocopy skbs in flight.
      When reached, transmission stalls. Stalls cause latency, as well as
      head-of-line blocking of other flows that do not use zerocopy.
      
      Instead of stalling, revert to copy-based transmission.
      
      Tested by sending two udp flows from guest to host, one with payload
      of VHOST_GOODCOPY_LEN, the other too small for zerocopy (1B). The
      large flow is redirected to a netem instance with 1MBps rate limit
      and deep 1000 entry queue.
      
        modprobe ifb
        ip link set dev ifb0 up
        tc qdisc add dev ifb0 root netem limit 1000 rate 1MBit
      
        tc qdisc add dev tap0 ingress
        tc filter add dev tap0 parent ffff: protocol ip \
            u32 match ip dport 8000 0xffff \
            action mirred egress redirect dev ifb0
      
      Before the delay, both flows process around 80K pps. With the delay,
      before this patch, both process around 400. After this patch, the
      large flow is still rate limited, while the small reverts to its
      original rate. See also discussion in the first link, below.
      
      Without rate limiting, {1, 10, 100}x TCP_STREAM tests continued to
      send at 100% zerocopy.
      
      The limit in vhost_exceeds_maxpend must be carefully chosen. With
      vq->num >> 1, the flows remain correlated. This value happens to
      correspond to VHOST_MAX_PENDING for vq->num == 256. Allow smaller
      fractions and ensure correctness also for much smaller values of
      vq->num, by testing the min() of both explicitly. See also the
      discussion in the second link below.
      
      Changes
        v1 -> v2
          - replaced min with typed min_t
          - avoid unnecessary whitespace change
      
      Link:http://lkml.kernel.org/r/CAF=yD-+Wk9sc9dXMUq1+x_hh=3ThTXa6BnZkygP3tgVpjbp93g@mail.gmail.com
      Link:http://lkml.kernel.org/r/20170819064129.27272-1-den@klaipeden.comSigned-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1e6f7453
    • William Tu's avatar
      openvswitch: Add erspan tunnel support. · ceaa001a
      William Tu authored
      Add erspan netlink interface for OVS.
      Signed-off-by: default avatarWilliam Tu <u9012063@gmail.com>
      Cc: Pravin B Shelar <pshelar@ovn.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ceaa001a
    • Eric Biggers's avatar
      once: switch to new jump label API · cf4c950b
      Eric Biggers authored
      Switch the DO_ONCE() macro from the deprecated jump label API to the new
      one.  The new one is more readable, and for DO_ONCE() it also makes the
      generated code more icache-friendly: now the one-time initialization
      code is placed out-of-line at the jump target, rather than at the inline
      fallthrough case.
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cf4c950b
    • David S. Miller's avatar
    • Wei Wang's avatar
      ipv6: use rcu_dereference_bh() in ipv6_route_seq_next() · d0e60206
      Wei Wang authored
      This patch replaces rcu_deference() with rcu_dereference_bh() in
      ipv6_route_seq_next() to avoid the following warning:
      
      [   19.431685] WARNING: suspicious RCU usage
      [   19.433451] 4.14.0-rc3-00914-g66f5d6ce #118 Not tainted
      [   19.435509] -----------------------------
      [   19.437267] net/ipv6/ip6_fib.c:2259 suspicious
      rcu_dereference_check() usage!
      [   19.440790]
      [   19.440790] other info that might help us debug this:
      [   19.440790]
      [   19.444734]
      [   19.444734] rcu_scheduler_active = 2, debug_locks = 1
      [   19.447757] 2 locks held by odhcpd/3720:
      [   19.449480]  #0:  (&p->lock){+.+.}, at: [<ffffffffb1231f7d>]
      seq_read+0x3c/0x333
      [   19.452720]  #1:  (rcu_read_lock_bh){....}, at: [<ffffffffb1d2b984>]
      ipv6_route_seq_start+0x5/0xfd
      [   19.456323]
      [   19.456323] stack backtrace:
      [   19.458812] CPU: 0 PID: 3720 Comm: odhcpd Not tainted
      4.14.0-rc3-00914-g66f5d6ce #118
      [   19.462042] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
      BIOS 1.10.2-1 04/01/2014
      [   19.465414] Call Trace:
      [   19.466788]  dump_stack+0x86/0xc0
      [   19.468358]  lockdep_rcu_suspicious+0xea/0xf3
      [   19.470183]  ipv6_route_seq_next+0x71/0x164
      [   19.471963]  seq_read+0x244/0x333
      [   19.473522]  proc_reg_read+0x48/0x67
      [   19.475152]  ? proc_reg_write+0x67/0x67
      [   19.476862]  __vfs_read+0x26/0x10b
      [   19.478463]  ? __might_fault+0x37/0x84
      [   19.480148]  vfs_read+0xba/0x146
      [   19.481690]  SyS_read+0x51/0x8e
      [   19.483197]  do_int80_syscall_32+0x66/0x15a
      [   19.484969]  entry_INT80_compat+0x32/0x50
      [   19.486707] RIP: 0023:0xf7f0be8e
      [   19.488244] RSP: 002b:00000000ffa75d04 EFLAGS: 00000246 ORIG_RAX:
      0000000000000003
      [   19.491431] RAX: ffffffffffffffda RBX: 0000000000000009 RCX:
      0000000008056068
      [   19.493886] RDX: 0000000000001000 RSI: 0000000008056008 RDI:
      0000000000001000
      [   19.496331] RBP: 00000000000001ff R08: 0000000000000000 R09:
      0000000000000000
      [   19.498768] R10: 0000000000000000 R11: 0000000000000000 R12:
      0000000000000000
      [   19.501217] R13: 0000000000000000 R14: 0000000000000000 R15:
      0000000000000000
      
      Fixes: 66f5d6ce ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
      Reported-by: default avatarXiaolong Ye <xiaolong.ye@intel.com>
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d0e60206
    • Linus Torvalds's avatar
      Merge branch 'ppc-bundle' (bundle from Michael Ellerman) · 529a86e0
      Linus Torvalds authored
      Merge powerpc transactional memory fixes from Michael Ellerman:
       "I figured I'd still send you the commits using a bundle to make sure
        it works in case I need to do it again in future"
      
      This fixes transactional memory state restore for powerpc.
      
      * bundle'd patches from Michael Ellerman:
        powerpc/tm: Fix illegal TM state in signal handler
        powerpc/64s: Use emergency stack for kernel TM Bad Thing program checks
      529a86e0
    • David S. Miller's avatar
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 9f7be893
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      40GbE Intel Wired LAN Driver Updates 2017-10-09
      
      This series contains updates to i40e and i40evf only.
      
      Jake fixes missed flag conversion from u64 to u32.  Fixes a deafult ITR
      value issue where the driver defaults to an ITR value of half the
      expected value (in terms of minimum microseconds between interrupts).  So
      fix this by changing the default values to be calculated using the
      ITR_REG_TO_USEC() macro which indicates that we are converting from the
      register units into microseconds. Updates the drivers to bump the tail in
      increments of 8 and double the number of descriptors we will bundle into
      one tail bump when receiving.  With the recent kernel support for
      enabling XPS and QoS at the same time, we no longer need to worry about
      the number of traffic classes when enabling XPS.
      
      Lihong converts the use of hash_for_each() to hash_for_each_safe() to
      safely remove a hash entry.  Adds a check for the return value for
      find_first_bit() in the case that it returns the size passed to search.
      
      Alan fixes a bug in which filters are erroneously removed if they are
      removed and then added again.  So make sure that when adding a filter, if
      we find it already existed in our list, make sure it is not marked to be
      removed.
      
      Jayaprakash adds the retrying of PHY reads when the I2C is busy for a
      maximum period of 500ms.
      
      Rami fixes code comment typo.
      
      Stefano Brivio simplifies the code by removing the use of a local
      return code variable and simply return the results of the read function.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9f7be893
  2. 09 Oct, 2017 32 commits
    • David S. Miller's avatar
      Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 0349a86c
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      10GbE Intel Wired LAN Driver Updates 2017-10-09
      
      This series contains updates to ixgbe only.
      
      Emil fixes an issue where the semaphore bits could be stuck after a reset
      or a crash, by adding the clearing of software resource bits in the
      software/firmware synchronization register.  Added error checks when we
      attempt to identify and initialize the PHY to prevent a crash.  Fixed a
      few issues in the logic of ixgbe_clean_test_rings() which was exposed by
      a previous commit that was causing a crash in ethtool diagnostics.
      
      Bhumika Goyal fixes a couple of instances which were overlooked when we
      made ixgbe_mac_operations constant.
      
      Shannon Nelson fixes an issue to restore normal operations after the
      last MACVLAN offload is removed, otherwise we get stuck in a single queue
      operations.
      
      The infamous Jesper Dangaard Brouer adds a counter which counts the
      number of times the recycle fails and the real page allocator is invoked.
      
      Alex updates the adaptive ITR algorithm to better support the needs of the
      network.  This attempt to make it so that our ITR algorithm will try to
      prevent either starving a socket buffer for memory in the case of
      transmit, or overrunning an receive socket buffer on receive.  We should
      function better with new features like XDP which can handle small packets
      at high rates without needing to lock us into NAPI polling mode.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0349a86c
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · ff33952e
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Fix object leak on IPSEC offload failure, from Steffen Klassert.
      
       2) Fix range checks in ipset address range addition operations, from
          Jozsef Kadlecsik.
      
       3) Fix pernet ops unregistration order in ipset, from Florian Westphal.
      
       4) Add missing netlink attribute policy for nl80211 packet pattern
          attrs, from Peng Xu.
      
       5) Fix PPP device destruction race, from Guillaume Nault.
      
       6) Write marks get lost when BPF verifier processes R1=R2 register
          assignments, causing incorrect liveness information and less state
          pruning. Fix from Alexei Starovoitov.
      
       7) Fix blockhole routes so that they are marked dead and therefore not
          cached in sockets, otherwise IPSEC stops working. From Steffen
          Klassert.
      
       8) Fix broadcast handling of UDP socket early demux, from Paolo Abeni.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (37 commits)
        cdc_ether: flag the u-blox TOBY-L2 and SARA-U2 as wwan
        net: thunderx: mark expected switch fall-throughs in nicvf_main()
        udp: fix bcast packet reception
        netlink: do not set cb_running if dump's start() errs
        ipv4: Fix traffic triggered IPsec connections.
        ipv6: Fix traffic triggered IPsec connections.
        ixgbe: incorrect XDP ring accounting in ethtool tx_frame param
        net: ixgbe: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
        Revert commit 1a8b6d76 ("net:add one common config...")
        ixgbe: fix masking of bits read from IXGBE_VXLANCTRL register
        ixgbe: Return error when getting PHY address if PHY access is not supported
        netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'
        netfilter: SYNPROXY: skip non-tcp packet in {ipv4, ipv6}_synproxy_hook
        tipc: Unclone message at secondary destination lookup
        tipc: correct initialization of skb list
        gso: fix payload length when gso_size is zero
        mlxsw: spectrum_router: Avoid expensive lookup during route removal
        bpf: fix liveness marking
        doc: Fix typo "8023.ad" in bonding documentation
        ipv6: fix net.ipv6.conf.all.accept_dad behaviour for real
        ...
      ff33952e
    • Aleksander Morgado's avatar
      cdc_ether: flag the u-blox TOBY-L2 and SARA-U2 as wwan · fdfbad32
      Aleksander Morgado authored
      The u-blox TOBY-L2 is a LTE Cat 4 module with HSPA+ and 2G fallback.
      This module allows switching to different USB profiles with the
      'AT+UUSBCONF' command, and provides a ECM network interface when the
      'AT+UUSBCONF=2' profile is selected.
      
      The u-blox SARA-U2 is a HSPA module with 2G fallback. The default USB
      configuration includes a ECM network interface.
      
      Both these modules are controlled via AT commands through one of the
      TTYs exposed. Connecting these modules may be done just by activating
      the desired PDP context with 'AT+CGACT=1,<cid>' and then running DHCP
      on the ECM interface.
      Signed-off-by: default avatarAleksander Morgado <aleksander@aleksander.es>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fdfbad32
    • Stefano Brivio's avatar
      i40e: Avoid some useless variables and initializers in NVM functions · 2c4d36b7
      Stefano Brivio authored
      Fixes: 09f79fd4 ("i40e: avoid NVM acquire deadlock during NVM update")
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      2c4d36b7
    • Rami Rosen's avatar
      i40e: fix a typo · 3d7d7a86
      Rami Rosen authored
      This patch fixes a typo in i40e_vsi_alloc_arrays() documentation.
      The first parameter name should be "vsi" instead of "type".
      Signed-off-by: default avatarRami Rosen <rami.rosen@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      3d7d7a86
    • Lihong Yang's avatar
      i40e: use a local variable instead of calculating multiple times · 9bcc07f0
      Lihong Yang authored
      The computed result of I40E_MAX_VSI_QP * I40E_VIRTCHNL_SUPPORTED_QTYPES
      is used more than three times in function i40e_config_irq_link_list.
      Simply declare a local variable to store it to improve readability.
      Signed-off-by: default avatarLihong Yang <lihong.yang@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      9bcc07f0
    • Jayaprakash Shanmugam's avatar
      i40e: Retry AQC GetPhyAbilities to overcome I2CRead hangs · 4988410f
      Jayaprakash Shanmugam authored
      - When the I2C is busy, the PHY reads are delayed.  The firmware will
        return EGAIN in these cases with an expectation that the SW will
        trigger the reads again
      - This patch retries the operation for a maximum period of 500ms
      Signed-off-by: default avatarJayaprakash Shanmugam <jayaprakash.shanmugam@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      4988410f
    • Lihong Yang's avatar
      i40e: add check for return from find_first_bit call · b861fb76
      Lihong Yang authored
      The find_first_bit function will return the size passed to search
      if the first set bit is not found. This patch adds the check in case
      that happens as the return value would be used as the index in an array
      and that would have caused the out-of-bounds access.
      
      Detected by CoverityScan, CID 1295969 Out-of-bounds access
      Signed-off-by: default avatarLihong Yang <lihong.yang@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      b861fb76
    • Jacob Keller's avatar
      i40e: allow XPS with QoS enabled · 6f853d4f
      Jacob Keller authored
      Recently, the kernel gained support for enabling XPS and QoS at the
      same time. Thus, we no longer need to worry about the number of
      traffic classes when enabling XPS.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      6f853d4f
    • Jacob Keller's avatar
      i40e/i40evf: bundle more descriptors when allocating buffers · 95bc2fb4
      Jacob Keller authored
      Double the number of descriptors we'll bundle into one tail bump when
      receiving. Empirical testing has shown that we reduce CPU utilization
      and don't appear to reduce throughput or packet rate. 32 seems to be the
      sweet spot, as it's half the default polling budget, so we'd essentially
      reduce from 4 tail writes when polling down to 2. Increasing this up to
      64 appears to have negative impacts as it may become possible that we
      don't bump the tail each time we get polled, which could cause a long
      delay between returning descriptors to the hardware.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      95bc2fb4
    • Jacob Keller's avatar
      i40e/i40evf: bump tail only in multiples of 8 · 11f29003
      Jacob Keller authored
      Hardware only fetches descriptors on cachelines of 8, essentially
      ignoring the lower 3 bits of the tail register. Thus, it is pointless to
      bump tail by an unaligned access as the hardware will ignore some of the
      new descriptors we allocated. Thus, it's ideal if we can ensure tail
      writes are always aligned to 8.
      
      At first, it seems like we'd already do this, since we allocate
      descriptors in batches which are a multiple of 8. Since we'd always
      increment by a multiple of 8, it seems like the value should always be
      aligned.
      
      However, this ignores allocation failures. If we fail to allocate
      a buffer, our tail register will become unaligned. Once it has become
      unaligned it will essentially be stuck unaligned until a buffer
      allocation happens to fail at the exact amount necessary to re-align it.
      
      We can do better, by simply rounding down the number of buffers we're
      about to allocate (cleaned_count) such that "next_to_clean
      + cleaned_count" is rounded to the nearest multiple of 8.
      
      We do this by calculating how far off that value is and subtracting it
      from the cleaned_count. This essentially defers allocation of buffers if
      they're going to be ignored by hardware anyways, and re-aligns our
      next_to_use and tail values after a failure to allocate a descriptor.
      
      This calculation ensures that we always align the tail writes in a way
      the hardware expects and don't unnecessarily allocate buffers which
      won't be fetched immediately.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      11f29003
    • Jacob Keller's avatar
      i40e: reduce lrxqthresh from 2 to 1 · 7362be9e
      Jacob Keller authored
      The lrxq thresh value tells hardware to immediately interrupt when there
      are fewer than N*64 packets left in the ring.
      
      Counter intuitively, empirical testing has shown that decreasing this
      value from 2 to 1, and thus changing from an immediate interrupt at
      fewer than 128 descriptors down to 64 descriptors causes a small
      increase in the maximum total packets per second we can receive. This
      increase occurs even when we're polling with interrupts masked, as the
      hardware must still handle interrupts internally even if we've disabled
      them in software.
      
      Also reduce the value for any VFs we allocate.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      7362be9e
    • Jacob Keller's avatar
      i40e/i40evf: always set the CLEARPBA flag when re-enabling interrupts · dbadbbe2
      Jacob Keller authored
      In the past we changed driver behavior to not clear the PBA when
      re-enabling interrupts. This change was motivated by the flawed belief
      that clearing the PBA would cause a lost interrupt if a receive
      interrupt occurred while interrupts were disabled.
      
      According to empirical testing this isn't the case. Additionally, the
      data sheet specifically says that we should set the CLEARPBA bit when
      re-enabling interrupts in a polling setup.
      
      This reverts commit 40d72a50 ("i40e/i40evf: don't lose interrupts")
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      dbadbbe2
    • Jacob Keller's avatar
      i40e/i40evf: fix incorrect default ITR values on driver load · 42702559
      Jacob Keller authored
      The ITR register expects to be programmed in units of 2 microseconds.
      Because of this, all of the drivers I40E_ITR_* constants are in terms of
      this 2 microsecond register.
      
      Unfortunately, the rx_itr_default value is expected to be programmed in
      microseconds.
      
      Effectively the driver defaults to an ITR value of half the expected
      value (in terms of minimum microseconds between interrupts).
      
      Fix this by changing the default values to be calculated using
      ITR_REG_TO_USEC macro which indicates that we're converting from the
      register units into microseconds.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      42702559
    • Alan Brady's avatar
      i40evf: fix mac filter removal timing issue · c766b9af
      Alan Brady authored
      Due to the asynchronous nature in which mac filters are added and
      deleted, there exists a bug in which filters are erroneously removed if
      removed then added again quickly.
      
      The events are as such:
          - filter marked for removal
          - same filter is re-added before watchdog that cleans up filters
          - we skip re-adding the filter because we have it already in the
      list
          - watchdog filter cleanup kicks off and filter is removed
      
      So when we were re-adding the same filter, it didn't actually get added
      because it already existed in the list, but was marked for removal and
      had yet to actually be removed.
      
      This patch fixes the issue by making sure that when adding a filter, if
      we find it already existing in our list, make sure it is not marked to
      be removed.
      Signed-off-by: default avatarAlan Brady <alan.brady@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      c766b9af
    • Lihong Yang's avatar
      i40e: use the safe hash table iterator when deleting mac filters · 784548c4
      Lihong Yang authored
      This patch replaces hash_for_each function with hash_for_each_safe
      when calling  __i40e_del_filter. The hash_for_each_safe function is
      the right one to use when iterating over a hash table to safely remove
      a hash entry. Otherwise, incorrect values may be read from freed memory.
      
      Detected by CoverityScan, CID 1402048 Read from pointer after free
      Signed-off-by: default avatarLihong Yang <lihong.yang@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      784548c4
    • Jacob Keller's avatar
      i40e: fix flags declaration · b48be997
      Jacob Keller authored
      Since we don't yet have more than 32 flags, we'll use a u32 for both the
      hw_features and flag field. Should we gain more flags in the future, we
      may need to convert to a u64 or separate flags out into two fields.
      
      This was overlooked in the previous commit 2781de2134c4 ("i40e/i40evf:
      organize and re-number feature flags"), where the feature flag was not
      converted form u64 to u32.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarMitch Williams <mitch.a.williams@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      b48be997
    • Linus Torvalds's avatar
      Merge tag 'nfs-for-4.14-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs · 68ebe3cb
      Linus Torvalds authored
      Pull NFS client bugfixes from Trond Myklebust:
       "Hightlights include:
      
        stable fixes:
         - nfs/filelayout: fix oops when freeing filelayout segment
         - NFS: Fix uninitialized rpc_wait_queue
      
        bugfixes:
         - NFSv4/pnfs: Fix an infinite layoutget loop
         - nfs: RPC_MAX_AUTH_SIZE is in bytes"
      
      * tag 'nfs-for-4.14-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
        NFSv4/pnfs: Fix an infinite layoutget loop
        nfs/filelayout: fix oops when freeing filelayout segment
        sunrpc: remove redundant initialization of sock
        NFS: Fix uninitialized rpc_wait_queue
        NFS: Cleanup error handling in nfs_idmap_request_key()
        nfs: RPC_MAX_AUTH_SIZE is in bytes
      68ebe3cb
    • David S. Miller's avatar
      Merge branch 'ipv6-addrlabel-avoid-dirtying-ip6addrlbl_entry' · 2e997d8b
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      ipv6: addrlabel: avoid dirtying ip6addrlbl_entry
      
      The refcount on ip6addrlbl_entry is only used to make sure ip6addrlbl_entry
      does not disappear while ip6addrlbl_get() is allocating an skb.
      
      We can instead allocate skb first, then use RCU, so that we no longer need
      to refcount these structures.
      ====================
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2e997d8b
    • Eric Dumazet's avatar
      ipv6: addrlabel: remove refcounting · 2809c095
      Eric Dumazet authored
      After previous patch ("ipv6: addrlabel: rework ip6addrlbl_get()")
      we can remove the refcount from struct ip6addrlbl_entry,
      since it is no longer elevated in p6addrlbl_get()
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2809c095
    • Eric Dumazet's avatar
      ipv6: addrlabel: rework ip6addrlbl_get() · 66c77ff3
      Eric Dumazet authored
      If we allocate skb before the lookup, we can use RCU
      without the need of ip6addrlbl_hold()
      
      This means that the following patch can get rid of refcounting.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      66c77ff3
    • Gustavo A. R. Silva's avatar
      net: thunderx: mark expected switch fall-throughs in nicvf_main() · 1a2ace56
      Gustavo A. R. Silva authored
      In preparation to enabling -Wimplicit-fallthrough, mark switch cases
      where we are expecting to fall through.
      
      Cc: Sunil Goutham <sgoutham@cavium.com>
      Cc: Robert Richter <rric@kernel.org>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: netdev@vger.kernel.org
      Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a2ace56
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf · fb60bccc
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter/IPVS fixes for net
      
      The following patchset contains Netfilter/IPVS fixes for your net tree,
      they are:
      
      1) Fix packet drops due to incorrect ECN handling in IPVS, from Vadim
         Fedorenko.
      
      2) Fix splat with mark restoration in xt_socket with non-full-sock,
         patch from Subash Abhinov Kasiviswanathan.
      
      3) ipset bogusly bails out when adding IPv4 range containing more than
         2^31 addresses, from Jozsef Kadlecsik.
      
      4) Incorrect pernet unregistration order in ipset, from Florian Westphal.
      
      5) Races between dump and swap in ipset results in BUG_ON splats, from
         Ross Lagerwall.
      
      6) Fix chain renames in nf_tables, from JingPiao Chen.
      
      7) Fix race in pernet codepath with ebtables table registration, from
         Artem Savkov.
      
      8) Memory leak in error path in set name allocation in nf_tables, patch
         from Arvind Yadav.
      
      9) Don't dump chain counters if they are not available, this fixes a
         crash when listing the ruleset.
      
      10) Fix out of bound memory read in strlcpy() in x_tables compat code,
          from Eric Dumazet.
      
      11) Make sure we only process TCP packets in SYNPROXY hooks, patch from
          Lin Zhang.
      
      12) Cannot load rules incrementally anymore after xt_bpf with pinned
          objects, added in revision 1. From Shmulik Ladkani.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fb60bccc
    • David S. Miller's avatar
      Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue · 5766cd68
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Updates 2017-10-09
      
      This series contains updates to ixgbe and arch/Kconfig.
      
      Mark fixes a case where PHY register access is not supported and we were
      returning a PHY address, when we should have been returning -EOPNOTSUPP.
      
      Sabrina Dubroca fixes the use of a logical "and" when it should have been
      the bitwise "and" operator.
      
      Ding Tianhong reverts the commit that added the Kconfig bool option
      ARCH_WANT_RELAX_ORDER, since there is now a new flag
      PCI_DEV_FLAGS_NO_RELAXED_ORDERING that has been added to indicate that
      Relaxed Ordering Attributes should not be used for Transaction Layer
      Packets.  Then follows up with making the needed changes to ixgbe to
      use the new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag.
      
      John Fastabend fixes an issue in the ring accounting when the transmit
      ring parameters are changed via ethtool when an XDP program is attached.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5766cd68
    • David S. Miller's avatar
      Merge branch 'mlx4-static-checker-warnings' · 1ee87d7a
      David S. Miller authored
      Tariq Toukan says:
      
      ====================
      Fix mlx4 static checker warnings
      
      This patchset contains fixes for static checker warnings
      in the mlx4 Core and Eth drivers.
      
      Patch 1 fixes an actual bug discovered by the checker.
      Patches 2 and 3 fix the warnings without functional changes.
      
      Series generated against net-next commit:
      c49c777f qed: Delete redundant check on dcb_app priority
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1ee87d7a
    • Tariq Toukan's avatar
      net/mlx4_en: Use __force to fix a sparse warning in TX datapath · 7ba5e7bd
      Tariq Toukan authored
      In TX data-path, we intentionally do not byte-swap, as documented
      in code and in the cited commit log.
      This fixes sparse warning:
      en_tx.c:720:23: warning: incorrect type in argument 1 (different base types)
      en_tx.c:720:23:    expected unsigned int [unsigned] [usertype] <noident>
      en_tx.c:720:23:    got restricted __be32 [usertype] doorbell_qpn
      
      Fixes: 492f5add ("net/mlx4_en: Doorbell is byteswapped in Little Endian archs")
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7ba5e7bd
    • Tariq Toukan's avatar
      net/mlx4_core: Fix cast warning in fw.c · b71322d9
      Tariq Toukan authored
      Fix the following SPARSE warning, in MLX4_GET() macro:
      drivers/net/ethernet/mellanox/mlx4/fw.c:233:9: warning: cast to restricted __be64
      
      Fixes: 17d5ceb6 ("net/mlx4_core: Fix unaligned accesses")
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b71322d9
    • Tariq Toukan's avatar
      net/mlx4: Fix endianness issue in qp context params · bb428a5c
      Tariq Toukan authored
      Should take care of the endianness before assigning to params2 field.
      
      Fixes: 53f33ae2 ("net/mlx4_core: Port aggregation upper layer interface")
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bb428a5c
    • Mika Westerberg's avatar
      thunderbolt: Initialize Thunderbolt bus earlier · acb40d84
      Mika Westerberg authored
      The 0day kbuild robot reports following crash:
      
        BUG: unable to handle kernel NULL pointer dereference at 00000004
        IP: tb_property_find+0xe/0x41
        *pde = 00000000
        Oops: 0000 [#1]
        CPU: 0 PID: 1 Comm: swapper Not tainted 4.14.0-rc1-00741-ge69b6c02 #412
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
        task: 89c80000 task.stack: 89c7c000
        EIP: tb_property_find+0xe/0x41
        EFLAGS: 00210246 CPU: 0
        EAX: 00000000 EBX: 7a368f47 ECX: 00000044 EDX: 7a368f47
        ESI: 8851d340 EDI: 7a368f47 EBP: 89c7df0c ESP: 89c7defc
         DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
        CR0: 80050033 CR2: 00000004 CR3: 027a2000 CR4: 00000690
        Call Trace:
         tb_register_property_dir+0x49/0xb9
         ? cdc_mbim_driver_init+0x1b/0x1b
         tbnet_init+0x77/0x9f
         ? cdc_mbim_driver_init+0x1b/0x1b
         do_one_initcall+0x7e/0x145
         ? parse_args+0x10c/0x1b3
         ? kernel_init_freeable+0xbe/0x159
         kernel_init_freeable+0xd1/0x159
         ? rest_init+0x110/0x110
         kernel_init+0xd/0xd0
         ret_from_fork+0x19/0x30
      
      The reason is that both Thunderbolt bus and thunderbolt-net are build
      into the kernel image, and the latter is linked first because
      drivers/net comes before drivers/thunderbolt. Since both use
      module_init() thunderbolt-net ends up calling Thunderbolt bus functions
      too early triggering the above crash.
      
      Fix this by moving Thunderbolt bus initialization to happen earlier to
      make sure all the data structures are ready when Thunderbolt service
      drivers are initialized. To be on the safe side also add a check for
      properly initialized xdomain_property_dir to tb_register_property_dir().
      Reported-by: default avatarkernel test robot <fengguang.wu@intel.com>
      Signed-off-by: default avatarMika Westerberg <mika.westerberg@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      acb40d84
    • Eric Dumazet's avatar
      ipv6: avoid zeroing per cpu data again · bfd8e5a4
      Eric Dumazet authored
      per cpu allocations are already zeroed, no need to clear them again.
      
      Fixes: d52d3997 ("ipv6: Create percpu rt6_info")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bfd8e5a4
    • Paolo Abeni's avatar
      udp: fix bcast packet reception · 996b44fc
      Paolo Abeni authored
      The commit bc044e8d ("udp: perform source validation for
      mcast early demux") does not take into account that broadcast packets
      lands in the same code path and they need different checks for the
      source address - notably, zero source address are valid for bcast
      and invalid for mcast.
      
      As a result, 2nd and later broadcast packets with 0 source address
      landing to the same socket are dropped. This breaks dhcp servers.
      
      Since we don't have stringent performance requirements for ingress
      broadcast traffic, fix it by disabling UDP early demux such traffic.
      Reported-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Fixes: bc044e8d ("udp: perform source validation for mcast early demux")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      996b44fc
    • Jason A. Donenfeld's avatar
      netlink: do not set cb_running if dump's start() errs · 41c87425
      Jason A. Donenfeld authored
      It turns out that multiple places can call netlink_dump(), which means
      it's still possible to dereference partially initialized values in
      dump() that were the result of a faulty returned start().
      
      This fixes the issue by calling start() _before_ setting cb_running to
      true, so that there's no chance at all of hitting the dump() function
      through any indirect paths.
      
      It also moves the call to start() to be when the mutex is held. This has
      the nice side effect of serializing invocations to start(), which is
      likely desirable anyway. It also prevents any possible other races that
      might come out of this logic.
      
      In testing this with several different pieces of tricky code to trigger
      these issues, this commit fixes all avenues that I'm aware of.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Reviewed-by: default avatarJohannes Berg <johannes@sipsolutions.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      41c87425