1. 24 Sep, 2015 8 commits
    • Eric Dumazet's avatar
      bnx2x: byte swap rss_key to comply to Toeplitz specs · d682d2bd
      Eric Dumazet authored
      After a good amount of debugging, I found bnx2x was byte swaping
      the 40 bytes of rss_key.
      
      If we byte swap the key, then bnx2x generates hashes matching
      MSDN specs as documented in (Verifying the RSS Hash Calculation)
      
      https://msdn.microsoft.com/en-us/library/windows/hardware/ff571021%
      28v=vs.85%29.aspx
      
      It is mostly a non issue, unless we want to mix different NIC
      in a host, and want consistent hashing among all of them, ie
      if they all use the boot time generated rss key, or if some application
      is choosing specific tuple(s) so that incoming traffic lands into known
      rx queue(s).
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d682d2bd
    • WANG Cong's avatar
      net: revert "net_sched: move tp->root allocation into fw_init()" · d8aecb10
      WANG Cong authored
      fw filter uses tp->root==NULL to check if it is the old method,
      so it doesn't need allocation at all in this case. This patch
      reverts the offending commit and adds some comments for old
      method to make it obvious.
      
      Fixes: 33f8b9ec ("net_sched: move tp->root allocation into fw_init()")
      Reported-by: default avatarAkshat Kakkar <akshat.1984@gmail.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d8aecb10
    • David S. Miller's avatar
      Merge branch 'lwt_arp' · 8fe79c60
      David S. Miller authored
      Jiri Benc says:
      
      ====================
      lwtunnel: make it really work, for IPv4
      
      One of the selling points of lwtunnel was the ability to specify the tunnel
      destination using routes. However, this doesn't really work currently, as
      ARP and ndisc replies are not handled correctly. ARP and ndisc replies won't
      have tunnel metadata attached, thus they will be sent out with the default
      parameters or not sent at all, either way never reaching the requester.
      
      Most of the egress tunnel parameters can be inferred from the ingress
      metada. The only and important exception is UDP ports. This patchset infers
      the egress data from the ingress data and disallow settings of UDP ports in
      tunnel routes. If there's a need for different UDP ports, a new interface
      needs to be created for each port combination. Note that it's still possible
      to specify the UDP ports to use, it just needs to be done while creating the
      vxlan/geneve interface.
      
      This covers only ARPs. IPv6 ndisc has the same problem but is harder to
      solve, as there's already dst attached to outgoing skbs. Ideas to solve this
      are welcome.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8fe79c60
    • Jiri Benc's avatar
      lwtunnel: remove source and destination UDP port config option · b194f30c
      Jiri Benc authored
      The UDP tunnel config is asymmetric wrt. to the ports used. The source and
      destination ports from one direction of the tunnel are not related to the
      ports of the other direction. We need to be able to respond to ARP requests
      using the correct ports without involving routing.
      
      As the consequence, UDP ports need to be fixed property of the tunnel
      interface and cannot be set per route. Remove the ability to set ports per
      route. This is still okay to do, as no kernel has been released with these
      attributes yet.
      
      Note that the ability to specify source and destination ports is preserved
      for other users of the lwtunnel API which don't use routes for tunnel key
      specification (like openvswitch).
      
      If in the future we rework ARP handling to allow port specification, the
      attributes can be added back.
      Signed-off-by: default avatarJiri Benc <jbenc@redhat.com>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b194f30c
    • Jiri Benc's avatar
      ipv4: send arp replies to the correct tunnel · 63d008a4
      Jiri Benc authored
      When using ip lwtunnels, the additional data for xmit (basically, the actual
      tunnel to use) are carried in ip_tunnel_info either in dst->lwtstate or in
      metadata dst. When replying to ARP requests, we need to send the reply to
      the same tunnel the request came from. This means we need to construct
      proper metadata dst for ARP replies.
      
      We could perform another route lookup to get a dst entry with the correct
      lwtstate. However, this won't always ensure that the outgoing tunnel is the
      same as the incoming one, and it won't work anyway for IPv4 duplicate
      address detection.
      
      The only thing to do is to "reverse" the ip_tunnel_info.
      Signed-off-by: default avatarJiri Benc <jbenc@redhat.com>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      63d008a4
    • Sudeep Holla's avatar
      net: gianfar: remove misuse of IRQF_NO_SUSPEND flag · d5b8d640
      Sudeep Holla authored
      The device is set as wakeup capable using proper wakeup API but the
      driver misuses IRQF_NO_SUSPEND to set the interrupt as wakeup source
      which is incorrect.
      
      This patch removes the use of IRQF_NO_SUSPEND flags replacing it with
      enable_irq_wake instead.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Claudiu Manoil <claudiu.manoil@freescale.com>
      Cc: Kevin Hao <haokexin@gmail.com>
      Cc: netdev@vger.kernel.org
      Signed-off-by: default avatarSudeep Holla <sudeep.holla@arm.com>
      Acked-by: default avatarClaudiu Manoil <claudiu.manoil@freescale.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d5b8d640
    • Pravin B Shelar's avatar
      skbuff: Fix skb checksum flag on skb pull · 6ae459bd
      Pravin B Shelar authored
      VXLAN device can receive skb with checksum partial. But the checksum
      offset could be in outer header which is pulled on receive. This results
      in negative checksum offset for the skb. Such skb can cause the assert
      failure in skb_checksum_help(). Following patch fixes the bug by setting
      checksum-none while pulling outer header.
      
      Following is the kernel panic msg from old kernel hitting the bug.
      
      ------------[ cut here ]------------
      kernel BUG at net/core/dev.c:1906!
      RIP: 0010:[<ffffffff81518034>] skb_checksum_help+0x144/0x150
      Call Trace:
      <IRQ>
      [<ffffffffa0164c28>] queue_userspace_packet+0x408/0x470 [openvswitch]
      [<ffffffffa016614d>] ovs_dp_upcall+0x5d/0x60 [openvswitch]
      [<ffffffffa0166236>] ovs_dp_process_packet_with_key+0xe6/0x100 [openvswitch]
      [<ffffffffa016629b>] ovs_dp_process_received_packet+0x4b/0x80 [openvswitch]
      [<ffffffffa016c51a>] ovs_vport_receive+0x2a/0x30 [openvswitch]
      [<ffffffffa0171383>] vxlan_rcv+0x53/0x60 [openvswitch]
      [<ffffffffa01734cb>] vxlan_udp_encap_recv+0x8b/0xf0 [openvswitch]
      [<ffffffff8157addc>] udp_queue_rcv_skb+0x2dc/0x3b0
      [<ffffffff8157b56f>] __udp4_lib_rcv+0x1cf/0x6c0
      [<ffffffff8157ba7a>] udp_rcv+0x1a/0x20
      [<ffffffff8154fdbd>] ip_local_deliver_finish+0xdd/0x280
      [<ffffffff81550128>] ip_local_deliver+0x88/0x90
      [<ffffffff8154fa7d>] ip_rcv_finish+0x10d/0x370
      [<ffffffff81550365>] ip_rcv+0x235/0x300
      [<ffffffff8151ba1d>] __netif_receive_skb+0x55d/0x620
      [<ffffffff8151c360>] netif_receive_skb+0x80/0x90
      [<ffffffff81459935>] virtnet_poll+0x555/0x6f0
      [<ffffffff8151cd04>] net_rx_action+0x134/0x290
      [<ffffffff810683d8>] __do_softirq+0xa8/0x210
      [<ffffffff8162fe6c>] call_softirq+0x1c/0x30
      [<ffffffff810161a5>] do_softirq+0x65/0xa0
      [<ffffffff810687be>] irq_exit+0x8e/0xb0
      [<ffffffff81630733>] do_IRQ+0x63/0xe0
      [<ffffffff81625f2e>] common_interrupt+0x6e/0x6e
      Reported-by: default avatarAnupam Chanda <achanda@vmware.com>
      Signed-off-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Acked-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ae459bd
    • Herbert Xu's avatar
      netlink: Replace rhash_portid with bound · da314c99
      Herbert Xu authored
      On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote:
      >
      > store_release and load_acquire are different from the usual memory
      > barriers and can't be paired this way.  You have to pair store_release
      > and load_acquire.  Besides, it isn't a particularly good idea to
      
      OK I've decided to drop the acquire/release helpers as they don't
      help us at all and simply pessimises the code by using full memory
      barriers (on some architectures) where only a write or read barrier
      is needed.
      
      > depend on memory barriers embedded in other data structures like the
      > above.  Here, especially, rhashtable_insert() would have write barrier
      > *before* the entry is hashed not necessarily *after*, which means that
      > in the above case, a socket which appears to have set bound to a
      > reader might not visible when the reader tries to look up the socket
      > on the hashtable.
      
      But you are right we do need an explicit write barrier here to
      ensure that the hashing is visible.
      
      > There's no reason to be overly smart here.  This isn't a crazy hot
      > path, write barriers tend to be very cheap, store_release more so.
      > Please just do smp_store_release() and note what it's paired with.
      
      It's not about being overly smart.  It's about actually understanding
      what's going on with the code.  I've seen too many instances of
      people simply sprinkling synchronisation primitives around without
      any knowledge of what is happening underneath, which is just a recipe
      for creating hard-to-debug races.
      
      > > @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
      > >  		}
      > >  	}
      > >
      > > -	if (!nlk->portid) {
      > > +	if (!nlk->bound) {
      >
      > I don't think you can skip load_acquire here just because this is the
      > second deref of the variable.  That doesn't change anything.  Race
      > condition could still happen between the first and second tests and
      > skipping the second would lead to the same kind of bug.
      
      The reason this one is OK is because we do not use nlk->portid or
      try to get nlk from the hash table before we return to user-space.
      
      However, there is a real bug here that none of these acquire/release
      helpers discovered.  The two bound tests here used to be a single
      one.  Now that they are separate it is entirely possible for another
      thread to come in the middle and bind the socket.  So we need to
      repeat the portid check in order to maintain consistency.
      
      > > @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
      > >  	    !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
      > >  		return -EPERM;
      > >
      > > -	if (!nlk->portid)
      > > +	if (!nlk->bound)
      >
      > Don't we need load_acquire here too?  Is this path holding a lock
      > which makes that unnecessary?
      
      Ditto.
      
      ---8<---
      The commit 1f770c0a ("netlink:
      Fix autobind race condition that leads to zero port ID") created
      some new races that can occur due to inconcsistencies between the
      two port IDs.
      
      Tejun is right that a barrier is unavoidable.  Therefore I am
      reverting to the original patch that used a boolean to indicate
      that a user netlink socket has been bound.
      
      Barriers have been added where necessary to ensure that a valid
      portid and the hashed socket is visible.
      
      I have also changed netlink_insert to only return EBUSY if the
      socket is bound to a portid different to the requested one.  This
      combined with only reading nlk->bound once in netlink_bind fixes
      a race where two threads that bind the socket at the same time
      with different port IDs may both succeed.
      
      Fixes: 1f770c0a ("netlink: Fix autobind race condition that leads to zero port ID")
      Reported-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Nacked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      da314c99
  2. 23 Sep, 2015 23 commits
    • John W. Linville's avatar
      geneve: use network byte order for destination port config parameter · 7bbe33ff
      John W. Linville authored
      This is primarily for consistancy with vxlan and other tunnels which
      use network byte order for similar parameters.
      Signed-off-by: default avatarJohn W. Linville <linville@tuxdriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7bbe33ff
    • David Woodhouse's avatar
      8139cp: Dump contents of descriptor ring on TX timeout · 41b97641
      David Woodhouse authored
      We are seeing unexplained TX timeouts under heavy load. Let's try to get
      a better idea of what's going on.
      Signed-off-by: default avatarDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      41b97641
    • David Woodhouse's avatar
      8139cp: Fix DMA unmapping of transmitted buffers · 7f4c6856
      David Woodhouse authored
      The low 16 bits of the 'opts1' field in the TX descriptor are supposed
      to still contain the buffer length when the descriptor is handed back to
      us. In practice, at least on my hardware, they don't. So stash the
      original value of the opts1 field and get the length to unmap from
      there.
      
      There are other ways we could have worked out the length, but I actually
      want a stash of the opts1 field anyway so that I can dump it alongside
      the contents of the descriptor ring when we suffer a TX timeout.
      Signed-off-by: default avatarDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7f4c6856
    • David Woodhouse's avatar
      8139cp: Reduce duplicate csum/tso code in cp_start_xmit() · 0a5aeee0
      David Woodhouse authored
      We calculate the value of the opts1 descriptor field in three different
      places. With two different behaviours when given an invalid packet to
      be checksummed — none of them correct. Sort that out.
      Signed-off-by: default avatarDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a5aeee0
    • David Woodhouse's avatar
      8139cp: Fix TSO/scatter-gather descriptor setup · a3b80404
      David Woodhouse authored
      When sending a TSO frame in multiple buffers, we were neglecting to set
      the first descriptor up in TSO mode.
      Signed-off-by: default avatarDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a3b80404
    • David Woodhouse's avatar
      8139cp: Fix tx_queued debug message to print correct slot numbers · 26b0bad6
      David Woodhouse authored
      After a certain amount of staring at the debug output of this driver, I
      realised it was lying to me.
      Signed-off-by: default avatarDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      26b0bad6
    • David Woodhouse's avatar
      8139cp: Do not re-enable RX interrupts in cp_tx_timeout() · aaa0062e
      David Woodhouse authored
      If an RX interrupt was already received but NAPI has not yet run when
      the RX timeout happens, we end up in cp_tx_timeout() with RX interrupts
      already disabled. Blindly re-enabling them will cause an IRQ storm.
      
      (This is made particularly horrid by the fact that cp_interrupt() always
      returns that it's handled the interrupt, even when it hasn't actually
      done anything. If it didn't do that, the core IRQ code would have
      detected the storm and handled it, I'd have had a clear smoking gun
      backtrace instead of just a spontaneously resetting router, and I'd have
      at *least* two days of my life back. Changing the return value of
      cp_interrupt() will be argued about under separate cover.)
      
      Unconditionally leave RX interrupts disabled after the reset, and
      schedule NAPI to check the receive ring and re-enable them.
      Signed-off-by: default avatarDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aaa0062e
    • David S. Miller's avatar
      Merge branch 'netcp-fixes' · 3c6cb3ac
      David S. Miller authored
      Murali Karicheri says:
      
      ====================
      net: netcp: a set of bug fixes
      
      This patch series fixes a set of issues in netcp driver seen during internal
      testing of the driver. While at it, do some clean up as well.
      
      The fixes are tested on K2HK, K2L and K2E EVMs and the boot up logs can be
      seen at
      
       http://pastebin.ubuntu.com/12533100/
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3c6cb3ac
    • Karicheri, Muralidharan's avatar
      net: netcp: fix deadlock reported by lockup detector · 8ceaf361
      Karicheri, Muralidharan authored
      A deadlock trace is seen in netcp driver with lockup detector enabled.
      The trace log is provided below for reference. This patch fixes the
      bug by removing the usage of netcp_modules_lock within ndo_ops functions.
      ndo_{open/close/ioctl)() is already called with rtnl_lock held. So there
      is no need to hold another mutex for serialization across processes on
      multiple cores.  So remove use of netcp_modules_lock mutex from these
      ndo ops functions.
      
      ndo_set_rx_mode() shouldn't be using a mutex as it is called from atomic
      context. In the case of ndo_set_rx_mode(), there can be call to this API
      without rtnl_lock held from an atomic context. As the underlying modules
      are expected to add address to a hardware table, it is to be protected
      across concurrent updates and hence a spin lock is used to synchronize
      the access. Same with ndo_vlan_rx_add_vid() & ndo_vlan_rx_kill_vid().
      
      Probably the netcp_modules_lock is used to protect the module not being
      removed as part of rmmod. Currently this is not fully implemented and
      assumes the interface is brought down before doing rmmod of modules.
      The support for rmmmod while interface is up is expected in a future
      patch set when additional modules such as pa, qos are added. For now
      all of the tests such as if up/down, reboot, iperf works fine with this
      patch applied.
      
      Deadlock trace seen with lockup detector enabled is shown below for
      reference.
      
      [   16.863014] ======================================================
      [   16.869183] [ INFO: possible circular locking dependency detected ]
      [   16.875441] 4.1.6-01265-gfb1e101 #1 Tainted: G        W
      [   16.881176] -------------------------------------------------------
      [   16.887432] ifconfig/1662 is trying to acquire lock:
      [   16.892386]  (netcp_modules_lock){+.+.+.}, at: [<c03e8110>]
      netcp_ndo_open+0x168/0x518
      [   16.900321]
      [   16.900321] but task is already holding lock:
      [   16.906144]  (rtnl_mutex){+.+.+.}, at: [<c053a418>] devinet_ioctl+0xf8/0x7e4
      [   16.913206]
      [   16.913206] which lock already depends on the new lock.
      [   16.913206]
      [   16.921372]
      [   16.921372] the existing dependency chain (in reverse order) is:
      [   16.928844]
      -> #1 (rtnl_mutex){+.+.+.}:
      [   16.932865]        [<c06023f0>] mutex_lock_nested+0x68/0x4a8
      [   16.938521]        [<c04c5758>] register_netdev+0xc/0x24
      [   16.943831]        [<c03e65c0>] netcp_module_probe+0x214/0x2ec
      [   16.949660]        [<c03e8a54>] netcp_register_module+0xd4/0x140
      [   16.955663]        [<c089654c>] keystone_gbe_init+0x10/0x28
      [   16.961233]        [<c000977c>] do_one_initcall+0xb8/0x1f8
      [   16.966714]        [<c0867e04>] kernel_init_freeable+0x148/0x1e8
      [   16.972720]        [<c05f9994>] kernel_init+0xc/0xe8
      [   16.977682]        [<c0010038>] ret_from_fork+0x14/0x3c
      [   16.982905]
      -> #0 (netcp_modules_lock){+.+.+.}:
      [   16.987619]        [<c006eab0>] lock_acquire+0x118/0x320
      [   16.992928]        [<c06023f0>] mutex_lock_nested+0x68/0x4a8
      [   16.998582]        [<c03e8110>] netcp_ndo_open+0x168/0x518
      [   17.004064]        [<c04c48f0>] __dev_open+0xa8/0x10c
      [   17.009112]        [<c04c4b74>] __dev_change_flags+0x94/0x144
      [   17.014853]        [<c04c4c3c>] dev_change_flags+0x18/0x48
      [   17.020334]        [<c053a9fc>] devinet_ioctl+0x6dc/0x7e4
      [   17.025729]        [<c04a59ec>] sock_ioctl+0x1d0/0x2a8
      [   17.030865]        [<c0142844>] do_vfs_ioctl+0x41c/0x688
      [   17.036173]        [<c0142ae4>] SyS_ioctl+0x34/0x5c
      [   17.041046]        [<c000ff60>] ret_fast_syscall+0x0/0x54
      [   17.046441]
      [   17.046441] other info that might help us debug this:
      [   17.046441]
      [   17.054434]  Possible unsafe locking scenario:
      [   17.054434]
      [   17.060343]        CPU0                    CPU1
      [   17.064862]        ----                    ----
      [   17.069381]   lock(rtnl_mutex);
      [   17.072522]                                lock(netcp_modules_lock);
      [   17.078875]                                lock(rtnl_mutex);
      [   17.084532]   lock(netcp_modules_lock);
      [   17.088366]
      [   17.088366]  *** DEADLOCK ***
      [   17.088366]
      [   17.094279] 1 lock held by ifconfig/1662:
      [   17.098278]  #0:  (rtnl_mutex){+.+.+.}, at: [<c053a418>]
      devinet_ioctl+0xf8/0x7e4
      [   17.105774]
      [   17.105774] stack backtrace:
      [   17.110124] CPU: 1 PID: 1662 Comm: ifconfig Tainted: G        W
      4.1.6-01265-gfb1e101 #1
      [   17.118637] Hardware name: Keystone
      [   17.122123] [<c00178e4>] (unwind_backtrace) from [<c0013cbc>]
      (show_stack+0x10/0x14)
      [   17.129862] [<c0013cbc>] (show_stack) from [<c05ff450>]
      (dump_stack+0x84/0xc4)
      [   17.137079] [<c05ff450>] (dump_stack) from [<c0068e34>]
      (print_circular_bug+0x210/0x330)
      [   17.145161] [<c0068e34>] (print_circular_bug) from [<c006ab7c>]
      (validate_chain.isra.35+0xf98/0x13ac)
      [   17.154372] [<c006ab7c>] (validate_chain.isra.35) from [<c006da60>]
      (__lock_acquire+0x52c/0xcc0)
      [   17.163149] [<c006da60>] (__lock_acquire) from [<c006eab0>]
      (lock_acquire+0x118/0x320)
      [   17.171058] [<c006eab0>] (lock_acquire) from [<c06023f0>]
      (mutex_lock_nested+0x68/0x4a8)
      [   17.179140] [<c06023f0>] (mutex_lock_nested) from [<c03e8110>]
      (netcp_ndo_open+0x168/0x518)
      [   17.187484] [<c03e8110>] (netcp_ndo_open) from [<c04c48f0>]
      (__dev_open+0xa8/0x10c)
      [   17.195133] [<c04c48f0>] (__dev_open) from [<c04c4b74>]
      (__dev_change_flags+0x94/0x144)
      [   17.203129] [<c04c4b74>] (__dev_change_flags) from [<c04c4c3c>]
      (dev_change_flags+0x18/0x48)
      [   17.211560] [<c04c4c3c>] (dev_change_flags) from [<c053a9fc>]
      (devinet_ioctl+0x6dc/0x7e4)
      [   17.219729] [<c053a9fc>] (devinet_ioctl) from [<c04a59ec>]
      (sock_ioctl+0x1d0/0x2a8)
      [   17.227378] [<c04a59ec>] (sock_ioctl) from [<c0142844>]
      (do_vfs_ioctl+0x41c/0x688)
      [   17.234939] [<c0142844>] (do_vfs_ioctl) from [<c0142ae4>]
      (SyS_ioctl+0x34/0x5c)
      [   17.242242] [<c0142ae4>] (SyS_ioctl) from [<c000ff60>]
      (ret_fast_syscall+0x0/0x54)
      [   17.258855] netcp-1.0 2620110.netcp eth0: Link is Up - 1Gbps/Full - flow
      control off
      [   17.271282] BUG: sleeping function called from invalid context at
      kernel/locking/mutex.c:616
      [   17.279712] in_atomic(): 1, irqs_disabled(): 0, pid: 1662, name: ifconfig
      [   17.286500] INFO: lockdep is turned off.
      [   17.290413] Preemption disabled at:[<  (null)>]   (null)
      [   17.295728]
      [   17.297214] CPU: 1 PID: 1662 Comm: ifconfig Tainted: G        W
      4.1.6-01265-gfb1e101 #1
      [   17.305735] Hardware name: Keystone
      [   17.309223] [<c00178e4>] (unwind_backtrace) from [<c0013cbc>]
      (show_stack+0x10/0x14)
      [   17.316970] [<c0013cbc>] (show_stack) from [<c05ff450>]
      (dump_stack+0x84/0xc4)
      [   17.324194] [<c05ff450>] (dump_stack) from [<c06023b0>]
      (mutex_lock_nested+0x28/0x4a8)
      [   17.332112] [<c06023b0>] (mutex_lock_nested) from [<c03e9840>]
      (netcp_set_rx_mode+0x160/0x210)
      [   17.340724] [<c03e9840>] (netcp_set_rx_mode) from [<c04c483c>]
      (dev_set_rx_mode+0x1c/0x28)
      [   17.348982] [<c04c483c>] (dev_set_rx_mode) from [<c04c490c>]
      (__dev_open+0xc4/0x10c)
      [   17.356724] [<c04c490c>] (__dev_open) from [<c04c4b74>]
      (__dev_change_flags+0x94/0x144)
      [   17.364729] [<c04c4b74>] (__dev_change_flags) from [<c04c4c3c>]
      (dev_change_flags+0x18/0x48)
      [   17.373166] [<c04c4c3c>] (dev_change_flags) from [<c053a9fc>]
      (devinet_ioctl+0x6dc/0x7e4)
      [   17.381344] [<c053a9fc>] (devinet_ioctl) from [<c04a59ec>]
      (sock_ioctl+0x1d0/0x2a8)
      [   17.388994] [<c04a59ec>] (sock_ioctl) from [<c0142844>]
      (do_vfs_ioctl+0x41c/0x688)
      [   17.396563] [<c0142844>] (do_vfs_ioctl) from [<c0142ae4>]
      (SyS_ioctl+0x34/0x5c)
      [   17.403873] [<c0142ae4>] (SyS_ioctl) from [<c000ff60>]
      (ret_fast_syscall+0x0/0x54)
      [   17.413772] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
      udhcpc (v1.20.2) started
      Sending discover...
      [   18.690666] netcp-1.0 2620110.netcp eth0: Link is Up - 1Gbps/Full - flow
      control off
      Sending discover...
      [   22.250972] netcp-1.0 2620110.netcp eth0: Link is Up - 1Gbps/Full - flow
      control off
      [   22.258721] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
      [   22.265458] BUG: sleeping function called from invalid context at
      kernel/locking/mutex.c:616
      [   22.273896] in_atomic(): 1, irqs_disabled(): 0, pid: 342, name: kworker/1:1
      [   22.280854] INFO: lockdep is turned off.
      [   22.284767] Preemption disabled at:[<  (null)>]   (null)
      [   22.290074]
      [   22.291568] CPU: 1 PID: 342 Comm: kworker/1:1 Tainted: G        W
      4.1.6-01265-gfb1e101 #1
      [   22.300255] Hardware name: Keystone
      [   22.303750] Workqueue: ipv6_addrconf addrconf_dad_work
      [   22.308895] [<c00178e4>] (unwind_backtrace) from [<c0013cbc>]
      (show_stack+0x10/0x14)
      [   22.316643] [<c0013cbc>] (show_stack) from [<c05ff450>]
      (dump_stack+0x84/0xc4)
      [   22.323867] [<c05ff450>] (dump_stack) from [<c06023b0>]
      (mutex_lock_nested+0x28/0x4a8)
      [   22.331786] [<c06023b0>] (mutex_lock_nested) from [<c03e9840>]
      (netcp_set_rx_mode+0x160/0x210)
      [   22.340394] [<c03e9840>] (netcp_set_rx_mode) from [<c04c9d18>]
      (__dev_mc_add+0x54/0x68)
      [   22.348401] [<c04c9d18>] (__dev_mc_add) from [<c05ab358>]
      (igmp6_group_added+0x168/0x1b4)
      [   22.356580] [<c05ab358>] (igmp6_group_added) from [<c05ad2cc>]
      (ipv6_dev_mc_inc+0x4f0/0x5a8)
      [   22.365019] [<c05ad2cc>] (ipv6_dev_mc_inc) from [<c058f0d0>]
      (addrconf_dad_work+0x21c/0x33c)
      [   22.373460] [<c058f0d0>] (addrconf_dad_work) from [<c0042850>]
      (process_one_work+0x214/0x8d0)
      [   22.381986] [<c0042850>] (process_one_work) from [<c0042f54>]
      (worker_thread+0x48/0x4bc)
      [   22.390071] [<c0042f54>] (worker_thread) from [<c004868c>]
      (kthread+0xf0/0x108)
      [   22.397381] [<c004868c>] (kthread) from [<c0010038>]
      
      Trace related to incorrect usage of mutex inside ndo_set_rx_mode
      
      [   24.086066] BUG: sleeping function called from invalid context at
      kernel/locking/mutex.c:616
      [   24.094506] in_atomic(): 1, irqs_disabled(): 0, pid: 1682, name: ifconfig
      [   24.101291] INFO: lockdep is turned off.
      [   24.105203] Preemption disabled at:[<  (null)>]   (null)
      [   24.110511]
      [   24.112005] CPU: 2 PID: 1682 Comm: ifconfig Tainted: G        W
      4.1.6-01265-gfb1e101 #1
      [   24.120518] Hardware name: Keystone
      [   24.124018] [<c00178e4>] (unwind_backtrace) from [<c0013cbc>]
      (show_stack+0x10/0x14)
      [   24.131772] [<c0013cbc>] (show_stack) from [<c05ff450>]
      (dump_stack+0x84/0xc4)
      [   24.138989] [<c05ff450>] (dump_stack) from [<c06023b0>]
      (mutex_lock_nested+0x28/0x4a8)
      [   24.146908] [<c06023b0>] (mutex_lock_nested) from [<c03e9840>]
      (netcp_set_rx_mode+0x160/0x210)
      [   24.155523] [<c03e9840>] (netcp_set_rx_mode) from [<c04c483c>]
      (dev_set_rx_mode+0x1c/0x28)
      [   24.163787] [<c04c483c>] (dev_set_rx_mode) from [<c04c490c>]
      (__dev_open+0xc4/0x10c)
      [   24.171531] [<c04c490c>] (__dev_open) from [<c04c4b74>]
      (__dev_change_flags+0x94/0x144)
      [   24.179528] [<c04c4b74>] (__dev_change_flags) from [<c04c4c3c>]
      (dev_change_flags+0x18/0x48)
      [   24.187966] [<c04c4c3c>] (dev_change_flags) from [<c053a9fc>]
      (devinet_ioctl+0x6dc/0x7e4)
      [   24.196145] [<c053a9fc>] (devinet_ioctl) from [<c04a59ec>]
      (sock_ioctl+0x1d0/0x2a8)
      [   24.203803] [<c04a59ec>] (sock_ioctl) from [<c0142844>]
      (do_vfs_ioctl+0x41c/0x688)
      [   24.211373] [<c0142844>] (do_vfs_ioctl) from [<c0142ae4>]
      (SyS_ioctl+0x34/0x5c)
      [   24.218676] [<c0142ae4>] (SyS_ioctl) from [<c000ff60>]
      (ret_fast_syscall+0x0/0x54)
      [   24.227156] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
      Signed-off-by: default avatarMurali Karicheri <m-karicheri2@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8ceaf361
    • Karicheri, Muralidharan's avatar
      net: netcp: allocate buffers to desc before re-enable interrupt · 99f8ef5d
      Karicheri, Muralidharan authored
      Currently netcp_rxpool_refill() that refill descriptors and attached
      buffers to fdq while interrupt is enabled as part of NAPI poll. Doing
      it while interrupt is disabled could be beneficial as hardware will
      not be starved when CPU is busy with processing interrupt.
      Signed-off-by: default avatarMurali Karicheri <m-karicheri2@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      99f8ef5d
    • Karicheri, Muralidharan's avatar
      net: netcp: check for interface handle in netcp_module_probe() · 915c5857
      Karicheri, Muralidharan authored
      Currently netcp_module_probe() doesn't check the return value of
      of_parse_phandle() that points to the interface data for the
      module and then pass the node ptr to the module which is incorrect.
      Check for return value and free the intf_modpriv if there is error.
      Signed-off-by: default avatarMurali Karicheri <m-karicheri2@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      915c5857
    • Karicheri, Muralidharan's avatar
      net: netcp: add error check to netcp_allocate_rx_buf() · e558b1fb
      Karicheri, Muralidharan authored
      Currently, if netcp_allocate_rx_buf() fails due no descriptors
      in the rx free descriptor queue, inside the netcp_rxpool_refill() function
      the iterative loop to fill buffers doesn't terminate right away. So modify
      the netcp_allocate_rx_buf() to return an error code and use it break the
      loop when there is error.
      Signed-off-by: default avatarMurali Karicheri <m-karicheri2@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e558b1fb
    • Karicheri, Muralidharan's avatar
      net: netcp: move netcp_register_interface() to after attach module · 736532a0
      Karicheri, Muralidharan authored
      The netcp interface is not fully initialized before attach the module
      to the interface. For example, the tx pipe/rx pipe is initialized
      in ethss module as part of attach(). So until this is complete, the
      interface can't be registered.  So move registration of interface to
      net device outside the current loop that attaches the modules to the
      interface.
      Signed-off-by: default avatarMurali Karicheri <m-karicheri2@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      736532a0
    • Karicheri, Muralidharan's avatar
      net: netcp: remove dead code from the driver · 156e3c21
      Karicheri, Muralidharan authored
      netcp_core is the first driver that will get initialized and the modules
      (ethss, pa etc) will then get initialized. So the code at the end of
      netcp_probe() that iterate over the modules is a dead code as the module
      list will be always be empty. So remove this code.
      Signed-off-by: default avatarMurali Karicheri <m-karicheri2@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      156e3c21
    • WingMan Kwok's avatar
      net: netcp: ethss: fix error in calling sgmii api with incorrect offset · 8c85151d
      WingMan Kwok authored
      On K2HK, sgmii module registers of slave 0 and 1 are mem
      mapped to one contiguous block, while those of slave 2
      and 3 are mapped to another contiguous block.  However,
      on K2E and K2L, sgmii module registers of all slaves are
      mem mapped to one contiguous block.  SGMII APIs expect
      slave 0 sgmii base when API is invoked for slave 0 and 1,
      and slave 2 sgmii base when invoked for other slaves.
      Before this patch, slave 0 sgmii base is always passed to
      sgmii API for K2E regardless which slave is the API invoked
      for.  This patch fixes the problem.
      Signed-off-by: default avatarWingMan Kwok <w-kwok2@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8c85151d
    • David Woodhouse's avatar
      Fix AF_PACKET ABI breakage in 4.2 · d3869efe
      David Woodhouse authored
      Commit 7d824109 ("virtio: add explicit big-endian support to memory
      accessors") accidentally changed the virtio_net header used by
      AF_PACKET with PACKET_VNET_HDR from host-endian to big-endian.
      
      Since virtio_legacy_is_little_endian() is a very long identifier,
      define a vio_le macro and use that throughout the code instead of the
      hard-coded 'false' for little-endian.
      
      This restores the ABI to match 4.1 and earlier kernels, and makes my
      test program work again.
      Signed-off-by: default avatarDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d3869efe
    • Neil Horman's avatar
      netpoll: Close race condition between poll_one_napi and napi_disable · 2d8bff12
      Neil Horman authored
      Drivers might call napi_disable while not holding the napi instance poll_lock.
      In those instances, its possible for a race condition to exist between
      poll_one_napi and napi_disable.  That is to say, poll_one_napi only tests the
      NAPI_STATE_SCHED bit to see if there is work to do during a poll, and as such
      the following may happen:
      
      CPU0				CPU1
      ndo_tx_timeout			napi_poll_dev
       napi_disable			 poll_one_napi
        test_and_set_bit (ret 0)
      				  test_bit (ret 1)
         reset adapter		   napi_poll_routine
      
      If the adapter gets a tx timeout without a napi instance scheduled, its possible
      for the adapter to think it has exclusive access to the hardware  (as the napi
      instance is now scheduled via the napi_disable call), while the netpoll code
      thinks there is simply work to do.  The result is parallel hardware access
      leading to corrupt data structures in the driver, and a crash.
      
      Additionaly, there is another, more critical race between netpoll and
      napi_disable.  The disabled napi state is actually identical to the scheduled
      state for a given napi instance.  The implication being that, if a napi instance
      is disabled, a netconsole instance would see the napi state of the device as
      having been scheduled, and poll it, likely while the driver was dong something
      requiring exclusive access.  In the case above, its fairly clear that not having
      the rings in a state ready to be polled will cause any number of crashes.
      
      The fix should be pretty easy.  netpoll uses its own bit to indicate that that
      the napi instance is in a state of being serviced by netpoll (NAPI_STATE_NPSVC).
      We can just gate disabling on that bit as well as the sched bit.  That should
      prevent netpoll from conducting a napi poll if we convert its set bit to a
      test_and_set_bit operation to provide mutual exclusion
      
      Change notes:
      V2)
      	Remove a trailing whtiespace
      	Resubmit with proper subject prefix
      
      V3)
      	Clean up spacing nits
      Signed-off-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: jmaxwell@redhat.com
      Tested-by: jmaxwell@redhat.com
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2d8bff12
    • Eric Dumazet's avatar
      tcp: add proper TS val into RST packets · 675ee231
      Eric Dumazet authored
      RST packets sent on behalf of TCP connections with TS option (RFC 7323
      TCP timestamps) have incorrect TS val (set to 0), but correct TS ecr.
      
      A > B: Flags [S], seq 0, win 65535, options [mss 1000,nop,nop,TS val 100
      ecr 0], length 0
      B > A: Flags [S.], seq 2444755794, ack 1, win 28960, options [mss
      1460,nop,nop,TS val 7264344 ecr 100], length 0
      A > B: Flags [.], ack 1, win 65535, options [nop,nop,TS val 110 ecr
      7264344], length 0
      
      B > A: Flags [R.], seq 1, ack 1, win 28960, options [nop,nop,TS val 0
      ecr 110], length 0
      
      We need to call skb_mstamp_get() to get proper TS val,
      derived from skb->skb_mstamp
      
      Note that RFC 1323 was advocating to not send TS option in RST segment,
      but RFC 7323 recommends the opposite :
      
        Once TSopt has been successfully negotiated, that is both <SYN> and
        <SYN,ACK> contain TSopt, the TSopt MUST be sent in every non-<RST>
        segment for the duration of the connection, and SHOULD be sent in an
        <RST> segment (see Section 5.2 for details)
      
      Note this RFC recommends to send TS val = 0, but we believe it is
      premature : We do not know if all TCP stacks are properly
      handling the receive side :
      
         When an <RST> segment is
         received, it MUST NOT be subjected to the PAWS check by verifying an
         acceptable value in SEG.TSval, and information from the Timestamps
         option MUST NOT be used to update connection state information.
         SEG.TSecr MAY be used to provide stricter <RST> acceptance checks.
      
      In 5 years, if/when all TCP stack are RFC 7323 ready, we might consider
      to decide to send TS val = 0, if it buys something.
      
      Fixes: 7faee5c0 ("tcp: remove TCP_SKB_CB(skb)->when")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      675ee231
    • Neil Armstrong's avatar
      net: dsa: Fix Marvell Egress Trailer check · fbd03513
      Neil Armstrong authored
      The Marvell Egress rx trailer check must be fixed to
      correctly detect bad bits in the third byte of the
      Eggress trailer as described in the Table 28 of the
      88E6060 datasheet.
      The current code incorrectly omits to check the third
      byte and checks the fourth byte twice.
      Signed-off-by: default avatarNeil Armstrong <narmstrong@baylibre.com>
      Acked-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fbd03513
    • Dmitriy Vyukov's avatar
      lib: fix data race in rhashtable_rehash_one · 7def0f95
      Dmitriy Vyukov authored
      rhashtable_rehash_one() uses complex logic to update entry->next field,
      after INIT_RHT_NULLS_HEAD and NULLS_MARKER expansion:
      
      entry->next = 1 | ((base + off) << 1)
      
      This can be compiled along the lines of:
      
      entry->next = base + off
      entry->next <<= 1
      entry->next |= 1
      
      Which will break concurrent readers.
      
      NULLS value recomputation is not needed here, so just remove
      the complex logic.
      
      The data race was found with KernelThreadSanitizer (KTSAN).
      Signed-off-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Acked-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7def0f95
    • Tobias Klauser's avatar
      ch9200: Convert to use module_usb_driver · 23eedbc2
      Tobias Klauser authored
      Converts the ch9200 driver to use the module_usb_driver() macro which
      makes the code smaller and a bit simpler.
      Signed-off-by: default avatarTobias Klauser <tklauser@distanz.ch>
      Acked-by: default avatarMatthew Garrett <mjg59@srcf.ucam.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      23eedbc2
    • Jesse Gross's avatar
      openvswitch: Zero flows on allocation. · ae5f2fb1
      Jesse Gross authored
      When support for megaflows was introduced, OVS needed to start
      installing flows with a mask applied to them. Since masking is an
      expensive operation, OVS also had an optimization that would only
      take the parts of the flow keys that were covered by a non-zero
      mask. The values stored in the remaining pieces should not matter
      because they are masked out.
      
      While this works fine for the purposes of matching (which must always
      look at the mask), serialization to netlink can be problematic. Since
      the flow and the mask are serialized separately, the uninitialized
      portions of the flow can be encoded with whatever values happen to be
      present.
      
      In terms of functionality, this has little effect since these fields
      will be masked out by definition. However, it leaks kernel memory to
      userspace, which is a potential security vulnerability. It is also
      possible that other code paths could look at the masked key and get
      uninitialized data, although this does not currently appear to be an
      issue in practice.
      
      This removes the mask optimization for flows that are being installed.
      This was always intended to be the case as the mask optimizations were
      really targetting per-packet flow operations.
      
      Fixes: 03f0d916 ("openvswitch: Mega flow implementation")
      Signed-off-by: default avatarJesse Gross <jesse@nicira.com>
      Acked-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ae5f2fb1
    • Russell King's avatar
      net: dsa: actually force the speed on the CPU port · 53adc9e8
      Russell King authored
      Commit 54d792f2 ("net: dsa: Centralise global and port setup
      code into mv88e6xxx.") merged in the 4.2 merge window broke the link
      speed forcing for the CPU port of Marvell DSA switches.  The original
      code was:
      
              /* MAC Forcing register: don't force link, speed, duplex
               * or flow control state to any particular values on physical
               * ports, but force the CPU port and all DSA ports to 1000 Mb/s
               * full duplex.
               */
              if (dsa_is_cpu_port(ds, p) || ds->dsa_port_mask & (1 << p))
                      REG_WRITE(addr, 0x01, 0x003e);
              else
                      REG_WRITE(addr, 0x01, 0x0003);
      
      but the new code does a read-modify-write:
      
                      reg = _mv88e6xxx_reg_read(ds, REG_PORT(port), PORT_PCS_CTRL);
                      if (dsa_is_cpu_port(ds, port) ||
                          ds->dsa_port_mask & (1 << port)) {
                              reg |= PORT_PCS_CTRL_FORCE_LINK |
                                      PORT_PCS_CTRL_LINK_UP |
                                      PORT_PCS_CTRL_DUPLEX_FULL |
                                      PORT_PCS_CTRL_FORCE_DUPLEX;
                              if (mv88e6xxx_6065_family(ds))
                                      reg |= PORT_PCS_CTRL_100;
                              else
                                      reg |= PORT_PCS_CTRL_1000;
      
      The link speed in the PCS control register is a two bit field.  Forcing
      the link speed in this way doesn't ensure that the bit field is set to
      the correct value - on the hardware I have here, the speed bitfield
      remains set to 0x03, resulting in the speed not being forced to gigabit.
      
      We must clear both bits before forcing the link speed.
      
      Fixes: 54d792f2 ("net: dsa: Centralise global and port setup code into mv88e6xxx.")
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Acked-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      53adc9e8
  3. 22 Sep, 2015 1 commit
  4. 21 Sep, 2015 8 commits