1. 03 May, 2012 7 commits
    • Eric Dumazet's avatar
      tcp: change tcp_adv_win_scale and tcp_rmem[2] · b49960a0
      Eric Dumazet authored
      tcp_adv_win_scale default value is 2, meaning we expect a good citizen
      skb to have skb->len / skb->truesize ratio of 75% (3/4)
      
      In 2.6 kernels we (mis)accounted for typical MSS=1460 frame :
      1536 + 64 + 256 = 1856 'estimated truesize', and 1856 * 3/4 = 1392.
      So these skbs were considered as not bloated.
      
      With recent truesize fixes, a typical MSS=1460 frame truesize is now the
      more precise :
      2048 + 256 = 2304. But 2304 * 3/4 = 1728.
      So these skb are not good citizen anymore, because 1460 < 1728
      
      (GRO can escape this problem because it build skbs with a too low
      truesize.)
      
      This also means tcp advertises a too optimistic window for a given
      allocated rcvspace : When receiving frames, sk_rmem_alloc can hit
      sk_rcvbuf limit and we call tcp_prune_queue()/tcp_collapse() too often,
      especially when application is slow to drain its receive queue or in
      case of losses (netperf is fast, scp is slow). This is a major latency
      source.
      
      We should adjust the len/truesize ratio to 50% instead of 75%
      
      This patch :
      
      1) changes tcp_adv_win_scale default to 1 instead of 2
      
      2) increase tcp_rmem[2] limit from 4MB to 6MB to take into account
      better truesize tracking and to allow autotuning tcp receive window to
      reach same value than before. Note that same amount of kernel memory is
      consumed compared to 2.6 kernels.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b49960a0
    • Sasha Levin's avatar
      net: l2tp: unlock socket lock before returning from l2tp_ip_sendmsg · 84768edb
      Sasha Levin authored
      l2tp_ip_sendmsg could return without releasing socket lock, making it all the
      way to userspace, and generating the following warning:
      
      [  130.891594] ================================================
      [  130.894569] [ BUG: lock held when returning to user space! ]
      [  130.897257] 3.4.0-rc5-next-20120501-sasha #104 Tainted: G        W
      [  130.900336] ------------------------------------------------
      [  130.902996] trinity/8384 is leaving the kernel with locks still held!
      [  130.906106] 1 lock held by trinity/8384:
      [  130.907924]  #0:  (sk_lock-AF_INET){+.+.+.}, at: [<ffffffff82b9503f>] l2tp_ip_sendmsg+0x2f/0x550
      
      Introduced by commit 2f16270f ("l2tp: Fix locking in l2tp_ip.c").
      Signed-off-by: default avatarSasha Levin <levinsasha928@gmail.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      84768edb
    • Neil Horman's avatar
      drop_monitor: prevent init path from scheduling on the wrong cpu · 4fdcfa12
      Neil Horman authored
      I just noticed after some recent updates, that the init path for the drop
      monitor protocol has a minor error.  drop monitor maintains a per cpu structure,
      that gets initalized from a single cpu.  Normally this is fine, as the protocol
      isn't in use yet, but I recently made a change that causes a failed skb
      allocation to reschedule itself .  Given the current code, the implication is
      that this workqueue reschedule will take place on the wrong cpu.  If drop
      monitor is used early during the boot process, its possible that two cpus will
      access a single per-cpu structure in parallel, possibly leading to data
      corruption.
      
      This patch fixes the situation, by storing the cpu number that a given instance
      of this per-cpu data should be accessed from.  In the case of a need for a
      reschedule, the cpu stored in the struct is assigned the rescheule, rather than
      the currently executing cpu
      
      Tested successfully by myself.
      Signed-off-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      CC: David Miller <davem@davemloft.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4fdcfa12
    • tom.leiming@gmail.com's avatar
      usbnet: fix failure handling in usbnet_probe · a4723848
      tom.leiming@gmail.com authored
      If register_netdev returns failure, the dev->interrupt and
      its transfer buffer should be released, so just fix it.
      Signed-off-by: default avatarMing Lei <tom.leiming@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a4723848
    • tom.leiming@gmail.com's avatar
      usbnet: fix leak of transfer buffer of dev->interrupt · 720f3d7c
      tom.leiming@gmail.com authored
      The transfer buffer of dev->interrupt is allocated in .probe path,
      but not freed in .disconnet path, so mark the interrupt URB as
      URB_FREE_BUFFER to free the buffer when the URB is destroyed.
      Signed-off-by: default avatarMing Lei <tom.leiming@gmail.com>
      Acked-by: default avatarOliver Neukum <oneukum@suse.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      720f3d7c
    • Joakim Tjernlund's avatar
      ucc_geth: Add 16 bytes to max TX frame for VLANs · 70f8002d
      Joakim Tjernlund authored
      Creating a VLAN interface on top of ucc_geth adds 4 bytes
      to the frame and the HW controller is not prepared to
      TX a frame bigger than 1518 bytes which is 4 bytes too
      small for a full VLAN frame. Add 16 bytes which will handle
      the a simple VLAN and leaves 12 bytes for future expansion.
      Signed-off-by: default avatarJoakim Tjernlund <Joakim.Tjernlund@transmode.se>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      70f8002d
    • Joakim Tjernlund's avatar
      net: ucc_geth, increase no. of HW RX descriptors · 5bbdc057
      Joakim Tjernlund authored
      In a busy network we see ucc_geth is dropping RX pkgs every now
      and then. Increase the RX queues HW descriptors from
      16 to 32 to deal with this.
      Signed-off-by: default avatarJoakim Tjernlund <Joakim.Tjernlund@transmode.se>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5bbdc057
  2. 01 May, 2012 17 commits
  3. 30 Apr, 2012 9 commits
    • Felix Fietkau's avatar
      mac80211: fix AP mode EAP tx for VLAN stations · 66f2c99a
      Felix Fietkau authored
      EAP frames for stations in an AP VLAN are sent on the main AP interface
      to avoid race conditions wrt. moving stations.
      For that to work properly, sta_info_get_bss must be used instead of
      sta_info_get when sending EAP packets.
      Previously this was only done for cooked monitor injected packets, so
      this patch adds a check for tx->skb->protocol to the same place.
      Signed-off-by: default avatarFelix Fietkau <nbd@openwrt.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJohn W. Linville <linville@tuxdriver.com>
      66f2c99a
    • Yuchung Cheng's avatar
      tcp: fix infinite cwnd in tcp_complete_cwr() · 1cebce36
      Yuchung Cheng authored
      When the cwnd reduction is done, ssthresh may be infinite
      if TCP enters CWR via ECN or F-RTO. If cwnd is not undone, i.e.,
      undo_marker is set, tcp_complete_cwr() falsely set cwnd to the
      infinite ssthresh value. The correct operation is to keep cwnd
      intact because it has been updated in ECN or F-RTO.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1cebce36
    • Jan Seiffert's avatar
      bpf jit: Let the powerpc jit handle negative offsets · 05be1824
      Jan Seiffert authored
      Now the helper function from filter.c for negative offsets is exported,
      it can be used it in the jit to handle negative offsets.
      
      First modify the asm load helper functions to handle:
      - know positive offsets
      - know negative offsets
      - any offset
      
      then the compiler can be modified to explicitly use these helper
      when appropriate.
      
      This fixes the case of a negative X register and allows to lift
      the restriction that bpf programs with negative offsets can't
      be jited.
      Tested-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarJan Seiffert <kaffeemonster@googlemail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      05be1824
    • Eric Dumazet's avatar
      net: fix sk_sockets_allocated_read_positive · 518fbf9c
      Eric Dumazet authored
      Denys Fedoryshchenko reported frequent crashes on a proxy server and kindly
      provided a lockdep report that explains it all :
      
        [  762.903868]
        [  762.903880] =================================
        [  762.903890] [ INFO: inconsistent lock state ]
        [  762.903903] 3.3.4-build-0061 #8 Not tainted
        [  762.904133] ---------------------------------
        [  762.904344] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
        [  762.904542] squid/1603 [HC0[0]:SC0[0]:HE1:SE1] takes:
        [  762.904542]  (key#3){+.?...}, at: [<c0232cc4>]
      __percpu_counter_sum+0xd/0x58
        [  762.904542] {IN-SOFTIRQ-W} state was registered at:
        [  762.904542]   [<c0158b84>] __lock_acquire+0x284/0xc26
        [  762.904542]   [<c01598e8>] lock_acquire+0x71/0x85
        [  762.904542]   [<c0349765>] _raw_spin_lock+0x33/0x40
        [  762.904542]   [<c0232c93>] __percpu_counter_add+0x58/0x7c
        [  762.904542]   [<c02cfde1>] sk_clone_lock+0x1e5/0x200
        [  762.904542]   [<c0303ee4>] inet_csk_clone_lock+0xe/0x78
        [  762.904542]   [<c0315778>] tcp_create_openreq_child+0x1b/0x404
        [  762.904542]   [<c031339c>] tcp_v4_syn_recv_sock+0x32/0x1c1
        [  762.904542]   [<c031615a>] tcp_check_req+0x1fd/0x2d7
        [  762.904542]   [<c0313f77>] tcp_v4_do_rcv+0xab/0x194
        [  762.904542]   [<c03153bb>] tcp_v4_rcv+0x3b3/0x5cc
        [  762.904542]   [<c02fc0c4>] ip_local_deliver_finish+0x13a/0x1e9
        [  762.904542]   [<c02fc539>] NF_HOOK.clone.11+0x46/0x4d
        [  762.904542]   [<c02fc652>] ip_local_deliver+0x41/0x45
        [  762.904542]   [<c02fc4d1>] ip_rcv_finish+0x31a/0x33c
        [  762.904542]   [<c02fc539>] NF_HOOK.clone.11+0x46/0x4d
        [  762.904542]   [<c02fc857>] ip_rcv+0x201/0x23e
        [  762.904542]   [<c02daa3a>] __netif_receive_skb+0x319/0x368
        [  762.904542]   [<c02dac07>] netif_receive_skb+0x4e/0x7d
        [  762.904542]   [<c02dacf6>] napi_skb_finish+0x1e/0x34
        [  762.904542]   [<c02db122>] napi_gro_receive+0x20/0x24
        [  762.904542]   [<f85d1743>] e1000_receive_skb+0x3f/0x45 [e1000e]
        [  762.904542]   [<f85d3464>] e1000_clean_rx_irq+0x1f9/0x284 [e1000e]
        [  762.904542]   [<f85d3926>] e1000_clean+0x62/0x1f4 [e1000e]
        [  762.904542]   [<c02db228>] net_rx_action+0x90/0x160
        [  762.904542]   [<c012a445>] __do_softirq+0x7b/0x118
        [  762.904542] irq event stamp: 156915469
        [  762.904542] hardirqs last  enabled at (156915469): [<c019b4f4>]
      __slab_alloc.clone.58.clone.63+0xc4/0x2de
        [  762.904542] hardirqs last disabled at (156915468): [<c019b452>]
      __slab_alloc.clone.58.clone.63+0x22/0x2de
        [  762.904542] softirqs last  enabled at (156915466): [<c02ce677>]
      lock_sock_nested+0x64/0x6c
        [  762.904542] softirqs last disabled at (156915464): [<c0349914>]
      _raw_spin_lock_bh+0xe/0x45
        [  762.904542]
        [  762.904542] other info that might help us debug this:
        [  762.904542]  Possible unsafe locking scenario:
        [  762.904542]
        [  762.904542]        CPU0
        [  762.904542]        ----
        [  762.904542]   lock(key#3);
        [  762.904542]   <Interrupt>
        [  762.904542]     lock(key#3);
        [  762.904542]
        [  762.904542]  *** DEADLOCK ***
        [  762.904542]
        [  762.904542] 1 lock held by squid/1603:
        [  762.904542]  #0:  (sk_lock-AF_INET){+.+.+.}, at: [<c03055c0>]
      lock_sock+0xa/0xc
        [  762.904542]
        [  762.904542] stack backtrace:
        [  762.904542] Pid: 1603, comm: squid Not tainted 3.3.4-build-0061 #8
        [  762.904542] Call Trace:
        [  762.904542]  [<c0347b73>] ? printk+0x18/0x1d
        [  762.904542]  [<c015873a>] valid_state+0x1f6/0x201
        [  762.904542]  [<c0158816>] mark_lock+0xd1/0x1bb
        [  762.904542]  [<c015876b>] ? mark_lock+0x26/0x1bb
        [  762.904542]  [<c015805d>] ? check_usage_forwards+0x77/0x77
        [  762.904542]  [<c0158bf8>] __lock_acquire+0x2f8/0xc26
        [  762.904542]  [<c0159b8e>] ? mark_held_locks+0x5d/0x7b
        [  762.904542]  [<c0159cf6>] ? trace_hardirqs_on+0xb/0xd
        [  762.904542]  [<c0158dd4>] ? __lock_acquire+0x4d4/0xc26
        [  762.904542]  [<c01598e8>] lock_acquire+0x71/0x85
        [  762.904542]  [<c0232cc4>] ? __percpu_counter_sum+0xd/0x58
        [  762.904542]  [<c0349765>] _raw_spin_lock+0x33/0x40
        [  762.904542]  [<c0232cc4>] ? __percpu_counter_sum+0xd/0x58
        [  762.904542]  [<c0232cc4>] __percpu_counter_sum+0xd/0x58
        [  762.904542]  [<c02cebc4>] __sk_mem_schedule+0xdd/0x1c7
        [  762.904542]  [<c02d178d>] ? __alloc_skb+0x76/0x100
        [  762.904542]  [<c0305e8e>] sk_wmem_schedule+0x21/0x2d
        [  762.904542]  [<c0306370>] sk_stream_alloc_skb+0x42/0xaa
        [  762.904542]  [<c0306567>] tcp_sendmsg+0x18f/0x68b
        [  762.904542]  [<c031f3dc>] ? ip_fast_csum+0x30/0x30
        [  762.904542]  [<c0320193>] inet_sendmsg+0x53/0x5a
        [  762.904542]  [<c02cb633>] sock_aio_write+0xd2/0xda
        [  762.904542]  [<c015876b>] ? mark_lock+0x26/0x1bb
        [  762.904542]  [<c01a1017>] do_sync_write+0x9f/0xd9
        [  762.904542]  [<c01a2111>] ? file_free_rcu+0x2f/0x2f
        [  762.904542]  [<c01a17a1>] vfs_write+0x8f/0xab
        [  762.904542]  [<c01a284d>] ? fget_light+0x75/0x7c
        [  762.904542]  [<c01a1900>] sys_write+0x3d/0x5e
        [  762.904542]  [<c0349ec9>] syscall_call+0x7/0xb
        [  762.904542]  [<c0340000>] ? rp_sidt+0x41/0x83
      
      Bug is that sk_sockets_allocated_read_positive() calls
      percpu_counter_sum_positive() without BH being disabled.
      
      This bug was added in commit 180d8cd9
      (foundations of per-cgroup memory pressure controlling.), since previous
      code was using percpu_counter_read_positive() which is IRQ safe.
      
      In __sk_mem_schedule() we dont need the precise count of allocated
      sockets and can revert to previous behavior.
      Reported-by: default avatarDenys Fedoryshchenko <denys@visp.net.lb>
      Sined-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      518fbf9c
    • David S. Miller's avatar
      5414fc12
    • Pablo Neira Ayuso's avatar
      netfilter: xt_CT: fix wrong checking in the timeout assignment path · 6cf51852
      Pablo Neira Ayuso authored
      The current checking always succeeded. We have to check the first
      character of the string to check that it's empty, thus, skipping
      the timeout path.
      
      This fixes the use of the CT target without the timeout option.
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      6cf51852
    • Hans Schillstrom's avatar
      ipvs: kernel oops - do_ip_vs_get_ctl · 8537de8a
      Hans Schillstrom authored
      Change order of init so netns init is ready
      when register ioctl and netlink.
      
      Ver2
      	Whitespace fixes and __init added.
      Reported-by: default avatar"Ryan O'Hara" <rohara@redhat.com>
      Signed-off-by: default avatarHans Schillstrom <hans.schillstrom@ericsson.com>
      Acked-by: default avatarJulian Anastasov <ja@ssi.bg>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarSimon Horman <horms@verge.net.au>
      8537de8a
    • Hans Schillstrom's avatar
      ipvs: take care of return value from protocol init_netns · 582b8e3e
      Hans Schillstrom authored
      ip_vs_create_timeout_table() can return NULL
      All functions protocol init_netns is affected of this patch.
      Signed-off-by: default avatarHans Schillstrom <hans.schillstrom@ericsson.com>
      Acked-by: default avatarJulian Anastasov <ja@ssi.bg>
      Signed-off-by: default avatarSimon Horman <horms@verge.net.au>
      582b8e3e
    • Hans Schillstrom's avatar
      ipvs: null check of net->ipvs in lblc(r) shedulers · 4b984cd5
      Hans Schillstrom authored
      Avoid crash when registering shedulers after
      the IPVS core initialization for netns fails. Do this by
      checking for present core (net->ipvs).
      Signed-off-by: default avatarHans Schillstrom <hans.schillstrom@ericsson.com>
      Acked-by: default avatarJulian Anastasov <ja@ssi.bg>
      Signed-off-by: default avatarSimon Horman <horms@verge.net.au>
      4b984cd5
  4. 28 Apr, 2012 2 commits
    • Neil Horman's avatar
      drop_monitor: Make updating data->skb smp safe · 3885ca78
      Neil Horman authored
      Eric Dumazet pointed out to me that the drop_monitor protocol has some holes in
      its smp protections.  Specifically, its possible to replace data->skb while its
      being written.  This patch corrects that by making data->skb an rcu protected
      variable.  That will prevent it from being overwritten while a tracepoint is
      modifying it.
      Signed-off-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Reported-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      CC: David Miller <davem@davemloft.net>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3885ca78
    • Neil Horman's avatar
      drop_monitor: fix sleeping in invalid context warning · cde2e9a6
      Neil Horman authored
      Eric Dumazet pointed out this warning in the drop_monitor protocol to me:
      
      [   38.352571] BUG: sleeping function called from invalid context at kernel/mutex.c:85
      [   38.352576] in_atomic(): 1, irqs_disabled(): 0, pid: 4415, name: dropwatch
      [   38.352580] Pid: 4415, comm: dropwatch Not tainted 3.4.0-rc2+ #71
      [   38.352582] Call Trace:
      [   38.352592]  [<ffffffff8153aaf0>] ? trace_napi_poll_hit+0xd0/0xd0
      [   38.352599]  [<ffffffff81063f2a>] __might_sleep+0xca/0xf0
      [   38.352606]  [<ffffffff81655b16>] mutex_lock+0x26/0x50
      [   38.352610]  [<ffffffff8153aaf0>] ? trace_napi_poll_hit+0xd0/0xd0
      [   38.352616]  [<ffffffff810b72d9>] tracepoint_probe_register+0x29/0x90
      [   38.352621]  [<ffffffff8153a585>] set_all_monitor_traces+0x105/0x170
      [   38.352625]  [<ffffffff8153a8ca>] net_dm_cmd_trace+0x2a/0x40
      [   38.352630]  [<ffffffff8154a81a>] genl_rcv_msg+0x21a/0x2b0
      [   38.352636]  [<ffffffff810f8029>] ? zone_statistics+0x99/0xc0
      [   38.352640]  [<ffffffff8154a600>] ? genl_rcv+0x30/0x30
      [   38.352645]  [<ffffffff8154a059>] netlink_rcv_skb+0xa9/0xd0
      [   38.352649]  [<ffffffff8154a5f0>] genl_rcv+0x20/0x30
      [   38.352653]  [<ffffffff81549a7e>] netlink_unicast+0x1ae/0x1f0
      [   38.352658]  [<ffffffff81549d76>] netlink_sendmsg+0x2b6/0x310
      [   38.352663]  [<ffffffff8150824f>] sock_sendmsg+0x10f/0x130
      [   38.352668]  [<ffffffff8150abe0>] ? move_addr_to_kernel+0x60/0xb0
      [   38.352673]  [<ffffffff81515f04>] ? verify_iovec+0x64/0xe0
      [   38.352677]  [<ffffffff81509c46>] __sys_sendmsg+0x386/0x390
      [   38.352682]  [<ffffffff810ffaf9>] ? handle_mm_fault+0x139/0x210
      [   38.352687]  [<ffffffff8165b5bc>] ? do_page_fault+0x1ec/0x4f0
      [   38.352693]  [<ffffffff8106ba4d>] ? set_next_entity+0x9d/0xb0
      [   38.352699]  [<ffffffff81310b49>] ? tty_ldisc_deref+0x9/0x10
      [   38.352703]  [<ffffffff8106d363>] ? pick_next_task_fair+0x63/0x140
      [   38.352708]  [<ffffffff8150b8d4>] sys_sendmsg+0x44/0x80
      [   38.352713]  [<ffffffff8165f8e2>] system_call_fastpath+0x16/0x1b
      
      It stems from holding a spinlock (trace_state_lock) while attempting to register
      or unregister tracepoint hooks, making in_atomic() true in this context, leading
      to the warning when the tracepoint calls might_sleep() while its taking a mutex.
      Since we only use the trace_state_lock to prevent trace protocol state races, as
      well as hardware stat list updates on an rcu write side, we can just convert the
      spinlock to a mutex to avoid this problem.
      Signed-off-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Reported-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      CC: David Miller <davem@davemloft.net>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cde2e9a6
  5. 27 Apr, 2012 5 commits