1. 17 Jan, 2014 36 commits
    • Michael Dalton's avatar
      virtio-net: auto-tune mergeable rx buffer size for improved performance · ab7db917
      Michael Dalton authored
      Commit 2613af0e ("virtio_net: migrate mergeable rx buffers to page frag
      allocators") changed the mergeable receive buffer size from PAGE_SIZE to
      MTU-size, introducing a single-stream regression for benchmarks with large
      average packet size. There is no single optimal buffer size for all
      workloads.  For workloads with packet size <= MTU bytes, MTU + virtio-net
      header-sized buffers are preferred as larger buffers reduce the TCP window
      due to SKB truesize. However, single-stream workloads with large average
      packet sizes have higher throughput if larger (e.g., PAGE_SIZE) buffers
      are used.
      
      This commit auto-tunes the mergeable receiver buffer packet size by
      choosing the packet buffer size based on an EWMA of the recent packet
      sizes for the receive queue. Packet buffer sizes range from MTU_SIZE +
      virtio-net header len to PAGE_SIZE. This improves throughput for
      large packet workloads, as any workload with average packet size >=
      PAGE_SIZE will use PAGE_SIZE buffers.
      
      These optimizations interact positively with recent commit
      ba275241 ("virtio-net: coalesce rx frags when possible during rx"),
      which coalesces adjacent RX SKB fragments in virtio_net. The coalescing
      optimizations benefit buffers of any size.
      
      Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs
      between two QEMU VMs on a single physical machine. Each VM has two VCPUs
      with all offloads & vhost enabled. All VMs and vhost threads run in a
      single 4 CPU cgroup cpuset, using cgroups to ensure that other processes
      in the system will not be scheduled on the benchmark CPUs. Trunk includes
      SKB rx frag coalescing.
      
      net-next w/ virtio_net before 2613af0e (PAGE_SIZE bufs): 14642.85Gb/s
      net-next (MTU-size bufs):  13170.01Gb/s
      net-next + auto-tune: 14555.94Gb/s
      
      Jason Wang also reported a throughput increase on mlx4 from 22Gb/s
      using MTU-sized buffers to about 26Gb/s using auto-tuning.
      Signed-off-by: default avatarMichael Dalton <mwdalton@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ab7db917
    • Michael Dalton's avatar
      virtio-net: use per-receive queue page frag alloc for mergeable bufs · fb51879d
      Michael Dalton authored
      The virtio-net driver currently uses netdev_alloc_frag() for GFP_ATOMIC
      mergeable rx buffer allocations. This commit migrates virtio-net to use
      per-receive queue page frags for GFP_ATOMIC allocation. This change unifies
      mergeable rx buffer memory allocation, which now will use skb_refill_frag()
      for both atomic and GFP-WAIT buffer allocations.
      
      To address fragmentation concerns, if after buffer allocation there
      is too little space left in the page frag to allocate a subsequent
      buffer, the remaining space is added to the current allocated buffer
      so that the remaining space can be used to store packet data.
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarMichael Dalton <mwdalton@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fb51879d
    • Michael Dalton's avatar
      net: allow > 0 order atomic page alloc in skb_page_frag_refill · 097b4f19
      Michael Dalton authored
      skb_page_frag_refill currently permits only order-0 page allocs
      unless GFP_WAIT is used. Change skb_page_frag_refill to attempt
      higher-order page allocations whether or not GFP_WAIT is used. If
      memory cannot be allocated, the allocator will fall back to
      successively smaller page allocs (down to order-0 page allocs).
      
      This change brings skb_page_frag_refill in line with the existing
      page allocation strategy employed by netdev_alloc_frag, which attempts
      higher-order page allocations whether or not GFP_WAIT is set, falling
      back to successively lower-order page allocations on failure. Part
      of migration of virtio-net to per-receive queue page frag allocators.
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarMichael Dalton <mwdalton@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      097b4f19
    • Wei Yongjun's avatar
      net_sched: fix error return code in fw_change_attrs() · 722e47d7
      Wei Yongjun authored
      The error code was not set if change indev fail, so the error
      condition wasn't reflected in the return value. Fix to return a
      negative error code from this error handling case instead of 0.
      
      Fixes: 2519a602 ('net_sched: optimize tcf_match_indev()')
      Signed-off-by: default avatarWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      722e47d7
    • David S. Miller's avatar
      Merge branch 'tipc' · 8b88a11e
      David S. Miller authored
      Ying Xue says:
      
      ====================
      tipc: align TIPC behaviours of waiting for events with other stacks
      
      Comparing the current implementations of waiting for events in TIPC
      socket layer with other stacks, TIPC's behaviour is very different
      because wait_event_interruptible_timeout()/wait_event_interruptible()
      are always used by TIPC to wait for events while relevant socket or
      port variables are fed to them as their arguments. As socket lock has
      to be released temporarily before the two routines of waiting for
      events are called, their arguments associated with socket or port
      structures are out of socket lock protection. This might cause
      serious issues where the process of calling socket syscall such as
      sendsmg(), connect(), accept(), and recvmsg(), cannot be waken up
      at all even if proper event arrives or improperly be woken up
      although the condition of waking up the process is not satisfied
      in practice.
      
      Therefore, aligning its behaviours with similar functions implemented
      in other stacks, for instance, sk_stream_wait_connect() and
      inet_csk_wait_for_connect() etc, can avoid above risks for us.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8b88a11e
    • Ying Xue's avatar
      tipc: standardize recvmsg routine · 9bbb4ecc
      Ying Xue authored
      Standardize the behaviour of waiting for events in TIPC recvmsg()
      so that all variables of socket or port structures are protected
      within socket lock, allowing the process of calling recvmsg() to
      be woken up at appropriate time.
      Signed-off-by: default avatarYing Xue <ying.xue@windriver.com>
      Reviewed-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9bbb4ecc
    • Ying Xue's avatar
      tipc: standardize sendmsg routine of connected socket · 391a6dd1
      Ying Xue authored
      Standardize the behaviour of waiting for events in TIPC send_packet()
      so that all variables of socket or port structures are protected within
      socket lock, allowing the process of calling sendmsg() to be woken up
      at appropriate time.
      Signed-off-by: default avatarYing Xue <ying.xue@windriver.com>
      Reviewed-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      391a6dd1
    • Ying Xue's avatar
      tipc: standardize sendmsg routine of connectionless socket · 3f40504f
      Ying Xue authored
      Comparing the behaviour of how to wait for events in TIPC sendmsg()
      with other stacks, the TIPC implementation might be perceived as
      different, and sometimes even incorrect. For instance, sk_sleep()
      and tport->congested variables associated with socket are exposed
      without socket lock protection while wait_event_interruptible_timeout()
      accesses them. So standardizing it with similar implementation
      in other stacks can help us correct these errors which the process
      of calling sendmsg() cannot be woken up event if an expected event
      arrive at socket or improperly woken up although the wake condition
      doesn't match.
      Signed-off-by: default avatarYing Xue <ying.xue@windriver.com>
      Reviewed-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3f40504f
    • Ying Xue's avatar
      tipc: standardize accept routine · 6398e23c
      Ying Xue authored
      Comparing the behaviour of how to wait for events in TIPC accept()
      with other stacks, the TIPC implementation might be perceived as
      different, and sometimes even incorrect. As sk_sleep() and
      sk->sk_receive_queue variables associated with socket are not
      protected by socket lock, the process of calling accept() may be
      woken up improperly or sometimes cannot be woken up at all. After
      standardizing it with inet_csk_wait_for_connect routine, we can
      get benefits including: avoiding 'thundering herd' phenomenon,
      adding a timeout mechanism for accept(), coping with a pending
      signal, and having sk_sleep() and sk->sk_receive_queue being
      always protected within socket lock scope and so on.
      Signed-off-by: default avatarYing Xue <ying.xue@windriver.com>
      Reviewed-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6398e23c
    • Ying Xue's avatar
      tipc: standardize connect routine · 78eb3a53
      Ying Xue authored
      Comparing the behaviour of how to wait for events in TIPC connect()
      with other stacks, the TIPC implementation might be perceived as
      different, and sometimes even incorrect. For instance, as both
      sock->state and sk_sleep() are directly fed to
      wait_event_interruptible_timeout() as its arguments, and socket lock
      has to be released before we call wait_event_interruptible_timeout(),
      the two variables associated with socket are exposed out of socket
      lock protection, thereby probably getting stale values so that the
      process of calling connect() cannot be woken up exactly even if
      correct event arrives or it is woken up improperly even if the wake
      condition is not satisfied in practice. Therefore, standardizing its
      behaviour with sk_stream_wait_connect routine can avoid these risks.
      
      Additionally the implementation of connect routine is simplified as a
      whole, allowing it to return correct values in all different cases.
      Signed-off-by: default avatarYing Xue <ying.xue@windriver.com>
      Reviewed-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      78eb3a53
    • wangweidong's avatar
      sctp: remove the unnecessary assignment · abfce3ef
      wangweidong authored
      When go the right path, the status is 0, no need to assign it again.
      So just remove the assignment.
      Signed-off-by: default avatarWang Weidong <wangweidong1@huawei.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      abfce3ef
    • Jason Wang's avatar
      virtio-net: drop rq->max and rq->num · be121f46
      Jason Wang authored
      It looks like there's no need for those two fields:
      
      - Unless there's a failure for the first refill try, rq->max should be always
        equal to the vring size.
      - rq->num is only used to determine the condition that we need to do the refill,
        we could check vq->num_free instead.
      - rq->num was required to be increased or decreased explicitly after each
        get/put which results a bad API.
      
      So this patch removes them both to make the code simpler.
      
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      be121f46
    • Lad, Prabhakar's avatar
      net: davinci_mdio: Fix sparse warning · 9b05f462
      Lad, Prabhakar authored
      This patch fixes following sparse warning
      davinci_mdio.c:85:27: warning: symbol 'default_pdata' was not declared. Should it be static?
      Also makes the default_pdata as a constant.
      Signed-off-by: default avatarLad, Prabhakar <prabhakar.csengg@gmail.com>
      Acked-by: default avatarMugunthan V N <mugunthanvnm@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9b05f462
    • Veaceslav Falico's avatar
      bonding: handle slave's name change with primary_slave logic · 3ec775b9
      Veaceslav Falico authored
      Currently, if a slave's name change, we just pass it by. However, if the
      slave is a current primary_slave, then we end up with using a slave, whose
      name != params.primary, for primary_slave. And vice-versa, if we don't have
      a primary_slave but have params.primary set - we will not detected a new
      primary_slave.
      
      Fix this by catching the NETDEV_CHANGENAME event and setting primary_slave
      accordingly. Also, if the primary_slave was changed, issue a reselection of
      the active slave, cause the priorities have changed.
      Reported-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      CC: Ding Tianhong <dingtianhong@huawei.com>
      CC: Jay Vosburgh <fubar@us.ibm.com>
      CC: Andy Gospodarek <andy@greyhouse.net>
      Signed-off-by: default avatarVeaceslav Falico <vfalico@redhat.com>
      Acked-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3ec775b9
    • WANG Cong's avatar
      net_sched: act: pick a different type for act_xt · 6c80563c
      WANG Cong authored
      In tcf_register_action() we check either ->type or ->kind to see if
      there is an existing action registered, but ipt action registers two
      actions with same type but different kinds. They should have different
      types too.
      
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6c80563c
    • David S. Miller's avatar
      Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge · 7dff08bb
      David S. Miller authored
      Included change:
      - properly format already existing kerneldoc
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7dff08bb
    • WANG Cong's avatar
      net_sched: act: use tcf_hash_release() in net/sched/act_police.c · fb1d598d
      WANG Cong authored
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fb1d598d
    • Shannon Nelson's avatar
      i40e: updates to AdminQ interface · 0aebd2d9
      Shannon Nelson authored
      Refinements to cloud support in the Firmware API.
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@intel.com>
      Tested-by: default avatarKavindya Deegala <kavindya.s.deegala@intel.com>
      Signed-off-by: default avatarAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0aebd2d9
    • Shannon Nelson's avatar
      i40e: check desc pointer before printing · 68bf94aa
      Shannon Nelson authored
      Check that the descriptors were allocated before trying to dump
      them to the logfile.  While we're there, de-trick-ify the code
      so as to be easier to read and not abusing the types and unions.
      
      Change-ID: I22898f4b22cecda3582d4d9e4018da9cd540f177
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@intel.com>
      Tested-by: default avatarKavindya Deegala <kavindya.s.deegala@intel.com>
      Signed-off-by: default avatarAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      68bf94aa
    • Veaceslav Falico's avatar
      team: block mtu change before it happens via NETDEV_PRECHANGEMTU · b01f236c
      Veaceslav Falico authored
      Now it catches the NETDEV_CHANGEMTU notification, which is signaled after
      the actual change happened on the device, and returns NOTIFY_BAD, so that
      the change on the device is reverted.
      
      This might be quite costly and messy, so use the new NETDEV_PRECHANGEMTU to
      catch the MTU change before the actual change happens and signal that it's
      forbidden to do it.
      
      CC: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarVeaceslav Falico <vfalico@redhat.com>
      Acked-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b01f236c
    • Veaceslav Falico's avatar
      net: add NETDEV_PRECHANGEMTU to notify before mtu change happens · 1d486bfb
      Veaceslav Falico authored
      Currently, if a device changes its mtu, first the change happens (invloving
      all the side effects), and after that the NETDEV_CHANGEMTU is sent so that
      other devices can catch up with the new mtu. However, if they return
      NOTIFY_BAD, then the change is reverted and error returned.
      
      This is a really long and costy operation (sometimes). To fix this, add
      NETDEV_PRECHANGEMTU notification which is called prior to any change
      actually happening, and if any callee returns NOTIFY_BAD - the change is
      aborted. This way we're skipping all the playing with apply/revert the mtu.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      CC: Cong Wang <amwang@redhat.com>
      Signed-off-by: default avatarVeaceslav Falico <vfalico@redhat.com>
      Acked-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1d486bfb
    • Florian Fainelli's avatar
      r6040: use ETH_ZLEN instead of MISR for SKB length checking · 31cf344c
      Florian Fainelli authored
      Ever since this driver was merged the following code was included:
      
      if (skb->len < MISR)
      	skb->len = MISR;
      
      MISR is defined to 0x3C which is also equivalent to ETH_ZLEN, but use
      ETH_ZLEN directly which is exactly what we want to be checking for.
      Reported-by: default avatarMarc Volovic <marcv@ezchip.com>
      Signed-off-by: default avatarFlorian Fainelli <florian@openwrt.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      31cf344c
    • Florian Fainelli's avatar
      r6040: add delays in MDIO read/write polling loops · 4f8d9f3c
      Florian Fainelli authored
      On newer and faster machines (Vortex X86DX) using the r6040 driver, it
      was noticed that the driver was returning an error during probing traced
      down to being the MDIO bus probing and the inability to complete a MDIO
      read operation in time. It turns out that the MDIO operations on these
      faster machines usually complete after ~2140 iterations which is bigger
      than 2048 (MAC_DEF_TIMEOUT) and results in spurious timeouts depending
      on the system load.
      
      Update r6040_phy_read() and r6040_phy_write() to include a 1
      micro second delay in each busy-looping iteration of the loop which is a
      much safer operation than incrementing MAC_DEF_TIMEOUT.
      Reported-by: default avatarNils Koehler <nils.koehler@ibt-interfaces.de>
      Reported-by: default avatarDaniel Goertzen <daniel.goertzen@gmail.com>
      Signed-off-by: default avatarFlorian Fainelli <florian@openwrt.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4f8d9f3c
    • Paul Durrant's avatar
      xen-netfront: add support for IPv6 offloads · 2c0057de
      Paul Durrant authored
      This patch adds support for IPv6 checksum offload and GSO when those
      features are available in the backend.
      Signed-off-by: default avatarPaul Durrant <paul.durrant@citrix.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2c0057de
    • Tom Herbert's avatar
      net: Check skb->rxhash in gro_receive · 0b4cec8c
      Tom Herbert authored
      When initializing a gro_list for a packet, first check the rxhash of
      the incoming skb against that of the skb's in the list. This should be
      a very strong inidicator of whether the flow is going to be matched,
      and potentially allows a lot of other checks to be short circuited.
      Use skb_hash_raw so that we don't force the hash to be calculated.
      
      Tested by running netperf 200 TCP_STREAMs between two machines with
      GRO, HW rxhash, and 1G. Saw no performance degration, slight reduction
      of time in dev_gro_receive.
      Signed-off-by: default avatarTom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0b4cec8c
    • Tom Herbert's avatar
      net: Add skb_get_hash_raw · 57bdf7f4
      Tom Herbert authored
      Function to just return skb->rxhash without checking to see if it needs
      to be recomputed.
      Signed-off-by: default avatarTom Herbert <therbert@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      57bdf7f4
    • stephen hemminger's avatar
      vxge: make local functions static · e40c10fc
      stephen hemminger authored
      Remove unused function vxge_hw_vpath_vid_get
      Signed-off-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e40c10fc
    • stephen hemminger's avatar
      bnad: code cleanup · 2fd888a5
      stephen hemminger authored
      Use 'make namespacecheck' to code that could be declared static.
      After that remove code that is not being used.
      
      Compile tested only.
      Signed-off-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2fd888a5
    • Barry Song's avatar
      dm9000: fix a lot of checkpatch issues · 5b22721d
      Barry Song authored
      recently, dm9000 codes have many checkpatch errors and warnings:
      
      WARNING: please, no space before tabs
      3: FILE: dm9000.c:3:
      + * ^ICopyright (C) 1997  Sten Wang$
      
      WARNING: please, no space before tabs
      5: FILE: dm9000.c:5:
      + * ^IThis program is free software; you can redistribute it and/or$
      
      WARNING: please, no space before tabs
      6: FILE: dm9000.c:6:
      + * ^Imodify it under the terms of the GNU General Public License$
      
      WARNING: please, no space before tabs
      7: FILE: dm9000.c:7:
      + * ^Ias published by the Free Software Foundation; either version 2$
      
      WARNING: please, no space before tabs
      8: FILE: dm9000.c:8:
      + * ^Iof the License, or (at your option) any later version.$
      
      WARNING: please, no space before tabs
      10: FILE: dm9000.c:10:
      + * ^IThis program is distributed in the hope that it will be useful,$
      
      WARNING: please, no space before tabs
      11: FILE: dm9000.c:11:
      + * ^Ibut WITHOUT ANY WARRANTY; without even the implied warranty of$
      
      WARNING: please, no space before tabs
      12: FILE: dm9000.c:12:
      + * ^IMERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the$
      
      WARNING: please, no space before tabs
      13: FILE: dm9000.c:13:
      + * ^IGNU General Public License for more details.$
      
      WARNING: do not add new typedefs
      97: FILE: dm9000.c:97:
      +typedef struct board_info {
      
      ERROR: spaces prohibited around that ':' (ctx:WxV)
      113: FILE: dm9000.c:113:
      +	unsigned int	in_suspend :1;
       	            	           ^
      
      ERROR: spaces prohibited around that ':' (ctx:WxV)
      114: FILE: dm9000.c:114:
      +	unsigned int	wake_supported :1;
       	            	               ^
      
      This patch fixes important errors in it.
      Signed-off-by: default avatarBarry Song <Baohua.Song@csr.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5b22721d
    • Daniel Borkmann's avatar
      packet: use percpu mmap tx frame pending refcount · b0138408
      Daniel Borkmann authored
      In PF_PACKET's packet mmap(), we can avoid using one atomic_inc()
      and one atomic_dec() call in skb destructor and use a percpu
      reference count instead in order to determine if packets are
      still pending to be sent out. Micro-benchmark with [1] that has
      been slightly modified (that is, protcol = 0 in socket(2) and
      bind(2)), example on a rather crappy testing machine; I expect
      it to scale and have even better results on bigger machines:
      
      ./packet_mm_tx -s7000 -m7200 -z700000 em1, avg over 2500 runs:
      
      With patch:    4,022,015 cyc
      Without patch: 4,812,994 cyc
      
      time ./packet_mm_tx -s64 -c10000000 em1 > /dev/null, stable:
      
      With patch:
        real         1m32.241s
        user         0m0.287s
        sys          1m29.316s
      
      Without patch:
        real         1m38.386s
        user         0m0.265s
        sys          1m35.572s
      
      In function tpacket_snd(), it is okay to use packet_read_pending()
      since in fast-path we short-circuit the condition already with
      ph != NULL, since we have next frames to process. In case we have
      MSG_DONTWAIT, we also do not execute this path as need_wait is
      false here anyway, and in case of _no_ MSG_DONTWAIT flag, it is
      okay to call a packet_read_pending(), because when we ever reach
      that path, we're done processing outgoing frames anyway and only
      look if there are skbs still outstanding to be orphaned. We can
      stay lockless in this percpu counter since it's acceptable when we
      reach this path for the sum to be imprecise first, but we'll level
      out at 0 after all pending frames have reached the skb destructor
      eventually through tx reclaim. When people pin a tx process to
      particular CPUs, we expect overflows to happen in the reference
      counter as on one CPU we expect heavy increase; and distributed
      through ksoftirqd on all CPUs a decrease, for example. As
      David Laight points out, since the C language doesn't define the
      result of signed int overflow (i.e. rather than wrap, it is
      allowed to saturate as a possible outcome), we have to use
      unsigned int as reference count. The sum over all CPUs when tx
      is complete will result in 0 again.
      
      The BUG_ON() in tpacket_destruct_skb() we can remove as well. It
      can _only_ be set from inside tpacket_snd() path and we made sure
      to increase tx_ring.pending in any case before we called po->xmit(skb).
      So testing for tx_ring.pending == 0 is not too useful. Instead, it
      would rather have been useful to test if lower layers didn't orphan
      the skb so that we're missing ring slots being put back to
      TP_STATUS_AVAILABLE. But such a bug will be caught in user space
      already as we end up realizing that we do not have any
      TP_STATUS_AVAILABLE slots left anymore. Therefore, we're all set.
      
      Btw, in case of RX_RING path, we do not make use of the pending
      member, therefore we also don't need to use up any percpu memory
      here. Also note that __alloc_percpu() already returns a zero-filled
      percpu area, so initialization is done already.
      
        [1] http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmapSigned-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b0138408
    • Daniel Borkmann's avatar
      packet: don't unconditionally schedule() in case of MSG_DONTWAIT · 87a2fd28
      Daniel Borkmann authored
      In tpacket_snd(), when we've discovered a first frame that is
      not in status TP_STATUS_SEND_REQUEST, and return a NULL buffer,
      we exit the send routine in case of MSG_DONTWAIT, since we've
      finished traversing the mmaped send ring buffer and don't care
      about pending frames.
      
      While doing so, we still unconditionally call an expensive
      schedule() in the packet_current_frame() "error" path, which
      is unnecessary in this case since it's enough to just quit
      the function.
      
      Also, in case MSG_DONTWAIT is not set, we should rather test
      for need_resched() first and do schedule() only if necessary
      since meanwhile pending frames could already have finished
      processing and called skb destructor.
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      87a2fd28
    • Daniel Borkmann's avatar
      packet: improve socket create/bind latency in some cases · 902fefb8
      Daniel Borkmann authored
      Most people acquire PF_PACKET sockets with a protocol argument in
      the socket call, e.g. libpcap does so with htons(ETH_P_ALL) for
      all its sockets. Most likely, at some point in time a subsequent
      bind() call will follow, e.g. in libpcap with ...
      
        memset(&sll, 0, sizeof(sll));
        sll.sll_family          = AF_PACKET;
        sll.sll_ifindex         = ifindex;
        sll.sll_protocol        = htons(ETH_P_ALL);
      
      ... as arguments. What happens in the kernel is that already
      in socket() syscall, we install a proto hook via register_prot_hook()
      if our protocol argument is != 0. Yet, in bind() we're almost
      doing the same work by doing a unregister_prot_hook() with an
      expensive synchronize_net() call in case during socket() the proto
      was != 0, plus follow-up register_prot_hook() with a bound device
      to it this time, in order to limit traffic we get.
      
      In the case when the protocol and user supplied device index (== 0)
      does not change from socket() to bind(), we can spare us doing
      the same work twice. Similarly for re-binding to the same device
      and protocol. For these scenarios, we can decrease create/bind
      latency from ~7447us (sock-bind-2 case) to ~89us (sock-bind-1 case)
      with this patch.
      
      Alternatively, for the first case, if people care, they should
      simply create their sockets with proto == 0 argument and define
      the protocol during bind() as this saves a call to synchronize_net()
      as well (sock-bind-3 case).
      
      In all other cases, we're tied to user space behaviour we must not
      change, also since a bind() is not strictly required. Thus, we need
      the synchronize_net() to make sure no asynchronous packet processing
      paths still refer to the previous elements of po->prot_hook.
      
      In case of mmap()ed sockets, the workflow that includes bind() is
      socket() -> setsockopt(<ring>) -> bind(). In that case, a pair of
      {__unregister, register}_prot_hook is being called from setsockopt()
      in order to install the new protocol receive handler. Thus, when
      we call bind and can skip a re-hook, we have already previously
      installed the new handler. For fanout, this is handled different
      entirely, so we should be good.
      
      Timings on an i7-3520M machine:
      
        * sock-bind-1:   89 us
        * sock-bind-2: 7447 us
        * sock-bind-3:   75 us
      
      sock-bind-1:
        socket(PF_PACKET, SOCK_RAW, htons(ETH_P_IP)) = 3
        bind(3, {sa_family=AF_PACKET, proto=htons(ETH_P_IP), if=all(0),
                 pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0
      
      sock-bind-2:
        socket(PF_PACKET, SOCK_RAW, htons(ETH_P_IP)) = 3
        bind(3, {sa_family=AF_PACKET, proto=htons(ETH_P_IP), if=lo(1),
                 pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0
      
      sock-bind-3:
        socket(PF_PACKET, SOCK_RAW, 0) = 3
        bind(3, {sa_family=AF_PACKET, proto=htons(ETH_P_IP), if=lo(1),
                 pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      902fefb8
    • David S. Miller's avatar
      i40e: Remove autogenerated Module.symvers file. · ec48a787
      David S. Miller authored
      Fixes: 9d8bf547 ("i40e: associate VMDq queue with VM type")
      Reported-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ec48a787
    • Paul Gortmaker's avatar
      net/ipv4: don't use module_init in non-modular gre_offload · cf172283
      Paul Gortmaker authored
      Recent commit 438e38fa
      ("gre_offload: statically build GRE offloading support") added
      new module_init/module_exit calls to the gre_offload.c file.
      
      The file is obj-y and can't be anything other than built-in.
      Currently it can never be built modular, so using module_init
      as an alias for __initcall can be somewhat misleading.
      
      Fix this up now, so that we can relocate module_init from
      init.h into module.h in the future.  If we don't do this, we'd
      have to add module.h to obviously non-modular code, and that
      would be a worse thing.  We also make the inclusion explicit.
      
      Note that direct use of __initcall is discouraged, vs. one
      of the priority categorized subgroups.  As __initcall gets
      mapped onto device_initcall, our use of device_initcall
      directly in this change means that the runtime impact is
      zero -- it will remain at level 6 in initcall ordering.
      
      As for the module_exit, rather than replace it with __exitcall,
      we simply remove it, since it appears only UML does anything
      with those, and even for UML, there is no relevant cleanup
      to be done here.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cf172283
    • Paul Bolle's avatar
      net/mlx4_core: clean up srq_res_start_move_to() · f088cbb8
      Paul Bolle authored
      Building resource_tracker.o triggers a GCC warning:
          drivers/net/ethernet/mellanox/mlx4/resource_tracker.c: In function 'mlx4_HW2SW_SRQ_wrapper':
          drivers/net/ethernet/mellanox/mlx4/resource_tracker.c:3202:17: warning: 'srq' may be used uninitialized in this function [-Wmaybe-uninitialized]
            atomic_dec(&srq->mtt->ref_count);
                           ^
      
      This is a false positive. But a cleanup of srq_res_start_move_to() can
      help GCC here. The code currently uses a switch statement where a plain
      if/else would do, since only two of the switch's four cases can ever
      occur. Dropping that switch makes the warning go away.
      
      While we're at it, add some missing braces, and convert state to the
      correct type.
      Signed-off-by: default avatarPaul Bolle <pebolle@tiscali.nl>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f088cbb8
    • Paul Bolle's avatar
      net/mlx4_core: clean up cq_res_start_move_to() · c9218a9e
      Paul Bolle authored
      Building resource_tracker.o triggers a GCC warning:
          drivers/net/ethernet/mellanox/mlx4/resource_tracker.c: In function 'mlx4_HW2SW_CQ_wrapper':
          drivers/net/ethernet/mellanox/mlx4/resource_tracker.c:3019:16: warning: 'cq' may be used uninitialized in this function [-Wmaybe-uninitialized]
            atomic_dec(&cq->mtt->ref_count);
                          ^
      
      This is a false positive. But a cleanup of cq_res_start_move_to() can
      help GCC here. The code currently uses a switch statement where an
      if/else construct would do too, since only two of the switch's four
      cases can ever occur. Dropping that switch makes the warning go away.
      
      While we're at it, add some missing braces.
      Signed-off-by: default avatarPaul Bolle <pebolle@tiscali.nl>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c9218a9e
  2. 16 Jan, 2014 4 commits