1. 29 Jan, 2013 13 commits
    • Barry Grussling's avatar
      ethoc: Cleanup driver format · 72aa8e1b
      Barry Grussling authored
      Cleanup the format of ethoc.c to meet network driver style as
      per checkpatch.pl.
      Signed-off-by: default avatarBarry Grussling <barry@grussling.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      72aa8e1b
    • David Ward's avatar
      ip_gre: When TOS is inherited, use configured TOS value for non-IP packets · 040468a0
      David Ward authored
      A GRE tunnel can be configured so that outgoing tunnel packets inherit
      the value of the TOS field from the inner IP header. In doing so, when
      a non-IP packet is transmitted through the tunnel, the TOS field will
      always be set to 0.
      
      Instead, the user should be able to configure a different TOS value as
      the fallback to use for non-IP packets. This is helpful when the non-IP
      packets are all control packets and should be handled by routers outside
      the tunnel as having Internet Control precedence. One example of this is
      the NHRP packets that control a DMVPN-compatible mGRE tunnel; they are
      encapsulated directly by GRE and do not contain an inner IP header.
      
      Under the existing behavior, the IFLA_GRE_TOS parameter must be set to
      '1' for the TOS value to be inherited. Now, only the least significant
      bit of this parameter must be set to '1', and when a non-IP packet is
      sent through the tunnel, the upper 6 bits of this same parameter will be
      copied into the TOS field. (The ECN bits get masked off as before.)
      
      This behavior is backwards-compatible with existing configurations and
      iproute2 versions.
      Signed-off-by: default avatarDavid Ward <david.ward@ll.mit.edu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      040468a0
    • Jiri Pirko's avatar
      ipv4: introduce address lifetime · 5c766d64
      Jiri Pirko authored
      There are some usecase when lifetime of ipv4 addresses might be helpful.
      For example:
      1) initramfs networkmanager uses a DHCP daemon to learn network
      configuration parameters
      2) initramfs networkmanager addresses, routes and DNS configuration
      3) initramfs networkmanager is requested to stop
      4) initramfs networkmanager stops all daemons including dhclient
      5) there are addresses and routes configured but no daemon running. If
      the system doesn't start networkmanager for some reason, addresses and
      routes will be used forever, which violates RFC 2131.
      
      This patch is essentially a backport of ivp6 address lifetime mechanism
      for ipv4 addresses.
      
      Current "ip" tool supports this without any patch (since it does not
      distinguish between ipv4 and ipv6 addresses in this perspective.
      
      Also, this should be back-compatible with all current netlink users.
      Reported-by: default avatarPavel Šimerda <psimerda@redhat.com>
      Signed-off-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5c766d64
    • David S. Miller's avatar
      Merge branch 'ipfrags' · 5a1dc317
      David S. Miller authored
      Jesper Dangaard Brouer says:
      
      ====================
      This patchset is V2, with some trivial code fixes, which were noticed
      by DaveM. It is still a partly respin of my fragmentation optimization
      patches: http://thread.gmane.org/gmane.linux.network/250914
      
      This is not the complete patchset, from the gmane link above. In this
      patchset, I primarily focus on adjusting cacheline for better SMP/NUMA
      performance.
      
      Once this patchset have been agreed upon, I will continue and respin
      the rest of my patches.
      
      This time around, I have created a frag DoS generator, via the tool
      trafgen (http://netsniff-ng.org/).  To create a stable DoS scenario
      (no longer relying on frame dropping due to disabled flow-control).
      
      Two 10G interfaces are under-test, and uses Ethernet flow-control.  A
      third interface is used for generating the DoS attack (this interface
      is also 10G, but it does not need to be, as 500Kpps DoS is enough).
      
      Test types summary (netperf):
       Test-20G64K     == 2x10G with 65K fragments
       Test-20G3F      == 2x10G with 3x fragments (3*1472 bytes)
       Test-20G64K+DoS == Same as 20G64K with frag DoS
       Test-20G3F+DoS  == Same as 20G3F  with frag DoS
      
      Patch list:
       Patch-01 - net: cacheline adjust struct netns_frags for better frag performance
       Patch-02 - net: cacheline adjust struct inet_frags for better frag performance
       Patch-03 - net: cacheline adjust struct inet_frag_queue
       Patch-04 - net: frag helper functions for mem limit tracking
       Patch-05 - net: use lib/percpu_counter API for fragmentation mem accounting
       Patch-06 - net: frag, move LRU list maintenance outside of rwlock
      
      Performance table summary:
      
       Test-type:  Test-20G64K    Test-20G3F  20G64K+DoS   20G3F+DoS
       ----------  -----------    ----------  ----------   ---------
        net-next:  15114.5 Mbit/s   8954.21     2444.28     3918.01 Mbit/s
        Patch-01:  16075.8 Mbit/s   8976.18     2621.49     4072.79 Mbit/s
        Patch-02:  17806.9 Mbit/s   9280.32     2478.62     4274.59 Mbit/s
        Patch-03:  17317.4 Mbit/s   9308.62     2546.05     4336.59 Mbit/s
        Patch-04:  17635.9 Mbit/s   9256.16     2535.25     4327.63 Mbit/s
        Patch-05:  18027.0 Mbit/s   9918.99     2492.62     3621.68 Mbit/s
        Patch-06:  18486.7 Mbit/s  10723.20     3657.85     4560.64 Mbit/s
      
       I cannot explain the under-DoS regression that patch-05/percpu_counter
       introduces.  But patch-06/LRU-lock corrects the situation again.
      
      Below is a testlab setup description, with links to the trafgen DoS
      packet config used.
      
      Testlab
      =======
      
      Server setup
      ------------
      The machine acting as a server:
       - 2x CPU (E5-2630)
       - Thus a NUMA arch/machine
       - 4x 10Gbit/s ports
       - NICs 2x Intel Dual port 82599 based (driver ixgbe)
      
      Setup:
       - Interfaces uses Ethernet flow control
       - Flush all iptables
       - Remove all iptables related module.
       - Kill irqbalance
       - Pin each 10G NIC port to a *single* CPU each
      
      Pinning can easily be done by command hacks::
      
       for x in /proc/irq/*/eth8*/../smp_affinity_list ; do echo 1 > $x; done
       for x in /proc/irq/*/eth9*/../smp_affinity_list ; do echo 3 > $x; done
       for x in /proc/irq/*/eth31*/../smp_affinity_list; do echo 6 > $x; done
       for x in /proc/irq/*/eth32*/../smp_affinity_list; do echo 8 > $x; done
      
      Notice NUMA setting: The CPU to NIC tying is carefully choosen
      according to the NUMA node setup.  Thus, NICs connected to a PCI-e
      slot that is connected to a physical CPU socket are tied together.
      
      Choosing only a single CPU per NIC (port) is just to ease provoking
      and debugging this performance issue. (In real setups, you can choose
      more CPU, just remember the NUMA node in the equation).
      
      Tools
      -----
      
      Netperf is used, with option -T to ensure CPU binding.
      The netserver processes, are NAPI pinned::
      
       numactl -m0 -c0 netserver
       numactl -m1 -c 1 netserver -p 1337
      
      I now have a frag DoS generator, created via the tool:
        trafgen (see: http://netsniff-ng.org/)
      
      Trafgen packet config file:
       http://people.netfilter.org/hawk/frag_work/trafgen/frag_packet03_small_frag.txf
      
      Notice, I'm using features of trafgen, recently developed by Daniel
      Borkmann, thus you need the latest git tree to use my trafgen packet
      config.
      
       git://github.com/borkmann/netsniff-ng.git
      
      Command line:
       trafgen --dev eth51 --conf frag_packet03_small_frag.txf -V -k 100 --cpus 2
      
      Tests types
      -----------
      
      Test(20G64K) UDP-64K 2x 10Gbit/s with no DoS traffic:
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
       export SIZE=$((65507)); export TIME=$((20)); export LOG=/tmp/netperf.log ;\
       netperf -p 1337 -H 192.168.31.2 -T7,7 -t UDP_STREAM -l $TIME -- -m $SIZE >> ${LOG}.31 &\
       netperf         -H 192.168.81.2 -T2,2 -t UDP_STREAM -l $TIME -- -m $SIZE >> ${LOG}.81 && \
       wait $! && tail -n3 ${LOG}.* && \
       tail -n3 ${LOG}.{31,81} | awk 'BEGIN{sum=0;} /212992        / {sum+=$4; print " +"$4} /==/ {print " file:"$2} END{print "sum:"sum" Mbit/s"}'
      
      Test(20G3F) UDP-3xfrags 2x 10Gbit/s with no DoS traffic:
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
       export SIZE=$((3*1472)); export TIME=$((20)); export LOG=/tmp/netperf.log ;\
       netperf -p 1337 -H 192.168.31.2 -T7,7 -t UDP_STREAM -l $TIME -- -m $SIZE >> ${LOG}.31 &\
       netperf         -H 192.168.81.2 -T2,2 -t UDP_STREAM -l $TIME -- -m $SIZE >> ${LOG}.81 && \
       wait $! && tail -n3 ${LOG}.* && \
      tail -n3 ${LOG}.{31,81} | awk 'BEGIN{sum=0;} /212992        / {sum+=$4; print " +"$4} /==/ {print " file:"$2} END{print "sum:"sum" Mbit/s"}'
      
      Awk script for summming results:
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      tail -n3 ${LOG}.{31,81} | awk 'BEGIN{sum=0;} /212992        / {sum+=$4; print " +"$4} /==/ {print " file:"$2} END{print "sum:"sum" Mbit/s"}'
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5a1dc317
    • Jesper Dangaard Brouer's avatar
      net: frag, move LRU list maintenance outside of rwlock · 3ef0eb0d
      Jesper Dangaard Brouer authored
      Updating the fragmentation queues LRU (Least-Recently-Used) list,
      required taking the hash writer lock.  However, the LRU list isn't
      tied to the hash at all, so we can use a separate lock for it.
      Original-idea-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3ef0eb0d
    • Jesper Dangaard Brouer's avatar
      net: use lib/percpu_counter API for fragmentation mem accounting · 6d7b857d
      Jesper Dangaard Brouer authored
      Replace the per network namespace shared atomic "mem" accounting
      variable, in the fragmentation code, with a lib/percpu_counter.
      
      Getting percpu_counter to scale to the fragmentation code usage
      requires some tweaks.
      
      At first view, percpu_counter looks superfast, but it does not
      scale on multi-CPU/NUMA machines, because the default batch size
      is too small, for frag code usage.  Thus, I have adjusted the
      batch size by using __percpu_counter_add() directly, instead of
      percpu_counter_sub() and percpu_counter_add().
      
      The batch size is increased to 130.000, based on the largest 64K
      fragment memory usage.  This does introduce some imprecise
      memory accounting, but its does not need to be strict for this
      use-case.
      
      It is also essential, that the percpu_counter, does not
      share cacheline with other writers, to make this scale.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6d7b857d
    • Jesper Dangaard Brouer's avatar
      net: frag helper functions for mem limit tracking · d433673e
      Jesper Dangaard Brouer authored
      This change is primarily a preparation to ease the extension of memory
      limit tracking.
      
      The change does reduce the number atomic operation, during freeing of
      a frag queue.  This does introduce a some performance improvement, as
      these atomic operations are at the core of the performance problems
      seen on NUMA systems.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d433673e
    • Jesper Dangaard Brouer's avatar
      net: cacheline adjust struct inet_frag_queue · 6e34a8b3
      Jesper Dangaard Brouer authored
      Fragmentation code cacheline adjusting of struct inet_frag_queue.
      
      Take advantage of the size of struct timer_list, and move all but
      spinlock_t lock, below the timer struct.  On 64-bit 'lru_list',
      'list' and 'refcnt', fits exactly into the next cacheline, and a
      new cacheline starts at 'fragments'.
      
      The netns_frags *net pointer is moved to the end of the struct,
      because its used in a compare, with "next/close-by" elements of
      which this struct is embedded into.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6e34a8b3
    • Jesper Dangaard Brouer's avatar
      net: cacheline adjust struct inet_frags for better frag performance · 5f8e1e8b
      Jesper Dangaard Brouer authored
      The globally shared rwlock, of struct inet_frags, shares
      cacheline with the 'rnd' number, which is used by the hash
      calculations.  Fix this, as this obviously is a bad idea, as
      unnecessary cache-misses will occur when accessing the 'rnd'
      number.
      
      Also small note that, moving function ptr (*match) up in struct,
      is to avoid it lands on the next cacheline (on 64-bit).
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5f8e1e8b
    • Jesper Dangaard Brouer's avatar
      net: cacheline adjust struct netns_frags for better frag performance · cd39a789
      Jesper Dangaard Brouer authored
      This small cacheline adjustment of struct netns_frags improves
      performance significantly for the fragmentation code.
      
      Struct members 'lru_list' and 'mem' are both hot elements, and it
      hurts performance, due to cacheline bouncing at every call point,
      when they share a cacheline.  Also notice, how mem is placed
      together with 'high_thresh' and 'low_thresh', as they are used in
      the compare operations together.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cd39a789
    • Felipe Balbi's avatar
      net: ks8851: convert to threaded IRQ · 656a05c8
      Felipe Balbi authored
      just as it should have been. It also helps
      removing the, now unnecessary, workqueue.
      Signed-off-by: default avatarFelipe Balbi <balbi@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      656a05c8
    • YOSHIFUJI Hideaki / 吉藤英明's avatar
      net neigh: Optimize neighbor entry size calculation. · 08433eff
      YOSHIFUJI Hideaki / 吉藤英明 authored
      When allocating memory for neighbour cache entry, if
      tbl->entry_size is not set, we always calculate
      sizeof(struct neighbour) + tbl->key_len, which is common
      in the same table.
      
      With this change, set tbl->entry_size during the table
      initialization phase, if it was not set, and use it in
      neigh_alloc() and neighbour_priv().
      
      This change also allow us to have both of protocol private
      data and device priate data at tha same time.
      
      Note that the only user of prototcol private is DECnet
      and the only user of device private is ATM CLIP.
      Since those are exclusive, we have not been facing issues
      here.
      Signed-off-by: default avatarYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      08433eff
    • bingtian.ly@taobao.com's avatar
      net: avoid to hang up on sending due to sysctl configuration overflow. · cdda8891
      bingtian.ly@taobao.com authored
          I found if we write a larger than 4GB value to some sysctl
      variables, the sending syscall will hang up forever, because these
      variables are 32 bits, such large values make them overflow to 0 or
      negative.
      
          This patch try to fix overflow or prevent from zero value setup
      of below sysctl variables:
      
      net.core.wmem_default
      net.core.rmem_default
      
      net.core.rmem_max
      net.core.wmem_max
      
      net.ipv4.udp_rmem_min
      net.ipv4.udp_wmem_min
      
      net.ipv4.tcp_wmem
      net.ipv4.tcp_rmem
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarLi Yu <raise.sail@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cdda8891
  2. 28 Jan, 2013 27 commits