1. 02 Jul, 2014 1 commit
  2. 01 Jul, 2014 13 commits
    • David S. Miller's avatar
      Merge branch 'bnx2x-next' · b6fd8b7f
      David S. Miller authored
      Yuval Mintz says:
      
      ====================
      bnx2x: Enhancement patch series
      
      This patch series introduces the ability to propagate link parameters
      to VFs as well as control the VF link via hypervisor.
      
      In addition, it contains 2 small improvements [one IOV-related and the
      other improves performance on machines with short cache lines].
      
      Please consider applying these patches to `net-next'.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b6fd8b7f
    • Yuval Mintz's avatar
      bnx2x: Fail probe of VFs using an old incompatible driver · ebf457f9
      Yuval Mintz authored
      There are linux distributions where the inbox bnx2x driver contains SRIOV
      support but doesn't contain the changes introduced in b9871bcf
      "bnx2x: VF RSS support - PF side".
      
      A VF in a VM running that distribution over a new hypervisor will access
      incorrect addresses when trying to transmit packets, causing an attention
      in the hypervisor and making that VF inactive until FLRed.
      
      The driver in the VM has to ne upgraded [no real way to overcome this], but
      due to the HW attention currently arising upgrading the driver in the VM
      would not suffice [since the VF needs also be FLRed if the previous driver
      was already loaded].
      
      This patch causes the PF to fail the acquire message from a VF running an
      old problematic driver; The VF will then gracefully fail it's probe preventing
      the HW attention [and allow clean upgrade of driver in VM].
      Signed-off-by: default avatarYuval Mintz <Yuval.Mintz@qlogic.com>
      Signed-off-by: default avatarAriel Elior <Ariel.Elior@qlogic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ebf457f9
    • Dmitry Kravkov's avatar
      bnx2x: enlarge minimal alignemnt of data offset · 9927b514
      Dmitry Kravkov authored
      This improves the performance of driver on machine with L1_CACHE_SHIFT of at
      most 32 bytes [HW was planned for 64-byte aligned fastpath data].
      Signed-off-by: default avatarDmitry Kravkov <Dmitry.Kravkov@qlogic.com>
      Signed-off-by: default avatarYuval Mintz <Yuval.Mintz@qlogic.com>
      Signed-off-by: default avatarAriel Elior <Ariel.Elior@qlogic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9927b514
    • Dmitry Kravkov's avatar
      bnx2x: VF can report link speed · 6495d15a
      Dmitry Kravkov authored
      Until now VFs were oblvious to the actual configured link parameters.
      This patch does 2 things:
      
        1. It enables a PF to inform its VF using the bulletin board of the link
           configured, and allows the VF to present that information.
      
        2. It adds support of `ndo_set_vf_link_state', allowing the hypervisor
           to set the VF link state.
      Signed-off-by: default avatarDmitry Kravkov <Dmitry.Kravkov@qlogic.com>
      Signed-off-by: default avatarYuval Mintz <Yuval.Mintz@qlogic.com>
      Signed-off-by: default avatarAriel Elior <Ariel.Elior@qlogic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6495d15a
    • David S. Miller's avatar
      Merge branch 'pktgen' · edd79ca8
      David S. Miller authored
      Jesper Dangaard Brouer says:
      
      ====================
      Optimizing pktgen for single CPU performance
      
      This series focus on optimizing "pktgen" for single CPU performance.
      
      V2-series:
       - Removed some patches
       - Doc real reason for TX ring buffer filling up
      
      NIC tuning for pktgen:
       http://netoptimizer.blogspot.dk/2014/06/pktgen-for-network-overload-testing.html
      
      General overload setup according to:
       http://netoptimizer.blogspot.dk/2014/04/basic-tuning-for-network-overload.html
      
      Hardware:
       System: CPU E5-2630
       NIC: Intel ixgbe/82599 chip
      
      Testing done with net-next git tree on top of
       commit 6623b419 ("Merge branch 'master' of...jkirsher/net-next")
      
      Pktgen script exercising race condition:
       https://github.com/netoptimizer/network-testing/blob/master/pktgen/unit_test01_race_add_rem_device_loop.sh
      
      Tool for measuring LOCK overhead:
       https://github.com/netoptimizer/network-testing/blob/master/src/overhead_cmpxchg.c
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      edd79ca8
    • Jesper Dangaard Brouer's avatar
      pktgen: RCU-ify "if_list" to remove lock in next_to_run() · 8788370a
      Jesper Dangaard Brouer authored
      The if_lock()/if_unlock() in next_to_run() adds a significant
      overhead, because its called for every packet in busy loop of
      pktgen_thread_worker().  (Thomas Graf originally pointed me
      at this lock problem).
      
      Removing these two "LOCK" operations should in theory save us approx
      16ns (8ns x 2), as illustrated below we do save 16ns when removing
      the locks and introducing RCU protection.
      
      Performance data with CLONE_SKB==100000, TX-size=512, rx-usecs=30:
       (single CPU performance, ixgbe 10Gbit/s, E5-2630)
       * Prev   : 5684009 pps --> 175.93ns (1/5684009*10^9)
       * RCU-fix: 6272204 pps --> 159.43ns (1/6272204*10^9)
       * Diff   : +588195 pps --> -16.50ns
      
      To understand this RCU patch, I describe the pktgen thread model
      below.
      
      In pktgen there is several kernel threads, but there is only one CPU
      running each kernel thread.  Communication with the kernel threads are
      done through some thread control flags.  This allow the thread to
      change data structures at a know synchronization point, see main
      thread func pktgen_thread_worker().
      
      Userspace changes are communicated through proc-file writes.  There
      are three types of changes, general control changes "pgctrl"
      (func:pgctrl_write), thread changes "kpktgend_X"
      (func:pktgen_thread_write), and interface config changes "etcX@N"
      (func:pktgen_if_write).
      
      Userspace "pgctrl" and "thread" changes are synchronized via the mutex
      pktgen_thread_lock, thus only a single userspace instance can run.
      The mutex is taken while the packet generator is running, by pgctrl
      "start".  Thus e.g. "add_device" cannot be invoked when pktgen is
      running/started.
      
      All "pgctrl" and all "thread" changes, except thread "add_device",
      communicate via the thread control flags.  The main problem is the
      exception "add_device", that modifies threads "if_list" directly.
      
      Fortunately "add_device" cannot be invoked while pktgen is running.
      But there exists a race between "rem_device_all" and "add_device"
      (which normally don't occur, because "rem_device_all" waits 125ms
      before returning). Background'ing "rem_device_all" and running
      "add_device" immediately allow the race to occur.
      
      The race affects the threads (list of devices) "if_list".  The if_lock
      is used for protecting this "if_list".  Other readers are given
      lock-free access to the list under RCU read sections.
      
      Note, interface config changes (via proc) can occur while pktgen is
      running, which worries me a bit.  I'm assuming proc_remove() takes
      appropriate locks, to assure no writers exists after proc_remove()
      finish.
      
      I've been running a script exercising the race condition (leading me
      to fix the proc_remove order), without any issues.  The script also
      exercises concurrent proc writes, while the interface config is
      getting removed.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8788370a
    • Jesper Dangaard Brouer's avatar
      pktgen: avoid expensive set_current_state() call in loop · baac167b
      Jesper Dangaard Brouer authored
      Avoid calling set_current_state() inside the busy-loop in
      pktgen_thread_worker().  In case of pkt_dev->delay, then it is still
      used/enabled in pktgen_xmit() via the spin() call.
      
      The set_current_state(TASK_INTERRUPTIBLE) uses a xchg, which implicit
      is LOCK prefixed.  I've measured the asm LOCK operation to take approx
      8ns on this E5-2630 CPU.  Performance increase corrolate with this
      measurement.
      
      Performance data with CLONE_SKB==100000, rx-usecs=30:
       (single CPU performance, ixgbe 10Gbit/s, E5-2630)
       * Prev:  5454050 pps --> 183.35ns (1/5454050*10^9)
       * Now:   5684009 pps --> 175.93ns (1/5684009*10^9)
       * Diff:  +229959 pps -->  -7.42ns
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      baac167b
    • Jesper Dangaard Brouer's avatar
      pktgen: document tuning for max NIC performance · 9ceb87fc
      Jesper Dangaard Brouer authored
      Using pktgen I'm seeing the ixgbe driver "push-back", due TX ring
      running full.  Thus, the TX ring is artificially limiting pktgen.
      (Diagnose via "ethtool -S", look for "tx_restart_queue" or "tx_busy"
      counters.)
      
      Using ixgbe, the real reason behind the TX ring running full, is due
      to TX ring not being cleaned up fast enough. The ixgbe driver combines
      TX+RX ring cleanups, and the cleanup interval is affected by the
      ethtool --coalesce setting of parameter "rx-usecs".
      
      Do not increase the default NIC TX ring buffer or default cleanup
      interval.  Instead simply document that pktgen needs special NIC
      tuning for maximum packet per sec performance.
      
      Performance results with pktgen with clone_skb=100000.
      TX ring size 512 (default), adjusting "rx-usecs":
       (Single CPU performance, E5-2630, ixgbe)
       - 3935002 pps - rx-usecs:  1 (irqs:  9346)
       - 5132350 pps - rx-usecs: 10 (irqs: 99157)
       - 5375111 pps - rx-usecs: 20 (irqs: 50154)
       - 5454050 pps - rx-usecs: 30 (irqs: 33872)
       - 5496320 pps - rx-usecs: 40 (irqs: 26197)
       - 5502510 pps - rx-usecs: 50 (irqs: 21527)
      
      TX ring size adjusting (ethtool -G), "rx-usecs==1" (default):
       - 3935002 pps - tx-size:  512
       - 5354401 pps - tx-size:  768
       - 5356847 pps - tx-size: 1024
       - 5327595 pps - tx-size: 1536
       - 5356779 pps - tx-size: 2048
       - 5353438 pps - tx-size: 4096
      
      Notice after commit 6f25cd47 (pktgen: fix xmit test for BQL enabled
      devices) pktgen uses netif_xmit_frozen_or_drv_stopped() and ignores
      the BQL "stack" pause (QUEUE_STATE_STACK_XOFF) flag.  This allow us to put
      more pressure on the TX ring buffers.
      
      It is the ixgbe_maybe_stop_tx() call that stops the transmits, and
      pktgen respecting this in the call to netif_xmit_frozen_or_drv_stopped(txq).
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9ceb87fc
    • Jiri Pirko's avatar
      openvswitch: introduce rtnl ops stub · 5b9e7e16
      Jiri Pirko authored
      This stub now allows userspace to see IFLA_INFO_KIND for ovs master and
      IFLA_INFO_SLAVE_KIND for slave.
      Signed-off-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5b9e7e16
    • Jiri Pirko's avatar
      rtnetlink: allow to register ops without ops->setup set · b0ab2fab
      Jiri Pirko authored
      So far, it is assumed that ops->setup is filled up. But there might be
      case that ops might make sense even without ->setup. In that case,
      forbid to newlink and dellink.
      
      This allows to register simple rtnl link ops containing only ->kind.
      That allows consistent way of passing device kind (either device-kind or
      slave-kind) to userspace.
      Signed-off-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b0ab2fab
    • Ying Xue's avatar
      net: fix some typos in comment · 9bf2b8c2
      Ying Xue authored
      In commit 37112105("net:
      QDISC_STATE_RUNNING dont need atomic bit ops") the
      __QDISC_STATE_RUNNING is renamed to __QDISC___STATE_RUNNING,
      but the old names existing in comment are not replaced with
      the new name completely.
      Signed-off-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9bf2b8c2
    • Ben Greear's avatar
      ipv6: Allow accepting RA from local IP addresses. · d9333196
      Ben Greear authored
      This can be used in virtual networking applications, and
      may have other uses as well.  The option is disabled by
      default.
      
      A specific use case is setting up virtual routers, bridges, and
      hosts on a single OS without the use of network namespaces or
      virtual machines.  With proper use of ip rules, routing tables,
      veth interface pairs and/or other virtual interfaces,
      and applications that can bind to interfaces and/or IP addresses,
      it is possibly to create one or more virtual routers with multiple
      hosts attached.  The host interfaces can act as IPv6 systems,
      with radvd running on the ports in the virtual routers.  With the
      option provided in this patch enabled, those hosts can now properly
      obtain IPv6 addresses from the radvd.
      Signed-off-by: default avatarBen Greear <greearb@candelatech.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d9333196
    • Ben Greear's avatar
      ipv6: Add more debugging around accept-ra logic. · f2a762d8
      Ben Greear authored
      This is disabled by default, just like similar debug info
      already in this module.  But, makes it easier to find out
      why RA is not being accepted when debugging strange behaviour.
      Signed-off-by: default avatarBen Greear <greearb@candelatech.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f2a762d8
  3. 30 Jun, 2014 1 commit
  4. 27 Jun, 2014 25 commits
    • David S. Miller's avatar
      Merge branch 'tcp_conn_request_unification' · 9e1a21b6
      David S. Miller authored
      Octavian Purdila says:
      
      ====================
      tcp: remove code duplication in tcp_v[46]_conn_request
      
      This patch series unifies the TCPv4 and TCPv6 connection request flow
      in a single new function (tcp_conn_request).
      
      The first 3 patches are small cleanups and fixes found during the code
      merge process.
      
      The next patches add new methods in tcp_request_sock_ops to abstract
      the IPv4/IPv6 operations and keep the TCP connection request flow
      common.
      
      To identify potential performance issues this patch has been tested
      by measuring the connection per second rate with nginx and a httperf
      like client (to allow for concurrent connection requests - 256 CC were
      used during testing) using the loopback interface. A dual-core i5 Ivy
      Bridge processor was used and each process was bounded to a different
      core to make results consistent.
      
      Results for IPv4, unit is connections per second, higher is better, 20
      measurements have been collected:
      
      		before		after
      min		27917		27962
      max		28262		28366
      avg		28094.1		28212.75
      stdev		87.35		97.26
      
      Results for IPv6, unit is connections per second, higher is better, 20
      measurements have been collected:
      
      		before		after
      min		24813		24877
      max		25029		25119
      avg		24935.5		25017
      stdev		64.13		62.93
      
      Changes since v1:
      
       * add benchmarking datapoints
      
       * fix a few issues in the last patch (IPv6 related)
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9e1a21b6
    • Octavian Purdila's avatar
      tcp: add tcp_conn_request · 1fb6f159
      Octavian Purdila authored
      Create tcp_conn_request and remove most of the code from
      tcp_v4_conn_request and tcp_v6_conn_request.
      Signed-off-by: default avatarOctavian Purdila <octavian.purdila@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1fb6f159
    • Octavian Purdila's avatar
      tcp: add queue_add_hash to tcp_request_sock_ops · 695da14e
      Octavian Purdila authored
      Add queue_add_hash member to tcp_request_sock_ops so that we can later
      unify tcp_v4_conn_request and tcp_v6_conn_request.
      Signed-off-by: default avatarOctavian Purdila <octavian.purdila@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      695da14e
    • Octavian Purdila's avatar
      tcp: add mss_clamp to tcp_request_sock_ops · 2aec4a29
      Octavian Purdila authored
      Add mss_clamp member to tcp_request_sock_ops so that we can later
      unify tcp_v4_conn_request and tcp_v6_conn_request.
      Signed-off-by: default avatarOctavian Purdila <octavian.purdila@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2aec4a29
    • Octavian Purdila's avatar
    • Octavian Purdila's avatar
      tcp: add send_synack method to tcp_request_sock_ops · d6274bd8
      Octavian Purdila authored
      Create a new tcp_request_sock_ops method to unify the IPv4/IPv6
      signature for tcp_v[46]_send_synack. This allows us to later unify
      tcp_v4_rtx_synack with tcp_v6_rtx_synack and tcp_v4_conn_request with
      tcp_v4_conn_request.
      Signed-off-by: default avatarOctavian Purdila <octavian.purdila@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d6274bd8
    • Octavian Purdila's avatar
      tcp: add init_seq method to tcp_request_sock_ops · 936b8bdb
      Octavian Purdila authored
      More work in preparation of unifying tcp_v4_conn_request and
      tcp_v6_conn_request: indirect the init sequence calls via the
      tcp_request_sock_ops.
      Signed-off-by: default avatarOctavian Purdila <octavian.purdila@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      936b8bdb
    • Octavian Purdila's avatar
      tcp: move around a few calls in tcp_v6_conn_request · 94037159
      Octavian Purdila authored
      Make the tcp_v6_conn_request calls flow similar with that of
      tcp_v4_conn_request.
      
      Note that want_cookie can be true only if isn is zero and that is why
      we can move the if (want_cookie) block out of the if (!isn) block.
      
      Moving security_inet_conn_request() has a couple of side effects:
      missing inet_rsk(req)->ecn_ok update and the req->cookie_ts
      update. However, neither SELinux nor Smack security hooks seems to
      check them. This change should also avoid future different behaviour
      for IPv4 and IPv6 in the security hooks.
      Signed-off-by: default avatarOctavian Purdila <octavian.purdila@intel.com>
      Acked-by: default avatarPaul Moore <paul@paul-moore.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      94037159
    • Octavian Purdila's avatar
      tcp: add route_req method to tcp_request_sock_ops · d94e0417
      Octavian Purdila authored
      Create wrappers with same signature for the IPv4/IPv6 request routing
      calls and use these wrappers (via route_req method from
      tcp_request_sock_ops) in tcp_v4_conn_request and tcp_v6_conn_request
      with the purpose of unifying the two functions in a later patch.
      
      We can later drop the wrapper functions and modify inet_csk_route_req
      and inet6_cks_route_req to use the same signature.
      Signed-off-by: default avatarOctavian Purdila <octavian.purdila@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d94e0417
    • Octavian Purdila's avatar
      tcp: add init_cookie_seq method to tcp_request_sock_ops · fb7b37a7
      Octavian Purdila authored
      Move the specific IPv4/IPv6 cookie sequence initialization to a new
      method in tcp_request_sock_ops in preparation for unifying
      tcp_v4_conn_request and tcp_v6_conn_request.
      Signed-off-by: default avatarOctavian Purdila <octavian.purdila@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fb7b37a7
    • Octavian Purdila's avatar
      tcp: add init_req method to tcp_request_sock_ops · 16bea70a
      Octavian Purdila authored
      Move the specific IPv4/IPv6 intializations to a new method in
      tcp_request_sock_ops in preparation for unifying tcp_v4_conn_request
      and tcp_v6_conn_request.
      Signed-off-by: default avatarOctavian Purdila <octavian.purdila@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      16bea70a
    • Octavian Purdila's avatar
      net: remove inet6_reqsk_alloc · 476eab82
      Octavian Purdila authored
      Since pktops is only used for IPv6 only and opts is used for IPv4
      only, we can move these fields into a union and this allows us to drop
      the inet6_reqsk_alloc function as after this change it becomes
      equivalent with inet_reqsk_alloc.
      
      This patch also fixes a kmemcheck issue in the IPv6 stack: the flags
      field was not annotated after a request_sock was allocated.
      Signed-off-by: default avatarOctavian Purdila <octavian.purdila@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      476eab82
    • Octavian Purdila's avatar
      tcp: tcp_v[46]_conn_request: fix snt_synack initialization · aa27fc50
      Octavian Purdila authored
      Commit 016818d0 (tcp: TCP Fast Open Server - take SYNACK RTT after
      completing 3WHS) changes the code to only take a snt_synack timestamp
      when a SYNACK transmit or retransmit succeeds. This behaviour is later
      broken by commit 843f4a55 (tcp: use tcp_v4_send_synack on first
      SYN-ACK), as snt_synack is now updated even if tcp_v4_send_synack
      fails.
      
      Also, commit 3a19ce0e (tcp: IPv6 support for fastopen server) misses
      the required IPv6 updates for 016818d0.
      
      This patch makes sure that snt_synack is updated only when the SYNACK
      trasnmit/retransmit succeeds, for both IPv4 and IPv6.
      
      Cc: Cardwell <ncardwell@google.com>
      Cc: Daniel Lee <longinus00@gmail.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarOctavian Purdila <octavian.purdila@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aa27fc50
    • Octavian Purdila's avatar
    • David S. Miller's avatar
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next · c1c27fb9
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Updates 2014-06-26
      
      This series contains updates to i40e and i40evf.
      
      Kamil provides a cleanup patch to i40e where we do not need to acquire the
      NVM for shadow RAM checksum calculation, since we only read the shadow RAM
      through SRCTL register.
      
      Paul provides a fix for handling HMC for big endian architectures for i40e
      and i40evf.
      
      Mitch provides four cleanup and fixes for i40evf.  Fix an issue where if
      the VF driver fails to complete early init, then rmmod can cause a softlock
      when the driver tries to stop a watchdog timer that never got initialized.
      So add a check to see if the timer is actually initialized before stopping
      it.  Make the function i40evf_send_api_ver() return more useful information,
      instead of just returning -EIO by propagating firmware errors back to the
      caller and log a message if the PF sends an invalid reply.  Fix up a log
      message that was missing a word, which makes the log message more readable.
      Fix an initialization failure if many VFs are instantiated at the same time
      and the VF module is autoloaded by simply resending firmware request if
      there is no response the first time.
      
      Jacob does a rename of the function i40e_ptp_enable() to
      i40e_ptp_feature_enable(), like he did for ixgbe, to reduce possible
      confusion and ambugity in the purpose of the function.  Does follow on
      PTP work on i40e, like he did for ixgbe, by breaking the PTP hardware
      control from the ioctl command for timestamping mode.  By doing this,
      we can maintain state about the 1588 timestamping mode and properly
      re-enable to the last known mode during a re-initialization of 1588 bits.
      
      Anjali cleans up the i40e driver where TCP-IPv4 filters were being added
      twice, which seems to be left over from when we had to add two PTYPEs for
      one filter.  Fixes the flow director sideband logic to detect when there
      is a full flow director table.  Also fixes the programming of FDIR where
      a couple of fields in the descriptor setup that were not being
      programmed, which left the opportunity for stale data to be pushed as
      part of the descriptor next time it was used.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c1c27fb9
    • David S. Miller's avatar
      Merge branch 'tipc-next' · 0ff9275a
      David S. Miller authored
      Jon Maloy says:
      
      ====================
      tipc: new unicast transmission code
      
      As a step towards making the data transmission code more maintainable
      and performant, we introduce a number of new functions, both for
      building, sending and rejecting messages. The new functions will
      eventually be used for alla data transmission, user data unicast,
      service internal messaging, and multicast/broadcast.
      
      We start with this series, where we introduce the functions, and
      let user data unicast and the internal connection protocol use them.
      The remaining users will come in a later series.
      
      There are only minor changes to data structures, and no protocol
      changes, so the older functions can still be used in parallel for
      some time. Until the old functions are removed, we use temporary
      names for the new functions, such as tipc_build_msg2, tipc_link_xmit2.
      
      It should be noted that the first two commits are unrelated to the
      rest of the series.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0ff9275a
    • Jon Paul Maloy's avatar
      tipc: simplify connection congestion handling · 60120526
      Jon Paul Maloy authored
      As a consequence of the recently introduced serialized access
      to the socket in commit 8d94168a761819d10252bab1f8de6d7b202c3baa
      ("tipc: same receive code path for connection protocol and data
      messages") we can make a number of simplifications in the
      detection and handling of connection congestion situations.
      
      - We don't need to keep two counters, one for sent messages and one
        for acked messages. There is no longer any risk for races between
        acknowledge messages arriving in BH and data message sending
        running in user context. So we merge this into one counter,
        'sent_unacked', which is incremented at sending and subtracted
        from at acknowledge reception.
      
      - We don't need to set the 'congested' field in tipc_port to
        true before we sent the message, and clear it when sending
        is successful. (As a matter of fact, it was never necessary;
        the field was set in link_schedule_port() before any wakeup
        could arrive anyway.)
      
      - We keep the conditions for link congestion and connection connection
        congestion separated. There would otherwise be a risk that an arriving
        acknowledge message may wake up a user sleeping because of link
        congestion.
      
      - We can simplify reception of acknowledge messages.
      
      We also make some cosmetic/structural changes:
      
      - We rename the 'congested' field to the more correct 'link_cong´.
      
      - We rename 'conn_unacked' to 'rcv_unacked'
      
      - We move the above mentioned fields from struct tipc_port to
        struct tipc_sock.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: default avatarErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      60120526
    • Jon Paul Maloy's avatar
      tipc: clean up connection protocol reception function · ac0074ee
      Jon Paul Maloy authored
      We simplify the code for receiving connection probes, leveraging the
      recently introduced tipc_msg_reverse() function. We also stick to
      the principle of sending a possible response message directly from
      the calling (tipc_sk_rcv or backlog_rcv) functions, hence making
      the call chain shallower and easier to follow.
      
      We make one small protocol change here, allowed according to
      the spec. If a protocol message arrives from a remote socket that
      is not the one we are connected to, we are currently generating a
      connection abort message and send it to the source. This behavior
      is unnecessary, and might even be a security risk, so instead we
      now choose to only ignore the message. The consequnce for the sender
      is that he will need longer time to discover his mistake (until the
      next timeout), but this is an extreme corner case, and may happen
      anyway under other circumstances, so we deem this change acceptable.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: default avatarErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ac0074ee
    • Jon Paul Maloy's avatar
      tipc: same receive code path for connection protocol and data messages · ec8a2e56
      Jon Paul Maloy authored
      As a preparation to eliminate port_lock we need to bring reception
      of connection protocol messages under proper protection of bh_lock_sock
      or socket owner.
      
      We fix this by letting those messages follow the same code path as
      incoming data messages.
      
      As a side effect of this change, the last reference to the function
      net_route_msg() disappears, and we can eliminate that function.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: default avatarErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ec8a2e56
    • Jon Paul Maloy's avatar
      tipc: let port protocol senders use new link send function · b786e2b0
      Jon Paul Maloy authored
      Several functions in port.c, related to the port protocol and
      connection shutdown, need to send messages. We now convert them
      to use the new link send function.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: default avatarErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b786e2b0
    • Jon Paul Maloy's avatar
      tipc: connection oriented transport uses new send functions · 4ccfe5e0
      Jon Paul Maloy authored
      We move the message sending across established connections
      to use the message preparation and send functions introduced
      earlier in this series. We now do the message preparation
      and call to the link send function directly from the socket,
      instead of going via the port layer.
      
      As a consequence of this change, the functions tipc_send(),
      tipc_port_iovec_rcv(), tipc_port_iovec_reject() and tipc_reject_msg()
      become unreferenced and can be eliminated from port.c. For the same
      reason, the functions tipc_link_xmit_fast(), tipc_link_iovec_xmit_long()
      and tipc_link_iovec_fast() can be eliminated from link.c.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: default avatarErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4ccfe5e0
    • Jon Paul Maloy's avatar
      tipc: RDM/DGRAM transport uses new fragmenting and sending functions · e2dafe87
      Jon Paul Maloy authored
      We merge the code for sending port name and port identity addressed
      messages into the corresponding send functions in socket.c, and start
      using the new fragmenting and transmit functions we just have introduced.
      
      This saves a call level and quite a few code lines, as well as making
      this part of the code easier to follow. As a consequence, the functions
      tipc_send2name() and tipc_send2port() in port.c can be removed.
      
      For practical reasons, we break out the code for sending multicast messages
      from tipc_sendmsg() and move it into a separate function, tipc_sendmcast(),
      but we do not yet convert it into using the new build/send functions.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: default avatarErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e2dafe87
    • Jon Paul Maloy's avatar
      tipc: introduce message evaluation function · 5a379074
      Jon Paul Maloy authored
      When a message arrives in a node and finds no destination
      socket, we may need to drop it, reject it, or forward it after
      a secondary destination lookup. The latter two cases currently
      results in a code path that is perceived as complex, because it
      follows a deep call chain via obscure functions such as
      net_route_named_msg() and net_route_msg().
      
      We now introduce a function, tipc_msg_eval(), that takes the
      decision about whether such a message should be rejected or
      forwarded, but leaves it to the caller to actually perform
      the indicated action.
      
      If the decision is 'reject', it is still the task of the recently
      introduced function tipc_msg_reverse() to take the final decision
      about whether the message is rejectable or not. In the latter case
      it drops the message.
      
      As a result of this change, we can finally eliminate the function
      net_route_named_msg(), and hence become independent of net_route_msg().
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: default avatarErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5a379074
    • Jon Paul Maloy's avatar
      tipc: separate building and sending of rejected messages · 8db1bae3
      Jon Paul Maloy authored
      The way we build and send rejected message is currenty perceived as
      hard to follow, partly because we let the transmission go via deep
      call chains through functions such as tipc_reject_msg() and
      net_route_msg().
      
      We want to remove those functions, and make the call sequences shallower
      and simpler. For this purpose, we separate building and sending of
      rejected messages. We build the reject message using the new function
      tipc_msg_reverse(), and let the transmission go via the newly introduced
      tipc_link_xmit2() function, as all transmission eventually will do. We
      also ensure that all calls to tipc_link_xmit2() are made outside
      port_lock/bh_lock_sock.
      
      Finally, we replace all calls to tipc_reject_msg() with the two new
      calls at all locations in the code that we want to keep. The remaining
      calls are made from code that we are planning to remove, along with
      tipc_reject_msg() itself.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: default avatarErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8db1bae3
    • Jon Paul Maloy's avatar
      tipc: introduce direct iovec to buffer chain fragmentation function · 067608e9
      Jon Paul Maloy authored
      Fragmentation at message sending is currently performed in two
      places in link.c, depending on whether data to be transmitted
      is delivered in the form of an iovec or as a big sk_buff. Those
      functions are also tightly entangled with the send functions
      that are using them.
      
      We now introduce a re-entrant, standalone function, tipc_msg_build2(),
      that builds a packet chain directly from an iovec. Each fragment is
      sized according to the MTU value given by the caller, and is prepended
      with a correctly built fragment header, when needed. The function is
      independent from who is calling and where the chain will be delivered,
      as long as the caller is able to indicate a correct MTU.
      
      The function is tested, but not called by anybody yet. Since it is
      incompatible with the existing tipc_msg_build(), and we cannot yet
      remove that function, we have given it a temporary name.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: default avatarErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      067608e9