An error occurred fetching the project authors.
  1. 29 Apr, 2015 1 commit
  2. 09 Dec, 2014 1 commit
  3. 24 Nov, 2014 1 commit
  4. 21 Nov, 2014 1 commit
    • Calvin Owens's avatar
      tcp: Restore RFC5961-compliant behavior for SYN packets · 0c228e83
      Calvin Owens authored
      Commit c3ae62af ("tcp: should drop incoming frames without ACK
      flag set") was created to mitigate a security vulnerability in which a
      local attacker is able to inject data into locally-opened sockets by
      using TCP protocol statistics in procfs to quickly find the correct
      sequence number.
      
      This broke the RFC5961 requirement to send a challenge ACK in response
      to spurious RST packets, which was subsequently fixed by commit
      7b514a88 ("tcp: accept RST without ACK flag").
      
      Unfortunately, the RFC5961 requirement that spurious SYN packets be
      handled in a similar manner remains broken.
      
      RFC5961 section 4 states that:
      
         ... the handling of the SYN in the synchronized state SHOULD be
         performed as follows:
      
         1) If the SYN bit is set, irrespective of the sequence number, TCP
            MUST send an ACK (also referred to as challenge ACK) to the remote
            peer:
      
            <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>
      
            After sending the acknowledgment, TCP MUST drop the unacceptable
            segment and stop processing further.
      
         By sending an ACK, the remote peer is challenged to confirm the loss
         of the previous connection and the request to start a new connection.
         A legitimate peer, after restart, would not have a TCB in the
         synchronized state.  Thus, when the ACK arrives, the peer should send
         a RST segment back with the sequence number derived from the ACK
         field that caused the RST.
      
         This RST will confirm that the remote peer has indeed closed the
         previous connection.  Upon receipt of a valid RST, the local TCP
         endpoint MUST terminate its connection.  The local TCP endpoint
         should then rely on SYN retransmission from the remote end to
         re-establish the connection.
      
      This patch lets SYN packets through the discard added in c3ae62af,
      so that spurious SYN packets are properly dealt with as per the RFC.
      
      The challenge ACK is sent unconditionally and is rate-limited, so the
      original vulnerability is not reintroduced by this patch.
      Signed-off-by: default avatarCalvin Owens <calvinowens@fb.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0c228e83
  5. 11 Nov, 2014 1 commit
    • Joe Perches's avatar
      net: Convert LIMIT_NETDEBUG to net_dbg_ratelimited · ba7a46f1
      Joe Perches authored
      Use the more common dynamic_debug capable net_dbg_ratelimited
      and remove the LIMIT_NETDEBUG macro.
      
      All messages are still ratelimited.
      
      Some KERN_<LEVEL> uses are changed to KERN_DEBUG.
      
      This may have some negative impact on messages that were
      emitted at KERN_INFO that are not not enabled at all unless
      DEBUG is defined or dynamic_debug is enabled.  Even so,
      these messages are now _not_ emitted by default.
      
      This also eliminates the use of the net_msg_warn sysctl
      "/proc/sys/net/core/warnings".  For backward compatibility,
      the sysctl is not removed, but it has no function.  The extern
      declaration of net_msg_warn is removed from sock.h and made
      static in net/core/sysctl_net_core.c
      
      Miscellanea:
      
      o Update the sysctl documentation
      o Remove the embedded uses of pr_fmt
      o Coalesce format fragments
      o Realign arguments
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ba7a46f1
  6. 05 Nov, 2014 1 commit
    • Marcelo Leitner's avatar
      tcp: zero retrans_stamp if all retrans were acked · 1f37bf87
      Marcelo Leitner authored
      Ueki Kohei reported that when we are using NewReno with connections that
      have a very low traffic, we may timeout the connection too early if a
      second loss occurs after the first one was successfully acked but no
      data was transfered later. Below is his description of it:
      
      When SACK is disabled, and a socket suffers multiple separate TCP
      retransmissions, that socket's ETIMEDOUT value is calculated from the
      time of the *first* retransmission instead of the *latest*
      retransmission.
      
      This happens because the tcp_sock's retrans_stamp is set once then never
      cleared.
      
      Take the following connection:
      
                            Linux                    remote-machine
                              |                           |
               send#1---->(*1)|--------> data#1 --------->|
                        |     |                           |
                       RTO    :                           :
                        |     |                           |
                       ---(*2)|----> data#1(retrans) ---->|
                        | (*3)|<---------- ACK <----------|
                        |     |                           |
                        |     :                           :
                        |     :                           :
                        |     :                           :
                      16 minutes (or more)                :
                        |     :                           :
                        |     :                           :
                        |     :                           :
                        |     |                           |
               send#2---->(*4)|--------> data#2 --------->|
                        |     |                           |
                       RTO    :                           :
                        |     |                           |
                       ---(*5)|----> data#2(retrans) ---->|
                        |     |                           |
                        |     |                           |
                      RTO*2   :                           :
                        |     |                           |
                        |     |                           |
            ETIMEDOUT<----(*6)|                           |
      
      (*1) One data packet sent.
      (*2) Because no ACK packet is received, the packet is retransmitted.
      (*3) The ACK packet is received. The transmitted packet is acknowledged.
      
      At this point the first "retransmission event" has passed and been
      recovered from. Any future retransmission is a completely new "event".
      
      (*4) After 16 minutes (to correspond with retries2=15), a new data
      packet is sent. Note: No data is transmitted between (*3) and (*4).
      
      The socket's timeout SHOULD be calculated from this point in time, but
      instead it's calculated from the prior "event" 16 minutes ago.
      
      (*5) Because no ACK packet is received, the packet is retransmitted.
      (*6) At the time of the 2nd retransmission, the socket returns
      ETIMEDOUT.
      
      Therefore, now we clear retrans_stamp as soon as all data during the
      loss window is fully acked.
      
      Reported-by: Ueki Kohei
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarMarcelo Ricardo Leitner <mleitner@redhat.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Tested-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1f37bf87
  7. 04 Nov, 2014 1 commit
    • Florian Westphal's avatar
      net: allow setting ecn via routing table · f7b3bec6
      Florian Westphal authored
      This patch allows to set ECN on a per-route basis in case the sysctl
      tcp_ecn is not set to 1. In other words, when ECN is set for specific
      routes, it provides a tcp_ecn=1 behaviour for that route while the rest
      of the stack acts according to the global settings.
      
      One can use 'ip route change dev $dev $net features ecn' to toggle this.
      
      Having a more fine-grained per-route setting can be beneficial for various
      reasons, for example, 1) within data centers, or 2) local ISPs may deploy
      ECN support for their own video/streaming services [1], etc.
      
      There was a recent measurement study/paper [2] which scanned the Alexa's
      publicly available top million websites list from a vantage point in US,
      Europe and Asia:
      
      Half of the Alexa list will now happily use ECN (tcp_ecn=2, most likely
      blamed to commit 255cac91 ("tcp: extend ECN sysctl to allow server-side
      only ECN") ;)); the break in connectivity on-path was found is about
      1 in 10,000 cases. Timeouts rather than receiving back RSTs were much
      more common in the negotiation phase (and mostly seen in the Alexa
      middle band, ranks around 50k-150k): from 12-thousand hosts on which
      there _may_ be ECN-linked connection failures, only 79 failed with RST
      when _not_ failing with RST when ECN is not requested.
      
      It's unclear though, how much equipment in the wild actually marks CE
      when buffers start to fill up.
      
      We thought about a fallback to non-ECN for retransmitted SYNs as another
      global option (which could perhaps one day be made default), but as Eric
      points out, there's much more work needed to detect broken middleboxes.
      
      Two examples Eric mentioned are buggy firewalls that accept only a single
      SYN per flow, and middleboxes that successfully let an ECN flow establish,
      but later mark CE for all packets (so cwnd converges to 1).
      
       [1] http://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf, p.15
       [2] http://ecn.ethz.ch/
      
      Joint work with Daniel Borkmann.
      
      Reference: http://thread.gmane.org/gmane.linux.network/335797Suggested-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f7b3bec6
  8. 30 Oct, 2014 2 commits
  9. 29 Oct, 2014 1 commit
    • Eric Dumazet's avatar
      tcp: allow for bigger reordering level · dca145ff
      Eric Dumazet authored
      While testing upcoming Yaogong patch (converting out of order queue
      into an RB tree), I hit the max reordering level of linux TCP stack.
      
      Reordering level was limited to 127 for no good reason, and some
      network setups [1] can easily reach this limit and get limited
      throughput.
      
      Allow a new max limit of 300, and add a sysctl to allow admins to even
      allow bigger (or lower) values if needed.
      
      [1] Aggregation of links, per packet load balancing, fabrics not doing
       deep packet inspections, alternative TCP congestion modules...
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Yaogong Wang <wygivan@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dca145ff
  10. 14 Oct, 2014 1 commit
    • Eric Dumazet's avatar
      tcp: fix tcp_ack() performance problem · ad971f61
      Eric Dumazet authored
      We worked hard to improve tcp_ack() performance, by not accessing
      skb_shinfo() in fast path (cd7d8498 tcp: change tcp_skb_pcount()
      location)
      
      We still have one spurious access because of ACK timestamping,
      added in commit e1c8a607 ("net-timestamp: ACK timestamp for
      bytestreams")
      
      By checking if sk_tsflags has SOF_TIMESTAMPING_TX_ACK set,
      we can avoid two cache line misses for the common case.
      
      While we are at it, add two prefetchw() :
      
      One in tcp_ack() to bring skb at the head of write queue.
      
      One in tcp_clean_rtx_queue() loop to bring following skb,
      as we will delete skb from the write queue and dirty skb->next->prev.
      
      Add a couple of [un]likely() clauses.
      
      After this patch, tcp_ack() is no longer the most consuming
      function in tcp stack.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ad971f61
  11. 29 Sep, 2014 5 commits
  12. 28 Sep, 2014 4 commits
    • Peter Pan(潘卫平)'s avatar
      tcp: use tcp_flags in tcp_data_queue() · 155c6e1a
      Peter Pan(潘卫平) authored
      This patch is a cleanup which follows the idea in commit e11ecddf (tcp: use
      TCP_SKB_CB(skb)->tcp_flags in input path),
      and it may reduce register pressure since skb->cb[] access is fast,
      bacause skb is probably in a register.
      
      v2: remove variable th
      v3: reword the changelog
      Signed-off-by: default avatarWeiping Pan <panweiping3@gmail.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      155c6e1a
    • Eric Dumazet's avatar
      tcp: change tcp_skb_pcount() location · cd7d8498
      Eric Dumazet authored
      Our goal is to access no more than one cache line access per skb in
      a write or receive queue when doing the various walks.
      
      After recent TCP_SKB_CB() reorganizations, it is almost done.
      
      Last part is tcp_skb_pcount() which currently uses
      skb_shinfo(skb)->gso_segs, which is a terrible choice, because it needs
      3 cache lines in current kernel (skb->head, skb->end, and
      shinfo->gso_segs are all in 3 different cache lines, far from skb->cb)
      
      This very simple patch reuses space currently taken by tcp_tw_isn
      only in input path, as tcp_skb_pcount is only needed for skb stored in
      write queue.
      
      This considerably speeds up tcp_ack(), granted we avoid shinfo->tx_flags
      to get SKBTX_ACK_TSTAMP, which seems possible.
      
      This also speeds up all sack processing in general.
      
      This speeds up tcp_sendmsg() because it no longer has to access/dirty
      shinfo.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cd7d8498
    • Dan Williams's avatar
      net_dma: revert 'copied_early' · d27f9bc1
      Dan Williams authored
      Now that tcp_dma_try_early_copy() is gone nothing ever sets
      copied_early.
      
      Also reverts "53240c20 tcp: Fix possible double-ack w/ user dma"
      since it is no longer necessary.
      
      Cc: Ali Saidi <saidi@engin.umich.edu>
      Cc: James Morris <jmorris@namei.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Neal Cardwell <ncardwell@google.com>
      Reported-by: default avatarDave Jones <davej@redhat.com>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      d27f9bc1
    • Dan Williams's avatar
      net_dma: simple removal · 7bced397
      Dan Williams authored
      Per commit "77873803 net_dma: mark broken" net_dma is no longer used
      and there is no plan to fix it.
      
      This is the mechanical removal of bits in CONFIG_NET_DMA ifdef guards.
      Reverting the remainder of the net_dma induced changes is deferred to
      subsequent patches.
      
      Marked for stable due to Roman's report of a memory leak in
      dma_pin_iovec_pages():
      
          https://lkml.org/lkml/2014/9/3/177
      
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Vinod Koul <vinod.koul@intel.com>
      Cc: David Whipple <whipple@securedatainnovations.ch>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Cc: <stable@vger.kernel.org>
      Reported-by: default avatarRoman Gushchin <klamm@yandex-team.ru>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      7bced397
  13. 23 Sep, 2014 1 commit
  14. 22 Sep, 2014 1 commit
  15. 19 Sep, 2014 1 commit
  16. 15 Sep, 2014 3 commits
  17. 06 Sep, 2014 2 commits
  18. 23 Aug, 2014 1 commit
    • Yuchung Cheng's avatar
      tcp: improve undo on timeout · 989e04c5
      Yuchung Cheng authored
      Upon timeout, undo (via both timestamps/Eifel and DSACKs) was
      disabled if any retransmits were still in flight.  The concern was
      perhaps that spurious retransmission sent in a previous recovery
      episode may trigger DSACKs to falsely undo the current recovery.
      
      However, this inadvertently misses undo opportunities (using either
      TCP timestamps or DSACKs) when timeout occurs during a loss episode,
      i.e.  recurring timeouts or timeout during fast recovery. In these
      cases some retransmissions will be in flight but we should allow
      undo. Furthermore, we should only reset undo_marker and undo_retrans
      upon timeout if we are starting a new recovery episode. Finally,
      when we do reset our undo state, we now do so in a manner similar
      to tcp_enter_recovery(), so that we require a DSACK for each of
      the outstsanding retransmissions. This will achieve the original
      goal by requiring that we receive the same number of DSACKs as
      retransmissions.
      
      This patch increases the undo events by 50% on Google servers.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      989e04c5
  19. 14 Aug, 2014 3 commits
    • Neal Cardwell's avatar
      tcp: fix ssthresh and undo for consecutive short FRTO episodes · 0c9ab092
      Neal Cardwell authored
      Fix TCP FRTO logic so that it always notices when snd_una advances,
      indicating that any RTO after that point will be a new and distinct
      loss episode.
      
      Previously there was a very specific sequence that could cause FRTO to
      fail to notice a new loss episode had started:
      
      (1) RTO timer fires, enter FRTO and retransmit packet 1 in write queue
      (2) receiver ACKs packet 1
      (3) FRTO sends 2 more packets
      (4) RTO timer fires again (should start a new loss episode)
      
      The problem was in step (3) above, where tcp_process_loss() returned
      early (in the spot marked "Step 2.b"), so that it never got to the
      logic to clear icsk_retransmits. Thus icsk_retransmits stayed
      non-zero. Thus in step (4) tcp_enter_loss() would see the non-zero
      icsk_retransmits, decide that this RTO is not a new episode, and
      decide not to cut ssthresh and remember the current cwnd and ssthresh
      for undo.
      
      There were two main consequences to the bug that we have
      observed. First, ssthresh was not decreased in step (4). Second, when
      there was a series of such FRTO (1-4) sequences that happened to be
      followed by an FRTO undo, we would restore the cwnd and ssthresh from
      before the entire series started (instead of the cwnd and ssthresh
      from before the most recent RTO). This could result in cwnd and
      ssthresh being restored to values much bigger than the proper values.
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Fixes: e33099f9 ("tcp: implement RFC5682 F-RTO")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0c9ab092
    • Hannes Frederic Sowa's avatar
      tcp: don't allow syn packets without timestamps to pass tcp_tw_recycle logic · a26552af
      Hannes Frederic Sowa authored
      tcp_tw_recycle heavily relies on tcp timestamps to build a per-host
      ordering of incoming connections and teardowns without the need to
      hold state on a specific quadruple for TCP_TIMEWAIT_LEN, but only for
      the last measured RTO. To do so, we keep the last seen timestamp in a
      per-host indexed data structure and verify if the incoming timestamp
      in a connection request is strictly greater than the saved one during
      last connection teardown. Thus we can verify later on that no old data
      packets will be accepted by the new connection.
      
      During moving a socket to time-wait state we already verify if timestamps
      where seen on a connection. Only if that was the case we let the
      time-wait socket expire after the RTO, otherwise normal TCP_TIMEWAIT_LEN
      will be used. But we don't verify this on incoming SYN packets. If a
      connection teardown was less than TCP_PAWS_MSL seconds in the past we
      cannot guarantee to not accept data packets from an old connection if
      no timestamps are present. We should drop this SYN packet. This patch
      closes this loophole.
      
      Please note, this patch does not make tcp_tw_recycle in any way more
      usable but only adds another safety check:
      Sporadic drops of SYN packets because of reordering in the network or
      in the socket backlog queues can happen. Users behing NAT trying to
      connect to a tcp_tw_recycle enabled server can get caught in blackholes
      and their connection requests may regullary get dropped because hosts
      behind an address translator don't have synchronized tcp timestamp clocks.
      tcp_tw_recycle cannot work if peers don't have tcp timestamps enabled.
      
      In general, use of tcp_tw_recycle is disadvised.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a26552af
    • Willem de Bruijn's avatar
      net-timestamp: fix missing ACK timestamp · 712a7221
      Willem de Bruijn authored
      ACK timestamps are generated in tcp_clean_rtx_queue. The TSO datapath
      can break out early, causing the timestamp code to be skipped. Move
      the code up before the break.
      Reported-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      Also fix a boundary condition: tp->snd_una is the next unacknowledged
      byte and between tests inclusive (a <= b <= c), so generate a an ACK
      timestamp if (prior_snd_una <= tskey <= tp->snd_una - 1).
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      712a7221
  20. 05 Aug, 2014 2 commits
    • Willem de Bruijn's avatar
      net-timestamp: ACK timestamp for bytestreams · e1c8a607
      Willem de Bruijn authored
      Add SOF_TIMESTAMPING_TX_ACK, a request for a tstamp when the last byte
      in the send() call is acknowledged. It implements the feature for TCP.
      
      The timestamp is generated when the TCP socket cumulative ACK is moved
      beyond the tracked seqno for the first time. The feature ignores SACK
      and FACK, because those acknowledge the specific byte, but not
      necessarily the entire contents of the buffer up to that byte.
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e1c8a607
    • Neal Cardwell's avatar
      tcp: reduce spurious retransmits due to transient SACK reneging · 5ae344c9
      Neal Cardwell authored
      This commit reduces spurious retransmits due to apparent SACK reneging
      by only reacting to SACK reneging that persists for a short delay.
      
      When a sequence space hole at snd_una is filled, some TCP receivers
      send a series of ACKs as they apparently scan their out-of-order queue
      and cumulatively ACK all the packets that have now been consecutiveyly
      received. This is essentially misbehavior B in "Misbehaviors in TCP
      SACK generation" ACM SIGCOMM Computer Communication Review, April
      2011, so we suspect that this is from several common OSes (Windows
      2000, Windows Server 2003, Windows XP). However, this issue has also
      been seen in other cases, e.g. the netdev thread "TCP being hoodwinked
      into spurious retransmissions by lack of timestamps?" from March 2014,
      where the receiver was thought to be a BSD box.
      
      Since snd_una would temporarily be adjacent to a previously SACKed
      range in these scenarios, this receiver behavior triggered the Linux
      SACK reneging code path in the sender. This led the sender to clear
      the SACK scoreboard, enter CA_Loss, and spuriously retransmit
      (potentially) every packet from the entire write queue at line rate
      just a few milliseconds before the ACK for each packet arrives at the
      sender.
      
      To avoid such situations, now when a sender sees apparent reneging it
      does not yet retransmit, but rather adjusts the RTO timer to give the
      receiver a little time (max(RTT/2, 10ms)) to send us some more ACKs
      that will restore sanity to the SACK scoreboard. If the reneging
      persists until this RTO then, as before, we clear the SACK scoreboard
      and enter CA_Loss.
      
      A 10ms delay tolerates a receiver sending such a stream of ACKs at
      56Kbit/sec. And to allow for receivers with slower or more congested
      paths, we wait for at least RTT/2.
      
      We validated the resulting max(RTT/2, 10ms) delay formula with a mix
      of North American and South American Google web server traffic, and
      found that for ACKs displaying transient reneging:
      
       (1) 90% of inter-ACK delays were less than 10ms
       (2) 99% of inter-ACK delays were less than RTT/2
      
      In tests on Google web servers this commit reduced reneging events by
      75%-90% (as measured by the TcpExtTCPSACKReneging counter), without
      any measurable impact on latency for user HTTP and SPDY requests.
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5ae344c9
  21. 15 Jul, 2014 1 commit
  22. 08 Jul, 2014 1 commit
    • Yuchung Cheng's avatar
      tcp: fix false undo corner cases · 6e08d5e3
      Yuchung Cheng authored
      The undo code assumes that, upon entering loss recovery, TCP
      1) always retransmit something
      2) the retransmission never fails locally (e.g., qdisc drop)
      
      so undo_marker is set in tcp_enter_recovery() and undo_retrans is
      incremented only when tcp_retransmit_skb() is successful.
      
      When the assumption is broken because TCP's cwnd is too small to
      retransmit or the retransmit fails locally. The next (DUP)ACK
      would incorrectly revert the cwnd and the congestion state in
      tcp_try_undo_dsack() or tcp_may_undo(). Subsequent (DUP)ACKs
      may enter the recovery state. The sender repeatedly enter and
      (incorrectly) exit recovery states if the retransmits continue to
      fail locally while receiving (DUP)ACKs.
      
      The fix is to initialize undo_retrans to -1 and start counting on
      the first retransmission. Always increment undo_retrans even if the
      retransmissions fail locally because they couldn't cause DSACKs to
      undo the cwnd reduction.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6e08d5e3
  23. 30 Jun, 2014 1 commit
  24. 27 Jun, 2014 1 commit
  25. 20 Jun, 2014 1 commit
    • Neal Cardwell's avatar
      tcp: fix tcp_match_skb_to_sack() for unaligned SACK at end of an skb · 2cd0d743
      Neal Cardwell authored
      If there is an MSS change (or misbehaving receiver) that causes a SACK
      to arrive that covers the end of an skb but is less than one MSS, then
      tcp_match_skb_to_sack() was rounding up pkt_len to the full length of
      the skb ("Round if necessary..."), then chopping all bytes off the skb
      and creating a zero-byte skb in the write queue.
      
      This was visible now because the recently simplified TLP logic in
      bef1909e ("tcp: fixing TLP's FIN recovery") could find that 0-byte
      skb at the end of the write queue, and now that we do not check that
      skb's length we could send it as a TLP probe.
      
      Consider the following example scenario:
      
       mss: 1000
       skb: seq: 0 end_seq: 4000  len: 4000
       SACK: start_seq: 3999 end_seq: 4000
      
      The tcp_match_skb_to_sack() code will compute:
      
       in_sack = false
       pkt_len = start_seq - TCP_SKB_CB(skb)->seq = 3999 - 0 = 3999
       new_len = (pkt_len / mss) * mss = (3999/1000)*1000 = 3000
       new_len += mss = 4000
      
      Previously we would find the new_len > skb->len check failing, so we
      would fall through and set pkt_len = new_len = 4000 and chop off
      pkt_len of 4000 from the 4000-byte skb, leaving a 0-byte segment
      afterward in the write queue.
      
      With this new commit, we notice that the new new_len >= skb->len check
      succeeds, so that we return without trying to fragment.
      
      Fixes: adb92db8 ("tcp: Make SACK code to split only at mss boundaries")
      Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Ilpo Jarvinen <ilpo.jarvinen@helsinki.fi>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2cd0d743
  26. 11 Jun, 2014 1 commit