Commit 032ee423 authored by Neal Cardwell's avatar Neal Cardwell Committed by David S. Miller

tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacks

Helpers for mitigating ACK loops by rate-limiting dupacks sent in
response to incoming out-of-window packets.

This patch includes:

- rate-limiting logic
- sysctl to control how often we allow dupacks to out-of-window packets
- SNMP counter for cases where we rate-limited our dupack sending

The rate-limiting logic in this patch decides to not send dupacks in
response to out-of-window segments if (a) they are SYNs or pure ACKs
and (b) the remote endpoint is sending them faster than the configured
rate limit.

We rate-limit our responses rather than blocking them entirely or
resetting the connection, because legitimate connections can rely on
dupacks in response to some out-of-window segments. For example, zero
window probes are typically sent with a sequence number that is below
the current window, and ZWPs thus expect to thus elicit a dupack in
response.

We allow dupacks in response to TCP segments with data, because these
may be spurious retransmissions for which the remote endpoint wants to
receive DSACKs. This is safe because segments with data can't
realistically be part of ACK loops, which by their nature consist of
each side sending pure/data-less ACKs to each other.

The dupack interval is controlled by a new sysctl knob,
tcp_invalid_ratelimit, given in milliseconds, in case an administrator
needs to dial this upward in the face of a high-rate DoS attack. The
name and units are chosen to be analogous to the existing analogous
knob for ICMP, icmp_ratelimit.

The default value for tcp_invalid_ratelimit is 500ms, which allows at
most one such dupack per 500ms. This is chosen to be 2x faster than
the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule
2.4). We allow the extra 2x factor because network delay variations
can cause packets sent at 1 second intervals to be compressed and
arrive much closer.
Reported-by: default avatarAvery Fay <avery@mixpanel.com>
Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
parent ca539345
...@@ -290,6 +290,28 @@ tcp_frto - INTEGER ...@@ -290,6 +290,28 @@ tcp_frto - INTEGER
By default it's enabled with a non-zero value. 0 disables F-RTO. By default it's enabled with a non-zero value. 0 disables F-RTO.
tcp_invalid_ratelimit - INTEGER
Limit the maximal rate for sending duplicate acknowledgments
in response to incoming TCP packets that are for an existing
connection but that are invalid due to any of these reasons:
(a) out-of-window sequence number,
(b) out-of-window acknowledgment number, or
(c) PAWS (Protection Against Wrapped Sequence numbers) check failure
This can help mitigate simple "ack loop" DoS attacks, wherein
a buggy or malicious middlebox or man-in-the-middle can
rewrite TCP header fields in manner that causes each endpoint
to think that the other is sending invalid TCP segments, thus
causing each side to send an unterminating stream of duplicate
acknowledgments for invalid segments.
Using 0 disables rate-limiting of dupacks in response to
invalid segments; otherwise this value specifies the minimal
space between sending such dupacks, in milliseconds.
Default: 500 (milliseconds).
tcp_keepalive_time - INTEGER tcp_keepalive_time - INTEGER
How often TCP sends out keepalive messages when keepalive is enabled. How often TCP sends out keepalive messages when keepalive is enabled.
Default: 2hours. Default: 2hours.
......
...@@ -274,6 +274,7 @@ extern int sysctl_tcp_challenge_ack_limit; ...@@ -274,6 +274,7 @@ extern int sysctl_tcp_challenge_ack_limit;
extern unsigned int sysctl_tcp_notsent_lowat; extern unsigned int sysctl_tcp_notsent_lowat;
extern int sysctl_tcp_min_tso_segs; extern int sysctl_tcp_min_tso_segs;
extern int sysctl_tcp_autocorking; extern int sysctl_tcp_autocorking;
extern int sysctl_tcp_invalid_ratelimit;
extern atomic_long_t tcp_memory_allocated; extern atomic_long_t tcp_memory_allocated;
extern struct percpu_counter tcp_sockets_allocated; extern struct percpu_counter tcp_sockets_allocated;
...@@ -1236,6 +1237,37 @@ static inline bool tcp_paws_reject(const struct tcp_options_received *rx_opt, ...@@ -1236,6 +1237,37 @@ static inline bool tcp_paws_reject(const struct tcp_options_received *rx_opt,
return true; return true;
} }
/* Return true if we're currently rate-limiting out-of-window ACKs and
* thus shouldn't send a dupack right now. We rate-limit dupacks in
* response to out-of-window SYNs or ACKs to mitigate ACK loops or DoS
* attacks that send repeated SYNs or ACKs for the same connection. To
* do this, we do not send a duplicate SYNACK or ACK if the remote
* endpoint is sending out-of-window SYNs or pure ACKs at a high rate.
*/
static inline bool tcp_oow_rate_limited(struct net *net,
const struct sk_buff *skb,
int mib_idx, u32 *last_oow_ack_time)
{
/* Data packets without SYNs are not likely part of an ACK loop. */
if ((TCP_SKB_CB(skb)->seq != TCP_SKB_CB(skb)->end_seq) &&
!tcp_hdr(skb)->syn)
goto not_rate_limited;
if (*last_oow_ack_time) {
s32 elapsed = (s32)(tcp_time_stamp - *last_oow_ack_time);
if (0 <= elapsed && elapsed < sysctl_tcp_invalid_ratelimit) {
NET_INC_STATS_BH(net, mib_idx);
return true; /* rate-limited: don't send yet! */
}
}
*last_oow_ack_time = tcp_time_stamp;
not_rate_limited:
return false; /* not rate-limited: go ahead, send dupack now! */
}
static inline void tcp_mib_init(struct net *net) static inline void tcp_mib_init(struct net *net)
{ {
/* See RFC 2012 */ /* See RFC 2012 */
......
...@@ -270,6 +270,12 @@ enum ...@@ -270,6 +270,12 @@ enum
LINUX_MIB_TCPHYSTARTTRAINCWND, /* TCPHystartTrainCwnd */ LINUX_MIB_TCPHYSTARTTRAINCWND, /* TCPHystartTrainCwnd */
LINUX_MIB_TCPHYSTARTDELAYDETECT, /* TCPHystartDelayDetect */ LINUX_MIB_TCPHYSTARTDELAYDETECT, /* TCPHystartDelayDetect */
LINUX_MIB_TCPHYSTARTDELAYCWND, /* TCPHystartDelayCwnd */ LINUX_MIB_TCPHYSTARTDELAYCWND, /* TCPHystartDelayCwnd */
LINUX_MIB_TCPACKSKIPPEDSYNRECV, /* TCPACKSkippedSynRecv */
LINUX_MIB_TCPACKSKIPPEDPAWS, /* TCPACKSkippedPAWS */
LINUX_MIB_TCPACKSKIPPEDSEQ, /* TCPACKSkippedSeq */
LINUX_MIB_TCPACKSKIPPEDFINWAIT2, /* TCPACKSkippedFinWait2 */
LINUX_MIB_TCPACKSKIPPEDTIMEWAIT, /* TCPACKSkippedTimeWait */
LINUX_MIB_TCPACKSKIPPEDCHALLENGE, /* TCPACKSkippedChallenge */
__LINUX_MIB_MAX __LINUX_MIB_MAX
}; };
......
...@@ -292,6 +292,12 @@ static const struct snmp_mib snmp4_net_list[] = { ...@@ -292,6 +292,12 @@ static const struct snmp_mib snmp4_net_list[] = {
SNMP_MIB_ITEM("TCPHystartTrainCwnd", LINUX_MIB_TCPHYSTARTTRAINCWND), SNMP_MIB_ITEM("TCPHystartTrainCwnd", LINUX_MIB_TCPHYSTARTTRAINCWND),
SNMP_MIB_ITEM("TCPHystartDelayDetect", LINUX_MIB_TCPHYSTARTDELAYDETECT), SNMP_MIB_ITEM("TCPHystartDelayDetect", LINUX_MIB_TCPHYSTARTDELAYDETECT),
SNMP_MIB_ITEM("TCPHystartDelayCwnd", LINUX_MIB_TCPHYSTARTDELAYCWND), SNMP_MIB_ITEM("TCPHystartDelayCwnd", LINUX_MIB_TCPHYSTARTDELAYCWND),
SNMP_MIB_ITEM("TCPACKSkippedSynRecv", LINUX_MIB_TCPACKSKIPPEDSYNRECV),
SNMP_MIB_ITEM("TCPACKSkippedPAWS", LINUX_MIB_TCPACKSKIPPEDPAWS),
SNMP_MIB_ITEM("TCPACKSkippedSeq", LINUX_MIB_TCPACKSKIPPEDSEQ),
SNMP_MIB_ITEM("TCPACKSkippedFinWait2", LINUX_MIB_TCPACKSKIPPEDFINWAIT2),
SNMP_MIB_ITEM("TCPACKSkippedTimeWait", LINUX_MIB_TCPACKSKIPPEDTIMEWAIT),
SNMP_MIB_ITEM("TCPACKSkippedChallenge", LINUX_MIB_TCPACKSKIPPEDCHALLENGE),
SNMP_MIB_SENTINEL SNMP_MIB_SENTINEL
}; };
......
...@@ -728,6 +728,13 @@ static struct ctl_table ipv4_table[] = { ...@@ -728,6 +728,13 @@ static struct ctl_table ipv4_table[] = {
.extra1 = &zero, .extra1 = &zero,
.extra2 = &one, .extra2 = &one,
}, },
{
.procname = "tcp_invalid_ratelimit",
.data = &sysctl_tcp_invalid_ratelimit,
.maxlen = sizeof(int),
.mode = 0644,
.proc_handler = proc_dointvec_ms_jiffies,
},
{ {
.procname = "icmp_msgs_per_sec", .procname = "icmp_msgs_per_sec",
.data = &sysctl_icmp_msgs_per_sec, .data = &sysctl_icmp_msgs_per_sec,
......
...@@ -100,6 +100,7 @@ int sysctl_tcp_thin_dupack __read_mostly; ...@@ -100,6 +100,7 @@ int sysctl_tcp_thin_dupack __read_mostly;
int sysctl_tcp_moderate_rcvbuf __read_mostly = 1; int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
int sysctl_tcp_early_retrans __read_mostly = 3; int sysctl_tcp_early_retrans __read_mostly = 3;
int sysctl_tcp_invalid_ratelimit __read_mostly = HZ/2;
#define FLAG_DATA 0x01 /* Incoming frame contained data. */ #define FLAG_DATA 0x01 /* Incoming frame contained data. */
#define FLAG_WIN_UPDATE 0x02 /* Incoming ACK was a window update. */ #define FLAG_WIN_UPDATE 0x02 /* Incoming ACK was a window update. */
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment