Commit 20718485 authored by Eric Dumazet's avatar Eric Dumazet Committed by Jakub Kicinski

tcp/dccp: change source port selection at connect() time

In commit 1580ab63 ("tcp/dccp: better use of ephemeral ports in connect()")
we added an heuristic to select even ports for connect() and odd ports for bind().

This was nice because no applications changes were needed.

But it added more costs when all even ports are in use,
when there are few listeners and many active connections.

Since then, IP_LOCAL_PORT_RANGE has been added to permit an application
to partition ephemeral port range at will.

This patch extends the idea so that if IP_LOCAL_PORT_RANGE is set on
a socket before accept(), port selection no longer favors even ports.

This means that connect() can find a suitable source port faster,
and applications can use a different split between connect() and bind()
users.

This should give more entropy to Toeplitz hash used in RSS: Using even
ports was wasting one bit from the 16bit sport.

A similar change can be done in inet_csk_find_open_port() if needed.
Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: default avatarJason Xing <kerneljasonxing@gmail.com>
Link: https://lore.kernel.org/r/20231214192939.1962891-3-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
parent 41db7626
...@@ -1012,7 +1012,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, ...@@ -1012,7 +1012,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
bool tb_created = false; bool tb_created = false;
u32 remaining, offset; u32 remaining, offset;
int ret, i, low, high; int ret, i, low, high;
int l3mdev; bool local_ports;
int step, l3mdev;
u32 index; u32 index;
if (port) { if (port) {
...@@ -1024,10 +1025,12 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, ...@@ -1024,10 +1025,12 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
l3mdev = inet_sk_bound_l3mdev(sk); l3mdev = inet_sk_bound_l3mdev(sk);
inet_sk_get_local_port_range(sk, &low, &high); local_ports = inet_sk_get_local_port_range(sk, &low, &high);
step = local_ports ? 1 : 2;
high++; /* [32768, 60999] -> [32768, 61000[ */ high++; /* [32768, 60999] -> [32768, 61000[ */
remaining = high - low; remaining = high - low;
if (likely(remaining > 1)) if (!local_ports && remaining > 1)
remaining &= ~1U; remaining &= ~1U;
get_random_sleepable_once(table_perturb, get_random_sleepable_once(table_perturb,
...@@ -1040,10 +1043,11 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, ...@@ -1040,10 +1043,11 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
/* In first pass we try ports of @low parity. /* In first pass we try ports of @low parity.
* inet_csk_get_port() does the opposite choice. * inet_csk_get_port() does the opposite choice.
*/ */
if (!local_ports)
offset &= ~1U; offset &= ~1U;
other_parity_scan: other_parity_scan:
port = low + offset; port = low + offset;
for (i = 0; i < remaining; i += 2, port += 2) { for (i = 0; i < remaining; i += step, port += step) {
if (unlikely(port >= high)) if (unlikely(port >= high))
port -= remaining; port -= remaining;
if (inet_is_local_reserved_port(net, port)) if (inet_is_local_reserved_port(net, port))
...@@ -1083,10 +1087,11 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, ...@@ -1083,10 +1087,11 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
cond_resched(); cond_resched();
} }
if (!local_ports) {
offset++; offset++;
if ((offset & 1) && remaining > 1) if ((offset & 1) && remaining > 1)
goto other_parity_scan; goto other_parity_scan;
}
return -EADDRNOTAVAIL; return -EADDRNOTAVAIL;
ok: ok:
...@@ -1109,8 +1114,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, ...@@ -1109,8 +1114,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
* on low contention the randomness is maximal and on high contention * on low contention the randomness is maximal and on high contention
* it may be inexistent. * it may be inexistent.
*/ */
i = max_t(int, i, get_random_u32_below(8) * 2); i = max_t(int, i, get_random_u32_below(8) * step);
WRITE_ONCE(table_perturb[index], READ_ONCE(table_perturb[index]) + i + 2); WRITE_ONCE(table_perturb[index], READ_ONCE(table_perturb[index]) + i + step);
/* Head lock still held and bh's disabled */ /* Head lock still held and bh's disabled */
inet_bind_hash(sk, tb, tb2, port); inet_bind_hash(sk, tb, tb2, port);
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment