An error occurred fetching the project authors.
- 09 Aug, 2020 1 commit
-
-
Christoph Paasch authored
BugLink: https://bugs.launchpad.net/bugs/1888690 [ Upstream commit ce69e563 ] syzkaller found its way into setsockopt with TCP_CONGESTION "cdg". tcp_cdg_init() does a kcalloc to store the gradients. As sk_clone_lock just copies all the memory, the allocated pointer will be copied as well, if the app called setsockopt(..., TCP_CONGESTION) on the listener. If now the socket will be destroyed before the congestion-control has properly been initialized (through a call to tcp_init_transfer), we will end up freeing memory that does not belong to that particular socket, opening the door to a double-free: [ 11.413102] ================================================================== [ 11.414181] BUG: KASAN: double-free or invalid-free in tcp_cleanup_congestion_control+0x58/0xd0 [ 11.415329] [ 11.415560] CPU: 3 PID: 4884 Comm: syz-executor.5 Not tainted 5.8.0-rc2 #80 [ 11.416544] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 [ 11.418148] Call Trace: [ 11.418534] <IRQ> [ 11.418834] dump_stack+0x7d/0xb0 [ 11.419297] print_address_description.constprop.0+0x1a/0x210 [ 11.422079] kasan_report_invalid_free+0x51/0x80 [ 11.423433] __kasan_slab_free+0x15e/0x170 [ 11.424761] kfree+0x8c/0x230 [ 11.425157] tcp_cleanup_congestion_control+0x58/0xd0 [ 11.425872] tcp_v4_destroy_sock+0x57/0x5a0 [ 11.426493] inet_csk_destroy_sock+0x153/0x2c0 [ 11.427093] tcp_v4_syn_recv_sock+0xb29/0x1100 [ 11.427731] tcp_get_cookie_sock+0xc3/0x4a0 [ 11.429457] cookie_v4_check+0x13d0/0x2500 [ 11.433189] tcp_v4_do_rcv+0x60e/0x780 [ 11.433727] tcp_v4_rcv+0x2869/0x2e10 [ 11.437143] ip_protocol_deliver_rcu+0x23/0x190 [ 11.437810] ip_local_deliver+0x294/0x350 [ 11.439566] __netif_receive_skb_one_core+0x15d/0x1a0 [ 11.441995] process_backlog+0x1b1/0x6b0 [ 11.443148] net_rx_action+0x37e/0xc40 [ 11.445361] __do_softirq+0x18c/0x61a [ 11.445881] asm_call_on_stack+0x12/0x20 [ 11.446409] </IRQ> [ 11.446716] do_softirq_own_stack+0x34/0x40 [ 11.447259] do_softirq.part.0+0x26/0x30 [ 11.447827] __local_bh_enable_ip+0x46/0x50 [ 11.448406] ip_finish_output2+0x60f/0x1bc0 [ 11.450109] __ip_queue_xmit+0x71c/0x1b60 [ 11.451861] __tcp_transmit_skb+0x1727/0x3bb0 [ 11.453789] tcp_rcv_state_process+0x3070/0x4d3a [ 11.456810] tcp_v4_do_rcv+0x2ad/0x780 [ 11.457995] __release_sock+0x14b/0x2c0 [ 11.458529] release_sock+0x4a/0x170 [ 11.459005] __inet_stream_connect+0x467/0xc80 [ 11.461435] inet_stream_connect+0x4e/0xa0 [ 11.462043] __sys_connect+0x204/0x270 [ 11.465515] __x64_sys_connect+0x6a/0xb0 [ 11.466088] do_syscall_64+0x3e/0x70 [ 11.466617] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 11.467341] RIP: 0033:0x7f56046dc469 [ 11.467844] Code: Bad RIP value. [ 11.468282] RSP: 002b:00007f5604dccdd8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a [ 11.469326] RAX: ffffffffffffffda RBX: 000000000068bf00 RCX: 00007f56046dc469 [ 11.470379] RDX: 0000000000000010 RSI: 0000000020000000 RDI: 0000000000000004 [ 11.471311] RBP: 00000000ffffffff R08: 0000000000000000 R09: 0000000000000000 [ 11.472286] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 11.473341] R13: 000000000041427c R14: 00007f5604dcd5c0 R15: 0000000000000003 [ 11.474321] [ 11.474527] Allocated by task 4884: [ 11.475031] save_stack+0x1b/0x40 [ 11.475548] __kasan_kmalloc.constprop.0+0xc2/0xd0 [ 11.476182] tcp_cdg_init+0xf0/0x150 [ 11.476744] tcp_init_congestion_control+0x9b/0x3a0 [ 11.477435] tcp_set_congestion_control+0x270/0x32f [ 11.478088] do_tcp_setsockopt.isra.0+0x521/0x1a00 [ 11.478744] __sys_setsockopt+0xff/0x1e0 [ 11.479259] __x64_sys_setsockopt+0xb5/0x150 [ 11.479895] do_syscall_64+0x3e/0x70 [ 11.480395] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 11.481097] [ 11.481321] Freed by task 4872: [ 11.481783] save_stack+0x1b/0x40 [ 11.482230] __kasan_slab_free+0x12c/0x170 [ 11.482839] kfree+0x8c/0x230 [ 11.483240] tcp_cleanup_congestion_control+0x58/0xd0 [ 11.483948] tcp_v4_destroy_sock+0x57/0x5a0 [ 11.484502] inet_csk_destroy_sock+0x153/0x2c0 [ 11.485144] tcp_close+0x932/0xfe0 [ 11.485642] inet_release+0xc1/0x1c0 [ 11.486131] __sock_release+0xc0/0x270 [ 11.486697] sock_close+0xc/0x10 [ 11.487145] __fput+0x277/0x780 [ 11.487632] task_work_run+0xeb/0x180 [ 11.488118] __prepare_exit_to_usermode+0x15a/0x160 [ 11.488834] do_syscall_64+0x4a/0x70 [ 11.489326] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Wei Wang fixed a part of these CDG-malloc issues with commit c1201444 ("tcp: memset ca_priv data to 0 properly"). This patch here fixes the listener-scenario: We make sure that listeners setting the congestion-control through setsockopt won't initialize it (thus CDG never allocates on listeners). For those who use AF_UNSPEC to reuse a socket, tcp_disconnect() is changed to cleanup afterwards. (The issue can be reproduced at least down to v4.4.x.) Cc: Wei Wang <weiwan@google.com> Cc: Eric Dumazet <edumazet@google.com> Fixes: 2b0a8c9e ("tcp: add CDG congestion control") Signed-off-by:
Christoph Paasch <cpaasch@apple.com> Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Kamal Mostafa <kamal@canonical.com> Signed-off-by:
Khalid Elmously <khalid.elmously@canonical.com>
-
- 15 Mar, 2020 2 commits
-
-
Eric Dumazet authored
BugLink: https://bugs.launchpad.net/bugs/1864775 [ Upstream commit 784f8344 ] tp->segs_in and tp->segs_out need to be cleared in tcp_disconnect(). tcp_disconnect() is rarely used, but it is worth fixing it. Fixes: 2efd055c ("tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info") Signed-off-by:
Eric Dumazet <edumazet@google.com> Cc: Marcelo Ricardo Leitner <mleitner@redhat.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by:
Neal Cardwell <ncardwell@google.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Khalid Elmously <khalid.elmously@canonical.com> Signed-off-by:
Kleber Sacilotto de Souza <kleber.souza@canonical.com>
-
Eric Dumazet authored
BugLink: https://bugs.launchpad.net/bugs/1864775 [ Upstream commit c13c48c0 ] total_retrans needs to be cleared in tcp_disconnect(). tcp_disconnect() is rarely used, but it is worth fixing it. Fixes: 1da177e4 ("Linux-2.6.12-rc2") Signed-off-by:
Eric Dumazet <edumazet@google.com> Cc: SeongJae Park <sjpark@amazon.de> Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Khalid Elmously <khalid.elmously@canonical.com> Signed-off-by:
Kleber Sacilotto de Souza <kleber.souza@canonical.com>
-
- 17 Sep, 2019 1 commit
-
-
Christoph Paasch authored
BugLink: https://bugs.launchpad.net/bugs/1840081 [ Upstream commit e858faf5 ] If an app is playing tricks to reuse a socket via tcp_disconnect(), bytes_acked/received needs to be reset to 0. Otherwise tcp_info will report the sum of the current and the old connection.. Cc: Eric Dumazet <edumazet@google.com> Fixes: 0df48c26 ("tcp: add tcpi_bytes_acked to tcp_info") Fixes: bdd1f9ed ("tcp: add tcpi_bytes_received to tcp_info") Signed-off-by:
Christoph Paasch <cpaasch@apple.com> Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by:
Kleber Sacilotto de Souza <kleber.souza@canonical.com>
-
- 05 Jun, 2019 1 commit
-
-
Eric Dumazet authored
Jonathan Looney reported that TCP can trigger the following crash in tcp_shifted_skb() : BUG_ON(tcp_skb_pcount(skb) < pcount); This can happen if the remote peer has advertized the smallest MSS that linux TCP accepts : 48 An skb can hold 17 fragments, and each fragment can hold 32KB on x86, or 64KB on PowerPC. This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs can overflow. Note that tcp_sendmsg() builds skbs with less than 64KB of payload, so this problem needs SACK to be enabled. SACK blocks allow TCP to coalesce multiple skbs in the retransmit queue, thus filling the 17 fragments to maximal capacity. Fixes: 832d11c5 ("tcp: Try to restore large SKBs while SACK processing") Signed-off-by:
Eric Dumazet <edumazet@google.com> Reported-by:
Jonathan Looney <jtl@netflix.com> Acked-by:
Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Bruce Curtis <brucec@netflix.com> BugLink: https://bugs.launchpad.net/bugs/1831637 (Remote denial of service (system crash) caused by integer overflow in TCP SACK handling (LP: #1831637)) [tyhicks: Backport to Xenial: - Adjust context in linux/tcp.h and tcp.c - tcp_shifted_skb() doesn't take the prev skb as a parameter - tcp_collapse_retrans() doesn't do frag shifting since commit f8071cde ("tcp: enhance tcp_collapse_retrans() with skb_shift()") isn't present so no changes are needed to that function] Signed-off-by:
Tyler Hicks <tyhicks@canonical.com> Signed-off-by:
Stefan Bader <stefan.bader@canonical.com>
-
- 14 Mar, 2019 1 commit
-
-
Eric Dumazet authored
BugLink: https://bugs.launchpad.net/bugs/1818815 [ Upstream commit 04c03114 ] soukjin bae reported a crash in tcp_v4_err() handling ICMP_DEST_UNREACH after tcp_write_queue_head(sk) returned a NULL pointer. Current logic should have prevented this : if (seq != tp->snd_una || !icsk->icsk_retransmits || !icsk->icsk_backoff || fastopen) break; Problem is the write queue might have been purged and icsk_backoff has not been cleared. Signed-off-by:
Eric Dumazet <edumazet@google.com> Reported-by:
soukjin bae <soukjin.bae@samsung.com> Acked-by:
Neal Cardwell <ncardwell@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Juerg Haefliger <juergh@canonical.com> Signed-off-by:
Khalid Elmously <khalid.elmously@canonical.com>
-
- 13 Nov, 2018 1 commit
-
-
Yaogong Wang authored
BugLink: https://bugs.launchpad.net/bugs/1801893 [ Upstream commit 9f5afeae ] Over the years, TCP BDP has increased by several orders of magnitude, and some people are considering to reach the 2 Gbytes limit. Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000 MSS. In presence of packet losses (or reorders), TCP stores incoming packets into an out of order queue, and number of skbs sitting there waiting for the missing packets to be received can be in the 10^5 range. Most packets are appended to the tail of this queue, and when packets can finally be transferred to receive queue, we scan the queue from its head. However, in presence of heavy losses, we might have to find an arbitrary point in this queue, involving a linear scan for every incoming packet, throwing away cpu caches. This patch converts it to a RB tree, to get bounded latencies. Yaogong wrote a preliminary patch about 2 years ago. Eric did the rebase, added ofo_last_skb cache, polishing and tests. Tested with network dropping between 1 and 10 % packets, with good success (about 30 % increase of throughput in stress tests) Next step would be to also use an RB tree for the write queue at sender side ;) Signed-off-by:
Yaogong Wang <wygivan@google.com> Signed-off-by:
Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Acked-By:
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Mao Wenan <maowenan@huawei.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Juerg Haefliger <juergh@canonical.com> Signed-off-by:
Khalid Elmously <khalid.elmously@canonical.com>
-
- 02 Oct, 2018 1 commit
-
-
Randy Dunlap authored
BugLink: https://bugs.launchpad.net/bugs/1792377 [ Upstream commit e56b8ce3 ] Attempt to make cryptic TCP seq number error messages clearer by (1) identifying the source of the message as "TCP", (2) identifying the errors as "seq # bug", and (3) grouping the field identifiers and values by separating them with commas. E.g., the following message is changed from: recvmsg bug 2: copied 73BCB6CD seq 70F17CBE rcvnxt 73BCB9AA fl 0 WARNING: CPU: 2 PID: 1501 at /linux/net/ipv4/tcp.c:1881 tcp_recvmsg+0x649/0xb90 to: TCP recvmsg seq # bug 2: copied 73BCB6CD, seq 70F17CBE, rcvnxt 73BCB9AA, fl 0 WARNING: CPU: 2 PID: 1501 at /linux/net/ipv4/tcp.c:2011 tcp_recvmsg+0x694/0xba0 Suggested-by:
積丹尼 Dan Jacobson <jidanni@jidanni.org> Signed-off-by:
Randy Dunlap <rdunlap@infradead.org> Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Sasha Levin <alexander.levin@microsoft.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Stefan Bader <stefan.bader@canonical.com> Signed-off-by:
Kleber Sacilotto de Souza <kleber.souza@canonical.com>
-
- 07 Jun, 2018 2 commits
-
-
Yuchung Cheng authored
BugLink: https://bugs.launchpad.net/bugs/1775477 [ Upstream commit 16ae6aa1 ] The TCP repair sequence of operation is to first set the socket in repair mode, then inject the TCP stats into the socket with repair socket options, then call connect() to re-activate the socket. The connect syscall simply returns and set state to ESTABLISHED mode. As a result Fast Open is meaningless for TCP repair. However allowing sendto() system call with MSG_FASTOPEN flag half-way during the repair operation could unexpectedly cause data to be sent, before the operation finishes changing the internal TCP stats (e.g. MSS). This in turn triggers TCP warnings on inconsistent packet accounting. The fix is to simply disallow Fast Open operation once the socket is in the repair mode. Reported-by:
syzbot <syzkaller@googlegroups.com> Signed-off-by:
Yuchung Cheng <ycheng@google.com> Reviewed-by:
Neal Cardwell <ncardwell@google.com> Reviewed-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Stefan Bader <stefan.bader@canonical.com> Acked-by:
Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by:
Khalid Elmously <khalid.elmously@canonical.com>
-
Eric Dumazet authored
BugLink: http://bugs.launchpad.net/bugs/1774173 commit bf2acc94 upstream. syzbot is able to produce a nasty WARN_ON() in tcp_verify_left_out() with following C-repro : socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3 setsockopt(3, SOL_TCP, TCP_REPAIR, [1], 4) = 0 setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [-1], 4) = 0 bind(3, {sa_family=AF_INET, sin_port=htons(20002), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 sendto(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1242, MSG_FASTOPEN, {sa_family=AF_INET, sin_port=htons(20002), sin_addr=inet_addr("127.0.0.1")}, 16) = 1242 setsockopt(3, SOL_TCP, TCP_REPAIR_WINDOW, "\4\0\0@+\205\0\0\377\377\0\0\377\377\377\177\0\0\0\0", 20) = 0 writev(3, [{"\270", 1}], 1) = 1 setsockopt(3, SOL_TCP, TCP_REPAIR_OPTIONS, "\10\0\0\0\0\0\0\0\0\0\0\0|\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 386) = 0 writev(3, [{"\210v\r[\226\320t\231qwQ\204\264l\254\t\1\20\245\214p\350H\223\254;\\\37\345\307p$"..., 3144}], 1) = 3144 The 3rd system call looks odd : setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [-1], 4) = 0 This patch makes sure bound checking is using an unsigned compare. Fixes: ee995283 ("tcp: Initial repair mode") Signed-off-by:
Eric Dumazet <edumazet@google.com> Reported-by:
syzbot <syzkaller@googlegroups.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Juerg Haefliger <juergh@canonical.com> Signed-off-by:
Stefan Bader <stefan.bader@canonical.com>
-
- 23 May, 2018 1 commit
-
-
Eric Dumazet authored
BugLink: http://bugs.launchpad.net/bugs/1768474 [ Upstream commit 72123032 ] syzbot/KMSAN reported an uninit-value in tcp_parse_options() [1] I believe this was caused by a TCP_MD5SIG being set on live flow. This is highly unexpected, since TCP option space is limited. For instance, presence of TCP MD5 option automatically disables TCP TimeStamp option at SYN/SYNACK time, which we can not do once flow has been established. Really, adding/deleting an MD5 key only makes sense on sockets in CLOSE or LISTEN state. [1] BUG: KMSAN: uninit-value in tcp_parse_options+0xd74/0x1a30 net/ipv4/tcp_input.c:3720 CPU: 1 PID: 6177 Comm: syzkaller192004 Not tainted 4.16.0+ #83 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x185/0x1d0 lib/dump_stack.c:53 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676 tcp_parse_options+0xd74/0x1a30 net/ipv4/tcp_input.c:3720 tcp_fast_parse_options net/ipv4/tcp_input.c:3858 [inline] tcp_validate_incoming+0x4f1/0x2790 net/ipv4/tcp_input.c:5184 tcp_rcv_established+0xf60/0x2bb0 net/ipv4/tcp_input.c:5453 tcp_v4_do_rcv+0x6cd/0xd90 net/ipv4/tcp_ipv4.c:1469 sk_backlog_rcv include/net/sock.h:908 [inline] __release_sock+0x2d6/0x680 net/core/sock.c:2271 release_sock+0x97/0x2a0 net/core/sock.c:2786 tcp_sendmsg+0xd6/0x100 net/ipv4/tcp.c:1464 inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764 sock_sendmsg_nosec net/socket.c:630 [inline] sock_sendmsg net/socket.c:640 [inline] SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747 SyS_sendto+0x8a/0xb0 net/socket.c:1715 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x448fe9 RSP: 002b:00007fd472c64d38 EFLAGS: 00000216 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00000000006e5a30 RCX: 0000000000448fe9 RDX: 000000000000029f RSI: 0000000020a88f88 RDI: 0000000000000004 RBP: 00000000006e5a34 R08: 0000000020e68000 R09: 0000000000000010 R10: 00000000200007fd R11: 0000000000000216 R12: 0000000000000000 R13: 00007fff074899ef R14: 00007fd472c659c0 R15: 0000000000000009 Uninit was created at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline] kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314 kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321 slab_post_alloc_hook mm/slab.h:445 [inline] slab_alloc_node mm/slub.c:2737 [inline] __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369 __kmalloc_reserve net/core/skbuff.c:138 [inline] __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206 alloc_skb include/linux/skbuff.h:984 [inline] tcp_send_ack+0x18c/0x910 net/ipv4/tcp_output.c:3624 __tcp_ack_snd_check net/ipv4/tcp_input.c:5040 [inline] tcp_ack_snd_check net/ipv4/tcp_input.c:5053 [inline] tcp_rcv_established+0x2103/0x2bb0 net/ipv4/tcp_input.c:5469 tcp_v4_do_rcv+0x6cd/0xd90 net/ipv4/tcp_ipv4.c:1469 sk_backlog_rcv include/net/sock.h:908 [inline] __release_sock+0x2d6/0x680 net/core/sock.c:2271 release_sock+0x97/0x2a0 net/core/sock.c:2786 tcp_sendmsg+0xd6/0x100 net/ipv4/tcp.c:1464 inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764 sock_sendmsg_nosec net/socket.c:630 [inline] sock_sendmsg net/socket.c:640 [inline] SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747 SyS_sendto+0x8a/0xb0 net/socket.c:1715 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Fixes: cfb6eeb4 ("[TCP]: MD5 Signature Option (RFC2385) support.") Signed-off-by:
Eric Dumazet <edumazet@google.com> Reported-by:
syzbot <syzkaller@googlegroups.com> Acked-by:
Yuchung Cheng <ycheng@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Juerg Haefliger <juergh@canonical.com> Signed-off-by:
Kleber Sacilotto de Souza <kleber.souza@canonical.com>
-
- 03 Apr, 2018 1 commit
-
-
Li RongQing authored
BugLink: http://bugs.launchpad.net/bugs/1756121 [ Upstream commit 9b42d55a ] socket can be disconnected and gets transformed back to a listening socket, if sk_frag.page is not released, which will be cloned into a new socket by sk_clone_lock, but the reference count of this page is increased, lead to a use after free or double free issue Signed-off-by:
Li RongQing <lirongqing@baidu.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Juerg Haefliger <juergh@canonical.com> Signed-off-by:
Stefan Bader <stefan.bader@canonical.com>
-
- 13 Mar, 2018 1 commit
-
-
Dan Streetman authored
BugLink: http://bugs.launchpad.net/bugs/1754592 [ Upstream commit 4ee806d5 ] When a tcp socket is closed, if it detects that its net namespace is exiting, close immediately and do not wait for FIN sequence. For normal sockets, a reference is taken to their net namespace, so it will never exit while the socket is open. However, kernel sockets do not take a reference to their net namespace, so it may begin exiting while the kernel socket is still open. In this case if the kernel socket is a tcp socket, it will stay open trying to complete its close sequence. The sock's dst(s) hold a reference to their interface, which are all transferred to the namespace's loopback interface when the real interfaces are taken down. When the namespace tries to take down its loopback interface, it hangs waiting for all references to the loopback interface to release, which results in messages like: unregister_netdevice: waiting for lo to become free. Usage count = 1 These messages continue until the socket finally times out and closes. Since the net namespace cleanup holds the net_mutex while calling its registered pernet callbacks, any new net namespace initialization is blocked until the current net namespace finishes exiting. After this change, the tcp socket notices the exiting net namespace, and closes immediately, releasing its dst(s) and their reference to the loopback interface, which lets the net namespace continue exiting. Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407 Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811Signed-off-by:
Dan Streetman <ddstreet@canonical.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Stefan Bader <stefan.bader@canonical.com> Signed-off-by:
Kleber Sacilotto de Souza <kleber.souza@canonical.com>
-
- 15 Sep, 2017 1 commit
-
-
Wei Wang authored
CVE-2017-14106 When tcp_disconnect() is called, inet_csk_delack_init() sets icsk->icsk_ack.rcv_mss to 0. This could potentially cause tcp_recvmsg() => tcp_cleanup_rbuf() => __tcp_select_window() call path to have division by 0 issue. So this patch initializes rcv_mss to TCP_MIN_MSS instead of 0. Reported-by:
Andrey Konovalov <andreyknvl@google.com> Signed-off-by:
Wei Wang <weiwan@google.com> Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
Neal Cardwell <ncardwell@google.com> Signed-off-by:
Yuchung Cheng <ycheng@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit 499350a5) Signed-off-by:
Po-Hsu Lin <po-hsu.lin@canonical.com> Acked-by:
Stefan Bader <stefan.bader@canonical.com> Acked-by:
Colin King <colin.king@canonical.com> Signed-off-by:
Stefan Bader <stefan.bader@canonical.com>
-
- 11 Aug, 2017 1 commit
-
-
WANG Cong authored
BugLink: http://bugs.launchpad.net/bugs/1705707 commit d747a7a5 upstream. We have to reset the sk->sk_rx_dst when we disconnect a TCP connection, because otherwise when we re-connect it this dst reference is simply overridden in tcp_finish_connect(). This fixes a dst leak which leads to a loopback dev refcnt leak. It is a long-standing bug, Kevin reported a very similar (if not same) bug before. Thanks to Andrei for providing such a reliable reproducer which greatly narrows down the problem. Fixes: 41063e9d ("ipv4: Early TCP socket demux.") Reported-by:
Andrei Vagin <avagin@gmail.com> Reported-by:
Kevin Xu <kaiwen.xu@hulu.com> Signed-off-by:
Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Stefan Bader <stefan.bader@canonical.com> Signed-off-by:
Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
-
- 27 Jun, 2017 1 commit
-
-
Wei Wang authored
BugLink: http://bugs.launchpad.net/bugs/1697001 [ Upstream commit ba615f67 ] Fastopen API should be used to perform fastopen operations on the TCP socket. It does not make sense to use fastopen API to perform disconnect by calling it with AF_UNSPEC. The fastopen data path is also prone to race conditions and bugs when using with AF_UNSPEC. One issue reported and analyzed by Vegard Nossum is as follows: +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Thread A: Thread B: ------------------------------------------------------------------------ sendto() - tcp_sendmsg() - sk_stream_memory_free() = 0 - goto wait_for_sndbuf - sk_stream_wait_memory() - sk_wait_event() // sleep | sendto(flags=MSG_FASTOPEN, dest_addr=AF_UNSPEC) | - tcp_sendmsg() | - tcp_sendmsg_fastopen() | - __inet_stream_connect() | - tcp_disconnect() //because of AF_UNSPEC | - tcp_transmit_skb()// send RST | - return 0; // no reconnect! | - sk_stream_wait_connect() | - sock_error() | - xchg(&sk->sk_err, 0) | - return -ECONNRESET - ... // wake up, see sk->sk_err == 0 - skb_entail() on TCP_CLOSE socket If the connection is reopened then we will send a brand new SYN packet after thread A has already queued a buffer. At this point I think the socket internal state (sequence numbers etc.) becomes messed up. When the new connection is closed, the FIN-ACK is rejected because the sequence number is outside the window. The other side tries to retransmit, but __tcp_retransmit_skb() calls tcp_trim_head() on an empty skb which corrupts the skb data length and hits a BUG() in copy_and_csum_bits(). +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Hence, this patch adds a check for AF_UNSPEC in the fastopen data path and return EOPNOTSUPP to user if such case happens. Fixes: cf60af03 ("tcp: Fast Open client - sendmsg(MSG_FASTOPEN)") Reported-by:
Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by:
Wei Wang <weiwan@google.com> Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Stefan Bader <stefan.bader@canonical.com> Signed-off-by:
Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
-
- 05 May, 2017 1 commit
-
-
Eric Dumazet authored
BugLink: http://bugs.launchpad.net/bugs/1688505 [ Upstream commit 17c3060b ] In the (very unlikely) case a passive socket becomes a listener, we do not want to duplicate its saved SYN headers. This would lead to double frees, use after free, and please hackers and various fuzzers Tested: 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, IPPROTO_TCP, TCP_SAVE_SYN, [1], 4) = 0 +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 5) = 0 +0 < S 0:0(0) win 32972 <mss 1460,nop,wscale 7> +0 > S. 0:0(0) ack 1 <...> +.1 < . 1:1(0) ack 1 win 257 +0 accept(3, ..., ...) = 4 +0 connect(4, AF_UNSPEC, ...) = 0 +0 close(3) = 0 +0 bind(4, ..., ...) = 0 +0 listen(4, 5) = 0 +0 < S 0:0(0) win 32972 <mss 1460,nop,wscale 7> +0 > S. 0:0(0) ack 1 <...> +.1 < . 1:1(0) ack 1 win 257 Fixes: cd8ae852 ("tcp: provide SYN headers for passive connections") Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Stefan Bader <stefan.bader@canonical.com> Signed-off-by:
Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
-
- 06 Apr, 2017 1 commit
-
-
Eric Dumazet authored
BugLink: http://bugs.launchpad.net/bugs/1666324 [ Upstream commit ccf7abb9 ] Splicing from TCP socket is vulnerable when a packet with URG flag is received and stored into receive queue. __tcp_splice_read() returns 0, and sk_wait_data() immediately returns since there is the problematic skb in queue. This is a nice way to burn cpu (aka infinite loop) and trigger soft lockups. Again, this gem was found by syzkaller tool. Fixes: 9c55e01c ("[TCP]: Splice receive support.") Signed-off-by:
Eric Dumazet <edumazet@google.com> Reported-by:
Dmitry Vyukov <dvyukov@google.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Tim Gardner <tim.gardner@canonical.com> Signed-off-by:
Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
-
- 06 Dec, 2016 1 commit
-
-
Eric Dumazet authored
BugLink: http://bugs.launchpad.net/bugs/1643637 [ Upstream commit ac9e70b1 ] Imagine initial value of max_skb_frags is 17, and last skb in write queue has 15 frags. Then max_skb_frags is lowered to 14 or smaller value. tcp_sendmsg() will then be allowed to add additional page frags and eventually go past MAX_SKB_FRAGS, overflowing struct skb_shared_info. Fixes: 5f74f82e ("net:Add sysctl_max_skb_frags") Signed-off-by:
Eric Dumazet <edumazet@google.com> Cc: Hans Westgaard Ry <hans.westgaard.ry@oracle.com> Cc: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Luis Henriques <luis.henriques@canonical.com>
-
- 06 Apr, 2016 2 commits
-
-
Hans Westgaard Ry authored
BugLink: http://bugs.launchpad.net/bugs/1553179 [ Upstream commit 5f74f82e ] Devices may have limits on the number of fragments in an skb they support. Current codebase uses a constant as maximum for number of fragments one skb can hold and use. When enabling scatter/gather and running traffic with many small messages the codebase uses the maximum number of fragments and may thereby violate the max for certain devices. The patch introduces a global variable as max number of fragments. Signed-off-by:
Hans Westgaard Ry <hans.westgaard.ry@oracle.com> Reviewed-by:
Håkon Bugge <haakon.bugge@oracle.com> Acked-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Tim Gardner <tim.gardner@canonical.com>
-
Eric Dumazet authored
BugLink: http://bugs.launchpad.net/bugs/1553179 [ Upstream commit ff5d7497 ] With some combinations of user provided flags in netlink command, it is possible to call tcp_get_info() with a buffer that is not 8-bytes aligned. It does matter on some arches, so we need to use put_unaligned() to store the u64 fields. Current iproute2 package does not trigger this particular issue. Fixes: 0df48c26 ("tcp: add tcpi_bytes_acked to tcp_info") Fixes: 977cb0ec ("tcp: add pacing_rate information into tcp_info") Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Tim Gardner <tim.gardner@canonical.com>
-
- 01 Dec, 2015 1 commit
-
-
Eric Dumazet authored
This patch is a cleanup to make following patch easier to review. Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA from (struct socket)->flags to a (struct socket_wq)->flags to benefit from RCU protection in sock_wake_async() To ease backports, we rename both constants. Two new helpers, sk_set_bit(int nr, struct sock *sk) and sk_clear_bit(int net, struct sock *sk) are added so that following patch can change their implementation. Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 15 Nov, 2015 1 commit
-
-
Eric Dumazet authored
Some functions access TCP sockets without holding a lock and might output non consistent data, depending on compiler and or architecture. tcp_diag_get_info(), tcp_get_info(), tcp_poll(), get_tcp4_sock() ... Introduce sk_state_load() and sk_state_store() to fix the issues, and more clearly document where this lack of locking is happening. Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 21 Oct, 2015 1 commit
-
-
Yuchung Cheng authored
Kathleen Nichols' algorithm for tracking the minimum RTT of a data stream over some measurement window. It uses constant space and constant time per update. Yet it almost always delivers the same minimum as an implementation that has to keep all the data in the window. The measurement window is tunable via sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes. The algorithm keeps track of the best, 2nd best & 3rd best min values, maintaining an invariant that the measurement time of the n'th best >= n-1'th best. It also makes sure that the three values are widely separated in the time window since that bounds the worse case error when that data is monotonically increasing over the window. Upon getting a new min, we can forget everything earlier because it has no value - the new min is less than everything else in the window by definition and it's the most recent. So we restart fresh on every new min and overwrites the 2nd & 3rd choices. The same property holds for the 2nd & 3rd best. Therefore we have to maintain two invariants to maximize the information in the samples, one on values (1st.v <= 2nd.v <= 3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <= now). These invariants determine the structure of the code The RTT input to the windowed filter is the minimum RTT measured from ACK or SACK, or as the last resort from TCP timestamps. The accessor tcp_min_rtt() returns the minimum RTT seen in the window. ~0U indicates it is not available. The minimum is 1usec even if the true RTT is below that. Signed-off-by:
Yuchung Cheng <ycheng@google.com> Signed-off-by:
Neal Cardwell <ncardwell@google.com> Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 07 Oct, 2015 1 commit
-
-
Yuvaraja Mariappan authored
Fixed an assignment coding style issue Signed-off-by:
Yuvaraja Mariappan <ymariappan@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 29 Sep, 2015 1 commit
-
-
Eric Dumazet authored
While auditing TCP stack for upcoming 'lockless' listener changes, I found I had to change fastopen_init_queue() to properly init the object before publishing it. Otherwise an other cpu could try to lock the spinlock before it gets properly initialized. Instead of adding appropriate barriers, just remove dynamic memory allocations : - Structure is 28 bytes on 64bit arches. Using additional 8 bytes for holding a pointer seems overkill. - Two listeners can share same cache line and performance would suffer. If we really want to save few bytes, we would instead dynamically allocate whole struct request_sock_queue in the future. Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 25 Aug, 2015 1 commit
-
-
Eric Dumazet authored
slow start after idle might reduce cwnd, but we perform this after first packet was cooked and sent. With TSO/GSO, it means that we might send a full TSO packet even if cwnd should have been reduced to IW10. Moving the SSAI check in skb_entail() makes sense, because we slightly reduce number of times this check is done, especially for large send() and TCP Small queue callbacks from softirq context. As Neal pointed out, we also need to perform the check if/when receive window opens. Tested: Following packetdrill test demonstrates the problem // Test of slow start after idle `sysctl -q net.ipv4.tcp_slow_start_after_idle=1` 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 7> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6> +.100 < . 1:1(0) ack 1 win 511 +0 accept(3, ..., ...) = 4 +0 setsockopt(4, SOL_SOCKET, SO_SNDBUF, [200000], 4) = 0 +0 write(4, ..., 26000) = 26000 +0 > . 1:5001(5000) ack 1 +0 > . 5001:10001(5000) ack 1 +0 %{ assert tcpi_snd_cwnd == 10 }% +.100 < . 1:1(0) ack 10001 win 511 +0 %{ assert tcpi_snd_cwnd == 20, tcpi_snd_cwnd }% +0 > . 10001:20001(10000) ack 1 +0 > P. 20001:26001(6000) ack 1 +.100 < . 1:1(0) ack 26001 win 511 +0 %{ assert tcpi_snd_cwnd == 36, tcpi_snd_cwnd }% +4 write(4, ..., 20000) = 20000 // If slow start after idle works properly, we should send 5 MSS here (cwnd/2) +0 > . 26001:31001(5000) ack 1 +0 %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }% +0 > . 31001:36001(5000) ack 1 Signed-off-by:
Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by:
Neal Cardwell <ncardwell@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 27 Jul, 2015 1 commit
-
-
Sabrina Dubroca authored
Currently, tcp_recvmsg enters a busy loop in sk_wait_data if called with flags = MSG_WAITALL | MSG_PEEK. sk_wait_data waits for sk_receive_queue not empty, but in this case, the receive queue is not empty, but does not contain any skb that we can use. Add a "last skb seen on receive queue" argument to sk_wait_data, so that it sleeps until the receive queue has new skbs. Link: https://bugzilla.kernel.org/show_bug.cgi?id=99461 Link: https://sourceware.org/bugzilla/show_bug.cgi?id=18493 Link: https://bugzilla.redhat.com/show_bug.cgi?id=1205258Reported-by:
Enrico Scholz <rh-bugzilla@ensc.de> Reported-by:
Dan Searle <dan@censornet.com> Signed-off-by:
Sabrina Dubroca <sd@queasysnail.net> Acked-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 23 Jun, 2015 1 commit
-
-
Christoph Paasch authored
tcp_fastopen_reset_cipher really cannot be called from interrupt context. It allocates the tcp_fastopen_context with GFP_KERNEL and calls crypto_alloc_cipher, which allocates all kind of stuff with GFP_KERNEL. Thus, we might sleep when the key-generation is triggered by an incoming TFO cookie-request which would then happen in interrupt- context, as shown by enabling CONFIG_DEBUG_ATOMIC_SLEEP: [ 36.001813] BUG: sleeping function called from invalid context at mm/slub.c:1266 [ 36.003624] in_atomic(): 1, irqs_disabled(): 0, pid: 1016, name: packetdrill [ 36.004859] CPU: 1 PID: 1016 Comm: packetdrill Not tainted 4.1.0-rc7 #14 [ 36.006085] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [ 36.008250] 00000000000004f2 ffff88007f8838a8 ffffffff8171d53a ffff880075a084a8 [ 36.009630] ffff880075a08000 ffff88007f8838c8 ffffffff810967d3 ffff88007f883928 [ 36.011076] 0000000000000000 ffff88007f8838f8 ffffffff81096892 ffff88007f89be00 [ 36.012494] Call Trace: [ 36.012953] <IRQ> [<ffffffff8171d53a>] dump_stack+0x4f/0x6d [ 36.014085] [<ffffffff810967d3>] ___might_sleep+0x103/0x170 [ 36.015117] [<ffffffff81096892>] __might_sleep+0x52/0x90 [ 36.016117] [<ffffffff8118e887>] kmem_cache_alloc_trace+0x47/0x190 [ 36.017266] [<ffffffff81680d82>] ? tcp_fastopen_reset_cipher+0x42/0x130 [ 36.018485] [<ffffffff81680d82>] tcp_fastopen_reset_cipher+0x42/0x130 [ 36.019679] [<ffffffff81680f01>] tcp_fastopen_init_key_once+0x61/0x70 [ 36.020884] [<ffffffff81680f2c>] __tcp_fastopen_cookie_gen+0x1c/0x60 [ 36.022058] [<ffffffff816814ff>] tcp_try_fastopen+0x58f/0x730 [ 36.023118] [<ffffffff81671788>] tcp_conn_request+0x3e8/0x7b0 [ 36.024185] [<ffffffff810e3872>] ? __module_text_address+0x12/0x60 [ 36.025327] [<ffffffff8167b2e1>] tcp_v4_conn_request+0x51/0x60 [ 36.026410] [<ffffffff816727e0>] tcp_rcv_state_process+0x190/0xda0 [ 36.027556] [<ffffffff81661f97>] ? __inet_lookup_established+0x47/0x170 [ 36.028784] [<ffffffff8167c2ad>] tcp_v4_do_rcv+0x16d/0x3d0 [ 36.029832] [<ffffffff812e6806>] ? security_sock_rcv_skb+0x16/0x20 [ 36.030936] [<ffffffff8167cc8a>] tcp_v4_rcv+0x77a/0x7b0 [ 36.031875] [<ffffffff816af8c3>] ? iptable_filter_hook+0x33/0x70 [ 36.032953] [<ffffffff81657d22>] ip_local_deliver_finish+0x92/0x1f0 [ 36.034065] [<ffffffff81657f1a>] ip_local_deliver+0x9a/0xb0 [ 36.035069] [<ffffffff81657c90>] ? ip_rcv+0x3d0/0x3d0 [ 36.035963] [<ffffffff81657569>] ip_rcv_finish+0x119/0x330 [ 36.036950] [<ffffffff81657ba7>] ip_rcv+0x2e7/0x3d0 [ 36.037847] [<ffffffff81610652>] __netif_receive_skb_core+0x552/0x930 [ 36.038994] [<ffffffff81610a57>] __netif_receive_skb+0x27/0x70 [ 36.040033] [<ffffffff81610b72>] process_backlog+0xd2/0x1f0 [ 36.041025] [<ffffffff81611482>] net_rx_action+0x122/0x310 [ 36.042007] [<ffffffff81076743>] __do_softirq+0x103/0x2f0 [ 36.042978] [<ffffffff81723e3c>] do_softirq_own_stack+0x1c/0x30 This patch moves the call to tcp_fastopen_init_key_once to the places where a listener socket creates its TFO-state, which always happens in user-context (either from the setsockopt, or implicitly during the listen()-call) Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Fixes: 222e83d2 ("tcp: switch tcp_fastopen key generation to net_get_random_once") Signed-off-by:
Christoph Paasch <cpaasch@apple.com> Acked-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 16 Jun, 2015 1 commit
-
-
Craig Gallek authored
This get_info handler will simply dispatch to the appropriate existing inet protocol handler. This patch also includes a new netlink attribute (INET_DIAG_PROTOCOL). This attribute is currently only used for multicast messages. Without this attribute, there is no way of knowing the IP protocol used by the socket information being broadcast. This attribute is not necessary in the 'dump' variant of this protocol (though it could easily be added) because dump requests are issued for specific family/protocol pairs. Tested: ss -E (note, the -E option has not yet been merged into the upstream version of ss). Signed-off-by:
Craig Gallek <kraig@google.com> Acked-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 25 May, 2015 1 commit
-
-
Hannes Frederic Sowa authored
Prepare skb_splice_bits to be able to deal with AF_UNIX sockets. AF_UNIX sockets don't use lock_sock/release_sock and thus we have to use a callback to make the locking and unlocking configureable. Signed-off-by:
Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 22 May, 2015 2 commits
-
-
Eric Dumazet authored
Taking socket spinlock in tcp_get_info() can deadlock, as inet_diag_dump_icsk() holds the &hashinfo->ehash_locks[i], while packet processing can use the reverse locking order. We could avoid this locking for TCP_LISTEN states, but lockdep would certainly get confused as all TCP sockets share same lockdep classes. [ 523.722504] ====================================================== [ 523.728706] [ INFO: possible circular locking dependency detected ] [ 523.734990] 4.1.0-dbg-DEV #1676 Not tainted [ 523.739202] ------------------------------------------------------- [ 523.745474] ss/18032 is trying to acquire lock: [ 523.750002] (slock-AF_INET){+.-...}, at: [<ffffffff81669d44>] tcp_get_info+0x2c4/0x360 [ 523.758129] [ 523.758129] but task is already holding lock: [ 523.763968] (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff816bcb75>] inet_diag_dump_icsk+0x1d5/0x6c0 [ 523.774661] [ 523.774661] which lock already depends on the new lock. [ 523.774661] [ 523.782850] [ 523.782850] the existing dependency chain (in reverse order) is: [ 523.790326] -> #1 (&(&hashinfo->ehash_locks[i])->rlock){+.-...}: [ 523.796599] [<ffffffff811126bb>] lock_acquire+0xbb/0x270 [ 523.802565] [<ffffffff816f5868>] _raw_spin_lock+0x38/0x50 [ 523.808628] [<ffffffff81665af8>] __inet_hash_nolisten+0x78/0x110 [ 523.815273] [<ffffffff816819db>] tcp_v4_syn_recv_sock+0x24b/0x350 [ 523.822067] [<ffffffff81684d41>] tcp_check_req+0x3c1/0x500 [ 523.828199] [<ffffffff81682d09>] tcp_v4_do_rcv+0x239/0x3d0 [ 523.834331] [<ffffffff816842fe>] tcp_v4_rcv+0xa8e/0xc10 [ 523.840202] [<ffffffff81658fa3>] ip_local_deliver_finish+0x133/0x3e0 [ 523.847214] [<ffffffff81659a9a>] ip_local_deliver+0xaa/0xc0 [ 523.853440] [<ffffffff816593b8>] ip_rcv_finish+0x168/0x5c0 [ 523.859624] [<ffffffff81659db7>] ip_rcv+0x307/0x420 Lets use u64_sync infrastructure instead. As a bonus, 64bit arches get optimized, as these are nop for them. Fixes: 0df48c26 ("tcp: add tcpi_bytes_acked to tcp_info") Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Marcelo Ricardo Leitner authored
This patch tracks the total number of inbound and outbound segments on a TCP socket. One may use this number to have an idea on connection quality when compared against the retransmissions. RFC4898 named these : tcpEStatsPerfSegsIn and tcpEStatsPerfSegsOut These are a 32bit field each and can be fetched both from TCP_INFO getsockopt() if one has a handle on a TCP socket, or from inet_diag netlink facility (iproute2/ss patch will follow) Note that tp->segs_out was placed near tp->snd_nxt for good data locality and minimal performance impact, while tp->segs_in was placed near tp->bytes_received for the same reason. Join work with Eric Dumazet. Note that received SYN are accounted on the listener, but sent SYNACK are not accounted. Signed-off-by:
Marcelo Ricardo Leitner <mleitner@redhat.com> Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 21 May, 2015 2 commits
-
-
Jason Baron authored
We currently rely on the setting of SOCK_NOSPACE in the write() path to ensure that we wake up any epoll edge trigger waiters when acks return to free space in the write queue. However, if we fail to allocate even a single skb in the write queue, we could end up waiting indefinitely. Fix this by explicitly issuing a wakeup when we detect the condition of an empty write queue and a return value of -EAGAIN. This allows userspace to re-try as we expect this to be a temporary failure. I've tested this approach by artificially making sk_stream_alloc_skb() return NULL periodically. In that case, epoll edge trigger waiters will hang indefinitely in epoll_wait() without this patch. Signed-off-by:
Jason Baron <jbaron@akamai.com> Acked-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
In commit 8e4d980a ("tcp: fix behavior for epoll edge trigger") we fixed a possible hang of TCP sockets under memory pressure, by allowing sk_stream_alloc_skb() to use sk_forced_mem_schedule() if no packet is in socket write queue. It turns out there are other cases where we want to force memory schedule : tcp_fragment() & tso_fragment() need to split a big TSO packet into two smaller ones. If we block here because of TCP memory pressure, we can effectively block TCP socket from sending new data. If no further ACK is coming, this hang would be definitive, and socket has no chance to effectively reduce its memory usage. Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 19 May, 2015 1 commit
-
-
Eric B Munson authored
Currently the getsockopt() requesting the cached contents of the syn packet headers will fail silently if the caller uses a buffer that is too small to contain the requested data. Rather than fail silently and discard the headers, getsockopt() should return an error and report the required size to hold the data. Signed-off-by:
Eric B Munson <emunson@akamai.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 18 May, 2015 2 commits
-
-
Eric Dumazet authored
Allowing tcp to use ~19% of physical memory is way too much, and allowed bugs to be hidden. Add to this that some drivers use a full page per incoming frame, so real cost can be twice the advertized one. Reduce tcp_mem by 50 % as a first step to sanity. tcp_mem[0,1,2] defaults are now 4.68%, 6.25%, 9.37% of physical memory. Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
Under memory pressure, tcp_sendmsg() can fail to queue a packet while no packet is present in write queue. If we return -EAGAIN with no packet in write queue, no ACK packet will ever come to raise EPOLLOUT. We need to allow one skb per TCP socket, and make sure that tcp sockets can release their forward allocations under pressure. This is a followup to commit 790ba456 ("tcp: set SOCK_NOSPACE under memory pressure") Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 05 May, 2015 1 commit
-
-
Eric Dumazet authored
This patch allows a server application to get the TCP SYN headers for its passive connections. This is useful if the server is doing fingerprinting of clients based on SYN packet contents. Two socket options are added: TCP_SAVE_SYN and TCP_SAVED_SYN. The first is used on a socket to enable saving the SYN headers for child connections. This can be set before or after the listen() call. The latter is used to retrieve the SYN headers for passive connections, if the parent listener has enabled TCP_SAVE_SYN. TCP_SAVED_SYN is read once, it frees the saved SYN headers. The data returned in TCP_SAVED_SYN are network (IPv4/IPv6) and TCP headers. Original patch was written by Tom Herbert, I changed it to not hold a full skb (and associated dst and conntracking reference). We have used such patch for about 3 years at Google. Signed-off-by:
Eric Dumazet <edumazet@google.com> Acked-by:
Neal Cardwell <ncardwell@google.com> Tested-by:
Neal Cardwell <ncardwell@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 29 Apr, 2015 1 commit
-
-
Eric Dumazet authored
Some Congestion Control modules can provide per flow information, but current way to get this information is to use netlink. Like TCP_INFO, let's add TCP_CC_INFO so that applications can issue a getsockopt() if they have a socket file descriptor, instead of playing complex netlink games. Sample usage would be : union tcp_cc_info info; socklen_t len = sizeof(info); if (getsockopt(fd, SOL_TCP, TCP_CC_INFO, &info, &len) == -1) Signed-off-by:
Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by:
Neal Cardwell <ncardwell@google.com> Acked-by:
Daniel Borkmann <daniel@iogearbox.net> Acked-by:
Yuchung Cheng <ycheng@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-