- 26 Sep, 2018 4 commits
-
-
Cong Wang authored
[ Upstream commit cc4dfb7f ] When a rds sock is bound, it is inserted into the bind_hash_table which is protected by RCU. But when releasing rds sock, after it is removed from this hash table, it is freed immediately without respecting RCU grace period. This could cause some use-after-free as reported by syzbot. Mark the rds sock with SOCK_RCU_FREE before inserting it into the bind_hash_table, so that it would be always freed after a RCU grace period. The other problem is in rds_find_bound(), the rds sock could be freed in between rhashtable_lookup_fast() and rds_sock_addref(), so we need to extend RCU read lock protection in rds_find_bound() to close this race condition. Reported-and-tested-by: syzbot+8967084bcac563795dc6@syzkaller.appspotmail.com Reported-by: syzbot+93a5839deb355537440f@syzkaller.appspotmail.com Cc: Sowmini Varadhan <sowmini.varadhan@oracle.com> Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com> Cc: rds-devel@oss.oracle.com Signed-off-by:
Cong Wang <xiyou.wangcong@gmail.com> Acked-by:
Santosh Shilimkar <santosh.shilimkar@oarcle.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Stefan Wahren authored
[ Upstream commit e65a9e48 ] With performance optimization the spi transfer and messages of basic register operations like qcaspi_read_register moved into the private driver structure. But they weren't protected against mutual access (e.g. between driver kthread and ethtool). So dumping the QCA7000 registers via ethtool during network traffic could make spi_sync hang forever, because the completion in spi_message is overwritten. So revert the optimization completely. Fixes: 291ab06e ("net: qualcomm: new Ethernet over SPI driver for QCA700") Signed-off-by:
Stefan Wahren <stefan.wahren@i2se.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Jack Morgenstein authored
[ Upstream commit 76d5581c ] When the mlx5 health mechanism detects a problem while the driver is in the middle of init_one or remove_one, the driver needs to prevent the health mechanism from scheduling future work; if future work is scheduled, there is a problem with use-after-free: the system WQ tries to run the work item (which has been freed) at the scheduled future time. Prevent this by disabling work item scheduling in the health mechanism when the driver is in the middle of init_one() or remove_one(). Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters") Signed-off-by:
Jack Morgenstein <jackm@dev.mellanox.co.il> Reviewed-by:
Feras Daoud <ferasda@mellanox.com> Signed-off-by:
Saeed Mahameed <saeedm@mellanox.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Petr Oros authored
[ Upstream commit 9d7f19dc ] DMA allocated memory is lost in be_cmd_get_profile_config() when we call it with non-NULL port_res parameter. Signed-off-by:
Petr Oros <poros@redhat.com> Reviewed-by:
Ivan Vecera <ivecera@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 19 Sep, 2018 36 commits
-
-
Greg Kroah-Hartman authored
-
Linus Torvalds authored
commit 7a9cdebd upstream. Jann Horn points out that the vmacache_flush_all() function is not only potentially expensive, it's buggy too. It also happens to be entirely unnecessary, because the sequence number overflow case can be avoided by simply making the sequence number be 64-bit. That doesn't even grow the data structures in question, because the other adjacent fields are already 64-bit. So simplify the whole thing by just making the sequence number overflow case go away entirely, which gets rid of all the complications and makes the code faster too. Win-win. [ Oleg Nesterov points out that the VMACACHE_FULL_FLUSHES statistics also just goes away entirely with this ] Reported-by:
Jann Horn <jannh@google.com> Suggested-by:
Will Deacon <will.deacon@arm.com> Acked-by:
Davidlohr Bueso <dave@stgolabs.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: stable@kernel.org Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Ian Kent authored
commit 0633da48 upstream. autofs_sbi() does not check the superblock magic number to verify it has been given an autofs super block. Backport Note: autofs4 has been renamed to autofs upstream. As a result the upstream patch does not apply cleanly onto 4.14.y. Link: http://lkml.kernel.org/r/153475422934.17131.7563724552005298277.stgit@pluto.themaw.net Reported-by: <syzbot+87c3c541582e56943277@syzkaller.appspotmail.com> Signed-off-by:
Ian Kent <raven@themaw.net> Reviewed-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Zubin Mithra <zsm@chromium.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Jason Wang authored
commit 7063efd3 upstream. After commit b196d88a ("tun: fix use after free for ptr_ring") we need clean up tx ring during release(). But unfortunately, it tries to do the cleanup blindly after socket were destroyed which will lead another use-after-free. Fix this by doing the cleanup before dropping the last reference of the socket in __tun_detach(). Backport Note :- Upstream commit moves the ptr_ring_cleanup call from tun_chr_close to __tun_detach. Upstream applied that patch after replacing skb_array with ptr_ring. This patch moves the skb_array_cleanup call from tun_chr_close to __tun_detach. Reported-by:
Andrei Vagin <avagin@virtuozzo.com> Acked-by:
Andrei Vagin <avagin@virtuozzo.com> Fixes: b196d88a ("tun: fix use after free for ptr_ring") Signed-off-by:
Jason Wang <jasowang@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Zubin Mithra <zsm@chromium.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Jason Wang authored
commit b196d88a upstream. We used to initialize ptr_ring during TUNSETIFF, this is because its size depends on the tx_queue_len of netdevice. And we try to clean it up when socket were detached from netdevice. A race were spotted when trying to do uninit during a read which will lead a use after free for pointer ring. Solving this by always initialize a zero size ptr_ring in open() and do resizing during TUNSETIFF, and then we can safely do cleanup during close(). With this, there's no need for the workaround that was introduced by commit 4df0bfc7 ("tun: fix a memory leak for tfile->tx_array"). Backport Note :- Comparison with the upstream patch: [1] A "semantic revert" of the changes made in 4df0bfc7("tun: fix a memory leak for tfile->tx_array"). 4df0bfc7 was applied upstream, and then skb array was changed to use ptr_ring. The upstream patch then removes the changes introduced by 4df0bfc7. This backport does the same; "revert" the changes made by 4df0bfc7. [2] xdp_rxq_info_unreg() being called in relevant locations As xdp_rxq_info related patches are not present in 4.14, these changes are not needed in the backport. [3] An instance of ptr_ring_init needs to be replaced by skb_array_init Inside tun_attach() [4] ptr_ring_cleanup needs to be replaced by skb_array_cleanup Inside tun_chr_close() Note that the backport for 7063efd3 ("tuntap: fix use after free during release") needs to be applied on top of this patch. Reported-by: syzbot+e8b902c3c3fadf0a9dba@syzkaller.appspotmail.com Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Michael S. Tsirkin <mst@redhat.com> Fixes: 1576d986 ("tun: switch to use skb array for tx") Signed-off-by:
Jason Wang <jasowang@redhat.com> Acked-by:
Michael S. Tsirkin <mst@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Zubin Mithra <zsm@chromium.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Wei Yongjun authored
commit 7233982a upstream. Fix to return error code -ENOMEM from the kmem_cache_alloc() error handling case instead of 0, as done elsewhere in this function. Fixes: f78e5623 ("ubi: fastmap: Erase outdated anchor PEBs during attach") Signed-off-by:
Wei Yongjun <weiyongjun1@huawei.com> Reviewed-by:
Boris Brezillon <boris.brezillon@free-electrons.com> Signed-off-by:
Richard Weinberger <richard@nod.at> Cc: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Taehee Yoo authored
commit 5d407b07 upstream A kernel crash occurrs when defragmented packet is fragmented in ip_do_fragment(). In defragment routine, skb_orphan() is called and skb->ip_defrag_offset is set. but skb->sk and skb->ip_defrag_offset are same union member. so that frag->sk is not NULL. Hence crash occurrs in skb->sk check routine in ip_do_fragment() when defragmented packet is fragmented. test commands: %iptables -t nat -I POSTROUTING -j MASQUERADE %hping3 192.168.4.2 -s 1000 -p 2000 -d 60000 splat looks like: [ 261.069429] kernel BUG at net/ipv4/ip_output.c:636! [ 261.075753] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI [ 261.083854] CPU: 1 PID: 1349 Comm: hping3 Not tainted 4.19.0-rc2+ #3 [ 261.100977] RIP: 0010:ip_do_fragment+0x1613/0x2600 [ 261.106945] Code: e8 e2 38 e3 fe 4c 8b 44 24 18 48 8b 74 24 08 e9 92 f6 ff ff 80 3c 02 00 0f 85 da 07 00 00 48 8b b5 d0 00 00 00 e9 25 f6 ff ff <0f> 0b 0f 0b 44 8b 54 24 58 4c 8b 4c 24 18 4c 8b 5c 24 60 4c 8b 6c [ 261.127015] RSP: 0018:ffff8801031cf2c0 EFLAGS: 00010202 [ 261.134156] RAX: 1ffff1002297537b RBX: ffffed0020639e6e RCX: 0000000000000004 [ 261.142156] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880114ba9bd8 [ 261.150157] RBP: ffff880114ba8a40 R08: ffffed0022975395 R09: ffffed0022975395 [ 261.158157] R10: 0000000000000001 R11: ffffed0022975394 R12: ffff880114ba9ca4 [ 261.166159] R13: 0000000000000010 R14: ffff880114ba9bc0 R15: dffffc0000000000 [ 261.174169] FS: 00007fbae2199700(0000) GS:ffff88011b400000(0000) knlGS:0000000000000000 [ 261.183012] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 261.189013] CR2: 00005579244fe000 CR3: 0000000119bf4000 CR4: 00000000001006e0 [ 261.198158] Call Trace: [ 261.199018] ? dst_output+0x180/0x180 [ 261.205011] ? save_trace+0x300/0x300 [ 261.209018] ? ip_copy_metadata+0xb00/0xb00 [ 261.213034] ? sched_clock_local+0xd4/0x140 [ 261.218158] ? kill_l4proto+0x120/0x120 [nf_conntrack] [ 261.223014] ? rt_cpu_seq_stop+0x10/0x10 [ 261.227014] ? find_held_lock+0x39/0x1c0 [ 261.233008] ip_finish_output+0x51d/0xb50 [ 261.237006] ? ip_fragment.constprop.56+0x220/0x220 [ 261.243011] ? nf_ct_l4proto_register_one+0x5b0/0x5b0 [nf_conntrack] [ 261.250152] ? rcu_is_watching+0x77/0x120 [ 261.255010] ? nf_nat_ipv4_out+0x1e/0x2b0 [nf_nat_ipv4] [ 261.261033] ? nf_hook_slow+0xb1/0x160 [ 261.265007] ip_output+0x1c7/0x710 [ 261.269005] ? ip_mc_output+0x13f0/0x13f0 [ 261.273002] ? __local_bh_enable_ip+0xe9/0x1b0 [ 261.278152] ? ip_fragment.constprop.56+0x220/0x220 [ 261.282996] ? nf_hook_slow+0xb1/0x160 [ 261.287007] raw_sendmsg+0x21f9/0x4420 [ 261.291008] ? dst_output+0x180/0x180 [ 261.297003] ? sched_clock_cpu+0x126/0x170 [ 261.301003] ? find_held_lock+0x39/0x1c0 [ 261.306155] ? stop_critical_timings+0x420/0x420 [ 261.311004] ? check_flags.part.36+0x450/0x450 [ 261.315005] ? _raw_spin_unlock_irq+0x29/0x40 [ 261.320995] ? _raw_spin_unlock_irq+0x29/0x40 [ 261.326142] ? cyc2ns_read_end+0x10/0x10 [ 261.330139] ? raw_bind+0x280/0x280 [ 261.334138] ? sched_clock_cpu+0x126/0x170 [ 261.338995] ? check_flags.part.36+0x450/0x450 [ 261.342991] ? __lock_acquire+0x4500/0x4500 [ 261.348994] ? inet_sendmsg+0x11c/0x500 [ 261.352989] ? dst_output+0x180/0x180 [ 261.357012] inet_sendmsg+0x11c/0x500 [ ... ] v2: - clear skb->sk at reassembly routine.(Eric Dumarzet) Fixes: fa0f5273 ("ip: use rb trees for IP frag queue.") Suggested-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
Taehee Yoo <ap420073@gmail.com> Reviewed-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Peter Oskolkov authored
This patch changes the runtime behavior of IP defrag queue: incoming in-order fragments are added to the end of the current list/"run" of in-order fragments at the tail. On some workloads, UDP stream performance is substantially improved: RX: ./udp_stream -F 10 -T 2 -l 60 TX: ./udp_stream -c -H <host> -F 10 -T 5 -l 60 with this patchset applied on a 10Gbps receiver: throughput=9524.18 throughput_units=Mbit/s upstream (net-next): throughput=4608.93 throughput_units=Mbit/s Reported-by:
Willem de Bruijn <willemb@google.com> Signed-off-by:
Peter Oskolkov <posk@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit a4fd284a) Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Peter Oskolkov authored
This patch introduces several helper functions/macros that will be used in the follow-up patch. No runtime changes yet. The new logic (fully implemented in the second patch) is as follows: * Nodes in the rb-tree will now contain not single fragments, but lists of consecutive fragments ("runs"). * At each point in time, the current "active" run at the tail is maintained/tracked. Fragments that arrive in-order, adjacent to the previous tail fragment, are added to this tail run without triggering the re-balancing of the rb-tree. * If a fragment arrives out of order with the offset _before_ the tail run, it is inserted into the rb-tree as a single fragment. * If a fragment arrives after the current tail fragment (with a gap), it starts a new "tail" run, as is inserted into the rb-tree at the end as the head of the new run. skb->cb is used to store additional information needed here (suggested by Eric Dumazet). Reported-by:
Willem de Bruijn <willemb@google.com> Signed-off-by:
Peter Oskolkov <posk@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit 353c9cb3) Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Dan Carpenter authored
We accidentally removed the parentheses here, but they are required because '!' has higher precedence than '&'. Fixes: fa0f5273 ("ip: use rb trees for IP frag queue.") Signed-off-by:
Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit 70837ffe) Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
commit bffa72cf upstream skb->rbnode shares space with skb->next, skb->prev and skb->tstamp Current uses (TCP receive ofo queue and netem) need to save/restore tstamp, while skb->dev is either NULL (TCP) or a constant for a given queue (netem). Since we plan using an RB tree for TCP retransmit queue to speedup SACK processing with large BDP, this patch exchanges skb->dev and skb->tstamp. This saves some overhead in both TCP and netem. v2: removes the swtstamp field from struct tcp_skb_cb Signed-off-by:
Eric Dumazet <edumazet@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Wei Wang <weiwan@google.com> Cc: Willem de Bruijn <willemb@google.com> Acked-by:
Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
Geeralize private netem_rb_to_skb() TCP rtx queue will soon be converted to rb-tree, so we will need skb_rbtree_walk() helpers. Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit 18a4c0ea) Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
After working on IP defragmentation lately, I found that some large packets defeat CHECKSUM_COMPLETE optimization because of NIC adding zero paddings on the last (small) fragment. While removing the padding with pskb_trim_rcsum(), we set skb->ip_summed to CHECKSUM_NONE, forcing a full csum validation, even if all prior fragments had CHECKSUM_COMPLETE set. We can instead compute the checksum of the part we are trimming, usually smaller than the part we keep. Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit 88078d98) Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Florian Westphal authored
don't bother with pathological cases, they only waste cycles. IPv6 requires a minimum MTU of 1280 so we should never see fragments smaller than this (except last frag). v3: don't use awkward "-offset + len" v2: drop IPv4 part, which added same check w. IPV4_MIN_MTU (68). There were concerns that there could be even smaller frags generated by intermediate nodes, e.g. on radio networks. Cc: Peter Oskolkov <posk@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit 0ed4229b) Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Peter Oskolkov authored
Tested: see the next patch is the series. Suggested-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
Peter Oskolkov <posk@google.com> Signed-off-by:
Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit 385114de) Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
As measured in my prior patch ("sch_netem: faster rb tree removal"), rbtree_postorder_for_each_entry_safe() is nice looking but much slower than using rb_next() directly, except when tree is small enough to fit in CPU caches (then the cost is the same) Also note that there is not even an increase of text size : $ size net/core/skbuff.o.before net/core/skbuff.o text data bss dec hex filename 40711 1298 0 42009 a419 net/core/skbuff.o.before 40711 1298 0 42009 a419 net/core/skbuff.o From: Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit 7c90584c) Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Peter Oskolkov authored
This behavior is required in IPv6, and there is little need to tolerate overlapping fragments in IPv4. This change simplifies the code and eliminates potential DDoS attack vectors. Tested: ran ip_defrag selftest (not yet available uptream). Suggested-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Peter Oskolkov <posk@google.com> Signed-off-by:
Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Acked-by:
Stephen Hemminger <stephen@networkplumber.org> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit 7969e5c4) Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
Giving an integer to proc_doulongvec_minmax() is dangerous on 64bit arches, since linker might place next to it a non zero value preventing a change to ip6frag_low_thresh. ip6frag_low_thresh is not used anymore in the kernel, but we do not want to prematuraly break user scripts wanting to change it. Since specifying a minimal value of 0 for proc_doulongvec_minmax() is moot, let's remove these zero values in all defrag units. Fixes: 6e00f7dd ("ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh") Signed-off-by:
Eric Dumazet <edumazet@google.com> Reported-by:
Maciej Żenczykowski <maze@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit 3d234012) Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
ip_defrag uses skb->cb[] to store the fragment offset, and unfortunately this integer is currently in a different cache line than skb->next, meaning that we use two cache lines per skb when finding the insertion point. By aliasing skb->ip_defrag_offset and skb->dev, we pack all the fields in a single cache line and save precious memory bandwidth. Note that after the fast path added by Changli Gao in commit d6bebca9 ("fragment: add fast path for in-order fragments") this change wont help the fast path, since we still need to access prev->len (2nd cache line), but will show great benefits when slow path is entered, since we perform a linear scan of a potentially long list. Also, note that this potential long list is an attack vector, we might consider also using an rb-tree there eventually. Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit bf663371) Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
Put the read-mostly fields in a separate cache line at the beginning of struct netns_frags, to reduce false sharing noticed in inet_frag_kill() Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit c2615cf5) Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
While under frags DDOS I noticed unfortunate false sharing between @nelems and @params.automatic_shrinking Move @nelems at the end of struct rhashtable so that first cache line is shared between all cpus, because almost never dirtied. Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit e5d672a0) Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
Make it similar to IPv4 ip_expire(), and release the lock before calling icmp functions. Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit 05c0b86b) Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
An skb_clone() was added in commit ec4fbd64 ("inet: frag: release spinlock before calling icmp_send()") While fixing the bug at that time, it also added a very high cost for DDOS frags, as the ICMP rate limit is applied after this expensive operation (skb_clone() + consume_skb(), implying memory allocations, copy, and freeing) We can use skb_get(head) here, all we want is to make sure skb wont be freed by another cpu. Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit 1eec5d56) Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
Some users are willing to provision huge amounts of memory to be able to perform reassembly reasonnably well under pressure. Current memory tracking is using one atomic_t and integers. Switch to atomic_long_t so that 64bit arches can use more than 2GB, without any cost for 32bit arches. Note that this patch avoids an overflow error, if high_thresh was set to ~2GB, since this test in inet_frag_alloc() was never true : if (... || frag_mem_limit(nf) > nf->high_thresh) Tested: $ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh <frag DDOS> $ grep FRAG /proc/net/sockstat FRAG: inuse 14705885 memory 16000002880 $ nstat -n ; sleep 1 ; nstat | grep Reas IpReasmReqds 3317150 0.0 IpReasmFails 3317112 0.0 Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit 3e67f106) Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
This function is obsolete, after rhashtable addition to inet defrag. Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit 2d44ed22) Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
This refactors ip_expire() since one indentation level is removed. Note: in the future, we should try hard to avoid the skb_clone() since this is a serious performance cost. Under DDOS, the ICMP message wont be sent because of rate limits. Fact that ip6_expire_frag_queue() does not use skb_clone() is disturbing too. Presumably IPv6 should have the same issue than the one we fixed in commit ec4fbd64 ("inet: frag: release spinlock before calling icmp_send()") Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit 399d1404) Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
Remove sum_frag_mem_limit(), ip_frag_mem() & ip6_frag_mem() Also since we use rhashtable we can bring back the number of fragments in "grep FRAG /proc/net/sockstat /proc/net/sockstat6" that was removed in commit 434d3054 ("inet: frag: don't account number of fragment queues") Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit 6befe4a7) Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
Some applications still rely on IP fragmentation, and to be fair linux reassembly unit is not working under any serious load. It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!) A work queue is supposed to garbage collect items when host is under memory pressure, and doing a hash rebuild, changing seed used in hash computations. This work queue blocks softirqs for up to 25 ms when doing a hash rebuild, occurring every 5 seconds if host is under fire. Then there is the problem of sharing this hash table for all netns. It is time to switch to rhashtables, and allocate one of them per netns to speedup netns dismantle, since this is a critical metric these days. Lookup is now using RCU. A followup patch will even remove the refcount hold/release left from prior implementation and save a couple of atomic operations. Before this patch, 16 cpus (16 RX queue NIC) could not handle more than 1 Mpps frags DDOS. After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB of storage for the fragments (exact number depends on frags being evicted after timeout) $ grep FRAG /proc/net/sockstat FRAG: inuse 1966916 memory 2140004608 A followup patch will change the limits for 64bit arches. Signed-off-by:
Eric Dumazet <edumazet@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Florian Westphal <fw@strlen.de> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Alexander Aring <alex.aring@gmail.com> Cc: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit 648700f7) Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
Rehashing and destroying large hash table takes a lot of time, and happens in process context. It is safe to add cond_resched() in rhashtable_rehash_table() and rhashtable_free_and_destroy() Signed-off-by:
Eric Dumazet <edumazet@google.com> Acked-by:
Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit ae6da1f5) Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
IPv4 was changed in commit 52a773d6 ("net: Export ip fragment sysctl to unprivileged users") The only sysctl that is not per-netns is not used : ip6frag_secret_interval Signed-off-by:
Eric Dumazet <edumazet@google.com> Cc: Nikolay Borisov <kernel@kyup.com> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit 18dcbe12) Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
We want to call lowpan_net_frag_init() earlier. Similar to commit "inet: frags: refactor ipv6_frag_init()" This is a prereq to "inet: frags: use rhashtables for reassembly units" Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit 807f1844) Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
We want to call inet_frags_init() earlier. This is a prereq to "inet: frags: use rhashtables for reassembly units" Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit 5b975bab) Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Kees Cook authored
In preparation for unconditionally passing the struct timer_list pointer to all timer callbacks, switch to using the new timer_setup() and from_timer() to pass the timer pointer explicitly. Cc: Alexander Aring <alex.aring@gmail.com> Cc: Stefan Schmidt <stefan@osg.samsung.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Cc: Florian Westphal <fw@strlen.de> Cc: linux-wpan@vger.kernel.org Cc: netdev@vger.kernel.org Cc: netfilter-devel@vger.kernel.org Cc: coreteam@netfilter.org Signed-off-by:
Kees Cook <keescook@chromium.org> Acked-by: Stefan Schmidt <stefan@osg.samsung.com> # for ieee802154 Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit 78802011) Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
We need to call inet_frags_init() before register_pernet_subsys(), as a prereq for following patch ("inet: frags: use rhashtables for reassembly units") Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit 483a6e4f) Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
In order to simplify the API, add a pointer to struct inet_frags. This will allow us to make things less complex. These functions no longer have a struct inet_frags parameter : inet_frag_destroy(struct inet_frag_queue *q /*, struct inet_frags *f */) inet_frag_put(struct inet_frag_queue *q /*, struct inet_frags *f */) inet_frag_kill(struct inet_frag_queue *q /*, struct inet_frags *f */) inet_frags_exit_net(struct netns_frags *nf /*, struct inet_frags *f */) ip6_expire_frag_queue(struct net *net, struct frag_queue *fq) Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit 093ba729) Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
We will soon initialize one rhashtable per struct netns_frags in inet_frags_init_net(). This patch changes the return value to eventually propagate an error. Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit 787bea77) Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-