- 30 Oct, 2013 1 commit
-
-
Daniel Borkmann authored
Unlike UDP or TCP, we do not take the pseudo-header into account in SCTP checksums. So in case port mapping is the very same, we do not need to recalculate the whole SCTP checksum in software, which is very expensive. Also, similarly as in TCP, take into account when a private helper mangled the packet. In that case, we also need to recalculate the checksum even if ports might be same. Thanks for feedback regarding skb->ip_summed checks from Julian Anastasov; here's a discussion on these checks for snat and dnat: * For snat_handler(), we can see CHECKSUM_PARTIAL from virtual devices, and from LOCAL_OUT, otherwise it should be CHECKSUM_UNNECESSARY. In general, in snat it is more complex. skb contains the original route and ip_vs_route_me_harder() can change the route after snat_handler. So, for locally generated replies from local server we can not preserve the CHECKSUM_PARTIAL mode. It is an chicken or egg dilemma: snat_handler needs the device after rerouting (to check for NETIF_F_SCTP_CSUM), while ip_route_me_harder() wants the snat_handler() to put the new saddr for proper rerouting. * For dnat_handler(), we should not see CHECKSUM_COMPLETE for SCTP, in fact the small set of drivers that support SCTP offloading return CHECKSUM_UNNECESSARY on correctly received SCTP csum. We can see CHECKSUM_PARTIAL from local stack or received from virtual drivers. The idea is that SCTP decides to avoid csum calculation if hardware supports offloading. IPVS can change the device after rerouting to real server but we can preserve the CHECKSUM_PARTIAL mode if the new device supports offloading too. This works because skb dst is changed before dnat_handler and we see the new device. So, checks in the 'if' part will decide whether it is ok to keep CHECKSUM_PARTIAL for the output. If the packet was with CHECKSUM_NONE, hence we deal with unknown checksum. As we recalculate the sum for IP header in all cases, it should be safe to use CHECKSUM_UNNECESSARY. We can forward wrong checksum in this case (without cp->app). In case of CHECKSUM_UNNECESSARY, the csum was valid on receive. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
-
- 28 Oct, 2013 1 commit
-
-
Daniel Borkmann authored
If skb_header_pointer() fails, we need to assign a verdict, that is NF_DROP in this case, otherwise, we would leave the verdict from conn_schedule() uninitialized when returning. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
-
- 27 Oct, 2013 3 commits
-
-
Florian Westphal authored
Pekka Pietikäinen reports xt_socket behavioural change after commit 00028aa3o (netfilter: xt_socket: use IP early demux). Reason is xt_socket now no longer does an unconditional sk lookup - it re-uses existing skb->sk if possible, assuming ->sk was set by ip early demux. However, when netfilter is invoked via bridge, this can cause 'bogus' sockets to be examined by the match, e.g. a 'tun' device socket. bridge netfilter should orphan the skb just like the routing path before invoking ipv4/ipv6 netfilter hooks to avoid this. Reported-and-tested-by: Pekka Pietikäinen <pp@ee.oulu.fi> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
-
Michael Opdenacker authored
This patch removes a duplicate define from net/netfilter/ipset/ip_set_hash_gen.h Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
-
Jozsef Kadlecsik authored
At the restructuring of the bitmap types creation in ipset, for the bitmap:port type wrong (too large) memory allocation was copied (netfilter bugzilla id #859). Reported-by: Quentin Armitage <quentin@armitage.org.uk> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
-
- 23 Oct, 2013 1 commit
-
-
Stanislav Fomichev authored
Don't verify checksum for outgoing packets because checksum calculation may be done by the device. Without this patch: $ ip6tables -I OUTPUT -p tcp --dport 80 -j REJECT --reject-with tcp-reset $ time telnet ipv6.google.com 80 Trying 2a00:1450:4010:c03::67... telnet: Unable to connect to remote host: Connection timed out real 0m7.201s user 0m0.000s sys 0m0.000s With the patch applied: $ ip6tables -I OUTPUT -p tcp --dport 80 -j REJECT --reject-with tcp-reset $ time telnet ipv6.google.com 80 Trying 2a00:1450:4010:c03::67... telnet: Unable to connect to remote host: Connection refused real 0m0.085s user 0m0.000s sys 0m0.000s Signed-off-by: Stanislav Fomichev <stfomichev@yandex-team.ru> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
-
- 22 Oct, 2013 2 commits
-
-
Jozsef Kadlecsik authored
The unnamed union should be possible to be initialized directly, but unfortunately it's not so: /usr/src/ipset/kernel/net/netfilter/ipset/ip_set_hash_netnet.c: In function ?hash_netnet4_kadt?: /usr/src/ipset/kernel/net/netfilter/ipset/ip_set_hash_netnet.c:141: error: unknown field ?cidr? specified in initializer Reported-by: Husnu Demir <hdemir@metu.edu.tr> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
-
Jozsef Kadlecsik authored
Instead of cb->data, use callback dump args only and introduce symbolic names instead of plain numbers at accessing the argument members. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
-
- 17 Oct, 2013 7 commits
-
-
Gao feng authored
we can allow users in uninit net namespace to operate ipt_CLUSTERIP now. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
-
Gao feng authored
Create proc entries under the ipt_CLUSTERIP directory of proper net namespace. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
-
Gao feng authored
Inorder to find clusterip_config in net namespace. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
-
Gao feng authored
this lock is used for protecting clusterip_configs of per net namespace, it should be per net namespace too. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
-
Gao feng authored
clusterip_configs should be per net namespace, so operate cluster in one net namespace won't affect other net namespace. right now, only allow to operate the clusterip_configs of init net namespace. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
-
Gao feng authored
Create /proc/net/ipt_CLUSTERIP directory for per net namespace. Right now,only allow to create entries under the ipt_CLUSTERIP in init net namespace. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
-
Eric Dumazet authored
TCP listener refactoring, part 7 : Use sock_gen_put() instead of xt_socket_put_sk() for future SYN_RECV support. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
-
- 15 Oct, 2013 3 commits
-
-
Alexander Frolkin authored
Improve the SH fallback realserver selection strategy. With sh and sh-fallback, if a realserver is down, this attempts to distribute the traffic that would have gone to that server evenly among the remaining servers. Signed-off-by: Alexander Frolkin <avf@eldamar.org.uk> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
-
Julian Anastasov authored
commit 578bc3ef ("ipvs: reorganize dest trash") added rcu_barrier() on cleanup to wait dest users and schedulers like LBLC and LBLCR to put their last dest reference. Using rcu_barrier with many namespaces is problematic. Trying to fix it by freeing dest with kfree_rcu is not a solution, RCU callbacks can run in parallel and execution order is random. Fix it by creating new function ip_vs_dest_put_and_free() which is heavier than ip_vs_dest_put(). We will use it just for schedulers like LBLC, LBLCR that can delay their dest release. By default, dests reference is above 0 if they are present in service and it is 0 when deleted but still in trash list. Change the dest trash code to use ip_vs_dest_put_and_free(), so that refcnt -1 can be used for freeing. As result, such checks remain in slow path and the rcu_barrier() from netns cleanup can be removed. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
-
Julian Anastasov authored
It was wrong (bigger) but problem is harmless. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
-
- 10 Oct, 2013 20 commits
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-nextDavid S. Miller authored
Jeff Kirsher says: ==================== This series contains updates to i40e only. Alex provides the majority of the patches against i40e, where he does cleanup of the Tx and RX queues and to align the code with the known good Tx/Rx queue code in the ixgbe driver. Anjali provides an i40e patch to update link events to not print to the log until the device is administratively up. Catherine provides a patch to update the driver version. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
In commit 634fb979 ("inet: includes a sock_common in request_sock") I forgot that the two ports in sock_common do not have same byte order : skc_dport is __be16 (network order), but skc_num is __u16 (host order) So sparse complains because ir_loc_port (mapped into skc_num) is considered as __u16 while it should be __be16 Let rename ir_loc_port to ireq->ir_num (analogy with inet->inet_num), and perform appropriate htons/ntohs conversions. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Catherine Sullivan authored
Update the version number of the driver. Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
-
Alexander Duyck authored
This change brings support for 64 bit netstats to the driver. Previously the stats were 64 bit but highly racy due to the fact that 64 bit transactions are not atomic on 32 bit systems. This change makes is so that the 64 bit byte and packet stats are reliable on all architectures. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
-
Alexander Duyck authored
Allocate the queue pairs individually instead of as a group. This allows for much easier queue management as it is possible to dynamically resize the queues without having to free and allocate the entire block. Ease statistic collection by treating Tx/Rx queue pairs as a single unit. Each pair is allocated together and starts with a Tx queue and ends with an Rx queue. By ordering them this way it is possible to know the Rx offset based on a pointer to the Tx queue. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
-
Alexander Duyck authored
This replaces the ring container array with a linked list. The idea is to make the logic much easier to deal with since this will allow us to call a simple helper function from the q_vectors to go through the entire list. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
-
Alexander Duyck authored
Allocate the q_vectors individually. The advantage to this is that it allows for easier freeing and allocation. In addition it makes it so that we could do node specific allocations at some point in the future. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
-
Alexander Duyck authored
This makes it so that the Tx and Rx byte and packet counts are separated from the rest of the statistics. This allows for better isolation of these stats when we move them into the 64 bit statistics. Simplify things by re-ordering how the stats display in ethtool. Instead of displaying all of the Tx queues as a block, followed by all the Rx queues, the new order is Tx[0], Rx[0], Tx[1], Rx[1], ..., Tx[n], Rx[n]. This reduces the loops and cleans up the display for testing purposes since it is very easy to verify if flow director is doing the right thing as the Tx and Rx queue pair are shown in pairs. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
-
Alexander Duyck authored
Implement BQL (byte queue limit) support in i40e. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
-
Alexander Duyck authored
Drop Tx flag and TXSW which is tested but never set. As a result of this change we can drop a complicated check that always resulted in the final result of i40e_tx_csum being equal to the CHECKSUM_PARTIAL value. As such we can replace the entire function call with just a check for skb->summed == CHECKSUM_PARTIAL. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
-
Alexander Duyck authored
Sync the fast path for i40e_tx_map and i40e_clean_tx_irq so that they are similar to igb and ixgbe. - Only update the Tx descriptor ring in tx_map - Make skb mapping always on the first buffer in the chain - Drop the use of MAPPED_AS_PAGE Tx flag - Only store flags on the first buffer_info structure Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
-
Alexander Duyck authored
Avoid directly incrementing next_to_use for multiple reasons. The main reason being that if we directly increment it then it can attain a state where it is equal to the ring count. Technically this is a state it should not be able to reach but the way this is written it now can. This patch pulls the value off into a register and then increments it and writes back either the value or 0 depending on if the value is equal to the ring count. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
-
Alexander Duyck authored
- drop the mapped_as_page u8 from the Tx buffer info as it was unused - use the DMA unmap accessors for Tx DMA - replace checks of DMA with checks of the unmap length to verify if an unmap is needed - update the Tx buffer layout to make it consistent with igb, ixgbe Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
-
Eric Dumazet authored
sk_pacing_rate is read by sch_fq packet scheduler at any time, with no synchronization, so make sure we update it in a sensible way. ACCESS_ONCE() is how we instruct compiler to not do stupid things, like using the memory location as a temporary variable. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
TCP listener refactoring, part 5 : We want to be able to insert request sockets (SYN_RECV) into main ehash table instead of the per listener hash table to allow RCU lookups and remove listener lock contention. This patch includes the needed struct sock_common in front of struct request_sock This means there is no more inet6_request_sock IPv6 specific structure. Following inet_request_sock fields were renamed as they became macros to reference fields from struct sock_common. Prefix ir_ was chosen to avoid name collisions. loc_port -> ir_loc_port loc_addr -> ir_loc_addr rmt_addr -> ir_rmt_addr rmt_port -> ir_rmt_port iif -> ir_iif Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
skb_gro_receive() is currently limited to 16 or 17 MSS per GRO skb, typically 24616 bytes, because it fills up to MAX_SKB_FRAGS frags. It's relatively easy to extend the skb using frag_list to allow more frags to be appended into the last sk_buff. This still builds very efficient skbs, and allows reaching 45 MSS per skb. (45 MSS GRO packet uses one skb plus a frag_list containing 2 additional sk_buff) High speed TCP flows benefit from this extension by lowering TCP stack cpu usage (less packets stored in receive queue, less ACK packets processed) Forwarding setups could be hurt, as such skbs will need to be linearized, although its not a new problem, as GRO could already provide skbs with a frag_list. We could make the 65536 bytes threshold a tunable to mitigate this. (First time we need to linearize skb in skb_needs_linearize(), we could lower the tunable to ~16*1460 so that following skb_gro_receive() calls build smaller skbs) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
baker.zhang authored
This is a enhancement. for the first node in fib_trie, newpos is 0, bit is 1. Only for the leaf or node with unmatched key need calc pos. Signed-off-by: baker.zhang <baker.kernel@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Gao feng authored
We can only setup multicast address for network device when net_device_ops->ndo_set_rx_mode is not null. Some configurations need to add multicast address for net device, such as netfilter cluster match module. Add a fake ndo_set_rx_mode function to allow this operation. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Duyck authored
The Tx "completed" stat was part of the original rewrite for detecting Tx hangs. However some time ago in ixgbe I determined that we could just use the packets stat instead. Since then this stat was removed from ixgbe and it serves no purpose in i40e so it can be dropped. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
-
Anjali Singhai authored
Link events should not print to the log until the device is administratively up. Signed-off-by: Anjali Singhai <anjali.singhai@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
-
- 09 Oct, 2013 2 commits
-
-
Ajit Khaparde authored
Signed-off-by: Ajit Khaparde <ajit.khaparde@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ajit Khaparde authored
SkyHawk-R can support RoCE. Add code to display RoCE specific counters maintained in hardware. Signed-off-by: Ajit Khaparde <ajit.khaparde@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-