- 31 Oct, 2019 33 commits
-
-
Jiri Pirko authored
Don't compute the original base local port during unsplit, rather remember it in mlxsw_sp_port structure during split port creation. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
In Spectrum-3 the modules have 8 lanes, so split by count 2 results in two split ports each of 4 lanes. Add a resource that can be used to obtain local port offset in that case. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Get local port offsets of split port in a separate helper function and use it in both split and unsplit function. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Driver assumes certain values in the PMLP register. Add checks that verify that PMLP register provides fitting values. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Pass the port mapping structure down to create, module_map and other function instead of individual values. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Don't use constant max width value and instead of that, use the actual width of the port. Also don't pass module value and use the value stored in the same structure. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Store the initial PMLP register configuration into array of structures instead of just simple array of module numbers. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Currently when user does split, he is not able to distinguish if the port cannot be split because it is already split, or because it cannot be split at all. Add another check for split flag to distinguish this. Also add check forbidding split when maximal width is 1. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
The fact that the port cannot be split further should be checked before checking the count, so move it. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Currently the max module width is hard-coded according to ASIC type. That is not entirely correct, as the max module width might differ per-board. Use PMTM register to query FW for maximal width of a module. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
The PMTM allows query or configuration of module types. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
The tx/rx lane fields got extended to 4 bits, update the reg field description accordingly. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Christophe JAILLET authored
Use '__skb_queue_purge()' instead of re-implementing it. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Vlad Buslov says: ==================== Control action percpu counters allocation by netlink flag Currently, significant fraction of CPU time during TC filter allocation is spent in percpu allocator. Moreover, percpu allocator is protected with single global mutex which negates any potential to improve its performance by means of recent developments in TC filter update API that removed rtnl lock for some Qdiscs and classifiers. In order to significantly improve filter update rate and reduce memory usage we would like to allow users to skip percpu counters allocation for specific action if they don't expect high traffic rate hitting the action, which is a reasonable expectation for hardware-offloaded setup. In that case any potential gains to software fast-path performance gained by usage of percpu-allocated counters compared to regular integer counters protected by spinlock are not important, but amount of additional CPU and memory consumed by them is significant. In order to allow configuring action counters allocation type at runtime, implement following changes: - Implement helper functions to update the action counters and use them in affected actions instead of updating counters directly. This steps abstracts actions implementation from counter types that are being used for particular action instance at runtime. - Modify the new helpers to use percpu counters if they were allocated during action initialization and use regular counters otherwise. - Extend action UAPI TCA_ACT space with TCA_ACT_FLAGS field. Add TCA_ACT_FLAGS_NO_PERCPU_STATS action flag and update hardware-offloaded actions to not allocate percpu counters when the flag is set. With this changes users that prefer action update slow-path speed over software fast-path speed can dynamically request actions to skip percpu counters allocation without affecting other users. Now, lets look at actual performance gains provided by this change. Simple test is used to measure insertion rate - iproute2 TC is executed in parallel by xargs in batch mode, its total execution time is measured by shell builtin "time" command. The command runs 20 concurrent tc instances, each with its own batch file with 100k rules: $ time ls add* | xargs -n 1 -P 20 sudo tc -b Two main rule profiles are tested. First is simple L2 flower classifier with single gact drop action. The configuration is chosen as worst case scenario because with single-action rules pressure on percpu allocator is minimized. Example rule: filter add dev ens1f0 protocol ip ingress prio 1 handle 1 flower skip_hw src_mac e4:11:0:0:0:0 dst_mac e4:12:0:0:0:0 action drop Second profile is typical real-world scenario that uses flower classifier with some L2-4 fields and two actions (tunnel_key+mirred). Example rule: filter add dev ens1f0_0 protocol ip ingress prio 1 handle 1 flower skip_hw src_mac e4:11:0:0:0:0 dst_mac e4:12:0:0:0:0 src_ip 192.168.111.1 dst_ip 192.168.111.2 ip_proto udp dst_port 1 src_port 1 action tunnel_key set id 1 src_ip 2.2.2.2 dst_ip 2.2.2.3 dst_port 4789 action mirred egress redirect dev vxlan1 Profile | percpu | no_percpu | X improvement | (k rules/sec) | (k rules/sec) | -------------------+---------------+---------------+--------------- Gact drop | 203 | 259 | 1.28 tunnel_key+mirred | 92 | 204 | 2.22 For simple drop action removing percpu allocation leads to ~25% insertion rate improvement. Perf profiles highlights the bottlenecks. Perf profile of run with percpu allocation (gact drop): + 89.11% 0.48% tc [kernel.vmlinux] [k] entry_SYSCALL_64 + 88.58% 0.04% tc [kernel.vmlinux] [k] do_syscall_64 + 87.50% 0.04% tc libc-2.29.so [.] __libc_sendmsg + 86.96% 0.04% tc [kernel.vmlinux] [k] __sys_sendmsg + 86.85% 0.01% tc [kernel.vmlinux] [k] ___sys_sendmsg + 86.60% 0.05% tc [kernel.vmlinux] [k] sock_sendmsg + 86.55% 0.12% tc [kernel.vmlinux] [k] netlink_sendmsg + 86.04% 0.13% tc [kernel.vmlinux] [k] netlink_unicast + 85.42% 0.03% tc [kernel.vmlinux] [k] netlink_rcv_skb + 84.68% 0.04% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg + 84.56% 0.24% tc [kernel.vmlinux] [k] tc_new_tfilter + 75.73% 0.65% tc [cls_flower] [k] fl_change + 71.30% 0.03% tc [kernel.vmlinux] [k] tcf_exts_validate + 71.27% 0.13% tc [kernel.vmlinux] [k] tcf_action_init + 71.06% 0.01% tc [kernel.vmlinux] [k] tcf_action_init_1 + 70.41% 0.04% tc [act_gact] [k] tcf_gact_init + 53.59% 1.21% tc [kernel.vmlinux] [k] __mutex_lock.isra.0 + 52.34% 0.34% tc [kernel.vmlinux] [k] tcf_idr_create - 51.23% 2.17% tc [kernel.vmlinux] [k] pcpu_alloc - 49.05% pcpu_alloc + 39.35% __mutex_lock.isra.0 4.99% memset_erms + 2.16% pcpu_alloc_area + 2.17% __libc_sendmsg + 45.89% 44.33% tc [kernel.vmlinux] [k] osq_lock + 9.94% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc + 7.76% 0.00% tc [kernel.vmlinux] [k] tcf_idr_insert + 6.50% 0.03% tc [kernel.vmlinux] [k] tfilter_notify + 6.24% 6.11% tc [kernel.vmlinux] [k] mutex_spin_on_owner + 5.73% 5.32% tc [kernel.vmlinux] [k] memset_erms + 5.31% 0.18% tc [kernel.vmlinux] [k] tcf_fill_node Here bottleneck is clearly in pcpu_alloc() function that takes more than half CPU time, which is mostly wasted busy-waiting for internal percpu allocator global lock. With percpu allocation removed (gact drop): + 87.50% 0.51% tc [kernel.vmlinux] [k] entry_SYSCALL_64 + 86.94% 0.07% tc [kernel.vmlinux] [k] do_syscall_64 + 85.75% 0.04% tc libc-2.29.so [.] __libc_sendmsg + 85.00% 0.07% tc [kernel.vmlinux] [k] __sys_sendmsg + 84.84% 0.07% tc [kernel.vmlinux] [k] ___sys_sendmsg + 84.59% 0.01% tc [kernel.vmlinux] [k] sock_sendmsg + 84.58% 0.14% tc [kernel.vmlinux] [k] netlink_sendmsg + 83.95% 0.12% tc [kernel.vmlinux] [k] netlink_unicast + 83.34% 0.01% tc [kernel.vmlinux] [k] netlink_rcv_skb + 82.39% 0.12% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg + 82.16% 0.25% tc [kernel.vmlinux] [k] tc_new_tfilter + 75.13% 0.84% tc [cls_flower] [k] fl_change + 69.92% 0.05% tc [kernel.vmlinux] [k] tcf_exts_validate + 69.87% 0.11% tc [kernel.vmlinux] [k] tcf_action_init + 69.61% 0.02% tc [kernel.vmlinux] [k] tcf_action_init_1 - 68.80% 0.10% tc [act_gact] [k] tcf_gact_init - 68.70% tcf_gact_init + 36.08% tcf_idr_check_alloc + 31.88% tcf_idr_insert + 63.72% 0.58% tc [kernel.vmlinux] [k] __mutex_lock.isra.0 + 58.80% 56.68% tc [kernel.vmlinux] [k] osq_lock + 36.08% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc + 31.88% 0.01% tc [kernel.vmlinux] [k] tcf_idr_insert The gact actions (like all other actions types) are inserted in single idr instance protected by global (per namespace) lock that becomes new bottleneck with such simple rule profile and prevents achieving 2x+ performance increase that can be expected by looking at profiling data for insertion action with percpu counter. Perf profile of run with percpu allocation (tunnel_key+mirred): + 91.95% 0.21% tc [kernel.vmlinux] [k] entry_SYSCALL_64 + 91.74% 0.06% tc [kernel.vmlinux] [k] do_syscall_64 + 90.74% 0.01% tc libc-2.29.so [.] __libc_sendmsg + 90.52% 0.01% tc [kernel.vmlinux] [k] __sys_sendmsg + 90.50% 0.04% tc [kernel.vmlinux] [k] ___sys_sendmsg + 90.41% 0.02% tc [kernel.vmlinux] [k] sock_sendmsg + 90.38% 0.04% tc [kernel.vmlinux] [k] netlink_sendmsg + 90.10% 0.06% tc [kernel.vmlinux] [k] netlink_unicast + 89.76% 0.01% tc [kernel.vmlinux] [k] netlink_rcv_skb + 89.28% 0.04% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg + 89.15% 0.03% tc [kernel.vmlinux] [k] tc_new_tfilter + 83.41% 0.33% tc [cls_flower] [k] fl_change + 81.17% 0.04% tc [kernel.vmlinux] [k] tcf_exts_validate + 81.13% 0.06% tc [kernel.vmlinux] [k] tcf_action_init + 81.04% 0.04% tc [kernel.vmlinux] [k] tcf_action_init_1 - 73.59% 2.16% tc [kernel.vmlinux] [k] pcpu_alloc - 71.42% pcpu_alloc + 61.41% __mutex_lock.isra.0 5.02% memset_erms + 2.93% pcpu_alloc_area + 2.16% __libc_sendmsg + 63.58% 0.17% tc [kernel.vmlinux] [k] tcf_idr_create + 63.40% 0.60% tc [kernel.vmlinux] [k] __mutex_lock.isra.0 + 57.85% 56.38% tc [kernel.vmlinux] [k] osq_lock + 46.27% 0.13% tc [act_tunnel_key] [k] tunnel_key_init + 34.26% 0.02% tc [act_mirred] [k] tcf_mirred_init + 10.99% 0.00% tc [kernel.vmlinux] [k] dst_cache_init + 5.32% 5.11% tc [kernel.vmlinux] [k] memset_erms With two times more actions pressure on percpu allocator doubles, so now it takes ~74% of CPU execution time. With percpu allocation removed (tunnel_key+mirred): + 86.02% 0.50% tc [kernel.vmlinux] [k] entry_SYSCALL_64 + 85.51% 0.12% tc [kernel.vmlinux] [k] do_syscall_64 + 84.40% 0.03% tc libc-2.29.so [.] __libc_sendmsg + 83.84% 0.03% tc [kernel.vmlinux] [k] __sys_sendmsg + 83.72% 0.01% tc [kernel.vmlinux] [k] ___sys_sendmsg + 83.56% 0.01% tc [kernel.vmlinux] [k] sock_sendmsg + 83.50% 0.08% tc [kernel.vmlinux] [k] netlink_sendmsg + 83.02% 0.17% tc [kernel.vmlinux] [k] netlink_unicast + 82.48% 0.00% tc [kernel.vmlinux] [k] netlink_rcv_skb + 81.89% 0.11% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg + 81.71% 0.25% tc [kernel.vmlinux] [k] tc_new_tfilter + 73.99% 0.63% tc [cls_flower] [k] fl_change + 69.72% 0.00% tc [kernel.vmlinux] [k] tcf_exts_validate + 69.72% 0.09% tc [kernel.vmlinux] [k] tcf_action_init + 69.53% 0.05% tc [kernel.vmlinux] [k] tcf_action_init_1 + 53.08% 0.91% tc [kernel.vmlinux] [k] __mutex_lock.isra.0 + 45.52% 43.99% tc [kernel.vmlinux] [k] osq_lock - 36.02% 0.21% tc [act_tunnel_key] [k] tunnel_key_init - 35.81% tunnel_key_init + 15.95% tcf_idr_check_alloc + 13.91% tcf_idr_insert - 4.70% dst_cache_init + 4.68% pcpu_alloc + 33.22% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc + 32.34% 0.05% tc [act_mirred] [k] tcf_mirred_init + 28.24% 0.01% tc [kernel.vmlinux] [k] tcf_idr_insert + 7.79% 0.05% tc [kernel.vmlinux] [k] idr_alloc_u32 + 7.67% 7.35% tc [kernel.vmlinux] [k] idr_get_free + 6.46% 6.22% tc [kernel.vmlinux] [k] mutex_spin_on_owner + 5.11% 0.05% tc [kernel.vmlinux] [k] tfilter_notify With percpu allocation removed insertion rate is increased by ~120%. Such rule profile scales much better than simple single action because both types of actions were competing for single lock in percpu allocator, but not for action idr lock, which is per-action. Note that percpu allocator is still used by dst_cache in tunnel_key actions and consumes 4.68% CPU time. Dst_cache seems like good opportunity for further insertion rate optimization but is not addressed by this change. Another improvement provided by this change is significantly reduced memory usage. The test is implemented by sampling "used memory" value from "vmstat -s" command output. Following table includes memory usage measurements for same two configurations that were used for measuring insertion rate: Profile | Mem per rule | Mem per rule no_percpu | Less memory used | (KB) | (KB) | (KB) -------------------+--------------+------------------------+------------------ Gact drop | 3.91 | 2.51 | 1.4 tunnel_key+mirred | 6.73 | 3.91 | 2.8 Results indicate that memory usage of percpu allocator per action is ~1.4 KB. Note that any measurements of percpu allocator memory usage is inherently tied to particular setup since memory usage is linear to number of cores in system. It is to be expected that on current top of the line servers percpu allocator memory usage will be 2-5x more than on 24 CPUs setup that was used for testing. Setup details: 2x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 32GB memory Patches applied on top of net-next branch: commit 2203cbf2 (net-next) Author: Russell King <rmk+kernel@armlinux.org.uk> Date: Tue Oct 15 11:38:39 2019 +0100 net: sfp: move fwnode parsing into sfp-bus layer Changes V1 -> V2: - Include memory measurements. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vlad Buslov authored
Add basic tests to verify action creation with new fast_init flag for all actions that support the flag. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vlad Buslov authored
Extend struct tc_action with new "tcfa_flags" field. Set the field in tcf_idr_create() function and provide new helper tcf_idr_create_from_flags() that derives 'cpustats' boolean from flags value. Update individual hardware-offloaded actions init() to pass their "flags" argument to new helper in order to skip percpu stats allocation when user requested it through flags. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vlad Buslov authored
Extend TCA_ACT space with nla_bitfield32 flags. Add TCA_ACT_FLAGS_NO_PERCPU_STATS as the only allowed flag. Parse the flags in tcf_action_init_1() and pass resulting value as additional argument to a_o->init(). Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vlad Buslov authored
Modify stats update helper functions introduced in previous patches in this series to fallback to regular tc_action->tcfa_{b|q}stats if cpu stats are not allocated for the action argument. If regular non-percpu allocated counters are in use, then obtain action tcfa_lock while modifying them. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vlad Buslov authored
Previous commit introduced helper function for updating qstats and refactored set of actions to use the helpers, instead of modifying qstats directly. However, one of the affected action exposes its qstats to skb_tc_reinsert(), which then modifies it. Refactor skb_tc_reinsert() to return integer error code and don't increment overlimit qstats in case of error, and use the returned error code in tcf_mirred_act() to manually increment the overlimit counter with new helper function. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vlad Buslov authored
Extract common code that increments cpu_qstats counters into standalone act API functions. Change hardware offloaded actions that use percpu counter allocation to use the new functions instead of accessing cpu_qstats directly. This commit doesn't change functionality. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vlad Buslov authored
Extract common code that increments cpu_bstats counter into standalone act API function. Change hardware offloaded actions that use percpu counter allocation to use the new function instead of incrementing cpu_bstats directly. This commit doesn't change functionality. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vlad Buslov authored
Currently, all implementations of tc_action_ops->stats_update() callback have almost exactly the same implementation of counters update code (besides gact which also updates drop counter). In order to simplify support for using both percpu-allocated and regular action counters depending on run-time flag in following patches, extract action counters update code into standalone function in act API. This commit doesn't change functionality. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Christophe JAILLET authored
Use 'skb_queue_purge()' instead of re-implementing it. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queueDavid S. Miller authored
Jeff Kirsher says: ==================== 1GbE Intel Wired LAN Driver Updates 2019-10-29 This series contains updates to e1000e, igb, ixgbe and i40e drivers. Sasha adds support for Intel client platforms Comet Lake and Tiger Lake to the e1000e driver. Also adds a fix for a compiler warning that was recently introduced, when CONFIG_PM_SLEEP is not defined, so wrap the code that requires this kernel configuration to be defined. Alex fixes a potential race condition between network configuration and power management for e1000e, which is similar to a past issue in the igb driver. Also provided a bit of code cleanup since the driver no longer checks for __E1000_DOWN. Josh Hunt adds UDP segmentation offload support for igb, ixgbe and i40e. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
zhong jiang authored
It is more clear to use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs file operation rather than DEFINE_SIMPLE_ATTRIBUTE. It is detected with the help of coccinelle. Signed-off-by: zhong jiang <zhongjiang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Heiner Kallweit authored
This patch adds glue logic to make pause settings per port configurable vie ethtool. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Guillaume Nault authored
This parameter has never been used. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Heiner Kallweit authored
Add downshift support for 88E1145, it uses the same downshift configuration registers as 88E1111. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Matteo Croce says: ==================== ICMP flow improvements This series improves the flow inspector handling of ICMP packets: The first two patches just add some comments in the code which would have saved me a few minutes of time, and refactor a piece of code. The third one adds to the flow inspector the capability to extract the Identifier field, if present, so echo requests and replies are classified as part of the same flow. The fourth patch uses the function introduced earlier to the bonding driver, so echo replies can be balanced across bonding slaves. v1 -> v2: - remove unused struct members - add an helper to check for the Id field - use a local flow_dissector_key in the bonding to avoid changing behaviour of the flow dissector ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Matteo Croce authored
The bonding uses the L4 ports to balance flows between slaves. As the ICMP protocol has no ports, those packets are sent all to the same device: # tcpdump -qltnni veth0 ip |sed 's/^/0: /' & # tcpdump -qltnni veth1 ip |sed 's/^/1: /' & # ping -qc1 192.168.0.2 1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 315, seq 1, length 64 1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 315, seq 1, length 64 # ping -qc1 192.168.0.2 1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 316, seq 1, length 64 1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 316, seq 1, length 64 # ping -qc1 192.168.0.2 1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 317, seq 1, length 64 1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 317, seq 1, length 64 But some ICMP packets have an Identifier field which is used to match packets within sessions, let's use this value in the hash function to balance these packets between bond slaves: # ping -qc1 192.168.0.2 0: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 303, seq 1, length 64 0: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 303, seq 1, length 64 # ping -qc1 192.168.0.2 1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 304, seq 1, length 64 1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 304, seq 1, length 64 Aso, let's use a flow_dissector_key which defines FLOW_DISSECTOR_KEY_ICMP, so we can balance pings encapsulated in a tunnel when using mode encap3+4: # ping -q 192.168.1.2 -c1 0: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 585, seq 1, length 64 0: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 585, seq 1, length 64 # ping -q 192.168.1.2 -c1 1: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 586, seq 1, length 64 1: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 586, seq 1, length 64 Signed-off-by: Matteo Croce <mcroce@redhat.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Matteo Croce authored
The ICMP flow dissector currently parses only the Type and Code fields. Some ICMP packets (echo, timestamp) have a 16 bit Identifier field which is used to correlate packets. Add such field in flow_dissector_key_icmp and replace skb_flow_get_be16() with a more complex function which populate this field. Signed-off-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Matteo Croce authored
FLOW_DISSECTOR_KEY_ICMP is checked for every packet, not only ICMP ones. Even if the test overhead is probably negligible, move the ICMP dissector code under the big 'switch(ip_proto)' so it gets called only for ICMP packets. Signed-off-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Matteo Croce authored
Documents two piece of code which can't be understood at a glance. Signed-off-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 30 Oct, 2019 7 commits
-
-
Roman Mashak authored
Two pedit tests were failing due to incorrect operation value in matchPattern, should be 'add' not 'val', so fix it. Signed-off-by: Roman Mashak <mrv@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jon Maloy authored
We introduce a feature that works like a combination of TCP_NAGLE and TCP_CORK, but without some of the weaknesses of those. In particular, we will not observe long delivery delays because of delayed acks, since the algorithm itself decides if and when acks are to be sent from the receiving peer. - The nagle property as such is determined by manipulating a new 'maxnagle' field in struct tipc_sock. If certain conditions are met, 'maxnagle' will define max size of the messages which can be bundled. If it is set to zero no messages are ever bundled, implying that the nagle property is disabled. - A socket with the nagle property enabled enters nagle mode when more than 4 messages have been sent out without receiving any data message from the peer. - A socket leaves nagle mode whenever it receives a data message from the peer. In nagle mode, messages smaller than 'maxnagle' are accumulated in the socket write queue. The last buffer in the queue is marked with a new 'ack_required' bit, which forces the receiving peer to send a CONN_ACK message back to the sender upon reception. The accumulated contents of the write queue is transmitted when one of the following events or conditions occur. - A CONN_ACK message is received from the peer. - A data message is received from the peer. - A SOCK_WAKEUP pseudo message is received from the link level. - The write queue contains more than 64 1k blocks of data. - The connection is being shut down. - There is no CONN_ACK message to expect. I.e., there is currently no outstanding message where the 'ack_required' bit was set. As a consequence, the first message added after we enter nagle mode is always sent directly with this bit set. This new feature gives a 50-100% improvement of throughput for small (i.e., less than MTU size) messages, while it might add up to one RTT to latency time when the socket is in nagle mode. Acked-by: Ying Xue <ying.xue@windreiver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Ido Schimmel says: ==================== mlxsw: Update firmware version This patch set updates the firmware version for Spectrum-1 and enforces a firmware version for Spectrum-2. The version adds support for querying port module type. It will be used by a followup patch set from Jiri to make port split code more generic. Patch #1 increases the size of an existing register in order to be compatible with the new firmware version. In the future the firmware will assign default values to fields not specified by the driver. Patch #2 temporarily increases the PCI reset timeout for SN3800 systems. Note that in normal cases the driver will need to wait no longer than 5 seconds for the device to become ready following reset command. Patch #3 bumps the firmware version for Spectrum-1. Patch #4 enforces a minimum firmware version for Spectrum-2. v2: * Added patch #2 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ido Schimmel authored
In a similar fashion to Spectrum-1, enforce a specific firmware version for Spectrum-2 so that the driver and firmware are always in sync with regards to new features. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ido Schimmel authored
The version adds support for querying port module type. It will be used by a followup patch set from Jiri to make port split code more generic. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ido Schimmel authored
SN3800 Spectrum-2 based systems have gearboxes that need to be initialized by the firmware during its initialization flow. In certain cases, the firmware might need to flash these gearboxes, which is currently a time-consuming process. In newer firmware versions, the firmware will not signal to the driver that it is ready until the gearboxes are flashed. Increase the PCI reset timeout for these situations. In normal cases, the driver will need to wait no longer than 5 seconds. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ido Schimmel authored
In new firmware versions this register is extended with a sampling rate for Spectrum-2 and future ASICs. Increase the size of the register to ensure the field is initialized to 0 which means every packet is mirrored. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-