- 16 Jun, 2016 24 commits
-
-
Raghu Vatsavayi authored
This patch is to refactor packet size calculations to support PTP enabled for 66xx and 68xx cards and also other cards that do not support PTP. Signed-off-by: Derek Chickles <derek.chickles@caviumnetworks.com> Signed-off-by: Satanand Burla <satananda.burla@caviumnetworks.com> Signed-off-by: Felix Manlunas <felix.manlunas@caviumnetworks.com> Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Raghu Vatsavayi authored
This patch is to add page based buffers for receive side descriptors of the driver and separate free routines for rx and tx buffers. Signed-off-by: Derek Chickles <derek.chickles@caviumnetworks.com> Signed-off-by: Satanand Burla <satananda.burla@caviumnetworks.com> Signed-off-by: Felix Manlunas <felix.manlunas@caviumnetworks.com> Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Raghu Vatsavayi authored
This patch is to allocate rx queue's memory based on numa node and also use page based buffers for rx traffic improvements. Signed-off-by: Derek Chickles <derek.chickles@caviumnetworks.com> Signed-off-by: Satanand Burla <satananda.burla@caviumnetworks.com> Signed-off-by: Felix Manlunas <felix.manlunas@caviumnetworks.com> Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Raghu Vatsavayi authored
This patch is to allocate and manage scatter gather lists per input queue(iq's) and remove queue's interdependence. Signed-off-by: Derek Chickles <derek.chickles@caviumnetworks.com> Signed-off-by: Satanand Burla <satananda.burla@caviumnetworks.com> Signed-off-by: Felix Manlunas <felix.manlunas@caviumnetworks.com> Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Raghu Vatsavayi authored
This patch is to allocate the input queues based on Numa node in tx path and queue mapping changes based on the mapping info provided by firmware. Signed-off-by: Derek Chickles <derek.chickles@caviumnetworks.com> Signed-off-by: Satanand Burla <satananda.burla@caviumnetworks.com> Signed-off-by: Felix Manlunas <felix.manlunas@caviumnetworks.com> Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Raghu Vatsavayi authored
This patch is to resolve the double free issue by checking proper return values from soft command. Signed-off-by: Derek Chickles <derek.chickles@caviumnetworks.com> Signed-off-by: Satanand Burla <satananda.burla@caviumnetworks.com> Signed-off-by: Felix Manlunas <felix.manlunas@caviumnetworks.com> Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Tom Herbert authored
The algorithm for checksum neutral mapping is incorrect. This problem was being hidden since we were previously always performing checksum offload on the translated addresses and only with IPv6 HW csum. Enabling an ILA router shows the issue. Corrected algorithm: old_loc is the original locator in the packet, new_loc is the value to overwrite with and is found in the lookup table. old_flag is the old flag value (zero of CSUM_NEUTRAL_FLAG) and new_flag is then (old_flag ^ CSUM_NEUTRAL_FLAG) & CSUM_NEUTRAL_FLAG. Need SUM(new_id + new_flag + diff) == SUM(old_id + old_flag) for checksum neutral translation. Solving for diff gives: diff = (old_id - new_id) + (old_flag - new_flag) compute_csum_diff8(new_id, old_id) gives old_id - new_id If old_flag is set old_flag - new_flag = old_flag = CSUM_NEUTRAL_FLAG Else old_flag - new_flag = -new_flag = ~CSUM_NEUTRAL_FLAG Tested: - Implemented a user space program that creates random addresses and random locators to overwrite. Compares the checksum over the address before and after translation (must always be equal) - Enabled ILA router and showed proper operation. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Philip Prindeville authored
In the presence of firewalls which improperly block ICMP Unreachable (including Fragmentation Required) messages, Path MTU Discovery is prevented from working. A workaround is to handle IPv4 payloads opaquely, ignoring the DF bit--as is done for other payloads like AppleTalk--and doing transparent fragmentation and reassembly. Redux includes the enforcement of mutual exclusion between this feature and Path MTU Discovery as suggested by Alexander Duyck. Cc: Alexander Duyck <alexander.duyck@gmail.com> Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Philip Prindeville <philipp@redfish-solutions.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Ahern authored
Attempting to delete a VRF device with a socket bound to it can stall: unregister_netdevice: waiting for red to become free. Usage count = 1 The unregister is waiting for the dst to be released and with it references to the vrf device. Similar to dst_ifdown switch the dst dev to loopback on delete for all of the dst's for the vrf device and release the references to the vrf device. Fixes: 193125db ("net: Introduce VRF device driver") Fixes: 35402e31 ("net: Add IPv6 support to VRF device") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdmaDavid S. Miller authored
Mellanox shared code between RDMA and net-next trees This is Mellanox mlx5_core shared code for both net-next and RDMA trees for 4.8 kernel cycle.
-
Arnd Bergmann authored
The latest changes to the MDIO code introduced a false-positive warning with gcc-6 (possibly others): drivers/net/phy/mdio-mux.c: In function 'mdio_mux_init': drivers/net/phy/mdio-mux.c:188:3: error: 'parent_bus_node' may be used uninitialized in this function [-Werror=maybe-uninitialized] It's easy to avoid the warning by making sure the parent_bus_node is initialized in both cases at the start of the function, since the later 'of_node_put()' call is also valid for a NULL pointer argument. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: f20e6657 ("mdio: mux: Enhanced MDIO mux framework for integrated multiplexers") Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Alexander Aring says: ==================== 6lowpan: introduce 6lowpan-nd David can you please pick-up this patch-serie for your net-next tree? Thanks in advance. This patch series introduces the ndisc ops callback structure to add different handling for IPv6 neighbour discovery cache functionality. It implements at first the two following use-cases: - 6CO handling as userspace option (For all 6LoWPAN layers, BTLE/802.15.4) [0] - short address handling for 802.15.4 6LoWPAN only [1] Since my last patch series, I completely changed the whole ndisc_ops callback structure to not replace the whole ndisc functionality at recv/send level of NS/NA/RS/RA which I send in my previous patch-series "6lowpan: introduce basic 6lowpan-nd". I changed it now to add different handling in a very low-level way of ndisc functionality. The ndisc_ops don't must be registered to dev->ndisc_ops anymore, if they are not set, then no additional ipv6 ndisc handling will be done. This patch series now introduce a complete handling of short address for 802.15.4 6LoWPAN in case of send/recv of NA/NS/RS and RA. In case of RA (receive only) and PIO we also need a second prefix + short-address based address. This callback structure can be used later (I hope) for RFC 6775 [0]. This RFC defines some new option fields and messages for 6LoWPAN-ND. This patch series does not implement RFC 6775 (except we decide now to handle 6CO in userspace). Additional we can use the current ops for parse/fill ndisc options for kernel handled ndisc messages to add 6CIO, see [2]. I tested RA/NS/NA/RS messages with short address which seems to work, what I didn't test is the redirect messages since I don't know how to generate them. The short address for redirect messages are also some special case here, because the short address by a L3 target address lookuped by neighbour cache need to be added. btw: According to [3] sending redirect messages should be also disabled by default on 6lowpan interfaces, but can be activated afterwards. This is maybe something for the ipv6_devconf structure. There is a "accept_redirects" but no "disable_redirects". - Alex [0] https://tools.ietf.org/html/rfc6775 [1] https://tools.ietf.org/html/rfc4944#section-8 [2] https://tools.ietf.org/html/rfc7400 changes since v3: - add acked-by and reviewed-by tags - fix url references in cover-letter - add cover-letter that this patch series is okay to go through net-next tree changes since RFC: - add lowlevel functions __ndisc_opt_addr_space, __ndisc_opt_addr_data and __ndisc_fill_addr_option for corresponding functions which doesn't requires net_device argument. - move ndisc_ops e.g. ndisc_ops_fill_addr_option function call into the corresponding device argument function ndisc_fill_addr_option. (Introduced a special static inline function for redirect handling). - fix error handling in addrconf_prefix_rcv_add_addr. (Please see, introduce new API handling that second address registration (in case of 802.15.4 6LoWPAN) will still be notified if failed, because dev->addr was successful. - add ieee802154 sub-directory in short address entry for 6lowpan UAPI. - add lowpan_802154_is_valid_src_short_addr, because 802.15.4 6lowpan defines the first bit as multicast (don't know how this can be working at the end, because some hardware addresses will handle such addresses in L2 as unicast. See: https://www.iana.org/assignments/_6lowpan-parameters/_6lowpan-parameters.xhtml#_6lowpan-parameters-2 changes since v2: - Introduce ndisc_ops to have our own implementation for dealing with NS/NA which allows also to support RFC6775 (e.g. ARO). - add handling for handling 6CO as userspace option for RA messages in case of 6LoWPAN interfaces. - change lowpan_is_ll to check on linklayer type only. - added some reviewed-by's. - move short addr slaac to net/6lowpan instead ipv6 handling. - add handling for context based address compression in case for short address as link-layer address. - change strategy to use short address, a short address will always be used when it's available. - Handle override flag in NA messages to update short address information or not. ==================== Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Aring authored
This patch adds necessary handling for use the short address for 802.15.4 6lowpan. It contains support for IPHC address compression and new matching algorithmn to decide which link layer address will be used for 802.15.4 frame. Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Aring authored
In case of sending RA messages we need some way to get the short address from an 802.15.4 6LoWPAN interface. This patch will add a temporary debugfs entry for experimental userspace api. Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Aring authored
This patch introduce different 6lowpan handling for receive and transmit NS/NA messages for the ipv6 neighbour discovery. The first use-case is for supporting 802.15.4 short addresses inside the option fields and handling for RFC6775 6CO option field as userspace option. Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Aring authored
This patch exports some neighbour discovery functions which can be used by 6lowpan neighbour discovery ops functionality then. Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Aring authored
This patch introduces neighbour discovery ops callback structure. The idea is to separate the handling for 6LoWPAN into the 6lowpan module. These callback offers 6lowpan different handling, such as 802.15.4 short address handling or RFC6775 (Neighbor Discovery Optimization for IPv6 over 6LoWPANs). Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Aring authored
This patch moves the functionality to add a RA PIO prefix generated address in an own function. This move prepares to add a hook for adding a second address for a second link-layer address. E.g. short address for 802.15.4 6LoWPAN. Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Aring authored
This patch adds __ndisc_fill_addr_option as low-level function for ndisc_fill_addr_option which doesn't depend on net_device parameter. Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Aring authored
This patch adds __ndisc_opt_addr_data as low-level function for ndisc_opt_addr_data which doesn't depend on net_device parameter. Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Aring authored
This patch adds __ndisc_opt_addr_space as low-level function for ndisc_opt_addr_space which doesn't depend on net_device parameter. Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Aring authored
Since we use exported function from ipv6 kernel module we don't need to request the module anymore to have ipv6 functionality. Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Aring authored
This patch adds the autoconfiguration if a valid 802.15.4 short address is available for 802.15.4 6LoWPAN interfaces. Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Aring authored
This patch will introduce a 6lowpan neighbour private data. Like the interface private data we handle private data for generic 6lowpan and for link-layer specific 6lowpan. The current first use case if to save the short address for a 802.15.4 6lowpan neighbour. Cc: David S. Miller <davem@davemloft.net> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 15 Jun, 2016 16 commits
-
-
David S. Miller authored
Hariprasad Shenai says: ==================== Add SRIOV configuration via sysfs and few fixes This series adds support to configure SR-IOV via PCI sysfs interface, reduces resource allocation in kdump kernel by disabling offload. Also synchronize unicast and multicast mac address, even in the interface is in Promiscuous mode. This patch series has been created against net-next tree and includes patches on cxgb4 and cxgb4vf driver. We have included all the maintainers of respective drivers. Kindly review the change and let us know in case of any review comments. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Hariprasad Shenai authored
Even if interface is in Promiscuous mode/Allmulti mode synchronize MAC addresses. Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Hariprasad Shenai authored
Implement callback in the driver for the new PCI bus driver interface that allows the user to enable/disable SR-IOV virtual functions in a device via the sysfs interface. Deprecate module parameter used to configure SRIOV Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Hariprasad Shenai authored
When is_kdump_kernel() is true, Forcing cxgb4 driver as Master so we can reinitialize the Firmware/Chip. Also reduce memory usage by disabling offload. Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Eric Dumazet says: ==================== net_sched: defer skb freeing while changing qdiscs qdiscs/classes are changed under RTNL protection and often while blocking BH and root qdisc spinlock. When lots of skbs need to be dropped, we free them under these locks causing TX/RX freezes, and more generally latency spikes. I saw spikes of 50+ ms on quite fast hardware... This patch series adds a simple queue protected by RTNL where skbs can be placed until RTNL is released. Note that this might also serve in the future for optional reinjection of packets when a qdisc is replaced. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
sfq_reset() can use rtnl_kfree_skbs() instead of kfree_skb() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
pie_change() can use rtnl_qdisc_drop() to benefit from deferred freeing. pie_reset() is already using qdisc_reset_queue() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
rtnl_kfree_skbs() can be used in tfifo_reset() It would be nice if we could iterate through rb tree instead of removing one skb at a time, and build a single skb chain. But this is left for a future patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
Both htb_reset() and htb_destroy() can use __qdisc_reset_queue() instead of __skb_queue_purge() to defer skb freeing of internal queues. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
Both hhf_reset() and hhf_change() can use rtnl_kfree_skbs() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
Both fq_codel_change() and fq_codel_reset() can use rtnl_kfree_skbs() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
Both fq_change() and fq_reset() can use rtnl_kfree_skbs() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
codel_change() can use rtnl_qdisc_drop() to defer expensive skb freeing after locks are released. codel_reset() already has support for deferred skb freeing because it uses qdisc_reset_queue() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
choke_reset() and choke_change() can use rtnl_qdisc_drop() to defer expensive skb freeing after locks are released. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
qdisc are changed under RTNL protection and often while blocking BH and root qdisc spinlock. When lots of skbs need to be dropped, we free them under these locks causing TX/RX freezes, and more generally latency spikes. This commit adds rtnl_kfree_skbs(), used to queue skbs for deferred freeing. Actual freeing happens right after RTNL is released, with appropriate scheduling points. rtnl_qdisc_drop() can also be used in place of disc_drop() when RTNL is held. qdisc_reset_queue() and __qdisc_reset_queue() get the new behavior, so standard qdiscs like pfifo, pfifo_fast... have their ->reset() method automatically handled. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jon Paul Maloy authored
TIPC based clusters are by default set up with full-mesh link connectivity between all nodes. Those links are expected to provide a short failure detection time, by default set to 1500 ms. Because of this, the background load for neighbor monitoring in an N-node cluster increases with a factor N on each node, while the overall monitoring traffic through the network infrastructure increases at a ~(N * (N - 1)) rate. Experience has shown that such clusters don't scale well beyond ~100 nodes unless we significantly increase failure discovery tolerance. This commit introduces a framework and an algorithm that drastically reduces this background load, while basically maintaining the original failure detection times across the whole cluster. Using this algorithm, background load will now grow at a rate of ~(2 * sqrt(N)) per node, and at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will now have to actively monitor 38 neighbors in a 400-node cluster, instead of as before 399. This "Overlapping Ring Supervision Algorithm" is completely distributed and employs no centralized or coordinated state. It goes as follows: - Each node makes up a linearly ascending, circular list of all its N known neighbors, based on their TIPC node identity. This algorithm must be the same on all nodes. - The node then selects the next M = sqrt(N) - 1 nodes downstream from itself in the list, and chooses to actively monitor those. This is called its "local monitoring domain". - It creates a domain record describing the monitoring domain, and piggy-backs this in the data area of all neighbor monitoring messages (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in the cluster eventually (default within 400 ms) will learn about its monitoring domain. - Whenever a node discovers a change in its local domain, e.g., a node has been added or has gone down, it creates and sends out a new version of its node record to inform all neighbors about the change. - A node receiving a domain record from anybody outside its local domain matches this against its own list (which may not look the same), and chooses to not actively monitor those members of the received domain record that are also present in its own list. Instead, it relies on indications from the direct monitoring nodes if an indirectly monitored node has gone up or down. If a node is indicated lost, the receiving node temporarily activates its own direct monitoring towards that node in order to confirm, or not, that it is actually gone. - Since each node is actively monitoring sqrt(N) downstream neighbors, each node is also actively monitored by the same number of upstream neighbors. This means that all non-direct monitoring nodes normally will receive sqrt(N) indications that a node is gone. - A major drawback with ring monitoring is how it handles failures that cause massive network partitionings. If both a lost node and all its direct monitoring neighbors are inside the lost partition, the nodes in the remaining partition will never receive indications about the loss. To overcome this, each node also chooses to actively monitor some nodes outside its local domain. Those nodes are called remote domain "heads", and are selected in such a way that no node in the cluster will be more than two direct monitoring hops away. Because of this, each node, apart from monitoring the member of its local domain, will also typically monitor sqrt(N) remote head nodes. - As an optimization, local list status, domain status and domain records are marked with a generation number. This saves senders from unnecessarily conveying unaltered domain records, and receivers from performing unneeded re-adaptations of their node monitoring list, such as re-assigning domain heads. - As a measure of caution we have added the possibility to disable the new algorithm through configuration. We do this by keeping a threshold value for the cluster size; a cluster that grows beyond this value will switch from full-mesh to ring monitoring, and vice versa when it shrinks below the value. This means that if the threshold is set to a value larger than any anticipated cluster size (default size is 32) the new algorithm is effectively disabled. A patch set for altering the threshold value and for listing the table contents will follow shortly. - This change is fully backwards compatible. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-