- 01 Dec, 2017 13 commits
-
-
Heiner Kallweit authored
read_status and config_aneg are the only mandatory callbacks and most of the time the generic implementation is used by drivers. So make the core fall back to the generic version if a driver doesn't implement the respective callback. Also currently the core doesn't seem to verify that drivers implement the mandatory calls. If a driver doesn't do so we'd just get a NPE. With this patch this potential issue doesn't exit any longer. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
William Tu says: ==================== ip6_gre: add erspan native tunnel for ipv6 The patch series add support for ERSPAN tunnel over ipv6. The first patch refectors the existing ipv4 gre implementation and the second refactors the ipv6 gre's xmit code. Finally the last patch introduces erspan protocol. change in v5: - add cover-letter description change in v4: - rebase on top of net-next - use log_ecn_error in ip6_tnl_rcv change in v3: - add inline for functions in header - rebase on top of net-next change in v2: - remove inline - fix some indent - fix errors reports by clang and scan-build ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
William Tu authored
The patch adds support for ERSPAN tunnel over ipv6. Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
William Tu authored
This patch refactors the ip6gre_xmit_{ipv4, ipv6}. It is a prep work to add the ip6erspan tunnel. Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
William Tu authored
Move two erspan functions to header file, erspan.h, so ipv6 erspan implementation can use it. Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Scott Branden says: ==================== net: ethtool: add support for ETH_RESET_AP Add support to reset appplication processors inside SmartNICs by defining new ETH_RESET_AP bit. And use new ETH_RESET_AP bit in bnxt ethernet driver. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Scott Branden authored
Add ETH_RESET_AP support handling to reset the internal Application Processor(s) of the SmartNIC card. Signed-off-by: Scott Branden <scott.branden@broadcom.com> Acked-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Scott Branden authored
Add ETH_RESET_AP to reset the application processor(s) inside the NIC interface. Current ETH_RESET_MGMT supports a management processor inside this NIC. This is typically used for remote NIC management purposes. Application processors exist inside some SmartNICs to run various applications inside the NIC processor - be it a simple algorithm without an OS to as complex as hosting multiple VMs. Signed-off-by: Scott Branden <scott.branden@broadcom.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Sowmini Varadhan says: ==================== rds-tcp netns delete related fixes Patchset contains cleanup and bug fixes. Patch 1 is the removal of some redundant code/functions. Patch 2 and 3 are fixes for corner cases identified by syzkaller. I've not been able to reproduce the actual use-after-free race flagged in the syzkaller reports, thus these fixes are based on code inspection plus manual testing to make sure the modified code paths are executed without problems in the commonly encountered timing cases. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sowmini Varadhan authored
The rds_tcp_kill_sock() function parses the rds_tcp_conn_list to find the rds_connection entries marked for deletion as part of the netns deletion under the protection of the rds_tcp_conn_lock. Since the rds_tcp_conn_list tracks rds_tcp_connections (which have a 1:1 mapping with rds_conn_path), multiple tc entries in the rds_tcp_conn_list will map to a single rds_connection, and will be deleted as part of the rds_conn_destroy() operation that is done outside the rds_tcp_conn_lock. The rds_tcp_conn_list traversal done under the protection of rds_tcp_conn_lock should not leave any doomed tc entries in the list after the rds_tcp_conn_lock is released, else another concurrently executiong netns delete (for a differnt netns) thread may trip on these entries. Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sowmini Varadhan authored
Commit 8edc3aff ("rds: tcp: Take explicit refcounts on struct net") introduces a regression in rds-tcp netns cleanup. The cleanup_net(), (and thus rds_tcp_dev_event notification) is only called from put_net() when all netns refcounts go to 0, but this cannot happen if the rds_connection itself is holding a c_net ref that it expects to release in rds_tcp_kill_sock. Instead, the rds_tcp_kill_sock callback should make sure to tear down state carefully, ensuring that the socket teardown is only done after all data-structures and workqs that depend on it are quiesced. The original motivation for commit 8edc3aff ("rds: tcp: Take explicit refcounts on struct net") was to resolve a race condition reported by syzkaller where workqs for tx/rx/connect were triggered after the namespace was deleted. Those worker threads should have been cancelled/flushed before socket tear-down and indeed, rds_conn_path_destroy() does try to sequence this by doing /* cancel cp_send_w */ /* cancel cp_recv_w */ /* flush cp_down_w */ /* free data structures */ Here the "flush cp_down_w" will trigger rds_conn_shutdown and thus invoke rds_tcp_conn_path_shutdown() to close the tcp socket, so that we ought to have satisfied the requirement that "socket-close is done after all other dependent state is quiesced". However, rds_conn_shutdown has a bug in that it *always* triggers the reconnect workq (and if connection is successful, we always restart tx/rx workqs so with the right timing, we risk the race conditions reported by syzkaller). Netns deletion is like module teardown- no need to restart a reconnect in this case. We can use the c_destroy_in_prog bit to avoid restarting the reconnect. Fixes: 8edc3aff ("rds: tcp: Take explicit refcounts on struct net") Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sowmini Varadhan authored
A side-effect of Commit c14b0366 ("rds: tcp: set linger to 1 when unloading a rds-tcp") is that we always send a RST on the tcp connection for rds_conn_destroy(), so rds_tcp_conn_paths_destroy() is not needed any more and is removed in this patch. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jon Maloy authored
When sending node local messages the code is using an 'mtu' of 66060 bytes to avoid unnecessary fragmentation. During situations of low memory tipc_msg_build() may sometimes fail to allocate such large buffers, resulting in unnecessary send failures. This can easily be remedied by falling back to a smaller MTU, and then reassemble the buffer chain as if the message were arriving from a remote node. At the same time, we change the initial MTU setting of the broadcast link to a lower value, so that large messages always are fragmented into smaller buffers even when we run in single node mode. Apart from obtaining the same advantage as for the 'fallback' solution above, this turns out to give a significant performance improvement. This can probably be explained with the __pskb_copy() operation performed on the buffer for each recipient during reception. We found the optimal value for this, considering the most relevant skb pool, to be 3744 bytes. Acked-by: Ying Xue <ying.xue@ericsson.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 30 Nov, 2017 27 commits
-
-
David S. Miller authored
Rafal Ozieblo says: ==================== Receive packets filtering for macb driver This patch series adds support for receive packets filtering for Cadence GEM driver. Packets can be redirect to different hardware queues based on source IP, destination IP, source port or destination port. To enable filtering, support for RX queueing was added as well. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Rafal Ozieblo authored
This patch allows filtering received packets to different hardware queues (aka ntuple). Signed-off-by: Rafal Ozieblo <rafalo@cadence.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Rafal Ozieblo authored
Added statistics per queue: - qX_rx_packets - qX_rx_bytes - qX_rx_dropped - qX_tx_packets - qX_tx_bytes - qX_tx_dropped Signed-off-by: Rafal Ozieblo <rafalo@cadence.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Rafal Ozieblo authored
To be able for packet reception on different RX queues some configuration has to be performed. This patch checks how many hardware queue does GEM support and initializes them. Signed-off-by: Rafal Ozieblo <rafalo@cadence.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Shrikrishna Khare authored
There are several reasons for increasing the receive ring sizes: 1. The original ring size of 256 was chosen about 10 years ago when vmxnet3 was first created. At that time, 10Gbps Ethernet was not prevalent and servers were dominated by 1Gbps Ethernet. Now 10Gbps is common place, and higher bandwidth links -- 25Gbps, 40Gbps, 50Gbps -- are starting to appear. 256 Rx ring entries are simply not enough to keep up with higher link speed when there is a burst of network frames coming from these high speed links. Even with full MTU size frames, they are gone in a short time. It is also more common to have a mix of frame sizes, and more likely bi-modal distribution of frame sizes so the average frame size is not close to full MTU. If we consider average frame size of 800B, 1024 frames that come in a burst takes ~0.65 ms to arrive at 10Gbps. With 256 entires, it takes ~0.16 ms to arrive at 10Gbps. At 25Gbps or 40Gbps, this time is reduced accordingly. 2. On a hypervisor where there are many VMs and CPU is over committed, i.e. the number of VCPUs is more than the number of VCPUs, each PCPU is in effect time shared between multiple VMs/VCPUs. The time granularity at which this multiplexing occurs is typically coarser than between processes on a guest OS. Trying to time slice more finely is not efficient, for example, if memory cache is barely warmed up when switching from one VM to another occurs. This CPU overcommit adds delay to when the driver in a VM can service incoming packets. Whether CPU is over committed really depends on customer workloads. For certain situations, it is very common. For example, workloads of desktop VMs and product testing setups. Consolidation and sharing is what drives efficiency of a customer setup for such workloads. In these situations, the raw network bandwidth may not be very high, but the delays between when a VM is running or not running can also be relatively long. Signed-off-by: Shrikrishna Khare <skhare@vmware.com> Acked-by: Jin Heo <heoj@vmware.com> Acked-by: Guolin Yang <gyang@vmware.com> Acked-by: Boon Ang <bang@vmware.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Florian Fainelli authored
Utilize the much more capable b53_get_tag_protocol() which takes care of all Broadcom switches specifics to resolve which port can have Broadcom tags enabled or not. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Paolo Abeni authored
Since commit e32ea7e7 ("soreuseport: fast reuseport UDP socket selection") and commit c125e80b ("soreuseport: fast reuseport TCP socket selection") the relevant reuseport socket matching the current packet is selected by the reuseport_select_sock() call. The only exceptions are invalid BPF filters/filters returning out-of-range indices. In the latter case the code implicitly falls back to using the hash demultiplexing, but instead of selecting the socket inside the reuseport_select_sock() function, it relies on the hash selection logic introduced with the early soreuseport implementation. With this patch, in case of a BPF filter returning a bad socket index value, we fall back to hash-based selection inside the reuseport_select_sock() body, so that we can drop some duplicate code in the ipv4 and ipv6 stack. This also allows faster lookup in the above scenario and will allow us to avoid computing the hash value for successful, BPF based demultiplexing - in a later patch. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Linus Walleij authored
This is not supported anymore, devices needing a MAC address just assign one at random, it's just a driver pecularity. Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
David Miller says: ==================== net: Significantly shrink the size of routes. Through a combination of several things, our route structures are larger than they need to be. Mostly this stems from having members in dst_entry which are only used by one class of routes. So the majority of the work in this series is about "un-commoning" these members and pushing them into the type specific structures. Unfortunately, IPSEC needed the most surgery. The majority of the changes here had to do with bundle creation and management. The other issue is the refcount alignment in dst_entry. Once we get rid of the not-so-common members, it really opens the door to removing that alignment entirely. I think the new layout looks really nice, so I'll reproduce it here: struct net_device *dev; struct dst_ops *ops; unsigned long _metrics; unsigned long expires; struct xfrm_state *xfrm; int (*input)(struct sk_buff *); int (*output)(struct net *net, struct sock *sk, struct sk_buff *skb); unsigned short flags; short obsolete; unsigned short header_len; unsigned short trailer_len; atomic_t __refcnt; int __use; unsigned long lastuse; struct lwtunnel_state *lwtstate; struct rcu_head rcu_head; short error; short __pad; __u32 tclassid; (This is for 64-bit, on 32-bit the __refcnt comes at the very end) So, the good news: 1) struct dst_entry shrinks from 160 to 112 bytes. 2) struct rtable shrinks from 216 to 168 bytes. 3) struct rt6_info shrinks from 384 to 320 bytes. Enjoy. v2: Collapse some patches logically based upon feedback. Fix the strange patch #7. v3: xfrm_dst_path() needs inline keyword Properly align __refcnt on 32-bit. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Miller authored
There are no more users. Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: Eric Dumazet <edumazet@google.com>
-
David Miller authored
While building ipsec bundles, blocks of xfrm dsts are linked together using dst->next from bottom to the top. The only thing this is used for is initializing the pmtu values of the xfrm stack, and for updating the mtu values at xfrm_bundle_ok() time. The bundle pmtu entries must be processed in this order so that pmtu values lower in the stack of routes can propagate up to the higher ones. Avoid using dst->next by simply maintaining an array of dst pointers as we already do for the xfrm_state objects when building the bundle. Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: Eric Dumazet <edumazet@google.com>
-
David Miller authored
We have padding to try and align the refcount on a separate cache line. But after several simplifications the padding has increased substantially. So now it's easy to change the layout to get rid of the padding entirely. We group the write-heavy __refcnt and __use with less often used items such as the rcu_head and the error code. Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: Eric Dumazet <edumazet@google.com>
-
David Miller authored
The first member of an IPSEC route bundle chain sets it's dst->path to the underlying ipv4/ipv6 route that carries the bundle. Stated another way, if one were to follow the xfrm_dst->child chain of the bundle, the final non-NULL pointer would be the path and point to either an ipv4 or an ipv6 route. This is largely used to make sure that PMTU events propagate down to the correct ipv4 or ipv6 route. When we don't have the top of an IPSEC bundle 'dst->path == dst'. Move it down into xfrm_dst and key off of dst->xfrm. Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: Eric Dumazet <edumazet@google.com>
-
David Miller authored
The dst->from value is only used by ipv6 routes to track where a route "came from". Any time we clone or copy a core ipv6 route in the ipv6 routing tables, we have the copy/clone's ->from point to the base route. This is used to handle route expiration properly. Only ipv6 uses this mechanism, and only ipv6 code references it. So it is safe to move it into rt6_info. Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: Eric Dumazet <edumazet@google.com>
-
David Miller authored
XFRM bundle child chains look like this: xdst1 --> xdst2 --> xdst3 --> path_dst All of xdstN are xfrm_dst objects and xdst->u.dst.xfrm is non-NULL. The final child pointer in the chain, here called 'path_dst', is some other kind of route such as an ipv4 or ipv6 one. The xfrm output path pops routes, one at a time, via the child pointer, until we hit one which has a dst->xfrm pointer which is NULL. We can easily preserve the above mechanisms with child sitting only in the xfrm_dst structure. All children in the chain before we break out of the xfrm_output() loop have dst->xfrm non-NULL and are therefore xfrm_dst objects. Since we break out of the loop when we find dst->xfrm NULL, we will not try to dereference 'dst' as if it were an xfrm_dst. Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Miller authored
This will make a future change moving the dst->child pointer less invasive. Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: Eric Dumazet <edumazet@google.com>
-
David Miller authored
Only IPSEC routes have a non-NULL dst->child pointer. And IPSEC routes are identified by a non-NULL dst->xfrm pointer. Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Miller authored
Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: Eric Dumazet <edumazet@google.com>
-
David Miller authored
Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: Eric Dumazet <edumazet@google.com>
-
David Miller authored
Delete it. Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: Eric Dumazet <edumazet@google.com>
-
Zhu Yanjun authored
In xmit, it is very impossible that TX_ERROR occurs. So using unlikely optimizes the xmit process. CC: Srinivas Eeda <srinivas.eeda@oracle.com> CC: Joe Jin <joe.jin@oracle.com> CC: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Tina Ruchandani authored
net/atm/mpoa_* files use 'struct timeval' to store event timestamps. struct timeval uses a 32-bit seconds field which will overflow in the year 2038 and beyond. Morever, the timestamps are being compared only to get seconds elapsed, so struct timeval which stores a seconds and microseconds field is an overkill. This patch replaces the use of struct timeval with time64_t to store a 64-bit seconds field. Signed-off-by: Tina Ruchandani <ruchandani.tina@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Colin Ian King authored
There are several statements that have incorrect indentation. Fix these. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Arnd Bergmann authored
timespec is deprecated because of the y2038 overflow, so let's convert this one to ktime_get_ts64(). The code is already safe even on 32-bit architectures, since it uses monotonic times. On 64-bit architectures, nothing changes, while on 32-bit architectures this avoids one type conversion. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Arnd Bergmann authored
netxen_collect_minidump() evidently just wants to get a monotonic timestamp. Using jiffies_to_timespec(jiffies, &ts) is not appropriate here, since it will overflow after 2^32 jiffies, which may be as short as 49 days of uptime. ktime_get_seconds() is the correct interface here. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Richard Leitner authored
Previously phy_id was u32 and phy_id_mask was unsigned int. As the phy_id_mask defines the important bits of the phy_id (and is therefore the same size) these two variables should be the same data type. Signed-off-by: Richard Leitner <richard.leitner@skidata.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Lukas Wunner authored
No need to reinvent the wheel, we have bus_find_device_by_name(). Cc: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-