Commits · 69e766612c4bcb79e19cebed9eed61d4222c1d47 · Kirill Smelkov / linux

03 Jul, 2017 34 commits

Jiri Benc authored Jul 02, 2017

It's not a good idea to add the same hlist_node to two different hash lists.
This leads to various hard to debug memory corruptions.

Fixes: b1be00a6 ("vxlan: support both IPv4 and IPv6 sockets in a single vxlan device")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

69e76661

net/mlxfw: Properly handle dependancy with non-loadable mlx5 · c1c1d86b

Or Gerlitz authored Jul 02, 2017

If mlx5 is set to be built-in and mlxfw as a module, we
get a link error:

drivers/built-in.o: In function `mlx5_firmware_flash':
(.text+0x5aed72): undefined reference to `mlxfw_firmware_flash'

Since we don't want to mandate selecting mlxfw for mlx5 users, we
use the IS_REACHABLE macro to make sure that a stub is exposed
to the caller.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reported-by: Jakub Kicinski <kubakici@wp.pl>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

c1c1d86b

iucv: Convert sk_wmem_alloc accesses to refcount_t. · b2c9c5df

David S. Miller authored Jul 03, 2017

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b2c9c5df

ctcm_fsms: Convert skb->user accesses to refcount_t · bba5850c

David S. Miller authored Jul 03, 2017

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bba5850c

Merge branch 'bpf-misc-helper-verifier-improvements' · 63d7c880

David S. Miller authored Jul 03, 2017

Daniel Borkmann says:

====================
Misc BPF helper/verifier improvements

Miscellanous improvements I still had in my queue, it adds a new
bpf_skb_adjust_room() helper for cls_bpf, exports to fdinfo whether
tail call array owner is JITed, so iproute2 error reporting can be
improved on that regard, a small cleanup and extension to trace
printk, two verifier patches, one to make the code around narrower
ctx access a bit more straight forward and one to allow for imm += x
operations, that we've seen LLVM generating and the verifier currently
rejecting. We've included the patch 6 given it's rather small and
we ran into it from LLVM side, it would be great if it could be
queued for stable as well after the merge window. Last but not least,
test cases are added also related to imm alu improvement.

Thanks a lot!
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

63d7c880

bpf: add various test cases for verifier selftest · 6d191ed4

Daniel Borkmann authored Jul 02, 2017

Add couple of verifier test cases for x|imm += pkt_ptr, including the
imm += x extension.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

6d191ed4

bpf, verifier: add additional patterns to evaluate_reg_imm_alu · 43188702

John Fastabend authored Jul 02, 2017

Currently the verifier does not track imm across alu operations when
the source register is of unknown type. This adds additional pattern
matching to catch this and track imm. We've seen LLVM generating this
pattern while working on cilium.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

43188702

bpf: extend bpf_trace_printk to support %i · 7bda4b40

John Fastabend authored Jul 02, 2017

Currently, bpf_trace_printk does not support common formatting
symbol '%i' however vsprintf does and is what eventually gets
called by bpf helper. If users are used to '%i' and currently
make use of it, then bpf_trace_printk will just return with
error without dumping anything to the trace pipe, so just add
support for '%i' to the helper.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

7bda4b40

bpf: export whether tail call has jited owner · 9780c0ab

Daniel Borkmann authored Jul 02, 2017

We do export through fdinfo already whether a prog is JITed or not,
given a program load can fail in case of either prog or tail call map
has JITed property, but neither both are JITed or not JITed, we can
facilitate error reporting in loaders like iproute2 through exporting
owner_jited of tail call map. We already do export owner_prog_type
through this facility, so parser can pick up both for comparison.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

9780c0ab

bpf: simplify narrower ctx access · f96da094

Daniel Borkmann authored Jul 02, 2017

This work tries to make the semantics and code around the
narrower ctx access a bit easier to follow. Right now
everything is done inside the .is_valid_access(). Offset
matching is done differently for read/write types, meaning
writes don't support narrower access and thus matching only
on offsetof(struct foo, bar) is enough whereas for read
case that supports narrower access we must check for
offsetof(struct foo, bar) + offsetof(struct foo, bar) +
sizeof(<bar>) - 1 for each of the cases. For read cases of
individual members that don't support narrower access (like
packet pointers or skb->cb[] case which has its own narrow
access logic), we check as usual only offsetof(struct foo,
bar) like in write case. Then, for the case where narrower
access is allowed, we also need to set the aux info for the
access. Meaning, ctx_field_size and converted_op_size have
to be set. First is the original field size e.g. sizeof(<bar>)
as in above example from the user facing ctx, and latter
one is the target size after actual rewrite happened, thus
for the kernel facing ctx. Also here we need the range match
and we need to keep track changing convert_ctx_access() and
converted_op_size from is_valid_access() as both are not at
the same location.

We can simplify the code a bit: check_ctx_access() becomes
simpler in that we only store ctx_field_size as a meta data
and later in convert_ctx_accesses() we fetch the target_size
right from the location where we do convert. Should the verifier
be misconfigured we do reject for BPF_WRITE cases or target_size
that are not provided. For the subsystems, we always work on
ranges in is_valid_access() and add small helpers for ranges
and narrow access, convert_ctx_accesses() sets target_size
for the relevant instruction.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Cc: Yonghong Song <yhs@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f96da094

bpf: add bpf_skb_adjust_room helper · 2be7e212

Daniel Borkmann authored Jul 02, 2017

This work adds a helper that can be used to adjust net room of an
skb. The helper is generic and can be further extended in future.
Main use case is for having a programmatic way to add/remove room to
v4/v6 header options along with cls_bpf on egress and ingress hook
of the data path. It reuses most of the infrastructure that we added
for the bpf_skb_change_type() helper which can be used in nat64
translations. Similarly, the helper only takes care of adjusting the
room so that related data is populated and csum adapted out of the
BPF program using it.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

2be7e212

bpf, net: add skb_mac_header_len helper · 0daf4349

Daniel Borkmann authored Jul 02, 2017

Add a small skb_mac_header_len() helper similarly as the
skb_network_header_len() we have and replace open coded
places in BPF's bpf_skb_change_proto() helper. Will also
be used in upcoming work.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

0daf4349

net: cdc_mbim: apply "NDP to end" quirk to HP lt4132 · a68491f8

Tore Anderson authored Jul 01, 2017

The HP lt4132 LTE/HSPA+ 4G Module (03f0:a31d) is a rebranded Huawei
ME906s-158 device. It, like the ME906s-158, requires the "NDP to end"
quirk for correct operation.
Signed-off-by: Tore Anderson <tore@fud.no>
Signed-off-by: David S. Miller <davem@davemloft.net>

a68491f8

Documentation: fix wrong example command · 75674c4c

Matteo Croce authored Jun 30, 2017

In the IPVLAN documentation there is an example command line where the
master and slave interface names are inverted.
Fix the command line and also add the optional `name' keyword to better
describe what the command is doing.

v2: added commit message
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

75674c4c

vxlan: correctly set vxlan->net when creating the device in a netns · 889ce937

Sabrina Dubroca authored Jun 30, 2017

Commit a985343b ("vxlan: refactor verification and application of
configuration") modified vxlan device creation, and replaced the
assignment of vxlan->net to src_net with dev_net(netdev) in ->setup().

But dev_net(netdev) is not the same as src_net. At the time ->setup()
is called, dev_net hasn't been set yet, so we end up creating the
socket for the vxlan device in init_net.

Fix this by bringing back the assignment of vxlan->net during device
creation.

Fixes: a985343b ("vxlan: refactor verification and application of configuration")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

889ce937

Merge branch 'hns-phy-loopback' · c3b99db8

David S. Miller authored Jul 03, 2017

Lin Yun Sheng says:

====================
Add loopback support in phy_driver and hns ethtool fix

This Patch Set add set_loopback in phy_driver and use it to setup loopback
when doing ethtool phy self_test.

Patch V8:
	Respin the Patch based on net-next

Patch V7:
	1. Add comment why resume the phy in hns_nic_config_phy_loopback.
	2. Fix a typo error in patch description.

Patch V6:
	Fix Or'ing error code in __lb_setup.

Patch V5:
	Removing non loopback related code change.

Patch V4:
	1. Remove c45 checking
	2. Add -ENOTSUPP when function pointer is null,
	   take mutex in phy_loopback.

Patch V3:
	Calling phy_loopback enable and disable in pair in hns mac driver.

Patch V2:
	1. Add phy_loopback in phy_device.c.
	2. Do error checking and do the read and write once in
	   genphy_loopback.
	3. Remove gen10g_loopback in phy_device.c.

Patch V1:
	Initial Submit
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

c3b99db8

net: hns: Use phy_driver to setup Phy loopback · 67cd9a99

Lin Yun Sheng authored Jun 30, 2017

Use function set_loopback in phy_driver to setup phy loopback
when doing ethtool self test.
Signed-off-by: Lin Yun Sheng <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

67cd9a99

net: phy: Add phy loopback support in net phy framework · f0f9b4ed

Lin Yun Sheng authored Jun 30, 2017

This patch add set_loopback in phy_driver, which is used by MAC
driver to enable or disable phy loopback. it also add a generic
genphy_loopback function, which use BMCR loopback bit to enable
or disable loopback.
Signed-off-by: Lin Yun Sheng <linyunsheng@huawei.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f0f9b4ed

net/mlx5: fix memcpy limit? · 6992c6c5

Stephen Rothwell authored Jun 30, 2017

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

6992c6c5

ipv6: dad: don't remove dynamic addresses if link is down · ec8add2a

Sabrina Dubroca authored Jun 29, 2017

Currently, when the link for $DEV is down, this command succeeds but the
address is removed immediately by DAD (1):

    ip addr add 1111::12/64 dev $DEV valid_lft 3600 preferred_lft 1800

In the same situation, this will succeed and not remove the address (2):

    ip addr add 1111::12/64 dev $DEV
    ip addr change 1111::12/64 dev $DEV valid_lft 3600 preferred_lft 1800

The comment in addrconf_dad_begin() when !IF_READY makes it look like
this is the intended behavior, but doesn't explain why:

     * If the device is not ready:
     * - keep it tentative if it is a permanent address.
     * - otherwise, kill it.

We clearly cannot prevent userspace from doing (2), but we can make (1)
work consistently with (2).

addrconf_dad_stop() is only called in two cases: if DAD failed, or to
skip DAD when the link is down. In that second case, the fix is to avoid
deleting the address, like we already do for permanent addresses.

Fixes: 3c21edbd ("[IPV6]: Defer IPv6 device initialization until the link becomes ready.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

ec8add2a

net: cdc_ncm: Reduce memory use when kernel memory low · e1069bbf

Jim Baxter authored Jun 28, 2017

The CDC-NCM driver can require large amounts of memory to create
skb's and this can be a problem when the memory becomes fragmented.

This especially affects embedded systems that have constrained
resources but wish to maximise the throughput of CDC-NCM with 16KiB
NTB's.

The issue is after running for a while the kernel memory can become
fragmented and it needs compacting.
If the NTB allocation is needed before the memory has been compacted
the atomic allocation can fail which can cause increased latency,
large re-transmissions or disconnections depending upon the data
being transmitted at the time.
This situation occurs for less than a second until the kernel has
compacted the memory but the failed devices can take a lot longer to
recover from the failed TX packets.

To ease this temporary situation I modified the CDC-NCM TX path to
temporarily switch into a reduced memory mode which allocates an NTB
that will fit into a USB_CDC_NCM_NTB_MIN_OUT_SIZE (default 2048 Bytes)
sized memory block and only transmit NTB's with a single network frame
until the memory situation is resolved.
Each time this issue occurs we wait for an increasing number of
reduced size allocations before requesting a full size one to not
put additional pressure on a low memory system.

Once the memory is compacted the CDC-NCM data can resume transmitting
at the normal tx_max rate once again.
Signed-off-by: Jim Baxter <jim_baxter@mentor.com>
Reviewed-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>

e1069bbf

Merge branch 'qed-Add-iWARP-support-for-QL4xxxx' · 2da95be9

David S. Miller authored Jul 03, 2017

Michal Kalderon says:

====================
qed: Add iWARP support for QL4xxxx

This patch series adds iWARP support to our QL4xxxx networking adapters.
The code changes span across qed and qedr drivers, but this series contains
changes to qed only. Once the series is accepted, the qedr series will
be submitted to the rdma tree.
There is one additional qed patch which enables the iWARP, this patch is
delayed until the qedr series will be accepted.

The patches were previously sent as an RFC, and these are the first 12
patches in the RFC series:
https://www.spinics.net/lists/linux-rdma/msg51416.html

This series was tested and built against net-next.

MAINTAINERS file is not updated in this PATCH as there is a pending patch
for qedr driver update https://patchwork.kernel.org/patch/9752761.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

2da95be9

qed: Add iWARP support for physical queue allocation · 93c45984

Kalderon, Michal authored Jul 02, 2017

iWARP has different physical queue requirements than RoCE
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

93c45984

qed: Add iWARP protocol support in context allocation · 5d7dc962

Kalderon, Michal authored Jul 02, 2017

When computing how much memory is required for the different hw clients
iWARP protocol should be taken into account
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5d7dc962

qed: iWARP CM add error handling · 9816b614

Kalderon, Michal authored Jul 02, 2017

This patch introduces error handling for errors that occurred during
connection establishment.
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9816b614

qed: iWARP implement disconnect flows · fc4c6065

Kalderon, Michal authored Jul 02, 2017

This patch takes care of active/passive disconnect flows.
Disconnect flows can be initiated remotely, in which case a async event
will arrive from peer and indicated to qedr driver. These
are referred to as exceptions. When a QP is destroyed, it needs to check
that it's associated ep has been closed.
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fc4c6065

qed: iWARP CM add active side connect · 4b0fdd7c

Kalderon, Michal authored Jul 02, 2017

This patch implements the active side connect.
Offload a connection, process MPA reply and send RTR.
In some of the common passive/active functions, the active side
will work in blocking mode.
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4b0fdd7c

qed: iWARP CM add passive side connect · 456a5849

Kalderon, Michal authored Jul 02, 2017

This patch implements the passive side connect.
It addresses pre-allocating resources, creating a connection
element upon valid SYN packet received. Calling upper layer and
implementation of the accept/reject calls.

Error handling is not part of this patch.
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

456a5849

qed: iWARP CM add listener functions and initial SYN processing · 65a91a6c

Kalderon, Michal authored Jul 02, 2017

This patch adds the ability to add and remove listeners and identify
whether the SYN packet received is intended for iWARP or not. If
a listener is not found the SYN packet is posted back to the chip.
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

65a91a6c

qed: iWARP CM - setup a ll2 connection for handling SYN packets · b5c29ca7

Kalderon, Michal authored Jul 02, 2017

iWARP handles incoming SYN packets using the ll2 interface. This patch
implements ll2 setup and teardown. Additional ll2 connections will
be used in the future which are not part of this patch series.
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b5c29ca7

qed: Add iWARP support in ll2 connections · cc4ad324

Kalderon, Michal authored Jul 02, 2017

Add a new connection type for iWARP ll2 connections for setting
correct ll2 filters and connection type to FW.
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cc4ad324

qed: Rename some ll2 related defines · 526d1d05

Kalderon, Michal authored Jul 02, 2017

Make some names more generic as they will be used by iWARP too.
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

526d1d05

qed: Implement iWARP initialization, teardown and qp operations · 67b40dcc

Kalderon, Michal authored Jul 02, 2017

This patch adds iWARP support for flows that have common code
between RoCE and iWARP, such as initialization, teardown and
qp setup verbs: create, destroy, modify, query.
It introduces the iWARP specific files qed_iwarp.[ch] and
iwarp_common.h
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

67b40dcc

qed: Introduce iWARP personality · c851a9dc

Kalderon, Michal authored Jul 02, 2017

iWARP personality introduced the need for differentiating in several
places in the code whether we are RoCE, iWARP or either. This
leads to introducing new macros for querying the personality.
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c851a9dc

02 Jul, 2017 1 commit

bpf: fix to bpf_setsockops · a5192c52

Lawrence Brakmo authored Jul 02, 2017

Fixed build error due to misplaced "#ifdef CONFIG_INET" (moved 1
statement up).
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a5192c52

01 Jul, 2017 5 commits

Merge branch 'bpf-Add-support-for-sock_ops' · bcdb239b

David S. Miller authored Jul 01, 2017

Lawrence Brakmo says:

====================
bpf: Add support for sock_ops

Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
struct that allows BPF programs of this type to access some of the
socket's fields (such as IP addresses, ports, etc.) and setting
connection parameters such as buffer sizes, initial window, SYN/SYN-ACK
RTOs, etc.

Unlike current BPF program types that expect to be called at a particular
place in the network stack code, SOCK_OPS program can be called at
different places and use an "op" field to indicate the context. There
are currently two types of operations, those whose effect is through
their return value and those whose effect is through the new
bpf_setsocketop BPF helper function.

Example operands of the first type are:
  BPF_SOCK_OPS_TIMEOUT_INIT
  BPF_SOCK_OPS_RWND_INIT
  BPF_SOCK_OPS_NEEDS_ECN

Example operands of the secont type are:
  BPF_SOCK_OPS_TCP_CONNECT_CB
  BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB
  BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB

Current operands are only called during connection establishment so
there should not be any BPF overheads after connection establishment. The
main idea is to use connection information form both hosts, such as IP
addresses and ports to allow setting of per connection parameters to
optimize the connection's peformance.

Alghough there are already 3 mechanisms to set parameters (sysctls,
route metrics and setsockopts), this new mechanism provides some
disticnt advantages. Unlike sysctls, it can set parameters per
connection. In contrast to route metrics, it can also use port numbers
and information provided by a user level program. In addition, it could
set parameters probabilistically for evaluation purposes (i.e. do
something different on 10% of the flows and compare results with the
other 90% of the flows). Also, in cases where IPv6 addresses contain
geographic information, the rules to make changes based on the distance
(or RTT) between the hosts are much easier than route metric rules and
can be global. Finally, unlike setsockopt, it does not require
application changes and it can be updated easily at any time.

It uses the existing bpf cgroups infrastructure so the programs can be
attached per cgroup with full inheritance support. Although the bpf cgroup
framework already contains a sock related program type (BPF_PROG_TYPE_CGROUP_SOCK),
I created the new type (BPF_PROG_TYPE_SOCK_OPS) beccause the existing type
expects to be called only once during the connections's lifetime. In contrast,
the new program type will be called multiple times from different places in the
network stack code.  For example, before sending SYN and SYN-ACKs to set
an appropriate timeout, when the connection is established to set congestion
control, etc. As a result it has "op" field to specify the type of operation
requested.

This patch set also includes sample BPF programs to demostrate the differnet
features.

v2: Formatting changes, rebased to latest net-next

v3: Fixed build issues, changed socket_ops to sock_ops throught,
    fixed formatting issues, removed the syscall to load sock_ops
    program and added functionality to use existing bpf attach and
    bpf detach system calls, removed reader/writer locks in
    sock_bpfops.c (used when saving sock_ops global program)
    and fixed missing module refcount increment.

v4: Removed global sock_ops program and instead used existing cgroup bpf
    infrastructure to support a new BPF_CGROUP_ATTCH type.

v5: fixed kbuild warning happening in bpf-cgroup.h
    removed automatic converstion to host byte order from some sock_ops
      fields (ipv4 and ipv6 addresses, remote port)
    Added conversion to host byte order in some of the sample programs
    Added to sample BPF program comments about using load_sock_ops to load
    Removed is_req_sock field from bpf_sock_ops_kern and related places,
      using sk_fullsock() instead.

v6: fixes to BPF helper function setsockopt (possible NULL deferencing, etc.)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

bcdb239b

bpf: update tools/include/uapi/linux/bpf.h · 04df41e3

Lawrence Brakmo authored Jun 30, 2017

Update tools/include/uapi/linux/bpf.h to include changes related to new
bpf sock_ops program type.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

04df41e3

bpf: Sample bpf program to set sndcwnd clamp · 6c4a01b2

Lawrence Brakmo authored Jun 30, 2017

Sample BPF program, tcp_clamp_kern.c, to demostrate the use
of setting the sndcwnd clamp. This program assumes that if the
first 5.5 bytes of the host's IPv6 addresses are the same, then
the hosts are in the same datacenter and sets sndcwnd clamp to
100 packets, SYN and SYN-ACK RTOs to 10ms and send/receive buffer
sizes to 150KB.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6c4a01b2

bpf: Adds support for setting sndcwnd clamp · 13bf9641

Lawrence Brakmo authored Jun 30, 2017

Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_SNDCWND_CLAMP, which
sets the initial congestion window. It is useful to limit the sndcwnd
when the host are close to each other (small RTT).
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

13bf9641

bpf: Sample BPF program to set initial cwnd · 7bc62e28

Lawrence Brakmo authored Jun 30, 2017

Sample BPF program that assumes hosts are far away (i.e. large RTTs)
and sets initial cwnd and initial receive window to 40 packets,
send and receive buffers to 1.5MB.

In practice there would be a test to insure the hosts are actually
far enough away.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7bc62e28