Commits · b898441f4ece44933af90b116b467f7864dd1ae7 · Kirill Smelkov / linux

02 Mar, 2015 37 commits

Merge branch 'neigh_cleanups' · b898441f

David S. Miller authored Mar 02, 2015

Eric W. Biederman says:

====================
Neighbour table and ax25 cleanups

While looking at the neighbour table to what it would take to allow
using next hops in a different address family than the current packets
I found a partial resolution for my issues and I stumbled upon some
work that makes the neighbour table code easier to understand and
maintain.

Long ago in a much younger kernel ax25 found a hack to use
dev_rebuild_header to transmit it's packets instead of going through
what today is ndo_start_xmit.

When the neighbour table was rewritten into it's current form the ax25
code was such a challenge that arp_broken_ops appeard in arp.c and
neigh_compat_output appeared in neighbour.c to keep the ax25 hack alive.

With a little bit of work I was able to remove some of the hack that
is the ax25 transmit path for ip packets and to isolate what remains
into a slightly more readable piece of code in ax25_ip.c.  Removing the
need for the generic code to worry about ax25 special cases.

After cleaning up the old ax25 hacks I also performed a little bit of
work on neigh_resolve_output to remove the need for a dst entry and to
ensure cached headers get a deterministic protocol value in their cached
header.   This guarantees that a cached header will not be different
depending on which protocol of packet is transmitted, and it allows
packets to be transmitted that don't have a dst entry.  There remains
a small amount of code that takes advantage of when packets have a dst
entry but that is something different.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

b898441f

neigh: Don't require a dst in neigh_resolve_output · 435e8eb2

Eric W. Biederman authored Mar 02, 2015

Having a dst helps a little bit for teql but is fundamentally
unnecessary and there are code paths where a dst is not available that
it would be nice to use the neighbour cache.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

435e8eb2

neigh: Don't require dst in neigh_hh_init · bdf53c58

Eric W. Biederman authored Mar 02, 2015

- Add protocol to neigh_tbl so that dst->ops->protocol is not needed
- Acquire the device from neigh->dev

This results in a neigh_hh_init that will cache the samve values
regardless of the packets flowing through it.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bdf53c58

arp: Kill arp_find · 59b2af26

Eric W. Biederman authored Mar 02, 2015

There are no more callers so kill this function.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

59b2af26

net: Kill dev_rebuild_header · d476059e

Eric W. Biederman authored Mar 02, 2015

Now that there are no more users kill dev_rebuild_header and all of it's
implementations.

This is long overdue.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d476059e

ax25: Stop depending on arp_find · 945db424

Eric W. Biederman authored Mar 02, 2015

Have ax25_neigh_output perform ordinary arp resolution before calling
ax25_neigh_xmit.

Call dev_hard_header in ax25_neigh_output with a destination address so
it will not fail, and the destination mac address will not need to be
set in ax25_neigh_xmit.

Remove arp_find from ax25_neigh_xmit (the ordinary arp resolution added
to ax25_neigh_output removes the need for calling arp_find).

Document how close ax25_neigh_output is to neigh_resolve_output.

Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

945db424

ax25: Stop calling/abusing dev_rebuild_header · abb7b755

Eric W. Biederman authored Mar 02, 2015

- Rename ax25_rebuild_header to ax25_neigh_xmit and call it from
  ax25_neigh_output directly.  The rename is to make it clear
  that this is not a rebuild_header operation.

- Remove ax25_rebuild_header from ax25_header_ops.

Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

abb7b755

neigh: Move neigh_compat_output into ax25_ip.c · def67753

Eric W. Biederman authored Mar 02, 2015

The only caller is now is ax25_neigh_construct so move
neigh_compat_output into ax25_ip.c make it static and rename it
ax25_neigh_output.

Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

def67753

arp: Remove special case to give AX25 it's open arp operations. · 21bfb8e9

Eric W. Biederman authored Mar 02, 2015

The special case has been pushed out into ax25_neigh_construct so there
is no need to keep this code in arp.c
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

21bfb8e9

ax25: Refactor to use private neighbour operations. · 3b6a94be

Eric W. Biederman authored Mar 02, 2015

AX25 already has it's own private arp cache operations to isolate
it's abuse of dev_rebuild_header to transmit packets.  Add a function
ax25_neigh_construct that will allow all of the ax25 devices to
force using these operations, so that the generic arp code does
not need to.

Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3b6a94be

ax25: Make ax25_header and ax25_rebuild_header static · 46d4e47a

Eric W. Biederman authored Mar 02, 2015

The only user is in ax25_ip.c so stop exporting these functions.

Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

46d4e47a

ax25/6pack: Replace sp_header_ops with ax25_header_ops · 204d2dca

Eric W. Biederman authored Mar 02, 2015

The two sets of header operations are functionally identical remove
the duplicate definition.

Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

204d2dca

ax25/kiss: Replace ax_header_ops with ax25_header_ops · 56fc7e71

Eric W. Biederman authored Mar 02, 2015

The two sets of header operations are functionally identical remove the
duplicate definition.

Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

56fc7e71

rose: Transmit packets in rose_xmit not rose_rebuild_header · 03ec2ac0

Eric W. Biederman authored Mar 02, 2015

Patterned after the similar code in net/rom this turns out
to be a trivial obviously correct transmformation.

Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

03ec2ac0

rose: Set the destination address in rose_header · b753af31

Eric W. Biederman authored Mar 02, 2015

Not setting the destination address is a bug that I suspect causes no
problems today, as only the arp code seems to call dev_hard_header and
the description I have of rose is that it is expected to be used with a
static neigbour table.

I have derived the offset and the length of the rose destination address
from rose_rebuild_header where arp_find calls neigh_ha_snapshot to set
the destination address.

Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b753af31

ax25: In ax25_rebuild_header add missing kfree_skb · e18dbd05

Eric W. Biederman authored Mar 01, 2015

In the unlikely (impossible?) event that we attempt to transmit
an ax25 packet over a non-ax25 device free the skb so we don't
leak it.

Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e18dbd05

ebpf: move CONFIG_BPF_SYSCALL-only function declarations · 61e021f3

Daniel Borkmann authored Mar 02, 2015

Masami noted that it would be better to hide the remaining CONFIG_BPF_SYSCALL-only
function declarations within the BPF header ifdef, w/o else path dummy alternatives
since these functions are not supposed to have a user outside of CONFIG_BPF_SYSCALL.
Suggested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reference: http://article.gmane.org/gmane.linux.kernel.api/8658Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

61e021f3

Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · 77f0379f

David S. Miller authored Mar 02, 2015

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

A small batch with accumulated updates in nf-next, mostly IPVS updates,
they are:

1) Add 64-bits stats counters to IPVS, from Julian Anastasov.

2) Move NETFILTER_XT_MATCH_ADDRTYPE out of NETFILTER_ADVANCED as docker
seem to require this, from Anton Blanchard.

3) Use boolean instead of numeric value in set_match_v*(), from
coccinelle via Fengguang Wu.

4) Allows rescheduling of new connections in IPVS when port reuse is
detected, from Marcelo Ricardo Leitner.

5) Add missing bits to support arptables extensions from nft_compat,
from Arturo Borrero.

Patrick is preparing a large batch to enhance the set infrastructure,
named expressions among other things, that should follow up soon after
this batch.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

77f0379f

filter: refactor common filter attach code into __sk_attach_prog · 49b31e57

Daniel Borkmann authored Mar 02, 2015

Both sk_attach_filter() and sk_attach_bpf() are setting up sk_filter,
charging skmem and attaching it to the socket after we got the eBPF
prog up and ready. Lets refactor that into a common helper.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

49b31e57

Merge branch 'for-upstream' of... · 70c836a4

David S. Miller authored Mar 02, 2015

Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next

Johan Hedberg says:

====================
pull request: bluetooth-next 2015-03-02

Here's the first bluetooth-next pull request targeting the 4.1 kernel:

 - ieee802154/6lowpan cleanups
 - SCO routing to host interface support for the btmrvl driver
 - AMP code cleanups
 - Fixes to AMP HCI init sequence
 - Refactoring of the HCI callback mechanism
 - Added shutdown routine for Intel controllers in the btusb driver
 - New config option to enable/disable Bluetooth debugfs information
 - Fix for early data reception on L2CAP fixed channels

Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

70c836a4

Merge branch 'sendmsg_recvmsg_iocb_removal' · b4844353

David S. Miller authored Mar 02, 2015

Ying Xue says:

====================
net: Remove iocb argument from sendmsg and recvmsg

Currently there is only one user - TIPC whose sendmsg() instances
using iocb argument. Meanwhile, there is no user using iocb argument
in its recvmsg() instance. Therefore, if we eliminate the werid usage
of iobc argument from TIPC, the iocb argument can be removed from
all sendmsg() and recvmsg() instances of the whole networking stack.

Reference:
https://patchwork.ozlabs.org/patch/433960/

Changes:

v2:
 * Fix compile errors of DCCP module pointed by David
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

b4844353

net: Remove iocb argument from sendmsg and recvmsg · 1b784140

Ying Xue authored Mar 02, 2015

After TIPC doesn't depend on iocb argument in its internal
implementations of sendmsg() and recvmsg() hooks defined in proto
structure, no any user is using iocb argument in them at all now.
Then we can drop the redundant iocb argument completely from kinds of
implementations of both sendmsg() and recvmsg() in the entire
networking stack.

Cc: Christoph Hellwig <hch@lst.de>
Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1b784140

tipc: Don't use iocb argument in socket layer · 39a0295f

Ying Xue authored Mar 02, 2015

Currently the iocb argument is used to idenfiy whether or not socket
lock is hold before tipc_sendmsg()/tipc_send_stream() is called. But
this usage prevents iocb argument from being dropped through sendmsg()
at socket common layer. Therefore, in the commit we introduce two new
functions called __tipc_sendmsg() and __tipc_send_stream(). When they
are invoked, it assumes that their callers have taken socket lock,
thereby avoiding the weird usage of iocb argument.

Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

39a0295f

netfilter: nft_compat: add support for arptables extensions · 5f158939

Arturo Borrero authored Feb 16, 2015

This patch adds support to arptables extensions from nft_compat.
Signed-off-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

5f158939

Merge branch 'dropcount' · 6556c385

David S. Miller authored Mar 02, 2015

Eyal Birger says:

====================
net: move skb->dropcount to skb->cb[]

Commit 97775007 ("af_packet: add interframe drop cmsg (v6)")
unionized skb->mark and skb->dropcount in order to allow recording
of the socket drop count while maintaining struct sk_buff size.

skb->dropcount was introduced since there was no available room
in skb->cb[] in packet sockets. However, its introduction led to
the inability to export skb->mark to userspace.

It was considered to alias skb->priority instead of skb->mark.
However, that would lead to the inabilty to export skb->priority
to userspace if desired. Such change may also lead to hard-to-find
issues as skb->priority is assumed to be alias free, and, as noted
by Shmulik Ladkani, is not 'naturally orthogonal' with other skb
fields.

This patch series follows the suggestions made by Eric Dumazet moving
the dropcount metric to skb->cb[], eliminating this problem
at the expense of 4 bytes less in skb->cb[] for protocol families
using it.

The patch series include compactization of bluetooth and packet
use of skb->cb[] as well as the infrastructure for placing dropcount
in skb->cb[].
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

6556c385

net: move skb->dropcount to skb->cb[] · 744d5a3e

Eyal Birger authored Mar 01, 2015

Commit 97775007 ("af_packet: add interframe drop cmsg (v6)")
unionized skb->mark and skb->dropcount in order to allow recording
of the socket drop count while maintaining struct sk_buff size.

skb->dropcount was introduced since there was no available room
in skb->cb[] in packet sockets. However, its introduction led to
the inability to export skb->mark, or any other aliased field to
userspace if so desired.

Moving the dropcount metric to skb->cb[] eliminates this problem
at the expense of 4 bytes less in skb->cb[] for protocol families
using it.
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

744d5a3e

net: add common accessor for setting dropcount on packets · 3bc3b96f

Eyal Birger authored Mar 01, 2015

As part of an effort to move skb->dropcount to skb->cb[], use
a common function in order to set dropcount in struct sk_buff.
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3bc3b96f

net: use common macro for assering skb->cb[] available size in protocol families · b4772ef8

Eyal Birger authored Mar 01, 2015

As part of an effort to move skb->dropcount to skb->cb[] use a common
macro in protocol families using skb->cb[] for ancillary data to
validate available room in skb->cb[].
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b4772ef8

net: packet: use sockaddr_ll fields as storage for skb original length in recvmsg path · 2472d761

Eyal Birger authored Mar 01, 2015

As part of an effort to move skb->dropcount to skb->cb[], 4 bytes
of additional room are needed in skb->cb[] in packet sockets.

Store the skb original length in the first two fields of sockaddr_ll
(sll_family and sll_protocol) as they can be derived from the skb when
needed.
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2472d761

net: rxrpc: change call to sock_recv_ts_and_drops() on rxrpc recvmsg to sock_recv_timestamp() · 2cfdf9fc

Eyal Birger authored Mar 01, 2015

Commit 3b885787 ("net: Generalize socket rx gap / receive queue overflow cmsg")
allowed receiving packet dropcount information as a socket level option.
RXRPC sockets recvmsg function was changed to support this by calling
sock_recv_ts_and_drops() instead of sock_recv_timestamp().

However, protocol families wishing to receive dropcount should call
sock_queue_rcv_skb() or set the dropcount specifically (as done
in packet_rcv()). This was not done for rxrpc and thus this feature
never worked on these sockets.

Formalizing this by not calling sock_recv_ts_and_drops() in rxrpc as
part of an effort to move skb->dropcount into skb->cb[]
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2cfdf9fc

net: bluetooth: compact struct bt_skb_cb by converting boolean fields to bit fields · 6368c235

Eyal Birger authored Mar 01, 2015

Convert boolean fields incoming and req_start to bit fields and move
force_active in order save space in bt_skb_cb in an effort to use
a portion of skb->cb[] for storing skb->dropcount.
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6368c235

net: bluetooth: compact struct bt_skb_cb by inlining struct hci_req_ctrl · 49a6fe05

Eyal Birger authored Mar 01, 2015

struct hci_req_ctrl is never used outside of struct bt_skb_cb;
Inlining it frees 8 bytes on a 64 bit system in skb->cb[] allowing
the addition of more ancillary data.
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

49a6fe05

pppoe: Use workqueue to die properly when a PADT is received · 287f3a94

Simon Farnsworth authored Mar 01, 2015

When a PADT frame is received, the socket may not be in a good state to
close down the PPP interface. The current implementation handles this by
simply blocking all further PPP traffic, and hoping that the lack of traffic
will trigger the user to investigate.

Use schedule_work to get to a process context from which we clear down the
PPP interface, in a fashion analogous to hangup on a TTY-based PPP
interface. This causes pppd to disconnect immediately, and allows tools to
take immediate corrective action.

Note that pppd's rp_pppoe.so plugin has code in it to disable the session
when it disconnects; however, as a consequence of this patch, the session is
already disabled before rp_pppoe.so is asked to disable the session. The
result is a harmless error message:

Failed to disconnect PPPoE socket: 114 Operation already in progress

This message is safe to ignore, as long as the error is 114 Operation
already in progress; in that specific case, it means that the PPPoE session
has already been disabled before pppd tried to disable it.
Signed-off-by: Simon Farnsworth <simon@farnz.org.uk>
Tested-by: Dan Williams <dcbw@redhat.com>
Tested-by: Christoph Schulz <develop@kristov.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

287f3a94

bnx2: disable toggling of rxvlan if necessary · 26caa346

Ivan Vecera authored Feb 26, 2015

The bnx2 driver uses .ndo_fix_features to force enable of Rx VLAN tag
stripping when the card cannot disable it. The driver should remove
NETIF_F_HW_VLAN_CTAG_RX flag from hw_features instead so it is fixed
for the ethtool.

Cc: Sony Chacko <sony.chacko@qlogic.com>
Cc: Dept-HSGLinuxNICDev@qlogic.com
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

26caa346

net: macb: Properly add DMACFG bit definitions · ea373041

Arun Chandran authored Mar 01, 2015

Add *_SIZE macros for the bits ENDIA_DESC and
ENDIA_PKT
Signed-off-by: Arun Chandran <achandran@mvista.com>
Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ea373041

net: macb: Add on the fly CPU endianness detection · 62f6924c

Arun Chandran authored Mar 01, 2015

Program management descriptor's access mode according to the
dynamically detected CPU endianness.
Signed-off-by: Arun Chandran <achandran@mvista.com>
Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Tested-by: Michal Simek <michal.simek@xilinx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

62f6924c

Driver: Vmxnet3: Copy TCP header to mapped frame for IPv6 packets · 759c9359

Shrikrishna Khare authored Feb 28, 2015

Allows for packet parsing to be done by the fast path. This performance
optimization already exists for IPv4. Add similar logic for IPv6.
Signed-off-by: Amitabha Banerjee <banerjeea@vmware.com>
Signed-off-by: Shrikrishna Khare <skhare@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

759c9359

01 Mar, 2015 3 commits

Merge branch 'ebpf_support_for_cls_bpf' · 68932f71

David S. Miller authored Mar 01, 2015

Daniel Borkmann says:

====================
eBPF support for cls_bpf

This is the non-RFC version of my patchset posted before netdev01 [1]
conference. It contains a couple of eBPF cleanups and preparation
patches to get eBPF support into cls_bpf. The last patch adds the
actual support. I'll post the iproute2 parts after the kernel bits
are merged, an initial preview link to the code is mentioned in the
last patch.

Patch 4 and 5 were originally one patch, but I've split them into
two parts upon request as patch 4 only is also needed for Alexei's
tracing patches that go via tip tree.

Tested with tc and all in-kernel available BPF test suites.

I have configured and built LLVM with --enable-experimental-targets=BPF
but as Alexei put it, the plan is to get rid of the experimental
status in future [2].

Thanks a lot!

v1 -> v2:
 - Removed arch patches from this series
  - x86 is already queued in tip tree, under x86/mm
  - arm64 just reposted directly to arm folks
 - Rest is unchanged

  [1] http://thread.gmane.org/gmane.linux.network/350191
  [2] http://article.gmane.org/gmane.linux.kernel/1874969
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

68932f71

cls_bpf: add initial eBPF support for programmable classifiers · e2e9b654

Daniel Borkmann authored Mar 01, 2015

This work extends the "classic" BPF programmable tc classifier by
extending its scope also to native eBPF code!

This allows for user space to implement own custom, 'safe' C like
classifiers (or whatever other frontend language LLVM et al may
provide in future), that can then be compiled with the LLVM eBPF
backend to an eBPF elf file. The result of this can be loaded into
the kernel via iproute2's tc. In the kernel, they can be JITed on
major archs and thus run in native performance.

Simple, minimal toy example to demonstrate the workflow:

  #include <linux/ip.h>
  #include <linux/if_ether.h>
  #include <linux/bpf.h>

  #include "tc_bpf_api.h"

  __section("classify")
  int cls_main(struct sk_buff *skb)
  {
    return (0x800 << 16) | load_byte(skb, ETH_HLEN + __builtin_offsetof(struct iphdr, tos));
  }

  char __license[] __section("license") = "GPL";

The classifier can then be compiled into eBPF opcodes and loaded
via tc, for example:

  clang -O2 -emit-llvm -c cls.c -o - | llc -march=bpf -filetype=obj -o cls.o
  tc filter add dev em1 parent 1: bpf cls.o [...]

As it has been demonstrated, the scope can even reach up to a fully
fledged flow dissector (similarly as in samples/bpf/sockex2_kern.c).

For tc, maps are allowed to be used, but from kernel context only,
in other words, eBPF code can keep state across filter invocations.
In future, we perhaps may reattach from a different application to
those maps e.g., to read out collected statistics/state.

Similarly as in socket filters, we may extend functionality for eBPF
classifiers over time depending on the use cases. For that purpose,
cls_bpf programs are using BPF_PROG_TYPE_SCHED_CLS program type, so
we can allow additional functions/accessors (e.g. an ABI compatible
offset translation to skb fields/metadata). For an initial cls_bpf
support, we allow the same set of helper functions as eBPF socket
filters, but we could diverge at some point in time w/o problem.

I was wondering whether cls_bpf and act_bpf could share C programs,
I can imagine that at some point, we introduce i) further common
handlers for both (or even beyond their scope), and/or if truly needed
ii) some restricted function space for each of them. Both can be
abstracted easily through struct bpf_verifier_ops in future.

The context of cls_bpf versus act_bpf is slightly different though:
a cls_bpf program will return a specific classid whereas act_bpf a
drop/non-drop return code, latter may also in future mangle skbs.
That said, we can surely have a "classify" and "action" section in
a single object file, or considered mentioned constraint add a
possibility of a shared section.

The workflow for getting native eBPF running from tc [1] is as
follows: for f_bpf, I've added a slightly modified ELF parser code
from Alexei's kernel sample, which reads out the LLVM compiled
object, sets up maps (and dynamically fixes up map fds) if any, and
loads the eBPF instructions all centrally through the bpf syscall.

The resulting fd from the loaded program itself is being passed down
to cls_bpf, which looks up struct bpf_prog from the fd store, and
holds reference, so that it stays available also after tc program
lifetime. On tc filter destruction, it will then drop its reference.

Moreover, I've also added the optional possibility to annotate an
eBPF filter with a name (e.g. path to object file, or something
else if preferred) so that when tc dumps currently installed filters,
some more context can be given to an admin for a given instance (as
opposed to just the file descriptor number).

Last but not least, bpf_prog_get() and bpf_prog_put() needed to be
exported, so that eBPF can be used from cls_bpf built as a module.
Thanks to 60a3b225 ("net: bpf: make eBPF interpreter images
read-only") I think this is of no concern since anything wanting to
alter eBPF opcode after verification stage would crash the kernel.

  [1] http://git.breakpoint.cc/cgit/dborkman/iproute2.git/log/?h=ebpfSigned-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e2e9b654

ebpf: move read-only fields to bpf_prog and shrink bpf_prog_aux · 24701ece

Daniel Borkmann authored Mar 01, 2015

is_gpl_compatible and prog_type should be moved directly into bpf_prog
as they stay immutable during bpf_prog's lifetime, are core attributes
and they can be locked as read-only later on via bpf_prog_select_runtime().

With a bit of rearranging, this also allows us to shrink bpf_prog_aux
to exactly 1 cacheline.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

24701ece