Commits · 3be2a49e5c08d268f8af0dd4fe89a24ea8cdc339 · nexedi / linux

16 May, 2014 29 commits

of: provide a binding for fixed link PHYs · 3be2a49e

Thomas Petazzoni authored May 16, 2014

Some Ethernet MACs have a "fixed link", and are not connected to a
normal MDIO-managed PHY device. For those situations, a Device Tree
binding allows to describe a "fixed link" using a special PHY node.

This patch adds:

 * A documentation for the fixed PHY Device Tree binding.

 * An of_phy_is_fixed_link() function that an Ethernet driver can call
   on its PHY phandle to find out whether it's a fixed link PHY or
   not. It should typically be used to know if
   of_phy_register_fixed_link() should be called.

 * An of_phy_register_fixed_link() function that instantiates the
   fixed PHY into the PHY subsystem, so that when the driver calls
   of_phy_connect(), the PHY device associated to the OF node will be
   found.

These two additional functions also support the old fixed-link Device
Tree binding used on PowerPC platforms, so that ultimately, the
network device drivers for those platforms could be converted to use
of_phy_is_fixed_link() and of_phy_register_fixed_link() instead of
of_phy_connect_fixed_link(), while keeping compatibility with their
respective Device Tree bindings.
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3be2a49e

net: phy: extend fixed driver with fixed_phy_register() · a7595121

Thomas Petazzoni authored May 16, 2014

The existing fixed_phy_add() function has several drawbacks that
prevents it from being used as is for OF-based declaration of fixed
PHYs:

 * The address of the PHY on the fake bus needs to be passed, while a
   dynamic allocation is desired.

 * Since the phy_device instantiation is post-poned until the next
   mdiobus scan, there is no way to associate the fixed PHY with its
   OF node, which later prevents of_phy_connect() from finding this
   fixed PHY from a given OF node.

To solve this, this commit introduces fixed_phy_register(), which will
allocate an available PHY address, add the PHY using fixed_phy_add()
and instantiate the phy_device structure associated with the provided
OF node.
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Grant Likely <grant.likely@linaro.org>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a7595121

net: phy: decouple PHY id and PHY address in fixed PHY driver · 9b744942

Thomas Petazzoni authored May 16, 2014

Until now, the fixed_phy_add() function was taking as argument
'phy_id', which was used both as the PHY address on the fake fixed
MDIO bus, and as the PHY id, as available in the MII_PHYSID1 and
MII_PHYSID2 registers. However, those two informations are completely
unrelated.

This patch decouples them. The PHY id of fixed PHYs is hardcoded to be
0x0. Ideally, a really reserved value would be nicer, but there
doesn't seem to be an easy of making sure a dummy value can be
assigned to the Linux kernel for such usage.

The PHY address remains passed by the caller of phy_fixed_add().
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9b744942

Merge branch 'bridge-non-promisc' · 2770abcc

David S. Miller authored May 16, 2014

Vlad Yasevich says:

====================
bridge: Non-promisc bridge ports support

This series adds functionality to the bridge device to enable
operations without setting all ports to promiscuous mode.

The basic concept is this.  The bridge keeps track of the ports
that support learning and flooding packets to unknown destinations.
We call these ports auto-discovery ports since they automatically
discover who is behind them through learning and flooding.

If flooding and learning are disabled via flags, then the port
requires static configuration to tell it which mac addresses
are behind it.  This is accomplished through adding of fdbs.
These fdbs should be static as dynamic fdbs can expire and systems
will become unreachable due to lack of flooding.

If the user marks all ports as needing static configuration then
we can safely make them non-promiscuous since we will know all the
information about them.

If the user leaves only 1 port as automatic, then we can mark
that port as not-promiscuous as well.  One could think of
this a edge relay similar to what's support by embedded switches
in SRIOV devices.  Since we have all the information about the
other ports, we can just program the mac addresses into the
single automatic port to receive all necessary traffic.
More information about this is patch 6.

In other cases, we keep all ports promiscuous as before.

There are some other cases when promiscuous mode has to be turned
back on.  One is when the bridge itself if placed in promiscuous
mode (user sets promisc flag).  The other is if vlan filtering is
turned off.  Since this is the default configuration, the default
bridge operation is not changed.

Changes since v2:
 - White space and spelling fixes from Michael Tsirkin
 - Squash patches 6, 7 and 8 to prevent bisect breakage.

Changes since v1:
 - Address issues rasied by Stephen Heminger
 - Address initializer comments raised by Sergey Shtylyov
 - Rebased recent net-next.

Changes since rfc v2:
 - Better description of in the commit logs
 - Leave port in promiscuous mode if IFF_UNICAST_FLT is disabled on the
   device.
 - Fix issue with flag masking
 - Rework patch ordering a bit.

Changes since rfc v1:
 - Removed private list.  We now traverse the fdb hashtable itself
   to write necessary addresses to the ports (Stephen's concern)
 - Add learning flag to the mask for flags that decides if the port
   is 'auto' or not (suggest by MST and Jamal).
 - Simplified tracking of such ports at the cost of a loop over all
   ports (suggested by MST)

I've played with quite a large number of ports and the current approach
seems to work fairly well.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

2770abcc

bridge: Automatically manage port promiscuous mode. · 2796d0c6

Vlad Yasevich authored May 16, 2014

There exist configurations where the administrator or another management
entity has the foreknowledge of all the mac addresses of end systems
that are being bridged together.

In these environments, the administrator can statically configure known
addresses in the bridge FDB and disable flooding and learning on ports.
This makes it possible to turn off promiscuous mode on the interfaces
connected to the bridge.

Here is why disabling flooding and learning allows us to control
promiscuity:
 Consider port X.  All traffic coming into this port from outside the
bridge (ingress) will be either forwarded through other ports of the
bridge (egress) or dropped.  Forwarding (egress) is defined by FDB
entries and by flooding in the event that no FDB entry exists.
In the event that flooding is disabled, only FDB entries define
the egress.  Once learning is disabled, only static FDB entries
provided by a management entity define the egress.  If we provide
information from these static FDBs to the ingress port X, then we'll
be able to accept all traffic that can be successfully forwarded and
drop all the other traffic sooner without spending CPU cycles to
process it.
 Another way to define the above is as following equations:
    ingress = egress + drop
 expanding egress
    ingress = static FDB + learned FDB + flooding + drop
 disabling flooding and learning we a left with
    ingress = static FDB + drop

By adding addresses from the static FDB entries to the MAC address
filter of an ingress port X, we fully define what the bridge can
process without dropping and can thus turn off promiscuous mode,
thus dropping packets sooner.

There have been suggestions that we may want to allow learning
and update the filters with learned addresses as well.  This
would require mac-level authentication similar to 802.1x to
prevent attacks against the hw filters as they are limited
resource.

Additionally, if the user places the bridge device in promiscuous mode,
all ports are placed in promiscuous mode regardless of the changes
to flooding and learning.

Since the above functionality depends on full static configuration,
we have also require that vlan filtering be enabled to take
advantage of this.  The reason is that the bridge has to be
able to receive and process VLAN-tagged frames and the there
are only 2 ways to accomplish this right now: promiscuous mode
or vlan filtering.
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2796d0c6

bridge: Add addresses from static fdbs to non-promisc ports · 145beee8

Vlad Yasevich authored May 16, 2014

When a static fdb entry is created, add the mac address
from this fdb entry to any ports that are currently running
in non-promiscuous mode.  These ports need this data so that
they can receive traffic destined to these addresses.
By default ports start in promiscuous mode, so this feature
is disabled.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

145beee8

bridge: Introduce BR_PROMISC flag · f3a6ddf1

Vlad Yasevich authored May 16, 2014

Introduce a BR_PROMISC per-port flag that will help us track if the
current port is supposed to be in promiscuous mode or not.  For now,
always start in promiscuous mode.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f3a6ddf1

bridge: Add functionality to sync static fdb entries to hw · 8db24af7

Vlad Yasevich authored May 16, 2014

Add code that allows static fdb entires to be synced to the
hw list for a specified port.  This will be used later to
program ports that can function in non-promiscuous mode.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8db24af7

bridge: Keep track of ports capable of automatic discovery. · e028e4b8

Vlad Yasevich authored May 16, 2014

By default, ports on the bridge are capable of automatic
discovery of nodes located behind the port.  This is accomplished
via flooding of unknown traffic (BR_FLOOD) and learning the
mac addresses from these packets (BR_LEARNING).
If the above functionality is disabled by turning off these
flags, the port requires static configuration in the form
of static FDB entries to function properly.

This patch adds functionality to keep track of all ports
capable of automatic discovery.  This will later be used
to control promiscuity settings.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e028e4b8

bridge: Turn flag change macro into a function. · 63c3a622

Vlad Yasevich authored May 16, 2014

Turn the flag change macro into a function to allow
easier updates and to reduce space.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

63c3a622

net: pch_gbe depends on x86_32 · 4c30b525

Jean Delvare authored May 16, 2014

The pch_gbe driver is for a companion chip to the Intel Atom E600
series processors. These are 32-bit x86 processors so the driver is
only needed on X86_32.
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

4c30b525

ip_tunnel: don't add tunnel twice · ee30ef4d

Duan Jiong authored May 15, 2014

When using command "ip tunnel add" to add a tunnel, the tunnel will be added twice,
through ip_tunnel_create() and ip_tunnel_update().

Because the second is unnecessary, so we can just break after adding tunnel
through ip_tunnel_create().
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ee30ef4d

tools: bpf_jit_disasm: increase image buffer size · 9bb1a208

Alexei Starovoitov authored May 15, 2014

JITed seccomp filters can be quite large if they check a lot of syscalls
Simply increase buffer size
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9bb1a208

tools: bpf_jit_disasm: ignore image address for disasm · ed4afd45

Alexei Starovoitov authored May 15, 2014

seccomp filters use kernel JIT image addresses, so bpf_jit_enable=2 prints
[ 20.146438] flen=3 proglen=82 pass=0 image=0000000000000000
[ 20.146442] JIT code: 00000000: 55 48 89 e5 48 81 ec 28 02 00 00 ...

ignore image address, so that seccomp filters can be disassembled
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ed4afd45

Merge branch 'systemport-next' · e0a1272c

David S. Miller authored May 16, 2014

Florian Fainelli says:

====================
net: systemport: DMA and MAC fixes

This patch series contains a critical fix in how the DMA unmapping of packet
is done, as well as a less critical fix in how we disable the Ethernet MAC
RX/TX functions.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

e0a1272c

net: systemport: wait for packet in umac_enable_set() · 00b91c69

Florian Fainelli authored May 15, 2014

When umac_enable_set() is used to disable the UniMAC receiver or
transmitter, we need to make sure that we wait for a full-sized packet
to be processed because the UniMAC hardware stops on a packet boundary,
not immediately.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

00b91c69

net: systemport: fix dma_unmap_single() len · b1ff53e9

Florian Fainelli authored May 15, 2014

dma_unmap_single() was called with dma_unmap_len(cb, dma_len),
unfortunately we failed to assign this length field in
bcm_sysport_rx_refill() or bcm_sysport_alloc_rx_bufs() using
dma_unmap_len_set().

This causes packet contents corruption because are we not invoking the
cache invalidation routines with the proper length.  Fix this by using
the full RX buffer size (RX_BUF_LENGTH) because the mappings for the RX
bufers are created with that size.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b1ff53e9

Merge branch 'bonding-next' · bd508065

David S. Miller authored May 16, 2014

Veaceslav Falico says:

====================
bonding: simple macro cleanup

Trivial patchset that converts most of the bonding's macros into inline
functions. It introduces only one macro, BOND_MODE(), which is just
bond->params.mode, better to write/understand/remember.

The only real change is the removal of IFF_UP verification, which always
came in pair with && netif_running(), and is though useless, as it's always
IFF_UP when LINK_STATE_RUNNING.

v2->v3: fix 3/9 to actually invert bond_mode_uses_arp() and add
	bond_uses_arp() alongside bond_mode_uses_arp()
v1->v2: use inlined functions instead of macros.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

bd508065

bonding: replace SLAVE_IS_OK() with bond_slave_can_tx() · 8557cd74

Veaceslav Falico authored May 15, 2014

They're verifying the same thing (except of IFF_UP, which is implied for
netif_running(), which is also a prerequisite).

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8557cd74

bonding: rename {, bond_}slave_can_tx and clean it up · 891ab54d

Veaceslav Falico authored May 15, 2014

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

891ab54d

bonding: convert IS_UP(slave->dev) to inline function · b6adc610

Veaceslav Falico authored May 15, 2014

Also, remove the IFF_UP verification cause we can't be netif_running() with
being also IFF_UP.

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b6adc610

bonding: make IS_IP_TARGET_UNUSABLE_ADDRESS an inline function · 2807a9fe

Veaceslav Falico authored May 15, 2014

Also, use standard IP primitives to check the address.

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2807a9fe

bonding: create a macro for bond mode and use it · 01844098

Veaceslav Falico authored May 15, 2014

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

01844098

bonding: make USES_PRIMARY inline functions · ec0865a9

Veaceslav Falico authored May 15, 2014

Change the name a bit to better reflect its scope, and update some
comments. Two functions added - one which takes bond as a param and the
other which takes the mode.

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ec0865a9

bonding: make BOND_NO_USES_ARP an inline function · 267bed77

Veaceslav Falico authored May 15, 2014

Also, change its name to better reflect its scope, and skip the "no"
part.

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

267bed77

bonding: make TX_QUEUE_OVERRIDE() macro an inline function · d1e2e5cd

Veaceslav Falico authored May 15, 2014

Also, make it accept bonding as a parameter and change the name a bit to
better reflect its scope.

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d1e2e5cd

bonding: remove BOND_MODE_IS_LB macro · befb903a

Veaceslav Falico authored May 15, 2014

It's used only in an inline function and is useless.

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

befb903a

net: unix: Align send data_len up to PAGE_SIZE · 31ff6aa5

Kirill Tkhai authored May 15, 2014

Using whole of allocated pages reduces requested skb->data size.
This is just a little more thriftily allocation.

netperf does not show difference with the current performance.
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

31ff6aa5

macvlan: simplify the structure port · a188a54d

dingtianhong authored May 15, 2014

The port->count was used to count the number of macvlan devs
in the same port, but the list vlans could play the same role
to do that, so free the port if the list vlans is empty and
no need to use the parameter count.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a188a54d

15 May, 2014 11 commits

vti6: delete unneeded call to netdev_priv · 112a3513

Julia Lawall authored May 15, 2014

Netdev_priv is an accessor function, and has no purpose if its result is
not used.

A simplified version of the semantic match that fixes this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@ local idexpression x; @@
-x = netdev_priv(...);
... when != x
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

112a3513

ip_tunnel: delete unneeded call to netdev_priv · 4929fd8c

Julia Lawall authored May 15, 2014

Netdev_priv is an accessor function, and has no purpose if its result is
not used.

A simplified version of the semantic match that fixes this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@ local idexpression x; @@
-x = netdev_priv(...);
... when != x
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

4929fd8c

net/ariadne: delete unneeded call to netdev_priv · 688c3b00

Julia Lawall authored May 15, 2014

Netdev_priv is an accessor function, and has no purpose if its result is
not used.

A simplified version of the semantic match that fixes this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@ local idexpression x; @@
-x = netdev_priv(...);
... when != x
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

688c3b00

drivers/net/wan: delete unneeded call to netdev_priv · 7077f22f

Julia Lawall authored May 15, 2014

Netdev_priv is an accessor function, and has no purpose if its result is
not used.

A simplified version of the semantic match that fixes this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@ local idexpression x; @@
-x = netdev_priv(...);
... when != x
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

7077f22f

net: systemport: pad packets to a minimum of 68 bytes · dab531b4

Florian Fainelli authored May 14, 2014

Packets need to be at least 64 bytes to enter the switch port logic,
including the FCS, otherwise they will be discarded as RUNT packets.

With packets having Broadcom tags, the 4-bytes tag is first stripped
off the packet, and the packet length is then checked, so we need to
make sure that the packet length with FCS is at least 64 bytes.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dab531b4

net: systemport: only update UMAC_CMD if something changed · d5e32cc7

Florian Fainelli authored May 14, 2014

The link adjustment callback can be called as frequently as desired by
the PHY library, as such, let's avoid doing a Read/Modify/Write sequence
if nothing changed, which is more than likely since we are interfaced
with a switch device.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d5e32cc7

ti: Remove trailing semicolon from do {...} while (0) macro · 5f47dfb4

Joe Perches authored May 14, 2014

These should not have trailing semicolons so remove them.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5f47dfb4

Merge branch 'filter-next' · 1f499d6a

David S. Miller authored May 15, 2014

Alexei Starovoitov says:

====================
internal BPF jit for x64 and JITed seccomp

Internal BPF JIT compiler for x86_64 replaces classic BPF JIT.
Use it in seccomp and in tracing filters (sent as separate patch)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

1f499d6a

seccomp: JIT compile seccomp filter · 8f577cad

Alexei Starovoitov authored May 13, 2014

Take advantage of internal BPF JIT

05-sim-long_jumps.c of libseccomp was used as micro-benchmark:

 seccomp_rule_add_exact(ctx,...
 seccomp_rule_add_exact(ctx,...

 rc = seccomp_load(ctx);

 for (i = 0; i < 10000000; i++)
    syscall(...);

$ sudo sysctl net.core.bpf_jit_enable=1
$ time ./bench
real	0m2.769s
user	0m1.136s
sys	0m1.624s

$ sudo sysctl net.core.bpf_jit_enable=0
$ time ./bench
real	0m5.825s
user	0m1.268s
sys	0m4.548s
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8f577cad

net: filter: x86: internal BPF JIT · 62258278

Alexei Starovoitov authored May 13, 2014

Maps all internal BPF instructions into x86_64 instructions.
This patch replaces original BPF x64 JIT with internal BPF x64 JIT.
sysctl net.core.bpf_jit_enable is reused as on/off switch.

Performance:

1. old BPF JIT and internal BPF JIT generate equivalent x86_64 code.
No performance difference is observed for filters that were JIT-able before

Example assembler code for BPF filter "tcpdump port 22"

original BPF -> old JIT: original BPF -> internal BPF -> new JIT:
0: push %rbp 0: push %rbp
1: mov %rsp,%rbp 1: mov %rsp,%rbp
4: sub $0x60,%rsp 4: sub $0x228,%rsp
8: mov %rbx,-0x8(%rbp) b: mov %rbx,-0x228(%rbp) // prologue
12: mov %r13,-0x220(%rbp)
19: mov %r14,-0x218(%rbp)
20: mov %r15,-0x210(%rbp)
27: xor %eax,%eax // clear A
c: xor %ebx,%ebx 29: xor %r13,%r13 // clear X
e: mov 0x68(%rdi),%r9d 2c: mov 0x68(%rdi),%r9d
12: sub 0x6c(%rdi),%r9d 30: sub 0x6c(%rdi),%r9d
16: mov 0xd8(%rdi),%r8 34: mov 0xd8(%rdi),%r10
3b: mov %rdi,%rbx
1d: mov $0xc,%esi 3e: mov $0xc,%esi
22: callq 0xffffffffe1021e15 43: callq 0xffffffffe102bd75
27: cmp $0x86dd,%eax 48: cmp $0x86dd,%rax
2c: jne 0x0000000000000069 4f: jne 0x000000000000009a
2e: mov $0x14,%esi 51: mov $0x14,%esi
33: callq 0xffffffffe1021e31 56: callq 0xffffffffe102bd91
38: cmp $0x84,%eax 5b: cmp $0x84,%rax
3d: je 0x0000000000000049 62: je 0x0000000000000074
3f: cmp $0x6,%eax 64: cmp $0x6,%rax
42: je 0x0000000000000049 68: je 0x0000000000000074
44: cmp $0x11,%eax 6a: cmp $0x11,%rax
47: jne 0x00000000000000c6 6e: jne 0x0000000000000117
49: mov $0x36,%esi 74: mov $0x36,%esi
4e: callq 0xffffffffe1021e15 79: callq 0xffffffffe102bd75
53: cmp $0x16,%eax 7e: cmp $0x16,%rax
56: je 0x00000000000000bf 82: je 0x0000000000000110
58: mov $0x38,%esi 88: mov $0x38,%esi
5d: callq 0xffffffffe1021e15 8d: callq 0xffffffffe102bd75
62: cmp $0x16,%eax 92: cmp $0x16,%rax
65: je 0x00000000000000bf 96: je 0x0000000000000110
67: jmp 0x00000000000000c6 98: jmp 0x0000000000000117
69: cmp $0x800,%eax 9a: cmp $0x800,%rax
6e: jne 0x00000000000000c6 a1: jne 0x0000000000000117
70: mov $0x17,%esi a3: mov $0x17,%esi
75: callq 0xffffffffe1021e31 a8: callq 0xffffffffe102bd91
7a: cmp $0x84,%eax ad: cmp $0x84,%rax
7f: je 0x000000000000008b b4: je 0x00000000000000c2
81: cmp $0x6,%eax b6: cmp $0x6,%rax
84: je 0x000000000000008b ba: je 0x00000000000000c2
86: cmp $0x11,%eax bc: cmp $0x11,%rax
89: jne 0x00000000000000c6 c0: jne 0x0000000000000117
8b: mov $0x14,%esi c2: mov $0x14,%esi
90: callq 0xffffffffe1021e15 c7: callq 0xffffffffe102bd75
95: test $0x1fff,%ax cc: test $0x1fff,%rax
99: jne 0x00000000000000c6 d3: jne 0x0000000000000117
d5: mov %rax,%r14
9b: mov $0xe,%esi d8: mov $0xe,%esi
a0: callq 0xffffffffe1021e44 dd: callq 0xffffffffe102bd91 // MSH
e2: and $0xf,%eax
e5: shl $0x2,%eax
e8: mov %rax,%r13
eb: mov %r14,%rax
ee: mov %r13,%rsi
a5: lea 0xe(%rbx),%esi f1: add $0xe,%esi
a8: callq 0xffffffffe1021e0d f4: callq 0xffffffffe102bd6d
ad: cmp $0x16,%eax f9: cmp $0x16,%rax
b0: je 0x00000000000000bf fd: je 0x0000000000000110
ff: mov %r13,%rsi
b2: lea 0x10(%rbx),%esi 102: add $0x10,%esi
b5: callq 0xffffffffe1021e0d 105: callq 0xffffffffe102bd6d
ba: cmp $0x16,%eax 10a: cmp $0x16,%rax
bd: jne 0x00000000000000c6 10e: jne 0x0000000000000117
bf: mov $0xffff,%eax 110: mov $0xffff,%eax
c4: jmp 0x00000000000000c8 115: jmp 0x000000000000011c
c6: xor %eax,%eax 117: mov $0x0,%eax
c8: mov -0x8(%rbp),%rbx 11c: mov -0x228(%rbp),%rbx // epilogue
cc: leaveq 123: mov -0x220(%rbp),%r13
cd: retq 12a: mov -0x218(%rbp),%r14
131: mov -0x210(%rbp),%r15
138: leaveq
139: retq

On fully cached SKBs both JITed functions take 12 nsec to execute.
BPF interpreter executes the program in 30 nsec.

The difference in generated assembler is due to the following:

Old BPF imlements LDX_MSH instruction via sk_load_byte_msh() helper function
inside bpf_jit.S.
New JIT removes the helper and does it explicitly, so ldx_msh cost
is the same for both JITs, but generated code looks longer.

New JIT has 4 registers to save, so prologue/epilogue are larger,
but the cost is within noise on x64.

Old JIT checks whether first insn clears A and if not emits 'xor %eax,%eax'.
New JIT clears %rax unconditionally.

2. old BPF JIT doesn't support ANC_NLATTR, ANC_PAY_OFFSET, ANC_RANDOM
extensions. New JIT supports all BPF extensions.
Performance of such filters improves 2-4 times depending on a filter.
The longer the filter the higher performance gain.
Synthetic benchmarks with many ancillary loads see 20x speedup
which seems to be the maximum gain from JIT

Notes:

. net.core.bpf_jit_enable=2 + tools/net/bpf_jit_disasm is still functional
and can be used to see generated assembler

. there are two jit_compile() functions and code flow for classic filters is:
sk_attach_filter() - load classic BPF
bpf_jit_compile() - try to JIT from classic BPF
sk_convert_filter() - convert classic to internal
bpf_int_jit_compile() - JIT from internal BPF

seccomp and tracing filters will just call bpf_int_jit_compile()
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

62258278

net: filter: x86: split bpf_jit_compile() · f3c2af7b

Alexei Starovoitov authored May 13, 2014

Split bpf_jit_compile() into two functions to improve readability
of for(pass++) loop. The change follows similar style of JIT compilers
for arm, powerpc, s390

The body of new do_jit() was not reformatted to reduce noise
in this patch, since the following patch replaces most of it.

Tested with BPF testsuite.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f3c2af7b