Commits · 58f9f9b5555543d47ad9a721489b557a790dc0c7 · Kirill Smelkov / linux

16 Jan, 2021 20 commits

Merge branch 'configuring-congestion-watermarks-on-ocelot-switch-using-devlink-sb' · 58f9f9b5

Jakub Kicinski authored Jan 15, 2021

Vladimir Oltean says:

====================
Configuring congestion watermarks on ocelot switch using devlink-sb

In some applications, it is important to create resource reservations in
the Ethernet switches, to prevent background traffic, or deliberate
attacks, from inducing denial of service into the high-priority traffic.

These patches give the user some knobs to turn. The ocelot switches
support per-port and per-port-tc reservations, on ingress and on egress.
The resources that are monitored are packet buffers (in cells of 60
bytes each) and frame references.

The frames that exceed the reservations can optionally consume from
sharing watermarks which are not per-port but global across the switch.
There are 10 sharing watermarks, 8 of them are per traffic class and 2
are per drop priority.

I am configuring the hardware using the best of my knowledge, and mostly
through trial and error. Same goes for devlink-sb integration. Feedback
is welcome.
====================

Link: https://lore.kernel.org/r/20210115021120.3055988-1-olteanv@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

58f9f9b5

net: mscc: ocelot: configure watermarks using devlink-sb · f59fd9ca

Vladimir Oltean authored Jan 15, 2021

Using devlink-sb, we can configure 12/16 (the important 75%) of the
switch's controlling watermarks for congestion drops, and we can monitor
50% of the watermark occupancies (we can monitor the reservation
watermarks, but not the sharing watermarks, which are exposed as pool
sizes).

The following definitions can be made:

SB_BUF=0 # The devlink-sb for frame buffers
SB_REF=1 # The devlink-sb for frame references
POOL_ING=0 # The pool for ingress traffic. Both devlink-sb instances
           # have one of these.
POOL_EGR=1 # The pool for egress traffic. Both devlink-sb instances
           # have one of these.

Editing the hardware watermarks is done in the following way:
BUF_xxxx_I is accessed when sb=$SB_BUF and pool=$POOL_ING
REF_xxxx_I is accessed when sb=$SB_REF and pool=$POOL_ING
BUF_xxxx_E is accessed when sb=$SB_BUF and pool=$POOL_EGR
REF_xxxx_E is accessed when sb=$SB_REF and pool=$POOL_EGR

Configuring the sharing watermarks for COL_SHR(dp=0) is done implicitly
by modifying the corresponding pool size. By default, the pool size has
maximum size, so this can be skipped.

devlink sb pool set pci/0000:00:00.5 sb $SB_BUF pool $POOL_ING \
	size 129840 thtype static

Since by default there is no buffer reservation, the above command has
maxed out BUF_COL_SHR_I(dp=0).

Configuring the per-port reservation watermark (P_RSRV) is done in the
following way:

devlink sb port pool set pci/0000:00:00.5/0 sb $SB_BUF \
	pool $POOL_ING th 1000

The above command sets BUF_P_RSRV_I(port 0) to 1000 bytes. After this
command, the sharing watermarks are internally reconfigured with 1000
bytes less, i.e. from 129840 bytes to 128840 bytes.

Configuring the per-port-tc reservation watermarks (Q_RSRV) is done in
the following way:

for tc in {0..7}; do
	devlink sb tc bind set pci/0000:00:00.5/0 sb 0 tc $tc \
		type ingress pool $POOL_ING \
		th 3000
done

The above command sets BUF_Q_RSRV_I(port 0, tc 0..7) to 3000 bytes.
The sharing watermarks are again reconfigured with 24000 bytes less.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

f59fd9ca

net: mscc: ocelot: initialize watermarks to sane defaults · a4ae997a

Vladimir Oltean authored Jan 15, 2021

This is meant to be a gentle introduction into the world of watermarks
on ocelot. The code is placed in ocelot_devlink.c because it will be
integrated with devlink, even if it isn't right now.

My first step was intended to be to replicate the default configuration
of the congestion watermarks programatically, since they are now going
to be tuned by the user.

But after studying and understanding through trial and error how they
work, I now believe that the configuration used out of reset does not do
justice to the word "reservation", since the sum of all reservations
exceeds the total amount of resources (otherwise said, all reservations
cannot be fulfilled at the same time, which means that, contrary to the
reference manual, they don't guarantee anything).

As an example, here's a dump of the reservation watermarks for frame
buffers, for port 0 (for brevity, the ports 1-6 were omitted, but they
have the same configuration):

BUF_Q_RSRV_I(port 0, prio 0) = max 3000 bytes
BUF_Q_RSRV_I(port 0, prio 1) = max 3000 bytes
BUF_Q_RSRV_I(port 0, prio 2) = max 3000 bytes
BUF_Q_RSRV_I(port 0, prio 3) = max 3000 bytes
BUF_Q_RSRV_I(port 0, prio 4) = max 3000 bytes
BUF_Q_RSRV_I(port 0, prio 5) = max 3000 bytes
BUF_Q_RSRV_I(port 0, prio 6) = max 3000 bytes
BUF_Q_RSRV_I(port 0, prio 7) = max 3000 bytes

Otherwise said, every port-tc has an ingress reservation of 3000 bytes,
and there are 7 ports in VSC9959 Felix (6 user ports and 1 CPU port).
Concentrating only on the ingress reservations, there are, in total,
8 [traffic classes] x 7 [ports] x 3000 [bytes] = 168,000 bytes of memory
reserved on ingress.
But, surprise, Felix only has 128 KB of packet buffer in total...
A similar thing happens with Seville, which has a larger packet buffer,
but also more ports, and the default configuration is also overcommitted.

This patch disables the (apparently) bogus reservations and moves all
resources to the shared area. This way, real reservations can be set up
by the user, using devlink-sb.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

a4ae997a

net: mscc: ocelot: register devlink ports · 6c30384e

Vladimir Oltean authored Jan 15, 2021

Add devlink integration into the mscc_ocelot switchdev driver. All
physical ports (i.e. the unused ones as well) except the CPU port module
at ocelot->num_phys_ports are registered with devlink, and that requires
keeping the devlink_port structure outside struct ocelot_port_private,
since the latter has a 1:1 mapping with a struct net_device (which does
not exist for unused ports).

Since we use devlink_port_type_eth_set to link the devlink port to the
net_device, we can as well remove the .ndo_get_phys_port_name and
.ndo_get_port_parent_id implementations, since devlink takes care of
retrieving the port name and number automatically, once
.ndo_get_devlink_port is implemented.

Note that the felix DSA driver is already integrated with devlink by
default, since that is a thing that the DSA core takes care of. This is
the reason why these devlink stubs were put in ocelot_net.c and not in
the common library. It is also the reason why ocelot::devlink is a
pointer and not a full structure embedded inside struct ocelot: because
the mscc_ocelot driver allocates that by itself (as the container of
struct ocelot, in fact), but in the case of felix, it is DSA who
allocates the devlink, and felix just propagates the pointer towards
struct ocelot.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

6c30384e

net: mscc: ocelot: delete unused ocelot_set_cpu_port prototype · c6c65d47

Vladimir Oltean authored Jan 15, 2021

This is a leftover of commit 69df578c ("net: mscc: ocelot: eliminate
confusion between CPU and NPI port") which renamed that function.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

c6c65d47

net: mscc: ocelot: export NUM_TC constant from felix to common switch lib · 70d39a6e

Vladimir Oltean authored Jan 15, 2021

We should be moving anything that isn't DSA-specific or SoC-specific out
of the felix DSA driver, and into the common mscc_ocelot switch library.

The number of traffic classes is one of the aspects that is common
between all ocelot switches, so it belongs in the library.

This patch also makes seville use 8 TX queues, and therefore enables
prioritization via the QOS_CLASS field in the NPI injection header.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

70d39a6e

net: dsa: felix: perform teardown in reverse order of setup · d19741b0

Vladimir Oltean authored Jan 15, 2021

In general it is desirable that cleanup is the reverse process of setup.
In this case I am not seeing any particular issue, but with the
introduction of devlink-sb for felix, a non-obvious decision had to be
made as to where to put its cleanup method. When there's a convention in
place, that decision becomes obvious.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

d19741b0

net: dsa: felix: reindent struct dsa_switch_ops · a7096915

Vladimir Oltean authored Jan 15, 2021

The devlink function pointer names are super long, and they would break
the alignment. So reindent the existing ops now by adding one tab.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

a7096915

net: dsa: add ops for devlink-sb · 2a6ef763

Vladimir Oltean authored Jan 15, 2021

Switches that care about QoS might have hardware support for reserving
buffer pools for individual ports or traffic classes, and configuring
their sizes and thresholds. Through devlink-sb (shared buffers), this is
all configurable, as well as their occupancy being viewable.

Add the plumbing in DSA for these operations.

Individual drivers still need to call devlink_sb_register() with the
shared buffers they want to expose. A helper was not created in DSA for
this purpose (unlike, say, dsa_devlink_params_register), since in my
opinion it does not bring any benefit over plainly calling
devlink_sb_register() directly.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

2a6ef763

net: mscc: ocelot: add ops for decoding watermark threshold and occupancy · 703b7621

Vladimir Oltean authored Jan 15, 2021

We'll need to read back the watermark thresholds and occupancy from
hardware (for devlink-sb integration), not only to write them as we did
so far in ocelot_port_set_maxlen. So introduce 2 new functions in struct
ocelot_ops, similar to wm_enc, and implement them for the 3 supported
mscc_ocelot switches.

Remove the INUSE and MAXUSE unpacking helpers for the QSYS_RES_STAT
register, because that doesn't scale with the number of switches that
mscc_ocelot supports now. They have different bit widths for the
watermarks, and we need function pointers to abstract that difference
away.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

703b7621

net: mscc: ocelot: auto-detect packet buffer size and number of frame references · f6fe01d6

Vladimir Oltean authored Jan 15, 2021

Instead of reading these values from the reference manual and writing
them down into the driver, it appears that the hardware gives us the
option of detecting them dynamically.

The number of frame references corresponds to what the reference manual
notes, however it seems that the frame buffers are reported as slightly
less than the books would indicate. On VSC9959 (Felix), the books say it
should have 128KB of packet buffer, but the registers indicate only
129840 bytes (126.79 KB). Also, the unit of measurement for FREECNT from
the documentation of all these devices is incorrect (taken from an older
generation). This was confirmed by Younes Leroul from Microchip support.

Not having anything better to do with these values at the moment* (this
will change soon), let's just print them.

*The frame buffer size is, in fact, used to calculate the tail dropping
watermarks.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

f6fe01d6

Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 2d9116be

Jakub Kicinski authored Jan 15, 2021

Daniel Borkmann says:

====================
pull-request: bpf-next 2021-01-16

1) Extend atomic operations to the BPF instruction set along with x86-64 JIT support,
   that is, atomic{,64}_{xchg,cmpxchg,fetch_{add,and,or,xor}}, from Brendan Jackman.

2) Add support for using kernel module global variables (__ksym externs in BPF
   programs) retrieved via module's BTF, from Andrii Nakryiko.

3) Generalize BPF stackmap's buildid retrieval and add support to have buildid
   stored in mmap2 event for perf, from Jiri Olsa.

4) Various fixes for cross-building BPF sefltests out-of-tree which then will
   unblock wider automated testing on ARM hardware, from Jean-Philippe Brucker.

5) Allow to retrieve SOL_SOCKET opts from sock_addr progs, from Daniel Borkmann.

6) Clean up driver's XDP buffer init and split into two helpers to init per-
   descriptor and non-changing fields during processing, from Lorenzo Bianconi.

7) Minor misc improvements to libbpf & bpftool, from Ian Rogers.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (41 commits)
  perf: Add build id data in mmap2 event
  bpf: Add size arg to build_id_parse function
  bpf: Move stack_map_get_build_id into lib
  bpf: Document new atomic instructions
  bpf: Add tests for new BPF atomic operations
  bpf: Add bitwise atomic instructions
  bpf: Pull out a macro for interpreting atomic ALU operations
  bpf: Add instructions for atomic_[cmp]xchg
  bpf: Add BPF_FETCH field / create atomic_fetch_add instruction
  bpf: Move BPF_STX reserved field check into BPF_STX verifier code
  bpf: Rename BPF_XADD and prepare to encode other atomics in .imm
  bpf: x86: Factor out a lookup table for some ALU opcodes
  bpf: x86: Factor out emission of REX byte
  bpf: x86: Factor out emission of ModR/M for *(reg + off)
  tools/bpftool: Add -Wall when building BPF programs
  bpf, libbpf: Avoid unused function warning on bpf_tail_call_static
  selftests/bpf: Install btf_dump test cases
  selftests/bpf: Fix installation of urandom_read
  selftests/bpf: Move generated test files to $(TEST_GEN_FILES)
  selftests/bpf: Fix out-of-tree build
  ...
====================

Link: https://lore.kernel.org/r/20210116012922.17823-1-daniel@iogearbox.netSigned-off-by: Jakub Kicinski <kuba@kernel.org>

2d9116be

net: ks8851: remove definition of DEBUG · 3ada665b

Tom Rix authored Jan 15, 2021

Defining DEBUG should only be done in development.
So remove DEBUG.
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Marek Vasut <marex@denx.de>
Link: https://lore.kernel.org/r/20210115153128.131026-1-trix@redhat.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

3ada665b

neighbor: remove definition of DEBUG · e794e7fa

Tom Rix authored Jan 14, 2021

Defining DEBUG should only be done in development.
So remove DEBUG.
Signed-off-by: Tom Rix <trix@redhat.com>
Link: https://lore.kernel.org/r/20210114212917.48174-1-trix@redhat.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e794e7fa

gianfar: remove definition of DEBUG · 2267c530

Tom Rix authored Jan 13, 2021

Defining DEBUG should only be done in development.
So remove DEBUG.
Signed-off-by: Tom Rix <trix@redhat.com>
Link: https://lore.kernel.org/r/20210113215603.1721958-1-trix@redhat.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

2267c530

net: dsa: set configure_vlan_while_not_filtering to true by default · 0ee2af4e

Vladimir Oltean authored Jan 16, 2021

As explained in commit 54a0ed0d ("net: dsa: provide an option for
drivers to always receive bridge VLANs"), DSA has historically been
skipping VLAN switchdev operations when the bridge wasn't in
vlan_filtering mode, but the reason why it was doing that has never been
clear. So the configure_vlan_while_not_filtering option is there merely
to preserve functionality for existing drivers. It isn't some behavior
that drivers should opt into. Ideally, when all drivers leave this flag
set, we can delete the dsa_port_skip_vlan_configuration() function.

New drivers always seem to omit setting this flag, for some reason. So
let's reverse the logic: the DSA core sets it by default to true before
the .setup() callback, and legacy drivers can turn it off. This way, new
drivers get the new behavior by default, unless they explicitly set the
flag to false, which is more obvious during review.

Remove the assignment from drivers which were setting it to true, and
add the assignment to false for the drivers that didn't previously have
it. This way, it should be easier to see how many we have left.

The following drivers: lan9303, mv88e6060 were skipped from setting this
flag to false, because they didn't have any VLAN offload ops in the
first place.

The Broadcom Starfighter 2 driver calls the common b53_switch_alloc and
therefore also inherits the configure_vlan_while_not_filtering=true
behavior.

Also, print a message through netlink extack every time a VLAN has been
skipped. This is mildly annoying on purpose, so that (a) it is at least
clear that VLANs are being skipped - the legacy behavior in itself is
confusing, and the extack should be much more difficult to miss, unlike
kernel logs - and (b) people have one more incentive to convert to the
new behavior.

No behavior change except for the added prints is intended at this time.

$ ip link add br0 type bridge vlan_filtering 0
$ ip link set sw0p2 master br0
[ 60.315148] br0: port 1(sw0p2) entered blocking state
[ 60.320350] br0: port 1(sw0p2) entered disabled state
[ 60.327839] device sw0p2 entered promiscuous mode
[ 60.334905] br0: port 1(sw0p2) entered blocking state
[ 60.340142] br0: port 1(sw0p2) entered forwarding state
Warning: dsa_core: skipping configuration of VLAN. # This was the pvid
$ bridge vlan add dev sw0p2 vid 100
Warning: dsa_core: skipping configuration of VLAN.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210115231919.43834-1-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

0ee2af4e

netxen_nic: switch from 'pci_' to 'dma_' API · 297af515

Christophe JAILLET authored Jan 13, 2021

The wrappers in include/linux/pci-dma-compat.h should go away.

The patch has been generated with the coccinelle script below and has been
hand modified to replace GFP_ with a correct flag.
It has been compile tested.

When memory is allocated in 'netxen_get_minidump_template()' GFP_KERNEL can
be used because its only caller, ' netxen_setup_minidump(()' already uses
it and no lock is acquired in the between.

When memory is allocated in other function in 'netxen_nic_ctx.c' GFP_KERNEL
can be used because the call chain already uses GFP_KERNEL and no lock is
taken in the between.
The call chain is:
  netxen_nic_attach()
    --> netxen_alloc_sw_resources()          : already uses GFP_KERNEL
    --> netxen_alloc_hw_resources()
      --> nx_fw_cmd_create_rx_ctx()
      --> nx_fw_cmd_create_tx_ctx()

When memory is allocated in 'netxen_init_dummy_dma()' GFP_KERNEL can be
used because its only call chain already uses it and no lock is acquired in
the between.
The call chain is:
  --> netxen_start_firmware
    --> netxen_request_firmware()
      --> request_firmware()
        --> _request_firmware(()
          --> fw_get_filesystem_firmware()
            --> __getname()                  : already uses GFP_KERNEL
    --> netxen_init_dummy_dma()

@@
@@
-    PCI_DMA_BIDIRECTIONAL
+    DMA_BIDIRECTIONAL

@@
@@
-    PCI_DMA_TODEVICE
+    DMA_TO_DEVICE

@@
@@
-    PCI_DMA_FROMDEVICE
+    DMA_FROM_DEVICE

@@
@@
-    PCI_DMA_NONE
+    DMA_NONE

@@
expression e1, e2, e3;
@@
-    pci_alloc_consistent(e1, e2, e3)
+    dma_alloc_coherent(&e1->dev, e2, e3, GFP_)

@@
expression e1, e2, e3;
@@
-    pci_zalloc_consistent(e1, e2, e3)
+    dma_alloc_coherent(&e1->dev, e2, e3, GFP_)

@@
expression e1, e2, e3, e4;
@@
-    pci_free_consistent(e1, e2, e3, e4)
+    dma_free_coherent(&e1->dev, e2, e3, e4)

@@
expression e1, e2, e3, e4;
@@
-    pci_map_single(e1, e2, e3, e4)
+    dma_map_single(&e1->dev, e2, e3, e4)

@@
expression e1, e2, e3, e4;
@@
-    pci_unmap_single(e1, e2, e3, e4)
+    dma_unmap_single(&e1->dev, e2, e3, e4)

@@
expression e1, e2, e3, e4, e5;
@@
-    pci_map_page(e1, e2, e3, e4, e5)
+    dma_map_page(&e1->dev, e2, e3, e4, e5)

@@
expression e1, e2, e3, e4;
@@
-    pci_unmap_page(e1, e2, e3, e4)
+    dma_unmap_page(&e1->dev, e2, e3, e4)

@@
expression e1, e2, e3, e4;
@@
-    pci_map_sg(e1, e2, e3, e4)
+    dma_map_sg(&e1->dev, e2, e3, e4)

@@
expression e1, e2, e3, e4;
@@
-    pci_unmap_sg(e1, e2, e3, e4)
+    dma_unmap_sg(&e1->dev, e2, e3, e4)

@@
expression e1, e2, e3, e4;
@@
-    pci_dma_sync_single_for_cpu(e1, e2, e3, e4)
+    dma_sync_single_for_cpu(&e1->dev, e2, e3, e4)

@@
expression e1, e2, e3, e4;
@@
-    pci_dma_sync_single_for_device(e1, e2, e3, e4)
+    dma_sync_single_for_device(&e1->dev, e2, e3, e4)

@@
expression e1, e2, e3, e4;
@@
-    pci_dma_sync_sg_for_cpu(e1, e2, e3, e4)
+    dma_sync_sg_for_cpu(&e1->dev, e2, e3, e4)

@@
expression e1, e2, e3, e4;
@@
-    pci_dma_sync_sg_for_device(e1, e2, e3, e4)
+    dma_sync_sg_for_device(&e1->dev, e2, e3, e4)

@@
expression e1, e2;
@@
-    pci_dma_mapping_error(e1, e2)
+    dma_mapping_error(&e1->dev, e2)

@@
expression e1, e2;
@@
-    pci_set_dma_mask(e1, e2)
+    dma_set_mask(&e1->dev, e2)

@@
expression e1, e2;
@@
-    pci_set_consistent_dma_mask(e1, e2)
+    dma_set_coherent_mask(&e1->dev, e2)
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/20210113202519.487672-1-christophe.jaillet@wanadoo.frSigned-off-by: Jakub Kicinski <kuba@kernel.org>

297af515

Merge branch 'net-dsa-mv88e6xxx-lag-fixes' · 7c140b05

Jakub Kicinski authored Jan 15, 2021

Tobias Waldekranz says:

====================
net: dsa: mv88e6xxx: LAG fixes

The kernel test robot kindly pointed out that Global 2 support in
mv88e6xxx is optional.

This also made me realize that we should verify that the hardware
actually supports LAG offloading before trying to configure it.
====================

Link: https://lore.kernel.org/r/20210115125259.22542-1-tobias@waldekranz.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

7c140b05

net: dsa: mv88e6xxx: Only allow LAG offload on supported hardware · b80dc51b

Tobias Waldekranz authored Jan 15, 2021

There are chips that do have Global 2 registers, and therefore trunk
mapping/mask tables are not available. Refuse the offload as early as
possible on those devices.

Fixes: 57e661aa ("net: dsa: mv88e6xxx: Link aggregation support")
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

b80dc51b

net: dsa: mv88e6xxx: Provide dummy implementations for trunk setters · d38001d3

Tobias Waldekranz authored Jan 15, 2021

Support for Global 2 registers is build-time optional. In the case
where it was not enabled the build would fail as no "dummy"
implementation of these functions was available.

Fixes: 57e661aa ("net: dsa: mv88e6xxx: Link aggregation support")
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Tested-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

d38001d3

15 Jan, 2021 20 commits

Merge branch 'arrow-speedchips-xrs700x-dsa-driver' · 8a39bee1

Jakub Kicinski authored Jan 15, 2021

George McCollister says:

====================
Arrow SpeedChips XRS700x DSA Driver

This series adds a DSA driver for the Arrow SpeedChips XRS 7000 series
of HSR/PRP gigabit switch chips.

The chips use Flexibilis IP.
More information can be found here:
 https://www.flexibilis.com/products/speedchips-xrs7000/

The switches have up to three RGMII ports and one MII port and are
managed via mdio or i2c. They use a one byte trailing tag to identify
the switch port when in managed mode so I've added a tag driver which
implements this.

This series contains minimal DSA functionality which may be built upon
in future patches. The ultimate goal is to add HSR and PRP
(IEC 62439-3 Clause 5 & 4) offloading with integration into net/hsr.
====================

Link: https://lore.kernel.org/r/20210114195734.55313-1-george.mccollister@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

8a39bee1

dt-bindings: net: dsa: add bindings for xrs700x switches · 8204c2b0

George McCollister authored Jan 14, 2021

Add documentation and an example for Arrow SpeedChips XRS7000 Series
single chip Ethernet switches.
Signed-off-by: George McCollister <george.mccollister@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

8204c2b0

net: dsa: add Arrow SpeedChips XRS700x driver · ee00b24f

George McCollister authored Jan 14, 2021

Add a driver with initial support for the Arrow SpeedChips XRS7000
series of gigabit Ethernet switch chips which are typically used in
critical networking applications.

The switches have up to three RGMII ports and one RMII port.
Management to the switches can be performed over i2c or mdio.

Support for advanced features such as PTP and
HSR/PRP (IEC 62439-3 Clause 5 & 4) is not included in this patch and
may be added at a later date.
Signed-off-by: George McCollister <george.mccollister@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ee00b24f

dsa: add support for Arrow XRS700x tag trailer · 54a52823

George McCollister authored Jan 14, 2021

Add support for Arrow SpeedChips XRS700x single byte tag trailer. This
is modeled on tag_trailer.c which works in a similar way.
Signed-off-by: George McCollister <george.mccollister@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

54a52823

Merge branch 'add-further-dt-configuration-for-at803x-phys' · e7fa5c80

Jakub Kicinski authored Jan 15, 2021

Russell King says:

====================
Add further DT configuration for AT803x PHYs

This patch series adds the ability to configure the SmartEEE feature
in AT803x PHYs. SmartEEE defaults to enabled on these PHYs, and has
a history of causing random sporadic link drops at Gigabit speeds.

There appears to be two solutions to this. There is the approach that
Freescale adopted early on, which is to disable the SmartEEE feature.
However, this loses the power saving provided by EEE. Another solution
was found by Jon Nettleton is to increase the Tw parameter for Gigabit
links.

This patch series adds support for both approaches, by adding a boolean:

	qca,disable-smarteee

if one wishes to disable SmartEEE, and two properties to configure the
SmartEEE Tw parameters:

	qca,smarteee-tw-us-100m
	qca,smarteee-tw-us-1g

Sadly, the PHY quirk I merged a while back for AT8035 on iMX6 is broken
- rather than disabling SmartEEE mode, it enables it.

The addition of these properties will be sent to the appropriate
platform maintainers - although for SolidRun platforms, we only make use
of "qca,smarteee-tw-us-1g".
====================

Link: https://lore.kernel.org/r/20210114104455.GP1551@shell.armlinux.org.ukSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e7fa5c80

net: phy: at803x: add support for configuring SmartEEE · 390b4cad

Russell King authored Jan 14, 2021

SmartEEE for the atheros phy was deemed buggy by Freescale and commits
were added to disable it for their boards.

In initial testing, SolidRun found that the default settings were
causing disconnects but by increasing the Tw buffer time we could allow
enough time for all parts of the link to come out of a low power state
and function properly without causing a disconnect. This allows us to
have functional power savings of between 300 and 400mW, rather than
disabling the feature altogether.

This commit adds support for disabling SmartEEE and configuring the Tw
parameters for 1G and 100M speeds.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

390b4cad

dt: ar803x: document SmartEEE properties · 623c1329

Russell King authored Jan 14, 2021

The SmartEEE feature of Atheros AR803x PHYs can cause the link to
bounce. Add DT properties to allow SmartEEE to be disabled, and to
allow the Tw parameters for 100M and 1G links to be configured.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

623c1329

Merge branch 'perf: Add mmap2 build id support' · eed6a9a9

Alexei Starovoitov authored Jan 14, 2021

Jiri Olsa says:

====================

hi,
adding the support to have buildid stored in mmap2 event,
so we can bypass the final perf record hunt on build ids.

This patchset allows perf to record build ID in mmap2 event,
and adds perf tooling to store/download binaries to .debug
cache based on these build IDs.

Note that the build id retrieval code is stolen from bpf
code, where it's been used (together with file offsets)
to replace IPs in user space stack traces. It's now added
under lib directory.

v7 changes:
  - included only missing kernel patches, cc-ed bpf@vger and
    rebased on bpf-next/master [Alexei]

v6 changes:
  - last 4 patches rebased Arnaldo's perf/core

v5 changes:
  - rebased on latest perf/core
  - several patches already pulled in
  - fixed trace+probe_vfs_getname.sh output redirection
  - fixed changelogs [Arnaldo]
  - renamed BUILD_ID_SIZE to BUILD_ID_SIZE_MAX [Song]

v4 changes:
  - fixed typo in changelog [Namhyung]
  - removed force_download bool from struct dso_store_data,
    because it's not used  [Namhyung]

v3 changes:
  - added acks
  - removed forgotten debug code [Arnaldo]
  - fixed readlink termination [Ian]
  - fixed doc for --debuginfod=URLs [Ian]
  - adopted kernel's memchr_inv function and used
    it in build_id__is_defined function [Arnaldo]

On recording server:

  - on the recording server we can run record with --buildid-mmap
    option to store build ids in mmap2 events:

    # perf record --buildid-mmap
    ^C[ perf record: Woken up 2 times to write data ]
    [ perf record: Captured and wrote 0.836 MB perf.data ]

  - it stores nothing to ~/.debug cache:

    # find ~/.debug
    find: ‘/root/.debug’: No such file or directory

  - and still reports properly:

    # perf report --stdio
    ...
    99.82%  swapper          [kernel.kallsyms]  [k] native_safe_halt
     0.03%  swapper          [kernel.kallsyms]  [k] finish_task_switch
     0.02%  swapper          [kernel.kallsyms]  [k] __softirqentry_text_start
     0.01%  kcompactd0       [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     0.01%  ksoftirqd/6      [kernel.kallsyms]  [k] slab_free_freelist_hook
     0.01%  kworker/17:1H-x  [kernel.kallsyms]  [k] slab_free_freelist_hook

  - display used/hit build ids:

    # perf buildid-list | head -5
    5dcec522abf136fcfd3128f47e131f2365834dd7 /proc/kcore
    589e403a34f55486bcac848a45e00bcdeedd1ca8 /usr/lib64/libcrypto.so.1.1.1g
    94569566d4eac7e9c87ba029d43d4e2158f9527e /usr/lib64/libpthread-2.30.so
    559b9702bebe31c6d132c8dc5cc887673d65d5b5 /usr/lib64/libc-2.30.so
    40da7abe89f631f60538a17686a7d65c6a02ed31 /usr/lib64/ld-2.30.so

  - store build id binaries into build id cache:

    # perf buildid-cache -a perf.data
    OK   5dcec522abf136fcfd3128f47e131f2365834dd7 /proc/kcore
    OK   589e403a34f55486bcac848a45e00bcdeedd1ca8 /usr/lib64/libcrypto.so.1.1.1g
    OK   94569566d4eac7e9c87ba029d43d4e2158f9527e /usr/lib64/libpthread-2.30.so
    OK   559b9702bebe31c6d132c8dc5cc887673d65d5b5 /usr/lib64/libc-2.30.so
    OK   40da7abe89f631f60538a17686a7d65c6a02ed31 /usr/lib64/ld-2.30.so
    OK   a674f7a47c78e35a088104647b9640710277b489 /usr/sbin/sshd
    OK   e5cb4ca25f46485bdbc691c3a92e7e111dac3ef2 /usr/bin/bash
    OK   9bc8589108223c944b452f0819298a0c3cba6215 /usr/bin/find

    # find ~/.debug | head -5
    /root/.debug
    /root/.debug/proc
    /root/.debug/proc/kcore
    /root/.debug/proc/kcore/5dcec522abf136fcfd3128f47e131f2365834dd7
    /root/.debug/proc/kcore/5dcec522abf136fcfd3128f47e131f2365834dd7/kallsyms

  - run debuginfod daemon to provide binaries to another server (below)
    (the initialization could take some time)

    # debuginfod -F /

On another server:

  - copy perf.data from 'record' server and run:

    $ find ~/.debug/
    find: ‘/home/jolsa/.debug/’: No such file or directory

    $ perf buildid-list | head -5
    No kallsyms or vmlinux with build-id 5dcec522abf136fcfd3128f47e131f2365834dd7 was found
    5dcec522abf136fcfd3128f47e131f2365834dd7 [kernel.kallsyms]
    5784f813b727a50cfd3363234aef9fcbab685cc4 /lib/modules/5.10.0-rc2speed+/kernel/fs/xfs/xfs.ko
    589e403a34f55486bcac848a45e00bcdeedd1ca8 /usr/lib64/libcrypto.so.1.1.1g
    94569566d4eac7e9c87ba029d43d4e2158f9527e /usr/lib64/libpthread-2.30.so
    559b9702bebe31c6d132c8dc5cc887673d65d5b5 /usr/lib64/libc-2.30.so

  - report does not show anything (kernel build id does not match):

   $ perf report --stdio
   ...
    76.73%  swapper          [kernel.kallsyms]    [k] 0xffffffff81aa8ebe
     1.89%  find             [kernel.kallsyms]    [k] 0xffffffff810f2167
     0.93%  sshd             [kernel.kallsyms]    [k] 0xffffffff8153380c
     0.83%  swapper          [kernel.kallsyms]    [k] 0xffffffff81104b0b
     0.71%  kworker/u40:2-e  [kernel.kallsyms]    [k] 0xffffffff810f3850
     0.70%  kworker/u40:0-e  [kernel.kallsyms]    [k] 0xffffffff810f3850
     0.64%  find             [kernel.kallsyms]    [k] 0xffffffff81a9ba0a
     0.63%  find             [kernel.kallsyms]    [k] 0xffffffff81aa93b0

  - add build ids does not work, because existing binaries (on another server)
    have different build ids:

    $ perf buildid-cache -a perf.data
    No kallsyms or vmlinux with build-id 5dcec522abf136fcfd3128f47e131f2365834dd7 was found
    FAIL 5dcec522abf136fcfd3128f47e131f2365834dd7 [kernel.kallsyms]
    FAIL 5784f813b727a50cfd3363234aef9fcbab685cc4 /lib/modules/5.10.0-rc2speed+/kernel/fs/xfs/xfs.ko
    FAIL 589e403a34f55486bcac848a45e00bcdeedd1ca8 /usr/lib64/libcrypto.so.1.1.1g
    FAIL 94569566d4eac7e9c87ba029d43d4e2158f9527e /usr/lib64/libpthread-2.30.so
    FAIL 559b9702bebe31c6d132c8dc5cc887673d65d5b5 /usr/lib64/libc-2.30.so
    FAIL 40da7abe89f631f60538a17686a7d65c6a02ed31 /usr/lib64/ld-2.30.so
    FAIL a674f7a47c78e35a088104647b9640710277b489 /usr/sbin/sshd
    FAIL e5cb4ca25f46485bdbc691c3a92e7e111dac3ef2 /usr/bin/bash
    FAIL 9bc8589108223c944b452f0819298a0c3cba6215 /usr/bin/find

  - add build ids with debuginfod setup pointing to record server:

    $ perf buildid-cache -a perf.data --debuginfod http://192.168.122.174:8002
    No kallsyms or vmlinux with build-id 5dcec522abf136fcfd3128f47e131f2365834dd7 was found
    OK   5dcec522abf136fcfd3128f47e131f2365834dd7 [kernel.kallsyms]
    OK   5784f813b727a50cfd3363234aef9fcbab685cc4 /lib/modules/5.10.0-rc2speed+/kernel/fs/xfs/xfs.ko
    OK   589e403a34f55486bcac848a45e00bcdeedd1ca8 /usr/lib64/libcrypto.so.1.1.1g
    OK   94569566d4eac7e9c87ba029d43d4e2158f9527e /usr/lib64/libpthread-2.30.so
    OK   559b9702bebe31c6d132c8dc5cc887673d65d5b5 /usr/lib64/libc-2.30.so
    OK   40da7abe89f631f60538a17686a7d65c6a02ed31 /usr/lib64/ld-2.30.so
    OK   a674f7a47c78e35a088104647b9640710277b489 /usr/sbin/sshd
    OK   e5cb4ca25f46485bdbc691c3a92e7e111dac3ef2 /usr/bin/bash
    OK   9bc8589108223c944b452f0819298a0c3cba6215 /usr/bin/find

  - and report works:

    $ perf report --stdio
    ...
    76.73%  swapper          [kernel.kallsyms]    [k] native_safe_halt
     1.91%  find             [kernel.kallsyms]    [k] queue_work_on
     0.93%  sshd             [kernel.kallsyms]    [k] iowrite16
     0.83%  swapper          [kernel.kallsyms]    [k] finish_task_switch
     0.72%  kworker/u40:2-e  [kernel.kallsyms]    [k] process_one_work
     0.70%  kworker/u40:0-e  [kernel.kallsyms]    [k] process_one_work
     0.64%  find             [kernel.kallsyms]    [k] syscall_enter_from_user_mode
     0.63%  find             [kernel.kallsyms]    [k] _raw_spin_unlock_irqrestore

  - because we have the data in build id cache:

    $ find ~/.debug | head -10
    .../.debug
    .../.debug/home
    .../.debug/home/jolsa
    .../.debug/home/jolsa/.cache
    .../.debug/home/jolsa/.cache/debuginfod_client
    .../.debug/home/jolsa/.cache/debuginfod_client/5dcec522abf136fcfd3128f47e131f2365834dd7
    .../.debug/home/jolsa/.cache/debuginfod_client/5dcec522abf136fcfd3128f47e131f2365834dd7/executable
    .../.debug/home/jolsa/.cache/debuginfod_client/5dcec522abf136fcfd3128f47e131f2365834dd7/executable/5dcec522abf136fcfd3128f47e131f2365834dd7
    .../.debug/home/jolsa/.cache/debuginfod_client/5dcec522abf136fcfd3128f47e131f2365834dd7/executable/5dcec522abf136fcfd3128f47e131f2365834dd7/elf
    .../.debug/home/jolsa/.cache/debuginfod_client/5dcec522abf136fcfd3128f47e131f2365834dd7/executable/5dcec522abf136fcfd3128f47e131f2365834dd7/debug

Available also in:
  git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
  perf/build_id

thanks,
jirka
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

eed6a9a9

perf: Add build id data in mmap2 event · 88a16a13

Jiri Olsa authored Jan 14, 2021

Adding support to carry build id data in mmap2 event.

The build id data replaces maj/min/ino/ino_generation
fields, which are also used to identify map's binary,
so it's ok to replace them with build id data:

  union {
          struct {
                  u32       maj;
                  u32       min;
                  u64       ino;
                  u64       ino_generation;
          };
          struct {
                  u8        build_id_size;
                  u8        __reserved_1;
                  u16       __reserved_2;
                  u8        build_id[20];
          };
  };

Replaced maj/min/ino/ino_generation fields give us size
of 24 bytes. We use 20 bytes for build id data, 1 byte
for size and rest is unused.

There's new misc bit for mmap2 to signal there's build
id data in it:

  #define PERF_RECORD_MISC_MMAP_BUILD_ID   (1 << 14)
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/bpf/20210114134044.1418404-4-jolsa@kernel.org

88a16a13

bpf: Add size arg to build_id_parse function · 921f88fc

Jiri Olsa authored Jan 14, 2021

It's possible to have other build id types (other than default SHA1).
Currently there's also ld support for MD5 build id.

Adding size argument to build_id_parse function, that returns (if defined)
size of the parsed build id, so we can recognize the build id type.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210114134044.1418404-3-jolsa@kernel.org

921f88fc

bpf: Move stack_map_get_build_id into lib · bd7525da

Jiri Olsa authored Jan 14, 2021

Moving stack_map_get_build_id into lib with
declaration in linux/buildid.h header:

  int build_id_parse(struct vm_area_struct *vma, unsigned char *build_id);

This function returns build id for given struct vm_area_struct.
There is no functional change to stack_map_get_build_id function.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210114134044.1418404-2-jolsa@kernel.org

bd7525da

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 1d9f03c0
Jakub Kicinski authored Jan 14, 2021
```
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
```
1d9f03c0

Merge branch 'Atomics for eBPF' · 7064a734

Alexei Starovoitov authored Jan 14, 2021

Brendan Jackman says:

====================

There's still one unresolved review comment from John[3] which I
will resolve with a followup patch.

Differences from v6->v7 [1]:

* Fixed riscv build error detected by 0-day robot.

Differences from v5->v6 [1]:

* Carried Björn Töpel's ack for RISC-V code, plus a couple more acks from
  Yonhgong.

* Doc fixups.

* Trivial cleanups.

Differences from v4->v5 [1]:

* Fixed bogus type casts in interpreter that led to warnings from
  the 0day robot.

* Dropped feature-detection for Clang per Andrii's suggestion in [4].
  The selftests will now fail to build unless you have llvm-project
  commit 286daafd6512. The ENABLE_ATOMICS_TEST macro is still needed
  to support the no_alu32 tests.

* Carried some Acks from John and Yonghong.

* Dropped confusing usage of __atomic_exchange from prog_test in
  favour of __sync_lock_test_and_set.

* [Really] got rid of all the forest of instruction macros
  (BPF_ATOMIC_FETCH_ADD and friends); now there's just BPF_ATOMIC_OP
  to define all the instructions as we use them in the verifier
  tests. This makes the atomic ops less special in that API, and I
  don't think the resulting usage is actually any harder to read.

Differences from v3->v4 [1]:

* Added one Ack from Yonghong. He acked some other patches but those
  have now changed non-trivally so I didn't add those acks.

* Fixups to commit messages.

* Fixed disassembly and comments: first arg to atomic_fetch_* is a
  pointer.

* Improved prog_test efficiency. BPF progs are now all loaded in a
  single call, then the skeleton is re-used for each subtest.

* Dropped use of tools/build/feature in favour of a one-liner in the
  Makefile.

* Dropped the commit that created an emit_neg helper in the x86
  JIT. It's not used any more (it wasn't used in v3 either).

* Combined all the different filter.h macros (used to be
  BPF_ATOMIC_ADD, BPF_ATOMIC_FETCH_ADD, BPF_ATOMIC_AND, etc) into
  just BPF_ATOMIC32 and BPF_ATOMIC64.

* Removed some references to BPF_STX_XADD from tools/, samples/ and
  lib/ that I missed before.

Differences from v2->v3 [1]:

* More minor fixes and naming/comment changes

* Dropped atomic subtract: compilers can implement this by preceding
  an atomic add with a NEG instruction (which is what the x86 JIT did
  under the hood anyway).

* Dropped the use of -mcpu=v4 in the Clang BPF command-line; there is
  no longer an architecture version bump. Instead a feature test is
  added to Kbuild - it builds a source file to check if Clang
  supports BPF atomics.

* Fixed the prog_test so it no longer breaks
  test_progs-no_alu32. This requires some ifdef acrobatics to avoid
  complicating the prog_tests model where the same userspace code
  exercises both the normal and no_alu32 BPF test objects, using the
  same skeleton header.

Differences from v1->v2 [1]:

* Fixed mistakes in the netronome driver

* Addd sub, add, or, xor operations

* The above led to some refactors to keep things readable. (Maybe I
  should have just waited until I'd implemented these before starting
  the review...)

* Replaced BPF_[CMP]SET | BPF_FETCH with just BPF_[CMP]XCHG, which
  include the BPF_FETCH flag

* Added a bit of documentation. Suggestions welcome for more places
  to dump this info...

The prog_test that's added depends on Clang/LLVM features added by
Yonghong in commit 286daafd6512 (was
https://reviews.llvm.org/D72184).

This only includes a JIT implementation for x86_64 - I don't plan to
implement JIT support myself for other architectures.

Operations
==========

This patchset adds atomic operations to the eBPF instruction set. The
use-case that motivated this work was a trivial and efficient way to
generate globally-unique cookies in BPF progs, but I think it's
obvious that these features are pretty widely applicable.  The
instructions that are added here can be summarised with this list of
kernel operations:

* atomic[64]_[fetch_]add
* atomic[64]_[fetch_]and
* atomic[64]_[fetch_]or
* atomic[64]_xchg
* atomic[64]_cmpxchg

The following are left out of scope for this effort:

* 16 and 8 bit operations
* Explicit memory barriers

Encoding
========

I originally planned to add new values for bpf_insn.opcode. This was
rather unpleasant: the opcode space has holes in it but no entire
instruction classes[2]. Yonghong Song had a better idea: use the
immediate field of the existing STX XADD instruction to encode the
operation. This works nicely, without breaking existing programs,
because the immediate field is currently reserved-must-be-zero, and
extra-nicely because BPF_ADD happens to be zero.

Note that this of course makes immediate-source atomic operations
impossible. It's hard to imagine a measurable speedup from such
instructions, and if it existed it would certainly not benefit x86,
which has no support for them.

The BPF_OP opcode fields are re-used in the immediate, and an
additional flag BPF_FETCH is used to mark instructions that should
fetch a pre-modification value from memory.

So, BPF_XADD is now called BPF_ATOMIC (the old name is kept to avoid
breaking userspace builds), and where we previously had .imm = 0, we
now have .imm = BPF_ADD (which is 0).

Operands
========

Reg-source eBPF instructions only have two operands, while these
atomic operations have up to four. To avoid needing to encode
additional operands, then:

- One of the input registers is re-used as an output register
  (e.g. atomic_fetch_add both reads from and writes to the source
  register).

- Where necessary (i.e. for cmpxchg) , R0 is "hard-coded" as one of
  the operands.

This approach also allows the new eBPF instructions to map directly
to single x86 instructions.

[1] Previous iterations:
    v1: https://lore.kernel.org/bpf/20201123173202.1335708-1-jackmanb@google.com/
    v2: https://lore.kernel.org/bpf/20201127175738.1085417-1-jackmanb@google.com/
    v3: https://lore.kernel.org/bpf/X8kN7NA7bJC7aLQI@google.com/
    v4: https://lore.kernel.org/bpf/20201207160734.2345502-1-jackmanb@google.com/
    v5: https://lore.kernel.org/bpf/20201215121816.1048557-1-jackmanb@google.com/
    v6: https://lore.kernel.org/bpf/20210112154235.2192781-1-jackmanb@google.com/

[2] Visualisation of eBPF opcode space:
    https://gist.github.com/bjackman/00fdad2d5dfff601c1918bc29b16e778

[3] Comment from John about propagating bounds in verifier:
    https://lore.kernel.org/bpf/5fcf0fbcc8aa8_9ab320853@john-XPS-13-9370.notmuch/

[4] Mail from Andrii about not supporting old Clang in selftests:
    https://lore.kernel.org/bpf/CAEf4BzYBddPaEzRUs=jaWSo5kbf=LZdb7geAUVj85GxLQztuAQ@mail.gmail.com/
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

7064a734

bpf: Document new atomic instructions · de948576

Brendan Jackman authored Jan 14, 2021

Document new atomic instructions.
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-12-jackmanb@google.com

de948576

bpf: Add tests for new BPF atomic operations · 98d666d0

Brendan Jackman authored Jan 14, 2021

The prog_test that's added depends on Clang/LLVM features added by
Yonghong in commit 286daafd6512 (was https://reviews.llvm.org/D72184).

Note the use of a define called ENABLE_ATOMICS_TESTS: this is used
to:

 - Avoid breaking the build for people on old versions of Clang
 - Avoid needing separate lists of test objects for no_alu32, where
   atomics are not supported even if Clang has the feature.

The atomics_test.o BPF object is built unconditionally both for
test_progs and test_progs-no_alu32. For test_progs, if Clang supports
atomics, ENABLE_ATOMICS_TESTS is defined, so it includes the proper
test code. Otherwise, progs and global vars are defined anyway, as
stubs; this means that the skeleton user code still builds.

The atomics_test.o userspace object is built once and used for both
test_progs and test_progs-no_alu32. A variable called skip_tests is
defined in the BPF object's data section, which tells the userspace
object whether to skip the atomics test.
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-11-jackmanb@google.com

98d666d0

bpf: Add bitwise atomic instructions · 981f94c3

Brendan Jackman authored Jan 14, 2021

This adds instructions for

atomic[64]_[fetch_]and
atomic[64]_[fetch_]or
atomic[64]_[fetch_]xor

All these operations are isomorphic enough to implement with the same
verifier, interpreter, and x86 JIT code, hence being a single commit.

The main interesting thing here is that x86 doesn't directly support
the fetch_ version these operations, so we need to generate a CMPXCHG
loop in the JIT. This requires the use of two temporary registers,
IIUC it's safe to use BPF_REG_AX and x86's AUX_REG for this purpose.
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-10-jackmanb@google.com

981f94c3

bpf: Pull out a macro for interpreting atomic ALU operations · 46291067

Brendan Jackman authored Jan 14, 2021

Since the atomic operations that are added in subsequent commits are
all isomorphic with BPF_ADD, pull out a macro to avoid the
interpreter becoming dominated by lines of atomic-related code.

Note that this sacrificies interpreter performance (combining
STX_ATOMIC_W and STX_ATOMIC_DW into single switch case means that we
need an extra conditional branch to differentiate them) in favour of
compact and (relatively!) simple C code.
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-9-jackmanb@google.com

46291067

bpf: Add instructions for atomic_[cmp]xchg · 5ffa2550

Brendan Jackman authored Jan 14, 2021

This adds two atomic opcodes, both of which include the BPF_FETCH
flag. XCHG without the BPF_FETCH flag would naturally encode
atomic_set. This is not supported because it would be of limited
value to userspace (it doesn't imply any barriers). CMPXCHG without
BPF_FETCH woulud be an atomic compare-and-write. We don't have such
an operation in the kernel so it isn't provided to BPF either.

There are two significant design decisions made for the CMPXCHG
instruction:

 - To solve the issue that this operation fundamentally has 3
   operands, but we only have two register fields. Therefore the
   operand we compare against (the kernel's API calls it 'old') is
   hard-coded to be R0. x86 has similar design (and A64 doesn't
   have this problem).

   A potential alternative might be to encode the other operand's
   register number in the immediate field.

 - The kernel's atomic_cmpxchg returns the old value, while the C11
   userspace APIs return a boolean indicating the comparison
   result. Which should BPF do? A64 returns the old value. x86 returns
   the old value in the hard-coded register (and also sets a
   flag). That means return-old-value is easier to JIT, so that's
   what we use.
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-8-jackmanb@google.com

5ffa2550

bpf: Add BPF_FETCH field / create atomic_fetch_add instruction · 5ca419f2

Brendan Jackman authored Jan 14, 2021

The BPF_FETCH field can be set in bpf_insn.imm, for BPF_ATOMIC
instructions, in order to have the previous value of the
atomically-modified memory location loaded into the src register
after an atomic op is carried out.
Suggested-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-7-jackmanb@google.com

5ca419f2

bpf: Move BPF_STX reserved field check into BPF_STX verifier code · c5bcb5eb

Brendan Jackman authored Jan 14, 2021

I can't find a reason why this code is in resolve_pseudo_ldimm64;
since I'll be modifying it in a subsequent commit, tidy it up.
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-6-jackmanb@google.com

c5bcb5eb