Commits · d133e4f1fa12bff58e1203c7d6c75f993fb5dead · nexedi / linux

18 Jun, 2019 8 commits

netdevsim: Ignore IPv6 multipath notifications · d133e4f1

Ido Schimmel authored Jun 18, 2019

In a similar fashion to previous patch, have netdevsim ignore IPv6
multipath notifications for now.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d133e4f1

mlxsw: spectrum_router: Ignore IPv6 multipath notifications · f6c3bb75

Ido Schimmel authored Jun 18, 2019

IPv6 multipath notifications are about to be sent, but mlxsw is not
ready to process them, so ignore them.

The limitation will be lifted by a subsequent patch which will also stop
the kernel from sending a notification for each nexthop.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f6c3bb75

ipv6: Extend notifier info for multipath routes · d4b96c7b

Ido Schimmel authored Jun 18, 2019

Extend the IPv6 FIB notifier info with number of sibling routes being
notified.

This will later allow listeners to process one notification for a
multipath routes instead of N, where N is the number of nexthops.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d4b96c7b

netlink: Add field to skip in-kernel notifications · c82481f7

Ido Schimmel authored Jun 18, 2019

The struct includes a 'skip_notify' flag that indicates if netlink
notifications to user space should be suppressed. As explained in commit
3b1137fe ("net: ipv6: Change notifications for multipath add to
RTA_MULTIPATH"), this is useful to suppress per-nexthop RTM_NEWROUTE
notifications when an IPv6 multipath route is added / deleted. Instead,
one notification is sent for the entire multipath route.

This concept is also useful for in-kernel notifications. Sending one
in-kernel notification for the addition / deletion of an IPv6 multipath
route - instead of one per-nexthop - provides a significant increase in
the insertion / deletion rate to underlying devices.

Add a 'skip_notify_kernel' flag to suppress in-kernel notifications.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c82481f7

netlink: Document all fields of 'struct nl_info' · 3de205cd

Ido Schimmel authored Jun 18, 2019

Some fields were not documented. Add documentation.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3de205cd

Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 714a485a

David S. Miller authored Jun 18, 2019

Jeff Kirsher says:

====================
Intel Wired LAN Driver Updates 2019-06-17

This series contains updates to the iavf driver only.

Akeem updates the driver to change how VLAN tags are being populated and
programmed into the hardware by starting from the first member of the
list until the number of allowed VLAN tags is exhausted.

Mitch fixed the variable type since the variable counter starts out
negative and climbs to zero, so use a signed integer instead of
unsigned.  Also increase the timeout to avoid erroneous errors.  Fixed
the driver to be able to handle when the hardware hands us a null
receive descriptor with no data attached, yet is still valid.

Aleksandr fixes the driver to use GFP_ATOMIC when allocating memory in
atomic context.

Avinash updates the driver to fix a calculation error in virtchnl
regarding the valid length.

Jakub does some refactoring of the commands processing the watchdog
state machine to reduce the length and complexity of the function.  Also
decalre watchdog task as delayed work and use a dedicated work queue to
service the driver tasks.

Paul updated the iavf_process_aq_command to call the necessary functions
to be able to clear cloud filter bits that need to be cleared.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

714a485a

mlxsw: spectrum_ptp: Fix compilation on 32-bit ARM · cd4bb2a3

Shalom Toledo authored Jun 18, 2019

Compilation on 32-bit ARM fails after commit 992aa864 ("mlxsw:
spectrum_ptp: Add implementation for physical hardware clock operations")
because of 64-bit division:

arm-linux-gnueabi-ld:
drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.o: in function
`mlxsw_sp1_ptp_phc_settime': spectrum_ptp.c:(.text+0x39c): undefined
reference to `__aeabi_uldivmod'

Fix by using div_u64().

Fixes: 992aa864 ("mlxsw: spectrum_ptp: Add implementation for physical hardware clock operations")
Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cd4bb2a3

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 13091aa3

David S. Miller authored Jun 17, 2019

Honestly all the conflicts were simple overlapping changes,
nothing really interesting to report.
Signed-off-by: David S. Miller <davem@davemloft.net>

13091aa3

17 Jun, 2019 32 commits

Merge branch 'UDP-GSO-audit-tests' · f97252a8

David S. Miller authored Jun 17, 2019

Fred Klassen says:

====================
UDP GSO audit tests

Updates to UDP GSO selftests ot optionally stress test CMSG
subsytem, and report the reliability and performance of both
TX Timestamping and ZEROCOPY messages.
====================
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f97252a8

net/udpgso_bench.sh test fails on error · 4ffc37f5

Fred Klassen authored Jun 17, 2019

Ensure that failure on any individual test results in an overall
failure of the test script.
Signed-off-by: Fred Klassen <fklassen@appneta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4ffc37f5

net/udpgso_bench.sh add UDP GSO audit tests · ade90d69

Fred Klassen authored Jun 17, 2019

Audit tests count the total number of messages sent and compares
with total number of CMSG received on error queue. Example:

    udp gso zerocopy timestamp audit
    udp rx:   1599 MB/s  1166414 calls/s
    udp tx:   1615 MB/s    27395 calls/s  27395 msg/s
    udp rx:   1634 MB/s  1192261 calls/s
    udp tx:   1633 MB/s    27699 calls/s  27699 msg/s
    udp rx:   1633 MB/s  1191358 calls/s
    udp tx:   1631 MB/s    27678 calls/s  27678 msg/s
    Summary over 4.000 seconds...
    sum udp tx:   1665 MB/s      82772 calls (27590/s)      82772 msgs (27590/s)
    Tx Timestamps:               82772 received                 0 errors
    Zerocopy acks:               82772 received

Errors are thrown if CMSG count does not equal send count,
example:

    Summary over 4.000 seconds...
    sum tcp tx:   7451 MB/s     493706 calls (123426/s)     493706 msgs (123426/s)
    ./udpgso_bench_tx: Unexpected number of Zerocopy completions:    493706 expected    493704 received

Also reduce individual test time from 4 to 3 seconds so that
overall test time does not increase significantly.

v3: Enhancements as per Willem de Bruijn <willemb@google.com>
    - document -P option for TCP audit
Signed-off-by: Fred Klassen <fklassen@appneta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ade90d69

net/udpgso_bench_tx: options to exercise TX CMSG · 79ebc3c2

Fred Klassen authored Jun 17, 2019

This enhancement adds options that facilitate load testing with
additional TX CMSG options, and to optionally print results of
various send CMSG operations.

These options are especially useful in isolating situations
where error-queue messages are lost when combined with other
CMSG operations (e.g. SO_ZEROCOPY).

New options:
    -a - count all CMSG messages and match to sent messages
    -T - add TX CMSG that requests TX software timestamps
    -H - similar to -T except request TX hardware timestamps
    -P - call poll() before reading error queue
    -v - print detailed results

v2: Enhancements as per Willem de Bruijn <willemb@google.com>
    - Updated control and buffer parameters for recvmsg
    - poll() parameter cleanup
    - fail on bad audit results
    - remove TOS options
    - improved reporting

v3: Enhancements as per Willem de Bruijn <willemb@google.com>
    - add SOF_TIMESTAMPING_OPT_TSONLY to eliminate MSG_TRUNC
    - general code cleanup
Signed-off-by: Fred Klassen <fklassen@appneta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

79ebc3c2

Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 29f785ff

Linus Torvalds authored Jun 17, 2019

Pull vfs fixes from Al Viro:
 "MS_MOVE regression fix + breakage in fsmount(2) (also introduced in
  this cycle, along with fsmount(2) itself).

  I'm still digging through the piles of mail, so there might be more
  fixes to follow, but these two are obvious and self-contained, so
  there's no point delaying those..."

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs/namespace: fix unprivileged mount propagation
  vfs: fsmount: add missing mntget()

29f785ff

Merge branch 'net-ipv4-remove-erroneous-advancement-of-list-pointer' · 4bd366ce

David S. Miller authored Jun 17, 2019

Florian Westphal says:

====================
net: ipv4: remove erroneous advancement of list pointer

Tariq reported a soft lockup on net-next that Mellanox was able to
bisect to 2638eb8b ("net: ipv4: provide __rcu annotation for ifa_list").

While reviewing above patch I found a regression when addresses have a
lifetime specified.

Second patch extends rtnetlink.sh to trigger crash
(without first patch applied).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

4bd366ce

selftests: rtnetlink: add addresses with fixed life time · 3cfa1488

Florian Westphal authored Jun 17, 2019

This exercises kernel code path that deal with addresses that have
a limited lifetime.

Without previous fix, this triggers following crash on net-next:
 BUG: KASAN: null-ptr-deref in check_lifetime+0x403/0x670
 Read of size 8 at addr 0000000000000010 by task kworker [..]
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

3cfa1488

net: ipv4: remove erroneous advancement of list pointer · 40008e92

Florian Westphal authored Jun 17, 2019

Causes crash when lifetime expires on an adress as garbage is
dereferenced soon after.

This used to look like this:

 for (ifap = &ifa->ifa_dev->ifa_list;
      *ifap != NULL; ifap = &(*ifap)->ifa_next) {
          if (*ifap == ifa) ...

but this was changed to:

struct in_ifaddr *tmp;

ifap = &ifa->ifa_dev->ifa_list;
tmp = rtnl_dereference(*ifap);
while (tmp) {
   tmp = rtnl_dereference(tmp->ifa_next); // Bogus
   if (rtnl_dereference(*ifap) == ifa) {
     ...
   ifap = &tmp->ifa_next;		// Can be NULL
   tmp = rtnl_dereference(*ifap);	// Dereference
   }
}

Remove the bogus assigment/list entry skip.

Fixes: 2638eb8b ("net: ipv4: provide __rcu annotation for ifa_list")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

40008e92

net: dsa: sja1105: fix ptp link error · 78fe8a28

Arnd Bergmann authored Jun 17, 2019

Due to a reversed dependency, it is possible to build
the lower ptp driver as a loadable module and the actual
driver using it as built-in, causing a link error:

drivers/net/dsa/sja1105/sja1105_spi.o: In function `sja1105_static_config_upload':
sja1105_spi.c:(.text+0x6f0): undefined reference to `sja1105_ptp_reset'
drivers/net/dsa/sja1105/sja1105_spi.o:(.data+0x2d4): undefined reference to `sja1105et_ptp_cmd'
drivers/net/dsa/sja1105/sja1105_spi.o:(.data+0x604): undefined reference to `sja1105pqrs_ptp_cmd'
drivers/net/dsa/sja1105/sja1105_main.o: In function `sja1105_remove':
sja1105_main.c:(.text+0x8d4): undefined reference to `sja1105_ptp_clock_unregister'
drivers/net/dsa/sja1105/sja1105_main.o: In function `sja1105_rxtstamp_work':
sja1105_main.c:(.text+0x964): undefined reference to `sja1105_tstamp_reconstruct'
drivers/net/dsa/sja1105/sja1105_main.o: In function `sja1105_setup':
sja1105_main.c:(.text+0xb7c): undefined reference to `sja1105_ptp_clock_register'
drivers/net/dsa/sja1105/sja1105_main.o: In function `sja1105_port_deferred_xmit':
sja1105_main.c:(.text+0x1fa0): undefined reference to `sja1105_ptpegr_ts_poll'
sja1105_main.c:(.text+0x1fc4): undefined reference to `sja1105_tstamp_reconstruct'
drivers/net/dsa/sja1105/sja1105_main.o:(.rodata+0x5b0): undefined reference to `sja1105_get_ts_info'

Change the Makefile logic to always build the ptp module
the same way as the rest. Another option would be to
just add it to the same module and remove the exports,
but I don't know if there was a good reason to keep them
separate.

Fixes: bb77f36a ("net: dsa: sja1105: Add support for the PTP clock")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

78fe8a28

net: stmmac: fix unused-variable warning · c63d1e5c

Arnd Bergmann authored Jun 17, 2019

When building without CONFIG_OF, we get a harmless build warning:

drivers/net/ethernet/stmicro/stmmac/stmmac_main.c: In function 'stmmac_phy_setup':
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c:973:22: error: unused variable 'node' [-Werror=unused-variable]
struct device_node *node = priv->plat->phy_node;

Reword it so we always use the local variable, by making it the
fwnode pointer instead of the device_node.

Fixes: 74371272 ("net: stmmac: Convert to phylink and remove phylib logic")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

c63d1e5c

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · da0f3820

Linus Torvalds authored Jun 17, 2019

Pull networking fixes from David Miller:
 "Lots of bug fixes here:

   1) Out of bounds access in __bpf_skc_lookup, from Lorenz Bauer.

   2) Fix rate reporting in cfg80211_calculate_bitrate_he(), from John
      Crispin.

   3) Use after free in psock backlog workqueue, from John Fastabend.

   4) Fix source port matching in fdb peer flow rule of mlx5, from Raed
      Salem.

   5) Use atomic_inc_not_zero() in fl6_sock_lookup(), from Eric Dumazet.

   6) Network header needs to be set for packet redirect in nfp, from
      John Hurley.

   7) Fix udp zerocopy refcnt, from Willem de Bruijn.

   8) Don't assume linear buffers in vxlan and geneve error handlers,
      from Stefano Brivio.

   9) Fix TOS matching in mlxsw, from Jiri Pirko.

  10) More SCTP cookie memory leak fixes, from Neil Horman.

  11) Fix VLAN filtering in rtl8366, from Linus Walluij.

  12) Various TCP SACK payload size and fragmentation memory limit fixes
      from Eric Dumazet.

  13) Use after free in pneigh_get_next(), also from Eric Dumazet.

  14) LAPB control block leak fix from Jeremy Sowden"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (145 commits)
  lapb: fixed leak of control-blocks.
  tipc: purge deferredq list for each grp member in tipc_group_delete
  ax25: fix inconsistent lock state in ax25_destroy_timer
  neigh: fix use-after-free read in pneigh_get_next
  tcp: fix compile error if !CONFIG_SYSCTL
  hv_sock: Suppress bogus "may be used uninitialized" warnings
  be2net: Fix number of Rx queues used for flow hashing
  net: handle 802.1P vlan 0 packets properly
  tcp: enforce tcp_min_snd_mss in tcp_mtu_probing()
  tcp: add tcp_min_snd_mss sysctl
  tcp: tcp_fragment() should apply sane memory limits
  tcp: limit payload size of sacked skbs
  Revert "net: phylink: set the autoneg state in phylink_phy_change"
  bpf: fix nested bpf tracepoints with per-cpu data
  bpf: Fix out of bounds memory access in bpf_sk_storage
  vsock/virtio: set SOCK_DONE on peer shutdown
  net: dsa: rtl8366: Fix up VLAN filtering
  net: phylink: set the autoneg state in phylink_phy_change
  net: add high_order_alloc_disable sysctl/static key
  tcp: add tcp_tx_skb_cache sysctl
  ...

da0f3820

iavf: allow null RX descriptors · efa14c39

Mitch Williams authored May 14, 2019

In some circumstances, the hardware can hand us a null receive
descriptor, with no data attached but otherwise valid. Unfortunately,
the driver was ill-equipped to handle such an event, and would stop
processing packets at that point.

To fix this, use the Descriptor Done bit instead of the size to
determine whether or not a descriptor is ready to be processed. Add some
checks to allow for unused buffers.
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

efa14c39

iavf: add call to iavf_[add|del]_cloud_filter · 68dfe634

Paul Greenwalt authored May 14, 2019

Add call to iavf_add_cloud_filter and iavf_del_cloud_filter from
iavf_process_aq_command to clear aq_required
IAVF_FLAG_AQ_ADD_CLOUD_FILTER and IAVF_FLAG_AQ_DEL_CLOUD_FILTER bits.

aq_required IAVF_FLAG_AQ_DEL_CLOUD_FILTER bit is being set in
iavf_down and iavf_delete_clsflower, and are never cleared.

aq_required IAVF_FLAG_AQ_ADD_CLOUD_FILTER bit is being set in
iavf_handle_reset and iavf_configure_clsflower, and are never
cleared.

Since the aq_required is not zero, iavf_watchdog_task is setting the
queue_delayed_work to 20 msec instead of the longer delay.
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

68dfe634

iavf: Refactor init state machine · b66c7bc1

Jakub Pawlak authored May 14, 2019

Cleanup of init state machine, move state specific
code to separate functions and rewrite the
iavf_init_task() function.
Signed-off-by: Jakub Pawlak <jakub.pawlak@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

b66c7bc1

iavf: Refactor the watchdog state machine · bac84861

Jan Sokolowski authored May 14, 2019

Refactor the watchdog state machine implementation.
Add the additional state __IAVF_COMM_FAILED to process
the PF communication fails. Prepare the watchdog state machine
to integrate with init state machine.
Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com>
Signed-off-by: Jakub Pawlak <jakub.pawlak@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

bac84861

iavf: Remove timer for work triggering, use delaying work instead · fdd4044f

Jakub Pawlak authored May 14, 2019

Remove the watchdog timer, instead declare watchdog task
as delayed work and use dedicated workqueue to service driver
tasks. The dedicated driver workqueue iavf_wq is common
for all driver instances.
Signed-off-by: Jakub Pawlak <jakub.pawlak@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fdd4044f

iavf: Move commands processing to the separate function · b476b003

Jakub Pawlak authored May 14, 2019

Move the commands processing outside the watchdog_task()
function. This reduce length and complexity of the function
which is mainly designed to process the watchdog state machine.
Signed-off-by: Jakub Pawlak <jakub.pawlak@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

b476b003

iavf: Fix the math for valid length for ADq enable · 16e00c25

Avinash Dayanand authored May 14, 2019

There was a calculation error in virtchnl regarding the valid
length which was fixed recently and a corresponding change needs
to go into the code while we enable ADq.
Signed-off-by: Avinash Dayanand <avinash.dayanand@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

16e00c25

iavf: Change GFP_KERNEL to GFP_ATOMIC in kzalloc() · f0a48fb4

Aleksandr Loktionov authored May 14, 2019

iavf_add_vlan() is being called in atomic context
so kzalloc() needs GFP_ATOMIC. This patch fixes it.
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

f0a48fb4

iavf: wait longer for close to complete · 88ec7308

Mitch Williams authored May 14, 2019

On some hardware/driver/architecture combinations, it may take longer
than 200msec for all close operations to be completed, causing a
spurious error message to be logged.

Increase the timeout value to 500msec to avoid this erroneous error.
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

88ec7308

iavf: use signed variable · 168d91cf

Mitch Williams authored May 14, 2019

The counter variable in iavf_clean_tx_irq starts out negative and climbs
to 0. So allocating it as u16 is actually a really bad idea that just
happens to work because the value underflows and overflows consistently
on most architectures.

Replace the u16 with an int so signed math works as expected.
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

168d91cf

iavf: Create VLAN tag elements starting from the first element · c2417a7b

Akeem G Abodunrin authored May 14, 2019

This patch changes how VLAN tag are being populated and programmed into
the HW - Instead of start adding VF VLAN tag from the last member of the
element list, start from the first member of the list, until number of
allowed VLAN tags is exhausted in the HW.
Signed-off-by: Akeem G Abodunrin <akeem.g.abodunrin@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

c2417a7b

fs/namespace: fix unprivileged mount propagation · d728cf79

Christian Brauner authored Jun 17, 2019

When propagating mounts across mount namespaces owned by different user
namespaces it is not possible anymore to move or umount the mount in the
less privileged mount namespace.

Here is a reproducer:

  sudo mount -t tmpfs tmpfs /mnt
  sudo --make-rshared /mnt

  # create unprivileged user + mount namespace and preserve propagation
  unshare -U -m --map-root --propagation=unchanged

  # now change back to the original mount namespace in another terminal:
  sudo mkdir /mnt/aaa
  sudo mount -t tmpfs tmpfs /mnt/aaa

  # now in the unprivileged user + mount namespace
  mount --move /mnt/aaa /opt

Unfortunately, this is a pretty big deal for userspace since this is
e.g. used to inject mounts into running unprivileged containers.
So this regression really needs to go away rather quickly.

The problem is that a recent change falsely locked the root of the newly
added mounts by setting MNT_LOCKED. Fix this by only locking the mounts
on copy_mnt_ns() and not when adding a new mount.

Fixes: 3bd045cc ("separate copying and locking mount tree on cross-userns copies")
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Tested-by: Christian Brauner <christian@brauner.io>
Acked-by: Christian Brauner <christian@brauner.io>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

d728cf79

vfs: fsmount: add missing mntget() · 1b0b9cc8

Eric Biggers authored Jun 12, 2019

sys_fsmount() needs to take a reference to the new mount when adding it
to the anonymous mount namespace. Otherwise the filesystem can be
unmounted while it's still in use, as found by syzkaller.
Reported-by: Mark Rutland <mark.rutland@arm.com>
Reported-by: syzbot+99de05d099a170867f22@syzkaller.appspotmail.com
Reported-by: syzbot+7008b8b8ba7df475fdc8@syzkaller.appspotmail.com
Fixes: 93766fbd ("vfs: syscall: Add fsmount() to create a mount for a superblock")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

1b0b9cc8

net: sched: cls_matchall: allow to delete filter · f517f271

Jiri Pirko authored Jun 17, 2019

Currently user is unable to delete the filter. See following example:
$ tc filter add dev ens16np1 ingress pref 1 handle 1 matchall action drop
$ tc filter show dev ens16np1 ingress
filter protocol all pref 1 matchall chain 0
filter protocol all pref 1 matchall chain 0 handle 0x1
  in_hw
        action order 1: gact action drop
         random type none pass val 0
         index 1 ref 1 bind 1

$ tc filter del dev ens16np1 ingress pref 1 handle 1 matchall action drop
RTNETLINK answers: Operation not supported

Implement tcf_proto_ops->delete() op and allow user to delete the filter.
Reported-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f517f271

net: hns3: fix dereference of ae_dev before it is null checked · ad9bf545

Colin Ian King authored Jun 17, 2019

Pointer ae_dev is null checked however, prior to that it is dereferenced
when assigned pointer ops. Fix this by assigning pointer ops after ae_dev
has been null checked.

Addresses-Coverity: ("Dereference before null check")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ad9bf545

Merge branch 'net-sched-act_ctinfo-fixes' · 43321251

David S. Miller authored Jun 17, 2019

Kevin Darbyshire-Bryant says:

====================
net: sched: act_ctinfo: fixes

This is first attempt at sending a small series.  Order is important
because one bug (policy validation) prevents us from encountering the
more important 'OOPS' generating bug in action creation.  Fix the OOPS
first.

Confession time: Until very recently, development of this module has
been done on 'net-next' tree to 'clean compile' level with run-time
testing on backports to 4.14 & 4.19 kernels under openwrt.  It turns out
that sched: action: based code has been under more active change than I
realised.

During the back & forward porting during development & testing, the
critical ACT_P_CREATED return code got missed despite being in the 4.14
& 4.19 backports.  I have now gone through the init functions, using
act_csum as reference with a fine toothed comb and am happy they do the
same things.

This issue hadn't been caught till now due to another issue caused by
new strict nla_parse_nested function failing parsing validation before
action creation.

Thanks to Marcelo Leitner <marcelo.leitner@gmail.com> for flagging
extack deficiency (fixed in 733f0766 sched: act_ctinfo: use extack
error reporting) which led to b424e432 ("netlink: add validation of
NLA_F_NESTED flag") and 8cb08174 ("netlink: make validation more
configurable for future strictness”) which led to the policy validation
fix, which then led to the action creation fix both contained in this
series.

If I ever get to a developer conference please feel free to
tar/feather/apply cone of shame.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

43321251

net: sched: act_ctinfo: fix policy validation · c197d636

Kevin Darbyshire-Bryant authored Jun 17, 2019

Fix nla_policy definition by specifying an exact length type attribute
to CTINFO action paraneter block structure. Without this change,
netlink parsing will fail validation and the action will not be
instantiated.

8cb08174 ("netlink: make validation more configurable for future")
introduced much stricter checking to attributes being passed via
netlink. Existing actions were updated to use less restrictive
deprecated versions of nla_parse_nested.

As a new module, act_ctinfo should be designed to use the strict
checking model otherwise, well, what was the point of implementing it.

Confession time: Until very recently, development of this module has
been done on 'net-next' tree to 'clean compile' level with run-time
testing on backports to 4.14 & 4.19 kernels under openwrt. This is how
I managed to miss the run-time impacts of the new strict
nla_parse_nested function. I hopefully have learned something from this
(glances toward laptop running a net-next kernel)

There is however a still outstanding implication on iproute2 user space
in that it needs to be told to pass nested netlink messages with the
nested attribute actually set. So even with this kernel fix to do
things correctly you still cannot instantiate a new 'strict'
nla_parse_nested based action such as act_ctinfo with iproute2's tc.
Signed-off-by: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

c197d636

net: sched: act_ctinfo: fix action creation · a658c2e4

Kevin Darbyshire-Bryant authored Jun 17, 2019

Use correct return value on action creation: ACT_P_CREATED.

The use of incorrect return value could result in a situation where the
system thought a ctinfo module was listening but actually wasn't
instantiated correctly leading to an OOPS in tcf_generic_walker().

Confession time: Until very recently, development of this module has
been done on 'net-next' tree to 'clean compile' level with run-time
testing on backports to 4.14 & 4.19 kernels under openwrt. During the
back & forward porting during development & testing, the critical
ACT_P_CREATED return code got missed despite being in the 4.14 & 4.19
backports. I have now gone through the init functions, using act_csum
as reference with a fine toothed comb. Bonus, no more OOPSes. I
managed to also miss this issue till now due to the new strict
nla_parse_nested function failing validation before action creation.

As an inexperienced developer I've learned that
copy/pasting/backporting/forward porting code correctly is hard. If I
ever get to a developer conference I shall don the cone of shame.
Signed-off-by: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

a658c2e4

vhost_net: disable zerocopy by default · 098eadce

Jason Wang authored Jun 17, 2019

Vhost_net was known to suffer from HOL[1] issues which is not easy to
fix. Several downstream disable the feature by default. What's more,
the datapath was split and datacopy path got the support of batching
and XDP support recently which makes it faster than zerocopy part for
small packets transmission.

It looks to me that disable zerocopy by default is more
appropriate. It cold be enabled by default again in the future if we
fix the above issues.

[1] https://patchwork.kernel.org/patch/3787671/Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

098eadce

net: ipv4: move tcp_fastopen server side code to SipHash library · c681edae

Ard Biesheuvel authored Jun 17, 2019

Using a bare block cipher in non-crypto code is almost always a bad idea,
not only for security reasons (and we've seen some examples of this in
the kernel in the past), but also for performance reasons.

In the TCP fastopen case, we call into the bare AES block cipher one or
two times (depending on whether the connection is IPv4 or IPv6). On most
systems, this results in a call chain such as

  crypto_cipher_encrypt_one(ctx, dst, src)
    crypto_cipher_crt(tfm)->cit_encrypt_one(crypto_cipher_tfm(tfm), ...);
      aesni_encrypt
        kernel_fpu_begin();
        aesni_enc(ctx, dst, src); // asm routine
        kernel_fpu_end();

It is highly unlikely that the use of special AES instructions has a
benefit in this case, especially since we are doing the above twice
for IPv6 connections, instead of using a transform which can process
the entire input in one go.

We could switch to the cbcmac(aes) shash, which would at least get
rid of the duplicated overhead in *some* cases (i.e., today, only
arm64 has an accelerated implementation of cbcmac(aes), while x86 will
end up using the generic cbcmac template wrapping the AES-NI cipher,
which basically ends up doing exactly the above). However, in the given
context, it makes more sense to use a light-weight MAC algorithm that
is more suitable for the purpose at hand, such as SipHash.

Since the output size of SipHash already matches our chosen value for
TCP_FASTOPEN_COOKIE_SIZE, and given that it accepts arbitrary input
sizes, this greatly simplifies the code as well.

NOTE: Server farms backing a single server IP for load balancing purposes
      and sharing a single fastopen key will be adversely affected by
      this change unless all systems in the pool receive their kernel
      upgrades at the same time.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c681edae

tipc: include retrans failure detection for unicast · 6a6b5c8b

Tuong Lien authored Jun 17, 2019

In patch series, commit 9195948f ("tipc: improve TIPC throughput by
Gap ACK blocks"), as for simplicity, the repeated retransmit failures'
detection in the function - "tipc_link_retrans()" was kept there for
broadcast retransmissions only.

This commit now reapplies this feature for link unicast retransmissions
that has been done via the function - "tipc_link_advance_transmq()".

Also, the "tipc_link_retrans()" is renamed to "tipc_link_bc_retrans()"
as it is used only for broadcast.
Acked-by: Jon Maloy <jon.maloy@ericsson.se>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

6a6b5c8b