Commits · a8c2eb15797a0f0bf17734d365da7bdc3e263155 · nexedi / linux

03 Sep, 2017 16 commits

net/mlx5e: Stop NAPI when irq balancer changes affinity · a8c2eb15

Tariq Toukan authored Jul 02, 2017

NAPI context keeps rescheduling on same CPU as long as it's busy.
This doesn't give the oppurtunity for changes in irq affinities
to take effect.
Fix that by calling napi_complete_done() upon a change in affinity.
This would stop the NAPI and reschedule it on the new CPU.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

a8c2eb15

net/mlx5e: Use kernel's mechanism to avoid missing NAPIs · 7b33aaea

Tariq Toukan authored Jul 02, 2017

We used a channel state bit MLX5E_CHANNEL_NAPI_SCHED to make
sure no NAPI is missed when a channel's napi_schedule() is called
for completion events of the different channel's resources/rings
while NAPI is currently running.
Now, as similar mechanism is implemented in kernel,
("39e6c820 net: solve a NAPI race"),
we obsolete our own implementation and rely on the return value
of napi_complete_done().

This patch removes a redundant overhead of atomic bit operations.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

7b33aaea

net/mlx5e: Slightly increase RX page-cache size · 29c2849e

Tariq Toukan authored Jul 02, 2017

In XDP_TX flow, we now get back quicker to each page in page-cache,
and on some occasions refcount does not get back to 1 on time, causing
some costly page allocations.
Slightly increase the size of RX page-cache to significantly decrease
the chances for this to happen.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

29c2849e

net/mlx5e: Don't recycle page if moved to far NUMA · 70871f1e

Tariq Toukan authored Jul 13, 2017

Avoid recycling an RX page if it moved to another NUMA node.
Add an ethtool counter to count such events.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

70871f1e

net/mlx5e: Remove unnecessary fields in ICO SQ · 3b56f7b2

Tariq Toukan authored Jul 17, 2017

As of current design, in each NAPI, only a single UMR WQE
completion could be available in the completion queue of the
the internal control operations (ICO) send queue, in addition
to nop operations that require no actions upon completion.
This renders the consume index obsolete, as the wqe_counter
field in CQE is sufficient.

This helps removing a memory barrier, and obsoletes the need
for tracking the num_wqebbs to update the consumer counter.

In addition, remove other unused fields in icosq struct:
pdev, dma_fifo_pc, and prev_cc.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

3b56f7b2

net/mlx5e: Type-specific optimizations for RX post WQEs function · 7cc6d77b

Tariq Toukan authored Jul 17, 2017

Separate the RX post WQEs function of the different RQ types.
This enables RQ type-specific optimizations in data-path.

Poll the ICOSQ completion queue only for Striding RQ,
and only when a UMR post completion could be possibly available.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

7cc6d77b

net/mlx5e: Non-atomic RQ state indicator for UMR WQE in progress · a071cb9f

Tariq Toukan authored Jul 03, 2017

The indication for a UMR WQE in progress is needed only within
the NAPI context, and hence no races possible and no need for
the use of atomic operations.
The only place the flag is read outside of NAPI context is
in closure flow, after RQ is disabled flag is no more accessed
in NAPI.
Use a boolean instead of a bit in ring state, so that its
non-atomic set operations do not race with the atomic sets of
the other bits.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

a071cb9f

net/mlx5e: Non-atomic indicator for ring enabled state · a1eaba4c

Tariq Toukan authored Jul 03, 2017

Rings enabled state change occurs in control path only, and is always
followed by a napi_sychronize(), so that following NAPIs read the
new value. This read does not need to be atomic.

The RQ auto-moderation bit is not set/cleared in data-path.
No need for atomic read, a regular read operation is sufficient.
In RQ creation time as well, there's no multiple threads trying
to access it yet, hence a regular read can be used.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

a1eaba4c

net/mlx5e: Refactor data-path lro header function · 604acb19

Tariq Toukan authored Jun 05, 2017

Refactor function mlx5e_lro_update_hdr() to reduce number of
branches.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

604acb19

net/mlx5e: Early-return on empty completion queues · 4b7dfc99

Tariq Toukan authored Jun 19, 2017

NAPI context handles different kinds of completion queues
(RX, TX, and others). Hence, upon a poll trial, some of them
might be empty.
Here we early-return upon empty completion queues, as well as
full rx buffer, and save unnecessary logic and memory barriers.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

4b7dfc99

net/mlx5e: NAPI busy-poll when UMR post is in progress · 4cbb7558

Tariq Toukan authored Jun 19, 2017

If a UMR post is in progress, it means that there's a missing
WQE in RQ, and that a completion will be shortly available in
ICO SQ completion queue. Prefer busy-poll to handle it as soon
as possible.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

4cbb7558

net/mlx5e: Small enhancements for RX MPWQE allocation and free · 4c2af5cc

Tariq Toukan authored Jun 25, 2017

The dma offset of a MPWQE (Multi-Packet WQE) in memory region
is fixed for all rounds. Calculate it once on creation time,
instead of in runtime. This also obsoletes the wqe argument in
the function.

In addition, optimize dma_info iterator calculation.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

4c2af5cc

net/mlx5e: Use memset to init skbs_frags array to zeros · 9bafe2ad

Tariq Toukan authored Jun 25, 2017

In RX data-path, use memset() instead of loop assignment
to init the whole skbs_frags array.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

9bafe2ad

net/mlx5e: Remove unnecessary wqe_sz field from RQ buffer · b681c481

Tariq Toukan authored Jul 03, 2017

Field is used only locally within the RQ create function.
The use of a local variable is sufficient.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

b681c481

net/mlx5e: Replace multiplication by stride size with a shift · 89e89f7a

Tariq Toukan authored Jul 02, 2017

In RX data-path, use shift operations instead of a regular multiplication
by stride size, as it is a power of two.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

89e89f7a

net/mlx5e: Reorganize struct mlx5e_rq · b45d8b50

Tariq Toukan authored Feb 13, 2017

Bring fast-path fields together, and combine RX WQE mutual
exclusive fields into a union.

Page-reuse and XDP are mutually exclusive and cannot be used at
the same time.
Use a union to combine their footprints.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

b45d8b50

02 Sep, 2017 23 commits

Merge branch 'hv_netvsc-channel-settings-cleanups-and-fixes' · 32d9b70a

David S. Miller authored Sep 01, 2017

Haiyang Zhang says:

====================
hv_netvsc: cleanups and fixes of channel settings

This patch set cleans up some unused variables, unnecessary checks.
Also fixed some limit checking of channel number.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

32d9b70a

hv_netvsc: Fix the channel limit in netvsc_set_rxfh() · db3cd7af

Haiyang Zhang authored Sep 01, 2017

The limit of setting receive indirection table value should be
the current number of channels, not the VRSS_CHANNEL_MAX.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

db3cd7af

hv_netvsc: Simplify the limit check in netvsc_set_channels() · 06be580a

Haiyang Zhang authored Sep 01, 2017

Because of the following code, net->num_tx_queues equals to
VRSS_CHANNEL_MAX, and max_chn is less than or equals to VRSS_CHANNEL_MAX.

netvsc_drv.c:
alloc_etherdev_mq(sizeof(struct net_device_context),
                                VRSS_CHANNEL_MAX);
rndis_filter.c:
net_device->max_chn = min_t(u32, VRSS_CHANNEL_MAX, num_possible_rss_qs);

So this patch removes the unnecessary limit check before comparing
with "max_chn".
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

06be580a

hv_netvsc: Simplify num_chn checking in rndis_filter_device_add() · 5c4217d0

Haiyang Zhang authored Sep 01, 2017

The minus one and assignment to a local variable is not necessary.
This patch simplifies it.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5c4217d0

hv_netvsc: Clean up an unused parameter in rndis_filter_set_rss_param() · 715e2ec5

Haiyang Zhang authored Sep 01, 2017

This patch removes the parameter, num_queue in
rndis_filter_set_rss_param(), which is no longer in use.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

715e2ec5

net: Add module reference to FIB notifiers · 864150df

Ido Schimmel authored Sep 01, 2017

When a listener registers to the FIB notification chain it receives a
dump of the FIB entries and rules from existing address families by
invoking their dump operations.

While we call into these modules we need to make sure they aren't
removed. Do that by increasing their reference count before invoking
their dump operations and decrease it afterwards.

Fixes: 04b1d4e5 ("net: core: Make the FIB notification chain generic")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

864150df

Merge branch 'netvsc-vf-cleanups' · 9e2cf36d

David S. Miller authored Sep 01, 2017

Stephen Hemminger says:

====================
netvsc: transparent VF related cleanups

The first gets rid of unnecessary ref counting, and second
allows removing hv_netvsc driver even if VF present.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

9e2cf36d

netvsc: allow driver to be removed even if VF is present · ec158f77

Stephen Hemminger authored Aug 31, 2017

If VF is attached then can still allow netvsc driver module to
be removed. Just have to make sure and do the cleanup.

Also, avoid extra rtnl round trip when calling unregister.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ec158f77

netvsc: cleanup datapath switch · 9a0c48df

Stephen Hemminger authored Aug 31, 2017

Use one routine for datapath up/down. Don't need to reopen
the rndis layer.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9a0c48df

bpf: sockmap update/simplify memory accounting scheme · 90a9631c

John Fastabend authored Sep 01, 2017

Instead of tracking wmem_queued and sk_mem_charge by incrementing
in the verdict SK_REDIRECT paths and decrementing in the tx work
path use skb_set_owner_w and sock_writeable helpers. This solves
a few issues with the current code. First, in SK_REDIRECT inc on
sk_wmem_queued and sk_mem_charge were being done without the peers
sock lock being held. Under stress this can result in accounting
errors when tx work and/or multiple verdict decisions are working
on the peer psock.

Additionally, this cleans up the code because we can rely on the
default destructor to decrement memory accounting on kfree_skb. Also
this will trigger sk_write_space when space becomes available on
kfree_skb() which wasn't happening before and prevent __sk_free
from being called until all in-flight packets are completed.

Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

90a9631c

Merge branch 'net-ubuf_info-refcnt-conversion' · 250b0f78

David S. Miller authored Sep 01, 2017

Eric Dumazet says:

====================
net: ubuf_info.refcnt conversion

Yet another atomic_t -> refcount_t conversion, split in two patches.

First patch prepares the automatic conversion done in the second patch.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

250b0f78

net: convert (struct ubuf_info)->refcnt to refcount_t · c1d1b437

Eric Dumazet authored Aug 31, 2017

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

v2: added the change in drivers/vhost/net.c as spotted
by Willem.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c1d1b437

net: prepare (struct ubuf_info)->refcnt conversion · db5bce32

Eric Dumazet authored Aug 31, 2017

In order to convert this atomic_t refcnt to refcount_t,
we need to init the refcount to one to not trigger
a 0 -> 1 transition.

This also removes one atomic operation in fast path.

v2: removed dead code in sock_zerocopy_put_abort()
as suggested by Willem.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

db5bce32

net: systemport: Correctly set TSB endian for host · 487234cc

Florian Fainelli authored Sep 01, 2017

Similarly to how we configure the RSB (Receive Status Block) we also
need to set the TSB (Transmit Status Block) based on the host endian.
This was missing from the commit indicated below.

Fixes: 389a06bc ("net: systemport: Set correct RSB endian bits based on host")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

487234cc

Merge branch 'inet_diag-TCP-MD5' · d49e3a9f

David S. Miller authored Sep 01, 2017

Ivan Delalande says:

====================
inet_diag: report TCP MD5 signing keys and addresses

Allow userspace to retrieve MD5 signature keys and addresses configured
on TCP sockets through inet_diag.

Thanks to Eric Dumazet and Stephen Hemminger for their useful
explanations and feedback.

v5: - memset the whole netlink payload after it has been nla_reserve-d
      in tcp_diag_put_md5sig (a third memset had to be added for
      tcpm_key so we might as well have just one for entire region).
    - move the nla_total_size call from inet_sk_attr_size to the
      idiag_get_aux_size defined by protocols as they could add multiple
      netlink attributes,
    - add check for net_admin in tcp_diag_get_aux_size.

v4: - add new struct tcp_diag_md5sig to report the data instead of
      tcp_md5sig to avoid wasting 112 bytes on every tcpm_addr,
    - memset tcpm_addr on IPv4 addresses to avoid leaks,
    - style fix in inet_diag_dump_one_icsk.

v3: - rename inet_diag_*md5sig in tcp_diag.c to tcp_diag_* for
      consistency,
    - don't lock the socket in tcp_diag_put_md5sig,
    - add checks on md5sig_count in tcp_diag_put_md5sig to not create
      the netlink attribute if the list is empty, and to avoid overflows
      or memory leaks if the list has changed in the meantime.

v2: - move changes to tcp_diag.c and extend inet_diag_handler to allow
      protocols to provide additional data on INET_DIAG_INFO,
    - lock socket before calling tcp_diag_put_md5sig.

I also have a patch for iproute2/ss to test this change, making it print
this new attribute. I'm planning to polish and send it if this series
gets applied.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

d49e3a9f

tcp_diag: report TCP MD5 signing keys and addresses · c03fa9bc

Ivan Delalande authored Aug 31, 2017

Report TCP MD5 (RFC2385) signing keys, addresses and address prefixes to
processes with CAP_NET_ADMIN requesting INET_DIAG_INFO. Currently it is
not possible to retrieve these from the kernel once they have been
configured on sockets.
Signed-off-by: Ivan Delalande <colona@arista.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c03fa9bc

inet_diag: allow protocols to provide additional data · b37e8840

Ivan Delalande authored Aug 31, 2017

Extend inet_diag_handler to allow individual protocols to report
additional data on INET_DIAG_INFO through idiag_get_aux. The size
can be dynamic and is computed by idiag_get_aux_size.
Signed-off-by: Ivan Delalande <colona@arista.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b37e8840

ipv6: sr: Use ARRAY_SIZE macro · 6391c4f6

Thomas Meyer authored Aug 31, 2017

Grepping for "sizeof\(.+\) / sizeof\(" found this as one of the first
candidates.
Maybe a coccinelle can catch all of those.
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

6391c4f6

net: qualcomm: rmnet: remove unused variable priv · 5debc53f

Colin Ian King authored Aug 31, 2017

priv is being assigned but is never used, so remove it.

Cleans up clang build warning:
"warning: Value stored to 'priv' is never read"

Fixes: ceed73a2 ("drivers: net: ethernet: qualcomm: rmnet: Initial implementation")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

5debc53f

net: phy: bcm7xxx: make array bcm7xxx_suspend_cfg static, reduces object code size · 33c81821

Colin Ian King authored Aug 31, 2017

Don't populate the array bcm7xxx_suspend_cfg A on the stack, instead
make it static.  Makes the object code smaller by over 300 bytes:

Before:
   text	   data	    bss	    dec	    hex	filename
   6351	   8146	      0	  14497	   38a1	drivers/net/phy/bcm7xxx.o

After:
   text	   data	    bss	    dec	    hex	filename
   5986	   8210	      0	  14196	   3774	drivers/net/phy/bcm7xxx.o
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

33c81821

fsl/fman: make arrays port_ids static, reduces object code size · d05071ed

Colin Ian King authored Aug 31, 2017

Don't populate the arrays port_ids on the stack, instead make them static.
Makes the object code smaller by over 700 bytes:

Before:
   text	   data	    bss	    dec	    hex	filename
  28785	   5832	    192	  34809	   87f9	fman.o

After:
   text	   data	    bss	    dec	    hex	filename
  27921	   5992	    192	  34105	   8539	fman.o
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d05071ed

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 6026e043
David S. Miller authored Sep 01, 2017
```
Three cases of simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
```
6026e043

inetpeer: fix RCU lookup() · 4cc5b44b

Eric Dumazet authored Sep 01, 2017

Excess of seafood or something happened while I cooked the commit
adding RB tree to inetpeer.

Of course, RCU rules need to be respected or bad things can happen.

In this particular loop, we need to read *pp once per iteration, not
twice.

Fixes: b145425f ("inetpeer: remove AVL implementation in favor of RB tree")
Reported-by: John Sperbeck <jsperbeck@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4cc5b44b

01 Sep, 2017 1 commit

epoll: fix race between ep_poll_callback(POLLFREE) and ep_free()/ep_remove() · 138e4ad6

Oleg Nesterov authored Sep 01, 2017

The race was introduced by me in commit 971316f0 ("epoll:
ep_unregister_pollwait() can use the freed pwq->whead").  I did not
realize that nothing can protect eventpoll after ep_poll_callback() sets
->whead = NULL, only whead->lock can save us from the race with
ep_free() or ep_remove().

Move ->whead = NULL to the end of ep_poll_callback() and add the
necessary barriers.

TODO: cleanup the ewake/EPOLLEXCLUSIVE logic, it was confusing even
before this patch.

Hopefully this explains use-after-free reported by syzcaller:

	BUG: KASAN: use-after-free in debug_spin_lock_before
	...
	 _raw_spin_lock_irqsave+0x4a/0x60 kernel/locking/spinlock.c:159
	 ep_poll_callback+0x29f/0xff0 fs/eventpoll.c:1148

this is spin_lock(eventpoll->lock),

	...
	Freed by task 17774:
	...
	 kfree+0xe8/0x2c0 mm/slub.c:3883
	 ep_free+0x22c/0x2a0 fs/eventpoll.c:865

Fixes: 971316f0 ("epoll: ep_unregister_pollwait() can use the freed pwq->whead")
Reported-by: 范龙飞 <long7573@126.com>
Cc: stable@vger.kernel.org
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

138e4ad6