Commits · 5e092be7418fdf0e1e288529bd7e657cb9d7954c · Kirill Smelkov / linux

18 Jun, 2023 2 commits

NFSD: Distinguish per-net namespace initialization · 5e092be7

Chuck Lever authored Jun 16, 2023

I find the naming of nfsd_init_net() and nfsd_startup_net() to be
confusingly similar. Rename the namespace initialization and tear-
down ops and add comments to distinguish their separate purposes.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

5e092be7

nfsd: move init of percpu reply_cache_stats counters back to nfsd_init_net · ed9ab734

Jeff Layton authored Jun 16, 2023

Commit f5f9d4a3 ("nfsd: move reply cache initialization into nfsd
startup") moved the initialization of the reply cache into nfsd startup,
but didn't account for the stats counters, which can be accessed before
nfsd is ever started. The result can be a NULL pointer dereference when
someone accesses /proc/fs/nfsd/reply_cache_stats while nfsd is still
shut down.

This is a regression and a user-triggerable oops in the right situation:

- non-x86_64 arch
- /proc/fs/nfsd is mounted in the namespace
- nfsd is not started in the namespace
- unprivileged user calls "cat /proc/fs/nfsd/reply_cache_stats"

Although this is easy to trigger on some arches (like aarch64), on
x86_64, calling this_cpu_ptr(NULL) evidently returns a pointer to the
fixed_percpu_data. That struct looks just enough like a newly
initialized percpu var to allow nfsd_reply_cache_stats_show to access
it without Oopsing.

Move the initialization of the per-net+per-cpu reply-cache counters
back into nfsd_init_net, while leaving the rest of the reply cache
allocations to be done at nfsd startup time.

Kudos to Eirik who did most of the legwork to track this down.

Cc: stable@vger.kernel.org # v6.3+
Fixes: f5f9d4a3 ("nfsd: move reply cache initialization into nfsd startup")
Reported-and-tested-by: Eirik Fuller <efuller@redhat.com>
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2215429Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

ed9ab734

17 Jun, 2023 11 commits

SUNRPC: Address RCU warning in net/sunrpc/svc.c · 00a87e5d

Chuck Lever authored Jun 16, 2023

$ make C=1 W=1 net/sunrpc/svc.o
make[1]: Entering directory 'linux/obj/manet.1015granger.net'
  GEN     Makefile
  CALL    linux/server-development/scripts/checksyscalls.sh
  DESCEND objtool
  INSTALL libsubcmd_headers
  DESCEND bpf/resolve_btfids
  INSTALL libsubcmd_headers
  CC [M]  net/sunrpc/svc.o
  CHECK   linux/server-development/net/sunrpc/svc.c
linux/server-development/net/sunrpc/svc.c:1225:9: warning: incorrect type in argument 1 (different address spaces)
linux/server-development/net/sunrpc/svc.c:1225:9:    expected struct spinlock [usertype] *lock
linux/server-development/net/sunrpc/svc.c:1225:9:    got struct spinlock [noderef] __rcu *
linux/server-development/net/sunrpc/svc.c:1227:40: warning: incorrect type in argument 1 (different address spaces)
linux/server-development/net/sunrpc/svc.c:1227:40:    expected struct spinlock [usertype] *lock
linux/server-development/net/sunrpc/svc.c:1227:40:    got struct spinlock [noderef] __rcu *
make[1]: Leaving directory 'linux/obj/manet.1015granger.net'

Warning introduced by commit 913292c9 ("sched.h: Annotate
sighand_struct with __rcu").
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

00a87e5d

SUNRPC: Use sysfs_emit in place of strlcpy/sprintf · a9156d7e

Azeem Shaikh authored Jun 14, 2023

Part of an effort to remove strlcpy() tree-wide [1].

Direct replacement is safe here since the getter in kernel_params_ops
handles -errno return [2].

[1] https://github.com/KSPP/linux/issues/89
[2] https://elixir.bootlin.com/linux/v6.4-rc6/source/include/linux/moduleparam.h#L52Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

a9156d7e

SUNRPC: Remove transport class dprintk call sites · 6c53da5d

Chuck Lever authored Jun 12, 2023

Remove a couple of dprintk call sites that are of little value.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

6c53da5d

SUNRPC: Fix comments for transport class registration · 02cea33f

Chuck Lever authored Jun 12, 2023

The preceding block comment before svc_register_xprt_class() is
not related to that function.

While we're here, add proper documenting comments for these two
publicly-visible functions.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

02cea33f

svcrdma: Remove an unused argument from __svc_rdma_put_rw_ctxt() · b55c6333

Chuck Lever authored Jun 12, 2023

Clean up.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

b55c6333

svcrdma: trace cc_release calls · a23c76e9

Chuck Lever authored Jun 12, 2023

This event brackets the svcrdma_post_* trace points. If this trace
event is enabled but does not appear as expected, that indicates a
chunk_ctxt leak.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

a23c76e9

svcrdma: Convert "might sleep" comment into a code annotation · 91f8ce28

Chuck Lever authored Jun 12, 2023

Try to catch incorrect calling contexts mechanically rather than by
code review.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

91f8ce28

NFSD: Add an nfsd4_encode_nfstime4() helper · 26217679

Chuck Lever authored Jun 12, 2023

Clean up: de-duplicate some common code.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

26217679

SUNRPC: Move initialization of rq_stime · f8335a21

Chuck Lever authored Jun 12, 2023

Micro-optimization: Call ktime_get() only when ->xpo_recvfrom() has
given us a full RPC message to process. rq_stime isn't used
otherwise, so this avoids pointless work.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

f8335a21

SUNRPC: Optimize page release in svc_rdma_sendto() · 5581cf8e

Chuck Lever authored Jun 12, 2023

Now that we have bulk page allocation and release APIs, it's more
efficient to use those than it is for nfsd threads to wait for send
completions. Previous patches have eliminated the calls to
wait_for_completion() and complete(), in order to avoid scheduler
overhead.

Now release pages-under-I/O in the send completion handler using
the efficient bulk release API.

I've measured a 7% reduction in cumulative CPU utilization in
svc_rdma_sendto(), svc_rdma_wc_send(), and svc_xprt_release(). In
particular, using release_pages() instead of complete() cuts the
time per svc_rdma_wc_send() call by two-thirds. This helps improve
scalability because svc_rdma_wc_send() is single-threaded per
connection.
Reviewed-by: Tom Talpey <tom@talpey.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

5581cf8e

svcrdma: Prevent page release when nothing was received · baf6d18b

Chuck Lever authored Jun 12, 2023

I noticed that svc_rqst_release_pages() was still unnecessarily
releasing a page when svc_rdma_recvfrom() returns zero.

Fixes: a53d5cb0 ("svcrdma: Avoid releasing a page in svc_xprt_release()")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

baf6d18b

12 Jun, 2023 11 commits

svcrdma: Revert ("svcrdma: Normalize Send page handling") · c4b50cdf

Chuck Lever authored Jun 12, 2023

Get rid of the completion wait in svc_rdma_sendto(), and release
pages in the send completion handler again. A subsequent patch will
handle releasing those pages more efficiently.

Reverted by hand: patch -R would not apply cleanly.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

c4b50cdf

SUNRPC: Revert ("svcrdma: Remove unused sc_pages field") · a944209c

Chuck Lever authored Jun 12, 2023

Pre-requisite for releasing pages in the send completion handler.
Reverted by hand: patch -R would not apply cleanly.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

a944209c

SUNRPC: Revert ("svcrdma: Retain the page backing rq_res.head[0].iov_base") · 6be7afcd

Chuck Lever authored Jun 12, 2023

Pre-requisite for releasing pages in the send completion handler.
Reverted by hand: patch -R would not apply cleanly.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

6be7afcd

NFSD: add encoding of op_recall flag for write delegation · 58f5d894

Dai Ngo authored Jun 06, 2023

Modified nfsd4_encode_open to encode the op_recall flag properly
for OPEN result with write delegation granted.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org

58f5d894

NFSD: Add "official" reviewers for this subsystem · 8111c17c

Chuck Lever authored Jun 05, 2023

At LFS 2023, it was suggested we should publicly document the name and
email of reviewers who new contributors can trust. This also gives them
some recognition for their work as reviewers.
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

8111c17c

mailmap: Add Bruce Fields' latest e-mail addresses · b1c6ffb2

Chuck Lever authored Jun 05, 2023

Ensure that Bruce's old e-mail addresses map to his current one so
he doesn't miss out on all the fun.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

b1c6ffb2

svcrdma: Clean up allocation of svc_rdma_rw_ctxt · ac3c32bb

Chuck Lever authored Jun 05, 2023

The physical device's favored NUMA node ID is available when
allocating a rw_ctxt. Use that value instead of relying on the
assumption that the memory allocation happens to be running on a
node close to the device.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

ac3c32bb

svcrdma: Clean up allocation of svc_rdma_send_ctxt · ed51b426

Chuck Lever authored Jun 05, 2023

The physical device's favored NUMA node ID is available when
allocating a send_ctxt. Use that value instead of relying on the
assumption that the memory allocation happens to be running on a
node close to the device.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

ed51b426

svcrdma: Clean up allocation of svc_rdma_recv_ctxt · c5d68d25

Chuck Lever authored Jun 05, 2023

The physical device's favored NUMA node ID is available when
allocating a recv_ctxt. Use that value instead of relying on the
assumption that the memory allocation happens to be running on a
node close to the device.

This clean up eliminates the hack of destroying recv_ctxts that
were not created by the receive CQ thread -- recv_ctxts are now
always allocated on a "good" node.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

c5d68d25

svcrdma: Allocate new transports on device's NUMA node · fe2b401e

Chuck Lever authored Jun 05, 2023

The physical device's NUMA node ID is available when allocating an
svc_xprt for an incoming connection. Use that value to ensure the
svc_xprt structure is allocated on the NUMA node closest to the
device.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

fe2b401e

lockd: drop inappropriate svc_get() from locked_get() · 665e89ab

NeilBrown authored Jun 03, 2023

The below-mentioned patch was intended to simplify refcounting on the
svc_serv used by locked.  The goal was to only ever have a single
reference from the single thread.  To that end we dropped a call to
lockd_start_svc() (except when creating thread) which would take a
reference, and dropped the svc_put(serv) that would drop that reference.

Unfortunately we didn't also remove the svc_get() from
lockd_create_svc() in the case where the svc_serv already existed.
So after the patch:
 - on the first call the svc_serv was allocated and the one reference
   was given to the thread, so there are no extra references
 - on subsequent calls svc_get() was called so there is now an extra
   reference.
This is clearly not consistent.

The inconsistency is also clear in the current code in lockd_get()
takes *two* references, one on nlmsvc_serv and one by incrementing
nlmsvc_users.   This clearly does not match lockd_put().

So: drop that svc_get() from lockd_get() (which used to be in
lockd_create_svc().
Reported-by: Ido Schimmel <idosch@idosch.org>
Closes: https://lore.kernel.org/linux-nfs/ZHsI%2FH16VX9kJQX1@shredder/T/#u
Fixes: b73a2972 ("lockd: move lockd_start_svc() call into lockd_create_svc()")
Signed-off-by: NeilBrown <neilb@suse.de>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

665e89ab

11 Jun, 2023 6 commits

nfsd: don't provide pre/post-op attrs if fh_getattr fails · 518f375c

Jeff Layton authored May 19, 2023

nfsd calls fh_getattr to get the latest inode attrs for pre/post-op
info. In the event that fh_getattr fails, it resorts to scraping cached
values out of the inode directly.

Since these attributes are optional, we can just skip providing them
altogether when this happens.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Neil Brown <neilb@suse.de>

518f375c

NFSD: Remove nfsd_readv() · df56b384

Chuck Lever authored May 18, 2023

nfsd_readv()'s consumers now use nfsd_iter_read().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

df56b384

NFSD: Hoist rq_vec preparation into nfsd_read() [step two] · 703d7521

Chuck Lever authored May 18, 2023

Now that the preparation of an rq_vec has been removed from the
generic read path, nfsd_splice_read() no longer needs to reset
rq_next_page.

nfsd4_encode_read() calls nfsd_splice_read() directly. As far as I
can ascertain, resetting rq_next_page for NFSv4 splice reads is
unnecessary because rq_next_page is already set correctly.

Moreover, resetting it might even be incorrect if previous
operations in the COMPOUND have already consumed at least a page of
the send buffer. I would expect that the result would be encoding
the READ payload over previously-encoded results.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

703d7521

NFSD: Hoist rq_vec preparation into nfsd_read() · 507df40e

Chuck Lever authored May 18, 2023

Accrue the following benefits:

a) Deduplicate this common bit of code.

b) Don't prepare rq_vec for NFSv2 and NFSv3 spliced reads, which
   don't use rq_vec. This is already the case for
   nfsd4_encode_read().

c) Eventually, converting NFSD's read path to use a bvec iterator
   will be simpler.

In the next patch, nfsd_iter_read() will replace nfsd_readv() for
all NFS versions.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

507df40e

NFSD: Update rq_next_page between COMPOUND operations · ed4a567a

Chuck Lever authored May 18, 2023

A GETATTR with a large result can advance xdr->page_ptr without
updating rq_next_page. If a splice READ follows that GETATTR in the
COMPOUND, nfsd_splice_actor can start splicing at the wrong page.

I've also seen READLINK and READDIR leave rq_next_page in an
unmodified state.

There are potentially a myriad of combinations like this, so play it
safe: move the rq_next_page update to nfsd4_encode_operation.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

ed4a567a

NFSD: Use svcxdr_encode_opaque_pages() in nfsd4_encode_splice_read() · ba21e20b

Chuck Lever authored May 18, 2023

Commit 15b23ef5 ("nfsd4: fix corruption of NFSv4 read data")
encountered exactly the same issue: after a splice read, a
filesystem-owned page is left in rq_pages[]; the symptoms are the
same as described there.

If the computed number of pages in nfsd4_encode_splice_read() is not
exactly the same as the actual number of pages that were consumed by
nfsd_splice_actor() (say, because of a bug) then hilarity ensues.

Instead of recomputing the page offset based on the size of the
payload, use rq_next_page, which is already properly updated by
nfsd_splice_actor(), to cause svc_rqst_release_pages() to operate
correctly in every instance.

This is a defensive change since we believe that after commit
27c934dd ("nfsd: don't replace page in rq_pages if it's a
continuation of last page") has been applied, there are no known
opportunities for nfsd_splice_actor() to screw up. So I'm not
marking it for stable backport.
Reported-by: Andy Zlotek <andy.zlotek@oracle.com>
Suggested-by: Calum Mackay <calum.mackay@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

ba21e20b

05 Jun, 2023 10 commits

NFSD: Ensure that xdr_write_pages updates rq_next_page · 82078b98

Chuck Lever authored May 18, 2023

All other NFSv[23] procedures manage to keep page_ptr and
rq_next_page in lock step.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

82078b98

NFSD: Replace encode_cinfo() · 66a21db7

Chuck Lever authored May 16, 2023

De-duplicate "reserve_space; encode_cinfo".
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

66a21db7

NFSD: Add encoders for NFSv4 clientids and verifiers · adaa7a50
Chuck Lever authored May 16, 2023
```
Deduplicate some common code.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
```
adaa7a50

SUNRPC: Use __alloc_bulk_pages() in svc_init_buffer() · 88e4d41a

Chuck Lever authored May 15, 2023

Clean up: Use the bulk page allocator when filling a server thread's
buffer page array.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

88e4d41a

SUNRPC: Resupply rq_pages from node-local memory · 5f7fc5d6

Chuck Lever authored May 15, 2023

svc_init_buffer() is careful to allocate the initial set of server
thread buffer pages from memory on the local NUMA node.
svc_alloc_arg() should also be that careful.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

5f7fc5d6

NFSD: trace nfsctl operations · 39d432fc

Chuck Lever authored May 15, 2023

Add trace log eye-catchers that record the arguments used to
configure NFSD. This helps when troubleshooting the NFSD
administrative interfaces.

These tracepoints can capture NFSD start-up and shutdown times and
parameters, changes in lease time and thread count, and a request
to end the namespace's NFSv4 grace period, in addition to the set
of NFS versions that are enabled.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

39d432fc

NFSD: Clean up nfsctl_transaction_write() · 3434d7aa

Chuck Lever authored May 15, 2023

For easier readability, follow the common convention:

    if (error)
	handle_error;
    continue_normally;

No behavior change is expected.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

3434d7aa

NFSD: Clean up nfsctl white-space damage · 442a6290

Chuck Lever authored May 15, 2023

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

442a6290

SUNRPC: Trace struct svc_sock lifetime events · c42bebca

Chuck Lever authored May 15, 2023

Capture a timestamp and pointer address during the creation and
destruction of struct svc_sock to record its lifetime. This helps
to diagnose transport reference counting issues.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

c42bebca

SUNRPC: Improve observability in svc_tcp_accept() · d7900dae

Chuck Lever authored May 15, 2023

The -ENOMEM arm could fire repeatedly if the system runs low on
memory, so remove it.

Don't bother to trace -EAGAIN error events, since those fire after
a listener is created (with no work done) and once again after an
accept has been handled successfully (again, with no work done).
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

d7900dae