Commits · 8db55a032ac7ac1ed7b98d6b1dc980e6378c652f · Kirill Smelkov / linux

13 Mar, 2022 13 commits

SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC · 8db55a03

NeilBrown authored Mar 07, 2022

rpc tasks can be marked as RPC_TASK_SWAPPER.  This causes GFP_MEMALLOC
to be used for some allocations.  This is needed in some cases, but not
in all where it is currently provided, and in some where it isn't
provided.

Currently *all* tasks associated with a rpc_client on which swap is
enabled get the flag and hence some GFP_MEMALLOC support.

GFP_MEMALLOC is provided for ->buf_alloc() but only swap-writes need it.
However xdr_alloc_bvec does not get GFP_MEMALLOC - though it often does
need it.

xdr_alloc_bvec is called while the XPRT_LOCK is held.  If this blocks,
then it blocks all other queued tasks.  So this allocation needs
GFP_MEMALLOC for *all* requests, not just writes, when the xprt is used
for any swap writes.

Similarly, if the transport is not connected, that will block all
requests including swap writes, so memory allocations should get
GFP_MEMALLOC if swap writes are possible.

So with this patch:
 1/ we ONLY set RPC_TASK_SWAPPER for swap writes.
 2/ __rpc_execute() sets PF_MEMALLOC while handling any task
    with RPC_TASK_SWAPPER set, or when handling any task that
    holds the XPRT_LOCKED lock on an xprt used for swap.
    This removes the need for the RPC_IS_SWAPPER() test
    in ->buf_alloc handlers.
 3/ xprt_prepare_transmit() sets PF_MEMALLOC after locking
    any task to a swapper xprt.  __rpc_execute() will clear it.
 3/ PF_MEMALLOC is set for all the connect workers.

Reviewed-by: Chuck Lever <chuck.lever@oracle.com> (for xprtrdma parts)
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

8db55a03

NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS · 89c2be8a

NeilBrown authored Mar 07, 2022

NFS_RPC_SWAPFLAGS is only used for READ requests.
It sets RPC_TASK_SWAPPER which gives some memory-allocation priority to
requests.  This is not needed for swap READ - though it is for writes
where it is set via a different mechanism.

RPC_TASK_ROOTCREDS causes the 'machine' credential to be used.
This is not needed as the root credential is saved when the swap file is
opened, and this is used for all IO.

So NFS_RPC_SWAPFLAGS isn't needed, and as it is the only user of
RPC_TASK_ROOTCREDS, that isn't needed either.

Remove both.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

89c2be8a

SUNRPC: remove scheduling boost for "SWAPPER" tasks. · a80a8461

NeilBrown authored Mar 07, 2022

Currently, tasks marked as "swapper" tasks get put to the front of
non-priority rpc_queues, and are sorted earlier than non-swapper tasks on
the transport's ->xmit_queue.

This is pointless as currently *all* tasks for a mount that has swap
enabled on *any* file are marked as "swapper" tasks.  So the net result
is that the non-priority rpc_queues are reverse-ordered (LIFO).

This scheduling boost is not necessary to avoid deadlocks, and hurts
fairness, so remove it.  If there were a need to expedite some requests,
the tk_priority mechanism is a more appropriate tool.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

a80a8461

SUNRPC/xprt: async tasks mustn't block waiting for memory · a7210354

NeilBrown authored Mar 07, 2022

When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything.  So it
must not block waiting for memory.

xprt_dynamic_alloc_slot can block indefinitely.  This can tie up all
workqueue threads and NFS can deadlock.  So when called from a
workqueue, set __GFP_NORETRY.

The rdma alloc_slot already does not block.  However it sets the error
to -EAGAIN suggesting this will trigger a sleep.  It does not.  As we
can see in call_reserveresult(), only -ENOMEM causes a sleep.  -EAGAIN
causes immediate retry.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

a7210354

SUNRPC/auth: async tasks mustn't block waiting for memory · a41b05ed

NeilBrown authored Mar 07, 2022

When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything.  So it
must not block waiting for memory.

mempools are particularly a problem as memory can only be released back
to the mempool by an async rpc task running.  If all available workqueue
threads are waiting on the mempool, no thread is available to return
anything.

lookup_cred() can block on a mempool or kmalloc - and this can cause
deadlocks.  So add a new RPCAUTH_LOOKUP flag for async lookups and don't
block on memory.  If the -ENOMEM gets back to call_refreshresult(), wait
a short while and try again.  HZ>>4 is chosen as it is used elsewhere
for -ENOMEM retries.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

a41b05ed

SUNRPC/call_alloc: async tasks mustn't block waiting for memory · c487216b

NeilBrown authored Mar 07, 2022

When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything.
So it must not block waiting for memory.

mempools are particularly a problem as memory can only be released back
to the mempool by an async rpc task running.  If all available
workqueue threads are waiting on the mempool, no thread is available to
return anything.

rpc_malloc() can block, and this might cause deadlocks.
So check RPC_IS_ASYNC(), rather than RPC_IS_SWAPPER() to determine if
blocking is acceptable.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

c487216b

NFS: remove IS_SWAPFILE hack · 944d95f7

NeilBrown authored Mar 07, 2022

This code is pointless as IS_SWAPFILE is always defined.
So remove it.
Suggested-by: Mark Hemment <markhemm@googlemail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

944d95f7

NFS: Remove remaining dfprintks related to fscache and remove NFSDBG_FSCACHE · b5fdf66f

Dave Wysochanski authored Mar 01, 2022

The fscache cookie APIs including fscache_acquire_cookie() and
fscache_relinquish_cookie() now have very good tracing.  Thus,
there is no real need for dfprintks in the NFS fscache interface.

The NFS fscache interface has removed all dfprintks so remove the
NFSDBG_FSCACHE defines.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

b5fdf66f

NFS: Replace dfprintks with tracepoints in fscache read and write page functions · e3f0a7fe

Dave Wysochanski authored Mar 01, 2022

Most of fscache and other NFS IO paths are now using tracepoints.
Remove the dfprintks in the NFS fscache read/write page functions
and replace with tracepoints at the begin and end of the functions.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

e3f0a7fe

NFS: Rename fscache read and write pages functions · fc1c5abf

Dave Wysochanski authored Mar 01, 2022

Rename NFS fscache functions in a more consistent fashion
to better reflect when we read from and write to fscache.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

fc1c5abf

NFS: Cleanup usage of nfs_inode in fscache interface · 45f3a70b

Dave Wysochanski authored Mar 01, 2022

A number of places in the fscache interface used nfs_inode when inode could
be used, simplifying the code.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

45f3a70b

NFSv4.1 restrict GETATTR fs_location query to the main transport · b4be2c59

Olga Kornievskaia authored Feb 15, 2022

In the presence of trunking transports, it's helpful to make sure
that during the migration event, the GETATTR for fs_location attribute
happens on the main transport.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

b4be2c59

NFS: remove unneeded check in decode_devicenotify_args() · cb8fac6d

Alexey Khoroshilov authored Feb 15, 2022

[You don't often get email from khoroshilov@ispras.ru. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]

Overflow check in not needed anymore after we switch to kmalloc_array().
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Fixes: a4f743a6 ("NFSv4.1: Convert open-coded array allocation calls to kmalloc_array()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

cb8fac6d

02 Mar, 2022 20 commits

NFS: Cache all entries in the readdirplus reply · 612896ec

Trond Myklebust authored Feb 24, 2022

Even if we're not able to cache all the entries in the readdir buffer,
let's ensure that we do prime the dcache.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

612896ec

NFS: Optimise away the previous cookie field · 0adf85b4

Trond Myklebust authored Feb 27, 2022

Replace the 'previous cookie' field in struct nfs_entry with the
array->last_cookie.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

0adf85b4

NFS: Fix up forced readdirplus · b0365ccb

Trond Myklebust authored Feb 23, 2022

Avoid clearing the entire readdir page cache if we're just doing forced
readdirplus for the 'ls -l' heuristic.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

b0365ccb

NFS: Convert readdir page cache to use a cookie based index · f648022f

Trond Myklebust authored Feb 23, 2022

Instead of using a linear index to address the pages, use the cookie of
the first entry, since that is what we use to match the page anyway.

This allows us to avoid re-reading the entire cache on a seekdir() type
of operation. The latter is very common when re-exporting NFS, and is a
major performance drain.

The change does affect our duplicate cookie detection, since we can no
longer rely on the page index as a linear offset for detecting whether
we looped backwards. However since we no longer do a linear search
through all the pages on each call to nfs_readdir(), this is less of a
concern than it was previously.
The other downside is that invalidate_mapping_pages() no longer can use
the page index to avoid clearing pages that have been read. A subsequent
patch will restore the functionality this provides to the 'ls -l'
heuristic.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

f648022f

NFS: Clean up page array initialisation/free · 9332cf14
Trond Myklebust authored Feb 26, 2022
```
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
```
9332cf14

NFS: Trace effects of the readdirplus heuristic · 11d03d0a

Trond Myklebust authored Feb 19, 2022

Enable tracking of when the readdirplus heuristic causes a page cache
invalidation.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

11d03d0a

NFS: Trace effects of readdirplus on the dcache · eace45a1

Trond Myklebust authored Feb 19, 2022

Trace the effects of readdirplus on attribute and dentry revalidation.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

eace45a1

NFS: Add basic readdir tracing · 310e3187

Trond Myklebust authored Feb 19, 2022

Add tracing to track how often the client goes to the server for updated
readdir information.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

310e3187

NFS: Don't request readdirplus when revalidation was forced · 0b3cc71b

Trond Myklebust authored Feb 19, 2022

If the revalidation was forced, due to the presence of a LOOKUP_EXCL or
a LOOKUP_REVAL flag, then readdirplus won't help. It also can't help
when we're doing a path component lookup.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

0b3cc71b

NFS: Readdirplus can't help lookup for case insensitive filesystems · 2c2c3365

Trond Myklebust authored Feb 19, 2022

If the filesystem is case insensitive, then readdirplus can't help with
cache misses, since it won't return case folded variants of the filename.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

2c2c3365

NFSv4: Ask for a full XDR buffer of readdir goodness · c49c6894

Trond Myklebust authored Feb 18, 2022

Instead of pretending that we know the ratio of directory info vs
readdirplus attribute info, just set the 'dircount' field to the same
value as the 'maxcount' field.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

c49c6894

NFS: Don't ask for readdirplus unless it can help nfs_getattr() · ad1e109a

Trond Myklebust authored Feb 17, 2022

If attribute caching is turned off, then use of readdirplus is not going
to help stat() performance.
Readdirplus also doesn't help if a file is being written to, since we
will have to flush those writes in order to sync the mtime/ctime.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

ad1e109a

NFS: Improve heuristic for readdirplus · 230bc98f

Trond Myklebust authored Feb 17, 2022

The heuristic for readdirplus is designed to try to detect 'ls -l' and
similar patterns. It does so by looking for cache hit/miss patterns in
both the attribute cache and in the dcache of the files in a given
directory, and then sets a flag for the readdirplus code to interpret.

The problem with this approach is that a single attribute or dcache miss
can cause the NFS code to force a refresh of the attributes for the
entire set of files contained in the directory.

To be able to make a more nuanced decision, let's sample the number of
hits and misses in the set of open directory descriptors. That allows us
to set thresholds at which we start preferring READDIRPLUS over regular
READDIR, or at which we start to force a re-read of the remaining
readdir cache using READDIRPLUS.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

230bc98f

NFS: Reduce use of uncached readdir · 9c3f4d98

Trond Myklebust authored Feb 17, 2022

When reading a very large directory, we want to try to keep the page
cache up to date if doing so is inexpensive. With the change to allow
readdir to continue reading even when the cache is incomplete, we no
longer need to fall back to uncached readdir in order to scale to large
directories.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

9c3f4d98

NFS: Simplify nfs_readdir_xdr_to_array() · 9ff89c25

Trond Myklebust authored Feb 07, 2022

Recent changes to readdir mean that we can cope with partially filled
page cache entries, so we no longer need to rely on looping in
nfs_readdir_xdr_to_array().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

9ff89c25

NFS: If the cookie verifier changes, we must invalidate the page cache · 6c34f05b

Trond Myklebust authored Feb 22, 2022

Ensure that if the cookie verifier changes when we use the zero-valued
cookie, then we invalidate any cached pages.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

6c34f05b

NFS: Adjust the amount of readahead performed by NFS readdir · 580f2367

Trond Myklebust authored Feb 07, 2022

The current NFS readdir code will always try to maximise the amount of
readahead it performs on the assumption that we can cache anything that
isn't immediately read by the process.
There are several cases where this assumption breaks down, including
when the 'ls -l' heuristic kicks in to try to force use of readdirplus
as a batch replacement for lookup/getattr.

This patch therefore tries to tone down the amount of readahead we
perform, and adjust it to try to match the amount of data being
requested by user space.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

580f2367

NFS: Don't advance the page pointer unless the page is full · c8f0523b

Trond Myklebust authored Feb 26, 2022

When we hit the end of the data in the readdir page, we don't want to
start filling a new page, unless this one is full.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

c8f0523b

NFS: Don't re-read the entire page cache to find the next cookie · 728dd0ab

Trond Myklebust authored Feb 22, 2022

If the page cache entry that was last read gets invalidated for some
reason, then make sure we can re-create it on the next call to readdir.
This, combined with the cache page validation, allows us to reuse the
cached value of page-index on successive calls to nfs_readdir.

Credit is due to Benjamin Coddington for showing that the concept works,
and that it allows for improved cache sharing between processes even in
the case where pages are lost due to LRU or active invalidation.
Suggested-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

728dd0ab

NFS: Store the change attribute in the directory page cache · d09e673f

Trond Myklebust authored Feb 22, 2022

Use the change attribute and the first cookie in a directory page cache
entry to validate that the page is up to date.
Suggested-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

d09e673f

28 Feb, 2022 7 commits

NFS: Calculate page offsets algorithmically · 0b2662b7

Trond Myklebust authored Feb 22, 2022

Instead of relying on counting the page offsets as we walk through the
page cache, switch to calculating them algorithmically.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

0b2662b7

NFS: Use kzalloc() to avoid initialising the nfs_open_dir_context · 281f31b2
Trond Myklebust authored Feb 22, 2022
```
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
```
281f31b2

NFS: Initialise the readdir verifier as best we can in nfs_opendir() · d1e32ea3

Trond Myklebust authored Feb 25, 2022

For the purpose of ensuring that opendir() followed by seekdir() work as
correctly as possible, try to initialise the readdir verifier in
nfs_opendir().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

d1e32ea3

NFS: Trace lookup revalidation failure · 2eef8a31

Trond Myklebust authored Feb 19, 2022

Enable tracing of lookup revalidation failures.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

2eef8a31

NFS: constify nfs_server_capable() and nfs_have_writebacks() · 1a93b82c
Trond Myklebust authored Feb 18, 2022
```
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
```
1a93b82c

NFS: Return valid errors from nfs2/3_decode_dirent() · 64cfca85

Trond Myklebust authored Feb 24, 2022

Valid return values for decode_dirent() callback functions are:
 0: Success
 -EBADCOOKIE: End of directory
 -EAGAIN: End of xdr_stream

All errors need to map into one of those three values.

Fixes: 573c4e1e ("NFS: Simplify ->decode_dirent() calling sequence")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

64cfca85

Revert "NFSv4: use unique client identifiers in network namespaces" · b38e09b9

Trond Myklebust authored Feb 28, 2022

This reverts commit 50c790a0.

The functionality is believed to be capable of causing regressions in
existing setups, so the author has requested that it be reverted.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

b38e09b9