Commits · a1f26739ccdcc6967617998bd200dd907f7ff80a · Kirill Smelkov / linux

14 Dec, 2020 2 commits

NFSv4.2: improve page handling for GETXATTR · a1f26739

Frank van der Linden authored Dec 02, 2020

XDRBUF_SPARSE_PAGES can cause problems for the RDMA transport,
and it's easy enough to allocate enough pages for the request
up front, so do that.

Also, since we've allocated the pages anyway, use the full
page aligned length for the receive buffer. This will allow
caching of valid replies that are too large for the caller,
but that still fit in the allocated pages.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

a1f26739

sunrpc: fix xs_read_xdr_buf for partial pages receive · ac9645c8

Dan Aloni authored Dec 05, 2020

When receiving pages data, return value 'ret' when positive includes
`buf->page_base`, so we should subtract that before it is used for
changing `offset` and comparing against `want`.

This was discovered on the very rare cases where the server returned a
chunk of bytes that when added to the already received amount of bytes
for the pages happened to match the current `recv.len`, for example
on this case:

     buf->page_base : 258356
     actually received from socket: 1740
     ret : 260096
     want : 260096

In this case neither of the two 'if ... goto out' trigger, and we
continue to tail parsing.

Worth to mention that the ensuing EMSGSIZE from the continued execution of
`xs_read_xdr_buf` may be observed by an application due to 4 superfluous
bytes being added to the pages data.

Fixes: 277e4ab7 ("SUNRPC: Simplify TCP receive code by switching to using iterators")
Signed-off-by: Dan Aloni <dan@kernelim.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

ac9645c8

10 Dec, 2020 1 commit

NFSv4.2: Fix up the get/listxattr calls to rpc_prepare_reply_pages() · fa94a951

Trond Myklebust authored Dec 03, 2020

Ensure that both getxattr and listxattr page array are correctly
aligned, and that getxattr correctly accounts for the page padding word.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

fa94a951

02 Dec, 2020 37 commits

NFS: switch nfsiod to be an UNBOUND workqueue. · bf701b76

NeilBrown authored Nov 27, 2020

nfsiod is currently a concurrency-managed workqueue (CMWQ).
This means that workitems scheduled to nfsiod on a given CPU are queued
behind all other work items queued on any CMWQ on the same CPU.  This
can introduce unexpected latency.

Occaionally nfsiod can even cause excessive latency.  If the work item
to complete a CLOSE request calls the final iput() on an inode, the
address_space of that inode will be dismantled.  This takes time
proportional to the number of in-memory pages, which on a large host
working on large files (e.g..  5TB), can be a large number of pages
resulting in a noticable number of seconds.

We can avoid these latency problems by switching nfsiod to WQ_UNBOUND.
This causes each concurrent work item to gets a dedicated thread which
can be scheduled to an idle CPU.

There is precedent for this as several other filesystems use WQ_UNBOUND
workqueue for handling various async events.
Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: ada609ee ("workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

bf701b76

lockd: don't use interval-based rebinding over TCP · 9b82d88d

Calum Mackay authored Oct 28, 2020

NLM uses an interval-based rebinding, i.e. it clears the transport's
binding under certain conditions if more than 60 seconds have elapsed
since the connection was last bound.

This rebinding is not necessary for an autobind RPC client over a
connection-oriented protocol like TCP.

It can also cause problems: it is possible for nlm_bind_host() to clear
XPRT_BOUND whilst a connection worker is in the middle of trying to
reconnect, after it had already been checked in xprt_connect().

When the connection worker notices that XPRT_BOUND has been cleared
under it, in xs_tcp_finish_connecting(), that results in:

	xs_tcp_setup_socket: connect returned unhandled error -107

Worse, it's possible that the two can get into lockstep, resulting in
the same behaviour repeated indefinitely, with the above error every
300 seconds, without ever recovering, and the connection never being
established. This has been seen in practice, with a large number of NLM
client tasks, following a server restart.

The existing callers of nlm_bind_host & nlm_rebind_host should not need
to force the rebind, for TCP, so restrict the interval-based rebinding
to UDP only.

For TCP, we will still rebind when needed, e.g. on timeout, and connection
error (including closure), since connection-related errors on an existing
connection, ECONNREFUSED when trying to connect, and rpc_check_timeout(),
already unconditionally clear XPRT_BOUND.

To avoid having to add the fix, and explanation, to both nlm_bind_host()
and nlm_rebind_host(), remove the duplicate code from the former, and
have it call the latter.

Drop the dprintk, which adds no value over a trace.
Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
Fixes: 35f5a422 ("SUNRPC: new interface to force an RPC rebind")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

9b82d88d

net: sunrpc: Fix 'snprintf' return value check in 'do_xprt_debugfs' · 35a6d396

Fedor Tokarev authored Oct 15, 2020

'snprintf' returns the number of characters which would have been written
if enough space had been available, excluding the terminating null byte.
Thus, the return value of 'sizeof(buf)' means that the last character
has been dropped.
Signed-off-by: Fedor Tokarev <ftokarev@gmail.com>
Fixes: 2f34b8bf ("SUNRPC: add links for all client xprts to debugfs")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

35a6d396

NFSv4: Refactor to use user namespaces for nfs4idmap · d3ff46fe

Sargun Dhillon authored Nov 12, 2020

In several patches work has been done to enable NFSv4 to use user
namespaces:
58002399: NFSv4: Convert the NFS client idmapper to use the container user namespace
3b7eb5e3: NFS: When mounting, don't share filesystems between different user namespaces

Unfortunately, the userspace APIs were only such that the userspace facing
side of the filesystem (superblock s_user_ns) could be set to a non init
user namespace. This furthers the fs_context related refactoring, and
piggybacks on top of that logic, so the superblock user namespace, and the
NFS user namespace are the same.

Users can still use rpc.idmapd if they choose to, but there are complexities
with user namespaces and request-key that have yet to be addresssed.

Eventually, we will need to at least:
  * Come up with an upcall mechanism that can be triggered inside of the container,
    or safely triggered outside, with the requisite context to do the right
    mapping. * Handle whatever refactoring needs to be done in net/sunrpc.
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Tested-by: Alban Crequy <alban.crequy@gmail.com>
Fixes: 62a55d08 ("NFS: Additional refactoring for fs_context conversion")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

d3ff46fe

NFS: NFSv2/NFSv3: Use cred from fs_context during mount · d18a9d3f

Sargun Dhillon authored Nov 12, 2020

There was refactoring done to use the fs_context for mounting done in:
62a55d08: NFS: Additional refactoring for fs_context conversion

This made it so that the net_ns is fetched from the fs_context (the netns
that fsopen is called in). This change also makes it so that the credential
fetched during fsopen is used as well as the net_ns.

NFS has already had a number of changes to prepare it for user namespaces:
1a58e8a0: NFS: Store the credential of the mount process in the nfs_server
264d948c: NFS: Convert NFSv3 to use the container user namespace
c207db2f: NFS: Convert NFSv2 to use the container user namespace

Previously, different credentials could be used for creation of the
fs_context versus creation of the nfs_server, as FSCONFIG_CMD_CREATE did
the actual credential check, and that's where current_creds() were fetched.
This meant that the user namespace which fsopen was called in could be a
non-init user namespace. This still requires that the user that calls
FSCONFIG_CMD_CREATE has CAP_SYS_ADMIN in the init user ns.

This roughly allows a privileged user to mount on behalf of an unprivileged
usernamespace, by forking off and calling fsopen in the unprivileged user
namespace. It can then pass back that fsfd to the privileged process which
can configure the NFS mount, and then it can call FSCONFIG_CMD_CREATE
before switching back into the mount namespace of the container, and finish
up the mounting process and call fsmount and move_mount.
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Tested-by: Alban Crequy <alban.crequy@gmail.com>
Fixes: 62a55d08 ("NFS: Additional refactoring for fs_context conversion")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

d18a9d3f

NFSv4: Fix a pNFS layout related use-after-free race when freeing the inode · b6d49ecd

Trond Myklebust authored Nov 25, 2020

When returning the layout in nfs4_evict_inode(), we need to ensure that
the layout is actually done being freed before we can proceed to free the
inode itself.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

b6d49ecd

NFSv4: Fix open coded xdr_stream_remaining() · 17068466
Trond Myklebust authored Nov 21, 2020
```
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
```
17068466
SUNRPC: Fix open coded xdr_stream_remaining() · eee1f549
Trond Myklebust authored Nov 21, 2020
```
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
```
eee1f549

SUNRPC: Fix up xdr_set_page() · 0279024f

Trond Myklebust authored Nov 21, 2020

While we always want to align to the next page and/or the beginning of
the tail buffer when we call xdr_set_next_page(), the functions
xdr_align_data() and xdr_expand_hole() really want to align to the next
object in that next page or tail.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

0279024f

SUNRPC: Clean up the handling of page padding in rpc_prepare_reply_pages() · 9ed5af26

Trond Myklebust authored Nov 21, 2020

rpc_prepare_reply_pages() currently expects the 'hdrsize' argument to
contain the length of the data that we expect to want placed in the head
kvec plus a count of 1 word of padding that is placed after the page data.
This is very confusing when trying to read the code, and sometimes leads
to callers adding an arbitrary value of '1' just in order to satisfy the
requirement (whether or not the page data actually needs such padding).

This patch aims to clarify the code by changing the 'hdrsize' argument
to remove that 1 word of padding. This means we need to subtract the
padding from all the existing callers.

Fixes: 02ef04e4 ("NFS: Account for XDR pad of buf->pages")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

9ed5af26

SUNRPC: Fix up xdr_read_pages() to take arbitrary object lengths · 1d973166

Trond Myklebust authored Nov 20, 2020

Fix up xdr_read_pages() so that it can handle object lengths that are
larger than the page length, by simply aligning to the next object in
the buffer tail.
The function will continue to return the length of the truncate object
data that actually fit into the pages.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

1d973166

SUNRPC: Clean up helpers xdr_set_iov() and xdr_set_page_base() · 8d86e373

Trond Myklebust authored Nov 21, 2020

Allow xdr_set_iov() to set a base so that we can use it to set the
cursor to a specific position in the kvec buffer.

If the new base overflows the kvec/pages buffer in either xdr_set_iov()
or xdr_set_page_base(), then truncate it so that we point to the end of
the buffer.

Finally, change both function to return the number of bytes remaining to
read in their buffers.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

8d86e373

SUNRPC: Fix up typo in xdr_init_decode() · 2b1f83d1

Trond Myklebust authored Nov 21, 2020

We already know that the head buffer and page are empty, so if there is
any data, it is in the tail.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

2b1f83d1

NFSv4: Fix the alignment of page data in the getdeviceinfo reply · 046e5ccb

Trond Myklebust authored Nov 13, 2020

We can fit the device_addr4 opaque data padding in the pages.

Fixes: cf500bac ("SUNRPC: Introduce rpc_prepare_reply_pages()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

046e5ccb

pNFS: Clean up open coded xdr string decoding · 98899813

Trond Myklebust authored Nov 09, 2020

Use the existing xdr_stream_decode_string_dup() to safely decode into
kmalloced strings.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

98899813

SUNRPC: Fix up open coded kmemdup_nul() · 4aceaaea
Trond Myklebust authored Nov 10, 2020
```
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
```
4aceaaea

pNFS/flexfiles: Fix up layoutstats reporting for non-TCP transports · 9a701631

Trond Myklebust authored Nov 09, 2020

Ensure that we report the correct netid when using UDP or RDMA
transports to the DSes.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

9a701631

NFSv4/pNFS: Store the transport type in struct nfs4_pnfs_ds_addr · 4be78d26

Trond Myklebust authored Nov 06, 2020

We want to enable RDMA and UDP as valid transport methods if a
GETDEVICEINFO call specifies it. Do so by adding a parser for the
netid that translates it to an appropriate argument for the RPC
transport layer.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

4be78d26

pNFS: Add helpers for allocation/free of struct nfs4_pnfs_ds_addr · 190c75a3
Trond Myklebust authored Nov 10, 2020
```
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
```
190c75a3

NFSv4/pNFS: Use connections to a DS that are all of the same protocol family · a12f996d

Trond Myklebust authored Nov 06, 2020

If the pNFS metadata server advertises multiple addresses for the same
data server, we should try to connect to just one protocol family and
transport type on the assumption that homogeneity will improve performance.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

a12f996d

SUNRPC: Remove unused function xprt_load_transport() · c87b056e
Trond Myklebust authored Nov 10, 2020
```
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
```
c87b056e

NFS: Switch mount code to use xprt_find_transport_ident() · 1c3695d0

Trond Myklebust authored Nov 10, 2020

Switch the mount code to use xprt_find_transport_ident() and to check
the results before allowing the mount to proceed.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

1c3695d0

SUNRPC: Add a helper to return the transport identifier given a netid · 1fc5f131
Trond Myklebust authored Nov 10, 2020
```
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
```
1fc5f131

SUNRPC: Close a race with transport setup and module put · 9bccd264

Trond Myklebust authored Nov 10, 2020

After we've looked up the transport module, we need to ensure it can't
go away until we've finished running the transport setup code.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

9bccd264

SUNRPC: xprt_load_transport() needs to support the netid "rdma6" · d5aa6b22

Trond Myklebust authored Nov 06, 2020

According to RFC5666, the correct netid for an IPv6 addressed RDMA
transport is "rdma6", which we've supported as a mount option since
Linux-4.7. The problem is when we try to load the module "xprtrdma6",
that will fail, since there is no modulealias of that name.

Fixes: 181342c5 ("xprtrdma: Add rdma6 option to support NFS/RDMA IPv6")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

d5aa6b22

NFS: Do uncached readdir when we're seeking a cookie in an empty page cache · 794092c5

Trond Myklebust authored Nov 06, 2020

If the directory is changing, causing the page cache to get invalidated
while we are listing the contents, then the NFS client is currently forced
to read in the entire directory contents from scratch, because it needs
to perform a linear search for the readdir cookie. While this is not
an issue for small directories, it does not scale to directories with
millions of entries.
In order to be able to deal with large directories that are changing,
add a heuristic to ensure that if the page cache is empty, and we are
searching for a cookie that is not the zero cookie, we just default to
performing uncached readdir.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>

794092c5

NFS: Reduce number of RPC calls when doing uncached readdir · 35df59d3

Trond Myklebust authored Nov 06, 2020

If we're doing uncached readdir, allocate multiple pages in order to
try to avoid duplicate RPC calls for the same getdents() call.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>

35df59d3

NFS: Optimisations for monotonically increasing readdir cookies · 762567b7

Trond Myklebust authored Nov 04, 2020

If the server is handing out monotonically increasing readdir cookie values,
then we can optimise away searches through pages that contain cookies that
lie outside our search range.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>

762567b7

NFS: Improve handling of directory verifiers · b593c09f

Trond Myklebust authored Nov 02, 2020

If the server insists on using the readdir verifiers in order to allow
cookies to expire, then we should ensure that we cache the verifier
with the cookie, so that we can return an error if the application
tries to use the expired cookie.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>

b593c09f

NFS: Handle NFS4ERR_NOT_SAME and NFSERR_BADCOOKIE from readdir calls · 9fff59ed

Trond Myklebust authored Nov 02, 2020

If the server returns NFS4ERR_NOT_SAME or tells us that the cookie is
bad in response to a READDIR call, then we should empty the page cache
so that we can fill it from scratch again.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>

9fff59ed

NFS: Allow the NFS generic code to pass in a verifier to readdir · 82e22a5e

Trond Myklebust authored Nov 02, 2020

If we're ever going to allow support for servers that use the readdir
verifier, then that use needs to be managed by the middle layers as
those need to be able to reject cookies from other verifiers.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>

82e22a5e

NFS: Cleanup to remove nfs_readdir_descriptor_t typedef · 6c981eff

Trond Myklebust authored Nov 03, 2020

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>

6c981eff

NFS: Reduce readdir stack usage · 6b75cf9e

Trond Myklebust authored Nov 02, 2020

The descriptor and the struct nfs_entry are both large structures,
so don't allocate them from the stack.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>

6b75cf9e

NFS: nfs_do_filldir() does not return a value · dbeaf8c9

Trond Myklebust authored Nov 01, 2020

Clean up nfs_do_filldir().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>

dbeaf8c9

NFS: More readdir cleanups · 93b8959a

Trond Myklebust authored Nov 01, 2020

Remove the redundant caching of the credential in struct
nfs_open_dir_context.
Pass the buffer size as an argument to nfs_readdir_xdr_filler().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>

93b8959a

NFS: Support larger readdir buffers · 1a34c8c9

Trond Myklebust authored Nov 01, 2020

Support readdir buffers of up to 1MB in size so that we can read
large directories using few RPC calls.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>

1a34c8c9

NFS: Simplify struct nfs_cache_array_entry · a52a8a6a

Trond Myklebust authored Nov 01, 2020

We don't need to store a hash, so replace struct qstr with a simple
const char pointer and length.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>

a52a8a6a