Commits · 287a9c558b9b825b3af36731bb09b06621f3e744 · Kirill Smelkov / linux

20 Sep, 2019 9 commits

NFSv4: Clean up pNFS return-on-close error handling · 287a9c55

Trond Myklebust authored Sep 20, 2019

Both close and delegreturn have identical code to handle pNFS
return-on-close. This patch refactors that code and places it
in pnfs.c
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

287a9c55

pNFS: Ensure we do clear the return-on-close layout stateid on fatal errors · 9c47b18c

Trond Myklebust authored Sep 20, 2019

IF the server rejected our layout return with a state error such as
NFS4ERR_BAD_STATEID, or even a stale inode error, then we do want
to clear out all the remaining layout segments and mark that stateid
as invalid.

Fixes: 1c5bd76d ("pNFS: Enable layoutreturn operation for...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

9c47b18c

NFS: remove unused check for negative dentry · 581057c8

Benjamin Coddington authored Sep 13, 2019

This check has been hanging out since we used to have parallel paths to add
dentry in nfs_create(), but that hasn't been the case for some years.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

581057c8

NFSv3: use nfs_add_or_obtain() to create and reference inodes · 17fd6e45

Benjamin Coddington authored Sep 13, 2019

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

17fd6e45

NFS: Refactor nfs_instantiate() for dentry referencing callers · 406cd915

Benjamin Coddington authored Sep 13, 2019

Since commit b0c6108e ("nfs_instantiate(): prevent multiple aliases for
directory inode"), nfs_instantiate() may succeed without actually
instantiating the dentry that was passed in.  That can be problematic for
some callers in NFSv3, so this patch breaks things up so we can get the
actual dentry obtained.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

406cd915

SUNRPC: Fix congestion window race with disconnect · 8593e010

Chuck Lever authored Sep 13, 2019

If the congestion window closes just as the transport disconnects,
a reconnect is never driven because:

1. The XPRT_CONG_WAIT flag prevents tasks from taking the write lock
2. There's no wake-up of the first task on the xprt->sending queue

To address this, clear the congestion wait flag as part of
completing a disconnect.

Fixes: 75891f50 ("SUNRPC: Support for congestion control ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

8593e010

SUNRPC: Don't try to parse incomplete RPC messages · 9ba82886

Trond Myklebust authored Sep 16, 2019

If the copy of the RPC reply into our buffers did not complete, and
we could end up with a truncated message. In that case, just resend
the call.

Fixes: a0584ee9 ("SUNRPC: Use struct xdr_stream when decoding...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

9ba82886

SUNRPC: Rename xdr_buf_read_netobj to xdr_buf_read_mic · f925ab92

Benjamin Coddington authored Sep 16, 2019

Let the name reflect the single use.  The function now assumes the GSS MIC
is the last object in the buffer.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

f925ab92

SUNRPC: Fix buffer handling of GSS MIC without slack · 5f1bc399

Benjamin Coddington authored Sep 16, 2019

The GSS Message Integrity Check data for krb5i may lie partially in the XDR
reply buffer's pages and tail.  If so, we try to copy the entire MIC into
free space in the tail.  But as the estimations of the slack space required
for authentication and verification have improved there may be less free
space in the tail to complete this copy -- see commit 2c94b8ec
("SUNRPC: Use au_rslack when computing reply buffer size").  In fact, there
may only be room in the tail for a single copy of the MIC, and not part of
the MIC and then another complete copy.

The real world failure reported is that `ls` of a directory on NFS may
sometimes return -EIO, which can be traced back to xdr_buf_read_netobj()
failing to find available free space in the tail to copy the MIC.

Fix this by checking for the case of the MIC crossing the boundaries of
head, pages, and tail. If so, shift the buffer until the MIC is contained
completely within the pages or tail.  This allows the remainder of the
function to create a sub buffer that directly address the complete MIC.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Cc: stable@vger.kernel.org # v5.1
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

5f1bc399

17 Sep, 2019 3 commits

SUNRPC: RPC level errors should always set task->tk_rpc_status · 714fbc73

Trond Myklebust authored Sep 12, 2019

Ensure that we set task->tk_rpc_status for all RPC level errors so that
the caller can distinguish between those and server reply status errors.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

714fbc73

SUNRPC: Don't receive TCP data into a request buffer that has been reset · 45835a63

Trond Myklebust authored Sep 12, 2019

If we've removed the request from the receive list, and have added
it back after resetting the request receive buffer, then we should
only receive message data if it is a new reply (i.e. if
transport->recv.copied is zero).

Fixes: 277e4ab7 ("SUNRPC: Simplify TCP receive code by switching...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

45835a63

SUNRPC: Dequeue the request from the receive queue while we're re-encoding · cc204d01

Trond Myklebust authored Sep 10, 2019

Ensure that we dequeue the request from the transport receive queue
while we're re-encoding to prevent issues like use-after-free when
we release the bvec.

Fixes: 75369089 ("SUNRPC: Ensure the bvecs are reset when we re-encode...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.20+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

cc204d01

26 Aug, 2019 3 commits

xprtrdma: Send Queue size grows after a reconnect · 98ef77d1

Chuck Lever authored Aug 26, 2019

Eli Dorfman reports that after a series of idle disconnects, an
RPC/RDMA transport becomes unusable (rdma_create_qp returns
-ENOMEM). Problem was tracked down to increasing Send Queue size
after each reconnect.

The rdma_create_qp() API does not promise to leave its @qp_init_attr
parameter unaltered. In fact, some drivers do modify one or more of
its fields. Thus our calls to rdma_create_qp must use a fresh copy
of ib_qp_init_attr each time.

This fix is appropriate for kernels dating back to late 2007, though
it will have to be adapted, as the connect code has changed over the
years.
Reported-by: Eli Dorfman <eli@vastdata.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

98ef77d1

xprtrdma: Clear xprt->reestablish_timeout on close · f9e1afe0

Chuck Lever authored Aug 26, 2019

Ensure that the re-establishment delay does not grow exponentially
on each good reconnect. This probably should have been part of
commit 675dd90a ("xprtrdma: Modernize ops->connect").
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

f9e1afe0

xprtrdma: Recycle MRs after disconnect · ee2f412e

Chuck Lever authored Aug 26, 2019

The optimization done in "xprtrdma: Simplify rpcrdma_mr_pop" was a
bit too optimistic. MRs left over after a reconnect still need to
be recycled, not added back to the free list, since they could be
in flight or actually fully registered.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

ee2f412e

22 Aug, 2019 6 commits

NFS: Have nfs4_proc_get_lease_time() call nfs4_call_sync_custom() · f836b27e

Anna Schumaker authored Aug 19, 2019

This removes some code duplication, since both functions were doing the
same thing.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

f836b27e

NFS: Have nfs41_proc_secinfo_no_name() call nfs4_call_sync_custom() · cc15e24a

Anna Schumaker authored Aug 14, 2019

We need to use the custom rpc_task_setup here to set the
RPC_TASK_NO_ROUND_ROBIN flag on the RPC call.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

cc15e24a

NFS: Have nfs41_proc_reclaim_complete() call nfs4_call_sync_custom() · 4c952e3d

Anna Schumaker authored Aug 14, 2019

An async call followed by an rpc_wait_for_completion() is basically the
same as a synchronous call, so we can use nfs4_call_sync_custom() to
keep our custom callback ops and the RPC_TASK_NO_ROUND_ROBIN flag.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

4c952e3d

NFS: Have _nfs4_proc_secinfo() call nfs4_call_sync_custom() · 50493364

Anna Schumaker authored Aug 14, 2019

We do this to set the RPC_TASK_NO_ROUND_ROBIN flag in the task_setup
structure
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

50493364

NFS: Have nfs4_proc_setclientid() call nfs4_call_sync_custom() · dae40965
Anna Schumaker authored Aug 14, 2019
```
Rather than running the task manually
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
```
dae40965

NFS: Add an nfs4_call_sync_custom() function · 48c05854

Anna Schumaker authored Aug 14, 2019

There are a few cases where we need to manually configure the
rpc_task_setup structure to get the behavior we want.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

48c05854

21 Aug, 2019 8 commits

NFSv4: Fix a memory leak bug · 1e672e36

Wenwen Wang authored Aug 20, 2019

In nfs4_try_migration(), if nfs4_begin_drain_session() fails, the
previously allocated 'page' and 'locations' are not deallocated, leading to
memory leaks. To fix this issue, go to the 'out' label to free 'page' and
'locations' before returning the error.
Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

1e672e36

xprtrdma: Optimize rpcrdma_post_recvs() · 435eba4a

Chuck Lever authored Aug 19, 2019

Micro-optimization: In rpcrdma_post_recvs, since commit e340c2d6
("xprtrdma: Reduce the doorbell rate (Receive)"), the common case is
to return without doing anything. Found with perf.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

435eba4a

xprtrdma: Inline XDR chunk encoder functions · 1738de33

Chuck Lever authored Aug 19, 2019

Micro-optimization: Save the cost of three function calls during
transport header encoding.

These were "noinline" before to generate more meaningful call stacks
during debugging, but this code is now pretty stable.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

1738de33

xprtrdma: Fix bc_max_slots return value · 17d47f93

Chuck Lever authored Aug 19, 2019

For the moment the returned value just happens to be correct because
the current backchannel server implementation does not vary the
number of credits it offers. The spec does permit this value to
change during the lifetime of a connection, however.

The actual maximum is fixed for all RPC/RDMA transports, because
each transport instance has to pre-allocate the resources for
processing BC requests. That's the value that should be returned.

Fixes: 7402a4fe ("SUNRPC: Fix up backchannel slot table ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

17d47f93

xprtrdma: Clean up xprt_rdma_set_connect_timeout() · 2a7f77c7

Chuck Lever authored Aug 19, 2019

Clean up: The function name should match the documenting comment.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

2a7f77c7

xprtrdma: Use an llist to manage free rpcrdma_reps · b0b227f0

Chuck Lever authored Aug 19, 2019

rpcrdma_rep objects are removed from their free list by only a
single thread: the Receive completion handler. Thus that free list
can be converted to an llist, where a single-threaded consumer and
a multi-threaded producer (rpcrdma_buffer_put) can both access the
llist without the need for any serialization.

This eliminates spin lock contention between the Receive completion
handler and rpcrdma_buffer_get, and makes the rep consumer wait-
free.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

b0b227f0

xprtrdma: Remove rpcrdma_buffer::rb_mrlock · 4d6b8890

Chuck Lever authored Aug 19, 2019

Clean up: Now that the free list is used sparingly, get rid of the
separate spin lock protecting it.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

4d6b8890

xprtrdma: Cache free MRs in each rpcrdma_req · 6dc6ec9e

Chuck Lever authored Aug 19, 2019

Instead of a globally-contended MR free list, cache MRs in each
rpcrdma_req as they are released. This means acquiring and releasing
an MR will be lock-free in the common case, even outside the
transport send lock.

The original idea of per-rpcrdma_req MR free lists was suggested by
Shirley Ma <shirley.ma@oracle.com> several years ago. I just now
figured out how to make that idea work with on-demand MR allocation.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

6dc6ec9e

20 Aug, 2019 11 commits

xprtrdma: Ensure creating an MR does not trigger FS writeback · 805a1f62

Chuck Lever authored Aug 19, 2019

Probably would be good to also pass GFP flags to ib_alloc_mr.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

805a1f62

xprtrdma: Move rpcrdma_mr_get out of frwr_map · 3b39f52a

Chuck Lever authored Aug 19, 2019

Refactor: Retrieve an MR and handle error recovery entirely in
rpc_rdma.c, as this is not a device-specific function.

Note that since commit 89f90fe1 ("SUNRPC: Allow calls to
xprt_transmit() to drain the entire transmit queue"), the
xprt_transmit function handles the cond_resched. The transport no
longer has to do this itself.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

3b39f52a

xprtrdma: Combine rpcrdma_mr_put and rpcrdma_mr_unmap_and_put · 1ca3f4c0

Chuck Lever authored Aug 19, 2019

Clean up. There is only one remaining rpcrdma_mr_put call site, and
it can be directly replaced with unmap_and_put because mr->mr_dir is
set to DMA_NONE just before the call.

Now all the call sites do a DMA unmap, and we can just rename
mr_unmap_and_put to mr_put, which nicely matches mr_get.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

1ca3f4c0

xprtrdma: Simplify rpcrdma_mr_pop · 265a38d4

Chuck Lever authored Aug 19, 2019

Clean up: rpcrdma_mr_pop call sites check if the list is empty
first. Let's replace the list_empty with less costly logic.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

265a38d4

xprtrdma: Toggle XPRT_CONGESTED in xprtrdma's slot methods · 39579056

Chuck Lever authored Aug 19, 2019

Commit 48be539d ("xprtrdma: Introduce ->alloc_slot call-out for
xprtrdma") added a separate alloc_slot and free_slot to the RPC/RDMA
transport. Later, commit 75891f50 ("SUNRPC: Support for
congestion control when queuing is enabled") modified the generic
alloc/free_slot methods, but neglected the methods in xprtrdma.

Found via code review.

Fixes: 75891f50 ("SUNRPC: Support for congestion control ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

39579056

xprtrdma: Rename rpcrdma_buffer::rb_all · eed48a9c

Chuck Lever authored Aug 19, 2019

Clean up: There are other "all" list heads. For code clarity
distinguish this one as for use only for MRs by renaming it.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

eed48a9c

xprtrdma: Rename CQE field in Receive trace points · 2dfdcd88

Chuck Lever authored Aug 19, 2019

Make the field name the same for all trace points that handle
pointers to struct rpcrdma_rep. That makes it easy to grep for
matching rep points in trace output.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

2dfdcd88

xprtrdma: Boost client's max slot table size to match Linux server · aeaed484

Chuck Lever authored Aug 19, 2019

I've heard rumors of an NFS/RDMA server implementation that has a
default credit limit of 1024. The client's default setting remains
at 128.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

aeaed484

xprtrdma: Boost maximum transport header size · f3c66a2f

Chuck Lever authored Aug 19, 2019

Although I haven't seen any performance results that justify it,
I've received several complaints that NFS/RDMA no longer supports
a maximum rsize and wsize of 1MB. These days it is somewhat smaller.

To simplify the logic that determines whether a chunk list is
necessary, the implementation uses a fixed maximum size of the
transport header. Currently that maximum size is 256 bytes, one
quarter of the default inline threshold size for RPC/RDMA v1.

Since commit a7886849 ("xprtrdma: Reduce max_frwr_depth"), the
size of chunks is also smaller to take advantage of inline page
lists in device internal MR data structures.

The combination of these two design choices has reduced the maximum
NFS rsize and wsize that can be used for most RNIC/HCAs. Increasing
the maximum transport header size and the maximum number of RDMA
segments it can contain increases the negotiated maximum rsize/wsize
on common RNIC/HCAs.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

f3c66a2f

xprtrdma: Fix calculation of ri_max_segs again · 36bdd905

Chuck Lever authored Aug 19, 2019

Commit 302d3deb ("xprtrdma: Prevent inline overflow") added this
calculation back in 2016, but got it wrong. I tested only the lower
bound, which is why there is a max_t there. The upper bound should be
rounded up too.

Now, when using DIV_ROUND_UP, that takes care of the lower bound as
well.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

36bdd905

xprtrdma: Update obsolete comment · af08a775

Chuck Lever authored Aug 19, 2019

Comment was made obsolete by commit 8cec3dba ("xprtrdma:
rpcrdma_regbuf alignment").
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

af08a775