Commits · 347543e64082782379627cb21162cb859590f3c7 · nexedi / linux

12 Jul, 2019 1 commit

Merge tag 'nfs-rdma-for-5.3-1' of git://git.linux-nfs.org/projects/anna/linux-nfs · 347543e6

Trond Myklebust authored Jul 11, 2019

NFSoRDMA client updates for 5.3

New features:
- Add a way to place MRs back on the free list
- Reduce context switching
- Add new trace events

Bugfixes and cleanups:
- Fix a BUG when tracing is enabled with NFSv4.1
- Fix a use-after-free in rpcrdma_post_recvs
- Replace use of xdr_stream_pos in rpcrdma_marshal_req
- Fix occasional transport deadlock
- Fix show_nfs_errors macros, other tracing improvements
- Remove RPCRDMA_REQ_F_PENDING and fr_state
- Various simplifications and refactors

347543e6

09 Jul, 2019 17 commits

NFS: Record task, client ID, and XID in xdr_status trace points · 62a92ba9

Chuck Lever authored Jun 19, 2019

When triggering an nfs_xdr_status trace point, record the task ID
and XID of the failing RPC to better pinpoint the problem.

This feels like a bit of a layering violation.
Suggested-by: Trond Myklebust <trondmy@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

62a92ba9

NFS: Update symbolic flags displayed by trace events · 7d4006c1

Chuck Lever authored Jun 19, 2019

Add missing symbolic flag names and display flags variables in
hexadecimal to improve observability.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

7d4006c1

NFS: Display symbolic status code names in trace log · 38a638a7

Chuck Lever authored Jun 19, 2019

For improved readability, add nfs_show_status() call-sites in the
generic NFS trace points so that the symbolic status code name is
displayed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

38a638a7

NFS: Fix show_nfs_errors macros again · 96650e2e

Chuck Lever authored Jun 19, 2019

I noticed that NFS status values stopped working again.

trace_print_symbols_seq() takes an unsigned long. Passing a negative
errno or negative NFSERR value just confuses it, and since we're
using C macros here and not static inline functions, all bets are
off due to implicit type conversion.

Straight-line the calling conventions so that error codes are stored
in the trace record as positive values in an unsigned long field,
mapped to symbolic as an unsigned long, and displayed as a negative
value, to continue to enable grepping on "error=-".

It's often the case that an error value that is positive is a byte
count but when it's negative, it's an error (e.g. nfs4_write). Fix
those cases so that the value that is eventually stored in the
error field is a positive NFS status or errno, or zero.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

96650e2e

NFS4: Add a trace event to record invalid CB sequence IDs · c5833f0d

Chuck Lever authored Jun 19, 2019

Help debug NFSv4 callback failures.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

c5833f0d

xprtrdma: Modernize ops->connect · 675dd90a

Chuck Lever authored Jun 19, 2019

Adapt and apply changes that were made to the TCP socket connect
code. See the following commits for details on the purpose of
these changes:

Commit 7196dbb0 ("SUNRPC: Allow changing of the TCP timeout parameters on the fly")
Commit 3851f1cd ("SUNRPC: Limit the reconnect backoff timer to the max RPC message timeout")
Commit 02910177 ("SUNRPC: Fix reconnection timeouts")

Some common transport code is moved to xprt.c to satisfy the code
duplication police.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

675dd90a

xprtrdma: Remove rpcrdma_req::rl_buffer · 5828ceba

Chuck Lever authored Jun 19, 2019

Clean up.

There is only one remaining function, rpcrdma_buffer_put(), that
uses this field. Its caller can supply a pointer to the correct
rpcrdma_buffer, enabling the removal of an 8-byte pointer field
from a frequently-allocated shared data structure.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

5828ceba

xprtrdma: Refactor chunk encoding · 6a6c6def

Chuck Lever authored Jun 19, 2019

Clean up.

Move the "not present" case into the individual chunk encoders. This
improves code organization and readability.

The reason for the original organization was to optimize for the
case where there there are no chunks. The optimization turned out to
be inconsequential, so let's err on the side of code readability.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

6a6c6def

xprtrdma: Streamline rpcrdma_post_recvs · 9ef33ef5

Chuck Lever authored Jun 19, 2019

rb_lock is contended between rpcrdma_buffer_create,
rpcrdma_buffer_put, and rpcrdma_post_recvs.

Commit e340c2d6 ("xprtrdma: Reduce the doorbell rate (Receive)")
causes rpcrdma_post_recvs to take the rb_lock repeatedly when it
determines more Receives are needed. Streamline this code path so
it takes the lock just once in most cases to build the Receive
chain that is about to be posted.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

9ef33ef5

xprtrdma: Simplify rpcrdma_rep_create · 379d1bc5

Chuck Lever authored Jun 19, 2019

Clean up.

Commit 7c8d9e7c ("xprtrdma: Move Receive posting to Receive
handler") reduced the number of rpcrdma_rep_create call sites to
one. After that commit, the backchannel code no longer invokes it.

Therefore the free list logic added by commit d698c4a0
("xprtrdma: Fix backchannel allocation of extra rpcrdma_reps") is
no longer necessary, and in fact adds some extra overhead that we
can do without.

Simply post any newly created reps. They will get added back to
the rb_recv_bufs list when they subsequently complete.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

379d1bc5

xprtrdma: Wake RPCs directly in rpcrdma_wc_send path · 0ab11523

Chuck Lever authored Jun 19, 2019

Eliminate a context switch in the path that handles RPC wake-ups
when a Receive completion has to wait for a Send completion.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

0ab11523

xprtrdma: Reduce context switching due to Local Invalidation · d8099fed

Chuck Lever authored Jun 19, 2019

Since commit ba69cd12 ("xprtrdma: Remove support for FMR memory
registration"), FRWR is the only supported memory registration mode.

We can take advantage of the asynchronous nature of FRWR's LOCAL_INV
Work Requests to get rid of the completion wait by having the
LOCAL_INV completion handler take care of DMA unmapping MRs and
waking the upper layer RPC waiter.

This eliminates two context switches when local invalidation is
necessary. As a side benefit, we will no longer need the per-xprt
deferred completion work queue.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

d8099fed

xprtrdma: Add mechanism to place MRs back on the free list · 40088f0e

Chuck Lever authored Jun 19, 2019

When a marshal operation fails, any MRs that were already set up for
that request are recycled. Recycling releases MRs and creates new
ones, which is expensive.

Since commit f2877623 ("xprtrdma: Chain Send to FastReg WRs")
was merged, recycling FRWRs is unnecessary. This is because before
that commit, frwr_map had already posted FAST_REG Work Requests,
so ownership of the MRs had already been passed to the NIC and thus
dealing with them had to be delayed until they completed.

Since that commit, however, FAST_REG WRs are posted at the same time
as the Send WR. This means that if marshaling fails, we are certain
the MRs are safe to simply unmap and place back on the free list
because neither the Send nor the FAST_REG WRs have been posted yet.
The kernel still has ownership of the MRs at this point.

This reduces the total number of MRs that the xprt has to create
under heavy workloads and makes the marshaling logic less brittle.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

40088f0e

xprtrdma: Remove fr_state · 84756894

Chuck Lever authored Jun 19, 2019

Now that both the Send and Receive completions are handled in
process context, it is safe to DMA unmap and return MRs to the
free or recycle lists directly in the completion handlers.

Doing this means rpcrdma_frwr no longer needs to track the state of
each MR, meaning that a VALID or FLUSHED MR can no longer appear on
an xprt's MR free list. Thus there is no longer a need to track the
MR's registration state in rpcrdma_frwr.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

84756894

xprtrdma: Remove the RPCRDMA_REQ_F_PENDING flag · 5809ea4f

Chuck Lever authored Jun 19, 2019

Commit 9590d083 ("xprtrdma: Use xprt_pin_rqst in
rpcrdma_reply_handler") pins incoming RPC/RDMA replies so they
can be left in the pending requests queue while they are being
processed without introducing a race between ->buf_free and the
transport's reply handler. Therefore RPCRDMA_REQ_F_PENDING is no
longer necessary.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

5809ea4f

xprtrdma: Fix occasional transport deadlock · 05eb06d8

Chuck Lever authored Jun 19, 2019

Under high I/O workloads, I've noticed that an RPC/RDMA transport
occasionally deadlocks (IOPS goes to zero, and doesn't recover).
Diagnosis shows that the sendctx queue is empty, but when sendctxs
are returned to the queue, the xprt_write_space wake-up never
occurs. The wake-up logic in rpcrdma_sendctx_put_locked is racy.

I noticed that both EMPTY_SCQ and XPRT_WRITE_SPACE are implemented
via an atomic bit. Just one of those is sufficient. Removing
EMPTY_SCQ in favor of the generic bit mechanism makes the deadlock
un-reproducible.

Without EMPTY_SCQ, rpcrdma_buffer::rb_flags is no longer used and
is therefore removed.

Unfortunately this patch does not apply cleanly to stable. If
needed, someone will have to port it and test it.

Fixes: 2fad6592 ("xprtrdma: Wait on empty sendctx queue")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

05eb06d8

xprtrdma: Replace use of xdr_stream_pos in rpcrdma_marshal_req · 1310051c

Chuck Lever authored Jun 19, 2019

This is a latent bug. xdr_stream_pos works by subtracting
xdr_stream::nwords from xdr_buf::len. But xdr_stream::nwords is not
initialized by xdr_init_encode().

It works today only because all fields in rpcrdma_req::rl_stream
are initialized to zero by rpcrdma_req_create, making the
subtraction in xdr_stream_pos always a no-op.

I found this issue via code inspection. It was introduced by commit
39f4cd9e ("xprtrdma: Harden chunk list encoding against send
buffer overflow"), but the code has changed enough since then that
this fix can't be automatically applied to stable.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

1310051c

06 Jul, 2019 22 commits

SUNRPC: Fix possible autodisconnect during connect due to old last_used · 80d3c45f

Dave Wysochanski authored Jun 26, 2019

Ensure last_used is updated before calling mod_timer inside
xprt_schedule_autodisconnect. This avoids a possible xprt_autoclose
firing immediately after a successful connect when xprt_unlock_connect
calls xprt_schedule_autodisconnect with an old value of last_used.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

80d3c45f

SUNRPC: Drop redundant CONFIG_ from CONFIG_SUNRPC_DISABLE_INSECURE_ENCTYPES · 4368d77a

Anna Schumaker authored Jun 19, 2019

The "CONFIG_" portion is added automatically, so this was being expanded
into "CONFIG_CONFIG_SUNRPC_DISABLE_INSECURE_ENCTYPES"
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

4368d77a

NFS: Cleanup if nfs_match_client is interrupted · 9f7761cf

Benjamin Coddington authored Jun 11, 2019

Don't bail out before cleaning up a new allocation if the wait for
searching for a matching nfs client is interrupted. Memory leaks.

Reported-by: syzbot+7fe11b49c1cc30e3fce2@syzkaller.appspotmail.com
Fixes: 950a578c ("NFS: make nfs_match_client killable")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

9f7761cf

nfs: disable client side deduplication · 9026b3a9

Darrick J. Wong authored May 31, 2019

The NFS protocol doesn't support deduplication, so turn it off again.

Fixes: ce96e888 ("Fix nfs4.2 return -EINVAL when do dedupe operation")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

9026b3a9

NFSv4: Add lease_time and lease_expired to 'nfs4:' line of mountstats · 1a7441b2

Dave Wysochanski authored May 17, 2019

On the NFS client there is no low-impact way to determine the nfs4
lease time or whether the lease is expired, so add these to mountstats
with times displayed in seconds.

If the lease is not expired, display lease_expired=0. Otherwise,
display lease_expired=seconds_since_expired, similar to 'age:' line
in mountstats.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

1a7441b2

NFS: Clean up writeback code · 2b17d725

Trond Myklebust authored Jun 11, 2019

Now that the VM promises never to recurse back into the filesystem
layer on writeback, remove all the GFP_NOFS references etc from
the generic writeback code.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

2b17d725

Merge branch 'multipath_tcp' · c98ebe29
Trond Myklebust authored Jun 11, 2019

c98ebe29
Merge branch 'containers' · 28ade856
Trond Myklebust authored Jun 28, 2019

28ade856
Merge branch 'cache_consistency' · 02a2779f
Trond Myklebust authored Jun 11, 2019

02a2779f

SUNRPC: Remove warning in debugfs.c when compiling with W=1 · b6580ab3

Trond Myklebust authored May 30, 2019

Remove the following warning:

net/sunrpc/debugfs.c:13: warning: cannot understand function prototype: 'struct dentry *topdir;
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

b6580ab3

Merge branch 'bh-remove' · 41adafa0
Trond Myklebust authored Jun 11, 2019

41adafa0

SUNRPC: add links for all client xprts to debugfs · 2f34b8bf

NeilBrown authored May 30, 2019

Now that a client can have multiple xprts, we need to add
them all to debugs.
The first one is still "xprt"
Subsequent xprts are "xprt1", "xprt2", etc.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

2f34b8bf

SUNRPC: Count ops completing with tk_status < 0 · a332518f

Dave Wysochanski authored May 23, 2019

We often see various error conditions with NFS4.x that show up with
a very high operation count all completing with tk_status < 0 in a
short period of time.  Add a count to rpc_iostats to record on a
per-op basis the ops that complete in this manner, which will
enable lower overhead diagnostics.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

a332518f

SUNRPC: enhance rpc_clnt_show_stats() to report on all xprts. · 10db5691

NeilBrown authored May 30, 2019

Now that a client can have multiple xprts, we need to
report the statistics for all of them.
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

10db5691

SUNRPC: Use proper printk specifiers for unsigned long long · 93ba048e

Dave Wysochanski authored May 23, 2019

Update the printk specifiers inside _print_rpc_iostats to avoid
a checkpatch warning.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

93ba048e

SUNRPC: Move call to rpc_count_iostats before rpc_call_done · 9dfe52a9

Dave Wysochanski authored May 23, 2019

For diagnostic purposes, it would be useful to have an rpc_iostats
metric of RPCs completing with tk_status < 0. Unfortunately,
tk_status is reset inside the rpc_call_done functions for each
operation, and the call to tally the per-op metrics comes after
rpc_call_done. Refactor the call to rpc_count_iostat earlier in
rpc_exit_task so we can count these RPCs completing in error.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

9dfe52a9

NFS: send state management on a single connection. · 5a0c257f

NeilBrown authored May 30, 2019

With NFSv4.1, different network connections need to be explicitly
bound to a session.  During session startup, this is not possible
so only a single connection must be used for session startup.

So add a task flag to disable the default round-robin choice of
connections (when nconnect > 1) and force the use of a single
connection.
Then use that flag on all requests for session management - for
consistence, include NFSv4.0 management (SETCLIENTID) and session
destruction
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

5a0c257f

NFS: Allow multiple connections to a NFSv2 or NFSv3 server · 53c32630
Trond Myklebust authored Sep 17, 2018
```
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
```
53c32630
NFS: Display the "nconnect" mount option if it is set. · fd87c8b7
Trond Myklebust authored Apr 27, 2017
```
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
```
fd87c8b7

pNFS: Allow multiple connections to the DS · bb71e4a5

Trond Myklebust authored Apr 27, 2017

If the user specifies -onconnect=<number> mount option, and the transport
protocol is TCP, then set up <number> connections to the pNFS data server
as well. The connections will all go to the same IP address.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

bb71e4a5

NFSv4: Allow multiple connections to NFSv4.x (x>0) servers · 6619079d

Trond Myklebust authored Apr 27, 2017

If the user specifies the -onconn=<number> mount option, and the transport
protocol is TCP, then set up <number> connections to the server. The
connections will all go to the same IP address.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

6619079d

NFS: Add a mount option to specify number of TCP connections to use · 28cc5cd8

Trond Myklebust authored Apr 26, 2017

Allow the user to specify that the client should use multiple connections
to the server. For the moment, this functionality will be limited to
TCP and to NFSv4.x (x>0).
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

28cc5cd8