Commits · 9a5c63e9c4056de8a73555131e6f698ddb0b9e0d · nexedi / linux

10 Feb, 2017 8 commits

xprtrdma: Refactor management of mw_list field · 9a5c63e9

Chuck Lever authored Feb 08, 2017

Clean up some duplicate code.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

9a5c63e9

xprtrdma: Handle stale connection rejection · 0a90487b

Chuck Lever authored Feb 08, 2017

A server rejects a connection attempt with STALE_CONNECTION when a
client attempts to connect to a working remote service, but uses a
QPN and GUID that corresponds to an old connection that was
abandoned. This might occur after a client crashes and restarts.

Fix rpcrdma_conn_upcall() to distinguish between a normal rejection
and rejection of stale connection parameters.

As an additional clean-up, remove the code that retries the
connection attempt with different ORD/IRD values. Code audit of
other ULP initiators shows no similar special case handling of
initiator_depth or responder_resources.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

0a90487b

xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs · 18c0fb31

Chuck Lever authored Feb 08, 2017

Sriharsha (sriharsha.basavapatna@broadcom.com) reports an occasional
double DMA unmap of an FRWR MR when a connection is lost. I see one
way this can happen.

When a request requires more than one segment or chunk,
rpcrdma_marshal_req loops, invoking ->frwr_op_map for each segment
(MR) in each chunk. Each call posts a FASTREG Work Request to
register one MR.

Now suppose that the transport connection is lost part-way through
marshaling this request. As part of recovering and resetting that
req, rpcrdma_marshal_req invokes ->frwr_op_unmap_safe, which hands
all the req's registered FRWRs to the MR recovery thread.

But note: FRWR registration is asynchronous. So it's possible that
some of these "already registered" FRWRs are fully registered, and
some are still waiting for their FASTREG WR to complete.

When the connection is lost, the "already registered" frmrs are
marked FRMR_IS_VALID, and the "still waiting" WRs flush. Then
frwr_wc_fastreg marks these frmrs FRMR_FLUSHED_FR.

But thanks to ->frwr_op_unmap_safe, the MR recovery thread is doing
an unreg / alloc_mr, a DMA unmap, and marking each of these frwrs
FRMR_IS_INVALID, at the same time frwr_wc_fastreg might be running.

- If the recovery thread runs last, then the frmr is marked
FRMR_IS_INVALID, and life continues.

- If frwr_wc_fastreg runs last, the frmr is marked FRMR_FLUSHED_FR,
but the recovery thread has already DMA unmapped that MR. When
->frwr_op_map later re-uses this frmr, it sees it is not marked
FRMR_IS_INVALID, and tries to recover it before using it, resulting
in a second DMA unmap of the same MR.

The fix is to guarantee in-flight FASTREG WRs have flushed before MR
recovery runs on those FRWRs. Thus we depend on ro_unmap_safe
(called from xprt_rdma_send_request on retransmit, or from
xprt_rdma_free) to clean up old registrations as needed.
Reported-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

18c0fb31

xprtrdma: Shrink send SGEs array · c6f5b47f

Chuck Lever authored Feb 08, 2017

We no longer need to accommodate an xdr_buf whose pages start at an
offset and cross extra page boundaries. If there are more partial or
whole pages to send than there are available SGEs, the marshaling
logic is now smart enough to use a Read chunk instead of failing.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

c6f5b47f

xprtrdma: Reduce required number of send SGEs · 16f906d6

Chuck Lever authored Feb 08, 2017

The MAX_SEND_SGES check introduced in commit 655fec69
("xprtrdma: Use gathered Send for large inline messages") fails
for devices that have a small max_sge.

Instead of checking for a large fixed maximum number of SGEs,
check for a minimum small number. RPC-over-RDMA will switch to
using a Read chunk if an xdr_buf has more pages than can fit in
the device's max_sge limit. This is considerably better than
failing all together to mount the server.

This fix supports devices that have as few as three send SGEs
available.
Reported-by: Selvin Xavier <selvin.xavier@broadcom.com>
Reported-by: Devesh Sharma <devesh.sharma@broadcom.com>
Reported-by: Honggang Li <honli@redhat.com>
Reported-by: Ram Amrani <Ram.Amrani@cavium.com>
Fixes: 655fec69 ("xprtrdma: Use gathered Send for large ...")
Cc: stable@vger.kernel.org # v4.9+
Tested-by: Honggang Li <honli@redhat.com>
Tested-by: Ram Amrani <Ram.Amrani@cavium.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

16f906d6

xprtrdma: Disable pad optimization by default · c95a3c6b

Chuck Lever authored Feb 08, 2017

Commit d5440e27 ("xprtrdma: Enable pad optimization") made the
Linux client omit XDR round-up padding in normal Read and Write
chunks so that the client doesn't have to register and invalidate
3-byte memory regions that contain no real data.

Unfortunately, my cheery 2014 assessment that this optimization "is
supported now by both Linux and Solaris servers" was premature.
We've found bugs in Solaris in this area since commit d5440e27
("xprtrdma: Enable pad optimization") was merged (SYMLINK is the
main offender).

So for maximum interoperability, I'm disabling this optimization
again. If a CM private message is exchanged when connecting, the
client recognizes that the server is Linux, and enables the
optimization for that connection.

Until now the Solaris server bugs did not impact common operations,
and were thus largely benign. Soon, less capable devices on Linux
NFS/RDMA clients will make use of Read chunks more often, and these
Solaris bugs will prevent interoperation in more cases.

Fixes: 677eb17e ("xprtrdma: Fix XDR tail buffer marshalling")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

c95a3c6b

xprtrdma: Per-connection pad optimization · b5f0afbe

Chuck Lever authored Feb 08, 2017

Pad optimization is changed by echoing into
/proc/sys/sunrpc/rdma_pad_optimize. This is a global setting,
affecting all RPC-over-RDMA connections to all servers.

The marshaling code picks up that value and uses it for decisions
about how to construct each RPC-over-RDMA frame. Having it change
suddenly in mid-operation can result in unexpected failures. And
some servers a client mounts might need chunk round-up, while
others don't.

So instead, copy the pad_optimize setting into each connection's
rpcrdma_ia when the transport is created, and use the copy, which
can't change during the life of the connection, instead.

This also removes a hack: rpcrdma_convert_iovs was using
the remote-invalidation-expected flag to predict when it could leave
out Write chunk padding. This is because the Linux server handles
implicit XDR padding on Write chunks correctly, and only Linux
servers can set the connection's remote-invalidation-expected flag.

It's more sensible to use the pad optimization setting instead.

Fixes: 677eb17e ("xprtrdma: Fix XDR tail buffer marshalling")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

b5f0afbe

xprtrdma: Fix Read chunk padding · 24abdf1b

Chuck Lever authored Feb 08, 2017

When pad optimization is disabled, rpcrdma_convert_iovs still
does not add explicit XDR round-up padding to a Read chunk.

Commit 677eb17e ("xprtrdma: Fix XDR tail buffer marshalling")
incorrectly short-circuited the test for whether round-up padding
is needed that appears later in rpcrdma_convert_iovs.

However, if this is indeed a regular Read chunk (and not a
Position-Zero Read chunk), the tail iovec _always_ contains the
chunk's padding, and never anything else.

So, it's easy to just skip the tail when padding optimization is
enabled, and add the tail in a subsequent Read chunk segment, if
disabled.

Fixes: 677eb17e ("xprtrdma: Fix XDR tail buffer marshalling")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

24abdf1b

09 Feb, 2017 4 commits

NFSv4: Set the connection timeout to match the lease period · 26ae102f

Trond Myklebust authored Feb 08, 2017

Set the timeout for TCP connections to be 1 lease period to ensure
that we don't lose our lease due to a faulty TCP connection.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

26ae102f

SUNRPC: Allow changing of the TCP timeout parameters on the fly · 7196dbb0

Trond Myklebust authored Feb 08, 2017

When the NFSv4 server tells us the lease period, we usually want
to adjust down the timeout parameters on the TCP connection to
ensure that we don't miss lease renewals due to a faulty connection.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

7196dbb0

SUNRPC: Refactor TCP socket timeout code into a helper function · 8d1b8c62

Trond Myklebust authored Feb 08, 2017

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

8d1b8c62

SUNRPC: Remove unused function rpc_get_timeout() · d23bb113

Trond Myklebust authored Feb 08, 2017

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

d23bb113

08 Feb, 2017 10 commits

NFSv4: Fix memory and state leak in _nfs4_open_and_get_state · a974deee

Trond Myklebust authored Feb 08, 2017

If we exit because the file access check failed, we currently
leak the struct nfs4_state. We need to attach it to the
open context before returning.

Fixes: 3efb9722 ("NFSv4: Refactor _nfs4_open_and_get_state..")
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

a974deee

sunrpc: use simple_read_from_buffer for reading cache flush · 8ccc8691

Kinglong Mee authored Feb 07, 2017

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

8ccc8691

sunrpc: record rpc client pointer in seq->private directly · 3f373e81

Kinglong Mee authored Feb 07, 2017

pos in rpc_clnt_iter is useless, drop it and record clnt in seq_private.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

3f373e81

sunrpc: update the comments of sunrpc proc path · 6489a8f4

Kinglong Mee authored Feb 07, 2017

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

6489a8f4

sunrpc: remove dead codes of cr_magic in rpc_cred · af4926e5

Kinglong Mee authored Feb 07, 2017

Don't found any place using the cr_magic.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

af4926e5

sunrpc: rename NFS_NGROUPS to UNX_NGROUPS for auth unix · 5786461b

Kinglong Mee authored Feb 07, 2017

NFS_NGROUPS has been move to sunrpc, rename to UNX_NGROUPS.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

5786461b

sunrpc/nfs: cleanup procfs/pipefs entry in cache_detail · 863d7d9c

Kinglong Mee authored Feb 07, 2017

Record flush/channel/content entries is useless, remove them.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

863d7d9c

sunrpc: error out if register_shrinker fail · 2864486b

Kinglong Mee authored Feb 07, 2017

register_shrinker may return error when register fail, error out.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

2864486b

nfs: no PG_private waiters remain, remove waker · 600424e3

Nicholas Piggin authored Jan 04, 2017

Since commit 4f52b6bb ("NFS: Don't call COMMIT in ->releasepage()"),
no tasks wait on PagePrivate, so the wake introduced in commit 95905446
("NFS: avoid deadlocks with loop-back mounted NFS filesystems.") can
be removed.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

600424e3

NFS: nfs_rename() handle -ERESTARTSYS dentry left behind · 920b4530

Benjamin Coddington authored Feb 01, 2017

An interrupted rename will leave the old dentry behind if the rename
succeeds. Fix this by moving the final local work of the rename to
rpc_call_done so that the results of the RENAME can always be handled,
even if the original process has already returned with -ERESTARTSYS.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

920b4530

30 Jan, 2017 18 commits

NFSv4: Fix warning for using 0 as NULL · 68e33bd6

Wei Yongjun authored Jan 12, 2017

Fixes the following sparse warning:

fs/nfs/nfs4state.c:862:60: warning: Using plain integer as NULL pointer
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

68e33bd6

pNFS/flexfiles: Make local symbol layoutreturn_ops static · 2e54b9b1

Wei Yongjun authored Jan 12, 2017

Fixes the following sparse warning:

fs/nfs/flexfilelayout/flexfilelayout.c:2114:34: warning:
 symbol 'layoutreturn_ops' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

2e54b9b1

NFS: Return the comparison result directly in nfs41_match_stateid() · 045c5519
Anna Schumaker authored Jan 11, 2017
```
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
```
045c5519

NFS: Clean up nfs41_same_server_scope() · 49ad0145

Anna Schumaker authored Jan 11, 2017

The function is cleaner this way, since we can use the result of
memcmp() directly
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

49ad0145

NFS: No need to set and return status in nfs41_lock_expired() · 81b68de4
Anna Schumaker authored Jan 11, 2017
```
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
```
81b68de4

NFS: Remove unnecessary goto in nfs4_lookup_root_sec() · 9df1336c

Anna Schumaker authored Jan 11, 2017

Once again, it's easier and cleaner just to return the error directly.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

9df1336c

NFS: Remove nfs4_recover_expired_lease() · 334f87dd

Anna Schumaker authored Jan 11, 2017

This function doesn't add much, since all it does is access the server's
nfs_client variable.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

334f87dd

NFS: Remove an extra if in _nfs4_recover_proc_open() · d7e98258

Anna Schumaker authored Jan 11, 2017

It's simpler just to return the status unconditionally
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

d7e98258

NFS: Return errors directly in _nfs4_opendata_reclaim_to_nfs4_state() · 37a8484a

Anna Schumaker authored Jan 11, 2017

There is no need for a goto just to return an error code without any
cleanup. Returning the error directly helps to clean up the code.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

37a8484a

NFS: Remove nfs4_wait_for_completion_rpc_task() · 820bf85c
Anna Schumaker authored Jan 11, 2017
```
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
```
820bf85c

NFS: Clean up _nfs4_is_integrity_protected() · eeea5361

Anna Schumaker authored Jan 11, 2017

We can cut out the if statement and return the results of the comparison
directly.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

eeea5361

NFS: Fix inconsistent indentation in nfs4proc.c · d9b67e1e
Anna Schumaker authored Jan 11, 2017
```
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
```
d9b67e1e

NFS: Make trace_nfs4_setup_sequence() available to NFS v4.0 · ad05cc0f

Anna Schumaker authored Jan 11, 2017

This tracepoint displays information about the slot that was chosen for
the RPC, in addition to session information.  This could be useful
information for debugging, and we can set the session id hash to 0 to
indicate that there is no session.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

ad05cc0f

NFS: Merge the remaining setup_sequence functions · 3d35808b

Anna Schumaker authored Jan 11, 2017

This creates a single place for all the work to happen, using the
presence of a session to determine if extra values need to be set.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

3d35808b

NFS: Check if the slot table is draining from nfs4_setup_sequence() · 76ee0354
Anna Schumaker authored Jan 10, 2017
```
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
```
76ee0354
NFS: Handle setup sequence task rescheduling in a single place · 0dcee8bb
Anna Schumaker authored Jan 10, 2017
```
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
```
0dcee8bb
NFS: Lock the slot table from a single place during setup sequence · 6994cdd7
Anna Schumaker authored Jan 10, 2017
```
Rather than implementing this twice for NFS v4.0 and v4.1
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
```
6994cdd7

NFS: Move slot-already-allocated check into nfs_setup_sequence() · 9dd9107f

Anna Schumaker authored Jan 10, 2017

This puts the check in a single place, rather than needing to implement
it twice for v4.0 and v4.1.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

9dd9107f