Commits · 54551d85ad48b5b5f5735b9b76c147096828b626 · Kirill Smelkov / linux

26 Apr, 2017 1 commit

NFS: Add a few more fatal I/O errors to nfs_error_is_fatal() · 54551d85

Trond Myklebust authored Apr 26, 2017

EACCES, EDQUOT, EFBIG and ESTALE are all fatal errors as far as NFS
I/O is concerned. They need to be reported back to the application.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

54551d85

25 Apr, 2017 18 commits

Merge tag 'nfs-rdma-4.12-1' of git://git.linux-nfs.org/projects/anna/nfs-rdma · 35a24421

Trond Myklebust authored Apr 25, 2017

NFS: NFS over RDMA Client Side Changes

New Features:
- Break RDMA connections after a connection timeout
- Support for unloading the underlying device driver

Bugfixes and cleanups:
- Mark the receive workqueue as "read-mostly"
- Silence warnings caused by ENOBUFS
- Update a comment in xdr_init_decode_pages()
- Remove rpcrdma_buffer->rb_pool.

35a24421

NFSv3: nfs3_nlm_alloc_call should be declared static · bb3393d5
Trond Myklebust authored Apr 25, 2017
```
Fix compiler warnings.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
```
bb3393d5

xprtrdma: Remove rpcrdma_buffer::rb_pool · 2be1fce9

Chuck Lever authored Apr 11, 2017

Since commit 1e465fd4 ("xprtrdma: Replace send and receive
arrays"), this field is no longer used.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

2be1fce9

sunrpc: Fix xdr_init_decode_pages() documenting comment · 7ecce75f

Chuck Lever authored Apr 11, 2017

Clean up.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

7ecce75f

xprtrdma: Squelch ENOBUFS warnings · 0031e47c

Chuck Lever authored Apr 11, 2017

When ro_map is out of buffers, that's not a permanent error, so
don't report a problem.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

0031e47c

xprtrdma: Annotate receive workqueue · 7d7fa9b5

Chuck Lever authored Apr 11, 2017

Micro-optimize the receive workqueue by marking it's anchor "read-
mostly."
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

7d7fa9b5

xprtrdma: Revert commit · 56a6bd15

Chuck Lever authored Apr 11, 2017

Device removal is now adequately supported. Pinning the underlying
device driver to prevent removal while an NFS mount is active is no
longer necessary.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

56a6bd15

xprtrdma: Restore transport after device removal · a9b0e381

Chuck Lever authored Apr 11, 2017

After a device removal, enable the transport connect worker to
restore normal operation if there is another device with
connectivity to the server.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

a9b0e381

xprtrdma: Refactor rpcrdma_ep_connect · 1890896b

Chuck Lever authored Apr 11, 2017

I'm about to add another arm to

    if (ep->rep_connected != 0)

It will be cleaner to use a switch statement here. We'll be looking
for a couple of specific errnos, or "anything else," basically to
sort out the difference between a normal reconnect and recovery from
device removal.

This is a refactoring change only.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

1890896b

xprtrdma: Support unplugging an HCA from under an NFS mount · bebd0318

Chuck Lever authored Apr 11, 2017

The device driver for the underlying physical device associated
with an RPC-over-RDMA transport can be removed while RPC-over-RDMA
transports are still in use (ie, while NFS filesystems are still
mounted and active). The IB core performs a connection event upcall
to request that consumers free all RDMA resources associated with
a transport.

There may be pending RPCs when this occurs. Care must be taken to
release associated resources without leaving references that can
trigger a subsequent crash if a signal or soft timeout occurs. We
rely on the caller of the transport's ->close method to ensure that
the previous RPC task has invoked xprt_release but the transport
remains write-locked.

A DEVICE_REMOVE upcall forces a disconnect then sleeps. When ->close
is invoked, it destroys the transport's H/W resources, then wakes
the upcall, which completes and allows the core driver unload to
continue.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=266Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

bebd0318

xprtrdma: Use same device when mapping or syncing DMA buffers · 91a10c52

Chuck Lever authored Apr 11, 2017

When the underlying device driver is reloaded, ia->ri_device will be
replaced. All cached copies of that device pointer have to be
updated as well.

Commit 54cbd6b0 ("xprtrdma: Delay DMA mapping Send and Receive
buffers") added the rg_device field to each regbuf. As part of
handling a device removal, rpcrdma_dma_unmap_regbuf is invoked on
all regbufs for a transport.

Simply calling rpcrdma_dma_map_regbuf for each Receive buffer after
the driver has been reloaded should reinitialize rg_device correctly
for every case except rpcrdma_wc_receive, which still uses
rpcrdma_rep::rr_device.

Ensure the same device that was used to map a Receive buffer is also
used to sync it in rpcrdma_wc_receive by using rg_device there
instead of rr_device.

This is the only use of rr_device, so it can be removed.

The use of regbufs in the send path is also updated, for
completeness.

Fixes: 54cbd6b0 ("xprtrdma: Delay DMA mapping Send and ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

91a10c52

xprtrdma: Refactor rpcrdma_ia_open() · fff09594

Chuck Lever authored Apr 11, 2017

In order to unload a device driver and reload it, xprtrdma will need
to close a transport's interface adapter, and then call
rpcrdma_ia_open again, possibly finding a different interface
adapter.

Make rpcrdma_ia_open safe to call on the same transport multiple
times.

This is a refactoring change only.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

fff09594

xprtrdma: Detect unreachable NFS/RDMA servers more reliably · 33849792

Chuck Lever authored Apr 11, 2017

Current NFS clients rely on connection loss to determine when to
retransmit. In particular, for protocols like NFSv4, clients no
longer rely on RPC timeouts to drive retransmission: NFSv4 servers
are required to terminate a connection when they need a client to
retransmit pending RPCs.

When a server is no longer reachable, either because it has crashed
or because the network path has broken, the server cannot actively
terminate a connection. Thus NFS clients depend on transport-level
keepalive to determine when a connection must be replaced and
pending RPCs retransmitted.

However, RDMA RC connections do not have a native keepalive
mechanism. If an NFS/RDMA server crashes after a client has sent
RPCs successfully (an RC ACK has been received for all OTW RDMA
requests), there is no way for the client to know the connection is
moribund.

In addition, new RDMA requests are subject to the RPC-over-RDMA
credit limit. If the client has consumed all granted credits with
NFS traffic, it is not allowed to send another RDMA request until
the server replies. Thus it has no way to send a true keepalive when
the workload has already consumed all credits with pending RPCs.

To address this, forcibly disconnect a transport when an RPC times
out. This prevents moribund connections from stopping the
detection of failover or other configuration changes on the server.

Note that even if the connection is still good, retransmitting
any RPC will trigger a disconnect thanks to this logic in
xprt_rdma_send_request:

	/* Must suppress retransmit to maintain credits */
	if (req->rl_connect_cookie == xprt->connect_cookie)
		goto drop_connection;
	req->rl_connect_cookie = xprt->connect_cookie;
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

33849792

sunrpc: Export xprt_force_disconnect() · e2a4f4fb

Chuck Lever authored Apr 11, 2017

xprt_force_disconnect() is already invoked from the socket
transport. I want to invoke xprt_force_disconnect() from the
RPC-over-RDMA transport, which is a separate module from sunrpc.ko.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

e2a4f4fb

xprtrdma: Cancel refresh worker during buffer shutdown · 9378b274

Chuck Lever authored Apr 11, 2017

Trying to create MRs while the transport is being torn down can
cause a crash.

Fixes: e2ac236c ("xprtrdma: Allocate MRs on demand")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

9378b274

NFS: Don't write back further requests if there is a pending write error · a6598813

Trond Myklebust authored Apr 25, 2017

If the server has already returned a fatal write error that the user
has not yet received on this file, then don't write back the other pages.
Instead, act as if they have been sent, and have returned with the same
error.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

a6598813

pNFS: Fix use after free issues in pnfs_do_read() · 6aeafd05

Trond Myklebust authored Apr 25, 2017

The assumption should be that if the caller returns PNFS_ATTEMPTED, then hdr
has been consumed, and so we should not be testing hdr->task.tk_status.
If the caller returns PNFS_TRY_AGAIN, then we need to recoalesce and
free hdr.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

6aeafd05

pNFS: Ensure we check layout segment validity in the pg_init() callback · b3230e80

Trond Myklebust authored Apr 25, 2017

If we have a layout segment cached in pgio->pg_lseg, we should check it
for validity before reusing it in a new RPC request. Otherwise, if we
recoalesce, we can end up looping forever.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

b3230e80

21 Apr, 2017 6 commits

NFS: Always wait for I/O completion before unlock · f30cb757

Benjamin Coddington authored Apr 11, 2017

NFS attempts to wait for read and write completion before unlocking in
order to ensure that the data returned was protected by the lock.  When
this waiting is interrupted by a signal, the unlock may be skipped, and
messages similar to the following are seen in the kernel ring buffer:

[20.167876] Leaked locks on dev=0x0:0x2b ino=0x8dd4c3:
[20.168286] POSIX: fl_owner=ffff880078b06940 fl_flags=0x1 fl_type=0x0 fl_pid=20183
[20.168727] POSIX: fl_owner=ffff880078b06680 fl_flags=0x1 fl_type=0x0 fl_pid=20185

For NFSv3, the missing unlock will cause the server to refuse conflicting
locks indefinitely.  For NFSv4, the leftover lock will be removed by the
server after the lease timeout.

This patch fixes this issue by skipping the usual wait in
nfs_iocounter_wait if the FL_CLOSE flag is set when signaled.  Instead, the
wait happens in the unlock RPC task on the NFS UOC rpc_waitqueue.

For NFSv3, use lockd's new nlmclnt_operations along with
nfs_async_iocounter_wait to defer NLM's unlock task until the lock
context's iocounter reaches zero.

For NFSv4, call nfs_async_iocounter_wait() directly from unlock's
current rpc_call_prepare.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

f30cb757

lockd: Introduce nlmclnt_operations · b1ece737

Benjamin Coddington authored Apr 11, 2017

NFS would enjoy the ability to modify the behavior of the NLM client's
unlock RPC task in order to delay the transmission of the unlock until IO
that was submitted under that lock has completed.  This ability can ensure
that the NLM client will always complete the transmission of an unlock even
if the waiting caller has been interrupted with fatal signal.

For this purpose, a pointer to a struct nlmclnt_operations can be assigned
in a nfs_module's nfs_rpc_ops that will install those nlmclnt_operations on
the nlm_host.  The struct nlmclnt_operations defines three callback
operations that will be used in a following patch:

nlmclnt_alloc_call - used to call back after a successful allocation of
	a struct nlm_rqst in nlmclnt_proc().

nlmclnt_unlock_prepare - used to call back during NLM unlock's
	rpc_call_prepare.  The NLM client defers calling rpc_call_start()
	until this callback returns false.

nlmclnt_release_call - used to call back when the NLM client's struct
	nlm_rqst is freed.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

b1ece737

NFS: Add an iocounter wait function for async RPC tasks · 7d6ddf88

Benjamin Coddington authored Apr 11, 2017

By sleeping on a new NFS Unlock-On-Close waitqueue, rpc tasks may wait for
a lock context's iocounter to reach zero. The rpc waitqueue is only woken
when the open_context has the NFS_CONTEXT_UNLOCK flag set in order to
mitigate spurious wake-ups for any iocounter reaching zero.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

7d6ddf88

locks: Set FL_CLOSE when removing flock locks on close() · 50f2112c

Benjamin Coddington authored Apr 11, 2017

Set FL_CLOSE in fl_flags as in locks_remove_posix() when clearing locks.
NFS will check for this flag to ensure an unlock is sent in a following
patch.

Fuse handles flock and posix locks differently for FL_CLOSE, and so
requires a fixup to retain the existing behavior for flock.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

50f2112c

NFS: Move the flock open mode check into nfs_flock() · e1293727

Benjamin Coddington authored Apr 11, 2017

We only need to check lock exclusive/shared types against open mode when
flock() is used on NFS, so move it into the flock-specific path instead of
checking it for all locks.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

e1293727

NFS4: remove a redundant lock range check · 12a16d15

Benjamin Coddington authored Apr 11, 2017

flock64_to_posix_lock() is already doing this check
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jeff.layton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

12a16d15

20 Apr, 2017 15 commits

pNFS: unexport nfs4_pnfs_v3_ds_connect_unload · 675e508f

Trond Myklebust authored Apr 20, 2017

It is not used outside the NFSv4 module.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

675e508f

pNFS: Unexport pnfs_put_lseg_locked and _pnfs_return_layout · b9419688
Trond Myklebust authored Apr 20, 2017
```
They are not used outside the NFSv4 module.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
```
b9419688

pNFS: Remove unused layout driver callbacks · 73504740

Trond Myklebust authored Apr 20, 2017

encode_layoutreturn and encode_layoutcommit are now unused. Let's
remove them.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

73504740

nfs: remove the objlayout driver · 6d22323b

Christoph Hellwig authored Apr 12, 2017

The objlayout code has been in the tree, but it's been unmaintained and
no server product for it actually ever shipped.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

6d22323b

pNFS/flexfiles: Check the result of nfs4_pnfs_ds_connect · 260f32ad

Trond Myklebust authored Apr 20, 2017

The check in nfs4_ff_layout_prepare_ds() seems to be missing.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Fixes: a33e4b03 ("pNFS: return status from nfs4_pnfs_ds_connect")
Cc: Weston Andros Adamson <dros@primarydata.com>
Cc: stable@vger.kernel.org # v4.11

260f32ad

NFSv4: Fix a hang in OPEN related to server reboot · 56e0d71e

Trond Myklebust authored Apr 15, 2017

If the server fails to return the attributes as part of an OPEN
reply, and then reboots, we can end up hanging. The reason is that
the client attempts to send a GETATTR in order to pick up the
missing OPEN call, but fails to release the slot first, causing
reboot recovery to deadlock.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Fixes: 2e80dbe7 ("NFSv4.1: Close callback races for OPEN, LAYOUTGET...")
Cc: stable@vger.kernel.org # v4.8+

56e0d71e

NFS: move rw_mode to nfs_pageio_header · fbe77c30

Benjamin Coddington authored Apr 19, 2017

Let's try to have it in a cacheline in nfs4_proc_pgio_rpc_prepare().
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

fbe77c30

NFS: move nfs_pgarray_set() to open code · 8ef9b0b9

Benjamin Coddington authored Apr 19, 2017

Since commit 00bfa30a ("NFS: Create a common pgio_alloc and
pgio_release function"), nfs_pgarray_set() has only a single caller.  Let's
open code it.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

8ef9b0b9

NFS: Use GFP_NOIO for two allocations in writeback · ae97aa52

Benjamin Coddington authored Apr 19, 2017

Prevent a deadlock that can occur if we wait on allocations
that try to write back our pages.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: 00bfa30a ("NFS: Create a common pgio_alloc and pgio_release...")
Cc: stable@vger.kernel.org # 3.16+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

ae97aa52

NFS: Fix use after free in write error path · 1f84ccdf

Fred Isaman authored Apr 14, 2017

Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Fixes: 0bcbf039 ("nfs: handle request add failure properly")
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

1f84ccdf

NFS: Fix missing pg_cleanup after nfs_pageio_cond_complete() · 43b7d964

Benjamin Coddington authored Apr 14, 2017

Commit a7d42ddb ("nfs: add mirroring
support to pgio layer") moved pg_cleanup out of the path when there was
non-sequental I/O that needed to be flushed.  The result is that for
layouts that have more than one layout segment per file, the pg_lseg is not
cleared, so we can end up hitting the WARN_ON_ONCE(req_start >= seg_end) in
pnfs_generic_pg_test since the pg_lseg will be pointing to that
previously-flushed layout segment.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: a7d42ddb ("nfs: add mirroring support to pgio layer")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

43b7d964

sunrpc: don't check for failure from mempool_alloc() · 62b2417e

NeilBrown authored Apr 10, 2017

When mempool_alloc() is allowed to sleep (GFP_NOIO allows
sleeping) it cannot fail.
So rpc_alloc_task() cannot fail, so rpc_new_task doesn't need
to test for failure.
Consequently rpc_new_task() cannot fail, so the callers
don't need to test.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

62b2417e

NFS: fix usage of mempools. · 518662e0

NeilBrown authored Apr 10, 2017

When passed GFP flags that allow sleeping (such as
GFP_NOIO), mempool_alloc() will never return NULL, it will
wait until memory is available.

This means that we don't need to handle failure, but that we
do need to ensure one thread doesn't call mempool_alloc()
twice on the one pool without queuing or freeing the first
allocation.  If multiple threads did this during times of
high memory pressure, the pool could be exhausted and a
deadlock could result.

pnfs_generic_alloc_ds_commits() attempts to allocate from
the nfs_commit_mempool while already holding an allocation
from that pool.  This is not safe.  So change
nfs_commitdata_alloc() to take a flag that indicates whether
failure is acceptable.

In pnfs_generic_alloc_ds_commits(), accept failure and
handle it as we currently do.  Else where, do not accept
failure, and do not handle it.

Even when failure is acceptable, we want to succeed if
possible.  That means both
 - using an entry from the pool if there is one
 - waiting for direct reclaim is there isn't.

We call mempool_alloc(GFP_NOWAIT) to achieve the first, then
kmem_cache_alloc(GFP_NOIO|__GFP_NORETRY) to achieve the
second.  Each of these can fail, but together they do the
best they can without blocking indefinitely.

The objects returned by kmem_cache_alloc() will still be freed
by mempool_free().  This is safe as mempool_alloc() uses
exactly the same function to allocate objects (since the mempool
was created with mempool_create_slab_pool()).  The object returned
by mempool_alloc() and kmem_cache_alloc() are indistinguishable
so mempool_free() will handle both identically, either adding to the
pool or calling kmem_cache_free().

Also, don't test for failure when allocating from
nfs_wdata_mempool.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

518662e0

NFS: Clean up nfs4_proc_get_lease_time() · f6148713

Anna Schumaker authored Apr 07, 2017

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

f6148713

NFS: Clean up _nfs4_proc_exchange_id() · e917f0d1

Anna Schumaker authored Apr 07, 2017

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

e917f0d1