- 13 Jul, 2017 40 commits
-
-
Chuck Lever authored
When ib_post_send() fails, all LOCAL_INV WRs past @bad_wr have to be examined, and the MRs reset by hand. I'm not sure how the existing code can work by comparing R_keys. Restructure the logic so that instead it walks the chain of WRs, starting from the first bad one. Make sure to wait for completion if at least one WR was actually posted. Otherwise, if the ib_post_send fails, we can end up DMA-unmapping the MR while LOCAL_INV operations are in flight. Commit 7a89f9c6 ("xprtrdma: Honor ->send_request API contract") added the rdma_disconnect() call site. The disconnect actually causes more problems than it solves, and SQ overruns happen only as a result of software bugs. So remove it. Fixes: d7a21c1b ("xprtrdma: Reset MRs in frwr_op_unmap_sync()") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
After a signal, the RPC client aborts synchronous RPCs running on behalf of the signaled application. The server is still executing those RPCs, and will write the results back into the client's memory when it's done. By the time the server writes the results, that memory is likely being used for other purposes. Therefore xprtrdma has to immediately invalidate all memory regions used by those aborted RPCs to prevent the server's writes from clobbering that re-used memory. With FMR memory registration, invalidation takes a relatively long time. In fact, the invalidation is often still running when the server tries to write the results into the memory regions that are being invalidated. This sets up a race between two processes: 1. After the signal, xprt_rdma_free calls ro_unmap_safe. 2. While ro_unmap_safe is still running, the server replies and rpcrdma_reply_handler runs, calling ro_unmap_sync. Both processes invoke ib_unmap_fmr on the same FMR. The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at the same time, but HCAs generally don't tolerate this. Sometimes this can result in a system crash. If the HCA happens to survive, rpcrdma_reply_handler continues. It removes the rpc_rqst from rq_list and releases the transport_lock. This enables xprt_rdma_free to run in another process, and the rpc_rqst is released while rpcrdma_reply_handler is still waiting for the ib_unmap_fmr call to finish. But further down in rpcrdma_reply_handler, the transport_lock is taken again, and "rqst" is dereferenced. If "rqst" has already been released, this triggers a general protection fault. Since bottom- halves are disabled, the system locks up. Address both issues by reversing the order of the xprt_lookup_rqst call and the ro_unmap_sync call. Introduce a separate lookup mechanism for rpcrdma_req's to enable calling ro_unmap_sync before xprt_lookup_rqst. Now the handler takes the transport_lock once and holds it for the XID lookup and RPC completion. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305 Fixes: 68791649 ('xprtrdma: Invalidate in the RPC reply ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Clean up: I'm about to use the rl_free field for purposes other than a free list. So use a more generic name. This is a refactoring change only. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305 Fixes: 68791649 ('xprtrdma: Invalidate in the RPC reply ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
There are rare cases where an rpcrdma_req can be re-used (via rpcrdma_buffer_put) while the RPC reply handler is still running. This is due to a signal firing at just the wrong instant. Since commit 9d6b0409 ("xprtrdma: Place registered MWs on a per-req list"), rpcrdma_mws are self-contained; ie., they fully describe an MR and scatterlist, and no part of that information is stored in struct rpcrdma_req. As part of closing the above race window, pass only the req's list of registered MRs to ro_unmap_sync, rather than the rpcrdma_req itself. Some extra transport header sanity checking is removed. Since the client depends on its own recollection of what memory had been registered, there doesn't seem to be a way to abuse this change. And, the check was not terribly effective. If the client had sent Read chunks, the "list_empty" test is negative in both of the removed cases, which are actually looking for Write or Reply chunks. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305 Fixes: 68791649 ('xprtrdma: Invalidate in the RPC reply ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
There are rare cases where an rpcrdma_req and its matched rpcrdma_rep can be re-used, via rpcrdma_buffer_put, while the RPC reply handler is still using that req. This is typically due to a signal firing at just the wrong instant. As part of closing this race window, avoid using the wrong rpcrdma_rep to detect remotely invalidated MRs. Mark MRs as invalidated while we are sure the rep is still OK to use. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305 Fixes: 68791649 ('xprtrdma: Invalidate in the RPC reply ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Callers assume the ro_unmap_sync and ro_unmap_safe methods empty the list of registered MRs. Ensure that all paths through fmr_op_unmap_sync() remove MWs from that list. Fixes: 9d6b0409 ("xprtrdma: Place registered MWs on a ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
NeilBrown authored
If an NFS server returns a filehandle that we have previously seen, and reports a different type, then nfs_refresh_inode() will log a warning and return an error. nfs_fhget() does not check for this error and may return an inode with a different type than the one that the server reported. This is likely to cause confusion, and is one way that ->open_context() could return a directory inode as discussed in the previous patch. So if nfs_refresh_inode() returns and error, return that error from nfs_fhget() to avoid the confusion propagating. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
NeilBrown authored
A confused server could return a filehandle for an NFSv4 OPEN request, which it previously returned for a directory. So the inode returned by ->open_context() in nfs_atomic_open() could conceivably be a directory inode. This has particular implications for the call to nfs_file_set_open_context() in nfs_finish_open(). If that is called on a directory inode, then the nfs_open_context that gets stored in the filp->private_data will be linked to nfs_inode->open_files. When the directory is closed, nfs_closedir() will (ultimately) free the ->private_data, but not unlink it from nfs_inode->open_files (because it doesn't expect an nfs_open_context there). Subsequently the memory could get used for something else and eventually if the ->open_files list is walked, the walker will fall off the end and crash. So: change nfs_finish_open() to only call nfs_file_set_open_context() for regular-file inodes. This failure mode has been seen in a production setting (unknown NFS server implementation). The kernel was v3.0 and the specific sequence seen would not affect more recent kernels, but I think a risk is still present, and caution is wise. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
NeilBrown authored
Since commit bafc9b75 ("vfs: More precise tests in d_invalidate") in v3.18, a return of '0' from ->d_revalidate() will cause the dentry to be invalidated even if it has filesystems mounted on or it or on a descendant. The mounted filesystem is unmounted. This means we need to be careful not to return 0 unless the directory referred to truly is invalid. So -ESTALE or -ENOENT should invalidate the directory. Other errors such a -EPERM or -ERESTARTSYS should be returned from ->d_revalidate() so they are propagated to the caller. A particular problem can be demonstrated by: 1/ mount an NFS filesystem using NFSv3 on /mnt 2/ mount any other filesystem on /mnt/foo 3/ ls /mnt/foo 4/ turn off network, or otherwise make the server unable to respond 5/ ls /mnt/foo & 6/ cat /proc/$!/stack # note that nfs_lookup_revalidate is in the call stack 7/ kill -9 $! # this results in -ERESTARTSYS being returned 8/ observe that /mnt/foo has been unmounted. This patch changes nfs_lookup_revalidate() to only treat -ESTALE from nfs_lookup_verify_inode() and -ESTALE or -ENOENT from ->lookup() as indicating an invalid inode. Other errors are returned. Also nfs_check_inode_attributes() is changed to return -ESTALE rather than -EIO. This is consistent with the error returned in similar circumstances from nfs_update_inode(). As this bug allows any user to unmount a filesystem mounted on an NFS filesystem, this fix is suitable for stable kernels. Fixes: bafc9b75 ("vfs: More precise tests in d_invalidate") Cc: stable@vger.kernel.org (v3.18+) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Olga Kornievskaia authored
Upon receiving a stateid error such as BAD_STATEID, the client should retry the operation against the MDS before deciding to do stateid recovery. Previously, the code would initiate state recovery and it could lead to a race in a state manager that could chose an incorrect recovery method which would lead to the EIO failure for the application. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Olga Kornievskaia authored
Commit fabbbee0 "PNFS fix fallback to MDS if got error on commit to DS" moved the pnfs_set_lo_fail() to unhandled errors which was not correct and lead to a kernel oops on umount. Instead, fix the original EACCESS on commit to DS error by getting the new layout and re-doing the IO. Fixes: fabbbee0 ("PNFS fix fallback to MDS if got error on commit to DS") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Cc: stable@vger.kernel.org # v4.12 Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Dan Carpenter authored
Static checkers have gotten clever enough to complain that "id_long" is uninitialized on the failure path. It's harmless, but simple to fix. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Tuo Chen Peng authored
nfs_show_stats() was incorrectly reading statistics for bytes when printing that for fsc. It caused files like /proc/self/mountstats to report incorrect fsc statistics for NFS mounts. Signed-off-by: Tuo Chen Peng <tpeng@nvidia.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Benjamin Coddington authored
Commit 8ef9b0b9 open-coded nfs_pgarray_set(), and left out the initialization of the nfs_page_array's npages. This mistake didn't show up until testing with block layouts, and there shows that all pNFS reads return -EIO. Fixes: 8ef9b0b9 ("NFS: move nfs_pgarray_set() to open code") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Cc: stable@vger.kernel.org # 4.12 Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Trond Myklebust authored
Now that the writes will schedule a commit on their own, we don't need nfs_write_inode() to schedule one if there are outstanding writes, and we're being called in non-blocking mode. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Trond Myklebust authored
If the page cache is being flushed, then we want to ensure that we do start a commit once the pages are done being flushed. If we just wait until all I/O is done to that file, we can end up livelocking until the balance_dirty_pages() mechanism puts its foot down and forces I/O to stop. So instead we do more or less the same thing that O_DIRECT does, and set up a counter to tell us when the flush is done, Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Trond Myklebust authored
Remove the 'layout_private' fields that were only used by the pNFS OSD layout driver. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Trond Myklebust authored
In xprt_alloc_slot(), the spin lock is only needed to provide atomicity between the atomic_add_unless() failure and the call to xprt_add_backlog(). We do not actually need to hold it across the memory allocation itself. By dropping the lock, we can use a more resilient GFP_NOFS allocation, just as we now do in the rest of the RPC client code. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Benjamin Coddington authored
An interrupted rename will leave the old dentry behind if the rename succeeds. Fix this by forcing a lookup the next time through ->d_revalidate. A previous attempt at solving this problem took the approach to complete the work of the rename asynchronously, however that approach was wrong since it would allow the d_move() to occur after the directory's i_mutex had been dropped by the original process. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Benjamin Coddington authored
NFS uses some int, and unsigned int :1, and bool as flags in structs and args. Assert the preference for uniformly replacing these with the bool type. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Anna Schumaker authored
The current code worked okay for getdents(), but getdents64() expects the d_type field to get filled out properly in the stat structure. Setting this field fixes xfstests generic/401. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Christoph Hellwig authored
nfsd4_ops contains function pointers, and marking it as constant avoids it being able to be used as an attach vector for code injections. Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Christoph Hellwig authored
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
struct svc_procinfo contains function pointers, and marking it as constant avoids it being able to be used as an attach vector for code injections. Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Christoph Hellwig authored
pc_count is the only writeable memeber of struct svc_procinfo, which is a good candidate to be const-ified as it contains function pointers. This patch moves it into out out struct svc_procinfo, and into a separate writable array that is pointed to by struct svc_version. Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Christoph Hellwig authored
Pass union nfsd4_op_u to the op_func callbacks instead of using unsafe function pointer casts. It also adds two missing structures to struct nfsd4_op.u to facilitate this. Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Christoph Hellwig authored
Except for a lot of unnecessary casts this typedef only has one user, so remove the casts and expand it in struct nfsd4_operation. Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Christoph Hellwig authored
Pass union nfsd4_op_u to the op_set_currentstateid callbacks instead of using unsafe function pointer casts. Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Christoph Hellwig authored
Given the args union in struct nfsd4_op a name, and pass it to the op_set_currentstateid callbacks instead of using unsafe function pointer casts. Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Christoph Hellwig authored
Remove the now unused typedef. Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Christoph Hellwig authored
Drop the resp argument as it can trivially be derived from the rqstp argument. With that all functions now have the same prototype, and we can remove the unsafe casting to kxdrproc_t. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
Drop the argp argument as it can trivially be derived from the rqstp argument. With that all functions now have the same prototype, and we can remove the unsafe casting to kxdrproc_t. Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Christoph Hellwig authored
Drop the p and resp arguments as they are always NULL or can trivially be derived from the rqstp argument. With that all functions now have the same prototype, and we can remove the unsafe casting to kxdrproc_t. Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Christoph Hellwig authored
Drop the argp and resp arguments as they can trivially be derived from the rqstp argument. With that all functions now have the same prototype, and we can remove the unsafe casting to svc_procfunc as well as the svc_procfunc typedef itself. Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Christoph Hellwig authored
Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Christoph Hellwig authored
Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Christoph Hellwig authored
Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Christoph Hellwig authored
struct rpc_procinfo contains function pointers, and marking it as constant avoids it being able to be used as an attach vector for code injections. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Christoph Hellwig authored
p_count is the only writeable memeber of struct rpc_procinfo, which is a good candidate to be const-ified as it contains function pointers. This patch moves it into out out struct rpc_procinfo, and into a separate writable array that is pointed to by struct rpc_version and indexed by p_statidx. Signed-off-by: Christoph Hellwig <hch@lst.de>
-