- 13 Jul, 2019 4 commits
-
-
Donald Buczek authored
Compile nfs4_proc_get_lease_time, enc_get_lease_time and dec_get_lease_time for nfs4.0. Use nfs4_sequence_done instead of nfs41_sequence_done in nfs4_proc_get_lease_time, Signed-off-by: Donald Buczek <buczek@molgen.mpg.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
-
Donald Buczek authored
The debug message of decode_attr_lease_time incorrectly says "file size". Fix it to "lease time". Signed-off-by: Donald Buczek <buczek@molgen.mpg.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
-
Markus Elfring authored
Some strings should be put into a sequence. Thus use the corresponding function “seq_puts”. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
-
Markus Elfring authored
A single character (line break) should be put into a sequence. Thus use the corresponding function “seq_putc”. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
-
- 12 Jul, 2019 3 commits
-
-
Max Kellermann authored
This reverts commit be4c2d47. That commit caused a severe memory leak in nfs_readdir_make_qstr(). When listing a directory with more than 100 files (this is how many struct nfs_cache_array_entry elements fit in one 4kB page), all allocated file name strings past those 100 leak. The root of the leakage is that those string pointers are managed in pages which are never linked into the page cache. fs/nfs/dir.c puts pages into the page cache by calling read_cache_page(); the callback function nfs_readdir_filler() will then fill the given page struct which was passed to it, which is already linked in the page cache (by do_read_cache_page() calling add_to_page_cache_lru()). Commit be4c2d47 added another (local) array of allocated pages, to be filled with more data, instead of discarding excess items received from the NFS server. Those additional pages can be used by the next nfs_readdir_filler() call (from within the same nfs_readdir() call). The leak happens when some of those additional pages are never used (copied to the page cache using copy_highpage()). The pages will be freed by nfs_readdir_free_pages(), but their contents will not. The commit did not invoke nfs_readdir_clear_array() (and doing so would have been dangerous, because it did not track which of those pages were already copied to the page cache, risking double free bugs). How to reproduce the leak: - Use a kernel with CONFIG_SLUB_DEBUG_ON. - Create a directory on a NFS mount with more than 100 files with names long enough to use the "kmalloc-32" slab (so we can easily look up the allocation counts): for i in `seq 110`; do touch ${i}_0123456789abcdef; done - Drop all caches: echo 3 >/proc/sys/vm/drop_caches - Check the allocation counter: grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls 30564391 nfs_readdir_add_to_array+0x73/0xd0 age=534558/4791307/6540952 pid=370-1048386 cpus=0-47 nodes=0-1 - Request a directory listing and check the allocation counters again: ls [...] grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls 30564511 nfs_readdir_add_to_array+0x73/0xd0 age=207/4792999/6542663 pid=370-1048386 cpus=0-47 nodes=0-1 There are now 120 new allocations. - Drop all caches and check the counters again: echo 3 >/proc/sys/vm/drop_caches grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls 30564401 nfs_readdir_add_to_array+0x73/0xd0 age=735/4793524/6543176 pid=370-1048386 cpus=0-47 nodes=0-1 110 allocations are gone, but 10 have leaked and will never be freed. Unhelpfully, those allocations are explicitly excluded from KMEMLEAK, that's why my initial attempts with KMEMLEAK were not successful: /* * Avoid a kmemleak false positive. The pointer to the name is stored * in a page cache page which kmemleak does not scan. */ kmemleak_not_leak(string->name); It would be possible to solve this bug without reverting the whole commit: - keep track of which pages were not used, and call nfs_readdir_clear_array() on them, or - manually link those pages into the page cache But for now I have decided to just revert the commit, because the real fix would require complex considerations, risking more dangerous (crash) bugs, which may seem unsuitable for the stable branches. Signed-off-by: Max Kellermann <mk@cm4all.com> Cc: stable@vger.kernel.org # v5.1+ Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
-
Trond Myklebust authored
Ensure that we do the required accounting for the round robin queue when the caller to rpc_init_task() has passed in a transport to be used. Reported-by: Olga Kornievskaia <aglo@umich.edu> Reported-by: Neil Brown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
-
git://git.linux-nfs.org/projects/anna/linux-nfsTrond Myklebust authored
NFSoRDMA client updates for 5.3 New features: - Add a way to place MRs back on the free list - Reduce context switching - Add new trace events Bugfixes and cleanups: - Fix a BUG when tracing is enabled with NFSv4.1 - Fix a use-after-free in rpcrdma_post_recvs - Replace use of xdr_stream_pos in rpcrdma_marshal_req - Fix occasional transport deadlock - Fix show_nfs_errors macros, other tracing improvements - Remove RPCRDMA_REQ_F_PENDING and fr_state - Various simplifications and refactors
-
- 09 Jul, 2019 17 commits
-
-
Chuck Lever authored
When triggering an nfs_xdr_status trace point, record the task ID and XID of the failing RPC to better pinpoint the problem. This feels like a bit of a layering violation. Suggested-by: Trond Myklebust <trondmy@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Add missing symbolic flag names and display flags variables in hexadecimal to improve observability. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
For improved readability, add nfs_show_status() call-sites in the generic NFS trace points so that the symbolic status code name is displayed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
I noticed that NFS status values stopped working again. trace_print_symbols_seq() takes an unsigned long. Passing a negative errno or negative NFSERR value just confuses it, and since we're using C macros here and not static inline functions, all bets are off due to implicit type conversion. Straight-line the calling conventions so that error codes are stored in the trace record as positive values in an unsigned long field, mapped to symbolic as an unsigned long, and displayed as a negative value, to continue to enable grepping on "error=-". It's often the case that an error value that is positive is a byte count but when it's negative, it's an error (e.g. nfs4_write). Fix those cases so that the value that is eventually stored in the error field is a positive NFS status or errno, or zero. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Help debug NFSv4 callback failures. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Adapt and apply changes that were made to the TCP socket connect code. See the following commits for details on the purpose of these changes: Commit 7196dbb0 ("SUNRPC: Allow changing of the TCP timeout parameters on the fly") Commit 3851f1cd ("SUNRPC: Limit the reconnect backoff timer to the max RPC message timeout") Commit 02910177 ("SUNRPC: Fix reconnection timeouts") Some common transport code is moved to xprt.c to satisfy the code duplication police. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Clean up. There is only one remaining function, rpcrdma_buffer_put(), that uses this field. Its caller can supply a pointer to the correct rpcrdma_buffer, enabling the removal of an 8-byte pointer field from a frequently-allocated shared data structure. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Clean up. Move the "not present" case into the individual chunk encoders. This improves code organization and readability. The reason for the original organization was to optimize for the case where there there are no chunks. The optimization turned out to be inconsequential, so let's err on the side of code readability. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
rb_lock is contended between rpcrdma_buffer_create, rpcrdma_buffer_put, and rpcrdma_post_recvs. Commit e340c2d6 ("xprtrdma: Reduce the doorbell rate (Receive)") causes rpcrdma_post_recvs to take the rb_lock repeatedly when it determines more Receives are needed. Streamline this code path so it takes the lock just once in most cases to build the Receive chain that is about to be posted. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Clean up. Commit 7c8d9e7c ("xprtrdma: Move Receive posting to Receive handler") reduced the number of rpcrdma_rep_create call sites to one. After that commit, the backchannel code no longer invokes it. Therefore the free list logic added by commit d698c4a0 ("xprtrdma: Fix backchannel allocation of extra rpcrdma_reps") is no longer necessary, and in fact adds some extra overhead that we can do without. Simply post any newly created reps. They will get added back to the rb_recv_bufs list when they subsequently complete. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Eliminate a context switch in the path that handles RPC wake-ups when a Receive completion has to wait for a Send completion. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Since commit ba69cd12 ("xprtrdma: Remove support for FMR memory registration"), FRWR is the only supported memory registration mode. We can take advantage of the asynchronous nature of FRWR's LOCAL_INV Work Requests to get rid of the completion wait by having the LOCAL_INV completion handler take care of DMA unmapping MRs and waking the upper layer RPC waiter. This eliminates two context switches when local invalidation is necessary. As a side benefit, we will no longer need the per-xprt deferred completion work queue. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
When a marshal operation fails, any MRs that were already set up for that request are recycled. Recycling releases MRs and creates new ones, which is expensive. Since commit f2877623 ("xprtrdma: Chain Send to FastReg WRs") was merged, recycling FRWRs is unnecessary. This is because before that commit, frwr_map had already posted FAST_REG Work Requests, so ownership of the MRs had already been passed to the NIC and thus dealing with them had to be delayed until they completed. Since that commit, however, FAST_REG WRs are posted at the same time as the Send WR. This means that if marshaling fails, we are certain the MRs are safe to simply unmap and place back on the free list because neither the Send nor the FAST_REG WRs have been posted yet. The kernel still has ownership of the MRs at this point. This reduces the total number of MRs that the xprt has to create under heavy workloads and makes the marshaling logic less brittle. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Now that both the Send and Receive completions are handled in process context, it is safe to DMA unmap and return MRs to the free or recycle lists directly in the completion handlers. Doing this means rpcrdma_frwr no longer needs to track the state of each MR, meaning that a VALID or FLUSHED MR can no longer appear on an xprt's MR free list. Thus there is no longer a need to track the MR's registration state in rpcrdma_frwr. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Commit 9590d083 ("xprtrdma: Use xprt_pin_rqst in rpcrdma_reply_handler") pins incoming RPC/RDMA replies so they can be left in the pending requests queue while they are being processed without introducing a race between ->buf_free and the transport's reply handler. Therefore RPCRDMA_REQ_F_PENDING is no longer necessary. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Under high I/O workloads, I've noticed that an RPC/RDMA transport occasionally deadlocks (IOPS goes to zero, and doesn't recover). Diagnosis shows that the sendctx queue is empty, but when sendctxs are returned to the queue, the xprt_write_space wake-up never occurs. The wake-up logic in rpcrdma_sendctx_put_locked is racy. I noticed that both EMPTY_SCQ and XPRT_WRITE_SPACE are implemented via an atomic bit. Just one of those is sufficient. Removing EMPTY_SCQ in favor of the generic bit mechanism makes the deadlock un-reproducible. Without EMPTY_SCQ, rpcrdma_buffer::rb_flags is no longer used and is therefore removed. Unfortunately this patch does not apply cleanly to stable. If needed, someone will have to port it and test it. Fixes: 2fad6592 ("xprtrdma: Wait on empty sendctx queue") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
This is a latent bug. xdr_stream_pos works by subtracting xdr_stream::nwords from xdr_buf::len. But xdr_stream::nwords is not initialized by xdr_init_encode(). It works today only because all fields in rpcrdma_req::rl_stream are initialized to zero by rpcrdma_req_create, making the subtraction in xdr_stream_pos always a no-op. I found this issue via code inspection. It was introduced by commit 39f4cd9e ("xprtrdma: Harden chunk list encoding against send buffer overflow"), but the code has changed enough since then that this fix can't be automatically applied to stable. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
- 06 Jul, 2019 16 commits
-
-
Dave Wysochanski authored
Ensure last_used is updated before calling mod_timer inside xprt_schedule_autodisconnect. This avoids a possible xprt_autoclose firing immediately after a successful connect when xprt_unlock_connect calls xprt_schedule_autodisconnect with an old value of last_used. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
-
Anna Schumaker authored
The "CONFIG_" portion is added automatically, so this was being expanded into "CONFIG_CONFIG_SUNRPC_DISABLE_INSECURE_ENCTYPES" Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
-
Benjamin Coddington authored
Don't bail out before cleaning up a new allocation if the wait for searching for a matching nfs client is interrupted. Memory leaks. Reported-by: syzbot+7fe11b49c1cc30e3fce2@syzkaller.appspotmail.com Fixes: 950a578c ("NFS: make nfs_match_client killable") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
-
Darrick J. Wong authored
The NFS protocol doesn't support deduplication, so turn it off again. Fixes: ce96e888 ("Fix nfs4.2 return -EINVAL when do dedupe operation") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
-
Dave Wysochanski authored
On the NFS client there is no low-impact way to determine the nfs4 lease time or whether the lease is expired, so add these to mountstats with times displayed in seconds. If the lease is not expired, display lease_expired=0. Otherwise, display lease_expired=seconds_since_expired, similar to 'age:' line in mountstats. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
-
Trond Myklebust authored
Now that the VM promises never to recurse back into the filesystem layer on writeback, remove all the GFP_NOFS references etc from the generic writeback code. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
-
Trond Myklebust authored
-
Trond Myklebust authored
-
Trond Myklebust authored
-
Trond Myklebust authored
Remove the following warning: net/sunrpc/debugfs.c:13: warning: cannot understand function prototype: 'struct dentry *topdir; Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
-
Trond Myklebust authored
-
NeilBrown authored
Now that a client can have multiple xprts, we need to add them all to debugs. The first one is still "xprt" Subsequent xprts are "xprt1", "xprt2", etc. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
-
Dave Wysochanski authored
We often see various error conditions with NFS4.x that show up with a very high operation count all completing with tk_status < 0 in a short period of time. Add a count to rpc_iostats to record on a per-op basis the ops that complete in this manner, which will enable lower overhead diagnostics. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
-
NeilBrown authored
Now that a client can have multiple xprts, we need to report the statistics for all of them. Reported-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
-
Dave Wysochanski authored
Update the printk specifiers inside _print_rpc_iostats to avoid a checkpatch warning. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
-
Dave Wysochanski authored
For diagnostic purposes, it would be useful to have an rpc_iostats metric of RPCs completing with tk_status < 0. Unfortunately, tk_status is reset inside the rpc_call_done functions for each operation, and the call to tally the per-op metrics comes after rpc_call_done. Refactor the call to rpc_count_iostat earlier in rpc_exit_task so we can count these RPCs completing in error. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
-