Commits · 4cbc8a571c24133a8b645c62188205908ef2ea83 · nexedi / linux

02 Mar, 2019 12 commits

NFS/flexfile: Simplify nfs4_ff_layout_select_ds_stateid() · 4cbc8a57

Trond Myklebust authored Feb 28, 2019

Pass in a pointer to the mirror rather than forcing another
array access.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

4cbc8a57

NFS/flexfile: Simplify nfs4_ff_layout_ds_version() · 626d48b1

Trond Myklebust authored Feb 28, 2019

Pass in a pointer to the mirror rather than forcing another
array access.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

626d48b1

NFS/flexfiles: Simplify ff_layout_get_ds_cred() · 312cd4cb

Trond Myklebust authored Feb 28, 2019

Pass in a pointer to the mirror rather than forcing another
array access.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

312cd4cb

NFS/flexfiles: Simplify nfs4_ff_find_or_create_ds_client() · 561d6f8a

Trond Myklebust authored Feb 28, 2019

Pass in a pointer to the mirror rather than forcing another
array access.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

561d6f8a

NFS/flexfiles: Simplify nfs4_ff_layout_select_ds_fh() · 749da527

Trond Myklebust authored Feb 28, 2019

Pass in a pointer to the mirror rather than having to retrieve it from
the array and then verify the resulting pointer.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

749da527

NFS/flexfiles: Speed up read failover when DSes are down · 76c66905

Trond Myklebust authored Feb 14, 2019

If we notice that a DS may be down, we should attempt to read from the
other mirrors first before we go back to retry the dead DS.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

76c66905

NFS/flexfiles: Don't invalidate DS deviceids for being unresponsive · 17aaec81

Trond Myklebust authored Feb 26, 2019

If the DS is unresponsive, we want to just mark it as such, while
reporting the errors. If the server later returns the same deviceid
in a new layout, then we don't want to have to look it up again.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

17aaec81

NFS/flexfiles: Remove bogus checks for invalid deviceids · d082d4b5

Trond Myklebust authored Feb 26, 2019

We already check the deviceids before we start the RPC call.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

d082d4b5

NFS/flexfiles: Avoid unnecessary layout invalidations · 0a156dd5

Trond Myklebust authored Feb 27, 2019

In ff_layout_mirror_valid() we may not want to invalidate the layout
segment despite the call to GETDEVICEINFO failing. The reason is that
a read may still be able to make progress on another mirror.

So instead we let the caller (in this case nfs4_ff_layout_prepare_ds())
decide whether or not it needs to invalidate.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

0a156dd5

NFS/flexfiles: refactor calls to fs4_ff_layout_prepare_ds() · 2444ff27

Trond Myklebust authored Feb 14, 2019

While we may want to skip attempting to connect to a downed mirror
when we're deciding which mirror to select for a read, we do not
want to do so once we've committed to attempting the I/O in
ff_layout_read/write_pagelist(), or ff_layout_initiate_commit()
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

2444ff27

NFSv4: Handle early exit in layoutget by returning an error · 18c0778a

Trond Myklebust authored Feb 13, 2019

If the LAYOUTGET rpc call exits early without an error, convert it to
EAGAIN.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

18c0778a

NFS/flexfiles: Send LAYOUTERROR when failing over mirrored reads · f0922a6c

Trond Myklebust authored Feb 10, 2019

When a read to the preferred mirror returns an error, the flexfiles
driver records the error in the inode list and currently marks the
layout for return before failing over the attempted read to the next
mirror.
What we actually want to do is fire off a LAYOUTERROR to notify the
MDS that there is an issue with the preferred mirror, then we fail
over. Only once we've failed to read from all mirrors should we
return the layout.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

f0922a6c

01 Mar, 2019 8 commits

NFSv4.2: Add client support for the generic 'layouterror' RPC call · 3eb86093
Trond Myklebust authored Feb 08, 2019
```
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
```
3eb86093

NFSv4/flexfiles: Abort I/O early if the layout segment was invalidated · a79f194a

Trond Myklebust authored Feb 27, 2019

If a layout segment gets invalidated while a pNFS I/O operation
is queued for transmission, then we ideally want to abort
immediately. This is particularly the case when there is a large
number of I/O related RPCs queued in the RPC layer, and the layout
segment gets invalidated due to an ENOSPC error, or an EACCES (because
the client was fenced). We may end up forced to spam the MDS with a
lot of otherwise unnecessary LAYOUTERRORs after that I/O fails.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

a79f194a

NFSv4/pnfs: Fix barriers in nfs4_mark_deviceid_unavailable() · 39a5201a

Trond Myklebust authored Feb 26, 2019

Fix the memory barriers in nfs4_mark_deviceid_unavailable() and
nfs4_test_deviceid_unavailable().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

39a5201a

NFS/flexfiles: Fix up sparse RCU annotations · 762bb7e9
Trond Myklebust authored Feb 26, 2019
```
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
```
762bb7e9

NFSv4/flexfiles: Fix invalid deref in FF_LAYOUT_DEVID_NODE() · 108bb4af

Trond Myklebust authored Feb 26, 2019

If the attempt to instantiate the mirror's layout DS pointer failed,
then that pointer may hold a value of type ERR_PTR(), so we need
to check that before we dereference it.

Fixes: 65990d1a ("pNFS/flexfiles: Fix a deadlock on LAYOUTGET")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

108bb4af

NFS: Add missing encode / decode sequence_maxsz to v4.2 operations · 1a3466ae

Anna Schumaker authored Mar 01, 2019

These really should have been there from the beginning, but we never
noticed because there was enough slack in the RPC request for the extra
bytes. Chuck's recent patch to use au_cslack and au_rslack to compute
buffer size shrunk the buffer enough that this was now a problem for
SEEK operations on my test client.

Fixes: f4ac1674 ("nfs: Add ALLOCATE support")
Fixes: 2e72448b ("NFS: Add COPY nfs operation")
Fixes: cb95deea ("NFS OFFLOAD_CANCEL xdr")
Fixes: 624bd5b7 ("nfs: Add DEALLOCATE support")
Fixes: 1c6dcbe5 ("NFS: Implement SEEK")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

1a3466ae

NFSv4.1: Don't process the sequence op more than once. · c71c46f0

Trond Myklebust authored Mar 01, 2019

Ensure that if we call nfs41_sequence_process() a second time for the
same rpc_task, then we only process the results once.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

c71c46f0

NFSv4.1: Reinitialise sequence results before retransmitting a request · c1dffe0b

Trond Myklebust authored Mar 01, 2019

If we have to retransmit a request, we should ensure that we reinitialise
the sequence results structure, since in the event of a signal
we need to treat the request as if it had not been sent.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org

c1dffe0b

26 Feb, 2019 1 commit

SUNRPC: Fix an Oops in udp_poll() · a73881c9

Trond Myklebust authored Feb 26, 2019

udp_poll() checks the struct file for the O_NONBLOCK flag, so we must not
call it with a NULL file pointer.

Fixes: 0ffe86f4 ("SUNRPC: Use poll() to fix up the socket requeue races")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

a73881c9

25 Feb, 2019 1 commit

Merge tag 'nfs-rdma-for-5.1-1' of git://git.linux-nfs.org/projects/anna/linux-nfs · 06b5fc3a

Trond Myklebust authored Feb 25, 2019

NFSoRDMA client updates for 5.1

New features:
- Convert rpc auth layer to use xdr_streams
- Config option to disable insecure enctypes
- Reduce size of RPC receive buffers

Bugfixes and cleanups:
- Fix sparse warnings
- Check inline size before providing a write chunk
- Reduce the receive doorbell rate
- Various tracepoint improvements

[Trond: Fix up merge conflicts]
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

06b5fc3a

23 Feb, 2019 1 commit

NFS/pnfs: Bulk destroy of layouts needs to be safe w.r.t. umount · 5085607d

Trond Myklebust authored Feb 22, 2019

If a bulk layout recall or a metadata server reboot coincides with a
umount, then holding a reference to an inode is unsafe unless we
also hold a reference to the super block.

Fixes: fd9a8d71 ("NFSv4.1: Fix bulk recall and destroy of layouts")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

5085607d

21 Feb, 2019 2 commits

NFS: Fix a soft lockup in the delegation recovery code · 6f9449be

Trond Myklebust authored Feb 21, 2019

Fix a soft lockup when NFS client delegation recovery is attempted
but the inode is in the process of being freed. When the
igrab(inode) call fails, and we have to restart the recovery process,
we need to ensure that we won't attempt to recover the same delegation
again.

Fixes: 45870d69 ("NFSv4.1: Test delegation stateids when server...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

6f9449be

NFSv4.1: Avoid false retries when RPC calls are interrupted · 3453d570

Trond Myklebust authored Jun 20, 2018

A 'false retry' in NFSv4.1 occurs when the client attempts to transmit a
new RPC call using a slot+sequence number combination that references an
already cached one. Currently, the Linux NFS client will do this if a
user process interrupts an RPC call that is in progress.
The problem with doing so is that we defeat the main mechanism used by
the server to differentiate between a new call and a replayed one. Even
if the server is able to perfectly cache the arguments of the old call,
it cannot know if the client intended to replay or send a new call.

The obvious fix is to bump the sequence number pre-emptively if an
RPC call is interrupted, but in order to deal with the corner cases
where the interrupted call is not actually received and processed by
the server, we need to interpret the error NFS4ERR_SEQ_MISORDERED
as a sign that we need to either wait or locate a correct sequence
number that lies between the value we sent, and the last value that
was acked by a SEQUENCE call on that slot.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Tested-by: Jason Tibbitts <tibbs@math.uh.edu>

3453d570

20 Feb, 2019 15 commits

SUNRPC: Remove the redundant 'zerocopy' argument to xs_sendpages() · 6f903b11
Trond Myklebust authored Feb 19, 2019
```
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
```
6f903b11

SUNRPC: Further cleanups of xs_sendpages() · c87dc4c7

Trond Myklebust authored Feb 19, 2019

Now that we send the pages using a struct msghdr, instead of
using sendpage(), we no longer need to 'prime the socket' with
an address for unconnected UDP messages.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

c87dc4c7

SUNRPC: Convert socket page send code to use iov_iter() · 0472e476

Trond Myklebust authored Feb 19, 2019

Simplify the page send code using iov_iter and bvecs.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

0472e476

SUNRPC: Convert xs_send_kvec() to use iov_iter_kvec() · e791f8e9

Trond Myklebust authored Feb 19, 2019

Prepare to the socket transmission code to use iov_iter.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

e791f8e9

SUNRPC: Initiate a connection close on an ESHUTDOWN error in stream receive · 5f52a9d4

Trond Myklebust authored Feb 16, 2019

If the client stream receive code receives an ESHUTDOWN error either
because the server closed the connection, or because it sent a
callback which cannot be processed, then we should shut down
the connection.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

5f52a9d4

SUNRPC: Don't suppress socket errors when a message read completes · 727fcc64

Trond Myklebust authored Feb 15, 2019

If the message read completes, but the socket returned an error
condition, we should ensure to propagate that error.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

727fcc64

SUNRPC: Handle zero length fragments correctly · e92053a5

Trond Myklebust authored Feb 15, 2019

A zero length fragment is really a bug, but let's ensure we don't
go nuts when one turns up.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

e92053a5

SUNRPC: Don't reset the stream record info when the receive worker is running · ae053551

Trond Myklebust authored Feb 20, 2019

To ensure that the receive worker has exclusive access to the stream record
info, we must not reset the contents other than when holding the
transport->recv_mutex.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

ae053551

nfs: fix xfstest generic/099 failed on nfsv3 · ded52fbe

ZhangXiaoxu authored Feb 18, 2019

After setxattr, the nfsv3 cached the acl which set by user.

But at the backend, the shared file system (eg. ext4) will check
the acl, if it can merged with mode, it won't add acl to the file.
So, the nfsv3 cached acl is redundant.

Don't 'set_cached_acl' when setxattr.
Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

ded52fbe

pNFS: Avoid read/modify/write when it is not necessary · 2cde04e9

Kazuo Ito authored Feb 14, 2019

As the block and SCSI layouts can only read/write fixed-length
blocks, we must perform read-modify-write when data to be written is
not aligned to a block boundary or smaller than the block size.
(612aa983 pnfs: add flag to force read-modify-write in ->write_begin)

The current code tries to see if we have to do read-modify-write
on block-oriented pNFS layouts by just checking !PageUptodate(page),
but the same condition also applies for overwriting of any uncached
potions of existing files, making such operations excessively slow
even it is block-aligned.

The change does not affect the optimization for modify-write-read
cases (38c73044 NFS: read-modify-write page updating),
because partial update of !PageUptodate() pages can only happen
in layouts that can do arbitrary length read/write and never
in block-based ones.

Testing results:

We ran fio on one of the pNFS clients running 4.20 kernel
(vanilla and patched) in this configuration to read/write/overwrite
files on the storage array, exported as pnfs share by the server.

 pNFS clients ---1G Ethernet--- pNFS server
 (HP DL360 G8)                  (HP DL360 G8)
       |                              |
       |                              |
       +------8G Fiber Channel--------+
                     |
               Storage Array
                 (HP P6350)

Throughput of overwrite (both buffered and O_SYNC) is noticeably
improved.

Ops.     |block size|   Throughput   |
         |  (KiB)   |    (MiB/s)     |
         |          |  4.20 | patched|
---------+----------+----------------+
buffered |         4|  21.3 |  232   |
overwrite|        32|  22.2 |  256   |
         |       512|  22.4 |  260   |
---------+----------+----------------+
O_SYNC   |         4|   3.84|    4.77|
overwrite|        32|  12.2 |   32.0 |
         |       512|  18.5 |  152   |
---------+----------+----------------+

Read and write (buffered and O_SYNC) by the same client remain unchanged
by the patch either negatively or positively, as they should do.

Ops.     |block size|   Throughput   |
         |  (KiB)   |    (MiB/s)     |
         |          |  4.20 | patched|
---------+----------+----------------+
read     |         4| 548   |  550   |
         |        32| 547   |  551   |
         |       512| 548   |  551   |
---------+----------+----------------+
buffered |         4| 237   |  244   |
write    |        32| 261   |  268   |
         |       512| 265   |  272   |
---------+----------+----------------+
O_SYNC   |         4|   0.46|    0.46|
write    |        32|   3.60|    3.57|
         |       512| 105   |  106   |
---------+----------+----------------+
Signed-off-by: Kazuo Ito <ito_kazuo_g3@lab.ntt.co.jp>
Tested-by: Hiroyuki Watanabe <watanabe.hiroyuki@lab.ntt.co.jp>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

2cde04e9

pNFS: Fix potential corruption of page being written · 97ae91bb

Kazuo Ito authored Feb 14, 2019

nfs_want_read_modify_write() didn't check for !PagePrivate when pNFS
block or SCSI layout was in use, therefore we could lose data forever
if the page being written was filled by a read before completion.
Signed-off-by: Kazuo Ito <ito_kazuo_g3@lab.ntt.co.jp>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

97ae91bb

NFS: Fix typo in comments of nfs_readdir_alloc_pages() · bf211ca1

zhangliguang authored Feb 16, 2019

This fixes the typo in comments of nfs_readdir_alloc_pages().
Because nfs_readdir_large_page and nfs_readdir_free_pagearray had been
renamed.
Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

bf211ca1

NFS: Remove redundant semicolon · 42f72cf3

zhangliguang authored Feb 12, 2019

This removes redundant semicolon for ending code.

Fixes: c7944ebb ("NFSv4: Fix lookup revalidate of regular files")
Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

42f72cf3

NFS: readdirplus optimization by cache mechanism · be4c2d47

luanshi authored Jan 29, 2019

When listing very large directories via NFS, clients may take a long
time to complete. There are about three factors involved:

First of all, ls and practically every other method of listing a
directory including python os.listdir and find rely on libc readdir().
However readdir() only reads 32K of directory entries at a time, which
means that if you have a lot of files in the same directory, it is going
to take an insanely long time to read all the directory entries.

Secondly, libc readdir() reads 32K of directory entries at a time, in
kernel space 32K buffer split into 8 pages. One NFS readdirplus rpc will
be called for one page, which introduces many readdirplus rpc calls.

Lastly, one NFS readdirplus rpc asks for 32K data (filled by nfs_dentry)
to fill one page (filled by dentry), we found that nearly one third of
data was wasted.

To solve above problems, pagecache mechanism was introduced. One NFS
readdirplus rpc will ask for a large data (more than 32k), the data can
fill more than one page, the cached pages can be used for next readdir
call. This can reduce many readdirplus rpc calls and improve readdirplus
performance.

TESTING:
When listing very large directories(include 300 thousand files) via NFS

time ls -l /nfs_mount | wc -l

without the patch:
300001
real    1m53.524s
user    0m2.314s
sys     0m2.599s

with the patch:
300001
real    0m23.487s
user    0m2.305s
sys     0m2.558s

Improved performance: 79.6%
readdirplus rpc calls decrease: 85%
Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

be4c2d47

fs/nfs: Fix nfs_parse_devname to not modify it's argument · 40cc394b

Eric W. Biederman authored Jan 30, 2019

In the rare and unsupported case of a hostname list nfs_parse_devname
will modify dev_name. There is no need to modify dev_name as the all
that is being computed is the length of the hostname, so the computed
length can just be shorted.

Fixes: dc045898 ("NFS: Use common device name parsing logic for NFSv4 and NFSv2/v3")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

40cc394b