- 17 May, 2016 37 commits
-
-
Jeff Layton authored
LAYOUTRETURN is "special" in that servers and clients are expected to work with old stateids. When the client sends a LAYOUTRETURN with an old stateid in it then the server is expected to only tear down layout segments that were present when that seqid was current. Ensure that the client handles its accounting accordingly. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Jeff Layton authored
When we want to selectively do a LAYOUTRETURN, we need to specify a stateid that represents most recent layout acquisition that is to be returned. When we mark a layout stateid to be returned, we update the return sequence number in the layout header with that value, if it's newer than the existing one. Then, when we go to do a LAYOUTRETURN on layout header put, we overwrite the seqid in the stateid with the saved one, and then zero it out. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Jeff Layton authored
In later patches, we're going to teach the client to be more selective about how it returns layouts. This means keeping a record of what the stateid's seqid was at the time that the server handed out a layout segment. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Jeff Layton authored
Otherwise, we'll end up returning layouts that we've just received if the client issues a new LAYOUTGET prior to the LAYOUTRETURN. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Tom Haynes authored
If we are initializing reads or writes and can not connect to a DS, then check whether or not IO is allowed through the MDS. If it is allowed, reset to the MDS. Else, fail the layout segment and force a retry of a new layout segment. Signed-off-by: Tom Haynes <loghyr@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Tom Haynes authored
Whenever we check to see if we have the needed number of DSes for the action, we may also have to check to see whether IO is allowed to go to the MDS or not. [jlayton: fix merge conflict due to lack of localio patches here] Signed-off-by: Tom Haynes <loghyr@primarydata.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Trond Myklebust authored
This patch fixes a problem whereby the pNFS client falls back to doing reads and writes through the metadata server even when the layout flag FF_FLAGS_NO_IO_THRU_MDS is set. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Trond Myklebust authored
No need to make them a priority any more, or to make them succeed. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Trond Myklebust authored
When we're using a delegation to represent our open state, we should ensure that we use the stateid that was used to create that delegation. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Trond Myklebust authored
In order to more easily distinguish what kind of stateid we are dealing with, introduce a type that can be used to label the stateid structure. The label will be useful both for debugging, but also when dealing with operations like SETATTR, READ and WRITE that can take several different types of stateid as arguments. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Trond Myklebust authored
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Trond Myklebust authored
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Clean up. After "xprtrdma: Remove ro_unmap() from all registration modes", there are no longer any sites that take rpcrdma_ia::qplock for read. The one site that takes it for write is always single-threaded. It is safe to remove it. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
In a cluster failover scenario, it is desirable for the client to attempt to reconnect quickly, as an alternate NFS server is already waiting to take over for the down server. The client can't see that a server IP address has moved to a new server until the existing connection is gone. For fabrics and devices where it is meaningful, set a definite upper bound on the amount of time before it is determined that a connection is no longer valid. This allows the RPC client to detect connection loss in a timely matter, then perform a fresh resolution of the server GUID in case it has changed (cluster failover). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Clean up: The ro_unmap method is no longer used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
There needs to be a safe method of releasing registered memory resources when an RPC terminates. Safe can mean a number of things: + Doesn't have to sleep + Doesn't rely on having a QP in RTS ro_unmap_safe will be that safe method. It can be used in cases where synchronous memory invalidation can deadlock, or needs to have an active QP. The important case is fencing an RPC's memory regions after it is signaled (^C) and before it exits. If this is not done, there is a window where the server can write an RPC reply into memory that the client has released and re-used for some other purpose. Note that this is a full solution for FRWR, but FMR and physical still have some gaps where a particularly bad server can wreak some havoc on the client. These gaps are not made worse by this patch and are expected to be exceptionally rare and timing-based. They are noted in documenting comments. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Separate the DMA unmap operation from freeing the MW. In a subsequent patch they will not always be done at the same time, and they are not related operations (except by order; freeing the MW must be the last step during invalidation). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
In a subsequent patch, the fr_xprt and fr_worker fields will be needed by another memory registration mode. Move them into the generic rpcrdma_mw structure that wraps struct rpcrdma_frmr. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Maintain the order of invalidation and DMA unmapping when doing a background MR reset. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
frwr_op_unmap_sync() is now invoked in a workqueue context, the same as __frwr_queue_recovery(). There's no need to defer MR reset if posting LOCAL_INV MRs fails. This means that even when ib_post_send() fails (which should occur very rarely) the invalidation and DMA unmapping steps are still done in the correct order. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Move the the I/O direction field from rpcrdma_mr_seg into the rpcrdma_frmr. This makes it possible to DMA-unmap the frwr long after an RPC has exited and its rpcrdma_mr_seg array has been released and re-used. This might occur if an RPC times out while waiting for a new connection to be established. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Clean up: Follow same naming convention as other fields in struct rpcrdma_frwr. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Clean up: Replace rpcrdma_flush_cqs() and rpcrdma_clean_cqs() with the new ib_drain_qp() API. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-By: Leon Romanovsky <leonro@mellanox.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
rpcrdma_create_chunks() has been replaced, and can be removed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Update documenting comments to reflect code changes over the past year. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Avoid the latency and interrupt overhead of registering a Write chunk when handling NFS READ requests of a few hundred bytes or less. This change does not interoperate with Linux NFS/RDMA servers that do not have commit 9d11b51c ('svcrdma: Fix send_reply() scatter/gather set-up'). Commit 9d11b51c was introduced in v4.3, and is included in 4.2.y, 4.1.y, and 3.18.y. Oracle bug 22925946 has been filed to request that the above fix be included in the Oracle Linux UEK4 NFS/RDMA server. Red Hat bugzillas 1327280 and 1327554 have been filed to request that RHEL NFS/RDMA server backports include the above fix. Workaround: Replace the "proto=rdma,port=20049" mount options with "proto=tcp" until commit 9d11b51c is applied to your NFS server. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
When deciding whether to send a Call inline, rpcrdma_marshal_req doesn't take into account header bytes consumed by chunk lists. This results in Call messages on the wire that are sometimes larger than the inline threshold. Likewise, when a Write list or Reply chunk is in play, the server's reply has to emit an RDMA Send that includes a larger-than-minimal RPC-over-RDMA header. The actual size of a Call message cannot be estimated until after the chunk lists have been registered. Thus the size of each RPC-over-RDMA header can be estimated only after chunks are registered; but the decision to register chunks is based on the size of that header. Chicken, meet egg. The best a client can do is estimate header size based on the largest header that might occur, and then ensure that inline content is always smaller than that. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Send buffer space is shared between the RPC-over-RDMA header and an RPC message. A large RPC-over-RDMA header means less space is available for the associated RPC message, which then has to be moved via an RDMA Read or Write. As more segments are added to the chunk lists, the header increases in size. Typical modern hardware needs only a few segments to convey the maximum payload size, but some devices and registration modes may need a lot of segments to convey data payload. Sometimes so many are needed that the remaining space in the Send buffer is not enough for the RPC message. Sending such a message usually fails. To ensure a transport can always make forward progress, cap the number of RDMA segments that are allowed in chunk lists. This prevents less-capable devices and memory registrations from consuming a large portion of the Send buffer by reducing the maximum data payload that can be conveyed with such devices. For now I choose an arbitrary maximum of 8 RDMA segments. This allows a maximum size RPC-over-RDMA header to fit nicely in the current 1024 byte inline threshold with over 700 bytes remaining for an inline RPC message. The current maximum data payload of NFS READ or WRITE requests is one megabyte. To convey that payload on a client with 4KB pages, each chunk segment would need to handle 32 or more data pages. This is well within the capabilities of FMR. For physical registration, the maximum payload size on platforms with 4KB pages is reduced to 32KB. For FRWR, a device's maximum page list depth would need to be at least 34 to support the maximum 1MB payload. A device with a smaller maximum page list depth means the maximum data payload is reduced when using that device. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Currently the sysctls that allow setting the inline threshold allow any value to be set. Small values only make the transport run slower. The default 1KB setting is as low as is reasonable. And the logic that decides how to divide a Send buffer between RPC-over-RDMA header and RPC message assumes (but does not check) that the lower bound is not crazy (say, 57 bytes). Send and receive buffers share a page with some control information. Values larger than about 3KB can't be supported, currently. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
RPC-over-RDMA transports have a limit on how large a backward direction (backchannel) RPC message can be. Ensure that the NFSv4.x CREATE_SESSION operation advertises this limit to servers. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Commit 176e21ee ("SUNRPC: Support for RPC over AF_LOCAL transports") added a 5-character netid, but did not bump RPCBIND_MAXNETIDLEN from 4 to 5. Fixes: 176e21ee ("SUNRPC: Support for RPC over AF_LOCAL ...") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Shirley Ma authored
RFC 5666: The "rdma" netid is to be used when IPv4 addressing is employed by the underlying transport, and "rdma6" for IPv6 addressing. Add mount -o proto=rdma6 option to support NFS/RDMA IPv6 addressing. Changes from v2: - Integrated comments from Chuck Level, Anna Schumaker, Trodt Myklebust - Add a little more to the patch description to describe NFS/RDMA IPv6 suggested by Chuck Level and Anna Schumaker - Removed duplicated rdma6 define - Remove Opt_xprt_rdma mountfamily since it doesn't support Signed-off-by: Shirley Ma <shirley.ma@oracle.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Tigran Mkrtchyan authored
OPEN_CREATE with EXCLUSIVE4_1 sends initial file permission. Ignoring fact, that server have indicated that file mod is set, client will send yet another SETATTR request, but, as mode is already set, new SETATTR will be empty. This is not a problem, nevertheless an extra roundtrip and slow open on high latency networks. This change is aims to skip extra setattr after open if there are no attributes to be set. Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Anna Schumaker authored
This adds the copy_range file_ops function pointer used by the sys_copy_range() function call. This patch only implements sync copies, so if an async copy happens we decode the stateid and ignore it. Signed-off-by: Anna Schumaker <bjschuma@netapp.com>
-
Anna Schumaker authored
Copy will use this to set up a commit request for a generic range. I don't want to allocate a new pagecache entry for the file, so I needed to change parts of the commit path to handle requests with a null wb_page. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Olga Kornievskaia authored
Commit 80f96427 ("NFSv4.x: Enforce the ca_maxreponsesize_cached on the back channel") causes an oops when it receives a callback with cachethis=yes. [ 109.667378] BUG: unable to handle kernel NULL pointer dereference at 00000000000002c8 [ 109.669476] IP: [<ffffffffa08a3e68>] nfs4_callback_compound+0x4f8/0x690 [nfsv4] [ 109.671216] PGD 0 [ 109.671736] Oops: 0000 [#1] SMP [ 109.705427] CPU: 1 PID: 3579 Comm: nfsv4.1-svc Not tainted 4.5.0-rc1+ #1 [ 109.706987] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014 [ 109.709468] task: ffff8800b4408000 ti: ffff88008448c000 task.ti: ffff88008448c000 [ 109.711207] RIP: 0010:[<ffffffffa08a3e68>] [<ffffffffa08a3e68>] nfs4_callback_compound+0x4f8/0x690 [nfsv4] [ 109.713521] RSP: 0018:ffff88008448fca0 EFLAGS: 00010286 [ 109.714762] RAX: ffff880081ee202c RBX: ffff8800b7b5b600 RCX: 0000000000000001 [ 109.716427] RDX: 0000000000000008 RSI: 0000000000000008 RDI: 0000000000000000 [ 109.718091] RBP: ffff88008448fda8 R08: 0000000000000000 R09: 000000000b000000 [ 109.719757] R10: ffff880137786000 R11: ffff8800b7b5b600 R12: 0000000001000000 [ 109.721415] R13: 0000000000000002 R14: 0000000053270000 R15: 000000000000000b [ 109.723061] FS: 0000000000000000(0000) GS:ffff880139640000(0000) knlGS:0000000000000000 [ 109.724931] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 109.726278] CR2: 00000000000002c8 CR3: 0000000034d50000 CR4: 00000000001406e0 [ 109.727972] Stack: [ 109.728465] ffff880081ee202c ffff880081ee201c 000000008448fcc0 ffff8800baccb800 [ 109.730349] ffff8800baccc800 ffffffffa08d0380 0000000000000000 0000000000000000 [ 109.732211] ffff8800b7b5b600 0000000000000001 ffffffff81d073c0 ffff880081ee3090 [ 109.734056] Call Trace: [ 109.734657] [<ffffffffa03795d4>] svc_process_common+0x5c4/0x6c0 [sunrpc] [ 109.736267] [<ffffffffa0379a4c>] bc_svc_process+0x1fc/0x360 [sunrpc] [ 109.737775] [<ffffffffa08a2c2c>] nfs41_callback_svc+0x10c/0x1d0 [nfsv4] [ 109.739335] [<ffffffff810cb380>] ? prepare_to_wait_event+0xf0/0xf0 [ 109.740799] [<ffffffffa08a2b20>] ? nfs4_callback_svc+0x50/0x50 [nfsv4] [ 109.742349] [<ffffffff810a6998>] kthread+0xd8/0xf0 [ 109.743495] [<ffffffff810a68c0>] ? kthread_park+0x60/0x60 [ 109.744776] [<ffffffff816abc4f>] ret_from_fork+0x3f/0x70 [ 109.746037] [<ffffffff810a68c0>] ? kthread_park+0x60/0x60 [ 109.747324] Code: cc 45 31 f6 48 8b 85 00 ff ff ff 44 89 30 48 8b 85 f8 fe ff ff 44 89 20 48 8b 9d 38 ff ff ff 48 8b bd 30 ff ff ff 48 85 db 74 4c <4c> 8b af c8 02 00 00 4d 8d a5 08 02 00 00 49 81 c5 98 02 00 00 [ 109.754361] RIP [<ffffffffa08a3e68>] nfs4_callback_compound+0x4f8/0x690 [nfsv4] [ 109.756123] RSP <ffff88008448fca0> [ 109.756951] CR2: 00000000000002c8 [ 109.757738] ---[ end trace 2b8555511ab5dfb4 ]--- [ 109.758819] Kernel panic - not syncing: Fatal exception [ 109.760126] Kernel Offset: disabled [ 118.938934] ---[ end Kernel panic - not syncing: Fatal exception It doesn't unlock the table nor does it set the cps->clp pointer which is later needed by nfs4_cb_free_slot(). Fixes: 80f96427 ("NFSv4.x: Enforce the ca_maxresponsesize_cached ...") CC: stable@vger.kernel.org Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
- 09 May, 2016 3 commits
-
-
J. Bruce Fields authored
There's no guarantee that an IP address in a different network namespace actually represents the same endpoint. Also, if we allow unprivileged nfs mounts some day then this might allow an unprivileged user in another network namespace to misdirect somebody else's nfs mounts. If sharing between containers is really what's wanted then that could still be arranged explicitly, for example with bind mounts. Reported-by: "Eric W. Biederman" <ebiederm@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
At Connectathon 2016, we found that recent upstream Linux clients would occasionally send a LOCK operation with a zero stateid. This appeared to happen in close proximity to another thread returning a delegation before unlinking the same file while it remained open. Earlier, the client received a write delegation on this file and returned the open stateid. Now, as it is getting ready to unlink the file, it returns the write delegation. But there is still an open file descriptor on that file, so the client must OPEN the file again before it returns the delegation. Since commit 24311f88 ('NFSv4: Recovery of recalled read delegations is broken'), nfs_open_delegation_recall() clears the NFS_DELEGATED_STATE flag _before_ it sends the OPEN. This allows a racing LOCK on the same inode to be put on the wire before the OPEN operation has returned a valid open stateid. To eliminate this race, serialize delegation return with the acquisition of a file lock on the same file. Adopt the same approach as is used in the unlock path. This patch also eliminates a similar race seen when sending a LOCK operation at the same time as returning a delegation on the same file. Fixes: 24311f88 ('NFSv4: Recovery of recalled read ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> [Anna: Add sentence about LOCK / delegation race] Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Jeff Layton authored
A mirror can be shared between multiple layouts, even with different iomodes. That makes stats gathering simpler, but it causes a problem when we get different creds in READ vs. RW layouts. The current code drops the newer credentials onto the floor when this occurs. That's problematic when you fetch a READ layout first, and then a RW. If the READ layout doesn't have the correct creds to do a write, then writes will fail. We could just overwrite the READ credentials with the RW ones, but that would break the ability for the server to fence the layout for reads if things go awry. We need to be able to revert to the earlier READ creds if the RW layout is returned afterward. The simplest fix is to just keep two sets of creds per mirror. One for READ layouts and one for RW, and then use the appropriate set depending on the iomode of the layout segment. Also fix up some RCU nits that sparse found. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-