1. 01 Mar, 2024 40 commits
    • Trond Myklebust's avatar
      nfsd: Fix NFSv3 atomicity bugs in nfsd_setattr() · 24d92de9
      Trond Myklebust authored
      The main point of the guarded SETATTR is to prevent races with other
      WRITE and SETATTR calls. That requires that the check of the guard time
      against the inode ctime be done after taking the inode lock.
      
      Furthermore, we need to take into account the 32-bit nature of
      timestamps in NFSv3, and the possibility that files may change at a
      faster rate than once a second.
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Reviewed-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      24d92de9
    • Trond Myklebust's avatar
      nfsd: Fix a regression in nfsd_setattr() · 6412e44c
      Trond Myklebust authored
      Commit bb4d53d6 ("NFSD: use (un)lock_inode instead of
      fh_(un)lock for file operations") broke the NFSv3 pre/post op
      attributes behaviour when doing a SETATTR rpc call by stripping out
      the calls to fh_fill_pre_attrs() and fh_fill_post_attrs().
      
      Fixes: bb4d53d6 ("NFSD: use (un)lock_inode instead of fh_(un)lock for file operations")
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Reviewed-by: default avatarNeilBrown <neilb@suse.de>
      Message-ID: <20240216012451.22725-1-trondmy@kernel.org>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      6412e44c
    • Dai Ngo's avatar
      NFSD: OP_CB_RECALL_ANY should recall both read and write delegations · 5826e09b
      Dai Ngo authored
      Add RCA4_TYPE_MASK_WDATA_DLG to ra_bmval bitmask of OP_CB_RECALL_ANY
      Signed-off-by: default avatarDai Ngo <dai.ngo@oracle.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      5826e09b
    • Dai Ngo's avatar
      NFSD: handle GETATTR conflict with write delegation · c5967721
      Dai Ngo authored
      If the GETATTR request on a file that has write delegation in effect
      and the request attributes include the change info and size attribute
      then the request is handled as below:
      
      Server sends CB_GETATTR to client to get the latest change info and file
      size. If these values are the same as the server's cached values then
      the GETATTR proceeds as normal.
      
      If either the change info or file size is different from the server's
      cached values, or the file was already marked as modified, then:
      
          . update time_modify and time_metadata into file's metadata
            with current time
      
          . encode GETATTR as normal except the file size is encoded with
            the value returned from CB_GETATTR
      
          . mark the file as modified
      
      If the CB_GETATTR fails for any reasons, the delegation is recalled
      and NFS4ERR_DELAY is returned for the GETATTR.
      Signed-off-by: default avatarDai Ngo <dai.ngo@oracle.com>
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      c5967721
    • Dai Ngo's avatar
      NFSD: add support for CB_GETATTR callback · 6487a13b
      Dai Ngo authored
      Includes:
         . CB_GETATTR proc for nfs4_cb_procedures[]
         . XDR encoding and decoding function for CB_GETATTR request/reply
         . add nfs4_cb_fattr to nfs4_delegation for sending CB_GETATTR
           and store file attributes from client's reply.
      Signed-off-by: default avatarDai Ngo <dai.ngo@oracle.com>
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      6487a13b
    • Chuck Lever's avatar
      NFSD: Document the phases of CREATE_SESSION · b910544a
      Chuck Lever authored
      As described in RFC 8881 Section 18.36.4, CREATE_SESSION can be
      split into four phases. NFSD's implementation now does it like that
      description.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      b910544a
    • Chuck Lever's avatar
      NFSD: Fix the NFSv4.1 CREATE_SESSION operation · e4469c6c
      Chuck Lever authored
      RFC 8881 Section 18.36.4 discusses the implementation of the NFSv4.1
      CREATE_SESSION operation. The section defines four phases of
      operation.
      
      Phase 2 processes the CREATE_SESSION sequence ID. As a separate
      step, Phase 3 evaluates the CREATE_SESSION arguments.
      
      The problem we are concerned with is when phase 2 is successful but
      phase 3 fails. The spec language in this case is "No changes are
      made to any client records on the server."
      
      RFC 8881 Section 18.35.4 defines a "client record", and it does
      /not/ contain any details related to the special CREATE_SESSION
      slot. Therefore NFSD is incorrect to skip incrementing the
      CREATE_SESSION sequence id when phase 3 (see Section 18.36.4) of
      CREATE_SESSION processing fails. In other words, even though NFSD
      happens to store the cs_slot in a client record, in terms of the
      protocol the slot is logically separate from the client record.
      
      Three complications:
      
      1. The world has moved on since commit 86c3e16c ("nfsd4: confirm
         only on succesful create_session") broke this. So we can't simply
         revert that commit.
      
      2. NFSD's CREATE_SESSION implementation does not cleanly delineate
         the logic of phases 2 and 3. So this won't be a surgical fix.
      
      3. Because of the way it currently handles the CREATE_SESSION slot
         sequence number, nfsd4_create_session() isn't caching error
         responses in the CREATE_SESSION slot. Instead of replaying the
         response cache in those cases, it's executing the transaction
         again.
      
      Reorganize the CREATE_SESSION slot sequence number accounting. This
      requires that error responses are appropriately cached in the
      CREATE_SESSION slot (once it is found).
      Reported-by: default avatarConnor Smith <connor.smith@hitachivantara.com>
      Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218382Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      e4469c6c
    • Chen Hanxiao's avatar
      nfsd: clean up comments over nfs4_client definition · f8104027
      Chen Hanxiao authored
      nfsd fault injection has been deprecated since
      commit 9d60d931 ("Deprecate nfsd fault injection")
      and removed by
      commit e56dc9e2 ("nfsd: remove fault injection code")
      
      So remove the outdated parts about fault injection.
      Signed-off-by: default avatarChen Hanxiao <chenhx.fnst@fujitsu.com>
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      f8104027
    • Chuck Lever's avatar
      svcrdma: Add Write chunk WRs to the RPC's Send WR chain · e084ee67
      Chuck Lever authored
      Chain RDMA Writes that convey Write chunks onto the local Send
      chain. This means all WRs for an RPC Reply are now posted with a
      single ib_post_send() call, and there is a single Send completion
      when all of these are done. That reduces both the per-transport
      doorbell rate and completion rate.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      e084ee67
    • Chuck Lever's avatar
      svcrdma: Post WRs for Write chunks in svc_rdma_sendto() · d2727cef
      Chuck Lever authored
      Refactor to eventually enable svcrdma to post the Write WRs for each
      RPC response using the same ib_post_send() as the Send WR (ie, as a
      single WR chain).
      
      svc_rdma_result_payload (originally svc_rdma_read_payload) was added
      so that the upper layer XDR encoder could identify a range of bytes
      to be possibly conveyed by RDMA (if a Write chunk was provided by
      the client).
      
      The purpose of commit f6ad7759 ("svcrdma: Post RDMA Writes while
      XDR encoding replies") was to post as much of the result payload
      outside of svc_rdma_sendto() as possible because svc_rdma_sendto()
      used to be called with the xpt_mutex held.
      
      However, since commit ca4faf54 ("SUNRPC: Move xpt_mutex into
      socket xpo_sendto methods"), the xpt_mutex is no longer held when
      calling svc_rdma_sendto(). Thus, that benefit is no longer an issue.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      d2727cef
    • Chuck Lever's avatar
      svcrdma: Post the Reply chunk and Send WR together · 10e6fc10
      Chuck Lever authored
      Reduce the doorbell and Send completion rates when sending RPC/RDMA
      replies that have Reply chunks. NFS READDIR procedures typically
      return their result in a Reply chunk, for example.
      
      Instead of calling ib_post_send() to post the Write WRs for the
      Reply chunk, and then calling it again to post the Send WR that
      conveys the transport header, chain the Write WRs to the Send WR
      and call ib_post_send() only once.
      
      Thanks to the Send Queue completion ordering rules, when the Send
      WR completes, that guarantees that Write WRs posted before it have
      also completed successfully. Thus all Write WRs for the Reply chunk
      can remain unsignaled. Instead of handling a Write completion and
      then a Send completion, only the Send completion is seen, and it
      handles clean up for both the Writes and the Send.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      10e6fc10
    • Chuck Lever's avatar
      svcrdma: Move write_info for Reply chunks into struct svc_rdma_send_ctxt · a1f5788a
      Chuck Lever authored
      Since the RPC transaction's svc_rdma_send_ctxt will stay around for
      the duration of the RDMA Write operation, the write_info structure
      for the Reply chunk can reside in the request's svc_rdma_send_ctxt
      instead of being allocated separately.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      a1f5788a
    • Chuck Lever's avatar
      svcrdma: Post Send WR chain · 71b43531
      Chuck Lever authored
      Eventually I'd like the server to post the reply's Send WR along
      with any Write WRs using only a single call to ib_post_send(), in
      order to reduce the NIC's doorbell rate.
      
      To do this, add an anchor for a WR chain to svc_rdma_send_ctxt, and
      refactor svc_rdma_send() to post this WR chain to the Send Queue. For
      the moment, the posted chain will continue to contain a single Send
      WR.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      71b43531
    • Chuck Lever's avatar
      svcrdma: Fix retry loop in svc_rdma_send() · fc709d82
      Chuck Lever authored
      Don't call ib_post_send() at all if the transport is already
      shutting down.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      fc709d82
    • Chuck Lever's avatar
      svcrdma: Prevent a UAF in svc_rdma_send() · 773f6c5b
      Chuck Lever authored
      In some error flow cases, svc_rdma_wc_send() releases @ctxt. Copy
      the sc_cid field in @ctxt to a stack variable in order to guarantee
      that the value is available after the ib_post_send() call.
      
      In case the new comment looks a little strange, this will be done
      with at least one more field in a subsequent patch.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      773f6c5b
    • Chuck Lever's avatar
      svcrdma: Fix SQ wake-ups · 5b9a8589
      Chuck Lever authored
      Ensure there is a wake-up when increasing sc_sq_avail.
      
      Likewise, if a wake-up is done, sc_sq_avail needs to be updated,
      otherwise the wait_event() conditional is never going to be met.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      5b9a8589
    • Chuck Lever's avatar
      svcrdma: Increase the per-transport rw_ctx count · 2da0f610
      Chuck Lever authored
      rdma_rw_mr_factor() returns the smallest number of MRs needed to
      move a particular number of pages. svcrdma currently asks for the
      number of MRs needed to move RPCSVC_MAXPAGES (a little over one
      megabyte), as that is the number of pages in the largest r/wsize
      the server supports.
      
      This call assumes that the client's NIC can bundle a full one
      megabyte payload in a single rdma_segment. In fact, most NICs cannot
      handle a full megabyte with a single rkey / rdma_segment. Clients
      will typically split even a single Read chunk into many segments.
      
      The server needs one MR to read each rdma_segment in a Read chunk,
      and thus each one needs an rw_ctx.
      
      svcrdma has been vastly underestimating the number of rw_ctxs needed
      to handle 64 RPC requests with large Read chunks using small
      rdma_segments.
      
      Unfortunately there doesn't seem to be a good way to estimate this
      number without knowing the client NIC's capabilities. Even then,
      the client RPC/RDMA implementation is still free to split a chunk
      into smaller segments (for example, it might be using physical
      registration, which needs an rdma_segment per page).
      
      The best we can do for now is choose a number that will guarantee
      forward progress in the worst case (one page per segment).
      
      At some later point, we could add some mechanisms to make this
      much less of a problem:
      - Add a core API to add more rw_ctxs to an already-established QP
      - svcrdma could treat rw_ctx exhaustion as a temporary error and
        try again
      - Limit the number of Reads in flight
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      2da0f610
    • Chuck Lever's avatar
      svcrdma: Update max_send_sges after QP is created · 4c8c0fa0
      Chuck Lever authored
      rdma_create_qp() can modify cap.max_send_sges. Copy the new value
      to the svcrdma transport so it is bound by the new limit instead
      of the requested one.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      4c8c0fa0
    • Chuck Lever's avatar
      svcrdma: Report CQ depths in debugging output · 5485d6dd
      Chuck Lever authored
      Check that svc_rdma_accept() is allocating an appropriate number of
      CQEs.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      5485d6dd
    • Chuck Lever's avatar
      svcrdma: Reserve an extra WQE for ib_drain_rq() · e67792cc
      Chuck Lever authored
      Do as other ULPs already do: ensure there is an extra Receive WQE
      reserved for the tear-down drain WR. I haven't heard reports of
      problems but it can't hurt.
      
      Note that rq_depth is used to compute the Send Queue depth as well,
      so this fix should affect both the SQ and RQ.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      e67792cc
    • Jeff Layton's avatar
      MAINTAINERS: add Alex Aring as Reviewer for file locking code · c8004c1c
      Jeff Layton authored
      Alex helps co-maintain the DLM code and did some recent work to fix up
      how lockd and GFS2 work together. Add him as a Reviewer for file locking
      changes.
      Acked-by: default avatarAlexander Aring <aahringo@redhat.com>
      Acked-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      c8004c1c
    • Kunwu Chan's avatar
      nfsd: Simplify the allocation of slab caches in nfsd4_init_slabs · 649e58d5
      Kunwu Chan authored
      Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
      to simplify the creation of SLAB caches.
      Make the code cleaner and more readable.
      Signed-off-by: default avatarKunwu Chan <chentao@kylinos.cn>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      649e58d5
    • Kunwu Chan's avatar
      nfsd: Simplify the allocation of slab caches in nfsd_drc_slab_create · 192d80cd
      Kunwu Chan authored
      Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
      to simplify the creation of SLAB caches.
      And change cache name from 'nfsd_drc' to 'nfsd_cacherep'.
      Signed-off-by: default avatarKunwu Chan <chentao@kylinos.cn>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      192d80cd
    • Kunwu Chan's avatar
      nfsd: Simplify the allocation of slab caches in nfsd_file_cache_init · 2f74991a
      Kunwu Chan authored
      Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
      to simplify the creation of SLAB caches.
      Signed-off-by: default avatarKunwu Chan <chentao@kylinos.cn>
      Acked-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      2f74991a
    • Kunwu Chan's avatar
      nfsd: Simplify the allocation of slab caches in nfsd4_init_pnfs · 10bcc2f1
      Kunwu Chan authored
      commit 0a31bd5f ("KMEM_CACHE(): simplify slab cache creation")
      introduces a new macro.
      Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
      to simplify the creation of SLAB caches.
      Signed-off-by: default avatarKunwu Chan <chentao@kylinos.cn>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      10bcc2f1
    • NeilBrown's avatar
      nfsd: don't call locks_release_private() twice concurrently · 05eda6e7
      NeilBrown authored
      It is possible for free_blocked_lock() to be called twice concurrently,
      once from nfsd4_lock() and once from nfsd4_release_lockowner() calling
      remove_blocked_locks().  This is why a kref was added.
      
      It is perfectly safe for locks_delete_block() and kref_put() to be
      called in parallel as they use locking or atomicity respectively as
      protection.  However locks_release_private() has no locking.  It is
      safe for it to be called twice sequentially, but not concurrently.
      
      This patch moves that call from free_blocked_lock() where it could race
      with itself, to free_nbl() where it cannot.  This will slightly delay
      the freeing of private info or release of the owner - but not by much.
      It is arguably more natural for this freeing to happen in free_nbl()
      where the structure itself is freed.
      
      This bug was found by code inspection - it has not been seen in practice.
      
      Fixes: 47446d74 ("nfsd4: add refcount for nfsd4_blocked_lock")
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      05eda6e7
    • NeilBrown's avatar
      nfsd: allow layout state to be admin-revoked. · 1e33e141
      NeilBrown authored
      When there is layout state on a filesystem that is being "unlocked" that
      is now revoked, which involves closing the nfsd_file and releasing the
      vfs lease.
      
      To avoid races, ->ls_file can now be accessed either:
       - under ->fi_lock for the state's sc_file or
       - under rcu_read_lock() if nfsd_file_get() is used.
      To support this, ->fence_client and nfsd4_cb_layout_fail() now take a
      second argument being the nfsd_file.
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      1e33e141
    • NeilBrown's avatar
      nfsd: allow delegation state ids to be revoked and then freed · 06efa667
      NeilBrown authored
      Revoking state through 'unlock_filesystem' now revokes any delegation
      states found.  When the stateids are then freed by the client, the
      revoked stateids will be cleaned up correctly.
      
      As there is already support for revoking delegations, we build on that
      for admin-revoking.
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      06efa667
    • NeilBrown's avatar
      nfsd: allow open state ids to be revoked and then freed · 39657c74
      NeilBrown authored
      Revoking state through 'unlock_filesystem' now revokes any open states
      found.  When the stateids are then freed by the client, the revoked
      stateids will be cleaned up correctly.
      
      Possibly the related lock states should be revoked too, but a
      subsequent patch will do that for all lock state on the superblock.
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      39657c74
    • NeilBrown's avatar
      nfsd: allow lock state ids to be revoked and then freed · 1c13bf9f
      NeilBrown authored
      Revoking state through 'unlock_filesystem' now revokes any lock states
      found.  When the stateids are then freed by the client, the revoked
      stateids will be cleaned up correctly.
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      1c13bf9f
    • NeilBrown's avatar
      nfsd: allow admin-revoked NFSv4.0 state to be freed. · d688d858
      NeilBrown authored
      For NFSv4.1 and later the client easily discovers if there is any
      admin-revoked state and will then find and explicitly free it.
      
      For NFSv4.0 there is no such mechanism.  The client can only find that
      state is admin-revoked if it tries to use that state, and there is no
      way for it to explicitly free the state.  So the server must hold on to
      the stateid (at least) for an indefinite amount of time.  A
      RELEASE_LOCKOWNER request might justify forgetting some of these
      stateids, as would the whole clients lease lapsing, but these are not
      reliable.
      
      This patch takes two approaches.
      
      Whenever a client uses an revoked stateid, that stateid is then
      discarded and will not be recognised again.  This might confuse a client
      which expect to get NFS4ERR_ADMIN_REVOKED consistently once it get it at
      all, but should mostly work.  Hopefully one error will lead to other
      resources being closed (e.g.  process exits), which will result in more
      stateid being freed when a CLOSE attempt gets NFS4ERR_ADMIN_REVOKED.
      
      Also, any admin-revoked stateids that have been that way for more than
      one lease time are periodically revoke.
      
      No actual freeing of state happens in this patch.  That will come in
      future patches which handle the different sorts of revoked state.
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      d688d858
    • NeilBrown's avatar
      nfsd: report in /proc/fs/nfsd/clients/*/states when state is admin-revoke · 11b2cfbf
      NeilBrown authored
      Add "admin-revoked" to the status information for any states that have
      been admin-revoked.  This can be useful for confirming correct
      behaviour.
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      11b2cfbf
    • NeilBrown's avatar
      nfsd: allow state with no file to appear in /proc/fs/nfsd/clients/*/states · 39e1be64
      NeilBrown authored
      Change the "show" functions to show some content even if a file cannot
      be found.  This is the case for admin-revoked state.
      This is primarily useful for debugging - to ensure states are being
      removed eventually.
      
      So change several seq_printf() to seq_puts().  Some of these are needed
      to keep checkpatch happy.  Others were done for consistency.
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      39e1be64
    • NeilBrown's avatar
      nfsd: prepare for supporting admin-revocation of state · 1ac3629b
      NeilBrown authored
      The NFSv4 protocol allows state to be revoked by the admin and has error
      codes which allow this to be communicated to the client.
      
      This patch
       - introduces a new state-id status SC_STATUS_ADMIN_REVOKED
         which can be set on open, lock, or delegation state.
       - reports NFS4ERR_ADMIN_REVOKED when these are accessed
       - introduces a per-client counter of these states and returns
         SEQ4_STATUS_ADMIN_STATE_REVOKED when the counter is not zero.
         Decrements this when freeing any admin-revoked state.
       - introduces stub code to find all interesting states for a given
         superblock so they can be revoked via the 'unlock_filesystem'
         file in /proc/fs/nfsd/
         No actual states are handled yet.
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      1ac3629b
    • NeilBrown's avatar
      nfsd: split sc_status out of sc_type · 3f29cc82
      NeilBrown authored
      sc_type identifies the type of a state - open, lock, deleg, layout - and
      also the status of a state - closed or revoked.
      
      This is a bit untidy and could get worse when "admin-revoked" states are
      added.  So clean it up.
      
      With this patch, the type is now all that is stored in sc_type.  This is
      zero when the state is first added to ->cl_stateids (causing it to be
      ignored), and is then set appropriately once it is fully initialised.
      It is set under ->cl_lock to ensure atomicity w.r.t lookup.  It is now
      never cleared.
      
      sc_type is still a bit-set even though at most one bit is set.  This allows
      lookup functions to be given a bitmap of acceptable types.
      
      sc_type is now an unsigned short rather than char.  There is no value in
      restricting to just 8 bits.
      
      All the constants now start SC_TYPE_ matching the field in which they
      are stored.  Keeping the existing names and ensuring clear separation
      from non-type flags would have required something like
      NFS4_STID_TYPE_CLOSED which is cumbersome.  The "NFS4" prefix is
      redundant was they only appear in NFS4 code, so remove that and change
      STID to SC to match the field.
      
      The status is stored in a separate unsigned short named "sc_status".  It
      has two flags: SC_STATUS_CLOSED and SC_STATUS_REVOKED.
      CLOSED combines NFS4_CLOSED_STID, NFS4_CLOSED_DELEG_STID, and is used
      for SC_TYPE_LOCK and SC_TYPE_LAYOUT instead of setting the sc_type to zero.
      These flags are only ever set, never cleared.
      For deleg stateids they are set under the global state_lock.
      For open and lock stateids they are set under ->cl_lock.
      For layout stateids they are set under ->ls_lock
      
      nfs4_unhash_stid() has been removed, and we never set sc_type = 0.  This
      was only used for LOCK and LAYOUT stids and they now use
      SC_STATUS_CLOSED.
      
      Also TRACE_DEFINE_NUM() calls for the various STID #define have been
      removed because these things are not enums, and so that call is
      incorrect.
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      3f29cc82
    • NeilBrown's avatar
      nfsd: avoid race after unhash_delegation_locked() · 83e73316
      NeilBrown authored
      NFS4_CLOSED_DELEG_STID and NFS4_REVOKED_DELEG_STID are similar in
      purpose.
      REVOKED is used for NFSv4.1 states which have been revoked because the
      lease has expired.  CLOSED is used in other cases.
      The difference has two practical effects.
      1/ REVOKED states are on the ->cl_revoked list
      2/ REVOKED states result in nfserr_deleg_revoked from
         nfsd4_verify_open_stid() and nfsd4_validate_stateid while
         CLOSED states result in nfserr_bad_stid.
      
      Currently a state that is being revoked is first set to "CLOSED" in
      unhash_delegation_locked(), then possibly to "REVOKED" in
      revoke_delegation(), at which point it is added to the cl_revoked list.
      
      It is possible that a stateid test could see the CLOSED state
      which really should be REVOKED, and so return the wrong error code.  So
      it is safest to remove this window of inconsistency.
      
      With this patch, unhash_delegation_locked() always sets the state
      correctly, and revoke_delegation() no longer changes the state.
      
      Also remove a redundant test on minorversion when
      NFS4_REVOKED_DELEG_STID is seen - it can only be seen when minorversion
      is non-zero.
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      83e73316
    • NeilBrown's avatar
      nfsd: don't call functions with side-effecting inside WARN_ON() · c6540026
      NeilBrown authored
      Code like:
      
          WARN_ON(foo())
      
      looks like an assertion and might not be expected to have any side
      effects.
      When testing if a function with side-effects fails a construct like
      
          if (foo())
             WARN_ON(1);
      
      makes the intent more obvious.
      
      nfsd has several WARN_ON calls where the test has side effects, so it
      would be good to change them.  These cases don't really need the
      WARN_ON.  They have never failed in 8 years of usage so let's just
      remove the WARN_ON wrapper.
      Suggested-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      c6540026
    • NeilBrown's avatar
      nfsd: hold ->cl_lock for hash_delegation_locked() · 77945728
      NeilBrown authored
      The protocol for creating a new state in nfsd is to allocate the state
      leaving it largely uninitialised, add that state to the ->cl_stateids
      idr so as to reserve a state-id, then complete initialisation of the
      state and only set ->sc_type to non-zero once the state is fully
      initialised.
      
      If a state is found in the idr with ->sc_type == 0, it is ignored.
      The ->cl_lock lock is used to avoid races - it is held while checking
      sc_type during lookup, and held when a non-zero value is stored in
      ->sc_type.
      
      ... except... hash_delegation_locked() finalises the initialisation of a
      delegation state, but does NOT hold ->cl_lock.
      
      So this patch takes ->cl_lock at the appropriate time w.r.t other locks,
      and so ensures there are no races (which are extremely unlikely in any
      case).
      As ->fi_lock is often taken when ->cl_lock is held, we need to take
      ->cl_lock first of those two.
      Currently ->cl_lock and state_lock are never both taken at the same time.
      We need both for this patch so an arbitrary choice is needed concerning
      which to take first.  As state_lock is more global, it might be more
      contended, so take it first.
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      77945728
    • NeilBrown's avatar
      nfsd: remove stale comment in nfs4_show_deleg() · 6b4ca49d
      NeilBrown authored
      As we do now support write delegations, this comment is unhelpful and
      misleading.
      Reported-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      6b4ca49d
    • Chuck Lever's avatar
      NFSD: Remove redundant cb_seq_status initialization · e3179e44
      Chuck Lever authored
      As far as I can see, setting cb_seq_status in nfsd4_init_cb() is
      superfluous because it is set again in nfsd4_cb_prepare().
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Reviewed-by: default avatarBenjamin Coddington <bcodding@redhat.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      e3179e44