1. 27 Jul, 2017 1 commit
    • NeilBrown's avatar
      NFS: invalidate file size when taking a lock. · 442ce049
      NeilBrown authored
      Prior to commit ca0daa27 ("NFS: Cache aggressively when file is open
      for writing"), NFS would revalidate, or invalidate, the file size when
      taking a lock.  Since that commit it only invalidates the file content.
      
      If the file size is changed on the server while wait for the lock, the
      client will have an incorrect understanding of the file size and could
      corrupt data.  This particularly happens when writing beyond the
      (supposed) end of file and can be easily be demonstrated with
      posix_fallocate().
      
      If an application opens an empty file, waits for a write lock, and then
      calls posix_fallocate(), glibc will determine that the underlying
      filesystem doesn't support fallocate (assuming version 4.1 or earlier)
      and will write out a '0' byte at the end of each 4K page in the region
      being fallocated that is after the end of the file.
      NFS will (usually) detect that these writes are beyond EOF and will
      expand them to cover the whole page, and then will merge the pages.
      Consequently, NFS will write out large blocks of zeroes beyond where it
      thought EOF was.  If EOF had moved, the pre-existing part of the file
      will be over-written.  Locking should have protected against this,
      but it doesn't.
      
      This patch restores the use of nfs_zap_caches() which invalidated the
      cached attributes.  When posix_fallocate() asks for the file size, the
      request will go to the server and get a correct answer.
      
      cc: stable@vger.kernel.org (v4.8+)
      Fixes: ca0daa27 ("NFS: Cache aggressively when file is open for writing")
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      442ce049
  2. 26 Jul, 2017 1 commit
  3. 21 Jul, 2017 7 commits
  4. 19 Jul, 2017 5 commits
  5. 13 Jul, 2017 26 commits
    • Trond Myklebust's avatar
      NFS: Don't run wake_up_bit() when nobody is waiting... · 301bfa48
      Trond Myklebust authored
      "perf lock" shows fairly heavy contention for the bit waitqueue locks
      when doing an I/O heavy workload.
      Use a bit to tell whether or not there has been contention for a lock
      so that we can optimise away the bit waitqueue options in those cases.
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      301bfa48
    • Peng Tao's avatar
      nfs: add export operations · 00422483
      Peng Tao authored
      This support for opening files on NFS by file handle, both through the
      open_by_handle syscall, and for re-exporting NFS (for example using a
      different version).  The support is very basic for now, as each open by
      handle will have to do an NFSv4 open operation on the wire.  In the
      future this will hopefully be mitigated by an open file cache, as well
      as various optimizations in NFS for this specific case.
      Signed-off-by: default avatarPeng Tao <tao.peng@primarydata.com>
      [hch: incorporated various changes, resplit the patches, new changelog]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      00422483
    • Jeff Layton's avatar
      nfs4: add NFSv4 LOOKUPP handlers · 5b5faaf6
      Jeff Layton authored
      This will be needed in order to implement the get_parent export op
      for nfsd.
      Signed-off-by: default avatarJeff Layton <jeff.layton@primarydata.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      5b5faaf6
    • Peng Tao's avatar
      nfs: add a nfs_ilookup helper · f174ff7a
      Peng Tao authored
      This helper will allow to find an existing NFS inode by the file handle
      and fattr.
      Signed-off-by: default avatarPeng Tao <tao.peng@primarydata.com>
      [hch: split from a larger patch]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      f174ff7a
    • Peng Tao's avatar
      nfs: replace d_add with d_splice_alias in atomic_open · 774d9513
      Peng Tao authored
      It's a trival change but follows knfsd export document that asks
      for d_splice_alias during lookup.
      Signed-off-by: default avatarPeng Tao <tao.peng@primarydata.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      774d9513
    • Jason A. Donenfeld's avatar
      sunrpc: use constant time memory comparison for mac · 15a8b93f
      Jason A. Donenfeld authored
      Otherwise, we enable a MAC forgery via timing attack.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Cc: Anna Schumaker <anna.schumaker@netapp.com>
      Cc: linux-nfs@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      15a8b93f
    • Olga Kornievskaia's avatar
      NFSv4.2 fix size storage for nfs42_proc_copy · 1ee48bdd
      Olga Kornievskaia authored
      Return size of COPY is u64 but it was assigned to an "int" status.
      Signed-off-by: default avatarOlga Kornievskaia <kolga@netapp.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      1ee48bdd
    • Chuck Lever's avatar
      xprtrdma: Fix documenting comments in frwr_ops.c · 6afafa77
      Chuck Lever authored
      Clean up.
      
      FASTREG and LOCAL_INV WRs are typically not signaled. localinv_wake
      is used for the last LOCAL_INV WR in a chain, which is always
      signaled. The documenting comments should reflect that.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      6afafa77
    • Chuck Lever's avatar
      xprtrdma: Replace PAGE_MASK with offset_in_page() · d933cc32
      Chuck Lever authored
      Clean up.
      
      Reported by: Geliang Tang <geliangtang@gmail.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      d933cc32
    • Chuck Lever's avatar
      xprtrdma: FMR does not need list_del_init() · e2f6ef09
      Chuck Lever authored
      Clean up.
      
      Commit 38f1932e ("xprtrdma: Remove FMRs from the unmap list
      after unmapping") utilized list_del_init() to try to prevent some
      list corruption. The corruption was actually caused by the reply
      handler racing with a signal. Now that MR invalidation is properly
      serialized, list_del_init() can safely be replaced.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      e2f6ef09
    • Chuck Lever's avatar
      xprtrdma: Demote "connect" log messages · 173b8f49
      Chuck Lever authored
      Some have complained about the log messages generated when xprtrdma
      opens or closes a connection to a server. When an NFS mount is
      mostly idle these can appear every few minutes as the client idles
      out the connection and reconnects.
      
      Connection and disconnection is a normal part of operation, and not
      exceptional, so change these to dprintk's for now. At some point
      all of these will be converted to tracepoints, but that's for
      another day.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      173b8f49
    • Chuck Lever's avatar
      NFSv4.1: Use seqid returned by EXCHANGE_ID after state migration · 838edb94
      Chuck Lever authored
      Transparent State Migration copies a client's lease state from the
      server where a filesystem used to reside to the server where it now
      resides. When an NFSv4.1 client first contacts that destination
      server, it uses EXCHANGE_ID to detect trunking relationships.
      
      The lease that was copied there is returned to that client, but the
      destination server sets EXCHGID4_FLAG_CONFIRMED_R when replying to
      the client. This is because the lease was confirmed on the source
      server (before it was copied).
      
      When CONFIRMED_R is set, the client throws away the sequence ID
      returned by the server. During a Transparent State Migration, however
      there's no other way for the client to know what sequence ID to use
      with a lease that's been migrated.
      
      Therefore, the client must save and use the contrived slot sequence
      value returned by the destination server even when CONFIRMED_R is
      set.
      
      Note that some servers always return a seqid of 1 after a migration.
      Reported-by: default avatarXuan Qi <xuan.qi@oracle.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Tested-by: default avatarXuan Qi <xuan.qi@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      838edb94
    • Chuck Lever's avatar
      NFSv4.1: Handle EXCHGID4_FLAG_CONFIRMED_R during NFSv4.1 migration · 8dcbec6d
      Chuck Lever authored
      Transparent State Migration copies a client's lease state from the
      server where a filesystem used to reside to the server where it now
      resides. When an NFSv4.1 client first contacts that destination
      server, it uses EXCHANGE_ID to detect trunking relationships.
      
      The lease that was copied there is returned to that client, but the
      destination server sets EXCHGID4_FLAG_CONFIRMED_R when replying to
      the client. This is because the lease was confirmed on the source
      server (before it was copied).
      
      Normally, when CONFIRMED_R is set, a client purges the lease and
      creates a new one. However, that throws away the entire benefit of
      Transparent State Migration.
      
      Therefore, the client must not purge that lease when it is possible
      that Transparent State Migration has occurred.
      Reported-by: default avatarXuan Qi <xuan.qi@oracle.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Tested-by: default avatarXuan Qi <xuan.qi@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      8dcbec6d
    • Chuck Lever's avatar
      xprtrdma: Don't defer MR recovery if ro_map fails · 1f541895
      Chuck Lever authored
      Deferred MR recovery does a DMA-unmapping of the MW. However, ro_map
      invokes rpcrdma_defer_mr_recovery in some error cases where the MW
      has not even been DMA-mapped yet.
      
      Avoid a DMA-unmapping error replacing rpcrdma_defer_mr_recovery.
      
      Also note that if ib_dma_map_sg is asked to map 0 nents, it will
      return 0. So the extra "if (i == 0)" check is no longer needed.
      
      Fixes: 42fe28f6 ("xprtrdma: Do not leak an MW during a DMA ...")
      Fixes: 505bbe64 ("xprtrdma: Refactor MR recovery work queues")
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      1f541895
    • Chuck Lever's avatar
      xprtrdma: Fix FRWR invalidation error recovery · 8d75483a
      Chuck Lever authored
      When ib_post_send() fails, all LOCAL_INV WRs past @bad_wr have to be
      examined, and the MRs reset by hand.
      
      I'm not sure how the existing code can work by comparing R_keys.
      Restructure the logic so that instead it walks the chain of WRs,
      starting from the first bad one.
      
      Make sure to wait for completion if at least one WR was actually
      posted. Otherwise, if the ib_post_send fails, we can end up
      DMA-unmapping the MR while LOCAL_INV operations are in flight.
      
      Commit 7a89f9c6 ("xprtrdma: Honor ->send_request API contract")
      added the rdma_disconnect() call site. The disconnect actually
      causes more problems than it solves, and SQ overruns happen only as
      a result of software bugs. So remove it.
      
      Fixes: d7a21c1b ("xprtrdma: Reset MRs in frwr_op_unmap_sync()")
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      8d75483a
    • Chuck Lever's avatar
      xprtrdma: Fix client lock-up after application signal fires · 431af645
      Chuck Lever authored
      After a signal, the RPC client aborts synchronous RPCs running on
      behalf of the signaled application.
      
      The server is still executing those RPCs, and will write the results
      back into the client's memory when it's done. By the time the server
      writes the results, that memory is likely being used for other
      purposes. Therefore xprtrdma has to immediately invalidate all
      memory regions used by those aborted RPCs to prevent the server's
      writes from clobbering that re-used memory.
      
      With FMR memory registration, invalidation takes a relatively long
      time. In fact, the invalidation is often still running when the
      server tries to write the results into the memory regions that are
      being invalidated.
      
      This sets up a race between two processes:
      
      1.  After the signal, xprt_rdma_free calls ro_unmap_safe.
      2.  While ro_unmap_safe is still running, the server replies and
          rpcrdma_reply_handler runs, calling ro_unmap_sync.
      
      Both processes invoke ib_unmap_fmr on the same FMR.
      
      The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at
      the same time, but HCAs generally don't tolerate this. Sometimes
      this can result in a system crash.
      
      If the HCA happens to survive, rpcrdma_reply_handler continues. It
      removes the rpc_rqst from rq_list and releases the transport_lock.
      This enables xprt_rdma_free to run in another process, and the
      rpc_rqst is released while rpcrdma_reply_handler is still waiting
      for the ib_unmap_fmr call to finish.
      
      But further down in rpcrdma_reply_handler, the transport_lock is
      taken again, and "rqst" is dereferenced. If "rqst" has already been
      released, this triggers a general protection fault. Since bottom-
      halves are disabled, the system locks up.
      
      Address both issues by reversing the order of the xprt_lookup_rqst
      call and the ro_unmap_sync call. Introduce a separate lookup
      mechanism for rpcrdma_req's to enable calling ro_unmap_sync before
      xprt_lookup_rqst. Now the handler takes the transport_lock once
      and holds it for the XID lookup and RPC completion.
      
      BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
      Fixes: 68791649 ('xprtrdma: Invalidate in the RPC reply ... ')
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      431af645
    • Chuck Lever's avatar
      xprtrdma: Rename rpcrdma_req::rl_free · a80d66c9
      Chuck Lever authored
      Clean up: I'm about to use the rl_free field for purposes other than
      a free list. So use a more generic name.
      
      This is a refactoring change only.
      
      BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
      Fixes: 68791649 ('xprtrdma: Invalidate in the RPC reply ... ')
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      a80d66c9
    • Chuck Lever's avatar
      xprtrdma: Pass only the list of registered MRs to ro_unmap_sync · 451d26e1
      Chuck Lever authored
      There are rare cases where an rpcrdma_req can be re-used (via
      rpcrdma_buffer_put) while the RPC reply handler is still running.
      This is due to a signal firing at just the wrong instant.
      
      Since commit 9d6b0409 ("xprtrdma: Place registered MWs on a
      per-req list"), rpcrdma_mws are self-contained; ie., they fully
      describe an MR and scatterlist, and no part of that information is
      stored in struct rpcrdma_req.
      
      As part of closing the above race window, pass only the req's list
      of registered MRs to ro_unmap_sync, rather than the rpcrdma_req
      itself.
      
      Some extra transport header sanity checking is removed. Since the
      client depends on its own recollection of what memory had been
      registered, there doesn't seem to be a way to abuse this change.
      
      And, the check was not terribly effective. If the client had sent
      Read chunks, the "list_empty" test is negative in both of the
      removed cases, which are actually looking for Write or Reply
      chunks.
      
      BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
      Fixes: 68791649 ('xprtrdma: Invalidate in the RPC reply ... ')
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      451d26e1
    • Chuck Lever's avatar
      xprtrdma: Pre-mark remotely invalidated MRs · 4b196dc6
      Chuck Lever authored
      There are rare cases where an rpcrdma_req and its matched
      rpcrdma_rep can be re-used, via rpcrdma_buffer_put, while the RPC
      reply handler is still using that req. This is typically due to a
      signal firing at just the wrong instant.
      
      As part of closing this race window, avoid using the wrong
      rpcrdma_rep to detect remotely invalidated MRs. Mark MRs as
      invalidated while we are sure the rep is still OK to use.
      
      BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
      Fixes: 68791649 ('xprtrdma: Invalidate in the RPC reply ... ')
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      4b196dc6
    • Chuck Lever's avatar
      xprtrdma: On invalidation failure, remove MWs from rl_registered · 04d25b7d
      Chuck Lever authored
      Callers assume the ro_unmap_sync and ro_unmap_safe methods empty
      the list of registered MRs. Ensure that all paths through
      fmr_op_unmap_sync() remove MWs from that list.
      
      Fixes: 9d6b0409 ("xprtrdma: Place registered MWs on a ... ")
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      04d25b7d
    • NeilBrown's avatar
      NFS: check for nfs_refresh_inode() errors in nfs_fhget() · 26fde4df
      NeilBrown authored
      If an NFS server returns a filehandle that we have previously
      seen, and reports a different type, then nfs_refresh_inode()
      will log a warning and return an error.
      
      nfs_fhget() does not check for this error and may return an
      inode with a different type than the one that the server
      reported.
      
      This is likely to cause confusion, and is one way that
      ->open_context() could return a directory inode as discussed
      in the previous patch.
      
      So if nfs_refresh_inode() returns and error, return that error
      from nfs_fhget() to avoid the confusion propagating.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      26fde4df
    • NeilBrown's avatar
      NFS: guard against confused server in nfs_atomic_open() · eaa2b82c
      NeilBrown authored
      A confused server could return a filehandle for an
      NFSv4 OPEN request, which it previously returned for a directory.
      So the inode returned by  ->open_context() in nfs_atomic_open()
      could conceivably be a directory inode.
      
      This has particular implications for the call to
      nfs_file_set_open_context() in nfs_finish_open().
      If that is called on a directory inode, then the nfs_open_context
      that gets stored in the filp->private_data will be linked to
      nfs_inode->open_files.
      
      When the directory is closed, nfs_closedir() will (ultimately)
      free the ->private_data, but not unlink it from nfs_inode->open_files
      (because it doesn't expect an nfs_open_context there).
      
      Subsequently the memory could get used for something else and eventually
      if the ->open_files list is walked, the walker will fall off the end and
      crash.
      
      So: change nfs_finish_open() to only call nfs_file_set_open_context()
      for regular-file inodes.
      
      This failure mode has been seen in a production setting (unknown NFS
      server implementation).  The kernel was v3.0 and the specific sequence
      seen would not affect more recent kernels, but I think a risk is still
      present, and caution is wise.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      eaa2b82c
    • NeilBrown's avatar
      NFS: only invalidate dentrys that are clearly invalid. · cc89684c
      NeilBrown authored
      Since commit bafc9b75 ("vfs: More precise tests in d_invalidate")
      in v3.18, a return of '0' from ->d_revalidate() will cause the dentry
      to be invalidated even if it has filesystems mounted on or it or on a
      descendant.  The mounted filesystem is unmounted.
      
      This means we need to be careful not to return 0 unless the directory
      referred to truly is invalid.  So -ESTALE or -ENOENT should invalidate
      the directory.  Other errors such a -EPERM or -ERESTARTSYS should be
      returned from ->d_revalidate() so they are propagated to the caller.
      
      A particular problem can be demonstrated by:
      
      1/ mount an NFS filesystem using NFSv3 on /mnt
      2/ mount any other filesystem on /mnt/foo
      3/ ls /mnt/foo
      4/ turn off network, or otherwise make the server unable to respond
      5/ ls /mnt/foo &
      6/ cat /proc/$!/stack # note that nfs_lookup_revalidate is in the call stack
      7/ kill -9 $! # this results in -ERESTARTSYS being returned
      8/ observe that /mnt/foo has been unmounted.
      
      This patch changes nfs_lookup_revalidate() to only treat
        -ESTALE from nfs_lookup_verify_inode() and
        -ESTALE or -ENOENT from ->lookup()
      as indicating an invalid inode.  Other errors are returned.
      
      Also nfs_check_inode_attributes() is changed to return -ESTALE rather
      than -EIO.  This is consistent with the error returned in similar
      circumstances from nfs_update_inode().
      
      As this bug allows any user to unmount a filesystem mounted on an NFS
      filesystem, this fix is suitable for stable kernels.
      
      Fixes: bafc9b75 ("vfs: More precise tests in d_invalidate")
      Cc: stable@vger.kernel.org (v3.18+)
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      cc89684c
    • Olga Kornievskaia's avatar
      PNFS for stateid errors retry against MDS first · 22368ff1
      Olga Kornievskaia authored
      Upon receiving a stateid error such as BAD_STATEID, the client
      should retry the operation against the MDS before deciding to
      do stateid recovery.
      
      Previously, the code would initiate state recovery and it could
      lead to a race in a state manager that could chose an incorrect
      recovery method which would lead to the EIO failure for the
      application.
      Signed-off-by: default avatarOlga Kornievskaia <kolga@netapp.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      22368ff1
    • Olga Kornievskaia's avatar
      PNFS fix EACCESS on commit to DS handling · a0bc01e0
      Olga Kornievskaia authored
      Commit fabbbee0 "PNFS fix fallback to MDS if got error on
      commit to DS" moved the pnfs_set_lo_fail() to unhandled errors
      which was not correct and lead to a kernel oops on umount.
      
      Instead, fix the original EACCESS on commit to DS error by
      getting the new layout and re-doing the IO.
      
      Fixes: fabbbee0 ("PNFS fix fallback to MDS if got error on commit to DS")
      Signed-off-by: default avatarOlga Kornievskaia <kolga@netapp.com>
      Cc: stable@vger.kernel.org # v4.12
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      a0bc01e0
    • Dan Carpenter's avatar
      NFS: silence a uninitialized variable warning · 4cd1ec95
      Dan Carpenter authored
      Static checkers have gotten clever enough to complain that "id_long" is
      uninitialized on the failure path.  It's harmless, but simple to fix.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      4cd1ec95