1. 13 Jul, 2017 40 commits
    • Linus Torvalds's avatar
      Merge tag 'nfs-for-4.13-1' of git://git.linux-nfs.org/projects/anna/linux-nfs · b86faee6
      Linus Torvalds authored
      Pull NFS client updates from Anna Schumaker:
       "Stable bugfixes:
         - Fix -EACCESS on commit to DS handling
         - Fix initialization of nfs_page_array->npages
         - Only invalidate dentries that are actually invalid
      
        Features:
         - Enable NFSoRDMA transparent state migration
         - Add support for lookup-by-filehandle
         - Add support for nfs re-exporting
      
        Other bugfixes and cleanups:
         - Christoph cleaned up the way we declare NFS operations
         - Clean up various internal structures
         - Various cleanups to commits
         - Various improvements to error handling
         - Set the dt_type of . and .. entries in NFS v4
         - Make slot allocation more reliable
         - Fix fscache stat printing
         - Fix uninitialized variable warnings
         - Fix potential list overrun in nfs_atomic_open()
         - Fix a race in NFSoRDMA RPC reply handler
         - Fix return size for nfs42_proc_copy()
         - Fix against MAC forgery timing attacks"
      
      * tag 'nfs-for-4.13-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (68 commits)
        NFS: Don't run wake_up_bit() when nobody is waiting...
        nfs: add export operations
        nfs4: add NFSv4 LOOKUPP handlers
        nfs: add a nfs_ilookup helper
        nfs: replace d_add with d_splice_alias in atomic_open
        sunrpc: use constant time memory comparison for mac
        NFSv4.2 fix size storage for nfs42_proc_copy
        xprtrdma: Fix documenting comments in frwr_ops.c
        xprtrdma: Replace PAGE_MASK with offset_in_page()
        xprtrdma: FMR does not need list_del_init()
        xprtrdma: Demote "connect" log messages
        NFSv4.1: Use seqid returned by EXCHANGE_ID after state migration
        NFSv4.1: Handle EXCHGID4_FLAG_CONFIRMED_R during NFSv4.1 migration
        xprtrdma: Don't defer MR recovery if ro_map fails
        xprtrdma: Fix FRWR invalidation error recovery
        xprtrdma: Fix client lock-up after application signal fires
        xprtrdma: Rename rpcrdma_req::rl_free
        xprtrdma: Pass only the list of registered MRs to ro_unmap_sync
        xprtrdma: Pre-mark remotely invalidated MRs
        xprtrdma: On invalidation failure, remove MWs from rl_registered
        ...
      b86faee6
    • Linus Torvalds's avatar
      Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending · 48ea2ced
      Linus Torvalds authored
      Pull SCSI target updates from Nicholas Bellinger:
       "It's been usually busy for summer, with most of the efforts centered
        around TCMU developments and various target-core + fabric driver bug
        fixing activities. Not particularly large in terms of LoC, but lots of
        smaller patches from many different folks.
      
        The highlights include:
      
         - ibmvscsis logical partition manager support (Michael Cyr + Bryant
           Ly)
      
         - Convert target/iblock WRITE_SAME to blkdev_issue_zeroout (hch +
           nab)
      
         - Add support for TMR percpu LUN reference counting (nab)
      
         - Fix a potential deadlock between EXTENDED_COPY and iscsi shutdown
           (Bart)
      
         - Fix COMPARE_AND_WRITE caw_sem leak during se_cmd quiesce (Jiang Yi)
      
         - Fix TMCU module removal (Xiubo Li)
      
         - Fix iser-target OOPs during login failure (Andrea Righi + Sagi)
      
         - Breakup target-core free_device backend driver callback (mnc)
      
         - Perform TCMU add/delete/reconfig synchronously (mnc)
      
         - Fix TCMU multiple UIO open/close sequences (mnc)
      
         - Fix TCMU CHECK_CONDITION sense handling (mnc)
      
         - Fix target-core SAM_STAT_BUSY + TASK_SET_FULL handling (mnc + nab)
      
         - Introduce TYPE_ZBC support in PSCSI (Damien Le Moal)
      
         - Fix possible TCMU memory leak + OOPs when recalculating cmd base
           size (Xiubo Li + Bryant Ly + Damien Le Moal + mnc)
      
         - Add login_keys_workaround attribute for non RFC initiators (Robert
           LeBlanc + Arun Easi + nab)"
      
      * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (68 commits)
        iscsi-target: Add login_keys_workaround attribute for non RFC initiators
        Revert "qla2xxx: Fix incorrect tcm_qla2xxx_free_cmd use during TMR ABORT"
        tcmu: clean up the code and with one small fix
        tcmu: Fix possbile memory leak / OOPs when recalculating cmd base size
        target: export lio pgr/alua support as device attr
        target: Fix return sense reason in target_scsi3_emulate_pr_out
        target: Fix cmd size for PR-OUT in passthrough_parse_cdb
        tcmu: Fix dev_config_store
        target: pscsi: Introduce TYPE_ZBC support
        target: Use macro for WRITE_VERIFY_32 operation codes
        target: fix SAM_STAT_BUSY/TASK_SET_FULL handling
        target: remove transport_complete
        pscsi: finish cmd processing from pscsi_req_done
        tcmu: fix sense handling during completion
        target: add helper to copy sense to se_cmd buffer
        target: do not require a transport_complete for SCF_TRANSPORT_TASK_SENSE
        target: make device_mutex and device_list static
        tcmu: Fix flushing cmd entry dcache page
        tcmu: fix multiple uio open/close sequences
        tcmu: drop configured check in destroy
        ...
      48ea2ced
    • Trond Myklebust's avatar
      NFS: Don't run wake_up_bit() when nobody is waiting... · b4f937cf
      Trond Myklebust authored
      "perf lock" shows fairly heavy contention for the bit waitqueue locks
      when doing an I/O heavy workload.
      Use a bit to tell whether or not there has been contention for a lock
      so that we can optimise away the bit waitqueue options in those cases.
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      b4f937cf
    • Peng Tao's avatar
      nfs: add export operations · 20fa1902
      Peng Tao authored
      This support for opening files on NFS by file handle, both through the
      open_by_handle syscall, and for re-exporting NFS (for example using a
      different version).  The support is very basic for now, as each open by
      handle will have to do an NFSv4 open operation on the wire.  In the
      future this will hopefully be mitigated by an open file cache, as well
      as various optimizations in NFS for this specific case.
      Signed-off-by: default avatarPeng Tao <tao.peng@primarydata.com>
      [hch: incorporated various changes, resplit the patches, new changelog]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      20fa1902
    • Linus Torvalds's avatar
      Merge tag 'nfsd-4.13' of git://linux-nfs.org/~bfields/linux · 62403005
      Linus Torvalds authored
      Pull nfsd updates from Bruce Fields:
       "Chuck's RDMA update overhauls the "call receive" side of the
        RPC-over-RDMA transport to use the new rdma_rw API.
      
        Christoph cleaned the way nfs operations are declared, removing a
        bunch of function-pointer casts and declaring the operation vectors as
        const.
      
        Christoph's changes touch both client and server, and both client and
        server pulls this time around should be based on the same commits from
        Christoph"
      
      * tag 'nfsd-4.13' of git://linux-nfs.org/~bfields/linux: (53 commits)
        svcrdma: fix an incorrect check on -E2BIG and -EINVAL
        nfsd4: factor ctime into change attribute
        svcrdma: Remove svc_rdma_chunk_ctxt::cc_dir field
        svcrdma: use offset_in_page() macro
        svcrdma: Clean up after converting svc_rdma_recvfrom to rdma_rw API
        svcrdma: Clean-up svc_rdma_unmap_dma
        svcrdma: Remove frmr cache
        svcrdma: Remove unused Read completion handlers
        svcrdma: Properly compute .len and .buflen for received RPC Calls
        svcrdma: Use generic RDMA R/W API in RPC Call path
        svcrdma: Add recvfrom helpers to svc_rdma_rw.c
        sunrpc: Allocate up to RPCSVC_MAXPAGES per svc_rqst
        svcrdma: Don't account for Receive queue "starvation"
        svcrdma: Improve Reply chunk sanity checking
        svcrdma: Improve Write chunk sanity checking
        svcrdma: Improve Read chunk sanity checking
        svcrdma: Remove svc_rdma_marshal.c
        svcrdma: Avoid Send Queue overflow
        svcrdma: Squelch disconnection messages
        sunrpc: Disable splice for krb5i
        ...
      62403005
    • Linus Torvalds's avatar
      Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs · 19c6e12c
      Linus Torvalds authored
      Pull ext2, udf, reiserfs fixes from Jan Kara:
       "Several ext2, udf, and reiserfs fixes"
      
      * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
        ext2: Fix memory leak when truncate races ext2_get_blocks
        reiserfs: fix race in prealloc discard
        reiserfs: don't preallocate blocks for extended attributes
        udf: Convert udf_disk_stamp_to_time() to use mktime64()
        udf: Use time64_to_tm for timestamp conversion
        udf: Fix deadlock between writeback and udf_setsize()
        udf: Use i_size_read() in udf_adinicb_writepage()
        udf: Fix races with i_size changes during readpage
        udf: Remove unused UDF_DEFAULT_BLOCKSIZE
      19c6e12c
    • Linus Torvalds's avatar
      Merge tag '4.13-fixes' of git://git.lwn.net/linux · 954e6e03
      Linus Torvalds authored
      Pull documentation fixes from Jonathan Corbet:
       "A set of fixes for various warnings, including the one caused by the
        removal of kernel/rcu/srcu.c. Also correct a stray pointer in
        memory-barriers.txt"
      
      * tag '4.13-fixes' of git://git.lwn.net/linux:
        kokr/memory-barriers.txt: Fix obsolete link to atomic_ops.txt
        memory-barriers.txt: Fix broken link to atomic_ops.txt
        docs: Turn off section numbering for the input docs
        docs: Include uaccess docs from the right file
        docs: Do not include from kernel/rcu/srcu.c
      954e6e03
    • Linus Torvalds's avatar
      Merge tag 'kbuild-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild · 80fc6238
      Linus Torvalds authored
      Pull more Kbuild updates from Masahiro Yamada:
      
       - Move generic-y of exported headers to uapi/asm/Kbuild for complete
         de-coupling of UAPI
      
       - Clean up scripts/Makefile.headersinst
      
       - Fix host programs for 32 bit machine with XFS file system
      
      * tag 'kbuild-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (29 commits)
        kbuild: Enable Large File Support for hostprogs
        kbuild: remove wrapper files handling from Makefile.headersinst
        kbuild: split exported generic header creation into uapi-asm-generic
        kbuild: do not include old-kbuild-file from Makefile.headersinst
        xtensa: move generic-y of exported headers to uapi/asm/Kbuild
        unicore32: move generic-y of exported headers to uapi/asm/Kbuild
        tile: move generic-y of exported headers to uapi/asm/Kbuild
        sparc: move generic-y of exported headers to uapi/asm/Kbuild
        sh: move generic-y of exported headers to uapi/asm/Kbuild
        parisc: move generic-y of exported headers to uapi/asm/Kbuild
        openrisc: move generic-y of exported headers to uapi/asm/Kbuild
        nios2: move generic-y of exported headers to uapi/asm/Kbuild
        nios2: remove unneeded arch/nios2/include/(generated/)asm/signal.h
        microblaze: move generic-y of exported headers to uapi/asm/Kbuild
        metag: move generic-y of exported headers to uapi/asm/Kbuild
        m68k: move generic-y of exported headers to uapi/asm/Kbuild
        m32r: move generic-y of exported headers to uapi/asm/Kbuild
        ia64: remove redundant generic-y += kvm_para.h from asm/Kbuild
        hexagon: move generic-y of exported headers to uapi/asm/Kbuild
        h8300: move generic-y of exported headers to uapi/asm/Kbuild
        ...
      80fc6238
    • Linus Torvalds's avatar
      Merge tag 'trace-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · bc0f51d3
      Linus Torvalds authored
      Pull more tracing updates from Steven Rostedt:
       "A few more minor updates:
      
         - Show the tgid mappings for user space trace tools to use
      
         - Fix and optimize the comm and tgid cache recording
      
         - Sanitize derived kprobe names
      
         - Ftrace selftest updates
      
         - trace file header fix
      
         - Update of Documentation/trace/ftrace.txt
      
         - Compiler warning fixes
      
         - Fix possible uninitialized variable"
      
      * tag 'trace-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
        ftrace: Fix uninitialized variable in match_records()
        ftrace: Remove an unneeded NULL check
        ftrace: Hide cached module code for !CONFIG_MODULES
        tracing: Do note expose stack_trace_filter without DYNAMIC_FTRACE
        tracing: Update Documentation/trace/ftrace.txt
        tracing: Fixup trace file header alignment
        selftests/ftrace: Add a testcase for kprobe event naming
        selftests/ftrace: Add a test to probe module functions
        selftests/ftrace: Update multiple kprobes test for powerpc
        trace/kprobes: Sanitize derived event names
        tracing: Attempt to record other information even if some fail
        tracing: Treat recording tgid for idle task as a success
        tracing: Treat recording comm for idle task as a success
        tracing: Add saved_tgids file to show cached pid to tgid mappings
      bc0f51d3
    • Jeff Layton's avatar
      nfs4: add NFSv4 LOOKUPP handlers · 5b5faaf6
      Jeff Layton authored
      This will be needed in order to implement the get_parent export op
      for nfsd.
      Signed-off-by: default avatarJeff Layton <jeff.layton@primarydata.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      5b5faaf6
    • Peng Tao's avatar
      nfs: add a nfs_ilookup helper · f174ff7a
      Peng Tao authored
      This helper will allow to find an existing NFS inode by the file handle
      and fattr.
      Signed-off-by: default avatarPeng Tao <tao.peng@primarydata.com>
      [hch: split from a larger patch]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      f174ff7a
    • Peng Tao's avatar
      nfs: replace d_add with d_splice_alias in atomic_open · 774d9513
      Peng Tao authored
      It's a trival change but follows knfsd export document that asks
      for d_splice_alias during lookup.
      Signed-off-by: default avatarPeng Tao <tao.peng@primarydata.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      774d9513
    • Jason A. Donenfeld's avatar
      sunrpc: use constant time memory comparison for mac · 15a8b93f
      Jason A. Donenfeld authored
      Otherwise, we enable a MAC forgery via timing attack.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Cc: Anna Schumaker <anna.schumaker@netapp.com>
      Cc: linux-nfs@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      15a8b93f
    • Olga Kornievskaia's avatar
      NFSv4.2 fix size storage for nfs42_proc_copy · 1ee48bdd
      Olga Kornievskaia authored
      Return size of COPY is u64 but it was assigned to an "int" status.
      Signed-off-by: default avatarOlga Kornievskaia <kolga@netapp.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      1ee48bdd
    • Chuck Lever's avatar
      xprtrdma: Fix documenting comments in frwr_ops.c · 6afafa77
      Chuck Lever authored
      Clean up.
      
      FASTREG and LOCAL_INV WRs are typically not signaled. localinv_wake
      is used for the last LOCAL_INV WR in a chain, which is always
      signaled. The documenting comments should reflect that.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      6afafa77
    • Chuck Lever's avatar
      xprtrdma: Replace PAGE_MASK with offset_in_page() · d933cc32
      Chuck Lever authored
      Clean up.
      
      Reported by: Geliang Tang <geliangtang@gmail.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      d933cc32
    • Chuck Lever's avatar
      xprtrdma: FMR does not need list_del_init() · e2f6ef09
      Chuck Lever authored
      Clean up.
      
      Commit 38f1932e ("xprtrdma: Remove FMRs from the unmap list
      after unmapping") utilized list_del_init() to try to prevent some
      list corruption. The corruption was actually caused by the reply
      handler racing with a signal. Now that MR invalidation is properly
      serialized, list_del_init() can safely be replaced.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      e2f6ef09
    • Chuck Lever's avatar
      xprtrdma: Demote "connect" log messages · 173b8f49
      Chuck Lever authored
      Some have complained about the log messages generated when xprtrdma
      opens or closes a connection to a server. When an NFS mount is
      mostly idle these can appear every few minutes as the client idles
      out the connection and reconnects.
      
      Connection and disconnection is a normal part of operation, and not
      exceptional, so change these to dprintk's for now. At some point
      all of these will be converted to tracepoints, but that's for
      another day.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      173b8f49
    • Chuck Lever's avatar
      NFSv4.1: Use seqid returned by EXCHANGE_ID after state migration · 838edb94
      Chuck Lever authored
      Transparent State Migration copies a client's lease state from the
      server where a filesystem used to reside to the server where it now
      resides. When an NFSv4.1 client first contacts that destination
      server, it uses EXCHANGE_ID to detect trunking relationships.
      
      The lease that was copied there is returned to that client, but the
      destination server sets EXCHGID4_FLAG_CONFIRMED_R when replying to
      the client. This is because the lease was confirmed on the source
      server (before it was copied).
      
      When CONFIRMED_R is set, the client throws away the sequence ID
      returned by the server. During a Transparent State Migration, however
      there's no other way for the client to know what sequence ID to use
      with a lease that's been migrated.
      
      Therefore, the client must save and use the contrived slot sequence
      value returned by the destination server even when CONFIRMED_R is
      set.
      
      Note that some servers always return a seqid of 1 after a migration.
      Reported-by: default avatarXuan Qi <xuan.qi@oracle.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Tested-by: default avatarXuan Qi <xuan.qi@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      838edb94
    • Chuck Lever's avatar
      NFSv4.1: Handle EXCHGID4_FLAG_CONFIRMED_R during NFSv4.1 migration · 8dcbec6d
      Chuck Lever authored
      Transparent State Migration copies a client's lease state from the
      server where a filesystem used to reside to the server where it now
      resides. When an NFSv4.1 client first contacts that destination
      server, it uses EXCHANGE_ID to detect trunking relationships.
      
      The lease that was copied there is returned to that client, but the
      destination server sets EXCHGID4_FLAG_CONFIRMED_R when replying to
      the client. This is because the lease was confirmed on the source
      server (before it was copied).
      
      Normally, when CONFIRMED_R is set, a client purges the lease and
      creates a new one. However, that throws away the entire benefit of
      Transparent State Migration.
      
      Therefore, the client must not purge that lease when it is possible
      that Transparent State Migration has occurred.
      Reported-by: default avatarXuan Qi <xuan.qi@oracle.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Tested-by: default avatarXuan Qi <xuan.qi@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      8dcbec6d
    • Chuck Lever's avatar
      xprtrdma: Don't defer MR recovery if ro_map fails · 1f541895
      Chuck Lever authored
      Deferred MR recovery does a DMA-unmapping of the MW. However, ro_map
      invokes rpcrdma_defer_mr_recovery in some error cases where the MW
      has not even been DMA-mapped yet.
      
      Avoid a DMA-unmapping error replacing rpcrdma_defer_mr_recovery.
      
      Also note that if ib_dma_map_sg is asked to map 0 nents, it will
      return 0. So the extra "if (i == 0)" check is no longer needed.
      
      Fixes: 42fe28f6 ("xprtrdma: Do not leak an MW during a DMA ...")
      Fixes: 505bbe64 ("xprtrdma: Refactor MR recovery work queues")
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      1f541895
    • Chuck Lever's avatar
      xprtrdma: Fix FRWR invalidation error recovery · 8d75483a
      Chuck Lever authored
      When ib_post_send() fails, all LOCAL_INV WRs past @bad_wr have to be
      examined, and the MRs reset by hand.
      
      I'm not sure how the existing code can work by comparing R_keys.
      Restructure the logic so that instead it walks the chain of WRs,
      starting from the first bad one.
      
      Make sure to wait for completion if at least one WR was actually
      posted. Otherwise, if the ib_post_send fails, we can end up
      DMA-unmapping the MR while LOCAL_INV operations are in flight.
      
      Commit 7a89f9c6 ("xprtrdma: Honor ->send_request API contract")
      added the rdma_disconnect() call site. The disconnect actually
      causes more problems than it solves, and SQ overruns happen only as
      a result of software bugs. So remove it.
      
      Fixes: d7a21c1b ("xprtrdma: Reset MRs in frwr_op_unmap_sync()")
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      8d75483a
    • Chuck Lever's avatar
      xprtrdma: Fix client lock-up after application signal fires · 431af645
      Chuck Lever authored
      After a signal, the RPC client aborts synchronous RPCs running on
      behalf of the signaled application.
      
      The server is still executing those RPCs, and will write the results
      back into the client's memory when it's done. By the time the server
      writes the results, that memory is likely being used for other
      purposes. Therefore xprtrdma has to immediately invalidate all
      memory regions used by those aborted RPCs to prevent the server's
      writes from clobbering that re-used memory.
      
      With FMR memory registration, invalidation takes a relatively long
      time. In fact, the invalidation is often still running when the
      server tries to write the results into the memory regions that are
      being invalidated.
      
      This sets up a race between two processes:
      
      1.  After the signal, xprt_rdma_free calls ro_unmap_safe.
      2.  While ro_unmap_safe is still running, the server replies and
          rpcrdma_reply_handler runs, calling ro_unmap_sync.
      
      Both processes invoke ib_unmap_fmr on the same FMR.
      
      The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at
      the same time, but HCAs generally don't tolerate this. Sometimes
      this can result in a system crash.
      
      If the HCA happens to survive, rpcrdma_reply_handler continues. It
      removes the rpc_rqst from rq_list and releases the transport_lock.
      This enables xprt_rdma_free to run in another process, and the
      rpc_rqst is released while rpcrdma_reply_handler is still waiting
      for the ib_unmap_fmr call to finish.
      
      But further down in rpcrdma_reply_handler, the transport_lock is
      taken again, and "rqst" is dereferenced. If "rqst" has already been
      released, this triggers a general protection fault. Since bottom-
      halves are disabled, the system locks up.
      
      Address both issues by reversing the order of the xprt_lookup_rqst
      call and the ro_unmap_sync call. Introduce a separate lookup
      mechanism for rpcrdma_req's to enable calling ro_unmap_sync before
      xprt_lookup_rqst. Now the handler takes the transport_lock once
      and holds it for the XID lookup and RPC completion.
      
      BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
      Fixes: 68791649 ('xprtrdma: Invalidate in the RPC reply ... ')
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      431af645
    • Chuck Lever's avatar
      xprtrdma: Rename rpcrdma_req::rl_free · a80d66c9
      Chuck Lever authored
      Clean up: I'm about to use the rl_free field for purposes other than
      a free list. So use a more generic name.
      
      This is a refactoring change only.
      
      BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
      Fixes: 68791649 ('xprtrdma: Invalidate in the RPC reply ... ')
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      a80d66c9
    • Chuck Lever's avatar
      xprtrdma: Pass only the list of registered MRs to ro_unmap_sync · 451d26e1
      Chuck Lever authored
      There are rare cases where an rpcrdma_req can be re-used (via
      rpcrdma_buffer_put) while the RPC reply handler is still running.
      This is due to a signal firing at just the wrong instant.
      
      Since commit 9d6b0409 ("xprtrdma: Place registered MWs on a
      per-req list"), rpcrdma_mws are self-contained; ie., they fully
      describe an MR and scatterlist, and no part of that information is
      stored in struct rpcrdma_req.
      
      As part of closing the above race window, pass only the req's list
      of registered MRs to ro_unmap_sync, rather than the rpcrdma_req
      itself.
      
      Some extra transport header sanity checking is removed. Since the
      client depends on its own recollection of what memory had been
      registered, there doesn't seem to be a way to abuse this change.
      
      And, the check was not terribly effective. If the client had sent
      Read chunks, the "list_empty" test is negative in both of the
      removed cases, which are actually looking for Write or Reply
      chunks.
      
      BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
      Fixes: 68791649 ('xprtrdma: Invalidate in the RPC reply ... ')
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      451d26e1
    • Chuck Lever's avatar
      xprtrdma: Pre-mark remotely invalidated MRs · 4b196dc6
      Chuck Lever authored
      There are rare cases where an rpcrdma_req and its matched
      rpcrdma_rep can be re-used, via rpcrdma_buffer_put, while the RPC
      reply handler is still using that req. This is typically due to a
      signal firing at just the wrong instant.
      
      As part of closing this race window, avoid using the wrong
      rpcrdma_rep to detect remotely invalidated MRs. Mark MRs as
      invalidated while we are sure the rep is still OK to use.
      
      BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
      Fixes: 68791649 ('xprtrdma: Invalidate in the RPC reply ... ')
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      4b196dc6
    • Chuck Lever's avatar
      xprtrdma: On invalidation failure, remove MWs from rl_registered · 04d25b7d
      Chuck Lever authored
      Callers assume the ro_unmap_sync and ro_unmap_safe methods empty
      the list of registered MRs. Ensure that all paths through
      fmr_op_unmap_sync() remove MWs from that list.
      
      Fixes: 9d6b0409 ("xprtrdma: Place registered MWs on a ... ")
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      04d25b7d
    • NeilBrown's avatar
      NFS: check for nfs_refresh_inode() errors in nfs_fhget() · 26fde4df
      NeilBrown authored
      If an NFS server returns a filehandle that we have previously
      seen, and reports a different type, then nfs_refresh_inode()
      will log a warning and return an error.
      
      nfs_fhget() does not check for this error and may return an
      inode with a different type than the one that the server
      reported.
      
      This is likely to cause confusion, and is one way that
      ->open_context() could return a directory inode as discussed
      in the previous patch.
      
      So if nfs_refresh_inode() returns and error, return that error
      from nfs_fhget() to avoid the confusion propagating.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      26fde4df
    • NeilBrown's avatar
      NFS: guard against confused server in nfs_atomic_open() · eaa2b82c
      NeilBrown authored
      A confused server could return a filehandle for an
      NFSv4 OPEN request, which it previously returned for a directory.
      So the inode returned by  ->open_context() in nfs_atomic_open()
      could conceivably be a directory inode.
      
      This has particular implications for the call to
      nfs_file_set_open_context() in nfs_finish_open().
      If that is called on a directory inode, then the nfs_open_context
      that gets stored in the filp->private_data will be linked to
      nfs_inode->open_files.
      
      When the directory is closed, nfs_closedir() will (ultimately)
      free the ->private_data, but not unlink it from nfs_inode->open_files
      (because it doesn't expect an nfs_open_context there).
      
      Subsequently the memory could get used for something else and eventually
      if the ->open_files list is walked, the walker will fall off the end and
      crash.
      
      So: change nfs_finish_open() to only call nfs_file_set_open_context()
      for regular-file inodes.
      
      This failure mode has been seen in a production setting (unknown NFS
      server implementation).  The kernel was v3.0 and the specific sequence
      seen would not affect more recent kernels, but I think a risk is still
      present, and caution is wise.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      eaa2b82c
    • NeilBrown's avatar
      NFS: only invalidate dentrys that are clearly invalid. · cc89684c
      NeilBrown authored
      Since commit bafc9b75 ("vfs: More precise tests in d_invalidate")
      in v3.18, a return of '0' from ->d_revalidate() will cause the dentry
      to be invalidated even if it has filesystems mounted on or it or on a
      descendant.  The mounted filesystem is unmounted.
      
      This means we need to be careful not to return 0 unless the directory
      referred to truly is invalid.  So -ESTALE or -ENOENT should invalidate
      the directory.  Other errors such a -EPERM or -ERESTARTSYS should be
      returned from ->d_revalidate() so they are propagated to the caller.
      
      A particular problem can be demonstrated by:
      
      1/ mount an NFS filesystem using NFSv3 on /mnt
      2/ mount any other filesystem on /mnt/foo
      3/ ls /mnt/foo
      4/ turn off network, or otherwise make the server unable to respond
      5/ ls /mnt/foo &
      6/ cat /proc/$!/stack # note that nfs_lookup_revalidate is in the call stack
      7/ kill -9 $! # this results in -ERESTARTSYS being returned
      8/ observe that /mnt/foo has been unmounted.
      
      This patch changes nfs_lookup_revalidate() to only treat
        -ESTALE from nfs_lookup_verify_inode() and
        -ESTALE or -ENOENT from ->lookup()
      as indicating an invalid inode.  Other errors are returned.
      
      Also nfs_check_inode_attributes() is changed to return -ESTALE rather
      than -EIO.  This is consistent with the error returned in similar
      circumstances from nfs_update_inode().
      
      As this bug allows any user to unmount a filesystem mounted on an NFS
      filesystem, this fix is suitable for stable kernels.
      
      Fixes: bafc9b75 ("vfs: More precise tests in d_invalidate")
      Cc: stable@vger.kernel.org (v3.18+)
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      cc89684c
    • Olga Kornievskaia's avatar
      PNFS for stateid errors retry against MDS first · 22368ff1
      Olga Kornievskaia authored
      Upon receiving a stateid error such as BAD_STATEID, the client
      should retry the operation against the MDS before deciding to
      do stateid recovery.
      
      Previously, the code would initiate state recovery and it could
      lead to a race in a state manager that could chose an incorrect
      recovery method which would lead to the EIO failure for the
      application.
      Signed-off-by: default avatarOlga Kornievskaia <kolga@netapp.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      22368ff1
    • Olga Kornievskaia's avatar
      PNFS fix EACCESS on commit to DS handling · a0bc01e0
      Olga Kornievskaia authored
      Commit fabbbee0 "PNFS fix fallback to MDS if got error on
      commit to DS" moved the pnfs_set_lo_fail() to unhandled errors
      which was not correct and lead to a kernel oops on umount.
      
      Instead, fix the original EACCESS on commit to DS error by
      getting the new layout and re-doing the IO.
      
      Fixes: fabbbee0 ("PNFS fix fallback to MDS if got error on commit to DS")
      Signed-off-by: default avatarOlga Kornievskaia <kolga@netapp.com>
      Cc: stable@vger.kernel.org # v4.12
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      a0bc01e0
    • Dan Carpenter's avatar
      NFS: silence a uninitialized variable warning · 4cd1ec95
      Dan Carpenter authored
      Static checkers have gotten clever enough to complain that "id_long" is
      uninitialized on the failure path.  It's harmless, but simple to fix.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      4cd1ec95
    • Tuo Chen Peng's avatar
      nfs: Fix fscache stat printing in nfs_show_stats() · ce85bd29
      Tuo Chen Peng authored
      nfs_show_stats() was incorrectly reading statistics for bytes when printing that
      for fsc. It caused files like /proc/self/mountstats to report incorrect fsc
      statistics for NFS mounts.
      Signed-off-by: default avatarTuo Chen Peng <tpeng@nvidia.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      ce85bd29
    • Benjamin Coddington's avatar
      NFS: Fix initialization of nfs_page_array->npages · 2eb3aea7
      Benjamin Coddington authored
      Commit 8ef9b0b9 open-coded nfs_pgarray_set(), and left out the
      initialization of the nfs_page_array's npages.  This mistake didn't show up
      until testing with block layouts, and there shows that all pNFS reads
      return -EIO.
      
      Fixes: 8ef9b0b9 ("NFS: move nfs_pgarray_set() to open code")
      Signed-off-by: default avatarBenjamin Coddington <bcodding@redhat.com>
      Cc: stable@vger.kernel.org # 4.12
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      2eb3aea7
    • Trond Myklebust's avatar
      NFS: Fix commit policy for non-blocking calls to nfs_write_inode() · 1a4edf0f
      Trond Myklebust authored
      Now that the writes will schedule a commit on their own, we don't
      need nfs_write_inode() to schedule one if there are outstanding
      writes, and we're being called in non-blocking mode.
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      1a4edf0f
    • Trond Myklebust's avatar
      NFS: Ensure we commit after writeback is complete · 919e3bd9
      Trond Myklebust authored
      If the page cache is being flushed, then we want to ensure that we
      do start a commit once the pages are done being flushed.
      If we just wait until all I/O is done to that file, we can end up
      livelocking until the balance_dirty_pages() mechanism puts its
      foot down and forces I/O to stop.
      So instead we do more or less the same thing that O_DIRECT does,
      and set up a counter to tell us when the flush is done,
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      919e3bd9
    • Trond Myklebust's avatar
      NFS: Remove unused fields in the page I/O structures · b5973a8c
      Trond Myklebust authored
      Remove the 'layout_private' fields that were only used by the pNFS OSD
      layout driver.
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      b5973a8c
    • Trond Myklebust's avatar
      SUNRPC: Make slot allocation more reliable · 92ea011f
      Trond Myklebust authored
      In xprt_alloc_slot(), the spin lock is only needed to provide atomicity
      between the atomic_add_unless() failure and the call to xprt_add_backlog().
      We do not actually need to hold it across the memory allocation itself.
      
      By dropping the lock, we can use a more resilient GFP_NOFS allocation,
      just as we now do in the rest of the RPC client code.
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      92ea011f
    • Benjamin Coddington's avatar
      NFS: nfs_rename() - revalidate directories on -ERESTARTSYS · 818a8dbe
      Benjamin Coddington authored
      An interrupted rename will leave the old dentry behind if the rename
      succeeds.  Fix this by forcing a lookup the next time through
      ->d_revalidate.
      
      A previous attempt at solving this problem took the approach to complete
      the work of the rename asynchronously, however that approach was wrong
      since it would allow the d_move() to occur after the directory's i_mutex
      had been dropped by the original process.
      Signed-off-by: default avatarBenjamin Coddington <bcodding@redhat.com>
      Reviewed-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      818a8dbe