1. 17 May, 2016 19 commits
    • Chuck Lever's avatar
      xprtrdma: Refactor the FRWR recovery worker · 660bb497
      Chuck Lever authored
      Maintain the order of invalidation and DMA unmapping when doing
      a background MR reset.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Tested-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      660bb497
    • Chuck Lever's avatar
      xprtrdma: Reset MRs in frwr_op_unmap_sync() · d7a21c1b
      Chuck Lever authored
      frwr_op_unmap_sync() is now invoked in a workqueue context, the same
      as __frwr_queue_recovery(). There's no need to defer MR reset if
      posting LOCAL_INV MRs fails.
      
      This means that even when ib_post_send() fails (which should occur
      very rarely) the invalidation and DMA unmapping steps are still done
      in the correct order.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Tested-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      d7a21c1b
    • Chuck Lever's avatar
      xprtrdma: Save I/O direction in struct rpcrdma_frwr · a3aa8b2b
      Chuck Lever authored
      Move the the I/O direction field from rpcrdma_mr_seg into the
      rpcrdma_frmr.
      
      This makes it possible to DMA-unmap the frwr long after an RPC has
      exited and its rpcrdma_mr_seg array has been released and re-used.
      This might occur if an RPC times out while waiting for a new
      connection to be established.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Tested-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      a3aa8b2b
    • Chuck Lever's avatar
      xprtrdma: Rename rpcrdma_frwr::sg and sg_nents · 55fdfce1
      Chuck Lever authored
      Clean up: Follow same naming convention as other fields in struct
      rpcrdma_frwr.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Tested-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      55fdfce1
    • Chuck Lever's avatar
      xprtrdma: Use core ib_drain_qp() API · 550d7502
      Chuck Lever authored
      Clean up: Replace rpcrdma_flush_cqs() and rpcrdma_clean_cqs() with
      the new ib_drain_qp() API.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-By: default avatarLeon Romanovsky <leonro@mellanox.com>
      Tested-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      550d7502
    • Chuck Lever's avatar
      xprtrdma: Remove rpcrdma_create_chunks() · 3c19409b
      Chuck Lever authored
      rpcrdma_create_chunks() has been replaced, and can be removed.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Tested-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      3c19409b
    • Chuck Lever's avatar
      xprtrdma: Allow Read list and Reply chunk simultaneously · 94f58c58
      Chuck Lever authored
      rpcrdma_marshal_req() makes a simplifying assumption: that NFS
      operations with large Call messages have small Reply messages, and
      vice versa. Therefore with RPC-over-RDMA, only one chunk type is
      ever needed for each Call/Reply pair, because one direction needs
      chunks, the other direction will always fit inline.
      
      In fact, this assumption is asserted in the code:
      
        if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
        	dprintk("RPC:       %s: cannot marshal multiple chunk lists\n",
      		__func__);
      	return -EIO;
        }
      
      But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
      perform data transformation on RPC messages before they are
      transmitted, direct data placement techniques cannot be used, thus
      RPC messages must be sent via a Long call in both directions.
      All such calls are sent with a Position Zero Read chunk, and all
      such replies are handled with a Reply chunk. Thus the client must
      provide every Call/Reply pair with both a Read list and a Reply
      chunk.
      
      Without any special security in effect, NFSv4 WRITEs may now also
      use the Read list and provide a Reply chunk. The marshal_req
      logic was preventing that, meaning an NFSv4 WRITE with a large
      payload that included a GETATTR result larger than the inline
      threshold would fail.
      
      The code that encodes each chunk list is now completely contained in
      its own function. There is some code duplication, but the trade-off
      is that the overall logic should be more clear.
      
      Note that all three chunk lists now share the rl_segments array.
      Some additional per-req accounting is necessary to track this
      usage. For the same reasons that the above simplifying assumption
      has held true for so long, I don't expect more array elements are
      needed at this time.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Tested-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      94f58c58
    • Chuck Lever's avatar
      xprtrdma: Update comments in rpcrdma_marshal_req() · 88b18a12
      Chuck Lever authored
      Update documenting comments to reflect code changes over the past
      year.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Tested-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      88b18a12
    • Chuck Lever's avatar
      xprtrdma: Avoid using Write list for small NFS READ requests · cce6deeb
      Chuck Lever authored
      Avoid the latency and interrupt overhead of registering a Write
      chunk when handling NFS READ requests of a few hundred bytes or
      less.
      
      This change does not interoperate with Linux NFS/RDMA servers
      that do not have commit 9d11b51c ('svcrdma: Fix send_reply()
      scatter/gather set-up'). Commit 9d11b51c was introduced in v4.3,
      and is included in 4.2.y, 4.1.y, and 3.18.y.
      
      Oracle bug 22925946 has been filed to request that the above fix
      be included in the Oracle Linux UEK4 NFS/RDMA server.
      
      Red Hat bugzillas 1327280 and 1327554 have been filed to request
      that RHEL NFS/RDMA server backports include the above fix.
      
      Workaround: Replace the "proto=rdma,port=20049" mount options
      with "proto=tcp" until commit 9d11b51c is applied to your
      NFS server.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Tested-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      cce6deeb
    • Chuck Lever's avatar
      xprtrdma: Prevent inline overflow · 302d3deb
      Chuck Lever authored
      When deciding whether to send a Call inline, rpcrdma_marshal_req
      doesn't take into account header bytes consumed by chunk lists.
      This results in Call messages on the wire that are sometimes larger
      than the inline threshold.
      
      Likewise, when a Write list or Reply chunk is in play, the server's
      reply has to emit an RDMA Send that includes a larger-than-minimal
      RPC-over-RDMA header.
      
      The actual size of a Call message cannot be estimated until after
      the chunk lists have been registered. Thus the size of each
      RPC-over-RDMA header can be estimated only after chunks are
      registered; but the decision to register chunks is based on the size
      of that header. Chicken, meet egg.
      
      The best a client can do is estimate header size based on the
      largest header that might occur, and then ensure that inline content
      is always smaller than that.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Tested-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      302d3deb
    • Chuck Lever's avatar
      xprtrdma: Limit number of RDMA segments in RPC-over-RDMA headers · 94931746
      Chuck Lever authored
      Send buffer space is shared between the RPC-over-RDMA header and
      an RPC message. A large RPC-over-RDMA header means less space is
      available for the associated RPC message, which then has to be
      moved via an RDMA Read or Write.
      
      As more segments are added to the chunk lists, the header increases
      in size.  Typical modern hardware needs only a few segments to
      convey the maximum payload size, but some devices and registration
      modes may need a lot of segments to convey data payload. Sometimes
      so many are needed that the remaining space in the Send buffer is
      not enough for the RPC message. Sending such a message usually
      fails.
      
      To ensure a transport can always make forward progress, cap the
      number of RDMA segments that are allowed in chunk lists. This
      prevents less-capable devices and memory registrations from
      consuming a large portion of the Send buffer by reducing the
      maximum data payload that can be conveyed with such devices.
      
      For now I choose an arbitrary maximum of 8 RDMA segments. This
      allows a maximum size RPC-over-RDMA header to fit nicely in the
      current 1024 byte inline threshold with over 700 bytes remaining
      for an inline RPC message.
      
      The current maximum data payload of NFS READ or WRITE requests is
      one megabyte. To convey that payload on a client with 4KB pages,
      each chunk segment would need to handle 32 or more data pages. This
      is well within the capabilities of FMR. For physical registration,
      the maximum payload size on platforms with 4KB pages is reduced to
      32KB.
      
      For FRWR, a device's maximum page list depth would need to be at
      least 34 to support the maximum 1MB payload. A device with a smaller
      maximum page list depth means the maximum data payload is reduced
      when using that device.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Tested-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      94931746
    • Chuck Lever's avatar
      xprtrdma: Bound the inline threshold values · 29c55422
      Chuck Lever authored
      Currently the sysctls that allow setting the inline threshold allow
      any value to be set.
      
      Small values only make the transport run slower. The default 1KB
      setting is as low as is reasonable. And the logic that decides how
      to divide a Send buffer between RPC-over-RDMA header and RPC message
      assumes (but does not check) that the lower bound is not crazy (say,
      57 bytes).
      
      Send and receive buffers share a page with some control information.
      Values larger than about 3KB can't be supported, currently.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Tested-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      29c55422
    • Chuck Lever's avatar
      sunrpc: Advertise maximum backchannel payload size · 6b26cc8c
      Chuck Lever authored
      RPC-over-RDMA transports have a limit on how large a backward
      direction (backchannel) RPC message can be. Ensure that the NFSv4.x
      CREATE_SESSION operation advertises this limit to servers.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Tested-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      6b26cc8c
    • Chuck Lever's avatar
      sunrpc: Update RPCBIND_MAXNETIDLEN · 4b9c7f9d
      Chuck Lever authored
      Commit 176e21ee ("SUNRPC: Support for RPC over AF_LOCAL
      transports") added a 5-character netid, but did not bump
      RPCBIND_MAXNETIDLEN from 4 to 5.
      
      Fixes: 176e21ee ("SUNRPC: Support for RPC over AF_LOCAL ...")
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      4b9c7f9d
    • Shirley Ma's avatar
      xprtrdma: Add rdma6 option to support NFS/RDMA IPv6 · 181342c5
      Shirley Ma authored
      RFC 5666: The "rdma" netid is to be used when IPv4 addressing
      is employed by the underlying transport, and "rdma6" for IPv6
      addressing.
      
      Add mount -o proto=rdma6 option to support NFS/RDMA IPv6 addressing.
      
      Changes from v2:
       - Integrated comments from Chuck Level, Anna Schumaker, Trodt Myklebust
       - Add a little more to the patch description to describe NFS/RDMA
         IPv6 suggested by Chuck Level and Anna Schumaker
       - Removed duplicated rdma6 define
       - Remove Opt_xprt_rdma mountfamily since it doesn't support
      Signed-off-by: default avatarShirley Ma <shirley.ma@oracle.com>
      Reviewed-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      181342c5
    • Tigran Mkrtchyan's avatar
      nfs4: client: do not send empty SETATTR after OPEN_CREATE · a1d1c4f1
      Tigran Mkrtchyan authored
      OPEN_CREATE with EXCLUSIVE4_1 sends initial file permission.
      Ignoring  fact, that server have indicated that file mod is set, client
      will send yet another SETATTR request, but, as mode is already set,
      new SETATTR will be empty. This is not a problem, nevertheless
      an extra roundtrip and slow open on high latency networks.
      
      This change is aims to skip extra setattr after open  if there are
      no attributes to be set.
      Signed-off-by: default avatarTigran Mkrtchyan <tigran.mkrtchyan@desy.de>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      a1d1c4f1
    • Anna Schumaker's avatar
      NFS: Add COPY nfs operation · 2e72448b
      Anna Schumaker authored
      This adds the copy_range file_ops function pointer used by the
      sys_copy_range() function call.  This patch only implements sync copies,
      so if an async copy happens we decode the stateid and ignore it.
      Signed-off-by: default avatarAnna Schumaker <bjschuma@netapp.com>
      2e72448b
    • Anna Schumaker's avatar
      NFS: Add nfs_commit_file() · 67911c8f
      Anna Schumaker authored
      Copy will use this to set up a commit request for a generic range.  I
      don't want to allocate a new pagecache entry for the file, so I needed
      to change parts of the commit path to handle requests with a null
      wb_page.
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      67911c8f
    • Olga Kornievskaia's avatar
      Fixing oops in callback path · c2985d00
      Olga Kornievskaia authored
      Commit 80f96427 ("NFSv4.x: Enforce the ca_maxreponsesize_cached
      on the back channel") causes an oops when it receives a callback with
      cachethis=yes.
      
      [  109.667378] BUG: unable to handle kernel NULL pointer dereference at 00000000000002c8
      [  109.669476] IP: [<ffffffffa08a3e68>] nfs4_callback_compound+0x4f8/0x690 [nfsv4]
      [  109.671216] PGD 0
      [  109.671736] Oops: 0000 [#1] SMP
      [  109.705427] CPU: 1 PID: 3579 Comm: nfsv4.1-svc Not tainted 4.5.0-rc1+ #1
      [  109.706987] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014
      [  109.709468] task: ffff8800b4408000 ti: ffff88008448c000 task.ti: ffff88008448c000
      [  109.711207] RIP: 0010:[<ffffffffa08a3e68>]  [<ffffffffa08a3e68>] nfs4_callback_compound+0x4f8/0x690 [nfsv4]
      [  109.713521] RSP: 0018:ffff88008448fca0  EFLAGS: 00010286
      [  109.714762] RAX: ffff880081ee202c RBX: ffff8800b7b5b600 RCX: 0000000000000001
      [  109.716427] RDX: 0000000000000008 RSI: 0000000000000008 RDI: 0000000000000000
      [  109.718091] RBP: ffff88008448fda8 R08: 0000000000000000 R09: 000000000b000000
      [  109.719757] R10: ffff880137786000 R11: ffff8800b7b5b600 R12: 0000000001000000
      [  109.721415] R13: 0000000000000002 R14: 0000000053270000 R15: 000000000000000b
      [  109.723061] FS:  0000000000000000(0000) GS:ffff880139640000(0000) knlGS:0000000000000000
      [  109.724931] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  109.726278] CR2: 00000000000002c8 CR3: 0000000034d50000 CR4: 00000000001406e0
      [  109.727972] Stack:
      [  109.728465]  ffff880081ee202c ffff880081ee201c 000000008448fcc0 ffff8800baccb800
      [  109.730349]  ffff8800baccc800 ffffffffa08d0380 0000000000000000 0000000000000000
      [  109.732211]  ffff8800b7b5b600 0000000000000001 ffffffff81d073c0 ffff880081ee3090
      [  109.734056] Call Trace:
      [  109.734657]  [<ffffffffa03795d4>] svc_process_common+0x5c4/0x6c0 [sunrpc]
      [  109.736267]  [<ffffffffa0379a4c>] bc_svc_process+0x1fc/0x360 [sunrpc]
      [  109.737775]  [<ffffffffa08a2c2c>] nfs41_callback_svc+0x10c/0x1d0 [nfsv4]
      [  109.739335]  [<ffffffff810cb380>] ? prepare_to_wait_event+0xf0/0xf0
      [  109.740799]  [<ffffffffa08a2b20>] ? nfs4_callback_svc+0x50/0x50 [nfsv4]
      [  109.742349]  [<ffffffff810a6998>] kthread+0xd8/0xf0
      [  109.743495]  [<ffffffff810a68c0>] ? kthread_park+0x60/0x60
      [  109.744776]  [<ffffffff816abc4f>] ret_from_fork+0x3f/0x70
      [  109.746037]  [<ffffffff810a68c0>] ? kthread_park+0x60/0x60
      [  109.747324] Code: cc 45 31 f6 48 8b 85 00 ff ff ff 44 89 30 48 8b 85 f8 fe ff ff 44 89 20 48 8b 9d 38 ff ff ff 48 8b bd 30 ff ff ff 48 85 db 74 4c <4c> 8b af c8 02 00 00 4d 8d a5 08 02 00 00 49 81 c5 98 02 00 00
      [  109.754361] RIP  [<ffffffffa08a3e68>] nfs4_callback_compound+0x4f8/0x690 [nfsv4]
      [  109.756123]  RSP <ffff88008448fca0>
      [  109.756951] CR2: 00000000000002c8
      [  109.757738] ---[ end trace 2b8555511ab5dfb4 ]---
      [  109.758819] Kernel panic - not syncing: Fatal exception
      [  109.760126] Kernel Offset: disabled
      [  118.938934] ---[ end Kernel panic - not syncing: Fatal exception
      
      It doesn't unlock the table nor does it set the cps->clp pointer which
      is later needed by nfs4_cb_free_slot().
      
      Fixes: 80f96427 ("NFSv4.x: Enforce the ca_maxresponsesize_cached ...")
      CC: stable@vger.kernel.org
      Signed-off-by: default avatarOlga Kornievskaia <kolga@netapp.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      c2985d00
  2. 09 May, 2016 13 commits
  3. 08 May, 2016 1 commit
  4. 07 May, 2016 7 commits