1. 04 Dec, 2013 1 commit
    • Trond Myklebust's avatar
      NFSv4.1: Prevent a 3-way deadlock between layoutreturn, open and state recovery · f22e5edd
      Trond Myklebust authored
      Andy Adamson reports:
      
      The state manager is recovering expired state and recovery OPENs are being
      processed. If kswapd is pruning inodes at the same time, a deadlock can occur
      when kswapd calls evict_inode on an NFSv4.1 inode with a layout, and the
      resultant layoutreturn gets an error that the state mangager is to handle,
      causing the layoutreturn to wait on the (NFS client) cl_rpcwaitq.
      
      At the same time an open is waiting for the inode deletion to complete in
      __wait_on_freeing_inode.
      
      If the open is either the open called by the state manager, or an open from
      the same open owner that is holding the NFSv4 sequence id which causes the
      OPEN from the state manager to wait for the sequence id on the Seqid_waitqueue,
      then the state is deadlocked with kswapd.
      
      The fix is simply to have layoutreturn ignore all errors except NFS4ERR_DELAY.
      We already know that layouts are dropped on all server reboots, and that
      it has to be coded to deal with the "forgetful client model" that doesn't
      send layoutreturns.
      Reported-by: default avatarAndy Adamson <andros@netapp.com>
      Link: http://lkml.kernel.org/r/1385402270-14284-1-git-send-email-andros@netapp.comSigned-off-by: default avatarTrond Myklebust <Trond.Myklebust@primarydata.com>
      f22e5edd
  2. 26 Nov, 2013 1 commit
  3. 20 Nov, 2013 3 commits
    • Trond Myklebust's avatar
      NFSv4: close needs to handle NFS4ERR_ADMIN_REVOKED · 69794ad7
      Trond Myklebust authored
      Also ensure that we zero out the stateid mode when exiting
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      69794ad7
    • Trond Myklebust's avatar
      NFSv4: Update list of irrecoverable errors on DELEGRETURN · c97cf606
      Trond Myklebust authored
      If the DELEGRETURN errors out with something like NFS4ERR_BAD_STATEID
      then there is no recovery possible. Just quit without returning an error.
      
      Also, note that the client must not assume that the NFSv4 lease has been
      renewed when it sees an error on DELEGRETURN.
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      Cc: stable@vger.kernel.org
      c97cf606
    • Andy Adamson's avatar
      NFSv4 wait on recovery for async session errors · 4a82fd7c
      Andy Adamson authored
      When the state manager is processing the NFS4CLNT_DELEGRETURN flag, session
      draining is off, but DELEGRETURN can still get a session error.
      The async handler calls nfs4_schedule_session_recovery returns -EAGAIN, and
      the DELEGRETURN done then restarts the RPC task in the prepare state.
      With the state manager still processing the NFS4CLNT_DELEGRETURN flag with
      session draining off, these DELEGRETURNs will cycle with errors filling up the
      session slots.
      
      This prevents OPEN reclaims (from nfs_delegation_claim_opens) required by the
      NFS4CLNT_DELEGRETURN state manager processing from completing, hanging the
      state manager in the __rpc_wait_for_completion_task in nfs4_run_open_task
      as seen in this kernel thread dump:
      
      kernel: 4.12.32.53-ma D 0000000000000000     0  3393      2 0x00000000
      kernel: ffff88013995fb60 0000000000000046 ffff880138cc5400 ffff88013a9df140
      kernel: ffff8800000265c0 ffffffff8116eef0 ffff88013fc10080 0000000300000001
      kernel: ffff88013a4ad058 ffff88013995ffd8 000000000000fbc8 ffff88013a4ad058
      kernel: Call Trace:
      kernel: [<ffffffff8116eef0>] ? cache_alloc_refill+0x1c0/0x240
      kernel: [<ffffffffa0358110>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
      kernel: [<ffffffffa0358152>] rpc_wait_bit_killable+0x42/0xa0 [sunrpc]
      kernel: [<ffffffff8152914f>] __wait_on_bit+0x5f/0x90
      kernel: [<ffffffffa0358110>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
      kernel: [<ffffffff815291f8>] out_of_line_wait_on_bit+0x78/0x90
      kernel: [<ffffffff8109b520>] ? wake_bit_function+0x0/0x50
      kernel: [<ffffffffa035810d>] __rpc_wait_for_completion_task+0x2d/0x30 [sunrpc]
      kernel: [<ffffffffa040d44c>] nfs4_run_open_task+0x11c/0x160 [nfs]
      kernel: [<ffffffffa04114e7>] nfs4_open_recover_helper+0x87/0x120 [nfs]
      kernel: [<ffffffffa0411646>] nfs4_open_recover+0xc6/0x150 [nfs]
      kernel: [<ffffffffa040cc6f>] ? nfs4_open_recoverdata_alloc+0x2f/0x60 [nfs]
      kernel: [<ffffffffa0414e1a>] nfs4_open_delegation_recall+0x6a/0xa0 [nfs]
      kernel: [<ffffffffa0424020>] nfs_end_delegation_return+0x120/0x2e0 [nfs]
      kernel: [<ffffffff8109580f>] ? queue_work+0x1f/0x30
      kernel: [<ffffffffa0424347>] nfs_client_return_marked_delegations+0xd7/0x110 [nfs]
      kernel: [<ffffffffa04225d8>] nfs4_run_state_manager+0x548/0x620 [nfs]
      kernel: [<ffffffffa0422090>] ? nfs4_run_state_manager+0x0/0x620 [nfs]
      kernel: [<ffffffff8109b0f6>] kthread+0x96/0xa0
      kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
      kernel: [<ffffffff8109b060>] ? kthread+0x0/0xa0
      kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
      
      The state manager can not therefore process the DELEGRETURN session errors.
      Change the async handler to wait for recovery on session errors.
      Signed-off-by: default avatarAndy Adamson <andros@netapp.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      4a82fd7c
  4. 19 Nov, 2013 2 commits
  5. 15 Nov, 2013 2 commits
  6. 14 Nov, 2013 1 commit
    • Jeff Layton's avatar
      nfs: don't retry detect_trunking with RPC_AUTH_UNIX more than once · 6d769f1e
      Jeff Layton authored
      Currently, when we try to mount and get back NFS4ERR_CLID_IN_USE or
      NFS4ERR_WRONGSEC, we create a new rpc_clnt and then try the call again.
      There is no guarantee that doing so will work however, so we can end up
      retrying the call in an infinite loop.
      
      Worse yet, we create the new client using rpc_clone_client_set_auth,
      which creates the new client as a child of the old one. Thus, we can end
      up with a *very* long lineage of rpc_clnts. When we go to put all of the
      references to them, we can end up with a long call chain that can smash
      the stack as each rpc_free_client() call can recurse back into itself.
      
      This patch fixes this by simply ensuring that the SETCLIENTID call will
      only be retried in this situation if the last attempt did not use
      RPC_AUTH_UNIX.
      
      Note too that with this change, we don't need the (i > 2) check in the
      -EACCES case since we now have a more reliable test as to whether we
      should reattempt.
      
      Cc: stable@vger.kernel.org # v3.10+
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Tested-by/Acked-by: Weston Andros Adamson <dros@netapp.com>
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      6d769f1e
  7. 12 Nov, 2013 1 commit
  8. 08 Nov, 2013 1 commit
    • Trond Myklebust's avatar
      SUNRPC: Fix a data corruption issue when retransmitting RPC calls · a6b31d18
      Trond Myklebust authored
      The following scenario can cause silent data corruption when doing
      NFS writes. It has mainly been observed when doing database writes
      using O_DIRECT.
      
      1) The RPC client uses sendpage() to do zero-copy of the page data.
      2) Due to networking issues, the reply from the server is delayed,
         and so the RPC client times out.
      
      3) The client issues a second sendpage of the page data as part of
         an RPC call retransmission.
      
      4) The reply to the first transmission arrives from the server
         _before_ the client hardware has emptied the TCP socket send
         buffer.
      5) After processing the reply, the RPC state machine rules that
         the call to be done, and triggers the completion callbacks.
      6) The application notices the RPC call is done, and reuses the
         pages to store something else (e.g. a new write).
      
      7) The client NIC drains the TCP socket send buffer. Since the
         page data has now changed, it reads a corrupted version of the
         initial RPC call, and puts it on the wire.
      
      This patch fixes the problem in the following manner:
      
      The ordering guarantees of TCP ensure that when the server sends a
      reply, then we know that the _first_ transmission has completed. Using
      zero-copy in that situation is therefore safe.
      If a time out occurs, we then send the retransmission using sendmsg()
      (i.e. no zero-copy), We then know that the socket contains a full copy of
      the data, and so it will retransmit a faithful reproduction even if the
      RPC call completes, and the application reuses the O_DIRECT buffer in
      the meantime.
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      Cc: stable@vger.kernel.org
      a6b31d18
  9. 04 Nov, 2013 5 commits
  10. 01 Nov, 2013 2 commits
    • Trond Myklebust's avatar
      NFS: Fix a missing initialisation when reading the SELinux label · fcb63a9b
      Trond Myklebust authored
      Ensure that _nfs4_do_get_security_label() also initialises the
      SEQUENCE call correctly, by having it call into nfs4_call_sync().
      Reported-by: default avatarJeff Layton <jlayton@redhat.com>
      Cc: stable@vger.kernel.org # 3.11+
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      fcb63a9b
    • Jeff Layton's avatar
      nfs: fix oops when trying to set SELinux label · 12207f69
      Jeff Layton authored
      Chao reported the following oops when testing labeled NFS:
      
      BUG: unable to handle kernel NULL pointer dereference at           (null)
      IP: [<ffffffffa0568703>] nfs4_xdr_enc_setattr+0x43/0x110 [nfsv4]
      PGD 277bbd067 PUD 2777ea067 PMD 0
      Oops: 0000 [#1] SMP
      Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache sg coretemp kvm_intel kvm crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul iTCO_wdt glue_helper ablk_helper cryptd iTCO_vendor_support bnx2 pcspkr serio_raw i7core_edac cdc_ether microcode usbnet edac_core mii lpc_ich i2c_i801 mfd_core shpchp ioatdma dca acpi_cpufreq mperf nfsd auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sr_mod sd_mod cdrom crc_t10dif mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ata_generic ttm pata_acpi drm ata_piix libata megaraid_sas i2c_core dm_mirror dm_region_hash dm_log dm_mod
      CPU: 4 PID: 25657 Comm: chcon Not tainted 3.10.0-33.el7.x86_64 #1
      Hardware name: IBM System x3550 M3 -[7944OEJ]-/90Y4784     , BIOS -[D6E150CUS-1.11]- 02/08/2011
      task: ffff880178397220 ti: ffff8801595d2000 task.ti: ffff8801595d2000
      RIP: 0010:[<ffffffffa0568703>]  [<ffffffffa0568703>] nfs4_xdr_enc_setattr+0x43/0x110 [nfsv4]
      RSP: 0018:ffff8801595d3888  EFLAGS: 00010296
      RAX: 0000000000000000 RBX: ffff8801595d3b30 RCX: 0000000000000b4c
      RDX: ffff8801595d3b30 RSI: ffff8801595d38e0 RDI: ffff880278b6ec00
      RBP: ffff8801595d38c8 R08: ffff8801595d3b30 R09: 0000000000000001
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801595d38e0
      R13: ffff880277a4a780 R14: ffffffffa05686c0 R15: ffff8802765f206c
      FS:  00007f2c68486800(0000) GS:ffff88027fc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000000 CR3: 000000027651a000 CR4: 00000000000007e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Stack:
       0000000000000000 0000000000000000 0000000000000000 0000000000000000
       0000000000000000 ffff880277865800 ffff880278b6ec00 ffff880277a4a780
       ffff8801595d3948 ffffffffa02ad926 ffff8801595d3b30 ffff8802765f206c
      Call Trace:
       [<ffffffffa02ad926>] rpcauth_wrap_req+0x86/0xd0 [sunrpc]
       [<ffffffffa02a1d40>] ? call_connect+0xb0/0xb0 [sunrpc]
       [<ffffffffa02a1d40>] ? call_connect+0xb0/0xb0 [sunrpc]
       [<ffffffffa02a1ecb>] call_transmit+0x18b/0x290 [sunrpc]
       [<ffffffffa02a1d40>] ? call_connect+0xb0/0xb0 [sunrpc]
       [<ffffffffa02aae14>] __rpc_execute+0x84/0x400 [sunrpc]
       [<ffffffffa02ac40e>] rpc_execute+0x5e/0xa0 [sunrpc]
       [<ffffffffa02a2ea0>] rpc_run_task+0x70/0x90 [sunrpc]
       [<ffffffffa02a2f03>] rpc_call_sync+0x43/0xa0 [sunrpc]
       [<ffffffffa055284d>] _nfs4_do_set_security_label+0x11d/0x170 [nfsv4]
       [<ffffffffa0558861>] nfs4_set_security_label.isra.69+0xf1/0x1d0 [nfsv4]
       [<ffffffff815fca8b>] ? avc_alloc_node+0x24/0x125
       [<ffffffff815fcd2f>] ? avc_compute_av+0x1a3/0x1b5
       [<ffffffffa055897b>] nfs4_xattr_set_nfs4_label+0x3b/0x50 [nfsv4]
       [<ffffffff811bc772>] generic_setxattr+0x62/0x80
       [<ffffffff811bcfc3>] __vfs_setxattr_noperm+0x63/0x1b0
       [<ffffffff811bd1c5>] vfs_setxattr+0xb5/0xc0
       [<ffffffff811bd2fe>] setxattr+0x12e/0x1c0
       [<ffffffff811a4d22>] ? final_putname+0x22/0x50
       [<ffffffff811a4f2b>] ? putname+0x2b/0x40
       [<ffffffff811aa1cf>] ? user_path_at_empty+0x5f/0x90
       [<ffffffff8119bc29>] ? __sb_start_write+0x49/0x100
       [<ffffffff811bd66f>] SyS_lsetxattr+0x8f/0xd0
       [<ffffffff8160cf99>] system_call_fastpath+0x16/0x1b
      Code: 48 8b 02 48 c7 45 c0 00 00 00 00 48 c7 45 c8 00 00 00 00 48 c7 45 d0 00 00 00 00 48 c7 45 d8 00 00 00 00 48 c7 45 e0 00 00 00 00 <48> 8b 00 48 8b 00 48 85 c0 0f 84 ae 00 00 00 48 8b 80 b8 03 00
      RIP  [<ffffffffa0568703>] nfs4_xdr_enc_setattr+0x43/0x110 [nfsv4]
       RSP <ffff8801595d3888>
      CR2: 0000000000000000
      
      The problem is that _nfs4_do_set_security_label calls rpc_call_sync()
      directly which fails to do any setup of the SEQUENCE call. Have it use
      nfs4_call_sync() instead which does the right thing. While we're at it
      change the name of "args" to "arg" to better match the pattern in
      _nfs4_do_setattr.
      Reported-by: default avatarChao Ye <cye@redhat.com>
      Cc: David Quigley <dpquigl@davequigley.com>
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Cc: stable@vger.kernel.org # 3.11+
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      12207f69
  11. 31 Oct, 2013 3 commits
    • Jeff Layton's avatar
      nfs: fix inverted test for delegation in nfs4_reclaim_open_state · 1acd1c30
      Jeff Layton authored
      commit 6686390b (NFS: remove incorrect "Lock reclaim failed!"
      warning.) added a test for a delegation before checking to see if any
      reclaimed locks failed. The test however is backward and is only doing
      that check when a delegation is held instead of when one isn't.
      
      Cc: NeilBrown <neilb@suse.de>
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Fixes: 6686390b: NFS: remove incorrect "Lock reclaim failed!" warning.
      Cc: stable@vger.kernel.org # 3.12
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      1acd1c30
    • Trond Myklebust's avatar
      SUNRPC: Cleanup xs_destroy() · a1311d87
      Trond Myklebust authored
      There is no longer any need for a separate xs_local_destroy() helper.
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      a1311d87
    • NeilBrown's avatar
      SUNRPC: close a rare race in xs_tcp_setup_socket. · 93dc41bd
      NeilBrown authored
      We have one report of a crash in xs_tcp_setup_socket.
      The call path to the crash is:
      
        xs_tcp_setup_socket -> inet_stream_connect -> lock_sock_nested.
      
      The 'sock' passed to that last function is NULL.
      
      The only way I can see this happening is a concurrent call to
      xs_close:
      
        xs_close -> xs_reset_transport -> sock_release -> inet_release
      
      inet_release sets:
         sock->sk = NULL;
      inet_stream_connect calls
         lock_sock(sock->sk);
      which gets NULL.
      
      All calls to xs_close are protected by XPRT_LOCKED as are most
      activations of the workqueue which runs xs_tcp_setup_socket.
      The exception is xs_tcp_schedule_linger_timeout.
      
      So presumably the timeout queued by the later fires exactly when some
      other code runs xs_close().
      
      To protect against this we can move the cancel_delayed_work_sync()
      call from xs_destory() to xs_close().
      
      As xs_close is never called from the worker scheduled on
      ->connect_worker, this can never deadlock.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      [Trond: Make it safe to call cancel_delayed_work_sync() on AF_LOCAL sockets]
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      93dc41bd
  12. 30 Oct, 2013 1 commit
  13. 28 Oct, 2013 17 commits