1. 25 Sep, 2014 10 commits
    • NeilBrown's avatar
      NFS: avoid deadlocks with loop-back mounted NFS filesystems. · 95905446
      NeilBrown authored
      Support for loop-back mounted NFS filesystems is useful when NFS is
      used to access shared storage in a high-availability cluster.
      
      If the node running the NFS server fails, some other node can mount the
      filesystem and start providing NFS service.  If that node already had
      the filesystem NFS mounted, it will now have it loop-back mounted.
      
      nfsd can suffer a deadlock when allocating memory and entering direct
      reclaim.
      While direct reclaim does not write to the NFS filesystem it can send
      and wait for a COMMIT through nfs_release_page().
      
      This patch modifies nfs_release_page() to wait a limited time for the
      commit to complete - one second.  If the commit doesn't complete
      in this time, nfs_release_page() will fail.  This means it might now
      fail in some cases where it wouldn't before.  These cases are only
      when 'gfp' includes '__GFP_WAIT'.
      
      nfs_release_page() is only called by try_to_release_page(), and that
      can only be called on an NFS page with required 'gfp' flags from
       - page_cache_pipe_buf_steal() in splice.c
       - shrink_page_list() in vmscan.c
       - invalidate_inode_pages2_range() in truncate.c
      
      The first two handle failure quite safely.  The last is only called
      after ->launder_page() has been called, and that will have waited
      for the commit to finish already.
      
      So aborting if the commit takes longer than 1 second is perfectly safe.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Acked-by: default avatarJeff Layton <jlayton@primarydata.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      95905446
    • NeilBrown's avatar
      MM: export page_wakeup functions · a4796e37
      NeilBrown authored
      This will allow NFS to wait for PG_private to be cleared and,
      particularly, to send a wake-up when it is.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Acked-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      a4796e37
    • NeilBrown's avatar
      SCHED: add some "wait..on_bit...timeout()" interfaces. · cbbce822
      NeilBrown authored
      In commit c1221321
         sched: Allow wait_on_bit_action() functions to support a timeout
      
      I suggested that a "wait_on_bit_timeout()" interface would not meet my
      need.  This isn't true - I was just over-engineering.
      
      Including a 'private' field in wait_bit_key instead of a focused
      "timeout" field was just premature generalization.  If some other
      use is ever found, it can be generalized or added later.
      
      So this patch renames "private" to "timeout" with a meaning "stop
      waiting when "jiffies" reaches or passes "timeout",
      and adds two of the many possible wait..bit..timeout() interfaces:
      
      wait_on_page_bit_killable_timeout(), which is the one I want to use,
      and out_of_line_wait_on_bit_timeout() which is a reasonably general
      example.  Others can be added as needed.
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      cbbce822
    • NeilBrown's avatar
      NFS: don't use STABLE writes during writeback. · e87b4c7a
      NeilBrown authored
      commit b31268ac
        FS: Use stable writes when not doing a bulk flush
      
      was a bit heavy handed.
      The particular problem that lead to this patch was that
      small writes to an O_SYNC file we being written as UNSTABLE writes
      followed by a commit.
      This is appropriate for large writes (which require multiple NFS
      requests) but for small writes (single NFS request), using
      NFS_FILE_SYNC is more efficient.
      
      So that patch causes the code to select between the two methods
      depending on how many nfs requests get generated.
      
      Unfortunately this ends up applying to non O_SYNC writes as well.
      In particular if you memory-map a file and update random pages, then
      when they are eventually written out by writeback they will go as
      NFS_FILE_SYNC.  This is inefficient and slows down the application.
      
      So: only set FLUSH_COND_STABLE when wbc->sync_mode is WB_SYNC_ALL.
      With this patch:
       O_SYNC writes are NFS_FILE_SYNC for single requests, and NFS_UNSTABLE
          followed by COMMIT for multiple requests
       Writing immediately before close of fsync follow the same pattern.
       Non-O_SYNC writes without an fsync of close eventually get flushed
       out as UNSTABLE and a commit follows eventually as appropriate.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      e87b4c7a
    • NeilBrown's avatar
      NFSv4: use exponential retry on NFS4ERR_DELAY for async requests. · 8478eaa1
      NeilBrown authored
      Currently asynchronous NFSv4 request will be retried with
      exponential timeout (from 1/10 to 15 seconds), but async
      requests will always use a 15second retry.
      
      Some "async" requests are really synchronous though.  The
      async mechanism is used to allow the request to continue if
      the requesting process is killed.
      In those cases, an exponential retry is appropriate.
      
      For example, if two different clients both open a file and
      get a READ delegation, and one client then unlinks the file
      (while still holding an open file descriptor), that unlink
      will used the "silly-rename" handling which is async.
      The first rename will result in NFS4ERR_DELAY while the
      delegation is reclaimed from the other client.  The rename
      will not be retried for 15 seconds, causing an unlink to take
      15 seconds rather than 100msec.
      
      This patch only added exponential timeout for async unlink and
      async rename.  Other async calls, such as 'close' are sometimes
      waited for so they might benefit from exponential timeout too.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      8478eaa1
    • Jason Baron's avatar
      rpc: Add -EPERM processing for xs_udp_send_request() · 3dedbb5c
      Jason Baron authored
      If an iptables drop rule is added for an nfs server, the client can end up in
      a softlockup. Because of the way that xs_sendpages() is structured, the -EPERM
      is ignored since the prior bits of the packet may have been successfully queued
      and thus xs_sendpages() returns a non-zero value. Then, xs_udp_send_request()
      thinks that because some bits were queued it should return -EAGAIN. We then try
      the request again and again, resulting in cpu spinning. Reproducer:
      
      1) open a file on the nfs server '/nfs/foo' (mounted using udp)
      2) iptables -A OUTPUT -d <nfs server ip> -j DROP
      3) write to /nfs/foo
      4) close /nfs/foo
      5) iptables -D OUTPUT -d <nfs server ip> -j DROP
      
      The softlockup occurs in step 4 above.
      
      The previous patch, allows xs_sendpages() to return both a sent count and
      any error values that may have occurred. Thus, if we get an -EPERM, return
      that to the higher level code.
      
      With this patch in place we can successfully abort the above sequence and
      avoid the softlockup.
      
      I also tried the above test case on an nfs mount on tcp and although the system
      does not softlockup, I still ended up with the 'hung_task' firing after 120
      seconds, due to the i/o being stuck. The tcp case appears a bit harder to fix,
      since -EPERM appears to get ignored much lower down in the stack and does not
      propogate up to xs_sendpages(). This case is not quite as insidious as the
      softlockup and it is not addressed here.
      Reported-by: default avatarYigong Lou <ylou@akamai.com>
      Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      3dedbb5c
    • Jason Baron's avatar
      rpc: return sent and err from xs_sendpages() · f279cd00
      Jason Baron authored
      If an error is returned after the first bits of a packet have already been
      successfully queued, xs_sendpages() will return a positive 'int' value
      indicating success. Callers seem to treat this as -EAGAIN.
      
      However, there are cases where its not a question of waiting for the write
      queue to drain. For example, when there is an iptables rule dropping packets
      to the destination, the lower level code can return -EPERM only after parts
      of the packet have been successfully queued. In this case, we can end up
      continuously retrying resulting in a kernel softlockup.
      
      This patch is intended to make no changes in behavior but is in preparation for
      subsequent patches that can make decisions based on both on the number of bytes
      sent by xs_sendpages() and any errors that may have be returned.
      Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      f279cd00
    • Benjamin Coddington's avatar
      lockd: Try to reconnect if statd has moved · 173b3afc
      Benjamin Coddington authored
      If rpc.statd is restarted, upcalls to monitor hosts can fail with
      ECONNREFUSED.  In that case force a lookup of statd's new port and retry the
      upcall.
      Signed-off-by: default avatarBenjamin Coddington <bcodding@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      173b3afc
    • Benjamin Coddington's avatar
      SUNRPC: Don't wake tasks during connection abort · a743419f
      Benjamin Coddington authored
      When aborting a connection to preserve source ports, don't wake the task in
      xs_error_report.  This allows tasks with RPC_TASK_SOFTCONN to succeed if the
      connection needs to be re-established since it preserves the task's status
      instead of setting it to the status of the aborting kernel_connect().
      
      This may also avoid a potential conflict on the socket's lock.
      Signed-off-by: default avatarBenjamin Coddington <bcodding@redhat.com>
      Cc: stable@vger.kernel.org # 3.14+
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      a743419f
    • Olga Kornievskaia's avatar
      Fixing lease renewal · 8faaa6d5
      Olga Kornievskaia authored
      Commit c9fdeb28 removed a 'continue' after checking if the lease needs
      to be renewed. However, if client hasn't moved, the code falls down to
      starting reboot recovery erroneously (ie., sends open reclaim and gets
      back stale_clientid error) before recovering from getting stale_clientid
      on the renew operation.
      Signed-off-by: default avatarOlga Kornievskaia <kolga@netapp.com>
      Fixes: c9fdeb28 (NFS: Add basic migration support to state manager thread)
      Cc: stable@vger.kernel.org # 3.13+
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      8faaa6d5
  2. 21 Sep, 2014 1 commit
  3. 15 Sep, 2014 1 commit
  4. 12 Sep, 2014 15 commits
  5. 10 Sep, 2014 13 commits