1. 10 Feb, 2020 11 commits
    • Bob Peterson's avatar
      gfs2: Ignore dlm recovery requests if gfs2 is withdrawn · 03678a99
      Bob Peterson authored
      When a node fails, user space informs dlm of the node failure,
      and dlm instructs gfs2 on the surviving nodes to perform journal
      recovery. It does this by calling various callback functions in
      lock_dlm.c. To mark its progress, it keeps generation numbers
      and recover bits in a dlm "control" lock lvb, which is seen by
      all nodes to determine which journals need to be replayed.
      
      The gfs2 on all nodes get the same recovery requests from dlm,
      so they all try to do the recovery, but only one will be
      granted the exclusive lock on the journal. The others fail
      with a "Busy" message on their "try lock."
      
      However, when a node is withdrawn, it cannot safely do any
      recovery or replay any journals. To make matters worse,
      gfs2 might withdraw as a result of attempting recovery. For
      example, this might happen if the device goes offline, or if
      an hba fails. But in today's gfs2 code, it doesn't check for
      being withdrawn at any step in the recovery process. What's
      worse is that these callbacks from dlm have no return code,
      so there is no way to indicate failure back to dlm. We can
      send a "Recovery failed" uevent eventually, but that tells
      user space what happened, not dlm's kernel code.
      
      Before this patch, lock_dlm would perform its recovery steps but
      ignore the result, and eventually it would still update its
      generation number in the lvb, despite the fact that it may have
      withdrawn or encountered an error. The other nodes would then
      see the newer generation number in the lvb and conclude that
      they don't need to do recovery because the generation number
      is newer than the last one they saw. They think a different
      node has already recovered the journal.
      
      This patch adds checks to several of the callbacks used by dlm
      in its recovery state machine so that the functions are ignored
      and skipped if an io error has occurred or if the file system
      is withdrawn. That prevents the lvb bits from being updated, and
      therefore dlm and user space still see the need for recovery to
      take place.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Reviewed-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      03678a99
    • Bob Peterson's avatar
      gfs2: Only complain the first time an io error occurs in quota or log · f34a6135
      Bob Peterson authored
      Before this patch, all io errors received by the quota daemon or the
      logd daemon would cause a complaint message to be issued, such as:
      
         gfs2: fsid=dm-13.0: Error 10 writing to journal, jid=0
      
      This patch changes it so that the error message is only issued the
      first time the error is encountered.
      
      Also, before this patch function gfs2_end_log_write did not set the
      sd_log_error value, so log errors would not cause the file system to
      be withdrawn. This patch sets the error code so the file system is
      properly withdrawn if an io error is encountered writing to the journal.
      
      WARNING: This change in function breaks check xfstests generic/441
      and causes it to fail: io errors writing to the log should cause a
      file system to be withdrawn, and no further operations are tolerated.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Reviewed-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      f34a6135
    • Bob Peterson's avatar
      gfs2: log error reform · 036330c9
      Bob Peterson authored
      Before this patch, gfs2 kept track of journal io errors in two
      places sd_log_error and the SDF_AIL1_IO_ERROR flag in sd_flags.
      This patch consolidates the two into sd_log_error so that it
      reflects the first error encountered writing to the journal.
      In future patches, we will take advantage of this by checking
      this value rather than having to check both when reacting to
      io errors.
      
      In addition, this fixes a tight loop in unmount: If buffers
      get on the ail1 list and an io error occurs elsewhere, the
      ail1 list would never be cleared because they were always busy.
      So unmount would hang, waiting for the ail1 list to empty.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Reviewed-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      036330c9
    • Bob Peterson's avatar
      gfs2: Rework how rgrp buffer_heads are managed · b3422cac
      Bob Peterson authored
      Before this patch, the rgrp code had a serious problem related to
      how it managed buffer_heads for resource groups. The problem caused
      file system corruption, especially in cases of journal replay.
      
      When an rgrp glock was demoted to transfer ownership to a
      different cluster node, do_xmote() first calls rgrp_go_sync and then
      rgrp_go_inval, as expected. When it calls rgrp_go_sync, that called
      gfs2_rgrp_brelse() that dropped the buffer_head reference count.
      In most cases, the reference count went to zero, which is right.
      However, there were other places where the buffers are handled
      differently.
      
      After rgrp_go_sync, do_xmote called rgrp_go_inval which called
      gfs2_rgrp_brelse a second time, then rgrp_go_inval's call to
      truncate_inode_pages_range would get rid of the pages in memory,
      but only if the reference count drops to 0.
      
      Unfortunately, gfs2_rgrp_brelse was setting bi->bi_bh = NULL.
      So when rgrp_go_sync called gfs2_rgrp_brelse, it lost the pointer
      to the buffer_heads in cases where the reference count was still 1.
      Therefore, when rgrp_go_inval called gfs2_rgrp_brelse a second time,
      it failed the check for "if (bi->bi_bh)" and thus failed to call
      brelse a second time. Because of that, the reference count on those
      buffers sometimes failed to drop from 1 to 0. And that caused
      function truncate_inode_pages_range to keep the pages in page cache
      rather than freeing them.
      
      The next time the rgrp glock was acquired, the metadata read of
      the rgrp buffers re-used the pages in memory, which were now
      wrong because they were likely modified by the other node who
      acquired the glock in EX (which is why we demoted the glock).
      This re-use of the page cache caused corruption because changes
      made by the other nodes were never seen, so the bitmaps were
      inaccurate.
      
      For some reason, the problem became most apparent when journal
      replay forced the replay of rgrps in memory, which caused newer
      rgrp data to be overwritten by the older in-core pages.
      
      A big part of the problem was that the rgrp buffer were released
      in multiple places: The go_unlock function would release them when
      the glock was released rather than when the glock is demoted,
      which is clearly wrong because our intent was to cache them until
      the glock is demoted from SH or EX.
      
      This patch attempts to clean up the mess and make one consistent
      and centralized mechanism for managing the rgrp buffer_heads by
      implementing several changes:
      
      1. It eliminates the call to gfs2_rgrp_brelse() from rgrp_go_sync.
         We don't want to release the buffers or zero the pointers when
         syncing for the reasons stated above. It only makes sense to
         release them when the glock is actually invalidated (go_inval).
         And when we do, then we set the bh pointers to NULL.
      2. The go_unlock function (which was only used for rgrps) is
         eliminated, as we've talked about doing many times before.
         The go_unlock function was called too early in the glock dq
         process, and should not happen until the glock is invalidated.
      3. It also eliminates the call to rgrp_brelse in gfs2_clear_rgrpd.
         That will now happen automatically when the rgrp glocks are
         demoted, and shouldn't happen any sooner or later than that.
         Instead, function gfs2_clear_rgrpd has been modified to demote
         the rgrp glocks, and therefore, free those pages, before the
         remaining glocks are culled by gfs2_gl_hash_clear. This
         prevents the gl_object from hanging around when the glocks are
         culled.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Reviewed-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      b3422cac
    • Bob Peterson's avatar
      gfs2: clear ail1 list when gfs2 withdraws · 30fe70a8
      Bob Peterson authored
      This patch fixes a bug in which function gfs2_log_flush can get into
      an infinite loop when a gfs2 file system is withdrawn. The problem
      is the infinite loop "for (;;)" in gfs2_log_flush which would never
      finish because the io error and subsequent withdraw prevented the
      items from being taken off the ail list.
      
      This patch tries to clean up the mess by allowing withdraw situations
      to move not-in-flight buffer_heads to the ail2 list, where they will
      be dealt with later.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Reviewed-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      30fe70a8
    • Bob Peterson's avatar
      gfs2: Introduce concept of a pending withdraw · 69511080
      Bob Peterson authored
      File system withdraws can be delayed when inconsistencies are
      discovered when we cannot withdraw immediately, for example, when
      critical spin_locks are held. But delaying the withdraw can cause
      gfs2 to ignore the error and keep running for a short period of time.
      For example, an rgrp glock may be dequeued and demoted while there
      are still buffers that haven't been properly revoked, due to io
      errors writing to the journal.
      
      This patch introduces a new concept of a pending withdraw, which
      means an inconsistency has been discovered and we need to withdraw
      at the earliest possible opportunity. In these cases, we aren't
      quite withdrawn yet, but we still need to not dequeue glocks and
      other critical things. If we dequeue the glocks and the withdraw
      results in our journal being replayed, the replay could overwrite
      data that's been modified by a different node that acquired the
      glock in the meantime.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Reviewed-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      69511080
    • Andreas Gruenbacher's avatar
      gfs2: Return bool from gfs2_assert functions · 8e28ef1f
      Andreas Gruenbacher authored
      The gfs2_assert functions only print messages when the filesystem hasn't been
      withdrawn yet, and they indicate whether or not they've printed something in
      their return value.  However, none of the callers use that information, so
      simply return whether or not the assert has failed.
      
      (The gfs2_assert functions are still backwards; they return false when an
      assertion is true.)
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      8e28ef1f
    • Andreas Gruenbacher's avatar
      gfs2: Turn gfs2_consist into void functions · a5ca2f1c
      Andreas Gruenbacher authored
      Change the various gfs2_consist functions to return void.
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      a5ca2f1c
    • Andreas Gruenbacher's avatar
      gfs2: Remove usused cluster_wide arguments of gfs2_consist functions · d7e7ab3f
      Andreas Gruenbacher authored
      These arguments are always passed as 0, and they are never evaluated.
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      d7e7ab3f
    • Andreas Gruenbacher's avatar
      gfs2: Report errors before withdraw · 8dc88ac6
      Andreas Gruenbacher authored
      In gfs2_rgrp_verify and compute_bitstructs, make sure to report errors before
      withdrawing the filesystem: otherwise, when we withdraw first and withdraw is
      configured to panic, we'll never get to the error reporting.
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      8dc88ac6
    • Andreas Gruenbacher's avatar
      gfs2: Split gfs2_lm_withdraw into two functions · badb55ec
      Andreas Gruenbacher authored
      Split gfs2_lm_withdraw into a function that prints an error message and a
      function that withdraws the filesystem.
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      badb55ec
  2. 06 Feb, 2020 3 commits
  3. 31 Jan, 2020 26 commits