1. 14 Jun, 2016 25 commits
    • Lars Ellenberg's avatar
      drbd: bump current uuid when resuming IO with diskless peer · 20004e24
      Lars Ellenberg authored
      Scenario, starting with normal operation
       Connected Primary/Secondary UpToDate/UpToDate
       NetworkFailure Primary/Unknown UpToDate/DUnknown (frozen)
       ... more failures happen, secondary loses it's disk,
       but eventually is able to re-establish the replication link ...
       Connected Primary/Secondary UpToDate/Diskless (resumed; needs to bump uuid!)
      
      We used to just resume/resent suspended requests,
      without bumping the UUID.
      
      Which will lead to problems later, when we want to re-attach the disk on
      the peer, without first disconnecting, or if we experience additional
      failures, because we now have diverging data without being able to
      recognize it.
      
      Make sure we also bump the current data generation UUID,
      if we notice "peer disk unknown" -> "peer disk known bad".
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      20004e24
    • Lars Ellenberg's avatar
      drbd: disallow promotion during resync handshake, avoid deadlock and hard reset · 31d64604
      Lars Ellenberg authored
      We already serialize connection state changes,
      and other, non-connection state changes (role changes)
      while we are establishing a connection.
      
      But if we have an established connection,
      then trigger a resync handshake (by primary --force or similar),
      until now we just had to be "lucky".
      
      Consider this sequence (e.g. deployment scenario):
      create-md; up;
        -> Connected Secondary/Secondary Inconsistent/Inconsistent
      then do a racy primary --force on both peers.
      
       block drbd0: drbd_sync_handshake:
       block drbd0: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
       block drbd0: peer 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
       block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> Inconsistent )
       block drbd0: peer( Secondary -> Primary ) pdsk( Inconsistent -> UpToDate )
        *** HERE things go wrong. ***
       block drbd0: role( Secondary -> Primary )
       block drbd0: drbd_sync_handshake:
       block drbd0: self 0000000000000005:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
       block drbd0: peer C90D2FC716D232AB:0000000000000004:0000000000000000:0000000000000000 bits:25590 flags:0
       block drbd0: Becoming sync target due to disk states.
       block drbd0: Writing the whole bitmap, full sync required after drbd_sync_handshake.
       block drbd0: Remote failed to finish a request within 6007ms > ko-count (2) * timeout (30 * 0.1s)
       drbd s0: peer( Primary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )
      
      The problem here is that the local promotion happens before the sync handshake
      triggered by the remote promotion was completed.  Some assumptions elsewhere
      become wrong, and when the expected resync handshake is then received and
      processed, we get stuck in a deadlock, which can only be recovered by reboot :-(
      
      Fix: if we know the peer has good data,
      and our own disk is present, but NOT good,
      and there is no resync going on yet,
      we expect a sync handshake to happen "soon".
      So reject a racy promotion with SS_IN_TRANSIENT_STATE.
      
      Result:
       ... as above ...
       block drbd0: peer( Secondary -> Primary ) pdsk( Inconsistent -> UpToDate )
        *** local promotion being postponed until ... ***
       block drbd0: drbd_sync_handshake:
       block drbd0: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
       block drbd0: peer 77868BDA836E12A5:0000000000000004:0000000000000000:0000000000000000 bits:25590 flags:0
        ...
       block drbd0: conn( WFBitMapT -> WFSyncUUID )
       block drbd0: updated sync uuid 85D06D0E8887AD44:0000000000000000:0000000000000000:0000000000000000
       block drbd0: conn( WFSyncUUID -> SyncTarget )
        *** ... after the resync handshake ***
       block drbd0: role( Secondary -> Primary )
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      31d64604
    • Lars Ellenberg's avatar
      drbd: sync_handshake: handle identical uuids with current (frozen) Primary · f2d3d75b
      Lars Ellenberg authored
      If in a two-primary scenario, we lost our peer, freeze IO,
      and are still frozen (no UUID rotation) when the peer comes back
      as Secondary after a hard crash, we will see identical UUIDs.
      
      The "rule_nr = 40" chose to use the "CRASHED_PRIMARY" bit as
      arbitration, but that would cause the still running (but frozen) Primary
      to become SyncTarget (which it typically refuses), and the handshake is
      declined.
      
      Fix: check current roles.
      If we have *one* current primary, the Primary wins.
      (rule_nr = 41)
      
      Since that is a protocol change, use the newly introduced DRBD_FF_WSAME
      to determine if rule_nr = 41 can be applied.
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      f2d3d75b
    • Lars Ellenberg's avatar
      drbd: introduce WRITE_SAME support · 9104d31a
      Lars Ellenberg authored
      We will support WRITE_SAME, if
       * all peers support WRITE_SAME (both in kernel and DRBD version),
       * all peer devices support WRITE_SAME
       * logical_block_size is identical on all peers.
      
      We may at some point introduce a fallback on the receiving side
      for devices/kernels that do not support WRITE_SAME,
      by open-coding a submit loop. But not yet.
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      9104d31a
    • Lars Ellenberg's avatar
    • Lars Ellenberg's avatar
      drbd: discard_zeroes_if_aligned allows "thin" resync for discard_zeroes_data=0 · 65f5be35
      Lars Ellenberg authored
      Even if discard_zeroes_data != 0,
      if discard_zeroes_if_aligned is set, we assume we can reliably
      zero-out/discard using the drbd_issue_peer_discard() helper.
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      65f5be35
    • Lars Ellenberg's avatar
      drbd: only restart frozen disk io when D_UP_TO_DATE · af61494a
      Lars Ellenberg authored
      When re-attaching the local backend device to a C_STANDALONE D_DISKLESS
      R_PRIMARY with OND_SUSPEND_IO, we may only resume IO if we recognize the
      backend that is being attached as D_UP_TO_DATE.
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      af61494a
    • Lars Ellenberg's avatar
      drbd: if there is no good data accessible, writes should be IO errors · 0ead5cca
      Lars Ellenberg authored
      If DRBD lost all path to good data,
      and the on-no-data-accessible policy is OND_SUSPEND_IO,
      all pending and new IO requests are suspended (will block).
      
      If that setting is OND_IO_ERROR, IO will still be completed.
      READ to "clean" areas (e.g. on an D_INCONSISTENT device,
      and bitmap indicates a block is already in sync) will succeed.
      READ to "unclean" areas (bitmap indicates block is out-of-sync),
      will return EIO.
      
      If we are already D_DISKLESS (or D_FAILED), we also return EIO.
      
      Unfortunately, on a former R_PRIMARY C_SYNC_TARGET D_INCONSISTENT,
      after replication link loss, new WRITE requests still went through OK.
      
      The would also set the "out-of-sync" bit on their way, so READ after
      WRITE would still return EIO. Also, the data generation UUIDs had not
      been bumped, we would cause data divergence, without being able to
      detect it on the next sync handshake, given the right sequence of events
      in a multiple error scenario and "improper" order of recovery actions.
      
      The right thing to do is to return EIO for all new writes,
      unless we have access to good, current, D_UP_TO_DATE data.
      
      The "established best practices" way to avoid these situations in the
      first place is to set OND_SUSPEND_IO, or even do a hard-reset from
      the pri-on-incon-degr policy helper hook.
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      0ead5cca
    • Lars Ellenberg's avatar
      drbd: don't forget error completion when "unsuspending" IO · 7bd000cb
      Lars Ellenberg authored
      Possibly sequence of events:
      SyncTarget is made Primary, then loses replication link
      (only path to good data on SyncSource).
      
      Behavior is then controlled by the on-no-data-accessible policy,
      which defaults to OND_IO_ERROR (may be set to OND_SUSPEND_IO).
      
      If OND_IO_ERROR is in fact the current policy, we clear the susp_fen
      (IO suspended due to fencing policy) flag, do NOT set the susp_nod
      (IO suspended due to no data) flag.
      
      But we forgot to call the IO error completion for all pending,
      suspended, requests.
      
      While at it, also add a race check for a theoretically possible
      race with a new handshake (network hickup), we may be able to
      re-send requests, and can avoid passing IO errors up the stack.
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      7bd000cb
    • Lars Ellenberg's avatar
      drbd: introduce unfence-peer handler · 26a96110
      Lars Ellenberg authored
      When resync is finished, we already call the "after-resync-target"
      handler (on the former sync target, obviously), once per volume.
      
      Paired with the before-resync-target handler, you can create snapshots,
      before the resync causes the volumes to become inconsistent,
      and discard those snapshots again, once they are no longer needed.
      
      It was also overloaded to be paired with the "fence-peer" handler,
      to "unfence" once the volumes are up-to-date and known good.
      
      This has some disadvantages, though: we call "fence-peer" for the whole
      connection (once for the group of volumes), but would call unfence as
      side-effect of after-resync-target once for each volume.
      
      Also, we fence on a (current, or about to become) Primary,
      which will later become the sync-source.
      
      Calling unfence only as a side effect of the after-resync-target
      handler opens a race window, between a new fence on the Primary
      (SyncTarget) and the unfence on the SyncTarget, which is difficult to
      close without some kind of "cluster wide lock" in those handlers.
      
      We would not need those handlers if we could still communicate.
      Which makes trying to aquire a cluster wide lock from those handlers
      seem like a very bad idea.
      
      This introduces the "unfence-peer" handler, which will be called
      per connection (once for the group of volumes), just like the fence
      handler, only once all volumes are back in sync, and on the SyncSource.
      
      Which is expected to be the node that previously called "fence", the
      node that is currently allowed to be Primary, and thus the only node
      that could trigger a new "fence" that could race with this unfence.
      
      Which makes us not need any cluster wide synchronization here,
      serializing two scripts running on the same node is trivial.
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      26a96110
    • Lars Ellenberg's avatar
      drbd: finish resync on sync source only by notification from sync target · 5052fee2
      Lars Ellenberg authored
      If the replication link breaks exactly during "resync finished" detection,
      finishing too early on the sync source could again lead to UUIDs rotated
      too fast, and potentially a spurious full resync on next handshake.
      
      Always wait for explicit resync finished state change notification from
      the sync target.
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      5052fee2
    • Lars Ellenberg's avatar
      drbd: allow larger max_discard_sectors · 505675f9
      Lars Ellenberg authored
      Make sure we have at least 67 (> AL_UPDATES_PER_TRANSACTION)
      al-extents available, and allow up to half of that to be
      discarded in one bio.
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      505675f9
    • Lars Ellenberg's avatar
      drbd: zero-out partial unaligned discards on local backend · 7435e901
      Lars Ellenberg authored
      For consistency, also zero-out partial unaligned chunks of discard
      requests on the local backend.
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      7435e901
    • Lars Ellenberg's avatar
      drbd: possibly disable discard support, if backend has discard_zeroes_data=0 · 69ba1ee9
      Lars Ellenberg authored
      Now that we have the discard_zeroes_if_aligned setting, we should also
      check it when setting up our queue parameters on the primary,
      not only on the receiving side.
      
      We announce discard support,
      UNLESS
      
       * we are connected to a peer that does not support TRIM
         on the DRBD protocol level.  Otherwise, it would either discard, or
         do a fallback to zero-out, depending on its backend and configuration.
      
       * our local backend does not support discards,
         or (discard_zeroes_data=0 AND discard_zeroes_if_aligned=no).
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      69ba1ee9
    • Lars Ellenberg's avatar
      drbd: when receiving P_TRIM, zero-out partial unaligned chunks · dd4f699d
      Lars Ellenberg authored
      We can avoid spurious data divergence caused by partially-ignored
      discards on certain backends with discard_zeroes_data=0, if we
      translate partial unaligned discard requests into explicit zero-out.
      
      The relevant use case is LVM/DM thin.
      
      If on different nodes, DRBD is backed by devices with differing
      discard characteristics, discards may lead to data divergence
      (old data or garbage left over on one backend, zeroes due to
      unmapped areas on the other backend). Online verify would now
      potentially report tons of spurious differences.
      
      While probably harmless for most use cases (fstrim on a file system),
      DRBD cannot have that, it would violate our promise to upper layers
      that our data instances on the nodes are identical.
      
      To be correct and play safe (make sure data is identical on both copies),
      we would have to disable discard support, if our local backend (on a
      Primary) does not support "discard_zeroes_data=true".
      
      We'd also have to translate discards to explicit zero-out on the
      receiving (typically: Secondary) side, unless the receiving side
      supports "discard_zeroes_data=true".
      
      Which both would allocate those blocks, instead of unmapping them,
      in contrast with expectations.
      
      LVM/DM thin does set discard_zeroes_data=0,
      because it silently ignores discards to partial chunks.
      
      We can work around this by checking the alignment first.
      For unaligned (wrt. alignment and granularity) or too small discards,
      we zero-out the initial (and/or) trailing unaligned partial chunks,
      but discard all the aligned full chunks.
      
      At least for LVM/DM thin, the result is effectively "discard_zeroes_data=1".
      
      Arguably it should behave this way internally, by default,
      and we'll try to make that happen.
      
      But our workaround is still valid for already deployed setups,
      and for other devices that may behave this way.
      
      Setting discard-zeroes-if-aligned=yes will allow DRBD to use
      discards, and to announce discard_zeroes_data=true, even on
      backends that announce discard_zeroes_data=false.
      
      Setting discard-zeroes-if-aligned=no will cause DRBD to always
      fall-back to zero-out on the receiving side, and to not even
      announce discard capabilities on the Primary, if the respective
      backend announces discard_zeroes_data=false.
      
      We used to ignore the discard_zeroes_data setting completely.
      To not break established and expected behaviour, and suddenly
      cause fstrim on thin-provisioned LVs to run out-of-space,
      instead of freeing up space, the default value is "yes".
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      dd4f699d
    • Lars Ellenberg's avatar
      drbd: allow parallel flushes for multi-volume resources · f9ff0da5
      Lars Ellenberg authored
      To maintain write-order fidelity accros all volumes in a DRBD resource,
      the receiver of a P_BARRIER needs to issue flushes to all volumes.
      We used to do this by calling blkdev_issue_flush(), synchronously,
      one volume at a time.
      
      We now submit all flushes to all volumes in parallel, then wait for all
      completions, to reduce worst-case latencies on multi-volume resources.
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      f9ff0da5
    • Lars Ellenberg's avatar
      drbd: fix for truncated minor number in callback command line · 0982368b
      Lars Ellenberg authored
      The command line parameter the kernel module uses to communicate the
      device minor to userland helper is flawed in a way that the device
      indentifier "minor-%d" is being truncated to minors with a maximum
      of 5 digits.
      
      But DRBD 8.4 allows 2^20 == 1048576 minors,
      thus a minimum of 7 digits must be supported.
      
      Reported by Veit Wahlich on drbd-dev.
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      0982368b
    • Lars Ellenberg's avatar
      drbd: fix regression: protocol A sometimes synchronous, C sometimes double-latency · 1b228c98
      Lars Ellenberg authored
      Regression introduced with 8.4.5
       drbd: application writes may set-in-sync in protocol != C
      
      Overwriting the same block (LBA) while a former version is still
      "in-flight" to the peer (to be exact: we did not receive the
      P_BARRIER_ACK for its epoch yet) would wait for the full epoch of that
      former version to be acknowledged by the peer.
      
      In synchronous and quasi-synchronous protocols C and B,
      this may double the latency on overwrites.
      
      With protocol A, which is supposed to be asynchronous and only wait for
      local completion, it is even worse: it would make overwrites
      quasi-synchronous, they would be hit by the full RTT, which protocol A
      was specifically meant to avoid, and possibly the additional time it
      takes to drain the buffers first.
      
      Particularly bad for databases, or anything else that
      does frequent updates to the same blocks (various file system meta data).
      
      No impact if >= rtt passes between updates to the same block.
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      1b228c98
    • Lars Ellenberg's avatar
    • Philipp Reisner's avatar
      drbd: Create the protocol feature THIN_RESYNC · 92d94ae6
      Philipp Reisner authored
      If thinly provisioned volumes are used, during a resync the sync source
      tries to find out if a block is deallocated. If it is deallocated, then
      the resync target uses block_dev_issue_zeroout() on the range in
      question.
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      92d94ae6
    • Philipp Reisner's avatar
      drbd: Introduce new disk config option rs-discard-granularity · a5ca66c4
      Philipp Reisner authored
      As long as the value is 0 the feature is disabled. With setting
      it to a positive value, DRBD limits and aligns its resync requests
      to the rs-discard-granularity setting. If the sync source detects
      all zeros in such a block, the resync target discards the range
      on disk.
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      a5ca66c4
    • Philipp Reisner's avatar
      drbd: Implement handling of thinly provisioned storage on resync target nodes · 700ca8c0
      Philipp Reisner authored
      If during resync we read only zeroes for a range of sectors assume
      that these secotors can be discarded on the sync target node.
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      700ca8c0
    • Philipp Reisner's avatar
      c5c23854
    • Lars Ellenberg's avatar
      drbd: change bitmap write-out when leaving resync states · be115b69
      Lars Ellenberg authored
      When leaving resync states because of disconnect,
      do the bitmap write-out synchronously in the drbd_disconnected() path.
      
      When leaving resync states because we go back to AHEAD/BEHIND, or
      because resync actually finished, or some disk was lost during resync,
      trigger the write-out from after_state_ch().
      
      The bitmap write-out for resync -> ahead/behind was missing completely before.
      
      Note that this is all only an optimization to avoid double-resyncs of
      already completed blocks in case this node crashes.
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      be115b69
    • Lars Ellenberg's avatar
      drbd: bitmap bulk IO: do not always suspend IO · c0065f98
      Lars Ellenberg authored
      The intention was to only suspend IO if some normal bitmap operation is
      supposed to be locked out, not always. If the bulk operation is flaged
      as BM_LOCKED_CHANGE_ALLOWED, we do not need to suspend IO.
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      c0065f98
  2. 12 Jun, 2016 9 commits
  3. 09 Jun, 2016 1 commit
  4. 08 Jun, 2016 3 commits
  5. 07 Jun, 2016 2 commits