1. 23 Mar, 2017 7 commits
    • NeilBrown's avatar
      md/raid1, raid10: move rXbio accounting closer to allocation. · 6b6c8110
      NeilBrown authored
      When raid1 or raid10 find they will need to allocate a new
      r1bio/r10bio, in order to work around a known bad block, they
      account for the allocation well before the allocation is
      made.  This separation makes the correctness less obvious
      and requires comments.
      
      The accounting needs to be a little before: before the first
      rXbio is submitted, but that is all.
      
      So move the accounting down to where it makes more sense.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      6b6c8110
    • NeilBrown's avatar
      Revert "md/raid5: limit request size according to implementation limits" · 97d53438
      NeilBrown authored
      This reverts commit e8d7c332.
      
      Now that raid5 doesn't abuse bi_phys_segments any more, we no longer
      need to impose these limits.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      97d53438
    • NeilBrown's avatar
      md/raid5: remove over-loading of ->bi_phys_segments. · 0472a42b
      NeilBrown authored
      When a read request, which bypassed the cache, fails, we need to retry
      it through the cache.
      This involves attaching it to a sequence of stripe_heads, and it may not
      be possible to get all the stripe_heads we need at once.
      We do what we can, and record how far we got in ->bi_phys_segments so
      we can pick up again later.
      
      There is only ever one bio which may have a non-zero offset stored in
      ->bi_phys_segments, the one that is either active in the single thread
      which calls retry_aligned_read(), or is in conf->retry_read_aligned
      waiting for retry_aligned_read() to be called again.
      
      So we only need to store one offset value.  This can be in a local
      variable passed between remove_bio_from_retry() and
      retry_aligned_read(), or in the r5conf structure next to the
      ->retry_read_aligned pointer.
      
      Storing it there allows the last usage of ->bi_phys_segments to be
      removed from md/raid5.c.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      0472a42b
    • NeilBrown's avatar
      md/raid5: use bio_inc_remaining() instead of repurposing bi_phys_segments as a counter · 016c76ac
      NeilBrown authored
      md/raid5 needs to keep track of how many stripe_heads are processing a
      bio so that it can delay calling bio_endio() until all stripe_heads
      have completed.  It currently uses 16 bits of ->bi_phys_segments for
      this purpose.
      
      16 bits is only enough for 256M requests, and it is possible for a
      single bio to be larger than this, which causes problems.  Also, the
      bio struct contains a larger counter, __bi_remaining, which has a
      purpose very similar to the purpose of our counter.  So stop using
      ->bi_phys_segments, and instead use __bi_remaining.
      
      This means we don't need to initialize the counter, as our caller
      initializes it to '1'.  It also means we can call bio_endio() directly
      as it tests this counter internally.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      016c76ac
    • NeilBrown's avatar
      md/raid5: call bio_endio() directly rather than queueing for later. · bd83d0a2
      NeilBrown authored
      We currently gather bios that need to be returned into a bio_list
      and call bio_endio() on them all together.
      The original reason for this was to avoid making the calls while
      holding a spinlock.
      Locking has changed a lot since then, and that reason is no longer
      valid.
      
      So discard return_io() and various return_bi lists, and just call
      bio_endio() directly as needed.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      bd83d0a2
    • NeilBrown's avatar
      md/raid5: simplfy delaying of writes while metadata is updated. · 16d997b7
      NeilBrown authored
      If a device fails during a write, we must ensure the failure is
      recorded in the metadata before the completion of the write is
      acknowleged.
      
      Commit c3cce6cd ("md/raid5: ensure device failure recorded before
      write request returns.")  added code for this, but it was
      unnecessarily complicated.  We already had similar functionality for
      handling updates to the bad-block-list, thanks to Commit de393cde
      ("md: make it easier to wait for bad blocks to be acknowledged.")
      
      So revert most of the former commit, and instead avoid collecting
      completed writes if MD_CHANGE_PENDING is set.  raid5d() will then flush
      the metadata and retry the stripe_head.
      As this change can leave a stripe_head ready for handling immediately
      after handle_active_stripes() returns, we change raid5_do_work() to
      pause when MD_CHANGE_PENDING is set, so that it doesn't spin.
      
      We check MD_CHANGE_PENDING *after* analyse_stripe() as it could be set
      asynchronously.  After analyse_stripe(), we have collected stable data
      about the state of devices, which will be used to make decisions.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      16d997b7
    • NeilBrown's avatar
      md/raid5: use md_write_start to count stripes, not bios · 49728050
      NeilBrown authored
      We use md_write_start() to increase the count of pending writes, and
      md_write_end() to decrement the count.  We currently count bios
      submitted to md/raid5.  Change it count stripe_heads that a WRITE bio
      has been attached to.
      
      So now, raid5_make_request() calls md_write_start() and then
      md_write_end() to keep the count elevated during the setup of the
      request.
      
      add_stripe_bio() calls md_write_start() for each stripe_head, and the
      completion routines always call md_write_end(), instead of only
      calling it when raid5_dec_bi_active_stripes() returns 0.
      make_discard_request also calls md_write_start/end().
      
      The parallel between md_write_{start,end} and use of bi_phys_segments
      can be seen in that:
       Whenever we set bi_phys_segments to 1, we now call md_write_start.
       Whenever we increment it on non-read requests with
         raid5_inc_bi_active_stripes(), we now call md_write_start().
       Whenever we decrement bi_phys_segments on non-read requsts with
          raid5_dec_bi_active_stripes(), we now call md_write_end().
      
      This reduces our dependence on keeping a per-bio count of active
      stripes in bi_phys_segments.
      
      md_write_inc() is added which parallels md_write_start(), but requires
      that a write has already been started, and is certain never to sleep.
      This can be used inside a spinlocked region when adding to a write
      request.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      49728050
  2. 16 Mar, 2017 21 commits
    • Guoqing Jiang's avatar
      md: move bitmap_destroy to the beginning of __md_stop · 48df498d
      Guoqing Jiang authored
      Since we have switched to sync way to handle METADATA_UPDATED
      msg for md-cluster, then process_metadata_update is depended
      on mddev->thread->wqueue.
      
      With the new change, clustered raid could possible hang if
      array received a METADATA_UPDATED msg after array unregistered
      mddev->thread, so we need to stop clustered raid (bitmap_destroy
      -> bitmap_free -> md_cluster_stop) earlier than unregister
      thread (mddev_detach -> md_unregister_thread).
      
      And this change should be safe for non-clustered raid since
      all writes are stopped before the destroy. Also in md_run,
      we activate the personality (pers->run()) before activating
      the bitmap (bitmap_create()). So it is pleasingly symmetric
      to stop the bitmap (bitmap_destroy()) before stopping the
      personality (__md_stop() calls pers->free()), we achieve this
      by move bitmap_destroy to the beginning of __md_stop.
      
      But we don't want to break the codes for waiting behind IO as
      Shaohua mentioned, so introduce bitmap_wait_behind_writes to
      call the codes, and call the new fun in both mddev_detach and
      bitmap_destroy, then we will not break original behind IO code
      and also fit the new condition well.
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      48df498d
    • Song Liu's avatar
      md/r5cache: generate R5LOG_PAYLOAD_FLUSH · ea17481f
      Song Liu authored
      In r5c_finish_stripe_write_out(), R5LOG_PAYLOAD_FLUSH is append to
      log->current_io.
      
      Appending R5LOG_PAYLOAD_FLUSH in quiesce needs extra writes to
      journal. To simplify the logic, we just skip R5LOG_PAYLOAD_FLUSH in
      quiesce.
      
      Even R5LOG_PAYLOAD_FLUSH supports multiple stripes per payload.
      However, current implementation is one stripe per R5LOG_PAYLOAD_FLUSH,
      which is simpler.
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      ea17481f
    • Song Liu's avatar
      md/r5cache: handle R5LOG_PAYLOAD_FLUSH in recovery · 2d4f4687
      Song Liu authored
      This patch adds handling of R5LOG_PAYLOAD_FLUSH in journal recovery.
      Next patch will add logic that generate R5LOG_PAYLOAD_FLUSH on flush
      finish.
      
      When R5LOG_PAYLOAD_FLUSH is seen in recovery, pending data and parity
      will be dropped from recovery. This will reduce the number of stripes
      to replay, and thus accelerate the recovery process.
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      2d4f4687
    • Artur Paszkiewicz's avatar
      raid5-ppl: runtime PPL enabling or disabling · ba903a3e
      Artur Paszkiewicz authored
      Allow writing to 'consistency_policy' attribute when the array is
      active. Add a new function 'change_consistency_policy' to the
      md_personality operations structure to handle the change in the
      personality code. Values "ppl" and "resync" are accepted and
      turn PPL on and off respectively.
      
      When enabling PPL its location and size should first be set using
      'ppl_sector' and 'ppl_size' attributes and a valid PPL header should be
      written at this location on each member device.
      
      Enabling or disabling PPL is performed under a suspended array.  The
      raid5_reset_stripe_cache function frees the stripe cache and allocates
      it again in order to allocate or free the ppl_pages for the stripes in
      the stripe cache.
      Signed-off-by: default avatarArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      ba903a3e
    • Artur Paszkiewicz's avatar
      raid5-ppl: support disk hot add/remove with PPL · 6358c239
      Artur Paszkiewicz authored
      Add a function to modify the log by removing an rdev when a drive fails
      or adding when a spare/replacement is activated as a raid member.
      
      Removing a disk just clears the child log rdev pointer. No new stripes
      will be accepted for this child log in ppl_write_stripe() and running io
      units will be processed without writing PPL to the device.
      
      Adding a disk sets the child log rdev pointer and writes an empty PPL
      header.
      Signed-off-by: default avatarArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      6358c239
    • Artur Paszkiewicz's avatar
      raid5-ppl: load and recover the log · 4536bf9b
      Artur Paszkiewicz authored
      Load the log from each disk when starting the array and recover if the
      array is dirty.
      
      The initial empty PPL is written by mdadm. When loading the log we
      verify the header checksum and signature. For external metadata arrays
      the signature is verified in userspace, so here we read it from the
      header, verifying only if it matches on all disks, and use it later when
      writing PPL.
      
      In addition to the header checksum, each header entry also contains a
      checksum of its partial parity data. If the header is valid, recovery is
      performed for each entry until an invalid entry is found. If the array
      is not degraded and recovery using PPL fully succeeds, there is no need
      to resync the array because data and parity will be consistent, so in
      this case resync will be disabled.
      
      Due to compatibility with IMSM implementations on other systems, we
      can't assume that the recovery data block size is always 4K. Writes
      generated by MD raid5 don't have this issue, but when recovering PPL
      written in other environments it is possible to have entries with
      512-byte sector granularity. The recovery code takes this into account
      and also the logical sector size of the underlying drives.
      Signed-off-by: default avatarArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      4536bf9b
    • Artur Paszkiewicz's avatar
      md: add sysfs entries for PPL · 664aed04
      Artur Paszkiewicz authored
      Add 'consistency_policy' attribute for array. It indicates how the array
      maintains consistency in case of unexpected shutdown.
      
      Add 'ppl_sector' and 'ppl_size' for rdev, which describe the location
      and size of the PPL space on the device. They can't be changed for
      active members if the array is started and PPL is enabled, so in the
      setter functions only basic checks are performed. More checks are done
      in ppl_validate_rdev() when starting the log.
      
      These attributes are writable to allow enabling PPL for external
      metadata arrays and (later) to enable/disable PPL for a running array.
      Signed-off-by: default avatarArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      664aed04
    • Artur Paszkiewicz's avatar
      raid5-ppl: Partial Parity Log write logging implementation · 3418d036
      Artur Paszkiewicz authored
      Implement the calculation of partial parity for a stripe and PPL write
      logging functionality. The description of PPL is added to the
      documentation. More details can be found in the comments in raid5-ppl.c.
      
      Attach a page for holding the partial parity data to stripe_head.
      Allocate it only if mddev has the MD_HAS_PPL flag set.
      
      Partial parity is the xor of not modified data chunks of a stripe and is
      calculated as follows:
      
      - reconstruct-write case:
        xor data from all not updated disks in a stripe
      
      - read-modify-write case:
        xor old data and parity from all updated disks in a stripe
      
      Implement it using the async_tx API and integrate into raid_run_ops().
      It must be called when we still have access to old data, so do it when
      STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
      stored into sh->ppl_page.
      
      Partial parity is not meaningful for full stripe write and is not stored
      in the log or used for recovery, so don't attempt to calculate it when
      stripe has STRIPE_FULL_WRITE.
      
      Put the PPL metadata structures to md_p.h because userspace tools
      (mdadm) will also need to read/write PPL.
      
      Warn about using PPL with enabled disk volatile write-back cache for
      now. It can be removed once disk cache flushing before writing PPL is
      implemented.
      Signed-off-by: default avatarArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      3418d036
    • Artur Paszkiewicz's avatar
      raid5: separate header for log functions · ff875738
      Artur Paszkiewicz authored
      Move raid5-cache declarations from raid5.h to raid5-log.h, add inline
      wrappers for functions which will be shared with ppl and use them in
      raid5 core instead of direct calls to raid5-cache.
      
      Remove unused parameter from r5c_cache_data(), move two duplicated
      pr_debug() calls to r5l_init_log().
      Signed-off-by: default avatarArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      ff875738
    • Artur Paszkiewicz's avatar
      md: superblock changes for PPL · ea0213e0
      Artur Paszkiewicz authored
      Include information about PPL location and size into mdp_superblock_1
      and copy it to/from rdev. Because PPL is mutually exclusive with bitmap,
      put it in place of 'bitmap_offset'. Add a new flag MD_FEATURE_PPL for
      'feature_map', analogically to MD_FEATURE_BITMAP_OFFSET. Add MD_HAS_PPL
      to mddev->flags to indicate that PPL is enabled on an array.
      Signed-off-by: default avatarArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      ea0213e0
    • Song Liu's avatar
      md/r5cache: improve recovery with read ahead page pool · effe6ee7
      Song Liu authored
      In r5cache recovery, the journal device is scanned page by page.
      Currently, we use sync_page_io() to read journal device. This is
      not efficient when we have to recovery many stripes from the journal.
      
      To improve the speed of recovery, this patch introduces a read ahead
      page pool (ra_pool) to recovery_ctx. With ra_pool, multiple consecutive
      pages are read in one IO. Then the recovery code read the journal from
      ra_pool.
      
      With ra_pool, r5l_recovery_ctx has become much bigger. Therefore,
      r5l_recovery_log() is refactored so r5l_recovery_ctx is not using
      stack space.
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      effe6ee7
    • Shaohua Li's avatar
      md/raid5: sort bios · aaf9f12e
      Shaohua Li authored
      Previous patch (raid5: only dispatch IO from raid5d for harddisk raid)
      defers IO dispatching. The goal is to create better IO pattern. At that
      time, we don't sort the deffered IO and hope the block layer can do IO
      merge and sort. Now the raid5-cache writeback could create large amount
      of bios. And if we enable muti-thread for stripe handling, we can't
      control when to dispatch IO to raid disks. In a lot of time, we are
      dispatching IO which block layer can't do merge effectively.
      
      This patch moves further for the IO dispatching defer. We accumulate
      bios, but we don't dispatch all the bios after a threshold is met. This
      'dispatch partial portion of bios' stragety allows bios coming in a
      large time window are sent to disks together. At the dispatching time,
      there is large chance the block layer can merge the bios. To make this
      more effective, we dispatch IO in ascending order. This increases
      request merge chance and reduces disk seek.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      aaf9f12e
    • Shaohua Li's avatar
      md/raid5-cache: bump flush stripe batch size · 84890c03
      Shaohua Li authored
      Bump the flush stripe batch size to 2048. For my 12 disks raid
      array, the stripes takes:
      12 * 4k * 2048 = 96MB
      
      This is still quite small. A hardware raid card generally has 1GB size,
      which we suggest the raid5-cache has similar cache size.
      
      The advantage of a big batch size is we can dispatch a lot of IO in the
      same time, then we can do some scheduling to make better IO pattern.
      
      Last patch prioritizes stripes, so we don't worry about a big flush
      stripe batch will starve normal stripes.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      84890c03
    • Shaohua Li's avatar
      md/raid5: prioritize stripes for writeback · 535ae4eb
      Shaohua Li authored
      In raid5-cache writeback mode, we have two types of stripes to handle.
      - stripes which aren't cached yet
      - stripes which are cached and flushing out to raid disks
      
      Upperlayer is more sensistive to latency of the first type of stripes
      generally. But we only one handle list for all these stripes, where the
      two types of stripes are mixed together. When reclaim flushes a lot of
      stripes, the first type of stripes could be noticeably delayed. On the
      other hand, if the log space is tight, we'd like to handle the second
      type of stripes faster and free log space.
      
      This patch destinguishes the two types stripes. They are added into
      different handle list. When we try to get a stripe to handl, we prefer
      the first type of stripes unless log space is tight.
      
      This should have no impact for !writeback case.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      535ae4eb
    • Guoqing Jiang's avatar
      md-cluster: add the support for resize · 818da59f
      Guoqing Jiang authored
      To update size for cluster raid, we need to make
      sure all nodes can perform the change successfully.
      However, it is possible that some of them can't do
      it due to failure (bitmap_resize could fail). So
      we need to consider the issue before we set the
      capacity unconditionally, and we use below steps
      to perform sanity check.
      
      1. A change the size, then broadcast METADATA_UPDATED
         msg.
      2. B and C receive METADATA_UPDATED change the size
         excepts call set_capacity, sync_size is not update
         if the change failed. Also call bitmap_update_sb
         to sync sb to disk.
      3. A checks other node's sync_size, if sync_size has
         been updated in all nodes, then send CHANGE_CAPACITY
         msg otherwise send msg to revert previous change.
      4. B and C call set_capacity if receive CHANGE_CAPACITY
         msg, otherwise pers->resize will be called to restore
         the old value.
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      818da59f
    • Guoqing Jiang's avatar
      md-cluster: introduce cluster_check_sync_size · b98938d1
      Guoqing Jiang authored
      Support resize is a little complex for clustered
      raid, since we need to ensure all the nodes share
      the same knowledge about the size of raid.
      
      We achieve the goal by check the sync_size which
      is in each node's bitmap, we can only change the
      capacity after cluster_check_sync_size returns 0.
      
      Also, get_bitmap_from_slot is added to get a slot's
      bitmap. And we exported some funcs since they are
      used in cluster_check_sync_size().
      
      We can also reuse get_bitmap_from_slot to remove
      redundant code existed in bitmap_copy_from_slot.
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      b98938d1
    • Guoqing Jiang's avatar
      md-cluster: add CHANGE_CAPACITY message type · 7da3d203
      Guoqing Jiang authored
      The msg type CHANGE_CAPACITY is introduced to support
      resize clustered raid in later patch, and it is sent
      after all the nodes have the same sync_size, receiver
      node just need to set new capacity once received this
      msg.
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      7da3d203
    • Guoqing Jiang's avatar
      md-cluster: use sync way to handle METADATA_UPDATED msg · 0ba95977
      Guoqing Jiang authored
      Previously, when node received METADATA_UPDATED msg, it just
      need to wakeup mddev->thread, then md_reload_sb will be called
      eventually.
      
      We taken the asynchronous way to avoid a deadlock issue, the
      deadlock issue could happen when one node is receiving the
      METADATA_UPDATED msg (wants reconfig_mutex) and trying to run
      the path:
      
      md_check_recovery -> mddev_trylock(hold reconfig_mutex)
                        -> md_update_sb-metadata_update_start
      		     (want EX on token however token is
      		      got by the sending node)
      
      Since we will support resizing for clustered raid, and we
      need the metadata update handling to be synchronous so that
      the initiating node can detect failure, so we need to change
      the way for handling METADATA_UPDATED msg.
      
      But, we obviously need to avoid above deadlock with the
      sync way. To make this happen, we considered to not hold
      reconfig_mutex to call md_reload_sb, if some other thread
      has already taken reconfig_mutex and waiting for the 'token',
      then process_recvd_msg() can safely call md_reload_sb()
      without taking the mutex. This is because we can be certain
      that no other thread will take the mutex, and we also certain
      that the actions performed by md_reload_sb() won't interfere
      with anything that the other thread is in the middle of.
      
      To make this more concrete, we added a new cinfo->state bit
              MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD
      
      Which is set in lock_token() just before dlm_lock_sync() is
      called, and cleared just after. As lock_token() is always
      called with reconfig_mutex() held (the specific case is the
      resync_info_update which is distinguished well in previous
      patch), if process_recvd_msg() finds that the new bit is set,
      then the mutex must be held by some other thread, and it will
      keep waiting.
      
      So process_metadata_update() can call md_reload_sb() if either
      mddev_trylock() succeeds, or if MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD
      is set. The tricky bit is what to do if neither of these apply.
      We need to wait. Fortunately mddev_unlock() always calls wake_up()
      on mddev->thread->wqueue. So we can get lock_token() to call
      wake_up() on that when it sets the bit.
      
      There are also some related changes inside this commit:
      1. remove RELOAD_SB related codes since there are not valid anymore.
      2. mddev is added into md_cluster_info then we can get mddev inside
         lock_token.
      3. add new parameter for lock_token to distinguish reconfig_mutex
         is held or not.
      
      And, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in below:
      1. set it before unregister thread, otherwise a deadlock could
         appear if stop a resyncing array.
         This is because md_unregister_thread(&cinfo->recv_thread) is
         blocked by recv_daemon -> process_recvd_msg
      			  -> process_metadata_update.
         To resolve the issue, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
         also need to be set before unregister thread.
      2. set it in metadata_update_start to fix another deadlock.
      	a. Node A sends METADATA_UPDATED msg (held Token lock).
      	b. Node B wants to do resync, and is blocked since it can't
      	   get Token lock, but MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
      	   not set since the callchain
      	   (md_do_sync -> sync_request
              	       -> resync_info_update
      		       -> sendmsg
      		       -> lock_comm -> lock_token)
      	   doesn't hold reconfig_mutex.
      	c. Node B trys to update sb (held reconfig_mutex), but stopped
      	   at wait_event() in metadata_update_start since we have set
      	   MD_CLUSTER_SEND_LOCK flag in lock_comm (step 2).
      	d. Then Node B receives METADATA_UPDATED msg from A, of course
      	   recv_daemon is blocked forever.
         Since metadata_update_start always calls lock_token with reconfig_mutex,
         we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD here as well, and
         lock_token don't need to set it twice unless lock_token is invoked from
         lock_comm.
      
      Finally, thanks to Neil for his great idea and help!
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      0ba95977
    • Linus Torvalds's avatar
      Merge tag 'xfs-4.11-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · d11507e1
      Linus Torvalds authored
      Pull xfs fix from Darrick Wong:
       "Here's a single fix for -rc3 to improve input validation on inline
        directory data to prevent buffer overruns due to corrupt metadata"
      
      * tag 'xfs-4.11-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
        xfs: verify inline directory data forks
      d11507e1
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 325513d9
      Linus Torvalds authored
      Pull arm64 fixes/cleanups from Catalin Marinas:
       "In Will's absence I'm sending the arm64 fixes he queued for 4.11-rc3:
      
         - fix arm64 kernel boot warning when DEBUG_VIRTUAL and KASAN are
           enabled
      
         - enable KEYS_COMPAT for keyctl compat support
      
         - use cpus_have_const_cap() for system_uses_ttbr0_pan() (slight
           performance improvement)
      
         - update kerneldoc for cpu_suspend() rename
      
         - remove the arm64-specific kprobe_exceptions_notify (weak generic
           variant defined)"
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: kernel: Update kerneldoc for cpu_suspend() rename
        arm64: use const cap for system_uses_ttbr0_pan()
        arm64: support keyctl() system call in 32-bit mode
        arm64: kasan: avoid bad virt_to_pfn()
        arm64: kprobes: remove kprobe_exceptions_notify
      325513d9
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md · 3009b303
      Linus Torvalds authored
      Pull MD fixes from Shaohua Li:
      
       - fix a parity calculation bug of raid5 cache by Song
      
       - fix a potential deadlock issue by me
      
       - fix two endian issues by Jason
      
       - fix a disk limitation issue by Neil
      
       - other small fixes and cleanup
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
        md/raid1: fix a trivial typo in comments
        md/r5cache: fix set_syndrome_sources() for data in cache
        md: fix incorrect use of lexx_to_cpu in does_sb_need_changing
        md: fix super_offset endianness in super_1_rdev_size_change
        md/raid1/10: fix potential deadlock
        md: don't impose the MD_SB_DISKS limit on arrays without metadata.
        md: move funcs from pers->resize to update_size
        md-cluster: remove useless memset from gather_all_resync_info
        md-cluster: free md_cluster_info if node leave cluster
        md: delete dead code
        md/raid10: submit bio directly to replacement disk
      3009b303
  3. 15 Mar, 2017 7 commits
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.dk/linux-block · 69eea5a4
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
       "Four small fixes for this cycle:
      
         - followup fix from Neil for a fix that went in before -rc2, ensuring
           that we always see the full per-task bio_list.
      
         - fix for blk-mq-sched from me that ensures that we retain similar
           direct-to-issue behavior on running the queue.
      
         - fix from Sagi fixing a potential NULL pointer dereference in blk-mq
           on spurious CPU unplug.
      
         - a memory leak fix in writeback from Tahsin, fixing a case where
           device removal of a mounted device can leak a struct
           wb_writeback_work"
      
      * 'for-linus' of git://git.kernel.dk/linux-block:
        blk-mq-sched: don't run the queue async from blk_mq_try_issue_directly()
        writeback: fix memory leak in wb_queue_work()
        blk-mq: Fix tagset reinit in the presence of cpu hot-unplug
        blk: Ensure users for current->bio_list can see the full list.
      69eea5a4
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 95422dec
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "This is a rather large set of fixes. The bulk are for lpfc correcting
        a lot of issues in the new NVME driver code which just went in in the
        merge window.
      
        The others are:
      
         - fix a hang in the vmware paravirt driver caused by incorrect
           handling of the new MSI vector allocation
      
         - long standing bug in storvsc, which recent block changes turned
           from being a harmless annoyance into a hang
      
         - yet more fallout (in mpt3sas) from the changes to device blocking
      
        The remainder are small fixes and updates"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (34 commits)
        scsi: lpfc: Add shutdown method for kexec
        scsi: storvsc: Workaround for virtual DVD SCSI version
        scsi: lpfc: revise version number to 11.2.0.10
        scsi: lpfc: code cleanups in NVME initiator discovery
        scsi: lpfc: code cleanups in NVME initiator base
        scsi: lpfc: correct rdp diag portnames
        scsi: lpfc: remove dead sli3 nvme code
        scsi: lpfc: correct double print
        scsi: lpfc: Rename LPFC_MAX_EQ_DELAY to LPFC_MAX_EQ_DELAY_EQID_CNT
        scsi: lpfc: Rework lpfc Kconfig for NVME options
        scsi: lpfc: add transport eh_timed_out reference
        scsi: lpfc: Fix eh_deadline setting for sli3 adapters.
        scsi: lpfc: add NVME exchange aborts
        scsi: lpfc: Fix nvme allocation bug on failed nvme_fc_register_localport
        scsi: lpfc: Fix IO submission if WQ is full
        scsi: lpfc: Fix NVME CMD IU byte swapped word 1 problem
        scsi: lpfc: Fix RCTL value on NVME LS request and response
        scsi: lpfc: Fix crash during Hardware error recovery on SLI3 adapters
        scsi: lpfc: fix missing spin_unlock on sql_list_lock
        scsi: lpfc: don't dereference dma_buf->iocbq before null check
        ...
      95422dec
    • Linus Torvalds's avatar
      Merge tag 'gfs2-4.11-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 · aabcf5fc
      Linus Torvalds authored
      Pull gfs2 fix from Bob Peterson:
       "This is an emergency patch for 4.11-rc3
      
        The GFS2 developers uncovered a really nasty problem that can lead to
        random corruption and kernel panic, much like the last one. Andreas
        Gruenbacher wrote a simple one-line patch to fix the problem."
      
      * tag 'gfs2-4.11-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
        gfs2: Avoid alignment hole in struct lm_lockname
      aabcf5fc
    • Linus Torvalds's avatar
      Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 · defc7d75
      Linus Torvalds authored
      Pull crypto fixes from Herbert Xu:
      
       - self-test failure of crc32c on powerpc
      
       - regressions of ecb(aes) when used with xts/lrw in s5p-sss
      
       - a number of bugs in the omap RNG driver
      
      * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
        crypto: s5p-sss - Fix spinlock recursion on LRW(AES)
        hwrng: omap - Do not access INTMASK_REG on EIP76
        hwrng: omap - use devm_clk_get() instead of of_clk_get()
        hwrng: omap - write registers after enabling the clock
        crypto: s5p-sss - Fix completing crypto request in IRQ handler
        crypto: powerpc - Fix initialisation of crc32c context
      defc7d75
    • Andreas Gruenbacher's avatar
      gfs2: Avoid alignment hole in struct lm_lockname · 28ea06c4
      Andreas Gruenbacher authored
      Commit 88ffbf3e switches to using rhashtables for glocks, hashing over
      the entire struct lm_lockname instead of its individual fields.  On some
      architectures, struct lm_lockname contains a hole of uninitialized
      memory due to alignment rules, which now leads to incorrect hash values.
      Get rid of that hole.
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      CC: <stable@vger.kernel.org> #v4.3+
      28ea06c4
    • Darrick J. Wong's avatar
      xfs: verify inline directory data forks · 630a04e7
      Darrick J. Wong authored
      When we're reading or writing the data fork of an inline directory,
      check the contents to make sure we're not overflowing buffers or eating
      garbage data.  xfs/348 corrupts an inline symlink into an inline
      directory, triggering a buffer overflow bug.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      ---
      v2: add more checks consistent with _dir2_sf_check and make the verifier
      usable from anywhere.
      630a04e7
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · ae50dfd6
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Ensure that mtu is at least IPV6_MIN_MTU in ipv6 VTI tunnel driver,
          from Steffen Klassert.
      
       2) Fix crashes when user tries to get_next_key on an LPM bpf map, from
          Alexei Starovoitov.
      
       3) Fix detection of VLAN fitlering feature for bnx2x VF devices, from
          Michal Schmidt.
      
       4) We can get a divide by zero when TCP socket are morphed into
          listening state, fix from Eric Dumazet.
      
       5) Fix socket refcounting bugs in skb_complete_wifi_ack() and
          skb_complete_tx_timestamp(). From Eric Dumazet.
      
       6) Use after free in dccp_feat_activate_values(), also from Eric
          Dumazet.
      
       7) Like bonding team needs to use ETH_MAX_MTU as netdev->max_mtu, from
          Jarod Wilson.
      
       8) Fix use after free in vrf_xmit(), from David Ahern.
      
       9) Don't do UDP Fragmentation Offload on IPComp ipsec packets, from
          Alexey Kodanev.
      
      10) Properly check napi_complete_done() return value in order to decide
          whether to re-enable IRQs or not in amd-xgbe driver, from Thomas
          Lendacky.
      
      11) Fix double free of hwmon device in marvell phy driver, from Andrew
          Lunn.
      
      12) Don't crash on malformed netlink attributes in act_connmark, from
          Etienne Noss.
      
      13) Don't remove routes with a higher metric in ipv6 ECMP route replace,
          from Sabrina Dubroca.
      
      14) Don't write into a cloned SKB in ipv6 fragmentation handling, from
          Florian Westphal.
      
      15) Fix routing redirect races in dccp and tcp, basically the ICMP
          handler can't modify the socket's cached route in it's locked by the
          user at this moment. From Jon Maxwell.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (108 commits)
        qed: Enable iSCSI Out-of-Order
        qed: Correct out-of-bound access in OOO history
        qed: Fix interrupt flags on Rx LL2
        qed: Free previous connections when releasing iSCSI
        qed: Fix mapping leak on LL2 rx flow
        qed: Prevent creation of too-big u32-chains
        qed: Align CIDs according to DORQ requirement
        mlxsw: reg: Fix SPVMLR max record count
        mlxsw: reg: Fix SPVM max record count
        net: Resend IGMP memberships upon peer notification.
        dccp: fix memory leak during tear-down of unsuccessful connection request
        tun: fix premature POLLOUT notification on tun devices
        dccp/tcp: fix routing redirect race
        ucc/hdlc: fix two little issue
        vxlan: fix ovs support
        net: use net->count to check whether a netns is alive or not
        bridge: drop netfilter fake rtable unconditionally
        ipv6: avoid write to a possibly cloned skb
        net: wimax/i2400m: fix NULL-deref at probe
        isdn/gigaset: fix NULL-deref at probe
        ...
      ae50dfd6
  4. 14 Mar, 2017 5 commits
    • Linus Torvalds's avatar
      Merge branch 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · 352526f4
      Linus Torvalds authored
      Pull cgroup fixes from Tejun Heo:
       "Three cgroup fixes.  Nothing critical:
      
         - the pids controller could trigger suspicious RCU warning
           spuriously. Fixed.
      
         - in the debug controller, %p -> %pK to protect kernel pointer
           from getting exposed.
      
         - documentation formatting fix"
      
      * 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
        cgroups: censor kernel pointer in debug files
        cgroup/pids: remove spurious suspicious RCU usage warning
        cgroup: Fix indenting in PID controller documentation
      352526f4
    • Linus Torvalds's avatar
      Merge branch 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata · 6517569d
      Linus Torvalds authored
      Pull libata fixes from Tejun Heo:
       "Three libata fixes:
      
         - fix for a circular reference bug in sysfs code which prevented
           pata_legacy devices from being released after probe failure, which
           in turn prevented devres from releasing the associated resources.
      
         - drop spurious WARN in the command issue path which can be triggered
           by a legitimate passthrough command.
      
         - an ahci_qoriq specific fix"
      
      * 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
        ahci: qoriq: correct the sata ecc setting error
        libata: drop WARN from protocol error in ata_sff_qc_issue()
        libata: transport: Remove circular dependency at free time
      6517569d
    • Linus Torvalds's avatar
      Merge branch 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq · bc258879
      Linus Torvalds authored
      Pull workqueue fix from Tejun Heo:
       "If a delayed work is queued with NULL @wq, workqueue code explodes
        after the timer expires at which point it's difficult to tell who the
        culprit was.
      
        This actually happened and the offender was net/smc this time.
      
        Add an explicit sanity check for it in the queueing path"
      
      * 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
        workqueue: trigger WARN if queue_delayed_work() is called with NULL @wq
      bc258879
    • Linus Torvalds's avatar
      Merge branch 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu · 83e63226
      Linus Torvalds authored
      Pull percpu fixes from Tejun Heo:
      
       - the allocation path was updating pcpu_nr_empty_pop_pages without the
         required locking which can lead to incorrect handling of empty chunks
         (e.g. keeping too many around), which is buggy but shouldn't lead to
         critical failures. Fixed by adding the locking
      
       - a trivial patch to drop an unused param from pcpu_get_pages()
      
      * 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
        percpu: remove unused chunk_alloc parameter from pcpu_get_pages()
        percpu: acquire pcpu_lock when updating pcpu_nr_empty_pop_pages
      83e63226
    • David S. Miller's avatar
      Merge branch 'qed-fixes' · 1e6a1cd8
      David S. Miller authored
      Yuval Mintz says:
      
      ====================
      qed: Fixes series
      
      This address several different issues in qed.
      The more significant portions:
      
      Patch #1 would cause timeout when qedr utilizes the highest
      CIDs availble for it [or when future qede adapters would utilize
      queues in some constellations].
      
      Patch #4 fixes a leak of mapped addresses; When iommu is enabled,
      offloaded storage protocols might eventually run out of resources
      and fail to map additional buffers.
      
      Patches #6,#7 were missing in the initial iSCSI infrastructure
      submissions, and would hamper qedi's stability when it reaches
      out-of-order scenarios.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1e6a1cd8