1. 23 Feb, 2017 1 commit
  2. 20 Feb, 2017 3 commits
    • Shaohua Li's avatar
      md/raid1: fix a use-after-free bug · af5f42a7
      Shaohua Li authored
      Commit fd76863e (RAID1: a new I/O barrier implementation to remove resync
      window) introduces a user-after-free bug.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      af5f42a7
    • colyli@suse.de's avatar
      RAID1: avoid unnecessary spin locks in I/O barrier code · 824e47da
      colyli@suse.de authored
      When I run a parallel reading performan testing on a md raid1 device with
      two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB
      block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is
      only 2.7GB/s, this is around 50% of the idea performance number.
      
      The perf reports locking contention happens at allow_barrier() and
      wait_barrier() code,
       - 41.41%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
         - _raw_spin_lock_irqsave
               + 89.92% allow_barrier
               + 9.34% __wake_up
       - 37.30%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irq
         - _raw_spin_lock_irq
               - 100.00% wait_barrier
      
      The reason is, in these I/O barrier related functions,
       - raise_barrier()
       - lower_barrier()
       - wait_barrier()
       - allow_barrier()
      They always hold conf->resync_lock firstly, even there are only regular
      reading I/Os and no resync I/O at all. This is a huge performance penalty.
      
      The solution is a lockless-like algorithm in I/O barrier code, and only
      holding conf->resync_lock when it has to.
      
      The original idea is from Hannes Reinecke, and Neil Brown provides
      comments to improve it. I continue to work on it, and make the patch into
      current form.
      
      In the new simpler raid1 I/O barrier implementation, there are two
      wait barrier functions,
       - wait_barrier()
         Which calls _wait_barrier(), is used for regular write I/O. If there is
         resync I/O happening on the same I/O barrier bucket, or the whole
         array is frozen, task will wait until no barrier on same barrier bucket,
         or the whold array is unfreezed.
       - wait_read_barrier()
         Since regular read I/O won't interfere with resync I/O (read_balance()
         will make sure only uptodate data will be read out), it is unnecessary
         to wait for barrier in regular read I/Os, waiting in only necessary
         when the whole array is frozen.
      
      The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf->
      barrier[idx] are very carefully designed in raise_barrier(),
      lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to
      avoid unnecessary spin locks in these functions. Once conf->
      nr_pengding[idx] is increased, a resync I/O with same barrier bucket index
      has to wait in raise_barrier(). Then in _wait_barrier() if no barrier
      raised in same barrier bucket index and array is not frozen, the regular
      I/O doesn't need to hold conf->resync_lock, it can just increase
      conf->nr_pending[idx], and return to its caller. wait_read_barrier() is
      very similar to _wait_barrier(), the only difference is it only waits when
      array is frozen. For heavy parallel reading I/Os, the lockless I/O barrier
      code almostly gets rid of all spin lock cost.
      
      This patch significantly improves raid1 reading peroformance. From my
      testing, a raid1 device built by two NVMe SSD, runs fio with 64KB
      blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
      increases from 2.7GB/s to 4.6GB/s (+70%).
      
      Changelog
      V4:
      - Change conf->nr_queued[] to atomic_t.
      - Define BARRIER_BUCKETS_NR_BITS by (PAGE_SHIFT - ilog2(sizeof(atomic_t)))
      V3:
      - Add smp_mb__after_atomic() as Shaohua and Neil suggested.
      - Change conf->nr_queued[] from atomic_t to int.
      - Change conf->array_frozen from atomic_t back to int, and use
        READ_ONCE(conf->array_frozen) to check value of conf->array_frozen
        in _wait_barrier() and wait_read_barrier().
      - In _wait_barrier() and wait_read_barrier(), add a call to
        wake_up(&conf->wait_barrier) after atomic_dec(&conf->nr_pending[idx]),
        to fix a deadlock between  _wait_barrier()/wait_read_barrier and
        freeze_array().
      V2:
      - Remove a spin_lock/unlock pair in raid1d().
      - Add more code comments to explain why there is no racy when checking two
        atomic_t variables at same time.
      V1:
      - Original RFC patch for comments.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Cc: Guoqing Jiang <gqjiang@suse.com>
      Reviewed-by: default avatarNeil Brown <neilb@suse.de>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      824e47da
    • colyli@suse.de's avatar
      RAID1: a new I/O barrier implementation to remove resync window · fd76863e
      colyli@suse.de authored
      'Commit 79ef3a8a ("raid1: Rewrite the implementation of iobarrier.")'
      introduces a sliding resync window for raid1 I/O barrier, this idea limits
      I/O barriers to happen only inside a slidingresync window, for regular
      I/Os out of this resync window they don't need to wait for barrier any
      more. On large raid1 device, it helps a lot to improve parallel writing
      I/O throughput when there are background resync I/Os performing at
      same time.
      
      The idea of sliding resync widow is awesome, but code complexity is a
      challenge. Sliding resync window requires several variables to work
      collectively, this is complexed and very hard to make it work correctly.
      Just grep "Fixes: 79ef3a8a" in kernel git log, there are 8 more patches
      to fix the original resync window patch. This is not the end, any further
      related modification may easily introduce more regreassion.
      
      Therefore I decide to implement a much simpler raid1 I/O barrier, by
      removing resync window code, I believe life will be much easier.
      
      The brief idea of the simpler barrier is,
       - Do not maintain a global unique resync window
       - Use multiple hash buckets to reduce I/O barrier conflicts, regular
         I/O only has to wait for a resync I/O when both them have same barrier
         bucket index, vice versa.
       - I/O barrier can be reduced to an acceptable number if there are enough
         barrier buckets
      
      Here I explain how the barrier buckets are designed,
       - BARRIER_UNIT_SECTOR_SIZE
         The whole LBA address space of a raid1 device is divided into multiple
         barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
         Bio requests won't go across border of barrier unit size, that means
         maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes.
         For random I/O 64MB is large enough for both read and write requests,
         for sequential I/O considering underlying block layer may merge them
         into larger requests, 64MB is still good enough.
         Neil also points out that for resync operation, "we want the resync to
         move from region to region fairly quickly so that the slowness caused
         by having to synchronize with the resync is averaged out over a fairly
         small time frame". For full speed resync, 64MB should take less then 1
         second. When resync is competing with other I/O, it could take up a few
         minutes. Therefore 64MB size is fairly good range for resync.
      
       - BARRIER_BUCKETS_NR
         There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
              #define BARRIER_BUCKETS_NR_BITS   (PAGE_SHIFT - 2)
              #define BARRIER_BUCKETS_NR        (1<<BARRIER_BUCKETS_NR_BITS)
         this patch makes the bellowed members of struct r1conf from integer
         to array of integers,
              -       int                     nr_pending;
              -       int                     nr_waiting;
              -       int                     nr_queued;
              -       int                     barrier;
              +       int                     *nr_pending;
              +       int                     *nr_waiting;
              +       int                     *nr_queued;
              +       int                     *barrier;
         number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB
         kernel space page size, (PAGE_SHIFT - 2) indecates there are 1024 I/O
         barrier buckets, and each array of integers occupies single memory page.
         1024 means for a request which is smaller than the I/O barrier unit size
         has ~0.1% chance to wait for resync to pause, which is quite a small
         enough fraction. Also requesting single memory page is more friendly to
         kernel page allocator than larger memory size.
      
       - I/O barrier bucket is indexed by bio start sector
         If multiple I/O requests hit different I/O barrier units, they only need
         to compete I/O barrier with other I/Os which hit the same I/O barrier
         bucket index with each other. The index of a barrier bucket which a
         bio should look for is calculated by sector_to_idx() which is defined
         in raid1.h as an inline function,
              static inline int sector_to_idx(sector_t sector)
              {
                      return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS,
                                      BARRIER_BUCKETS_NR_BITS);
              }
         Here sector_nr is the start sector number of a bio.
      
       - Single bio won't go across boundary of a I/O barrier unit
         If a request goes across boundary of barrier unit, it will be split. A
         bio may be split in raid1_make_request() or raid1_sync_request(), if
         sectors returned by align_to_barrier_unit_end() is smaller than
         original bio size.
      
      Comparing to single sliding resync window,
       - Currently resync I/O grows linearly, therefore regular and resync I/O
         will conflict within a single barrier units. So the I/O behavior is
         similar to single sliding resync window.
       - But a barrier unit bucket is shared by all barrier units with identical
         barrier uinit index, the probability of conflict might be higher
         than single sliding resync window, in condition that writing I/Os
         always hit barrier units which have identical barrier bucket indexs with
         the resync I/Os. This is a very rare condition in real I/O work loads,
         I cannot imagine how it could happen in practice.
       - Therefore we can achieve a good enough low conflict rate with much
         simpler barrier algorithm and implementation.
      
      There are two changes should be noticed,
       - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
         single loop, it looks like this,
              spin_lock_irqsave(&conf->device_lock, flags);
              conf->nr_queued[idx]--;
              spin_unlock_irqrestore(&conf->device_lock, flags);
         This change generates more spin lock operations, but in next patch of
         this patch set, it will be replaced by a single line code,
              atomic_dec(&conf->nr_queueud[idx]);
         So we don't need to worry about spin lock cost here.
       - Mainline raid1 code split original raid1_make_request() into
         raid1_read_request() and raid1_write_request(). If the original bio
         goes across an I/O barrier unit size, this bio will be split before
         calling raid1_read_request() or raid1_write_request(),  this change
         the code logic more simple and clear.
       - In this patch wait_barrier() is moved from raid1_make_request() to
         raid1_write_request(). In raid_read_request(), original wait_barrier()
         is replaced by raid1_read_request().
         The differnece is wait_read_barrier() only waits if array is frozen,
         using different barrier function in different code path makes the code
         more clean and easy to read.
      Changelog
      V4:
      - Add alloc_r1bio() to remove redundant r1bio memory allocation code.
      - Fix many typos in patch comments.
      - Use (PAGE_SHIFT - ilog2(sizeof(int))) to define BARRIER_BUCKETS_NR_BITS.
      V3:
      - Rebase the patch against latest upstream kernel code.
      - Many fixes by review comments from Neil,
        - Back to use pointers to replace arraries in struct r1conf
        - Remove total_barriers from struct r1conf
        - Add more patch comments to explain how/why the values of
          BARRIER_UNIT_SECTOR_SIZE and BARRIER_BUCKETS_NR are decided.
        - Use get_unqueued_pending() to replace get_all_pendings() and
          get_all_queued()
        - Increase bucket number from 512 to 1024
      - Change code comments format by review from Shaohua.
      V2:
      - Use bio_split() to split the orignal bio if it goes across barrier unit
        bounday, to make the code more simple, by suggestion from Shaohua and
        Neil.
      - Use hash_long() to replace original linear hash, to avoid a possible
        confilict between resync I/O and sequential write I/O, by suggestion from
        Shaohua.
      - Add conf->total_barriers to record barrier depth, which is used to
        control number of parallel sync I/O barriers, by suggestion from Shaohua.
      - In V1 patch the bellowed barrier buckets related members in r1conf are
        allocated in memory page. To make the code more simple, V2 patch moves
        the memory space into struct r1conf, like this,
              -       int                     nr_pending;
              -       int                     nr_waiting;
              -       int                     nr_queued;
              -       int                     barrier;
              +       int                     nr_pending[BARRIER_BUCKETS_NR];
              +       int                     nr_waiting[BARRIER_BUCKETS_NR];
              +       int                     nr_queued[BARRIER_BUCKETS_NR];
              +       int                     barrier[BARRIER_BUCKETS_NR];
        This change is by the suggestion from Shaohua.
      - Remove some inrelavent code comments, by suggestion from Guoqing.
      - Add a missing wait_barrier() before jumping to retry_write, in
        raid1_make_write_request().
      V1:
      - Original RFC patch for comments
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Cc: Guoqing Jiang <gqjiang@suse.com>
      Reviewed-by: default avatarNeil Brown <neilb@suse.de>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      fd76863e
  3. 16 Feb, 2017 1 commit
  4. 15 Feb, 2017 5 commits
  5. 14 Feb, 2017 1 commit
  6. 13 Feb, 2017 10 commits
    • Shaohua Li's avatar
      md/raid5-cache: exclude reclaiming stripes in reclaim check · e33fbb9c
      Shaohua Li authored
      stripes which are being reclaimed are still accounted into cached
      stripes. The reclaim takes time. r5c_do_reclaim isn't aware of the
      stripes and does unnecessary stripe reclaim. In practice, I saw one
      stripe is reclaimed one time. This will cause bad IO pattern. Fixing
      this by excluding the reclaing stripes in the check.
      
      Cc: Song Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      e33fbb9c
    • Shaohua Li's avatar
      md/raid5-cache: stripe reclaim only counts valid stripes · e8fd52ee
      Shaohua Li authored
      When log space is tight, we try to reclaim stripes from log head. There
      are stripes which can't be reclaimed right now if some conditions are
      met. We skip such stripes but accidentally count them, which might cause
      no stripes are claimed. Fixing this by only counting valid stripes.
      
      Cc: Song Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      e8fd52ee
    • Shaohua Li's avatar
      MD: add doc for raid5-cache · 5a6265f9
      Shaohua Li authored
      I'm starting document of the raid5-cache feature. Please note this is a
      kernel doc instead of a mdadm manual, so I don't add the details about
      how to use the feature in mdadm side.
      
      Cc: NeilBrown <neilb@suse.com>
      Reviewed-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      5a6265f9
    • Shaohua Li's avatar
      1601c590
    • NeilBrown's avatar
      md: ensure md devices are freed before module is unloaded. · 9356863c
      NeilBrown authored
      Commit: cbd19983 ("md: Fix unfortunate interaction with evms")
      change mddev_put() so that it would not destroy an md device while
      ->ctime was non-zero.
      
      Unfortunately, we didn't make sure to clear ->ctime when unloading
      the module, so it is possible for an md device to remain after
      module unload.  An attempt to open such a device will trigger
      an invalid memory reference in:
        get_gendisk -> kobj_lookup -> exact_lock -> get_disk
      
      when tring to access disk->fops, which was in the module that has
      been removed.
      
      So ensure we clear ->ctime in md_exit(), and explain how that is useful,
      as it isn't immediately obvious when looking at the code.
      
      Fixes: cbd19983 ("md: Fix unfortunate interaction with evms")
      Tested-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      9356863c
    • Song Liu's avatar
      md/r5cache: improve journal device efficiency · 39b99586
      Song Liu authored
      It is important to be able to flush all stripes in raid5-cache.
      Therefore, we need reserve some space on the journal device for
      these flushes. If flush operation includes pending writes to the
      stripe, we need to reserve (conf->raid_disk + 1) pages per stripe
      for the flush out. This reduces the efficiency of journal space.
      If we exclude these pending writes from flush operation, we only
      need (conf->max_degraded + 1) pages per stripe.
      
      With this patch, when log space is critical (R5C_LOG_CRITICAL=1),
      pending writes will be excluded from stripe flush out. Therefore,
      we can reduce reserved space for flush out and thus improve journal
      device efficiency.
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      39b99586
    • Song Liu's avatar
      md/r5cache: enable chunk_aligned_read with write back cache · 03b047f4
      Song Liu authored
      Chunk aligned read significantly reduces CPU usage of raid456.
      However, it is not safe to fully bypass the write back cache.
      This patch enables chunk aligned read with write back cache.
      
      For chunk aligned read, we track stripes in write back cache at
      a bigger granularity, "big_stripe". Each chunk may contain more
      than one stripe (for example, a 256kB chunk contains 64 4kB-page,
      so this chunk contain 64 stripes). For chunk_aligned_read, these
      stripes are grouped into one big_stripe, so we only need one lookup
      for the whole chunk.
      
      For each big_stripe, struct big_stripe_info tracks how many stripes
      of this big_stripe are in the write back cache. We count how many
      stripes of this big_stripe are in the write back cache. These
      counters are tracked in a radix tree (big_stripe_tree).
      r5c_tree_index() is used to calculate keys for the radix tree.
      
      chunk_aligned_read() calls r5c_big_stripe_cached() to look up
      big_stripe of each chunk in the tree. If this big_stripe is in the
      tree, chunk_aligned_read() aborts. This look up is protected by
      rcu_read_lock().
      
      It is necessary to remember whether a stripe is counted in
      big_stripe_tree. Instead of adding new flag, we reuses existing flags:
      STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE. If either of these
      two flags are set, the stripe is counted in big_stripe_tree. This
      requires moving set_bit(STRIPE_R5C_PARTIAL_STRIPE) to
      r5c_try_caching_write(); and moving clear_bit of
      STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE to
      r5c_finish_stripe_write_out().
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      03b047f4
    • Song Liu's avatar
      EXPORT_SYMBOL radix_tree_replace_slot · 10257d71
      Song Liu authored
      It will be used in drivers/md/raid5-cache.c
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      10257d71
    • Shaohua Li's avatar
      raid5: only dispatch IO from raid5d for harddisk raid · 765d704d
      Shaohua Li authored
      We made raid5 stripe handling multi-thread before. It works well for
      SSD. But for harddisk, the multi-threading creates more disk seek, so
      not always improve performance. For several hard disks based raid5,
      multi-threading is required as raid5d becames a bottleneck especially
      for sequential write.
      
      To overcome the disk seek issue, we only dispatch IO from raid5d if the
      array is harddisk based. Other threads can still handle stripes, but
      can't dispatch IO.
      
      Idealy, we should control IO dispatching order according to IO position
      interrnally. Right now we still depend on block layer, which isn't very
      efficient sometimes though.
      
      My setup has 9 harddisks, each disk can do around 180M/s sequential
      write. So in theory, the raid5 can do 180 * 8 = 1440M/s sequential
      write. The test machine uses an ATOM CPU. I measure sequential write
      with large iodepth bandwidth to raid array:
      
      without patch: ~600M/s
      without patch and group_thread_cnt=4: 750M/s
      with patch and group_thread_cnt=4: 950M/s
      with patch, group_thread_cnt=4, skip_copy=1: 1150M/s
      
      We are pretty close to the maximum bandwidth in the large iodepth
      iodepth case. The performance gap of small iodepth sequential write
      between software raid and theory value is still very big though, because
      we don't have an efficient pipeline.
      
      Cc: NeilBrown <neilb@suse.com>
      Cc: Song Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      765d704d
    • colyli@suse.de's avatar
      md linear: fix a race between linear_add() and linear_congested() · 03a9e24e
      colyli@suse.de authored
      Recently I receive a bug report that on Linux v3.0 based kerenl, hot add
      disk to a md linear device causes kernel crash at linear_congested(). From
      the crash image analysis, I find in linear_congested(), mddev->raid_disks
      contains value N, but conf->disks[] only has N-1 pointers available. Then
      a NULL pointer deference crashes the kernel.
      
      There is a race between linear_add() and linear_congested(), RCU stuffs
      used in these two functions cannot avoid the race. Since Linuv v4.0
      RCU code is replaced by introducing mddev_suspend().  After checking the
      upstream code, it seems linear_congested() is not called in
      generic_make_request() code patch, so mddev_suspend() cannot provent it
      from being called. The possible race still exists.
      
      Here I explain how the race still exists in current code.  For a machine
      has many CPUs, on one CPU, linear_add() is called to add a hard disk to a
      md linear device; at the same time on other CPU, linear_congested() is
      called to detect whether this md linear device is congested before issuing
      an I/O request onto it.
      
      Now I use a possible code execution time sequence to demo how the possible
      race happens,
      
      seq    linear_add()                linear_congested()
       0                                 conf=mddev->private
       1   oldconf=mddev->private
       2   mddev->raid_disks++
       3                              for (i=0; i<mddev->raid_disks;i++)
       4                                bdev_get_queue(conf->disks[i].rdev->bdev)
       5   mddev->private=newconf
      
      In linear_add() mddev->raid_disks is increased in time seq 2, and on
      another CPU in linear_congested() the for-loop iterates conf->disks[i] by
      the increased mddev->raid_disks in time seq 3,4. But conf with one more
      element (which is a pointer to struct dev_info type) to conf->disks[] is
      not updated yet, accessing its structure member in time seq 4 will cause a
      NULL pointer deference fault.
      
      To fix this race, there are 2 parts of modification in the patch,
       1) Add 'int raid_disks' in struct linear_conf, as a copy of
          mddev->raid_disks. It is initialized in linear_conf(), always being
          consistent with pointers number of 'struct dev_info disks[]'. When
          iterating conf->disks[] in linear_congested(), use conf->raid_disks to
          replace mddev->raid_disks in the for-loop, then NULL pointer deference
          will not happen again.
       2) RCU stuffs are back again, and use kfree_rcu() in linear_add() to
          free oldconf memory. Because oldconf may be referenced as mddev->private
          in linear_congested(), kfree_rcu() makes sure that its memory will not
          be released until no one uses it any more.
      Also some code comments are added in this patch, to make this modification
      to be easier understandable.
      
      This patch can be applied for kernels since v4.0 after commit:
      3be260cc ("md/linear: remove rcu protections in favour of
      suspend/resume"). But this bug is reported on Linux v3.0 based kernel, for
      people who maintain kernels before Linux v4.0, they need to do some back
      back port to this patch.
      
      Changelog:
       - V3: add 'int raid_disks' in struct linear_conf, and use kfree_rcu() to
             replace rcu_call() in linear_add().
       - v2: add RCU stuffs by suggestion from Shaohua and Neil.
       - v1: initial effort.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Neil Brown <neilb@suse.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      03a9e24e
  7. 12 Feb, 2017 1 commit
  8. 11 Feb, 2017 8 commits
  9. 10 Feb, 2017 10 commits