1. 18 Nov, 2016 9 commits
    • Song Liu's avatar
      md/r5cache: write-out phase and reclaim support · a39f7afd
      Song Liu authored
      There are two limited resources, stripe cache and journal disk space.
      For better performance, we priotize reclaim of full stripe writes.
      To free up more journal space, we free earliest data on the journal.
      
      In current implementation, reclaim happens when:
      1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
         if there is no reclaim in the past 5 seconds.
      2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
         or cached stripes is enough for a full stripe (chunk size / 4k)
         (r5c_check_cached_full_stripe)
      3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
      4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
      
      r5c_do_reclaim() contains new logic of reclaim.
      
      For stripe cache:
      
      When stripe cache pressure is high (more than 3/4 stripes are cached,
      or there is empty inactive lists), flush all full stripe. If fewer
      than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
      are flushed, flush some paritial stripes. When stripe cache pressure
      is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.
      
      For log space:
      
      To avoid deadlock due to log space, we need to reserve enough space
      to flush cached data. The size of required log space depends on total
      number of cached stripes (stripe_in_journal_count). In current
      implementation, the writing-out phase automatically include pending
      data writes with parity writes (similar to write through case).
      Therefore, we need up to (conf->raid_disks + 1) pages for each cached
      stripe (1 page for meta data, raid_disks pages for all data and
      parity). r5c_log_required_to_flush_cache() calculates log space
      required to flush cache. In the following, we refer to the space
      calculated by r5c_log_required_to_flush_cache() as
      reclaim_required_space.
      
      Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
      R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
      device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
      is set when free space on the log device is less than 2x of
      reclaim_required_space.
      
      r5c_cache keeps all data in cache (not fully committed to RAID) in
      a list (stripe_in_journal_list). These stripes are in the order of their
      first appearance on the journal. So the log tail (last_checkpoint)
      should point to the journal_start of the first item in the list.
      
      When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
      stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
      set, the state machine only writes data that are already in the
      log device (in stripe_in_journal_list).
      
      This patch includes a fix to improve performance by
      Shaohua Li <shli@fb.com>.
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      a39f7afd
    • Song Liu's avatar
      md/r5cache: caching phase of r5cache · 1e6d690b
      Song Liu authored
      As described in previous patch, write back cache operates in two
      phases: caching and writing-out. The caching phase works as:
      1. write data to journal
         (r5c_handle_stripe_dirtying, r5c_cache_data)
      2. call bio_endio
         (r5c_handle_data_cached, r5c_return_dev_pending_writes).
      
      Then the writing-out phase is as:
      1. Mark the stripe as write-out (r5c_make_stripe_write_out)
      2. Calcualte parity (reconstruct or RMW)
      3. Write parity (and maybe some other data) to journal device
      4. Write data and parity to RAID disks
      
      This patch implements caching phase. The cache is integrated with
      stripe cache of raid456. It leverages code of r5l_log to write
      data to journal device.
      
      Writing-out phase of the cache is implemented in the next patch.
      
      With r5cache, write operation does not wait for parity calculation
      and write out, so the write latency is lower (1 write to journal
      device vs. read and then write to raid disks). Also, r5cache will
      reduce RAID overhead (multipile IO due to read-modify-write of
      parity) and provide more opportunities of full stripe writes.
      
      This patch adds 2 flags to stripe_head.state:
       - STRIPE_R5C_PARTIAL_STRIPE,
       - STRIPE_R5C_FULL_STRIPE,
      
      Instead of inactive_list, stripes with cached data are tracked in
      r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
      STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
      stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
      are not considered as "active".
      
      For RMW, the code allocates an extra page for each data block
      being updated.  This is stored in r5dev->orig_page and the old data
      is read into it.  Then the prexor calculation subtracts ->orig_page
      from the parity block, and the reconstruct calculation adds the
      ->page data back into the parity block.
      
      r5cache naturally excludes SkipCopy. When the array has write back
      cache, async_copy_data() will not skip copy.
      
      There are some known limitations of the cache implementation:
      
      1. Write cache only covers full page writes (R5_OVERWRITE). Writes
         of smaller granularity are write through.
      2. Only one log io (sh->log_io) for each stripe at anytime. Later
         writes for the same stripe have to wait. This can be improved by
         moving log_io to r5dev.
      3. With writeback cache, read path must enter state machine, which
         is a significant bottleneck for some workloads.
      4. There is no per stripe checkpoint (with r5l_payload_flush) in
         the log, so recovery code has to replay more than necessary data
         (sometimes all the log from last_checkpoint). This reduces
         availability of the array.
      
      This patch includes a fix proposed by ZhengYuan Liu
      <liuzhengyuan@kylinos.cn>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      1e6d690b
    • Song Liu's avatar
      md/r5cache: State machine for raid5-cache write back mode · 2ded3703
      Song Liu authored
      This patch adds state machine for raid5-cache. With log device, the
      raid456 array could operate in two different modes (r5c_journal_mode):
        - write-back (R5C_MODE_WRITE_BACK)
        - write-through (R5C_MODE_WRITE_THROUGH)
      
      Existing code of raid5-cache only has write-through mode. For write-back
      cache, it is necessary to extend the state machine.
      
      With write-back cache, every stripe could operate in two different
      phases:
        - caching
        - writing-out
      
      In caching phase, the stripe handles writes as:
        - write to journal
        - return IO
      
      In writing-out phase, the stripe behaviors as a stripe in write through
      mode R5C_MODE_WRITE_THROUGH.
      
      STRIPE_R5C_CACHING is added to sh->state to differentiate caching and
      writing-out phase.
      
      Please note: this is a "no-op" patch for raid5-cache write-through
      mode.
      
      The following detailed explanation is copied from the raid5-cache.c:
      
      /*
       * raid5 cache state machine
       *
       * With rhe RAID cache, each stripe works in two phases:
       *      - caching phase
       *      - writing-out phase
       *
       * These two phases are controlled by bit STRIPE_R5C_CACHING:
       *   if STRIPE_R5C_CACHING == 0, the stripe is in writing-out phase
       *   if STRIPE_R5C_CACHING == 1, the stripe is in caching phase
       *
       * When there is no journal, or the journal is in write-through mode,
       * the stripe is always in writing-out phase.
       *
       * For write-back journal, the stripe is sent to caching phase on write
       * (r5c_handle_stripe_dirtying). r5c_make_stripe_write_out() kicks off
       * the write-out phase by clearing STRIPE_R5C_CACHING.
       *
       * Stripes in caching phase do not write the raid disks. Instead, all
       * writes are committed from the log device. Therefore, a stripe in
       * caching phase handles writes as:
       *      - write to log device
       *      - return IO
       *
       * Stripes in writing-out phase handle writes as:
       *      - calculate parity
       *      - write pending data and parity to journal
       *      - write data and parity to raid disks
       *      - return IO for pending writes
       */
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      2ded3703
    • Song Liu's avatar
      md/r5cache: move some code to raid5.h · 937621c3
      Song Liu authored
      Move some define and inline functions to raid5.h, so they can be
      used in raid5-cache.c
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      937621c3
    • Song Liu's avatar
      md/r5cache: Check array size in r5l_init_log · c757ec95
      Song Liu authored
      Currently, r5l_write_stripe checks meta size for each stripe write,
      which is not necessary.
      
      With this patch, r5l_init_log checks maximal meta size of the array,
      which is (r5l_meta_block + raid_disks x r5l_payload_data_parity).
      If this is too big to fit in one page, r5l_init_log aborts.
      
      With current meta data, r5l_log support raid_disks up to 203.
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      c757ec95
    • Shaohua Li's avatar
      md: add blktrace event for writes to superblock · 504634f6
      Shaohua Li authored
      superblock write is an expensive operation. With raid5-cache, it can be called
      regularly. Tracing to help performance debug.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Cc: NeilBrown <neilb@suse.com>
      504634f6
    • NeilBrown's avatar
      md/raid1, raid10: add blktrace records when IO is delayed · 578b54ad
      NeilBrown authored
      Both raid1 and raid10 will sometimes delay handling an IO request,
      such as when resync is happening or there are too many requests queued.
      
      Add some blktrace messsages so we can see when that is happening when
      looking for performance artefacts.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      578b54ad
    • NeilBrown's avatar
      md/bitmap: add blktrace event for writes to the bitmap · 581dbd94
      NeilBrown authored
      We trace wheneven bitmap_unplug() finds that it needs to write
      to the bitmap, or when bitmap_daemon_work() find there is work
      to do.
      
      This makes it easier to correlate bitmap updates with data writes.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      581dbd94
    • NeilBrown's avatar
      md: add block tracing for bio_remapping · 109e3765
      NeilBrown authored
      The block tracing infrastructure (accessed with blktrace/blkparse)
      supports the tracing of mapping bios from one device to another.
      This is currently used when a bio in a partition is mapped to the
      whole device, when bios are mapped by dm, and for mapping in md/raid5.
      Other md personalities do not include this tracing yet, so add it.
      
      When a read-error is detected we redirect the request to a different device.
      This could justifiably be seen as a new mapping for the originial bio,
      or a secondary mapping for the bio that errors.  This patch uses
      the second option.
      
      When md is used under dm-raid, the mappings are not traced as we do
      not have access to the block device number of the parent.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      109e3765
  2. 17 Nov, 2016 1 commit
  3. 10 Nov, 2016 1 commit
  4. 09 Nov, 2016 2 commits
    • NeilBrown's avatar
      md: define mddev flags, recovery flags and r1bio state bits using enums · be306c29
      NeilBrown authored
      This is less error prone than using individual #defines.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      be306c29
    • NeilBrown's avatar
      md/raid1: fix: IO can block resync indefinitely · f2c771a6
      NeilBrown authored
      While performing a resync/recovery, raid1 divides the
      array space into three regions:
       - before the resync
       - at or shortly after the resync point
       - much further ahead of the resync point.
      
      Write requests to the first or third do not need to wait.  Write
      requests to the middle region do need to wait if resync requests are
      pending.
      
      If there are any active write requests in the middle region, resync
      will wait for them.
      
      Due to an accounting error, there is a small range of addresses,
      between conf->next_resync and conf->start_next_window, where write
      requests will *not* be blocked, but *will* be counted in the middle
      region.  This can effectively block resync indefinitely if filesystem
      writes happen repeatedly to this region.
      
      As ->next_window_requests is incremented when the sector is after
        conf->start_next_window + NEXT_NORMALIO_DISTANCE
      the same boundary should be used for determining when write requests
      should wait.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      f2c771a6
  5. 07 Nov, 2016 24 commits
  6. 06 Nov, 2016 1 commit
    • Guenter Roeck's avatar
      openrisc: Define __ro_after_init to avoid crash · 2c7a5c5c
      Guenter Roeck authored
      openrisc qemu tests fail with the following crash.
      
      Unable to handle kernel access at virtual address 0xc0300c34
      
      Oops#: 0001
      CPU #: 0
         PC: c016c710    SR: 0000ae67    SP: c1017e04
         GPR00: 00000000 GPR01: c1017e04 GPR02: c0300c34 GPR03: c0300c34
         GPR04: 00000000 GPR05: c0300cb0 GPR06: c0300c34 GPR07: 000000ff
         GPR08: c107f074 GPR09: c0199ef4 GPR10: c1016000 GPR11: 00000000
         GPR12: 00000000 GPR13: c107f044 GPR14: c0473774 GPR15: 07ce0000
         GPR16: 00000000 GPR17: c107ed8a GPR18: 00009600 GPR19: c107f044
         GPR20: c107ee74 GPR21: 00000003 GPR22: c0473770 GPR23: 00000033
         GPR24: 000000bf GPR25: 00000019 GPR26: c046400c GPR27: 00000001
         GPR28: c0464028 GPR29: c1018000 GPR30: 00000006 GPR31: ccf37483
           RES: 00000000 oGPR11: ffffffff
           Process swapper (pid: 1, stackpage=c1001960)
      
           Stack: Stack dump [0xc1017cf8]:
           sp + 00: 0xc1017e04
           sp + 04: 0xc0300c34
           sp + 08: 0xc0300c34
           sp + 12: 0x00000000
      ...
      
      Bisect points to commit d2ec3f77 ("pty: make ptmx file ops read-only
      after init"). Fix by defining __ro_after_init for the openrisc
      architecture, similar to parisc.
      
      Fixes: d2ec3f77 ("pty: make ptmx file ops read-only after init")
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Acked-by: default avatarStafford Horne <shorne@gmail.com>
      2c7a5c5c
  7. 05 Nov, 2016 2 commits