Commit 0576b1c6 authored by Shaohua Li's avatar Shaohua Li Committed by NeilBrown

raid5: log reclaim support

This is the reclaim support for raid5 log. A stripe write will have
following steps:

1. reconstruct the stripe, read data/calculate parity. ops_run_io
prepares to write data/parity to raid disks
2. hijack ops_run_io. stripe data/parity is appending to log disk
3. flush log disk cache
4. ops_run_io run again and do normal operation. stripe data/parity is
written in raid array disks. raid core can return io to upper layer.
5. flush cache of all raid array disks
6. update super block
7. log disk space used by the stripe can be reused

In practice, several stripes consist of an io_unit and we will batch
several io_unit in different steps, but the whole process doesn't
change.

It's possible io return just after data/parity hit log disk, but then
read IO will need read from log disk. For simplicity, IO return happens
at step 4, where read IO can directly read from raid disks.

Currently reclaim run if there is specific reclaimable space (1/4 disk
size or 10G) or we are out of space. Reclaim is just to free log disk
spaces, it doesn't impact data consistency. The size based force reclaim
is to make sure log isn't too big, so recovery doesn't scan log too
much.

Recovery make sure raid disks and log disk have the same data of a
stripe. If crash happens before 4, recovery might/might not recovery
stripe's data/parity depending on if data/parity and its checksum
matches. In either case, this doesn't change the syntax of an IO write.
After step 3, stripe is guaranteed recoverable, because stripe's
data/parity is persistent in log disk. In some cases, log disk content
and raid disks content of a stripe are the same, but recovery will still
copy log disk content to raid disks, this doesn't impact data
consistency. space reuse happens after superblock update and cache
flush.

There is one situation we want to avoid. A broken meta in the middle of
a log causes recovery can't find meta at the head of log. If operations
require meta at the head persistent in log, we must make sure meta
before it persistent in log too. The case is stripe data/parity is in
log and we start write stripe to raid disks (before step 4). stripe
data/parity must be persistent in log before we do the write to raid
disks. The solution is we restrictly maintain io_unit list order. In
this case, we only write stripes of an io_unit to raid disks till the
io_unit is the first one whose data/parity is in log.

The io_unit list order is important for other cases too. For example,
some io_unit are reclaimable and others not. They can be mixed in the
list, we shouldn't reuse space of an unreclaimable io_unit.

Includes fixes to problems which were...
Reported-by: default avatarkbuild test robot <fengguang.wu@intel.com>
Signed-off-by: default avatarShaohua Li <shli@fb.com>
Signed-off-by: default avatarNeilBrown <neilb@suse.com>
parent f6bed0ef
This diff is collapsed.
...@@ -3098,6 +3098,8 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh, ...@@ -3098,6 +3098,8 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
if (bi) if (bi)
bitmap_end = 1; bitmap_end = 1;
r5l_stripe_write_finished(sh);
if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags)) if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
wake_up(&conf->wait_for_overlap); wake_up(&conf->wait_for_overlap);
...@@ -3498,6 +3500,8 @@ static void handle_stripe_clean_event(struct r5conf *conf, ...@@ -3498,6 +3500,8 @@ static void handle_stripe_clean_event(struct r5conf *conf,
WARN_ON(dev->page != dev->orig_page); WARN_ON(dev->page != dev->orig_page);
} }
r5l_stripe_write_finished(sh);
if (!discard_pending && if (!discard_pending &&
test_bit(R5_Discard, &sh->dev[sh->pd_idx].flags)) { test_bit(R5_Discard, &sh->dev[sh->pd_idx].flags)) {
clear_bit(R5_Discard, &sh->dev[sh->pd_idx].flags); clear_bit(R5_Discard, &sh->dev[sh->pd_idx].flags);
...@@ -5882,6 +5886,8 @@ static void raid5d(struct md_thread *thread) ...@@ -5882,6 +5886,8 @@ static void raid5d(struct md_thread *thread)
mutex_unlock(&conf->cache_size_mutex); mutex_unlock(&conf->cache_size_mutex);
} }
r5l_flush_stripe_to_raid(conf->log);
async_tx_issue_pending_all(); async_tx_issue_pending_all();
blk_finish_plug(&plug); blk_finish_plug(&plug);
......
...@@ -627,4 +627,6 @@ extern int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev); ...@@ -627,4 +627,6 @@ extern int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev);
extern void r5l_exit_log(struct r5l_log *log); extern void r5l_exit_log(struct r5l_log *log);
extern int r5l_write_stripe(struct r5l_log *log, struct stripe_head *head_sh); extern int r5l_write_stripe(struct r5l_log *log, struct stripe_head *head_sh);
extern void r5l_write_stripe_run(struct r5l_log *log); extern void r5l_write_stripe_run(struct r5l_log *log);
extern void r5l_flush_stripe_to_raid(struct r5l_log *log);
extern void r5l_stripe_write_finished(struct stripe_head *sh);
#endif #endif
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment