Commit 3418d036 authored by Artur Paszkiewicz's avatar Artur Paszkiewicz Committed by Shaohua Li

raid5-ppl: Partial Parity Log write logging implementation

Implement the calculation of partial parity for a stripe and PPL write
logging functionality. The description of PPL is added to the
documentation. More details can be found in the comments in raid5-ppl.c.

Attach a page for holding the partial parity data to stripe_head.
Allocate it only if mddev has the MD_HAS_PPL flag set.

Partial parity is the xor of not modified data chunks of a stripe and is
calculated as follows:

- reconstruct-write case:
  xor data from all not updated disks in a stripe

- read-modify-write case:
  xor old data and parity from all updated disks in a stripe

Implement it using the async_tx API and integrate into raid_run_ops().
It must be called when we still have access to old data, so do it when
STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
stored into sh->ppl_page.

Partial parity is not meaningful for full stripe write and is not stored
in the log or used for recovery, so don't attempt to calculate it when
stripe has STRIPE_FULL_WRITE.

Put the PPL metadata structures to md_p.h because userspace tools
(mdadm) will also need to read/write PPL.

Warn about using PPL with enabled disk volatile write-back cache for
now. It can be removed once disk cache flushing before writing PPL is
implemented.
Signed-off-by: default avatarArtur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: default avatarShaohua Li <shli@fb.com>
parent ff875738
Partial Parity Log
Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
addressed by PPL is that after a dirty shutdown, parity of a particular stripe
may become inconsistent with data on other member disks. If the array is also
in degraded state, there is no way to recalculate parity, because one of the
disks is missing. This can lead to silent data corruption when rebuilding the
array or using it is as degraded - data calculated from parity for array blocks
that have not been touched by a write request during the unclean shutdown can
be incorrect. Such condition is known as the RAID5 Write Hole. Because of
this, md by default does not allow starting a dirty degraded array.
Partial parity for a write operation is the XOR of stripe data chunks not
modified by this write. It is just enough data needed for recovering from the
write hole. XORing partial parity with the modified chunks produces parity for
the stripe, consistent with its state before the write operation, regardless of
which chunk writes have completed. If one of the not modified data disks of
this stripe is missing, this updated parity can be used to recover its
contents. PPL recovery is also performed when starting an array after an
unclean shutdown and all disks are available, eliminating the need to resync
the array. Because of this, using write-intent bitmap and PPL together is not
supported.
When handling a write request PPL writes partial parity before new data and
parity are dispatched to disks. PPL is a distributed log - it is stored on
array member drives in the metadata area, on the parity drive of a particular
stripe. It does not require a dedicated journaling drive. Write performance is
reduced by up to 30%-40% but it scales with the number of drives in the array
and the journaling drive does not become a bottleneck or a single point of
failure.
Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
not a true journal. It does not protect from losing in-flight data, only from
silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
performed for this stripe (parity is not updated). So it is possible to have
arbitrary data in the written part of a stripe if that disk is lost. In such
case the behavior is the same as in plain raid5.
PPL is available for md version-1 metadata and external (specifically IMSM)
metadata arrays. It can be enabled using mdadm option --consistency-policy=ppl.
Currently, volatile write-back cache should be disabled on all member drives
when using PPL. Otherwise it cannot guarantee consistency in case of power
failure.
...@@ -18,7 +18,7 @@ dm-cache-cleaner-y += dm-cache-policy-cleaner.o ...@@ -18,7 +18,7 @@ dm-cache-cleaner-y += dm-cache-policy-cleaner.o
dm-era-y += dm-era-target.o dm-era-y += dm-era-target.o
dm-verity-y += dm-verity-target.o dm-verity-y += dm-verity-target.o
md-mod-y += md.o bitmap.o md-mod-y += md.o bitmap.o
raid456-y += raid5.o raid5-cache.o raid456-y += raid5.o raid5-cache.o raid5-ppl.o
# Note: link order is important. All raid personalities # Note: link order is important. All raid personalities
# and must come before md.o, as they each initialise # and must come before md.o, as they each initialise
......
...@@ -31,6 +31,20 @@ extern struct md_sysfs_entry r5c_journal_mode; ...@@ -31,6 +31,20 @@ extern struct md_sysfs_entry r5c_journal_mode;
extern void r5c_update_on_rdev_error(struct mddev *mddev); extern void r5c_update_on_rdev_error(struct mddev *mddev);
extern bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect); extern bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect);
extern struct dma_async_tx_descriptor *
ops_run_partial_parity(struct stripe_head *sh, struct raid5_percpu *percpu,
struct dma_async_tx_descriptor *tx);
extern int ppl_init_log(struct r5conf *conf);
extern void ppl_exit_log(struct r5conf *conf);
extern int ppl_write_stripe(struct r5conf *conf, struct stripe_head *sh);
extern void ppl_write_stripe_run(struct r5conf *conf);
extern void ppl_stripe_write_finished(struct stripe_head *sh);
static inline bool raid5_has_ppl(struct r5conf *conf)
{
return test_bit(MD_HAS_PPL, &conf->mddev->flags);
}
static inline int log_stripe(struct stripe_head *sh, struct stripe_head_state *s) static inline int log_stripe(struct stripe_head *sh, struct stripe_head_state *s)
{ {
struct r5conf *conf = sh->raid_conf; struct r5conf *conf = sh->raid_conf;
...@@ -45,6 +59,8 @@ static inline int log_stripe(struct stripe_head *sh, struct stripe_head_state *s ...@@ -45,6 +59,8 @@ static inline int log_stripe(struct stripe_head *sh, struct stripe_head_state *s
/* caching phase */ /* caching phase */
return r5c_cache_data(conf->log, sh); return r5c_cache_data(conf->log, sh);
} }
} else if (raid5_has_ppl(conf)) {
return ppl_write_stripe(conf, sh);
} }
return -EAGAIN; return -EAGAIN;
...@@ -56,24 +72,32 @@ static inline void log_stripe_write_finished(struct stripe_head *sh) ...@@ -56,24 +72,32 @@ static inline void log_stripe_write_finished(struct stripe_head *sh)
if (conf->log) if (conf->log)
r5l_stripe_write_finished(sh); r5l_stripe_write_finished(sh);
else if (raid5_has_ppl(conf))
ppl_stripe_write_finished(sh);
} }
static inline void log_write_stripe_run(struct r5conf *conf) static inline void log_write_stripe_run(struct r5conf *conf)
{ {
if (conf->log) if (conf->log)
r5l_write_stripe_run(conf->log); r5l_write_stripe_run(conf->log);
else if (raid5_has_ppl(conf))
ppl_write_stripe_run(conf);
} }
static inline void log_exit(struct r5conf *conf) static inline void log_exit(struct r5conf *conf)
{ {
if (conf->log) if (conf->log)
r5l_exit_log(conf); r5l_exit_log(conf);
else if (raid5_has_ppl(conf))
ppl_exit_log(conf);
} }
static inline int log_init(struct r5conf *conf, struct md_rdev *journal_dev) static inline int log_init(struct r5conf *conf, struct md_rdev *journal_dev)
{ {
if (journal_dev) if (journal_dev)
return r5l_init_log(conf, journal_dev); return r5l_init_log(conf, journal_dev);
else if (raid5_has_ppl(conf))
return ppl_init_log(conf);
return 0; return 0;
} }
......
This diff is collapsed.
...@@ -482,6 +482,11 @@ static void shrink_buffers(struct stripe_head *sh) ...@@ -482,6 +482,11 @@ static void shrink_buffers(struct stripe_head *sh)
sh->dev[i].page = NULL; sh->dev[i].page = NULL;
put_page(p); put_page(p);
} }
if (sh->ppl_page) {
put_page(sh->ppl_page);
sh->ppl_page = NULL;
}
} }
static int grow_buffers(struct stripe_head *sh, gfp_t gfp) static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
...@@ -498,6 +503,13 @@ static int grow_buffers(struct stripe_head *sh, gfp_t gfp) ...@@ -498,6 +503,13 @@ static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
sh->dev[i].page = page; sh->dev[i].page = page;
sh->dev[i].orig_page = page; sh->dev[i].orig_page = page;
} }
if (raid5_has_ppl(sh->raid_conf)) {
sh->ppl_page = alloc_page(gfp);
if (!sh->ppl_page)
return 1;
}
return 0; return 0;
} }
...@@ -746,7 +758,7 @@ static bool stripe_can_batch(struct stripe_head *sh) ...@@ -746,7 +758,7 @@ static bool stripe_can_batch(struct stripe_head *sh)
{ {
struct r5conf *conf = sh->raid_conf; struct r5conf *conf = sh->raid_conf;
if (conf->log) if (conf->log || raid5_has_ppl(conf))
return false; return false;
return test_bit(STRIPE_BATCH_READY, &sh->state) && return test_bit(STRIPE_BATCH_READY, &sh->state) &&
!test_bit(STRIPE_BITMAP_PENDING, &sh->state) && !test_bit(STRIPE_BITMAP_PENDING, &sh->state) &&
...@@ -2093,6 +2105,9 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request) ...@@ -2093,6 +2105,9 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
async_tx_ack(tx); async_tx_ack(tx);
} }
if (test_bit(STRIPE_OP_PARTIAL_PARITY, &ops_request))
tx = ops_run_partial_parity(sh, percpu, tx);
if (test_bit(STRIPE_OP_PREXOR, &ops_request)) { if (test_bit(STRIPE_OP_PREXOR, &ops_request)) {
if (level < 6) if (level < 6)
tx = ops_run_prexor5(sh, percpu, tx); tx = ops_run_prexor5(sh, percpu, tx);
...@@ -3168,6 +3183,12 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s, ...@@ -3168,6 +3183,12 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
s->locked++; s->locked++;
} }
if (raid5_has_ppl(sh->raid_conf) &&
test_bit(STRIPE_OP_BIODRAIN, &s->ops_request) &&
!test_bit(STRIPE_FULL_WRITE, &sh->state) &&
test_bit(R5_Insync, &sh->dev[pd_idx].flags))
set_bit(STRIPE_OP_PARTIAL_PARITY, &s->ops_request);
pr_debug("%s: stripe %llu locked: %d ops_request: %lx\n", pr_debug("%s: stripe %llu locked: %d ops_request: %lx\n",
__func__, (unsigned long long)sh->sector, __func__, (unsigned long long)sh->sector,
s->locked, s->ops_request); s->locked, s->ops_request);
...@@ -3215,6 +3236,36 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, ...@@ -3215,6 +3236,36 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
if (*bip && (*bip)->bi_iter.bi_sector < bio_end_sector(bi)) if (*bip && (*bip)->bi_iter.bi_sector < bio_end_sector(bi))
goto overlap; goto overlap;
if (forwrite && raid5_has_ppl(conf)) {
/*
* With PPL only writes to consecutive data chunks within a
* stripe are allowed because for a single stripe_head we can
* only have one PPL entry at a time, which describes one data
* range. Not really an overlap, but wait_for_overlap can be
* used to handle this.
*/
sector_t sector;
sector_t first = 0;
sector_t last = 0;
int count = 0;
int i;
for (i = 0; i < sh->disks; i++) {
if (i != sh->pd_idx &&
(i == dd_idx || sh->dev[i].towrite)) {
sector = sh->dev[i].sector;
if (count == 0 || sector < first)
first = sector;
if (sector > last)
last = sector;
count++;
}
}
if (first + conf->chunk_sectors * (count - 1) != last)
goto overlap;
}
if (!forwrite || previous) if (!forwrite || previous)
clear_bit(STRIPE_BATCH_READY, &sh->state); clear_bit(STRIPE_BATCH_READY, &sh->state);
...@@ -7208,6 +7259,13 @@ static int raid5_run(struct mddev *mddev) ...@@ -7208,6 +7259,13 @@ static int raid5_run(struct mddev *mddev)
BUG_ON(mddev->delta_disks != 0); BUG_ON(mddev->delta_disks != 0);
} }
if (test_bit(MD_HAS_JOURNAL, &mddev->flags) &&
test_bit(MD_HAS_PPL, &mddev->flags)) {
pr_warn("md/raid:%s: using journal device and PPL not allowed - disabling PPL\n",
mdname(mddev));
clear_bit(MD_HAS_PPL, &mddev->flags);
}
if (mddev->private == NULL) if (mddev->private == NULL)
conf = setup_conf(mddev); conf = setup_conf(mddev);
else else
...@@ -7689,7 +7747,7 @@ static int raid5_resize(struct mddev *mddev, sector_t sectors) ...@@ -7689,7 +7747,7 @@ static int raid5_resize(struct mddev *mddev, sector_t sectors)
sector_t newsize; sector_t newsize;
struct r5conf *conf = mddev->private; struct r5conf *conf = mddev->private;
if (conf->log) if (conf->log || raid5_has_ppl(conf))
return -EINVAL; return -EINVAL;
sectors &= ~((sector_t)conf->chunk_sectors - 1); sectors &= ~((sector_t)conf->chunk_sectors - 1);
newsize = raid5_size(mddev, sectors, mddev->raid_disks); newsize = raid5_size(mddev, sectors, mddev->raid_disks);
...@@ -7740,7 +7798,7 @@ static int check_reshape(struct mddev *mddev) ...@@ -7740,7 +7798,7 @@ static int check_reshape(struct mddev *mddev)
{ {
struct r5conf *conf = mddev->private; struct r5conf *conf = mddev->private;
if (conf->log) if (conf->log || raid5_has_ppl(conf))
return -EINVAL; return -EINVAL;
if (mddev->delta_disks == 0 && if (mddev->delta_disks == 0 &&
mddev->new_layout == mddev->layout && mddev->new_layout == mddev->layout &&
......
...@@ -224,10 +224,16 @@ struct stripe_head { ...@@ -224,10 +224,16 @@ struct stripe_head {
spinlock_t batch_lock; /* only header's lock is useful */ spinlock_t batch_lock; /* only header's lock is useful */
struct list_head batch_list; /* protected by head's batch lock*/ struct list_head batch_list; /* protected by head's batch lock*/
union {
struct r5l_io_unit *log_io; struct r5l_io_unit *log_io;
struct ppl_io_unit *ppl_io;
};
struct list_head log_list; struct list_head log_list;
sector_t log_start; /* first meta block on the journal */ sector_t log_start; /* first meta block on the journal */
struct list_head r5c; /* for r5c_cache->stripe_in_journal */ struct list_head r5c; /* for r5c_cache->stripe_in_journal */
struct page *ppl_page; /* partial parity of this stripe */
/** /**
* struct stripe_operations * struct stripe_operations
* @target - STRIPE_OP_COMPUTE_BLK target * @target - STRIPE_OP_COMPUTE_BLK target
...@@ -400,6 +406,7 @@ enum { ...@@ -400,6 +406,7 @@ enum {
STRIPE_OP_BIODRAIN, STRIPE_OP_BIODRAIN,
STRIPE_OP_RECONSTRUCT, STRIPE_OP_RECONSTRUCT,
STRIPE_OP_CHECK, STRIPE_OP_CHECK,
STRIPE_OP_PARTIAL_PARITY,
}; };
/* /*
...@@ -696,6 +703,7 @@ struct r5conf { ...@@ -696,6 +703,7 @@ struct r5conf {
int group_cnt; int group_cnt;
int worker_cnt_per_group; int worker_cnt_per_group;
struct r5l_log *log; struct r5l_log *log;
void *log_private;
spinlock_t pending_bios_lock; spinlock_t pending_bios_lock;
bool batch_bio_dispatch; bool batch_bio_dispatch;
......
...@@ -398,4 +398,31 @@ struct r5l_meta_block { ...@@ -398,4 +398,31 @@ struct r5l_meta_block {
#define R5LOG_VERSION 0x1 #define R5LOG_VERSION 0x1
#define R5LOG_MAGIC 0x6433c509 #define R5LOG_MAGIC 0x6433c509
struct ppl_header_entry {
__le64 data_sector; /* raid sector of the new data */
__le32 pp_size; /* length of partial parity */
__le32 data_size; /* length of data */
__le32 parity_disk; /* member disk containing parity */
__le32 checksum; /* checksum of partial parity data for this
* entry (~crc32c) */
} __attribute__ ((__packed__));
#define PPL_HEADER_SIZE 4096
#define PPL_HDR_RESERVED 512
#define PPL_HDR_ENTRY_SPACE \
(PPL_HEADER_SIZE - PPL_HDR_RESERVED - 4 * sizeof(u32) - sizeof(u64))
#define PPL_HDR_MAX_ENTRIES \
(PPL_HDR_ENTRY_SPACE / sizeof(struct ppl_header_entry))
struct ppl_header {
__u8 reserved[PPL_HDR_RESERVED];/* reserved space, fill with 0xff */
__le32 signature; /* signature (family number of volume) */
__le32 padding; /* zero pad */
__le64 generation; /* generation number of the header */
__le32 entries_count; /* number of entries in entry array */
__le32 checksum; /* checksum of the header (~crc32c) */
struct ppl_header_entry entries[PPL_HDR_MAX_ENTRIES];
} __attribute__ ((__packed__));
#endif #endif
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment