Commit e5021876 authored by Linus Torvalds's avatar Linus Torvalds

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md

Pull MD updates from Shaohua Li:

 - Add Partial Parity Log (ppl) feature found in Intel IMSM raid array
   by Artur Paszkiewicz. This feature is another way to close RAID5
   writehole. The Linux implementation is also available for normal
   RAID5 array if specific superblock bit is set.

 - A number of md-cluser fixes and enabling md-cluster array resize from
   Guoqing Jiang

 - A bunch of patches from Ming Lei and Neil Brown to rewrite MD bio
   handling related code. Now MD doesn't directly access bio bvec,
   bi_phys_segments and uses modern bio API for bio split.

 - Improve RAID5 IO pattern to improve performance for hard disk based
   RAID5/6 from me.

 - Several patches from Song Liu to speed up raid5-cache recovery and
   allow raid5 cache feature disabling in runtime.

 - Fix a performance regression in raid1 resync from Xiao Ni.

 - Other cleanup and fixes from various people.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (84 commits)
  md/raid10: skip spare disk as 'first' disk
  md/raid1: Use a new variable to count flighting sync requests
  md: clear WantReplacement once disk is removed
  md/raid1/10: remove unused queue
  md: handle read-only member devices better.
  md/raid10: wait up frozen array in handle_write_completed
  uapi: fix linux/raid/md_p.h userspace compilation error
  md-cluster: Fix a memleak in an error handling path
  md: support disabling of create-on-open semantics.
  md: allow creation of mdNNN arrays via md_mod/parameters/new_array
  raid5-ppl: use a single mempool for ppl_io_unit and header_page
  md/raid0: fix up bio splitting.
  md/linear: improve bio splitting.
  md/raid5: make chunk_aligned_read() split bios more cleanly.
  md/raid10: simplify handle_read_error()
  md/raid10: simplify the splitting of requests.
  md/raid1: factor out flush_bio_list()
  md/raid1: simplify handle_read_error().
  Revert "block: introduce bio_copy_data_partial"
  md/raid1: simplify alloc_behind_master_bio()
  ...
parents 46f0537b e265eb3a
......@@ -276,14 +276,14 @@ All md devices contain:
array creation it will default to 0, though starting the array as
``clean`` will set it much larger.
new_dev
new_dev
This file can be written but not read. The value written should
be a block device number as major:minor. e.g. 8:0
This will cause that device to be attached to the array, if it is
available. It will then appear at md/dev-XXX (depending on the
name of the device) and further configuration is then possible.
safe_mode_delay
safe_mode_delay
When an md array has seen no write requests for a certain period
of time, it will be marked as ``clean``. When another write
request arrives, the array is marked as ``dirty`` before the write
......@@ -292,7 +292,7 @@ All md devices contain:
period as a number of seconds. The default is 200msec (0.200).
Writing a value of 0 disables safemode.
array_state
array_state
This file contains a single word which describes the current
state of the array. In many cases, the state can be set by
writing the word for the desired state, however some states
......@@ -401,7 +401,30 @@ All md devices contain:
once the array becomes non-degraded, and this fact has been
recorded in the metadata.
consistency_policy
This indicates how the array maintains consistency in case of unexpected
shutdown. It can be:
none
Array has no redundancy information, e.g. raid0, linear.
resync
Full resync is performed and all redundancy is regenerated when the
array is started after unclean shutdown.
bitmap
Resync assisted by a write-intent bitmap.
journal
For raid4/5/6, journal device is used to log transactions and replay
after unclean shutdown.
ppl
For raid5 only, Partial Parity Log is used to close the write hole and
eliminate resync.
The accepted values when writing to this file are ``ppl`` and ``resync``,
used to enable and disable PPL.
As component devices are added to an md array, they appear in the ``md``
......@@ -563,6 +586,9 @@ Each directory contains:
adds bad blocks without acknowledging them. This is largely
for testing.
ppl_sector, ppl_size
Location and size (in sectors) of the space used for Partial Parity Log
on this device.
An active md device will also contain an entry for each active device
......
......@@ -321,4 +321,4 @@ The algorithm is:
There are somethings which are not supported by cluster MD yet.
- update size and change array_sectors.
- change array_sectors.
Partial Parity Log
Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
addressed by PPL is that after a dirty shutdown, parity of a particular stripe
may become inconsistent with data on other member disks. If the array is also
in degraded state, there is no way to recalculate parity, because one of the
disks is missing. This can lead to silent data corruption when rebuilding the
array or using it is as degraded - data calculated from parity for array blocks
that have not been touched by a write request during the unclean shutdown can
be incorrect. Such condition is known as the RAID5 Write Hole. Because of
this, md by default does not allow starting a dirty degraded array.
Partial parity for a write operation is the XOR of stripe data chunks not
modified by this write. It is just enough data needed for recovering from the
write hole. XORing partial parity with the modified chunks produces parity for
the stripe, consistent with its state before the write operation, regardless of
which chunk writes have completed. If one of the not modified data disks of
this stripe is missing, this updated parity can be used to recover its
contents. PPL recovery is also performed when starting an array after an
unclean shutdown and all disks are available, eliminating the need to resync
the array. Because of this, using write-intent bitmap and PPL together is not
supported.
When handling a write request PPL writes partial parity before new data and
parity are dispatched to disks. PPL is a distributed log - it is stored on
array member drives in the metadata area, on the parity drive of a particular
stripe. It does not require a dedicated journaling drive. Write performance is
reduced by up to 30%-40% but it scales with the number of drives in the array
and the journaling drive does not become a bottleneck or a single point of
failure.
Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
not a true journal. It does not protect from losing in-flight data, only from
silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
performed for this stripe (parity is not updated). So it is possible to have
arbitrary data in the written part of a stripe if that disk is lost. In such
case the behavior is the same as in plain raid5.
PPL is available for md version-1 metadata and external (specifically IMSM)
metadata arrays. It can be enabled using mdadm option --consistency-policy=ppl.
Currently, volatile write-back cache should be disabled on all member drives
when using PPL. Otherwise it cannot guarantee consistency in case of power
failure.
......@@ -633,20 +633,21 @@ struct bio *bio_clone_fast(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs)
}
EXPORT_SYMBOL(bio_clone_fast);
static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
struct bio_set *bs, int offset,
int size)
/**
* bio_clone_bioset - clone a bio
* @bio_src: bio to clone
* @gfp_mask: allocation priority
* @bs: bio_set to allocate from
*
* Clone bio. Caller will own the returned bio, but not the actual data it
* points to. Reference count of returned bio will be one.
*/
struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
struct bio_set *bs)
{
struct bvec_iter iter;
struct bio_vec bv;
struct bio *bio;
struct bvec_iter iter_src = bio_src->bi_iter;
/* for supporting partial clone */
if (offset || size != bio_src->bi_iter.bi_size) {
bio_advance_iter(bio_src, &iter_src, offset);
iter_src.bi_size = size;
}
/*
* Pre immutable biovecs, __bio_clone() used to just do a memcpy from
......@@ -670,8 +671,7 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
* __bio_clone_fast() anyways.
*/
bio = bio_alloc_bioset(gfp_mask, __bio_segments(bio_src,
&iter_src), bs);
bio = bio_alloc_bioset(gfp_mask, bio_segments(bio_src), bs);
if (!bio)
return NULL;
bio->bi_bdev = bio_src->bi_bdev;
......@@ -688,7 +688,7 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
bio->bi_io_vec[bio->bi_vcnt++] = bio_src->bi_io_vec[0];
break;
default:
__bio_for_each_segment(bv, bio_src, iter, iter_src)
bio_for_each_segment(bv, bio_src, iter)
bio->bi_io_vec[bio->bi_vcnt++] = bv;
break;
}
......@@ -707,43 +707,8 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
return bio;
}
/**
* bio_clone_bioset - clone a bio
* @bio_src: bio to clone
* @gfp_mask: allocation priority
* @bs: bio_set to allocate from
*
* Clone bio. Caller will own the returned bio, but not the actual data it
* points to. Reference count of returned bio will be one.
*/
struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
struct bio_set *bs)
{
return __bio_clone_bioset(bio_src, gfp_mask, bs, 0,
bio_src->bi_iter.bi_size);
}
EXPORT_SYMBOL(bio_clone_bioset);
/**
* bio_clone_bioset_partial - clone a partial bio
* @bio_src: bio to clone
* @gfp_mask: allocation priority
* @bs: bio_set to allocate from
* @offset: cloned starting from the offset
* @size: size for the cloned bio
*
* Clone bio. Caller will own the returned bio, but not the actual data it
* points to. Reference count of returned bio will be one.
*/
struct bio *bio_clone_bioset_partial(struct bio *bio_src, gfp_t gfp_mask,
struct bio_set *bs, int offset,
int size)
{
return __bio_clone_bioset(bio_src, gfp_mask, bs, offset, size);
}
EXPORT_SYMBOL(bio_clone_bioset_partial);
/**
* bio_add_pc_page - attempt to add page to bio
* @q: the target queue
......
......@@ -18,7 +18,7 @@ dm-cache-cleaner-y += dm-cache-policy-cleaner.o
dm-era-y += dm-era-target.o
dm-verity-y += dm-verity-target.o
md-mod-y += md.o bitmap.o
raid456-y += raid5.o raid5-cache.o
raid456-y += raid5.o raid5-cache.o raid5-ppl.o
# Note: link order is important. All raid personalities
# and must come before md.o, as they each initialise
......
......@@ -471,6 +471,7 @@ void bitmap_update_sb(struct bitmap *bitmap)
kunmap_atomic(sb);
write_page(bitmap, bitmap->storage.sb_page, 1);
}
EXPORT_SYMBOL(bitmap_update_sb);
/* print out the bitmap file superblock */
void bitmap_print_sb(struct bitmap *bitmap)
......@@ -696,7 +697,7 @@ static int bitmap_read_sb(struct bitmap *bitmap)
out:
kunmap_atomic(sb);
/* Assiging chunksize is required for "re_read" */
/* Assigning chunksize is required for "re_read" */
bitmap->mddev->bitmap_info.chunksize = chunksize;
if (err == 0 && nodes && (bitmap->cluster_slot < 0)) {
err = md_setup_cluster(bitmap->mddev, nodes);
......@@ -1727,7 +1728,7 @@ void bitmap_flush(struct mddev *mddev)
/*
* free memory that was allocated
*/
static void bitmap_free(struct bitmap *bitmap)
void bitmap_free(struct bitmap *bitmap)
{
unsigned long k, pages;
struct bitmap_page *bp;
......@@ -1761,6 +1762,21 @@ static void bitmap_free(struct bitmap *bitmap)
kfree(bp);
kfree(bitmap);
}
EXPORT_SYMBOL(bitmap_free);
void bitmap_wait_behind_writes(struct mddev *mddev)
{
struct bitmap *bitmap = mddev->bitmap;
/* wait for behind writes to complete */
if (bitmap && atomic_read(&bitmap->behind_writes) > 0) {
pr_debug("md:%s: behind writes in progress - waiting to stop.\n",
mdname(mddev));
/* need to kick something here to make sure I/O goes? */
wait_event(bitmap->behind_wait,
atomic_read(&bitmap->behind_writes) == 0);
}
}
void bitmap_destroy(struct mddev *mddev)
{
......@@ -1769,6 +1785,8 @@ void bitmap_destroy(struct mddev *mddev)
if (!bitmap) /* there was no bitmap */
return;
bitmap_wait_behind_writes(mddev);
mutex_lock(&mddev->bitmap_info.mutex);
spin_lock(&mddev->lock);
mddev->bitmap = NULL; /* disconnect from the md device */
......@@ -1920,6 +1938,27 @@ int bitmap_load(struct mddev *mddev)
}
EXPORT_SYMBOL_GPL(bitmap_load);
struct bitmap *get_bitmap_from_slot(struct mddev *mddev, int slot)
{
int rv = 0;
struct bitmap *bitmap;
bitmap = bitmap_create(mddev, slot);
if (IS_ERR(bitmap)) {
rv = PTR_ERR(bitmap);
return ERR_PTR(rv);
}
rv = bitmap_init_from_disk(bitmap, 0);
if (rv) {
bitmap_free(bitmap);
return ERR_PTR(rv);
}
return bitmap;
}
EXPORT_SYMBOL(get_bitmap_from_slot);
/* Loads the bitmap associated with slot and copies the resync information
* to our bitmap
*/
......@@ -1929,14 +1968,13 @@ int bitmap_copy_from_slot(struct mddev *mddev, int slot,
int rv = 0, i, j;
sector_t block, lo = 0, hi = 0;
struct bitmap_counts *counts;
struct bitmap *bitmap = bitmap_create(mddev, slot);
if (IS_ERR(bitmap))
return PTR_ERR(bitmap);
struct bitmap *bitmap;
rv = bitmap_init_from_disk(bitmap, 0);
if (rv)
goto err;
bitmap = get_bitmap_from_slot(mddev, slot);
if (IS_ERR(bitmap)) {
pr_err("%s can't get bitmap from slot %d\n", __func__, slot);
return -1;
}
counts = &bitmap->counts;
for (j = 0; j < counts->chunks; j++) {
......@@ -1963,8 +2001,7 @@ int bitmap_copy_from_slot(struct mddev *mddev, int slot,
bitmap_unplug(mddev->bitmap);
*low = lo;
*high = hi;
err:
bitmap_free(bitmap);
return rv;
}
EXPORT_SYMBOL_GPL(bitmap_copy_from_slot);
......
......@@ -267,8 +267,11 @@ void bitmap_daemon_work(struct mddev *mddev);
int bitmap_resize(struct bitmap *bitmap, sector_t blocks,
int chunksize, int init);
struct bitmap *get_bitmap_from_slot(struct mddev *mddev, int slot);
int bitmap_copy_from_slot(struct mddev *mddev, int slot,
sector_t *lo, sector_t *hi, bool clear_bits);
void bitmap_free(struct bitmap *bitmap);
void bitmap_wait_behind_writes(struct mddev *mddev);
#endif
#endif
......@@ -249,54 +249,49 @@ static void linear_make_request(struct mddev *mddev, struct bio *bio)
{
char b[BDEVNAME_SIZE];
struct dev_info *tmp_dev;
struct bio *split;
sector_t start_sector, end_sector, data_offset;
sector_t bio_sector = bio->bi_iter.bi_sector;
if (unlikely(bio->bi_opf & REQ_PREFLUSH)) {
md_flush_request(mddev, bio);
return;
}
do {
sector_t bio_sector = bio->bi_iter.bi_sector;
tmp_dev = which_dev(mddev, bio_sector);
start_sector = tmp_dev->end_sector - tmp_dev->rdev->sectors;
end_sector = tmp_dev->end_sector;
data_offset = tmp_dev->rdev->data_offset;
bio->bi_bdev = tmp_dev->rdev->bdev;
if (unlikely(bio_sector >= end_sector ||
bio_sector < start_sector))
goto out_of_bounds;
if (unlikely(bio_end_sector(bio) > end_sector)) {
/* This bio crosses a device boundary, so we have to
* split it.
*/
split = bio_split(bio, end_sector - bio_sector,
GFP_NOIO, fs_bio_set);
bio_chain(split, bio);
} else {
split = bio;
}
tmp_dev = which_dev(mddev, bio_sector);
start_sector = tmp_dev->end_sector - tmp_dev->rdev->sectors;
end_sector = tmp_dev->end_sector;
data_offset = tmp_dev->rdev->data_offset;
if (unlikely(bio_sector >= end_sector ||
bio_sector < start_sector))
goto out_of_bounds;
if (unlikely(bio_end_sector(bio) > end_sector)) {
/* This bio crosses a device boundary, so we have to split it */
struct bio *split = bio_split(bio, end_sector - bio_sector,
GFP_NOIO, mddev->bio_set);
bio_chain(split, bio);
generic_make_request(bio);
bio = split;
}
split->bi_iter.bi_sector = split->bi_iter.bi_sector -
start_sector + data_offset;
if (unlikely((bio_op(split) == REQ_OP_DISCARD) &&
!blk_queue_discard(bdev_get_queue(split->bi_bdev)))) {
/* Just ignore it */
bio_endio(split);
} else {
if (mddev->gendisk)
trace_block_bio_remap(bdev_get_queue(split->bi_bdev),
split, disk_devt(mddev->gendisk),
bio_sector);
mddev_check_writesame(mddev, split);
mddev_check_write_zeroes(mddev, split);
generic_make_request(split);
}
} while (split != bio);
bio->bi_bdev = tmp_dev->rdev->bdev;
bio->bi_iter.bi_sector = bio->bi_iter.bi_sector -
start_sector + data_offset;
if (unlikely((bio_op(bio) == REQ_OP_DISCARD) &&
!blk_queue_discard(bdev_get_queue(bio->bi_bdev)))) {
/* Just ignore it */
bio_endio(bio);
} else {
if (mddev->gendisk)
trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
bio, disk_devt(mddev->gendisk),
bio_sector);
mddev_check_writesame(mddev, bio);
mddev_check_write_zeroes(mddev, bio);
generic_make_request(bio);
}
return;
out_of_bounds:
......
This diff is collapsed.
......@@ -27,6 +27,7 @@ struct md_cluster_operations {
int (*gather_bitmaps)(struct md_rdev *rdev);
int (*lock_all_bitmaps)(struct mddev *mddev);
void (*unlock_all_bitmaps)(struct mddev *mddev);
void (*update_size)(struct mddev *mddev, sector_t old_dev_sectors);
};
#endif /* _MD_CLUSTER_H */
This diff is collapsed.
......@@ -122,6 +122,13 @@ struct md_rdev {
* sysfs entry */
struct badblocks badblocks;
struct {
short offset; /* Offset from superblock to start of PPL.
* Not used by external metadata. */
unsigned int size; /* Size in sectors of the PPL space */
sector_t sector; /* First sector of the PPL space */
} ppl;
};
enum flag_bits {
Faulty, /* device is known to have a fault */
......@@ -219,9 +226,6 @@ enum mddev_flags {
* it then */
MD_JOURNAL_CLEAN, /* A raid with journal is already clean */
MD_HAS_JOURNAL, /* The raid array has journal feature set */
MD_RELOAD_SB, /* Reload the superblock because another node
* updated it.
*/
MD_CLUSTER_RESYNC_LOCKED, /* cluster raid only, which means node
* already took resync lock, need to
* release the lock */
......@@ -229,6 +233,7 @@ enum mddev_flags {
* supported as calls to md_error() will
* never cause the array to become failed.
*/
MD_HAS_PPL, /* The raid array has PPL feature set */
};
enum mddev_sb_flags {
......@@ -404,7 +409,8 @@ struct mddev {
*/
unsigned int safemode_delay;
struct timer_list safemode_timer;
atomic_t writes_pending;
struct percpu_ref writes_pending;
int sync_checkers; /* # of threads checking writes_pending */
struct request_queue *queue; /* for plugging ... */
struct bitmap *bitmap; /* the bitmap for the device */
......@@ -540,6 +546,8 @@ struct md_personality
/* congested implements bdi.congested_fn().
* Will not be called while array is 'suspended' */
int (*congested)(struct mddev *mddev, int bits);
/* Changes the consistency policy of an active array. */
int (*change_consistency_policy)(struct mddev *mddev, const char *buf);
};
struct md_sysfs_entry {
......@@ -641,6 +649,7 @@ extern void md_wakeup_thread(struct md_thread *thread);
extern void md_check_recovery(struct mddev *mddev);
extern void md_reap_sync_thread(struct mddev *mddev);
extern void md_write_start(struct mddev *mddev, struct bio *bi);
extern void md_write_inc(struct mddev *mddev, struct bio *bi);
extern void md_write_end(struct mddev *mddev);
extern void md_done_sync(struct mddev *mddev, int blocks, int ok);
extern void md_error(struct mddev *mddev, struct md_rdev *rdev);
......@@ -716,4 +725,58 @@ static inline void mddev_check_write_zeroes(struct mddev *mddev, struct bio *bio
!bdev_get_queue(bio->bi_bdev)->limits.max_write_zeroes_sectors)
mddev->queue->limits.max_write_zeroes_sectors = 0;
}
/* Maximum size of each resync request */
#define RESYNC_BLOCK_SIZE (64*1024)
#define RESYNC_PAGES ((RESYNC_BLOCK_SIZE + PAGE_SIZE-1) / PAGE_SIZE)
/* for managing resync I/O pages */
struct resync_pages {
unsigned idx; /* for get/put page from the pool */
void *raid_bio;
struct page *pages[RESYNC_PAGES];
};
static inline int resync_alloc_pages(struct resync_pages *rp,
gfp_t gfp_flags)
{
int i;
for (i = 0; i < RESYNC_PAGES; i++) {
rp->pages[i] = alloc_page(gfp_flags);
if (!rp->pages[i])
goto out_free;
}
return 0;
out_free:
while (--i >= 0)
put_page(rp->pages[i]);
return -ENOMEM;
}
static inline void resync_free_pages(struct resync_pages *rp)
{
int i;
for (i = 0; i < RESYNC_PAGES; i++)
put_page(rp->pages[i]);
}
static inline void resync_get_all_pages(struct resync_pages *rp)
{
int i;
for (i = 0; i < RESYNC_PAGES; i++)
get_page(rp->pages[i]);
}
static inline struct page *resync_fetch_page(struct resync_pages *rp,
unsigned idx)
{
if (WARN_ON_ONCE(idx >= RESYNC_PAGES))
return NULL;
return rp->pages[idx];
}
#endif /* _MD_MD_H */
......@@ -29,7 +29,8 @@
#define UNSUPPORTED_MDDEV_FLAGS \
((1L << MD_HAS_JOURNAL) | \
(1L << MD_JOURNAL_CLEAN) | \
(1L << MD_FAILFAST_SUPPORTED))
(1L << MD_FAILFAST_SUPPORTED) |\
(1L << MD_HAS_PPL))
static int raid0_congested(struct mddev *mddev, int bits)
{
......@@ -462,53 +463,54 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
{
struct strip_zone *zone;
struct md_rdev *tmp_dev;
struct bio *split;
sector_t bio_sector;
sector_t sector;
unsigned chunk_sects;
unsigned sectors;
if (unlikely(bio->bi_opf & REQ_PREFLUSH)) {
md_flush_request(mddev, bio);
return;
}
do {
sector_t bio_sector = bio->bi_iter.bi_sector;
sector_t sector = bio_sector;
unsigned chunk_sects = mddev->chunk_sectors;
bio_sector = bio->bi_iter.bi_sector;
sector = bio_sector;
chunk_sects = mddev->chunk_sectors;
unsigned sectors = chunk_sects -
(likely(is_power_of_2(chunk_sects))
? (sector & (chunk_sects-1))
: sector_div(sector, chunk_sects));
sectors = chunk_sects -
(likely(is_power_of_2(chunk_sects))
? (sector & (chunk_sects-1))
: sector_div(sector, chunk_sects));
/* Restore due to sector_div */
sector = bio_sector;
/* Restore due to sector_div */
sector = bio_sector;
if (sectors < bio_sectors(bio)) {
split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
bio_chain(split, bio);
} else {
split = bio;
}
if (sectors < bio_sectors(bio)) {
struct bio *split = bio_split(bio, sectors, GFP_NOIO, mddev->bio_set);
bio_chain(split, bio);
generic_make_request(bio);
bio = split;
}
zone = find_zone(mddev->private, &sector);
tmp_dev = map_sector(mddev, zone, sector, &sector);
split->bi_bdev = tmp_dev->bdev;
split->bi_iter.bi_sector = sector + zone->dev_start +
tmp_dev->data_offset;
if (unlikely((bio_op(split) == REQ_OP_DISCARD) &&
!blk_queue_discard(bdev_get_queue(split->bi_bdev)))) {
/* Just ignore it */
bio_endio(split);
} else {
if (mddev->gendisk)
trace_block_bio_remap(bdev_get_queue(split->bi_bdev),
split, disk_devt(mddev->gendisk),
bio_sector);
mddev_check_writesame(mddev, split);
mddev_check_write_zeroes(mddev, split);
generic_make_request(split);
}
} while (split != bio);
zone = find_zone(mddev->private, &sector);
tmp_dev = map_sector(mddev, zone, sector, &sector);
bio->bi_bdev = tmp_dev->bdev;
bio->bi_iter.bi_sector = sector + zone->dev_start +
tmp_dev->data_offset;
if (unlikely((bio_op(bio) == REQ_OP_DISCARD) &&
!blk_queue_discard(bdev_get_queue(bio->bi_bdev)))) {
/* Just ignore it */
bio_endio(bio);
} else {
if (mddev->gendisk)
trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
bio, disk_devt(mddev->gendisk),
bio_sector);
mddev_check_writesame(mddev, bio);
mddev_check_write_zeroes(mddev, bio);
generic_make_request(bio);
}
}
static void raid0_status(struct seq_file *seq, struct mddev *mddev)
......
This diff is collapsed.
......@@ -84,6 +84,7 @@ struct r1conf {
*/
wait_queue_head_t wait_barrier;
spinlock_t resync_lock;
atomic_t nr_sync_pending;
atomic_t *nr_pending;
atomic_t *nr_waiting;
atomic_t *nr_queued;
......@@ -107,6 +108,8 @@ struct r1conf {
mempool_t *r1bio_pool;
mempool_t *r1buf_pool;
struct bio_set *bio_split;
/* temporary buffer to synchronous IO when attempting to repair
* a read error.
*/
......@@ -153,9 +156,13 @@ struct r1bio {
int read_disk;
struct list_head retry_list;
/* Next two are only valid when R1BIO_BehindIO is set */
struct bio_vec *behind_bvecs;
int behind_page_count;
/*
* When R1BIO_BehindIO is set, we store pages for write behind
* in behind_master_bio.
*/
struct bio *behind_master_bio;
/*
* if the IO is in WRITE direction, then multiple bios are used.
* We choose the number when they are allocated.
......
This diff is collapsed.
......@@ -82,6 +82,7 @@ struct r10conf {
mempool_t *r10bio_pool;
mempool_t *r10buf_pool;
struct page *tmppage;
struct bio_set *bio_split;
/* When taking over an array from a different personality, we store
* the new thread here until we fully activate the array.
......
This diff is collapsed.
#ifndef _RAID5_LOG_H
#define _RAID5_LOG_H
extern int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev);
extern void r5l_exit_log(struct r5conf *conf);
extern int r5l_write_stripe(struct r5l_log *log, struct stripe_head *head_sh);
extern void r5l_write_stripe_run(struct r5l_log *log);
extern void r5l_flush_stripe_to_raid(struct r5l_log *log);
extern void r5l_stripe_write_finished(struct stripe_head *sh);
extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
extern void r5l_quiesce(struct r5l_log *log, int state);
extern bool r5l_log_disk_error(struct r5conf *conf);
extern bool r5c_is_writeback(struct r5l_log *log);
extern int
r5c_try_caching_write(struct r5conf *conf, struct stripe_head *sh,
struct stripe_head_state *s, int disks);
extern void
r5c_finish_stripe_write_out(struct r5conf *conf, struct stripe_head *sh,
struct stripe_head_state *s);
extern void r5c_release_extra_page(struct stripe_head *sh);
extern void r5c_use_extra_page(struct stripe_head *sh);
extern void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
extern void r5c_handle_cached_data_endio(struct r5conf *conf,
struct stripe_head *sh, int disks);
extern int r5c_cache_data(struct r5l_log *log, struct stripe_head *sh);
extern void r5c_make_stripe_write_out(struct stripe_head *sh);
extern void r5c_flush_cache(struct r5conf *conf, int num);
extern void r5c_check_stripe_cache_usage(struct r5conf *conf);
extern void r5c_check_cached_full_stripe(struct r5conf *conf);
extern struct md_sysfs_entry r5c_journal_mode;
extern void r5c_update_on_rdev_error(struct mddev *mddev);
extern bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect);
extern struct dma_async_tx_descriptor *
ops_run_partial_parity(struct stripe_head *sh, struct raid5_percpu *percpu,
struct dma_async_tx_descriptor *tx);
extern int ppl_init_log(struct r5conf *conf);
extern void ppl_exit_log(struct r5conf *conf);
extern int ppl_write_stripe(struct r5conf *conf, struct stripe_head *sh);
extern void ppl_write_stripe_run(struct r5conf *conf);
extern void ppl_stripe_write_finished(struct stripe_head *sh);
extern int ppl_modify_log(struct r5conf *conf, struct md_rdev *rdev, bool add);
static inline bool raid5_has_ppl(struct r5conf *conf)
{
return test_bit(MD_HAS_PPL, &conf->mddev->flags);
}
static inline int log_stripe(struct stripe_head *sh, struct stripe_head_state *s)
{
struct r5conf *conf = sh->raid_conf;
if (conf->log) {
if (!test_bit(STRIPE_R5C_CACHING, &sh->state)) {
/* writing out phase */
if (s->waiting_extra_page)
return 0;
return r5l_write_stripe(conf->log, sh);
} else if (test_bit(STRIPE_LOG_TRAPPED, &sh->state)) {
/* caching phase */
return r5c_cache_data(conf->log, sh);
}
} else if (raid5_has_ppl(conf)) {
return ppl_write_stripe(conf, sh);
}
return -EAGAIN;
}
static inline void log_stripe_write_finished(struct stripe_head *sh)
{
struct r5conf *conf = sh->raid_conf;
if (conf->log)
r5l_stripe_write_finished(sh);
else if (raid5_has_ppl(conf))
ppl_stripe_write_finished(sh);
}
static inline void log_write_stripe_run(struct r5conf *conf)
{
if (conf->log)
r5l_write_stripe_run(conf->log);
else if (raid5_has_ppl(conf))
ppl_write_stripe_run(conf);
}
static inline void log_exit(struct r5conf *conf)
{
if (conf->log)
r5l_exit_log(conf);
else if (raid5_has_ppl(conf))
ppl_exit_log(conf);
}
static inline int log_init(struct r5conf *conf, struct md_rdev *journal_dev,
bool ppl)
{
if (journal_dev)
return r5l_init_log(conf, journal_dev);
else if (ppl)
return ppl_init_log(conf);
return 0;
}
static inline int log_modify(struct r5conf *conf, struct md_rdev *rdev, bool add)
{
if (raid5_has_ppl(conf))
return ppl_modify_log(conf, rdev, add);
return 0;
}
#endif
This diff is collapsed.
This diff is collapsed.
......@@ -224,10 +224,16 @@ struct stripe_head {
spinlock_t batch_lock; /* only header's lock is useful */
struct list_head batch_list; /* protected by head's batch lock*/
struct r5l_io_unit *log_io;
union {
struct r5l_io_unit *log_io;
struct ppl_io_unit *ppl_io;
};
struct list_head log_list;
sector_t log_start; /* first meta block on the journal */
struct list_head r5c; /* for r5c_cache->stripe_in_journal */
struct page *ppl_page; /* partial parity of this stripe */
/**
* struct stripe_operations
* @target - STRIPE_OP_COMPUTE_BLK target
......@@ -272,7 +278,6 @@ struct stripe_head_state {
int dec_preread_active;
unsigned long ops_request;
struct bio_list return_bi;
struct md_rdev *blocked_rdev;
int handle_bad_blocks;
int log_failed;
......@@ -400,6 +405,7 @@ enum {
STRIPE_OP_BIODRAIN,
STRIPE_OP_RECONSTRUCT,
STRIPE_OP_CHECK,
STRIPE_OP_PARTIAL_PARITY,
};
/*
......@@ -481,50 +487,6 @@ static inline struct bio *r5_next_bio(struct bio *bio, sector_t sector)
return NULL;
}
/*
* We maintain a biased count of active stripes in the bottom 16 bits of
* bi_phys_segments, and a count of processed stripes in the upper 16 bits
*/
static inline int raid5_bi_processed_stripes(struct bio *bio)
{
atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
return (atomic_read(segments) >> 16) & 0xffff;
}
static inline int raid5_dec_bi_active_stripes(struct bio *bio)
{
atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
return atomic_sub_return(1, segments) & 0xffff;
}
static inline void raid5_inc_bi_active_stripes(struct bio *bio)
{
atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
atomic_inc(segments);
}
static inline void raid5_set_bi_processed_stripes(struct bio *bio,
unsigned int cnt)
{
atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
int old, new;
do {
old = atomic_read(segments);
new = (old & 0xffff) | (cnt << 16);
} while (atomic_cmpxchg(segments, old, new) != old);
}
static inline void raid5_set_bi_stripes(struct bio *bio, unsigned int cnt)
{
atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
atomic_set(segments, cnt);
}
/* NOTE NR_STRIPE_HASH_LOCKS must remain below 64.
* This is because we sometimes take all the spinlocks
* and creating that much locking depth can cause
......@@ -542,6 +504,7 @@ struct r5worker {
struct r5worker_group {
struct list_head handle_list;
struct list_head loprio_list;
struct r5conf *conf;
struct r5worker *workers;
int stripes_cnt;
......@@ -571,6 +534,14 @@ enum r5_cache_state {
*/
};
#define PENDING_IO_MAX 512
#define PENDING_IO_ONE_FLUSH 128
struct r5pending_data {
struct list_head sibling;
sector_t sector; /* stripe sector */
struct bio_list bios;
};
struct r5conf {
struct hlist_head *stripe_hashtbl;
/* only protect corresponding hash list and inactive_list */
......@@ -608,10 +579,12 @@ struct r5conf {
*/
struct list_head handle_list; /* stripes needing handling */
struct list_head loprio_list; /* low priority stripes */
struct list_head hold_list; /* preread ready stripes */
struct list_head delayed_list; /* stripes that have plugged requests */
struct list_head bitmap_list; /* stripes delaying awaiting bitmap update */
struct bio *retry_read_aligned; /* currently retrying aligned bios */
unsigned int retry_read_offset; /* sector offset into retry_read_aligned */
struct bio *retry_read_aligned_list; /* aligned bios retry list */
atomic_t preread_active_stripes; /* stripes with scheduled io */
atomic_t active_aligned_reads;
......@@ -621,9 +594,6 @@ struct r5conf {
int skip_copy; /* Don't copy data from bio to stripe cache */
struct list_head *last_hold; /* detect hold_list promotions */
/* bios to have bi_end_io called after metadata is synced */
struct bio_list return_bi;
atomic_t reshape_stripes; /* stripes with pending writes for reshape */
/* unfortunately we need two cache names as we temporarily have
* two caches.
......@@ -676,6 +646,7 @@ struct r5conf {
int pool_size; /* number of disks in stripeheads in pool */
spinlock_t device_lock;
struct disk_info *disks;
struct bio_set *bio_split;
/* When taking over an array from a different personality, we store
* the new thread here until we fully activate the array.
......@@ -686,10 +657,15 @@ struct r5conf {
int group_cnt;
int worker_cnt_per_group;
struct r5l_log *log;
void *log_private;
struct bio_list pending_bios;
spinlock_t pending_bios_lock;
bool batch_bio_dispatch;
struct r5pending_data *pending_data;
struct list_head free_list;
struct list_head pending_list;
int pending_data_cnt;
struct r5pending_data *next_pending_data;
};
......@@ -765,34 +741,4 @@ extern struct stripe_head *
raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
int previous, int noblock, int noquiesce);
extern int raid5_calc_degraded(struct r5conf *conf);
extern int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev);
extern void r5l_exit_log(struct r5l_log *log);
extern int r5l_write_stripe(struct r5l_log *log, struct stripe_head *head_sh);
extern void r5l_write_stripe_run(struct r5l_log *log);
extern void r5l_flush_stripe_to_raid(struct r5l_log *log);
extern void r5l_stripe_write_finished(struct stripe_head *sh);
extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
extern void r5l_quiesce(struct r5l_log *log, int state);
extern bool r5l_log_disk_error(struct r5conf *conf);
extern bool r5c_is_writeback(struct r5l_log *log);
extern int
r5c_try_caching_write(struct r5conf *conf, struct stripe_head *sh,
struct stripe_head_state *s, int disks);
extern void
r5c_finish_stripe_write_out(struct r5conf *conf, struct stripe_head *sh,
struct stripe_head_state *s);
extern void r5c_release_extra_page(struct stripe_head *sh);
extern void r5c_use_extra_page(struct stripe_head *sh);
extern void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
extern void r5c_handle_cached_data_endio(struct r5conf *conf,
struct stripe_head *sh, int disks, struct bio_list *return_bi);
extern int r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
struct stripe_head_state *s);
extern void r5c_make_stripe_write_out(struct stripe_head *sh);
extern void r5c_flush_cache(struct r5conf *conf, int num);
extern void r5c_check_stripe_cache_usage(struct r5conf *conf);
extern void r5c_check_cached_full_stripe(struct r5conf *conf);
extern struct md_sysfs_entry r5c_journal_mode;
extern void r5c_update_on_rdev_error(struct mddev *mddev);
extern bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect);
#endif
......@@ -183,7 +183,7 @@ static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
#define bio_iter_last(bvec, iter) ((iter).bi_size == (bvec).bv_len)
static inline unsigned __bio_segments(struct bio *bio, struct bvec_iter *bvec)
static inline unsigned bio_segments(struct bio *bio)
{
unsigned segs = 0;
struct bio_vec bv;
......@@ -205,17 +205,12 @@ static inline unsigned __bio_segments(struct bio *bio, struct bvec_iter *bvec)
break;
}
__bio_for_each_segment(bv, bio, iter, *bvec)
bio_for_each_segment(bv, bio, iter)
segs++;
return segs;
}
static inline unsigned bio_segments(struct bio *bio)
{
return __bio_segments(bio, &bio->bi_iter);
}
/*
* get a reference to a bio, so it won't disappear. the intended use is
* something like:
......@@ -389,8 +384,6 @@ extern void bio_put(struct bio *);
extern void __bio_clone_fast(struct bio *, struct bio *);
extern struct bio *bio_clone_fast(struct bio *, gfp_t, struct bio_set *);
extern struct bio *bio_clone_bioset(struct bio *, gfp_t, struct bio_set *bs);
extern struct bio *bio_clone_bioset_partial(struct bio *, gfp_t,
struct bio_set *, int, int);
extern struct bio_set *fs_bio_set;
......
......@@ -99,6 +99,7 @@ int __must_check percpu_ref_init(struct percpu_ref *ref,
void percpu_ref_exit(struct percpu_ref *ref);
void percpu_ref_switch_to_atomic(struct percpu_ref *ref,
percpu_ref_func_t *confirm_switch);
void percpu_ref_switch_to_atomic_sync(struct percpu_ref *ref);
void percpu_ref_switch_to_percpu(struct percpu_ref *ref);
void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
percpu_ref_func_t *confirm_kill);
......
This diff is collapsed.
......@@ -260,6 +260,22 @@ void percpu_ref_switch_to_atomic(struct percpu_ref *ref,
spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
}
EXPORT_SYMBOL_GPL(percpu_ref_switch_to_atomic);
/**
* percpu_ref_switch_to_atomic_sync - switch a percpu_ref to atomic mode
* @ref: percpu_ref to switch to atomic mode
*
* Schedule switching the ref to atomic mode, and wait for the
* switch to complete. Caller must ensure that no other thread
* will switch back to percpu mode.
*/
void percpu_ref_switch_to_atomic_sync(struct percpu_ref *ref)
{
percpu_ref_switch_to_atomic(ref, NULL);
wait_event(percpu_ref_switch_waitq, !ref->confirm_switch);
}
EXPORT_SYMBOL_GPL(percpu_ref_switch_to_atomic_sync);
/**
* percpu_ref_switch_to_percpu - switch a percpu_ref to percpu mode
......@@ -290,6 +306,7 @@ void percpu_ref_switch_to_percpu(struct percpu_ref *ref)
spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
}
EXPORT_SYMBOL_GPL(percpu_ref_switch_to_percpu);
/**
* percpu_ref_kill_and_confirm - drop the initial ref and schedule confirmation
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment