Commit 474095e4 authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'md/4.1' of git://neil.brown.name/md

Pull md updates from Neil Brown:
 "More updates that usual this time.  A few have performance impacts
  which hould mostly be positive, but RAID5 (in particular) can be very
  work-load ensitive...  We'll have to wait and see.

  Highlights:

   - "experimental" code for managing md/raid1 across a cluster using
     DLM.  Code is not ready for general use and triggers a WARNING if
     used.  However it is looking good and mostly done and having in
     mainline will help co-ordinate development.

   - RAID5/6 can now batch multiple (4K wide) stripe_heads so as to
     handle a full (chunk wide) stripe as a single unit.

   - RAID6 can now perform read-modify-write cycles which should help
     performance on larger arrays: 6 or more devices.

   - RAID5/6 stripe cache now grows and shrinks dynamically.  The value
     set is used as a minimum.

   - Resync is now allowed to go a little faster than the 'mininum' when
     there is competing IO.  How much faster depends on the speed of the
     devices, so the effective minimum should scale with device speed to
     some extent"

* tag 'md/4.1' of git://neil.brown.name/md: (58 commits)
  md/raid5: don't do chunk aligned read on degraded array.
  md/raid5: allow the stripe_cache to grow and shrink.
  md/raid5: change ->inactive_blocked to a bit-flag.
  md/raid5: move max_nr_stripes management into grow_one_stripe and drop_one_stripe
  md/raid5: pass gfp_t arg to grow_one_stripe()
  md/raid5: introduce configuration option rmw_level
  md/raid5: activate raid6 rmw feature
  md/raid6 algorithms: xor_syndrome() for SSE2
  md/raid6 algorithms: xor_syndrome() for generic int
  md/raid6 algorithms: improve test program
  md/raid6 algorithms: delta syndrome functions
  raid5: handle expansion/resync case with stripe batching
  raid5: handle io error of batch list
  RAID5: batch adjacent full stripe write
  raid5: track overwrite disk count
  raid5: add a new flag to track if a stripe can be batched
  raid5: use flex_array for scribble data
  md raid0: access mddev->queue (request queue member) conditionally because it is not set when accessed from dm-raid
  md: allow resync to go faster when there is competing IO.
  md: remove 'go_faster' option from ->sync_request()
  ...
parents d56a669c 9ffc8f7c
The cluster MD is a shared-device RAID for a cluster.
1. On-disk format
Separate write-intent-bitmap are used for each cluster node.
The bitmaps record all writes that may have been started on that node,
and may not yet have finished. The on-disk layout is:
0 4k 8k 12k
-------------------------------------------------------------------
| idle | md super | bm super [0] + bits |
| bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] |
| bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits |
| bm bits [3, contd] | | |
During "normal" functioning we assume the filesystem ensures that only one
node writes to any given block at a time, so a write
request will
- set the appropriate bit (if not already set)
- commit the write to all mirrors
- schedule the bit to be cleared after a timeout.
Reads are just handled normally. It is up to the filesystem to
ensure one node doesn't read from a location where another node (or the same
node) is writing.
2. DLM Locks for management
There are two locks for managing the device:
2.1 Bitmap lock resource (bm_lockres)
The bm_lockres protects individual node bitmaps. They are named in the
form bitmap001 for node 1, bitmap002 for node and so on. When a node
joins the cluster, it acquires the lock in PW mode and it stays so
during the lifetime the node is part of the cluster. The lock resource
number is based on the slot number returned by the DLM subsystem. Since
DLM starts node count from one and bitmap slots start from zero, one is
subtracted from the DLM slot number to arrive at the bitmap slot number.
3. Communication
Each node has to communicate with other nodes when starting or ending
resync, and metadata superblock updates.
3.1 Message Types
There are 3 types, of messages which are passed
3.1.1 METADATA_UPDATED: informs other nodes that the metadata has been
updated, and the node must re-read the md superblock. This is performed
synchronously.
3.1.2 RESYNC: informs other nodes that a resync is initiated or ended
so that each node may suspend or resume the region.
3.2 Communication mechanism
The DLM LVB is used to communicate within nodes of the cluster. There
are three resources used for the purpose:
3.2.1 Token: The resource which protects the entire communication
system. The node having the token resource is allowed to
communicate.
3.2.2 Message: The lock resource which carries the data to
communicate.
3.2.3 Ack: The resource, acquiring which means the message has been
acknowledged by all nodes in the cluster. The BAST of the resource
is used to inform the receive node that a node wants to communicate.
The algorithm is:
1. receive status
sender receiver receiver
ACK:CR ACK:CR ACK:CR
2. sender get EX of TOKEN
sender get EX of MESSAGE
sender receiver receiver
TOKEN:EX ACK:CR ACK:CR
MESSAGE:EX
ACK:CR
Sender checks that it still needs to send a message. Messages received
or other events that happened while waiting for the TOKEN may have made
this message inappropriate or redundant.
3. sender write LVB.
sender down-convert MESSAGE from EX to CR
sender try to get EX of ACK
[ wait until all receiver has *processed* the MESSAGE ]
[ triggered by bast of ACK ]
receiver get CR of MESSAGE
receiver read LVB
receiver processes the message
[ wait finish ]
receiver release ACK
sender receiver receiver
TOKEN:EX MESSAGE:CR MESSAGE:CR
MESSAGE:CR
ACK:EX
4. triggered by grant of EX on ACK (indicating all receivers have processed
message)
sender down-convert ACK from EX to CR
sender release MESSAGE
sender release TOKEN
receiver upconvert to EX of MESSAGE
receiver get CR of ACK
receiver release MESSAGE
sender receiver receiver
ACK:CR ACK:CR ACK:CR
4. Handling Failures
4.1 Node Failure
When a node fails, the DLM informs the cluster with the slot. The node
starts a cluster recovery thread. The cluster recovery thread:
- acquires the bitmap<number> lock of the failed node
- opens the bitmap
- reads the bitmap of the failed node
- copies the set bitmap to local node
- cleans the bitmap of the failed node
- releases bitmap<number> lock of the failed node
- initiates resync of the bitmap on the current node
The resync process, is the regular md resync. However, in a clustered
environment when a resync is performed, it needs to tell other nodes
of the areas which are suspended. Before a resync starts, the node
send out RESYNC_START with the (lo,hi) range of the area which needs
to be suspended. Each node maintains a suspend_list, which contains
the list of ranges which are currently suspended. On receiving
RESYNC_START, the node adds the range to the suspend_list. Similarly,
when the node performing resync finishes, it send RESYNC_FINISHED
to other nodes and other nodes remove the corresponding entry from
the suspend_list.
A helper function, should_suspend() can be used to check if a particular
I/O range should be suspended or not.
4.2 Device Failure
Device failures are handled and communicated with the metadata update
routine.
5. Adding a new Device
For adding a new device, it is necessary that all nodes "see" the new device
to be added. For this, the following algorithm is used:
1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD)
2. Node 1 sends NEWDISK with uuid and slot number
3. Other nodes issue kobject_uevent_env with uuid and slot number
(Steps 4,5 could be a udev rule)
4. In userspace, the node searches for the disk, perhaps
using blkid -t SUB_UUID=""
5. Other nodes issue either of the following depending on whether the disk
was found:
ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
disc.number set to slot number)
ioctl(CLUSTERED_DISK_NACK)
6. Other nodes drop lock on no-new-devs (CR) if device is found
7. Node 1 attempts EX lock on no-new-devs
8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk
as SpareLocal
9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED
10. Other nodes get the information whether a disk is added or not
by the following METADATA_UPDATED.
...@@ -124,6 +124,7 @@ do_sync_gen_syndrome(struct page **blocks, unsigned int offset, int disks, ...@@ -124,6 +124,7 @@ do_sync_gen_syndrome(struct page **blocks, unsigned int offset, int disks,
{ {
void **srcs; void **srcs;
int i; int i;
int start = -1, stop = disks - 3;
if (submit->scribble) if (submit->scribble)
srcs = submit->scribble; srcs = submit->scribble;
...@@ -134,10 +135,21 @@ do_sync_gen_syndrome(struct page **blocks, unsigned int offset, int disks, ...@@ -134,10 +135,21 @@ do_sync_gen_syndrome(struct page **blocks, unsigned int offset, int disks,
if (blocks[i] == NULL) { if (blocks[i] == NULL) {
BUG_ON(i > disks - 3); /* P or Q can't be zero */ BUG_ON(i > disks - 3); /* P or Q can't be zero */
srcs[i] = (void*)raid6_empty_zero_page; srcs[i] = (void*)raid6_empty_zero_page;
} else } else {
srcs[i] = page_address(blocks[i]) + offset; srcs[i] = page_address(blocks[i]) + offset;
if (i < disks - 2) {
stop = i;
if (start == -1)
start = i;
}
}
} }
raid6_call.gen_syndrome(disks, len, srcs); if (submit->flags & ASYNC_TX_PQ_XOR_DST) {
BUG_ON(!raid6_call.xor_syndrome);
if (start >= 0)
raid6_call.xor_syndrome(disks, start, stop, len, srcs);
} else
raid6_call.gen_syndrome(disks, len, srcs);
async_tx_sync_epilog(submit); async_tx_sync_epilog(submit);
} }
...@@ -178,7 +190,8 @@ async_gen_syndrome(struct page **blocks, unsigned int offset, int disks, ...@@ -178,7 +190,8 @@ async_gen_syndrome(struct page **blocks, unsigned int offset, int disks,
if (device) if (device)
unmap = dmaengine_get_unmap_data(device->dev, disks, GFP_NOIO); unmap = dmaengine_get_unmap_data(device->dev, disks, GFP_NOIO);
if (unmap && /* XORing P/Q is only implemented in software */
if (unmap && !(submit->flags & ASYNC_TX_PQ_XOR_DST) &&
(src_cnt <= dma_maxpq(device, 0) || (src_cnt <= dma_maxpq(device, 0) ||
dma_maxpq(device, DMA_PREP_CONTINUE) > 0) && dma_maxpq(device, DMA_PREP_CONTINUE) > 0) &&
is_dma_pq_aligned(device, offset, 0, len)) { is_dma_pq_aligned(device, offset, 0, len)) {
......
...@@ -175,6 +175,22 @@ config MD_FAULTY ...@@ -175,6 +175,22 @@ config MD_FAULTY
In unsure, say N. In unsure, say N.
config MD_CLUSTER
tristate "Cluster Support for MD (EXPERIMENTAL)"
depends on BLK_DEV_MD
depends on DLM
default n
---help---
Clustering support for MD devices. This enables locking and
synchronization across multiple systems on the cluster, so all
nodes in the cluster can access the MD devices simultaneously.
This brings the redundancy (and uptime) of RAID levels across the
nodes of the cluster.
If unsure, say N.
source "drivers/md/bcache/Kconfig" source "drivers/md/bcache/Kconfig"
config BLK_DEV_DM_BUILTIN config BLK_DEV_DM_BUILTIN
......
...@@ -30,6 +30,7 @@ obj-$(CONFIG_MD_RAID10) += raid10.o ...@@ -30,6 +30,7 @@ obj-$(CONFIG_MD_RAID10) += raid10.o
obj-$(CONFIG_MD_RAID456) += raid456.o obj-$(CONFIG_MD_RAID456) += raid456.o
obj-$(CONFIG_MD_MULTIPATH) += multipath.o obj-$(CONFIG_MD_MULTIPATH) += multipath.o
obj-$(CONFIG_MD_FAULTY) += faulty.o obj-$(CONFIG_MD_FAULTY) += faulty.o
obj-$(CONFIG_MD_CLUSTER) += md-cluster.o
obj-$(CONFIG_BCACHE) += bcache/ obj-$(CONFIG_BCACHE) += bcache/
obj-$(CONFIG_BLK_DEV_MD) += md-mod.o obj-$(CONFIG_BLK_DEV_MD) += md-mod.o
obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o
......
This diff is collapsed.
...@@ -130,8 +130,9 @@ typedef struct bitmap_super_s { ...@@ -130,8 +130,9 @@ typedef struct bitmap_super_s {
__le32 write_behind; /* 60 number of outstanding write-behind writes */ __le32 write_behind; /* 60 number of outstanding write-behind writes */
__le32 sectors_reserved; /* 64 number of 512-byte sectors that are __le32 sectors_reserved; /* 64 number of 512-byte sectors that are
* reserved for the bitmap. */ * reserved for the bitmap. */
__le32 nodes; /* 68 the maximum number of nodes in cluster. */
__u8 pad[256 - 68]; /* set to zero */ __u8 cluster_name[64]; /* 72 cluster name to which this md belongs */
__u8 pad[256 - 136]; /* set to zero */
} bitmap_super_t; } bitmap_super_t;
/* notes: /* notes:
...@@ -226,12 +227,13 @@ struct bitmap { ...@@ -226,12 +227,13 @@ struct bitmap {
wait_queue_head_t behind_wait; wait_queue_head_t behind_wait;
struct kernfs_node *sysfs_can_clear; struct kernfs_node *sysfs_can_clear;
int cluster_slot; /* Slot offset for clustered env */
}; };
/* the bitmap API */ /* the bitmap API */
/* these are used only by md/bitmap */ /* these are used only by md/bitmap */
int bitmap_create(struct mddev *mddev); struct bitmap *bitmap_create(struct mddev *mddev, int slot);
int bitmap_load(struct mddev *mddev); int bitmap_load(struct mddev *mddev);
void bitmap_flush(struct mddev *mddev); void bitmap_flush(struct mddev *mddev);
void bitmap_destroy(struct mddev *mddev); void bitmap_destroy(struct mddev *mddev);
...@@ -260,6 +262,8 @@ void bitmap_daemon_work(struct mddev *mddev); ...@@ -260,6 +262,8 @@ void bitmap_daemon_work(struct mddev *mddev);
int bitmap_resize(struct bitmap *bitmap, sector_t blocks, int bitmap_resize(struct bitmap *bitmap, sector_t blocks,
int chunksize, int init); int chunksize, int init);
int bitmap_copy_from_slot(struct mddev *mddev, int slot,
sector_t *lo, sector_t *hi, bool clear_bits);
#endif #endif
#endif #endif
This diff is collapsed.
#ifndef _MD_CLUSTER_H
#define _MD_CLUSTER_H
#include "md.h"
struct mddev;
struct md_rdev;
struct md_cluster_operations {
int (*join)(struct mddev *mddev, int nodes);
int (*leave)(struct mddev *mddev);
int (*slot_number)(struct mddev *mddev);
void (*resync_info_update)(struct mddev *mddev, sector_t lo, sector_t hi);
int (*resync_start)(struct mddev *mddev, sector_t lo, sector_t hi);
void (*resync_finish)(struct mddev *mddev);
int (*metadata_update_start)(struct mddev *mddev);
int (*metadata_update_finish)(struct mddev *mddev);
int (*metadata_update_cancel)(struct mddev *mddev);
int (*area_resyncing)(struct mddev *mddev, sector_t lo, sector_t hi);
int (*add_new_disk_start)(struct mddev *mddev, struct md_rdev *rdev);
int (*add_new_disk_finish)(struct mddev *mddev);
int (*new_disk_ack)(struct mddev *mddev, bool ack);
int (*remove_disk)(struct mddev *mddev, struct md_rdev *rdev);
int (*gather_bitmaps)(struct md_rdev *rdev);
};
#endif /* _MD_CLUSTER_H */
This diff is collapsed.
...@@ -23,6 +23,7 @@ ...@@ -23,6 +23,7 @@
#include <linux/timer.h> #include <linux/timer.h>
#include <linux/wait.h> #include <linux/wait.h>
#include <linux/workqueue.h> #include <linux/workqueue.h>
#include "md-cluster.h"
#define MaxSector (~(sector_t)0) #define MaxSector (~(sector_t)0)
...@@ -170,6 +171,10 @@ enum flag_bits { ...@@ -170,6 +171,10 @@ enum flag_bits {
* a want_replacement device with same * a want_replacement device with same
* raid_disk number. * raid_disk number.
*/ */
Candidate, /* For clustered environments only:
* This device is seen locally but not
* by the whole cluster
*/
}; };
#define BB_LEN_MASK (0x00000000000001FFULL) #define BB_LEN_MASK (0x00000000000001FFULL)
...@@ -202,6 +207,8 @@ extern int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors, ...@@ -202,6 +207,8 @@ extern int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
int is_new); int is_new);
extern void md_ack_all_badblocks(struct badblocks *bb); extern void md_ack_all_badblocks(struct badblocks *bb);
struct md_cluster_info;
struct mddev { struct mddev {
void *private; void *private;
struct md_personality *pers; struct md_personality *pers;
...@@ -430,6 +437,8 @@ struct mddev { ...@@ -430,6 +437,8 @@ struct mddev {
unsigned long daemon_sleep; /* how many jiffies between updates? */ unsigned long daemon_sleep; /* how many jiffies between updates? */
unsigned long max_write_behind; /* write-behind mode */ unsigned long max_write_behind; /* write-behind mode */
int external; int external;
int nodes; /* Maximum number of nodes in the cluster */
char cluster_name[64]; /* Name of the cluster */
} bitmap_info; } bitmap_info;
atomic_t max_corr_read_errors; /* max read retries */ atomic_t max_corr_read_errors; /* max read retries */
...@@ -448,6 +457,7 @@ struct mddev { ...@@ -448,6 +457,7 @@ struct mddev {
struct work_struct flush_work; struct work_struct flush_work;
struct work_struct event_work; /* used by dm to report failure event */ struct work_struct event_work; /* used by dm to report failure event */
void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev); void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev);
struct md_cluster_info *cluster_info;
}; };
static inline int __must_check mddev_lock(struct mddev *mddev) static inline int __must_check mddev_lock(struct mddev *mddev)
...@@ -496,7 +506,7 @@ struct md_personality ...@@ -496,7 +506,7 @@ struct md_personality
int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev); int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev);
int (*hot_remove_disk) (struct mddev *mddev, struct md_rdev *rdev); int (*hot_remove_disk) (struct mddev *mddev, struct md_rdev *rdev);
int (*spare_active) (struct mddev *mddev); int (*spare_active) (struct mddev *mddev);
sector_t (*sync_request)(struct mddev *mddev, sector_t sector_nr, int *skipped, int go_faster); sector_t (*sync_request)(struct mddev *mddev, sector_t sector_nr, int *skipped);
int (*resize) (struct mddev *mddev, sector_t sectors); int (*resize) (struct mddev *mddev, sector_t sectors);
sector_t (*size) (struct mddev *mddev, sector_t sectors, int raid_disks); sector_t (*size) (struct mddev *mddev, sector_t sectors, int raid_disks);
int (*check_reshape) (struct mddev *mddev); int (*check_reshape) (struct mddev *mddev);
...@@ -608,6 +618,11 @@ static inline void safe_put_page(struct page *p) ...@@ -608,6 +618,11 @@ static inline void safe_put_page(struct page *p)
extern int register_md_personality(struct md_personality *p); extern int register_md_personality(struct md_personality *p);
extern int unregister_md_personality(struct md_personality *p); extern int unregister_md_personality(struct md_personality *p);
extern int register_md_cluster_operations(struct md_cluster_operations *ops,
struct module *module);
extern int unregister_md_cluster_operations(void);
extern int md_setup_cluster(struct mddev *mddev, int nodes);
extern void md_cluster_stop(struct mddev *mddev);
extern struct md_thread *md_register_thread( extern struct md_thread *md_register_thread(
void (*run)(struct md_thread *thread), void (*run)(struct md_thread *thread),
struct mddev *mddev, struct mddev *mddev,
...@@ -654,6 +669,10 @@ extern struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs, ...@@ -654,6 +669,10 @@ extern struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
struct mddev *mddev); struct mddev *mddev);
extern void md_unplug(struct blk_plug_cb *cb, bool from_schedule); extern void md_unplug(struct blk_plug_cb *cb, bool from_schedule);
extern void md_reload_sb(struct mddev *mddev);
extern void md_update_sb(struct mddev *mddev, int force);
extern void md_kick_rdev_from_array(struct md_rdev * rdev);
struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, int nr);
static inline int mddev_check_plugged(struct mddev *mddev) static inline int mddev_check_plugged(struct mddev *mddev)
{ {
return !!blk_check_plugged(md_unplug, mddev, return !!blk_check_plugged(md_unplug, mddev,
...@@ -669,4 +688,9 @@ static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev) ...@@ -669,4 +688,9 @@ static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev)
} }
} }
extern struct md_cluster_operations *md_cluster_ops;
static inline int mddev_is_clustered(struct mddev *mddev)
{
return mddev->cluster_info && mddev->bitmap_info.nodes > 1;
}
#endif /* _MD_MD_H */ #endif /* _MD_MD_H */
...@@ -271,14 +271,16 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf) ...@@ -271,14 +271,16 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf)
goto abort; goto abort;
} }
blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9); if (mddev->queue) {
blk_queue_io_opt(mddev->queue, blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9);
(mddev->chunk_sectors << 9) * mddev->raid_disks); blk_queue_io_opt(mddev->queue,
(mddev->chunk_sectors << 9) * mddev->raid_disks);
if (!discard_supported)
queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, mddev->queue); if (!discard_supported)
else queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, mddev->queue);
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, mddev->queue); else
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, mddev->queue);
}
pr_debug("md/raid0:%s: done.\n", mdname(mddev)); pr_debug("md/raid0:%s: done.\n", mdname(mddev));
*private_conf = conf; *private_conf = conf;
...@@ -429,9 +431,12 @@ static int raid0_run(struct mddev *mddev) ...@@ -429,9 +431,12 @@ static int raid0_run(struct mddev *mddev)
} }
if (md_check_no_bitmap(mddev)) if (md_check_no_bitmap(mddev))
return -EINVAL; return -EINVAL;
blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
blk_queue_max_write_same_sectors(mddev->queue, mddev->chunk_sectors); if (mddev->queue) {
blk_queue_max_discard_sectors(mddev->queue, mddev->chunk_sectors); blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
blk_queue_max_write_same_sectors(mddev->queue, mddev->chunk_sectors);
blk_queue_max_discard_sectors(mddev->queue, mddev->chunk_sectors);
}
/* if private is not null, we are here after takeover */ /* if private is not null, we are here after takeover */
if (mddev->private == NULL) { if (mddev->private == NULL) {
...@@ -448,16 +453,17 @@ static int raid0_run(struct mddev *mddev) ...@@ -448,16 +453,17 @@ static int raid0_run(struct mddev *mddev)
printk(KERN_INFO "md/raid0:%s: md_size is %llu sectors.\n", printk(KERN_INFO "md/raid0:%s: md_size is %llu sectors.\n",
mdname(mddev), mdname(mddev),
(unsigned long long)mddev->array_sectors); (unsigned long long)mddev->array_sectors);
/* calculate the max read-ahead size.
* For read-ahead of large files to be effective, we need to if (mddev->queue) {
* readahead at least twice a whole stripe. i.e. number of devices /* calculate the max read-ahead size.
* multiplied by chunk size times 2. * For read-ahead of large files to be effective, we need to
* If an individual device has an ra_pages greater than the * readahead at least twice a whole stripe. i.e. number of devices
* chunk size, then we will not drive that device as hard as it * multiplied by chunk size times 2.
* wants. We consider this a configuration error: a larger * If an individual device has an ra_pages greater than the
* chunksize should be used in that case. * chunk size, then we will not drive that device as hard as it
*/ * wants. We consider this a configuration error: a larger
{ * chunksize should be used in that case.
*/
int stripe = mddev->raid_disks * int stripe = mddev->raid_disks *
(mddev->chunk_sectors << 9) / PAGE_SIZE; (mddev->chunk_sectors << 9) / PAGE_SIZE;
if (mddev->queue->backing_dev_info.ra_pages < 2* stripe) if (mddev->queue->backing_dev_info.ra_pages < 2* stripe)
......
...@@ -539,7 +539,13 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect ...@@ -539,7 +539,13 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
has_nonrot_disk = 0; has_nonrot_disk = 0;
choose_next_idle = 0; choose_next_idle = 0;
choose_first = (conf->mddev->recovery_cp < this_sector + sectors); if ((conf->mddev->recovery_cp < this_sector + sectors) ||
(mddev_is_clustered(conf->mddev) &&
md_cluster_ops->area_resyncing(conf->mddev, this_sector,
this_sector + sectors)))
choose_first = 1;
else
choose_first = 0;
for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) { for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
sector_t dist; sector_t dist;
...@@ -1102,8 +1108,10 @@ static void make_request(struct mddev *mddev, struct bio * bio) ...@@ -1102,8 +1108,10 @@ static void make_request(struct mddev *mddev, struct bio * bio)
md_write_start(mddev, bio); /* wait on superblock update early */ md_write_start(mddev, bio); /* wait on superblock update early */
if (bio_data_dir(bio) == WRITE && if (bio_data_dir(bio) == WRITE &&
bio_end_sector(bio) > mddev->suspend_lo && ((bio_end_sector(bio) > mddev->suspend_lo &&
bio->bi_iter.bi_sector < mddev->suspend_hi) { bio->bi_iter.bi_sector < mddev->suspend_hi) ||
(mddev_is_clustered(mddev) &&
md_cluster_ops->area_resyncing(mddev, bio->bi_iter.bi_sector, bio_end_sector(bio))))) {
/* As the suspend_* range is controlled by /* As the suspend_* range is controlled by
* userspace, we want an interruptible * userspace, we want an interruptible
* wait. * wait.
...@@ -1114,7 +1122,10 @@ static void make_request(struct mddev *mddev, struct bio * bio) ...@@ -1114,7 +1122,10 @@ static void make_request(struct mddev *mddev, struct bio * bio)
prepare_to_wait(&conf->wait_barrier, prepare_to_wait(&conf->wait_barrier,
&w, TASK_INTERRUPTIBLE); &w, TASK_INTERRUPTIBLE);
if (bio_end_sector(bio) <= mddev->suspend_lo || if (bio_end_sector(bio) <= mddev->suspend_lo ||
bio->bi_iter.bi_sector >= mddev->suspend_hi) bio->bi_iter.bi_sector >= mddev->suspend_hi ||
(mddev_is_clustered(mddev) &&
!md_cluster_ops->area_resyncing(mddev,
bio->bi_iter.bi_sector, bio_end_sector(bio))))
break; break;
schedule(); schedule();
} }
...@@ -1561,6 +1572,7 @@ static int raid1_spare_active(struct mddev *mddev) ...@@ -1561,6 +1572,7 @@ static int raid1_spare_active(struct mddev *mddev)
struct md_rdev *rdev = conf->mirrors[i].rdev; struct md_rdev *rdev = conf->mirrors[i].rdev;
struct md_rdev *repl = conf->mirrors[conf->raid_disks + i].rdev; struct md_rdev *repl = conf->mirrors[conf->raid_disks + i].rdev;
if (repl if (repl
&& !test_bit(Candidate, &repl->flags)
&& repl->recovery_offset == MaxSector && repl->recovery_offset == MaxSector
&& !test_bit(Faulty, &repl->flags) && !test_bit(Faulty, &repl->flags)
&& !test_and_set_bit(In_sync, &repl->flags)) { && !test_and_set_bit(In_sync, &repl->flags)) {
...@@ -2468,7 +2480,7 @@ static int init_resync(struct r1conf *conf) ...@@ -2468,7 +2480,7 @@ static int init_resync(struct r1conf *conf)
* that can be installed to exclude normal IO requests. * that can be installed to exclude normal IO requests.
*/ */
static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int *skipped, int go_faster) static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int *skipped)
{ {
struct r1conf *conf = mddev->private; struct r1conf *conf = mddev->private;
struct r1bio *r1_bio; struct r1bio *r1_bio;
...@@ -2521,13 +2533,6 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int *skipp ...@@ -2521,13 +2533,6 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int *skipp
*skipped = 1; *skipped = 1;
return sync_blocks; return sync_blocks;
} }
/*
* If there is non-resync activity waiting for a turn,
* and resync is going fast enough,
* then let it though before starting on this new sync request.
*/
if (!go_faster && conf->nr_waiting)
msleep_interruptible(1000);
bitmap_cond_end_sync(mddev->bitmap, sector_nr); bitmap_cond_end_sync(mddev->bitmap, sector_nr);
r1_bio = mempool_alloc(conf->r1buf_pool, GFP_NOIO); r1_bio = mempool_alloc(conf->r1buf_pool, GFP_NOIO);
......
...@@ -2889,7 +2889,7 @@ static int init_resync(struct r10conf *conf) ...@@ -2889,7 +2889,7 @@ static int init_resync(struct r10conf *conf)
*/ */
static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, static sector_t sync_request(struct mddev *mddev, sector_t sector_nr,
int *skipped, int go_faster) int *skipped)
{ {
struct r10conf *conf = mddev->private; struct r10conf *conf = mddev->private;
struct r10bio *r10_bio; struct r10bio *r10_bio;
...@@ -2994,12 +2994,6 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, ...@@ -2994,12 +2994,6 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr,
if (conf->geo.near_copies < conf->geo.raid_disks && if (conf->geo.near_copies < conf->geo.raid_disks &&
max_sector > (sector_nr | chunk_mask)) max_sector > (sector_nr | chunk_mask))
max_sector = (sector_nr | chunk_mask) + 1; max_sector = (sector_nr | chunk_mask) + 1;
/*
* If there is non-resync activity waiting for us then
* put in a delay to throttle resync.
*/
if (!go_faster && conf->nr_waiting)
msleep_interruptible(1000);
/* Again, very different code for resync and recovery. /* Again, very different code for resync and recovery.
* Both must result in an r10bio with a list of bios that * Both must result in an r10bio with a list of bios that
......
This diff is collapsed.
...@@ -210,11 +210,19 @@ struct stripe_head { ...@@ -210,11 +210,19 @@ struct stripe_head {
atomic_t count; /* nr of active thread/requests */ atomic_t count; /* nr of active thread/requests */
int bm_seq; /* sequence number for bitmap flushes */ int bm_seq; /* sequence number for bitmap flushes */
int disks; /* disks in stripe */ int disks; /* disks in stripe */
int overwrite_disks; /* total overwrite disks in stripe,
* this is only checked when stripe
* has STRIPE_BATCH_READY
*/
enum check_states check_state; enum check_states check_state;
enum reconstruct_states reconstruct_state; enum reconstruct_states reconstruct_state;
spinlock_t stripe_lock; spinlock_t stripe_lock;
int cpu; int cpu;
struct r5worker_group *group; struct r5worker_group *group;
struct stripe_head *batch_head; /* protected by stripe lock */
spinlock_t batch_lock; /* only header's lock is useful */
struct list_head batch_list; /* protected by head's batch lock*/
/** /**
* struct stripe_operations * struct stripe_operations
* @target - STRIPE_OP_COMPUTE_BLK target * @target - STRIPE_OP_COMPUTE_BLK target
...@@ -327,8 +335,15 @@ enum { ...@@ -327,8 +335,15 @@ enum {
STRIPE_ON_UNPLUG_LIST, STRIPE_ON_UNPLUG_LIST,
STRIPE_DISCARD, STRIPE_DISCARD,
STRIPE_ON_RELEASE_LIST, STRIPE_ON_RELEASE_LIST,
STRIPE_BATCH_READY,
STRIPE_BATCH_ERR,
}; };
#define STRIPE_EXPAND_SYNC_FLAG \
((1 << STRIPE_EXPAND_SOURCE) |\
(1 << STRIPE_EXPAND_READY) |\
(1 << STRIPE_EXPANDING) |\
(1 << STRIPE_SYNC_REQUESTED))
/* /*
* Operation request flags * Operation request flags
*/ */
...@@ -340,6 +355,24 @@ enum { ...@@ -340,6 +355,24 @@ enum {
STRIPE_OP_RECONSTRUCT, STRIPE_OP_RECONSTRUCT,
STRIPE_OP_CHECK, STRIPE_OP_CHECK,
}; };
/*
* RAID parity calculation preferences
*/
enum {
PARITY_DISABLE_RMW = 0,
PARITY_ENABLE_RMW,
PARITY_PREFER_RMW,
};
/*
* Pages requested from set_syndrome_sources()
*/
enum {
SYNDROME_SRC_ALL,
SYNDROME_SRC_WANT_DRAIN,
SYNDROME_SRC_WRITTEN,
};
/* /*
* Plugging: * Plugging:
* *
...@@ -396,10 +429,11 @@ struct r5conf { ...@@ -396,10 +429,11 @@ struct r5conf {
spinlock_t hash_locks[NR_STRIPE_HASH_LOCKS]; spinlock_t hash_locks[NR_STRIPE_HASH_LOCKS];
struct mddev *mddev; struct mddev *mddev;
int chunk_sectors; int chunk_sectors;
int level, algorithm; int level, algorithm, rmw_level;
int max_degraded; int max_degraded;
int raid_disks; int raid_disks;
int max_nr_stripes; int max_nr_stripes;
int min_nr_stripes;
/* reshape_progress is the leading edge of a 'reshape' /* reshape_progress is the leading edge of a 'reshape'
* It has value MaxSector when no reshape is happening * It has value MaxSector when no reshape is happening
...@@ -458,15 +492,11 @@ struct r5conf { ...@@ -458,15 +492,11 @@ struct r5conf {
/* per cpu variables */ /* per cpu variables */
struct raid5_percpu { struct raid5_percpu {
struct page *spare_page; /* Used when checking P/Q in raid6 */ struct page *spare_page; /* Used when checking P/Q in raid6 */
void *scribble; /* space for constructing buffer struct flex_array *scribble; /* space for constructing buffer
* lists and performing address * lists and performing address
* conversions * conversions
*/ */
} __percpu *percpu; } __percpu *percpu;
size_t scribble_len; /* size of scribble region must be
* associated with conf to handle
* cpu hotplug while reshaping
*/
#ifdef CONFIG_HOTPLUG_CPU #ifdef CONFIG_HOTPLUG_CPU
struct notifier_block cpu_notify; struct notifier_block cpu_notify;
#endif #endif
...@@ -480,9 +510,19 @@ struct r5conf { ...@@ -480,9 +510,19 @@ struct r5conf {
struct llist_head released_stripes; struct llist_head released_stripes;
wait_queue_head_t wait_for_stripe; wait_queue_head_t wait_for_stripe;
wait_queue_head_t wait_for_overlap; wait_queue_head_t wait_for_overlap;
int inactive_blocked; /* release of inactive stripes blocked, unsigned long cache_state;
* waiting for 25% to be free #define R5_INACTIVE_BLOCKED 1 /* release of inactive stripes blocked,
*/ * waiting for 25% to be free
*/
#define R5_ALLOC_MORE 2 /* It might help to allocate another
* stripe.
*/
#define R5_DID_ALLOC 4 /* A stripe was allocated, don't allocate
* more until at least one has been
* released. This avoids flooding
* the cache.
*/
struct shrinker shrinker;
int pool_size; /* number of disks in stripeheads in pool */ int pool_size; /* number of disks in stripeheads in pool */
spinlock_t device_lock; spinlock_t device_lock;
struct disk_info *disks; struct disk_info *disks;
...@@ -497,6 +537,7 @@ struct r5conf { ...@@ -497,6 +537,7 @@ struct r5conf {
int worker_cnt_per_group; int worker_cnt_per_group;
}; };
/* /*
* Our supported algorithms * Our supported algorithms
*/ */
......
...@@ -60,12 +60,15 @@ struct dma_chan_ref { ...@@ -60,12 +60,15 @@ struct dma_chan_ref {
* dependency chain * dependency chain
* @ASYNC_TX_FENCE: specify that the next operation in the dependency * @ASYNC_TX_FENCE: specify that the next operation in the dependency
* chain uses this operation's result as an input * chain uses this operation's result as an input
* @ASYNC_TX_PQ_XOR_DST: do not overwrite the syndrome but XOR it with the
* input data. Required for rmw case.
*/ */
enum async_tx_flags { enum async_tx_flags {
ASYNC_TX_XOR_ZERO_DST = (1 << 0), ASYNC_TX_XOR_ZERO_DST = (1 << 0),
ASYNC_TX_XOR_DROP_DST = (1 << 1), ASYNC_TX_XOR_DROP_DST = (1 << 1),
ASYNC_TX_ACK = (1 << 2), ASYNC_TX_ACK = (1 << 2),
ASYNC_TX_FENCE = (1 << 3), ASYNC_TX_FENCE = (1 << 3),
ASYNC_TX_PQ_XOR_DST = (1 << 4),
}; };
/** /**
......
...@@ -72,6 +72,7 @@ extern const char raid6_empty_zero_page[PAGE_SIZE]; ...@@ -72,6 +72,7 @@ extern const char raid6_empty_zero_page[PAGE_SIZE];
/* Routine choices */ /* Routine choices */
struct raid6_calls { struct raid6_calls {
void (*gen_syndrome)(int, size_t, void **); void (*gen_syndrome)(int, size_t, void **);
void (*xor_syndrome)(int, int, int, size_t, void **);
int (*valid)(void); /* Returns 1 if this routine set is usable */ int (*valid)(void); /* Returns 1 if this routine set is usable */
const char *name; /* Name of this routine set */ const char *name; /* Name of this routine set */
int prefer; /* Has special performance attribute */ int prefer; /* Has special performance attribute */
......
...@@ -78,6 +78,12 @@ ...@@ -78,6 +78,12 @@
#define MD_DISK_ACTIVE 1 /* disk is running or spare disk */ #define MD_DISK_ACTIVE 1 /* disk is running or spare disk */
#define MD_DISK_SYNC 2 /* disk is in sync with the raid set */ #define MD_DISK_SYNC 2 /* disk is in sync with the raid set */
#define MD_DISK_REMOVED 3 /* disk is in sync with the raid set */ #define MD_DISK_REMOVED 3 /* disk is in sync with the raid set */
#define MD_DISK_CLUSTER_ADD 4 /* Initiate a disk add across the cluster
* For clustered enviroments only.
*/
#define MD_DISK_CANDIDATE 5 /* disk is added as spare (local) until confirmed
* For clustered enviroments only.
*/
#define MD_DISK_WRITEMOSTLY 9 /* disk is "write-mostly" is RAID1 config. #define MD_DISK_WRITEMOSTLY 9 /* disk is "write-mostly" is RAID1 config.
* read requests will only be sent here in * read requests will only be sent here in
...@@ -101,6 +107,7 @@ typedef struct mdp_device_descriptor_s { ...@@ -101,6 +107,7 @@ typedef struct mdp_device_descriptor_s {
#define MD_SB_CLEAN 0 #define MD_SB_CLEAN 0
#define MD_SB_ERRORS 1 #define MD_SB_ERRORS 1
#define MD_SB_CLUSTERED 5 /* MD is clustered */
#define MD_SB_BITMAP_PRESENT 8 /* bitmap may be present nearby */ #define MD_SB_BITMAP_PRESENT 8 /* bitmap may be present nearby */
/* /*
......
...@@ -62,6 +62,7 @@ ...@@ -62,6 +62,7 @@
#define STOP_ARRAY _IO (MD_MAJOR, 0x32) #define STOP_ARRAY _IO (MD_MAJOR, 0x32)
#define STOP_ARRAY_RO _IO (MD_MAJOR, 0x33) #define STOP_ARRAY_RO _IO (MD_MAJOR, 0x33)
#define RESTART_ARRAY_RW _IO (MD_MAJOR, 0x34) #define RESTART_ARRAY_RW _IO (MD_MAJOR, 0x34)
#define CLUSTERED_DISK_NACK _IO (MD_MAJOR, 0x35)
/* 63 partitions with the alternate major number (mdp) */ /* 63 partitions with the alternate major number (mdp) */
#define MdpMinorShift 6 #define MdpMinorShift 6
......
...@@ -131,11 +131,12 @@ static inline const struct raid6_recov_calls *raid6_choose_recov(void) ...@@ -131,11 +131,12 @@ static inline const struct raid6_recov_calls *raid6_choose_recov(void)
static inline const struct raid6_calls *raid6_choose_gen( static inline const struct raid6_calls *raid6_choose_gen(
void *(*const dptrs)[(65536/PAGE_SIZE)+2], const int disks) void *(*const dptrs)[(65536/PAGE_SIZE)+2], const int disks)
{ {
unsigned long perf, bestperf, j0, j1; unsigned long perf, bestgenperf, bestxorperf, j0, j1;
int start = (disks>>1)-1, stop = disks-3; /* work on the second half of the disks */
const struct raid6_calls *const *algo; const struct raid6_calls *const *algo;
const struct raid6_calls *best; const struct raid6_calls *best;
for (bestperf = 0, best = NULL, algo = raid6_algos; *algo; algo++) { for (bestgenperf = 0, bestxorperf = 0, best = NULL, algo = raid6_algos; *algo; algo++) {
if (!best || (*algo)->prefer >= best->prefer) { if (!best || (*algo)->prefer >= best->prefer) {
if ((*algo)->valid && !(*algo)->valid()) if ((*algo)->valid && !(*algo)->valid())
continue; continue;
...@@ -153,19 +154,45 @@ static inline const struct raid6_calls *raid6_choose_gen( ...@@ -153,19 +154,45 @@ static inline const struct raid6_calls *raid6_choose_gen(
} }
preempt_enable(); preempt_enable();
if (perf > bestperf) { if (perf > bestgenperf) {
bestperf = perf; bestgenperf = perf;
best = *algo; best = *algo;
} }
pr_info("raid6: %-8s %5ld MB/s\n", (*algo)->name, pr_info("raid6: %-8s gen() %5ld MB/s\n", (*algo)->name,
(perf*HZ) >> (20-16+RAID6_TIME_JIFFIES_LG2)); (perf*HZ) >> (20-16+RAID6_TIME_JIFFIES_LG2));
if (!(*algo)->xor_syndrome)
continue;
perf = 0;
preempt_disable();
j0 = jiffies;
while ((j1 = jiffies) == j0)
cpu_relax();
while (time_before(jiffies,
j1 + (1<<RAID6_TIME_JIFFIES_LG2))) {
(*algo)->xor_syndrome(disks, start, stop,
PAGE_SIZE, *dptrs);
perf++;
}
preempt_enable();
if (best == *algo)
bestxorperf = perf;
pr_info("raid6: %-8s xor() %5ld MB/s\n", (*algo)->name,
(perf*HZ) >> (20-16+RAID6_TIME_JIFFIES_LG2+1));
} }
} }
if (best) { if (best) {
pr_info("raid6: using algorithm %s (%ld MB/s)\n", pr_info("raid6: using algorithm %s gen() %ld MB/s\n",
best->name, best->name,
(bestperf*HZ) >> (20-16+RAID6_TIME_JIFFIES_LG2)); (bestgenperf*HZ) >> (20-16+RAID6_TIME_JIFFIES_LG2));
if (best->xor_syndrome)
pr_info("raid6: .... xor() %ld MB/s, rmw enabled\n",
(bestxorperf*HZ) >> (20-16+RAID6_TIME_JIFFIES_LG2+1));
raid6_call = *best; raid6_call = *best;
} else } else
pr_err("raid6: Yikes! No algorithm found!\n"); pr_err("raid6: Yikes! No algorithm found!\n");
......
...@@ -119,6 +119,7 @@ int raid6_have_altivec(void) ...@@ -119,6 +119,7 @@ int raid6_have_altivec(void)
const struct raid6_calls raid6_altivec$# = { const struct raid6_calls raid6_altivec$# = {
raid6_altivec$#_gen_syndrome, raid6_altivec$#_gen_syndrome,
NULL, /* XOR not yet implemented */
raid6_have_altivec, raid6_have_altivec,
"altivecx$#", "altivecx$#",
0 0
......
...@@ -89,6 +89,7 @@ static void raid6_avx21_gen_syndrome(int disks, size_t bytes, void **ptrs) ...@@ -89,6 +89,7 @@ static void raid6_avx21_gen_syndrome(int disks, size_t bytes, void **ptrs)
const struct raid6_calls raid6_avx2x1 = { const struct raid6_calls raid6_avx2x1 = {
raid6_avx21_gen_syndrome, raid6_avx21_gen_syndrome,
NULL, /* XOR not yet implemented */
raid6_have_avx2, raid6_have_avx2,
"avx2x1", "avx2x1",
1 /* Has cache hints */ 1 /* Has cache hints */
...@@ -150,6 +151,7 @@ static void raid6_avx22_gen_syndrome(int disks, size_t bytes, void **ptrs) ...@@ -150,6 +151,7 @@ static void raid6_avx22_gen_syndrome(int disks, size_t bytes, void **ptrs)
const struct raid6_calls raid6_avx2x2 = { const struct raid6_calls raid6_avx2x2 = {
raid6_avx22_gen_syndrome, raid6_avx22_gen_syndrome,
NULL, /* XOR not yet implemented */
raid6_have_avx2, raid6_have_avx2,
"avx2x2", "avx2x2",
1 /* Has cache hints */ 1 /* Has cache hints */
...@@ -242,6 +244,7 @@ static void raid6_avx24_gen_syndrome(int disks, size_t bytes, void **ptrs) ...@@ -242,6 +244,7 @@ static void raid6_avx24_gen_syndrome(int disks, size_t bytes, void **ptrs)
const struct raid6_calls raid6_avx2x4 = { const struct raid6_calls raid6_avx2x4 = {
raid6_avx24_gen_syndrome, raid6_avx24_gen_syndrome,
NULL, /* XOR not yet implemented */
raid6_have_avx2, raid6_have_avx2,
"avx2x4", "avx2x4",
1 /* Has cache hints */ 1 /* Has cache hints */
......
...@@ -107,9 +107,48 @@ static void raid6_int$#_gen_syndrome(int disks, size_t bytes, void **ptrs) ...@@ -107,9 +107,48 @@ static void raid6_int$#_gen_syndrome(int disks, size_t bytes, void **ptrs)
} }
} }
static void raid6_int$#_xor_syndrome(int disks, int start, int stop,
size_t bytes, void **ptrs)
{
u8 **dptr = (u8 **)ptrs;
u8 *p, *q;
int d, z, z0;
unative_t wd$$, wq$$, wp$$, w1$$, w2$$;
z0 = stop; /* P/Q right side optimization */
p = dptr[disks-2]; /* XOR parity */
q = dptr[disks-1]; /* RS syndrome */
for ( d = 0 ; d < bytes ; d += NSIZE*$# ) {
/* P/Q data pages */
wq$$ = wp$$ = *(unative_t *)&dptr[z0][d+$$*NSIZE];
for ( z = z0-1 ; z >= start ; z-- ) {
wd$$ = *(unative_t *)&dptr[z][d+$$*NSIZE];
wp$$ ^= wd$$;
w2$$ = MASK(wq$$);
w1$$ = SHLBYTE(wq$$);
w2$$ &= NBYTES(0x1d);
w1$$ ^= w2$$;
wq$$ = w1$$ ^ wd$$;
}
/* P/Q left side optimization */
for ( z = start-1 ; z >= 0 ; z-- ) {
w2$$ = MASK(wq$$);
w1$$ = SHLBYTE(wq$$);
w2$$ &= NBYTES(0x1d);
wq$$ = w1$$ ^ w2$$;
}
*(unative_t *)&p[d+NSIZE*$$] ^= wp$$;
*(unative_t *)&q[d+NSIZE*$$] ^= wq$$;
}
}
const struct raid6_calls raid6_intx$# = { const struct raid6_calls raid6_intx$# = {
raid6_int$#_gen_syndrome, raid6_int$#_gen_syndrome,
NULL, /* always valid */ raid6_int$#_xor_syndrome,
NULL, /* always valid */
"int" NSTRING "x$#", "int" NSTRING "x$#",
0 0
}; };
......
...@@ -76,6 +76,7 @@ static void raid6_mmx1_gen_syndrome(int disks, size_t bytes, void **ptrs) ...@@ -76,6 +76,7 @@ static void raid6_mmx1_gen_syndrome(int disks, size_t bytes, void **ptrs)
const struct raid6_calls raid6_mmxx1 = { const struct raid6_calls raid6_mmxx1 = {
raid6_mmx1_gen_syndrome, raid6_mmx1_gen_syndrome,
NULL, /* XOR not yet implemented */
raid6_have_mmx, raid6_have_mmx,
"mmxx1", "mmxx1",
0 0
...@@ -134,6 +135,7 @@ static void raid6_mmx2_gen_syndrome(int disks, size_t bytes, void **ptrs) ...@@ -134,6 +135,7 @@ static void raid6_mmx2_gen_syndrome(int disks, size_t bytes, void **ptrs)
const struct raid6_calls raid6_mmxx2 = { const struct raid6_calls raid6_mmxx2 = {
raid6_mmx2_gen_syndrome, raid6_mmx2_gen_syndrome,
NULL, /* XOR not yet implemented */
raid6_have_mmx, raid6_have_mmx,
"mmxx2", "mmxx2",
0 0
......
...@@ -42,6 +42,7 @@ ...@@ -42,6 +42,7 @@
} \ } \
struct raid6_calls const raid6_neonx ## _n = { \ struct raid6_calls const raid6_neonx ## _n = { \
raid6_neon ## _n ## _gen_syndrome, \ raid6_neon ## _n ## _gen_syndrome, \
NULL, /* XOR not yet implemented */ \
raid6_have_neon, \ raid6_have_neon, \
"neonx" #_n, \ "neonx" #_n, \
0 \ 0 \
......
...@@ -92,6 +92,7 @@ static void raid6_sse11_gen_syndrome(int disks, size_t bytes, void **ptrs) ...@@ -92,6 +92,7 @@ static void raid6_sse11_gen_syndrome(int disks, size_t bytes, void **ptrs)
const struct raid6_calls raid6_sse1x1 = { const struct raid6_calls raid6_sse1x1 = {
raid6_sse11_gen_syndrome, raid6_sse11_gen_syndrome,
NULL, /* XOR not yet implemented */
raid6_have_sse1_or_mmxext, raid6_have_sse1_or_mmxext,
"sse1x1", "sse1x1",
1 /* Has cache hints */ 1 /* Has cache hints */
...@@ -154,6 +155,7 @@ static void raid6_sse12_gen_syndrome(int disks, size_t bytes, void **ptrs) ...@@ -154,6 +155,7 @@ static void raid6_sse12_gen_syndrome(int disks, size_t bytes, void **ptrs)
const struct raid6_calls raid6_sse1x2 = { const struct raid6_calls raid6_sse1x2 = {
raid6_sse12_gen_syndrome, raid6_sse12_gen_syndrome,
NULL, /* XOR not yet implemented */
raid6_have_sse1_or_mmxext, raid6_have_sse1_or_mmxext,
"sse1x2", "sse1x2",
1 /* Has cache hints */ 1 /* Has cache hints */
......
...@@ -88,8 +88,58 @@ static void raid6_sse21_gen_syndrome(int disks, size_t bytes, void **ptrs) ...@@ -88,8 +88,58 @@ static void raid6_sse21_gen_syndrome(int disks, size_t bytes, void **ptrs)
kernel_fpu_end(); kernel_fpu_end();
} }
static void raid6_sse21_xor_syndrome(int disks, int start, int stop,
size_t bytes, void **ptrs)
{
u8 **dptr = (u8 **)ptrs;
u8 *p, *q;
int d, z, z0;
z0 = stop; /* P/Q right side optimization */
p = dptr[disks-2]; /* XOR parity */
q = dptr[disks-1]; /* RS syndrome */
kernel_fpu_begin();
asm volatile("movdqa %0,%%xmm0" : : "m" (raid6_sse_constants.x1d[0]));
for ( d = 0 ; d < bytes ; d += 16 ) {
asm volatile("movdqa %0,%%xmm4" :: "m" (dptr[z0][d]));
asm volatile("movdqa %0,%%xmm2" : : "m" (p[d]));
asm volatile("pxor %xmm4,%xmm2");
/* P/Q data pages */
for ( z = z0-1 ; z >= start ; z-- ) {
asm volatile("pxor %xmm5,%xmm5");
asm volatile("pcmpgtb %xmm4,%xmm5");
asm volatile("paddb %xmm4,%xmm4");
asm volatile("pand %xmm0,%xmm5");
asm volatile("pxor %xmm5,%xmm4");
asm volatile("movdqa %0,%%xmm5" :: "m" (dptr[z][d]));
asm volatile("pxor %xmm5,%xmm2");
asm volatile("pxor %xmm5,%xmm4");
}
/* P/Q left side optimization */
for ( z = start-1 ; z >= 0 ; z-- ) {
asm volatile("pxor %xmm5,%xmm5");
asm volatile("pcmpgtb %xmm4,%xmm5");
asm volatile("paddb %xmm4,%xmm4");
asm volatile("pand %xmm0,%xmm5");
asm volatile("pxor %xmm5,%xmm4");
}
asm volatile("pxor %0,%%xmm4" : : "m" (q[d]));
/* Don't use movntdq for r/w memory area < cache line */
asm volatile("movdqa %%xmm4,%0" : "=m" (q[d]));
asm volatile("movdqa %%xmm2,%0" : "=m" (p[d]));
}
asm volatile("sfence" : : : "memory");
kernel_fpu_end();
}
const struct raid6_calls raid6_sse2x1 = { const struct raid6_calls raid6_sse2x1 = {
raid6_sse21_gen_syndrome, raid6_sse21_gen_syndrome,
raid6_sse21_xor_syndrome,
raid6_have_sse2, raid6_have_sse2,
"sse2x1", "sse2x1",
1 /* Has cache hints */ 1 /* Has cache hints */
...@@ -150,8 +200,76 @@ static void raid6_sse22_gen_syndrome(int disks, size_t bytes, void **ptrs) ...@@ -150,8 +200,76 @@ static void raid6_sse22_gen_syndrome(int disks, size_t bytes, void **ptrs)
kernel_fpu_end(); kernel_fpu_end();
} }
static void raid6_sse22_xor_syndrome(int disks, int start, int stop,
size_t bytes, void **ptrs)
{
u8 **dptr = (u8 **)ptrs;
u8 *p, *q;
int d, z, z0;
z0 = stop; /* P/Q right side optimization */
p = dptr[disks-2]; /* XOR parity */
q = dptr[disks-1]; /* RS syndrome */
kernel_fpu_begin();
asm volatile("movdqa %0,%%xmm0" : : "m" (raid6_sse_constants.x1d[0]));
for ( d = 0 ; d < bytes ; d += 32 ) {
asm volatile("movdqa %0,%%xmm4" :: "m" (dptr[z0][d]));
asm volatile("movdqa %0,%%xmm6" :: "m" (dptr[z0][d+16]));
asm volatile("movdqa %0,%%xmm2" : : "m" (p[d]));
asm volatile("movdqa %0,%%xmm3" : : "m" (p[d+16]));
asm volatile("pxor %xmm4,%xmm2");
asm volatile("pxor %xmm6,%xmm3");
/* P/Q data pages */
for ( z = z0-1 ; z >= start ; z-- ) {
asm volatile("pxor %xmm5,%xmm5");
asm volatile("pxor %xmm7,%xmm7");
asm volatile("pcmpgtb %xmm4,%xmm5");
asm volatile("pcmpgtb %xmm6,%xmm7");
asm volatile("paddb %xmm4,%xmm4");
asm volatile("paddb %xmm6,%xmm6");
asm volatile("pand %xmm0,%xmm5");
asm volatile("pand %xmm0,%xmm7");
asm volatile("pxor %xmm5,%xmm4");
asm volatile("pxor %xmm7,%xmm6");
asm volatile("movdqa %0,%%xmm5" :: "m" (dptr[z][d]));
asm volatile("movdqa %0,%%xmm7" :: "m" (dptr[z][d+16]));
asm volatile("pxor %xmm5,%xmm2");
asm volatile("pxor %xmm7,%xmm3");
asm volatile("pxor %xmm5,%xmm4");
asm volatile("pxor %xmm7,%xmm6");
}
/* P/Q left side optimization */
for ( z = start-1 ; z >= 0 ; z-- ) {
asm volatile("pxor %xmm5,%xmm5");
asm volatile("pxor %xmm7,%xmm7");
asm volatile("pcmpgtb %xmm4,%xmm5");
asm volatile("pcmpgtb %xmm6,%xmm7");
asm volatile("paddb %xmm4,%xmm4");
asm volatile("paddb %xmm6,%xmm6");
asm volatile("pand %xmm0,%xmm5");
asm volatile("pand %xmm0,%xmm7");
asm volatile("pxor %xmm5,%xmm4");
asm volatile("pxor %xmm7,%xmm6");
}
asm volatile("pxor %0,%%xmm4" : : "m" (q[d]));
asm volatile("pxor %0,%%xmm6" : : "m" (q[d+16]));
/* Don't use movntdq for r/w memory area < cache line */
asm volatile("movdqa %%xmm4,%0" : "=m" (q[d]));
asm volatile("movdqa %%xmm6,%0" : "=m" (q[d+16]));
asm volatile("movdqa %%xmm2,%0" : "=m" (p[d]));
asm volatile("movdqa %%xmm3,%0" : "=m" (p[d+16]));
}
asm volatile("sfence" : : : "memory");
kernel_fpu_end();
}
const struct raid6_calls raid6_sse2x2 = { const struct raid6_calls raid6_sse2x2 = {
raid6_sse22_gen_syndrome, raid6_sse22_gen_syndrome,
raid6_sse22_xor_syndrome,
raid6_have_sse2, raid6_have_sse2,
"sse2x2", "sse2x2",
1 /* Has cache hints */ 1 /* Has cache hints */
...@@ -248,8 +366,117 @@ static void raid6_sse24_gen_syndrome(int disks, size_t bytes, void **ptrs) ...@@ -248,8 +366,117 @@ static void raid6_sse24_gen_syndrome(int disks, size_t bytes, void **ptrs)
kernel_fpu_end(); kernel_fpu_end();
} }
static void raid6_sse24_xor_syndrome(int disks, int start, int stop,
size_t bytes, void **ptrs)
{
u8 **dptr = (u8 **)ptrs;
u8 *p, *q;
int d, z, z0;
z0 = stop; /* P/Q right side optimization */
p = dptr[disks-2]; /* XOR parity */
q = dptr[disks-1]; /* RS syndrome */
kernel_fpu_begin();
asm volatile("movdqa %0,%%xmm0" :: "m" (raid6_sse_constants.x1d[0]));
for ( d = 0 ; d < bytes ; d += 64 ) {
asm volatile("movdqa %0,%%xmm4" :: "m" (dptr[z0][d]));
asm volatile("movdqa %0,%%xmm6" :: "m" (dptr[z0][d+16]));
asm volatile("movdqa %0,%%xmm12" :: "m" (dptr[z0][d+32]));
asm volatile("movdqa %0,%%xmm14" :: "m" (dptr[z0][d+48]));
asm volatile("movdqa %0,%%xmm2" : : "m" (p[d]));
asm volatile("movdqa %0,%%xmm3" : : "m" (p[d+16]));
asm volatile("movdqa %0,%%xmm10" : : "m" (p[d+32]));
asm volatile("movdqa %0,%%xmm11" : : "m" (p[d+48]));
asm volatile("pxor %xmm4,%xmm2");
asm volatile("pxor %xmm6,%xmm3");
asm volatile("pxor %xmm12,%xmm10");
asm volatile("pxor %xmm14,%xmm11");
/* P/Q data pages */
for ( z = z0-1 ; z >= start ; z-- ) {
asm volatile("prefetchnta %0" :: "m" (dptr[z][d]));
asm volatile("prefetchnta %0" :: "m" (dptr[z][d+32]));
asm volatile("pxor %xmm5,%xmm5");
asm volatile("pxor %xmm7,%xmm7");
asm volatile("pxor %xmm13,%xmm13");
asm volatile("pxor %xmm15,%xmm15");
asm volatile("pcmpgtb %xmm4,%xmm5");
asm volatile("pcmpgtb %xmm6,%xmm7");
asm volatile("pcmpgtb %xmm12,%xmm13");
asm volatile("pcmpgtb %xmm14,%xmm15");
asm volatile("paddb %xmm4,%xmm4");
asm volatile("paddb %xmm6,%xmm6");
asm volatile("paddb %xmm12,%xmm12");
asm volatile("paddb %xmm14,%xmm14");
asm volatile("pand %xmm0,%xmm5");
asm volatile("pand %xmm0,%xmm7");
asm volatile("pand %xmm0,%xmm13");
asm volatile("pand %xmm0,%xmm15");
asm volatile("pxor %xmm5,%xmm4");
asm volatile("pxor %xmm7,%xmm6");
asm volatile("pxor %xmm13,%xmm12");
asm volatile("pxor %xmm15,%xmm14");
asm volatile("movdqa %0,%%xmm5" :: "m" (dptr[z][d]));
asm volatile("movdqa %0,%%xmm7" :: "m" (dptr[z][d+16]));
asm volatile("movdqa %0,%%xmm13" :: "m" (dptr[z][d+32]));
asm volatile("movdqa %0,%%xmm15" :: "m" (dptr[z][d+48]));
asm volatile("pxor %xmm5,%xmm2");
asm volatile("pxor %xmm7,%xmm3");
asm volatile("pxor %xmm13,%xmm10");
asm volatile("pxor %xmm15,%xmm11");
asm volatile("pxor %xmm5,%xmm4");
asm volatile("pxor %xmm7,%xmm6");
asm volatile("pxor %xmm13,%xmm12");
asm volatile("pxor %xmm15,%xmm14");
}
asm volatile("prefetchnta %0" :: "m" (q[d]));
asm volatile("prefetchnta %0" :: "m" (q[d+32]));
/* P/Q left side optimization */
for ( z = start-1 ; z >= 0 ; z-- ) {
asm volatile("pxor %xmm5,%xmm5");
asm volatile("pxor %xmm7,%xmm7");
asm volatile("pxor %xmm13,%xmm13");
asm volatile("pxor %xmm15,%xmm15");
asm volatile("pcmpgtb %xmm4,%xmm5");
asm volatile("pcmpgtb %xmm6,%xmm7");
asm volatile("pcmpgtb %xmm12,%xmm13");
asm volatile("pcmpgtb %xmm14,%xmm15");
asm volatile("paddb %xmm4,%xmm4");
asm volatile("paddb %xmm6,%xmm6");
asm volatile("paddb %xmm12,%xmm12");
asm volatile("paddb %xmm14,%xmm14");
asm volatile("pand %xmm0,%xmm5");
asm volatile("pand %xmm0,%xmm7");
asm volatile("pand %xmm0,%xmm13");
asm volatile("pand %xmm0,%xmm15");
asm volatile("pxor %xmm5,%xmm4");
asm volatile("pxor %xmm7,%xmm6");
asm volatile("pxor %xmm13,%xmm12");
asm volatile("pxor %xmm15,%xmm14");
}
asm volatile("movntdq %%xmm2,%0" : "=m" (p[d]));
asm volatile("movntdq %%xmm3,%0" : "=m" (p[d+16]));
asm volatile("movntdq %%xmm10,%0" : "=m" (p[d+32]));
asm volatile("movntdq %%xmm11,%0" : "=m" (p[d+48]));
asm volatile("pxor %0,%%xmm4" : : "m" (q[d]));
asm volatile("pxor %0,%%xmm6" : : "m" (q[d+16]));
asm volatile("pxor %0,%%xmm12" : : "m" (q[d+32]));
asm volatile("pxor %0,%%xmm14" : : "m" (q[d+48]));
asm volatile("movntdq %%xmm4,%0" : "=m" (q[d]));
asm volatile("movntdq %%xmm6,%0" : "=m" (q[d+16]));
asm volatile("movntdq %%xmm12,%0" : "=m" (q[d+32]));
asm volatile("movntdq %%xmm14,%0" : "=m" (q[d+48]));
}
asm volatile("sfence" : : : "memory");
kernel_fpu_end();
}
const struct raid6_calls raid6_sse2x4 = { const struct raid6_calls raid6_sse2x4 = {
raid6_sse24_gen_syndrome, raid6_sse24_gen_syndrome,
raid6_sse24_xor_syndrome,
raid6_have_sse2, raid6_have_sse2,
"sse2x4", "sse2x4",
1 /* Has cache hints */ 1 /* Has cache hints */
......
...@@ -28,11 +28,11 @@ char *dataptrs[NDISKS]; ...@@ -28,11 +28,11 @@ char *dataptrs[NDISKS];
char data[NDISKS][PAGE_SIZE]; char data[NDISKS][PAGE_SIZE];
char recovi[PAGE_SIZE], recovj[PAGE_SIZE]; char recovi[PAGE_SIZE], recovj[PAGE_SIZE];
static void makedata(void) static void makedata(int start, int stop)
{ {
int i, j; int i, j;
for (i = 0; i < NDISKS; i++) { for (i = start; i <= stop; i++) {
for (j = 0; j < PAGE_SIZE; j++) for (j = 0; j < PAGE_SIZE; j++)
data[i][j] = rand(); data[i][j] = rand();
...@@ -91,34 +91,55 @@ int main(int argc, char *argv[]) ...@@ -91,34 +91,55 @@ int main(int argc, char *argv[])
{ {
const struct raid6_calls *const *algo; const struct raid6_calls *const *algo;
const struct raid6_recov_calls *const *ra; const struct raid6_recov_calls *const *ra;
int i, j; int i, j, p1, p2;
int err = 0; int err = 0;
makedata(); makedata(0, NDISKS-1);
for (ra = raid6_recov_algos; *ra; ra++) { for (ra = raid6_recov_algos; *ra; ra++) {
if ((*ra)->valid && !(*ra)->valid()) if ((*ra)->valid && !(*ra)->valid())
continue; continue;
raid6_2data_recov = (*ra)->data2; raid6_2data_recov = (*ra)->data2;
raid6_datap_recov = (*ra)->datap; raid6_datap_recov = (*ra)->datap;
printf("using recovery %s\n", (*ra)->name); printf("using recovery %s\n", (*ra)->name);
for (algo = raid6_algos; *algo; algo++) { for (algo = raid6_algos; *algo; algo++) {
if (!(*algo)->valid || (*algo)->valid()) { if ((*algo)->valid && !(*algo)->valid())
raid6_call = **algo; continue;
raid6_call = **algo;
/* Nuke syndromes */
memset(data[NDISKS-2], 0xee, 2*PAGE_SIZE);
/* Generate assumed good syndrome */
raid6_call.gen_syndrome(NDISKS, PAGE_SIZE,
(void **)&dataptrs);
for (i = 0; i < NDISKS-1; i++)
for (j = i+1; j < NDISKS; j++)
err += test_disks(i, j);
if (!raid6_call.xor_syndrome)
continue;
for (p1 = 0; p1 < NDISKS-2; p1++)
for (p2 = p1; p2 < NDISKS-2; p2++) {
/* Nuke syndromes */ /* Simulate rmw run */
memset(data[NDISKS-2], 0xee, 2*PAGE_SIZE); raid6_call.xor_syndrome(NDISKS, p1, p2, PAGE_SIZE,
(void **)&dataptrs);
makedata(p1, p2);
raid6_call.xor_syndrome(NDISKS, p1, p2, PAGE_SIZE,
(void **)&dataptrs);
/* Generate assumed good syndrome */ for (i = 0; i < NDISKS-1; i++)
raid6_call.gen_syndrome(NDISKS, PAGE_SIZE, for (j = i+1; j < NDISKS; j++)
(void **)&dataptrs); err += test_disks(i, j);
}
for (i = 0; i < NDISKS-1; i++)
for (j = i+1; j < NDISKS; j++)
err += test_disks(i, j);
}
} }
printf("\n"); printf("\n");
} }
......
...@@ -80,6 +80,7 @@ void raid6_tilegx$#_gen_syndrome(int disks, size_t bytes, void **ptrs) ...@@ -80,6 +80,7 @@ void raid6_tilegx$#_gen_syndrome(int disks, size_t bytes, void **ptrs)
const struct raid6_calls raid6_tilegx$# = { const struct raid6_calls raid6_tilegx$# = {
raid6_tilegx$#_gen_syndrome, raid6_tilegx$#_gen_syndrome,
NULL, /* XOR not yet implemented */
NULL, NULL,
"tilegx$#", "tilegx$#",
0 0
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment