Commit 53365383 authored by Linus Torvalds's avatar Linus Torvalds

Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm

* git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: (80 commits)
  dm snapshot: use merge origin if snapshot invalid
  dm snapshot: report merge failure in status
  dm snapshot: merge consecutive chunks together
  dm snapshot: trigger exceptions in remaining snapshots during merge
  dm snapshot: delay merging a chunk until writes to it complete
  dm snapshot: queue writes to chunks being merged
  dm snapshot: add merging
  dm snapshot: permit only one merge at once
  dm snapshot: support barriers in snapshot merge target
  dm snapshot: avoid allocating exceptions in merge
  dm snapshot: rework writing to origin
  dm snapshot: add merge target
  dm exception store: add merge specific methods
  dm snapshot: create function for chunk_is_tracked wait
  dm snapshot: make bio optional in __origin_write
  dm mpath: reject messages when device is suspended
  dm: export suspended state to targets
  dm: rename dm_suspended to dm_suspended_md
  dm: swap target postsuspend call and setting suspended flag
  dm crypt: add plain64 iv
  ...
parents 51b736b8 d2fdb776
...@@ -8,13 +8,19 @@ the block device which are also writable without interfering with the ...@@ -8,13 +8,19 @@ the block device which are also writable without interfering with the
original content; original content;
*) To create device "forks", i.e. multiple different versions of the *) To create device "forks", i.e. multiple different versions of the
same data stream. same data stream.
*) To merge a snapshot of a block device back into the snapshot's origin
device.
In the first two cases, dm copies only the chunks of data that get
changed and uses a separate copy-on-write (COW) block device for
storage.
In both cases, dm copies only the chunks of data that get changed and For snapshot merge the contents of the COW storage are merged back into
uses a separate copy-on-write (COW) block device for storage. the origin device.
There are two dm targets available: snapshot and snapshot-origin. There are three dm targets available:
snapshot, snapshot-origin, and snapshot-merge.
*) snapshot-origin <origin> *) snapshot-origin <origin>
...@@ -40,8 +46,25 @@ The difference is that for transient snapshots less metadata must be ...@@ -40,8 +46,25 @@ The difference is that for transient snapshots less metadata must be
saved on disk - they can be kept in memory by the kernel. saved on disk - they can be kept in memory by the kernel.
How this is used by LVM2 * snapshot-merge <origin> <COW device> <persistent> <chunksize>
========================
takes the same table arguments as the snapshot target except it only
works with persistent snapshots. This target assumes the role of the
"snapshot-origin" target and must not be loaded if the "snapshot-origin"
is still present for <origin>.
Creates a merging snapshot that takes control of the changed chunks
stored in the <COW device> of an existing snapshot, through a handover
procedure, and merges these chunks back into the <origin>. Once merging
has started (in the background) the <origin> may be opened and the merge
will continue while I/O is flowing to it. Changes to the <origin> are
deferred until the merging snapshot's corresponding chunk(s) have been
merged. Once merging has started the snapshot device, associated with
the "snapshot" target, will return -EIO when accessed.
How snapshot is used by LVM2
============================
When you create the first LVM2 snapshot of a volume, four dm devices are used: When you create the first LVM2 snapshot of a volume, four dm devices are used:
1) a device containing the original mapping table of the source volume; 1) a device containing the original mapping table of the source volume;
...@@ -72,3 +95,30 @@ brw------- 1 root root 254, 12 29 ago 18:15 /dev/mapper/volumeGroup-snap-cow ...@@ -72,3 +95,30 @@ brw------- 1 root root 254, 12 29 ago 18:15 /dev/mapper/volumeGroup-snap-cow
brw------- 1 root root 254, 13 29 ago 18:15 /dev/mapper/volumeGroup-snap brw------- 1 root root 254, 13 29 ago 18:15 /dev/mapper/volumeGroup-snap
brw------- 1 root root 254, 10 29 ago 18:14 /dev/mapper/volumeGroup-base brw------- 1 root root 254, 10 29 ago 18:14 /dev/mapper/volumeGroup-base
How snapshot-merge is used by LVM2
==================================
A merging snapshot assumes the role of the "snapshot-origin" while
merging. As such the "snapshot-origin" is replaced with
"snapshot-merge". The "-real" device is not changed and the "-cow"
device is renamed to <origin name>-cow to aid LVM2's cleanup of the
merging snapshot after it completes. The "snapshot" that hands over its
COW device to the "snapshot-merge" is deactivated (unless using lvchange
--refresh); but if it is left active it will simply return I/O errors.
A snapshot will merge into its origin with the following command:
lvconvert --merge volumeGroup/snap
we'll now have this situation:
# dmsetup table|grep volumeGroup
volumeGroup-base-real: 0 2097152 linear 8:19 384
volumeGroup-base-cow: 0 204800 linear 8:19 2097536
volumeGroup-base: 0 2097152 snapshot-merge 254:11 254:12 P 16
# ls -lL /dev/mapper/volumeGroup-*
brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real
brw------- 1 root root 254, 12 29 ago 18:16 /dev/mapper/volumeGroup-base-cow
brw------- 1 root root 254, 10 29 ago 18:16 /dev/mapper/volumeGroup-base
/* /*
* Copyright (C) 2003 Christophe Saout <christophe@saout.de> * Copyright (C) 2003 Christophe Saout <christophe@saout.de>
* Copyright (C) 2004 Clemens Fruhwirth <clemens@endorphin.org> * Copyright (C) 2004 Clemens Fruhwirth <clemens@endorphin.org>
* Copyright (C) 2006-2008 Red Hat, Inc. All rights reserved. * Copyright (C) 2006-2009 Red Hat, Inc. All rights reserved.
* *
* This file is released under the GPL. * This file is released under the GPL.
*/ */
...@@ -71,10 +71,21 @@ struct crypt_iv_operations { ...@@ -71,10 +71,21 @@ struct crypt_iv_operations {
int (*ctr)(struct crypt_config *cc, struct dm_target *ti, int (*ctr)(struct crypt_config *cc, struct dm_target *ti,
const char *opts); const char *opts);
void (*dtr)(struct crypt_config *cc); void (*dtr)(struct crypt_config *cc);
const char *(*status)(struct crypt_config *cc); int (*init)(struct crypt_config *cc);
int (*wipe)(struct crypt_config *cc);
int (*generator)(struct crypt_config *cc, u8 *iv, sector_t sector); int (*generator)(struct crypt_config *cc, u8 *iv, sector_t sector);
}; };
struct iv_essiv_private {
struct crypto_cipher *tfm;
struct crypto_hash *hash_tfm;
u8 *salt;
};
struct iv_benbi_private {
int shift;
};
/* /*
* Crypt: maps a linear range of a block device * Crypt: maps a linear range of a block device
* and encrypts / decrypts at the same time. * and encrypts / decrypts at the same time.
...@@ -102,8 +113,8 @@ struct crypt_config { ...@@ -102,8 +113,8 @@ struct crypt_config {
struct crypt_iv_operations *iv_gen_ops; struct crypt_iv_operations *iv_gen_ops;
char *iv_mode; char *iv_mode;
union { union {
struct crypto_cipher *essiv_tfm; struct iv_essiv_private essiv;
int benbi_shift; struct iv_benbi_private benbi;
} iv_gen_private; } iv_gen_private;
sector_t iv_offset; sector_t iv_offset;
unsigned int iv_size; unsigned int iv_size;
...@@ -147,6 +158,9 @@ static void kcryptd_queue_crypt(struct dm_crypt_io *io); ...@@ -147,6 +158,9 @@ static void kcryptd_queue_crypt(struct dm_crypt_io *io);
* plain: the initial vector is the 32-bit little-endian version of the sector * plain: the initial vector is the 32-bit little-endian version of the sector
* number, padded with zeros if necessary. * number, padded with zeros if necessary.
* *
* plain64: the initial vector is the 64-bit little-endian version of the sector
* number, padded with zeros if necessary.
*
* essiv: "encrypted sector|salt initial vector", the sector number is * essiv: "encrypted sector|salt initial vector", the sector number is
* encrypted with the bulk cipher using a salt as key. The salt * encrypted with the bulk cipher using a salt as key. The salt
* should be derived from the bulk cipher's key via hashing. * should be derived from the bulk cipher's key via hashing.
...@@ -169,88 +183,123 @@ static int crypt_iv_plain_gen(struct crypt_config *cc, u8 *iv, sector_t sector) ...@@ -169,88 +183,123 @@ static int crypt_iv_plain_gen(struct crypt_config *cc, u8 *iv, sector_t sector)
return 0; return 0;
} }
static int crypt_iv_essiv_ctr(struct crypt_config *cc, struct dm_target *ti, static int crypt_iv_plain64_gen(struct crypt_config *cc, u8 *iv,
const char *opts) sector_t sector)
{ {
struct crypto_cipher *essiv_tfm; memset(iv, 0, cc->iv_size);
struct crypto_hash *hash_tfm; *(u64 *)iv = cpu_to_le64(sector);
return 0;
}
/* Initialise ESSIV - compute salt but no local memory allocations */
static int crypt_iv_essiv_init(struct crypt_config *cc)
{
struct iv_essiv_private *essiv = &cc->iv_gen_private.essiv;
struct hash_desc desc; struct hash_desc desc;
struct scatterlist sg; struct scatterlist sg;
unsigned int saltsize;
u8 *salt;
int err; int err;
if (opts == NULL) { sg_init_one(&sg, cc->key, cc->key_size);
desc.tfm = essiv->hash_tfm;
desc.flags = CRYPTO_TFM_REQ_MAY_SLEEP;
err = crypto_hash_digest(&desc, &sg, cc->key_size, essiv->salt);
if (err)
return err;
return crypto_cipher_setkey(essiv->tfm, essiv->salt,
crypto_hash_digestsize(essiv->hash_tfm));
}
/* Wipe salt and reset key derived from volume key */
static int crypt_iv_essiv_wipe(struct crypt_config *cc)
{
struct iv_essiv_private *essiv = &cc->iv_gen_private.essiv;
unsigned salt_size = crypto_hash_digestsize(essiv->hash_tfm);
memset(essiv->salt, 0, salt_size);
return crypto_cipher_setkey(essiv->tfm, essiv->salt, salt_size);
}
static void crypt_iv_essiv_dtr(struct crypt_config *cc)
{
struct iv_essiv_private *essiv = &cc->iv_gen_private.essiv;
crypto_free_cipher(essiv->tfm);
essiv->tfm = NULL;
crypto_free_hash(essiv->hash_tfm);
essiv->hash_tfm = NULL;
kzfree(essiv->salt);
essiv->salt = NULL;
}
static int crypt_iv_essiv_ctr(struct crypt_config *cc, struct dm_target *ti,
const char *opts)
{
struct crypto_cipher *essiv_tfm = NULL;
struct crypto_hash *hash_tfm = NULL;
u8 *salt = NULL;
int err;
if (!opts) {
ti->error = "Digest algorithm missing for ESSIV mode"; ti->error = "Digest algorithm missing for ESSIV mode";
return -EINVAL; return -EINVAL;
} }
/* Hash the cipher key with the given hash algorithm */ /* Allocate hash algorithm */
hash_tfm = crypto_alloc_hash(opts, 0, CRYPTO_ALG_ASYNC); hash_tfm = crypto_alloc_hash(opts, 0, CRYPTO_ALG_ASYNC);
if (IS_ERR(hash_tfm)) { if (IS_ERR(hash_tfm)) {
ti->error = "Error initializing ESSIV hash"; ti->error = "Error initializing ESSIV hash";
return PTR_ERR(hash_tfm); err = PTR_ERR(hash_tfm);
goto bad;
} }
saltsize = crypto_hash_digestsize(hash_tfm); salt = kzalloc(crypto_hash_digestsize(hash_tfm), GFP_KERNEL);
salt = kmalloc(saltsize, GFP_KERNEL); if (!salt) {
if (salt == NULL) {
ti->error = "Error kmallocing salt storage in ESSIV"; ti->error = "Error kmallocing salt storage in ESSIV";
crypto_free_hash(hash_tfm); err = -ENOMEM;
return -ENOMEM; goto bad;
} }
sg_init_one(&sg, cc->key, cc->key_size); /* Allocate essiv_tfm */
desc.tfm = hash_tfm;
desc.flags = CRYPTO_TFM_REQ_MAY_SLEEP;
err = crypto_hash_digest(&desc, &sg, cc->key_size, salt);
crypto_free_hash(hash_tfm);
if (err) {
ti->error = "Error calculating hash in ESSIV";
kfree(salt);
return err;
}
/* Setup the essiv_tfm with the given salt */
essiv_tfm = crypto_alloc_cipher(cc->cipher, 0, CRYPTO_ALG_ASYNC); essiv_tfm = crypto_alloc_cipher(cc->cipher, 0, CRYPTO_ALG_ASYNC);
if (IS_ERR(essiv_tfm)) { if (IS_ERR(essiv_tfm)) {
ti->error = "Error allocating crypto tfm for ESSIV"; ti->error = "Error allocating crypto tfm for ESSIV";
kfree(salt); err = PTR_ERR(essiv_tfm);
return PTR_ERR(essiv_tfm); goto bad;
} }
if (crypto_cipher_blocksize(essiv_tfm) != if (crypto_cipher_blocksize(essiv_tfm) !=
crypto_ablkcipher_ivsize(cc->tfm)) { crypto_ablkcipher_ivsize(cc->tfm)) {
ti->error = "Block size of ESSIV cipher does " ti->error = "Block size of ESSIV cipher does "
"not match IV size of block cipher"; "not match IV size of block cipher";
crypto_free_cipher(essiv_tfm); err = -EINVAL;
kfree(salt); goto bad;
return -EINVAL;
} }
err = crypto_cipher_setkey(essiv_tfm, salt, saltsize);
if (err) {
ti->error = "Failed to set key for ESSIV cipher";
crypto_free_cipher(essiv_tfm);
kfree(salt);
return err;
}
kfree(salt);
cc->iv_gen_private.essiv_tfm = essiv_tfm; cc->iv_gen_private.essiv.salt = salt;
cc->iv_gen_private.essiv.tfm = essiv_tfm;
cc->iv_gen_private.essiv.hash_tfm = hash_tfm;
return 0; return 0;
}
static void crypt_iv_essiv_dtr(struct crypt_config *cc) bad:
{ if (essiv_tfm && !IS_ERR(essiv_tfm))
crypto_free_cipher(cc->iv_gen_private.essiv_tfm); crypto_free_cipher(essiv_tfm);
cc->iv_gen_private.essiv_tfm = NULL; if (hash_tfm && !IS_ERR(hash_tfm))
crypto_free_hash(hash_tfm);
kfree(salt);
return err;
} }
static int crypt_iv_essiv_gen(struct crypt_config *cc, u8 *iv, sector_t sector) static int crypt_iv_essiv_gen(struct crypt_config *cc, u8 *iv, sector_t sector)
{ {
memset(iv, 0, cc->iv_size); memset(iv, 0, cc->iv_size);
*(u64 *)iv = cpu_to_le64(sector); *(u64 *)iv = cpu_to_le64(sector);
crypto_cipher_encrypt_one(cc->iv_gen_private.essiv_tfm, iv, iv); crypto_cipher_encrypt_one(cc->iv_gen_private.essiv.tfm, iv, iv);
return 0; return 0;
} }
...@@ -273,7 +322,7 @@ static int crypt_iv_benbi_ctr(struct crypt_config *cc, struct dm_target *ti, ...@@ -273,7 +322,7 @@ static int crypt_iv_benbi_ctr(struct crypt_config *cc, struct dm_target *ti,
return -EINVAL; return -EINVAL;
} }
cc->iv_gen_private.benbi_shift = 9 - log; cc->iv_gen_private.benbi.shift = 9 - log;
return 0; return 0;
} }
...@@ -288,7 +337,7 @@ static int crypt_iv_benbi_gen(struct crypt_config *cc, u8 *iv, sector_t sector) ...@@ -288,7 +337,7 @@ static int crypt_iv_benbi_gen(struct crypt_config *cc, u8 *iv, sector_t sector)
memset(iv, 0, cc->iv_size - sizeof(u64)); /* rest is cleared below */ memset(iv, 0, cc->iv_size - sizeof(u64)); /* rest is cleared below */
val = cpu_to_be64(((u64)sector << cc->iv_gen_private.benbi_shift) + 1); val = cpu_to_be64(((u64)sector << cc->iv_gen_private.benbi.shift) + 1);
put_unaligned(val, (__be64 *)(iv + cc->iv_size - sizeof(u64))); put_unaligned(val, (__be64 *)(iv + cc->iv_size - sizeof(u64)));
return 0; return 0;
...@@ -305,9 +354,15 @@ static struct crypt_iv_operations crypt_iv_plain_ops = { ...@@ -305,9 +354,15 @@ static struct crypt_iv_operations crypt_iv_plain_ops = {
.generator = crypt_iv_plain_gen .generator = crypt_iv_plain_gen
}; };
static struct crypt_iv_operations crypt_iv_plain64_ops = {
.generator = crypt_iv_plain64_gen
};
static struct crypt_iv_operations crypt_iv_essiv_ops = { static struct crypt_iv_operations crypt_iv_essiv_ops = {
.ctr = crypt_iv_essiv_ctr, .ctr = crypt_iv_essiv_ctr,
.dtr = crypt_iv_essiv_dtr, .dtr = crypt_iv_essiv_dtr,
.init = crypt_iv_essiv_init,
.wipe = crypt_iv_essiv_wipe,
.generator = crypt_iv_essiv_gen .generator = crypt_iv_essiv_gen
}; };
...@@ -934,14 +989,14 @@ static int crypt_set_key(struct crypt_config *cc, char *key) ...@@ -934,14 +989,14 @@ static int crypt_set_key(struct crypt_config *cc, char *key)
set_bit(DM_CRYPT_KEY_VALID, &cc->flags); set_bit(DM_CRYPT_KEY_VALID, &cc->flags);
return 0; return crypto_ablkcipher_setkey(cc->tfm, cc->key, cc->key_size);
} }
static int crypt_wipe_key(struct crypt_config *cc) static int crypt_wipe_key(struct crypt_config *cc)
{ {
clear_bit(DM_CRYPT_KEY_VALID, &cc->flags); clear_bit(DM_CRYPT_KEY_VALID, &cc->flags);
memset(&cc->key, 0, cc->key_size * sizeof(u8)); memset(&cc->key, 0, cc->key_size * sizeof(u8));
return 0; return crypto_ablkcipher_setkey(cc->tfm, cc->key, cc->key_size);
} }
/* /*
...@@ -983,11 +1038,6 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv) ...@@ -983,11 +1038,6 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
return -ENOMEM; return -ENOMEM;
} }
if (crypt_set_key(cc, argv[1])) {
ti->error = "Error decoding key";
goto bad_cipher;
}
/* Compatibility mode for old dm-crypt cipher strings */ /* Compatibility mode for old dm-crypt cipher strings */
if (!chainmode || (strcmp(chainmode, "plain") == 0 && !ivmode)) { if (!chainmode || (strcmp(chainmode, "plain") == 0 && !ivmode)) {
chainmode = "cbc"; chainmode = "cbc";
...@@ -1015,6 +1065,11 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv) ...@@ -1015,6 +1065,11 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
strcpy(cc->chainmode, chainmode); strcpy(cc->chainmode, chainmode);
cc->tfm = tfm; cc->tfm = tfm;
if (crypt_set_key(cc, argv[1]) < 0) {
ti->error = "Error decoding and setting key";
goto bad_ivmode;
}
/* /*
* Choose ivmode. Valid modes: "plain", "essiv:<esshash>", "benbi". * Choose ivmode. Valid modes: "plain", "essiv:<esshash>", "benbi".
* See comments at iv code * See comments at iv code
...@@ -1024,6 +1079,8 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv) ...@@ -1024,6 +1079,8 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
cc->iv_gen_ops = NULL; cc->iv_gen_ops = NULL;
else if (strcmp(ivmode, "plain") == 0) else if (strcmp(ivmode, "plain") == 0)
cc->iv_gen_ops = &crypt_iv_plain_ops; cc->iv_gen_ops = &crypt_iv_plain_ops;
else if (strcmp(ivmode, "plain64") == 0)
cc->iv_gen_ops = &crypt_iv_plain64_ops;
else if (strcmp(ivmode, "essiv") == 0) else if (strcmp(ivmode, "essiv") == 0)
cc->iv_gen_ops = &crypt_iv_essiv_ops; cc->iv_gen_ops = &crypt_iv_essiv_ops;
else if (strcmp(ivmode, "benbi") == 0) else if (strcmp(ivmode, "benbi") == 0)
...@@ -1039,6 +1096,12 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv) ...@@ -1039,6 +1096,12 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
cc->iv_gen_ops->ctr(cc, ti, ivopts) < 0) cc->iv_gen_ops->ctr(cc, ti, ivopts) < 0)
goto bad_ivmode; goto bad_ivmode;
if (cc->iv_gen_ops && cc->iv_gen_ops->init &&
cc->iv_gen_ops->init(cc) < 0) {
ti->error = "Error initialising IV";
goto bad_slab_pool;
}
cc->iv_size = crypto_ablkcipher_ivsize(tfm); cc->iv_size = crypto_ablkcipher_ivsize(tfm);
if (cc->iv_size) if (cc->iv_size)
/* at least a 64 bit sector number should fit in our buffer */ /* at least a 64 bit sector number should fit in our buffer */
...@@ -1085,11 +1148,6 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv) ...@@ -1085,11 +1148,6 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
goto bad_bs; goto bad_bs;
} }
if (crypto_ablkcipher_setkey(tfm, cc->key, key_size) < 0) {
ti->error = "Error setting key";
goto bad_device;
}
if (sscanf(argv[2], "%llu", &tmpll) != 1) { if (sscanf(argv[2], "%llu", &tmpll) != 1) {
ti->error = "Invalid iv_offset sector"; ti->error = "Invalid iv_offset sector";
goto bad_device; goto bad_device;
...@@ -1278,6 +1336,7 @@ static void crypt_resume(struct dm_target *ti) ...@@ -1278,6 +1336,7 @@ static void crypt_resume(struct dm_target *ti)
static int crypt_message(struct dm_target *ti, unsigned argc, char **argv) static int crypt_message(struct dm_target *ti, unsigned argc, char **argv)
{ {
struct crypt_config *cc = ti->private; struct crypt_config *cc = ti->private;
int ret = -EINVAL;
if (argc < 2) if (argc < 2)
goto error; goto error;
...@@ -1287,10 +1346,22 @@ static int crypt_message(struct dm_target *ti, unsigned argc, char **argv) ...@@ -1287,10 +1346,22 @@ static int crypt_message(struct dm_target *ti, unsigned argc, char **argv)
DMWARN("not suspended during key manipulation."); DMWARN("not suspended during key manipulation.");
return -EINVAL; return -EINVAL;
} }
if (argc == 3 && !strnicmp(argv[1], MESG_STR("set"))) if (argc == 3 && !strnicmp(argv[1], MESG_STR("set"))) {
return crypt_set_key(cc, argv[2]); ret = crypt_set_key(cc, argv[2]);
if (argc == 2 && !strnicmp(argv[1], MESG_STR("wipe"))) if (ret)
return ret;
if (cc->iv_gen_ops && cc->iv_gen_ops->init)
ret = cc->iv_gen_ops->init(cc);
return ret;
}
if (argc == 2 && !strnicmp(argv[1], MESG_STR("wipe"))) {
if (cc->iv_gen_ops && cc->iv_gen_ops->wipe) {
ret = cc->iv_gen_ops->wipe(cc);
if (ret)
return ret;
}
return crypt_wipe_key(cc); return crypt_wipe_key(cc);
}
} }
error: error:
......
...@@ -172,7 +172,8 @@ int dm_exception_store_set_chunk_size(struct dm_exception_store *store, ...@@ -172,7 +172,8 @@ int dm_exception_store_set_chunk_size(struct dm_exception_store *store,
} }
/* Validate the chunk size against the device block size */ /* Validate the chunk size against the device block size */
if (chunk_size % (bdev_logical_block_size(store->cow->bdev) >> 9)) { if (chunk_size %
(bdev_logical_block_size(dm_snap_cow(store->snap)->bdev) >> 9)) {
*error = "Chunk size is not a multiple of device blocksize"; *error = "Chunk size is not a multiple of device blocksize";
return -EINVAL; return -EINVAL;
} }
...@@ -190,6 +191,7 @@ int dm_exception_store_set_chunk_size(struct dm_exception_store *store, ...@@ -190,6 +191,7 @@ int dm_exception_store_set_chunk_size(struct dm_exception_store *store,
} }
int dm_exception_store_create(struct dm_target *ti, int argc, char **argv, int dm_exception_store_create(struct dm_target *ti, int argc, char **argv,
struct dm_snapshot *snap,
unsigned *args_used, unsigned *args_used,
struct dm_exception_store **store) struct dm_exception_store **store)
{ {
...@@ -198,7 +200,7 @@ int dm_exception_store_create(struct dm_target *ti, int argc, char **argv, ...@@ -198,7 +200,7 @@ int dm_exception_store_create(struct dm_target *ti, int argc, char **argv,
struct dm_exception_store *tmp_store; struct dm_exception_store *tmp_store;
char persistent; char persistent;
if (argc < 3) { if (argc < 2) {
ti->error = "Insufficient exception store arguments"; ti->error = "Insufficient exception store arguments";
return -EINVAL; return -EINVAL;
} }
...@@ -209,14 +211,15 @@ int dm_exception_store_create(struct dm_target *ti, int argc, char **argv, ...@@ -209,14 +211,15 @@ int dm_exception_store_create(struct dm_target *ti, int argc, char **argv,
return -ENOMEM; return -ENOMEM;
} }
persistent = toupper(*argv[1]); persistent = toupper(*argv[0]);
if (persistent == 'P') if (persistent == 'P')
type = get_type("P"); type = get_type("P");
else if (persistent == 'N') else if (persistent == 'N')
type = get_type("N"); type = get_type("N");
else { else {
ti->error = "Persistent flag is not P or N"; ti->error = "Persistent flag is not P or N";
return -EINVAL; r = -EINVAL;
goto bad_type;
} }
if (!type) { if (!type) {
...@@ -226,32 +229,23 @@ int dm_exception_store_create(struct dm_target *ti, int argc, char **argv, ...@@ -226,32 +229,23 @@ int dm_exception_store_create(struct dm_target *ti, int argc, char **argv,
} }
tmp_store->type = type; tmp_store->type = type;
tmp_store->ti = ti; tmp_store->snap = snap;
r = dm_get_device(ti, argv[0], 0, 0,
FMODE_READ | FMODE_WRITE, &tmp_store->cow);
if (r) {
ti->error = "Cannot get COW device";
goto bad_cow;
}
r = set_chunk_size(tmp_store, argv[2], &ti->error); r = set_chunk_size(tmp_store, argv[1], &ti->error);
if (r) if (r)
goto bad_ctr; goto bad;
r = type->ctr(tmp_store, 0, NULL); r = type->ctr(tmp_store, 0, NULL);
if (r) { if (r) {
ti->error = "Exception store type constructor failed"; ti->error = "Exception store type constructor failed";
goto bad_ctr; goto bad;
} }
*args_used = 3; *args_used = 2;
*store = tmp_store; *store = tmp_store;
return 0; return 0;
bad_ctr: bad:
dm_put_device(ti, tmp_store->cow);
bad_cow:
put_type(type); put_type(type);
bad_type: bad_type:
kfree(tmp_store); kfree(tmp_store);
...@@ -262,7 +256,6 @@ EXPORT_SYMBOL(dm_exception_store_create); ...@@ -262,7 +256,6 @@ EXPORT_SYMBOL(dm_exception_store_create);
void dm_exception_store_destroy(struct dm_exception_store *store) void dm_exception_store_destroy(struct dm_exception_store *store)
{ {
store->type->dtr(store); store->type->dtr(store);
dm_put_device(store->ti, store->cow);
put_type(store->type); put_type(store->type);
kfree(store); kfree(store);
} }
......
...@@ -26,7 +26,7 @@ typedef sector_t chunk_t; ...@@ -26,7 +26,7 @@ typedef sector_t chunk_t;
* of chunks that follow contiguously. Remaining bits hold the number of the * of chunks that follow contiguously. Remaining bits hold the number of the
* chunk within the device. * chunk within the device.
*/ */
struct dm_snap_exception { struct dm_exception {
struct list_head hash_list; struct list_head hash_list;
chunk_t old_chunk; chunk_t old_chunk;
...@@ -64,16 +64,33 @@ struct dm_exception_store_type { ...@@ -64,16 +64,33 @@ struct dm_exception_store_type {
* Find somewhere to store the next exception. * Find somewhere to store the next exception.
*/ */
int (*prepare_exception) (struct dm_exception_store *store, int (*prepare_exception) (struct dm_exception_store *store,
struct dm_snap_exception *e); struct dm_exception *e);
/* /*
* Update the metadata with this exception. * Update the metadata with this exception.
*/ */
void (*commit_exception) (struct dm_exception_store *store, void (*commit_exception) (struct dm_exception_store *store,
struct dm_snap_exception *e, struct dm_exception *e,
void (*callback) (void *, int success), void (*callback) (void *, int success),
void *callback_context); void *callback_context);
/*
* Returns 0 if the exception store is empty.
*
* If there are exceptions still to be merged, sets
* *last_old_chunk and *last_new_chunk to the most recent
* still-to-be-merged chunk and returns the number of
* consecutive previous ones.
*/
int (*prepare_merge) (struct dm_exception_store *store,
chunk_t *last_old_chunk, chunk_t *last_new_chunk);
/*
* Clear the last n exceptions.
* nr_merged must be <= the value returned by prepare_merge.
*/
int (*commit_merge) (struct dm_exception_store *store, int nr_merged);
/* /*
* The snapshot is invalid, note this in the metadata. * The snapshot is invalid, note this in the metadata.
*/ */
...@@ -86,19 +103,19 @@ struct dm_exception_store_type { ...@@ -86,19 +103,19 @@ struct dm_exception_store_type {
/* /*
* Return how full the snapshot is. * Return how full the snapshot is.
*/ */
void (*fraction_full) (struct dm_exception_store *store, void (*usage) (struct dm_exception_store *store,
sector_t *numerator, sector_t *total_sectors, sector_t *sectors_allocated,
sector_t *denominator); sector_t *metadata_sectors);
/* For internal device-mapper use only. */ /* For internal device-mapper use only. */
struct list_head list; struct list_head list;
}; };
struct dm_snapshot;
struct dm_exception_store { struct dm_exception_store {
struct dm_exception_store_type *type; struct dm_exception_store_type *type;
struct dm_target *ti; struct dm_snapshot *snap;
struct dm_dev *cow;
/* Size of data blocks saved - must be a power of 2 */ /* Size of data blocks saved - must be a power of 2 */
unsigned chunk_size; unsigned chunk_size;
...@@ -108,6 +125,11 @@ struct dm_exception_store { ...@@ -108,6 +125,11 @@ struct dm_exception_store {
void *context; void *context;
}; };
/*
* Obtain the cow device used by a given snapshot.
*/
struct dm_dev *dm_snap_cow(struct dm_snapshot *snap);
/* /*
* Funtions to manipulate consecutive chunks * Funtions to manipulate consecutive chunks
*/ */
...@@ -120,18 +142,25 @@ static inline chunk_t dm_chunk_number(chunk_t chunk) ...@@ -120,18 +142,25 @@ static inline chunk_t dm_chunk_number(chunk_t chunk)
return chunk & (chunk_t)((1ULL << DM_CHUNK_NUMBER_BITS) - 1ULL); return chunk & (chunk_t)((1ULL << DM_CHUNK_NUMBER_BITS) - 1ULL);
} }
static inline unsigned dm_consecutive_chunk_count(struct dm_snap_exception *e) static inline unsigned dm_consecutive_chunk_count(struct dm_exception *e)
{ {
return e->new_chunk >> DM_CHUNK_NUMBER_BITS; return e->new_chunk >> DM_CHUNK_NUMBER_BITS;
} }
static inline void dm_consecutive_chunk_count_inc(struct dm_snap_exception *e) static inline void dm_consecutive_chunk_count_inc(struct dm_exception *e)
{ {
e->new_chunk += (1ULL << DM_CHUNK_NUMBER_BITS); e->new_chunk += (1ULL << DM_CHUNK_NUMBER_BITS);
BUG_ON(!dm_consecutive_chunk_count(e)); BUG_ON(!dm_consecutive_chunk_count(e));
} }
static inline void dm_consecutive_chunk_count_dec(struct dm_exception *e)
{
BUG_ON(!dm_consecutive_chunk_count(e));
e->new_chunk -= (1ULL << DM_CHUNK_NUMBER_BITS);
}
# else # else
# define DM_CHUNK_CONSECUTIVE_BITS 0 # define DM_CHUNK_CONSECUTIVE_BITS 0
...@@ -140,12 +169,16 @@ static inline chunk_t dm_chunk_number(chunk_t chunk) ...@@ -140,12 +169,16 @@ static inline chunk_t dm_chunk_number(chunk_t chunk)
return chunk; return chunk;
} }
static inline unsigned dm_consecutive_chunk_count(struct dm_snap_exception *e) static inline unsigned dm_consecutive_chunk_count(struct dm_exception *e)
{ {
return 0; return 0;
} }
static inline void dm_consecutive_chunk_count_inc(struct dm_snap_exception *e) static inline void dm_consecutive_chunk_count_inc(struct dm_exception *e)
{
}
static inline void dm_consecutive_chunk_count_dec(struct dm_exception *e)
{ {
} }
...@@ -162,7 +195,7 @@ static inline sector_t get_dev_size(struct block_device *bdev) ...@@ -162,7 +195,7 @@ static inline sector_t get_dev_size(struct block_device *bdev)
static inline chunk_t sector_to_chunk(struct dm_exception_store *store, static inline chunk_t sector_to_chunk(struct dm_exception_store *store,
sector_t sector) sector_t sector)
{ {
return (sector & ~store->chunk_mask) >> store->chunk_shift; return sector >> store->chunk_shift;
} }
int dm_exception_store_type_register(struct dm_exception_store_type *type); int dm_exception_store_type_register(struct dm_exception_store_type *type);
...@@ -173,6 +206,7 @@ int dm_exception_store_set_chunk_size(struct dm_exception_store *store, ...@@ -173,6 +206,7 @@ int dm_exception_store_set_chunk_size(struct dm_exception_store *store,
char **error); char **error);
int dm_exception_store_create(struct dm_target *ti, int argc, char **argv, int dm_exception_store_create(struct dm_target *ti, int argc, char **argv,
struct dm_snapshot *snap,
unsigned *args_used, unsigned *args_used,
struct dm_exception_store **store); struct dm_exception_store **store);
void dm_exception_store_destroy(struct dm_exception_store *store); void dm_exception_store_destroy(struct dm_exception_store *store);
......
...@@ -5,6 +5,8 @@ ...@@ -5,6 +5,8 @@
* This file is released under the GPL. * This file is released under the GPL.
*/ */
#include "dm.h"
#include <linux/device-mapper.h> #include <linux/device-mapper.h>
#include <linux/bio.h> #include <linux/bio.h>
...@@ -14,12 +16,19 @@ ...@@ -14,12 +16,19 @@
#include <linux/slab.h> #include <linux/slab.h>
#include <linux/dm-io.h> #include <linux/dm-io.h>
#define DM_MSG_PREFIX "io"
#define DM_IO_MAX_REGIONS BITS_PER_LONG
struct dm_io_client { struct dm_io_client {
mempool_t *pool; mempool_t *pool;
struct bio_set *bios; struct bio_set *bios;
}; };
/* FIXME: can we shrink this ? */ /*
* Aligning 'struct io' reduces the number of bits required to store
* its address. Refer to store_io_and_region_in_bio() below.
*/
struct io { struct io {
unsigned long error_bits; unsigned long error_bits;
unsigned long eopnotsupp_bits; unsigned long eopnotsupp_bits;
...@@ -28,7 +37,9 @@ struct io { ...@@ -28,7 +37,9 @@ struct io {
struct dm_io_client *client; struct dm_io_client *client;
io_notify_fn callback; io_notify_fn callback;
void *context; void *context;
}; } __attribute__((aligned(DM_IO_MAX_REGIONS)));
static struct kmem_cache *_dm_io_cache;
/* /*
* io contexts are only dynamically allocated for asynchronous * io contexts are only dynamically allocated for asynchronous
...@@ -53,7 +64,7 @@ struct dm_io_client *dm_io_client_create(unsigned num_pages) ...@@ -53,7 +64,7 @@ struct dm_io_client *dm_io_client_create(unsigned num_pages)
if (!client) if (!client)
return ERR_PTR(-ENOMEM); return ERR_PTR(-ENOMEM);
client->pool = mempool_create_kmalloc_pool(ios, sizeof(struct io)); client->pool = mempool_create_slab_pool(ios, _dm_io_cache);
if (!client->pool) if (!client->pool)
goto bad; goto bad;
...@@ -88,18 +99,29 @@ EXPORT_SYMBOL(dm_io_client_destroy); ...@@ -88,18 +99,29 @@ EXPORT_SYMBOL(dm_io_client_destroy);
/*----------------------------------------------------------------- /*-----------------------------------------------------------------
* We need to keep track of which region a bio is doing io for. * We need to keep track of which region a bio is doing io for.
* In order to save a memory allocation we store this the last * To avoid a memory allocation to store just 5 or 6 bits, we
* bvec which we know is unused (blech). * ensure the 'struct io' pointer is aligned so enough low bits are
* XXX This is ugly and can OOPS with some configs... find another way. * always zero and then combine it with the region number directly in
* bi_private.
*---------------------------------------------------------------*/ *---------------------------------------------------------------*/
static inline void bio_set_region(struct bio *bio, unsigned region) static void store_io_and_region_in_bio(struct bio *bio, struct io *io,
unsigned region)
{ {
bio->bi_io_vec[bio->bi_max_vecs].bv_len = region; if (unlikely(!IS_ALIGNED((unsigned long)io, DM_IO_MAX_REGIONS))) {
DMCRIT("Unaligned struct io pointer %p", io);
BUG();
}
bio->bi_private = (void *)((unsigned long)io | region);
} }
static inline unsigned bio_get_region(struct bio *bio) static void retrieve_io_and_region_from_bio(struct bio *bio, struct io **io,
unsigned *region)
{ {
return bio->bi_io_vec[bio->bi_max_vecs].bv_len; unsigned long val = (unsigned long)bio->bi_private;
*io = (void *)(val & -(unsigned long)DM_IO_MAX_REGIONS);
*region = val & (DM_IO_MAX_REGIONS - 1);
} }
/*----------------------------------------------------------------- /*-----------------------------------------------------------------
...@@ -140,10 +162,8 @@ static void endio(struct bio *bio, int error) ...@@ -140,10 +162,8 @@ static void endio(struct bio *bio, int error)
/* /*
* The bio destructor in bio_put() may use the io object. * The bio destructor in bio_put() may use the io object.
*/ */
io = bio->bi_private; retrieve_io_and_region_from_bio(bio, &io, &region);
region = bio_get_region(bio);
bio->bi_max_vecs++;
bio_put(bio); bio_put(bio);
dec_count(io, region, error); dec_count(io, region, error);
...@@ -243,7 +263,10 @@ static void vm_dp_init(struct dpages *dp, void *data) ...@@ -243,7 +263,10 @@ static void vm_dp_init(struct dpages *dp, void *data)
static void dm_bio_destructor(struct bio *bio) static void dm_bio_destructor(struct bio *bio)
{ {
struct io *io = bio->bi_private; unsigned region;
struct io *io;
retrieve_io_and_region_from_bio(bio, &io, &region);
bio_free(bio, io->client->bios); bio_free(bio, io->client->bios);
} }
...@@ -286,26 +309,23 @@ static void do_region(int rw, unsigned region, struct dm_io_region *where, ...@@ -286,26 +309,23 @@ static void do_region(int rw, unsigned region, struct dm_io_region *where,
unsigned num_bvecs; unsigned num_bvecs;
sector_t remaining = where->count; sector_t remaining = where->count;
while (remaining) { /*
* where->count may be zero if rw holds a write barrier and we
* need to send a zero-sized barrier.
*/
do {
/* /*
* Allocate a suitably sized-bio: we add an extra * Allocate a suitably sized-bio.
* bvec for bio_get/set_region() and decrement bi_max_vecs
* to hide it from bio_add_page().
*/ */
num_bvecs = dm_sector_div_up(remaining, num_bvecs = dm_sector_div_up(remaining,
(PAGE_SIZE >> SECTOR_SHIFT)); (PAGE_SIZE >> SECTOR_SHIFT));
num_bvecs = 1 + min_t(int, bio_get_nr_vecs(where->bdev), num_bvecs = min_t(int, bio_get_nr_vecs(where->bdev), num_bvecs);
num_bvecs);
if (unlikely(num_bvecs > BIO_MAX_PAGES))
num_bvecs = BIO_MAX_PAGES;
bio = bio_alloc_bioset(GFP_NOIO, num_bvecs, io->client->bios); bio = bio_alloc_bioset(GFP_NOIO, num_bvecs, io->client->bios);
bio->bi_sector = where->sector + (where->count - remaining); bio->bi_sector = where->sector + (where->count - remaining);
bio->bi_bdev = where->bdev; bio->bi_bdev = where->bdev;
bio->bi_end_io = endio; bio->bi_end_io = endio;
bio->bi_private = io;
bio->bi_destructor = dm_bio_destructor; bio->bi_destructor = dm_bio_destructor;
bio->bi_max_vecs--; store_io_and_region_in_bio(bio, io, region);
bio_set_region(bio, region);
/* /*
* Try and add as many pages as possible. * Try and add as many pages as possible.
...@@ -323,7 +343,7 @@ static void do_region(int rw, unsigned region, struct dm_io_region *where, ...@@ -323,7 +343,7 @@ static void do_region(int rw, unsigned region, struct dm_io_region *where,
atomic_inc(&io->count); atomic_inc(&io->count);
submit_bio(rw, bio); submit_bio(rw, bio);
} } while (remaining);
} }
static void dispatch_io(int rw, unsigned int num_regions, static void dispatch_io(int rw, unsigned int num_regions,
...@@ -333,6 +353,8 @@ static void dispatch_io(int rw, unsigned int num_regions, ...@@ -333,6 +353,8 @@ static void dispatch_io(int rw, unsigned int num_regions,
int i; int i;
struct dpages old_pages = *dp; struct dpages old_pages = *dp;
BUG_ON(num_regions > DM_IO_MAX_REGIONS);
if (sync) if (sync)
rw |= (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG); rw |= (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG);
...@@ -342,7 +364,7 @@ static void dispatch_io(int rw, unsigned int num_regions, ...@@ -342,7 +364,7 @@ static void dispatch_io(int rw, unsigned int num_regions,
*/ */
for (i = 0; i < num_regions; i++) { for (i = 0; i < num_regions; i++) {
*dp = old_pages; *dp = old_pages;
if (where[i].count) if (where[i].count || (rw & (1 << BIO_RW_BARRIER)))
do_region(rw, i, where + i, dp, io); do_region(rw, i, where + i, dp, io);
} }
...@@ -357,7 +379,14 @@ static int sync_io(struct dm_io_client *client, unsigned int num_regions, ...@@ -357,7 +379,14 @@ static int sync_io(struct dm_io_client *client, unsigned int num_regions,
struct dm_io_region *where, int rw, struct dpages *dp, struct dm_io_region *where, int rw, struct dpages *dp,
unsigned long *error_bits) unsigned long *error_bits)
{ {
struct io io; /*
* gcc <= 4.3 can't do the alignment for stack variables, so we must
* align it on our own.
* volatile prevents the optimizer from removing or reusing
* "io_" field from the stack frame (allowed in ANSI C).
*/
volatile char io_[sizeof(struct io) + __alignof__(struct io) - 1];
struct io *io = (struct io *)PTR_ALIGN(&io_, __alignof__(struct io));
if (num_regions > 1 && (rw & RW_MASK) != WRITE) { if (num_regions > 1 && (rw & RW_MASK) != WRITE) {
WARN_ON(1); WARN_ON(1);
...@@ -365,33 +394,33 @@ static int sync_io(struct dm_io_client *client, unsigned int num_regions, ...@@ -365,33 +394,33 @@ static int sync_io(struct dm_io_client *client, unsigned int num_regions,
} }
retry: retry:
io.error_bits = 0; io->error_bits = 0;
io.eopnotsupp_bits = 0; io->eopnotsupp_bits = 0;
atomic_set(&io.count, 1); /* see dispatch_io() */ atomic_set(&io->count, 1); /* see dispatch_io() */
io.sleeper = current; io->sleeper = current;
io.client = client; io->client = client;
dispatch_io(rw, num_regions, where, dp, &io, 1); dispatch_io(rw, num_regions, where, dp, io, 1);
while (1) { while (1) {
set_current_state(TASK_UNINTERRUPTIBLE); set_current_state(TASK_UNINTERRUPTIBLE);
if (!atomic_read(&io.count)) if (!atomic_read(&io->count))
break; break;
io_schedule(); io_schedule();
} }
set_current_state(TASK_RUNNING); set_current_state(TASK_RUNNING);
if (io.eopnotsupp_bits && (rw & (1 << BIO_RW_BARRIER))) { if (io->eopnotsupp_bits && (rw & (1 << BIO_RW_BARRIER))) {
rw &= ~(1 << BIO_RW_BARRIER); rw &= ~(1 << BIO_RW_BARRIER);
goto retry; goto retry;
} }
if (error_bits) if (error_bits)
*error_bits = io.error_bits; *error_bits = io->error_bits;
return io.error_bits ? -EIO : 0; return io->error_bits ? -EIO : 0;
} }
static int async_io(struct dm_io_client *client, unsigned int num_regions, static int async_io(struct dm_io_client *client, unsigned int num_regions,
...@@ -472,3 +501,18 @@ int dm_io(struct dm_io_request *io_req, unsigned num_regions, ...@@ -472,3 +501,18 @@ int dm_io(struct dm_io_request *io_req, unsigned num_regions,
&dp, io_req->notify.fn, io_req->notify.context); &dp, io_req->notify.fn, io_req->notify.context);
} }
EXPORT_SYMBOL(dm_io); EXPORT_SYMBOL(dm_io);
int __init dm_io_init(void)
{
_dm_io_cache = KMEM_CACHE(io, 0);
if (!_dm_io_cache)
return -ENOMEM;
return 0;
}
void dm_io_exit(void)
{
kmem_cache_destroy(_dm_io_cache);
_dm_io_cache = NULL;
}
...@@ -56,6 +56,11 @@ static void dm_hash_remove_all(int keep_open_devices); ...@@ -56,6 +56,11 @@ static void dm_hash_remove_all(int keep_open_devices);
*/ */
static DECLARE_RWSEM(_hash_lock); static DECLARE_RWSEM(_hash_lock);
/*
* Protects use of mdptr to obtain hash cell name and uuid from mapped device.
*/
static DEFINE_MUTEX(dm_hash_cells_mutex);
static void init_buckets(struct list_head *buckets) static void init_buckets(struct list_head *buckets)
{ {
unsigned int i; unsigned int i;
...@@ -206,7 +211,9 @@ static int dm_hash_insert(const char *name, const char *uuid, struct mapped_devi ...@@ -206,7 +211,9 @@ static int dm_hash_insert(const char *name, const char *uuid, struct mapped_devi
list_add(&cell->uuid_list, _uuid_buckets + hash_str(uuid)); list_add(&cell->uuid_list, _uuid_buckets + hash_str(uuid));
} }
dm_get(md); dm_get(md);
mutex_lock(&dm_hash_cells_mutex);
dm_set_mdptr(md, cell); dm_set_mdptr(md, cell);
mutex_unlock(&dm_hash_cells_mutex);
up_write(&_hash_lock); up_write(&_hash_lock);
return 0; return 0;
...@@ -224,9 +231,11 @@ static void __hash_remove(struct hash_cell *hc) ...@@ -224,9 +231,11 @@ static void __hash_remove(struct hash_cell *hc)
/* remove from the dev hash */ /* remove from the dev hash */
list_del(&hc->uuid_list); list_del(&hc->uuid_list);
list_del(&hc->name_list); list_del(&hc->name_list);
mutex_lock(&dm_hash_cells_mutex);
dm_set_mdptr(hc->md, NULL); dm_set_mdptr(hc->md, NULL);
mutex_unlock(&dm_hash_cells_mutex);
table = dm_get_table(hc->md); table = dm_get_live_table(hc->md);
if (table) { if (table) {
dm_table_event(table); dm_table_event(table);
dm_table_put(table); dm_table_put(table);
...@@ -321,13 +330,15 @@ static int dm_hash_rename(uint32_t cookie, const char *old, const char *new) ...@@ -321,13 +330,15 @@ static int dm_hash_rename(uint32_t cookie, const char *old, const char *new)
*/ */
list_del(&hc->name_list); list_del(&hc->name_list);
old_name = hc->name; old_name = hc->name;
mutex_lock(&dm_hash_cells_mutex);
hc->name = new_name; hc->name = new_name;
mutex_unlock(&dm_hash_cells_mutex);
list_add(&hc->name_list, _name_buckets + hash_str(new_name)); list_add(&hc->name_list, _name_buckets + hash_str(new_name));
/* /*
* Wake up any dm event waiters. * Wake up any dm event waiters.
*/ */
table = dm_get_table(hc->md); table = dm_get_live_table(hc->md);
if (table) { if (table) {
dm_table_event(table); dm_table_event(table);
dm_table_put(table); dm_table_put(table);
...@@ -512,8 +523,6 @@ static int list_versions(struct dm_ioctl *param, size_t param_size) ...@@ -512,8 +523,6 @@ static int list_versions(struct dm_ioctl *param, size_t param_size)
return 0; return 0;
} }
static int check_name(const char *name) static int check_name(const char *name)
{ {
if (strchr(name, '/')) { if (strchr(name, '/')) {
...@@ -524,6 +533,40 @@ static int check_name(const char *name) ...@@ -524,6 +533,40 @@ static int check_name(const char *name)
return 0; return 0;
} }
/*
* On successful return, the caller must not attempt to acquire
* _hash_lock without first calling dm_table_put, because dm_table_destroy
* waits for this dm_table_put and could be called under this lock.
*/
static struct dm_table *dm_get_inactive_table(struct mapped_device *md)
{
struct hash_cell *hc;
struct dm_table *table = NULL;
down_read(&_hash_lock);
hc = dm_get_mdptr(md);
if (!hc || hc->md != md) {
DMWARN("device has been removed from the dev hash table.");
goto out;
}
table = hc->new_map;
if (table)
dm_table_get(table);
out:
up_read(&_hash_lock);
return table;
}
static struct dm_table *dm_get_live_or_inactive_table(struct mapped_device *md,
struct dm_ioctl *param)
{
return (param->flags & DM_QUERY_INACTIVE_TABLE_FLAG) ?
dm_get_inactive_table(md) : dm_get_live_table(md);
}
/* /*
* Fills in a dm_ioctl structure, ready for sending back to * Fills in a dm_ioctl structure, ready for sending back to
* userland. * userland.
...@@ -536,7 +579,7 @@ static int __dev_status(struct mapped_device *md, struct dm_ioctl *param) ...@@ -536,7 +579,7 @@ static int __dev_status(struct mapped_device *md, struct dm_ioctl *param)
param->flags &= ~(DM_SUSPEND_FLAG | DM_READONLY_FLAG | param->flags &= ~(DM_SUSPEND_FLAG | DM_READONLY_FLAG |
DM_ACTIVE_PRESENT_FLAG); DM_ACTIVE_PRESENT_FLAG);
if (dm_suspended(md)) if (dm_suspended_md(md))
param->flags |= DM_SUSPEND_FLAG; param->flags |= DM_SUSPEND_FLAG;
param->dev = huge_encode_dev(disk_devt(disk)); param->dev = huge_encode_dev(disk_devt(disk));
...@@ -548,18 +591,30 @@ static int __dev_status(struct mapped_device *md, struct dm_ioctl *param) ...@@ -548,18 +591,30 @@ static int __dev_status(struct mapped_device *md, struct dm_ioctl *param)
*/ */
param->open_count = dm_open_count(md); param->open_count = dm_open_count(md);
if (get_disk_ro(disk))
param->flags |= DM_READONLY_FLAG;
param->event_nr = dm_get_event_nr(md); param->event_nr = dm_get_event_nr(md);
param->target_count = 0;
table = dm_get_table(md); table = dm_get_live_table(md);
if (table) { if (table) {
param->flags |= DM_ACTIVE_PRESENT_FLAG; if (!(param->flags & DM_QUERY_INACTIVE_TABLE_FLAG)) {
param->target_count = dm_table_get_num_targets(table); if (get_disk_ro(disk))
param->flags |= DM_READONLY_FLAG;
param->target_count = dm_table_get_num_targets(table);
}
dm_table_put(table); dm_table_put(table);
} else
param->target_count = 0; param->flags |= DM_ACTIVE_PRESENT_FLAG;
}
if (param->flags & DM_QUERY_INACTIVE_TABLE_FLAG) {
table = dm_get_inactive_table(md);
if (table) {
if (!(dm_table_get_mode(table) & FMODE_WRITE))
param->flags |= DM_READONLY_FLAG;
param->target_count = dm_table_get_num_targets(table);
dm_table_put(table);
}
}
return 0; return 0;
} }
...@@ -634,9 +689,9 @@ static struct mapped_device *find_device(struct dm_ioctl *param) ...@@ -634,9 +689,9 @@ static struct mapped_device *find_device(struct dm_ioctl *param)
* Sneakily write in both the name and the uuid * Sneakily write in both the name and the uuid
* while we have the cell. * while we have the cell.
*/ */
strncpy(param->name, hc->name, sizeof(param->name)); strlcpy(param->name, hc->name, sizeof(param->name));
if (hc->uuid) if (hc->uuid)
strncpy(param->uuid, hc->uuid, sizeof(param->uuid)-1); strlcpy(param->uuid, hc->uuid, sizeof(param->uuid));
else else
param->uuid[0] = '\0'; param->uuid[0] = '\0';
...@@ -784,7 +839,7 @@ static int do_suspend(struct dm_ioctl *param) ...@@ -784,7 +839,7 @@ static int do_suspend(struct dm_ioctl *param)
if (param->flags & DM_NOFLUSH_FLAG) if (param->flags & DM_NOFLUSH_FLAG)
suspend_flags |= DM_SUSPEND_NOFLUSH_FLAG; suspend_flags |= DM_SUSPEND_NOFLUSH_FLAG;
if (!dm_suspended(md)) if (!dm_suspended_md(md))
r = dm_suspend(md, suspend_flags); r = dm_suspend(md, suspend_flags);
if (!r) if (!r)
...@@ -800,7 +855,7 @@ static int do_resume(struct dm_ioctl *param) ...@@ -800,7 +855,7 @@ static int do_resume(struct dm_ioctl *param)
unsigned suspend_flags = DM_SUSPEND_LOCKFS_FLAG; unsigned suspend_flags = DM_SUSPEND_LOCKFS_FLAG;
struct hash_cell *hc; struct hash_cell *hc;
struct mapped_device *md; struct mapped_device *md;
struct dm_table *new_map; struct dm_table *new_map, *old_map = NULL;
down_write(&_hash_lock); down_write(&_hash_lock);
...@@ -826,14 +881,14 @@ static int do_resume(struct dm_ioctl *param) ...@@ -826,14 +881,14 @@ static int do_resume(struct dm_ioctl *param)
suspend_flags &= ~DM_SUSPEND_LOCKFS_FLAG; suspend_flags &= ~DM_SUSPEND_LOCKFS_FLAG;
if (param->flags & DM_NOFLUSH_FLAG) if (param->flags & DM_NOFLUSH_FLAG)
suspend_flags |= DM_SUSPEND_NOFLUSH_FLAG; suspend_flags |= DM_SUSPEND_NOFLUSH_FLAG;
if (!dm_suspended(md)) if (!dm_suspended_md(md))
dm_suspend(md, suspend_flags); dm_suspend(md, suspend_flags);
r = dm_swap_table(md, new_map); old_map = dm_swap_table(md, new_map);
if (r) { if (IS_ERR(old_map)) {
dm_table_destroy(new_map); dm_table_destroy(new_map);
dm_put(md); dm_put(md);
return r; return PTR_ERR(old_map);
} }
if (dm_table_get_mode(new_map) & FMODE_WRITE) if (dm_table_get_mode(new_map) & FMODE_WRITE)
...@@ -842,9 +897,11 @@ static int do_resume(struct dm_ioctl *param) ...@@ -842,9 +897,11 @@ static int do_resume(struct dm_ioctl *param)
set_disk_ro(dm_disk(md), 1); set_disk_ro(dm_disk(md), 1);
} }
if (dm_suspended(md)) if (dm_suspended_md(md))
r = dm_resume(md); r = dm_resume(md);
if (old_map)
dm_table_destroy(old_map);
if (!r) { if (!r) {
dm_kobject_uevent(md, KOBJ_CHANGE, param->event_nr); dm_kobject_uevent(md, KOBJ_CHANGE, param->event_nr);
...@@ -982,7 +1039,7 @@ static int dev_wait(struct dm_ioctl *param, size_t param_size) ...@@ -982,7 +1039,7 @@ static int dev_wait(struct dm_ioctl *param, size_t param_size)
if (r) if (r)
goto out; goto out;
table = dm_get_table(md); table = dm_get_live_or_inactive_table(md, param);
if (table) { if (table) {
retrieve_status(table, param, param_size); retrieve_status(table, param, param_size);
dm_table_put(table); dm_table_put(table);
...@@ -1215,7 +1272,7 @@ static int table_deps(struct dm_ioctl *param, size_t param_size) ...@@ -1215,7 +1272,7 @@ static int table_deps(struct dm_ioctl *param, size_t param_size)
if (r) if (r)
goto out; goto out;
table = dm_get_table(md); table = dm_get_live_or_inactive_table(md, param);
if (table) { if (table) {
retrieve_deps(table, param, param_size); retrieve_deps(table, param, param_size);
dm_table_put(table); dm_table_put(table);
...@@ -1244,13 +1301,13 @@ static int table_status(struct dm_ioctl *param, size_t param_size) ...@@ -1244,13 +1301,13 @@ static int table_status(struct dm_ioctl *param, size_t param_size)
if (r) if (r)
goto out; goto out;
table = dm_get_table(md); table = dm_get_live_or_inactive_table(md, param);
if (table) { if (table) {
retrieve_status(table, param, param_size); retrieve_status(table, param, param_size);
dm_table_put(table); dm_table_put(table);
} }
out: out:
dm_put(md); dm_put(md);
return r; return r;
} }
...@@ -1288,10 +1345,15 @@ static int target_message(struct dm_ioctl *param, size_t param_size) ...@@ -1288,10 +1345,15 @@ static int target_message(struct dm_ioctl *param, size_t param_size)
goto out; goto out;
} }
table = dm_get_table(md); table = dm_get_live_table(md);
if (!table) if (!table)
goto out_argv; goto out_argv;
if (dm_deleting_md(md)) {
r = -ENXIO;
goto out_table;
}
ti = dm_table_find_target(table, tmsg->sector); ti = dm_table_find_target(table, tmsg->sector);
if (!dm_target_is_valid(ti)) { if (!dm_target_is_valid(ti)) {
DMWARN("Target message sector outside device."); DMWARN("Target message sector outside device.");
...@@ -1303,6 +1365,7 @@ static int target_message(struct dm_ioctl *param, size_t param_size) ...@@ -1303,6 +1365,7 @@ static int target_message(struct dm_ioctl *param, size_t param_size)
r = -EINVAL; r = -EINVAL;
} }
out_table:
dm_table_put(table); dm_table_put(table);
out_argv: out_argv:
kfree(argv); kfree(argv);
...@@ -1582,8 +1645,7 @@ int dm_copy_name_and_uuid(struct mapped_device *md, char *name, char *uuid) ...@@ -1582,8 +1645,7 @@ int dm_copy_name_and_uuid(struct mapped_device *md, char *name, char *uuid)
if (!md) if (!md)
return -ENXIO; return -ENXIO;
dm_get(md); mutex_lock(&dm_hash_cells_mutex);
down_read(&_hash_lock);
hc = dm_get_mdptr(md); hc = dm_get_mdptr(md);
if (!hc || hc->md != md) { if (!hc || hc->md != md) {
r = -ENXIO; r = -ENXIO;
...@@ -1596,8 +1658,7 @@ int dm_copy_name_and_uuid(struct mapped_device *md, char *name, char *uuid) ...@@ -1596,8 +1658,7 @@ int dm_copy_name_and_uuid(struct mapped_device *md, char *name, char *uuid)
strcpy(uuid, hc->uuid ? : ""); strcpy(uuid, hc->uuid ? : "");
out: out:
up_read(&_hash_lock); mutex_unlock(&dm_hash_cells_mutex);
dm_put(md);
return r; return r;
} }
...@@ -450,7 +450,10 @@ static void dispatch_job(struct kcopyd_job *job) ...@@ -450,7 +450,10 @@ static void dispatch_job(struct kcopyd_job *job)
{ {
struct dm_kcopyd_client *kc = job->kc; struct dm_kcopyd_client *kc = job->kc;
atomic_inc(&kc->nr_jobs); atomic_inc(&kc->nr_jobs);
push(&kc->pages_jobs, job); if (unlikely(!job->source.count))
push(&kc->complete_jobs, job);
else
push(&kc->pages_jobs, job);
wake(kc); wake(kc);
} }
......
...@@ -145,8 +145,9 @@ int dm_dirty_log_type_unregister(struct dm_dirty_log_type *type) ...@@ -145,8 +145,9 @@ int dm_dirty_log_type_unregister(struct dm_dirty_log_type *type)
EXPORT_SYMBOL(dm_dirty_log_type_unregister); EXPORT_SYMBOL(dm_dirty_log_type_unregister);
struct dm_dirty_log *dm_dirty_log_create(const char *type_name, struct dm_dirty_log *dm_dirty_log_create(const char *type_name,
struct dm_target *ti, struct dm_target *ti,
unsigned int argc, char **argv) int (*flush_callback_fn)(struct dm_target *ti),
unsigned int argc, char **argv)
{ {
struct dm_dirty_log_type *type; struct dm_dirty_log_type *type;
struct dm_dirty_log *log; struct dm_dirty_log *log;
...@@ -161,6 +162,7 @@ struct dm_dirty_log *dm_dirty_log_create(const char *type_name, ...@@ -161,6 +162,7 @@ struct dm_dirty_log *dm_dirty_log_create(const char *type_name,
return NULL; return NULL;
} }
log->flush_callback_fn = flush_callback_fn;
log->type = type; log->type = type;
if (type->ctr(log, ti, argc, argv)) { if (type->ctr(log, ti, argc, argv)) {
kfree(log); kfree(log);
...@@ -208,7 +210,9 @@ struct log_header { ...@@ -208,7 +210,9 @@ struct log_header {
struct log_c { struct log_c {
struct dm_target *ti; struct dm_target *ti;
int touched; int touched_dirtied;
int touched_cleaned;
int flush_failed;
uint32_t region_size; uint32_t region_size;
unsigned int region_count; unsigned int region_count;
region_t sync_count; region_t sync_count;
...@@ -233,6 +237,7 @@ struct log_c { ...@@ -233,6 +237,7 @@ struct log_c {
* Disk log fields * Disk log fields
*/ */
int log_dev_failed; int log_dev_failed;
int log_dev_flush_failed;
struct dm_dev *log_dev; struct dm_dev *log_dev;
struct log_header header; struct log_header header;
...@@ -253,14 +258,14 @@ static inline void log_set_bit(struct log_c *l, ...@@ -253,14 +258,14 @@ static inline void log_set_bit(struct log_c *l,
uint32_t *bs, unsigned bit) uint32_t *bs, unsigned bit)
{ {
ext2_set_bit(bit, (unsigned long *) bs); ext2_set_bit(bit, (unsigned long *) bs);
l->touched = 1; l->touched_cleaned = 1;
} }
static inline void log_clear_bit(struct log_c *l, static inline void log_clear_bit(struct log_c *l,
uint32_t *bs, unsigned bit) uint32_t *bs, unsigned bit)
{ {
ext2_clear_bit(bit, (unsigned long *) bs); ext2_clear_bit(bit, (unsigned long *) bs);
l->touched = 1; l->touched_dirtied = 1;
} }
/*---------------------------------------------------------------- /*----------------------------------------------------------------
...@@ -287,6 +292,19 @@ static int rw_header(struct log_c *lc, int rw) ...@@ -287,6 +292,19 @@ static int rw_header(struct log_c *lc, int rw)
return dm_io(&lc->io_req, 1, &lc->header_location, NULL); return dm_io(&lc->io_req, 1, &lc->header_location, NULL);
} }
static int flush_header(struct log_c *lc)
{
struct dm_io_region null_location = {
.bdev = lc->header_location.bdev,
.sector = 0,
.count = 0,
};
lc->io_req.bi_rw = WRITE_BARRIER;
return dm_io(&lc->io_req, 1, &null_location, NULL);
}
static int read_header(struct log_c *log) static int read_header(struct log_c *log)
{ {
int r; int r;
...@@ -378,7 +396,9 @@ static int create_log_context(struct dm_dirty_log *log, struct dm_target *ti, ...@@ -378,7 +396,9 @@ static int create_log_context(struct dm_dirty_log *log, struct dm_target *ti,
} }
lc->ti = ti; lc->ti = ti;
lc->touched = 0; lc->touched_dirtied = 0;
lc->touched_cleaned = 0;
lc->flush_failed = 0;
lc->region_size = region_size; lc->region_size = region_size;
lc->region_count = region_count; lc->region_count = region_count;
lc->sync = sync; lc->sync = sync;
...@@ -406,6 +426,7 @@ static int create_log_context(struct dm_dirty_log *log, struct dm_target *ti, ...@@ -406,6 +426,7 @@ static int create_log_context(struct dm_dirty_log *log, struct dm_target *ti,
} else { } else {
lc->log_dev = dev; lc->log_dev = dev;
lc->log_dev_failed = 0; lc->log_dev_failed = 0;
lc->log_dev_flush_failed = 0;
lc->header_location.bdev = lc->log_dev->bdev; lc->header_location.bdev = lc->log_dev->bdev;
lc->header_location.sector = 0; lc->header_location.sector = 0;
...@@ -614,6 +635,11 @@ static int disk_resume(struct dm_dirty_log *log) ...@@ -614,6 +635,11 @@ static int disk_resume(struct dm_dirty_log *log)
/* write the new header */ /* write the new header */
r = rw_header(lc, WRITE); r = rw_header(lc, WRITE);
if (!r) {
r = flush_header(lc);
if (r)
lc->log_dev_flush_failed = 1;
}
if (r) { if (r) {
DMWARN("%s: Failed to write header on dirty region log device", DMWARN("%s: Failed to write header on dirty region log device",
lc->log_dev->name); lc->log_dev->name);
...@@ -656,18 +682,40 @@ static int core_flush(struct dm_dirty_log *log) ...@@ -656,18 +682,40 @@ static int core_flush(struct dm_dirty_log *log)
static int disk_flush(struct dm_dirty_log *log) static int disk_flush(struct dm_dirty_log *log)
{ {
int r; int r, i;
struct log_c *lc = (struct log_c *) log->context; struct log_c *lc = log->context;
/* only write if the log has changed */ /* only write if the log has changed */
if (!lc->touched) if (!lc->touched_cleaned && !lc->touched_dirtied)
return 0; return 0;
if (lc->touched_cleaned && log->flush_callback_fn &&
log->flush_callback_fn(lc->ti)) {
/*
* At this point it is impossible to determine which
* regions are clean and which are dirty (without
* re-reading the log off disk). So mark all of them
* dirty.
*/
lc->flush_failed = 1;
for (i = 0; i < lc->region_count; i++)
log_clear_bit(lc, lc->clean_bits, i);
}
r = rw_header(lc, WRITE); r = rw_header(lc, WRITE);
if (r) if (r)
fail_log_device(lc); fail_log_device(lc);
else else {
lc->touched = 0; if (lc->touched_dirtied) {
r = flush_header(lc);
if (r) {
lc->log_dev_flush_failed = 1;
fail_log_device(lc);
} else
lc->touched_dirtied = 0;
}
lc->touched_cleaned = 0;
}
return r; return r;
} }
...@@ -681,7 +729,8 @@ static void core_mark_region(struct dm_dirty_log *log, region_t region) ...@@ -681,7 +729,8 @@ static void core_mark_region(struct dm_dirty_log *log, region_t region)
static void core_clear_region(struct dm_dirty_log *log, region_t region) static void core_clear_region(struct dm_dirty_log *log, region_t region)
{ {
struct log_c *lc = (struct log_c *) log->context; struct log_c *lc = (struct log_c *) log->context;
log_set_bit(lc, lc->clean_bits, region); if (likely(!lc->flush_failed))
log_set_bit(lc, lc->clean_bits, region);
} }
static int core_get_resync_work(struct dm_dirty_log *log, region_t *region) static int core_get_resync_work(struct dm_dirty_log *log, region_t *region)
...@@ -762,7 +811,9 @@ static int disk_status(struct dm_dirty_log *log, status_type_t status, ...@@ -762,7 +811,9 @@ static int disk_status(struct dm_dirty_log *log, status_type_t status,
switch(status) { switch(status) {
case STATUSTYPE_INFO: case STATUSTYPE_INFO:
DMEMIT("3 %s %s %c", log->type->name, lc->log_dev->name, DMEMIT("3 %s %s %c", log->type->name, lc->log_dev->name,
lc->log_dev_failed ? 'D' : 'A'); lc->log_dev_flush_failed ? 'F' :
lc->log_dev_failed ? 'D' :
'A');
break; break;
case STATUSTYPE_TABLE: case STATUSTYPE_TABLE:
......
...@@ -93,6 +93,10 @@ struct multipath { ...@@ -93,6 +93,10 @@ struct multipath {
* can resubmit bios on error. * can resubmit bios on error.
*/ */
mempool_t *mpio_pool; mempool_t *mpio_pool;
struct mutex work_mutex;
unsigned suspended; /* Don't create new I/O internally when set. */
}; };
/* /*
...@@ -198,6 +202,7 @@ static struct multipath *alloc_multipath(struct dm_target *ti) ...@@ -198,6 +202,7 @@ static struct multipath *alloc_multipath(struct dm_target *ti)
m->queue_io = 1; m->queue_io = 1;
INIT_WORK(&m->process_queued_ios, process_queued_ios); INIT_WORK(&m->process_queued_ios, process_queued_ios);
INIT_WORK(&m->trigger_event, trigger_event); INIT_WORK(&m->trigger_event, trigger_event);
mutex_init(&m->work_mutex);
m->mpio_pool = mempool_create_slab_pool(MIN_IOS, _mpio_cache); m->mpio_pool = mempool_create_slab_pool(MIN_IOS, _mpio_cache);
if (!m->mpio_pool) { if (!m->mpio_pool) {
kfree(m); kfree(m);
...@@ -885,13 +890,18 @@ static int multipath_ctr(struct dm_target *ti, unsigned int argc, ...@@ -885,13 +890,18 @@ static int multipath_ctr(struct dm_target *ti, unsigned int argc,
return r; return r;
} }
static void multipath_dtr(struct dm_target *ti) static void flush_multipath_work(void)
{ {
struct multipath *m = (struct multipath *) ti->private;
flush_workqueue(kmpath_handlerd); flush_workqueue(kmpath_handlerd);
flush_workqueue(kmultipathd); flush_workqueue(kmultipathd);
flush_scheduled_work(); flush_scheduled_work();
}
static void multipath_dtr(struct dm_target *ti)
{
struct multipath *m = ti->private;
flush_multipath_work();
free_multipath(m); free_multipath(m);
} }
...@@ -1261,6 +1271,16 @@ static void multipath_presuspend(struct dm_target *ti) ...@@ -1261,6 +1271,16 @@ static void multipath_presuspend(struct dm_target *ti)
queue_if_no_path(m, 0, 1); queue_if_no_path(m, 0, 1);
} }
static void multipath_postsuspend(struct dm_target *ti)
{
struct multipath *m = ti->private;
mutex_lock(&m->work_mutex);
m->suspended = 1;
flush_multipath_work();
mutex_unlock(&m->work_mutex);
}
/* /*
* Restore the queue_if_no_path setting. * Restore the queue_if_no_path setting.
*/ */
...@@ -1269,6 +1289,10 @@ static void multipath_resume(struct dm_target *ti) ...@@ -1269,6 +1289,10 @@ static void multipath_resume(struct dm_target *ti)
struct multipath *m = (struct multipath *) ti->private; struct multipath *m = (struct multipath *) ti->private;
unsigned long flags; unsigned long flags;
mutex_lock(&m->work_mutex);
m->suspended = 0;
mutex_unlock(&m->work_mutex);
spin_lock_irqsave(&m->lock, flags); spin_lock_irqsave(&m->lock, flags);
m->queue_if_no_path = m->saved_queue_if_no_path; m->queue_if_no_path = m->saved_queue_if_no_path;
spin_unlock_irqrestore(&m->lock, flags); spin_unlock_irqrestore(&m->lock, flags);
...@@ -1397,51 +1421,71 @@ static int multipath_status(struct dm_target *ti, status_type_t type, ...@@ -1397,51 +1421,71 @@ static int multipath_status(struct dm_target *ti, status_type_t type,
static int multipath_message(struct dm_target *ti, unsigned argc, char **argv) static int multipath_message(struct dm_target *ti, unsigned argc, char **argv)
{ {
int r; int r = -EINVAL;
struct dm_dev *dev; struct dm_dev *dev;
struct multipath *m = (struct multipath *) ti->private; struct multipath *m = (struct multipath *) ti->private;
action_fn action; action_fn action;
mutex_lock(&m->work_mutex);
if (m->suspended) {
r = -EBUSY;
goto out;
}
if (dm_suspended(ti)) {
r = -EBUSY;
goto out;
}
if (argc == 1) { if (argc == 1) {
if (!strnicmp(argv[0], MESG_STR("queue_if_no_path"))) if (!strnicmp(argv[0], MESG_STR("queue_if_no_path"))) {
return queue_if_no_path(m, 1, 0); r = queue_if_no_path(m, 1, 0);
else if (!strnicmp(argv[0], MESG_STR("fail_if_no_path"))) goto out;
return queue_if_no_path(m, 0, 0); } else if (!strnicmp(argv[0], MESG_STR("fail_if_no_path"))) {
r = queue_if_no_path(m, 0, 0);
goto out;
}
} }
if (argc != 2) if (argc != 2) {
goto error; DMWARN("Unrecognised multipath message received.");
goto out;
}
if (!strnicmp(argv[0], MESG_STR("disable_group"))) if (!strnicmp(argv[0], MESG_STR("disable_group"))) {
return bypass_pg_num(m, argv[1], 1); r = bypass_pg_num(m, argv[1], 1);
else if (!strnicmp(argv[0], MESG_STR("enable_group"))) goto out;
return bypass_pg_num(m, argv[1], 0); } else if (!strnicmp(argv[0], MESG_STR("enable_group"))) {
else if (!strnicmp(argv[0], MESG_STR("switch_group"))) r = bypass_pg_num(m, argv[1], 0);
return switch_pg_num(m, argv[1]); goto out;
else if (!strnicmp(argv[0], MESG_STR("reinstate_path"))) } else if (!strnicmp(argv[0], MESG_STR("switch_group"))) {
r = switch_pg_num(m, argv[1]);
goto out;
} else if (!strnicmp(argv[0], MESG_STR("reinstate_path")))
action = reinstate_path; action = reinstate_path;
else if (!strnicmp(argv[0], MESG_STR("fail_path"))) else if (!strnicmp(argv[0], MESG_STR("fail_path")))
action = fail_path; action = fail_path;
else else {
goto error; DMWARN("Unrecognised multipath message received.");
goto out;
}
r = dm_get_device(ti, argv[1], ti->begin, ti->len, r = dm_get_device(ti, argv[1], ti->begin, ti->len,
dm_table_get_mode(ti->table), &dev); dm_table_get_mode(ti->table), &dev);
if (r) { if (r) {
DMWARN("message: error getting device %s", DMWARN("message: error getting device %s",
argv[1]); argv[1]);
return -EINVAL; goto out;
} }
r = action_dev(m, dev, action); r = action_dev(m, dev, action);
dm_put_device(ti, dev); dm_put_device(ti, dev);
out:
mutex_unlock(&m->work_mutex);
return r; return r;
error:
DMWARN("Unrecognised multipath message received.");
return -EINVAL;
} }
static int multipath_ioctl(struct dm_target *ti, unsigned int cmd, static int multipath_ioctl(struct dm_target *ti, unsigned int cmd,
...@@ -1567,13 +1611,14 @@ static int multipath_busy(struct dm_target *ti) ...@@ -1567,13 +1611,14 @@ static int multipath_busy(struct dm_target *ti)
*---------------------------------------------------------------*/ *---------------------------------------------------------------*/
static struct target_type multipath_target = { static struct target_type multipath_target = {
.name = "multipath", .name = "multipath",
.version = {1, 1, 0}, .version = {1, 1, 1},
.module = THIS_MODULE, .module = THIS_MODULE,
.ctr = multipath_ctr, .ctr = multipath_ctr,
.dtr = multipath_dtr, .dtr = multipath_dtr,
.map_rq = multipath_map, .map_rq = multipath_map,
.rq_end_io = multipath_end_io, .rq_end_io = multipath_end_io,
.presuspend = multipath_presuspend, .presuspend = multipath_presuspend,
.postsuspend = multipath_postsuspend,
.resume = multipath_resume, .resume = multipath_resume,
.status = multipath_status, .status = multipath_status,
.message = multipath_message, .message = multipath_message,
......
...@@ -35,6 +35,7 @@ static DECLARE_WAIT_QUEUE_HEAD(_kmirrord_recovery_stopped); ...@@ -35,6 +35,7 @@ static DECLARE_WAIT_QUEUE_HEAD(_kmirrord_recovery_stopped);
*---------------------------------------------------------------*/ *---------------------------------------------------------------*/
enum dm_raid1_error { enum dm_raid1_error {
DM_RAID1_WRITE_ERROR, DM_RAID1_WRITE_ERROR,
DM_RAID1_FLUSH_ERROR,
DM_RAID1_SYNC_ERROR, DM_RAID1_SYNC_ERROR,
DM_RAID1_READ_ERROR DM_RAID1_READ_ERROR
}; };
...@@ -57,6 +58,7 @@ struct mirror_set { ...@@ -57,6 +58,7 @@ struct mirror_set {
struct bio_list reads; struct bio_list reads;
struct bio_list writes; struct bio_list writes;
struct bio_list failures; struct bio_list failures;
struct bio_list holds; /* bios are waiting until suspend */
struct dm_region_hash *rh; struct dm_region_hash *rh;
struct dm_kcopyd_client *kcopyd_client; struct dm_kcopyd_client *kcopyd_client;
...@@ -67,6 +69,7 @@ struct mirror_set { ...@@ -67,6 +69,7 @@ struct mirror_set {
region_t nr_regions; region_t nr_regions;
int in_sync; int in_sync;
int log_failure; int log_failure;
int leg_failure;
atomic_t suspend; atomic_t suspend;
atomic_t default_mirror; /* Default mirror */ atomic_t default_mirror; /* Default mirror */
...@@ -179,6 +182,17 @@ static void set_default_mirror(struct mirror *m) ...@@ -179,6 +182,17 @@ static void set_default_mirror(struct mirror *m)
atomic_set(&ms->default_mirror, m - m0); atomic_set(&ms->default_mirror, m - m0);
} }
static struct mirror *get_valid_mirror(struct mirror_set *ms)
{
struct mirror *m;
for (m = ms->mirror; m < ms->mirror + ms->nr_mirrors; m++)
if (!atomic_read(&m->error_count))
return m;
return NULL;
}
/* fail_mirror /* fail_mirror
* @m: mirror device to fail * @m: mirror device to fail
* @error_type: one of the enum's, DM_RAID1_*_ERROR * @error_type: one of the enum's, DM_RAID1_*_ERROR
...@@ -198,6 +212,8 @@ static void fail_mirror(struct mirror *m, enum dm_raid1_error error_type) ...@@ -198,6 +212,8 @@ static void fail_mirror(struct mirror *m, enum dm_raid1_error error_type)
struct mirror_set *ms = m->ms; struct mirror_set *ms = m->ms;
struct mirror *new; struct mirror *new;
ms->leg_failure = 1;
/* /*
* error_count is used for nothing more than a * error_count is used for nothing more than a
* simple way to tell if a device has encountered * simple way to tell if a device has encountered
...@@ -224,19 +240,50 @@ static void fail_mirror(struct mirror *m, enum dm_raid1_error error_type) ...@@ -224,19 +240,50 @@ static void fail_mirror(struct mirror *m, enum dm_raid1_error error_type)
goto out; goto out;
} }
for (new = ms->mirror; new < ms->mirror + ms->nr_mirrors; new++) new = get_valid_mirror(ms);
if (!atomic_read(&new->error_count)) { if (new)
set_default_mirror(new); set_default_mirror(new);
break; else
}
if (unlikely(new == ms->mirror + ms->nr_mirrors))
DMWARN("All sides of mirror have failed."); DMWARN("All sides of mirror have failed.");
out: out:
schedule_work(&ms->trigger_event); schedule_work(&ms->trigger_event);
} }
static int mirror_flush(struct dm_target *ti)
{
struct mirror_set *ms = ti->private;
unsigned long error_bits;
unsigned int i;
struct dm_io_region io[ms->nr_mirrors];
struct mirror *m;
struct dm_io_request io_req = {
.bi_rw = WRITE_BARRIER,
.mem.type = DM_IO_KMEM,
.mem.ptr.bvec = NULL,
.client = ms->io_client,
};
for (i = 0, m = ms->mirror; i < ms->nr_mirrors; i++, m++) {
io[i].bdev = m->dev->bdev;
io[i].sector = 0;
io[i].count = 0;
}
error_bits = -1;
dm_io(&io_req, ms->nr_mirrors, io, &error_bits);
if (unlikely(error_bits != 0)) {
for (i = 0; i < ms->nr_mirrors; i++)
if (test_bit(i, &error_bits))
fail_mirror(ms->mirror + i,
DM_RAID1_FLUSH_ERROR);
return -EIO;
}
return 0;
}
/*----------------------------------------------------------------- /*-----------------------------------------------------------------
* Recovery. * Recovery.
* *
...@@ -396,6 +443,8 @@ static int mirror_available(struct mirror_set *ms, struct bio *bio) ...@@ -396,6 +443,8 @@ static int mirror_available(struct mirror_set *ms, struct bio *bio)
*/ */
static sector_t map_sector(struct mirror *m, struct bio *bio) static sector_t map_sector(struct mirror *m, struct bio *bio)
{ {
if (unlikely(!bio->bi_size))
return 0;
return m->offset + (bio->bi_sector - m->ms->ti->begin); return m->offset + (bio->bi_sector - m->ms->ti->begin);
} }
...@@ -413,6 +462,27 @@ static void map_region(struct dm_io_region *io, struct mirror *m, ...@@ -413,6 +462,27 @@ static void map_region(struct dm_io_region *io, struct mirror *m,
io->count = bio->bi_size >> 9; io->count = bio->bi_size >> 9;
} }
static void hold_bio(struct mirror_set *ms, struct bio *bio)
{
/*
* If device is suspended, complete the bio.
*/
if (atomic_read(&ms->suspend)) {
if (dm_noflush_suspending(ms->ti))
bio_endio(bio, DM_ENDIO_REQUEUE);
else
bio_endio(bio, -EIO);
return;
}
/*
* Hold bio until the suspend is complete.
*/
spin_lock_irq(&ms->lock);
bio_list_add(&ms->holds, bio);
spin_unlock_irq(&ms->lock);
}
/*----------------------------------------------------------------- /*-----------------------------------------------------------------
* Reads * Reads
*---------------------------------------------------------------*/ *---------------------------------------------------------------*/
...@@ -511,7 +581,6 @@ static void write_callback(unsigned long error, void *context) ...@@ -511,7 +581,6 @@ static void write_callback(unsigned long error, void *context)
unsigned i, ret = 0; unsigned i, ret = 0;
struct bio *bio = (struct bio *) context; struct bio *bio = (struct bio *) context;
struct mirror_set *ms; struct mirror_set *ms;
int uptodate = 0;
int should_wake = 0; int should_wake = 0;
unsigned long flags; unsigned long flags;
...@@ -524,36 +593,27 @@ static void write_callback(unsigned long error, void *context) ...@@ -524,36 +593,27 @@ static void write_callback(unsigned long error, void *context)
* This way we handle both writes to SYNC and NOSYNC * This way we handle both writes to SYNC and NOSYNC
* regions with the same code. * regions with the same code.
*/ */
if (likely(!error)) if (likely(!error)) {
goto out; bio_endio(bio, ret);
return;
}
for (i = 0; i < ms->nr_mirrors; i++) for (i = 0; i < ms->nr_mirrors; i++)
if (test_bit(i, &error)) if (test_bit(i, &error))
fail_mirror(ms->mirror + i, DM_RAID1_WRITE_ERROR); fail_mirror(ms->mirror + i, DM_RAID1_WRITE_ERROR);
else
uptodate = 1;
if (unlikely(!uptodate)) { /*
DMERR("All replicated volumes dead, failing I/O"); * Need to raise event. Since raising
/* None of the writes succeeded, fail the I/O. */ * events can block, we need to do it in
ret = -EIO; * the main thread.
} else if (errors_handled(ms)) { */
/* spin_lock_irqsave(&ms->lock, flags);
* Need to raise event. Since raising if (!ms->failures.head)
* events can block, we need to do it in should_wake = 1;
* the main thread. bio_list_add(&ms->failures, bio);
*/ spin_unlock_irqrestore(&ms->lock, flags);
spin_lock_irqsave(&ms->lock, flags); if (should_wake)
if (!ms->failures.head) wakeup_mirrord(ms);
should_wake = 1;
bio_list_add(&ms->failures, bio);
spin_unlock_irqrestore(&ms->lock, flags);
if (should_wake)
wakeup_mirrord(ms);
return;
}
out:
bio_endio(bio, ret);
} }
static void do_write(struct mirror_set *ms, struct bio *bio) static void do_write(struct mirror_set *ms, struct bio *bio)
...@@ -562,7 +622,7 @@ static void do_write(struct mirror_set *ms, struct bio *bio) ...@@ -562,7 +622,7 @@ static void do_write(struct mirror_set *ms, struct bio *bio)
struct dm_io_region io[ms->nr_mirrors], *dest = io; struct dm_io_region io[ms->nr_mirrors], *dest = io;
struct mirror *m; struct mirror *m;
struct dm_io_request io_req = { struct dm_io_request io_req = {
.bi_rw = WRITE, .bi_rw = WRITE | (bio->bi_rw & WRITE_BARRIER),
.mem.type = DM_IO_BVEC, .mem.type = DM_IO_BVEC,
.mem.ptr.bvec = bio->bi_io_vec + bio->bi_idx, .mem.ptr.bvec = bio->bi_io_vec + bio->bi_idx,
.notify.fn = write_callback, .notify.fn = write_callback,
...@@ -603,6 +663,11 @@ static void do_writes(struct mirror_set *ms, struct bio_list *writes) ...@@ -603,6 +663,11 @@ static void do_writes(struct mirror_set *ms, struct bio_list *writes)
bio_list_init(&requeue); bio_list_init(&requeue);
while ((bio = bio_list_pop(writes))) { while ((bio = bio_list_pop(writes))) {
if (unlikely(bio_empty_barrier(bio))) {
bio_list_add(&sync, bio);
continue;
}
region = dm_rh_bio_to_region(ms->rh, bio); region = dm_rh_bio_to_region(ms->rh, bio);
if (log->type->is_remote_recovering && if (log->type->is_remote_recovering &&
...@@ -672,8 +737,12 @@ static void do_writes(struct mirror_set *ms, struct bio_list *writes) ...@@ -672,8 +737,12 @@ static void do_writes(struct mirror_set *ms, struct bio_list *writes)
dm_rh_delay(ms->rh, bio); dm_rh_delay(ms->rh, bio);
while ((bio = bio_list_pop(&nosync))) { while ((bio = bio_list_pop(&nosync))) {
map_bio(get_default_mirror(ms), bio); if (unlikely(ms->leg_failure) && errors_handled(ms))
generic_make_request(bio); hold_bio(ms, bio);
else {
map_bio(get_default_mirror(ms), bio);
generic_make_request(bio);
}
} }
} }
...@@ -681,20 +750,12 @@ static void do_failures(struct mirror_set *ms, struct bio_list *failures) ...@@ -681,20 +750,12 @@ static void do_failures(struct mirror_set *ms, struct bio_list *failures)
{ {
struct bio *bio; struct bio *bio;
if (!failures->head) if (likely(!failures->head))
return;
if (!ms->log_failure) {
while ((bio = bio_list_pop(failures))) {
ms->in_sync = 0;
dm_rh_mark_nosync(ms->rh, bio, bio->bi_size, 0);
}
return; return;
}
/* /*
* If the log has failed, unattempted writes are being * If the log has failed, unattempted writes are being
* put on the failures list. We can't issue those writes * put on the holds list. We can't issue those writes
* until a log has been marked, so we must store them. * until a log has been marked, so we must store them.
* *
* If a 'noflush' suspend is in progress, we can requeue * If a 'noflush' suspend is in progress, we can requeue
...@@ -709,23 +770,27 @@ static void do_failures(struct mirror_set *ms, struct bio_list *failures) ...@@ -709,23 +770,27 @@ static void do_failures(struct mirror_set *ms, struct bio_list *failures)
* for us to treat them the same and requeue them * for us to treat them the same and requeue them
* as well. * as well.
*/ */
if (dm_noflush_suspending(ms->ti)) { while ((bio = bio_list_pop(failures))) {
while ((bio = bio_list_pop(failures))) if (!ms->log_failure) {
bio_endio(bio, DM_ENDIO_REQUEUE); ms->in_sync = 0;
return; dm_rh_mark_nosync(ms->rh, bio);
} }
if (atomic_read(&ms->suspend)) { /*
while ((bio = bio_list_pop(failures))) * If all the legs are dead, fail the I/O.
* If we have been told to handle errors, hold the bio
* and wait for userspace to deal with the problem.
* Otherwise pretend that the I/O succeeded. (This would
* be wrong if the failed leg returned after reboot and
* got replicated back to the good legs.)
*/
if (!get_valid_mirror(ms))
bio_endio(bio, -EIO); bio_endio(bio, -EIO);
return; else if (errors_handled(ms))
hold_bio(ms, bio);
else
bio_endio(bio, 0);
} }
spin_lock_irq(&ms->lock);
bio_list_merge(&ms->failures, failures);
spin_unlock_irq(&ms->lock);
delayed_wake(ms);
} }
static void trigger_event(struct work_struct *work) static void trigger_event(struct work_struct *work)
...@@ -784,12 +849,17 @@ static struct mirror_set *alloc_context(unsigned int nr_mirrors, ...@@ -784,12 +849,17 @@ static struct mirror_set *alloc_context(unsigned int nr_mirrors,
} }
spin_lock_init(&ms->lock); spin_lock_init(&ms->lock);
bio_list_init(&ms->reads);
bio_list_init(&ms->writes);
bio_list_init(&ms->failures);
bio_list_init(&ms->holds);
ms->ti = ti; ms->ti = ti;
ms->nr_mirrors = nr_mirrors; ms->nr_mirrors = nr_mirrors;
ms->nr_regions = dm_sector_div_up(ti->len, region_size); ms->nr_regions = dm_sector_div_up(ti->len, region_size);
ms->in_sync = 0; ms->in_sync = 0;
ms->log_failure = 0; ms->log_failure = 0;
ms->leg_failure = 0;
atomic_set(&ms->suspend, 0); atomic_set(&ms->suspend, 0);
atomic_set(&ms->default_mirror, DEFAULT_MIRROR); atomic_set(&ms->default_mirror, DEFAULT_MIRROR);
...@@ -889,7 +959,8 @@ static struct dm_dirty_log *create_dirty_log(struct dm_target *ti, ...@@ -889,7 +959,8 @@ static struct dm_dirty_log *create_dirty_log(struct dm_target *ti,
return NULL; return NULL;
} }
dl = dm_dirty_log_create(argv[0], ti, param_count, argv + 2); dl = dm_dirty_log_create(argv[0], ti, mirror_flush, param_count,
argv + 2);
if (!dl) { if (!dl) {
ti->error = "Error creating mirror dirty log"; ti->error = "Error creating mirror dirty log";
return NULL; return NULL;
...@@ -995,6 +1066,7 @@ static int mirror_ctr(struct dm_target *ti, unsigned int argc, char **argv) ...@@ -995,6 +1066,7 @@ static int mirror_ctr(struct dm_target *ti, unsigned int argc, char **argv)
ti->private = ms; ti->private = ms;
ti->split_io = dm_rh_get_region_size(ms->rh); ti->split_io = dm_rh_get_region_size(ms->rh);
ti->num_flush_requests = 1;
ms->kmirrord_wq = create_singlethread_workqueue("kmirrord"); ms->kmirrord_wq = create_singlethread_workqueue("kmirrord");
if (!ms->kmirrord_wq) { if (!ms->kmirrord_wq) {
...@@ -1122,7 +1194,8 @@ static int mirror_end_io(struct dm_target *ti, struct bio *bio, ...@@ -1122,7 +1194,8 @@ static int mirror_end_io(struct dm_target *ti, struct bio *bio,
* We need to dec pending if this was a write. * We need to dec pending if this was a write.
*/ */
if (rw == WRITE) { if (rw == WRITE) {
dm_rh_dec(ms->rh, map_context->ll); if (likely(!bio_empty_barrier(bio)))
dm_rh_dec(ms->rh, map_context->ll);
return error; return error;
} }
...@@ -1180,6 +1253,9 @@ static void mirror_presuspend(struct dm_target *ti) ...@@ -1180,6 +1253,9 @@ static void mirror_presuspend(struct dm_target *ti)
struct mirror_set *ms = (struct mirror_set *) ti->private; struct mirror_set *ms = (struct mirror_set *) ti->private;
struct dm_dirty_log *log = dm_rh_dirty_log(ms->rh); struct dm_dirty_log *log = dm_rh_dirty_log(ms->rh);
struct bio_list holds;
struct bio *bio;
atomic_set(&ms->suspend, 1); atomic_set(&ms->suspend, 1);
/* /*
...@@ -1202,6 +1278,22 @@ static void mirror_presuspend(struct dm_target *ti) ...@@ -1202,6 +1278,22 @@ static void mirror_presuspend(struct dm_target *ti)
* we know that all of our I/O has been pushed. * we know that all of our I/O has been pushed.
*/ */
flush_workqueue(ms->kmirrord_wq); flush_workqueue(ms->kmirrord_wq);
/*
* Now set ms->suspend is set and the workqueue flushed, no more
* entries can be added to ms->hold list, so process it.
*
* Bios can still arrive concurrently with or after this
* presuspend function, but they cannot join the hold list
* because ms->suspend is set.
*/
spin_lock_irq(&ms->lock);
holds = ms->holds;
bio_list_init(&ms->holds);
spin_unlock_irq(&ms->lock);
while ((bio = bio_list_pop(&holds)))
hold_bio(ms, bio);
} }
static void mirror_postsuspend(struct dm_target *ti) static void mirror_postsuspend(struct dm_target *ti)
...@@ -1244,7 +1336,8 @@ static char device_status_char(struct mirror *m) ...@@ -1244,7 +1336,8 @@ static char device_status_char(struct mirror *m)
if (!atomic_read(&(m->error_count))) if (!atomic_read(&(m->error_count)))
return 'A'; return 'A';
return (test_bit(DM_RAID1_WRITE_ERROR, &(m->error_type))) ? 'D' : return (test_bit(DM_RAID1_FLUSH_ERROR, &(m->error_type))) ? 'F' :
(test_bit(DM_RAID1_WRITE_ERROR, &(m->error_type))) ? 'D' :
(test_bit(DM_RAID1_SYNC_ERROR, &(m->error_type))) ? 'S' : (test_bit(DM_RAID1_SYNC_ERROR, &(m->error_type))) ? 'S' :
(test_bit(DM_RAID1_READ_ERROR, &(m->error_type))) ? 'R' : 'U'; (test_bit(DM_RAID1_READ_ERROR, &(m->error_type))) ? 'R' : 'U';
} }
......
...@@ -79,6 +79,11 @@ struct dm_region_hash { ...@@ -79,6 +79,11 @@ struct dm_region_hash {
struct list_head recovered_regions; struct list_head recovered_regions;
struct list_head failed_recovered_regions; struct list_head failed_recovered_regions;
/*
* If there was a barrier failure no regions can be marked clean.
*/
int barrier_failure;
void *context; void *context;
sector_t target_begin; sector_t target_begin;
...@@ -211,6 +216,7 @@ struct dm_region_hash *dm_region_hash_create( ...@@ -211,6 +216,7 @@ struct dm_region_hash *dm_region_hash_create(
INIT_LIST_HEAD(&rh->quiesced_regions); INIT_LIST_HEAD(&rh->quiesced_regions);
INIT_LIST_HEAD(&rh->recovered_regions); INIT_LIST_HEAD(&rh->recovered_regions);
INIT_LIST_HEAD(&rh->failed_recovered_regions); INIT_LIST_HEAD(&rh->failed_recovered_regions);
rh->barrier_failure = 0;
rh->region_pool = mempool_create_kmalloc_pool(MIN_REGIONS, rh->region_pool = mempool_create_kmalloc_pool(MIN_REGIONS,
sizeof(struct dm_region)); sizeof(struct dm_region));
...@@ -377,8 +383,6 @@ static void complete_resync_work(struct dm_region *reg, int success) ...@@ -377,8 +383,6 @@ static void complete_resync_work(struct dm_region *reg, int success)
/* dm_rh_mark_nosync /* dm_rh_mark_nosync
* @ms * @ms
* @bio * @bio
* @done
* @error
* *
* The bio was written on some mirror(s) but failed on other mirror(s). * The bio was written on some mirror(s) but failed on other mirror(s).
* We can successfully endio the bio but should avoid the region being * We can successfully endio the bio but should avoid the region being
...@@ -386,8 +390,7 @@ static void complete_resync_work(struct dm_region *reg, int success) ...@@ -386,8 +390,7 @@ static void complete_resync_work(struct dm_region *reg, int success)
* *
* This function is _not_ safe in interrupt context! * This function is _not_ safe in interrupt context!
*/ */
void dm_rh_mark_nosync(struct dm_region_hash *rh, void dm_rh_mark_nosync(struct dm_region_hash *rh, struct bio *bio)
struct bio *bio, unsigned done, int error)
{ {
unsigned long flags; unsigned long flags;
struct dm_dirty_log *log = rh->log; struct dm_dirty_log *log = rh->log;
...@@ -395,6 +398,11 @@ void dm_rh_mark_nosync(struct dm_region_hash *rh, ...@@ -395,6 +398,11 @@ void dm_rh_mark_nosync(struct dm_region_hash *rh,
region_t region = dm_rh_bio_to_region(rh, bio); region_t region = dm_rh_bio_to_region(rh, bio);
int recovering = 0; int recovering = 0;
if (bio_empty_barrier(bio)) {
rh->barrier_failure = 1;
return;
}
/* We must inform the log that the sync count has changed. */ /* We must inform the log that the sync count has changed. */
log->type->set_region_sync(log, region, 0); log->type->set_region_sync(log, region, 0);
...@@ -419,7 +427,6 @@ void dm_rh_mark_nosync(struct dm_region_hash *rh, ...@@ -419,7 +427,6 @@ void dm_rh_mark_nosync(struct dm_region_hash *rh,
BUG_ON(!list_empty(&reg->list)); BUG_ON(!list_empty(&reg->list));
spin_unlock_irqrestore(&rh->region_lock, flags); spin_unlock_irqrestore(&rh->region_lock, flags);
bio_endio(bio, error);
if (recovering) if (recovering)
complete_resync_work(reg, 0); complete_resync_work(reg, 0);
} }
...@@ -515,8 +522,11 @@ void dm_rh_inc_pending(struct dm_region_hash *rh, struct bio_list *bios) ...@@ -515,8 +522,11 @@ void dm_rh_inc_pending(struct dm_region_hash *rh, struct bio_list *bios)
{ {
struct bio *bio; struct bio *bio;
for (bio = bios->head; bio; bio = bio->bi_next) for (bio = bios->head; bio; bio = bio->bi_next) {
if (bio_empty_barrier(bio))
continue;
rh_inc(rh, dm_rh_bio_to_region(rh, bio)); rh_inc(rh, dm_rh_bio_to_region(rh, bio));
}
} }
EXPORT_SYMBOL_GPL(dm_rh_inc_pending); EXPORT_SYMBOL_GPL(dm_rh_inc_pending);
...@@ -544,7 +554,14 @@ void dm_rh_dec(struct dm_region_hash *rh, region_t region) ...@@ -544,7 +554,14 @@ void dm_rh_dec(struct dm_region_hash *rh, region_t region)
*/ */
/* do nothing for DM_RH_NOSYNC */ /* do nothing for DM_RH_NOSYNC */
if (reg->state == DM_RH_RECOVERING) { if (unlikely(rh->barrier_failure)) {
/*
* If a write barrier failed some time ago, we
* don't know whether or not this write made it
* to the disk, so we must resync the device.
*/
reg->state = DM_RH_NOSYNC;
} else if (reg->state == DM_RH_RECOVERING) {
list_add_tail(&reg->list, &rh->quiesced_regions); list_add_tail(&reg->list, &rh->quiesced_regions);
} else if (reg->state == DM_RH_DIRTY) { } else if (reg->state == DM_RH_DIRTY) {
reg->state = DM_RH_CLEAN; reg->state = DM_RH_CLEAN;
......
...@@ -55,6 +55,8 @@ ...@@ -55,6 +55,8 @@
*/ */
#define SNAPSHOT_DISK_VERSION 1 #define SNAPSHOT_DISK_VERSION 1
#define NUM_SNAPSHOT_HDR_CHUNKS 1
struct disk_header { struct disk_header {
uint32_t magic; uint32_t magic;
...@@ -120,7 +122,22 @@ struct pstore { ...@@ -120,7 +122,22 @@ struct pstore {
/* /*
* The next free chunk for an exception. * The next free chunk for an exception.
*
* When creating exceptions, all the chunks here and above are
* free. It holds the next chunk to be allocated. On rare
* occasions (e.g. after a system crash) holes can be left in
* the exception store because chunks can be committed out of
* order.
*
* When merging exceptions, it does not necessarily mean all the
* chunks here and above are free. It holds the value it would
* have held if all chunks had been committed in order of
* allocation. Consequently the value may occasionally be
* slightly too low, but since it's only used for 'status' and
* it can never reach its minimum value too early this doesn't
* matter.
*/ */
chunk_t next_free; chunk_t next_free;
/* /*
...@@ -214,7 +231,7 @@ static int chunk_io(struct pstore *ps, void *area, chunk_t chunk, int rw, ...@@ -214,7 +231,7 @@ static int chunk_io(struct pstore *ps, void *area, chunk_t chunk, int rw,
int metadata) int metadata)
{ {
struct dm_io_region where = { struct dm_io_region where = {
.bdev = ps->store->cow->bdev, .bdev = dm_snap_cow(ps->store->snap)->bdev,
.sector = ps->store->chunk_size * chunk, .sector = ps->store->chunk_size * chunk,
.count = ps->store->chunk_size, .count = ps->store->chunk_size,
}; };
...@@ -294,7 +311,8 @@ static int read_header(struct pstore *ps, int *new_snapshot) ...@@ -294,7 +311,8 @@ static int read_header(struct pstore *ps, int *new_snapshot)
*/ */
if (!ps->store->chunk_size) { if (!ps->store->chunk_size) {
ps->store->chunk_size = max(DM_CHUNK_SIZE_DEFAULT_SECTORS, ps->store->chunk_size = max(DM_CHUNK_SIZE_DEFAULT_SECTORS,
bdev_logical_block_size(ps->store->cow->bdev) >> 9); bdev_logical_block_size(dm_snap_cow(ps->store->snap)->
bdev) >> 9);
ps->store->chunk_mask = ps->store->chunk_size - 1; ps->store->chunk_mask = ps->store->chunk_size - 1;
ps->store->chunk_shift = ffs(ps->store->chunk_size) - 1; ps->store->chunk_shift = ffs(ps->store->chunk_size) - 1;
chunk_size_supplied = 0; chunk_size_supplied = 0;
...@@ -408,6 +426,15 @@ static void write_exception(struct pstore *ps, ...@@ -408,6 +426,15 @@ static void write_exception(struct pstore *ps,
e->new_chunk = cpu_to_le64(de->new_chunk); e->new_chunk = cpu_to_le64(de->new_chunk);
} }
static void clear_exception(struct pstore *ps, uint32_t index)
{
struct disk_exception *e = get_exception(ps, index);
/* clear it */
e->old_chunk = 0;
e->new_chunk = 0;
}
/* /*
* Registers the exceptions that are present in the current area. * Registers the exceptions that are present in the current area.
* 'full' is filled in to indicate if the area has been * 'full' is filled in to indicate if the area has been
...@@ -489,11 +516,23 @@ static struct pstore *get_info(struct dm_exception_store *store) ...@@ -489,11 +516,23 @@ static struct pstore *get_info(struct dm_exception_store *store)
return (struct pstore *) store->context; return (struct pstore *) store->context;
} }
static void persistent_fraction_full(struct dm_exception_store *store, static void persistent_usage(struct dm_exception_store *store,
sector_t *numerator, sector_t *denominator) sector_t *total_sectors,
sector_t *sectors_allocated,
sector_t *metadata_sectors)
{ {
*numerator = get_info(store)->next_free * store->chunk_size; struct pstore *ps = get_info(store);
*denominator = get_dev_size(store->cow->bdev);
*sectors_allocated = ps->next_free * store->chunk_size;
*total_sectors = get_dev_size(dm_snap_cow(store->snap)->bdev);
/*
* First chunk is the fixed header.
* Then there are (ps->current_area + 1) metadata chunks, each one
* separated from the next by ps->exceptions_per_area data chunks.
*/
*metadata_sectors = (ps->current_area + 1 + NUM_SNAPSHOT_HDR_CHUNKS) *
store->chunk_size;
} }
static void persistent_dtr(struct dm_exception_store *store) static void persistent_dtr(struct dm_exception_store *store)
...@@ -552,44 +591,40 @@ static int persistent_read_metadata(struct dm_exception_store *store, ...@@ -552,44 +591,40 @@ static int persistent_read_metadata(struct dm_exception_store *store,
ps->current_area = 0; ps->current_area = 0;
zero_memory_area(ps); zero_memory_area(ps);
r = zero_disk_area(ps, 0); r = zero_disk_area(ps, 0);
if (r) { if (r)
DMWARN("zero_disk_area(0) failed"); DMWARN("zero_disk_area(0) failed");
return r; return r;
} }
} else { /*
/* * Sanity checks.
* Sanity checks. */
*/ if (ps->version != SNAPSHOT_DISK_VERSION) {
if (ps->version != SNAPSHOT_DISK_VERSION) { DMWARN("unable to handle snapshot disk version %d",
DMWARN("unable to handle snapshot disk version %d", ps->version);
ps->version); return -EINVAL;
return -EINVAL; }
}
/* /*
* Metadata are valid, but snapshot is invalidated * Metadata are valid, but snapshot is invalidated
*/ */
if (!ps->valid) if (!ps->valid)
return 1; return 1;
/* /*
* Read the metadata. * Read the metadata.
*/ */
r = read_exceptions(ps, callback, callback_context); r = read_exceptions(ps, callback, callback_context);
if (r)
return r;
}
return 0; return r;
} }
static int persistent_prepare_exception(struct dm_exception_store *store, static int persistent_prepare_exception(struct dm_exception_store *store,
struct dm_snap_exception *e) struct dm_exception *e)
{ {
struct pstore *ps = get_info(store); struct pstore *ps = get_info(store);
uint32_t stride; uint32_t stride;
chunk_t next_free; chunk_t next_free;
sector_t size = get_dev_size(store->cow->bdev); sector_t size = get_dev_size(dm_snap_cow(store->snap)->bdev);
/* Is there enough room ? */ /* Is there enough room ? */
if (size < ((ps->next_free + 1) * store->chunk_size)) if (size < ((ps->next_free + 1) * store->chunk_size))
...@@ -611,7 +646,7 @@ static int persistent_prepare_exception(struct dm_exception_store *store, ...@@ -611,7 +646,7 @@ static int persistent_prepare_exception(struct dm_exception_store *store,
} }
static void persistent_commit_exception(struct dm_exception_store *store, static void persistent_commit_exception(struct dm_exception_store *store,
struct dm_snap_exception *e, struct dm_exception *e,
void (*callback) (void *, int success), void (*callback) (void *, int success),
void *callback_context) void *callback_context)
{ {
...@@ -672,6 +707,85 @@ static void persistent_commit_exception(struct dm_exception_store *store, ...@@ -672,6 +707,85 @@ static void persistent_commit_exception(struct dm_exception_store *store,
ps->callback_count = 0; ps->callback_count = 0;
} }
static int persistent_prepare_merge(struct dm_exception_store *store,
chunk_t *last_old_chunk,
chunk_t *last_new_chunk)
{
struct pstore *ps = get_info(store);
struct disk_exception de;
int nr_consecutive;
int r;
/*
* When current area is empty, move back to preceding area.
*/
if (!ps->current_committed) {
/*
* Have we finished?
*/
if (!ps->current_area)
return 0;
ps->current_area--;
r = area_io(ps, READ);
if (r < 0)
return r;
ps->current_committed = ps->exceptions_per_area;
}
read_exception(ps, ps->current_committed - 1, &de);
*last_old_chunk = de.old_chunk;
*last_new_chunk = de.new_chunk;
/*
* Find number of consecutive chunks within the current area,
* working backwards.
*/
for (nr_consecutive = 1; nr_consecutive < ps->current_committed;
nr_consecutive++) {
read_exception(ps, ps->current_committed - 1 - nr_consecutive,
&de);
if (de.old_chunk != *last_old_chunk - nr_consecutive ||
de.new_chunk != *last_new_chunk - nr_consecutive)
break;
}
return nr_consecutive;
}
static int persistent_commit_merge(struct dm_exception_store *store,
int nr_merged)
{
int r, i;
struct pstore *ps = get_info(store);
BUG_ON(nr_merged > ps->current_committed);
for (i = 0; i < nr_merged; i++)
clear_exception(ps, ps->current_committed - 1 - i);
r = area_io(ps, WRITE);
if (r < 0)
return r;
ps->current_committed -= nr_merged;
/*
* At this stage, only persistent_usage() uses ps->next_free, so
* we make no attempt to keep ps->next_free strictly accurate
* as exceptions may have been committed out-of-order originally.
* Once a snapshot has become merging, we set it to the value it
* would have held had all the exceptions been committed in order.
*
* ps->current_area does not get reduced by prepare_merge() until
* after commit_merge() has removed the nr_merged previous exceptions.
*/
ps->next_free = (area_location(ps, ps->current_area) - 1) +
(ps->current_committed + 1) + NUM_SNAPSHOT_HDR_CHUNKS;
return 0;
}
static void persistent_drop_snapshot(struct dm_exception_store *store) static void persistent_drop_snapshot(struct dm_exception_store *store)
{ {
struct pstore *ps = get_info(store); struct pstore *ps = get_info(store);
...@@ -697,7 +811,7 @@ static int persistent_ctr(struct dm_exception_store *store, ...@@ -697,7 +811,7 @@ static int persistent_ctr(struct dm_exception_store *store,
ps->area = NULL; ps->area = NULL;
ps->zero_area = NULL; ps->zero_area = NULL;
ps->header_area = NULL; ps->header_area = NULL;
ps->next_free = 2; /* skipping the header and first area */ ps->next_free = NUM_SNAPSHOT_HDR_CHUNKS + 1; /* header and 1st area */
ps->current_committed = 0; ps->current_committed = 0;
ps->callback_count = 0; ps->callback_count = 0;
...@@ -726,8 +840,7 @@ static unsigned persistent_status(struct dm_exception_store *store, ...@@ -726,8 +840,7 @@ static unsigned persistent_status(struct dm_exception_store *store,
case STATUSTYPE_INFO: case STATUSTYPE_INFO:
break; break;
case STATUSTYPE_TABLE: case STATUSTYPE_TABLE:
DMEMIT(" %s P %llu", store->cow->name, DMEMIT(" P %llu", (unsigned long long)store->chunk_size);
(unsigned long long)store->chunk_size);
} }
return sz; return sz;
...@@ -741,8 +854,10 @@ static struct dm_exception_store_type _persistent_type = { ...@@ -741,8 +854,10 @@ static struct dm_exception_store_type _persistent_type = {
.read_metadata = persistent_read_metadata, .read_metadata = persistent_read_metadata,
.prepare_exception = persistent_prepare_exception, .prepare_exception = persistent_prepare_exception,
.commit_exception = persistent_commit_exception, .commit_exception = persistent_commit_exception,
.prepare_merge = persistent_prepare_merge,
.commit_merge = persistent_commit_merge,
.drop_snapshot = persistent_drop_snapshot, .drop_snapshot = persistent_drop_snapshot,
.fraction_full = persistent_fraction_full, .usage = persistent_usage,
.status = persistent_status, .status = persistent_status,
}; };
...@@ -754,8 +869,10 @@ static struct dm_exception_store_type _persistent_compat_type = { ...@@ -754,8 +869,10 @@ static struct dm_exception_store_type _persistent_compat_type = {
.read_metadata = persistent_read_metadata, .read_metadata = persistent_read_metadata,
.prepare_exception = persistent_prepare_exception, .prepare_exception = persistent_prepare_exception,
.commit_exception = persistent_commit_exception, .commit_exception = persistent_commit_exception,
.prepare_merge = persistent_prepare_merge,
.commit_merge = persistent_commit_merge,
.drop_snapshot = persistent_drop_snapshot, .drop_snapshot = persistent_drop_snapshot,
.fraction_full = persistent_fraction_full, .usage = persistent_usage,
.status = persistent_status, .status = persistent_status,
}; };
......
...@@ -36,10 +36,10 @@ static int transient_read_metadata(struct dm_exception_store *store, ...@@ -36,10 +36,10 @@ static int transient_read_metadata(struct dm_exception_store *store,
} }
static int transient_prepare_exception(struct dm_exception_store *store, static int transient_prepare_exception(struct dm_exception_store *store,
struct dm_snap_exception *e) struct dm_exception *e)
{ {
struct transient_c *tc = store->context; struct transient_c *tc = store->context;
sector_t size = get_dev_size(store->cow->bdev); sector_t size = get_dev_size(dm_snap_cow(store->snap)->bdev);
if (size < (tc->next_free + store->chunk_size)) if (size < (tc->next_free + store->chunk_size))
return -1; return -1;
...@@ -51,7 +51,7 @@ static int transient_prepare_exception(struct dm_exception_store *store, ...@@ -51,7 +51,7 @@ static int transient_prepare_exception(struct dm_exception_store *store,
} }
static void transient_commit_exception(struct dm_exception_store *store, static void transient_commit_exception(struct dm_exception_store *store,
struct dm_snap_exception *e, struct dm_exception *e,
void (*callback) (void *, int success), void (*callback) (void *, int success),
void *callback_context) void *callback_context)
{ {
...@@ -59,11 +59,14 @@ static void transient_commit_exception(struct dm_exception_store *store, ...@@ -59,11 +59,14 @@ static void transient_commit_exception(struct dm_exception_store *store,
callback(callback_context, 1); callback(callback_context, 1);
} }
static void transient_fraction_full(struct dm_exception_store *store, static void transient_usage(struct dm_exception_store *store,
sector_t *numerator, sector_t *denominator) sector_t *total_sectors,
sector_t *sectors_allocated,
sector_t *metadata_sectors)
{ {
*numerator = ((struct transient_c *) store->context)->next_free; *sectors_allocated = ((struct transient_c *) store->context)->next_free;
*denominator = get_dev_size(store->cow->bdev); *total_sectors = get_dev_size(dm_snap_cow(store->snap)->bdev);
*metadata_sectors = 0;
} }
static int transient_ctr(struct dm_exception_store *store, static int transient_ctr(struct dm_exception_store *store,
...@@ -91,8 +94,7 @@ static unsigned transient_status(struct dm_exception_store *store, ...@@ -91,8 +94,7 @@ static unsigned transient_status(struct dm_exception_store *store,
case STATUSTYPE_INFO: case STATUSTYPE_INFO:
break; break;
case STATUSTYPE_TABLE: case STATUSTYPE_TABLE:
DMEMIT(" %s N %llu", store->cow->name, DMEMIT(" N %llu", (unsigned long long)store->chunk_size);
(unsigned long long)store->chunk_size);
} }
return sz; return sz;
...@@ -106,7 +108,7 @@ static struct dm_exception_store_type _transient_type = { ...@@ -106,7 +108,7 @@ static struct dm_exception_store_type _transient_type = {
.read_metadata = transient_read_metadata, .read_metadata = transient_read_metadata,
.prepare_exception = transient_prepare_exception, .prepare_exception = transient_prepare_exception,
.commit_exception = transient_commit_exception, .commit_exception = transient_commit_exception,
.fraction_full = transient_fraction_full, .usage = transient_usage,
.status = transient_status, .status = transient_status,
}; };
...@@ -118,7 +120,7 @@ static struct dm_exception_store_type _transient_compat_type = { ...@@ -118,7 +120,7 @@ static struct dm_exception_store_type _transient_compat_type = {
.read_metadata = transient_read_metadata, .read_metadata = transient_read_metadata,
.prepare_exception = transient_prepare_exception, .prepare_exception = transient_prepare_exception,
.commit_exception = transient_commit_exception, .commit_exception = transient_commit_exception,
.fraction_full = transient_fraction_full, .usage = transient_usage,
.status = transient_status, .status = transient_status,
}; };
......
...@@ -25,6 +25,11 @@ ...@@ -25,6 +25,11 @@
#define DM_MSG_PREFIX "snapshots" #define DM_MSG_PREFIX "snapshots"
static const char dm_snapshot_merge_target_name[] = "snapshot-merge";
#define dm_target_is_snapshot_merge(ti) \
((ti)->type->name == dm_snapshot_merge_target_name)
/* /*
* The percentage increment we will wake up users at * The percentage increment we will wake up users at
*/ */
...@@ -49,7 +54,7 @@ ...@@ -49,7 +54,7 @@
#define DM_TRACKED_CHUNK_HASH(x) ((unsigned long)(x) & \ #define DM_TRACKED_CHUNK_HASH(x) ((unsigned long)(x) & \
(DM_TRACKED_CHUNK_HASH_SIZE - 1)) (DM_TRACKED_CHUNK_HASH_SIZE - 1))
struct exception_table { struct dm_exception_table {
uint32_t hash_mask; uint32_t hash_mask;
unsigned hash_shift; unsigned hash_shift;
struct list_head *table; struct list_head *table;
...@@ -59,22 +64,31 @@ struct dm_snapshot { ...@@ -59,22 +64,31 @@ struct dm_snapshot {
struct rw_semaphore lock; struct rw_semaphore lock;
struct dm_dev *origin; struct dm_dev *origin;
struct dm_dev *cow;
struct dm_target *ti;
/* List of snapshots per Origin */ /* List of snapshots per Origin */
struct list_head list; struct list_head list;
/* You can't use a snapshot if this is 0 (e.g. if full) */ /*
* You can't use a snapshot if this is 0 (e.g. if full).
* A snapshot-merge target never clears this.
*/
int valid; int valid;
/* Origin writes don't trigger exceptions until this is set */ /* Origin writes don't trigger exceptions until this is set */
int active; int active;
/* Whether or not owning mapped_device is suspended */
int suspended;
mempool_t *pending_pool; mempool_t *pending_pool;
atomic_t pending_exceptions_count; atomic_t pending_exceptions_count;
struct exception_table pending; struct dm_exception_table pending;
struct exception_table complete; struct dm_exception_table complete;
/* /*
* pe_lock protects all pending_exception operations and access * pe_lock protects all pending_exception operations and access
...@@ -95,8 +109,51 @@ struct dm_snapshot { ...@@ -95,8 +109,51 @@ struct dm_snapshot {
mempool_t *tracked_chunk_pool; mempool_t *tracked_chunk_pool;
spinlock_t tracked_chunk_lock; spinlock_t tracked_chunk_lock;
struct hlist_head tracked_chunk_hash[DM_TRACKED_CHUNK_HASH_SIZE]; struct hlist_head tracked_chunk_hash[DM_TRACKED_CHUNK_HASH_SIZE];
/*
* The merge operation failed if this flag is set.
* Failure modes are handled as follows:
* - I/O error reading the header
* => don't load the target; abort.
* - Header does not have "valid" flag set
* => use the origin; forget about the snapshot.
* - I/O error when reading exceptions
* => don't load the target; abort.
* (We can't use the intermediate origin state.)
* - I/O error while merging
* => stop merging; set merge_failed; process I/O normally.
*/
int merge_failed;
/* Wait for events based on state_bits */
unsigned long state_bits;
/* Range of chunks currently being merged. */
chunk_t first_merging_chunk;
int num_merging_chunks;
/*
* Incoming bios that overlap with chunks being merged must wait
* for them to be committed.
*/
struct bio_list bios_queued_during_merge;
}; };
/*
* state_bits:
* RUNNING_MERGE - Merge operation is in progress.
* SHUTDOWN_MERGE - Set to signal that merge needs to be stopped;
* cleared afterwards.
*/
#define RUNNING_MERGE 0
#define SHUTDOWN_MERGE 1
struct dm_dev *dm_snap_cow(struct dm_snapshot *s)
{
return s->cow;
}
EXPORT_SYMBOL(dm_snap_cow);
static struct workqueue_struct *ksnapd; static struct workqueue_struct *ksnapd;
static void flush_queued_bios(struct work_struct *work); static void flush_queued_bios(struct work_struct *work);
...@@ -116,7 +173,7 @@ static int bdev_equal(struct block_device *lhs, struct block_device *rhs) ...@@ -116,7 +173,7 @@ static int bdev_equal(struct block_device *lhs, struct block_device *rhs)
} }
struct dm_snap_pending_exception { struct dm_snap_pending_exception {
struct dm_snap_exception e; struct dm_exception e;
/* /*
* Origin buffers waiting for this to complete are held * Origin buffers waiting for this to complete are held
...@@ -125,28 +182,6 @@ struct dm_snap_pending_exception { ...@@ -125,28 +182,6 @@ struct dm_snap_pending_exception {
struct bio_list origin_bios; struct bio_list origin_bios;
struct bio_list snapshot_bios; struct bio_list snapshot_bios;
/*
* Short-term queue of pending exceptions prior to submission.
*/
struct list_head list;
/*
* The primary pending_exception is the one that holds
* the ref_count and the list of origin_bios for a
* group of pending_exceptions. It is always last to get freed.
* These fields get set up when writing to the origin.
*/
struct dm_snap_pending_exception *primary_pe;
/*
* Number of pending_exceptions processing this chunk.
* When this drops to zero we must complete the origin bios.
* If incrementing or decrementing this, hold pe->snap->lock for
* the sibling concerned and not pe->primary_pe->snap->lock unless
* they are the same.
*/
atomic_t ref_count;
/* Pointer back to snapshot context */ /* Pointer back to snapshot context */
struct dm_snapshot *snap; struct dm_snapshot *snap;
...@@ -221,6 +256,16 @@ static int __chunk_is_tracked(struct dm_snapshot *s, chunk_t chunk) ...@@ -221,6 +256,16 @@ static int __chunk_is_tracked(struct dm_snapshot *s, chunk_t chunk)
return found; return found;
} }
/*
* This conflicting I/O is extremely improbable in the caller,
* so msleep(1) is sufficient and there is no need for a wait queue.
*/
static void __check_for_conflicting_io(struct dm_snapshot *s, chunk_t chunk)
{
while (__chunk_is_tracked(s, chunk))
msleep(1);
}
/* /*
* One of these per registered origin, held in the snapshot_origins hash * One of these per registered origin, held in the snapshot_origins hash
*/ */
...@@ -243,6 +288,10 @@ struct origin { ...@@ -243,6 +288,10 @@ struct origin {
static struct list_head *_origins; static struct list_head *_origins;
static struct rw_semaphore _origins_lock; static struct rw_semaphore _origins_lock;
static DECLARE_WAIT_QUEUE_HEAD(_pending_exceptions_done);
static DEFINE_SPINLOCK(_pending_exceptions_done_spinlock);
static uint64_t _pending_exceptions_done_count;
static int init_origin_hash(void) static int init_origin_hash(void)
{ {
int i; int i;
...@@ -290,23 +339,145 @@ static void __insert_origin(struct origin *o) ...@@ -290,23 +339,145 @@ static void __insert_origin(struct origin *o)
list_add_tail(&o->hash_list, sl); list_add_tail(&o->hash_list, sl);
} }
/*
* _origins_lock must be held when calling this function.
* Returns number of snapshots registered using the supplied cow device, plus:
* snap_src - a snapshot suitable for use as a source of exception handover
* snap_dest - a snapshot capable of receiving exception handover.
* snap_merge - an existing snapshot-merge target linked to the same origin.
* There can be at most one snapshot-merge target. The parameter is optional.
*
* Possible return values and states of snap_src and snap_dest.
* 0: NULL, NULL - first new snapshot
* 1: snap_src, NULL - normal snapshot
* 2: snap_src, snap_dest - waiting for handover
* 2: snap_src, NULL - handed over, waiting for old to be deleted
* 1: NULL, snap_dest - source got destroyed without handover
*/
static int __find_snapshots_sharing_cow(struct dm_snapshot *snap,
struct dm_snapshot **snap_src,
struct dm_snapshot **snap_dest,
struct dm_snapshot **snap_merge)
{
struct dm_snapshot *s;
struct origin *o;
int count = 0;
int active;
o = __lookup_origin(snap->origin->bdev);
if (!o)
goto out;
list_for_each_entry(s, &o->snapshots, list) {
if (dm_target_is_snapshot_merge(s->ti) && snap_merge)
*snap_merge = s;
if (!bdev_equal(s->cow->bdev, snap->cow->bdev))
continue;
down_read(&s->lock);
active = s->active;
up_read(&s->lock);
if (active) {
if (snap_src)
*snap_src = s;
} else if (snap_dest)
*snap_dest = s;
count++;
}
out:
return count;
}
/*
* On success, returns 1 if this snapshot is a handover destination,
* otherwise returns 0.
*/
static int __validate_exception_handover(struct dm_snapshot *snap)
{
struct dm_snapshot *snap_src = NULL, *snap_dest = NULL;
struct dm_snapshot *snap_merge = NULL;
/* Does snapshot need exceptions handed over to it? */
if ((__find_snapshots_sharing_cow(snap, &snap_src, &snap_dest,
&snap_merge) == 2) ||
snap_dest) {
snap->ti->error = "Snapshot cow pairing for exception "
"table handover failed";
return -EINVAL;
}
/*
* If no snap_src was found, snap cannot become a handover
* destination.
*/
if (!snap_src)
return 0;
/*
* Non-snapshot-merge handover?
*/
if (!dm_target_is_snapshot_merge(snap->ti))
return 1;
/*
* Do not allow more than one merging snapshot.
*/
if (snap_merge) {
snap->ti->error = "A snapshot is already merging.";
return -EINVAL;
}
if (!snap_src->store->type->prepare_merge ||
!snap_src->store->type->commit_merge) {
snap->ti->error = "Snapshot exception store does not "
"support snapshot-merge.";
return -EINVAL;
}
return 1;
}
static void __insert_snapshot(struct origin *o, struct dm_snapshot *s)
{
struct dm_snapshot *l;
/* Sort the list according to chunk size, largest-first smallest-last */
list_for_each_entry(l, &o->snapshots, list)
if (l->store->chunk_size < s->store->chunk_size)
break;
list_add_tail(&s->list, &l->list);
}
/* /*
* Make a note of the snapshot and its origin so we can look it * Make a note of the snapshot and its origin so we can look it
* up when the origin has a write on it. * up when the origin has a write on it.
*
* Also validate snapshot exception store handovers.
* On success, returns 1 if this registration is a handover destination,
* otherwise returns 0.
*/ */
static int register_snapshot(struct dm_snapshot *snap) static int register_snapshot(struct dm_snapshot *snap)
{ {
struct dm_snapshot *l; struct origin *o, *new_o = NULL;
struct origin *o, *new_o;
struct block_device *bdev = snap->origin->bdev; struct block_device *bdev = snap->origin->bdev;
int r = 0;
new_o = kmalloc(sizeof(*new_o), GFP_KERNEL); new_o = kmalloc(sizeof(*new_o), GFP_KERNEL);
if (!new_o) if (!new_o)
return -ENOMEM; return -ENOMEM;
down_write(&_origins_lock); down_write(&_origins_lock);
o = __lookup_origin(bdev);
r = __validate_exception_handover(snap);
if (r < 0) {
kfree(new_o);
goto out;
}
o = __lookup_origin(bdev);
if (o) if (o)
kfree(new_o); kfree(new_o);
else { else {
...@@ -320,14 +491,27 @@ static int register_snapshot(struct dm_snapshot *snap) ...@@ -320,14 +491,27 @@ static int register_snapshot(struct dm_snapshot *snap)
__insert_origin(o); __insert_origin(o);
} }
/* Sort the list according to chunk size, largest-first smallest-last */ __insert_snapshot(o, snap);
list_for_each_entry(l, &o->snapshots, list)
if (l->store->chunk_size < snap->store->chunk_size) out:
break; up_write(&_origins_lock);
list_add_tail(&snap->list, &l->list);
return r;
}
/*
* Move snapshot to correct place in list according to chunk size.
*/
static void reregister_snapshot(struct dm_snapshot *s)
{
struct block_device *bdev = s->origin->bdev;
down_write(&_origins_lock);
list_del(&s->list);
__insert_snapshot(__lookup_origin(bdev), s);
up_write(&_origins_lock); up_write(&_origins_lock);
return 0;
} }
static void unregister_snapshot(struct dm_snapshot *s) static void unregister_snapshot(struct dm_snapshot *s)
...@@ -338,7 +522,7 @@ static void unregister_snapshot(struct dm_snapshot *s) ...@@ -338,7 +522,7 @@ static void unregister_snapshot(struct dm_snapshot *s)
o = __lookup_origin(s->origin->bdev); o = __lookup_origin(s->origin->bdev);
list_del(&s->list); list_del(&s->list);
if (list_empty(&o->snapshots)) { if (o && list_empty(&o->snapshots)) {
list_del(&o->hash_list); list_del(&o->hash_list);
kfree(o); kfree(o);
} }
...@@ -351,8 +535,8 @@ static void unregister_snapshot(struct dm_snapshot *s) ...@@ -351,8 +535,8 @@ static void unregister_snapshot(struct dm_snapshot *s)
* The lowest hash_shift bits of the chunk number are ignored, allowing * The lowest hash_shift bits of the chunk number are ignored, allowing
* some consecutive chunks to be grouped together. * some consecutive chunks to be grouped together.
*/ */
static int init_exception_table(struct exception_table *et, uint32_t size, static int dm_exception_table_init(struct dm_exception_table *et,
unsigned hash_shift) uint32_t size, unsigned hash_shift)
{ {
unsigned int i; unsigned int i;
...@@ -368,10 +552,11 @@ static int init_exception_table(struct exception_table *et, uint32_t size, ...@@ -368,10 +552,11 @@ static int init_exception_table(struct exception_table *et, uint32_t size,
return 0; return 0;
} }
static void exit_exception_table(struct exception_table *et, struct kmem_cache *mem) static void dm_exception_table_exit(struct dm_exception_table *et,
struct kmem_cache *mem)
{ {
struct list_head *slot; struct list_head *slot;
struct dm_snap_exception *ex, *next; struct dm_exception *ex, *next;
int i, size; int i, size;
size = et->hash_mask + 1; size = et->hash_mask + 1;
...@@ -385,19 +570,12 @@ static void exit_exception_table(struct exception_table *et, struct kmem_cache * ...@@ -385,19 +570,12 @@ static void exit_exception_table(struct exception_table *et, struct kmem_cache *
vfree(et->table); vfree(et->table);
} }
static uint32_t exception_hash(struct exception_table *et, chunk_t chunk) static uint32_t exception_hash(struct dm_exception_table *et, chunk_t chunk)
{ {
return (chunk >> et->hash_shift) & et->hash_mask; return (chunk >> et->hash_shift) & et->hash_mask;
} }
static void insert_exception(struct exception_table *eh, static void dm_remove_exception(struct dm_exception *e)
struct dm_snap_exception *e)
{
struct list_head *l = &eh->table[exception_hash(eh, e->old_chunk)];
list_add(&e->hash_list, l);
}
static void remove_exception(struct dm_snap_exception *e)
{ {
list_del(&e->hash_list); list_del(&e->hash_list);
} }
...@@ -406,11 +584,11 @@ static void remove_exception(struct dm_snap_exception *e) ...@@ -406,11 +584,11 @@ static void remove_exception(struct dm_snap_exception *e)
* Return the exception data for a sector, or NULL if not * Return the exception data for a sector, or NULL if not
* remapped. * remapped.
*/ */
static struct dm_snap_exception *lookup_exception(struct exception_table *et, static struct dm_exception *dm_lookup_exception(struct dm_exception_table *et,
chunk_t chunk) chunk_t chunk)
{ {
struct list_head *slot; struct list_head *slot;
struct dm_snap_exception *e; struct dm_exception *e;
slot = &et->table[exception_hash(et, chunk)]; slot = &et->table[exception_hash(et, chunk)];
list_for_each_entry (e, slot, hash_list) list_for_each_entry (e, slot, hash_list)
...@@ -421,9 +599,9 @@ static struct dm_snap_exception *lookup_exception(struct exception_table *et, ...@@ -421,9 +599,9 @@ static struct dm_snap_exception *lookup_exception(struct exception_table *et,
return NULL; return NULL;
} }
static struct dm_snap_exception *alloc_exception(void) static struct dm_exception *alloc_completed_exception(void)
{ {
struct dm_snap_exception *e; struct dm_exception *e;
e = kmem_cache_alloc(exception_cache, GFP_NOIO); e = kmem_cache_alloc(exception_cache, GFP_NOIO);
if (!e) if (!e)
...@@ -432,7 +610,7 @@ static struct dm_snap_exception *alloc_exception(void) ...@@ -432,7 +610,7 @@ static struct dm_snap_exception *alloc_exception(void)
return e; return e;
} }
static void free_exception(struct dm_snap_exception *e) static void free_completed_exception(struct dm_exception *e)
{ {
kmem_cache_free(exception_cache, e); kmem_cache_free(exception_cache, e);
} }
...@@ -457,12 +635,11 @@ static void free_pending_exception(struct dm_snap_pending_exception *pe) ...@@ -457,12 +635,11 @@ static void free_pending_exception(struct dm_snap_pending_exception *pe)
atomic_dec(&s->pending_exceptions_count); atomic_dec(&s->pending_exceptions_count);
} }
static void insert_completed_exception(struct dm_snapshot *s, static void dm_insert_exception(struct dm_exception_table *eh,
struct dm_snap_exception *new_e) struct dm_exception *new_e)
{ {
struct exception_table *eh = &s->complete;
struct list_head *l; struct list_head *l;
struct dm_snap_exception *e = NULL; struct dm_exception *e = NULL;
l = &eh->table[exception_hash(eh, new_e->old_chunk)]; l = &eh->table[exception_hash(eh, new_e->old_chunk)];
...@@ -478,7 +655,7 @@ static void insert_completed_exception(struct dm_snapshot *s, ...@@ -478,7 +655,7 @@ static void insert_completed_exception(struct dm_snapshot *s,
new_e->new_chunk == (dm_chunk_number(e->new_chunk) + new_e->new_chunk == (dm_chunk_number(e->new_chunk) +
dm_consecutive_chunk_count(e) + 1)) { dm_consecutive_chunk_count(e) + 1)) {
dm_consecutive_chunk_count_inc(e); dm_consecutive_chunk_count_inc(e);
free_exception(new_e); free_completed_exception(new_e);
return; return;
} }
...@@ -488,7 +665,7 @@ static void insert_completed_exception(struct dm_snapshot *s, ...@@ -488,7 +665,7 @@ static void insert_completed_exception(struct dm_snapshot *s,
dm_consecutive_chunk_count_inc(e); dm_consecutive_chunk_count_inc(e);
e->old_chunk--; e->old_chunk--;
e->new_chunk--; e->new_chunk--;
free_exception(new_e); free_completed_exception(new_e);
return; return;
} }
...@@ -507,9 +684,9 @@ static void insert_completed_exception(struct dm_snapshot *s, ...@@ -507,9 +684,9 @@ static void insert_completed_exception(struct dm_snapshot *s,
static int dm_add_exception(void *context, chunk_t old, chunk_t new) static int dm_add_exception(void *context, chunk_t old, chunk_t new)
{ {
struct dm_snapshot *s = context; struct dm_snapshot *s = context;
struct dm_snap_exception *e; struct dm_exception *e;
e = alloc_exception(); e = alloc_completed_exception();
if (!e) if (!e)
return -ENOMEM; return -ENOMEM;
...@@ -518,11 +695,30 @@ static int dm_add_exception(void *context, chunk_t old, chunk_t new) ...@@ -518,11 +695,30 @@ static int dm_add_exception(void *context, chunk_t old, chunk_t new)
/* Consecutive_count is implicitly initialised to zero */ /* Consecutive_count is implicitly initialised to zero */
e->new_chunk = new; e->new_chunk = new;
insert_completed_exception(s, e); dm_insert_exception(&s->complete, e);
return 0; return 0;
} }
#define min_not_zero(l, r) (((l) == 0) ? (r) : (((r) == 0) ? (l) : min(l, r)))
/*
* Return a minimum chunk size of all snapshots that have the specified origin.
* Return zero if the origin has no snapshots.
*/
static sector_t __minimum_chunk_size(struct origin *o)
{
struct dm_snapshot *snap;
unsigned chunk_size = 0;
if (o)
list_for_each_entry(snap, &o->snapshots, list)
chunk_size = min_not_zero(chunk_size,
snap->store->chunk_size);
return chunk_size;
}
/* /*
* Hard coded magic. * Hard coded magic.
*/ */
...@@ -546,16 +742,18 @@ static int init_hash_tables(struct dm_snapshot *s) ...@@ -546,16 +742,18 @@ static int init_hash_tables(struct dm_snapshot *s)
* Calculate based on the size of the original volume or * Calculate based on the size of the original volume or
* the COW volume... * the COW volume...
*/ */
cow_dev_size = get_dev_size(s->store->cow->bdev); cow_dev_size = get_dev_size(s->cow->bdev);
origin_dev_size = get_dev_size(s->origin->bdev); origin_dev_size = get_dev_size(s->origin->bdev);
max_buckets = calc_max_buckets(); max_buckets = calc_max_buckets();
hash_size = min(origin_dev_size, cow_dev_size) >> s->store->chunk_shift; hash_size = min(origin_dev_size, cow_dev_size) >> s->store->chunk_shift;
hash_size = min(hash_size, max_buckets); hash_size = min(hash_size, max_buckets);
if (hash_size < 64)
hash_size = 64;
hash_size = rounddown_pow_of_two(hash_size); hash_size = rounddown_pow_of_two(hash_size);
if (init_exception_table(&s->complete, hash_size, if (dm_exception_table_init(&s->complete, hash_size,
DM_CHUNK_CONSECUTIVE_BITS)) DM_CHUNK_CONSECUTIVE_BITS))
return -ENOMEM; return -ENOMEM;
/* /*
...@@ -566,14 +764,284 @@ static int init_hash_tables(struct dm_snapshot *s) ...@@ -566,14 +764,284 @@ static int init_hash_tables(struct dm_snapshot *s)
if (hash_size < 64) if (hash_size < 64)
hash_size = 64; hash_size = 64;
if (init_exception_table(&s->pending, hash_size, 0)) { if (dm_exception_table_init(&s->pending, hash_size, 0)) {
exit_exception_table(&s->complete, exception_cache); dm_exception_table_exit(&s->complete, exception_cache);
return -ENOMEM; return -ENOMEM;
} }
return 0; return 0;
} }
static void merge_shutdown(struct dm_snapshot *s)
{
clear_bit_unlock(RUNNING_MERGE, &s->state_bits);
smp_mb__after_clear_bit();
wake_up_bit(&s->state_bits, RUNNING_MERGE);
}
static struct bio *__release_queued_bios_after_merge(struct dm_snapshot *s)
{
s->first_merging_chunk = 0;
s->num_merging_chunks = 0;
return bio_list_get(&s->bios_queued_during_merge);
}
/*
* Remove one chunk from the index of completed exceptions.
*/
static int __remove_single_exception_chunk(struct dm_snapshot *s,
chunk_t old_chunk)
{
struct dm_exception *e;
e = dm_lookup_exception(&s->complete, old_chunk);
if (!e) {
DMERR("Corruption detected: exception for block %llu is "
"on disk but not in memory",
(unsigned long long)old_chunk);
return -EINVAL;
}
/*
* If this is the only chunk using this exception, remove exception.
*/
if (!dm_consecutive_chunk_count(e)) {
dm_remove_exception(e);
free_completed_exception(e);
return 0;
}
/*
* The chunk may be either at the beginning or the end of a
* group of consecutive chunks - never in the middle. We are
* removing chunks in the opposite order to that in which they
* were added, so this should always be true.
* Decrement the consecutive chunk counter and adjust the
* starting point if necessary.
*/
if (old_chunk == e->old_chunk) {
e->old_chunk++;
e->new_chunk++;
} else if (old_chunk != e->old_chunk +
dm_consecutive_chunk_count(e)) {
DMERR("Attempt to merge block %llu from the "
"middle of a chunk range [%llu - %llu]",
(unsigned long long)old_chunk,
(unsigned long long)e->old_chunk,
(unsigned long long)
e->old_chunk + dm_consecutive_chunk_count(e));
return -EINVAL;
}
dm_consecutive_chunk_count_dec(e);
return 0;
}
static void flush_bios(struct bio *bio);
static int remove_single_exception_chunk(struct dm_snapshot *s)
{
struct bio *b = NULL;
int r;
chunk_t old_chunk = s->first_merging_chunk + s->num_merging_chunks - 1;
down_write(&s->lock);
/*
* Process chunks (and associated exceptions) in reverse order
* so that dm_consecutive_chunk_count_dec() accounting works.
*/
do {
r = __remove_single_exception_chunk(s, old_chunk);
if (r)
goto out;
} while (old_chunk-- > s->first_merging_chunk);
b = __release_queued_bios_after_merge(s);
out:
up_write(&s->lock);
if (b)
flush_bios(b);
return r;
}
static int origin_write_extent(struct dm_snapshot *merging_snap,
sector_t sector, unsigned chunk_size);
static void merge_callback(int read_err, unsigned long write_err,
void *context);
static uint64_t read_pending_exceptions_done_count(void)
{
uint64_t pending_exceptions_done;
spin_lock(&_pending_exceptions_done_spinlock);
pending_exceptions_done = _pending_exceptions_done_count;
spin_unlock(&_pending_exceptions_done_spinlock);
return pending_exceptions_done;
}
static void increment_pending_exceptions_done_count(void)
{
spin_lock(&_pending_exceptions_done_spinlock);
_pending_exceptions_done_count++;
spin_unlock(&_pending_exceptions_done_spinlock);
wake_up_all(&_pending_exceptions_done);
}
static void snapshot_merge_next_chunks(struct dm_snapshot *s)
{
int i, linear_chunks;
chunk_t old_chunk, new_chunk;
struct dm_io_region src, dest;
sector_t io_size;
uint64_t previous_count;
BUG_ON(!test_bit(RUNNING_MERGE, &s->state_bits));
if (unlikely(test_bit(SHUTDOWN_MERGE, &s->state_bits)))
goto shut;
/*
* valid flag never changes during merge, so no lock required.
*/
if (!s->valid) {
DMERR("Snapshot is invalid: can't merge");
goto shut;
}
linear_chunks = s->store->type->prepare_merge(s->store, &old_chunk,
&new_chunk);
if (linear_chunks <= 0) {
if (linear_chunks < 0) {
DMERR("Read error in exception store: "
"shutting down merge");
down_write(&s->lock);
s->merge_failed = 1;
up_write(&s->lock);
}
goto shut;
}
/* Adjust old_chunk and new_chunk to reflect start of linear region */
old_chunk = old_chunk + 1 - linear_chunks;
new_chunk = new_chunk + 1 - linear_chunks;
/*
* Use one (potentially large) I/O to copy all 'linear_chunks'
* from the exception store to the origin
*/
io_size = linear_chunks * s->store->chunk_size;
dest.bdev = s->origin->bdev;
dest.sector = chunk_to_sector(s->store, old_chunk);
dest.count = min(io_size, get_dev_size(dest.bdev) - dest.sector);
src.bdev = s->cow->bdev;
src.sector = chunk_to_sector(s->store, new_chunk);
src.count = dest.count;
/*
* Reallocate any exceptions needed in other snapshots then
* wait for the pending exceptions to complete.
* Each time any pending exception (globally on the system)
* completes we are woken and repeat the process to find out
* if we can proceed. While this may not seem a particularly
* efficient algorithm, it is not expected to have any
* significant impact on performance.
*/
previous_count = read_pending_exceptions_done_count();
while (origin_write_extent(s, dest.sector, io_size)) {
wait_event(_pending_exceptions_done,
(read_pending_exceptions_done_count() !=
previous_count));
/* Retry after the wait, until all exceptions are done. */
previous_count = read_pending_exceptions_done_count();
}
down_write(&s->lock);
s->first_merging_chunk = old_chunk;
s->num_merging_chunks = linear_chunks;
up_write(&s->lock);
/* Wait until writes to all 'linear_chunks' drain */
for (i = 0; i < linear_chunks; i++)
__check_for_conflicting_io(s, old_chunk + i);
dm_kcopyd_copy(s->kcopyd_client, &src, 1, &dest, 0, merge_callback, s);
return;
shut:
merge_shutdown(s);
}
static void error_bios(struct bio *bio);
static void merge_callback(int read_err, unsigned long write_err, void *context)
{
struct dm_snapshot *s = context;
struct bio *b = NULL;
if (read_err || write_err) {
if (read_err)
DMERR("Read error: shutting down merge.");
else
DMERR("Write error: shutting down merge.");
goto shut;
}
if (s->store->type->commit_merge(s->store,
s->num_merging_chunks) < 0) {
DMERR("Write error in exception store: shutting down merge");
goto shut;
}
if (remove_single_exception_chunk(s) < 0)
goto shut;
snapshot_merge_next_chunks(s);
return;
shut:
down_write(&s->lock);
s->merge_failed = 1;
b = __release_queued_bios_after_merge(s);
up_write(&s->lock);
error_bios(b);
merge_shutdown(s);
}
static void start_merge(struct dm_snapshot *s)
{
if (!test_and_set_bit(RUNNING_MERGE, &s->state_bits))
snapshot_merge_next_chunks(s);
}
static int wait_schedule(void *ptr)
{
schedule();
return 0;
}
/*
* Stop the merging process and wait until it finishes.
*/
static void stop_merge(struct dm_snapshot *s)
{
set_bit(SHUTDOWN_MERGE, &s->state_bits);
wait_on_bit(&s->state_bits, RUNNING_MERGE, wait_schedule,
TASK_UNINTERRUPTIBLE);
clear_bit(SHUTDOWN_MERGE, &s->state_bits);
}
/* /*
* Construct a snapshot mapping: <origin_dev> <COW-dev> <p/n> <chunk-size> * Construct a snapshot mapping: <origin_dev> <COW-dev> <p/n> <chunk-size>
*/ */
...@@ -582,50 +1050,73 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv) ...@@ -582,50 +1050,73 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv)
struct dm_snapshot *s; struct dm_snapshot *s;
int i; int i;
int r = -EINVAL; int r = -EINVAL;
char *origin_path; char *origin_path, *cow_path;
struct dm_exception_store *store; unsigned args_used, num_flush_requests = 1;
unsigned args_used; fmode_t origin_mode = FMODE_READ;
if (argc != 4) { if (argc != 4) {
ti->error = "requires exactly 4 arguments"; ti->error = "requires exactly 4 arguments";
r = -EINVAL; r = -EINVAL;
goto bad_args; goto bad;
}
if (dm_target_is_snapshot_merge(ti)) {
num_flush_requests = 2;
origin_mode = FMODE_WRITE;
} }
origin_path = argv[0]; origin_path = argv[0];
argv++; argv++;
argc--; argc--;
r = dm_exception_store_create(ti, argc, argv, &args_used, &store); s = kmalloc(sizeof(*s), GFP_KERNEL);
if (!s) {
ti->error = "Cannot allocate snapshot context private "
"structure";
r = -ENOMEM;
goto bad;
}
cow_path = argv[0];
argv++;
argc--;
r = dm_get_device(ti, cow_path, 0, 0,
FMODE_READ | FMODE_WRITE, &s->cow);
if (r) {
ti->error = "Cannot get COW device";
goto bad_cow;
}
r = dm_exception_store_create(ti, argc, argv, s, &args_used, &s->store);
if (r) { if (r) {
ti->error = "Couldn't create exception store"; ti->error = "Couldn't create exception store";
r = -EINVAL; r = -EINVAL;
goto bad_args; goto bad_store;
} }
argv += args_used; argv += args_used;
argc -= args_used; argc -= args_used;
s = kmalloc(sizeof(*s), GFP_KERNEL); r = dm_get_device(ti, origin_path, 0, ti->len, origin_mode, &s->origin);
if (!s) {
ti->error = "Cannot allocate snapshot context private "
"structure";
r = -ENOMEM;
goto bad_snap;
}
r = dm_get_device(ti, origin_path, 0, ti->len, FMODE_READ, &s->origin);
if (r) { if (r) {
ti->error = "Cannot get origin device"; ti->error = "Cannot get origin device";
goto bad_origin; goto bad_origin;
} }
s->store = store; s->ti = ti;
s->valid = 1; s->valid = 1;
s->active = 0; s->active = 0;
s->suspended = 0;
atomic_set(&s->pending_exceptions_count, 0); atomic_set(&s->pending_exceptions_count, 0);
init_rwsem(&s->lock); init_rwsem(&s->lock);
INIT_LIST_HEAD(&s->list);
spin_lock_init(&s->pe_lock); spin_lock_init(&s->pe_lock);
s->state_bits = 0;
s->merge_failed = 0;
s->first_merging_chunk = 0;
s->num_merging_chunks = 0;
bio_list_init(&s->bios_queued_during_merge);
/* Allocate hash table for COW data */ /* Allocate hash table for COW data */
if (init_hash_tables(s)) { if (init_hash_tables(s)) {
...@@ -659,39 +1150,55 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv) ...@@ -659,39 +1150,55 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv)
spin_lock_init(&s->tracked_chunk_lock); spin_lock_init(&s->tracked_chunk_lock);
/* Metadata must only be loaded into one table at once */ bio_list_init(&s->queued_bios);
INIT_WORK(&s->queued_bios_work, flush_queued_bios);
ti->private = s;
ti->num_flush_requests = num_flush_requests;
/* Add snapshot to the list of snapshots for this origin */
/* Exceptions aren't triggered till snapshot_resume() is called */
r = register_snapshot(s);
if (r == -ENOMEM) {
ti->error = "Snapshot origin struct allocation failed";
goto bad_load_and_register;
} else if (r < 0) {
/* invalid handover, register_snapshot has set ti->error */
goto bad_load_and_register;
}
/*
* Metadata must only be loaded into one table at once, so skip this
* if metadata will be handed over during resume.
* Chunk size will be set during the handover - set it to zero to
* ensure it's ignored.
*/
if (r > 0) {
s->store->chunk_size = 0;
return 0;
}
r = s->store->type->read_metadata(s->store, dm_add_exception, r = s->store->type->read_metadata(s->store, dm_add_exception,
(void *)s); (void *)s);
if (r < 0) { if (r < 0) {
ti->error = "Failed to read snapshot metadata"; ti->error = "Failed to read snapshot metadata";
goto bad_load_and_register; goto bad_read_metadata;
} else if (r > 0) { } else if (r > 0) {
s->valid = 0; s->valid = 0;
DMWARN("Snapshot is marked invalid."); DMWARN("Snapshot is marked invalid.");
} }
bio_list_init(&s->queued_bios);
INIT_WORK(&s->queued_bios_work, flush_queued_bios);
if (!s->store->chunk_size) { if (!s->store->chunk_size) {
ti->error = "Chunk size not set"; ti->error = "Chunk size not set";
goto bad_load_and_register; goto bad_read_metadata;
}
/* Add snapshot to the list of snapshots for this origin */
/* Exceptions aren't triggered till snapshot_resume() is called */
if (register_snapshot(s)) {
r = -EINVAL;
ti->error = "Cannot register snapshot origin";
goto bad_load_and_register;
} }
ti->private = s;
ti->split_io = s->store->chunk_size; ti->split_io = s->store->chunk_size;
ti->num_flush_requests = 1;
return 0; return 0;
bad_read_metadata:
unregister_snapshot(s);
bad_load_and_register: bad_load_and_register:
mempool_destroy(s->tracked_chunk_pool); mempool_destroy(s->tracked_chunk_pool);
...@@ -702,19 +1209,22 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv) ...@@ -702,19 +1209,22 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv)
dm_kcopyd_client_destroy(s->kcopyd_client); dm_kcopyd_client_destroy(s->kcopyd_client);
bad_kcopyd: bad_kcopyd:
exit_exception_table(&s->pending, pending_cache); dm_exception_table_exit(&s->pending, pending_cache);
exit_exception_table(&s->complete, exception_cache); dm_exception_table_exit(&s->complete, exception_cache);
bad_hash_tables: bad_hash_tables:
dm_put_device(ti, s->origin); dm_put_device(ti, s->origin);
bad_origin: bad_origin:
kfree(s); dm_exception_store_destroy(s->store);
bad_snap: bad_store:
dm_exception_store_destroy(store); dm_put_device(ti, s->cow);
bad_cow:
kfree(s);
bad_args: bad:
return r; return r;
} }
...@@ -723,8 +1233,39 @@ static void __free_exceptions(struct dm_snapshot *s) ...@@ -723,8 +1233,39 @@ static void __free_exceptions(struct dm_snapshot *s)
dm_kcopyd_client_destroy(s->kcopyd_client); dm_kcopyd_client_destroy(s->kcopyd_client);
s->kcopyd_client = NULL; s->kcopyd_client = NULL;
exit_exception_table(&s->pending, pending_cache); dm_exception_table_exit(&s->pending, pending_cache);
exit_exception_table(&s->complete, exception_cache); dm_exception_table_exit(&s->complete, exception_cache);
}
static void __handover_exceptions(struct dm_snapshot *snap_src,
struct dm_snapshot *snap_dest)
{
union {
struct dm_exception_table table_swap;
struct dm_exception_store *store_swap;
} u;
/*
* Swap all snapshot context information between the two instances.
*/
u.table_swap = snap_dest->complete;
snap_dest->complete = snap_src->complete;
snap_src->complete = u.table_swap;
u.store_swap = snap_dest->store;
snap_dest->store = snap_src->store;
snap_src->store = u.store_swap;
snap_dest->store->snap = snap_dest;
snap_src->store->snap = snap_src;
snap_dest->ti->split_io = snap_dest->store->chunk_size;
snap_dest->valid = snap_src->valid;
/*
* Set source invalid to ensure it receives no further I/O.
*/
snap_src->valid = 0;
} }
static void snapshot_dtr(struct dm_target *ti) static void snapshot_dtr(struct dm_target *ti)
...@@ -733,9 +1274,24 @@ static void snapshot_dtr(struct dm_target *ti) ...@@ -733,9 +1274,24 @@ static void snapshot_dtr(struct dm_target *ti)
int i; int i;
#endif #endif
struct dm_snapshot *s = ti->private; struct dm_snapshot *s = ti->private;
struct dm_snapshot *snap_src = NULL, *snap_dest = NULL;
flush_workqueue(ksnapd); flush_workqueue(ksnapd);
down_read(&_origins_lock);
/* Check whether exception handover must be cancelled */
(void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL);
if (snap_src && snap_dest && (s == snap_src)) {
down_write(&snap_dest->lock);
snap_dest->valid = 0;
up_write(&snap_dest->lock);
DMERR("Cancelling snapshot handover.");
}
up_read(&_origins_lock);
if (dm_target_is_snapshot_merge(ti))
stop_merge(s);
/* Prevent further origin writes from using this snapshot. */ /* Prevent further origin writes from using this snapshot. */
/* After this returns there can be no new kcopyd jobs. */ /* After this returns there can be no new kcopyd jobs. */
unregister_snapshot(s); unregister_snapshot(s);
...@@ -763,6 +1319,8 @@ static void snapshot_dtr(struct dm_target *ti) ...@@ -763,6 +1319,8 @@ static void snapshot_dtr(struct dm_target *ti)
dm_exception_store_destroy(s->store); dm_exception_store_destroy(s->store);
dm_put_device(ti, s->cow);
kfree(s); kfree(s);
} }
...@@ -795,6 +1353,26 @@ static void flush_queued_bios(struct work_struct *work) ...@@ -795,6 +1353,26 @@ static void flush_queued_bios(struct work_struct *work)
flush_bios(queued_bios); flush_bios(queued_bios);
} }
static int do_origin(struct dm_dev *origin, struct bio *bio);
/*
* Flush a list of buffers.
*/
static void retry_origin_bios(struct dm_snapshot *s, struct bio *bio)
{
struct bio *n;
int r;
while (bio) {
n = bio->bi_next;
bio->bi_next = NULL;
r = do_origin(s->origin, bio);
if (r == DM_MAPIO_REMAPPED)
generic_make_request(bio);
bio = n;
}
}
/* /*
* Error a list of buffers. * Error a list of buffers.
*/ */
...@@ -825,45 +1403,12 @@ static void __invalidate_snapshot(struct dm_snapshot *s, int err) ...@@ -825,45 +1403,12 @@ static void __invalidate_snapshot(struct dm_snapshot *s, int err)
s->valid = 0; s->valid = 0;
dm_table_event(s->store->ti->table); dm_table_event(s->ti->table);
}
static void get_pending_exception(struct dm_snap_pending_exception *pe)
{
atomic_inc(&pe->ref_count);
}
static struct bio *put_pending_exception(struct dm_snap_pending_exception *pe)
{
struct dm_snap_pending_exception *primary_pe;
struct bio *origin_bios = NULL;
primary_pe = pe->primary_pe;
/*
* If this pe is involved in a write to the origin and
* it is the last sibling to complete then release
* the bios for the original write to the origin.
*/
if (primary_pe &&
atomic_dec_and_test(&primary_pe->ref_count)) {
origin_bios = bio_list_get(&primary_pe->origin_bios);
free_pending_exception(primary_pe);
}
/*
* Free the pe if it's not linked to an origin write or if
* it's not itself a primary pe.
*/
if (!primary_pe || primary_pe != pe)
free_pending_exception(pe);
return origin_bios;
} }
static void pending_complete(struct dm_snap_pending_exception *pe, int success) static void pending_complete(struct dm_snap_pending_exception *pe, int success)
{ {
struct dm_snap_exception *e; struct dm_exception *e;
struct dm_snapshot *s = pe->snap; struct dm_snapshot *s = pe->snap;
struct bio *origin_bios = NULL; struct bio *origin_bios = NULL;
struct bio *snapshot_bios = NULL; struct bio *snapshot_bios = NULL;
...@@ -877,7 +1422,7 @@ static void pending_complete(struct dm_snap_pending_exception *pe, int success) ...@@ -877,7 +1422,7 @@ static void pending_complete(struct dm_snap_pending_exception *pe, int success)
goto out; goto out;
} }
e = alloc_exception(); e = alloc_completed_exception();
if (!e) { if (!e) {
down_write(&s->lock); down_write(&s->lock);
__invalidate_snapshot(s, -ENOMEM); __invalidate_snapshot(s, -ENOMEM);
...@@ -888,28 +1433,27 @@ static void pending_complete(struct dm_snap_pending_exception *pe, int success) ...@@ -888,28 +1433,27 @@ static void pending_complete(struct dm_snap_pending_exception *pe, int success)
down_write(&s->lock); down_write(&s->lock);
if (!s->valid) { if (!s->valid) {
free_exception(e); free_completed_exception(e);
error = 1; error = 1;
goto out; goto out;
} }
/* /* Check for conflicting reads */
* Check for conflicting reads. This is extremely improbable, __check_for_conflicting_io(s, pe->e.old_chunk);
* so msleep(1) is sufficient and there is no need for a wait queue.
*/
while (__chunk_is_tracked(s, pe->e.old_chunk))
msleep(1);
/* /*
* Add a proper exception, and remove the * Add a proper exception, and remove the
* in-flight exception from the list. * in-flight exception from the list.
*/ */
insert_completed_exception(s, e); dm_insert_exception(&s->complete, e);
out: out:
remove_exception(&pe->e); dm_remove_exception(&pe->e);
snapshot_bios = bio_list_get(&pe->snapshot_bios); snapshot_bios = bio_list_get(&pe->snapshot_bios);
origin_bios = put_pending_exception(pe); origin_bios = bio_list_get(&pe->origin_bios);
free_pending_exception(pe);
increment_pending_exceptions_done_count();
up_write(&s->lock); up_write(&s->lock);
...@@ -919,7 +1463,7 @@ static void pending_complete(struct dm_snap_pending_exception *pe, int success) ...@@ -919,7 +1463,7 @@ static void pending_complete(struct dm_snap_pending_exception *pe, int success)
else else
flush_bios(snapshot_bios); flush_bios(snapshot_bios);
flush_bios(origin_bios); retry_origin_bios(s, origin_bios);
} }
static void commit_callback(void *context, int success) static void commit_callback(void *context, int success)
...@@ -963,7 +1507,7 @@ static void start_copy(struct dm_snap_pending_exception *pe) ...@@ -963,7 +1507,7 @@ static void start_copy(struct dm_snap_pending_exception *pe)
src.sector = chunk_to_sector(s->store, pe->e.old_chunk); src.sector = chunk_to_sector(s->store, pe->e.old_chunk);
src.count = min((sector_t)s->store->chunk_size, dev_size - src.sector); src.count = min((sector_t)s->store->chunk_size, dev_size - src.sector);
dest.bdev = s->store->cow->bdev; dest.bdev = s->cow->bdev;
dest.sector = chunk_to_sector(s->store, pe->e.new_chunk); dest.sector = chunk_to_sector(s->store, pe->e.new_chunk);
dest.count = src.count; dest.count = src.count;
...@@ -975,7 +1519,7 @@ static void start_copy(struct dm_snap_pending_exception *pe) ...@@ -975,7 +1519,7 @@ static void start_copy(struct dm_snap_pending_exception *pe)
static struct dm_snap_pending_exception * static struct dm_snap_pending_exception *
__lookup_pending_exception(struct dm_snapshot *s, chunk_t chunk) __lookup_pending_exception(struct dm_snapshot *s, chunk_t chunk)
{ {
struct dm_snap_exception *e = lookup_exception(&s->pending, chunk); struct dm_exception *e = dm_lookup_exception(&s->pending, chunk);
if (!e) if (!e)
return NULL; return NULL;
...@@ -1006,8 +1550,6 @@ __find_pending_exception(struct dm_snapshot *s, ...@@ -1006,8 +1550,6 @@ __find_pending_exception(struct dm_snapshot *s,
pe->e.old_chunk = chunk; pe->e.old_chunk = chunk;
bio_list_init(&pe->origin_bios); bio_list_init(&pe->origin_bios);
bio_list_init(&pe->snapshot_bios); bio_list_init(&pe->snapshot_bios);
pe->primary_pe = NULL;
atomic_set(&pe->ref_count, 0);
pe->started = 0; pe->started = 0;
if (s->store->type->prepare_exception(s->store, &pe->e)) { if (s->store->type->prepare_exception(s->store, &pe->e)) {
...@@ -1015,16 +1557,15 @@ __find_pending_exception(struct dm_snapshot *s, ...@@ -1015,16 +1557,15 @@ __find_pending_exception(struct dm_snapshot *s,
return NULL; return NULL;
} }
get_pending_exception(pe); dm_insert_exception(&s->pending, &pe->e);
insert_exception(&s->pending, &pe->e);
return pe; return pe;
} }
static void remap_exception(struct dm_snapshot *s, struct dm_snap_exception *e, static void remap_exception(struct dm_snapshot *s, struct dm_exception *e,
struct bio *bio, chunk_t chunk) struct bio *bio, chunk_t chunk)
{ {
bio->bi_bdev = s->store->cow->bdev; bio->bi_bdev = s->cow->bdev;
bio->bi_sector = chunk_to_sector(s->store, bio->bi_sector = chunk_to_sector(s->store,
dm_chunk_number(e->new_chunk) + dm_chunk_number(e->new_chunk) +
(chunk - e->old_chunk)) + (chunk - e->old_chunk)) +
...@@ -1035,14 +1576,14 @@ static void remap_exception(struct dm_snapshot *s, struct dm_snap_exception *e, ...@@ -1035,14 +1576,14 @@ static void remap_exception(struct dm_snapshot *s, struct dm_snap_exception *e,
static int snapshot_map(struct dm_target *ti, struct bio *bio, static int snapshot_map(struct dm_target *ti, struct bio *bio,
union map_info *map_context) union map_info *map_context)
{ {
struct dm_snap_exception *e; struct dm_exception *e;
struct dm_snapshot *s = ti->private; struct dm_snapshot *s = ti->private;
int r = DM_MAPIO_REMAPPED; int r = DM_MAPIO_REMAPPED;
chunk_t chunk; chunk_t chunk;
struct dm_snap_pending_exception *pe = NULL; struct dm_snap_pending_exception *pe = NULL;
if (unlikely(bio_empty_barrier(bio))) { if (unlikely(bio_empty_barrier(bio))) {
bio->bi_bdev = s->store->cow->bdev; bio->bi_bdev = s->cow->bdev;
return DM_MAPIO_REMAPPED; return DM_MAPIO_REMAPPED;
} }
...@@ -1063,7 +1604,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio, ...@@ -1063,7 +1604,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio,
} }
/* If the block is already remapped - use that, else remap it */ /* If the block is already remapped - use that, else remap it */
e = lookup_exception(&s->complete, chunk); e = dm_lookup_exception(&s->complete, chunk);
if (e) { if (e) {
remap_exception(s, e, bio, chunk); remap_exception(s, e, bio, chunk);
goto out_unlock; goto out_unlock;
...@@ -1087,7 +1628,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio, ...@@ -1087,7 +1628,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio,
goto out_unlock; goto out_unlock;
} }
e = lookup_exception(&s->complete, chunk); e = dm_lookup_exception(&s->complete, chunk);
if (e) { if (e) {
free_pending_exception(pe); free_pending_exception(pe);
remap_exception(s, e, bio, chunk); remap_exception(s, e, bio, chunk);
...@@ -1125,6 +1666,78 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio, ...@@ -1125,6 +1666,78 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio,
return r; return r;
} }
/*
* A snapshot-merge target behaves like a combination of a snapshot
* target and a snapshot-origin target. It only generates new
* exceptions in other snapshots and not in the one that is being
* merged.
*
* For each chunk, if there is an existing exception, it is used to
* redirect I/O to the cow device. Otherwise I/O is sent to the origin,
* which in turn might generate exceptions in other snapshots.
* If merging is currently taking place on the chunk in question, the
* I/O is deferred by adding it to s->bios_queued_during_merge.
*/
static int snapshot_merge_map(struct dm_target *ti, struct bio *bio,
union map_info *map_context)
{
struct dm_exception *e;
struct dm_snapshot *s = ti->private;
int r = DM_MAPIO_REMAPPED;
chunk_t chunk;
if (unlikely(bio_empty_barrier(bio))) {
if (!map_context->flush_request)
bio->bi_bdev = s->origin->bdev;
else
bio->bi_bdev = s->cow->bdev;
map_context->ptr = NULL;
return DM_MAPIO_REMAPPED;
}
chunk = sector_to_chunk(s->store, bio->bi_sector);
down_write(&s->lock);
/* Full merging snapshots are redirected to the origin */
if (!s->valid)
goto redirect_to_origin;
/* If the block is already remapped - use that */
e = dm_lookup_exception(&s->complete, chunk);
if (e) {
/* Queue writes overlapping with chunks being merged */
if (bio_rw(bio) == WRITE &&
chunk >= s->first_merging_chunk &&
chunk < (s->first_merging_chunk +
s->num_merging_chunks)) {
bio->bi_bdev = s->origin->bdev;
bio_list_add(&s->bios_queued_during_merge, bio);
r = DM_MAPIO_SUBMITTED;
goto out_unlock;
}
remap_exception(s, e, bio, chunk);
if (bio_rw(bio) == WRITE)
map_context->ptr = track_chunk(s, chunk);
goto out_unlock;
}
redirect_to_origin:
bio->bi_bdev = s->origin->bdev;
if (bio_rw(bio) == WRITE) {
up_write(&s->lock);
return do_origin(s->origin, bio);
}
out_unlock:
up_write(&s->lock);
return r;
}
static int snapshot_end_io(struct dm_target *ti, struct bio *bio, static int snapshot_end_io(struct dm_target *ti, struct bio *bio,
int error, union map_info *map_context) int error, union map_info *map_context)
{ {
...@@ -1137,40 +1750,135 @@ static int snapshot_end_io(struct dm_target *ti, struct bio *bio, ...@@ -1137,40 +1750,135 @@ static int snapshot_end_io(struct dm_target *ti, struct bio *bio,
return 0; return 0;
} }
static void snapshot_merge_presuspend(struct dm_target *ti)
{
struct dm_snapshot *s = ti->private;
stop_merge(s);
}
static void snapshot_postsuspend(struct dm_target *ti)
{
struct dm_snapshot *s = ti->private;
down_write(&s->lock);
s->suspended = 1;
up_write(&s->lock);
}
static int snapshot_preresume(struct dm_target *ti)
{
int r = 0;
struct dm_snapshot *s = ti->private;
struct dm_snapshot *snap_src = NULL, *snap_dest = NULL;
down_read(&_origins_lock);
(void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL);
if (snap_src && snap_dest) {
down_read(&snap_src->lock);
if (s == snap_src) {
DMERR("Unable to resume snapshot source until "
"handover completes.");
r = -EINVAL;
} else if (!snap_src->suspended) {
DMERR("Unable to perform snapshot handover until "
"source is suspended.");
r = -EINVAL;
}
up_read(&snap_src->lock);
}
up_read(&_origins_lock);
return r;
}
static void snapshot_resume(struct dm_target *ti) static void snapshot_resume(struct dm_target *ti)
{ {
struct dm_snapshot *s = ti->private; struct dm_snapshot *s = ti->private;
struct dm_snapshot *snap_src = NULL, *snap_dest = NULL;
down_read(&_origins_lock);
(void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL);
if (snap_src && snap_dest) {
down_write(&snap_src->lock);
down_write_nested(&snap_dest->lock, SINGLE_DEPTH_NESTING);
__handover_exceptions(snap_src, snap_dest);
up_write(&snap_dest->lock);
up_write(&snap_src->lock);
}
up_read(&_origins_lock);
/* Now we have correct chunk size, reregister */
reregister_snapshot(s);
down_write(&s->lock); down_write(&s->lock);
s->active = 1; s->active = 1;
s->suspended = 0;
up_write(&s->lock); up_write(&s->lock);
} }
static sector_t get_origin_minimum_chunksize(struct block_device *bdev)
{
sector_t min_chunksize;
down_read(&_origins_lock);
min_chunksize = __minimum_chunk_size(__lookup_origin(bdev));
up_read(&_origins_lock);
return min_chunksize;
}
static void snapshot_merge_resume(struct dm_target *ti)
{
struct dm_snapshot *s = ti->private;
/*
* Handover exceptions from existing snapshot.
*/
snapshot_resume(ti);
/*
* snapshot-merge acts as an origin, so set ti->split_io
*/
ti->split_io = get_origin_minimum_chunksize(s->origin->bdev);
start_merge(s);
}
static int snapshot_status(struct dm_target *ti, status_type_t type, static int snapshot_status(struct dm_target *ti, status_type_t type,
char *result, unsigned int maxlen) char *result, unsigned int maxlen)
{ {
unsigned sz = 0; unsigned sz = 0;
struct dm_snapshot *snap = ti->private; struct dm_snapshot *snap = ti->private;
down_write(&snap->lock);
switch (type) { switch (type) {
case STATUSTYPE_INFO: case STATUSTYPE_INFO:
down_write(&snap->lock);
if (!snap->valid) if (!snap->valid)
DMEMIT("Invalid"); DMEMIT("Invalid");
else if (snap->merge_failed)
DMEMIT("Merge failed");
else { else {
if (snap->store->type->fraction_full) { if (snap->store->type->usage) {
sector_t numerator, denominator; sector_t total_sectors, sectors_allocated,
snap->store->type->fraction_full(snap->store, metadata_sectors;
&numerator, snap->store->type->usage(snap->store,
&denominator); &total_sectors,
DMEMIT("%llu/%llu", &sectors_allocated,
(unsigned long long)numerator, &metadata_sectors);
(unsigned long long)denominator); DMEMIT("%llu/%llu %llu",
(unsigned long long)sectors_allocated,
(unsigned long long)total_sectors,
(unsigned long long)metadata_sectors);
} }
else else
DMEMIT("Unknown"); DMEMIT("Unknown");
} }
up_write(&snap->lock);
break; break;
case STATUSTYPE_TABLE: case STATUSTYPE_TABLE:
...@@ -1179,14 +1887,12 @@ static int snapshot_status(struct dm_target *ti, status_type_t type, ...@@ -1179,14 +1887,12 @@ static int snapshot_status(struct dm_target *ti, status_type_t type,
* to make private copies if the output is to * to make private copies if the output is to
* make sense. * make sense.
*/ */
DMEMIT("%s", snap->origin->name); DMEMIT("%s %s", snap->origin->name, snap->cow->name);
snap->store->type->status(snap->store, type, result + sz, snap->store->type->status(snap->store, type, result + sz,
maxlen - sz); maxlen - sz);
break; break;
} }
up_write(&snap->lock);
return 0; return 0;
} }
...@@ -1202,17 +1908,36 @@ static int snapshot_iterate_devices(struct dm_target *ti, ...@@ -1202,17 +1908,36 @@ static int snapshot_iterate_devices(struct dm_target *ti,
/*----------------------------------------------------------------- /*-----------------------------------------------------------------
* Origin methods * Origin methods
*---------------------------------------------------------------*/ *---------------------------------------------------------------*/
static int __origin_write(struct list_head *snapshots, struct bio *bio)
/*
* If no exceptions need creating, DM_MAPIO_REMAPPED is returned and any
* supplied bio was ignored. The caller may submit it immediately.
* (No remapping actually occurs as the origin is always a direct linear
* map.)
*
* If further exceptions are required, DM_MAPIO_SUBMITTED is returned
* and any supplied bio is added to a list to be submitted once all
* the necessary exceptions exist.
*/
static int __origin_write(struct list_head *snapshots, sector_t sector,
struct bio *bio)
{ {
int r = DM_MAPIO_REMAPPED, first = 0; int r = DM_MAPIO_REMAPPED;
struct dm_snapshot *snap; struct dm_snapshot *snap;
struct dm_snap_exception *e; struct dm_exception *e;
struct dm_snap_pending_exception *pe, *next_pe, *primary_pe = NULL; struct dm_snap_pending_exception *pe;
struct dm_snap_pending_exception *pe_to_start_now = NULL;
struct dm_snap_pending_exception *pe_to_start_last = NULL;
chunk_t chunk; chunk_t chunk;
LIST_HEAD(pe_queue);
/* Do all the snapshots on this origin */ /* Do all the snapshots on this origin */
list_for_each_entry (snap, snapshots, list) { list_for_each_entry (snap, snapshots, list) {
/*
* Don't make new exceptions in a merging snapshot
* because it has effectively been deleted
*/
if (dm_target_is_snapshot_merge(snap->ti))
continue;
down_write(&snap->lock); down_write(&snap->lock);
...@@ -1221,24 +1946,21 @@ static int __origin_write(struct list_head *snapshots, struct bio *bio) ...@@ -1221,24 +1946,21 @@ static int __origin_write(struct list_head *snapshots, struct bio *bio)
goto next_snapshot; goto next_snapshot;
/* Nothing to do if writing beyond end of snapshot */ /* Nothing to do if writing beyond end of snapshot */
if (bio->bi_sector >= dm_table_get_size(snap->store->ti->table)) if (sector >= dm_table_get_size(snap->ti->table))
goto next_snapshot; goto next_snapshot;
/* /*
* Remember, different snapshots can have * Remember, different snapshots can have
* different chunk sizes. * different chunk sizes.
*/ */
chunk = sector_to_chunk(snap->store, bio->bi_sector); chunk = sector_to_chunk(snap->store, sector);
/* /*
* Check exception table to see if block * Check exception table to see if block
* is already remapped in this snapshot * is already remapped in this snapshot
* and trigger an exception if not. * and trigger an exception if not.
*
* ref_count is initialised to 1 so pending_complete()
* won't destroy the primary_pe while we're inside this loop.
*/ */
e = lookup_exception(&snap->complete, chunk); e = dm_lookup_exception(&snap->complete, chunk);
if (e) if (e)
goto next_snapshot; goto next_snapshot;
...@@ -1253,7 +1975,7 @@ static int __origin_write(struct list_head *snapshots, struct bio *bio) ...@@ -1253,7 +1975,7 @@ static int __origin_write(struct list_head *snapshots, struct bio *bio)
goto next_snapshot; goto next_snapshot;
} }
e = lookup_exception(&snap->complete, chunk); e = dm_lookup_exception(&snap->complete, chunk);
if (e) { if (e) {
free_pending_exception(pe); free_pending_exception(pe);
goto next_snapshot; goto next_snapshot;
...@@ -1266,59 +1988,43 @@ static int __origin_write(struct list_head *snapshots, struct bio *bio) ...@@ -1266,59 +1988,43 @@ static int __origin_write(struct list_head *snapshots, struct bio *bio)
} }
} }
if (!primary_pe) { r = DM_MAPIO_SUBMITTED;
/*
* Either every pe here has same
* primary_pe or none has one yet.
*/
if (pe->primary_pe)
primary_pe = pe->primary_pe;
else {
primary_pe = pe;
first = 1;
}
bio_list_add(&primary_pe->origin_bios, bio);
r = DM_MAPIO_SUBMITTED; /*
} * If an origin bio was supplied, queue it to wait for the
* completion of this exception, and start this one last,
* at the end of the function.
*/
if (bio) {
bio_list_add(&pe->origin_bios, bio);
bio = NULL;
if (!pe->primary_pe) { if (!pe->started) {
pe->primary_pe = primary_pe; pe->started = 1;
get_pending_exception(primary_pe); pe_to_start_last = pe;
}
} }
if (!pe->started) { if (!pe->started) {
pe->started = 1; pe->started = 1;
list_add_tail(&pe->list, &pe_queue); pe_to_start_now = pe;
} }
next_snapshot: next_snapshot:
up_write(&snap->lock); up_write(&snap->lock);
}
if (!primary_pe) if (pe_to_start_now) {
return r; start_copy(pe_to_start_now);
pe_to_start_now = NULL;
/* }
* If this is the first time we're processing this chunk and
* ref_count is now 1 it means all the pending exceptions
* got completed while we were in the loop above, so it falls to
* us here to remove the primary_pe and submit any origin_bios.
*/
if (first && atomic_dec_and_test(&primary_pe->ref_count)) {
flush_bios(bio_list_get(&primary_pe->origin_bios));
free_pending_exception(primary_pe);
/* If we got here, pe_queue is necessarily empty. */
return r;
} }
/* /*
* Now that we have a complete pe list we can start the copying. * Submit the exception against which the bio is queued last,
* to give the other exceptions a head start.
*/ */
list_for_each_entry_safe(pe, next_pe, &pe_queue, list) if (pe_to_start_last)
start_copy(pe); start_copy(pe_to_start_last);
return r; return r;
} }
...@@ -1334,12 +2040,47 @@ static int do_origin(struct dm_dev *origin, struct bio *bio) ...@@ -1334,12 +2040,47 @@ static int do_origin(struct dm_dev *origin, struct bio *bio)
down_read(&_origins_lock); down_read(&_origins_lock);
o = __lookup_origin(origin->bdev); o = __lookup_origin(origin->bdev);
if (o) if (o)
r = __origin_write(&o->snapshots, bio); r = __origin_write(&o->snapshots, bio->bi_sector, bio);
up_read(&_origins_lock); up_read(&_origins_lock);
return r; return r;
} }
/*
* Trigger exceptions in all non-merging snapshots.
*
* The chunk size of the merging snapshot may be larger than the chunk
* size of some other snapshot so we may need to reallocate multiple
* chunks in other snapshots.
*
* We scan all the overlapping exceptions in the other snapshots.
* Returns 1 if anything was reallocated and must be waited for,
* otherwise returns 0.
*
* size must be a multiple of merging_snap's chunk_size.
*/
static int origin_write_extent(struct dm_snapshot *merging_snap,
sector_t sector, unsigned size)
{
int must_wait = 0;
sector_t n;
struct origin *o;
/*
* The origin's __minimum_chunk_size() got stored in split_io
* by snapshot_merge_resume().
*/
down_read(&_origins_lock);
o = __lookup_origin(merging_snap->origin->bdev);
for (n = 0; n < size; n += merging_snap->ti->split_io)
if (__origin_write(&o->snapshots, sector + n, NULL) ==
DM_MAPIO_SUBMITTED)
must_wait = 1;
up_read(&_origins_lock);
return must_wait;
}
/* /*
* Origin: maps a linear range of a device, with hooks for snapshotting. * Origin: maps a linear range of a device, with hooks for snapshotting.
*/ */
...@@ -1391,8 +2132,6 @@ static int origin_map(struct dm_target *ti, struct bio *bio, ...@@ -1391,8 +2132,6 @@ static int origin_map(struct dm_target *ti, struct bio *bio,
return (bio_rw(bio) == WRITE) ? do_origin(dev, bio) : DM_MAPIO_REMAPPED; return (bio_rw(bio) == WRITE) ? do_origin(dev, bio) : DM_MAPIO_REMAPPED;
} }
#define min_not_zero(l, r) (l == 0) ? r : ((r == 0) ? l : min(l, r))
/* /*
* Set the target "split_io" field to the minimum of all the snapshots' * Set the target "split_io" field to the minimum of all the snapshots'
* chunk sizes. * chunk sizes.
...@@ -1400,19 +2139,8 @@ static int origin_map(struct dm_target *ti, struct bio *bio, ...@@ -1400,19 +2139,8 @@ static int origin_map(struct dm_target *ti, struct bio *bio,
static void origin_resume(struct dm_target *ti) static void origin_resume(struct dm_target *ti)
{ {
struct dm_dev *dev = ti->private; struct dm_dev *dev = ti->private;
struct dm_snapshot *snap;
struct origin *o;
unsigned chunk_size = 0;
down_read(&_origins_lock);
o = __lookup_origin(dev->bdev);
if (o)
list_for_each_entry (snap, &o->snapshots, list)
chunk_size = min_not_zero(chunk_size,
snap->store->chunk_size);
up_read(&_origins_lock);
ti->split_io = chunk_size; ti->split_io = get_origin_minimum_chunksize(dev->bdev);
} }
static int origin_status(struct dm_target *ti, status_type_t type, char *result, static int origin_status(struct dm_target *ti, status_type_t type, char *result,
...@@ -1455,17 +2183,35 @@ static struct target_type origin_target = { ...@@ -1455,17 +2183,35 @@ static struct target_type origin_target = {
static struct target_type snapshot_target = { static struct target_type snapshot_target = {
.name = "snapshot", .name = "snapshot",
.version = {1, 7, 0}, .version = {1, 9, 0},
.module = THIS_MODULE, .module = THIS_MODULE,
.ctr = snapshot_ctr, .ctr = snapshot_ctr,
.dtr = snapshot_dtr, .dtr = snapshot_dtr,
.map = snapshot_map, .map = snapshot_map,
.end_io = snapshot_end_io, .end_io = snapshot_end_io,
.postsuspend = snapshot_postsuspend,
.preresume = snapshot_preresume,
.resume = snapshot_resume, .resume = snapshot_resume,
.status = snapshot_status, .status = snapshot_status,
.iterate_devices = snapshot_iterate_devices, .iterate_devices = snapshot_iterate_devices,
}; };
static struct target_type merge_target = {
.name = dm_snapshot_merge_target_name,
.version = {1, 0, 0},
.module = THIS_MODULE,
.ctr = snapshot_ctr,
.dtr = snapshot_dtr,
.map = snapshot_merge_map,
.end_io = snapshot_end_io,
.presuspend = snapshot_merge_presuspend,
.postsuspend = snapshot_postsuspend,
.preresume = snapshot_preresume,
.resume = snapshot_merge_resume,
.status = snapshot_status,
.iterate_devices = snapshot_iterate_devices,
};
static int __init dm_snapshot_init(void) static int __init dm_snapshot_init(void)
{ {
int r; int r;
...@@ -1477,7 +2223,7 @@ static int __init dm_snapshot_init(void) ...@@ -1477,7 +2223,7 @@ static int __init dm_snapshot_init(void)
} }
r = dm_register_target(&snapshot_target); r = dm_register_target(&snapshot_target);
if (r) { if (r < 0) {
DMERR("snapshot target register failed %d", r); DMERR("snapshot target register failed %d", r);
goto bad_register_snapshot_target; goto bad_register_snapshot_target;
} }
...@@ -1485,34 +2231,40 @@ static int __init dm_snapshot_init(void) ...@@ -1485,34 +2231,40 @@ static int __init dm_snapshot_init(void)
r = dm_register_target(&origin_target); r = dm_register_target(&origin_target);
if (r < 0) { if (r < 0) {
DMERR("Origin target register failed %d", r); DMERR("Origin target register failed %d", r);
goto bad1; goto bad_register_origin_target;
}
r = dm_register_target(&merge_target);
if (r < 0) {
DMERR("Merge target register failed %d", r);
goto bad_register_merge_target;
} }
r = init_origin_hash(); r = init_origin_hash();
if (r) { if (r) {
DMERR("init_origin_hash failed."); DMERR("init_origin_hash failed.");
goto bad2; goto bad_origin_hash;
} }
exception_cache = KMEM_CACHE(dm_snap_exception, 0); exception_cache = KMEM_CACHE(dm_exception, 0);
if (!exception_cache) { if (!exception_cache) {
DMERR("Couldn't create exception cache."); DMERR("Couldn't create exception cache.");
r = -ENOMEM; r = -ENOMEM;
goto bad3; goto bad_exception_cache;
} }
pending_cache = KMEM_CACHE(dm_snap_pending_exception, 0); pending_cache = KMEM_CACHE(dm_snap_pending_exception, 0);
if (!pending_cache) { if (!pending_cache) {
DMERR("Couldn't create pending cache."); DMERR("Couldn't create pending cache.");
r = -ENOMEM; r = -ENOMEM;
goto bad4; goto bad_pending_cache;
} }
tracked_chunk_cache = KMEM_CACHE(dm_snap_tracked_chunk, 0); tracked_chunk_cache = KMEM_CACHE(dm_snap_tracked_chunk, 0);
if (!tracked_chunk_cache) { if (!tracked_chunk_cache) {
DMERR("Couldn't create cache to track chunks in use."); DMERR("Couldn't create cache to track chunks in use.");
r = -ENOMEM; r = -ENOMEM;
goto bad5; goto bad_tracked_chunk_cache;
} }
ksnapd = create_singlethread_workqueue("ksnapd"); ksnapd = create_singlethread_workqueue("ksnapd");
...@@ -1526,19 +2278,21 @@ static int __init dm_snapshot_init(void) ...@@ -1526,19 +2278,21 @@ static int __init dm_snapshot_init(void)
bad_pending_pool: bad_pending_pool:
kmem_cache_destroy(tracked_chunk_cache); kmem_cache_destroy(tracked_chunk_cache);
bad5: bad_tracked_chunk_cache:
kmem_cache_destroy(pending_cache); kmem_cache_destroy(pending_cache);
bad4: bad_pending_cache:
kmem_cache_destroy(exception_cache); kmem_cache_destroy(exception_cache);
bad3: bad_exception_cache:
exit_origin_hash(); exit_origin_hash();
bad2: bad_origin_hash:
dm_unregister_target(&merge_target);
bad_register_merge_target:
dm_unregister_target(&origin_target); dm_unregister_target(&origin_target);
bad1: bad_register_origin_target:
dm_unregister_target(&snapshot_target); dm_unregister_target(&snapshot_target);
bad_register_snapshot_target: bad_register_snapshot_target:
dm_exception_store_exit(); dm_exception_store_exit();
return r; return r;
} }
...@@ -1548,6 +2302,7 @@ static void __exit dm_snapshot_exit(void) ...@@ -1548,6 +2302,7 @@ static void __exit dm_snapshot_exit(void)
dm_unregister_target(&snapshot_target); dm_unregister_target(&snapshot_target);
dm_unregister_target(&origin_target); dm_unregister_target(&origin_target);
dm_unregister_target(&merge_target);
exit_origin_hash(); exit_origin_hash();
kmem_cache_destroy(pending_cache); kmem_cache_destroy(pending_cache);
......
...@@ -59,7 +59,7 @@ static ssize_t dm_attr_uuid_show(struct mapped_device *md, char *buf) ...@@ -59,7 +59,7 @@ static ssize_t dm_attr_uuid_show(struct mapped_device *md, char *buf)
static ssize_t dm_attr_suspended_show(struct mapped_device *md, char *buf) static ssize_t dm_attr_suspended_show(struct mapped_device *md, char *buf)
{ {
sprintf(buf, "%d\n", dm_suspended(md)); sprintf(buf, "%d\n", dm_suspended_md(md));
return strlen(buf); return strlen(buf);
} }
...@@ -79,6 +79,13 @@ static struct sysfs_ops dm_sysfs_ops = { ...@@ -79,6 +79,13 @@ static struct sysfs_ops dm_sysfs_ops = {
.show = dm_attr_show, .show = dm_attr_show,
}; };
/*
* The sysfs structure is embedded in md struct, nothing to do here
*/
static void dm_sysfs_release(struct kobject *kobj)
{
}
/* /*
* dm kobject is embedded in mapped_device structure * dm kobject is embedded in mapped_device structure
* no need to define release function here * no need to define release function here
...@@ -86,6 +93,7 @@ static struct sysfs_ops dm_sysfs_ops = { ...@@ -86,6 +93,7 @@ static struct sysfs_ops dm_sysfs_ops = {
static struct kobj_type dm_ktype = { static struct kobj_type dm_ktype = {
.sysfs_ops = &dm_sysfs_ops, .sysfs_ops = &dm_sysfs_ops,
.default_attrs = dm_attrs, .default_attrs = dm_attrs,
.release = dm_sysfs_release
}; };
/* /*
......
...@@ -238,6 +238,9 @@ void dm_table_destroy(struct dm_table *t) ...@@ -238,6 +238,9 @@ void dm_table_destroy(struct dm_table *t)
{ {
unsigned int i; unsigned int i;
if (!t)
return;
while (atomic_read(&t->holders)) while (atomic_read(&t->holders))
msleep(1); msleep(1);
smp_mb(); smp_mb();
......
...@@ -139,14 +139,13 @@ void dm_send_uevents(struct list_head *events, struct kobject *kobj) ...@@ -139,14 +139,13 @@ void dm_send_uevents(struct list_head *events, struct kobject *kobj)
list_del_init(&event->elist); list_del_init(&event->elist);
/* /*
* Need to call dm_copy_name_and_uuid from here for now. * When a device is being removed this copy fails and we
* Context of previous var adds and locking used for * discard these unsent events.
* hash_cell not compatable.
*/ */
if (dm_copy_name_and_uuid(event->md, event->name, if (dm_copy_name_and_uuid(event->md, event->name,
event->uuid)) { event->uuid)) {
DMERR("%s: dm_copy_name_and_uuid() failed", DMINFO("%s: skipping sending uevent for lost device",
__func__); __func__);
goto uevent_free; goto uevent_free;
} }
......
...@@ -142,10 +142,20 @@ struct mapped_device { ...@@ -142,10 +142,20 @@ struct mapped_device {
*/ */
int barrier_error; int barrier_error;
/*
* Protect barrier_error from concurrent endio processing
* in request-based dm.
*/
spinlock_t barrier_error_lock;
/* /*
* Processing queue (flush/barriers) * Processing queue (flush/barriers)
*/ */
struct workqueue_struct *wq; struct workqueue_struct *wq;
struct work_struct barrier_work;
/* A pointer to the currently processing pre/post flush request */
struct request *flush_request;
/* /*
* The current mapping. * The current mapping.
...@@ -178,9 +188,6 @@ struct mapped_device { ...@@ -178,9 +188,6 @@ struct mapped_device {
/* forced geometry settings */ /* forced geometry settings */
struct hd_geometry geometry; struct hd_geometry geometry;
/* marker of flush suspend for request-based dm */
struct request suspend_rq;
/* For saving the address of __make_request for request based dm */ /* For saving the address of __make_request for request based dm */
make_request_fn *saved_make_request_fn; make_request_fn *saved_make_request_fn;
...@@ -275,6 +282,7 @@ static int (*_inits[])(void) __initdata = { ...@@ -275,6 +282,7 @@ static int (*_inits[])(void) __initdata = {
dm_target_init, dm_target_init,
dm_linear_init, dm_linear_init,
dm_stripe_init, dm_stripe_init,
dm_io_init,
dm_kcopyd_init, dm_kcopyd_init,
dm_interface_init, dm_interface_init,
}; };
...@@ -284,6 +292,7 @@ static void (*_exits[])(void) = { ...@@ -284,6 +292,7 @@ static void (*_exits[])(void) = {
dm_target_exit, dm_target_exit,
dm_linear_exit, dm_linear_exit,
dm_stripe_exit, dm_stripe_exit,
dm_io_exit,
dm_kcopyd_exit, dm_kcopyd_exit,
dm_interface_exit, dm_interface_exit,
}; };
...@@ -320,6 +329,11 @@ static void __exit dm_exit(void) ...@@ -320,6 +329,11 @@ static void __exit dm_exit(void)
/* /*
* Block device functions * Block device functions
*/ */
int dm_deleting_md(struct mapped_device *md)
{
return test_bit(DMF_DELETING, &md->flags);
}
static int dm_blk_open(struct block_device *bdev, fmode_t mode) static int dm_blk_open(struct block_device *bdev, fmode_t mode)
{ {
struct mapped_device *md; struct mapped_device *md;
...@@ -331,7 +345,7 @@ static int dm_blk_open(struct block_device *bdev, fmode_t mode) ...@@ -331,7 +345,7 @@ static int dm_blk_open(struct block_device *bdev, fmode_t mode)
goto out; goto out;
if (test_bit(DMF_FREEING, &md->flags) || if (test_bit(DMF_FREEING, &md->flags) ||
test_bit(DMF_DELETING, &md->flags)) { dm_deleting_md(md)) {
md = NULL; md = NULL;
goto out; goto out;
} }
...@@ -388,7 +402,7 @@ static int dm_blk_ioctl(struct block_device *bdev, fmode_t mode, ...@@ -388,7 +402,7 @@ static int dm_blk_ioctl(struct block_device *bdev, fmode_t mode,
unsigned int cmd, unsigned long arg) unsigned int cmd, unsigned long arg)
{ {
struct mapped_device *md = bdev->bd_disk->private_data; struct mapped_device *md = bdev->bd_disk->private_data;
struct dm_table *map = dm_get_table(md); struct dm_table *map = dm_get_live_table(md);
struct dm_target *tgt; struct dm_target *tgt;
int r = -ENOTTY; int r = -ENOTTY;
...@@ -401,7 +415,7 @@ static int dm_blk_ioctl(struct block_device *bdev, fmode_t mode, ...@@ -401,7 +415,7 @@ static int dm_blk_ioctl(struct block_device *bdev, fmode_t mode,
tgt = dm_table_get_target(map, 0); tgt = dm_table_get_target(map, 0);
if (dm_suspended(md)) { if (dm_suspended_md(md)) {
r = -EAGAIN; r = -EAGAIN;
goto out; goto out;
} }
...@@ -430,9 +444,10 @@ static void free_tio(struct mapped_device *md, struct dm_target_io *tio) ...@@ -430,9 +444,10 @@ static void free_tio(struct mapped_device *md, struct dm_target_io *tio)
mempool_free(tio, md->tio_pool); mempool_free(tio, md->tio_pool);
} }
static struct dm_rq_target_io *alloc_rq_tio(struct mapped_device *md) static struct dm_rq_target_io *alloc_rq_tio(struct mapped_device *md,
gfp_t gfp_mask)
{ {
return mempool_alloc(md->tio_pool, GFP_ATOMIC); return mempool_alloc(md->tio_pool, gfp_mask);
} }
static void free_rq_tio(struct dm_rq_target_io *tio) static void free_rq_tio(struct dm_rq_target_io *tio)
...@@ -450,6 +465,12 @@ static void free_bio_info(struct dm_rq_clone_bio_info *info) ...@@ -450,6 +465,12 @@ static void free_bio_info(struct dm_rq_clone_bio_info *info)
mempool_free(info, info->tio->md->io_pool); mempool_free(info, info->tio->md->io_pool);
} }
static int md_in_flight(struct mapped_device *md)
{
return atomic_read(&md->pending[READ]) +
atomic_read(&md->pending[WRITE]);
}
static void start_io_acct(struct dm_io *io) static void start_io_acct(struct dm_io *io)
{ {
struct mapped_device *md = io->md; struct mapped_device *md = io->md;
...@@ -512,7 +533,7 @@ static void queue_io(struct mapped_device *md, struct bio *bio) ...@@ -512,7 +533,7 @@ static void queue_io(struct mapped_device *md, struct bio *bio)
* function to access the md->map field, and make sure they call * function to access the md->map field, and make sure they call
* dm_table_put() when finished. * dm_table_put() when finished.
*/ */
struct dm_table *dm_get_table(struct mapped_device *md) struct dm_table *dm_get_live_table(struct mapped_device *md)
{ {
struct dm_table *t; struct dm_table *t;
unsigned long flags; unsigned long flags;
...@@ -716,28 +737,38 @@ static void end_clone_bio(struct bio *clone, int error) ...@@ -716,28 +737,38 @@ static void end_clone_bio(struct bio *clone, int error)
blk_update_request(tio->orig, 0, nr_bytes); blk_update_request(tio->orig, 0, nr_bytes);
} }
static void store_barrier_error(struct mapped_device *md, int error)
{
unsigned long flags;
spin_lock_irqsave(&md->barrier_error_lock, flags);
/*
* Basically, the first error is taken, but:
* -EOPNOTSUPP supersedes any I/O error.
* Requeue request supersedes any I/O error but -EOPNOTSUPP.
*/
if (!md->barrier_error || error == -EOPNOTSUPP ||
(md->barrier_error != -EOPNOTSUPP &&
error == DM_ENDIO_REQUEUE))
md->barrier_error = error;
spin_unlock_irqrestore(&md->barrier_error_lock, flags);
}
/* /*
* Don't touch any member of the md after calling this function because * Don't touch any member of the md after calling this function because
* the md may be freed in dm_put() at the end of this function. * the md may be freed in dm_put() at the end of this function.
* Or do dm_get() before calling this function and dm_put() later. * Or do dm_get() before calling this function and dm_put() later.
*/ */
static void rq_completed(struct mapped_device *md, int run_queue) static void rq_completed(struct mapped_device *md, int rw, int run_queue)
{ {
int wakeup_waiters = 0; atomic_dec(&md->pending[rw]);
struct request_queue *q = md->queue;
unsigned long flags;
spin_lock_irqsave(q->queue_lock, flags);
if (!queue_in_flight(q))
wakeup_waiters = 1;
spin_unlock_irqrestore(q->queue_lock, flags);
/* nudge anyone waiting on suspend queue */ /* nudge anyone waiting on suspend queue */
if (wakeup_waiters) if (!md_in_flight(md))
wake_up(&md->wait); wake_up(&md->wait);
if (run_queue) if (run_queue)
blk_run_queue(q); blk_run_queue(md->queue);
/* /*
* dm_put() must be at the end of this function. See the comment above * dm_put() must be at the end of this function. See the comment above
...@@ -753,6 +784,44 @@ static void free_rq_clone(struct request *clone) ...@@ -753,6 +784,44 @@ static void free_rq_clone(struct request *clone)
free_rq_tio(tio); free_rq_tio(tio);
} }
/*
* Complete the clone and the original request.
* Must be called without queue lock.
*/
static void dm_end_request(struct request *clone, int error)
{
int rw = rq_data_dir(clone);
int run_queue = 1;
bool is_barrier = blk_barrier_rq(clone);
struct dm_rq_target_io *tio = clone->end_io_data;
struct mapped_device *md = tio->md;
struct request *rq = tio->orig;
if (blk_pc_request(rq) && !is_barrier) {
rq->errors = clone->errors;
rq->resid_len = clone->resid_len;
if (rq->sense)
/*
* We are using the sense buffer of the original
* request.
* So setting the length of the sense data is enough.
*/
rq->sense_len = clone->sense_len;
}
free_rq_clone(clone);
if (unlikely(is_barrier)) {
if (unlikely(error))
store_barrier_error(md, error);
run_queue = 0;
} else
blk_end_request_all(rq, error);
rq_completed(md, rw, run_queue);
}
static void dm_unprep_request(struct request *rq) static void dm_unprep_request(struct request *rq)
{ {
struct request *clone = rq->special; struct request *clone = rq->special;
...@@ -768,12 +837,23 @@ static void dm_unprep_request(struct request *rq) ...@@ -768,12 +837,23 @@ static void dm_unprep_request(struct request *rq)
*/ */
void dm_requeue_unmapped_request(struct request *clone) void dm_requeue_unmapped_request(struct request *clone)
{ {
int rw = rq_data_dir(clone);
struct dm_rq_target_io *tio = clone->end_io_data; struct dm_rq_target_io *tio = clone->end_io_data;
struct mapped_device *md = tio->md; struct mapped_device *md = tio->md;
struct request *rq = tio->orig; struct request *rq = tio->orig;
struct request_queue *q = rq->q; struct request_queue *q = rq->q;
unsigned long flags; unsigned long flags;
if (unlikely(blk_barrier_rq(clone))) {
/*
* Barrier clones share an original request.
* Leave it to dm_end_request(), which handles this special
* case.
*/
dm_end_request(clone, DM_ENDIO_REQUEUE);
return;
}
dm_unprep_request(rq); dm_unprep_request(rq);
spin_lock_irqsave(q->queue_lock, flags); spin_lock_irqsave(q->queue_lock, flags);
...@@ -782,7 +862,7 @@ void dm_requeue_unmapped_request(struct request *clone) ...@@ -782,7 +862,7 @@ void dm_requeue_unmapped_request(struct request *clone)
blk_requeue_request(q, rq); blk_requeue_request(q, rq);
spin_unlock_irqrestore(q->queue_lock, flags); spin_unlock_irqrestore(q->queue_lock, flags);
rq_completed(md, 0); rq_completed(md, rw, 0);
} }
EXPORT_SYMBOL_GPL(dm_requeue_unmapped_request); EXPORT_SYMBOL_GPL(dm_requeue_unmapped_request);
...@@ -815,34 +895,28 @@ static void start_queue(struct request_queue *q) ...@@ -815,34 +895,28 @@ static void start_queue(struct request_queue *q)
spin_unlock_irqrestore(q->queue_lock, flags); spin_unlock_irqrestore(q->queue_lock, flags);
} }
/* static void dm_done(struct request *clone, int error, bool mapped)
* Complete the clone and the original request.
* Must be called without queue lock.
*/
static void dm_end_request(struct request *clone, int error)
{ {
int r = error;
struct dm_rq_target_io *tio = clone->end_io_data; struct dm_rq_target_io *tio = clone->end_io_data;
struct mapped_device *md = tio->md; dm_request_endio_fn rq_end_io = tio->ti->type->rq_end_io;
struct request *rq = tio->orig;
if (blk_pc_request(rq)) { if (mapped && rq_end_io)
rq->errors = clone->errors; r = rq_end_io(tio->ti, clone, error, &tio->info);
rq->resid_len = clone->resid_len;
if (rq->sense) if (r <= 0)
/* /* The target wants to complete the I/O */
* We are using the sense buffer of the original dm_end_request(clone, r);
* request. else if (r == DM_ENDIO_INCOMPLETE)
* So setting the length of the sense data is enough. /* The target will handle the I/O */
*/ return;
rq->sense_len = clone->sense_len; else if (r == DM_ENDIO_REQUEUE)
/* The target wants to requeue the I/O */
dm_requeue_unmapped_request(clone);
else {
DMWARN("unimplemented target endio return value: %d", r);
BUG();
} }
free_rq_clone(clone);
blk_end_request_all(rq, error);
rq_completed(md, 1);
} }
/* /*
...@@ -850,27 +924,14 @@ static void dm_end_request(struct request *clone, int error) ...@@ -850,27 +924,14 @@ static void dm_end_request(struct request *clone, int error)
*/ */
static void dm_softirq_done(struct request *rq) static void dm_softirq_done(struct request *rq)
{ {
bool mapped = true;
struct request *clone = rq->completion_data; struct request *clone = rq->completion_data;
struct dm_rq_target_io *tio = clone->end_io_data; struct dm_rq_target_io *tio = clone->end_io_data;
dm_request_endio_fn rq_end_io = tio->ti->type->rq_end_io;
int error = tio->error;
if (!(rq->cmd_flags & REQ_FAILED) && rq_end_io) if (rq->cmd_flags & REQ_FAILED)
error = rq_end_io(tio->ti, clone, error, &tio->info); mapped = false;
if (error <= 0) dm_done(clone, tio->error, mapped);
/* The target wants to complete the I/O */
dm_end_request(clone, error);
else if (error == DM_ENDIO_INCOMPLETE)
/* The target will handle the I/O */
return;
else if (error == DM_ENDIO_REQUEUE)
/* The target wants to requeue the I/O */
dm_requeue_unmapped_request(clone);
else {
DMWARN("unimplemented target endio return value: %d", error);
BUG();
}
} }
/* /*
...@@ -882,6 +943,19 @@ static void dm_complete_request(struct request *clone, int error) ...@@ -882,6 +943,19 @@ static void dm_complete_request(struct request *clone, int error)
struct dm_rq_target_io *tio = clone->end_io_data; struct dm_rq_target_io *tio = clone->end_io_data;
struct request *rq = tio->orig; struct request *rq = tio->orig;
if (unlikely(blk_barrier_rq(clone))) {
/*
* Barrier clones share an original request. So can't use
* softirq_done with the original.
* Pass the clone to dm_done() directly in this special case.
* It is safe (even if clone->q->queue_lock is held here)
* because there is no I/O dispatching during the completion
* of barrier clone.
*/
dm_done(clone, error, true);
return;
}
tio->error = error; tio->error = error;
rq->completion_data = clone; rq->completion_data = clone;
blk_complete_request(rq); blk_complete_request(rq);
...@@ -898,6 +972,17 @@ void dm_kill_unmapped_request(struct request *clone, int error) ...@@ -898,6 +972,17 @@ void dm_kill_unmapped_request(struct request *clone, int error)
struct dm_rq_target_io *tio = clone->end_io_data; struct dm_rq_target_io *tio = clone->end_io_data;
struct request *rq = tio->orig; struct request *rq = tio->orig;
if (unlikely(blk_barrier_rq(clone))) {
/*
* Barrier clones share an original request.
* Leave it to dm_end_request(), which handles this special
* case.
*/
BUG_ON(error > 0);
dm_end_request(clone, error);
return;
}
rq->cmd_flags |= REQ_FAILED; rq->cmd_flags |= REQ_FAILED;
dm_complete_request(clone, error); dm_complete_request(clone, error);
} }
...@@ -1214,7 +1299,7 @@ static void __split_and_process_bio(struct mapped_device *md, struct bio *bio) ...@@ -1214,7 +1299,7 @@ static void __split_and_process_bio(struct mapped_device *md, struct bio *bio)
struct clone_info ci; struct clone_info ci;
int error = 0; int error = 0;
ci.map = dm_get_table(md); ci.map = dm_get_live_table(md);
if (unlikely(!ci.map)) { if (unlikely(!ci.map)) {
if (!bio_rw_flagged(bio, BIO_RW_BARRIER)) if (!bio_rw_flagged(bio, BIO_RW_BARRIER))
bio_io_error(bio); bio_io_error(bio);
...@@ -1255,7 +1340,7 @@ static int dm_merge_bvec(struct request_queue *q, ...@@ -1255,7 +1340,7 @@ static int dm_merge_bvec(struct request_queue *q,
struct bio_vec *biovec) struct bio_vec *biovec)
{ {
struct mapped_device *md = q->queuedata; struct mapped_device *md = q->queuedata;
struct dm_table *map = dm_get_table(md); struct dm_table *map = dm_get_live_table(md);
struct dm_target *ti; struct dm_target *ti;
sector_t max_sectors; sector_t max_sectors;
int max_size = 0; int max_size = 0;
...@@ -1352,11 +1437,6 @@ static int dm_make_request(struct request_queue *q, struct bio *bio) ...@@ -1352,11 +1437,6 @@ static int dm_make_request(struct request_queue *q, struct bio *bio)
{ {
struct mapped_device *md = q->queuedata; struct mapped_device *md = q->queuedata;
if (unlikely(bio_rw_flagged(bio, BIO_RW_BARRIER))) {
bio_endio(bio, -EOPNOTSUPP);
return 0;
}
return md->saved_make_request_fn(q, bio); /* call __make_request() */ return md->saved_make_request_fn(q, bio); /* call __make_request() */
} }
...@@ -1375,6 +1455,25 @@ static int dm_request(struct request_queue *q, struct bio *bio) ...@@ -1375,6 +1455,25 @@ static int dm_request(struct request_queue *q, struct bio *bio)
return _dm_request(q, bio); return _dm_request(q, bio);
} }
/*
* Mark this request as flush request, so that dm_request_fn() can
* recognize.
*/
static void dm_rq_prepare_flush(struct request_queue *q, struct request *rq)
{
rq->cmd_type = REQ_TYPE_LINUX_BLOCK;
rq->cmd[0] = REQ_LB_OP_FLUSH;
}
static bool dm_rq_is_flush_request(struct request *rq)
{
if (rq->cmd_type == REQ_TYPE_LINUX_BLOCK &&
rq->cmd[0] == REQ_LB_OP_FLUSH)
return true;
else
return false;
}
void dm_dispatch_request(struct request *rq) void dm_dispatch_request(struct request *rq)
{ {
int r; int r;
...@@ -1420,25 +1519,54 @@ static int dm_rq_bio_constructor(struct bio *bio, struct bio *bio_orig, ...@@ -1420,25 +1519,54 @@ static int dm_rq_bio_constructor(struct bio *bio, struct bio *bio_orig,
static int setup_clone(struct request *clone, struct request *rq, static int setup_clone(struct request *clone, struct request *rq,
struct dm_rq_target_io *tio) struct dm_rq_target_io *tio)
{ {
int r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC, int r;
dm_rq_bio_constructor, tio);
if (r) if (dm_rq_is_flush_request(rq)) {
return r; blk_rq_init(NULL, clone);
clone->cmd_type = REQ_TYPE_FS;
clone->cmd_flags |= (REQ_HARDBARRIER | WRITE);
} else {
r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
dm_rq_bio_constructor, tio);
if (r)
return r;
clone->cmd = rq->cmd;
clone->cmd_len = rq->cmd_len;
clone->sense = rq->sense;
clone->buffer = rq->buffer;
}
clone->cmd = rq->cmd;
clone->cmd_len = rq->cmd_len;
clone->sense = rq->sense;
clone->buffer = rq->buffer;
clone->end_io = end_clone_request; clone->end_io = end_clone_request;
clone->end_io_data = tio; clone->end_io_data = tio;
return 0; return 0;
} }
static int dm_rq_flush_suspending(struct mapped_device *md) static struct request *clone_rq(struct request *rq, struct mapped_device *md,
gfp_t gfp_mask)
{ {
return !md->suspend_rq.special; struct request *clone;
struct dm_rq_target_io *tio;
tio = alloc_rq_tio(md, gfp_mask);
if (!tio)
return NULL;
tio->md = md;
tio->ti = NULL;
tio->orig = rq;
tio->error = 0;
memset(&tio->info, 0, sizeof(tio->info));
clone = &tio->clone;
if (setup_clone(clone, rq, tio)) {
/* -ENOMEM */
free_rq_tio(tio);
return NULL;
}
return clone;
} }
/* /*
...@@ -1447,39 +1575,19 @@ static int dm_rq_flush_suspending(struct mapped_device *md) ...@@ -1447,39 +1575,19 @@ static int dm_rq_flush_suspending(struct mapped_device *md)
static int dm_prep_fn(struct request_queue *q, struct request *rq) static int dm_prep_fn(struct request_queue *q, struct request *rq)
{ {
struct mapped_device *md = q->queuedata; struct mapped_device *md = q->queuedata;
struct dm_rq_target_io *tio;
struct request *clone; struct request *clone;
if (unlikely(rq == &md->suspend_rq)) { if (unlikely(dm_rq_is_flush_request(rq)))
if (dm_rq_flush_suspending(md)) return BLKPREP_OK;
return BLKPREP_OK;
else
/* The flush suspend was interrupted */
return BLKPREP_KILL;
}
if (unlikely(rq->special)) { if (unlikely(rq->special)) {
DMWARN("Already has something in rq->special."); DMWARN("Already has something in rq->special.");
return BLKPREP_KILL; return BLKPREP_KILL;
} }
tio = alloc_rq_tio(md); /* Only one for each original request */ clone = clone_rq(rq, md, GFP_ATOMIC);
if (!tio) if (!clone)
/* -ENOMEM */
return BLKPREP_DEFER;
tio->md = md;
tio->ti = NULL;
tio->orig = rq;
tio->error = 0;
memset(&tio->info, 0, sizeof(tio->info));
clone = &tio->clone;
if (setup_clone(clone, rq, tio)) {
/* -ENOMEM */
free_rq_tio(tio);
return BLKPREP_DEFER; return BLKPREP_DEFER;
}
rq->special = clone; rq->special = clone;
rq->cmd_flags |= REQ_DONTPREP; rq->cmd_flags |= REQ_DONTPREP;
...@@ -1487,11 +1595,10 @@ static int dm_prep_fn(struct request_queue *q, struct request *rq) ...@@ -1487,11 +1595,10 @@ static int dm_prep_fn(struct request_queue *q, struct request *rq)
return BLKPREP_OK; return BLKPREP_OK;
} }
static void map_request(struct dm_target *ti, struct request *rq, static void map_request(struct dm_target *ti, struct request *clone,
struct mapped_device *md) struct mapped_device *md)
{ {
int r; int r;
struct request *clone = rq->special;
struct dm_rq_target_io *tio = clone->end_io_data; struct dm_rq_target_io *tio = clone->end_io_data;
/* /*
...@@ -1511,6 +1618,8 @@ static void map_request(struct dm_target *ti, struct request *rq, ...@@ -1511,6 +1618,8 @@ static void map_request(struct dm_target *ti, struct request *rq,
break; break;
case DM_MAPIO_REMAPPED: case DM_MAPIO_REMAPPED:
/* The target has remapped the I/O so dispatch it */ /* The target has remapped the I/O so dispatch it */
trace_block_rq_remap(clone->q, clone, disk_devt(dm_disk(md)),
blk_rq_pos(tio->orig));
dm_dispatch_request(clone); dm_dispatch_request(clone);
break; break;
case DM_MAPIO_REQUEUE: case DM_MAPIO_REQUEUE:
...@@ -1536,29 +1645,26 @@ static void map_request(struct dm_target *ti, struct request *rq, ...@@ -1536,29 +1645,26 @@ static void map_request(struct dm_target *ti, struct request *rq,
static void dm_request_fn(struct request_queue *q) static void dm_request_fn(struct request_queue *q)
{ {
struct mapped_device *md = q->queuedata; struct mapped_device *md = q->queuedata;
struct dm_table *map = dm_get_table(md); struct dm_table *map = dm_get_live_table(md);
struct dm_target *ti; struct dm_target *ti;
struct request *rq; struct request *rq, *clone;
/* /*
* For noflush suspend, check blk_queue_stopped() to immediately * For suspend, check blk_queue_stopped() and increment
* quit I/O dispatching. * ->pending within a single queue_lock not to increment the
* number of in-flight I/Os after the queue is stopped in
* dm_suspend().
*/ */
while (!blk_queue_plugged(q) && !blk_queue_stopped(q)) { while (!blk_queue_plugged(q) && !blk_queue_stopped(q)) {
rq = blk_peek_request(q); rq = blk_peek_request(q);
if (!rq) if (!rq)
goto plug_and_out; goto plug_and_out;
if (unlikely(rq == &md->suspend_rq)) { /* Flush suspend maker */ if (unlikely(dm_rq_is_flush_request(rq))) {
if (queue_in_flight(q)) BUG_ON(md->flush_request);
/* Not quiet yet. Wait more */ md->flush_request = rq;
goto plug_and_out;
/* This device should be quiet now */
__stop_queue(q);
blk_start_request(rq); blk_start_request(rq);
__blk_end_request_all(rq, 0); queue_work(md->wq, &md->barrier_work);
wake_up(&md->wait);
goto out; goto out;
} }
...@@ -1567,8 +1673,11 @@ static void dm_request_fn(struct request_queue *q) ...@@ -1567,8 +1673,11 @@ static void dm_request_fn(struct request_queue *q)
goto plug_and_out; goto plug_and_out;
blk_start_request(rq); blk_start_request(rq);
clone = rq->special;
atomic_inc(&md->pending[rq_data_dir(clone)]);
spin_unlock(q->queue_lock); spin_unlock(q->queue_lock);
map_request(ti, rq, md); map_request(ti, clone, md);
spin_lock_irq(q->queue_lock); spin_lock_irq(q->queue_lock);
} }
...@@ -1595,7 +1704,7 @@ static int dm_lld_busy(struct request_queue *q) ...@@ -1595,7 +1704,7 @@ static int dm_lld_busy(struct request_queue *q)
{ {
int r; int r;
struct mapped_device *md = q->queuedata; struct mapped_device *md = q->queuedata;
struct dm_table *map = dm_get_table(md); struct dm_table *map = dm_get_live_table(md);
if (!map || test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) if (!map || test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags))
r = 1; r = 1;
...@@ -1610,7 +1719,7 @@ static int dm_lld_busy(struct request_queue *q) ...@@ -1610,7 +1719,7 @@ static int dm_lld_busy(struct request_queue *q)
static void dm_unplug_all(struct request_queue *q) static void dm_unplug_all(struct request_queue *q)
{ {
struct mapped_device *md = q->queuedata; struct mapped_device *md = q->queuedata;
struct dm_table *map = dm_get_table(md); struct dm_table *map = dm_get_live_table(md);
if (map) { if (map) {
if (dm_request_based(md)) if (dm_request_based(md))
...@@ -1628,7 +1737,7 @@ static int dm_any_congested(void *congested_data, int bdi_bits) ...@@ -1628,7 +1737,7 @@ static int dm_any_congested(void *congested_data, int bdi_bits)
struct dm_table *map; struct dm_table *map;
if (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) { if (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
map = dm_get_table(md); map = dm_get_live_table(md);
if (map) { if (map) {
/* /*
* Request-based dm cares about only own queue for * Request-based dm cares about only own queue for
...@@ -1725,6 +1834,7 @@ static int next_free_minor(int *minor) ...@@ -1725,6 +1834,7 @@ static int next_free_minor(int *minor)
static const struct block_device_operations dm_blk_dops; static const struct block_device_operations dm_blk_dops;
static void dm_wq_work(struct work_struct *work); static void dm_wq_work(struct work_struct *work);
static void dm_rq_barrier_work(struct work_struct *work);
/* /*
* Allocate and initialise a blank device with a given minor. * Allocate and initialise a blank device with a given minor.
...@@ -1754,6 +1864,7 @@ static struct mapped_device *alloc_dev(int minor) ...@@ -1754,6 +1864,7 @@ static struct mapped_device *alloc_dev(int minor)
init_rwsem(&md->io_lock); init_rwsem(&md->io_lock);
mutex_init(&md->suspend_lock); mutex_init(&md->suspend_lock);
spin_lock_init(&md->deferred_lock); spin_lock_init(&md->deferred_lock);
spin_lock_init(&md->barrier_error_lock);
rwlock_init(&md->map_lock); rwlock_init(&md->map_lock);
atomic_set(&md->holders, 1); atomic_set(&md->holders, 1);
atomic_set(&md->open_count, 0); atomic_set(&md->open_count, 0);
...@@ -1788,6 +1899,8 @@ static struct mapped_device *alloc_dev(int minor) ...@@ -1788,6 +1899,8 @@ static struct mapped_device *alloc_dev(int minor)
blk_queue_softirq_done(md->queue, dm_softirq_done); blk_queue_softirq_done(md->queue, dm_softirq_done);
blk_queue_prep_rq(md->queue, dm_prep_fn); blk_queue_prep_rq(md->queue, dm_prep_fn);
blk_queue_lld_busy(md->queue, dm_lld_busy); blk_queue_lld_busy(md->queue, dm_lld_busy);
blk_queue_ordered(md->queue, QUEUE_ORDERED_DRAIN_FLUSH,
dm_rq_prepare_flush);
md->disk = alloc_disk(1); md->disk = alloc_disk(1);
if (!md->disk) if (!md->disk)
...@@ -1797,6 +1910,7 @@ static struct mapped_device *alloc_dev(int minor) ...@@ -1797,6 +1910,7 @@ static struct mapped_device *alloc_dev(int minor)
atomic_set(&md->pending[1], 0); atomic_set(&md->pending[1], 0);
init_waitqueue_head(&md->wait); init_waitqueue_head(&md->wait);
INIT_WORK(&md->work, dm_wq_work); INIT_WORK(&md->work, dm_wq_work);
INIT_WORK(&md->barrier_work, dm_rq_barrier_work);
init_waitqueue_head(&md->eventq); init_waitqueue_head(&md->eventq);
md->disk->major = _major; md->disk->major = _major;
...@@ -1921,9 +2035,13 @@ static void __set_size(struct mapped_device *md, sector_t size) ...@@ -1921,9 +2035,13 @@ static void __set_size(struct mapped_device *md, sector_t size)
mutex_unlock(&md->bdev->bd_inode->i_mutex); mutex_unlock(&md->bdev->bd_inode->i_mutex);
} }
static int __bind(struct mapped_device *md, struct dm_table *t, /*
struct queue_limits *limits) * Returns old map, which caller must destroy.
*/
static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
struct queue_limits *limits)
{ {
struct dm_table *old_map;
struct request_queue *q = md->queue; struct request_queue *q = md->queue;
sector_t size; sector_t size;
unsigned long flags; unsigned long flags;
...@@ -1938,11 +2056,6 @@ static int __bind(struct mapped_device *md, struct dm_table *t, ...@@ -1938,11 +2056,6 @@ static int __bind(struct mapped_device *md, struct dm_table *t,
__set_size(md, size); __set_size(md, size);
if (!size) {
dm_table_destroy(t);
return 0;
}
dm_table_event_callback(t, event_callback, md); dm_table_event_callback(t, event_callback, md);
/* /*
...@@ -1958,26 +2071,31 @@ static int __bind(struct mapped_device *md, struct dm_table *t, ...@@ -1958,26 +2071,31 @@ static int __bind(struct mapped_device *md, struct dm_table *t,
__bind_mempools(md, t); __bind_mempools(md, t);
write_lock_irqsave(&md->map_lock, flags); write_lock_irqsave(&md->map_lock, flags);
old_map = md->map;
md->map = t; md->map = t;
dm_table_set_restrictions(t, q, limits); dm_table_set_restrictions(t, q, limits);
write_unlock_irqrestore(&md->map_lock, flags); write_unlock_irqrestore(&md->map_lock, flags);
return 0; return old_map;
} }
static void __unbind(struct mapped_device *md) /*
* Returns unbound table for the caller to free.
*/
static struct dm_table *__unbind(struct mapped_device *md)
{ {
struct dm_table *map = md->map; struct dm_table *map = md->map;
unsigned long flags; unsigned long flags;
if (!map) if (!map)
return; return NULL;
dm_table_event_callback(map, NULL, NULL); dm_table_event_callback(map, NULL, NULL);
write_lock_irqsave(&md->map_lock, flags); write_lock_irqsave(&md->map_lock, flags);
md->map = NULL; md->map = NULL;
write_unlock_irqrestore(&md->map_lock, flags); write_unlock_irqrestore(&md->map_lock, flags);
dm_table_destroy(map);
return map;
} }
/* /*
...@@ -2059,18 +2177,18 @@ void dm_put(struct mapped_device *md) ...@@ -2059,18 +2177,18 @@ void dm_put(struct mapped_device *md)
BUG_ON(test_bit(DMF_FREEING, &md->flags)); BUG_ON(test_bit(DMF_FREEING, &md->flags));
if (atomic_dec_and_lock(&md->holders, &_minor_lock)) { if (atomic_dec_and_lock(&md->holders, &_minor_lock)) {
map = dm_get_table(md); map = dm_get_live_table(md);
idr_replace(&_minor_idr, MINOR_ALLOCED, idr_replace(&_minor_idr, MINOR_ALLOCED,
MINOR(disk_devt(dm_disk(md)))); MINOR(disk_devt(dm_disk(md))));
set_bit(DMF_FREEING, &md->flags); set_bit(DMF_FREEING, &md->flags);
spin_unlock(&_minor_lock); spin_unlock(&_minor_lock);
if (!dm_suspended(md)) { if (!dm_suspended_md(md)) {
dm_table_presuspend_targets(map); dm_table_presuspend_targets(map);
dm_table_postsuspend_targets(map); dm_table_postsuspend_targets(map);
} }
dm_sysfs_exit(md); dm_sysfs_exit(md);
dm_table_put(map); dm_table_put(map);
__unbind(md); dm_table_destroy(__unbind(md));
free_dev(md); free_dev(md);
} }
} }
...@@ -2080,8 +2198,6 @@ static int dm_wait_for_completion(struct mapped_device *md, int interruptible) ...@@ -2080,8 +2198,6 @@ static int dm_wait_for_completion(struct mapped_device *md, int interruptible)
{ {
int r = 0; int r = 0;
DECLARE_WAITQUEUE(wait, current); DECLARE_WAITQUEUE(wait, current);
struct request_queue *q = md->queue;
unsigned long flags;
dm_unplug_all(md->queue); dm_unplug_all(md->queue);
...@@ -2091,15 +2207,7 @@ static int dm_wait_for_completion(struct mapped_device *md, int interruptible) ...@@ -2091,15 +2207,7 @@ static int dm_wait_for_completion(struct mapped_device *md, int interruptible)
set_current_state(interruptible); set_current_state(interruptible);
smp_mb(); smp_mb();
if (dm_request_based(md)) { if (!md_in_flight(md))
spin_lock_irqsave(q->queue_lock, flags);
if (!queue_in_flight(q) && blk_queue_stopped(q)) {
spin_unlock_irqrestore(q->queue_lock, flags);
break;
}
spin_unlock_irqrestore(q->queue_lock, flags);
} else if (!atomic_read(&md->pending[0]) &&
!atomic_read(&md->pending[1]))
break; break;
if (interruptible == TASK_INTERRUPTIBLE && if (interruptible == TASK_INTERRUPTIBLE &&
...@@ -2194,98 +2302,106 @@ static void dm_queue_flush(struct mapped_device *md) ...@@ -2194,98 +2302,106 @@ static void dm_queue_flush(struct mapped_device *md)
queue_work(md->wq, &md->work); queue_work(md->wq, &md->work);
} }
/* static void dm_rq_set_flush_nr(struct request *clone, unsigned flush_nr)
* Swap in a new table (destroying old one).
*/
int dm_swap_table(struct mapped_device *md, struct dm_table *table)
{ {
struct queue_limits limits; struct dm_rq_target_io *tio = clone->end_io_data;
int r = -EINVAL;
mutex_lock(&md->suspend_lock); tio->info.flush_request = flush_nr;
}
/* device must be suspended */ /* Issue barrier requests to targets and wait for their completion. */
if (!dm_suspended(md)) static int dm_rq_barrier(struct mapped_device *md)
goto out; {
int i, j;
struct dm_table *map = dm_get_live_table(md);
unsigned num_targets = dm_table_get_num_targets(map);
struct dm_target *ti;
struct request *clone;
r = dm_calculate_queue_limits(table, &limits); md->barrier_error = 0;
if (r)
goto out;
/* cannot change the device type, once a table is bound */ for (i = 0; i < num_targets; i++) {
if (md->map && ti = dm_table_get_target(map, i);
(dm_table_get_type(md->map) != dm_table_get_type(table))) { for (j = 0; j < ti->num_flush_requests; j++) {
DMWARN("can't change the device type after a table is bound"); clone = clone_rq(md->flush_request, md, GFP_NOIO);
goto out; dm_rq_set_flush_nr(clone, j);
atomic_inc(&md->pending[rq_data_dir(clone)]);
map_request(ti, clone, md);
}
} }
__unbind(md); dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
r = __bind(md, table, &limits); dm_table_put(map);
out:
mutex_unlock(&md->suspend_lock);
return r;
}
static void dm_rq_invalidate_suspend_marker(struct mapped_device *md) return md->barrier_error;
{
md->suspend_rq.special = (void *)0x1;
} }
static void dm_rq_abort_suspend(struct mapped_device *md, int noflush) static void dm_rq_barrier_work(struct work_struct *work)
{ {
int error;
struct mapped_device *md = container_of(work, struct mapped_device,
barrier_work);
struct request_queue *q = md->queue; struct request_queue *q = md->queue;
struct request *rq;
unsigned long flags; unsigned long flags;
spin_lock_irqsave(q->queue_lock, flags); /*
if (!noflush) * Hold the md reference here and leave it at the last part so that
dm_rq_invalidate_suspend_marker(md); * the md can't be deleted by device opener when the barrier request
__start_queue(q); * completes.
spin_unlock_irqrestore(q->queue_lock, flags); */
} dm_get(md);
static void dm_rq_start_suspend(struct mapped_device *md, int noflush) error = dm_rq_barrier(md);
{
struct request *rq = &md->suspend_rq;
struct request_queue *q = md->queue;
if (noflush) rq = md->flush_request;
stop_queue(q); md->flush_request = NULL;
else {
blk_rq_init(q, rq); if (error == DM_ENDIO_REQUEUE) {
blk_insert_request(q, rq, 0, NULL); spin_lock_irqsave(q->queue_lock, flags);
} blk_requeue_request(q, rq);
spin_unlock_irqrestore(q->queue_lock, flags);
} else
blk_end_request_all(rq, error);
blk_run_queue(q);
dm_put(md);
} }
static int dm_rq_suspend_available(struct mapped_device *md, int noflush) /*
* Swap in a new table, returning the old one for the caller to destroy.
*/
struct dm_table *dm_swap_table(struct mapped_device *md, struct dm_table *table)
{ {
int r = 1; struct dm_table *map = ERR_PTR(-EINVAL);
struct request *rq = &md->suspend_rq; struct queue_limits limits;
struct request_queue *q = md->queue; int r;
unsigned long flags;
if (noflush) mutex_lock(&md->suspend_lock);
return r;
/* The marker must be protected by queue lock if it is in use */ /* device must be suspended */
spin_lock_irqsave(q->queue_lock, flags); if (!dm_suspended_md(md))
if (unlikely(rq->ref_count)) { goto out;
/*
* This can happen, when the previous flush suspend was r = dm_calculate_queue_limits(table, &limits);
* interrupted, the marker is still in the queue and if (r) {
* this flush suspend has been invoked, because we don't map = ERR_PTR(r);
* remove the marker at the time of suspend interruption. goto out;
* We have only one marker per mapped_device, so we can't
* start another flush suspend while it is in use.
*/
BUG_ON(!rq->special); /* The marker should be invalidated */
DMWARN("Invalidating the previous flush suspend is still in"
" progress. Please retry later.");
r = 0;
} }
spin_unlock_irqrestore(q->queue_lock, flags);
return r; /* cannot change the device type, once a table is bound */
if (md->map &&
(dm_table_get_type(md->map) != dm_table_get_type(table))) {
DMWARN("can't change the device type after a table is bound");
goto out;
}
map = __bind(md, table, &limits);
out:
mutex_unlock(&md->suspend_lock);
return map;
} }
/* /*
...@@ -2330,49 +2446,11 @@ static void unlock_fs(struct mapped_device *md) ...@@ -2330,49 +2446,11 @@ static void unlock_fs(struct mapped_device *md)
/* /*
* Suspend mechanism in request-based dm. * Suspend mechanism in request-based dm.
* *
* After the suspend starts, further incoming requests are kept in * 1. Flush all I/Os by lock_fs() if needed.
* the request_queue and deferred. * 2. Stop dispatching any I/O by stopping the request_queue.
* Remaining requests in the request_queue at the start of suspend are flushed * 3. Wait for all in-flight I/Os to be completed or requeued.
* if it is flush suspend.
* The suspend completes when the following conditions have been satisfied,
* so wait for it:
* 1. q->in_flight is 0 (which means no in_flight request)
* 2. queue has been stopped (which means no request dispatching)
*
* *
* Noflush suspend * To abort suspend, start the request_queue.
* ---------------
* Noflush suspend doesn't need to dispatch remaining requests.
* So stop the queue immediately. Then, wait for all in_flight requests
* to be completed or requeued.
*
* To abort noflush suspend, start the queue.
*
*
* Flush suspend
* -------------
* Flush suspend needs to dispatch remaining requests. So stop the queue
* after the remaining requests are completed. (Requeued request must be also
* re-dispatched and completed. Until then, we can't stop the queue.)
*
* During flushing the remaining requests, further incoming requests are also
* inserted to the same queue. To distinguish which requests are to be
* flushed, we insert a marker request to the queue at the time of starting
* flush suspend, like a barrier.
* The dispatching is blocked when the marker is found on the top of the queue.
* And the queue is stopped when all in_flight requests are completed, since
* that means the remaining requests are completely flushed.
* Then, the marker is removed from the queue.
*
* To abort flush suspend, we also need to take care of the marker, not only
* starting the queue.
* We don't remove the marker forcibly from the queue since it's against
* the block-layer manner. Instead, we put a invalidated mark on the marker.
* When the invalidated marker is found on the top of the queue, it is
* immediately removed from the queue, so it doesn't block dispatching.
* Because we have only one marker per mapped_device, we can't start another
* flush suspend until the invalidated marker is removed from the queue.
* So fail and return with -EBUSY in such a case.
*/ */
int dm_suspend(struct mapped_device *md, unsigned suspend_flags) int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
{ {
...@@ -2383,17 +2461,12 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags) ...@@ -2383,17 +2461,12 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
mutex_lock(&md->suspend_lock); mutex_lock(&md->suspend_lock);
if (dm_suspended(md)) { if (dm_suspended_md(md)) {
r = -EINVAL; r = -EINVAL;
goto out_unlock; goto out_unlock;
} }
if (dm_request_based(md) && !dm_rq_suspend_available(md, noflush)) { map = dm_get_live_table(md);
r = -EBUSY;
goto out_unlock;
}
map = dm_get_table(md);
/* /*
* DMF_NOFLUSH_SUSPENDING must be set before presuspend. * DMF_NOFLUSH_SUSPENDING must be set before presuspend.
...@@ -2406,8 +2479,10 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags) ...@@ -2406,8 +2479,10 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
dm_table_presuspend_targets(map); dm_table_presuspend_targets(map);
/* /*
* Flush I/O to the device. noflush supersedes do_lockfs, * Flush I/O to the device.
* because lock_fs() needs to flush I/Os. * Any I/O submitted after lock_fs() may not be flushed.
* noflush takes precedence over do_lockfs.
* (lock_fs() flushes I/Os and waits for them to complete.)
*/ */
if (!noflush && do_lockfs) { if (!noflush && do_lockfs) {
r = lock_fs(md); r = lock_fs(md);
...@@ -2436,10 +2511,15 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags) ...@@ -2436,10 +2511,15 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
set_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags); set_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags);
up_write(&md->io_lock); up_write(&md->io_lock);
flush_workqueue(md->wq); /*
* Request-based dm uses md->wq for barrier (dm_rq_barrier_work) which
* can be kicked until md->queue is stopped. So stop md->queue before
* flushing md->wq.
*/
if (dm_request_based(md)) if (dm_request_based(md))
dm_rq_start_suspend(md, noflush); stop_queue(md->queue);
flush_workqueue(md->wq);
/* /*
* At this point no more requests are entering target request routines. * At this point no more requests are entering target request routines.
...@@ -2458,7 +2538,7 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags) ...@@ -2458,7 +2538,7 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
dm_queue_flush(md); dm_queue_flush(md);
if (dm_request_based(md)) if (dm_request_based(md))
dm_rq_abort_suspend(md, noflush); start_queue(md->queue);
unlock_fs(md); unlock_fs(md);
goto out; /* pushback list is already flushed, so skip flush */ goto out; /* pushback list is already flushed, so skip flush */
...@@ -2470,10 +2550,10 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags) ...@@ -2470,10 +2550,10 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
* requests are being added to md->deferred list. * requests are being added to md->deferred list.
*/ */
dm_table_postsuspend_targets(map);
set_bit(DMF_SUSPENDED, &md->flags); set_bit(DMF_SUSPENDED, &md->flags);
dm_table_postsuspend_targets(map);
out: out:
dm_table_put(map); dm_table_put(map);
...@@ -2488,10 +2568,10 @@ int dm_resume(struct mapped_device *md) ...@@ -2488,10 +2568,10 @@ int dm_resume(struct mapped_device *md)
struct dm_table *map = NULL; struct dm_table *map = NULL;
mutex_lock(&md->suspend_lock); mutex_lock(&md->suspend_lock);
if (!dm_suspended(md)) if (!dm_suspended_md(md))
goto out; goto out;
map = dm_get_table(md); map = dm_get_live_table(md);
if (!map || !dm_table_get_size(map)) if (!map || !dm_table_get_size(map))
goto out; goto out;
...@@ -2592,18 +2672,29 @@ struct mapped_device *dm_get_from_kobject(struct kobject *kobj) ...@@ -2592,18 +2672,29 @@ struct mapped_device *dm_get_from_kobject(struct kobject *kobj)
return NULL; return NULL;
if (test_bit(DMF_FREEING, &md->flags) || if (test_bit(DMF_FREEING, &md->flags) ||
test_bit(DMF_DELETING, &md->flags)) dm_deleting_md(md))
return NULL; return NULL;
dm_get(md); dm_get(md);
return md; return md;
} }
int dm_suspended(struct mapped_device *md) int dm_suspended_md(struct mapped_device *md)
{ {
return test_bit(DMF_SUSPENDED, &md->flags); return test_bit(DMF_SUSPENDED, &md->flags);
} }
int dm_suspended(struct dm_target *ti)
{
struct mapped_device *md = dm_table_get_md(ti->table);
int r = dm_suspended_md(md);
dm_put(md);
return r;
}
EXPORT_SYMBOL_GPL(dm_suspended);
int dm_noflush_suspending(struct dm_target *ti) int dm_noflush_suspending(struct dm_target *ti)
{ {
struct mapped_device *md = dm_table_get_md(ti->table); struct mapped_device *md = dm_table_get_md(ti->table);
......
...@@ -88,6 +88,16 @@ int dm_target_iterate(void (*iter_func)(struct target_type *tt, ...@@ -88,6 +88,16 @@ int dm_target_iterate(void (*iter_func)(struct target_type *tt,
int dm_split_args(int *argc, char ***argvp, char *input); int dm_split_args(int *argc, char ***argvp, char *input);
/*
* Is this mapped_device being deleted?
*/
int dm_deleting_md(struct mapped_device *md);
/*
* Is this mapped_device suspended?
*/
int dm_suspended_md(struct mapped_device *md);
/* /*
* The device-mapper can be driven through one of two interfaces; * The device-mapper can be driven through one of two interfaces;
* ioctl or filesystem, depending which patch you have applied. * ioctl or filesystem, depending which patch you have applied.
...@@ -118,6 +128,9 @@ int dm_lock_for_deletion(struct mapped_device *md); ...@@ -118,6 +128,9 @@ int dm_lock_for_deletion(struct mapped_device *md);
void dm_kobject_uevent(struct mapped_device *md, enum kobject_action action, void dm_kobject_uevent(struct mapped_device *md, enum kobject_action action,
unsigned cookie); unsigned cookie);
int dm_io_init(void);
void dm_io_exit(void);
int dm_kcopyd_init(void); int dm_kcopyd_init(void);
void dm_kcopyd_exit(void); void dm_kcopyd_exit(void);
......
...@@ -235,7 +235,7 @@ void dm_uevent_add(struct mapped_device *md, struct list_head *elist); ...@@ -235,7 +235,7 @@ void dm_uevent_add(struct mapped_device *md, struct list_head *elist);
const char *dm_device_name(struct mapped_device *md); const char *dm_device_name(struct mapped_device *md);
int dm_copy_name_and_uuid(struct mapped_device *md, char *name, char *uuid); int dm_copy_name_and_uuid(struct mapped_device *md, char *name, char *uuid);
struct gendisk *dm_disk(struct mapped_device *md); struct gendisk *dm_disk(struct mapped_device *md);
int dm_suspended(struct mapped_device *md); int dm_suspended(struct dm_target *ti);
int dm_noflush_suspending(struct dm_target *ti); int dm_noflush_suspending(struct dm_target *ti);
union map_info *dm_get_mapinfo(struct bio *bio); union map_info *dm_get_mapinfo(struct bio *bio);
union map_info *dm_get_rq_mapinfo(struct request *rq); union map_info *dm_get_rq_mapinfo(struct request *rq);
...@@ -276,7 +276,7 @@ void dm_table_unplug_all(struct dm_table *t); ...@@ -276,7 +276,7 @@ void dm_table_unplug_all(struct dm_table *t);
/* /*
* Table reference counting. * Table reference counting.
*/ */
struct dm_table *dm_get_table(struct mapped_device *md); struct dm_table *dm_get_live_table(struct mapped_device *md);
void dm_table_get(struct dm_table *t); void dm_table_get(struct dm_table *t);
void dm_table_put(struct dm_table *t); void dm_table_put(struct dm_table *t);
...@@ -295,8 +295,10 @@ void dm_table_event(struct dm_table *t); ...@@ -295,8 +295,10 @@ void dm_table_event(struct dm_table *t);
/* /*
* The device must be suspended before calling this method. * The device must be suspended before calling this method.
* Returns the previous table, which the caller must destroy.
*/ */
int dm_swap_table(struct mapped_device *md, struct dm_table *t); struct dm_table *dm_swap_table(struct mapped_device *md,
struct dm_table *t);
/* /*
* A wrapper around vmalloc. * A wrapper around vmalloc.
......
...@@ -21,6 +21,7 @@ struct dm_dirty_log_type; ...@@ -21,6 +21,7 @@ struct dm_dirty_log_type;
struct dm_dirty_log { struct dm_dirty_log {
struct dm_dirty_log_type *type; struct dm_dirty_log_type *type;
int (*flush_callback_fn)(struct dm_target *ti);
void *context; void *context;
}; };
...@@ -136,8 +137,9 @@ int dm_dirty_log_type_unregister(struct dm_dirty_log_type *type); ...@@ -136,8 +137,9 @@ int dm_dirty_log_type_unregister(struct dm_dirty_log_type *type);
* type->constructor/destructor() directly. * type->constructor/destructor() directly.
*/ */
struct dm_dirty_log *dm_dirty_log_create(const char *type_name, struct dm_dirty_log *dm_dirty_log_create(const char *type_name,
struct dm_target *ti, struct dm_target *ti,
unsigned argc, char **argv); int (*flush_callback_fn)(struct dm_target *ti),
unsigned argc, char **argv);
void dm_dirty_log_destroy(struct dm_dirty_log *log); void dm_dirty_log_destroy(struct dm_dirty_log *log);
#endif /* __KERNEL__ */ #endif /* __KERNEL__ */
......
/* /*
* Copyright (C) 2001 - 2003 Sistina Software (UK) Limited. * Copyright (C) 2001 - 2003 Sistina Software (UK) Limited.
* Copyright (C) 2004 - 2005 Red Hat, Inc. All rights reserved. * Copyright (C) 2004 - 2009 Red Hat, Inc. All rights reserved.
* *
* This file is released under the LGPL. * This file is released under the LGPL.
*/ */
...@@ -266,9 +266,9 @@ enum { ...@@ -266,9 +266,9 @@ enum {
#define DM_DEV_SET_GEOMETRY _IOWR(DM_IOCTL, DM_DEV_SET_GEOMETRY_CMD, struct dm_ioctl) #define DM_DEV_SET_GEOMETRY _IOWR(DM_IOCTL, DM_DEV_SET_GEOMETRY_CMD, struct dm_ioctl)
#define DM_VERSION_MAJOR 4 #define DM_VERSION_MAJOR 4
#define DM_VERSION_MINOR 15 #define DM_VERSION_MINOR 16
#define DM_VERSION_PATCHLEVEL 0 #define DM_VERSION_PATCHLEVEL 0
#define DM_VERSION_EXTRA "-ioctl (2009-04-01)" #define DM_VERSION_EXTRA "-ioctl (2009-11-05)"
/* Status bits */ /* Status bits */
#define DM_READONLY_FLAG (1 << 0) /* In/Out */ #define DM_READONLY_FLAG (1 << 0) /* In/Out */
...@@ -309,4 +309,11 @@ enum { ...@@ -309,4 +309,11 @@ enum {
*/ */
#define DM_NOFLUSH_FLAG (1 << 11) /* In */ #define DM_NOFLUSH_FLAG (1 << 11) /* In */
/*
* If set, any table information returned will relate to the inactive
* table instead of the live one. Always check DM_INACTIVE_PRESENT_FLAG
* is set before using the data returned.
*/
#define DM_QUERY_INACTIVE_TABLE_FLAG (1 << 12) /* In */
#endif /* _LINUX_DM_IOCTL_H */ #endif /* _LINUX_DM_IOCTL_H */
...@@ -78,8 +78,7 @@ void dm_rh_dec(struct dm_region_hash *rh, region_t region); ...@@ -78,8 +78,7 @@ void dm_rh_dec(struct dm_region_hash *rh, region_t region);
/* Delay bios on regions. */ /* Delay bios on regions. */
void dm_rh_delay(struct dm_region_hash *rh, struct bio *bio); void dm_rh_delay(struct dm_region_hash *rh, struct bio *bio);
void dm_rh_mark_nosync(struct dm_region_hash *rh, void dm_rh_mark_nosync(struct dm_region_hash *rh, struct bio *bio);
struct bio *bio, unsigned done, int error);
/* /*
* Region recovery control. * Region recovery control.
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment