Commit 311f7128 authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'for-5.2/dm-changes-v2' of...

Merge tag 'for-5.2/dm-changes-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - Improve DM snapshot target's scalability by using finer grained
   locking. Requires some list_bl interface improvements.

 - Add ability for DM integrity to use a bitmap mode, that tracks
   regions where data and metadata are out of sync, instead of using a
   journal.

 - Improve DM thin provisioning target to not write metadata changes to
   disk if the thin-pool and associated thin devices are merely
   activated but not used. This avoids metadata corruption due to
   concurrent activation of thin devices across different OS instances
   (e.g. split brain scenarios, which ultimately would be avoided if
   proper device filters were used -- but not having proper filtering
   has proven a very common configuration mistake)

 - Fix missing call to path selector type->end_io in DM multipath. This
   fixes reported performance problems due to inaccurate path selector
   IO accounting causing an imbalance of IO (e.g. avoiding issuing IO to
   particular path due to it seemingly being heavily used).

 - Fix bug in DM cache metadata's loading of its discard bitset that
   could lead to all cache blocks being discarded if the very first
   cache block was discarded (thankfully in practice the first cache
   block is generally in use; be it FS superblock, partition table, disk
   label, etc).

 - Add testing-only DM dust target which simulates a device that has
   failing sectors and/or read failures.

 - Fix a DM init error path reference count hang that caused boot hangs
   if user supplied malformed input on kernel commandline.

 - Fix a couple issues with DM crypt target's logging being overly
   verbose or lacking context.

 - Various other small fixes to DM init, DM multipath, DM zoned, and DM
   crypt.

* tag 'for-5.2/dm-changes-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (42 commits)
  dm: fix a couple brace coding style issues
  dm crypt: print device name in integrity error message
  dm crypt: move detailed message into debug level
  dm ioctl: fix hang in early create error condition
  dm integrity: whitespace, coding style and dead code cleanup
  dm integrity: implement synchronous mode for reboot handling
  dm integrity: handle machine reboot in bitmap mode
  dm integrity: add a bitmap mode
  dm integrity: introduce a function add_new_range_and_wait()
  dm integrity: allow large ranges to be described
  dm ingerity: pass size to dm_integrity_alloc_page_list()
  dm integrity: introduce rw_journal_sectors()
  dm integrity: update documentation
  dm integrity: don't report unused options
  dm integrity: don't check null pointer before kvfree and vfree
  dm integrity: correctly calculate the size of metadata area
  dm dust: Make dm_dust_init and dm_dust_exit static
  dm dust: remove redundant unsigned comparison to less than zero
  dm mpath: always free attached_handler_name in parse_path()
  dm init: fix max devices/targets checks
  ...
parents 7878c231 8454fca4
dm-dust
=======
This target emulates the behavior of bad sectors at arbitrary
locations, and the ability to enable the emulation of the failures
at an arbitrary time.
This target behaves similarly to a linear target. At a given time,
the user can send a message to the target to start failing read
requests on specific blocks (to emulate the behavior of a hard disk
drive with bad sectors).
When the failure behavior is enabled (i.e.: when the output of
"dmsetup status" displays "fail_read_on_bad_block"), reads of blocks
in the "bad block list" will fail with EIO ("Input/output error").
Writes of blocks in the "bad block list will result in the following:
1. Remove the block from the "bad block list".
2. Successfully complete the write.
This emulates the "remapped sector" behavior of a drive with bad
sectors.
Normally, a drive that is encountering bad sectors will most likely
encounter more bad sectors, at an unknown time or location.
With dm-dust, the user can use the "addbadblock" and "removebadblock"
messages to add arbitrary bad blocks at new locations, and the
"enable" and "disable" messages to modulate the state of whether the
configured "bad blocks" will be treated as bad, or bypassed.
This allows the pre-writing of test data and metadata prior to
simulating a "failure" event where bad sectors start to appear.
Table parameters:
-----------------
<device_path> <offset> <blksz>
Mandatory parameters:
<device_path>: path to the block device.
<offset>: offset to data area from start of device_path
<blksz>: block size in bytes
(minimum 512, maximum 1073741824, must be a power of 2)
Usage instructions:
-------------------
First, find the size (in 512-byte sectors) of the device to be used:
$ sudo blockdev --getsz /dev/vdb1
33552384
Create the dm-dust device:
(For a device with a block size of 512 bytes)
$ sudo dmsetup create dust1 --table '0 33552384 dust /dev/vdb1 0 512'
(For a device with a block size of 4096 bytes)
$ sudo dmsetup create dust1 --table '0 33552384 dust /dev/vdb1 0 4096'
Check the status of the read behavior ("bypass" indicates that all I/O
will be passed through to the underlying device):
$ sudo dmsetup status dust1
0 33552384 dust 252:17 bypass
$ sudo dd if=/dev/mapper/dust1 of=/dev/null bs=512 count=128 iflag=direct
128+0 records in
128+0 records out
$ sudo dd if=/dev/zero of=/dev/mapper/dust1 bs=512 count=128 oflag=direct
128+0 records in
128+0 records out
Adding and removing bad blocks:
-------------------------------
At any time (i.e.: whether the device has the "bad block" emulation
enabled or disabled), bad blocks may be added or removed from the
device via the "addbadblock" and "removebadblock" messages:
$ sudo dmsetup message dust1 0 addbadblock 60
kernel: device-mapper: dust: badblock added at block 60
$ sudo dmsetup message dust1 0 addbadblock 67
kernel: device-mapper: dust: badblock added at block 67
$ sudo dmsetup message dust1 0 addbadblock 72
kernel: device-mapper: dust: badblock added at block 72
These bad blocks will be stored in the "bad block list".
While the device is in "bypass" mode, reads and writes will succeed:
$ sudo dmsetup status dust1
0 33552384 dust 252:17 bypass
Enabling block read failures:
-----------------------------
To enable the "fail read on bad block" behavior, send the "enable" message:
$ sudo dmsetup message dust1 0 enable
kernel: device-mapper: dust: enabling read failures on bad sectors
$ sudo dmsetup status dust1
0 33552384 dust 252:17 fail_read_on_bad_block
With the device in "fail read on bad block" mode, attempting to read a
block will encounter an "Input/output error":
$ sudo dd if=/dev/mapper/dust1 of=/dev/null bs=512 count=1 skip=67 iflag=direct
dd: error reading '/dev/mapper/dust1': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 0.00040651 s, 0.0 kB/s
...and writing to the bad blocks will remove the blocks from the list,
therefore emulating the "remap" behavior of hard disk drives:
$ sudo dd if=/dev/zero of=/dev/mapper/dust1 bs=512 count=128 oflag=direct
128+0 records in
128+0 records out
kernel: device-mapper: dust: block 60 removed from badblocklist by write
kernel: device-mapper: dust: block 67 removed from badblocklist by write
kernel: device-mapper: dust: block 72 removed from badblocklist by write
kernel: device-mapper: dust: block 87 removed from badblocklist by write
Bad block add/remove error handling:
------------------------------------
Attempting to add a bad block that already exists in the list will
result in an "Invalid argument" error, as well as a helpful message:
$ sudo dmsetup message dust1 0 addbadblock 88
device-mapper: message ioctl on dust1 failed: Invalid argument
kernel: device-mapper: dust: block 88 already in badblocklist
Attempting to remove a bad block that doesn't exist in the list will
result in an "Invalid argument" error, as well as a helpful message:
$ sudo dmsetup message dust1 0 removebadblock 87
device-mapper: message ioctl on dust1 failed: Invalid argument
kernel: device-mapper: dust: block 87 not found in badblocklist
Counting the number of bad blocks in the bad block list:
--------------------------------------------------------
To count the number of bad blocks configured in the device, run the
following message command:
$ sudo dmsetup message dust1 0 countbadblocks
A message will print with the number of bad blocks currently
configured on the device:
kernel: device-mapper: dust: countbadblocks: 895 badblock(s) found
Querying for specific bad blocks:
---------------------------------
To find out if a specific block is in the bad block list, run the
following message command:
$ sudo dmsetup message dust1 0 queryblock 72
The following message will print if the block is in the list:
device-mapper: dust: queryblock: block 72 found in badblocklist
The following message will print if the block is in the list:
device-mapper: dust: queryblock: block 72 not found in badblocklist
The "queryblock" message command will work in both the "enabled"
and "disabled" modes, allowing the verification of whether a block
will be treated as "bad" without having to issue I/O to the device,
or having to "enable" the bad block emulation.
Clearing the bad block list:
----------------------------
To clear the bad block list (without needing to individually run
a "removebadblock" message command for every block), run the
following message command:
$ sudo dmsetup message dust1 0 clearbadblocks
After clearing the bad block list, the following message will appear:
kernel: device-mapper: dust: clearbadblocks: badblocks cleared
If there were no bad blocks to clear, the following message will
appear:
kernel: device-mapper: dust: clearbadblocks: no badblocks found
Message commands list:
----------------------
Below is a list of the messages that can be sent to a dust device:
Operations on blocks (requires a <blknum> argument):
addbadblock <blknum>
queryblock <blknum>
removebadblock <blknum>
...where <blknum> is a block number within range of the device
(corresponding to the block size of the device.)
Single argument message commands:
countbadblocks
clearbadblocks
disable
enable
quiet
Device removal:
---------------
When finished, remove the device via the "dmsetup remove" command:
$ sudo dmsetup remove dust1
Quiet mode:
-----------
On test runs with many bad blocks, it may be desirable to avoid
excessive logging (from bad blocks added, removed, or "remapped").
This can be done by enabling "quiet mode" via the following message:
$ sudo dmsetup message dust1 0 quiet
This will suppress log messages from add / remove / removed by write
operations. Log messages from "countbadblocks" or "queryblock"
message commands will still print in quiet mode.
The status of quiet mode can be seen by running "dmsetup status":
$ sudo dmsetup status dust1
0 33552384 dust 252:17 fail_read_on_bad_block quiet
To disable quiet mode, send the "quiet" message again:
$ sudo dmsetup message dust1 0 quiet
$ sudo dmsetup status dust1
0 33552384 dust 252:17 fail_read_on_bad_block verbose
(The presence of "verbose" indicates normal logging.)
"Why not...?"
-------------
scsi_debug has a "medium error" mode that can fail reads on one
specified sector (sector 0x1234, hardcoded in the source code), but
it uses RAM for the persistent storage, which drastically decreases
the potential device size.
dm-flakey fails all I/O from all block locations at a specified time
frequency, and not a given point in time.
When a bad sector occurs on a hard disk drive, reads to that sector
are failed by the device, usually resulting in an error code of EIO
("I/O error") or ENODATA ("No data available"). However, a write to
the sector may succeed, and result in the sector becoming readable
after the device controller no longer experiences errors reading the
sector (or after a reallocation of the sector). However, there may
be bad sectors that occur on the device in the future, in a different,
unpredictable location.
This target seeks to provide a device that can exhibit the behavior
of a bad sector at a known sector location, at a known time, based
on a large storage device (at least tens of gigabytes, not occupying
system memory).
......@@ -21,6 +21,13 @@ mode it calculates and verifies the integrity tag internally. In this
mode, the dm-integrity target can be used to detect silent data
corruption on the disk or in the I/O path.
There's an alternate mode of operation where dm-integrity uses bitmap
instead of a journal. If a bit in the bitmap is 1, the corresponding
region's data and integrity tags are not synchronized - if the machine
crashes, the unsynchronized regions will be recalculated. The bitmap mode
is faster than the journal mode, because we don't have to write the data
twice, but it is also less reliable, because if data corruption happens
when the machine crashes, it may not be detected.
When loading the target for the first time, the kernel driver will format
the device. But it will only format the device if the superblock contains
......@@ -59,6 +66,10 @@ Target arguments:
either both data and tag or none of them are written. The
journaled mode degrades write throughput twice because the
data have to be written twice.
B - bitmap mode - data and metadata are written without any
synchronization, the driver maintains a bitmap of dirty
regions where data and metadata don't match. This mode can
only be used with internal hash.
R - recovery mode - in this mode, journal is not replayed,
checksums are not checked and writes to the device are not
allowed. This mode is useful for data recovery if the
......@@ -79,6 +90,10 @@ interleave_sectors:number
a power of two. If the device is already formatted, the value from
the superblock is used.
meta_device:device
Don't interleave the data and metadata on on device. Use a
separate device for metadata.
buffer_sectors:number
The number of sectors in one buffer. The value is rounded down to
a power of two.
......@@ -146,6 +161,15 @@ block_size:number
Supported values are 512, 1024, 2048 and 4096 bytes. If not
specified the default block size is 512 bytes.
sectors_per_bit:number
In the bitmap mode, this parameter specifies the number of
512-byte sectors that corresponds to one bitmap bit.
bitmap_flush_interval:number
The bitmap flush interval in milliseconds. The metadata buffers
are synchronized when this interval expires.
The journal mode (D/J), buffer_sectors, journal_watermark, commit_time can
be changed when reloading the target (load an inactive table and swap the
tables with suspend and resume). The other arguments should not be changed
......@@ -167,7 +191,13 @@ The layout of the formatted block device:
provides (i.e. the size of the device minus the size of all
metadata and padding). The user of this target should not send
bios that access data beyond the "provided data sectors" limit.
* flags - a flag is set if journal_mac is used
* flags
SB_FLAG_HAVE_JOURNAL_MAC - a flag is set if journal_mac is used
SB_FLAG_RECALCULATING - recalculating is in progress
SB_FLAG_DIRTY_BITMAP - journal area contains the bitmap of dirty
blocks
* log2(sectors per block)
* a position where recalculating finished
* journal
The journal is divided into sections, each section contains:
* metadata area (4kiB), it contains journal entries
......
......@@ -436,6 +436,15 @@ config DM_DELAY
If unsure, say N.
config DM_DUST
tristate "Bad sector simulation target"
depends on BLK_DEV_DM
---help---
A target that simulates bad sector behavior.
Useful for testing.
If unsure, say N.
config DM_INIT
bool "DM \"dm-mod.create=\" parameter support"
depends on BLK_DEV_DM=y
......
......@@ -48,6 +48,7 @@ obj-$(CONFIG_DM_BUFIO) += dm-bufio.o
obj-$(CONFIG_DM_BIO_PRISON) += dm-bio-prison.o
obj-$(CONFIG_DM_CRYPT) += dm-crypt.o
obj-$(CONFIG_DM_DELAY) += dm-delay.o
obj-$(CONFIG_DM_DUST) += dm-dust.o
obj-$(CONFIG_DM_FLAKEY) += dm-flakey.o
obj-$(CONFIG_DM_MULTIPATH) += dm-multipath.o dm-round-robin.o
obj-$(CONFIG_DM_MULTIPATH_QL) += dm-queue-length.o
......
......@@ -1167,11 +1167,18 @@ static int __load_discards(struct dm_cache_metadata *cmd,
if (r)
return r;
for (b = 0; b < from_dblock(cmd->discard_nr_blocks); b++) {
for (b = 0; ; b++) {
r = fn(context, cmd->discard_block_size, to_dblock(b),
dm_bitset_cursor_get_value(&c));
if (r)
break;
if (b >= (from_dblock(cmd->discard_nr_blocks) - 1))
break;
r = dm_bitset_cursor_next(&c);
if (r)
break;
}
dm_bitset_cursor_end(&c);
......
......@@ -946,6 +946,7 @@ static int crypt_integrity_ctr(struct crypt_config *cc, struct dm_target *ti)
{
#ifdef CONFIG_BLK_DEV_INTEGRITY
struct blk_integrity *bi = blk_get_integrity(cc->dev->bdev->bd_disk);
struct mapped_device *md = dm_table_get_md(ti->table);
/* From now we require underlying device with our integrity profile */
if (!bi || strcasecmp(bi->profile->name, "DM-DIF-EXT-TAG")) {
......@@ -965,7 +966,7 @@ static int crypt_integrity_ctr(struct crypt_config *cc, struct dm_target *ti)
if (crypt_integrity_aead(cc)) {
cc->integrity_tag_size = cc->on_disk_tag_size - cc->integrity_iv_size;
DMINFO("Integrity AEAD, tag size %u, IV size %u.",
DMDEBUG("%s: Integrity AEAD, tag size %u, IV size %u.", dm_device_name(md),
cc->integrity_tag_size, cc->integrity_iv_size);
if (crypto_aead_setauthsize(any_tfm_aead(cc), cc->integrity_tag_size)) {
......@@ -973,7 +974,7 @@ static int crypt_integrity_ctr(struct crypt_config *cc, struct dm_target *ti)
return -EINVAL;
}
} else if (cc->integrity_iv_size)
DMINFO("Additional per-sector space %u bytes for IV.",
DMDEBUG("%s: Additional per-sector space %u bytes for IV.", dm_device_name(md),
cc->integrity_iv_size);
if ((cc->integrity_tag_size + cc->integrity_iv_size) != bi->tag_size) {
......@@ -1031,11 +1032,11 @@ static u8 *org_iv_of_dmreq(struct crypt_config *cc,
return iv_of_dmreq(cc, dmreq) + cc->iv_size;
}
static uint64_t *org_sector_of_dmreq(struct crypt_config *cc,
static __le64 *org_sector_of_dmreq(struct crypt_config *cc,
struct dm_crypt_request *dmreq)
{
u8 *ptr = iv_of_dmreq(cc, dmreq) + cc->iv_size + cc->iv_size;
return (uint64_t*) ptr;
return (__le64 *) ptr;
}
static unsigned int *org_tag_of_dmreq(struct crypt_config *cc,
......@@ -1071,7 +1072,7 @@ static int crypt_convert_block_aead(struct crypt_config *cc,
struct bio_vec bv_out = bio_iter_iovec(ctx->bio_out, ctx->iter_out);
struct dm_crypt_request *dmreq;
u8 *iv, *org_iv, *tag_iv, *tag;
uint64_t *sector;
__le64 *sector;
int r = 0;
BUG_ON(cc->integrity_iv_size && cc->integrity_iv_size != cc->iv_size);
......@@ -1143,9 +1144,11 @@ static int crypt_convert_block_aead(struct crypt_config *cc,
r = crypto_aead_decrypt(req);
}
if (r == -EBADMSG)
DMERR_LIMIT("INTEGRITY AEAD ERROR, sector %llu",
if (r == -EBADMSG) {
char b[BDEVNAME_SIZE];
DMERR_LIMIT("%s: INTEGRITY AEAD ERROR, sector %llu", bio_devname(ctx->bio_in, b),
(unsigned long long)le64_to_cpu(*sector));
}
if (!r && cc->iv_gen_ops && cc->iv_gen_ops->post)
r = cc->iv_gen_ops->post(cc, org_iv, dmreq);
......@@ -1166,7 +1169,7 @@ static int crypt_convert_block_skcipher(struct crypt_config *cc,
struct scatterlist *sg_in, *sg_out;
struct dm_crypt_request *dmreq;
u8 *iv, *org_iv, *tag_iv;
uint64_t *sector;
__le64 *sector;
int r = 0;
/* Reject unexpected unaligned bio. */
......@@ -1788,7 +1791,8 @@ static void kcryptd_async_done(struct crypto_async_request *async_req,
error = cc->iv_gen_ops->post(cc, org_iv_of_dmreq(cc, dmreq), dmreq);
if (error == -EBADMSG) {
DMERR_LIMIT("INTEGRITY AEAD ERROR, sector %llu",
char b[BDEVNAME_SIZE];
DMERR_LIMIT("%s: INTEGRITY AEAD ERROR, sector %llu", bio_devname(ctx->bio_in, b),
(unsigned long long)le64_to_cpu(*org_sector_of_dmreq(cc, dmreq)));
io->error = BLK_STS_PROTECTION;
} else if (error < 0)
......@@ -1887,7 +1891,7 @@ static int crypt_alloc_tfms_skcipher(struct crypt_config *cc, char *ciphermode)
* algorithm implementation is used. Help people debug performance
* problems by logging the ->cra_driver_name.
*/
DMINFO("%s using implementation \"%s\"", ciphermode,
DMDEBUG_LIMIT("%s using implementation \"%s\"", ciphermode,
crypto_skcipher_alg(any_tfm(cc))->base.cra_driver_name);
return 0;
}
......@@ -1907,7 +1911,7 @@ static int crypt_alloc_tfms_aead(struct crypt_config *cc, char *ciphermode)
return err;
}
DMINFO("%s using implementation \"%s\"", ciphermode,
DMDEBUG_LIMIT("%s using implementation \"%s\"", ciphermode,
crypto_aead_alg(any_tfm_aead(cc))->base.cra_driver_name);
return 0;
}
......
......@@ -121,7 +121,8 @@ static void delay_dtr(struct dm_target *ti)
{
struct delay_c *dc = ti->private;
destroy_workqueue(dc->kdelayd_wq);
if (dc->kdelayd_wq)
destroy_workqueue(dc->kdelayd_wq);
if (dc->read.dev)
dm_put_device(ti, dc->read.dev);
......
This diff is collapsed.
......@@ -11,6 +11,7 @@
#define _LINUX_DM_EXCEPTION_STORE
#include <linux/blkdev.h>
#include <linux/list_bl.h>
#include <linux/device-mapper.h>
/*
......@@ -27,7 +28,7 @@ typedef sector_t chunk_t;
* chunk within the device.
*/
struct dm_exception {
struct list_head hash_list;
struct hlist_bl_node hash_list;
chunk_t old_chunk;
chunk_t new_chunk;
......
......@@ -160,7 +160,7 @@ static int __init dm_parse_table(struct dm_device *dev, char *str)
while (table_entry) {
DMDEBUG("parsing table \"%s\"", str);
if (++dev->dmi.target_count >= DM_MAX_TARGETS) {
if (++dev->dmi.target_count > DM_MAX_TARGETS) {
DMERR("too many targets %u > %d",
dev->dmi.target_count, DM_MAX_TARGETS);
return -EINVAL;
......@@ -242,9 +242,9 @@ static int __init dm_parse_devices(struct list_head *devices, char *str)
return -ENOMEM;
list_add_tail(&dev->list, devices);
if (++ndev >= DM_MAX_DEVICES) {
DMERR("too many targets %u > %d",
dev->dmi.target_count, DM_MAX_TARGETS);
if (++ndev > DM_MAX_DEVICES) {
DMERR("too many devices %lu > %d",
ndev, DM_MAX_DEVICES);
return -EINVAL;
}
......
This diff is collapsed.
......@@ -2069,7 +2069,7 @@ int __init dm_early_create(struct dm_ioctl *dmi,
/* alloc table */
r = dm_table_create(&t, get_mode(dmi), dmi->target_count, md);
if (r)
goto err_destroy_dm;
goto err_hash_remove;
/* add targets */
for (i = 0; i < dmi->target_count; i++) {
......@@ -2116,6 +2116,10 @@ int __init dm_early_create(struct dm_ioctl *dmi,
err_destroy_table:
dm_table_destroy(t);
err_hash_remove:
(void) __hash_remove(__get_name_cell(dmi->name));
/* release reference from __get_name_cell */
dm_put(md);
err_destroy_dm:
dm_put(md);
dm_destroy(md);
......
......@@ -544,8 +544,23 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
return DM_MAPIO_REMAPPED;
}
static void multipath_release_clone(struct request *clone)
static void multipath_release_clone(struct request *clone,
union map_info *map_context)
{
if (unlikely(map_context)) {
/*
* non-NULL map_context means caller is still map
* method; must undo multipath_clone_and_map()
*/
struct dm_mpath_io *mpio = get_mpio(map_context);
struct pgpath *pgpath = mpio->pgpath;
if (pgpath && pgpath->pg->ps.type->end_io)
pgpath->pg->ps.type->end_io(&pgpath->pg->ps,
&pgpath->path,
mpio->nr_bytes);
}
blk_put_request(clone);
}
......@@ -882,6 +897,7 @@ static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps
if (attached_handler_name || m->hw_handler_name) {
INIT_DELAYED_WORK(&p->activate_path, activate_path_work);
r = setup_scsi_dh(p->path.dev->bdev, m, &attached_handler_name, &ti->error);
kfree(attached_handler_name);
if (r) {
dm_put_device(ti, p->path.dev);
goto bad;
......@@ -896,7 +912,6 @@ static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps
return p;
bad:
kfree(attached_handler_name);
free_pgpath(p);
return ERR_PTR(r);
}
......
......@@ -168,7 +168,7 @@ static void dm_end_request(struct request *clone, blk_status_t error)
struct request *rq = tio->orig;
blk_rq_unprep_clone(clone);
tio->ti->type->release_clone_rq(clone);
tio->ti->type->release_clone_rq(clone, NULL);
rq_end_stats(md, rq);
blk_mq_end_request(rq, error);
......@@ -201,7 +201,7 @@ static void dm_requeue_original_request(struct dm_rq_target_io *tio, bool delay_
rq_end_stats(md, rq);
if (tio->clone) {
blk_rq_unprep_clone(tio->clone);
tio->ti->type->release_clone_rq(tio->clone);
tio->ti->type->release_clone_rq(tio->clone, NULL);
}
dm_mq_delay_requeue_request(rq, delay_ms);
......@@ -398,7 +398,7 @@ static int map_request(struct dm_rq_target_io *tio)
case DM_MAPIO_REMAPPED:
if (setup_clone(clone, rq, tio, GFP_ATOMIC)) {
/* -ENOMEM */
ti->type->release_clone_rq(clone);
ti->type->release_clone_rq(clone, &tio->info);
return DM_MAPIO_REQUEUE;
}
......@@ -408,7 +408,7 @@ static int map_request(struct dm_rq_target_io *tio)
ret = dm_dispatch_clone_request(clone, rq);
if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
blk_rq_unprep_clone(clone);
tio->ti->type->release_clone_rq(clone);
tio->ti->type->release_clone_rq(clone, &tio->info);
tio->clone = NULL;
return DM_MAPIO_REQUEUE;
}
......
This diff is collapsed.
......@@ -136,7 +136,8 @@ static int io_err_clone_and_map_rq(struct dm_target *ti, struct request *rq,
return DM_MAPIO_KILL;
}
static void io_err_release_clone_rq(struct request *clone)
static void io_err_release_clone_rq(struct request *clone,
union map_info *map_context)
{
}
......
......@@ -201,6 +201,13 @@ struct dm_pool_metadata {
*/
bool fail_io:1;
/*
* Set once a thin-pool has been accessed through one of the interfaces
* that imply the pool is in-service (e.g. thin devices created/deleted,
* thin-pool message, metadata snapshots, etc).
*/
bool in_service:1;
/*
* Reading the space map roots can fail, so we read it into these
* buffers before the superblock is locked and updated.
......@@ -367,6 +374,32 @@ static int subtree_equal(void *context, const void *value1_le, const void *value
/*----------------------------------------------------------------*/
/*
* Variant that is used for in-core only changes or code that
* shouldn't put the pool in service on its own (e.g. commit).
*/
static inline void __pmd_write_lock(struct dm_pool_metadata *pmd)
__acquires(pmd->root_lock)
{
down_write(&pmd->root_lock);
}
#define pmd_write_lock_in_core(pmd) __pmd_write_lock((pmd))
static inline void pmd_write_lock(struct dm_pool_metadata *pmd)
{
__pmd_write_lock(pmd);
if (unlikely(!pmd->in_service))
pmd->in_service = true;
}
static inline void pmd_write_unlock(struct dm_pool_metadata *pmd)
__releases(pmd->root_lock)
{
up_write(&pmd->root_lock);
}
/*----------------------------------------------------------------*/
static int superblock_lock_zero(struct dm_pool_metadata *pmd,
struct dm_block **sblock)
{
......@@ -790,6 +823,9 @@ static int __commit_transaction(struct dm_pool_metadata *pmd)
*/
BUILD_BUG_ON(sizeof(struct thin_disk_superblock) > 512);
if (unlikely(!pmd->in_service))
return 0;
r = __write_changed_details(pmd);
if (r < 0)
return r;
......@@ -853,6 +889,7 @@ struct dm_pool_metadata *dm_pool_metadata_open(struct block_device *bdev,
pmd->time = 0;
INIT_LIST_HEAD(&pmd->thin_devices);
pmd->fail_io = false;
pmd->in_service = false;
pmd->bdev = bdev;
pmd->data_block_size = data_block_size;
......@@ -903,7 +940,6 @@ int dm_pool_metadata_close(struct dm_pool_metadata *pmd)
DMWARN("%s: __commit_transaction() failed, error = %d",
__func__, r);
}
if (!pmd->fail_io)
__destroy_persistent_data_objects(pmd);
......@@ -1032,10 +1068,10 @@ int dm_pool_create_thin(struct dm_pool_metadata *pmd, dm_thin_id dev)
{
int r = -EINVAL;
down_write(&pmd->root_lock);
pmd_write_lock(pmd);
if (!pmd->fail_io)
r = __create_thin(pmd, dev);
up_write(&pmd->root_lock);
pmd_write_unlock(pmd);
return r;
}
......@@ -1123,10 +1159,10 @@ int dm_pool_create_snap(struct dm_pool_metadata *pmd,
{
int r = -EINVAL;
down_write(&pmd->root_lock);
pmd_write_lock(pmd);
if (!pmd->fail_io)
r = __create_snap(pmd, dev, origin);
up_write(&pmd->root_lock);
pmd_write_unlock(pmd);
return r;
}
......@@ -1166,10 +1202,10 @@ int dm_pool_delete_thin_device(struct dm_pool_metadata *pmd,
{
int r = -EINVAL;
down_write(&pmd->root_lock);
pmd_write_lock(pmd);
if (!pmd->fail_io)
r = __delete_device(pmd, dev);
up_write(&pmd->root_lock);
pmd_write_unlock(pmd);
return r;
}
......@@ -1180,7 +1216,7 @@ int dm_pool_set_metadata_transaction_id(struct dm_pool_metadata *pmd,
{
int r = -EINVAL;
down_write(&pmd->root_lock);
pmd_write_lock(pmd);
if (pmd->fail_io)
goto out;
......@@ -1194,7 +1230,7 @@ int dm_pool_set_metadata_transaction_id(struct dm_pool_metadata *pmd,
r = 0;
out:
up_write(&pmd->root_lock);
pmd_write_unlock(pmd);
return r;
}
......@@ -1225,7 +1261,12 @@ static int __reserve_metadata_snap(struct dm_pool_metadata *pmd)
* We commit to ensure the btree roots which we increment in a
* moment are up to date.
*/
__commit_transaction(pmd);
r = __commit_transaction(pmd);
if (r < 0) {
DMWARN("%s: __commit_transaction() failed, error = %d",
__func__, r);
return r;
}
/*
* Copy the superblock.
......@@ -1283,10 +1324,10 @@ int dm_pool_reserve_metadata_snap(struct dm_pool_metadata *pmd)
{
int r = -EINVAL;
down_write(&pmd->root_lock);
pmd_write_lock(pmd);
if (!pmd->fail_io)
r = __reserve_metadata_snap(pmd);
up_write(&pmd->root_lock);
pmd_write_unlock(pmd);
return r;
}
......@@ -1331,10 +1372,10 @@ int dm_pool_release_metadata_snap(struct dm_pool_metadata *pmd)
{
int r = -EINVAL;
down_write(&pmd->root_lock);
pmd_write_lock(pmd);
if (!pmd->fail_io)
r = __release_metadata_snap(pmd);
up_write(&pmd->root_lock);
pmd_write_unlock(pmd);
return r;
}
......@@ -1377,19 +1418,19 @@ int dm_pool_open_thin_device(struct dm_pool_metadata *pmd, dm_thin_id dev,
{
int r = -EINVAL;
down_write(&pmd->root_lock);
pmd_write_lock_in_core(pmd);
if (!pmd->fail_io)
r = __open_device(pmd, dev, 0, td);
up_write(&pmd->root_lock);
pmd_write_unlock(pmd);
return r;
}
int dm_pool_close_thin_device(struct dm_thin_device *td)
{
down_write(&td->pmd->root_lock);
pmd_write_lock_in_core(td->pmd);
__close_device(td);
up_write(&td->pmd->root_lock);
pmd_write_unlock(td->pmd);
return 0;
}
......@@ -1570,10 +1611,10 @@ int dm_thin_insert_block(struct dm_thin_device *td, dm_block_t block,
{
int r = -EINVAL;
down_write(&td->pmd->root_lock);
pmd_write_lock(td->pmd);
if (!td->pmd->fail_io)
r = __insert(td, block, data_block);
up_write(&td->pmd->root_lock);
pmd_write_unlock(td->pmd);
return r;
}
......@@ -1657,10 +1698,10 @@ int dm_thin_remove_block(struct dm_thin_device *td, dm_block_t block)
{
int r = -EINVAL;
down_write(&td->pmd->root_lock);
pmd_write_lock(td->pmd);
if (!td->pmd->fail_io)
r = __remove(td, block);
up_write(&td->pmd->root_lock);
pmd_write_unlock(td->pmd);
return r;
}
......@@ -1670,10 +1711,10 @@ int dm_thin_remove_range(struct dm_thin_device *td,
{
int r = -EINVAL;
down_write(&td->pmd->root_lock);
pmd_write_lock(td->pmd);
if (!td->pmd->fail_io)
r = __remove_range(td, begin, end);
up_write(&td->pmd->root_lock);
pmd_write_unlock(td->pmd);
return r;
}
......@@ -1696,13 +1737,13 @@ int dm_pool_inc_data_range(struct dm_pool_metadata *pmd, dm_block_t b, dm_block_
{
int r = 0;
down_write(&pmd->root_lock);
pmd_write_lock(pmd);
for (; b != e; b++) {
r = dm_sm_inc_block(pmd->data_sm, b);
if (r)
break;
}
up_write(&pmd->root_lock);
pmd_write_unlock(pmd);
return r;
}
......@@ -1711,13 +1752,13 @@ int dm_pool_dec_data_range(struct dm_pool_metadata *pmd, dm_block_t b, dm_block_
{
int r = 0;
down_write(&pmd->root_lock);
pmd_write_lock(pmd);
for (; b != e; b++) {
r = dm_sm_dec_block(pmd->data_sm, b);
if (r)
break;
}
up_write(&pmd->root_lock);
pmd_write_unlock(pmd);
return r;
}
......@@ -1765,10 +1806,10 @@ int dm_pool_alloc_data_block(struct dm_pool_metadata *pmd, dm_block_t *result)
{
int r = -EINVAL;
down_write(&pmd->root_lock);
pmd_write_lock(pmd);
if (!pmd->fail_io)
r = dm_sm_new_block(pmd->data_sm, result);
up_write(&pmd->root_lock);
pmd_write_unlock(pmd);
return r;
}
......@@ -1777,12 +1818,16 @@ int dm_pool_commit_metadata(struct dm_pool_metadata *pmd)
{
int r = -EINVAL;
down_write(&pmd->root_lock);
/*
* Care is taken to not have commit be what
* triggers putting the thin-pool in-service.
*/
__pmd_write_lock(pmd);
if (pmd->fail_io)
goto out;
r = __commit_transaction(pmd);
if (r <= 0)
if (r < 0)
goto out;
/*
......@@ -1790,7 +1835,7 @@ int dm_pool_commit_metadata(struct dm_pool_metadata *pmd)
*/
r = __begin_transaction(pmd);
out:
up_write(&pmd->root_lock);
pmd_write_unlock(pmd);
return r;
}
......@@ -1806,7 +1851,7 @@ int dm_pool_abort_metadata(struct dm_pool_metadata *pmd)
{
int r = -EINVAL;
down_write(&pmd->root_lock);
pmd_write_lock(pmd);
if (pmd->fail_io)
goto out;
......@@ -1817,7 +1862,7 @@ int dm_pool_abort_metadata(struct dm_pool_metadata *pmd)
pmd->fail_io = true;
out:
up_write(&pmd->root_lock);
pmd_write_unlock(pmd);
return r;
}
......@@ -1948,10 +1993,10 @@ int dm_pool_resize_data_dev(struct dm_pool_metadata *pmd, dm_block_t new_count)
{
int r = -EINVAL;
down_write(&pmd->root_lock);
pmd_write_lock(pmd);
if (!pmd->fail_io)
r = __resize_space_map(pmd->data_sm, new_count);
up_write(&pmd->root_lock);
pmd_write_unlock(pmd);
return r;
}
......@@ -1960,29 +2005,29 @@ int dm_pool_resize_metadata_dev(struct dm_pool_metadata *pmd, dm_block_t new_cou
{
int r = -EINVAL;
down_write(&pmd->root_lock);
pmd_write_lock(pmd);
if (!pmd->fail_io) {
r = __resize_space_map(pmd->metadata_sm, new_count);
if (!r)
__set_metadata_reserve(pmd);
}
up_write(&pmd->root_lock);
pmd_write_unlock(pmd);
return r;
}
void dm_pool_metadata_read_only(struct dm_pool_metadata *pmd)
{
down_write(&pmd->root_lock);
pmd_write_lock_in_core(pmd);
dm_bm_set_read_only(pmd->bm);
up_write(&pmd->root_lock);
pmd_write_unlock(pmd);
}
void dm_pool_metadata_read_write(struct dm_pool_metadata *pmd)
{
down_write(&pmd->root_lock);
pmd_write_lock_in_core(pmd);
dm_bm_set_read_write(pmd->bm);
up_write(&pmd->root_lock);
pmd_write_unlock(pmd);
}
int dm_pool_register_metadata_threshold(struct dm_pool_metadata *pmd,
......@@ -1992,9 +2037,9 @@ int dm_pool_register_metadata_threshold(struct dm_pool_metadata *pmd,
{
int r;
down_write(&pmd->root_lock);
pmd_write_lock_in_core(pmd);
r = dm_sm_register_threshold_callback(pmd->metadata_sm, threshold, fn, context);
up_write(&pmd->root_lock);
pmd_write_unlock(pmd);
return r;
}
......@@ -2005,7 +2050,7 @@ int dm_pool_metadata_set_needs_check(struct dm_pool_metadata *pmd)
struct dm_block *sblock;
struct thin_disk_superblock *disk_super;
down_write(&pmd->root_lock);
pmd_write_lock(pmd);
pmd->flags |= THIN_METADATA_NEEDS_CHECK_FLAG;
r = superblock_lock(pmd, &sblock);
......@@ -2019,7 +2064,7 @@ int dm_pool_metadata_set_needs_check(struct dm_pool_metadata *pmd)
dm_bm_unlock(sblock);
out:
up_write(&pmd->root_lock);
pmd_write_unlock(pmd);
return r;
}
......
......@@ -190,7 +190,6 @@ struct writeback_struct {
struct dm_writecache *wc;
struct wc_entry **wc_list;
unsigned wc_list_n;
unsigned page_offset;
struct page *page;
struct wc_entry *wc_list_inline[WB_LIST_INLINE];
struct bio bio;
......@@ -546,21 +545,20 @@ static struct wc_entry *writecache_find_entry(struct dm_writecache *wc,
e = container_of(node, struct wc_entry, rb_node);
if (read_original_sector(wc, e) == block)
break;
node = (read_original_sector(wc, e) >= block ?
e->rb_node.rb_left : e->rb_node.rb_right);
if (unlikely(!node)) {
if (!(flags & WFE_RETURN_FOLLOWING)) {
if (!(flags & WFE_RETURN_FOLLOWING))
return NULL;
}
if (read_original_sector(wc, e) >= block) {
break;
return e;
} else {
node = rb_next(&e->rb_node);
if (unlikely(!node)) {
if (unlikely(!node))
return NULL;
}
e = container_of(node, struct wc_entry, rb_node);
break;
return e;
}
}
}
......@@ -571,7 +569,7 @@ static struct wc_entry *writecache_find_entry(struct dm_writecache *wc,
node = rb_prev(&e->rb_node);
else
node = rb_next(&e->rb_node);
if (!node)
if (unlikely(!node))
return e;
e2 = container_of(node, struct wc_entry, rb_node);
if (read_original_sector(wc, e2) != block)
......@@ -804,7 +802,7 @@ static void writecache_discard(struct dm_writecache *wc, sector_t start, sector_
writecache_free_entry(wc, e);
}
if (!node)
if (unlikely(!node))
break;
e = container_of(node, struct wc_entry, rb_node);
......@@ -1478,10 +1476,9 @@ static void __writecache_writeback_pmem(struct dm_writecache *wc, struct writeba
bio = bio_alloc_bioset(GFP_NOIO, max_pages, &wc->bio_set);
wb = container_of(bio, struct writeback_struct, bio);
wb->wc = wc;
wb->bio.bi_end_io = writecache_writeback_endio;
bio_set_dev(&wb->bio, wc->dev->bdev);
wb->bio.bi_iter.bi_sector = read_original_sector(wc, e);
wb->page_offset = PAGE_SIZE;
bio->bi_end_io = writecache_writeback_endio;
bio_set_dev(bio, wc->dev->bdev);
bio->bi_iter.bi_sector = read_original_sector(wc, e);
if (max_pages <= WB_LIST_INLINE ||
unlikely(!(wb->wc_list = kmalloc_array(max_pages, sizeof(struct wc_entry *),
GFP_NOIO | __GFP_NORETRY |
......@@ -1507,12 +1504,12 @@ static void __writecache_writeback_pmem(struct dm_writecache *wc, struct writeba
wb->wc_list[wb->wc_list_n++] = f;
e = f;
}
bio_set_op_attrs(&wb->bio, REQ_OP_WRITE, WC_MODE_FUA(wc) * REQ_FUA);
bio_set_op_attrs(bio, REQ_OP_WRITE, WC_MODE_FUA(wc) * REQ_FUA);
if (writecache_has_error(wc)) {
bio->bi_status = BLK_STS_IOERR;
bio_endio(&wb->bio);
bio_endio(bio);
} else {
submit_bio(&wb->bio);
submit_bio(bio);
}
__writeback_throttle(wc, wbl);
......
......@@ -1169,6 +1169,9 @@ static int dmz_init_zones(struct dmz_metadata *zmd)
goto out;
}
if (!nr_blkz)
break;
/* Process report */
for (i = 0; i < nr_blkz; i++) {
ret = dmz_init_zone(zmd, zone, &blkz[i]);
......@@ -1204,6 +1207,8 @@ static int dmz_update_zone(struct dmz_metadata *zmd, struct dm_zone *zone)
/* Get zone information from disk */
ret = blkdev_report_zones(zmd->dev->bdev, dmz_start_sect(zmd, zone),
&blkz, &nr_blkz, GFP_NOIO);
if (!nr_blkz)
ret = -EIO;
if (ret) {
dmz_dev_err(zmd->dev, "Get zone %u report failed",
dmz_id(zmd, zone));
......
......@@ -643,7 +643,8 @@ static int dmz_get_zoned_device(struct dm_target *ti, char *path)
q = bdev_get_queue(dev->bdev);
dev->capacity = i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT;
aligned_capacity = dev->capacity & ~(blk_queue_zone_sectors(q) - 1);
aligned_capacity = dev->capacity &
~((sector_t)blk_queue_zone_sectors(q) - 1);
if (ti->begin ||
((ti->len != dev->capacity) && (ti->len != aligned_capacity))) {
ti->error = "Partial mapping not supported";
......
......@@ -781,7 +781,8 @@ static void close_table_device(struct table_device *td, struct mapped_device *md
}
static struct table_device *find_table_device(struct list_head *l, dev_t dev,
fmode_t mode) {
fmode_t mode)
{
struct table_device *td;
list_for_each_entry(td, l, list)
......@@ -792,7 +793,8 @@ static struct table_device *find_table_device(struct list_head *l, dev_t dev,
}
int dm_get_table_device(struct mapped_device *md, dev_t dev, fmode_t mode,
struct dm_dev **result) {
struct dm_dev **result)
{
int r;
struct table_device *td;
......@@ -1906,7 +1908,6 @@ static void cleanup_mapped_device(struct mapped_device *md)
static struct mapped_device *alloc_dev(int minor)
{
int r, numa_node_id = dm_get_numa_node();
struct dax_device *dax_dev = NULL;
struct mapped_device *md;
void *old_md;
......@@ -1969,11 +1970,10 @@ static struct mapped_device *alloc_dev(int minor)
sprintf(md->disk->disk_name, "dm-%d", minor);
if (IS_ENABLED(CONFIG_DAX_DRIVER)) {
dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops);
if (!dax_dev)
md->dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops);
if (!md->dax_dev)
goto bad;
}
md->dax_dev = dax_dev;
add_disk_no_queue_reg(md->disk);
format_dev_t(md->name, MKDEV(_major, minor));
......
......@@ -190,6 +190,8 @@ static int sm_find_free(void *addr, unsigned begin, unsigned end,
static int sm_ll_init(struct ll_disk *ll, struct dm_transaction_manager *tm)
{
memset(ll, 0, sizeof(struct ll_disk));
ll->tm = tm;
ll->bitmap_info.tm = tm;
......
......@@ -62,7 +62,8 @@ typedef int (*dm_clone_and_map_request_fn) (struct dm_target *ti,
struct request *rq,
union map_info *map_context,
struct request **clone);
typedef void (*dm_release_clone_request_fn) (struct request *clone);
typedef void (*dm_release_clone_request_fn) (struct request *clone,
union map_info *map_context);
/*
* Returns:
......
......@@ -789,7 +789,7 @@ static inline void hlist_add_behind(struct hlist_node *n,
struct hlist_node *prev)
{
n->next = prev->next;
WRITE_ONCE(prev->next, n);
prev->next = n;
n->pprev = &prev->next;
if (n->next)
......
......@@ -86,6 +86,32 @@ static inline void hlist_bl_add_head(struct hlist_bl_node *n,
hlist_bl_set_first(h, n);
}
static inline void hlist_bl_add_before(struct hlist_bl_node *n,
struct hlist_bl_node *next)
{
struct hlist_bl_node **pprev = next->pprev;
n->pprev = pprev;
n->next = next;
next->pprev = &n->next;
/* pprev may be `first`, so be careful not to lose the lock bit */
WRITE_ONCE(*pprev,
(struct hlist_bl_node *)
((uintptr_t)n | ((uintptr_t)*pprev & LIST_BL_LOCKMASK)));
}
static inline void hlist_bl_add_behind(struct hlist_bl_node *n,
struct hlist_bl_node *prev)
{
n->next = prev->next;
n->pprev = &prev->next;
prev->next = n;
if (n->next)
n->next->pprev = &n->next;
}
static inline void __hlist_bl_del(struct hlist_bl_node *n)
{
struct hlist_bl_node *next = n->next;
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment