Commits · 51b86f9a8d1c4bb4e3862ee4b4c5f46072f7520d · Kirill Smelkov / linux

21 May, 2019 1 commit

dm: make sure to obey max_io_len_target_boundary · 51b86f9a

Michael Lass authored May 21, 2019

Commit 61697a6a ("dm: eliminate 'split_discard_bios' flag from DM
target interface") incorrectly removed code from
__send_changing_extent_only() that is required to impose a per-target IO
boundary on IO that exceeds max_io_len_target_boundary().  Otherwise
"special" IO (e.g. DISCARD, WRITE SAME, WRITE ZEROES) can write beyond
where allowed.

Fix this by restoring the max_io_len_target_boundary() limit in
__send_changing_extent_only()

Fixes: 61697a6a ("dm: eliminate 'split_discard_bios' flag from DM target interface")
Cc: stable@vger.kernel.org # 5.1+
Signed-off-by: Michael Lass <bevan@bi-co.net>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

51b86f9a

16 May, 2019 4 commits

dm: fix a couple brace coding style issues · 8454fca4

Sheetal Singala authored May 10, 2019

Signed-off-by: Sheetal Singala <2396sheetal@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

8454fca4

dm crypt: print device name in integrity error message · f710126c

Milan Broz authored May 15, 2019

This message should better identify the DM device with the integrity
failure.
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

f710126c

dm crypt: move detailed message into debug level · 7a1cd723

Milan Broz authored May 15, 2019

The information about tag size should not be printed without debug info
set. Also print device major:minor in the error message to identify the
device instance.

Also use rate limiting and debug level for info about used crypto API
implementaton.  This is important because during online reencryption
the existing message saturates syslog (because we are moving hotzone
across the whole device).

Cc: stable@vger.kernel.org
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

7a1cd723

dm ioctl: fix hang in early create error condition · 0f41fcf7

Helen Koike authored May 15, 2019

The dm_early_create() function (which deals with "dm-mod.create=" kernel
command line option) calls dm_hash_insert() who gets an extra reference
to the md object.

In case of failure, this reference wasn't being released, causing
dm_destroy() to hang, thus hanging the whole boot process.

Fix this by calling __hash_remove() in the error path.

Fixes: 6bbc923d ("dm: add support to directly boot to a mapped device")
Cc: stable@vger.kernel.org
Signed-off-by: Helen Koike <helen.koike@collabora.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

0f41fcf7

09 May, 2019 1 commit

dm integrity: whitespace, coding style and dead code cleanup · 05d6909e

Mike Snitzer authored May 09, 2019

Just some things that stood out like a sore thumb.
Also, converted some printk(KERN_CRIT, ...) to DMCRIT(...)
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

05d6909e

08 May, 2019 4 commits

dm integrity: implement synchronous mode for reboot handling · 48271493

Mikulas Patocka authored Apr 29, 2019

Unfortunatelly, there may be bios coming even after the reboot notifier
was called. We don't want these bios to make the bitmap dirty again.

To address this, implement a synchronous mode - when a bio is about to
be terminated, we clean the bitmap and terminate the bio after the clean
operation succeeds. This obviously slows down bio processing, but it
makes sure that when all bios are finished, the bitmap will be clean.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

48271493

dm integrity: handle machine reboot in bitmap mode · 1f5a7759

Mikulas Patocka authored Apr 29, 2019

When in bitmap mode the bitmap must be cleared when rebooting.  This
commit adds the reboot hook.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

1f5a7759

dm integrity: add a bitmap mode · 468dfca3

Mikulas Patocka authored Apr 29, 2019

Introduce an alternate mode of operation where dm-integrity uses a
bitmap instead of a journal. If a bit in the bitmap is 1, the
corresponding region's data and integrity tags are not synchronized - if
the machine crashes, the unsynchronized regions will be recalculated.
The bitmap mode is faster than the journal mode, because we don't have
to write the data twice, but it is also less reliable, because if data
corruption happens when the machine crashes, it may not be detected.

Benchmark results for an SSD connected to a SATA300 port, when doing
large linear writes with dd:

buffered I/O:
        raw device throughput - 245MB/s
        dm-integrity with journaling - 120MB/s
        dm-integrity with bitmap - 238MB/s

direct I/O with 1MB block size:
        raw device throughput - 248MB/s
        dm-integrity with journaling - 123MB/s
        dm-integrity with bitmap - 223MB/s

For more info see dm-integrity in Documentation/device-mapper/
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

468dfca3

dm integrity: introduce a function add_new_range_and_wait() · 8b3bbd49

Mikulas Patocka authored Apr 29, 2019

Introduce a function add_new_range_and_wait() in order to avoid
repetitive code.  It will be used in the following commit.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

8b3bbd49

07 May, 2019 9 commits

dm integrity: allow large ranges to be described · 4f43446d

Mikulas Patocka authored Apr 29, 2019

Change n_sectors data type from unsigned to sector_t.  Following commits
will need to lock large ranges.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

4f43446d

dm ingerity: pass size to dm_integrity_alloc_page_list() · d5027e03

Mikulas Patocka authored Apr 29, 2019

Pass size to dm_integrity_alloc_page_list().  This is needed so
following commits can pass a size that is different from
ic->journal_pages.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

d5027e03

dm integrity: introduce rw_journal_sectors() · 981e8a98

Mikulas Patocka authored Apr 29, 2019

Introduce a function rw_journal_sectors() that takes sector and length
as its arguments instead of a section and the number of sections.

This functions will be used in further patches.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

981e8a98

dm integrity: update documentation · 88ad5d1e

Mikulas Patocka authored Apr 29, 2019

Update documentation with the "meta_device" parameter and flags.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

88ad5d1e

dm integrity: don't report unused options · 893e3c39

Mikulas Patocka authored Apr 29, 2019

If we are not journaling, don't report journaling options in the table
status.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

893e3c39

dm integrity: don't check null pointer before kvfree and vfree · 97abfde1

Mikulas Patocka authored Apr 29, 2019

The functions kfree, vfree and kvfree do nothing if we pass a NULL
pointer to them.  So we don't need to test the pointer for NULL.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

97abfde1

dm integrity: correctly calculate the size of metadata area · 30bba430

Mikulas Patocka authored May 07, 2019

When we use separate devices for data and metadata, dm-integrity would
incorrectly calculate the size of the metadata device as if it had
512-byte block size - and it would refuse activation with larger block
size and smaller metadata device.

Fix this so that it takes actual block size into account, which fixes
the following reported issue:
https://gitlab.com/cryptsetup/cryptsetup/issues/450

Fixes: 356d9d52 ("dm integrity: allow separate metadata device")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

30bba430

dm dust: Make dm_dust_init and dm_dust_exit static · 9ccce5a0

YueHaibing authored May 04, 2019

Fix sparse warnings:

drivers/md/dm-dust.c:495:12: warning: symbol 'dm_dust_init' was not declared. Should it be static?
drivers/md/dm-dust.c:505:13: warning: symbol 'dm_dust_exit' was not declared. Should it be static?
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

9ccce5a0

dm dust: remove redundant unsigned comparison to less than zero · cacddeab

Colin Ian King authored May 01, 2019

Variable block is an unsigned long long hence the less than zero
comparison is always false, hence it is redundant and can be removed.

Addresses-Coverity: ("Unsigned compared against 0")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Bryan Gurney <bgurney@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

cacddeab

30 Apr, 2019 3 commits

dm mpath: always free attached_handler_name in parse_path() · 940bc471

Martin Wilck authored Apr 29, 2019

Commit b592211c ("dm mpath: fix attached_handler_name leak and
dangling hw_handler_name pointer") fixed a memory leak for the case
where setup_scsi_dh() returns failure. But setup_scsi_dh may return
success and not "use" attached_handler_name if the
retain_attached_hwhandler flag is not set on the map. As setup_scsi_sh
properly "steals" the pointer by nullifying it, freeing it
unconditionally in parse_path() is safe.

Fixes: b592211c ("dm mpath: fix attached_handler_name leak and dangling hw_handler_name pointer")
Cc: stable@vger.kernel.org
Reported-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

940bc471

dm init: fix max devices/targets checks · 8e890c1a

Helen Koike authored Apr 26, 2019

dm-init should allow up to DM_MAX_{DEVICES,TARGETS} for devices/targets,
and not DM_MAX_{DEVICES,TARGETS} - 1.

Fix the checks and also fix the error message when the number of devices
is surpassed.

Fixes: 6bbc923d ("dm: add support to directly boot to a mapped device")
Cc: stable@vger.kernel.org
Signed-off-by: Helen Koike <helen.koike@collabora.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

8e890c1a

dm: add dust target · e4f3fabd

Bryan Gurney authored Mar 07, 2019

Add the dm-dust target, which simulates the behavior of bad sectors
at arbitrary locations, and the ability to enable the emulation of
the read failures at an arbitrary time.

This target behaves similarly to a linear target.  At a given time,
the user can send a message to the target to start failing read
requests on specific blocks.  When the failure behavior is enabled,
reads of blocks configured "bad" will fail with EIO.

Writes of blocks configured "bad" will result in the following:

1. Remove the block from the "bad block list".
2. Successfully complete the write.

After this point, the block will successfully contain the written
data, and will service reads and writes normally.  This emulates the
behavior of a "remapped sector" on a hard disk drive.

dm-dust provides logging of which blocks have been added or removed
to the "bad block list", as well as logging when a block has been
removed from the bad block list.  These messages can be used
alongside the messages from the driver using a dm-dust device to
analyze the driver's behavior when a read fails at a given time.

(This logging can be reduced via a "quiet" mode, if desired.)

NOTE: If the block size is larger than 512 bytes, only the first sector
of each "dust block" is detected.  Placing a limiting layer above a dust
target, to limit the minimum I/O size to the dust block size, will
ensure proper emulation of the given large block size.
Signed-off-by: Bryan Gurney <bgurney@redhat.com>
Co-developed-by: Joe Shimkus <jshimkus@redhat.com>
Co-developed-by: John Dorminy <jdorminy@redhat.com>
Co-developed-by: John Pittman <jpittman@redhat.com>
Co-developed-by: Thomas Jaskiewicz <tjaskiew@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

e4f3fabd

26 Apr, 2019 4 commits

dm writecache: avoid unnecessary lookups in writecache_find_entry() · f8011d33

Mikulas Patocka authored Apr 26, 2019

This is a small optimization in writecache_find_entry().

If we go past the condition "if (unlikely(!node))", we can be certain that
there is no entry in the tree that has the block equal to the "block"
variable.

Consequently, we can return the next entry directly, we don't need to go
to the second part of the function that finds the entry with lowest or
highest seq number that matches the "block" variable.

Also, add some whitespace and cleanup needless braces.
Suggested-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

f8011d33

dm writecache: remove unused member page_offset in writeback_struct · 08a8e804

Huaisheng Ye authored Apr 25, 2019

The stucture member page_offset in writeback_struct never has been
used actually. Remove it.
Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

08a8e804

dm delay: fix a crash when invalid device is specified · 81bc6d15

Mikulas Patocka authored Apr 25, 2019

When the target line contains an invalid device, delay_ctr() will call
delay_dtr() with NULL workqueue.  Attempting to destroy the NULL
workqueue causes a crash.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

81bc6d15

dm: only initialize md->dax_dev if CONFIG_DAX_DRIVER is enabled · 514cf4f8

Peng Wang authored Apr 18, 2019

md->dax_dev defaults to NULL and there is no need to initialize it
if CONFIG_DAX_DRIVER is disabled.
Signed-off-by: Peng Wang <rocking@whu.edu.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

514cf4f8

25 Apr, 2019 1 commit

dm mpath: fix missing call of path selector type->end_io · 5de719e3

Yufen Yu authored Apr 24, 2019

After commit 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via
blk_insert_cloned_request feedback"), map_request() will requeue the tio
when issued clone request return BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE.

Thus, if device driver status is error, a tio may be requeued multiple
times until the return value is not DM_MAPIO_REQUEUE.  That means
type->start_io may be called multiple times, while type->end_io is only
called when IO complete.

In fact, even without commit 396eaf21, setup_clone() failure can
also cause tio requeue and associated missed call to type->end_io.

The service-time path selector selects path based on in_flight_size,
which is increased by st_start_io() and decreased by st_end_io().
Missed calls to st_end_io() can lead to in_flight_size count error and
will cause the selector to make the wrong choice.  In addition,
queue-length path selector will also be affected.

To fix the problem, call type->end_io in ->release_clone_rq before tio
requeue.  map_info is passed to ->release_clone_rq() for map_request()
error path that result in requeue.

Fixes: 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
Cc: stable@vger.kernl.org
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

5de719e3

18 Apr, 2019 13 commits

dm thin metadata: do not write metadata if no changes occurred · 873f258b

Mike Snitzer authored Apr 18, 2019

Otherwise, just activating a thin-pool and thin device and then
deactivating them will cause the thin-pool metadata to be changed
(e.g. superblock written) -- even without any metadata being changed.

Add 'in_service' flag to struct dm_pool_metadata and set it in
pmd_write_lock() because all on-disk metadata changes must take a write
lock of pmd->root_lock.  Once 'in_service' is set it is never cleared.
__commit_transaction() will return 0 if 'in_service' is not set.
dm_pool_commit_metadata() is updated to use __pmd_write_lock() so that
it isn't the sole reason for putting a thin-pool in service.

Also fix dm_pool_commit_metadata() to open the next transaction if the
return from __commit_transaction() is 0.  Not seeing why the early
return ever made since for a return of 0 given that dm-io's async_io(),
as used by bufio, always returns 0.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

873f258b

dm thin metadata: add wrappers for managing write locking of metadata · 6a1b1ddc

Mike Snitzer authored Apr 15, 2019

No functional change, but this prepares to hook off of pmd_write_lock()
with additional functionality (as provided in next commit).
Suggested-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

6a1b1ddc

dm thin metadata: check __commit_transaction()'s return · a1ed4d9e

Mike Snitzer authored Apr 15, 2019

Fix __reserve_metadata_snap() to return early if __commit_transaction()
fails.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

a1ed4d9e

dm space map common: zero entire ll_disk · c6e086e0

Mike Snitzer authored Mar 22, 2019

Otherwise, memory that is allocated (and potentially not previously
zeroed) will get written to disk as part of the space maps.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

c6e086e0

dm writecache: add unlikely for returned value of rb_next/prev · 84420b1e

Huaisheng Ye authored Apr 12, 2019

In functions writecache_discard() and writecache_find_entry() there is a
high probablity that the pointer of structure rb_node won't equal NULL.
Add unlikely for the pointer node NULL.
Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

84420b1e

dm writecache: remove needless dereferences in __writecache_writeback_pmem() · 09f2d656

Huaisheng Ye authored Apr 12, 2019

bio is already available so there is no need to access it in terms of
the wb pointer.
Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

09f2d656

dm snapshot: Use fine-grained locking scheme · 3f1637f2

Nikos Tsironis authored Mar 17, 2019

Substitute the global locking scheme with a fine grained one, employing
the read-write semaphore and the scalable exception tables with
per-bucket locks introduced by the previous two commits.

Summarizing, we now use a read-write semaphore to protect the mostly
read fields of the snapshot structure, e.g., valid, active, etc., and
per-bucket bit spinlocks to protect accesses to the complete and pending
exception tables.

Finally, we use an extra spinlock (pe_allocation_lock) to serialize the
allocation of new exceptions by the exception store. This allocation is
really fast, so the extra spinlock doesn't hurt the performance.

This scheme allows dm-snapshot to scale better, resulting in increased
IOPS and reduced latency.

Following are some benchmark results using the null_blk device:

  modprobe null_blk gb=1024 bs=512 submit_queues=8 hw_queue_depth=4096 \
   queue_mode=2 irqmode=1 completion_nsec=1 nr_devices=1

* Benchmark fio_origin_randwrite_throughput_N, from the device mapper
  test suite [1] (direct IO, random 4K writes to origin device, IO
  engine libaio):

  +--------------+-------------+------------+
  | # of workers | IOPS Before | IOPS After |
  +--------------+-------------+------------+
  |      1       |    57708    |   66421    |
  |      2       |    63415    |   77589    |
  |      4       |    67276    |   98839    |
  |      8       |    60564    |   109258   |
  +--------------+-------------+------------+

* Benchmark fio_origin_randwrite_latency_N, from the device mapper test
  suite [1] (direct IO, random 4K writes to origin device, IO engine
  psync):

  +--------------+-----------------------+----------------------+
  | # of workers | Latency (usec) Before | Latency (usec) After |
  +--------------+-----------------------+----------------------+
  |      1       |         16.25         |        13.27         |
  |      2       |         31.65         |        25.08         |
  |      4       |         55.28         |        41.08         |
  |      8       |         121.47        |        74.44         |
  +--------------+-----------------------+----------------------+

* Benchmark fio_snapshot_randwrite_throughput_N, from the device mapper
  test suite [1] (direct IO, random 4K writes to snapshot device, IO
  engine libaio):

  +--------------+-------------+------------+
  | # of workers | IOPS Before | IOPS After |
  +--------------+-------------+------------+
  |      1       |    72593    |   84938    |
  |      2       |    97379    |   134973   |
  |      4       |    90610    |   143077   |
  |      8       |    90537    |   180085   |
  +--------------+-------------+------------+

* Benchmark fio_snapshot_randwrite_latency_N, from the device mapper
  test suite [1] (direct IO, random 4K writes to snapshot device, IO
  engine psync):

  +--------------+-----------------------+----------------------+
  | # of workers | Latency (usec) Before | Latency (usec) After |
  +--------------+-----------------------+----------------------+
  |      1       |         12.53         |         10.6         |
  |      2       |         19.78         |        14.89         |
  |      4       |         40.37         |        23.47         |
  |      8       |         89.32         |        48.48         |
  +--------------+-----------------------+----------------------+

[1] https://github.com/jthornber/device-mapper-test-suiteCo-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

3f1637f2

dm snapshot: Make exception tables scalable · f79ae415

Nikos Tsironis authored Mar 17, 2019

Use list_bl to implement the exception hash tables' buckets. This change
permits concurrent access, to distinct buckets, by multiple threads.

Also, implement helper functions to lock and unlock the exception tables
based on the chunk number of the exception at hand.

We retain the global locking, by means of down_write(), which is
replaced by the next commit.

Still, we must acquire the per-bucket spinlocks when accessing the hash
tables, since list_bl does not allow modification on unlocked lists.
Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

f79ae415

dm snapshot: Replace mutex with rw semaphore · 4ad8d880

Nikos Tsironis authored Mar 17, 2019

dm-snapshot uses a single mutex to serialize every access to the
snapshot state. This includes all accesses to the complete and pending
exception tables, which occur at every origin write, every snapshot
read/write and every exception completion.

The lock statistics indicate that this mutex is a bottleneck (average
wait time ~480 usecs for 8 processes doing random 4K writes to the
origin device) preventing dm-snapshot to scale as the number of threads
doing IO increases.

The major contention points are __origin_write()/snapshot_map() and
pending_complete(), i.e., the submission and completion of pending
exceptions.

Replace this mutex with a rw semaphore.

We essentially revert commit ae1093be ("dm snapshot: use mutex
instead of rw_semaphore") and together with the next two patches we
substitute the single mutex with a fine-grained locking scheme, where we
use a read-write semaphore to protect the mostly read fields of the
snapshot structure, e.g., valid, active, etc., and per-bucket bit
spinlocks to protect accesses to the complete and pending exception
tables.
Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

4ad8d880

dm snapshot: Don't sleep holding the snapshot lock · 65fc7c37

Nikos Tsironis authored Mar 17, 2019

When completing a pending exception, pending_complete() waits for all
conflicting reads to drain, before inserting the final, completed
exception. Conflicting reads are snapshot reads redirected to the
origin, because the relevant chunk is not remapped to the COW device the
moment we receive the read.

The completed exception must be inserted into the exception table after
all conflicting reads drain to ensure snapshot reads don't return
corrupted data. This is required because inserting the completed
exception into the exception table signals that the relevant chunk is
remapped and both origin writes and snapshot merging will now overwrite
the chunk in origin.

This wait is done holding the snapshot lock to ensure that
pending_complete() doesn't starve if new snapshot reads keep coming for
this chunk.

In preparation for the next commit, where we use a spinlock instead of a
mutex to protect the exception tables, we remove the need for holding
the lock while waiting for conflicting reads to drain.

We achieve this in two steps:

1. pending_complete() inserts the completed exception before waiting for
   conflicting reads to drain and removes the pending exception after
   all conflicting reads drain.

   This ensures that new snapshot reads will be redirected to the COW
   device, instead of the origin, and thus pending_complete() will not
   starve. Moreover, we use the existence of both a completed and
   a pending exception to signify that the COW is done but there are
   conflicting reads in flight.

2. In __origin_write() we check first if there is a pending exception
   and then if there is a completed exception. If there is a pending
   exception any submitted BIO is delayed on the pe->origin_bios list and
   DM_MAPIO_SUBMITTED is returned. This ensures that neither writes to the
   origin nor snapshot merging can overwrite the origin chunk, until all
   conflicting reads drain, and thus snapshot reads will not return
   corrupted data.

Summarizing, we now have the following possible combinations of pending
and completed exceptions for a chunk, along with their meaning:

A. No exceptions exist: The chunk has not been remapped yet.
B. Only a pending exception exists: The chunk is currently being copied
   to the COW device.
C. Both a pending and a completed exception exist: COW for this chunk
   has completed but there are snapshot reads in flight which had been
   redirected to the origin before the chunk was remapped.
D. Only the completed exception exists: COW has been completed and there
   are no conflicting reads in flight.
Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

65fc7c37

list_bl: Add hlist_bl_add_before/behind helpers · 34191ae8

Nikos Tsironis authored Mar 17, 2019

Add hlist_bl_add_before/behind helpers to add an element before/after an
existing element in a bl_list.
Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

34191ae8

list: Don't use WRITE_ONCE() in hlist_add_behind() · ae325dcd

Nikos Tsironis authored Mar 17, 2019

Commit 1c97be67 ("list: Use WRITE_ONCE() when adding to lists and
hlists") introduced the use of WRITE_ONCE() to atomically write the list
head's ->next pointer.

hlist_add_behind() doesn't touch the hlist head's ->first pointer so
there is no reason to use WRITE_ONCE() in this case.
Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

ae325dcd

dm cache metadata: Fix loading discard bitset · e28adc3b

Nikos Tsironis authored Apr 17, 2019

Add missing dm_bitset_cursor_next() to properly advance the bitset
cursor.

Otherwise, the discarded state of all blocks is set according to the
discarded state of the first block.

Fixes: ae4a46a1 ("dm cache metadata: use bitset cursor api to load discard bitset")
Cc: stable@vger.kernel.org
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

e28adc3b