Commits · 78d8e58a086b214dddf1fd463e20a7e1d82d7866 · Kirill Smelkov / linux

26 Jun, 2015 2 commits

Revert "block, dm: don't copy bios for request clones" · 78d8e58a

Mike Snitzer authored Jun 26, 2015

This reverts commit 5f1b670d.

Justification for revert as reported in this dm-devel post:
https://www.redhat.com/archives/dm-devel/2015-June/msg00160.html

this change should not be pushed to mainline yet.

Firstly, Christoph has a newer version of the patch that fixes silent
data corruption problem:
  https://www.redhat.com/archives/dm-devel/2015-May/msg00229.html

And the new version still depends on LLDDs to always complete requests
to the end when error happens, while block API doesn't enforce such a
requirement. If the assumption is ever broken, the inconsistency between
request and bio (e.g. rq->__sector and rq->bio) will cause silent data
corruption:
  https://www.redhat.com/archives/dm-devel/2015-June/msg00022.htmlReported-by: Junichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

78d8e58a

Revert "dm: do not allocate any mempools for blk-mq request-based DM" · 4e6e36c3

Mike Snitzer authored Jun 26, 2015

This reverts commit cbc4e3c1.
Reported-by: Junichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

4e6e36c3

17 Jun, 2015 6 commits

dm stats: add support for request-based DM devices · e262f347

Mikulas Patocka authored Jun 09, 2015

This makes it possible to use dm stats with DM multipath.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

e262f347

dm stats: collect and report histogram of IO latencies · dfcfac3e

Mikulas Patocka authored Jun 09, 2015

Add an option to dm statistics to collect and report a histogram of
IO latencies.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

dfcfac3e

dm stats: support precise timestamps · c96aec34

Mikulas Patocka authored Jun 09, 2015

Make it possible to use precise timestamps with nanosecond granularity
in dm statistics.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

c96aec34

dm stats: fix divide by zero if 'number_of_areas' arg is zero · dd4c1b7d

Mikulas Patocka authored Jun 05, 2015

If the number_of_areas argument was zero the kernel would crash on
div-by-zero.  Add better input validation.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v3.12+

dd4c1b7d

dm cache: switch the "default" cache replacement policy from mq to smq · bccab6a0

Mike Snitzer authored Jun 17, 2015

The Stochastic multiqueue (SMQ) policy (vs MQ) offers the promise of
less memory utilization, improved performance and increased adaptability
in the face of changing workloads.  SMQ also does not have any
cumbersome tuning knobs.

Users may switch from "mq" to "smq" simply by appropriately reloading a
DM table that is using the cache target.  Doing so will cause all of the
mq policy's hints to be dropped.  Also, performance of the cache may
degrade slightly until smq recalculates the origin device's hotspots
that should be cached.

In the future the "mq" policy will just silently make use of "smq" and
the mq code will be removed.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>

bccab6a0

dm space map metadata: fix occasional leak of a metadata block on resize · 6096d91a

Joe Thornber authored Jun 17, 2015

The metadata space map has a simplified 'bootstrap' mode that is
operational when extending the space maps.  Whilst in this mode it's
possible for some refcount decrement operations to become queued (eg, as
a result of shadowing one of the bitmap indexes).  These decrements were
not being applied when switching out of bootstrap mode.

The effect of this bug was the leaking of a 4k metadata block.  This is
detected by the latest version of thin_check as a non fatal error.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org

6096d91a

11 Jun, 2015 12 commits

dm thin metadata: fix a race when entering fail mode · b1f11aff

Joe Thornber authored Jun 11, 2015

In dm_thin_find_block() the ->fail_io flag was checked outside the
metadata device's root_lock, causing dm_thin_find_block() to race with
the setting of this flag.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

b1f11aff

dm thin: fail messages with EOPNOTSUPP when pool cannot handle messages · fd467696

Mike Snitzer authored Jun 09, 2015

Use EOPNOTSUPP, rather than EINVAL, error code when user attempts to
send the pool a message.  Otherwise usespace is led to believe the
message failed due to invalid argument.
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

fd467696

dm thin: range discard support · 34fbcf62

Joe Thornber authored Apr 16, 2015

Previously REQ_DISCARD bios have been split into block sized chunks
before submission to the thin target.  There are a couple of issues with
this:

 - If the block size is small, a large discard request can
   get broken up into a great many bios which is both slow and causes
   a lot of memory pressure.

 - The thin pool block size and the discard granularity for the
   underlying data device need to be compatible if we want to passdown
   the discard.

This patch relaxes the block size granularity for thin devices.  It
makes use of the recent range locking added to the bio_prison to
quiesce a whole range of thin blocks before unmapping them.  Once a
thin range has been unmapped the discard can then be passed down to
the data device for those sub ranges where the data blocks are no
longer used (ie. they weren't shared in the first place).

This patch also doesn't make any apologies about open-coding portions
of block core as a means to supporting async discard completions in the
near-term -- if/when late bio splitting lands it'll all get cleaned up.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

34fbcf62

dm thin metadata: add dm_thin_remove_range() · 6550f075

Joe Thornber authored Apr 13, 2015

Removes a range of blocks from the btree.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

6550f075

dm thin metadata: add dm_thin_find_mapped_range() · a5d895a9

Joe Thornber authored Apr 16, 2015

Retrieve the next run of contiguously mapped blocks.  Useful for working
out where to break up IO.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

a5d895a9

dm btree: add dm_btree_remove_leaves() · 4ec331c3

Joe Thornber authored Apr 13, 2015

Removes a range of leaf values from the tree.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

4ec331c3

dm stats: Use kvfree() in dm_kvfree() · 0f24b79b

Pekka Enberg authored May 15, 2015

Use kvfree() instead of open-coding it.
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

0f24b79b

dm cache: age and write back cache entries even without active IO · fba10109

Joe Thornber authored May 29, 2015

The policy tick() method is normally called from interrupt context.
Both the mq and smq policies do some bottom half work for the tick
method in their map functions.  However if no IO is going through the
cache, then that bottom half work doesn't occur.  With these policies
this means recently hit entries do not age and do not get written
back as early as we'd like.

Fix this by introducing a new 'can_block' parameter to the tick()
method.  When this is set the bottom half work occurs immediately.
'can_block' is set when the tick method is called every second by the
core target (not in interrupt context).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

fba10109

dm cache: prefix all DMERR and DMINFO messages with cache device name · b61d9509

Mike Snitzer authored Apr 22, 2015

Having the DM device name associated with the ERR or INFO message is
very helpful.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

b61d9509

dm cache: add fail io mode and needs_check flag · 028ae9f7

Joe Thornber authored Apr 22, 2015

If a cache metadata operation fails (e.g. transaction commit) the
cache's metadata device will abort the current transaction, set a new
needs_check flag, and the cache will transition to "read-only" mode.  If
aborting the transaction or setting the needs_check flag fails the cache
will transition to "fail-io" mode.

Once needs_check is set the cache device will not be allowed to
activate.  Activation requires write access to metadata.  Future work is
needed to add proper support for running the cache in read-only mode.

Once in fail-io mode the cache will report a status of "Fail".

Also, add commit() wrapper that will disallow commits if in read_only or
fail mode.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

028ae9f7

dm cache: wake the worker thread every time we free a migration object · 88bf5184

Joe Thornber authored May 27, 2015

When the cache is idle, writeback work was only being issued every
second.  With this change outstanding writebacks are streamed
constantly.  This offers a writeback performance improvement.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

88bf5184

dm cache: add stochastic-multi-queue (smq) policy · 66a63635

Joe Thornber authored May 15, 2015

The stochastic-multi-queue (smq) policy addresses some of the problems
with the current multiqueue (mq) policy.

Memory usage
------------

The mq policy uses a lot of memory; 88 bytes per cache block on a 64
bit machine.

SMQ uses 28bit indexes to implement it's data structures rather than
pointers.  It avoids storing an explicit hit count for each block.  It
has a 'hotspot' queue rather than a pre cache which uses a quarter of
the entries (each hotspot block covers a larger area than a single
cache block).

All these mean smq uses ~25bytes per cache block.  Still a lot of
memory, but a substantial improvement nontheless.

Level balancing
---------------

MQ places entries in different levels of the multiqueue structures
based on their hit count (~ln(hit count)).  This means the bottom
levels generally have the most entries, and the top ones have very
few.  Having unbalanced levels like this reduces the efficacy of the
multiqueue.

SMQ does not maintain a hit count, instead it swaps hit entries with
the least recently used entry from the level above.  The over all
ordering being a side effect of this stochastic process.  With this
scheme we can decide how many entries occupy each multiqueue level,
resulting in better promotion/demotion decisions.

Adaptability
------------

The MQ policy maintains a hit count for each cache block.  For a
different block to get promoted to the cache it's hit count has to
exceed the lowest currently in the cache.  This means it can take a
long time for the cache to adapt between varying IO patterns.
Periodically degrading the hit counts could help with this, but I
haven't found a nice general solution.

SMQ doesn't maintain hit counts, so a lot of this problem just goes
away.  In addition it tracks performance of the hotspot queue, which
is used to decide which blocks to promote.  If the hotspot queue is
performing badly then it starts moving entries more quickly between
levels.  This lets it adapt to new IO patterns very quickly.

Performance
-----------

In my tests SMQ shows substantially better performance than MQ.  Once
this matures a bit more I'm sure it'll become the default policy.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

66a63635

29 May, 2015 20 commits

dm cache: boost promotion of blocks that will be overwritten · 40775257

Joe Thornber authored May 15, 2015

When considering whether to move a block to the cache we already give
preferential treatment to discarded blocks, since they are cheap to
promote (no read of the origin required since the data is junk).

The same is true of blocks that are about to be completely
overwritten, so we likewise boost their promotion chances.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

40775257

dm cache: defer whole cells · 651f5fa2

Joe Thornber authored May 15, 2015

Currently individual bios are deferred to the worker thread if they
cannot be processed immediately (eg, a block is in the process of
being moved to the fast device).

This patch passes whole cells across to the worker.  This saves
reaquiring the cell, and also collects bios destined for the same block
together, which allows them to be mapped with a single look up to the
policy.  This reduces the overhead of using dm-cache.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

651f5fa2

dm bio prison: add dm_cell_promote_or_release() · 3cdf93f9

Joe Thornber authored May 15, 2015

Rather than always releasing the prisoners in a cell, the client may
want to promote one of them to be the new holder. There is a race here
though between releasing an empty cell, and other threads adding new
inmates. So this function makes the decision with its lock held.

This function can have two outcomes:
i) An inmate is promoted to be the holder of the cell (return value of 0).
ii) The cell has no inmate for promotion and is released (return value of 1).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

3cdf93f9

dm cache: pull out some bitset utility functions for reuse · 451b9e00
Joe Thornber authored May 15, 2015
```
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
```
451b9e00

dm cache: pass a new 'critical' flag to the policies when requesting writeback work · 20f6814b

Joe Thornber authored May 15, 2015

We only allow non critical writeback if the origin is idle.  It is up
to the policy to decide what writeback work is critical.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

20f6814b

dm cache: track IO to the origin device using io_tracker · 066dbaa3
Joe Thornber authored May 15, 2015
```
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
```
066dbaa3

dm cache: add io_tracker · 77289d32

Joe Thornber authored May 15, 2015

A little class that keeps track of the volume of io that is in flight,
and the length of time that a device has been idle for.

FIXME: rather than jiffes, may be best to use ktime_t (to support faster
devices).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

77289d32

dm cache: fix race when issuing a POLICY_REPLACE operation · fb4100ae

Joe Thornber authored May 20, 2015

There is a race between a policy deciding to replace a cache entry,
the core target writing back any dirty data from this block, and other
IO threads doing IO to the same block.

This sort of problem is avoided most of the time by the core target
grabbing a bio prison cell before making the request to the policy.
But for a demotion the core target doesn't know which block will be
demoted, so can't do this in advance.

Fix this demotion race by introducing a callback to the policy interface
that allows the policy to grab the cell on behalf of the core target.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org

fb4100ae

dm crypt: add comments to better describe crypto processing logic · 54cea3f6

Milan Broz authored May 15, 2015

A crypto driver can process requests synchronously or asynchronously
and can use an internal driver queue to backlog requests.
Add some comments to clarify internal logic and completion return codes.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

54cea3f6

dm raid1: keep issuing IO after leg failure · ed63287d

Lidong Zhong authored May 13, 2015

Currently if there is a leg failure, the bio will be put into the hold
list until userspace does a remove/replace on the leg.  Doing so in a
cluster config (clvmd) is problematic because there may be a temporary
path failure that results in cluster raid1 remove/replace.  Such
recovery takes a long time due to a full resync.

Update dm-raid1 to optionally ignore these failures so bios continue
being issued without interrupton.  To enable this feature userspace
must pass "keep_log" when creating the dm-raid1 device.
Signed-off-by: Lidong Zhong <lzhong@suse.com>
Tested-by: Liuhua Wang <lwang@suse.com>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

ed63287d

dm log writes: use ULL suffix for 64-bit constants · f4ad317a

Geert Uytterhoeven authored Apr 19, 2015

On 32-bit:
drivers/md/dm-log-writes.c: In function ‘log_super’:
drivers/md/dm-log-writes.c:323: warning: integer constant is too large for ‘long’ type

Add a ULL suffix to WRITE_LOG_MAGIC to fix this.
Also add a ULL suffix to WRITE_LOG_VERSION as it's stored in a __le64
field.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

f4ad317a

dm stripe: drop useless exit point from dm_stripe_init() · e223e1de

Luis Henriques authored Apr 27, 2015

Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

e223e1de

dm raid: add support for the MD RAID0 personality · 0cf45031

Heinz Mauelshagen authored Apr 29, 2015

Add dm-raid access to the MD RAID0 personality to enable single zone
striping.

The following changes enable that access:
- add type definition to raid_types array
- make bitmap creation conditonal in super_validate(), because
  bitmaps are not allowed in raid0
- set rdev->sectors to the data image size in super_validate()
  to allow the raid0 personality to calculate the MD array
  size properly
- use mdddev(un)lock() functions instead of direct mutex_(un)lock()
  (wrapped in here because it's a trivial change)
- enhance raid_status() to always report full sync for raid0
  so that userspace checks for 100% sync will succeed and allow
  for resize (and takeover/reshape once added in future paches)
- enhance raid_resume() to not load bitmap in case of raid0
- add merge function to avoid data corruption (seen with readahead)
  that resulted from bio payloads that grew too large.  This problem
  did not occur with the other raid levels because it either did not
  apply without striping (raid1) or was avoided via stripe caching.
- raise version to 1.7.0 because of the raid0 API change
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

0cf45031

dm raid: a few cleanups · c76d53f4

Heinz Mauelshagen authored Apr 29, 2015

- ensure maximum device limit in superblock
- rename DMPF_* (print flags) to CTR_FLAG_* (constructor flags)
  and their respective struct raid_set member
- use strcasecmp() in raid10_format_to_md_layout() as in the constructor
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

c76d53f4

dm raid: fixup documentation for discard support · 0f4106b3

Heinz Mauelshagen authored Apr 29, 2015

Remove comment above parse_raid_params() that claims
"devices_handle_discard_safely" is a table line argument when it is
actually is a module parameter.

Also, backfill dm-raid target version 1.6.0 documentation.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

0f4106b3

dm thin metadata: remove in-core 'read_only' flag · 49f154c7

Mike Snitzer authored Apr 23, 2015

Leverage the block manager's read_only flag instead of duplicating it;
access with new dm_bm_is_read_only() method.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

49f154c7

dm thin: cleanup schedule_zero() to read more logically · f8ae7525

Mike Snitzer authored May 14, 2015

The overwrite has only ever about optimizing away the need to zero a
block if the entire block was being overwritten.  As such it is only
relevant when zeroing is enabled.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>

f8ae7525

dm thin: cleanup overwrite's endio restore to be centralized · 8b908f8e
Mike Snitzer authored May 13, 2015
```
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
```
8b908f8e

dm: factor out a common cleanup_mapped_device() · 0f20972f

Mike Snitzer authored Apr 28, 2015

Introduce a single common method for cleaning up a DM device's
mapped_device.  No functional change, just eliminates duplication of
delicate mapped_device cleanup code.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

0f20972f

dm: cleanup methods that requeue requests · 2d76fff1

Mike Snitzer authored Apr 29, 2015

More often than not a request that is requeued _is_ mapped (meaning the
clone request is allocated and clone->q is initialized).  Rename
dm_requeue_unmapped_original_request() to avoid potential confusion due
to function name containing "unmapped".

Also, remove dm_requeue_unmapped_request() since callers can easily call
the dm_requeue_original_request() directly.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

2d76fff1