Commits · 1836df0891423dcf2b68771e04b28e2208ee95f2 · Kirill Smelkov / linux

03 Jan, 2018 1 commit

dm mpath: move dm_bio_restore out of endio method · 1836df08

Mike Snitzer authored Dec 06, 2017

Moving the dm_bio_restore() to process_queued_bios() avoids doing that
work in multipath_end_io_bio().
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

1836df08

20 Dec, 2017 5 commits

dm mpath: optimize retrieval of bio_details from per-bio-data · d07a241d
Mike Snitzer authored Dec 11, 2017
```
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
```
d07a241d

dm mpath: remove unnecessary memset() calls for per-io-data · d0442f80

Mike Snitzer authored Dec 11, 2017

All underlying members are initialized directly so the memset() calls
are not needed.  Also, initialize mpio->nr_bytes from the start since it
never changes.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

d0442f80

dm mpath: remove unused param from multipath_init_per_bio_data() · 63f6e6fd
Mike Snitzer authored Dec 05, 2017
```
'struct dm_bio_details *' isn't ever needed.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
```
63f6e6fd

dm: optimize bio-based NVMe IO submission · 978e51ba

Mike Snitzer authored Dec 09, 2017

Upper level bio-based drivers that stack immediately ontop of NVMe can
leverage direct_make_request().  In addition DM's NVMe bio-based
will initially only ever have one NVMe device that it submits IO to at a
time.  There is no splitting needed.  Enhance DM core so that
DM_TYPE_NVME_BIO_BASED's IO submission takes advantage of both of these
characteristics.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

978e51ba

dm: introduce DM_TYPE_NVME_BIO_BASED · 22c11858

Mike Snitzer authored Dec 04, 2017

If dm_table_determine_type() establishes DM_TYPE_NVME_BIO_BASED then
all devices in the DM table do not support partial completions.  Also,
the table has a single immutable target that doesn't require DM core to
split bios.

This will enable adding NVMe optimizations to bio-based DM.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

22c11858

17 Dec, 2017 6 commits

dm: simplify start of block stats accounting for bio-based · f3986374

Mike Snitzer authored Dec 17, 2017

No apparent need to generic_start_io_acct() until before the IO is ready
for submission.  start_io_acct() is the proper place to do this
accounting -- it is also where DM accounts for pending IO and, if
enabled, starts dm-stats accounting.

Replace start_io_acct()'s part_round_stats() with generic_start_io_acct().
This eliminates needing to take part_stat_lock() multiple times when
starting an IO on bio-based devices.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

f3986374

dm: remove redundant mapped_device member from clone_info structure · bc02cdbe

Mike Snitzer authored Dec 14, 2017

'struct dm_io' already has the same pointer.  So update all accesses
from ci->md to ci->io->md.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

bc02cdbe

dm: remove now unused bio-based io_pool and _io_cache · dde1e1ec
Mike Snitzer authored Dec 11, 2017
```
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
```
dde1e1ec

dm: improve performance by moving dm_io structure to per-bio-data · 64f52b0e

Mike Snitzer authored Dec 11, 2017

Eliminates need for a separate mempool to allocate 'struct dm_io'
objects from.  As such, it saves an extra mempool allocation for each
original bio that DM core is issued.

This complicates the per-bio-data accessor functions by needing to
conditonally add extra padding to get to a target's per-bio-data.  But
in the end this provides a decent performance improvement for all
bio-based DM devices.

On an NVMe-loop based testbed to a ramdisk (~3100 MB/s): bio-based
DM linear performance improved by 2% (went from 2665 to 2777 MB/s).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

64f52b0e

dm: rename 'bio' member of dm_io structure to 'orig_bio' · 745dc570
Mike Snitzer authored Dec 11, 2017
```
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
```
745dc570

dm: remove stale comment blocks · 2abf1fc9

Mike Snitzer authored Dec 09, 2017

These CRUD comments have worn out their welcome.  The code is what it
is, over time it'll hopefully get better.  But these comments serve no
purpose whatsoever.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

2abf1fc9

13 Dec, 2017 15 commits

dm: set QUEUE_FLAG_DAX accordingly in dm_table_set_restrictions() · ad3793fc

Mike Snitzer authored Dec 04, 2017

Rather than having DAX support be unique by setting it based on table
type in dm_setup_md_queue().
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

ad3793fc

dm: fix __send_changing_extent_only() to send first bio and chain remainder · 3d7f4562

Mike Snitzer authored Dec 08, 2017

__send_changing_extent_only() must follow the same pattern that was
established with commit "dm: ensure bio submission follows a depth-first
tree walk".  That is: submit first bio up to split boundary and then
split the remainder to further submissions.
Suggested-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

3d7f4562

dm: ensure bio-based DM's bioset and io_pool support targets' maximum IOs · 0776aa0e

Mike Snitzer authored Dec 08, 2017

alloc_multiple_bios() assumes it can allocate the requested number of
bios but until now there was no gaurantee that the mempools would be
accomodating.
Suggested-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

0776aa0e

dm: remove BIOSET_NEED_RESCUER based dm_offload infrastructure · 4a3f54d9

Mike Snitzer authored Nov 22, 2017

Now that all of DM has been revised and/or verified to no longer require
the use of BIOSET_NEED_RESCUER the dm_offload code may be removed.
Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

4a3f54d9

dm: safely allocate multiple bioset bios · 318716dd

Mike Snitzer authored Nov 22, 2017

DM targets can request multiple bios be sent to them by DM core (see:
num_{flush,discard,write_same,write_zeroes}_bios).  But until now these
bios were allocated in an unsafe manner than could potentially exhaust
the DM device's bioset -- in the face of multiple threads each trying to
do multiple allocations from the same DM device's bioset.

Fix __send_duplicate_bios() by using the new alloc_multiple_bios().  The
allocation strategy used by alloc_multiple_bios() models that used by
dm-crypt.c:crypt_alloc_buffer().

Neil Brown initially proposed this fix but the implementation has been
revised enough that it inappropriate to attribute the entirety of it to
him.
Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

318716dd

dm: remove unused 'num_write_bios' target interface · f31c21e4

NeilBrown authored Nov 22, 2017

No DM target provides num_write_bios and none has since dm-cache's
brief use in 2013.

Having the possibility of num_write_bios > 1 complicates bio
allocation.  So remove the interface and assume there is only one bio
needed.

If a target ever needs more, it must provide a suitable bioset and
allocate itself based on its particular needs.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

f31c21e4

dm: ensure bio submission follows a depth-first tree walk · 18a25da8

NeilBrown authored Sep 06, 2017

A dm device can, in general, represent a tree of targets, each of which
handles a sub-range of the range of blocks handled by the parent.

The bio sequencing managed by generic_make_request() requires that bios
are generated and handled in a depth-first manner.  Each call to a
make_request_fn() may submit bios to a single member device, and may
submit bios for a reduced region of the same device as the
make_request_fn.

In particular, any bios submitted to member devices must be expected to
be processed in order, so a later one must never wait for an earlier
one.

This ordering is usually achieved by using bio_split() to reduce a bio
to a size that can be completely handled by one target, and resubmitting
the remainder to the originating device. bio_queue_split() shows the
canonical approach.

dm doesn't follow this approach, largely because it has needed to split
bios since long before bio_split() was available.  It currently can
submit bios to separate targets within the one dm_make_request() call.
Dependencies between these targets, as can happen with dm-snap, can
cause deadlocks if either bios gets stuck behind the other in the queues
managed by generic_make_request().  This requires the 'rescue'
functionality provided by dm_offload_{start,end}.

Some of this requirement can be removed by changing the order of bio
submission to follow the canonical approach.  That is, if dm finds that
it needs to split a bio, the remainder should be sent to
generic_make_request() rather than being handled immediately.  This
delays the handling until the first part is completely processed, so the
deadlock problems do not occur.

__split_and_process_bio() can be called both from dm_make_request() and
from dm_wq_work().  When called from dm_wq_work() the current approach
is perfectly satisfactory as each bio will be processed immediately.
When called from dm_make_request(), current->bio_list will be non-NULL,
and in this case it is best to create a separate "clone" bio for the
remainder.

When we use bio_clone_bioset() to split off the front part of a bio
and chain the two together and submit the remainder to
generic_make_request(), it is important that the newly allocated
bio is used as the head to be processed immediately, and the original
bio gets "bio_advance()"d and sent to generic_make_request() as the
remainder.  Otherwise, if the newly allocated bio is used as the
remainder, and if it then needs to be split again, then the next
bio_clone_bioset() call will be made while holding a reference a bio
(result of the first clone) from the same bioset.  This can potentially
exhaust the bioset mempool and result in a memory allocation deadlock.

Note that there is no race caused by reassigning cio.io->bio after already
calling __map_bio().  This bio will only be dereferenced again after
dec_pending() has found io->io_count to be zero, and this cannot happen
before the dec_pending() call at the end of __split_and_process_bio().

To provide the clone bio when splitting, we use q->bio_split.  This
was previously being freed by bio-based dm to avoid having excess
rescuer threads.  As bio_split bio sets no longer create rescuer
threads, there is little cost and much gain from restoring the
q->bio_split bio set.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

18a25da8

dm io: remove BIOSET_NEED_RESCUER flag from bios bioset · c110a4b6

NeilBrown authored Aug 30, 2017

The BIOSET_NEED_RESCUER flag is only needed when a make_request_fn might
do two allocations from the one bioset, and the second one could block
until the first bio completes.

dm_io() is called from make_request_fn() context.  The closest it comes
to multiple allocations is in chunk_io() in dm-snap-persistent.  But
there the code uses a separate thread to avoid problems.

So BIOSET_NEED_RESCUER is not needed.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

c110a4b6

dm crypt: remove BIOSET_NEED_RESCUER flag · 80cd1757

NeilBrown authored Aug 30, 2017

The BIOSET_NEED_RESCUER flag is only needed when a make_request_fn might
do two allocations from the one bioset, and the second one could block
until the first bio completes.

dm-crypt does allocate from this bioset inside the dm make_request_fn,
but does so using GFP_NOWAIT so that the allocation will not block.

So BIOSET_NEED_RESCUER is not needed.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

80cd1757

dm: fix comment above dm_accept_partial_bio · c06b3e58

NeilBrown authored Nov 21, 2017

Clarify that dm_accept_partial_bio isn't allowed for REQ_OP_ZONE_RESET
bios.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

c06b3e58

dm raid: use rs_is_raid*() · 552aa679

Heinz Mauelshagen authored Dec 13, 2017

Cleanup, no functional change.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

552aa679

dm raid: simplify rs_get_progress() · 7c29744e

Heinz Mauelshagen authored Dec 13, 2017

No need to calculate the reshaping progress because
mddev->curr_resync_completed holds it.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

7c29744e

dm raid: ensure 'a' chars during reshape · dc15b943

Heinz Mauelshagen authored Dec 13, 2017

During reshape, 'A' chars were reported in status rather than 'a'.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

dc15b943

dm raid: stop keeping raid set frozen altogether · 11e47232

Heinz Mauelshagen authored Dec 13, 2017

In order to avoid redoing synchronization/recovery/reshape partially,
the raid set got frozen until after all passed in table line flags had
been cleared. The related table reload sequence had to be precisely
followed, or reshaping may lead to data corruption caused by the active
mapping carrying on with a reshape when the inactive mapping already
had retrieved a stale reshape position.

Harden by retrieving the actual resync/recovery/reshape position
during resume whilst the active table is suspended thus avoiding
to keep the raid set frozen altogether. This prevents superfluous
redoing of an already resynchronized or recovered segment and,
most importantly, potential for redoing of an already reshaped
segment causing data corruption.

Fixes: d39f0010 ("dm raid: fix raid_resume() to keep raid set frozen as needed")
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

11e47232

dm raid: validate current raid sets redundancy · 53bf5384

Heinz Mauelshagen authored Dec 13, 2017

Verifying the current raid sets redundancy based on retrieved
superblock content has to use the superblock's raid level (e.g. raid0),
not the constructor requested one (e.g. raid10).

Using the requested raid level of raid10 lead to a "divide error"
on raid0 which defines data copies divided by to be zero.

Also check for bogus data copies.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

53bf5384

08 Dec, 2017 13 commits

dm raid: bump target version to reflect numerous fixes · b84cf269
Mike Snitzer authored Dec 04, 2017
```
Also update Documentation accordingly.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
```
b84cf269

dm raid: small cleanup and remove unsed "struct raid_set" member · 78a75d10

Heinz Mauelshagen authored Dec 02, 2017

Move raid_resume()'s setting of 'rw' and 'in_sync' to just prior to
mddev_resume().

Also, remove unused 'bitmap_loaded' member from "struct raid_set".

No functional changes.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

78a75d10

dm raid: fix rs_get_progress() synchronization state/ratio · 4102d9de

Heinz Mauelshagen authored Dec 02, 2017

Fix various sync state issues causing racy/bogus sync ratio,
sync_action ad health chars in dm_status() info output.

Sync ratio could be N/N (i.e. 100%) shortly after raid set
creation, i.e. creating a new RaidLV or upconverting a linear LV to
raid1 thus:
  "0 2097152 raid raid1 2 Aa 2097162/2097152 recover 0 0 -"
instead of:
  "0 2097152 raid raid1 2 Aa 0/2097152 idle 0 0 -"

Sync action could be non-idle, when the MD thread was done with io.

Health chars could be 'A' when they should be 'a' for a short time
before a resynchonization started.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

4102d9de

dm raid: avoid passing array_in_sync variable to raid_status() callees · 242ea5ad

Heinz Mauelshagen authored Dec 02, 2017

The raid_status() function passes the bool array_in_sync variable around
providing synchronization state of the MD array.  Replace it with a
runtime flag.  This will avoid a pattern of having to pass discrete
variables to various functions.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

242ea5ad

dm raid: display a consistent copy of the MD status via raid_status() · 67143510

Heinz Mauelshagen authored Dec 02, 2017

The MD sync thread updates recovery flags providing state of any
running, idle, frozen, recovering, reshaping, ... activity it performs
and updates respective flags asynchronously versus dm processing
raid_status().  To close that race window, take a single copy of the
flags and pass it into its callees.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

67143510

dm raid: fix raid_resume() to keep raid set frozen as needed · d39f0010

Heinz Mauelshagen authored Dec 02, 2017

During a reshape request: if userspace reloads a "raid" table multiple
times, resulting in multiple superblock reads, the raid set needs to
stay frozen until all config changes (chunk size, layout data_offset,
delta_disks) have been stored in the superblocks and respective flags
cleared.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

d39f0010

dm raid: add component device size checks to avoid runtime failure · 188a212d

Heinz Mauelshagen authored Dec 02, 2017

Check all component data device sizes versus calculated size.
Reject if device(s) are too small.  Otherwise, MD will fail the
operation by accessing beyond the end of the data device.

An example use-case is that growing bitmap won't fit any more and the MD
runtime will report an error when DM raid should catch this earlier.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

188a212d

dm raid: fix raid set size revalidation · 61e06e2c

Heinz Mauelshagen authored Dec 02, 2017

The raid set size is being revalidated unconditionally before a
reshaping conversion is started.  MD requires the size to only be
reduced in case of a stripe removing (i.e. shrinking) reshape but not
when growing because the raid array has to stay small until after the
growing reshape finishes.

Fix by avoiding the size revalidation in preresume unless a shrinking
reshape is requested.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

61e06e2c

dm raid: correct resizing state relative to reshape space in ctr · 7501537e

Heinz Mauelshagen authored Dec 02, 2017

Pay attention to existing reshape space to define if a raid set needs
resizing.  Otherwise we can hit "Can't resize a reshaping raid set"
when a reshape is being requested.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

7501537e

dm raid: consume sizes after md_finish_reshape() completes changing them · 052b2b1e

Heinz Mauelshagen authored Dec 02, 2017

The md raid personalities call md_finish_reshape() at the end of a
reshape conversion which adjusts rdev->sectors.

Correct/check rdev->sectors before initiating a reshape and raise the
recovery pointer accordingly.

Otherwise, the DM raid coordinated reshape will fail.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

052b2b1e

dm raid: fix deadlock caused by premature md_stop_writes() · 1af2048a

Heinz Mauelshagen authored Dec 02, 2017

md_stop_writes() is called in raid_presuspend() causing deadlocks on
bios submitted afterwards -- which happens on loaded raid sets with
conversion requests.

Fix by moving md_stop_writes() to raid_postsuspend().  NOTE: when the
recovery's frozen (MD_RECOVERY_FROZEN), writes haven't been started (or
are already stopped) so don't stop them again.

Also remove superfluous readonly setting.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

1af2048a

dm bufio: fix shrinker scans when (nr_to_scan < retain_target) · fbc7c07e

Suren Baghdasaryan authored Dec 06, 2017

When system is under memory pressure it is observed that dm bufio
shrinker often reclaims only one buffer per scan. This change fixes
the following two issues in dm bufio shrinker that cause this behavior:

1. ((nr_to_scan - freed) <= retain_target) condition is used to
terminate slab scan process. This assumes that nr_to_scan is equal
to the LRU size, which might not be correct because do_shrink_slab()
in vmscan.c calculates nr_to_scan using multiple inputs.
As a result when nr_to_scan is less than retain_target (64) the scan
will terminate after the first iteration, effectively reclaiming one
buffer per scan and making scans very inefficient. This hurts vmscan
performance especially because mutex is acquired/released every time
dm_bufio_shrink_scan() is called.
New implementation uses ((LRU size - freed) <= retain_target)
condition for scan termination. LRU size can be safely determined
inside __scan() because this function is called after dm_bufio_lock().

2. do_shrink_slab() uses value returned by dm_bufio_shrink_count() to
determine number of freeable objects in the slab. However dm_bufio
always retains retain_target buffers in its LRU and will terminate
a scan when this mark is reached. Therefore returning the entire LRU size
from dm_bufio_shrink_count() is misleading because that does not
represent the number of freeable objects that slab will reclaim during
a scan. Returning (LRU size - retain_target) better represents the
number of freeable objects in the slab. This way do_shrink_slab()
returns 0 when (LRU size < retain_target) and vmscan will not try to
scan this shrinker avoiding scans that will not reclaim any memory.

Test: tested using Android device running
<AOSP>/system/extras/alloc-stress that generates memory pressure
and causes intensive shrinker scans
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

fbc7c07e

dm mpath: fix bio-based multipath queue_if_no_path handling · c1fd0abe

Mike Snitzer authored Dec 07, 2017

Commit ca5beb76 ("dm mpath: micro-optimize the hot path relative to
MPATHF_QUEUE_IF_NO_PATH") caused bio-based DM-multipath to fail mptest's
"test_02_sdev_delete".

Restoring the logic that existed prior to commit ca5beb76 fixes this
bio-based DM-multipath regression.  Also verified all mptest tests pass
with request-based DM-multipath.

This commit effectively reverts commit ca5beb76 -- but it does so
without reintroducing the need to take the m->lock spinlock in
must_push_back_{rq,bio}.

Fixes: ca5beb76 ("dm mpath: micro-optimize the hot path relative to MPATHF_QUEUE_IF_NO_PATH")
Cc: stable@vger.kernel.org # 4.12+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

c1fd0abe