Commits · bde14184781bd24ee6fb0e1af8d69ca21acbd6e6 · Kirill Smelkov / linux

17 Jan, 2018 3 commits

dm bufio: add missed destroys of client mutex · bde14184

Aliaksei Karaliou authored Dec 23, 2017

The client's mutex needs to be destroyed in dm_bufio_client_destroy() as
well as the dm_bufio_client_create() error path.
Signed-off-by: Aliaksei Karaliou <akaraliou.dev@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

bde14184

dm bufio: use REQ_OP_READ and REQ_OP_WRITE · 905be0a1

Mikulas Patocka authored Dec 02, 2017

Use REQ_OP_READ and REQ_OP_WRITE macros instead of READ and WRITE.  They
have the same value, but the block layer uses REQ_OP so bufio should
too.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

905be0a1

dm: add unstriped target · 18a5bf27

Scott Bauer authored Dec 18, 2017

This device mapper "unstriped" target remaps and unstripes I/O so it
is issued solely on a single drive in a HW RAID0 or dm-striped target.

In a 4 drive HW RAID0 the striped target exposes 1/4th of the LBA range
as a virtual drive.  Each I/O to that virtual drive will only be issued
to the 1 drive that was selected of the 4 drives in the HW RAID0.

This unstriped target is most useful for Intel NVMe drives that have
multiple cores but that do not have firmware control to pin separate LBA
ranges to each discrete cpu core.
Signed-off-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

18a5bf27

06 Jan, 2018 2 commits

dm mpath: factor out SCSI vs NVMe path selection · 0001ec56

Mike Snitzer authored Dec 11, 2017

Trying to do both SCSI and NVMe bio-based handling with branching in the
same common code has proven too tedious on a code maintenance level.  In
addition it slightly hurts IO performance.

Fix this by factoring out __map_bio() and __map_bio_nvme().
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

0001ec56

dm mpath: optimize NVMe bio-based support · 848b8aef

Mike Snitzer authored Dec 10, 2017

All code that deals with pg_init is not used with bio-based NVMe mode.
This includes skipping initialization of pg_init related variables.

Also, pg_init related members on 'struct multipath' have been grouped
together.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

848b8aef

05 Jan, 2018 1 commit

dm mpath: implement NVMe bio-based support · cd025384

Mike Snitzer authored Dec 05, 2017

This DM multipath NVMe bio-based support requires CONFIG_NVME_MULTIPATH
to not be set. In the future hopefully NVMe multipath and DM multipath
can co-exist more seemlessly. But as is, if CONFIG_NVME_MULTIPATH=Y
then all the individal NVMe paths will remain hidden to upper layers and
as such DM multipath will not be able to manage them.

Though NVMe's native multipathing doesn't multipath namespaces across
subsystems; so technically a user _could_ use CONFIG_NVME_MULTIPATH=Y
and also use DM multipath to multipath across subsystems.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

cd025384

03 Jan, 2018 1 commit

dm mpath: move dm_bio_restore out of endio method · 1836df08

Mike Snitzer authored Dec 06, 2017

Moving the dm_bio_restore() to process_queued_bios() avoids doing that
work in multipath_end_io_bio().
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

1836df08

20 Dec, 2017 5 commits

dm mpath: optimize retrieval of bio_details from per-bio-data · d07a241d
Mike Snitzer authored Dec 11, 2017
```
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
```
d07a241d

dm mpath: remove unnecessary memset() calls for per-io-data · d0442f80

Mike Snitzer authored Dec 11, 2017

All underlying members are initialized directly so the memset() calls
are not needed.  Also, initialize mpio->nr_bytes from the start since it
never changes.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

d0442f80

dm mpath: remove unused param from multipath_init_per_bio_data() · 63f6e6fd
Mike Snitzer authored Dec 05, 2017
```
'struct dm_bio_details *' isn't ever needed.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
```
63f6e6fd

dm: optimize bio-based NVMe IO submission · 978e51ba

Mike Snitzer authored Dec 09, 2017

Upper level bio-based drivers that stack immediately ontop of NVMe can
leverage direct_make_request().  In addition DM's NVMe bio-based
will initially only ever have one NVMe device that it submits IO to at a
time.  There is no splitting needed.  Enhance DM core so that
DM_TYPE_NVME_BIO_BASED's IO submission takes advantage of both of these
characteristics.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

978e51ba

dm: introduce DM_TYPE_NVME_BIO_BASED · 22c11858

Mike Snitzer authored Dec 04, 2017

If dm_table_determine_type() establishes DM_TYPE_NVME_BIO_BASED then
all devices in the DM table do not support partial completions.  Also,
the table has a single immutable target that doesn't require DM core to
split bios.

This will enable adding NVMe optimizations to bio-based DM.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

22c11858

17 Dec, 2017 6 commits

dm: simplify start of block stats accounting for bio-based · f3986374

Mike Snitzer authored Dec 17, 2017

No apparent need to generic_start_io_acct() until before the IO is ready
for submission.  start_io_acct() is the proper place to do this
accounting -- it is also where DM accounts for pending IO and, if
enabled, starts dm-stats accounting.

Replace start_io_acct()'s part_round_stats() with generic_start_io_acct().
This eliminates needing to take part_stat_lock() multiple times when
starting an IO on bio-based devices.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

f3986374

dm: remove redundant mapped_device member from clone_info structure · bc02cdbe

Mike Snitzer authored Dec 14, 2017

'struct dm_io' already has the same pointer.  So update all accesses
from ci->md to ci->io->md.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

bc02cdbe

dm: remove now unused bio-based io_pool and _io_cache · dde1e1ec
Mike Snitzer authored Dec 11, 2017
```
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
```
dde1e1ec

dm: improve performance by moving dm_io structure to per-bio-data · 64f52b0e

Mike Snitzer authored Dec 11, 2017

Eliminates need for a separate mempool to allocate 'struct dm_io'
objects from.  As such, it saves an extra mempool allocation for each
original bio that DM core is issued.

This complicates the per-bio-data accessor functions by needing to
conditonally add extra padding to get to a target's per-bio-data.  But
in the end this provides a decent performance improvement for all
bio-based DM devices.

On an NVMe-loop based testbed to a ramdisk (~3100 MB/s): bio-based
DM linear performance improved by 2% (went from 2665 to 2777 MB/s).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

64f52b0e

dm: rename 'bio' member of dm_io structure to 'orig_bio' · 745dc570
Mike Snitzer authored Dec 11, 2017
```
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
```
745dc570

dm: remove stale comment blocks · 2abf1fc9

Mike Snitzer authored Dec 09, 2017

These CRUD comments have worn out their welcome.  The code is what it
is, over time it'll hopefully get better.  But these comments serve no
purpose whatsoever.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

2abf1fc9

13 Dec, 2017 15 commits

dm: set QUEUE_FLAG_DAX accordingly in dm_table_set_restrictions() · ad3793fc

Mike Snitzer authored Dec 04, 2017

Rather than having DAX support be unique by setting it based on table
type in dm_setup_md_queue().
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

ad3793fc

dm: fix __send_changing_extent_only() to send first bio and chain remainder · 3d7f4562

Mike Snitzer authored Dec 08, 2017

__send_changing_extent_only() must follow the same pattern that was
established with commit "dm: ensure bio submission follows a depth-first
tree walk".  That is: submit first bio up to split boundary and then
split the remainder to further submissions.
Suggested-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

3d7f4562

dm: ensure bio-based DM's bioset and io_pool support targets' maximum IOs · 0776aa0e

Mike Snitzer authored Dec 08, 2017

alloc_multiple_bios() assumes it can allocate the requested number of
bios but until now there was no gaurantee that the mempools would be
accomodating.
Suggested-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

0776aa0e

dm: remove BIOSET_NEED_RESCUER based dm_offload infrastructure · 4a3f54d9

Mike Snitzer authored Nov 22, 2017

Now that all of DM has been revised and/or verified to no longer require
the use of BIOSET_NEED_RESCUER the dm_offload code may be removed.
Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

4a3f54d9

dm: safely allocate multiple bioset bios · 318716dd

Mike Snitzer authored Nov 22, 2017

DM targets can request multiple bios be sent to them by DM core (see:
num_{flush,discard,write_same,write_zeroes}_bios).  But until now these
bios were allocated in an unsafe manner than could potentially exhaust
the DM device's bioset -- in the face of multiple threads each trying to
do multiple allocations from the same DM device's bioset.

Fix __send_duplicate_bios() by using the new alloc_multiple_bios().  The
allocation strategy used by alloc_multiple_bios() models that used by
dm-crypt.c:crypt_alloc_buffer().

Neil Brown initially proposed this fix but the implementation has been
revised enough that it inappropriate to attribute the entirety of it to
him.
Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

318716dd

dm: remove unused 'num_write_bios' target interface · f31c21e4

NeilBrown authored Nov 22, 2017

No DM target provides num_write_bios and none has since dm-cache's
brief use in 2013.

Having the possibility of num_write_bios > 1 complicates bio
allocation.  So remove the interface and assume there is only one bio
needed.

If a target ever needs more, it must provide a suitable bioset and
allocate itself based on its particular needs.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

f31c21e4

dm: ensure bio submission follows a depth-first tree walk · 18a25da8

NeilBrown authored Sep 06, 2017

A dm device can, in general, represent a tree of targets, each of which
handles a sub-range of the range of blocks handled by the parent.

The bio sequencing managed by generic_make_request() requires that bios
are generated and handled in a depth-first manner.  Each call to a
make_request_fn() may submit bios to a single member device, and may
submit bios for a reduced region of the same device as the
make_request_fn.

In particular, any bios submitted to member devices must be expected to
be processed in order, so a later one must never wait for an earlier
one.

This ordering is usually achieved by using bio_split() to reduce a bio
to a size that can be completely handled by one target, and resubmitting
the remainder to the originating device. bio_queue_split() shows the
canonical approach.

dm doesn't follow this approach, largely because it has needed to split
bios since long before bio_split() was available.  It currently can
submit bios to separate targets within the one dm_make_request() call.
Dependencies between these targets, as can happen with dm-snap, can
cause deadlocks if either bios gets stuck behind the other in the queues
managed by generic_make_request().  This requires the 'rescue'
functionality provided by dm_offload_{start,end}.

Some of this requirement can be removed by changing the order of bio
submission to follow the canonical approach.  That is, if dm finds that
it needs to split a bio, the remainder should be sent to
generic_make_request() rather than being handled immediately.  This
delays the handling until the first part is completely processed, so the
deadlock problems do not occur.

__split_and_process_bio() can be called both from dm_make_request() and
from dm_wq_work().  When called from dm_wq_work() the current approach
is perfectly satisfactory as each bio will be processed immediately.
When called from dm_make_request(), current->bio_list will be non-NULL,
and in this case it is best to create a separate "clone" bio for the
remainder.

When we use bio_clone_bioset() to split off the front part of a bio
and chain the two together and submit the remainder to
generic_make_request(), it is important that the newly allocated
bio is used as the head to be processed immediately, and the original
bio gets "bio_advance()"d and sent to generic_make_request() as the
remainder.  Otherwise, if the newly allocated bio is used as the
remainder, and if it then needs to be split again, then the next
bio_clone_bioset() call will be made while holding a reference a bio
(result of the first clone) from the same bioset.  This can potentially
exhaust the bioset mempool and result in a memory allocation deadlock.

Note that there is no race caused by reassigning cio.io->bio after already
calling __map_bio().  This bio will only be dereferenced again after
dec_pending() has found io->io_count to be zero, and this cannot happen
before the dec_pending() call at the end of __split_and_process_bio().

To provide the clone bio when splitting, we use q->bio_split.  This
was previously being freed by bio-based dm to avoid having excess
rescuer threads.  As bio_split bio sets no longer create rescuer
threads, there is little cost and much gain from restoring the
q->bio_split bio set.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

18a25da8

dm io: remove BIOSET_NEED_RESCUER flag from bios bioset · c110a4b6

NeilBrown authored Aug 30, 2017

The BIOSET_NEED_RESCUER flag is only needed when a make_request_fn might
do two allocations from the one bioset, and the second one could block
until the first bio completes.

dm_io() is called from make_request_fn() context.  The closest it comes
to multiple allocations is in chunk_io() in dm-snap-persistent.  But
there the code uses a separate thread to avoid problems.

So BIOSET_NEED_RESCUER is not needed.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

c110a4b6

dm crypt: remove BIOSET_NEED_RESCUER flag · 80cd1757

NeilBrown authored Aug 30, 2017

The BIOSET_NEED_RESCUER flag is only needed when a make_request_fn might
do two allocations from the one bioset, and the second one could block
until the first bio completes.

dm-crypt does allocate from this bioset inside the dm make_request_fn,
but does so using GFP_NOWAIT so that the allocation will not block.

So BIOSET_NEED_RESCUER is not needed.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

80cd1757

dm: fix comment above dm_accept_partial_bio · c06b3e58

NeilBrown authored Nov 21, 2017

Clarify that dm_accept_partial_bio isn't allowed for REQ_OP_ZONE_RESET
bios.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

c06b3e58

dm raid: use rs_is_raid*() · 552aa679

Heinz Mauelshagen authored Dec 13, 2017

Cleanup, no functional change.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

552aa679

dm raid: simplify rs_get_progress() · 7c29744e

Heinz Mauelshagen authored Dec 13, 2017

No need to calculate the reshaping progress because
mddev->curr_resync_completed holds it.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

7c29744e

dm raid: ensure 'a' chars during reshape · dc15b943

Heinz Mauelshagen authored Dec 13, 2017

During reshape, 'A' chars were reported in status rather than 'a'.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

dc15b943

dm raid: stop keeping raid set frozen altogether · 11e47232

Heinz Mauelshagen authored Dec 13, 2017

In order to avoid redoing synchronization/recovery/reshape partially,
the raid set got frozen until after all passed in table line flags had
been cleared. The related table reload sequence had to be precisely
followed, or reshaping may lead to data corruption caused by the active
mapping carrying on with a reshape when the inactive mapping already
had retrieved a stale reshape position.

Harden by retrieving the actual resync/recovery/reshape position
during resume whilst the active table is suspended thus avoiding
to keep the raid set frozen altogether. This prevents superfluous
redoing of an already resynchronized or recovered segment and,
most importantly, potential for redoing of an already reshaped
segment causing data corruption.

Fixes: d39f0010 ("dm raid: fix raid_resume() to keep raid set frozen as needed")
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

11e47232

dm raid: validate current raid sets redundancy · 53bf5384

Heinz Mauelshagen authored Dec 13, 2017

Verifying the current raid sets redundancy based on retrieved
superblock content has to use the superblock's raid level (e.g. raid0),
not the constructor requested one (e.g. raid10).

Using the requested raid level of raid10 lead to a "divide error"
on raid0 which defines data copies divided by to be zero.

Also check for bogus data copies.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

53bf5384

08 Dec, 2017 7 commits

dm raid: bump target version to reflect numerous fixes · b84cf269
Mike Snitzer authored Dec 04, 2017
```
Also update Documentation accordingly.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
```
b84cf269

dm raid: small cleanup and remove unsed "struct raid_set" member · 78a75d10

Heinz Mauelshagen authored Dec 02, 2017

Move raid_resume()'s setting of 'rw' and 'in_sync' to just prior to
mddev_resume().

Also, remove unused 'bitmap_loaded' member from "struct raid_set".

No functional changes.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

78a75d10

dm raid: fix rs_get_progress() synchronization state/ratio · 4102d9de

Heinz Mauelshagen authored Dec 02, 2017

Fix various sync state issues causing racy/bogus sync ratio,
sync_action ad health chars in dm_status() info output.

Sync ratio could be N/N (i.e. 100%) shortly after raid set
creation, i.e. creating a new RaidLV or upconverting a linear LV to
raid1 thus:
  "0 2097152 raid raid1 2 Aa 2097162/2097152 recover 0 0 -"
instead of:
  "0 2097152 raid raid1 2 Aa 0/2097152 idle 0 0 -"

Sync action could be non-idle, when the MD thread was done with io.

Health chars could be 'A' when they should be 'a' for a short time
before a resynchonization started.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

4102d9de

dm raid: avoid passing array_in_sync variable to raid_status() callees · 242ea5ad

Heinz Mauelshagen authored Dec 02, 2017

The raid_status() function passes the bool array_in_sync variable around
providing synchronization state of the MD array.  Replace it with a
runtime flag.  This will avoid a pattern of having to pass discrete
variables to various functions.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

242ea5ad

dm raid: display a consistent copy of the MD status via raid_status() · 67143510

Heinz Mauelshagen authored Dec 02, 2017

The MD sync thread updates recovery flags providing state of any
running, idle, frozen, recovering, reshaping, ... activity it performs
and updates respective flags asynchronously versus dm processing
raid_status().  To close that race window, take a single copy of the
flags and pass it into its callees.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

67143510

dm raid: fix raid_resume() to keep raid set frozen as needed · d39f0010

Heinz Mauelshagen authored Dec 02, 2017

During a reshape request: if userspace reloads a "raid" table multiple
times, resulting in multiple superblock reads, the raid set needs to
stay frozen until all config changes (chunk size, layout data_offset,
delta_disks) have been stored in the superblocks and respective flags
cleared.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

d39f0010

dm raid: add component device size checks to avoid runtime failure · 188a212d

Heinz Mauelshagen authored Dec 02, 2017

Check all component data device sizes versus calculated size.
Reject if device(s) are too small.  Otherwise, MD will fail the
operation by accessing beyond the end of the data device.

An example use-case is that growing bitmap won't fit any more and the MD
runtime will report an error when DM raid should catch this earlier.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

188a212d