Commits · 3627896a4b12ea6bb9e0ff77724a24f53726db2d · Kirill Smelkov / linux

13 Oct, 2017 22 commits

lightnvm: pblk: use constant for GC max inflight · 3627896a

Javier González authored Oct 13, 2017

Use a constant to set the maximum number of inflight GC requests
allowed.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3627896a

lightnvm: pblk: remove checks on mempool alloc. · 2942f50f

Javier González authored Oct 13, 2017

As part of the mempool audit on pblk, remove unnecessary mempool
allocation checks on mempools.
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2942f50f

lightnvm: pblk: do not use a mempool for line bitmaps · e72ec1d3

Javier González authored Oct 13, 2017

pblk holds two sector bitmaps: one to keep track of the mapped sectors
while the line is active and another one to keep track of the invalid
sectors. The latter is kept during the whole live of the line, until it
is recycled. Since we cannot guarantee forward progress for the mempool
in this case, get rid of the mempool and simply allocate memory through
kmalloc.
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e72ec1d3

lightnvm: pblk: decouple read/erase mempools · 0d880398

Javier González authored Oct 13, 2017

Since read and erase paths offer different guarantees for inflight I/Os,
separate the mempools to set the right min_nr for each on creation.
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0d880398

lightnvm: pblk: simplify work_queue mempool · b84ae4a8

Javier González authored Oct 13, 2017

In pblk, we have a mempool to allocate a generic structure that we
pass along workqueues. This is heavily used in the GC path in order
to have enough inflight reads and fully utilize the GC bandwidth.

However, the current GC path copies data to the host memory and puts it
back into the write buffer. This requires a vmalloc allocation for the
data and a memory copy. Thus, guaranteeing the allocation by using a
mempool for the structure in itself does not give us much. Until we
implement support for vector copy to avoid moving data through the host,
just allocate the workqueue structure using kmalloc.

This allows us to have a much smaller mempool.
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b84ae4a8

lightnvm: pblk: fix min size for page mempool · bd432417

Javier González authored Oct 13, 2017

pblk uses an internal page mempool for allocating pages on internal
bios. The main two users of this memory pool are partial reads (reads
with some sectors in cache and some on media) and padded writes, which
need to add dummy pages to an existing bio already containing valid
data (and with a large enough bioset allocated). In both cases, the
maximum number of pages per bio is defined by the maximum number of
physical sectors supported by the underlying device.

This patch fixes a bad mempool allocation, where the min_nr of elements
on the pool was fixed (to 16), which is lower than the maximum number
of sectors supported by NVMe (as of the time for this patch). Instead,
use the maximum number of allowed sectors reported by the device.
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bd432417

lightnvm: pblk: avoid deadlock on low LUN config · da67e68f

Javier González authored Oct 13, 2017

On low LUN configurations, make sure not to send bios that are bigger
than the buffer size.

Fixes: a4bd217b ("lightnvm: physical block device (pblk) target")
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

da67e68f

lightnvm: pblk: fix write I/O sync stat · e0e12a70

Javier González authored Oct 13, 2017

Fix stat counter to collect the right number of I/Os being synced on the
completion path.

Fixes: 0880a9aa ("lightnvm: pblk: delete redundant buffer pointer")
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e0e12a70

lightnvm: pblk: free padded entries in write buffer · cd8ddbf7

Javier González authored Oct 13, 2017

When a REQ_FLUSH reaches pblk, the bio cannot be directly completed.
Instead, data on the write buffer is flushed and the bio is completed on
the completion pah. This might require some sectors to be padded in
order to guarantee a successful write.

This patch fixes a memory leak on the padded pages. A consequence of
this bad free was that internal bios not containing data (only a flush)
were not being completed.

Fixes: a4bd217b ("lightnvm: physical block device (pblk) target")
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cd8ddbf7

lightnvm: pblk: use right flag for GC allocation · 7d327a9e

Javier González authored Oct 13, 2017

The data buffer for the GC path allocates virtual memory through
vmalloc. When this change was introduced, a flag signaling kmalloc'ed
memory was wrongly introduced. Use the right flag when creating a bio
from this buffer.

Fixes: de54e703 ("lightnvm: pblk: use vmalloc for GC data buffer")
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7d327a9e

lightnvm: pblk: initialize debug stat counter · a1121176

Javier González authored Oct 13, 2017

Initialize the stat counter for garbage collected reads.

Fixes: a4bd217b ("lightnvm: physical block device (pblk) target")
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a1121176

lightnvm: pblk: reuse pblk_gc_should_kick · 32825ebb

Rakesh Pandit authored Oct 13, 2017

This is a trivial change which reuses pblk_gc_should_kick instead of
repeating it again in pblk_rl_free_lines_inc.
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Made it apply to the common case.
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

32825ebb

lightnvm: pblk: print incompatible line version correctly · c79819bc

Rakesh Pandit authored Oct 13, 2017

Correct it by converting little endian to cpu endian and also define
a macro for line version so that maintenance is easy.
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c79819bc

lightnvm: pblk: improve error message if down_timeout fails · c5493845

Rakesh Pandit authored Oct 13, 2017

The two pr_err messages are useless as they don't differentiate
error code.
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c5493845

lightnvm: pblk: fix message if L2P MAP is in device · 4e76af53

Rakesh Pandit authored Oct 13, 2017

This usually happens if we are developing with qemu and ll2pmode has
default value. Improve description.
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4e76af53

lightnvm: pblk: protect line bitmap while submitting meta io · e57903fd

Rakesh Pandit authored Oct 13, 2017

It seems pblk_dealloc_page would race against pblk_alloc_pages for
line bitmap for sector allocation.The chances are very low but might
as well protect the bitmap properly.
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e57903fd

lightnvm: include NVM Express driver if OCSSD is selected for build · 32c662c5

Rakesh Pandit authored Oct 13, 2017

Because NVM needs BLK_DEV_NVME, select it automatically if we mark NVM
in config file before building kernel.  Also append PCI to depends as
select doesn't automatically add dependencies.
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

32c662c5

lightnvm: pblk: fix error path in pblk_lines_alloc_metadata · c9d84b35

Rakesh Pandit authored Oct 13, 2017

Use appropriate memory free calls based on allocation type used and
also fix number of times free is called if kmalloc fails.
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c9d84b35

lightnvm: remove already calculated nr_chnls · a96d50fa

Rakesh Pandit authored Oct 13, 2017

Remove repeated calculation for number of channels while creating a
target device.
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a96d50fa

lightnvm: protect target type list with correct locks · 88d31ea2

Rakesh Pandit authored Oct 13, 2017

nvm_tgt_types list was protected by wrong lock for NVM_INFO ioctl call
and can race with addition or removal of target types.  Also
unregistering target type was not protected correctly.

Fixes: 5cd90785 ("lightnvm: remove nested lock conflict with mm")
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

88d31ea2

lightnvm: prevent bd removal if busy · bb6aa6f0

Rakesh Pandit authored Oct 13, 2017

When a virtual block device is formatted and mounted after creating
with "nvme lnvm create... -t pblk", a removal from "nvm lnvm remove"
would result in this:

446416.309757] bdi-block not registered
[446416.309773] ------------[ cut here ]------------
[446416.309780] WARNING: CPU: 3 PID: 4319 at fs/fs-writeback.c:2159
  __mark_inode_dirty+0x268/0x340

Ideally removal should return -EBUSY as block device is mounted after
formatting.  This patch tries to address this checking if whole device
or any partition of it already mounted or not before removal.

Whole device is checked using "bd_super" member of block device.  This
member is always set once block device has been mounted using a
filesystem.  Another member "bd_part_count" takes care of checking any
if any partitions are under use.  "bd_part_count" is only updated
under locks when partitions are opened or closed (first open and last
release).  This at least does take care sending -EBUSY if removal is
being attempted while whole block device or any partition is mounted.
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bb6aa6f0

lightnvm: prevent target type module removal when in use · 90014829

Rakesh Pandit authored Oct 13, 2017

If target type module e.g. pblk here is unloaded (rmmod) while module
is in use (after creating target) system crashes.  We fix this by
using module API refcnt.
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

90014829

12 Oct, 2017 2 commits

mtip32xx: Clean up unused variables · 47bc227d

Christos Gkekas authored Oct 12, 2017

Remove variables that are set but never used.
Signed-off-by: Christos Gkekas <chris.gekas@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

47bc227d

fs/block_dev: remove vfs_msg() interface · 7f66721a

Rakesh Pandit authored Oct 12, 2017

Replaced by pr_err usage in commit ef510424 ("block, dax: move
"select DAX" from BLOCK to FS_DAX")
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Acked-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7f66721a

10 Oct, 2017 5 commits

block: set request_list for request · 85acb3ba

Shaohua Li authored Oct 06, 2017

Legacy queue sets request's request_list, mq doesn't. This makes mq does
the same thing, so we can find cgroup of a request. Note, we really
only use blkg field of request_list, it's pointless to allocate mempool
for request_list in mq case.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

85acb3ba

blk-stat: delete useless code · eca8b53a

Shaohua Li authored Oct 06, 2017

Fix two issues:
- the per-cpu stat flush is unnecessary, nobody uses per-cpu stat except
  sum it to global stat. We can do the calculation there. The flush just
  wastes cpu time.
- some fields are signed int/s64. I don't see the point.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

eca8b53a

blk-throttle: fix null pointer dereference while throttling writeback IOs · 53cfdc10

Jiufei Xue authored Oct 10, 2017

A null pointer dereference can occur when blkcg is removed manually
with writeback IOs inflight. This is caused by the following case:

Writeback kworker submit the bio and set bio->bi_cg_private to tg
in blk_throtl_assoc_bio.
Then we remove the block cgroup manually, the blkg and tg would be
freed if there is no request inflight.
When the submitted bio come back, blk_throtl_bio_endio() fetch the tg
which was already freed.

Fix this by increasing the refcount of blkg in funcion
blk_throtl_assoc_bio() so that the blkg will not be freed until the
bio_endio called.
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jiufei Xue <jiufei.xjf@alibaba-inc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

53cfdc10

blkcg: check pol->cpd_free_fn before free cpd · 58a9edce

weiping zhang authored Oct 10, 2017

check pol->cpd_free_fn() instead of pol->cpd_alloc_fn() when free cpd.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

58a9edce

writeback: merge try_to_writeback_inodes_sb_nr() into caller · 8264c321

Rakesh Pandit authored Oct 09, 2017

Since commit 925a6efb ("Btrfs: stop using
try_to_writeback_inodes_sb_nr to flush delalloc") this function hasn't
been used outside so stop exporting it.

In addition we merge it into try_to_writeback_inodes_sb() which is the
only caller.  Also change return type of try_to_writeback_inodes_sb to
void as the only user ext4 doesn't care.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8264c321

09 Oct, 2017 3 commits

block, bfq: fix unbalanced decrements of burst size · 99fead8d

Paolo Valente authored Oct 09, 2017

The commit "block, bfq: decrease burst size when queues in burst
exit" introduced the decrement of burst_size on the removal of a
bfq_queue from the burst list. Unfortunately, this decrement can
happen to be performed even when burst size is already equal to 0,
because of unbalanced decrements. A description follows of the cause
of these unbalanced decrements, namely a wrong assumption, and of the
way how this wrong assumption leads to unbalanced decrements.

The wrong assumption is that a bfq_queue can exit only if the process
associated with the bfq_queue has exited. This is false, because a
bfq_queue, say Q, may exit also as a consequence of a merge with
another bfq_queue. In this case, Q exits because the I/O of its
associated process has been redirected to another bfq_queue.

The decrement unbalance occurs because Q may then be re-created after
a split, and added back to the current burst list, *without*
incrementing burst_size. burst_size is not incremented because Q is
not a new bfq_queue added to the burst list, but a bfq_queue only
temporarily removed from the list, and, before the commit "bfq-sq,
bfq-mq: decrease burst size when queues in burst exit", burst_size was
not decremented when Q was removed.

This commit addresses this issue by just checking whether the exiting
bfq_queue is a merged bfq_queue, and, in that case, not decrementing
burst_size. Unfortunately, this still leaves room for unbalanced
decrements, in the following rarer case: on a split, the bfq_queue
happens to be inserted into a different burst list than that it was
removed from when merged. If this happens, the number of elements in
the new burst list becomes higher than burst_size (by one). When the
bfq_queue then exits, it is of course not in a merged state any
longer, thus burst_size is decremented, which results in an unbalanced
decrement. To handle this sporadic, unlucky case in a simple way,
this commit also checks that burst_size is larger than 0 before
decrementing it.

Finally, this commit removes an useless, extra check: the check that
the bfq_queue is sync, performed before checking whether the bfq_queue
is in the burst list. This extra check is redundant, because only sync
bfq_queues can be inserted into the burst list.

Fixes: 7cb04004 ("block, bfq: decrease burst size when queues in burst exit")
Reported-by: Philip Müller <philm@manjaro.org>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com>
Tested-by: Philip Müller <philm@manjaro.org>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

99fead8d

block,bfq: Disable writeback throttling · b5dc5d4d

Luca Miccio authored Oct 09, 2017

Similarly to CFQ, BFQ has its write-throttling heuristics, and it
is better not to combine them with further write-throttling
heuristics of a different nature.
So this commit disables write-back throttling for a device if BFQ
is used as I/O scheduler for that device.
Signed-off-by: Luca Miccio <lucmiccio@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b5dc5d4d

writeback: schedule periodic writeback with sysctl · 94af5846

Yafang Shao authored Oct 10, 2017

After disable periodic writeback by writing 0 to
dirty_writeback_centisecs, the handler wb_workfn() will not be
entered again until the dirty background limit reaches or
sync syscall is executed or no enough free memory available or
vmscan is triggered.

So the periodic writeback can't be enabled by writing a non-zero
value to dirty_writeback_centisecs.
As it can be disabled by sysctl, it should be able to enable by
sysctl as well.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

94af5846

06 Oct, 2017 3 commits

block/bio: Remove null checks before mempool_destroy in bioset_free · 4078def8

Tim Hansen authored Oct 06, 2017

This patch removes redundant checks for null values on bio_pool and
bvec_pool.

Found using make coccicheck M=block/ on linux-net tree on the
next-20170929 tag.
Signed-off-by: Tim Hansen <devtimhansen@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4078def8

block: remove unnecessary NULL checks in bioset_integrity_free() · 4b14a5c5

Tim Hansen authored Oct 05, 2017

mempool_destroy() already checks for a NULL value being passed in, this
eliminates duplicate checks.

This was caught by running make coccicheck M=block/ on linus' tree on
commit 77ede3a0 (current head as of this
patch).
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Tim Hansen <devtimhansen@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4b14a5c5

backing-dev: kill unused pdflush_proc_obsolete() · 775d3a35

Jens Axboe authored Oct 06, 2017

After commit b35bd0d9, pdflush_proc_obsolete() is no longer
used. Kill the function and declaration.
Reported-by: Rakesh Pandit <rakesh@tuxera.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

775d3a35

05 Oct, 2017 1 commit

block: remove QUEUE_FLAG_STACKABLE · 5fdee212

Christoph Hellwig authored Oct 05, 2017

We already have a queue_is_rq_based helper to check if a request_queue
is request based, so we can remove the flag for it.
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5fdee212

04 Oct, 2017 4 commits

sysctl: remove /proc/sys/vm/nr_pdflush_threads · b35bd0d9

Jens Axboe authored Sep 30, 2017

This tunable has been obsolete since 2.6.32, and writes to the
file have been failing and complaining in dmesg since then:

nr_pdflush_threads exported in /proc is scheduled for removal

That was 8 years ago. Remove the file ABI obsolete notice, and
the sysfs file.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b35bd0d9

writeback: eliminate work item allocation in bd_start_writeback() · 85009b4f

Jens Axboe authored Sep 30, 2017

Handle start-all writeback like we do periodic or kupdate
style writeback - by marking the bdi_writeback as needing a full
flush, and simply waking the thread. This eliminates the need to
allocate and queue a specific work item just for this purpose.

After this change, we truly only ever have one of them running at
any point in time. We mark the need to start all flushes, and the
writeback thread will clear it once it has processed the request.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

85009b4f

blk-mq: document the need to have STARTED and COMPLETED share a byte · fc13457f

Jens Axboe authored Oct 04, 2017

For memory ordering guarantees on stores, we need to ensure that
these two bits share the same byte of storage in the unsigned
long. Add a comment as to why, and a BUILD_BUG_ON() to ensure that
we don't violate this requirement.
Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fc13457f

blk-mq: attempt to fix atomic flag memory ordering · a7af0af3

Peter Zijlstra authored Sep 06, 2017

Attempt to untangle the ordering in blk-mq. The patch introducing the
single smp_mb__before_atomic() is obviously broken in that it doesn't
clearly specify a pairing barrier and an obtained guarantee.

The comment is further misleading in that it hints that the
deadline store and the COMPLETE store also need to be ordered, but
AFAICT there is no such dependency. However what does appear to be
important is the clear happening _after_ the store, and that worked by
pure accident.

This clarifies blk_mq_start_request() -- we should not get there with
STARTING set -- this simplifies the code and makes the barrier usage
sane (the old code could be read to allow not having _any_ atomic after
the barrier, in which case the barrier hasn't got anything to order). We
then also introduce the missing pairing barrier for it.

Also down-grade the barrier to smp_wmb(), this is cheaper for
PowerPC/ARM and doesn't cost anything extra on x86.

And it documents the STARTING vs COMPLETE ordering. Although I've not
been entirely successful in reverse engineering the blk-mq state
machine so there might still be more funnies around timeout vs
requeue.

If I got anything wrong, feel free to educate me by adding comments to
clarify things ;-)

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Fixes: 538b7534 ("blk-mq: request deadline must be visible before marking rq as started")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a7af0af3