- 20 Dec, 2018 12 commits
-
-
Luc Van Oostenryck authored
print_st_err() is defined with its 4th argument taking an 'enum drbd_state_rv' but its prototype use an int for it. Fix this by using 'enum drbd_state_rv' in the prototype too. Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Lars Ellenberg authored
If peers are "simultaneously" told to disconnect from each other, either explicitly, or implicitly by taking down the resource, with bad timing, one side may see its disconnect "fail" with a result of "state change failed by peer", and interpret this as "please oudate yourself". Try to catch this by checking for current connection status, and possibly retry as local-only state change instead. Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Lars Ellenberg authored
"suspending" IO is overloaded. It can mean "do not allow new requests" (obviously), but it also may mean "must not complete pending IO", for example while the fencing handlers do their arbitration. When adjusting disk options, we suspend io (disallow new requests), then wait for the activity-log to become unused (drain all IO completions), and possibly replace it with a new activity log of different size. If the other "suspend IO" aspect is active, pending IO completions won't happen, and we would block forever (unkillable drbdsetup process). Fix this by skipping the activity log adjustment if the "al-extents" setting did not change. Also, in case it did change, fail early without blocking if it looks like we would block forever. Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Lars Ellenberg authored
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Lars Ellenberg authored
Multiple failure scenario: a) all good Connected Primary/Secondary UpToDate/UpToDate b) lose disk on Primary, Connected Primary/Secondary Diskless/UpToDate c) continue to write to the device, changes only make it to the Secondary storage. d) lose disk on Secondary, Connected Primary/Secondary Diskless/Diskless e) now try to re-attach on Primary This would have succeeded before, even though that is clearly the wrong data set to attach to (missing the modifications from c). Because we only compared our "effective" and the "to-be-attached" data generation uuid tags if (device->state.conn < C_CONNECTED). Fix: change that constraint to (device->state.pdsk != D_UP_TO_DATE) compare the uuids, and reject the attach. This patch also tries to improve the reverse scenario: first lose Secondary, then Primary disk, then try to attach the disk on Secondary. Before this patch, the attach on the Secondary succeeds, but since commit drbd: disconnect, if the wrong UUIDs are attached on a connected peer the Primary will notice unsuitable data, and drop the connection hard. Though unfortunately at a point in time during the handshake where we cannot easily abort the attach on the peer without more refactoring of the handshake. We now reject any attach to "unsuitable" uuids, as long as we can see a Primary role, unless we already have access to "good" data. Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Lars Ellenberg authored
If we would reject a new handshake, if the peer had attached first, and then connected, we should force disconnect if the peer first connects, and only then attaches. Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Lars Ellenberg authored
If we attach a (consistent) backing device, which knows about a last-agreed effective size, and that effective size is *larger* than the currently requested size, we refused to attach with ERR_DISK_TOO_SMALL Failure: (111) Low.dev. smaller than requested DRBD-dev. size. which is confusing to say the least. This patch changes the error code in that case to ERR_IMPLICIT_SHRINK Failure: (170) Implicit device shrinking not allowed. See kernel log. additional info from kernel: To-be-attached device has last effective > current size, and is consistent (9999 > 7777 sectors). Refusing to attach. It also allows to attach with an explicit size. Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Lars Ellenberg authored
With "on-no-data-accessible suspend-io", DRBD requires the next attach or connect to be to the very same data generation uuid tag it lost last. If we first lost connection to the peer, then later lost connection to our own disk, we would usually refuse to re-connect to the peer, because it presents the wrong data set. However, if the peer first connects without a disk, and then attached its disk, we accepted that same wrong data set, which would be "unexpected" by any user of that DRBD and cause "undefined results" (read: very likely data corruption). The fix is to forcefully disconnect as soon as we notice that the peer attached to the "wrong" dataset. Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Lars Ellenberg authored
During handshake, if we are diskless ourselves, we used to accept any size presented by the peer. Which could be zero if that peer was just brought up and connected to us without having a disk attached first, in which case both peers would just "flip" their volume sizes. Now, even a diskless node will ignore "zero" sizes presented by a diskless peer. Also a currently Diskless Primary will refuse to shrink during handshake: it may be frozen, and waiting for a "suitable" local disk or peer to re-appear (on-no-data-accessible suspend-io). If the peer is smaller than what we used to be, it is not suitable. The logic for a diskless node during handshake is now supposed to be: believe the peer, if - I don't have a current size myself - we agree on the size anyways - I do have a current size, am Secondary, and he has the only disk - I do have a current size, am Primary, and he has the only disk, which is larger than my current size Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Lars Ellenberg authored
Previously, some implicit resizes that happend during handshake have not been reported as prominently as explicit resize. Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Lars Ellenberg authored
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Roland Kammerer authored
So far there was the possibility that we called genlmsg_new(GFP_NOIO)/mutex_lock() while holding an rcu_read_lock(). This included cases like: drbd_sync_handshake (acquire the RCU lock) drbd_asb_recover_1p drbd_khelper drbd_bcast_event genlmsg_new(GFP_NOIO) --> may sleep drbd_sync_handshake (acquire the RCU lock) drbd_asb_recover_1p drbd_khelper notify_helper genlmsg_new(GFP_NOIO) --> may sleep drbd_sync_handshake (acquire the RCU lock) drbd_asb_recover_1p drbd_khelper notify_helper mutex_lock --> may sleep While using GFP_ATOMIC whould have been possible in the first two cases, the real fix is to narrow the rcu_read_lock. Reported-by: Jia-Ju Bai <baijiaju1990@163.com> Reviewed-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 19 Dec, 2018 4 commits
-
-
Ming Lei authored
blkg_lookup_create() may be called from pool_map() in which irq state is saved, so we have to do that in blkg_lookup_create(). Otherwise, the following lockdep warning can be triggered: [ 104.258537] ================================ [ 104.259129] WARNING: inconsistent lock state [ 104.259725] 4.20.0-rc6+ #545 Not tainted [ 104.260268] -------------------------------- [ 104.260865] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [ 104.261727] swapper/49/0 [HC0[0]:SC1[1]:HE0:SE0] takes: [ 104.262444] 00000000db365b5d (&(&pool->lock)->rlock#3){+.?.}, at: thin_endio+0xcf/0x2a3 [dm_thin_pool] [ 104.263747] {SOFTIRQ-ON-W} state was registered at: [ 104.264417] _raw_spin_unlock_irq+0x29/0x4c [ 104.265014] blkg_lookup_create+0xdc/0xe6 [ 104.265609] bio_associate_blkg_from_css+0xd3/0x13f [ 104.266312] bio_associate_blkg+0x15a/0x1bb [ 104.266913] pool_map+0xe8/0x103 [dm_thin_pool] [ 104.267572] __map_bio+0x98/0x29c [dm_mod] [ 104.268162] __split_and_process_non_flush+0x29e/0x306 [dm_mod] [ 104.269003] __split_and_process_bio+0x16a/0x25b [dm_mod] [ 104.269971] __dm_make_request.isra.14+0xdc/0x124 [dm_mod] [ 104.270973] generic_make_request+0x3f5/0x68b [ 104.271676] process_prepared_mapping+0x166/0x1ef [dm_thin_pool] [ 104.272531] schedule_zero+0x239/0x273 [dm_thin_pool] [ 104.273245] process_cell+0x60c/0x6f1 [dm_thin_pool] [ 104.273967] do_worker+0x60c/0xca8 [dm_thin_pool] [ 104.274635] process_one_work+0x4eb/0x834 [ 104.275203] worker_thread+0x318/0x484 [ 104.275740] kthread+0x1d1/0x1e1 [ 104.276203] ret_from_fork+0x3a/0x50 [ 104.276714] irq event stamp: 170003 [ 104.277201] hardirqs last enabled at (170002): [<ffffffff81bcc33e>] _raw_spin_unlock_irqrestore+0x44/0x6b [ 104.278535] hardirqs last disabled at (170003): [<ffffffff81bcc1ad>] _raw_spin_lock_irqsave+0x20/0x55 [ 104.280273] softirqs last enabled at (169978): [<ffffffff810d13d4>] irq_enter+0x4c/0x73 [ 104.281617] softirqs last disabled at (169979): [<ffffffff810d1479>] irq_exit+0x7e/0x11d [ 104.282744] [ 104.282744] other info that might help us debug this: [ 104.283640] Possible unsafe locking scenario: [ 104.283640] [ 104.284452] CPU0 [ 104.284803] ---- [ 104.285150] lock(&(&pool->lock)->rlock#3); [ 104.285762] <Interrupt> [ 104.286130] lock(&(&pool->lock)->rlock#3); [ 104.286750] [ 104.286750] *** DEADLOCK *** [ 104.286750] [ 104.287564] no locks held by swapper/49/0. [ 104.288129] [ 104.288129] stack backtrace: [ 104.288738] CPU: 49 PID: 0 Comm: swapper/49 Not tainted 4.20.0-rc6+ #545 [ 104.289700] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014 [ 104.290858] Call Trace: [ 104.291204] <IRQ> [ 104.291502] dump_stack+0x9a/0xe6 [ 104.291968] mark_lock+0x56c/0x7a6 [ 104.292442] ? check_usage_backwards+0x209/0x209 [ 104.293086] __lock_acquire+0x400/0x15bf [ 104.293662] ? check_chain_key+0x150/0x1aa [ 104.294236] lock_acquire+0x1a6/0x1e3 [ 104.294768] ? thin_endio+0xcf/0x2a3 [dm_thin_pool] [ 104.295444] ? _raw_spin_unlock_irqrestore+0x44/0x6b [ 104.296143] ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool] [ 104.297031] _raw_spin_lock_irqsave+0x46/0x55 [ 104.297659] ? thin_endio+0xcf/0x2a3 [dm_thin_pool] [ 104.298335] thin_endio+0xcf/0x2a3 [dm_thin_pool] [ 104.298997] ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool] [ 104.299886] ? check_flags+0x20a/0x20a [ 104.300408] ? lock_acquire+0x1a6/0x1e3 [ 104.300954] ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool] [ 104.301865] clone_endio+0x1bb/0x22d [dm_mod] [ 104.302491] ? disable_write_zeroes+0x20/0x20 [dm_mod] [ 104.303200] ? bio_disassociate_blkg+0xc6/0x15f [ 104.303836] ? bio_endio+0x2b2/0x2da [ 104.304349] clone_endio+0x1f3/0x22d [dm_mod] [ 104.304978] ? disable_write_zeroes+0x20/0x20 [dm_mod] [ 104.305709] ? bio_disassociate_blkg+0xc6/0x15f [ 104.306333] ? bio_endio+0x2b2/0x2da [ 104.306853] clone_endio+0x1f3/0x22d [dm_mod] [ 104.307476] ? disable_write_zeroes+0x20/0x20 [dm_mod] [ 104.308185] ? bio_disassociate_blkg+0xc6/0x15f [ 104.308817] ? bio_endio+0x2b2/0x2da [ 104.309319] blk_update_request+0x2de/0x4cc [ 104.309927] blk_mq_end_request+0x2a/0x183 [ 104.310498] blk_done_softirq+0x16a/0x1a6 [ 104.311051] ? blk_softirq_cpu_dead+0xe2/0xe2 [ 104.311653] ? __lock_is_held+0x2a/0x87 [ 104.312186] __do_softirq+0x250/0x4e8 [ 104.312705] irq_exit+0x7e/0x11d [ 104.313157] call_function_single_interrupt+0xf/0x20 [ 104.313860] </IRQ> [ 104.314163] RIP: 0010:native_safe_halt+0x2/0x3 [ 104.314792] Code: 63 02 df f0 83 44 24 fc 00 48 89 df e8 cc 3f 7a ff 48 8b 03 a8 08 74 0b 65 81 25 9d 31 45 7e ff ff ff 7f 5b 5d 41 5c c3 fb f4 <c3> f4 c3 0f 1f 44 00 00 41 56 41 55 41 54 55 53 e8 a2 0d 5c ff e8 [ 104.317339] RSP: 0018:ffff888106c9fdc0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff04 [ 104.318390] RAX: 1ffff11020d92100 RBX: 0000000000000000 RCX: ffffffff81159ac7 [ 104.319366] RDX: 1ffffffff05d5e69 RSI: 0000000000000007 RDI: ffff888106c90d1c [ 104.320339] RBP: 0000000000000000 R08: dffffc0000000000 R09: 0000000000000001 [ 104.321313] R10: ffffed1025d57ba0 R11: ffffed1025d57b9f R12: 1ffff11020d93fbf [ 104.322328] R13: 0000000000000031 R14: ffff888106c90040 R15: 0000000000000000 [ 104.323307] ? lockdep_hardirqs_on+0x26b/0x278 [ 104.323927] default_idle+0xd9/0x1a8 [ 104.324427] do_idle+0x162/0x2b2 [ 104.324891] ? arch_cpu_idle_exit+0x28/0x28 [ 104.325467] ? mark_held_locks+0x28/0x7f [ 104.326031] ? _raw_spin_unlock_irqrestore+0x44/0x6b [ 104.326719] cpu_startup_entry+0x1d/0x1f [ 104.327261] start_secondary+0x2cb/0x308 [ 104.327806] ? set_cpu_sibling_map+0x8a3/0x8a3 [ 104.328421] secondary_startup_64+0xa4/0xb0 Fixes: b978962a ("blkcg: update blkg_lookup_create() to do locking") Cc: Mike Snitzer <snitzer@redhat.com> Cc: Dennis Zhou <dennis@kernel.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
DM currently has a statically allocated bio that it uses to issue empty flushes. It doesn't submit this bio, it just uses it for maintaining state while setting up clones. Multiple users can access this bio at the same time. This wasn't previously an issue, even if it was a bit iffy, but with the blkg associations it can become one. We setup the blkg association, then clone bio's and submit, then remove the blkg assocation again. But since we can have multiple tasks doing this at the same time, against multiple blkg's, then we can either lose references to a blkg, or put it twice. The latter causes complaints on the percpu ref being <= 0 when released, and can cause use-after-free as well. Ming reports that xfstest generic/475 triggers this: ------------[ cut here ]------------ percpu ref (blkg_release) <= 0 (0) after switching to atomic WARNING: CPU: 13 PID: 0 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x2c9/0x4a0 Switch to just using an on-stack bio for this, and get rid of the embedded bio. Fixes: 5cdf2e3f ("blkcg: associate blkg when associating a device") Reported-by: Ming Lei <ming.lei@redhat.com> Tested-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
git://git.infradead.org/nvmeJens Axboe authored
Pull last batch of NVMe updates for 4.21 from Christoph: "This contains a series from Sagi to restore poll support for nvme-rdma, a new tracepoint from yupeng and various fixes." * 'nvme-4.21' of git://git.infradead.org/nvme: nvme-pci: trace SQ status on completions nvme-rdma: implement polling queue map nvme-fabrics: allow user to pass in nr_poll_queues nvme-fabrics: allow nvmf_connect_io_queue to poll nvme-core: optionally poll sync commands block: make request_to_qc_t public nvme-tcp: fix spelling mistake "attepmpt" -> "attempt" nvme-tcp: fix endianess annotations nvmet-tcp: fix endianess annotations nvme-pci: refactor nvme_poll_irqdisable to make sparse happy nvme-pci: only set nr_maps to 2 if poll queues are supported nvmet: use a macro for default error location nvmet: fix comparison of a u16 with -1
-
yupeng authored
Export the disk name, queue id, sq_head, sq_tail to a trace event in completion handling. Usage example: cd /sys/kernel/debug/tracing/events/nvme/nvme_sq echo 'disk=="nvme1n1"' > filter echo 1 > enable cat /sys/kernel/debug/tracing/trace_pipe Signed-off-by: yupeng <yupeng0921@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Keith Busch <keith.busch@intel.com> [hch: slight formatting tweaks, use standard nvme tracepoint conventions] Signed-off-by: Christoph Hellwig <hch@lst.de> wip
-
- 18 Dec, 2018 14 commits
-
-
Sagi Grimberg authored
When passed with nr_poll_queues setup additional queues with cq polling context IB_POLL_DIRECT (no interrupts) and make sure to set QUEUE_FLAG_POLL on the connect_q. In addition add the third queue mapping for polling queues. nvmf connect on this queue is polled for like all other requests so make nvmf_connect_io_queue poll for polling queues. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
This argument will specify how many polling I/O queues to connect when creating the controller. These I/O queues will host I/O that is set with REQ_HIPRI. Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
Preparation for polling support for fabrics. Polling support means that our completion queues are not generating any interrupts which means we need to poll for the nvmf io queue connect as well. Reviewed by Steve Wise <swise@opengridcomputing.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
Pass poll bool to indicate that we need it to poll. This prepares us for polling support in nvmf since connect is an I/O that will be queued and has to be polled in order to complete. If poll is passed, we call nvme_execute_rq_polled which sends the requests and polls for its completion. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
block consumers will need it for polling requests that are sent with blk_execute_rq_nowait. Also, get rid of blk_tag_to_qc_t and open-code it instead. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Colin Ian King authored
There is a spelling mistake in a dev_info message, fix it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Christoph Hellwig authored
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
-
Christoph Hellwig authored
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Sagi Grimberg <sagi@grimberg.me>
-
Christoph Hellwig authored
By duplicating the nvme_process_cq in both branches we keep the sparse lock context checking happy, so do it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
-
Christoph Hellwig authored
The block layer now enables polling support on a queue if nr_maps includes the poll map, so we should only set that if we actually support poll queues. Fixes: 6544d229 ("block: enable polling by default if a poll map is initalized") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
-
Chaitanya Kulkarni authored
This patch defines a new macro NVMET_NO_ERROR_LOC to represent the default error location value in the nvme-error-log-page. This is a pure cleanup patch and it does not change any functionality. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Colin Ian King authored
Currently the u16 req->error_loc is being compared to -1 which will always be false. Fix this by casting -1 to u16 to fix this. Detected by clang: warning: result of comparison of constant -1 with expression of type 'u16' (aka 'unsigned short') is always false [-Wtautological-constant-out-of-range-compare] Fixes: 76574f37 ("nvmet: add interface to update error-log page") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Ming Lei authored
The queue mapping of type poll only exists when set->map[HCTX_TYPE_POLL].nr_queues is bigger than zero, so enhance the constraint by checking .nr_queues of type poll before enabling IO poll. Otherwise IO race & timeout can be observed when running block/007. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
There's a single user of this function, dm, and dm just wants to check if IO is inflight, not that it's just allocated. This fixes a hang with srp/002 in blktests with dm, where it tries to suspend but waits for inflight IO to finish first. As it checks for just allocated requests, this fails. Tested-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 17 Dec, 2018 10 commits
-
-
Ming Lei authored
From 7e849dd9 ("nvme-pci: don't share queue maps"), the mapping table won't be initialized actually if map->nr_queues is zero, so we can't use blk_mq_map_queue_type() to retrieve hctx any more. This way still may cause broken mapping, fix it by skipping zero-queues maps in blk_mq_map_swqueue(). Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Dennis Zhou authored
The blk-iolatency controller measures the time from rq_qos_throttle() to rq_qos_done_bio() and attributes this time to the first bio that needs to create the request. This means if a bio is plug-mergeable or bio-mergeable, it gets to bypass the blk-iolatency controller. The recent series [1], to tag all bios w/ blkgs undermined how iolatency was determining which bios it was charging and should process in rq_qos_done_bio(). Because all bios are being tagged, this caused the atomic_t for the struct rq_wait inflight count to underflow and result in a stall. This patch adds a new flag BIO_TRACKED to let controllers know that a bio is going through the rq_qos path. blk-iolatency now checks if this flag is set to see if it should process the bio in rq_qos_done_bio(). Overloading BLK_QUEUE_ENTERED works, but makes the flag rules confusing. BIO_THROTTLED was another candidate, but the flag is set for all bios that have gone through blk-throttle code. Overloading a flag comes with the burden of making sure that when either implementation changes, a change in setting rules for one doesn't cause a bug in the other. So here, we unfortunately opt for adding a new flag. [1] https://lore.kernel.org/lkml/20181205171039.73066-1-dennis@kernel.org/ Fixes: 5cdf2e3f ("blkcg: associate blkg when associating a device") Signed-off-by: Dennis Zhou <dennis@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
When a request is added to rq list of sw queue(ctx), the rq may be from a different type of hctx, especially after multi queue mapping is introduced. So when dispach request from sw queue via blk_mq_flush_busy_ctxs() or blk_mq_dequeue_from_ctx(), one request belonging to other queue type of hctx can be dispatched to current hctx in case that read queue or poll queue is enabled. This patch fixes this issue by introducing per-queue-type list. Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Changed by me to not use separately cacheline aligned lists, just place them all in the same cacheline where we had just the one list and lock before. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Damien Le Moal authored
For a zoned block device using mq-deadline, if a write request for a zone is received while another write was already dispatched for the same zone, dd_dispatch_request() will return NULL and the newly inserted write request is kept in the scheduler queue waiting for the ongoing zone write to complete. With this behavior, when no other request has been dispatched, rq_list in blk_mq_sched_dispatch_requests() is empty and blk_mq_sched_mark_restart_hctx() not called. This in turn leads to __blk_mq_free_request() call of blk_mq_sched_restart() to not run the queue when the already dispatched write request completes. The newly dispatched request stays stuck in the scheduler queue until eventually another request is submitted. This problem does not affect SCSI disk as the SCSI stack handles queue restart on request completion. However, this problem is can be triggered the nullblk driver with zoned mode enabled. Fix this by always requesting a queue restart in dd_dispatch_request() if no request was dispatched while WRITE requests are queued. Fixes: 5700f691 ("mq-deadline: Introduce zone locking support") Cc: <stable@vger.kernel.org> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Add missing export of blk_mq_sched_restart() Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Now that the block layer checks if a queue map has any queues inside it there is no more reason to duplicate the maps for the non-default types. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
We should check if a given queue map actually has queues enabled before dispatching to it. This allows drivers to not initialize optional but not used map types, which subsequently will allow fixing problems with queue map rebuilds for that case. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
Now we only export hctx->type via sysfs, and there isn't such info in hctx entry under debugfs. We often use debugfs only to diagnose queue mapping issue, so add the support in debugfs. Queue mapping becomes a bit more complicated after multiple queue mapping is supported, we may write blktest to verify if queue mapping is valid based on blk-mq-debugfs. Given not necessary to export hctx->type twice, so remove the export from sysfs. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
Type of each element in queue mapping table is 'unsigned int, intead of 'struct blk_mq_queue_map)', so fix it. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
This information is helpful to either investigate issues, or understand wbt's internal behaviour. Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
blk-mq-debugfs has been proved as very helpful for debug some tough issues, such as IO hang. We have seen blk-wbt related IO hang several times, even inside Red Hat BZ, there is such report not sovled yet, so this patch adds support debugfs on rq_qos. Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-