An error occurred fetching the project authors.
- 14 Jan, 2021 2 commits
-
-
Sagi Grimberg authored
When a bio merges, we can get a request that spans multiple bios, and the overall request payload size is the sum of all bios. When we calculate how much we need to send from the existing bio (and bvec), we did not take into account the iov_iter byte count cap. Since multipage bvecs support, bvecs can split in the middle which means that when we account for the last bvec send we should also take the iov_iter byte count cap as it might be lower than the last bvec size. Reported-by:
Hao Wang <pkuwangh@gmail.com> Fixes: 3f2304f8 ("nvme-tcp: add NVMe over TCP host driver") Tested-by:
Hao Wang <pkuwangh@gmail.com> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
We shouldn't call smp_processor_id() in a preemptible context, but this is advisory at best, so instead call __smp_processor_id(). Fixes: db5ad6b7 ("nvme-tcp: try to send request in queue_rq context") Reported-by:
Or Gerlitz <gerlitz.or@gmail.com> Reported-by:
Yi Zhang <yi.zhang@redhat.com> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 06 Jan, 2021 1 commit
-
-
Sagi Grimberg authored
We may send a request (with or without its data) from two paths: 1. From our I/O context nvme_tcp_io_work which is triggered from: - queue_rq - r2t reception - socket data_ready and write_space callbacks 2. Directly from queue_rq if the send_list is empty (because we want to save the context switch associated with scheduling our io_work). However, given that now we have the send_mutex, we may run into a race condition where none of these contexts will send the pending payload to the controller. Both io_work send path and queue_rq send path opportunistically attempt to acquire the send_mutex however queue_rq only attempts to send a single request, and if io_work context fails to acquire the send_mutex it will complete without rescheduling itself. The race can trigger with the following sequence: 1. queue_rq sends request (no incapsule data) and blocks 2. RX path receives r2t - prepares data PDU to send, adds h2cdata PDU to the send_list and schedules io_work 3. io_work triggers and cannot acquire the send_mutex - because of (1), ends without self rescheduling 4. queue_rq completes the send, and completes ==> no context will send the h2cdata - timeout. Fix this by having queue_rq sending as much as it can from the send_list such that if it still has any left, its because the socket buffer is full and the socket write_space callback will trigger, thus guaranteeing that a context will be scheduled to send the h2cdata PDU. Fixes: db5ad6b7 ("nvme-tcp: try to send request in queue_rq context") Reported-by:
Potnuri Bharat Teja <bharat@chelsio.com> Reported-by:
Samuel Jones <sjones@kalrayinc.com> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Tested-by:
Potnuri Bharat Teja <bharat@chelsio.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 01 Dec, 2020 1 commit
-
-
Chaitanya Kulkarni authored
This is purely a clenaup patch, add prefix NVME to the ADMIN_TIMEOUT to make consistent with NVME_IO_TIMEOUT. Signed-off-by:
Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 03 Nov, 2020 2 commits
-
-
Sagi Grimberg authored
The request may be executed asynchronously, and rq->state may be changed to IDLE. To avoid repeated request completion, only MQ_RQ_COMPLETE of rq->state is checked in nvme_tcp_complete_timed_out. It is not safe, so need adding check IDLE for rq->state. Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Chao Leng <lengchao@huawei.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Chao Leng authored
Now use teardown_lock to serialize for time out and tear down. This may cause abnormal: first cancel all request in tear down, then time out may complete the request again, but the request may already be freed or restarted. To avoid race between time out and tear down, in tear down process, first we quiesce the queue, and then delete the timer and cancel the time out work for the queue. At the same time we need to delete teardown_lock. Signed-off-by:
Chao Leng <lengchao@huawei.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 02 Oct, 2020 1 commit
-
-
Coly Li authored
Currently nvme_tcp_try_send_data() doesn't use kernel_sendpage() to send slab pages. But for pages allocated by __get_free_pages() without __GFP_COMP, which also have refcount as 0, they are still sent by kernel_sendpage() to remote end, this is problematic. The new introduced helper sendpage_ok() checks both PageSlab tag and page_count counter, and returns true if the checking page is OK to be sent by kernel_sendpage(). This patch fixes the page checking issue of nvme_tcp_try_send_data() with sendpage_ok(). If sendpage_ok() returns true, send this page by kernel_sendpage(), otherwise use sock_no_sendpage to handle this page. Signed-off-by:
Coly Li <colyli@suse.de> Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Jan Kara <jack@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Vlastimil Babka <vbabka@suse.com> Cc: stable@vger.kernel.org Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 08 Sep, 2020 1 commit
-
-
David Milburn authored
Cancel async event work in case async event has been queued up, and nvme_tcp_submit_async_event() runs after event has been freed. Signed-off-by:
David Milburn <dmilburn@redhat.com> Reviewed-by:
Keith Busch <kbusch@kernel.org> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 28 Aug, 2020 3 commits
-
-
Sagi Grimberg authored
If the controller becomes unresponsive in the middle of a reset, we will hang because we are waiting for the freeze to complete, but that cannot happen since we have commands that are inflight holding the q_usage_counter, and we can't blindly fail requests that times out. So give a timeout and if we cannot wait for queue freeze before unfreezing, fail and have the error handling take care how to proceed (either schedule a reconnect of remove the controller). Signed-off-by:
Sagi Grimberg <sagi@grimberg.me>
-
Sagi Grimberg authored
When a request times out in a LIVE state, we simply trigger error recovery and let the error recovery handle the request cancellation, however when a request times out in a non LIVE state, we make sure to complete it immediately as it might block controller setup or teardown and prevent forward progress. However tearing down the entire set of I/O and admin queues causes freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really an overkill to what we actually need, which is to just fence controller teardown that may be running, stop the queue, and cancel the request if it is not already completed. Now that we have the controller teardown_lock, we can safely serialize request cancellation. This addresses a hang caused by calling extra queue freeze on controller namespaces, causing unfreeze to not complete correctly. Signed-off-by:
Sagi Grimberg <sagi@grimberg.me>
-
Sagi Grimberg authored
In the timeout handler we may need to complete a request because the request that timed out may be an I/O that is a part of a serial sequence of controller teardown or initialization. In order to complete the request, we need to fence any other context that may compete with us and complete the request that is timing out. In this case, we could have a potential double completion in case a hard-irq or a different competing context triggered error recovery and is running inflight request cancellation concurrently with the timeout handler. Protect using a ctrl teardown_lock to serialize contexts that may complete a cancelled request due to error recovery or a reset. Signed-off-by:
Sagi Grimberg <sagi@grimberg.me>
-
- 23 Aug, 2020 1 commit
-
-
Gustavo A. R. Silva authored
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-throughSigned-off-by:
Gustavo A. R. Silva <gustavoars@kernel.org>
-
- 21 Aug, 2020 1 commit
-
-
Christoph Hellwig authored
nvme_end_request is a bit misnamed, as it wraps around the blk_mq_complete_* API. It's semantics also are non-trivial, so give it a more descriptive name and add a comment explaining the semantics. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Reviewed-by:
Mike Snitzer <snitzer@redhat.com> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 29 Jul, 2020 2 commits
-
-
Sagi Grimberg authored
commit fe35ec58 ("block: update hctx map when use multiple maps") exposed an issue where we may hang trying to wait for queue freeze during I/O. We call blk_mq_update_nr_hw_queues which in case of multiple queue maps (which we have now for default/read/poll) is attempting to freeze the queue. However we never started queue freeze when starting the reset, which means that we have inflight pending requests that entered the queue that we will not complete once the queue is quiesced. So start a freeze before we quiesce the queue, and unfreeze the queue after we successfully connected the I/O queues (and make sure to call blk_mq_update_nr_hw_queues only after we are sure that the queue was already frozen). This follows to how the pci driver handles resets. Fixes: fe35ec58 ("block: update hctx map when use multiple maps") Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
A deadlock happens in the following scenario with multipath: 1) scan_work(nvme0) detects a new nsid while nvme0 is an optimized path to it, path nvme1 happens to be inaccessible. 2) Before scan_work is complete nvme0 disconnect is initiated nvme_delete_ctrl_sync() sets nvme0 state to NVME_CTRL_DELETING 3) scan_work(1) attempts to submit IO, but nvme_path_is_optimized() observes nvme0 is not LIVE. Since nvme1 is a possible path IO is requeued and scan_work hangs. -- Workqueue: nvme-wq nvme_scan_work [nvme_core] kernel: Call Trace: kernel: __schedule+0x2b9/0x6c0 kernel: schedule+0x42/0xb0 kernel: io_schedule+0x16/0x40 kernel: do_read_cache_page+0x438/0x830 kernel: read_cache_page+0x12/0x20 kernel: read_dev_sector+0x27/0xc0 kernel: read_lba+0xc1/0x220 kernel: efi_partition+0x1e6/0x708 kernel: check_partition+0x154/0x244 kernel: rescan_partitions+0xae/0x280 kernel: __blkdev_get+0x40f/0x560 kernel: blkdev_get+0x3d/0x140 kernel: __device_add_disk+0x388/0x480 kernel: device_add_disk+0x13/0x20 kernel: nvme_mpath_set_live+0x119/0x140 [nvme_core] kernel: nvme_update_ns_ana_state+0x5c/0x60 [nvme_core] kernel: nvme_set_ns_ana_state+0x1e/0x30 [nvme_core] kernel: nvme_parse_ana_log+0xa1/0x180 [nvme_core] kernel: nvme_mpath_add_disk+0x47/0x90 [nvme_core] kernel: nvme_validate_ns+0x396/0x940 [nvme_core] kernel: nvme_scan_work+0x24f/0x380 [nvme_core] kernel: process_one_work+0x1db/0x380 kernel: worker_thread+0x249/0x400 kernel: kthread+0x104/0x140 -- 4) Delete also hangs in flush_work(ctrl->scan_work) from nvme_remove_namespaces(). Similiarly a deadlock with ana_work may happen: if ana_work has started and calls nvme_mpath_set_live and device_add_disk, it will trigger I/O. When we trigger disconnect I/O will block because our accessible (optimized) path is disconnecting, but the alternate path is inaccessible, so I/O blocks. Then disconnect tries to flush the ana_work and hangs. [ 605.550896] Workqueue: nvme-wq nvme_ana_work [nvme_core] [ 605.552087] Call Trace: [ 605.552683] __schedule+0x2b9/0x6c0 [ 605.553507] schedule+0x42/0xb0 [ 605.554201] io_schedule+0x16/0x40 [ 605.555012] do_read_cache_page+0x438/0x830 [ 605.556925] read_cache_page+0x12/0x20 [ 605.557757] read_dev_sector+0x27/0xc0 [ 605.558587] amiga_partition+0x4d/0x4c5 [ 605.561278] check_partition+0x154/0x244 [ 605.562138] rescan_partitions+0xae/0x280 [ 605.563076] __blkdev_get+0x40f/0x560 [ 605.563830] blkdev_get+0x3d/0x140 [ 605.564500] __device_add_disk+0x388/0x480 [ 605.565316] device_add_disk+0x13/0x20 [ 605.566070] nvme_mpath_set_live+0x5e/0x130 [nvme_core] [ 605.567114] nvme_update_ns_ana_state+0x2c/0x30 [nvme_core] [ 605.568197] nvme_update_ana_state+0xca/0xe0 [nvme_core] [ 605.569360] nvme_parse_ana_log+0xa1/0x180 [nvme_core] [ 605.571385] nvme_read_ana_log+0x76/0x100 [nvme_core] [ 605.572376] nvme_ana_work+0x15/0x20 [nvme_core] [ 605.573330] process_one_work+0x1db/0x380 [ 605.574144] worker_thread+0x4d/0x400 [ 605.574896] kthread+0x104/0x140 [ 605.577205] ret_from_fork+0x35/0x40 [ 605.577955] INFO: task nvme:14044 blocked for more than 120 seconds. [ 605.579239] Tainted: G OE 5.3.5-050305-generic #201910071830 [ 605.580712] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 605.582320] nvme D 0 14044 14043 0x00000000 [ 605.583424] Call Trace: [ 605.583935] __schedule+0x2b9/0x6c0 [ 605.584625] schedule+0x42/0xb0 [ 605.585290] schedule_timeout+0x203/0x2f0 [ 605.588493] wait_for_completion+0xb1/0x120 [ 605.590066] __flush_work+0x123/0x1d0 [ 605.591758] __cancel_work_timer+0x10e/0x190 [ 605.593542] cancel_work_sync+0x10/0x20 [ 605.594347] nvme_mpath_stop+0x2f/0x40 [nvme_core] [ 605.595328] nvme_stop_ctrl+0x12/0x50 [nvme_core] [ 605.596262] nvme_do_delete_ctrl+0x3f/0x90 [nvme_core] [ 605.597333] nvme_sysfs_delete+0x5c/0x70 [nvme_core] [ 605.598320] dev_attr_store+0x17/0x30 Fix this by introducing a new state: NVME_CTRL_DELETE_NOIO, which will indicate the phase of controller deletion where I/O cannot be allowed to access the namespace. NVME_CTRL_DELETING still allows mpath I/O to be issued to the bottom device, and only after we flush the ana_work and scan_work (after nvme_stop_ctrl and nvme_prep_remove_namespaces) we change the state to NVME_CTRL_DELETING_NOIO. Also we prevent ana_work from re-firing by aborting early if we are not LIVE, so we should be safe here. In addition, change the transport drivers to follow the updated state machine. Fixes: 0d0b660f ("nvme: add ANA support") Reported-by:
Anton Eidelman <anton@lightbitslabs.com> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 26 Jul, 2020 1 commit
-
-
Sagi Grimberg authored
If the controller died exactly when we are receiving icresp we hang because icresp may never return. Make sure to set a high finite limit. Fixes: 3f2304f8 ("nvme-tcp: add NVMe over TCP host driver") Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 08 Jul, 2020 3 commits
-
-
Sagi Grimberg authored
If we have a long list of request to send, signal the network stack that more is coming (MSG_MORE). If we have nothing else, signal MSG_EOR. Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Tested-by:
Mark Wunderlich <mark.wunderlich@intel.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
blk-mq request plugging can improve the execution of our pipeline. When we queue a request we actually trigger our I/O worker thread yielding a context switch by definition. However if we know that there are more requests in the pipe that are coming, we are better off not trigger our I/O worker and only do that for the last request in the batch (bd->last). By having it, we improve efficiency by amortizing context switches over a batch of requests. Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Tested-by:
Mark Wunderlich <mark.wunderlich@intel.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
The queue processing will splice to a queue local list, this should alleviate some contention on the send_list lock, but also prepares us to the next patch where we look on these lists for network stack flag optimization. Remove queue lock as its not used anymore. Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Tested-by:
Mark Wunderlich <mark.wunderlich@intel.com> [hch: simplified a loop] Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 24 Jun, 2020 2 commits
-
-
Max Gurtovoy authored
Both admin's and drive's tagsets should be set according the numa node of the controller. Signed-off-by:
Max Gurtovoy <maxg@mellanox.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Christoph Hellwig authored
Use the new blk_mq_complete_request_remote helper to avoid an indirect function call in the completion fast path. Reviewed-by:
Daniel Wagner <dwagner@suse.de> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 11 Jun, 2020 1 commit
-
-
Rikard Falkeborn authored
nvme_tcp_mq_ops and nvme_tcp_admin_mq_ops are never modified and can be made const to allow the compiler to put them in read-only memory. Before: text data bss dec hex filename 53102 6885 576 60563 ec93 drivers/nvme/host/tcp.o After: text data bss dec hex filename 53422 6565 576 60563 ec93 drivers/nvme/host/tcp.o Signed-off-by:
Rikard Falkeborn <rikard.falkeborn@gmail.com> Acked-by:
Sagi Grimberg <sagi@grimberg.me> Reviewed-by:
Max Gurtovoy <maxg@mellanox.com> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 28 May, 2020 5 commits
-
-
Christoph Hellwig authored
Add a helper to directly set the IP_TOS sockopt from kernel space without going through a fake uaccess. Signed-off-by:
Christoph Hellwig <hch@lst.de> Acked-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Christoph Hellwig authored
Add a helper to directly set the TCP_SYNCNT sockopt from kernel space without going through a fake uaccess. Signed-off-by:
Christoph Hellwig <hch@lst.de> Acked-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Christoph Hellwig authored
Add a helper to directly set the TCP_NODELAY sockopt from kernel space without going through a fake uaccess. Cleanup the callers to avoid pointless wrappers now that this is a simple function call. Signed-off-by:
Christoph Hellwig <hch@lst.de> Acked-by:
Sagi Grimberg <sagi@grimberg.me> Acked-by:
Jason Gunthorpe <jgg@mellanox.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Christoph Hellwig authored
Add a helper to directly set the SO_PRIORITY sockopt from kernel space without going through a fake uaccess. Signed-off-by:
Christoph Hellwig <hch@lst.de> Acked-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Christoph Hellwig authored
Add a helper to directly set the SO_LINGER sockopt from kernel space with onoff set to true and a linger time of 0 without going through a fake uaccess. Signed-off-by:
Christoph Hellwig <hch@lst.de> Acked-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 27 May, 2020 1 commit
-
-
Sagi Grimberg authored
We can signal the stack that this is not the last page coming and the stack can build a larger tso segment, so go ahead and use it. Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 09 May, 2020 3 commits
-
-
Sagi Grimberg authored
Today, nvme-tcp automatically schedules a send request to a workqueue context, which is 1 more than we'd need in case the socket buffer is wide open. However, because we have async send activity (as a result of r2t, or write_space callbacks), we need to synchronize sends from possibly multiple contexts (ideally all running on the same cpu though). Thus, we only try to send directly from queue_rq in cases: 1. the send_list is empty 2. we can send it synchronously (i.e. not from the RX path) 3. we run on the same cpu as the queue->io_cpu to avoid contention on the send operation. Proposed-by:
Mark Wunderlich <mark.wunderlich@intel.com> Signed-off-by:
Mark Wunderlich <mark.wunderlich@intel.com> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Sagi Grimberg authored
When the user runs polled I/O, we shouldn't have to trigger the workqueue to generate the receive work upon the .data_ready upcall. This prevents a redundant context switch when the application is already polling for completions. Proposed-by:
Mark Wunderlich <mark.wunderlich@intel.com> Signed-off-by:
Mark Wunderlich <mark.wunderlich@intel.com> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Sagi Grimberg authored
data_ready may be invoked from send context or from softirq, so need bh locking for that. Fixes: 3f2304f8 ("nvme-tcp: add NVMe over TCP host driver") Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 01 Apr, 2020 1 commit
-
-
Sagi Grimberg authored
If the target misbehaves and sends us unexpected payload we need to make sure to fail the controller and stop processing the input stream. We clear the rd_enabled flag and stop the io_work, but we may still requeue it if we still have pending sends and then in the next invocation we will process the input stream as the check is only in the .data_ready upcall. To fix this we need to make sure not to self-requeue io_work upon a recv flow error. This fixes the crash: nvme nvme2: receive failed: -22 BUG: unable to handle page fault for address: ffffbeb5816c3b48 nvme_ns_head_make_request: 29 callbacks suppressed block nvme0n5: no usable path - requeuing I/O block nvme0n5: no usable path - requeuing I/O block nvme0n7: no usable path - requeuing I/O block nvme0n7: no usable path - requeuing I/O block nvme0n3: no usable path - requeuing I/O block nvme0n3: no usable path - requeuing I/O block nvme0n3: no usable path - requeuing I/O block nvme0n7: no usable path - requeuing I/O block nvme0n3: no usable path - requeuing I/O block nvme0n3: no usable path - requeuing I/O #PF: supervisor read access inkernel mode #PF: error_code(0x0000) - not-present page PGD 1039157067 P4D 1039157067 PUD 103915a067 PMD 102719f067 PTE 0 Oops: 0000 [#1] SMP PTI CPU: 8 PID: 411 Comm: kworker/8:1H Not tainted 5.3.0-40-generic #32~18.04.1-Ubuntu Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0 12/17/2015 Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp] RIP: 0010:nvme_tcp_recv_skb+0x2ae/0xb50 [nvme_tcp] RSP: 0018:ffffbeb5806cfd10 EFLAGS: 00010246 RAX: ffffbeb5816c3b48 RBX: 00000000000003d0 RCX: 0000000000000008 RDX: 00000000000003d0 RSI: 0000000000000001 RDI: ffff9a3040684b40 RBP: ffffbeb5806cfd90 R08: 0000000000000000 R09: ffffffff946e6900 R10: ffffbeb5806cfce0 R11: 0000000000000001 R12: 0000000000000000 R13: ffff9a2ff86501c0 R14: 00000000000003d0 R15: ffff9a30b85f2798 FS: 0000000000000000(0000) GS:ffff9a30bf800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffbeb5816c3b48 CR3: 000000088400a006 CR4: 00000000003626e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: tcp_read_sock+0x8c/0x290 ? __release_sock+0x9d/0xe0 ? nvme_tcp_write_space+0xb0/0xb0 [nvme_tcp] nvme_tcp_io_work+0x4b4/0x830 [nvme_tcp] ? finish_task_switch+0x163/0x270 process_one_work+0x1fd/0x3f0 worker_thread+0x34/0x410 kthread+0x121/0x140 ? process_one_work+0x3f0/0x3f0 ? kthread_park+0xb0/0xb0 ret_from_fork+0x35/0x40 Reported-by:
Roy Shterman <roys@lightbitslabs.com> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 31 Mar, 2020 2 commits
-
-
Sagi Grimberg authored
In error recovery we might be removing the queue so check we can actually poll before we do. Reported-by:
Mark Wunderlich <mark.wunderlich@intel.com> Tested-by:
Mark Wunderlich <mark.wunderlich@intel.com> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Keith Busch <kbusch@kernel.org> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
We cannot look at blk_rq_payload_bytes without first checking that the request has a mappable physical segments first (e.g. blk_rq_nr_phys_segments(rq) != 0) and only then to take the request payload bytes. This caused us to send a wrong sgl to the target or even dereference a non-existing buffer in case we actually got to the data send sequence (if it was in-capsule). Reported-by:
Tony Asleson <tasleson@redhat.com> Suggested-by:
Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Keith Busch <kbusch@kernel.org> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 25 Mar, 2020 6 commits
-
-
Israel Rukshin authored
The transition to LIVE state should not fail in case of a new controller. Moving to DELETING state before nvme_tcp_create_ctrl() allocates all the resources may leads to NULL dereference at teardown flow (e.g., IO tagset, admin_q, connect_q). Signed-off-by:
Israel Rukshin <israelr@mellanox.com> Reviewed-by:
Max Gurtovoy <maxg@mellanox.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Keith Busch <kbusch@kernel.org>
-
Israel Rukshin authored
Put the ctrl reference count at nvme_uninit_ctrl as opposed to nvme_init_ctrl which takes it. This decrease the reference count at the core layer instead of decreasing it on each transport separately. Also move the call of nvme_uninit_ctrl at PCI driver after calling to nvme_release_prp_pools and nvme_dev_unmap, in order to put the reference count after using the dev. This is safe because those functions use nvme_dev which is freed only later at nvme_pci_free_ctrl. Signed-off-by:
Israel Rukshin <israelr@mellanox.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Keith Busch <kbusch@kernel.org>
-
Israel Rukshin authored
In case nvme_sysfs_delete() is called by the user before taking the ctrl reference count, the ctrl may be freed during the creation and cause the bug. Take the reference as soon as the controller is externally visible, which is done by cdev_device_add() in nvme_init_ctrl(). Also take the reference count at the core layer instead of taking it on each transport separately. Signed-off-by:
Israel Rukshin <israelr@mellanox.com> Reviewed-by:
Max Gurtovoy <maxg@mellanox.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Keith Busch <kbusch@kernel.org>
-
Sagi Grimberg authored
If we failed to receive data from the socket, don't try to further process it, we will for sure be handling a queue error at this point. While no issue was seen with the current behavior thus far, its safer to cease socket processing if we detected an error. Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Keith Busch <kbusch@kernel.org>
-
Sagi Grimberg authored
Consolidate the request failure handling code to where it is being fetched (nvme_tcp_try_send). Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Keith Busch <kbusch@kernel.org>
-
Sagi Grimberg authored
Currently, queue io_cpu assignment is done sequentially for default, read and poll queues based on queue id. This causes miss-alignment between context of CPU initiating I/O and the I/O worker thread processing queued requests or completions. Change to modify queue io_cpu assignment to take into account queue maps offset. Each queue io_cpu will start at zero for each queue map. This essentially aligns read/poll queues to start over the same range as default queues. Testing performed by Mark with: - ram device (nvmet) - single CPU core (pinned) - 100% 4k reads - engine io_uring (not using sq_thread option) - hipri flag set Micro-benchmark results show a net gain of: - increase of 18%-29% in IOPs - reduction of 16%-22% in average latency - reduction of 7%-23% in 99.99% latency Baseline: ======== QDepth/Batch | IOPs [k] | Avg. Lat [us] | 99.99% Lat [us] ----------------------------------------------------------------- 1/1 | 32.4 | 30.11 | 50.94 32/8 | 179 | 168.20 | 371 CPU alignment: ============= QDepth/Batch | IOPs [k] | Avg. Lat [us] | 99.99% Lat [us] ----------------------------------------------------------------- 1/1 | 38.5 | 25.18 | 39.16 32/8 | 231 | 130.75 | 343 Reported-by:
Mark Wunderlich <mark.wunderlich@intel.com> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Keith Busch <kbusch@kernel.org>
-