An error occurred fetching the project authors.
- 19 Dec, 2018 1 commit
-
-
yupeng authored
Export the disk name, queue id, sq_head, sq_tail to a trace event in completion handling. Usage example: cd /sys/kernel/debug/tracing/events/nvme/nvme_sq echo 'disk=="nvme1n1"' > filter echo 1 > enable cat /sys/kernel/debug/tracing/trace_pipe Signed-off-by:
yupeng <yupeng0921@gmail.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Reviewed-by:
Keith Busch <keith.busch@intel.com> [hch: slight formatting tweaks, use standard nvme tracepoint conventions] Signed-off-by:
Christoph Hellwig <hch@lst.de> wip
-
- 18 Dec, 2018 2 commits
-
-
Christoph Hellwig authored
By duplicating the nvme_process_cq in both branches we keep the sparse lock context checking happy, so do it. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me>
-
Christoph Hellwig authored
The block layer now enables polling support on a queue if nr_maps includes the poll map, so we should only set that if we actually support poll queues. Fixes: 6544d229 ("block: enable polling by default if a poll map is initalized") Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Jens Axboe <axboe@kernel.dk> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me>
-
- 17 Dec, 2018 1 commit
-
-
Christoph Hellwig authored
Now that the block layer checks if a queue map has any queues inside it there is no more reason to duplicate the maps for the non-default types. Reviewed-by:
Ming Lei <ming.lei@redhat.com> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 11 Dec, 2018 1 commit
-
-
Jens Axboe authored
Guenter reported an boot hang issue on HPPA after we default to 0 poll queues. We have two issues in the queue count calculations: 1) We don't separate the poll queues from the read/write queues. This is important, since the former doesn't need interrupts. 2) The adjust logic is broken. Adjust the poll queue count before doing nvme_calc_io_queues(). The poll queue count is only limited by the IO queue count we were able to get from the controller, not failures in the IRQ allocation loop. This leaves nvme_calc_io_queues() just adjusting the read/write queue map. Reported-by:
Reported-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 04 Dec, 2018 9 commits
-
-
Christoph Hellwig authored
This avoids having to have differnet mq_ops for different setups with or without poll queues. Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Now that we can't poll regular, interrupt driven I/O queues there is almost nothing that can race with an interrupt. The only possible other contexts polling a CQ are the error handler and queue shutdown, and both are so far off in the slow path that we can simply use the big hammer of disabling interrupts. With that we can stop taking the cq_lock for normal queues. Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Reviewed-by:
Keith Busch <keith.busch@intel.com> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
This is the last place outside of nvme_irq that handles CQEs from interrupt context, and thus is in the way of removing the cq_lock for normal queues, and avoiding lockdep warnings on the poll queues, for which we already take it without IRQ disabling. Reviewed-by:
Keith Busch <keith.busch@intel.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Pass the opcode for the delete SQ/CQ command as an argument instead of the somewhat confusing pass loop. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Keith Busch <keith.busch@intel.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
We have three places that can poll for I/O completions on a normal interrupt-enabled queue. All of them are in slow path code, so consolidate them to a single helper that uses spin_lock_irqsave and removes the fast path cqe_pending check. Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
This will allow us to simplify both the regular NVMe interrupt handler and the upcoming aio poll code. In addition to that the separate queues are generally a good idea for performance reasons. Reviewed-by:
Keith Busch <keith.busch@intel.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Use a bit flag to mark if the SQ was allocated from the CMB, and clean up the surrounding code a bit. Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Reviewed-by:
Keith Busch <keith.busch@intel.com> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
This gets rid of all the messing with cq_vector and the ->polled field by using an atomic bitop to mark the queue enabled or not. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Keith Busch <keith.busch@intel.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Having another indirect all in the fast path doesn't really help in our post-spectre world. Also having too many queue type is just going to create confusion, so I'd rather manage them centrally. Note that the queue type naming and ordering changes a bit - the first index now is the default queue for everything not explicitly marked, the optional ones are read and poll queues. Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 29 Nov, 2018 1 commit
-
-
Jens Axboe authored
Split the command submission and the SQ doorbell ring, and add the doorbell ring as our ->commit_rqs() hook. This allows a list of requests to be issued, with nvme only writing the SQ update when it's necessary. This is more efficient if we have lists of requests to issue, particularly on virtualized hardware, where writing the SQ doorbell is more expensive than on real hardware. For those cases, performance increases of 2-3x have been observed. The use case for this is plugged IO, where blk-mq flushes a batch of requests at the time. Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 26 Nov, 2018 2 commits
-
-
Jens Axboe authored
We always pass in -1 now and none of the callers use the tag value, remove the parameter. Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
If we want to support async IO polling, then we have to allow finding completions that aren't just for the one we are looking for. Always pass in -1 to the mq_ops->poll() helper, and have that return how many events were found in this poll loop. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 19 Nov, 2018 1 commit
-
-
Jens Axboe authored
We need a better way of configuring this, and given that polling is (still) a bit niche, let's default to using 0 poll queues. That way we'll have the same read/write/poll behavior as 4.20, and users that want to test/use polling are required to do manual configuration of the number of poll queues. Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 16 Nov, 2018 1 commit
-
-
Jens Axboe authored
If we have separate poll queues, we know that they aren't using interrupts. Hence we don't need to disable interrupts around finding completions. Provide a separate set of blk_mq_ops for such devices. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 15 Nov, 2018 1 commit
-
-
Jens Axboe authored
At least on SPARC, if MSI/MSI-X isn't supported, we get EINVAL if we ask for more than one vector. This isn't covered by our ENOSPC check. If we get EINVAL, decrease our ask to just one vector, instead of bailing out in error. Fixes: 3b6592f7 ("nvme: utilize two queue maps, one for reads and one for writes") Reported-by:
Guenter Roeck <linux@roeck-us.net> Tested-by:
Guenter Roeck <linux@roeck-us.net> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 14 Nov, 2018 1 commit
-
-
Jens Axboe authored
NVMe always asks for io_queues + 1 worth of IRQ vectors, which means that even when we scale all the way down, we still ask for 2 vectors and get -ENOSPC in return if the system can't support more than 1. Getting just 1 vector is fine, it just means that we'll have 1 IO queue and 1 admin queue, with a shared vector between them. Check for this case and don't add our + 1 if it happens. Fixes: 3b6592f7 ("nvme: utilize two queue maps, one for reads and one for writes") Reported-by:
Guenter Roeck <linux@roeck-us.net> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 07 Nov, 2018 3 commits
-
-
Jens Axboe authored
Adds support for defining a variable number of poll queues, currently configurable with the 'poll_queues' module parameter. Defaults to a single poll queue. And now we finally have poll support without triggering interrupts! Reviewed-by:
Hannes Reinecke <hare@suse.com> Reviewed-by:
Keith Busch <keith.busch@intel.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
NVMe does round-robin between queues by default, which means that sharing a queue map for both reads and writes can be problematic in terms of read servicing. It's much easier to flood the queue with writes and reduce the read servicing. Implement two queue maps, one for reads and one for writes. The write queue count is configurable through the 'write_queues' parameter. By default, we retain the previous behavior of having a single queue set, shared between reads and writes. Setting 'write_queues' to a non-zero value will create two queue sets, one for reads and one for writes, the latter using the configurable number of queues (hardware queue counts permitting). Reviewed-by:
Hannes Reinecke <hare@suse.com> Reviewed-by:
Keith Busch <keith.busch@intel.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
This is in preparation for allowing multiple sets of maps per queue, if so desired. Reviewed-by:
Hannes Reinecke <hare@suse.com> Reviewed-by:
Bart Van Assche <bvanassche@acm.org> Reviewed-by:
Keith Busch <keith.busch@intel.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 02 Nov, 2018 1 commit
-
-
Keith Busch authored
The nvme pci driver had been adding its CMB resource to the P2P DMA subsystem everytime on on a controller reset. This results in the following warning: ------------[ cut here ]------------ nvme 0000:00:03.0: Conflicting mapping in same section WARNING: CPU: 7 PID: 81 at kernel/memremap.c:155 devm_memremap_pages+0xa6/0x380 ... Call Trace: pci_p2pdma_add_resource+0x153/0x370 nvme_reset_work+0x28c/0x17b1 [nvme] ? add_timer+0x107/0x1e0 ? dequeue_entity+0x81/0x660 ? dequeue_entity+0x3b0/0x660 ? pick_next_task_fair+0xaf/0x610 ? __switch_to+0xbc/0x410 process_one_work+0x1cf/0x350 worker_thread+0x215/0x3d0 ? process_one_work+0x350/0x350 kthread+0x107/0x120 ? kthread_park+0x80/0x80 ret_from_fork+0x1f/0x30 ---[ end trace f7ea76ac6ee72727 ]--- nvme nvme0: failed to register the CMB This patch fixes this by registering the CMB with P2P only once. Signed-off-by:
Keith Busch <keith.busch@intel.com> Reviewed-by:
Logan Gunthorpe <logang@deltatee.com> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 18 Oct, 2018 1 commit
-
-
Chaitanya Kulkarni authored
This is a cleanup patch doesn't change any functionality. It removes the duplicate call to the blk_integrity_rq() in the nvme_map_data(). Signed-off-by:
Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by:
Keith Busch <keith.busch@intel.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 17 Oct, 2018 4 commits
-
-
Logan Gunthorpe authored
For P2P requests, we must use the pci_p2pmem_map_sg() function instead of the dma_map_sg functions. With that, we can then indicate PCI_P2P support in the request queue. For this, we create an NVME_F_PCI_P2P flag which tells the core to set QUEUE_FLAG_PCI_P2P in the request queue. Signed-off-by:
Logan Gunthorpe <logang@deltatee.com> Signed-off-by:
Bjorn Helgaas <bhelgaas@google.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Keith Busch <keith.busch@intel.com>
-
Logan Gunthorpe authored
Register the CMB buffer as p2pmem and use the appropriate allocation functions to create and destroy the IO submission queues. If the CMB supports WDS and RDS, publish it for use as P2P memory by other devices. Kernels without CONFIG_PCI_P2PDMA will also no longer support NVMe CMB. However, seeing the main use-cases for the CMB is P2P operations, this seems like a reasonable dependency. We drop the __iomem safety on the buffer seeing that, by convention, it's safe to directly access memory mapped by memremap()/devm_memremap_pages(). Architectures where this is not safe will not be supported by memremap() and therefore will not support PCI P2P and have no support for CMB. Signed-off-by:
Logan Gunthorpe <logang@deltatee.com> Signed-off-by:
Bjorn Helgaas <bhelgaas@google.com> Reviewed-by:
Keith Busch <keith.busch@intel.com> Reviewed-by:
Christoph Hellwig <hch@lst.de>
-
Keith Busch authored
A removal waits for the reset_work to complete. If a surprise removal occurs around the same time as an error triggered controller reset, and reset work happened to dispatch a command to the removed controller, the command won't be recovered since the timeout work doesn't do anything during error recovery. We wouldn't want to wait for timeout handling anyway, so this patch fixes this by disabling the controller and killing admin queues prior to syncing with the reset_work. Signed-off-by:
Keith Busch <keith.busch@intel.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Bart Van Assche authored
This patch avoids that the kernel-doc tool complains about the nvme_suspend_queue() function header when building with W=1. Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Reviewed-by:
Keith Busch <keith.busch@intel.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 02 Oct, 2018 1 commit
-
-
Oza Pawandeep authored
After bfcb79fc ("PCI/ERR: Run error recovery callbacks for all affected devices"), AER errors are always cleared by the PCI core and drivers don't need to do it themselves. Remove calls to pci_cleanup_aer_uncorrect_error_status() from device driver error recovery functions. Signed-off-by:
Oza Pawandeep <poza@codeaurora.org> [bhelgaas: changelog, remove PCI core changes, remove unused variables] Signed-off-by:
Bjorn Helgaas <bhelgaas@google.com>
-
- 28 Aug, 2018 1 commit
-
-
Michal Wnukowski authored
In many architectures loads may be reordered with older stores to different locations. In the nvme driver the following two operations could be reordered: - Write shadow doorbell (dbbuf_db) into memory. - Read EventIdx (dbbuf_ei) from memory. This can result in a potential race condition between driver and VM host processing requests (if given virtual NVMe controller has a support for shadow doorbell). If that occurs, then the NVMe controller may decide to wait for MMIO doorbell from guest operating system, and guest driver may decide not to issue MMIO doorbell on any of subsequent commands. This issue is purely timing-dependent one, so there is no easy way to reproduce it. Currently the easiest known approach is to run "Oracle IO Numbers" (orion) that is shipped with Oracle DB: orion -run advanced -num_large 0 -size_small 8 -type rand -simulate \ concat -write 40 -duration 120 -matrix row -testname nvme_test Where nvme_test is a .lun file that contains a list of NVMe block devices to run test against. Limiting number of vCPUs assigned to given VM instance seems to increase chances for this bug to occur. On test environment with VM that got 4 NVMe drives and 1 vCPU assigned the virtual NVMe controller hang could be observed within 10-20 minutes. That correspond to about 400-500k IO operations processed (or about 100GB of IO read/writes). Orion tool was used as a validation and set to run in a loop for 36 hours (equivalent of pushing 550M IO operations). No issues were observed. That suggest that the patch fixes the issue. Fixes: f9f38e33 ("nvme: improve performance for virtual NVMe devices") Signed-off-by:
Michal Wnukowski <wnukowski@google.com> Reviewed-by:
Keith Busch <keith.busch@intel.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> [hch: updated changelog and comment a bit] Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 30 Jul, 2018 1 commit
-
-
Max Gurtovoy authored
Also moved the logic of the remapping to the nvme core driver instead of implementing it in the nvme pci driver. This way all the other nvme transport drivers will benefit from it (in case they'll implement metadata support). Suggested-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Acked-by:
Keith Busch <keith.busch@intel.com> Signed-off-by:
Max Gurtovoy <maxg@mellanox.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 23 Jul, 2018 1 commit
-
-
Sagi Grimberg authored
We will need to reference the controller in the setup and completion time for tracing and future traffic based keep alive support. Reviewed-by:
Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 12 Jul, 2018 1 commit
-
-
Keith Busch authored
The nvme driver specific structures need to be initialized prior to enabling the generic controller so we can unwind on failure with out using the reference counting callbacks so that 'probe' and 'remove' can be symmetric. The newly added iod_mempool is the only resource that was being allocated out of order, and a failure there would leak the generic controller memory. This patch just moves that allocation above the controller initialization. Fixes: 943e942e ("nvme-pci: limit max IO size and segments to avoid high order allocations") Reported-by:
Weiping Zhang <zwp10758@gmail.com> Signed-off-by:
Keith Busch <keith.busch@intel.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 21 Jun, 2018 2 commits
-
-
Jens Axboe authored
nvme requires an sg table allocation for each request. If the request is large, then the allocation can become quite large. For instance, with our default software settings of 1280KB IO size, we'll need 10248 bytes of sg table. That turns into a 2nd order allocation, which we can't always guarantee. If we fail the allocation, blk-mq will retry it later. But there's no guarantee that we'll EVER be able to allocate that much contigious memory. Limit the IO size such that we never need more than a single page of memory. That's a lot faster and more reliable. Then back that allocation with a mempool, so that we know we'll always be able to succeed the allocation at some point. Signed-off-by:
Jens Axboe <axboe@kernel.dk> Acked-by:
Keith Busch <keith.busch@intel.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Jianchao Wang authored
There is race between nvme_remove and nvme_reset_work that can lead to io hang. nvme_remove nvme_reset_work -> nvme_remove_dead_ctrl -> nvme_dev_disable -> quiesce request_queue -> queue remove_work -> cancel_work_sync reset_work -> nvme_remove_namespaces -> splice ctrl->namespaces nvme_remove_dead_ctrl_work -> nvme_kill_queues -> nvme_ns_remove do nothing -> blk_cleanup_queue -> blk_freeze_queue Finally, the request_queue is quiesced state when wait freeze, we will get io hang here. To fix it, move the nvme_kill_queues from nvme_remove_dead_ctrl_work to nvme_remove_dead_ctrl. Suggested-by:
Keith Busch <keith.busch@linux.intel.com> Signed-off-by:
Jianchao Wang <jianchao.w.wang@oracle.com> Reviewed-by:
Keith Busch <keith.busch@intel.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- 08 Jun, 2018 3 commits
-
-
Keith Busch authored
A controller reset after a run time change of the CMB module parameter breaks the driver. An 'on -> off' will have the driver use NULL for the host memory queue, and 'off -> on' will use mismatched queue depth between the device and the host. We could fix both, but there isn't really a good reason to change this at run time anyway, compared to at module load time, so this patch makes parameter read-only after after modprobe. Signed-off-by:
Keith Busch <keith.busch@intel.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Reviewed-by:
Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Keith Busch authored
This patch ensures the nvme namsepace request queues are not quiesced on a surprise removal. It's possible the queues were previously killed in a failed reset, so the queues need to be unquiesced to ensure all requests are flushed to completion. Signed-off-by:
Keith Busch <keith.busch@intel.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Reviewed-by:
Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Keith Busch authored
The controller is required to disable its host memory buffer use on controller reset. We don't need to submit an admin command to delete it, so this patch skips sending that command so we don't need to worry about handling a timeout. Signed-off-by:
Keith Busch <keith.busch@intel.com> Reviewed-by:
Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-