- 06 May, 2019 9 commits
-
-
Igor Konopko authored
Currently in case of read errors, bi_status is not set properly which leads to returning inproper data to layers above. This patch fix that by setting proper status in case of read errors. Also remove unnecessary warn_once(), which does not make sense in that place, since user bio is not used for interation with drive and thus bi_status will not be set here. Signed-off-by: Igor Konopko <igor.j.konopko@intel.com> Reviewed-by: Javier González <javier@javigon.com> Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Igor Konopko authored
L2P table can be huge in many cases, since it typically requires 1GB of DRAM for 1TB of drive. When there is not enough memory available, OOM killer turns on and kills random processes, which can be very annoying for users. This patch changes the flag for L2P table allocation on order to handle this situation in more user friendly way. GFP_KERNEL and __GPF_HIGHMEM are default flags used in parameterless vmalloc() calls, so they are also keeped in that patch. Additionally __GFP_NOWARN flag is added in order to hide very long dmesg warn in case of the allocation failures. The most important flag introduced in that patch is __GFP_RETRY_MAYFAIL, which would cause allocator to try use free memory and if not available to drop caches, but not to run OOM killer. Signed-off-by: Igor Konopko <igor.j.konopko@intel.com> Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Reviewed-by: Javier González <javier@javigon.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Igor Konopko authored
The sector bits in the erase command may be uninitialized are uninitialized, causing the erase LBA to be unaligned to the chunk size. This is unexpected situation, since erase shall always be chunk aligned based on OCSSD the 2.0 specification. Signed-off-by: Igor Konopko <igor.j.konopko@intel.com> Reviewed-by: Javier González <javier@javigon.com> Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Igor Konopko authored
In the pblk_put_line_back function, a race condition with __pblk_map_invalidate can make a line not part of any lists. Fix gc_list by resetting it to null fixes the above issue. Fixes: a4bd217b ("lightnvm: physical block device (pblk) target") Signed-off-by: Igor Konopko <igor.j.konopko@intel.com> Reviewed-by: Javier González <javier@javigon.com> Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Igor Konopko authored
Currently when we fail on rq data allocation in gc, it skips moving active data and moves line straigt to its free state. Losing user data in the process. Move the data allocation to an earlier phase of GC, where we can still fail gracefully by moving line back to the closed state. Signed-off-by: Igor Konopko <igor.j.konopko@intel.com> Reviewed-by: Javier González <javier@javigon.com> Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Igor Konopko authored
smeta_ssec field in pblk_line is never used after it was replaced by the function pblk_line_smeta_start(). Signed-off-by: Igor Konopko <igor.j.konopko@intel.com> Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Reviewed-by: Javier González <javier@javigon.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Igor Konopko authored
Currently L2P map size is calculated based on the total number of available sectors, which is redundant, since it contains mapping for overprovisioning as well (11% by default). Change this size to the real capacity and thus reduce the memory footprint significantly - with default op value it is approx. 110MB of DRAM less for every 1TB of media. Signed-off-by: Igor Konopko <igor.j.konopko@intel.com> Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Reviewed-by: Javier González <javier@javigon.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Igor Konopko authored
A line is left unsigned to the blocks lists in case pblk_gc_line returns an error. This moves the line back to be appropriate list, which can then be picked up by the garbage collector. Signed-off-by: Igor Konopko <igor.j.konopko@intel.com> Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Reviewed-by: Javier González <javier@javigon.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Igor Konopko authored
Fixes the GC error case when moving a line back to closed state while releasing additional references. Signed-off-by: Igor Konopko <igor.j.konopko@intel.com> Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Reviewed-by: Javier González <javier@javigon.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 04 May, 2019 7 commits
-
-
Ming Lei authored
Now freeing hw queue resource is moved to hctx's release handler, we don't need to worry about the race between blk_cleanup_queue and run queue any more. So don't drain in-progress dispatch in blk_cleanup_queue(). This is basically revert of c2856ae2 ("blk-mq: quiesce queue before freeing queue"). Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: James Smart <james.smart@broadcom.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen <martin.petersen@oracle.com>, Cc: Christoph Hellwig <hch@lst.de>, Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>, Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Hannes Reinecke <hare@suse.com> Tested-by: James Smart <james.smart@broadcom.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
hctx is always released after requeue is freed. With holding queue's kobject refcount, it is safe for driver to run queue, so one run queue might be scheduled after blk_sync_queue() is done. So moving the cancel of hctx->run_work into blk_mq_hw_sysfs_release() for avoiding run released queue. Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: James Smart <james.smart@broadcom.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen <martin.petersen@oracle.com>, Cc: Christoph Hellwig <hch@lst.de>, Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>, Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Hannes Reinecke <hare@suse.com> Tested-by: James Smart <james.smart@broadcom.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
In normal queue cleanup path, hctx is released after request queue is freed, see blk_mq_release(). However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because of hw queues shrinking. This way is easy to cause use-after-free, because: one implicit rule is that it is safe to call almost all block layer APIs if the request queue is alive; and one hctx may be retrieved by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues(); finally use-after-free is triggered. Fixes this issue by always freeing hctx after releasing request queue. If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce a per-queue list to hold them, then try to resuse these hctxs if numa node is matched. Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: James Smart <james.smart@broadcom.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen <martin.petersen@oracle.com>, Cc: Christoph Hellwig <hch@lst.de>, Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>, Reviewed-by: Hannes Reinecke <hare@suse.com> Tested-by: James Smart <james.smart@broadcom.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
Split blk_mq_alloc_and_init_hctx into two parts, and one is blk_mq_alloc_hctx() for allocating all hctx resources, another is blk_mq_init_hctx() for initializing hctx, which serves as counter-part of blk_mq_exit_hctx(). Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: James Smart <james.smart@broadcom.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: linux-scsi@vger.kernel.org Cc: Martin K . Petersen <martin.petersen@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: James Smart <james.smart@broadcom.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
Once blk_cleanup_queue() returns, tags shouldn't be used any more, because blk_mq_free_tag_set() may be called. Commit 45a9c9d9 ("blk-mq: Fix a use-after-free") fixes this issue exactly. However, that commit introduces another issue. Before 45a9c9d9, we are allowed to run queue during cleaning up queue if the queue's kobj refcount is held. After that commit, queue can't be run during queue cleaning up, otherwise oops can be triggered easily because some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue(). We have invented ways for addressing this kind of issue before, such as: 8dc765d4 ("SCSI: fix queue cleanup race before queue initialization is done") c2856ae2 ("blk-mq: quiesce queue before freeing queue") But still can't cover all cases, recently James reports another such kind of issue: https://marc.info/?l=linux-scsi&m=155389088124782&w=2 This issue can be quite hard to address by previous way, given scsi_run_queue() may run requeues for other LUNs. Fixes the above issue by freeing hctx's resources in its release handler, and this way is safe becasue tags isn't needed for freeing such hctx resource. This approach follows typical design pattern wrt. kobject's release handler. Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: James Smart <james.smart@broadcom.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen <martin.petersen@oracle.com>, Cc: Christoph Hellwig <hch@lst.de>, Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>, Reported-by: James Smart <james.smart@broadcom.com> Fixes: 45a9c9d9 ("blk-mq: Fix a use-after-free") Cc: stable@vger.kernel.org Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: James Smart <james.smart@broadcom.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
With holding queue's kobject refcount, it is safe for driver to schedule requeue. However, blk_mq_kick_requeue_list() may be called after blk_sync_queue() is done because of concurrent requeue activities, then requeue work may not be completed when freeing queue, and kernel oops is triggered. So moving the cancel of requeue_work into blk_mq_release() for avoiding race between requeue and freeing queue. Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: James Smart <james.smart@broadcom.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen <martin.petersen@oracle.com>, Cc: Christoph Hellwig <hch@lst.de>, Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>, Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: James Smart <james.smart@broadcom.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
Just like aio/io_uring, we need to grab 2 refcount for queuing one request, one is for submission, another is for completion. If the request isn't queued from plug code path, the refcount grabbed in generic_make_request() serves for submission. In theroy, this refcount should have been released after the sumission(async run queue) is done. blk_freeze_queue() works with blk_sync_queue() together for avoiding race between cleanup queue and IO submission, given async run queue activities are canceled because hctx->run_work is scheduled with the refcount held, so it is fine to not hold the refcount when running the run queue work function for dispatch IO. However, if request is staggered into plug list, and finally queued from plug code path, the refcount in submission side is actually missed. And we may start to run queue after queue is removed because the queue's kobject refcount isn't guaranteed to be grabbed in flushing plug list context, then kernel oops is triggered, see the following race: blk_mq_flush_plug_list(): blk_mq_sched_insert_requests() insert requests to sw queue or scheduler queue blk_mq_run_hw_queue Because of concurrent run queue, all requests inserted above may be completed before calling the above blk_mq_run_hw_queue. Then queue can be freed during the above blk_mq_run_hw_queue(). Fixes the issue by grab .q_usage_counter before calling blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is safe because the queue is absolutely alive before inserting request. Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: James Smart <james.smart@broadcom.com> Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen <martin.petersen@oracle.com>, Cc: Christoph Hellwig <hch@lst.de>, Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>, Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: James Smart <james.smart@broadcom.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 02 May, 2019 2 commits
-
-
git://git.infradead.org/nvmeJens Axboe authored
Pull NVMe updates from Christoph. * 'nvme-5.2' of git://git.infradead.org/nvme: nvmet: protect discovery change log event list iteration nvme: mark nvme_core_init and nvme_core_exit static nvme: move command size checks to the core nvme-fabrics: check more command sizes nvme-pci: check more command sizes nvme-pci: remove an unneeded variable initialization nvme-pci: unquiesce admin queue on shutdown nvme-pci: shutdown on timeout during deletion nvme-pci: fix psdt field for single segment sgls nvme-multipath: don't print ANA group state by default nvme-multipath: split bios with the ns_head bio_set before submitting nvme-tcp: fix possible null deref on a timed out io queue connect
-
Raul E Rangel authored
The comment was out of date. Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Raul E Rangel <rrangel@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 01 May, 2019 13 commits
-
-
Sagi Grimberg authored
When we iterate on the discovery subsystem controllers we need to protect against concurrent mutations to it. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Minwoo Im <minwoo.im@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Christoph Hellwig authored
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
Christoph Hellwig authored
Most command aren't PCIe specific, so move the size checking for them to core.c Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
Minwoo Im authored
struct common_command provides a common structure for NVMe-oF command format. It also needs to be checked for unintended size growth. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Minwoo Im authored
All the NVMe command has 64bytes fixed size so that it has been assured with BUILD_BUG_ON(). The remaining command structures in linux/nvme.h also need to be checked here. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Minwoo Im authored
Variable "n" will be assigned once kstrtoint() succeeds, otherwise it will not be referred because kstrtoint() will return an error which means go out from this function. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Keith Busch authored
Just like IO queues, the admin queue also will not be restarted after a controller shutdown. Unquiesce this queue so that we do not block request dispatch on a permanently disabled controller. Reported-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Keith Busch authored
We do not restart a controller in a deleting state for timeout errors. When in this state, unblock potential request dispatchers with failed completions by shutting down the controller on timeout detection. Reported-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Klaus Birkelund Jensen authored
The shortcut for single segment SGL requests did not set the PSDT field to mark the request as using SGLs. Fixes: 29791057 ("nvme-pci: optimize mapping single segment requests using SGLs") Signed-off-by: Klaus Birkelund Jensen <klaus.jensen@cnexlabs.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Hannes Reinecke authored
Signed-off-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Hannes Reinecke authored
If the bio is moved to a different queue via blk_steal_bios() and the original queue is destroyed in nvme_remove_ns() we'll be ending with a crash in bio_endio() as the mempool for the split bio bvecs had already been destroyed. So split the bio using the original queue (which will remain during the lifetime of the bio) before sending it down to the underlying device. Signed-off-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
If I/O queue connect times out, we might have freed the queue socket already, so check for that on the error path in nvme_tcp_start_queue. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Jens Axboe authored
It's not used outside this file. Fixes: 63120731 ("bcache: fix failure in journal relplay") Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 30 Apr, 2019 9 commits
-
-
Christoph Hellwig authored
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Various block layer files do not have any licensing information at all. Add SPDX tags for the default kernel GPLv2 license to those. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
This file has no copyright notice, but was added as part of a commit adding another file using the default kernel GPLv2 license. Add a matching SPDX tag. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
The file already has the correct SPDX header. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
All these files have some form of the usual GPLv2 or later boilerplate. Switch them to use SPDX tags instead. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
All these files have some form of the usual GPLv2 boilerplate. Switch them to use SPDX tags instead. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Share the bi_size update by moving the done label up, and duplicate the bv_len update in the two callers to get rid of the bvec_merge label. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
We are never called with file system pages by defintions for the passthrough interface, and we also never undo any addition later these days. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
The same page optimization is a rather odd corner case, which is not used outside bio.c and which really should not be used outside of bio.c either - we have better highlevel helpers like the rq/bio mapping helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-