- 30 Jun, 2017 10 commits
-
-
Javier González authored
Do bitmap checks only when debug mode is enable. The line bitmap used for mapping to physical addresses is fairly large (~512KB) and it is expensive to do this checks on the fast path. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Javier González authored
When a read is directed to the cache, we risk that the lba has been updated during the time we made the L2P table lookup and the time we are actually reading form the cache. We intentionally not hold the L2P lock not to block other threads. While strict ordering is not a guarantee at this level (unless REQ_FLUSH has been previously issued), we have experience that some databases that have recently implemented direct I/O support, issue metadata reads very close to the writes, without issuing a fsync in the middle. An easy way to support them while they is to make an extra effort and check the L2P map right before reading the cache. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Javier González authored
Add a sanity check to the pblk initialization sequence in order to ensure that enough LUNs have been allocated to store the line metadata. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Javier González authored
When removing a pblk instance, pad the current line using asynchronous I/O. This reduces the removal time from ~1 minute in the worst case to a couple of seconds. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Javier González authored
For now, we allocate a per I/O buffer for GC data. Since the potential size of the buffer is 256KB and GC is not in the fast path, do this allocation with vmalloc. This puts lets pressure on the memory allocator at no performance cost. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Javier González authored
Fix bad metadata buffer assignations introduced when refactoring the medatada write path. Fixes: dd2a4343 lightnvm: pblk: sched. metadata on write thread Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Javier González authored
When user threads place data into the write buffer, they reserve space and do the memory copy out of the lock. As a consequence, when the write thread starts persisting data, there is a chance that it is not copied yet. In this case, avoid polling, and schedule before retrying. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Javier González authored
Remove unused variable. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Javier González authored
Prevent pblk->lines being double freed in case of an error during pblk initialization. Fixes: dd2a4343: "lightnvm: pblk: sched. metadata on write thread" Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Javier González authored
Use the right types and conversions on le64 variables. Reported by sparse. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 29 Jun, 2017 2 commits
-
-
Valentin Rothberg authored
Remove dead build rule for drivers/nvme/host/scsi.c which has been removed by commit ("nvme: Remove SCSI translations"). Signed-off-by: Valentin Rothberg <vrothberg@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Max Gurtovoy authored
This patch performs sequential mapping between CPUs and queues. In case the system has more CPUs than HWQs then there are still CPUs to map to HWQs. In hyperthreaded system, map the unmapped CPUs and their siblings to the same HWQ. This actually fixes a bug that found unmapped HWQs in a system with 2 sockets, 18 cores per socket, 2 threads per core (total 72 CPUs) running NVMEoF (opens upto maximum of 64 HWQs). Performance results running fio (72 jobs, 128 iodepth) using null_blk (w/w.o patch): bs IOPS(read submit_queues=72) IOPS(write submit_queues=72) IOPS(read submit_queues=24) IOPS(write submit_queues=24) ----- ---------------------------- ------------------------------ ---------------------------- ----------------------------- 512 4890.4K/4723.5K 4524.7K/4324.2K 4280.2K/4264.3K 3902.4K/3909.5K 1k 4910.1K/4715.2K 4535.8K/4309.6K 4296.7K/4269.1K 3906.8K/3914.9K 2k 4906.3K/4739.7K 4526.7K/4330.6K 4301.1K/4262.4K 3890.8K/3900.1K 4k 4918.6K/4730.7K 4556.1K/4343.6K 4297.6K/4264.5K 3886.9K/3893.9K 8k 4906.4K/4748.9K 4550.9K/4346.7K 4283.2K/4268.8K 3863.4K/3858.2K 16k 4903.8K/4782.6K 4501.5K/4233.9K 4292.3K/4282.3K 3773.1K/3773.5K 32k 4885.8K/4782.4K 4365.9K/4184.2K 4307.5K/4289.4K 3780.3K/3687.3K 64k 4822.5K/4762.7K 2752.8K/2675.1K 4308.8K/4312.3K 2651.5K/2655.7K 128k 2388.5K/2313.8K 1391.9K/1375.7K 2142.8K/2152.2K 1395.5K/1374.2K Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 28 Jun, 2017 19 commits
-
-
Sagi Grimberg authored
We can deadlock in case we got to a device removal event on a queue which is already in the process of destroying the cm_id is this is blocking until all events on this cm_id will drain. On the other hand we cannot guarantee that rdma_destroy_id was invoked as we only have indication that the queue disconnect flow has been queued (the queue state is updated before the realease work has been queued). So, we leave all the queue removal to a separate ib_client to avoid this deadlock as ib_client device removal is in a different context than the cm_id itself. Reported-by: Shiraz Saleem <shiraz.saleem@intel.com> Tested-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
James Smart authored
Currently, the fc transport invokes nvme_fc_error_recovery() on every io in which the transport detects an error. Which means: a) it's really noisy on large io loads that all get hit by a link down. b) we repeatively call nvme_stop_queues() even though queues are stopped upon the first error or as first steps of reset_work. Correct by: Errors are only meaningful if the controller is in the LIVE state. Thus, enact the reset_work only if LIVE. If called repeatively, state will have already transitioned. There's no need to stop the queues here. Let the first steps of reset_work do the queue stopping. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
James Smart authored
if a nvme command is issued with an opcode that is not supported by the target (example: opcode 21 - detach namespace), the target crashes due to a null pointer. nvmet_req_init() detects the bad opcode and immediately calls the nvme command done routine with an error status, allowing the transport to send the response. However, the FC transport was aborting the command on error, so the abort freed the lldd point, but the rsp transmit path referenced it psot the free. Fix by removing the abort call on nvmet_req_init() failure. The completion response will be sent with an error status code. As the completion path will terminate the io, ensure the data_sg lists show an unused state so that teardown paths are successful. Signed-off-by: Paul Ely <Paul.Ely@broadcom.com> Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
James Smart authored
If a controller connection is attempted (say to a subsystem that does not exist), the first attempt errors out. If another connect is attempted, it crashes. Issue is the prior controller has yet execute it's final put, thus its still on lists. However, opts points on it have been cleared, thus causing the crash if they are referenced. Fix is to add the missing put after the nvme_uninit_ctrl() call on the attachment failure. Signed-off-by: Paul Ely <Paul.Ely@broadcom.com> Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
James Smart authored
Per the recommendation by Sagi on: http://lists.infradead.org/pipermail/linux-nvme/2017-April/009261.html Wait for io aborts to complete wait converted from msleep look to using a struct completion. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
James Smart authored
Current fc transport code, on io termination, is calling nvme_cleanup_cmd() followed by the transport dma unmap routine which also calls nvme_cleanup_cmd(). Which means two kfrees occur on the same address, raising havoc. This resulted in odd data errors, effectively corruption.. Fix by removing the extraneous double calls. Call now occurs only in teardown paths and as part of dma unmap routine. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Ewan D. Milne <emilne@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
NVMe 1.2.1 or later requires controllers to provide a subsystem NQN in the Identify controller data structures. Use this NQN for the subsysnqn sysfs attribute by storing it in the nvme_ctrl structure after verifying it. For older controllers we generate a "fake" NQN per non-normative text in the NVMe 1.3 spec. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
While a NVMe Namespace is somewhat similar to a SCSI Logical Unit (and not a Logical Unit Number anyway) there are subtile differences. Remove the misleading comment. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grmberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Kai-Heng Feng authored
A user reports APST is enabled, even when the NVMe is quirked or with option "default_ps_max_latency_us=0". The current logic will not set APST if the device is quirked. But the NVMe in question will enable APST automatically. Separate the logic "apst is supported" and "to enable apst", so we can use the latter one to explicitly disable APST at initialiaztion. BugLink: https://bugs.launchpad.net/bugs/1699004Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Reviewed-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Sagi Grimberg authored
No need to differentiate fabrics from pci/loop, also lower it to 32 as we don't really need 256 inflight admin commands. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Johannes Thumshirn authored
Currently we have no way to define a stable host-id but always use the one which is randomly generated when we add the host or use the default host. Provide a "hostid=%s" for user-space to pass in a persistent host-id which overrides the randomly generated one. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Keith Busch authored
The SCSI-to-NVMe translations were added to assist storage applications utilizing SG_IO transitioning to NVMe. It was always recommended, however, to use native NVMe for device management as too much is lost in translation and the maintenance burden in keeping this kludgey layer around has been neglected such that much of the translations are completely broken. This patch removes SG_IO handling from NVMe to avoid any confusion regarding maintenance support for this interface. The config option for NVMe SCSI emulation has been disabled by default since 4.5. The driver has supported native nvme user commands since the beginning, and native tooling is publicly available for use or as reference for anyone writing their own tools, so there's no excuse for hanging onto a broken crutch. Signed-off-by: Keith Busch <keith.busch@intel.com> Acked-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Guan Junxiong <guanjunxiong@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Sagi Grimberg authored
Given that the code is simple enough it seems better then passing a tag by reference for each call site, also we can now get rid of __nvme_process_cq. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Sagi Grimberg authored
Also, maintain a consumed counter to rely on for doorbell and cqe_seen update instead of directly relying on the cq head and phase. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Sagi Grimberg authored
Makes the code slightly more readable. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Sagi Grimberg authored
Nice abstraction of the actual mechanics of how to do it. Note the change that we call it after we assign nvmeq->cq_head to avoid passing it. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Some architectures (at least PPC) doesn't like get/put_user with 64-bit types on a 32-bit system. Use the variably sized copy to/from user variants instead. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Fixes: c75b1d94 ("fs: add fcntl() interface for setting/getting write life time hints") Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 27 Jun, 2017 9 commits
-
-
Julia Lawall authored
Drop static on a local variable, when the variable is initialized before any use, on every possible execution path through the function. The static has no benefit, and dropping it reduces the code size. The semantic patch that fixes this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @bad exists@ position p; identifier x; type T; @@ static T x@p; ... x = <+...x...+> @@ identifier x; expression e; type T; position p != bad.p; @@ -static T x@p; ... when != x when strict ?x = e; // </smpl> The change in code size is indicates by the following output from the size command. before: text data bss dec hex filename 67299 2291 1056 70646 113f6 drivers/block/drbd/drbd_nl.o after: text data bss dec hex filename 67283 2291 1056 70630 113e6 drivers/block/drbd/drbd_nl.o Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Paolo Valente authored
This commit fixes a bug triggered by a non-trivial sequence of events. These events are briefly described in the next two paragraphs. The impatiens, or those who are familiar with queue merging and splitting, can jump directly to the last paragraph. On each I/O-request arrival for a shared bfq_queue, i.e., for a bfq_queue that is the result of the merge of two or more bfq_queues, BFQ checks whether the shared bfq_queue has become seeky (i.e., if too many random I/O requests have arrived for the bfq_queue; if the device is non rotational, then random requests must be also small for the bfq_queue to be tagged as seeky). If the shared bfq_queue is actually detected as seeky, then a split occurs: the bfq I/O context of the process that has issued the request is redirected from the shared bfq_queue to a new non-shared bfq_queue. As a degenerate case, if the shared bfq_queue actually happens to be shared only by one process (because of previous splits), then no new bfq_queue is created: the state of the shared bfq_queue is just changed from shared to non shared. Regardless of whether a brand new non-shared bfq_queue is created, or the pre-existing shared bfq_queue is just turned into a non-shared bfq_queue, several parameters of the non-shared bfq_queue are set (restored) to the original values they had when the bfq_queue associated with the bfq I/O context of the process (that has just issued an I/O request) was merged with the shared bfq_queue. One of these parameters is the weight-raising state. If, on the split of a shared bfq_queue, 1) a pre-existing shared bfq_queue is turned into a non-shared bfq_queue; 2) the previously shared bfq_queue happens to be busy; 3) the weight-raising state of the previously shared bfq_queue happens to change; the number of weight-raised busy queues changes. The field wr_busy_queues must then be updated accordingly, but such an update was missing. This commit adds the missing update. Reported-by: Luca Miccio <lucmiccio@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
BLK_BOUNCE_ANY is the defauly now, so the call is superflous. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Now all queues allocators come without abounce limit by default, dm doesn't have to override this anymore. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Instead move it to the callers. Those that either don't use bio_data() or page_address() or are specific to architectures that do not support highmem are skipped. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
And just move it into scsi_transport_sas which needs it due to low-level drivers directly derferencing bio_data, and into blk_init_queue_node, which will need a further push into the callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
For historical reasons we default to bouncing highmem pages for all block queues. But the blk-mq drivers are easy to audit to ensure that we don't need this - scsi and mtip32xx set explicit limits and everyone else doesn't have any particular ones. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
We only call blk_queue_bounce for request-based drivers, so stop messing with it for make_request based drivers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Only used inside the bounce code, and opencoding it makes it more obvious what is going on. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-