- 05 Jan, 2018 33 commits
-
-
Damien Le Moal authored
Introduce zone write locking to avoid write request reordering with zoned block devices. This is achieved using a finer selection of the next request to dispatch: 1) Any non-write request is always allowed to proceed. 2) Any write to a conventional zone is always allowed to proceed. 3) For a write to a sequential zone, the zone lock is first checked. a) If the zone is not locked, the write is allowed to proceed after its target zone is locked. b) If the zone is locked, the write request is skipped and the next request in the dispatch queue tested (back to step 1). For a write request that has locked its target zone, the zone is unlocked either when the request completes and the method deadline_request_completed() is called, or when the request is requeued using the method deadline_add_request(). Requests targeting a locked zone are always left in the scheduler queue to preserve the initial write order. If no write request can be dispatched, allow reads to be dispatched even if the write batch is not done. If the device used is not a zoned block device, or if zoned block device support is disabled, this patch does not modify deadline behavior. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Damien Le Moal authored
Avoid directly referencing the next_rq and fifo_list arrays using the helper functions deadline_next_request() and deadline_fifo_request() to facilitate changes in the dispatch request selection in deadline_dispatch_requests() for zoned block devices. While at it, also remove the unnecessary forward declaration of the function deadline_move_request(). Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Damien Le Moal authored
Introduce zone write locking to avoid write request reordering with zoned block devices. This is achieved using a finer selection of the next request to dispatch: 1) Any non-write request is always allowed to proceed. 2) Any write to a conventional zone is always allowed to proceed. 3) For a write to a sequential zone, the zone lock is first checked. a) If the zone is not locked, the write is allowed to proceed after its target zone is locked. b) If the zone is locked, the write request is skipped and the next request in the dispatch queue tested (back to step 1). For a write request that has locked its target zone, the zone is unlocked either when the request completes with a call to the method deadline_request_completed() or when the request is requeued using dd_insert_request(). Requests targeting a locked zone are always left in the scheduler queue to preserve the lba ordering for write requests. If no write request can be dispatched, allow reads to be dispatched even if the write batch is not done. If the device used is not a zoned block device, or if zoned block device support is disabled, this patch does not modify mq-deadline behavior. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Damien Le Moal authored
Avoid directly referencing the next_rq and fifo_list arrays using the helper functions deadline_next_request() and deadline_fifo_request() to facilitate changes in the dispatch request selection in __dd_dispatch_request() for zoned block devices. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Bart Van Assche <Bart.VanAssche@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Components relying only on the request_queue structure for accessing block devices (e.g. I/O schedulers) have a limited knowledged of the device characteristics. In particular, the device capacity cannot be easily discovered, which for a zoned block device also result in the inability to easily know the number of zones of the device (the zone size is indicated by the chunk_sectors field of the queue limits). Introduce the nr_zones field to the request_queue structure to simplify access to this information. Also, add the bitmap seq_zone_bitmap which indicates which zones of the device are sequential zones (write preferred or write required) and the bitmap seq_zones_wlock which indicates if a zone is write locked, that is, if a write request targeting a zone was dispatched to the device. These fields are initialized by the low level block device driver (sd.c for ZBC/ZAC disks). They are not initialized by stacking drivers (device mappers) handling zoned block devices (e.g. dm-linear). Using this, I/O schedulers can introduce zone write locking to control request dispatching to a zoned block device and avoid write request reordering by limiting to at most a single write request per zone outside of the scheduler at any time. Based on previous patches from Damien Le Moal. Signed-off-by: Christoph Hellwig <hch@lst.de> [Damien] * Fixed comments and identation in blkdev.h * Changed helper functions * Fixed this commit message Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
Call bdev_get_queue(bdev) after bdev->bd_disk has been initialized instead of just before that pointer has been initialized. This patch avoids that the following command pktsetup 1 /dev/sr0 triggers the following kernel crash: BUG: unable to handle kernel NULL pointer dereference at 0000000000000548 IP: pkt_setup_dev+0x2db/0x670 [pktcdvd] CPU: 2 PID: 724 Comm: pktsetup Not tainted 4.15.0-rc4-dbg+ #1 Call Trace: pkt_ctl_ioctl+0xce/0x1c0 [pktcdvd] do_vfs_ioctl+0x8e/0x670 SyS_ioctl+0x3c/0x70 entry_SYSCALL_64_fastpath+0x23/0x9a Reported-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name> Fixes: commit ca18d6f7 ("block: Make most scsi_req_init() calls implicit") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Tested-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name> Cc: Maciej S. Szmigiero <mail@maciej.szmigiero.name> Cc: <stable@vger.kernel.org> # v4.13 Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
Commit 523e1d39 ("block: make gendisk hold a reference to its queue") modified add_disk() and disk_release() but did not update any of the error paths that trigger a put_disk() call after disk->queue has been assigned. That introduced the following behavior in the pktcdvd driver if pkt_new_dev() fails: Kernel BUG at 00000000e98fd882 [verbose debug info unavailable] Since disk_release() calls blk_put_queue() anyway if disk->queue != NULL, fix this by removing the blk_cleanup_queue() call from the pkt_setup_dev() error path. Fixes: commit 523e1d39 ("block: make gendisk hold a reference to its queue") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Tejun Heo <tj@kernel.org> Cc: Maciej S. Szmigiero <mail@maciej.szmigiero.name> Cc: <stable@vger.kernel.org> # v3.2 Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Matias Bjørling authored
Shorten function to simply return the value of the if statement. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Javier González authored
Since pblk registers its own block device, the iostat accounting is not automatically done for us. Therefore, add the necessary accounting logic to satisfy the iostat interface. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Javier González authored
Add the instance name to the information printed out on target creation. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Javier González authored
Refactor the way we free the write buffer to ensure that all entries get freed in case of an error on the init sequence. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Javier González authored
When creating the write thread, ensure that the kthread has been created before initializing the timer responsible from kicking it. Otherwise, if the kthread creation fails or gets killed from used space, we risk kicking an empty thread structure. Also, since the kthread creation can be interrupted form user space, adapt the error path to not report an error when this happens, since it is intentional that the instance creation is aborted. Signed-off-by: Javier González <javier@cnexlabs.com> Updated source to reflect the new timer_setup API. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Javier González authored
On scan recovery, reads can fail. This happens because the first page for each line is read in order to determined if the line has been used (and thus needs to be recovered), or not. This can lead to "empty page" read errors. Since these errors are normal, do not log them, as they are confusing when reviewing the logs. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Javier González authored
On recovery, do not stop L2P recovery if reads report high ECC error as the data is still available. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Javier González authored
Allow to set the over-provision percentage on target creation. In case that the value is not provided, fall back to the default value set by the target. In pblk, set the default OP to 11% of the total size of the device Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Javier González authored
Until now, pblk's rate-limiter has used a heuristic to reserve space for GC I/O given that the over-provision area was fixed. In preparation for allowing to define the over-provision area on target creation, define a dedicated free_block counter in the rate-limiter to track the number of blocks being used for user data. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Hans Holmberg authored
pblk_gc_stop just sets pblk->gc->gc_active to zero, ignoring the flush parameter. This is plain confusing, so remove the function and set the gc active flag at the call points instead. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Hans Holmberg authored
Unless we protect flush pointer updates with a lock, we risk resetting new flush points before we've synced all sectors up to that point. This patch protects new flush points with the same spin lock that is being held when advancing the sync pointer and resetting completed flush points. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Hans Holmberg authored
Move completion of syncs and clearing of flush points to the write completion path - this ensures that the data has been comitted to the media before completing bios containing syncs. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Hans Holmberg authored
Sync point is a really confusing name for keeping track of the last entry that needs to be flushed so change the name to to flush_point instead. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Hans Holmberg authored
Currently pblk_recov_get_lba list does two separate things: it checks the consistency of the emeta and extracts the lba list. This patch separates the consistency check to make the code easier to read and to prepare for version checks of the line emeta persistent data format version. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Javier González authored
Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Javier González authored
Through time, we have generated some redundant helper functions. Refactor them to eliminate redundant and unnecessary code. Also, reorder them to improve readability Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Javier González authored
Until now, target unique naming is only guaranteed per device. This is ok from a lightnvm perspective, but not from a sysfs one, since groups will collide regardless of the underlying device. Check that names are unique across all lightnvm-capable devices. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Javier González authored
Refactor target type lookup to use/not use locks explicitly instead of using a hidden parameter to make the function locking. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Matias Bjørling authored
Prepare for the 2.0 revision by adapting the geometry structures to coexist with the 1.2 revision. Signed-off-by: Matias Bjørling <m@bjorling.me> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Matias Bjørling authored
The lower page table is unused. All page tables reported by 1.2 devices are all reporting a sequential 1:1 page mapping. This is also not used going forward with the 2.0 revision. Signed-off-by: Matias Bjørling <m@bjorling.me> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Javier González authored
Remove the wait filed in nvm_rq. It is not used anymore, as targets rely on the functionality provided by the LightNVM subsystem when sending sync I/O. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Matias Bjørling authored
Now that rrpc have been removed. Also remove the hybrid 1.2 support from the core. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Matias Bjørling authored
Now that rrpc has been removed, the only users of the ppa helpers is pblk. However, pblk already defines similar functions. Switch pblk to use the internal ones, and remove the generic ppa helpers. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Matias Bjørling authored
The hybrid mode for 1.2 revision was deprecated, and have no users. Remove to make it easier to move to the 2.0 revision. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Matias Bjørling authored
With rrpc to be removed, the null_blk lightnvm support is no longer functional. Remove the lightnvm implementation and maybe add it to another module in the future if someone takes on the challenge. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Liu Bo authored
Commit de148297 ("blk-mq: introduce .get_budget and .put_budget in blk_mq_ops") changes the function to return bool type, and then commit 1f460b63 ("blk-mq: don't restart queue when .get_budget returns BLK_STS_RESOURCE") changes it back to void, but the comment remains. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 22 Dec, 2017 1 commit
-
-
Jens Axboe authored
Even with a number of waitqueues, we can get into a situation where we are heavily contended on the waitqueue lock. I got a report on spc1 where we're spending seconds doing this. Arguably the use case is nasty, I reproduce it with one device and 1000 threads banging on the device. But that doesn't mean we shouldn't be handling it better. What ends up happening is that a thread will fail to get a tag, add itself to the waitqueue, and subsequently get woken up when a tag is freed - only to find itself going back to sleep on the waitqueue. Instead of waking all threads, use an exclusive wait and wake up our sbitmap batch count instead. This seems to work well for me (massive improvement for this use case), and it survives basic testing. But I haven't fully verified it yet. An additional improvement is running the queue and checking for a new tag BEFORE needing to add ourselves to the waitqueue. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 18 Dec, 2017 1 commit
-
-
Linus Torvalds authored
-
- 17 Dec, 2017 5 commits
-
-
Kees Cook authored
This reverts commit 04e35f44. SELinux runs with secureexec for all non-"noatsecure" domain transitions, which means lots of processes end up hitting the stack hard-limit change that was introduced in order to fix a race with prlimit(). That race fix will need to be redesigned. Reported-by: Laura Abbott <labbott@redhat.com> Reported-by: Tomáš Trnka <trnka@scm.com> Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull Page Table Isolation (PTI) v4.14 backporting base tree from Ingo Molnar: "This tree contains the v4.14 PTI backport preparatory tree, which consists of four merges of upstream trees and 7 cherry-picked commits, which the upcoming PTI work depends on" NOTE! The resulting tree is exactly the same as the original base tree (ie the diff between this commit and its immediate first parent is empty). The only reason for this merge is literally to have a common point for the actual PTI changes so that the commits can be shared in both the 4.15 and 4.14 trees. * 'WIP.x86-pti.base-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm/kasan: Don't use vmemmap_populate() to initialize shadow locking/barriers: Convert users of lockless_dereference() to READ_ONCE() locking/barriers: Add implicit smp_read_barrier_depends() to READ_ONCE() bpf: fix build issues on um due to mising bpf_perf_event.h perf/x86: Enable free running PEBS for REGS_USER/INTR x86: Make X86_BUG_FXSAVE_LEAK detectable in CPUID on AMD x86/cpufeature: Add User-Mode Instruction Prevention definitions
-
Linus Torvalds authored
Merge branch 'WIP.x86-pti.base.prep-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull Page Table Isolation (PTI) preparatory tree from Ingo Molnar: "This does a rename to free up linux/pti.h to be used by the upcoming page table isolation feature" * 'WIP.x86-pti.base.prep-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: drivers/misc/intel/pti: Rename the header file to free up the namespace
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull timer fix from Thomas Gleixner: "A single bugfix which prevents arbitrary sigev_notify values in posix-timers" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: posix-timer: Properly check sigevent->sigev_notify
-
git://git.infradead.org/users/vkoul/slave-dmaLinus Torvalds authored
Pull dmaengine fixes from Vinod Koul: "This time consisting of fixes in a bunch of drivers and the dmatest module: - Fix for disable clk on error path in fsl-edma driver - Disable clk fail fix in jz4740 driver - Fix long pending bug in dmatest driver for dangling pointer - Fix potential NULL pointer dereference in at_hdmac driver - Error handling path in ioat driver" * tag 'dmaengine-fix-4.15-rc4' of git://git.infradead.org/users/vkoul/slave-dma: dmaengine: fsl-edma: disable clks on all error paths dmaengine: jz4740: disable/unprepare clk if probe fails dmaengine: dmatest: move callback wait queue to thread context dmaengine: at_hdmac: fix potential NULL pointer dereference in atc_prep_dma_interleaved dmaengine: ioat: Fix error handling path
-