1. 06 Jan, 2018 19 commits
  2. 05 Jan, 2018 21 commits
    • Paolo Valente's avatar
      block, bfq: remove batches of confusing ifdefs · 9b25bd03
      Paolo Valente authored
      Commit a33801e8 ("block, bfq: move debug blkio stats behind
      CONFIG_DEBUG_BLK_CGROUP") introduced two batches of confusing ifdefs:
      one reported in [1], plus a similar one in another function. This
      commit removes both batches, in the way suggested in [1].
      
      [1] https://www.spinics.net/lists/linux-block/msg20043.html
      
      Fixes: a33801e8 ("block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP")
      Reported-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Tested-by: default avatarLuca Miccio <lucmiccio@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9b25bd03
    • Paolo Valente's avatar
      block, bfq: consider also past I/O in soft real-time detection · a34b0244
      Paolo Valente authored
      BFQ privileges the I/O of soft real-time applications, such as video
      players, to guarantee to these application a high bandwidth and a low
      latency. In this respect, it is not easy to correctly detect when an
      application is soft real-time. A particularly nasty false positive is
      that of an I/O-bound application that occasionally happens to meet all
      requirements to be deemed as soft real-time. After being detected as
      soft real-time, such an application monopolizes the device. Fortunately,
      BFQ will realize soon that the application is actually not soft
      real-time and suspend every privilege. Yet, the application may happen
      again to be wrongly detected as soft real-time, and so on.
      
      As highlighted by our tests, this problem causes BFQ to occasionally
      fail to guarantee a high responsiveness, in the presence of heavy
      background I/O workloads. The reason is that the background workload
      happens to be detected as soft real-time, more or less frequently,
      during the execution of the interactive task under test. To give an
      idea, because of this problem, Libreoffice Writer occasionally takes 8
      seconds, instead of 3, to start up, if there are sequential reads and
      writes in the background, on a Kingston SSDNow V300.
      
      This commit addresses this issue by leveraging the following facts.
      
      The reason why some applications are detected as soft real-time despite
      all BFQ checks to avoid false positives, is simply that, during high
      CPU or storage-device load, I/O-bound applications may happen to do
      I/O slowly enough to meet all soft real-time requirements, and pass
      all BFQ extra checks. Yet, this happens only for limited time periods:
      slow-speed time intervals are usually interspersed between other time
      intervals during which these applications do I/O at a very high speed.
      To exploit these facts, this commit introduces a little change, in the
      detection of soft real-time behavior, to systematically consider also
      the recent past: the higher the speed was in the recent past, the
      later next I/O should arrive for the application to be considered as
      soft real-time. At the beginning of a slow-speed interval, the minimum
      arrival time allowed for the next I/O usually happens to still be so
      high, to fall *after* the end of the slow-speed period itself. As a
      consequence, the application does not risk to be deemed as soft
      real-time during the slow-speed interval. Then, during the next
      high-speed interval, the application cannot, evidently, be deemed as
      soft real-time (exactly because of its speed), and so on.
      
      This extra filtering proved to be rather effective: in the above test,
      the frequency of false positives became so low that the start-up time
      was 3 seconds in all iterations (apart from occasional outliers,
      caused by page-cache-management issues, which are out of the scope of
      this commit, and cannot be solved by an I/O scheduler).
      Tested-by: default avatarLee Tibbert <lee.tibbert@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarAngelo Ruocco <angeloruocco90@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a34b0244
    • Angelo Ruocco's avatar
      block, bfq: remove superfluous check in queue-merging setup · 4403e4e4
      Angelo Ruocco authored
      When two or more processes do I/O in a way that the their requests are
      sequential in respect to one another, BFQ merges the bfq_queues associated
      with the processes. This way the overall I/O pattern becomes sequential,
      and thus there is a boost in througput.
      These cooperating processes usually start or restart to do I/O shortly
      after each other. So, in order to avoid merging non-cooperating processes,
      BFQ ensures that none of these queues has been in weight raising for too
      long.
      
      In this respect, from commit "block, bfq-sq, bfq-mq: let a queue be merged
      only shortly after being created", BFQ checks whether any queue (and not
      only weight-raised ones) is doing I/O continuously from too long to be
      merged.
      
      This new additional check makes the first one useless: a queue doing
      I/O from long enough, if being weight-raised, is also a queue in
      weight raising for too long to be merged. Accordingly, this commit
      removes the first check.
      Signed-off-by: default avatarAngelo Ruocco <angeloruocco90@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4403e4e4
    • Paolo Valente's avatar
      block, bfq: let a queue be merged only shortly after starting I/O · 7b8fa3b9
      Paolo Valente authored
      In BFQ and CFQ, two processes are said to be cooperating if they do
      I/O in such a way that the union of their I/O requests yields a
      sequential I/O pattern. To get such a sequential I/O pattern out of
      the non-sequential pattern of each cooperating process, BFQ and CFQ
      merge the queues associated with these processes. In more detail,
      cooperating processes, and thus their associated queues, usually
      start, or restart, to do I/O shortly after each other. This is the
      case, e.g., for the I/O threads of KVM/QEMU and of the dump
      utility. Basing on this assumption, this commit allows a bfq_queue to
      be merged only during a short time interval (100ms) after it starts,
      or re-starts, to do I/O.  This filtering provides two important
      benefits.
      
      First, it greatly reduces the probability that two non-cooperating
      processes have their queues merged by mistake, if they just happen to
      do I/O close to each other for a short time interval. These spurious
      merges cause loss of service guarantees. A low-weight bfq_queue may
      unjustly get more than its expected share of the throughput: if such a
      low-weight queue is merged with a high-weight queue, then the I/O for
      the low-weight queue is served as if the queue had a high weight. This
      may damage other high-weight queues unexpectedly.  For instance,
      because of this issue, lxterminal occasionally took 7.5 seconds to
      start, instead of 6.5 seconds, when some sequential readers and
      writers did I/O in the background on a FUJITSU MHX2300BT HDD.  The
      reason is that the bfq_queues associated with some of the readers or
      the writers were merged with the high-weight queues of some processes
      that had to do some urgent but little I/O. The readers then exploited
      the inherited high weight for all or most of their I/O, during the
      start-up of terminal. The filtering introduced by this commit
      eliminated any outlier caused by spurious queue merges in our start-up
      time tests.
      
      This filtering also provides a little boost of the throughput
      sustainable by BFQ: 3-4%, depending on the CPU. The reason is that,
      once a bfq_queue cannot be merged any longer, this commit makes BFQ
      stop updating the data needed to handle merging for the queue.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarAngelo Ruocco <angeloruocco90@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7b8fa3b9
    • Angelo Ruocco's avatar
      block, bfq: check low_latency flag in bfq_bfqq_save_state() · 1be6e8a9
      Angelo Ruocco authored
      A just-created bfq_queue will certainly be deemed as interactive on
      the arrival of its first I/O request, if the low_latency flag is
      set. Yet, if the queue is merged with another queue on the arrival of
      its first I/O request, it will not have the chance to be flagged as
      interactive. Nevertheless, if the queue is then split soon enough, it
      has to be flagged as interactive after the split.
      
      To handle this early-merge scenario correctly, BFQ saves the state of
      the queue, on the merge, as if the latter had already been deemed
      interactive. So, if the queue is split soon, it will get
      weight-raised, because the previous state of the queue is resumed on
      the split.
      
      Unfortunately, in the act of saving the state of the newly-created
      queue, BFQ doesn't check whether the low_latency flag is set, and this
      causes early-merged queues to be then weight-raised, on queue splits,
      even if low_latency is off. This commit addresses this problem by
      adding the missing check.
      Signed-off-by: default avatarAngelo Ruocco <angeloruocco90@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1be6e8a9
    • Paolo Valente's avatar
      block, bfq: add missing rq_pos_tree update on rq removal · 05e90283
      Paolo Valente authored
      If two processes do I/O close to each other, then BFQ merges the
      bfq_queues associated with these processes, to get a more sequential
      I/O, and thus a higher throughput.  In this respect, to detect whether
      two processes are doing I/O close to each other, BFQ keeps a list of
      the head-of-line I/O requests of all active bfq_queues.  The list is
      ordered by initial sectors, and implemented through a red-black tree
      (rq_pos_tree).
      
      Unfortunately, the update of the rq_pos_tree was incomplete, because
      the tree was not updated on the removal of the head-of-line I/O
      request of a bfq_queue, in case the queue did not remain empty. This
      commit adds the missing update.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarAngelo Ruocco <angeloruocco90@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      05e90283
    • Paolo Valente's avatar
      block, bfq: increase threshold to deem I/O as random · f0ba5ea2
      Paolo Valente authored
      If two processes do I/O close to each other, i.e., are cooperating
      processes in BFQ (and CFQ'S) nomenclature, then BFQ merges their
      associated bfq_queues, so as to get sequential I/O from the union of
      the I/O requests of the processes, and thus reach a higher
      throughput. A merged queue is then split if its I/O stops being
      sequential. In this respect, BFQ deems the I/O of a bfq_queue as
      (mostly) sequential only if less than 4 I/O requests are random, out
      of the last 32 requests inserted into the queue.
      
      Unfortunately, extensive testing (with the interleaved_io benchmark of
      the S suite [1], and with real applications spawning cooperating
      processes) has clearly shown that, with such a low threshold, only a
      rather low I/O throughput may be reached when several cooperating
      processes do I/O. In particular, the outcome of each test run was
      bimodal: if queue merging occurred and was stable during the test,
      then the throughput was close to the peak rate of the storage device,
      otherwise the throughput was arbitrarily low (usually around 1/10 of
      the peak rate with a rotational device). The probability to get the
      unlucky outcomes grew with the number of cooperating processes: it was
      already significant with 5 processes, and close to one with 7 or more
      processes.
      
      The cause of the low throughput in the unlucky runs was that the
      merged queues containing the I/O of these cooperating processes were
      soon split, because they contained more random I/O requests than those
      tolerated by the 4/32 threshold, but
      - that I/O would have however allowed the storage device to reach
        peak throughput or almost peak throughput;
      - in contrast, the I/O of these processes, if served individually
        (from separate queues) yielded a rather low throughput.
      
      So we repeated our tests with increasing values of the threshold,
      until we found the minimum value (19) for which we obtained maximum
      throughput, reliably, with at least up to 9 cooperating
      processes. Then we checked that the use of that higher threshold value
      did not cause any regression for any other benchmark in the suite [1].
      This commit raises the threshold to such a higher value.
      
      [1] https://github.com/Algodev-github/SSigned-off-by: default avatarAngelo Ruocco <angeloruocco90@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f0ba5ea2
    • Damien Le Moal's avatar
      deadline-iosched: Introduce zone locking support · 8dc8146f
      Damien Le Moal authored
      Introduce zone write locking to avoid write request reordering with
      zoned block devices. This is achieved using a finer selection of the
      next request to dispatch:
      1) Any non-write request is always allowed to proceed.
      2) Any write to a conventional zone is always allowed to proceed.
      3) For a write to a sequential zone, the zone lock is first checked.
         a) If the zone is not locked, the write is allowed to proceed after
            its target zone is locked.
         b) If the zone is locked, the write request is skipped and the next
            request in the dispatch queue tested (back to step 1).
      
      For a write request that has locked its target zone, the zone is
      unlocked either when the request completes and the method
      deadline_request_completed() is called, or when the request is requeued
      using the method deadline_add_request().
      
      Requests targeting a locked zone are always left in the scheduler queue
      to preserve the initial write order. If no write request can be
      dispatched, allow reads to be dispatched even if the write batch is not
      done.
      
      If the device used is not a zoned block device, or if zoned block device
      support is disabled, this patch does not modify deadline behavior.
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8dc8146f
    • Damien Le Moal's avatar
      deadline-iosched: Introduce dispatch helpers · c117bac7
      Damien Le Moal authored
      Avoid directly referencing the next_rq and fifo_list arrays using the
      helper functions deadline_next_request() and deadline_fifo_request() to
      facilitate changes in the dispatch request selection in
      deadline_dispatch_requests() for zoned block devices.
      
      While at it, also remove the unnecessary forward declaration of the
      function deadline_move_request().
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c117bac7
    • Damien Le Moal's avatar
      mq-deadline: Introduce zone locking support · 5700f691
      Damien Le Moal authored
      Introduce zone write locking to avoid write request reordering with
      zoned block devices. This is achieved using a finer selection of the
      next request to dispatch:
      1) Any non-write request is always allowed to proceed.
      2) Any write to a conventional zone is always allowed to proceed.
      3) For a write to a sequential zone, the zone lock is first checked.
         a) If the zone is not locked, the write is allowed to proceed after
            its target zone is locked.
         b) If the zone is locked, the write request is skipped and the next
            request in the dispatch queue tested (back to step 1).
      
      For a write request that has locked its target zone, the zone is
      unlocked either when the request completes with a call to the method
      deadline_request_completed() or when the request is requeued using
      dd_insert_request().
      
      Requests targeting a locked zone are always left in the scheduler queue
      to preserve the lba ordering for write requests. If no write request
      can be dispatched, allow reads to be dispatched even if the write batch
      is not done.
      
      If the device used is not a zoned block device, or if zoned block device
      support is disabled, this patch does not modify mq-deadline behavior.
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5700f691
    • Damien Le Moal's avatar
      mq-deadline: Introduce dispatch helpers · bf09ce56
      Damien Le Moal authored
      Avoid directly referencing the next_rq and fifo_list arrays using the
      helper functions deadline_next_request() and deadline_fifo_request() to
      facilitate changes in the dispatch request selection in
      __dd_dispatch_request() for zoned block devices.
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: default avatarBart Van Assche <Bart.VanAssche@wdc.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bf09ce56
    • Christoph Hellwig's avatar
      block: introduce zoned block devices zone write locking · 6cc77e9c
      Christoph Hellwig authored
      Components relying only on the request_queue structure for accessing
      block devices (e.g. I/O schedulers) have a limited knowledged of the
      device characteristics. In particular, the device capacity cannot be
      easily discovered, which for a zoned block device also result in the
      inability to easily know the number of zones of the device (the zone
      size is indicated by the chunk_sectors field of the queue limits).
      
      Introduce the nr_zones field to the request_queue structure to simplify
      access to this information. Also, add the bitmap seq_zone_bitmap which
      indicates which zones of the device are sequential zones (write
      preferred or write required) and the bitmap seq_zones_wlock which
      indicates if a zone is write locked, that is, if a write request
      targeting a zone was dispatched to the device. These fields are
      initialized by the low level block device driver (sd.c for ZBC/ZAC
      disks). They are not initialized by stacking drivers (device mappers)
      handling zoned block devices (e.g. dm-linear).
      
      Using this, I/O schedulers can introduce zone write locking to control
      request dispatching to a zoned block device and avoid write request
      reordering by limiting to at most a single write request per zone
      outside of the scheduler at any time.
      
      Based on previous patches from Damien Le Moal.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      [Damien]
      * Fixed comments and identation in blkdev.h
      * Changed helper functions
      * Fixed this commit message
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6cc77e9c
    • Bart Van Assche's avatar
      pktcdvd: Fix a recently introduced NULL pointer dereference · 882d4171
      Bart Van Assche authored
      Call bdev_get_queue(bdev) after bdev->bd_disk has been initialized
      instead of just before that pointer has been initialized. This patch
      avoids that the following command
      
      pktsetup 1 /dev/sr0
      
      triggers the following kernel crash:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000548
      IP: pkt_setup_dev+0x2db/0x670 [pktcdvd]
      CPU: 2 PID: 724 Comm: pktsetup Not tainted 4.15.0-rc4-dbg+ #1
      Call Trace:
       pkt_ctl_ioctl+0xce/0x1c0 [pktcdvd]
       do_vfs_ioctl+0x8e/0x670
       SyS_ioctl+0x3c/0x70
       entry_SYSCALL_64_fastpath+0x23/0x9a
      Reported-by: default avatarMaciej S. Szmigiero <mail@maciej.szmigiero.name>
      Fixes: commit ca18d6f7 ("block: Make most scsi_req_init() calls implicit")
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Tested-by: default avatarMaciej S. Szmigiero <mail@maciej.szmigiero.name>
      Cc: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
      Cc: <stable@vger.kernel.org> # v4.13
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      882d4171
    • Bart Van Assche's avatar
      pktcdvd: Fix pkt_setup_dev() error path · 5a0ec388
      Bart Van Assche authored
      Commit 523e1d39 ("block: make gendisk hold a reference to its queue")
      modified add_disk() and disk_release() but did not update any of the
      error paths that trigger a put_disk() call after disk->queue has been
      assigned. That introduced the following behavior in the pktcdvd driver
      if pkt_new_dev() fails:
      
      Kernel BUG at 00000000e98fd882 [verbose debug info unavailable]
      
      Since disk_release() calls blk_put_queue() anyway if disk->queue != NULL,
      fix this by removing the blk_cleanup_queue() call from the pkt_setup_dev()
      error path.
      
      Fixes: commit 523e1d39 ("block: make gendisk hold a reference to its queue")
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
      Cc: <stable@vger.kernel.org> # v3.2
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5a0ec388
    • Matias Bjørling's avatar
      lightnvm: pblk: refactor pblk_ppa_comp function · 8b7bc849
      Matias Bjørling authored
      Shorten function to simply return the value of the if statement.
      Signed-off-by: default avatarMatias Bjørling <m@bjorling.me>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8b7bc849
    • Javier González's avatar
      lightnvm: pblk: add iostat support · 998ba629
      Javier González authored
      Since pblk registers its own block device, the iostat accounting is
      not automatically done for us. Therefore, add the necessary
      accounting logic to satisfy the iostat interface.
      Signed-off-by: default avatarJavier González <javier@cnexlabs.com>
      Signed-off-by: default avatarMatias Bjørling <m@bjorling.me>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      998ba629
    • Javier González's avatar
      lightnvm: pblk: print instance name on instance info · 30d82a86
      Javier González authored
      Add the instance name to the information printed out on target creation.
      Signed-off-by: default avatarJavier González <javier@cnexlabs.com>
      Signed-off-by: default avatarMatias Bjørling <m@bjorling.me>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      30d82a86
    • Javier González's avatar
      lightnvm: pblk: free write buffer on init failure · c6847e4e
      Javier González authored
      Refactor the way we free the write buffer to ensure that all entries get
      freed in case of an error on the init sequence.
      Signed-off-by: default avatarJavier González <javier@cnexlabs.com>
      Signed-off-by: default avatarMatias Bjørling <m@bjorling.me>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c6847e4e
    • Javier González's avatar
      lightnvm: pblk: ensure kthread alloc. before kicking it · cc4f5ba1
      Javier González authored
      When creating the write thread, ensure that the kthread has been created
      before initializing the timer responsible from kicking it. Otherwise, if
      the kthread creation fails or gets killed from used space, we risk
      kicking an empty thread structure.
      
      Also, since the kthread creation can be interrupted form user space,
      adapt the error path to not report an error when this happens, since it
      is intentional that the instance creation is aborted.
      Signed-off-by: default avatarJavier González <javier@cnexlabs.com>
      Updated source to reflect the new timer_setup API.
      Signed-off-by: default avatarMatias Bjørling <m@bjorling.me>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      cc4f5ba1
    • Javier González's avatar
      lightnvm: pblk: do not log recovery read errors · 8f554597
      Javier González authored
      On scan recovery, reads can fail. This happens because the first page
      for each line is read in order to determined if the line has been used
      (and thus needs to be recovered), or not. This can lead to "empty page"
      read errors.
      
      Since these errors are normal, do not log them, as they are confusing
      when reviewing the logs.
      Signed-off-by: default avatarJavier González <javier@cnexlabs.com>
      Signed-off-by: default avatarMatias Bjørling <m@bjorling.me>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8f554597
    • Javier González's avatar
      lightnvm: pblk: ignore high ecc errors on recovery · 5d201f07
      Javier González authored
      On recovery, do not stop L2P recovery if reads report high ECC error
      as the data is still available.
      Signed-off-by: default avatarJavier González <javier@cnexlabs.com>
      Signed-off-by: default avatarMatias Bjørling <m@bjorling.me>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5d201f07