1. 20 Apr, 2017 19 commits
  2. 19 Apr, 2017 21 commits
    • Bart Van Assche's avatar
      block: Optimize ioprio_best() · 9a87182c
      Bart Van Assche authored
      Since ioprio_best() translates IOPRIO_CLASS_NONE into IOPRIO_CLASS_BE
      and since lower numerical priority values represent a higher priority
      a simple numerical comparison is sufficient.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: default avatarAdam Manzanares <adam.manzanares@wdc.com>
      Tested-by: default avatarAdam Manzanares <adam.manzanares@wdc.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Matias Bjørling <m@bjorling.me>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      9a87182c
    • Bart Van Assche's avatar
      block: Inline blk_rq_set_prio() · 0be0dee6
      Bart Van Assche authored
      Since only a single caller remains, inline blk_rq_set_prio(). Initialize
      req->ioprio even if no I/O priority has been set in the bio nor in the
      I/O context.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: default avatarAdam Manzanares <adam.manzanares@wdc.com>
      Tested-by: default avatarAdam Manzanares <adam.manzanares@wdc.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Matias Bjørling <m@bjorling.me>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      0be0dee6
    • Bart Van Assche's avatar
      lightnvm: Use blk_init_request_from_bio() instead of open-coding it · 9460e280
      Bart Van Assche authored
      This patch changes the behavior of the lightnvm driver as follows:
      * REQ_FAILFAST_MASK is set for read-ahead requests.
      * If no I/O priority has been set in the bio, the I/O priority is
        copied from the I/O context.
      * The rq_disk member is initialized if bio->bi_bdev != NULL.
      * The bio sector offset is copied into req->__sector instead of
        retaining the value -1 set by blk_mq_alloc_request().
      * req->errors is initialized to zero.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Matias Bjørling <m@bjorling.me>
      Cc: Adam Manzanares <adam.manzanares@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      9460e280
    • Bart Van Assche's avatar
      null_blk: Use blk_init_request_from_bio() instead of open-coding it · 2644a3cc
      Bart Van Assche authored
      This patch changes the behavior of the null_blk driver for the
      LightNVM mode as follows:
      * REQ_FAILFAST_MASK is set for read-ahead requests.
      * If no I/O priority has been set in the bio, the I/O priority is
        copied from the I/O context.
      * The rq_disk member is initialized if bio->bi_bdev != NULL.
      * req->errors is initialized to zero.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Matias Bjørling <m@bjorling.me>
      Cc: Adam Manzanares <adam.manzanares@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      2644a3cc
    • Bart Van Assche's avatar
      block: Export blk_init_request_from_bio() · da8d7f07
      Bart Van Assche authored
      Export this function such that it becomes available to block
      drivers.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Matias Bjørling <m@bjorling.me>
      Cc: Adam Manzanares <adam.manzanares@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      da8d7f07
    • Arnd Bergmann's avatar
      lightnvm: assume 64-bit lba numbers · ef697902
      Arnd Bergmann authored
      The driver uses both u64 and sector_t to refer to offsets, and assigns between the
      two. This causes one harmless warning when sector_t is 32-bit:
      
      drivers/lightnvm/pblk-rb.c: In function 'pblk_rb_write_entry_gc':
      include/linux/lightnvm.h:215:20: error: large integer implicitly truncated to unsigned type [-Werror=overflow]
      drivers/lightnvm/pblk-rb.c:324:22: note: in expansion of macro 'ADDR_EMPTY'
      
      As the driver is already doing this inconsistently, changing the type
      won't make it worse and is an easy way to avoid the warning.
      
      Fixes: a4bd217b ("lightnvm: physical block device (pblk) target")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      ef697902
    • Christoph Hellwig's avatar
      block: make __blk_end_bidi_request private · d0fac025
      Christoph Hellwig authored
      blk_insert_flush should be using __blk_end_request to start with.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      d0fac025
    • Christoph Hellwig's avatar
      block: remove blk_end_request_cur · fa1a15c0
      Christoph Hellwig authored
      This function is not used anywhere in the kernel.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      fa1a15c0
    • Christoph Hellwig's avatar
      block: remove blk_end_request_err and __blk_end_request_err · 314fe91b
      Christoph Hellwig authored
      Both functions are entirely unused.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      314fe91b
    • Christoph Hellwig's avatar
      block: remove the osdblk driver · 10081552
      Christoph Hellwig authored
      This was just a proof of concept user for the SCSI OSD library, and
      never had any real users.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarBoaz Harrosh <ooo@electrozaur.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      10081552
    • Jan Kara's avatar
      block: Make writeback throttling defaults consistent for SQ devices · 8330cdb0
      Jan Kara authored
      When CFQ is used as an elevator, it disables writeback throttling
      because they don't play well together. Later when a different elevator
      is chosen for the device, writeback throttling doesn't get enabled
      again as it should. Make sure CFQ enables writeback throttling (if it
      should be enabled by default) when we switch from it to another IO
      scheduler.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      8330cdb0
    • Paolo Valente's avatar
      block, bfq: split bfq-iosched.c into multiple source files · ea25da48
      Paolo Valente authored
      The BFQ I/O scheduler features an optimal fair-queuing
      (proportional-share) scheduling algorithm, enriched with several
      mechanisms to boost throughput and reduce latency for interactive and
      real-time applications. This makes BFQ a large and complex piece of
      code. This commit addresses this issue by splitting BFQ into three
      main, independent components, and by moving each component into a
      separate source file:
      1. Main algorithm: handles the interaction with the kernel, and
      decides which requests to dispatch; it uses the following two further
      components to achieve its goals.
      2. Scheduling engine (Hierarchical B-WF2Q+ scheduling algorithm):
      computes the schedule, using weights and budgets provided by the above
      component.
      3. cgroups support: handles group operations (creation, destruction,
      move, ...).
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      ea25da48
    • Paolo Valente's avatar
      block, bfq: remove all get and put of I/O contexts · 6fa3e8d3
      Paolo Valente authored
      When a bfq queue is set in service and when it is merged, a reference
      to the I/O context associated with the queue is taken. This reference
      is then released when the queue is deselected from service or
      split. More precisely, the release of the reference is postponed to
      when the scheduler lock is released, to avoid nesting between the
      scheduler and the I/O-context lock. In fact, such nesting would lead
      to deadlocks, because of other code paths that take the same locks in
      the opposite order. This postponing of I/O-context releases does
      complicate code.
      
      This commit addresses these issue by modifying involved operations in
      such a way to not need to get the above I/O-context references any
      more. Then it also removes any get and release of these references.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      6fa3e8d3
    • Arianna Avanzini's avatar
      block, bfq: handle bursts of queue activations · e1b2324d
      Arianna Avanzini authored
      Many popular I/O-intensive services or applications spawn or
      reactivate many parallel threads/processes during short time
      intervals. Examples are systemd during boot or git grep.  These
      services or applications benefit mostly from a high throughput: the
      quicker the I/O generated by their processes is cumulatively served,
      the sooner the target job of these services or applications gets
      completed. As a consequence, it is almost always counterproductive to
      weight-raise any of the queues associated to the processes of these
      services or applications: in most cases it would just lower the
      throughput, mainly because weight-raising also implies device idling.
      
      To address this issue, an I/O scheduler needs, first, to detect which
      queues are associated with these services or applications. In this
      respect, we have that, from the I/O-scheduler standpoint, these
      services or applications cause bursts of activations, i.e.,
      activations of different queues occurring shortly after each
      other. However, a shorter burst of activations may be caused also by
      the start of an application that does not consist in a lot of parallel
      I/O-bound threads (see the comments on the function bfq_handle_burst
      for details).
      
      In view of these facts, this commit introduces:
      1) an heuristic to detect (only) bursts of queue activations caused by
         services or applications consisting in many parallel I/O-bound
         threads;
      2) the prevention of device idling and weight-raising for the queues
         belonging to these bursts.
      Signed-off-by: default avatarArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      e1b2324d
    • Paolo Valente's avatar
      block, bfq: boost the throughput with random I/O on NCQ-capable HDDs · e01eff01
      Paolo Valente authored
      This patch is basically the counterpart, for NCQ-capable rotational
      devices, of the previous patch. Exactly as the previous patch does on
      flash-based devices and for any workload, this patch disables device
      idling on rotational devices, but only for random I/O. In fact, only
      with these queues disabling idling boosts the throughput on
      NCQ-capable rotational devices. To not break service guarantees,
      idling is disabled for NCQ-enabled rotational devices only when the
      same symmetry conditions considered in the previous patches hold.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      e01eff01
    • Paolo Valente's avatar
      block, bfq: boost the throughput on NCQ-capable flash-based devices · bf2b79e7
      Paolo Valente authored
      This patch boosts the throughput on NCQ-capable flash-based devices,
      while still preserving latency guarantees for interactive and soft
      real-time applications. The throughput is boosted by just not idling
      the device when the in-service queue remains empty, even if the queue
      is sync and has a non-null idle window. This helps to keep the drive's
      internal queue full, which is necessary to achieve maximum
      performance. This solution to boost the throughput is a port of
      commits a68bbdd and f7d7b7a7 for CFQ.
      
      As already highlighted in a previous patch, allowing the device to
      prefetch and internally reorder requests trivially causes loss of
      control on the request service order, and hence on service guarantees.
      Fortunately, as discussed in detail in the comments on the function
      bfq_bfqq_may_idle(), if every process has to receive the same
      fraction of the throughput, then the service order enforced by the
      internal scheduler of a flash-based device is relatively close to that
      enforced by BFQ. In particular, it is close enough to let service
      guarantees be substantially preserved.
      
      Things change in an asymmetric scenario, i.e., if not every process
      has to receive the same fraction of the throughput. In this case, to
      guarantee the desired throughput distribution, the device must be
      prevented from prefetching requests. This is exactly what this patch
      does in asymmetric scenarios.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      bf2b79e7
    • Arianna Avanzini's avatar
      block, bfq: reduce idling only in symmetric scenarios · 1de0c4cd
      Arianna Avanzini authored
      A seeky queue (i..e, a queue containing random requests) is assigned a
      very small device-idling slice, for throughput issues. Unfortunately,
      given the process associated with a seeky queue, this behavior causes
      the following problem: if the process, say P, performs sync I/O and
      has a higher weight than some other processes doing I/O and associated
      with non-seeky queues, then BFQ may fail to guarantee to P its
      reserved share of the throughput. The reason is that idling is key
      for providing service guarantees to processes doing sync I/O [1].
      
      This commit addresses this issue by allowing the device-idling slice
      to be reduced for a seeky queue only if the scenario happens to be
      symmetric, i.e., if all the queues are to receive the same share of
      the throughput.
      
      [1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
          Scheduler", Proceedings of the First Workshop on Mobile System
          Technologies (MST-2015), May 2015.
          http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdfSigned-off-by: default avatarArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarRiccardo Pizzetti <riccardo.pizzetti@gmail.com>
      Signed-off-by: default avatarSamuele Zecchini <samuele.zecchini92@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      1de0c4cd
    • Arianna Avanzini's avatar
      block, bfq: add Early Queue Merge (EQM) · 36eca894
      Arianna Avanzini authored
      A set of processes may happen to perform interleaved reads, i.e.,
      read requests whose union would give rise to a sequential read pattern.
      There are two typical cases: first, processes reading fixed-size chunks
      of data at a fixed distance from each other; second, processes reading
      variable-size chunks at variable distances. The latter case occurs for
      example with QEMU, which splits the I/O generated by a guest into
      multiple chunks, and lets these chunks be served by a pool of I/O
      threads, iteratively assigning the next chunk of I/O to the first
      available thread. CFQ denotes as 'cooperating' a set of processes that
      are doing interleaved I/O, and when it detects cooperating processes,
      it merges their queues to obtain a sequential I/O pattern from the union
      of their I/O requests, and hence boost the throughput.
      
      Unfortunately, in the following frequent case, the mechanism
      implemented in CFQ for detecting cooperating processes and merging
      their queues is not responsive enough to handle also the fluctuating
      I/O pattern of the second type of processes. Suppose that one process
      of the second type issues a request close to the next request to serve
      of another process of the same type. At that time the two processes
      would be considered as cooperating. But, if the request issued by the
      first process is to be merged with some other already-queued request,
      then, from the moment at which this request arrives, to the moment
      when CFQ controls whether the two processes are cooperating, the two
      processes are likely to be already doing I/O in distant zones of the
      disk surface or device memory.
      
      CFQ uses however preemption to get a sequential read pattern out of
      the read requests performed by the second type of processes too.  As a
      consequence, CFQ uses two different mechanisms to achieve the same
      goal: boosting the throughput with interleaved I/O.
      
      This patch introduces Early Queue Merge (EQM), a unified mechanism to
      get a sequential read pattern with both types of processes. The main
      idea is to immediately check whether a newly-arrived request lets some
      pair of processes become cooperating, both in the case of actual
      request insertion and, to be responsive with the second type of
      processes, in the case of request merge. Both types of processes are
      then handled by just merging their queues.
      Signed-off-by: default avatarArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarMauro Andreolini <mauro.andreolini@unimore.it>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      36eca894
    • Paolo Valente's avatar
      block, bfq: reduce latency during request-pool saturation · cfd69712
      Paolo Valente authored
      This patch introduces an heuristic that reduces latency when the
      I/O-request pool is saturated. This goal is achieved by disabling
      device idling, for non-weight-raised queues, when there are weight-
      raised queues with pending or in-flight requests. In fact, as
      explained in more detail in the comment on the function
      bfq_bfqq_may_idle(), this reduces the rate at which processes
      associated with non-weight-raised queues grab requests from the pool,
      thereby increasing the probability that processes associated with
      weight-raised queues get a request immediately (or at least soon) when
      they need one. Along the same line, if there are weight-raised queues,
      then this patch halves the service rate of async (write) requests for
      non-weight-raised queues.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      cfd69712
    • Paolo Valente's avatar
      block, bfq: preserve a low latency also with NCQ-capable drives · bcd56426
      Paolo Valente authored
      I/O schedulers typically allow NCQ-capable drives to prefetch I/O
      requests, as NCQ boosts the throughput exactly by prefetching and
      internally reordering requests.
      
      Unfortunately, as discussed in detail and shown experimentally in [1],
      this may cause fairness and latency guarantees to be violated. The
      main problem is that the internal scheduler of an NCQ-capable drive
      may postpone the service of some unlucky (prefetched) requests as long
      as it deems serving other requests more appropriate to boost the
      throughput.
      
      This patch addresses this issue by not disabling device idling for
      weight-raised queues, even if the device supports NCQ. This allows BFQ
      to start serving a new queue, and therefore allows the drive to
      prefetch new requests, only after the idling timeout expires. At that
      time, all the outstanding requests of the expired queue have been most
      certainly served.
      
      [1] P. Valente and M. Andreolini, "Improving Application
          Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
          the 5th Annual International Systems and Storage Conference
          (SYSTOR '12), June 2012.
          Slightly extended version:
          http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
      							results.pdf
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      bcd56426
    • Paolo Valente's avatar
      block, bfq: reduce I/O latency for soft real-time applications · 77b7dcea
      Paolo Valente authored
      To guarantee a low latency also to the I/O requests issued by soft
      real-time applications, this patch introduces a further heuristic,
      which weight-raises (in the sense explained in the previous patch)
      also the queues associated to applications deemed as soft real-time.
      
      To be deemed as soft real-time, an application must meet two
      requirements.  First, the application must not require an average
      bandwidth higher than the approximate bandwidth required to playback
      or record a compressed high-definition video. Second, the request
      pattern of the application must be isochronous, i.e., after issuing a
      request or a batch of requests, the application must stop issuing new
      requests until all its pending requests have been completed. After
      that, the application may issue a new batch, and so on.
      
      As for the second requirement, it is critical to require also that,
      after all the pending requests of the application have been completed,
      an adequate minimum amount of time elapses before the application
      starts issuing new requests. This prevents also greedy (i.e.,
      I/O-bound) applications from being incorrectly deemed, occasionally,
      as soft real-time. In fact, if *any amount of time* is fine, then even
      a greedy application may, paradoxically, meet both the above
      requirements, if: (1) the application performs random I/O and/or the
      device is slow, and (2) the CPU load is high. The reason is the
      following.  First, if condition (1) is true, then, during the service
      of the application, the throughput may be low enough to let the
      application meet the bandwidth requirement.  Second, if condition (2)
      is true as well, then the application may occasionally behave in an
      apparently isochronous way, because it may simply stop issuing
      requests while the CPUs are busy serving other processes.
      
      To address this issue, the heuristic leverages the simple fact that
      greedy applications issue *all* their requests as quickly as they can,
      whereas soft real-time applications spend some time processing data
      after each batch of requests is completed. In particular, the
      heuristic works as follows. First, according to the above isochrony
      requirement, the heuristic checks whether an application may be soft
      real-time, thereby giving to the application the opportunity to be
      deemed as such, only when both the following two conditions happen to
      hold: 1) the queue associated with the application has expired and is
      empty, 2) there is no outstanding request of the application.
      
      Suppose that both conditions hold at time, say, t_c and that the
      application issues its next request at time, say, t_i. At time t_c the
      heuristic computes the next time instant, called soft_rt_next_start in
      the code, such that, only if t_i >= soft_rt_next_start, then both the
      next conditions will hold when the application issues its next
      request: 1) the application will meet the above bandwidth requirement,
      2) a given minimum time interval, say Delta, will have elapsed from
      time t_c (so as to filter out greedy application).
      
      The current value of Delta is a little bit higher than the value that
      we have found, experimentally, to be adequate on a real,
      general-purpose machine. In particular we had to increase Delta to
      make the filter quite precise also in slower, embedded systems, and in
      KVM/QEMU virtual machines (details in the comments on the code).
      
      If the application actually issues its next request after time
      soft_rt_next_start, then its associated queue will be weight-raised
      for a relatively short time interval. If, during this time interval,
      the application proves again to meet the bandwidth and isochrony
      requirements, then the end of the weight-raising period for the queue
      is moved forward, and so on. Note that an application whose associated
      queue never happens to be empty when it expires will never have the
      opportunity to be deemed as soft real-time.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      77b7dcea