1. 05 Apr, 2019 15 commits
  2. 04 Apr, 2019 1 commit
  3. 01 Apr, 2019 24 commits
    • NeilBrown's avatar
      md: batch flush requests. · 2bc13b83
      NeilBrown authored
      Currently if many flush requests are submitted to an md device is quick
      succession, they are serialized and can take a long to process them all.
      We don't really need to call flush all those times - a single flush call
      can satisfy all requests submitted before it started.
      So keep track of when the current flush started and when it finished,
      allow any pending flush that was requested before the flush started
      to complete without waiting any more.
      
      Test results from Xiao:
      
      Test is done on a raid10 device which is created by 4 SSDs. The tool is
      dbench.
      
      1. The latest linux stable kernel
        Operation                Count    AvgLat    MaxLat
        --------------------------------------------------
        Deltree                    768    10.509    78.305
        Flush                  2078376     0.013    10.094
        Close                  21787697     0.019    18.821
        LockX                    96580     0.007     3.184
        Mkdir                      384     0.008     0.062
        Rename                 1255883     0.191    23.534
        ReadX                  46495589     0.020    14.230
        WriteX                 14790591     7.123    60.706
        Unlink                 5989118     0.440    54.551
        UnlockX                  96580     0.005     2.736
        FIND_FIRST             10393845     0.042    12.079
        SET_FILE_INFORMATION   2415558     0.129    10.088
        QUERY_FILE_INFORMATION 4711725     0.005     8.462
        QUERY_PATH_INFORMATION 26883327     0.032    21.715
        QUERY_FS_INFORMATION   4929409b     0.010     8.238
        NTCreateX              29660080     0.100    53.268
      
      Throughput 1034.88 MB/sec (sync open)  128 clients  128 procs
      max_latency=60.712 ms
      
      2. With patch1 "Revert "MD: fix lock contention for flush bios""
        Operation                Count    AvgLat    MaxLat
        --------------------------------------------------
        Deltree                    256     8.326    36.761
        Flush                   693291     3.974   180.269
        Close                  7266404     0.009    36.929
        LockX                    32160     0.006     0.840
        Mkdir                      128     0.008     0.021
        Rename                  418755     0.063    29.945
        ReadX                  15498708     0.007     7.216
        WriteX                 4932310    22.482   267.928
        Unlink                 1997557     0.109    47.553
        UnlockX                  32160     0.004     1.110
        FIND_FIRST             3465791     0.036     7.320
        SET_FILE_INFORMATION    805825     0.015     1.561
        QUERY_FILE_INFORMATION 1570950     0.005     2.403
        QUERY_PATH_INFORMATION 8965483     0.013    14.277
        QUERY_FS_INFORMATION   1643626     0.009     3.314
        NTCreateX              9892174     0.061    41.278
      
      Throughput 345.009 MB/sec (sync open)  128 clients  128 procs
      max_latency=267.939 m
      
      3. With patch1 and patch2
        Operation                Count    AvgLat    MaxLat
        --------------------------------------------------
        Deltree                    768     9.570    54.588
        Flush                  2061354     0.666    15.102
        Close                  21604811     0.012    25.697
        LockX                    95770     0.007     1.424
        Mkdir                      384     0.008     0.053
        Rename                 1245411     0.096    12.263
        ReadX                  46103198     0.011    12.116
        WriteX                 14667988     7.375    60.069
        Unlink                 5938936     0.173    30.905
        UnlockX                  95770     0.005     4.147
        FIND_FIRST             10306407     0.041    11.715
        SET_FILE_INFORMATION   2395987     0.048     7.640
        QUERY_FILE_INFORMATION 4672371     0.005     9.291
        QUERY_PATH_INFORMATION 26656735     0.018    19.719
        QUERY_FS_INFORMATION   4887940     0.010     7.654
        NTCreateX              29410811     0.059    28.551
      
      Throughput 1026.21 MB/sec (sync open)  128 clients  128 procs
      max_latency=60.075 ms
      
      Cc: <stable@vger.kernel.org> # v4.19+
      Tested-by: default avatarXiao Ni <xni@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2bc13b83
    • NeilBrown's avatar
      Revert "MD: fix lock contention for flush bios" · 4bc034d3
      NeilBrown authored
      This reverts commit 5a409b4f.
      
      This patch has two problems.
      
      1/ it make multiple calls to submit_bio() from inside a make_request_fn.
       The bios thus submitted will be queued on current->bio_list and not
       submitted immediately.  As the bios are allocated from a mempool,
       this can theoretically result in a deadlock - all the pool of requests
       could be in various ->bio_list queues and a subsequent mempool_alloc
       could block waiting for one of them to be released.
      
      2/ It aims to handle a case when there are many concurrent flush requests.
        It handles this by submitting many requests in parallel - all of which
        are identical and so most of which do nothing useful.
        It would be more efficient to just send one lower-level request, but
        allow that to satisfy multiple upper-level requests.
      
      Fixes: 5a409b4f ("MD: fix lock contention for flush bios")
      Cc: <stable@vger.kernel.org> # v4.19+
      Tested-by: default avatarXiao Ni <xni@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4bc034d3
    • Nigel Croxon's avatar
      Don't jump to compute_result state from check_result state · 4f4fd7c5
      Nigel Croxon authored
      Changing state from check_state_check_result to
      check_state_compute_result not only is unsafe but also doesn't
      appear to serve a valid purpose.  A raid6 check should only be
      pushing out extra writes if doing repair and a mis-match occurs.
      The stripe dev management will already try and do repair writes
      for failing sectors.
      
      This patch makes the raid6 check_state_check_result handling
      work more like raid5's.  If somehow too many failures for a
      check, just quit the check operation for the stripe.  When any
      checks pass, don't try and use check_state_compute_result for
      a purpose it isn't needed for and is unsafe for.  Just mark the
      stripe as in sync for passing its parity checks and let the
      stripe dev read/write code and the bad blocks list do their
      job handling I/O errors.
      
      Repro steps from Xiao:
      
      These are the steps to reproduce this problem:
      1. redefined OPT_MEDIUM_ERR_ADDR to 12000 in scsi_debug.c
      2. insmod scsi_debug.ko dev_size_mb=11000  max_luns=1 num_tgts=1
      3. mdadm --create /dev/md127 --level=6 --raid-devices=5 /dev/sde1 /dev/sde2 /dev/sde3 /dev/sde5 /dev/sde6
      sde is the disk created by scsi_debug
      4. echo "2" >/sys/module/scsi_debug/parameters/opts
      5. raid-check
      
      It panic:
      [ 4854.730899] md: data-check of RAID array md127
      [ 4854.857455] sd 5:0:0:0: [sdr] tag#80 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
      [ 4854.859246] sd 5:0:0:0: [sdr] tag#80 Sense Key : Medium Error [current]
      [ 4854.860694] sd 5:0:0:0: [sdr] tag#80 Add. Sense: Unrecovered read error
      [ 4854.862207] sd 5:0:0:0: [sdr] tag#80 CDB: Read(10) 28 00 00 00 2d 88 00 04 00 00
      [ 4854.864196] print_req_error: critical medium error, dev sdr, sector 11656 flags 0
      [ 4854.867409] sd 5:0:0:0: [sdr] tag#100 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
      [ 4854.869469] sd 5:0:0:0: [sdr] tag#100 Sense Key : Medium Error [current]
      [ 4854.871206] sd 5:0:0:0: [sdr] tag#100 Add. Sense: Unrecovered read error
      [ 4854.872858] sd 5:0:0:0: [sdr] tag#100 CDB: Read(10) 28 00 00 00 2e e0 00 00 08 00
      [ 4854.874587] print_req_error: critical medium error, dev sdr, sector 12000 flags 4000
      [ 4854.876456] sd 5:0:0:0: [sdr] tag#101 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
      [ 4854.878552] sd 5:0:0:0: [sdr] tag#101 Sense Key : Medium Error [current]
      [ 4854.880278] sd 5:0:0:0: [sdr] tag#101 Add. Sense: Unrecovered read error
      [ 4854.881846] sd 5:0:0:0: [sdr] tag#101 CDB: Read(10) 28 00 00 00 2e e8 00 00 08 00
      [ 4854.883691] print_req_error: critical medium error, dev sdr, sector 12008 flags 4000
      [ 4854.893927] sd 5:0:0:0: [sdr] tag#166 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
      [ 4854.896002] sd 5:0:0:0: [sdr] tag#166 Sense Key : Medium Error [current]
      [ 4854.897561] sd 5:0:0:0: [sdr] tag#166 Add. Sense: Unrecovered read error
      [ 4854.899110] sd 5:0:0:0: [sdr] tag#166 CDB: Read(10) 28 00 00 00 2e e0 00 00 10 00
      [ 4854.900989] print_req_error: critical medium error, dev sdr, sector 12000 flags 0
      [ 4854.902757] md/raid:md127: read error NOT corrected!! (sector 9952 on sdr1).
      [ 4854.904375] md/raid:md127: read error NOT corrected!! (sector 9960 on sdr1).
      [ 4854.906201] ------------[ cut here ]------------
      [ 4854.907341] kernel BUG at drivers/md/raid5.c:4190!
      
      raid5.c:4190 above is this BUG_ON:
      
          handle_parity_checks6()
              ...
              BUG_ON(s->uptodate < disks - 1); /* We don't need Q to recover */
      
      Cc: <stable@vger.kernel.org> # v3.16+
      OriginalAuthor: David Jeffery <djeffery@redhat.com>
      Cc: Xiao Ni <xni@redhat.com>
      Tested-by: default avatarDavid Jeffery <djeffery@redhat.com>
      Signed-off-by: default avatarDavid Jeffy <djeffery@redhat.com>
      Signed-off-by: default avatarNigel Croxon <ncroxon@redhat.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4f4fd7c5
    • Ming Lei's avatar
      block: loop: mark bvec as ITER_BVEC_FLAG_NO_REF · 81ba6abd
      Ming Lei authored
      loop is one block device, for any bio submitted to this device,
      the upper layer does guarantee that pages added to loop's bio won't
      go away when the bio is in-flight.
      
      So mark loop's bvec as ITER_BVEC_FLAG_NO_REF then get_page/put_page
      can be saved for serving loop's IO.
      
      Cc: linux-fsdevel@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      81ba6abd
    • Ming Lei's avatar
      block: don't check if adjacent bvecs in one bio can be mergeable · f6970f83
      Ming Lei authored
      Now both passthrough and FS IO have supported multi-page bvec, and
      bvec merging has been handled actually when adding page to bio, then
      adjacent bvecs won't be mergeable any more if they belong to same bio.
      
      So only try to merge bvecs if they are from different bios.
      
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f6970f83
    • Ming Lei's avatar
      block: reuse __blk_bvec_map_sg() for mapping page sized bvec · 16e3e418
      Ming Lei authored
      Inside __blk_segment_map_sg(), page sized bvec mapping is optimized
      a bit with one standalone branch.
      
      So reuse __blk_bvec_map_sg() to do that.
      
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      16e3e418
    • Ming Lei's avatar
      block: remove argument of 'request_queue' from __blk_bvec_map_sg · cae6c2e5
      Ming Lei authored
      The argument of 'request_queue' isn't used by __blk_bvec_map_sg(),
      so remove it.
      
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      cae6c2e5
    • Ming Lei's avatar
      block: enable multi-page bvec for passthrough IO · 489fbbcb
      Ming Lei authored
      Now block IO stack is basically ready for supporting multi-page bvec,
      however it isn't enabled on passthrough IO.
      
      One reason is that passthrough IO is dispatched to LLD directly and bio
      split is bypassed, so the bio has to be built correctly for dispatch to
      LLD from the beginning.
      
      Implement multi-page support for passthrough IO by limitting each bvec
      as block device's segment and applying all kinds of queue limit in
      blk_add_pc_page(). Then we don't need to calculate segments any more for
      passthrough IO any more, turns out code is simplified much.
      
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      489fbbcb
    • Ming Lei's avatar
      block: put the same page when adding it to bio · 19047087
      Ming Lei authored
      When the added page is merged to last same page in bio_add_pc_page(),
      the user may need to put this page for avoiding page leak.
      
      bio_map_user_iov() needs this kind of handling, and now it deals with
      it by itself in hack style.
      
      Moves the handling of put page into __bio_add_pc_page(), so
      bio_map_user_iov() may be simplified a bit, and maybe more users
      can benefit from this change.
      
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      19047087
    • Ming Lei's avatar
      block: check if page is mergeable in one helper · 5919482e
      Ming Lei authored
      Now the check for deciding if one page is mergeable to current bvec
      becomes a bit complicated, and we need to reuse the code before
      adding pc page.
      
      So move the check in one dedicated helper.
      
      No function change.
      
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5919482e
    • Ming Lei's avatar
      block: cleanup bio_add_pc_page · 5a8ce240
      Ming Lei authored
      REQ_PC is out of date, so replace it with passthrough IO.
      
      Also remove the local variable of 'prev' since we can reuse
      the top local variable of 'bvec'.
      
      No function change.
      
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5a8ce240
    • Ming Lei's avatar
      block: don't merge adjacent bvecs to one segment in bio blk_queue_split · fd7d8d42
      Ming Lei authored
      For normal filesystem IO, each page is added via blk_add_page(),
      in which bvec(page) merge has been handled already, and basically
      not possible to merge two adjacent bvecs in one bio.
      
      So not try to merge two adjacent bvecs in blk_queue_split().
      
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fd7d8d42
    • Ming Lei's avatar
      block: avoid to break XEN by multi-page bvec · db5ebd6e
      Ming Lei authored
      XEN has special page merge requirement, see xen_biovec_phys_mergeable().
      We can't merge pages into one bvec simply for XEN.
      
      So move XEN's specific check on page merge into __bio_try_merge_page(),
      then abvoid to break XEN by multi-page bvec.
      
      Cc: ris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: xen-devel@lists.xenproject.org
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJuergen Gross <jgross@suse.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      db5ebd6e
    • Ming Lei's avatar
      block: pass page to xen_biovec_phys_mergeable · 0383ad43
      Ming Lei authored
      xen_biovec_phys_mergeable() only needs .bv_page of the 2nd bio bvec
      for checking if the two bvecs can be merged, so pass page to
      xen_biovec_phys_mergeable() directly.
      
      No function change.
      
      Cc: ris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: xen-devel@lists.xenproject.org
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0383ad43
    • Holger Hoffstätte's avatar
      loop: properly observe rotational flag of underlying device · 56a85fd8
      Holger Hoffstätte authored
      The loop driver always declares the rotational flag of its device as
      rotational, even when the device of the mapped file is nonrotational,
      as is the case with SSDs or on tmpfs. This can confuse filesystem tools
      which are SSD-aware; in my case I frequently forget to tell mkfs.btrfs
      that my loop device on tmpfs is nonrotational, and that I really don't
      need any automatic metadata redundancy.
      
      The attached patch fixes this by introspecting the rotational flag of the
      mapped file's underlying block device, if it exists. If the mapped file's
      filesystem has no associated block device - as is the case on e.g. tmpfs -
      we assume nonrotational storage. If there is a better way to identify such
      non-devices I'd love to hear them.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-block@vger.kernel.org
      Cc: holger@applied-asynchrony.com
      Signed-off-by: default avatarHolger Hoffstätte <holger.hoffstaette@googlemail.com>
      Signed-off-by: default avatarGwendal Grignou <gwendal@chromium.org>
      Signed-off-by: default avatarBenjamin Gordon <bmgordon@chromium.org>
      Reviewed-by: default avatarGuenter Roeck <groeck@chromium.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      56a85fd8
    • Paolo Valente's avatar
      doc, block, bfq: add information on bfq execution time · 4438cf50
      Paolo Valente authored
      The execution time of BFQ has been slightly lowered. Report the new
      execution time in BFQ documentation.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4438cf50
    • Francesco Pollicino's avatar
      block, bfq: save & resume weight on a queue merge/split · fffca087
      Francesco Pollicino authored
      bfq saves the state of a queue each time a merge occurs, to be
      able to resume such a state when the queue is associated again
      with its original process, on a split.
      
      Unfortunately bfq does not save & restore also the weight of the
      queue. If the weight is not correctly resumed when the queue is
      recycled, then the weight of the recycled queue could differ
      from the weight of the original queue.
      
      This commit adds the missing save & resume of the weight.
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: default avatarFrancesco Pollicino <fra.fra.800@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fffca087
    • Francesco Pollicino's avatar
      block, bfq: print SHARED instead of pid for shared queues in logs · 1e66413c
      Francesco Pollicino authored
      The function "bfq_log_bfqq" prints the pid of the process
      associated with the queue passed as input.
      
      Unfortunately, if the queue is shared, then more than one process
      is associated with the queue. The pid that gets printed in this
      case is the pid of one of the associated processes.
      Which process gets printed depends on the exact sequence of merge
      events the queue underwent. So printing such a pid is rather
      useless and above all is often rather confusing because it
      reports a random pid between those of the associated processes.
      
      This commit addresses this issue by printing SHARED instead of a pid
      if the queue is shared.
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: default avatarFrancesco Pollicino <fra.fra.800@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1e66413c
    • Paolo Valente's avatar
      block, bfq: always protect newly-created queues from existing active queues · 84a74689
      Paolo Valente authored
      If many bfq_queues belonging to the same group happen to be created
      shortly after each other, then the processes associated with these
      queues have typically a common goal. In particular, bursts of queue
      creations are usually caused by services or applications that spawn
      many parallel threads/processes. Examples are systemd during boot, or
      git grep. If there are no other active queues, then, to help these
      processes get their job done as soon as possible, the best thing to do
      is to reach a high throughput. To this goal, it is usually better to
      not grant either weight-raising or device idling to the queues
      associated with these processes. And this is exactly what BFQ
      currently does.
      
      There is however a drawback: if, in contrast, some other queues are
      already active, then the newly created queues must be protected from
      the I/O flowing through the already existing queues. In this case, the
      best thing to do is the opposite as in the other case: it is much
      better to grant weight-raising and device idling to the newly-created
      queues, if they deserve it. This commit addresses this issue by doing
      so if there are already other active queues.
      
      This change also helps eliminating false positives, which occur when
      the newly-created queues do not belong to an actual large burst of
      creations, but some background task (e.g., a service) happens to
      trigger the creation of new queues in the middle, i.e., very close to
      when the victim queues are created. These false positive may cause
      total loss of control on process latencies.
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      84a74689
    • Paolo Valente's avatar
      block, bfq: do not tag totally seeky queues as soft rt · 7074f076
      Paolo Valente authored
      Sync random I/O is likely to be confused with soft real-time I/O,
      because it is characterized by limited throughput and apparently
      isochronous arrival pattern. To avoid false positives, this commits
      prevents bfq_queues containing only random (seeky) I/O from being
      tagged as soft real-time.
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7074f076
    • Paolo Valente's avatar
      block, bfq: do not merge queues on flash storage with queueing · 8cacc5ab
      Paolo Valente authored
      To boost throughput with a set of processes doing interleaved I/O
      (i.e., a set of processes whose individual I/O is random, but whose
      merged cumulative I/O is sequential), BFQ merges the queues associated
      with these processes, i.e., redirects the I/O of these processes into a
      common, shared queue. In the shared queue, I/O requests are ordered by
      their position on the medium, thus sequential I/O gets dispatched to
      the device when the shared queue is served.
      
      Queue merging costs execution time, because, to detect which queues to
      merge, BFQ must maintain a list of the head I/O requests of active
      queues, ordered by request positions. Measurements showed that this
      costs about 10% of BFQ's total per-request processing time.
      
      Request processing time becomes more and more critical as the speed of
      the underlying storage device grows. Yet, fortunately, queue merging
      is basically useless on the very devices that are so fast to make
      request processing time critical. To reach a high throughput, these
      devices must have many requests queued at the same time. But, in this
      configuration, the internal scheduling algorithms of these devices do
      also the job of queue merging: they reorder requests so as to obtain
      as much as possible a sequential I/O pattern. As a consequence, with
      processes doing interleaved I/O, the throughput reached by one such
      device is likely to be the same, with and without queue merging.
      
      In view of this fact, this commit disables queue merging, and all
      related housekeeping, for non-rotational devices with internal
      queueing. The total, single-lock-protected, per-request processing
      time of BFQ drops to, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz
      (time measured with simple code instrumentation, and using the
      throughput-sync.sh script of the S suite [1], in performance-profiling
      mode). To put this result into context, the total,
      single-lock-protected, per-request execution time of the lightest I/O
      scheduler available in blk-mq, mq-deadline, is 0.7 us (mq-deadline is
      ~800 LOC, against ~10500 LOC for BFQ).
      
      Disabling merging provides a further, remarkable benefit in terms of
      throughput. Merging tends to make many workloads artificially more
      uneven, mainly because of shared queues remaining non empty for
      incomparably more time than normal queues. So, if, e.g., one of the
      queues in a set of merged queues has a higher weight than a normal
      queue, then the shared queue may inherit such a high weight and, by
      staying almost always active, may force BFQ to perform I/O plugging
      most of the time. This evidently makes it harder for BFQ to let the
      device reach a high throughput.
      
      As a practical example of this problem, and of the benefits of this
      commit, we measured again the throughput in the nasty scenario
      considered in previous commit messages: dbench test (in the Phoronix
      suite), with 6 clients, on a filesystem with journaling, and with the
      journaling daemon enjoying a higher weight than normal processes. With
      this commit, the throughput grows from ~150 MB/s to ~200 MB/s on a
      PLEXTOR PX-256M5 SSD. This is the same peak throughput reached by any
      of the other I/O schedulers. As such, this is also likely to be the
      maximum possible throughput reachable with this workload on this
      device, because I/O is mostly random, and the other schedulers
      basically just pass I/O requests to the drive as fast as possible.
      
      [1] https://github.com/Algodev-github/STested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: default avatarFrancesco Pollicino <fra.fra.800@gmail.com>
      Signed-off-by: default avatarAlessio Masola <alessio.masola@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8cacc5ab
    • Paolo Valente's avatar
      block, bfq: tune service injection basing on request service times · 2341d662
      Paolo Valente authored
      The processes associated with a bfq_queue, say Q, may happen to
      generate their cumulative I/O at a lower rate than the rate at which
      the device could serve the same I/O. This is rather probable, e.g., if
      only one process is associated with Q and the device is an SSD. It
      results in Q becoming often empty while in service. If BFQ is not
      allowed to switch to another queue when Q becomes empty, then, during
      the service of Q, there will be frequent "service holes", i.e., time
      intervals during which Q gets empty and the device can only consume
      the I/O already queued in its hardware queues. This easily causes
      considerable losses of throughput.
      
      To counter this problem, BFQ implements a request injection mechanism,
      which tries to fill the above service holes with I/O requests taken
      from other bfq_queues. The hard part in this mechanism is finding the
      right amount of I/O to inject, so as to both boost throughput and not
      break Q's bandwidth and latency guarantees. To this goal, the current
      version of this mechanism measures the bandwidth enjoyed by Q while it
      is being served, and tries to inject the maximum possible amount of
      extra service that does not cause Q's bandwidth to decrease too
      much.
      
      This solution has an important shortcoming. For bandwidth measurements
      to be stable and reliable, Q must remain in service for a much longer
      time than that needed to serve a single I/O request. Unfortunately,
      this does not hold with many workloads. This commit addresses this
      issue by changing the way the amount of injection allowed is
      dynamically computed. It tunes injection as a function of the service
      times of single I/O requests of Q, instead of Q's
      bandwidth. Single-request service times are evidently meaningful even
      if Q gets very few I/O requests completed while it is in service.
      
      As a testbed for this new solution, we measured the throughput reached
      by BFQ for one of the nastiest workloads and configurations for this
      scheduler: the workload generated by the dbench test (in the Phoronix
      suite), with 6 clients, on a filesystem with journaling, and with the
      journaling daemon enjoying a higher weight than normal processes.
      With this commit, the throughput grows from ~100 MB/s to ~150 MB/s on
      a PLEXTOR PX-256M5.
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: default avatarFrancesco Pollicino <fra.fra.800@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2341d662
    • Paolo Valente's avatar
      block, bfq: do not idle for lowest-weight queues · fb53ac6c
      Paolo Valente authored
      In most cases, it is detrimental for throughput to plug I/O dispatch
      when the in-service bfq_queue becomes temporarily empty (plugging is
      performed to wait for the possible arrival, soon, of new I/O from the
      in-service queue). There is however a case where plugging is needed
      for service guarantees. If a bfq_queue, say Q, has a higher weight
      than some other active bfq_queue, and is sync, i.e., contains sync
      I/O, then, to guarantee that Q does receive a higher share of the
      throughput than other lower-weight queues, it is necessary to plug I/O
      dispatch when Q remains temporarily empty while being served.
      
      For this reason, BFQ performs I/O plugging when some active bfq_queue
      has a higher weight than some other active bfq_queue. But this is
      unnecessarily overkill. In fact, if the in-service bfq_queue actually
      has a weight lower than or equal to the other queues, then the queue
      *must not* be guaranteed a higher share of the throughput than the
      other queues. So, not plugging I/O cannot cause any harm to the
      queue. And can boost throughput.
      
      Taking advantage of this fact, this commit does not plug I/O for sync
      bfq_queues with a weight lower than or equal to the weights of the
      other queues. Here is an example of the resulting throughput boost
      with the dbench workload, which is particularly nasty for BFQ. With
      the dbench test in the Phoronix suite, BFQ reaches its lowest total
      throughput with 6 clients on a filesystem with journaling, in case the
      journaling daemon has a higher weight than normal processes. Before
      this commit, the total throughput was ~80 MB/sec on a PLEXTOR PX-256M5,
      after this commit it is ~100 MB/sec.
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fb53ac6c
    • Paolo Valente's avatar
      block, bfq: increase idling for weight-raised queues · 778c02a2
      Paolo Valente authored
      If a sync bfq_queue has a higher weight than some other queue, and
      remains temporarily empty while in service, then, to preserve the
      bandwidth share of the queue, it is necessary to plug I/O dispatching
      until a new request arrives for the queue. In addition, a timeout
      needs to be set, to avoid waiting for ever if the process associated
      with the queue has actually finished its I/O.
      
      Even with the above timeout, the device is however not fed with new
      I/O for a while, if the process has finished its I/O. If this happens
      often, then throughput drops and latencies grow. For this reason, the
      timeout is kept rather low: 8 ms is the current default.
      
      Unfortunately, such a low value may cause, on the opposite end, a
      violation of bandwidth guarantees for a process that happens to issue
      new I/O too late. The higher the system load, the higher the
      probability that this happens to some process. This is a problem in
      scenarios where service guarantees matter more than throughput. One
      important case are weight-raised queues, which need to be granted a
      very high fraction of the bandwidth.
      
      To address this issue, this commit lower-bounds the plugging timeout
      for weight-raised queues to 20 ms. This simple change provides
      relevant benefits. For example, on a PLEXTOR PX-256M5S, with which
      gnome-terminal starts in 0.6 seconds if there is no other I/O in
      progress, the same applications starts in
      - 0.8 seconds, instead of 1.2 seconds, if ten files are being read
        sequentially in parallel
      - 1 second, instead of 2 seconds, if, in parallel, five files are
        being read sequentially, and five more files are being written
        sequentially
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      778c02a2