1. 07 Apr, 2017 13 commits
    • Omar Sandoval's avatar
      blk-mq-sched: provide hooks for initializing hardware queue data · ee056f98
      Omar Sandoval authored
      Schedulers need to be informed when a hardware queue is added or removed
      at runtime so they can allocate/free per-hardware queue data. So,
      replace the blk_mq_sched_init_hctx_data() helper, which only makes sense
      at init time, with .init_hctx() and .exit_hctx() hooks.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      ee056f98
    • Jens Axboe's avatar
      Merge branch 'for-linus' into for-4.12/block · 65f619d2
      Jens Axboe authored
      We've added a considerable amount of fixes for stalls and issues
      with the blk-mq scheduling in the 4.11 series since forking
      off the for-4.12/block branch. We need to do improvements on
      top of that for 4.12, so pull in the previous fixes to make
      our lives easier going forward.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      65f619d2
    • Bart Van Assche's avatar
      blk-mq: Restart a single queue if tag sets are shared · 6d8c6c0f
      Bart Van Assche authored
      To improve scalability, if hardware queues are shared, restart
      a single hardware queue in round-robin fashion. Rename
      blk_mq_sched_restart_queues() to reflect the new semantics.
      Remove blk_mq_sched_mark_restart_queue() because this function
      has no callers. Remove flag QUEUE_FLAG_RESTART because this
      patch removes the code that uses this flag.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      6d8c6c0f
    • Bart Van Assche's avatar
      dm rq: Avoid that request processing stalls sporadically · 6077c2d7
      Bart Van Assche authored
      While running the srp-test software I noticed that request
      processing stalls sporadically at the beginning of a test, namely
      when mkfs is run against a dm-mpath device. Every time when that
      happened the following command was sufficient to resume request
      processing:
      
          echo run >/sys/kernel/debug/block/dm-0/state
      
      This patch avoids that such request processing stalls occur. The
      test I ran is as follows:
      
          while srp-test/run_tests -d -r 30 -t 02-mq; do :; done
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      6077c2d7
    • Bart Van Assche's avatar
      scsi: Avoid that SCSI queues get stuck · 36e3cf27
      Bart Van Assche authored
      If a .queue_rq() function returns BLK_MQ_RQ_QUEUE_BUSY then the block
      driver that implements that function is responsible for rerunning the
      hardware queue once requests can be queued again successfully.
      
      commit 52d7f1b5 ("blk-mq: Avoid that requeueing starts stopped
      queues") removed the blk_mq_stop_hw_queue() call from scsi_queue_rq()
      for the BLK_MQ_RQ_QUEUE_BUSY case. Hence change all calls to functions
      that are intended to rerun a busy queue such that these examine all
      hardware queues instead of only stopped queues.
      
      Since no other functions than scsi_internal_device_block() and
      scsi_internal_device_unblock() should ever stop or restart a SCSI
      queue, change the blk_mq_delay_queue() call into a
      blk_mq_delay_run_hw_queue() call.
      
      Fixes: commit 52d7f1b5 ("blk-mq: Avoid that requeueing starts stopped queues")
      Fixes: commit 7e79dadc ("blk-mq: stop hardware queue in blk_mq_delay_queue()")
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Long Li <longli@microsoft.com>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      36e3cf27
    • Bart Van Assche's avatar
      blk-mq: Introduce blk_mq_delay_run_hw_queue() · 7587a5ae
      Bart Van Assche authored
      Introduce a function that runs a hardware queue unconditionally
      after a delay. Note: there is already a function that stops and
      restarts a hardware queue after a delay, namely blk_mq_delay_queue().
      
      This function will be used in the next patch in this series.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Long Li <longli@microsoft.com>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      7587a5ae
    • NeilBrown's avatar
      block: trace completion of all bios. · fbbaf700
      NeilBrown authored
      Currently only dm and md/raid5 bios trigger
      trace_block_bio_complete().  Now that we have bio_chain() and
      bio_inc_remaining(), it is not possible, in general, for a driver to
      know when the bio is really complete.  Only bio_endio() knows that.
      
      So move the trace_block_bio_complete() call to bio_endio().
      
      Now trace_block_bio_complete() pairs with trace_block_bio_queue().
      Any bio for which a 'queue' event is traced, will subsequently
      generate a 'complete' event.
      
      There are a few cases where completion tracing is not wanted.
      1/ If blk_update_request() has already generated a completion
         trace event at the 'request' level, there is no point generating
         one at the bio level too.  In this case the bi_sector and bi_size
         will have changed, so the bio level event would be wrong
      
      2/ If the bio hasn't actually been queued yet, but is being aborted
         early, then a trace event could be confusing.  Some filesystems
         call bio_endio() but do not want tracing.
      
      3/ The bio_integrity code interposes itself by replacing bi_end_io,
         then restoring it and calling bio_endio() again.  This would produce
         two identical trace events if left like that.
      
      To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
      produce the trace event when this is set.
      We address point 1 above by clearing the flag in blk_update_request().
      We address point 2 above by only setting the flag when
      generic_make_request() is called.
      We address point 3 above by clearing the flag after generating a
      completion event.
      
      When bio_split() is used on a bio, particularly in blk_queue_split(),
      there is an extra complication.  A new bio is split off the front, and
      may be handle directly without going through generic_make_request().
      The old bio, which has been advanced, is passed to
      generic_make_request(), so it will trigger a trace event a second
      time.
      Probably the best result when a split happens is to see a single
      'queue' event for the whole bio, then multiple 'complete' events - one
      for each component.  To achieve this was can:
      - copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
      - avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
      This way, the split-off bio won't create a queue event, the original
      won't either even if it re-submitted to generic_make_request(),
      but both will produce completion events, each for their own range.
      
      So if generic_make_request() is called (which generates a QUEUED
      event), then bi_endio() will create a single COMPLETE event for each
      range that the bio is split into, unless the driver has explicitly
      requested it not to.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      fbbaf700
    • NeilBrown's avatar
      block: simple improvements for bio->flags · dbde775c
      NeilBrown authored
      The comment for the 'flags' field of 'bio' mentions
      "command" which is no longer stored there, and doesn't
      mention the bvec pool number, which is.
      
      BIO_RESET_BITS is set in such a way that it would need to be
      updated if new bits were added, which is easy to miss.
      
      BVEC_POOL_BITS is larger than needed.  The BVEC_POOL_IDX()
      ranges from 0 to 6, so 3 bits are sufficient.
      
      This patch make improvements in each of these areas.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      dbde775c
    • Omar Sandoval's avatar
      blk-mq: remap queues when adding/removing hardware queues · ebe8bddb
      Omar Sandoval authored
      blk_mq_update_nr_hw_queues() used to remap hardware queues, which is the
      behavior that drivers expect. However, commit 4e68a011 changed
      blk_mq_queue_reinit() to not remap queues for the case of CPU
      hotplugging, inadvertently making blk_mq_update_nr_hw_queues() not remap
      queues as well. This breaks, for example, NBD's multi-connection mode,
      leaving the added hardware queues unused. Fix it by making
      blk_mq_update_nr_hw_queues() explicitly remap the queues.
      
      Fixes: 4e68a011 ("blk-mq: don't redistribute hardware queues on a CPU hotplug event")
      Reviewed-by: default avatarKeith Busch <keith.busch@intel.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      ebe8bddb
    • Omar Sandoval's avatar
      blk-mq-sched: fix crash in switch error path · 54d5329d
      Omar Sandoval authored
      In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall
      back to the original scheduler. However, at this point, we've already
      torn down the original scheduler's tags, so this causes a crash. Doing
      the fallback like the legacy elevator path is much harder for mq, so fix
      it by just falling back to none, instead.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      54d5329d
    • Omar Sandoval's avatar
      blk-mq-sched: set up scheduler tags when bringing up new queues · 93252632
      Omar Sandoval authored
      If a new hardware queue is added at runtime, we don't allocate scheduler
      tags for it, leading to a crash. This hooks up the scheduler framework
      to blk_mq_{init,exit}_hctx() to make sure everything gets properly
      initialized/freed.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      93252632
    • Omar Sandoval's avatar
      blk-mq-sched: refactor scheduler initialization · 6917ff0b
      Omar Sandoval authored
      Preparation cleanup for the next couple of fixes, push
      blk_mq_sched_setup() and e->ops.mq.init_sched() into a helper.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      6917ff0b
    • Omar Sandoval's avatar
      blk-mq: use the right hctx when getting a driver tag fails · 81380ca1
      Omar Sandoval authored
      While dispatching requests, if we fail to get a driver tag, we mark the
      hardware queue as waiting for a tag and put the requests on a
      hctx->dispatch list to be run later when a driver tag is freed. However,
      blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
      queues if using a single-queue scheduler with a multiqueue device. If
      blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
      are processing. This means we end up using the hardware queue of the
      previous request, which may or may not be the same as that of the
      current request. If it isn't, the wrong hardware queue will end up
      waiting for a tag, and the requests will be on the wrong dispatch list,
      leading to a hang.
      
      The fix is twofold:
      
      1. Make sure we save which hardware queue we were trying to get a
         request for in blk_mq_get_driver_tag() regardless of whether it
         succeeds or not.
      2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
         blk_mq_hw_queue to make it clear that it must handle multiple
         hardware queues, since I've already messed this up on a couple of
         occasions.
      
      This didn't appear in testing with nvme and mq-deadline because nvme has
      more driver tags than the default number of scheduler tags. However,
      with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      81380ca1
  2. 05 Apr, 2017 9 commits
    • Jens Axboe's avatar
      block: move timeout field in struct request to pack better · 1dd5198b
      Jens Axboe authored
      After commit 64c7f1d1, we went from 1 to 2 holes in my
      test setup. If we move the timeout field a bit, we remove
      both of those holes and shrink struct request by 8 bytes.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      1dd5198b
    • Christoph Hellwig's avatar
      block, scsi: move the retries field to struct scsi_request · 64c7f1d1
      Christoph Hellwig authored
      Instead of bloating the generic struct request with it.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      64c7f1d1
    • Christoph Hellwig's avatar
      nvme: move the retries count to struct nvme_request · 44e44b29
      Christoph Hellwig authored
      The way NVMe uses this field is entirely different from the older
      SCSI/BLOCK_PC usage, so move it into struct nvme_request.
      
      Also reduce the size of the file to a unsigned char so that we leave
      space for additional smaller fields that will appear soon.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      44e44b29
    • Christoph Hellwig's avatar
      83f3aeb3
    • Christoph Hellwig's avatar
      nvme: cleanup nvme_req_needs_retry · f6324b1b
      Christoph Hellwig authored
      Don't pass the status explicitly but derive it from the requeust,
      and unwind the complex condition to be more readable.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      f6324b1b
    • Christoph Hellwig's avatar
      nvme: move ->retries setup to nvme_setup_cmd · 987f699a
      Christoph Hellwig authored
      ->retries is counting the number of times a command is resubmitted, and
      be cleared on the first time we see the command.  We currently don't do
      that for non-PCIe command, which is easily fixed by moving the setup
      to common code.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      987f699a
    • Christoph Hellwig's avatar
      remove the obsolete hd driver · 8e14be53
      Christoph Hellwig authored
      This driver is for pre-IDE hardisk that are only found in PC from the
      stoneage of personal computing, and which we don't support elsewhere
      in the kernel these days.
      
      It's also been marked broken forever.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      8e14be53
    • Bart Van Assche's avatar
      blk-mq: Remove blk_mq_queue_data.list · f2fbc9dd
      Bart Van Assche authored
      The block layer core sets blk_mq_queue_data.list but no block
      drivers read that member. Hence remove it and also the code that
      is used to set this member.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      f2fbc9dd
    • Jan Kara's avatar
      cfq: Disable writeback throttling by default · 142bbdfc
      Jan Kara authored
      Writeback throttling does not play well with CFQ since that also tries
      to throttle async writes. As a result async writeback can get starved in
      presence of readers. As an example take a benchmark simulating
      postgreSQL database running over a standard rotating SATA drive. There
      are 16 processes doing random reads from a huge file (2*machine memory),
      1 process doing random writes to the huge file and calling fsync once
      per 50000 writes and 1 process doing sequential 8k writes to a
      relatively small file wrapping around at the end of the file and calling
      fsync every 5 writes. Under this load read latency easily exceeds the
      target latency of 75 ms (just because there are so many reads happening
      against a relatively slow disk) and thus writeback is throttled to a
      point where only 1 write request is allowed at a time. Blktrace data
      then looks like:
      
        8,0    1        0     8.347751764     0  m   N cfq workload slice:40000000
        8,0    1        0     8.347755256     0  m   N cfq293A  / set_active wl_class: 0 wl_type:0
        8,0    1        0     8.347784100     0  m   N cfq293A  / Not idling.  st->count:1
        8,0    1     3814     8.347763916  5839 UT   N [kworker/u9:2] 1
        8,0    0        0     8.347777605     0  m   N cfq293A  / Not idling.  st->count:1
        8,0    1        0     8.347784100     0  m   N cfq293A  / Not idling.  st->count:1
        8,0    3     1596     8.354364057     0  C   R 156109528 + 8 (6906954) [0]
        8,0    3        0     8.354383193     0  m   N cfq6196SN / complete rqnoidle 0
        8,0    3        0     8.354386476     0  m   N cfq schedule dispatch
        8,0    3        0     8.354399397     0  m   N cfq293A  / Not idling.  st->count:1
        8,0    3        0     8.354404705     0  m   N cfq293A  / dispatch_insert
        8,0    3        0     8.354409454     0  m   N cfq293A  / dispatched a request
        8,0    3        0     8.354412527     0  m   N cfq293A  / activate rq, drv=1
        8,0    3     1597     8.354414692     0  D   W 145961400 + 24 (6718452) [swapper/0]
        8,0    3        0     8.354484184     0  m   N cfq293A  / Not idling.  st->count:1
        8,0    3        0     8.354487536     0  m   N cfq293A  / slice expired t=0
        8,0    3        0     8.354498013     0  m   N / served: vt=5888102466265088 min_vt=5888074869387264
        8,0    3        0     8.354502692     0  m   N cfq293A  / sl_used=6737519 disp=1 charge=6737519 iops=0 sect=24
        8,0    3        0     8.354505695     0  m   N cfq293A  / del_from_rr
      ...
        8,0    0     1810     8.354728768     0  C   W 145961400 + 24 (314076) [0]
        8,0    0        0     8.354746927     0  m   N cfq293A  / complete rqnoidle 0
      ...
        8,0    1     3829     8.389886102  5839  G   W 145962968 + 24 [kworker/u9:2]
        8,0    1     3830     8.389888127  5839  P   N [kworker/u9:2]
        8,0    1     3831     8.389908102  5839  A   W 145978336 + 24 <- (8,4) 44000
        8,0    1     3832     8.389910477  5839  Q   W 145978336 + 24 [kworker/u9:2]
        8,0    1     3833     8.389914248  5839  I   W 145962968 + 24 (28146) [kworker/u9:2]
        8,0    1        0     8.389919137     0  m   N cfq293A  / insert_request
        8,0    1        0     8.389924305     0  m   N cfq293A  / add_to_rr
        8,0    1     3834     8.389933175  5839 UT   N [kworker/u9:2] 1
      ...
        8,0    0        0     9.455290997     0  m   N cfq workload slice:40000000
        8,0    0        0     9.455294769     0  m   N cfq293A  / set_active wl_class:0 wl_type:0
        8,0    0        0     9.455303499     0  m   N cfq293A  / fifo=ffff880003166090
        8,0    0        0     9.455306851     0  m   N cfq293A  / dispatch_insert
        8,0    0        0     9.455311251     0  m   N cfq293A  / dispatched a request
        8,0    0        0     9.455314324     0  m   N cfq293A  / activate rq, drv=1
        8,0    0     2043     9.455316210  6204  D   W 145962968 + 24 (1065401962) [pgioperf]
        8,0    0        0     9.455392407     0  m   N cfq293A  / Not idling.  st->count:1
        8,0    0        0     9.455395969     0  m   N cfq293A  / slice expired t=0
        8,0    0        0     9.455404210     0  m   N / served: vt=5888958194597888 min_vt=5888941810597888
        8,0    0        0     9.455410077     0  m   N cfq293A  / sl_used=4000000 disp=1 charge=4000000 iops=0 sect=24
        8,0    0        0     9.455416851     0  m   N cfq293A  / del_from_rr
      ...
        8,0    0     2045     9.455648515     0  C   W 145962968 + 24 (332305) [0]
        8,0    0        0     9.455668350     0  m   N cfq293A  / complete rqnoidle 0
      ...
        8,0    1     4371     9.455710115  5839  G   W 145978336 + 24 [kworker/u9:2]
        8,0    1     4372     9.455712350  5839  P   N [kworker/u9:2]
        8,0    1     4373     9.455730159  5839  A   W 145986616 + 24 <- (8,4) 52280
        8,0    1     4374     9.455732674  5839  Q   W 145986616 + 24 [kworker/u9:2]
        8,0    1     4375     9.455737563  5839  I   W 145978336 + 24 (27448) [kworker/u9:2]
        8,0    1        0     9.455742871     0  m   N cfq293A  / insert_request
        8,0    1        0     9.455747550     0  m   N cfq293A  / add_to_rr
        8,0    1     4376     9.455756629  5839 UT   N [kworker/u9:2] 1
      
      So we can see a Q event for a write request, then IO is blocked by
      writeback throttling and G and I events for the request happen only once
      other writeback IO is completed. Thus CFQ always sees only one write
      request. When it sees it, it queues the async queue behind all the read
      queues and the async queue gets scheduled after about one second. When
      it is scheduled, that one request gets dispatched and async queue is
      expired as it has no more requests to submit. Overall we submit about
      one write request per second.
      
      Although this scheduling is beneficial for read latency, writes are
      heavily starved and this causes large delays all over the system (due to
      processes blocking on page lock, transaction starts, etc.). When
      writeback throttling is disabled, write throughput is about one fifth of
      a read throughput which roughly matches readers/writers ratio and
      overall the system stalls are much shorter.
      
      Mixing writeback throttling logic with CFQ throttling logic is always a
      recipe for surprises as CFQ assumes it sees the big part of the picture
      which is not necessarily true when writeback throttling is blocking
      requests. So disable writeback throttling logic by default when CFQ is
      used as an IO scheduler.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      142bbdfc
  3. 04 Apr, 2017 18 commits