1. 29 May, 2018 3 commits
    • Keith Busch's avatar
      blk-mq: Remove generation seqeunce · 12f5b931
      Keith Busch authored
      This patch simplifies the timeout handling by relying on the request
      reference counting to ensure the iterator is operating on an inflight
      and truly timed out request. Since the reference counting prevents the
      tag from being reallocated, the block layer no longer needs to prevent
      drivers from completing their requests while the timeout handler is
      operating on it: a driver completing a request is allowed to proceed to
      the next state without additional syncronization with the block layer.
      
      This also removes any need for generation sequence numbers since the
      request lifetime is prevented from being reallocated as a new sequence
      while timeout handling is operating on it.
      
      To enables this a refcount is added to struct request so that request
      users can be sure they're operating on the same request without it
      changing while they're processing it.  The request's tag won't be
      released for reuse until both the timeout handler and the completion
      are done with it.
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      [hch: slight cleanups, added back submission side hctx lock, use cmpxchg
       for completions]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      12f5b931
    • Keith Busch's avatar
      blk-mq: Fix timeout and state order · ad103e79
      Keith Busch authored
      The block layer had been setting the state to in-flight prior to updating
      the timer. This is the wrong order since the timeout handler could observe
      the in-flight state with the older timeout, believing the request had
      expired when in fact it is just getting started.
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ad103e79
    • Christoph Hellwig's avatar
      libata: remove ata_scsi_timed_out · 01fc27d9
      Christoph Hellwig authored
      As far as I can tell this function can't even be called any more, given
      that ATA implements its own eh_strategy_handler with ata_scsi_error, which
      never calls ->eh_timed_out.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      01fc27d9
  2. 28 May, 2018 4 commits
  3. 25 May, 2018 1 commit
  4. 24 May, 2018 2 commits
    • Joe Perches's avatar
      block drivers/block: Use octal not symbolic permissions · 5657a819
      Joe Perches authored
      Convert the S_<FOO> symbolic permissions to their octal equivalents as
      using octal and not symbolic permissions is preferred by many as more
      readable.
      
      see: https://lkml.org/lkml/2016/8/2/1945
      
      Done with automated conversion via:
      $ ./scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace <files...>
      
      Miscellanea:
      
      o Wrapped modified multi-line calls to a single line where appropriate
      o Realign modified multi-line calls to open parenthesis
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5657a819
    • Ming Lei's avatar
      blk-mq: avoid starving tag allocation after allocating process migrates · e6fc4649
      Ming Lei authored
      When the allocation process is scheduled back and the mapped hw queue is
      changed, fake one extra wake up on previous queue for compensating wake
      up miss, so other allocations on the previous queue won't be starved.
      
      This patch fixes one request allocation hang issue, which can be
      triggered easily in case of very low nr_request.
      
      The race is as follows:
      
      1) 2 hw queues, nr_requests are 2, and wake_batch is one
      
      2) there are 3 waiters on hw queue 0
      
      3) two in-flight requests in hw queue 0 are completed, and only two
         waiters of 3 are waken up because of wake_batch, but both the two
         waiters can be scheduled to another CPU and cause to switch to hw
         queue 1
      
      4) then the 3rd waiter will wait for ever, since no in-flight request
         is in hw queue 0 any more.
      
      5) this patch fixes it by the fake wakeup when waiter is scheduled to
         another hw queue
      
      Cc: <stable@vger.kernel.org>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      
      Modified commit message to make it clearer, and make it apply on
      top of the 4.18 branch.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e6fc4649
  5. 23 May, 2018 2 commits
    • Tejun Heo's avatar
      bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue · f1834646
      Tejun Heo authored
      From 0aa2e9b921d6db71150633ff290199554f0842a8 Mon Sep 17 00:00:00 2001
      From: Tejun Heo <tj@kernel.org>
      Date: Wed, 23 May 2018 10:29:00 -0700
      
      cgwb_release() punts the actual release to cgwb_release_workfn() on
      system_wq.  Depending on the number of cgroups or block devices, there
      can be a lot of cgwb_release_workfn() in flight at the same time.
      
      We're periodically seeing close to 256 kworkers getting stuck with the
      following stack trace and overtime the entire system gets stuck.
      
        [<ffffffff810ee40c>] _synchronize_rcu_expedited.constprop.72+0x2fc/0x330
        [<ffffffff810ee634>] synchronize_rcu_expedited+0x24/0x30
        [<ffffffff811ccf23>] bdi_unregister+0x53/0x290
        [<ffffffff811cd1e9>] release_bdi+0x89/0xc0
        [<ffffffff811cd645>] wb_exit+0x85/0xa0
        [<ffffffff811cdc84>] cgwb_release_workfn+0x54/0xb0
        [<ffffffff810a68d0>] process_one_work+0x150/0x410
        [<ffffffff810a71fd>] worker_thread+0x6d/0x520
        [<ffffffff810ad3dc>] kthread+0x12c/0x160
        [<ffffffff81969019>] ret_from_fork+0x29/0x40
        [<ffffffffffffffff>] 0xffffffffffffffff
      
      The events leading to the lockup are...
      
      1. A lot of cgwb_release_workfn() is queued at the same time and all
         system_wq kworkers are assigned to execute them.
      
      2. They all end up calling synchronize_rcu_expedited().  One of them
         wins and tries to perform the expedited synchronization.
      
      3. However, that invovles queueing rcu_exp_work to system_wq and
         waiting for it.  Because #1 is holding all available kworkers on
         system_wq, rcu_exp_work can't be executed.  cgwb_release_workfn()
         is waiting for synchronize_rcu_expedited() which in turn is waiting
         for cgwb_release_workfn() to free up some of the kworkers.
      
      We shouldn't be scheduling hundreds of cgwb_release_workfn() at the
      same time.  There's nothing to be gained from that.  This patch
      updates cgwb release path to use a dedicated percpu workqueue with
      @max_active of 1.
      
      While this resolves the problem at hand, it might be a good idea to
      isolate rcu_exp_work to its own workqueue too as it can be used from
      various paths and is prone to this sort of indirect A-A deadlocks.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f1834646
    • Josef Bacik's avatar
      nbd: set discard granularity properly · 6df133a1
      Josef Bacik authored
      For some reason we had discard granularity set to 512 always even when
      discards were disabled.  Fix this by having the default be 0, and then
      if we turn it on set the discard granularity to the blocksize.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6df133a1
  6. 22 May, 2018 3 commits
  7. 21 May, 2018 2 commits
    • Jens Axboe's avatar
      nvme-pci: fix race between poll and IRQ completions · 68fa9dbe
      Jens Axboe authored
      If polling completions are racing with the IRQ triggered by a
      completion, the IRQ handler will find no work and return IRQ_NONE.
      This can trigger complaints about spurious interrupts:
      
      [  560.169153] irq 630: nobody cared (try booting with the "irqpoll" option)
      [  560.175988] CPU: 40 PID: 0 Comm: swapper/40 Not tainted 4.17.0-rc2+ #65
      [  560.175990] Hardware name: Intel Corporation S2600STB/S2600STB, BIOS SE5C620.86B.00.01.0010.010920180151 01/09/2018
      [  560.175991] Call Trace:
      [  560.175994]  <IRQ>
      [  560.176005]  dump_stack+0x5c/0x7b
      [  560.176010]  __report_bad_irq+0x30/0xc0
      [  560.176013]  note_interrupt+0x235/0x280
      [  560.176020]  handle_irq_event_percpu+0x51/0x70
      [  560.176023]  handle_irq_event+0x27/0x50
      [  560.176026]  handle_edge_irq+0x6d/0x180
      [  560.176031]  handle_irq+0xa5/0x110
      [  560.176036]  do_IRQ+0x41/0xc0
      [  560.176042]  common_interrupt+0xf/0xf
      [  560.176043]  </IRQ>
      [  560.176050] RIP: 0010:cpuidle_enter_state+0x9b/0x2b0
      [  560.176052] RSP: 0018:ffffa0ed4659fe98 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdd
      [  560.176055] RAX: ffff9527beb20a80 RBX: 000000826caee491 RCX: 000000000000001f
      [  560.176056] RDX: 000000826caee491 RSI: 00000000335206ee RDI: 0000000000000000
      [  560.176057] RBP: 0000000000000001 R08: 00000000ffffffff R09: 0000000000000008
      [  560.176059] R10: ffffa0ed4659fe78 R11: 0000000000000001 R12: ffff9527beb29358
      [  560.176060] R13: ffffffffa235d4b8 R14: 0000000000000000 R15: 000000826caed593
      [  560.176065]  ? cpuidle_enter_state+0x8b/0x2b0
      [  560.176071]  do_idle+0x1f4/0x260
      [  560.176075]  cpu_startup_entry+0x6f/0x80
      [  560.176080]  start_secondary+0x184/0x1d0
      [  560.176085]  secondary_startup_64+0xa5/0xb0
      [  560.176088] handlers:
      [  560.178387] [<00000000efb612be>] nvme_irq [nvme]
      [  560.183019] Disabling IRQ #630
      
      A previous commit removed ->cqe_seen that was handling this case,
      but we need to handle this a bit differently due to completions
      now running outside the queue lock. Return IRQ_HANDLED from the
      IRQ handler, if the completion ring head was moved since we last
      saw it.
      
      Fixes: 5cb525c8 ("nvme-pci: handle completions outside of the queue lock")
      Reported-by: default avatarKeith Busch <keith.busch@intel.com>
      Reviewed-by: default avatarKeith Busch <keith.busch@intel.com>
      Tested-by: default avatarKeith Busch <keith.busch@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      68fa9dbe
    • Jens Axboe's avatar
      Merge branch 'nvme-4.18' of git://git.infradead.org/nvme into for-4.18/block · 81b1dab4
      Jens Axboe authored
      Pull NVMe changes from Keith:
      
      "This is just the first nvme pull request for 4.18. There are several
      fabrics and target patches that I missed, so there will be more to
      come."
      
      * 'nvme-4.18' of git://git.infradead.org/nvme:
        nvme-pci: drop IRQ disabling on submission queue lock
        nvme-pci: split the nvme queue lock into submission and completion locks
        nvme-pci: handle completions outside of the queue lock
        nvme-pci: move ->cq_vector == -1 check outside of ->q_lock
        nvme-pci: remove cq check after submission
        nvme-pci: simplify nvme_cqe_valid
        nvme: mark the result argument to nvme_complete_async_event volatile
        nvme/pci: Sync controller reset for AER slot_reset
        nvme/pci: Hold controller reference during async probe
        nvme: only reconfigure discard if necessary
        nvme/pci: Use async_schedule for initial reset work
        nvme: lightnvm: add granby support
        NVMe: Add Quirk Delay before CHK RDY for Seagate Nytro Flash Storage
        nvme: change order of qid and cmdid in completion trace
        nvme: fc: provide a descriptive error
      81b1dab4
  8. 18 May, 2018 8 commits
  9. 16 May, 2018 8 commits
  10. 15 May, 2018 1 commit
  11. 14 May, 2018 6 commits