1. 12 Sep, 2019 10 commits
  2. 11 Sep, 2019 3 commits
  3. 10 Sep, 2019 6 commits
    • Tejun Heo's avatar
      iocost_monitor: Report debt · 7c1ee704
      Tejun Heo authored
      Report debt and rename del_ms row to delay for consistency.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7c1ee704
    • Tejun Heo's avatar
      iocost_monitor: Report more info with higher accuracy · b06f2d35
      Tejun Heo authored
      When outputting json:
      
      * Don't truncate numbers.
      
      * Report address of iocg to ease drilling down further.
      
      When outputting table:
      
      * Use math.ceil() for delay_ms so that small delays don't read as 0.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b06f2d35
    • Tejun Heo's avatar
      iocost_monitor: Always use strings for json values · e742bd5c
      Tejun Heo authored
      Json has limited accuracy for numbers and can silently truncate 64bit
      values, which can be extremely confusing.  Let's consistently use
      string encapsulated values for json output.
      
      While at it, convert an unnecesary f-string to str().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e742bd5c
    • Tejun Heo's avatar
      blk-iocost: Don't let merges push vtime into the future · e1518f63
      Tejun Heo authored
      Merges have the same problem that forced-bios had which is fixed by
      the previous patch.  The cost of a merge is calculated at the time of
      issue and force-advances vtime into the future.  Until global vtime
      catches up, how the cgroup's hweight changes in the meantime doesn't
      matter and it often leads to situations where the cost is calculated
      at one hweight and paid at a very different one.  See the previous
      patch for more details.
      
      Fix it by never advancing vtime into the future for merges.  If budget
      is available, vtime is advanced.  Otherwise, the cost is charged as
      debt.
      
      This brings merge cost handling in line with issue cost handling in
      ioc_rqos_throttle().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e1518f63
    • Tejun Heo's avatar
      blk-iocost: Account force-charged overage in absolute vtime · 36a52481
      Tejun Heo authored
      Currently, when a bio needs to be force-charged and there isn't enough
      budget, vtime is simply pushed into the future.  This means that the
      cost of the whole bio is scaled using the current hweight and then
      charged immediately.  Until the global vtime advances beyond this
      future vtime, the cgroup won't be allowed to issue normal IOs.
      
      This is incorrect and can lead to, for example, exploding vrate or
      extended stalls if vrate range is constrained.  Consider the following
      scenario.
      
      1. A cgroup with a very low hweight runs out of budget.
      
      2. A storm of swap-out happens on it.  All of them are scaled
         according to the current low hweight and charged to vtime pushing
         it to a far future.
      
      3. All other cgroups go idle and now the above cgroup has access to
         the whole device.  However, because vtime is already wound using
         the past low hweight, what its current hweight is doesn't matter
         until global vtime catches up to the local vtime.
      
      4. As a result, either vrate gets ramped up extremely or the IOs stall
         while the underlying device is idle.
      
      This is because the hweight the overage is calculated at is different
      from the hweight that it's being paid at.
      
      Fix it by remembering the overage in absoulte vtime and continuously
      paying with the actual budget according to the current hweight at each
      period.
      
      Note that non-forced bios which wait already remembers the cost in
      absolute vtime.  This brings forced-bio accounting in line.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      36a52481
    • Tejun Heo's avatar
      blk-iocost: Fix incorrect operation order during iocg free · e036c4ca
      Tejun Heo authored
      ioc_pd_free() first cancels the hrtimers and then deactivates the
      iocg.  However, the iocg timer can run inbetween and reschedule the
      hrtimers which will end up running after the iocg is freed leading to
      crashes like the following.
      
        general protection fault: 0000 [#1] SMP
        ...
        RIP: 0010:iocg_kick_delay+0xbe/0x1b0
        RSP: 0018:ffffc90003598ea0 EFLAGS: 00010046
        RAX: 1cee00fd69512b54 RBX: ffff8881bba48400 RCX: 00000000000003e8
        RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8881bba48400
        RBP: 0000000000004e20 R08: 0000000000000002 R09: 00000000000003e8
        R10: 0000000000000000 R11: 0000000000000000 R12: ffffc90003598ef0
        R13: 00979f3810ad461f R14: ffff8881bba4b400 R15: 25439f950d26e1d1
        FS:  0000000000000000(0000) GS:ffff88885f800000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f64328c7e40 CR3: 0000000002409005 CR4: 00000000003606e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         <IRQ>
         iocg_delay_timer_fn+0x3d/0x60
         __hrtimer_run_queues+0xfe/0x270
         hrtimer_interrupt+0xf4/0x210
         smp_apic_timer_interrupt+0x5e/0x120
         apic_timer_interrupt+0xf/0x20
         </IRQ>
      
      Fix it by canceling hrtimers after deactivating the iocg.
      
      Fixes: 7caa4715 ("blkcg: implement blk-iocost")
      Reported-by: default avatarDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e036c4ca
  4. 06 Sep, 2019 11 commits
    • Fam Zheng's avatar
      bfq: Add per-device weight · 795fe54c
      Fam Zheng authored
      This adds to BFQ the missing per-device weight interfaces:
      blkio.bfq.weight_device on legacy and io.bfq.weight on unified. The
      implementation pretty closely resembles what we had in CFQ and the parsing code
      is basically reused.
      
      Tests
      =====
      
      Using two cgroups and three block devices, having weights setup as:
      
      Cgroup          test1           test2
      ============================================
      default         100             500
      sda             500             100
      sdb             default         default
      sdc             200             200
      
      cgroup v1 runs
      --------------
      
          sda.test1.out:   READ: bw=913MiB/s
          sda.test2.out:   READ: bw=183MiB/s
      
          sdb.test1.out:   READ: bw=213MiB/s
          sdb.test2.out:   READ: bw=1054MiB/s
      
          sdc.test1.out:   READ: bw=650MiB/s
          sdc.test2.out:   READ: bw=650MiB/s
      
      cgroup v2 runs
      --------------
      
          sda.test1.out:   READ: bw=915MiB/s
          sda.test2.out:   READ: bw=184MiB/s
      
          sdb.test1.out:   READ: bw=216MiB/s
          sdb.test2.out:   READ: bw=1069MiB/s
      
          sdc.test1.out:   READ: bw=621MiB/s
          sdc.test2.out:   READ: bw=622MiB/s
      Signed-off-by: default avatarFam Zheng <zhengfeiran@bytedance.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      795fe54c
    • Fam Zheng's avatar
      bfq: Extract bfq_group_set_weight from bfq_io_set_weight_legacy · 5ff047e3
      Fam Zheng authored
      This function will be useful when we update weight from the soon-coming
      per-device interface.
      Signed-off-by: default avatarFam Zheng <zhengfeiran@bytedance.com>
      Reviewed-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5ff047e3
    • Fam Zheng's avatar
      bfq: Fix the missing barrier in __bfq_entity_update_weight_prio · e9d3c866
      Fam Zheng authored
      The comment of bfq_group_set_weight says the reading of prio_changed
      should happen before the reading of weight, but a memory barrier is
      missing here. Add it now, to match the smp_wmb() there.
      Signed-off-by: default avatarFam Zheng <zhengfeiran@bytedance.com>
      Reviewed-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e9d3c866
    • Jens Axboe's avatar
      block: fix elevator_get_by_features() · a2614255
      Jens Axboe authored
      The lookup logic is broken - 'e' will never be NULL, even if the
      list is empty. Maintain lookup hit in a separate variable instead.
      
      Fixes: a0958ba7 ("block: Improve default elevator selection")
      Reported-by: default avatarJulia Lawall <julia.lawall@lip6.fr>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a2614255
    • Damien Le Moal's avatar
      sd: Set ELEVATOR_F_ZBD_SEQ_WRITE for ZBC disks · ebddd2a1
      Damien Le Moal authored
      Using the helper blk_queue_required_elevator_features(), set the
      elevator feature ELEVATOR_F_ZBD_SEQ_WRITE as required for the request
      queue of SCSI ZBC disks.
      
      This feature requirement can always be satisfied as the mq-deadline
      elevator is always selected for in-kernel compilation when
      CONFIG_BLK_DEV_ZONED (zoned block device support) is enabled.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ebddd2a1
    • Damien Le Moal's avatar
      block: Set ELEVATOR_F_ZBD_SEQ_WRITE for nullblk zoned disks · 780d97a9
      Damien Le Moal authored
      Using the helper blk_queue_required_elevator_features(), set the
      elevator feature ELEVATOR_F_ZBD_SEQ_WRITE as required for the request
      queue of null_blk devices created with zoned mode enabled.
      
      This feature requirement can always be satisfied as the mq-deadline
      elevator is always selected for in-kernel compilation when
      CONFIG_BLK_DEV_ZONED (zoned block device support) is enabled.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      780d97a9
    • Damien Le Moal's avatar
      block: Delay default elevator initialization · 737eb78e
      Damien Le Moal authored
      When elevator_init_mq() is called from blk_mq_init_allocated_queue(),
      the only information known about the device is the number of hardware
      queues as the block device scan by the device driver is not completed
      yet for most drivers. The device type and elevator required features
      are not set yet, preventing to correctly select the default elevator
      most suitable for the device.
      
      This currently affects all multi-queue zoned block devices which default
      to the "none" elevator instead of the required "mq-deadline" elevator.
      These drives currently include host-managed SMR disks connected to a
      smartpqi HBA and null_blk block devices with zoned mode enabled.
      Upcoming NVMe Zoned Namespace devices will also be affected.
      
      Fix this by adding the boolean elevator_init argument to
      blk_mq_init_allocated_queue() to control the execution of
      elevator_init_mq(). Two cases exist:
      1) elevator_init = false is used for calls to
         blk_mq_init_allocated_queue() within blk_mq_init_queue(). In this
         case, a call to elevator_init_mq() is added to __device_add_disk(),
         resulting in the delayed initialization of the queue elevator
         after the device driver finished probing the device information. This
         effectively allows elevator_init_mq() access to more information
         about the device.
      2) elevator_init = true preserves the current behavior of initializing
         the elevator directly from blk_mq_init_allocated_queue(). This case
         is used for the special request based DM devices where the device
         gendisk is created before the queue initialization and device
         information (e.g. queue limits) is already known when the queue
         initialization is executed.
      
      Additionally, to make sure that the elevator initialization is never
      done while requests are in-flight (there should be none when the device
      driver calls device_add_disk()), freeze and quiesce the device request
      queue before calling blk_mq_init_sched() in elevator_init_mq().
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      737eb78e
    • Damien Le Moal's avatar
      block: Improve default elevator selection · a0958ba7
      Damien Le Moal authored
      For block devices that do not specify required features, preserve the
      current default elevator selection (mq-deadline for single queue
      devices, none for multi-queue devices). However, for devices specifying
      required features (e.g. zoned block devices ELEVATOR_F_ZBD_SEQ_WRITE
      feature), select the first available elevator providing the required
      features.
      
      In all cases, default to "none" if no elevator is available or if the
      initialization of the default elevator fails.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a0958ba7
    • Damien Le Moal's avatar
      block: Introduce elevator features · 68c43f13
      Damien Le Moal authored
      Introduce the definition of elevator features through the
      elevator_features flags in the elevator_type structure. Each flag can
      represent a feature supported by an elevator. The first feature defined
      by this patch is support for zoned block device sequential write
      constraint with the flag ELEVATOR_F_ZBD_SEQ_WRITE, which is implemented
      by the mq-deadline elevator using zone write locking.
      
      Other possible features are IO priorities, write hints, latency targets
      or single-LUN dual-actuator disks (for which the elevator could maintain
      one LBA ordered list per actuator).
      
      The required_elevator_features field is also added to the request_queue
      structure to allow a device driver to specify elevator feature flags
      that an elevator must support for the correct operation of the device
      (e.g. device drivers for zoned block devices can have the
      ELEVATOR_F_ZBD_SEQ_WRITE flag as a required feature).
      The helper function blk_queue_required_elevator_features() is
      defined for setting this new field.
      
      With these two new fields in place, the elevator functions
      elevator_match() and elevator_find() are modified to allow a user to set
      only an elevator with a set of features that satisfies the device
      required features. Elevators not matching the device requirements are
      not shown in the device sysfs queue/scheduler file to prevent their use.
      
      The "none" elevator can always be selected as before.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      68c43f13
    • Damien Le Moal's avatar
      block: Change elevator_init_mq() to always succeed · 954b4a5c
      Damien Le Moal authored
      If the default elevator chosen is mq-deadline, elevator_init_mq() may
      return an error if mq-deadline initialization fails, leading to
      blk_mq_init_allocated_queue() returning an error, which in turn will
      cause the block device initialization to fail and the device not being
      exposed.
      
      Instead of taking such extreme measure, handle mq-deadline
      initialization failures in the same manner as when mq-deadline is not
      available (no module to load), that is, default to the "none" scheduler.
      With this change, elevator_init_mq() return type can be changed to void.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      954b4a5c
    • Damien Le Moal's avatar
      block: Cleanup elevator_init_mq() use · 61db437d
      Damien Le Moal authored
      Instead of checking a queue tag_set BLK_MQ_F_NO_SCHED flag before
      calling elevator_init_mq() to make sure that the queue supports IO
      scheduling, use the elevator.c function elv_support_iosched() in
      elevator_init_mq(). This does not introduce any functional change but
      ensure that elevator_init_mq() does the right thing based on the queue
      settings.
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      61db437d
  5. 05 Sep, 2019 2 commits
  6. 04 Sep, 2019 4 commits
  7. 03 Sep, 2019 4 commits
    • Guoqing Jiang's avatar
      md/raid5: use bio_end_sector to calculate last_sector · b0f01ecf
      Guoqing Jiang authored
      Use the common way to get last_sector.
      Signed-off-by: default avatarGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      b0f01ecf
    • Yufen Yu's avatar
      md/raid1: fail run raid1 array when active disk less than one · 07f1a685
      Yufen Yu authored
      When run test case:
        mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] --assume-clean --bitmap=internal
        mdadm -S /dev/md1
        mdadm -A /dev/md1 /dev/sd[b-c] --run --force
      
        mdadm --zero /dev/sda
        mdadm /dev/md1 -a /dev/sda
      
        echo offline > /sys/block/sdc/device/state
        echo offline > /sys/block/sdb/device/state
        sleep 5
        mdadm -S /dev/md1
      
        echo running > /sys/block/sdb/device/state
        echo running > /sys/block/sdc/device/state
        mdadm -A /dev/md1 /dev/sd[a-c] --run --force
      
      mdadm run fail with kernel message as follow:
      [  172.986064] md: kicking non-fresh sdb from array!
      [  173.004210] md: kicking non-fresh sdc from array!
      [  173.022383] md/raid1:md1: active with 0 out of 4 mirrors
      [  173.022406] md1: failed to create bitmap (-5)
      
      In fact, when active disk in raid1 array less than one, we
      need to return fail in raid1_run().
      Reviewed-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarYufen Yu <yuyufen@huawei.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      07f1a685
    • Guilherme G. Piccoli's avatar
      md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone · 62f7b198
      Guilherme G. Piccoli authored
      Currently md raid0/linear are not provided with any mechanism to validate
      if an array member got removed or failed. The driver keeps sending BIOs
      regardless of the state of array members, and kernel shows state 'clean'
      in the 'array_state' sysfs attribute. This leads to the following
      situation: if a raid0/linear array member is removed and the array is
      mounted, some user writing to this array won't realize that errors are
      happening unless they check dmesg or perform one fsync per written file.
      Despite udev signaling the member device is gone, 'mdadm' cannot issue the
      STOP_ARRAY ioctl successfully, given the array is mounted.
      
      In other words, no -EIO is returned and writes (except direct ones) appear
      normal. Meaning the user might think the wrote data is correctly stored in
      the array, but instead garbage was written given that raid0 does stripping
      (and so, it requires all its members to be working in order to not corrupt
      data). For md/linear, writes to the available members will work fine, but
      if the writes go to the missing member(s), it'll cause a file corruption
      situation, whereas the portion of the writes to the missing devices aren't
      written effectively.
      
      This patch changes this behavior: we check if the block device's gendisk
      is UP when submitting the BIO to the array member, and if it isn't, we flag
      the md device as MD_BROKEN and fail subsequent I/Os to that device; a read
      request to the array requiring data from a valid member is still completed.
      While flagging the device as MD_BROKEN, we also show a rate-limited warning
      in the kernel log.
      
      A new array state 'broken' was added too: it mimics the state 'clean' in
      every aspect, being useful only to distinguish if the array has some member
      missing. We rely on the MD_BROKEN flag to put the array in the 'broken'
      state. This state cannot be written in 'array_state' as it just shows
      one or more members of the array are missing but acts like 'clean', it
      wouldn't make sense to write it.
      
      With this patch, the filesystem reacts much faster to the event of missing
      array member: after some I/O errors, ext4 for instance aborts the journal
      and prevents corruption. Without this change, we're able to keep writing
      in the disk and after a machine reboot, e2fsck shows some severe fs errors
      that demand fixing. This patch was tested in ext4 and xfs filesystems, and
      requires a 'mdadm' counterpart to handle the 'broken' state.
      
      Cc: Song Liu <songliubraving@fb.com>
      Reviewed-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarGuilherme G. Piccoli <gpiccoli@canonical.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      62f7b198
    • Kent Overstreet's avatar
      closures: fix a race on wakeup from closure_sync · a22a9602
      Kent Overstreet authored
      The race was when a thread using closure_sync() notices cl->s->done == 1
      before the thread calling closure_put() calls wake_up_process(). Then,
      it's possible for that thread to return and exit just before
      wake_up_process() is called - so we're trying to wake up a process that
      no longer exists.
      
      rcu_read_lock() is sufficient to protect against this, as there's an rcu
      barrier somewhere in the process teardown path.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@gmail.com>
      Acked-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a22a9602