1. 07 Sep, 2020 1 commit
    • Jan Kara's avatar
      fs: Don't invalidate page buffers in block_write_full_page() · 6dbf7bb5
      Jan Kara authored
      If block_write_full_page() is called for a page that is beyond current
      inode size, it will truncate page buffers for the page and return 0.
      This logic has been added in 2.5.62 in commit 81eb6906 ("fix ext3
      BUG due to race with truncate") in history.git tree to fix a problem
      with ext3 in data=ordered mode. This particular problem doesn't exist
      anymore because ext3 is long gone and ext4 handles ordered data
      differently. Also normally buffers are invalidated by truncate code and
      there's no need to specially handle this in ->writepage() code.
      
      This invalidation of page buffers in block_write_full_page() is causing
      issues to filesystems (e.g. ext4 or ocfs2) when block device is shrunk
      under filesystem's hands and metadata buffers get discarded while being
      tracked by the journalling layer. Although it is obviously "not
      supported" it can cause kernel crashes like:
      
      [ 7986.689400] BUG: unable to handle kernel NULL pointer dereference at
      +0000000000000008
      [ 7986.697197] PGD 0 P4D 0
      [ 7986.699724] Oops: 0002 [#1] SMP PTI
      [ 7986.703200] CPU: 4 PID: 203778 Comm: jbd2/dm-3-8 Kdump: loaded Tainted: G
      +O     --------- -  - 4.18.0-147.5.0.5.h126.eulerosv2r9.x86_64 #1
      [ 7986.716438] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015
      [ 7986.723462] RIP: 0010:jbd2_journal_grab_journal_head+0x1b/0x40 [jbd2]
      ...
      [ 7986.810150] Call Trace:
      [ 7986.812595]  __jbd2_journal_insert_checkpoint+0x23/0x70 [jbd2]
      [ 7986.818408]  jbd2_journal_commit_transaction+0x155f/0x1b60 [jbd2]
      [ 7986.836467]  kjournald2+0xbd/0x270 [jbd2]
      
      which is not great. The crash happens because bh->b_private is suddently
      NULL although BH_JBD flag is still set (this is because
      block_invalidatepage() cleared BH_Mapped flag and subsequent bh lookup
      found buffer without BH_Mapped set, called init_page_buffers() which has
      rewritten bh->b_private). So just remove the invalidation in
      block_write_full_page().
      
      Note that the buffer cache invalidation when block device changes size
      is already careful to avoid similar problems by using
      invalidate_mapping_pages() which skips busy buffers so it was only this
      odd block_write_full_page() behavior that could tear down bdev buffers
      under filesystem's hands.
      Reported-by: default avatarYe Bin <yebin10@huawei.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      CC: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6dbf7bb5
  2. 03 Sep, 2020 10 commits
  3. 02 Sep, 2020 29 commits
    • Christoph Hellwig's avatar
      block: remove revalidate_disk() · de09077c
      Christoph Hellwig authored
      Remove the now unused helper.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Acked-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      de09077c
    • Christoph Hellwig's avatar
      nvdimm: simplify revalidate_disk handling · 32f61d67
      Christoph Hellwig authored
      The nvdimm block driver abuse revalidate_disk in a strange way, and
      totally unrelated to what other drivers do.  Simplify this by just
      calling nvdimm_revalidate_disk (which seems rather misnamed) from the
      probe routines, as the additional bdev size revalidation is pointless
      at this point, and remove the revalidate_disk methods given that
      it can only be triggered from add_disk, which is right before the
      manual calls.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      32f61d67
    • Christoph Hellwig's avatar
      sd: open code revalidate_disk · 033a1b98
      Christoph Hellwig authored
      Instead of calling revalidate_disk just do the work directly by
      calling sd_revalidate_disk, and revalidate_disk_size where needed.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      033a1b98
    • Christoph Hellwig's avatar
      nvme: opencode revalidate_disk in nvme_validate_ns · b55d3d21
      Christoph Hellwig authored
      Keep control in the NVMe driver instead of going through an indirect
      call back into ->revalidate_disk.  Also reorder the function a bit to be
      easier to follow with the additional code.
      
      And now that we have removed all callers of revalidate_disk() in the nvme
      code, ->revalidate_disk is only called from the open code when first
      opening the device.  Which is of course totally pointless as we have
      a valid size since the initial scan, and will get an updated view
      through the asynchronous notifiation everytime the size changes.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b55d3d21
    • Christoph Hellwig's avatar
      block: use revalidate_disk_size in set_capacity_revalidate_and_notify · b8086d3f
      Christoph Hellwig authored
      Only virtio_blk and xen-blkfront set the revalidate argument to true,
      and both do not implement the ->revalidate_disk method.  So switch
      to the helper that just updates the size instead.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b8086d3f
    • Christoph Hellwig's avatar
      block: add a new revalidate_disk_size helper · 659e56ba
      Christoph Hellwig authored
      revalidate_disk is a relative awkward helper for driver use, as it first
      calls an optional driver method and then updates the block device size,
      while most callers either don't need the method call at all, or want to
      keep state between the caller and the called method.
      
      Add a revalidate_disk_size helper that just performs the update of the
      block device size from the gendisk one, and switch all drivers that do
      not implement ->revalidate_disk to use the new helper instead of
      revalidate_disk()
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Acked-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      659e56ba
    • Christoph Hellwig's avatar
      block: rename bd_invalidated · f4ad06f2
      Christoph Hellwig authored
      Replace bd_invalidate with a new BDEV_NEED_PART_SCAN flag in a bd_flags
      variable to better describe the condition.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f4ad06f2
    • Christoph Hellwig's avatar
      block: don't clear bd_invalidated in check_disk_size_change · 6540fbf6
      Christoph Hellwig authored
      bd_invalidated is set by check_disk_change or in add_disk to initiate a
      partition scan.  Move it from check_disk_size_change which is called
      from both revalidate_disk() and bdev_disk_changed() to only the latter,
      as that is what is called from the block device open code (and nbd) to
      deal with the bd_invalidated event.  revalidate_disk() on the other hand
      is mostly used to propagate a size update from the gendisk to the block
      device, which is entirely unrelated.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6540fbf6
    • Christoph Hellwig's avatar
      Documentation/filesystems/locking.rst: remove an incorrect sentence · 653cd534
      Christoph Hellwig authored
      unlock_native_capacity is never called from check_disk_change(), and
      while revalidate_disk can be called from it, it can also be called
      from two other places at the moment.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      653cd534
    • Baolin Wang's avatar
      block: Remove a duplicative condition · 265600b7
      Baolin Wang authored
      Remove a duplicative condition to remove below cppcheck warnings:
      
      "warning: Redundant condition: sched_allow_merge. '!A || (A && B)' is
      equivalent to '!A || B' [redundantCondition]"
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      265600b7
    • Ritika Srivastava's avatar
      block: better deal with the delayed not supported case in blk_cloned_rq_check_limits · 8327cce5
      Ritika Srivastava authored
      If WRITE_ZERO/WRITE_SAME operation is not supported by the storage,
      blk_cloned_rq_check_limits() will return IO error which will cause
      device-mapper to fail the paths.
      
      Instead, if the queue limit is set to 0, return BLK_STS_NOTSUPP.
      BLK_STS_NOTSUPP will be ignored by device-mapper and will not fail the
      paths.
      Suggested-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarRitika Srivastava <ritika.srivastava@oracle.com>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8327cce5
    • Ritika Srivastava's avatar
      block: Return blk_status_t instead of errno codes · 143d2600
      Ritika Srivastava authored
      Replace returning legacy errno codes with blk_status_t in
      blk_cloned_rq_check_limits().
      Signed-off-by: default avatarRitika Srivastava <ritika.srivastava@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      143d2600
    • Khazhismel Kumykov's avatar
      block: grant IOPRIO_CLASS_RT to CAP_SYS_NICE · 9d3a39a5
      Khazhismel Kumykov authored
      CAP_SYS_ADMIN is too broad, and ionice fits into CAP_SYS_NICE's grouping.
      
      Retain CAP_SYS_ADMIN permission for backwards compatibility.
      Signed-off-by: default avatarKhazhismel Kumykov <khazhy@google.com>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Acked-by: default avatarSerge Hallyn <serge@hallyn.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9d3a39a5
    • Tejun Heo's avatar
      blk-iocost: update iocost_monitor.py · a7863b34
      Tejun Heo authored
      iocost went through significant internal changes. Update iocost_monitor.py
      accordingly.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a7863b34
    • Tejun Heo's avatar
      blk-iocost: add three debug stat - cost.wait, indebt and indelay · f0bf84a5
      Tejun Heo authored
      These are really cheap to collect and can be useful in debugging iocost
      behavior. Add them as debug stats for now.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f0bf84a5
    • Tejun Heo's avatar
      blk-iocost: restore inuse update tracepoints · 04603755
      Tejun Heo authored
      Update and restore the inuse update tracepoints.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      04603755
    • Tejun Heo's avatar
      blk-iocost: implement vtime loss compensation · ac33e91e
      Tejun Heo authored
      When an iocg accumulates too much vtime or gets deactivated, we throw away
      some vtime, which lowers the overall device utilization. As the exact amount
      which is being thrown away is known, we can compensate by accelerating the
      vrate accordingly so that the extra vtime generated in the current period
      matches what got lost.
      
      This significantly improves work conservation when involving high weight
      cgroups with intermittent and bursty IO patterns.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ac33e91e
    • Tejun Heo's avatar
      blk-iocost: halve debts if device stays idle · dda1315f
      Tejun Heo authored
      A low weight iocg can amass a large amount of debt, for example, when
      anonymous memory gets reclaimed aggressively. If the system has a lot of
      memory paired with a slow IO device, the debt can span multiple seconds or
      more. If there are no other subsequent IO issuers, the in-debt iocg may end
      up blocked paying its debt while the IO device is idle.
      
      This patch implements a mechanism to protect against such pathological
      cases. If the device has been sufficiently idle for a substantial amount of
      time, the debts are halved. The criteria are on the conservative side as we
      want to resolve the rare extreme cases without impacting regular operation
      by forgiving debts too readily.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      dda1315f
    • Tejun Heo's avatar
      blk-iocost: implement delay adjustment hysteresis · 5160a5a5
      Tejun Heo authored
      Curently, iocost syncs the delay duration to the outstanding debt amount,
      which seemed enough to protect the system from anon memory hogs. However,
      that was mostly because the delay calcuation was using hweight_inuse which
      quickly converges towards zero under debt for delay duration calculation,
      often pusnishing debtors overly harshly for longer than deserved.
      
      The previous patch fixed the delay calcuation and now the protection against
      anonymous memory hogs isn't enough because the effect of delay is indirect
      and non-linear and a huge amount of future debt can accumulate abruptly
      while unthrottled.
      
      This patch implements delay hysteresis so that delay is decayed
      exponentially over time instead of getting cleared immediately as debt is
      paid off. While the overall behavior is similar to the blk-cgroup
      implementation used by blk-iolatency, a lot of the details are different and
      due to the empirical nature of the mechanism, it's challenging to adapt the
      mechanism for one controller without negatively impacting the other.
      
      As the delay is gradually decayed now, there's no point in running it from
      its own hrtimer. Periodic updates are now performed from ioc_timer_fn() and
      the dedicated hrtimer is removed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5160a5a5
    • Tejun Heo's avatar
      blk-iocost: revamp debt handling · c421a3eb
      Tejun Heo authored
      Debt handling had several issues.
      
      * How much inuse a debtor carries wasn't clearly defined. inuse would be
        driven down over time from not issuing IOs but it'd be better to clamp it
        to minimum immediately once in debt.
      
      * How much can be paid off was determined by hweight_inuse. As inuse was
        driven down, the payment amount would fall together regardless of the
        debtor's active weight. This means that the debtors were punished harshly.
      
      * ioc_rqos_merge() wasn't calling blkcg_schedule_throttle() after
        iocg_kick_delay().
      
      This patch revamps debt handling so that
      
      * Debt handling owns inuse for iocgs in debt and keeps them at zero.
      
      * Payment amount is determined by hweight_active. This is more deterministic
        and safer than hweight_inuse but still far from ideal in that it doesn't
        factor in possible donations from other iocgs for debt payments. This
        likely needs further improvements in the future.
      
      * iocg_rqos_merge() now calls blkcg_schedule_throttle() as necessary.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Andy Newell <newella@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c421a3eb
    • Tejun Heo's avatar
      blk-iocost: revamp in-period donation snapbacks · b0853ab4
      Tejun Heo authored
      When the margin drops below the minimum on a donating iocg, donation is
      immediately canceled in full. There are a couple shortcomings with the
      current behavior.
      
      * It's abrupt. A small temporary budget deficit can lead to a wide swing in
        weight allocation and a large surplus.
      
      * It's open coded in the issue path but not implemented for the merge path.
        A series of merges at a low inuse can make the iocg incur debts and stall
        incorrectly.
      
      This patch reimplements in-period donation snapbacks so that
      
      * inuse adjustment and cost calculations are factored into
        adjust_inuse_and_calc_cost() which is called from both the issue and merge
        paths.
      
      * Snapbacks are more gradual. It occurs in quarter steps.
      
      * A snapback triggers if the margin goes below the low threshold and is
        lower than the budget at the time of the last adjustment.
      
      * For the above, __propagate_weights() stores the margin in
        iocg->saved_margin. Move iocg->last_inuse storing together into
        __propagate_weights() for consistency.
      
      * Full snapback is guaranteed when there are waiters.
      
      * With precise donation and gradual snapbacks, inuse adjustments are now a
        lot more effective and the value of scaling inuse on weight changes isn't
        clear. Removed inuse scaling from weight_update().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b0853ab4
    • Tejun Heo's avatar
      blk-iocost: revamp donation amount determination · f1de2439
      Tejun Heo authored
      iocost has various safety nets to combat inuse adjustment calculation
      inaccuracies. With Andy's method implemented in transfer_surpluses(), inuse
      adjustment calculations are now accurate and we can make donation amount
      determinations accurate too.
      
      * Stop keeping track of past usage history and using the maximum. Act on the
        immediate usage information.
      
      * Remove donation constraints defined by SURPLUS_* constants. Donate
        whatever isn't used.
      
      * Determine the donation amount so that the iocg will end up with
        MARGIN_TARGET_PCT budget at the end of the coming period assuming the same
        usage as the previous period. TARGET is set at 50% of period, which is the
        previous maximum. This provides smooth convergence for most repetitive IO
        patterns.
      
      * Apply donation logic early at 20% budget. There's no risk in doing so as
        the calculation is based on the delta between the current budget and the
        target budget at the end of the coming period.
      
      * Remove preemptive iocg activation for zero cost IOs. As donation can reach
        near zero now, the mere activation doesn't provide any protection anymore.
        In the unlikely case that this becomes a problem, the right solution is
        assigning appropriate costs for such IOs.
      
      This significantly improves the donation determination logic while also
      simplifying it. Now all donations are immediate, exact and smooth.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Andy Newell <newella@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f1de2439
    • Tejun Heo's avatar
      blk-iocost: implement Andy's method for donation weight updates · e08d02aa
      Tejun Heo authored
      iocost implements work conservation by reducing iocg->inuse and propagating
      the adjustment upwards proportionally. However, while I knew the target
      absolute hierarchical proportion - adjusted hweight_inuse, I couldn't figure
      out how to determine the iocg->inuse adjustment to achieve that and
      approximated the adjustment by scaling iocg->inuse using the proportion of
      the needed hweight_inuse changes.
      
      When nested, these scalings aren't accurate even when adjusting a single
      node as the donating node also receives the benefit of the donated portion.
      When multiple nodes are donating as they often do, they can be wildly wrong.
      
      iocost employed various safety nets to combat the inaccuracies. There are
      ample buffers in determining how much to donate, the adjustments are
      conservative and gradual. While it can achieve a reasonable level of work
      conservation in simple scenarios, the inaccuracies can easily add up leading
      to significant loss of total work. This in turn makes it difficult to
      closely cap vrate as vrate adjustment is needed to compensate for the loss
      of work. The combination of inaccurate donation calculations and vrate
      adjustments can lead to wide fluctuations and clunky overall behaviors.
      
      Andy Newell devised a method to calculate the needed ->inuse updates to
      achieve the target hweight_inuse's. The method is compatible with the
      proportional inuse adjustment propagation which allows all hot path
      operations to be local to each iocg.
      
      To roughly summarize, Andy's method divides the tree into donating and
      non-donating parts, calculates global donation rate which is used to
      determine the target hweight_inuse for each node, and then derives per-level
      proportions. There's non-trivial amount of math involved. Please refer to
      the following pdfs for detailed descriptions.
      
        https://drive.google.com/file/d/1PsJwxPFtjUnwOY1QJ5AeICCcsL7BM3bo
        https://drive.google.com/file/d/1vONz1-fzVO7oY5DXXsLjSxEtYYQbOvsE
        https://drive.google.com/file/d/1WcrltBOSPN0qXVdBgnKm4mdp9FhuEFQN
      
      This patch implements Andy's method in transfer_surpluses(). This makes the
      donation calculations accurate per cycle and enables further improvements in
      other parts of the donation logic.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Andy Newell <newella@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e08d02aa
    • Tejun Heo's avatar
      blk-iocost: restructure surplus donation logic · 93f7d2db
      Tejun Heo authored
      The way the surplus donation logic is structured isn't great. There are two
      separate paths for starting/increasing donations and decreasing them making
      the logic harder to follow and is prone to unnecessary behavior differences.
      
      In preparation for improved donation handling, this patch restructures the
      code so that
      
      * All donors - new, increasing and decreasing - are funneled through the
        same code path.
      
      * The target donation calculation is factored into hweight_after_donation()
        which is called once from the same spot for all possible donors.
      
      * Actual inuse adjustment is factored into trasnfer_surpluses().
      
      This change introduces a few behavior differences - e.g. donation amount
      reduction now uses the max usage of the recent three periods just like new
      and increasing donations, and inuse now gets adjusted upwards the same way
      it gets downwards. These differences are unlikely to have severely negative
      implications and the whole logic will be revamped soon.
      
      This patch also removes two tracepoints. The existing TPs don't quite fit
      the new implementation. A later patch will update and reinstate them.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      93f7d2db
    • Tejun Heo's avatar
      blk-iocost: decouple vrate adjustment from surplus transfers · 065655c8
      Tejun Heo authored
      Budget donations are inaccurate and could take multiple periods to converge.
      To prevent triggering vrate adjustments while surplus transfers were
      catching up, vrate adjustment was suppressed if donations were increasing,
      which was indicated by non-zero nr_surpluses.
      
      This entangling won't be necessary with the scheduled rewrite of donation
      mechanism which will make it precise and immediate. Let's decouple the two
      in preparation.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      065655c8
    • Tejun Heo's avatar
      blk-iocost: replace iocg->has_surplus with ->surplus_list · 8692d2db
      Tejun Heo authored
      Instead of marking iocgs with surplus with a flag and filtering for them
      while walking all active iocgs, build a surpluses list. This doesn't make
      much difference now but will help implementing improved donation logic which
      will iterate iocgs with surplus multiple times.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8692d2db
    • Tejun Heo's avatar
      blk-iocost: calculate iocg->usages[] from iocg->local_stat.usage_us · 1aa50d02
      Tejun Heo authored
      Currently, iocg->usages[] which are used to guide inuse adjustments are
      calculated from vtime deltas. This, however, assumes that the hierarchical
      inuse weight at the time of calculation held for the entire period, which
      often isn't true and can lead to significant errors.
      
      Now that we have absolute usage information collected, we can derive
      iocg->usages[] from iocg->local_stat.usage_us so that inuse adjustment
      decisions are made based on actual absolute usage. The calculated usage is
      clamped between 1 and WEIGHT_ONE and WEIGHT_ONE is also used to signal
      saturation regardless of the current hierarchical inuse weight.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1aa50d02
    • Tejun Heo's avatar
      blk-iocost: add absolute usage stat · 97eb1975
      Tejun Heo authored
      Currently, iocost doesn't collect or expose any statistics punting off all
      monitoring duties to drgn based iocost_monitor.py. While it works for some
      scenarios, there are some usability and data availability challenges. For
      example, accurate per-cgroup usage information can't be tracked by vtime
      progression at all and the number available in iocg->usages[] are really
      short-term snapshots used for control heuristics with possibly significant
      errors.
      
      This patch implements per-cgroup absolute usage stat counter and exposes it
      through io.stat along with the current vrate. Usage stat collection and
      flushing employ the same method as cgroup rstat on the active iocg's and the
      only hot path overhead is preemption toggling and adding to a percpu
      counter.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      97eb1975
    • Tejun Heo's avatar
      blk-iocost: grab ioc->lock for debt handling · da437b95
      Tejun Heo authored
      Currently, debt handling requires only iocg->waitq.lock. In the future, we
      want to adjust and propagate inuse changes depending on debt status. Let's
      grab ioc->lock in debt handling paths in preparation.
      
      * Because ioc->lock nests outside iocg->waitq.lock, the decision to grab
        ioc->lock needs to be made before entering the critical sections.
      
      * Add and use iocg_[un]lock() which handles the conditional double locking.
      
      * Add @pay_debt to iocg_kick_waitq() so that debt payment happens only when
        the caller grabbed both locks.
      
      This patch is prepatory and the comments contain references to future
      changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      da437b95