1. 24 Jul, 2011 1 commit
    • Wu Fengguang's avatar
      writeback: don't busy retry writeback on new/freeing inodes · fcc5c222
      Wu Fengguang authored
      Fix a system hang bug introduced by commit b7a2441f ("writeback:
      remove writeback_control.more_io") and e8dfc305 ("writeback: elevate
      queue_io() into wb_writeback()") easily reproducible with high memory
      pressure and lots of file creation/deletions, for example, a kernel
      build in limited memory.
      
      It hangs when some inode is in the I_NEW, I_FREEING or I_WILL_FREE 
      state, the flusher will get stuck busy retrying that inode, never
      releasing wb->list_lock. The lock in turn blocks all kinds of other
      tasks when they are trying to grab it.
      
      As put by Jan, it's a safe change regarding data integrity. I_FREEING or
      I_WILL_FREE inodes are written back by iput_final() and it is reclaim
      code that is responsible for eventually removing them. So writeback code
      can safely ignore them. I_NEW inodes should move out of this state when
      they are fully set up and in the writeback round following that, we will
      consider them for writeback. So the change makes sense.                                                         
      
      CC: Jan Kara <jack@suse.cz> 
      Reported-by: default avatarHugh Dickins <hughd@google.com>
      Tested-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      fcc5c222
  2. 10 Jul, 2011 9 commits
    • Wu Fengguang's avatar
      writeback: scale IO chunk size up to half device bandwidth · 1a12d8bd
      Wu Fengguang authored
      Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
      concern of not holding I_SYNC for too long.  (At least, that was the
      comment previously.)  This doesn't make sense now because the only
      time we wait for I_SYNC is if we are calling sync or fsync, and in
      that case we need to write out all of the data anyway.  Previously
      there may have been other code paths that waited on I_SYNC, but not
      any more.					    -- Theodore Ts'o
      
      So remove the MAX_WRITEBACK_PAGES constraint. The writeback pages
      will adapt to as large as the storage device can write within 500ms.
      
      XFS is observed to do IO completions in a batch, and the batch size is
      equal to the write chunk size. To avoid dirty pages to suddenly drop
      out of balance_dirty_pages()'s dirty control scope and create large
      fluctuations, the chunk size is also limited to half the control scope.
      
      The balance_dirty_pages() control scrope is
      
      	[(background_thresh + dirty_thresh) / 2, dirty_thresh]
      
      which is by default [15%, 20%] of global dirty pages, whose range size
      is dirty_thresh / DIRTY_FULL_SCOPE.
      
      The adpative write chunk size will be rounded to the nearest 4MB
      boundary.
      
      http://bugzilla.kernel.org/show_bug.cgi?id=13930
      
      CC: Theodore Ts'o <tytso@mit.edu>
      CC: Dave Chinner <david@fromorbit.com>
      CC: Chris Mason <chris.mason@oracle.com>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      1a12d8bd
    • Wu Fengguang's avatar
      writeback: trace global_dirty_state · e1cbe236
      Wu Fengguang authored
      Add trace event balance_dirty_state for showing the global dirty page
      counts and thresholds at each global_dirty_limits() invocation.  This
      will cover the callers throttle_vm_writeout(), over_bground_thresh()
      and each balance_dirty_pages() loop.
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      e1cbe236
    • Wu Fengguang's avatar
      writeback: introduce max-pause and pass-good dirty limits · ffd1f609
      Wu Fengguang authored
      The max-pause limit helps to keep the sleep time inside
      balance_dirty_pages() within MAX_PAUSE=200ms. The 200ms max sleep means
      per task rate limit of 8pages/200ms=160KB/s when dirty exceeded, which
      normally is enough to stop dirtiers from continue pushing the dirty
      pages high, unless there are a sufficient large number of slow dirtiers
      (eg. 500 tasks doing 160KB/s will still sum up to 80MB/s, exceeding the
      write bandwidth of a slow disk and hence accumulating more and more dirty
      pages).
      
      The pass-good limit helps to let go of the good bdi's in the presence of
      a blocked bdi (ie. NFS server not responding) or slow USB disk which for
      some reason build up a large number of initial dirty pages that refuse
      to go away anytime soon.
      
      For example, given two bdi's A and B and the initial state
      
      	bdi_thresh_A = dirty_thresh / 2
      	bdi_thresh_B = dirty_thresh / 2
      	bdi_dirty_A  = dirty_thresh / 2
      	bdi_dirty_B  = dirty_thresh / 2
      
      Then A get blocked, after a dozen seconds
      
      	bdi_thresh_A = 0
      	bdi_thresh_B = dirty_thresh
      	bdi_dirty_A  = dirty_thresh / 2
      	bdi_dirty_B  = dirty_thresh / 2
      
      The (bdi_dirty_B < bdi_thresh_B) test is now useless and the dirty pages
      will be effectively throttled by condition (nr_dirty < dirty_thresh).
      This has two problems:
      (1) we lose the protections for light dirtiers
      (2) balance_dirty_pages() effectively becomes IO-less because the
          (bdi_nr_reclaimable > bdi_thresh) test won't be true. This is good
          for IO, but balance_dirty_pages() loses an important way to break
          out of the loop which leads to more spread out throttle delays.
      
      DIRTY_PASSGOOD_AREA can eliminate the above issues. The only problem is,
      DIRTY_PASSGOOD_AREA needs to be defined as 2 to fully cover the above
      example while this patch uses the more conservative value 8 so as not to
      surprise people with too many dirty pages than expected.
      
      The max-pause limit won't noticeably impact the speed dirty pages are
      knocked down when there is a sudden drop of global/bdi dirty thresholds.
      Because the heavy dirties will be throttled below 160KB/s which is slow
      enough. It does help to avoid long dirty throttle delays and especially
      will make light dirtiers more responsive.
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      ffd1f609
    • Wu Fengguang's avatar
      writeback: introduce smoothed global dirty limit · c42843f2
      Wu Fengguang authored
      The start of a heavy weight application (ie. KVM) may instantly knock
      down determine_dirtyable_memory() if the swap is not enabled or full.
      global_dirty_limits() and bdi_dirty_limit() will in turn get global/bdi
      dirty thresholds that are _much_ lower than the global/bdi dirty pages.
      
      balance_dirty_pages() will then heavily throttle all dirtiers including
      the light ones, until the dirty pages drop below the new dirty thresholds.
      During this _deep_ dirty-exceeded state, the system may appear rather
      unresponsive to the users.
      
      About "deep" dirty-exceeded: task_dirty_limit() assigns 1/8 lower dirty
      threshold to heavy dirtiers than light ones, and the dirty pages will
      be throttled around the heavy dirtiers' dirty threshold and reasonably
      below the light dirtiers' dirty threshold. In this state, only the heavy
      dirtiers will be throttled and the dirty pages are carefully controlled
      to not exceed the light dirtiers' dirty threshold. However if the
      threshold itself suddenly drops below the number of dirty pages, the
      light dirtiers will get heavily throttled.
      
      So introduce global_dirty_limit for tracking the global dirty threshold
      with policies
      
      - follow downwards slowly
      - follow up in one shot
      
      global_dirty_limit can effectively mask out the impact of sudden drop of
      dirtyable memory. It will be used in the next patch for two new type of
      dirty limits. Note that the new dirty limits are not going to avoid
      throttling the light dirtiers, but could limit their sleep time to 200ms.
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      c42843f2
    • Wu Fengguang's avatar
      writeback: consolidate variable names in balance_dirty_pages() · 7762741e
      Wu Fengguang authored
      Introduce
      
      	nr_dirty = NR_FILE_DIRTY + NR_WRITEBACK + NR_UNSTABLE_NFS
      
      in order to simplify many tests in the following patches.
      
      balance_dirty_pages() will eventually care only about the dirty sums
      besides nr_writeback.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      7762741e
    • Wu Fengguang's avatar
      writeback: show bdi write bandwidth in debugfs · 00821b00
      Wu Fengguang authored
      Add a "BdiWriteBandwidth" entry and indent others in /debug/bdi/*/stats.
      
      btw, increase digital field width to 10, for keeping the possibly
      huge BdiWritten number aligned at least for desktop systems.
      
      Impact: this could break user space tools if they are dumb enough to
      depend on the number of white spaces.
      
      CC: Theodore Ts'o <tytso@mit.edu>
      CC: Jan Kara <jack@suse.cz>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      00821b00
    • Wu Fengguang's avatar
      writeback: bdi write bandwidth estimation · e98be2d5
      Wu Fengguang authored
      The estimation value will start from 100MB/s and adapt to the real
      bandwidth in seconds.
      
      It tries to update the bandwidth only when disk is fully utilized.
      Any inactive period of more than one second will be skipped.
      
      The estimated bandwidth will be reflecting how fast the device can
      writeout when _fully utilized_, and won't drop to 0 when it goes idle.
      The value will remain constant at disk idle time. At busy write time, if
      not considering fluctuations, it will also remain high unless be knocked
      down by possible concurrent reads that compete for the disk time and
      bandwidth with async writes.
      
      The estimation is not done purely in the flusher because there is no
      guarantee for write_cache_pages() to return timely to update bandwidth.
      
      The bdi->avg_write_bandwidth smoothing is very effective for filtering
      out sudden spikes, however may be a little biased in long term.
      
      The overheads are low because the bdi bandwidth update only occurs at
      200ms intervals.
      
      The 200ms update interval is suitable, because it's not possible to get
      the real bandwidth for the instance at all, due to large fluctuations.
      
      The NFS commits can be as large as seconds worth of data. One XFS
      completion may be as large as half second worth of data if we are going
      to increase the write chunk to half second worth of data. In ext4,
      fluctuations with time period of around 5 seconds is observed. And there
      is another pattern of irregular periods of up to 20 seconds on SSD tests.
      
      That's why we are not only doing the estimation at 200ms intervals, but
      also averaging them over a period of 3 seconds and then go further to do
      another level of smoothing in avg_write_bandwidth.
      
      CC: Li Shaohua <shaohua.li@intel.com>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      e98be2d5
    • Jan Kara's avatar
      writeback: account per-bdi accumulated written pages · f7d2b1ec
      Jan Kara authored
      Introduce the BDI_WRITTEN counter. It will be used for estimating the
      bdi's write bandwidth.
      
      Peter Zijlstra <a.p.zijlstra@chello.nl>:
      Move BDI_WRITTEN accounting into __bdi_writeout_inc().
      This will cover and fix fuse, which only calls bdi_writeout_inc().
      
      CC: Michael Rubin <mrubin@google.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      f7d2b1ec
    • Wu Fengguang's avatar
      writeback: make writeback_control.nr_to_write straight · d46db3d5
      Wu Fengguang authored
      Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
      and initialize the struct writeback_control there.
      
      struct writeback_control is basically designed to control writeback of a
      single file, but we keep abuse it for writing multiple files in
      writeback_sb_inodes() and its callers.
      
      It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
      work->nr_pages starts to make sense, and instead of saving and restoring
      pages_skipped in writeback_sb_inodes it can always start with a clean
      zero value.
      
      It also makes a neat IO pattern change: large dirty files are now
      written in the full 4MB writeback chunk size, rather than whatever
      remained quota in wbc->nr_to_write.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Proposed-by: default avatarChristoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      d46db3d5
  3. 19 Jun, 2011 1 commit
  4. 08 Jun, 2011 15 commits
    • Wu Fengguang's avatar
      writeback: trace event writeback_queue_io · e84d0a4f
      Wu Fengguang authored
      Note that it adds a little overheads to account the moved/enqueued
      inodes from b_dirty to b_io. The "moved" accounting may be later used to
      limit the number of inodes that can be moved in one shot, in order to
      keep spinlock hold time under control.
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      e84d0a4f
    • Wu Fengguang's avatar
      writeback: trace event writeback_single_inode · 251d6a47
      Wu Fengguang authored
      It is valuable to know how the dirty inodes are iterated and their IO size.
      
      "writeback_single_inode: bdi 8:0: ino=134246746 state=I_DIRTY_SYNC|I_SYNC age=414 index=0 to_write=1024 wrote=0"
      
      - "state" reflects inode->i_state at the end of writeback_single_inode()
      - "index" reflects mapping->writeback_index after the ->writepages() call
      - "to_write" is the wbc->nr_to_write at entrance of writeback_single_inode()
      - "wrote" is the number of pages actually written
      
      v2: add trace event writeback_single_inode_requeue as proposed by Dave.
      
      CC: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      251d6a47
    • Wu Fengguang's avatar
      writeback: remove .nonblocking and .encountered_congestion · 846d5a09
      Wu Fengguang authored
      Remove two unused struct writeback_control fields:
      
      	.encountered_congestion	(completely unused)
      	.nonblocking		(never set, checked/showed in XFS,NFS/btrfs)
      
      The .for_background check in nfs_write_inode() is also removed btw,
      as .for_background implies WB_SYNC_NONE.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Proposed-by: default avatarChristoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      846d5a09
    • Wu Fengguang's avatar
      writeback: remove writeback_control.more_io · b7a2441f
      Wu Fengguang authored
      When wbc.more_io was first introduced, it indicates whether there are
      at least one superblock whose s_more_io contains more IO work. Now with
      the per-bdi writeback, it can be replaced with a simple b_more_io test.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      b7a2441f
    • Wu Fengguang's avatar
      writeback: skip balance_dirty_pages() for in-memory fs · 3efaf0fa
      Wu Fengguang authored
      This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.
      
      Notes about the tmpfs/ramfs behavior changes:
      
      As for 2.6.36 and older kernels, the tmpfs writes will sleep inside
      balance_dirty_pages() as long as we are over the (dirty+background)/2
      global throttle threshold.  This is because both the dirty pages and
      threshold will be 0 for tmpfs/ramfs. Hence this test will always
      evaluate to TRUE:
      
                      dirty_exceeded =
                              (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
                              || (nr_reclaimable + nr_writeback >= dirty_thresh);
      
      For 2.6.37, someone complained that the current logic does not allow the
      users to set vm.dirty_ratio=0.  So commit 4cbec4c8 changed the test to
      
                      dirty_exceeded =
                              (bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
                              || (nr_reclaimable + nr_writeback > dirty_thresh);
      
      So 2.6.37 will behave differently for tmpfs/ramfs: it will never get
      throttled unless the global dirty threshold is exceeded (which is very
      unlikely to happen; once happen, will block many tasks).
      
      I'd say that the 2.6.36 behavior is very bad for tmpfs/ramfs. It means
      for a busy writing server, tmpfs write()s may get livelocked! The
      "inadvertent" throttling can hardly bring help to any workload because
      of its "either no throttling, or get throttled to death" property.
      
      So based on 2.6.37, this patch won't bring more noticeable changes.
      
      CC: Hugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      3efaf0fa
    • Wu Fengguang's avatar
      writeback: add bdi_dirty_limit() kernel-doc · 6f718656
      Wu Fengguang authored
      Clarify the bdi_dirty_limit() comment.
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      6f718656
    • Wu Fengguang's avatar
      writeback: avoid extra sync work at enqueue time · e185dda8
      Wu Fengguang authored
      This removes writeback_control.wb_start and does more straightforward
      sync livelock prevention by setting .older_than_this to prevent extra
      inodes from being enqueued in the first place.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      e185dda8
    • Wu Fengguang's avatar
      writeback: elevate queue_io() into wb_writeback() · e8dfc305
      Wu Fengguang authored
      Code refactor for more logical code layout.
      No behavior change.
      
      - remove the mis-named __writeback_inodes_sb()
      
      - wb_writeback()/writeback_inodes_wb() will decide when to queue_io()
        before calling __writeback_inodes_wb()
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      e8dfc305
    • Christoph Hellwig's avatar
      writeback: split inode_wb_list_lock into bdi_writeback.list_lock · f758eeab
      Christoph Hellwig authored
      Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
      as it's currently the most contended lock in the system for metadata
      heavy workloads.  It won't help for single-filesystem workloads for
      which we'll need the I/O-less balance_dirty_pages, but at least we
      can dedicate a cpu to spinning on each bdi now for larger systems.
      
      Based on earlier patches from Nick Piggin and Dave Chinner.
      
      It reduces lock contentions to 1/4 in this test case:
      10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram
      
      lock_stat version 0.3
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                    class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      vanilla 2.6.39-rc3:
                            inode_wb_list_lock:         42590          44433           0.12         147.74      144127.35         252274         886792           0.08         121.34      917211.23
                            ------------------
                            inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                            inode_wb_list_lock             34          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                            inode_wb_list_lock          12893          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                            inode_wb_list_lock          10702          [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a
                            ------------------
                            inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                            inode_wb_list_lock             19          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                            inode_wb_list_lock           5550          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                            inode_wb_list_lock           8511          [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157
      
      2.6.39-rc3 + patch:
                      &(&wb->list_lock)->rlock:         11383          11657           0.14         151.69       40429.51          90825         527918           0.11         145.90      556843.37
                      ------------------------
                      &(&wb->list_lock)->rlock             10          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                      &(&wb->list_lock)->rlock           1493          [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150
                      &(&wb->list_lock)->rlock           3652          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
                      &(&wb->list_lock)->rlock           1412          [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223
                      ------------------------
                      &(&wb->list_lock)->rlock              3          [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b
                      &(&wb->list_lock)->rlock              6          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                      &(&wb->list_lock)->rlock           2061          [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf
                      &(&wb->list_lock)->rlock           2629          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
      
      hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
      akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      f758eeab
    • Wu Fengguang's avatar
      writeback: refill b_io iff empty · 424b351f
      Wu Fengguang authored
      There is no point to carry different refill policies between for_kupdate
      and other type of works. Use a consistent "refill b_io iff empty" policy
      which can guarantee fairness in an easy to understand way.
      
      A b_io refill will setup a _fixed_ work set with all currently eligible
      inodes and start a new round of walk through b_io. The "fixed" work set
      means no new inodes will be added to the work set during the walk.
      Only when a complete walk over b_io is done, new inodes that are
      eligible at the time will be enqueued and the walk be started over.
      
      This procedure provides fairness among the inodes because it guarantees
      each inode to be synced once and only once at each round. So all inodes
      will be free from starvations.
      
      This change relies on wb_writeback() to keep retrying as long as we made
      some progress on cleaning some pages and/or inodes. Without that ability,
      the old logic on background works relies on aggressively queuing all
      eligible inodes into b_io at every time. But that's not a guarantee.
      
      The below test script completes a slightly faster now:
      
                   2.6.39-rc3	  2.6.39-rc3-dyn-expire+
      ------------------------------------------------
      all elapsed     256.043      252.367
      stddev           24.381       12.530
      
      tar elapsed      30.097       28.808
      dd  elapsed      13.214       11.782
      
      	#!/bin/zsh
      
      	cp /c/linux-2.6.38.3.tar.bz2 /dev/shm/
      
      	umount /dev/sda7
      	mkfs.xfs -f /dev/sda7
      	mount /dev/sda7 /fs
      
      	echo 3 > /proc/sys/vm/drop_caches
      
      	tic=$(cat /proc/uptime|cut -d' ' -f2)
      
      	cd /fs
      	time tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 &
      	time dd if=/dev/zero of=/fs/zero bs=1M count=1000 &
      
      	wait
      	sync
      	tac=$(cat /proc/uptime|cut -d' ' -f2)
      	echo elapsed: $((tac - tic))
      
      It maintains roughly the same small vs. large file writeout shares, and
      offers large files better chances to be written in nice 4M chunks.
      
      Analyzes from Dave Chinner in great details:
      
      Let's say we have lots of inodes with 100 dirty pages being created,
      and one large writeback going on. We expire 8 new inodes for every
      1024 pages we write back.
      
      With the old code, we do:
      
      	b_more_io (large inode) -> b_io (1l)
      	8 newly expired inodes -> b_io (1l, 8s)
      
      	writeback  large inode 1024 pages -> b_more_io
      
      	b_more_io (large inode) -> b_io (8s, 1l)
      	8 newly expired inodes -> b_io (8s, 1l, 8s)
      
      	writeback  8 small inodes 800 pages
      		   1 large inode 224 pages -> b_more_io
      
      	b_more_io (large inode) -> b_io (8s, 1l)
      	8 newly expired inodes -> b_io (8s, 1l, 8s)
      	.....
      
      Your new code:
      
      	b_more_io (large inode) -> b_io (1l)
      	8 newly expired inodes -> b_io (1l, 8s)
      
      	writeback  large inode 1024 pages -> b_more_io
      	(b_io == 8s)
      	writeback  8 small inodes 800 pages
      
      	b_io empty: (1800 pages written)
      		b_more_io (large inode) -> b_io (1l)
      		14 newly expired inodes -> b_io (1l, 14s)
      
      	writeback  large inode 1024 pages -> b_more_io
      	(b_io == 14s)
      	writeback  10 small inodes 1000 pages
      		   1 small inode 24 pages -> b_more_io (1l, 1s(24))
      	writeback  5 small inodes 500 pages
      	b_io empty: (2548 pages written)
      		b_more_io (large inode) -> b_io (1l, 1s(24))
      		20 newly expired inodes -> b_io (1l, 1s(24), 20s)
      	......
      
      Rough progression of pages written at b_io refill:
      
      Old code:
      
      	total	large file	% of writeback
      	1024	224		21.9% (fixed)
      
      New code:
      	total	large file	% of writeback
      	1800	1024		~55%
      	2550	1024		~40%
      	3050	1024		~33%
      	3500	1024		~29%
      	3950	1024		~26%
      	4250	1024		~24%
      	4500	1024		~22.7%
      	4700	1024		~21.7%
      	4800	1024		~21.3%
      	4800	1024		~21.3%
      	(pretty much steady state from here)
      
      Ok, so the steady state is reached with a similar percentage of
      writeback to the large file as the existing code. Ok, that's good,
      but providing some evidence that is doesn't change the shared of
      writeback to the large should be in the commit message ;)
      
      The other advantage to this is that we always write 1024 page chunks
      to the large file, rather than smaller "whatever remains" chunks.
      
      CC: Jan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      424b351f
    • Wu Fengguang's avatar
      writeback: the kupdate expire timestamp should be a moving target · ba9aa839
      Wu Fengguang authored
      Dynamically compute the dirty expire timestamp at queue_io() time.
      
      writeback_control.older_than_this used to be determined at entrance to
      the kupdate writeback work. This _static_ timestamp may go stale if the
      kupdate work runs on and on. The flusher may then stuck with some old
      busy inodes, never considering newly expired inodes thereafter.
      
      This has two possible problems:
      
      - It is unfair for a large dirty inode to delay (for a long time) the
        writeback of small dirty inodes.
      
      - As time goes by, the large and busy dirty inode may contain only
        _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
        delaying the expired dirty pages to the end of LRU lists, triggering
        the evil pageout(). Nevertheless this patch merely addresses part
        of the problem.
      
      v2: keep policy changes inside wb_writeback() and keep the
      wbc.older_than_this visibility as suggested by Dave.
      
      CC: Dave Chinner <david@fromorbit.com>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarItaru Kitayama <kitayama@cl.bb4u.ne.jp>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      ba9aa839
    • Wu Fengguang's avatar
      writeback: try more writeback as long as something was written · e6fb6da2
      Wu Fengguang authored
      writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
      they only populate possibly a subset of eligible inodes into b_io at
      entrance time. When the queued set of inodes are all synced, they just
      return, possibly with all queued inode pages written but still
      wbc.nr_to_write > 0.
      
      For kupdate and background writeback, there may be more eligible inodes
      sitting in b_dirty when the current set of b_io inodes are completed. So
      it is necessary to try another round of writeback as long as we made some
      progress in this round. When there are no more eligible inodes, no more
      inodes will be enqueued in queue_io(), hence nothing could/will be
      synced and we may safely bail.
      
      For example, imagine 100 inodes
      
              i0, i1, i2, ..., i90, i91, i99
      
      At queue_io() time, i90-i99 happen to be expired and moved to s_io for
      IO. When finished successfully, if their total size is less than
      MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
      quit the background work (w/o this patch) while it's still over
      background threshold. This will be a fairly normal/frequent case I guess.
      
      Now that we do tagged sync and update inode->dirtied_when after the sync,
      this change won't livelock sync(1).  I actually tried to write 1 page
      per 1ms with this command
      
      	write-and-fsync -n10000 -S 1000 -c 4096 /fs/test
      
      and do sync(1) at the same time. The sync completes quickly on ext4,
      xfs, btrfs.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      e6fb6da2
    • Wu Fengguang's avatar
      writeback: introduce writeback_control.inodes_written · cb9bd115
      Wu Fengguang authored
      The flusher works on dirty inodes in batches, and may quit prematurely
      if the batch of inodes happen to be metadata-only dirtied: in this case
      wbc->nr_to_write won't be decreased at all, which stands for "no pages
      written" but also mis-interpreted as "no progress".
      
      So introduce writeback_control.inodes_written to count the inodes get
      cleaned from VFS POV.  A non-zero value means there are some progress on
      writeback, in which case more writeback can be tried.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      cb9bd115
    • Wu Fengguang's avatar
      writeback: update dirtied_when for synced inode to prevent livelock · 94c3dcbb
      Wu Fengguang authored
      Explicitly update .dirtied_when on synced inodes, so that they are no
      longer considered for writeback in the next round.
      
      It can prevent both of the following livelock schemes:
      
      - while true; do echo data >> f; done
      - while true; do touch f;        done (in theory)
      
      The exact livelock condition is, during sync(1):
      
      (1) no new inodes are dirtied
      (2) an inode being actively dirtied
      
      On (2), the inode will be tagged and synced with .nr_to_write=LONG_MAX.
      When finished, it will be redirty_tail()ed because it's still dirty
      and (.nr_to_write > 0). redirty_tail() won't update its ->dirtied_when
      on condition (1). The sync work will then revisit it on the next
      queue_io() and find it eligible again because its old ->dirtied_when
      predates the sync work start time.
      
      We'll do more aggressive "keep writeback as long as we wrote something"
      logic in wb_writeback(). The "use LONG_MAX .nr_to_write" trick in commit
      b9543dac ("writeback: avoid livelocking WB_SYNC_ALL writeback") will
      no longer be enough to stop sync livelock.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      94c3dcbb
    • Wu Fengguang's avatar
      writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage · 6e6938b6
      Wu Fengguang authored
      sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
      WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and
      do livelock prevention for it, too.
      
      Jan's commit f446daae ("mm: implement writeback livelock avoidance
      using page tagging") is a partial fix in that it only fixed the
      WB_SYNC_ALL phase livelock.
      
      Although ext4 is tested to no longer livelock with commit f446daae,
      it may due to some "redirty_tail() after pages_skipped" effect which
      is by no means a guarantee for _all_ the file systems.
      
      Note that writeback_inodes_sb() is called by not only sync(), they are
      treated the same because the other callers also need livelock prevention.
      
      Impact:  It changes the order in which pages/inodes are synced to disk.
      Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
      until finished with the current inode.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      CC: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      6e6938b6
  5. 06 Jun, 2011 5 commits
  6. 04 Jun, 2011 9 commits