1. 08 Jun, 2011 14 commits
    • Wu Fengguang's avatar
      writeback: trace event writeback_single_inode · 251d6a47
      Wu Fengguang authored
      It is valuable to know how the dirty inodes are iterated and their IO size.
      
      "writeback_single_inode: bdi 8:0: ino=134246746 state=I_DIRTY_SYNC|I_SYNC age=414 index=0 to_write=1024 wrote=0"
      
      - "state" reflects inode->i_state at the end of writeback_single_inode()
      - "index" reflects mapping->writeback_index after the ->writepages() call
      - "to_write" is the wbc->nr_to_write at entrance of writeback_single_inode()
      - "wrote" is the number of pages actually written
      
      v2: add trace event writeback_single_inode_requeue as proposed by Dave.
      
      CC: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      251d6a47
    • Wu Fengguang's avatar
      writeback: remove .nonblocking and .encountered_congestion · 846d5a09
      Wu Fengguang authored
      Remove two unused struct writeback_control fields:
      
      	.encountered_congestion	(completely unused)
      	.nonblocking		(never set, checked/showed in XFS,NFS/btrfs)
      
      The .for_background check in nfs_write_inode() is also removed btw,
      as .for_background implies WB_SYNC_NONE.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Proposed-by: default avatarChristoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      846d5a09
    • Wu Fengguang's avatar
      writeback: remove writeback_control.more_io · b7a2441f
      Wu Fengguang authored
      When wbc.more_io was first introduced, it indicates whether there are
      at least one superblock whose s_more_io contains more IO work. Now with
      the per-bdi writeback, it can be replaced with a simple b_more_io test.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      b7a2441f
    • Wu Fengguang's avatar
      writeback: skip balance_dirty_pages() for in-memory fs · 3efaf0fa
      Wu Fengguang authored
      This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.
      
      Notes about the tmpfs/ramfs behavior changes:
      
      As for 2.6.36 and older kernels, the tmpfs writes will sleep inside
      balance_dirty_pages() as long as we are over the (dirty+background)/2
      global throttle threshold.  This is because both the dirty pages and
      threshold will be 0 for tmpfs/ramfs. Hence this test will always
      evaluate to TRUE:
      
                      dirty_exceeded =
                              (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
                              || (nr_reclaimable + nr_writeback >= dirty_thresh);
      
      For 2.6.37, someone complained that the current logic does not allow the
      users to set vm.dirty_ratio=0.  So commit 4cbec4c8 changed the test to
      
                      dirty_exceeded =
                              (bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
                              || (nr_reclaimable + nr_writeback > dirty_thresh);
      
      So 2.6.37 will behave differently for tmpfs/ramfs: it will never get
      throttled unless the global dirty threshold is exceeded (which is very
      unlikely to happen; once happen, will block many tasks).
      
      I'd say that the 2.6.36 behavior is very bad for tmpfs/ramfs. It means
      for a busy writing server, tmpfs write()s may get livelocked! The
      "inadvertent" throttling can hardly bring help to any workload because
      of its "either no throttling, or get throttled to death" property.
      
      So based on 2.6.37, this patch won't bring more noticeable changes.
      
      CC: Hugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      3efaf0fa
    • Wu Fengguang's avatar
      writeback: add bdi_dirty_limit() kernel-doc · 6f718656
      Wu Fengguang authored
      Clarify the bdi_dirty_limit() comment.
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      6f718656
    • Wu Fengguang's avatar
      writeback: avoid extra sync work at enqueue time · e185dda8
      Wu Fengguang authored
      This removes writeback_control.wb_start and does more straightforward
      sync livelock prevention by setting .older_than_this to prevent extra
      inodes from being enqueued in the first place.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      e185dda8
    • Wu Fengguang's avatar
      writeback: elevate queue_io() into wb_writeback() · e8dfc305
      Wu Fengguang authored
      Code refactor for more logical code layout.
      No behavior change.
      
      - remove the mis-named __writeback_inodes_sb()
      
      - wb_writeback()/writeback_inodes_wb() will decide when to queue_io()
        before calling __writeback_inodes_wb()
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      e8dfc305
    • Christoph Hellwig's avatar
      writeback: split inode_wb_list_lock into bdi_writeback.list_lock · f758eeab
      Christoph Hellwig authored
      Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
      as it's currently the most contended lock in the system for metadata
      heavy workloads.  It won't help for single-filesystem workloads for
      which we'll need the I/O-less balance_dirty_pages, but at least we
      can dedicate a cpu to spinning on each bdi now for larger systems.
      
      Based on earlier patches from Nick Piggin and Dave Chinner.
      
      It reduces lock contentions to 1/4 in this test case:
      10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram
      
      lock_stat version 0.3
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                    class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      vanilla 2.6.39-rc3:
                            inode_wb_list_lock:         42590          44433           0.12         147.74      144127.35         252274         886792           0.08         121.34      917211.23
                            ------------------
                            inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                            inode_wb_list_lock             34          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                            inode_wb_list_lock          12893          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                            inode_wb_list_lock          10702          [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a
                            ------------------
                            inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                            inode_wb_list_lock             19          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                            inode_wb_list_lock           5550          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                            inode_wb_list_lock           8511          [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157
      
      2.6.39-rc3 + patch:
                      &(&wb->list_lock)->rlock:         11383          11657           0.14         151.69       40429.51          90825         527918           0.11         145.90      556843.37
                      ------------------------
                      &(&wb->list_lock)->rlock             10          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                      &(&wb->list_lock)->rlock           1493          [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150
                      &(&wb->list_lock)->rlock           3652          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
                      &(&wb->list_lock)->rlock           1412          [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223
                      ------------------------
                      &(&wb->list_lock)->rlock              3          [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b
                      &(&wb->list_lock)->rlock              6          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                      &(&wb->list_lock)->rlock           2061          [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf
                      &(&wb->list_lock)->rlock           2629          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
      
      hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
      akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      f758eeab
    • Wu Fengguang's avatar
      writeback: refill b_io iff empty · 424b351f
      Wu Fengguang authored
      There is no point to carry different refill policies between for_kupdate
      and other type of works. Use a consistent "refill b_io iff empty" policy
      which can guarantee fairness in an easy to understand way.
      
      A b_io refill will setup a _fixed_ work set with all currently eligible
      inodes and start a new round of walk through b_io. The "fixed" work set
      means no new inodes will be added to the work set during the walk.
      Only when a complete walk over b_io is done, new inodes that are
      eligible at the time will be enqueued and the walk be started over.
      
      This procedure provides fairness among the inodes because it guarantees
      each inode to be synced once and only once at each round. So all inodes
      will be free from starvations.
      
      This change relies on wb_writeback() to keep retrying as long as we made
      some progress on cleaning some pages and/or inodes. Without that ability,
      the old logic on background works relies on aggressively queuing all
      eligible inodes into b_io at every time. But that's not a guarantee.
      
      The below test script completes a slightly faster now:
      
                   2.6.39-rc3	  2.6.39-rc3-dyn-expire+
      ------------------------------------------------
      all elapsed     256.043      252.367
      stddev           24.381       12.530
      
      tar elapsed      30.097       28.808
      dd  elapsed      13.214       11.782
      
      	#!/bin/zsh
      
      	cp /c/linux-2.6.38.3.tar.bz2 /dev/shm/
      
      	umount /dev/sda7
      	mkfs.xfs -f /dev/sda7
      	mount /dev/sda7 /fs
      
      	echo 3 > /proc/sys/vm/drop_caches
      
      	tic=$(cat /proc/uptime|cut -d' ' -f2)
      
      	cd /fs
      	time tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 &
      	time dd if=/dev/zero of=/fs/zero bs=1M count=1000 &
      
      	wait
      	sync
      	tac=$(cat /proc/uptime|cut -d' ' -f2)
      	echo elapsed: $((tac - tic))
      
      It maintains roughly the same small vs. large file writeout shares, and
      offers large files better chances to be written in nice 4M chunks.
      
      Analyzes from Dave Chinner in great details:
      
      Let's say we have lots of inodes with 100 dirty pages being created,
      and one large writeback going on. We expire 8 new inodes for every
      1024 pages we write back.
      
      With the old code, we do:
      
      	b_more_io (large inode) -> b_io (1l)
      	8 newly expired inodes -> b_io (1l, 8s)
      
      	writeback  large inode 1024 pages -> b_more_io
      
      	b_more_io (large inode) -> b_io (8s, 1l)
      	8 newly expired inodes -> b_io (8s, 1l, 8s)
      
      	writeback  8 small inodes 800 pages
      		   1 large inode 224 pages -> b_more_io
      
      	b_more_io (large inode) -> b_io (8s, 1l)
      	8 newly expired inodes -> b_io (8s, 1l, 8s)
      	.....
      
      Your new code:
      
      	b_more_io (large inode) -> b_io (1l)
      	8 newly expired inodes -> b_io (1l, 8s)
      
      	writeback  large inode 1024 pages -> b_more_io
      	(b_io == 8s)
      	writeback  8 small inodes 800 pages
      
      	b_io empty: (1800 pages written)
      		b_more_io (large inode) -> b_io (1l)
      		14 newly expired inodes -> b_io (1l, 14s)
      
      	writeback  large inode 1024 pages -> b_more_io
      	(b_io == 14s)
      	writeback  10 small inodes 1000 pages
      		   1 small inode 24 pages -> b_more_io (1l, 1s(24))
      	writeback  5 small inodes 500 pages
      	b_io empty: (2548 pages written)
      		b_more_io (large inode) -> b_io (1l, 1s(24))
      		20 newly expired inodes -> b_io (1l, 1s(24), 20s)
      	......
      
      Rough progression of pages written at b_io refill:
      
      Old code:
      
      	total	large file	% of writeback
      	1024	224		21.9% (fixed)
      
      New code:
      	total	large file	% of writeback
      	1800	1024		~55%
      	2550	1024		~40%
      	3050	1024		~33%
      	3500	1024		~29%
      	3950	1024		~26%
      	4250	1024		~24%
      	4500	1024		~22.7%
      	4700	1024		~21.7%
      	4800	1024		~21.3%
      	4800	1024		~21.3%
      	(pretty much steady state from here)
      
      Ok, so the steady state is reached with a similar percentage of
      writeback to the large file as the existing code. Ok, that's good,
      but providing some evidence that is doesn't change the shared of
      writeback to the large should be in the commit message ;)
      
      The other advantage to this is that we always write 1024 page chunks
      to the large file, rather than smaller "whatever remains" chunks.
      
      CC: Jan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      424b351f
    • Wu Fengguang's avatar
      writeback: the kupdate expire timestamp should be a moving target · ba9aa839
      Wu Fengguang authored
      Dynamically compute the dirty expire timestamp at queue_io() time.
      
      writeback_control.older_than_this used to be determined at entrance to
      the kupdate writeback work. This _static_ timestamp may go stale if the
      kupdate work runs on and on. The flusher may then stuck with some old
      busy inodes, never considering newly expired inodes thereafter.
      
      This has two possible problems:
      
      - It is unfair for a large dirty inode to delay (for a long time) the
        writeback of small dirty inodes.
      
      - As time goes by, the large and busy dirty inode may contain only
        _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
        delaying the expired dirty pages to the end of LRU lists, triggering
        the evil pageout(). Nevertheless this patch merely addresses part
        of the problem.
      
      v2: keep policy changes inside wb_writeback() and keep the
      wbc.older_than_this visibility as suggested by Dave.
      
      CC: Dave Chinner <david@fromorbit.com>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarItaru Kitayama <kitayama@cl.bb4u.ne.jp>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      ba9aa839
    • Wu Fengguang's avatar
      writeback: try more writeback as long as something was written · e6fb6da2
      Wu Fengguang authored
      writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
      they only populate possibly a subset of eligible inodes into b_io at
      entrance time. When the queued set of inodes are all synced, they just
      return, possibly with all queued inode pages written but still
      wbc.nr_to_write > 0.
      
      For kupdate and background writeback, there may be more eligible inodes
      sitting in b_dirty when the current set of b_io inodes are completed. So
      it is necessary to try another round of writeback as long as we made some
      progress in this round. When there are no more eligible inodes, no more
      inodes will be enqueued in queue_io(), hence nothing could/will be
      synced and we may safely bail.
      
      For example, imagine 100 inodes
      
              i0, i1, i2, ..., i90, i91, i99
      
      At queue_io() time, i90-i99 happen to be expired and moved to s_io for
      IO. When finished successfully, if their total size is less than
      MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
      quit the background work (w/o this patch) while it's still over
      background threshold. This will be a fairly normal/frequent case I guess.
      
      Now that we do tagged sync and update inode->dirtied_when after the sync,
      this change won't livelock sync(1).  I actually tried to write 1 page
      per 1ms with this command
      
      	write-and-fsync -n10000 -S 1000 -c 4096 /fs/test
      
      and do sync(1) at the same time. The sync completes quickly on ext4,
      xfs, btrfs.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      e6fb6da2
    • Wu Fengguang's avatar
      writeback: introduce writeback_control.inodes_written · cb9bd115
      Wu Fengguang authored
      The flusher works on dirty inodes in batches, and may quit prematurely
      if the batch of inodes happen to be metadata-only dirtied: in this case
      wbc->nr_to_write won't be decreased at all, which stands for "no pages
      written" but also mis-interpreted as "no progress".
      
      So introduce writeback_control.inodes_written to count the inodes get
      cleaned from VFS POV.  A non-zero value means there are some progress on
      writeback, in which case more writeback can be tried.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      cb9bd115
    • Wu Fengguang's avatar
      writeback: update dirtied_when for synced inode to prevent livelock · 94c3dcbb
      Wu Fengguang authored
      Explicitly update .dirtied_when on synced inodes, so that they are no
      longer considered for writeback in the next round.
      
      It can prevent both of the following livelock schemes:
      
      - while true; do echo data >> f; done
      - while true; do touch f;        done (in theory)
      
      The exact livelock condition is, during sync(1):
      
      (1) no new inodes are dirtied
      (2) an inode being actively dirtied
      
      On (2), the inode will be tagged and synced with .nr_to_write=LONG_MAX.
      When finished, it will be redirty_tail()ed because it's still dirty
      and (.nr_to_write > 0). redirty_tail() won't update its ->dirtied_when
      on condition (1). The sync work will then revisit it on the next
      queue_io() and find it eligible again because its old ->dirtied_when
      predates the sync work start time.
      
      We'll do more aggressive "keep writeback as long as we wrote something"
      logic in wb_writeback(). The "use LONG_MAX .nr_to_write" trick in commit
      b9543dac ("writeback: avoid livelocking WB_SYNC_ALL writeback") will
      no longer be enough to stop sync livelock.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      94c3dcbb
    • Wu Fengguang's avatar
      writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage · 6e6938b6
      Wu Fengguang authored
      sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
      WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and
      do livelock prevention for it, too.
      
      Jan's commit f446daae ("mm: implement writeback livelock avoidance
      using page tagging") is a partial fix in that it only fixed the
      WB_SYNC_ALL phase livelock.
      
      Although ext4 is tested to no longer livelock with commit f446daae,
      it may due to some "redirty_tail() after pages_skipped" effect which
      is by no means a guarantee for _all_ the file systems.
      
      Note that writeback_inodes_sb() is called by not only sync(), they are
      treated the same because the other callers also need livelock prevention.
      
      Impact:  It changes the order in which pages/inodes are synced to disk.
      Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
      until finished with the current inode.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      CC: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      6e6938b6
  2. 06 Jun, 2011 5 commits
  3. 04 Jun, 2011 17 commits
  4. 03 Jun, 2011 4 commits