• Ethan Lien's avatar
    btrfs: balance dirty metadata pages in btrfs_finish_ordered_io · e73e81b6
    Ethan Lien authored
    [Problem description and how we fix it]
    We should balance dirty metadata pages at the end of
    btrfs_finish_ordered_io, since a small, unmergeable random write can
    potentially produce dirty metadata which is multiple times larger than
    the data itself. For example, a small, unmergeable 4KiB write may
    produce:
    
        16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
        16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
        16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree
    
    Although we do call balance dirty pages in write side, but in the
    buffered write path, most metadata are dirtied only after we reach the
    dirty background limit (which by far only counts dirty data pages) and
    wakeup the flusher thread. If there are many small, unmergeable random
    writes spread in a large btree, we'll find a burst of dirty pages
    exceeds the dirty_bytes limit after we wakeup the flusher thread - which
    is not what we expect. In our machine, it caused out-of-memory problem
    since a page cannot be dropped if it is marked dirty.
    
    Someone may worry about we may sleep in btrfs_btree_balance_dirty_nodelay,
    but since we do btrfs_finish_ordered_io in a separate worker, it will not
    stop the flusher consuming dirty pages. Also, we use different worker for
    metadata writeback endio, sleep in btrfs_finish_ordered_io help us throttle
    the size of dirty metadata pages.
    
    [Reproduce steps]
    To reproduce the problem, we need to do 4KiB write randomly spread in a
    large btree. In our 2GiB RAM machine:
    
    1) Create 4 subvolumes.
    2) Run fio on each subvolume:
    
       [global]
       direct=0
       rw=randwrite
       ioengine=libaio
       bs=4k
       iodepth=16
       numjobs=1
       group_reporting
       size=128G
       runtime=1800
       norandommap
       time_based
       randrepeat=0
    
    3) Take snapshot on each subvolume and repeat fio on existing files.
    4) Repeat step (3) until we get large btrees.
       In our case, by observing btrfs_root_item->bytes_used, we have 2GiB of
       metadata in each subvolume tree and 12GiB of metadata in extent tree.
    5) Stop all fio, take snapshot again, and wait until all delayed work is
       completed.
    6) Start all fio. Few seconds later we hit OOM when the flusher starts
       to work.
    
    It can be reproduced even when using nocow write.
    Signed-off-by: default avatarEthan Lien <ethanlien@synology.com>
    Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
    [ add comment ]
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    e73e81b6
inode.c 285 KB