• Filipe Manana's avatar
    btrfs: add a shrinker for extent maps · 956a17d9
    Filipe Manana authored
    Extent maps are used either to represent existing file extent items, or to
    represent new extents that are going to be written and the respective file
    extent items are created when the ordered extent completes.
    
    We currently don't have any limit for how many extent maps we can have,
    neither per inode nor globally. Most of the time this not too noticeable
    because extent maps are removed in the following situations:
    
    1) When evicting an inode;
    
    2) When releasing folios (pages) through the btrfs_release_folio() address
       space operation callback.
    
       However we won't release extent maps in the folio range if the folio is
       either dirty or under writeback or if the inode's i_size is less than
       or equals to 16M (see try_release_extent_mapping(). This 16M i_size
       constraint was added back in 2008 with commit 70dec807 ("Btrfs:
       extent_io and extent_state optimizations"), but there's no explanation
       about why we have it or why the 16M value.
    
    This means that for buffered IO we can reach an OOM situation due to too
    many extent maps if either of the following happens:
    
    1) There's a set of tasks constantly doing IO on many files with a size
       not larger than 16M, specially if they keep the files open for very
       long periods, therefore preventing inode eviction.
    
       This requires a really high number of such files, and having many non
       mergeable extent maps (due to random 4K writes for example) and a
       machine with very little memory;
    
    2) There's a set tasks constantly doing random write IO (therefore
       creating many non mergeable extent maps) on files and keeping them
       open for long periods of time, so inode eviction doesn't happen and
       there's always a lot of dirty pages or pages under writeback,
       preventing btrfs_release_folio() from releasing the respective extent
       maps.
    
    This second case was actually reported in the thread pointed by the Link
    tag below, and it requires a very large file under heavy IO and a machine
    with very little amount of RAM, which is probably hard to happen in
    practice in a real world use case.
    
    However when using direct IO this is not so hard to happen, because the
    page cache is not used, and therefore btrfs_release_folio() is never
    called. Which means extent maps are dropped only when evicting the inode,
    and that means that if we have tasks that keep a file descriptor open and
    keep doing IO on a very large file (or files), we can exhaust memory due
    to an unbounded amount of extent maps. This is especially easy to happen
    if we have a huge file with millions of small extents and their extent
    maps are not mergeable (non contiguous offsets and disk locations).
    This was reported in that thread with the following fio test:
    
       $ cat test.sh
       #!/bin/bash
    
       DEV=/dev/sdj
       MNT=/mnt/sdj
       MOUNT_OPTIONS="-o ssd"
       MKFS_OPTIONS=""
    
       cat <<EOF > /tmp/fio-job.ini
       [global]
       name=fio-rand-write
       filename=$MNT/fio-rand-write
       rw=randwrite
       bs=4K
       direct=1
       numjobs=16
       fallocate=none
       time_based
       runtime=90000
    
       [file1]
       size=300G
       ioengine=libaio
       iodepth=16
    
       EOF
    
       umount $MNT &> /dev/null
       mkfs.btrfs -f $MKFS_OPTIONS $DEV
       mount $MOUNT_OPTIONS $DEV $MNT
    
       fio /tmp/fio-job.ini
       umount $MNT
    
    Monitoring the btrfs_extent_map slab while running the test with:
    
       $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \
                            /sys/kernel/slab/btrfs_extent_map/total_objects'
    
    Shows the number of active and total extent maps skyrocketing to tens of
    millions, and on systems with a short amount of memory it's easy and quick
    to get into an OOM situation, as reported in that thread.
    
    So to avoid this issue add a shrinker that will remove extents maps, as
    long as they are not pinned, and takes proper care with any concurrent
    fsync to avoid missing extents (setting the full sync flag while in the
    middle of a fast fsync). This shrinker is triggered through the callbacks
    nr_cached_objects and free_cached_objects of struct super_operations.
    
    The shrinker will iterate over all roots and over all inodes of each
    root, and keeps track of the last scanned root and inode, so that the
    next time it runs, it starts from that root and from the next inode.
    This is similar to what xfs does for its inode reclaim (implements those
    callbacks, and cycles through inodes by starting from where it ended
    last time).
    Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    956a17d9
super.c 74.1 KB