• Andrew Morton's avatar
    [PATCH] direct-to-BIO I/O for swapcache pages · 88c4650a
    Andrew Morton authored
    This patch changes the swap I/O handling.  The objectives are:
    
    - Remove swap special-casing
    - Stop using buffer_heads -> direct-to-BIO
    - Make S_ISREG swapfiles more robust.
    
    I've spent quite some time with swap.  The first patches converted swap to
    use block_read/write_full_page().  These were discarded because they are
    still using buffer_heads, and a reasonable amount of otherwise unnecessary
    infrastructure had to be added to the swap code just to make it look like a
    regular fs.  So this code just has a custom direct-to-BIO path for swap,
    which seems to be the most comfortable approach.
    
    A significant thing here is the introduction of "swap extents".  A swap
    extent is a simple data structure which maps a range of swap pages onto a
    range of disk sectors.  It is simply:
    
    	struct swap_extent {
    		struct list_head list;
    		pgoff_t start_page;
    		pgoff_t nr_pages;
    		sector_t start_block;
    	};
    
    At swapon time (for an S_ISREG swapfile), each block in the file is bmapped()
    and the block numbers are parsed to generate the device's swap extent list.
    This extent list is quite compact - a 512 megabyte swapfile generates about
    130 nodes in the list.  That's about 4 kbytes of storage.  The conversion
    from filesystem blocksize blocks into PAGE_SIZE blocks is performed at swapon
    time.
    
    At swapon time (for an S_ISBLK swapfile), we install a single swap extent
    which describes the entire device.
    
    The advantages of the swap extents are:
    
    1: We never have to run bmap() (ie: read from disk) at swapout time.  So
       S_ISREG swapfiles are now just as robust as S_ISBLK swapfiles.
    
    2: All the differences between S_ISBLK swapfiles and S_ISREG swapfiles are
       handled at swapon time.  During normal operation, we just don't care.
       Both types of swapfiles are handled the same way.
    
    3: The extent lists always operate in PAGE_SIZE units.  So the problems of
       going from fs blocksize to PAGE_SIZE are handled at swapon time and normal
       operating code doesn't need to care.
    
    4: Because we don't have to fiddle with different blocksizes, we can go
       direct-to-BIO for swap_readpage() and swap_writepage().  This introduces
       the kernel-wide invariant "anonymous pages never have buffers attached",
       which cleans some things up nicely.  All those block_flushpage() calls in
       the swap code simply go away.
    
    5: The kernel no longer has to allocate both buffer_heads and BIOs to
       perform swapout.  Just a BIO.
    
    6: It permits us to perform swapcache writeout and throttling for
       GFP_NOFS allocations (a later patch).
    
    (Well, there is one sort of anon page which can have buffers: the pages which
    are cast adrift in truncate_complete_page() because do_invalidatepage()
    failed.  But these pages are never added to swapcache, and nobody except the
    VM LRU has to deal with them).
    
    The swapfile parser in setup_swap_extents() will attempt to extract the
    largest possible number of PAGE_SIZE-sized and PAGE_SIZE-aligned chunks of
    disk from the S_ISREG swapfile.  Any stray blocks (due to file
    discontiguities) are simply discarded - we never swap to those.
    
    If an S_ISREG swapfile is found to have any unmapped blocks (file holes) then
    the swapon attempt will fail.
    
    The extent list can be quite large (hundreds of nodes for a gigabyte S_ISREG
    swapfile).  It needs to be consulted once for each page within
    swap_readpage() and swap_writepage().  Hence there is a risk that we could
    blow significant amounts of CPU walking that list.  However I have
    implemented a "where we found the last block" cache, which is used as the
    starting point for the next search.  Empirical testing indicates that this is
    wildly effective - the average length of the list walk in map_swap_page() is
    0.3 iterations per page, with a 130-element list.
    
    It _could_ be that some workloads do start suffering long walks in that code,
    and perhaps a tree would be needed there.  But I doubt that, and if this is
    happening then it means that we're seeking all over the disk for swap I/O,
    and the list walk is the least of our problems.
    
    rw_swap_page_nolock() now takes a page*, not a kernel virtual address.  It
    has been renamed to rw_swap_page_sync() and it takes care of locking and
    unlocking the page itself.  Which is all a much better interface.
    
    Support for type 0 swap has been removed.  Current versions of mkwap(8) seem
    to never produce v0 swap unless you explicitly ask for it, so I doubt if this
    will affect anyone.  If you _do_ have a type 0 swapfile, swapon will fail and
    the message
    
    	version 0 swap is no longer supported. Use mkswap -v1 /dev/sdb3
    
    is printed.  We can remove that code for real later on.  Really, all that
    swapfile header parsing should be pushed out to userspace.
    
    This code always uses single-page BIOs for swapin and swapout.  I have an
    additional patch which converts swap to use mpage_writepages(), so we swap
    out in 16-page BIOs.  It works fine, but I don't intend to submit that.
    There just doesn't seem to be any significant advantage to it.
    
    I can't see anything in sys_swapon()/sys_swapoff() which needs the
    lock_kernel() calls, so I deleted them.
    
    If you ftruncate an S_ISREG swapfile to a shorter size while it is in use,
    subsequent swapout will destroy the filesystem.  It was always thus, but it
    is much, much easier to do now.  Not really a kernel problem, but swapon(8)
    should not be allowing the kernel to use swapfiles which are modifiable by
    unprivileged users.
    88c4650a
swapfile.c 34.9 KB