mm/swapfile.c · 88c4650a9ece8fef2be042fbbec2dde2d0afa1a4 · nexedi / linux

[PATCH] direct-to-BIO I/O for swapcache pages · 88c4650a
Andrew Morton authored Jun 17, 2002
This patch changes the swap I/O handling.  The objectives are:

- Remove swap special-casing
- Stop using buffer_heads -> direct-to-BIO
- Make S_ISREG swapfiles more robust.

I've spent quite some time with swap.  The first patches converted swap to
use block_read/write_full_page().  These were discarded because they are
still using buffer_heads, and a reasonable amount of otherwise unnecessary
infrastructure had to be added to the swap code just to make it look like a
regular fs.  So this code just has a custom direct-to-BIO path for swap,
which seems to be the most comfortable approach.

A significant thing here is the introduction of "swap extents".  A swap
extent is a simple data structure which maps a range of swap pages onto a
range of disk sectors.  It is simply:

	struct swap_extent {
		struct list_head list;
		pgoff_t start_page;
		pgoff_t nr_pages;
		sector_t start_block;
	};

At swapon time (for an S_ISREG swapfile), each block in the file is bmapped()
and the block numbers are parsed to generate the device's swap extent list.
This extent list is quite compact - a 512 megabyte swapfile generates about
130 nodes in the list.  That's about 4 kbytes of storage.  The conversion
from filesystem blocksize blocks into PAGE_SIZE blocks is performed at swapon
time.

At swapon time (for an S_ISBLK swapfile), we install a single swap extent
which describes the entire device.

The advantages of the swap extents are:

1: We never have to run bmap() (ie: read from disk) at swapout time.  So
   S_ISREG swapfiles are now just as robust as S_ISBLK swapfiles.

2: All the differences between S_ISBLK swapfiles and S_ISREG swapfiles are
   handled at swapon time.  During normal operation, we just don't care.
   Both types of swapfiles are handled the same way.

3: The extent lists always operate in PAGE_SIZE units.  So the problems of
   going from fs blocksize to PAGE_SIZE are handled at swapon time and normal
   operating code doesn't need to care.

4: Because we don't have to fiddle with different blocksizes, we can go
   direct-to-BIO for swap_readpage() and swap_writepage().  This introduces
   the kernel-wide invariant "anonymous pages never have buffers attached",
   which cleans some things up nicely.  All those block_flushpage() calls in
   the swap code simply go away.

5: The kernel no longer has to allocate both buffer_heads and BIOs to
   perform swapout.  Just a BIO.

6: It permits us to perform swapcache writeout and throttling for
   GFP_NOFS allocations (a later patch).

(Well, there is one sort of anon page which can have buffers: the pages which
are cast adrift in truncate_complete_page() because do_invalidatepage()
failed.  But these pages are never added to swapcache, and nobody except the
VM LRU has to deal with them).

The swapfile parser in setup_swap_extents() will attempt to extract the
largest possible number of PAGE_SIZE-sized and PAGE_SIZE-aligned chunks of
disk from the S_ISREG swapfile.  Any stray blocks (due to file
discontiguities) are simply discarded - we never swap to those.

If an S_ISREG swapfile is found to have any unmapped blocks (file holes) then
the swapon attempt will fail.

The extent list can be quite large (hundreds of nodes for a gigabyte S_ISREG
swapfile).  It needs to be consulted once for each page within
swap_readpage() and swap_writepage().  Hence there is a risk that we could
blow significant amounts of CPU walking that list.  However I have
implemented a "where we found the last block" cache, which is used as the
starting point for the next search.  Empirical testing indicates that this is
wildly effective - the average length of the list walk in map_swap_page() is
0.3 iterations per page, with a 130-element list.

It _could_ be that some workloads do start suffering long walks in that code,
and perhaps a tree would be needed there.  But I doubt that, and if this is
happening then it means that we're seeking all over the disk for swap I/O,
and the list walk is the least of our problems.

rw_swap_page_nolock() now takes a page*, not a kernel virtual address.  It
has been renamed to rw_swap_page_sync() and it takes care of locking and
unlocking the page itself.  Which is all a much better interface.

Support for type 0 swap has been removed.  Current versions of mkwap(8) seem
to never produce v0 swap unless you explicitly ask for it, so I doubt if this
will affect anyone.  If you _do_ have a type 0 swapfile, swapon will fail and
the message

	version 0 swap is no longer supported. Use mkswap -v1 /dev/sdb3

is printed.  We can remove that code for real later on.  Really, all that
swapfile header parsing should be pushed out to userspace.

This code always uses single-page BIOs for swapin and swapout.  I have an
additional patch which converts swap to use mpage_writepages(), so we swap
out in 16-page BIOs.  It works fine, but I don't intend to submit that.
There just doesn't seem to be any significant advantage to it.

I can't see anything in sys_swapon()/sys_swapoff() which needs the
lock_kernel() calls, so I deleted them.

If you ftruncate an S_ISREG swapfile to a shorter size while it is in use,
subsequent swapout will destroy the filesystem.  It was always thus, but it
is much, much easier to do now.  Not really a kernel problem, but swapon(8)
should not be allowing the kernel to use swapfiles which are modifiable by
unprivileged users.
88c4650a
swapfile.c 34.9 KB
Replace swapfile.c