locking 6.15 KB
Newer Older
1
Started Oct 1999 by Kanoj Sarcar <kanojsarcar@yahoo.com>
Linus Torvalds's avatar
Linus Torvalds committed
2 3 4 5 6

The intent of this file is to have an uptodate, running commentary 
from different people about how locking and synchronization is done 
in the Linux vm code.

Linus Torvalds's avatar
Linus Torvalds committed
7
page_table_lock & mmap_sem
Linus Torvalds's avatar
Linus Torvalds committed
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
--------------------------------------

Page stealers pick processes out of the process pool and scan for 
the best process to steal pages from. To guarantee the existence 
of the victim mm, a mm_count inc and a mmdrop are done in swap_out().
Page stealers hold kernel_lock to protect against a bunch of races.
The vma list of the victim mm is also scanned by the stealer, 
and the page_table_lock is used to preserve list sanity against the
process adding/deleting to the list. This also guarantees existence
of the vma. Vma existence is not guaranteed once try_to_swap_out() 
drops the page_table_lock. To guarantee the existence of the underlying 
file structure, a get_file is done before the swapout() method is 
invoked. The page passed into swapout() is guaranteed not to be reused
for a different purpose because the page reference count due to being
present in the user's pte is not released till after swapout() returns.

Any code that modifies the vmlist, or the vm_start/vm_end/
vm_flags:VM_LOCKED/vm_next of any vma *in the list* must prevent 
Linus Torvalds's avatar
Linus Torvalds committed
26
kswapd from looking at the chain.
Linus Torvalds's avatar
Linus Torvalds committed
27 28

The rules are:
Linus Torvalds's avatar
Linus Torvalds committed
29 30 31 32 33 34 35 36 37 38 39
1. To scan the vmlist (look but don't touch) you must hold the
   mmap_sem with read bias, i.e. down_read(&mm->mmap_sem)
2. To modify the vmlist you need to hold the mmap_sem with
   read&write bias, i.e. down_write(&mm->mmap_sem)  *AND*
   you need to take the page_table_lock.
3. The swapper takes _just_ the page_table_lock, this is done
   because the mmap_sem can be an extremely long lived lock
   and the swapper just cannot sleep on that.
4. The exception to this rule is expand_stack, which just
   takes the read lock and the page_table_lock, this is ok
   because it doesn't really modify fields anybody relies on.
Linus Torvalds's avatar
Linus Torvalds committed
40
5. You must be able to guarantee that while holding page_table_lock
Linus Torvalds's avatar
Linus Torvalds committed
41 42
   or page_table_lock of mm A, you will not try to get either lock
   for mm B.
Linus Torvalds's avatar
Linus Torvalds committed
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68

The caveats are:
1. find_vma() makes use of, and updates, the mmap_cache pointer hint.
The update of mmap_cache is racy (page stealer can race with other code
that invokes find_vma with mmap_sem held), but that is okay, since it 
is a hint. This can be fixed, if desired, by having find_vma grab the
page_table_lock.


Code that add/delete elements from the vmlist chain are
1. callers of insert_vm_struct
2. callers of merge_segments
3. callers of avl_remove

Code that changes vm_start/vm_end/vm_flags:VM_LOCKED of vma's on
the list:
1. expand_stack
2. mprotect
3. mlock
4. mremap

It is advisable that changes to vm_start/vm_end be protected, although 
in some cases it is not really needed. Eg, vm_start is modified by 
expand_stack(), it is hard to come up with a destructive scenario without 
having the vmlist protection in this case.

69 70 71
The page_table_lock nests with the inode i_shared_sem and the kmem cache
c_spinlock spinlocks.  This is okay, since the kmem code asks for pages after
dropping c_spinlock.  The page_table_lock also nests with pagecache_lock and
Linus Torvalds's avatar
Linus Torvalds committed
72 73 74 75 76 77 78
pagemap_lru_lock spinlocks, and no code asks for memory with these locks
held.

The page_table_lock is grabbed while holding the kernel_lock spinning monitor.

The page_table_lock is a spin lock.

79 80 81 82
Note: PTL can also be used to guarantee that no new clones using the
mm start up ... this is a loose form of stability on mm_users. For
example, it is used in copy_mm to protect against a racing tlb_gather_mmu
single address space optimization, so that the zap_page_range (from
83
vmtruncate) does not lose sending ipi's to cloned threads that might 
84 85
be spawned underneath it and go to user mode to drag in pte's into tlbs.

Linus Torvalds's avatar
Linus Torvalds committed
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131
swap_list_lock/swap_device_lock
-------------------------------
The swap devices are chained in priority order from the "swap_list" header. 
The "swap_list" is used for the round-robin swaphandle allocation strategy.
The #free swaphandles is maintained in "nr_swap_pages". These two together
are protected by the swap_list_lock. 

The swap_device_lock, which is per swap device, protects the reference 
counts on the corresponding swaphandles, maintained in the "swap_map"
array, and the "highest_bit" and "lowest_bit" fields.

Both of these are spinlocks, and are never acquired from intr level. The
locking hierarchy is swap_list_lock -> swap_device_lock.

To prevent races between swap space deletion or async readahead swapins
deciding whether a swap handle is being used, ie worthy of being read in
from disk, and an unmap -> swap_free making the handle unused, the swap
delete and readahead code grabs a temp reference on the swaphandle to
prevent warning messages from swap_duplicate <- read_swap_cache_async.

Swap cache locking
------------------
Pages are added into the swap cache with kernel_lock held, to make sure
that multiple pages are not being added (and hence lost) by associating
all of them with the same swaphandle.

Pages are guaranteed not to be removed from the scache if the page is 
"shared": ie, other processes hold reference on the page or the associated 
swap handle. The only code that does not follow this rule is shrink_mmap,
which deletes pages from the swap cache if no process has a reference on 
the page (multiple processes might have references on the corresponding
swap handle though). lookup_swap_cache() races with shrink_mmap, when
establishing a reference on a scache page, so, it must check whether the
page it located is still in the swapcache, or shrink_mmap deleted it.
(This race is due to the fact that shrink_mmap looks at the page ref
count with pagecache_lock, but then drops pagecache_lock before deleting
the page from the scache).

do_wp_page and do_swap_page have MP races in them while trying to figure
out whether a page is "shared", by looking at the page_count + swap_count.
To preserve the sum of the counts, the page lock _must_ be acquired before
calling is_page_shared (else processes might switch their swap_count refs
to the page count refs, after the page count ref has been snapshotted).

Swap device deletion code currently breaks all the scache assumptions,
since it grabs neither mmap_sem nor page_table_lock.