- 22 May, 2004 40 commits
-
-
Roland McGrath authored
There is a longstanding bug in the rt_sigreturn system call. This exists in both 2.4 and 2.6, and for almost every platform. I am referring to this code in sys_rt_sigreturn (arch/i386/kernel/signal.c): if (__copy_from_user(&st, &frame->uc.uc_stack, sizeof(st))) goto badframe; /* It is more difficult to avoid calling this function than to call it and ignore errors. */ /* * THIS CANNOT WORK! "&st" is a kernel address, and "do_sigaltstack()" * takes a user address (and verifies that it is a user address). End * result: it does exactly _nothing_. */ do_sigaltstack(&st, NULL, regs->esp); As the comment says, this is bogus. On vanilla i386 kernels, this is just harmlessly stupid--do_sigaltstack always does nothing and returns -EFAULT. However this code actually bites users on kernels using Ingo Molnar's 4G/4G address space layout changes. There some kernel stack address might very well be a lovely and readable user address as well. When that happens, we make a sigaltstack call with some random buffer, and then the fun begins. To my knowledge, this has produced trouble in the real world only for 4G i386 kernels (RHEL and Fedora "hugemem" kernels) on machines that actually have several GB of physical memory (and in programs that are actually using sigaltstack and handling a lot of signals). However, the same clearly broken code has been blindly copied to most other architecture ports, and off hand I don't know the address space details of any other well enough to know if real kernel stack addresses and real user addresses are in fact disjoint as they are on i386 when not using the nonstandard 4GB address space layout. The obvious intent of the call being there in the first place is to permit a signal handler to diddle its ucontext_t.uc_stack before returning, and have this effect a sigaltstack call on the signal handler return. This is not only an optimization vs doing the extra system call, but makes it possible to make a sigaltstack change when that handler itself was running on the signal stack. AFAICT this has never actually worked before, so certainly noone depends on it. But the code certainly suggests that someone intended at one time for that to be the behavior. Thus I am inclined to fix it so it works in that way, though it has not done so before. It would also be reasonable enough to simply rip out the bogus call and not have this functionality. From the current state of code in both 2.4 and 2.6, there is no fathoming how this broken code came about. It's actually much simpler to just make it work! I can only presume that at some point in the past the sigaltstack implementation functions were different such that this made sense. Of the few ports I've looked at briefly, only the ppc/pc64 porters (go paulus!) actually tried to understand what the i386 code was doing and implemented it correctly rather than just carefully transliterating the bug. The patch below fixes only the i386 and x86_64 versions. The x86_64 patches I have not actually tested. I think each and every arch (except ppc and ppc64) need to make the corresponding fixes as well. Note that there is a function to fix for each native arch, and then one for each emulation flavor. The details differ minutely for getting the calls right in each emulation flavor, but I think that most or all of the arch's with biarch/emulation support have similar enough code that each emulation flavor's fix will look very much like the arch/x86_64/ia32/ia32_signal.c patch here.
-
Andrew Morton authored
From: Rajesh Venkatasubramanian <vrajesh@umich.edu> This patch adds prefetches for walking a vm_set.list. Adding prefetches for prio tree traversals is tricky and may lead to cache trashing. So this patch just adds prefetches only when walking a vm_set.list. I haven't done any benchmarks to show that this patch improves performance. However, this patch should help to improve performance when vm_set.lists are long, e.g., libc. Since we only prefetch vmas that are guaranteed to be used in the near future, this patch should not result in cache trashing, theoretically. I didn't add any NULL checks before prefetching because prefetch.h clearly says prefetch(0) is okay.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> anon_vma rmap will always necessarily be more restrictive about vma merging than before: according to the history of the vmas in an mm, they are liable to be allocated different anon_vma heads, and from that point on be unmergeable. Most of the time this doesn't matter at all; but in two cases it may matter. One case is that mremap refuses (-EFAULT) to span more than a single vma: so it is conceivable that some app has relied on vma merging prior to mremap in the past, and will now fail with anon_vma. Conceivable but unlikely, let's cross that bridge if we come to it: and the right answer would be to extend mremap, which should not be exporting the kernel's implementation detail of vma to user interface. The other case that matters is when a reasonable repetitive sequence of syscalls and faults ends up with a large number of separate unmergeable vmas, instead of the single merged vma it could have. Andrea's mprotect-vma-merging patch fixed some such instances, but left other plausible cases unmerged. There is no perfect solution, and the harder you try to allow vmas to be merged, the less efficient anon_vma becomes, in the extreme there being one to span the whole address space, from which hangs every private vma; but anonmm rmap is clearly superior to that extreme. Andrea's principle was that neighbouring vmas which could be mprotected into mergeable vmas should be allowed to share anon_vma: good insight. His implementation was to arrange this sharing when trying vma merge, but that seems to be too early. This patch sticks to the principle, but implements it in anon_vma_prepare, when handling the first write fault on a private vma: with better results. The drawback is that this first write fault needs an extra find_vma_prev (whereas prev was already to hand when implementing anon_vma sharing at try-to-merge time).
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> Andrea Arcangeli's anon_vma object-based reverse mapping scheme for anonymous pages. Instead of tracking anonymous pages by pte_chains or by mm, this tracks them by vma. But because vmas are frequently split and merged (particularly by mprotect), a page cannot point directly to its vma(s), but instead to an anon_vma list of those vmas likely to contain the page - a list on which vmas can easily be linked and unlinked as they come and go. The vmas on one list are all related, either by forking or by splitting. This has three particular advantages over anonmm: that it can cope effortlessly with mremap moves; and no longer needs page_table_lock to protect an mm's vma tree, since try_to_unmap finds vmas via page -> anon_vma -> vma instead of using find_vma; and should use less cpu for swapout since it can locate its anonymous vmas more quickly. It does have disadvantages too: a lot more change in mmap.c to deal with anon_vmas, though small straightforward additions now that the vma merging has been refactored there; more lowmem needed for each anon_vma and vma structure; an additional restriction on the merging of vmas (cannot be merged if already assigned different anon_vmas, since then their pages will be pointing to different heads). (There would be no need to enlarge the vma structure if anonymous pages belonged only to anonymous vmas; but private file mappings accumulate anonymous pages by copy-on-write, so need to be listed in both anon_vma and prio_tree at the same time. A different implementation could avoid that by using anon_vmas only for purely anonymous vmas, and use the existing prio_tree to locate cow pages - but that would involve a long search for each single private copy, probably not a good idea.) Where before the vm_pgoff of a purely anonymous (not file-backed) vma was meaningless, now it represents the virtual start address at which that vma is mapped - which the standard file pgoff manipulations treat linearly as vmas are split and merged. But if mremap moves the vma, then it generally carries its original vm_pgoff to the new location, so pages shared with the old location can still be found. Magic. Hugh has massaged it somewhat: building on the earlier rmap patches, this patch is a fifth of the size of Andrea's original anon_vma patch. Please note that this posting will be his first sight of this patch, which he may or may not approve.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> Before moving on to anon_vma rmap, remove now what's peculiar to anonmm rmap: the anonmm handling and the mremap move cows. Temporarily reduce page_referenced_anon and try_to_unmap_anon to stubs, so a kernel built with this patch will not swap anonymous at all.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> Silly final patch for anonmm rmap: change page_add_anon_rmap's mm arg to vma arg like anon_vma rmap, to smooth the transition between them.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> Earlier on, in 2.6.6, we took the vma merging code out of mremap.c and let it rely on vma_merge instead (via copy_vma). Now take the vma merging code out of mprotect.c and let it rely on vma_merge too: so vma_merge becomes the sole vma merging engine. The fruit of this consolidation is that mprotect now merges file-backed vmas naturally. Make this change now because anon_vma will complicate the vma merging rules, let's keep them all in one place. vma_merge remains where the decisions are made, whether to merge with prev and/or next; but now [addr,end) may be the latter part of prev, or first part or whole of next, whereas before it was always a new area. vma_adjust carries out vma_merge's decision, but when sliding the boundary between vma and next, must temporarily remove next from the prio_tree too. And it turned out (by oops) to have a surer idea of whether next needs to be removed than vma_merge, so the fput and freeing moves into vma_adjust. Too much decipherment of what's going on at the start of vma_adjust? Yes, and there's a delicate assumption that you may use vma_adjust in sliding a boundary, or splitting in two, or growing a vma (mremap uses it in that way), but not for simply shrinking a vma. Which is so, and must be so (how could pages mapped in the part to go, be zapped without first splitting?), but would feel better with some protection. __vma_unlink can then be moved from mm.h to mmap.c, and mm.h's more misleading than helpful can_vma_merge is deleted.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> Before some real vma_merge work in mmap.c in the next patch, a patch of miscellaneous cleanups to cut down the noise: - remove rb_parent arg from vma_merge: mm->mmap can do that case - scatter pgoff_t around to ingratiate myself with the boss - reorder is_mergeable_vma tests, vm_ops->close is least likely - can_vma_merge_before take combined pgoff+pglen arg (from Andrea) - rearrange do_mmap_pgoff's ever-confusing anonymous flags switch - comment do_mmap_pgoff's mysterious (vm_flags & VM_SHARED) test - fix ISO C90 warning on browse_rb if building with DEBUG_MM_RB - stop that long MNT_NOEXEC line wrapping Yes, buried in amidst these is indeed one pgoff replaced by "next->vm_pgoff - pglen" (reverting a mod of mine which took pgoff supplied by user too seriously in the anon case), and another pgoff replaced by 0 (reverting anon_vma mod which crept in with NUMA API): neither of them really matters, except perhaps in /proc/pid/maps.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> First of a batch of seven rmap patches, based on 2.6.6-mm3. Probably the final batch: remaining issues outstanding can have isolated patches. The first half of the batch is good for anonmm or anon_vma, the second half of the batch replaces my anonmm rmap by Andrea's anon_vma rmap. Judge for yourselves which you prefer. I do think I was wrong to call anon_vma more complex than anonmm (its lists are easier to understand than my refcounting), and I'm happy with its vma merging after the last patch. It just comes down to whether we can spare the extra 24 bytes (maximum, on 32-bit) per vma for its advantages in swapout and mremap. rmap 34 vm_flags page_table_lock Why do we guard vm_flags mods with page_table_lock when it's already down_write guarded by mmap_sem? There's probably a historical reason, but no sign of any need for it now. Andrea added a comment and removed the instance from mprotect.c, Hugh plagiarized his comment and removed the instances from madvise.c and mlock.c. Huge leap in scalability... not expected; but this should stop people asking why those spinlocks.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> anon_vma will need to pass vma to put_dirty_page, so change it and its various callers (setup_arg_pages and its 32-on-64-bit arch variants); and please, let's rename it to install_arg_page. Earlier attempt to do this (rmap 26 __setup_arg_pages) tried to clean up those callers instead, but failed to boot: so now apply rmap 27's memset initialization of vmas to these callers too; which relieves them from needing the recently included linux/mempolicy.h. While there, moved install_arg_page's flush_dcache_page up before page_table_lock - doesn't in fact matter at all, just saves one worry when researching flush_dcache_page locking constraints.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> From: Andrea Arcangeli <andrea@suse.de> zap_pmd_range, alone of all those page_range loops, lacks the check for whether address wrapped. Hugh is in doubt as to whether this makes any difference to any config on any arch, but eager to fix the odd one out.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> From: Andrea Arcangeli <andrea@suse.de> Sprinkle unlikelys throughout mm/memory.c, wherever we see a pgd_bad or a pmd_bad; likely or unlikely on pte_same or !pte_same. Put the jump in the error return from do_no_page, not in the fast path.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> From: Andrea Arcangeli <andrea@suse.de> page_alloc.c's bad_page routine should reset a bad mapcount; and it's more revealing to show the bad mapcount than just the boolean mapped.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> From: Andrea Arcangeli <andrea@suse.de> Set VM_RESERVED in videobuf_mmap_mapper, to warn do_no_page and swapout not to worry about its pages. Set VM_RESERVED in ia64_elf32_init, it too provides an unusual nopage which might surprise higher level checks. Future safety: they don't actually pose a problem in this current tree.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> The callers of remove_shared_vm_struct then proceed to do several more identical things: gather them together in remove_vm_struct.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> We're NULLifying more and more fields when initializing a vma (mpol_set_vma_default does that too, if configured to do anything). Now use memset to avoid specifying fields, and save a little code too. (Yes, I realize anon_vma will want to set vm_pgoff non-0, but I think that will be better handled at the core, since anon vm_pgoff is negotiable up until an anon_vma is actually assigned.)
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> I like CONFIG_REGPARM, even when it's forced on: because it's easy to force off for debugging - easier than editing out scattered fastcalls. Plus I've never understood why we make function foo a fastcall, but function bar not. Remove fastcall directives from rmap. And fix comment about mremap_moved race: it only applies to anon pages.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> Most architectures (like i386) do nothing in flush_dcache_page, or don't scan i_mmap in flush_dcache_page, so don't need flush_dcache_mmap_lock to do anything: define it and flush_dcache_mmap_unlock away. Noticed arm26, cris, h8300 still defining flush_page_to_ram: delete it again.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> arm and parisc __flush_dcache_page have been scanning the i_mmap(_shared) list without locking or disabling preemption. That may be even more unsafe now it's a prio tree instead of a list. It looks like we cannot use i_shared_lock for this protection: most uses of flush_dcache_page are okay, and only one would need lock ordering fixed (get_user_pages holds page_table_lock across flush_dcache_page); but there's a few (e.g. in net and ntfs) which look as if they're using it in I/O completion - and it would be restrictive to disallow it there. So, on arm and parisc only, define flush_dcache_mmap_lock(mapping) as spin_lock_irq(&(mapping)->tree_lock); on i386 (and other arches left to the next patch) define it away to nothing; and use where needed. While updating locking hierarchy in filemap.c, remove two layers of the fossil record from add_to_page_cache comment: no longer used for swap. I believe all the #includes will work out, but have only built i386. I can see several things about this patch which might cause revulsion: the name flush_dcache_mmap_lock? the reuse of the page radix_tree's tree_lock for this different purpose? spin_lock_irqsave instead? can't we somehow get i_shared_lock to handle the problem?
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> Why should try_to_unmap_anon and try_to_unmap_file take a copy of page->mapcount and pass it down for try_to_unmap_one to decrement? why not just check page->mapcount itself? asks akpm. Perhaps there used to be a good reason, but not any more: remove the mapcount arg.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> Why should struct address_space have separate i_mmap and i_mmap_shared prio_trees (separating !VM_SHARED and VM_SHARED vmas)? No good reason, the same processing is usually needed on both. Merge i_mmap_shared into i_mmap, but keep i_mmap_writable count of VM_SHARED vmas (those capable of dirtying the underlying file) for the mapping_writably_mapped test. The VM_MAYSHARE test in the arm and parisc loops is not necessarily what they will want to use in the end: it's provided as a harmless example of what might be appropriate, but maintainers are likely to revise it later (that parisc loop is currently being changed in the parisc tree anyway). On the way, remove the now out-of-date comments on vm_area_struct size.
-
Andrew Morton authored
From: Christoph Hellwig <hch@lst.de>
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> Missed comment on the size of vm_area_struct: it is no longer 64 bytes on ia32.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> The previous patches of this prio_tree batch have been to generic only. Now the arm and parisc __flush_dcache_page are converted to using vma_prio_tree_next, and benefit from its selection of relevant vmas. They're still accessing the tree without i_shared_lock or any other, that's not forgotten but still under investigation. Include pagemap.h for the definition of PAGE_CACHE_SHIFT. s390 and x86_64 no longer initialize vma's shared field (whose type has changed), done later.
-
Andrew Morton authored
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> The prio_tree is of no use to nonlinear vmas: currently we're having to search the tree in the most inefficient way to find all its nonlinears. At the very least we need an indication of the unlikely case when there are some nonlinears; but really, we'd do best to take them out of the prio_tree altogether, into a list of their own - i_mmap_nonlinear.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> Rajesh Venkatasubramanian's implementation of a radix priority search tree of vmas, to handle object-based reverse mapping corner cases well. Amongst the objections to object-based rmap were test cases by akpm and by mingo, in which large numbers of vmas mapping disjoint or overlapping parts of a file showed strikingly poor performance of the i_mmap lists. Perhaps those tests are irrelevant in the real world? We cannot be too sure: the prio_tree is well-suited to solving precisely that problem, so unless it turns out to bring too much overhead, let's include it. Why is this prio_tree.c placed in mm rather than lib? See GET_INDEX: this implementation is geared throughout to use with vmas, though the first half of the file appears more general than the second half. Each node of the prio_tree is itself (contained within) a vma: might save memory by allocating distinct nodes from which to hang vmas, but wouldn't save much, and would complicate the usage with preallocations. Off each node of the prio_tree itself hangs a list of like vmas, if any. The connection from node to list is a little awkward, but probably the best compromise: it would be more straightforward to list likes directly from the tree node, but that would use more memory per vma, for the list_head and to identify that head. Instead, node's shared.vm_set.head points to next vma (whose shared.vm_set.head points back to node vma), and that next contains the list_head from which the rest hang - reusing fields already used in the prio_tree node itself. Currently lacks prefetch: Rajesh hopes to add some soon.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> Pave the way for prio_tree by switching over to its interfaces, but actually still implement them with the same old lists as before. Most of the vma_prio_tree interfaces are straightforward. The interesting one is vma_prio_tree_next, used to search the tree for all vmas which overlap the given range: unlike the list_for_each_entry it replaces, it does not find every vma, just those that match. But this does leave handling of nonlinear vmas in a very unsatisfactory state: for now we have to search again over the maximum range to find all the nonlinear vmas which might contain a page, which of course takes away the point of the tree. Fixed in later patch of this batch. There is no need to initialize vma linkage all over, just do it before inserting the vma in list or tree. /proc/pid/statm had an odd test for its shared count: simplified to an equivalent test on vm_file.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> If file-based vmas are to be kept in a tree, according to the file offsets they map, then adjusting the vma's start pgoff or its end involves repositioning in the tree, while holding i_shared_lock (and page_table_lock). We used to avoid that if possible, e.g. when just moving end; but if we're heading that way, let's now tidy up vma_merge and split_vma, and do all the locking and adjustment in a new helper vma_adjust. And please, let's call the next vma in vma_merge "next" rather than "prev". Since these patches are diffed over 2.6.6-rc2-mm2, they include the NUMA mpolicy mods which you'll have to remove to go earlier in the series, sorry for that nuisance. I have intentionally changed the one vma_mpol_equal to mpol_equal, to make the merge cases more alike.
-
Andrew Morton authored
From: Andi Kleen <ak@suse.de> This fixes a user triggerable crash in mbind() in NUMA API. It would oops when running into the end of memory. Actually not really oops, because a oops with the mm sem hold for writing always deadlocks.
-
Andrew Morton authored
From: Andi Kleen <ak@suse.de> Change to core VM to use alloc_page_vma() instead of alloc_page(). Change the swap readahead to follow the policy of the VMA.
-
Andrew Morton authored
From: Andi Kleen <ak@suse.de> Add NUMA hit/miss statistics to page allocation and display them in sysfs. This is not 100% required for NUMA API, but without this it is very The overhead is quite low because all counters are per CPU and only happens when CONFIG_NUMA is defined.
-
Andrew Morton authored
From: Christoph Hellwig <hch@lst.de> - don't include mempolicy.h in sched.h and mm.h when a forward delcaration is enough. Andi argued against that in the past, but I'd really hate to add another header to two of the includes used in basically every driver when we can include it in the six files actually needing it instead (that number is for my ppc32 system, maybe other arches need more include in their directories) - make numa api fields in tast_struct conditional on CONFIG_NUMA, this gives us a few ugly ifdefs but avoids wasting memory on non-NUMA systems.
-
Andrew Morton authored
From: Andi Kleen <ak@suse.de> Add support to tmpfs and hugetlbfs to support NUMA API. Shared memory is a bit of a special case for NUMA policy. Normally policy is associated to VMAs or to processes, but for a shared memory segment you really want to share the policy. The core NUMA API has code for that, this patch adds the necessary changes to tmpfs and hugetlbfs. First it changes the custom swapping code in tmpfs to follow the policy set via VMAs. It is also useful to have a "backing store" of policy that saves the policy even when nobody has the shared memory segment mapped. This allows command line tools to pre configure policy, which is then later used by programs. Note that hugetlbfs needs more changes - it is also required to switch it to lazy allocation, otherwise the prefault prevents mbind() from working.
-
Andrew Morton authored
From: Andi Kleen <ak@suse.de> NUMA API adds a policy to each VMA. During VMA creattion, merging and splitting these policies must be handled properly. This patch adds the calls to this. It is a nop when CONFIG_NUMA is not defined.
-
Andrew Morton authored
From: Andi Kleen <ak@suse.de> Patch readds the sysfs output of the NUMA API statistics. All my test scripts need this and it is very useful to check if the policy actually works. This got lost when the huge page numa api changes got dropped. I decided to not resend the huge pages NUMA API changes for now. Instead I will wait for this area to settle when demand paged large pages is merged.
-
Andrew Morton authored
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> I think Andi missed the copy_vma I recently added for mremap, and it'll need something like below.... (Doesn't look like it'll optimize away when it's not needed - rather bloaty.)
-
Andrew Morton authored
From: Andi Kleen <ak@suse.de> The following patches add support for configurable NUMA memory policy for user processes. It is based on the proposal from last kernel summit with feedback from various people. This NUMA API doesn't not attempt to implement page migration or anything else complicated: all it does is to police the allocation when a page is first allocation or when a page is reallocated after swapping. Currently only support for shared memory and anonymous memory is there; policy for file based mappings is not implemented yet (although they get implicitely policied by the default process policy) It adds three new system calls: mbind to change the policy of a VMA, set_mempolicy to change the policy of a process, get_mempolicy to retrieve memory policy. User tools (numactl, libnuma, test programs, manpages) can be found in ftp://ftp.suse.com/pub/people/ak/numa/numactl-0.6.tar.gz For details on the system calls see the manpages http://www.firstfloor.org/~andi/mbind.html http://www.firstfloor.org/~andi/set_mempolicy.html http://www.firstfloor.org/~andi/get_mempolicy.html Most user programs should actually not use the system calls directly, but use the higher level functions in libnuma (http://www.firstfloor.org/~andi/numa.html) or the command line tools (http://www.firstfloor.org/~andi/numactl.html The system calls allow user programs and administors to set various NUMA memory policies for putting memory on specific nodes. Here is a short description of the policies copied from the kernel patch: * NUMA policy allows the user to give hints in which node(s) memory should * be allocated. * * Support four policies per VMA and per process: * * The VMA policy has priority over the process policy for a page fault. * * interleave Allocate memory interleaved over a set of nodes, * with normal fallback if it fails. * For VMA based allocations this interleaves based on the * offset into the backing object or offset into the mapping * for anonymous memory. For process policy an process counter * is used. * bind Only allocate memory on a specific set of nodes, * no fallback. * preferred Try a specific node first before normal fallback. * As a special case node -1 here means do the allocation * on the local CPU. This is normally identical to default, * but useful to set in a VMA when you have a non default * process policy. * default Allocate on the local node first, or when on a VMA * use the process policy. This is what Linux always did * in a NUMA aware kernel and still does by, ahem, default. * * The process policy is applied for most non interrupt memory allocations * in that process' context. Interrupts ignore the policies and always * try to allocate on the local CPU. The VMA policy is only applied for memory * allocations for a VMA in the VM. * * Currently there are a few corner cases in swapping where the policy * is not applied, but the majority should be handled. When process policy * is used it is not remembered over swap outs/swap ins. * * Only the highest zone in the zone hierarchy gets policied. Allocations * requesting a lower zone just use default policy. This implies that * on systems with highmem kernel lowmem allocation don't get policied. * Same with GFP_DMA allocations. * * For shmfs/tmpfs/hugetlbfs shared memory the policy is shared between * all users and remembered even when nobody has memory mapped. This patch: This is the core NUMA API code. This includes NUMA policy aware wrappers for get_free_pages and alloc_page_vma(). On non NUMA kernels these are defined away. The system calls mbind (see http://www.firstfloor.org/~andi/mbind.html), get_mempolicy (http://www.firstfloor.org/~andi/get_mempolicy.html) and set_mempolicy (http://www.firstfloor.org/~andi/set_mempolicy.html) are implemented here. Adds a vm_policy field to the VMA and to the process. The process also has field for interleaving. VMA interleaving uses the offset into the VMA, but that's not possible for process allocations. From: Andi Kleen <ak@muc.de> > Andi, how come policy_vma() calls ->set_policy under i_shared_sem? I think this can be actually dropped now. In an earlier version I did walk the vma shared list to change the policies of other mappings to the same shared memory region. This turned out too complicated with all the corner cases, so I eventually gave in and added ->get_policy to the fast path. Also there is still the mmap_sem which prevents races in the same MM. Patch to remove it attached. Also adds documentation and removes the bogus __alloc_page_vma() prototype noticed by hch. From: Andi Kleen <ak@suse.de> A few incremental fixes for NUMA API. - Fix a few comments - Add a compat_ function for get_mem_policy I considered changing the ABI to avoid this, but that would have made the API too ugly. I put it directly into the file because a mm/compat.c didn't seem worth it just for this. - Fix the algorithm for VMA interleave. From: Matthew Dobson <colpatch@us.ibm.com> 1) Move the extern of alloc_pages_current() into #ifdef CONFIG_NUMA. The only references to the function are in NUMA code in mempolicy.c 2) Remove the definitions of __alloc_page_vma(). They aren't used. 3) Move forward declaration of struct vm_area_struct to top of file.
-
Andrew Morton authored
From: Andi Kleen <ak@suse.de> Add NUMA API system calls on IA64 and one bug fix required for it.
-