- 22 May, 2004 40 commits
-
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> First of batch of six patches which introduce Rajesh Venkatasubramanian's implementation of a radix priority search tree of vmas, to handle object-based reverse mapping corner cases well. rmap 14 i_shared_lock fixes Start the sequence with a couple of outstanding i_shared_lock fixes. Since i_shared_sem became i_shared_lock, we've had to shift and then temporarily remove mremap move's protection of concurrent truncation - if mremap moves ptes while unmap_mapping_range_list is making its way through the vmas, there's a danger we'd move a pte from an area yet to be cleaned back into an area already cleared. Now site the i_shared_lock with the page_table_lock in move_one_page. Replace page_table_present by get_one_pte_map, so we know when it's necessary to allocate a new page table: in which case have to drop i_shared_lock, trylock and perhaps reorder locks on the way back. Yet another fix: must check for NULL dst before pte_unmap(dst). And over in rmap.c, try_to_unmap_file's cond_resched amidst its lengthy nonlinear swapping was now causing might_sleep warnings: moved to a rather unsatisfactory and less frequent cond_resched_lock on i_shared_lock when we reach the end of the list; and one before starting on the nonlinears too: the "cursor" may become out-of-date if we do schedule, but I doubt it's worth bothering about.
-
Andrew Morton authored
Having a semaphore in there causes modest performance regressions on heavily mmap-intensive workloads on some hardware. Specifically, up to 30% in SDET on NUMAQ and big PPC64. So switch it back to being a spinlock. This does mean that unmap_vmas() needs to be told whether or not it is allowed to schedule away; that's simple to do via the zap_details structure. This change means that there will be high scheuling latencies when someone truncates a large file which is currently mmapped, but nobody does that anyway. The scheduling points in unmap_vmas() are mainly for munmap() and exit(), and they still will work OK for that. From: Hugh Dickins <hugh@veritas.com> Sorry, my premature optimizations (trying to pass down NULL zap_details except when needed) have caught you out doubly: unmap_mapping_range_list was NULLing the details even though atomic was set; and if it hadn't, then zap_pte_range would have missed free_swap_and_cache and pte_clear when pte not present. Moved the optimization into zap_pte_range itself. Plus massive documentation update. From: Hugh Dickins <hugh@veritas.com> Here's a second patch to add to the first: mremap's cows can't come home without releasing the i_mmap_lock, better move the whole "Subtle point" locking from move_vma into move_page_tables. And it's possible for the file that was behind an anonymous page to be truncated while we drop that lock, don't want to abort mremap because of VM_FAULT_SIGBUS. (Eek, should we be checking do_swap_page of a vm_file area against the truncate_count sequence? Technically yes, but I doubt we need bother.) - We cannot hold i_mmap_lock across move_one_page() because move_one_page() needs to perform __GFP_WAIT allocations of pagetable pages. - Move the cond_resched() out so we test it once per page rather than only when move_one_page() returns -EAGAIN.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> Delete include/asm*/rmap.h Delete pte_addr_t typedef from include/asm*/pgtable.h Delete KM_PTE2 from subset of include/asm*/kmap_types.h Beware when 4G/4G returns to -mm: i386 may need KM_FILLER for 8K stack.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> Remove the support for pte_chain rmap from page table initialization, just continue to maintain nr_page_table_pages (but only for user page tables - it also counted vmalloc page tables before, little need, and I'm unsure if per-cpu stats are safe early enough on all arches). mm/memory.c is the only core file affected. But ppc and ppc64 have found the old rmap page table initialization useful to support their ptep_test_and_clear_young: so transfer rmap's initialization to them (even on kernel page tables? well, okay).
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> A weakness of the anonmm scheme is its difficulty in tracking pages shared between two or more mms (one being an ancestor of the other), when mremap has been used to move a range of pages in one of those mms. mremap move is not very common anyway, and it's more often used on a page range exclusive to the mm; but uncommon though it may be, we must not allow unlocked pages to become unswappable. This patch follows Linus' suggestion, simply to take a private copy of the page in such a case: early C-O-W. My previous implementation was daft with respect to pages currently on swap: it insisted on swapping them in to copy them. No need for that: just take the copy when a page is brought in from swap, and its intended address is found to clash with what rmap has already noted. If do_swap_page has to make this copy in the mremap moved case (simply a call to do_wp_page), might as well do so also in the case when it's a write access but the page not exclusive, it's always seemed a little odd that swapin needed a second fault for that. A bug even: get_user_pages force imagines that a single call to handle_mm_fault must break C-O-W. Another bugfix: swapoff's unuse_process didn't check is_vm_hugetlb_page. Andrea's anon_vma has no such problem with mremap moved pages, handling them with elegant use of vm_pgoff - though at some cost to vma merging. How important is it to handle them efficiently? For now there's a msg printk(KERN_WARNING "%s: mremap moved %d cows\n", current->comm, cows);
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> Hugh's anonmm object-based reverse mapping scheme for anonymous pages. We have not yet decided whether to adopt this scheme, or Andrea's more advanced anon_vma scheme. anonmm is easier for me to merge quickly, to replace the pte_chain rmap taken out in the previous patch; a patch to install Andrea's anon_vma will follow in due course. Why build up and tear down chains of pte pointers for anonymous pages, when a page can only appear at one particular address, in a restricted group of mms that might share it? (Except: see next patch on mremap.) Introduce struct anonmm per mm to track anonymous pages, all forks from one exec sharing the same bundle of linked anonmms. Anonymous pages originate in one mm, but may be forked into another mm of the bundle later on. Callouts from fork.c to allocate, dup and exit the anonmm structure private to rmap.c. From: Hugh Dickins <hugh@veritas.com> Two concurrent exits (of the last two mms sharing the anonhd). First exit_rmap brings anonhd->count down to 2, gets preempted (at the spin_unlock) by second, which brings anonhd->count down to 1, sees it's 1 and frees the anonhd (without making any change to anonhd->count itself), cpu goes on to do something new which reallocates the old anonhd as a new struct anonmm (probably not a head, in which case count will start at 1), first resumes after the spin_unlock and sees anonhd->count 1, frees "anonhd" again, it's used for something else, a later exit_rmap list_del finds list corrupt.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> Lots of deletions: the next patch will put in the new anon rmap, which should look clearer if first we remove all of the old pte-pointer-based rmap from the core in this patch - which therefore leaves anonymous rmap totally disabled, anon pages locked in memory until process frees them. Leave arch files (and page table rmap) untouched for now, clean them up in a later batch. A few constructive changes amidst all the deletions: Choose names (e.g. page_add_anon_rmap) and args (e.g. no more pteps) now so we need not revisit so many files in the next patch. Inline function page_dup_rmap for fork's copy_page_range, simply bumps mapcount under lock. cond_resched_lock in copy_page_range. Struct page rearranged: no pte union, just mapcount moved next to atomic count, so two ints can occupy one long on 64-bit; i386 struct page now 32 bytes even with PAE. Never pass PageReserved to page_remove_rmap, only do_wp_page did so. From: Hugh Dickins <hugh@veritas.com> Move page_add_anon_rmap's BUG_ON(page_mapping(page)) inside the rmap_lock (well, might as well just check mapping if !mapcount then): if this page is being mapped or unmapped on another cpu at the same time, page_mapping's PageAnon(page) and page->mapping are volatile. But page_mapping(page) is used more widely: I've a nasty feeling that clear_page_anon, page_add_anon_rmap and/or page_mapping need barriers added (also in 2.6.6 itself),
-
Andrew Morton authored
Many places do: if (kmem_cache_create(...) == NULL) panic(...); We can consolidate all that by passing another flag to kmem_cache_create() which says "panic if it doesn't work".
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> The previous patch let the ptes of file pages be located via page ->mapping->i_mmap and i_mmap_shared lists of vmas; which works well unless the vma is VM_NONLINEAR - one in which sys_remap_file_pages has been used to place pages in unexpected places, to avoid an explosion of distinct unmergable vmas. Such pages were effectively locked in memory. page_referenced_file is already skipping nonlinear vmas, they'd just waste its time, and age unfairly any pages in their proper positions. Now extend try_to_unmap_file, to persuade it to swap from nonlinears. Ignoring the page requested, try to unmap cluster of 32 neighbouring ptes (in worst case all empty slots) in a nonlinear vma, then move on to the next vma; stopping when we've unmapped at least as many maps as the requested page had (vague guide of how hard to try), or have reached the end. With large sparse nonlinear vmas, this could take a long time: inserted a cond_resched while no locks are held, unusual at this level but I think okay, shrink_list does so. Use vm_private_data a little like the old mm->swap_address, as a cursor recording how far we got, so we don't attack the same ptes next time around (earlier tried inserting an empty marker vma in the list, but that got messy). How well this will work on real- life nonlinear vmas remains to be seen, but should work better than locking them all in memory, or swapping everything out all the time. Existing users of vm_private_data have either VM_RESERVED or VM_DONTEXPAND set, both of which are in the VM_SPECIAL category where we never try to merge vmas: so removed the vm_private_data test from is_mergeable_vma, so we can still merge VM_NONLINEARs. Of course, we could instead add another field to vm_area_struct.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> Dave McCracken's object-based reverse mapping scheme for file pages: why build up and tear down chains of pte pointers for file pages, when page->mapping has i_mmap and i_mmap_shared lists of all the vmas which might contain that page, and it appears at one deterministic position within the vma (unless vma is nonlinear - see next patch)? Has some drawbacks: more work to locate the ptes from page_referenced and try_to_unmap, especially if the i_mmap lists contain a lot of vmas covering different ranges; has to down_trylock the i_shared_sem, and hope that doesn't fail too often. But attractive in that it uses less lowmem, and shifts the rmap burden away from the hot paths, to swapout. Hybrid scheme for the moment: carry on with pte_chains for anonymous pages, that's unchanged; but file pages keep mapcount in the pte union of struct page, where anonymous pages keep chain pointer or direct pte address: so page_mapped(page) works on both. Hugh massaged it a little: distinct page_add_file_rmap entry point; list searches check rss so as not to waste time on mms fully swapped out; check mapcount to terminate once all ptes have been found; and a WARN_ON if page_referenced should have but couldn't find all the ptes.
-
Andrew Morton authored
Running __mark_inode_dirty() against a swapcache page is illegal and will oops. I see a race in set_page_dirty() wherein it can be called with a PageSwapCache page, but if the page is removed from swapcache after __set_page_dirty_nobuffers() drops tree_lock(), we have the situation where PageSwapCache() is false, but local variable `mapping' points at swapcache. Handle that by checking for non-null mapping->host. We don't care about the page state at this point - we're only interested in the inode. There is a converse case: what if a page is added to swapcache as we are running set_page_dirty() against it? In this case the page gets its PG_dirty flag set but it is not tagged as dirty in the swapper_space radix tree. The swap writeout code will handle this OK and test_clear_page_dirty()'s call to radix_tree_tag_clear(PAGECACHE_TAG_DIRTY) will silently have no effect. The only downside is that future radix-tree-based writearound won't notice that such pages are dirty and swap IO scheduling will be a teensy bit worse. The patch also fixes the (silly) testing of local variable `mapping' to see if the page was truncated. We should test page_mapping() for that.
-
Andrew Morton authored
Revert recent changes to sync_page(). Now that page_mapping() returns &swapper_space for swapcache pages we don't need to test for PageSwapCache in sync_page().
-
Andrew Morton authored
Fix up the "may we call writepage" logic for the swapcache changes.
-
Andrew Morton authored
Go back to the 2.6.5 concepts, with rmap additions. In particular: - Implement Andrea's flavour of page_mapping(). This function opaquely does the right thing for pagecache pages, anon pages and for swapcache pages. The critical thing here is that page_mapping() returns &swapper_space for swapcache pages without actually requiring the storage at page->mapping. This frees page->mapping for the anonmm/anonvma metadata. - Andrea and Hugh placed the pagecache index of swapcache pages into page->private rather than page->index. So add new page_index() function which hides this. - Make swapper_space.set_page_dirty() again point at __set_page_dirty_buffers(). If we don't do that, a bare set_page_dirty() will fall through to __set_page_dirty_buffers(), which is silly. This way, __set_page_dirty_buffers() can continue to use page->mapping. It should never go near anon or swapcache pages. - Give swapper_space a ->set_page_dirty address_space_operation method, so that set_page_dirty() will not fall through to __set_page_dirty_buffers() for swapcache pages. That function is not set up to handle them. The main effect of these changes is that swapcache pages are treated more similarly to pagecache pages. And we are again tagging swapcache pages as dirty in their radix tree, which is a requirement if we later wish to implement swapcache writearound based on tagged radix-tree walks.
-
Andrew Morton authored
Simplify the logic in there a bit.
-
Andrew Morton authored
->tree_lock is supposed to be IRQ-safe. Hugh worked out that with his changes, we never actually take it from interrupt context, so spin_lock() is sufficient. Apart from kinda freaking me out, the analysis which led to this decision becomes untrue with later patches. So make it irq-safe.
-
bk://kernel.bkbits.net/davem/net-2.6Linus Torvalds authored
into ppc970.osdl.org:/home/torvalds/v2.6/linux
-
Linus Torvalds authored
(The difference between two pointers is a "size_t", while MAX_LEN and the result here are "int"s).
-
Stephen Hemminger authored
Forwarding table paranoia: * Solve some potential problems if a device changes address and one or more device has the same address. * Warn if new device added to a bridge matches a entry that has shown up on the network. * Also don't put static entries in the timer list, they don't time out so shouldn't be there.
-
Stephen Hemminger authored
Replacement 64 bit compatibility code for the new ioctl's. The new ioctl's all pass through clean, but for the old style ioctl's it uses the mis-feature of the earlier bridge-utils that they check the API version. So if an old 32bit version of brctl is run on a 64bit platform it will report bridge utilities not compatible with kernel version Tested on Itanium 1; but should solve issue for sparc, ppc, and x86_64
-
Stephen Hemminger authored
Add four new ioctl's for the operations that can't be done through sysfs. The existing bridge ioctl's are multiplexed, and most go through SIOCDEVPRIVATE so they won't work in a mixed 32/64bit environment. The new release of bridge-utils will use these if possible, and fall back to the old interface.
-
Stephen Hemminger authored
-
Stephen Hemminger authored
Move the local function timer_residue to br_timer_value so it can be used by both ioctl and sysfs code.
-
Stephen Hemminger authored
Change how the read of forwarding table works. Instead of copying entries to user one at a time, use an intermediate kernel buffer and do up to a page at a chunk. This gets rid of some awkward code dealing with entries getting deleted during the copy. And allows same function to be used by later sysfs hook.
-
Stephen Hemminger authored
Fix a deadlock where deleting a device call br_del_if with lock held. br_del_if doesn't want to be called under lock anymore.
-
Stephen Hemminger authored
Merge the ioctl stub calls that just end up calling the sub-function to do the actual ioctl. Move br_get_XXX_ifindices into the ioctl file as well where they can be static.
-
Stephen Hemminger authored
Relax the locking on add/delete interfaces to a bridge. Since these operations are already called with RTNL semaphore, only need to hold the bridge lock while doing operations related to STP and processing path. This is necessary for later sysfs support where those operations might sleep.
-
Stephen Hemminger authored
Minor cleanup (lead in to later sysfs support). Change new_nb to new_bridge_dev and return the net_device rather than bridge because that is what the caller wants anyway.
-
Stephen Hemminger authored
This fixes the issue discovered when removing bluetooth devices from a bridge. Need to add special case code when forwarding table is being cleaned up to handle the case where several devices share the same hardware address.
-
Herbert Xu authored
-
Bartlomiej Zolnierkiewicz authored
Also remove unused EOL define from ide.h. This trivial patch makes grepping a lot easier.
-
Bartlomiej Zolnierkiewicz authored
- initializing needs to be set to 1 before calling ide_arm_init() - ide_default_io_ctl() should be 0 on arm26
-
Bartlomiej Zolnierkiewicz authored
This driver was partially merged in 2.5.32 and never compiled in 2.5/2.6. It was fixed in linux-mips CVS but has been broken again about 5 months ago. Just remove it for now (it is in wrong directory anyway).
-
Adrian Bunk authored
The patch below removes the MAINTAINERS entry for the removed comx driver. Additionally, the following comx header files could be removed: drivers/net/wan/mixcom.h drivers/net/wan/hscx.h drivers/net/wan/munich32x.h drivers/net/wan/falc-lh.h I've double-checked that none of them are used by any other driver.
-
Adrian Bunk authored
The case of CONFIG_JFFS2_FS_NAND=y got broken recently. The bug is obvious, and the fix is trivial:
-
Andrew Morton authored
From: Ian Kent <raven@themaw.net> This changes the autofs4 maintainer to me. Recommended by Joe Perches and OKed with Jeremy.
-
Andrew Morton authored
From: Ian Kent <raven@themaw.net> This is a patch contributed by Joe Perches to automatically include the function name in the dprintk statements.
-
Andrew Morton authored
From: Francois Romieu <romieu@fr.zoreil.com> Missing cache size format for Intel P4E (p.26 of doc. 241618-025, "Intel Processor Identification and the CPUID Instruction").
-
Andrew Morton authored
From: Armin Schindler <armin@melware.de> Fixes a compiler warning about unused Eicon ISDN driver function if hotplug is disabled.
-
Andrew Morton authored
From: Pavel Machek <pavel@ucw.cz> This fixes bad interaction between devfs and swsusp. Check whether the swap device is the specified resume device, irrespective of whether they are specified by identical names. (Thus, device inode aliasing is allowed. You can say /dev/hda4 instead of /dev/ide/host0/bus0/target0/lun0/part4 [if using devfs] and they'll be considered the same device. This is *necessary* for devfs, since the resume code can only recognize the form /dev/hda4, but the suspend code would like the long name [as shown in 'cat /proc/mounts'].) [Thanks to devfs hero whose name I forgot.]
-