- 27 Sep, 2002 1 commit
-
-
Ingo Molnar authored
Implement a "mapping change" notification for virtual lookup caches, and make the futex code use that to keep the futex page pinning consistent across copy-on-write events in the VM space.
-
- 18 Sep, 2002 1 commit
-
-
Kai Germaschewski authored
It's gone almost everywhere else already, and will eventually make for a nicer top-level Makefile.
-
- 17 Sep, 2002 1 commit
-
-
Andrew Morton authored
Patch from Christoph Hellwig moves the madvise implementation out of filemap.c and into its own .c file. No other changes are made.
-
- 19 Jul, 2002 1 commit
-
-
Andrew Morton authored
This is the "minimal rmap" patch, writen by Rik, ported to 2.5 by Craig Kulsea. Basically, before: When the page reclaim code decides that is has scanned too many unreclaimable pages on the LRU it does a scan of process virtual address spaces for pages to add to swapcache. ptes pointing at the page are unmapped as the scan proceeds. When all ptes referring to a page have been unmapped and it has been written to swap the page is reclaimable. after: When an anonymous page is encountered on the tail of the LRU we use the rmap to see if it hasn't been referenced lately. If so then add it to swapcache. When the page is again encountered on the LRU, if it is still unreferenced then try to unmap all ptes which refer to it in one hit, and if it is clean (ie: on swap) then free it. The rest of the VM - list management, the classzone concept, etc remains unchanged. There are a number of things which the per-page pte chain could be used for. Bill Irwin has identified the following. (1) page replacement no longer goes around randomly unmapping things (2) referenced bits are more accurate because there aren't several ms or even seconds between find the multiple pte's mapping a page (3) reduces page replacement from O(total virtually mapped) to O(physical) (4) enables defragmentation of physical memory (5) enables cooperative offlining of memory for friendly guest instance behavior in UML and/or LPAR settings (6) demonstrable benefit in performance of swapping which is common in end-user interactive workstation workloads (I don't like the word "desktop"). c.f. Craig Kulesa's post wrt. swapping performance (7) evidence from 2.4-based rmap trees indicates approximate parity with mainline in kernel compiles with appropriate locking bits (8) partitioning of physical memory can reduce the complexity of page replacement searches by scanning only the "interesting" zones implemented and merged in 2.4-based rmap (9) partitioning of physical memory can increase the parallelism of page replacement searches by independently processing different zones implemented, but not merged in 2.4-based rmap (10) the reverse mappings may be used for efficiently keeping pte cache attributes coherent (11) they may be used for virtual cache invalidation (with changes) (12) the reverse mappings enable proper RSS limit enforcement implemented and merged in 2.4-based rmap The code adds a pointer to struct page, consumes additional storage for the pte chains and adds computational expense to the page reclaim code (I measured it at 3% additional load during streaming I/O). The benefits which we get back for all this are, I must say, theoretical and unproven. If it has real advantages (or, indeed, disadvantages) then why has nobody demonstrated them? There are a number of things remaining to be done: 1: Demonstrate the above advantages. 2: Make it work with pte-highmem (Bill Irwin is signed up for this) 3: Don't add pte_chains to non-shared pages optimisation (Dave McCracken's patch does this) 4: Move the pte_chains into highmem too (Bill, I guess) 5: per-cpu pte_chain freelists (Rik?) 6: maybe GC the pte_chain backing pages. (Seems unavoidable. Rik?) 7: multithread the page reclaim code. (I have patches). 8: clustered add-to-swap. Not sure if I buy this. anon pages are often well-ordered-by-virtual-address on the LRU, so it "just works" for benchmarky loads. But there may be some other loads... 9: Fix bad IO latency in page reclaim (I have lame patches) 10: Develop tuning tools, use them. 11: The nightly updatedb run is still evicting everything.
-
- 30 Apr, 2002 1 commit
-
-
Andrew Morton authored
[ I reversed the order in which writeback walks the superblock's dirty inodes. It sped up dbench's unlink phase greatly. I'm such a sleaze ] The core writeback patch. Switches file writeback from the dirty buffer LRU over to address_space.dirty_pages. - The buffer LRU is removed - The buffer hash is removed (uses blockdev pagecache lookups) - The bdflush and kupdate functions are implemented against address_spaces, via pdflush. - The relationship between pages and buffers is changed. - If a page has dirty buffers, it is marked dirty - If a page is marked dirty, it *may* have dirty buffers. - A dirty page may be "partially dirty". block_write_full_page discovers this. - A bunch of consistency checks of the form if (!something_which_should_be_true()) buffer_error(); have been introduced. These fog the code up but are important for ensuring that the new buffer/page code is working correctly. - New locking (inode.i_bufferlist_lock) is introduced for exclusion from try_to_free_buffers(). This is needed because set_page_dirty is called under spinlock, so it cannot lock the page. But it needs access to page->buffers to set them all dirty. i_bufferlist_lock is also used to protect inode.i_dirty_buffers. - fs/inode.c has been split: all the code related to file data writeback has been moved into fs/fs-writeback.c - Code related to file data writeback at the address_space level is in the new mm/page-writeback.c - try_to_free_buffers() is now non-blocking - Switches vmscan.c over to understand that all pages with dirty data are now marked dirty. - Introduces a new a_op for VM writeback: ->vm_writeback(struct page *page, int *nr_to_write) this is a bit half-baked at present. The intent is that the address_space is given the opportunity to perform clustered writeback. To allow it to opportunistically write out disk-contiguous dirty data which may be in other zones. To allow delayed-allocate filesystems to get good disk layout. - Added address_space.io_pages. Pages which are being prepared for writeback. This is here for two reasons: 1: It will be needed later, when BIOs are assembled direct against pagecache, bypassing the buffer layer. It avoids a deadlock which would occur if someone moved the page back onto the dirty_pages list after it was added to the BIO, but before it was submitted. (hmm. This may not be a problem with PG_writeback logic). 2: Avoids a livelock which would occur if some other thread is continually redirtying pages. - There are two known performance problems in this code: 1: Pages which are locked for writeback cause undesirable blocking when they are being overwritten. A patch which leaves pages unlocked during writeback comes later in the series. 2: While inodes are under writeback, they are locked. This causes namespace lookups against the file to get unnecessarily blocked in wait_on_inode(). This is a fairly minor problem. I don't have a fix for this at present - I'll fix this when I attach dirty address_spaces direct to super_blocks. - The patch vastly increases the amount of dirty data which the kernel permits highmem machines to maintain. This is because the balancing decisions are made against the amount of memory in the machine, not against the amount of buffercache-allocatable memory. This may be very wrong, although it works fine for me (2.5 gigs). We can trivially go back to the old-style throttling with s/nr_free_pagecache_pages/nr_free_buffer_pages/ in balance_dirty_pages(). But better would be to allow blockdev mappings to use highmem (I'm thinking about this one, slowly). And to move writer-throttling and writeback decisions into the VM (modulo the file-overwriting problem). - Drops 24 bytes from struct buffer_head. More to come. - There's some gunk like super_block.flags:MS_FLUSHING which needs to be killed. Need a better way of providing collision avoidance between pdflush threads, to prevent more than one pdflush thread working a disk at the same time. The correct way to do that is to put a flag in the request queue to say "there's a pdlfush thread working this disk". This is easy to do: just generalise the "ra_pages" pointer to point at a struct which includes ra_pages and the new collision-avoidance flag.
-
- 10 Apr, 2002 2 commits
-
-
Andrew Morton authored
This patch implements a gang-of-threads which are designed to be used for dirty data writeback. "pdflush" -> dirty page flush, or something. The number of threads is dynamically managed by a simple demand-driven algorithm. "Oh no, more kernel threads". Don't worry, kupdate and bdflush disappear later. The intent is that no two pdflush threads are ever performing writeback against the same request queue at the same time. It would be wasteful to do that. My current patches don't quite achieve this; I need to move the state into the request queue itself... The driver for implementing the thread pool was to avoid the possibility where bdflush gets stuck on one device's get_request_wait() queue while lots of other disks sit idle. Also generality, abstraction, and the need to have something in place to perform the address_space-based writeback when the buffer_head-based writeback disappears. There is no provision inside the pdflush code itself to prevent many threads from working against the same device. That's the responsibility of the caller. The main API function, `pdflush_operation()' attempts to find a thread to do some work for you. It is not reliable - it may return -1 and say "sorry, I didn't do that". This happens if all threads are busy. One _could_ extend pdflush_operation() to queue the work so that it is guaranteed to happen. If there's a need, that additional minor complexity can be added.
-
Andrew Morton authored
I'd like to be able to claim amazing speedups, but the best benchmark I could find was diffing two 256 megabyte files, which is about 10% quicker. And that is probably due to the window size being effectively 50% larger. Fact is, any disk worth owning nowadays has a segmented 2-megabyte cache, and OS-level readahead mainly seems to save on CPU cycles rather than overall throughput. Once you start reading more streams than there are segments in the disk cache we start to win. Still. The main motivation for this work is to clean the code up, and to create a central point at which many pages are marshalled together so that they can all be encapsulated into the smallest possible number of BIOs, and injected into the request layer. A number of filesystems were poking around inside the readahead state variables. I'm not really sure what they were up to, but I took all that out. The readahead code manages its own state autonomously and should not need any hints. - Unifies the current three readahead functions (mmap reads, read(2) and sys_readhead) into a single implementation. - More aggressive in building up the readahead windows. - More conservative in tearing them down. - Special start-of-file heuristics. - Preallocates the readahead pages, to avoid the (never demonstrated, but potentially catastrophic) scenario where allocation of readahead pages causes the allocator to perform VM writeout. - Gets all the readahead pages gathered together in one spot, so they can be marshalled into big BIOs. - reinstates the readahead ioctls, so hdparm(8) and blockdev(8) are working again. The readahead settings are now per-request-queue, and the drivers never have to know about it. I use blockdev(8). It works in units of 512 bytes. - Identifies readahead thrashing. Also attempts to handle it. Certainly the changes here delay the onset of catastrophic readahead thrashing by quite a lot, and decrease it seriousness as we get more deeply into it, but it's still pretty bad.
-
- 08 Mar, 2002 2 commits
-
-
Linus Torvalds authored
-
Linus Torvalds authored
changes.
-
- 19 Feb, 2002 1 commit
-
-
Rik van Riel authored
The patch has been changed like you wanted, with page->zone shoved into page->flags. I've also pulled the thing up to your latest changes from linux.bkbits.net so you should be able to just pull it into your tree from: Rik
-
- 05 Feb, 2002 5 commits
-
-
Linus Torvalds authored
- me: revert the "kill(-1..)" change. POSIX isn't that clear on the issue anyway, and the new behaviour breaks things. - Jens Axboe: more bio updates - Al Viro: rd_load cleanups. hpfs mount fix, mount cleanups - Ingo Molnar: more raid updates - Jakub Jelinek: fix Linux/x86 confusion about arg passing of "save_v86_state" and "do_signal" - Trond Myklebust: fix NFS client race conditions
-
Linus Torvalds authored
- Jens Axboe: more bio stuff - Ingo Molnar: mempool for bio - Niibe Yutaka: Super-H update
-
Linus Torvalds authored
- Michael Warfield: computone serial driver update - Alexander Viro: cdrom module race fixes - David Miller: Acenic driver fix - Andrew Grover: ACPI update - Kai Germaschewski: ISDN update - Tim Waugh: parport update - David Woodhouse: JFFS garbage collect sleep
-
Linus Torvalds authored
- Andreas Dilger: various ext2 cleanups - Richard Gooch: devfs update - Johannes Erdfelt: USB updates - Alan Cox: merges - David Miller: fix SMP pktsched bootup deadlock (CONFIG_NET_SCHED) - Roman Zippel: AFFS update - Anton Altaparmakov: NTFS update - me: fix races in vfork() (semaphores are not good completion handlers) - Jeff Garzik: net driver updates, sysvfs update
-
Linus Torvalds authored
-