- 30 Aug, 2002 40 commits
-
-
Andrew Morton authored
mpage_writepages() does a lock_page() on pages to be written back, even when it is being used for page reclaim writeback. This is normally OK, because the page is unlocked quickly - pages are unlocked during writeback and nobody should be performing __GFP_FS allocations inside lock_page(). But it has introduced a ranking problem in ext3: generic_file_write ->lock_page ->ext3_prepare_write ->journal_start (waits for a commit) versus ext3_create() ->journal_start() ->ext3_new_inode(GFP_KERNEL) ->page reclaim ->mpage_writepages ->lock_page (locks up, transaction is held open) Maybe sometime, I'll have to turn mpage_writepages' lock_page into a trylock if the caller is PF_MEMALLOC. But for now, let's make ext3's inside-transaction allocations use GFP_NOFS. There is only one of them.
-
Andrew Morton authored
This is a performance and correctness fix against the writeback paths. The writeback code has competing requirements. Sometimes it is used for "memory cleansing": kupdate, bdflush, writer throttling, page allocator writeback, etc. And sometimes this same code is used for data integrity pruposes: fsync, msync, fdatasync, sync, umount, various other kernel-internal uses. The problem is: how to handle a dirty buffer or page which is currently under writeback. For memory cleansing, we just want to skip that buffer/page and go onto the next one. But for sync, we must wait on the old writeback and then start new writeback. mpage_writepages() is current correct for cleansing, but incorrect for sync. block_write_full_page() is currently correct for sync, but inefficient for cleansing. The fix is fairly simple. - In mpage_writepages(), don't skip the page is it's a sync operation. - In block_write_full_page(), skip the buffer if it is a sync operation. And return -EAGAIN to tell the caller that the writeout didn't work out. The caller must then set the page dirty again and move it onto mapping->dirty_pages. This is an extension of the writepage API: writepage can now return EAGAIN. There are only three callers, and they have been updated. fail_writepage() and ext3_writepage() were actually doing this by hand. They have been changed to return -EAGAIN. NTFS will want to be able to return -EAGAIN from its writepage as well. - A sticky question is: how to tell the writeout code which mode it is operating in? Cleansing or sync? It's such a tiny code change that I didn't have the heart to go and propagate a `mode' argument down every instance of writepages() and writepage() in the kernel. So I passed it in via current->flags. Incidentally, the occurrence of a locked-and-dirty buffer in block_write_full_page() is fairly rare: normally the collision avoidance happens at the address_space level, via PageWriteback. But some mappings (blockdevs, ext3 files, etc) have their dirty buffers written out via submit_bh(). It is these buffers which can stall block_write_full_page(). This wart will be pretty intrusive to fix. ext3 needs to become fully page-based (ugh. It's a block-based journalling filesystem, and pages are unnatural). blockdev mappings are still written out by buffers because that's how filesystems use them. Putting _all_ metadata (indirects, inodes, superblocks, etc) into standalone address_spaces would fix that up. - filemap_fdatawrite() sets PF_SYNC. So filemap_fdatawrite() is the kernel function which will start writeback against a mapping for "data integrity" purposes, whereas the unexported, internal-only do_writepages() is the writeback function which is used for memory cleansing. This difference is the reason why I didn't consolidate those functions ages ago... - Lots of code paths had a bogus extra call to filemap_fdatawait(), which I previously added in a moment of weak-headedness. They have all been removed.
-
Andrew Morton authored
A reworked version of the batched page freeing and lock amortisation for VMA teardown. It walks the existing 507-page list in the mmu_gather_t in 16-page chunks, drops their refcounts in 16-page chunks, and de-LRUs and frees any resulting zero-count pages in up-to-16 page chunks.
-
Andrew Morton authored
Clean up put_page() and page_cache_release(). It's pretty simple now: #define page_cache_get(page) get_page(page) #define page_cache_release(page) put_page(page)
-
Andrew Morton authored
it was only being used in invalidate_inode_pages(), and from there, pagevec_release() does the same thing.
-
Andrew Morton authored
As suggested by Daniel - it's a bug to run put_page_testzero against a zero-ref page.
-
Ingo Molnar authored
please apply this patch (Robert ACK-ed it). While there is a preemptible kernel entry already, i think listing this at the scheduler entry is justfied, preemption has a number of scheduler interactions.
-
Ingo Molnar authored
this is an updated version of the LDT fixes. It fixes the following kinds of problems: - fix a possible gcc optimization causing a race causing the loading of a corrupt LDT descriptor upon context switch. [this fix got simplified over previous versions.] - remove an unconditional OOM printk, and there's no need to set ->size in the OOM path. - fix preemption bugs, load_LDT()/clear_LDT() was not preemption-safe, when it was used outside of spinlocks. the context-switch race is the following. 'LDT modification' is the following operation: the seg->ldt pointer is modified, then seg->size is modified. In theory gcc is free to reschedule the two modifications, and first modify ->size, then ->ldt. Thus if this modification is not synchronized with context-switches, another thread might see a temporary state of the new ->size [which was increased], but still the old pointer. Ie.: CPU0 CPU1 pc->size = newsize; load_LDT(); // (oldptr, newsize) pc->ldt = newptr; the corrupt LDT is loaded until the SMP cross-call is sent, leaving the window open for many usecs. the fix is to put a wmb() after ->ldt modifications. [this is also in preparation of not-write-ordered SMP x86 designs.]
-
bk://linux-input.bkbits.net/linux-inputLinus Torvalds authored
into home.transmeta.com:/home/torvalds/v2.5/linux
-
Vojtech Pavlik authored
some mainboards (Andrew Morton's Dell) report that even everything is okay with AUX. Also remove a check for very old AMI i8042's, which could generate false positives on modern buggy mainboards.
-
bk://jfs.bkbits.net/linux-2.5Linus Torvalds authored
into home.transmeta.com:/home/torvalds/v2.5/linux
-
Peter Wächtler authored
-
Peter Wächtler authored
-
Peter Wächtler authored
-
Peter Wächtler authored
-
Peter Wächtler authored
-
Peter Wächtler authored
-
Peter Wächtler authored
-
Peter Wächtler authored
-
Peter Wächtler authored
-
Peter Wächtler authored
-
Peter Wächtler authored
-
Peter Wächtler authored
-
Peter Wächtler authored
-
Peter Wächtler authored
-
Peter Wächtler authored
-
Peter Wächtler authored
-
Peter Wächtler authored
-
Peter Wächtler authored
-
Peter Wächtler authored
-
Peter Wächtler authored
-
Peter Wächtler authored
-
Peter Wächtler authored
-
Peter Wächtler authored
-
Peter Wächtler authored
-
Dave Kleikamp authored
jfs_get_blocks should return up to the number of blocks in the extent rather than limiting itself to one block, as the initial, trivial implementation did. This greatly reduces the overhead of O_DIRECT reads and writes. Submitted by Badari Pulavarty (pbadari@us.ibm.com)
-
http://linuxusb.bkbits.net/pci-2.5Linus Torvalds authored
into home.transmeta.com:/home/torvalds/v2.5/linux
-
Dave Kleikamp authored
Submitted by Steve Best.
-
Greg Kroah-Hartman authored
-
David Brownell authored
This patch exposes basic allocation statistics for pci pools, very much like /proc/slabinfo but applying to DMA-consistent memory. A file "pools" is created in the driverfs directory for the relevant pci device when the first pool is created, and removed when the last pool is destroyed. Please merge to 2.5.latest. If it matters, DaveM said it looks fine. It produces sane output for all the 2.5.30 USB host controller drivers.
-