- 14 Dec, 2002 22 commits
-
-
Andrew Morton authored
This allows us to control the aggressiveness of the lower-zone defense algorithm. The `incremental min'. For workloads which are using a serious amount of mlocked memory, a few megabytes is not enough. So the `lower_zone_protection' tunable allows the administrator to increase the amount of protection which lower zones receive against allocations which _could_ use higher zones. The default value of lower_zone_protection is zero, giving unchanged behaviour. We should not normally make large amounts of memory unavailable for pagecache just in case someone mlocks many hundreds of megabytes.
-
Andrew Morton authored
I've revisited all the superblock->inode->page writeback paths. There were several silly things in there, and things were not as clear as they could be. scenario 1: create and dirty a MAP_SHARED segment over a sparse file, then exit. All the memory turns into dirty pagecache, but the kupdate function only writes it out at a trickle - 4 megabytes every thirty seconds. We should sync it all within 30 seconds. What's happening is that when writeback tries to write those pages, the filesystem needs to instantiate new blocks for them (they're over holes). The filesystem runs mark_inode_dirty() within the writeback function. This redirtying of the inode while we're writing it out triggers some livelock avoidance code in __sync_single_inode(). That function says "ah, someone redirtied the file while I was writing it. Let's move the file to the new end of the superblock dirty list and write it out later." Problem is, writeback dirtied the inode itself. (It is rather silly that mark_inode_dirty() sets I_DIRTY_PAGES when clearly no pages have been dirtied. Fixing that up would be a largish work, so work around it here). So this patch just removes the livelock avoidance from __sync_single_inode(). It is no longer needed anyway - writeback livelock is now avoided (in all writeback paths) by writing a finite number of pages. scenario 2: an application is continuously dirtying a 200 megabyte file, and your disk has a bandwidth of less than 40 megabytes/sec. What happens is that once 30 seconds passes, pdflush starts writing out the file. And because that writeout will take more than five seconds (a `kupdate' interval), pdflush just keeps writing it out forever - continuous I/O. What we _want_ to happen is that the 200 megabytes gets written, and then IO stops for thirty seconds (minus the writeout period). So the file is fully synced every thirty seconds. The patch solves this by using mapping->io_pages more intelligently. When the time comes to write the file out, move all the dirty pages onto io_pages. That is a "batch of pages for this kupdate round". When io_pages is empty, we know we're done. The address_space_operations.writepages() API is changed! It now only needs to write the pages which the caller placed on mapping->io_pages. This conceptually cleans things up a bit, by more clearly defining the role of ->io_pages, and the motion between the various mapping lists. The treatment of sb->s_dirty and sb->s_io is now conceptually identical to mapping->dirty_pages and mapping->io_pages: move the items-to-be written onto ->s_io/io_pages, alk walk that list. As inodes (or pages) are written, move them over to the clean/locked/dirty lists. Oh, scenario 3: start an app whcih continuously overwrites a 5 meg file. Wait five seconds, start another, wait 5 seconds, start another. What we _should_ see is three 5-meg writes, five seconds apart, every thirty seconds. That did all sorts of odd things. It now does the right thing.
-
Andrew Morton authored
From Rohit 1) hugetlbfs_zero_setup returns ENOMEM in case the request size can not be easily handleed. 2) Preference is given to LOW_MEM while freeing the pages from hugetlbpage free list.
-
Andrew Morton authored
- /proc/vmstat:pageoutrun and /proc/vmstat:allocstall are always identical. Rework this so that - "allocstall" is the number of times a page allocator ran diect reclaim - "pageoutrun" is the number of times kswapd ran page reclaim - Add a new stat: "pgrotated". The number of pages which were rotated to the tail of the LRU for immediate reclaim by rotate_reclaimable_page(). - Document things a bit.
-
Andrew Morton authored
Check for usercopy faults in filldir().
-
Andrew Morton authored
ext3_sync_fs will start a commit and will wait on that commit. This means that on its return, all journalled file data has been dirtied and exposed to sync_inodes_sb(). Which is sufficient to fix the umount data loss problem.
-
Andrew Morton authored
This is infrastructure for fixing the journalled-data ext3 unmount data loss problem. It was sent for comment to linux-fsdevel a week ago; there was none. Add a `sync_fs' superblock operation whose mandate is to perform filesystem-specific operations to ensure a successful sync. It is called in two places: 1: fsync_super() - for umount. 2: sys_sync() - for global sync. In the sys_sync() case we call all the ->write_super() methods first. write_super() is an async flushing operation. It should not block. After that, we call all the ->sync_fs functions. This is independent of the state of s_dirt! That was all confused up before, and in this patch ->write_super() and ->sync_fs() are quite separate. With ext3 as an example, the initial ->write_super() will start a transaction, but will not wait on it. (But only if s_dirt was set!) The first ->sync_fs() call will get the IO underway. The second ->sync_fs() call will wait on the IO. And we really do need to be this elaborate, because all the testing of s_dirt in there makes ->write_super() an unreliable way of detecting when the VFS is trying to sync the filesystem.
-
Andrew Morton authored
Fix a radix-tree bug spotted by Vladimir Saveliev <vs@namesys.com>. Each step in the radix tree spans six address bits. So a height=6 tree spans 36-bits worth of nodes. On 32-bit machines radix_tree_gang_lookup() doesn't handle this right - at the 12TB mark it wraps back to zero, and returns pages at quite wrong indices. The patch fixes all that up, and tidies a couple of things. A user-space test harness was developed so that the code can be sanely tested. It is at http://www.zip.com.au/~akpm/linux/patches/stuff/rtth.tar.gz
-
Andrew Morton authored
Patch from Martin Bligh and Dave Hansen If a PAE machine has 1G of memory and you set PAGE_OFFSET to 2G, the kernel will only instantiate a PMD to cover the 2G-3G region. But another PMD is needed for the 3G-4G region for the APIC and possibly an extended vmalloc region. So the patch changes the code to instantiate PMDs out to the end of physical memory. It's a no-op for PAGE_OFFSET=3G, and _could_ be part of the CONFIG_PAGE_OFFSET patch. But it seems a reasonable generalisation anyway.
-
Andrew Morton authored
Ancient patch From Bill Irwin The patch is intended to show improved information about where the memory went during OOM-killing events. - when the OOM killer fails and the system panics, calls show_free_areas() - reorganize show_free_areas() to use for_each_zone() - add per-cpu stats to show_free_areas() - tags output from show_free_areas() with node and zone information
-
Andrew Morton authored
fail_writepage() does not work. Its activate_page() call cannot activate the page because it is not on the LRU. So perform that function (more efficiently) in the VM. Remove fail_writepage() and, if the filesystem does not implement ->writepage() then activate the page from shrink_list(). A special case is tmpfs, which does have a writepage, but which sometimes wants to activate the pages anyway. The most important case is when there is no swap online and we don't want to keep all those pages on the inactive list. So just as a tmpfs special-case, allow writepage() to return WRITEPAGE_ACTIVATE, and handle that in the VM. Also, the whole idea of allowing ->writepage() to return -EAGAIN, and handling that in the caller has been reverted. If a writepage() implementation wants to back out and not write the page, it must redirty the page, unlock it and return zero. (This is Hugh's preferred way). And remove the now-unneeded shmem_writepages() - shmem inodes are marked as `memory backed' so it will not be called. And remove the test for non-null ->writepage() in generic_file_mmap(). Memory-backed files _are_ mmappable, and they do not have a writepage(). It just isn't called. So the locking rules for writepage() are unchanged. They are: - Called with the page locked - Returns with the page unlocked - Must redirty the page itself if it wasn't all written. But there is a new, special, hidden, undocumented, secret hack for tmpfs: writepage may return WRITEPAGE_ACTIVATE to tell the VM to move the page to the active list. The page must be kept locked in this one case.
-
Andrew Morton authored
There's nopoint in walking through a lot of tmpfs or ramdisk pages when we're trying to clean memory. So if a memory-backed inode is discovered during writeback, skip the entire superblock.
-
Andrew Morton authored
Patch from Mark Fasheh <mark.fasheh@oracle.com> (plus a few cleanups and a speedup from yours truly) Adds the semtimedop() function - semop with a timeout. Solaris has this. It's apparently worth a couple of percent to Oracle throughput and given the simplicity, that is sufficient benefit for inclusion IMO. This patch hooks up semtimedop() only for ia64 and ia32.
-
Andrew Morton authored
The pte_chain_unlock() needs to be outside the ifdef.
-
Andrew Morton authored
The read_zero() implementation for !CONFIG_MMU was very inefficient. This sped-up version has been tested and acked by Greg Ungerer.
-
Andrew Morton authored
Back out the sys_syslog()-based printk-from-userspace and replace it with Ben's /proc/kmsg version. Requires a `mknod /dev/kmsg c 1 11'.
-
Andrew Morton authored
Patch from Robert Love <rml@tech9.net> We can never get rid of it if we do not deprecate it - so do so and print a stern warning to those who still run bdflush daemons.
-
Andrew Morton authored
The PF_MEMALLOC handling got broken somewhere, and it is now possible for a PF_MEMALLOC process to reenter page reclaim. Change it to fail the allocation if we're PF_MEMALLOC and there are zero pages free.
-
François Romieu authored
This removes calls to function which disappeared during last Iphase driver update. Since this update, Iphase driver has been using plain modern pci style init. Problem wasn't noticed until Adrian Bunk tried to build non-modular kernel (I only tested the modularized driver). Everybody else seemed happy :o)
-
James Simmons authored
scrup is using memcpy even when the memory areas src, dest overlap. The key is to use memmove which handles overlapping memory gracefully.
-
bk://thebsh.namesys.com/bk/reiser3-linux-2.5-fixesLinus Torvalds authored
into home.transmeta.com:/home/torvalds/v2.5/linux
-
Anton Blanchard authored
2.5 currently tries to register disk sda twice. Not nice and now we use sysfs to do name to dev_t mapping, I couldnt mount my root filesystem.
-
- 13 Dec, 2002 13 commits
-
-
Matthew Wilcox authored
This driver does not need to use atomic operations on local variables.
-
Robert Love authored
Trivial but annoying: two printk() calls in drivers/scsi/hosts.c are missing '\n' Also, for some reason I have not yet investigated, shost_tp->name is NULL here. This should not be, eh? If it can be, we should do something to tidy up the printing of it.
-
Robert Love authored
This error message is uber annoying and needs to go. Non-root can flood the console with this junk on invalid SCSI CD-ROM ioctl(), and that is exactly what gnome-cd does. An illegal ioctl() returns an error to the program. That is sufficient - we do not need KERN_ERROR warnings all over the place. Especially when any user can cause them at any rate.
-
Rusty Russell authored
In some configurations, parport and bttv request a module inside their module_init function. Drop the lock around mod->init(), change module->live to module->state so we can detect modules which are in init.
-
Rusty Russell authored
module-init-tools 0.9 and newer supply a replacement depmod, so it's safe to run again. Also, some external programs like PCMCIA and mkinitrd really want the directory hierarchy in /lib/modules back again: it makes no difference to the tools (since 0.9), so revert it.
-
bk://linuxusb.bkbits.net/linus-2.5Linus Torvalds authored
into home.transmeta.com:/home/torvalds/v2.5/linux
-
Linus Torvalds authored
-
Peter Braam authored
Relatively straightforward fixes for intermezzo problems in 2.5.50. I think all of them related to: - two missing headers - use of timespec instead of time_t.
-
bk://are.twiddle.net/axp-2.5Linus Torvalds authored
into home.transmeta.com:/home/torvalds/v2.5/linux
-
Linus Torvalds authored
-
http://jfs.bkbits.net/linux-2.5Linus Torvalds authored
into home.transmeta.com:/home/torvalds/v2.5/linux
-
Andy Grover authored
-
Petko Manolov authored
I made the changes to the set/get_registers code.
-
- 12 Dec, 2002 5 commits
-
-
Richard Henderson authored
Cset exclude: rth@dorothy.sfbay.redhat.com|ChangeSet|20021207231352|30637
-
Andy Grover authored
- remove NATIVE_CHAR typedef - remove ACPI_{GET,VALID}_ADDRESS macros - fix memory corruption in deletion of a static AML buffer - fix fault caused by 0-length AML - fix user-buffer overwrite/corruption of buffer is too small - fix buffer-to-string conversion
-
Greg Kroah-Hartman authored
This allowed a lock to be removed. Also removed the MOD_* functions, and some remove logic was cleaned up by Oliver Neukum.
-
Greg Kroah-Hartman authored
-
Andy Grover authored
into groveronline.com:/root/bk/linux-acpi
-