- 28 Jul, 2002 25 commits
-
-
Andrew Morton authored
This patch is a performance and correctness update to the direct-IO code: O_DIRECT and the raw driver. It mainly affects IO against blockdevs. The direct_io code was returning -EINVAL for a filesystem hole. Change it to clear the userspace page instead. There were a few restrictions and weirdnesses wrt blocksize and alignments. The code has been reworked so we now lay out maximum-sized BIOs at any sector alignment. Because of this, the raw driver has been altered to set the blockdev's soft blocksize to the minimum possible at open() time. Typically, 512 bytes. There are now no performance disadvantages to using small blocksizes, and this gives the finest possible alignment. There is no API here for setting or querying the soft blocksize of the raw driver (there never was, really), which could conceivably be a problem. If it is, we can permit BLKBSZSET and BLKBSZGET against the fd which /dev/raw/rawN returned, but that would require that blk_ioctl() be exported to modules again. This code is wickedly quick. Here's an oprofile of a single 500MHz PIII reading from four (old) scsi disks (two aic7xxx controllers) via the raw driver. Aggregate throughput is 72 megabytes/second: c013363c 24 0.0896492 __set_page_dirty_buffers c021b8cc 24 0.0896492 ahc_linux_isr c012b5dc 25 0.0933846 kmem_cache_free c014d894 26 0.09712 dio_bio_complete c01cc78c 26 0.09712 number c0123bd4 40 0.149415 follow_page c01eed8c 46 0.171828 end_that_request_first c01ed410 49 0.183034 blk_recount_segments c01ed574 65 0.2428 blk_rq_map_sg c014db38 85 0.317508 do_direct_IO c021b090 90 0.336185 ahc_linux_run_device_queue c010bb78 236 0.881551 timer_interrupt c01052d8 25354 94.707 poll_idle A testament to the efficiency of the 2.5 block layer. And against four IDE disks on an HPT374 controller. Throughput is 120 megabytes/sec: c01eed8c 80 0.292462 end_that_request_first c01fe850 87 0.318052 hpt3xx_intrproc c01ed574 123 0.44966 blk_rq_map_sg c01f8f10 141 0.515464 ata_select c014db38 153 0.559333 do_direct_IO c010bb78 235 0.859107 timer_interrupt c01f9144 281 1.02727 ata_irq_enable c01ff990 290 1.06017 udma_pci_init c01fe878 308 1.12598 hpt3xx_maskproc c02006f8 379 1.38554 idedisk_do_request c02356a0 609 2.22637 pci_conf1_read c01ff8dc 611 2.23368 udma_pci_start c01ff950 922 3.37062 udma_pci_irq_status c01f8fac 1002 3.66308 ata_status c01ff26c 1059 3.87146 ata_start_dma c01feb70 1141 4.17124 hpt374_udma_stop c01f9228 3072 11.2305 ata_out_regfile c01052d8 15193 55.5422 poll_idle Not so good. One problem which has been identified with O_DIRECT is the cost of repeated calls into the mapping's get_block() callback. Not a big problem with ext2 but other filesystems have more complex get_block implementations. So what I have done is to require that callers of generic_direct_IO() implement the new `get_blocks()' interface. This is a small extension to get_block(). It gets passed another argument which indicates the maximum number of blocks which should be mapped, and it returns the number of blocks which it did map in bh_result->b_size. This allows the fs to map up to 4G of disk (or of hole) in a single get_block() invokation. There are some other caveats and requirements of get_blocks() which are documented in the comment block over fs/direct_io.c:get_more_blocks(). Possibly, get_blocks() will be the 2.6 kernel's way of doing gang block mapping. It certainly allows good speedups. But it doesn't allow the fs to return a scatter list of blocks - it only understands linear chunks of disk. I think that's really all it _should_ do. I'll let get_blocks() sit for a while and wait for some feedback. If it is sufficient and nobody objects too much, I shall convert all get_block() instances in the kernel to be get_blocks() instances. And I'll teach readahead (at least) to use the get_blocks() extension. Delayed allocate writeback could use get_blocks(). As could block_prepare_write() for blocksize < PAGE_CACHE_SIZE. There's no mileage using it in mpage_writepages() because all our filesystems are syncalloc, and nobody uses MAP_SHARED for much. It will be tricky to use get_blocks() for writes, because if a ton of blocks have been mapped into the file and then something goes wrong, the kernel needs to either remove those blocks from the file or zero them out. The direct_io code zeroes them out. btw, some time ago you mentioned that some drivers and/or hardware may get upset if there are multiple simultaneous IOs in progress against the same block. Well, the raw driver has always allowed that to happen. O_DIRECT writes to blockdevs do as well now. todo: 1) The driver will probably explode if someone runs BLKBSZSET while IO is in progress. Need to use bdclaim() somewhere. 2) readv() and writev() need to become direct_io-aware. At present we're doing stop-and-wait for each segment when performing readv/writev against the raw driver and O_DIRECT blockdevs.
-
Andrew Morton authored
Convert ext3 to the C99 initialiser format. From Rusty.
-
Andrew Morton authored
Alan's overcommit patch, brought to 2.5 by Robert Love. Can't say I've tested its functionality at all, but it doesn't crash, it has been in -ac and RH kernels for some time and I haven't observed any of its functions on profiles. "So what is strict VM overcommit? We introduce new overcommit policies that attempt to never succeed an allocation that can not be fulfilled by the backing store and consequently never OOM. This is achieved through strict accounting of the committed address space and a policy to allow/refuse allocations based on that accounting. In the strictest of modes, it should be impossible to allocate more memory than available and impossible to OOM. All memory failures should be pushed down to the allocation routines -- malloc, mmap, etc. The new modes are available via sysctl (same as before). See Documentation/vm/overcommit-accounting for more information."
-
Andrew Morton authored
Patch from Robert Love. Attached patch implements for_each_zone(zont_t *) which is a helper macro to cleanup code of the form: for (pgdat = pgdat_list; pgdat; pgdat = pgdat->node_next) { for (i = 0; i < MAX_NR_ZONES; ++i) { zone_t * z = pgdat->node_zones + i; /* ... */ } } and replace it with: for_each_zone(zone) { /* ... */ } This patch only replaces one use of the above loop with the new macro. Pending code, however, currently in the full rmap patch uses for_each_zone more extensively.
-
Andrew Morton authored
Patch from Robert Love. This patch implements for_each_pgdat(pg_data_t *) which is a helper macro to cleanup code that does a loop of the form: pgdat = pgdat_list; while(pgdat) { /* ... */ pgdat = pgdat->node_next; } and replace it with: for_each_pgdat(pgdat) { /* ... */ } This code is from Rik's 2.4-rmap patch and is by William Irwin.
-
Andrew Morton authored
Reorganise the members of struct page. - Place ->flags at the start so the compiler can generate indirect addressing rather than indirect+indexed for this commonly-accessed field. Shrinks the kernel by ~100 bytes. - Keep ->count with ->flags so they have the best chance of being in the same cacheline.
-
Andrew Morton authored
ifdef out some operations in pte_chain_lock() which are not necessary on uniprocessor.
-
Andrew Morton authored
Cleanup to show_free_areas() from Bill Irwin: show_free_areas() and show_free_areas_core() is a mess. (1) it uses a bizarre and ugly form of list iteration to walk buddy lists use standard list functions instead (2) it prints the same information repeatedly once per-node rationalize the braindamaged iteration logic (3) show_free_areas_node() is useless and not called anywhere remove it entirely (4) show_free_areas() itself just calls show_free_areas_core() remove show_free_areas_core() and do the stuff directly (5) SWAP_CACHE_INFO is always #defined, remove it (6) INC_CACHE_INFO() doesn't use the do { } while (0) construct This patch also includes Matthew Dobson's patch which removes mm/numa.c:node_lock. The consensus is that it doesn't do anything now that show_free_areas_node() isn't there.
-
Andrew Morton authored
Patch from Bill Irwin. It removes the custom pte_chain allocator in mm/rmap.c and replaces it with a slab cache. "This patch (1) eliminates the pte_chain_freelist_lock and all contention on it (2) gives the VM the ability to recover unused pte_chain pages Anton Blanchard has reported (1) from prior incarnations of this patch. Craig Kulesa has reported (2) in combination with slab-on-LRU patches. I've left OOM detection out of this patch entirely as upcoming patches will do real OOM handling for pte_chains and all the code changed anyway."
-
Andrew Morton authored
There are a few VM-related patches in this series. Mainly fixes; feature work is on hold. We have some fairly serious locking contention problems with the reverse mapping's pte_chains. Until we have a clear way out of that I believe that it is best to not merge code which has a lot of rmap dependency. It is apparent that these problems will not be solved by tweaking - some redesign is needed. In the 2.5 timeframe the only practical solution appears to be page table sharing, based on Daniel's February work. Daniel and Dave McCracken are working that. Some bits and pieces here: - list_splice() has an open-coded list_empty() in it. Use list_empty() instead. - in shrink_cache() we have a local `nr_pages' which shadows another local. Rename the inner one. (Nikita Danilov) - Add a BUG() on a can't-happen code path in page_remove_rmap(). - Tighten up the bug checks in the BH completion handlers - if the buffer is still under IO then it must be locked, because we unlock it inside the page_uptodate_lock.
-
Linus Torvalds authored
in the page cache, it needs to use page_cache_release() instead of plain "put_page()".
-
Ingo Molnar authored
the attached patch fixes two things: - a TLS related bug noticed by Arjan van de Ven: apm_init() should set up all CPU's gdt entries - just in case some code happens to call in the APM BIOS on the wrong CPU. This should also handle the case when some APM code gets triggered (by suspend or power button or something). - compilation problem
-
Trond Myklebust authored
Add support for the glibc 'd_type' field in cases where we have the READDIRPLUS file attribute information available to us in nfs_do_filldir().
-
Trond Myklebust authored
Add support for positive lookups using the READDIRPLUS cached information. Both new lookups and lookup revalidation is supported. Use READDIRPLUS instead of READDIR on NFSv3 directories with lengths shorter than 8*PAGE_SIZE. Note that inode attribute information is only updated if it is seen to be more recent than any existing cached information.
-
Trond Myklebust authored
Cache the information about whether or not the server supports READDIRPLUS.
-
Trond Myklebust authored
Cleanup for readdirplus. Allow the file attribute struct to set the NFS_READTIME(inode) to some value other than 'jiffies'.
-
Trond Myklebust authored
Cleanup for the readdirplus code. Make struct nfs_entry take pointers to the filehandle and file attributes.
-
Trond Myklebust authored
A patch by Charles Lever (Charles.Lever@netapp.com) that ensures the PG_uptodate bit gets set if an entire page gets written by nfs_writepage_sync().
-
Muli Ben-Yehuda authored
This patch replaces the cli/sti calls in the trident.c driver with spin_lock_irqsave/spin_unlock_irqrestore.
-
Muli Ben-Yehuda authored
This patch (1/2) brings the sound/oss/trident.c driver up to date with the driver in the 2.4-ac tree. It fixes the following bugs: * fix wrong cast in suspend/resume (Eric Lemar via Ian Soboroff) * fix bug where we would free with free_pages() memory allocated via pci_alloc_consistent(). * add a missing unlock on an error path. * rewrite the code to read/write registers of audio codecs for Ali5451 (Lei Hu) It also does various cleanups so that the code conforms to Documentation/CodingStyle and is nicer to work with.
-
bk://bkbits.ras.ucalgary.ca/rgooch-2.5Linus Torvalds authored
into home.transmeta.com:/home/torvalds/v2.5/linux
-
Richard Gooch authored
-
Oleg Nesterov authored
The gdt entry is consulted only while loading its index into the segment register. So load_TLS_desc(next, cpu) must be called before loading next->fs,next->gs in __switch_to().
-
Linus Torvalds authored
-
Christoph Hellwig authored
generic spinlock implementation for downgrade_write().
-
- 27 Jul, 2002 15 commits
-
-
Paul Mackerras authored
into samba.org:/home/paulus/kernel/for-linus-ppc
-
Paul Mackerras authored
into samba.org:/home/paulus/kernel/for-linus-ppc
-
Paul Mackerras authored
Since we have a signed 32-bit time_t, the fact that y % 4 == 0 will get it wrong in 2100 is irrelevant.
-
bk://bkbits.ras.ucalgary.ca/rgooch-2.5Linus Torvalds authored
into home.transmeta.com:/home/torvalds/v2.5/linux
-
Linus Torvalds authored
test, since it is needed regardless.
-
Alan Cox authored
We use the pci host lock so that we lock config space portably while handling the CMD640 config space via our own routines to avoid pci bios tripping CMD640 hardware stuff. We need to use this lock in order to ensure that we lock at a portable layer. Also add the 2.4.19 fixes for avoiding wrong probes, and the fix noted on the list.
-
Alan Cox authored
This should do the trick for pnpbios - we load the initial gdt into each gdt, and we load the parameters into the gdt of the cpu making the call relying on the spinlock to avoid bouncing cpu due to pre-empt
-
Matthew Wilcox authored
- Remove third argument from file_lock security op. Whether the lock is blocking or not cannot make any difference to a security module! - Fix the call in sys_flock to pass the translated lock command, not the original. - Add a call in fcntl_setlease. If they're going to know about two types of lock, let's tell them about the third too.
-
Andries E. Brouwer authored
The patch below does two things: (i) fixes a small bug in the new partition code This is the final chunk s/n/slot/. I'll refrain from giving a vi script. This is uncontroversial. (ii) removes ancient garbage concerning disk managers This may well be controversial. (Long ago, when disks became larger than 500 MB, lots of tricks were invented to keep DOS happy. Both hardware tricks and software tricks. One of the software tricks was the invention of boot managers. There have been many of those. The Linux kernel has had support for two of them: OnTrack Disk Manager and EZdrive. More precisely: there have been many versions of both OnTrack Disk Manager and EZdrive, and the kernel had support for a few of these versions. I think the time has come to remove the automatic support - every now and then it bites some innocent user, and the support is not really needed any longer, and the support is for outdated versions of these boot managers. No doubt it will turn out that users still exist that use some form of this stuff, but I would prefer to support them by explicit kernel boot parameters, rather than by code that guesses what might be the right thing to do. The patch below just rips out the old stuff. Depending on the screams this might provoke I expect to add some boot parameters.)
-
Ingo Molnar authored
This fixes a synchronize_irq() bug: if the interrupt is freed while an IRQ handler is running (irq state is IRQ_INPROGRESS) then synchronize_irq() will return early, which is incorrect. there was another do_IRQ() bug that in fact necessiated the bad code that caused the synchronize_irq() bug - we kept the IRQ_INPROGRESS bit set for not active interrupt sources - after they happen for the first time. Now the only effect this has is on i8259A irq handling - we used to keep these irqs disabled after the first 'spurious' interrupt happened. Now what the i8259A code really wants to do IMO is to keep the interrupt disabled if there is no handler defined for that interrupt source. The patch adds exactly this. I dont remember why this was needed in the first place (irq probing? avoidance of interrupt storms?), but with the patch the behavior should be equivalent.
-
William Lee Irwin III authored
Fix PMD typo
-
Anders Gustafsson authored
Added irq-argument to synchronize_irq to make sound/oss/cs46xx.c compile again.
-
Peter Osterlund authored
-
Ingo Molnar authored
the attached patch fixes the scheduler's migration thread startup bug that got unearthed by Rusty's recent CPU-startup enhancements. the fix is to let a startup-helper thread migrate the migration thread, instead of the migration thread calling set_cpus_allowed() itself. Migrating a not running thread is a simple and robust thing, and needs no cooperation from migration threads - thus the catch-22 problem of how to migrate the migration threads is solved finally. the patch is against Rusty's initcall fix/hack which calls migration_init() before other CPUs are brought up - this ordering is clearly the clean way of doing migration init. [the patch also fixes a UP compiliation bug in Rusty's hack.]
-
Rusty Russell authored
As pointed out by Andrew Morton, this fixes: softirq.c: In function `spawn_ksoftirqd': softirq.c:416: warning: statement with no effect
-