- 06 Jul, 2003 15 commits
-
-
Greg Ungerer authored
This patch adds shared library support to the MMU application loader, binfmt_flat. This is not new, it is a forward port from the same support in 2.4.x kernels with MMUless support, and has been running for well over a year now. The code support is conditionally compiled on CONFIG_BINFMT_FLAT_SHARED. This change also abstracts a bit more architecture dependent code into the separate flat.h includes. Basically relocations within an application also carry a tag to identify what they refer too (this code or which shared library). This is patched as before at load/run-time with an appropriate address.
-
Greg Ungerer authored
Unify access_ok for all m68knommu targets. All targets use the common linker script and have common end symbols. So now we can just use a simple check.
-
Greg Ungerer authored
Remove "%d0" register from clobber list of down_trylock() for m68knommu. It is not used by the asm code here at all.
-
Greg Ungerer authored
Force PAGE_SIZE for the m68knommu architecture to be an unsigned long. This makes it consistent with all other architectures and cleans up a load of compiler warnings.
-
Greg Ungerer authored
Conditionally copy the ROMfs filesystem on the Motorola M5307C3 target board only if using a ROMfs.
-
Greg Ungerer authored
Allow setting boot time parameters at configuration for Motorola 5282 targets.
-
Linus Torvalds authored
This improves cold-cache program startup noticeably for me, and simplifies the read-ahead logic at the same time. The rules for read-ahead are: - if the vma is marked random, we just do the regular one-page case. Obvious. - if the vma is marked "linear access", we use the regular readahead code. No change in behaviour there (well, we also only consider it a _miss_ if it was marked linear access - the "readahead" and "readaround" things are now totally independent of each other) - otherwise, we look at how many hits/misses we've had for this particular file open for mmap, and if we've had noticeably more misses than hits, we don't bother with read-around. In particular, this means that the "real" read-ahead logic literally only needs to worry about finding sequential accesses, and does not have to worry about the common executable mmap access patthers that have very different behaviour. Some constant tweaking may be a good idea.
-
Ingo Molnar authored
in add_timer_internal() we simply leave the timer pending forever if the expiry is in more than 0xffffffff jiffies. This means more than 48 days on eg. ia64 - which is not an unrealistic timeout. IIRC crond is happy to use extremely large timeouts. It's better to time out early (if you can call 48 days "early") than to not time out at all.
-
Bernardo Innocenti authored
This offers a generic do_div64() that actually does the right thing, unlike some architectures that "optimized" the 64-by-32 divide into just a 32-bit divide. Both ppc and sh were already providing an assembly optimized __div64_32(). I called my function the same, so that their optimized versions will automatically override mine in lib.a. I've only tested extensively on m68knommu (uClinux) and made sure generated code is reasonably short. Should be ok also on parisc, since it's the same algorithm they were using before. - add generic C implementations of the do_div() for 32bit and 64bit archs in asm-generic/div64.h; - add generic library support function __div64_32() to handle the full 64/32 case on 32bit archs; - kill multiple copies of generic do_div() in architecture specific subdirs. Most copies were either buggy or not doing what they were supposed to do; - ensure all surviving instances of do_div() have their parameters correctly parenthesized to avoid funny side-effects;
-
Paul Fulghum authored
Fix arbitration between net open and tty open. Cleanup missed bits of CUA device removal changes.
-
Paul Fulghum authored
Fix arbitration between net open and tty open. Clean up unused locals resulting from latest tty changes.
-
Paul Fulghum authored
Fix arbitration between net open and tty open. Cleanup unused local resulting from latest tty changes.
-
Benjamin Herrenschmidt authored
From Mikael Petterson: Booting kernel 2.5.74 on a PowerMac with CONFIG_BLK_DEV_IDE_PMAC=y results in an oops during IDE init, and the box then reboots. The patch below updates drivers/ide/ppc/pmac.c to also set up the hwif->ide_dma_queued_off and hwif->ide_dma_queued_on function pointers, which fixes the oops. Tested on my ancient PM4400.
-
Pavel Machek authored
I no longer have the time/interest in nbd, and Paul agreed to take it over.
-
Anton Blanchard authored
The compat ioctls for device mapper were not being enabled due to an incorrect config option.
-
- 05 Jul, 2003 25 commits
-
-
Andrew Morton authored
This tweaks the mmap read-ahead behaviour so that the prefaulting is largely pointless. - double the minimum readaround chunksize in page_cache_readaround(). - when a seek is detected, collapse the window more slowly.
-
Krzysztof Halasa authored
-
Andrew Morton authored
i2o_scsi.c now needs pci.h.
-
Andrew Morton authored
From: ilmari@ilmari.org (Dagfinn Ilmari Mannsaker) It turns out that net/bluetooth/rfcomm/sock.c (and net/bluetooth/hci_sock.c) had been left out when net_proto_family gained an owner field, here's a patch that fixes them both.
-
Andrew Morton authored
From: junkio@cox.net Sigh. Is there a gcc option to tell it to not accept this incompatible C99 extension?
-
Andrew Morton authored
From: Arvind Kandhare <arvind.kan@wipro.com> When switch_uid is called, the reference count of the new user is incremented twice. I think the increment in the switch_uid is done because of the reparent_to_init() function which does not increase the __count for root user. But if switch_uid is called from any other function, the reference count is already incremented by the caller by calling alloc_uid for the new user. Hence the count is incremented twice. The user struct will not be deleted even when there are no processes holding a reference count for it. This does not cause any problem currently because nothing is dependent on timely deletion of the user struct.
-
Andrew Morton authored
From: Davide Libenzi <davidel@xmailserver.org> - Inline eventpoll_release() so that __fput() does not need to call in epoll code if the file itself is not registered inside an epoll fd - Add <linux/types.h> inclusion due __u32 and __u64 usage - Fix debug printf that would otherwise panic if enabled with the new epoll code
-
Andrew Morton authored
From: Davide Libenzi <davidel@xmailserver.org> - Remove a couple of impossible debug checks (unsigneds cannot be negative!) - If __alloc_bootmem_core() fails with a goal and unaligned node_boot_start it'll loop fovever.
-
Andrew Morton authored
If de_thread() fails in flush_old_exec() then we try to fail the execve(). That is a bad move, because exec_mmap() has already switched the current process over to the new mm. The new process is not yet sufficiently set up to handle the error and the kernel doublefaults and dies. exec_mmap() is the point of no return. Change flush_old_exec() to call de_thread() before running exec_mmap() so the execing program sees the error. I added fault injection to both de_thread() and exec_mmap() - everything now survives OK.
-
Andrew Morton authored
From: Nick Piggin <piggin@cyberone.com.au> Add some comments to the request allocation code.
-
Andrew Morton authored
- pass gfp_flags to get_io_context(): not all callers are forced to use GFP_ATOMIC(). - fix locking in get_io_context(): bump the refcount whilein the exclusive region. - don't go oops in get_io_context() if the kmalloc failed. - in as_get_io_context(): fail the whole thing if we were unable to allocate the AS-specific part. - as_remove_queued_request() cleanup
-
Andrew Morton authored
From: Nick Piggin <piggin@cyberone.com.au> The following patch gets batching working how it should be. After a process is woken up, it is allowed to allocate up to 32 requests for 20ms. It does not stop other processes submitting requests if it isn't submitting though. This should allow less context switches, and allow batches of requests from each process to be sent to the io scheduler instead of 1 request from each process. tiobench sequential writes are more than tripled, random writes are nearly doubled over mm1. In earlier tests I generally saw better CPU efficiency but it doesn't show here. There is still debug to be taken out. Its also only on UP. Avg Maximum Lat% Lat% CPU Identifier Rate (CPU%) Latency Latency >2s >10s Eff ------------------- ------ --------- ---------- ------- ------ ---- -2.5.71-mm1 11.13 3.783% 46.10 24668.01 0.84 0.02 294 +2.5.71-mm1 13.21 4.489% 37.37 5691.66 0.76 0.00 294 Random Reads ------------------- ------ --------- ---------- ------- ------ ---- -2.5.71-mm1 0.97 0.582% 519.86 6444.66 11.93 0.00 167 +2.5.71-mm1 1.01 0.604% 484.59 6604.93 10.73 0.00 167 Sequential Writes ------------------- ------ --------- ---------- ------- ------ ---- -2.5.71-mm1 4.85 4.456% 77.80 99359.39 0.18 0.13 109 +2.5.71-mm1 14.11 14.19% 10.07 22805.47 0.09 0.04 99 Random Writes ------------------- ------ --------- ---------- ------- ------ ---- -2.5.71-mm1 0.46 0.371% 14.48 6173.90 0.23 0.00 125 +2.5.71-mm1 0.86 0.744% 24.08 8753.66 0.31 0.00 115 It decreases context switch rate on IBM's 8-way on ext2 tiobench 64 threads from ~2500/s to ~140/s on their regression tests.
-
Andrew Morton authored
From: Nick Piggin <piggin@cyberone.com.au> Generalise the AS-specific per-process IO context so that other IO schedulers could use it.
-
Andrew Morton authored
From: Nick Piggin <piggin@cyberone.com.au> This patch fixes the request batching fairness/starvation issue. Its not clear what is going on with 2.4, but it seems that its a problem around this area. Anyway, previously: * request queue fills up * process 1 calls get_request, sleeps * a couple of requests are freed * process 2 calls get_request, proceeds * a couple of requests are freed * process 2 calls get_request... Now as unlikely as it seems, it could be a problem. Its a fairness problem that process 2 can skip ahead of process 1 anyway. With the patch: * request queue fills up * any process calling get_request will sleep * once the queue gets below the batch watermark, processes start being worken, and may allocate. This patch includes Chris Mason's fix to only clear queue_full when all tasks have been woken. Previously I think starvation and unfairness could still occur. With this change to the blk-fair-batches patch, Chris is showing some much improved numbers for 2.4 - 170 ms max wait vs 2700ms without blk-fair-batches for a dbench 90 run. He didn't indicate how much difference his patch alone made, but it is an important fix I think.
-
Andrew Morton authored
From: Nick Piggin <piggin@cyberone.com.au> If there are no requess in flight against the target device and get_request() fails, nothing will wake us up. Fix.
-
Andrew Morton authored
From: Nick Piggin <piggin@cyberone.com.au> This patch implements a hint so that AS can tell the request allocator to allocate a request even if there are none left (the accounting is quite flexible and easily handles overallocations). elv_may_queue semantics have changed from "the elevator does _not_ want another request allocated" to "the elevator _insists_ that another request is allocated". I couldn't see any harm ;) Now in practice, AS will only allow _1_ request over the limit, because as soon as the request is sent to AS, it stops anticipating.
-
Andrew Morton authored
From: Nick Piggin <piggin@cyberone.com.au> Now that we are counting requests (not requests free), this patch changes the congested & batch watermarks to be more logical. Also a minor fix to the sysfs code.
-
Andrew Morton authored
From: Nick Piggin <piggin@cyberone.com.au> This gets rid of the global queue_nr_requests and usage of BLKDEV_MAX_RQ (the latter is now only used to set the queues' defaults). The queue depth becomes per-queue, controlled by a sysfs entry.
-
Andrew Morton authored
Using keventd for running request_fns is risky because keventd itself can block on disk I/O. Use the new kblockd kernel threads for the generic unplugging.
-
Andrew Morton authored
From: Nick Piggin <piggin@cyberone.com.au> This is the core anticipatory IO scheduler. There are nearly 100 changesets in this and five months work. I really cannot describe it fully here. Major points: - It works by recognising that reads are dependent: we don't know where the next read will occur, but it's probably close-by the previous one. So once a read has completed we leave the disk idle, anticipating that a request for a nearby read will come in. - There is read batching and write batching logic. - when we're servicing a batch of writes we will refuse to seek away for a read for some tens of milliseconds. Then the write stream is preempted. - when we're servicing a batch of reads (via anticipation) we'll do that for some tens of milliseconds, then preempt. - There are request deadlines, for latency and fairness. The oldest outstanding request is examined at regular intervals. If this request is older than a specific deadline, it will be the next one dispatched. This gives a good fairness heuristic while being simple because processes tend to have localised IO. Just about all of the rest of the complexity involves an array of fixups which prevent most of teh obvious failure modes with anticipation: trying to not leave the disk head pointlessly idle. Some of these algorithms are: - Process tracking. If the process whose read we are anticipating submits a write, abandon anticipation. - Process exit tracking. If the process whose read we are anticipating exits, abandon anticipation. - Process IO history. We accumulate statistical info on the process's recent IO patterns to aid in making decisions about how long to anticipate new reads. Currently thinktime and seek distance are tracked. Thinktime is the time between when a process's last request has completed and when it submits another one. Seek distance is simply the number of sectors between each read request. If either statistic becomes too high, the it isn't anticipated that the process will submit another read. The above all means that we need a per-process "io context". This is a fully refcounted structure. In this patch it is AS-only. later we generalise it a little so other IO schedulers could use the same framework. - Requests are grouped as synchronous and asynchronous whereas deadline scheduler groups requests as reads and writes. This can provide better sync write performance, and may give better responsiveness with journalling filesystems (although we haven't done that yet). We currently detect synchronous writes by nastily setting PF_SYNCWRITE in current->flags. The plan is to remove this later, and to propagate the sync hint from writeback_contol.sync_mode into bio->bi_flags thence into request->flags. Once that is done, direct-io needs to set the BIO sync hint as well. - There is also quite a bit of complexity gone into bashing TCQ into submission. Timing for a read batch is not started until the first read request actually completes. A read batch also does not start until all outstanding writes have completed. AS is the default IO scheduler. deadline may be chosen by booting with "elevator=deadline". There are a few reasons for retaining deadline: - AS is often slower than deadline in random IO loads with large TCQ windows. The usual real world task here is OLTP database loads. - deadline is presumably more stable. - deadline is much simpler. The tunable per-queue entries under /sys/block/*/iosched/ are all in milliseconds: * read_expire Controls how long until a request becomes "expired". It also controls the interval between which expired requests are served, so set to 50, a request might take anywhere < 100ms to be serviced _if_ it is the next on the expired list. Obviously it can't make the disk go faster. Result is basically the timeslice a reader gets in the presence of other IO. 100*((seek time / read_expire) + 1) is very roughly the % streaming read efficiency your disk should get in the presence of multiple readers. * read_batch_expire Controls how much time a batch of reads is given before pending writes are served. Higher value is more efficient. Shouldn't really be below read_expire. * write_ versions of the above * antic_expire Controls the maximum amount of time we can anticipate a good read before giving up. Many other factors may cause anticipation to be stopped early, or some processes will not be "anticipated" at all. Should be a bit higher for big seek time devices though not a linear correspondance - most processes have only a few ms thinktime.
-
Andrew Morton authored
From: Nick Piggin <piggin@cyberone.com.au> Introduces an elevator_completed_req() callback with which the generic queueing layer may tell an IO scheduler that a particualr request has finished.
-
Andrew Morton authored
Introduces the elv_may_queue() predicate with which the IO scheduler may tell the generic request layer that we may add another request to this queue. It is used by the CFQ elevator.
-
Andrew Morton authored
keventd is inappropriate for running block request queues because keventd itself can get blocked on disk I/O. Via call_usermodehelper()'s vfork and, presumably, GFP_KERNEL allocations. So create a new gang of kernel threads whose mandate is for running low-level disk operations. It must ever block on disk IO, so any memory allocations should be GFP_NOIO. We mainly use it for running unplug operations from interrupt context.
-
Andrew Morton authored
From: Nick Piggin <piggin@cyberone.com.au> The batch_requests function got lost during the merge of the dynamic request allocation patch. We need it for the anticipatory scheduler - when the number of threads exceeds the number of requests, the anticipated-upon task will undesirably sleep in get_request_wait(). And apparently some block devices which use small requests need it so they string a decent number together. Jens has acked this patch.
-
Andrew Morton authored
From: "Chen, Kenneth W" <kenneth.w.chen@intel.com> This patch proposes a performance fix for the current IPC semaphore implementation. There are two shortcoming in the current implementation: try_atomic_semop() was called two times to wake up a blocked process, once from the update_queue() (executed from the process that wakes up the sleeping process) and once in the retry part of the blocked process (executed from the block process that gets woken up). A second issue is that when several sleeping processes that are eligible for wake up, they woke up in daisy chain formation and each one in turn to wake up next process in line. However, every time when a process wakes up, it start scans the wait queue from the beginning, not from where it was last scanned. This causes large number of unnecessary scanning of the wait queue under a situation of deep wait queue. Blocked processes come and go, but chances are there are still quite a few blocked processes sit at the beginning of that queue. What we are proposing here is to merge the portion of the code in the bottom part of sys_semtimedop() (code that gets executed when a sleeping process gets woken up) into update_queue() function. The benefit is two folds: (1) is to reduce redundant calls to try_atomic_semop() and (2) to increase efficiency of finding eligible processes to wake up and higher concurrency for multiple wake-ups. We have measured that this patch improves throughput for a large application significantly on a industry standard benchmark. This patch is relative to 2.5.72. Any feedback is very much appreciated. Some kernel profile data attached: Kernel profile before optimization: ----------------------------------------------- 0.05 0.14 40805/529060 sys_semop [133] 0.55 1.73 488255/529060 ia64_ret_from_syscall [2] [52] 2.5 0.59 1.88 529060 sys_semtimedop [52] 0.05 0.83 477766/817966 schedule_timeout [62] 0.34 0.46 529064/989340 update_queue [61] 0.14 0.00 1006740/6473086 try_atomic_semop [75] 0.06 0.00 529060/989336 ipcperms [149] ----------------------------------------------- 0.30 0.40 460276/989340 semctl_main [68] 0.34 0.46 529064/989340 sys_semtimedop [52] [61] 1.5 0.64 0.87 989340 update_queue [61] 0.75 0.00 5466346/6473086 try_atomic_semop [75] 0.01 0.11 477676/576698 wake_up_process [146] ----------------------------------------------- 0.14 0.00 1006740/6473086 sys_semtimedop [52] 0.75 0.00 5466346/6473086 update_queue [61] [75] 0.9 0.89 0.00 6473086 try_atomic_semop [75] ----------------------------------------------- Kernel profile with optimization: ----------------------------------------------- 0.03 0.05 26139/503178 sys_semop [155] 0.46 0.92 477039/503178 ia64_ret_from_syscall [2] [61] 1.2 0.48 0.97 503178 sys_semtimedop [61] 0.04 0.79 470724/784394 schedule_timeout [62] 0.05 0.00 503178/3301773 try_atomic_semop [109] 0.05 0.00 503178/930934 ipcperms [149] 0.00 0.03 32454/460210 update_queue [99] ----------------------------------------------- 0.00 0.03 32454/460210 sys_semtimedop [61] 0.06 0.36 427756/460210 semctl_main [75] [99] 0.4 0.06 0.39 460210 update_queue [99] 0.30 0.00 2798595/3301773 try_atomic_semop [109] 0.00 0.09 470630/614097 wake_up_process [146] ----------------------------------------------- 0.05 0.00 503178/3301773 sys_semtimedop [61] 0.30 0.00 2798595/3301773 update_queue [99] [109] 0.3 0.35 0.00 3301773 try_atomic_semop [109] -----------------------------------------------=20 Both number of function calls to try_atomic_semop() and update_queue() are reduced by 50% as a result of the merge. Execution time of sys_semtimedop is reduced because of the reduction in the low level functions.
-