- 05 Mar, 2005 40 commits
-
-
Olof Johansson authored
Abstract most manual mask checks of cpu_features with cpu_has_feature() Signed-off-by: Olof Johansson <olof@austin.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
bill.irwin@oracle.com authored
Convert mapping->tree_lock to an rwlock. with: dd if=/dev/zero of=foo bs=1 count=2M 0.80s user 4.15s system 99% cpu 4.961 total dd if=/dev/zero of=foo bs=1 count=2M 0.73s user 4.26s system 100% cpu 4.987 total dd if=/dev/zero of=foo bs=1 count=2M 0.79s user 4.25s system 100% cpu 5.034 total dd if=foo of=/dev/null bs=1 0.80s user 3.12s system 99% cpu 3.928 total dd if=foo of=/dev/null bs=1 0.77s user 3.15s system 100% cpu 3.914 total dd if=foo of=/dev/null bs=1 0.92s user 3.02s system 100% cpu 3.935 total (3.926: 1.87 usecs) without: dd if=/dev/zero of=foo bs=1 count=2M 0.85s user 3.92s system 99% cpu 4.780 total dd if=/dev/zero of=foo bs=1 count=2M 0.78s user 4.02s system 100% cpu 4.789 total dd if=/dev/zero of=foo bs=1 count=2M 0.82s user 3.94s system 99% cpu 4.763 total dd if=/dev/zero of=foo bs=1 count=2M 0.71s user 4.10s system 99% cpu 4.810 tota dd if=foo of=/dev/null bs=1 0.76s user 2.68s system 100% cpu 3.438 total dd if=foo of=/dev/null bs=1 0.74s user 2.72s system 99% cpu 3.465 total dd if=foo of=/dev/null bs=1 0.67s user 2.82s system 100% cpu 3.489 total dd if=foo of=/dev/null bs=1 0.70s user 2.62s system 99% cpu 3.326 total (3.430: 1.635 usecs) So on a P4, the additional cost of the rwlock is ~240 nsecs for a one-byte-write(). On the other hand: From: Peter Chubb <peterc@gelato.unsw.edu.au> As part of the Gelato scalability focus group, we've been running OSDL's Re-AIM7 benchmark with an I/O intensive load with varying numbers of processors. The current kernel shows severe contention on the tree_lock in the address space structure when running on tmpfs or ext2 on a RAM disk. Lockstat output for a 12-way: SPINLOCKS HOLD WAIT UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME 5.5% 0.4us(3177us) 28us( 20ms)(44.2%) 131821954 94.5% 5.5% 0.00% *TOTAL* 72.3% 13.1% 0.5us( 9.5us) 29us( 20ms)(42.5%) 50542055 86.9% 13.1% 0% find_lock_page+0x30 23.8% 0% 385us(3177us) 0us 23235 100% 0% 0% exit_mmap+0x50 11.5% 0.82% 0.1us( 101us) 17us(5670us)( 1.6%) 50665658 99.2% 0.82% 0% dnotify_parent+0x70 Replacing the spinlock with a multi-reader lock fixes this problem, without unduly affecting anything else. Here are the benchmark results (jobs per minute at a 50-client level, average of 5 runs, standard deviation in parens) on an HP Olympia with 3 cells, 12 processors, and dnotify turned off (after this spinlock, the spinlock in dnotify_parent is the worst contended for this workload). tmpfs............... ext2............... #CPUs spinlock rwlock spinlock rwlock 1 7556(15) 7588(17) +0.42% 3744(20) 3791(16) +1.25% 2 13743(31) 13791(33) +0.35% 6405(30) 6413(24) +0.12% 4 23334(111) 22881(154) -2% 9648(51) 9595(50) -0.55% 8 33580(240) 36163(190) +7.7% 13183(63) 13070(68) -0.85% 12 28748(170) 44064(238)+53% 12681(49) 14504(105)+14% And on a pentium3 single processsor: 1 4177(4) 4169(2) -0.2% 3811(4) 3820(3) +0.23% I'm not sure what's happening in the 4-processor case. The important thing to note is that with a spinlock, the benchmark shows worse performance for a 12 than for an 8-way box; with the patch, the 12 way performs better, as expected. We've done some runs with 16-way as well; without the patch below, the 16-way performs worse than the 12-way. It's a tricky tradeoff, but large-smp is hurt a lot more by the spinlocks than small-smp is by the rwlocks. And I don't think we really want to implement compile-time either-or-locks. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Zach Brown authored
Now that we're only invalidating the pages that intersected a direct IO write we might as well only unmap the intersecting bytes as well. This passed a light fsx load with page cache, direct, and mmap IO. Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Zach Brown authored
This adds filemap_write_and_wait_range(mapping, lstart, lend) which starts writeback and waits on a range of pages. We call this from __blkdev_direct_IO with just the range that is going to be read by the direct_IO read. It was lightly tested with fsx and ext3 and passed. Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Zach Brown authored
Presently we invalidate all of a file's pages when writing to any part of that file with direct-IO. After a direct IO write only invalidate the pages that the write intersected. invalidate_inode_pages2_range(mapping, pgoff start, pgoff end) is added and called from generic_file_direct_IO(). While we're in there, invalidate_inode_pages2() was calling unmap_mapping_range() with the wrong convention in the single page case. It was providing the byte offset of the final page rather than the length of the hole being unmapped. This is also fixed. This was lightly tested with a 10k op fsx run with O_DIRECT on a 16MB file in ext3 on a junky old IDE drive. Totaling vmstat columns of blocks read and written during the runs shows that read traffic drops significantly. The run time seems to have gone down a little. Two runs before the patch gave the following user/real/sys times and total blocks in and out: 0m28.029s 0m20.093s 0m3.166s 16673 125107 0m27.949s 0m20.068s 0m3.227s 18426 126094 and after the patch: 0m26.775s 0m19.996s 0m3.060s 3505 124982 0m26.856s 0m19.935s 0m3.052s 3505 125279 akpm: - Don't look up more pages than we're going to use - Don't test page->index until we've locked the page - Check for the cursor wrapping at the end of the mapping. Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Christoph Lameter authored
In the 2.6.11 development cycle function calls have been added to lots of hot vm paths to do accounting. I think these should not go into the final 2.6.1 release because these statistics can be collected in a different way that does not require the updating of counters from frequently used vm code paths and is consistent with the methods use elsewhere in the kernel to obtain statistics. These function calls are acct_update_integrals -> Account for processes based on stime changes update_mem_hiwater -> takes rss and total_vm hiwater marks. acct_update_integrals is only useful to call if stime changes otherwise it will simply return. It is therefore best to relocate the function call to acct_update_integral into the function that updates stime which is account_system_time and remove it from the vm code paths. update_mem_hiwater finds the rss hiwater mark. We call that from timer context as well. This means that processes' high-water marks are now sampled statistically, at timer-interrupt time rather than deterministically. This may or may not be a problem.. This means that the rss limit is not always updated if rss is increased and thus not as accurate. But the benefit is that the rss checks do no pollute the vm paths and that it is consistent with the rss limit check. The following patch removes acct_update_integrals and update_mem_hiwater from the hot vm paths. Signed-off-by: Christoph Lameter <clameter@sgi.com> From: Jay Lan <jlan@sgi.com> The new "move-accounting-function-calls-out-of-critical-vm-code-paths" patch in 2.6.11-rc3-mm2 was different from the code i tested. In particular, it mistakenly dropped the accounting routine calls in fs/exec.c. The calls in do_execve() are needed to properly initialize accounting fields. Specifically, the tsk->acct_stimexpd needs to be initialized to tsk->stime. I have discussed this with Christoph Lameter and he gave me full blessings to bring the calls back. Signed-off-by: Jay Lan <jlan@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Arjan van de Ven authored
In addition to randomisation of the stack pointer within the stack, the stack itself should be randomized too. We need both approaches, we can only randomize the stack itself in pagesize increments. However randomizing large ranges with the stackpointer runs into the situation where a huge chunk of the stack rlimit is used by the randomisation; this is undesirable so we need to do both. Signed-off-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Arjan van de Ven authored
Introduce a personality that disables randomisation, so that users can use setarch and related commands to run specific applications without randomisation. Signed-off-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Arjan van de Ven authored
Signed-off-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Arjan van de Ven authored
The patch below randomizes the starting point of the mmap area. This has the effect that all non-prelinked shared libaries and all bigger malloc()s will be randomized between various invocations of the binary. Prelinked binaries get a address-hint from ld.so in their mmap and are thus exempt from this randomisation, in order to not break the prelink advantage. The randomisation range is 1 megabyte (this is bigger than the stack randomisation since the stack randomisation only needs 16 bytes alignment while the mmap needs page alignment, a 64kb range would not have given enough entropy to be effective) Signed-off-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Arjan van de Ven authored
The patch below replaces the existing 8Kb randomisation of the userspace stack pointer (which is currently only done for Hyperthreaded P-IVs) with a more general randomisation over a 64Kb range. 64Kb is not a lot, but it's a start and once the dust settles we can increase this value to a more agressive value. Signed-off-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Arjan van de Ven authored
Even though there is a global flag to disable randomisation, it's useful to have a per process flag too; the patch below introduces this per process flag and automatically sets it for "new" binaries. Eventually we will want to tie this to the legacy-va-space personality Signed-off-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Arjan van de Ven authored
The patch below introduces get_random_int() and randomize_range(), two helpers used in later patches in the series. get_random_int() shares the tcp/ip random number stuff so the CONFIG_INET ifdef needs to move slightly, and to reduce the damange due to that, secure_ip_id() needs to move inside random.c From: Frank Sorenson <frank@tuxrocks.com> Acked-By: Jeff Dike <jdike@addtoit.com> The stack randomization patches that went into 2.6.11-rc3-mm1 broke compilation of ARCH=um. This patch fixes compiling by adding arch_align_stack back in. Signed-off-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Frank Sorenson <frank@tuxrocks.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Arjan van de Ven authored
This first patch of the series introduces a sysctl (default off) that enables/disables the randomisation feature globally. Since randomisation may make it harder to debug really tricky situations (reproducability goes down), the sysadmin needs a way to disable it globally. Signed-off-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Kumar Gala authored
Static initialization of spin locks that are otherwise accessed prior to initialization. Signed-off-by: Jaka Mocnik <jaka@activetools.si> Signed-off-by: Kumar Gala <kumar.gala@freescale.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Sean Hefty authored
Modify ib_cancel_mad() to invoke a user's send completion callback from a different thread context than that used by the caller. This allows a caller to hold a lock while calling cancel that is also acquired from their send handler. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Roland Dreier <roland@topspin.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Michael S. Tsirkin authored
Set device_cap_flags field in mthca's query_device method. Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il> Signed-off-by: Roland Dreier <roland@topspin.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Michael S. Tsirkin authored
1. Split the QP spinlock into separate send and receive locks. The only place where we have to lock both is upon modify_qp, and that is not on data path. 2. Avoid taking any QP locks when polling CQ. This last part is achieved by getting rid of the cur field in mthca_wq, and calculating the number of outstanding WQEs by comparing the head and tail fields. head is only updated by post, tail is only updated by poll. In a rare case where an overrun is detected, a CQ is locked and the overrun condition is re-tested, to avoid any potential for stale tail values. Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il> Signed-off-by: Roland Dreier <roland@topspin.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Roland Dreier authored
Tie up one last loose end by mapping enough context memory to cover the whole multicast table during initialization, and then enable mem-free mode. mthca now supports enough of mem-free mode so that IPoIB works with a mem-free HCA. Signed-off-by: Roland Dreier <roland@topspin.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Roland Dreier authored
Implement posting send and receive work requests for mem-free mode. Also tidy up a few things in send/receive posting for Tavor mode (fix smp_wmb()s that should really be just wmb()s, annotate tests in the fast path with likely()/unlikely()). Signed-off-by: Roland Dreier <roland@topspin.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Roland Dreier authored
Update address vector handling to support mem-free mode. In mem-free mode, the address vector (in hardware format) is copied by the driver into each send work queue entry, so our address handle creation can become pretty trivial: we just kmalloc() a buffer to hold the formatted address vector. Signed-off-by: Roland Dreier <roland@topspin.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Roland Dreier authored
Update QP initialization and cleanup to handle mem-free mode. In mem-free mode, work queue sizes have to be rounded up to a power of 2, we need to allocate doorbells, there must be memory mapped for the entries in the QP and extended QP context table that we use, and the entries of the receive queue must be initialized. Signed-off-by: Roland Dreier <roland@topspin.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Roland Dreier authored
Add support for CQ data path operations (request notification, update consumer index) in mem-free mode. Signed-off-by: Roland Dreier <roland@topspin.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Roland Dreier authored
Update CQ initialization and cleanup to handle mem-free mode: we need to make sure the HCA has memory mapped for the entry in the CQ context table we will use and also allocate doorbell records. Signed-off-by: Roland Dreier <roland@topspin.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Roland Dreier authored
Factor the allocation and freeing of completion queue buffers into mthca_alloc_cq_buf() and mthca_free_cq_buf(). This makes the code more readable and will eventually make handling userspace CQs simpler (the kernel doesn't have to allocate a buffer at all). Signed-off-by: Roland Dreier <roland@topspin.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Roland Dreier authored
Add a mthca_write_db_rec() to wrap writing doorbell records. On 64-bit archs, this is just a 64-bit write, while on 32-bit archs it splits the write into two 32-bit writes with a memory barrier to make sure the two halves of the record are written in the correct order. Signed-off-by: Roland Dreier <roland@topspin.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Roland Dreier authored
Mem-free mode requires the driver to allocate additional doorbell pages for each user access region. Add support for this in mthca_memfree.c, and have the driver allocate a table in db_tab for kernel use. Signed-off-by: Roland Dreier <roland@topspin.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Roland Dreier authored
Have MAP_ICM_page() firmware command map assume pages are always the HCA-native 4K size rather than using the kernel's page size. This will make handling doorbell pages for mem-free mode simpler. Signed-off-by: Roland Dreier <roland@topspin.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Roland Dreier authored
Slightly improve debugging output for UNMAP_ICM and MODIFY_QP firmware commands. Signed-off-by: Roland Dreier <roland@topspin.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Roland Dreier authored
Update interrupt handling code to handle mem-free mode. While we're at it, improve the Tavor interrupt handling to avoid an extra MMIO read of the event cause register. Signed-off-by: Roland Dreier <roland@topspin.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Roland Dreier authored
Add code to initialize EQ context properly in both Tavor and mem-free mode. Signed-off-by: Roland Dreier <roland@topspin.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Roland Dreier authored
Add support for mem-free mode to memory region code. This mostly amounts to properly munging between keys and indices. Signed-off-by: Roland Dreier <roland@topspin.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Roland Dreier authored
Add support for mapping more memory into HCA's context to cover context tables when new objects are allocated. Pass the object size into mthca_alloc_icm_table(), reference count the ICM chunks, and add new mthca_table_get() and mthca_table_put() functions to handle mapping memory when allocating or destroying objects. Signed-off-by: Roland Dreier <roland@topspin.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Roland Dreier authored
Add support for allocating user access regions (UARs). Use this to allocate a region for kernel at driver init instead using hard-coded MTHCA_KAR_PAGE index. Signed-off-by: Roland Dreier <roland@topspin.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Roland Dreier authored
Move the request/ioremap of regions related to event handling into mthca_eq.c. Map the correct regions depending on whether we're in Tavor or native mem-free mode. Signed-off-by: Roland Dreier <roland@topspin.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Michael S. Tsirkin authored
Remove support for unsignaled receive requests. This is a non-standard extension to the IB spec that is not used by any known applications or protocols, and is not supported by newer hardware. Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il> Signed-off-by: Roland Dreier <roland@topspin.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Roland Dreier authored
Simplify some of the code for CQ handling slightly. Signed-off-by: Roland Dreier <roland@topspin.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Michael S. Tsirkin authored
Locking during the poll cq operation can be reduced by locking the cq while qp is being removed from the qp array. This also avoids an extra atomic operation for reference counting. Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il> Signed-off-by: Roland Dreier <roland@topspin.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Michael S. Tsirkin authored
Avoid taking the CQ table lock in the fast path path by using synchronize_irq() after removing a CQ from the table to make sure that no completion events are still in progress. This gets a nice speedup (about 4%) in IP over IB on my hardware. Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il> Signed-off-by: Roland Dreier <roland@topspin.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Michael S. Tsirkin authored
Clean up CQ code so that we only calculate the address of a CQ entry once when using it. Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il> Signed-off-by: Roland Dreier <roland@topspin.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-