- 17 Jun, 2009 30 commits
-
-
Mel Gorman authored
It is possible with __GFP_THISNODE that no zones are suitable. This patch makes sure the check is only made once. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mel Gorman authored
Callers of alloc_pages_node() can optionally specify -1 as a node to mean "allocate from the current node". However, a number of the callers in fast paths know for a fact their node is valid. To avoid a comparison and branch, this patch adds alloc_pages_exact_node() that only checks the nid with VM_BUG_ON(). Callers that know their node is valid are then converted. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Paul Mundt <lethal@linux-sh.org> [for the SLOB NUMA bits] Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mel Gorman authored
No user of the allocator API should be passing in an order >= MAX_ORDER but we check for it on each and every allocation. Delete this check and make it a VM_BUG_ON check further down the call path. [akpm@linux-foundation.org: s/VM_BUG_ON/WARN_ON_ONCE/] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mel Gorman authored
The start of a large patch series to clean up and optimise the page allocator. The performance improvements are in a wide range depending on the exact machine but the results I've seen so fair are approximately; kernbench: 0 to 0.12% (elapsed time) 0.49% to 3.20% (sys time) aim9: -4% to 30% (for page_test and brk_test) tbench: -1% to 4% hackbench: -2.5% to 3.45% (mostly within the noise though) netperf-udp -1.34% to 4.06% (varies between machines a bit) netperf-tcp -0.44% to 5.22% (varies between machines a bit) I haven't sysbench figures at hand, but previously they were within the -0.5% to 2% range. On netperf, the client and server were bound to opposite number CPUs to maximise the problems with cache line bouncing of the struct pages so I expect different people to report different results for netperf depending on their exact machine and how they ran the test (different machines, same cpus client/server, shared cache but two threads client/server, different socket client/server etc). I also measured the vmlinux sizes for a single x86-based config with CONFIG_DEBUG_INFO enabled but not CONFIG_DEBUG_VM. The core of the .config is based on the Debian Lenny kernel config so I expect it to be reasonably typical. This patch: __alloc_pages_internal is the core page allocator function but essentially it is an alias of __alloc_pages_nodemask. Naming a publicly available and exported function "internal" is also a big ugly. This patch renames __alloc_pages_internal() to __alloc_pages_nodemask() and deletes the old nodemask function. Warning - This patch renames an exported symbol. No kernel driver is affected by external drivers calling __alloc_pages_internal() should change the call to __alloc_pages_nodemask() without any alteration of parameters. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Hugh Dickins authored
On an x86_64 with 4GB ram, tcp_init()'s call to alloc_large_system_hash(), to allocate tcp_hashinfo.ehash, is now triggering an mmotm WARN_ON_ONCE on order >= MAX_ORDER - it's hoping for order 11. alloc_large_system_hash() had better make its own check on the order. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: David Miller <davem@davemloft.net> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Eric Dumazet <dada1@cosmosbay.com> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Miao Xie authored
Fix allocating page cache/slab object on the unallowed node when memory spread is set by updating tasks' mems_allowed after its cpuset's mems is changed. In order to update tasks' mems_allowed in time, we must modify the code of memory policy. Because the memory policy is applied in the process's context originally. After applying this patch, one task directly manipulates anothers mems_allowed, and we use alloc_lock in the task_struct to protect mems_allowed and memory policy of the task. But in the fast path, we didn't use lock to protect them, because adding a lock may lead to performance regression. But if we don't add a lock,the task might see no nodes when changing cpuset's mems_allowed to some non-overlapping set. In order to avoid it, we set all new allowed nodes, then clear newly disallowed ones. [lee.schermerhorn@hp.com: The rework of mpol_new() to extract the adjusting of the node mask to apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind() with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local allocation. Fix this by adding the check for MPOL_PREFERRED and empty node mask to mpol_new_mpolicy(). Remove the now unneeded 'nodes = NULL' from mpol_new(). Note that mpol_new_mempolicy() is always called with a non-NULL 'nodes' parameter now that it has been removed from mpol_new(). Therefore, we don't need to test nodes for NULL before testing it for 'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to verify this assumption.] [lee.schermerhorn@hp.com: I don't think the function name 'mpol_new_mempolicy' is descriptive enough to differentiate it from mpol_new(). This function applies cpuset set context, usually constraining nodes to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag is set, it also translates the nodes. So I settled on 'mpol_set_nodemask()', because the comment block for mpol_new() mentions that we need to call this function to "set nodes". Some additional minor line length, whitespace and typo cleanup.] Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Paul Menage <menage@google.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Miao Xie authored
Fix the bug that the kernel didn't spread page cache/slab object evenly over all the allowed nodes when spread flags were set by updating tasks' page/slab spread flags in time. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Paul Menage <menage@google.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Miao Xie authored
The kernel still allocates the page caches on old node after modifying its cpuset's mems when 'memory_spread_page' was set, or it didn't spread the page cache evenly over all the nodes that faulting task is allowed to usr after memory_spread_page was set. it is caused by the old mem_allowed and flags of the task, the current kernel doesn't updates them unless some function invokes cpuset_update_task_memory_state(), it is too late sometimes.We must update the mem_allowed and the flags of the tasks in time. Slab has the same problem. The following patches fix this bug by updating tasks' mem_allowed and spread flag after its cpuset's mems or spread flag is changed. This patch: Extract a function from cpuset_update_task_memory_state(). It will be used later for update tasks' page/slab spread flags after its cpuset's flag is set Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Paul Menage <menage@google.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
H Hartley Sweeten authored
get_dirty_limits() calls clip_bdi_dirty_limit() and task_dirty_limit() with variable pbdi_dirty as one of the arguments. This variable is an unsigned long * but both functions expect it to be a long *. This causes the following sparse warnings: warning: incorrect type in argument 3 (different signedness) expected long *pbdi_dirty got unsigned long *pbdi_dirty warning: incorrect type in argument 2 (different signedness) expected long *pdirty got unsigned long *pbdi_dirty Fix the warnings by changing the long * to unsigned long * in both functions. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
KOSAKI Motohiro authored
Commit 33c120ed ("more aggressively use lumpy reclaim") increased how aggressive lumpy reclaim was by isolating both active and inactive pages for asynchronous lumpy reclaim on costly-high-order pages and for cheap-high-order when memory pressure is high. However, if the system is under heavy pressure and there are dirty pages, asynchronous IO may not be sufficient to reclaim a suitable page in time. This patch causes the caller to enter synchronous lumpy reclaim for costly-high-order pages and for cheap-high-order pages when under memory pressure. Minchan.kim@gmail.com said: Andy added synchronous lumpy reclaim with c661b078. At that time, lumpy reclaim is not agressive. His intension is just for high-order users.(above PAGE_ALLOC_COSTLY_ORDER). After some time, Rik added aggressive lumpy reclaim with 33c120ed. His intention was to do lumpy reclaim when high-order users and trouble getting a small set of contiguous pages. So we also have to add synchronous pageout for small set of contiguous pages. Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andy Whitcroft <apw@shadowen.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <Minchan.kim@gmail.com> Reviewed-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Nick Piggin authored
Move more documentation for get_user_pages_fast into the new kerneldoc comment. Add some comments for get_user_pages as well. Also, move get_user_pages_fast declaration up to get_user_pages. It wasn't there initially because it was once a static inline function. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Andy Grover <andy.grover@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Wu Fengguang authored
Now that we do readahead for sequential mmap reads, here is a simple evaluation of the impacts, and one further optimization. It's an NFS-root debian desktop system, readahead size = 60 pages. The numbers are grabbed after a fresh boot into console. approach pgmajfault RA miss ratio mmap IO count avg IO size(pages) A 383 31.6% 383 11 B 225 32.4% 390 11 C 224 32.6% 307 13 case A: mmap sync/async readahead disabled case B: mmap sync/async readahead enabled, with enforced full async readahead size case C: mmap sync/async readahead enabled, with enforced full sync/async readahead size or: A = vanilla 2.6.30-rc1 B = A plus mmap readahead C = B plus this patch The numbers show that - there are good possibilities for random mmap reads to trigger readahead - 'pgmajfault' is reduced by 1/3, due to the _async_ nature of readahead - case C can further reduce IO count by 1/4 - readahead miss ratios are not quite affected The theory is - readahead is _good_ for clustered random reads, and can perform _better_ than readaround because they could be _async_. - async readahead size is guaranteed to be larger than readaround size, and they are _async_, hence will mostly behave better However for B - sync readahead size could be smaller than readaround size, hence may make things worse by produce more smaller IOs which will be fixed by this patch. Final conclusion: - mmap readahead reduced major faults by 1/3 and no obvious overheads; - mmap io can be further reduced by 1/4 with this patch. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Wu Fengguang authored
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Wu Fengguang authored
Introduce page cache context based readahead algorithm. This is to better support concurrent read streams in general. RATIONALE --------- The current readahead algorithm detects interleaved reads in a _passive_ way. Given a sequence of interleaved streams 1,1001,2,1002,3,4,1003,5,1004,1005,6,... By checking for (offset == prev_offset + 1), it will discover the sequentialness between 3,4 and between 1004,1005, and start doing sequential readahead for the individual streams since page 4 and page 1005. The context readahead algorithm guarantees to discover the sequentialness no matter how the streams are interleaved. For the above example, it will start sequential readahead since page 2 and 1002. The trick is to poke for page @offset-1 in the page cache when it has no other clues on the sequentialness of request @offset: if the current requenst belongs to a sequential stream, that stream must have accessed page @offset-1 recently, and the page will still be cached now. So if page @offset-1 is there, we can take request @offset as a sequential access. BENEFICIARIES ------------- - strictly interleaved reads i.e. 1,1001,2,1002,3,1003,... the current readahead will take them as silly random reads; the context readahead will take them as two sequential streams. - cooperative IO processes i.e. NFS and SCST They create a thread pool, farming off (sequential) IO requests to different threads which will be performing interleaved IO. It was not easy(or possible) to reliably tell from file->f_ra all those cooperative processes working on the same sequential stream, since they will have different file->f_ra instances. And NFSD's file->f_ra is particularly unusable, since their file objects are dynamically created for each request. The nfsd does have code trying to restore the f_ra bits, but not satisfactory. The new scheme is to detect the sequential pattern via looking up the page cache, which provides one single and consistent view of the pages recently accessed. That makes sequential detection for cooperative processes possible. USER REPORT ----------- Vladislav recommends the addition of context readahead as a result of his SCST benchmarks. It leads to 6%~40% performance gains in various cases and achieves equal performance in others. http://lkml.org/lkml/2009/3/19/239 OVERHEADS --------- In theory, it introduces one extra page cache lookup per random read. However the below benchmark shows context readahead to be slightly faster, wondering.. Randomly reading 200MB amount of data on a sparse file, repeat 20 times for each block size. The average throughputs are: original ra context ra gain 4K random reads: 65.561MB/s 65.648MB/s +0.1% 16K random reads: 124.767MB/s 124.951MB/s +0.1% 64K random reads: 162.123MB/s 162.278MB/s +0.1% Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Jeff Moyer <jmoyer@redhat.com> Tested-by: Vladislav Bolkhovitin <vst@vlnb.net> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Wu Fengguang authored
Split all readahead cases, and move the random one to bottom. No behavior changes. This is to prepare for the introduction of context readahead, and make it easy for inserting accounting/tracing points for each case. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Vladislav Bolkhovitin <vst@vlnb.net> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Wu Fengguang authored
The counterpart of radix_tree_next_hole(). To be used by context readahead. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Vladislav Bolkhovitin <vst@vlnb.net> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Wu Fengguang authored
Mmap read-around now shares the same code style and data structure with readahead code. This also removes do_page_cache_readahead(). Its last user, mmap read-around, has been changed to call ra_submit(). The no-readahead-if-congested logic is dumped by the way. Users will be pretty sensitive about the slow loading of executables. So it's unfavorable to disabled mmap read-around on a congested queue. [akpm@linux-foundation.org: coding-style fixes] Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Wu Fengguang authored
We need this in one particular case and two more general ones. Now we do async readahead for sequential mmap reads, and do it with the help of PG_readahead. For normal reads, PG_readahead is the sufficient condition to do a sequential readahead. But unfortunately, for mmap reads, there is a tiny nuisance: [11736.998347] readahead-init0(process: sh/23926, file: sda1/w3m, offset=0:4503599627370495, ra=0+4-3) = 4 [11737.014985] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=290+32-0) = 17 [11737.019488] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=118+32-0) = 32 [11737.024921] readahead-interleaved(process: w3m/23926, file: sda1/w3m, offset=0:2, ra=4+6-6) = 6 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~ An unfavorably small readahead. The original dumb read-around size could be more efficient. That happened because ld-linux.so does a read(832) in L1 before mmap(), which triggers a 4-page readahead, with the second page tagged PG_readahead. L0: open("/lib/libc.so.6", O_RDONLY) = 3 L1: read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\342"..., 832) = 832 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ L2: fstat(3, {st_mode=S_IFREG|0755, st_size=1420624, ...}) = 0 L3: mmap(NULL, 3527256, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fac6e51d000 L4: mprotect(0x7fac6e671000, 2097152, PROT_NONE) = 0 L5: mmap(0x7fac6e871000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x154000) = 0x7fac6e871000 L6: mmap(0x7fac6e876000, 16984, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fac6e876000 L7: close(3) = 0 In general, the PG_readahead flag will also be hit in cases - sequential reads - clustered random reads A full readahead size is desirable in both cases. Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Wu Fengguang authored
Auto-detect sequential mmap reads and do readahead for them. The sequential mmap readahead will be triggered when - sync readahead: it's a major fault and (prev_offset == offset-1); - async readahead: minor fault on PG_readahead page with valid readahead state. The benefits of doing readahead instead of read-around: - less I/O wait thanks to async readahead - double real I/O size and no more cache hits The single stream case is improved a little. For 100,000 sequential mmap reads: user system cpu total (1-1) plain -mm, 128KB readaround: 3.224 2.554 48.40% 11.838 (1-2) plain -mm, 256KB readaround: 3.170 2.392 46.20% 11.976 (2) patched -mm, 128KB readahead: 3.117 2.448 47.33% 11.607 The patched (2) has smallest total time, since it has no cache hit overheads and less I/O block time(thanks to async readahead). Here the I/O size makes no much difference, since there's only one single stream. Note that (1-1)'s real I/O size is 64KB and (1-2)'s real I/O size is 128KB, since the half of the read-around pages will be readahead cache hits. This is going to make _real_ differences for _concurrent_ IO streams. Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Linus Torvalds authored
This shouldn't really change behavior all that much, but the single rather complex function with read-ahead inside a loop etc is broken up into more manageable pieces. The behaviour is also less subtle, with the read-ahead being done up-front rather than inside some subtle loop and thus avoiding the now unnecessary extra state variables (ie "did_readaround" is gone). Fengguang: the code split in fact fixed a bug reported by Pavel Levshin: the PGMAJFAULT accounting used to be bypassed when MADV_RANDOM is set, in which case the original code will directly jump to no_cached_page reading. Cc: Pavel Levshin <lpk@581.spb.su> Cc: <wli@movementarian.org> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Wu Fengguang authored
The readahead call scheme is error-prone in that it expects the call sites to check for async readahead after doing a sync one. I.e. if (!page) page_cache_sync_readahead(); page = find_get_page(); if (page && PageReadahead(page)) page_cache_async_readahead(); This is because PG_readahead could be set by a sync readahead for the _current_ newly faulted in page, and the readahead code simply expects one more callback on the same page to start the async readahead. If the caller fails to do so, it will miss the PG_readahead bits and never able to start an async readahead. Eliminate this insane constraint by piggy-backing the async part into the current readahead window. Now if an async readahead should be started immediately after a sync one, the readahead logic itself will do it. So the following code becomes valid: (the 'else' in particular) if (!page) page_cache_sync_readahead(); else if (PageReadahead(page)) page_cache_async_readahead(); Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Wu Fengguang authored
Make sure interleaved readahead size is larger than request size. This also makes the readahead window grow up more quickly. Reported-by: Xu Chenfeng <xcf@ustc.edu.cn> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Wu Fengguang authored
(hit_readahead_marker != 0) means the page at @offset is present, so we can search for non-present page starting from @offset+1. Reported-by: Xu Chenfeng <xcf@ustc.edu.cn> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Wu Fengguang authored
Just in case someone aggressively sets a huge readahead size. Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Wu Fengguang authored
Impact: code simplification. Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Wu Fengguang authored
This makes the performance impact of possible mmap_miss wrap around to be temporary and tolerable: i.e. MMAP_LOTSAMISS=100 extra readarounds. Otherwise if ever mmap_miss wraps around to negative, it takes INT_MAX cache misses to bring it back to normal state. During the time mmap readaround will be _enabled_ for whatever wild random workload. That's almost permanent performance impact. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Alexey Dobriyan authored
* create mm/init-mm.c, move init_mm there * remove INIT_MM, initialize init_mm with C99 initializer * unexport init_mm on all arches: init_mm is already unexported on x86. One strange place is some OMAP driver (drivers/video/omap/) which won't build modular, but it's already wants get_vm_area() export. Somebody should look there. [akpm@linux-foundation.org: add missing #includes] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Mike Frysinger <vapier.adi@gmail.com> Cc: Americo Wang <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Yinghai Lu authored
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=13484 Peer reported: | The bug is introduced from kernel 2.6.27, if E820 table reserve the memory | above 4G in 32bit OS(BIOS-e820: 00000000fff80000 - 0000000120000000 | (reserved)), system will report Int 6 error and hang up. The bug is caused by | the following code in drivers/firmware/memmap.c, the resource_size_t is 32bit | variable in 32bit OS, the BUG_ON() will be invoked to result in the Int 6 | error. I try the latest 32bit Ubuntu and Fedora distributions, all hit this | bug. |====== |static int firmware_map_add_entry(resource_size_t start, resource_size_t end, | const char *type, | struct firmware_map_entry *entry) and it only happen with CONFIG_PHYS_ADDR_T_64BIT is not set. it turns out we need to pass u64 instead of resource_size_t for that. [akpm@linux-foundation.org: add comment] Reported-and-tested-by: Peer Chen <pchen@nvidia.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Roel Kluin authored
Do not take the size of a pointer to determine the size of the pointed-to type. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Acked-by: Anton Vorontsov <avorontsov@ru.mvista.com> Cc: David Brownell <david-b@pacbell.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Kumar Gala <galak@gate.crashing.org> Acked-by: Grant Likely <grant.likely@secretlab.ca> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Arnd Bergmann authored
PIT_TICK_RATE is currently defined in four architectures, but in three different places. While linux/timex.h is not the perfect place for it, it is still a reasonable replacement for those drivers that traditionally use asm/timex.h to get CLOCK_TICK_RATE and expect it to be the PIT frequency. Note that for Alpha, the actual value changed from 1193182UL to 1193180UL. This is unlikely to make a difference, and probably can only improve accuracy. There was a discussion on the correct value of CLOCK_TICK_RATE a few years ago, after which every existing instance was getting changed to 1193182. According to the specification, it should be 1193181.818181... Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Len Brown <lenb@kernel.org> Cc: john stultz <johnstul@us.ibm.com> Cc: Dmitry Torokhov <dtor@mail.ru> Cc: Takashi Iwai <tiwai@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
- 15 Jun, 2009 10 commits
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6Linus Torvalds authored
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6: [IA64] fix compile error in arch/ia64/mm/extable.c
-
Linus Torvalds authored
Merge branch 'timers-for-linus-migration' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus-migration' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: timers: Logic to move non pinned timers timers: /proc/sys sysctl hook to enable timer migration timers: Identifying the existing pinned timers timers: Framework for identifying pinned timers timers: allow deferrable timers for intervals tv2-tv5 to be deferred Fix up conflicts in kernel/sched.c and kernel/timer.c manually
-
Linus Torvalds authored
Merge branch 'timers-for-linus-clockevents' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus-clockevents' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: clockevent: export register_device and delta2ns clockevents: tick_broadcast_device can become static
-
Linus Torvalds authored
Merge branch 'timers-for-linus-clocksource' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus-clocksource' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: clocksource: prevent selection of low resolution clocksourse also for nohz=on clocksource: sanity check sysfs clocksource changes
-
Linus Torvalds authored
Merge branch 'timers-for-linus-ntp' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus-ntp' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: ntp: fix comment typos ntp: adjust SHIFT_PLL to improve NTP convergence
-
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6Linus Torvalds authored
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1244 commits) pkt_sched: Rename PSCHED_US2NS and PSCHED_NS2US ipv4: Fix fib_trie rebalancing Bluetooth: Fix issue with uninitialized nsh.type in DTL-1 driver Bluetooth: Fix Kconfig issue with RFKILL integration PIM-SM: namespace changes ipv4: update ARPD help text net: use a deferred timer in rt_check_expire ieee802154: fix kconfig bool/tristate muckup bonding: initialization rework bonding: use is_zero_ether_addr bonding: network device names are case sensative bonding: elminate bad refcount code bonding: fix style issues bonding: fix destructor bonding: remove bonding read/write semaphore bonding: initialize before registration bonding: bond_create always called with default parameters x_tables: Convert printk to pr_err netfilter: conntrack: optional reliable conntrack event delivery list_nulls: add hlist_nulls_add_head and hlist_nulls_del ...
-
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpcLinus Torvalds authored
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (103 commits) powerpc: Fix bug in move of altivec code to vector.S powerpc: Add support for swiotlb on 32-bit powerpc/spufs: Remove unused error path powerpc: Fix warning when printing a resource_size_t powerpc/xmon: Remove unused variable in xmon.c powerpc/pseries: Fix warnings when printing resource_size_t powerpc: Shield code specific to 64-bit server processors powerpc: Separate PACA fields for server CPUs powerpc: Split exception handling out of head_64.S powerpc: Introduce CONFIG_PPC_BOOK3S powerpc: Move VMX and VSX asm code to vector.S powerpc: Set init_bootmem_done on NUMA platforms as well powerpc/mm: Fix a AB->BA deadlock scenario with nohash MMU context lock powerpc/mm: Fix some SMP issues with MMU context handling powerpc: Add PTRACE_SINGLEBLOCK support fbdev: Add PLB support and cleanup DCR in xilinxfb driver. powerpc/virtex: Add ml510 reference design device tree powerpc/virtex: Add Xilinx ML510 reference design support powerpc/virtex: refactor intc driver and add support for i8259 cascading powerpc/virtex: Add support for Xilinx PCI host bridge ...
-
git://git.kernel.org/pub/scm/linux/kernel/git/lrg/voltage-2.6Linus Torvalds authored
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lrg/voltage-2.6: regulator/max1586: fix V3 gain calculation integer overflow regulator/max1586: support increased V3 voltage range regulator: lp3971 - fix driver link error when built-in. LP3971 PMIC regulator driver (updated and combined version) regulator: remove driver_data direct access of struct device regulator: Set MODULE_ALIAS for regulator drivers regulator: Support list_voltage for fixed voltage regulator regulator: Move regulator drivers to subsys_initcall() regulator: build fix for powerpc - renamed show_state regulator: add userspace-consumer driver Maxim 1586 regulator driver
-
Rusty Russell authored
ad6561df ("module: trim exception table on init free.") put a bogus trim_init_extable() function into ia64 which didn't compile. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Tony Luck <tony.luck@intel.com>
-
git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2Linus Torvalds authored
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: (22 commits) nilfs2: support contiguous lookup of blocks nilfs2: add sync_page method to page caches of meta data nilfs2: use device's backing_dev_info for btree node caches nilfs2: return EBUSY against delete request on snapshot nilfs2: modify list of unsupported features in caveats nilfs2: enable sync_page method nilfs2: set bio unplug flag for the last bio in segment nilfs2: allow future expansion of metadata read out via get info ioctl NILFS2: Pagecache usage optimization on NILFS2 nilfs2: remove nilfs_btree_operations from btree mapping nilfs2: remove nilfs_direct_operations from direct mapping nilfs2: remove bmap pointer operations nilfs2: remove useless b_low and b_high fields from nilfs_bmap struct nilfs2: remove pointless NULL check of bpop_commit_alloc_ptr function nilfs2: move get block functions in bmap.c into btree codes nilfs2: remove nilfs_bmap_delete_block nilfs2: remove nilfs_bmap_put_block nilfs2: remove header file for segment list operations nilfs2: eliminate removal list of segments nilfs2: add sufile function that can modify multiple segment usages ...
-