Commits · ee926b71e222306ffa09ef4120a26afc972b0c02 · nexedi / linux

18 Jun, 2002 40 commits

[PATCH] Make SMP/APIC config option earlier · ee926b71

Zwane Mwaikambo authored Jun 17, 2002

Patch to reorder the APIC configuration so that dependencies are
determined beforehand for MCE. Keith Owens pointed this out a whiles back
actually.

ee926b71

[PATCH] drivers/char/rio/func.h needs linux/kdev_t.h · 288aa740
Adrian Bunk authored Jun 17, 2002
```
It seems func.h needs to inlude linux/kdev_t.h:
```
288aa740

[PATCH] missing tag blkdev.h stuff · 813bef20

Jens Axboe authored Jun 17, 2002

For some odd reason, the blkdev.h changes did not get patched into your
tree from the patch I sent?! Anyways, here's that change:

813bef20

[PATCH] make file leases work as they should · 23b9a9a3

Stephen Rothwell authored Jun 17, 2002

This patch fixes the following problems in the file lease:
	when there are multiple shared leases on a file, all the
		lease holders get notified when someone opens the
		file for writing (used to be only the first).
	when a nonblocking open breaks a lease, it will time out
		as it should (used to never time out).

This should make the leases code more usable (hopefully).

23b9a9a3

[PATCH] Make copy_siginfo_to_user mode explicit · 9d33a271

Stephen Rothwell authored Jun 17, 2002

This patch makes copy_siginfo_to_user excplicitly copy the correct
union member.  Previously we were getting the correct result but
really by accident.

9d33a271

[PATCH] 2.5.22 compile fixes · e784b458
Stephen Rothwell authored Jun 17, 2002
```
I needed these to make 2.5.22 build for me.
```
e784b458

[PATCH] remove getname32 · 64088985

Stephen Rothwell authored Jun 17, 2002

arch/ppc64/kernel/sys_ppc32.c has a getname32 function.  The only
difference between it and getname() is that it calls do_getname32()
instead of do_getname() (see fs/namei.c).  The difference between
do_getname and do_getname32 is that the former checks to make sure that
the pointer it is passed is less that TASK_SIZE and restricts the length
copied to the lesser of PATH_MAX and (TASK_SIZE - pointer).
do_getname32 uses PAGE_SIZE instead of PATH_MAX.

Anton Blanchard says it is OK to remove getname32.

arch/ia64/ia32/sys_ia32.c defined a getname32(), but nothing used it.

This patch removes both.

64088985

Merge master.kernel.org:/home/mingo/bk-sched · 1f60ade2
Linus Torvalds authored Jun 17, 2002
```
into home.transmeta.com:/home/torvalds/v2.5/linux
```
1f60ade2
- sync wakeup affinity fix: do not fast-migrate threads · 8509486a
Ingo Molnar authored Jun 18, 2002
```
  without making sure that the target CPU is allowed.
```
8509486a
- comment and coding style fixes. · f042243c
Ingo Molnar authored Jun 18, 2002

f042243c

sched_yield() is misbehaving. · 18cb13a6

Ingo Molnar authored Jun 18, 2002

  the current implementation does the following to 'give up' the CPU:

   - it decreases its priority by 1 until it reaches the lowest level
   - it queues the task to the end of the priority queue

  this scheme works fine in most cases, but if sched_yield()-active tasks
  are mixed with CPU-using processes then it's quite likely that the
  CPU-using process is in the expired array. In that case the yield()-ing
  process only requeues itself in the active array - a true context-switch
  to the expired process will only occur once the timeslice of the
  yield()-ing process has expired: in ~150 msecs. This leads to the
  yield()-ing and CPU-using process to use up rougly the same amount of
  CPU-time, which is arguably deficient.

  i've fixed this problem by extending sched_yield() the following way:

  +        * There are three levels of how a yielding task will give up
  +        * the current CPU:
  +        *
  +        *  #1 - it decreases its priority by one. This priority loss is
  +        *       temporary, it's recovered once the current timeslice
  +        *       expires.
  +        *
  +        *  #2 - once it has reached the lowest priority level,
  +        *       it will give up timeslices one by one. (We do not
  +        *       want to give them up all at once, it's gradual,
  +        *       to protect the casual yield()er.)
  +        *
  +        *  #3 - once all timeslices are gone we put the process into
  +        *       the expired array.
  +        *
  +        *  (special rule: RT tasks do not lose any priority, they just
  +        *  roundrobin on their current priority level.)
  +        */

18cb13a6

Merge master.kernel.org:/home/mingo/bk-misc · 3986594c
Linus Torvalds authored Jun 17, 2002
```
into home.transmeta.com:/home/torvalds/v2.5/linux
```
3986594c
- sti() preemption fix. · 5567614b
Ingo Molnar authored Jun 18, 2002

5567614b
- fix preemption bug in cli(). · d76513b3
Ingo Molnar authored Jun 18, 2002

d76513b3

[PATCH] Push BKL into ->permission() calls · 73769d9b

Paul Menage authored Jun 17, 2002

This patch (against 2.5.22) removes the BKL from around the call
to i_op->permission() in fs/namei.c, and pushes the BKL into those
filesystems that have permission() methods that require it.

73769d9b

Merge home.transmeta.com:/home/torvalds/v2.5/scsi-tape · 67aa3988
Linus Torvalds authored Jun 17, 2002
```
into home.transmeta.com:/home/torvalds/v2.5/linux
```
67aa3988

[PATCH] 2.5.22 SCSI tape buffering changes · f179b6ce

Kai Mäkisara authored Jun 17, 2002

This contains the following changes to the SCSI tape driver:

- one buffer is used for each tape (no buffer pool)
- buffers allocated when needed and freed when device closed
- common code from read and write moved to a function
- default maximum number of scatter/gather segments increased to 64
- tape status set to "no tape" after succesful unload

f179b6ce

[PATCH] Remove sync_timers · 94173f68
Matthew Wilcox authored Jun 17, 2002
```
Nobody's using it any more, kill:
```
94173f68

[PATCH] remove tqueue.h from sched.h · 4f9d90c4

Matthew Wilcox authored Jun 17, 2002

This is actually part of the work I've been doing to remove BHs, but it
stands by itself.

4f9d90c4

[PATCH] poll/select fast path · 30724dcd

Andi Kleen authored Jun 17, 2002

This patch streamlines poll and select by adding fast paths for a
small number of descriptors passed. The majority of polls/selects
seem to be of this nature. The main saving comes from not allocating
two pages for wait queue and table, but from using stack allocation
(upto 256bytes) when only a few descriptors are needed. This makes
it as fast again as 2.0 and even a bit faster because the wait queue
page allocation is avoided too (except when the drivers overflow it)

select also skips a lot faster over big holes and avoids the separate
pass of determining the max. number of descriptors in the bitmap.

A typical linux system saves a considerable amount of unswappable memory
with this patch, because it usually has 10+ daemons hanging around in poll or
select with each two pages allocated for data and wait queue.

Some other cleanups.

30724dcd

[PATCH] Move jiffies_64 down into architectures · 86403107

Andi Kleen authored Jun 17, 2002

x86-64 needs an own special declaration of jiffies_64.

prepare for this by moving the jiffies_64 declaration from
kernel/timer.c down into each architecture.

86403107

[PATCH] x86-64 merge · b068ec41

Andi Kleen authored Jun 17, 2002

x86_64 core updates.

 - Make it compile again (switch_to macros etc., add dummy suspend.h)
 - reenable strength reduce optimization
 - Fix ramdisk (patch from Mikael Pettersson)
 - Some merges from i386
 - Reimplement lazy iobitmap allocation.  I reimplemented it based
   on bcrl's idea.
 - Fix IPC 32bit emulation to actually work and move into own file
 - New fixed mtrr.c from DaveJ ported from 2.4 and reenable it.
 - Move tlbstate into PDA.
 - Add some changes that got lost during the last merge.
 - new memset that seems to actually work.
 - Align signal handler stack frames to 16 bytes.
 - Some more minor bugfixes.

b068ec41

[PATCH] msync(bad address) should return -ENOMEM · 9343c8e2

Andrew Morton authored Jun 17, 2002

Heaven knows why, but that's what the opengroup say, and returning
-EFAULT causes 2.5 to fail one of the Linux Test Project tests.

[ENOMEM]
          The addresses in the range starting at addr and continuing
          for len bytes are outside the range allowed for the address
          space of a process or specify one or more pages that are not
          mapped.

2.4 has it right, but 2.5 doesn't.

9343c8e2

[PATCH] Reduce the radix tree nodes to 64 slots · df01cd17

Andrew Morton authored Jun 17, 2002

Reduce the radix tree nodes from 128 slots to 64.

- The main reason for this is that on 64-bit/4k page machines, the
  slab allocator has decided that radix tree nodes will require an
  order-1 allocation.  Shrinking the nodes to 64 slots pulls that back
  to an order-0 allocation.

- On x86 we get fifteen 64-slot nodes per page rather than seven
  129-slot nodes, for a modest memory saving.

- Halving the node size will approximately halve the memory use in
  the worrisome really-large, really-sparse file case.

Of course, the downside is longer tree walks.  Each level of the tree
covers six bits of pagecache index rather than seven.  As ever, I am
guided by Anton's profiling on the 12- and 32-way PPC boxes.
radix_tree_lookup() is currently down in the noise floor.

Now, there is one special case: one file which is really big and which
is accessed in a random manner and which is accessed very heavily: the
blockdev mapping.  We _are_ showing some locking cost in
__find_get_block (used to be __get_hash_table) and in its call to
find_get_page().  I have a bunch of patches which introduce a generic
per-cpu buffer LRU, and which remove ext2's private bitmap buffer LRUs.
I expect these patches to wipe the blockdev mapping lookup lock contention
off the map,  but I'm awaiting test results from Anton before deciding
whether those patches are worth submitting.

df01cd17

[PATCH] rename get_hash_table() to find_get_block() · 3fb3b749

Andrew Morton authored Jun 17, 2002

Renames the buffer_head lookup function `get_hash_table' to
`find_get_block'.

get_hash_table() is too generic a name. Plus it doesn't even use a hash
any more.

3fb3b749

[PATCH] allow GFP_NOFS allocators to perform swapcache writeout · 493f4988

Andrew Morton authored Jun 17, 2002

One weakness which was introduced when the buffer LRU went away was
that GFP_NOFS allocations became equivalent to GFP_NOIO.  Because all
writeback goes via writepage/writepages, which requires entry into the
filesystem.

However now that swapout no longer calls bmap(), we can honour
GFP_NOFS's intent for swapcache pages.  So if the allocation request
specifies __GFP_IO and !__GFP_FS, we can wait on swapcache pages and we
can perform swapcache writeout.

This should strengthen the VM somewhat.

493f4988

[PATCH] remove set_page_buffers() and clear_page_buffers() · 38cb52ca

Andrew Morton authored Jun 17, 2002

The set_page_buffers() and clear_page_buffers() macros are each used in
only one place.  Fold them into their callers.

38cb52ca

[PATCH] take bio.h out of highmem.h · a28b4d4e

Andrew Morton authored Jun 17, 2002

highmem.h includes bio.h, so just about every compilation unit in the
kernel gets to process bio.h.

The patch moves the BIO-related functions out of highmem.h and into
bio-related headers.  The nested include is removed and all files which
need to include bio.h now do so.

a28b4d4e

[PATCH] clean up alloc_buffer_head() · c67b85b0
Andrew Morton authored Jun 17, 2002
```
alloc_bufer_head() does not need the additional argument - GFP_NOFS is
always correct.
```
c67b85b0

[PATCH] ext3: clean up journal_try_to_free_buffers() · 1704566f

Andrew Morton authored Jun 17, 2002

Clean up ext3's journal_try_to_free_buffers().  Now that the
releasepage() a_op is non-blocking and need not perform I/O, this
function becomes much simpler.

1704566f

[PATCH] kmap_atomic fix in bio_copy() · 9d8e6506

Andrew Morton authored Jun 17, 2002

bio_copy is doing

	vfrom = kmap_atomic(bv->bv_page, KM_BIO_IRQ);
	vto = kmap_atomic(bbv->bv_page, KM_BIO_IRQ);

which, if I understand atomic kmaps, is incorrect.  Both source and
dest will get the same pte.

The patch creates a separate atomic kmap member for the destination and
source of this copy.

9d8e6506

[PATCH] fix loop driver for large BIOs · 8504e479

Andrew Morton authored Jun 17, 2002

Fix the loop driver for loop-on-blockdev setups.

When presented with a multipage BIO, loop_make_request overindexes the
first page and corrupts kernel memory.  Fix it to walk the individual
pages.

BTW, I suspect the IV handling in loop may be incorrect for multipage
BIOs.  Should we not be recalculating the IV for each page in the BIOs,
or incrementing the offset by the size of the preceding pages, or such?

8504e479

[PATCH] direct-to-BIO I/O for swapcache pages · 88c4650a

Andrew Morton authored Jun 17, 2002

This patch changes the swap I/O handling.  The objectives are:

- Remove swap special-casing
- Stop using buffer_heads -> direct-to-BIO
- Make S_ISREG swapfiles more robust.

I've spent quite some time with swap.  The first patches converted swap to
use block_read/write_full_page().  These were discarded because they are
still using buffer_heads, and a reasonable amount of otherwise unnecessary
infrastructure had to be added to the swap code just to make it look like a
regular fs.  So this code just has a custom direct-to-BIO path for swap,
which seems to be the most comfortable approach.

A significant thing here is the introduction of "swap extents".  A swap
extent is a simple data structure which maps a range of swap pages onto a
range of disk sectors.  It is simply:

	struct swap_extent {
		struct list_head list;
		pgoff_t start_page;
		pgoff_t nr_pages;
		sector_t start_block;
	};

At swapon time (for an S_ISREG swapfile), each block in the file is bmapped()
and the block numbers are parsed to generate the device's swap extent list.
This extent list is quite compact - a 512 megabyte swapfile generates about
130 nodes in the list.  That's about 4 kbytes of storage.  The conversion
from filesystem blocksize blocks into PAGE_SIZE blocks is performed at swapon
time.

At swapon time (for an S_ISBLK swapfile), we install a single swap extent
which describes the entire device.

The advantages of the swap extents are:

1: We never have to run bmap() (ie: read from disk) at swapout time.  So
   S_ISREG swapfiles are now just as robust as S_ISBLK swapfiles.

2: All the differences between S_ISBLK swapfiles and S_ISREG swapfiles are
   handled at swapon time.  During normal operation, we just don't care.
   Both types of swapfiles are handled the same way.

3: The extent lists always operate in PAGE_SIZE units.  So the problems of
   going from fs blocksize to PAGE_SIZE are handled at swapon time and normal
   operating code doesn't need to care.

4: Because we don't have to fiddle with different blocksizes, we can go
   direct-to-BIO for swap_readpage() and swap_writepage().  This introduces
   the kernel-wide invariant "anonymous pages never have buffers attached",
   which cleans some things up nicely.  All those block_flushpage() calls in
   the swap code simply go away.

5: The kernel no longer has to allocate both buffer_heads and BIOs to
   perform swapout.  Just a BIO.

6: It permits us to perform swapcache writeout and throttling for
   GFP_NOFS allocations (a later patch).

(Well, there is one sort of anon page which can have buffers: the pages which
are cast adrift in truncate_complete_page() because do_invalidatepage()
failed.  But these pages are never added to swapcache, and nobody except the
VM LRU has to deal with them).

The swapfile parser in setup_swap_extents() will attempt to extract the
largest possible number of PAGE_SIZE-sized and PAGE_SIZE-aligned chunks of
disk from the S_ISREG swapfile.  Any stray blocks (due to file
discontiguities) are simply discarded - we never swap to those.

If an S_ISREG swapfile is found to have any unmapped blocks (file holes) then
the swapon attempt will fail.

The extent list can be quite large (hundreds of nodes for a gigabyte S_ISREG
swapfile).  It needs to be consulted once for each page within
swap_readpage() and swap_writepage().  Hence there is a risk that we could
blow significant amounts of CPU walking that list.  However I have
implemented a "where we found the last block" cache, which is used as the
starting point for the next search.  Empirical testing indicates that this is
wildly effective - the average length of the list walk in map_swap_page() is
0.3 iterations per page, with a 130-element list.

It _could_ be that some workloads do start suffering long walks in that code,
and perhaps a tree would be needed there.  But I doubt that, and if this is
happening then it means that we're seeking all over the disk for swap I/O,
and the list walk is the least of our problems.

rw_swap_page_nolock() now takes a page*, not a kernel virtual address.  It
has been renamed to rw_swap_page_sync() and it takes care of locking and
unlocking the page itself.  Which is all a much better interface.

Support for type 0 swap has been removed.  Current versions of mkwap(8) seem
to never produce v0 swap unless you explicitly ask for it, so I doubt if this
will affect anyone.  If you _do_ have a type 0 swapfile, swapon will fail and
the message

	version 0 swap is no longer supported. Use mkswap -v1 /dev/sdb3

is printed.  We can remove that code for real later on.  Really, all that
swapfile header parsing should be pushed out to userspace.

This code always uses single-page BIOs for swapin and swapout.  I have an
additional patch which converts swap to use mpage_writepages(), so we swap
out in 16-page BIOs.  It works fine, but I don't intend to submit that.
There just doesn't seem to be any significant advantage to it.

I can't see anything in sys_swapon()/sys_swapoff() which needs the
lock_kernel() calls, so I deleted them.

If you ftruncate an S_ISREG swapfile to a shorter size while it is in use,
subsequent swapout will destroy the filesystem.  It was always thus, but it
is much, much easier to do now.  Not really a kernel problem, but swapon(8)
should not be allowing the kernel to use swapfiles which are modifiable by
unprivileged users.

88c4650a

[PATCH] leave swapcache pages unlocked during writeout · 3ab86fb0

Andrew Morton authored Jun 17, 2002

Convert swap pages so that they are PageWriteback and !PageLocked while
under writeout, like all other block-backed pages.  (Network
filesystems aren't doing this yet - their pages are still locked while
under writeout)

3ab86fb0

[PATCH] mark_buffer_dirty_inode() speedup · 43967af3

Andrew Morton authored Jun 17, 2002

buffer_insert_list() is showing up on Anton's graphs. It'll be via
ext2's mark_buffer_dirty_inode() against indirect blocks. If the
buffer is already on an inode queue, we know that it is on the correct
inode's queue so we don't need to re-add it.

43967af3

[PATCH] go back to 256 requests per queue · 374cac7a

Andrew Morton authored Jun 17, 2002

The request queue was increased from 256 slots to 512 in 2.5.20.  The
throughput of `dbench 128' on Randy's 384 megabyte machine fell 40%.

We do need to understand why that happened, and what we can learn from
it.  But in the meanwhile I'd suggest that we go back to 256 slots so
that this known problem doesn't impact people's evaluation and tuning
of 2.5 performance.

374cac7a

[PATCH] mark_buffer_dirty() speedup · 7a1a7f5b

Andrew Morton authored Jun 17, 2002

mark_buffer_dirty() is showing up on Anton's graphs.  Avoiding the
buslocked RMW if the buffer is already dirty should fix that up.

7a1a7f5b

[PATCH] grab_cache_page_nowait deadlock fix · 85bfa7dc

Andrew Morton authored Jun 17, 2002

- If grab_cache_page_nowait() is to be called while holding a lock on
  a different page, it must perform memory allocations with GFP_NOFS.
  Otherwise it could come back onto the locked page (if it's dirty) and
  deadlock.

  Also tidy this function up a bit - the checks in there were overly
  paranoid.

- In a few of places, look to see if we can avoid a buslocked cycle
  and dirtying of a cacheline.

85bfa7dc

[PATCH] update_atime cleanup · 386b1f74
Andrew Morton authored Jun 17, 2002
```
Remove unneeded do_update_atime(), and convert update_atime() to C.
```
386b1f74

[PATCH] ext3 corruption fix · afb51f81

Andrew Morton authored Jun 17, 2002

Stephen and Neil Brown recently worked this out.  It's a
rare situation which only affects data=journal mode.

Fix problem in data=journal mode where writeback could be left pending on a
journaled, deleted disk block.  If that block then gets reallocated, we can
end up with an alias in which the old data can be written back to disk over
the new.  Thanks to Neil Brown for spotting this and coming up with the
initial fix.

afb51f81