Commits · 55478b6c1b34f7096e3a3e09d2055350625d72df · nexedi / linux

14 Dec, 2002 40 commits

[PATCH] remove vm_area_struct.vm_raend · 55478b6c

Andrew Morton authored Dec 14, 2002

Remove the unused vm_area_struct.vm_raend.

If someone wants to tune per-VMA readaround then they can alter
vma->vm_file->f_ra.ra_pages.

55478b6c

[PATCH] ext3: fix error-path bh leak · 5e352342
Andrew Morton authored Dec 14, 2002
```
It is missing a brelse() on an error path.
```
5e352342
[PATCH] Add prefetching to get_page_state() · 351419e2
Andrew Morton authored Dec 14, 2002
```
Fetch the next cacheline as we're counting up the fields in this one.
```
351419e2

[PATCH] ext2 synchronous mount fix · 7cc9ee3d

Andrew Morton authored Dec 14, 2002

The optimisation for synchronous mounts was only correct for S_ISREG
files. Directories do not pass through generic_osync_inode() and we
still need to synchronously write out their indirect blocks.

7cc9ee3d

[PATCH] pad pte_chains out to a cacheline · c566bb56
Andrew Morton authored Dec 14, 2002
```
In PAE mode there is a 4-byte gap and they're not aligning correctly.
```
c566bb56
[PATCH] Fix off-by-one in the page allocator · 90b3b976
Andrew Morton authored Dec 14, 2002
```
From Hugh.

Be consistent in deciding when we are below the zone allocation
thresholds.
```
90b3b976

[PATCH] tidier atomic check in mempool_alloc() · 36aed1f9

Andrew Morton authored Dec 14, 2002

From Hugh.

Be more explicit in the "can we sleep" test.  It doesn't change
anything unless someone is performing __GFP_IO && !__GFP_WAIT
allocations, which is nonsensical.

36aed1f9

[PATCH] provide a default super_block_operations · b88f83d5

Andrew Morton authored Dec 14, 2002

A little cleanup suggested by Chris Mason or Al Viro.

Quite a number of codepaths are testing whether a superblock has a
non-null ->s_op pointer.  We can remove all those by making sure that
all superblocks have a valid ->s_op.

b88f83d5

[PATCH] madvise_willneed() maximum readahead checking · 654107b9

Andrew Morton authored Dec 14, 2002

madvise_willneed() currently has a very strange check on how much readahead
it is prepared to do.

  It is based on the user's rss limit.  But this is usually enormous, and
  the user isn't necessarily going to map all that memory at the same time
  anyway.

  And the logic is wrong - it is comparing rss (which is in bytes) with
  `end - start', which is in pages.

  And it returns -EIO on error, which is not mentioned in the Open Group
  spec and doesn't make sense.


This patch takes it all out and applies the same upper limit as is used in
sys_readahead() - half the inactive list.

654107b9

[PATCH] remove a vm debug check · d8259d09

Andrew Morton authored Dec 14, 2002

This ad-hoc assertion is no longer true.  If all zones are in the `all
unreclaimable' state it can trigger.  When testing with a tiny amount
of physical memory.

d8259d09

[PATCH] limit pinned memory due to readahead · 234931ab

Andrew Morton authored Dec 14, 2002

readahead allocates all the pages before starting I/O. Potentially bad
if someone is performing huge reads with madvise or sys_readahead().

So the patch just busts that up into two-megabyte units.

234931ab

[PATCH] don't apply file size rlimits to blockdevs · 67de87c5

Andrew Morton authored Dec 14, 2002

generic_file_write()'s rlimit checks are preventing writes to large
offsets into blockdevs:

# ulimit -f 10000
# dd if=/dev/zero of=/dev/sde5 bs=1k count=1 seek=1000000
zsh: file size limit exceeded

So don't apply that check if it's a blockdev.

The patch also caches the S_ISBLK result in a local.

67de87c5

[PATCH] ext2/ext3_free_blocks() extra check · db0d232c

Andrew Morton authored Dec 14, 2002

From Andreas Dilger.

Additional sanity checks in the ext2 and ext3 block allocators: if
someone tries to free a negative number of blocks, detect and handle
that rather than wrecking the fs.

db0d232c

[PATCH] bootmem allocator merging fix · 344391c7

Andrew Morton authored Dec 14, 2002

Patch from "Juan M. de la Torre" <jmtorre@gmx.net>

If the requested align is PAGE_SIZE, it is impossible to merge with the
previous allocation request, because the allocated area must begin in a
page boundary.

344391c7

[PATCH] Don't inherit mm->def_flags across forks · 4d840923
Andrew Morton authored Dec 14, 2002
```
Prevents children from inheriting mlockall(MCL_FUTURE).
Standards-friendly, and 2.4 has it.
```
4d840923

[PATCH] remove PF_SYNC · 577c516f

Andrew Morton authored Dec 14, 2002

current->flags:PF_SYNC was a hack I added because I didn't want to
change all ->writepage implementations.

It's foul.  And it means that if someone happens to run direct page
reclaim within the context of (say) sys_sync, the writepage invokations
from the VM will be treated as "data integrity" operations, not "memory
cleansing" operations, which would cause latency.

So the patch removes PF_SYNC and adds an extra arg to a_ops->writepage.
 It is the `writeback_control' structure which contains the full context
information about why writepage was called.

The initial version of this patch just passed in a bare `int sync', but
the XFS team need more info so they can perform writearound from within
page reclaim.

The patch also adds writeback_control.for_reclaim, so writepage
implementations can inspect that to work out the call context rather
than peeking at current->flags:PF_MEMALLOC.

577c516f

[PATCH] Reserve an additional transaction block in · 8725c3fc

Andrew Morton authored Dec 14, 2002

Under rare conditions (filesystem corruption, really) it is possible
for ext3_dirty_inode() to require _two_ blocks for the transaction: one
for the inode and one to update the superblock - to set
EXT3_FEATURE_RO_COMPAT_LARGE_FILE.  This causes the filesystem to go
BUG.

So reserve an additional block for that eventuality.

8725c3fc

[PATCH] Set a minimum hash table size for wait_on_page() · e4406863

Andrew Morton authored Dec 14, 2002

Fixes the problem identified by Miles Bader on extremely small zones:
calling hash_long with `bits = 0' is treated as `bits = 32'.

So don't permit the zone to have a one-slot waitqueue hashtable.

e4406863

[PATCH] Add /proc/sys/vm/lower_zone_protection · c1859213

Andrew Morton authored Dec 14, 2002

This allows us to control the aggressiveness of the lower-zone defense
algorithm.  The `incremental min'.  For workloads which are using a
serious amount of mlocked memory, a few megabytes is not enough.

So the `lower_zone_protection' tunable allows the administrator to
increase the amount of protection which lower zones receive against
allocations which _could_ use higher zones.

The default value of lower_zone_protection is zero, giving unchanged
behaviour.  We should not normally make large amounts of memory
unavailable for pagecache just in case someone mlocks many hundreds of
megabytes.

c1859213

[PATCH] fs-writeback rework. · 20b96b52

Andrew Morton authored Dec 14, 2002

I've revisited all the superblock->inode->page writeback paths.  There
were several silly things in there, and things were not as clear as they
could be.

scenario 1: create and dirty a MAP_SHARED segment over a sparse file,
then exit.

  All the memory turns into dirty pagecache, but the kupdate function
  only writes it out at a trickle - 4 megabytes every thirty seconds.
  We should sync it all within 30 seconds.

  What's happening is that when writeback tries to write those pages,
  the filesystem needs to instantiate new blocks for them (they're over
  holes).  The filesystem runs mark_inode_dirty() within the writeback
  function.

  This redirtying of the inode while we're writing it out triggers
  some livelock avoidance code in __sync_single_inode().  That function
  says "ah, someone redirtied the file while I was writing it.  Let's
  move the file to the new end of the superblock dirty list and write
  it out later." Problem is, writeback dirtied the inode itself.

  (It is rather silly that mark_inode_dirty() sets I_DIRTY_PAGES when
  clearly no pages have been dirtied.  Fixing that up would be a
  largish work, so work around it here).

  So this patch just removes the livelock avoidance from
  __sync_single_inode().  It is no longer needed anyway - writeback
  livelock is now avoided (in all writeback paths) by writing a finite
  number of pages.

scenario 2: an application is continuously dirtying a 200 megabyte
file, and your disk has a bandwidth of less than 40 megabytes/sec.

  What happens is that once 30 seconds passes, pdflush starts writing
  out the file.  And because that writeout will take more than five
  seconds (a `kupdate' interval), pdflush just keeps writing it out
  forever - continuous I/O.

  What we _want_ to happen is that the 200 megabytes gets written,
  and then IO stops for thirty seconds (minus the writeout period).  So
  the file is fully synced every thirty seconds.

The patch solves this by using mapping->io_pages more intelligently.
When the time comes to write the file out, move all the dirty pages
onto io_pages.  That is a "batch of pages for this kupdate round".
When io_pages is empty, we know we're done.

The address_space_operations.writepages() API is changed!  It now only
needs to write the pages which the caller placed on mapping->io_pages.

This conceptually cleans things up a bit, by more clearly defining the
role of ->io_pages, and the motion between the various mapping lists.

The treatment of sb->s_dirty and sb->s_io is now conceptually identical
to mapping->dirty_pages and mapping->io_pages: move the items-to-be
written onto ->s_io/io_pages, alk walk that list.  As inodes (or pages)
are written, move them over to the clean/locked/dirty lists.

Oh, scenario 3: start an app whcih continuously overwrites a 5 meg
file.  Wait five seconds, start another, wait 5 seconds, start another.
 What we _should_ see is three 5-meg writes, five seconds apart, every
thirty seconds.  That did all sorts of odd things.  It now does the
right thing.

20b96b52

[PATCH] hugetlb fixes · 21c2baef

Andrew Morton authored Dec 14, 2002

From Rohit

1) hugetlbfs_zero_setup returns ENOMEM in case the request size can
   not be easily handleed.

2) Preference is given to LOW_MEM while freeing the pages from
   hugetlbpage free list.

21c2baef

[PATCH] vm accounting fixes and addition · c720c50a

Andrew Morton authored Dec 14, 2002

- /proc/vmstat:pageoutrun and /proc/vmstat:allocstall are always
  identical.  Rework this so that

  - "allocstall" is the number of times a page allocator ran diect reclaim

  - "pageoutrun" is the number of times kswapd ran page reclaim

- Add a new stat: "pgrotated".  The number of pages which were
  rotated to the tail of the LRU for immediate reclaim by
  rotate_reclaimable_page().

- Document things a bit.

c720c50a

[PATCH] copy_user checks in filldir() · 54cbdcfd
Andrew Morton authored Dec 14, 2002
```
Check for usercopy faults in filldir().
```
54cbdcfd

[PATCH] implement ext3_sync_fs · 012af46c

Andrew Morton authored Dec 14, 2002

ext3_sync_fs will start a commit and will wait on that commit.  This
means that on its return, all journalled file data has been dirtied and
exposed to sync_inodes_sb().  Which is sufficient to fix the umount
data loss problem.

012af46c

[PATCH] Add a sync_fs super_block operation · 75f19a40

Andrew Morton authored Dec 14, 2002

This is infrastructure for fixing the journalled-data ext3 unmount data
loss problem. It was sent for comment to linux-fsdevel a week ago; there
was none.

Add a `sync_fs' superblock operation whose mandate is to perform
filesystem-specific operations to ensure a successful sync.

It is called in two places:

1: fsync_super() - for umount.

2: sys_sync() - for global sync.

In the sys_sync() case we call all the ->write_super() methods first.
write_super() is an async flushing operation.  It should not block.

After that, we call all the ->sync_fs functions.  This is independent
of the state of s_dirt!  That was all confused up before, and in this
patch ->write_super() and ->sync_fs() are quite separate.

With ext3 as an example, the initial ->write_super() will start a
transaction, but will not wait on it.  (But only if s_dirt was set!)

The first ->sync_fs() call will get the IO underway.

The second ->sync_fs() call will wait on the IO.

And we really do need to be this elaborate, because all the testing of
s_dirt in there makes ->write_super() an unreliable way of detecting
when the VFS is trying to sync the filesystem.

75f19a40

[PATCH] handle overflows in radix_tree_gang_lookup() · 7404e32c

Andrew Morton authored Dec 14, 2002

Fix a radix-tree bug spotted by Vladimir Saveliev <vs@namesys.com>.

Each step in the radix tree spans six address bits.  So a height=6 tree
spans 36-bits worth of nodes.

On 32-bit machines radix_tree_gang_lookup() doesn't handle this right -
at the 12TB mark it wraps back to zero, and returns pages at quite
wrong indices.

The patch fixes all that up, and tidies a couple of things.

A user-space test harness was developed so that the code can be sanely
tested.  It is at

	http://www.zip.com.au/~akpm/linux/patches/stuff/rtth.tar.gz

7404e32c

[PATCH] make sure all PMDs are allocated under PAE mode · 2134c937

Andrew Morton authored Dec 14, 2002

Patch from Martin Bligh and Dave Hansen

If a PAE machine has 1G of memory and you set PAGE_OFFSET to 2G, the
kernel will only instantiate a PMD to cover the 2G-3G region.  But
another PMD is needed for the 3G-4G region for the APIC and possibly an
extended vmalloc region.

So the patch changes the code to instantiate PMDs out to the end of
physical memory.

It's a no-op for PAGE_OFFSET=3G, and _could_ be part of the
CONFIG_PAGE_OFFSET patch.  But it seems a reasonable generalisation
anyway.

2134c937

[PATCH] show_free_areas extensions · b7fdef78

Andrew Morton authored Dec 14, 2002

Ancient patch From Bill Irwin

The patch is intended to show improved information about where the
memory went during OOM-killing events.

- when the OOM killer fails and the system panics, calls
  show_free_areas()

- reorganize show_free_areas() to use for_each_zone()

- add per-cpu stats to show_free_areas()

- tags output from show_free_areas() with node and zone information

b7fdef78

[PATCH] Remove fail_writepage, redux · 3e9afe4c

Andrew Morton authored Dec 14, 2002

fail_writepage() does not work.  Its activate_page() call cannot
activate the page because it is not on the LRU.

So perform that function (more efficiently) in the VM.  Remove
fail_writepage() and, if the filesystem does not implement
->writepage() then activate the page from shrink_list().

A special case is tmpfs, which does have a writepage, but which
sometimes wants to activate the pages anyway.  The most important case
is when there is no swap online and we don't want to keep all those
pages on the inactive list.  So just as a tmpfs special-case, allow
writepage() to return WRITEPAGE_ACTIVATE, and handle that in the VM.

Also, the whole idea of allowing ->writepage() to return -EAGAIN, and
handling that in the caller has been reverted.  If a writepage()
implementation wants to back out and not write the page, it must
redirty the page, unlock it and return zero.  (This is Hugh's preferred
way).

And remove the now-unneeded shmem_writepages() - shmem inodes are
marked as `memory backed' so it will not be called.

And remove the test for non-null ->writepage() in generic_file_mmap().
Memory-backed files _are_ mmappable, and they do not have a
writepage().  It just isn't called.

So the locking rules for writepage() are unchanged.  They are:

- Called with the page locked
- Returns with the page unlocked
- Must redirty the page itself if it wasn't all written.

But there is a new, special, hidden, undocumented, secret hack for
tmpfs: writepage may return WRITEPAGE_ACTIVATE to tell the VM to move
the page to the active list.  The page must be kept locked in this one
case.

3e9afe4c

[PATCH] skip memory-backed filesystems in writeback · 660282aa

Andrew Morton authored Dec 14, 2002

There's nopoint in walking through a lot of tmpfs or ramdisk pages when
we're trying to clean memory. So if a memory-backed inode is
discovered during writeback, skip the entire superblock.

660282aa

[PATCH] semtimedop - semop() with a timeout · f99a1a55

Andrew Morton authored Dec 14, 2002

Patch from Mark Fasheh <mark.fasheh@oracle.com> (plus a few cleanups
and a speedup from yours truly)

Adds the semtimedop() function - semop with a timeout.  Solaris has
this.  It's apparently worth a couple of percent to Oracle throughput
and given the simplicity, that is sufficient benefit for inclusion IMO.

This patch hooks up semtimedop() only for ia64 and ia32.

f99a1a55

[PATCH] Fix rmap locking for CONFIG_SWAP=n · c7d7f43a
Andrew Morton authored Dec 14, 2002
```
The pte_chain_unlock() needs to be outside the ifdef.
```
c7d7f43a

[PATCH] speed up read_zero() for !CONFIG_MMU · 849696bb

Andrew Morton authored Dec 14, 2002

The read_zero() implementation for !CONFIG_MMU was very inefficient.
This sped-up version has been tested and acked by Greg Ungerer.

849696bb

[PATCH] create /proc/kmsg, remove sys_syslog()-based · 48a789a9

Andrew Morton authored Dec 14, 2002

Back out the sys_syslog()-based printk-from-userspace and replace
it with Ben's /proc/kmsg version.

Requires a `mknod /dev/kmsg c 1 11'.

48a789a9

[PATCH] deprecate use of bdflush() · 2f268ee8

Andrew Morton authored Dec 14, 2002

Patch from Robert Love <rml@tech9.net>

We can never get rid of it if we do not deprecate it - so do so and
print a stern warning to those who still run bdflush daemons.

2f268ee8

[PATCH] Avoid recursion in the page allocator · 2a6c8678

Andrew Morton authored Dec 14, 2002

The PF_MEMALLOC handling got broken somewhere, and it is now possible
for a PF_MEMALLOC process to reenter page reclaim.

Change it to fail the allocation if we're PF_MEMALLOC and there are
zero pages free.

2a6c8678

[PATCH] missing piece of Iphase atm driver update · ee842908

François Romieu authored Dec 14, 2002

This removes calls to function which disappeared during last Iphase
driver update.  Since this update, Iphase driver has been using plain
modern pci style init.

Problem wasn't noticed until Adrian Bunk tried to build non-modular kernel
(I only tested the modularized driver). Everybody else seemed happy :o)

ee842908

[PATCH] VT scrolling fix · 6faa9cfc

James Simmons authored Dec 14, 2002

scrup is using memcpy even when the memory areas src, dest overlap.  The
key is to use memmove which handles overlapping memory gracefully.

6faa9cfc

Merge bk://thebsh.namesys.com/bk/reiser3-linux-2.5-fixes · 36db78e3
Linus Torvalds authored Dec 14, 2002
```
into home.transmeta.com:/home/torvalds/v2.5/linux
```
36db78e3

[PATCH] 2.5 fix for > 25 disks · 49b0c2a3

Anton Blanchard authored Dec 14, 2002

2.5 currently tries to register disk sda twice. Not nice and now we use
sysfs to do name to dev_t mapping, I couldnt mount my root filesystem.

49b0c2a3