Commits · de396e84f8c99cf76ab77bd06a13f46c83b8111a · Kirill Smelkov / linux

27 Sep, 2002 1 commit

[PATCH] virtual => physical page mapping cache · 7c2149e9

Ingo Molnar authored 22 years ago

Implement a "mapping change" notification for virtual lookup caches, and
make the futex code use that to keep the futex page pinning consistent
across copy-on-write events in the VM space.

7c2149e9

18 Sep, 2002 1 commit
- kbuild: Remove O_TARGET from {kernel,mm,fs,...}/Makefile · 709aa851
  Kai Germaschewski authored 22 years ago
```
It's gone almost everywhere else already, and will eventually make for
a nicer top-level Makefile.
```
  709aa851
17 Sep, 2002 1 commit

[PATCH] move madvise implementation into madvise.c · c89a8bad

Andrew Morton authored 22 years ago

Patch from Christoph Hellwig moves the madvise implementation out of
filemap.c and into its own .c file.  No other changes are made.

c89a8bad

19 Jul, 2002 1 commit

[PATCH] minimal rmap · c48c43e6

Andrew Morton authored 22 years ago

This is the "minimal rmap" patch, writen by Rik, ported to 2.5 by Craig
Kulsea.

Basically,

before: When the page reclaim code decides that is has scanned too many
unreclaimable pages on the LRU it does a scan of process virtual
address spaces for pages to add to swapcache.  ptes pointing at the
page are unmapped as the scan proceeds.  When all ptes referring to a
page have been unmapped and it has been written to swap the page is
reclaimable.

after: When an anonymous page is encountered on the tail of the LRU we
use the rmap to see if it hasn't been referenced lately.  If so then
add it to swapcache.  When the page is again encountered on the LRU, if
it is still unreferenced then try to unmap all ptes which refer to it
in one hit, and if it is clean (ie: on swap) then free it.

The rest of the VM - list management, the classzone concept, etc
remains unchanged.

There are a number of things which the per-page pte chain could be
used for.  Bill Irwin has identified the following.


(1)  page replacement no longer goes around randomly unmapping things

(2)  referenced bits are more accurate because there aren't several ms
        or even seconds between find the multiple pte's mapping a page

(3)  reduces page replacement from O(total virtually mapped) to O(physical)

(4)  enables defragmentation of physical memory

(5)  enables cooperative offlining of memory for friendly guest instance
        behavior in UML and/or LPAR settings

(6)  demonstrable benefit in performance of swapping which is common in
        end-user interactive workstation workloads (I don't like the word
        "desktop"). c.f. Craig Kulesa's post wrt. swapping performance

(7)  evidence from 2.4-based rmap trees indicates approximate parity
        with mainline in kernel compiles with appropriate locking bits

(8)  partitioning of physical memory can reduce the complexity of page
        replacement searches by scanning only the "interesting" zones
        implemented and merged in 2.4-based rmap

(9)  partitioning of physical memory can increase the parallelism of page
        replacement searches by independently processing different zones
        implemented, but not merged in 2.4-based rmap

(10) the reverse mappings may be used for efficiently keeping pte cache
        attributes coherent

(11) they may be used for virtual cache invalidation (with changes)

(12) the reverse mappings enable proper RSS limit enforcement
        implemented and merged in 2.4-based rmap



The code adds a pointer to struct page, consumes additional storage for
the pte chains and adds computational expense to the page reclaim code
(I measured it at 3% additional load during streaming I/O).  The
benefits which we get back for all this are, I must say, theoretical
and unproven.  If it has real advantages (or, indeed, disadvantages)
then why has nobody demonstrated them?



There are a number of things remaining to be done:

1: Demonstrate the above advantages.

2: Make it work with pte-highmem  (Bill Irwin is signed up for this)

3: Don't add pte_chains to non-shared pages optimisation (Dave McCracken's
   patch does this)

4: Move the pte_chains into highmem too (Bill, I guess)

5: per-cpu pte_chain freelists (Rik?)

6: maybe GC the pte_chain backing pages. (Seems unavoidable.  Rik?)

7: multithread the page reclaim code.  (I have patches).

8: clustered add-to-swap.  Not sure if I buy this.  anon pages are
   often well-ordered-by-virtual-address on the LRU, so it "just
   works" for benchmarky loads.  But there may be some other loads...

9: Fix bad IO latency in page reclaim (I have lame patches)

10: Develop tuning tools, use them.

11: The nightly updatedb run is still evicting everything.

c48c43e6

30 Apr, 2002 1 commit

[PATCH] writeback from address spaces · 090da372

Andrew Morton authored 22 years ago

[ I reversed the order in which writeback walks the superblock's
  dirty inodes.  It sped up dbench's unlink phase greatly.  I'm
  such a sleaze ]

The core writeback patch.  Switches file writeback from the dirty
buffer LRU over to address_space.dirty_pages.

- The buffer LRU is removed

- The buffer hash is removed (uses blockdev pagecache lookups)

- The bdflush and kupdate functions are implemented against
  address_spaces, via pdflush.

- The relationship between pages and buffers is changed.

  - If a page has dirty buffers, it is marked dirty
  - If a page is marked dirty, it *may* have dirty buffers.
  - A dirty page may be "partially dirty".  block_write_full_page
    discovers this.

- A bunch of consistency checks of the form

	if (!something_which_should_be_true())
		buffer_error();

  have been introduced.  These fog the code up but are important for
  ensuring that the new buffer/page code is working correctly.

- New locking (inode.i_bufferlist_lock) is introduced for exclusion
  from try_to_free_buffers().  This is needed because set_page_dirty
  is called under spinlock, so it cannot lock the page.  But it
  needs access to page->buffers to set them all dirty.

  i_bufferlist_lock is also used to protect inode.i_dirty_buffers.

- fs/inode.c has been split: all the code related to file data writeback
  has been moved into fs/fs-writeback.c

- Code related to file data writeback at the address_space level is in
  the new mm/page-writeback.c

- try_to_free_buffers() is now non-blocking

- Switches vmscan.c over to understand that all pages with dirty data
  are now marked dirty.

- Introduces a new a_op for VM writeback:

	->vm_writeback(struct page *page, int *nr_to_write)

  this is a bit half-baked at present.  The intent is that the address_space
  is given the opportunity to perform clustered writeback.  To allow it to
  opportunistically write out disk-contiguous dirty data which may be in other zones.
  To allow delayed-allocate filesystems to get good disk layout.

- Added address_space.io_pages.  Pages which are being prepared for
  writeback.  This is here for two reasons:

  1: It will be needed later, when BIOs are assembled direct
     against pagecache, bypassing the buffer layer.  It avoids a
     deadlock which would occur if someone moved the page back onto the
     dirty_pages list after it was added to the BIO, but before it was
     submitted.  (hmm.  This may not be a problem with PG_writeback logic).

  2: Avoids a livelock which would occur if some other thread is continually
     redirtying pages.

- There are two known performance problems in this code:

  1: Pages which are locked for writeback cause undesirable
     blocking when they are being overwritten.  A patch which leaves
     pages unlocked during writeback comes later in the series.

  2: While inodes are under writeback, they are locked.  This
     causes namespace lookups against the file to get unnecessarily
     blocked in wait_on_inode().  This is a fairly minor problem.

     I don't have a fix for this at present - I'll fix this when I
     attach dirty address_spaces direct to super_blocks.

- The patch vastly increases the amount of dirty data which the
  kernel permits highmem machines to maintain.  This is because the
  balancing decisions are made against the amount of memory in the
  machine, not against the amount of buffercache-allocatable memory.

  This may be very wrong, although it works fine for me (2.5 gigs).

  We can trivially go back to the old-style throttling with
  s/nr_free_pagecache_pages/nr_free_buffer_pages/ in
  balance_dirty_pages().  But better would be to allow blockdev
  mappings to use highmem (I'm thinking about this one, slowly).  And
  to move writer-throttling and writeback decisions into the VM (modulo
  the file-overwriting problem).

- Drops 24 bytes from struct buffer_head.  More to come.

- There's some gunk like super_block.flags:MS_FLUSHING which needs to
  be killed.  Need a better way of providing collision avoidance
  between pdflush threads, to prevent more than one pdflush thread
  working a disk at the same time.

  The correct way to do that is to put a flag in the request queue to
  say "there's a pdlfush thread working this disk".  This is easy to
  do: just generalise the "ra_pages" pointer to point at a struct which
  includes ra_pages and the new collision-avoidance flag.

090da372

10 Apr, 2002 2 commits

[PATCH] writeback daemons · 1ed704e9

Andrew Morton authored 22 years ago

This patch implements a gang-of-threads which are designed to
be used for dirty data writeback. "pdflush" -> dirty page
flush, or something.

The number of threads is dynamically managed by a simple
demand-driven algorithm.

"Oh no, more kernel threads".  Don't worry, kupdate and
bdflush disappear later.

The intent is that no two pdflush threads are ever performing
writeback against the same request queue at the same time.
It would be wasteful to do that.  My current patches don't
quite achieve this; I need to move the state into the request
queue itself...

The driver for implementing the thread pool was to avoid the
possibility where bdflush gets stuck on one device's get_request_wait()
queue while lots of other disks sit idle.  Also generality,
abstraction, and the need to have something in place to perform
the address_space-based writeback when the buffer_head-based
writeback disappears.

There is no provision inside the pdflush code itself to prevent
many threads from working against the same device.  That's
the responsibility of the caller.

The main API function, `pdflush_operation()' attempts to find
a thread to do some work for you.  It is not reliable - it may
return -1 and say "sorry, I didn't do that".  This happens if
all threads are busy.

One _could_ extend pdflush_operation() to queue the work so that
it is guaranteed to happen.  If there's a need, that additional
minor complexity can be added.

1ed704e9

[PATCH] readahead · 8fa49846

Andrew Morton authored 22 years ago

I'd like to be able to claim amazing speedups, but
the best benchmark I could find was diffing two
256 megabyte files, which is about 10% quicker.  And
that is probably due to the window size being effectively
50% larger.

Fact is, any disk worth owning nowadays has a segmented
2-megabyte cache, and OS-level readahead mainly seems
to save on CPU cycles rather than overall throughput.
Once you start reading more streams than there are segments
in the disk cache we start to win.

Still.  The main motivation for this work is to
clean the code up, and to create a central point at
which many pages are marshalled together so that
they can all be encapsulated into the smallest possible
number of BIOs, and injected into the request layer.

A number of filesystems were poking around inside the
readahead state variables.  I'm not really sure what they
were up to, but I took all that out.  The readahead
code manages its own state autonomously and should not
need any hints.

- Unifies the current three readahead functions (mmap reads, read(2)
  and sys_readhead) into a single implementation.

- More aggressive in building up the readahead windows.

- More conservative in tearing them down.

- Special start-of-file heuristics.

- Preallocates the readahead pages, to avoid the (never demonstrated,
  but potentially catastrophic) scenario where allocation of readahead
  pages causes the allocator to perform VM writeout.

- Gets all the readahead pages gathered together in
  one spot, so they can be marshalled into big BIOs.

- reinstates the readahead ioctls, so hdparm(8) and blockdev(8)
  are working again.  The readahead settings are now per-request-queue,
  and the drivers never have to know about it.  I use blockdev(8).
  It works in units of 512 bytes.

- Identifies readahead thrashing.

  Also attempts to handle it.  Certainly the changes here
  delay the onset of catastrophic readahead thrashing by
  quite a lot, and decrease it seriousness as we get more
  deeply into it, but it's still pretty bad.

8fa49846

08 Mar, 2002 2 commits
- mm cleanup: split out mincore() system call from filemap.c · 24237d11
  Linus Torvalds authored 23 years ago
  
  24237d11
- Split out "msync" logic into a file of its own. No actual code · f98dc48f
  Linus Torvalds authored 23 years ago
```
changes.
```
  f98dc48f
19 Feb, 2002 1 commit

[PATCH] new struct page shrinkage · e5191c50

Rik van Riel authored 23 years ago

The patch has been changed like you wanted, with page->zone
shoved into page->flags. I've also pulled the thing up to
your latest changes from linux.bkbits.net so you should be
able to just pull it into your tree from:

Rik

e5191c50

05 Feb, 2002 5 commits

v2.5.1 -> v2.5.1.1 · 0925bad3

Linus Torvalds authored 23 years ago

- me: revert the "kill(-1..)" change.  POSIX isn't that clear on the
issue anyway, and the new behaviour breaks things.
- Jens Axboe: more bio updates
- Al Viro: rd_load cleanups. hpfs mount fix, mount cleanups
- Ingo Molnar: more raid updates
- Jakub Jelinek: fix Linux/x86 confusion about arg passing of "save_v86_state" and "do_signal"
- Trond Myklebust: fix NFS client race conditions

0925bad3

v2.5.0.9 -> v2.5.0.10 · 80044607

Linus Torvalds authored 23 years ago

- Jens Axboe: more bio stuff
- Ingo Molnar: mempool for bio
- Niibe Yutaka: Super-H update

80044607

v2.4.13 -> v2.4.13.1 · 980adcb2

Linus Torvalds authored 23 years ago

  - Michael Warfield: computone serial driver update
  - Alexander Viro: cdrom module race fixes
  - David Miller: Acenic driver fix
  - Andrew Grover: ACPI update
  - Kai Germaschewski: ISDN update
  - Tim Waugh: parport update
  - David Woodhouse: JFFS garbage collect sleep

980adcb2

v2.4.6.6 -> v2.4.6.7 · 74f5133b

Linus Torvalds authored 23 years ago

  - Andreas Dilger: various ext2 cleanups
  - Richard Gooch: devfs update
  - Johannes Erdfelt: USB updates
  - Alan Cox: merges
  - David Miller: fix SMP pktsched bootup deadlock (CONFIG_NET_SCHED)
  - Roman Zippel: AFFS update
  - Anton Altaparmakov: NTFS update
  - me: fix races in vfork() (semaphores are not good completion handlers)
  - Jeff Garzik: net driver updates, sysvfs update

74f5133b

Import changeset · 7a2deb32
Linus Torvalds authored 23 years ago

7a2deb32