1. 12 Apr, 2004 40 commits
    • Andrew Morton's avatar
      [PATCH] ppc64: NUMA fix for 16MB LMBs · 4c886627
      Andrew Morton authored
      From: Olof Johansson <olof@austin.ibm.com>
      
      As discussed on the ppc64 list yesterday and today:
      
      On some ppc64 systems, Open Firmware will give memory device nodes that are
      only 16MB in size, instead of the 256MB that our NUMA code currently
      expects (see MEMORY_INCREMENT in mmzone.h).
      
      Just changing the defines from 256MB to 16MB makes the table blow up from
      32KB to 512KB, so this patch also makes it dynamically allocated based on
      actual memory size.  Since all this is done before (well, during) bootmem
      init so we need to use lmb_alloc().
      
      Finally, there's no need to use a full int for node ID. Current max is 16
      nodes, so a signed char still leaves plenty of room to grow.
      4c886627
    • Andrew Morton's avatar
      [PATCH] procfs LoadAVG/load_avg scaling fix · 9b8696f2
      Andrew Morton authored
      From: Ingo Molnar <mingo@elte.hu>
      
      Dave reported that /proc/*/status sometimes shows 101% as LoadAVG, which
      makes no sense.
      
      the reason of the bug is slightly incorrect scaling of the load_avg value. 
      The patch below fixes this.
      9b8696f2
    • Andrew Morton's avatar
      [PATCH] ia32: 4Kb stacks (and irqstacks) patch · 95f238ea
      Andrew Morton authored
      From: Arjan van de Ven <arjanv@redhat.com>
      
      Below is a patch to enable 4Kb stacks for x86. The goal of this is to
      
      1) Reduce footprint per thread so that systems can run many more threads
         (for the java people)
      
      2) Reduce the pressure on the VM for order > 0 allocations. We see real life
         workloads (granted with 2.4 but the fundamental fragmentation issue isn't
         solved in 2.6 and isn't solvable in theory) where this can be a problem.
         In addition order > 0 allocations can make the VM "stutter" and give more
         latency due to having to do much much more work trying to defragment
      
      The first 2 bits of the patch actually affect compiler options in a generic
      way: I propose to disable the -funit-at-a-time feature from gcc.  With this
      enabled (and it's default with -O2), gcc will very agressively inline
      functions, which is nice and all for userspace, but for the kernel this makes
      us suffer a gcc deficiency more: gcc is extremely bad at sharing stackslots,
      for example a situation like this:
      
      if (some_condition)
      	function_A();
      else
      	function_B();
      
      with -funit-at-a-time, both function_A() and _B() might get inlined, however
      the stack usage of both functions of the parent function grows the stack
      usage of both functions COMBINED instead of the maximum of the two.  Even
      with the normal 8Kb stacks this is a danger since we see some functions grow
      3Kb to 4Kb of stack use this way.  With 4Kb stacks, 4Kb of stack usage growth
      obviously is deadly ;-( but even with 8Kb stacks it's pure lottery.
      Disabling -funit-at-a-time also exposes another thing in the -mm tree; the
      attribute always_inline is considered harmful by gcc folks in that when gcc
      makes a decision to NOT inline a function marked this way, it throws an
      error.  Disabling -funit-at-a-time disables some of the agressive inlining
      (eg of large functions that come later in the .c file) so this would make
      your tree not compile.
      
      The 4k stackness of the kernel is included in modversions, so people don't
      load 4k-stack modules into 8k-stack kernels.
      
      At present 4k stacks are selectable in config.  When the feature has settled
      in we should remove the 8k option.  This will break the nvidia modules.  But
      Fedora uses 4k stacks so a new nvidia driver is expected soon.
      95f238ea
    • Andrew Morton's avatar
      [PATCH] acpi printk fix · 124187e5
      Andrew Morton authored
      drivers/acpi/events/evmisc.c: In function `acpi_ev_queue_notify_request':
      drivers/acpi/events/evmisc.c:143: warning: too many arguments for format
      124187e5
    • Andrew Morton's avatar
      [PATCH] Fix logic in filemap_nopage() · 94721b16
      Andrew Morton authored
      The filempa_nopage() logic will go into a tight loop if
      do_page_cache_readahead() doesn't actually start I/O against the target page.
      This can happen if the disk is read-congested, or if the filesystem doesn't
      want to read that part of the file for some reason.
      
      We will accidentally break out of the loop because
      
      	 (ra->mmap_miss > ra->mmap_hit + MMAP_LOTSAMISS)
      
      will eventually become true.
      
      Fix that up.
      94721b16
    • Andrew Morton's avatar
      [PATCH] Honour the readahead tunable in filemap_nopage() · 80b1573f
      Andrew Morton authored
      Remove the hardwired pagefault readaround distance in filemap_nopage() and
      use the per-file readahead setting.
      
      The main reason for this is in fact laptop-mode.  If you want to prevent the
      disk from spinning up then you want all of your application's pages to be
      pulled into memory in one hit.  Otherwise the disk will spin up each time you
      use a new part of whatever application(s) you are running.
      80b1573f
    • Andrew Morton's avatar
      [PATCH] Add commit=0 to ext3, meaning "set commit to default". · 26f14a57
      Andrew Morton authored
      From: Bart Samwel <bart@samwel.tk>
      
      Add support for the value "0" to ext3's "commit" option.  When this value
      is given, ext3 substitutes it by the default commit interval.  Introduce a
      constant JBD_DEFAULT_MAX_COMMIT_AGE for this.
      26f14a57
    • Andrew Morton's avatar
      [PATCH] laptop mode · 93d33a48
      Andrew Morton authored
      From: Bart Samwel <bart@samwel.tk>
      
      Adds /proc/sys/vm/laptop-mode: a special knob which says "this is a laptop".
      In this mode the kernel will attempt to avoid spinning disks up.
      
      Algorithm: the idea is to hold dirty data in memory for a long time, but to
      flush everything which has been accumulated if the disk happens to spin up
      for other reasons.
      
      - Whenever a disk request completes (read or write), schedule a timer a few
        seconds hence.  If the timer was already pending, reset it to a few seconds
        hence.
      
      - When the timer expires, write back the whole world.  We use
        sync_filesystems() for this because it will force ext3 journal commits as
        well.
      
      - In balance_dirty_pages(), kick off background writeback when we hit the
        high threshold (dirty_ratio), not when we hit the low threshold.  This has
        the effect of causing "lumpy" writeback which is something I spent a year
        fixing, but in laptop mode, it is desirable.
      
      - In try_to_free_pages(), only kick pdflush if the VM is getting into
        distress: we want to keep scanning for clean pages, deferring writeback.
      
      - In page reclaim, avoid writing back the odd random dirty page off the
        LRU: only start I/O if the scanning is working harder.
      
      The effect is to perform a sync() a few seconds after all I/O has ceased.
      
      The value which was written into /proc/sys/vm/laptop-mode determines, in
      seconds, the delay between the final I/O and the flush.
      
      Additionally, the patch adds tools which help answer the question "why the
      heck does my disk spin up all the time?".  The user may set
      /proc/sys/vm/block_dump to a non-zero value and the kernel will print out
      information which will identify the process which is performing disk reads or
      which is dirtying pagecache.
      
      The user should probably disable syslogd before setting block-dump.
      93d33a48
    • Andrew Morton's avatar
      [PATCH] kswapd: remove pages_scanned local · 77fe0a19
      Andrew Morton authored
      This is always equal to constant zero.
      77fe0a19
    • Andrew Morton's avatar
      [PATCH] Fix rmap comment · 2d47875a
      Andrew Morton authored
      From: Hugh Dickins <hugh@veritas.com>
      
      rmap's try_to_unmap_one comments on find_vma failure, that a page may
      temporarily be absent from a vma during mremap: no longer, though it is still
      possible for this find_vma to fail, while unmap_vmas drops page_table_lock
      (but that is no problem for file truncation).
      2d47875a
    • Andrew Morton's avatar
      [PATCH] mremap: check map_count · 59da95c4
      Andrew Morton authored
      From: Hugh Dickins <hugh@veritas.com>
      
      mremap's move_vma should think ahead to lessen the chance of failure during
      its rewind on failure: running out of memory always possible, but it's silly
      for it to embark when it's near the map_count limit.
      59da95c4
    • Andrew Morton's avatar
      [PATCH] mremap: vma_relink_file race fix · 2039e7b5
      Andrew Morton authored
      From: Hugh Dickins <hugh@veritas.com>
      
      Subtle point from Rajesh Venkatasubramanian: when mremap's move_vma fails and
      so rewinds, before moving the file-based ptes back, we must move new_vma
      before old vma in the i_mmap or i_mmap_shared list, so that when racing
      against vmtruncate we cannot propagate pages to be truncated back from
      new_vma into the just cleaned old_vma.
      2039e7b5
    • Andrew Morton's avatar
      [PATCH] mremap: move_vma fixes and cleanup · e2ea8374
      Andrew Morton authored
      From: Hugh Dickins <hugh@veritas.com>
      
      Partial rewrite of mremap's move_vma.  Rajesh Venkatasubramanian has pointed
      out that vmtruncate could miss ptes, leaving orphaned pages, because move_vma
      only made the new vma visible after filling it.  We see no good reason for
      that, and time to make move_vma more robust.
      
      Removed all its vma merging decisions, leave them to mmap.c's vma_merge, with
      copy_vma added.  Removed duplicated is_mergeable_vma test from vma_merge, and
      duplicated validate_mm from insert_vm_struct.
      
      move_vma move from old to new then unmap old; but on error move back from new
      to old and unmap new.  Don't unwind within move_page_tables, let move_vma
      call it explicitly to unwind, with the right source vma.  Get the
      VM_ACCOUNTing right even when the final do_munmap fails.
      e2ea8374
    • Andrew Morton's avatar
      [PATCH] mremap: copy_one_pte cleanup · 209b450c
      Andrew Morton authored
      From: Hugh Dickins <hugh@veritas.com>
      
      Clean up mremap move's copy_one_pte:
      
      - get_one_pte_map_nested already weeded out the pte_none case,
        now don't even call copy_one_pte if it has nothing to do.
      
      - check pfn_valid before passing page to page_remove_rmap.
      209b450c
    • Andrew Morton's avatar
      [PATCH] fork vma ordering during fork · 424e44d1
      Andrew Morton authored
      From: Hugh Dickins <hugh@veritas.com>
      
      First of six patches against 2.6.5-rc3, cleaning up mremap's move_vma, and
      fixing truncation orphan issues raised by Rajesh Venkatasubramanian. 
      Originally done as part of the anonymous objrmap work on mremap move, but
      useful fixes now extracted for mainline.  The mremap changes need some
      exposure in the -mm tree first, but the first (fork one-liner) is safe enough
      to go straight into 2.6.5.
      
      
      
      From: Rajesh Venkatasubramanian.  Despite the comment that child vma should
      be inserted just after parent vma, 2.5.6 did exactly the reverse: thus a
      racing vmtruncate may free the child's ptes, then advance to the parent, and
      meanwhile copy_page_range has propagated more ptes from the parent to the
      child, leaving file pages still mapped after truncation.
      424e44d1
    • Andrew Morton's avatar
      [PATCH] use compound pages for hugetlb pages only · 3c7011b3
      Andrew Morton authored
      The compound page logic is a little fragile - it relies on additional
      metadata in the pageframes which some other kernel code likes to stomp on
      (xfs was doing this).
      
      Also, because we're treating all higher-order pages as compound pages it is
      no longer possible to free individual lower-order pages from the middle of
      higher-order pages.  At least one ARM driver insists on doing this.
      
      We only really need the compound page logic for higher-order pages which can
      be mapped into user pagetables and placed under direct-io.  This covers
      hugetlb pages and, conceivably, soundcard DMA buffers which were allcoated
      with a higher-order allocation but which weren't marked PageReserved.
      
      The patch arranges for the hugetlb implications to allocate their pages with
      compound page metadata, and all other higher-order allocations go back to the
      old way.
      
      (Andrea supplied the GFP_LEVEL_MASK fix)
      3c7011b3
    • Andrew Morton's avatar
      [PATCH] mpage_writepages() cleanup · 60af4464
      Andrew Morton authored
      Rework the code layout a bit.  No logic change.
      60af4464
    • Andrew Morton's avatar
      [PATCH] Add mpage_writepages() scheduling point · 082825b6
      Andrew Morton authored
      From: Jens Axboe <axboe@suse.de>
      
      Takashi did some nice latency testing of the current kernel (with -mm
      writeback changes), and the biggest offender in general core is
      mpage_writepages().
      082825b6
    • Andrew Morton's avatar
      [PATCH] writeback efficiency and QoS improvements · 9672a337
      Andrew Morton authored
      The radix-tree walk for writeback has a couple of problems:
      
      a) It always scans a file from its first dirty page, so if someone
         is repeatedly dirtying the front part of a file, pages near the end
         may be starved of writeout.  (Well, not completely: the `kupdate'
         function will write an entire file once the file's dirty timestamp
         has expired).  
      
      b) When the disk queues are huge (10000 requests), there can be a
         very large number of locked pages.  Scanning past these in writeback
         consumes quite some CPU time.
      
      So in each address_space we record the index at which the last batch of
      writeout terminated and start the next batch of writeback from that
      point.
      9672a337
    • Andrew Morton's avatar
      [PATCH] don't allow background writes to hide dirty buffers · bd134f27
      Andrew Morton authored
      If pdflush hits a locked-and-clean buffer in __block_write_full_page() it
      will just pass over the buffer.  Typically the buffer is an ext3 data=ordered
      buffer which is being written by kjournald, but a similar thing can happen
      with blockdev buffers and ll_rw_block().
      
      This is bad because the buffer is still under I/O and a subsequent fsync's
      fdatawait() needs to know about it.
      
      It is not practical to tag the page for writeback - only the submitter of the
      I/O can do that, because the submitter has control of the end_io handler.
      
      So instead, redirty the page so a subsequent fsync's fdatawrite() will wait on
      the underway I/O.
      
      There is a risk that pdflush::background_writeout() will lock up, repeatedly
      trying and failing to write the same page.  This is prevented by ensuring
      that background_writeout() always throttles when it made no progress.
      bd134f27
    • Andrew Morton's avatar
      [PATCH] fdatasync integrity fix · d3eb546e
      Andrew Morton authored
      fdatasync can fail to wait on some pages due to a race.
      
      If some task (eg pdflush) is flushing the same mapping it can remove a page's
      dirty tag but not then mark that page as being under writeback, because
      pdflush hit a locked buffer in __block_write_full_page().  This will happen
      because kjournald is writing the buffer.  In this situation
      __block_write_full_page() will redirty the page so that fsync notices it, but
      there is a window where the page eludes the radix tree dirty page walk.
      
      Consequently a concurrent fsync will fail to notice the page when walking the
      radix tree's dirty pages.
      
      The approach taken by this patch is to leave the page marked as dirty in the
      radix tree while ->writepage is working out what to do with it.  This ensures
      that a concurrent write-for-sync will successfully locate the page and will
      then block in lock_page() until the non-write-for-sync code has finished
      altering the page state.
      d3eb546e
    • Andrew Morton's avatar
      [PATCH] remove page.list · be5ceb40
      Andrew Morton authored
      Remove the now-unneeded page.list field.
      be5ceb40
    • Andrew Morton's avatar
      [PATCH] switch the m68k pointer-table code over to page->lru · 67817afb
      Andrew Morton authored
      Switch the m68k pointer-table code over to page->lru.
      67817afb
    • Andrew Morton's avatar
      [PATCH] arm: stop using page->list · de894013
      Andrew Morton authored
      Switch the ARM `small_page' code over to page->lru.
      de894013
    • Andrew Morton's avatar
      [PATCH] stop using page->lru in compound pages · 0fcb51fd
      Andrew Morton authored
      The compound page logic is using page->lru, and these get will scribbled on
      in various places so switch the Compound page logic over to using ->mapping
      and ->private.
      0fcb51fd
    • Andrew Morton's avatar
      [PATCH] stop using page.list in readahead · bd64f049
      Andrew Morton authored
      The address_space.readapges() function currently takes a list of pages,
      strung together via page->list.  Switch it to using page->lru.
      
      This changes the API into filesystems.
      bd64f049
    • Andrew Morton's avatar
      [PATCH] stop using page.list in pageattr.c · 90687aa1
      Andrew Morton authored
      Switch it to ->lru
      90687aa1
    • Andrew Morton's avatar
      [PATCH] stop using page->list in the hugetlbpage implementations · c41bb9c4
      Andrew Morton authored
      Switch them over to page.lru
      c41bb9c4
    • Andrew Morton's avatar
      [PATCH] stop using page.list in the page allocator · 62e52945
      Andrew Morton authored
      Switch the page allocator over to using page.lru for the buddy lists.
      62e52945
    • Andrew Morton's avatar
      [PATCH] slab: stop using page.list · 02979dcb
      Andrew Morton authored
      slab.c is using page->list.  Switch it over to using page->lru so we can
      remove page.list.
      02979dcb
    • Andrew Morton's avatar
      [PATCH] revert the slabification of i386 pgd's and pmd's · c33c9e78
      Andrew Morton authored
      This code is playing with page->lru from pages which came from slab.  But to
      remove page->list we need to convert slab over to using page->lru.  So we
      cannot allow the i386 pagetable code to go scribbling on the ->lru field of
      active slab pages.
      
      This optimisation was pretty thin, and it is more important to shrink the
      pageframe (on all architectures).
      c33c9e78
    • Andrew Morton's avatar
      [PATCH] stop using address_space.clean_pages · d672c382
      Andrew Morton authored
      Remove remaining references to address_space.clean_pages.
      d672c382
    • Andrew Morton's avatar
      [PATCH] Stop using address_space.locked_pages · a1513309
      Andrew Morton authored
      Instead, use a radix-tree walk of the pages which are tagged as being under
      writeback.
      
      The new function wait_on_page_writeback_range() was generalised out of
      filemap_fdatawait().  We can later use this to provide concurrent fsync of
      just a section of a file.
      a1513309
    • Andrew Morton's avatar
      [PATCH] remove address_space.io_pages · 3c1ed9b2
      Andrew Morton authored
      Now remove address_space.io_pages.
      3c1ed9b2
    • Andrew Morton's avatar
      [PATCH] fix the kupdate function · b79a8408
      Andrew Morton authored
      Juggle dirty pages and dirty inodes and dirty superblocks and various
      different writeback modes and livelock avoidance and fairness to recover from
      the loss of mapping->io_pages.
      b79a8408
    • Andrew Morton's avatar
      [PATCH] stop using the address_space dirty_pages list · 1d7d3304
      Andrew Morton authored
      Move everything over to walking the radix tree via the PAGECACHE_TAG_DIRTY
      tag.  Remove address_space.dirty_pages.
      1d7d3304
    • Andrew Morton's avatar
      [PATCH] tag writeback pages as such in their radix tree · 40c8348e
      Andrew Morton authored
      Arrange for under-writeback pages to be marked thus in their pagecache radix
      tree.
      40c8348e
    • Andrew Morton's avatar
      [PATCH] tag dirty pages as such in the radix tree · 8ece6262
      Andrew Morton authored
      Arrange for all dirty pagecache pages to be tagged as dirty within their
      radix tree.
      8ece6262
    • Andrew Morton's avatar
      [PATCH] make the pagecache lock irq-safe. · 89261aab
      Andrew Morton authored
      Intro to these patches:
      
      - Major surgery against the pagecache, radix-tree and writeback code.  This
        work is to address the O_DIRECT-vs-buffered data exposure horrors which
        we've been struggling with for months.
      
        As a side-effect, 32 bytes are saved from struct inode and eight bytes
        are removed from struct page.  At a cost of approximately 2.5 bits per page
        in the radix tree nodes on 4k pagesize, assuming the pagecache is densely
        populated.  Not all pages are pagecache; other pages gain the full 8 byte
        saving.
      
        This change will break any arch code which is using page->list and will
        also break any arch code which is using page->lru of memory which was
        obtained from slab.
      
        The basic problem which we (mainly Daniel McNeil) have been struggling
        with is in getting a really reliable fsync() across the page lists while
        other processes are performing writeback against the same file.  It's like
        juggling four bars of wet soap with your eyes shut while someone is
        whacking you with a baseball bat.  Daniel pretty much has the problem
        plugged but I suspect that's just because we don't have testcases to
        trigger the remaining problems.  The complexity and additional locking
        which those patches add is worrisome.
      
        So the approach taken here is to remove the page lists altogether and
        replace the list-based writeback and wait operations with in-order
        radix-tree walks.
      
        The radix-tree code has been enhanced to support "tagging" of pages, for
        later searches for pages which have a particular tag set.  This means that
        we can ask the radix tree code "find me the next 16 dirty pages starting at
        pagecache index N" and it will do that in O(log64(N)) time.
      
        This affects I/O scheduling potentially quite significantly.  It is no
        longer the case that the kernel will submit pages for I/O in the order in
        which the application dirtied them.  We instead submit them in file-offset
        order all the time.
      
        This is likely to be advantageous when applications are seeking all over
        a large file randomly writing small amounts of data.  I haven't performed
        much benchmarking, but tiobench random write throughput seems to be
        increased by 30%.  Other tests appear to be unaltered.  dbench may have got
        10-20% quicker, but it's variable.
      
        There is one large file which everyone seeks all over randomly writing
        small amounts of data: the blockdev mapping which caches filesystem
        metadata.  The kernel's IO submission patterns for this are now ideal.
      
      
        Because writeback and wait-for-writeback use a tree walk instead of a
        list walk they are no longer livelockable.  This probably means that we no
        longer need to hold i_sem across O_SYNC writes and perhaps fsync() and
        fdatasync().  This may be beneficial for databases: multiple processes
        writing and syncing different parts of the same file at the same time can
        now all submit and wait upon writes to just their own little bit of the
        file, so we can get a lot more data into the queues.
      
        It is trivial to implement a part-file-fdatasync() as well, so
        applications can say "sync the file from byte N to byte M", and multiple
        applications can do this concurrently.  This is easy for ext2 filesystems,
        but probably needs lots of work for data-journalled filesystems and XFS and
        it probably doesn't offer much benefit over an i_semless O_SYNC write.
      
      
        These patches can end up making ext3 (even) slower:
      
      	for i in 1 2 3 4
      	do
      		dd if=/dev/zero of=$i bs=1M count=2000 &
      	done          
      
        runs awfully slow on SMP.  This is, yet again, because all the file
        blocks are jumbled up and the per-file linear writeout causes tons of
        seeking.  The above test runs sweetly on UP because the on UP we don't
        allocate blocks to different files in parallel.
      
        Mingming and Badari are working on getting block reservation working for
        ext3 (preallocation on steroids).  That should fix ext3 up.
      
      
      This patch:
      
      - Later, we'll need to access the radix trees from inside disk I/O
        completion handlers.  So make mapping->page_lock irq-safe.  And rename it
        to tree_lock to reliably break any missed conversions.
      89261aab
    • Andrew Morton's avatar
      [PATCH] radix-tree tags for selective lookup · 8691fb83
      Andrew Morton authored
      Add radix-tree tagging so we can look up dirty or writeback pages in
      O(log64(n)) time.
      
      Each radix-tree node gains two bits for each slot: one for page dirtiness and
      one for page writebackness.
      
      If a tag bit is set on a leaf node, it indicates that item at the
      corresponding slot is tagged (say, a dirty page).
      
      If a tag bit is set in a non-leaf node it indicates that the same tag bit is
      set in the subtree which lies under the corresponding slot.  ie: "there is a
      dirty page under here somewhere, but you need to search down further to find
      it".
      
      A gang lookup function is provided which can walk the radix tree in
      logarithmic time looking for items which are tagged, starting from a
      specified offset.  We use this for in-order searches for dirty or writeback
      pages.
      
      There is a userspace test harness for this code at
      
      http://www.zip.com.au/~akpm/linux/patches/stuff/rtth.tar.gz
      8691fb83