1. 22 May, 2004 40 commits
    • Andrew Morton's avatar
      [PATCH] rmap 19: arch prio_tree · afd81431
      Andrew Morton authored
      From: Hugh Dickins <hugh@veritas.com>
      
      The previous patches of this prio_tree batch have been to generic only.  Now
      the arm and parisc __flush_dcache_page are converted to using
      vma_prio_tree_next, and benefit from its selection of relevant vmas.  They're
      still accessing the tree without i_shared_lock or any other, that's not
      forgotten but still under investigation.  Include pagemap.h for the definition
      of PAGE_CACHE_SHIFT.  s390 and x86_64 no longer initialize vma's shared field
      (whose type has changed), done later.
      afd81431
    • Andrew Morton's avatar
      [PATCH] unmap_mapping_range: add comment · 4e44e085
      Andrew Morton authored
      4e44e085
    • Andrew Morton's avatar
      [PATCH] rmap 18: i_mmap_nonlinear · 068258f7
      Andrew Morton authored
      From: Hugh Dickins <hugh@veritas.com>
      
      The prio_tree is of no use to nonlinear vmas: currently we're having to search
      the tree in the most inefficient way to find all its nonlinears.  At the very
      least we need an indication of the unlikely case when there are some
      nonlinears; but really, we'd do best to take them out of the prio_tree
      altogether, into a list of their own - i_mmap_nonlinear.
      068258f7
    • Andrew Morton's avatar
      [PATCH] rmap 17: real prio_tree · 2fe9c14c
      Andrew Morton authored
      From: Hugh Dickins <hugh@veritas.com>
      
      Rajesh Venkatasubramanian's implementation of a radix priority search tree of
      vmas, to handle object-based reverse mapping corner cases well.
      
      Amongst the objections to object-based rmap were test cases by akpm and by
      mingo, in which large numbers of vmas mapping disjoint or overlapping parts of
      a file showed strikingly poor performance of the i_mmap lists.  Perhaps those
      tests are irrelevant in the real world?  We cannot be too sure: the prio_tree
      is well-suited to solving precisely that problem, so unless it turns out to
      bring too much overhead, let's include it.
      
      Why is this prio_tree.c placed in mm rather than lib?  See GET_INDEX: this
      implementation is geared throughout to use with vmas, though the first half of
      the file appears more general than the second half.
      
      Each node of the prio_tree is itself (contained within) a vma: might save
      memory by allocating distinct nodes from which to hang vmas, but wouldn't save
      much, and would complicate the usage with preallocations.  Off each node of
      the prio_tree itself hangs a list of like vmas, if any.
      
      The connection from node to list is a little awkward, but probably the best
      compromise: it would be more straightforward to list likes directly from the
      tree node, but that would use more memory per vma, for the list_head and to
      identify that head.  Instead, node's shared.vm_set.head points to next vma
      (whose shared.vm_set.head points back to node vma), and that next contains the
      list_head from which the rest hang - reusing fields already used in the
      prio_tree node itself.
      
      Currently lacks prefetch: Rajesh hopes to add some soon.
      2fe9c14c
    • Andrew Morton's avatar
      [PATCH] rmap 16: pretend prio_tree · fc96c90f
      Andrew Morton authored
      From: Hugh Dickins <hugh@veritas.com>
      
      Pave the way for prio_tree by switching over to its interfaces, but actually
      still implement them with the same old lists as before.
      
      Most of the vma_prio_tree interfaces are straightforward.  The interesting one
      is vma_prio_tree_next, used to search the tree for all vmas which overlap the
      given range: unlike the list_for_each_entry it replaces, it does not find
      every vma, just those that match.
      
      But this does leave handling of nonlinear vmas in a very unsatisfactory state:
      for now we have to search again over the maximum range to find all the
      nonlinear vmas which might contain a page, which of course takes away the
      point of the tree.  Fixed in later patch of this batch.
      
      There is no need to initialize vma linkage all over, just do it before
      inserting the vma in list or tree.  /proc/pid/statm had an odd test for its
      shared count: simplified to an equivalent test on vm_file.
      fc96c90f
    • Andrew Morton's avatar
      [PATCH] rmap 15: vma_adjust · fb41b417
      Andrew Morton authored
      From: Hugh Dickins <hugh@veritas.com>
      
      If file-based vmas are to be kept in a tree, according to the file offsets
      they map, then adjusting the vma's start pgoff or its end involves
      repositioning in the tree, while holding i_shared_lock (and page_table_lock). 
      We used to avoid that if possible, e.g.  when just moving end; but if we're
      heading that way, let's now tidy up vma_merge and split_vma, and do all the
      locking and adjustment in a new helper vma_adjust.  And please, let's call the
      next vma in vma_merge "next" rather than "prev".
      
      Since these patches are diffed over 2.6.6-rc2-mm2, they include the NUMA
      mpolicy mods which you'll have to remove to go earlier in the series, sorry
      for that nuisance.  I have intentionally changed the one vma_mpol_equal to
      mpol_equal, to make the merge cases more alike.
      fb41b417
    • Andrew Morton's avatar
      [PATCH] numa api: fix end of memory handling in mbind · e1e71f9b
      Andrew Morton authored
      From: Andi Kleen <ak@suse.de>
      
      This fixes a user triggerable crash in mbind() in NUMA API.  It would oops
      when running into the end of memory.  Actually not really oops, because a
      oops with the mm sem hold for writing always deadlocks.
      e1e71f9b
    • Andrew Morton's avatar
      [PATCH] numa api: Add policy support to anonymous memory · e8a2ef16
      Andrew Morton authored
      From: Andi Kleen <ak@suse.de>
      
      Change to core VM to use alloc_page_vma() instead of alloc_page().
      
      Change the swap readahead to follow the policy of the VMA.
      e8a2ef16
    • Andrew Morton's avatar
      [PATCH] numa api: Add statistics · 6633401d
      Andrew Morton authored
      From: Andi Kleen <ak@suse.de>
      
      Add NUMA hit/miss statistics to page allocation and display them in sysfs.
      
      This is not 100% required for NUMA API, but without this it is very
      
      The overhead is quite low because all counters are per CPU and only happens
      when CONFIG_NUMA is defined.
      6633401d
    • Andrew Morton's avatar
      [PATCH] small numa api fixups · e52c02f7
      Andrew Morton authored
      From: Christoph Hellwig <hch@lst.de>
      
      - don't include mempolicy.h in sched.h and mm.h when a forward delcaration
        is enough.  Andi argued against that in the past, but I'd really hate to add
        another header to two of the includes used in basically every driver when we
        can include it in the six files actually needing it instead (that number is
        for my ppc32 system, maybe other arches need more include in their
        directories)
      
      - make numa api fields in tast_struct conditional on CONFIG_NUMA, this gives
        us a few ugly ifdefs but avoids wasting memory on non-NUMA systems.
      e52c02f7
    • Andrew Morton's avatar
      [PATCH] numa api: Add shared memory support · d31d7a18
      Andrew Morton authored
      From: Andi Kleen <ak@suse.de>
      
      Add support to tmpfs and hugetlbfs to support NUMA API.  Shared memory is a
      bit of a special case for NUMA policy.  Normally policy is associated to VMAs
      or to processes, but for a shared memory segment you really want to share the
      policy.  The core NUMA API has code for that, this patch adds the necessary
      changes to tmpfs and hugetlbfs.
      
      First it changes the custom swapping code in tmpfs to follow the policy set
      via VMAs.
      
      It is also useful to have a "backing store" of policy that saves the policy
      even when nobody has the shared memory segment mapped.  This allows command
      line tools to pre configure policy, which is then later used by programs.
      
      Note that hugetlbfs needs more changes - it is also required to switch it to
      lazy allocation, otherwise the prefault prevents mbind() from working.
      d31d7a18
    • Andrew Morton's avatar
      [PATCH] numa api: Add VMA hooks for policy · c78b023f
      Andrew Morton authored
      From: Andi Kleen <ak@suse.de>
      
      NUMA API adds a policy to each VMA.  During VMA creattion, merging and
      splitting these policies must be handled properly.  This patch adds the calls
      to this. 
      
      It is a nop when CONFIG_NUMA is not defined.
      c78b023f
    • Andrew Morton's avatar
      [PATCH] Re-add NUMA API statistics · 490b582a
      Andrew Morton authored
      From: Andi Kleen <ak@suse.de>
      
      Patch readds the sysfs output of the NUMA API statistics.  All my test
      scripts need this and it is very useful to check if the policy actually
      works.
      
      This got lost when the huge page numa api changes got dropped.
      
      I decided to not resend the huge pages NUMA API changes for now.  Instead I
      will wait for this area to settle when demand paged large pages is merged.
      490b582a
    • Andrew Morton's avatar
      [PATCH] numa api core: use SLAB_PANIC · 0aa6e336
      Andrew Morton authored
      0aa6e336
    • Andrew Morton's avatar
      [PATCH] mpol in copy_vma · e6ac361f
      Andrew Morton authored
      From: Hugh Dickins <hugh@veritas.com>
      
      I think Andi missed the copy_vma I recently added for mremap, and it'll
      need something like below....  (Doesn't look like it'll optimize away when
      it's not needed - rather bloaty.)
      e6ac361f
    • Andrew Morton's avatar
      [PATCH] numa api: Core NUMA API code · d3b8924a
      Andrew Morton authored
      From: Andi Kleen <ak@suse.de>
      
      The following patches add support for configurable NUMA memory policy
      for user processes. It is based on the proposal from last kernel summit
      with feedback from various people.
      
      This NUMA API doesn't not attempt to implement page migration or anything
      else complicated: all it does is to police the allocation when a page
      is first allocation or when a page is reallocated after swapping. Currently
      only support for shared memory and anonymous memory is there; policy for
      file based mappings is not implemented yet (although they get implicitely
      policied by the default process policy)
      
      It adds three new system calls: mbind to change the policy of a VMA,
      set_mempolicy to change the policy of a process, get_mempolicy to retrieve
      memory policy. User tools (numactl, libnuma, test programs, manpages) can be
      found in  ftp://ftp.suse.com/pub/people/ak/numa/numactl-0.6.tar.gz
      
      For details on the system calls see the manpages
      http://www.firstfloor.org/~andi/mbind.html
      http://www.firstfloor.org/~andi/set_mempolicy.html
      http://www.firstfloor.org/~andi/get_mempolicy.html
      Most user programs should actually not use the system calls directly,
      but use the higher level functions in libnuma
      (http://www.firstfloor.org/~andi/numa.html) or the command line tools
      (http://www.firstfloor.org/~andi/numactl.html
      
      The system calls allow user programs and administors to set various NUMA memory
      policies for putting memory on specific nodes. Here is a short description
      of the policies copied from the kernel patch:
      
       * NUMA policy allows the user to give hints in which node(s) memory should
       * be allocated.
       *
       * Support four policies per VMA and per process:
       *
       * The VMA policy has priority over the process policy for a page fault.
       *
       * interleave     Allocate memory interleaved over a set of nodes,
       *                with normal fallback if it fails.
       *                For VMA based allocations this interleaves based on the
       *                offset into the backing object or offset into the mapping
       *                for anonymous memory. For process policy an process counter
       *                is used.
       * bind           Only allocate memory on a specific set of nodes,
       *                no fallback.
       * preferred      Try a specific node first before normal fallback.
       *                As a special case node -1 here means do the allocation
       *                on the local CPU. This is normally identical to default,
       *                but useful to set in a VMA when you have a non default
       *                process policy.
       * default        Allocate on the local node first, or when on a VMA
       *                use the process policy. This is what Linux always did
       *                in a NUMA aware kernel and still does by, ahem, default.
       *
       * The process policy is applied for most non interrupt memory allocations
       * in that process' context. Interrupts ignore the policies and always
       * try to allocate on the local CPU. The VMA policy is only applied for memory
       * allocations for a VMA in the VM.
       *
       * Currently there are a few corner cases in swapping where the policy
       * is not applied, but the majority should be handled. When process policy
       * is used it is not remembered over swap outs/swap ins.
       *
       * Only the highest zone in the zone hierarchy gets policied. Allocations
       * requesting a lower zone just use default policy. This implies that
       * on systems with highmem kernel lowmem allocation don't get policied.
       * Same with GFP_DMA allocations.
       *
       * For shmfs/tmpfs/hugetlbfs shared memory the policy is shared between
       * all users and remembered even when nobody has memory mapped.
      
      
      
      
      This patch:
      
      This is the core NUMA API code. This includes NUMA policy aware
      wrappers for get_free_pages and alloc_page_vma(). On non NUMA kernels
      these are defined away.
      
      The system calls mbind (see http://www.firstfloor.org/~andi/mbind.html),
      get_mempolicy (http://www.firstfloor.org/~andi/get_mempolicy.html) and
      set_mempolicy (http://www.firstfloor.org/~andi/set_mempolicy.html) are
      implemented here.
      
      Adds a vm_policy field to the VMA and to the process. The process
      also has field for interleaving. VMA interleaving uses the offset
      into the VMA, but that's not possible for process allocations.
      
      From: Andi Kleen <ak@muc.de>
      
        > Andi, how come policy_vma() calls ->set_policy under i_shared_sem?
      
        I think this can be actually dropped now.  In an earlier version I did
        walk the vma shared list to change the policies of other mappings to the
        same shared memory region.  This turned out too complicated with all the
        corner cases, so I eventually gave in and added ->get_policy to the fast
        path.  Also there is still the mmap_sem which prevents races in the same MM.
         
      
        Patch to remove it attached.  Also adds documentation and removes the
        bogus __alloc_page_vma() prototype noticed by hch.
      
      From: Andi Kleen <ak@suse.de>
      
        A few incremental fixes for NUMA API.
      
        - Fix a few comments
      
        - Add a compat_ function for get_mem_policy I considered changing the
          ABI to avoid this, but that would have made the API too ugly.  I put it
          directly into the file because a mm/compat.c didn't seem worth it just for
          this.
      
        - Fix the algorithm for VMA interleave.
      
      From: Matthew Dobson <colpatch@us.ibm.com>
      
        1) Move the extern of alloc_pages_current() into #ifdef CONFIG_NUMA.
          The only references to the function are in NUMA code in mempolicy.c
      
        2) Remove the definitions of __alloc_page_vma().  They aren't used.
      
        3) Move forward declaration of struct vm_area_struct to top of file.
      d3b8924a
    • Andrew Morton's avatar
      [PATCH] numa api: Add IA64 support · 8a8e5a38
      Andrew Morton authored
      From: Andi Kleen <ak@suse.de>
      
      Add NUMA API system calls on IA64 and one bug fix required for it.
      8a8e5a38
    • Andrew Morton's avatar
      [PATCH] numa api: Add i386 support · 5a6d8244
      Andrew Morton authored
      From: Andi Kleen <ak@suse.de>
      
      Add NUMA API system calls for i386
      5a6d8244
    • Andrew Morton's avatar
      [PATCH] numa api: x86_64 support · 6f4944c5
      Andrew Morton authored
      From: Andi Kleen <ak@suse.de>
      
      Add NUMA API system calls on x86-64
      
      This includes a bugfix to prevent miscompilation on gcc 3.2 of bitmap.h
      6f4944c5
    • Andrew Morton's avatar
      [PATCH] rmap 14: i_shared_lock fixes · d3f42511
      Andrew Morton authored
      From: Hugh Dickins <hugh@veritas.com>
      
      First of batch of six patches which introduce Rajesh Venkatasubramanian's
      implementation of a radix priority search tree of vmas, to handle object-based
      reverse mapping corner cases well.
      
      rmap 14 i_shared_lock fixes
      
      Start the sequence with a couple of outstanding i_shared_lock fixes.
      
      Since i_shared_sem became i_shared_lock, we've had to shift and then
      temporarily remove mremap move's protection of concurrent truncation - if
      mremap moves ptes while unmap_mapping_range_list is making its way through the
      vmas, there's a danger we'd move a pte from an area yet to be cleaned back
      into an area already cleared.
      
      Now site the i_shared_lock with the page_table_lock in move_one_page.  Replace
      page_table_present by get_one_pte_map, so we know when it's necessary to
      allocate a new page table: in which case have to drop i_shared_lock, trylock
      and perhaps reorder locks on the way back.  Yet another fix: must check for
      NULL dst before pte_unmap(dst).
      
      And over in rmap.c, try_to_unmap_file's cond_resched amidst its lengthy
      nonlinear swapping was now causing might_sleep warnings: moved to a rather
      unsatisfactory and less frequent cond_resched_lock on i_shared_lock when we
      reach the end of the list; and one before starting on the nonlinears too: the
      "cursor" may become out-of-date if we do schedule, but I doubt it's worth
      bothering about.
      d3f42511
    • Andrew Morton's avatar
      [PATCH] Convert i_shared_sem back to a spinlock · c0868962
      Andrew Morton authored
      Having a semaphore in there causes modest performance regressions on heavily
      mmap-intensive workloads on some hardware.  Specifically, up to 30% in SDET on
      NUMAQ and big PPC64.
      
      So switch it back to being a spinlock.  This does mean that unmap_vmas() needs
      to be told whether or not it is allowed to schedule away; that's simple to do
      via the zap_details structure.
      
      This change means that there will be high scheuling latencies when someone
      truncates a large file which is currently mmapped, but nobody does that
      anyway.  The scheduling points in unmap_vmas() are mainly for munmap() and
      exit(), and they still will work OK for that.
      
      From: Hugh Dickins <hugh@veritas.com>
      
        Sorry, my premature optimizations (trying to pass down NULL zap_details
        except when needed) have caught you out doubly: unmap_mapping_range_list was
        NULLing the details even though atomic was set; and if it hadn't, then
        zap_pte_range would have missed free_swap_and_cache and pte_clear when pte
        not present.  Moved the optimization into zap_pte_range itself.  Plus
        massive documentation update.
      
      From: Hugh Dickins <hugh@veritas.com>
      
        Here's a second patch to add to the first: mremap's cows can't come home
        without releasing the i_mmap_lock, better move the whole "Subtle point"
        locking from move_vma into move_page_tables.  And it's possible for the file
        that was behind an anonymous page to be truncated while we drop that lock,
        don't want to abort mremap because of VM_FAULT_SIGBUS.
      
        (Eek, should we be checking do_swap_page of a vm_file area against the
        truncate_count sequence?  Technically yes, but I doubt we need bother.)
      
      
      - We cannot hold i_mmap_lock across move_one_page() because
        move_one_page() needs to perform __GFP_WAIT allocations of pagetable pages.
      
      - Move the cond_resched() out so we test it once per page rather than only
        when move_one_page() returns -EAGAIN.
      c0868962
    • Andrew Morton's avatar
      [PATCH] rmap 13 include/asm deletions · 71a18745
      Andrew Morton authored
      From: Hugh Dickins <hugh@veritas.com>
      
      Delete include/asm*/rmap.h
      Delete pte_addr_t typedef from include/asm*/pgtable.h
      Delete KM_PTE2 from subset of include/asm*/kmap_types.h
      Beware when 4G/4G returns to -mm: i386 may need KM_FILLER for 8K stack.
      71a18745
    • Andrew Morton's avatar
      [PATCH] rmap 12 pgtable remove rmap · 865fadf0
      Andrew Morton authored
      From: Hugh Dickins <hugh@veritas.com>
      
      Remove the support for pte_chain rmap from page table initialization, just
      continue to maintain nr_page_table_pages (but only for user page tables -
      it also counted vmalloc page tables before, little need, and I'm unsure if
      per-cpu stats are safe early enough on all arches).  mm/memory.c is the
      only core file affected.
      
      But ppc and ppc64 have found the old rmap page table initialization useful
      to support their ptep_test_and_clear_young: so transfer rmap's
      initialization to them (even on kernel page tables?  well, okay).
      865fadf0
    • Andrew Morton's avatar
      [PATCH] rmap 11 mremap moves · 70b671f8
      Andrew Morton authored
      From: Hugh Dickins <hugh@veritas.com>
      
      A weakness of the anonmm scheme is its difficulty in tracking pages shared
      between two or more mms (one being an ancestor of the other), when mremap has
      been used to move a range of pages in one of those mms.  mremap move is not
      very common anyway, and it's more often used on a page range exclusive to the
      mm; but uncommon though it may be, we must not allow unlocked pages to become
      unswappable.
      
      This patch follows Linus' suggestion, simply to take a private copy of the
      page in such a case: early C-O-W.  My previous implementation was daft with
      respect to pages currently on swap: it insisted on swapping them in to copy
      them.  No need for that: just take the copy when a page is brought in from
      swap, and its intended address is found to clash with what rmap has already
      noted.
      
      If do_swap_page has to make this copy in the mremap moved case (simply a call
      to do_wp_page), might as well do so also in the case when it's a write access
      but the page not exclusive, it's always seemed a little odd that swapin needed
      a second fault for that.  A bug even: get_user_pages force imagines that a
      single call to handle_mm_fault must break C-O-W.  Another bugfix: swapoff's
      unuse_process didn't check is_vm_hugetlb_page.
      
      Andrea's anon_vma has no such problem with mremap moved pages, handling them
      with elegant use of vm_pgoff - though at some cost to vma merging.  How
      important is it to handle them efficiently?  For now there's a msg
      printk(KERN_WARNING "%s: mremap moved %d cows\n", current->comm, cows);
      70b671f8
    • Andrew Morton's avatar
      [PATCH] rmap 10 add anonmm rmap · 6bccf794
      Andrew Morton authored
      From: Hugh Dickins <hugh@veritas.com>
      
      Hugh's anonmm object-based reverse mapping scheme for anonymous pages.  We
      have not yet decided whether to adopt this scheme, or Andrea's more advanced
      anon_vma scheme.  anonmm is easier for me to merge quickly, to replace the
      pte_chain rmap taken out in the previous patch; a patch to install Andrea's
      anon_vma will follow in due course.
      
      Why build up and tear down chains of pte pointers for anonymous pages, when a
      page can only appear at one particular address, in a restricted group of mms
      that might share it?  (Except: see next patch on mremap.)
      
      Introduce struct anonmm per mm to track anonymous pages, all forks from one
      exec sharing the same bundle of linked anonmms.  Anonymous pages originate in
      one mm, but may be forked into another mm of the bundle later on.  Callouts
      from fork.c to allocate, dup and exit the anonmm structure private to rmap.c.
      
      From: Hugh Dickins <hugh@veritas.com>
      
        Two concurrent exits (of the last two mms sharing the anonhd).  First
        exit_rmap brings anonhd->count down to 2, gets preempted (at the
        spin_unlock) by second, which brings anonhd->count down to 1, sees it's 1
        and frees the anonhd (without making any change to anonhd->count itself),
        cpu goes on to do something new which reallocates the old anonhd as a new
        struct anonmm (probably not a head, in which case count will start at 1),
        first resumes after the spin_unlock and sees anonhd->count 1, frees "anonhd"
        again, it's used for something else, a later exit_rmap list_del finds list
        corrupt.
      6bccf794
    • Andrew Morton's avatar
      [PATCH] rmap 9 remove pte_chains · 123e4df7
      Andrew Morton authored
      From: Hugh Dickins <hugh@veritas.com>
      
      Lots of deletions: the next patch will put in the new anon rmap, which
      should look clearer if first we remove all of the old pte-pointer-based
      rmap from the core in this patch - which therefore leaves anonymous rmap
      totally disabled, anon pages locked in memory until process frees them.
      
      Leave arch files (and page table rmap) untouched for now, clean them up in
      a later batch.  A few constructive changes amidst all the deletions:
      
      Choose names (e.g.  page_add_anon_rmap) and args (e.g.  no more pteps) now
      so we need not revisit so many files in the next patch.  Inline function
      page_dup_rmap for fork's copy_page_range, simply bumps mapcount under lock.
       cond_resched_lock in copy_page_range.  Struct page rearranged: no pte
      union, just mapcount moved next to atomic count, so two ints can occupy one
      long on 64-bit; i386 struct page now 32 bytes even with PAE.  Never pass
      PageReserved to page_remove_rmap, only do_wp_page did so.
      
      
      From: Hugh Dickins <hugh@veritas.com>
      
        Move page_add_anon_rmap's BUG_ON(page_mapping(page)) inside the rmap_lock
        (well, might as well just check mapping if !mapcount then): if this page is
        being mapped or unmapped on another cpu at the same time, page_mapping's
        PageAnon(page) and page->mapping are volatile.
      
        But page_mapping(page) is used more widely: I've a nasty feeling that
        clear_page_anon, page_add_anon_rmap and/or page_mapping need barriers added
        (also in 2.6.6 itself),
      123e4df7
    • Andrew Morton's avatar
      [PATCH] slab: consolidate panic code · b33a7bad
      Andrew Morton authored
      Many places do:
      
      	if (kmem_cache_create(...) == NULL)
      		panic(...);
      
      We can consolidate all that by passing another flag to kmem_cache_create()
      which says "panic if it doesn't work".
      b33a7bad
    • Andrew Morton's avatar
      [PATCH] rmap 8 unmap nonlinear · 108e3158
      Andrew Morton authored
      From: Hugh Dickins <hugh@veritas.com>
      
      The previous patch let the ptes of file pages be located via page
      ->mapping->i_mmap and i_mmap_shared lists of vmas; which works well unless
      the vma is VM_NONLINEAR - one in which sys_remap_file_pages has been used
      to place pages in unexpected places, to avoid an explosion of distinct
      unmergable vmas.  Such pages were effectively locked in memory.
      
      page_referenced_file is already skipping nonlinear vmas, they'd just waste
      its time, and age unfairly any pages in their proper positions.  Now extend
      try_to_unmap_file, to persuade it to swap from nonlinears.
      
      Ignoring the page requested, try to unmap cluster of 32 neighbouring ptes
      (in worst case all empty slots) in a nonlinear vma, then move on to the
      next vma; stopping when we've unmapped at least as many maps as the
      requested page had (vague guide of how hard to try), or have reached the
      end.  With large sparse nonlinear vmas, this could take a long time:
      inserted a cond_resched while no locks are held, unusual at this level but
      I think okay, shrink_list does so.
      
      Use vm_private_data a little like the old mm->swap_address, as a cursor
      recording how far we got, so we don't attack the same ptes next time around
      (earlier tried inserting an empty marker vma in the list, but that got
      messy).  How well this will work on real- life nonlinear vmas remains to be
      seen, but should work better than locking them all in memory, or swapping
      everything out all the time.
      
      Existing users of vm_private_data have either VM_RESERVED or VM_DONTEXPAND
      set, both of which are in the VM_SPECIAL category where we never try to
      merge vmas: so removed the vm_private_data test from is_mergeable_vma, so
      we can still merge VM_NONLINEARs.  Of course, we could instead add another
      field to vm_area_struct.
      108e3158
    • Andrew Morton's avatar
      [PATCH] rmap 7 object-based rmap · cab971db
      Andrew Morton authored
      From: Hugh Dickins <hugh@veritas.com>
      
      Dave McCracken's object-based reverse mapping scheme for file pages: why
      build up and tear down chains of pte pointers for file pages, when
      page->mapping has i_mmap and i_mmap_shared lists of all the vmas which
      might contain that page, and it appears at one deterministic position
      within the vma (unless vma is nonlinear - see next patch)?
      
      Has some drawbacks: more work to locate the ptes from page_referenced and
      try_to_unmap, especially if the i_mmap lists contain a lot of vmas covering
      different ranges; has to down_trylock the i_shared_sem, and hope that
      doesn't fail too often.  But attractive in that it uses less lowmem, and
      shifts the rmap burden away from the hot paths, to swapout.
      
      Hybrid scheme for the moment: carry on with pte_chains for anonymous pages,
      that's unchanged; but file pages keep mapcount in the pte union of struct
      page, where anonymous pages keep chain pointer or direct pte address: so
      page_mapped(page) works on both.
      
      Hugh massaged it a little: distinct page_add_file_rmap entry point; list
      searches check rss so as not to waste time on mms fully swapped out; check
      mapcount to terminate once all ptes have been found; and a WARN_ON if
      page_referenced should have but couldn't find all the ptes.
      cab971db
    • Andrew Morton's avatar
      [PATCH] __set_page_dirty_nobuffers race fix · d61ae266
      Andrew Morton authored
      Running __mark_inode_dirty() against a swapcache page is illegal and will
      oops.
      
      I see a race in set_page_dirty() wherein it can be called with a PageSwapCache
      page, but if the page is removed from swapcache after
      __set_page_dirty_nobuffers() drops tree_lock(), we have the situation where
      PageSwapCache() is false, but local variable `mapping' points at swapcache.
      
      Handle that by checking for non-null mapping->host.  We don't care about the
      page state at this point - we're only interested in the inode.
      
      
      
      There is a converse case: what if a page is added to swapcache as we are
      running set_page_dirty() against it?
      
      In this case the page gets its PG_dirty flag set but it is not tagged as dirty
      in the swapper_space radix tree.  The swap writeout code will handle this OK
      and test_clear_page_dirty()'s call to
      radix_tree_tag_clear(PAGECACHE_TAG_DIRTY) will silently have no effect.  The
      only downside is that future radix-tree-based writearound won't notice that
      such pages are dirty and swap IO scheduling will be a teensy bit worse.
      
      
      The patch also fixes the (silly) testing of local variable `mapping' to see if
      the page was truncated.  We should test page_mapping() for that.
      d61ae266
    • Andrew Morton's avatar
      [PATCH] Make sync_page use swapper_space again · e6dfd92e
      Andrew Morton authored
      Revert recent changes to sync_page().  Now that page_mapping() returns
      &swapper_space for swapcache pages we don't need to test for PageSwapCache in
      sync_page().
      e6dfd92e
    • Andrew Morton's avatar
      [PATCH] vmscan: revert may_enter_fs changes · 8ea360d4
      Andrew Morton authored
      Fix up the "may we call writepage" logic for the swapcache changes.
      8ea360d4
    • Andrew Morton's avatar
      [PATCH] revert recent swapcache handling changes · e74193ad
      Andrew Morton authored
      Go back to the 2.6.5 concepts, with rmap additions.  In particular:
      
      - Implement Andrea's flavour of page_mapping().  This function opaquely does
        the right thing for pagecache pages, anon pages and for swapcache pages.
      
        The critical thing here is that page_mapping() returns &swapper_space for
        swapcache pages without actually requiring the storage at page->mapping. 
        This frees page->mapping for the anonmm/anonvma metadata.
      
      - Andrea and Hugh placed the pagecache index of swapcache pages into
        page->private rather than page->index.  So add new page_index() function
        which hides this.
      
      - Make swapper_space.set_page_dirty() again point at
        __set_page_dirty_buffers().  If we don't do that, a bare set_page_dirty()
        will fall through to __set_page_dirty_buffers(), which is silly.
      
        This way, __set_page_dirty_buffers() can continue to use page->mapping.
        It should never go near anon or swapcache pages.
      
      - Give swapper_space a ->set_page_dirty address_space_operation method, so
        that set_page_dirty() will not fall through to __set_page_dirty_buffers()
        for swapcache pages.  That function is not set up to handle them.
      
      
      The main effect of these changes is that swapcache pages are treated more
      similarly to pagecache pages.  And we are again tagging swapcache pages as
      dirty in their radix tree, which is a requirement if we later wish to
      implement swapcache writearound based on tagged radix-tree walks.
      e74193ad
    • Andrew Morton's avatar
      [PATCH] __add_to_swap_cache and add_to_pagecache() simplification · 7379e302
      Andrew Morton authored
      Simplify the logic in there a bit.
      7379e302
    • Andrew Morton's avatar
      [PATCH] Make swapper_space tree_lock irq-safe · b6c418dc
      Andrew Morton authored
      ->tree_lock is supposed to be IRQ-safe.  Hugh worked out that with his
      changes, we never actually take it from interrupt context, so spin_lock() is
      sufficient.
      
      Apart from kinda freaking me out, the analysis which led to this decision
      becomes untrue with later patches.  So make it irq-safe.
      b6c418dc
    • Linus Torvalds's avatar
      Merge bk://kernel.bkbits.net/davem/net-2.6 · a20a9dee
      Linus Torvalds authored
      into ppc970.osdl.org:/home/torvalds/v2.6/linux
      a20a9dee
    • Linus Torvalds's avatar
      Avoid type warning in comparison by making it explicit. · 4d4aaa67
      Linus Torvalds authored
      (The difference between two pointers is a "size_t", while
      MAX_LEN and the result here are "int"s).
      4d4aaa67
    • Stephen Hemminger's avatar
      [BRIDGE]: Forwarding table sanity checks. · 649f71c5
      Stephen Hemminger authored
      Forwarding table paranoia:
      * Solve some potential problems if a device changes address and one or
        more device has the same address.  
      * Warn if new device added to a bridge matches a entry that has shown
        up on the network.
      * Also don't put static entries in the timer list, they don't time
        out so shouldn't be there.
      649f71c5
    • Stephen Hemminger's avatar
      [BRIDGE]: Compat hooks for new-ioctl interface. · d6bd6619
      Stephen Hemminger authored
      Replacement 64 bit compatibility code for the new ioctl's.  The new 
      ioctl's all pass through clean, but for the old style ioctl's it uses
      the mis-feature of the earlier bridge-utils that they check the API version.
      
      So if an old 32bit version of brctl is run on a 64bit platform it will
      report
      	bridge utilities not compatible with kernel version
      
      Tested on Itanium 1; but should solve issue for sparc, ppc, and x86_64
      d6bd6619
    • Stephen Hemminger's avatar
      [BRIDGE]: New ioctl interface for 32/64 compatability. · 5075405c
      Stephen Hemminger authored
      Add four new ioctl's for the operations that can't be done through sysfs.
      The existing bridge ioctl's are multiplexed, and most go through SIOCDEVPRIVATE
      so they won't work in a mixed 32/64bit environment.
      
      The new release of bridge-utils will use these if possible, and fall
      back to the old interface.
      5075405c