1. 29 Jul, 2002 16 commits
    • Jens Axboe's avatar
      08f9788a
    • Paul Mackerras's avatar
      [PATCH] page table page->index · f6c2354a
      Paul Mackerras authored
      I found a situation where page->index for a pagetable page can be set
      to 0 instead of the correct value.  This means that ptep_to_address
      will return the wrong answer.  The problem occurs when remap_pmd_range
      calls pte_alloc_map and pte_alloc_map needs to allocate a new pte
      page, because remap_pmd_range has masked off the top bits of the
      address (to avoid overflow in the computation of `end'), and it passes
      the masked address to pte_alloc_map.
      
      Now we presumably don't need to get from the physical pages mapped by
      remap_page_range back to the ptes mapping them.  But we could easily
      map some normal pages using ptes in that pagetable page subsequently,
      and when we call ptep_to_address on their ptes it will give the wrong
      answer.
      
      The patch below fixes the problem.
      
      There is a more general question this brings up - some of the
      procedures which iterate over ranges of ptes will do the wrong thing
      if the end of the address range is too close to ~0UL, while others are
      OK.  Is this a problem in practice?  On i386, ppc, and the 64-bit
      architectures it isn't since user addresses can't go anywhere near
      ~0UL, but what about arm or m68k for instance?
      
      And BTW, being able to go from a pte pointer to the mm and virtual
      address that that pte maps is an extremely useful thing on ppc, since
      it will enable me to do MMU hash-table management at set_pte (and
      ptep_*) time and thus avoid the extra traversal of the pagetables that
      I am currently doing in flush_tlb_*.  So if you do decide to back out
      rmap, please leave in the hooks for setting page->mapping and
      page->index on pagetable pages.
      f6c2354a
    • Paul Mackerras's avatar
      [PATCH] fix include/linux/timer.h compile · 07611a33
      Paul Mackerras authored
      include/linux/timer.h needs to include <linux/stddef.h>
      to get the definition of NULL.
      07611a33
    • Adam J. Richter's avatar
      [PATCH] fix do_open() interaction with rd.c · bac5bcac
      Adam J. Richter authored
      	linux-2.5.28/drivers/block_dev.c has a new do_open that broke
      initial ramdisk support, because it now requires devices that "manually"
      set bdev->bd_openers to set bdev->bd_inode->i_size as well.  The
      following single line patch, suggested by Russell King, fixes the
      problem.
      
      	There does not appear to be anyone acting as maintainer for
      rd.c, so I posted to lkml yesterday to ask if anyone objected to my
      submitting the patch to you, and I also emailed the message to Russell
      King and Al Viro.  Nobody has complained.  I have been running the
      patch for almost a day without problems.
      bac5bcac
    • Linus Torvalds's avatar
      Merge bk://bk.arm.linux.org.uk:14691 · 0cd3455f
      Linus Torvalds authored
      into home.transmeta.com:/home/torvalds/v2.5/linux
      0cd3455f
    • Russell King's avatar
      [SERIAL] Cleanup includes. · 29363ba0
      Russell King authored
      Al Viro pointed out there was a fair bit of redundancy here.  We
      remove many include files from the serial layer, leaving those
      which are necessary for it to build.  This has been posted to lkml,
      no one complained.
      
      This cset also combines a missing include of asm/io.h in 8250_pci.c
      (unfortunately I've lost the name of the reporter, sorry.)
      29363ba0
    • Russell King's avatar
      81059697
    • Linus Torvalds's avatar
      Merge bk://jfs.bkbits.net/linux-2.5 · fd4588d0
      Linus Torvalds authored
      into home.transmeta.com:/home/torvalds/v2.5/linux
      fd4588d0
    • Dave Kleikamp's avatar
      Remove d_delete call from jfs_rmdir and jfs_unlink · cb36441a
      Dave Kleikamp authored
      jfs_rmdir and jfs_unlink have always called d_delete, but it hasn't
      caused a problem until 2.5.28.  The call is an artifact of the 2.2
      kernel, which had gone unnoticed in 2.4 and 2.5.
      cb36441a
    • Linus Torvalds's avatar
      Automerge · db469c8d
      Linus Torvalds authored
      db469c8d
    • Christoph Hellwig's avatar
      VM: remove unused /proc/sys/vm/kswapd and swapctl.h · a074f680
      Christoph Hellwig authored
      These were totally unused for a long time.  It's interesting how
      many files include swapctl.h, though..
      a074f680
    • David Woodhouse's avatar
      [PATCH] Remove cli() from R3964 line discipline. · aa1190a2
      David Woodhouse authored
      I did this ages ago but never submitted it because I never got round to
      testing it. I still haven't tested it, but it ought to work, and the code
      is definitely broken without it...
      aa1190a2
    • Linus Torvalds's avatar
      Leftover from trident cli/sti removal. · 86560be4
      Linus Torvalds authored
      Noticed by Zwane Mwaikambo.
      86560be4
    • David Woodhouse's avatar
    • Russell King's avatar
    • Russell King's avatar
      [SERIAL] Remove some old compatibility cruft from 8250_pci.c · b3a1d183
      Russell King authored
      8250_pci.c contains some old compatibility cruft for when __devexit
      wasn't defined by the generic kernel.  It is now, so it's gone.
      b3a1d183
  2. 28 Jul, 2002 24 commits
    • Andrew Morton's avatar
      [PATCH] restore lru_cache_del() in truncate_complete_page · ff42067b
      Andrew Morton authored
      I removed the PF_INVALIDATE debug check from buffercache
      leaks, too.  It's non-functional - the flag should have been
      set across truncate_inode_pages(), not invalidate_inode_pages().
      ff42067b
    • Linus Torvalds's avatar
    • Linus Torvalds's avatar
      Make "cpu_relax()" imply a barrier, since that's how it is · 3f0c2c5b
      Linus Torvalds authored
      used.
      
      This fixes a lockup in synchronize_irq() on x86.
      3f0c2c5b
    • Linus Torvalds's avatar
      Automerge · 984e13d3
      Linus Torvalds authored
      984e13d3
    • Linus Torvalds's avatar
      Merge · 621f5626
      Linus Torvalds authored
      621f5626
    • Ingo Molnar's avatar
      [PATCH] sched-2.5.29-B1 · 8e77485f
      Ingo Molnar authored
      the attached patch is a comment update of sched.c and it also does a small
      cleanup in migration_thread().
      8e77485f
    • Matthew Dharm's avatar
      [PATCH] SCSI MODE_SENSE transfer length fix · c5155e55
      Matthew Dharm authored
      Modified the MODE_SENSE write-protect test in sd.c to issue a SCSI
      request with the request_bufflen the same size as the MODE_SENSE
      command being issued requests.
      c5155e55
    • Matthew Dharm's avatar
      [PATCH] SCSI INQUIRY transfer length fix · d7cdb541
      Matthew Dharm authored
      Fixed one of the INQUIRY commands used for probing SCSI devices.  This
      badly-formed command was trapped by the usb-storage driver BUG_ON()
      which is designed to stop command with a badly formed transfer_length
      field.
      d7cdb541
    • Andrew Morton's avatar
      [PATCH] put_page() uses audited · 06829ded
      Andrew Morton authored
      Audit put_page() uses of pages that may be in the page cache.
      
      Use page_cache_release() instead.
      06829ded
    • Ingo Molnar's avatar
      [PATCH] Re: Limit in set_thread_area · 686d6649
      Ingo Molnar authored
      the attached patch does the set_thread_area parameter simplification - it
      also cleans up some other TLS issues, it removes the tls_* fields from the
      thread_struct, and removes the now unused page-granularity flag.
      686d6649
    • Andrew Morton's avatar
      [PATCH] permit modular build of raw driver · 603e29ca
      Andrew Morton authored
      This patch allows the raw driver to be built as a kernel module.
      
      It also cleans up a bunch of stuff, C99ifies the initialisers, gives
      lots of symbols static scope, etc.
      
      The module is unloadable when there are zero bindings.  The current
      ioctl() interface have no way of undoing a binding - it only allows
      bindings to be overwritten.  So I overloaded a bind to major=0,minor=0
      to mean "undo the binding".  I'll update the raw(8) manpage for that.
      
      generic_file_direct_IO has been exported to modules.
      
      The call to invalidate_inode_pages2() has been removed from all
      generic_file_driect_IO() callers, into generic_file_direct_IO() itself.
      Mainly to avoid exporting invalidate_inode_pages2() to modules.
      603e29ca
    • Andrew Morton's avatar
      [PATCH] direct IO updates · 0d85f8bf
      Andrew Morton authored
      This patch is a performance and correctness update to the direct-IO
      code: O_DIRECT and the raw driver.  It mainly affects IO against
      blockdevs.
      
      The direct_io code was returning -EINVAL for a filesystem hole.  Change
      it to clear the userspace page instead.
      
      There were a few restrictions and weirdnesses wrt blocksize and
      alignments.  The code has been reworked so we now lay out maximum-sized
      BIOs at any sector alignment.
      
      Because of this, the raw driver has been altered to set the blockdev's
      soft blocksize to the minimum possible at open() time.  Typically, 512
      bytes.  There are now no performance disadvantages to using small
      blocksizes, and this gives the finest possible alignment.
      
      There is no API here for setting or querying the soft blocksize of the
      raw driver (there never was, really), which could conceivably be a
      problem.  If it is, we can permit BLKBSZSET and BLKBSZGET against the
      fd which /dev/raw/rawN returned, but that would require that
      blk_ioctl() be exported to modules again.
      
      This code is wickedly quick.  Here's an oprofile of a single 500MHz
      PIII reading from four (old) scsi disks (two aic7xxx controllers) via
      the raw driver.  Aggregate throughput is 72 megabytes/second:
      
      c013363c 24       0.0896492   __set_page_dirty_buffers
      c021b8cc 24       0.0896492   ahc_linux_isr
      c012b5dc 25       0.0933846   kmem_cache_free
      c014d894 26       0.09712     dio_bio_complete
      c01cc78c 26       0.09712     number
      c0123bd4 40       0.149415    follow_page
      c01eed8c 46       0.171828    end_that_request_first
      c01ed410 49       0.183034    blk_recount_segments
      c01ed574 65       0.2428      blk_rq_map_sg
      c014db38 85       0.317508    do_direct_IO
      c021b090 90       0.336185    ahc_linux_run_device_queue
      c010bb78 236      0.881551    timer_interrupt
      c01052d8 25354    94.707      poll_idle
      
      A testament to the efficiency of the 2.5 block layer.
      
      And against four IDE disks on an HPT374 controller.  Throughput is 120
      megabytes/sec:
      
      c01eed8c 80       0.292462    end_that_request_first
      c01fe850 87       0.318052    hpt3xx_intrproc
      c01ed574 123      0.44966     blk_rq_map_sg
      c01f8f10 141      0.515464    ata_select
      c014db38 153      0.559333    do_direct_IO
      c010bb78 235      0.859107    timer_interrupt
      c01f9144 281      1.02727     ata_irq_enable
      c01ff990 290      1.06017     udma_pci_init
      c01fe878 308      1.12598     hpt3xx_maskproc
      c02006f8 379      1.38554     idedisk_do_request
      c02356a0 609      2.22637     pci_conf1_read
      c01ff8dc 611      2.23368     udma_pci_start
      c01ff950 922      3.37062     udma_pci_irq_status
      c01f8fac 1002     3.66308     ata_status
      c01ff26c 1059     3.87146     ata_start_dma
      c01feb70 1141     4.17124     hpt374_udma_stop
      c01f9228 3072     11.2305     ata_out_regfile
      c01052d8 15193    55.5422     poll_idle
      
      Not so good.
      
      One problem which has been identified with O_DIRECT is the cost of
      repeated calls into the mapping's get_block() callback.  Not a big
      problem with ext2 but other filesystems have more complex get_block
      implementations.
      
      So what I have done is to require that callers of generic_direct_IO()
      implement the new `get_blocks()' interface.  This is a small extension
      to get_block().  It gets passed another argument which indicates the
      maximum number of blocks which should be mapped, and it returns the
      number of blocks which it did map in bh_result->b_size.  This allows
      the fs to map up to 4G of disk (or of hole) in a single get_block()
      invokation.
      
      There are some other caveats and requirements of get_blocks() which are
      documented in the comment block over fs/direct_io.c:get_more_blocks().
      
      Possibly, get_blocks() will be the 2.6 kernel's way of doing gang block
      mapping.  It certainly allows good speedups.  But it doesn't allow the
      fs to return a scatter list of blocks - it only understands linear
      chunks of disk.  I think that's really all it _should_ do.
      
      I'll let get_blocks() sit for a while and wait for some feedback.  If
      it is sufficient and nobody objects too much, I shall convert all
      get_block() instances in the kernel to be get_blocks() instances.  And
      I'll teach readahead (at least) to use the get_blocks() extension.
      
      Delayed allocate writeback could use get_blocks().  As could
      block_prepare_write() for blocksize < PAGE_CACHE_SIZE.  There's no
      mileage using it in mpage_writepages() because all our filesystems are
      syncalloc, and nobody uses MAP_SHARED for much.
      
      It will be tricky to use get_blocks() for writes, because if a ton of
      blocks have been mapped into the file and then something goes wrong,
      the kernel needs to either remove those blocks from the file or zero
      them out.  The direct_io code zeroes them out.
      
      btw, some time ago you mentioned that some drivers and/or hardware may
      get upset if there are multiple simultaneous IOs in progress against
      the same block.  Well, the raw driver has always allowed that to
      happen.  O_DIRECT writes to blockdevs do as well now.
      
      todo:
      
      1) The driver will probably explode if someone runs BLKBSZSET while
         IO is in progress.  Need to use bdclaim() somewhere.
      
      2) readv() and writev() need to become direct_io-aware.  At present
         we're doing stop-and-wait for each segment when performing
         readv/writev against the raw driver and O_DIRECT blockdevs.
      0d85f8bf
    • Andrew Morton's avatar
      [PATCH] use c99 initialisers in ext3 · 62b52f5c
      Andrew Morton authored
      Convert ext3 to the C99 initialiser format.  From Rusty.
      62b52f5c
    • Andrew Morton's avatar
      [PATCH] strict overcommit · 502bff06
      Andrew Morton authored
      Alan's overcommit patch, brought to 2.5 by Robert Love.
      
      Can't say I've tested its functionality at all, but it doesn't crash,
      it has been in -ac and RH kernels for some time and I haven't observed
      any of its functions on profiles.
      
      "So what is strict VM overcommit?  We introduce new overcommit
       policies that attempt to never succeed an allocation that can not be
       fulfilled by the backing store and consequently never OOM.  This is
       achieved through strict accounting of the committed address space and
       a policy to allow/refuse allocations based on that accounting.
      
       In the strictest of modes, it should be impossible to allocate more
       memory than available and impossible to OOM.  All memory failures
       should be pushed down to the allocation routines -- malloc, mmap, etc.
      
       The new modes are available via sysctl (same as before).  See
       Documentation/vm/overcommit-accounting for more information."
      502bff06
    • Andrew Morton's avatar
      [PATCH] for_each_zone macro · a4b065fa
      Andrew Morton authored
      Patch from Robert Love.
      
      Attached patch implements for_each_zone(zont_t *) which is a helper
      macro to cleanup code of the form:
      
              for (pgdat = pgdat_list; pgdat; pgdat = pgdat->node_next) {
                      for (i = 0; i < MAX_NR_ZONES; ++i) {
                              zone_t * z = pgdat->node_zones + i;
                              /* ... */
                      }
              }
      
      and replace it with:
      
              for_each_zone(zone) {
                      /* ... */
              }
      
      This patch only replaces one use of the above loop with the new macro.
      Pending code, however, currently in the full rmap patch uses
      for_each_zone more extensively.
      a4b065fa
    • Andrew Morton's avatar
      [PATCH] for_each_pgdat macro · f183c478
      Andrew Morton authored
      Patch from Robert Love.
      
      This patch implements for_each_pgdat(pg_data_t *) which is a helper
      macro to cleanup code that does a loop of the form:
      
              pgdat = pgdat_list;
              while(pgdat) {
      	        /* ... */
      	        pgdat = pgdat->node_next;
      	}
      
      and replace it with:
      
      	for_each_pgdat(pgdat) {
      		/* ... */
      	}
      
      This code is from Rik's 2.4-rmap patch and is by William Irwin.
      f183c478
    • Andrew Morton's avatar
      [PATCH] optimise struct page layout · a854c11b
      Andrew Morton authored
      Reorganise the members of struct page.
      
      - Place ->flags at the start so the compiler can generate indirect
        addressing rather than indirect+indexed for this commonly-accessed
        field.  Shrinks the kernel by ~100 bytes.
      
      - Keep ->count with ->flags so they have the best chance of
        being in the same cacheline.
      a854c11b
    • Andrew Morton's avatar
      [PATCH] speed up pte_chain locking on uniprocessors · ab35295d
      Andrew Morton authored
      ifdef out some operations in pte_chain_lock() which are not necessary
      on uniprocessor.
      ab35295d
    • Andrew Morton's avatar
      [PATCH] show_free_areas() cleanup · c1ab3459
      Andrew Morton authored
      Cleanup to show_free_areas() from Bill Irwin:
      
      show_free_areas() and show_free_areas_core() is a mess.
      (1) it uses a bizarre and ugly form of list iteration to walk buddy lists
              use standard list functions instead
      (2) it prints the same information repeatedly once per-node
              rationalize the braindamaged iteration logic
      (3) show_free_areas_node() is useless and not called anywhere
              remove it entirely
      (4) show_free_areas() itself just calls show_free_areas_core()
              remove show_free_areas_core() and do the stuff directly
      (5) SWAP_CACHE_INFO is always #defined, remove it
      (6) INC_CACHE_INFO() doesn't use the do { } while (0) construct
      
      This patch also includes Matthew Dobson's patch which removes
      mm/numa.c:node_lock.  The consensus is that it doesn't do anything now
      that show_free_areas_node() isn't there.
      c1ab3459
    • Andrew Morton's avatar
      [PATCH] use a slab cache for pte_chains · cbb6e8ec
      Andrew Morton authored
      Patch from Bill Irwin.
      
      It removes the custom pte_chain allocator in mm/rmap.c and replaces it
      with a slab cache.
      
      "This patch
       (1) eliminates the pte_chain_freelist_lock and all contention on it
       (2) gives the VM the ability to recover unused pte_chain pages
      
       Anton Blanchard has reported (1) from prior incarnations of this patch.
       Craig Kulesa has reported (2) in combination with slab-on-LRU patches.
      
       I've left OOM detection out of this patch entirely as upcoming patches
       will do real OOM handling for pte_chains and all the code changed anyway."
      cbb6e8ec
    • Andrew Morton's avatar
      [PATCH] misc fixes · 1a40868e
      Andrew Morton authored
      There are a few VM-related patches in this series.  Mainly fixes;
      feature work is on hold.
      
      We have some fairly serious locking contention problems with the reverse
      mapping's pte_chains.  Until we have a clear way out of that I believe
      that it is best to not merge code which has a lot of rmap dependency.
      
      It is apparent that these problems will not be solved by tweaking -
      some redesign is needed.  In the 2.5 timeframe the only practical
      solution appears to be page table sharing, based on Daniel's February
      work.  Daniel and Dave McCracken are working that.
      
      Some bits and pieces here:
      
      - list_splice() has an open-coded list_empty() in it.  Use
        list_empty() instead.
      
      - in shrink_cache() we have a local `nr_pages' which shadows another
        local.  Rename the inner one.  (Nikita Danilov)
      
      - Add a BUG() on a can't-happen code path in page_remove_rmap().
      
      - Tighten up the bug checks in the BH completion handlers - if the
        buffer is still under IO then it must be locked, because we unlock it
        inside the page_uptodate_lock.
      1a40868e
    • Linus Torvalds's avatar
      Since "access_process_vm()" releases pages that can be · 4e3663d7
      Linus Torvalds authored
      in the page cache, it needs to use page_cache_release()
      instead of plain "put_page()".
      4e3663d7
    • Ingo Molnar's avatar
      [PATCH] APM fixes, 2.5.29 · 06ba030a
      Ingo Molnar authored
      the attached patch fixes two things:
      
       - a TLS related bug noticed by Arjan van de Ven: apm_init() should set up
         all CPU's gdt entries - just in case some code happens to call in the
         APM BIOS on the wrong CPU. This should also handle the case when some
         APM code gets triggered (by suspend or power button or something).
      
       - compilation problem
      06ba030a
    • Trond Myklebust's avatar
      [PATCH] Support for cached lookups via readdirplus [6/6] · ab12b34b
      Trond Myklebust authored
      Add support for the glibc 'd_type' field in cases where we have the
      READDIRPLUS file attribute information available to us in
      nfs_do_filldir().
      ab12b34b