1. 19 Jul, 2002 40 commits
    • Andrew Morton's avatar
      [PATCH] direct_io mopup · e3339bee
      Andrew Morton authored
      Some cleanup from the surprise direct-to-bio for O_DIRECT merge.
      
      - Remove bits and pieces from the kiobuf implementation
      
      - Replace the waitqueue in struct dio with just a task_struct pointer
        and use wake_up_process.  (Ben).
      
      - Only take mmap_sem around the individual calls to get_user_pages().
         (It pins the vmas, yes?)
      
      - Remove some debug code.
      
      - Fix JFS.
      e3339bee
    • Andrew Morton's avatar
      [PATCH] alloc_pages cleanup · 4504a57e
      Andrew Morton authored
      Cleanup patch from Martin Bligh: convert some loops which want to be
      `for' loops into that, and add some commentary.
      4504a57e
    • Andrew Morton's avatar
      [PATCH] inline generic_writepages() · 15a37ba2
      Andrew Morton authored
      generic_writepages() is just a wrapper around mpage_writepages(), so
      inline it.
      15a37ba2
    • Andrew Morton's avatar
      [PATCH] restore CHECK_EMERGENCY_SYNC. Again. · 3d4ed856
      Andrew Morton authored
      Put the CHECK_EMERGENCY_SYNC back into the kupdate function.  I seem to
      keep removing it.
      3d4ed856
    • Andrew Morton's avatar
      [PATCH] O_DIRECT open check · 7d0be429
      Andrew Morton authored
      Updated forward-port of Aodrea's O_DIRECT open() checks.  If the user
      asked for O_DIRECT and the inode has no mapping or no a_ops then fail
      the open up-front.
      7d0be429
    • Andrew Morton's avatar
      [PATCH] VM instrumentation · e177ea28
      Andrew Morton authored
      A patch from Rik which adds some operational statitics to the VM.
      
      In /proc/meminfo:
      
      PageTables:	Amount of memory used for process pagetables
      PteChainTot:	Amount of memory allocated for pte_chain objects
      PteChainUsed:	Amount of memory currently in use for pte chains.
      
      In /proc/stat:
      
      pageallocs:	Number of pages allocated in the page allocator
      pagefrees:	Number of pages returned to the page allocator
      
      		(These can be used to measure the allocation rate)
      
      pageactiv:	Number of pages activated (moved to the active list)
      pagedeact:	Number of pages deactivated (moved to the inactive list)
      pagefault:	Total pagefaults
      majorfault:	Major pagefaults
      pagescan:	Number of pages which shrink_cache looked at
      pagesteal:	Number of pages which shrink_cache freed
      pageoutrun:	Number of calls to try_to_free_pages()
      allocstall:	Number of calls to balance_classzone()
      
      
      Rik will be writing a userspace app which interprets these things.
      
      The /proc/meminfo stats are efficient, but the /proc/stat accumulators
      will cause undesirable cacheline bouncing.  We need to break the disk
      statistics out of struct kernel_stat and make everything else in there
      per-cpu.  If that doesn't happen in time for 2.6 then we disable
      KERNEL_STAT_INC().
      e177ea28
    • Andrew Morton's avatar
      [PATCH] avoid allocating pte_chains for unshared pages · 6a2ea338
      Andrew Morton authored
      Patch from David McCracken.  It is an optimisation to the rmap
      pte_chains.
      
      In the common case where a page is mapped by only a single pte, we
      don't need to allocate a pte_chain structure.  Just make the page's
      pte_chain pointer point straight at that pte and flag this with
      PG_direct.
      6a2ea338
    • Andrew Morton's avatar
      [PATCH] leave truncate's orphaned pages on the LRU · fa08cc83
      Andrew Morton authored
      Fix to the page reclaim code from Rik.
      
      Anonymous pages which have buffers arise when
      truncate_complete_page()'s call to ->releasepage() failed.  Those pages
      may still be mapped into process address spaces.
      
      We should not remove them from the LRU, because that makes them
      unswappable and they hang around until process exit.
      fa08cc83
    • Andrew Morton's avatar
      [PATCH] minimal rmap · c48c43e6
      Andrew Morton authored
      This is the "minimal rmap" patch, writen by Rik, ported to 2.5 by Craig
      Kulsea.
      
      Basically,
      
      before: When the page reclaim code decides that is has scanned too many
      unreclaimable pages on the LRU it does a scan of process virtual
      address spaces for pages to add to swapcache.  ptes pointing at the
      page are unmapped as the scan proceeds.  When all ptes referring to a
      page have been unmapped and it has been written to swap the page is
      reclaimable.
      
      after: When an anonymous page is encountered on the tail of the LRU we
      use the rmap to see if it hasn't been referenced lately.  If so then
      add it to swapcache.  When the page is again encountered on the LRU, if
      it is still unreferenced then try to unmap all ptes which refer to it
      in one hit, and if it is clean (ie: on swap) then free it.
      
      The rest of the VM - list management, the classzone concept, etc
      remains unchanged.
      
      There are a number of things which the per-page pte chain could be
      used for.  Bill Irwin has identified the following.
      
      
      (1)  page replacement no longer goes around randomly unmapping things
      
      (2)  referenced bits are more accurate because there aren't several ms
              or even seconds between find the multiple pte's mapping a page
      
      (3)  reduces page replacement from O(total virtually mapped) to O(physical)
      
      (4)  enables defragmentation of physical memory
      
      (5)  enables cooperative offlining of memory for friendly guest instance
              behavior in UML and/or LPAR settings
      
      (6)  demonstrable benefit in performance of swapping which is common in
              end-user interactive workstation workloads (I don't like the word
              "desktop"). c.f. Craig Kulesa's post wrt. swapping performance
      
      (7)  evidence from 2.4-based rmap trees indicates approximate parity
              with mainline in kernel compiles with appropriate locking bits
      
      (8)  partitioning of physical memory can reduce the complexity of page
              replacement searches by scanning only the "interesting" zones
              implemented and merged in 2.4-based rmap
      
      (9)  partitioning of physical memory can increase the parallelism of page
              replacement searches by independently processing different zones
              implemented, but not merged in 2.4-based rmap
      
      (10) the reverse mappings may be used for efficiently keeping pte cache
              attributes coherent
      
      (11) they may be used for virtual cache invalidation (with changes)
      
      (12) the reverse mappings enable proper RSS limit enforcement
              implemented and merged in 2.4-based rmap
      
      
      
      The code adds a pointer to struct page, consumes additional storage for
      the pte chains and adds computational expense to the page reclaim code
      (I measured it at 3% additional load during streaming I/O).  The
      benefits which we get back for all this are, I must say, theoretical
      and unproven.  If it has real advantages (or, indeed, disadvantages)
      then why has nobody demonstrated them?
      
      
      
      There are a number of things remaining to be done:
      
      1: Demonstrate the above advantages.
      
      2: Make it work with pte-highmem  (Bill Irwin is signed up for this)
      
      3: Don't add pte_chains to non-shared pages optimisation (Dave McCracken's
         patch does this)
      
      4: Move the pte_chains into highmem too (Bill, I guess)
      
      5: per-cpu pte_chain freelists (Rik?)
      
      6: maybe GC the pte_chain backing pages. (Seems unavoidable.  Rik?)
      
      7: multithread the page reclaim code.  (I have patches).
      
      8: clustered add-to-swap.  Not sure if I buy this.  anon pages are
         often well-ordered-by-virtual-address on the LRU, so it "just
         works" for benchmarky loads.  But there may be some other loads...
      
      9: Fix bad IO latency in page reclaim (I have lame patches)
      
      10: Develop tuning tools, use them.
      
      11: The nightly updatedb run is still evicting everything.
      c48c43e6
    • Linus Torvalds's avatar
      Merge bk://lsm.bkbits.net/linus-2.5 · b15d45bf
      Linus Torvalds authored
      into home.transmeta.com:/home/torvalds/v2.5/linux
      b15d45bf
    • Linus Torvalds's avatar
      Merge http://linuxusb.bkbits.net/agpgart-2.5 · faaab2cf
      Linus Torvalds authored
      into home.transmeta.com:/home/torvalds/v2.5/linux
      faaab2cf
    • Martin Dalecki's avatar
      [PATCH] IDE 100 · 84f4a1c4
      Martin Dalecki authored
        Trivia time:
      
        - C99 conforming initializations by Rusty.
      
        - ide__sti() -> local_irq_enable() and its friends.
      84f4a1c4
    • Martin Dalecki's avatar
      [PATCH] 2.5.26 IDE 99 · e9356da8
      Martin Dalecki authored
      Most noticable in the patch:
      
      1. we handle IRQ sharing now better then ever
      
      2. survives quite a lot of testing by few people. Forexample
      cat /dev/hdb > /dev/null, where /dev/hdb contains a CD-ROM
      with a big cratch on the surface making sure it's broken :-).
      it's BTW. amanzing how wide the cratch had to be until errors
      ocurred.
      
      3. Doesn't play with rq_rdev and friends
      
      Fri Jul 12 05:04:32 CEST 2002 ide-clean-99
      
      - Push nIEN disabling down at the place where we are finished with a particular
         request.
      
      - First round of command line parser cleanups by Gerald Champagne.
      
      - Unfold the drive eviction functions in do_request(). This allowed us to
         realize that we don't have to re-get the major/minor numbers of the device we
         are action on from the raw device field of the currently running request. One
         significant place less in kernel where major/minor data gets manipulated.
      
      - Move the big IDE_BUSY loop out of do_request to do_ide_request().  This makes
         us realize that we don't have to clear the IDE_BUSY bit just before
         reentering do_request to look for more requests still pending on the queue
         and set it immediately again.
      
         This is fixing a tinny race on the code path from IRQ or timer function,
         where we had a tinny window between the clearing of the IDE_BUSY bit and
         reentering the request queue for completely unrelated requests to come in to
         our way.
      
      - Don't return any value in do_reset1(). It's always ATA_OP_CONTINUES. Split it
         up in to two functions one for disks (well in fact channels) and one for
         ATAPI devices. It turns out that they can be moved to the places where they
         are used to clarify the code flow. The only function remaining is
         do_reset_channel() now.
      
      - Duplicate code from ide_do_drive_code explicitely in ide_raw_taskfile().
         Simplify ide_raw_taskfile() thereafter. Realize that ide_do_drive_cmd()
         is now only used by ATAPI devices. Move it therefore to atapi.c.
      
      - Do busy polling for ATAPI reset operations. This is much safer then the
         previous timer games played there. It simply doesn't make sense to give the
         bus up during such a subtile operation. We don't have to disable IRQs here as
         well, since we are already under the protection of the do_request mechanisms.
         (Well hopefully...)
      
      - Remove no longer used reset_poll() function. poll_timeout and friends are now
         used only in pdc4030 code. Those function where not called from IRQ context
         but they where set as handlers and not as expiry functions.
      
      - Return ATA_OP_CONTINUES instead of ATA_OP_FINISHED in ata_error(), to signal
         that we are willing to retry the operation until the maximal number of retry
         attempts is exceeded. Returning ATA_OP_FINISHED without prior end_request()
         hangs the system.
      
      - Apply trivia from DJ patch set.
      
      - Apply small configuration fix to ide-pci.c from Muli Ben-Yehuda.
      
      - Feed add_blkdev_randomness with information we already have in struct
         ata_channel *ch->major, instead of using the major(macro) on the request in
         question.
      
      - Make ide_raw_taskfile use the same request submission mechanism as
         tcq_invalidate_queue(). Something similar would be ideal for ioctl() code as
         well.
      
      - Implement actual device reset. Realize that the recalibration procedure is
         doomed by the standard. Don't try to recover by recalibrating devices
         therefore -just our retry mechanism should work in those cases. And suddenly
         the error handling code is IRQ safe.
      
      - Reinvent the ATA reset operation, since it is apparently needed. We still
         have to do the whole transfer timing reconfiguration there.
      
      - Move drive_is_ready(), which is in reality an attempt to check for IRQ
         requesters without clearing the IRQ line, over to the place where it belongs:
         device.c, which is the direct device access abstraction place.  Rename it to
         ata_status_irq() to prevent global name space pollution.
      
      - Updates to the pdc202xxx host chip controller setup code by Bart³omiej
         ¯o³nierkiewicz:
      
         Forward port 2.4 patch by Hank Yang from Promise:
      
      	- Add PDC20271 support
      	- Disable LBA48 support on PDC20262
      	- Fix ATAPI UDMA port value
      	- Add new quirk drive
      	- Adjust timings for all drives when using ATA133
      	- Update pdc202xx_reset() waiting time
      
      - Mark TCQ as dangerous and add some bits about it to the help.
      
      - Add some missing exports.
      
      - Some small ide-scsi.c host allocation fixes by sullivan.
      e9356da8
    • Neil Brown's avatar
      [PATCH] MD - Get rid of dev in rdev and use bdev exclusively. · 3ec59360
      Neil Brown authored
      Get rid of dev in rdev and use bdev exclusively.
      
      There is an awkwardness here in that userspace sometimes
      passed down a dev_t (e.g. hot_add_disk) and sometime
      a major and a minor (e.g. add_new_disk).  Should we convert
      both to kdev_t as the uniform standard....
      That is what was being done but it seemed very clumsy and
      things were gets converted back and forth a lot.
      
      As bdget used a dev_t, I felt safe in staying with dev_t once I
      had one rather than converting to kdev_t and back.
      3ec59360
    • Neil Brown's avatar
      [PATCH] MD - Change partition_name calls to bdev_partition_name were possible. · c4909782
      Neil Brown authored
      Change partition_name calls to bdev_partition_name were possible.
      
      All part of decreasing reliance on device numbers... atleast in
      appearance.
      c4909782
    • Neil Brown's avatar
      [PATCH] MD - Remove the sb from the mddev · 43fb3e86
      Neil Brown authored
      Remove the sb from the mddev
      
      Now that al the important information is in mddev, we don't need
      to have an sb off the mddev.  We only keep the per-device ones.
      
      Previously we determined if "set_array_info" had been run byb checking
      mddev->sb.  Now we check mddev->raid_disks on the assumption that
      any valid array MUST have a non-zero number of devices.
      43fb3e86
    • Neil Brown's avatar
      [PATCH] MD - Remove dependance on superblock · bab5d712
      Neil Brown authored
      Remove dependance on superblock
      
      All the remaining field of interest in the superblock
      get duplicated in the mddev struture and this is treated as
      authoritative.  The superblock gets completely generated at
      write time, and all useful information extracted at read time.
      
      This means that we can slot in different superblock formats
      without affecting the bulk of the code.
      bab5d712
    • Neil Brown's avatar
      [PATCH] MD - Move persistent from superblock to mddev · 5e601b35
      Neil Brown authored
      Move persistent from superblock to mddev
      
      Tidyup calc_dev_sboffset and calc_dev_size on the way
      5e601b35
    • Neil Brown's avatar
      [PATCH] MD - Remove number and raid_disk from personality arrays · 9f3b0380
      Neil Brown authored
      Remove number and raid_disk from personality arrays
      
      These are redundant.  number not needed any more
      raid_disk never was as that is the index.
      9f3b0380
    • Neil Brown's avatar
      [PATCH] MD - nr_disks is gone from multipath/raid1 · 4395b447
      Neil Brown authored
      nr_disks is gone from multipath/raid1
      
      Never used.
      4395b447
    • Neil Brown's avatar
      [PATCH] MD - Remove old_dev field. · f2421da3
      Neil Brown authored
      Remove old_dev field.
      
      We used to monitor the pervious device number of a
      component device for superblock maintenance.  This is
      not needed any more.
      f2421da3
    • Neil Brown's avatar
      [PATCH] MD - Don't maintain disc status in superblock. · d109d34c
      Neil Brown authored
      Don't maintain disc status in superblock.
      
      The state is now in rdev so we don't maintain it
      in superblock any more.
      We also nolonger test content of superblock for
      disk status
      mddev->spare is now an rdev and not a superblock fragment.
      d109d34c
    • Neil Brown's avatar
      [PATCH] MD - when writing superblock, generate from mddev/rdev info. · 1b114450
      Neil Brown authored
      when writing superblock, generate from mddev/rdev info.
      
      Rather than relying on the superblock info being kept up-to-date,
      we regenerate the superblock from mddev/rdev info before
      each write.
      1b114450
    • Neil Brown's avatar
      [PATCH] MD - Add "degraded" field to md device · d58aa811
      Neil Brown authored
      Add "degraded" field to md device
      
      This is used to determine if a spare should be added
      without relying on the superblock.
      d58aa811
    • Neil Brown's avatar
      [PATCH] MD - Add in_sync flag to each rdev · 8ee83145
      Neil Brown authored
      Add in_sync flag to each rdev
      
      This currently mirrors the MD_DISK_SYNC superblock flag,
      but soon it will be authoritative and the superblock will
      only be consulted at start time.
      8ee83145
    • Neil Brown's avatar
      [PATCH] MD - Add raid_disk field to rdev · 9347ddf5
      Neil Brown authored
      Add raid_disk field to rdev
      
      Also change find_rdev_nr to find based on position
      in array (raid_disk) not position in superblock (number).
      9347ddf5
    • Neil Brown's avatar
      [PATCH] MD - Improve handling of spares in md · 82081640
      Neil Brown authored
      Improve handling of spares in md
      
      - hot_remove_disk is given the raid_disk rather than descriptor number
        so that it can find the device in internal array directly, no search.
      - spare_inactive now uses mddev->spare->raid_disk instead of
        mddev->spare->number so it can find the device directly without searching
      - spare_write does not need number.  It can use mddev->spare->raid_disk as above.
      - spare_active does not need &mddev->spare.  It finds the descriptor directly
        and fixes it without this pointer
      82081640
    • Neil Brown's avatar
      [PATCH] MD - Remove concept of 'spare' drive for multipath. · 03aa5c1c
      Neil Brown authored
      Remove concept of 'spare' drive for multipath.
      
      Multipath now treats all working devices as
      active and does io to to first working one.
      03aa5c1c
    • Neil Brown's avatar
      [PATCH] MD - Set desc_nr more sanely. · 999a2029
      Neil Brown authored
      Set desc_nr more sanely.
      
      Currently rdev->desc_nr is set in sync_sbs which is typcially
      called just before writing out the superblocks, which is an
      odd place to set it.
      It is also called when a new disk is added (which is sane) and
      when an old disc is imported ... which is quesitonable.
      
      With this patch it is set when a new disk is added, and when
      the superblocks are being analysed, which makes lots of sense.
      
      MULTIPATH is particularly an issue here.  The old code tried
      to figure the desc_nr for an rdev by matching device numbers in
      the superblock.  This doesn't make a lot of sense as
      device numbers can change.  Now MULTIPATH components
      get sequential desc_nrs.
      999a2029
    • Neil Brown's avatar
      [PATCH] MD - Move md_update_sb calls · 6f42312c
      Neil Brown authored
      Move md_update_sb calls
      
      When a change which requires a superblock update happens
      at interrupt time, we currently set a flag (sb_dirty) and
      wakeup to per-array thread (raid1/raid5d/multipathd) to
      do the actual update.
      
      This patch centralises this.  The sb_update is now done
      by the mdrecoveryd thread.  As this is always woken up after
      the error handler is called, we don't need the call to wakeup
      the local thread any more.
      
      With this, we don't need "md_update_sb" to lock the array
      any more and only use __md_update_sb which is local to md.c
      So we rename __md_update_sb back to md_update_sb and stop
      exporting it.
      6f42312c
    • Neil Brown's avatar
      [PATCH] MD - Pass the correct bdev to md_error · a15b60a2
      Neil Brown authored
      Pass the correct bdev to md_error
      
      After a call to generic_make_request, bio->bi_bdev can have changed
      (e.g. by a re-mapped like raid0).  So we cannot trust it for reporting
      the source of an error.  This patch takes care to find the correct
      bdev.
      a15b60a2
    • Neil Brown's avatar
      [PATCH] MD - Rdev list cleanups. · 2a9400e9
      Neil Brown authored
      Rdev list cleanups.
      
      An "rdev" can be on three different lists.
       - the list of all rdevs
       - the list of pending rdevs
       - the list of rdevs for a given mddev
      
      The first list is now only used to list "unused" devices in
      /proc/mdstat, and only pending rdevs can be unused, so this list
      isn't necessary.
      An rdev cannot be both pending and in an mddev, so we know rdev will
      only be on one list at at time.
      
      This patch discards  the all_raid_disks list, and changes the
      pending list to use "same_set" in the rdev.  It also changes
      /proc/mdstat to iterate through pending devices, rather than through
      all devices.
      
      So now an rdev is only on one list, either the pending list
      or the list of rdevs for a given mddev.  This means that
      ITERATE_RDEV_GENERIC doesn't need to be told which field,
      to walk down: there is ony one.
      2a9400e9
    • Neil Brown's avatar
      [PATCH] MD - Get rid of find_rdev_all · 70e96bef
      Neil Brown authored
      Get rid of find_rdev_all
      
      find_rdev_all is now only used to check if a device is already
      used in an md array.
      
      We change lock_rdev so that it claims the bdev for
      the specific rdev rather than for rdevs in general.
      Now lock_rdev will check if the bdev is inuse by another array
      or not, so the find_rdev_all check isn't needed and is removed,
      along with find_rdev_all itself.
      
      We also make sure that the error code from lock_rdev is
      propagated up properly.
      70e96bef
    • Neil Brown's avatar
      [PATCH] MD - Use symbolic names for multipath (-4) and linear (-1) · a0f86742
      Neil Brown authored
      Use symbolic names for multipath (-4) and linear (-1)
      
      Also, a variable called "level" was being used to store a
      "level" and a "personality" number.  This is potentially
      confusing, so it is now two variables.
      a0f86742
    • Neil Brown's avatar
      [PATCH] MD - Don't "analyze_sb" when creating new array. · 376163df
      Neil Brown authored
      Don't "analyze_sb" when creating new array.
      
      When creating a new array (and we have an mddev->sb),
      don't both to analyze the superblocks.  There is no point.
      Also, these means we always allocate the array sb in
      analyze_sbs, rather than conditionally.
      376163df
    • Neil Brown's avatar
      [PATCH] MD - Embed bio in mp_bh rather than separate allocation. · e3de153e
      Neil Brown authored
      Embed bio in mp_bh rather than separate allocation.
      
      multipath currently allocates an mp_bh and a bio for each
      request.  With this patch, the bio is made to be part of the
      mp_bh so there is only one allocation, and it from a private
      pool (the bio was allocated from a shared pool).
      
      Also remove "remaining" and "cmd" from mp_bh which aren't used.
      And remove spare (unused) from multipath_private_data.
      e3de153e
    • Neil Brown's avatar
      [PATCH] MD - 27 - Remove state field from multipath mp_bh structure. · 8e2a19e7
      Neil Brown authored
      Remove state field from multipath mp_bh structure.
      
      The MPBH_Uptodate flag is set but never used,
      The MPBH_SyncPhase flag was never used.
      These a both legacy from the copying of raid1.c
      
      MPBH_PreAlloc is no longer needed as due to use of
      mempools, so the state field can go...
      8e2a19e7
    • Neil Brown's avatar
      [PATCH] MD - Get multipath to use mempool · e18a7e5c
      Neil Brown authored
      Get multipath to use mempool
      
      ... rather than maintaining it's own mempool
      e18a7e5c
    • Neil Brown's avatar
      [PATCH] MD - Remove dead consistancy checking code from multipath. · 663c6269
      Neil Brown authored
      Remove dead consistancy checking code from multipath.
      
      This "consistancy_check" is carried over from raid1 on which multipath
      was based, and was not used in raid1 and has since been removed.  Now
      it gets removed from multipath too.
      663c6269
    • Neil Brown's avatar
      [PATCH] MD - Remove bdput calls from raid personalities. · 82b0fad1
      Neil Brown authored
      Remove bdput calls from raid personalities.
      
      Some of the md personalities currently hold a counted reference
      on a bdev.  This is not necessary as the main md module will always
      hold a counted reference in the rdev.
      This patch removes the code to take and drop these unnecessary
      references.
      82b0fad1