1. 20 May, 2002 26 commits
    • Jack Hammer's avatar
      [PATCH] ips for 2.5 · 99baa651
      Jack Hammer authored
      ips driver update ( version 5.10.13-BETA )
      99baa651
    • Linus Torvalds's avatar
      Update kernel version to 2.5.17 · 97e87c96
      Linus Torvalds authored
      97e87c96
    • Linus Torvalds's avatar
    • Christoph Hellwig's avatar
      [PATCH] get rid of <linux/locks.h> · bd2b0c85
      Christoph Hellwig authored
      The lock.h header contained some hand-crafted lcoking routines from
      the pre-SMP days.  In 2.5 only lock_super/unlock_super are left,
      guarded by a number of completly unrelated (!) includes.
      
      This patch moves lock_super/unlock_super to fs.h, which defined
      struct super_block that is needed for those to operate it, removes
      locks.h and updates all caller to not include it and add the missing,
      previously nested includes where needed.
      bd2b0c85
    • Linus Torvalds's avatar
      Merge quota update from Jan Kara · 43a3a37b
      Linus Torvalds authored
      43a3a37b
    • Jan Kara's avatar
      [PATCH] [13/13] quota-13-ioctl · ad447df3
      Jan Kara authored
      This patch implements ioctl() for getting space used by file.
      I agree it's ioctl() abuse, it doesn't work on links and has
      other ugly properties. Better would be to change 'struct stat'
      but changing it just due to this is overkill and it will take
      some time before there will be enough changes which will provoke
      yet another struct stat :). So this is temporary solution...
      If you don't like it, simply reject it. The function it provides
      is not fundamental...
      
      So that should be all patches. Any comments (or decision about
      including/not including) welcome.
      								Honza
      ad447df3
    • Jan Kara's avatar
      [PATCH] [12/13] quota-12-compat · 1c5bbffe
      Jan Kara authored
      This patch implements configurable backward compatible quota interface.
      Maybe this isn't needed in 2.5 but as some people want to use patches
      in 2.4 where it's necessary I have implemented it.
      1c5bbffe
    • Jan Kara's avatar
      [PATCH] [11/13] quota-11-sync · 736e690e
      Jan Kara authored
      Implemented proper syncing of dquots - ie. also global information
      about quota files are synced. We find info to sync by walking through
      all superblocks...
      736e690e
    • Jan Kara's avatar
      [PATCH] [10/13] quota-10-inttype · 0c532315
      Jan Kara authored
      Remove use of 'short' in parameters of functions. 'int' is used instead.
      0c532315
    • Jan Kara's avatar
      [PATCH] [9/13] quota-9-format2 · 8ea6f99a
      Jan Kara authored
      Implementation of new quota format. The code is almost the same
      as in -ac versions of kernel. All the code for new format is in
      quota_v2.c
      8ea6f99a
    • Jan Kara's avatar
      [PATCH] [8/13] quota-8-format1 · dcfb8111
      Jan Kara authored
      Implementation of old quota format. All the code for old format is now in
      quota_v1.c. Code mostly remained the same as in older kernels (just minor
      changes were needed to bind it with quota interface).
      dcfb8111
    • Jan Kara's avatar
      [PATCH] [7/13] quota-7-quotactl · b5abbc1f
      Jan Kara authored
      This is probably the largest chunk in quota patches. It removes old quotactl interface
      and implements new one. New interface should not need arch specific conversions so they
      are removed. All quota interface stuff is moved to quota.c so we can
      easily separate things which should be compiled even if quota is disabled (mainly
      because XFS needs some interface even if standard VFS quota is disabled).
      Callbacks to filesystem on quota_on() and quota_off() are implemented (needed by Ext3),
      quota operations callbacks are now set in super.c on superblock initialization and
      not on quota_on(). This way it starts to make sense to have callbacks on alloc_space(),
      alloc_inode() etc. as filesystem can override them on read_super(). This will be used
      later for implementing journalled quota.
      b5abbc1f
    • Jan Kara's avatar
      [PATCH] [6/13] quota-6-bytes · ce9fb139
      Jan Kara authored
      This patch implements counting of used space in inodes in bytes.
      New field i_bytes is added and used space modulo 512 is kept in
      it (rest is still kept in i_blocks). Functions manipulating both
      i_blocks and i_bytes are implemented (inode_add_bytes(), inode_sub_bytes()
      and inode_set_bytes()). Filesystems allocating only in whole blocks
      can safely ignore i_bytes field and continue using i_blocks...
      ce9fb139
    • Jan Kara's avatar
      [PATCH] [5/13] quota-5-space · f0071c7b
      Jan Kara authored
      This patch implements accounting of used space in bytes.
      f0071c7b
    • Jan Kara's avatar
      [PATCH] [4/13] quota-4-getstats · f48acc23
      Jan Kara authored
        This patch moves reporting of quota statistics from Q_GETSTATS call to
      /proc/fs/quota. Also reporting of registered quota formats is added.
      f48acc23
    • Jan Kara's avatar
      [PATCH] [3/13] quota-3-register · 48c39f24
      Jan Kara authored
        This patch implements list 'quota_formats' with registered quota formats
      and functions register_quota_format() and unregister_quota_format() for
      manipulating the list.
      48c39f24
    • Jan Kara's avatar
      [PATCH] [2/13] quota-2-formats · b80d2549
      Jan Kara authored
      This patch removes most format dependent code from dquot.c and quota.h
      and puts calls of callback functions instead.
      b80d2549
    • Jan Kara's avatar
      [PATCH] [1/13] quota-1-newlocks · 61d681d6
      Jan Kara authored
        This patch adds dq_dup_ref to struct dquot. Functions altering just usage of
      quota take just this duplicated reference, inodes, quotactl() helpers take real
      dq_count reference. dqput() blocks if there are some duplicated references and
      put reference is last 'real one'. This way is assured that quota IO is not
      done from functions altering quota usage (quota structure is written on last dqput()).
      61d681d6
    • Linus Torvalds's avatar
      Merge · 6033f024
      Linus Torvalds authored
      6033f024
    • Jan Harkes's avatar
      [PATCH] iget_locked [6/6] · 9b406173
      Jan Harkes authored
      As of the last patch the inode_hashtable doesn't really need to be
      indexed by i_ino anymore, the only reason we still have to keep the
      hashvalue and i_ino identical is because of insert_inode_hash.
      
      If at some point a FS specific getattr method is implemented it will be
      possible to completely remove any use of i_ino by the VFS.
      9b406173
    • Jan Harkes's avatar
      [PATCH] iget_locked [5/6] · aa624c8d
      Jan Harkes authored
      This patch starts taking i_ino dependencies out of the VFS. The FS
      provided test and set callbacks become responsible for testing and
      setting inode->i_ino.
      
      Because most filesystems are based on 32-bit unique inode numbers
      several functions are duplicated to keep iget_locked as a fast path. We
      can avoid unnecessary pointer dereferences and function calls for this
      specific case.
      aa624c8d
    • Jan Harkes's avatar
      [PATCH] iget_locked [4/6] · 16fb4ea3
      Jan Harkes authored
      Now that we have no more users of iget4 we can kill the function and the
      associated read_inode2 callback (i.e. the 'reiserfs specific hack').
      
      Document iget5_locked as the replacement for iget4 in filesystems/porting.
      16fb4ea3
    • Jan Harkes's avatar
      [PATCH] iget_locked [3/6] · 77d1ac9b
      Jan Harkes authored
      Convert existing filesystems (Coda/NFS/ReiserFS) that currently use
      iget4 to iget5_locked.
      77d1ac9b
    • Jan Harkes's avatar
      [PATCH] iget_locked [2/6] · 85b640c5
      Jan Harkes authored
      Now we introduce iget_locked and iget5_locked. These are similar to
      iget, but return a locked inode and read_inode has not been called. So
      the FS has to call read_inode to initialize the inode and then unlock
      it with unlock_new_inode().
      
      This patch is based on the icreate patch from the XFS group, i.e.
      it is pretty much identical except for function naming.
      85b640c5
    • Jan Harkes's avatar
      [PATCH] iget_locked [1/6] · 7a24f1a6
      Jan Harkes authored
      Fix a race in iget4. The fs specific data that is used to find an inode
      should be initialized while still holding the inode lock.
      
      It adds a 'set' callback function that should be a non-blocking FS
      provided function which initializes the private parts of the inode so
      that the 'test' callback function can correctly match new inodes.
      
      Touches all filesystems that use iget4 (Coda/NFS/ReiserFS).
      7a24f1a6
    • Linus Torvalds's avatar
      Merge http://kernel-acme.bkbits.net:8080/char-copy_tofrom_user-2.5 · 80be2217
      Linus Torvalds authored
      into home.transmeta.com:/home/torvalds/v2.5/linux
      80be2217
  2. 19 May, 2002 14 commits
    • Arnaldo Carvalho de Melo's avatar
      drivers/char/* · 2d7d1c4e
      Arnaldo Carvalho de Melo authored
      	- fix copy_{to,from}_user error handling, thanks to Rusty to pointing this out on lkml
      2d7d1c4e
    • Arnaldo Carvalho de Melo's avatar
      drivers/block/*.c · 20131c10
      Arnaldo Carvalho de Melo authored
        - fix copy_{to,from}_user error handling, thanks to Rusty for
          pointing this out on lkml
      20131c10
    • Andrew Morton's avatar
      [PATCH] remove PG_launder · a2536452
      Andrew Morton authored
      Removal of PG_launder.
      
      It's not obvious (to me) why this ever existed.  If it's to prevent
      deadlocks then I'd like to know who was performing __GFP_FS allocations
      while holding a page lock?
      
      But in 2.5, the only memory allocations which are performed when the
      caller holds PG_writeback against an unsubmitted page are those which
      occur inside submit_bh().  There will be no __GFS_FS allocations in
      that call chain.
      
      Removing PG_launder means that memory allocators can block on any
      PageWriteback() page at all, which reduces the risk of very long list
      walks inside pagemap_lru_lock in shrink_cache().
      a2536452
    • Andrew Morton's avatar
      [PATCH] fix ext3 race with writeback · 5409c2b5
      Andrew Morton authored
      The ext3-no-steal patch has exposed a long-standing race in ext3.  It
      has been there all the time in 2.4, but never triggered until some
      timing change in the ext3-no-steal patch exposed it.  The race was not
      present in 2.2 because 2.2's bdflush runs inside lock_kernel().
      
      The problem is that when ext3 is shuffling a buffer between journalling
      lists there is a small window where the buffer is marked BH_dirty.
      Aonther CPU can grab it, mark it clean and write it out.  Then ext3
      puts the buffer onto a list of buffers which are expected to be dirty,
      and gets confused later on when the buffer turns out to be clean.
      
      The patch from Stephen records the expected dirtiness of the buffer in
      a local variable, so BH_dirty is not transiently set while ext3
      shuffles.
      5409c2b5
    • Andrew Morton's avatar
      [PATCH] fix ext3 buffer-stealing · d9ae0cee
      Andrew Morton authored
      Patch from sct fixes a long-standing (I did it!) and rather complex
      problem with ext3.
      
      The problem is to do with buffers which are continually being dirtied
      by an external agent.  I had code in there (for easily-triggerable
      livelock avoidance) which steals the buffer from checkpoint mode and
      reattaches it to the running transaction.  This violates ext3 ordering
      requirements - it can permit journal space to be reclaimed before the
      relevant data has really been written out.
      
      Also, we do have to reliably get a lock on the buffer when moving it
      between lists and inspecting its internal state.  Otherwise a competing
      read from the underlying block device can trigger an assertion failure,
      and a competing write to the underlying block device can confuse ext3
      journalling state completely.
      d9ae0cee
    • Andrew Morton's avatar
      [PATCH] improved I/O scheduling for indirect blocks · 799391cc
      Andrew Morton authored
      Fixes a performance problem with many-small-file writeout.
      
      At present, files are written out via their mapping and their indirect
      blocks are written out via the blockdev mapping.  As we know that
      indirects are disk-adjacent to the data it is better to start I/O
      against the indirects at the same time as the data.
      
      The delalloc pathes have code in ext2_writepage() which recognises when
      the target page->index was at an indirect boundary and does an explicit
      hunt-and-write against the neighbouring indirect block.  Which is
      ideal.  (Unless the file was dirtied seekily and the page which is next
      to the indirect was not dirtied).
      
      This patch does it the other way: when we start writeback against a
      mapping, also start writeback against any dirty buffers which are
      attached to mapping->private_list.  Let the elevator take care of the
      rest.
      
      The patch makes a number of tuning changes to the writeback path in
      fs-writeback.c.  This is very fiddly code: getting the throughput
      tuned, getting the data-integrity "sync" operations right, avoiding
      most of the livelock opportunities, getting the `kupdate' function
      working efficiently, keeping it all least somewhat comprehensible.
      
      An important intent here is to ensure that metadata blocks for inodes
      are marked dirty before writeback starts working the blockdev mapping,
      so all the inode blocks are efficiently written back.
      
      The patch removes try_to_writeback_unused_inodes(), which became
      unreferenced in vm-writeback.patch.
      
      The patch has a tweak in ext2_put_inode() to prevent ext2 from
      incorrectly droppping its preallocation window in response to a random
      iput().
      
      
      Generally, many-small-file writeout is a lot faster than 2.5.7 (which
      is linux-before-I-futzed-with-it).  The workload which was optimised was
      
      	tar xfz /nfs/mountpoint/linux-2.4.18.tar.gz ; sync
      
      on mem=128M and mem=2048M.
      
      With these patches, 2.5.15 is completing in about 2/3 of the time of
      2.5.7.  But it is only a shade faster than 2.4.19-pre7.  Why is 2.5.7
      so much slower than 2.4.19?  Not sure yet.
      
      Heavy dbench loads (dbench 32 on mem=128M) are slightly faster than
      2.5.7 and significantly slower than 2.4.19.  It appears that the cause
      is poor read throughput at the later stages of the run.  Because there
      are background writeback threads operating at the same time.
      
      The 2.4.19-pre8 write scheduling manages to stop writeback during the
      latter stages of the dbench run in a way which I haven't been able to
      sanely emulate yet.  It may not be desirable to do this anyway - it's
      optimising for the case where the files are about to be deleted.  But
      it would be good to find a way of "pausing" the writeback for a few
      seconds to allow readers to get an interval of decent bandwidth.
      
      tiobench throughput is basically the same across all recent kernels.
      CPU load on writes is down maybe 30% in 2.5.15.
      799391cc
    • Andrew Morton's avatar
      [PATCH] ext2: preread inode backing blocks · a9f525e6
      Andrew Morton authored
      When ext2 creates a new inode, perform an asynchronous preread against
      its backing block.
      
      Without this patch, many-file writeout gets stalled by having to read
      many individual inode table blocks in the middle of writeback.
      
      It's worth about a 20% gain in writeback bandwidth for the many-file
      writeback case.
      
      ext3 already reads the inode's backing block in
      ext3_new_inode->ext3_mark_inode_dirty, so no change is needed there.
      
      A backport to 2.4 would make sense.
      a9f525e6
    • Andrew Morton's avatar
      [PATCH] writeback tuning · acb5f6f9
      Andrew Morton authored
      Tune up the VM-based writeback a bit.
      
      - Always use the multipage clustered-writeback function from within
        shrink_cache(), even if the page's mapping has a NULL ->vm_writeback().  So
        clustered writeback is turned on for all address_spaces, not just ext2.
      
        Subtle effect of this change: it is now the case that *all* writeback
        proceeds along the mapping->dirty_pages list.  The orderedness of the page
        LRUs no longer has an impact on disk scheduling.  So we only have one list
        to keep well-sorted rather than two, and churning pages around on the LRU
        will no longer damage write bandwidth - it's all up to the filesystem.
      
      - Decrease the clustered writeback from 1024 pages(!) to 32 pages.
      
        (1024 was a leftover from when this code was always dispatching writeback
        to a pdflush thread).
      
      - Fix wakeup_bdflush() so that it actually does write something (duh).
      
        do_wp_page() needs to call balance_dirty_pages_ratelimited(), so we
        throttle mmap page-dirtiers in the same way as write(2) page-dirtiers.
        This may make wakeup_bdflush() obsolete, but it doesn't hurt.
      
      - Converts generic_vm_writeback() to directly call ->writeback_mapping(),
        rather that going through writeback_single_inode().  This prevents memory
        allocators from blocking on the inode's I_LOCK.  But it does mean that two
        processes can be writing pages from the same mapping at the same time.  If
        filesystems care about this (for layout reasons) then they should serialise
        in their ->writeback_mapping a_op.
      
        This means that memory-allocators will writeback only pages, not pages
        and inodes.  There are no locks in that writeback path (except for request
        queue exhaustion).  Reduces memory allocation latency.
      
      - Implement new background_writeback function, which when kicked off
        will perform writeback until dirty memory falls below the background
        threshold.
      
      - Put written-back pages onto the remote end of the page LRU.  It
        does this in the slow-and-stupid way at present.  pagemap_lru_lock
        stress-relief is planned...
      
      - Remove the funny writeback_unused_inodes() stuff from prune_icache().
        Writeback from wakeup_bdflush() and the `kupdate' function now just
        naturally cleanses the oldest inodes so we don't need to do anything
        there.
      
      - Dirty memory balancing is still using magic numbers: "after you
        dirtied your 1,000th page, go write 1,500".  Obviously, this needs
        more work.
      acb5f6f9
    • Andrew Morton's avatar
      [PATCH] pdflush exclusion · 17a74e88
      Andrew Morton authored
      Use the pdflush exclusion infrastructure to ensure that only one
      pdlfush thread is ever performing writeback against a particular
      request_queue.
      
      This works rather well.  It requires a lot of activity against a lot of
      disks to cause more pdflush threads to start up.  Possibly the
      thread-creation logic is a little weak: it starts more threads when a
      pdflush thread goes back to sleep.  It may be better to start new
      threads within pdlfush_operation().
      
      All non-request_queue-backed address_spaces share the global
      default_backing_dev_info structure.  So at present only a single
      pdflush instance will be available for background writeback of *all*
      NFS filesystems (for example).
      
      If there is benefit in concurrent background writeback for multiple NFS
      mounts then NFS would need to create per-mount backing_dev_info
      structures and install those into new inode's address_spaces in some
      manner.
      17a74e88
    • Andrew Morton's avatar
      [PATCH] pdflush exclusion infrastructure · 1f6acea0
      Andrew Morton authored
      Collision avoidance for pdflush threads.
      
      Turns the request_queue-based `unsigned long ra_pages' into a structure
      which contains ra_pages as well as a longword.
      
      That longword is used to record the fact that a pdflush thread is
      currently writing something back against this request_queue.
      
      Avoids the situation where several pdflush threads are sleeping on the
      same request_queue.
      
      This patch provides only the infrastructure for the pdflush exclusion.
      This infrastructure gets used in pdflush-single.patch
      1f6acea0
    • Andrew Morton's avatar
      [PATCH] dirty inode management · 610c5ab8
      Andrew Morton authored
      Fix the "race with umount" in __sync_list().  __sync_list() no longer
      puts inodes onto a local list while writing them out.
      
      The super_block.sb_dirty list is kept time-ordered.  Mappings which
      have the "oldest" ->dirtied_when are kept at sb->s_dirty.prev.
      
      So the time-based writeback (kupdate) can just bale out when it
      encounters a not-old-enough mapping, rather than walking the entire
      list.
      
      dirtied_when is set on the *first* dirtying of a mapping.  So once the
      mapping is marked dirty it strictly retains its place on s_dirty until
      it reaches the oldest end and is written out.  So frequently-dirtied
      mappings don't stay dirty at the head of the list for all time.
      
      That local inode list was there for livelock avoidance.  Livelock is
      instead avoided by looking at each mapping's ->dirtied_when.  If we
      encounter one which was dirtied after this invokation of __sync_list(),
      then just bale out - the sync functions are only required to write out
      data which was dirty at the time when they were called.
      
      Keeping the s_dirty list in time-order is the right thing to do anyway
      - so all the various writeback callers always work against the oldest
      data.
      610c5ab8
    • Andrew Morton's avatar
      [PATCH] larger b_size, and misc fixlets · 2d8f24d0
      Andrew Morton authored
      Miscellany.
      
      - make the printk in buffer_io_error() sector_t-aware.
      
      - Some buffer.c cleanups from AntonA: remove a couple of !uptodate
        checks, and set a new buffer's b_blocknr to -1 in a more sensible
        place.
      
      - Make buffer_head.b_size a 32-bit quantity.  Needed for 64k pagesize
        on ia64.  Does not increase sizeof(struct buffer_head).
      2d8f24d0
    • Andrew Morton's avatar
      [PATCH] reiserfs locking fix · 943acef9
      Andrew Morton authored
      reiserfs is using b_inode_buffers and fsync_buffers_list() for
      attaching dependent buffers to its journal.  For writeout prior to
      commit.
      
      This worked OK when a global lock was used everywhere, but the locking
      is currently incorrect - try_to_free_buffers() is taking a different
      lock when detaching buffers from their "foreign" inode.  So list_head
      corruption could occur on SMP.
      
      The patch implements a reiserfs_releasepage() which holds the
      journal-wide buffer lock while it runs try_to_free_buffers(), so all
      those list_heads are protected.  The lock is held across the
      try_to_free_buffers() call as well, so nobody will attach one of this
      page's buffers to a list while try_to_free_buffers() is running.
      943acef9
    • Andrew Morton's avatar
      [PATCH] fix dirty page management · 0f9268b8
      Andrew Morton authored
      This fixes a bug in ext3 - when ext3 decides that it wants to fail its
      writepage(), it is running SetPageDirty().  But ->writepage has just put
      the page on ->clean_pages().  The page ends up dirty, on ->clean_pages
      and the normal writeback paths don't know about it any more.
      
      So run set_page_dirty() instead, to place the page back on the dirty
      list.
      
      And in move_from_swap_cache(), shuffle the page across to ->dirty_pages
      so that it's eligible for writeout.  ___add_to_page_cache() forgets to
      look at the page state when deciding which list to attach it to.
      
      All SetPageDirty() callers otherwise look OK.
      0f9268b8