1. 03 Jan, 2005 24 commits
    • Brent Casavant's avatar
      [PATCH] alloc_large_system_hash: NUMA interleaving · dcee73c4
      Brent Casavant authored
      NUMA systems running current Linux kernels suffer from substantial inequities
      in the amount of memory allocated from each NUMA node during boot.  In
      particular, several large hashes are allocated using alloc_bootmem, and as
      such are allocated contiguously from a single node each.
      
      This becomes a problem for certain workloads that are relatively common on
      big-iron HPC NUMA systems.  In particular, a number of MPI and OpenMP
      applications which require nearly all available processors in the system and
      nearly all the memory on each node run into difficulties.  Due to the uneven
      memory distribution onto a few nodes, any thread on those nodes will require a
      portion of its memory be allocated from remote nodes.  Any access to those
      memory locations will be slower than local accesses, and thereby slows down
      the effective computation rate for the affected CPUs/threads.  This problem is
      further amplified if the application is tightly synchronized between threads
      (as is often the case), as they entire job can run only at the speed of the
      slowest thread.
      
      Additionally since these hashes are usually accessed by all CPUS in the
      system, the NUMA network link on the node which hosts the hash experiences
      disproportionate traffic levels, thereby reducing the memory bandwidth
      available to that node's CPUs, and further penalizing performance of the
      threads executed thereupon.
      
      As such, it is desired to find a way to distribute these large hash
      allocations more evenly across NUMA nodes.  Fortunately current kernels do
      perform allocation interleaving for vmalloc() during boot, which provides a
      stepping stone to a solution.
      
      This series of patches enables (but does not require) the kernel to allocate
      several boot time hashes using vmalloc rather than alloc_bootmem, thereby
      causing the hashes to be interleaved amongst NUMA nodes.  In particular the
      dentry cache, inode cache, TCP ehash, and TCP bhash have been changed to be
      allocated in this manner.  Due to the limited vmalloc space on architectures
      such as i386, this behavior is turned on by default only for IA64 NUMA systems
      (though there is no reason other interested architectures could not enable it
      if desired).  Non-IA64 and non-NUMA systems continue to use the existing
      alloc_bootmem() allocation mechanism.  A boot line parameter "hashdist" can be
      set to override the default behavior.
      
      The following two sets of example output show the uneven distribution just
      after boot, using init=/bin/sh to eliminate as much non-kernel allocation as
      possible.
      
      Without the boot hash distribution patches:
      
       Nid  MemTotal   MemFree   MemUsed      (in kB)
         0   3870656   3697696    172960
         1   3882992   3866656     16336
         2   3883008   3866784     16224
         3   3882992   3866464     16528
         4   3883008   3866592     16416
         5   3883008   3866720     16288
         6   3882992   3342176    540816
         7   3883008   3865440     17568
         8   3882992   3866560     16432
         9   3883008   3866400     16608
        10   3882992   3866592     16400
        11   3883008   3866400     16608
        12   3882992   3866400     16592
        13   3883008   3866432     16576
        14   3883008   3866528     16480
        15   3864768   3848256     16512
       ToT  62097440  61152096    945344
      
      Notice that nodes 0 and 6 have a substantially larger memory utilization
      than all other nodes.
      
      With the boot hash distribution patch:
      
       Nid  MemTotal   MemFree   MemUsed      (in kB)
         0   3870656   3789792     80864
         1   3882992   3843776     39216
         2   3883008   3843808     39200
         3   3882992   3843904     39088
         4   3883008   3827488     55520
         5   3883008   3843712     39296
         6   3882992   3843936     39056
         7   3883008   3844096     38912
         8   3882992   3843712     39280
         9   3883008   3844000     39008
        10   3882992   3843872     39120
        11   3883008   3843872     39136
        12   3882992   3843808     39184
        13   3883008   3843936     39072
        14   3883008   3843712     39296
        15   3864768   3825760     39008
       ToT  62097440  61413184    684256
      
      While not perfectly even, we can see that there is a substantial improvement
      in the spread of memory allocated by the kernel during boot.  The remaining
      uneveness may be due in part to further boot time allocations that could be
      addressed in a similar manner, but some difference is due to the somewhat
      special nature of node 0 during boot.  However the uneveness has fallen to a
      much more acceptable level (at least to a level that SGI isn't concerned
      about).
      
      The astute reader will also notice that in this example, with this patch
      approximately 256 MB less memory was allocated during boot.  This is due to
      the size limits of a single vmalloc.  More specifically, this is because the
      automatically computed size of the TCP ehash exceeds the maximum size which a
      single vmalloc can accomodate.  However this is of little practical concern as
      the vmalloc size limit simply reduces one ridiculously large allocation
      (512MB) to a slightly less ridiculously large allocation (256MB).  In practice
      machines with large memory configurations are using the thash_entries setting
      to limit the size of the TCP ehash _much_ lower than either of the
      automatically computed values.  Illustrative of the exceedingly large nature
      of the automatically computed size, SGI currently recommends that customers
      boot with thash_entries=2097152, which works out to a 32MB allocation.  In any
      case, setting hashdist=0 will allow for allocations in excess of vmalloc
      limits, if so desired.
      
      Other than the vmalloc limit, great care was taken to ensure that the size of
      TCP hash allocations was not altered by this patch.  Due to slightly different
      computation techniques between the existing TCP code and
      alloc_large_system_hash (which is now utilized), some of the magic constants
      in the TCP hash allocation code were changed.  On all sizes of system (128MB
      through 64GB) that I had access to, the patched code preserves the previous
      hash size, as long as the vmalloc limit (256MB on IA64) is not encountered.
      
      There was concern that changing the TCP-related hashes to use vmalloc space
      may adversely impact network performance.  To this end the netperf set of
      benchmarks was run.  Some individual tests seemed to benefit slightly, some
      seemed to be harmed slightly, but in all cases the average difference with and
      without these patches was well within the variabilty I would see from run to
      run.
      
      The following is the overall netperf averages (30 10 second runs each) against
      an older kernel with these same patches.  These tests were run over loopback
      as GigE results were so inconsistent run to run both with and without these
      patches that they provided no meaningful comparison that I could discern.  I
      used the same kernel (IA64 generic) for each run, simply varying the new
      "hashdist" boot parameter to turn on or off the new allocation behavior.  In
      all cases the thash_entries value was manually specified as discussed
      previously to eliminate any variability that might result from that size
      difference.
      
      HP ZX1, hashdist=0
      ==================
      TCP_RR = 19389
      TCP_MAERTS = 6561 
      TCP_STREAM = 6590 
      TCP_CC = 9483
      TCP_CRR = 8633 
      
      HP ZX1, hashdist=1
      ==================
      TCP_RR = 19411
      TCP_MAERTS = 6559 
      TCP_STREAM = 6584 
      TCP_CC = 9454
      TCP_CRR = 8626 
      
      SGI Altix, hashdist=0
      =====================
      TCP_RR = 16871
      TCP_MAERTS = 3925 
      TCP_STREAM = 4055 
      TCP_CC = 8438
      TCP_CRR = 7750 
      
      SGI Altix, hashdist=1
      =====================
      TCP_RR = 17040
      TCP_MAERTS = 3913 
      TCP_STREAM = 4044 
      TCP_CC = 8367
      TCP_CRR = 7538 
      
      I believe the TCP_CC and TCP_CRR are the tests most sensitive to this
      particular change.  But again, I want to emphasize that even the differences
      you see above are _well_ within the variability I saw from run to run of any
      given test.
      
      In addition, Jose Santos at IBM has run specSFS, which has been particularly
      sensitive to TLB issues, against these patches and saw no performance
      degredation (differences down in the noise).
      
      
      
      This patch:
      
      Modifies alloc_large_system_hash to enable the use of vmalloc to alleviate
      boottime allocation imbalances on NUMA systems.
      
      Due to limited vmalloc space on some architectures (i.e.  x86), the use of
      vmalloc is enabled by default only on NUMA IA64 kernels.  There should be
      no problem enabling this change for any other interested NUMA architecture.
      Signed-off-by: default avatarBrent Casavant <bcasavan@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      dcee73c4
    • Alex Williamson's avatar
      [PATCH] collect page_states only from online cpus · d841f01f
      Alex Williamson authored
      I noticed the function __read_page_state() curiously high in a q-tools
      profile of a write to a software raid0 device.  Seems this is because we're
      checking page_states for all possible cpus and we have NR_CPUS possible
      when CONFIG_HOTPLUG_CPU=y.  The default config for ia64 is now NR_CPUS=512,
      so on a little 8-way box, this is a significant waste of time.  The patch
      below updates __read_page_state() and __get_page_state() to only count
      page_state info for online cpus.  To keep the stats consistent, the
      page_alloc notifier is updated to move page_states off of the cpu going
      offline.  On my profile, this dropped __read_page_state() back into the
      noise and boosted block write performance by 5% (as measured by spew -
      http://spew.berlios.de).
      Signed-off-by: default avatarAlex Williamson <alex.williamson@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      d841f01f
    • Manfred Spraul's avatar
      [PATCH] slab: Add more arch overrides to control object alignment · d32d6f8a
      Manfred Spraul authored
      Add ARCH_SLAB_MINALIGN and document ARCH_KMALLOC_MINALIGN: The flags allow
      the arch code to override the default minimum object aligment
      (BYTES_PER_WORD).
      Signed-Off-By: default avatarManfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      d32d6f8a
    • Andrew Morton's avatar
      [PATCH] do_anonymous_page() use SetPageReferenced · a161d268
      Andrew Morton authored
      mark_page_accessed() is more heavyweight than we need: the page is already
      headed for the active list, so setting the software-referenced bit is
      equivalent.
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      a161d268
    • Miquel van Smoorenburg's avatar
      [PATCH] mark_page_accessed() for read()s on non-page boundaries · 21adf7ac
      Miquel van Smoorenburg authored
      When reading a (partial) page from disk using read(), the kernel only marks
      the page as "accessed" if the read started at a page boundary.  This means
      that files that are accessed randomly at non-page boundaries (usually
      database style files) will not be cached properly.
      
      The patch below uses the readahead state instead.  If a page is read(), it
      is marked as "accessed" if the previous read() was for a different page,
      whatever the offset in the page.
      
      Testing results:
      
      
      - Boot kernel with mem=128M
      
      - create a testfile of size 8 MB on a partition. Unmount/mount.
      
      - then generate about 10 MB/sec streaming writes
      
      	for i in `seq 1 1000`
      	do
      		dd if=/dev/zero of=junkfile.$i bs=1M count=10
      		sync
      		cat junkfile.$i > /dev/null
      		sleep 1
      	done
      
      - use an application that reads 128 bytes 64000 times from a
        random offset in the 64 MB testfile.
      
      1. Linux 2.6.10-rc3 vanilla, no streaming writes:
      
      # time ~/rr testfile
      Read 128 bytes 64000 times
      ~/rr testfile  0.03s user 0.22s system 5% cpu 4.456 total
      
      2. Linux 2.6.10-rc3 vanilla, streaming writes:
      
      # time ~/rr testfile
      Read 128 bytes 64000 times
      ~/rr testfile  0.03s user 0.16s system 2% cpu 7.667 total
      # time ~/rr testfile
      Read 128 bytes 64000 times
      ~/rr testfile  0.03s user 0.37s system 1% cpu 23.294 total
      # time ~/rr testfile
      Read 128 bytes 64000 times
      ~/rr testfile  0.02s user 0.99s system 1% cpu 1:11.52 total
      # time ~/rr testfile
      Read 128 bytes 64000 times
      ~/rr testfile  0.03s user 0.21s system 2% cpu 10.273 total
      
      3. Linux 2.6.10-rc3 with read-page-access.patch , streaming writes:
      
      # time ~/rr testfile
      Read 128 bytes 64000 times
      ~/rr testfile  0.02s user 0.21s system 3% cpu 7.634 total
      # time ~/rr testfile
      Read 128 bytes 64000 times
      ~/rr testfile  0.04s user 0.22s system 2% cpu 9.588 total
      # time ~/rr testfile
      Read 128 bytes 64000 times
      ~/rr testfile  0.02s user 0.12s system 24% cpu 0.563 total
      # time ~/rr testfile
      Read 128 bytes 64000 times
      ~/rr testfile  0.03s user 0.13s system 98% cpu 0.163 total
      
      As expected, with the read-page-access.patch, the kernel keeps the 8 MB
      testfile cached as expected, while without it, it doesn't.
      
      So this is useful for workloads where one smallish (wrt RAM) file is read
      randomly over and over again (like heavily used database indexes), while
      other I/O is going on.  Plain 2.6 caches those files poorly, if the app
      uses plain read().
      Signed-Off-By: default avatarMiquel van Smoorenburg <miquels@cistron.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      21adf7ac
    • Dave Hansen's avatar
      [PATCH] make sure ioremap only tests valid addresses · bbd4c45d
      Dave Hansen authored
      When CONFIG_HIGHMEM=y, but ZONE_NORMAL isn't quite full, there is, of
      course, no actual memory at *high_memory.  This isn't a problem with normal
      virt<->phys translations because it's never dereferenced, but
      CONFIG_NONLINEAR is a bit more finicky.  So, don't do virt_to_phys() to
      non-existent addresses.
      Signed-off-by: default avatarDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      bbd4c45d
    • Dave Hansen's avatar
      [PATCH] kill off highmem_start_page · 422e43d4
      Dave Hansen authored
      People love to do comparisons with highmem_start_page.  However, where
      CONFIG_HIGHMEM=y and there is no actual highmem, there's no real page at
      *highmem_start_page.
      
      That's usually not a problem, but CONFIG_NONLINEAR is a bit more strict and
      catches the bogus address tranlations. 
      
      There are about a gillion different ways to find out of a 'struct page' is
      highmem or not.  Why not just check page_flags?  Just use PageHighMem()
      wherever there used to be a highmem_start_page comparison.  Then, kill off
      highmem_start_page.
      
      This removes more code than it adds, and gets rid of some nasty
      #ifdefs in .c files.
      Signed-off-by: default avatarDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      422e43d4
    • Andries E. Brouwer's avatar
      [PATCH] mm: overcommit updates · ea86630e
      Andries E. Brouwer authored
      Alan made overcommit mode 2 and it doesnt work at all.  A process passing
      the limit often does so at a moment of stack extension, and is killed by a
      segfault, not better than being OOM-killed.
      
      Another problem is that close to the edge no other processes can be
      started, so that a sysadmin has problems logging in and investigating.
      
      Below a patch that does 3 things:
      
      (1) It reserves a reasonable amount of virtual stack space (amount
          randomly chosen, no guarantees given) when the process is started, so
          that the common utilities will not be killed by segfault on stack
          extension.
      
      (2) It reserves a reasonable amount of virtual memory for root, so that
          root can do things when the system is out-of-memory
      
      (3) It limits a single process to 97% of what is left, so that also an
          ordinary user is able to use getty, login, bash, ps, kill and similar
          things when one of her processes got out of control.
      
      Since the current overcommit mode 2 is not really useful, I did not give
      this a new number.
      
      The patch is just for playing, not to be applied by Linus.  But, Andrew, I
      hope that you would be willing to put this in -mm so that people can
      experiment.  Of course it only does something if one sets overcommit mode
      to 2.
      
      The past month I have pressured people asking for feedback, and now have
      about a dozen reports, mostly positive, one very positive.
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      ea86630e
    • Andrea Arcangeli's avatar
      [PATCH] mempolicy optimisation · 182e0eba
      Andrea Arcangeli authored
      Some optimizations in mempolicy.c (like to avoid rebalancing the tree while
      destroying it and by breaking loops early and not checking for invariant
      conditions in the replace operation).
      Signed-off-by: default avatarAndrea Arcangeli <andrea@novell.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      182e0eba
    • Ram Pai's avatar
      [PATCH] Simplified readahead congestion control · 250c01d0
      Ram Pai authored
      Reinstate the feature wherein readahead will be bypassed if the underlying
      queue is read-congersted.
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      250c01d0
    • Steven Pratt's avatar
      [PATCH] Simplified readahead · 6f734a1a
      Steven Pratt authored
      With Ram Pai <linuxram@us.ibm.com>
      
      - request size is now passed into page_cache_readahead.  This allows the
        removal of the size averaging code in the current readahead logic.
      
      - readahead rampup is now faster  (especially for larger request sizes)
      
      - No longer "slow read path".  Readahead is turn off at first random access,
        turned back on at first sequential access.
      
      - Code now handles thrashing, slowly reducing readahead window until
        thrashing stops, or min size reached.
      
      - Returned to old behavior where first access is assumed sequential only if
        at offset 0.
      
      - designed to handle larger (1M or above) window sizes efficiently
      
      
      Benchmark results:
      
      machine 1: 8 way pentiumIV 1GB memory, tests run to 36GB SCSI disk
      (Similar results were seen on a 1 way 866Mhz box with IDE disk.)
      
      TioBench:
      
      tiobench.pl --dir /mnt/tmp --block 4096 --size 4000 --numruns 2 --threads 1(4,16,64)
      
      4k request size sequential read results in MB/sec
      
        Threads         2.6.9    w/patches    %diff         diff
      6f734a1a
    • Nick Piggin's avatar
      [PATCH] mm: teach kswapd about higher order areas · d4cf1012
      Nick Piggin authored
      Teach kswapd to free memory on behalf of higher order allocators.  This
      could be important for higher order atomic allocations because they
      otherwise have no means to free the memory themselves.
      Signed-off-by: default avatarNick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      d4cf1012
    • Nick Piggin's avatar
      [PATCH] mm: higher order watermarks · 206ca74e
      Nick Piggin authored
      Move the watermark checking code into a single function.  Extend it to
      account for the order of the allocation and the number of free pages that
      could satisfy such a request.
      
      From: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
      
      Fix typo in Nick's kswapd-high-order awareness patch
      Signed-off-by: default avatarNick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      206ca74e
    • Nick Piggin's avatar
      [PATCH] mm: keep count of free areas · f86789bc
      Nick Piggin authored
      Keep track of the number of free pages of each order in the buddy allocator.
      Signed-off-by: default avatarNick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      f86789bc
    • Ron Murray's avatar
      [PATCH] CS461x gameport code isn't being included in build · 7aee2fc8
      Ron Murray authored
      With Cal Peake <cp@absolutedigital.net>
      
      I've found a typo in drivers/input/gameport/Makefile in kernel 2.6.9 which
      effectively prevents the CS461x gameport code from being included.
      Signed-off-by: default avatarRon Murray <rjmx@rjmx.net>
      Signed-off-by: default avatarCal Peake <cp@absolutedigital.net>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      7aee2fc8
    • Andrew Morton's avatar
      [PATCH] vmscan: total_scanned fix · aa0baf35
      Andrew Morton authored
      We haven't been incrementing local variable total_scanned since the
      scan_control stuff went in.  That broke kswapd throttling.
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      aa0baf35
    • Jan Kara's avatar
      [PATCH] Allow disabling quota messages to console · cdd39d34
      Jan Kara authored
      Allow disabling of quota messages to console (they can disturb other
      output).
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      cdd39d34
    • Jan Kara's avatar
      [PATCH] Fix of quota deadlock on pagelock: reiserfs · 04a6c897
      Jan Kara authored
      Implement quota journaling and quota reading and writing functions for
      reiserfs.  Solves also several other deadlocks possible for reiserfs due to
      the lock inversion on journal_begin and quota locks.
      
      From: Vladimir Saveliev <vs@namesys.com>
      
      When CONFIG_QUOTA is defined reiserfs's finish_unfinished sets and clears
      MS_ACTIVE bit in s_flags field of super block.  If that bit was set already
      it should not be set.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      04a6c897
    • Jan Kara's avatar
      [PATCH] Fix of quota deadlock on pagelock: ext3 · 98887122
      Jan Kara authored
      Implementation of quota reading and writing functions for ext3.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      98887122
    • Jan Kara's avatar
      [PATCH] Fix of quota deadlock on pagelock: ext2 · 6b394613
      Jan Kara authored
      Implementation of quota reading and writing functions for ext2.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      6b394613
    • Jan Kara's avatar
      [PATCH] quota umount race fix · 84f308c2
      Jan Kara authored
      Fix possible races between umount and quota on/off.
      
      Finally I decided to take a reference to vfsmount during vfs_quota_on() and
      to drop it after the final cleanup in the vfs_quota_off().  This way we
      should be all the time guarded against umount.  This way was protected also
      the old code which used filp_open() for opening quota files.  I was also
      thinking about other ways of protection but there would be always a window
      (provided I don't want to play much with namespace locks) where
      vfs_quota_on() could be called while umount() is in progress resulting in
      the "Busy inodes after unmount" messages...
      
      Get a reference to vfsmount during quotaon() so that we are guarded against
      umount (as was the old code using filp_open()).
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      84f308c2
    • Jan Kara's avatar
      [PATCH] Fix of quota deadlock on pagelock: quota core · cf684334
      Jan Kara authored
      The four patches in this series fix deadlocks with quotas of pagelock (the
      problem was lock inversion on PageLock and transaction start - quota code
      needed to first start a transaction and then write the data which subsequently
      needed acquisition of PageLock while the standard ordering - PageLock first
      and transaction start later - was used e.g.  by pdflush).  They implement a
      new way of quota access to disk: Every filesystem that would like to implement
      quotas now has to provide quota_read() and quota_write() functions.  These
      functions must obey quota lock ordering (in particular they should not take
      PageLock inside a transaction).
      
      The first patch implements the changes in the quota core, the other three
      patches implement needed functions in ext2, ext3 and reiserfs.  The patch for
      reiserfs also fixes several other lock inversion problems (similar as ext3
      had) and implements the journaled quota functionality (which comes almost for
      free after the locking fixes...).
      
      The quota core patch makes quota support in other filesystems (except XFS
      which implements everything on its own ;)) unfunctional (quotaon() will refuse
      to turn on quotas on them).  When the patches get reasonable wide testing and
      it will seem that no major changes will be needed I can make fixes also for
      the other filesystems (JFS, UDF, UFS).
      
      This patch:
      
      The patch implements the new way of quota io in the quota core.  Every
      filesystem wanting to support quotas has to provide functions quota_read()
      and quota_write() obeying quota locking rules.  As the writes and reads
      bypass the pagecache there is some ugly stuff ensuring that userspace can
      see all the data after quotaoff() (or Q_SYNC quotactl).  In future I plan
      to make quota files inaccessible from userspace (with the exception of
      quotacheck(8) which will take care about the cache flushing and such stuff
      itself) so that this synchronization stuff can be removed...
      
      The rewrite of the quota core. Quota uses the filesystem read() and write()
      functions no more to avoid possible deadlocks on PageLock. From now on every
      filesystem supporting quotas must provide functions quota_read() and
      quota_write() which obey the quota locking rules (e.g. they cannot acquire the
      PageLock).
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      cf684334
    • Jan Kara's avatar
      [PATCH] Fix reiserfs quota debug messages · 6ffc2881
      Jan Kara authored
      Attached patch fixes debug messages of quota code in reiserfs so that they
      compile.  Chris Mason agreed the patch.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      6ffc2881
    • Jan Kara's avatar
      [PATCH] Expose reiserfs_sync_fs() · 3bc5bf4e
      Jan Kara authored
      Attached patch exposes reiserfs_sync_fs().  This call is needed by the new
      quota code to write data to disk on quotaoff so that userspace can see them
      afterwards.  Chris Mason agrees with the patch.
      
      Make reiserfs provide the sync_fs() function so that the quota code
      has a way to reliably force a transaction to disk.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      3bc5bf4e
  2. 02 Jan, 2005 16 commits