Commits · bd15d114048e2b9ef632ba5771f15b61de969a56 · Kirill Smelkov / linux

06 Feb, 2003 40 commits

Merge http://linux-scsi.bkbits.net/scsi-for-linus-2.5 · bd15d114
Linus Torvalds authored Feb 05, 2003
```
into home.transmeta.com:/home/torvalds/v2.5/linux
```
bd15d114
Merge raven.il.steeleye.com:/home/jejb/BK/scsi-misc-2.5 · 35766eb7
James Bottomley authored Feb 05, 2003
```
into raven.il.steeleye.com:/home/jejb/BK/scsi-for-linus-2.5
```
35766eb7

[PATCH] coding style updates for scsi_lib.c · 78ef52ec

Christoph Hellwig authored Feb 05, 2003

I just couldn't see the mess anymore..  Nuke the ifdefs and use sane
variable names.  Some more small nitpicks but no behaviour changes at
all.

78ef52ec

[PATCH] 2.5.59 add two help texts to drivers_scsi_Kconfig · baaf76dd

Rusty Russell authored Feb 05, 2003

From:  Steven Cole <elenstev@mesatop.com>

  Here are some help texts from 2.4.21-pre3 Configure.help which are
  needed in 2.5.59 drivers/scsi/Kconfig.

  Steven

baaf76dd

[PATCH] [patch, 2.5] scsi_qla1280.c free on error path · f8646d20

Rusty Russell authored Feb 05, 2003

From:  Marcus Alanen <maalanen@ra.abo.fi>

  Remove check_region in favour of request_region. Free resources
  properly on error path. Horribly subtle ioremap/iounmap lurks here I
  think, in qla1280_pci_config(), which the below patch should take care
  of.

  I'm wondering if there couldn't / shouldn't be a better way to
  allocate resources. Obviously lots of drivers have broken error paths.
  Is this even necessary?

  Marcus


  #
  # create_patch: qla1280_release_on_error_path-2002-12-08-A.patch
  # Date: Sun Dec  8 22:32:33 EET 2002
  #

f8646d20

[SCSI] Remove host_active · d30a24be
Christoph Hellwig authored Feb 05, 2003
```
It isn't used anywhere anymore
```
d30a24be
Merge http://linux-acpi.bkbits.net/linux-acpi · 26d987a7
Linus Torvalds authored Feb 05, 2003
```
into home.transmeta.com:/home/torvalds/v2.5/linux
```
26d987a7
ACPI: Enable compilation w/o cpufreq · 32dbc81b
Andy Grover authored Feb 05, 2003

32dbc81b
[PATCH] quota memleak · a2dd1464
Randy Dunlap authored Feb 05, 2003
```
The Stanford Checker found a memleak.
```
a2dd1464
Merge bk://kernel.bkbits.net/vojtech/x86-64 · d0d3f1f0
Linus Torvalds authored Feb 05, 2003
```
into home.transmeta.com:/home/torvalds/v2.5/linux
```
d0d3f1f0
x86-64: Minor fixes to make the kernel compile and remove warnings. · 4a69c79b
Vojtech Pavlik authored Feb 06, 2003

4a69c79b

[PATCH] Fix signed use of i_blocks in ext3 truncate · 9a3e1a96

Andrew Morton authored Feb 05, 2003

Patch from "Stephen C. Tweedie" <sct@redhat.com>

Fix "h_buffer_credits<0" assert failure during truncate.

The bug occurs when the "i_blocks" count in the file's inode overflows
past 2^31. That works fine most of the time, because i_blocks is an
unsigned long, and should go up to 2^32; but there's a place in truncate
where ext3 calculates the size of the next transaction chunk for the
delete, and that mistakenly uses a signed long instead. Because the
huge i_blocks gets cast to a negative value, ext3 does not reserve
enough credits for the transaction and the above error results.

This is usually only possible on filesystems corrupted for other
reasons, but it is reproducible if you create a single, non-sparse file
larger than 1TB on ext3 and then try to delete it.

9a3e1a96

[PATCH] CPU Hotplug mm/slab.c CPU_UP_CANCELED fix · 4f1cb3ff

Andrew Morton authored Feb 05, 2003

Patch from Manfred Spraul.

Fixes a bug which was exposed by Zwane's hotplug CPU work.  The
cache_cache.array pointer is initially given a temp bootstrap area, which is
later converted over to the final value after the CPU is brought up.

But if slab is enhanced to permit cancellation of a CPU bringup, this pointer
ends up pointing at stale memory.  So reinitialise it by hand when
kmem_cache_init() is run.

4f1cb3ff

[PATCH] spinlock debugging on uniprocessors · ecd2d220

Andrew Morton authored Feb 05, 2003

Patch from Manfred Spraul <manfred@colorfullife.com>

This enables spinlock debuggng on uniprocessor builds, under
CONFIG_DEBUG_SPINLOCK.

The reason I want this is that one day we'll need to pull out the debugging
support from the timer code which detects uninitialised timers.  And once
that has gone, uniprocessor developers and testers have no way of detecting
uninitialised timers - there will be mysterious deadlocks on SMP machines.
And there will surely be more uninitialised timers

The patch also removes the last pieces of the support for including
<asm/spinlock.h> directly.  Doesn't work since (IIRC) 2.3.x

ecd2d220

[PATCH] mm/mremap.c whitespace cleanup · 32738fbf
Andrew Morton authored Feb 05, 2003
```
- Not everyone uses 160-column xterms.

- Coding style consistency
```
32738fbf

[PATCH] hugetlb mremap fix · df79ea40

Andrew Morton authored Feb 05, 2003

If you attempt to perform a relocating 4k-aligned mremap and the new address
for the map lands on top of a hugepage VMA, do_mremap() will attempt to
perform a 4k-aligned unmap inside the hugetlb VMA.  The hugetlb layer goes
BUG.

Fix that by trapping the poorly-aligned unmap attempt in do_munmap().
do_remap() will then fall through without having done anything to the place
where it tests for a hugetlb VMA.

It would be neater to perform these checks on entry to do_mremap(), but that
would incur another VMA lookup.

Also, if you attempt to perform a 4k-aligned and/or sized munmap() inside a
hugepage VMA the same BUG happens.  This patch fixes that too.

This all means that an mremap attempt against a hugetlb area will fail, but
only after having unmapped the source pages.  That's a bit messy, but
supporting hugetlb mremap doesn't seem worth it, and completely disallowing
it will add overhead to normal mremaps.

df79ea40

[PATCH] Fix hugetlb_vmtruncate_list() · 8a1335e9

Andrew Morton authored Feb 05, 2003

This function is quite wrong - has an "=" where it should have a "-" and
confuses PAGE_SIZE and HPAGE_SIZE in its address and file offset arithmetic.

8a1335e9

[PATCH] ia32 hugetlb cleanup · a20d5200
Andrew Morton authored Feb 05, 2003
```
- whitespace

- remove unneeded spinlocking no-op.
```
a20d5200

[PATCH] Fix hugetlbfs faults · 8b5111ec

Andrew Morton authored Feb 05, 2003

If the underlying mapping was truncated and someone references the
now-unmapped memory the kernel will enter handle_mm_fault() and will start
instantiating PAGE_SIZE pte's inside the hugepage VMA.  Everything goes
generally pear-shaped.

So trap this in handle_mm_fault().  It adds no overhead to non-hugepage
builds.

Another possible fix would be to not unmap the huge pages at all in truncate
- just anonymise them.

But I think we want full ftruncate semantics for hugepages for management
purposes.

8b5111ec

[PATCH] Give all architectures a hugetlb_nopage(). · 08a1cc4e

Andrew Morton authored Feb 05, 2003

If someone maps a hugetlbfs file, then truncates it, then references the part
of the mapping outside the truncation point, they take a pagefault and we end
up hitting hugetlb_nopage().

We want to prevent this from ever happening. This patch just makes sure that
all architectures have a goes-BUG hugetlb_nopage() to trap it.

08a1cc4e

[PATCH] hugetlbfs cleanups · 3cc33271

Andrew Morton authored Feb 05, 2003

- Remove quota code.

- Remove extraneous copy-n-paste code from truncate: that's only for
  physically-backed filesystems.

- Whitespace changes.

3cc33271

[PATCH] hugetlbfs i_size fixes · 05732657

Andrew Morton authored Feb 05, 2003

We're expanding hugetlbfs i_size in the wrong place.  If someone attempts to
mmap more pages than are available, i_size is updated to reflect the
attempted mapping size.

So set i_size only when pages are successfully added to the mapping.

i_size handling at truncate time is still a bit wrong - if the mapping has
pages at (say) page offset 100-200 and the mappng is truncated to (say) page
offset 50, i_size should be set to zero.  But it is instead set to
50*HPAGE_SIZE.  That's harmless.

05732657

[PATCH] hugetlbfs: fix truncate · 136963d1

Andrew Morton authored Feb 05, 2003

- Opening a hugetlbfs file O_TRUNC calls the generic vmtruncate() functions
  and nukes the kernel.

  Give S_ISREG hugetlbfs files a inode_operations, and hence a setattr
  which know how to handle these files.

- Don't permit the user to truncate hugetlbfs files to sizes which are not
  a multiple of HPAGE_SIZE.

- We don't support expanding in ftruncate(), so remove that code.

136963d1

[PATCH] get_unmapped_area for hugetlbfs · 8ca8cd5b

Andrew Morton authored Feb 05, 2003

Having to specify the mapping address is a pain.  Give hugetlbfs files a
file_operations.get_unmapped_area().

The implementation is in hugetlbfs rather than in arch code because it's
probably common to several architectures.  If the architecture has special
needs it can define HAVE_ARCH_HUGETLB_UNMAPPED_AREA and go it alone.  Just
like HAVE_ARCH_UNMAPPED_AREA.

8ca8cd5b

[PATCH] convert hugetlb code to use compound pages · b3a656b6

Andrew Morton authored Feb 05, 2003

The odd thing about hugetlb is that it maintains its own freelist of pages.
And it has to do that, else it would trivially run out of pages due to buddy
fragmetation.

So we we don't want callers of put_page() to be passing those pages
to __free_pages_ok() on the final put().

So hugetlb installs a destructor in the compound pages to point at
free_huge_page(), which knows how to put these pages back onto the free list.

Also, don't mark hugepages as all PageReserved any more. That's preenting
callers from doing proper refcounting. Any code which does a user pagetable
walk and hits part of a hugepage will now handle it transparently.

b3a656b6

[PATCH] Infrastructure for correct hugepage refcounting · eefb08ee

Andrew Morton authored Feb 05, 2003

We currently have a problem when things like ptrace, futexes and direct-io
try to pin user pages.  If the user's address is in a huge page we're
elevting the refcount of a constituent 4k page, not the head page of the
high-order allocation unit.

To solve this, a generic way of handling higher-order pages has been
implemented:

- A higher-order page is called a "compound page".  Chose this because
  "huge page", "large page", "super page", etc all seem to mean different
  things to different people.

- The first (controlling) 4k page of a compound page is referred to as the
  "head" page.

- The remaining pages are tail pages.

All pages have PG_compound set.  All pages have their lru.next pointing at
the head page (even the head page has this).

The head page's lru.prev, if non-zero, holds the address of the compound
page's put_page() function.

The order of the allocation is stored in the first tail page's lru.prev.
This is only for debug at present.  This usage means that zero-order pages
may not be compound.

The above relationships are established for _all_ higher-order pages in the
page allocator.  Which has some cost, but not much - another atomic op during
fork(), mainly.

This functionality is only enabled if CONFIG_HUGETLB_PAGE, although it could
be turned on permanently.  There's a little extra cost in get_page/put_page.

These changes do not preclude adding compound pages to the LRU in the future
- we can add a new page flag to the head page and then move all the
additional data to the first tail page's lru.next, lru.prev, list.next,
list.prev, index, private, etc.

eefb08ee

[PATCH] give hugetlbfs a set_page_dirty a_op · 6725839b

Andrew Morton authored Feb 05, 2003

Seems that nobody has tested direct IO into hugetlb pages yet.  The VFS gets
upset about running set_page_dirty() against a non-uptodate page.

So give hugetlbfs inodes a private no-op ->set_page_dirty() to isolate them
from all that.

6725839b

[PATCH] pte_chain_alloc fixes · afcde6ef

Andrew Morton authored Feb 05, 2003

There are several places in which the return value from pte_chain_alloc() is
not being checked, and one place in which a GFP_KERNEL allocatiopn is
happening inside spinlock.

afcde6ef

[PATCH] loop inefficiency fix · a1329fe8

Andrew Morton authored Feb 05, 2003

Patch from Hugh Dickins <hugh@veritas.com>

The loop driver's loop over elements of bi_io_vec is in lo_send and
lo_receive: iterating that same transfer bi_vcnt times at the level above is,
er, excessive.  (And no need to increment bi_idx here.)

a1329fe8

[PATCH] default_idle micro-optimisation · 87afb5f6

Andrew Morton authored Feb 05, 2003

Patch from rwhron@earthlink.net

Micro-optimization of default_idle from -aa.  current_cpu_data.hlt_works_ok
is only false for some old 386/486 pcs.

87afb5f6

[PATCH] Optimise follow_page() for page-table-based hugepages · 1f1921fc

Andrew Morton authored Feb 05, 2003

ia32 and others can determine a page's hugeness by inspecting the pmd's value
directly.  No need to perform a VMA lookup against the user's virtual
address.

This patch ifdef's away the VMA-based implementation of
hugepage-aware-follow_page for ia32 and replaces it with a pmd-based
implementation.

The intent is that architectures will implement one or the other.  So the architecture either:

1: Implements hugepage_vma()/follow_huge_addr(), and stubs out
   pmd_huge()/follow_huge_pmd() or

2: Implements pmd_huge()/follow_huge_pmd(), and stubs out
   hugepage_vma()/follow_huge_addr()

1f1921fc

[PATCH] Fix futexes in huge pages · f93fcfa9

Andrew Morton authored Feb 05, 2003

Using a futex in a large page causes a kernel lockup in __pin_page() -
because __pin_page's page revalidation uses follow_page(), and follow_page()
doesn't work for hugepages.

The patch fixes up follow_page() to return the appropriate 4k page for
hugepages.

This incurs a vma lookup for each follow_page(), which is considerable
overhead in some situations.  We only _need_ to do this if the architecture
cannot determin a page's hugeness from the contents of the PMD.

So this patch is a "reference" implementation for, say, PPC BAT-based
hugepages.

f93fcfa9

[PATCH] ia32 IRQ distribution rework · 08f16f8f

Andrew Morton authored Feb 05, 2003

Patch from "Kamble, Nitin A" <nitin.a.kamble@intel.com>

Hello All,

  We were looking at the performance impact of the IRQ routing from
the 2.5.52 Linux kernel. This email includes some of our findings
about the way the interrupts are getting moved in the 2.5.52 kernel.
Also there is discussion and a patch for a new implementation. Let
me know what you think at nitin.a.kamble@intel.com

Current implementation:
======================
We have found that the existing implementation works well on IA32
SMP systems with light load of interrupts. Also we noticed that it
is not working that well under heavy interrupt load conditions on
these SMP systems. The observations are:

* Interrupt load of each IRQ is getting balanced on CPUs independent
of load of other IRQs. Also the current implementation moves the
IRQs randomly. This works well when the interrupt load is light. But
we start seeing imbalance of interrupt load with existence of
multiple heavy interrupt sources. Frequently multiple heavily loaded
IRQs gets moved to a single CPU while other CPUs stay very lightly
loaded. To achieve a good interrupts load balance, it is important to
consider the load of all the interrupts together.
    This further can be explained with an example of 4 CPUs and 4
heavy interrupt sources. With the existing random movement approach,
the chance of each of these heavy interrupt sources moving to separate
CPUs is: (4/4)*(3/4)*(2/4)*(1/4) = 3/16. It means 13/16 = 81.25% of
the time the situation is, some CPUs are very lightly loaded and some
are loaded with multiple heavy interrupts. This causes the interrupt
load imbalance and results in less performance. In a case of 2 CPUs
and 2 heavily loaded interrupt sources, this imbalance happens
1/2 = 50% of the times. This issue becomes more and more severe with
increasing number of heavy interrupt sources.

* Another interesting observation is: We cannot see the imbalance
of the interrupt load from /proc/interrupts. (/proc/interrupts shows
the cumulative load of interrupts on all CPUs.) If the interrupt load
is imbalanced and this imbalance is getting rotated among CPUs
continuously, then /proc/interrupts will still show that the interrupt
load is going to processors very evenly. Currently at the frequency
(HZ/50) at which IRQs are moved across CPUs, it is not possible to
see any interrupt load imbalance happening.

* We have also found that, in certain cases the static IRQ binding
performs better than the existing kernel distribution of interrupt
load. The reason is, in a well-balanced interrupt load situations,
these interrupts are unnecessarily getting frequently moved across
CPUs. This adds an extra overhead; also it takes off the CPU cache
warmth benefits.
  This came out from the performance measurements done on a 4-way HT
(8 logical processors) Pentium 4 Xeon system running 8 copies of
netperf. The 4 NICs in the system taking different IRQs generated
sizable interrupt load with the help of connected clients.

Here the netperf transactions/sec throughput numbers observed are:

IRQs nicely manually bound to CPUs: 56.20K
The current kernel implementation of IRQ movement: 50.05K
 -----------------------
 The static binding of IRQs has performed 12.28% better than the
current IRQ movement implemented in the kernel.

* The current implementation does not distinguish siblings from the
HT (Hyper-Threading(tm)) enabled CPUs. It will be beneficial to
balance the interrupt load with respect to processor packages first,
and then among logical CPUs inside processor packages.
  For example if we have 2 heavy interrupt sources and 2 processor
packages (4 logical CPUs); Assigning both the heavy interrupt sources
in different processor packages is better, it will use different
execution resources from the different processor packages.



New revised implementation:
==========================
We also have been working on a new implementation. The following
points are in main focus.

* At any moment heavily loaded IRQs are distributed to different
CPUs to achieve as much balance as possible.

* Lightly loaded interrupt sources are ignored from the load
balancing, as they do not cause considerable imbalance.

* When the heavy interrupt sources are balanced, they are not moved
around. This also helps in keeping the CPU caches warm.

* It has been made HT aware. While distributing the load, the load
on a processor package to which the logical CPUs belong to is also
considered.

* In the situations of few (lesser than num_cpus) heavy interrupt
sources, it is not possible to balance them evenly. In such case
the existing code has been reused to move the interrupts. The
randomness from the original code has been removed.

* The time interval for redistribution has been made flexible. It
varies as the system interrupt load changes.

* A new kernel_thread is introduced to do the load balancing
calculations for all the interrupt sources. It keeps the balanace_maps
ready for interrupt handlers, keeping the overhead in the interrupt
handling to minimum.

* It allows the disabling of the IRQ distribution from the boot loader
command line, if anybody wants to do it for any reason.

* The algorithm also takes into account the static binding of
interrupts to CPUs that user imposes from the
/proc/irq/{n}/smp_affinity interface.


Throughput numbers with the netperf setup for the new implementation:

Current kernel IRQ balance implementation: 50.02K transactions/sec
The new IRQ balance implementation: 56.01K transactions/sec
 ---------------------
  The performance improvement on P4 Xeon of 11.9% is observed.

The new IRQ balance implementation also shows little performance
improvement on P6 (Pentium II, III) systems.

On a P6 system the netperf throughput numbers are:
Current kernel IRQ balance implementation: 36.96K transactions/sec
The new IRQ balance implementation: 37.65K transactions/sec
 ---------------------
Here the performance improvement on P6 system of about 2% is observed.


 ---------------------

Andrew Theurer <habanero@us.ibm.com> did some testing of this patch on a quad
P4:


I got a chance to run the NetBench benchmark with your patch on 2.5.54-mjb2
kernel.  NetBench measures SMB/CIFS performance by using several SMB
clients  (in this case 44 Windows 2000 systems), sending SMB requests to a
Linux  server running Samba 2.2.3a+sendfile.  Result is in throughput,
Mbps.   Generally the network traffic on the server is 60% recv, 40% tx.

I believe we have very similar systems.  Mine is a 4 x 1.6 GHz, 1 MB L3 P4
Xeon with 4 GB DDR memory (3.2 GB/sec I believe).  The chipset is "Summit".
 I also have more than one Intel e1000 adapters.

I decided to run a few configurations, first with just one adapter, with
and  without HT support in the kernel (acpi=off), then add another adapter
and  test again with/without HT.

Here are the results:

4P, no HT, 1 x e1000, no kirq:	1214 Mbps, 4% idle
4P, no HT, 1 x e1000, kirq:		1223 Mbps, 4% idle,		+0.74%

I suppose we didn't see much of an improvement here because we never run
into  the situation where more than one interrupt with a high rate is
routed to a  single CPU on irq_balance.

4P, HT, 1 x e1000, no kirq:	1214 Mbps, 25% idle
4P, HT, 1 x e1000, kirq:	1220 Mbps, 30% idle,			+0.49%

Again, not much of a difference just yet, but lots of idle time.  We may
have  reached the limit at which one logical CPU can process interrupts for
an  e1000 adapter.  There are other things I can probably do to help this,
like  int delay, and NAPI, which I will get to eventually.

4P, HT, 2 x e1000, no kirq:	1269 Mbps, 23% idle
4P, HT, 2 x e1000, kirq:	1329 Mbps, 18% idle			+4.7%

OK, almost 5% better!  Probably has to do with a couple of things; the fact
that your code does not route two different interrupts to the same
core/different logical cpus (quite obvious by looking at /proc/interrupts),
and that more than one interrupt does not go to the same cpu if possible.
I  suspect irq_balance did some of those [bad] things some of the time, and
we  observed a bottleneck in int processing that was lower than with kirq.

I don't think all of the idle time is because of a int processing
bottleneck.   I'm just not sure what it is yet :)  Hopefully something will
become obvious  to me...

Overall I like the way it works, and I believe it can be tweaked to work
with  NUMA when necessary.  I hope to have access to a specweb system on a
NUMA box  soon, so we can verify that.

08f16f8f

[PATCH] Fix SMP race betwen __sync_single_inode and · 50d49a05

Andrew Morton authored Feb 05, 2003

Patch from Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>

there's a SMP race condition between __sync_single_inode (or __sync_one on
2.4.20) and __mark_inode_dirty. __mark_inode_dirty doesn't take inode
spinlock. As we know -- unless you take a spinlock or use barrier,
processor can change order of instructions.

CPU 1

modify inode
(but modifications are in cpu-local
buffer and do not go to bus)

calls
__mark_inode_dirty
it sees I_DIRTY and exits immediatelly
					CPU 2
					takes spinlock
					calls __sync_single_inode
					inode->i_state &= ~I_DIRTY
					writes the inode (but does not see
					modifications by CPU 1 yet)

CPU 1 flushes its write buffer to the bus
inode is already written, clean, modifications
done by CPU1 are lost

The easiest fix would be to move the test inside spinlock in
__mark_inode_dirty; if you do not want to suffer from performance loss,
use the attached patches that use memory barriers to ensure ordering of
reads and writes.

50d49a05

[PATCH] Restore LSM hook calls to sendfile · 0b316620

Andrew Morton authored Feb 05, 2003

Patch from "Stephen D. Smalley" <sds@epoch.ncsc.mil>

This patch restores the LSM hook calls in sendfile to 2.5.59. The hook was
previously added as of 2.5.29 but the hook calls in sendfile were
subsequently lost as a result of the sendfile rewrite as of 2.5.30.

0b316620

[PATCH] JBD Documentation · b573296a

Andrew Morton authored Feb 05, 2003

Patch from Roger Gammans <roger@computer-surgery.co.uk>

Adds lots of API documentation to the JBD layer.

b573296a

[PATCH] Updated Documentation/kernel-parameters.txt · 7260b084

Andrew Morton authored Feb 05, 2003

Patch from Petr Baudis <pasky@ucw.cz>

this patch (against 2.5.59) updates Documentation/kernel-parameters.txt to
the (more-or-less; I certainly missed some parameters) current state of
kernel.  Note also that I will probably send up another update after few
further kernel releases..

7260b084

[PATCH] Remove most of the blk_run_queues() calls · 418f398e

Andrew Morton authored Feb 05, 2003

We don't need these with self-unplugging queues.

The patch also contains a couple of microopts suggested by Andrea: we
don't need to run sync_page() if the page just came unlocked.

418f398e

[PATCH] self-unplugging request queues · 00c8e791

Andrew Morton authored Feb 05, 2003

The patch teaches a queue to unplug itself:

a) if is has four requests OR
b) if it has had plugged requests for 3 milliseconds.

These numbers may need to be tuned, although doing so doesn't seem to
make much difference.  10 msecs works OK, so HZ=100 machines will be
fine.

Instrumentation shows that about 5-10% of requests were started due to
the three millisecond timeout (during a kernel compile).  That's
somewhat significant.  It means that the kernel is leaving stuff in the
queue, plugged, for too long.  This testing was with a uniprocessor
preemptible kernel, which is particularly vulnerable to unplug latency
(submit some IO, get preempted before the unplug).

This patch permits the removal of a lot of rather lame unplugging in
page reclaim and in the writeback code, which kicks the queues
(globally!) every four megabytes to get writeback underway.

This patch doesn't use blk_run_queues().  It is able to kick just the
particular queue.

The patch is not expected to make much difference really, except for
AIO.  AIO needs a blk_run_queues() in its io_submit() call.  For each
request.  This means that AIO has to disable plugging altogether,
unless something like this patch does it for it.  It means that AIO
will unplug *all* queues in the machine for every io_submit().  Even
against a socket!

This patch was tested by disabling blk_run_queues() completely.  The
system ran OK.

The 3 milliseconds may be too long.  It's OK for the heavy writeback
code, but AIO may want less.  Or maybe AIO really wants zero (ie:
disable plugging).  If that is so, we need new code paths by which AIO
can communicate the "immediate unplug" information - a global unplug is
not good.


To minimise unplug latency due to user CPU load, this patch gives keventd
`nice -10'.  This is of course completely arbitrary.  Really, I think keventd
should be SCHED_RR/MAX_RT_PRIO-1, as it has been in -aa kernels for ages.

00c8e791

[PATCH] reiserfs v3 readpages support · c5070032

Andrew Morton authored Feb 05, 2003

Patch from Chris Mason <mason@suse.com>

The patch below is against 2.5.59, various forms have been floating
around for a while, and Andrew recently included this fixed version in
2.5.55-mm. The end result is faster reads and writes for reiserfs.

This adds reiserfs support for readpages, along with a support func in
fs/mpage.c to deal with the reiserfs_get_block call sending back up to
date buffers with packed tails copied into them.

Most of the changes are to reiserfs_writepage, which still had many
2.4isms in the way it started io, dealt with errors and handled the bh
state bits. I've also added an optimization so it only starts
transactions when we need to copy a packed tail into the btree or fill a
hole, instead of any time reiserfs_writepage hits an unmapped buffer.

c5070032