Commits · d55249d351bc96f49e30bc3e5dfa1dad5034cc28 · nexedi / linux

19 Oct, 2004 24 commits

[PATCH] convert jiffies <-> msecs for io schedulers · d55249d3

Jens Axboe authored Oct 18, 2004

The various io schedulers don't convert to and from jiffies and ms in their
sysfs exported values.  This patch adds that.
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

d55249d3

[PATCH] cfq-v2 I/O scheduler update · f9887e4a

Jens Axboe authored Oct 18, 2004

Here is the next incarnation of the CFQ io scheduler, so far known as
CFQ v2 locally. It attempts to address some of the limitations of the
original CFQ io scheduler (hence forth known as CFQ v1). Some of the
problems with CFQ v1 are:

- It does accounting for the lifetime of the cfq_queue, which is setup
  and torn down for the time when a process has io in flight. For a fork
  heavy work load (such as a kernel compile, for instance), new
  processes can effectively starve io of running processes. This is in
  part due to the fact that CFQ v1 gives preference to a new processes
  to get better latency numbers. Removing that heuristic is not an
  option exactly because of that.

- It makes no attempts to address inter-cfq_queue fairness.

- It makes no attempt to limit upper latency bound of a single request.

- It only provides per-tgid grouping. You need to change the source to
  group on a different criteria.

- It uses a mempool for the cfq_queues. Theoretically this could
  deadlock if io bound processes never exit.

- The may_queue() logic can be unfair since it fluctuates quickly, thus
  leaving processes sleeping while new processes are allowed to allocate
  a request.

CFQ v2 attempts to fix these issues. It uses the process io_context
logic to maintain a cfq_queue lifetime of the duration of the process
(and its io). This means we can now be a lot more clever in deciding
which process is allowed to queue or dispatch io to the device. The
cfq_io_context is per-process per-queue, this is an extension to what AS
currently does in that we truly do have a unique per-process identifier
for io grouping. Busy queues are sorted by service time used, sub sorted
by in_flight requests. Queues that have no io in flight are also
preferred at dispatch time.

Accounting is done on completion time of a request, or with a fixed cost
for tagged command queueing. Requests are fifo'ed like with deadline, to
make sure that a single request doesn't stay in the io scheduler for
ages.

Process grouping is selectable at runtime. I provide 4 grouping
criterias: process group, thread group id, user id, and group id.

As usual, settings are sysfs tweakable in /sys/block/<dev>/queue/iosched

axboe@apu:[.]s/block/hda/queue/iosched $ ls
back_seek_max      fifo_batch_expire  find_best_crq  queued
back_seek_penalty  fifo_expire_async  key_type       show_status
clear_elapsed      fifo_expire_sync   quantum        tagged

In order, each of these settings control:

back_seek_max
back_seek_penalty:
	Useful logic stolen from AS that allow small backwards seeks in
	the io stream if we deem them useful. CFQ uses a strict
	ascending elevator otherwise. _max controls the maximum allowed
	backwards seek, defaulting to 16MiB. _penalty denotes how
	expensive we account a backwards seek compared to a forward
	seek. Default is 2, meaning it's twice as expensive.

clear_elapsed:
	Really a debug switch, will go away in the future. It clears the
	maximum values for completion and dispatch time, shown in
	show_status.

fifo_batch_expire
fifo_batch_async
fifo_batch_sync:
	The settings for the expiry fifo. batch_expire is how often we
	allow the fifo expire to control which request to select.
	Default is 125ms. _async is the deadline for async requests
	(typically writes), _sync is the deadline for sync requests
	(reads and sync writes). Defaults are, respectively, 5 seconds
	and 0.5 seconds.

key_type:
	The grouping key. Can be set to pgid, tgid, uid, or gid. The
	current value is shown bracketed:

	axboe@apu:[.]s/block/hda/queue/iosched $ cat key_type
	[pgid] tgid uid gid

	Default is tgid. To set, simply echo any of the 4 words into the
	file.

quantum:
	The amount of requests we select for dispatch when the driver
	asks for work to do and the current pending list is empty.
	Default is 4.

queued:
	The minimum amount of requests a group is allowed to queue.
	Default is 8.

show_status:
	Debug output showing the current state of the queues.

tagged:
	Set this to 1 if the device is using tagged command queueing.
	This cannot be reliably detected by CFQ yet, since most drivers
	don't use the block layer (well it could, by looking at number
	of requests being between dispatch and completion. but not
	completely reliably). Default is 0.

The patch is a little big, but works reliably here on my laptop. There
are a number of other changes and fixes in there (like converting to
hlist for hashes). The code is commented a lot better, CFQ v1 has
basically no comments (reflecting that it was writting in one go, no
touched or tuned much since then). This is of course only done to
increase the AAF, akpm acceptance factor. Since I'm on the road, I
cannot provide any really good numbers of CFQ v1 compared to v2, maybe
someone will help me out there.
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

f9887e4a

[PATCH] switchable and modular io schedulers · df02202c

Jens Axboe authored Oct 18, 2004

This patch modularizes the io schedulers completely, allowing them to be
modular.  Additionally it enables online switching of io schedulers.  See
also http://lwn.net/Articles/102593/ .


There's a scheduler file in the sysfs directory for the block device
queue:

axboe@router:/sys/block/hda/queue> ls
iosched            max_sectors_kb  read_ahead_kb
max_hw_sectors_kb  nr_requests     scheduler

If you list the contents of the file, it will show available schedulers
and the active one:

axboe@router:/sys/block/hda/queue> cat scheduler
[cfq]

Lets load a few more.

router:/sys/block/hda/queue # modprobe deadline-iosched
router:/sys/block/hda/queue # modprobe as-iosched
router:/sys/block/hda/queue # cat scheduler
[cfq] deadline anticipatory

Changing is done with

router:/sys/block/hda/queue # echo deadline > scheduler
router:/sys/block/hda/queue # cat scheduler
cfq [deadline] anticipatory

deadline is now the new active io scheduler for hda.
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

df02202c

[PATCH] unreachable code in ext3_direct_IO() · 3d3d8747

Andrew Morton authored Oct 18, 2004

davej points out that in this code local variable `ret' is already known to be
positive non-zero, so this test is meaningless.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

3d3d8747

[PATCH] jbd wakeup fix · 91cd0c2b

Andrew Morton authored Oct 18, 2004

Processes can sleep in do_get_write_access(), waiting for buffers to be
removed from the BJ_Shadow state. We did this by doing a wake_up_buffer() in
the commit path and sleeping on the buffer in do_get_write_access().

With the filtered bit-level wakeup code this doesn't work properly any more -
the wake_up_buffer() accidentally wakes up tasks which are sleeping in
lock_buffer() as well. Those tasks now implicitly assume that the buffer came
unlocked. Net effect: Bogus I/O errors when reading journal blocks, because
the buffer isn't up to date yet. Hence the recently spate of journal_bmap()
failure reports.

The patch creates a new jbd-private BH flag purely for this wakeup function.
So a wake_up_bit(..., BH_Unshadow) doesn't wake up someone who is waiting for
a wake_up_bit(BH_Lock).

JBD was the only user of wake_up_buffer(), so remove it altogether.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

91cd0c2b

[PATCH] document wake_up_bit()'s requirement for preceding memory barriers · a8589849

William Lee Irwin III authored Oct 18, 2004

Document the requirement to use a memory barrier prior to wake_up_bit().
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

a8589849

[PATCH] reduce number of parameters to __wait_on_bit() and __wait_on_bit_lock() · 9659cc89

William Lee Irwin III authored Oct 18, 2004

Some of the parameters to __wait_on_bit() and __wait_on_bit_lock() are
redundant, as the wait_bit_queue parameter holds the flags word and the bit
number. This patch updates __wait_on_bit() and __wait_on_bit_lock() to
fetch that information from the wait_bit_queue passed to them and so reduce
the number of parameters so that -mregparm may be more effective.

Incremental atop the complete out-of-lining of the contention cases and the
fastcall and wait_on_bit_lock()/test_and_set_bit() fixes.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

9659cc89

[PATCH] move wait ops' contention case completely out of line · bc341c61

William Lee Irwin III authored Oct 18, 2004

Move the slow paths of wait_on_bit() and wait_on_bit_lock() out of line.
Also uninline wake_up_bit() to reduce the number of callsites generated,
and adjust loop startup in __wait_on_bit_lock() to properly reflect its
usage in the contention case.

Incremental atop the fastcall and wait_on_bit_lock()/test_and_set_bit()
fixes.  Successfully tested on x86-64.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

bc341c61

[PATCH] eliminate inode waitqueue hashtable · 493267b6

William Lee Irwin III authored Oct 18, 2004

Eliminate the inode waitqueue hashtable using bit_waitqueue() via
wait_on_bit() and wake_up_bit() to locate the waitqueue head associated
with a bit.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

493267b6

[PATCH] eliminate bh waitqueue hashtable · 525b64cd

William Lee Irwin III authored Oct 18, 2004

Eliminate the bh waitqueue hashtable using bit_waitqueue() via
wait_on_bit() and wake_up_bit() to locate the waitqueue head associated
with a bit.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

525b64cd

[PATCH] consolidate bit waiting code patterns · baa896b3

William Lee Irwin III authored Oct 18, 2004

Consolidate bit waiting code patterns for page waitqueues using
__wait_on_bit() and __wait_on_bit_lock().
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

baa896b3

[PATCH] standardize bit waiting data type · fd4d36bf

William Lee Irwin III authored Oct 18, 2004

Eliminate specialized page and bh waitqueue hashing structures in favor of
a standardized structure, using wake_up_bit() to wake waiters using the
standardized wait_bit_key structure.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

fd4d36bf

[PATCH] move waitqueue functions to kernel/wait.c · d7988992

William Lee Irwin III authored Oct 18, 2004

The following patch series consolidates the various instances of waitqueue
hashing to use a uniform structure and share the per-zone hashtable among all
waitqueue hashers. This is expected to increase the number of hashtable
buckets available for waiting on bh's and inodes and eliminate statically
allocated kernel data structures for greater node locality and reduced kernel
image size. Some attempt was made to look similar to Oleg Nesterov's
suggested API in order to provide some kind of credit for independent
invention of something very similar (the original versions of these patches
predated my public postings on the subject of filtered waitqueues).

These patches have the further benefit and intention of enabling aio to use
filtered wakeups by standardizing the data structure passed to wake functions
so that embedded waitqueue elements in aio structures may be succesfully
passed to the filtered wakeup wake functions, though this patch series doesn't
implement that particular functionality.

Successfully stress-tested on x86-64, and ia64 in recent prior versions.

This patch:

Move waitqueue -related functions not needing static functions in sched.c
to kernel/wait.c
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

d7988992

[PATCH] TIOCCONS security · d05dd6d0

Olaf Dabrunz authored Oct 18, 2004

The ioctl TIOCCONS allows any user to redirect console output to another
tty.  This allows anyone to suppress messages to the console at will.

AFAIK nowadays not many programs write to /dev/console, except for start
scripts and the kernel (printk() above console log level).

Still, I believe that administrators and operators would not like any user
to be able to hijack messages that were written to the console.

The only user of TIOCCONS that I am aware of is bootlogd/blogd, which runs
as root.  Please comment if there are other users.

Is there any reason why normal users should be able to use TIOCCONS?

Otherwise I would suggest to restrict access to root (CAP_SYS_ADMIN), e.g. 
with this patch.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

d05dd6d0

[PATCH] kallsyms data size reduction / lookup speedup · e1039211

Paulo Marques authored Oct 18, 2004

This patch is an improvement over my first kallsyms speedup patch posted about
2 weeks ago.

It changes scripts/kallsyms as to produce a different format for
kallsyms_names and extra data to speedup lookups.  The compression algorithm
is quite simple: it uses all the char codes not actually used in symbols to
build a lookup table that translates these codes into small strings.  For
instance, in my test runs the code 0xFE was being translated into "acpi_"
giving a 4 byte save on every translation.

The advantage of this algorithm is that to translate a symbol we only require
information that is stored on that symbol position, and never need to go back
on the compressed stream to get information from other symbols.

To give an idea about the benefits of this algorithm here are some benchmark
results on a P4 2.8GHz with a symbol table with 10000 entries:

kallsyms_lookup average time:
  vanilla           1346.0 us
  speedup             14.4 us
  with this patch      0.5 us

total data produced by scripts/kallsyms:
  uncompressed         169 Kb
  vanilla              134 Kb
  with this patch       91 Kb

(speedup was my latest patch, that only changed the way kallsyms_lookup worked
and not the data format)

I removed a cond_resched() from the proc/kallsyms handling code path, because
using stem compression, if the current position went backwards, the hole
stream would be uncompressed up to the current position.  It seemed that by
removing this loop it would be safe to remove the conditional reschedule
altogether.

There is just one catch with this patch: the time it takes to compile the
kernel goes up just a bit (about 0.8s on a P4 2.8GHz with defconfig).  If this
delay is not acceptable I can change the compression algorithm so that it can
use the previous table (calculating a new table is what consumes most of the
time, and not doing the actual compression) and check to see if it obtains a
similar compression ratio.  If it does, then this is a sign that the symbol
patterns haven't changed that much and this table is still good to use.  This
would not only cut the time down to half on any compilation (because of the 2
pass symbol build method), but in frequent cases where a developer is
compiling a single file and linking everything over and over again, the table
optimization process would never run.

I'm CC'ing Brent Casavant on this email, because last june he sent a patch
trying a different approach that used a 32 entry symbol cache, because there
was a problem with the time "top" took to read "proc/<pid>/wchan".  I was
hopping he would be willing to test this patch and comment on the results.
Signed-off-by: Paulo Marques <pmarques@grupopie.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

e1039211

[PATCH] implement in-kernel keys & keyring management · e4262f59

David Howells authored Oct 18, 2004

The feature set the patch includes:

 - Key attributes:
   - Key type
   - Description (by which a key of a particular type can be selected)
   - Payload
   - UID, GID and permissions mask
   - Expiry time
 - Keyrings (just a type of key that holds links to other keys)
 - User-defined keys
 - Key revokation
 - Access controls
 - Per user key-count and key-memory consumption quota
 - Three std keyrings per task: per-thread, per-process, session
 - Two std keyrings per user: per-user and default-user-session
 - prctl() functions for key and keyring creation and management
 - Kernel interfaces for filesystem, blockdev, net stack access
 - JIT key creation by usermode helper

There are also two utility programs available:

 (*) http://people.redhat.com/~dhowells/keys/keyctl.c

     A comprehensive key management tool, permitting all the interfaces
     available to userspace to be exercised.

 (*) http://people.redhat.com/~dhowells/keys/request-key

     An example shell script (to be installed in /sbin) for instantiating a
     key.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

e4262f59

[PATCH] keys: new error codes for Alpha, MIPS, PA-RISC, Sparc & Sparc64 · 322f317d

David Howells authored Oct 18, 2004

The attached patch adds the new error codes I added for key-related errors to
those archs that don't make use of <asm-generic/errno.h>, including Alpha,
MIPS, PA-RISC, Sparc and Sparc64.  This is required to compile with
CONFIG_KEYS on those platforms.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

322f317d

[PATCH] Add some key management specific error codes · c6ac5ab1

David Howells authored Oct 18, 2004

Here's a patch to add some new error codes specific to key management.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

c6ac5ab1

[PATCH] reiserfs: rename struct key · 6f1afa77

Andrew Morton authored Oct 18, 2004

Rename resierfs's `struct key' to `struct reiserfs_key' to avoid namespace
clashes.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

6f1afa77

[PATCH] Create nodemask_t · 59356466

Matthew Dobson authored Oct 18, 2004

The idea behind this patch is to create a nodemask_t as a node analog of
cpumask_t.  As NUMA machines become more common, the need for a standard,
cross-platform bitmap of both online & possible nodes becomes more
apparent.  We believe we've worked out most of the kinks of the variable
length bitmap types with the recent cpumask_t patches.  Nodemasks are also
currently far less widespread than cpumasks.  Further, inclusion at this
point in the kernel would mean consistency in node handling between 2.6 and
2.7.

Future goals would be to get rid of the 'numnodes' variable used to count
the number of online nodes, and replace with node_online_map.  This would
allow arbitrary node numbering and facilitate node hotplugging.

(Nothing actually uses this yet, but several projects need it, and it does
model a well-defined physical grouping).
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

59356466

[PATCH] cdrom: buffer sizing fix · 66d5cab9

Peter Osterlund authored Oct 18, 2004

The problem is that some drives fail the "GET CONFIGURATION" command when
asked to only return 8 bytes.  This happens for example on my drive, which
is identified as:

        hdc: HL-DT-ST DVD+RW GCA-4040N, ATAPI CD/DVD-ROM drive

Since the cdrom_mmc3_profile() function already allocates 32 bytes for the
reply buffer, this patch is enough to make the command succeed on my drive.
Signed-off-by: Peter Osterlund <petero2@telia.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

66d5cab9

[PATCH] CDRW packet writing support · 2f8e2dc8

Peter Osterlund authored Oct 18, 2004

This patch implements CDRW packet writing as a kernel block device.  Usage
instructions are in the packet-writing.txt file.

A hint: If you don't want to wait for a complete disc format, you can
format just a part of the disc.  For example:

        cdrwtool -d /dev/hdc -m 10240

This will format 10240 blocks, ie 20MB.
Signed-off-by: Peter Osterlund <petero2@telia.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

2f8e2dc8

[PATCH] packet-writing: add credits · a7cbd7da

Peter Osterlund authored Oct 18, 2004

Nigel pointed out that the earlier patches contained attributions that
are not present in this patch. The 2.4 patch contains:

  Nov 5 2001, Aug 8 2002. Modified by Andy Polyakov
  <appro@fy.chalmers.se> to support MMC-3 complaint DVD+RW units.

and Nigel changed it to this in his 2.6 patch:

  Modified by Nigel Kukard <nkukard@lbsd.net> - support DVD+RW
  2.4.x patch by Andy Polyakov <appro@fy.chalmers.se>

The patch I sent you deleted most of the earlier work and moved the
rest to cdrom.c, but the comments were not moved over, since the
earlier authors didn't modify cdrom.c.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

a7cbd7da

[PATCH] DVD+RW support · ed594d2d

Peter Osterlund authored Oct 18, 2004

This patch adds support for using DVD+RW drives as writable block devices.

The patch is based on work from:

        Andy Polyakov <appro@fy.chalmers.se> - Wrote the 2.4 patch
        Nigel Kukard <nkukard@lbsd.net> - Initial porting to 2.6.x

It works for me using an Iomega Super DVD 8x USB drive.


  Nov 5 2001, Aug 8 2002. Modified by Andy Polyakov
  <appro@fy.chalmers.se> to support MMC-3 complaint DVD+RW units.

  Modified by Nigel Kukard <nkukard@lbsd.net> - support DVD+RW
  2.4.x patch by Andy Polyakov <appro@fy.chalmers.se>

This patch implements CDRW packet writing as a kernel block device.  Usage
instructions are in the packet-writing.txt file.

A hint: If you don't want to wait for a complete disc format, you can
format just a part of the disc.  For example:

        cdrwtool -d /dev/hdc -m 10240

This will format 10240 blocks, ie 20MB.
Signed-off-by: Peter Osterlund <petero2@telia.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

ed594d2d

18 Oct, 2004 16 commits

Fix pci config syscall definitions. · a4946826
Linus Torvalds authored Oct 18, 2004
```
Including the proper header file showed that they didn't
match the declared prototypes.
```
a4946826
Don't use obsolete gcc named initializer syntax. · c1e08b84
Linus Torvalds authored Oct 18, 2004
```
The proper C99 syntax is much preferred.
```
c1e08b84
Fix old-style fn declaration. · e3303e02
Linus Torvalds authored Oct 18, 2004

e3303e02

[PATCH] return full SCSI status byte in SG_IO · fa0e5198

Jens Axboe authored Oct 18, 2004

This has been around for a while. Return the full scsi result byte in
rq->errors for SG_IO generated requests.
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

fa0e5198

[PATCH] fix & clean up zombie/dead task handling & preemption · d3069b4d

Ingo Molnar authored Oct 18, 2004

This patch fixes all the preempt-after-task->state-is-TASK_DEAD problems we
had.  Right now, the moment procfs does a down() that sleeps in
proc_pid_flush() [it could] our TASK_DEAD state is zapped and we might be
back to TASK_RUNNING to and we trigger this assert:

        schedule();
        BUG();
        /* Avoid "noreturn function does return".  */
        for (;;) ;

I have split out TASK_ZOMBIE and TASK_DEAD into a separate p->exit_state
field, to allow the detaching of exit-signal/parent/wait-handling from
descheduling a dead task.  Dead-task freeing is done via PF_DEAD.

Tested the patch on x86 SMP and UP, but all architectures should work
fine.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

d3069b4d

[PATCH] sched: fix SCHED_SMT & numa=fake=2 lockup · 7a9a3e86

Ingo Molnar authored Oct 18, 2004

This patch fixes an interaction between the numa=fake=<domains> feature,
the domain setup code and cpu_siblings_map[].  The bug leads to a bootup
crash when using numa=fake=2 on a 2-way/4-way SMP+HT box.

When SCHED_SMT is turned on the domains-setup code relies on siblings not
spanning multiple domains (which makes perfect sense).  But numa=fake=2
creates an assymetric 1101/0010 splitup between CPUs, which results in two
siblings being on different nodes.

The patch adds a check_siblings_map() function that checks the sibling maps
and fixes them up if they violate this rule.  (it also prints a warning in
that case.)

The patch also turns SCHED_DOMAIN_DEBUG back on - had this been enabled
we'd have noticed this bug much earlier.

From: Badari Pulavarty <pbadari@us.ibm.com>

  arch/x86_64/mm/numa.c: In function `numa_setup':
  arch/x86_64/mm/numa.c:332: error: `numa_fake' undeclared (first use in this function)
  arch/x86_64/mm/numa.c:332: error: (Each undeclared identifier is reported only once
  arch/x86_64/mm/numa.c:332: error: for each function it appears in.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

7a9a3e86

[PATCH] sched: remove NODE_BALANCE_RATE definitions · ad5f30c4

Matthew Dobson authored Oct 18, 2004

NODE_BALANCE_RATE is defined all over the place, but used nowhere.  Let's
remove it.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

ad5f30c4

[PATCH] sched_domains: Make SD_NODE_INIT per-arch #2 · 7b2565a8

Matthew Dobson authored Oct 18, 2004

Here's yet another version of a patch to implement per-arch SD_*_INITs. 
This follows the same basic idea of my last patch, but

1) defines an arch-specific SD_NODE_INIT for the 4 NUMA arches (i386,
   x86_64, IA64 & PPC64),

2) defines *default* SD_CPU_INIT & SD_SIBLING_INIT for *all* arches,
   with the possibility of them being overridden by simply defining an
   arch-specific version in include/asm/topology.h.

The motivation behind the third version of this patch is that Martin feels
that there should be no "default" NUMA initializer because NUMA
characteristics are *very* arch/platform specific, and hence a "default"
NUMA initializer can only lead to confusion.  I agree with most of that,
but don't quite see as much harm in having a default as he does.
Nevertheless, to keep him quiet, I've run up this version of the patch. 
Martin, please run this through your magic test suite and make sure I
didn't break anything trivial.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

7b2565a8

[PATCH] CPU Scheduler: fix potential error in runqueue nr_uninterruptible count · 95a2f6d7

Peter Williams authored Oct 18, 2004

Problem:

In the function try_to_wake_up(), when the runqueue's nr_uninterruptible
field is decremented it's possible (on SMP systems) that the pointer no
longer points to the runqueue that the task being woken was on when it went
to sleep.  This would cause the wrong runqueue's field to be decremented
and the correct one tp remain unchanged.

Fix:

Save a pointer to the old runqueue at the beginning of the function and use
it when decrementing nr_uninterruptible.
Signed-off-by: Peter Williams <pwil3058@bigpond.net.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

95a2f6d7

[PATCH] sched: print preempt count · c0799016

Andrew Morton authored Oct 18, 2004

Better debugging output when the CPU scheduler detects atomicity errors.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

c0799016

[PATCH] sched: fixes for ia64 domain setup · c80f2dee

Nick Piggin authored Oct 18, 2004

Still having some trouble with ia64 domain setup on the Altixes.  Jesse
hasn't had much time to look into it, and I'm lacking an Altix, so I'm not
sure if this is right or not...

Anyway, it again does the right thing on the NUMAQ, and fixes some real
bugs, so can you include it please?

* Increase SD_NODES_PER_DOMAIN to 6 from 4 to better match Altix's
   topology. A setting of 4 will include this node, the other one
   in the brick, and the 2 nodes in the next closest brick, while 6
   will catch 2 other bricks. Probably it could be increased even
   more.

* Work correctly with sparse and not completely full node maps.

* Nasty typo fixed in find_next_best_node:
	-               val = node_distance(node, i);
	+               val = node_distance(node, n);

* Ensure all nodes are themselves a member of their numa balancing
   domain. This is more a precaution against creative implementations
   of node_distance.. but it makes the setup easier to verify without
   having to look at a table of node_distance's, which is possibly
   generated at runtime.

So again, I'm not too sure if this will fix the Altix setup or not.  But if
you do a release, it will surely be less broken than it was before.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

c80f2dee

[PATCH] sched: use CPU_DOWN_FAILED notifier · 5e641eba

Nick Piggin authored Oct 18, 2004

Use CPU_DOWN_FAILED notifier in the sched-domains hotplug code.  This goes
with 4/8 "integrate cpu hotplug and sched domains"
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

5e641eba

[PATCH] sched: hotplug add a CPU_DOWN_FAILED notifier · 71da3667

Nick Piggin authored Oct 18, 2004

Introduce CPU_DOWN_FAILED notifier, so we can cope with a failure after a
CPU_DOWN_PREPARE notice.

This fixes 3/8 "add CPU_DOWN_PREPARE notifier" to be useful
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

71da3667

[PATCH] sched: enable SD_LOAD_BALANCE · 0298ac9c

Nick Piggin authored Oct 18, 2004

Actually turn on SD_LOAD_BALANCE for the regular domains.  Introduced by
5/8 "sched add load balance flag".
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

0298ac9c

[PATCH] sched: fix domain debug for isolcpus · 8dac7706

Nick Piggin authored Oct 18, 2004

Fix an oops in the domain debug code when isolated CPUs are specified.
Introduced by 5/8 "sched add load balance flag"
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

8dac7706

[PATCH] sched: IA64 add disjoint NUMA domain support · 293643f4

Nick Piggin authored Oct 18, 2004

Implement disjoint NUMA domain setup for IA64 architecture.  Most of the code
was what was ripped out of kernel/sched.c, which was written by Jesse Barnes
<jbarnes@sgi.com>.  I fixed up the tricky NUMA groups initialistion.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

293643f4