Commit f221af36 authored by Andrew Morton's avatar Andrew Morton Committed by Linus Torvalds

[PATCH] scheduler infrastructure

From: Ingo Molnar <mingo@elte.hu>

the attached scheduler patch (against test2-mm2) adds the scheduling
infrastructure items discussed on lkml. I got good feedback - and while i
dont expect it to solve all problems, it does solve a number of bad ones:

 - test_starve.c code from David Mosberger

 - thud.c making the system unusuable due to unfairness

 - fair/accurate sleep average based on a finegrained clock

 - audio skipping way too easily

other changes in sched-test2-mm2-A3:

 - ia64 sched_clock() code, from David Mosberger.

 - migration thread startup without relying on implicit scheduling
   behavior. While the current 2.6 code is correct (due to the cpu-up code
   adding CPUs one by one), but it's also fragile - and this code cannot
   be carried over into the 2.4 backports. So adding this method would
   clean up the startup and would make it easier to have 2.4 backports.

and here's the original changelog for the scheduler changes:

 - cycle accuracy (nanosec resolution) timekeeping within the scheduler.
   This fixes a number of audio artifacts (skipping) i've reproduced. I
   dont think we can get away without going cycle accuracy - reading the
   cycle counter adds some overhead, but it's acceptable. The first
   nanosec-accuracy patch was done by Mike Galbraith - this patch is
   different but similar in nature. I went further in also changing the
   sleep_avg to be of nanosec resolution.

 - more finegrained timeslices: there's now a timeslice 'sub unit' of 50
   usecs (TIMESLICE_GRANULARITY) - CPU hogs on the same priority level
   will roundrobin with this unit. This change is intended to make gaming
   latencies shorter.

 - include scheduling latency in sleep bonus calculation. This change
   extends the sleep-average calculation to the period of time a task
   spends on the runqueue but doesnt get scheduled yet, right after
   wakeup. Note that tasks that were preempted (ie. not woken up) and are
   still on the runqueue do not get this benefit. This change closes one
   of the last hole in the dynamic priority estimation, it should result
   in interactive tasks getting more priority under heavy load. This
   change also fixes the test-starve.c testcase from David Mosberger.


The TSC-based scheduler clock is disabled on ia32 NUMA platforms.  (ie. 
platforms that have unsynched TSC for sure.) Those platforms should provide
the proper code to rely on the TSC in a global way.  (no such infrastructure
exists at the moment - the monotonic TSC-based clock doesnt deal with TSC
offsets either, as far as i can tell.)
parent 1dffaaf7
......@@ -915,13 +915,13 @@ static void smp_tune_scheduling (void)
cacheflush_time = (cpu_khz>>10) * (cachesize<<10) / bandwidth;
}
cache_decay_ticks = (long)cacheflush_time/cpu_khz * HZ / 1000;
cache_decay_ticks = (long)cacheflush_time/cpu_khz + 1;
printk("per-CPU timeslice cutoff: %ld.%02ld usecs.\n",
(long)cacheflush_time/(cpu_khz/1000),
((long)cacheflush_time*100/(cpu_khz/1000)) % 100);
printk("task migration cache decay timeout: %ld msecs.\n",
(cache_decay_ticks + 1) * 1000 / HZ);
cache_decay_ticks);
}
/*
......
......@@ -127,6 +127,30 @@ static unsigned long long monotonic_clock_tsc(void)
return base + cycles_2_ns(this_offset - last_offset);
}
/*
* Scheduler clock - returns current time in nanosec units.
*/
unsigned long long sched_clock(void)
{
unsigned long long this_offset;
/*
* In the NUMA case we dont use the TSC as they are not
* synchronized across all CPUs.
*/
#ifndef CONFIG_NUMA
if (unlikely(!cpu_has_tsc))
#endif
return (unsigned long long)jiffies * (1000000000 / HZ);
/* Read the Time Stamp Counter */
rdtscll(this_offset);
/* return the value in ns */
return cycles_2_ns(this_offset);
}
static void mark_offset_tsc(void)
{
unsigned long lost,delay;
......
......@@ -154,13 +154,16 @@ static inline char * task_state(struct task_struct *p, char *buffer)
read_lock(&tasklist_lock);
buffer += sprintf(buffer,
"State:\t%s\n"
"SleepAVG:\t%lu%%\n"
"Tgid:\t%d\n"
"Pid:\t%d\n"
"PPid:\t%d\n"
"TracerPid:\t%d\n"
"Uid:\t%d\t%d\t%d\t%d\n"
"Gid:\t%d\t%d\t%d\t%d\n",
get_task_state(p), p->tgid,
get_task_state(p),
(p->sleep_avg/1024)*100/(1000000000/1024),
p->tgid,
p->pid, p->pid ? p->real_parent->pid : 0,
p->pid && p->ptrace ? p->parent->pid : 0,
p->uid, p->euid, p->suid, p->fsuid,
......
......@@ -342,7 +342,8 @@ struct task_struct {
prio_array_t *array;
unsigned long sleep_avg;
unsigned long last_run;
unsigned long long timestamp;
int activated;
unsigned long policy;
cpumask_t cpus_allowed;
......@@ -506,6 +507,8 @@ static inline int set_cpus_allowed(task_t *p, cpumask_t new_mask)
}
#endif
extern unsigned long long sched_clock(void);
#ifdef CONFIG_NUMA
extern void sched_balance_exec(void);
extern void node_nr_running_init(void);
......
......@@ -925,7 +925,7 @@ struct task_struct *copy_process(unsigned long clone_flags,
*/
p->first_time_slice = 1;
current->time_slice >>= 1;
p->last_run = jiffies;
p->timestamp = sched_clock();
if (!current->time_slice) {
/*
* This case is rare, it happens when the parent has only
......
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment