Commit cb251765 authored by Mel Gorman's avatar Mel Gorman Committed by Ingo Molnar

sched/debug: Make schedstats a runtime tunable that is disabled by default

schedstats is very useful during debugging and performance tuning but it
incurs overhead to calculate the stats. As such, even though it can be
disabled at build time, it is often enabled as the information is useful.

This patch adds a kernel command-line and sysctl tunable to enable or
disable schedstats on demand (when it's built in). It is disabled
by default as someone who knows they need it can also learn to enable
it when necessary.

The benefits are dependent on how scheduler-intensive the workload is.
If it is then the patch reduces the number of cycles spent calculating
the stats with a small benefit from reducing the cache footprint of the
scheduler.

These measurements were taken from a 48-core 2-socket
machine with Xeon(R) E5-2670 v3 cpus although they were also tested on a
single socket machine 8-core machine with Intel i7-3770 processors.

netperf-tcp
                           4.5.0-rc1             4.5.0-rc1
                             vanilla          nostats-v3r1
Hmean    64         560.45 (  0.00%)      575.98 (  2.77%)
Hmean    128        766.66 (  0.00%)      795.79 (  3.80%)
Hmean    256        950.51 (  0.00%)      981.50 (  3.26%)
Hmean    1024      1433.25 (  0.00%)     1466.51 (  2.32%)
Hmean    2048      2810.54 (  0.00%)     2879.75 (  2.46%)
Hmean    3312      4618.18 (  0.00%)     4682.09 (  1.38%)
Hmean    4096      5306.42 (  0.00%)     5346.39 (  0.75%)
Hmean    8192     10581.44 (  0.00%)    10698.15 (  1.10%)
Hmean    16384    18857.70 (  0.00%)    18937.61 (  0.42%)

Small gains here, UDP_STREAM showed nothing intresting and neither did
the TCP_RR tests. The gains on the 8-core machine were very similar.

tbench4
                                 4.5.0-rc1             4.5.0-rc1
                                   vanilla          nostats-v3r1
Hmean    mb/sec-1         500.85 (  0.00%)      522.43 (  4.31%)
Hmean    mb/sec-2         984.66 (  0.00%)     1018.19 (  3.41%)
Hmean    mb/sec-4        1827.91 (  0.00%)     1847.78 (  1.09%)
Hmean    mb/sec-8        3561.36 (  0.00%)     3611.28 (  1.40%)
Hmean    mb/sec-16       5824.52 (  0.00%)     5929.03 (  1.79%)
Hmean    mb/sec-32      10943.10 (  0.00%)    10802.83 ( -1.28%)
Hmean    mb/sec-64      15950.81 (  0.00%)    16211.31 (  1.63%)
Hmean    mb/sec-128     15302.17 (  0.00%)    15445.11 (  0.93%)
Hmean    mb/sec-256     14866.18 (  0.00%)    15088.73 (  1.50%)
Hmean    mb/sec-512     15223.31 (  0.00%)    15373.69 (  0.99%)
Hmean    mb/sec-1024    14574.25 (  0.00%)    14598.02 (  0.16%)
Hmean    mb/sec-2048    13569.02 (  0.00%)    13733.86 (  1.21%)
Hmean    mb/sec-3072    12865.98 (  0.00%)    13209.23 (  2.67%)

Small gains of 2-4% at low thread counts and otherwise flat.  The
gains on the 8-core machine were slightly different

tbench4 on 8-core i7-3770 single socket machine
Hmean    mb/sec-1        442.59 (  0.00%)      448.73 (  1.39%)
Hmean    mb/sec-2        796.68 (  0.00%)      794.39 ( -0.29%)
Hmean    mb/sec-4       1322.52 (  0.00%)     1343.66 (  1.60%)
Hmean    mb/sec-8       2611.65 (  0.00%)     2694.86 (  3.19%)
Hmean    mb/sec-16      2537.07 (  0.00%)     2609.34 (  2.85%)
Hmean    mb/sec-32      2506.02 (  0.00%)     2578.18 (  2.88%)
Hmean    mb/sec-64      2511.06 (  0.00%)     2569.16 (  2.31%)
Hmean    mb/sec-128     2313.38 (  0.00%)     2395.50 (  3.55%)
Hmean    mb/sec-256     2110.04 (  0.00%)     2177.45 (  3.19%)
Hmean    mb/sec-512     2072.51 (  0.00%)     2053.97 ( -0.89%)

In constract, this shows a relatively steady 2-3% gain at higher thread
counts. Due to the nature of the patch and the type of workload, it's
not a surprise that the result will depend on the CPU used.

hackbench-pipes
                         4.5.0-rc1             4.5.0-rc1
                           vanilla          nostats-v3r1
Amean    1        0.0637 (  0.00%)      0.0660 ( -3.59%)
Amean    4        0.1229 (  0.00%)      0.1181 (  3.84%)
Amean    7        0.1921 (  0.00%)      0.1911 (  0.52%)
Amean    12       0.3117 (  0.00%)      0.2923 (  6.23%)
Amean    21       0.4050 (  0.00%)      0.3899 (  3.74%)
Amean    30       0.4586 (  0.00%)      0.4433 (  3.33%)
Amean    48       0.5910 (  0.00%)      0.5694 (  3.65%)
Amean    79       0.8663 (  0.00%)      0.8626 (  0.43%)
Amean    110      1.1543 (  0.00%)      1.1517 (  0.22%)
Amean    141      1.4457 (  0.00%)      1.4290 (  1.16%)
Amean    172      1.7090 (  0.00%)      1.6924 (  0.97%)
Amean    192      1.9126 (  0.00%)      1.9089 (  0.19%)

Some small gains and losses and while the variance data is not included,
it's close to the noise. The UMA machine did not show anything particularly
different

pipetest
                             4.5.0-rc1             4.5.0-rc1
                               vanilla          nostats-v2r2
Min         Time        4.13 (  0.00%)        3.99 (  3.39%)
1st-qrtle   Time        4.38 (  0.00%)        4.27 (  2.51%)
2nd-qrtle   Time        4.46 (  0.00%)        4.39 (  1.57%)
3rd-qrtle   Time        4.56 (  0.00%)        4.51 (  1.10%)
Max-90%     Time        4.67 (  0.00%)        4.60 (  1.50%)
Max-93%     Time        4.71 (  0.00%)        4.65 (  1.27%)
Max-95%     Time        4.74 (  0.00%)        4.71 (  0.63%)
Max-99%     Time        4.88 (  0.00%)        4.79 (  1.84%)
Max         Time        4.93 (  0.00%)        4.83 (  2.03%)
Mean        Time        4.48 (  0.00%)        4.39 (  1.91%)
Best99%Mean Time        4.47 (  0.00%)        4.39 (  1.91%)
Best95%Mean Time        4.46 (  0.00%)        4.38 (  1.93%)
Best90%Mean Time        4.45 (  0.00%)        4.36 (  1.98%)
Best50%Mean Time        4.36 (  0.00%)        4.25 (  2.49%)
Best10%Mean Time        4.23 (  0.00%)        4.10 (  3.13%)
Best5%Mean  Time        4.19 (  0.00%)        4.06 (  3.20%)
Best1%Mean  Time        4.13 (  0.00%)        4.00 (  3.39%)

Small improvement and similar gains were seen on the UMA machine.

The gain is small but it stands to reason that doing less work in the
scheduler is a good thing. The downside is that the lack of schedstats and
tracepoints may be surprising to experts doing performance analysis until
they find the existence of the schedstats= parameter or schedstats sysctl.
It will be automatically activated for latencytop and sleep profiling to
alleviate the problem. For tracepoints, there is a simple warning as it's
not safe to activate schedstats in the context when it's known the tracepoint
may be wanted but is unavailable.
Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
Reviewed-by: default avatarMatt Fleming <matt@codeblueprint.co.uk>
Reviewed-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <mgalbraith@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1454663316-22048-1-git-send-email-mgorman@techsingularity.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
parent a6e4491c
...@@ -3528,6 +3528,11 @@ bytes respectively. Such letter suffixes can also be entirely omitted. ...@@ -3528,6 +3528,11 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
sched_debug [KNL] Enables verbose scheduler debug messages. sched_debug [KNL] Enables verbose scheduler debug messages.
schedstats= [KNL,X86] Enable or disable scheduled statistics.
Allowed values are enable and disable. This feature
incurs a small amount of overhead in the scheduler
but is useful for debugging and performance tuning.
skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate
xtime_lock contention on larger systems, and/or RCU lock xtime_lock contention on larger systems, and/or RCU lock
contention on all systems with CONFIG_MAXSMP set. contention on all systems with CONFIG_MAXSMP set.
......
...@@ -760,6 +760,14 @@ rtsig-nr shows the number of RT signals currently queued. ...@@ -760,6 +760,14 @@ rtsig-nr shows the number of RT signals currently queued.
============================================================== ==============================================================
sched_schedstats:
Enables/disables scheduler statistics. Enabling this feature
incurs a small amount of overhead in the scheduler but is
useful for debugging and performance tuning.
==============================================================
sg-big-buff: sg-big-buff:
This file shows the size of the generic SCSI (sg) buffer. This file shows the size of the generic SCSI (sg) buffer.
......
...@@ -37,6 +37,9 @@ account_scheduler_latency(struct task_struct *task, int usecs, int inter) ...@@ -37,6 +37,9 @@ account_scheduler_latency(struct task_struct *task, int usecs, int inter)
void clear_all_latency_tracing(struct task_struct *p); void clear_all_latency_tracing(struct task_struct *p);
extern int sysctl_latencytop(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp, loff_t *ppos);
#else #else
static inline void static inline void
......
...@@ -920,6 +920,10 @@ static inline int sched_info_on(void) ...@@ -920,6 +920,10 @@ static inline int sched_info_on(void)
#endif #endif
} }
#ifdef CONFIG_SCHEDSTATS
void force_schedstat_enabled(void);
#endif
enum cpu_idle_type { enum cpu_idle_type {
CPU_IDLE, CPU_IDLE,
CPU_NOT_IDLE, CPU_NOT_IDLE,
......
...@@ -95,4 +95,8 @@ extern int sysctl_numa_balancing(struct ctl_table *table, int write, ...@@ -95,4 +95,8 @@ extern int sysctl_numa_balancing(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp, void __user *buffer, size_t *lenp,
loff_t *ppos); loff_t *ppos);
extern int sysctl_schedstats(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);
#endif /* _SCHED_SYSCTL_H */ #endif /* _SCHED_SYSCTL_H */
...@@ -47,12 +47,12 @@ ...@@ -47,12 +47,12 @@
* of times) * of times)
*/ */
#include <linux/latencytop.h>
#include <linux/kallsyms.h> #include <linux/kallsyms.h>
#include <linux/seq_file.h> #include <linux/seq_file.h>
#include <linux/notifier.h> #include <linux/notifier.h>
#include <linux/spinlock.h> #include <linux/spinlock.h>
#include <linux/proc_fs.h> #include <linux/proc_fs.h>
#include <linux/latencytop.h>
#include <linux/export.h> #include <linux/export.h>
#include <linux/sched.h> #include <linux/sched.h>
#include <linux/list.h> #include <linux/list.h>
...@@ -289,4 +289,16 @@ static int __init init_lstats_procfs(void) ...@@ -289,4 +289,16 @@ static int __init init_lstats_procfs(void)
proc_create("latency_stats", 0644, NULL, &lstats_fops); proc_create("latency_stats", 0644, NULL, &lstats_fops);
return 0; return 0;
} }
int sysctl_latencytop(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp, loff_t *ppos)
{
int err;
err = proc_dointvec(table, write, buffer, lenp, ppos);
if (latencytop_enabled)
force_schedstat_enabled();
return err;
}
device_initcall(init_lstats_procfs); device_initcall(init_lstats_procfs);
...@@ -59,6 +59,7 @@ int profile_setup(char *str) ...@@ -59,6 +59,7 @@ int profile_setup(char *str)
if (!strncmp(str, sleepstr, strlen(sleepstr))) { if (!strncmp(str, sleepstr, strlen(sleepstr))) {
#ifdef CONFIG_SCHEDSTATS #ifdef CONFIG_SCHEDSTATS
force_schedstat_enabled();
prof_on = SLEEP_PROFILING; prof_on = SLEEP_PROFILING;
if (str[strlen(sleepstr)] == ',') if (str[strlen(sleepstr)] == ',')
str += strlen(sleepstr) + 1; str += strlen(sleepstr) + 1;
......
...@@ -2093,7 +2093,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) ...@@ -2093,7 +2093,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
ttwu_queue(p, cpu); ttwu_queue(p, cpu);
stat: stat:
ttwu_stat(p, cpu, wake_flags); if (schedstat_enabled())
ttwu_stat(p, cpu, wake_flags);
out: out:
raw_spin_unlock_irqrestore(&p->pi_lock, flags); raw_spin_unlock_irqrestore(&p->pi_lock, flags);
...@@ -2141,7 +2142,8 @@ static void try_to_wake_up_local(struct task_struct *p) ...@@ -2141,7 +2142,8 @@ static void try_to_wake_up_local(struct task_struct *p)
ttwu_activate(rq, p, ENQUEUE_WAKEUP); ttwu_activate(rq, p, ENQUEUE_WAKEUP);
ttwu_do_wakeup(rq, p, 0); ttwu_do_wakeup(rq, p, 0);
ttwu_stat(p, smp_processor_id(), 0); if (schedstat_enabled())
ttwu_stat(p, smp_processor_id(), 0);
out: out:
raw_spin_unlock(&p->pi_lock); raw_spin_unlock(&p->pi_lock);
} }
...@@ -2210,6 +2212,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) ...@@ -2210,6 +2212,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
#endif #endif
#ifdef CONFIG_SCHEDSTATS #ifdef CONFIG_SCHEDSTATS
/* Even if schedstat is disabled, there should not be garbage */
memset(&p->se.statistics, 0, sizeof(p->se.statistics)); memset(&p->se.statistics, 0, sizeof(p->se.statistics));
#endif #endif
...@@ -2281,6 +2284,69 @@ int sysctl_numa_balancing(struct ctl_table *table, int write, ...@@ -2281,6 +2284,69 @@ int sysctl_numa_balancing(struct ctl_table *table, int write,
#endif #endif
#endif #endif
DEFINE_STATIC_KEY_FALSE(sched_schedstats);
#ifdef CONFIG_SCHEDSTATS
static void set_schedstats(bool enabled)
{
if (enabled)
static_branch_enable(&sched_schedstats);
else
static_branch_disable(&sched_schedstats);
}
void force_schedstat_enabled(void)
{
if (!schedstat_enabled()) {
pr_info("kernel profiling enabled schedstats, disable via kernel.sched_schedstats.\n");
static_branch_enable(&sched_schedstats);
}
}
static int __init setup_schedstats(char *str)
{
int ret = 0;
if (!str)
goto out;
if (!strcmp(str, "enable")) {
set_schedstats(true);
ret = 1;
} else if (!strcmp(str, "disable")) {
set_schedstats(false);
ret = 1;
}
out:
if (!ret)
pr_warn("Unable to parse schedstats=\n");
return ret;
}
__setup("schedstats=", setup_schedstats);
#ifdef CONFIG_PROC_SYSCTL
int sysctl_schedstats(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp, loff_t *ppos)
{
struct ctl_table t;
int err;
int state = static_branch_likely(&sched_schedstats);
if (write && !capable(CAP_SYS_ADMIN))
return -EPERM;
t = *table;
t.data = &state;
err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
if (err < 0)
return err;
if (write)
set_schedstats(state);
return err;
}
#endif
#endif
/* /*
* fork()/clone()-time setup: * fork()/clone()-time setup:
*/ */
......
...@@ -75,16 +75,18 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group ...@@ -75,16 +75,18 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
PN(se->vruntime); PN(se->vruntime);
PN(se->sum_exec_runtime); PN(se->sum_exec_runtime);
#ifdef CONFIG_SCHEDSTATS #ifdef CONFIG_SCHEDSTATS
PN(se->statistics.wait_start); if (schedstat_enabled()) {
PN(se->statistics.sleep_start); PN(se->statistics.wait_start);
PN(se->statistics.block_start); PN(se->statistics.sleep_start);
PN(se->statistics.sleep_max); PN(se->statistics.block_start);
PN(se->statistics.block_max); PN(se->statistics.sleep_max);
PN(se->statistics.exec_max); PN(se->statistics.block_max);
PN(se->statistics.slice_max); PN(se->statistics.exec_max);
PN(se->statistics.wait_max); PN(se->statistics.slice_max);
PN(se->statistics.wait_sum); PN(se->statistics.wait_max);
P(se->statistics.wait_count); PN(se->statistics.wait_sum);
P(se->statistics.wait_count);
}
#endif #endif
P(se->load.weight); P(se->load.weight);
#ifdef CONFIG_SMP #ifdef CONFIG_SMP
...@@ -122,10 +124,12 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p) ...@@ -122,10 +124,12 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
(long long)(p->nvcsw + p->nivcsw), (long long)(p->nvcsw + p->nivcsw),
p->prio); p->prio);
#ifdef CONFIG_SCHEDSTATS #ifdef CONFIG_SCHEDSTATS
SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld", if (schedstat_enabled()) {
SPLIT_NS(p->se.statistics.wait_sum), SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld",
SPLIT_NS(p->se.sum_exec_runtime), SPLIT_NS(p->se.statistics.wait_sum),
SPLIT_NS(p->se.statistics.sum_sleep_runtime)); SPLIT_NS(p->se.sum_exec_runtime),
SPLIT_NS(p->se.statistics.sum_sleep_runtime));
}
#else #else
SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld", SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld",
0LL, 0L, 0LL, 0L,
...@@ -313,17 +317,18 @@ do { \ ...@@ -313,17 +317,18 @@ do { \
#define P(n) SEQ_printf(m, " .%-30s: %d\n", #n, rq->n); #define P(n) SEQ_printf(m, " .%-30s: %d\n", #n, rq->n);
#define P64(n) SEQ_printf(m, " .%-30s: %Ld\n", #n, rq->n); #define P64(n) SEQ_printf(m, " .%-30s: %Ld\n", #n, rq->n);
P(yld_count);
P(sched_count);
P(sched_goidle);
#ifdef CONFIG_SMP #ifdef CONFIG_SMP
P64(avg_idle); P64(avg_idle);
P64(max_idle_balance_cost); P64(max_idle_balance_cost);
#endif #endif
P(ttwu_count); if (schedstat_enabled()) {
P(ttwu_local); P(yld_count);
P(sched_count);
P(sched_goidle);
P(ttwu_count);
P(ttwu_local);
}
#undef P #undef P
#undef P64 #undef P64
...@@ -569,38 +574,39 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m) ...@@ -569,38 +574,39 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
nr_switches = p->nvcsw + p->nivcsw; nr_switches = p->nvcsw + p->nivcsw;
#ifdef CONFIG_SCHEDSTATS #ifdef CONFIG_SCHEDSTATS
PN(se.statistics.sum_sleep_runtime);
PN(se.statistics.wait_start);
PN(se.statistics.sleep_start);
PN(se.statistics.block_start);
PN(se.statistics.sleep_max);
PN(se.statistics.block_max);
PN(se.statistics.exec_max);
PN(se.statistics.slice_max);
PN(se.statistics.wait_max);
PN(se.statistics.wait_sum);
P(se.statistics.wait_count);
PN(se.statistics.iowait_sum);
P(se.statistics.iowait_count);
P(se.nr_migrations); P(se.nr_migrations);
P(se.statistics.nr_migrations_cold);
P(se.statistics.nr_failed_migrations_affine);
P(se.statistics.nr_failed_migrations_running);
P(se.statistics.nr_failed_migrations_hot);
P(se.statistics.nr_forced_migrations);
P(se.statistics.nr_wakeups);
P(se.statistics.nr_wakeups_sync);
P(se.statistics.nr_wakeups_migrate);
P(se.statistics.nr_wakeups_local);
P(se.statistics.nr_wakeups_remote);
P(se.statistics.nr_wakeups_affine);
P(se.statistics.nr_wakeups_affine_attempts);
P(se.statistics.nr_wakeups_passive);
P(se.statistics.nr_wakeups_idle);
{ if (schedstat_enabled()) {
u64 avg_atom, avg_per_cpu; u64 avg_atom, avg_per_cpu;
PN(se.statistics.sum_sleep_runtime);
PN(se.statistics.wait_start);
PN(se.statistics.sleep_start);
PN(se.statistics.block_start);
PN(se.statistics.sleep_max);
PN(se.statistics.block_max);
PN(se.statistics.exec_max);
PN(se.statistics.slice_max);
PN(se.statistics.wait_max);
PN(se.statistics.wait_sum);
P(se.statistics.wait_count);
PN(se.statistics.iowait_sum);
P(se.statistics.iowait_count);
P(se.statistics.nr_migrations_cold);
P(se.statistics.nr_failed_migrations_affine);
P(se.statistics.nr_failed_migrations_running);
P(se.statistics.nr_failed_migrations_hot);
P(se.statistics.nr_forced_migrations);
P(se.statistics.nr_wakeups);
P(se.statistics.nr_wakeups_sync);
P(se.statistics.nr_wakeups_migrate);
P(se.statistics.nr_wakeups_local);
P(se.statistics.nr_wakeups_remote);
P(se.statistics.nr_wakeups_affine);
P(se.statistics.nr_wakeups_affine_attempts);
P(se.statistics.nr_wakeups_passive);
P(se.statistics.nr_wakeups_idle);
avg_atom = p->se.sum_exec_runtime; avg_atom = p->se.sum_exec_runtime;
if (nr_switches) if (nr_switches)
avg_atom = div64_ul(avg_atom, nr_switches); avg_atom = div64_ul(avg_atom, nr_switches);
......
...@@ -20,8 +20,8 @@ ...@@ -20,8 +20,8 @@
* Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra
*/ */
#include <linux/latencytop.h>
#include <linux/sched.h> #include <linux/sched.h>
#include <linux/latencytop.h>
#include <linux/cpumask.h> #include <linux/cpumask.h>
#include <linux/cpuidle.h> #include <linux/cpuidle.h>
#include <linux/slab.h> #include <linux/slab.h>
...@@ -755,7 +755,9 @@ static void ...@@ -755,7 +755,9 @@ static void
update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se) update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
{ {
struct task_struct *p; struct task_struct *p;
u64 delta = rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start; u64 delta;
delta = rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start;
if (entity_is_task(se)) { if (entity_is_task(se)) {
p = task_of(se); p = task_of(se);
...@@ -776,22 +778,12 @@ update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se) ...@@ -776,22 +778,12 @@ update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
se->statistics.wait_sum += delta; se->statistics.wait_sum += delta;
se->statistics.wait_start = 0; se->statistics.wait_start = 0;
} }
#else
static inline void
update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
}
static inline void
update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
}
#endif
/* /*
* Task is being enqueued - update stats: * Task is being enqueued - update stats:
*/ */
static void update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se) static inline void
update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
{ {
/* /*
* Are we enqueueing a waiting task? (for current tasks * Are we enqueueing a waiting task? (for current tasks
...@@ -802,7 +794,7 @@ static void update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se) ...@@ -802,7 +794,7 @@ static void update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
} }
static inline void static inline void
update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se) update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{ {
/* /*
* Mark the end of the wait period if dequeueing a * Mark the end of the wait period if dequeueing a
...@@ -810,7 +802,40 @@ update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se) ...@@ -810,7 +802,40 @@ update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
*/ */
if (se != cfs_rq->curr) if (se != cfs_rq->curr)
update_stats_wait_end(cfs_rq, se); update_stats_wait_end(cfs_rq, se);
if (flags & DEQUEUE_SLEEP) {
if (entity_is_task(se)) {
struct task_struct *tsk = task_of(se);
if (tsk->state & TASK_INTERRUPTIBLE)
se->statistics.sleep_start = rq_clock(rq_of(cfs_rq));
if (tsk->state & TASK_UNINTERRUPTIBLE)
se->statistics.block_start = rq_clock(rq_of(cfs_rq));
}
}
}
#else
static inline void
update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
}
static inline void
update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
}
static inline void
update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
}
static inline void
update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
} }
#endif
/* /*
* We are picking a new current task - update its stats: * We are picking a new current task - update its stats:
...@@ -3102,6 +3127,26 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial) ...@@ -3102,6 +3127,26 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
static void check_enqueue_throttle(struct cfs_rq *cfs_rq); static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
static inline void check_schedstat_required(void)
{
#ifdef CONFIG_SCHEDSTATS
if (schedstat_enabled())
return;
/* Force schedstat enabled if a dependent tracepoint is active */
if (trace_sched_stat_wait_enabled() ||
trace_sched_stat_sleep_enabled() ||
trace_sched_stat_iowait_enabled() ||
trace_sched_stat_blocked_enabled() ||
trace_sched_stat_runtime_enabled()) {
pr_warn_once("Scheduler tracepoints stat_sleep, stat_iowait, "
"stat_blocked and stat_runtime require the "
"kernel parameter schedstats=enabled or "
"kernel.sched_schedstats=1\n");
}
#endif
}
static void static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{ {
...@@ -3122,11 +3167,15 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) ...@@ -3122,11 +3167,15 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
if (flags & ENQUEUE_WAKEUP) { if (flags & ENQUEUE_WAKEUP) {
place_entity(cfs_rq, se, 0); place_entity(cfs_rq, se, 0);
enqueue_sleeper(cfs_rq, se); if (schedstat_enabled())
enqueue_sleeper(cfs_rq, se);
} }
update_stats_enqueue(cfs_rq, se); check_schedstat_required();
check_spread(cfs_rq, se); if (schedstat_enabled()) {
update_stats_enqueue(cfs_rq, se);
check_spread(cfs_rq, se);
}
if (se != cfs_rq->curr) if (se != cfs_rq->curr)
__enqueue_entity(cfs_rq, se); __enqueue_entity(cfs_rq, se);
se->on_rq = 1; se->on_rq = 1;
...@@ -3193,19 +3242,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) ...@@ -3193,19 +3242,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
update_curr(cfs_rq); update_curr(cfs_rq);
dequeue_entity_load_avg(cfs_rq, se); dequeue_entity_load_avg(cfs_rq, se);
update_stats_dequeue(cfs_rq, se); if (schedstat_enabled())
if (flags & DEQUEUE_SLEEP) { update_stats_dequeue(cfs_rq, se, flags);
#ifdef CONFIG_SCHEDSTATS
if (entity_is_task(se)) {
struct task_struct *tsk = task_of(se);
if (tsk->state & TASK_INTERRUPTIBLE)
se->statistics.sleep_start = rq_clock(rq_of(cfs_rq));
if (tsk->state & TASK_UNINTERRUPTIBLE)
se->statistics.block_start = rq_clock(rq_of(cfs_rq));
}
#endif
}
clear_buddies(cfs_rq, se); clear_buddies(cfs_rq, se);
...@@ -3279,7 +3317,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) ...@@ -3279,7 +3317,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
* a CPU. So account for the time it spent waiting on the * a CPU. So account for the time it spent waiting on the
* runqueue. * runqueue.
*/ */
update_stats_wait_end(cfs_rq, se); if (schedstat_enabled())
update_stats_wait_end(cfs_rq, se);
__dequeue_entity(cfs_rq, se); __dequeue_entity(cfs_rq, se);
update_load_avg(se, 1); update_load_avg(se, 1);
} }
...@@ -3292,7 +3331,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) ...@@ -3292,7 +3331,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
* least twice that of our own weight (i.e. dont track it * least twice that of our own weight (i.e. dont track it
* when there are only lesser-weight tasks around): * when there are only lesser-weight tasks around):
*/ */
if (rq_of(cfs_rq)->load.weight >= 2*se->load.weight) { if (schedstat_enabled() && rq_of(cfs_rq)->load.weight >= 2*se->load.weight) {
se->statistics.slice_max = max(se->statistics.slice_max, se->statistics.slice_max = max(se->statistics.slice_max,
se->sum_exec_runtime - se->prev_sum_exec_runtime); se->sum_exec_runtime - se->prev_sum_exec_runtime);
} }
...@@ -3375,9 +3414,13 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev) ...@@ -3375,9 +3414,13 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
/* throttle cfs_rqs exceeding runtime */ /* throttle cfs_rqs exceeding runtime */
check_cfs_rq_runtime(cfs_rq); check_cfs_rq_runtime(cfs_rq);
check_spread(cfs_rq, prev); if (schedstat_enabled()) {
check_spread(cfs_rq, prev);
if (prev->on_rq)
update_stats_wait_start(cfs_rq, prev);
}
if (prev->on_rq) { if (prev->on_rq) {
update_stats_wait_start(cfs_rq, prev);
/* Put 'current' back into the tree. */ /* Put 'current' back into the tree. */
__enqueue_entity(cfs_rq, prev); __enqueue_entity(cfs_rq, prev);
/* in !on_rq case, update occurred at dequeue */ /* in !on_rq case, update occurred at dequeue */
......
...@@ -1022,6 +1022,7 @@ extern struct static_key sched_feat_keys[__SCHED_FEAT_NR]; ...@@ -1022,6 +1022,7 @@ extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];
#endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */ #endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */
extern struct static_key_false sched_numa_balancing; extern struct static_key_false sched_numa_balancing;
extern struct static_key_false sched_schedstats;
static inline u64 global_rt_period(void) static inline u64 global_rt_period(void)
{ {
......
...@@ -29,9 +29,10 @@ rq_sched_info_dequeued(struct rq *rq, unsigned long long delta) ...@@ -29,9 +29,10 @@ rq_sched_info_dequeued(struct rq *rq, unsigned long long delta)
if (rq) if (rq)
rq->rq_sched_info.run_delay += delta; rq->rq_sched_info.run_delay += delta;
} }
# define schedstat_inc(rq, field) do { (rq)->field++; } while (0) # define schedstat_enabled() static_branch_unlikely(&sched_schedstats)
# define schedstat_add(rq, field, amt) do { (rq)->field += (amt); } while (0) # define schedstat_inc(rq, field) do { if (schedstat_enabled()) { (rq)->field++; } } while (0)
# define schedstat_set(var, val) do { var = (val); } while (0) # define schedstat_add(rq, field, amt) do { if (schedstat_enabled()) { (rq)->field += (amt); } } while (0)
# define schedstat_set(var, val) do { if (schedstat_enabled()) { var = (val); } } while (0)
#else /* !CONFIG_SCHEDSTATS */ #else /* !CONFIG_SCHEDSTATS */
static inline void static inline void
rq_sched_info_arrive(struct rq *rq, unsigned long long delta) rq_sched_info_arrive(struct rq *rq, unsigned long long delta)
...@@ -42,6 +43,7 @@ rq_sched_info_dequeued(struct rq *rq, unsigned long long delta) ...@@ -42,6 +43,7 @@ rq_sched_info_dequeued(struct rq *rq, unsigned long long delta)
static inline void static inline void
rq_sched_info_depart(struct rq *rq, unsigned long long delta) rq_sched_info_depart(struct rq *rq, unsigned long long delta)
{} {}
# define schedstat_enabled() 0
# define schedstat_inc(rq, field) do { } while (0) # define schedstat_inc(rq, field) do { } while (0)
# define schedstat_add(rq, field, amt) do { } while (0) # define schedstat_add(rq, field, amt) do { } while (0)
# define schedstat_set(var, val) do { } while (0) # define schedstat_set(var, val) do { } while (0)
......
...@@ -350,6 +350,17 @@ static struct ctl_table kern_table[] = { ...@@ -350,6 +350,17 @@ static struct ctl_table kern_table[] = {
.mode = 0644, .mode = 0644,
.proc_handler = proc_dointvec, .proc_handler = proc_dointvec,
}, },
#ifdef CONFIG_SCHEDSTATS
{
.procname = "sched_schedstats",
.data = NULL,
.maxlen = sizeof(unsigned int),
.mode = 0644,
.proc_handler = sysctl_schedstats,
.extra1 = &zero,
.extra2 = &one,
},
#endif /* CONFIG_SCHEDSTATS */
#endif /* CONFIG_SMP */ #endif /* CONFIG_SMP */
#ifdef CONFIG_NUMA_BALANCING #ifdef CONFIG_NUMA_BALANCING
{ {
...@@ -505,7 +516,7 @@ static struct ctl_table kern_table[] = { ...@@ -505,7 +516,7 @@ static struct ctl_table kern_table[] = {
.data = &latencytop_enabled, .data = &latencytop_enabled,
.maxlen = sizeof(int), .maxlen = sizeof(int),
.mode = 0644, .mode = 0644,
.proc_handler = proc_dointvec, .proc_handler = sysctl_latencytop,
}, },
#endif #endif
#ifdef CONFIG_BLK_DEV_INITRD #ifdef CONFIG_BLK_DEV_INITRD
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment