sched/debug: Make schedstats a runtime tunable that is disabled by default
schedstats is very useful during debugging and performance tuning but it
incurs overhead to calculate the stats. As such, even though it can be
disabled at build time, it is often enabled as the information is useful.
This patch adds a kernel command-line and sysctl tunable to enable or
disable schedstats on demand (when it's built in). It is disabled
by default as someone who knows they need it can also learn to enable
it when necessary.
The benefits are dependent on how scheduler-intensive the workload is.
If it is then the patch reduces the number of cycles spent calculating
the stats with a small benefit from reducing the cache footprint of the
scheduler.
These measurements were taken from a 48-core 2-socket
machine with Xeon(R) E5-2670 v3 cpus although they were also tested on a
single socket machine 8-core machine with Intel i7-3770 processors.
netperf-tcp
4.5.0-rc1 4.5.0-rc1
vanilla nostats-v3r1
Hmean 64 560.45 ( 0.00%) 575.98 ( 2.77%)
Hmean 128 766.66 ( 0.00%) 795.79 ( 3.80%)
Hmean 256 950.51 ( 0.00%) 981.50 ( 3.26%)
Hmean 1024 1433.25 ( 0.00%) 1466.51 ( 2.32%)
Hmean 2048 2810.54 ( 0.00%) 2879.75 ( 2.46%)
Hmean 3312 4618.18 ( 0.00%) 4682.09 ( 1.38%)
Hmean 4096 5306.42 ( 0.00%) 5346.39 ( 0.75%)
Hmean 8192 10581.44 ( 0.00%) 10698.15 ( 1.10%)
Hmean 16384 18857.70 ( 0.00%) 18937.61 ( 0.42%)
Small gains here, UDP_STREAM showed nothing intresting and neither did
the TCP_RR tests. The gains on the 8-core machine were very similar.
tbench4
4.5.0-rc1 4.5.0-rc1
vanilla nostats-v3r1
Hmean mb/sec-1 500.85 ( 0.00%) 522.43 ( 4.31%)
Hmean mb/sec-2 984.66 ( 0.00%) 1018.19 ( 3.41%)
Hmean mb/sec-4 1827.91 ( 0.00%) 1847.78 ( 1.09%)
Hmean mb/sec-8 3561.36 ( 0.00%) 3611.28 ( 1.40%)
Hmean mb/sec-16 5824.52 ( 0.00%) 5929.03 ( 1.79%)
Hmean mb/sec-32 10943.10 ( 0.00%) 10802.83 ( -1.28%)
Hmean mb/sec-64 15950.81 ( 0.00%) 16211.31 ( 1.63%)
Hmean mb/sec-128 15302.17 ( 0.00%) 15445.11 ( 0.93%)
Hmean mb/sec-256 14866.18 ( 0.00%) 15088.73 ( 1.50%)
Hmean mb/sec-512 15223.31 ( 0.00%) 15373.69 ( 0.99%)
Hmean mb/sec-1024 14574.25 ( 0.00%) 14598.02 ( 0.16%)
Hmean mb/sec-2048 13569.02 ( 0.00%) 13733.86 ( 1.21%)
Hmean mb/sec-3072 12865.98 ( 0.00%) 13209.23 ( 2.67%)
Small gains of 2-4% at low thread counts and otherwise flat. The
gains on the 8-core machine were slightly different
tbench4 on 8-core i7-3770 single socket machine
Hmean mb/sec-1 442.59 ( 0.00%) 448.73 ( 1.39%)
Hmean mb/sec-2 796.68 ( 0.00%) 794.39 ( -0.29%)
Hmean mb/sec-4 1322.52 ( 0.00%) 1343.66 ( 1.60%)
Hmean mb/sec-8 2611.65 ( 0.00%) 2694.86 ( 3.19%)
Hmean mb/sec-16 2537.07 ( 0.00%) 2609.34 ( 2.85%)
Hmean mb/sec-32 2506.02 ( 0.00%) 2578.18 ( 2.88%)
Hmean mb/sec-64 2511.06 ( 0.00%) 2569.16 ( 2.31%)
Hmean mb/sec-128 2313.38 ( 0.00%) 2395.50 ( 3.55%)
Hmean mb/sec-256 2110.04 ( 0.00%) 2177.45 ( 3.19%)
Hmean mb/sec-512 2072.51 ( 0.00%) 2053.97 ( -0.89%)
In constract, this shows a relatively steady 2-3% gain at higher thread
counts. Due to the nature of the patch and the type of workload, it's
not a surprise that the result will depend on the CPU used.
hackbench-pipes
4.5.0-rc1 4.5.0-rc1
vanilla nostats-v3r1
Amean 1 0.0637 ( 0.00%) 0.0660 ( -3.59%)
Amean 4 0.1229 ( 0.00%) 0.1181 ( 3.84%)
Amean 7 0.1921 ( 0.00%) 0.1911 ( 0.52%)
Amean 12 0.3117 ( 0.00%) 0.2923 ( 6.23%)
Amean 21 0.4050 ( 0.00%) 0.3899 ( 3.74%)
Amean 30 0.4586 ( 0.00%) 0.4433 ( 3.33%)
Amean 48 0.5910 ( 0.00%) 0.5694 ( 3.65%)
Amean 79 0.8663 ( 0.00%) 0.8626 ( 0.43%)
Amean 110 1.1543 ( 0.00%) 1.1517 ( 0.22%)
Amean 141 1.4457 ( 0.00%) 1.4290 ( 1.16%)
Amean 172 1.7090 ( 0.00%) 1.6924 ( 0.97%)
Amean 192 1.9126 ( 0.00%) 1.9089 ( 0.19%)
Some small gains and losses and while the variance data is not included,
it's close to the noise. The UMA machine did not show anything particularly
different
pipetest
4.5.0-rc1 4.5.0-rc1
vanilla nostats-v2r2
Min Time 4.13 ( 0.00%) 3.99 ( 3.39%)
1st-qrtle Time 4.38 ( 0.00%) 4.27 ( 2.51%)
2nd-qrtle Time 4.46 ( 0.00%) 4.39 ( 1.57%)
3rd-qrtle Time 4.56 ( 0.00%) 4.51 ( 1.10%)
Max-90% Time 4.67 ( 0.00%) 4.60 ( 1.50%)
Max-93% Time 4.71 ( 0.00%) 4.65 ( 1.27%)
Max-95% Time 4.74 ( 0.00%) 4.71 ( 0.63%)
Max-99% Time 4.88 ( 0.00%) 4.79 ( 1.84%)
Max Time 4.93 ( 0.00%) 4.83 ( 2.03%)
Mean Time 4.48 ( 0.00%) 4.39 ( 1.91%)
Best99%Mean Time 4.47 ( 0.00%) 4.39 ( 1.91%)
Best95%Mean Time 4.46 ( 0.00%) 4.38 ( 1.93%)
Best90%Mean Time 4.45 ( 0.00%) 4.36 ( 1.98%)
Best50%Mean Time 4.36 ( 0.00%) 4.25 ( 2.49%)
Best10%Mean Time 4.23 ( 0.00%) 4.10 ( 3.13%)
Best5%Mean Time 4.19 ( 0.00%) 4.06 ( 3.20%)
Best1%Mean Time 4.13 ( 0.00%) 4.00 ( 3.39%)
Small improvement and similar gains were seen on the UMA machine.
The gain is small but it stands to reason that doing less work in the
scheduler is a good thing. The downside is that the lack of schedstats and
tracepoints may be surprising to experts doing performance analysis until
they find the existence of the schedstats= parameter or schedstats sysctl.
It will be automatically activated for latencytop and sleep profiling to
alleviate the problem. For tracepoints, there is a simple warning as it's
not safe to activate schedstats in the context when it's known the tracepoint
may be wanted but is unavailable.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <mgalbraith@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1454663316-22048-1-git-send-email-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 551ecf0..ed47b60 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3528,6 +3528,11 @@
sched_debug [KNL] Enables verbose scheduler debug messages.
+ schedstats= [KNL,X86] Enable or disable scheduled statistics.
+ Allowed values are enable and disable. This feature
+ incurs a small amount of overhead in the scheduler
+ but is useful for debugging and performance tuning.
+
skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate
xtime_lock contention on larger systems, and/or RCU lock
contention on all systems with CONFIG_MAXSMP set.
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index a93b414..87119dc 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -760,6 +760,14 @@
==============================================================
+sched_schedstats:
+
+Enables/disables scheduler statistics. Enabling this feature
+incurs a small amount of overhead in the scheduler but is
+useful for debugging and performance tuning.
+
+==============================================================
+
sg-big-buff:
This file shows the size of the generic SCSI (sg) buffer.
diff --git a/include/linux/latencytop.h b/include/linux/latencytop.h
index e23121f..59ccab2 100644
--- a/include/linux/latencytop.h
+++ b/include/linux/latencytop.h
@@ -37,6 +37,9 @@
void clear_all_latency_tracing(struct task_struct *p);
+extern int sysctl_latencytop(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos);
+
#else
static inline void
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a10494a..a292c4b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -920,6 +920,10 @@
#endif
}
+#ifdef CONFIG_SCHEDSTATS
+void force_schedstat_enabled(void);
+#endif
+
enum cpu_idle_type {
CPU_IDLE,
CPU_NOT_IDLE,
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index c9e4731..4f080ab 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -95,4 +95,8 @@
void __user *buffer, size_t *lenp,
loff_t *ppos);
+extern int sysctl_schedstats(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos);
+
#endif /* _SCHED_SYSCTL_H */
diff --git a/kernel/latencytop.c b/kernel/latencytop.c
index a028127..b5c30d9 100644
--- a/kernel/latencytop.c
+++ b/kernel/latencytop.c
@@ -47,12 +47,12 @@
* of times)
*/
-#include <linux/latencytop.h>
#include <linux/kallsyms.h>
#include <linux/seq_file.h>
#include <linux/notifier.h>
#include <linux/spinlock.h>
#include <linux/proc_fs.h>
+#include <linux/latencytop.h>
#include <linux/export.h>
#include <linux/sched.h>
#include <linux/list.h>
@@ -289,4 +289,16 @@
proc_create("latency_stats", 0644, NULL, &lstats_fops);
return 0;
}
+
+int sysctl_latencytop(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int err;
+
+ err = proc_dointvec(table, write, buffer, lenp, ppos);
+ if (latencytop_enabled)
+ force_schedstat_enabled();
+
+ return err;
+}
device_initcall(init_lstats_procfs);
diff --git a/kernel/profile.c b/kernel/profile.c
index 99513e11..5136969 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -59,6 +59,7 @@
if (!strncmp(str, sleepstr, strlen(sleepstr))) {
#ifdef CONFIG_SCHEDSTATS
+ force_schedstat_enabled();
prof_on = SLEEP_PROFILING;
if (str[strlen(sleepstr)] == ',')
str += strlen(sleepstr) + 1;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 24fcdbf..7e548bd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2093,7 +2093,8 @@
ttwu_queue(p, cpu);
stat:
- ttwu_stat(p, cpu, wake_flags);
+ if (schedstat_enabled())
+ ttwu_stat(p, cpu, wake_flags);
out:
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
@@ -2141,7 +2142,8 @@
ttwu_activate(rq, p, ENQUEUE_WAKEUP);
ttwu_do_wakeup(rq, p, 0);
- ttwu_stat(p, smp_processor_id(), 0);
+ if (schedstat_enabled())
+ ttwu_stat(p, smp_processor_id(), 0);
out:
raw_spin_unlock(&p->pi_lock);
}
@@ -2210,6 +2212,7 @@
#endif
#ifdef CONFIG_SCHEDSTATS
+ /* Even if schedstat is disabled, there should not be garbage */
memset(&p->se.statistics, 0, sizeof(p->se.statistics));
#endif
@@ -2281,6 +2284,69 @@
#endif
#endif
+DEFINE_STATIC_KEY_FALSE(sched_schedstats);
+
+#ifdef CONFIG_SCHEDSTATS
+static void set_schedstats(bool enabled)
+{
+ if (enabled)
+ static_branch_enable(&sched_schedstats);
+ else
+ static_branch_disable(&sched_schedstats);
+}
+
+void force_schedstat_enabled(void)
+{
+ if (!schedstat_enabled()) {
+ pr_info("kernel profiling enabled schedstats, disable via kernel.sched_schedstats.\n");
+ static_branch_enable(&sched_schedstats);
+ }
+}
+
+static int __init setup_schedstats(char *str)
+{
+ int ret = 0;
+ if (!str)
+ goto out;
+
+ if (!strcmp(str, "enable")) {
+ set_schedstats(true);
+ ret = 1;
+ } else if (!strcmp(str, "disable")) {
+ set_schedstats(false);
+ ret = 1;
+ }
+out:
+ if (!ret)
+ pr_warn("Unable to parse schedstats=\n");
+
+ return ret;
+}
+__setup("schedstats=", setup_schedstats);
+
+#ifdef CONFIG_PROC_SYSCTL
+int sysctl_schedstats(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ struct ctl_table t;
+ int err;
+ int state = static_branch_likely(&sched_schedstats);
+
+ if (write && !capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ t = *table;
+ t.data = &state;
+ err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
+ if (err < 0)
+ return err;
+ if (write)
+ set_schedstats(state);
+ return err;
+}
+#endif
+#endif
+
/*
* fork()/clone()-time setup:
*/
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 6415117..7cfa87b 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -75,16 +75,18 @@
PN(se->vruntime);
PN(se->sum_exec_runtime);
#ifdef CONFIG_SCHEDSTATS
- PN(se->statistics.wait_start);
- PN(se->statistics.sleep_start);
- PN(se->statistics.block_start);
- PN(se->statistics.sleep_max);
- PN(se->statistics.block_max);
- PN(se->statistics.exec_max);
- PN(se->statistics.slice_max);
- PN(se->statistics.wait_max);
- PN(se->statistics.wait_sum);
- P(se->statistics.wait_count);
+ if (schedstat_enabled()) {
+ PN(se->statistics.wait_start);
+ PN(se->statistics.sleep_start);
+ PN(se->statistics.block_start);
+ PN(se->statistics.sleep_max);
+ PN(se->statistics.block_max);
+ PN(se->statistics.exec_max);
+ PN(se->statistics.slice_max);
+ PN(se->statistics.wait_max);
+ PN(se->statistics.wait_sum);
+ P(se->statistics.wait_count);
+ }
#endif
P(se->load.weight);
#ifdef CONFIG_SMP
@@ -122,10 +124,12 @@
(long long)(p->nvcsw + p->nivcsw),
p->prio);
#ifdef CONFIG_SCHEDSTATS
- SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld",
- SPLIT_NS(p->se.statistics.wait_sum),
- SPLIT_NS(p->se.sum_exec_runtime),
- SPLIT_NS(p->se.statistics.sum_sleep_runtime));
+ if (schedstat_enabled()) {
+ SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld",
+ SPLIT_NS(p->se.statistics.wait_sum),
+ SPLIT_NS(p->se.sum_exec_runtime),
+ SPLIT_NS(p->se.statistics.sum_sleep_runtime));
+ }
#else
SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld",
0LL, 0L,
@@ -313,17 +317,18 @@
#define P(n) SEQ_printf(m, " .%-30s: %d\n", #n, rq->n);
#define P64(n) SEQ_printf(m, " .%-30s: %Ld\n", #n, rq->n);
- P(yld_count);
-
- P(sched_count);
- P(sched_goidle);
#ifdef CONFIG_SMP
P64(avg_idle);
P64(max_idle_balance_cost);
#endif
- P(ttwu_count);
- P(ttwu_local);
+ if (schedstat_enabled()) {
+ P(yld_count);
+ P(sched_count);
+ P(sched_goidle);
+ P(ttwu_count);
+ P(ttwu_local);
+ }
#undef P
#undef P64
@@ -569,38 +574,39 @@
nr_switches = p->nvcsw + p->nivcsw;
#ifdef CONFIG_SCHEDSTATS
- PN(se.statistics.sum_sleep_runtime);
- PN(se.statistics.wait_start);
- PN(se.statistics.sleep_start);
- PN(se.statistics.block_start);
- PN(se.statistics.sleep_max);
- PN(se.statistics.block_max);
- PN(se.statistics.exec_max);
- PN(se.statistics.slice_max);
- PN(se.statistics.wait_max);
- PN(se.statistics.wait_sum);
- P(se.statistics.wait_count);
- PN(se.statistics.iowait_sum);
- P(se.statistics.iowait_count);
P(se.nr_migrations);
- P(se.statistics.nr_migrations_cold);
- P(se.statistics.nr_failed_migrations_affine);
- P(se.statistics.nr_failed_migrations_running);
- P(se.statistics.nr_failed_migrations_hot);
- P(se.statistics.nr_forced_migrations);
- P(se.statistics.nr_wakeups);
- P(se.statistics.nr_wakeups_sync);
- P(se.statistics.nr_wakeups_migrate);
- P(se.statistics.nr_wakeups_local);
- P(se.statistics.nr_wakeups_remote);
- P(se.statistics.nr_wakeups_affine);
- P(se.statistics.nr_wakeups_affine_attempts);
- P(se.statistics.nr_wakeups_passive);
- P(se.statistics.nr_wakeups_idle);
- {
+ if (schedstat_enabled()) {
u64 avg_atom, avg_per_cpu;
+ PN(se.statistics.sum_sleep_runtime);
+ PN(se.statistics.wait_start);
+ PN(se.statistics.sleep_start);
+ PN(se.statistics.block_start);
+ PN(se.statistics.sleep_max);
+ PN(se.statistics.block_max);
+ PN(se.statistics.exec_max);
+ PN(se.statistics.slice_max);
+ PN(se.statistics.wait_max);
+ PN(se.statistics.wait_sum);
+ P(se.statistics.wait_count);
+ PN(se.statistics.iowait_sum);
+ P(se.statistics.iowait_count);
+ P(se.statistics.nr_migrations_cold);
+ P(se.statistics.nr_failed_migrations_affine);
+ P(se.statistics.nr_failed_migrations_running);
+ P(se.statistics.nr_failed_migrations_hot);
+ P(se.statistics.nr_forced_migrations);
+ P(se.statistics.nr_wakeups);
+ P(se.statistics.nr_wakeups_sync);
+ P(se.statistics.nr_wakeups_migrate);
+ P(se.statistics.nr_wakeups_local);
+ P(se.statistics.nr_wakeups_remote);
+ P(se.statistics.nr_wakeups_affine);
+ P(se.statistics.nr_wakeups_affine_attempts);
+ P(se.statistics.nr_wakeups_passive);
+ P(se.statistics.nr_wakeups_idle);
+
avg_atom = p->se.sum_exec_runtime;
if (nr_switches)
avg_atom = div64_ul(avg_atom, nr_switches);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 56b7d4b..51a4550 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -20,8 +20,8 @@
* Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra
*/
-#include <linux/latencytop.h>
#include <linux/sched.h>
+#include <linux/latencytop.h>
#include <linux/cpumask.h>
#include <linux/cpuidle.h>
#include <linux/slab.h>
@@ -755,7 +755,9 @@
update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
struct task_struct *p;
- u64 delta = rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start;
+ u64 delta;
+
+ delta = rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start;
if (entity_is_task(se)) {
p = task_of(se);
@@ -776,22 +778,12 @@
se->statistics.wait_sum += delta;
se->statistics.wait_start = 0;
}
-#else
-static inline void
-update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-}
-
-static inline void
-update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-}
-#endif
/*
* Task is being enqueued - update stats:
*/
-static void update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static inline void
+update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
/*
* Are we enqueueing a waiting task? (for current tasks
@@ -802,7 +794,7 @@
}
static inline void
-update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
+update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
/*
* Mark the end of the wait period if dequeueing a
@@ -810,7 +802,40 @@
*/
if (se != cfs_rq->curr)
update_stats_wait_end(cfs_rq, se);
+
+ if (flags & DEQUEUE_SLEEP) {
+ if (entity_is_task(se)) {
+ struct task_struct *tsk = task_of(se);
+
+ if (tsk->state & TASK_INTERRUPTIBLE)
+ se->statistics.sleep_start = rq_clock(rq_of(cfs_rq));
+ if (tsk->state & TASK_UNINTERRUPTIBLE)
+ se->statistics.block_start = rq_clock(rq_of(cfs_rq));
+ }
+ }
+
}
+#else
+static inline void
+update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+}
+
+static inline void
+update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+}
+
+static inline void
+update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+}
+
+static inline void
+update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+{
+}
+#endif
/*
* We are picking a new current task - update its stats:
@@ -3102,6 +3127,26 @@
static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
+static inline void check_schedstat_required(void)
+{
+#ifdef CONFIG_SCHEDSTATS
+ if (schedstat_enabled())
+ return;
+
+ /* Force schedstat enabled if a dependent tracepoint is active */
+ if (trace_sched_stat_wait_enabled() ||
+ trace_sched_stat_sleep_enabled() ||
+ trace_sched_stat_iowait_enabled() ||
+ trace_sched_stat_blocked_enabled() ||
+ trace_sched_stat_runtime_enabled()) {
+ pr_warn_once("Scheduler tracepoints stat_sleep, stat_iowait, "
+ "stat_blocked and stat_runtime require the "
+ "kernel parameter schedstats=enabled or "
+ "kernel.sched_schedstats=1\n");
+ }
+#endif
+}
+
static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
@@ -3122,11 +3167,15 @@
if (flags & ENQUEUE_WAKEUP) {
place_entity(cfs_rq, se, 0);
- enqueue_sleeper(cfs_rq, se);
+ if (schedstat_enabled())
+ enqueue_sleeper(cfs_rq, se);
}
- update_stats_enqueue(cfs_rq, se);
- check_spread(cfs_rq, se);
+ check_schedstat_required();
+ if (schedstat_enabled()) {
+ update_stats_enqueue(cfs_rq, se);
+ check_spread(cfs_rq, se);
+ }
if (se != cfs_rq->curr)
__enqueue_entity(cfs_rq, se);
se->on_rq = 1;
@@ -3193,19 +3242,8 @@
update_curr(cfs_rq);
dequeue_entity_load_avg(cfs_rq, se);
- update_stats_dequeue(cfs_rq, se);
- if (flags & DEQUEUE_SLEEP) {
-#ifdef CONFIG_SCHEDSTATS
- if (entity_is_task(se)) {
- struct task_struct *tsk = task_of(se);
-
- if (tsk->state & TASK_INTERRUPTIBLE)
- se->statistics.sleep_start = rq_clock(rq_of(cfs_rq));
- if (tsk->state & TASK_UNINTERRUPTIBLE)
- se->statistics.block_start = rq_clock(rq_of(cfs_rq));
- }
-#endif
- }
+ if (schedstat_enabled())
+ update_stats_dequeue(cfs_rq, se, flags);
clear_buddies(cfs_rq, se);
@@ -3279,7 +3317,8 @@
* a CPU. So account for the time it spent waiting on the
* runqueue.
*/
- update_stats_wait_end(cfs_rq, se);
+ if (schedstat_enabled())
+ update_stats_wait_end(cfs_rq, se);
__dequeue_entity(cfs_rq, se);
update_load_avg(se, 1);
}
@@ -3292,7 +3331,7 @@
* least twice that of our own weight (i.e. dont track it
* when there are only lesser-weight tasks around):
*/
- if (rq_of(cfs_rq)->load.weight >= 2*se->load.weight) {
+ if (schedstat_enabled() && rq_of(cfs_rq)->load.weight >= 2*se->load.weight) {
se->statistics.slice_max = max(se->statistics.slice_max,
se->sum_exec_runtime - se->prev_sum_exec_runtime);
}
@@ -3375,9 +3414,13 @@
/* throttle cfs_rqs exceeding runtime */
check_cfs_rq_runtime(cfs_rq);
- check_spread(cfs_rq, prev);
+ if (schedstat_enabled()) {
+ check_spread(cfs_rq, prev);
+ if (prev->on_rq)
+ update_stats_wait_start(cfs_rq, prev);
+ }
+
if (prev->on_rq) {
- update_stats_wait_start(cfs_rq, prev);
/* Put 'current' back into the tree. */
__enqueue_entity(cfs_rq, prev);
/* in !on_rq case, update occurred at dequeue */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 10f1637..1d58387 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1022,6 +1022,7 @@
#endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */
extern struct static_key_false sched_numa_balancing;
+extern struct static_key_false sched_schedstats;
static inline u64 global_rt_period(void)
{
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index b0fbc76..70b3b6a 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -29,9 +29,10 @@
if (rq)
rq->rq_sched_info.run_delay += delta;
}
-# define schedstat_inc(rq, field) do { (rq)->field++; } while (0)
-# define schedstat_add(rq, field, amt) do { (rq)->field += (amt); } while (0)
-# define schedstat_set(var, val) do { var = (val); } while (0)
+# define schedstat_enabled() static_branch_unlikely(&sched_schedstats)
+# define schedstat_inc(rq, field) do { if (schedstat_enabled()) { (rq)->field++; } } while (0)
+# define schedstat_add(rq, field, amt) do { if (schedstat_enabled()) { (rq)->field += (amt); } } while (0)
+# define schedstat_set(var, val) do { if (schedstat_enabled()) { var = (val); } } while (0)
#else /* !CONFIG_SCHEDSTATS */
static inline void
rq_sched_info_arrive(struct rq *rq, unsigned long long delta)
@@ -42,6 +43,7 @@
static inline void
rq_sched_info_depart(struct rq *rq, unsigned long long delta)
{}
+# define schedstat_enabled() 0
# define schedstat_inc(rq, field) do { } while (0)
# define schedstat_add(rq, field, amt) do { } while (0)
# define schedstat_set(var, val) do { } while (0)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 97715fd..f5102fab 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -350,6 +350,17 @@
.mode = 0644,
.proc_handler = proc_dointvec,
},
+#ifdef CONFIG_SCHEDSTATS
+ {
+ .procname = "sched_schedstats",
+ .data = NULL,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = sysctl_schedstats,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+#endif /* CONFIG_SCHEDSTATS */
#endif /* CONFIG_SMP */
#ifdef CONFIG_NUMA_BALANCING
{
@@ -505,7 +516,7 @@
.data = &latencytop_enabled,
.maxlen = sizeof(int),
.mode = 0644,
- .proc_handler = proc_dointvec,
+ .proc_handler = sysctl_latencytop,
},
#endif
#ifdef CONFIG_BLK_DEV_INITRD