Commit 0e94682b authored by Suren Baghdasaryan's avatar Suren Baghdasaryan Committed by Linus Torvalds

psi: introduce psi monitor

Psi monitor aims to provide a low-latency short-term pressure detection
mechanism configurable by users.  It allows users to monitor psi metrics
growth and trigger events whenever a metric raises above user-defined
threshold within user-defined time window.

Time window and threshold are both expressed in usecs.  Multiple psi
resources with different thresholds and window sizes can be monitored
concurrently.

Psi monitors activate when system enters stall state for the monitored
psi metric and deactivate upon exit from the stall state.  While system
is in the stall state psi signal growth is monitored at a rate of 10
times per tracking window.  Min window size is 500ms, therefore the min
monitoring interval is 50ms.  Max window size is 10s with monitoring
interval of 1s.

When activated psi monitor stays active for at least the duration of one
tracking window to avoid repeated activations/deactivations when psi
signal is bouncing.

Notifications to the users are rate-limited to one per tracking window.

Link: http://lkml.kernel.org/r/20190319235619.260832-8-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
parent 8af0c18a
...@@ -63,6 +63,110 @@ as well as medium and long term trends. The total absolute stall time ...@@ -63,6 +63,110 @@ as well as medium and long term trends. The total absolute stall time
spikes which wouldn't necessarily make a dent in the time averages, spikes which wouldn't necessarily make a dent in the time averages,
or to average trends over custom time frames. or to average trends over custom time frames.
Monitoring for pressure thresholds
==================================
Users can register triggers and use poll() to be woken up when resource
pressure exceeds certain thresholds.
A trigger describes the maximum cumulative stall time over a specific
time window, e.g. 100ms of total stall time within any 500ms window to
generate a wakeup event.
To register a trigger user has to open psi interface file under
/proc/pressure/ representing the resource to be monitored and write the
desired threshold and time window. The open file descriptor should be
used to wait for trigger events using select(), poll() or epoll().
The following format is used:
<some|full> <stall amount in us> <time window in us>
For example writing "some 150000 1000000" into /proc/pressure/memory
would add 150ms threshold for partial memory stall measured within
1sec time window. Writing "full 50000 1000000" into /proc/pressure/io
would add 50ms threshold for full io stall measured within 1sec time window.
Triggers can be set on more than one psi metric and more than one trigger
for the same psi metric can be specified. However for each trigger a separate
file descriptor is required to be able to poll it separately from others,
therefore for each trigger a separate open() syscall should be made even
when opening the same psi interface file.
Monitors activate only when system enters stall state for the monitored
psi metric and deactivates upon exit from the stall state. While system is
in the stall state psi signal growth is monitored at a rate of 10 times per
tracking window.
The kernel accepts window sizes ranging from 500ms to 10s, therefore min
monitoring update interval is 50ms and max is 1s. Min limit is set to
prevent overly frequent polling. Max limit is chosen as a high enough number
after which monitors are most likely not needed and psi averages can be used
instead.
When activated, psi monitor stays active for at least the duration of one
tracking window to avoid repeated activations/deactivations when system is
bouncing in and out of the stall state.
Notifications to the userspace are rate-limited to one per tracking window.
The trigger will de-register when the file descriptor used to define the
trigger is closed.
Userspace monitor usage example
===============================
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <poll.h>
#include <string.h>
#include <unistd.h>
/*
* Monitor memory partial stall with 1s tracking window size
* and 150ms threshold.
*/
int main() {
const char trig[] = "some 150000 1000000";
struct pollfd fds;
int n;
fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK);
if (fds.fd < 0) {
printf("/proc/pressure/memory open error: %s\n",
strerror(errno));
return 1;
}
fds.events = POLLPRI;
if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
printf("/proc/pressure/memory write error: %s\n",
strerror(errno));
return 1;
}
printf("waiting for events...\n");
while (1) {
n = poll(&fds, 1, -1);
if (n < 0) {
printf("poll error: %s\n", strerror(errno));
return 1;
}
if (fds.revents & POLLERR) {
printf("got POLLERR, event source is gone\n");
return 0;
}
if (fds.revents & POLLPRI) {
printf("event triggered!\n");
} else {
printf("unknown event received: 0x%x\n", fds.revents);
return 1;
}
}
return 0;
}
Cgroup2 interface Cgroup2 interface
================= =================
...@@ -71,3 +175,6 @@ mounted, pressure stall information is also tracked for tasks grouped ...@@ -71,3 +175,6 @@ mounted, pressure stall information is also tracked for tasks grouped
into cgroups. Each subdirectory in the cgroupfs mountpoint contains into cgroups. Each subdirectory in the cgroupfs mountpoint contains
cpu.pressure, memory.pressure, and io.pressure files; the format is cpu.pressure, memory.pressure, and io.pressure files; the format is
the same as the /proc/pressure/ files. the same as the /proc/pressure/ files.
Per-cgroup psi monitors can be specified and used the same way as
system-wide ones.
...@@ -4,6 +4,7 @@ ...@@ -4,6 +4,7 @@
#include <linux/jump_label.h> #include <linux/jump_label.h>
#include <linux/psi_types.h> #include <linux/psi_types.h>
#include <linux/sched.h> #include <linux/sched.h>
#include <linux/poll.h>
struct seq_file; struct seq_file;
struct css_set; struct css_set;
...@@ -26,6 +27,13 @@ int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res); ...@@ -26,6 +27,13 @@ int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res);
int psi_cgroup_alloc(struct cgroup *cgrp); int psi_cgroup_alloc(struct cgroup *cgrp);
void psi_cgroup_free(struct cgroup *cgrp); void psi_cgroup_free(struct cgroup *cgrp);
void cgroup_move_task(struct task_struct *p, struct css_set *to); void cgroup_move_task(struct task_struct *p, struct css_set *to);
struct psi_trigger *psi_trigger_create(struct psi_group *group,
char *buf, size_t nbytes, enum psi_res res);
void psi_trigger_replace(void **trigger_ptr, struct psi_trigger *t);
__poll_t psi_trigger_poll(void **trigger_ptr, struct file *file,
poll_table *wait);
#endif #endif
#else /* CONFIG_PSI */ #else /* CONFIG_PSI */
......
#ifndef _LINUX_PSI_TYPES_H #ifndef _LINUX_PSI_TYPES_H
#define _LINUX_PSI_TYPES_H #define _LINUX_PSI_TYPES_H
#include <linux/kthread.h>
#include <linux/seqlock.h> #include <linux/seqlock.h>
#include <linux/types.h> #include <linux/types.h>
#include <linux/kref.h>
#include <linux/wait.h>
#ifdef CONFIG_PSI #ifdef CONFIG_PSI
...@@ -44,6 +47,12 @@ enum psi_states { ...@@ -44,6 +47,12 @@ enum psi_states {
NR_PSI_STATES = 6, NR_PSI_STATES = 6,
}; };
enum psi_aggregators {
PSI_AVGS = 0,
PSI_POLL,
NR_PSI_AGGREGATORS,
};
struct psi_group_cpu { struct psi_group_cpu {
/* 1st cacheline updated by the scheduler */ /* 1st cacheline updated by the scheduler */
...@@ -65,7 +74,55 @@ struct psi_group_cpu { ...@@ -65,7 +74,55 @@ struct psi_group_cpu {
/* 2nd cacheline updated by the aggregator */ /* 2nd cacheline updated by the aggregator */
/* Delta detection against the sampling buckets */ /* Delta detection against the sampling buckets */
u32 times_prev[NR_PSI_STATES] ____cacheline_aligned_in_smp; u32 times_prev[NR_PSI_AGGREGATORS][NR_PSI_STATES]
____cacheline_aligned_in_smp;
};
/* PSI growth tracking window */
struct psi_window {
/* Window size in ns */
u64 size;
/* Start time of the current window in ns */
u64 start_time;
/* Value at the start of the window */
u64 start_value;
/* Value growth in the previous window */
u64 prev_growth;
};
struct psi_trigger {
/* PSI state being monitored by the trigger */
enum psi_states state;
/* User-spacified threshold in ns */
u64 threshold;
/* List node inside triggers list */
struct list_head node;
/* Backpointer needed during trigger destruction */
struct psi_group *group;
/* Wait queue for polling */
wait_queue_head_t event_wait;
/* Pending event flag */
int event;
/* Tracking window */
struct psi_window win;
/*
* Time last event was generated. Used for rate-limiting
* events to one per window
*/
u64 last_event_time;
/* Refcounting to prevent premature destruction */
struct kref refcount;
}; };
struct psi_group { struct psi_group {
...@@ -79,11 +136,32 @@ struct psi_group { ...@@ -79,11 +136,32 @@ struct psi_group {
u64 avg_total[NR_PSI_STATES - 1]; u64 avg_total[NR_PSI_STATES - 1];
u64 avg_last_update; u64 avg_last_update;
u64 avg_next_update; u64 avg_next_update;
/* Aggregator work control */
struct delayed_work avgs_work; struct delayed_work avgs_work;
/* Total stall times and sampled pressure averages */ /* Total stall times and sampled pressure averages */
u64 total[NR_PSI_STATES - 1]; u64 total[NR_PSI_AGGREGATORS][NR_PSI_STATES - 1];
unsigned long avg[NR_PSI_STATES - 1][3]; unsigned long avg[NR_PSI_STATES - 1][3];
/* Monitor work control */
atomic_t poll_scheduled;
struct kthread_worker __rcu *poll_kworker;
struct kthread_delayed_work poll_work;
/* Protects data used by the monitor */
struct mutex trigger_lock;
/* Configured polling triggers */
struct list_head triggers;
u32 nr_triggers[NR_PSI_STATES - 1];
u32 poll_states;
u64 poll_min_period;
/* Total stall times at the start of monitor activation */
u64 polling_total[NR_PSI_STATES - 1];
u64 polling_next_update;
u64 polling_until;
}; };
#else /* CONFIG_PSI */ #else /* CONFIG_PSI */
......
...@@ -3550,7 +3550,65 @@ static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v) ...@@ -3550,7 +3550,65 @@ static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v)
{ {
return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_CPU); return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_CPU);
} }
#endif
static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf,
size_t nbytes, enum psi_res res)
{
struct psi_trigger *new;
struct cgroup *cgrp;
cgrp = cgroup_kn_lock_live(of->kn, false);
if (!cgrp)
return -ENODEV;
cgroup_get(cgrp);
cgroup_kn_unlock(of->kn);
new = psi_trigger_create(&cgrp->psi, buf, nbytes, res);
if (IS_ERR(new)) {
cgroup_put(cgrp);
return PTR_ERR(new);
}
psi_trigger_replace(&of->priv, new);
cgroup_put(cgrp);
return nbytes;
}
static ssize_t cgroup_io_pressure_write(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
{
return cgroup_pressure_write(of, buf, nbytes, PSI_IO);
}
static ssize_t cgroup_memory_pressure_write(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
{
return cgroup_pressure_write(of, buf, nbytes, PSI_MEM);
}
static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
{
return cgroup_pressure_write(of, buf, nbytes, PSI_CPU);
}
static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of,
poll_table *pt)
{
return psi_trigger_poll(&of->priv, of->file, pt);
}
static void cgroup_pressure_release(struct kernfs_open_file *of)
{
psi_trigger_replace(&of->priv, NULL);
}
#endif /* CONFIG_PSI */
static int cgroup_freeze_show(struct seq_file *seq, void *v) static int cgroup_freeze_show(struct seq_file *seq, void *v)
{ {
...@@ -4745,18 +4803,27 @@ static struct cftype cgroup_base_files[] = { ...@@ -4745,18 +4803,27 @@ static struct cftype cgroup_base_files[] = {
.name = "io.pressure", .name = "io.pressure",
.flags = CFTYPE_NOT_ON_ROOT, .flags = CFTYPE_NOT_ON_ROOT,
.seq_show = cgroup_io_pressure_show, .seq_show = cgroup_io_pressure_show,
.write = cgroup_io_pressure_write,
.poll = cgroup_pressure_poll,
.release = cgroup_pressure_release,
}, },
{ {
.name = "memory.pressure", .name = "memory.pressure",
.flags = CFTYPE_NOT_ON_ROOT, .flags = CFTYPE_NOT_ON_ROOT,
.seq_show = cgroup_memory_pressure_show, .seq_show = cgroup_memory_pressure_show,
.write = cgroup_memory_pressure_write,
.poll = cgroup_pressure_poll,
.release = cgroup_pressure_release,
}, },
{ {
.name = "cpu.pressure", .name = "cpu.pressure",
.flags = CFTYPE_NOT_ON_ROOT, .flags = CFTYPE_NOT_ON_ROOT,
.seq_show = cgroup_cpu_pressure_show, .seq_show = cgroup_cpu_pressure_show,
.write = cgroup_cpu_pressure_write,
.poll = cgroup_pressure_poll,
.release = cgroup_pressure_release,
}, },
#endif #endif /* CONFIG_PSI */
{ } /* terminate */ { } /* terminate */
}; };
......
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment