Commit 92c1d652 authored by Linus Torvalds's avatar Linus Torvalds

Merge branch 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:
 "Documentation updates and the addition of cgroup_parse_float() which
  will be used by new controllers including blk-iocost"

* 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  docs: cgroup-v1: convert docs to ReST and rename to *.rst
  cgroup: Move cgroup_parse_float() implementation out of CONFIG_SYSFS
  cgroup: add cgroup_parse_float()
parents df2a40f5 99c8b231
...@@ -705,6 +705,12 @@ Conventions ...@@ -705,6 +705,12 @@ Conventions
informational files on the root cgroup which end up showing global informational files on the root cgroup which end up showing global
information available elsewhere shouldn't exist. information available elsewhere shouldn't exist.
- The default time unit is microseconds. If a different unit is ever
used, an explicit unit suffix must be present.
- A parts-per quantity should use a percentage decimal with at least
two digit fractional part - e.g. 13.40.
- If a controller implements weight based resource distribution, its - If a controller implements weight based resource distribution, its
interface file should be named "weight" and have the range [1, interface file should be named "weight" and have the range [1,
10000] with 100 as the default. The values are chosen to allow 10000] with 100 as the default. The values are chosen to allow
......
...@@ -241,7 +241,7 @@ Guest mitigation mechanisms ...@@ -241,7 +241,7 @@ Guest mitigation mechanisms
For further information about confining guests to a single or to a group For further information about confining guests to a single or to a group
of cores consult the cpusets documentation: of cores consult the cpusets documentation:
https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.rst
.. _interrupt_isolation: .. _interrupt_isolation:
......
...@@ -4084,7 +4084,7 @@ ...@@ -4084,7 +4084,7 @@
relax_domain_level= relax_domain_level=
[KNL, SMP] Set scheduler's default relax_domain_level. [KNL, SMP] Set scheduler's default relax_domain_level.
See Documentation/cgroup-v1/cpusets.txt. See Documentation/cgroup-v1/cpusets.rst.
reserve= [KNL,BUGS] Force kernel to ignore I/O ports or memory reserve= [KNL,BUGS] Force kernel to ignore I/O ports or memory
Format: <base1>,<size1>[,<base2>,<size2>,...] Format: <base1>,<size1>[,<base2>,<size2>,...]
...@@ -4594,7 +4594,7 @@ ...@@ -4594,7 +4594,7 @@
swapaccount=[0|1] swapaccount=[0|1]
[KNL] Enable accounting of swap in memory resource [KNL] Enable accounting of swap in memory resource
controller if no parameter or 1 is given or disable controller if no parameter or 1 is given or disable
it if 0 is given (See Documentation/cgroup-v1/memory.txt) it if 0 is given (See Documentation/cgroup-v1/memory.rst)
swiotlb= [ARM,IA-64,PPC,MIPS,X86] swiotlb= [ARM,IA-64,PPC,MIPS,X86]
Format: { <int> | force | noforce } Format: { <int> | force | noforce }
......
...@@ -15,7 +15,7 @@ document attempts to describe the concepts and APIs of the 2.6 memory policy ...@@ -15,7 +15,7 @@ document attempts to describe the concepts and APIs of the 2.6 memory policy
support. support.
Memory policies should not be confused with cpusets Memory policies should not be confused with cpusets
(``Documentation/cgroup-v1/cpusets.txt``) (``Documentation/cgroup-v1/cpusets.rst``)
which is an administrative mechanism for restricting the nodes from which which is an administrative mechanism for restricting the nodes from which
memory may be allocated by a set of processes. Memory policies are a memory may be allocated by a set of processes. Memory policies are a
programming interface that a NUMA-aware application can take advantage of. When programming interface that a NUMA-aware application can take advantage of. When
......
...@@ -539,7 +539,7 @@ As for cgroups-v1 (blkio controller), the exact set of stat files ...@@ -539,7 +539,7 @@ As for cgroups-v1 (blkio controller), the exact set of stat files
created, and kept up-to-date by bfq, depends on whether created, and kept up-to-date by bfq, depends on whether
CONFIG_DEBUG_BLK_CGROUP is set. If it is set, then bfq creates all CONFIG_DEBUG_BLK_CGROUP is set. If it is set, then bfq creates all
the stat files documented in the stat files documented in
Documentation/cgroup-v1/blkio-controller.txt. If, instead, Documentation/cgroup-v1/blkio-controller.rst. If, instead,
CONFIG_DEBUG_BLK_CGROUP is not set, then bfq creates only the files CONFIG_DEBUG_BLK_CGROUP is not set, then bfq creates only the files
blkio.bfq.io_service_bytes blkio.bfq.io_service_bytes
blkio.bfq.io_service_bytes_recursive blkio.bfq.io_service_bytes_recursive
......
Block IO Controller ===================
=================== Block IO Controller
===================
Overview Overview
======== ========
cgroup subsys "blkio" implements the block io controller. There seems to be cgroup subsys "blkio" implements the block io controller. There seems to be
...@@ -17,24 +19,27 @@ HOWTO ...@@ -17,24 +19,27 @@ HOWTO
===== =====
Throttling/Upper Limit policy Throttling/Upper Limit policy
----------------------------- -----------------------------
- Enable Block IO controller - Enable Block IO controller::
CONFIG_BLK_CGROUP=y CONFIG_BLK_CGROUP=y
- Enable throttling in block layer - Enable throttling in block layer::
CONFIG_BLK_DEV_THROTTLING=y CONFIG_BLK_DEV_THROTTLING=y
- Mount blkio controller (see cgroups.txt, Why are cgroups needed?) - Mount blkio controller (see cgroups.txt, Why are cgroups needed?)::
mount -t cgroup -o blkio none /sys/fs/cgroup/blkio mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
- Specify a bandwidth rate on particular device for root group. The format - Specify a bandwidth rate on particular device for root group. The format
for policy is "<major>:<minor> <bytes_per_second>". for policy is "<major>:<minor> <bytes_per_second>"::
echo "8:16 1048576" > /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device echo "8:16 1048576" > /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device
Above will put a limit of 1MB/second on reads happening for root group Above will put a limit of 1MB/second on reads happening for root group
on device having major/minor number 8:16. on device having major/minor number 8:16.
- Run dd to read a file and see if rate is throttled to 1MB/s or not. - Run dd to read a file and see if rate is throttled to 1MB/s or not::
# dd iflag=direct if=/mnt/common/zerofile of=/dev/null bs=4K count=1024 # dd iflag=direct if=/mnt/common/zerofile of=/dev/null bs=4K count=1024
1024+0 records in 1024+0 records in
...@@ -51,7 +56,7 @@ throttling's hierarchy support is enabled iff "sane_behavior" is ...@@ -51,7 +56,7 @@ throttling's hierarchy support is enabled iff "sane_behavior" is
enabled from cgroup side, which currently is a development option and enabled from cgroup side, which currently is a development option and
not publicly available. not publicly available.
If somebody created a hierarchy like as follows. If somebody created a hierarchy like as follows::
root root
/ \ / \
...@@ -66,7 +71,7 @@ directly generated by tasks in that cgroup. ...@@ -66,7 +71,7 @@ directly generated by tasks in that cgroup.
Throttling without "sane_behavior" enabled from cgroup side will Throttling without "sane_behavior" enabled from cgroup side will
practically treat all groups at same level as if it looks like the practically treat all groups at same level as if it looks like the
following. following::
pivot pivot
/ / \ \ / / \ \
...@@ -99,23 +104,27 @@ Proportional weight policy files ...@@ -99,23 +104,27 @@ Proportional weight policy files
These rules override the default value of group weight as specified These rules override the default value of group weight as specified
by blkio.weight. by blkio.weight.
Following is the format. Following is the format::
# echo dev_maj:dev_minor weight > blkio.weight_device # echo dev_maj:dev_minor weight > blkio.weight_device
Configure weight=300 on /dev/sdb (8:16) in this cgroup
Configure weight=300 on /dev/sdb (8:16) in this cgroup::
# echo 8:16 300 > blkio.weight_device # echo 8:16 300 > blkio.weight_device
# cat blkio.weight_device # cat blkio.weight_device
dev weight dev weight
8:16 300 8:16 300
Configure weight=500 on /dev/sda (8:0) in this cgroup Configure weight=500 on /dev/sda (8:0) in this cgroup::
# echo 8:0 500 > blkio.weight_device # echo 8:0 500 > blkio.weight_device
# cat blkio.weight_device # cat blkio.weight_device
dev weight dev weight
8:0 500 8:0 500
8:16 300 8:16 300
Remove specific weight for /dev/sda in this cgroup Remove specific weight for /dev/sda in this cgroup::
# echo 8:0 0 > blkio.weight_device # echo 8:0 0 > blkio.weight_device
# cat blkio.weight_device # cat blkio.weight_device
dev weight dev weight
...@@ -244,28 +253,28 @@ Throttling/Upper limit policy files ...@@ -244,28 +253,28 @@ Throttling/Upper limit policy files
- blkio.throttle.read_bps_device - blkio.throttle.read_bps_device
- Specifies upper limit on READ rate from the device. IO rate is - Specifies upper limit on READ rate from the device. IO rate is
specified in bytes per second. Rules are per device. Following is specified in bytes per second. Rules are per device. Following is
the format. the format::
echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.read_bps_device echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.read_bps_device
- blkio.throttle.write_bps_device - blkio.throttle.write_bps_device
- Specifies upper limit on WRITE rate to the device. IO rate is - Specifies upper limit on WRITE rate to the device. IO rate is
specified in bytes per second. Rules are per device. Following is specified in bytes per second. Rules are per device. Following is
the format. the format::
echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.write_bps_device echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.write_bps_device
- blkio.throttle.read_iops_device - blkio.throttle.read_iops_device
- Specifies upper limit on READ rate from the device. IO rate is - Specifies upper limit on READ rate from the device. IO rate is
specified in IO per second. Rules are per device. Following is specified in IO per second. Rules are per device. Following is
the format. the format::
echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.read_iops_device echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.read_iops_device
- blkio.throttle.write_iops_device - blkio.throttle.write_iops_device
- Specifies upper limit on WRITE rate to the device. IO rate is - Specifies upper limit on WRITE rate to the device. IO rate is
specified in io per second. Rules are per device. Following is specified in io per second. Rules are per device. Following is
the format. the format::
echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.write_iops_device echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.write_iops_device
......
=========================
CPU Accounting Controller CPU Accounting Controller
------------------------- =========================
The CPU accounting controller is used to group tasks using cgroups and The CPU accounting controller is used to group tasks using cgroups and
account the CPU usage of these groups of tasks. account the CPU usage of these groups of tasks.
...@@ -8,9 +9,9 @@ The CPU accounting controller supports multi-hierarchy groups. An accounting ...@@ -8,9 +9,9 @@ The CPU accounting controller supports multi-hierarchy groups. An accounting
group accumulates the CPU usage of all of its child groups and the tasks group accumulates the CPU usage of all of its child groups and the tasks
directly present in its group. directly present in its group.
Accounting groups can be created by first mounting the cgroup filesystem. Accounting groups can be created by first mounting the cgroup filesystem::
# mount -t cgroup -ocpuacct none /sys/fs/cgroup # mount -t cgroup -ocpuacct none /sys/fs/cgroup
With the above step, the initial or the parent accounting group becomes With the above step, the initial or the parent accounting group becomes
visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in
...@@ -19,11 +20,11 @@ the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup. ...@@ -19,11 +20,11 @@ the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup.
by this group which is essentially the CPU time obtained by all the tasks by this group which is essentially the CPU time obtained by all the tasks
in the system. in the system.
New accounting groups can be created under the parent group /sys/fs/cgroup. New accounting groups can be created under the parent group /sys/fs/cgroup::
# cd /sys/fs/cgroup # cd /sys/fs/cgroup
# mkdir g1 # mkdir g1
# echo $$ > g1/tasks # echo $$ > g1/tasks
The above steps create a new group g1 and move the current shell The above steps create a new group g1 and move the current shell
process (bash) into it. CPU time consumed by this bash and its children process (bash) into it. CPU time consumed by this bash and its children
......
===========================
Device Whitelist Controller Device Whitelist Controller
===========================
1. Description: 1. Description
==============
Implement a cgroup to track and enforce open and mknod restrictions Implement a cgroup to track and enforce open and mknod restrictions
on device files. A device cgroup associates a device access on device files. A device cgroup associates a device access
...@@ -16,24 +19,26 @@ devices from the whitelist or add new entries. A child cgroup can ...@@ -16,24 +19,26 @@ devices from the whitelist or add new entries. A child cgroup can
never receive a device access which is denied by its parent. never receive a device access which is denied by its parent.
2. User Interface 2. User Interface
=================
An entry is added using devices.allow, and removed using An entry is added using devices.allow, and removed using
devices.deny. For instance devices.deny. For instance::
echo 'c 1:3 mr' > /sys/fs/cgroup/1/devices.allow echo 'c 1:3 mr' > /sys/fs/cgroup/1/devices.allow
allows cgroup 1 to read and mknod the device usually known as allows cgroup 1 to read and mknod the device usually known as
/dev/null. Doing /dev/null. Doing::
echo a > /sys/fs/cgroup/1/devices.deny echo a > /sys/fs/cgroup/1/devices.deny
will remove the default 'a *:* rwm' entry. Doing will remove the default 'a *:* rwm' entry. Doing::
echo a > /sys/fs/cgroup/1/devices.allow echo a > /sys/fs/cgroup/1/devices.allow
will add the 'a *:* rwm' entry to the whitelist. will add the 'a *:* rwm' entry to the whitelist.
3. Security 3. Security
===========
Any task can move itself between cgroups. This clearly won't Any task can move itself between cgroups. This clearly won't
suffice, but we can decide the best way to adequately restrict suffice, but we can decide the best way to adequately restrict
...@@ -50,6 +55,7 @@ A cgroup may not be granted more permissions than the cgroup's ...@@ -50,6 +55,7 @@ A cgroup may not be granted more permissions than the cgroup's
parent has. parent has.
4. Hierarchy 4. Hierarchy
============
device cgroups maintain hierarchy by making sure a cgroup never has more device cgroups maintain hierarchy by making sure a cgroup never has more
access permissions than its parent. Every time an entry is written to access permissions than its parent. Every time an entry is written to
...@@ -58,7 +64,8 @@ from their whitelist and all the locally set whitelist entries will be ...@@ -58,7 +64,8 @@ from their whitelist and all the locally set whitelist entries will be
re-evaluated. In case one of the locally set whitelist entries would provide re-evaluated. In case one of the locally set whitelist entries would provide
more access than the cgroup's parent, it'll be removed from the whitelist. more access than the cgroup's parent, it'll be removed from the whitelist.
Example: Example::
A A
/ \ / \
B B
...@@ -67,10 +74,12 @@ Example: ...@@ -67,10 +74,12 @@ Example:
A allow "b 8:* rwm", "c 116:1 rw" A allow "b 8:* rwm", "c 116:1 rw"
B deny "c 1:3 rwm", "c 116:2 rwm", "b 3:* rwm" B deny "c 1:3 rwm", "c 116:2 rwm", "b 3:* rwm"
If a device is denied in group A: If a device is denied in group A::
# echo "c 116:* r" > A/devices.deny # echo "c 116:* r" > A/devices.deny
it'll propagate down and after revalidating B's entries, the whitelist entry it'll propagate down and after revalidating B's entries, the whitelist entry
"c 116:2 rwm" will be removed: "c 116:2 rwm" will be removed::
group whitelist entries denied devices group whitelist entries denied devices
A all "b 8:* rwm", "c 116:* rw" A all "b 8:* rwm", "c 116:* rw"
...@@ -79,7 +88,8 @@ it'll propagate down and after revalidating B's entries, the whitelist entry ...@@ -79,7 +88,8 @@ it'll propagate down and after revalidating B's entries, the whitelist entry
In case parent's exceptions change and local exceptions are not allowed In case parent's exceptions change and local exceptions are not allowed
anymore, they'll be deleted. anymore, they'll be deleted.
Notice that new whitelist entries will not be propagated: Notice that new whitelist entries will not be propagated::
A A
/ \ / \
B B
...@@ -88,24 +98,30 @@ Notice that new whitelist entries will not be propagated: ...@@ -88,24 +98,30 @@ Notice that new whitelist entries will not be propagated:
A "c 1:3 rwm", "c 1:5 r" all the rest A "c 1:3 rwm", "c 1:5 r" all the rest
B "c 1:3 rwm", "c 1:5 r" all the rest B "c 1:3 rwm", "c 1:5 r" all the rest
when adding "c *:3 rwm": when adding ``c *:3 rwm``::
# echo "c *:3 rwm" >A/devices.allow # echo "c *:3 rwm" >A/devices.allow
the result: the result::
group whitelist entries denied devices group whitelist entries denied devices
A "c *:3 rwm", "c 1:5 r" all the rest A "c *:3 rwm", "c 1:5 r" all the rest
B "c 1:3 rwm", "c 1:5 r" all the rest B "c 1:3 rwm", "c 1:5 r" all the rest
but now it'll be possible to add new entries to B: but now it'll be possible to add new entries to B::
# echo "c 2:3 rwm" >B/devices.allow # echo "c 2:3 rwm" >B/devices.allow
# echo "c 50:3 r" >B/devices.allow # echo "c 50:3 r" >B/devices.allow
or even
or even::
# echo "c *:3 rwm" >B/devices.allow # echo "c *:3 rwm" >B/devices.allow
Allowing or denying all by writing 'a' to devices.allow or devices.deny will Allowing or denying all by writing 'a' to devices.allow or devices.deny will
not be possible once the device cgroups has children. not be possible once the device cgroups has children.
4.1 Hierarchy (internal implementation) 4.1 Hierarchy (internal implementation)
---------------------------------------
device cgroups is implemented internally using a behavior (ALLOW, DENY) and a device cgroups is implemented internally using a behavior (ALLOW, DENY) and a
list of exceptions. The internal state is controlled using the same user list of exceptions. The internal state is controlled using the same user
......
==============
Cgroup Freezer
==============
The cgroup freezer is useful to batch job management system which start The cgroup freezer is useful to batch job management system which start
and stop sets of tasks in order to schedule the resources of a machine and stop sets of tasks in order to schedule the resources of a machine
according to the desires of a system administrator. This sort of program according to the desires of a system administrator. This sort of program
...@@ -23,7 +27,7 @@ blocked, or ignored it can be seen by waiting or ptracing parent tasks. ...@@ -23,7 +27,7 @@ blocked, or ignored it can be seen by waiting or ptracing parent tasks.
SIGCONT is especially unsuitable since it can be caught by the task. Any SIGCONT is especially unsuitable since it can be caught by the task. Any
programs designed to watch for SIGSTOP and SIGCONT could be broken by programs designed to watch for SIGSTOP and SIGCONT could be broken by
attempting to use SIGSTOP and SIGCONT to stop and resume tasks. We can attempting to use SIGSTOP and SIGCONT to stop and resume tasks. We can
demonstrate this problem using nested bash shells: demonstrate this problem using nested bash shells::
$ echo $$ $ echo $$
16644 16644
...@@ -93,19 +97,19 @@ The following cgroupfs files are created by cgroup freezer. ...@@ -93,19 +97,19 @@ The following cgroupfs files are created by cgroup freezer.
The root cgroup is non-freezable and the above interface files don't The root cgroup is non-freezable and the above interface files don't
exist. exist.
* Examples of usage : * Examples of usage::
# mkdir /sys/fs/cgroup/freezer # mkdir /sys/fs/cgroup/freezer
# mount -t cgroup -ofreezer freezer /sys/fs/cgroup/freezer # mount -t cgroup -ofreezer freezer /sys/fs/cgroup/freezer
# mkdir /sys/fs/cgroup/freezer/0 # mkdir /sys/fs/cgroup/freezer/0
# echo $some_pid > /sys/fs/cgroup/freezer/0/tasks # echo $some_pid > /sys/fs/cgroup/freezer/0/tasks
to get status of the freezer subsystem : to get status of the freezer subsystem::
# cat /sys/fs/cgroup/freezer/0/freezer.state # cat /sys/fs/cgroup/freezer/0/freezer.state
THAWED THAWED
to freeze all tasks in the container : to freeze all tasks in the container::
# echo FROZEN > /sys/fs/cgroup/freezer/0/freezer.state # echo FROZEN > /sys/fs/cgroup/freezer/0/freezer.state
# cat /sys/fs/cgroup/freezer/0/freezer.state # cat /sys/fs/cgroup/freezer/0/freezer.state
...@@ -113,7 +117,7 @@ to freeze all tasks in the container : ...@@ -113,7 +117,7 @@ to freeze all tasks in the container :
# cat /sys/fs/cgroup/freezer/0/freezer.state # cat /sys/fs/cgroup/freezer/0/freezer.state
FROZEN FROZEN
to unfreeze all tasks in the container : to unfreeze all tasks in the container::
# echo THAWED > /sys/fs/cgroup/freezer/0/freezer.state # echo THAWED > /sys/fs/cgroup/freezer/0/freezer.state
# cat /sys/fs/cgroup/freezer/0/freezer.state # cat /sys/fs/cgroup/freezer/0/freezer.state
......
==================
HugeTLB Controller HugeTLB Controller
------------------- ==================
The HugeTLB controller allows to limit the HugeTLB usage per control group and The HugeTLB controller allows to limit the HugeTLB usage per control group and
enforces the controller limit during page fault. Since HugeTLB doesn't enforces the controller limit during page fault. Since HugeTLB doesn't
...@@ -16,16 +17,16 @@ With the above step, the initial or the parent HugeTLB group becomes ...@@ -16,16 +17,16 @@ With the above step, the initial or the parent HugeTLB group becomes
visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in
the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup. the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup.
New groups can be created under the parent group /sys/fs/cgroup. New groups can be created under the parent group /sys/fs/cgroup::
# cd /sys/fs/cgroup # cd /sys/fs/cgroup
# mkdir g1 # mkdir g1
# echo $$ > g1/tasks # echo $$ > g1/tasks
The above steps create a new group g1 and move the current shell The above steps create a new group g1 and move the current shell
process (bash) into it. process (bash) into it.
Brief summary of control files Brief summary of control files::
hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage
hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded
...@@ -33,17 +34,17 @@ Brief summary of control files ...@@ -33,17 +34,17 @@ Brief summary of control files
hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB limit hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB limit
For a system supporting three hugepage sizes (64k, 32M and 1G), the control For a system supporting three hugepage sizes (64k, 32M and 1G), the control
files include: files include::
hugetlb.1GB.limit_in_bytes hugetlb.1GB.limit_in_bytes
hugetlb.1GB.max_usage_in_bytes hugetlb.1GB.max_usage_in_bytes
hugetlb.1GB.usage_in_bytes hugetlb.1GB.usage_in_bytes
hugetlb.1GB.failcnt hugetlb.1GB.failcnt
hugetlb.64KB.limit_in_bytes hugetlb.64KB.limit_in_bytes
hugetlb.64KB.max_usage_in_bytes hugetlb.64KB.max_usage_in_bytes
hugetlb.64KB.usage_in_bytes hugetlb.64KB.usage_in_bytes
hugetlb.64KB.failcnt hugetlb.64KB.failcnt
hugetlb.32MB.limit_in_bytes hugetlb.32MB.limit_in_bytes
hugetlb.32MB.max_usage_in_bytes hugetlb.32MB.max_usage_in_bytes
hugetlb.32MB.usage_in_bytes hugetlb.32MB.usage_in_bytes
hugetlb.32MB.failcnt hugetlb.32MB.failcnt
:orphan:
========================
Control Groups version 1
========================
.. toctree::
:maxdepth: 1
cgroups
blkio-controller
cpuacct
cpusets
devices
freezer-subsystem
hugetlb
memcg_test
memory
net_cls
net_prio
pids
rdma
.. only:: subproject and html
Indices
=======
* :ref:`genindex`
Memory Resource Controller(Memcg) Implementation Memo. =====================================================
Memory Resource Controller(Memcg) Implementation Memo
=====================================================
Last Updated: 2010/2 Last Updated: 2010/2
Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34). Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34).
Because VM is getting complex (one of reasons is memcg...), memcg's behavior Because VM is getting complex (one of reasons is memcg...), memcg's behavior
is complex. This is a document for memcg's internal behavior. is complex. This is a document for memcg's internal behavior.
Please note that implementation details can be changed. Please note that implementation details can be changed.
(*) Topics on API should be in Documentation/cgroup-v1/memory.txt) (*) Topics on API should be in Documentation/cgroup-v1/memory.rst)
0. How to record usage ? 0. How to record usage ?
========================
2 objects are used. 2 objects are used.
page_cgroup ....an object per page. page_cgroup ....an object per page.
Allocated at boot or memory hotplug. Freed at memory hot removal. Allocated at boot or memory hotplug. Freed at memory hot removal.
swap_cgroup ... an entry per swp_entry. swap_cgroup ... an entry per swp_entry.
Allocated at swapon(). Freed at swapoff(). Allocated at swapon(). Freed at swapoff().
The page_cgroup has USED bit and double count against a page_cgroup never The page_cgroup has USED bit and double count against a page_cgroup never
occurs. swap_cgroup is used only when a charged page is swapped-out. occurs. swap_cgroup is used only when a charged page is swapped-out.
1. Charge 1. Charge
=========
a page/swp_entry may be charged (usage += PAGE_SIZE) at a page/swp_entry may be charged (usage += PAGE_SIZE) at
mem_cgroup_try_charge() mem_cgroup_try_charge()
2. Uncharge 2. Uncharge
===========
a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
mem_cgroup_uncharge() mem_cgroup_uncharge()
...@@ -37,9 +48,12 @@ Please note that implementation details can be changed. ...@@ -37,9 +48,12 @@ Please note that implementation details can be changed.
disappears. disappears.
3. charge-commit-cancel 3. charge-commit-cancel
=======================
Memcg pages are charged in two steps: Memcg pages are charged in two steps:
mem_cgroup_try_charge()
mem_cgroup_commit_charge() or mem_cgroup_cancel_charge() - mem_cgroup_try_charge()
- mem_cgroup_commit_charge() or mem_cgroup_cancel_charge()
At try_charge(), there are no flags to say "this page is charged". At try_charge(), there are no flags to say "this page is charged".
at this point, usage += PAGE_SIZE. at this point, usage += PAGE_SIZE.
...@@ -51,6 +65,8 @@ Please note that implementation details can be changed. ...@@ -51,6 +65,8 @@ Please note that implementation details can be changed.
Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
4. Anonymous 4. Anonymous
============
Anonymous page is newly allocated at Anonymous page is newly allocated at
- page fault into MAP_ANONYMOUS mapping. - page fault into MAP_ANONYMOUS mapping.
- Copy-On-Write. - Copy-On-Write.
...@@ -78,34 +94,45 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. ...@@ -78,34 +94,45 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
(e) zap_pte() is called and swp_entry's refcnt -=1 -> 0. (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
5. Page Cache 5. Page Cache
=============
Page Cache is charged at Page Cache is charged at
- add_to_page_cache_locked(). - add_to_page_cache_locked().
The logic is very clear. (About migration, see below) The logic is very clear. (About migration, see below)
Note: __remove_from_page_cache() is called by remove_from_page_cache()
Note:
__remove_from_page_cache() is called by remove_from_page_cache()
and __remove_mapping(). and __remove_mapping().
6. Shmem(tmpfs) Page Cache 6. Shmem(tmpfs) Page Cache
===========================
The best way to understand shmem's page state transition is to read The best way to understand shmem's page state transition is to read
mm/shmem.c. mm/shmem.c.
But brief explanation of the behavior of memcg around shmem will be But brief explanation of the behavior of memcg around shmem will be
helpful to understand the logic. helpful to understand the logic.
Shmem's page (just leaf page, not direct/indirect block) can be on Shmem's page (just leaf page, not direct/indirect block) can be on
- radix-tree of shmem's inode. - radix-tree of shmem's inode.
- SwapCache. - SwapCache.
- Both on radix-tree and SwapCache. This happens at swap-in - Both on radix-tree and SwapCache. This happens at swap-in
and swap-out, and swap-out,
It's charged when... It's charged when...
- A new page is added to shmem's radix-tree. - A new page is added to shmem's radix-tree.
- A swp page is read. (move a charge from swap_cgroup to page_cgroup) - A swp page is read. (move a charge from swap_cgroup to page_cgroup)
7. Page Migration 7. Page Migration
=================
mem_cgroup_migrate() mem_cgroup_migrate()
8. LRU 8. LRU
======
Each memcg has its own private LRU. Now, its handling is under global Each memcg has its own private LRU. Now, its handling is under global
VM's control (means that it's handled under global pgdat->lru_lock). VM's control (means that it's handled under global pgdat->lru_lock).
Almost all routines around memcg's LRU is called by global LRU's Almost all routines around memcg's LRU is called by global LRU's
...@@ -114,29 +141,38 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. ...@@ -114,29 +141,38 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
A special function is mem_cgroup_isolate_pages(). This scans A special function is mem_cgroup_isolate_pages(). This scans
memcg's private LRU and call __isolate_lru_page() to extract a page memcg's private LRU and call __isolate_lru_page() to extract a page
from LRU. from LRU.
(By __isolate_lru_page(), the page is removed from both of global and (By __isolate_lru_page(), the page is removed from both of global and
private LRU.) private LRU.)
9. Typical Tests. 9. Typical Tests.
=================
Tests for racy cases. Tests for racy cases.
9.1 Small limit to memcg. 9.1 Small limit to memcg.
-------------------------
When you do test to do racy case, it's good test to set memcg's limit When you do test to do racy case, it's good test to set memcg's limit
to be very small rather than GB. Many races found in the test under to be very small rather than GB. Many races found in the test under
xKB or xxMB limits. xKB or xxMB limits.
(Memory behavior under GB and Memory behavior under MB shows very (Memory behavior under GB and Memory behavior under MB shows very
different situation.) different situation.)
9.2 Shmem 9.2 Shmem
---------
Historically, memcg's shmem handling was poor and we saw some amount Historically, memcg's shmem handling was poor and we saw some amount
of troubles here. This is because shmem is page-cache but can be of troubles here. This is because shmem is page-cache but can be
SwapCache. Test with shmem/tmpfs is always good test. SwapCache. Test with shmem/tmpfs is always good test.
9.3 Migration 9.3 Migration
-------------
For NUMA, migration is an another special case. To do easy test, cpuset For NUMA, migration is an another special case. To do easy test, cpuset
is useful. Following is a sample script to do migration. is useful. Following is a sample script to do migration::
mount -t cgroup -o cpuset none /opt/cpuset mount -t cgroup -o cpuset none /opt/cpuset
...@@ -151,7 +187,8 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. ...@@ -151,7 +187,8 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
In above set, when you moves a task from 01 to 02, page migration to In above set, when you moves a task from 01 to 02, page migration to
node 0 to node 1 will occur. Following is a script to migrate all node 0 to node 1 will occur. Following is a script to migrate all
under cpuset. under cpuset.::
-- --
move_task() move_task()
{ {
...@@ -168,16 +205,25 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. ...@@ -168,16 +205,25 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
G2_TASK=`cat ${G2}/tasks` G2_TASK=`cat ${G2}/tasks`
move_task "${G1_TASK}" ${G2} & move_task "${G1_TASK}" ${G2} &
-- --
9.4 Memory hotplug.
9.4 Memory hotplug
------------------
memory hotplug test is one of good test. memory hotplug test is one of good test.
to offline memory, do following.
to offline memory, do following::
# echo offline > /sys/devices/system/memory/memoryXXX/state # echo offline > /sys/devices/system/memory/memoryXXX/state
(XXX is the place of memory) (XXX is the place of memory)
This is an easy way to test page migration, too. This is an easy way to test page migration, too.
9.5 mkdir/rmdir 9.5 mkdir/rmdir
---------------
When using hierarchy, mkdir/rmdir test should be done. When using hierarchy, mkdir/rmdir test should be done.
Use tests like the following. Use tests like the following::
echo 1 >/opt/cgroup/01/memory/use_hierarchy echo 1 >/opt/cgroup/01/memory/use_hierarchy
mkdir /opt/cgroup/01/child_a mkdir /opt/cgroup/01/child_a
...@@ -187,35 +233,46 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. ...@@ -187,35 +233,46 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
add limit to 01/child_b add limit to 01/child_b
run jobs under child_a and child_b run jobs under child_a and child_b
create/delete following groups at random while jobs are running. create/delete following groups at random while jobs are running::
/opt/cgroup/01/child_a/child_aa /opt/cgroup/01/child_a/child_aa
/opt/cgroup/01/child_b/child_bb /opt/cgroup/01/child_b/child_bb
/opt/cgroup/01/child_c /opt/cgroup/01/child_c
running new jobs in new group is also good. running new jobs in new group is also good.
9.6 Mount with other subsystems. 9.6 Mount with other subsystems
-------------------------------
Mounting with other subsystems is a good test because there is a Mounting with other subsystems is a good test because there is a
race and lock dependency with other cgroup subsystems. race and lock dependency with other cgroup subsystems.
example) example::
# mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices # mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices
and do task move, mkdir, rmdir etc...under this. and do task move, mkdir, rmdir etc...under this.
9.7 swapoff. 9.7 swapoff
-----------
Besides management of swap is one of complicated parts of memcg, Besides management of swap is one of complicated parts of memcg,
call path of swap-in at swapoff is not same as usual swap-in path.. call path of swap-in at swapoff is not same as usual swap-in path..
It's worth to be tested explicitly. It's worth to be tested explicitly.
For example, test like following is good. For example, test like following is good:
(Shell-A)
(Shell-A)::
# mount -t cgroup none /cgroup -o memory # mount -t cgroup none /cgroup -o memory
# mkdir /cgroup/test # mkdir /cgroup/test
# echo 40M > /cgroup/test/memory.limit_in_bytes # echo 40M > /cgroup/test/memory.limit_in_bytes
# echo 0 > /cgroup/test/tasks # echo 0 > /cgroup/test/tasks
Run malloc(100M) program under this. You'll see 60M of swaps. Run malloc(100M) program under this. You'll see 60M of swaps.
(Shell-B)
(Shell-B)::
# move all tasks in /cgroup/test to /cgroup # move all tasks in /cgroup/test to /cgroup
# /sbin/swapoff -a # /sbin/swapoff -a
# rmdir /cgroup/test # rmdir /cgroup/test
...@@ -223,51 +280,69 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. ...@@ -223,51 +280,69 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
Of course, tmpfs v.s. swapoff test should be tested, too. Of course, tmpfs v.s. swapoff test should be tested, too.
9.8 OOM-Killer 9.8 OOM-Killer
--------------
Out-of-memory caused by memcg's limit will kill tasks under Out-of-memory caused by memcg's limit will kill tasks under
the memcg. When hierarchy is used, a task under hierarchy the memcg. When hierarchy is used, a task under hierarchy
will be killed by the kernel. will be killed by the kernel.
In this case, panic_on_oom shouldn't be invoked and tasks In this case, panic_on_oom shouldn't be invoked and tasks
in other groups shouldn't be killed. in other groups shouldn't be killed.
It's not difficult to cause OOM under memcg as following. It's not difficult to cause OOM under memcg as following.
Case A) when you can swapoff
Case A) when you can swapoff::
#swapoff -a #swapoff -a
#echo 50M > /memory.limit_in_bytes #echo 50M > /memory.limit_in_bytes
run 51M of malloc run 51M of malloc
Case B) when you use mem+swap limitation. Case B) when you use mem+swap limitation::
#echo 50M > memory.limit_in_bytes #echo 50M > memory.limit_in_bytes
#echo 50M > memory.memsw.limit_in_bytes #echo 50M > memory.memsw.limit_in_bytes
run 51M of malloc run 51M of malloc
9.9 Move charges at task migration 9.9 Move charges at task migration
----------------------------------
Charges associated with a task can be moved along with task migration. Charges associated with a task can be moved along with task migration.
(Shell-A) (Shell-A)::
#mkdir /cgroup/A #mkdir /cgroup/A
#echo $$ >/cgroup/A/tasks #echo $$ >/cgroup/A/tasks
run some programs which uses some amount of memory in /cgroup/A. run some programs which uses some amount of memory in /cgroup/A.
(Shell-B) (Shell-B)::
#mkdir /cgroup/B #mkdir /cgroup/B
#echo 1 >/cgroup/B/memory.move_charge_at_immigrate #echo 1 >/cgroup/B/memory.move_charge_at_immigrate
#echo "pid of the program running in group A" >/cgroup/B/tasks #echo "pid of the program running in group A" >/cgroup/B/tasks
You can see charges have been moved by reading *.usage_in_bytes or You can see charges have been moved by reading ``*.usage_in_bytes`` or
memory.stat of both A and B. memory.stat of both A and B.
See 8.2 of Documentation/cgroup-v1/memory.txt to see what value should be
written to move_charge_at_immigrate.
9.10 Memory thresholds See 8.2 of Documentation/cgroup-v1/memory.rst to see what value should
be written to move_charge_at_immigrate.
9.10 Memory thresholds
----------------------
Memory controller implements memory thresholds using cgroups notification Memory controller implements memory thresholds using cgroups notification
API. You can use tools/cgroup/cgroup_event_listener.c to test it. API. You can use tools/cgroup/cgroup_event_listener.c to test it.
(Shell-A) Create cgroup and run event listener (Shell-A) Create cgroup and run event listener::
# mkdir /cgroup/A # mkdir /cgroup/A
# ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M # ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M
(Shell-B) Add task to cgroup and try to allocate and free memory (Shell-B) Add task to cgroup and try to allocate and free memory::
# echo $$ >/cgroup/A/tasks # echo $$ >/cgroup/A/tasks
# a="$(dd if=/dev/zero bs=1M count=10)" # a="$(dd if=/dev/zero bs=1M count=10)"
# a= # a=
......
=========================
Network classifier cgroup Network classifier cgroup
------------------------- =========================
The Network classifier cgroup provides an interface to The Network classifier cgroup provides an interface to
tag network packets with a class identifier (classid). tag network packets with a class identifier (classid).
...@@ -17,23 +18,27 @@ values is 0xAAAABBBB; AAAA is the major handle number and BBBB ...@@ -17,23 +18,27 @@ values is 0xAAAABBBB; AAAA is the major handle number and BBBB
is the minor handle number. is the minor handle number.
Reading net_cls.classid yields a decimal result. Reading net_cls.classid yields a decimal result.
Example: Example::
mkdir /sys/fs/cgroup/net_cls
mount -t cgroup -onet_cls net_cls /sys/fs/cgroup/net_cls
mkdir /sys/fs/cgroup/net_cls/0
echo 0x100001 > /sys/fs/cgroup/net_cls/0/net_cls.classid
- setting a 10:1 handle.
cat /sys/fs/cgroup/net_cls/0/net_cls.classid mkdir /sys/fs/cgroup/net_cls
1048577 mount -t cgroup -onet_cls net_cls /sys/fs/cgroup/net_cls
mkdir /sys/fs/cgroup/net_cls/0
echo 0x100001 > /sys/fs/cgroup/net_cls/0/net_cls.classid
configuring tc: - setting a 10:1 handle::
tc qdisc add dev eth0 root handle 10: htb
tc class add dev eth0 parent 10: classid 10:1 htb rate 40mbit cat /sys/fs/cgroup/net_cls/0/net_cls.classid
- creating traffic class 10:1 1048577
tc filter add dev eth0 parent 10: protocol ip prio 10 handle 1: cgroup - configuring tc::
configuring iptables, basic example: tc qdisc add dev eth0 root handle 10: htb
iptables -A OUTPUT -m cgroup ! --cgroup 0x100001 -j DROP tc class add dev eth0 parent 10: classid 10:1 htb rate 40mbit
- creating traffic class 10:1::
tc filter add dev eth0 parent 10: protocol ip prio 10 handle 1: cgroup
configuring iptables, basic example::
iptables -A OUTPUT -m cgroup ! --cgroup 0x100001 -j DROP
=======================
Network priority cgroup Network priority cgroup
------------------------- =======================
The Network priority cgroup provides an interface to allow an administrator to The Network priority cgroup provides an interface to allow an administrator to
dynamically set the priority of network traffic generated by various dynamically set the priority of network traffic generated by various
...@@ -14,9 +15,9 @@ SO_PRIORITY socket option. This however, is not always possible because: ...@@ -14,9 +15,9 @@ SO_PRIORITY socket option. This however, is not always possible because:
This cgroup allows an administrator to assign a process to a group which defines This cgroup allows an administrator to assign a process to a group which defines
the priority of egress traffic on a given interface. Network priority groups can the priority of egress traffic on a given interface. Network priority groups can
be created by first mounting the cgroup filesystem. be created by first mounting the cgroup filesystem::
# mount -t cgroup -onet_prio none /sys/fs/cgroup/net_prio # mount -t cgroup -onet_prio none /sys/fs/cgroup/net_prio
With the above step, the initial group acting as the parent accounting group With the above step, the initial group acting as the parent accounting group
becomes visible at '/sys/fs/cgroup/net_prio'. This group includes all tasks in becomes visible at '/sys/fs/cgroup/net_prio'. This group includes all tasks in
...@@ -25,17 +26,18 @@ the system. '/sys/fs/cgroup/net_prio/tasks' lists the tasks in this cgroup. ...@@ -25,17 +26,18 @@ the system. '/sys/fs/cgroup/net_prio/tasks' lists the tasks in this cgroup.
Each net_prio cgroup contains two files that are subsystem specific Each net_prio cgroup contains two files that are subsystem specific
net_prio.prioidx net_prio.prioidx
This file is read-only, and is simply informative. It contains a unique integer This file is read-only, and is simply informative. It contains a unique
value that the kernel uses as an internal representation of this cgroup. integer value that the kernel uses as an internal representation of this
cgroup.
net_prio.ifpriomap net_prio.ifpriomap
This file contains a map of the priorities assigned to traffic originating from This file contains a map of the priorities assigned to traffic originating
processes in this group and egressing the system on various interfaces. It from processes in this group and egressing the system on various interfaces.
contains a list of tuples in the form <ifname priority>. Contents of this file It contains a list of tuples in the form <ifname priority>. Contents of this
can be modified by echoing a string into the file using the same tuple format. file can be modified by echoing a string into the file using the same tuple
for example: format. For example::
echo "eth0 5" > /sys/fs/cgroups/net_prio/iscsi/net_prio.ifpriomap echo "eth0 5" > /sys/fs/cgroups/net_prio/iscsi/net_prio.ifpriomap
This command would force any traffic originating from processes belonging to the This command would force any traffic originating from processes belonging to the
iscsi net_prio cgroup and egressing on interface eth0 to have the priority of iscsi net_prio cgroup and egressing on interface eth0 to have the priority of
......
Process Number Controller =========================
========================= Process Number Controller
=========================
Abstract Abstract
-------- --------
...@@ -34,55 +35,58 @@ pids.current tracks all child cgroup hierarchies, so parent/pids.current is a ...@@ -34,55 +35,58 @@ pids.current tracks all child cgroup hierarchies, so parent/pids.current is a
superset of parent/child/pids.current. superset of parent/child/pids.current.
The pids.events file contains event counters: The pids.events file contains event counters:
- max: Number of times fork failed because limit was hit. - max: Number of times fork failed because limit was hit.
Example Example
------- -------
First, we mount the pids controller: First, we mount the pids controller::
# mkdir -p /sys/fs/cgroup/pids
# mount -t cgroup -o pids none /sys/fs/cgroup/pids # mkdir -p /sys/fs/cgroup/pids
# mount -t cgroup -o pids none /sys/fs/cgroup/pids
Then we create a hierarchy, set limits and attach processes to it::
Then we create a hierarchy, set limits and attach processes to it: # mkdir -p /sys/fs/cgroup/pids/parent/child
# mkdir -p /sys/fs/cgroup/pids/parent/child # echo 2 > /sys/fs/cgroup/pids/parent/pids.max
# echo 2 > /sys/fs/cgroup/pids/parent/pids.max # echo $$ > /sys/fs/cgroup/pids/parent/cgroup.procs
# echo $$ > /sys/fs/cgroup/pids/parent/cgroup.procs # cat /sys/fs/cgroup/pids/parent/pids.current
# cat /sys/fs/cgroup/pids/parent/pids.current 2
2 #
#
It should be noted that attempts to overcome the set limit (2 in this case) will It should be noted that attempts to overcome the set limit (2 in this case) will
fail: fail::
# cat /sys/fs/cgroup/pids/parent/pids.current # cat /sys/fs/cgroup/pids/parent/pids.current
2 2
# ( /bin/echo "Here's some processes for you." | cat ) # ( /bin/echo "Here's some processes for you." | cat )
sh: fork: Resource temporary unavailable sh: fork: Resource temporary unavailable
# #
Even if we migrate to a child cgroup (which doesn't have a set limit), we will Even if we migrate to a child cgroup (which doesn't have a set limit), we will
not be able to overcome the most stringent limit in the hierarchy (in this case, not be able to overcome the most stringent limit in the hierarchy (in this case,
parent's): parent's)::
# echo $$ > /sys/fs/cgroup/pids/parent/child/cgroup.procs # echo $$ > /sys/fs/cgroup/pids/parent/child/cgroup.procs
# cat /sys/fs/cgroup/pids/parent/pids.current # cat /sys/fs/cgroup/pids/parent/pids.current
2 2
# cat /sys/fs/cgroup/pids/parent/child/pids.current # cat /sys/fs/cgroup/pids/parent/child/pids.current
2 2
# cat /sys/fs/cgroup/pids/parent/child/pids.max # cat /sys/fs/cgroup/pids/parent/child/pids.max
max max
# ( /bin/echo "Here's some processes for you." | cat ) # ( /bin/echo "Here's some processes for you." | cat )
sh: fork: Resource temporary unavailable sh: fork: Resource temporary unavailable
# #
We can set a limit that is smaller than pids.current, which will stop any new We can set a limit that is smaller than pids.current, which will stop any new
processes from being forked at all (note that the shell itself counts towards processes from being forked at all (note that the shell itself counts towards
pids.current): pids.current)::
# echo 1 > /sys/fs/cgroup/pids/parent/pids.max # echo 1 > /sys/fs/cgroup/pids/parent/pids.max
# /bin/echo "We can't even spawn a single process now." # /bin/echo "We can't even spawn a single process now."
sh: fork: Resource temporary unavailable sh: fork: Resource temporary unavailable
# echo 0 > /sys/fs/cgroup/pids/parent/pids.max # echo 0 > /sys/fs/cgroup/pids/parent/pids.max
# /bin/echo "We can't even spawn a single process now." # /bin/echo "We can't even spawn a single process now."
sh: fork: Resource temporary unavailable sh: fork: Resource temporary unavailable
# #
RDMA Controller ===============
---------------- RDMA Controller
===============
Contents .. Contents
--------
1. Overview 1. Overview
1-1. What is RDMA controller? 1-1. What is RDMA controller?
1-2. Why RDMA controller needed? 1-2. Why RDMA controller needed?
1-3. How is RDMA controller implemented? 1-3. How is RDMA controller implemented?
2. Usage Examples 2. Usage Examples
1. Overview 1. Overview
===========
1-1. What is RDMA controller? 1-1. What is RDMA controller?
----------------------------- -----------------------------
...@@ -83,27 +84,34 @@ what is configured by user for a given cgroup and what is supported by ...@@ -83,27 +84,34 @@ what is configured by user for a given cgroup and what is supported by
IB device. IB device.
Following resources can be accounted by rdma controller. Following resources can be accounted by rdma controller.
========== =============================
hca_handle Maximum number of HCA Handles hca_handle Maximum number of HCA Handles
hca_object Maximum number of HCA Objects hca_object Maximum number of HCA Objects
========== =============================
2. Usage Examples 2. Usage Examples
----------------- =================
(a) Configure resource limit: (a) Configure resource limit::
echo mlx4_0 hca_handle=2 hca_object=2000 > /sys/fs/cgroup/rdma/1/rdma.max
echo ocrdma1 hca_handle=3 > /sys/fs/cgroup/rdma/2/rdma.max echo mlx4_0 hca_handle=2 hca_object=2000 > /sys/fs/cgroup/rdma/1/rdma.max
echo ocrdma1 hca_handle=3 > /sys/fs/cgroup/rdma/2/rdma.max
(b) Query resource limit:
cat /sys/fs/cgroup/rdma/2/rdma.max (b) Query resource limit::
#Output:
mlx4_0 hca_handle=2 hca_object=2000 cat /sys/fs/cgroup/rdma/2/rdma.max
ocrdma1 hca_handle=3 hca_object=max #Output:
mlx4_0 hca_handle=2 hca_object=2000
(c) Query current usage: ocrdma1 hca_handle=3 hca_object=max
cat /sys/fs/cgroup/rdma/2/rdma.current
#Output: (c) Query current usage::
mlx4_0 hca_handle=1 hca_object=20
ocrdma1 hca_handle=1 hca_object=23 cat /sys/fs/cgroup/rdma/2/rdma.current
#Output:
(d) Delete resource limit: mlx4_0 hca_handle=1 hca_object=20
echo echo mlx4_0 hca_handle=max hca_object=max > /sys/fs/cgroup/rdma/1/rdma.max ocrdma1 hca_handle=1 hca_object=23
(d) Delete resource limit::
echo echo mlx4_0 hca_handle=max hca_object=max > /sys/fs/cgroup/rdma/1/rdma.max
...@@ -98,7 +98,7 @@ A memory policy with a valid NodeList will be saved, as specified, for ...@@ -98,7 +98,7 @@ A memory policy with a valid NodeList will be saved, as specified, for
use at file creation time. When a task allocates a file in the file use at file creation time. When a task allocates a file in the file
system, the mount option memory policy will be applied with a NodeList, system, the mount option memory policy will be applied with a NodeList,
if any, modified by the calling task's cpuset constraints if any, modified by the calling task's cpuset constraints
[See Documentation/cgroup-v1/cpusets.txt] and any optional flags, listed [See Documentation/cgroup-v1/cpusets.rst] and any optional flags, listed
below. If the resulting NodeLists is the empty set, the effective memory below. If the resulting NodeLists is the empty set, the effective memory
policy for the file will revert to "default" policy. policy for the file will revert to "default" policy.
......
...@@ -652,7 +652,7 @@ CONTENTS ...@@ -652,7 +652,7 @@ CONTENTS
-deadline tasks cannot have an affinity mask smaller that the entire -deadline tasks cannot have an affinity mask smaller that the entire
root_domain they are created on. However, affinities can be specified root_domain they are created on. However, affinities can be specified
through the cpuset facility (Documentation/cgroup-v1/cpusets.txt). through the cpuset facility (Documentation/cgroup-v1/cpusets.rst).
5.1 SCHED_DEADLINE and cpusets HOWTO 5.1 SCHED_DEADLINE and cpusets HOWTO
------------------------------------ ------------------------------------
......
...@@ -215,7 +215,7 @@ SCHED_BATCH) tasks. ...@@ -215,7 +215,7 @@ SCHED_BATCH) tasks.
These options need CONFIG_CGROUPS to be defined, and let the administrator These options need CONFIG_CGROUPS to be defined, and let the administrator
create arbitrary groups of tasks, using the "cgroup" pseudo filesystem. See create arbitrary groups of tasks, using the "cgroup" pseudo filesystem. See
Documentation/cgroup-v1/cgroups.txt for more information about this filesystem. Documentation/cgroup-v1/cgroups.rst for more information about this filesystem.
When CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each When CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each
group created using the pseudo filesystem. See example steps below to create group created using the pseudo filesystem. See example steps below to create
......
...@@ -133,7 +133,7 @@ This uses the cgroup virtual file system and "<cgroup>/cpu.rt_runtime_us" ...@@ -133,7 +133,7 @@ This uses the cgroup virtual file system and "<cgroup>/cpu.rt_runtime_us"
to control the CPU time reserved for each control group. to control the CPU time reserved for each control group.
For more information on working with control groups, you should read For more information on working with control groups, you should read
Documentation/cgroup-v1/cgroups.txt as well. Documentation/cgroup-v1/cgroups.rst as well.
Group settings are checked against the following limits in order to keep the Group settings are checked against the following limits in order to keep the
configuration schedulable: configuration schedulable:
......
...@@ -67,7 +67,7 @@ nodes. Each emulated node will manage a fraction of the underlying cells' ...@@ -67,7 +67,7 @@ nodes. Each emulated node will manage a fraction of the underlying cells'
physical memory. NUMA emluation is useful for testing NUMA kernel and physical memory. NUMA emluation is useful for testing NUMA kernel and
application features on non-NUMA platforms, and as a sort of memory resource application features on non-NUMA platforms, and as a sort of memory resource
management mechanism when used together with cpusets. management mechanism when used together with cpusets.
[see Documentation/cgroup-v1/cpusets.txt] [see Documentation/cgroup-v1/cpusets.rst]
For each node with memory, Linux constructs an independent memory management For each node with memory, Linux constructs an independent memory management
subsystem, complete with its own free page lists, in-use page lists, usage subsystem, complete with its own free page lists, in-use page lists, usage
...@@ -114,7 +114,7 @@ allocation behavior using Linux NUMA memory policy. [see ...@@ -114,7 +114,7 @@ allocation behavior using Linux NUMA memory policy. [see
System administrators can restrict the CPUs and nodes' memories that a non- System administrators can restrict the CPUs and nodes' memories that a non-
privileged user can specify in the scheduling or NUMA commands and functions privileged user can specify in the scheduling or NUMA commands and functions
using control groups and CPUsets. [see Documentation/cgroup-v1/cpusets.txt] using control groups and CPUsets. [see Documentation/cgroup-v1/cpusets.rst]
On architectures that do not hide memoryless nodes, Linux will include only On architectures that do not hide memoryless nodes, Linux will include only
zones [nodes] with memory in the zonelists. This means that for a memoryless zones [nodes] with memory in the zonelists. This means that for a memoryless
......
...@@ -41,7 +41,7 @@ locations. ...@@ -41,7 +41,7 @@ locations.
Larger installations usually partition the system using cpusets into Larger installations usually partition the system using cpusets into
sections of nodes. Paul Jackson has equipped cpusets with the ability to sections of nodes. Paul Jackson has equipped cpusets with the ability to
move pages when a task is moved to another cpuset (See move pages when a task is moved to another cpuset (See
Documentation/cgroup-v1/cpusets.txt). Documentation/cgroup-v1/cpusets.rst).
Cpusets allows the automation of process locality. If a task is moved to Cpusets allows the automation of process locality. If a task is moved to
a new cpuset then also all its pages are moved with it so that the a new cpuset then also all its pages are moved with it so that the
performance of the process does not sink dramatically. Also the pages performance of the process does not sink dramatically. Also the pages
......
...@@ -98,7 +98,7 @@ Memory Control Group Interaction ...@@ -98,7 +98,7 @@ Memory Control Group Interaction
-------------------------------- --------------------------------
The unevictable LRU facility interacts with the memory control group [aka The unevictable LRU facility interacts with the memory control group [aka
memory controller; see Documentation/cgroup-v1/memory.txt] by extending the memory controller; see Documentation/cgroup-v1/memory.rst] by extending the
lru_list enum. lru_list enum.
The memory controller data structure automatically gets a per-zone unevictable The memory controller data structure automatically gets a per-zone unevictable
......
...@@ -15,7 +15,7 @@ assign them to cpusets and their attached tasks. This is a way of limiting the ...@@ -15,7 +15,7 @@ assign them to cpusets and their attached tasks. This is a way of limiting the
amount of system memory that are available to a certain class of tasks. amount of system memory that are available to a certain class of tasks.
For more information on the features of cpusets, see For more information on the features of cpusets, see
Documentation/cgroup-v1/cpusets.txt. Documentation/cgroup-v1/cpusets.rst.
There are a number of different configurations you can use for your needs. For There are a number of different configurations you can use for your needs. For
more information on the numa=fake command line option and its various ways of more information on the numa=fake command line option and its various ways of
configuring fake nodes, see Documentation/x86/x86_64/boot-options.txt. configuring fake nodes, see Documentation/x86/x86_64/boot-options.txt.
...@@ -40,7 +40,7 @@ A machine may be split as follows with "numa=fake=4*512," as reported by dmesg:: ...@@ -40,7 +40,7 @@ A machine may be split as follows with "numa=fake=4*512," as reported by dmesg::
On node 3 totalpages: 131072 On node 3 totalpages: 131072
Now following the instructions for mounting the cpusets filesystem from Now following the instructions for mounting the cpusets filesystem from
Documentation/cgroup-v1/cpusets.txt, you can assign fake nodes (i.e. contiguous memory Documentation/cgroup-v1/cpusets.rst, you can assign fake nodes (i.e. contiguous memory
address spaces) to individual cpusets:: address spaces) to individual cpusets::
[root@xroads /]# mkdir exampleset [root@xroads /]# mkdir exampleset
......
...@@ -4122,7 +4122,7 @@ W: http://www.bullopensource.org/cpuset/ ...@@ -4122,7 +4122,7 @@ W: http://www.bullopensource.org/cpuset/
W: http://oss.sgi.com/projects/cpusets/ W: http://oss.sgi.com/projects/cpusets/
T: git git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git T: git git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git
S: Maintained S: Maintained
F: Documentation/cgroup-v1/cpusets.txt F: Documentation/cgroup-v1/cpusets.rst
F: include/linux/cpuset.h F: include/linux/cpuset.h
F: kernel/cgroup/cpuset.c F: kernel/cgroup/cpuset.c
......
...@@ -89,7 +89,7 @@ config BLK_DEV_THROTTLING ...@@ -89,7 +89,7 @@ config BLK_DEV_THROTTLING
one needs to mount and use blkio cgroup controller for creating one needs to mount and use blkio cgroup controller for creating
cgroups and specifying per device IO rate policies. cgroups and specifying per device IO rate policies.
See Documentation/cgroup-v1/blkio-controller.txt for more information. See Documentation/cgroup-v1/blkio-controller.rst for more information.
config BLK_DEV_THROTTLING_LOW config BLK_DEV_THROTTLING_LOW
bool "Block throttling .low limit interface support (EXPERIMENTAL)" bool "Block throttling .low limit interface support (EXPERIMENTAL)"
......
...@@ -624,7 +624,7 @@ struct cftype { ...@@ -624,7 +624,7 @@ struct cftype {
/* /*
* Control Group subsystem type. * Control Group subsystem type.
* See Documentation/cgroup-v1/cgroups.txt for details * See Documentation/cgroup-v1/cgroups.rst for details
*/ */
struct cgroup_subsys { struct cgroup_subsys {
struct cgroup_subsys_state *(*css_alloc)(struct cgroup_subsys_state *parent_css); struct cgroup_subsys_state *(*css_alloc)(struct cgroup_subsys_state *parent_css);
......
...@@ -131,6 +131,8 @@ void cgroup_free(struct task_struct *p); ...@@ -131,6 +131,8 @@ void cgroup_free(struct task_struct *p);
int cgroup_init_early(void); int cgroup_init_early(void);
int cgroup_init(void); int cgroup_init(void);
int cgroup_parse_float(const char *input, unsigned dec_shift, s64 *v);
/* /*
* Iteration helpers and macros. * Iteration helpers and macros.
*/ */
......
...@@ -785,7 +785,7 @@ union bpf_attr { ...@@ -785,7 +785,7 @@ union bpf_attr {
* based on a user-provided identifier for all traffic coming from * based on a user-provided identifier for all traffic coming from
* the tasks belonging to the related cgroup. See also the related * the tasks belonging to the related cgroup. See also the related
* kernel documentation, available from the Linux sources in file * kernel documentation, available from the Linux sources in file
* *Documentation/cgroup-v1/net_cls.txt*. * *Documentation/cgroup-v1/net_cls.rst*.
* *
* The Linux kernel has two versions for cgroups: there are * The Linux kernel has two versions for cgroups: there are
* cgroups v1 and cgroups v2. Both are available to users, who can * cgroups v1 and cgroups v2. Both are available to users, who can
......
...@@ -850,7 +850,7 @@ config BLK_CGROUP ...@@ -850,7 +850,7 @@ config BLK_CGROUP
CONFIG_CFQ_GROUP_IOSCHED=y; for enabling throttling policy, set CONFIG_CFQ_GROUP_IOSCHED=y; for enabling throttling policy, set
CONFIG_BLK_DEV_THROTTLING=y. CONFIG_BLK_DEV_THROTTLING=y.
See Documentation/cgroup-v1/blkio-controller.txt for more information. See Documentation/cgroup-v1/blkio-controller.rst for more information.
config DEBUG_BLK_CGROUP config DEBUG_BLK_CGROUP
bool "IO controller debugging" bool "IO controller debugging"
......
...@@ -6240,6 +6240,48 @@ struct cgroup *cgroup_get_from_fd(int fd) ...@@ -6240,6 +6240,48 @@ struct cgroup *cgroup_get_from_fd(int fd)
} }
EXPORT_SYMBOL_GPL(cgroup_get_from_fd); EXPORT_SYMBOL_GPL(cgroup_get_from_fd);
static u64 power_of_ten(int power)
{
u64 v = 1;
while (power--)
v *= 10;
return v;
}
/**
* cgroup_parse_float - parse a floating number
* @input: input string
* @dec_shift: number of decimal digits to shift
* @v: output
*
* Parse a decimal floating point number in @input and store the result in
* @v with decimal point right shifted @dec_shift times. For example, if
* @input is "12.3456" and @dec_shift is 3, *@v will be set to 12345.
* Returns 0 on success, -errno otherwise.
*
* There's nothing cgroup specific about this function except that it's
* currently the only user.
*/
int cgroup_parse_float(const char *input, unsigned dec_shift, s64 *v)
{
s64 whole, frac = 0;
int fstart = 0, fend = 0, flen;
if (!sscanf(input, "%lld.%n%lld%n", &whole, &fstart, &frac, &fend))
return -EINVAL;
if (frac < 0)
return -EINVAL;
flen = fend > fstart ? fend - fstart : 0;
if (flen < dec_shift)
frac *= power_of_ten(dec_shift - flen);
else
frac = DIV_ROUND_CLOSEST_ULL(frac, power_of_ten(flen - dec_shift));
*v = whole * power_of_ten(dec_shift) + frac;
return 0;
}
/* /*
* sock->sk_cgrp_data handling. For more info, see sock_cgroup_data * sock->sk_cgrp_data handling. For more info, see sock_cgroup_data
* definition in cgroup-defs.h. * definition in cgroup-defs.h.
...@@ -6402,4 +6444,5 @@ static int __init cgroup_sysfs_init(void) ...@@ -6402,4 +6444,5 @@ static int __init cgroup_sysfs_init(void)
return sysfs_create_group(kernel_kobj, &cgroup_sysfs_attr_group); return sysfs_create_group(kernel_kobj, &cgroup_sysfs_attr_group);
} }
subsys_initcall(cgroup_sysfs_init); subsys_initcall(cgroup_sysfs_init);
#endif /* CONFIG_SYSFS */ #endif /* CONFIG_SYSFS */
...@@ -729,7 +729,7 @@ static inline int nr_cpusets(void) ...@@ -729,7 +729,7 @@ static inline int nr_cpusets(void)
* load balancing domains (sched domains) as specified by that partial * load balancing domains (sched domains) as specified by that partial
* partition. * partition.
* *
* See "What is sched_load_balance" in Documentation/cgroup-v1/cpusets.txt * See "What is sched_load_balance" in Documentation/cgroup-v1/cpusets.rst
* for a background explanation of this. * for a background explanation of this.
* *
* Does not return errors, on the theory that the callers of this * Does not return errors, on the theory that the callers of this
......
...@@ -509,7 +509,7 @@ static inline int may_allow_all(struct dev_cgroup *parent) ...@@ -509,7 +509,7 @@ static inline int may_allow_all(struct dev_cgroup *parent)
* This is one of the three key functions for hierarchy implementation. * This is one of the three key functions for hierarchy implementation.
* This function is responsible for re-evaluating all the cgroup's active * This function is responsible for re-evaluating all the cgroup's active
* exceptions due to a parent's exception change. * exceptions due to a parent's exception change.
* Refer to Documentation/cgroup-v1/devices.txt for more details. * Refer to Documentation/cgroup-v1/devices.rst for more details.
*/ */
static void revalidate_active_exceptions(struct dev_cgroup *devcg) static void revalidate_active_exceptions(struct dev_cgroup *devcg)
{ {
......
...@@ -785,7 +785,7 @@ union bpf_attr { ...@@ -785,7 +785,7 @@ union bpf_attr {
* based on a user-provided identifier for all traffic coming from * based on a user-provided identifier for all traffic coming from
* the tasks belonging to the related cgroup. See also the related * the tasks belonging to the related cgroup. See also the related
* kernel documentation, available from the Linux sources in file * kernel documentation, available from the Linux sources in file
* *Documentation/cgroup-v1/net_cls.txt*. * *Documentation/cgroup-v1/net_cls.rst*.
* *
* The Linux kernel has two versions for cgroups: there are * The Linux kernel has two versions for cgroups: there are
* cgroups v1 and cgroups v2. Both are available to users, who can * cgroups v1 and cgroups v2. Both are available to users, who can
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment