Commit ac9a1974 authored by Jens Axboe's avatar Jens Axboe

Merge branch 'blkcg-cfq-hierarchy' of...

Merge branch 'blkcg-cfq-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup into for-3.9/core

Tejun writes:

Hello, Jens.

Please consider pulling from the following branch to receive cfq blkcg
hierarchy support.  The branch is based on top of v3.8-rc2.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git blkcg-cfq-hierarchy

The patchset was reviewd in the following thread.

  http://thread.gmane.org/gmane.linux.kernel.cgroups/5571
parents 422765c2 43114018
...@@ -102,6 +102,64 @@ processing of request. Therefore, increasing the value can imporve the ...@@ -102,6 +102,64 @@ processing of request. Therefore, increasing the value can imporve the
performace although this can cause the latency of some I/O to increase due performace although this can cause the latency of some I/O to increase due
to more number of requests. to more number of requests.
CFQ Group scheduling
====================
CFQ supports blkio cgroup and has "blkio." prefixed files in each
blkio cgroup directory. It is weight-based and there are four knobs
for configuration - weight[_device] and leaf_weight[_device].
Internal cgroup nodes (the ones with children) can also have tasks in
them, so the former two configure how much proportion the cgroup as a
whole is entitled to at its parent's level while the latter two
configure how much proportion the tasks in the cgroup have compared to
its direct children.
Another way to think about it is assuming that each internal node has
an implicit leaf child node which hosts all the tasks whose weight is
configured by leaf_weight[_device]. Let's assume a blkio hierarchy
composed of five cgroups - root, A, B, AA and AB - with the following
weights where the names represent the hierarchy.
weight leaf_weight
root : 125 125
A : 500 750
B : 250 500
AA : 500 500
AB : 1000 500
root never has a parent making its weight is meaningless. For backward
compatibility, weight is always kept in sync with leaf_weight. B, AA
and AB have no child and thus its tasks have no children cgroup to
compete with. They always get 100% of what the cgroup won at the
parent level. Considering only the weights which matter, the hierarchy
looks like the following.
root
/ | \
A B leaf
500 250 125
/ | \
AA AB leaf
500 1000 750
If all cgroups have active IOs and competing with each other, disk
time will be distributed like the following.
Distribution below root. The total active weight at this level is
A:500 + B:250 + C:125 = 875.
root-leaf : 125 / 875 =~ 14%
A : 500 / 875 =~ 57%
B(-leaf) : 250 / 875 =~ 28%
A has children and further distributes its 57% among the children and
the implicit leaf node. The total active weight at this level is
AA:500 + AB:1000 + A-leaf:750 = 2250.
A-leaf : ( 750 / 2250) * A =~ 19%
AA(-leaf) : ( 500 / 2250) * A =~ 12%
AB(-leaf) : (1000 / 2250) * A =~ 25%
CFQ IOPS Mode for group scheduling CFQ IOPS Mode for group scheduling
=================================== ===================================
Basic CFQ design is to provide priority based time slices. Higher priority Basic CFQ design is to provide priority based time slices. Higher priority
......
...@@ -94,13 +94,11 @@ Throttling/Upper Limit policy ...@@ -94,13 +94,11 @@ Throttling/Upper Limit policy
Hierarchical Cgroups Hierarchical Cgroups
==================== ====================
- Currently none of the IO control policy supports hierarchical groups. But - Currently only CFQ supports hierarchical groups. For throttling,
cgroup interface does allow creation of hierarchical cgroups and internally cgroup interface does allow creation of hierarchical cgroups and
IO policies treat them as flat hierarchy. internally it treats them as flat hierarchy.
So this patch will allow creation of cgroup hierarchcy but at the backend If somebody created a hierarchy like as follows.
everything will be treated as flat. So if somebody created a hierarchy like
as follows.
root root
/ \ / \
...@@ -108,16 +106,20 @@ Hierarchical Cgroups ...@@ -108,16 +106,20 @@ Hierarchical Cgroups
| |
test3 test3
CFQ and throttling will practically treat all groups at same level. CFQ will handle the hierarchy correctly but and throttling will
practically treat all groups at same level. For details on CFQ
hierarchy support, refer to Documentation/block/cfq-iosched.txt.
Throttling will treat the hierarchy as if it looks like the
following.
pivot pivot
/ / \ \ / / \ \
root test1 test2 test3 root test1 test2 test3
Down the line we can implement hierarchical accounting/control support Nesting cgroups, while allowed, isn't officially supported and blkio
and also introduce a new cgroup file "use_hierarchy" which will control genereates warning when cgroups nest. Once throttling implements
whether cgroup hierarchy is viewed as flat or hierarchical by the policy.. hierarchy support, hierarchy will be supported and the warning will
This is how memory controller also has implemented the things. be removed.
Various user visible config options Various user visible config options
=================================== ===================================
...@@ -172,6 +174,12 @@ Proportional weight policy files ...@@ -172,6 +174,12 @@ Proportional weight policy files
dev weight dev weight
8:16 300 8:16 300
- blkio.leaf_weight[_device]
- Equivalents of blkio.weight[_device] for the purpose of
deciding how much weight tasks in the given cgroup has while
competing with the cgroup's child cgroups. For details,
please refer to Documentation/block/cfq-iosched.txt.
- blkio.time - blkio.time
- disk time allocated to cgroup per device in milliseconds. First - disk time allocated to cgroup per device in milliseconds. First
two fields specify the major and minor number of the device and two fields specify the major and minor number of the device and
...@@ -279,6 +287,11 @@ Proportional weight policy files ...@@ -279,6 +287,11 @@ Proportional weight policy files
and minor number of the device and third field specifies the number and minor number of the device and third field specifies the number
of times a group was dequeued from a particular device. of times a group was dequeued from a particular device.
- blkio.*_recursive
- Recursive version of various stats. These files show the
same information as their non-recursive counterparts but
include stats from all the descendant cgroups.
Throttling/Upper limit policy files Throttling/Upper limit policy files
----------------------------------- -----------------------------------
- blkio.throttle.read_bps_device - blkio.throttle.read_bps_device
......
This diff is collapsed.
...@@ -54,6 +54,7 @@ struct blkcg { ...@@ -54,6 +54,7 @@ struct blkcg {
/* TODO: per-policy storage in blkcg */ /* TODO: per-policy storage in blkcg */
unsigned int cfq_weight; /* belongs to cfq */ unsigned int cfq_weight; /* belongs to cfq */
unsigned int cfq_leaf_weight;
}; };
struct blkg_stat { struct blkg_stat {
...@@ -80,8 +81,9 @@ struct blkg_rwstat { ...@@ -80,8 +81,9 @@ struct blkg_rwstat {
* beginning and pd_size can't be smaller than pd. * beginning and pd_size can't be smaller than pd.
*/ */
struct blkg_policy_data { struct blkg_policy_data {
/* the blkg this per-policy data belongs to */ /* the blkg and policy id this per-policy data belongs to */
struct blkcg_gq *blkg; struct blkcg_gq *blkg;
int plid;
/* used during policy activation */ /* used during policy activation */
struct list_head alloc_node; struct list_head alloc_node;
...@@ -94,17 +96,27 @@ struct blkcg_gq { ...@@ -94,17 +96,27 @@ struct blkcg_gq {
struct list_head q_node; struct list_head q_node;
struct hlist_node blkcg_node; struct hlist_node blkcg_node;
struct blkcg *blkcg; struct blkcg *blkcg;
/* all non-root blkcg_gq's are guaranteed to have access to parent */
struct blkcg_gq *parent;
/* request allocation list for this blkcg-q pair */ /* request allocation list for this blkcg-q pair */
struct request_list rl; struct request_list rl;
/* reference count */ /* reference count */
int refcnt; int refcnt;
/* is this blkg online? protected by both blkcg and q locks */
bool online;
struct blkg_policy_data *pd[BLKCG_MAX_POLS]; struct blkg_policy_data *pd[BLKCG_MAX_POLS];
struct rcu_head rcu_head; struct rcu_head rcu_head;
}; };
typedef void (blkcg_pol_init_pd_fn)(struct blkcg_gq *blkg); typedef void (blkcg_pol_init_pd_fn)(struct blkcg_gq *blkg);
typedef void (blkcg_pol_online_pd_fn)(struct blkcg_gq *blkg);
typedef void (blkcg_pol_offline_pd_fn)(struct blkcg_gq *blkg);
typedef void (blkcg_pol_exit_pd_fn)(struct blkcg_gq *blkg); typedef void (blkcg_pol_exit_pd_fn)(struct blkcg_gq *blkg);
typedef void (blkcg_pol_reset_pd_stats_fn)(struct blkcg_gq *blkg); typedef void (blkcg_pol_reset_pd_stats_fn)(struct blkcg_gq *blkg);
...@@ -117,6 +129,8 @@ struct blkcg_policy { ...@@ -117,6 +129,8 @@ struct blkcg_policy {
/* operations */ /* operations */
blkcg_pol_init_pd_fn *pd_init_fn; blkcg_pol_init_pd_fn *pd_init_fn;
blkcg_pol_online_pd_fn *pd_online_fn;
blkcg_pol_offline_pd_fn *pd_offline_fn;
blkcg_pol_exit_pd_fn *pd_exit_fn; blkcg_pol_exit_pd_fn *pd_exit_fn;
blkcg_pol_reset_pd_stats_fn *pd_reset_stats_fn; blkcg_pol_reset_pd_stats_fn *pd_reset_stats_fn;
}; };
...@@ -150,6 +164,10 @@ u64 blkg_prfill_stat(struct seq_file *sf, struct blkg_policy_data *pd, int off); ...@@ -150,6 +164,10 @@ u64 blkg_prfill_stat(struct seq_file *sf, struct blkg_policy_data *pd, int off);
u64 blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd, u64 blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd,
int off); int off);
u64 blkg_stat_recursive_sum(struct blkg_policy_data *pd, int off);
struct blkg_rwstat blkg_rwstat_recursive_sum(struct blkg_policy_data *pd,
int off);
struct blkg_conf_ctx { struct blkg_conf_ctx {
struct gendisk *disk; struct gendisk *disk;
struct blkcg_gq *blkg; struct blkcg_gq *blkg;
...@@ -180,6 +198,19 @@ static inline struct blkcg *bio_blkcg(struct bio *bio) ...@@ -180,6 +198,19 @@ static inline struct blkcg *bio_blkcg(struct bio *bio)
return task_blkcg(current); return task_blkcg(current);
} }
/**
* blkcg_parent - get the parent of a blkcg
* @blkcg: blkcg of interest
*
* Return the parent blkcg of @blkcg. Can be called anytime.
*/
static inline struct blkcg *blkcg_parent(struct blkcg *blkcg)
{
struct cgroup *pcg = blkcg->css.cgroup->parent;
return pcg ? cgroup_to_blkcg(pcg) : NULL;
}
/** /**
* blkg_to_pdata - get policy private data * blkg_to_pdata - get policy private data
* @blkg: blkg of interest * @blkg: blkg of interest
...@@ -386,6 +417,18 @@ static inline void blkg_stat_reset(struct blkg_stat *stat) ...@@ -386,6 +417,18 @@ static inline void blkg_stat_reset(struct blkg_stat *stat)
stat->cnt = 0; stat->cnt = 0;
} }
/**
* blkg_stat_merge - merge a blkg_stat into another
* @to: the destination blkg_stat
* @from: the source
*
* Add @from's count to @to.
*/
static inline void blkg_stat_merge(struct blkg_stat *to, struct blkg_stat *from)
{
blkg_stat_add(to, blkg_stat_read(from));
}
/** /**
* blkg_rwstat_add - add a value to a blkg_rwstat * blkg_rwstat_add - add a value to a blkg_rwstat
* @rwstat: target blkg_rwstat * @rwstat: target blkg_rwstat
...@@ -434,14 +477,14 @@ static inline struct blkg_rwstat blkg_rwstat_read(struct blkg_rwstat *rwstat) ...@@ -434,14 +477,14 @@ static inline struct blkg_rwstat blkg_rwstat_read(struct blkg_rwstat *rwstat)
} }
/** /**
* blkg_rwstat_sum - read the total count of a blkg_rwstat * blkg_rwstat_total - read the total count of a blkg_rwstat
* @rwstat: blkg_rwstat to read * @rwstat: blkg_rwstat to read
* *
* Return the total count of @rwstat regardless of the IO direction. This * Return the total count of @rwstat regardless of the IO direction. This
* function can be called without synchronization and takes care of u64 * function can be called without synchronization and takes care of u64
* atomicity. * atomicity.
*/ */
static inline uint64_t blkg_rwstat_sum(struct blkg_rwstat *rwstat) static inline uint64_t blkg_rwstat_total(struct blkg_rwstat *rwstat)
{ {
struct blkg_rwstat tmp = blkg_rwstat_read(rwstat); struct blkg_rwstat tmp = blkg_rwstat_read(rwstat);
...@@ -457,6 +500,25 @@ static inline void blkg_rwstat_reset(struct blkg_rwstat *rwstat) ...@@ -457,6 +500,25 @@ static inline void blkg_rwstat_reset(struct blkg_rwstat *rwstat)
memset(rwstat->cnt, 0, sizeof(rwstat->cnt)); memset(rwstat->cnt, 0, sizeof(rwstat->cnt));
} }
/**
* blkg_rwstat_merge - merge a blkg_rwstat into another
* @to: the destination blkg_rwstat
* @from: the source
*
* Add @from's counts to @to.
*/
static inline void blkg_rwstat_merge(struct blkg_rwstat *to,
struct blkg_rwstat *from)
{
struct blkg_rwstat v = blkg_rwstat_read(from);
int i;
u64_stats_update_begin(&to->syncp);
for (i = 0; i < BLKG_RWSTAT_NR; i++)
to->cnt[i] += v.cnt[i];
u64_stats_update_end(&to->syncp);
}
#else /* CONFIG_BLK_CGROUP */ #else /* CONFIG_BLK_CGROUP */
struct cgroup; struct cgroup;
......
...@@ -497,6 +497,13 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr, ...@@ -497,6 +497,13 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr,
return res; return res;
} }
static void blk_free_queue_rcu(struct rcu_head *rcu_head)
{
struct request_queue *q = container_of(rcu_head, struct request_queue,
rcu_head);
kmem_cache_free(blk_requestq_cachep, q);
}
/** /**
* blk_release_queue: - release a &struct request_queue when it is no longer needed * blk_release_queue: - release a &struct request_queue when it is no longer needed
* @kobj: the kobj belonging to the request queue to be released * @kobj: the kobj belonging to the request queue to be released
...@@ -538,7 +545,7 @@ static void blk_release_queue(struct kobject *kobj) ...@@ -538,7 +545,7 @@ static void blk_release_queue(struct kobject *kobj)
bdi_destroy(&q->backing_dev_info); bdi_destroy(&q->backing_dev_info);
ida_simple_remove(&blk_queue_ida, q->id); ida_simple_remove(&blk_queue_ida, q->id);
kmem_cache_free(blk_requestq_cachep, q); call_rcu(&q->rcu_head, blk_free_queue_rcu);
} }
static const struct sysfs_ops queue_sysfs_ops = { static const struct sysfs_ops queue_sysfs_ops = {
......
This diff is collapsed.
...@@ -19,6 +19,7 @@ ...@@ -19,6 +19,7 @@
#include <linux/gfp.h> #include <linux/gfp.h>
#include <linux/bsg.h> #include <linux/bsg.h>
#include <linux/smp.h> #include <linux/smp.h>
#include <linux/rcupdate.h>
#include <asm/scatterlist.h> #include <asm/scatterlist.h>
...@@ -437,6 +438,7 @@ struct request_queue { ...@@ -437,6 +438,7 @@ struct request_queue {
/* Throttle data */ /* Throttle data */
struct throtl_data *td; struct throtl_data *td;
#endif #endif
struct rcu_head rcu_head;
}; };
#define QUEUE_FLAG_QUEUED 1 /* uses generic tag queueing */ #define QUEUE_FLAG_QUEUED 1 /* uses generic tag queueing */
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment