Commit 54a611b6 authored by Liam R. Howlett's avatar Liam R. Howlett Committed by Andrew Morton

Maple Tree: add new data structure

Patch series "Introducing the Maple Tree"

The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently.  There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface.  If you use an rbtree with other data
structures to improve performance or an interval tree to track
non-overlapping ranges, then this is for you.

The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes.  With the increased branching factor, it is significantly shorter
than the rbtree so it has fewer cache misses.  The removal of the linked
list between subsequent entries also reduces the cache misses and the need
to pull in the previous and next VMA during many tree alterations.

The first user that is covered in this patch set is the vm_area_struct,
where three data structures are replaced by the maple tree: the augmented
rbtree, the vma cache, and the linked list of VMAs in the mm_struct.  The
long term goal is to reduce or remove the mmap_lock contention.

The plan is to get to the point where we use the maple tree in RCU mode.
Readers will not block for writers.  A single write operation will be
allowed at a time.  A reader re-walks if stale data is encountered.  VMAs
would be RCU enabled and this mode would be entered once multiple tasks
are using the mm_struct.

Davidlor said

: Yes I like the maple tree, and at this stage I don't think we can ask for
: more from this series wrt the MM - albeit there seems to still be some
: folks reporting breakage.  Fundamentally I see Liam's work to (re)move
: complexity out of the MM (not to say that the actual maple tree is not
: complex) by consolidating the three complimentary data structures very
: much worth it considering performance does not take a hit.  This was very
: much a turn off with the range locking approach, which worst case scenario
: incurred in prohibitive overhead.  Also as Liam and Matthew have
: mentioned, RCU opens up a lot of nice performance opportunities, and in
: addition academia[1] has shown outstanding scalability of address spaces
: with the foundation of replacing the locked rbtree with RCU aware trees.

A similar work has been discovered in the academic press

	https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf

Sheer coincidence.  We designed our tree with the intention of solving the
hardest problem first.  Upon settling on a b-tree variant and a rough
outline, we researched ranged based b-trees and RCU b-trees and did find
that article.  So it was nice to find reassurances that we were on the
right path, but our design choice of using ranges made that paper unusable
for us.

This patch (of 70):

The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently.  There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface.  If you use an rbtree with other data
structures to improve performance or an interval tree to track
non-overlapping ranges, then this is for you.

The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes.  With the increased branching factor, it is significantly shorter
than the rbtree so it has fewer cache misses.  The removal of the linked
list between subsequent entries also reduces the cache misses and the need
to pull in the previous and next VMA during many tree alterations.

The first user that is covered in this patch set is the vm_area_struct,
where three data structures are replaced by the maple tree: the augmented
rbtree, the vma cache, and the linked list of VMAs in the mm_struct.  The
long term goal is to reduce or remove the mmap_lock contention.

The plan is to get to the point where we use the maple tree in RCU mode.
Readers will not block for writers.  A single write operation will be
allowed at a time.  A reader re-walks if stale data is encountered.  VMAs
would be RCU enabled and this mode would be entered once multiple tasks
are using the mm_struct.

There is additional BUG_ON() calls added within the tree, most of which
are in debug code.  These will be replaced with a WARN_ON() call in the
future.  There is also additional BUG_ON() calls within the code which
will also be reduced in number at a later date.  These exist to catch
things such as out-of-range accesses which would crash anyways.

Link: https://lkml.kernel.org/r/20220906194824.2110408-1-Liam.Howlett@oracle.com
Link: https://lkml.kernel.org/r/20220906194824.2110408-2-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: default avatarDavid Howells <dhowells@redhat.com>
Tested-by: default avatarSven Schnelle <svens@linux.ibm.com>
Tested-by: default avatarYu Zhao <yuzhao@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
parent 9832fb87
......@@ -36,6 +36,7 @@ Library functionality that is used throughout the kernel.
kref
assoc_array
xarray
maple_tree
idr
circular-buffers
rbtree
......
.. SPDX-License-Identifier: GPL-2.0+
==========
Maple Tree
==========
:Author: Liam R. Howlett
Overview
========
The Maple Tree is a B-Tree data type which is optimized for storing
non-overlapping ranges, including ranges of size 1. The tree was designed to
be simple to use and does not require a user written search method. It
supports iterating over a range of entries and going to the previous or next
entry in a cache-efficient manner. The tree can also be put into an RCU-safe
mode of operation which allows reading and writing concurrently. Writers must
synchronize on a lock, which can be the default spinlock, or the user can set
the lock to an external lock of a different type.
The Maple Tree maintains a small memory footprint and was designed to use
modern processor cache efficiently. The majority of the users will be able to
use the normal API. An :ref:`maple-tree-advanced-api` exists for more complex
scenarios. The most important usage of the Maple Tree is the tracking of the
virtual memory areas.
The Maple Tree can store values between ``0`` and ``ULONG_MAX``. The Maple
Tree reserves values with the bottom two bits set to '10' which are below 4096
(ie 2, 6, 10 .. 4094) for internal use. If the entries may use reserved
entries then the users can convert the entries using xa_mk_value() and convert
them back by calling xa_to_value(). If the user needs to use a reserved
value, then the user can convert the value when using the
:ref:`maple-tree-advanced-api`, but are blocked by the normal API.
The Maple Tree can also be configured to support searching for a gap of a given
size (or larger).
Pre-allocating of nodes is also supported using the
:ref:`maple-tree-advanced-api`. This is useful for users who must guarantee a
successful store operation within a given
code segment when allocating cannot be done. Allocations of nodes are
relatively small at around 256 bytes.
.. _maple-tree-normal-api:
Normal API
==========
Start by initialising a maple tree, either with DEFINE_MTREE() for statically
allocated maple trees or mt_init() for dynamically allocated ones. A
freshly-initialised maple tree contains a ``NULL`` pointer for the range ``0``
- ``ULONG_MAX``. There are currently two types of maple trees supported: the
allocation tree and the regular tree. The regular tree has a higher branching
factor for internal nodes. The allocation tree has a lower branching factor
but allows the user to search for a gap of a given size or larger from either
``0`` upwards or ``ULONG_MAX`` down. An allocation tree can be used by
passing in the ``MT_FLAGS_ALLOC_RANGE`` flag when initialising the tree.
You can then set entries using mtree_store() or mtree_store_range().
mtree_store() will overwrite any entry with the new entry and return 0 on
success or an error code otherwise. mtree_store_range() works in the same way
but takes a range. mtree_load() is used to retrieve the entry stored at a
given index. You can use mtree_erase() to erase an entire range by only
knowing one value within that range, or mtree_store() call with an entry of
NULL may be used to partially erase a range or many ranges at once.
If you want to only store a new entry to a range (or index) if that range is
currently ``NULL``, you can use mtree_insert_range() or mtree_insert() which
return -EEXIST if the range is not empty.
You can search for an entry from an index upwards by using mt_find().
You can walk each entry within a range by calling mt_for_each(). You must
provide a temporary variable to store a cursor. If you want to walk each
element of the tree then ``0`` and ``ULONG_MAX`` may be used as the range. If
the caller is going to hold the lock for the duration of the walk then it is
worth looking at the mas_for_each() API in the :ref:`maple-tree-advanced-api`
section.
Sometimes it is necessary to ensure the next call to store to a maple tree does
not allocate memory, please see :ref:`maple-tree-advanced-api` for this use case.
Finally, you can remove all entries from a maple tree by calling
mtree_destroy(). If the maple tree entries are pointers, you may wish to free
the entries first.
Allocating Nodes
----------------
The allocations are handled by the internal tree code. See
:ref:`maple-tree-advanced-alloc` for other options.
Locking
-------
You do not have to worry about locking. See :ref:`maple-tree-advanced-locks`
for other options.
The Maple Tree uses RCU and an internal spinlock to synchronise access:
Takes RCU read lock:
* mtree_load()
* mt_find()
* mt_for_each()
* mt_next()
* mt_prev()
Takes ma_lock internally:
* mtree_store()
* mtree_store_range()
* mtree_insert()
* mtree_insert_range()
* mtree_erase()
* mtree_destroy()
* mt_set_in_rcu()
* mt_clear_in_rcu()
If you want to take advantage of the internal lock to protect the data
structures that you are storing in the Maple Tree, you can call mtree_lock()
before calling mtree_load(), then take a reference count on the object you
have found before calling mtree_unlock(). This will prevent stores from
removing the object from the tree between looking up the object and
incrementing the refcount. You can also use RCU to avoid dereferencing
freed memory, but an explanation of that is beyond the scope of this
document.
.. _maple-tree-advanced-api:
Advanced API
============
The advanced API offers more flexibility and better performance at the
cost of an interface which can be harder to use and has fewer safeguards.
You must take care of your own locking while using the advanced API.
You can use the ma_lock, RCU or an external lock for protection.
You can mix advanced and normal operations on the same array, as long
as the locking is compatible. The :ref:`maple-tree-normal-api` is implemented
in terms of the advanced API.
The advanced API is based around the ma_state, this is where the 'mas'
prefix originates. The ma_state struct keeps track of tree operations to make
life easier for both internal and external tree users.
Initialising the maple tree is the same as in the :ref:`maple-tree-normal-api`.
Please see above.
The maple state keeps track of the range start and end in mas->index and
mas->last, respectively.
mas_walk() will walk the tree to the location of mas->index and set the
mas->index and mas->last according to the range for the entry.
You can set entries using mas_store(). mas_store() will overwrite any entry
with the new entry and return the first existing entry that is overwritten.
The range is passed in as members of the maple state: index and last.
You can use mas_erase() to erase an entire range by setting index and
last of the maple state to the desired range to erase. This will erase
the first range that is found in that range, set the maple state index
and last as the range that was erased and return the entry that existed
at that location.
You can walk each entry within a range by using mas_for_each(). If you want
to walk each element of the tree then ``0`` and ``ULONG_MAX`` may be used as
the range. If the lock needs to be periodically dropped, see the locking
section mas_pause().
Using a maple state allows mas_next() and mas_prev() to function as if the
tree was a linked list. With such a high branching factor the amortized
performance penalty is outweighed by cache optimization. mas_next() will
return the next entry which occurs after the entry at index. mas_prev()
will return the previous entry which occurs before the entry at index.
mas_find() will find the first entry which exists at or above index on
the first call, and the next entry from every subsequent calls.
mas_find_rev() will find the fist entry which exists at or below the last on
the first call, and the previous entry from every subsequent calls.
If the user needs to yield the lock during an operation, then the maple state
must be paused using mas_pause().
There are a few extra interfaces provided when using an allocation tree.
If you wish to search for a gap within a range, then mas_empty_area()
or mas_empty_area_rev() can be used. mas_empty_area() searches for a gap
starting at the lowest index given up to the maximum of the range.
mas_empty_area_rev() searches for a gap starting at the highest index given
and continues downward to the lower bound of the range.
.. _maple-tree-advanced-alloc:
Advanced Allocating Nodes
-------------------------
Allocations are usually handled internally to the tree, however if allocations
need to occur before a write occurs then calling mas_expected_entries() will
allocate the worst-case number of needed nodes to insert the provided number of
ranges. This also causes the tree to enter mass insertion mode. Once
insertions are complete calling mas_destroy() on the maple state will free the
unused allocations.
.. _maple-tree-advanced-locks:
Advanced Locking
----------------
The maple tree uses a spinlock by default, but external locks can be used for
tree updates as well. To use an external lock, the tree must be initialized
with the ``MT_FLAGS_LOCK_EXTERN flag``, this is usually done with the
MTREE_INIT_EXT() #define, which takes an external lock as an argument.
Functions and structures
========================
.. kernel-doc:: include/linux/maple_tree.h
.. kernel-doc:: lib/maple_tree.c
......@@ -12092,6 +12092,18 @@ L: linux-man@vger.kernel.org
S: Maintained
W: http://www.kernel.org/doc/man-pages
MAPLE TREE
M: Liam R. Howlett <Liam.Howlett@oracle.com>
L: linux-mm@kvack.org
S: Supported
F: Documentation/core-api/maple_tree.rst
F: include/linux/maple_tree.h
F: include/trace/events/maple_tree.h
F: lib/maple_tree.c
F: lib/test_maple_tree.c
F: tools/testing/radix-tree/linux/maple_tree.h
F: tools/testing/radix-tree/maple.c
MARDUK (CREATOR CI40) DEVICE TREE SUPPORT
M: Rahul Bedarkar <rahulbedarkar89@gmail.com>
L: linux-mips@vger.kernel.org
......
This diff is collapsed.
/* SPDX-License-Identifier: GPL-2.0 */
#undef TRACE_SYSTEM
#define TRACE_SYSTEM maple_tree
#if !defined(_TRACE_MM_H) || defined(TRACE_HEADER_MULTI_READ)
#define _TRACE_MM_H
#include <linux/tracepoint.h>
struct ma_state;
TRACE_EVENT(ma_op,
TP_PROTO(const char *fn, struct ma_state *mas),
TP_ARGS(fn, mas),
TP_STRUCT__entry(
__field(const char *, fn)
__field(unsigned long, min)
__field(unsigned long, max)
__field(unsigned long, index)
__field(unsigned long, last)
__field(void *, node)
),
TP_fast_assign(
__entry->fn = fn;
__entry->min = mas->min;
__entry->max = mas->max;
__entry->index = mas->index;
__entry->last = mas->last;
__entry->node = mas->node;
),
TP_printk("%s\tNode: %p (%lu %lu) range: %lu-%lu",
__entry->fn,
(void *) __entry->node,
(unsigned long) __entry->min,
(unsigned long) __entry->max,
(unsigned long) __entry->index,
(unsigned long) __entry->last
)
)
TRACE_EVENT(ma_read,
TP_PROTO(const char *fn, struct ma_state *mas),
TP_ARGS(fn, mas),
TP_STRUCT__entry(
__field(const char *, fn)
__field(unsigned long, min)
__field(unsigned long, max)
__field(unsigned long, index)
__field(unsigned long, last)
__field(void *, node)
),
TP_fast_assign(
__entry->fn = fn;
__entry->min = mas->min;
__entry->max = mas->max;
__entry->index = mas->index;
__entry->last = mas->last;
__entry->node = mas->node;
),
TP_printk("%s\tNode: %p (%lu %lu) range: %lu-%lu",
__entry->fn,
(void *) __entry->node,
(unsigned long) __entry->min,
(unsigned long) __entry->max,
(unsigned long) __entry->index,
(unsigned long) __entry->last
)
)
TRACE_EVENT(ma_write,
TP_PROTO(const char *fn, struct ma_state *mas, unsigned long piv,
void *val),
TP_ARGS(fn, mas, piv, val),
TP_STRUCT__entry(
__field(const char *, fn)
__field(unsigned long, min)
__field(unsigned long, max)
__field(unsigned long, index)
__field(unsigned long, last)
__field(unsigned long, piv)
__field(void *, val)
__field(void *, node)
),
TP_fast_assign(
__entry->fn = fn;
__entry->min = mas->min;
__entry->max = mas->max;
__entry->index = mas->index;
__entry->last = mas->last;
__entry->piv = piv;
__entry->val = val;
__entry->node = mas->node;
),
TP_printk("%s\tNode %p (%lu %lu) range:%lu-%lu piv (%lu) val %p",
__entry->fn,
(void *) __entry->node,
(unsigned long) __entry->min,
(unsigned long) __entry->max,
(unsigned long) __entry->index,
(unsigned long) __entry->last,
(unsigned long) __entry->piv,
(void *) __entry->val
)
)
#endif /* _TRACE_MM_H */
/* This part must be outside protection */
#include <trace/define_trace.h>
......@@ -117,6 +117,7 @@ static int kernel_init(void *);
extern void init_IRQ(void);
extern void radix_tree_init(void);
extern void maple_tree_init(void);
/*
* Debug helper: via this flag we know that we are in 'early bootup code'
......@@ -1005,6 +1006,7 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
"Interrupts were enabled *very* early, fixing it\n"))
local_irq_disable();
radix_tree_init();
maple_tree_init();
/*
* Set up housekeeping before setting up workqueues to allow the unbound
......
......@@ -820,6 +820,13 @@ config DEBUG_VM_VMACACHE
can cause significant overhead, so only enable it in non-production
environments.
config DEBUG_VM_MAPLE_TREE
bool "Debug VM maple trees"
depends on DEBUG_VM
select DEBUG_MAPLE_TREE
help
Enable VM maple tree debugging information and extra validations.
If unsure, say N.
config DEBUG_VM_RB
......@@ -1635,6 +1642,14 @@ config BUG_ON_DATA_CORRUPTION
If unsure, say N.
config DEBUG_MAPLE_TREE
bool "Debug maple trees"
depends on DEBUG_KERNEL
help
Enable maple tree debugging information and extra validations.
If unsure, say N.
endmenu
config DEBUG_CREDENTIALS
......
......@@ -29,7 +29,7 @@ endif
lib-y := ctype.o string.o vsprintf.o cmdline.o \
rbtree.o radix-tree.o timerqueue.o xarray.o \
idr.o extable.o irq_regs.o argv_split.o \
maple_tree.o idr.o extable.o irq_regs.o argv_split.o \
flex_proportions.o ratelimit.o show_mem.o \
is_single_threaded.o plist.o decompress.o kobject_uevent.o \
earlycpio.o seq_buf.o siphash.o dec_and_lock.o \
......
This diff is collapsed.
......@@ -6,3 +6,5 @@ main
multiorder
radix-tree.c
xarray
maple
ma_xa_benchmark
#define CONFIG_XARRAY_MULTI 1
#define CONFIG_64BIT 1
/* SPDX-License-Identifier: GPL-2.0+ */
#define atomic_t int32_t
#include "../../../../include/linux/maple_tree.h"
#define atomic_inc(x) uatomic_inc(x)
#define atomic_read(x) uatomic_read(x)
#define atomic_set(x, y) do {} while (0)
#define U8_MAX UCHAR_MAX
// SPDX-License-Identifier: GPL-2.0+
/*
* maple_tree.c: Userspace shim for maple tree test-suite
* Copyright (c) 2018 Liam R. Howlett <Liam.Howlett@Oracle.com>
*/
#define CONFIG_DEBUG_MAPLE_TREE
#define CONFIG_MAPLE_SEARCH
#include "test.h"
#define module_init(x)
#define module_exit(x)
#define MODULE_AUTHOR(x)
#define MODULE_LICENSE(x)
#define dump_stack() assert(0)
#include "../../../lib/maple_tree.c"
#undef CONFIG_DEBUG_MAPLE_TREE
#include "../../../lib/test_maple_tree.c"
void farmer_tests(void)
{
struct maple_node *node;
DEFINE_MTREE(tree);
mt_dump(&tree);
tree.ma_root = xa_mk_value(0);
mt_dump(&tree);
node = mt_alloc_one(GFP_KERNEL);
node->parent = (void *)((unsigned long)(&tree) | 1);
node->slot[0] = xa_mk_value(0);
node->slot[1] = xa_mk_value(1);
node->mr64.pivot[0] = 0;
node->mr64.pivot[1] = 1;
node->mr64.pivot[2] = 0;
tree.ma_root = mt_mk_node(node, maple_leaf_64);
mt_dump(&tree);
ma_free_rcu(node);
}
void maple_tree_tests(void)
{
farmer_tests();
maple_tree_seed();
maple_tree_harvest();
}
int __weak main(void)
{
maple_tree_init();
maple_tree_tests();
rcu_barrier();
if (nr_allocated)
printf("nr_allocated = %d\n", nr_allocated);
return 0;
}
/* SPDX-License-Identifier: GPL-2.0+ */
#define trace_ma_op(a, b) do {} while (0)
#define trace_ma_read(a, b) do {} while (0)
#define trace_ma_write(a, b, c, d) do {} while (0)
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment