Commit 36baa6c5 authored by John Esmet's avatar John Esmet Committed by Yoni Fogel

[t:4372] couple more updates to brt comment


git-svn-id: file:///svn/toku/tokudb@39466 c7de825b-a66e-492c-adef-691d508d4ae1
parent 66d77edb
......@@ -10,13 +10,28 @@ Managing the tree shape: How insertion, deletion, and querying work
When we insert a message into the BRT, here's what happens.
to insert a message at the root
- find the root node
- capture the next msn of the root node and assign it to the message
- split the root if it needs to be split
- insert the message into the root buffer
- if the root is too full, then flush_some_child() of the root on a flusher thread
flusher functions use an advice struct with provides some functions to
call that tell it what to do based on the context of the flush. see brt-flusher.h
to flush some child, given a parent and some advice
- pick the child using advice->pick_child()
- remove that childs buffer from the parent
- flush the buffer to the child
- if the child has stable reactivity and
advice->should_recursively_flush() is true, then
flush_some_child() of the child
- otherwise split the child if it needs to be split
- otherwise maybe merge the child if it needs to be merged
flusher threads:
flusher threads are created on demand as the result of internal nodes
becoming gorged by insertions. this allows flushing to be done somewhere
other than the client thread. these work items are enqueued onto
......@@ -31,7 +46,7 @@ cleaner threads:
the cleaner thread need not actually do a flush when awoken, so only
nodes that have sufficient cache pressure are flushed.
checkpoingting:
checkpointing:
the checkpoint thread wakes up every minute to checkpoint dirty nodes
to disk. at the time of this writing, nodes during checkpoint are
......@@ -41,39 +56,6 @@ cleaner threads:
many nodes and preventing other threads from traversing down the tree,
for a query or otherwise.
Flusher functions use an advice struct with provides some functions to
call that tell it what to do based on the context of the flush. see brt-flusher.h
to flush some child, given a parent and some advice
- pick the child using advice->pick_child()
- remove that childs buffer from the parent
- flush the buffer to the child
- if the child has stable reactivity and
advice->should_recursively_flush() is true, then
flush_some_child() of the child
- otherwise split the child if it needs to be split
- otherwise maybe merge the child if it needs to be merged
background flattener
It's state is a height and a key and a child number
Repeat:
sleep (say 1s)
grab the ydb lock
descend the tree to find the height and key
while the node is not empty:
bring the child into memory (possibly causing a TRY_AGAIN)
move all messages from the node into the child
if the child needs to be split or merged then split or merge the child
set the state to operate on the next relevant node in the depth-first order
That is: if there are more children, increment the child number, and return.
if there are no more children, then return with an error code that says "next". At the first point at the descent is not to the ultimate
child, then set the state to visit that node and that child.
if we get back up to the root then the state goes to "root" and "child 0" so the whole background flattener can run again the next BRT.
Probably only open BRTs get flattened.
It may be important for the flattener not to run if there've been no message insertions since the last time it ran.
The background flattener should also garbage collect MVCC versions. The flattener should remember the MVCC versions it has encountered
so that if any of those are no longer live, it can run again.
To shrink a file: Let X be the size of the reachable data.
We define an acceptable bloat constant of C. For example we set C=2 if we are willing to allow the file to be as much as 2X in size.
The goal is to find the smallest amount of stuff we can move to get the file down to size CX.
......@@ -117,7 +99,11 @@ Split_or_merge (node, childnum) {
algorithm on insertions.
- when a node is brought into memory, we apply ancestor messages above it.
- for point queries, we do not read the entire node into memory. instead,
only the required basement node is read
we only read in the required basement node
- for range queries, cursors may return cursor continue in their callback
to take a shortcut path to the next row in the basement node.
- for range queries, cursors that prelock a range benefit from
internal prefetching of nodes within that range.
*/
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment