Commit d9e44c32 authored by Alexei Starovoitov's avatar Alexei Starovoitov

Merge branch ' docs/bpf: Add description of register liveness tracking algorithm'

Eduard Zingerman says:

====================

An overview of the register tracking liveness algorithm.
Previous versions posted here: [1], [2], [3].
- Changes from RFC to v2 (suggested by Andrii Nakryiko):
  - wording corrected to use term "stack slot" instead of "stack spill";
  - parentage chain diagram updated to show nil links for frame #1;
  - added example for non-BPF_DW writes behavior;
  - explanation in "Read marks propagation for cache hits" is reworked.
- Changes from v2 to v3:
  - lot's of grammatical / wording fixes as suggested by David Vernet;
  - "Register parentage chains" section is fixed to reflect what
    happens to r1-r5 when function call is processed (as suggested by David and Alexei);
  - Example in "Liveness marks tracking" section updated to explain
    why partial writes should not lead to REG_LIVE_WRITTEN marks
    (suggested by David);
  - "Read marks propagation for cache hits" section updates:
    - Explanation updated to hint why read marks should be propagated
      before jumping to example (suggested by David);
    - Removed box around B/D in the diagram updated (suggested by Alexei).
- Changes from v3 to v4 (suggested by Edward Cree):
  - register parentage chain diagram updated to explain why r6 mark is
    not propagated;
  - read mark propagation algorithm pseudo-code fixed to correctly
    show "if state->live & REG_LIVE_WRITTEN" stop condition;
  - general wording improvements in section "Liveness marks tracking".

[1] https://lore.kernel.org/bpf/20230124220343.2942203-1-eddyz87@gmail.com/
[2] https://lore.kernel.org/bpf/20230130182400.630997-1-eddyz87@gmail.com/
[3] https://lore.kernel.org/bpf/20230131181118.733845-1-eddyz87@gmail.com/
====================
Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
parents 354bb4a0 cb601848
......@@ -316,6 +316,301 @@ Pruning considers not only the registers but also the stack (and any spilled
registers it may hold). They must all be safe for the branch to be pruned.
This is implemented in states_equal().
Some technical details about state pruning implementation could be found below.
Register liveness tracking
--------------------------
In order to make state pruning effective, liveness state is tracked for each
register and stack slot. The basic idea is to track which registers and stack
slots are actually used during subseqeuent execution of the program, until
program exit is reached. Registers and stack slots that were never used could be
removed from the cached state thus making more states equivalent to a cached
state. This could be illustrated by the following program::
0: call bpf_get_prandom_u32()
1: r1 = 0
2: if r0 == 0 goto +1
3: r0 = 1
--- checkpoint ---
4: r0 = r1
5: exit
Suppose that a state cache entry is created at instruction #4 (such entries are
also called "checkpoints" in the text below). The verifier could reach the
instruction with one of two possible register states:
* r0 = 1, r1 = 0
* r0 = 0, r1 = 0
However, only the value of register ``r1`` is important to successfully finish
verification. The goal of the liveness tracking algorithm is to spot this fact
and figure out that both states are actually equivalent.
Data structures
~~~~~~~~~~~~~~~
Liveness is tracked using the following data structures::
enum bpf_reg_liveness {
REG_LIVE_NONE = 0,
REG_LIVE_READ32 = 0x1,
REG_LIVE_READ64 = 0x2,
REG_LIVE_READ = REG_LIVE_READ32 | REG_LIVE_READ64,
REG_LIVE_WRITTEN = 0x4,
REG_LIVE_DONE = 0x8,
};
struct bpf_reg_state {
...
struct bpf_reg_state *parent;
...
enum bpf_reg_liveness live;
...
};
struct bpf_stack_state {
struct bpf_reg_state spilled_ptr;
...
};
struct bpf_func_state {
struct bpf_reg_state regs[MAX_BPF_REG];
...
struct bpf_stack_state *stack;
}
struct bpf_verifier_state {
struct bpf_func_state *frame[MAX_CALL_FRAMES];
struct bpf_verifier_state *parent;
...
}
* ``REG_LIVE_NONE`` is an initial value assigned to ``->live`` fields upon new
verifier state creation;
* ``REG_LIVE_WRITTEN`` means that the value of the register (or stack slot) is
defined by some instruction verified between this verifier state's parent and
verifier state itself;
* ``REG_LIVE_READ{32,64}`` means that the value of the register (or stack slot)
is read by a some child state of this verifier state;
* ``REG_LIVE_DONE`` is a marker used by ``clean_verifier_state()`` to avoid
processing same verifier state multiple times and for some sanity checks;
* ``->live`` field values are formed by combining ``enum bpf_reg_liveness``
values using bitwise or.
Register parentage chains
~~~~~~~~~~~~~~~~~~~~~~~~~
In order to propagate information between parent and child states, a *register
parentage chain* is established. Each register or stack slot is linked to a
corresponding register or stack slot in its parent state via a ``->parent``
pointer. This link is established upon state creation in ``is_state_visited()``
and might be modified by ``set_callee_state()`` called from
``__check_func_call()``.
The rules for correspondence between registers / stack slots are as follows:
* For the current stack frame, registers and stack slots of the new state are
linked to the registers and stack slots of the parent state with the same
indices.
* For the outer stack frames, only caller saved registers (r6-r9) and stack
slots are linked to the registers and stack slots of the parent state with the
same indices.
* When function call is processed a new ``struct bpf_func_state`` instance is
allocated, it encapsulates a new set of registers and stack slots. For this
new frame, parent links for r6-r9 and stack slots are set to nil, parent links
for r1-r5 are set to match caller r1-r5 parent links.
This could be illustrated by the following diagram (arrows stand for
``->parent`` pointers)::
... ; Frame #0, some instructions
--- checkpoint #0 ---
1 : r6 = 42 ; Frame #0
--- checkpoint #1 ---
2 : call foo() ; Frame #0
... ; Frame #1, instructions from foo()
--- checkpoint #2 ---
... ; Frame #1, instructions from foo()
--- checkpoint #3 ---
exit ; Frame #1, return from foo()
3 : r1 = r6 ; Frame #0 <- current state
+-------------------------------+-------------------------------+
| Frame #0 | Frame #1 |
Checkpoint +-------------------------------+-------------------------------+
#0 | r0 | r1-r5 | r6-r9 | fp-8 ... |
+-------------------------------+
^ ^ ^ ^
| | | |
Checkpoint +-------------------------------+
#1 | r0 | r1-r5 | r6-r9 | fp-8 ... |
+-------------------------------+
^ ^ ^
|_______|_______|_______________
| | |
nil nil | | | nil nil
| | | | | | |
Checkpoint +-------------------------------+-------------------------------+
#2 | r0 | r1-r5 | r6-r9 | fp-8 ... | r0 | r1-r5 | r6-r9 | fp-8 ... |
+-------------------------------+-------------------------------+
^ ^ ^ ^ ^
nil nil | | | | |
| | | | | | |
Checkpoint +-------------------------------+-------------------------------+
#3 | r0 | r1-r5 | r6-r9 | fp-8 ... | r0 | r1-r5 | r6-r9 | fp-8 ... |
+-------------------------------+-------------------------------+
^ ^
nil nil | |
| | | |
Current +-------------------------------+
state | r0 | r1-r5 | r6-r9 | fp-8 ... |
+-------------------------------+
\
r6 read mark is propagated via these links
all the way up to checkpoint #1.
The checkpoint #1 contains a write mark for r6
because of instruction (1), thus read propagation
does not reach checkpoint #0 (see section below).
Liveness marks tracking
~~~~~~~~~~~~~~~~~~~~~~~
For each processed instruction, the verifier tracks read and written registers
and stack slots. The main idea of the algorithm is that read marks propagate
back along the state parentage chain until they hit a write mark, which 'screens
off' earlier states from the read. The information about reads is propagated by
function ``mark_reg_read()`` which could be summarized as follows::
mark_reg_read(struct bpf_reg_state *state, ...):
parent = state->parent
while parent:
if state->live & REG_LIVE_WRITTEN:
break
if parent->live & REG_LIVE_READ64:
break
parent->live |= REG_LIVE_READ64
state = parent
parent = state->parent
Notes:
* The read marks are applied to the **parent** state while write marks are
applied to the **current** state. The write mark on a register or stack slot
means that it is updated by some instruction in the straight-line code leading
from the parent state to the current state.
* Details about REG_LIVE_READ32 are omitted.
* Function ``propagate_liveness()`` (see section :ref:`read_marks_for_cache_hits`)
might override the first parent link. Please refer to the comments in the
``propagate_liveness()`` and ``mark_reg_read()`` source code for further
details.
Because stack writes could have different sizes ``REG_LIVE_WRITTEN`` marks are
applied conservatively: stack slots are marked as written only if write size
corresponds to the size of the register, e.g. see function ``save_register_state()``.
Consider the following example::
0: (*u64)(r10 - 8) = 0 ; define 8 bytes of fp-8
--- checkpoint #0 ---
1: (*u32)(r10 - 8) = 1 ; redefine lower 4 bytes
2: r1 = (*u32)(r10 - 8) ; read lower 4 bytes defined at (1)
3: r2 = (*u32)(r10 - 4) ; read upper 4 bytes defined at (0)
As stated above, the write at (1) does not count as ``REG_LIVE_WRITTEN``. Should
it be otherwise, the algorithm above wouldn't be able to propagate the read mark
from (3) to checkpoint #0.
Once the ``BPF_EXIT`` instruction is reached ``update_branch_counts()`` is
called to update the ``->branches`` counter for each verifier state in a chain
of parent verifier states. When the ``->branches`` counter reaches zero the
verifier state becomes a valid entry in a set of cached verifier states.
Each entry of the verifier states cache is post-processed by a function
``clean_live_states()``. This function marks all registers and stack slots
without ``REG_LIVE_READ{32,64}`` marks as ``NOT_INIT`` or ``STACK_INVALID``.
Registers/stack slots marked in this way are ignored in function ``stacksafe()``
called from ``states_equal()`` when a state cache entry is considered for
equivalence with a current state.
Now it is possible to explain how the example from the beginning of the section
works::
0: call bpf_get_prandom_u32()
1: r1 = 0
2: if r0 == 0 goto +1
3: r0 = 1
--- checkpoint[0] ---
4: r0 = r1
5: exit
* At instruction #2 branching point is reached and state ``{ r0 == 0, r1 == 0, pc == 4 }``
is pushed to states processing queue (pc stands for program counter).
* At instruction #4:
* ``checkpoint[0]`` states cache entry is created: ``{ r0 == 1, r1 == 0, pc == 4 }``;
* ``checkpoint[0].r0`` is marked as written;
* ``checkpoint[0].r1`` is marked as read;
* At instruction #5 exit is reached and ``checkpoint[0]`` can now be processed
by ``clean_live_states()``. After this processing ``checkpoint[0].r0`` has a
read mark and all other registers and stack slots are marked as ``NOT_INIT``
or ``STACK_INVALID``
* The state ``{ r0 == 0, r1 == 0, pc == 4 }`` is popped from the states queue
and is compared against a cached state ``{ r1 == 0, pc == 4 }``, the states
are considered equivalent.
.. _read_marks_for_cache_hits:
Read marks propagation for cache hits
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Another point is the handling of read marks when a previously verified state is
found in the states cache. Upon cache hit verifier must behave in the same way
as if the current state was verified to the program exit. This means that all
read marks, present on registers and stack slots of the cached state, must be
propagated over the parentage chain of the current state. Example below shows
why this is important. Function ``propagate_liveness()`` handles this case.
Consider the following state parentage chain (S is a starting state, A-E are
derived states, -> arrows show which state is derived from which)::
r1 read
<------------- A[r1] == 0
C[r1] == 0
S ---> A ---> B ---> exit E[r1] == 1
|
` ---> C ---> D
|
` ---> E ^
|___ suppose all these
^ states are at insn #Y
|
suppose all these
states are at insn #X
* Chain of states ``S -> A -> B -> exit`` is verified first.
* While ``B -> exit`` is verified, register ``r1`` is read and this read mark is
propagated up to state ``A``.
* When chain of states ``C -> D`` is verified the state ``D`` turns out to be
equivalent to state ``B``.
* The read mark for ``r1`` has to be propagated to state ``C``, otherwise state
``C`` might get mistakenly marked as equivalent to state ``E`` even though
values for register ``r1`` differ between ``C`` and ``E``.
Understanding eBPF verifier messages
====================================
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment