• Wang Nan's avatar
    perf/core: Add ::write_backward attribute to perf event · 9ecda41a
    Wang Nan authored
    This patch introduces 'write_backward' bit to perf_event_attr, which
    controls the direction of a ring buffer. After set, the corresponding
    ring buffer is written from end to beginning. This feature is design to
    support reading from overwritable ring buffer.
    
    Ring buffer can be created by mapping a perf event fd. Kernel puts event
    records into ring buffer, user tooling like perf fetch them from
    address returned by mmap(). To prevent racing between kernel and tooling,
    they communicate to each other through 'head' and 'tail' pointers.
    Kernel maintains 'head' pointer, points it to the next free area (tail
    of the last record). Tooling maintains 'tail' pointer, points it to the
    tail of last consumed record (record has already been fetched). Kernel
    determines the available space in a ring buffer using these two
    pointers to avoid overwrite unfetched records.
    
    By mapping without 'PROT_WRITE', an overwritable ring buffer is created.
    Different from normal ring buffer, tooling is unable to maintain 'tail'
    pointer because writing is forbidden. Therefore, for this type of ring
    buffers, kernel overwrite old records unconditionally, works like flight
    recorder. This feature would be useful if reading from overwritable ring
    buffer were as easy as reading from normal ring buffer. However,
    there's an obscure problem.
    
    The following figure demonstrates a full overwritable ring buffer. In
    this figure, the 'head' pointer points to the end of last record, and a
    long record 'E' is pending. For a normal ring buffer, a 'tail' pointer
    would have pointed to position (X), so kernel knows there's no more
    space in the ring buffer. However, for an overwritable ring buffer,
    kernel ignore the 'tail' pointer.
    
       (X)                              head
        .                                |
        .                                V
        +------+-------+----------+------+---+
        |A....A|B.....B|C........C|D....D|   |
        +------+-------+----------+------+---+
    
    Record 'A' is overwritten by event 'E':
    
          head
           |
           V
        +--+---+-------+----------+------+---+
        |.E|..A|B.....B|C........C|D....D|E..|
        +--+---+-------+----------+------+---+
    
    Now tooling decides to read from this ring buffer. However, none of these
    two natural positions, 'head' and the start of this ring buffer, are
    pointing to the head of a record. Even the full ring buffer can be
    accessed by tooling, it is unable to find a position to start decoding.
    
    The first attempt tries to solve this problem AFAIK can be found from
    [1]. It makes kernel to maintain 'tail' pointer: updates it when ring
    buffer is half full. However, this approach introduces overhead to
    fast path. Test result shows a 1% overhead [2]. In addition, this method
    utilizes no more tham 50% records.
    
    Another attempt can be found from [3], which allows putting the size of
    an event at the end of each record. This approach allows tooling to find
    records in a backward manner from 'head' pointer by reading size of a
    record from its tail. However, because of alignment requirement, it
    needs 8 bytes to record the size of a record, which is a huge waste. Its
    performance is also not good, because more data need to be written.
    This approach also introduces some extra branch instructions to fast
    path.
    
    'write_backward' is a better solution to this problem.
    
    Following figure demonstrates the state of the overwritable ring buffer
    when 'write_backward' is set before overwriting:
    
           head
            |
            V
        +---+------+----------+-------+------+
        |   |D....D|C........C|B.....B|A....A|
        +---+------+----------+-------+------+
    
    and after overwriting:
                                         head
                                          |
                                          V
        +---+------+----------+-------+---+--+
        |..E|D....D|C........C|B.....B|A..|E.|
        +---+------+----------+-------+---+--+
    
    In each situation, 'head' points to the beginning of the newest record.
    From this record, tooling can iterate over the full ring buffer and fetch
    records one by one.
    
    The only limitation that needs to be considered is back-to-back reading.
    Due to the non-deterministic of user programs, it is impossible to ensure
    the ring buffer keeps stable during reading. Consider an extreme situation:
    tooling is scheduled out after reading record 'D', then a burst of events
    come, eat up the whole ring buffer (one or multiple rounds). When the
    tooling process comes back, reading after 'D' is incorrect now.
    
    To prevent this problem, we need to find a way to ensure the ring buffer
    is stable during reading. ioctl(PERF_EVENT_IOC_PAUSE_OUTPUT) is
    suggested because its overhead is lower than
    ioctl(PERF_EVENT_IOC_ENABLE).
    
    By carefully verifying 'header' pointer, reader can avoid pausing the
    ring-buffer. For example:
    
        /* A union of all possible events */
        union perf_event event;
    
        p = head = perf_mmap__read_head();
        while (true) {
            /* copy header of next event */
            fetch(&event.header, p, sizeof(event.header));
    
            /* read 'head' pointer */
            head = perf_mmap__read_head();
    
            /* check overwritten: is the header good? */
            if (!verify(sizeof(event.header), p, head))
                break;
    
            /* copy the whole event */
            fetch(&event, p, event.header.size);
    
            /* read 'head' pointer again */
            head = perf_mmap__read_head();
    
            /* is the whole event good? */
            if (!verify(event.header.size, p, head))
                break;
            p += event.header.size;
        }
    
    However, the overhead is high because:
    
     a) In-place decoding is not safe.
        Copying-verifying-decoding is required.
     b) Fetching 'head' pointer requires additional synchronization.
    
    (From Alexei Starovoitov:
    
    Even when this trick works, pause is needed for more than stability of
    reading. When we collect the events into overwrite buffer we're waiting
    for some other trigger (like all cpu utilization spike or just one cpu
    running and all others are idle) and when it happens the buffer has
    valuable info from the past. At this point new events are no longer
    interesting and buffer should be paused, events read and unpaused until
    next trigger comes.)
    
    This patch utilizes event's default overflow_handler introduced
    previously. perf_event_output_backward() is created as the default
    overflow handler for backward ring buffers. To avoid extra overhead to
    fast path, original perf_event_output() becomes __perf_event_output()
    and marked '__always_inline'. In theory, there's no extra overhead
    introduced to fast path.
    
    Performance testing:
    
    Calling 3000000 times of 'close(-1)', use gettimeofday() to check
    duration.  Use 'perf record -o /dev/null -e raw_syscalls:*' to capture
    system calls. In ns.
    
    Testing environment:
    
      CPU    : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
      Kernel : v4.5.0
                        MEAN         STDVAR
     BASE            800214.950    2853.083
     PRE1           2253846.700    9997.014
     PRE2           2257495.540    8516.293
     POST           2250896.100    8933.921
    
    Where 'BASE' is pure performance without capturing. 'PRE1' is test
    result of pure 'v4.5.0' kernel. 'PRE2' is test result before this
    patch. 'POST' is test result after this patch. See [4] for the detailed
    experimental setup.
    
    Considering the stdvar, this patch doesn't introduce performance
    overhead to the fast path.
    
     [1] http://lkml.iu.edu/hypermail/linux/kernel/1304.1/04584.html
     [2] http://lkml.iu.edu/hypermail/linux/kernel/1307.1/00535.html
     [3] http://lkml.iu.edu/hypermail/linux/kernel/1512.0/01265.html
     [4] http://lkml.kernel.org/g/56F89DCD.1040202@huawei.comSigned-off-by: default avatarWang Nan <wangnan0@huawei.com>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
    Cc: <acme@kernel.org>
    Cc: <pi3orama@163.com>
    Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
    Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
    Cc: He Kuang <hekuang@huawei.com>
    Cc: Jiri Olsa <jolsa@kernel.org>
    Cc: Jiri Olsa <jolsa@redhat.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
    Cc: Namhyung Kim <namhyung@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Stephane Eranian <eranian@google.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vince Weaver <vincent.weaver@maine.edu>
    Cc: Zefan Li <lizefan@huawei.com>
    Link: http://lkml.kernel.org/r/1459865478-53413-1-git-send-email-wangnan0@huawei.com
    [ Fixed the changelog some more. ]
    Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
    Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
    9ecda41a
perf_event.h 34.6 KB