Commit 2a56bb59 authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'trace-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:
 "New features:

   - Tom Zanussi's extended histogram work.

     This adds the synthetic events to have histograms from multiple
     event data Adds triggers "onmatch" and "onmax" to call the
     synthetic events Several updates to the histogram code from this

   - Allow way to nest ring buffer calls in the same context

   - Allow absolute time stamps in ring buffer

   - Rewrite of filter code parsing based on Al Viro's suggestions

   - Setting of trace_clock to global if TSC is unstable (on boot)

   - Better OOM handling when allocating large ring buffers

   - Added initcall tracepoints (consolidated initcall_debug code with
     them)

  And other various fixes and clean ups"

* tag 'trace-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (68 commits)
  init: Have initcall_debug still work without CONFIG_TRACEPOINTS
  init, tracing: Have printk come through the trace events for initcall_debug
  init, tracing: instrument security and console initcall trace events
  init, tracing: Add initcall trace events
  tracing: Add rcu dereference annotation for test func that touches filter->prog
  tracing: Add rcu dereference annotation for filter->prog
  tracing: Fixup logic inversion on setting trace_global_clock defaults
  tracing: Hide global trace clock from lockdep
  ring-buffer: Add set/clear_current_oom_origin() during allocations
  ring-buffer: Check if memory is available before allocation
  lockdep: Add print_irqtrace_events() to __warn
  vsprintf: Do not preprocess non-dereferenced pointers for bprintf (%px and %pK)
  tracing: Uninitialized variable in create_tracing_map_fields()
  tracing: Make sure variable string fields are NULL-terminated
  tracing: Add action comparisons when testing matching hist triggers
  tracing: Don't add flag strings when displaying variable references
  tracing: Fix display of hist trigger expressions containing timestamps
  ftrace: Drop a VLA in module_exists()
  tracing: Mention trace_clock=global when warning about unstable clocks
  tracing: Default to using trace_global_clock if sched_clock is unstable
  ...
parents 9f3a0941 b0dc52f1
...@@ -520,1550 +520,4 @@ The following commands are supported: ...@@ -520,1550 +520,4 @@ The following commands are supported:
totals derived from one or more trace event format fields and/or totals derived from one or more trace event format fields and/or
event counts (hitcount). event counts (hitcount).
The format of a hist trigger is as follows:: See Documentation/trace/histogram.txt for details and examples.
hist:keys=<field1[,field2,...]>[:values=<field1[,field2,...]>]
[:sort=<field1[,field2,...]>][:size=#entries][:pause][:continue]
[:clear][:name=histname1] [if <filter>]
When a matching event is hit, an entry is added to a hash table
using the key(s) and value(s) named. Keys and values correspond to
fields in the event's format description. Values must correspond to
numeric fields - on an event hit, the value(s) will be added to a
sum kept for that field. The special string 'hitcount' can be used
in place of an explicit value field - this is simply a count of
event hits. If 'values' isn't specified, an implicit 'hitcount'
value will be automatically created and used as the only value.
Keys can be any field, or the special string 'stacktrace', which
will use the event's kernel stacktrace as the key. The keywords
'keys' or 'key' can be used to specify keys, and the keywords
'values', 'vals', or 'val' can be used to specify values. Compound
keys consisting of up to two fields can be specified by the 'keys'
keyword. Hashing a compound key produces a unique entry in the
table for each unique combination of component keys, and can be
useful for providing more fine-grained summaries of event data.
Additionally, sort keys consisting of up to two fields can be
specified by the 'sort' keyword. If more than one field is
specified, the result will be a 'sort within a sort': the first key
is taken to be the primary sort key and the second the secondary
key. If a hist trigger is given a name using the 'name' parameter,
its histogram data will be shared with other triggers of the same
name, and trigger hits will update this common data. Only triggers
with 'compatible' fields can be combined in this way; triggers are
'compatible' if the fields named in the trigger share the same
number and type of fields and those fields also have the same names.
Note that any two events always share the compatible 'hitcount' and
'stacktrace' fields and can therefore be combined using those
fields, however pointless that may be.
'hist' triggers add a 'hist' file to each event's subdirectory.
Reading the 'hist' file for the event will dump the hash table in
its entirety to stdout. If there are multiple hist triggers
attached to an event, there will be a table for each trigger in the
output. The table displayed for a named trigger will be the same as
any other instance having the same name. Each printed hash table
entry is a simple list of the keys and values comprising the entry;
keys are printed first and are delineated by curly braces, and are
followed by the set of value fields for the entry. By default,
numeric fields are displayed as base-10 integers. This can be
modified by appending any of the following modifiers to the field
name:
- .hex display a number as a hex value
- .sym display an address as a symbol
- .sym-offset display an address as a symbol and offset
- .syscall display a syscall id as a system call name
- .execname display a common_pid as a program name
Note that in general the semantics of a given field aren't
interpreted when applying a modifier to it, but there are some
restrictions to be aware of in this regard:
- only the 'hex' modifier can be used for values (because values
are essentially sums, and the other modifiers don't make sense
in that context).
- the 'execname' modifier can only be used on a 'common_pid'. The
reason for this is that the execname is simply the 'comm' value
saved for the 'current' process when an event was triggered,
which is the same as the common_pid value saved by the event
tracing code. Trying to apply that comm value to other pid
values wouldn't be correct, and typically events that care save
pid-specific comm fields in the event itself.
A typical usage scenario would be the following to enable a hist
trigger, read its current contents, and then turn it off::
# echo 'hist:keys=skbaddr.hex:vals=len' > \
/sys/kernel/debug/tracing/events/net/netif_rx/trigger
# cat /sys/kernel/debug/tracing/events/net/netif_rx/hist
# echo '!hist:keys=skbaddr.hex:vals=len' > \
/sys/kernel/debug/tracing/events/net/netif_rx/trigger
The trigger file itself can be read to show the details of the
currently attached hist trigger. This information is also displayed
at the top of the 'hist' file when read.
By default, the size of the hash table is 2048 entries. The 'size'
parameter can be used to specify more or fewer than that. The units
are in terms of hashtable entries - if a run uses more entries than
specified, the results will show the number of 'drops', the number
of hits that were ignored. The size should be a power of 2 between
128 and 131072 (any non- power-of-2 number specified will be rounded
up).
The 'sort' parameter can be used to specify a value field to sort
on. The default if unspecified is 'hitcount' and the default sort
order is 'ascending'. To sort in the opposite direction, append
.descending' to the sort key.
The 'pause' parameter can be used to pause an existing hist trigger
or to start a hist trigger but not log any events until told to do
so. 'continue' or 'cont' can be used to start or restart a paused
hist trigger.
The 'clear' parameter will clear the contents of a running hist
trigger and leave its current paused/active state.
Note that the 'pause', 'cont', and 'clear' parameters should be
applied using 'append' shell operator ('>>') if applied to an
existing trigger, rather than via the '>' operator, which will cause
the trigger to be removed through truncation.
- enable_hist/disable_hist
The enable_hist and disable_hist triggers can be used to have one
event conditionally start and stop another event's already-attached
hist trigger. Any number of enable_hist and disable_hist triggers
can be attached to a given event, allowing that event to kick off
and stop aggregations on a host of other events.
The format is very similar to the enable/disable_event triggers::
enable_hist:<system>:<event>[:count]
disable_hist:<system>:<event>[:count]
Instead of enabling or disabling the tracing of the target event
into the trace buffer as the enable/disable_event triggers do, the
enable/disable_hist triggers enable or disable the aggregation of
the target event into a hash table.
A typical usage scenario for the enable_hist/disable_hist triggers
would be to first set up a paused hist trigger on some event,
followed by an enable_hist/disable_hist pair that turns the hist
aggregation on and off when conditions of interest are hit::
# echo 'hist:keys=skbaddr.hex:vals=len:pause' > \
/sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
# echo 'enable_hist:net:netif_receive_skb if filename==/usr/bin/wget' > \
/sys/kernel/debug/tracing/events/sched/sched_process_exec/trigger
# echo 'disable_hist:net:netif_receive_skb if comm==wget' > \
/sys/kernel/debug/tracing/events/sched/sched_process_exit/trigger
The above sets up an initially paused hist trigger which is unpaused
and starts aggregating events when a given program is executed, and
which stops aggregating when the process exits and the hist trigger
is paused again.
The examples below provide a more concrete illustration of the
concepts and typical usage patterns discussed above.
6.2 'hist' trigger examples
---------------------------
The first set of examples creates aggregations using the kmalloc
event. The fields that can be used for the hist trigger are listed
in the kmalloc event's format file::
# cat /sys/kernel/debug/tracing/events/kmem/kmalloc/format
name: kmalloc
ID: 374
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:unsigned long call_site; offset:8; size:8; signed:0;
field:const void * ptr; offset:16; size:8; signed:0;
field:size_t bytes_req; offset:24; size:8; signed:0;
field:size_t bytes_alloc; offset:32; size:8; signed:0;
field:gfp_t gfp_flags; offset:40; size:4; signed:0;
We'll start by creating a hist trigger that generates a simple table
that lists the total number of bytes requested for each function in
the kernel that made one or more calls to kmalloc::
# echo 'hist:key=call_site:val=bytes_req' > \
/sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
This tells the tracing system to create a 'hist' trigger using the
call_site field of the kmalloc event as the key for the table, which
just means that each unique call_site address will have an entry
created for it in the table. The 'val=bytes_req' parameter tells
the hist trigger that for each unique entry (call_site) in the
table, it should keep a running total of the number of bytes
requested by that call_site.
We'll let it run for awhile and then dump the contents of the 'hist'
file in the kmalloc event's subdirectory (for readability, a number
of entries have been omitted)::
# cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
# trigger info: hist:keys=call_site:vals=bytes_req:sort=hitcount:size=2048 [active]
{ call_site: 18446744072106379007 } hitcount: 1 bytes_req: 176
{ call_site: 18446744071579557049 } hitcount: 1 bytes_req: 1024
{ call_site: 18446744071580608289 } hitcount: 1 bytes_req: 16384
{ call_site: 18446744071581827654 } hitcount: 1 bytes_req: 24
{ call_site: 18446744071580700980 } hitcount: 1 bytes_req: 8
{ call_site: 18446744071579359876 } hitcount: 1 bytes_req: 152
{ call_site: 18446744071580795365 } hitcount: 3 bytes_req: 144
{ call_site: 18446744071581303129 } hitcount: 3 bytes_req: 144
{ call_site: 18446744071580713234 } hitcount: 4 bytes_req: 2560
{ call_site: 18446744071580933750 } hitcount: 4 bytes_req: 736
.
.
.
{ call_site: 18446744072106047046 } hitcount: 69 bytes_req: 5576
{ call_site: 18446744071582116407 } hitcount: 73 bytes_req: 2336
{ call_site: 18446744072106054684 } hitcount: 136 bytes_req: 140504
{ call_site: 18446744072106224230 } hitcount: 136 bytes_req: 19584
{ call_site: 18446744072106078074 } hitcount: 153 bytes_req: 2448
{ call_site: 18446744072106062406 } hitcount: 153 bytes_req: 36720
{ call_site: 18446744071582507929 } hitcount: 153 bytes_req: 37088
{ call_site: 18446744072102520590 } hitcount: 273 bytes_req: 10920
{ call_site: 18446744071582143559 } hitcount: 358 bytes_req: 716
{ call_site: 18446744072106465852 } hitcount: 417 bytes_req: 56712
{ call_site: 18446744072102523378 } hitcount: 485 bytes_req: 27160
{ call_site: 18446744072099568646 } hitcount: 1676 bytes_req: 33520
Totals:
Hits: 4610
Entries: 45
Dropped: 0
The output displays a line for each entry, beginning with the key
specified in the trigger, followed by the value(s) also specified in
the trigger. At the beginning of the output is a line that displays
the trigger info, which can also be displayed by reading the
'trigger' file::
# cat /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
hist:keys=call_site:vals=bytes_req:sort=hitcount:size=2048 [active]
At the end of the output are a few lines that display the overall
totals for the run. The 'Hits' field shows the total number of
times the event trigger was hit, the 'Entries' field shows the total
number of used entries in the hash table, and the 'Dropped' field
shows the number of hits that were dropped because the number of
used entries for the run exceeded the maximum number of entries
allowed for the table (normally 0, but if not a hint that you may
want to increase the size of the table using the 'size' parameter).
Notice in the above output that there's an extra field, 'hitcount',
which wasn't specified in the trigger. Also notice that in the
trigger info output, there's a parameter, 'sort=hitcount', which
wasn't specified in the trigger either. The reason for that is that
every trigger implicitly keeps a count of the total number of hits
attributed to a given entry, called the 'hitcount'. That hitcount
information is explicitly displayed in the output, and in the
absence of a user-specified sort parameter, is used as the default
sort field.
The value 'hitcount' can be used in place of an explicit value in
the 'values' parameter if you don't really need to have any
particular field summed and are mainly interested in hit
frequencies.
To turn the hist trigger off, simply call up the trigger in the
command history and re-execute it with a '!' prepended::
# echo '!hist:key=call_site:val=bytes_req' > \
/sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
Finally, notice that the call_site as displayed in the output above
isn't really very useful. It's an address, but normally addresses
are displayed in hex. To have a numeric field displayed as a hex
value, simply append '.hex' to the field name in the trigger::
# echo 'hist:key=call_site.hex:val=bytes_req' > \
/sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
# cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
# trigger info: hist:keys=call_site.hex:vals=bytes_req:sort=hitcount:size=2048 [active]
{ call_site: ffffffffa026b291 } hitcount: 1 bytes_req: 433
{ call_site: ffffffffa07186ff } hitcount: 1 bytes_req: 176
{ call_site: ffffffff811ae721 } hitcount: 1 bytes_req: 16384
{ call_site: ffffffff811c5134 } hitcount: 1 bytes_req: 8
{ call_site: ffffffffa04a9ebb } hitcount: 1 bytes_req: 511
{ call_site: ffffffff8122e0a6 } hitcount: 1 bytes_req: 12
{ call_site: ffffffff8107da84 } hitcount: 1 bytes_req: 152
{ call_site: ffffffff812d8246 } hitcount: 1 bytes_req: 24
{ call_site: ffffffff811dc1e5 } hitcount: 3 bytes_req: 144
{ call_site: ffffffffa02515e8 } hitcount: 3 bytes_req: 648
{ call_site: ffffffff81258159 } hitcount: 3 bytes_req: 144
{ call_site: ffffffff811c80f4 } hitcount: 4 bytes_req: 544
.
.
.
{ call_site: ffffffffa06c7646 } hitcount: 106 bytes_req: 8024
{ call_site: ffffffffa06cb246 } hitcount: 132 bytes_req: 31680
{ call_site: ffffffffa06cef7a } hitcount: 132 bytes_req: 2112
{ call_site: ffffffff8137e399 } hitcount: 132 bytes_req: 23232
{ call_site: ffffffffa06c941c } hitcount: 185 bytes_req: 171360
{ call_site: ffffffffa06f2a66 } hitcount: 185 bytes_req: 26640
{ call_site: ffffffffa036a70e } hitcount: 265 bytes_req: 10600
{ call_site: ffffffff81325447 } hitcount: 292 bytes_req: 584
{ call_site: ffffffffa072da3c } hitcount: 446 bytes_req: 60656
{ call_site: ffffffffa036b1f2 } hitcount: 526 bytes_req: 29456
{ call_site: ffffffffa0099c06 } hitcount: 1780 bytes_req: 35600
Totals:
Hits: 4775
Entries: 46
Dropped: 0
Even that's only marginally more useful - while hex values do look
more like addresses, what users are typically more interested in
when looking at text addresses are the corresponding symbols
instead. To have an address displayed as symbolic value instead,
simply append '.sym' or '.sym-offset' to the field name in the
trigger::
# echo 'hist:key=call_site.sym:val=bytes_req' > \
/sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
# cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
# trigger info: hist:keys=call_site.sym:vals=bytes_req:sort=hitcount:size=2048 [active]
{ call_site: [ffffffff810adcb9] syslog_print_all } hitcount: 1 bytes_req: 1024
{ call_site: [ffffffff8154bc62] usb_control_msg } hitcount: 1 bytes_req: 8
{ call_site: [ffffffffa00bf6fe] hidraw_send_report [hid] } hitcount: 1 bytes_req: 7
{ call_site: [ffffffff8154acbe] usb_alloc_urb } hitcount: 1 bytes_req: 192
{ call_site: [ffffffffa00bf1ca] hidraw_report_event [hid] } hitcount: 1 bytes_req: 7
{ call_site: [ffffffff811e3a25] __seq_open_private } hitcount: 1 bytes_req: 40
{ call_site: [ffffffff8109524a] alloc_fair_sched_group } hitcount: 2 bytes_req: 128
{ call_site: [ffffffff811febd5] fsnotify_alloc_group } hitcount: 2 bytes_req: 528
{ call_site: [ffffffff81440f58] __tty_buffer_request_room } hitcount: 2 bytes_req: 2624
{ call_site: [ffffffff81200ba6] inotify_new_group } hitcount: 2 bytes_req: 96
{ call_site: [ffffffffa05e19af] ieee80211_start_tx_ba_session [mac80211] } hitcount: 2 bytes_req: 464
{ call_site: [ffffffff81672406] tcp_get_metrics } hitcount: 2 bytes_req: 304
{ call_site: [ffffffff81097ec2] alloc_rt_sched_group } hitcount: 2 bytes_req: 128
{ call_site: [ffffffff81089b05] sched_create_group } hitcount: 2 bytes_req: 1424
.
.
.
{ call_site: [ffffffffa04a580c] intel_crtc_page_flip [i915] } hitcount: 1185 bytes_req: 123240
{ call_site: [ffffffffa0287592] drm_mode_page_flip_ioctl [drm] } hitcount: 1185 bytes_req: 104280
{ call_site: [ffffffffa04c4a3c] intel_plane_duplicate_state [i915] } hitcount: 1402 bytes_req: 190672
{ call_site: [ffffffff812891ca] ext4_find_extent } hitcount: 1518 bytes_req: 146208
{ call_site: [ffffffffa029070e] drm_vma_node_allow [drm] } hitcount: 1746 bytes_req: 69840
{ call_site: [ffffffffa045e7c4] i915_gem_do_execbuffer.isra.23 [i915] } hitcount: 2021 bytes_req: 792312
{ call_site: [ffffffffa02911f2] drm_modeset_lock_crtc [drm] } hitcount: 2592 bytes_req: 145152
{ call_site: [ffffffffa0489a66] intel_ring_begin [i915] } hitcount: 2629 bytes_req: 378576
{ call_site: [ffffffffa046041c] i915_gem_execbuffer2 [i915] } hitcount: 2629 bytes_req: 3783248
{ call_site: [ffffffff81325607] apparmor_file_alloc_security } hitcount: 5192 bytes_req: 10384
{ call_site: [ffffffffa00b7c06] hid_report_raw_event [hid] } hitcount: 5529 bytes_req: 110584
{ call_site: [ffffffff8131ebf7] aa_alloc_task_context } hitcount: 21943 bytes_req: 702176
{ call_site: [ffffffff8125847d] ext4_htree_store_dirent } hitcount: 55759 bytes_req: 5074265
Totals:
Hits: 109928
Entries: 71
Dropped: 0
Because the default sort key above is 'hitcount', the above shows a
the list of call_sites by increasing hitcount, so that at the bottom
we see the functions that made the most kmalloc calls during the
run. If instead we we wanted to see the top kmalloc callers in
terms of the number of bytes requested rather than the number of
calls, and we wanted the top caller to appear at the top, we can use
the 'sort' parameter, along with the 'descending' modifier::
# echo 'hist:key=call_site.sym:val=bytes_req:sort=bytes_req.descending' > \
/sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
# cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
# trigger info: hist:keys=call_site.sym:vals=bytes_req:sort=bytes_req.descending:size=2048 [active]
{ call_site: [ffffffffa046041c] i915_gem_execbuffer2 [i915] } hitcount: 2186 bytes_req: 3397464
{ call_site: [ffffffffa045e7c4] i915_gem_do_execbuffer.isra.23 [i915] } hitcount: 1790 bytes_req: 712176
{ call_site: [ffffffff8125847d] ext4_htree_store_dirent } hitcount: 8132 bytes_req: 513135
{ call_site: [ffffffff811e2a1b] seq_buf_alloc } hitcount: 106 bytes_req: 440128
{ call_site: [ffffffffa0489a66] intel_ring_begin [i915] } hitcount: 2186 bytes_req: 314784
{ call_site: [ffffffff812891ca] ext4_find_extent } hitcount: 2174 bytes_req: 208992
{ call_site: [ffffffff811ae8e1] __kmalloc } hitcount: 8 bytes_req: 131072
{ call_site: [ffffffffa04c4a3c] intel_plane_duplicate_state [i915] } hitcount: 859 bytes_req: 116824
{ call_site: [ffffffffa02911f2] drm_modeset_lock_crtc [drm] } hitcount: 1834 bytes_req: 102704
{ call_site: [ffffffffa04a580c] intel_crtc_page_flip [i915] } hitcount: 972 bytes_req: 101088
{ call_site: [ffffffffa0287592] drm_mode_page_flip_ioctl [drm] } hitcount: 972 bytes_req: 85536
{ call_site: [ffffffffa00b7c06] hid_report_raw_event [hid] } hitcount: 3333 bytes_req: 66664
{ call_site: [ffffffff8137e559] sg_kmalloc } hitcount: 209 bytes_req: 61632
.
.
.
{ call_site: [ffffffff81095225] alloc_fair_sched_group } hitcount: 2 bytes_req: 128
{ call_site: [ffffffff81097ec2] alloc_rt_sched_group } hitcount: 2 bytes_req: 128
{ call_site: [ffffffff812d8406] copy_semundo } hitcount: 2 bytes_req: 48
{ call_site: [ffffffff81200ba6] inotify_new_group } hitcount: 1 bytes_req: 48
{ call_site: [ffffffffa027121a] drm_getmagic [drm] } hitcount: 1 bytes_req: 48
{ call_site: [ffffffff811e3a25] __seq_open_private } hitcount: 1 bytes_req: 40
{ call_site: [ffffffff811c52f4] bprm_change_interp } hitcount: 2 bytes_req: 16
{ call_site: [ffffffff8154bc62] usb_control_msg } hitcount: 1 bytes_req: 8
{ call_site: [ffffffffa00bf1ca] hidraw_report_event [hid] } hitcount: 1 bytes_req: 7
{ call_site: [ffffffffa00bf6fe] hidraw_send_report [hid] } hitcount: 1 bytes_req: 7
Totals:
Hits: 32133
Entries: 81
Dropped: 0
To display the offset and size information in addition to the symbol
name, just use 'sym-offset' instead::
# echo 'hist:key=call_site.sym-offset:val=bytes_req:sort=bytes_req.descending' > \
/sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
# cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
# trigger info: hist:keys=call_site.sym-offset:vals=bytes_req:sort=bytes_req.descending:size=2048 [active]
{ call_site: [ffffffffa046041c] i915_gem_execbuffer2+0x6c/0x2c0 [i915] } hitcount: 4569 bytes_req: 3163720
{ call_site: [ffffffffa0489a66] intel_ring_begin+0xc6/0x1f0 [i915] } hitcount: 4569 bytes_req: 657936
{ call_site: [ffffffffa045e7c4] i915_gem_do_execbuffer.isra.23+0x694/0x1020 [i915] } hitcount: 1519 bytes_req: 472936
{ call_site: [ffffffffa045e646] i915_gem_do_execbuffer.isra.23+0x516/0x1020 [i915] } hitcount: 3050 bytes_req: 211832
{ call_site: [ffffffff811e2a1b] seq_buf_alloc+0x1b/0x50 } hitcount: 34 bytes_req: 148384
{ call_site: [ffffffffa04a580c] intel_crtc_page_flip+0xbc/0x870 [i915] } hitcount: 1385 bytes_req: 144040
{ call_site: [ffffffff811ae8e1] __kmalloc+0x191/0x1b0 } hitcount: 8 bytes_req: 131072
{ call_site: [ffffffffa0287592] drm_mode_page_flip_ioctl+0x282/0x360 [drm] } hitcount: 1385 bytes_req: 121880
{ call_site: [ffffffffa02911f2] drm_modeset_lock_crtc+0x32/0x100 [drm] } hitcount: 1848 bytes_req: 103488
{ call_site: [ffffffffa04c4a3c] intel_plane_duplicate_state+0x2c/0xa0 [i915] } hitcount: 461 bytes_req: 62696
{ call_site: [ffffffffa029070e] drm_vma_node_allow+0x2e/0xd0 [drm] } hitcount: 1541 bytes_req: 61640
{ call_site: [ffffffff815f8d7b] sk_prot_alloc+0xcb/0x1b0 } hitcount: 57 bytes_req: 57456
.
.
.
{ call_site: [ffffffff8109524a] alloc_fair_sched_group+0x5a/0x1a0 } hitcount: 2 bytes_req: 128
{ call_site: [ffffffffa027b921] drm_vm_open_locked+0x31/0xa0 [drm] } hitcount: 3 bytes_req: 96
{ call_site: [ffffffff8122e266] proc_self_follow_link+0x76/0xb0 } hitcount: 8 bytes_req: 96
{ call_site: [ffffffff81213e80] load_elf_binary+0x240/0x1650 } hitcount: 3 bytes_req: 84
{ call_site: [ffffffff8154bc62] usb_control_msg+0x42/0x110 } hitcount: 1 bytes_req: 8
{ call_site: [ffffffffa00bf6fe] hidraw_send_report+0x7e/0x1a0 [hid] } hitcount: 1 bytes_req: 7
{ call_site: [ffffffffa00bf1ca] hidraw_report_event+0x8a/0x120 [hid] } hitcount: 1 bytes_req: 7
Totals:
Hits: 26098
Entries: 64
Dropped: 0
We can also add multiple fields to the 'values' parameter. For
example, we might want to see the total number of bytes allocated
alongside bytes requested, and display the result sorted by bytes
allocated in a descending order::
# echo 'hist:keys=call_site.sym:values=bytes_req,bytes_alloc:sort=bytes_alloc.descending' > \
/sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
# cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
# trigger info: hist:keys=call_site.sym:vals=bytes_req,bytes_alloc:sort=bytes_alloc.descending:size=2048 [active]
{ call_site: [ffffffffa046041c] i915_gem_execbuffer2 [i915] } hitcount: 7403 bytes_req: 4084360 bytes_alloc: 5958016
{ call_site: [ffffffff811e2a1b] seq_buf_alloc } hitcount: 541 bytes_req: 2213968 bytes_alloc: 2228224
{ call_site: [ffffffffa0489a66] intel_ring_begin [i915] } hitcount: 7404 bytes_req: 1066176 bytes_alloc: 1421568
{ call_site: [ffffffffa045e7c4] i915_gem_do_execbuffer.isra.23 [i915] } hitcount: 1565 bytes_req: 557368 bytes_alloc: 1037760
{ call_site: [ffffffff8125847d] ext4_htree_store_dirent } hitcount: 9557 bytes_req: 595778 bytes_alloc: 695744
{ call_site: [ffffffffa045e646] i915_gem_do_execbuffer.isra.23 [i915] } hitcount: 5839 bytes_req: 430680 bytes_alloc: 470400
{ call_site: [ffffffffa04c4a3c] intel_plane_duplicate_state [i915] } hitcount: 2388 bytes_req: 324768 bytes_alloc: 458496
{ call_site: [ffffffffa02911f2] drm_modeset_lock_crtc [drm] } hitcount: 3911 bytes_req: 219016 bytes_alloc: 250304
{ call_site: [ffffffff815f8d7b] sk_prot_alloc } hitcount: 235 bytes_req: 236880 bytes_alloc: 240640
{ call_site: [ffffffff8137e559] sg_kmalloc } hitcount: 557 bytes_req: 169024 bytes_alloc: 221760
{ call_site: [ffffffffa00b7c06] hid_report_raw_event [hid] } hitcount: 9378 bytes_req: 187548 bytes_alloc: 206312
{ call_site: [ffffffffa04a580c] intel_crtc_page_flip [i915] } hitcount: 1519 bytes_req: 157976 bytes_alloc: 194432
.
.
.
{ call_site: [ffffffff8109bd3b] sched_autogroup_create_attach } hitcount: 2 bytes_req: 144 bytes_alloc: 192
{ call_site: [ffffffff81097ee8] alloc_rt_sched_group } hitcount: 2 bytes_req: 128 bytes_alloc: 128
{ call_site: [ffffffff8109524a] alloc_fair_sched_group } hitcount: 2 bytes_req: 128 bytes_alloc: 128
{ call_site: [ffffffff81095225] alloc_fair_sched_group } hitcount: 2 bytes_req: 128 bytes_alloc: 128
{ call_site: [ffffffff81097ec2] alloc_rt_sched_group } hitcount: 2 bytes_req: 128 bytes_alloc: 128
{ call_site: [ffffffff81213e80] load_elf_binary } hitcount: 3 bytes_req: 84 bytes_alloc: 96
{ call_site: [ffffffff81079a2e] kthread_create_on_node } hitcount: 1 bytes_req: 56 bytes_alloc: 64
{ call_site: [ffffffffa00bf6fe] hidraw_send_report [hid] } hitcount: 1 bytes_req: 7 bytes_alloc: 8
{ call_site: [ffffffff8154bc62] usb_control_msg } hitcount: 1 bytes_req: 8 bytes_alloc: 8
{ call_site: [ffffffffa00bf1ca] hidraw_report_event [hid] } hitcount: 1 bytes_req: 7 bytes_alloc: 8
Totals:
Hits: 66598
Entries: 65
Dropped: 0
Finally, to finish off our kmalloc example, instead of simply having
the hist trigger display symbolic call_sites, we can have the hist
trigger additionally display the complete set of kernel stack traces
that led to each call_site. To do that, we simply use the special
value 'stacktrace' for the key parameter::
# echo 'hist:keys=stacktrace:values=bytes_req,bytes_alloc:sort=bytes_alloc' > \
/sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
The above trigger will use the kernel stack trace in effect when an
event is triggered as the key for the hash table. This allows the
enumeration of every kernel callpath that led up to a particular
event, along with a running total of any of the event fields for
that event. Here we tally bytes requested and bytes allocated for
every callpath in the system that led up to a kmalloc (in this case
every callpath to a kmalloc for a kernel compile)::
# cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
# trigger info: hist:keys=stacktrace:vals=bytes_req,bytes_alloc:sort=bytes_alloc:size=2048 [active]
{ stacktrace:
__kmalloc_track_caller+0x10b/0x1a0
kmemdup+0x20/0x50
hidraw_report_event+0x8a/0x120 [hid]
hid_report_raw_event+0x3ea/0x440 [hid]
hid_input_report+0x112/0x190 [hid]
hid_irq_in+0xc2/0x260 [usbhid]
__usb_hcd_giveback_urb+0x72/0x120
usb_giveback_urb_bh+0x9e/0xe0
tasklet_hi_action+0xf8/0x100
__do_softirq+0x114/0x2c0
irq_exit+0xa5/0xb0
do_IRQ+0x5a/0xf0
ret_from_intr+0x0/0x30
cpuidle_enter+0x17/0x20
cpu_startup_entry+0x315/0x3e0
rest_init+0x7c/0x80
} hitcount: 3 bytes_req: 21 bytes_alloc: 24
{ stacktrace:
__kmalloc_track_caller+0x10b/0x1a0
kmemdup+0x20/0x50
hidraw_report_event+0x8a/0x120 [hid]
hid_report_raw_event+0x3ea/0x440 [hid]
hid_input_report+0x112/0x190 [hid]
hid_irq_in+0xc2/0x260 [usbhid]
__usb_hcd_giveback_urb+0x72/0x120
usb_giveback_urb_bh+0x9e/0xe0
tasklet_hi_action+0xf8/0x100
__do_softirq+0x114/0x2c0
irq_exit+0xa5/0xb0
do_IRQ+0x5a/0xf0
ret_from_intr+0x0/0x30
} hitcount: 3 bytes_req: 21 bytes_alloc: 24
{ stacktrace:
kmem_cache_alloc_trace+0xeb/0x150
aa_alloc_task_context+0x27/0x40
apparmor_cred_prepare+0x1f/0x50
security_prepare_creds+0x16/0x20
prepare_creds+0xdf/0x1a0
SyS_capset+0xb5/0x200
system_call_fastpath+0x12/0x6a
} hitcount: 1 bytes_req: 32 bytes_alloc: 32
.
.
.
{ stacktrace:
__kmalloc+0x11b/0x1b0
i915_gem_execbuffer2+0x6c/0x2c0 [i915]
drm_ioctl+0x349/0x670 [drm]
do_vfs_ioctl+0x2f0/0x4f0
SyS_ioctl+0x81/0xa0
system_call_fastpath+0x12/0x6a
} hitcount: 17726 bytes_req: 13944120 bytes_alloc: 19593808
{ stacktrace:
__kmalloc+0x11b/0x1b0
load_elf_phdrs+0x76/0xa0
load_elf_binary+0x102/0x1650
search_binary_handler+0x97/0x1d0
do_execveat_common.isra.34+0x551/0x6e0
SyS_execve+0x3a/0x50
return_from_execve+0x0/0x23
} hitcount: 33348 bytes_req: 17152128 bytes_alloc: 20226048
{ stacktrace:
kmem_cache_alloc_trace+0xeb/0x150
apparmor_file_alloc_security+0x27/0x40
security_file_alloc+0x16/0x20
get_empty_filp+0x93/0x1c0
path_openat+0x31/0x5f0
do_filp_open+0x3a/0x90
do_sys_open+0x128/0x220
SyS_open+0x1e/0x20
system_call_fastpath+0x12/0x6a
} hitcount: 4766422 bytes_req: 9532844 bytes_alloc: 38131376
{ stacktrace:
__kmalloc+0x11b/0x1b0
seq_buf_alloc+0x1b/0x50
seq_read+0x2cc/0x370
proc_reg_read+0x3d/0x80
__vfs_read+0x28/0xe0
vfs_read+0x86/0x140
SyS_read+0x46/0xb0
system_call_fastpath+0x12/0x6a
} hitcount: 19133 bytes_req: 78368768 bytes_alloc: 78368768
Totals:
Hits: 6085872
Entries: 253
Dropped: 0
If you key a hist trigger on common_pid, in order for example to
gather and display sorted totals for each process, you can use the
special .execname modifier to display the executable names for the
processes in the table rather than raw pids. The example below
keeps a per-process sum of total bytes read::
# echo 'hist:key=common_pid.execname:val=count:sort=count.descending' > \
/sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger
# cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/hist
# trigger info: hist:keys=common_pid.execname:vals=count:sort=count.descending:size=2048 [active]
{ common_pid: gnome-terminal [ 3196] } hitcount: 280 count: 1093512
{ common_pid: Xorg [ 1309] } hitcount: 525 count: 256640
{ common_pid: compiz [ 2889] } hitcount: 59 count: 254400
{ common_pid: bash [ 8710] } hitcount: 3 count: 66369
{ common_pid: dbus-daemon-lau [ 8703] } hitcount: 49 count: 47739
{ common_pid: irqbalance [ 1252] } hitcount: 27 count: 27648
{ common_pid: 01ifupdown [ 8705] } hitcount: 3 count: 17216
{ common_pid: dbus-daemon [ 772] } hitcount: 10 count: 12396
{ common_pid: Socket Thread [ 8342] } hitcount: 11 count: 11264
{ common_pid: nm-dhcp-client. [ 8701] } hitcount: 6 count: 7424
{ common_pid: gmain [ 1315] } hitcount: 18 count: 6336
.
.
.
{ common_pid: postgres [ 1892] } hitcount: 2 count: 32
{ common_pid: postgres [ 1891] } hitcount: 2 count: 32
{ common_pid: gmain [ 8704] } hitcount: 2 count: 32
{ common_pid: upstart-dbus-br [ 2740] } hitcount: 21 count: 21
{ common_pid: nm-dispatcher.a [ 8696] } hitcount: 1 count: 16
{ common_pid: indicator-datet [ 2904] } hitcount: 1 count: 16
{ common_pid: gdbus [ 2998] } hitcount: 1 count: 16
{ common_pid: rtkit-daemon [ 2052] } hitcount: 1 count: 8
{ common_pid: init [ 1] } hitcount: 2 count: 2
Totals:
Hits: 2116
Entries: 51
Dropped: 0
Similarly, if you key a hist trigger on syscall id, for example to
gather and display a list of systemwide syscall hits, you can use
the special .syscall modifier to display the syscall names rather
than raw ids. The example below keeps a running total of syscall
counts for the system during the run::
# echo 'hist:key=id.syscall:val=hitcount' > \
/sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
# cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
# trigger info: hist:keys=id.syscall:vals=hitcount:sort=hitcount:size=2048 [active]
{ id: sys_fsync [ 74] } hitcount: 1
{ id: sys_newuname [ 63] } hitcount: 1
{ id: sys_prctl [157] } hitcount: 1
{ id: sys_statfs [137] } hitcount: 1
{ id: sys_symlink [ 88] } hitcount: 1
{ id: sys_sendmmsg [307] } hitcount: 1
{ id: sys_semctl [ 66] } hitcount: 1
{ id: sys_readlink [ 89] } hitcount: 3
{ id: sys_bind [ 49] } hitcount: 3
{ id: sys_getsockname [ 51] } hitcount: 3
{ id: sys_unlink [ 87] } hitcount: 3
{ id: sys_rename [ 82] } hitcount: 4
{ id: unknown_syscall [ 58] } hitcount: 4
{ id: sys_connect [ 42] } hitcount: 4
{ id: sys_getpid [ 39] } hitcount: 4
.
.
.
{ id: sys_rt_sigprocmask [ 14] } hitcount: 952
{ id: sys_futex [202] } hitcount: 1534
{ id: sys_write [ 1] } hitcount: 2689
{ id: sys_setitimer [ 38] } hitcount: 2797
{ id: sys_read [ 0] } hitcount: 3202
{ id: sys_select [ 23] } hitcount: 3773
{ id: sys_writev [ 20] } hitcount: 4531
{ id: sys_poll [ 7] } hitcount: 8314
{ id: sys_recvmsg [ 47] } hitcount: 13738
{ id: sys_ioctl [ 16] } hitcount: 21843
Totals:
Hits: 67612
Entries: 72
Dropped: 0
The syscall counts above provide a rough overall picture of system
call activity on the system; we can see for example that the most
popular system call on this system was the 'sys_ioctl' system call.
We can use 'compound' keys to refine that number and provide some
further insight as to which processes exactly contribute to the
overall ioctl count.
The command below keeps a hitcount for every unique combination of
system call id and pid - the end result is essentially a table
that keeps a per-pid sum of system call hits. The results are
sorted using the system call id as the primary key, and the
hitcount sum as the secondary key::
# echo 'hist:key=id.syscall,common_pid.execname:val=hitcount:sort=id,hitcount' > \
/sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
# cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
# trigger info: hist:keys=id.syscall,common_pid.execname:vals=hitcount:sort=id.syscall,hitcount:size=2048 [active]
{ id: sys_read [ 0], common_pid: rtkit-daemon [ 1877] } hitcount: 1
{ id: sys_read [ 0], common_pid: gdbus [ 2976] } hitcount: 1
{ id: sys_read [ 0], common_pid: console-kit-dae [ 3400] } hitcount: 1
{ id: sys_read [ 0], common_pid: postgres [ 1865] } hitcount: 1
{ id: sys_read [ 0], common_pid: deja-dup-monito [ 3543] } hitcount: 2
{ id: sys_read [ 0], common_pid: NetworkManager [ 890] } hitcount: 2
{ id: sys_read [ 0], common_pid: evolution-calen [ 3048] } hitcount: 2
{ id: sys_read [ 0], common_pid: postgres [ 1864] } hitcount: 2
{ id: sys_read [ 0], common_pid: nm-applet [ 3022] } hitcount: 2
{ id: sys_read [ 0], common_pid: whoopsie [ 1212] } hitcount: 2
.
.
.
{ id: sys_ioctl [ 16], common_pid: bash [ 8479] } hitcount: 1
{ id: sys_ioctl [ 16], common_pid: bash [ 3472] } hitcount: 12
{ id: sys_ioctl [ 16], common_pid: gnome-terminal [ 3199] } hitcount: 16
{ id: sys_ioctl [ 16], common_pid: Xorg [ 1267] } hitcount: 1808
{ id: sys_ioctl [ 16], common_pid: compiz [ 2994] } hitcount: 5580
.
.
.
{ id: sys_waitid [247], common_pid: upstart-dbus-br [ 2690] } hitcount: 3
{ id: sys_waitid [247], common_pid: upstart-dbus-br [ 2688] } hitcount: 16
{ id: sys_inotify_add_watch [254], common_pid: gmain [ 975] } hitcount: 2
{ id: sys_inotify_add_watch [254], common_pid: gmain [ 3204] } hitcount: 4
{ id: sys_inotify_add_watch [254], common_pid: gmain [ 2888] } hitcount: 4
{ id: sys_inotify_add_watch [254], common_pid: gmain [ 3003] } hitcount: 4
{ id: sys_inotify_add_watch [254], common_pid: gmain [ 2873] } hitcount: 4
{ id: sys_inotify_add_watch [254], common_pid: gmain [ 3196] } hitcount: 6
{ id: sys_openat [257], common_pid: java [ 2623] } hitcount: 2
{ id: sys_eventfd2 [290], common_pid: ibus-ui-gtk3 [ 2760] } hitcount: 4
{ id: sys_eventfd2 [290], common_pid: compiz [ 2994] } hitcount: 6
Totals:
Hits: 31536
Entries: 323
Dropped: 0
The above list does give us a breakdown of the ioctl syscall by
pid, but it also gives us quite a bit more than that, which we
don't really care about at the moment. Since we know the syscall
id for sys_ioctl (16, displayed next to the sys_ioctl name), we
can use that to filter out all the other syscalls::
# echo 'hist:key=id.syscall,common_pid.execname:val=hitcount:sort=id,hitcount if id == 16' > \
/sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
# cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
# trigger info: hist:keys=id.syscall,common_pid.execname:vals=hitcount:sort=id.syscall,hitcount:size=2048 if id == 16 [active]
{ id: sys_ioctl [ 16], common_pid: gmain [ 2769] } hitcount: 1
{ id: sys_ioctl [ 16], common_pid: evolution-addre [ 8571] } hitcount: 1
{ id: sys_ioctl [ 16], common_pid: gmain [ 3003] } hitcount: 1
{ id: sys_ioctl [ 16], common_pid: gmain [ 2781] } hitcount: 1
{ id: sys_ioctl [ 16], common_pid: gmain [ 2829] } hitcount: 1
{ id: sys_ioctl [ 16], common_pid: bash [ 8726] } hitcount: 1
{ id: sys_ioctl [ 16], common_pid: bash [ 8508] } hitcount: 1
{ id: sys_ioctl [ 16], common_pid: gmain [ 2970] } hitcount: 1
{ id: sys_ioctl [ 16], common_pid: gmain [ 2768] } hitcount: 1
.
.
.
{ id: sys_ioctl [ 16], common_pid: pool [ 8559] } hitcount: 45
{ id: sys_ioctl [ 16], common_pid: pool [ 8555] } hitcount: 48
{ id: sys_ioctl [ 16], common_pid: pool [ 8551] } hitcount: 48
{ id: sys_ioctl [ 16], common_pid: avahi-daemon [ 896] } hitcount: 66
{ id: sys_ioctl [ 16], common_pid: Xorg [ 1267] } hitcount: 26674
{ id: sys_ioctl [ 16], common_pid: compiz [ 2994] } hitcount: 73443
Totals:
Hits: 101162
Entries: 103
Dropped: 0
The above output shows that 'compiz' and 'Xorg' are far and away
the heaviest ioctl callers (which might lead to questions about
whether they really need to be making all those calls and to
possible avenues for further investigation.)
The compound key examples used a key and a sum value (hitcount) to
sort the output, but we can just as easily use two keys instead.
Here's an example where we use a compound key composed of the the
common_pid and size event fields. Sorting with pid as the primary
key and 'size' as the secondary key allows us to display an
ordered summary of the recvfrom sizes, with counts, received by
each process::
# echo 'hist:key=common_pid.execname,size:val=hitcount:sort=common_pid,size' > \
/sys/kernel/debug/tracing/events/syscalls/sys_enter_recvfrom/trigger
# cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_recvfrom/hist
# trigger info: hist:keys=common_pid.execname,size:vals=hitcount:sort=common_pid.execname,size:size=2048 [active]
{ common_pid: smbd [ 784], size: 4 } hitcount: 1
{ common_pid: dnsmasq [ 1412], size: 4096 } hitcount: 672
{ common_pid: postgres [ 1796], size: 1000 } hitcount: 6
{ common_pid: postgres [ 1867], size: 1000 } hitcount: 10
{ common_pid: bamfdaemon [ 2787], size: 28 } hitcount: 2
{ common_pid: bamfdaemon [ 2787], size: 14360 } hitcount: 1
{ common_pid: compiz [ 2994], size: 8 } hitcount: 1
{ common_pid: compiz [ 2994], size: 20 } hitcount: 11
{ common_pid: gnome-terminal [ 3199], size: 4 } hitcount: 2
{ common_pid: firefox [ 8817], size: 4 } hitcount: 1
{ common_pid: firefox [ 8817], size: 8 } hitcount: 5
{ common_pid: firefox [ 8817], size: 588 } hitcount: 2
{ common_pid: firefox [ 8817], size: 628 } hitcount: 1
{ common_pid: firefox [ 8817], size: 6944 } hitcount: 1
{ common_pid: firefox [ 8817], size: 408880 } hitcount: 2
{ common_pid: firefox [ 8822], size: 8 } hitcount: 2
{ common_pid: firefox [ 8822], size: 160 } hitcount: 2
{ common_pid: firefox [ 8822], size: 320 } hitcount: 2
{ common_pid: firefox [ 8822], size: 352 } hitcount: 1
.
.
.
{ common_pid: pool [ 8923], size: 1960 } hitcount: 10
{ common_pid: pool [ 8923], size: 2048 } hitcount: 10
{ common_pid: pool [ 8924], size: 1960 } hitcount: 10
{ common_pid: pool [ 8924], size: 2048 } hitcount: 10
{ common_pid: pool [ 8928], size: 1964 } hitcount: 4
{ common_pid: pool [ 8928], size: 1965 } hitcount: 2
{ common_pid: pool [ 8928], size: 2048 } hitcount: 6
{ common_pid: pool [ 8929], size: 1982 } hitcount: 1
{ common_pid: pool [ 8929], size: 2048 } hitcount: 1
Totals:
Hits: 2016
Entries: 224
Dropped: 0
The above example also illustrates the fact that although a compound
key is treated as a single entity for hashing purposes, the sub-keys
it's composed of can be accessed independently.
The next example uses a string field as the hash key and
demonstrates how you can manually pause and continue a hist trigger.
In this example, we'll aggregate fork counts and don't expect a
large number of entries in the hash table, so we'll drop it to a
much smaller number, say 256::
# echo 'hist:key=child_comm:val=hitcount:size=256' > \
/sys/kernel/debug/tracing/events/sched/sched_process_fork/trigger
# cat /sys/kernel/debug/tracing/events/sched/sched_process_fork/hist
# trigger info: hist:keys=child_comm:vals=hitcount:sort=hitcount:size=256 [active]
{ child_comm: dconf worker } hitcount: 1
{ child_comm: ibus-daemon } hitcount: 1
{ child_comm: whoopsie } hitcount: 1
{ child_comm: smbd } hitcount: 1
{ child_comm: gdbus } hitcount: 1
{ child_comm: kthreadd } hitcount: 1
{ child_comm: dconf worker } hitcount: 1
{ child_comm: evolution-alarm } hitcount: 2
{ child_comm: Socket Thread } hitcount: 2
{ child_comm: postgres } hitcount: 2
{ child_comm: bash } hitcount: 3
{ child_comm: compiz } hitcount: 3
{ child_comm: evolution-sourc } hitcount: 4
{ child_comm: dhclient } hitcount: 4
{ child_comm: pool } hitcount: 5
{ child_comm: nm-dispatcher.a } hitcount: 8
{ child_comm: firefox } hitcount: 8
{ child_comm: dbus-daemon } hitcount: 8
{ child_comm: glib-pacrunner } hitcount: 10
{ child_comm: evolution } hitcount: 23
Totals:
Hits: 89
Entries: 20
Dropped: 0
If we want to pause the hist trigger, we can simply append :pause to
the command that started the trigger. Notice that the trigger info
displays as [paused]::
# echo 'hist:key=child_comm:val=hitcount:size=256:pause' >> \
/sys/kernel/debug/tracing/events/sched/sched_process_fork/trigger
# cat /sys/kernel/debug/tracing/events/sched/sched_process_fork/hist
# trigger info: hist:keys=child_comm:vals=hitcount:sort=hitcount:size=256 [paused]
{ child_comm: dconf worker } hitcount: 1
{ child_comm: kthreadd } hitcount: 1
{ child_comm: dconf worker } hitcount: 1
{ child_comm: gdbus } hitcount: 1
{ child_comm: ibus-daemon } hitcount: 1
{ child_comm: Socket Thread } hitcount: 2
{ child_comm: evolution-alarm } hitcount: 2
{ child_comm: smbd } hitcount: 2
{ child_comm: bash } hitcount: 3
{ child_comm: whoopsie } hitcount: 3
{ child_comm: compiz } hitcount: 3
{ child_comm: evolution-sourc } hitcount: 4
{ child_comm: pool } hitcount: 5
{ child_comm: postgres } hitcount: 6
{ child_comm: firefox } hitcount: 8
{ child_comm: dhclient } hitcount: 10
{ child_comm: emacs } hitcount: 12
{ child_comm: dbus-daemon } hitcount: 20
{ child_comm: nm-dispatcher.a } hitcount: 20
{ child_comm: evolution } hitcount: 35
{ child_comm: glib-pacrunner } hitcount: 59
Totals:
Hits: 199
Entries: 21
Dropped: 0
To manually continue having the trigger aggregate events, append
:cont instead. Notice that the trigger info displays as [active]
again, and the data has changed::
# echo 'hist:key=child_comm:val=hitcount:size=256:cont' >> \
/sys/kernel/debug/tracing/events/sched/sched_process_fork/trigger
# cat /sys/kernel/debug/tracing/events/sched/sched_process_fork/hist
# trigger info: hist:keys=child_comm:vals=hitcount:sort=hitcount:size=256 [active]
{ child_comm: dconf worker } hitcount: 1
{ child_comm: dconf worker } hitcount: 1
{ child_comm: kthreadd } hitcount: 1
{ child_comm: gdbus } hitcount: 1
{ child_comm: ibus-daemon } hitcount: 1
{ child_comm: Socket Thread } hitcount: 2
{ child_comm: evolution-alarm } hitcount: 2
{ child_comm: smbd } hitcount: 2
{ child_comm: whoopsie } hitcount: 3
{ child_comm: compiz } hitcount: 3
{ child_comm: evolution-sourc } hitcount: 4
{ child_comm: bash } hitcount: 5
{ child_comm: pool } hitcount: 5
{ child_comm: postgres } hitcount: 6
{ child_comm: firefox } hitcount: 8
{ child_comm: dhclient } hitcount: 11
{ child_comm: emacs } hitcount: 12
{ child_comm: dbus-daemon } hitcount: 22
{ child_comm: nm-dispatcher.a } hitcount: 22
{ child_comm: evolution } hitcount: 35
{ child_comm: glib-pacrunner } hitcount: 59
Totals:
Hits: 206
Entries: 21
Dropped: 0
The previous example showed how to start and stop a hist trigger by
appending 'pause' and 'continue' to the hist trigger command. A
hist trigger can also be started in a paused state by initially
starting the trigger with ':pause' appended. This allows you to
start the trigger only when you're ready to start collecting data
and not before. For example, you could start the trigger in a
paused state, then unpause it and do something you want to measure,
then pause the trigger again when done.
Of course, doing this manually can be difficult and error-prone, but
it is possible to automatically start and stop a hist trigger based
on some condition, via the enable_hist and disable_hist triggers.
For example, suppose we wanted to take a look at the relative
weights in terms of skb length for each callpath that leads to a
netif_receieve_skb event when downloading a decent-sized file using
wget.
First we set up an initially paused stacktrace trigger on the
netif_receive_skb event::
# echo 'hist:key=stacktrace:vals=len:pause' > \
/sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
Next, we set up an 'enable_hist' trigger on the sched_process_exec
event, with an 'if filename==/usr/bin/wget' filter. The effect of
this new trigger is that it will 'unpause' the hist trigger we just
set up on netif_receive_skb if and only if it sees a
sched_process_exec event with a filename of '/usr/bin/wget'. When
that happens, all netif_receive_skb events are aggregated into a
hash table keyed on stacktrace::
# echo 'enable_hist:net:netif_receive_skb if filename==/usr/bin/wget' > \
/sys/kernel/debug/tracing/events/sched/sched_process_exec/trigger
The aggregation continues until the netif_receive_skb is paused
again, which is what the following disable_hist event does by
creating a similar setup on the sched_process_exit event, using the
filter 'comm==wget'::
# echo 'disable_hist:net:netif_receive_skb if comm==wget' > \
/sys/kernel/debug/tracing/events/sched/sched_process_exit/trigger
Whenever a process exits and the comm field of the disable_hist
trigger filter matches 'comm==wget', the netif_receive_skb hist
trigger is disabled.
The overall effect is that netif_receive_skb events are aggregated
into the hash table for only the duration of the wget. Executing a
wget command and then listing the 'hist' file will display the
output generated by the wget command::
$ wget https://www.kernel.org/pub/linux/kernel/v3.x/patch-3.19.xz
# cat /sys/kernel/debug/tracing/events/net/netif_receive_skb/hist
# trigger info: hist:keys=stacktrace:vals=len:sort=hitcount:size=2048 [paused]
{ stacktrace:
__netif_receive_skb_core+0x46d/0x990
__netif_receive_skb+0x18/0x60
netif_receive_skb_internal+0x23/0x90
napi_gro_receive+0xc8/0x100
ieee80211_deliver_skb+0xd6/0x270 [mac80211]
ieee80211_rx_handlers+0xccf/0x22f0 [mac80211]
ieee80211_prepare_and_rx_handle+0x4e7/0xc40 [mac80211]
ieee80211_rx+0x31d/0x900 [mac80211]
iwlagn_rx_reply_rx+0x3db/0x6f0 [iwldvm]
iwl_rx_dispatch+0x8e/0xf0 [iwldvm]
iwl_pcie_irq_handler+0xe3c/0x12f0 [iwlwifi]
irq_thread_fn+0x20/0x50
irq_thread+0x11f/0x150
kthread+0xd2/0xf0
ret_from_fork+0x42/0x70
} hitcount: 85 len: 28884
{ stacktrace:
__netif_receive_skb_core+0x46d/0x990
__netif_receive_skb+0x18/0x60
netif_receive_skb_internal+0x23/0x90
napi_gro_complete+0xa4/0xe0
dev_gro_receive+0x23a/0x360
napi_gro_receive+0x30/0x100
ieee80211_deliver_skb+0xd6/0x270 [mac80211]
ieee80211_rx_handlers+0xccf/0x22f0 [mac80211]
ieee80211_prepare_and_rx_handle+0x4e7/0xc40 [mac80211]
ieee80211_rx+0x31d/0x900 [mac80211]
iwlagn_rx_reply_rx+0x3db/0x6f0 [iwldvm]
iwl_rx_dispatch+0x8e/0xf0 [iwldvm]
iwl_pcie_irq_handler+0xe3c/0x12f0 [iwlwifi]
irq_thread_fn+0x20/0x50
irq_thread+0x11f/0x150
kthread+0xd2/0xf0
} hitcount: 98 len: 664329
{ stacktrace:
__netif_receive_skb_core+0x46d/0x990
__netif_receive_skb+0x18/0x60
process_backlog+0xa8/0x150
net_rx_action+0x15d/0x340
__do_softirq+0x114/0x2c0
do_softirq_own_stack+0x1c/0x30
do_softirq+0x65/0x70
__local_bh_enable_ip+0xb5/0xc0
ip_finish_output+0x1f4/0x840
ip_output+0x6b/0xc0
ip_local_out_sk+0x31/0x40
ip_send_skb+0x1a/0x50
udp_send_skb+0x173/0x2a0
udp_sendmsg+0x2bf/0x9f0
inet_sendmsg+0x64/0xa0
sock_sendmsg+0x3d/0x50
} hitcount: 115 len: 13030
{ stacktrace:
__netif_receive_skb_core+0x46d/0x990
__netif_receive_skb+0x18/0x60
netif_receive_skb_internal+0x23/0x90
napi_gro_complete+0xa4/0xe0
napi_gro_flush+0x6d/0x90
iwl_pcie_irq_handler+0x92a/0x12f0 [iwlwifi]
irq_thread_fn+0x20/0x50
irq_thread+0x11f/0x150
kthread+0xd2/0xf0
ret_from_fork+0x42/0x70
} hitcount: 934 len: 5512212
Totals:
Hits: 1232
Entries: 4
Dropped: 0
The above shows all the netif_receive_skb callpaths and their total
lengths for the duration of the wget command.
The 'clear' hist trigger param can be used to clear the hash table.
Suppose we wanted to try another run of the previous example but
this time also wanted to see the complete list of events that went
into the histogram. In order to avoid having to set everything up
again, we can just clear the histogram first::
# echo 'hist:key=stacktrace:vals=len:clear' >> \
/sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
Just to verify that it is in fact cleared, here's what we now see in
the hist file::
# cat /sys/kernel/debug/tracing/events/net/netif_receive_skb/hist
# trigger info: hist:keys=stacktrace:vals=len:sort=hitcount:size=2048 [paused]
Totals:
Hits: 0
Entries: 0
Dropped: 0
Since we want to see the detailed list of every netif_receive_skb
event occurring during the new run, which are in fact the same
events being aggregated into the hash table, we add some additional
'enable_event' events to the triggering sched_process_exec and
sched_process_exit events as such::
# echo 'enable_event:net:netif_receive_skb if filename==/usr/bin/wget' > \
/sys/kernel/debug/tracing/events/sched/sched_process_exec/trigger
# echo 'disable_event:net:netif_receive_skb if comm==wget' > \
/sys/kernel/debug/tracing/events/sched/sched_process_exit/trigger
If you read the trigger files for the sched_process_exec and
sched_process_exit triggers, you should see two triggers for each:
one enabling/disabling the hist aggregation and the other
enabling/disabling the logging of events::
# cat /sys/kernel/debug/tracing/events/sched/sched_process_exec/trigger
enable_event:net:netif_receive_skb:unlimited if filename==/usr/bin/wget
enable_hist:net:netif_receive_skb:unlimited if filename==/usr/bin/wget
# cat /sys/kernel/debug/tracing/events/sched/sched_process_exit/trigger
enable_event:net:netif_receive_skb:unlimited if comm==wget
disable_hist:net:netif_receive_skb:unlimited if comm==wget
In other words, whenever either of the sched_process_exec or
sched_process_exit events is hit and matches 'wget', it enables or
disables both the histogram and the event log, and what you end up
with is a hash table and set of events just covering the specified
duration. Run the wget command again::
$ wget https://www.kernel.org/pub/linux/kernel/v3.x/patch-3.19.xz
Displaying the 'hist' file should show something similar to what you
saw in the last run, but this time you should also see the
individual events in the trace file::
# cat /sys/kernel/debug/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 183/1426 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
wget-15108 [000] ..s1 31769.606929: netif_receive_skb: dev=lo skbaddr=ffff88009c353100 len=60
wget-15108 [000] ..s1 31769.606999: netif_receive_skb: dev=lo skbaddr=ffff88009c353200 len=60
dnsmasq-1382 [000] ..s1 31769.677652: netif_receive_skb: dev=lo skbaddr=ffff88009c352b00 len=130
dnsmasq-1382 [000] ..s1 31769.685917: netif_receive_skb: dev=lo skbaddr=ffff88009c352200 len=138
##### CPU 2 buffer started ####
irq/29-iwlwifi-559 [002] ..s. 31772.031529: netif_receive_skb: dev=wlan0 skbaddr=ffff88009d433d00 len=2948
irq/29-iwlwifi-559 [002] ..s. 31772.031572: netif_receive_skb: dev=wlan0 skbaddr=ffff88009d432200 len=1500
irq/29-iwlwifi-559 [002] ..s. 31772.032196: netif_receive_skb: dev=wlan0 skbaddr=ffff88009d433100 len=2948
irq/29-iwlwifi-559 [002] ..s. 31772.032761: netif_receive_skb: dev=wlan0 skbaddr=ffff88009d433000 len=2948
irq/29-iwlwifi-559 [002] ..s. 31772.033220: netif_receive_skb: dev=wlan0 skbaddr=ffff88009d432e00 len=1500
....
The following example demonstrates how multiple hist triggers can be
attached to a given event. This capability can be useful for
creating a set of different summaries derived from the same set of
events, or for comparing the effects of different filters, among
other things.
::
# echo 'hist:keys=skbaddr.hex:vals=len if len < 0' >> \
/sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
# echo 'hist:keys=skbaddr.hex:vals=len if len > 4096' >> \
/sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
# echo 'hist:keys=skbaddr.hex:vals=len if len == 256' >> \
/sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
# echo 'hist:keys=skbaddr.hex:vals=len' >> \
/sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
# echo 'hist:keys=len:vals=common_preempt_count' >> \
/sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
The above set of commands create four triggers differing only in
their filters, along with a completely different though fairly
nonsensical trigger. Note that in order to append multiple hist
triggers to the same file, you should use the '>>' operator to
append them ('>' will also add the new hist trigger, but will remove
any existing hist triggers beforehand).
Displaying the contents of the 'hist' file for the event shows the
contents of all five histograms::
# cat /sys/kernel/debug/tracing/events/net/netif_receive_skb/hist
# event histogram
#
# trigger info: hist:keys=len:vals=hitcount,common_preempt_count:sort=hitcount:size=2048 [active]
#
{ len: 176 } hitcount: 1 common_preempt_count: 0
{ len: 223 } hitcount: 1 common_preempt_count: 0
{ len: 4854 } hitcount: 1 common_preempt_count: 0
{ len: 395 } hitcount: 1 common_preempt_count: 0
{ len: 177 } hitcount: 1 common_preempt_count: 0
{ len: 446 } hitcount: 1 common_preempt_count: 0
{ len: 1601 } hitcount: 1 common_preempt_count: 0
.
.
.
{ len: 1280 } hitcount: 66 common_preempt_count: 0
{ len: 116 } hitcount: 81 common_preempt_count: 40
{ len: 708 } hitcount: 112 common_preempt_count: 0
{ len: 46 } hitcount: 221 common_preempt_count: 0
{ len: 1264 } hitcount: 458 common_preempt_count: 0
Totals:
Hits: 1428
Entries: 147
Dropped: 0
# event histogram
#
# trigger info: hist:keys=skbaddr.hex:vals=hitcount,len:sort=hitcount:size=2048 [active]
#
{ skbaddr: ffff8800baee5e00 } hitcount: 1 len: 130
{ skbaddr: ffff88005f3d5600 } hitcount: 1 len: 1280
{ skbaddr: ffff88005f3d4900 } hitcount: 1 len: 1280
{ skbaddr: ffff88009fed6300 } hitcount: 1 len: 115
{ skbaddr: ffff88009fe0ad00 } hitcount: 1 len: 115
{ skbaddr: ffff88008cdb1900 } hitcount: 1 len: 46
{ skbaddr: ffff880064b5ef00 } hitcount: 1 len: 118
{ skbaddr: ffff880044e3c700 } hitcount: 1 len: 60
{ skbaddr: ffff880100065900 } hitcount: 1 len: 46
{ skbaddr: ffff8800d46bd500 } hitcount: 1 len: 116
{ skbaddr: ffff88005f3d5f00 } hitcount: 1 len: 1280
{ skbaddr: ffff880100064700 } hitcount: 1 len: 365
{ skbaddr: ffff8800badb6f00 } hitcount: 1 len: 60
.
.
.
{ skbaddr: ffff88009fe0be00 } hitcount: 27 len: 24677
{ skbaddr: ffff88009fe0a400 } hitcount: 27 len: 23052
{ skbaddr: ffff88009fe0b700 } hitcount: 31 len: 25589
{ skbaddr: ffff88009fe0b600 } hitcount: 32 len: 27326
{ skbaddr: ffff88006a462800 } hitcount: 68 len: 71678
{ skbaddr: ffff88006a463700 } hitcount: 70 len: 72678
{ skbaddr: ffff88006a462b00 } hitcount: 71 len: 77589
{ skbaddr: ffff88006a463600 } hitcount: 73 len: 71307
{ skbaddr: ffff88006a462200 } hitcount: 81 len: 81032
Totals:
Hits: 1451
Entries: 318
Dropped: 0
# event histogram
#
# trigger info: hist:keys=skbaddr.hex:vals=hitcount,len:sort=hitcount:size=2048 if len == 256 [active]
#
Totals:
Hits: 0
Entries: 0
Dropped: 0
# event histogram
#
# trigger info: hist:keys=skbaddr.hex:vals=hitcount,len:sort=hitcount:size=2048 if len > 4096 [active]
#
{ skbaddr: ffff88009fd2c300 } hitcount: 1 len: 7212
{ skbaddr: ffff8800d2bcce00 } hitcount: 1 len: 7212
{ skbaddr: ffff8800d2bcd700 } hitcount: 1 len: 7212
{ skbaddr: ffff8800d2bcda00 } hitcount: 1 len: 21492
{ skbaddr: ffff8800ae2e2d00 } hitcount: 1 len: 7212
{ skbaddr: ffff8800d2bcdb00 } hitcount: 1 len: 7212
{ skbaddr: ffff88006a4df500 } hitcount: 1 len: 4854
{ skbaddr: ffff88008ce47b00 } hitcount: 1 len: 18636
{ skbaddr: ffff8800ae2e2200 } hitcount: 1 len: 12924
{ skbaddr: ffff88005f3e1000 } hitcount: 1 len: 4356
{ skbaddr: ffff8800d2bcdc00 } hitcount: 2 len: 24420
{ skbaddr: ffff8800d2bcc200 } hitcount: 2 len: 12996
Totals:
Hits: 14
Entries: 12
Dropped: 0
# event histogram
#
# trigger info: hist:keys=skbaddr.hex:vals=hitcount,len:sort=hitcount:size=2048 if len < 0 [active]
#
Totals:
Hits: 0
Entries: 0
Dropped: 0
Named triggers can be used to have triggers share a common set of
histogram data. This capability is mostly useful for combining the
output of events generated by tracepoints contained inside inline
functions, but names can be used in a hist trigger on any event.
For example, these two triggers when hit will update the same 'len'
field in the shared 'foo' histogram data::
# echo 'hist:name=foo:keys=skbaddr.hex:vals=len' > \
/sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
# echo 'hist:name=foo:keys=skbaddr.hex:vals=len' > \
/sys/kernel/debug/tracing/events/net/netif_rx/trigger
You can see that they're updating common histogram data by reading
each event's hist files at the same time::
# cat /sys/kernel/debug/tracing/events/net/netif_receive_skb/hist;
cat /sys/kernel/debug/tracing/events/net/netif_rx/hist
# event histogram
#
# trigger info: hist:name=foo:keys=skbaddr.hex:vals=hitcount,len:sort=hitcount:size=2048 [active]
#
{ skbaddr: ffff88000ad53500 } hitcount: 1 len: 46
{ skbaddr: ffff8800af5a1500 } hitcount: 1 len: 76
{ skbaddr: ffff8800d62a1900 } hitcount: 1 len: 46
{ skbaddr: ffff8800d2bccb00 } hitcount: 1 len: 468
{ skbaddr: ffff8800d3c69900 } hitcount: 1 len: 46
{ skbaddr: ffff88009ff09100 } hitcount: 1 len: 52
{ skbaddr: ffff88010f13ab00 } hitcount: 1 len: 168
{ skbaddr: ffff88006a54f400 } hitcount: 1 len: 46
{ skbaddr: ffff8800d2bcc500 } hitcount: 1 len: 260
{ skbaddr: ffff880064505000 } hitcount: 1 len: 46
{ skbaddr: ffff8800baf24e00 } hitcount: 1 len: 32
{ skbaddr: ffff88009fe0ad00 } hitcount: 1 len: 46
{ skbaddr: ffff8800d3edff00 } hitcount: 1 len: 44
{ skbaddr: ffff88009fe0b400 } hitcount: 1 len: 168
{ skbaddr: ffff8800a1c55a00 } hitcount: 1 len: 40
{ skbaddr: ffff8800d2bcd100 } hitcount: 1 len: 40
{ skbaddr: ffff880064505f00 } hitcount: 1 len: 174
{ skbaddr: ffff8800a8bff200 } hitcount: 1 len: 160
{ skbaddr: ffff880044e3cc00 } hitcount: 1 len: 76
{ skbaddr: ffff8800a8bfe700 } hitcount: 1 len: 46
{ skbaddr: ffff8800d2bcdc00 } hitcount: 1 len: 32
{ skbaddr: ffff8800a1f64800 } hitcount: 1 len: 46
{ skbaddr: ffff8800d2bcde00 } hitcount: 1 len: 988
{ skbaddr: ffff88006a5dea00 } hitcount: 1 len: 46
{ skbaddr: ffff88002e37a200 } hitcount: 1 len: 44
{ skbaddr: ffff8800a1f32c00 } hitcount: 2 len: 676
{ skbaddr: ffff88000ad52600 } hitcount: 2 len: 107
{ skbaddr: ffff8800a1f91e00 } hitcount: 2 len: 92
{ skbaddr: ffff8800af5a0200 } hitcount: 2 len: 142
{ skbaddr: ffff8800d2bcc600 } hitcount: 2 len: 220
{ skbaddr: ffff8800ba36f500 } hitcount: 2 len: 92
{ skbaddr: ffff8800d021f800 } hitcount: 2 len: 92
{ skbaddr: ffff8800a1f33600 } hitcount: 2 len: 675
{ skbaddr: ffff8800a8bfff00 } hitcount: 3 len: 138
{ skbaddr: ffff8800d62a1300 } hitcount: 3 len: 138
{ skbaddr: ffff88002e37a100 } hitcount: 4 len: 184
{ skbaddr: ffff880064504400 } hitcount: 4 len: 184
{ skbaddr: ffff8800a8bfec00 } hitcount: 4 len: 184
{ skbaddr: ffff88000ad53700 } hitcount: 5 len: 230
{ skbaddr: ffff8800d2bcdb00 } hitcount: 5 len: 196
{ skbaddr: ffff8800a1f90000 } hitcount: 6 len: 276
{ skbaddr: ffff88006a54f900 } hitcount: 6 len: 276
Totals:
Hits: 81
Entries: 42
Dropped: 0
# event histogram
#
# trigger info: hist:name=foo:keys=skbaddr.hex:vals=hitcount,len:sort=hitcount:size=2048 [active]
#
{ skbaddr: ffff88000ad53500 } hitcount: 1 len: 46
{ skbaddr: ffff8800af5a1500 } hitcount: 1 len: 76
{ skbaddr: ffff8800d62a1900 } hitcount: 1 len: 46
{ skbaddr: ffff8800d2bccb00 } hitcount: 1 len: 468
{ skbaddr: ffff8800d3c69900 } hitcount: 1 len: 46
{ skbaddr: ffff88009ff09100 } hitcount: 1 len: 52
{ skbaddr: ffff88010f13ab00 } hitcount: 1 len: 168
{ skbaddr: ffff88006a54f400 } hitcount: 1 len: 46
{ skbaddr: ffff8800d2bcc500 } hitcount: 1 len: 260
{ skbaddr: ffff880064505000 } hitcount: 1 len: 46
{ skbaddr: ffff8800baf24e00 } hitcount: 1 len: 32
{ skbaddr: ffff88009fe0ad00 } hitcount: 1 len: 46
{ skbaddr: ffff8800d3edff00 } hitcount: 1 len: 44
{ skbaddr: ffff88009fe0b400 } hitcount: 1 len: 168
{ skbaddr: ffff8800a1c55a00 } hitcount: 1 len: 40
{ skbaddr: ffff8800d2bcd100 } hitcount: 1 len: 40
{ skbaddr: ffff880064505f00 } hitcount: 1 len: 174
{ skbaddr: ffff8800a8bff200 } hitcount: 1 len: 160
{ skbaddr: ffff880044e3cc00 } hitcount: 1 len: 76
{ skbaddr: ffff8800a8bfe700 } hitcount: 1 len: 46
{ skbaddr: ffff8800d2bcdc00 } hitcount: 1 len: 32
{ skbaddr: ffff8800a1f64800 } hitcount: 1 len: 46
{ skbaddr: ffff8800d2bcde00 } hitcount: 1 len: 988
{ skbaddr: ffff88006a5dea00 } hitcount: 1 len: 46
{ skbaddr: ffff88002e37a200 } hitcount: 1 len: 44
{ skbaddr: ffff8800a1f32c00 } hitcount: 2 len: 676
{ skbaddr: ffff88000ad52600 } hitcount: 2 len: 107
{ skbaddr: ffff8800a1f91e00 } hitcount: 2 len: 92
{ skbaddr: ffff8800af5a0200 } hitcount: 2 len: 142
{ skbaddr: ffff8800d2bcc600 } hitcount: 2 len: 220
{ skbaddr: ffff8800ba36f500 } hitcount: 2 len: 92
{ skbaddr: ffff8800d021f800 } hitcount: 2 len: 92
{ skbaddr: ffff8800a1f33600 } hitcount: 2 len: 675
{ skbaddr: ffff8800a8bfff00 } hitcount: 3 len: 138
{ skbaddr: ffff8800d62a1300 } hitcount: 3 len: 138
{ skbaddr: ffff88002e37a100 } hitcount: 4 len: 184
{ skbaddr: ffff880064504400 } hitcount: 4 len: 184
{ skbaddr: ffff8800a8bfec00 } hitcount: 4 len: 184
{ skbaddr: ffff88000ad53700 } hitcount: 5 len: 230
{ skbaddr: ffff8800d2bcdb00 } hitcount: 5 len: 196
{ skbaddr: ffff8800a1f90000 } hitcount: 6 len: 276
{ skbaddr: ffff88006a54f900 } hitcount: 6 len: 276
Totals:
Hits: 81
Entries: 42
Dropped: 0
And here's an example that shows how to combine histogram data from
any two events even if they don't share any 'compatible' fields
other than 'hitcount' and 'stacktrace'. These commands create a
couple of triggers named 'bar' using those fields::
# echo 'hist:name=bar:key=stacktrace:val=hitcount' > \
/sys/kernel/debug/tracing/events/sched/sched_process_fork/trigger
# echo 'hist:name=bar:key=stacktrace:val=hitcount' > \
/sys/kernel/debug/tracing/events/net/netif_rx/trigger
And displaying the output of either shows some interesting if
somewhat confusing output::
# cat /sys/kernel/debug/tracing/events/sched/sched_process_fork/hist
# cat /sys/kernel/debug/tracing/events/net/netif_rx/hist
# event histogram
#
# trigger info: hist:name=bar:keys=stacktrace:vals=hitcount:sort=hitcount:size=2048 [active]
#
{ stacktrace:
_do_fork+0x18e/0x330
kernel_thread+0x29/0x30
kthreadd+0x154/0x1b0
ret_from_fork+0x3f/0x70
} hitcount: 1
{ stacktrace:
netif_rx_internal+0xb2/0xd0
netif_rx_ni+0x20/0x70
dev_loopback_xmit+0xaa/0xd0
ip_mc_output+0x126/0x240
ip_local_out_sk+0x31/0x40
igmp_send_report+0x1e9/0x230
igmp_timer_expire+0xe9/0x120
call_timer_fn+0x39/0xf0
run_timer_softirq+0x1e1/0x290
__do_softirq+0xfd/0x290
irq_exit+0x98/0xb0
smp_apic_timer_interrupt+0x4a/0x60
apic_timer_interrupt+0x6d/0x80
cpuidle_enter+0x17/0x20
call_cpuidle+0x3b/0x60
cpu_startup_entry+0x22d/0x310
} hitcount: 1
{ stacktrace:
netif_rx_internal+0xb2/0xd0
netif_rx_ni+0x20/0x70
dev_loopback_xmit+0xaa/0xd0
ip_mc_output+0x17f/0x240
ip_local_out_sk+0x31/0x40
ip_send_skb+0x1a/0x50
udp_send_skb+0x13e/0x270
udp_sendmsg+0x2bf/0x980
inet_sendmsg+0x67/0xa0
sock_sendmsg+0x38/0x50
SYSC_sendto+0xef/0x170
SyS_sendto+0xe/0x10
entry_SYSCALL_64_fastpath+0x12/0x6a
} hitcount: 2
{ stacktrace:
netif_rx_internal+0xb2/0xd0
netif_rx+0x1c/0x60
loopback_xmit+0x6c/0xb0
dev_hard_start_xmit+0x219/0x3a0
__dev_queue_xmit+0x415/0x4f0
dev_queue_xmit_sk+0x13/0x20
ip_finish_output2+0x237/0x340
ip_finish_output+0x113/0x1d0
ip_output+0x66/0xc0
ip_local_out_sk+0x31/0x40
ip_send_skb+0x1a/0x50
udp_send_skb+0x16d/0x270
udp_sendmsg+0x2bf/0x980
inet_sendmsg+0x67/0xa0
sock_sendmsg+0x38/0x50
___sys_sendmsg+0x14e/0x270
} hitcount: 76
{ stacktrace:
netif_rx_internal+0xb2/0xd0
netif_rx+0x1c/0x60
loopback_xmit+0x6c/0xb0
dev_hard_start_xmit+0x219/0x3a0
__dev_queue_xmit+0x415/0x4f0
dev_queue_xmit_sk+0x13/0x20
ip_finish_output2+0x237/0x340
ip_finish_output+0x113/0x1d0
ip_output+0x66/0xc0
ip_local_out_sk+0x31/0x40
ip_send_skb+0x1a/0x50
udp_send_skb+0x16d/0x270
udp_sendmsg+0x2bf/0x980
inet_sendmsg+0x67/0xa0
sock_sendmsg+0x38/0x50
___sys_sendmsg+0x269/0x270
} hitcount: 77
{ stacktrace:
netif_rx_internal+0xb2/0xd0
netif_rx+0x1c/0x60
loopback_xmit+0x6c/0xb0
dev_hard_start_xmit+0x219/0x3a0
__dev_queue_xmit+0x415/0x4f0
dev_queue_xmit_sk+0x13/0x20
ip_finish_output2+0x237/0x340
ip_finish_output+0x113/0x1d0
ip_output+0x66/0xc0
ip_local_out_sk+0x31/0x40
ip_send_skb+0x1a/0x50
udp_send_skb+0x16d/0x270
udp_sendmsg+0x2bf/0x980
inet_sendmsg+0x67/0xa0
sock_sendmsg+0x38/0x50
SYSC_sendto+0xef/0x170
} hitcount: 88
{ stacktrace:
_do_fork+0x18e/0x330
SyS_clone+0x19/0x20
entry_SYSCALL_64_fastpath+0x12/0x6a
} hitcount: 244
Totals:
Hits: 489
Entries: 7
Dropped: 0
...@@ -543,6 +543,30 @@ of ftrace. Here is a list of some of the key files: ...@@ -543,6 +543,30 @@ of ftrace. Here is a list of some of the key files:
See events.txt for more information. See events.txt for more information.
timestamp_mode:
Certain tracers may change the timestamp mode used when
logging trace events into the event buffer. Events with
different modes can coexist within a buffer but the mode in
effect when an event is logged determines which timestamp mode
is used for that event. The default timestamp mode is
'delta'.
Usual timestamp modes for tracing:
# cat timestamp_mode
[delta] absolute
The timestamp mode with the square brackets around it is the
one in effect.
delta: Default timestamp mode - timestamp is a delta against
a per-buffer timestamp.
absolute: The timestamp is a full timestamp, not a delta
against some other value. As such it takes up more
space and is less efficient.
hwlat_detector: hwlat_detector:
Directory for the Hardware Latency Detector. Directory for the Hardware Latency Detector.
......
This source diff could not be displayed because it is too large. You can view the blob instead.
...@@ -34,10 +34,12 @@ struct ring_buffer_event { ...@@ -34,10 +34,12 @@ struct ring_buffer_event {
* array[0] = time delta (28 .. 59) * array[0] = time delta (28 .. 59)
* size = 8 bytes * size = 8 bytes
* *
* @RINGBUF_TYPE_TIME_STAMP: Sync time stamp with external clock * @RINGBUF_TYPE_TIME_STAMP: Absolute timestamp
* array[0] = tv_nsec * Same format as TIME_EXTEND except that the
* array[1..2] = tv_sec * value is an absolute timestamp, not a delta
* size = 16 bytes * event.time_delta contains bottom 27 bits
* array[0] = top (28 .. 59) bits
* size = 8 bytes
* *
* <= @RINGBUF_TYPE_DATA_TYPE_LEN_MAX: * <= @RINGBUF_TYPE_DATA_TYPE_LEN_MAX:
* Data record * Data record
...@@ -54,12 +56,12 @@ enum ring_buffer_type { ...@@ -54,12 +56,12 @@ enum ring_buffer_type {
RINGBUF_TYPE_DATA_TYPE_LEN_MAX = 28, RINGBUF_TYPE_DATA_TYPE_LEN_MAX = 28,
RINGBUF_TYPE_PADDING, RINGBUF_TYPE_PADDING,
RINGBUF_TYPE_TIME_EXTEND, RINGBUF_TYPE_TIME_EXTEND,
/* FIXME: RINGBUF_TYPE_TIME_STAMP not implemented */
RINGBUF_TYPE_TIME_STAMP, RINGBUF_TYPE_TIME_STAMP,
}; };
unsigned ring_buffer_event_length(struct ring_buffer_event *event); unsigned ring_buffer_event_length(struct ring_buffer_event *event);
void *ring_buffer_event_data(struct ring_buffer_event *event); void *ring_buffer_event_data(struct ring_buffer_event *event);
u64 ring_buffer_event_time_stamp(struct ring_buffer_event *event);
/* /*
* ring_buffer_discard_commit will remove an event that has not * ring_buffer_discard_commit will remove an event that has not
...@@ -115,6 +117,9 @@ int ring_buffer_unlock_commit(struct ring_buffer *buffer, ...@@ -115,6 +117,9 @@ int ring_buffer_unlock_commit(struct ring_buffer *buffer,
int ring_buffer_write(struct ring_buffer *buffer, int ring_buffer_write(struct ring_buffer *buffer,
unsigned long length, void *data); unsigned long length, void *data);
void ring_buffer_nest_start(struct ring_buffer *buffer);
void ring_buffer_nest_end(struct ring_buffer *buffer);
struct ring_buffer_event * struct ring_buffer_event *
ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts, ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts,
unsigned long *lost_events); unsigned long *lost_events);
...@@ -178,6 +183,8 @@ void ring_buffer_normalize_time_stamp(struct ring_buffer *buffer, ...@@ -178,6 +183,8 @@ void ring_buffer_normalize_time_stamp(struct ring_buffer *buffer,
int cpu, u64 *ts); int cpu, u64 *ts);
void ring_buffer_set_clock(struct ring_buffer *buffer, void ring_buffer_set_clock(struct ring_buffer *buffer,
u64 (*clock)(void)); u64 (*clock)(void));
void ring_buffer_set_time_stamp_abs(struct ring_buffer *buffer, bool abs);
bool ring_buffer_time_stamp_abs(struct ring_buffer *buffer);
size_t ring_buffer_page_len(void *page); size_t ring_buffer_page_len(void *page);
......
...@@ -430,11 +430,13 @@ enum event_trigger_type { ...@@ -430,11 +430,13 @@ enum event_trigger_type {
extern int filter_match_preds(struct event_filter *filter, void *rec); extern int filter_match_preds(struct event_filter *filter, void *rec);
extern enum event_trigger_type event_triggers_call(struct trace_event_file *file, extern enum event_trigger_type
void *rec); event_triggers_call(struct trace_event_file *file, void *rec,
extern void event_triggers_post_call(struct trace_event_file *file, struct ring_buffer_event *event);
enum event_trigger_type tt, extern void
void *rec); event_triggers_post_call(struct trace_event_file *file,
enum event_trigger_type tt,
void *rec, struct ring_buffer_event *event);
bool trace_event_ignore_this_pid(struct trace_event_file *trace_file); bool trace_event_ignore_this_pid(struct trace_event_file *trace_file);
...@@ -454,7 +456,7 @@ trace_trigger_soft_disabled(struct trace_event_file *file) ...@@ -454,7 +456,7 @@ trace_trigger_soft_disabled(struct trace_event_file *file)
if (!(eflags & EVENT_FILE_FL_TRIGGER_COND)) { if (!(eflags & EVENT_FILE_FL_TRIGGER_COND)) {
if (eflags & EVENT_FILE_FL_TRIGGER_MODE) if (eflags & EVENT_FILE_FL_TRIGGER_MODE)
event_triggers_call(file, NULL); event_triggers_call(file, NULL, NULL);
if (eflags & EVENT_FILE_FL_SOFT_DISABLED) if (eflags & EVENT_FILE_FL_SOFT_DISABLED)
return true; return true;
if (eflags & EVENT_FILE_FL_PID_FILTER) if (eflags & EVENT_FILE_FL_PID_FILTER)
......
/* SPDX-License-Identifier: GPL-2.0 */
#undef TRACE_SYSTEM
#define TRACE_SYSTEM initcall
#if !defined(_TRACE_INITCALL_H) || defined(TRACE_HEADER_MULTI_READ)
#define _TRACE_INITCALL_H
#include <linux/tracepoint.h>
TRACE_EVENT(initcall_level,
TP_PROTO(const char *level),
TP_ARGS(level),
TP_STRUCT__entry(
__string(level, level)
),
TP_fast_assign(
__assign_str(level, level);
),
TP_printk("level=%s", __get_str(level))
);
TRACE_EVENT(initcall_start,
TP_PROTO(initcall_t func),
TP_ARGS(func),
TP_STRUCT__entry(
__field(initcall_t, func)
),
TP_fast_assign(
__entry->func = func;
),
TP_printk("func=%pS", __entry->func)
);
TRACE_EVENT(initcall_finish,
TP_PROTO(initcall_t func, int ret),
TP_ARGS(func, ret),
TP_STRUCT__entry(
__field(initcall_t, func)
__field(int, ret)
),
TP_fast_assign(
__entry->func = func;
__entry->ret = ret;
),
TP_printk("func=%pS ret=%d", __entry->func, __entry->ret)
);
#endif /* if !defined(_TRACE_GPIO_H) || defined(TRACE_HEADER_MULTI_READ) */
/* This part must be outside protection */
#include <trace/define_trace.h>
...@@ -97,6 +97,9 @@ ...@@ -97,6 +97,9 @@
#include <asm/sections.h> #include <asm/sections.h>
#include <asm/cacheflush.h> #include <asm/cacheflush.h>
#define CREATE_TRACE_POINTS
#include <trace/events/initcall.h>
static int kernel_init(void *); static int kernel_init(void *);
extern void init_IRQ(void); extern void init_IRQ(void);
...@@ -491,6 +494,17 @@ void __init __weak thread_stack_cache_init(void) ...@@ -491,6 +494,17 @@ void __init __weak thread_stack_cache_init(void)
void __init __weak mem_encrypt_init(void) { } void __init __weak mem_encrypt_init(void) { }
bool initcall_debug;
core_param(initcall_debug, initcall_debug, bool, 0644);
#ifdef TRACEPOINTS_ENABLED
static void __init initcall_debug_enable(void);
#else
static inline void initcall_debug_enable(void)
{
}
#endif
/* /*
* Set up kernel memory allocators * Set up kernel memory allocators
*/ */
...@@ -612,6 +626,9 @@ asmlinkage __visible void __init start_kernel(void) ...@@ -612,6 +626,9 @@ asmlinkage __visible void __init start_kernel(void)
/* Trace events are available after this */ /* Trace events are available after this */
trace_init(); trace_init();
if (initcall_debug)
initcall_debug_enable();
context_tracking_init(); context_tracking_init();
/* init some links before init_ISA_irqs() */ /* init some links before init_ISA_irqs() */
early_irq_init(); early_irq_init();
...@@ -728,9 +745,6 @@ static void __init do_ctors(void) ...@@ -728,9 +745,6 @@ static void __init do_ctors(void)
#endif #endif
} }
bool initcall_debug;
core_param(initcall_debug, initcall_debug, bool, 0644);
#ifdef CONFIG_KALLSYMS #ifdef CONFIG_KALLSYMS
struct blacklist_entry { struct blacklist_entry {
struct list_head next; struct list_head next;
...@@ -800,37 +814,71 @@ static bool __init_or_module initcall_blacklisted(initcall_t fn) ...@@ -800,37 +814,71 @@ static bool __init_or_module initcall_blacklisted(initcall_t fn)
#endif #endif
__setup("initcall_blacklist=", initcall_blacklist); __setup("initcall_blacklist=", initcall_blacklist);
static int __init_or_module do_one_initcall_debug(initcall_t fn) static __init_or_module void
trace_initcall_start_cb(void *data, initcall_t fn)
{ {
ktime_t calltime, delta, rettime; ktime_t *calltime = (ktime_t *)data;
unsigned long long duration;
int ret;
printk(KERN_DEBUG "calling %pF @ %i\n", fn, task_pid_nr(current)); printk(KERN_DEBUG "calling %pF @ %i\n", fn, task_pid_nr(current));
calltime = ktime_get(); *calltime = ktime_get();
ret = fn(); }
static __init_or_module void
trace_initcall_finish_cb(void *data, initcall_t fn, int ret)
{
ktime_t *calltime = (ktime_t *)data;
ktime_t delta, rettime;
unsigned long long duration;
rettime = ktime_get(); rettime = ktime_get();
delta = ktime_sub(rettime, calltime); delta = ktime_sub(rettime, *calltime);
duration = (unsigned long long) ktime_to_ns(delta) >> 10; duration = (unsigned long long) ktime_to_ns(delta) >> 10;
printk(KERN_DEBUG "initcall %pF returned %d after %lld usecs\n", printk(KERN_DEBUG "initcall %pF returned %d after %lld usecs\n",
fn, ret, duration); fn, ret, duration);
}
return ret; static ktime_t initcall_calltime;
#ifdef TRACEPOINTS_ENABLED
static void __init initcall_debug_enable(void)
{
int ret;
ret = register_trace_initcall_start(trace_initcall_start_cb,
&initcall_calltime);
ret |= register_trace_initcall_finish(trace_initcall_finish_cb,
&initcall_calltime);
WARN(ret, "Failed to register initcall tracepoints\n");
} }
# define do_trace_initcall_start trace_initcall_start
# define do_trace_initcall_finish trace_initcall_finish
#else
static inline void do_trace_initcall_start(initcall_t fn)
{
if (!initcall_debug)
return;
trace_initcall_start_cb(&initcall_calltime, fn);
}
static inline void do_trace_initcall_finish(initcall_t fn, int ret)
{
if (!initcall_debug)
return;
trace_initcall_finish_cb(&initcall_calltime, fn, ret);
}
#endif /* !TRACEPOINTS_ENABLED */
int __init_or_module do_one_initcall(initcall_t fn) int __init_or_module do_one_initcall(initcall_t fn)
{ {
int count = preempt_count(); int count = preempt_count();
int ret;
char msgbuf[64]; char msgbuf[64];
int ret;
if (initcall_blacklisted(fn)) if (initcall_blacklisted(fn))
return -EPERM; return -EPERM;
if (initcall_debug) do_trace_initcall_start(fn);
ret = do_one_initcall_debug(fn); ret = fn();
else do_trace_initcall_finish(fn, ret);
ret = fn();
msgbuf[0] = 0; msgbuf[0] = 0;
...@@ -874,7 +922,7 @@ static initcall_t *initcall_levels[] __initdata = { ...@@ -874,7 +922,7 @@ static initcall_t *initcall_levels[] __initdata = {
/* Keep these in sync with initcalls in include/linux/init.h */ /* Keep these in sync with initcalls in include/linux/init.h */
static char *initcall_level_names[] __initdata = { static char *initcall_level_names[] __initdata = {
"early", "pure",
"core", "core",
"postcore", "postcore",
"arch", "arch",
...@@ -895,6 +943,7 @@ static void __init do_initcall_level(int level) ...@@ -895,6 +943,7 @@ static void __init do_initcall_level(int level)
level, level, level, level,
NULL, &repair_env_string); NULL, &repair_env_string);
trace_initcall_level(initcall_level_names[level]);
for (fn = initcall_levels[level]; fn < initcall_levels[level+1]; fn++) for (fn = initcall_levels[level]; fn < initcall_levels[level+1]; fn++)
do_one_initcall(*fn); do_one_initcall(*fn);
} }
...@@ -929,6 +978,7 @@ static void __init do_pre_smp_initcalls(void) ...@@ -929,6 +978,7 @@ static void __init do_pre_smp_initcalls(void)
{ {
initcall_t *fn; initcall_t *fn;
trace_initcall_level("early");
for (fn = __initcall_start; fn < __initcall0_start; fn++) for (fn = __initcall_start; fn < __initcall0_start; fn++)
do_one_initcall(*fn); do_one_initcall(*fn);
} }
......
...@@ -554,6 +554,8 @@ void __warn(const char *file, int line, void *caller, unsigned taint, ...@@ -554,6 +554,8 @@ void __warn(const char *file, int line, void *caller, unsigned taint,
else else
dump_stack(); dump_stack();
print_irqtrace_events(current);
print_oops_end_marker(); print_oops_end_marker();
/* Just a warning, don't kill lockdep. */ /* Just a warning, don't kill lockdep. */
......
...@@ -51,6 +51,7 @@ ...@@ -51,6 +51,7 @@
#include <linux/uaccess.h> #include <linux/uaccess.h>
#include <asm/sections.h> #include <asm/sections.h>
#include <trace/events/initcall.h>
#define CREATE_TRACE_POINTS #define CREATE_TRACE_POINTS
#include <trace/events/printk.h> #include <trace/events/printk.h>
...@@ -2780,6 +2781,7 @@ EXPORT_SYMBOL(unregister_console); ...@@ -2780,6 +2781,7 @@ EXPORT_SYMBOL(unregister_console);
*/ */
void __init console_init(void) void __init console_init(void)
{ {
int ret;
initcall_t *call; initcall_t *call;
/* Setup the default TTY line discipline. */ /* Setup the default TTY line discipline. */
...@@ -2790,8 +2792,11 @@ void __init console_init(void) ...@@ -2790,8 +2792,11 @@ void __init console_init(void)
* inform about problems etc.. * inform about problems etc..
*/ */
call = __con_initcall_start; call = __con_initcall_start;
trace_initcall_level("console");
while (call < __con_initcall_end) { while (call < __con_initcall_end) {
(*call)(); trace_initcall_start((*call));
ret = (*call)();
trace_initcall_finish((*call), ret);
call++; call++;
} }
} }
......
...@@ -606,7 +606,10 @@ config HIST_TRIGGERS ...@@ -606,7 +606,10 @@ config HIST_TRIGGERS
event activity as an initial guide for further investigation event activity as an initial guide for further investigation
using more advanced tools. using more advanced tools.
See Documentation/trace/events.txt. Inter-event tracing of quantities such as latencies is also
supported using hist triggers under this option.
See Documentation/trace/histogram.txt.
If in doubt, say N. If in doubt, say N.
config MMIOTRACE_TEST config MMIOTRACE_TEST
......
...@@ -3902,14 +3902,13 @@ static bool module_exists(const char *module) ...@@ -3902,14 +3902,13 @@ static bool module_exists(const char *module)
{ {
/* All modules have the symbol __this_module */ /* All modules have the symbol __this_module */
const char this_mod[] = "__this_module"; const char this_mod[] = "__this_module";
const int modname_size = MAX_PARAM_PREFIX_LEN + sizeof(this_mod) + 1; char modname[MAX_PARAM_PREFIX_LEN + sizeof(this_mod) + 2];
char modname[modname_size + 1];
unsigned long val; unsigned long val;
int n; int n;
n = snprintf(modname, modname_size + 1, "%s:%s", module, this_mod); n = snprintf(modname, sizeof(modname), "%s:%s", module, this_mod);
if (n > modname_size) if (n > sizeof(modname) - 1)
return false; return false;
val = module_kallsyms_lookup_name(modname); val = module_kallsyms_lookup_name(modname);
......
...@@ -22,6 +22,7 @@ ...@@ -22,6 +22,7 @@
#include <linux/hash.h> #include <linux/hash.h>
#include <linux/list.h> #include <linux/list.h>
#include <linux/cpu.h> #include <linux/cpu.h>
#include <linux/oom.h>
#include <asm/local.h> #include <asm/local.h>
...@@ -41,6 +42,8 @@ int ring_buffer_print_entry_header(struct trace_seq *s) ...@@ -41,6 +42,8 @@ int ring_buffer_print_entry_header(struct trace_seq *s)
RINGBUF_TYPE_PADDING); RINGBUF_TYPE_PADDING);
trace_seq_printf(s, "\ttime_extend : type == %d\n", trace_seq_printf(s, "\ttime_extend : type == %d\n",
RINGBUF_TYPE_TIME_EXTEND); RINGBUF_TYPE_TIME_EXTEND);
trace_seq_printf(s, "\ttime_stamp : type == %d\n",
RINGBUF_TYPE_TIME_STAMP);
trace_seq_printf(s, "\tdata max type_len == %d\n", trace_seq_printf(s, "\tdata max type_len == %d\n",
RINGBUF_TYPE_DATA_TYPE_LEN_MAX); RINGBUF_TYPE_DATA_TYPE_LEN_MAX);
...@@ -140,12 +143,15 @@ int ring_buffer_print_entry_header(struct trace_seq *s) ...@@ -140,12 +143,15 @@ int ring_buffer_print_entry_header(struct trace_seq *s)
enum { enum {
RB_LEN_TIME_EXTEND = 8, RB_LEN_TIME_EXTEND = 8,
RB_LEN_TIME_STAMP = 16, RB_LEN_TIME_STAMP = 8,
}; };
#define skip_time_extend(event) \ #define skip_time_extend(event) \
((struct ring_buffer_event *)((char *)event + RB_LEN_TIME_EXTEND)) ((struct ring_buffer_event *)((char *)event + RB_LEN_TIME_EXTEND))
#define extended_time(event) \
(event->type_len >= RINGBUF_TYPE_TIME_EXTEND)
static inline int rb_null_event(struct ring_buffer_event *event) static inline int rb_null_event(struct ring_buffer_event *event)
{ {
return event->type_len == RINGBUF_TYPE_PADDING && !event->time_delta; return event->type_len == RINGBUF_TYPE_PADDING && !event->time_delta;
...@@ -209,7 +215,7 @@ rb_event_ts_length(struct ring_buffer_event *event) ...@@ -209,7 +215,7 @@ rb_event_ts_length(struct ring_buffer_event *event)
{ {
unsigned len = 0; unsigned len = 0;
if (event->type_len == RINGBUF_TYPE_TIME_EXTEND) { if (extended_time(event)) {
/* time extends include the data event after it */ /* time extends include the data event after it */
len = RB_LEN_TIME_EXTEND; len = RB_LEN_TIME_EXTEND;
event = skip_time_extend(event); event = skip_time_extend(event);
...@@ -231,7 +237,7 @@ unsigned ring_buffer_event_length(struct ring_buffer_event *event) ...@@ -231,7 +237,7 @@ unsigned ring_buffer_event_length(struct ring_buffer_event *event)
{ {
unsigned length; unsigned length;
if (event->type_len == RINGBUF_TYPE_TIME_EXTEND) if (extended_time(event))
event = skip_time_extend(event); event = skip_time_extend(event);
length = rb_event_length(event); length = rb_event_length(event);
...@@ -248,7 +254,7 @@ EXPORT_SYMBOL_GPL(ring_buffer_event_length); ...@@ -248,7 +254,7 @@ EXPORT_SYMBOL_GPL(ring_buffer_event_length);
static __always_inline void * static __always_inline void *
rb_event_data(struct ring_buffer_event *event) rb_event_data(struct ring_buffer_event *event)
{ {
if (event->type_len == RINGBUF_TYPE_TIME_EXTEND) if (extended_time(event))
event = skip_time_extend(event); event = skip_time_extend(event);
BUG_ON(event->type_len > RINGBUF_TYPE_DATA_TYPE_LEN_MAX); BUG_ON(event->type_len > RINGBUF_TYPE_DATA_TYPE_LEN_MAX);
/* If length is in len field, then array[0] has the data */ /* If length is in len field, then array[0] has the data */
...@@ -275,6 +281,27 @@ EXPORT_SYMBOL_GPL(ring_buffer_event_data); ...@@ -275,6 +281,27 @@ EXPORT_SYMBOL_GPL(ring_buffer_event_data);
#define TS_MASK ((1ULL << TS_SHIFT) - 1) #define TS_MASK ((1ULL << TS_SHIFT) - 1)
#define TS_DELTA_TEST (~TS_MASK) #define TS_DELTA_TEST (~TS_MASK)
/**
* ring_buffer_event_time_stamp - return the event's extended timestamp
* @event: the event to get the timestamp of
*
* Returns the extended timestamp associated with a data event.
* An extended time_stamp is a 64-bit timestamp represented
* internally in a special way that makes the best use of space
* contained within a ring buffer event. This function decodes
* it and maps it to a straight u64 value.
*/
u64 ring_buffer_event_time_stamp(struct ring_buffer_event *event)
{
u64 ts;
ts = event->array[0];
ts <<= TS_SHIFT;
ts += event->time_delta;
return ts;
}
/* Flag when events were overwritten */ /* Flag when events were overwritten */
#define RB_MISSED_EVENTS (1 << 31) #define RB_MISSED_EVENTS (1 << 31)
/* Missed count stored at end */ /* Missed count stored at end */
...@@ -451,6 +478,7 @@ struct ring_buffer_per_cpu { ...@@ -451,6 +478,7 @@ struct ring_buffer_per_cpu {
struct buffer_page *reader_page; struct buffer_page *reader_page;
unsigned long lost_events; unsigned long lost_events;
unsigned long last_overrun; unsigned long last_overrun;
unsigned long nest;
local_t entries_bytes; local_t entries_bytes;
local_t entries; local_t entries;
local_t overrun; local_t overrun;
...@@ -488,6 +516,7 @@ struct ring_buffer { ...@@ -488,6 +516,7 @@ struct ring_buffer {
u64 (*clock)(void); u64 (*clock)(void);
struct rb_irq_work irq_work; struct rb_irq_work irq_work;
bool time_stamp_abs;
}; };
struct ring_buffer_iter { struct ring_buffer_iter {
...@@ -1134,30 +1163,60 @@ static int rb_check_pages(struct ring_buffer_per_cpu *cpu_buffer) ...@@ -1134,30 +1163,60 @@ static int rb_check_pages(struct ring_buffer_per_cpu *cpu_buffer)
static int __rb_allocate_pages(long nr_pages, struct list_head *pages, int cpu) static int __rb_allocate_pages(long nr_pages, struct list_head *pages, int cpu)
{ {
struct buffer_page *bpage, *tmp; struct buffer_page *bpage, *tmp;
bool user_thread = current->mm != NULL;
gfp_t mflags;
long i; long i;
/*
* Check if the available memory is there first.
* Note, si_mem_available() only gives us a rough estimate of available
* memory. It may not be accurate. But we don't care, we just want
* to prevent doing any allocation when it is obvious that it is
* not going to succeed.
*/
i = si_mem_available();
if (i < nr_pages)
return -ENOMEM;
/*
* __GFP_RETRY_MAYFAIL flag makes sure that the allocation fails
* gracefully without invoking oom-killer and the system is not
* destabilized.
*/
mflags = GFP_KERNEL | __GFP_RETRY_MAYFAIL;
/*
* If a user thread allocates too much, and si_mem_available()
* reports there's enough memory, even though there is not.
* Make sure the OOM killer kills this thread. This can happen
* even with RETRY_MAYFAIL because another task may be doing
* an allocation after this task has taken all memory.
* This is the task the OOM killer needs to take out during this
* loop, even if it was triggered by an allocation somewhere else.
*/
if (user_thread)
set_current_oom_origin();
for (i = 0; i < nr_pages; i++) { for (i = 0; i < nr_pages; i++) {
struct page *page; struct page *page;
/*
* __GFP_RETRY_MAYFAIL flag makes sure that the allocation fails
* gracefully without invoking oom-killer and the system is not
* destabilized.
*/
bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()), bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
GFP_KERNEL | __GFP_RETRY_MAYFAIL, mflags, cpu_to_node(cpu));
cpu_to_node(cpu));
if (!bpage) if (!bpage)
goto free_pages; goto free_pages;
list_add(&bpage->list, pages); list_add(&bpage->list, pages);
page = alloc_pages_node(cpu_to_node(cpu), page = alloc_pages_node(cpu_to_node(cpu), mflags, 0);
GFP_KERNEL | __GFP_RETRY_MAYFAIL, 0);
if (!page) if (!page)
goto free_pages; goto free_pages;
bpage->page = page_address(page); bpage->page = page_address(page);
rb_init_page(bpage->page); rb_init_page(bpage->page);
if (user_thread && fatal_signal_pending(current))
goto free_pages;
} }
if (user_thread)
clear_current_oom_origin();
return 0; return 0;
...@@ -1166,6 +1225,8 @@ static int __rb_allocate_pages(long nr_pages, struct list_head *pages, int cpu) ...@@ -1166,6 +1225,8 @@ static int __rb_allocate_pages(long nr_pages, struct list_head *pages, int cpu)
list_del_init(&bpage->list); list_del_init(&bpage->list);
free_buffer_page(bpage); free_buffer_page(bpage);
} }
if (user_thread)
clear_current_oom_origin();
return -ENOMEM; return -ENOMEM;
} }
...@@ -1382,6 +1443,16 @@ void ring_buffer_set_clock(struct ring_buffer *buffer, ...@@ -1382,6 +1443,16 @@ void ring_buffer_set_clock(struct ring_buffer *buffer,
buffer->clock = clock; buffer->clock = clock;
} }
void ring_buffer_set_time_stamp_abs(struct ring_buffer *buffer, bool abs)
{
buffer->time_stamp_abs = abs;
}
bool ring_buffer_time_stamp_abs(struct ring_buffer *buffer)
{
return buffer->time_stamp_abs;
}
static void rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer); static void rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer);
static inline unsigned long rb_page_entries(struct buffer_page *bpage) static inline unsigned long rb_page_entries(struct buffer_page *bpage)
...@@ -2206,12 +2277,15 @@ rb_move_tail(struct ring_buffer_per_cpu *cpu_buffer, ...@@ -2206,12 +2277,15 @@ rb_move_tail(struct ring_buffer_per_cpu *cpu_buffer,
/* Slow path, do not inline */ /* Slow path, do not inline */
static noinline struct ring_buffer_event * static noinline struct ring_buffer_event *
rb_add_time_stamp(struct ring_buffer_event *event, u64 delta) rb_add_time_stamp(struct ring_buffer_event *event, u64 delta, bool abs)
{ {
event->type_len = RINGBUF_TYPE_TIME_EXTEND; if (abs)
event->type_len = RINGBUF_TYPE_TIME_STAMP;
else
event->type_len = RINGBUF_TYPE_TIME_EXTEND;
/* Not the first event on the page? */ /* Not the first event on the page, or not delta? */
if (rb_event_index(event)) { if (abs || rb_event_index(event)) {
event->time_delta = delta & TS_MASK; event->time_delta = delta & TS_MASK;
event->array[0] = delta >> TS_SHIFT; event->array[0] = delta >> TS_SHIFT;
} else { } else {
...@@ -2254,7 +2328,9 @@ rb_update_event(struct ring_buffer_per_cpu *cpu_buffer, ...@@ -2254,7 +2328,9 @@ rb_update_event(struct ring_buffer_per_cpu *cpu_buffer,
* add it to the start of the resevered space. * add it to the start of the resevered space.
*/ */
if (unlikely(info->add_timestamp)) { if (unlikely(info->add_timestamp)) {
event = rb_add_time_stamp(event, delta); bool abs = ring_buffer_time_stamp_abs(cpu_buffer->buffer);
event = rb_add_time_stamp(event, info->delta, abs);
length -= RB_LEN_TIME_EXTEND; length -= RB_LEN_TIME_EXTEND;
delta = 0; delta = 0;
} }
...@@ -2442,7 +2518,7 @@ static __always_inline void rb_end_commit(struct ring_buffer_per_cpu *cpu_buffer ...@@ -2442,7 +2518,7 @@ static __always_inline void rb_end_commit(struct ring_buffer_per_cpu *cpu_buffer
static inline void rb_event_discard(struct ring_buffer_event *event) static inline void rb_event_discard(struct ring_buffer_event *event)
{ {
if (event->type_len == RINGBUF_TYPE_TIME_EXTEND) if (extended_time(event))
event = skip_time_extend(event); event = skip_time_extend(event);
/* array[0] holds the actual length for the discarded event */ /* array[0] holds the actual length for the discarded event */
...@@ -2486,10 +2562,11 @@ rb_update_write_stamp(struct ring_buffer_per_cpu *cpu_buffer, ...@@ -2486,10 +2562,11 @@ rb_update_write_stamp(struct ring_buffer_per_cpu *cpu_buffer,
cpu_buffer->write_stamp = cpu_buffer->write_stamp =
cpu_buffer->commit_page->page->time_stamp; cpu_buffer->commit_page->page->time_stamp;
else if (event->type_len == RINGBUF_TYPE_TIME_EXTEND) { else if (event->type_len == RINGBUF_TYPE_TIME_EXTEND) {
delta = event->array[0]; delta = ring_buffer_event_time_stamp(event);
delta <<= TS_SHIFT;
delta += event->time_delta;
cpu_buffer->write_stamp += delta; cpu_buffer->write_stamp += delta;
} else if (event->type_len == RINGBUF_TYPE_TIME_STAMP) {
delta = ring_buffer_event_time_stamp(event);
cpu_buffer->write_stamp = delta;
} else } else
cpu_buffer->write_stamp += event->time_delta; cpu_buffer->write_stamp += event->time_delta;
} }
...@@ -2581,10 +2658,10 @@ trace_recursive_lock(struct ring_buffer_per_cpu *cpu_buffer) ...@@ -2581,10 +2658,10 @@ trace_recursive_lock(struct ring_buffer_per_cpu *cpu_buffer)
bit = pc & NMI_MASK ? RB_CTX_NMI : bit = pc & NMI_MASK ? RB_CTX_NMI :
pc & HARDIRQ_MASK ? RB_CTX_IRQ : RB_CTX_SOFTIRQ; pc & HARDIRQ_MASK ? RB_CTX_IRQ : RB_CTX_SOFTIRQ;
if (unlikely(val & (1 << bit))) if (unlikely(val & (1 << (bit + cpu_buffer->nest))))
return 1; return 1;
val |= (1 << bit); val |= (1 << (bit + cpu_buffer->nest));
cpu_buffer->current_context = val; cpu_buffer->current_context = val;
return 0; return 0;
...@@ -2593,7 +2670,57 @@ trace_recursive_lock(struct ring_buffer_per_cpu *cpu_buffer) ...@@ -2593,7 +2670,57 @@ trace_recursive_lock(struct ring_buffer_per_cpu *cpu_buffer)
static __always_inline void static __always_inline void
trace_recursive_unlock(struct ring_buffer_per_cpu *cpu_buffer) trace_recursive_unlock(struct ring_buffer_per_cpu *cpu_buffer)
{ {
cpu_buffer->current_context &= cpu_buffer->current_context - 1; cpu_buffer->current_context &=
cpu_buffer->current_context - (1 << cpu_buffer->nest);
}
/* The recursive locking above uses 4 bits */
#define NESTED_BITS 4
/**
* ring_buffer_nest_start - Allow to trace while nested
* @buffer: The ring buffer to modify
*
* The ring buffer has a safty mechanism to prevent recursion.
* But there may be a case where a trace needs to be done while
* tracing something else. In this case, calling this function
* will allow this function to nest within a currently active
* ring_buffer_lock_reserve().
*
* Call this function before calling another ring_buffer_lock_reserve() and
* call ring_buffer_nest_end() after the nested ring_buffer_unlock_commit().
*/
void ring_buffer_nest_start(struct ring_buffer *buffer)
{
struct ring_buffer_per_cpu *cpu_buffer;
int cpu;
/* Enabled by ring_buffer_nest_end() */
preempt_disable_notrace();
cpu = raw_smp_processor_id();
cpu_buffer = buffer->buffers[cpu];
/* This is the shift value for the above recusive locking */
cpu_buffer->nest += NESTED_BITS;
}
/**
* ring_buffer_nest_end - Allow to trace while nested
* @buffer: The ring buffer to modify
*
* Must be called after ring_buffer_nest_start() and after the
* ring_buffer_unlock_commit().
*/
void ring_buffer_nest_end(struct ring_buffer *buffer)
{
struct ring_buffer_per_cpu *cpu_buffer;
int cpu;
/* disabled by ring_buffer_nest_start() */
cpu = raw_smp_processor_id();
cpu_buffer = buffer->buffers[cpu];
/* This is the shift value for the above recusive locking */
cpu_buffer->nest -= NESTED_BITS;
preempt_enable_notrace();
} }
/** /**
...@@ -2637,7 +2764,8 @@ rb_handle_timestamp(struct ring_buffer_per_cpu *cpu_buffer, ...@@ -2637,7 +2764,8 @@ rb_handle_timestamp(struct ring_buffer_per_cpu *cpu_buffer,
sched_clock_stable() ? "" : sched_clock_stable() ? "" :
"If you just came from a suspend/resume,\n" "If you just came from a suspend/resume,\n"
"please switch to the trace global clock:\n" "please switch to the trace global clock:\n"
" echo global > /sys/kernel/debug/tracing/trace_clock\n"); " echo global > /sys/kernel/debug/tracing/trace_clock\n"
"or add trace_clock=global to the kernel command line\n");
info->add_timestamp = 1; info->add_timestamp = 1;
} }
...@@ -2669,7 +2797,7 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer, ...@@ -2669,7 +2797,7 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
* If this is the first commit on the page, then it has the same * If this is the first commit on the page, then it has the same
* timestamp as the page itself. * timestamp as the page itself.
*/ */
if (!tail) if (!tail && !ring_buffer_time_stamp_abs(cpu_buffer->buffer))
info->delta = 0; info->delta = 0;
/* See if we shot pass the end of this buffer page */ /* See if we shot pass the end of this buffer page */
...@@ -2746,8 +2874,11 @@ rb_reserve_next_event(struct ring_buffer *buffer, ...@@ -2746,8 +2874,11 @@ rb_reserve_next_event(struct ring_buffer *buffer,
/* make sure this diff is calculated here */ /* make sure this diff is calculated here */
barrier(); barrier();
/* Did the write stamp get updated already? */ if (ring_buffer_time_stamp_abs(buffer)) {
if (likely(info.ts >= cpu_buffer->write_stamp)) { info.delta = info.ts;
rb_handle_timestamp(cpu_buffer, &info);
} else /* Did the write stamp get updated already? */
if (likely(info.ts >= cpu_buffer->write_stamp)) {
info.delta = diff; info.delta = diff;
if (unlikely(test_time_stamp(info.delta))) if (unlikely(test_time_stamp(info.delta)))
rb_handle_timestamp(cpu_buffer, &info); rb_handle_timestamp(cpu_buffer, &info);
...@@ -3429,14 +3560,13 @@ rb_update_read_stamp(struct ring_buffer_per_cpu *cpu_buffer, ...@@ -3429,14 +3560,13 @@ rb_update_read_stamp(struct ring_buffer_per_cpu *cpu_buffer,
return; return;
case RINGBUF_TYPE_TIME_EXTEND: case RINGBUF_TYPE_TIME_EXTEND:
delta = event->array[0]; delta = ring_buffer_event_time_stamp(event);
delta <<= TS_SHIFT;
delta += event->time_delta;
cpu_buffer->read_stamp += delta; cpu_buffer->read_stamp += delta;
return; return;
case RINGBUF_TYPE_TIME_STAMP: case RINGBUF_TYPE_TIME_STAMP:
/* FIXME: not implemented */ delta = ring_buffer_event_time_stamp(event);
cpu_buffer->read_stamp = delta;
return; return;
case RINGBUF_TYPE_DATA: case RINGBUF_TYPE_DATA:
...@@ -3460,14 +3590,13 @@ rb_update_iter_read_stamp(struct ring_buffer_iter *iter, ...@@ -3460,14 +3590,13 @@ rb_update_iter_read_stamp(struct ring_buffer_iter *iter,
return; return;
case RINGBUF_TYPE_TIME_EXTEND: case RINGBUF_TYPE_TIME_EXTEND:
delta = event->array[0]; delta = ring_buffer_event_time_stamp(event);
delta <<= TS_SHIFT;
delta += event->time_delta;
iter->read_stamp += delta; iter->read_stamp += delta;
return; return;
case RINGBUF_TYPE_TIME_STAMP: case RINGBUF_TYPE_TIME_STAMP:
/* FIXME: not implemented */ delta = ring_buffer_event_time_stamp(event);
iter->read_stamp = delta;
return; return;
case RINGBUF_TYPE_DATA: case RINGBUF_TYPE_DATA:
...@@ -3691,6 +3820,8 @@ rb_buffer_peek(struct ring_buffer_per_cpu *cpu_buffer, u64 *ts, ...@@ -3691,6 +3820,8 @@ rb_buffer_peek(struct ring_buffer_per_cpu *cpu_buffer, u64 *ts,
struct buffer_page *reader; struct buffer_page *reader;
int nr_loops = 0; int nr_loops = 0;
if (ts)
*ts = 0;
again: again:
/* /*
* We repeat when a time extend is encountered. * We repeat when a time extend is encountered.
...@@ -3727,12 +3858,17 @@ rb_buffer_peek(struct ring_buffer_per_cpu *cpu_buffer, u64 *ts, ...@@ -3727,12 +3858,17 @@ rb_buffer_peek(struct ring_buffer_per_cpu *cpu_buffer, u64 *ts,
goto again; goto again;
case RINGBUF_TYPE_TIME_STAMP: case RINGBUF_TYPE_TIME_STAMP:
/* FIXME: not implemented */ if (ts) {
*ts = ring_buffer_event_time_stamp(event);
ring_buffer_normalize_time_stamp(cpu_buffer->buffer,
cpu_buffer->cpu, ts);
}
/* Internal data, OK to advance */
rb_advance_reader(cpu_buffer); rb_advance_reader(cpu_buffer);
goto again; goto again;
case RINGBUF_TYPE_DATA: case RINGBUF_TYPE_DATA:
if (ts) { if (ts && !(*ts)) {
*ts = cpu_buffer->read_stamp + event->time_delta; *ts = cpu_buffer->read_stamp + event->time_delta;
ring_buffer_normalize_time_stamp(cpu_buffer->buffer, ring_buffer_normalize_time_stamp(cpu_buffer->buffer,
cpu_buffer->cpu, ts); cpu_buffer->cpu, ts);
...@@ -3757,6 +3893,9 @@ rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts) ...@@ -3757,6 +3893,9 @@ rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
struct ring_buffer_event *event; struct ring_buffer_event *event;
int nr_loops = 0; int nr_loops = 0;
if (ts)
*ts = 0;
cpu_buffer = iter->cpu_buffer; cpu_buffer = iter->cpu_buffer;
buffer = cpu_buffer->buffer; buffer = cpu_buffer->buffer;
...@@ -3809,12 +3948,17 @@ rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts) ...@@ -3809,12 +3948,17 @@ rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
goto again; goto again;
case RINGBUF_TYPE_TIME_STAMP: case RINGBUF_TYPE_TIME_STAMP:
/* FIXME: not implemented */ if (ts) {
*ts = ring_buffer_event_time_stamp(event);
ring_buffer_normalize_time_stamp(cpu_buffer->buffer,
cpu_buffer->cpu, ts);
}
/* Internal data, OK to advance */
rb_advance_iter(iter); rb_advance_iter(iter);
goto again; goto again;
case RINGBUF_TYPE_DATA: case RINGBUF_TYPE_DATA:
if (ts) { if (ts && !(*ts)) {
*ts = iter->read_stamp + event->time_delta; *ts = iter->read_stamp + event->time_delta;
ring_buffer_normalize_time_stamp(buffer, ring_buffer_normalize_time_stamp(buffer,
cpu_buffer->cpu, ts); cpu_buffer->cpu, ts);
......
...@@ -41,6 +41,7 @@ ...@@ -41,6 +41,7 @@
#include <linux/nmi.h> #include <linux/nmi.h>
#include <linux/fs.h> #include <linux/fs.h>
#include <linux/trace.h> #include <linux/trace.h>
#include <linux/sched/clock.h>
#include <linux/sched/rt.h> #include <linux/sched/rt.h>
#include "trace.h" #include "trace.h"
...@@ -1168,6 +1169,14 @@ static struct { ...@@ -1168,6 +1169,14 @@ static struct {
ARCH_TRACE_CLOCKS ARCH_TRACE_CLOCKS
}; };
bool trace_clock_in_ns(struct trace_array *tr)
{
if (trace_clocks[tr->clock_id].in_ns)
return true;
return false;
}
/* /*
* trace_parser_get_init - gets the buffer for trace parser * trace_parser_get_init - gets the buffer for trace parser
*/ */
...@@ -2269,7 +2278,7 @@ trace_event_buffer_lock_reserve(struct ring_buffer **current_rb, ...@@ -2269,7 +2278,7 @@ trace_event_buffer_lock_reserve(struct ring_buffer **current_rb,
*current_rb = trace_file->tr->trace_buffer.buffer; *current_rb = trace_file->tr->trace_buffer.buffer;
if ((trace_file->flags & if (!ring_buffer_time_stamp_abs(*current_rb) && (trace_file->flags &
(EVENT_FILE_FL_SOFT_DISABLED | EVENT_FILE_FL_FILTERED)) && (EVENT_FILE_FL_SOFT_DISABLED | EVENT_FILE_FL_FILTERED)) &&
(entry = this_cpu_read(trace_buffered_event))) { (entry = this_cpu_read(trace_buffered_event))) {
/* Try to use the per cpu buffer first */ /* Try to use the per cpu buffer first */
...@@ -4515,6 +4524,9 @@ static const char readme_msg[] = ...@@ -4515,6 +4524,9 @@ static const char readme_msg[] =
#ifdef CONFIG_X86_64 #ifdef CONFIG_X86_64
" x86-tsc: TSC cycle counter\n" " x86-tsc: TSC cycle counter\n"
#endif #endif
"\n timestamp_mode\t-view the mode used to timestamp events\n"
" delta: Delta difference against a buffer-wide timestamp\n"
" absolute: Absolute (standalone) timestamp\n"
"\n trace_marker\t\t- Writes into this file writes into the kernel buffer\n" "\n trace_marker\t\t- Writes into this file writes into the kernel buffer\n"
"\n trace_marker_raw\t\t- Writes into this file writes binary data into the kernel buffer\n" "\n trace_marker_raw\t\t- Writes into this file writes binary data into the kernel buffer\n"
" tracing_cpumask\t- Limit which CPUs to trace\n" " tracing_cpumask\t- Limit which CPUs to trace\n"
...@@ -4691,8 +4703,9 @@ static const char readme_msg[] = ...@@ -4691,8 +4703,9 @@ static const char readme_msg[] =
"\t .sym display an address as a symbol\n" "\t .sym display an address as a symbol\n"
"\t .sym-offset display an address as a symbol and offset\n" "\t .sym-offset display an address as a symbol and offset\n"
"\t .execname display a common_pid as a program name\n" "\t .execname display a common_pid as a program name\n"
"\t .syscall display a syscall id as a syscall name\n\n" "\t .syscall display a syscall id as a syscall name\n"
"\t .log2 display log2 value rather than raw number\n\n" "\t .log2 display log2 value rather than raw number\n"
"\t .usecs display a common_timestamp in microseconds\n\n"
"\t The 'pause' parameter can be used to pause an existing hist\n" "\t The 'pause' parameter can be used to pause an existing hist\n"
"\t trigger or to start a hist trigger but not log any events\n" "\t trigger or to start a hist trigger but not log any events\n"
"\t until told to do so. 'continue' can be used to start or\n" "\t until told to do so. 'continue' can be used to start or\n"
...@@ -6202,7 +6215,7 @@ static int tracing_clock_show(struct seq_file *m, void *v) ...@@ -6202,7 +6215,7 @@ static int tracing_clock_show(struct seq_file *m, void *v)
return 0; return 0;
} }
static int tracing_set_clock(struct trace_array *tr, const char *clockstr) int tracing_set_clock(struct trace_array *tr, const char *clockstr)
{ {
int i; int i;
...@@ -6282,6 +6295,71 @@ static int tracing_clock_open(struct inode *inode, struct file *file) ...@@ -6282,6 +6295,71 @@ static int tracing_clock_open(struct inode *inode, struct file *file)
return ret; return ret;
} }
static int tracing_time_stamp_mode_show(struct seq_file *m, void *v)
{
struct trace_array *tr = m->private;
mutex_lock(&trace_types_lock);
if (ring_buffer_time_stamp_abs(tr->trace_buffer.buffer))
seq_puts(m, "delta [absolute]\n");
else
seq_puts(m, "[delta] absolute\n");
mutex_unlock(&trace_types_lock);
return 0;
}
static int tracing_time_stamp_mode_open(struct inode *inode, struct file *file)
{
struct trace_array *tr = inode->i_private;
int ret;
if (tracing_disabled)
return -ENODEV;
if (trace_array_get(tr))
return -ENODEV;
ret = single_open(file, tracing_time_stamp_mode_show, inode->i_private);
if (ret < 0)
trace_array_put(tr);
return ret;
}
int tracing_set_time_stamp_abs(struct trace_array *tr, bool abs)
{
int ret = 0;
mutex_lock(&trace_types_lock);
if (abs && tr->time_stamp_abs_ref++)
goto out;
if (!abs) {
if (WARN_ON_ONCE(!tr->time_stamp_abs_ref)) {
ret = -EINVAL;
goto out;
}
if (--tr->time_stamp_abs_ref)
goto out;
}
ring_buffer_set_time_stamp_abs(tr->trace_buffer.buffer, abs);
#ifdef CONFIG_TRACER_MAX_TRACE
if (tr->max_buffer.buffer)
ring_buffer_set_time_stamp_abs(tr->max_buffer.buffer, abs);
#endif
out:
mutex_unlock(&trace_types_lock);
return ret;
}
struct ftrace_buffer_info { struct ftrace_buffer_info {
struct trace_iterator iter; struct trace_iterator iter;
void *spare; void *spare;
...@@ -6529,6 +6607,13 @@ static const struct file_operations trace_clock_fops = { ...@@ -6529,6 +6607,13 @@ static const struct file_operations trace_clock_fops = {
.write = tracing_clock_write, .write = tracing_clock_write,
}; };
static const struct file_operations trace_time_stamp_mode_fops = {
.open = tracing_time_stamp_mode_open,
.read = seq_read,
.llseek = seq_lseek,
.release = tracing_single_release_tr,
};
#ifdef CONFIG_TRACER_SNAPSHOT #ifdef CONFIG_TRACER_SNAPSHOT
static const struct file_operations snapshot_fops = { static const struct file_operations snapshot_fops = {
.open = tracing_snapshot_open, .open = tracing_snapshot_open,
...@@ -7699,6 +7784,7 @@ static int instance_mkdir(const char *name) ...@@ -7699,6 +7784,7 @@ static int instance_mkdir(const char *name)
INIT_LIST_HEAD(&tr->systems); INIT_LIST_HEAD(&tr->systems);
INIT_LIST_HEAD(&tr->events); INIT_LIST_HEAD(&tr->events);
INIT_LIST_HEAD(&tr->hist_vars);
if (allocate_trace_buffers(tr, trace_buf_size) < 0) if (allocate_trace_buffers(tr, trace_buf_size) < 0)
goto out_free_tr; goto out_free_tr;
...@@ -7851,6 +7937,9 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer) ...@@ -7851,6 +7937,9 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer)
trace_create_file("tracing_on", 0644, d_tracer, trace_create_file("tracing_on", 0644, d_tracer,
tr, &rb_simple_fops); tr, &rb_simple_fops);
trace_create_file("timestamp_mode", 0444, d_tracer, tr,
&trace_time_stamp_mode_fops);
create_trace_options_dir(tr); create_trace_options_dir(tr);
#if defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER) #if defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)
...@@ -8446,6 +8535,7 @@ __init static int tracer_alloc_buffers(void) ...@@ -8446,6 +8535,7 @@ __init static int tracer_alloc_buffers(void)
INIT_LIST_HEAD(&global_trace.systems); INIT_LIST_HEAD(&global_trace.systems);
INIT_LIST_HEAD(&global_trace.events); INIT_LIST_HEAD(&global_trace.events);
INIT_LIST_HEAD(&global_trace.hist_vars);
list_add(&global_trace.list, &ftrace_trace_arrays); list_add(&global_trace.list, &ftrace_trace_arrays);
apply_trace_boot_options(); apply_trace_boot_options();
...@@ -8507,3 +8597,21 @@ __init static int clear_boot_tracer(void) ...@@ -8507,3 +8597,21 @@ __init static int clear_boot_tracer(void)
fs_initcall(tracer_init_tracefs); fs_initcall(tracer_init_tracefs);
late_initcall_sync(clear_boot_tracer); late_initcall_sync(clear_boot_tracer);
#ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
__init static int tracing_set_default_clock(void)
{
/* sched_clock_stable() is determined in late_initcall */
if (!trace_boot_clock && !sched_clock_stable()) {
printk(KERN_WARNING
"Unstable clock detected, switching default tracing clock to \"global\"\n"
"If you want to keep using the local clock, then add:\n"
" \"trace_clock=local\"\n"
"on the kernel command line\n");
tracing_set_clock(&global_trace, "global");
}
return 0;
}
late_initcall_sync(tracing_set_default_clock);
#endif
...@@ -273,6 +273,8 @@ struct trace_array { ...@@ -273,6 +273,8 @@ struct trace_array {
/* function tracing enabled */ /* function tracing enabled */
int function_enabled; int function_enabled;
#endif #endif
int time_stamp_abs_ref;
struct list_head hist_vars;
}; };
enum { enum {
...@@ -286,6 +288,11 @@ extern struct mutex trace_types_lock; ...@@ -286,6 +288,11 @@ extern struct mutex trace_types_lock;
extern int trace_array_get(struct trace_array *tr); extern int trace_array_get(struct trace_array *tr);
extern void trace_array_put(struct trace_array *tr); extern void trace_array_put(struct trace_array *tr);
extern int tracing_set_time_stamp_abs(struct trace_array *tr, bool abs);
extern int tracing_set_clock(struct trace_array *tr, const char *clockstr);
extern bool trace_clock_in_ns(struct trace_array *tr);
/* /*
* The global tracer (top) should be the first trace array added, * The global tracer (top) should be the first trace array added,
* but we check the flag anyway. * but we check the flag anyway.
...@@ -1209,12 +1216,11 @@ struct ftrace_event_field { ...@@ -1209,12 +1216,11 @@ struct ftrace_event_field {
int is_signed; int is_signed;
}; };
struct prog_entry;
struct event_filter { struct event_filter {
int n_preds; /* Number assigned */ struct prog_entry __rcu *prog;
int a_preds; /* allocated */ char *filter_string;
struct filter_pred __rcu *preds;
struct filter_pred __rcu *root;
char *filter_string;
}; };
struct event_subsystem { struct event_subsystem {
...@@ -1291,7 +1297,7 @@ __event_trigger_test_discard(struct trace_event_file *file, ...@@ -1291,7 +1297,7 @@ __event_trigger_test_discard(struct trace_event_file *file,
unsigned long eflags = file->flags; unsigned long eflags = file->flags;
if (eflags & EVENT_FILE_FL_TRIGGER_COND) if (eflags & EVENT_FILE_FL_TRIGGER_COND)
*tt = event_triggers_call(file, entry); *tt = event_triggers_call(file, entry, event);
if (test_bit(EVENT_FILE_FL_SOFT_DISABLED_BIT, &file->flags) || if (test_bit(EVENT_FILE_FL_SOFT_DISABLED_BIT, &file->flags) ||
(unlikely(file->flags & EVENT_FILE_FL_FILTERED) && (unlikely(file->flags & EVENT_FILE_FL_FILTERED) &&
...@@ -1328,7 +1334,7 @@ event_trigger_unlock_commit(struct trace_event_file *file, ...@@ -1328,7 +1334,7 @@ event_trigger_unlock_commit(struct trace_event_file *file,
trace_buffer_unlock_commit(file->tr, buffer, event, irq_flags, pc); trace_buffer_unlock_commit(file->tr, buffer, event, irq_flags, pc);
if (tt) if (tt)
event_triggers_post_call(file, tt, entry); event_triggers_post_call(file, tt, entry, event);
} }
/** /**
...@@ -1361,7 +1367,7 @@ event_trigger_unlock_commit_regs(struct trace_event_file *file, ...@@ -1361,7 +1367,7 @@ event_trigger_unlock_commit_regs(struct trace_event_file *file,
irq_flags, pc, regs); irq_flags, pc, regs);
if (tt) if (tt)
event_triggers_post_call(file, tt, entry); event_triggers_post_call(file, tt, entry, event);
} }
#define FILTER_PRED_INVALID ((unsigned short)-1) #define FILTER_PRED_INVALID ((unsigned short)-1)
...@@ -1406,12 +1412,8 @@ struct filter_pred { ...@@ -1406,12 +1412,8 @@ struct filter_pred {
unsigned short *ops; unsigned short *ops;
struct ftrace_event_field *field; struct ftrace_event_field *field;
int offset; int offset;
int not; int not;
int op; int op;
unsigned short index;
unsigned short parent;
unsigned short left;
unsigned short right;
}; };
static inline bool is_string_field(struct ftrace_event_field *field) static inline bool is_string_field(struct ftrace_event_field *field)
...@@ -1543,6 +1545,8 @@ extern void pause_named_trigger(struct event_trigger_data *data); ...@@ -1543,6 +1545,8 @@ extern void pause_named_trigger(struct event_trigger_data *data);
extern void unpause_named_trigger(struct event_trigger_data *data); extern void unpause_named_trigger(struct event_trigger_data *data);
extern void set_named_trigger_data(struct event_trigger_data *data, extern void set_named_trigger_data(struct event_trigger_data *data,
struct event_trigger_data *named_data); struct event_trigger_data *named_data);
extern struct event_trigger_data *
get_named_trigger_data(struct event_trigger_data *data);
extern int register_event_command(struct event_command *cmd); extern int register_event_command(struct event_command *cmd);
extern int unregister_event_command(struct event_command *cmd); extern int unregister_event_command(struct event_command *cmd);
extern int register_trigger_hist_enable_disable_cmds(void); extern int register_trigger_hist_enable_disable_cmds(void);
...@@ -1586,7 +1590,8 @@ extern int register_trigger_hist_enable_disable_cmds(void); ...@@ -1586,7 +1590,8 @@ extern int register_trigger_hist_enable_disable_cmds(void);
*/ */
struct event_trigger_ops { struct event_trigger_ops {
void (*func)(struct event_trigger_data *data, void (*func)(struct event_trigger_data *data,
void *rec); void *rec,
struct ring_buffer_event *rbe);
int (*init)(struct event_trigger_ops *ops, int (*init)(struct event_trigger_ops *ops,
struct event_trigger_data *data); struct event_trigger_data *data);
void (*free)(struct event_trigger_ops *ops, void (*free)(struct event_trigger_ops *ops,
......
...@@ -96,7 +96,7 @@ u64 notrace trace_clock_global(void) ...@@ -96,7 +96,7 @@ u64 notrace trace_clock_global(void)
int this_cpu; int this_cpu;
u64 now; u64 now;
local_irq_save(flags); raw_local_irq_save(flags);
this_cpu = raw_smp_processor_id(); this_cpu = raw_smp_processor_id();
now = sched_clock_cpu(this_cpu); now = sched_clock_cpu(this_cpu);
...@@ -122,7 +122,7 @@ u64 notrace trace_clock_global(void) ...@@ -122,7 +122,7 @@ u64 notrace trace_clock_global(void)
arch_spin_unlock(&trace_clock_struct.lock); arch_spin_unlock(&trace_clock_struct.lock);
out: out:
local_irq_restore(flags); raw_local_irq_restore(flags);
return now; return now;
} }
......
...@@ -33,163 +33,595 @@ ...@@ -33,163 +33,595 @@
"# Only events with the given fields will be affected.\n" \ "# Only events with the given fields will be affected.\n" \
"# If no events are modified, an error message will be displayed here" "# If no events are modified, an error message will be displayed here"
enum filter_op_ids /* Due to token parsing '<=' must be before '<' and '>=' must be before '>' */
{ #define OPS \
OP_OR, C( OP_GLOB, "~" ), \
OP_AND, C( OP_NE, "!=" ), \
OP_GLOB, C( OP_EQ, "==" ), \
OP_NE, C( OP_LE, "<=" ), \
OP_EQ, C( OP_LT, "<" ), \
OP_LT, C( OP_GE, ">=" ), \
OP_LE, C( OP_GT, ">" ), \
OP_GT, C( OP_BAND, "&" ), \
OP_GE, C( OP_MAX, NULL )
OP_BAND,
OP_NOT,
OP_NONE,
OP_OPEN_PAREN,
};
struct filter_op { #undef C
int id; #define C(a, b) a
char *string;
int precedence;
};
/* Order must be the same as enum filter_op_ids above */ enum filter_op_ids { OPS };
static struct filter_op filter_ops[] = {
{ OP_OR, "||", 1 },
{ OP_AND, "&&", 2 },
{ OP_GLOB, "~", 4 },
{ OP_NE, "!=", 4 },
{ OP_EQ, "==", 4 },
{ OP_LT, "<", 5 },
{ OP_LE, "<=", 5 },
{ OP_GT, ">", 5 },
{ OP_GE, ">=", 5 },
{ OP_BAND, "&", 6 },
{ OP_NOT, "!", 6 },
{ OP_NONE, "OP_NONE", 0 },
{ OP_OPEN_PAREN, "(", 0 },
};
enum { #undef C
FILT_ERR_NONE, #define C(a, b) b
FILT_ERR_INVALID_OP,
FILT_ERR_UNBALANCED_PAREN,
FILT_ERR_TOO_MANY_OPERANDS,
FILT_ERR_OPERAND_TOO_LONG,
FILT_ERR_FIELD_NOT_FOUND,
FILT_ERR_ILLEGAL_FIELD_OP,
FILT_ERR_ILLEGAL_INTVAL,
FILT_ERR_BAD_SUBSYS_FILTER,
FILT_ERR_TOO_MANY_PREDS,
FILT_ERR_MISSING_FIELD,
FILT_ERR_INVALID_FILTER,
FILT_ERR_IP_FIELD_ONLY,
FILT_ERR_ILLEGAL_NOT_OP,
};
static char *err_text[] = { static const char * ops[] = { OPS };
"No error",
"Invalid operator",
"Unbalanced parens",
"Too many operands",
"Operand too long",
"Field not found",
"Illegal operation for field type",
"Illegal integer value",
"Couldn't find or set field in one of a subsystem's events",
"Too many terms in predicate expression",
"Missing field name and/or value",
"Meaningless filter expression",
"Only 'ip' field is supported for function trace",
"Illegal use of '!'",
};
struct opstack_op { /*
enum filter_op_ids op; * pred functions are OP_LE, OP_LT, OP_GE, OP_GT, and OP_BAND
struct list_head list; * pred_funcs_##type below must match the order of them above.
}; */
#define PRED_FUNC_START OP_LE
#define PRED_FUNC_MAX (OP_BAND - PRED_FUNC_START)
#define ERRORS \
C(NONE, "No error"), \
C(INVALID_OP, "Invalid operator"), \
C(TOO_MANY_OPEN, "Too many '('"), \
C(TOO_MANY_CLOSE, "Too few '('"), \
C(MISSING_QUOTE, "Missing matching quote"), \
C(OPERAND_TOO_LONG, "Operand too long"), \
C(EXPECT_STRING, "Expecting string field"), \
C(EXPECT_DIGIT, "Expecting numeric field"), \
C(ILLEGAL_FIELD_OP, "Illegal operation for field type"), \
C(FIELD_NOT_FOUND, "Field not found"), \
C(ILLEGAL_INTVAL, "Illegal integer value"), \
C(BAD_SUBSYS_FILTER, "Couldn't find or set field in one of a subsystem's events"), \
C(TOO_MANY_PREDS, "Too many terms in predicate expression"), \
C(INVALID_FILTER, "Meaningless filter expression"), \
C(IP_FIELD_ONLY, "Only 'ip' field is supported for function trace"), \
C(INVALID_VALUE, "Invalid value (did you forget quotes)?"),
#undef C
#define C(a, b) FILT_ERR_##a
enum { ERRORS };
#undef C
#define C(a, b) b
static char *err_text[] = { ERRORS };
/* Called after a '!' character but "!=" and "!~" are not "not"s */
static bool is_not(const char *str)
{
switch (str[1]) {
case '=':
case '~':
return false;
}
return true;
}
struct postfix_elt { /**
enum filter_op_ids op; * prog_entry - a singe entry in the filter program
char *operand; * @target: Index to jump to on a branch (actually one minus the index)
struct list_head list; * @when_to_branch: The value of the result of the predicate to do a branch
* @pred: The predicate to execute.
*/
struct prog_entry {
int target;
int when_to_branch;
struct filter_pred *pred;
}; };
struct filter_parse_state { /**
struct filter_op *ops; * update_preds- assign a program entry a label target
struct list_head opstack; * @prog: The program array
struct list_head postfix; * @N: The index of the current entry in @prog
* @when_to_branch: What to assign a program entry for its branch condition
*
* The program entry at @N has a target that points to the index of a program
* entry that can have its target and when_to_branch fields updated.
* Update the current program entry denoted by index @N target field to be
* that of the updated entry. This will denote the entry to update if
* we are processing an "||" after an "&&"
*/
static void update_preds(struct prog_entry *prog, int N, int invert)
{
int t, s;
t = prog[N].target;
s = prog[t].target;
prog[t].when_to_branch = invert;
prog[t].target = N;
prog[N].target = s;
}
struct filter_parse_error {
int lasterr; int lasterr;
int lasterr_pos; int lasterr_pos;
struct {
char *string;
unsigned int cnt;
unsigned int tail;
} infix;
struct {
char string[MAX_FILTER_STR_VAL];
int pos;
unsigned int tail;
} operand;
}; };
struct pred_stack { static void parse_error(struct filter_parse_error *pe, int err, int pos)
struct filter_pred **preds; {
int index; pe->lasterr = err;
pe->lasterr_pos = pos;
}
typedef int (*parse_pred_fn)(const char *str, void *data, int pos,
struct filter_parse_error *pe,
struct filter_pred **pred);
enum {
INVERT = 1,
PROCESS_AND = 2,
PROCESS_OR = 4,
}; };
/* If not of not match is equal to not of not, then it is a match */ /*
* Without going into a formal proof, this explains the method that is used in
* parsing the logical expressions.
*
* For example, if we have: "a && !(!b || (c && g)) || d || e && !f"
* The first pass will convert it into the following program:
*
* n1: r=a; l1: if (!r) goto l4;
* n2: r=b; l2: if (!r) goto l4;
* n3: r=c; r=!r; l3: if (r) goto l4;
* n4: r=g; r=!r; l4: if (r) goto l5;
* n5: r=d; l5: if (r) goto T
* n6: r=e; l6: if (!r) goto l7;
* n7: r=f; r=!r; l7: if (!r) goto F
* T: return TRUE
* F: return FALSE
*
* To do this, we use a data structure to represent each of the above
* predicate and conditions that has:
*
* predicate, when_to_branch, invert, target
*
* The "predicate" will hold the function to determine the result "r".
* The "when_to_branch" denotes what "r" should be if a branch is to be taken
* "&&" would contain "!r" or (0) and "||" would contain "r" or (1).
* The "invert" holds whether the value should be reversed before testing.
* The "target" contains the label "l#" to jump to.
*
* A stack is created to hold values when parentheses are used.
*
* To simplify the logic, the labels will start at 0 and not 1.
*
* The possible invert values are 1 and 0. The number of "!"s that are in scope
* before the predicate determines the invert value, if the number is odd then
* the invert value is 1 and 0 otherwise. This means the invert value only
* needs to be toggled when a new "!" is introduced compared to what is stored
* on the stack, where parentheses were used.
*
* The top of the stack and "invert" are initialized to zero.
*
* ** FIRST PASS **
*
* #1 A loop through all the tokens is done:
*
* #2 If the token is an "(", the stack is push, and the current stack value
* gets the current invert value, and the loop continues to the next token.
* The top of the stack saves the "invert" value to keep track of what
* the current inversion is. As "!(a && !b || c)" would require all
* predicates being affected separately by the "!" before the parentheses.
* And that would end up being equivalent to "(!a || b) && !c"
*
* #3 If the token is an "!", the current "invert" value gets inverted, and
* the loop continues. Note, if the next token is a predicate, then
* this "invert" value is only valid for the current program entry,
* and does not affect other predicates later on.
*
* The only other acceptable token is the predicate string.
*
* #4 A new entry into the program is added saving: the predicate and the
* current value of "invert". The target is currently assigned to the
* previous program index (this will not be its final value).
*
* #5 We now enter another loop and look at the next token. The only valid
* tokens are ")", "&&", "||" or end of the input string "\0".
*
* #6 The invert variable is reset to the current value saved on the top of
* the stack.
*
* #7 The top of the stack holds not only the current invert value, but also
* if a "&&" or "||" needs to be processed. Note, the "&&" takes higher
* precedence than "||". That is "a && b || c && d" is equivalent to
* "(a && b) || (c && d)". Thus the first thing to do is to see if "&&" needs
* to be processed. This is the case if an "&&" was the last token. If it was
* then we call update_preds(). This takes the program, the current index in
* the program, and the current value of "invert". More will be described
* below about this function.
*
* #8 If the next token is "&&" then we set a flag in the top of the stack
* that denotes that "&&" needs to be processed, break out of this loop
* and continue with the outer loop.
*
* #9 Otherwise, if a "||" needs to be processed then update_preds() is called.
* This is called with the program, the current index in the program, but
* this time with an inverted value of "invert" (that is !invert). This is
* because the value taken will become the "when_to_branch" value of the
* program.
* Note, this is called when the next token is not an "&&". As stated before,
* "&&" takes higher precedence, and "||" should not be processed yet if the
* next logical operation is "&&".
*
* #10 If the next token is "||" then we set a flag in the top of the stack
* that denotes that "||" needs to be processed, break out of this loop
* and continue with the outer loop.
*
* #11 If this is the end of the input string "\0" then we break out of both
* loops.
*
* #12 Otherwise, the next token is ")", where we pop the stack and continue
* this inner loop.
*
* Now to discuss the update_pred() function, as that is key to the setting up
* of the program. Remember the "target" of the program is initialized to the
* previous index and not the "l" label. The target holds the index into the
* program that gets affected by the operand. Thus if we have something like
* "a || b && c", when we process "a" the target will be "-1" (undefined).
* When we process "b", its target is "0", which is the index of "a", as that's
* the predicate that is affected by "||". But because the next token after "b"
* is "&&" we don't call update_preds(). Instead continue to "c". As the
* next token after "c" is not "&&" but the end of input, we first process the
* "&&" by calling update_preds() for the "&&" then we process the "||" by
* callin updates_preds() with the values for processing "||".
*
* What does that mean? What update_preds() does is to first save the "target"
* of the program entry indexed by the current program entry's "target"
* (remember the "target" is initialized to previous program entry), and then
* sets that "target" to the current index which represents the label "l#".
* That entry's "when_to_branch" is set to the value passed in (the "invert"
* or "!invert"). Then it sets the current program entry's target to the saved
* "target" value (the old value of the program that had its "target" updated
* to the label).
*
* Looking back at "a || b && c", we have the following steps:
* "a" - prog[0] = { "a", X, -1 } // pred, when_to_branch, target
* "||" - flag that we need to process "||"; continue outer loop
* "b" - prog[1] = { "b", X, 0 }
* "&&" - flag that we need to process "&&"; continue outer loop
* (Notice we did not process "||")
* "c" - prog[2] = { "c", X, 1 }
* update_preds(prog, 2, 0); // invert = 0 as we are processing "&&"
* t = prog[2].target; // t = 1
* s = prog[t].target; // s = 0
* prog[t].target = 2; // Set target to "l2"
* prog[t].when_to_branch = 0;
* prog[2].target = s;
* update_preds(prog, 2, 1); // invert = 1 as we are now processing "||"
* t = prog[2].target; // t = 0
* s = prog[t].target; // s = -1
* prog[t].target = 2; // Set target to "l2"
* prog[t].when_to_branch = 1;
* prog[2].target = s;
*
* #13 Which brings us to the final step of the first pass, which is to set
* the last program entry's when_to_branch and target, which will be
* when_to_branch = 0; target = N; ( the label after the program entry after
* the last program entry processed above).
*
* If we denote "TRUE" to be the entry after the last program entry processed,
* and "FALSE" the program entry after that, we are now done with the first
* pass.
*
* Making the above "a || b && c" have a progam of:
* prog[0] = { "a", 1, 2 }
* prog[1] = { "b", 0, 2 }
* prog[2] = { "c", 0, 3 }
*
* Which translates into:
* n0: r = a; l0: if (r) goto l2;
* n1: r = b; l1: if (!r) goto l2;
* n2: r = c; l2: if (!r) goto l3; // Which is the same as "goto F;"
* T: return TRUE; l3:
* F: return FALSE
*
* Although, after the first pass, the program is correct, it is
* inefficient. The simple sample of "a || b && c" could be easily been
* converted into:
* n0: r = a; if (r) goto T
* n1: r = b; if (!r) goto F
* n2: r = c; if (!r) goto F
* T: return TRUE;
* F: return FALSE;
*
* The First Pass is over the input string. The next too passes are over
* the program itself.
*
* ** SECOND PASS **
*
* Which brings us to the second pass. If a jump to a label has the
* same condition as that label, it can instead jump to its target.
* The original example of "a && !(!b || (c && g)) || d || e && !f"
* where the first pass gives us:
*
* n1: r=a; l1: if (!r) goto l4;
* n2: r=b; l2: if (!r) goto l4;
* n3: r=c; r=!r; l3: if (r) goto l4;
* n4: r=g; r=!r; l4: if (r) goto l5;
* n5: r=d; l5: if (r) goto T
* n6: r=e; l6: if (!r) goto l7;
* n7: r=f; r=!r; l7: if (!r) goto F:
* T: return TRUE;
* F: return FALSE
*
* We can see that "l3: if (r) goto l4;" and at l4, we have "if (r) goto l5;".
* And "l5: if (r) goto T", we could optimize this by converting l3 and l4
* to go directly to T. To accomplish this, we start from the last
* entry in the program and work our way back. If the target of the entry
* has the same "when_to_branch" then we could use that entry's target.
* Doing this, the above would end up as:
*
* n1: r=a; l1: if (!r) goto l4;
* n2: r=b; l2: if (!r) goto l4;
* n3: r=c; r=!r; l3: if (r) goto T;
* n4: r=g; r=!r; l4: if (r) goto T;
* n5: r=d; l5: if (r) goto T;
* n6: r=e; l6: if (!r) goto F;
* n7: r=f; r=!r; l7: if (!r) goto F;
* T: return TRUE
* F: return FALSE
*
* In that same pass, if the "when_to_branch" doesn't match, we can simply
* go to the program entry after the label. That is, "l2: if (!r) goto l4;"
* where "l4: if (r) goto T;", then we can convert l2 to be:
* "l2: if (!r) goto n5;".
*
* This will have the second pass give us:
* n1: r=a; l1: if (!r) goto n5;
* n2: r=b; l2: if (!r) goto n5;
* n3: r=c; r=!r; l3: if (r) goto T;
* n4: r=g; r=!r; l4: if (r) goto T;
* n5: r=d; l5: if (r) goto T
* n6: r=e; l6: if (!r) goto F;
* n7: r=f; r=!r; l7: if (!r) goto F
* T: return TRUE
* F: return FALSE
*
* Notice, all the "l#" labels are no longer used, and they can now
* be discarded.
*
* ** THIRD PASS **
*
* For the third pass we deal with the inverts. As they simply just
* make the "when_to_branch" get inverted, a simple loop over the
* program to that does: "when_to_branch ^= invert;" will do the
* job, leaving us with:
* n1: r=a; if (!r) goto n5;
* n2: r=b; if (!r) goto n5;
* n3: r=c: if (!r) goto T;
* n4: r=g; if (!r) goto T;
* n5: r=d; if (r) goto T
* n6: r=e; if (!r) goto F;
* n7: r=f; if (r) goto F
* T: return TRUE
* F: return FALSE
*
* As "r = a; if (!r) goto n5;" is obviously the same as
* "if (!a) goto n5;" without doing anything we can interperate the
* program as:
* n1: if (!a) goto n5;
* n2: if (!b) goto n5;
* n3: if (!c) goto T;
* n4: if (!g) goto T;
* n5: if (d) goto T
* n6: if (!e) goto F;
* n7: if (f) goto F
* T: return TRUE
* F: return FALSE
*
* Since the inverts are discarded at the end, there's no reason to store
* them in the program array (and waste memory). A separate array to hold
* the inverts is used and freed at the end.
*/
static struct prog_entry *
predicate_parse(const char *str, int nr_parens, int nr_preds,
parse_pred_fn parse_pred, void *data,
struct filter_parse_error *pe)
{
struct prog_entry *prog_stack;
struct prog_entry *prog;
const char *ptr = str;
char *inverts = NULL;
int *op_stack;
int *top;
int invert = 0;
int ret = -ENOMEM;
int len;
int N = 0;
int i;
nr_preds += 2; /* For TRUE and FALSE */
op_stack = kmalloc(sizeof(*op_stack) * nr_parens, GFP_KERNEL);
if (!op_stack)
return ERR_PTR(-ENOMEM);
prog_stack = kmalloc(sizeof(*prog_stack) * nr_preds, GFP_KERNEL);
if (!prog_stack) {
parse_error(pe, -ENOMEM, 0);
goto out_free;
}
inverts = kmalloc(sizeof(*inverts) * nr_preds, GFP_KERNEL);
if (!inverts) {
parse_error(pe, -ENOMEM, 0);
goto out_free;
}
top = op_stack;
prog = prog_stack;
*top = 0;
/* First pass */
while (*ptr) { /* #1 */
const char *next = ptr++;
if (isspace(*next))
continue;
switch (*next) {
case '(': /* #2 */
if (top - op_stack > nr_parens)
return ERR_PTR(-EINVAL);
*(++top) = invert;
continue;
case '!': /* #3 */
if (!is_not(next))
break;
invert = !invert;
continue;
}
if (N >= nr_preds) {
parse_error(pe, FILT_ERR_TOO_MANY_PREDS, next - str);
goto out_free;
}
inverts[N] = invert; /* #4 */
prog[N].target = N-1;
len = parse_pred(next, data, ptr - str, pe, &prog[N].pred);
if (len < 0) {
ret = len;
goto out_free;
}
ptr = next + len;
N++;
ret = -1;
while (1) { /* #5 */
next = ptr++;
if (isspace(*next))
continue;
switch (*next) {
case ')':
case '\0':
break;
case '&':
case '|':
if (next[1] == next[0]) {
ptr++;
break;
}
default:
parse_error(pe, FILT_ERR_TOO_MANY_PREDS,
next - str);
goto out_free;
}
invert = *top & INVERT;
if (*top & PROCESS_AND) { /* #7 */
update_preds(prog, N - 1, invert);
*top &= ~PROCESS_AND;
}
if (*next == '&') { /* #8 */
*top |= PROCESS_AND;
break;
}
if (*top & PROCESS_OR) { /* #9 */
update_preds(prog, N - 1, !invert);
*top &= ~PROCESS_OR;
}
if (*next == '|') { /* #10 */
*top |= PROCESS_OR;
break;
}
if (!*next) /* #11 */
goto out;
if (top == op_stack) {
ret = -1;
/* Too few '(' */
parse_error(pe, FILT_ERR_TOO_MANY_CLOSE, ptr - str);
goto out_free;
}
top--; /* #12 */
}
}
out:
if (top != op_stack) {
/* Too many '(' */
parse_error(pe, FILT_ERR_TOO_MANY_OPEN, ptr - str);
goto out_free;
}
prog[N].pred = NULL; /* #13 */
prog[N].target = 1; /* TRUE */
prog[N+1].pred = NULL;
prog[N+1].target = 0; /* FALSE */
prog[N-1].target = N;
prog[N-1].when_to_branch = false;
/* Second Pass */
for (i = N-1 ; i--; ) {
int target = prog[i].target;
if (prog[i].when_to_branch == prog[target].when_to_branch)
prog[i].target = prog[target].target;
}
/* Third Pass */
for (i = 0; i < N; i++) {
invert = inverts[i] ^ prog[i].when_to_branch;
prog[i].when_to_branch = invert;
/* Make sure the program always moves forward */
if (WARN_ON(prog[i].target <= i)) {
ret = -EINVAL;
goto out_free;
}
}
return prog;
out_free:
kfree(op_stack);
kfree(prog_stack);
kfree(inverts);
return ERR_PTR(ret);
}
#define DEFINE_COMPARISON_PRED(type) \ #define DEFINE_COMPARISON_PRED(type) \
static int filter_pred_LT_##type(struct filter_pred *pred, void *event) \ static int filter_pred_LT_##type(struct filter_pred *pred, void *event) \
{ \ { \
type *addr = (type *)(event + pred->offset); \ type *addr = (type *)(event + pred->offset); \
type val = (type)pred->val; \ type val = (type)pred->val; \
int match = (*addr < val); \ return *addr < val; \
return !!match == !pred->not; \
} \ } \
static int filter_pred_LE_##type(struct filter_pred *pred, void *event) \ static int filter_pred_LE_##type(struct filter_pred *pred, void *event) \
{ \ { \
type *addr = (type *)(event + pred->offset); \ type *addr = (type *)(event + pred->offset); \
type val = (type)pred->val; \ type val = (type)pred->val; \
int match = (*addr <= val); \ return *addr <= val; \
return !!match == !pred->not; \
} \ } \
static int filter_pred_GT_##type(struct filter_pred *pred, void *event) \ static int filter_pred_GT_##type(struct filter_pred *pred, void *event) \
{ \ { \
type *addr = (type *)(event + pred->offset); \ type *addr = (type *)(event + pred->offset); \
type val = (type)pred->val; \ type val = (type)pred->val; \
int match = (*addr > val); \ return *addr > val; \
return !!match == !pred->not; \
} \ } \
static int filter_pred_GE_##type(struct filter_pred *pred, void *event) \ static int filter_pred_GE_##type(struct filter_pred *pred, void *event) \
{ \ { \
type *addr = (type *)(event + pred->offset); \ type *addr = (type *)(event + pred->offset); \
type val = (type)pred->val; \ type val = (type)pred->val; \
int match = (*addr >= val); \ return *addr >= val; \
return !!match == !pred->not; \
} \ } \
static int filter_pred_BAND_##type(struct filter_pred *pred, void *event) \ static int filter_pred_BAND_##type(struct filter_pred *pred, void *event) \
{ \ { \
type *addr = (type *)(event + pred->offset); \ type *addr = (type *)(event + pred->offset); \
type val = (type)pred->val; \ type val = (type)pred->val; \
int match = !!(*addr & val); \ return !!(*addr & val); \
return match == !pred->not; \
} \ } \
static const filter_pred_fn_t pred_funcs_##type[] = { \ static const filter_pred_fn_t pred_funcs_##type[] = { \
filter_pred_LT_##type, \
filter_pred_LE_##type, \ filter_pred_LE_##type, \
filter_pred_GT_##type, \ filter_pred_LT_##type, \
filter_pred_GE_##type, \ filter_pred_GE_##type, \
filter_pred_GT_##type, \
filter_pred_BAND_##type, \ filter_pred_BAND_##type, \
}; };
#define PRED_FUNC_START OP_LT
#define DEFINE_EQUALITY_PRED(size) \ #define DEFINE_EQUALITY_PRED(size) \
static int filter_pred_##size(struct filter_pred *pred, void *event) \ static int filter_pred_##size(struct filter_pred *pred, void *event) \
{ \ { \
...@@ -272,44 +704,36 @@ static int filter_pred_strloc(struct filter_pred *pred, void *event) ...@@ -272,44 +704,36 @@ static int filter_pred_strloc(struct filter_pred *pred, void *event)
static int filter_pred_cpu(struct filter_pred *pred, void *event) static int filter_pred_cpu(struct filter_pred *pred, void *event)
{ {
int cpu, cmp; int cpu, cmp;
int match = 0;
cpu = raw_smp_processor_id(); cpu = raw_smp_processor_id();
cmp = pred->val; cmp = pred->val;
switch (pred->op) { switch (pred->op) {
case OP_EQ: case OP_EQ:
match = cpu == cmp; return cpu == cmp;
break; case OP_NE:
return cpu != cmp;
case OP_LT: case OP_LT:
match = cpu < cmp; return cpu < cmp;
break;
case OP_LE: case OP_LE:
match = cpu <= cmp; return cpu <= cmp;
break;
case OP_GT: case OP_GT:
match = cpu > cmp; return cpu > cmp;
break;
case OP_GE: case OP_GE:
match = cpu >= cmp; return cpu >= cmp;
break;
default: default:
break; return 0;
} }
return !!match == !pred->not;
} }
/* Filter predicate for COMM. */ /* Filter predicate for COMM. */
static int filter_pred_comm(struct filter_pred *pred, void *event) static int filter_pred_comm(struct filter_pred *pred, void *event)
{ {
int cmp, match; int cmp;
cmp = pred->regex.match(current->comm, &pred->regex, cmp = pred->regex.match(current->comm, &pred->regex,
pred->regex.field_len); TASK_COMM_LEN);
match = cmp ^ pred->not; return cmp ^ pred->not;
return match;
} }
static int filter_pred_none(struct filter_pred *pred, void *event) static int filter_pred_none(struct filter_pred *pred, void *event)
...@@ -366,6 +790,7 @@ static int regex_match_glob(char *str, struct regex *r, int len __maybe_unused) ...@@ -366,6 +790,7 @@ static int regex_match_glob(char *str, struct regex *r, int len __maybe_unused)
return 1; return 1;
return 0; return 0;
} }
/** /**
* filter_parse_regex - parse a basic regex * filter_parse_regex - parse a basic regex
* @buff: the raw regex * @buff: the raw regex
...@@ -426,10 +851,9 @@ static void filter_build_regex(struct filter_pred *pred) ...@@ -426,10 +851,9 @@ static void filter_build_regex(struct filter_pred *pred)
struct regex *r = &pred->regex; struct regex *r = &pred->regex;
char *search; char *search;
enum regex_type type = MATCH_FULL; enum regex_type type = MATCH_FULL;
int not = 0;
if (pred->op == OP_GLOB) { if (pred->op == OP_GLOB) {
type = filter_parse_regex(r->pattern, r->len, &search, &not); type = filter_parse_regex(r->pattern, r->len, &search, &pred->not);
r->len = strlen(search); r->len = strlen(search);
memmove(r->pattern, search, r->len+1); memmove(r->pattern, search, r->len+1);
} }
...@@ -451,210 +875,32 @@ static void filter_build_regex(struct filter_pred *pred) ...@@ -451,210 +875,32 @@ static void filter_build_regex(struct filter_pred *pred)
r->match = regex_match_glob; r->match = regex_match_glob;
break; break;
} }
pred->not ^= not;
}
enum move_type {
MOVE_DOWN,
MOVE_UP_FROM_LEFT,
MOVE_UP_FROM_RIGHT
};
static struct filter_pred *
get_pred_parent(struct filter_pred *pred, struct filter_pred *preds,
int index, enum move_type *move)
{
if (pred->parent & FILTER_PRED_IS_RIGHT)
*move = MOVE_UP_FROM_RIGHT;
else
*move = MOVE_UP_FROM_LEFT;
pred = &preds[pred->parent & ~FILTER_PRED_IS_RIGHT];
return pred;
}
enum walk_return {
WALK_PRED_ABORT,
WALK_PRED_PARENT,
WALK_PRED_DEFAULT,
};
typedef int (*filter_pred_walkcb_t) (enum move_type move,
struct filter_pred *pred,
int *err, void *data);
static int walk_pred_tree(struct filter_pred *preds,
struct filter_pred *root,
filter_pred_walkcb_t cb, void *data)
{
struct filter_pred *pred = root;
enum move_type move = MOVE_DOWN;
int done = 0;
if (!preds)
return -EINVAL;
do {
int err = 0, ret;
ret = cb(move, pred, &err, data);
if (ret == WALK_PRED_ABORT)
return err;
if (ret == WALK_PRED_PARENT)
goto get_parent;
switch (move) {
case MOVE_DOWN:
if (pred->left != FILTER_PRED_INVALID) {
pred = &preds[pred->left];
continue;
}
goto get_parent;
case MOVE_UP_FROM_LEFT:
pred = &preds[pred->right];
move = MOVE_DOWN;
continue;
case MOVE_UP_FROM_RIGHT:
get_parent:
if (pred == root)
break;
pred = get_pred_parent(pred, preds,
pred->parent,
&move);
continue;
}
done = 1;
} while (!done);
/* We are fine. */
return 0;
}
/*
* A series of AND or ORs where found together. Instead of
* climbing up and down the tree branches, an array of the
* ops were made in order of checks. We can just move across
* the array and short circuit if needed.
*/
static int process_ops(struct filter_pred *preds,
struct filter_pred *op, void *rec)
{
struct filter_pred *pred;
int match = 0;
int type;
int i;
/*
* Micro-optimization: We set type to true if op
* is an OR and false otherwise (AND). Then we
* just need to test if the match is equal to
* the type, and if it is, we can short circuit the
* rest of the checks:
*
* if ((match && op->op == OP_OR) ||
* (!match && op->op == OP_AND))
* return match;
*/
type = op->op == OP_OR;
for (i = 0; i < op->val; i++) {
pred = &preds[op->ops[i]];
if (!WARN_ON_ONCE(!pred->fn))
match = pred->fn(pred, rec);
if (!!match == type)
break;
}
/* If not of not match is equal to not of not, then it is a match */
return !!match == !op->not;
}
struct filter_match_preds_data {
struct filter_pred *preds;
int match;
void *rec;
};
static int filter_match_preds_cb(enum move_type move, struct filter_pred *pred,
int *err, void *data)
{
struct filter_match_preds_data *d = data;
*err = 0;
switch (move) {
case MOVE_DOWN:
/* only AND and OR have children */
if (pred->left != FILTER_PRED_INVALID) {
/* If ops is set, then it was folded. */
if (!pred->ops)
return WALK_PRED_DEFAULT;
/* We can treat folded ops as a leaf node */
d->match = process_ops(d->preds, pred, d->rec);
} else {
if (!WARN_ON_ONCE(!pred->fn))
d->match = pred->fn(pred, d->rec);
}
return WALK_PRED_PARENT;
case MOVE_UP_FROM_LEFT:
/*
* Check for short circuits.
*
* Optimization: !!match == (pred->op == OP_OR)
* is the same as:
* if ((match && pred->op == OP_OR) ||
* (!match && pred->op == OP_AND))
*/
if (!!d->match == (pred->op == OP_OR))
return WALK_PRED_PARENT;
break;
case MOVE_UP_FROM_RIGHT:
break;
}
return WALK_PRED_DEFAULT;
} }
/* return 1 if event matches, 0 otherwise (discard) */ /* return 1 if event matches, 0 otherwise (discard) */
int filter_match_preds(struct event_filter *filter, void *rec) int filter_match_preds(struct event_filter *filter, void *rec)
{ {
struct filter_pred *preds; struct prog_entry *prog;
struct filter_pred *root; int i;
struct filter_match_preds_data data = {
/* match is currently meaningless */
.match = -1,
.rec = rec,
};
int n_preds, ret;
/* no filter is considered a match */ /* no filter is considered a match */
if (!filter) if (!filter)
return 1; return 1;
n_preds = filter->n_preds; prog = rcu_dereference_sched(filter->prog);
if (!n_preds) if (!prog)
return 1;
/*
* n_preds, root and filter->preds are protect with preemption disabled.
*/
root = rcu_dereference_sched(filter->root);
if (!root)
return 1; return 1;
data.preds = preds = rcu_dereference_sched(filter->preds); for (i = 0; prog[i].pred; i++) {
ret = walk_pred_tree(preds, root, filter_match_preds_cb, &data); struct filter_pred *pred = prog[i].pred;
WARN_ON(ret); int match = pred->fn(pred, rec);
return data.match; if (match == prog[i].when_to_branch)
i = prog[i].target;
}
return prog[i].target;
} }
EXPORT_SYMBOL_GPL(filter_match_preds); EXPORT_SYMBOL_GPL(filter_match_preds);
static void parse_error(struct filter_parse_state *ps, int err, int pos)
{
ps->lasterr = err;
ps->lasterr_pos = pos;
}
static void remove_filter_string(struct event_filter *filter) static void remove_filter_string(struct event_filter *filter)
{ {
if (!filter) if (!filter)
...@@ -664,57 +910,44 @@ static void remove_filter_string(struct event_filter *filter) ...@@ -664,57 +910,44 @@ static void remove_filter_string(struct event_filter *filter)
filter->filter_string = NULL; filter->filter_string = NULL;
} }
static int replace_filter_string(struct event_filter *filter, static void append_filter_err(struct filter_parse_error *pe,
char *filter_string)
{
kfree(filter->filter_string);
filter->filter_string = kstrdup(filter_string, GFP_KERNEL);
if (!filter->filter_string)
return -ENOMEM;
return 0;
}
static int append_filter_string(struct event_filter *filter,
char *string)
{
int newlen;
char *new_filter_string;
BUG_ON(!filter->filter_string);
newlen = strlen(filter->filter_string) + strlen(string) + 1;
new_filter_string = kmalloc(newlen, GFP_KERNEL);
if (!new_filter_string)
return -ENOMEM;
strcpy(new_filter_string, filter->filter_string);
strcat(new_filter_string, string);
kfree(filter->filter_string);
filter->filter_string = new_filter_string;
return 0;
}
static void append_filter_err(struct filter_parse_state *ps,
struct event_filter *filter) struct event_filter *filter)
{ {
int pos = ps->lasterr_pos; struct trace_seq *s;
char *buf, *pbuf; int pos = pe->lasterr_pos;
char *buf;
int len;
if (WARN_ON(!filter->filter_string))
return;
buf = (char *)__get_free_page(GFP_KERNEL); s = kmalloc(sizeof(*s), GFP_KERNEL);
if (!buf) if (!s)
return; return;
trace_seq_init(s);
len = strlen(filter->filter_string);
if (pos > len)
pos = len;
append_filter_string(filter, "\n"); /* indexing is off by one */
memset(buf, ' ', PAGE_SIZE); if (pos)
if (pos > PAGE_SIZE - 128) pos++;
pos = 0;
buf[pos] = '^';
pbuf = &buf[pos] + 1;
sprintf(pbuf, "\nparse_error: %s\n", err_text[ps->lasterr]); trace_seq_puts(s, filter->filter_string);
append_filter_string(filter, buf); if (pe->lasterr > 0) {
free_page((unsigned long) buf); trace_seq_printf(s, "\n%*s", pos, "^");
trace_seq_printf(s, "\nparse_error: %s\n", err_text[pe->lasterr]);
} else {
trace_seq_printf(s, "\nError: (%d)\n", pe->lasterr);
}
trace_seq_putc(s, 0);
buf = kmemdup_nul(s->buffer, s->seq.len, GFP_KERNEL);
if (buf) {
kfree(filter->filter_string);
filter->filter_string = buf;
}
kfree(s);
} }
static inline struct event_filter *event_filter(struct trace_event_file *file) static inline struct event_filter *event_filter(struct trace_event_file *file)
...@@ -747,166 +980,44 @@ void print_subsystem_event_filter(struct event_subsystem *system, ...@@ -747,166 +980,44 @@ void print_subsystem_event_filter(struct event_subsystem *system,
mutex_unlock(&event_mutex); mutex_unlock(&event_mutex);
} }
static int __alloc_pred_stack(struct pred_stack *stack, int n_preds) static void free_prog(struct event_filter *filter)
{ {
stack->preds = kcalloc(n_preds + 1, sizeof(*stack->preds), GFP_KERNEL); struct prog_entry *prog;
if (!stack->preds) int i;
return -ENOMEM;
stack->index = n_preds;
return 0;
}
static void __free_pred_stack(struct pred_stack *stack) prog = rcu_access_pointer(filter->prog);
{ if (!prog)
kfree(stack->preds); return;
stack->index = 0;
for (i = 0; prog[i].pred; i++)
kfree(prog[i].pred);
kfree(prog);
} }
static int __push_pred_stack(struct pred_stack *stack, static void filter_disable(struct trace_event_file *file)
struct filter_pred *pred)
{ {
int index = stack->index; unsigned long old_flags = file->flags;
if (WARN_ON(index == 0)) file->flags &= ~EVENT_FILE_FL_FILTERED;
return -ENOSPC;
stack->preds[--index] = pred; if (old_flags != file->flags)
stack->index = index; trace_buffered_event_disable();
return 0;
} }
static struct filter_pred * static void __free_filter(struct event_filter *filter)
__pop_pred_stack(struct pred_stack *stack)
{ {
struct filter_pred *pred; if (!filter)
int index = stack->index; return;
pred = stack->preds[index++];
if (!pred)
return NULL;
stack->index = index; free_prog(filter);
return pred; kfree(filter->filter_string);
kfree(filter);
} }
static int filter_set_pred(struct event_filter *filter, void free_event_filter(struct event_filter *filter)
int idx,
struct pred_stack *stack,
struct filter_pred *src)
{ {
struct filter_pred *dest = &filter->preds[idx]; __free_filter(filter);
struct filter_pred *left; }
struct filter_pred *right;
*dest = *src;
dest->index = idx;
if (dest->op == OP_OR || dest->op == OP_AND) {
right = __pop_pred_stack(stack);
left = __pop_pred_stack(stack);
if (!left || !right)
return -EINVAL;
/*
* If both children can be folded
* and they are the same op as this op or a leaf,
* then this op can be folded.
*/
if (left->index & FILTER_PRED_FOLD &&
((left->op == dest->op && !left->not) ||
left->left == FILTER_PRED_INVALID) &&
right->index & FILTER_PRED_FOLD &&
((right->op == dest->op && !right->not) ||
right->left == FILTER_PRED_INVALID))
dest->index |= FILTER_PRED_FOLD;
dest->left = left->index & ~FILTER_PRED_FOLD;
dest->right = right->index & ~FILTER_PRED_FOLD;
left->parent = dest->index & ~FILTER_PRED_FOLD;
right->parent = dest->index | FILTER_PRED_IS_RIGHT;
} else {
/*
* Make dest->left invalid to be used as a quick
* way to know this is a leaf node.
*/
dest->left = FILTER_PRED_INVALID;
/* All leafs allow folding the parent ops. */
dest->index |= FILTER_PRED_FOLD;
}
return __push_pred_stack(stack, dest);
}
static void __free_preds(struct event_filter *filter)
{
int i;
if (filter->preds) {
for (i = 0; i < filter->n_preds; i++)
kfree(filter->preds[i].ops);
kfree(filter->preds);
filter->preds = NULL;
}
filter->a_preds = 0;
filter->n_preds = 0;
}
static void filter_disable(struct trace_event_file *file)
{
unsigned long old_flags = file->flags;
file->flags &= ~EVENT_FILE_FL_FILTERED;
if (old_flags != file->flags)
trace_buffered_event_disable();
}
static void __free_filter(struct event_filter *filter)
{
if (!filter)
return;
__free_preds(filter);
kfree(filter->filter_string);
kfree(filter);
}
void free_event_filter(struct event_filter *filter)
{
__free_filter(filter);
}
static struct event_filter *__alloc_filter(void)
{
struct event_filter *filter;
filter = kzalloc(sizeof(*filter), GFP_KERNEL);
return filter;
}
static int __alloc_preds(struct event_filter *filter, int n_preds)
{
struct filter_pred *pred;
int i;
if (filter->preds)
__free_preds(filter);
filter->preds = kcalloc(n_preds, sizeof(*filter->preds), GFP_KERNEL);
if (!filter->preds)
return -ENOMEM;
filter->a_preds = n_preds;
filter->n_preds = 0;
for (i = 0; i < n_preds; i++) {
pred = &filter->preds[i];
pred->fn = filter_pred_none;
}
return 0;
}
static inline void __remove_filter(struct trace_event_file *file) static inline void __remove_filter(struct trace_event_file *file)
{ {
...@@ -937,800 +1048,467 @@ static void filter_free_subsystem_filters(struct trace_subsystem_dir *dir, ...@@ -937,800 +1048,467 @@ static void filter_free_subsystem_filters(struct trace_subsystem_dir *dir,
{ {
struct trace_event_file *file; struct trace_event_file *file;
list_for_each_entry(file, &tr->events, list) { list_for_each_entry(file, &tr->events, list) {
if (file->system != dir) if (file->system != dir)
continue; continue;
__free_subsystem_filter(file); __free_subsystem_filter(file);
}
}
static int filter_add_pred(struct filter_parse_state *ps,
struct event_filter *filter,
struct filter_pred *pred,
struct pred_stack *stack)
{
int err;
if (WARN_ON(filter->n_preds == filter->a_preds)) {
parse_error(ps, FILT_ERR_TOO_MANY_PREDS, 0);
return -ENOSPC;
}
err = filter_set_pred(filter, filter->n_preds, stack, pred);
if (err)
return err;
filter->n_preds++;
return 0;
}
int filter_assign_type(const char *type)
{
if (strstr(type, "__data_loc") && strstr(type, "char"))
return FILTER_DYN_STRING;
if (strchr(type, '[') && strstr(type, "char"))
return FILTER_STATIC_STRING;
return FILTER_OTHER;
}
static bool is_legal_op(struct ftrace_event_field *field, enum filter_op_ids op)
{
if (is_string_field(field) &&
(op != OP_EQ && op != OP_NE && op != OP_GLOB))
return false;
if (!is_string_field(field) && op == OP_GLOB)
return false;
return true;
}
static filter_pred_fn_t select_comparison_fn(enum filter_op_ids op,
int field_size, int field_is_signed)
{
filter_pred_fn_t fn = NULL;
switch (field_size) {
case 8:
if (op == OP_EQ || op == OP_NE)
fn = filter_pred_64;
else if (field_is_signed)
fn = pred_funcs_s64[op - PRED_FUNC_START];
else
fn = pred_funcs_u64[op - PRED_FUNC_START];
break;
case 4:
if (op == OP_EQ || op == OP_NE)
fn = filter_pred_32;
else if (field_is_signed)
fn = pred_funcs_s32[op - PRED_FUNC_START];
else
fn = pred_funcs_u32[op - PRED_FUNC_START];
break;
case 2:
if (op == OP_EQ || op == OP_NE)
fn = filter_pred_16;
else if (field_is_signed)
fn = pred_funcs_s16[op - PRED_FUNC_START];
else
fn = pred_funcs_u16[op - PRED_FUNC_START];
break;
case 1:
if (op == OP_EQ || op == OP_NE)
fn = filter_pred_8;
else if (field_is_signed)
fn = pred_funcs_s8[op - PRED_FUNC_START];
else
fn = pred_funcs_u8[op - PRED_FUNC_START];
break;
}
return fn;
}
static int init_pred(struct filter_parse_state *ps,
struct ftrace_event_field *field,
struct filter_pred *pred)
{
filter_pred_fn_t fn = filter_pred_none;
unsigned long long val;
int ret;
pred->offset = field->offset;
if (!is_legal_op(field, pred->op)) {
parse_error(ps, FILT_ERR_ILLEGAL_FIELD_OP, 0);
return -EINVAL;
}
if (field->filter_type == FILTER_COMM) {
filter_build_regex(pred);
fn = filter_pred_comm;
pred->regex.field_len = TASK_COMM_LEN;
} else if (is_string_field(field)) {
filter_build_regex(pred);
if (field->filter_type == FILTER_STATIC_STRING) {
fn = filter_pred_string;
pred->regex.field_len = field->size;
} else if (field->filter_type == FILTER_DYN_STRING)
fn = filter_pred_strloc;
else
fn = filter_pred_pchar;
} else if (is_function_field(field)) {
if (strcmp(field->name, "ip")) {
parse_error(ps, FILT_ERR_IP_FIELD_ONLY, 0);
return -EINVAL;
}
} else {
if (field->is_signed)
ret = kstrtoll(pred->regex.pattern, 0, &val);
else
ret = kstrtoull(pred->regex.pattern, 0, &val);
if (ret) {
parse_error(ps, FILT_ERR_ILLEGAL_INTVAL, 0);
return -EINVAL;
}
pred->val = val;
if (field->filter_type == FILTER_CPU)
fn = filter_pred_cpu;
else
fn = select_comparison_fn(pred->op, field->size,
field->is_signed);
if (!fn) {
parse_error(ps, FILT_ERR_INVALID_OP, 0);
return -EINVAL;
}
}
if (pred->op == OP_NE)
pred->not ^= 1;
pred->fn = fn;
return 0;
}
static void parse_init(struct filter_parse_state *ps,
struct filter_op *ops,
char *infix_string)
{
memset(ps, '\0', sizeof(*ps));
ps->infix.string = infix_string;
ps->infix.cnt = strlen(infix_string);
ps->ops = ops;
INIT_LIST_HEAD(&ps->opstack);
INIT_LIST_HEAD(&ps->postfix);
}
static char infix_next(struct filter_parse_state *ps)
{
if (!ps->infix.cnt)
return 0;
ps->infix.cnt--;
return ps->infix.string[ps->infix.tail++];
}
static char infix_peek(struct filter_parse_state *ps)
{
if (ps->infix.tail == strlen(ps->infix.string))
return 0;
return ps->infix.string[ps->infix.tail];
}
static void infix_advance(struct filter_parse_state *ps)
{
if (!ps->infix.cnt)
return;
ps->infix.cnt--;
ps->infix.tail++;
}
static inline int is_precedence_lower(struct filter_parse_state *ps,
int a, int b)
{
return ps->ops[a].precedence < ps->ops[b].precedence;
}
static inline int is_op_char(struct filter_parse_state *ps, char c)
{
int i;
for (i = 0; strcmp(ps->ops[i].string, "OP_NONE"); i++) {
if (ps->ops[i].string[0] == c)
return 1;
}
return 0;
}
static int infix_get_op(struct filter_parse_state *ps, char firstc)
{
char nextc = infix_peek(ps);
char opstr[3];
int i;
opstr[0] = firstc;
opstr[1] = nextc;
opstr[2] = '\0';
for (i = 0; strcmp(ps->ops[i].string, "OP_NONE"); i++) {
if (!strcmp(opstr, ps->ops[i].string)) {
infix_advance(ps);
return ps->ops[i].id;
}
}
opstr[1] = '\0';
for (i = 0; strcmp(ps->ops[i].string, "OP_NONE"); i++) {
if (!strcmp(opstr, ps->ops[i].string))
return ps->ops[i].id;
}
return OP_NONE;
}
static inline void clear_operand_string(struct filter_parse_state *ps)
{
memset(ps->operand.string, '\0', MAX_FILTER_STR_VAL);
ps->operand.tail = 0;
}
static inline int append_operand_char(struct filter_parse_state *ps, char c)
{
if (ps->operand.tail == MAX_FILTER_STR_VAL - 1)
return -EINVAL;
ps->operand.string[ps->operand.tail++] = c;
return 0;
}
static int filter_opstack_push(struct filter_parse_state *ps,
enum filter_op_ids op)
{
struct opstack_op *opstack_op;
opstack_op = kmalloc(sizeof(*opstack_op), GFP_KERNEL);
if (!opstack_op)
return -ENOMEM;
opstack_op->op = op;
list_add(&opstack_op->list, &ps->opstack);
return 0;
}
static int filter_opstack_empty(struct filter_parse_state *ps)
{
return list_empty(&ps->opstack);
}
static int filter_opstack_top(struct filter_parse_state *ps)
{
struct opstack_op *opstack_op;
if (filter_opstack_empty(ps))
return OP_NONE;
opstack_op = list_first_entry(&ps->opstack, struct opstack_op, list);
return opstack_op->op;
}
static int filter_opstack_pop(struct filter_parse_state *ps)
{
struct opstack_op *opstack_op;
enum filter_op_ids op;
if (filter_opstack_empty(ps))
return OP_NONE;
opstack_op = list_first_entry(&ps->opstack, struct opstack_op, list);
op = opstack_op->op;
list_del(&opstack_op->list);
kfree(opstack_op);
return op;
}
static void filter_opstack_clear(struct filter_parse_state *ps)
{
while (!filter_opstack_empty(ps))
filter_opstack_pop(ps);
}
static char *curr_operand(struct filter_parse_state *ps)
{
return ps->operand.string;
}
static int postfix_append_operand(struct filter_parse_state *ps, char *operand)
{
struct postfix_elt *elt;
elt = kmalloc(sizeof(*elt), GFP_KERNEL);
if (!elt)
return -ENOMEM;
elt->op = OP_NONE;
elt->operand = kstrdup(operand, GFP_KERNEL);
if (!elt->operand) {
kfree(elt);
return -ENOMEM;
}
list_add_tail(&elt->list, &ps->postfix);
return 0;
}
static int postfix_append_op(struct filter_parse_state *ps, enum filter_op_ids op)
{
struct postfix_elt *elt;
elt = kmalloc(sizeof(*elt), GFP_KERNEL);
if (!elt)
return -ENOMEM;
elt->op = op;
elt->operand = NULL;
list_add_tail(&elt->list, &ps->postfix);
return 0;
}
static void postfix_clear(struct filter_parse_state *ps)
{
struct postfix_elt *elt;
while (!list_empty(&ps->postfix)) {
elt = list_first_entry(&ps->postfix, struct postfix_elt, list);
list_del(&elt->list);
kfree(elt->operand);
kfree(elt);
}
}
static int filter_parse(struct filter_parse_state *ps)
{
enum filter_op_ids op, top_op;
int in_string = 0;
char ch;
while ((ch = infix_next(ps))) {
if (ch == '"') {
in_string ^= 1;
continue;
}
if (in_string)
goto parse_operand;
if (isspace(ch))
continue;
if (is_op_char(ps, ch)) {
op = infix_get_op(ps, ch);
if (op == OP_NONE) {
parse_error(ps, FILT_ERR_INVALID_OP, 0);
return -EINVAL;
}
if (strlen(curr_operand(ps))) {
postfix_append_operand(ps, curr_operand(ps));
clear_operand_string(ps);
}
while (!filter_opstack_empty(ps)) {
top_op = filter_opstack_top(ps);
if (!is_precedence_lower(ps, top_op, op)) {
top_op = filter_opstack_pop(ps);
postfix_append_op(ps, top_op);
continue;
}
break;
}
filter_opstack_push(ps, op);
continue;
}
if (ch == '(') {
filter_opstack_push(ps, OP_OPEN_PAREN);
continue;
}
if (ch == ')') {
if (strlen(curr_operand(ps))) {
postfix_append_operand(ps, curr_operand(ps));
clear_operand_string(ps);
}
top_op = filter_opstack_pop(ps);
while (top_op != OP_NONE) {
if (top_op == OP_OPEN_PAREN)
break;
postfix_append_op(ps, top_op);
top_op = filter_opstack_pop(ps);
}
if (top_op == OP_NONE) {
parse_error(ps, FILT_ERR_UNBALANCED_PAREN, 0);
return -EINVAL;
}
continue;
}
parse_operand:
if (append_operand_char(ps, ch)) {
parse_error(ps, FILT_ERR_OPERAND_TOO_LONG, 0);
return -EINVAL;
}
}
if (strlen(curr_operand(ps)))
postfix_append_operand(ps, curr_operand(ps));
while (!filter_opstack_empty(ps)) {
top_op = filter_opstack_pop(ps);
if (top_op == OP_NONE)
break;
if (top_op == OP_OPEN_PAREN) {
parse_error(ps, FILT_ERR_UNBALANCED_PAREN, 0);
return -EINVAL;
}
postfix_append_op(ps, top_op);
} }
return 0;
} }
static struct filter_pred *create_pred(struct filter_parse_state *ps, int filter_assign_type(const char *type)
struct trace_event_call *call,
enum filter_op_ids op,
char *operand1, char *operand2)
{ {
struct ftrace_event_field *field; if (strstr(type, "__data_loc") && strstr(type, "char"))
static struct filter_pred pred; return FILTER_DYN_STRING;
if (strchr(type, '[') && strstr(type, "char"))
return FILTER_STATIC_STRING;
memset(&pred, 0, sizeof(pred)); return FILTER_OTHER;
pred.op = op; }
if (op == OP_AND || op == OP_OR) static filter_pred_fn_t select_comparison_fn(enum filter_op_ids op,
return &pred; int field_size, int field_is_signed)
{
filter_pred_fn_t fn = NULL;
int pred_func_index = -1;
if (!operand1 || !operand2) { switch (op) {
parse_error(ps, FILT_ERR_MISSING_FIELD, 0); case OP_EQ:
return NULL; case OP_NE:
break;
default:
if (WARN_ON_ONCE(op < PRED_FUNC_START))
return NULL;
pred_func_index = op - PRED_FUNC_START;
if (WARN_ON_ONCE(pred_func_index > PRED_FUNC_MAX))
return NULL;
} }
field = trace_find_event_field(call, operand1); switch (field_size) {
if (!field) { case 8:
parse_error(ps, FILT_ERR_FIELD_NOT_FOUND, 0); if (pred_func_index < 0)
return NULL; fn = filter_pred_64;
else if (field_is_signed)
fn = pred_funcs_s64[pred_func_index];
else
fn = pred_funcs_u64[pred_func_index];
break;
case 4:
if (pred_func_index < 0)
fn = filter_pred_32;
else if (field_is_signed)
fn = pred_funcs_s32[pred_func_index];
else
fn = pred_funcs_u32[pred_func_index];
break;
case 2:
if (pred_func_index < 0)
fn = filter_pred_16;
else if (field_is_signed)
fn = pred_funcs_s16[pred_func_index];
else
fn = pred_funcs_u16[pred_func_index];
break;
case 1:
if (pred_func_index < 0)
fn = filter_pred_8;
else if (field_is_signed)
fn = pred_funcs_s8[pred_func_index];
else
fn = pred_funcs_u8[pred_func_index];
break;
} }
strcpy(pred.regex.pattern, operand2); return fn;
pred.regex.len = strlen(pred.regex.pattern);
pred.field = field;
return init_pred(ps, field, &pred) ? NULL : &pred;
} }
static int check_preds(struct filter_parse_state *ps) /* Called when a predicate is encountered by predicate_parse() */
static int parse_pred(const char *str, void *data,
int pos, struct filter_parse_error *pe,
struct filter_pred **pred_ptr)
{ {
int n_normal_preds = 0, n_logical_preds = 0; struct trace_event_call *call = data;
struct postfix_elt *elt; struct ftrace_event_field *field;
int cnt = 0; struct filter_pred *pred = NULL;
char num_buf[24]; /* Big enough to hold an address */
char *field_name;
char q;
u64 val;
int len;
int ret;
int op;
int s;
int i = 0;
list_for_each_entry(elt, &ps->postfix, list) { /* First find the field to associate to */
if (elt->op == OP_NONE) { while (isspace(str[i]))
cnt++; i++;
continue; s = i;
}
if (elt->op == OP_AND || elt->op == OP_OR) { while (isalnum(str[i]) || str[i] == '_')
n_logical_preds++; i++;
cnt--;
continue; len = i - s;
}
if (elt->op != OP_NOT) if (!len)
cnt--; return -1;
n_normal_preds++;
/* all ops should have operands */
if (cnt < 0)
break;
}
if (cnt != 1 || !n_normal_preds || n_logical_preds >= n_normal_preds) { field_name = kmemdup_nul(str + s, len, GFP_KERNEL);
parse_error(ps, FILT_ERR_INVALID_FILTER, 0); if (!field_name)
return -ENOMEM;
/* Make sure that the field exists */
field = trace_find_event_field(call, field_name);
kfree(field_name);
if (!field) {
parse_error(pe, FILT_ERR_FIELD_NOT_FOUND, pos + i);
return -EINVAL; return -EINVAL;
} }
return 0; while (isspace(str[i]))
} i++;
static int count_preds(struct filter_parse_state *ps) /* Make sure this op is supported */
{ for (op = 0; ops[op]; op++) {
struct postfix_elt *elt; /* This is why '<=' must come before '<' in ops[] */
int n_preds = 0; if (strncmp(str + i, ops[op], strlen(ops[op])) == 0)
break;
}
list_for_each_entry(elt, &ps->postfix, list) { if (!ops[op]) {
if (elt->op == OP_NONE) parse_error(pe, FILT_ERR_INVALID_OP, pos + i);
continue; goto err_free;
n_preds++;
} }
return n_preds; i += strlen(ops[op]);
}
struct check_pred_data { while (isspace(str[i]))
int count; i++;
int max;
};
static int check_pred_tree_cb(enum move_type move, struct filter_pred *pred, s = i;
int *err, void *data)
{
struct check_pred_data *d = data;
if (WARN_ON(d->count++ > d->max)) { pred = kzalloc(sizeof(*pred), GFP_KERNEL);
*err = -EINVAL; if (!pred)
return WALK_PRED_ABORT; return -ENOMEM;
}
return WALK_PRED_DEFAULT;
}
/* pred->field = field;
* The tree is walked at filtering of an event. If the tree is not correctly pred->offset = field->offset;
* built, it may cause an infinite loop. Check here that the tree does pred->op = op;
* indeed terminate.
*/ if (ftrace_event_is_function(call)) {
static int check_pred_tree(struct event_filter *filter,
struct filter_pred *root)
{
struct check_pred_data data = {
/* /*
* The max that we can hit a node is three times. * Perf does things different with function events.
* Once going down, once coming up from left, and * It only allows an "ip" field, and expects a string.
* once coming up from right. This is more than enough * But the string does not need to be surrounded by quotes.
* since leafs are only hit a single time. * If it is a string, the assigned function as a nop,
* (perf doesn't use it) and grab everything.
*/ */
.max = 3 * filter->n_preds, if (strcmp(field->name, "ip") != 0) {
.count = 0, parse_error(pe, FILT_ERR_IP_FIELD_ONLY, pos + i);
}; goto err_free;
}
pred->fn = filter_pred_none;
/*
* Quotes are not required, but if they exist then we need
* to read them till we hit a matching one.
*/
if (str[i] == '\'' || str[i] == '"')
q = str[i];
else
q = 0;
for (i++; str[i]; i++) {
if (q && str[i] == q)
break;
if (!q && (str[i] == ')' || str[i] == '&' ||
str[i] == '|'))
break;
}
/* Skip quotes */
if (q)
s++;
len = i - s;
if (len >= MAX_FILTER_STR_VAL) {
parse_error(pe, FILT_ERR_OPERAND_TOO_LONG, pos + i);
goto err_free;
}
return walk_pred_tree(filter->preds, root, pred->regex.len = len;
check_pred_tree_cb, &data); strncpy(pred->regex.pattern, str + s, len);
} pred->regex.pattern[len] = 0;
/* This is either a string, or an integer */
} else if (str[i] == '\'' || str[i] == '"') {
char q = str[i];
/* Make sure the op is OK for strings */
switch (op) {
case OP_NE:
pred->not = 1;
/* Fall through */
case OP_GLOB:
case OP_EQ:
break;
default:
parse_error(pe, FILT_ERR_ILLEGAL_FIELD_OP, pos + i);
goto err_free;
}
static int count_leafs_cb(enum move_type move, struct filter_pred *pred, /* Make sure the field is OK for strings */
int *err, void *data) if (!is_string_field(field)) {
{ parse_error(pe, FILT_ERR_EXPECT_DIGIT, pos + i);
int *count = data; goto err_free;
}
if ((move == MOVE_DOWN) && for (i++; str[i]; i++) {
(pred->left == FILTER_PRED_INVALID)) if (str[i] == q)
(*count)++; break;
}
if (!str[i]) {
parse_error(pe, FILT_ERR_MISSING_QUOTE, pos + i);
goto err_free;
}
return WALK_PRED_DEFAULT; /* Skip quotes */
} s++;
len = i - s;
if (len >= MAX_FILTER_STR_VAL) {
parse_error(pe, FILT_ERR_OPERAND_TOO_LONG, pos + i);
goto err_free;
}
static int count_leafs(struct filter_pred *preds, struct filter_pred *root) pred->regex.len = len;
{ strncpy(pred->regex.pattern, str + s, len);
int count = 0, ret; pred->regex.pattern[len] = 0;
ret = walk_pred_tree(preds, root, count_leafs_cb, &count); filter_build_regex(pred);
WARN_ON(ret);
return count;
}
struct fold_pred_data { if (field->filter_type == FILTER_COMM) {
struct filter_pred *root; pred->fn = filter_pred_comm;
int count;
int children;
};
static int fold_pred_cb(enum move_type move, struct filter_pred *pred, } else if (field->filter_type == FILTER_STATIC_STRING) {
int *err, void *data) pred->fn = filter_pred_string;
{ pred->regex.field_len = field->size;
struct fold_pred_data *d = data;
struct filter_pred *root = d->root;
if (move != MOVE_DOWN) } else if (field->filter_type == FILTER_DYN_STRING)
return WALK_PRED_DEFAULT; pred->fn = filter_pred_strloc;
if (pred->left != FILTER_PRED_INVALID) else
return WALK_PRED_DEFAULT; pred->fn = filter_pred_pchar;
/* go past the last quote */
i++;
if (WARN_ON(d->count == d->children)) { } else if (isdigit(str[i])) {
*err = -EINVAL;
return WALK_PRED_ABORT;
}
pred->index &= ~FILTER_PRED_FOLD; /* Make sure the field is not a string */
root->ops[d->count++] = pred->index; if (is_string_field(field)) {
return WALK_PRED_DEFAULT; parse_error(pe, FILT_ERR_EXPECT_STRING, pos + i);
} goto err_free;
}
static int fold_pred(struct filter_pred *preds, struct filter_pred *root) if (op == OP_GLOB) {
{ parse_error(pe, FILT_ERR_ILLEGAL_FIELD_OP, pos + i);
struct fold_pred_data data = { goto err_free;
.root = root, }
.count = 0,
};
int children;
/* No need to keep the fold flag */ /* We allow 0xDEADBEEF */
root->index &= ~FILTER_PRED_FOLD; while (isalnum(str[i]))
i++;
/* If the root is a leaf then do nothing */ len = i - s;
if (root->left == FILTER_PRED_INVALID) /* 0xfeedfacedeadbeef is 18 chars max */
return 0; if (len >= sizeof(num_buf)) {
parse_error(pe, FILT_ERR_OPERAND_TOO_LONG, pos + i);
goto err_free;
}
/* count the children */ strncpy(num_buf, str + s, len);
children = count_leafs(preds, &preds[root->left]); num_buf[len] = 0;
children += count_leafs(preds, &preds[root->right]);
root->ops = kcalloc(children, sizeof(*root->ops), GFP_KERNEL); /* Make sure it is a value */
if (!root->ops) if (field->is_signed)
return -ENOMEM; ret = kstrtoll(num_buf, 0, &val);
else
ret = kstrtoull(num_buf, 0, &val);
if (ret) {
parse_error(pe, FILT_ERR_ILLEGAL_INTVAL, pos + s);
goto err_free;
}
root->val = children; pred->val = val;
data.children = children;
return walk_pred_tree(preds, root, fold_pred_cb, &data);
}
static int fold_pred_tree_cb(enum move_type move, struct filter_pred *pred, if (field->filter_type == FILTER_CPU)
int *err, void *data) pred->fn = filter_pred_cpu;
{ else {
struct filter_pred *preds = data; pred->fn = select_comparison_fn(pred->op, field->size,
field->is_signed);
if (pred->op == OP_NE)
pred->not = 1;
}
if (move != MOVE_DOWN) } else {
return WALK_PRED_DEFAULT; parse_error(pe, FILT_ERR_INVALID_VALUE, pos + i);
if (!(pred->index & FILTER_PRED_FOLD)) goto err_free;
return WALK_PRED_DEFAULT; }
*err = fold_pred(preds, pred); *pred_ptr = pred;
if (*err) return i;
return WALK_PRED_ABORT;
/* eveyrhing below is folded, continue with parent */ err_free:
return WALK_PRED_PARENT; kfree(pred);
return -EINVAL;
} }
enum {
TOO_MANY_CLOSE = -1,
TOO_MANY_OPEN = -2,
MISSING_QUOTE = -3,
};
/* /*
* To optimize the processing of the ops, if we have several "ors" or * Read the filter string once to calculate the number of predicates
* "ands" together, we can put them in an array and process them all * as well as how deep the parentheses go.
* together speeding up the filter logic. *
* Returns:
* 0 - everything is fine (err is undefined)
* -1 - too many ')'
* -2 - too many '('
* -3 - No matching quote
*/ */
static int fold_pred_tree(struct event_filter *filter, static int calc_stack(const char *str, int *parens, int *preds, int *err)
struct filter_pred *root) {
{ bool is_pred = false;
return walk_pred_tree(filter->preds, root, fold_pred_tree_cb, int nr_preds = 0;
filter->preds); int open = 1; /* Count the expression as "(E)" */
} int last_quote = 0;
int max_open = 1;
static int replace_preds(struct trace_event_call *call, int quote = 0;
struct event_filter *filter, int i;
struct filter_parse_state *ps,
bool dry_run)
{
char *operand1 = NULL, *operand2 = NULL;
struct filter_pred *pred;
struct filter_pred *root;
struct postfix_elt *elt;
struct pred_stack stack = { }; /* init to NULL */
int err;
int n_preds = 0;
n_preds = count_preds(ps);
if (n_preds >= MAX_FILTER_PRED) {
parse_error(ps, FILT_ERR_TOO_MANY_PREDS, 0);
return -ENOSPC;
}
err = check_preds(ps);
if (err)
return err;
if (!dry_run) { *err = 0;
err = __alloc_pred_stack(&stack, n_preds);
if (err)
return err;
err = __alloc_preds(filter, n_preds);
if (err)
goto fail;
}
n_preds = 0; for (i = 0; str[i]; i++) {
list_for_each_entry(elt, &ps->postfix, list) { if (isspace(str[i]))
if (elt->op == OP_NONE) { continue;
if (!operand1) if (quote) {
operand1 = elt->operand; if (str[i] == quote)
else if (!operand2) quote = 0;
operand2 = elt->operand;
else {
parse_error(ps, FILT_ERR_TOO_MANY_OPERANDS, 0);
err = -EINVAL;
goto fail;
}
continue; continue;
} }
if (elt->op == OP_NOT) { switch (str[i]) {
if (!n_preds || operand1 || operand2) { case '\'':
parse_error(ps, FILT_ERR_ILLEGAL_NOT_OP, 0); case '"':
err = -EINVAL; quote = str[i];
goto fail; last_quote = i;
break;
case '|':
case '&':
if (str[i+1] != str[i])
break;
is_pred = false;
continue;
case '(':
is_pred = false;
open++;
if (open > max_open)
max_open = open;
continue;
case ')':
is_pred = false;
if (open == 1) {
*err = i;
return TOO_MANY_CLOSE;
} }
if (!dry_run) open--;
filter->preds[n_preds - 1].not ^= 1;
continue; continue;
} }
if (!is_pred) {
if (WARN_ON(n_preds++ == MAX_FILTER_PRED)) { nr_preds++;
parse_error(ps, FILT_ERR_TOO_MANY_PREDS, 0); is_pred = true;
err = -ENOSPC;
goto fail;
} }
}
pred = create_pred(ps, call, elt->op, operand1, operand2); if (quote) {
if (!pred) { *err = last_quote;
err = -EINVAL; return MISSING_QUOTE;
goto fail; }
}
if (!dry_run) { if (open != 1) {
err = filter_add_pred(ps, filter, pred, &stack); int level = open;
if (err)
goto fail;
}
operand1 = operand2 = NULL; /* find the bad open */
for (i--; i; i--) {
if (quote) {
if (str[i] == quote)
quote = 0;
continue;
}
switch (str[i]) {
case '(':
if (level == open) {
*err = i;
return TOO_MANY_OPEN;
}
level--;
break;
case ')':
level++;
break;
case '\'':
case '"':
quote = str[i];
break;
}
}
/* First character is the '(' with missing ')' */
*err = 0;
return TOO_MANY_OPEN;
} }
if (!dry_run) { /* Set the size of the required stacks */
/* We should have one item left on the stack */ *parens = max_open;
pred = __pop_pred_stack(&stack); *preds = nr_preds;
if (!pred) return 0;
return -EINVAL; }
/* This item is where we start from in matching */
root = pred; static int process_preds(struct trace_event_call *call,
/* Make sure the stack is empty */ const char *filter_string,
pred = __pop_pred_stack(&stack); struct event_filter *filter,
if (WARN_ON(pred)) { struct filter_parse_error *pe)
err = -EINVAL; {
filter->root = NULL; struct prog_entry *prog;
goto fail; int nr_parens;
int nr_preds;
int index;
int ret;
ret = calc_stack(filter_string, &nr_parens, &nr_preds, &index);
if (ret < 0) {
switch (ret) {
case MISSING_QUOTE:
parse_error(pe, FILT_ERR_MISSING_QUOTE, index);
break;
case TOO_MANY_OPEN:
parse_error(pe, FILT_ERR_TOO_MANY_OPEN, index);
break;
default:
parse_error(pe, FILT_ERR_TOO_MANY_CLOSE, index);
} }
err = check_pred_tree(filter, root); return ret;
if (err)
goto fail;
/* Optimize the tree */
err = fold_pred_tree(filter, root);
if (err)
goto fail;
/* We don't set root until we know it works */
barrier();
filter->root = root;
} }
err = 0; if (!nr_preds) {
fail: prog = NULL;
__free_pred_stack(&stack); } else {
return err; prog = predicate_parse(filter_string, nr_parens, nr_preds,
parse_pred, call, pe);
if (IS_ERR(prog))
return PTR_ERR(prog);
}
rcu_assign_pointer(filter->prog, prog);
return 0;
} }
static inline void event_set_filtered_flag(struct trace_event_file *file) static inline void event_set_filtered_flag(struct trace_event_file *file)
...@@ -1780,72 +1558,53 @@ struct filter_list { ...@@ -1780,72 +1558,53 @@ struct filter_list {
struct event_filter *filter; struct event_filter *filter;
}; };
static int replace_system_preds(struct trace_subsystem_dir *dir, static int process_system_preds(struct trace_subsystem_dir *dir,
struct trace_array *tr, struct trace_array *tr,
struct filter_parse_state *ps, struct filter_parse_error *pe,
char *filter_string) char *filter_string)
{ {
struct trace_event_file *file; struct trace_event_file *file;
struct filter_list *filter_item; struct filter_list *filter_item;
struct event_filter *filter = NULL;
struct filter_list *tmp; struct filter_list *tmp;
LIST_HEAD(filter_list); LIST_HEAD(filter_list);
bool fail = true; bool fail = true;
int err; int err;
list_for_each_entry(file, &tr->events, list) { list_for_each_entry(file, &tr->events, list) {
if (file->system != dir)
continue;
/*
* Try to see if the filter can be applied
* (filter arg is ignored on dry_run)
*/
err = replace_preds(file->event_call, NULL, ps, true);
if (err)
event_set_no_set_filter_flag(file);
else
event_clear_no_set_filter_flag(file);
}
list_for_each_entry(file, &tr->events, list) {
struct event_filter *filter;
if (file->system != dir) if (file->system != dir)
continue; continue;
if (event_no_set_filter_flag(file)) filter = kzalloc(sizeof(*filter), GFP_KERNEL);
continue; if (!filter)
filter_item = kzalloc(sizeof(*filter_item), GFP_KERNEL);
if (!filter_item)
goto fail_mem;
list_add_tail(&filter_item->list, &filter_list);
filter_item->filter = __alloc_filter();
if (!filter_item->filter)
goto fail_mem; goto fail_mem;
filter = filter_item->filter;
/* Can only fail on no memory */ filter->filter_string = kstrdup(filter_string, GFP_KERNEL);
err = replace_filter_string(filter, filter_string); if (!filter->filter_string)
if (err)
goto fail_mem; goto fail_mem;
err = replace_preds(file->event_call, filter, ps, false); err = process_preds(file->event_call, filter_string, filter, pe);
if (err) { if (err) {
filter_disable(file); filter_disable(file);
parse_error(ps, FILT_ERR_BAD_SUBSYS_FILTER, 0); parse_error(pe, FILT_ERR_BAD_SUBSYS_FILTER, 0);
append_filter_err(ps, filter); append_filter_err(pe, filter);
} else } else
event_set_filtered_flag(file); event_set_filtered_flag(file);
filter_item = kzalloc(sizeof(*filter_item), GFP_KERNEL);
if (!filter_item)
goto fail_mem;
list_add_tail(&filter_item->list, &filter_list);
/* /*
* Regardless of if this returned an error, we still * Regardless of if this returned an error, we still
* replace the filter for the call. * replace the filter for the call.
*/ */
filter = event_filter(file); filter_item->filter = event_filter(file);
event_set_filter(file, filter_item->filter); event_set_filter(file, filter);
filter_item->filter = filter; filter = NULL;
fail = false; fail = false;
} }
...@@ -1871,9 +1630,10 @@ static int replace_system_preds(struct trace_subsystem_dir *dir, ...@@ -1871,9 +1630,10 @@ static int replace_system_preds(struct trace_subsystem_dir *dir,
list_del(&filter_item->list); list_del(&filter_item->list);
kfree(filter_item); kfree(filter_item);
} }
parse_error(ps, FILT_ERR_BAD_SUBSYS_FILTER, 0); parse_error(pe, FILT_ERR_BAD_SUBSYS_FILTER, 0);
return -EINVAL; return -EINVAL;
fail_mem: fail_mem:
kfree(filter);
/* If any call succeeded, we still need to sync */ /* If any call succeeded, we still need to sync */
if (!fail) if (!fail)
synchronize_sched(); synchronize_sched();
...@@ -1885,47 +1645,42 @@ static int replace_system_preds(struct trace_subsystem_dir *dir, ...@@ -1885,47 +1645,42 @@ static int replace_system_preds(struct trace_subsystem_dir *dir,
return -ENOMEM; return -ENOMEM;
} }
static int create_filter_start(char *filter_str, bool set_str, static int create_filter_start(char *filter_string, bool set_str,
struct filter_parse_state **psp, struct filter_parse_error **pse,
struct event_filter **filterp) struct event_filter **filterp)
{ {
struct event_filter *filter; struct event_filter *filter;
struct filter_parse_state *ps = NULL; struct filter_parse_error *pe = NULL;
int err = 0; int err = 0;
WARN_ON_ONCE(*psp || *filterp); if (WARN_ON_ONCE(*pse || *filterp))
return -EINVAL;
/* allocate everything, and if any fails, free all and fail */ filter = kzalloc(sizeof(*filter), GFP_KERNEL);
filter = __alloc_filter(); if (filter && set_str) {
if (filter && set_str) filter->filter_string = kstrdup(filter_string, GFP_KERNEL);
err = replace_filter_string(filter, filter_str); if (!filter->filter_string)
err = -ENOMEM;
}
ps = kzalloc(sizeof(*ps), GFP_KERNEL); pe = kzalloc(sizeof(*pe), GFP_KERNEL);
if (!filter || !ps || err) { if (!filter || !pe || err) {
kfree(ps); kfree(pe);
__free_filter(filter); __free_filter(filter);
return -ENOMEM; return -ENOMEM;
} }
/* we're committed to creating a new filter */ /* we're committed to creating a new filter */
*filterp = filter; *filterp = filter;
*psp = ps; *pse = pe;
parse_init(ps, filter_ops, filter_str); return 0;
err = filter_parse(ps);
if (err && set_str)
append_filter_err(ps, filter);
return err;
} }
static void create_filter_finish(struct filter_parse_state *ps) static void create_filter_finish(struct filter_parse_error *pe)
{ {
if (ps) { kfree(pe);
filter_opstack_clear(ps);
postfix_clear(ps);
kfree(ps);
}
} }
/** /**
...@@ -1945,24 +1700,20 @@ static void create_filter_finish(struct filter_parse_state *ps) ...@@ -1945,24 +1700,20 @@ static void create_filter_finish(struct filter_parse_state *ps)
* freeing it. * freeing it.
*/ */
static int create_filter(struct trace_event_call *call, static int create_filter(struct trace_event_call *call,
char *filter_str, bool set_str, char *filter_string, bool set_str,
struct event_filter **filterp) struct event_filter **filterp)
{ {
struct filter_parse_error *pe = NULL;
struct event_filter *filter = NULL; struct event_filter *filter = NULL;
struct filter_parse_state *ps = NULL;
int err; int err;
err = create_filter_start(filter_str, set_str, &ps, &filter); err = create_filter_start(filter_string, set_str, &pe, &filter);
if (!err) { if (err)
err = replace_preds(call, filter, ps, false); return err;
if (err && set_str)
append_filter_err(ps, filter); err = process_preds(call, filter_string, filter, pe);
} if (err && set_str)
if (err && !set_str) { append_filter_err(pe, filter);
free_event_filter(filter);
filter = NULL;
}
create_filter_finish(ps);
*filterp = filter; *filterp = filter;
return err; return err;
...@@ -1989,21 +1740,21 @@ static int create_system_filter(struct trace_subsystem_dir *dir, ...@@ -1989,21 +1740,21 @@ static int create_system_filter(struct trace_subsystem_dir *dir,
char *filter_str, struct event_filter **filterp) char *filter_str, struct event_filter **filterp)
{ {
struct event_filter *filter = NULL; struct event_filter *filter = NULL;
struct filter_parse_state *ps = NULL; struct filter_parse_error *pe = NULL;
int err; int err;
err = create_filter_start(filter_str, true, &ps, &filter); err = create_filter_start(filter_str, true, &pe, &filter);
if (!err) { if (!err) {
err = replace_system_preds(dir, tr, ps, filter_str); err = process_system_preds(dir, tr, pe, filter_str);
if (!err) { if (!err) {
/* System filters just show a default message */ /* System filters just show a default message */
kfree(filter->filter_string); kfree(filter->filter_string);
filter->filter_string = NULL; filter->filter_string = NULL;
} else { } else {
append_filter_err(ps, filter); append_filter_err(pe, filter);
} }
} }
create_filter_finish(ps); create_filter_finish(pe);
*filterp = filter; *filterp = filter;
return err; return err;
...@@ -2186,66 +1937,80 @@ static int __ftrace_function_set_filter(int filter, char *buf, int len, ...@@ -2186,66 +1937,80 @@ static int __ftrace_function_set_filter(int filter, char *buf, int len,
return ret; return ret;
} }
static int ftrace_function_check_pred(struct filter_pred *pred, int leaf) static int ftrace_function_check_pred(struct filter_pred *pred)
{ {
struct ftrace_event_field *field = pred->field; struct ftrace_event_field *field = pred->field;
if (leaf) { /*
/* * Check the predicate for function trace, verify:
* Check the leaf predicate for function trace, verify: * - only '==' and '!=' is used
* - only '==' and '!=' is used * - the 'ip' field is used
* - the 'ip' field is used */
*/ if ((pred->op != OP_EQ) && (pred->op != OP_NE))
if ((pred->op != OP_EQ) && (pred->op != OP_NE)) return -EINVAL;
return -EINVAL;
if (strcmp(field->name, "ip")) if (strcmp(field->name, "ip"))
return -EINVAL; return -EINVAL;
} else {
/*
* Check the non leaf predicate for function trace, verify:
* - only '||' is used
*/
if (pred->op != OP_OR)
return -EINVAL;
}
return 0; return 0;
} }
static int ftrace_function_set_filter_cb(enum move_type move, static int ftrace_function_set_filter_pred(struct filter_pred *pred,
struct filter_pred *pred, struct function_filter_data *data)
int *err, void *data)
{ {
int ret;
/* Checking the node is valid for function trace. */ /* Checking the node is valid for function trace. */
if ((move != MOVE_DOWN) || ret = ftrace_function_check_pred(pred);
(pred->left != FILTER_PRED_INVALID)) { if (ret)
*err = ftrace_function_check_pred(pred, 0); return ret;
} else {
*err = ftrace_function_check_pred(pred, 1); return __ftrace_function_set_filter(pred->op == OP_EQ,
if (*err) pred->regex.pattern,
return WALK_PRED_ABORT; pred->regex.len,
data);
*err = __ftrace_function_set_filter(pred->op == OP_EQ, }
pred->regex.pattern,
pred->regex.len, static bool is_or(struct prog_entry *prog, int i)
data); {
} int target;
/*
* Only "||" is allowed for function events, thus,
* all true branches should jump to true, and any
* false branch should jump to false.
*/
target = prog[i].target + 1;
/* True and false have NULL preds (all prog entries should jump to one */
if (prog[target].pred)
return false;
return (*err) ? WALK_PRED_ABORT : WALK_PRED_DEFAULT; /* prog[target].target is 1 for TRUE, 0 for FALSE */
return prog[i].when_to_branch == prog[target].target;
} }
static int ftrace_function_set_filter(struct perf_event *event, static int ftrace_function_set_filter(struct perf_event *event,
struct event_filter *filter) struct event_filter *filter)
{ {
struct prog_entry *prog = rcu_dereference_protected(filter->prog,
lockdep_is_held(&event_mutex));
struct function_filter_data data = { struct function_filter_data data = {
.first_filter = 1, .first_filter = 1,
.first_notrace = 1, .first_notrace = 1,
.ops = &event->ftrace_ops, .ops = &event->ftrace_ops,
}; };
int i;
return walk_pred_tree(filter->preds, filter->root, for (i = 0; prog[i].pred; i++) {
ftrace_function_set_filter_cb, &data); struct filter_pred *pred = prog[i].pred;
if (!is_or(prog, i))
return -EINVAL;
if (ftrace_function_set_filter_pred(pred, &data) < 0)
return -EINVAL;
}
return 0;
} }
#else #else
static int ftrace_function_set_filter(struct perf_event *event, static int ftrace_function_set_filter(struct perf_event *event,
...@@ -2388,26 +2153,28 @@ static int test_pred_visited_fn(struct filter_pred *pred, void *event) ...@@ -2388,26 +2153,28 @@ static int test_pred_visited_fn(struct filter_pred *pred, void *event)
return 1; return 1;
} }
static int test_walk_pred_cb(enum move_type move, struct filter_pred *pred, static void update_pred_fn(struct event_filter *filter, char *fields)
int *err, void *data)
{ {
char *fields = data; struct prog_entry *prog = rcu_dereference_protected(filter->prog,
lockdep_is_held(&event_mutex));
int i;
if ((move == MOVE_DOWN) && for (i = 0; prog[i].pred; i++) {
(pred->left == FILTER_PRED_INVALID)) { struct filter_pred *pred = prog[i].pred;
struct ftrace_event_field *field = pred->field; struct ftrace_event_field *field = pred->field;
WARN_ON_ONCE(!pred->fn);
if (!field) { if (!field) {
WARN(1, "all leafs should have field defined"); WARN_ONCE(1, "all leafs should have field defined %d", i);
return WALK_PRED_DEFAULT; continue;
} }
if (!strchr(fields, *field->name)) if (!strchr(fields, *field->name))
return WALK_PRED_DEFAULT; continue;
WARN_ON(!pred->fn);
pred->fn = test_pred_visited_fn; pred->fn = test_pred_visited_fn;
} }
return WALK_PRED_DEFAULT;
} }
static __init int ftrace_test_event_filter(void) static __init int ftrace_test_event_filter(void)
...@@ -2431,20 +2198,22 @@ static __init int ftrace_test_event_filter(void) ...@@ -2431,20 +2198,22 @@ static __init int ftrace_test_event_filter(void)
break; break;
} }
/* Needed to dereference filter->prog */
mutex_lock(&event_mutex);
/* /*
* The preemption disabling is not really needed for self * The preemption disabling is not really needed for self
* tests, but the rcu dereference will complain without it. * tests, but the rcu dereference will complain without it.
*/ */
preempt_disable(); preempt_disable();
if (*d->not_visited) if (*d->not_visited)
walk_pred_tree(filter->preds, filter->root, update_pred_fn(filter, d->not_visited);
test_walk_pred_cb,
d->not_visited);
test_pred_visited = 0; test_pred_visited = 0;
err = filter_match_preds(filter, &d->rec); err = filter_match_preds(filter, &d->rec);
preempt_enable(); preempt_enable();
mutex_unlock(&event_mutex);
__free_filter(filter); __free_filter(filter);
if (test_pred_visited) { if (test_pred_visited) {
......
This source diff could not be displayed because it is too large. You can view the blob instead.
...@@ -63,7 +63,8 @@ void trigger_data_free(struct event_trigger_data *data) ...@@ -63,7 +63,8 @@ void trigger_data_free(struct event_trigger_data *data)
* any trigger that should be deferred, ETT_NONE if nothing to defer. * any trigger that should be deferred, ETT_NONE if nothing to defer.
*/ */
enum event_trigger_type enum event_trigger_type
event_triggers_call(struct trace_event_file *file, void *rec) event_triggers_call(struct trace_event_file *file, void *rec,
struct ring_buffer_event *event)
{ {
struct event_trigger_data *data; struct event_trigger_data *data;
enum event_trigger_type tt = ETT_NONE; enum event_trigger_type tt = ETT_NONE;
...@@ -76,7 +77,7 @@ event_triggers_call(struct trace_event_file *file, void *rec) ...@@ -76,7 +77,7 @@ event_triggers_call(struct trace_event_file *file, void *rec)
if (data->paused) if (data->paused)
continue; continue;
if (!rec) { if (!rec) {
data->ops->func(data, rec); data->ops->func(data, rec, event);
continue; continue;
} }
filter = rcu_dereference_sched(data->filter); filter = rcu_dereference_sched(data->filter);
...@@ -86,7 +87,7 @@ event_triggers_call(struct trace_event_file *file, void *rec) ...@@ -86,7 +87,7 @@ event_triggers_call(struct trace_event_file *file, void *rec)
tt |= data->cmd_ops->trigger_type; tt |= data->cmd_ops->trigger_type;
continue; continue;
} }
data->ops->func(data, rec); data->ops->func(data, rec, event);
} }
return tt; return tt;
} }
...@@ -108,7 +109,7 @@ EXPORT_SYMBOL_GPL(event_triggers_call); ...@@ -108,7 +109,7 @@ EXPORT_SYMBOL_GPL(event_triggers_call);
void void
event_triggers_post_call(struct trace_event_file *file, event_triggers_post_call(struct trace_event_file *file,
enum event_trigger_type tt, enum event_trigger_type tt,
void *rec) void *rec, struct ring_buffer_event *event)
{ {
struct event_trigger_data *data; struct event_trigger_data *data;
...@@ -116,7 +117,7 @@ event_triggers_post_call(struct trace_event_file *file, ...@@ -116,7 +117,7 @@ event_triggers_post_call(struct trace_event_file *file,
if (data->paused) if (data->paused)
continue; continue;
if (data->cmd_ops->trigger_type & tt) if (data->cmd_ops->trigger_type & tt)
data->ops->func(data, rec); data->ops->func(data, rec, event);
} }
} }
EXPORT_SYMBOL_GPL(event_triggers_post_call); EXPORT_SYMBOL_GPL(event_triggers_post_call);
...@@ -908,8 +909,15 @@ void set_named_trigger_data(struct event_trigger_data *data, ...@@ -908,8 +909,15 @@ void set_named_trigger_data(struct event_trigger_data *data,
data->named_data = named_data; data->named_data = named_data;
} }
struct event_trigger_data *
get_named_trigger_data(struct event_trigger_data *data)
{
return data->named_data;
}
static void static void
traceon_trigger(struct event_trigger_data *data, void *rec) traceon_trigger(struct event_trigger_data *data, void *rec,
struct ring_buffer_event *event)
{ {
if (tracing_is_on()) if (tracing_is_on())
return; return;
...@@ -918,7 +926,8 @@ traceon_trigger(struct event_trigger_data *data, void *rec) ...@@ -918,7 +926,8 @@ traceon_trigger(struct event_trigger_data *data, void *rec)
} }
static void static void
traceon_count_trigger(struct event_trigger_data *data, void *rec) traceon_count_trigger(struct event_trigger_data *data, void *rec,
struct ring_buffer_event *event)
{ {
if (tracing_is_on()) if (tracing_is_on())
return; return;
...@@ -933,7 +942,8 @@ traceon_count_trigger(struct event_trigger_data *data, void *rec) ...@@ -933,7 +942,8 @@ traceon_count_trigger(struct event_trigger_data *data, void *rec)
} }
static void static void
traceoff_trigger(struct event_trigger_data *data, void *rec) traceoff_trigger(struct event_trigger_data *data, void *rec,
struct ring_buffer_event *event)
{ {
if (!tracing_is_on()) if (!tracing_is_on())
return; return;
...@@ -942,7 +952,8 @@ traceoff_trigger(struct event_trigger_data *data, void *rec) ...@@ -942,7 +952,8 @@ traceoff_trigger(struct event_trigger_data *data, void *rec)
} }
static void static void
traceoff_count_trigger(struct event_trigger_data *data, void *rec) traceoff_count_trigger(struct event_trigger_data *data, void *rec,
struct ring_buffer_event *event)
{ {
if (!tracing_is_on()) if (!tracing_is_on())
return; return;
...@@ -1039,13 +1050,15 @@ static struct event_command trigger_traceoff_cmd = { ...@@ -1039,13 +1050,15 @@ static struct event_command trigger_traceoff_cmd = {
#ifdef CONFIG_TRACER_SNAPSHOT #ifdef CONFIG_TRACER_SNAPSHOT
static void static void
snapshot_trigger(struct event_trigger_data *data, void *rec) snapshot_trigger(struct event_trigger_data *data, void *rec,
struct ring_buffer_event *event)
{ {
tracing_snapshot(); tracing_snapshot();
} }
static void static void
snapshot_count_trigger(struct event_trigger_data *data, void *rec) snapshot_count_trigger(struct event_trigger_data *data, void *rec,
struct ring_buffer_event *event)
{ {
if (!data->count) if (!data->count)
return; return;
...@@ -1053,7 +1066,7 @@ snapshot_count_trigger(struct event_trigger_data *data, void *rec) ...@@ -1053,7 +1066,7 @@ snapshot_count_trigger(struct event_trigger_data *data, void *rec)
if (data->count != -1) if (data->count != -1)
(data->count)--; (data->count)--;
snapshot_trigger(data, rec); snapshot_trigger(data, rec, event);
} }
static int static int
...@@ -1141,13 +1154,15 @@ static __init int register_trigger_snapshot_cmd(void) { return 0; } ...@@ -1141,13 +1154,15 @@ static __init int register_trigger_snapshot_cmd(void) { return 0; }
#endif #endif
static void static void
stacktrace_trigger(struct event_trigger_data *data, void *rec) stacktrace_trigger(struct event_trigger_data *data, void *rec,
struct ring_buffer_event *event)
{ {
trace_dump_stack(STACK_SKIP); trace_dump_stack(STACK_SKIP);
} }
static void static void
stacktrace_count_trigger(struct event_trigger_data *data, void *rec) stacktrace_count_trigger(struct event_trigger_data *data, void *rec,
struct ring_buffer_event *event)
{ {
if (!data->count) if (!data->count)
return; return;
...@@ -1155,7 +1170,7 @@ stacktrace_count_trigger(struct event_trigger_data *data, void *rec) ...@@ -1155,7 +1170,7 @@ stacktrace_count_trigger(struct event_trigger_data *data, void *rec)
if (data->count != -1) if (data->count != -1)
(data->count)--; (data->count)--;
stacktrace_trigger(data, rec); stacktrace_trigger(data, rec, event);
} }
static int static int
...@@ -1217,7 +1232,8 @@ static __init void unregister_trigger_traceon_traceoff_cmds(void) ...@@ -1217,7 +1232,8 @@ static __init void unregister_trigger_traceon_traceoff_cmds(void)
} }
static void static void
event_enable_trigger(struct event_trigger_data *data, void *rec) event_enable_trigger(struct event_trigger_data *data, void *rec,
struct ring_buffer_event *event)
{ {
struct enable_trigger_data *enable_data = data->private_data; struct enable_trigger_data *enable_data = data->private_data;
...@@ -1228,7 +1244,8 @@ event_enable_trigger(struct event_trigger_data *data, void *rec) ...@@ -1228,7 +1244,8 @@ event_enable_trigger(struct event_trigger_data *data, void *rec)
} }
static void static void
event_enable_count_trigger(struct event_trigger_data *data, void *rec) event_enable_count_trigger(struct event_trigger_data *data, void *rec,
struct ring_buffer_event *event)
{ {
struct enable_trigger_data *enable_data = data->private_data; struct enable_trigger_data *enable_data = data->private_data;
...@@ -1242,7 +1259,7 @@ event_enable_count_trigger(struct event_trigger_data *data, void *rec) ...@@ -1242,7 +1259,7 @@ event_enable_count_trigger(struct event_trigger_data *data, void *rec)
if (data->count != -1) if (data->count != -1)
(data->count)--; (data->count)--;
event_enable_trigger(data, rec); event_enable_trigger(data, rec, event);
} }
int event_enable_trigger_print(struct seq_file *m, int event_enable_trigger_print(struct seq_file *m,
......
...@@ -66,6 +66,73 @@ u64 tracing_map_read_sum(struct tracing_map_elt *elt, unsigned int i) ...@@ -66,6 +66,73 @@ u64 tracing_map_read_sum(struct tracing_map_elt *elt, unsigned int i)
return (u64)atomic64_read(&elt->fields[i].sum); return (u64)atomic64_read(&elt->fields[i].sum);
} }
/**
* tracing_map_set_var - Assign a tracing_map_elt's variable field
* @elt: The tracing_map_elt
* @i: The index of the given variable associated with the tracing_map_elt
* @n: The value to assign
*
* Assign n to variable i associated with the specified tracing_map_elt
* instance. The index i is the index returned by the call to
* tracing_map_add_var() when the tracing map was set up.
*/
void tracing_map_set_var(struct tracing_map_elt *elt, unsigned int i, u64 n)
{
atomic64_set(&elt->vars[i], n);
elt->var_set[i] = true;
}
/**
* tracing_map_var_set - Return whether or not a variable has been set
* @elt: The tracing_map_elt
* @i: The index of the given variable associated with the tracing_map_elt
*
* Return true if the variable has been set, false otherwise. The
* index i is the index returned by the call to tracing_map_add_var()
* when the tracing map was set up.
*/
bool tracing_map_var_set(struct tracing_map_elt *elt, unsigned int i)
{
return elt->var_set[i];
}
/**
* tracing_map_read_var - Return the value of a tracing_map_elt's variable field
* @elt: The tracing_map_elt
* @i: The index of the given variable associated with the tracing_map_elt
*
* Retrieve the value of the variable i associated with the specified
* tracing_map_elt instance. The index i is the index returned by the
* call to tracing_map_add_var() when the tracing map was set
* up.
*
* Return: The variable value associated with field i for elt.
*/
u64 tracing_map_read_var(struct tracing_map_elt *elt, unsigned int i)
{
return (u64)atomic64_read(&elt->vars[i]);
}
/**
* tracing_map_read_var_once - Return and reset a tracing_map_elt's variable field
* @elt: The tracing_map_elt
* @i: The index of the given variable associated with the tracing_map_elt
*
* Retrieve the value of the variable i associated with the specified
* tracing_map_elt instance, and reset the variable to the 'not set'
* state. The index i is the index returned by the call to
* tracing_map_add_var() when the tracing map was set up. The reset
* essentially makes the variable a read-once variable if it's only
* accessed using this function.
*
* Return: The variable value associated with field i for elt.
*/
u64 tracing_map_read_var_once(struct tracing_map_elt *elt, unsigned int i)
{
elt->var_set[i] = false;
return (u64)atomic64_read(&elt->vars[i]);
}
int tracing_map_cmp_string(void *val_a, void *val_b) int tracing_map_cmp_string(void *val_a, void *val_b)
{ {
char *a = val_a; char *a = val_a;
...@@ -170,6 +237,28 @@ int tracing_map_add_sum_field(struct tracing_map *map) ...@@ -170,6 +237,28 @@ int tracing_map_add_sum_field(struct tracing_map *map)
return tracing_map_add_field(map, tracing_map_cmp_atomic64); return tracing_map_add_field(map, tracing_map_cmp_atomic64);
} }
/**
* tracing_map_add_var - Add a field describing a tracing_map var
* @map: The tracing_map
*
* Add a var to the map and return the index identifying it in the map
* and associated tracing_map_elts. This is the index used for
* instance to update a var for a particular tracing_map_elt using
* tracing_map_update_var() or reading it via tracing_map_read_var().
*
* Return: The index identifying the var in the map and associated
* tracing_map_elts, or -EINVAL on error.
*/
int tracing_map_add_var(struct tracing_map *map)
{
int ret = -EINVAL;
if (map->n_vars < TRACING_MAP_VARS_MAX)
ret = map->n_vars++;
return ret;
}
/** /**
* tracing_map_add_key_field - Add a field describing a tracing_map key * tracing_map_add_key_field - Add a field describing a tracing_map key
* @map: The tracing_map * @map: The tracing_map
...@@ -280,6 +369,11 @@ static void tracing_map_elt_clear(struct tracing_map_elt *elt) ...@@ -280,6 +369,11 @@ static void tracing_map_elt_clear(struct tracing_map_elt *elt)
if (elt->fields[i].cmp_fn == tracing_map_cmp_atomic64) if (elt->fields[i].cmp_fn == tracing_map_cmp_atomic64)
atomic64_set(&elt->fields[i].sum, 0); atomic64_set(&elt->fields[i].sum, 0);
for (i = 0; i < elt->map->n_vars; i++) {
atomic64_set(&elt->vars[i], 0);
elt->var_set[i] = false;
}
if (elt->map->ops && elt->map->ops->elt_clear) if (elt->map->ops && elt->map->ops->elt_clear)
elt->map->ops->elt_clear(elt); elt->map->ops->elt_clear(elt);
} }
...@@ -306,6 +400,8 @@ static void tracing_map_elt_free(struct tracing_map_elt *elt) ...@@ -306,6 +400,8 @@ static void tracing_map_elt_free(struct tracing_map_elt *elt)
if (elt->map->ops && elt->map->ops->elt_free) if (elt->map->ops && elt->map->ops->elt_free)
elt->map->ops->elt_free(elt); elt->map->ops->elt_free(elt);
kfree(elt->fields); kfree(elt->fields);
kfree(elt->vars);
kfree(elt->var_set);
kfree(elt->key); kfree(elt->key);
kfree(elt); kfree(elt);
} }
...@@ -333,6 +429,18 @@ static struct tracing_map_elt *tracing_map_elt_alloc(struct tracing_map *map) ...@@ -333,6 +429,18 @@ static struct tracing_map_elt *tracing_map_elt_alloc(struct tracing_map *map)
goto free; goto free;
} }
elt->vars = kcalloc(map->n_vars, sizeof(*elt->vars), GFP_KERNEL);
if (!elt->vars) {
err = -ENOMEM;
goto free;
}
elt->var_set = kcalloc(map->n_vars, sizeof(*elt->var_set), GFP_KERNEL);
if (!elt->var_set) {
err = -ENOMEM;
goto free;
}
tracing_map_elt_init_fields(elt); tracing_map_elt_init_fields(elt);
if (map->ops && map->ops->elt_alloc) { if (map->ops && map->ops->elt_alloc) {
...@@ -414,7 +522,9 @@ static inline struct tracing_map_elt * ...@@ -414,7 +522,9 @@ static inline struct tracing_map_elt *
__tracing_map_insert(struct tracing_map *map, void *key, bool lookup_only) __tracing_map_insert(struct tracing_map *map, void *key, bool lookup_only)
{ {
u32 idx, key_hash, test_key; u32 idx, key_hash, test_key;
int dup_try = 0;
struct tracing_map_entry *entry; struct tracing_map_entry *entry;
struct tracing_map_elt *val;
key_hash = jhash(key, map->key_size, 0); key_hash = jhash(key, map->key_size, 0);
if (key_hash == 0) if (key_hash == 0)
...@@ -426,11 +536,33 @@ __tracing_map_insert(struct tracing_map *map, void *key, bool lookup_only) ...@@ -426,11 +536,33 @@ __tracing_map_insert(struct tracing_map *map, void *key, bool lookup_only)
entry = TRACING_MAP_ENTRY(map->map, idx); entry = TRACING_MAP_ENTRY(map->map, idx);
test_key = entry->key; test_key = entry->key;
if (test_key && test_key == key_hash && entry->val && if (test_key && test_key == key_hash) {
keys_match(key, entry->val->key, map->key_size)) { val = READ_ONCE(entry->val);
if (!lookup_only) if (val &&
atomic64_inc(&map->hits); keys_match(key, val->key, map->key_size)) {
return entry->val; if (!lookup_only)
atomic64_inc(&map->hits);
return val;
} else if (unlikely(!val)) {
/*
* The key is present. But, val (pointer to elt
* struct) is still NULL. which means some other
* thread is in the process of inserting an
* element.
*
* On top of that, it's key_hash is same as the
* one being inserted right now. So, it's
* possible that the element has the same
* key as well.
*/
dup_try++;
if (dup_try > map->map_size) {
atomic64_inc(&map->drops);
break;
}
continue;
}
} }
if (!test_key) { if (!test_key) {
...@@ -452,6 +584,13 @@ __tracing_map_insert(struct tracing_map *map, void *key, bool lookup_only) ...@@ -452,6 +584,13 @@ __tracing_map_insert(struct tracing_map *map, void *key, bool lookup_only)
atomic64_inc(&map->hits); atomic64_inc(&map->hits);
return entry->val; return entry->val;
} else {
/*
* cmpxchg() failed. Loop around once
* more to check what key was inserted.
*/
dup_try++;
continue;
} }
} }
...@@ -816,67 +955,15 @@ create_sort_entry(void *key, struct tracing_map_elt *elt) ...@@ -816,67 +955,15 @@ create_sort_entry(void *key, struct tracing_map_elt *elt)
return sort_entry; return sort_entry;
} }
static struct tracing_map_elt *copy_elt(struct tracing_map_elt *elt) static void detect_dups(struct tracing_map_sort_entry **sort_entries,
{
struct tracing_map_elt *dup_elt;
unsigned int i;
dup_elt = tracing_map_elt_alloc(elt->map);
if (IS_ERR(dup_elt))
return NULL;
if (elt->map->ops && elt->map->ops->elt_copy)
elt->map->ops->elt_copy(dup_elt, elt);
dup_elt->private_data = elt->private_data;
memcpy(dup_elt->key, elt->key, elt->map->key_size);
for (i = 0; i < elt->map->n_fields; i++) {
atomic64_set(&dup_elt->fields[i].sum,
atomic64_read(&elt->fields[i].sum));
dup_elt->fields[i].cmp_fn = elt->fields[i].cmp_fn;
}
return dup_elt;
}
static int merge_dup(struct tracing_map_sort_entry **sort_entries,
unsigned int target, unsigned int dup)
{
struct tracing_map_elt *target_elt, *elt;
bool first_dup = (target - dup) == 1;
int i;
if (first_dup) {
elt = sort_entries[target]->elt;
target_elt = copy_elt(elt);
if (!target_elt)
return -ENOMEM;
sort_entries[target]->elt = target_elt;
sort_entries[target]->elt_copied = true;
} else
target_elt = sort_entries[target]->elt;
elt = sort_entries[dup]->elt;
for (i = 0; i < elt->map->n_fields; i++)
atomic64_add(atomic64_read(&elt->fields[i].sum),
&target_elt->fields[i].sum);
sort_entries[dup]->dup = true;
return 0;
}
static int merge_dups(struct tracing_map_sort_entry **sort_entries,
int n_entries, unsigned int key_size) int n_entries, unsigned int key_size)
{ {
unsigned int dups = 0, total_dups = 0; unsigned int dups = 0, total_dups = 0;
int err, i, j; int i;
void *key; void *key;
if (n_entries < 2) if (n_entries < 2)
return total_dups; return;
sort(sort_entries, n_entries, sizeof(struct tracing_map_sort_entry *), sort(sort_entries, n_entries, sizeof(struct tracing_map_sort_entry *),
(int (*)(const void *, const void *))cmp_entries_dup, NULL); (int (*)(const void *, const void *))cmp_entries_dup, NULL);
...@@ -885,30 +972,14 @@ static int merge_dups(struct tracing_map_sort_entry **sort_entries, ...@@ -885,30 +972,14 @@ static int merge_dups(struct tracing_map_sort_entry **sort_entries,
for (i = 1; i < n_entries; i++) { for (i = 1; i < n_entries; i++) {
if (!memcmp(sort_entries[i]->key, key, key_size)) { if (!memcmp(sort_entries[i]->key, key, key_size)) {
dups++; total_dups++; dups++; total_dups++;
err = merge_dup(sort_entries, i - dups, i);
if (err)
return err;
continue; continue;
} }
key = sort_entries[i]->key; key = sort_entries[i]->key;
dups = 0; dups = 0;
} }
if (!total_dups) WARN_ONCE(total_dups > 0,
return total_dups; "Duplicates detected: %d\n", total_dups);
for (i = 0, j = 0; i < n_entries; i++) {
if (!sort_entries[i]->dup) {
sort_entries[j] = sort_entries[i];
if (j++ != i)
sort_entries[i] = NULL;
} else {
destroy_sort_entry(sort_entries[i]);
sort_entries[i] = NULL;
}
}
return total_dups;
} }
static bool is_key(struct tracing_map *map, unsigned int field_idx) static bool is_key(struct tracing_map *map, unsigned int field_idx)
...@@ -1034,10 +1105,7 @@ int tracing_map_sort_entries(struct tracing_map *map, ...@@ -1034,10 +1105,7 @@ int tracing_map_sort_entries(struct tracing_map *map,
return 1; return 1;
} }
ret = merge_dups(entries, n_entries, map->key_size); detect_dups(entries, n_entries, map->key_size);
if (ret < 0)
goto free;
n_entries -= ret;
if (is_key(map, sort_keys[0].field_idx)) if (is_key(map, sort_keys[0].field_idx))
cmp_entries_fn = cmp_entries_key; cmp_entries_fn = cmp_entries_key;
......
...@@ -10,6 +10,7 @@ ...@@ -10,6 +10,7 @@
#define TRACING_MAP_VALS_MAX 3 #define TRACING_MAP_VALS_MAX 3
#define TRACING_MAP_FIELDS_MAX (TRACING_MAP_KEYS_MAX + \ #define TRACING_MAP_FIELDS_MAX (TRACING_MAP_KEYS_MAX + \
TRACING_MAP_VALS_MAX) TRACING_MAP_VALS_MAX)
#define TRACING_MAP_VARS_MAX 16
#define TRACING_MAP_SORT_KEYS_MAX 2 #define TRACING_MAP_SORT_KEYS_MAX 2
typedef int (*tracing_map_cmp_fn_t) (void *val_a, void *val_b); typedef int (*tracing_map_cmp_fn_t) (void *val_a, void *val_b);
...@@ -137,6 +138,8 @@ struct tracing_map_field { ...@@ -137,6 +138,8 @@ struct tracing_map_field {
struct tracing_map_elt { struct tracing_map_elt {
struct tracing_map *map; struct tracing_map *map;
struct tracing_map_field *fields; struct tracing_map_field *fields;
atomic64_t *vars;
bool *var_set;
void *key; void *key;
void *private_data; void *private_data;
}; };
...@@ -192,6 +195,7 @@ struct tracing_map { ...@@ -192,6 +195,7 @@ struct tracing_map {
int key_idx[TRACING_MAP_KEYS_MAX]; int key_idx[TRACING_MAP_KEYS_MAX];
unsigned int n_keys; unsigned int n_keys;
struct tracing_map_sort_key sort_key; struct tracing_map_sort_key sort_key;
unsigned int n_vars;
atomic64_t hits; atomic64_t hits;
atomic64_t drops; atomic64_t drops;
}; };
...@@ -215,11 +219,6 @@ struct tracing_map { ...@@ -215,11 +219,6 @@ struct tracing_map {
* Element allocation occurs before tracing begins, when the * Element allocation occurs before tracing begins, when the
* tracing_map_init() call is made by client code. * tracing_map_init() call is made by client code.
* *
* @elt_copy: At certain points in the lifetime of an element, it may
* need to be copied. The copy should include a copy of the
* client-allocated data, which can be copied into the 'to'
* element from the 'from' element.
*
* @elt_free: When a tracing_map_elt is freed, this function is called * @elt_free: When a tracing_map_elt is freed, this function is called
* and allows client-allocated per-element data to be freed. * and allows client-allocated per-element data to be freed.
* *
...@@ -233,8 +232,6 @@ struct tracing_map { ...@@ -233,8 +232,6 @@ struct tracing_map {
*/ */
struct tracing_map_ops { struct tracing_map_ops {
int (*elt_alloc)(struct tracing_map_elt *elt); int (*elt_alloc)(struct tracing_map_elt *elt);
void (*elt_copy)(struct tracing_map_elt *to,
struct tracing_map_elt *from);
void (*elt_free)(struct tracing_map_elt *elt); void (*elt_free)(struct tracing_map_elt *elt);
void (*elt_clear)(struct tracing_map_elt *elt); void (*elt_clear)(struct tracing_map_elt *elt);
void (*elt_init)(struct tracing_map_elt *elt); void (*elt_init)(struct tracing_map_elt *elt);
...@@ -248,6 +245,7 @@ tracing_map_create(unsigned int map_bits, ...@@ -248,6 +245,7 @@ tracing_map_create(unsigned int map_bits,
extern int tracing_map_init(struct tracing_map *map); extern int tracing_map_init(struct tracing_map *map);
extern int tracing_map_add_sum_field(struct tracing_map *map); extern int tracing_map_add_sum_field(struct tracing_map *map);
extern int tracing_map_add_var(struct tracing_map *map);
extern int tracing_map_add_key_field(struct tracing_map *map, extern int tracing_map_add_key_field(struct tracing_map *map,
unsigned int offset, unsigned int offset,
tracing_map_cmp_fn_t cmp_fn); tracing_map_cmp_fn_t cmp_fn);
...@@ -267,7 +265,13 @@ extern int tracing_map_cmp_none(void *val_a, void *val_b); ...@@ -267,7 +265,13 @@ extern int tracing_map_cmp_none(void *val_a, void *val_b);
extern void tracing_map_update_sum(struct tracing_map_elt *elt, extern void tracing_map_update_sum(struct tracing_map_elt *elt,
unsigned int i, u64 n); unsigned int i, u64 n);
extern void tracing_map_set_var(struct tracing_map_elt *elt,
unsigned int i, u64 n);
extern bool tracing_map_var_set(struct tracing_map_elt *elt, unsigned int i);
extern u64 tracing_map_read_sum(struct tracing_map_elt *elt, unsigned int i); extern u64 tracing_map_read_sum(struct tracing_map_elt *elt, unsigned int i);
extern u64 tracing_map_read_var(struct tracing_map_elt *elt, unsigned int i);
extern u64 tracing_map_read_var_once(struct tracing_map_elt *elt, unsigned int i);
extern void tracing_map_set_field_descr(struct tracing_map *map, extern void tracing_map_set_field_descr(struct tracing_map *map,
unsigned int i, unsigned int i,
unsigned int key_offset, unsigned int key_offset,
......
...@@ -2591,6 +2591,8 @@ int vbin_printf(u32 *bin_buf, size_t size, const char *fmt, va_list args) ...@@ -2591,6 +2591,8 @@ int vbin_printf(u32 *bin_buf, size_t size, const char *fmt, va_list args)
case 's': case 's':
case 'F': case 'F':
case 'f': case 'f':
case 'x':
case 'K':
save_arg(void *); save_arg(void *);
break; break;
default: default:
...@@ -2765,6 +2767,8 @@ int bstr_printf(char *buf, size_t size, const char *fmt, const u32 *bin_buf) ...@@ -2765,6 +2767,8 @@ int bstr_printf(char *buf, size_t size, const char *fmt, const u32 *bin_buf)
case 's': case 's':
case 'F': case 'F':
case 'f': case 'f':
case 'x':
case 'K':
process = true; process = true;
break; break;
default: default:
......
...@@ -30,6 +30,8 @@ ...@@ -30,6 +30,8 @@
#include <linux/string.h> #include <linux/string.h>
#include <net/flow.h> #include <net/flow.h>
#include <trace/events/initcall.h>
#define MAX_LSM_EVM_XATTR 2 #define MAX_LSM_EVM_XATTR 2
/* Maximum number of letters for an LSM name string */ /* Maximum number of letters for an LSM name string */
...@@ -45,10 +47,14 @@ static __initdata char chosen_lsm[SECURITY_NAME_MAX + 1] = ...@@ -45,10 +47,14 @@ static __initdata char chosen_lsm[SECURITY_NAME_MAX + 1] =
static void __init do_security_initcalls(void) static void __init do_security_initcalls(void)
{ {
int ret;
initcall_t *call; initcall_t *call;
call = __security_initcall_start; call = __security_initcall_start;
trace_initcall_level("security");
while (call < __security_initcall_end) { while (call < __security_initcall_end) {
(*call) (); trace_initcall_start((*call));
ret = (*call) ();
trace_initcall_finish((*call), ret);
call++; call++;
} }
} }
......
...@@ -59,6 +59,13 @@ disable_events() { ...@@ -59,6 +59,13 @@ disable_events() {
echo 0 > events/enable echo 0 > events/enable
} }
clear_synthetic_events() { # reset all current synthetic events
grep -v ^# synthetic_events |
while read line; do
echo "!$line" >> synthetic_events
done
}
initialize_ftrace() { # Reset ftrace to initial-state initialize_ftrace() { # Reset ftrace to initial-state
# As the initial state, ftrace will be set to nop tracer, # As the initial state, ftrace will be set to nop tracer,
# no events, no triggers, no filters, no function filters, # no events, no triggers, no filters, no function filters,
......
#!/bin/sh
# description: event trigger - test extended error support
do_reset() {
reset_trigger
echo > set_event
clear_trace
}
fail() { #msg
do_reset
echo $1
exit_fail
}
if [ ! -f set_event ]; then
echo "event tracing is not supported"
exit_unsupported
fi
if [ ! -f synthetic_events ]; then
echo "synthetic event is not supported"
exit_unsupported
fi
reset_tracer
do_reset
echo "Test extended error support"
echo 'hist:keys=pid:ts0=common_timestamp.usecs if comm=="ping"' > events/sched/sched_wakeup/trigger
echo 'hist:keys=pid:ts0=common_timestamp.usecs if comm=="ping"' >> events/sched/sched_wakeup/trigger &>/dev/null
if ! grep -q "ERROR:" events/sched/sched_wakeup/hist; then
fail "Failed to generate extended error in histogram"
fi
do_reset
exit 0
#!/bin/sh
# description: event trigger - test field variable support
do_reset() {
reset_trigger
echo > set_event
clear_trace
}
fail() { #msg
do_reset
echo $1
exit_fail
}
if [ ! -f set_event ]; then
echo "event tracing is not supported"
exit_unsupported
fi
if [ ! -f synthetic_events ]; then
echo "synthetic event is not supported"
exit_unsupported
fi
clear_synthetic_events
reset_tracer
do_reset
echo "Test field variable support"
echo 'wakeup_latency u64 lat; pid_t pid; int prio; char comm[16]' > synthetic_events
echo 'hist:keys=comm:ts0=common_timestamp.usecs if comm=="ping"' > events/sched/sched_waking/trigger
echo 'hist:keys=next_comm:wakeup_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,sched.sched_waking.prio,next_comm) if next_comm=="ping"' > events/sched/sched_switch/trigger
echo 'hist:keys=pid,prio,comm:vals=lat:sort=pid,prio' > events/synthetic/wakeup_latency/trigger
ping localhost -c 3
if ! grep -q "ping" events/synthetic/wakeup_latency/hist; then
fail "Failed to create inter-event histogram"
fi
if ! grep -q "synthetic_prio=prio" events/sched/sched_waking/hist; then
fail "Failed to create histogram with field variable"
fi
echo '!hist:keys=next_comm:wakeup_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,sched.sched_waking.prio,next_comm) if next_comm=="ping"' >> events/sched/sched_switch/trigger
if grep -q "synthetic_prio=prio" events/sched/sched_waking/hist; then
fail "Failed to remove histogram with field variable"
fi
do_reset
exit 0
#!/bin/sh
# description: event trigger - test inter-event combined histogram trigger
do_reset() {
reset_trigger
echo > set_event
clear_trace
}
fail() { #msg
do_reset
echo $1
exit_fail
}
if [ ! -f set_event ]; then
echo "event tracing is not supported"
exit_unsupported
fi
if [ ! -f synthetic_events ]; then
echo "synthetic event is not supported"
exit_unsupported
fi
reset_tracer
do_reset
clear_synthetic_events
echo "Test create synthetic event"
echo 'waking_latency u64 lat pid_t pid' > synthetic_events
if [ ! -d events/synthetic/waking_latency ]; then
fail "Failed to create waking_latency synthetic event"
fi
echo "Test combined histogram"
echo 'hist:keys=pid:ts0=common_timestamp.usecs if comm=="ping"' > events/sched/sched_waking/trigger
echo 'hist:keys=pid:waking_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).waking_latency($waking_lat,pid) if comm=="ping"' > events/sched/sched_wakeup/trigger
echo 'hist:keys=pid,lat:sort=pid,lat' > events/synthetic/waking_latency/trigger
echo 'wakeup_latency u64 lat pid_t pid' >> synthetic_events
echo 'hist:keys=pid:ts1=common_timestamp.usecs if comm=="ping"' >> events/sched/sched_wakeup/trigger
echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts1:onmatch(sched.sched_wakeup).wakeup_latency($wakeup_lat,next_pid) if next_comm=="ping"' > events/sched/sched_switch/trigger
echo 'waking+wakeup_latency u64 lat; pid_t pid' >> synthetic_events
echo 'hist:keys=pid,lat:sort=pid,lat:ww_lat=$waking_lat+$wakeup_lat:onmatch(synthetic.wakeup_latency).waking+wakeup_latency($ww_lat,pid)' >> events/synthetic/wakeup_latency/trigger
echo 'hist:keys=pid,lat:sort=pid,lat' >> events/synthetic/waking+wakeup_latency/trigger
ping localhost -c 3
if ! grep -q "pid:" events/synthetic/waking+wakeup_latency/hist; then
fail "Failed to create combined histogram"
fi
do_reset
exit 0
#!/bin/sh
# description: event trigger - test inter-event histogram trigger onmatch action
do_reset() {
reset_trigger
echo > set_event
clear_trace
}
fail() { #msg
do_reset
echo $1
exit_fail
}
if [ ! -f set_event ]; then
echo "event tracing is not supported"
exit_unsupported
fi
if [ ! -f synthetic_events ]; then
echo "synthetic event is not supported"
exit_unsupported
fi
clear_synthetic_events
reset_tracer
do_reset
echo "Test create synthetic event"
echo 'wakeup_latency u64 lat pid_t pid char comm[16]' > synthetic_events
if [ ! -d events/synthetic/wakeup_latency ]; then
fail "Failed to create wakeup_latency synthetic event"
fi
echo "Test create histogram for synthetic event"
echo "Test histogram variables,simple expression support and onmatch action"
echo 'hist:keys=pid:ts0=common_timestamp.usecs if comm=="ping"' > events/sched/sched_wakeup/trigger
echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_wakeup).wakeup_latency($wakeup_lat,next_pid,next_comm) if next_comm=="ping"' > events/sched/sched_switch/trigger
echo 'hist:keys=comm,pid,lat:wakeup_lat=lat:sort=lat' > events/synthetic/wakeup_latency/trigger
ping localhost -c 5
if ! grep -q "ping" events/synthetic/wakeup_latency/hist; then
fail "Failed to create onmatch action inter-event histogram"
fi
do_reset
exit 0
#!/bin/sh
# description: event trigger - test inter-event histogram trigger onmatch-onmax action
do_reset() {
reset_trigger
echo > set_event
clear_trace
}
fail() { #msg
do_reset
echo $1
exit_fail
}
if [ ! -f set_event ]; then
echo "event tracing is not supported"
exit_unsupported
fi
if [ ! -f synthetic_events ]; then
echo "synthetic event is not supported"
exit_unsupported
fi
clear_synthetic_events
reset_tracer
do_reset
echo "Test create synthetic event"
echo 'wakeup_latency u64 lat pid_t pid char comm[16]' > synthetic_events
if [ ! -d events/synthetic/wakeup_latency ]; then
fail "Failed to create wakeup_latency synthetic event"
fi
echo "Test create histogram for synthetic event"
echo "Test histogram variables,simple expression support and onmatch-onmax action"
echo 'hist:keys=pid:ts0=common_timestamp.usecs if comm=="ping"' > events/sched/sched_wakeup/trigger
echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_wakeup).wakeup_latency($wakeup_lat,next_pid,next_comm):onmax($wakeup_lat).save(next_comm,prev_pid,prev_prio,prev_comm) if next_comm=="ping"' >> events/sched/sched_switch/trigger
echo 'hist:keys=comm,pid,lat:wakeup_lat=lat:sort=lat' > events/synthetic/wakeup_latency/trigger
ping localhost -c 5
if [ ! grep -q "ping" events/synthetic/wakeup_latency/hist -o ! grep -q "max:" events/sched/sched_switch/hist]; then
fail "Failed to create onmatch-onmax action inter-event histogram"
fi
do_reset
exit 0
#!/bin/sh
# description: event trigger - test inter-event histogram trigger onmax action
do_reset() {
reset_trigger
echo > set_event
clear_trace
}
fail() { #msg
do_reset
echo $1
exit_fail
}
if [ ! -f set_event ]; then
echo "event tracing is not supported"
exit_unsupported
fi
if [ ! -f synthetic_events ]; then
echo "synthetic event is not supported"
exit_unsupported
fi
clear_synthetic_events
reset_tracer
do_reset
echo "Test create synthetic event"
echo 'wakeup_latency u64 lat pid_t pid char comm[16]' > synthetic_events
if [ ! -d events/synthetic/wakeup_latency ]; then
fail "Failed to create wakeup_latency synthetic event"
fi
echo "Test onmax action"
echo 'hist:keys=pid:ts0=common_timestamp.usecs if comm=="ping"' >> events/sched/sched_waking/trigger
echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmax($wakeup_lat).save(next_comm,prev_pid,prev_prio,prev_comm) if next_comm=="ping"' >> events/sched/sched_switch/trigger
ping localhost -c 3
if ! grep -q "max:" events/sched/sched_switch/hist; then
fail "Failed to create onmax action inter-event histogram"
fi
do_reset
exit 0
#!/bin/sh
# description: event trigger - test synthetic event create remove
do_reset() {
reset_trigger
echo > set_event
clear_trace
}
fail() { #msg
do_reset
echo $1
exit_fail
}
if [ ! -f set_event ]; then
echo "event tracing is not supported"
exit_unsupported
fi
if [ ! -f synthetic_events ]; then
echo "synthetic event is not supported"
exit_unsupported
fi
clear_synthetic_events
reset_tracer
do_reset
echo "Test create synthetic event"
echo 'wakeup_latency u64 lat pid_t pid char comm[16]' > synthetic_events
if [ ! -d events/synthetic/wakeup_latency ]; then
fail "Failed to create wakeup_latency synthetic event"
fi
reset_trigger
echo "Test create synthetic event with an error"
echo 'wakeup_latency u64 lat pid_t pid char' > synthetic_events > /dev/null
if [ -d events/synthetic/wakeup_latency ]; then
fail "Created wakeup_latency synthetic event with an invalid format"
fi
reset_trigger
echo "Test remove synthetic event"
echo '!wakeup_latency u64 lat pid_t pid char comm[16]' > synthetic_events
if [ -d events/synthetic/wakeup_latency ]; then
fail "Failed to delete wakeup_latency synthetic event"
fi
do_reset
exit 0
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment