1. 15 Feb, 2024 2 commits
  2. 14 Feb, 2024 1 commit
    • Steven Rostedt (Google)'s avatar
      tracing: Inform kmemleak of saved_cmdlines allocation · 2394ac41
      Steven Rostedt (Google) authored
      The allocation of the struct saved_cmdlines_buffer structure changed from:
      
              s = kmalloc(sizeof(*s), GFP_KERNEL);
      	s->saved_cmdlines = kmalloc_array(TASK_COMM_LEN, val, GFP_KERNEL);
      
      to:
      
      	orig_size = sizeof(*s) + val * TASK_COMM_LEN;
      	order = get_order(orig_size);
      	size = 1 << (order + PAGE_SHIFT);
      	page = alloc_pages(GFP_KERNEL, order);
      	if (!page)
      		return NULL;
      
      	s = page_address(page);
      	memset(s, 0, sizeof(*s));
      
      	s->saved_cmdlines = kmalloc_array(TASK_COMM_LEN, val, GFP_KERNEL);
      
      Where that s->saved_cmdlines allocation looks to be a dangling allocation
      to kmemleak. That's because kmemleak only keeps track of kmalloc()
      allocations. For allocations that use page_alloc() directly, the kmemleak
      needs to be explicitly informed about it.
      
      Add kmemleak_alloc() and kmemleak_free() around the page allocation so
      that it doesn't give the following false positive:
      
      unreferenced object 0xffff8881010c8000 (size 32760):
        comm "swapper", pid 0, jiffies 4294667296
        hex dump (first 32 bytes):
          ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
          ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
        backtrace (crc ae6ec1b9):
          [<ffffffff86722405>] kmemleak_alloc+0x45/0x80
          [<ffffffff8414028d>] __kmalloc_large_node+0x10d/0x190
          [<ffffffff84146ab1>] __kmalloc+0x3b1/0x4c0
          [<ffffffff83ed7103>] allocate_cmdlines_buffer+0x113/0x230
          [<ffffffff88649c34>] tracer_alloc_buffers.isra.0+0x124/0x460
          [<ffffffff8864a174>] early_trace_init+0x14/0xa0
          [<ffffffff885dd5ae>] start_kernel+0x12e/0x3c0
          [<ffffffff885f5758>] x86_64_start_reservations+0x18/0x30
          [<ffffffff885f582b>] x86_64_start_kernel+0x7b/0x80
          [<ffffffff83a001c3>] secondary_startup_64_no_verify+0x15e/0x16b
      
      Link: https://lore.kernel.org/linux-trace-kernel/87r0hfnr9r.fsf@kernel.org/
      Link: https://lore.kernel.org/linux-trace-kernel/20240214112046.09a322d6@gandalf.local.home
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Fixes: 44dc5c41 ("tracing: Fix wasted memory in saved_cmdlines logic")
      Reported-by: default avatarKalle Valo <kvalo@kernel.org>
      Tested-by: default avatarKalle Valo <kvalo@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      2394ac41
  3. 13 Feb, 2024 2 commits
  4. 09 Feb, 2024 2 commits
    • Steven Rostedt (Google)'s avatar
      tracing: Fix wasted memory in saved_cmdlines logic · 44dc5c41
      Steven Rostedt (Google) authored
      While looking at improving the saved_cmdlines cache I found a huge amount
      of wasted memory that should be used for the cmdlines.
      
      The tracing data saves pids during the trace. At sched switch, if a trace
      occurred, it will save the comm of the task that did the trace. This is
      saved in a "cache" that maps pids to comms and exposed to user space via
      the /sys/kernel/tracing/saved_cmdlines file. Currently it only caches by
      default 128 comms.
      
      The structure that uses this creates an array to store the pids using
      PID_MAX_DEFAULT (which is usually set to 32768). This causes the structure
      to be of the size of 131104 bytes on 64 bit machines.
      
      In hex: 131104 = 0x20020, and since the kernel allocates generic memory in
      powers of two, the kernel would allocate 0x40000 or 262144 bytes to store
      this structure. That leaves 131040 bytes of wasted space.
      
      Worse, the structure points to an allocated array to store the comm names,
      which is 16 bytes times the amount of names to save (currently 128), which
      is 2048 bytes. Instead of allocating a separate array, make the structure
      end with a variable length string and use the extra space for that.
      
      This is similar to a recommendation that Linus had made about eventfs_inode names:
      
        https://lore.kernel.org/all/20240130190355.11486-5-torvalds@linux-foundation.org/
      
      Instead of allocating a separate string array to hold the saved comms,
      have the structure end with: char saved_cmdlines[]; and round up to the
      next power of two over sizeof(struct saved_cmdline_buffers) + num_cmdlines * TASK_COMM_LEN
      It will use this extra space for the saved_cmdline portion.
      
      Now, instead of saving only 128 comms by default, by using this wasted
      space at the end of the structure it can save over 8000 comms and even
      saves space by removing the need for allocating the other array.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240209063622.1f7b6d5f@rorschach.local.home
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Vincent Donnefort <vdonnefort@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Mete Durlu <meted@linux.ibm.com>
      Fixes: 939c7a4f ("tracing: Introduce saved_cmdlines_size file")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      44dc5c41
    • Masami Hiramatsu (Google)'s avatar
      ftrace: Fix DIRECT_CALLS to use SAVE_REGS by default · a8b9cf62
      Masami Hiramatsu (Google) authored
      The commit 60c89718 ("ftrace: Make DIRECT_CALLS work WITH_ARGS
      and !WITH_REGS") changed DIRECT_CALLS to use SAVE_ARGS when there
      are multiple ftrace_ops at the same function, but since the x86 only
      support to jump to direct_call from ftrace_regs_caller, when we set
      the function tracer on the same target function on x86, ftrace-direct
      does not work as below (this actually works on arm64.)
      
      At first, insmod ftrace-direct.ko to put a direct_call on
      'wake_up_process()'.
      
       # insmod kernel/samples/ftrace/ftrace-direct.ko
       # less trace
      ...
                <idle>-0       [006] ..s1.   564.686958: my_direct_func: waking up rcu_preempt-17
                <idle>-0       [007] ..s1.   564.687836: my_direct_func: waking up kcompactd0-63
                <idle>-0       [006] ..s1.   564.690926: my_direct_func: waking up rcu_preempt-17
                <idle>-0       [006] ..s1.   564.696872: my_direct_func: waking up rcu_preempt-17
                <idle>-0       [007] ..s1.   565.191982: my_direct_func: waking up kcompactd0-63
      
      Setup a function filter to the 'wake_up_process' too, and enable it.
      
       # cd /sys/kernel/tracing/
       # echo wake_up_process > set_ftrace_filter
       # echo function > current_tracer
       # less trace
      ...
                <idle>-0       [006] ..s3.   686.180972: wake_up_process <-call_timer_fn
                <idle>-0       [006] ..s3.   686.186919: wake_up_process <-call_timer_fn
                <idle>-0       [002] ..s3.   686.264049: wake_up_process <-call_timer_fn
                <idle>-0       [002] d.h6.   686.515216: wake_up_process <-kick_pool
                <idle>-0       [002] d.h6.   686.691386: wake_up_process <-kick_pool
      
      Then, only function tracer is shown on x86.
      But if you enable 'kprobe on ftrace' event (which uses SAVE_REGS flag)
      on the same function, it is shown again.
      
       # echo 'p wake_up_process' >> dynamic_events
       # echo 1 > events/kprobes/p_wake_up_process_0/enable
       # echo > trace
       # less trace
      ...
                <idle>-0       [006] ..s2.  2710.345919: p_wake_up_process_0: (wake_up_process+0x4/0x20)
                <idle>-0       [006] ..s3.  2710.345923: wake_up_process <-call_timer_fn
                <idle>-0       [006] ..s1.  2710.345928: my_direct_func: waking up rcu_preempt-17
                <idle>-0       [006] ..s2.  2710.349931: p_wake_up_process_0: (wake_up_process+0x4/0x20)
                <idle>-0       [006] ..s3.  2710.349934: wake_up_process <-call_timer_fn
                <idle>-0       [006] ..s1.  2710.349937: my_direct_func: waking up rcu_preempt-17
      
      To fix this issue, use SAVE_REGS flag for multiple ftrace_ops flag of
      direct_call by default.
      
      Link: https://lore.kernel.org/linux-trace-kernel/170484558617.178953.1590516949390270842.stgit@devnote2
      
      Fixes: 60c89718 ("ftrace: Make DIRECT_CALLS work WITH_ARGS and !WITH_REGS")
      Cc: stable@vger.kernel.org
      Cc: Florent Revest <revest@chromium.org>
      Signed-off-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Reviewed-by: default avatarMark Rutland <mark.rutland@arm.com>
      Tested-by: Mark Rutland <mark.rutland@arm.com> [arm64]
      Acked-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      a8b9cf62
  5. 01 Feb, 2024 9 commits
    • Steven Rostedt (Google)'s avatar
      eventfs: Keep all directory links at 1 · ca185770
      Steven Rostedt (Google) authored
      The directory link count in eventfs was somewhat bogus. It was only being
      updated when a directory child was being looked up and not on creation.
      
      One solution would be to update in get_attr() the link count by iterating
      the ei->children list and then adding 2. But that could slow down simple
      stat() calls, especially if it's done on all directories in eventfs.
      
      Another solution would be to add a parent pointer to the eventfs_inode
      and keep track of the number of sub directories it has on creation. But
      this adds overhead for something not really worthwhile.
      
      The solution decided upon is to keep all directory links in eventfs as 1.
      This tells user space not to rely on the hard links of directories. Which
      in this case it shouldn't.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240201002719.GS2087318@ZenIV/
      Link: https://lore.kernel.org/linux-trace-kernel/20240201161617.339968298@goodmis.org
      
      Cc: stable@vger.kernel.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Ajay Kaher <ajay.kaher@broadcom.com>
      Fixes: c1504e51 ("eventfs: Implement eventfs dir creation functions")
      Suggested-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      ca185770
    • Steven Rostedt (Google)'s avatar
      eventfs: Remove fsnotify*() functions from lookup() · 12d823b3
      Steven Rostedt (Google) authored
      The dentries and inodes are created when referenced in the lookup code.
      There's no reason to call fsnotify_*() functions when they are created by
      a reference. It doesn't make any sense.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240201002719.GS2087318@ZenIV/
      Link: https://lore.kernel.org/linux-trace-kernel/20240201161617.166973329@goodmis.org
      
      Cc: stable@vger.kernel.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Ajay Kaher <ajay.kaher@broadcom.com>
      Fixes: a3760079 ("eventfs: Implement functions to create files and dirs when accessed");
      Suggested-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      12d823b3
    • Steven Rostedt (Google)'s avatar
      eventfs: Restructure eventfs_inode structure to be more condensed · 264424df
      Steven Rostedt (Google) authored
      Some of the eventfs_inode structure has holes in it. Rework the structure
      to be a bit more condensed, and also remove the no longer used llist
      field.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240201161617.002321438@goodmis.org
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Ajay Kaher <ajay.kaher@broadcom.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      264424df
    • Steven Rostedt (Google)'s avatar
      eventfs: Warn if an eventfs_inode is freed without is_freed being set · 5a49f996
      Steven Rostedt (Google) authored
      There should never be a case where an evenfs_inode is being freed without
      is_freed being set. Add a WARN_ON_ONCE() if it ever happens. That would
      mean there was one too many put_ei()s.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240201161616.843551963@goodmis.org
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Ajay Kaher <ajay.kaher@broadcom.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      5a49f996
    • Daniel Bristot de Oliveira's avatar
      tracing/timerlat: Move hrtimer_init to timerlat_fd open() · 1389358b
      Daniel Bristot de Oliveira authored
      Currently, the timerlat's hrtimer is initialized at the first read of
      timerlat_fd, and destroyed at close(). It works, but it causes an error
      if the user program open() and close() the file without reading.
      
      Here's an example:
      
       # echo NO_OSNOISE_WORKLOAD > /sys/kernel/debug/tracing/osnoise/options
       # echo timerlat > /sys/kernel/debug/tracing/current_tracer
      
       # cat <<EOF > ./timerlat_load.py
       # !/usr/bin/env python3
      
       timerlat_fd = open("/sys/kernel/tracing/osnoise/per_cpu/cpu0/timerlat_fd", 'r')
       timerlat_fd.close();
       EOF
      
       # ./taskset -c 0 ./timerlat_load.py
      <BOOM>
      
       BUG: kernel NULL pointer dereference, address: 0000000000000010
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 0 P4D 0
       Oops: 0000 [#1] PREEMPT SMP NOPTI
       CPU: 1 PID: 2673 Comm: python3 Not tainted 6.6.13-200.fc39.x86_64 #1
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 04/01/2014
       RIP: 0010:hrtimer_active+0xd/0x50
       Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 48 8b 57 30 <8b> 42 10 a8 01 74 09 f3 90 8b 42 10 a8 01 75 f7 80 7f 38 00 75 1d
       RSP: 0018:ffffb031009b7e10 EFLAGS: 00010286
       RAX: 000000000002db00 RBX: ffff9118f786db08 RCX: 0000000000000000
       RDX: 0000000000000000 RSI: ffff9117a0e64400 RDI: ffff9118f786db08
       RBP: ffff9118f786db80 R08: ffff9117a0ddd420 R09: ffff9117804d4f70
       R10: 0000000000000000 R11: 0000000000000000 R12: ffff9118f786db08
       R13: ffff91178fdd5e20 R14: ffff9117840978c0 R15: 0000000000000000
       FS:  00007f2ffbab1740(0000) GS:ffff9118f7840000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000010 CR3: 00000001b402e000 CR4: 0000000000750ee0
       PKRU: 55555554
       Call Trace:
        <TASK>
        ? __die+0x23/0x70
        ? page_fault_oops+0x171/0x4e0
        ? srso_alias_return_thunk+0x5/0x7f
        ? avc_has_extended_perms+0x237/0x520
        ? exc_page_fault+0x7f/0x180
        ? asm_exc_page_fault+0x26/0x30
        ? hrtimer_active+0xd/0x50
        hrtimer_cancel+0x15/0x40
        timerlat_fd_release+0x48/0xe0
        __fput+0xf5/0x290
        __x64_sys_close+0x3d/0x80
        do_syscall_64+0x60/0x90
        ? srso_alias_return_thunk+0x5/0x7f
        ? __x64_sys_ioctl+0x72/0xd0
        ? srso_alias_return_thunk+0x5/0x7f
        ? syscall_exit_to_user_mode+0x2b/0x40
        ? srso_alias_return_thunk+0x5/0x7f
        ? do_syscall_64+0x6c/0x90
        ? srso_alias_return_thunk+0x5/0x7f
        ? exit_to_user_mode_prepare+0x142/0x1f0
        ? srso_alias_return_thunk+0x5/0x7f
        ? syscall_exit_to_user_mode+0x2b/0x40
        ? srso_alias_return_thunk+0x5/0x7f
        ? do_syscall_64+0x6c/0x90
        entry_SYSCALL_64_after_hwframe+0x6e/0xd8
       RIP: 0033:0x7f2ffb321594
       Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 80 3d d5 cd 0d 00 00 74 13 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3c c3 0f 1f 00 55 48 89 e5 48 83 ec 10 89 7d
       RSP: 002b:00007ffe8d8eef18 EFLAGS: 00000202 ORIG_RAX: 0000000000000003
       RAX: ffffffffffffffda RBX: 00007f2ffba4e668 RCX: 00007f2ffb321594
       RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
       RBP: 00007ffe8d8eef40 R08: 0000000000000000 R09: 0000000000000000
       R10: 55c926e3167eae79 R11: 0000000000000202 R12: 0000000000000003
       R13: 00007ffe8d8ef030 R14: 0000000000000000 R15: 00007f2ffba4e668
        </TASK>
       CR2: 0000000000000010
       ---[ end trace 0000000000000000 ]---
      
      Move hrtimer_init to timerlat_fd open() to avoid this problem.
      
      Link: https://lore.kernel.org/linux-trace-kernel/7324dd3fc0035658c99b825204a66049389c56e3.1706798888.git.bristot@kernel.org
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: stable@vger.kernel.org
      Fixes: e88ed227 ("tracing/timerlat: Add user-space interface")
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      1389358b
    • Linus Torvalds's avatar
      eventfs: Get rid of dentry pointers without refcounts · 43aa6f97
      Linus Torvalds authored
      The eventfs inode had pointers to dentries (and child dentries) without
      actually holding a refcount on said pointer.  That is fundamentally
      broken, and while eventfs tried to then maintain coherence with dentries
      going away by hooking into the '.d_iput' callback, that doesn't actually
      work since it's not ordered wrt lookups.
      
      There were two reasonms why eventfs tried to keep a pointer to a dentry:
      
       - the creation of a 'events' directory would actually have a stable
         dentry pointer that it created with tracefs_start_creating().
      
         And it needed that dentry when tearing it all down again in
         eventfs_remove_events_dir().
      
         This use is actually ok, because the special top-level events
         directory dentries are actually stable, not just a temporary cache of
         the eventfs data structures.
      
       - the 'eventfs_inode' (aka ei) needs to stay around as long as there
         are dentries that refer to it.
      
         It then used these dentry pointers as a replacement for doing
         reference counting: it would try to make sure that there was only
         ever one dentry associated with an event_inode, and keep a child
         dentry array around to see which dentries might still refer to the
         parent ei.
      
      This gets rid of the invalid dentry pointer use, and renames the one
      valid case to a different name to make it clear that it's not just any
      random dentry.
      
      The magic child dentry array that is kind of a "reverse reference list"
      is simply replaced by having child dentries take a ref to the ei.  As
      does the directory dentries.  That makes the broken use case go away.
      
      Link: https://lore.kernel.org/linux-trace-kernel/202401291043.e62e89dc-oliver.sang@intel.com/
      Link: https://lore.kernel.org/linux-trace-kernel/20240131185513.280463000@goodmis.org
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Ajay Kaher <ajay.kaher@broadcom.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Fixes: c1504e51 ("eventfs: Implement eventfs dir creation functions")
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      43aa6f97
    • Linus Torvalds's avatar
      eventfs: Clean up dentry ops and add revalidate function · 8dce06e9
      Linus Torvalds authored
      In order for the dentries to stay up-to-date with the eventfs changes,
      just add a 'd_revalidate' function that checks the 'is_freed' bit.
      
      Also, clean up the dentry release to actually use d_release() rather
      than the slightly odd d_iput() function.  We don't care about the inode,
      all we want to do is to get rid of the refcount to the eventfs data
      added by dentry->d_fsdata.
      
      It would probably be cleaner to make eventfs its own filesystem, or at
      least set its own dentry ops when looking up eventfs files.  But as it
      is, only eventfs dentries use d_fsdata, so we don't really need to split
      these things up by use.
      
      Another thing that might be worth doing is to make all eventfs lookups
      mark their dentries as not worth caching.  We could do that with
      d_delete(), but the DCACHE_DONTCACHE flag would likely be even better.
      
      As it is, the dentries are all freeable, but they only tend to get freed
      at memory pressure rather than more proactively.  But that's a separate
      issue.
      
      Link: https://lore.kernel.org/linux-trace-kernel/202401291043.e62e89dc-oliver.sang@intel.com/
      Link: https://lore.kernel.org/linux-trace-kernel/20240131185513.124644253@goodmis.org
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Ajay Kaher <ajay.kaher@broadcom.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Fixes: c1504e51 ("eventfs: Implement eventfs dir creation functions")
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      8dce06e9
    • Linus Torvalds's avatar
      eventfs: Remove unused d_parent pointer field · 408600be
      Linus Torvalds authored
      It's never used
      
      Link: https://lore.kernel.org/linux-trace-kernel/202401291043.e62e89dc-oliver.sang@intel.com/
      Link: https://lore.kernel.org/linux-trace-kernel/20240131185512.961772428@goodmis.org
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Ajay Kaher <ajay.kaher@broadcom.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Fixes: c1504e51 ("eventfs: Implement eventfs dir creation functions")
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      408600be
    • Linus Torvalds's avatar
      tracefs: dentry lookup crapectomy · 49304c2b
      Linus Torvalds authored
      The dentry lookup for eventfs files was very broken, and had lots of
      signs of the old situation where the filesystem names were all created
      statically in the dentry tree, rather than being looked up dynamically
      based on the eventfs data structures.
      
      You could see it in the naming - how it claimed to "create" dentries
      rather than just look up the dentries that were given it.
      
      You could see it in various nonsensical and very incorrect operations,
      like using "simple_lookup()" on the dentries that were passed in, which
      only results in those dentries becoming negative dentries.  Which meant
      that any other lookup would possibly return ENOENT if it saw that
      negative dentry before the data was then later filled in.
      
      You could see it in the immense amount of nonsensical code that didn't
      actually just do lookups.
      
      Link: https://lore.kernel.org/linux-trace-kernel/202401291043.e62e89dc-oliver.sang@intel.com/
      Link: https://lore.kernel.org/linux-trace-kernel/20240131233227.73db55e1@gandalf.local.home
      
      Cc: stable@vger.kernel.org
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ajay Kaher <ajay.kaher@broadcom.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Fixes: c1504e51 ("eventfs: Implement eventfs dir creation functions")
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      49304c2b
  6. 31 Jan, 2024 4 commits
  7. 28 Jan, 2024 1 commit
  8. 26 Jan, 2024 1 commit
  9. 23 Jan, 2024 1 commit
    • Steven Rostedt (Google)'s avatar
      eventfs: Save directory inodes in the eventfs_inode structure · 834bf76a
      Steven Rostedt (Google) authored
      The eventfs inodes and directories are allocated when referenced. But this
      leaves the issue of keeping consistent inode numbers and the number is
      only saved in the inode structure itself. When the inode is no longer
      referenced, it can be freed. When the file that the inode was representing
      is referenced again, the inode is once again created, but the inode number
      needs to be the same as it was before.
      
      Just making the inode numbers the same for all files is fine, but that
      does not work with directories. The find command will check for loops via
      the inode number and having the same inode number for directories triggers:
      
        # find /sys/kernel/tracing
      find: File system loop detected;
      '/sys/kernel/debug/tracing/events/initcall/initcall_finish' is part of the same file system loop as
      '/sys/kernel/debug/tracing/events/initcall'.
      [..]
      
      Linus pointed out that the eventfs_inode structure ends with a single
      32bit int, and on 64 bit machines, there's likely a 4 byte hole due to
      alignment. We can use this hole to store the inode number for the
      eventfs_inode. All directories in eventfs are represented by an
      eventfs_inode and that data structure can hold its inode number.
      
      That last int was also purposely placed at the end of the structure to
      prevent holes from within. Now that there's a 4 byte number to hold the
      inode, both the inode number and the last integer can be moved up in the
      structure for better cache locality, where the llist and rcu fields can be
      moved to the end as they are only used when the eventfs_inode is being
      deleted.
      
      Link: https://lore.kernel.org/all/CAMuHMdXKiorg-jiuKoZpfZyDJ3Ynrfb8=X+c7x0Eewxn-YRdCA@mail.gmail.com/
      Link: https://lore.kernel.org/linux-trace-kernel/20240122152748.46897388@gandalf.local.home
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Reported-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Fixes: 53c41052 ("eventfs: Have the inodes all for files and directories all be the same")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      834bf76a
  10. 22 Jan, 2024 1 commit
    • Petr Pavlu's avatar
      tracing: Ensure visibility when inserting an element into tracing_map · 2b447606
      Petr Pavlu authored
      Running the following two commands in parallel on a multi-processor
      AArch64 machine can sporadically produce an unexpected warning about
      duplicate histogram entries:
      
       $ while true; do
           echo hist:key=id.syscall:val=hitcount > \
             /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
           cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
           sleep 0.001
         done
       $ stress-ng --sysbadaddr $(nproc)
      
      The warning looks as follows:
      
      [ 2911.172474] ------------[ cut here ]------------
      [ 2911.173111] Duplicates detected: 1
      [ 2911.173574] WARNING: CPU: 2 PID: 12247 at kernel/trace/tracing_map.c:983 tracing_map_sort_entries+0x3e0/0x408
      [ 2911.174702] Modules linked in: iscsi_ibft(E) iscsi_boot_sysfs(E) rfkill(E) af_packet(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) ena(E) tiny_power_button(E) qemu_fw_cfg(E) button(E) fuse(E) efi_pstore(E) ip_tables(E) x_tables(E) xfs(E) libcrc32c(E) aes_ce_blk(E) aes_ce_cipher(E) crct10dif_ce(E) polyval_ce(E) polyval_generic(E) ghash_ce(E) gf128mul(E) sm4_ce_gcm(E) sm4_ce_ccm(E) sm4_ce(E) sm4_ce_cipher(E) sm4(E) sm3_ce(E) sm3(E) sha3_ce(E) sha512_ce(E) sha512_arm64(E) sha2_ce(E) sha256_arm64(E) nvme(E) sha1_ce(E) nvme_core(E) nvme_auth(E) t10_pi(E) sg(E) scsi_mod(E) scsi_common(E) efivarfs(E)
      [ 2911.174738] Unloaded tainted modules: cppc_cpufreq(E):1
      [ 2911.180985] CPU: 2 PID: 12247 Comm: cat Kdump: loaded Tainted: G            E      6.7.0-default #2 1b58bbb22c97e4399dc09f92d309344f69c44a01
      [ 2911.182398] Hardware name: Amazon EC2 c7g.8xlarge/, BIOS 1.0 11/1/2018
      [ 2911.183208] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
      [ 2911.184038] pc : tracing_map_sort_entries+0x3e0/0x408
      [ 2911.184667] lr : tracing_map_sort_entries+0x3e0/0x408
      [ 2911.185310] sp : ffff8000a1513900
      [ 2911.185750] x29: ffff8000a1513900 x28: ffff0003f272fe80 x27: 0000000000000001
      [ 2911.186600] x26: ffff0003f272fe80 x25: 0000000000000030 x24: 0000000000000008
      [ 2911.187458] x23: ffff0003c5788000 x22: ffff0003c16710c8 x21: ffff80008017f180
      [ 2911.188310] x20: ffff80008017f000 x19: ffff80008017f180 x18: ffffffffffffffff
      [ 2911.189160] x17: 0000000000000000 x16: 0000000000000000 x15: ffff8000a15134b8
      [ 2911.190015] x14: 0000000000000000 x13: 205d373432323154 x12: 5b5d313131333731
      [ 2911.190844] x11: 00000000fffeffff x10: 00000000fffeffff x9 : ffffd1b78274a13c
      [ 2911.191716] x8 : 000000000017ffe8 x7 : c0000000fffeffff x6 : 000000000057ffa8
      [ 2911.192554] x5 : ffff0012f6c24ec0 x4 : 0000000000000000 x3 : ffff2e5b72b5d000
      [ 2911.193404] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0003ff254480
      [ 2911.194259] Call trace:
      [ 2911.194626]  tracing_map_sort_entries+0x3e0/0x408
      [ 2911.195220]  hist_show+0x124/0x800
      [ 2911.195692]  seq_read_iter+0x1d4/0x4e8
      [ 2911.196193]  seq_read+0xe8/0x138
      [ 2911.196638]  vfs_read+0xc8/0x300
      [ 2911.197078]  ksys_read+0x70/0x108
      [ 2911.197534]  __arm64_sys_read+0x24/0x38
      [ 2911.198046]  invoke_syscall+0x78/0x108
      [ 2911.198553]  el0_svc_common.constprop.0+0xd0/0xf8
      [ 2911.199157]  do_el0_svc+0x28/0x40
      [ 2911.199613]  el0_svc+0x40/0x178
      [ 2911.200048]  el0t_64_sync_handler+0x13c/0x158
      [ 2911.200621]  el0t_64_sync+0x1a8/0x1b0
      [ 2911.201115] ---[ end trace 0000000000000000 ]---
      
      The problem appears to be caused by CPU reordering of writes issued from
      __tracing_map_insert().
      
      The check for the presence of an element with a given key in this
      function is:
      
       val = READ_ONCE(entry->val);
       if (val && keys_match(key, val->key, map->key_size)) ...
      
      The write of a new entry is:
      
       elt = get_free_elt(map);
       memcpy(elt->key, key, map->key_size);
       entry->val = elt;
      
      The "memcpy(elt->key, key, map->key_size);" and "entry->val = elt;"
      stores may become visible in the reversed order on another CPU. This
      second CPU might then incorrectly determine that a new key doesn't match
      an already present val->key and subsequently insert a new element,
      resulting in a duplicate.
      
      Fix the problem by adding a write barrier between
      "memcpy(elt->key, key, map->key_size);" and "entry->val = elt;", and for
      good measure, also use WRITE_ONCE(entry->val, elt) for publishing the
      element. The sequence pairs with the mentioned "READ_ONCE(entry->val);"
      and the "val->key" check which has an address dependency.
      
      The barrier is placed on a path executed when adding an element for
      a new key. Subsequent updates targeting the same key remain unaffected.
      
      From the user's perspective, the issue was introduced by commit
      c193707d ("tracing: Remove code which merges duplicates"), which
      followed commit cbf4100e ("tracing: Add support to detect and avoid
      duplicates"). The previous code operated differently; it inherently
      expected potential races which result in duplicates but merged them
      later when they occurred.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240122150928.27725-1-petr.pavlu@suse.com
      
      Fixes: c193707d ("tracing: Remove code which merges duplicates")
      Signed-off-by: default avatarPetr Pavlu <petr.pavlu@suse.com>
      Acked-by: default avatarTom Zanussi <tom.zanussi@linux.intel.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      2b447606
  11. 21 Jan, 2024 16 commits