• Steven Rostedt (Google)'s avatar
    ring-buffer: Fix slowpath of interrupted event · b803d7c6
    Steven Rostedt (Google) authored
    To synchronize the timestamps with the ring buffer reservation, there are
    two timestamps that are saved in the buffer meta data.
    
    1. before_stamp
    2. write_stamp
    
    When the two are equal, the write_stamp is considered valid, as in, it may
    be used to calculate the delta of the next event as the write_stamp is the
    timestamp of the previous reserved event on the buffer.
    
    This is done by the following:
    
     /*A*/	w = current position on the ring buffer
    	before = before_stamp
    	after = write_stamp
    	ts = read current timestamp
    
    	if (before != after) {
    		write_stamp is not valid, force adding an absolute
    		timestamp.
    	}
    
     /*B*/	before_stamp = ts
    
     /*C*/	write = local_add_return(event length, position on ring buffer)
    
    	if (w == write - event length) {
    		/* Nothing interrupted between A and C */
     /*E*/		write_stamp = ts;
    		delta = ts - after
    		/*
    		 * If nothing interrupted again,
    		 * before_stamp == write_stamp and write_stamp
    		 * can be used to calculate the delta for
    		 * events that come in after this one.
    		 */
    	} else {
    
    		/*
    		 * The slow path!
    		 * Was interrupted between A and C.
    		 */
    
    This is the place that there's a bug. We currently have:
    
    		after = write_stamp
    		ts = read current timestamp
    
     /*F*/		if (write == current position on the ring buffer &&
    		    after < ts && cmpxchg(write_stamp, after, ts)) {
    
    			delta = ts - after;
    
    		} else {
    			delta = 0;
    		}
    
    The assumption is that if the current position on the ring buffer hasn't
    moved between C and F, then it also was not interrupted, and that the last
    event written has a timestamp that matches the write_stamp. That is the
    write_stamp is valid.
    
    But this may not be the case:
    
    If a task context event was interrupted by softirq between B and C.
    
    And the softirq wrote an event that got interrupted by a hard irq between
    C and E.
    
    and the hard irq wrote an event (does not need to be interrupted)
    
    We have:
    
     /*B*/ before_stamp = ts of normal context
    
       ---> interrupted by softirq
    
    	/*B*/ before_stamp = ts of softirq context
    
    	  ---> interrupted by hardirq
    
    		/*B*/ before_stamp = ts of hard irq context
    		/*E*/ write_stamp = ts of hard irq context
    
    		/* matches and write_stamp valid */
    	  <----
    
    	/*E*/ write_stamp = ts of softirq context
    
    	/* No longer matches before_stamp, write_stamp is not valid! */
    
       <---
    
     w != write - length, go to slow path
    
    // Right now the order of events in the ring buffer is:
    //
    // |-- softirq event --|-- hard irq event --|-- normal context event --|
    //
    
     after = write_stamp (this is the ts of softirq)
     ts = read current timestamp
    
     if (write == current position on the ring buffer [true] &&
         after < ts [true] && cmpxchg(write_stamp, after, ts) [true]) {
    
    	delta = ts - after  [Wrong!]
    
    The delta is to be between the hard irq event and the normal context
    event, but the above logic made the delta between the softirq event and
    the normal context event, where the hard irq event is between the two. This
    will shift all the remaining event timestamps on the sub-buffer
    incorrectly.
    
    The write_stamp is only valid if it matches the before_stamp. The cmpxchg
    does nothing to help this.
    
    Instead, the following logic can be done to fix this:
    
    	before = before_stamp
    	ts = read current timestamp
    	before_stamp = ts
    
    	after = write_stamp
    
    	if (write == current position on the ring buffer &&
    	    after == before && after < ts) {
    
    		delta = ts - after
    
    	} else {
    		delta = 0;
    	}
    
    The above will only use the write_stamp if it still matches before_stamp
    and was tested to not have changed since C.
    
    As a bonus, with this logic we do not need any 64-bit cmpxchg() at all!
    
    This means the 32-bit rb_time_t workaround can finally be removed. But
    that's for a later time.
    
    Link: https://lore.kernel.org/linux-trace-kernel/20231218175229.58ec3daf@gandalf.local.home/
    Link: https://lore.kernel.org/linux-trace-kernel/20231218230712.3a76b081@gandalf.local.home
    
    Cc: stable@vger.kernel.org
    Cc: Masami Hiramatsu <mhiramat@kernel.org>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Fixes: dd939425 ("ring-buffer: Do not try to put back write_stamp")
    Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
    b803d7c6
ring_buffer.c 164 KB