• Frederic Weisbecker's avatar
    sched: Lower chances of cputime scaling overflow · d9a3c982
    Frederic Weisbecker authored
    Some users have reported that after running a process with
    hundreds of threads on intensive CPU-bound loads, the cputime
    of the group started to freeze after a few days.
    
    This is due to how we scale the tick-based cputime against
    the scheduler precise execution time value.
    
    We add the values of all threads in the group and we multiply
    that against the sum of the scheduler exec runtime of the whole
    group.
    
    This easily overflows after a few days/weeks of execution.
    
    A proposed solution to solve this was to compute that multiplication
    on stime instead of utime:
       62188451
       ("cputime: Avoid multiplication overflow on utime scaling")
    
    The rationale behind that was that it's easy for a thread to
    spend most of its time in userspace under intensive CPU-bound workload
    but it's much harder to do CPU-bound intensive long run in the kernel.
    
    This postulate got defeated when a user recently reported he was still
    seeing cputime freezes after the above patch. The workload that
    triggers this issue relates to intensive networking workloads where
    most of the cputime is consumed in the kernel.
    
    To reduce much more the opportunities for multiplication overflow,
    lets reduce the multiplication factors to the remainders of the division
    between sched exec runtime and cputime. Assuming the difference between
    these shouldn't ever be that large, it could work on many situations.
    
    This gets the same results as in the upstream scaling code except for
    a small difference: the upstream code always rounds the results to
    the nearest integer not greater to what would be the precise result.
    The new code rounds to the nearest integer either greater or not
    greater. In practice this difference probably shouldn't matter but
    it's worth mentioning.
    
    If this solution appears not to be enough in the end, we'll
    need to partly revert back to the behaviour prior to commit
         0cf55e1e
         ("sched, cputime: Introduce thread_group_times()")
    
    Back then, the scaling was done on exit() time before adding the cputime
    of an exiting thread to the signal struct. And then we'll need to
    scale one-by-one the live threads cputime in thread_group_cputime(). The
    drawback may be a slightly slower code on exit time.
    Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
    Cc: Stanislaw Gruszka <sgruszka@redhat.com>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Ingo Molnar <mingo@kernel.org>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    d9a3c982
cputime.c 21 KB