1. 13 Nov, 2019 1 commit
    • Linus Torvalds's avatar
      sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f654 · ffff15d4
      Linus Torvalds authored
      CVE-2018-20784
      
      Zhipeng Xie, Xie XiuQi and Sargun Dhillon reported lockups in the
      scheduler under high loads, starting at around the v4.18 time frame,
      and Zhipeng Xie tracked it down to bugs in the rq->leaf_cfs_rq_list
      manipulation.
      
      Do a (manual) revert of:
      
        a9e7f654 ("sched/fair: Fix O(nr_cgroups) in load balance path")
      
      It turns out that the list_del_leaf_cfs_rq() introduced by this commit
      is a surprising property that was not considered in followup commits
      such as:
      
        9c2791f9 ("sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list")
      
      As Vincent Guittot explains:
      
       "I think that there is a bigger problem with commit a9e7f654 and
        cfs_rq throttling:
      
        Let take the example of the following topology TG2 --> TG1 --> root:
      
         1) The 1st time a task is enqueued, we will add TG2 cfs_rq then TG1
            cfs_rq to leaf_cfs_rq_list and we are sure to do the whole branch in
            one path because it has never been used and can't be throttled so
            tmp_alone_branch will point to leaf_cfs_rq_list at the end.
      
         2) Then TG1 is throttled
      
         3) and we add TG3 as a new child of TG1.
      
         4) The 1st enqueue of a task on TG3 will add TG3 cfs_rq just before TG1
            cfs_rq and tmp_alone_branch will stay  on rq->leaf_cfs_rq_list.
      
        With commit a9e7f654, we can del a cfs_rq from rq->leaf_cfs_rq_list.
        So if the load of TG1 cfs_rq becomes NULL before step 2) above, TG1
        cfs_rq is removed from the list.
        Then at step 4), TG3 cfs_rq is added at the beginning of rq->leaf_cfs_rq_list
        but tmp_alone_branch still points to TG3 cfs_rq because its throttled
        parent can't be enqueued when the lock is released.
        tmp_alone_branch doesn't point to rq->leaf_cfs_rq_list whereas it should.
      
        So if TG3 cfs_rq is removed or destroyed before tmp_alone_branch
        points on another TG cfs_rq, the next TG cfs_rq that will be added,
        will be linked outside rq->leaf_cfs_rq_list - which is bad.
      
        In addition, we can break the ordering of the cfs_rq in
        rq->leaf_cfs_rq_list but this ordering is used to update and
        propagate the update from leaf down to root."
      
      Instead of trying to work through all these cases and trying to reproduce
      the very high loads that produced the lockup to begin with, simplify
      the code temporarily by reverting a9e7f654
      
       - which change was clearly
      not thought through completely.
      
      This (hopefully) gives us a kernel that doesn't lock up so people
      can continue to enjoy their holidays without worrying about regressions. ;-)
      
      [ mingo: Wrote changelog, fixed weird spelling in code comment while at it. ]
      Analyzed-by: default avatarXie XiuQi <xiexiuqi@huawei.com>
      Analyzed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Reported-by: default avatarZhipeng Xie <xiezhipeng1@huawei.com>
      Reported-by: default avatarSargun Dhillon <sargun@sargun.me>
      Reported-by: default avatarXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: default avatarZhipeng Xie <xiezhipeng1@huawei.com>
      Tested-by: default avatarSargun Dhillon <sargun@sargun.me>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Cc: <stable@vger.kernel.org> # v4.13+
      Cc: Bin Li <huawei.libin@huawei.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: a9e7f654 ("sched/fair: Fix O(nr_cgroups) in load balance path")
      Link: http://lkml.kernel.org/r/1545879866-27809-1-git-send-email-xiexiuqi@huawei.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      (backported from commit c40f7d74
      
      )
      [ Connor Kuehl: context adjustments were required to remove
        'cfs_rq_is_decayed()' and to merge the changes to
        'update_blocked_averages()'. ]
      Signed-off-by: default avatarConnor Kuehl <connor.kuehl@canonical.com>
      Acked-by: default avatarSultan Alsawaf <sultan.alsawaf@canonical.com>
      Acked-by: default avatarStefan Bader <stefan.bader@canonical.com>
      Signed-off-by: default avatarKhalid Elmously <khalid.elmously@canonical.com>
      ffff15d4
  2. 21 Oct, 2019 1 commit
  3. 17 Sep, 2019 1 commit
  4. 25 Jun, 2019 1 commit
  5. 14 May, 2019 2 commits
  6. 07 May, 2019 1 commit
  7. 16 Jan, 2019 2 commits
    • Phil Auld's avatar
      sched/fair: Fix throttle_list starvation with low CFS quota · 31336647
      Phil Auld authored
      BugLink: https://bugs.launchpad.net/bugs/1810807
      
      commit baa9be4f upstream.
      
      With a very low cpu.cfs_quota_us setting, such as the minimum of 1000,
      distribute_cfs_runtime may not empty the throttled_list before it runs
      out of runtime to distribute. In that case, due to the change from
      c06f04c7
      
       to put throttled entries at the head of the list, later entries
      on the list will starve.  Essentially, the same X processes will get pulled
      off the list, given CPU time and then, when expired, get put back on the
      head of the list where distribute_cfs_runtime will give runtime to the same
      set of processes leaving the rest.
      
      Fix the issue by setting a bit in struct cfs_bandwidth when
      distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can
      decide to put the throttled entry on the tail or the head of the list.  The
      bit is set/cleared by the callers of distribute_cfs_runtime while they hold
      cfs_bandwidth->lock.
      
      This is easy to reproduce with a handful of CPU consumers. I use 'crash' on
      the live system. In some cases you can simply look at the throttled list and
      see the later entries are not changing:
      
        crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1"  "$4}' | pr -t -n3
          1     ffff90b56cb2d200  -976050
          2     ffff90b56cb2cc00  -484925
          3     ffff90b56cb2bc00  -658814
          4     ffff90b56cb2ba00  -275365
          5     ffff90b166a45600  -135138
          6     ffff90b56cb2da00  -282505
          7     ffff90b56cb2e000  -148065
          8     ffff90b56cb2fa00  -872591
          9     ffff90b56cb2c000  -84687
         10     ffff90b56cb2f000  -87237
         11     ffff90b166a40a00  -164582
      
        crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1"  "$4}' | pr -t -n3
          1     ffff90b56cb2d200  -994147
          2     ffff90b56cb2cc00  -306051
          3     ffff90b56cb2bc00  -961321
          4     ffff90b56cb2ba00  -24490
          5     ffff90b166a45600  -135138
          6     ffff90b56cb2da00  -282505
          7     ffff90b56cb2e000  -148065
          8     ffff90b56cb2fa00  -872591
          9     ffff90b56cb2c000  -84687
         10     ffff90b56cb2f000  -87237
         11     ffff90b166a40a00  -164582
      
      Sometimes it is easier to see by finding a process getting starved and looking
      at the sched_info:
      
        crash> task ffff8eb765994500 sched_info
        PID: 7800   TASK: ffff8eb765994500  CPU: 16  COMMAND: "cputest"
          sched_info = {
            pcount = 8,
            run_delay = 697094208,
            last_arrival = 240260125039,
            last_queued = 240260327513
          },
        crash> task ffff8eb765994500 sched_info
        PID: 7800   TASK: ffff8eb765994500  CPU: 16  COMMAND: "cputest"
          sched_info = {
            pcount = 8,
            run_delay = 697094208,
            last_arrival = 240260125039,
            last_queued = 240260327513
          },
      Signed-off-by: default avatarPhil Auld <pauld@redhat.com>
      Reviewed-by: default avatarBen Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Fixes: c06f04c7 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
      Link: http://lkml.kernel.org/r/20181008143639.GA4019@pauld.bos.csb
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarJuerg Haefliger <juergh@canonical.com>
      Signed-off-by: default avatarKleber Sacilotto de Souza <kleber.souza@canonical.com>
      31336647
    • Peter Zijlstra's avatar
      sched/cgroup: Fix cgroup entity load tracking tear-down · 5967ce5c
      Peter Zijlstra authored
      BugLink: https://bugs.launchpad.net/bugs/1810807
      
      [ Upstream commit 6fe1f348 ]
      
      When a cgroup's CPU runqueue is destroyed, it should remove its
      remaining load accounting from its parent cgroup.
      
      The current site for doing so it unsuited because its far too late and
      unordered against other cgroup removal (->css_free() will be, but we're also
      in an RCU callback).
      
      Put it in the ->css_offline() callback, which is the start of cgroup
      destruction, right after the group has been made unavailable to
      userspace. The ->css_offline() callbacks are called in hierarchical order
      after the following v4.4 commit:
      
        aa226ff4
      
       ("cgroup: make sure a parent css isn't offlined before its children")
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160121212416.GL6357@twins.programming.kicks-ass.net
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      Signed-off-by: default avatarJuerg Haefliger <juergh@canonical.com>
      Signed-off-by: default avatarKleber Sacilotto de Souza <kleber.souza@canonical.com>
      5967ce5c
  8. 09 May, 2018 1 commit
  9. 13 Mar, 2018 1 commit
  10. 12 May, 2017 2 commits
  11. 06 Dec, 2016 1 commit
  12. 12 Aug, 2016 1 commit
  13. 09 Aug, 2016 1 commit
    • Peter Zijlstra's avatar
      sched/fair: Fix cfs_rq avg tracking underflow · 9a8d56a7
      Peter Zijlstra authored
      BugLink: http://bugs.launchpad.net/bugs/1607404
      
      commit 89741892 upstream.
      
      As per commit:
      
        b7fa30c9
      
       ("sched/fair: Fix post_init_entity_util_avg() serialization")
      
      > the code generated from update_cfs_rq_load_avg():
      >
      > 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
      > 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
      > 		sa->load_avg = max_t(long, sa->load_avg - r, 0);
      > 		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
      > 		removed_load = 1;
      > 	}
      >
      > turns into:
      >
      > ffffffff81087064:       49 8b 85 98 00 00 00    mov    0x98(%r13),%rax
      > ffffffff8108706b:       48 85 c0                test   %rax,%rax
      > ffffffff8108706e:       74 40                   je     ffffffff810870b0 <update_blocked_averages+0xc0>
      > ffffffff81087070:       4c 89 f8                mov    %r15,%rax
      > ffffffff81087073:       49 87 85 98 00 00 00    xchg   %rax,0x98(%r13)
      > ffffffff8108707a:       49 29 45 70             sub    %rax,0x70(%r13)
      > ffffffff8108707e:       4c 89 f9                mov    %r15,%rcx
      > ffffffff81087081:       bb 01 00 00 00          mov    $0x1,%ebx
      > ffffffff81087086:       49 83 7d 70 00          cmpq   $0x0,0x70(%r13)
      > ffffffff8108708b:       49 0f 49 4d 70          cmovns 0x70(%r13),%rcx
      >
      > Which you'll note ends up with sa->load_avg -= r in memory at
      > ffffffff8108707a.
      
      So I _should_ have looked at other unserialized users of ->load_avg,
      but alas. Luckily nikbor reported a similar /0 from task_h_load() which
      instantly triggered recollection of this here problem.
      
      Aside from the intermediate value hitting memory and causing problems,
      there's another problem: the underflow detection relies on the signed
      bit. This reduces the effective width of the variables, IOW its
      effectively the same as having these variables be of signed type.
      
      This patch changes to a different means of unsigned underflow
      detection to not rely on the signed bit. This allows the variables to
      use the 'full' unsigned range. And it does so with explicit LOAD -
      STORE to ensure any intermediate value will never be visible in
      memory, allowing these unserialized loads.
      
      Note: GCC generates crap code for this, might warrant a look later.
      
      Note2: I say 'full' above, if we end up at U*_MAX we'll still explode;
             maybe we should do clamping on add too.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yuyang Du <yuyang.du@intel.com>
      Cc: bsegall@google.com
      Cc: kernel@kyup.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: steve.muckle@linaro.org
      Fixes: 9d89c257 ("sched/fair: Rewrite runnable load and utilization average tracking")
      Link: http://lkml.kernel.org/r/20160617091948.GJ30927@twins.programming.kicks-ass.net
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarTim Gardner <tim.gardner@canonical.com>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      9a8d56a7
  14. 06 Apr, 2016 1 commit
    • Gavin Guo's avatar
      sched/numa: Fix use-after-free bug in the task_numa_compare · 75bfdb94
      Gavin Guo authored
      BugLink: http://bugs.launchpad.net/bugs/1527643
      
      The following message can be observed on the Ubuntu v3.13.0-65 with KASan
      backported:
      
        ==================================================================
        BUG: KASan: use after free in task_numa_find_cpu+0x64c/0x890 at addr ffff880dd393ecd8
        Read of size 8 by task qemu-system-x86/3998900
        =============================================================================
        BUG kmalloc-128 (Tainted: G    B        ): kasan: bad access detected
        -----------------------------------------------------------------------------
      
        INFO: Allocated in task_numa_fault+0xc1b/0xed0 age=41980 cpu=18 pid=3998890
      	__slab_alloc+0x4f8/0x560
      	__kmalloc+0x1eb/0x280
      	task_numa_fault+0xc1b/0xed0
      	do_numa_page+0x192/0x200
      	handle_mm_fault+0x808/0x1160
      	__do_page_fault+0x218/0x750
      	do_page_fault+0x1a/0x70
      	page_fault+0x28/0x30
      	SyS_poll+0x66/0x1a0
      	system_call_fastpath+0x1a/0x1f
        INFO: Freed in task_numa_free+0x1d2/0x200 age=62 cpu=18 pid=0
      	__slab_free+0x2ab/0x3f0
      	kfree+0x161/0x170
      	task_numa_free+0x1d2/0x200
      	finish_task_switch+0x1d2/0x210
      	__schedule+0x5d4/0xc60
      	schedule_preempt_disabled+0x40/0xc0
      	cpu_startup_entry+0x2da/0x340
      	start_secondary+0x28f/0x360
        Call Trace:
         [<ffffffff81a6ce35>] dump_stack+0x45/0x56
         [<ffffffff81244aed>] print_trailer+0xfd/0x170
         [<ffffffff8124ac36>] object_err+0x36/0x40
         [<ffffffff8124cbf9>] kasan_report_error+0x1e9/0x3a0
         [<ffffffff8124d260>] kasan_report+0x40/0x50
         [<ffffffff810dda7c>] ? task_numa_find_cpu+0x64c/0x890
         [<ffffffff8124bee9>] __asan_load8+0x69/0xa0
         [<ffffffff814f5c38>] ? find_next_bit+0xd8/0x120
         [<ffffffff810dda7c>] task_numa_find_cpu+0x64c/0x890
         [<ffffffff810de16c>] task_numa_migrate+0x4ac/0x7b0
         [<ffffffff810de523>] numa_migrate_preferred+0xb3/0xc0
         [<ffffffff810e0b88>] task_numa_fault+0xb88/0xed0
         [<ffffffff8120ef02>] do_numa_page+0x192/0x200
         [<ffffffff81211038>] handle_mm_fault+0x808/0x1160
         [<ffffffff810d7dbd>] ? sched_clock_cpu+0x10d/0x160
         [<ffffffff81068c52>] ? native_load_tls+0x82/0xa0
         [<ffffffff81a7bd68>] __do_page_fault+0x218/0x750
         [<ffffffff810c2186>] ? hrtimer_try_to_cancel+0x76/0x160
         [<ffffffff81a6f5e7>] ? schedule_hrtimeout_range_clock.part.24+0xf7/0x1c0
         [<ffffffff81a7c2ba>] do_page_fault+0x1a/0x70
         [<ffffffff81a772e8>] page_fault+0x28/0x30
         [<ffffffff8128cbd4>] ? do_sys_poll+0x1c4/0x6d0
         [<ffffffff810e64f6>] ? enqueue_task_fair+0x4b6/0xaa0
         [<ffffffff810233c9>] ? sched_clock+0x9/0x10
         [<ffffffff810cf70a>] ? resched_task+0x7a/0xc0
         [<ffffffff810d0663>] ? check_preempt_curr+0xb3/0x130
         [<ffffffff8128b5c0>] ? poll_select_copy_remaining+0x170/0x170
         [<ffffffff810d3bc0>] ? wake_up_state+0x10/0x20
         [<ffffffff8112a28f>] ? drop_futex_key_refs.isra.14+0x1f/0x90
         [<ffffffff8112d40e>] ? futex_requeue+0x3de/0xba0
         [<ffffffff8112e49e>] ? do_futex+0xbe/0x8f0
         [<ffffffff81022c89>] ? read_tsc+0x9/0x20
         [<ffffffff8111bd9d>] ? ktime_get_ts+0x12d/0x170
         [<ffffffff8108f699>] ? timespec_add_safe+0x59/0xe0
         [<ffffffff8128d1f6>] SyS_poll+0x66/0x1a0
         [<ffffffff81a830dd>] system_call_fastpath+0x1a/0x1f
      
      As commit 1effd9f1
      
       ("sched/numa: Fix unsafe get_task_struct() in
      task_numa_assign()") points out, the rcu_read_lock() cannot protect the
      task_struct from being freed in the finish_task_switch(). And the bug
      happens in the process of calculation of imp which requires the access of
      p->numa_faults being freed in the following path:
      
      do_exit()
              current->flags |= PF_EXITING;
          release_task()
              ~~delayed_put_task_struct()~~
          schedule()
          ...
          ...
      rq->curr = next;
          context_switch()
              finish_task_switch()
                  put_task_struct()
                      __put_task_struct()
      		    task_numa_free()
      
      The fix here to get_task_struct() early before end of dst_rq->lock to
      protect the calculation process and also put_task_struct() in the
      corresponding point if finally the dst_rq->curr somehow cannot be
      assigned.
      
      Additional credit to Liang Chen who helped fix the error logic and add the
      put_task_struct() to the place it missed.
      Signed-off-by: default avatarGavin Guo <gavin.guo@canonical.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: jay.vosburgh@canonical.com
      Cc: liang.chen@canonical.com
      Link: http://lkml.kernel.org/r/1453264618-17645-1-git-send-email-gavin.guo@canonical.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      (cherry picked from commit 1dff76b9
      
      )
      Signed-off-by: default avatarTim Gardner <tim.gardner@canonical.com>
      75bfdb94
  15. 06 Jan, 2016 1 commit
  16. 23 Nov, 2015 1 commit
    • Peter Zijlstra's avatar
      treewide: Remove old email address · 90eec103
      Peter Zijlstra authored
      
      There were still a number of references to my old Red Hat email
      address in the kernel source. Remove these while keeping the
      Red Hat copyright notices intact.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      90eec103
  17. 09 Nov, 2015 1 commit
  18. 20 Oct, 2015 2 commits
  19. 06 Oct, 2015 2 commits
  20. 18 Sep, 2015 3 commits
  21. 13 Sep, 2015 13 commits