1. 12 Nov, 2018 2 commits
    • Ingo Molnar's avatar
    • Patrick Bellasi's avatar
      sched/fair: Fix cpu_util_wake() for 'execl' type workloads · c469933e
      Patrick Bellasi authored
      A ~10% regression has been reported for UnixBench's execl throughput
      test by Aaron Lu and Ye Xiaolong:
      
        https://lkml.org/lkml/2018/10/30/765
      
      That test is pretty simple, it does a "recursive" execve() syscall on the
      same binary. Starting from the syscall, this sequence is possible:
      
         do_execve()
           do_execveat_common()
             __do_execve_file()
               sched_exec()
                 select_task_rq_fair()          <==| Task already enqueued
                   find_idlest_cpu()
                     find_idlest_group()
                       capacity_spare_wake()    <==| Functions not called from
      		   cpu_util_wake()           | the wakeup path
      
      which means we can end up calling cpu_util_wake() not only from the
      "wakeup path", as its name would suggest. Indeed, the task doing an
      execve() syscall is already enqueued on the CPU we want to get the
      cpu_util_wake() for.
      
      The estimated utilization for a CPU computed in cpu_util_wake() was
      written under the assumption that function can be called only from the
      wakeup path. If instead the task is already enqueued, we end up with a
      utilization which does not remove the current task's contribution from
      the estimated utilization of the CPU.
      This will wrongly assume a reduced spare capacity on the current CPU and
      increase the chances to migrate the task on execve.
      
      The regression is tracked down to:
      
       commit d519329f ("sched/fair: Update util_est only on util_avg updates")
      
      because in that patch we turn on by default the UTIL_EST sched feature.
      However, the real issue is introduced by:
      
       commit f9be3e59 ("sched/fair: Use util_est in LB and WU paths")
      
      Let's fix this by ensuring to always discount the task estimated
      utilization from the CPU's estimated utilization when the task is also
      the current one. The same benchmark of the bug report, executed on a
      dual socket 40 CPUs Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz machine,
      reports these "Execl Throughput" figures (higher the better):
      
         mainline     : 48136.5 lps
         mainline+fix : 55376.5 lps
      
      which correspond to a 15% speedup.
      
      Moreover, since {cpu_util,capacity_spare}_wake() are not really only
      used from the wakeup path, let's remove this ambiguity by using a better
      matching name: {cpu_util,capacity_spare}_without().
      
      Since we are at that, let's also improve the existing documentation.
      Reported-by: default avatarAaron Lu <aaron.lu@intel.com>
      Reported-by: default avatarYe Xiaolong <xiaolong.ye@intel.com>
      Tested-by: default avatarAaron Lu <aaron.lu@intel.com>
      Signed-off-by: default avatarPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Fixes: f9be3e59 (sched/fair: Use util_est in LB and WU paths)
      Link: https://lore.kernel.org/lkml/20181025093100.GB13236@e110439-lin/Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      c469933e
  2. 05 Nov, 2018 1 commit
  3. 03 Nov, 2018 5 commits
    • Muchun Song's avatar
      sched/core: Introduce set_next_task() helper for better code readability · ff1cdc94
      Muchun Song authored
      When we pick the next task, we will do the following for the task:
      
        1) p->se.exec_start = rq_clock_task(rq);
        2) dequeue_pushable(_dl)_task(rq, p);
      
      When we call set_curr_task(), we also need to do the same thing
      above. In rt.c, the code at 1) is in the _pick_next_task_rt()
      and the code at 2) is in the pick_next_task_rt(). If we put two
      operations in one function, maybe better. So, we introduce a new
      function set_next_task(), which is responsible for doing the above.
      
      By introducing the function we can get rid of calling the
      dequeue_pushable(_dl)_task() directly(We can call set_next_task())
      in pick_next_task() and have better code readability and reuse.
      In set_curr_task_rt(), we also can call set_next_task().
      
      Do this things such that we end up with:
      
        static struct task_struct *pick_next_task(struct rq *rq,
        					    struct task_struct *prev,
        					    struct rq_flags *rf)
        {
        	/* do something else ... */
      
        	put_prev_task(rq, prev);
      
        	/* pick next task p */
      
        	set_next_task(rq, p);
      
        	/* do something else ... */
        }
      
      put_prev_task() can match set_next_task(), which can make the
      code more readable.
      Signed-off-by: default avatarMuchun Song <smuchun@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20181026131743.21786-1-smuchun@gmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ff1cdc94
    • Valentin Schneider's avatar
      sched/fair: Don't increase sd->balance_interval on newidle balance · 3f130a37
      Valentin Schneider authored
      When load_balance() fails to move some load because of task affinity,
      we end up increasing sd->balance_interval to delay the next periodic
      balance in the hopes that next time we look, that annoying pinned
      task(s) will be gone.
      
      However, idle_balance() pays no attention to sd->balance_interval, yet
      it will still lead to an increase in balance_interval in case of
      pinned tasks.
      
      If we're going through several newidle balances (e.g. we have a
      periodic task), this can lead to a huge increase of the
      balance_interval in a very small amount of time.
      
      To prevent that, don't increase the balance interval when going
      through a newidle balance.
      
      This is a similar approach to what is done in commit 58b26c4c
      ("sched: Increment cache_nice_tries only on periodic lb"), where we
      disregard newidle balance and rely on periodic balance for more stable
      results.
      Signed-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar.Eggemann@arm.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: patrick.bellasi@arm.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1537974727-30788-2-git-send-email-valentin.schneider@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      3f130a37
    • Valentin Schneider's avatar
      sched/fair: Clean up load_balance() condition · 47b7aee1
      Valentin Schneider authored
      The alignment of the condition is off, clean that up.
      
      Also, logical operators have lower precedence than bitwise/relational
      operators, so remove one layer of parentheses to make the condition a
      bit simpler to follow.
      Signed-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar.Eggemann@arm.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: patrick.bellasi@arm.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1537974727-30788-1-git-send-email-valentin.schneider@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      47b7aee1
    • Valentin Schneider's avatar
      sched/core: Take the hotplug lock in sched_init_smp() · 40fa3780
      Valentin Schneider authored
      When running on linux-next (8c60c36d0b8c ("Add linux-next specific files
      for 20181019")) + CONFIG_PROVE_LOCKING=y on a big.LITTLE system (e.g.
      Juno or HiKey960), we get the following report:
      
       [    0.748225] Call trace:
       [    0.750685]  lockdep_assert_cpus_held+0x30/0x40
       [    0.755236]  static_key_enable_cpuslocked+0x20/0xc8
       [    0.760137]  build_sched_domains+0x1034/0x1108
       [    0.764601]  sched_init_domains+0x68/0x90
       [    0.768628]  sched_init_smp+0x30/0x80
       [    0.772309]  kernel_init_freeable+0x278/0x51c
       [    0.776685]  kernel_init+0x10/0x108
       [    0.780190]  ret_from_fork+0x10/0x18
      
      The static_key in question is 'sched_asym_cpucapacity' introduced by
      commit:
      
        df054e84 ("sched/topology: Add static_key for asymmetric CPU capacity optimizations")
      
      In this particular case, we enable it because smp_prepare_cpus() will
      end up fetching the capacity-dmips-mhz entry from the devicetree,
      so we already have some asymmetry detected when entering sched_init_smp().
      
      This didn't get detected in tip/sched/core because we were missing:
      
        commit cb538267 ("jump_label/lockdep: Assert we hold the hotplug lock for _cpuslocked() operations")
      
      Calls to build_sched_domains() post sched_init_smp() will hold the
      hotplug lock, it just so happens that this very first call is a
      special case. As stated by a comment in sched_init_smp(), "There's no
      userspace yet to cause hotplug operations" so this is a harmless
      warning.
      
      However, to both respect the semantics of underlying
      callees and make lockdep happy, take the hotplug lock in
      sched_init_smp(). This also satisfies the comment atop
      sched_init_domains() that says "Callers must hold the hotplug lock".
      Reported-by: default avatarSudeep Holla <sudeep.holla@arm.com>
      Tested-by: default avatarSudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar.Eggemann@arm.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: morten.rasmussen@arm.com
      Cc: quentin.perret@arm.com
      Link: http://lkml.kernel.org/r/1540301851-3048-1-git-send-email-valentin.schneider@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      40fa3780
    • Peter Zijlstra's avatar
      sched/topology: Fix off by one bug · 993f0b05
      Peter Zijlstra authored
      With the addition of the NUMA identity level, we increased @level by
      one and will run off the end of the array in the distance sort loop.
      
      Fixed: 051f3ca0 ("sched/topology: Introduce NUMA identity node sched domain")
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      993f0b05
  4. 29 Oct, 2018 1 commit
  5. 27 Oct, 2018 4 commits
    • Linus Torvalds's avatar
      i2c-hid: properly terminate i2c_hid_dmi_desc_override_table[] array · b59dfdae
      Linus Torvalds authored
      Commit 9ee3e066 ("HID: i2c-hid: override HID descriptors for certain
      devices") added a new dmi_system_id quirk table to override certain HID
      report descriptors for some systems that lack them.
      
      But the table wasn't properly terminated, causing the dmi matching to
      walk off into la-la-land, and starting to treat random data as dmi
      descriptor pointers, causing boot-time oopses if you were at all
      unlucky.
      
      Terminate the array.
      
      We really should have some way to just statically check that arrays that
      should be terminated by an empty entry actually are so.  But the HID
      people really should have caught this themselves, rather than have me
      deal with an oops during the merge window.  Tssk, tssk.
      
      Cc: Julian Sax <jsbc@gmx.de>
      Cc: Benjamin Tissoires <benjamin.tissoires@redhat.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b59dfdae
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 345671ea
      Linus Torvalds authored
      Merge updates from Andrew Morton:
      
       - a few misc things
      
       - ocfs2 updates
      
       - most of MM
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (132 commits)
        hugetlbfs: dirty pages as they are added to pagecache
        mm: export add_swap_extent()
        mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
        tools/testing/selftests/vm/map_fixed_noreplace.c: add test for MAP_FIXED_NOREPLACE
        mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
        mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
        mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition
        mm/kasan/quarantine.c: make quarantine_lock a raw_spinlock_t
        mm/gup: cache dev_pagemap while pinning pages
        Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"
        mm: return zero_resv_unavail optimization
        mm: zero remaining unavailable struct pages
        tools/testing/selftests/vm/gup_benchmark.c: add MAP_HUGETLB option
        tools/testing/selftests/vm/gup_benchmark.c: add MAP_SHARED option
        tools/testing/selftests/vm/gup_benchmark.c: allow user specified file
        tools/testing/selftests/vm/gup_benchmark.c: fix 'write' flag usage
        mm/gup_benchmark.c: add additional pinning methods
        mm/gup_benchmark.c: time put_page()
        mm: don't raise MEMCG_OOM event due to failed high-order allocation
        mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock
        ...
      345671ea
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 49040081
      Linus Torvalds authored
      Pull networking fixes from David Miller:
       "What better way to start off a weekend than with some networking bug
        fixes:
      
        1) net namespace leak in dump filtering code of ipv4 and ipv6, fixed
           by David Ahern and Bjørn Mork.
      
        2) Handle bad checksums from hardware when using CHECKSUM_COMPLETE
           properly in UDP, from Sean Tranchetti.
      
        3) Remove TCA_OPTIONS from policy validation, it turns out we don't
           consistently use nested attributes for this across all packet
           schedulers. From David Ahern.
      
        4) Fix SKB corruption in cadence driver, from Tristram Ha.
      
        5) Fix broken WoL handling in r8169 driver, from Heiner Kallweit.
      
        6) Fix OOPS in pneigh_dump_table(), from Eric Dumazet"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (28 commits)
        net/neigh: fix NULL deref in pneigh_dump_table()
        net: allow traceroute with a specified interface in a vrf
        bridge: do not add port to router list when receives query with source 0.0.0.0
        net/smc: fix smc_buf_unuse to use the lgr pointer
        ipv6/ndisc: Preserve IPv6 control buffer if protocol error handlers are called
        net/{ipv4,ipv6}: Do not put target net if input nsid is invalid
        lan743x: Remove SPI dependency from Microchip group.
        drivers: net: remove <net/busy_poll.h> inclusion when not needed
        net: phy: genphy_10g_driver: Avoid NULL pointer dereference
        r8169: fix broken Wake-on-LAN from S5 (poweroff)
        octeontx2-af: Use GFP_ATOMIC under spin lock
        net: ethernet: cadence: fix socket buffer corruption problem
        net/ipv6: Allow onlink routes to have a device mismatch if it is the default route
        net: sched: Remove TCA_OPTIONS from policy
        ice: Poll for link status change
        ice: Allocate VF interrupts and set queue map
        ice: Introduce ice_dev_onetime_setup
        net: hns3: Fix for warning uninitialized symbol hw_err_lst3
        octeontx2-af: Copy the right amount of memory
        net: udp: fix handling of CHECKSUM_COMPLETE packets
        ...
      49040081
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc · a45dcff7
      Linus Torvalds authored
      Pull sparc fixes from David Miller:
       "Some more sparc fixups, mostly aimed at getting the allmodconfig build
        up and clean again"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
        sparc64: Rework xchg() definition to avoid warnings.
        sparc64: Export __node_distance.
        sparc64: Make corrupted user stacks more debuggable.
      a45dcff7
  6. 26 Oct, 2018 27 commits