1. 21 Feb, 2018 14 commits
    • Frederic Weisbecker's avatar
      sched/isolation: Update nohz documentation to explain tick offload · 083c6eea
      Frederic Weisbecker authored
      Update the documentation to reflect the 1Hz tick offload changes.
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Link: http://lkml.kernel.org/r/1519186649-3242-8-git-send-email-frederic@kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      083c6eea
    • Frederic Weisbecker's avatar
      sched/nohz: Remove the 1 Hz tick code · dcdedb24
      Frederic Weisbecker authored
      Now that the 1Hz tick is offloaded to workqueues, we can safely remove
      the residual code that used to handle it locally.
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Link: http://lkml.kernel.org/r/1519186649-3242-7-git-send-email-frederic@kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      dcdedb24
    • Frederic Weisbecker's avatar
      sched/isolation: Offload residual 1Hz scheduler tick · d84b3131
      Frederic Weisbecker authored
      When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
      keep the scheduler stats alive. However this residual tick is a burden
      for bare metal tasks that can't stand any interruption at all, or want
      to minimize them.
      
      The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now
      outsource these scheduler ticks to the global workqueue so that a
      housekeeping CPU handles those remotely. The sched_class::task_tick()
      implementations have been audited and look safe to be called remotely
      as the target runqueue and its current task are passed in parameter
      and don't seem to be accessed locally.
      
      Note that in the case of using isolcpus, it's still up to the user to
      affine the global workqueues to the housekeeping CPUs through
      /sys/devices/virtual/workqueue/cpumask or domains isolation
      "isolcpus=nohz,domain".
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Link: http://lkml.kernel.org/r/1519186649-3242-6-git-send-email-frederic@kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      d84b3131
    • Frederic Weisbecker's avatar
      sched/isolation: Isolate workqueues when "nohz_full=" is set · 1bda3f80
      Frederic Weisbecker authored
      As we prepare for offloading the residual 1hz scheduler ticks to
      workqueue, let's affine those to housekeepers so that they don't
      interrupt the CPUs that don't want to be disturbed.
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Link: http://lkml.kernel.org/r/1519186649-3242-5-git-send-email-frederic@kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1bda3f80
    • Frederic Weisbecker's avatar
      nohz: Allow to check if remote CPU tick is stopped · 22ab8bc0
      Frederic Weisbecker authored
      This check is racy but provides a good heuristic to determine whether
      a CPU may need a remote tick or not.
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Link: http://lkml.kernel.org/r/1519186649-3242-4-git-send-email-frederic@kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      22ab8bc0
    • Frederic Weisbecker's avatar
      nohz: Convert tick_nohz_tick_stopped() to bool · a3642983
      Frederic Weisbecker authored
      It makes this function more self-explanatory about what it does and how
      to use it.
      Reported-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Link: http://lkml.kernel.org/r/1519186649-3242-3-git-send-email-frederic@kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a3642983
    • Frederic Weisbecker's avatar
      sched/core: Rename init_rq_hrtick() to hrtick_rq_init() · 77a021be
      Frederic Weisbecker authored
      Do that rename in order to normalize the hrtick namespace.
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Link: http://lkml.kernel.org/r/1519186649-3242-2-git-send-email-frederic@kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      77a021be
    • Mel Gorman's avatar
      sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine() · 7347fc87
      Mel Gorman authored
      If wake_affine() pulls a task to another node for any reason and the node is
      no longer preferred then temporarily stop automatic NUMA balancing pulling
      the task back. Otherwise, tasks with a strong waker/wakee relationship
      may constantly fight automatic NUMA balancing over where a task should
      be placed.
      
      Once again netperf is interesting here. The performance barely changes
      but automatic NUMA balancing is interesting:
      
       Hmean     send-64         354.67 (   0.00%)      352.15 (  -0.71%)
       Hmean     send-128        702.91 (   0.00%)      693.84 (  -1.29%)
       Hmean     send-256       1350.07 (   0.00%)     1344.19 (  -0.44%)
       Hmean     send-1024      5124.38 (   0.00%)     4941.24 (  -3.57%)
       Hmean     send-2048      9687.44 (   0.00%)     9624.45 (  -0.65%)
       Hmean     send-3312     14577.64 (   0.00%)    14514.35 (  -0.43%)
       Hmean     send-4096     16393.62 (   0.00%)    16488.30 (   0.58%)
       Hmean     send-8192     26877.26 (   0.00%)    26431.63 (  -1.66%)
       Hmean     send-16384    38683.43 (   0.00%)    38264.91 (  -1.08%)
       Hmean     recv-64         354.67 (   0.00%)      352.15 (  -0.71%)
       Hmean     recv-128        702.91 (   0.00%)      693.84 (  -1.29%)
       Hmean     recv-256       1350.07 (   0.00%)     1344.19 (  -0.44%)
       Hmean     recv-1024      5124.38 (   0.00%)     4941.24 (  -3.57%)
       Hmean     recv-2048      9687.43 (   0.00%)     9624.45 (  -0.65%)
       Hmean     recv-3312     14577.59 (   0.00%)    14514.35 (  -0.43%)
       Hmean     recv-4096     16393.55 (   0.00%)    16488.20 (   0.58%)
       Hmean     recv-8192     26876.96 (   0.00%)    26431.29 (  -1.66%)
       Hmean     recv-16384    38682.41 (   0.00%)    38263.94 (  -1.08%)
      
       NUMA alloc hit                 1465986     1423090
       NUMA alloc miss                      0           0
       NUMA interleave hit                  0           0
       NUMA alloc local               1465897     1423003
       NUMA base PTE updates             1473        1420
       NUMA huge PMD updates                0           0
       NUMA page range updates           1473        1420
       NUMA hint faults                  1383        1312
       NUMA hint local faults             451         124
       NUMA hint local percent             32           9
      
      There is a slight degrading in performance but there are slightly fewer
      NUMA faults. There is a large drop in the percentage of local faults but
      the bulk of migrations for netperf are in small shared libraries so it's
      reflecting the fact that automatic NUMA balancing has backed off. This is
      a case where despite wake_affine() and automatic NUMA balancing fighting
      for placement that there is a marginal benefit to rescheduling to local
      data quickly. However, it should be noted that wake_affine() and automatic
      NUMA balancing fighting each other constantly is undesirable.
      
      However, the benefit in other cases is large. This is the result for NAS
      with the D class sizing on a 4-socket machine:
      
       nas-mpi
                                 4.15.0                 4.15.0
                           sdnuma-v1r23       delayretry-v1r23
       Time cg.D      557.00 (   0.00%)      431.82 (  22.47%)
       Time ep.D       77.83 (   0.00%)       79.01 (  -1.52%)
       Time is.D       26.46 (   0.00%)       26.64 (  -0.68%)
       Time lu.D      727.14 (   0.00%)      597.94 (  17.77%)
       Time mg.D      191.35 (   0.00%)      146.85 (  23.26%)
      
                     4.15.0      4.15.0
               sdnuma-v1r23delayretry-v1r23
       User        75665.20    70413.30
       System      20321.59     8861.67
       Elapsed       766.13      634.92
      
       Minor Faults                  16528502     7127941c
       Major Faults                      4553        5068
       NUMA alloc local               6963197     6749135
       NUMA base PTE updates        366409093   107491434
       NUMA huge PMD updates           687556      198880
       NUMA page range updates      718437765   209317994
       NUMA hint faults              13643410     4601187
       NUMA hint local faults         9212593     3063996
       NUMA hint local percent             67          66
      
      Note the massive reduction in system CPU usage even though the percentage
      of local faults is barely affected. There is a massive reduction in the
      number of PTE updates showing that automatic NUMA balancing has backed off.
      A critical observation is also that there is a massive reduction in minor
      faults which is due to far fewer NUMA hinting faults being trapped.
      
      There were questions on NAS OMP and how it behaved related to threads
      being bound to CPUs. First, there are more gains than losses with this
      patch applied and a reduction in system CPU usage:
      
      nas-omp
                            4.16.0-rc1             4.16.0-rc1
                           sdnuma-v2r1        delayretry-v2r1
      Time bt.D      436.71 (   0.00%)      430.05 (   1.53%)
      Time cg.D      201.02 (   0.00%)      180.87 (  10.02%)
      Time ep.D       32.84 (   0.00%)       32.68 (   0.49%)
      Time is.D        9.63 (   0.00%)        9.64 (  -0.10%)
      Time lu.D      331.20 (   0.00%)      304.80 (   7.97%)
      Time mg.D       54.87 (   0.00%)       52.72 (   3.92%)
      Time sp.D     1108.78 (   0.00%)      917.10 (  17.29%)
      Time ua.D      378.81 (   0.00%)      398.83 (  -5.28%)
      
                4.16.0-rc1  4.16.0-rc1
               sdnuma-v2r1delayretry-v2r1
      User       305633.08   296751.91
      System        451.75      357.80
      Elapsed      2595.73     2368.13
      
      However, it does not close the gap between binding and being unbound. There
      is negligible difference between the performance of the baseline and a
      patched kernel when threads are bound so it is not presented here:
      
                            4.16.0-rc1             4.16.0-rc1
                       delayretry-bind     delayretry-unbound
      Time bt.D      385.02 (   0.00%)      430.05 ( -11.70%)
      Time cg.D      144.02 (   0.00%)      180.87 ( -25.59%)
      Time ep.D       32.85 (   0.00%)       32.68 (   0.52%)
      Time is.D       10.52 (   0.00%)        9.64 (   8.37%)
      Time lu.D      285.31 (   0.00%)      304.80 (  -6.83%)
      Time mg.D       43.21 (   0.00%)       52.72 ( -22.01%)
      Time sp.D      820.24 (   0.00%)      917.10 ( -11.81%)
      Time ua.D      337.09 (   0.00%)      398.83 ( -18.32%)
      
                4.16.0-rc1  4.16.0-rc1
              delayretry-binddelayretry-unbound
      User       277731.25   296751.91
      System        261.29      357.80
      Elapsed      2100.55     2368.13
      
      Unfortunately, while performance is improved by the patch, there is still
      quite a long way to go before it's equivalent to hard binding.
      
      Other workloads like hackbench, tbench, dbench and schbench are barely
      affected. dbench shows a mix of gains and losses depending on the machine
      although in general, the results are more stable.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180213133730.24064-7-mgorman@techsingularity.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      7347fc87
    • Mel Gorman's avatar
      sched/fair: Consider SD_NUMA when selecting the most idle group to schedule on · 2c833627
      Mel Gorman authored
      find_idlest_group() compares a local group with each other group to select
      the one that is most idle. When comparing groups in different NUMA domains,
      a very slight imbalance is enough to select a remote NUMA node even if the
      runnable load on both groups is 0 or close to 0. This ignores the cost of
      remote accesses entirely and is a problem when selecting the CPU for a
      newly forked task to run on.  This is problematic when a forking server
      is almost guaranteed to run on a remote node incurring numerous remote
      accesses and potentially causing automatic NUMA balancing to try migrate
      the task back or migrate the data to another node. Similar weirdness is
      observed if a basic shell command pipes output to another as each process
      in the pipeline is likely to start on different nodes and then get adjusted
      later by wake_affine().
      
      This patch adds imbalance to remote domains when considering whether to
      select CPUs from remote domains. If the local domain is selected, imbalance
      will still be used to try select a CPU from a lower scheduler domain's group
      instead of stacking tasks on the same CPU.
      
      A variety of workloads and machines were tested and as expected, there is no
      difference on UMA. The difference on NUMA can be dramatic. This is a comparison
      of elapsed times running the git regression test suite. It's fork-intensive with
      short-lived processes:
      
                                        4.15.0                 4.15.0
                                  noexit-v1r23           sdnuma-v1r23
       Elapsed min          1706.06 (   0.00%)     1435.94 (  15.83%)
       Elapsed mean         1709.53 (   0.00%)     1436.98 (  15.94%)
       Elapsed stddev          2.16 (   0.00%)        1.01 (  53.38%)
       Elapsed coeffvar        0.13 (   0.00%)        0.07 (  44.54%)
       Elapsed max          1711.59 (   0.00%)     1438.01 (  15.98%)
      
                     4.15.0      4.15.0
               noexit-v1r23 sdnuma-v1r23
       User         5434.12     5188.41
       System       4878.77     3467.09
       Elapsed     10259.06     8624.21
      
      That shows a considerable reduction in elapsed times. It's important to
      note that automatic NUMA balancing does not affect this load as processes
      are too short-lived.
      
      There is also a noticable impact on hackbench such as this example using
      processes and pipes:
      
       hackbench-process-pipes
                                     4.15.0                 4.15.0
                               noexit-v1r23           sdnuma-v1r23
       Amean     1        1.0973 (   0.00%)      0.9393 (  14.40%)
       Amean     4        1.3427 (   0.00%)      1.3730 (  -2.26%)
       Amean     7        1.4233 (   0.00%)      1.6670 ( -17.12%)
       Amean     12       3.0250 (   0.00%)      3.3013 (  -9.13%)
       Amean     21       9.0860 (   0.00%)      9.5343 (  -4.93%)
       Amean     30      14.6547 (   0.00%)     13.2433 (   9.63%)
       Amean     48      22.5447 (   0.00%)     20.4303 (   9.38%)
       Amean     79      29.2010 (   0.00%)     26.7853 (   8.27%)
       Amean     110     36.7443 (   0.00%)     35.8453 (   2.45%)
       Amean     141     45.8533 (   0.00%)     42.6223 (   7.05%)
       Amean     172     55.1317 (   0.00%)     50.6473 (   8.13%)
       Amean     203     64.4420 (   0.00%)     58.3957 (   9.38%)
       Amean     234     73.2293 (   0.00%)     67.1047 (   8.36%)
       Amean     265     80.5220 (   0.00%)     75.7330 (   5.95%)
       Amean     296     88.7567 (   0.00%)     82.1533 (   7.44%)
      
      It's not a universal win as there are occasions when spreading wide and
      quickly is a benefit but it's more of a win than it is a loss. For other
      workloads, there is little difference but netperf is interesting. Without
      the patch, the server and client starts on different nodes but quickly get
      migrated due to wake_affine. Hence, the difference is overall performance
      is marginal but detectable:
      
                                            4.15.0                 4.15.0
                                      noexit-v1r23           sdnuma-v1r23
       Hmean     send-64         349.09 (   0.00%)      354.67 (   1.60%)
       Hmean     send-128        699.16 (   0.00%)      702.91 (   0.54%)
       Hmean     send-256       1316.34 (   0.00%)     1350.07 (   2.56%)
       Hmean     send-1024      5063.99 (   0.00%)     5124.38 (   1.19%)
       Hmean     send-2048      9705.19 (   0.00%)     9687.44 (  -0.18%)
       Hmean     send-3312     14359.48 (   0.00%)    14577.64 (   1.52%)
       Hmean     send-4096     16324.20 (   0.00%)    16393.62 (   0.43%)
       Hmean     send-8192     26112.61 (   0.00%)    26877.26 (   2.93%)
       Hmean     send-16384    37208.44 (   0.00%)    38683.43 (   3.96%)
       Hmean     recv-64         349.09 (   0.00%)      354.67 (   1.60%)
       Hmean     recv-128        699.16 (   0.00%)      702.91 (   0.54%)
       Hmean     recv-256       1316.34 (   0.00%)     1350.07 (   2.56%)
       Hmean     recv-1024      5063.99 (   0.00%)     5124.38 (   1.19%)
       Hmean     recv-2048      9705.16 (   0.00%)     9687.43 (  -0.18%)
       Hmean     recv-3312     14359.42 (   0.00%)    14577.59 (   1.52%)
       Hmean     recv-4096     16323.98 (   0.00%)    16393.55 (   0.43%)
       Hmean     recv-8192     26111.85 (   0.00%)    26876.96 (   2.93%)
       Hmean     recv-16384    37206.99 (   0.00%)    38682.41 (   3.97%)
      
      However, what is very interesting is how automatic NUMA balancing behaves.
      Each netperf instance runs long enough for balancing to activate:
      
       NUMA base PTE updates             4620        1473
       NUMA huge PMD updates                0           0
       NUMA page range updates           4620        1473
       NUMA hint faults                  4301        1383
       NUMA hint local faults            1309         451
       NUMA hint local percent             30          32
       NUMA pages migrated               1335         491
       AutoNUMA cost                      21%          6%
      
      There is an unfortunate number of remote faults although tracing indicated
      that the vast majority are in shared libraries. However, the tendency to
      start tasks on the same node if there is capacity means that there were
      far fewer PTE updates and faults incurred overall.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180213133730.24064-6-mgorman@techsingularity.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2c833627
    • Peter Zijlstra's avatar
      sched/fair: Do not migrate due to a sync wakeup on exit · 24d0c1d6
      Peter Zijlstra authored
      When a task exits, it notifies the parent that it has exited. This is a
      sync wakeup and the exiting task may pull the parent towards the wakers
      CPU. For simple workloads like using a shell, it was observed that the
      shell is pulled across nodes by exiting processes. This is daft as the
      parent may be long-lived and properly placed. This patch special cases a
      sync wakeup on exit to avoid pulling tasks across nodes. Testing on a range
      of workloads and machines showed very little differences in performance
      although there was a small 3% boost on some machines running a shellscript
      intensive workload (git regression test suite).
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180213133730.24064-5-mgorman@techsingularity.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      24d0c1d6
    • Mel Gorman's avatar
      sched/fair: Do not migrate on wake_affine_weight() if weights are equal · 082f764a
      Mel Gorman authored
      wake_affine_weight() will consider migrating a task to, or near, the current
      CPU if there is a load imbalance. If the CPUs share LLC then either CPU
      is valid as a search-for-idle-sibling target and equally appropriate for
      stacking two tasks on one CPU if an idle sibling is unavailable. If they do
      not share cache then a cross-node migration potentially impacts locality
      so while they are equal from a CPU capacity point of view, they are not
      equal in terms of memory locality. In either case, it's more appropriate
      to migrate only if there is a difference in their effective load.
      
      This patch modifies wake_affine_weight() to only consider migrating a task
      if there is a load imbalance for normal wakeups but will allow potential
      stacking if the loads are equal and it's a sync wakeup.
      
      For the most part, the different in performance is marginal. For example,
      on a 4-socket server running netperf UDP_STREAM on localhost the differences
      are as follows:
      
                                            4.15.0                 4.15.0
                                             16rc0          noequal-v1r23
       Hmean     send-64         355.47 (   0.00%)      349.50 (  -1.68%)
       Hmean     send-128        697.98 (   0.00%)      693.35 (  -0.66%)
       Hmean     send-256       1328.02 (   0.00%)     1318.77 (  -0.70%)
       Hmean     send-1024      5051.83 (   0.00%)     5051.11 (  -0.01%)
       Hmean     send-2048      9637.02 (   0.00%)     9601.34 (  -0.37%)
       Hmean     send-3312     14355.37 (   0.00%)    14414.51 (   0.41%)
       Hmean     send-4096     16464.97 (   0.00%)    16301.37 (  -0.99%)
       Hmean     send-8192     26722.42 (   0.00%)    26428.95 (  -1.10%)
       Hmean     send-16384    38137.81 (   0.00%)    38046.11 (  -0.24%)
       Hmean     recv-64         355.47 (   0.00%)      349.50 (  -1.68%)
       Hmean     recv-128        697.98 (   0.00%)      693.35 (  -0.66%)
       Hmean     recv-256       1328.02 (   0.00%)     1318.77 (  -0.70%)
       Hmean     recv-1024      5051.83 (   0.00%)     5051.11 (  -0.01%)
       Hmean     recv-2048      9636.95 (   0.00%)     9601.30 (  -0.37%)
       Hmean     recv-3312     14355.32 (   0.00%)    14414.48 (   0.41%)
       Hmean     recv-4096     16464.74 (   0.00%)    16301.16 (  -0.99%)
       Hmean     recv-8192     26721.63 (   0.00%)    26428.17 (  -1.10%)
       Hmean     recv-16384    38136.00 (   0.00%)    38044.88 (  -0.24%)
       Stddev    send-64           7.30 (   0.00%)        4.75 (  34.96%)
       Stddev    send-128         15.15 (   0.00%)       22.38 ( -47.66%)
       Stddev    send-256         13.99 (   0.00%)       19.14 ( -36.81%)
       Stddev    send-1024       105.73 (   0.00%)       67.38 (  36.27%)
       Stddev    send-2048       294.57 (   0.00%)      223.88 (  24.00%)
       Stddev    send-3312       302.28 (   0.00%)      271.74 (  10.10%)
       Stddev    send-4096       195.92 (   0.00%)      121.10 (  38.19%)
       Stddev    send-8192       399.71 (   0.00%)      563.77 ( -41.04%)
       Stddev    send-16384     1163.47 (   0.00%)     1103.68 (   5.14%)
       Stddev    recv-64           7.30 (   0.00%)        4.75 (  34.96%)
       Stddev    recv-128         15.15 (   0.00%)       22.38 ( -47.66%)
       Stddev    recv-256         13.99 (   0.00%)       19.14 ( -36.81%)
       Stddev    recv-1024       105.73 (   0.00%)       67.38 (  36.27%)
       Stddev    recv-2048       294.59 (   0.00%)      223.89 (  24.00%)
       Stddev    recv-3312       302.24 (   0.00%)      271.75 (  10.09%)
       Stddev    recv-4096       196.03 (   0.00%)      121.14 (  38.20%)
       Stddev    recv-8192       399.86 (   0.00%)      563.65 ( -40.96%)
       Stddev    recv-16384     1163.79 (   0.00%)     1103.86 (   5.15%)
      
      The difference in overall performance is marginal but note that most
      measurements are less variable. There were similar observations for other
      netperf comparisons. hackbench with sockets or threads with processes or
      threads showed minor difference with some reduction of migration. tbench
      showed only marginal differences that were within the noise. dbench,
      regardless of filesystem, showed minor differences all of which are
      within noise. Multiple machines, both UMA and NUMA were tested without
      any regressions showing up.
      
      The biggest risk with a patch like this is affecting wakeup latencies.
      However, the schbench load from Facebook which is very sensitive to wakeup
      latency showed a mixed result with mostly improvements in wakeup latency:
      
                                            4.15.0                 4.15.0
                                             16rc0          noequal-v1r23
       Lat 50.00th-qrtle-1        38.00 (   0.00%)       38.00 (   0.00%)
       Lat 75.00th-qrtle-1        49.00 (   0.00%)       41.00 (  16.33%)
       Lat 90.00th-qrtle-1        52.00 (   0.00%)       50.00 (   3.85%)
       Lat 95.00th-qrtle-1        54.00 (   0.00%)       51.00 (   5.56%)
       Lat 99.00th-qrtle-1        63.00 (   0.00%)       60.00 (   4.76%)
       Lat 99.50th-qrtle-1        66.00 (   0.00%)       61.00 (   7.58%)
       Lat 99.90th-qrtle-1        78.00 (   0.00%)       65.00 (  16.67%)
       Lat 50.00th-qrtle-2        38.00 (   0.00%)       38.00 (   0.00%)
       Lat 75.00th-qrtle-2        42.00 (   0.00%)       43.00 (  -2.38%)
       Lat 90.00th-qrtle-2        46.00 (   0.00%)       48.00 (  -4.35%)
       Lat 95.00th-qrtle-2        49.00 (   0.00%)       50.00 (  -2.04%)
       Lat 99.00th-qrtle-2        55.00 (   0.00%)       57.00 (  -3.64%)
       Lat 99.50th-qrtle-2        58.00 (   0.00%)       60.00 (  -3.45%)
       Lat 99.90th-qrtle-2        65.00 (   0.00%)       68.00 (  -4.62%)
       Lat 50.00th-qrtle-4        41.00 (   0.00%)       41.00 (   0.00%)
       Lat 75.00th-qrtle-4        45.00 (   0.00%)       46.00 (  -2.22%)
       Lat 90.00th-qrtle-4        50.00 (   0.00%)       50.00 (   0.00%)
       Lat 95.00th-qrtle-4        54.00 (   0.00%)       53.00 (   1.85%)
       Lat 99.00th-qrtle-4        61.00 (   0.00%)       61.00 (   0.00%)
       Lat 99.50th-qrtle-4        65.00 (   0.00%)       64.00 (   1.54%)
       Lat 99.90th-qrtle-4        76.00 (   0.00%)       82.00 (  -7.89%)
       Lat 50.00th-qrtle-8        48.00 (   0.00%)       46.00 (   4.17%)
       Lat 75.00th-qrtle-8        55.00 (   0.00%)       54.00 (   1.82%)
       Lat 90.00th-qrtle-8        60.00 (   0.00%)       59.00 (   1.67%)
       Lat 95.00th-qrtle-8        63.00 (   0.00%)       63.00 (   0.00%)
       Lat 99.00th-qrtle-8        71.00 (   0.00%)       69.00 (   2.82%)
       Lat 99.50th-qrtle-8        74.00 (   0.00%)       73.00 (   1.35%)
       Lat 99.90th-qrtle-8        98.00 (   0.00%)       90.00 (   8.16%)
       Lat 50.00th-qrtle-16       56.00 (   0.00%)       55.00 (   1.79%)
       Lat 75.00th-qrtle-16       68.00 (   0.00%)       67.00 (   1.47%)
       Lat 90.00th-qrtle-16       77.00 (   0.00%)       78.00 (  -1.30%)
       Lat 95.00th-qrtle-16       82.00 (   0.00%)       84.00 (  -2.44%)
       Lat 99.00th-qrtle-16       90.00 (   0.00%)       93.00 (  -3.33%)
       Lat 99.50th-qrtle-16       93.00 (   0.00%)       97.00 (  -4.30%)
       Lat 99.90th-qrtle-16      110.00 (   0.00%)      110.00 (   0.00%)
       Lat 50.00th-qrtle-32       68.00 (   0.00%)       62.00 (   8.82%)
       Lat 75.00th-qrtle-32       90.00 (   0.00%)       83.00 (   7.78%)
       Lat 90.00th-qrtle-32      110.00 (   0.00%)      100.00 (   9.09%)
       Lat 95.00th-qrtle-32      122.00 (   0.00%)      111.00 (   9.02%)
       Lat 99.00th-qrtle-32      145.00 (   0.00%)      133.00 (   8.28%)
       Lat 99.50th-qrtle-32      154.00 (   0.00%)      143.00 (   7.14%)
       Lat 99.90th-qrtle-32     2316.00 (   0.00%)      515.00 (  77.76%)
       Lat 50.00th-qrtle-35       69.00 (   0.00%)       72.00 (  -4.35%)
       Lat 75.00th-qrtle-35       92.00 (   0.00%)       95.00 (  -3.26%)
       Lat 90.00th-qrtle-35      111.00 (   0.00%)      114.00 (  -2.70%)
       Lat 95.00th-qrtle-35      122.00 (   0.00%)      124.00 (  -1.64%)
       Lat 99.00th-qrtle-35      142.00 (   0.00%)      144.00 (  -1.41%)
       Lat 99.50th-qrtle-35      150.00 (   0.00%)      154.00 (  -2.67%)
       Lat 99.90th-qrtle-35     6104.00 (   0.00%)     5640.00 (   7.60%)
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180213133730.24064-4-mgorman@techsingularity.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      082f764a
    • Mel Gorman's avatar
      sched/fair: Defer calculation of 'prev_eff_load' in wake_affine_weight() until needed · eeb60398
      Mel Gorman authored
      On sync wakeups, the previous CPU effective load may not be used so delay
      the calculation until it's needed.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180213133730.24064-3-mgorman@techsingularity.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      eeb60398
    • Mel Gorman's avatar
      sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine · 7ebb66a1
      Mel Gorman authored
      The only caller of wake_affine() knows the CPU ID. Pass it in instead of
      rechecking it.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180213133730.24064-2-mgorman@techsingularity.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      7ebb66a1
    • Ingo Molnar's avatar
      ed029343
  2. 19 Feb, 2018 1 commit
  3. 18 Feb, 2018 4 commits
    • Linus Torvalds's avatar
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 0e06fb5b
      Linus Torvalds authored
      Pull x86 Kconfig fixes from Thomas Gleixner:
       "Three patchlets to correct HIGHMEM64G and CMPXCHG64 dependencies in
        Kconfig when CPU selections are explicitely set to M586 or M686"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/Kconfig: Explicitly enumerate i686-class CPUs in Kconfig
        x86/Kconfig: Exclude i586-class CPUs lacking PAE support from the HIGHMEM64G Kconfig group
        x86/Kconfig: Add missing i586-class CPUs to the X86_CMPXCHG64 Kconfig group
      0e06fb5b
    • Linus Torvalds's avatar
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 9ca2c16f
      Linus Torvalds authored
      Pull perf updates from Thomas Gleixner:
       "Perf tool updates and kprobe fixes:
      
         - perf_mmap overwrite mode fixes/overhaul, prep work to get 'perf
           top' using it, making it bearable to use it in large core count
           systems such as Knights Landing/Mill Intel systems (Kan Liang)
      
         - s/390 now uses syscall.tbl, just like x86-64 to generate the
           syscall table id -> string tables used by 'perf trace' (Hendrik
           Brueckner)
      
         - Use strtoull() instead of home grown function (Andy Shevchenko)
      
         - Synchronize kernel ABI headers, v4.16-rc1 (Ingo Molnar)
      
         - Document missing 'perf data --force' option (Sangwon Hong)
      
         - Add perf vendor JSON metrics for ARM Cortex-A53 Processor (William
           Cohen)
      
         - Improve error handling and error propagation of ftrace based
           kprobes so failures when installing kprobes are not silently
           ignored and create disfunctional tracepoints"
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
        kprobes: Propagate error from disarm_kprobe_ftrace()
        kprobes: Propagate error from arm_kprobe_ftrace()
        Revert "tools include s390: Grab a copy of arch/s390/include/uapi/asm/unistd.h"
        perf s390: Rework system call table creation by using syscall.tbl
        perf s390: Grab a copy of arch/s390/kernel/syscall/syscall.tbl
        tools/headers: Synchronize kernel ABI headers, v4.16-rc1
        perf test: Fix test trace+probe_libc_inet_pton.sh for s390x
        perf data: Document missing --force option
        perf tools: Substitute yet another strtoull()
        perf top: Check the latency of perf_top__mmap_read()
        perf top: Switch default mode to overwrite mode
        perf top: Remove lost events checking
        perf hists browser: Add parameter to disable lost event warning
        perf top: Add overwrite fall back
        perf evsel: Expose the perf_missing_features struct
        perf top: Check per-event overwrite term
        perf mmap: Discard legacy interface for mmap read
        perf test: Update mmap read functions for backward-ring-buffer test
        perf mmap: Introduce perf_mmap__read_event()
        perf mmap: Introduce perf_mmap__read_done()
        ...
      9ca2c16f
    • Linus Torvalds's avatar
      Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 2d6c4e40
      Linus Torvalds authored
      Pull irq updates from Thomas Gleixner:
       "A small set of updates mostly for irq chip drivers:
      
         - MIPS GIC fix for spurious, masked interrupts
      
         - fix for a subtle IPI bug in GICv3
      
         - do not probe GICv3 ITSs that are marked as disabled
      
         - multi-MSI support for GICv2m
      
         - various small cleanups"
      
      * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        irqdomain: Re-use DEFINE_SHOW_ATTRIBUTE() macro
        irqchip/bcm: Remove hashed address printing
        irqchip/gic-v2m: Add PCI Multi-MSI support
        irqchip/gic-v3: Ignore disabled ITS nodes
        irqchip/gic-v3: Use wmb() instead of smb_wmb() in gic_raise_softirq()
        irqchip/gic-v3: Change pr_debug message to pr_devel
        irqchip/mips-gic: Avoid spuriously handling masked interrupts
      2d6c4e40
    • Linus Torvalds's avatar
      Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 59e47215
      Linus Torvalds authored
      Pull core fix from Thomas Gleixner:
       "A small fix which adds the missing for_each_cpu_wrap() stub for the UP
        case to avoid build failures"
      
      * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        cpumask: Make for_each_cpu_wrap() available on UP as well
      59e47215
  4. 17 Feb, 2018 11 commits
    • Linus Torvalds's avatar
      Merge tag 'for-linus-20180217' of git://git.kernel.dk/linux-block · c786427f
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
      
       - NVMe pull request from Keith, with fixes all over the map for nvme.
         From various folks.
      
       - Classic polling fix, that avoids a latency issue where we still end
         up waiting for an interrupt in some cases. From Nitesh Shetty.
      
       - Comment typo fix from Minwoo Im.
      
      * tag 'for-linus-20180217' of git://git.kernel.dk/linux-block:
        block: fix a typo in comment of BLK_MQ_POLL_STATS_BKTS
        nvme-rdma: fix sysfs invoked reset_ctrl error flow
        nvmet: Change return code of discard command if not supported
        nvme-pci: Fix timeouts in connecting state
        nvme-pci: Remap CMB SQ entries on every controller reset
        nvme: fix the deadlock in nvme_update_formats
        blk: optimization for classic polling
        nvme: Don't use a stack buffer for keep-alive command
        nvme_fc: cleanup io completion
        nvme_fc: correct abort race condition on resets
        nvme: Fix discard buffer overrun
        nvme: delete NVME_CTRL_LIVE --> NVME_CTRL_CONNECTING transition
        nvme-rdma: use NVME_CTRL_CONNECTING state to mark init process
        nvme: rename NVME_CTRL_RECONNECTING state to NVME_CTRL_CONNECTING
      c786427f
    • Linus Torvalds's avatar
      Merge tag 'mmc-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc · fa2139ef
      Linus Torvalds authored
      Pull MMC fixes from Ulf Hansson:
      
       - meson-gx: Revert to earlier tuning process
      
       - bcm2835: Don't overwrite max frequency unconditionally
      
      * tag 'mmc-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
        mmc: bcm2835: Don't overwrite max frequency unconditionally
        Revert "mmc: meson-gx: include tx phase in the tuning process"
      fa2139ef
    • Linus Torvalds's avatar
      Merge tag 'mtd/fixes-for-4.16-rc2' of git://git.infradead.org/linux-mtd · 4b6415f9
      Linus Torvalds authored
      Pull mtd fixes from Boris Brezillon:
      
       - add missing dependency to NAND_MARVELL Kconfig entry
      
       - use the appropriate OOB layout in the VF610 driver
      
      * tag 'mtd/fixes-for-4.16-rc2' of git://git.infradead.org/linux-mtd:
        mtd: nand: MTD_NAND_MARVELL should depend on HAS_DMA
        mtd: nand: vf610: set correct ooblayout
      4b6415f9
    • Linus Torvalds's avatar
      Merge tag 'powerpc-4.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · ee78ad78
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
       "The main attraction is a fix for a bug in the new drmem code, which
        was causing an oops on boot on some versions of Qemu.
      
        There's also a fix for XIVE (Power9 interrupt controller) on KVM, as
        well as a few other minor fixes.
      
        Thanks to: Corentin Labbe, Cyril Bur, Cédric Le Goater, Daniel Black,
        Nathan Fontenot, Nicholas Piggin"
      
      * tag 'powerpc-4.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/pseries: Check for zero filled ibm,dynamic-memory property
        powerpc/pseries: Add empty update_numa_cpu_lookup_table() for NUMA=n
        powerpc/powernv: IMC fix out of bounds memory access at shutdown
        powerpc/xive: Use hw CPU ids when configuring the CPU queues
        powerpc: Expose TSCR via sysfs only on powernv
      ee78ad78
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 74688a02
      Linus Torvalds authored
      Pull arm64 fixes from Catalin Marinas:
       "The bulk of this is the pte accessors annotation to READ/WRITE_ONCE
        (we tried to avoid pushing this during the merge window to avoid
        conflicts)
      
         - Updated the page table accessors to use READ/WRITE_ONCE and prevent
           compiler transformation that could lead to an apparent loss of
           coherency
      
         - Enabled branch predictor hardening for the Falkor CPU
      
         - Fix interaction between kpti enabling and KASan causing the
           recursive page table walking to take a significant time
      
         - Fix some sparse warnings"
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: cputype: Silence Sparse warnings
        arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables
        arm64: proc: Set PTE_NG for table entries to avoid traversing them twice
        arm64: Add missing Falkor part number for branch predictor hardening
      74688a02
    • Linus Torvalds's avatar
      Merge tag 'for-linus-4.16a-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip · f73f047d
      Linus Torvalds authored
      Pull xen fixes from Juergen Gross:
      
       - fixes for the Xen pvcalls frontend driver
      
       - fix for booting Xen pv domains
      
       - fix for the xenbus driver user interface
      
      * tag 'for-linus-4.16a-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
        pvcalls-front: wait for other operations to return when release passive sockets
        pvcalls-front: introduce a per sock_mapping refcount
        x86/xen: Calculate __max_logical_packages on PV domains
        xenbus: track caller request id
      f73f047d
    • Stefano Stabellini's avatar
      pvcalls-front: wait for other operations to return when release passive sockets · d1a75e08
      Stefano Stabellini authored
      Passive sockets can have ongoing operations on them, specifically, we
      have two wait_event_interruptable calls in pvcalls_front_accept.
      
      Add two wake_up calls in pvcalls_front_release, then wait for the
      potential waiters to return and release the sock_mapping refcount.
      Signed-off-by: default avatarStefano Stabellini <stefano@aporeto.com>
      Acked-by: default avatarJuergen Gross <jgross@suse.com>
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      d1a75e08
    • Stefano Stabellini's avatar
      pvcalls-front: introduce a per sock_mapping refcount · 64d68718
      Stefano Stabellini authored
      Introduce a per sock_mapping refcount, in addition to the existing
      global refcount. Thanks to the sock_mapping refcount, we can safely wait
      for it to be 1 in pvcalls_front_release before freeing an active socket,
      instead of waiting for the global refcount to be 1.
      Signed-off-by: default avatarStefano Stabellini <stefano@aporeto.com>
      Acked-by: default avatarJuergen Gross <jgross@suse.com>
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      64d68718
    • Prarit Bhargava's avatar
      x86/xen: Calculate __max_logical_packages on PV domains · 63e708f8
      Prarit Bhargava authored
      The kernel panics on PV domains because native_smp_cpus_done() is
      only called for HVM domains.
      
      Calculate __max_logical_packages for PV domains.
      
      Fixes: b4c0a732 ("x86/smpboot: Fix __max_logical_packages estimate")
      Signed-off-by: default avatarPrarit Bhargava <prarit@redhat.com>
      Tested-and-reported-by: default avatarSimon Gaiser <simon@invisiblethingslab.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: xen-devel@lists.xenproject.org
      Reviewed-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      63e708f8
    • Joao Martins's avatar
      xenbus: track caller request id · 29fee6ee
      Joao Martins authored
      Commit fd8aa909 ("xen: optimize xenbus driver for multiple concurrent
      xenstore accesses") optimized xenbus concurrent accesses but in doing so
      broke UABI of /dev/xen/xenbus. Through /dev/xen/xenbus applications are in
      charge of xenbus message exchange with the correct header and body. Now,
      after the mentioned commit the replies received by application will no
      longer have the header req_id echoed back as it was on request (see
      specification below for reference), because that particular field is being
      overwritten by kernel.
      
      struct xsd_sockmsg
      {
        uint32_t type;  /* XS_??? */
        uint32_t req_id;/* Request identifier, echoed in daemon's response.  */
        uint32_t tx_id; /* Transaction id (0 if not related to a transaction). */
        uint32_t len;   /* Length of data following this. */
      
        /* Generally followed by nul-terminated string(s). */
      };
      
      Before there was only one request at a time so req_id could simply be
      forwarded back and forth. To allow simultaneous requests we need a
      different req_id for each message thus kernel keeps a monotonic increasing
      counter for this field and is written on every request irrespective of
      userspace value.
      
      Forwarding again the req_id on userspace requests is not a solution because
      we would open the possibility of userspace-generated req_id colliding with
      kernel ones. So this patch instead takes another route which is to
      artificially keep user req_id while keeping the xenbus logic as is. We do
      that by saving the original req_id before xs_send(), use the private kernel
      counter as req_id and then once reply comes and was validated, we restore
      back the original req_id.
      
      Cc: <stable@vger.kernel.org> # 4.11
      Fixes: fd8aa909 ("xen: optimize xenbus driver for multiple concurrent xenstore accesses")
      Reported-by: default avatarBhavesh Davda <bhavesh.davda@oracle.com>
      Signed-off-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Reviewed-by: default avatarJuergen Gross <jgross@suse.com>
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      29fee6ee
    • Robin Murphy's avatar
      arm64: cputype: Silence Sparse warnings · e1a50de3
      Robin Murphy authored
      Sparse makes a fair bit of noise about our MPIDR mask being implicitly
      long - let's explicitly describe it as such rather than just relying on
      the value forcing automatic promotion.
      Signed-off-by: default avatarRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      e1a50de3
  5. 16 Feb, 2018 10 commits
    • Linus Torvalds's avatar
      Merge tag 'dma-mapping-4.16-2' of git://git.infradead.org/users/hch/dma-mapping · 1e3510b2
      Linus Torvalds authored
      Pull dma-mapping fixes from Christoph Hellwig:
       "A few dma-mapping fixes for the fallout from the changes in rc1"
      
      * tag 'dma-mapping-4.16-2' of git://git.infradead.org/users/hch/dma-mapping:
        powerpc/macio: set a proper dma_coherent_mask
        dma-mapping: fix a comment typo
        dma-direct: comment the dma_direct_free calling convention
        dma-direct: mark as is_phys
        ia64: fix build failure with CONFIG_SWIOTLB
      1e3510b2
    • Will Deacon's avatar
      arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables · 20a004e7
      Will Deacon authored
      In many cases, page tables can be accessed concurrently by either another
      CPU (due to things like fast gup) or by the hardware page table walker
      itself, which may set access/dirty bits. In such cases, it is important
      to use READ_ONCE/WRITE_ONCE when accessing page table entries so that
      entries cannot be torn, merged or subject to apparent loss of coherence
      due to compiler transformations.
      
      Whilst there are some scenarios where this cannot happen (e.g. pinned
      kernel mappings for the linear region), the overhead of using READ_ONCE
      /WRITE_ONCE everywhere is minimal and makes the code an awful lot easier
      to reason about. This patch consistently uses these macros in the arch
      code, as well as explicitly namespacing pointers to page table entries
      from the entries themselves by using adopting a 'p' suffix for the former
      (as is sometimes used elsewhere in the kernel source).
      Tested-by: default avatarYury Norov <ynorov@caviumnetworks.com>
      Tested-by: default avatarRichard Ruigrok <rruigrok@codeaurora.org>
      Reviewed-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      20a004e7
    • Arnd Bergmann's avatar
      mm: hide a #warning for COMPILE_TEST · af27d940
      Arnd Bergmann authored
      We get a warning about some slow configurations in randconfig kernels:
      
        mm/memory.c:83:2: error: #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid. [-Werror=cpp]
      
      The warning is reasonable by itself, but gets in the way of randconfig
      build testing, so I'm hiding it whenever CONFIG_COMPILE_TEST is set.
      
      The warning was added in 2013 in commit 75980e97 ("mm: fold
      page->_last_nid into page->flags where possible").
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af27d940
    • Linus Torvalds's avatar
      Merge tag 'mips_fixes_4.16_2' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/mips · 78352f18
      Linus Torvalds authored
      Pull MIPS fixes from James Hogan:
       "A few fixes for outstanding MIPS issues:
      
         - an __init section mismatch warning when brcmstb_pm is enabled
      
         - a regression handling multiple mem=X@Y arguments (4.11)
      
         - a USB Kconfig select warning, and related sparc cleanup (4.16)"
      
      * tag 'mips_fixes_4.16_2' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/mips:
        sparc,leon: Select USB_UHCI_BIG_ENDIAN_{MMIO,DESC}
        usb: Move USB_UHCI_BIG_ENDIAN_* out of USB_SUPPORT
        MIPS: Fix incorrect mem=X@Y handling
        MIPS: BMIPS: Fix section mismatch warning
      78352f18
    • Linus Torvalds's avatar
      Merge tag 'for-4.16-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · da370f1d
      Linus Torvalds authored
      Pull btrfs fixes from David Sterba:
       "We have a few assorted fixes, some of them show up during fstests so I
        gave them more testing"
      
      * tag 'for-4.16-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: Fix use-after-free when cleaning up fs_devs with a single stale device
        Btrfs: fix null pointer dereference when replacing missing device
        btrfs: remove spurious WARN_ON(ref->count < 0) in find_parent_nodes
        btrfs: Ignore errors from btrfs_qgroup_trace_extent_post
        Btrfs: fix unexpected -EEXIST when creating new inode
        Btrfs: fix use-after-free on root->orphan_block_rsv
        Btrfs: fix btrfs_evict_inode to handle abnormal inodes correctly
        Btrfs: fix extent state leak from tree log
        Btrfs: fix crash due to not cleaning up tree log block's dirty bits
        Btrfs: fix deadlock in run_delalloc_nocow
      da370f1d
    • Linus Torvalds's avatar
      Merge tag 'for-4.16/dm-chained-bios-fix' of... · c85b0b14
      Linus Torvalds authored
      Merge tag 'for-4.16/dm-chained-bios-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
      
      Pull device mapper fix from Mike Snitzer:
       "Fix for DM core to properly propagate errors (avoids overriding
        non-zero error with 0). This is particularly important given DM core's
        increased use of chained bios"
      
      * tag 'for-4.16/dm-chained-bios-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
        dm: correctly handle chained bios in dec_pending()
      c85b0b14
    • Linus Torvalds's avatar
      Merge tag 'platform-drivers-x86-v4.16-4' of git://git.infradead.org/linux-platform-drivers-x86 · 5e8639b7
      Linus Torvalds authored
      Pull x86 platform driver fixes from Andy Shevchenko:
      
       - regression fix in keyboard support for Dell laptops
      
       - prevent out-of-boundary write in WMI bus driver
      
       - increase timeout to read functional key status on Lenovo laptops
      
      * tag 'platform-drivers-x86-v4.16-4' of git://git.infradead.org/linux-platform-drivers-x86:
        platform/x86: dell-laptop: Removed duplicates in DMI whitelist
        platform/x86: dell-laptop: fix kbd_get_state's request value
        platform/x86: ideapad-laptop: Increase timeout to wait for EC answer
        platform/x86: wmi: fix off-by-one write in wmi_dev_probe()
      5e8639b7
    • Linus Torvalds's avatar
      Merge tag 'sound-4.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · 1a2a7d3e
      Linus Torvalds authored
      Pull sound fixes from Takashi Iwai:
       "A collection of usual suspects:
      
         - a handful USB-audio and HD-audio device-specific quirks
      
         - some trivial fixes for the new AC97 bus stuff
      
         - another race fix in ALSA sequencer core"
      
      * tag 'sound-4.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
        ALSA: hda/realtek: PCI quirk for Fujitsu U7x7
        ALSA: seq: Fix racy pool initializations
        ALSA: usb: add more device quirks for USB DSD devices
        ALSA: usb-audio: Fix UAC2 get_ctl request with a RANGE attribute
        ALSA: ac97: Fix copy and paste typo in documentation
        ALSA: usb-audio: add implicit fb quirk for Behringer UFX1204
        ALSA: ac97: kconfig: Remove select of undefined symbol AC97
        ALSA: hda/realtek - Enable Thinkpad Dock device for ALC298 platform
        ALSA: hda/realtek - Add headset mode support for Dell laptop
        ALSA: hda - Fix headset mic detection problem for two Dell machines
      1a2a7d3e
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-for-v4.16-rc2' of git://people.freedesktop.org/~airlied/linux · bad57539
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "One nouveau regression fix, one AMD quirk and a full set of i915
        fixes.
      
        The i915 fixes are mostly for things caught by their CI system, main
        ones being DSI panel fixes and GEM fixes"
      
      * tag 'drm-fixes-for-v4.16-rc2' of git://people.freedesktop.org/~airlied/linux:
        drm/nouveau: Make clock gate support conditional
        drm/i915: Fix DSI panels with v1 MIPI sequences without a DEASSERT sequence v3
        drm/i915: Free memdup-ed DSI VBT data structures on driver_unload
        drm/i915: Add intel_bios_cleanup() function
        drm/i915/vlv: Add cdclk workaround for DSI
        drm/i915/gvt: fix one typo of render_mmio trace
        drm/i915/gvt: Support BAR0 8-byte reads/writes
        drm/i915/gvt: add 0xe4f0 into gen9 render list
        drm/i915/pmu: Fix building without CONFIG_PM
        drm/i915/pmu: Fix sleep under atomic in RC6 readout
        drm/i915/pmu: Fix PMU enable vs execlists tasklet race
        drm/i915: Lock out execlist tasklet while peeking inside for busy-stats
        drm/i915/breadcrumbs: Ignore unsubmitted signalers
        drm/i915: Don't wake the device up to check if the engine is asleep
        drm/i915: Avoid truncation before clamping userspace's priority value
        drm/i915/perf: Fix compiler warning for string truncation
        drm/i915/perf: Fix compiler warning for string truncation
        drm/amdgpu: add new device to use atpx quirk
      bad57539
    • NeilBrown's avatar
      dm: correctly handle chained bios in dec_pending() · 8dd601fa
      NeilBrown authored
      dec_pending() is given an error status (possibly 0) to be recorded
      against a bio.  It can be called several times on the one 'struct
      dm_io', and it is careful to only assign a non-zero error to
      io->status.  However when it then assigned io->status to bio->bi_status,
      it is not careful and could overwrite a genuine error status with 0.
      
      This can happen when chained bios are in use.  If a bio is chained
      beneath the bio that this dm_io is handling, the child bio might
      complete and set bio->bi_status before the dm_io completes.
      
      This has been possible since chained bios were introduced in 3.14, and
      has become a lot easier to trigger with commit 18a25da8 ("dm: ensure
      bio submission follows a depth-first tree walk") as that commit caused
      dm to start using chained bios itself.
      
      A particular failure mode is that if a bio spans an 'error' target and a
      working target, the 'error' fragment will complete instantly and set the
      ->bi_status, and the other fragment will normally complete a little
      later, and will clear ->bi_status.
      
      The fix is simply to only assign io_error to bio->bi_status when
      io_error is not zero.
      Reported-and-tested-by: default avatarMilan Broz <gmazyland@gmail.com>
      Cc: stable@vger.kernel.org (v3.14+)
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      8dd601fa