1. 10 Sep, 2018 17 commits
    • Valentin Schneider's avatar
      sched/fair: Kick nohz balance if rq->misfit_task_load · 5fbdfae5
      Valentin Schneider authored
      There already are a few conditions in nohz_kick_needed() to ensure
      a nohz kick is triggered, but they are not enough for some misfit
      task scenarios. Excluding asym packing, those are:
      
       - rq->nr_running >=2: Not relevant here because we are running a
         misfit task, it needs to be migrated regardless and potentially through
         active balance.
      
       - sds->nr_busy_cpus > 1: If there is only the misfit task being run
         on a group of low capacity CPUs, this will be evaluated to False.
      
       - rq->cfs.h_nr_running >=1 && check_cpu_capacity(): Not relevant here,
         misfit task needs to be migrated regardless of rt/IRQ pressure
      
      As such, this commit adds an rq->misfit_task_load condition to trigger a
      nohz kick.
      
      The idea to kick a nohz balance for misfit tasks originally came from
      Leo Yan <leo.yan@linaro.org>, and a similar patch was submitted for
      the Android Common Kernel - see:
      
        https://lists.linaro.org/pipermail/eas-dev/2016-September/000551.htmlSigned-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: gaku.inami.xh@renesas.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1530699470-29808-6-git-send-email-morten.rasmussen@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      5fbdfae5
    • Morten Rasmussen's avatar
      sched/fair: Consider misfit tasks when load-balancing · cad68e55
      Morten Rasmussen authored
      On asymmetric CPU capacity systems load intensive tasks can end up on
      CPUs that don't suit their compute demand.  In this scenarios 'misfit'
      tasks should be migrated to CPUs with higher compute capacity to ensure
      better throughput. group_misfit_task indicates this scenario, but tweaks
      to the load-balance code are needed to make the migrations happen.
      
      Misfit balancing only makes sense between a source group of lower
      per-CPU capacity and destination group of higher compute capacity.
      Otherwise, misfit balancing is ignored. group_misfit_task has lowest
      priority so any imbalance due to overload is dealt with first.
      
      The modifications are:
      
      1. Only pick a group containing misfit tasks as the busiest group if the
         destination group has higher capacity and has spare capacity.
      2. When the busiest group is a 'misfit' group, skip the usual average
         load and group capacity checks.
      3. Set the imbalance for 'misfit' balancing sufficiently high for a task
         to be pulled ignoring average load.
      4. Pick the CPU with the highest misfit load as the source CPU.
      5. If the misfit task is alone on the source CPU, go for active
         balancing.
      Signed-off-by: default avatarMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: gaku.inami.xh@renesas.com
      Cc: valentin.schneider@arm.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1530699470-29808-5-git-send-email-morten.rasmussen@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      cad68e55
    • Morten Rasmussen's avatar
      sched/fair: Add sched_group per-CPU max capacity · e3d6d0cb
      Morten Rasmussen authored
      The current sg->min_capacity tracks the lowest per-CPU compute capacity
      available in the sched_group when rt/irq pressure is taken into account.
      Minimum capacity isn't the ideal metric for tracking if a sched_group
      needs offloading to another sched_group for some scenarios, e.g. a
      sched_group with multiple CPUs if only one is under heavy pressure.
      Tracking maximum capacity isn't perfect either but a better choice for
      some situations as it indicates that the sched_group definitely compute
      capacity constrained either due to rt/irq pressure on all CPUs or
      asymmetric CPU capacities (e.g. big.LITTLE).
      Signed-off-by: default avatarMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: gaku.inami.xh@renesas.com
      Cc: valentin.schneider@arm.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1530699470-29808-4-git-send-email-morten.rasmussen@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e3d6d0cb
    • Morten Rasmussen's avatar
      sched/fair: Add 'group_misfit_task' load-balance type · 3b1baa64
      Morten Rasmussen authored
      To maximize throughput in systems with asymmetric CPU capacities (e.g.
      ARM big.LITTLE) load-balancing has to consider task and CPU utilization
      as well as per-CPU compute capacity when load-balancing in addition to
      the current average load based load-balancing policy. Tasks with high
      utilization that are scheduled on a lower capacity CPU need to be
      identified and migrated to a higher capacity CPU if possible to maximize
      throughput.
      
      To implement this additional policy an additional group_type
      (load-balance scenario) is added: 'group_misfit_task'. This represents
      scenarios where a sched_group has one or more tasks that are not
      suitable for its per-CPU capacity. 'group_misfit_task' is only considered
      if the system is not overloaded or imbalanced ('group_imbalanced' or
      'group_overloaded').
      
      Identifying misfit tasks requires the rq lock to be held. To avoid
      taking remote rq locks to examine source sched_groups for misfit tasks,
      each CPU is responsible for tracking misfit tasks themselves and update
      the rq->misfit_task flag. This means checking task utilization when
      tasks are scheduled and on sched_tick.
      Signed-off-by: default avatarMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: gaku.inami.xh@renesas.com
      Cc: valentin.schneider@arm.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1530699470-29808-3-git-send-email-morten.rasmussen@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      3b1baa64
    • Morten Rasmussen's avatar
      sched/topology: Add static_key for asymmetric CPU capacity optimizations · df054e84
      Morten Rasmussen authored
      The existing asymmetric CPU capacity code should cause minimal overhead
      for others. Putting it behind a static_key, it has been done for SMT
      optimizations, would make it easier to extend and improve without
      causing harm to others moving forward.
      Signed-off-by: default avatarMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: gaku.inami.xh@renesas.com
      Cc: valentin.schneider@arm.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1530699470-29808-2-git-send-email-morten.rasmussen@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      df054e84
    • Morten Rasmussen's avatar
      sched/topology, arch/arm: Rebuild sched_domain hierarchy when CPU capacity changes · e1799a80
      Morten Rasmussen authored
      Asymmetric CPU capacity can not necessarily be determined accurately at
      the time the initial sched_domain hierarchy is built during boot. It is
      therefore necessary to be able to force a full rebuild of the hierarchy
      later triggered by the arch_topology driver. A full rebuild requires the
      arch-code to implement arch_update_cpu_topology() which isn't yet
      implemented for arm. This patch points the arm implementation to
      arch_topology driver to ensure that full hierarchy rebuild happens when
      needed.
      Signed-off-by: default avatarMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: valentin.schneider@arm.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1532093554-30504-5-git-send-email-morten.rasmussen@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e1799a80
    • Morten Rasmussen's avatar
      sched/topology, arch/arm64: Rebuild the sched_domain hierarchy when the CPU capacity changes · 3ba09df4
      Morten Rasmussen authored
      Asymmetric CPU capacity can not necessarily be determined accurately at
      the time the initial sched_domain hierarchy is built during boot. It is
      therefore necessary to be able to force a full rebuild of the hierarchy
      later triggered by the arch_topology driver. A full rebuild requires the
      arch-code to implement arch_update_cpu_topology() which isn't yet
      implemented for arm64. This patch points the arm64 implementation to
      arch_topology driver to ensure that full hierarchy rebuild happens when
      needed.
      Signed-off-by: default avatarMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: dietmar.eggemann@arm.com
      Cc: valentin.schneider@arm.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1532093554-30504-4-git-send-email-morten.rasmussen@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      3ba09df4
    • Morten Rasmussen's avatar
      sched/topology, drivers/base/arch_topology: Rebuild the sched_domain hierarchy... · bb1fbdd3
      Morten Rasmussen authored
      sched/topology, drivers/base/arch_topology: Rebuild the sched_domain hierarchy when capacities change
      
      The setting of SD_ASYM_CPUCAPACITY depends on the per-CPU capacities.
      These might not have their final values when the hierarchy is initially
      built as the values depend on cpufreq to be initialized or the values
      being set through sysfs. To ensure that the flags are set correctly we
      need to rebuild the sched_domain hierarchy whenever the reported per-CPU
      capacity (arch_scale_cpu_capacity()) changes.
      
      This patch ensure that a full sched_domain rebuild happens when CPU
      capacity changes occur.
      Signed-off-by: default avatarMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: valentin.schneider@arm.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1532093554-30504-3-git-send-email-morten.rasmussen@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      bb1fbdd3
    • Morten Rasmussen's avatar
      sched/topology: Add SD_ASYM_CPUCAPACITY flag detection · 05484e09
      Morten Rasmussen authored
      The SD_ASYM_CPUCAPACITY sched_domain flag is supposed to mark the
      sched_domain in the hierarchy where all CPU capacities are visible for
      any CPU's point of view on asymmetric CPU capacity systems. The
      scheduler can then take to take capacity asymmetry into account when
      balancing at this level. It also serves as an indicator for how wide
      task placement heuristics have to search to consider all available CPU
      capacities as asymmetric systems might often appear symmetric at
      smallest level(s) of the sched_domain hierarchy.
      
      The flag has been around for while but so far only been set by
      out-of-tree code in Android kernels. One solution is to let each
      architecture provide the flag through a custom sched_domain topology
      array and associated mask and flag functions. However,
      SD_ASYM_CPUCAPACITY is special in the sense that it depends on the
      capacity and presence of all CPUs in the system, i.e. when hotplugging
      all CPUs out except those with one particular CPU capacity the flag
      should disappear even if the sched_domains don't collapse. Similarly,
      the flag is affected by cpusets where load-balancing is turned off.
      Detecting when the flags should be set therefore depends not only on
      topology information but also the cpuset configuration and hotplug
      state. The arch code doesn't have easy access to the cpuset
      configuration.
      
      Instead, this patch implements the flag detection in generic code where
      cpusets and hotplug state is already taken care of. All the arch is
      responsible for is to implement arch_scale_cpu_capacity() and force a
      full rebuild of the sched_domain hierarchy if capacities are updated,
      e.g. later in the boot process when cpufreq has initialized.
      Signed-off-by: default avatarMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: valentin.schneider@arm.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1532093554-30504-2-git-send-email-morten.rasmussen@arm.com
      [ Fixed 'CPU' capitalization. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      05484e09
    • Randy Dunlap's avatar
      sched/fair: Fix kernel-doc notation warning · 882a78a9
      Randy Dunlap authored
      Fix kernel-doc warning for missing 'flags' parameter description:
      
      ../kernel/sched/fair.c:3371: warning: Function parameter or member 'flags' not described in 'attach_entity_load_avg'
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: ea14b57e ("sched/cpufreq: Provide migration hint")
      Link: http://lkml.kernel.org/r/cdda0d42-880d-4229-a9f7-5899c977a063@infradead.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      882a78a9
    • Vincent Guittot's avatar
      sched/fair: Fix load_balance redo for !imbalance · bb3485c8
      Vincent Guittot authored
      It can happen that load_balance() finds a busiest group and then a
      busiest rq but the calculated imbalance is in fact 0.
      
      In such situation, detach_tasks() returns immediately and lets the
      flag LBF_ALL_PINNED set. The busiest CPU is then wrongly assumed to
      have pinned tasks and removed from the load balance mask. then, we
      redo a load balance without the busiest CPU. This creates wrong load
      balance situation and generates wrong task migration.
      
      If the calculated imbalance is 0, it's useless to try to find a
      busiest rq as no task will be migrated and we can return immediately.
      
      This situation can happen with heterogeneous system or smp system when
      RT tasks are decreasing the capacity of some CPUs.
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: jhugo@codeaurora.org
      Link: http://lkml.kernel.org/r/1536306664-29827-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      bb3485c8
    • Vincent Guittot's avatar
      sched/fair: Fix scale_rt_capacity() for SMT · 287cdaac
      Vincent Guittot authored
      Since commit:
      
        523e979d ("sched/core: Use PELT for scale_rt_capacity()")
      
      scale_rt_capacity() returns the remaining capacity and not a scale factor
      to apply on cpu_capacity_orig. arch_scale_cpu() is directly called by
      scale_rt_capacity() so we must take the sched_domain argument.
      Reported-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 523e979d ("sched/core: Use PELT for scale_rt_capacity()")
      Link: http://lkml.kernel.org/r/20180904093626.GA23936@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      287cdaac
    • Steve Muckle's avatar
      sched/fair: Fix vruntime_normalized() for remote non-migration wakeup · d0cdb3ce
      Steve Muckle authored
      When a task which previously ran on a given CPU is remotely queued to
      wake up on that same CPU, there is a period where the task's state is
      TASK_WAKING and its vruntime is not normalized. This is not accounted
      for in vruntime_normalized() which will cause an error in the task's
      vruntime if it is switched from the fair class during this time.
      
      For example if it is boosted to RT priority via rt_mutex_setprio(),
      rq->min_vruntime will not be subtracted from the task's vruntime but
      it will be added again when the task returns to the fair class. The
      task's vruntime will have been erroneously doubled and the effective
      priority of the task will be reduced.
      
      Note this will also lead to inflation of all vruntimes since the doubled
      vruntime value will become the rq's min_vruntime when other tasks leave
      the rq. This leads to repeated doubling of the vruntime and priority
      penalty.
      
      Fix this by recognizing a WAKING task's vruntime as normalized only if
      sched_remote_wakeup is true. This indicates a migration, in which case
      the vruntime would have been normalized in migrate_task_rq_fair().
      
      Based on a similar patch from John Dias <joaodias@google.com>.
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Tested-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: default avatarSteve Muckle <smuckle@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Redpath <Chris.Redpath@arm.com>
      Cc: John Dias <joaodias@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Miguel de Dios <migueldedios@google.com>
      Cc: Morten Rasmussen <Morten.Rasmussen@arm.com>
      Cc: Patrick Bellasi <Patrick.Bellasi@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: kernel-team@android.com
      Fixes: b5179ac7 ("sched/fair: Prepare to fix fairness problems on migration")
      Link: http://lkml.kernel.org/r/20180831224217.169476-1-smuckle@google.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      d0cdb3ce
    • Vincent Guittot's avatar
      sched/pelt: Fix update_blocked_averages() for RT and DL classes · 12b04875
      Vincent Guittot authored
      update_blocked_averages() is called to periodiccally decay the stalled load
      of idle CPUs and to sync all loads before running load balance.
      
      When cfs rq is idle, it trigs a load balance during pick_next_task_fair()
      in order to potentially pull tasks and to use this newly idle CPU. This
      load balance happens whereas prev task from another class has not been put
      and its utilization updated yet. This may lead to wrongly account running
      time as idle time for RT or DL classes.
      
      Test that no RT or DL task is running when updating their utilization in
      update_blocked_averages().
      
      We still update RT and DL utilization instead of simply skipping them to
      make sure that all metrics are synced when used during load balance.
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 371bf427 ("sched/rt: Add rt_rq utilization tracking")
      Fixes: 3727e0e1 ("sched/dl: Add dl_rq utilization tracking")
      Link: http://lkml.kernel.org/r/1535728975-22799-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      12b04875
    • Srikar Dronamraju's avatar
      sched/topology: Set correct NUMA topology type · e5e96faf
      Srikar Dronamraju authored
      With the following commit:
      
        051f3ca0 ("sched/topology: Introduce NUMA identity node sched domain")
      
      the scheduler introduced a new NUMA level. However this leads to the NUMA topology
      on 2 node systems to not be marked as NUMA_DIRECT anymore.
      
      After this commit, it gets reported as NUMA_BACKPLANE, because
      sched_domains_numa_level is now 2 on 2 node systems.
      
      Fix this by allowing setting systems that have up to 2 NUMA levels as
      NUMA_DIRECT.
      
      While here remove code that assumes that level can be 0.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andre Wild <wild@linux.vnet.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
      Fixes: 051f3ca0 "Introduce NUMA identity node sched domain"
      Link: http://lkml.kernel.org/r/1533920419-17410-1-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e5e96faf
    • Jiada Wang's avatar
      sched/debug: Fix potential deadlock when writing to sched_features · e73e8197
      Jiada Wang authored
      The following lockdep report can be triggered by writing to /sys/kernel/debug/sched_features:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        4.18.0-rc6-00152-gcd3f77d7-dirty #18 Not tainted
        ------------------------------------------------------
        sh/3358 is trying to acquire lock:
        000000004ad3989d (cpu_hotplug_lock.rw_sem){++++}, at: static_key_enable+0x14/0x30
        but task is already holding lock:
        00000000c1b31a88 (&sb->s_type->i_mutex_key#3){+.+.}, at: sched_feat_write+0x160/0x428
        which lock already depends on the new lock.
        the existing dependency chain (in reverse order) is:
        -> #3 (&sb->s_type->i_mutex_key#3){+.+.}:
               lock_acquire+0xb8/0x148
               down_write+0xac/0x140
               start_creating+0x5c/0x168
               debugfs_create_dir+0x18/0x220
               opp_debug_register+0x8c/0x120
               _add_opp_dev+0x104/0x1f8
               dev_pm_opp_get_opp_table+0x174/0x340
               _of_add_opp_table_v2+0x110/0x760
               dev_pm_opp_of_add_table+0x5c/0x240
               dev_pm_opp_of_cpumask_add_table+0x5c/0x100
               cpufreq_init+0x160/0x430
               cpufreq_online+0x1cc/0xe30
               cpufreq_add_dev+0x78/0x198
               subsys_interface_register+0x168/0x270
               cpufreq_register_driver+0x1c8/0x278
               dt_cpufreq_probe+0xdc/0x1b8
               platform_drv_probe+0xb4/0x168
               driver_probe_device+0x318/0x4b0
               __device_attach_driver+0xfc/0x1f0
               bus_for_each_drv+0xf8/0x180
               __device_attach+0x164/0x200
               device_initial_probe+0x10/0x18
               bus_probe_device+0x110/0x178
               device_add+0x6d8/0x908
               platform_device_add+0x138/0x3d8
               platform_device_register_full+0x1cc/0x1f8
               cpufreq_dt_platdev_init+0x174/0x1bc
               do_one_initcall+0xb8/0x310
               kernel_init_freeable+0x4b8/0x56c
               kernel_init+0x10/0x138
               ret_from_fork+0x10/0x18
        -> #2 (opp_table_lock){+.+.}:
               lock_acquire+0xb8/0x148
               __mutex_lock+0x104/0xf50
               mutex_lock_nested+0x1c/0x28
               _of_add_opp_table_v2+0xb4/0x760
               dev_pm_opp_of_add_table+0x5c/0x240
               dev_pm_opp_of_cpumask_add_table+0x5c/0x100
               cpufreq_init+0x160/0x430
               cpufreq_online+0x1cc/0xe30
               cpufreq_add_dev+0x78/0x198
               subsys_interface_register+0x168/0x270
               cpufreq_register_driver+0x1c8/0x278
               dt_cpufreq_probe+0xdc/0x1b8
               platform_drv_probe+0xb4/0x168
               driver_probe_device+0x318/0x4b0
               __device_attach_driver+0xfc/0x1f0
               bus_for_each_drv+0xf8/0x180
               __device_attach+0x164/0x200
               device_initial_probe+0x10/0x18
               bus_probe_device+0x110/0x178
               device_add+0x6d8/0x908
               platform_device_add+0x138/0x3d8
               platform_device_register_full+0x1cc/0x1f8
               cpufreq_dt_platdev_init+0x174/0x1bc
               do_one_initcall+0xb8/0x310
               kernel_init_freeable+0x4b8/0x56c
               kernel_init+0x10/0x138
               ret_from_fork+0x10/0x18
        -> #1 (subsys mutex#6){+.+.}:
               lock_acquire+0xb8/0x148
               __mutex_lock+0x104/0xf50
               mutex_lock_nested+0x1c/0x28
               subsys_interface_register+0xd8/0x270
               cpufreq_register_driver+0x1c8/0x278
               dt_cpufreq_probe+0xdc/0x1b8
               platform_drv_probe+0xb4/0x168
               driver_probe_device+0x318/0x4b0
               __device_attach_driver+0xfc/0x1f0
               bus_for_each_drv+0xf8/0x180
               __device_attach+0x164/0x200
               device_initial_probe+0x10/0x18
               bus_probe_device+0x110/0x178
               device_add+0x6d8/0x908
               platform_device_add+0x138/0x3d8
               platform_device_register_full+0x1cc/0x1f8
               cpufreq_dt_platdev_init+0x174/0x1bc
               do_one_initcall+0xb8/0x310
               kernel_init_freeable+0x4b8/0x56c
               kernel_init+0x10/0x138
               ret_from_fork+0x10/0x18
        -> #0 (cpu_hotplug_lock.rw_sem){++++}:
               __lock_acquire+0x203c/0x21d0
               lock_acquire+0xb8/0x148
               cpus_read_lock+0x58/0x1c8
               static_key_enable+0x14/0x30
               sched_feat_write+0x314/0x428
               full_proxy_write+0xa0/0x138
               __vfs_write+0xd8/0x388
               vfs_write+0xdc/0x318
               ksys_write+0xb4/0x138
               sys_write+0xc/0x18
               __sys_trace_return+0x0/0x4
        other info that might help us debug this:
        Chain exists of:
          cpu_hotplug_lock.rw_sem --> opp_table_lock --> &sb->s_type->i_mutex_key#3
         Possible unsafe locking scenario:
               CPU0                    CPU1
               ----                    ----
          lock(&sb->s_type->i_mutex_key#3);
                                       lock(opp_table_lock);
                                       lock(&sb->s_type->i_mutex_key#3);
          lock(cpu_hotplug_lock.rw_sem);
         *** DEADLOCK ***
        2 locks held by sh/3358:
         #0: 00000000a8c4b363 (sb_writers#10){.+.+}, at: vfs_write+0x238/0x318
         #1: 00000000c1b31a88 (&sb->s_type->i_mutex_key#3){+.+.}, at: sched_feat_write+0x160/0x428
        stack backtrace:
        CPU: 5 PID: 3358 Comm: sh Not tainted 4.18.0-rc6-00152-gcd3f77d7-dirty #18
        Hardware name: Renesas H3ULCB Kingfisher board based on r8a7795 ES2.0+ (DT)
        Call trace:
         dump_backtrace+0x0/0x288
         show_stack+0x14/0x20
         dump_stack+0x13c/0x1ac
         print_circular_bug.isra.10+0x270/0x438
         check_prev_add.constprop.16+0x4dc/0xb98
         __lock_acquire+0x203c/0x21d0
         lock_acquire+0xb8/0x148
         cpus_read_lock+0x58/0x1c8
         static_key_enable+0x14/0x30
         sched_feat_write+0x314/0x428
         full_proxy_write+0xa0/0x138
         __vfs_write+0xd8/0x388
         vfs_write+0xdc/0x318
         ksys_write+0xb4/0x138
         sys_write+0xc/0x18
         __sys_trace_return+0x0/0x4
      
      This is because when loading the cpufreq_dt module we first acquire
      cpu_hotplug_lock.rw_sem lock, then in cpufreq_init(), we are taking
      the &sb->s_type->i_mutex_key lock.
      
      But when writing to /sys/kernel/debug/sched_features, the
      cpu_hotplug_lock.rw_sem lock depends on the &sb->s_type->i_mutex_key lock.
      
      To fix this bug, reverse the lock acquisition order when writing to
      sched_features, this way cpu_hotplug_lock.rw_sem no longer depends on
      &sb->s_type->i_mutex_key.
      Tested-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: default avatarJiada Wang <jiada_wang@mentor.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Eugeniu Rosca <erosca@de.adit-jv.com>
      Cc: George G. Davis <george_davis@mentor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180731121222.26195-1-jiada_wang@mentor.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e73e8197
    • Linus Torvalds's avatar
      Linux 4.19-rc3 · 11da3a7f
      Linus Torvalds authored
      11da3a7f
  2. 09 Sep, 2018 7 commits
  3. 08 Sep, 2018 6 commits
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · f8f65382
      Linus Torvalds authored
      Pull KVM fixes from Radim Krčmář:
       "ARM:
         - Fix a VFP corruption in 32-bit guest
         - Add missing cache invalidation for CoW pages
         - Two small cleanups
      
        s390:
         - Fallout from the hugetlbfs support: pfmf interpretion and locking
         - VSIE: fix keywrapping for nested guests
      
        PPC:
         - Fix a bug where pages might not get marked dirty, causing guest
           memory corruption on migration
         - Fix a bug causing reads from guest memory to use the wrong guest
           real address for very large HPT guests (>256G of memory), leading
           to failures in instruction emulation.
      
        x86:
         - Fix out of bound access from malicious pv ipi hypercalls
           (introduced in rc1)
         - Fix delivery of pending interrupts when entering a nested guest,
           preventing arbitrarily late injection
         - Sanitize kvm_stat output after destroying a guest
         - Fix infinite loop when emulating a nested guest page fault and
           improve the surrounding emulation code
         - Two minor cleanups"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (28 commits)
        KVM: LAPIC: Fix pv ipis out-of-bounds access
        KVM: nVMX: Fix loss of pending IRQ/NMI before entering L2
        arm64: KVM: Remove pgd_lock
        KVM: Remove obsolete kvm_unmap_hva notifier backend
        arm64: KVM: Only force FPEXC32_EL2.EN if trapping FPSIMD
        KVM: arm/arm64: Clean dcache to PoC when changing PTE due to CoW
        KVM: s390: Properly lock mm context allow_gmap_hpage_1m setting
        KVM: s390: vsie: copy wrapping keys to right place
        KVM: s390: Fix pfmf and conditional skey emulation
        tools/kvm_stat: re-animate display of dead guests
        tools/kvm_stat: indicate dead guests as such
        tools/kvm_stat: handle guest removals more gracefully
        tools/kvm_stat: don't reset stats when setting PID filter for debugfs
        tools/kvm_stat: fix updates for dead guests
        tools/kvm_stat: fix handling of invalid paths in debugfs provider
        tools/kvm_stat: fix python3 issues
        KVM: x86: Unexport x86_emulate_instruction()
        KVM: x86: Rename emulate_instruction() to kvm_emulate_instruction()
        KVM: x86: Do not re-{try,execute} after failed emulation in L2
        KVM: x86: Default to not allowing emulation retry in kvm_mmu_page_fault
        ...
      f8f65382
    • Linus Torvalds's avatar
      Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · 0f3aa48a
      Linus Torvalds authored
      Pull ARM SoC fixes from Olof Johansson:
       "A few more fixes who have trickled in:
      
         - MMC bus width fixup for some Allwinner platforms
      
         - Fix for NULL deref in ti-aemif when no platform data is passed in
      
         - Fix div by 0 in SCMI code
      
         - Add a missing module alias in a new RPi driver"
      
      * tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
        memory: ti-aemif: fix a potential NULL-pointer dereference
        firmware: arm_scmi: fix divide by zero when sustained_perf_level is zero
        hwmon: rpi: add module alias to raspberrypi-hwmon
        arm64: allwinner: dts: h6: fix Pine H64 MMC bus width
      0f3aa48a
    • Olof Johansson's avatar
      Merge tag 'sunxi-fixes-for-4.19' of... · a132bb90
      Olof Johansson authored
      Merge tag 'sunxi-fixes-for-4.19' of https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux into fixes
      
      Allwinner fixes for 4.19
      
      Just one fix for H6 mmc on the Pine H64: the mmc bus width was missing
      from the device tree. This was added in 4.19-rc1.
      
      * tag 'sunxi-fixes-for-4.19' of https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux:
        arm64: allwinner: dts: h6: fix Pine H64 MMC bus width
      Signed-off-by: default avatarOlof Johansson <olof@lixom.net>
      a132bb90
    • Nadav Amit's avatar
      x86/mm: Use WRITE_ONCE() when setting PTEs · 9bc4f28a
      Nadav Amit authored
      When page-table entries are set, the compiler might optimize their
      assignment by using multiple instructions to set the PTE. This might
      turn into a security hazard if the user somehow manages to use the
      interim PTE. L1TF does not make our lives easier, making even an interim
      non-present PTE a security hazard.
      
      Using WRITE_ONCE() to set PTEs and friends should prevent this potential
      security hazard.
      
      I skimmed the differences in the binary with and without this patch. The
      differences are (obviously) greater when CONFIG_PARAVIRT=n as more
      code optimizations are possible. For better and worse, the impact on the
      binary with this patch is pretty small. Skimming the code did not cause
      anything to jump out as a security hazard, but it seems that at least
      move_soft_dirty_pte() caused set_pte_at() to use multiple writes.
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20180902181451.80520-1-namit@vmware.com
      9bc4f28a
    • Thomas Gleixner's avatar
      x86/apic/vector: Make error return value negative · 47b7360c
      Thomas Gleixner authored
      activate_managed() returns EINVAL instead of -EINVAL in case of
      error. While this is unlikely to happen, the positive return value would
      cause further malfunction at the call site.
      
      Fixes: 2db1f959 ("x86/vector: Handle managed interrupts proper")
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      47b7360c
    • Linus Torvalds's avatar
      Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · d7b686eb
      Linus Torvalds authored
      Pull i2c fixes from Wolfram Sang:
      
       - bugfixes for uniphier, i801, and xiic drivers
      
       - ID removal (never produced) for imx
      
       - one MAINTAINER addition
      
      * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        i2c: xiic: Record xilinx i2c with Zynq fragment
        i2c: xiic: Make the start and the byte count write atomic
        i2c: i801: fix DNV's SMBCTRL register offset
        i2c: imx-lpi2c: Remove mx8dv compatible entry
        dt-bindings: imx-lpi2c: Remove mx8dv compatible entry
        i2c: uniphier-f: issue STOP only for last message or I2C_M_STOP
        i2c: uniphier: issue STOP only for last message or I2C_M_STOP
      d7b686eb
  4. 07 Sep, 2018 10 commits
    • Linus Torvalds's avatar
      Merge tag 'arc-4.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc · 2c34a0e0
      Linus Torvalds authored
      Pull ARC updates from Vineet Gupta:
      
       - Fix for atomic_fetch_#op  [Will Deacon]
      
       - Enable per device IOC [Eugeniy Paltsev]
      
       - Remove redundant gcc version checks [Masahiro Yamada]
      
       - Miscll platform config/DT updates [Alexey Brodkin]
      
      * tag 'arc-4.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
        ARC: don't check for HIGHMEM pages in arch_dma_alloc
        ARC: IOC: panic if both IOC and ZONE_HIGHMEM enabled
        ARC: dma [IOC] Enable per device io coherency
        ARC: dma [IOC]: mark DMA devices connected as dma-coherent
        ARC: atomics: unbork atomic_fetch_##op()
        arc: remove redundant GCC version checks
        ARC: sort Kconfig
        ARC: cleanup show_faulting_vma()
        ARC: [plat-axs*]: Enable SWAP
        ARC: [plat-axs*/plat-hsdk]: Allow U-Boot to pass MAC-address to the kernel
        ARC: configs: cleanup
      2c34a0e0
    • David Howells's avatar
      afs: Fix cell specification to permit an empty address list · ecfe951f
      David Howells authored
      Fix the cell specification mechanism to allow cells to be pre-created
      without having to specify at least one address (the addresses will be
      upcalled for).
      
      This allows the cell information preload service to avoid the need to issue
      loads of DNS lookups during boot to get the addresses for each cell (500+
      lookups for the 'standard' cell list[*]).  The lookups can be done later as
      each cell is accessed through the filesystem.
      
      Also remove the print statement that prints a line every time a new cell is
      added.
      
      [*] There are 144 cells in the list.  Each cell is first looked up for an
          SRV record, and if that fails, for an AFSDB record.  These get a list
          of server names, each of which then has to be looked up to get the
          addresses for that server.  E.g.:
      
      	dig srv _afs3-vlserver._udp.grand.central.org
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ecfe951f
    • Linus Torvalds's avatar
      Merge tag 'md/4.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md · 3d0e7a9e
      Linus Torvalds authored
      Pull MD fixes from Shaohua Li:
      
       - Fix a locking issue for md-cluster (Guoqing)
      
       - Fix a sync crash for raid10 (Ni)
      
       - Fix a reshape bug with raid5 cache enabled (me)
      
      * tag 'md/4.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
        md-cluster: release RESYNC lock after the last resync message
        RAID10 BUG_ON in raise_barrier when force is true and conf->barrier is 0
        md/raid5-cache: disable reshape completely
      3d0e7a9e
    • Linus Torvalds's avatar
      Merge tag 'ceph-for-4.19-rc3' of https://github.com/ceph/ceph-client · a12ed06b
      Linus Torvalds authored
      Pull ceph fixes from Ilya Dryomov:
       "Two rbd patches to complete support for images within namespaces that
        went into -rc1 and a use-after-free fix.
      
        The rbd changes have been sitting in a branch for quite a while but
        couldn't be included into the -rc1 pull request because of a pending
        wire protocol backwards compatibility fixup that only got committed
        early this week"
      
      * tag 'ceph-for-4.19-rc3' of https://github.com/ceph/ceph-client:
        rbd: support cloning across namespaces
        rbd: factor out get_parent_info()
        ceph: avoid a use-after-free in ceph_destroy_options()
      a12ed06b
    • Linus Torvalds's avatar
      Merge tag 'for_v4.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs · d042a240
      Linus Torvalds authored
      Pull fsnotify fix from Jan Kara:
       "A small fsnotify fix from Amir"
      
      * tag 'for_v4.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
        fsnotify: fix ignore mask logic in fsnotify()
      d042a240
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 4ff8a142
      Linus Torvalds authored
      Pull arm64 fix from Will Deacon:
       "Just one small fix here, preventing a VM_WARN_ON when a !present
        PMD/PUD is "freed" as part of a huge ioremap() operation.
      
        The correct behaviour is to skip the free silently in this case, which
        is a little weird (the function is a bit of a misnomer), but it
        follows the x86 implementation"
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: fix erroneous warnings in page freeing functions
      4ff8a142
    • Linus Torvalds's avatar
      Merge tag 'acpi-4.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 53937340
      Linus Torvalds authored
      Pull ACPI fixes from Rafael Wysocki:
       "These fix a regression from the 4.18 cycle in the ACPI driver for
        Intel SoCs (LPSS) and prevent dmi_check_system() from being called on
        non-x86 systems in the ACPI core.
      
        Specifics:
      
         - Fix a power management regression in the ACPI driver for Intel SoCs
           (LPSS) introduced by a system-wide suspend/resume fix during the
           4.18 cycle (Zhang Rui).
      
         - Prevent dmi_check_system() from being called on non-x86 systems in
           the ACPI core (Jean Delvare)"
      
      * tag 'acpi-4.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        ACPI / LPSS: Force LPSS quirks on boot
        ACPI / bus: Only call dmi_check_system() on X86
      53937340
    • Linus Torvalds's avatar
      Merge tag 'sound-4.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · 69ddce94
      Linus Torvalds authored
      Pull sound fixes from Takashi Iwai:
       "Just a few small fixes:
      
         - a fix for the recursive work cancellation in a specific HD-audio
           operation mode
      
         - a fix for potentially uninitialized memory access via rawmidi
      
         - the register bit access fixes for ASoC HD-audio"
      
      * tag 'sound-4.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
        ALSA: hda: Fix several mismatch for register mask and value
        ALSA: rawmidi: Initialize allocated buffers
        ALSA: hda - Fix cancel_work_sync() stall from jackpoll work
      69ddce94
    • Wanpeng Li's avatar
      KVM: LAPIC: Fix pv ipis out-of-bounds access · bdf7ffc8
      Wanpeng Li authored
      Dan Carpenter reported that the untrusted data returns from kvm_register_read()
      results in the following static checker warning:
        arch/x86/kvm/lapic.c:576 kvm_pv_send_ipi()
        error: buffer underflow 'map->phys_map' 's32min-s32max'
      
      KVM guest can easily trigger this by executing the following assembly sequence
      in Ring0:
      
      mov $10, %rax
      mov $0xFFFFFFFF, %rbx
      mov $0xFFFFFFFF, %rdx
      mov $0, %rsi
      vmcall
      
      As this will cause KVM to execute the following code-path:
      vmx_handle_exit() -> handle_vmcall() -> kvm_emulate_hypercall() -> kvm_pv_send_ipi()
      which will reach out-of-bounds access.
      
      This patch fixes it by adding a check to kvm_pv_send_ipi() against map->max_apic_id,
      ignoring destinations that are not present and delivering the rest. We also check
      whether or not map->phys_map[min + i] is NULL since the max_apic_id is set to the
      max apic id, some phys_map maybe NULL when apic id is sparse, especially kvm
      unconditionally set max_apic_id to 255 to reserve enough space for any xAPIC ID.
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarLiran Alon <liran.alon@oracle.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      [Add second "if (min > map->max_apic_id)" to complete the fix. -Radim]
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      bdf7ffc8
    • Liran Alon's avatar
      KVM: nVMX: Fix loss of pending IRQ/NMI before entering L2 · b5861e5c
      Liran Alon authored
      Consider the case L1 had a IRQ/NMI event until it executed
      VMLAUNCH/VMRESUME which wasn't delivered because it was disallowed
      (e.g. interrupts disabled). When L1 executes VMLAUNCH/VMRESUME,
      L0 needs to evaluate if this pending event should cause an exit from
      L2 to L1 or delivered directly to L2 (e.g. In case L1 don't intercept
      EXTERNAL_INTERRUPT).
      
      Usually this would be handled by L0 requesting a IRQ/NMI window
      by setting VMCS accordingly. However, this setting was done on
      VMCS01 and now VMCS02 is active instead. Thus, when L1 executes
      VMLAUNCH/VMRESUME we force L0 to perform pending event evaluation by
      requesting a KVM_REQ_EVENT.
      
      Note that above scenario exists when L1 KVM is about to enter L2 but
      requests an "immediate-exit". As in this case, L1 will
      disable-interrupts and then send a self-IPI before entering L2.
      Reviewed-by: default avatarNikita Leshchenko <nikita.leshchenko@oracle.com>
      Co-developed-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      b5861e5c