1. 20 Mar, 2018 5 commits
    • Patrick Bellasi's avatar
      sched/fair: Update util_est only on util_avg updates · d519329f
      Patrick Bellasi authored
      The estimated utilization of a task is currently updated every time the
      task is dequeued. However, to keep overheads under control, PELT signals
      are effectively updated at maximum once every 1ms.
      
      Thus, for really short running tasks, it can happen that their util_avg
      value has not been updates since their last enqueue.  If such tasks are
      also frequently running tasks (e.g. the kind of workload generated by
      hackbench) it can also happen that their util_avg is updated only every
      few activations.
      
      This means that updating util_est at every dequeue potentially introduces
      not necessary overheads and it's also conceptually wrong if the util_avg
      signal has never been updated during a task activation.
      
      Let's introduce a throttling mechanism on task's util_est updates
      to sync them with util_avg updates. To make the solution memory
      efficient, both in terms of space and load/store operations, we encode a
      synchronization flag into the LSB of util_est.enqueued.
      This makes util_est an even values only metric, which is still
      considered good enough for its purpose.
      The synchronization bit is (re)set by __update_load_avg_se() once the
      PELT signal of a task has been updated during its last activation.
      
      Such a throttling mechanism allows to keep under control util_est
      overheads in the wakeup hot path, thus making it a suitable mechanism
      which can be enabled also on high-intensity workload systems.
      Thus, this now switches on by default the estimation utilization
      scheduler feature.
      Suggested-by: default avatarChris Redpath <chris.redpath@arm.com>
      Signed-off-by: default avatarPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@android.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: http://lkml.kernel.org/r/20180309095245.11071-5-patrick.bellasi@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      d519329f
    • Patrick Bellasi's avatar
      sched/cpufreq/schedutil: Use util_est for OPP selection · a07630b8
      Patrick Bellasi authored
      When schedutil looks at the CPU utilization, the current PELT value for
      that CPU is returned straight away. In certain scenarios this can have
      undesired side effects and delays on frequency selection.
      
      For example, since the task utilization is decayed at wakeup time, a
      long sleeping big task newly enqueued does not add immediately a
      significant contribution to the target CPU. This introduces some latency
      before schedutil will be able to detect the best frequency required by
      that task.
      
      Moreover, the PELT signal build-up time is a function of the current
      frequency, because of the scale invariant load tracking support. Thus,
      starting from a lower frequency, the utilization build-up time will
      increase even more and further delays the selection of the actual
      frequency which better serves the task requirements.
      
      In order to reduce these kind of latencies, we integrate the usage
      of the CPU's estimated utilization in the sugov_get_util function.
      
      This allows to properly consider the expected utilization of a CPU which,
      for example, has just got a big task running after a long sleep period.
      Ultimately this allows to select the best frequency to run a task
      right after its wake-up.
      Signed-off-by: default avatarPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@android.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/20180309095245.11071-4-patrick.bellasi@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a07630b8
    • Patrick Bellasi's avatar
      sched/fair: Use util_est in LB and WU paths · f9be3e59
      Patrick Bellasi authored
      When the scheduler looks at the CPU utilization, the current PELT value
      for a CPU is returned straight away. In certain scenarios this can have
      undesired side effects on task placement.
      
      For example, since the task utilization is decayed at wakeup time, when
      a long sleeping big task is enqueued it does not add immediately a
      significant contribution to the target CPU.
      As a result we generate a race condition where other tasks can be placed
      on the same CPU while it is still considered relatively empty.
      
      In order to reduce this kind of race conditions, this patch introduces the
      required support to integrate the usage of the CPU's estimated utilization
      in the wakeup path, via cpu_util_wake(), as well as in the load-balance
      path, via cpu_util() which is used by update_sg_lb_stats().
      
      The estimated utilization of a CPU is defined to be the maximum between
      its PELT's utilization and the sum of the estimated utilization (at
      previous dequeue time) of all the tasks currently RUNNABLE on that CPU.
      This allows to properly represent the spare capacity of a CPU which, for
      example, has just got a big task running since a long sleep period.
      Signed-off-by: default avatarPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@android.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: http://lkml.kernel.org/r/20180309095245.11071-3-patrick.bellasi@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      f9be3e59
    • Patrick Bellasi's avatar
      sched/fair: Add util_est on top of PELT · 7f65ea42
      Patrick Bellasi authored
      The util_avg signal computed by PELT is too variable for some use-cases.
      For example, a big task waking up after a long sleep period will have its
      utilization almost completely decayed. This introduces some latency before
      schedutil will be able to pick the best frequency to run a task.
      
      The same issue can affect task placement. Indeed, since the task
      utilization is already decayed at wakeup, when the task is enqueued in a
      CPU, this can result in a CPU running a big task as being temporarily
      represented as being almost empty. This leads to a race condition where
      other tasks can be potentially allocated on a CPU which just started to run
      a big task which slept for a relatively long period.
      
      Moreover, the PELT utilization of a task can be updated every [ms], thus
      making it a continuously changing value for certain longer running
      tasks. This means that the instantaneous PELT utilization of a RUNNING
      task is not really meaningful to properly support scheduler decisions.
      
      For all these reasons, a more stable signal can do a better job of
      representing the expected/estimated utilization of a task/cfs_rq.
      Such a signal can be easily created on top of PELT by still using it as
      an estimator which produces values to be aggregated on meaningful
      events.
      
      This patch adds a simple implementation of util_est, a new signal built on
      top of PELT's util_avg where:
      
          util_est(task) = max(task::util_avg, f(task::util_avg@dequeue))
      
      This allows to remember how big a task has been reported by PELT in its
      previous activations via f(task::util_avg@dequeue), which is the new
      _task_util_est(struct task_struct*) function added by this patch.
      
      If a task should change its behavior and it runs longer in a new
      activation, after a certain time its util_est will just track the
      original PELT signal (i.e. task::util_avg).
      
      The estimated utilization of cfs_rq is defined only for root ones.
      That's because the only sensible consumer of this signal are the
      scheduler and schedutil when looking for the overall CPU utilization
      due to FAIR tasks.
      
      For this reason, the estimated utilization of a root cfs_rq is simply
      defined as:
      
          util_est(cfs_rq) = max(cfs_rq::util_avg, cfs_rq::util_est::enqueued)
      
      where:
      
          cfs_rq::util_est::enqueued = sum(_task_util_est(task))
                                       for each RUNNABLE task on that root cfs_rq
      
      It's worth noting that the estimated utilization is tracked only for
      objects of interests, specifically:
      
       - Tasks: to better support tasks placement decisions
       - root cfs_rqs: to better support both tasks placement decisions as
                       well as frequencies selection
      Signed-off-by: default avatarPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@android.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: http://lkml.kernel.org/r/20180309095245.11071-2-patrick.bellasi@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      7f65ea42
    • Ingo Molnar's avatar
      10c18c44
  2. 19 Mar, 2018 11 commits
    • Linus Torvalds's avatar
      Merge branch 'for-4.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · 1b5f3ba4
      Linus Torvalds authored
      Pull cgroup fixes from Tejun Heo:
       "Two commits to fix the following subtle cgroup2 behavior bugs:
      
         - cpu.max was rejecting config when it shouldn't
      
         - thread mode enable was allowed when it shouldn't"
      
      * 'for-4.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
        cgroup: fix rule checking for threaded mode switching
        sched, cgroup: Don't reject lower cpu.max on ancestors
      1b5f3ba4
    • Linus Torvalds's avatar
      Merge branch 'for-4.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq · c6256ca9
      Linus Torvalds authored
      Pull workqueue fixes from Tejun Heo:
       "Two low-impact workqueue commits.
      
        One fixes workqueue creation error path and the other removes the
        unused cancel_work()"
      
      * 'for-4.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
        workqueue: remove unused cancel_work()
        workqueue: use put_device() instead of kfree()
      c6256ca9
    • Linus Torvalds's avatar
      Merge branch 'for-4.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu · 0d707a2f
      Linus Torvalds authored
      Pull percpu fixes from Tejun Heo:
       "Late percpu pull request for v4.16-rc6.
      
         - percpu allocator pool replenishing no longer triggers OOM or
           warning messages.
      
           Also, the alloc interface now understands __GFP_NORETRY and
           __GFP_NOWARN. This is to allow avoiding OOMs from userland
           triggered actions like bpf map creation.
      
           Also added cond_resched() in alloc loop.
      
         - perpcu allocation now can be interrupted by kill sigs to avoid
           deadlocking OOM killer.
      
         - Added Dennis Zhou as a co-maintainer.
      
           He has rewritten the area map allocator, understands most of the
           code base and has been responsive for all bug reports"
      
      * 'for-4.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
        percpu_ref: Update doc to dissuade users from depending on internal RCU grace periods
        mm: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
        percpu: include linux/sched.h for cond_resched()
        percpu: add a schedule point in pcpu_balance_workfn()
        percpu: allow select gfp to be passed to underlying allocators
        percpu: add __GFP_NORETRY semantics to the percpu balancing path
        percpu: match chunk allocator declarations with definitions
        percpu: add Dennis Zhou as a percpu co-maintainer
      0d707a2f
    • Linus Torvalds's avatar
      Merge branch 'for-4.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata · efac2483
      Linus Torvalds authored
      Pull libata fixes from Tejun Heo:
       "I sat on them too long and it's quite a few this late, but nothing has
        a wide blast area. The changes are...
      
         - Fix corner cases in SG command handling.
      
         - Recent introduction of default powersaving mode config option
           exposed several devices with broken powersaving behaviors. A number
           of patches to update the blacklist accordingly.
      
         - Fix a kernel panic on SAS hotplug.
      
         - Other misc and device specific updates"
      
      * 'for-4.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
        libata: Modify quirks for MX100 to limit NCQ_TRIM quirk to MU01 version
        libata: Make Crucial BX100 500GB LPM quirk apply to all firmware versions
        libata: Apply NOLPM quirk to Crucial M500 480 and 960GB SSDs
        libata: Enable queued TRIM for Samsung SSD 860
        PCI: Add function 1 DMA alias quirk for Highpoint RocketRAID 644L
        ahci: Add PCI-id for the Highpoint Rocketraid 644L card
        ata: do not schedule hot plug if it is a sas host
        libata: disable LPM for Crucial BX100 SSD 500GB drive
        libata: Apply NOLPM quirk to Crucial MX100 512GB SSDs
        libata: update documentation for sysfs interfaces
        ata: sata_rcar: Remove unused variable in sata_rcar_init_controller()
        libata: transport: cleanup documentation of sysfs interface
        sata_rcar: Reset SATA PHY when Salvator-X board resumes
        libata: don't try to pass through NCQ commands to non-NCQ devices
        libata: remove WARN() for DMA or PIO command without data
        libata: fix length validation of ATAPI-relayed SCSI commands
        ata: libahci: fix comment indentation
        ahci: Add check for device presence (PCIe hot unplug) in ahci_stop_engine()
        libata: Fix compile warning with ATA_DEBUG enabled
      efac2483
    • Tejun Heo's avatar
      percpu_ref: Update doc to dissuade users from depending on internal RCU grace periods · b3a5d111
      Tejun Heo authored
      percpu_ref internally uses sched-RCU to implement the percpu -> atomic
      mode switching and the documentation suggested that this could be
      depended upon.  This doesn't seem like a good idea.
      
      * percpu_ref uses sched-RCU which has different grace periods regular
        RCU.  Users may combine percpu_ref with regular RCU usage and
        incorrectly believe that regular RCU grace periods are performed by
        percpu_ref.  This can lead to, for example, use-after-free due to
        premature freeing.
      
      * percpu_ref has a grace period when switching from percpu to atomic
        mode.  It doesn't have one between the last put and release.  This
        distinction is subtle and can lead to surprising bugs.
      
      * percpu_ref allows starting in and switching to atomic mode manually
        for debugging and other purposes.  This means that there may not be
        any grace periods from kill to release.
      
      This patch makes it clear that the grace periods are percpu_ref's
      internal implementation detail and can't be depended upon by the
      users.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      b3a5d111
    • Kirill Tkhai's avatar
      mm: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn() · f52ba1fe
      Kirill Tkhai authored
      In case of memory deficit and low percpu memory pages,
      pcpu_balance_workfn() takes pcpu_alloc_mutex for a long
      time (as it makes memory allocations itself and waits
      for memory reclaim). If tasks doing pcpu_alloc() are
      choosen by OOM killer, they can't exit, because they
      are waiting for the mutex.
      
      The patch makes pcpu_alloc() to care about killing signal
      and use mutex_lock_killable(), when it's allowed by GFP
      flags. This guarantees, a task does not miss SIGKILL
      from OOM killer.
      Signed-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      f52ba1fe
    • Tejun Heo's avatar
      percpu: include linux/sched.h for cond_resched() · 71546d10
      Tejun Heo authored
      microblaze build broke due to missing declaration of the
      cond_resched() invocation added recently.  Let's include linux/sched.h
      explicitly.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarkbuild test robot <fengguang.wu@intel.com>
      71546d10
    • Hans de Goede's avatar
      libata: Modify quirks for MX100 to limit NCQ_TRIM quirk to MU01 version · d418ff56
      Hans de Goede authored
      When commit 9c7be59f ("libata: Apply NOLPM quirk to Crucial MX100
      512GB SSDs") was added it inherited the ATA_HORKAGE_NO_NCQ_TRIM quirk
      from the existing "Crucial_CT*MX100*" entry, but that entry sets model_rev
      to "MU01", where as the entry adding the NOLPM quirk sets it to NULL.
      
      This means that after this commit we no apply the NO_NCQ_TRIM quirk to
      all "Crucial_CT512MX100*" SSDs even if they have the fixed "MU02"
      firmware. This commit splits the "Crucial_CT512MX100*" quirk into 2
      quirks, one for the "MU01" firmware and one for all other firmware
      versions, so that we once again only apply the NO_NCQ_TRIM quirk to the
      "MU01" firmware version.
      
      Fixes: 9c7be59f ("libata: Apply NOLPM quirk to ... MX100 512GB SSDs")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarHans de Goede <hdegoede@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      d418ff56
    • Hans de Goede's avatar
      libata: Make Crucial BX100 500GB LPM quirk apply to all firmware versions · 3bf7b5d6
      Hans de Goede authored
      Commit b17e5729 ("libata: disable LPM for Crucial BX100 SSD 500GB
      drive"), introduced a ATA_HORKAGE_NOLPM quirk for Crucial BX100 500GB SSDs
      but limited this to the MU02 firmware version, according to:
      http://www.crucial.com/usa/en/support-ssd-firmware
      
      MU02 is the last version, so there are no newer possibly fixed versions
      and if the MU02 version has broken LPM then the MU01 almost certainly
      also has broken LPM, so this commit changes the quirk to apply to all
      firmware versions.
      
      Fixes: b17e5729 ("libata: disable LPM for Crucial BX100 SSD 500GB...")
      Cc: stable@vger.kernel.org
      Cc: Kai-Heng Feng <kai.heng.feng@canonical.com>
      Signed-off-by: default avatarHans de Goede <hdegoede@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      3bf7b5d6
    • Hans de Goede's avatar
      libata: Apply NOLPM quirk to Crucial M500 480 and 960GB SSDs · 62ac3f73
      Hans de Goede authored
      There have been reports of the Crucial M500 480GB model not working
      with LPM set to min_power / med_power_with_dipm level.
      
      It has not been tested with medium_power, but that typically has no
      measurable power-savings.
      
      Note the reporters Crucial_CT480M500SSD3 has a firmware version of MU03
      and there is a MU05 update available, but that update does not mention any
      LPM fixes in its changelog, so the quirk matches all firmware versions.
      
      In my experience the LPM problems with (older) Crucial SSDs seem to be
      limited to higher capacity versions of the SSDs (different firmware?),
      so this commit adds a NOLPM quirk for the 480 and 960GB versions of the
      M500, to avoid LPM causing issues with these SSDs.
      
      Cc: stable@vger.kernel.org
      Reported-and-tested-by: default avatarMartin Steigerwald <martin@lichtvoll.de>
      Signed-off-by: default avatarHans de Goede <hdegoede@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      62ac3f73
    • Linus Torvalds's avatar
      Linux 4.16-rc6 · c698ca52
      Linus Torvalds authored
      c698ca52
  3. 18 Mar, 2018 5 commits
    • Linus Torvalds's avatar
      Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 9e1909b9
      Linus Torvalds authored
      Pull x86/pti updates from Thomas Gleixner:
       "Another set of melted spectrum updates:
      
         - Iron out the last late microcode loading issues by actually
           checking whether new microcode is present and preventing the CPU
           synchronization to run into a timeout induced hang.
      
         - Remove Skylake C2 from the microcode blacklist according to the
           latest Intel documentation
      
         - Fix the VM86 POPF emulation which traps if VIP is set, but VIF is
           not. Enhance the selftests to catch that kind of issue
      
         - Annotate indirect calls/jumps for objtool on 32bit. This is not a
           functional issue, but for consistency sake its the right thing to
           do.
      
         - Fix a jump label build warning observed on SPARC64 which uses 32bit
           storage for the code location which is casted to 64 bit pointer w/o
           extending it to 64bit first.
      
         - Add two new cpufeature bits. Not really an urgent issue, but
           provides them for both x86 and x86/kvm work. No impact on the
           current kernel"
      
      * 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/microcode: Fix CPU synchronization routine
        x86/microcode: Attempt late loading only when new microcode is present
        x86/speculation: Remove Skylake C2 from Speculation Control microcode blacklist
        jump_label: Fix sparc64 warning
        x86/speculation, objtool: Annotate indirect calls/jumps for objtool on 32-bit kernels
        x86/vm86/32: Fix POPF emulation
        selftests/x86/entry_from_vm86: Add test cases for POPF
        selftests/x86/entry_from_vm86: Exit with 1 if we fail
        x86/cpufeatures: Add Intel PCONFIG cpufeature
        x86/cpufeatures: Add Intel Total Memory Encryption cpufeature
      9e1909b9
    • Linus Torvalds's avatar
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · df4fe178
      Linus Torvalds authored
      Pull x86 fix from Thomas Gleixner:
       "A single fix for vmalloc_fault() which uses p*d_huge() unconditionally
        whether CONFIG_HUGETLBFS is set or not. In case of CONFIG_HUGETLBFS=n
        this results in a crash as p*d_huge() returns 0 in that case"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/mm: Fix vmalloc_fault to use pXd_large
      df4fe178
    • Linus Torvalds's avatar
      Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · d2149e13
      Linus Torvalds authored
      Pull irq fixes from Thomas Gleixner:
       "Three fixes for irq chip drivers:
      
         - Make sure the allocations in the GIC-V3 ITS driver are large enough
           to accomodate the interrupt space
      
         - Fix a misplaced __iomem annotation which causes a splat of 26
           sparse warnings
      
         - Remove an unused function in the IMX GPCV2 driver which causes
           build warnings"
      
      * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        irqchip/irq-imx-gpcv2: Remove unused function
        irqchip/gic-v3-its: Ensure nr_ites >= nr_lpis
        irqchip/gic-v3-its: Fix misplaced __iomem annotations
      d2149e13
    • Linus Torvalds's avatar
      Merge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 23fe85ae
      Linus Torvalds authored
      Pull EFI fix from Thomas Gleixner:
       "A single fix to prevent partially initialized pointers in mixed mode
        (64bit kernel on 32bit UEFI)"
      
      * 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        efi/libstub/tpm: Initialize pointer variables to zero for mixed mode
      23fe85ae
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 3cd1d327
      Linus Torvalds authored
      Pull KVM fixes from Paolo Bonzini:
       "PPC:
         - fix bug leading to lost IPIs and smp_call_function_many() lockups
           on POWER9
      
        ARM:
         - locking fix
         - reset fix
         - GICv2 multi-source SGI injection fix
         - GICv2-on-v3 MMIO synchronization fix
         - make the console less verbose.
      
        x86:
         - fix device passthrough on AMD SME"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: x86: Fix device passthrough when SME is active
        kvm: arm/arm64: vgic-v3: Tighten synchronization for guests using v2 on v3
        KVM: arm/arm64: vgic: Don't populate multiple LRs with the same vintid
        KVM: arm/arm64: Reduce verbosity of KVM init log
        KVM: arm/arm64: Reset mapped IRQs on VM reset
        KVM: arm/arm64: Avoid vcpu_load for other vcpu ioctls than KVM_RUN
        KVM: arm/arm64: vgic: Add missing irq_lock to vgic_mmio_read_pending
        KVM: PPC: Book3S HV: Fix trap number return from __kvmppc_vcore_entry
      3cd1d327
  4. 17 Mar, 2018 1 commit
    • John David Anglin's avatar
      parisc: Handle case where flush_cache_range is called with no context · 9ef0f88f
      John David Anglin authored
      Just when I had decided that flush_cache_range() was always called with
      a valid context, Helge reported two cases where the
      "BUG_ON(!vma->vm_mm->context);" was hit on the phantom buildd:
      
       kernel BUG at /mnt/sdb6/linux/linux-4.15.4/arch/parisc/kernel/cache.c:587!
       CPU: 1 PID: 3254 Comm: kworker/1:2 Tainted: G D 4.15.0-1-parisc64-smp #1 Debian 4.15.4-1+b1
       Workqueue: events free_ioctx
        IAOQ[0]: flush_cache_range+0x164/0x168
        IAOQ[1]: flush_cache_page+0x0/0x1c8
        RP(r2): unmap_page_range+0xae8/0xb88
       Backtrace:
        [<00000000404a6980>] unmap_page_range+0xae8/0xb88
        [<00000000404a6ae0>] unmap_single_vma+0xc0/0x188
        [<00000000404a6cdc>] zap_page_range_single+0x134/0x1f8
        [<00000000404a702c>] unmap_mapping_range+0x1cc/0x208
        [<0000000040461518>] truncate_pagecache+0x98/0x108
        [<0000000040461624>] truncate_setsize+0x9c/0xb8
        [<00000000405d7f30>] put_aio_ring_file+0x80/0x100
        [<00000000405d803c>] aio_free_ring+0x8c/0x290
        [<00000000405d82c0>] free_ioctx+0x80/0x180
        [<0000000040284e6c>] process_one_work+0x21c/0x668
        [<00000000402854c4>] worker_thread+0x20c/0x778
        [<0000000040291d44>] kthread+0x2d4/0x2e0
        [<0000000040204020>] end_fault_vector+0x20/0xc0
      
      This indicates that we need to handle the no context case in
      flush_cache_range() as we do in flush_cache_mm().
      
      In thinking about this, I realized that we don't need to flush the TLB
      when there is no context.  So, I added context checks to the large flush
      cases in flush_cache_mm() and flush_cache_range().  The large flush case
      occurs frequently in flush_cache_mm() and the change should improve fork
      performance.
      
      The v2 version of this change removes the BUG_ON from flush_cache_page()
      by skipping the TLB flush when there is no context.  I also added code
      to flush the TLB in flush_cache_mm() and flush_cache_range() when we
      have a context that's not current.  Now all three routines handle TLB
      flushes in a similar manner.
      Signed-off-by: default avatarJohn David Anglin <dave.anglin@bell.net>
      Cc: stable@vger.kernel.org # 4.9+
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      9ef0f88f
  5. 16 Mar, 2018 17 commits
  6. 15 Mar, 2018 1 commit
    • Eric W. Biederman's avatar
      fs: Teach path_connected to handle nfs filesystems with multiple roots. · 95dd7758
      Eric W. Biederman authored
      On nfsv2 and nfsv3 the nfs server can export subsets of the same
      filesystem and report the same filesystem identifier, so that the nfs
      client can know they are the same filesystem.  The subsets can be from
      disjoint directory trees.  The nfsv2 and nfsv3 filesystems provides no
      way to find the common root of all directory trees exported form the
      server with the same filesystem identifier.
      
      The practical result is that in struct super s_root for nfs s_root is
      not necessarily the root of the filesystem.  The nfs mount code sets
      s_root to the root of the first subset of the nfs filesystem that the
      kernel mounts.
      
      This effects the dcache invalidation code in generic_shutdown_super
      currently called shrunk_dcache_for_umount and that code for years
      has gone through an additional list of dentries that might be dentry
      trees that need to be freed to accomodate nfs.
      
      When I wrote path_connected I did not realize nfs was so special, and
      it's hueristic for avoiding calling is_subdir can fail.
      
      The practical case where this fails is when there is a move of a
      directory from the subtree exposed by one nfs mount to the subtree
      exposed by another nfs mount.  This move can happen either locally or
      remotely.  With the remote case requiring that the move directory be cached
      before the move and that after the move someone walks the path
      to where the move directory now exists and in so doing causes the
      already cached directory to be moved in the dcache through the magic
      of d_splice_alias.
      
      If someone whose working directory is in the move directory or a
      subdirectory and now starts calling .. from the initial mount of nfs
      (where s_root == mnt_root), then path_connected as a heuristic will
      not bother with the is_subdir check.  As s_root really is not the root
      of the nfs filesystem this heuristic is wrong, and the path may
      actually not be connected and path_connected can fail.
      
      The is_subdir function might be cheap enough that we can call it
      unconditionally.  Verifying that will take some benchmarking and
      the result may not be the same on all kernels this fix needs
      to be backported to.  So I am avoiding that for now.
      
      Filesystems with snapshots such as nilfs and btrfs do something
      similar.  But as the directory tree of the snapshots are disjoint
      from one another and from the main directory tree rename won't move
      things between them and this problem will not occur.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Fixes: 397d425d ("vfs: Test for and handle paths that are unreachable from their mnt_root")
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      95dd7758