1. 01 Jul, 2013 23 commits
    • Michael Ellerman's avatar
      powerpc: Rename and flesh out the facility unavailable exception handler · 021424a1
      Michael Ellerman authored
      The exception at 0xf60 is not the TM (Transactional Memory) unavailable
      exception, it is the "Facility Unavailable Exception", rename it as
      such.
      
      Flesh out the handler to acknowledge the fact that it can be called for
      many reasons, one of which is TM being unavailable.
      
      Use STD_EXCEPTION_COMMON() for the exception body, for some reason we
      had it open-coded, I've checked the generated code is identical.
      Signed-off-by: default avatarMichael Ellerman <michael@ellerman.id.au>
      CC: <stable@vger.kernel.org> [v3.10]
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      021424a1
    • Michael Ellerman's avatar
      powerpc: Remove KVMTEST from RELON exception handlers · c9f69518
      Michael Ellerman authored
      KVMTEST is a macro which checks whether we are taking an exception from
      guest context, if so we branch out of line and eventually call into the
      KVM code to handle the switch.
      
      When running real guests on bare metal (HV KVM) the hardware ensures
      that we never take a relocation on exception when transitioning from
      guest to host. For PR KVM we disable relocation on exceptions ourself in
      kvmppc_core_init_vm(), as of commit a413f474 "Disable relocation on
      exceptions whenever PR KVM is active".
      
      So convert all the RELON macros to use NOTEST, and drop the remaining
      KVM_HANDLER() definitions we have for 0xe40 and 0xe80.
      Signed-off-by: default avatarMichael Ellerman <michael@ellerman.id.au>
      CC: <stable@vger.kernel.org> [v3.9+]
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      c9f69518
    • Michael Ellerman's avatar
      powerpc: Remove unreachable relocation on exception handlers · 1d567cb4
      Michael Ellerman authored
      We have relocation on exception handlers defined for h_data_storage and
      h_instr_storage. However we will never take relocation on exceptions for
      these because they can only come from a guest, and we never take
      relocation on exceptions when we transition from guest to host.
      
      We also have a handler for hmi_exception (Hypervisor Maintenance) which
      is defined in the architecture to never be delivered with relocation on,
      see see v2.07 Book III-S section 6.5.
      
      So remove the handlers, leaving a branch to self just to be double extra
      paranoid.
      Signed-off-by: default avatarMichael Ellerman <michael@ellerman.id.au>
      CC: <stable@vger.kernel.org> [v3.9+]
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      1d567cb4
    • Nathan Fontenot's avatar
      powerpc/numa: Do not update sysfs cpu registration from invalid context · dd023217
      Nathan Fontenot authored
      The topology update code that updates the cpu node registration in sysfs
      should not be called while in stop_machine(). The register/unregister
      calls take a lock and may sleep.
      
      This patch moves these calls outside of the call to stop_machine().
      Signed-off-by: default avatarNathan Fontenot <nfont@linux.vnet.ibm.com>
      CC: <stable@vger.kernel.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      dd023217
    • Gavin Shan's avatar
      powerpc/eeh: Update MAINTAINERS · ec207dcc
      Gavin Shan authored
      Update MAINTAINERS to reflect recent changes.
      Signed-off-by: default avatarGavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      ec207dcc
    • Chen Gang's avatar
      powerpc/smp: Section mismatch from smp_release_cpus to __initdata spinning_secondaries · 8246aca7
      Chen Gang authored
      the smp_release_cpus is a normal funciton and called in normal environments,
        but it calls the __initdata spinning_secondaries.
        need modify spinning_secondaries to match smp_release_cpus.
      
      the related warning:
        (the linker report boot_paca.33377, but it should be spinning_secondaries)
      
      -----------------------------------------------------------------------------
      
      WARNING: arch/powerpc/kernel/built-in.o(.text+0x23176): Section mismatch in reference from the function .smp_release_cpus() to the variable .init.data:boot_paca.33377
      The function .smp_release_cpus() references
      the variable __initdata boot_paca.33377.
      This is often because .smp_release_cpus lacks a __initdata
      annotation or the annotation of boot_paca.33377 is wrong.
      
      WARNING: arch/powerpc/kernel/built-in.o(.text+0x231fe): Section mismatch in reference from the function .smp_release_cpus() to the variable .init.data:boot_paca.33377
      The function .smp_release_cpus() references
      the variable __initdata boot_paca.33377.
      This is often because .smp_release_cpus lacks a __initdata
      annotation or the annotation of boot_paca.33377 is wrong.
      
      -----------------------------------------------------------------------------
      Signed-off-by: default avatarChen Gang <gang.chen@asianux.com>
      CC: <stable@vger.kernel.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      8246aca7
    • Wolfram Sang's avatar
      macintosh/windfarm: Remove obsolete cleanup for clientdata · 91f5af2e
      Wolfram Sang authored
      A few new i2c-drivers came into the kernel which clear the clientdata-pointer
      on exit or error. This is obsolete meanwhile, the core will do it.
      Signed-off-by: default avatarWolfram Sang <wsa@the-dreams.de>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      91f5af2e
    • Chen Gang's avatar
      powerpc/nvram64: Need return the related error code on failure occurs · 7029705a
      Chen Gang authored
      When error occurs, need return the related error code to let upper
      caller know about it.
      
      ppc_md.nvram_size() can return the error code (e.g. core99_nvram_size()
      in 'arch/powerpc/platforms/powermac/nvram.c').
      
      Also set ret value when only need it, so can save structions for normal
      cases.
      Signed-off-by: default avatarChen Gang <gang.chen@asianux.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      7029705a
    • Li Zhong's avatar
      powerpc: Set cpu sibling mask before online cpu · cce606fe
      Li Zhong authored
      It seems following race is possible:
      
      	cpu0					cpux
      smp_init->cpu_up->_cpu_up
      	__cpu_up
      		kick_cpu(1)
      -------------------------------------------------------------------------
      		waiting online			...
      		...				notify CPU_STARTING
      							set cpux active
      						set cpux online
      -------------------------------------------------------------------------
      		finish waiting online
      		...
      sched_init_smp
      	init_sched_domains(cpu_active_mask)
      		build_sched_domains
      						set cpux sibling info
      -------------------------------------------------------------------------
      
      Execution of cpu0 and cpux could be concurrent between two separator
      lines.
      
      So if the cpux sibling information was set too late (normally
      impossible, but could be triggered by adding some delay in
      start_secondary, after setting cpu online), build_sched_domains()
      running on cpu0 might see cpux active, with an empty sibling mask, then
      cause some bad address accessing like following:
      
      [    0.099855] Unable to handle kernel paging request for data at address 0xc00000038518078f
      [    0.099868] Faulting instruction address: 0xc0000000000b7a64
      [    0.099883] Oops: Kernel access of bad area, sig: 11 [#1]
      [    0.099895] PREEMPT SMP NR_CPUS=16 DEBUG_PAGEALLOC NUMA pSeries
      [    0.099922] Modules linked in:
      [    0.099940] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.10.0-rc1-00120-gb973425c-dirty #16
      [    0.099956] task: c0000001fed80000 ti: c0000001fed7c000 task.ti: c0000001fed7c000
      [    0.099971] NIP: c0000000000b7a64 LR: c0000000000b7a40 CTR: c0000000000b4934
      [    0.099985] REGS: c0000001fed7f760 TRAP: 0300   Not tainted  (3.10.0-rc1-00120-gb973425c-dirty)
      [    0.099997] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI>  CR: 24272828  XER: 20000003
      [    0.100045] SOFTE: 1
      [    0.100053] CFAR: c000000000445ee8
      [    0.100064] DAR: c00000038518078f, DSISR: 40000000
      [    0.100073]
      GPR00: 0000000000000080 c0000001fed7f9e0 c000000000c84d48 0000000000000010
      GPR04: 0000000000000010 0000000000000000 c0000001fc55e090 0000000000000000
      GPR08: ffffffffffffffff c000000000b80b30 c000000000c962d8 00000003845ffc5f
      GPR12: 0000000000000000 c00000000f33d000 c00000000000b9e4 0000000000000000
      GPR16: 0000000000000000 0000000000000000 0000000000000001 0000000000000000
      GPR20: c000000000ccf750 0000000000000000 c000000000c94d48 c0000001fc504000
      GPR24: c0000001fc504000 c0000001fecef848 c000000000c94d48 c000000000ccf000
      GPR28: c0000001fc522090 0000000000000010 c0000001fecef848 c0000001fed7fae0
      [    0.100293] NIP [c0000000000b7a64] .get_group+0x84/0xc4
      [    0.100307] LR [c0000000000b7a40] .get_group+0x60/0xc4
      [    0.100318] Call Trace:
      [    0.100332] [c0000001fed7f9e0] [c0000000000dbce4] .lock_is_held+0xa8/0xd0 (unreliable)
      [    0.100354] [c0000001fed7fa70] [c0000000000bf62c] .build_sched_domains+0x728/0xd14
      [    0.100375] [c0000001fed7fbe0] [c000000000af67bc] .sched_init_smp+0x4fc/0x654
      [    0.100394] [c0000001fed7fce0] [c000000000adce24] .kernel_init_freeable+0x17c/0x30c
      [    0.100413] [c0000001fed7fdb0] [c00000000000ba08] .kernel_init+0x24/0x12c
      [    0.100431] [c0000001fed7fe30] [c000000000009f74] .ret_from_kernel_thread+0x5c/0x68
      [    0.100445] Instruction dump:
      [    0.100456] 38800010 38a00000 4838e3f5 60000000 7c6307b4 2fbf0000 419e0040 3d220001
      [    0.100496] 78601f24 39491590 e93e0008 7d6a002a <7d69582a> f97f0000 7d4a002a e93e0010
      [    0.100559] ---[ end trace 31fd0ba7d8756001 ]---
      
      This patch tries to move the sibling maps updating before
      notify_cpu_starting() and cpu online, and a write barrier there to make
      sure sibling maps are updated before active and online mask.
      Signed-off-by: default avatarLi Zhong <zhong@linux.vnet.ibm.com>
      Reviewed-by: default avatarSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      cce606fe
    • Geert Uytterhoeven's avatar
      mac: Make cuda_init_via() __init · 330dae19
      Geert Uytterhoeven authored
      cuda_init_via() is called from find_via_cuda() only, which is __init.
      Signed-off-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      330dae19
    • Paul Gortmaker's avatar
      powerpc: Delete __cpuinit usage from all users · 061d19f2
      Paul Gortmaker authored
      The __cpuinit type of throwaway sections might have made sense
      some time ago when RAM was more constrained, but now the savings
      do not offset the cost and complications.  For example, the fix in
      commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time")
      is a good example of the nasty type of bugs that can be created
      with improper use of the various __init prefixes.
      
      After a discussion on LKML[1] it was decided that cpuinit should go
      the way of devinit and be phased out.  Once all the users are gone,
      we can then finally remove the macros themselves from linux/init.h.
      
      This removes all the powerpc uses of the __cpuinit macros.  There
      are no __CPUINIT users in assembly files in powerpc.
      
      [1] https://lkml.org/lkml/2013/5/20/589
      
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Josh Boyer <jwboyer@gmail.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      061d19f2
    • Joe Perches's avatar
      macintosh: Convert use of typedef ctl_table to struct ctl_table · 5eb969d0
      Joe Perches authored
      This typedef is unnecessary and should just be removed.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      5eb969d0
    • Joe Perches's avatar
      powerpc/idle: Convert use of typedef ctl_table to struct ctl_table · cc293bf7
      Joe Perches authored
      This typedef is unnecessary and should just be removed.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      cc293bf7
    • Bjorn Helgaas's avatar
      powerpc/iommu: Remove unused pci_iommu_init() and pci_direct_iommu_init() · 5524f3fc
      Bjorn Helgaas authored
      pci_iommu_init() and pci_direct_iommu_init() are not referenced anywhere,
      so remove them.
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      5524f3fc
    • Kevin Hao's avatar
      powerpc: Don't flush/invalidate the d/icache for an unknown relocation type · 348c2298
      Kevin Hao authored
      For an unknown relocation type since the value of r4 is just the 8bit
      relocation type, the sum of r4 and r7 may yield an invalid memory
      address. For example:
          In normal case:
                   r4 = c00xxxxx
                   r7 = 40000000
                   r4 + r7 = 000xxxxx
      
          For an unknown relocation type:
                   r4 = 000000xx
                   r7 = 40000000
                   r4 + r7 = 400000xx
         400000xx is an invalid memory address for a board which has just
         512M memory.
      
      And for operations such as dcbst or icbi may cause bus error for an
      invalid memory address on some platforms and then cause the board
      reset. So we should skip the flush/invalidate the d/icache for
      an unknown relocation type.
      Signed-off-by: default avatarKevin Hao <haokexin@gmail.com>
      Acked-by: default avatarSuzuki K. Poulose <suzuki@in.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      348c2298
    • Aaro Koskinen's avatar
      powerpc/windfarm: Fix overtemperature clearing · 4bb29711
      Aaro Koskinen authored
      With pm81/pm91/pm121, when the overtemperature state is entered, and
      when it remains on after skipped ticks, the driver will try to leave
      it too soon (immediately on the next tick). This is because the active
      FAILURE_OVERTEMP state is not visible in "new_failure" variable of the
      current tick. Furthermore, the driver will keep trying to clear condition
      in subsequent ticks as FAILURE_OVERTEMP remains set in the "last_failure"
      variable. These will start to trigger WARNINGS from windfarm core:
      
      [  100.082735] windfarm: Clamping CPU frequency to minimum !
      [  100.108132] windfarm: Overtemp condition detected !
      [  101.952908] windfarm: Overtemp condition cleared !
      [...]
      [  102.980388] WARNING: at drivers/macintosh/windfarm_core.c:463
      [...]
      [  103.982227] WARNING: at drivers/macintosh/windfarm_core.c:463
      [...]
      [  105.030494] WARNING: at drivers/macintosh/windfarm_core.c:463
      [...]
      [  105.973666] WARNING: at drivers/macintosh/windfarm_core.c:463
      [...]
      [  106.977913] WARNING: at drivers/macintosh/windfarm_core.c:463
      
      Fix by adding a helper global variable. We leave the overtemp state only
      after all failure bits have been cleared.
      
      I saw this error on iMac G5 iSight (pm121). Also pm81/pm91 are fixed
      based on the observation that these are almost identical/copy-pasted code.
      Signed-off-by: default avatarAaro Koskinen <aaro.koskinen@iki.fi>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      4bb29711
    • Gavin Shan's avatar
      powerpc/powernv: Use dev-node in PCI config accessors · 9bf41be6
      Gavin Shan authored
      Currently, we're using the combo (PCI bus + devfn) in the PCI
      config accessors and PCI config accessors in EEH depends on them.
      However, it's not safe to refer the PCI bus which might have been
      removed during hotplug. So we're using device node in the PCI
      config accessors and the corresponding backends just reuse them.
      
      The patch also fix one potential risk: We possiblly have frozen
      PE during the early PCI probe time, but we haven't setup the PE
      mapping yet. So the errors should be counted to PE#0.
      Signed-off-by: default avatarGavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      9bf41be6
    • Gavin Shan's avatar
      powerpc/eeh: Avoid build warnings · eeb6361f
      Gavin Shan authored
      The patch is for avoiding following build warnings:
      
         The function .pnv_pci_ioda_fixup() references
         the function __init .eeh_init().
         This is often because .pnv_pci_ioda_fixup lacks a __init
      
         The function .pnv_pci_ioda_fixup() references
         the function __init .eeh_addr_cache_build().
         This is often because .pnv_pci_ioda_fixup lacks a __init
      Signed-off-by: default avatarGavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      eeb6361f
    • Gavin Shan's avatar
      powerpc/eeh: Refactor the output message · 56ca4fde
      Gavin Shan authored
      We needn't the the whole backtrace other than one-line message in
      the error reporting interrupt handler. For errors triggered by
      access PCI config space or MMIO, we replace "WARN(1, ...)" with
      pr_err() and dump_stack(). The patch also adds more output messages
      to indicate what EEH core is doing. Besides, some printk() are
      replaced with pr_warning().
      Signed-off-by: default avatarGavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      56ca4fde
    • Gavin Shan's avatar
      powerpc/eeh: Fix address catch for PowerNV · 88b6d14b
      Gavin Shan authored
      On the PowerNV platform, the EEH address cache isn't built correctly
      because we skipped the EEH devices without binding PE. The patch
      fixes that.
      Signed-off-by: default avatarGavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      88b6d14b
    • Gavin Shan's avatar
      powerpc/powernv: Replace variables with flags · 0b9e267d
      Gavin Shan authored
      We have 2 fields in "struct pnv_phb" to trace the states. The patch
      replace the fields with one and introduces flags for that. The patch
      doesn't impact the logic.
      Signed-off-by: default avatarGavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      0b9e267d
    • Gavin Shan's avatar
      powerpc/eeh: Check PCIe link after reset · 652defed
      Gavin Shan authored
      After reset (e.g. complete reset) in order to bring the fenced PHB
      back, the PCIe link might not be ready yet. The patch intends to
      make sure the PCIe link is ready before accessing its subordinate
      PCI devices. The patch also fixes that wrong values restored to
      PCI_COMMAND register for PCI bridges.
      Signed-off-by: default avatarGavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      652defed
    • Gavin Shan's avatar
      powerpc/eeh: Don't collect PCI-CFG data on PHB · c35ae179
      Gavin Shan authored
      When the PHB is fenced or dead, it's pointless to collect the data
      from PCI config space of subordinate PCI devices since it should
      return 0xFF's. The patch also fixes overwritten buffer while getting
      PCI config data.
      Signed-off-by: default avatarGavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      c35ae179
  2. 30 Jun, 2013 3 commits
  3. 25 Jun, 2013 8 commits
  4. 21 Jun, 2013 6 commits
    • Aneesh Kumar K.V's avatar
      powerpc: Optimize hugepage invalidate · 1a527286
      Aneesh Kumar K.V authored
      Hugepage invalidate involves invalidating multiple hpte entries.
      Optimize the operation using H_BULK_REMOVE on lpar platforms.
      On native, reduce the number of tlb flush.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      1a527286
    • Aneesh Kumar K.V's avatar
      powerpc/THP: Enable THP on PPC64 · 437d4964
      Aneesh Kumar K.V authored
      We enable only if the we support 16MB page size.
      Reviewed-by: default avatarDavid Gibson <dwg@au1.ibm.com>
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      437d4964
    • Aneesh Kumar K.V's avatar
      powerpc: split hugepage when using subpage protection · d8e355a2
      Aneesh Kumar K.V authored
      We find all the overlapping vma and mark them such that we don't allocate
      hugepage in that range. Also we split existing huge page so that the
      normal page hash can be invalidated and new page faulted in with new
      protection bits.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      d8e355a2
    • Aneesh Kumar K.V's avatar
      powerpc: disable assert_pte_locked for collapse_huge_page · a00e7bea
      Aneesh Kumar K.V authored
      With THP we set pmd to none, before we do pte_clear. Hence we can't
      walk page table to get the pte lock ptr and verify whether it is locked.
      THP do take pte lock before calling pte_clear. So we don't change the locking
      rules here. It is that we can't use page table walking to check whether
      pte locks are held with THP.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      a00e7bea
    • Aneesh Kumar K.V's avatar
      powerpc: Prevent gcc to re-read the pagetables · 7888b4dd
      Aneesh Kumar K.V authored
      GCC is very likely to read the pagetables just once and cache them in
      the local stack or in a register, but it is can also decide to re-read
      the pagetables. The problem is that the pagetable in those places can
      change from under gcc.
      
      With THP/hugetlbfs the pmd (and pgd for hugetlbfs giga pages) can
      change under gup_fast. The pages won't be freed untill we finish
      gup fast because we have irq disabled and we free these pages via
      rcu callback.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      7888b4dd
    • Aneesh Kumar K.V's avatar
      powerpc: Make linux pagetable walk safe with THP enabled · 0ac52dd7
      Aneesh Kumar K.V authored
      We need to have irqs disabled to handle all the possible parallel update for
      linux page table without holding locks.
      
      Events that we are intersted in while walking page tables are
      1) Page fault
      2) umap
      3) THP split
      4) THP collapse
      
      A) local_irq_disabled:
      ------------------------
      1) page fault:
      A none to valid transition via page fault is not an issue because we
      would either see a none or valid. If it is none, we would error out
      the page table walk. We may need to use on stack values when checking for
      type of page table elements, because if we do
      
      if (!is_hugepd()) {
          if (!pmd_none() {
             if (pmd_bad() {
      
      We could take that bad condition because the pmd got converted to a hugepd
      after the !is_hugepd check via a hugetlb fault.
      
      The right way would be to check for pmd_none higher up or use on stack value.
      
      2) A valid to none conversion via unmap:
      We can safely walk the upper level table, because we don't remove the the
      page table entries until rcu grace period. So even if we followed a
      wrong pointer we still have the pointer valid till the grace period.
      
      A PTE pointer returned need to be atomically checked for _PAGE_PRESENT and
       _PAGE_BUSY. A valid pointer returned could becoming none later. To prevent
      pte_clear we take _PAGE_BUSY.
      
      3) THP split:
      A valid transparent hugepage is converted to nomal page. Before we split we
      do pmd_splitting_flush, which sets the hugepage PTE to _PAGE_SPLITTING
      So when walking page table we need to check for pmd_trans_splitting and
      handle that. The pte returned should also need to be checked for
      _PAGE_SPLITTING before setting _PAGE_BUSY similar to _PAGE_PRESENT. We save
      the value of PTE on stack and check for the flag in the local pte value.
      If we don't have the value set we can safely operate on the local pte value
      and we atomicaly set _PAGE_BUSY.
      
      4) THP collapse:
      A normal page gets converted to hugepage. In the collapse path, we
      mark the pmd none early (pmdp_clear_flush). With irq disabled, if we
      are aleady walking page table we would see the pmd_none and won't continue.
      If we see a valid PMD, we should still check for _PAGE_PRESENT before
      setting _PAGE_BUSY, to make sure we didn't collapse the PTE to a Huge PTE.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      0ac52dd7