• Michael Ellerman's avatar
    powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR · 87e65ad7
    Michael Ellerman authored
    When we enabled STRICT_KERNEL_RWX we received some reports of boot
    failures when using the Hash MMU and running under phyp. The crashes
    are intermittent, and often exhibit as a completely unresponsive
    system, or possibly an oops.
    
    One example, which was caught in xmon:
    
      [   14.068327][    T1] devtmpfs: mounted
      [   14.069302][    T1] Freeing unused kernel memory: 5568K
      [   14.142060][  T347] BUG: Unable to handle kernel instruction fetch
      [   14.142063][    T1] Run /sbin/init as init process
      [   14.142074][  T347] Faulting instruction address: 0xc000000000004400
      cpu 0x2: Vector: 400 (Instruction Access) at [c00000000c7475e0]
          pc: c000000000004400: exc_virt_0x4400_instruction_access+0x0/0x80
          lr: c0000000001862d4: update_rq_clock+0x44/0x110
          sp: c00000000c747880
         msr: 8000000040001031
        current = 0xc00000000c60d380
        paca    = 0xc00000001ec9de80   irqmask: 0x03   irq_happened: 0x01
          pid   = 347, comm = kworker/2:1
      ...
      enter ? for help
      [c00000000c747880] c0000000001862d4 update_rq_clock+0x44/0x110 (unreliable)
      [c00000000c7478f0] c000000000198794 update_blocked_averages+0xb4/0x6d0
      [c00000000c7479f0] c000000000198e40 update_nohz_stats+0x90/0xd0
      [c00000000c747a20] c0000000001a13b4 _nohz_idle_balance+0x164/0x390
      [c00000000c747b10] c0000000001a1af8 newidle_balance+0x478/0x610
      [c00000000c747be0] c0000000001a1d48 pick_next_task_fair+0x58/0x480
      [c00000000c747c40] c000000000eaab5c __schedule+0x12c/0x950
      [c00000000c747cd0] c000000000eab3e8 schedule+0x68/0x120
      [c00000000c747d00] c00000000016b730 worker_thread+0x130/0x640
      [c00000000c747da0] c000000000174d50 kthread+0x1a0/0x1b0
      [c00000000c747e10] c00000000000e0f0 ret_from_kernel_thread+0x5c/0x6c
    
    This shows that CPU 2, which was idle, woke up and then appears to
    randomly take an instruction fault on a completely valid area of
    kernel text.
    
    The cause turns out to be the call to hash__mark_rodata_ro(), late in
    boot. Due to the way we layout text and rodata, that function actually
    changes the permissions for all of text and rodata to read-only plus
    execute.
    
    To do the permission change we use a hypervisor call, H_PROTECT. On
    phyp that appears to be implemented by briefly removing the mapping of
    the kernel text, before putting it back with the updated permissions.
    If any other CPU is executing during that window, it will see spurious
    faults on the kernel text and/or data, leading to crashes.
    
    To fix it we use stop machine to collect all other CPUs, and then have
    them drop into real mode (MMU off), while we change the mapping. That
    way they are unaffected by the mapping temporarily disappearing.
    
    We don't see this bug on KVM because KVM always use VPM=1, where
    faults are directed to the hypervisor, and the fault will be
    serialised vs the h_protect() by HPTE_V_HVLOCK.
    Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
    Link: https://lore.kernel.org/r/20210331003845.216246-5-mpe@ellerman.id.au
    87e65ad7
hash_pgtable.c 15.4 KB