• Yu Zhao's avatar
    mm/hugetlb_vmemmap: fix race with speculative PFN walkers · bd225530
    Yu Zhao authored
    While investigating HVO for THPs [1], it turns out that speculative PFN
    walkers like compaction can race with vmemmap modifications, e.g.,
    
      CPU 1 (vmemmap modifier)         CPU 2 (speculative PFN walker)
      -------------------------------  ------------------------------
      Allocates an LRU folio page1
                                       Sees page1
      Frees page1
    
      Allocates a hugeTLB folio page2
      (page1 being a tail of page2)
    
      Updates vmemmap mapping page1
                                       get_page_unless_zero(page1)
    
    Even though page1->_refcount is zero after HVO, get_page_unless_zero() can
    still try to modify this read-only field, resulting in a crash.
    
    An independent report [2] confirmed this race.
    
    There are two discussed approaches to fix this race:
    1. Make RO vmemmap RW so that get_page_unless_zero() can fail without
       triggering a PF.
    2. Use RCU to make sure get_page_unless_zero() either sees zero
       page->_refcount through the old vmemmap or non-zero page->_refcount
       through the new one.
    
    The second approach is preferred here because:
    1. It can prevent illegal modifications to struct page[] that has been
       HVO'ed;
    2. It can be generalized, in a way similar to ZERO_PAGE(), to fix
       similar races in other places, e.g., arch_remove_memory() on x86
       [3], which frees vmemmap mapping offlined struct page[].
    
    While adding synchronize_rcu(), the goal is to be surgical, rather than
    optimized.  Specifically, calls to synchronize_rcu() on the error handling
    paths can be coalesced, but it is not done for the sake of Simplicity:
    noticeably, this fix removes ~50% more lines than it adds.
    
    According to the hugetlb_optimize_vmemmap section in
    Documentation/admin-guide/sysctl/vm.rst, enabling HVO makes allocating or
    freeing hugeTLB pages "~2x slower than before".  Having synchronize_rcu()
    on top makes those operations even worse, and this also affects the user
    interface /proc/sys/vm/nr_overcommit_hugepages.
    
    This is *very* hard to trigger:
    
    1. Most hugeTLB use cases I know of are static, i.e., reserved at
       boot time, because allocating at runtime is not reliable at all.
    
    2. On top of that, someone has to be very unlucky to get tripped
       over above, because the race window is so small -- I wasn't able to
       trigger it with a stress testing that does nothing but that (with
       THPs though).
    
    [1] https://lore.kernel.org/20240229183436.4110845-4-yuzhao@google.com/
    [2] https://lore.kernel.org/917FFC7F-0615-44DD-90EE-9F85F8EA9974@linux.dev/
    [3] https://lore.kernel.org/be130a96-a27e-4240-ad78-776802f57cad@redhat.com/
    
    Link: https://lkml.kernel.org/r/20240627222705.2974207-1-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
    Acked-by: default avatarMuchun Song <muchun.song@linux.dev>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Frank van der Linden <fvdl@google.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Yang Shi <yang@os.amperecomputing.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    bd225530
hugetlb_vmemmap.c 21.2 KB