• Hugh Dickins's avatar
    mm/khugepaged: fix filemap page_to_pgoff(page) != offset · 033b5d77
    Hugh Dickins authored
    There have been elusive reports of filemap_fault() hitting its
    VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page) on kernels built
    with CONFIG_READ_ONLY_THP_FOR_FS=y.
    
    Suren has hit it on a kernel with CONFIG_READ_ONLY_THP_FOR_FS=y and
    CONFIG_NUMA is not set: and he has analyzed it down to how khugepaged
    without NUMA reuses the same huge page after collapse_file() failed
    (whereas NUMA targets its allocation to the respective node each time).
    And most of us were usually testing with CONFIG_NUMA=y kernels.
    
    collapse_file(old start)
      new_page = khugepaged_alloc_page(hpage)
      __SetPageLocked(new_page)
      new_page->index = start // hpage->index=old offset
      new_page->mapping = mapping
      xas_store(&xas, new_page)
    
                              filemap_fault
                                page = find_get_page(mapping, offset)
                                // if offset falls inside hpage then
                                // compound_head(page) == hpage
                                lock_page_maybe_drop_mmap()
                                  __lock_page(page)
    
      // collapse fails
      xas_store(&xas, old page)
      new_page->mapping = NULL
      unlock_page(new_page)
    
    collapse_file(new start)
      new_page = khugepaged_alloc_page(hpage)
      __SetPageLocked(new_page)
      new_page->index = start // hpage->index=new offset
      new_page->mapping = mapping // mapping becomes valid again
    
                                // since compound_head(page) == hpage
                                // page_to_pgoff(page) got changed
                                VM_BUG_ON_PAGE(page_to_pgoff(page) != offset)
    
    An initial patch replaced __SetPageLocked() by lock_page(), which did
    fix the race which Suren illustrates above.  But testing showed that it's
    not good enough: if the racing task's __lock_page() gets delayed long
    after its find_get_page(), then it may follow collapse_file(new start)'s
    successful final unlock_page(), and crash on the same VM_BUG_ON_PAGE.
    
    It could be fixed by relaxing filemap_fault()'s VM_BUG_ON_PAGE to a
    check and retry (as is done for mapping), with similar relaxations in
    find_lock_entry() and pagecache_get_page(): but it's not obvious what
    else might get caught out; and khugepaged non-NUMA appears to be unique
    in exposing a page to page cache, then revoking, without going through
    a full cycle of freeing before reuse.
    
    Instead, non-NUMA khugepaged_prealloc_page() release the old page
    if anyone else has a reference to it (1% of cases when I tested).
    
    Although never reported on huge tmpfs, I believe its find_lock_entry()
    has been at similar risk; but huge tmpfs does not rely on khugepaged
    for its normal working nearly so much as READ_ONLY_THP_FOR_FS does.
    Reported-by: default avatarDenis Lisov <dennis.lissov@gmail.com>
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206569
    Link: https://lore.kernel.org/linux-mm/?q=20200219144635.3b7417145de19b65f258c943%40linux-foundation.orgReported-by: default avatarQian Cai <cai@lca.pw>
    Link: https://lore.kernel.org/linux-xfs/?q=20200616013309.GB815%40lca.pwReported-and-analyzed-by: default avatarSuren Baghdasaryan <surenb@google.com>
    Fixes: 87c460a0 ("mm/khugepaged: collapse_shmem() without freezing new_page")
    Signed-off-by: default avatarHugh Dickins <hughd@google.com>
    Cc: stable@vger.kernel.org # v4.9+
    Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    033b5d77
khugepaged.c 58.1 KB