• Barry Song's avatar
    mm: hold PTL from the first PTE while reclaiming a large folio · 73bc3287
    Barry Song authored
    Within try_to_unmap_one(), page_vma_mapped_walk() races with other PTE
    modifications preceded by pte clear.  While iterating over PTEs of a large
    folio, it only starts acquiring PTL from the first valid (present) PTE. 
    PTE modifications can temporarily set PTEs to pte_none.  Consequently, the
    initial PTEs of a large folio might be skipped in try_to_unmap_one().
    
    For example, for an anon folio, if we skip PTE0, we may have PTE0 which is
    still present, while PTE1 ~ PTE(nr_pages - 1) are swap entries after
    try_to_unmap_one().
    
    So folio will be still mapped, the folio fails to be reclaimed and is put
    back to LRU in this round.
    
    This also breaks up PTEs optimization such as CONT-PTE on this large folio
    and may lead to accident folio_split() afterwards.  And since a part of
    PTEs are now swap entries, accessing those parts will introduce overhead -
    do_swap_page.  Although the kernel can withstand all of the above issues,
    the situation still seems quite awkward and warrants making it more ideal.
    
    The same race also occurs with small folios, but they have only one PTE,
    thus, it won't be possible for them to be partially unmapped.
    
    This patch holds PTL from PTE0, allowing us to avoid reading PTE values
    that are in the process of being transformed.  With stable PTE values, we
    can ensure that this large folio is either completely reclaimed or that
    all PTEs remain untouched in this round.
    
    A corner case is that if we hold PTL from PTE0 and most initial PTEs have
    been really unmapped before that, we may increase the duration of holding
    PTL.  Thus we only apply this optimization to folios which are still
    entirely mapped (not in deferred_split list).
    
    [akpm@linux-foundation.org: rewrap comment, per Matthew]
    Link: https://lkml.kernel.org/r/20240306095219.71086-1-21cnbao@gmail.comSigned-off-by: default avatarBarry Song <v-songbaohua@oppo.com>
    Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Chris Li <chrisl@kernel.org>
    Cc: Chuanhua Han <hanchuanhua@oppo.com>
    Cc: Gao Xiang <xiang@kernel.org>
    Cc: Huang, Ying <ying.huang@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    73bc3287
vmscan.c 208 KB