• Linus Torvalds's avatar
    mm: delay page_remove_rmap() until after the TLB has been flushed · 5df397de
    Linus Torvalds authored
    When we remove a page table entry, we are very careful to only free the
    page after we have flushed the TLB, because other CPUs could still be
    using the page through stale TLB entries until after the flush.
    
    However, we have removed the rmap entry for that page early, which means
    that functions like folio_mkclean() would end up not serializing with the
    page table lock because the page had already been made invisible to rmap.
    
    And that is a problem, because while the TLB entry exists, we could end up
    with the following situation:
    
     (a) one CPU could come in and clean it, never seeing our mapping of the
         page
    
     (b) another CPU could continue to use the stale and dirty TLB entry and
         continue to write to said page
    
    resulting in a page that has been dirtied, but then marked clean again,
    all while another CPU might have dirtied it some more.
    
    End result: possibly lost dirty data.
    
    This extends our current TLB gather infrastructure to optionally track a
    "should I do a delayed page_remove_rmap() for this page after flushing the
    TLB".  It uses the newly introduced 'encoded page pointer' to do that
    without having to keep separate data around.
    
    Note, this is complicated by a couple of issues:
    
     - we want to delay the rmap removal, but not past the page table lock,
       because that simplifies the memcg accounting
    
     - only SMP configurations want to delay TLB flushing, since on UP
       there are obviously no remote TLBs to worry about, and the page
       table lock means there are no preemption issues either
    
     - s390 has its own mmu_gather model that doesn't delay TLB flushing,
       and as a result also does not want the delayed rmap. As such, we can
       treat S390 like the UP case and use a common fallback for the "no
       delays" case.
    
     - we can track an enormous number of pages in our mmu_gather structure,
       with MAX_GATHER_BATCH_COUNT batches of MAX_TABLE_BATCH pages each,
       all set up to be approximately 10k pending pages.
    
       We do not want to have a huge number of batched pages that we then
       need to check for delayed rmap handling inside the page table lock.
    
    Particularly that last point results in a noteworthy detail, where the
    normal page batch gathering is limited once we have delayed rmaps pending,
    in such a way that only the last batch (the so-called "active batch") in
    the mmu_gather structure can have any delayed entries.
    
    NOTE!  While the "possibly lost dirty data" sounds catastrophic, for this
    all to happen you need to have a user thread doing either madvise() with
    MADV_DONTNEED or a full re-mmap() of the area concurrently with another
    thread continuing to use said mapping.
    
    So arguably this is about user space doing crazy things, but from a VM
    consistency standpoint it's better if we track the dirty bit properly even
    when user space goes off the rails.
    
    [akpm@linux-foundation.org: fix UP build, per Linus]
    Link: https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/
    Link: https://lkml.kernel.org/r/20221109203051.1835763-4-torvalds@linux-foundation.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Acked-by: default avatarHugh Dickins <hughd@google.com>
    Reported-by: default avatarNadav Amit <nadav.amit@gmail.com>
    Tested-by: default avatarNadav Amit <nadav.amit@gmail.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    5df397de
tlb.h 20.3 KB