• Linus Torvalds's avatar
    gup: document and work around "COW can break either way" issue · 17839856
    Linus Torvalds authored
    Doing a "get_user_pages()" on a copy-on-write page for reading can be
    ambiguous: the page can be COW'ed at any time afterwards, and the
    direction of a COW event isn't defined.
    
    Yes, whoever writes to it will generally do the COW, but if the thread
    that did the get_user_pages() unmapped the page before the write (and
    that could happen due to memory pressure in addition to any outright
    action), the writer could also just take over the old page instead.
    
    End result: the get_user_pages() call might result in a page pointer
    that is no longer associated with the original VM, and is associated
    with - and controlled by - another VM having taken it over instead.
    
    So when doing a get_user_pages() on a COW mapping, the only really safe
    thing to do would be to break the COW when getting the page, even when
    only getting it for reading.
    
    At the same time, some users simply don't even care.
    
    For example, the perf code wants to look up the page not because it
    cares about the page, but because the code simply wants to look up the
    physical address of the access for informational purposes, and doesn't
    really care about races when a page might be unmapped and remapped
    elsewhere.
    
    This adds logic to force a COW event by setting FOLL_WRITE on any
    copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page
    pointer as a result.
    
    The current semantics end up being:
    
     - __get_user_pages_fast(): no change. If you don't ask for a write,
       you won't break COW. You'd better know what you're doing.
    
     - get_user_pages_fast(): the fast-case "look it up in the page tables
       without anything getting mmap_sem" now refuses to follow a read-only
       page, since it might need COW breaking.  Which happens in the slow
       path - the fast path doesn't know if the memory might be COW or not.
    
     - get_user_pages() (including the slow-path fallback for gup_fast()):
       for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with
       very similar semantics to FOLL_FORCE.
    
    If it turns out that we want finer granularity (ie "only break COW when
    it might actually matter" - things like the zero page are special and
    don't need to be broken) we might need to push these semantics deeper
    into the lookup fault path.  So if people care enough, it's possible
    that we might end up adding a new internal FOLL_BREAK_COW flag to go
    with the internal FOLL_COW flag we already have for tracking "I had a
    COW".
    
    Alternatively, if it turns out that different callers might want to
    explicitly control the forced COW break behavior, we might even want to
    make such a flag visible to the users of get_user_pages() instead of
    using the above default semantics.
    
    But for now, this is mostly commentary on the issue (this commit message
    being a lot bigger than the patch, and that patch in turn is almost all
    comments), with that minimal "enable COW breaking early" logic using the
    existing FOLL_WRITE behavior.
    
    [ It might be worth noting that we've always had this ambiguity, and it
      could arguably be seen as a user-space issue.
    
      You only get private COW mappings that could break either way in
      situations where user space is doing cooperative things (ie fork()
      before an execve() etc), but it _is_ surprising and very subtle, and
      fork() is supposed to give you independent address spaces.
    
      So let's treat this as a kernel issue and make the semantics of
      get_user_pages() easier to understand. Note that obviously a true
      shared mapping will still get a page that can change under us, so this
      does _not_ mean that get_user_pages() somehow returns any "stable"
      page ]
    Reported-by: default avatarJann Horn <jannh@google.com>
    Tested-by: default avatarChristoph Hellwig <hch@lst.de>
    Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
    Acked-by: default avatarKirill Shutemov <kirill@shutemov.name>
    Acked-by: default avatarJan Kara <jack@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    17839856
huge_memory.c 87.3 KB