• Lorenzo Stoakes's avatar
    mm: drop the assumption that VM_SHARED always implies writable · e8e17ee9
    Lorenzo Stoakes authored
    Patch series "permit write-sealed memfd read-only shared mappings", v4.
    
    The man page for fcntl() describing memfd file seals states the following
    about F_SEAL_WRITE:-
    
        Furthermore, trying to create new shared, writable memory-mappings via
        mmap(2) will also fail with EPERM.
    
    With emphasis on 'writable'.  In turns out in fact that currently the
    kernel simply disallows all new shared memory mappings for a memfd with
    F_SEAL_WRITE applied, rendering this documentation inaccurate.
    
    This matters because users are therefore unable to obtain a shared mapping
    to a memfd after write sealing altogether, which limits their usefulness. 
    This was reported in the discussion thread [1] originating from a bug
    report [2].
    
    This is a product of both using the struct address_space->i_mmap_writable
    atomic counter to determine whether writing may be permitted, and the
    kernel adjusting this counter when any VM_SHARED mapping is performed and
    more generally implicitly assuming VM_SHARED implies writable.
    
    It seems sensible that we should only update this mapping if VM_MAYWRITE
    is specified, i.e.  whether it is possible that this mapping could at any
    point be written to.
    
    If we do so then all we need to do to permit write seals to function as
    documented is to clear VM_MAYWRITE when mapping read-only.  It turns out
    this functionality already exists for F_SEAL_FUTURE_WRITE - we can
    therefore simply adapt this logic to do the same for F_SEAL_WRITE.
    
    We then hit a chicken and egg situation in mmap_region() where the check
    for VM_MAYWRITE occurs before we are able to clear this flag.  To work
    around this, perform this check after we invoke call_mmap(), with careful
    consideration of error paths.
    
    Thanks to Andy Lutomirski for the suggestion!
    
    [1]:https://lore.kernel.org/all/20230324133646.16101dfa666f253c4715d965@linux-foundation.org/
    [2]:https://bugzilla.kernel.org/show_bug.cgi?id=217238
    
    
    This patch (of 3):
    
    There is a general assumption that VMAs with the VM_SHARED flag set are
    writable.  If the VM_MAYWRITE flag is not set, then this is simply not the
    case.
    
    Update those checks which affect the struct address_space->i_mmap_writable
    field to explicitly test for this by introducing
    [vma_]is_shared_maywrite() helper functions.
    
    This remains entirely conservative, as the lack of VM_MAYWRITE guarantees
    that the VMA cannot be written to.
    
    Link: https://lkml.kernel.org/r/cover.1697116581.git.lstoakes@gmail.com
    Link: https://lkml.kernel.org/r/d978aefefa83ec42d18dfa964ad180dbcde34795.1697116581.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
    Suggested-by: default avatarAndy Lutomirski <luto@kernel.org>
    Reviewed-by: default avatarJan Kara <jack@suse.cz>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    e8e17ee9
madvise.c 38.4 KB