• Ross Zwisler's avatar
    dax: add struct iomap based DAX PMD support · 642261ac
    Ross Zwisler authored
    DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
    locking.  This patch allows DAX PMDs to participate in the DAX radix tree
    based locking scheme so that they can be re-enabled using the new struct
    iomap based fault handlers.
    
    There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
    mappings that have an associated block allocation, and 4k DAX empty
    entries.  The empty entries exist to provide locking for the duration of a
    given page fault.
    
    This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
    entries, PMD DAX entries that have associated block allocations, and 2 MiB
    DAX empty entries.
    
    Unlike the 4k case where we insert a struct page* into the radix tree for
    4k zero pages, for HZP we insert a DAX exceptional entry with the new
    RADIX_DAX_HZP flag set.  This is because we use a single 2 MiB zero page in
    every 2MiB hole mapping, and it doesn't make sense to have that same struct
    page* with multiple entries in multiple trees.  This would cause contention
    on the single page lock for the one Huge Zero Page, and it would break the
    page->index and page->mapping associations that are assumed to be valid in
    many other places in the kernel.
    
    One difficult use case is when one thread is trying to use 4k entries in
    radix tree for a given offset, and another thread is using 2 MiB entries
    for that same offset.  The current code handles this by making the 2 MiB
    user fall back to 4k entries for most cases.  This was done because it is
    the simplest solution, and because the use of 2MiB pages is already
    opportunistic.
    
    If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
    we run into the problem of how we lock out 4k page faults for the entire
    2MiB range while we clean out the radix tree so we can insert the 2MiB
    entry.  We can solve this problem if we need to, but I think that the cases
    where both 2MiB entries and 4K entries are being used for the same range
    will be rare enough and the gain small enough that it probably won't be
    worth the complexity.
    Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
    Reviewed-by: default avatarJan Kara <jack@suse.cz>
    Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
    642261ac
filemap.c 77.4 KB