• Dan Williams's avatar
    mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash · 71389703
    Dan Williams authored
    The x86 conversion to the generic GUP code included a small change which causes
    crashes and data corruption in the pmem code - not good.
    
    The root cause is that the /dev/pmem driver code implicitly relies on the x86
    get_user_pages() implementation doing a get_page() on the page refcount, because
    get_page() does a get_zone_device_page() which properly refcounts pmem's separate
    page struct arrays that are not present in the regular page struct structures.
    (The pmem driver does this because it can cover huge memory areas.)
    
    But the x86 conversion to the generic GUP code changed the get_page() to
    page_cache_get_speculative() which is faster but doesn't do the
    get_zone_device_page() call the pmem code relies on.
    
    One way to solve the regression would be to change the generic GUP code to use
    get_page(), but that would slow things down a bit and punish other generic-GUP
    using architectures for an x86-ism they did not care about. (Arguably the pmem
    driver was probably not working reliably for them: but nvdimm is an Intel
    feature, so non-x86 exposure is probably still limited.)
    
    So restructure the pmem code's interface with the MM instead: get rid of the
    get/put_zone_device_page() distinction, integrate put_zone_device_page() into
    __put_page() and and restructure the pmem completion-wait and teardown machinery:
    
    Kirill points out that the calls to {get,put}_dev_pagemap() can be
    removed from the mm fast path if we take a single get_dev_pagemap()
    reference to signify that the page is alive and use the final put of the
    page to drop that reference.
    
    This does require some care to make sure that any waits for the
    percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
    since it now maintains its own elevated reference.
    
    This speeds up things while also making the pmem refcounting more robust going
    forward.
    Suggested-by: default avatarKirill Shutemov <kirill.shutemov@linux.intel.com>
    Tested-by: default avatarKirill Shutemov <kirill.shutemov@linux.intel.com>
    Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
    Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Brian Gerst <brgerst@gmail.com>
    Cc: Denys Vlasenko <dvlasenk@redhat.com>
    Cc: H. Peter Anvin <hpa@zytor.com>
    Cc: Josh Poimboeuf <jpoimboe@redhat.com>
    Cc: Jérôme Glisse <jglisse@redhat.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    71389703
swap.c 27.1 KB