• Mike Kravetz's avatar
    hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race · 188a3972
    Mike Kravetz authored
    Patch series "hugetlb: Use new vma lock for huge pmd sharing
    synchronization", v2.
    
    hugetlb fault scalability regressions have recently been reported [1]. 
    This is not the first such report, as regressions were also noted when
    commit c0d0381a ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
    synchronization") was added [2] in v5.7.  At that time, a proposal to
    address the regression was suggested [3] but went nowhere.
    
    The regression and benefit of this patch series is not evident when
    using the vm_scalability benchmark reported in [2] on a recent kernel.
    Results from running,
    "./usemem -n 48 --prealloc --prefault -O -U 3448054972"
    
    			48 sample Avg
    next-20220913		next-20220913			next-20220913
    unmodified	revert i_mmap_sema locking	vma sema locking, this series
    -----------------------------------------------------------------------------
    498150 KB/s		501934 KB/s			504793 KB/s
    
    The recent regression report [1] notes page fault and fork latency of
    shared hugetlb mappings.  To measure this, I created two simple programs:
    1) map a shared hugetlb area, write fault all pages, unmap area
       Do this in a continuous loop to measure faults per second
    2) map a shared hugetlb area, write fault a few pages, fork and exit
       Do this in a continuous loop to measure forks per second
    These programs were run on a 48 CPU VM with 320GB memory.  The shared
    mapping size was 250GB.  For comparison, a single instance of the program
    was run.  Then, multiple instances were run in parallel to introduce
    lock contention.  Changing the locking scheme results in a significant
    performance benefit.
    
    test		instances	unmodified	revert		vma
    --------------------------------------------------------------------------
    faults per sec	1		393043		395680		389932
    faults per sec  24		 71405		 81191		 79048
    forks per sec   1		  2802		  2747		  2725
    forks per sec   24		   439		   536		   500
    Combined faults 24		  1621		 68070		 53662
    Combined forks  24		   358		    67		   142
    
    Combined test is when running both faulting program and forking program
    simultaneously.
    
    Patches 1 and 2 of this series revert c0d0381a and 87bf91d3 which
    depends on c0d0381a.  Acquisition of i_mmap_rwsem is still required in
    the fault path to establish pmd sharing, so this is moved back to
    huge_pmd_share.  With c0d0381a reverted, this race is exposed:
    
    Faulting thread                                 Unsharing thread
    ...                                                  ...
    ptep = huge_pte_offset()
          or
    ptep = huge_pte_alloc()
    ...
                                                    i_mmap_lock_write
                                                    lock page table
    ptep invalid   <------------------------        huge_pmd_unshare()
    Could be in a previously                        unlock_page_table
    sharing process or worse                        i_mmap_unlock_write
    ...
    ptl = huge_pte_lock(ptep)
    get/update pte
    set_pte_at(pte, ptep)
    
    Reverting 87bf91d3 exposes races in page fault/file truncation.  When
    the new vma lock is put to use in patch 8, this will handle the fault/file
    truncation races.  This is explained in patch 9 where code associated with
    these races is cleaned up.
    
    Patches 3 - 5 restructure existing code in preparation for using the new
    vma lock (rw semaphore) for pmd sharing synchronization.  The idea is that
    this semaphore will be held in read mode for the duration of fault
    processing, and held in write mode for unmap operations which may call
    huge_pmd_unshare.  Acquiring i_mmap_rwsem is also still required to
    synchronize huge pmd sharing.  However it is only required in the fault
    path when setting up sharing, and will be acquired in huge_pmd_share().
    
    Patch 6 adds the new vma lock and all supporting routines, but does not
    actually change code to use the new lock.
    
    Patch 7 refactors code in preparation for using the new lock.  And, patch
    8 finally adds code to make use of this new vma lock.  Unfortunately, the
    fault code and truncate/hole punch code would naturally take locks in the
    opposite order which could lead to deadlock.  Since the performance of
    page faults is more important, the truncation/hole punch code is modified
    to back out and take locks in the correct order if necessary.
    
    [1] https://lore.kernel.org/linux-mm/43faf292-245b-5db5-cce9-369d8fb6bd21@infradead.org/
    [2] https://lore.kernel.org/lkml/20200622005551.GK5535@shao2-debian/
    [3] https://lore.kernel.org/linux-mm/20200706202615.32111-1-mike.kravetz@oracle.com/
    
    
    This patch (of 9):
    
    Commit c0d0381a ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
    synchronization") added code to take i_mmap_rwsem in read mode for the
    duration of fault processing.  The use of i_mmap_rwsem to prevent
    fault/truncate races depends on this.  However, this has been shown to
    cause performance/scaling issues.  As a result, that code will be
    reverted.  Since the use i_mmap_rwsem to address page fault/truncate races
    depends on this, it must also be reverted.
    
    In a subsequent patch, code will be added to detect the fault/truncate
    race and back out operations as required.
    
    Link: https://lkml.kernel.org/r/20220914221810.95771-1-mike.kravetz@oracle.com
    Link: https://lkml.kernel.org/r/20220914221810.95771-2-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: James Houghton <jthoughton@google.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    188a3972
inode.c 41.2 KB