Commit 6d0a07ed authored by Andrea Arcangeli's avatar Andrea Arcangeli Committed by Linus Torvalds

mm: thp: calculate the mapcount correctly for THP pages during WP faults

This will provide fully accuracy to the mapcount calculation in the
write protect faults, so page pinning will not get broken by false
positive copy-on-writes.

total_mapcount() isn't the right calculation needed in
reuse_swap_page(), so this introduces a page_trans_huge_mapcount()
that is effectively the full accurate return value for page_mapcount()
if dealing with Transparent Hugepages, however we only use the
page_trans_huge_mapcount() during COW faults where it strictly needed,
due to its higher runtime cost.

This also provide at practical zero cost the total_mapcount
information which is needed to know if we can still relocate the page
anon_vma to the local vma. If page_trans_huge_mapcount() returns 1 we
can reuse the page no matter if it's a pte or a pmd_trans_huge
triggering the fault, but we can only relocate the page anon_vma to
the local vma->anon_vma if we're sure it's only this "vma" mapping the
whole THP physical range.

Kirill A. Shutemov discovered the problem with moving the page
anon_vma to the local vma->anon_vma in a previous version of this
patch and another problem in the way page_move_anon_rmap() was called.

Andrew Morton discovered that CONFIG_SWAP=n wouldn't build in a
previous version, because reuse_swap_page must be a macro to call
page_trans_huge_mapcount from swap.h, so this uses a macro again
instead of an inline function. With this change at least it's a less
dangerous usage than it was before, because "page" is used only once
now, while with the previous code reuse_swap_page(page++) would have
called page_mapcount on page+1 and it would have increased page twice
instead of just once.

Dean Luick noticed an uninitialized variable that could result in a
rmap inefficiency for the non-THP case in a previous version.

Mike Marciniszyn said:

: Our RDMA tests are seeing an issue with memory locking that bisects to
: commit 61f5d698 ("mm: re-enable THP")
:
: The test program registers two rather large MRs (512M) and RDMA
: writes data to a passive peer using the first and RDMA reads it back
: into the second MR and compares that data.  The sizes are chosen randomly
: between 0 and 1024 bytes.
:
: The test will get through a few (<= 4 iterations) and then gets a
: compare error.
:
: Tracing indicates the kernel logical addresses associated with the individual
: pages at registration ARE correct , the data in the "RDMA read response only"
: packets ARE correct.
:
: The "corruption" occurs when the packet crosse two pages that are not physically
: contiguous.   The second page reads back as zero in the program.
:
: It looks like the user VA at the point of the compare error no longer points to
: the same physical address as was registered.
:
: This patch totally resolves the issue!

Link: http://lkml.kernel.org/r/1462547040-1737-2-git-send-email-aarcange@redhat.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
Reviewed-by: default avatar"Kirill A. Shutemov" <kirill@shutemov.name>
Reviewed-by: default avatarDean Luick <dean.luick@intel.com>
Tested-by: default avatarAlex Williamson <alex.williamson@redhat.com>
Tested-by: default avatarMike Marciniszyn <mike.marciniszyn@intel.com>
Tested-by: default avatarJosh Collier <josh.d.collier@intel.com>
Cc: Marc Haber <mh+linux-kernel@zugschlus.de>
Cc: <stable@vger.kernel.org>	[4.5]
Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
parent 7496fea9
...@@ -500,11 +500,20 @@ static inline int page_mapcount(struct page *page) ...@@ -500,11 +500,20 @@ static inline int page_mapcount(struct page *page)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE #ifdef CONFIG_TRANSPARENT_HUGEPAGE
int total_mapcount(struct page *page); int total_mapcount(struct page *page);
int page_trans_huge_mapcount(struct page *page, int *total_mapcount);
#else #else
static inline int total_mapcount(struct page *page) static inline int total_mapcount(struct page *page)
{ {
return page_mapcount(page); return page_mapcount(page);
} }
static inline int page_trans_huge_mapcount(struct page *page,
int *total_mapcount)
{
int mapcount = page_mapcount(page);
if (total_mapcount)
*total_mapcount = mapcount;
return mapcount;
}
#endif #endif
static inline struct page *virt_to_head_page(const void *x) static inline struct page *virt_to_head_page(const void *x)
......
...@@ -418,7 +418,7 @@ extern sector_t swapdev_block(int, pgoff_t); ...@@ -418,7 +418,7 @@ extern sector_t swapdev_block(int, pgoff_t);
extern int page_swapcount(struct page *); extern int page_swapcount(struct page *);
extern int swp_swapcount(swp_entry_t entry); extern int swp_swapcount(swp_entry_t entry);
extern struct swap_info_struct *page_swap_info(struct page *); extern struct swap_info_struct *page_swap_info(struct page *);
extern int reuse_swap_page(struct page *); extern bool reuse_swap_page(struct page *, int *);
extern int try_to_free_swap(struct page *); extern int try_to_free_swap(struct page *);
struct backing_dev_info; struct backing_dev_info;
...@@ -513,8 +513,8 @@ static inline int swp_swapcount(swp_entry_t entry) ...@@ -513,8 +513,8 @@ static inline int swp_swapcount(swp_entry_t entry)
return 0; return 0;
} }
#define reuse_swap_page(page) \ #define reuse_swap_page(page, total_mapcount) \
(!PageTransCompound(page) && page_mapcount(page) == 1) (page_trans_huge_mapcount(page, total_mapcount) == 1)
static inline int try_to_free_swap(struct page *page) static inline int try_to_free_swap(struct page *page)
{ {
......
...@@ -1298,15 +1298,9 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, ...@@ -1298,15 +1298,9 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page); VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
/* /*
* We can only reuse the page if nobody else maps the huge page or it's * We can only reuse the page if nobody else maps the huge page or it's
* part. We can do it by checking page_mapcount() on each sub-page, but * part.
* it's expensive.
* The cheaper way is to check page_count() to be equal 1: every
* mapcount takes page reference reference, so this way we can
* guarantee, that the PMD is the only mapping.
* This can give false negative if somebody pinned the page, but that's
* fine.
*/ */
if (page_mapcount(page) == 1 && page_count(page) == 1) { if (page_trans_huge_mapcount(page, NULL) == 1) {
pmd_t entry; pmd_t entry;
entry = pmd_mkyoung(orig_pmd); entry = pmd_mkyoung(orig_pmd);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
...@@ -2079,7 +2073,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, ...@@ -2079,7 +2073,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
if (pte_write(pteval)) { if (pte_write(pteval)) {
writable = true; writable = true;
} else { } else {
if (PageSwapCache(page) && !reuse_swap_page(page)) { if (PageSwapCache(page) &&
!reuse_swap_page(page, NULL)) {
unlock_page(page); unlock_page(page);
result = SCAN_SWAP_CACHE_PAGE; result = SCAN_SWAP_CACHE_PAGE;
goto out; goto out;
...@@ -3222,6 +3217,64 @@ int total_mapcount(struct page *page) ...@@ -3222,6 +3217,64 @@ int total_mapcount(struct page *page)
return ret; return ret;
} }
/*
* This calculates accurately how many mappings a transparent hugepage
* has (unlike page_mapcount() which isn't fully accurate). This full
* accuracy is primarily needed to know if copy-on-write faults can
* reuse the page and change the mapping to read-write instead of
* copying them. At the same time this returns the total_mapcount too.
*
* The function returns the highest mapcount any one of the subpages
* has. If the return value is one, even if different processes are
* mapping different subpages of the transparent hugepage, they can
* all reuse it, because each process is reusing a different subpage.
*
* The total_mapcount is instead counting all virtual mappings of the
* subpages. If the total_mapcount is equal to "one", it tells the
* caller all mappings belong to the same "mm" and in turn the
* anon_vma of the transparent hugepage can become the vma->anon_vma
* local one as no other process may be mapping any of the subpages.
*
* It would be more accurate to replace page_mapcount() with
* page_trans_huge_mapcount(), however we only use
* page_trans_huge_mapcount() in the copy-on-write faults where we
* need full accuracy to avoid breaking page pinning, because
* page_trans_huge_mapcount() is slower than page_mapcount().
*/
int page_trans_huge_mapcount(struct page *page, int *total_mapcount)
{
int i, ret, _total_mapcount, mapcount;
/* hugetlbfs shouldn't call it */
VM_BUG_ON_PAGE(PageHuge(page), page);
if (likely(!PageTransCompound(page))) {
mapcount = atomic_read(&page->_mapcount) + 1;
if (total_mapcount)
*total_mapcount = mapcount;
return mapcount;
}
page = compound_head(page);
_total_mapcount = ret = 0;
for (i = 0; i < HPAGE_PMD_NR; i++) {
mapcount = atomic_read(&page[i]._mapcount) + 1;
ret = max(ret, mapcount);
_total_mapcount += mapcount;
}
if (PageDoubleMap(page)) {
ret -= 1;
_total_mapcount -= HPAGE_PMD_NR;
}
mapcount = compound_mapcount(page);
ret += mapcount;
_total_mapcount += mapcount;
if (total_mapcount)
*total_mapcount = _total_mapcount;
return ret;
}
/* /*
* This function splits huge page into normal pages. @page can point to any * This function splits huge page into normal pages. @page can point to any
* subpage of huge page to split. Split doesn't change the position of @page. * subpage of huge page to split. Split doesn't change the position of @page.
......
...@@ -2373,6 +2373,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, ...@@ -2373,6 +2373,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
* not dirty accountable. * not dirty accountable.
*/ */
if (PageAnon(old_page) && !PageKsm(old_page)) { if (PageAnon(old_page) && !PageKsm(old_page)) {
int total_mapcount;
if (!trylock_page(old_page)) { if (!trylock_page(old_page)) {
get_page(old_page); get_page(old_page);
pte_unmap_unlock(page_table, ptl); pte_unmap_unlock(page_table, ptl);
...@@ -2387,13 +2388,18 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, ...@@ -2387,13 +2388,18 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
} }
put_page(old_page); put_page(old_page);
} }
if (reuse_swap_page(old_page)) { if (reuse_swap_page(old_page, &total_mapcount)) {
/* if (total_mapcount == 1) {
* The page is all ours. Move it to our anon_vma so /*
* the rmap code will not search our parent or siblings. * The page is all ours. Move it to
* Protected against the rmap code by the page lock. * our anon_vma so the rmap code will
*/ * not search our parent or siblings.
page_move_anon_rmap(old_page, vma, address); * Protected against the rmap code by
* the page lock.
*/
page_move_anon_rmap(compound_head(old_page),
vma, address);
}
unlock_page(old_page); unlock_page(old_page);
return wp_page_reuse(mm, vma, address, page_table, ptl, return wp_page_reuse(mm, vma, address, page_table, ptl,
orig_pte, old_page, 0, 0); orig_pte, old_page, 0, 0);
...@@ -2617,7 +2623,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, ...@@ -2617,7 +2623,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
inc_mm_counter_fast(mm, MM_ANONPAGES); inc_mm_counter_fast(mm, MM_ANONPAGES);
dec_mm_counter_fast(mm, MM_SWAPENTS); dec_mm_counter_fast(mm, MM_SWAPENTS);
pte = mk_pte(page, vma->vm_page_prot); pte = mk_pte(page, vma->vm_page_prot);
if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) { if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma); pte = maybe_mkwrite(pte_mkdirty(pte), vma);
flags &= ~FAULT_FLAG_WRITE; flags &= ~FAULT_FLAG_WRITE;
ret |= VM_FAULT_WRITE; ret |= VM_FAULT_WRITE;
......
...@@ -922,18 +922,19 @@ int swp_swapcount(swp_entry_t entry) ...@@ -922,18 +922,19 @@ int swp_swapcount(swp_entry_t entry)
* to it. And as a side-effect, free up its swap: because the old content * to it. And as a side-effect, free up its swap: because the old content
* on disk will never be read, and seeking back there to write new content * on disk will never be read, and seeking back there to write new content
* later would only waste time away from clustering. * later would only waste time away from clustering.
*
* NOTE: total_mapcount should not be relied upon by the caller if
* reuse_swap_page() returns false, but it may be always overwritten
* (see the other implementation for CONFIG_SWAP=n).
*/ */
int reuse_swap_page(struct page *page) bool reuse_swap_page(struct page *page, int *total_mapcount)
{ {
int count; int count;
VM_BUG_ON_PAGE(!PageLocked(page), page); VM_BUG_ON_PAGE(!PageLocked(page), page);
if (unlikely(PageKsm(page))) if (unlikely(PageKsm(page)))
return 0; return false;
/* The page is part of THP and cannot be reused */ count = page_trans_huge_mapcount(page, total_mapcount);
if (PageTransCompound(page))
return 0;
count = page_mapcount(page);
if (count <= 1 && PageSwapCache(page)) { if (count <= 1 && PageSwapCache(page)) {
count += page_swapcount(page); count += page_swapcount(page);
if (count == 1 && !PageWriteback(page)) { if (count == 1 && !PageWriteback(page)) {
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment