• Johannes Weiner's avatar
    mm: fix page cache convergence regression · 994f9a52
    Johannes Weiner authored
    commit 7b785645 upstream.
    
    Since a2833486 ("page cache: Finish XArray conversion"), on most
    major Linux distributions, the page cache doesn't correctly transition
    when the hot data set is changing, and leaves the new pages thrashing
    indefinitely instead of kicking out the cold ones.
    
    On a freshly booted, freshly ssh'd into virtual machine with 1G RAM
    running stock Arch Linux:
    
    [root@ham ~]# ./reclaimtest.sh
    + dd of=workingset-a bs=1M count=0 seek=600
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + ./mincore workingset-a
    153600/153600 workingset-a
    + dd of=workingset-b bs=1M count=0 seek=600
    + cat workingset-b
    + cat workingset-b
    + cat workingset-b
    + cat workingset-b
    + ./mincore workingset-a workingset-b
    104029/153600 workingset-a
    120086/153600 workingset-b
    + cat workingset-b
    + cat workingset-b
    + cat workingset-b
    + cat workingset-b
    + ./mincore workingset-a workingset-b
    104029/153600 workingset-a
    120268/153600 workingset-b
    
    workingset-b is a 600M file on a 1G host that is otherwise entirely
    idle. No matter how often it's being accessed, it won't get cached.
    
    While investigating, I noticed that the non-resident information gets
    aggressively reclaimed - /proc/vmstat::workingset_nodereclaim. This is
    a problem because a workingset transition like this relies on the
    non-resident information tracked in the page cache tree of evicted
    file ranges: when the cache faults are refaults of recently evicted
    cache, we challenge the existing active set, and that allows a new
    workingset to establish itself.
    
    Tracing the shrinker that maintains this memory revealed that all page
    cache tree nodes were allocated to the root cgroup. This is a problem,
    because 1) the shrinker sizes the amount of non-resident information
    it keeps to the size of the cgroup's other memory and 2) on most major
    Linux distributions, only kernel threads live in the root cgroup and
    everything else gets put into services or session groups:
    
    [root@ham ~]# cat /proc/self/cgroup
    0::/user.slice/user-0.slice/session-c1.scope
    
    As a result, we basically maintain no non-resident information for the
    workloads running on the system, thus breaking the caching algorithm.
    
    Looking through the code, I found the culprit in the above-mentioned
    patch: when switching from the radix tree to xarray, it dropped the
    __GFP_ACCOUNT flag from the tree node allocations - the flag that
    makes sure the allocated memory gets charged to and tracked by the
    cgroup of the calling process - in this case, the one doing the fault.
    
    To fix this, allow xarray users to specify per-tree flag that makes
    xarray allocate nodes using __GFP_ACCOUNT. Then restore the page cache
    tree annotation to request such cgroup tracking for the cache nodes.
    
    With this patch applied, the page cache correctly converges on new
    workingsets again after just a few iterations:
    
    [root@ham ~]# ./reclaimtest.sh
    + dd of=workingset-a bs=1M count=0 seek=600
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + ./mincore workingset-a
    153600/153600 workingset-a
    + dd of=workingset-b bs=1M count=0 seek=600
    + cat workingset-b
    + ./mincore workingset-a workingset-b
    124607/153600 workingset-a
    87876/153600 workingset-b
    + cat workingset-b
    + ./mincore workingset-a workingset-b
    81313/153600 workingset-a
    133321/153600 workingset-b
    + cat workingset-b
    + ./mincore workingset-a workingset-b
    63036/153600 workingset-a
    153600/153600 workingset-b
    
    Cc: stable@vger.kernel.org # 4.20+
    Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
    Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    994f9a52
inode.c 56.6 KB