Commits · 1b1e13440c1c17efac1000788730468cde16bdd3 · Kirill Smelkov / linux

05 Jul, 2024 7 commits

mm: memcg: introduce memcontrol-v1.c · 1b1e1344

Roman Gushchin authored Jun 24, 2024

Patch series "mm: memcg: separate legacy cgroup v1 code and put under
config option", v2.

Cgroups v2 have been around for a while and many users have fully adopted
them, so they never use cgroups v1 features and functionality.  Yet they
have to "pay" for the cgroup v1 support anyway:
1) the kernel binary contains an unused cgroup v1 code,
2) some code paths have additional checks which are not needed,
3) some common structures like task_struct and mem_cgroup contain unused
   cgroup v1-specific members.

Cgroup v1's memory controller has a number of features that are not
supported by cgroup v2 and their implementation is pretty much self
contained.  Most notably, these features are: soft limit reclaim, oom
handling in userspace, complicated event notification system, charge
migration.  Cgroup v1-specific code in memcontrol.c is close to 4k lines
in size and it's intervened with generic and cgroup v2-specific code. 
It's a burden on developers and maintainers.

This patchset aims to solve these problems by:
1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file,
2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the
   mm/memcontrol-v1.h header,
3) introducing the CONFIG_MEMCG_V1 config option, turned off by default,
4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set.

If CONFIG_MEMCG_V1 is not set, cgroup v1 memory controller is still available
for mounting, however no memory-specific control knobs are present.

This patch (of 14):


This patch introduces the mm/memcontrol-v1.c source file which will be
used for all legacy (cgroup v1) memory cgroup code.  It also introduces
mm/memcontrol-v1.h to keep declarations shared between mm/memcontrol.c and
mm/memcontrol-v1.c.

As of now, let's compile it if CONFIG_MEMCG is set, similar to
mm/memcontrol.c.  Later on it can be switched to use a separate config
option, so that the legacy code won't be compiled if not required.

Link: https://lkml.kernel.org/r/20240625005906.106920-1-roman.gushchin@linux.dev
Link: https://lkml.kernel.org/r/20240625005906.106920-2-roman.gushchin@linux.devSigned-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

1b1e1344

mm/ksm: optimize the chain()/chain_prune() interfaces · a0b856b6

Chengming Zhou authored Jun 21, 2024

Now the implementation of stable_node_dup() causes chain()/chain_prune()
interfaces and usages are overcomplicated.

Why?  stable_node_dup() only find and return a candidate stable_node for
sharing, so the users have to recheck using stable_node_dup_any() if any
non-candidate stable_node exist.  And try to ksm_get_folio() from it
again.

Actually, stable_node_dup() can just return a best stable_node as it can,
then the users can check if it's a candidate for sharing or not.

The code is simplified too and fewer corner cases: such as stable_node and
stable_node_dup can't be NULL if returned tree_folio is not NULL.

Link: https://lkml.kernel.org/r/20240621-b4-ksm-scan-optimize-v2-3-1c328aa9e30b@linux.devSigned-off-by: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Stefan Roesch <shr@devkernel.io>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

a0b856b6

mm/ksm: don't waste time searching stable tree for fast changing page · d58a361b

Chengming Zhou authored Jun 21, 2024

The code flow in cmp_and_merge_page() is suboptimal for handling the ksm
page and non-ksm page at the same time.  For example:

- ksm page
 1. Mostly just return if this ksm page is not migrated and this rmap_item
    has been on the rmap hlist. Or we have to fix this rmap_item mapping.
 2. But we absolutely don't need to checksum for this ksm page, since it
    can't change.

- non-ksm page
 1. First don't need to waste time searching stable tree if fast changing.
 2. Should try to merge with zero page before search the stable tree.
 3. Then search stable tree to find mergeable ksm page.

This patch optimizes the code flow so the handling differences between ksm
page and non-ksm page become clearer and more efficient too.

Link: https://lkml.kernel.org/r/20240621-b4-ksm-scan-optimize-v2-2-1c328aa9e30b@linux.devSigned-off-by: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Stefan Roesch <shr@devkernel.io>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

d58a361b

mm/ksm: refactor out try_to_merge_with_zero_page() · ac90c56b

Chengming Zhou authored Jun 21, 2024

Patch series "mm/ksm: cmp_and_merge_page() optimizations and cleanup", v2.

This series mainly optimizes cmp_and_merge_page() to have more efficient
separate code flow for ksm page and non-ksm anon page.

- ksm page: don't need to calculate the checksum obviously.
- anon page: don't need to search stable tree if changing fast and try
  to merge with zero page before searching ksm page on stable tree.

Please see the patch-2 for details.

Patch-3 is cleanup also a little optimization for the chain()/chain_prune
interfaces, which made the stable_tree_search()/stable_tree_insert() over
complex.

I have done simple testing using "hackbench -g 1 -l 300000" (maybe I need
to use a better workload) on my machine, have seen a little CPU usage
decrease of ksmd and some improvements of cmp_and_merge_page() latency:

We can see the latency of cmp_and_merge_page() when handling non-ksm anon
pages has been improved.


This patch (of 3):

In preparation for later changes, refactor out a new function called
try_to_merge_with_zero_page(), which tries to merge with zero page.

Link: https://lkml.kernel.org/r/20240621-b4-ksm-scan-optimize-v2-0-1c328aa9e30b@linux.dev
Link: https://lkml.kernel.org/r/20240621-b4-ksm-scan-optimize-v2-1-1c328aa9e30b@linux.devSigned-off-by: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Stefan Roesch <shr@devkernel.io>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ac90c56b

hugetlb: force allocating surplus hugepages on mempolicy allowed nodes · 003af997

Aristeu Rozanski authored Jun 21, 2024

When trying to allocate a hugepage with no reserved ones free, it may be
allowed in case a number of overcommit hugepages was configured (using
/proc/sys/vm/nr_overcommit_hugepages) and that number wasn't reached. 
This allows for a behavior of having extra hugepages allocated
dynamically, if there're resources for it.  Some sysadmins even prefer not
reserving any hugepages and setting a big number of overcommit hugepages.

But while attempting to allocate overcommit hugepages in a multi node
system (either NUMA or mempolicy/cpuset) said allocations might randomly
fail even when there're resources available for the allocation.

This happens due to allowed_mems_nr() only accounting for the number of
free hugepages in the nodes the current process belongs to and the surplus
hugepage allocation is done so it can be allocated in any node.  In case
one or more of the requested surplus hugepages are allocated in a
different node, the whole allocation will fail due allowed_mems_nr()
returning a lower value.

So allocate surplus hugepages in one of the nodes the current process
belongs to.

Easy way to reproduce this issue is to use a 2+ NUMA nodes system:

	# echo 0 >/proc/sys/vm/nr_hugepages
	# echo 1 >/proc/sys/vm/nr_overcommit_hugepages
	# numactl -m0 ./tools/testing/selftests/mm/map_hugetlb 2

Repeating the execution of map_hugetlb test application will eventually
fail when the hugepage ends up allocated in a different node.

[aris@ruivo.org: v2]
  Link: https://lkml.kernel.org/r/20240701212343.GG844599@cathedrallabs.org
Link: https://lkml.kernel.org/r/20240621190050.mhxwb65zn37doegp@redhat.comSigned-off-by: Aristeu Rozanski <aris@redhat.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vishal Moola <vishal.moola@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

003af997

mm/damon/paddr: initialize nr_succeeded in __damon_pa_migrate_folio_list() · 64548bc5

SeongJae Park authored Jul 01, 2024

The variable is supposed to be set via later migrate_pages() call. 
However, the function does not do that when CONFIG_MIGRATION is unset. 
Initialize the variable to zero.

Link: https://lkml.kernel.org/r/20240701165332.47495-1-sj@kernel.org
Fixes: 5311c0a2eee3 ("mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion")
Signed-off-by: SeongJae Park <sj@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202406251102.GE07hqfQ-lkp@intel.com/
Cc: Honggyu Kim <honggyu.kim@sk.com>
Cc: Hyeongtak Ji <hyeongtak.ji@sk.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

64548bc5

mm: refactor folio_undo_large_rmappable() · 593a10da

Kefeng Wang authored May 21, 2024

Folios of order <= 1 are not in deferred list, the check of order is added
into folio_undo_large_rmappable() from commit 8897277a ("mm: support
order-1 folios in the page cache"), but there is a repeated check for
small folio (order 0) during each call of the
folio_undo_large_rmappable(), so only keep folio_order() check inside the
function.

In addition, move all the checks into header file to save a function call
for non-large-rmappable or empty deferred_list folio.

Link: https://lkml.kernel.org/r/20240521130315.46072-1-wangkefeng.wang@huawei.comSigned-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

593a10da

04 Jul, 2024 33 commits

selftests/damon/damon_nr_regions: test online-tuned max_nr_regions · 8bf890c8

SeongJae Park authored Jun 25, 2024

User could update max_nr_regions parameter while DAMON is running to a
value that smaller than the current number of regions that DAMON is
seeing. Such update could be done for reducing the monitoring overhead.
In the case, DAMON should merge regions aggressively more than normal
situation to ensure the new limit is successfully applied. Implement a
kselftest to ensure that.

Link: https://lkml.kernel.org/r/20240625180538.73134-9-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

8bf890c8

_damon_sysfs: implement commit() for online parameters update · 5ac9adec

SeongJae Park authored Jun 25, 2024

Users can update DAMON parameters while it is running, using 'commit'
DAMON sysfs interface command. For testing the feature in future tests,
implement a function for doing that on the test-purpose DAMON sysfs
interface wrapper Python module.

Link: https://lkml.kernel.org/r/20240625180538.73134-8-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

5ac9adec

selftests/damon: implement test for min/max_nr_regions · 78149734

SeongJae Park authored Jun 25, 2024

Implement a kselftest for DAMON's {min,max}_nr_regions' parameters. The
test ensures both the minimum and the maximum number of regions limit is
respected even if the workload's real number of regions is less than the
minimum or larger than the maximum limits.

Link: https://lkml.kernel.org/r/20240625180538.73134-7-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

78149734

selftests/damon/_damon_sysfs: implement kdamonds stop function · f6063604

SeongJae Park authored Jun 25, 2024

Implement DAMON stop function on the test-purpose DAMON sysfs interface
wrapper Python module, _damon_sysfs.py. This feature will be used by
future DAMON tests that need to start/stop DAMON multiple times.

Link: https://lkml.kernel.org/r/20240625180538.73134-6-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

f6063604

selftests/damon: implement DAMOS tried regions test · c9a3003a

SeongJae Park authored Jun 25, 2024

Implement a test for DAMOS tried regions command of DAMON sysfs interface.
It ensures the expected number of monitoring regions are created using an
artificial memory access pattern generator program.

Link: https://lkml.kernel.org/r/20240625180538.73134-5-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

c9a3003a

selftests/damon: implement a program for even-numbered memory regions access · c94df805

SeongJae Park authored Jun 25, 2024

To test schemes_tried_regions feature, we need to have a program having
specific number of regions that having different access pattern. Existing
artificial access pattern generator, 'access_memory', cannot be used for
the purpose, since it accesses only one region at a given time. Extending
it could be an option, but since the purpose and the implementation are
pretty simple, implementing another one from the scratch is better.

Implement such another artificial memory access program that allocates
user-defined number/size regions and accesses even-numbered regions.

Link: https://lkml.kernel.org/r/20240625180538.73134-4-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

c94df805

selftests/damon/_damon_sysfs: support schemes_update_tried_regions · 209e6313

SeongJae Park authored Jun 25, 2024

Implement schemes_update_tried_regions DAMON sysfs command on
_damon_sysfs.py, to use on implementations of future tests for the
feature.

Link: https://lkml.kernel.org/r/20240625180538.73134-3-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

209e6313

selftests/damon/access_memory: use user-defined region size · 34ec4344

SeongJae Park authored Jun 25, 2024

Patch series "selftests/damon: test DAMOS tried regions and
{min,max}_nr_regions".

This patch series fix a minor issue in a program for DAMON selftest, and
implement new functionality selftests for DAMOS tried regions and
{min,max}_nr_regions.  The test for max_nr_regions also test the recovery
from online tuning-caused limit violation, which was fixed by a previous
patch [1] titled "mm/damon/core: merge regions aggressively when
max_nr_regions is unmet".

The first patch fixes a minor problem in the articial memory access
pattern generator for tests.  Following 3 patches (2-4) implement schemes
tried regions test.  Then a couple of patches (5-6) implementing static
setup based {min,max}_nr_regions functionality test follows.  Final two
patches (7-8) implement dynamic max_nr_regions update test.

[1] https://lore.kernel.org/20240624210650.53960C2BBFC@smtp.kernel.org


This patch (of 8):

'access_memory' is an artificial memory access pattern generator for DAMON
tests.  It creates and accesses memory regions that the user specified the
number and size via the command line.  However, real access part of the
program ignores the user-specified size of each region.  Instead, it uses
a hard-coded value, 10 MiB.  Fix it to use user-defined size.

Note that all existing 'access_memory' users are setting the region size
as 10 MiB.  Hence no real problem has happened so far.

Link: https://lkml.kernel.org/r/20240625180538.73134-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20240625180538.73134-2-sj@kernel.org
Fixes: b5906f5f ("selftests/damon: add a test for update_schemes_tried_regions sysfs command")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

34ec4344

readahead: simplify gotos in page_cache_sync_ra() · 58540f5c

Jan Kara authored Jun 25, 2024

Unify all conditions for initial readahead to simplify goto logic in
page_cache_sync_ra().  No functional changes.

Link: https://lkml.kernel.org/r/20240625101909.12234-10-jack@suse.czSigned-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Zhang Peng <zhangpengpeng0808@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

58540f5c

readahead: fold try_context_readahead() into its single caller · a6eccd5b

Jan Kara authored Jun 25, 2024

try_context_readahead() has a single caller page_cache_sync_ra().  Fold
the function there to make ra state modifications more obvious.  No
functional changes.

Link: https://lkml.kernel.org/r/20240625101909.12234-9-jack@suse.czSigned-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Zhang Peng <zhangpengpeng0808@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

a6eccd5b

readahead: disentangle async and sync readahead · 3a7a11a5

Jan Kara authored Jun 25, 2024

Both async and sync readahead are handled by ondemand_readahead()
function. However there isn't actually much in common. Just move async
related parts into page_cache_ra_async() and sync related parts to
page_cache_ra_sync(). No functional changes.

Link: https://lkml.kernel.org/r/20240625101909.12234-8-jack@suse.czSigned-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Zhang Peng <zhangpengpeng0808@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

3a7a11a5

readahead: drop dead code in ondemand_readahead() · 0b1efc3e

Jan Kara authored Jun 25, 2024

ondemand_readahead() scales up the readahead window if the current read
would hit the readahead mark placed by itself.  However the condition is
mostly dead code because:

a) In case of async readahead we always increase ra->start so ra->start
   == index is never true.

b) In case of sync readahead we either go through
   try_context_readahead() in which case ra->async_size == 1 < ra->size or
   we go through initial_readahead where ra->async_size == ra->size iff
   ra->size == max_pages.

So the only practical effect is reducing async_size for large initial
reads.  Make the code more obvious.

Link: https://lkml.kernel.org/r/20240625101909.12234-7-jack@suse.czSigned-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Zhang Peng <zhangpengpeng0808@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

0b1efc3e

readahead: drop dead code in page_cache_ra_order() · 8eaf93ac

Jan Kara authored Jun 25, 2024

page_cache_ra_order() scales folio order down so that is fully fits within
readahead window. Thus the code handling the case where we walked past
the readahead window is a dead code. Remove it.

Link: https://lkml.kernel.org/r/20240625101909.12234-6-jack@suse.czSigned-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Zhang Peng <zhangpengpeng0808@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

8eaf93ac

readahead: drop index argument of page_cache_async_readahead() · bb82ac31

Jan Kara authored Jun 25, 2024

The index argument of page_cache_async_readahead() is just folio->index so
there's no point in passing is separately. Drop it.

Link: https://lkml.kernel.org/r/20240625101909.12234-5-jack@suse.czSigned-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Zhang Peng <zhangpengpeng0808@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

bb82ac31

readahead: drop pointless index from force_page_cache_ra() · 878343df

Jan Kara authored Jun 25, 2024

Current index to readahead is tracked in readahead_control and properly
updated by page_cache_ra_unbounded() (read_pages() in fact). So there's
no need to track the index separately in force_page_cache_ra().

Link: https://lkml.kernel.org/r/20240625101909.12234-4-jack@suse.czSigned-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Zhang Peng <zhangpengpeng0808@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

878343df

readahead: properly shorten readahead when falling back to do_page_cache_ra() · 7c877586

Jan Kara authored Jun 25, 2024

When we succeed in creating some folios in page_cache_ra_order() but then
need to fallback to single page folios, we don't shorten the amount to
read passed to do_page_cache_ra() by the amount we've already read. This
then results in reading more and also in placing another readahead mark in
the middle of the readahead window which confuses readahead code. Fix the
problem by properly reducing number of pages to read.

Link: https://lkml.kernel.org/r/20240625101909.12234-3-jack@suse.czSigned-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Zhang Peng <zhangpengpeng0808@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

7c877586

filemap: fix page_cache_next_miss() when no hole found · 901a269f

Jan Kara authored Jun 25, 2024

page_cache_next_miss() should return value outside of the specified range
when no hole is found.  However currently it will return the last index
*in* the specified range confusing ondemand_readahead() to think there's a
hole in the searched range and upsetting readahead logic.

Link: https://lkml.kernel.org/r/20240625101909.12234-2-jack@suse.czSigned-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Zhang Peng <zhangpengpeng0808@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

901a269f

readahead: make sure sync readahead reads needed page · 8051b82a

Jan Kara authored Jun 25, 2024

Patch series "mm: Fix various readahead quirks".

When we were internally testing performance of recent kernels, we have
noticed quite variable performance of readahead arising from various
quirks in readahead code.  So I went on a cleaning spree there.  This is a
batch of patches resulting out of that.  A quick testing in my test VM
with the following fio job file:

[global]
direct=0
ioengine=sync
invalidate=1
blocksize=4k
size=10g
readwrite=read

[reader]
numjobs=1

shows that this patch series improves the throughput from variable one in
310-340 MB/s range to rather stable one at 350 MB/s.  As a side effect
these cleanups also address the issue noticed by Bruz Zhang [1].

[1] https://lore.kernel.org/all/20240618114941.5935-1-zhangpengpeng0808@gmail.com/

Zhang Peng reported:

: I test this batch of patch with fio, it indeed has a huge sppedup
: in sequential read when block size is 4KiB. The result as follow,
: for async read, iodepth is set to 128, and other settings
: are self-evident.
: 
: casename                upstream   withFix speedup
: ----------------        --------   -------- -------
: randread-4k-sync        48991      47
: seqread-4k-sync         1162758    14229
: seqread-1024k-sync      1460208    1452522 
: randread-4k-libaio      47467      4730
: randread-4k-posixaio    49190      49512
: seqread-4k-libaio       1085932    1234635
: seqread-1024k-libaio    1423341    1402214 -1
: seqread-4k-posixaio     1165084    1369613 1
: seqread-1024k-posixaio  1435422    1408808 -1.8


This patch (of 10):

page_cache_sync_ra() is called when a folio we want to read is not in the
page cache.  It is expected that it creates the folio (and perhaps the
following folios as well) and submits reads for them unless some error
happens.  However if index == ra->start + ra->size, ondemand_readahead()
will treat the call as another async readahead hit.  Thus ra->start will
be advanced and we create pages and queue reads from ra->start + ra->size
further.  Consequentially the page at 'index' is not created and
filemap_get_pages() has to always go through filemap_create_folio() path.

This behavior has particularly unfortunate consequences when we have two
IO threads sequentially reading from a shared file (as is the case when
NFS serves sequential reads).  In that case what can happen is:

suppose ra->size == ra->async_size == 128, ra->start = 512

T1					T2
reads 128 pages at index 512
  - hits async readahead mark
    filemap_readahead()
      ondemand_readahead()
        if (index == expected ...)
	  ra->start = 512 + 128 = 640
          ra->size = 128
	  ra->async_size = 128
	page_cache_ra_order()
	  blocks in ra_alloc_folio()
					reads 128 pages at index 640
					  - no page found
					  page_cache_sync_readahead()
					    ondemand_readahead()
					      if (index == expected ...)
						ra->start = 640 + 128 = 768
						ra->size = 128
						ra->async_size = 128
					    page_cache_ra_order()
					      submits reads from 768
					  - still no page found at index 640
					    filemap_create_folio()
					  - goes on to index 641
					  page_cache_sync_readahead()
					    ondemand_readahead()
					      - founds ra is confused,
					        trims is to small size
  	  finds pages were already inserted

And as a result read performance suffers.

Fix the problem by triggering async readahead case in ondemand_readahead()
only if we are calling the function because we hit the readahead marker. 
In any other case we need to read the folio at 'index' and thus we cannot
really use the current ra state.

Note that the above situation could be viewed as a special case of
file->f_ra state corruption.  In fact two thread reading using the shared
file can also seemingly corrupt file->f_ra in interesting ways due to
concurrent access.  I never saw that in practice and the fix is going to
be much more complex so for now at least fix this practical problem while
we ponder about the theoretically correct solution.

Link: https://lkml.kernel.org/r/20240625100859.15507-1-jack@suse.cz
Link: https://lkml.kernel.org/r/20240625101909.12234-1-jack@suse.czSigned-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Zhang Peng <zhangpengpeng0808@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

8051b82a

mm/migrate: move NUMA hinting fault folio isolation + checks under PTL · ee86814b

David Hildenbrand authored Jun 20, 2024

Currently we always take a folio reference even if migration will not even
be tried or isolation failed, requiring us to grab+drop an additional
reference.

Further, we end up calling folio_likely_mapped_shared() while the folio
might have already been unmapped, because after we dropped the PTL, that
can easily happen.  We want to stop touching mapcounts and friends from
such context, and only call folio_likely_mapped_shared() while the folio
is still mapped: mapcount information is pretty much stale and unreliable
otherwise.

So let's move checks into numamigrate_isolate_folio(), rename that
function to migrate_misplaced_folio_prepare(), and call that function from
callsites where we call migrate_misplaced_folio(), but still with the PTL
held.

We can now stop taking temporary folio references, and really only take a
reference if folio isolation succeeded.  Doing the
folio_likely_mapped_shared() + folio isolation under PT lock is now
similar to how we handle MADV_PAGEOUT.

While at it, combine the folio_is_file_lru() checks.

[david@redhat.com: fix list_del() corruption]
  Link: https://lkml.kernel.org/r/8f85c31a-e603-4578-bf49-136dae0d4b69@redhat.com
  Link: https://lkml.kernel.org/r/20240626191129.658CFC32782@smtp.kernel.org
Link: https://lkml.kernel.org/r/20240620212935.656243-3-david@redhat.comSigned-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Tested-by: Donet Tom <donettom@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ee86814b

mm/migrate: make migrate_misplaced_folio() return 0 on success · 4b88c23a

David Hildenbrand authored Jun 20, 2024

Patch series "mm/migrate: move NUMA hinting fault folio isolation + checks
under PTL".


Let's just return 0 on success, which is less confusing.

...  especially because we got it wrong in the migrate.h stub where we
have "return -EAGAIN; /* can't migrate now */" instead of "return 0;". 
Likely this wrong return value doesn't currently matter, but it certainly
adds confusion.

We'll add migrate_misplaced_folio_prepare() next, where we want to use the
same "return 0 on success" approach, so let's just clean this up.

Link: https://lkml.kernel.org/r/20240620212935.656243-1-david@redhat.com
Link: https://lkml.kernel.org/r/20240620212935.656243-2-david@redhat.comSigned-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

4b88c23a

mm: mmap_lock: replace get_memcg_path_buf() with on-stack buffer · 7d6be67c

Tetsuo Handa authored Jun 21, 2024

Commit 2b5067a8 ("mm: mmap_lock: add tracepoints around lock
acquisition") introduced TRACE_MMAP_LOCK_EVENT() macro using
preempt_disable() in order to let get_mm_memcg_path() return a percpu
buffer exclusively used by normal, softirq, irq and NMI contexts
respectively.

Commit 832b5072 ("mm: mmap_lock: use local locks instead of disabling
preemption") replaced preempt_disable() with local_lock(&memcg_paths.lock)
based on an argument that preempt_disable() has to be avoided because
get_mm_memcg_path() might sleep if PREEMPT_RT=y.

But syzbot started reporting

  inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.

and

  inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.

messages, for local_lock() does not disable IRQ.

We could replace local_lock() with local_lock_irqsave() in order to
suppress these messages.  But this patch instead replaces percpu buffers
with on-stack buffer, for the size of each buffer returned by
get_memcg_path_buf() is only 256 bytes which is tolerable for allocating
from current thread's kernel stack memory.

Link: https://lkml.kernel.org/r/ef22d289-eadb-4ed9-863b-fbc922b33d8d@I-love.SAKURA.ne.jpReported-by: syzbot <syzbot+40905bca570ae6784745@syzkaller.appspotmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=40905bca570ae6784745
Fixes: 832b5072 ("mm: mmap_lock: use local locks instead of disabling preemption")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Cc: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

7d6be67c

kmsan: do not pass NULL pointers as 0 · b1a80f4b

Ilya Leoshkevich authored Jun 27, 2024

sparse complains about passing NULL pointers as 0. Fix all instances.

Link: https://lkml.kernel.org/r/20240627145754.27333-3-iii@linux.ibm.comSigned-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202406272033.KejtfLkw-lkp@intel.com/Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

b1a80f4b

kmsan: add missing __user tags · b072880d

Ilya Leoshkevich authored Jun 27, 2024

sparse complains that __user pointers are being passed to functions that
expect non-__user ones. In all cases, these functions are in fact working
with user pointers, only the tag is missing. Add it.

Link: https://lkml.kernel.org/r/20240627145754.27333-2-iii@linux.ibm.comSigned-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202406272033.KejtfLkw-lkp@intel.com/Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

b072880d

kmsan: enable on s390 · 3a8f6f3b

Ilya Leoshkevich authored Jun 21, 2024

Now that everything else is in place, enable KMSAN in Kconfig.

Link: https://lkml.kernel.org/r/20240621113706.315500-39-iii@linux.ibm.comSigned-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

3a8f6f3b

s390/kmsan: implement the architecture-specific functions · 2a48c8c9

Ilya Leoshkevich authored Jun 21, 2024

arch_kmsan_get_meta_or_null() finds the lowcore shadow by querying the
prefix and calling kmsan_get_metadata() again.

kmsan_virt_addr_valid() delegates to virt_addr_valid().

Link: https://lkml.kernel.org/r/20240621113706.315500-38-iii@linux.ibm.comSigned-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

2a48c8c9

s390/unwind: disable KMSAN checks · 7e17eac2

Ilya Leoshkevich authored Jun 21, 2024

The unwind code can read uninitialized frames.  Furthermore, even in the
good case, KMSAN does not emit shadow for backchains.  Therefore disable
it for the unwinding functions.

Link: https://lkml.kernel.org/r/20240621113706.315500-37-iii@linux.ibm.comSigned-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

7e17eac2

s390/uaccess: add the missing linux/instrumented.h #include · e0bebfd6

Ilya Leoshkevich authored Jun 21, 2024

uaccess.h uses instrument_get_user() and instrument_put_user(), which are
defined in linux/instrumented.h.  Currently we get this header from
somewhere else by accident; prefer to be explicit about it and include it
directly.

Link: https://lkml.kernel.org/r/20240621113706.315500-36-iii@linux.ibm.comSigned-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Suggested-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

e0bebfd6

s390/uaccess: add KMSAN support to put_user() and get_user() · eb6efdfe

Ilya Leoshkevich authored Jun 21, 2024

put_user() uses inline assembly with precise constraints, so Clang is in
principle capable of instrumenting it automatically.  Unfortunately, one
of the constraints contains a dereferenced user pointer, and Clang does
not currently distinguish user and kernel pointers.  Therefore KMSAN
attempts to access shadow for user pointers, which is not a right thing to
do.

An obvious fix to add __no_sanitize_memory to __put_user_fn() does not
work, since it's __always_inline.  And __always_inline cannot be removed
due to the __put_user_bad() trick.

A different obvious fix of using the "a" instead of the "+Q" constraint
degrades the code quality, which is very important here, since it's a hot
path.

Instead, repurpose the __put_user_asm() macro to define
__put_user_{char,short,int,long}_noinstr() functions and mark them with
__no_sanitize_memory.  For the non-KMSAN builds make them __always_inline
in order to keep the generated code quality.  Also define
__put_user_{char,short,int,long}() functions, which call the
aforementioned ones and which *are* instrumented, because they call KMSAN
hooks, which may be implemented as macros.

The same applies to get_user() as well.

Link: https://lkml.kernel.org/r/20240621113706.315500-35-iii@linux.ibm.comSigned-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

eb6efdfe

s390/traps: unpoison the kernel_stack_overflow()'s pt_regs · c1057a70

Ilya Leoshkevich authored Jun 21, 2024

This is normally done by the generic entry code, but the
kernel_stack_overflow() flow bypasses it.

Link: https://lkml.kernel.org/r/20240621113706.315500-34-iii@linux.ibm.comSigned-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

c1057a70

s390/string: add KMSAN support · 05a6dde6

Ilya Leoshkevich authored Jun 21, 2024

Add KMSAN support for the s390 implementations of the string functions. 
Do this similar to how it's already done for KASAN, except that the
optimized memset{16,32,64}() functions need to be disabled: it's important
for KMSAN to know that they initialized something.

The way boot code is built with regard to string functions is problematic,
since most files think it's configured with sanitizers, but boot/string.c
doesn't.  This creates various problems with the memset64() definitions,
depending on whether the code is built with sanitizers or fortify.  This
should probably be streamlined, but in the meantime resolve the issues by
introducing the IN_BOOT_STRING_C macro, similar to the existing
IN_ARCH_STRING_C macro.

Link: https://lkml.kernel.org/r/20240621113706.315500-33-iii@linux.ibm.comSigned-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

05a6dde6

s390/mm: define KMSAN metadata for vmalloc and modules · 65ca73f9

Ilya Leoshkevich authored Jun 21, 2024

The pages for the KMSAN metadata associated with most kernel mappings are
taken from memblock by the common code.  However, vmalloc and module
metadata needs to be defined by the architectures.

Be a little bit more careful than x86: allocate exactly MODULES_LEN for
the module shadow and origins, and then take 2/3 of vmalloc for the
vmalloc shadow and origins.  This ensures that users passing small
vmalloc= values on the command line do not cause module metadata
collisions.

Link: https://lkml.kernel.org/r/20240621113706.315500-32-iii@linux.ibm.comSigned-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

65ca73f9

s390/irqflags: do not instrument arch_local_irq_*() with KMSAN · 1b301f5f

Ilya Leoshkevich authored Jun 21, 2024

Lockdep generates the following false positives with KMSAN on s390x:

[    6.063666] DEBUG_LOCKS_WARN_ON(lockdep_hardirqs_enabled())
[         ...]
[    6.577050] Call Trace:
[    6.619637]  [<000000000690d2de>] check_flags+0x1fe/0x210
[    6.665411] ([<000000000690d2da>] check_flags+0x1fa/0x210)
[    6.707478]  [<00000000006cec1a>] lock_acquire+0x2ca/0xce0
[    6.749959]  [<00000000069820ea>] _raw_spin_lock_irqsave+0xea/0x190
[    6.794912]  [<00000000041fc988>] __stack_depot_save+0x218/0x5b0
[    6.838420]  [<000000000197affe>] __msan_poison_alloca+0xfe/0x1a0
[    6.882985]  [<0000000007c5827c>] start_kernel+0x70c/0xd50
[    6.927454]  [<0000000000100036>] startup_continue+0x36/0x40

Between trace_hardirqs_on() and `stosm __mask, 3` lockdep thinks that
interrupts are on, but on the CPU they are still off.  KMSAN
instrumentation takes spinlocks, giving lockdep a chance to see and
complain about this discrepancy.

KMSAN instrumentation is inserted in order to poison the __mask variable. 
Disable instrumentation in the respective functions.  They are very small
and it's easy to see that no important metadata updates are lost because
of this.

Link: https://lkml.kernel.org/r/20240621113706.315500-31-iii@linux.ibm.comSigned-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

1b301f5f

s390/ftrace: unpoison ftrace_regs in kprobe_ftrace_handler() · 0cfd60a6

Ilya Leoshkevich authored Jun 21, 2024

s390 uses assembly code to initialize ftrace_regs and call
kprobe_ftrace_handler().  Therefore, from the KMSAN's point of view,
ftrace_regs is poisoned on kprobe_ftrace_handler() entry.  This causes
KMSAN warnings when running the ftrace testsuite.

Fix by trusting the assembly code and always unpoisoning ftrace_regs in
kprobe_ftrace_handler().

Link: https://lkml.kernel.org/r/20240621113706.315500-30-iii@linux.ibm.comSigned-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

0cfd60a6