Commits · ffcb5f5262b756a598eefb11e340bbd027cde037 · Kirill Smelkov / linux

09 Jun, 2023 15 commits

workingset: refactor LRU refault to expose refault recency check · ffcb5f52

Nhat Pham authored May 02, 2023

Patch series "cachestat: a new syscall for page cache state of files",
v13.

There is currently no good way to query the page cache statistics of large
files and directory trees.  There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really does not care about per-page information in that
case.  The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.

Some use cases where this information could come in handy:
  * Allowing database to decide whether to perform an index scan or direct
    table queries based on the in-memory cache state of the index.
  * Visibility into the writeback algorithm, for performance issues
    diagnostic.
  * Workload-aware writeback pacing: estimating IO fulfilled by page cache
    (and IO to be done) within a range of a file, allowing for more
    frequent syncing when and where there is IO capacity, and batching
    when there is not.
  * Computing memory usage of large files/directory trees, analogous to
    the du tool for disk usage.

More information about these use cases could be found in this thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/

This series of patches introduces a new system call, cachestat, that
summarizes the page cache statistics (number of cached pages, dirty pages,
pages marked for writeback, evicted pages etc.) of a file, in a specified
range of bytes.  It also include a selftest suite that tests some typical
usage.  Currently, the syscall is only wired in for x86 architecture.

This interface is inspired by past discussion and concerns with fincore,
which has a similar design (and as a result, issues) as mincore.  Relevant
links:

https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html


I have also developed a small tool that computes the memory usage of files
and directories, analogous to the du utility.  User can choose between
mincore or cachestat (with cachestat exporting more information than
mincore).  To compare the performance of these two options, I benchmarked
the tool on the root directory of a Meta's server machine, each for five
runs:

Using cachestat
real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602
user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742
sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689

Using mincore:
real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059
user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162
sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046

I also ran both syscalls on a 2TB sparse file:

Using cachestat:
real    0m0.009s
user    0m0.000s
sys     0m0.009s

Using mincore:
real    0m37.510s
user    0m2.934s
sys     0m34.558s

Very large files like this are the pathological case for mincore.  In
fact, to compute the stats for a single 2TB file, mincore takes as long as
cachestat takes to compute the stats for the entire tree!  This could
easily happen inadvertently when we run it on subdirectories.  Mincore is
clearly not suitable for a general-purpose command line tool.

Regarding security concerns, cachestat() should not pose any additional
issues.  The caller already has read permission to the file itself (since
they need an fd to that file to call cachestat).  This means that the
caller can access the underlying data in its entirety, which is a much
greater source of information (and as a result, a much greater security
risk) than the cache status itself.

The latest API change (in v13 of the patch series) is suggested by Jens
Axboe.  It allows for 64-bit length argument, even on 32-bit architecture
(which is previously not possible due to the limit on the number of
syscall arguments).  Furthermore, it eliminates the need for compatibility
handling - every user can use the same ABI.


This patch (of 4):

In preparation for computing recently evicted pages in cachestat, refactor
workingset_refault and lru_gen_refault to expose a helper function that
would test if an evicted page is recently evicted.

[penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()]
  Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp
Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.comSigned-off-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ffcb5f52

memcg, oom: remove explicit wakeup in mem_cgroup_oom_synchronize() · 18b1d18b

Haifeng Xu authored Apr 19, 2023

Before commit 29ef680a ("memcg, oom: move out_of_memory back to the
charge path"), all memcg oom killers were delayed to page fault path.  And
the explicit wakeup is used in this case:

thread A:
        ...
        if (locked) {           // complete oom-kill, hold the lock
                mem_cgroup_oom_unlock(memcg);
                ...
        }
        ...

thread B:
        ...

        if (locked && !memcg->oom_kill_disable) {
                ...
        } else {
                schedule();     // can't acquire the lock
                ...
        }
        ...

The reason is that thread A kicks off the OOM-killer, which leads to
wakeups from the uncharges of the exiting task.  But thread B is not
guaranteed to see them if it enters the OOM path after the OOM kills but
before thread A releases the lock.

Now only oom_kill_disable case is handled from the #PF path.  In that case
it is userspace to trigger the wake up not the #PF path itself.  All
potential paths to free some charges are responsible to call
memcg_oom_recover() , so the explicit wakeup is not needed in the
mem_cgroup_oom_synchronize() path which doesn't release any memory itself.

Link: https://lkml.kernel.org/r/20230419030739.115845-2-haifeng.xu@shopee.comSigned-off-by: Haifeng Xu <haifeng.xu@shopee.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

18b1d18b

memcg, oom: remove unnecessary check in mem_cgroup_oom_synchronize() · 857f2139

Haifeng Xu authored Apr 19, 2023

mem_cgroup_oom_synchronize() is only used when the memcg oom handling is
handed over to the edge of the #PF path.  Since commit 29ef680a
("memcg, oom: move out_of_memory back to the charge path") this is the
case only when the kernel memcg oom killer is disabled
(current->memcg_in_oom is only set if memcg->oom_kill_disable).  Therefore
a check for oom_kill_disable in mem_cgroup_oom_synchronize() is not
required.

Link: https://lkml.kernel.org/r/20230419030739.115845-1-haifeng.xu@shopee.comSigned-off-by: Haifeng Xu <haifeng.xu@shopee.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

857f2139

cgroup: remove cgroup_rstat_flush_atomic() · 0a2dc6ac

Yosry Ahmed authored Apr 21, 2023

Previous patches removed the only caller of cgroup_rstat_flush_atomic(). 
Remove the function and simplify the code.

Link: https://lkml.kernel.org/r/20230421174020.2994750-6-yosryahmed@google.comSigned-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

0a2dc6ac

memcg: remove mem_cgroup_flush_stats_atomic() · 35822fda

Yosry Ahmed authored Apr 21, 2023

Previous patches removed all callers of mem_cgroup_flush_stats_atomic(). 
Remove the function and simplify the code.

Link: https://lkml.kernel.org/r/20230421174020.2994750-5-yosryahmed@google.comSigned-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

35822fda

memcg: calculate root usage from global state · f82a7a86

Yosry Ahmed authored Apr 21, 2023

Currently, we approximate the root usage by adding the memcg stats for
anon, file, and conditionally swap (for memsw).  To read the memcg stats
we need to invoke an rstat flush.  rstat flushes can be expensive, they
scale with the number of cpus and cgroups on the system.

mem_cgroup_usage() is called by memcg_events()->mem_cgroup_threshold()
with irqs disabled, so such an expensive operation with irqs disabled can
cause problems.

Instead, approximate the root usage from global state.  This is not 100%
accurate, but the root usage has always been ill-defined anyway.

Link: https://lkml.kernel.org/r/20230421174020.2994750-4-yosryahmed@google.comSigned-off-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

f82a7a86

memcg: flush stats non-atomically in mem_cgroup_wb_stats() · 190409ca

Yosry Ahmed authored Apr 21, 2023

The previous patch moved the wb_over_bg_thresh()->mem_cgroup_wb_stats()
code path in wb_writeback() outside the lock section.  We no longer need
to flush the stats atomically.  Flush the stats non-atomically.

Link: https://lkml.kernel.org/r/20230421174020.2994750-3-yosryahmed@google.comSigned-off-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

190409ca

writeback: move wb_over_bg_thresh() call outside lock section · 2816ea2a

Yosry Ahmed authored Apr 21, 2023

Patch series "cgroup: eliminate atomic rstat flushing", v5.

A previous patch series [1] changed most atomic rstat flushing contexts to
become non-atomic.  This was done to avoid an expensive operation that
scales with # cgroups and # cpus to happen with irqs disabled and
scheduling not permitted.  There were two remaining atomic flushing
contexts after that series.  This series tries to eliminate them as well,
eliminating atomic rstat flushing completely.

The two remaining atomic flushing contexts are:
(a) wb_over_bg_thresh()->mem_cgroup_wb_stats()
(b) mem_cgroup_threshold()->mem_cgroup_usage()

For (a), flushing needs to be atomic as wb_writeback() calls
wb_over_bg_thresh() with a spinlock held.  However, it seems like the call
to wb_over_bg_thresh() doesn't need to be protected by that spinlock, so
this series proposes a refactoring that moves the call outside the lock
criticial section and makes the stats flushing in mem_cgroup_wb_stats()
non-atomic.

For (b), flushing needs to be atomic as mem_cgroup_threshold() is called
with irqs disabled.  We only flush the stats when calculating the root
usage, as it is approximated as the sum of some memcg stats (file, anon,
and optionally swap) instead of the conventional page counter.  This
series proposes changing this calculation to use the global stats instead,
eliminating the need for a memcg stat flush.

After these 2 contexts are eliminated, we no longer need
mem_cgroup_flush_stats_atomic() or cgroup_rstat_flush_atomic().  We can
remove them and simplify the code.

[1] https://lore.kernel.org/linux-mm/20230330191801.1967435-1-yosryahmed@google.com/


This patch (of 5):

wb_over_bg_thresh() calls mem_cgroup_wb_stats() which invokes an rstat
flush, which can be expensive on large systems. Currently,
wb_writeback() calls wb_over_bg_thresh() within a lock section, so we
have to do the rstat flush atomically. On systems with a lot of
cpus and/or cgroups, this can cause us to disable irqs for a long time,
potentially causing problems.

Move the call to wb_over_bg_thresh() outside the lock section in
preparation to make the rstat flush in mem_cgroup_wb_stats() non-atomic.
The list_empty(&wb->work_list) check should be okay outside the lock
section of wb->list_lock as it is protected by a separate lock
(wb->work_lock), and wb_over_bg_thresh() doesn't seem like it is
modifying any of wb->b_* lists the wb->list_lock is protecting.
Also, the loop seems to be already releasing and reacquring the
lock, so this refactoring looks safe.

Link: https://lkml.kernel.org/r/20230421174020.2994750-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20230421174020.2994750-2-yosryahmed@google.comSigned-off-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

2816ea2a

mm/page_alloc: drop the unnecessary pfn_valid() for start pfn · 3c4322c9

Baolin Wang authored Apr 24, 2023

__pageblock_pfn_to_page() currently performs both pfn_valid check and
pfn_to_online_page().  The former one is redundant because the latter is a
stronger check.  Drop pfn_valid().

Link: https://lkml.kernel.org/r/c3868b58c6714c09a43440d7d02c7b4eed6e03f6.1682342634.git.baolin.wang@linux.alibaba.comSigned-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

3c4322c9

mm: compaction: optimize compact_memory to comply with the admin-guide · 8b9167cd

Wen Yang authored Apr 25, 2023

For the /proc/sys/vm/compact_memory file, the admin-guide states: When 1
is written to the file, all zones are compacted such that free memory is
available in contiguous blocks where possible.  This can be important for
example in the allocation of huge pages although processes will also
directly compact memory as required

But it was not strictly followed, writing any value would cause all zones
to be compacted.

It has been slightly optimized to comply with the admin-guide.  Enforce
the 1 on the unlikely chance that the sysctl handler is ever extended to
do something different.

Commit ef498438 ("mm/compaction: remove unused variable
sysctl_compact_memory") has also been optimized a bit here, as the
declaration in the external header file has been eliminated, and
sysctl_compact_memory also needs to be verified.

[akpm@linux-foundation.org: add __read_mostly, per Mel]
Link: https://lkml.kernel.org/r/tencent_DFF54DB2A60F3333F97D3F6B5441519B050A@qq.comSigned-off-by: Wen Yang <wenyang.linux@foxmail.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: William Lam <william.lam@bytedance.com>
Cc: Pintu Kumar <pintu@codeaurora.org>
Cc: Fu Wei <wefu@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

8b9167cd

memcg: dump memory.stat during cgroup OOM for v1 · dddb44ff

Yosry Ahmed authored Apr 28, 2023

Patch series "memcg: OOM log improvements", v2.

This short patch series brings back some cgroup v1 stats in OOM logs
that were unnecessarily changed before. It also makes memcg OOM logs
less reliant on printk() internals.


This patch (of 2):

Commit c8713d0b ("mm: memcontrol: dump memory.stat during cgroup OOM")
made sure we dump all the stats in memory.stat during a cgroup OOM, but it
also introduced a slight behavioral change.  The code used to print the
non-hierarchical v1 cgroup stats for the entire cgroup subtree, now it
only prints the v2 cgroup stats for the cgroup under OOM.

For cgroup v1 users, this introduces a few problems:

(a) The non-hierarchical stats of the memcg under OOM are no longer
    shown.

(b) A couple of v1-only stats (e.g.  pgpgin, pgpgout) are no longer
    shown.

(c) We show the list of cgroup v2 stats, even in cgroup v1.  This list
    of stats is not tracked with v1 in mind.  While most of the stats seem
    to be working on v1, there may be some stats that are not fully or
    correctly tracked.

Although OOM log is not set in stone, we should not change it for no
reason.  When upgrading the kernel version to a version including commit
c8713d0b ("mm: memcontrol: dump memory.stat during cgroup OOM"), these
behavioral changes are noticed in cgroup v1.

The fix is simple.  Commit c8713d0b ("mm: memcontrol: dump memory.stat
during cgroup OOM") separated stats formatting from stats display for v2,
to reuse the stats formatting in the OOM logs.  Do the same for v1.

Move the v2 specific formatting from memory_stat_format() to
memcg_stat_format(), add memcg1_stat_format() for v1, and make
memory_stat_format() select between them based on cgroup version.  Since
memory_stat_show() now works for both v1 & v2, drop memcg_stat_show().

Link: https://lkml.kernel.org/r/20230428132406.2540811-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20230428132406.2540811-3-yosryahmed@google.comSigned-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

dddb44ff

memcg: use seq_buf_do_printk() with mem_cgroup_print_oom_meminfo() · 5b42360c

Yosry Ahmed authored Apr 28, 2023

Currently, we format all the memcg stats into a buffer in
mem_cgroup_print_oom_meminfo() and use pr_info() to dump it to the logs. 
However, this buffer is large in size.  Although it is currently working
as intended, ther is a dependency between the memcg stats buffer and the
printk record size limit.

If we add more stats in the future and the buffer becomes larger than the
printk record size limit, or if the prink record size limit is reduced,
the logs may be truncated.

It is safer to use seq_buf_do_printk(), which will automatically break up
the buffer at line breaks and issue small printk() calls.

Refactor the code to move the seq_buf from memory_stat_format() to its
callers, and use seq_buf_do_printk() to print the seq_buf in
mem_cgroup_print_oom_meminfo().

Link: https://lkml.kernel.org/r/20230428132406.2540811-2-yosryahmed@google.comSigned-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

5b42360c

migrate_pages: avoid blocking for IO in MIGRATE_SYNC_LIGHT · 4bb6dc79

Douglas Anderson authored Apr 28, 2023

The MIGRATE_SYNC_LIGHT mode is intended to block for things that will
finish quickly but not for things that will take a long time. Exactly how
long is too long is not well defined, but waits of tens of milliseconds is
likely non-ideal.

When putting a Chromebook under memory pressure (opening over 90 tabs on a
4GB machine) it was fairly easy to see delays waiting for some locks in
the kcompactd code path of > 100 ms. While the laptop wasn't amazingly
usable in this state, it was still limping along and this state isn't
something artificial. Sometimes we simply end up with a lot of memory
pressure.

Putting the same Chromebook under memory pressure while it was running
Android apps (though not stressing them) showed a much worse result (NOTE:
this was on a older kernel but the codepaths here are similar). Android
apps on ChromeOS currently run from a 128K-block, zlib-compressed,
loopback-mounted squashfs disk. If we get a page fault from something
backed by the squashfs filesystem we could end up holding a folio lock
while reading enough from disk to decompress 128K (and then decompressing
it using the somewhat slow zlib algorithms). That reading goes through
the ext4 subsystem (because it's a loopback mount) before eventually
ending up in the block subsystem. This extra jaunt adds extra overhead.
Without much work I could see cases where we ended up blocked on a folio
lock for over a second. With more extreme memory pressure I could see up
to 25 seconds.

We considered adding a timeout in the case of MIGRATE_SYNC_LIGHT for the
two locks that were seen to be slow [1] and that generated much
discussion. After discussion, it was decided that we should avoid waiting
for the two locks during MIGRATE_SYNC_LIGHT if they were being held for
IO. We'll continue with the unbounded wait for the more full SYNC modes.

With this change, I couldn't see any slow waits on these locks with my
previous testcases.

NOTE: The reason I stated digging into this originally isn't because some
benchmark had gone awry, but because we've received in-the-field crash
reports where we have a hung task waiting on the page lock (which is the
equivalent code path on old kernels). While the root cause of those
crashes is likely unrelated and won't be fixed by this patch, analyzing
those crash reports did point out these very long waits seemed like
something good to fix. With this patch we should no longer hang waiting
on these locks, but presumably the system will still be in a bad shape and
hang somewhere else.

[1] https://lore.kernel.org/r/20230421151135.v2.1.I2b71e11264c5c214bc59744b9e13e4c353bc5714@changeid

Link: https://lkml.kernel.org/r/20230428135414.v3.1.Ia86ccac02a303154a0b8bc60567e7a95d34c96d3@changeidSigned-off-by: Douglas Anderson <dianders@chromium.org>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

4bb6dc79

mm: memcg: use READ_ONCE()/WRITE_ONCE() to access stock->cached · f785a8f2

Roman Gushchin authored May 02, 2023

A memcg pointer in the percpu stock can be accessed by drain_all_stock()
from another cpu in a lockless way. In theory it might lead to an issue,
similar to the one which has been discovered with stock->cached_objcg,
where the pointer was zeroed between the check for being NULL and
dereferencing. In this case the issue is unlikely a real problem, but to
make it bulletproof and similar to stock->cached_objcg, let's annotate all
accesses to stock->cached with READ_ONCE()/WTRITE_ONCE().

Link: https://lkml.kernel.org/r/20230502160839.361544-2-roman.gushchin@linux.devSigned-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

f785a8f2

mm: kmem: fix a NULL pointer dereference in obj_stock_flush_required() · 3b8abb32

Roman Gushchin authored May 02, 2023

KCSAN found an issue in obj_stock_flush_required():
stock->cached_objcg can be reset between the check and dereference:

==================================================================
BUG: KCSAN: data-race in drain_all_stock / drain_obj_stock

write to 0xffff888237c2a2f8 of 8 bytes by task 19625 on cpu 0:
 drain_obj_stock+0x408/0x4e0 mm/memcontrol.c:3306
 refill_obj_stock+0x9c/0x1e0 mm/memcontrol.c:3340
 obj_cgroup_uncharge+0xe/0x10 mm/memcontrol.c:3408
 memcg_slab_free_hook mm/slab.h:587 [inline]
 __cache_free mm/slab.c:3373 [inline]
 __do_kmem_cache_free mm/slab.c:3577 [inline]
 kmem_cache_free+0x105/0x280 mm/slab.c:3602
 __d_free fs/dcache.c:298 [inline]
 dentry_free fs/dcache.c:375 [inline]
 __dentry_kill+0x422/0x4a0 fs/dcache.c:621
 dentry_kill+0x8d/0x1e0
 dput+0x118/0x1f0 fs/dcache.c:913
 __fput+0x3bf/0x570 fs/file_table.c:329
 ____fput+0x15/0x20 fs/file_table.c:349
 task_work_run+0x123/0x160 kernel/task_work.c:179
 resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
 exit_to_user_mode_loop+0xcf/0xe0 kernel/entry/common.c:171
 exit_to_user_mode_prepare+0x6a/0xa0 kernel/entry/common.c:203
 __syscall_exit_to_user_mode_work kernel/entry/common.c:285 [inline]
 syscall_exit_to_user_mode+0x26/0x140 kernel/entry/common.c:296
 do_syscall_64+0x4d/0xc0 arch/x86/entry/common.c:86
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

read to 0xffff888237c2a2f8 of 8 bytes by task 19632 on cpu 1:
 obj_stock_flush_required mm/memcontrol.c:3319 [inline]
 drain_all_stock+0x174/0x2a0 mm/memcontrol.c:2361
 try_charge_memcg+0x6d0/0xd10 mm/memcontrol.c:2703
 try_charge mm/memcontrol.c:2837 [inline]
 mem_cgroup_charge_skmem+0x51/0x140 mm/memcontrol.c:7290
 sock_reserve_memory+0xb1/0x390 net/core/sock.c:1025
 sk_setsockopt+0x800/0x1e70 net/core/sock.c:1525
 udp_lib_setsockopt+0x99/0x6c0 net/ipv4/udp.c:2692
 udp_setsockopt+0x73/0xa0 net/ipv4/udp.c:2817
 sock_common_setsockopt+0x61/0x70 net/core/sock.c:3668
 __sys_setsockopt+0x1c3/0x230 net/socket.c:2271
 __do_sys_setsockopt net/socket.c:2282 [inline]
 __se_sys_setsockopt net/socket.c:2279 [inline]
 __x64_sys_setsockopt+0x66/0x80 net/socket.c:2279
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

value changed: 0xffff8881382d52c0 -> 0xffff888138893740

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 19632 Comm: syz-executor.0 Not tainted 6.3.0-rc2-syzkaller-00387-g53429336 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/02/2023

Fix it by using READ_ONCE()/WRITE_ONCE() for all accesses to
stock->cached_objcg.

Link: https://lkml.kernel.org/r/20230502160839.361544-1-roman.gushchin@linux.dev
Fixes: bf4f0599 ("mm: memcg/slab: obj_cgroup API")
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Reported-by: syzbot+774c29891415ab0fd29d@syzkaller.appspotmail.com
Reported-by: Dmitry Vyukov <dvyukov@google.com>
  Link: https://lore.kernel.org/linux-mm/CACT4Y+ZfucZhM60YPphWiCLJr6+SGFhT+jjm8k1P-a_8Kkxsjg@mail.gmail.com/T/#tReviewed-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

3b8abb32

28 May, 2023 8 commits

Linux 6.4-rc4 · 7877cb91
Linus Torvalds authored May 28, 2023

7877cb91

Merge tag 'x86-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · f8b2507c

Linus Torvalds authored May 28, 2023

Pull x86 cpu fix from Thomas Gleixner:
 "A single fix for x86:

   - Prevent a bogus setting for the number of HT siblings, which is
     caused by the CPUID evaluation trainwreck of X86. That recomputes
     the value for each CPU, so the last CPU "wins". That can cause
     completely bogus sibling values"

* tag 'x86-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/topology: Fix erroneous smp_num_siblings on Intel Hybrid platforms

f8b2507c

Merge tag 'perf-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 2d5438f4

Linus Torvalds authored May 28, 2023

Pull perf fixes from Thomas Gleixner:
 "A small set of perf fixes:

   - Make the MSR-readout based CHA discovery work around broken
     discovery tables in some SPR firmwares.

   - Prevent saving PEBS configuration which has software bits set that
     cause a crash when restored into the relevant MSR"

* tag 'perf-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/uncore: Correct the number of CHAs on SPR
  perf/x86/intel: Save/restore cpuc->active_pebs_data_cfg when using guest PEBS

2d5438f4

Merge tag 'objtool-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · abbf7fa1

Linus Torvalds authored May 28, 2023

Pull unwinder fixes from Thomas Gleixner:
 "A set of unwinder and tooling fixes:

   - Ensure that the stack pointer on x86 is aligned again so that the
     unwinder does not read past the end of the stack

   - Discard .note.gnu.property section which has a pointlessly
     different alignment than the other note sections. That confuses
     tooling of all sorts including readelf, libbpf and pahole"

* tag 'objtool-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/show_trace_log_lvl: Ensure stack pointer is aligned, again
  vmlinux.lds.h: Discard .note.gnu.property section

abbf7fa1

Merge tag 'core-debugobjects-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · d8f14b84

Linus Torvalds authored May 28, 2023

Pull debugobjects fixes from Thomas Gleixner:
 "Two fixes for debugobjects:

   - Prevent the allocation path from waking up kswapd.

     That's a long standing issue due to the GFP_ATOMIC allocation flag.
     As debug objects can be invoked from pretty much any context waking
     kswapd can end up in arbitrary lock chains versus the waitqueue
     lock

   - Correct the explicit lockdep wait-type violation in
     debug_object_fill_pool()"

* tag 'core-debugobjects-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  debugobjects: Don't wake up kswapd from fill_pool()
  debugobjects,locking: Annotate debug_object_fill_pool() wait type violation

d8f14b84

Merge tag 'irq-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 9bd5386c

Linus Torvalds authored May 28, 2023

Pull irq fixes from Thomas Gleixner:
 "A set of fixes for interrupt chip drivers:

   - Prevent loss of state in the MIPS GIC interrupt controller

   - Disable pseudo NMIs on Mediatek based Chromebooks as they have
     firmware issues which cause instantenous chrashes and freezes wen
     pseudo NMIs are used

   - Fix the error handling path in the MBIGEN driver and a defined but
     not used warning in the meson-gpio interrupt chip driver"

* tag 'irq-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/mbigen: Unify the error handling in mbigen_of_create_domain()
  irqchip/meson-gpio: Mark OF related data as maybe unused
  irqchip/mips-gic: Use raw spinlock for gic_lock
  irqchip/mips-gic: Don't touch vl_map if a local interrupt is not routable
  irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues
  dt-bindings: interrupt-controller: arm,gic-v3: Add quirk for Mediatek SoCs w/ broken FW

9bd5386c

Merge tag 'mips-fixes_6.4_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux · 045049cb

Linus Torvalds authored May 28, 2023

Pull MIPS fixes from Thomas Bogendoerfer:

 - fixes to get alchemy platform back in shape

 - fix for initrd detection

* tag 'mips-fixes_6.4_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
  mips: Move initrd_start check after initrd address sanitisation.
  MIPS: Alchemy: fix dbdma2
  MIPS: Restore Au1300 support
  MIPS: unhide PATA_PLATFORM

045049cb

Merge tag 'powerpc-6.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 41683902

Linus Torvalds authored May 27, 2023

Pull powerpc fix from Michael Ellerman:

 - Reinstate ARCH_FORCE_MAX_ORDER ranges to fix various breakage

* tag 'powerpc-6.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/mm: Reinstate ARCH_FORCE_MAX_ORDER ranges

41683902

27 May, 2023 3 commits

Merge tag 'for-linus-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip · 4e893b5a

Linus Torvalds authored May 27, 2023

Pull xen fixes from Juergen Gross:

 - a double free fix in the Xen pvcalls backend driver

 - a fix for a regression causing the MSI related sysfs entries to not
   being created in Xen PV guests

 - a fix in the Xen blkfront driver for handling insane input data
   better

* tag 'for-linus-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  x86/pci/xen: populate MSI sysfs entries
  xen/pvcalls-back: fix double frees with pvcalls_new_active_socket()
  xen/blkfront: Only check REQ_FUA for writes

4e893b5a

Merge tag 'char-misc-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · 957f3f8e

Linus Torvalds authored May 27, 2023

Pull char/misc fixes from Greg KH:
 "Here are some small driver fixes for 6.4-rc4. They are just two
  different types:

   - binder fixes and reverts for reported problems and regressions in
     the binder "driver".

   - coresight driver fixes for reported problems.

  All of these have been in linux-next for over a week with no reported
  problems"

* tag 'char-misc-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  binder: fix UAF of alloc->vma in race with munmap()
  binder: add lockless binder_alloc_(set|get)_vma()
  Revert "android: binder: stop saving a pointer to the VMA"
  Revert "binder_alloc: add missing mmap_lock calls when using the VMA"
  binder: fix UAF caused by faulty buffer cleanup
  coresight: perf: Release Coresight path when alloc trace id failed
  coresight: Fix signedness bug in tmc_etr_buf_insert_barrier_packet()

957f3f8e

Merge tag 'cxl-fixes-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl · 49572d53

Linus Torvalds authored May 26, 2023

Pull compute express link fixes from Dan Williams:
 "The 'media ready' series prevents the driver from acting on bad
  capacity information, and it moves some checks earlier in the init
  sequence which impacts topics in the queue for 6.5.

  Additional hotplug testing uncovered a missing enable for memory
  decode. A debug crash fix is also included.

  Summary:

   - Stop trusting capacity data before the "media ready" indication

   - Add missing HDM decoder capability enable for the cold-plug case

   - Fix a debug message induced crash"

* tag 'cxl-fixes-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
  cxl: Explicitly initialize resources when media is not ready
  cxl/port: Fix NULL pointer access in devm_cxl_add_port()
  cxl: Move cxl_await_media_ready() to before capacity info retrieval
  cxl: Wait Memory_Info_Valid before access memory related info
  cxl/port: Enable the HDM decoder capability for switch ports

49572d53

26 May, 2023 14 commits

Merge tag 'arm-fixes-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc · 18713e8a

Linus Torvalds authored May 26, 2023

Pull ARM SoC fixes from Arnd Bergmann:
 "There have not been a lot of fixes for for the soc tree in 6.4, but
  these have been sitting here for too long.

  For the devicetree side, there is one minor warning fix for vexpress,
  the rest all all for the the NXP i.MX platforms: SoC specific bugfixes
  for the iMX8 clocks and its USB-3.0 gadget device, as well as board
  specific fixes for regulators and the phy on some of the i.MX boards.

  The microchip risc-v and arm32 maintainers now also add a shared
  maintainer file entry for the arm64 parts.

  The remaining fixes are all for firmware drivers, addressing mistakes
  in the optee, scmi and ff-a firmware driver implementation, mostly in
  the error handling code, incorrect use of the alloc_workqueue()
  interface in SCMI, and compatibility with corner cases of the firmware
  implementation"

* tag 'arm-fixes-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
  MAINTAINERS: update arm64 Microchip entries
  arm64: dts: imx8: fix USB 3.0 Gadget Failure in QM & QXPB0 at super speed
  dt-binding: cdns,usb3: Fix cdns,on-chip-buff-size type
  arm64: dts: colibri-imx8x: delete adc1 and dsp
  arm64: dts: colibri-imx8x: fix iris pinctrl configuration
  arm64: dts: colibri-imx8x: move pinctrl property from SoM to eval board
  arm64: dts: colibri-imx8x: fix eval board pin configuration
  arm64: dts: imx8mp: Fix video clock parents
  ARM: dts: imx6qdl-mba6: Add missing pvcie-supply regulator
  ARM: dts: imx6ull-dhcor: Set and limit the mode for PMIC buck 1, 2 and 3
  arm64: dts: imx8mn-var-som: fix PHY detection bug by adding deassert delay
  arm64: dts: imx8mn: Fix video clock parents
  firmware: arm_ffa: Set reserved/MBZ fields to zero in the memory descriptors
  firmware: arm_ffa: Fix FFA device names for logical partitions
  firmware: arm_ffa: Fix usage of partition info get count flag
  firmware: arm_ffa: Check if ffa_driver remove is present before executing
  arm64: dts: arm: add missing cache properties
  ARM: dts: vexpress: add missing cache properties
  firmware: arm_scmi: Fix incorrect alloc_workqueue() invocation
  optee: fix uninited async notif value

18713e8a

Merge tag 'pci-v6.4-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci · 96f15fc6

Linus Torvalds authored May 26, 2023

Pull PCI fix from Bjorn Helgaas:

 - Quirk Ice Lake Root Ports to work around DPC log size issue (Mika
   Westerberg)

* tag 'pci-v6.4-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
  PCI/DPC: Quirk PIO log size for Intel Ice Lake Root Ports

96f15fc6

Merge tag 'vfio-v6.4-rc4' of https://github.com/awilliam/linux-vfio · 8846af75

Linus Torvalds authored May 26, 2023

Pull VFIO fix from Alex Williamson:

 - Test for and return error for invalid pfns through the pin pages
   interface (Yan Zhao)

* tag 'vfio-v6.4-rc4' of https://github.com/awilliam/linux-vfio:
  vfio/type1: check pfn valid before converting to struct page

8846af75

Merge tag 'block-6.4-2023-05-26' of git://git.kernel.dk/linux · a92c9ab6

Linus Torvalds authored May 26, 2023

Pull block fixes from Jens Axboe:
 "A few fixes for the storage side of things:

   - Fix bio caching condition for passthrough IO (Anuj)

   - end-of-device check fix for zero sized devices (Christoph)

   - Update Paolo's email address

   - NVMe pull request via Keith with a single quirk addition

   - Fix regression in how wbt enablement is done (Yu)

   - Fix race in active queue accounting (Tian)"

* tag 'block-6.4-2023-05-26' of git://git.kernel.dk/linux:
  NVMe: Add MAXIO 1602 to bogus nid list.
  block: make bio_check_eod work for zero sized devices
  block: fix bio-cache for passthru IO
  block, bfq: update Paolo's address in maintainer list
  blk-mq: fix race condition in active queue accounting
  blk-wbt: fix that wbt can't be disabled by default

a92c9ab6

Merge tag 'io_uring-6.4-2023-05-26' of git://git.kernel.dk/linux · 6fae9129

Linus Torvalds authored May 26, 2023

Pull io_uring fix from Jens Axboe:
 "Just a single fix for the conditional schedule with the SQPOLL thread,
  dropping the uring_lock if we do need to reschedule"

* tag 'io_uring-6.4-2023-05-26' of git://git.kernel.dk/linux:
  io_uring: unlock sqd->lock before sq thread release CPU

6fae9129

Merge tag 'thermal-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 77af1f2b

Linus Torvalds authored May 26, 2023

Pull thermal control fix from Rafael Wysocki:
 "Fix a regression introduced inadvertently during the 6.3 cycle by a
  commit making the Intel int340x thermal driver use sysfs_emit_at()
  instead of scnprintf() (Srinivas Pandruvada)"

* tag 'thermal-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  thermal: intel: int340x: Add new line for UUID display

77af1f2b

Merge tag 'pm-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · c551afcd

Linus Torvalds authored May 26, 2023

Pull power management fixes from Rafael Wysocki:
 "Fix three issues related to the ->fast_switch callback in the AMD
  P-state cpufreq driver (Gautham R. Shenoy and Wyes Karny)"

* tag 'pm-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: amd-pstate: Update policy->cur in amd_pstate_adjust_perf()
  cpufreq: amd-pstate: Remove fast_switch_possible flag from active driver
  cpufreq: amd-pstate: Add ->fast_switch() callback

c551afcd

cxl: Explicitly initialize resources when media is not ready · 793a539a

Dave Jiang authored May 25, 2023

When media is not ready do not assume that the capacity information from
the identify command is valid, i.e. ->total_bytes
->partition_align_bytes ->{volatile,persistent}_only_bytes. Explicitly
zero out the capacity resources and exit early.

Given zero-init of those fields this patch is functionally equivalent to
the prior state, but it improves readability and robustness going
forward.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Link: https://lore.kernel.org/r/168506118166.3004974.13523455340007852589.stgit@djiang5-mobl3Signed-off-by: Dan Williams <dan.j.williams@intel.com>

793a539a

Merge tag 'gpio-fixes-for-v6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux · 91a30434

Linus Torvalds authored May 26, 2023

Pull gpio fixes from Bartosz Golaszewski:

 - fix incorrect output in in-tree gpio tools

 - fix a shell coding issue in gpio-sim selftests

 - correctly set the permissions for debugfs attributes exposed by
   gpio-mockup

 - fix chip name and pin count in gpio-f7188x for one of the supported
   models

 - fix numberspace pollution when using dynamically and statically
   allocated GPIOs together

* tag 'gpio-fixes-for-v6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
  gpio-f7188x: fix chip name and pin count on Nuvoton chip
  gpiolib: fix allocation of mixed dynamic/static GPIOs
  gpio: mockup: Fix mode of debugfs files
  selftests: gpio: gpio-sim: Fix BUG: test FAILED due to recent change
  tools: gpio: fix debounce_period_us output of lsgpio

91a30434

Merge tag 'for-6.4-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · b158dd94

Linus Torvalds authored May 26, 2023

Pull btrfs fixes from David Sterba:

 - handle memory allocation error in checksumming helper (reported by
   syzbot)

 - fix lockdep splat when aborting a transaction, add NOFS protection
   around invalidate_inode_pages2 that could allocate with GFP_KERNEL

 - reduce chances to hit an ENOSPC during scrub with RAID56 profiles

* tag 'for-6.4-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: use nofs when cleaning up aborted transactions
  btrfs: handle memory allocation failure in btrfs_csum_one_bio
  btrfs: scrub: try harder to mark RAID56 block groups read-only

b158dd94

Merge tag 'drm-fixes-2023-05-26' of git://anongit.freedesktop.org/drm/drm · b83ac44e

Linus Torvalds authored May 26, 2023

Pull drm fixes from Dave Airlie:
 "This week's collection is pretty spread out, accel/qaic has a bunch of
  fixes, amdgpu, then lots of single fixes across a bunch of places.

  core:
   - fix drmm_mutex_init lock class

  mgag200:
   - fix gamma lut initialisation

  pl111:
   - fix FB depth on IMPD-1 framebuffer

  amdgpu:
   - Fix missing BO unlocking in KIQ error path
   - Avoid spurious secure display error messages
   - SMU13 fix
   - Fix an OD regression
   - GPU reset display IRQ warning fix
   - MST fix

  radeon:
   - Fix a DP regression

  i915:
   - PIPEDMC disabling fix for bigjoiner config

  panel:
   - fix aya neo air plus quirk

  sched:
   - remove redundant NULL check

  qaic:
   - fix NNC message corruption
   - Grab ch_lock during QAIC_ATTACH_SLICE_BO
   - Flush the transfer list again
   - Validate if BO is sliced before slicing
   - Validate user data before grabbing any lock
   - initialize ret variable to 0
   - silence some uninitialized variable warnings"

* tag 'drm-fixes-2023-05-26' of git://anongit.freedesktop.org/drm/drm:
  drm/amd/display: Have Payload Properly Created After Resume
  drm/amd/display: Fix warning in disabling vblank irq
  drm/amd/pm: Fix output of pp_od_clk_voltage
  drm/amd/pm: add missing NotifyPowerSource message mapping for SMU13.0.7
  drm/radeon: reintroduce radeon_dp_work_func content
  drm/amdgpu: don't enable secure display on incompatible platforms
  drm:amd:amdgpu: Fix missing buffer object unlock in failure path
  accel/qaic: Fix NNC message corruption
  accel/qaic: Grab ch_lock during QAIC_ATTACH_SLICE_BO
  accel/qaic: Flush the transfer list again
  accel/qaic: Validate if BO is sliced before slicing
  accel/qaic: Validate user data before grabbing any lock
  accel/qaic: initialize ret variable to 0
  drm/i915: Fix PIPEDMC disabling for a bigjoiner configuration
  drm: fix drmm_mutex_init()
  drm/sched: Remove redundant check
  drm: panel-orientation-quirks: Change Air's quirk to support Air Plus
  accel/qaic: silence some uninitialized variable warnings
  drm/pl111: Fix FB depth on IMPD-1 framebuffer
  drm/mgag200: Fix gamma lut not initialized.

b83ac44e

x86: re-introduce support for ERMS copies for user space accesses · 47ee3f1d

Linus Torvalds authored May 26, 2023

I tried to streamline our user memory copy code fairly aggressively in
commit adfcf423 ("x86: don't use REP_GOOD or ERMS for user memory
copies"), in order to then be able to clean up the code and inline the
modern FSRM case in commit 577e6a7f ("x86: inline the 'rep movs' in
user copies for the FSRM case").

We had reports [1] of that causing regressions earlier with blogbench,
but that turned out to be a horrible benchmark for that case, and not a
sufficient reason for re-instating "rep movsb" on older machines.

However, now Eric Dumazet reported [2] a regression in performance that
seems to be a rather more real benchmark, where due to the removal of
"rep movs" a TCP stream over a 100Gbps network no longer reaches line
speed.

And it turns out that with the simplified the calling convention for the
non-FSRM case in commit 427fda2c ("x86: improve on the non-rep
'copy_user' function"), re-introducing the ERMS case is actually fairly
simple.

Of course, that "fairly simple" is glossing over several missteps due to
having to fight our assembler alternative code. This code really wanted
to rewrite a conditional branch to have two different targets, but that
made objtool sufficiently unhappy that this instead just ended up doing
a choice between "jump to the unrolled loop, or use 'rep movsb'
directly".

Let's see if somebody finds a case where the kernel memory copies also
care (see commit 68674f94: "x86: don't use REP_GOOD or ERMS for
small memory copies"). But Eric does argue that the user copies are
special because networking tries to copy up to 32KB at a time, if
order-3 pages allocations are possible.

In-kernel memory copies are typically small, unless they are the special
"copy pages at a time" kind that still use "rep movs".

Link: https://lore.kernel.org/lkml/202305041446.71d46724-yujie.liu@intel.com/ [1]
Link: https://lore.kernel.org/lkml/CANn89iKUbyrJ=r2+_kK+sb2ZSSHifFZ7QkPLDpAtkJ8v4WUumA@mail.gmail.com/ [2]
Reported-and-tested-by: Eric Dumazet <edumazet@google.com>
Fixes: adfcf423 ("x86: don't use REP_GOOD or ERMS for user memory copies")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

47ee3f1d

Merge tag 'nvme-6.4-2023-05-26' of git://git.infradead.org/nvme into block-6.4 · 9491d01f

Jens Axboe authored May 26, 2023

Pull NVMe fix from Keith:

"nvme fixes for 6.4

 One nvme quirk (Tatsuki)"

* tag 'nvme-6.4-2023-05-26' of git://git.infradead.org/nvme:
  NVMe: Add MAXIO 1602 to bogus nid list.

9491d01f

NVMe: Add MAXIO 1602 to bogus nid list. · a3a9d63d

Tatsuki Sugiura authored May 20, 2023

HIKSEMI FUTURE M.2 SSD uses the same dummy nguid and eui64.
I confirmed it with my two devices.

This patch marks the controller as NVME_QUIRK_BOGUS_NID.

---------------------------------------------------------
sugi@tempest:~% sudo nvme id-ctrl /dev/nvme0
NVME Identify Controller:
vid       : 0x1e4b
ssvid     : 0x1e4b
sn        : 30096022612
mn        : HS-SSD-FUTURE 2048G
fr        : SN10542
rab       : 0
ieee      : 000000
cmic      : 0
mdts      : 7
cntlid    : 0
ver       : 0x10400
rtd3r     : 0x7a120
rtd3e     : 0x1e8480
oaes      : 0x200
ctratt    : 0x2
rrls      : 0
cntrltype : 1
fguid     : 00000000-0000-0000-0000-000000000000
<snip...>
---------------------------------------------------------

---------------------------------------------------------
sugi@tempest:~% sudo nvme id-ns /dev/nvme0n1
NVME Identify Namespace 1:
<snip...>
nguid   : 00000000000000000000000000000000
eui64   : 0000000000000002
lbaf  0 : ms:0   lbads:9  rp:0 (in use)
---------------------------------------------------------
Signed-off-by: Tatsuki Sugiura <sugi@nemui.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>

a3a9d63d