1. 09 Jun, 2023 11 commits
    • Yosry Ahmed's avatar
      memcg: remove mem_cgroup_flush_stats_atomic() · 35822fda
      Yosry Ahmed authored
      Previous patches removed all callers of mem_cgroup_flush_stats_atomic(). 
      Remove the function and simplify the code.
      
      Link: https://lkml.kernel.org/r/20230421174020.2994750-5-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      35822fda
    • Yosry Ahmed's avatar
      memcg: calculate root usage from global state · f82a7a86
      Yosry Ahmed authored
      Currently, we approximate the root usage by adding the memcg stats for
      anon, file, and conditionally swap (for memsw).  To read the memcg stats
      we need to invoke an rstat flush.  rstat flushes can be expensive, they
      scale with the number of cpus and cgroups on the system.
      
      mem_cgroup_usage() is called by memcg_events()->mem_cgroup_threshold()
      with irqs disabled, so such an expensive operation with irqs disabled can
      cause problems.
      
      Instead, approximate the root usage from global state.  This is not 100%
      accurate, but the root usage has always been ill-defined anyway.
      
      Link: https://lkml.kernel.org/r/20230421174020.2994750-4-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f82a7a86
    • Yosry Ahmed's avatar
      memcg: flush stats non-atomically in mem_cgroup_wb_stats() · 190409ca
      Yosry Ahmed authored
      The previous patch moved the wb_over_bg_thresh()->mem_cgroup_wb_stats()
      code path in wb_writeback() outside the lock section.  We no longer need
      to flush the stats atomically.  Flush the stats non-atomically.
      
      Link: https://lkml.kernel.org/r/20230421174020.2994750-3-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      190409ca
    • Yosry Ahmed's avatar
      writeback: move wb_over_bg_thresh() call outside lock section · 2816ea2a
      Yosry Ahmed authored
      Patch series "cgroup: eliminate atomic rstat flushing", v5.
      
      A previous patch series [1] changed most atomic rstat flushing contexts to
      become non-atomic.  This was done to avoid an expensive operation that
      scales with # cgroups and # cpus to happen with irqs disabled and
      scheduling not permitted.  There were two remaining atomic flushing
      contexts after that series.  This series tries to eliminate them as well,
      eliminating atomic rstat flushing completely.
      
      The two remaining atomic flushing contexts are:
      (a) wb_over_bg_thresh()->mem_cgroup_wb_stats()
      (b) mem_cgroup_threshold()->mem_cgroup_usage()
      
      For (a), flushing needs to be atomic as wb_writeback() calls
      wb_over_bg_thresh() with a spinlock held.  However, it seems like the call
      to wb_over_bg_thresh() doesn't need to be protected by that spinlock, so
      this series proposes a refactoring that moves the call outside the lock
      criticial section and makes the stats flushing in mem_cgroup_wb_stats()
      non-atomic.
      
      For (b), flushing needs to be atomic as mem_cgroup_threshold() is called
      with irqs disabled.  We only flush the stats when calculating the root
      usage, as it is approximated as the sum of some memcg stats (file, anon,
      and optionally swap) instead of the conventional page counter.  This
      series proposes changing this calculation to use the global stats instead,
      eliminating the need for a memcg stat flush.
      
      After these 2 contexts are eliminated, we no longer need
      mem_cgroup_flush_stats_atomic() or cgroup_rstat_flush_atomic().  We can
      remove them and simplify the code.
      
      [1] https://lore.kernel.org/linux-mm/20230330191801.1967435-1-yosryahmed@google.com/
      
      
      This patch (of 5):
      
      wb_over_bg_thresh() calls mem_cgroup_wb_stats() which invokes an rstat
      flush, which can be expensive on large systems. Currently,
      wb_writeback() calls wb_over_bg_thresh() within a lock section, so we
      have to do the rstat flush atomically. On systems with a lot of
      cpus and/or cgroups, this can cause us to disable irqs for a long time,
      potentially causing problems.
      
      Move the call to wb_over_bg_thresh() outside the lock section in
      preparation to make the rstat flush in mem_cgroup_wb_stats() non-atomic.
      The list_empty(&wb->work_list) check should be okay outside the lock
      section of wb->list_lock as it is protected by a separate lock
      (wb->work_lock), and wb_over_bg_thresh() doesn't seem like it is
      modifying any of wb->b_* lists the wb->list_lock is protecting.
      Also, the loop seems to be already releasing and reacquring the
      lock, so this refactoring looks safe.
      
      Link: https://lkml.kernel.org/r/20230421174020.2994750-1-yosryahmed@google.com
      Link: https://lkml.kernel.org/r/20230421174020.2994750-2-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2816ea2a
    • Baolin Wang's avatar
      mm/page_alloc: drop the unnecessary pfn_valid() for start pfn · 3c4322c9
      Baolin Wang authored
      __pageblock_pfn_to_page() currently performs both pfn_valid check and
      pfn_to_online_page().  The former one is redundant because the latter is a
      stronger check.  Drop pfn_valid().
      
      Link: https://lkml.kernel.org/r/c3868b58c6714c09a43440d7d02c7b4eed6e03f6.1682342634.git.baolin.wang@linux.alibaba.comSigned-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3c4322c9
    • Wen Yang's avatar
      mm: compaction: optimize compact_memory to comply with the admin-guide · 8b9167cd
      Wen Yang authored
      For the /proc/sys/vm/compact_memory file, the admin-guide states: When 1
      is written to the file, all zones are compacted such that free memory is
      available in contiguous blocks where possible.  This can be important for
      example in the allocation of huge pages although processes will also
      directly compact memory as required
      
      But it was not strictly followed, writing any value would cause all zones
      to be compacted.
      
      It has been slightly optimized to comply with the admin-guide.  Enforce
      the 1 on the unlikely chance that the sysctl handler is ever extended to
      do something different.
      
      Commit ef498438 ("mm/compaction: remove unused variable
      sysctl_compact_memory") has also been optimized a bit here, as the
      declaration in the external header file has been eliminated, and
      sysctl_compact_memory also needs to be verified.
      
      [akpm@linux-foundation.org: add __read_mostly, per Mel]
      Link: https://lkml.kernel.org/r/tencent_DFF54DB2A60F3333F97D3F6B5441519B050A@qq.comSigned-off-by: default avatarWen Yang <wenyang.linux@foxmail.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: William Lam <william.lam@bytedance.com>
      Cc: Pintu Kumar <pintu@codeaurora.org>
      Cc: Fu Wei <wefu@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8b9167cd
    • Yosry Ahmed's avatar
      memcg: dump memory.stat during cgroup OOM for v1 · dddb44ff
      Yosry Ahmed authored
      Patch series "memcg: OOM log improvements", v2.
      
      This short patch series brings back some cgroup v1 stats in OOM logs
      that were unnecessarily changed before. It also makes memcg OOM logs
      less reliant on printk() internals.
      
      
      This patch (of 2):
      
      Commit c8713d0b ("mm: memcontrol: dump memory.stat during cgroup OOM")
      made sure we dump all the stats in memory.stat during a cgroup OOM, but it
      also introduced a slight behavioral change.  The code used to print the
      non-hierarchical v1 cgroup stats for the entire cgroup subtree, now it
      only prints the v2 cgroup stats for the cgroup under OOM.
      
      For cgroup v1 users, this introduces a few problems:
      
      (a) The non-hierarchical stats of the memcg under OOM are no longer
          shown.
      
      (b) A couple of v1-only stats (e.g.  pgpgin, pgpgout) are no longer
          shown.
      
      (c) We show the list of cgroup v2 stats, even in cgroup v1.  This list
          of stats is not tracked with v1 in mind.  While most of the stats seem
          to be working on v1, there may be some stats that are not fully or
          correctly tracked.
      
      Although OOM log is not set in stone, we should not change it for no
      reason.  When upgrading the kernel version to a version including commit
      c8713d0b ("mm: memcontrol: dump memory.stat during cgroup OOM"), these
      behavioral changes are noticed in cgroup v1.
      
      The fix is simple.  Commit c8713d0b ("mm: memcontrol: dump memory.stat
      during cgroup OOM") separated stats formatting from stats display for v2,
      to reuse the stats formatting in the OOM logs.  Do the same for v1.
      
      Move the v2 specific formatting from memory_stat_format() to
      memcg_stat_format(), add memcg1_stat_format() for v1, and make
      memory_stat_format() select between them based on cgroup version.  Since
      memory_stat_show() now works for both v1 & v2, drop memcg_stat_show().
      
      Link: https://lkml.kernel.org/r/20230428132406.2540811-1-yosryahmed@google.com
      Link: https://lkml.kernel.org/r/20230428132406.2540811-3-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      dddb44ff
    • Yosry Ahmed's avatar
      memcg: use seq_buf_do_printk() with mem_cgroup_print_oom_meminfo() · 5b42360c
      Yosry Ahmed authored
      Currently, we format all the memcg stats into a buffer in
      mem_cgroup_print_oom_meminfo() and use pr_info() to dump it to the logs. 
      However, this buffer is large in size.  Although it is currently working
      as intended, ther is a dependency between the memcg stats buffer and the
      printk record size limit.
      
      If we add more stats in the future and the buffer becomes larger than the
      printk record size limit, or if the prink record size limit is reduced,
      the logs may be truncated.
      
      It is safer to use seq_buf_do_printk(), which will automatically break up
      the buffer at line breaks and issue small printk() calls.
      
      Refactor the code to move the seq_buf from memory_stat_format() to its
      callers, and use seq_buf_do_printk() to print the seq_buf in
      mem_cgroup_print_oom_meminfo().
      
      Link: https://lkml.kernel.org/r/20230428132406.2540811-2-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5b42360c
    • Douglas Anderson's avatar
      migrate_pages: avoid blocking for IO in MIGRATE_SYNC_LIGHT · 4bb6dc79
      Douglas Anderson authored
      The MIGRATE_SYNC_LIGHT mode is intended to block for things that will
      finish quickly but not for things that will take a long time.  Exactly how
      long is too long is not well defined, but waits of tens of milliseconds is
      likely non-ideal.
      
      When putting a Chromebook under memory pressure (opening over 90 tabs on a
      4GB machine) it was fairly easy to see delays waiting for some locks in
      the kcompactd code path of > 100 ms.  While the laptop wasn't amazingly
      usable in this state, it was still limping along and this state isn't
      something artificial.  Sometimes we simply end up with a lot of memory
      pressure.
      
      Putting the same Chromebook under memory pressure while it was running
      Android apps (though not stressing them) showed a much worse result (NOTE:
      this was on a older kernel but the codepaths here are similar).  Android
      apps on ChromeOS currently run from a 128K-block, zlib-compressed,
      loopback-mounted squashfs disk.  If we get a page fault from something
      backed by the squashfs filesystem we could end up holding a folio lock
      while reading enough from disk to decompress 128K (and then decompressing
      it using the somewhat slow zlib algorithms).  That reading goes through
      the ext4 subsystem (because it's a loopback mount) before eventually
      ending up in the block subsystem.  This extra jaunt adds extra overhead. 
      Without much work I could see cases where we ended up blocked on a folio
      lock for over a second.  With more extreme memory pressure I could see up
      to 25 seconds.
      
      We considered adding a timeout in the case of MIGRATE_SYNC_LIGHT for the
      two locks that were seen to be slow [1] and that generated much
      discussion.  After discussion, it was decided that we should avoid waiting
      for the two locks during MIGRATE_SYNC_LIGHT if they were being held for
      IO.  We'll continue with the unbounded wait for the more full SYNC modes.
      
      With this change, I couldn't see any slow waits on these locks with my
      previous testcases.
      
      NOTE: The reason I stated digging into this originally isn't because some
      benchmark had gone awry, but because we've received in-the-field crash
      reports where we have a hung task waiting on the page lock (which is the
      equivalent code path on old kernels).  While the root cause of those
      crashes is likely unrelated and won't be fixed by this patch, analyzing
      those crash reports did point out these very long waits seemed like
      something good to fix.  With this patch we should no longer hang waiting
      on these locks, but presumably the system will still be in a bad shape and
      hang somewhere else.
      
      [1] https://lore.kernel.org/r/20230421151135.v2.1.I2b71e11264c5c214bc59744b9e13e4c353bc5714@changeid
      
      Link: https://lkml.kernel.org/r/20230428135414.v3.1.Ia86ccac02a303154a0b8bc60567e7a95d34c96d3@changeidSigned-off-by: default avatarDouglas Anderson <dianders@chromium.org>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4bb6dc79
    • Roman Gushchin's avatar
      mm: memcg: use READ_ONCE()/WRITE_ONCE() to access stock->cached · f785a8f2
      Roman Gushchin authored
      A memcg pointer in the percpu stock can be accessed by drain_all_stock()
      from another cpu in a lockless way.  In theory it might lead to an issue,
      similar to the one which has been discovered with stock->cached_objcg,
      where the pointer was zeroed between the check for being NULL and
      dereferencing.  In this case the issue is unlikely a real problem, but to
      make it bulletproof and similar to stock->cached_objcg, let's annotate all
      accesses to stock->cached with READ_ONCE()/WTRITE_ONCE().
      
      Link: https://lkml.kernel.org/r/20230502160839.361544-2-roman.gushchin@linux.devSigned-off-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f785a8f2
    • Roman Gushchin's avatar
      mm: kmem: fix a NULL pointer dereference in obj_stock_flush_required() · 3b8abb32
      Roman Gushchin authored
      KCSAN found an issue in obj_stock_flush_required():
      stock->cached_objcg can be reset between the check and dereference:
      
      ==================================================================
      BUG: KCSAN: data-race in drain_all_stock / drain_obj_stock
      
      write to 0xffff888237c2a2f8 of 8 bytes by task 19625 on cpu 0:
       drain_obj_stock+0x408/0x4e0 mm/memcontrol.c:3306
       refill_obj_stock+0x9c/0x1e0 mm/memcontrol.c:3340
       obj_cgroup_uncharge+0xe/0x10 mm/memcontrol.c:3408
       memcg_slab_free_hook mm/slab.h:587 [inline]
       __cache_free mm/slab.c:3373 [inline]
       __do_kmem_cache_free mm/slab.c:3577 [inline]
       kmem_cache_free+0x105/0x280 mm/slab.c:3602
       __d_free fs/dcache.c:298 [inline]
       dentry_free fs/dcache.c:375 [inline]
       __dentry_kill+0x422/0x4a0 fs/dcache.c:621
       dentry_kill+0x8d/0x1e0
       dput+0x118/0x1f0 fs/dcache.c:913
       __fput+0x3bf/0x570 fs/file_table.c:329
       ____fput+0x15/0x20 fs/file_table.c:349
       task_work_run+0x123/0x160 kernel/task_work.c:179
       resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
       exit_to_user_mode_loop+0xcf/0xe0 kernel/entry/common.c:171
       exit_to_user_mode_prepare+0x6a/0xa0 kernel/entry/common.c:203
       __syscall_exit_to_user_mode_work kernel/entry/common.c:285 [inline]
       syscall_exit_to_user_mode+0x26/0x140 kernel/entry/common.c:296
       do_syscall_64+0x4d/0xc0 arch/x86/entry/common.c:86
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      read to 0xffff888237c2a2f8 of 8 bytes by task 19632 on cpu 1:
       obj_stock_flush_required mm/memcontrol.c:3319 [inline]
       drain_all_stock+0x174/0x2a0 mm/memcontrol.c:2361
       try_charge_memcg+0x6d0/0xd10 mm/memcontrol.c:2703
       try_charge mm/memcontrol.c:2837 [inline]
       mem_cgroup_charge_skmem+0x51/0x140 mm/memcontrol.c:7290
       sock_reserve_memory+0xb1/0x390 net/core/sock.c:1025
       sk_setsockopt+0x800/0x1e70 net/core/sock.c:1525
       udp_lib_setsockopt+0x99/0x6c0 net/ipv4/udp.c:2692
       udp_setsockopt+0x73/0xa0 net/ipv4/udp.c:2817
       sock_common_setsockopt+0x61/0x70 net/core/sock.c:3668
       __sys_setsockopt+0x1c3/0x230 net/socket.c:2271
       __do_sys_setsockopt net/socket.c:2282 [inline]
       __se_sys_setsockopt net/socket.c:2279 [inline]
       __x64_sys_setsockopt+0x66/0x80 net/socket.c:2279
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      value changed: 0xffff8881382d52c0 -> 0xffff888138893740
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 19632 Comm: syz-executor.0 Not tainted 6.3.0-rc2-syzkaller-00387-g53429336 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/02/2023
      
      Fix it by using READ_ONCE()/WRITE_ONCE() for all accesses to
      stock->cached_objcg.
      
      Link: https://lkml.kernel.org/r/20230502160839.361544-1-roman.gushchin@linux.dev
      Fixes: bf4f0599 ("mm: memcg/slab: obj_cgroup API")
      Signed-off-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Reported-by: syzbot+774c29891415ab0fd29d@syzkaller.appspotmail.com
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
        Link: https://lore.kernel.org/linux-mm/CACT4Y+ZfucZhM60YPphWiCLJr6+SGFhT+jjm8k1P-a_8Kkxsjg@mail.gmail.com/T/#tReviewed-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3b8abb32
  2. 28 May, 2023 8 commits
  3. 27 May, 2023 3 commits
    • Linus Torvalds's avatar
      Merge tag 'for-linus-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip · 4e893b5a
      Linus Torvalds authored
      Pull xen fixes from Juergen Gross:
      
       - a double free fix in the Xen pvcalls backend driver
      
       - a fix for a regression causing the MSI related sysfs entries to not
         being created in Xen PV guests
      
       - a fix in the Xen blkfront driver for handling insane input data
         better
      
      * tag 'for-linus-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
        x86/pci/xen: populate MSI sysfs entries
        xen/pvcalls-back: fix double frees with pvcalls_new_active_socket()
        xen/blkfront: Only check REQ_FUA for writes
      4e893b5a
    • Linus Torvalds's avatar
      Merge tag 'char-misc-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · 957f3f8e
      Linus Torvalds authored
      Pull char/misc fixes from Greg KH:
       "Here are some small driver fixes for 6.4-rc4. They are just two
        different types:
      
         - binder fixes and reverts for reported problems and regressions in
           the binder "driver".
      
         - coresight driver fixes for reported problems.
      
        All of these have been in linux-next for over a week with no reported
        problems"
      
      * tag 'char-misc-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        binder: fix UAF of alloc->vma in race with munmap()
        binder: add lockless binder_alloc_(set|get)_vma()
        Revert "android: binder: stop saving a pointer to the VMA"
        Revert "binder_alloc: add missing mmap_lock calls when using the VMA"
        binder: fix UAF caused by faulty buffer cleanup
        coresight: perf: Release Coresight path when alloc trace id failed
        coresight: Fix signedness bug in tmc_etr_buf_insert_barrier_packet()
      957f3f8e
    • Linus Torvalds's avatar
      Merge tag 'cxl-fixes-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl · 49572d53
      Linus Torvalds authored
      Pull compute express link fixes from Dan Williams:
       "The 'media ready' series prevents the driver from acting on bad
        capacity information, and it moves some checks earlier in the init
        sequence which impacts topics in the queue for 6.5.
      
        Additional hotplug testing uncovered a missing enable for memory
        decode. A debug crash fix is also included.
      
        Summary:
      
         - Stop trusting capacity data before the "media ready" indication
      
         - Add missing HDM decoder capability enable for the cold-plug case
      
         - Fix a debug message induced crash"
      
      * tag 'cxl-fixes-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
        cxl: Explicitly initialize resources when media is not ready
        cxl/port: Fix NULL pointer access in devm_cxl_add_port()
        cxl: Move cxl_await_media_ready() to before capacity info retrieval
        cxl: Wait Memory_Info_Valid before access memory related info
        cxl/port: Enable the HDM decoder capability for switch ports
      49572d53
  4. 26 May, 2023 18 commits
    • Linus Torvalds's avatar
      Merge tag 'arm-fixes-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc · 18713e8a
      Linus Torvalds authored
      Pull ARM SoC fixes from Arnd Bergmann:
       "There have not been a lot of fixes for for the soc tree in 6.4, but
        these have been sitting here for too long.
      
        For the devicetree side, there is one minor warning fix for vexpress,
        the rest all all for the the NXP i.MX platforms: SoC specific bugfixes
        for the iMX8 clocks and its USB-3.0 gadget device, as well as board
        specific fixes for regulators and the phy on some of the i.MX boards.
      
        The microchip risc-v and arm32 maintainers now also add a shared
        maintainer file entry for the arm64 parts.
      
        The remaining fixes are all for firmware drivers, addressing mistakes
        in the optee, scmi and ff-a firmware driver implementation, mostly in
        the error handling code, incorrect use of the alloc_workqueue()
        interface in SCMI, and compatibility with corner cases of the firmware
        implementation"
      
      * tag 'arm-fixes-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
        MAINTAINERS: update arm64 Microchip entries
        arm64: dts: imx8: fix USB 3.0 Gadget Failure in QM & QXPB0 at super speed
        dt-binding: cdns,usb3: Fix cdns,on-chip-buff-size type
        arm64: dts: colibri-imx8x: delete adc1 and dsp
        arm64: dts: colibri-imx8x: fix iris pinctrl configuration
        arm64: dts: colibri-imx8x: move pinctrl property from SoM to eval board
        arm64: dts: colibri-imx8x: fix eval board pin configuration
        arm64: dts: imx8mp: Fix video clock parents
        ARM: dts: imx6qdl-mba6: Add missing pvcie-supply regulator
        ARM: dts: imx6ull-dhcor: Set and limit the mode for PMIC buck 1, 2 and 3
        arm64: dts: imx8mn-var-som: fix PHY detection bug by adding deassert delay
        arm64: dts: imx8mn: Fix video clock parents
        firmware: arm_ffa: Set reserved/MBZ fields to zero in the memory descriptors
        firmware: arm_ffa: Fix FFA device names for logical partitions
        firmware: arm_ffa: Fix usage of partition info get count flag
        firmware: arm_ffa: Check if ffa_driver remove is present before executing
        arm64: dts: arm: add missing cache properties
        ARM: dts: vexpress: add missing cache properties
        firmware: arm_scmi: Fix incorrect alloc_workqueue() invocation
        optee: fix uninited async notif value
      18713e8a
    • Linus Torvalds's avatar
      Merge tag 'pci-v6.4-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci · 96f15fc6
      Linus Torvalds authored
      Pull PCI fix from Bjorn Helgaas:
      
       - Quirk Ice Lake Root Ports to work around DPC log size issue (Mika
         Westerberg)
      
      * tag 'pci-v6.4-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
        PCI/DPC: Quirk PIO log size for Intel Ice Lake Root Ports
      96f15fc6
    • Linus Torvalds's avatar
      Merge tag 'vfio-v6.4-rc4' of https://github.com/awilliam/linux-vfio · 8846af75
      Linus Torvalds authored
      Pull VFIO fix from Alex Williamson:
      
       - Test for and return error for invalid pfns through the pin pages
         interface (Yan Zhao)
      
      * tag 'vfio-v6.4-rc4' of https://github.com/awilliam/linux-vfio:
        vfio/type1: check pfn valid before converting to struct page
      8846af75
    • Linus Torvalds's avatar
      Merge tag 'block-6.4-2023-05-26' of git://git.kernel.dk/linux · a92c9ab6
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
       "A few fixes for the storage side of things:
      
         - Fix bio caching condition for passthrough IO (Anuj)
      
         - end-of-device check fix for zero sized devices (Christoph)
      
         - Update Paolo's email address
      
         - NVMe pull request via Keith with a single quirk addition
      
         - Fix regression in how wbt enablement is done (Yu)
      
         - Fix race in active queue accounting (Tian)"
      
      * tag 'block-6.4-2023-05-26' of git://git.kernel.dk/linux:
        NVMe: Add MAXIO 1602 to bogus nid list.
        block: make bio_check_eod work for zero sized devices
        block: fix bio-cache for passthru IO
        block, bfq: update Paolo's address in maintainer list
        blk-mq: fix race condition in active queue accounting
        blk-wbt: fix that wbt can't be disabled by default
      a92c9ab6
    • Linus Torvalds's avatar
      Merge tag 'io_uring-6.4-2023-05-26' of git://git.kernel.dk/linux · 6fae9129
      Linus Torvalds authored
      Pull io_uring fix from Jens Axboe:
       "Just a single fix for the conditional schedule with the SQPOLL thread,
        dropping the uring_lock if we do need to reschedule"
      
      * tag 'io_uring-6.4-2023-05-26' of git://git.kernel.dk/linux:
        io_uring: unlock sqd->lock before sq thread release CPU
      6fae9129
    • Linus Torvalds's avatar
      Merge tag 'thermal-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 77af1f2b
      Linus Torvalds authored
      Pull thermal control fix from Rafael Wysocki:
       "Fix a regression introduced inadvertently during the 6.3 cycle by a
        commit making the Intel int340x thermal driver use sysfs_emit_at()
        instead of scnprintf() (Srinivas Pandruvada)"
      
      * tag 'thermal-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        thermal: intel: int340x: Add new line for UUID display
      77af1f2b
    • Linus Torvalds's avatar
      Merge tag 'pm-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · c551afcd
      Linus Torvalds authored
      Pull power management fixes from Rafael Wysocki:
       "Fix three issues related to the ->fast_switch callback in the AMD
        P-state cpufreq driver (Gautham R. Shenoy and Wyes Karny)"
      
      * tag 'pm-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        cpufreq: amd-pstate: Update policy->cur in amd_pstate_adjust_perf()
        cpufreq: amd-pstate: Remove fast_switch_possible flag from active driver
        cpufreq: amd-pstate: Add ->fast_switch() callback
      c551afcd
    • Dave Jiang's avatar
      cxl: Explicitly initialize resources when media is not ready · 793a539a
      Dave Jiang authored
      When media is not ready do not assume that the capacity information from
      the identify command is valid, i.e. ->total_bytes
      ->partition_align_bytes ->{volatile,persistent}_only_bytes. Explicitly
      zero out the capacity resources and exit early.
      
      Given zero-init of those fields this patch is functionally equivalent to
      the prior state, but it improves readability and robustness going
      forward.
      Signed-off-by: default avatarDave Jiang <dave.jiang@intel.com>
      Link: https://lore.kernel.org/r/168506118166.3004974.13523455340007852589.stgit@djiang5-mobl3Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      793a539a
    • Linus Torvalds's avatar
      Merge tag 'gpio-fixes-for-v6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux · 91a30434
      Linus Torvalds authored
      Pull gpio fixes from Bartosz Golaszewski:
      
       - fix incorrect output in in-tree gpio tools
      
       - fix a shell coding issue in gpio-sim selftests
      
       - correctly set the permissions for debugfs attributes exposed by
         gpio-mockup
      
       - fix chip name and pin count in gpio-f7188x for one of the supported
         models
      
       - fix numberspace pollution when using dynamically and statically
         allocated GPIOs together
      
      * tag 'gpio-fixes-for-v6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
        gpio-f7188x: fix chip name and pin count on Nuvoton chip
        gpiolib: fix allocation of mixed dynamic/static GPIOs
        gpio: mockup: Fix mode of debugfs files
        selftests: gpio: gpio-sim: Fix BUG: test FAILED due to recent change
        tools: gpio: fix debounce_period_us output of lsgpio
      91a30434
    • Linus Torvalds's avatar
      Merge tag 'for-6.4-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · b158dd94
      Linus Torvalds authored
      Pull btrfs fixes from David Sterba:
      
       - handle memory allocation error in checksumming helper (reported by
         syzbot)
      
       - fix lockdep splat when aborting a transaction, add NOFS protection
         around invalidate_inode_pages2 that could allocate with GFP_KERNEL
      
       - reduce chances to hit an ENOSPC during scrub with RAID56 profiles
      
      * tag 'for-6.4-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: use nofs when cleaning up aborted transactions
        btrfs: handle memory allocation failure in btrfs_csum_one_bio
        btrfs: scrub: try harder to mark RAID56 block groups read-only
      b158dd94
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-2023-05-26' of git://anongit.freedesktop.org/drm/drm · b83ac44e
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "This week's collection is pretty spread out, accel/qaic has a bunch of
        fixes, amdgpu, then lots of single fixes across a bunch of places.
      
        core:
         - fix drmm_mutex_init lock class
      
        mgag200:
         - fix gamma lut initialisation
      
        pl111:
         - fix FB depth on IMPD-1 framebuffer
      
        amdgpu:
         - Fix missing BO unlocking in KIQ error path
         - Avoid spurious secure display error messages
         - SMU13 fix
         - Fix an OD regression
         - GPU reset display IRQ warning fix
         - MST fix
      
        radeon:
         - Fix a DP regression
      
        i915:
         - PIPEDMC disabling fix for bigjoiner config
      
        panel:
         - fix aya neo air plus quirk
      
        sched:
         - remove redundant NULL check
      
        qaic:
         - fix NNC message corruption
         - Grab ch_lock during QAIC_ATTACH_SLICE_BO
         - Flush the transfer list again
         - Validate if BO is sliced before slicing
         - Validate user data before grabbing any lock
         - initialize ret variable to 0
         - silence some uninitialized variable warnings"
      
      * tag 'drm-fixes-2023-05-26' of git://anongit.freedesktop.org/drm/drm:
        drm/amd/display: Have Payload Properly Created After Resume
        drm/amd/display: Fix warning in disabling vblank irq
        drm/amd/pm: Fix output of pp_od_clk_voltage
        drm/amd/pm: add missing NotifyPowerSource message mapping for SMU13.0.7
        drm/radeon: reintroduce radeon_dp_work_func content
        drm/amdgpu: don't enable secure display on incompatible platforms
        drm:amd:amdgpu: Fix missing buffer object unlock in failure path
        accel/qaic: Fix NNC message corruption
        accel/qaic: Grab ch_lock during QAIC_ATTACH_SLICE_BO
        accel/qaic: Flush the transfer list again
        accel/qaic: Validate if BO is sliced before slicing
        accel/qaic: Validate user data before grabbing any lock
        accel/qaic: initialize ret variable to 0
        drm/i915: Fix PIPEDMC disabling for a bigjoiner configuration
        drm: fix drmm_mutex_init()
        drm/sched: Remove redundant check
        drm: panel-orientation-quirks: Change Air's quirk to support Air Plus
        accel/qaic: silence some uninitialized variable warnings
        drm/pl111: Fix FB depth on IMPD-1 framebuffer
        drm/mgag200: Fix gamma lut not initialized.
      b83ac44e
    • Linus Torvalds's avatar
      x86: re-introduce support for ERMS copies for user space accesses · 47ee3f1d
      Linus Torvalds authored
      I tried to streamline our user memory copy code fairly aggressively in
      commit adfcf423 ("x86: don't use REP_GOOD or ERMS for user memory
      copies"), in order to then be able to clean up the code and inline the
      modern FSRM case in commit 577e6a7f ("x86: inline the 'rep movs' in
      user copies for the FSRM case").
      
      We had reports [1] of that causing regressions earlier with blogbench,
      but that turned out to be a horrible benchmark for that case, and not a
      sufficient reason for re-instating "rep movsb" on older machines.
      
      However, now Eric Dumazet reported [2] a regression in performance that
      seems to be a rather more real benchmark, where due to the removal of
      "rep movs" a TCP stream over a 100Gbps network no longer reaches line
      speed.
      
      And it turns out that with the simplified the calling convention for the
      non-FSRM case in commit 427fda2c ("x86: improve on the non-rep
      'copy_user' function"), re-introducing the ERMS case is actually fairly
      simple.
      
      Of course, that "fairly simple" is glossing over several missteps due to
      having to fight our assembler alternative code.  This code really wanted
      to rewrite a conditional branch to have two different targets, but that
      made objtool sufficiently unhappy that this instead just ended up doing
      a choice between "jump to the unrolled loop, or use 'rep movsb'
      directly".
      
      Let's see if somebody finds a case where the kernel memory copies also
      care (see commit 68674f94: "x86: don't use REP_GOOD or ERMS for
      small memory copies").  But Eric does argue that the user copies are
      special because networking tries to copy up to 32KB at a time, if
      order-3 pages allocations are possible.
      
      In-kernel memory copies are typically small, unless they are the special
      "copy pages at a time" kind that still use "rep movs".
      
      Link: https://lore.kernel.org/lkml/202305041446.71d46724-yujie.liu@intel.com/ [1]
      Link: https://lore.kernel.org/lkml/CANn89iKUbyrJ=r2+_kK+sb2ZSSHifFZ7QkPLDpAtkJ8v4WUumA@mail.gmail.com/ [2]
      Reported-and-tested-by: default avatarEric Dumazet <edumazet@google.com>
      Fixes: adfcf423 ("x86: don't use REP_GOOD or ERMS for user memory copies")
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      47ee3f1d
    • Jens Axboe's avatar
      Merge tag 'nvme-6.4-2023-05-26' of git://git.infradead.org/nvme into block-6.4 · 9491d01f
      Jens Axboe authored
      Pull NVMe fix from Keith:
      
      "nvme fixes for 6.4
      
       One nvme quirk (Tatsuki)"
      
      * tag 'nvme-6.4-2023-05-26' of git://git.infradead.org/nvme:
        NVMe: Add MAXIO 1602 to bogus nid list.
      9491d01f
    • Tatsuki Sugiura's avatar
      NVMe: Add MAXIO 1602 to bogus nid list. · a3a9d63d
      Tatsuki Sugiura authored
      HIKSEMI FUTURE M.2 SSD uses the same dummy nguid and eui64.
      I confirmed it with my two devices.
      
      This patch marks the controller as NVME_QUIRK_BOGUS_NID.
      
      ---------------------------------------------------------
      sugi@tempest:~% sudo nvme id-ctrl /dev/nvme0
      NVME Identify Controller:
      vid       : 0x1e4b
      ssvid     : 0x1e4b
      sn        : 30096022612
      mn        : HS-SSD-FUTURE 2048G
      fr        : SN10542
      rab       : 0
      ieee      : 000000
      cmic      : 0
      mdts      : 7
      cntlid    : 0
      ver       : 0x10400
      rtd3r     : 0x7a120
      rtd3e     : 0x1e8480
      oaes      : 0x200
      ctratt    : 0x2
      rrls      : 0
      cntrltype : 1
      fguid     : 00000000-0000-0000-0000-000000000000
      <snip...>
      ---------------------------------------------------------
      
      ---------------------------------------------------------
      sugi@tempest:~% sudo nvme id-ns /dev/nvme0n1
      NVME Identify Namespace 1:
      <snip...>
      nguid   : 00000000000000000000000000000000
      eui64   : 0000000000000002
      lbaf  0 : ms:0   lbads:9  rp:0 (in use)
      ---------------------------------------------------------
      Signed-off-by: default avatarTatsuki Sugiura <sugi@nemui.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      a3a9d63d
    • Arnd Bergmann's avatar
      Merge tag 'ffa-fixes-6.4' of... · abf5422e
      Arnd Bergmann authored
      Merge tag 'ffa-fixes-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux into arm/fixes
      
      Arm FF-A fixes for v6.4
      
      Quite a few fixes to address set of assorted issues:
      1. NULL pointer dereference if the ffa driver doesn't provide remove()
         callback as it is currently executed unconditionally
      2. FF-A core probe failure on systems with v1.0 firmware as the new
         partition info get count flag is used unconditionally
      3. Failure to register more than one logical partition or service within
         the same physical partition as the device name contains only VM ID
         which will be same for all but each will have unique UUID.
      4. Rejection of certain memory interface transmissions by the receivers
         (secure partitions) as few MBZ fields are non-zero due to lack of
         explicit re-initialization of those fields
      
      * tag 'ffa-fixes-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux:
        firmware: arm_ffa: Set reserved/MBZ fields to zero in the memory descriptors
        firmware: arm_ffa: Fix FFA device names for logical partitions
        firmware: arm_ffa: Fix usage of partition info get count flag
        firmware: arm_ffa: Check if ffa_driver remove is present before executing
      
      Link: https://lore.kernel.org/r/20230509143453.1188753-1-sudeep.holla@arm.comSigned-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      abf5422e
    • Dave Airlie's avatar
      Merge tag 'drm-misc-fixes-2023-05-24' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes · 5502d1fa
      Dave Airlie authored
      drm-misc-fixes for v6.4-rc4:
      - A few non-trivial fixes to qaic.
      - Fix drmm_mutex_init always using same lock class.
      - Fix pl111 fb depth.
      - Fix uninitialised gamma lut in mgag200.
      - Add Aya Neo Air Plus quirk.
      - Trivial null check removal in scheduler.
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/d19f748c-2c5b-8140-5b05-a8282dfef73e@linux.intel.com
      5502d1fa
    • Dave Airlie's avatar
      Merge tag 'amd-drm-fixes-6.4-2023-05-24' of... · 13aa38f8
      Dave Airlie authored
      Merge tag 'amd-drm-fixes-6.4-2023-05-24' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
      
      amd-drm-fixes-6.4-2023-05-24:
      
      amdgpu:
      - Fix missing BO unlocking in KIQ error path
      - Avoid spurious secure display error messages
      - SMU13 fix
      - Fix an OD regression
      - GPU reset display IRQ warning fix
      - MST fix
      
      radeon:
      - Fix a DP regression
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      From: Alex Deucher <alexander.deucher@amd.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20230524211238.7749-1-alexander.deucher@amd.com
      13aa38f8
    • Dave Airlie's avatar
      Merge tag 'drm-intel-fixes-2023-05-25' of... · 94d39d01
      Dave Airlie authored
      Merge tag 'drm-intel-fixes-2023-05-25' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
      
      PIPEDMC disabling fix for bigjoiner config
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      From: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/ZG9aROGyc947/J1l@jlahtine-mobl.ger.corp.intel.com
      94d39d01