- 24 Sep, 2019 40 commits
-
-
Song Liu authored
In preparation for non-shmem THP, this patch adds a few stats and exposes them in /proc/meminfo, /sys/bus/node/devices/<node>/meminfo, and /proc/<pid>/task/<tid>/smaps. This patch is mostly a rewrite of Kirill A. Shutemov's earlier version: https://lkml.kernel.org/r/20170126115819.58875-5-kirill.shutemov@linux.intel.com/ Link: http://lkml.kernel.org/r/20190801184244.3169074-5-songliubraving@fb.comSigned-off-by: Song Liu <songliubraving@fb.com> Acked-by: Rik van Riel <riel@surriel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Hugh Dickins <hughd@google.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Song Liu authored
With THP, current check of offset: VM_BUG_ON_PAGE(page->index != offset, page); is no longer accurate. Update it to: VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page); Link: http://lkml.kernel.org/r/20190801184244.3169074-4-songliubraving@fb.comSigned-off-by: Song Liu <songliubraving@fb.com> Acked-by: Rik van Riel <riel@surriel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Hugh Dickins <hughd@google.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Song Liu authored
Similar to previous patch, pagecache_get_page() avoids race condition with truncate by checking page->mapping == mapping. This does not work for compound pages. This patch let it check compound_head(page)->mapping instead. Link: http://lkml.kernel.org/r/20190801184244.3169074-3-songliubraving@fb.comSigned-off-by: Song Liu <songliubraving@fb.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@surriel.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Song Liu authored
Patch series "Enable THP for text section of non-shmem files", v10; This patchset follows up discussion at LSF/MM 2019. The motivation is to put text section of an application in THP, and thus reduces iTLB miss rate and improves performance. Both Facebook and Oracle showed strong interests to this feature. To make reviews easier, this set aims a mininal valid product. Current version of the work does not have any changes to file system specific code. This comes with some limitations (discussed later). This set enables an application to "hugify" its text section by simply running something like: madvise(0x600000, 0x80000, MADV_HUGEPAGE); Before this call, the /proc/<pid>/maps looks like: 00400000-074d0000 r-xp 00000000 00:27 2006927 app After this call, part of the text section is split out and mapped to THP: 00400000-00425000 r-xp 00000000 00:27 2006927 app 00600000-00e00000 r-xp 00200000 00:27 2006927 app <<< on THP 00e00000-074d0000 r-xp 00a00000 00:27 2006927 app Limitations: 1. This only works for text section (vma with VM_DENYWRITE). 2. Original limitation #2 is removed in v3. We gated this feature with an experimental config, READ_ONLY_THP_FOR_FS. Once we get better support on the write path, we can remove the config and enable it by default. Tested cases: 1. Tested with btrfs and ext4. 2. Tested with real work application (memcache like caching service). 3. Tested with "THP aware uprobe": https://patchwork.kernel.org/project/linux-mm/list/?series=131339 This patch (of 7): Currently, filemap_fault() avoids race condition with truncate by checking page->mapping == mapping. This does not work for compound pages. This patch let it check compound_head(page)->mapping instead. Link: http://lkml.kernel.org/r/20190801184244.3169074-2-songliubraving@fb.comSigned-off-by: Song Liu <songliubraving@fb.com> Acked-by: Rik van Riel <riel@surriel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Kefeng Wang authored
According to 78ddc534 ("thp: rename split_huge_page_pmd() to split_huge_pmd()"), update related comment. Link: http://lkml.kernel.org/r/20190731033406.185285-1-wangkefeng.wang@huawei.comSigned-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Pingfan Liu authored
Remove unused 'pfn' variable. Link: http://lkml.kernel.org/r/1565167272-21453-1-git-send-email-kernelfans@gmail.comSigned-off-by: Pingfan Liu <kernelfans@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jan Kara <jack@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mike Kravetz authored
When allocating hugetlbfs pool pages via /proc/sys/vm/nr_hugepages, the pages will be interleaved between all nodes of the system. If nodes are not equal, it is quite possible for one node to fill up before the others. When this happens, the code still attempts to allocate pages from the full node. This results in calls to direct reclaim and compaction which slow things down considerably. When allocating pool pages, note the state of the previous allocation for each node. If previous allocation failed, do not use the aggressive retry algorithm on successive attempts. The allocation will still succeed if there is memory available, but it will not try as hard to free up memory. Link: http://lkml.kernel.org/r/20190806014744.15446-5-mike.kravetz@oracle.comSigned-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hdanton@sina.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Vlastimil Babka authored
Mike Kravetz reports that "hugetlb allocations could stall for minutes or hours when should_compact_retry() would return true more often then it should. Specifically, this was in the case where compact_result was COMPACT_DEFERRED and COMPACT_PARTIAL_SKIPPED and no progress was being made." The problem is that the compaction_withdrawn() test in should_compact_retry() includes compaction outcomes that are only possible on low compaction priority, and results in a retry without increasing the priority. This may result in furter reclaim, and more incomplete compaction attempts. With this patch, compaction priority is raised when possible, or should_compact_retry() returns false. The COMPACT_SKIPPED result doesn't really fit together with the other outcomes in compaction_withdrawn(), as that's a result caused by insufficient order-0 pages, not due to low compaction priority. With this patch, it is moved to a new compaction_needs_reclaim() function, and for that outcome we keep the current logic of retrying if it looks like reclaim will be able to help. Link: http://lkml.kernel.org/r/20190806014744.15446-4-mike.kravetz@oracle.comReported-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Tested-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Vlastimil Babka authored
After commit "mm, reclaim: make should_continue_reclaim perform dryrun detection", closer look at the function shows, that nr_reclaimed == 0 means the function will always return false. And since non-zero nr_reclaimed implies non_zero nr_scanned, testing nr_scanned serves no purpose, and so does the testing for __GFP_RETRY_MAYFAIL. This patch thus cleans up the function to test only !nr_reclaimed upfront, and remove the __GFP_RETRY_MAYFAIL test and nr_scanned parameter completely. Comment is also updated, explaining that approximating "full LRU list has been scanned" with nr_scanned == 0 didn't really work. Link: http://lkml.kernel.org/r/20190806014744.15446-3-mike.kravetz@oracle.comSigned-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Hillf Danton authored
Patch series "address hugetlb page allocation stalls", v2. Allocation of hugetlb pages via sysctl or procfs can stall for minutes or hours. A simple example on a two node system with 8GB of memory is as follows: echo 4096 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages echo 4096 > /proc/sys/vm/nr_hugepages Obviously, both allocation attempts will fall short of their 8GB goal. However, one or both of these commands may stall and not be interruptible. The issues were initially discussed in mail thread [1] and RFC code at [2]. This series addresses the issues causing the stalls. There are two distinct fixes, a cleanup, and an optimization. The reclaim patch by Hillf and compaction patch by Vlasitmil address corner cases in their respective areas. hugetlb page allocation could stall due to either of these issues. Vlasitmil added a cleanup patch after Hillf's modifications. The hugetlb patch by Mike is an optimization suggested during the debug and development process. [1] http://lkml.kernel.org/r/d38a095e-dc39-7e82-bb76-2c9247929f07@oracle.com [2] http://lkml.kernel.org/r/20190724175014.9935-1-mike.kravetz@oracle.com This patch (of 4): Address the issue of should_continue_reclaim returning true too often for __GFP_RETRY_MAYFAIL attempts when !nr_reclaimed and nr_scanned. This was observed during hugetlb page allocation causing stalls for minutes or hours. We can stop reclaiming pages if compaction reports it can make a progress. There might be side-effects for other high-order allocations that would potentially benefit from reclaiming more before compaction so that they would be faster and less likely to stall. However, the consequences of premature/over-reclaim are considered worse. We can also bail out of reclaiming pages if we know that there are not enough inactive lru pages left to satisfy the costly allocation. We can give up reclaiming pages too if we see dryrun occur, with the certainty of plenty of inactive pages. IOW with dryrun detected, we are sure we have reclaimed as many pages as we could. Link: http://lkml.kernel.org/r/20190806014744.15446-2-mike.kravetz@oracle.comSigned-off-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Tested-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Michal Hocko authored
Cgroup v1 memcg controller has exposed a dedicated kmem limit to users which turned out to be really a bad idea because there are paths which cannot shrink the kernel memory usage enough to get below the limit (e.g. because the accounted memory is not reclaimable). There are cases when the failure is even not allowed (e.g. __GFP_NOFAIL). This means that the kmem limit is in excess to the hard limit without any way to shrink and thus completely useless. OOM killer cannot be invoked to handle the situation because that would lead to a premature oom killing. As a result many places might see ENOMEM returning from kmalloc and result in unexpected errors. E.g. a global OOM killer when there is a lot of free memory because ENOMEM is translated into VM_FAULT_OOM in #PF path and therefore pagefault_out_of_memory would result in OOM killer. Please note that the kernel memory is still accounted to the overall limit along with the user memory so removing the kmem specific limit should still allow to contain kernel memory consumption. Unlike the kmem one, though, it invokes memory reclaim and targeted memcg oom killing if necessary. Start the deprecation process by crying to the kernel log. Let's see whether there are relevant usecases and simply return to EINVAL in the second stage if nobody complains in few releases. [akpm@linux-foundation.org: tweak documentation text] Link: http://lkml.kernel.org/r/20190911151612.GI4023@dhcp22.suse.czSigned-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Thomas Lindroth <thomas.lindroth@gmail.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Qian Cai authored
mem_cgroup_id_get() was introduced in commit 73f576c0 ("mm:memcontrol: fix cgroup creation failure after many small jobs"). Later, it no longer has any user since the commits, 1f47b61f ("mm: memcontrol: fix swap counter leak on swapout from offline cgroup") 58fa2a55 ("mm: memcontrol: add sanity checks for memcg->id.ref on get/put") so safe to remove it. Link: http://lkml.kernel.org/r/1568648453-5482-1-git-send-email-cai@lca.pwSigned-off-by: Qian Cai <cai@lca.pw> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Michal Hocko authored
constrained_alloc() calculates the size of the oom domain by using node_spanned_pages which is incorrect because this is the full range of the physical memory range that the numa node occupies rather than the memory that backs that range which is represented by node_present_pages. Sparsely populated nodes (e.g. after memory hot remove or simply sparse due to memory layout) can have really a large difference between the two. This shouldn't really cause any real user observable problems because the oom calculates a ratio against totalpages and used memory cannot exceed present pages but it is confusing and wrong from code point of view. Link: http://lkml.kernel.org/r/20190829163443.899-1-mhocko@kernel.orgSigned-off-by: Michal Hocko <mhocko@suse.com> Reported-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Yi Wang authored
Commit ac311a14 ("oom: decouple mems_allowed from oom_unkillable_task") changed has_intersects_mems_allowed() to oom_cpuset_eligible(), but didn't change the comment. Link: http://lkml.kernel.org/r/1566959929-10638-1-git-send-email-wang.yi59@zte.com.cnSigned-off-by: Yi Wang <wang.yi59@zte.com.cn> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Edward Chron authored
For an OOM event: print oom_score_adj value for the OOM Killed process to document what the oom score adjust value was at the time the process was OOM Killed. The adjustment value can be set by user code and it affects the resulting oom_score so it is used to influence kill process selection. When eligible tasks are not printed (sysctl oom_dump_tasks = 0) printing this value is the only documentation of the value for the process being killed. Having this value on the Killed process message is useful to document if a miscconfiguration occurred or to confirm that the oom_score_adj configuration applies as expected. An example which illustates both misconfiguration and validation that the oom_score_adj was applied as expected is: Aug 14 23:00:02 testserver kernel: Out of memory: Killed process 2692 (systemd-udevd) total-vm:1056800kB, anon-rss:1052760kB, file-rss:4kB, shmem-rss:0kB pgtables:22kB oom_score_adj:1000 The systemd-udevd is a critical system application that should have an oom_score_adj of -1000. It was miconfigured to have a adjustment of 1000 making it a highly favored OOM kill target process. The output documents both the misconfiguration and the fact that the process was correctly targeted by OOM due to the miconfiguration. This can be quite helpful for triage and problem determination. The addition of the pgtables_bytes shows page table usage by the process and is a useful measure of the memory size of the process. Link: http://lkml.kernel.org/r/20190822173157.1569-1-echron@arista.comSigned-off-by: Edward Chron <echron@arista.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Tetsuo Handa authored
Masoud Sharbiani noticed that commit 29ef680a ("memcg, oom: move out_of_memory back to the charge path") broke memcg OOM called from __xfs_filemap_fault() path. It turned out that try_charge() is retrying forever without making forward progress because mem_cgroup_oom(GFP_NOFS) cannot invoke the OOM killer due to commit 3da88fb3 ("mm, oom: move GFP_NOFS check to out_of_memory"). Allowing forced charge due to being unable to invoke memcg OOM killer will lead to global OOM situation. Also, just returning -ENOMEM will be risky because OOM path is lost and some paths (e.g. get_user_pages()) will leak -ENOMEM. Therefore, invoking memcg OOM killer (despite GFP_NOFS) will be the only choice we can choose for now. Until 29ef680a, we were able to invoke memcg OOM killer when GFP_KERNEL reclaim failed [1]. But since 29ef680a, we need to invoke memcg OOM killer when GFP_NOFS reclaim failed [2]. Although in the past we did invoke memcg OOM killer for GFP_NOFS [3], we might get pre-mature memcg OOM reports due to this patch. [1] leaker invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 CPU: 0 PID: 2746 Comm: leaker Not tainted 4.18.0+ #19 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: dump_stack+0x63/0x88 dump_header+0x67/0x27a ? mem_cgroup_scan_tasks+0x91/0xf0 oom_kill_process+0x210/0x410 out_of_memory+0x10a/0x2c0 mem_cgroup_out_of_memory+0x46/0x80 mem_cgroup_oom_synchronize+0x2e4/0x310 ? high_work_func+0x20/0x20 pagefault_out_of_memory+0x31/0x76 mm_fault_error+0x55/0x115 ? handle_mm_fault+0xfd/0x220 __do_page_fault+0x433/0x4e0 do_page_fault+0x22/0x30 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x4009f0 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83 RSP: 002b:00007ffe29ae96f0 EFLAGS: 00010206 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001ce1000 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f94be09220d R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0 R13: 0000000000000003 R14: 00007f949d845000 R15: 0000000002800000 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 158965 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 kmem: usage 2016kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:844KB rss:521136KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB inactive_anon:0KB active_anon:521224KB inactive_file:1012KB active_file:8KB unevictable:0KB Memory cgroup out of memory: Kill process 2746 (leaker) score 998 or sacrifice child Killed process 2746 (leaker) total-vm:536704kB, anon-rss:521176kB, file-rss:1208kB, shmem-rss:0kB oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [2] leaker invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), nodemask=(null), order=0, oom_score_adj=0 CPU: 1 PID: 2746 Comm: leaker Not tainted 4.18.0+ #20 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: dump_stack+0x63/0x88 dump_header+0x67/0x27a ? mem_cgroup_scan_tasks+0x91/0xf0 oom_kill_process+0x210/0x410 out_of_memory+0x109/0x2d0 mem_cgroup_out_of_memory+0x46/0x80 try_charge+0x58d/0x650 ? __radix_tree_replace+0x81/0x100 mem_cgroup_try_charge+0x7a/0x100 __add_to_page_cache_locked+0x92/0x180 add_to_page_cache_lru+0x4d/0xf0 iomap_readpages_actor+0xde/0x1b0 ? iomap_zero_range_actor+0x1d0/0x1d0 iomap_apply+0xaf/0x130 iomap_readpages+0x9f/0x150 ? iomap_zero_range_actor+0x1d0/0x1d0 xfs_vm_readpages+0x18/0x20 [xfs] read_pages+0x60/0x140 __do_page_cache_readahead+0x193/0x1b0 ondemand_readahead+0x16d/0x2c0 page_cache_async_readahead+0x9a/0xd0 filemap_fault+0x403/0x620 ? alloc_set_pte+0x12c/0x540 ? _cond_resched+0x14/0x30 __xfs_filemap_fault+0x66/0x180 [xfs] xfs_filemap_fault+0x27/0x30 [xfs] __do_fault+0x19/0x40 __handle_mm_fault+0x8e8/0xb60 handle_mm_fault+0xfd/0x220 __do_page_fault+0x238/0x4e0 do_page_fault+0x22/0x30 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x4009f0 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83 RSP: 002b:00007ffda45c9290 EFLAGS: 00010206 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001a1e000 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f6d061ff20d R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0 R13: 0000000000000003 R14: 00007f6ce59b2000 R15: 0000000002800000 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 7221 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 kmem: usage 1944kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:3632KB rss:518232KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:518408KB inactive_file:3908KB active_file:12KB unevictable:0KB Memory cgroup out of memory: Kill process 2746 (leaker) score 992 or sacrifice child Killed process 2746 (leaker) total-vm:536704kB, anon-rss:518264kB, file-rss:1188kB, shmem-rss:0kB oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [3] leaker invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0 leaker cpuset=/ mems_allowed=0 CPU: 1 PID: 3206 Comm: leaker Not tainted 3.10.0-957.27.2.el7.x86_64 #1 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: [<ffffffffaf364147>] dump_stack+0x19/0x1b [<ffffffffaf35eb6a>] dump_header+0x90/0x229 [<ffffffffaedbb456>] ? find_lock_task_mm+0x56/0xc0 [<ffffffffaee32a38>] ? try_get_mem_cgroup_from_mm+0x28/0x60 [<ffffffffaedbb904>] oom_kill_process+0x254/0x3d0 [<ffffffffaee36c36>] mem_cgroup_oom_synchronize+0x546/0x570 [<ffffffffaee360b0>] ? mem_cgroup_charge_common+0xc0/0xc0 [<ffffffffaedbc194>] pagefault_out_of_memory+0x14/0x90 [<ffffffffaf35d072>] mm_fault_error+0x6a/0x157 [<ffffffffaf3717c8>] __do_page_fault+0x3c8/0x4f0 [<ffffffffaf371925>] do_page_fault+0x35/0x90 [<ffffffffaf36d768>] page_fault+0x28/0x30 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 20628 memory+swap: usage 524288kB, limit 9007199254740988kB, failcnt 0 kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:840KB rss:523448KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:523448KB inactive_file:464KB active_file:376KB unevictable:0KB Memory cgroup out of memory: Kill process 3206 (leaker) score 970 or sacrifice child Killed process 3206 (leaker) total-vm:536692kB, anon-rss:523304kB, file-rss:412kB, shmem-rss:0kB Bisected by Masoud Sharbiani. Link: http://lkml.kernel.org/r/cbe54ed1-b6ba-a056-8899-2dc42526371d@i-love.sakura.ne.jp Fixes: 3da88fb3 ("mm, oom: move GFP_NOFS check to out_of_memory") [necessary after 29ef680a] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Masoud Sharbiani <msharbiani@apple.com> Tested-by: Masoud Sharbiani <msharbiani@apple.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [4.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Joel Savitz authored
In the event of an oom kill, useful information about the killed process is printed to dmesg. Users, especially system administrators, will find it useful to immediately see the UID of the process. We already print uid when dumping eligible tasks so it is not overly hard to find that information in the oom report. However this information is unavailable when dumping of eligible tasks is disabled. In the following example, abuse_the_ram is the name of a program that attempts to iteratively allocate all available memory until it is stopped by force. Current message: Out of memory: Killed process 35389 (abuse_the_ram) total-vm:133718232kB, anon-rss:129624980kB, file-rss:0kB, shmem-rss:0kB Patched message: Out of memory: Killed process 2739 (abuse_the_ram), total-vm:133880028kB, anon-rss:129754836kB, file-rss:0kB, shmem-rss:0kB, UID:0 [akpm@linux-foundation.org: s/UID %d/UID:%u/ in printk] Link: http://lkml.kernel.org/r/1560362273-534-1-git-send-email-jsavitz@redhat.comSigned-off-by: Joel Savitz <jsavitz@redhat.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Kefeng Wang authored
1) task_nodes = cpuset_mems_allowed(current); -> cpuset_mems_allowed() guaranteed to return some non-empty subset of node_states[N_MEMORY]. 2) nodes_and(*new, *new, task_nodes); -> after nodes_and(), the 'new' should be empty or appropriate nodemask(online node and with memory). After 1) and 2), we could remove unnecessary check whether the 'new' AND node_states[N_MEMORY] is empty. Link: http://lkml.kernel.org/r/20190806023634.55356-1-wangkefeng.wang@huawei.comSigned-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Pengfei Li authored
Like commit 40cacbcb ("mm, compaction: remove unnecessary zone parameter in some instances"), remove unnecessary zone parameter. No functional change. Link: http://lkml.kernel.org/r/20190806151616.21107-1-lpf.vector@gmail.comSigned-off-by: Pengfei Li <lpf.vector@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Qian Cai <cai@lca.pw> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Yafang Shao authored
total_{migrate,free}_scanned will be added to COMPACTMIGRATE_SCANNED and COMPACTFREE_SCANNED in compact_zone(). We should clear them before scanning a new zone. In the proc triggered compaction, we forgot clearing them. [laoar.shao@gmail.com: introduce a helper compact_zone_counters_init()] Link: http://lkml.kernel.org/r/1563869295-25748-1-git-send-email-laoar.shao@gmail.com [akpm@linux-foundation.org: expand compact_zone_counters_init() into its single callsite, per mhocko] [vbabka@suse.cz: squash compact_zone() list_head init as well] Link: http://lkml.kernel.org/r/1fb6f7da-f776-9e42-22f8-bbb79b030b98@suse.cz [akpm@linux-foundation.org: kcompactd_do_work(): avoid unnecessary initialization of cc.zone] Link: http://lkml.kernel.org/r/1563789275-9639-1-git-send-email-laoar.shao@gmail.com Fixes: 7f354a54 ("mm, compaction: add vmstats for kcompactd work") Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Yafang Shao <shaoyafang@didiglobal.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Vitaly Wool authored
Currently there is a leak in init_z3fold_page() -- it allocates handles from kmem cache even for headless pages, but then they are never used and never freed, so eventually kmem cache may get exhausted. This patch provides a fix for that. Link: http://lkml.kernel.org/r/20190917185352.44cf285d3ebd9e64548de5de@gmail.comSigned-off-by: Vitaly Wool <vitalywool@gmail.com> Reported-by: Markus Linnala <markus.linnala@gmail.com> Tested-by: Markus Linnala <markus.linnala@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Qian Cai authored
When compiling a kernel with W=1, there are several of those warnings due to arm64 overriding a field on purpose. Just disable those warnings for both GCC and Clang of this file, so it will help dig "gems" hidden in the W=1 warnings by reducing some noises. mm/init-mm.c:39:2: warning: initializer overrides prior initialization of this subobject [-Winitializer-overrides] INIT_MM_CONTEXT(init_mm) ^~~~~~~~~~~~~~~~~~~~~~~~ ./arch/arm64/include/asm/mmu.h:133:9: note: expanded from macro 'INIT_MM_CONTEXT' .pgd = init_pg_dir, ^~~~~~~~~~~ mm/init-mm.c:30:10: note: previous initialization is here .pgd = swapper_pg_dir, ^~~~~~~~~~~~~~ Note: there is a side project trying to support explicitly allowing specific initializer overrides in Clang, but there is no guarantee it will happen or not. https://github.com/ClangBuiltLinux/linux/issues/639 Link: http://lkml.kernel.org/r/1566920867-27453-1-git-send-email-cai@lca.pwSigned-off-by: Qian Cai <cai@lca.pw> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mike Rapoport authored
Replace open-coded bitmap array initialization of init_mm.cpu_bitmask with neat CPU_BITS_NONE macro. And, since init_mm.cpu_bitmask is statically set to zero, there is no way to clear it again in start_kernel(). Link: http://lkml.kernel.org/r/1565703815-8584-1-git-send-email-rppt@linux.ibm.comSigned-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Austin Kim authored
If !area->pages statement is true where memory allocation fails, area is freed. In this case 'area->pages = pages' should not executed. So move 'area->pages = pages' after if statement. [akpm@linux-foundation.org: give area->pages the same treatment] Link: http://lkml.kernel.org/r/20190830035716.GA190684@LGEARND20B15Signed-off-by: Austin Kim <austindh.kim@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Roman Penyaev <rpenyaev@suse.de> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Pengfei Li authored
Objective --------- The current implementation of struct vmap_area wasted space. After applying this commit, sizeof(struct vmap_area) has been reduced from 11 words to 8 words. Description ----------- 1) Pack "subtree_max_size", "vm" and "purge_list". This is no problem because A) "subtree_max_size" is only used when vmap_area is in "free" tree B) "vm" is only used when vmap_area is in "busy" tree C) "purge_list" is only used when vmap_area is in vmap_purge_list 2) Eliminate "flags". ;Since only one flag VM_VM_AREA is being used, and the same thing can be done by judging whether "vm" is NULL, then the "flags" can be eliminated. Link: http://lkml.kernel.org/r/20190716152656.12255-3-lpf.vector@gmail.comSigned-off-by: Pengfei Li <lpf.vector@gmail.com> Suggested-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Roman Gushchin <guro@fb.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Uladzislau Rezki (Sony) authored
The busy tree can be quite big, even though the area is freed or unmapped it still stays there until "purge" logic removes it. 1) Optimize and reduce the size of "busy" tree by removing a node from it right away as soon as user triggers free paths. It is possible to do so, because the allocation is done using another augmented tree. The vmalloc test driver shows the difference, for example the "fix_size_alloc_test" is ~11% better comparing with default configuration: sudo ./test_vmalloc.sh performance <default> Summary: fix_size_alloc_test loops: 1000000 avg: 993985 usec Summary: full_fit_alloc_test loops: 1000000 avg: 973554 usec Summary: long_busy_list_alloc_test loops: 1000000 avg: 12617652 usec <default> <this patch> Summary: fix_size_alloc_test loops: 1000000 avg: 882263 usec Summary: full_fit_alloc_test loops: 1000000 avg: 973407 usec Summary: long_busy_list_alloc_test loops: 1000000 avg: 12593929 usec <this patch> 2) Since the busy tree now contains allocated areas only and does not interfere with lazily free nodes, introduce the new function show_purge_info() that dumps "unpurged" areas that is propagated through "/proc/vmallocinfo". 3) Eliminate VM_LAZY_FREE flag. Link: http://lkml.kernel.org/r/20190716152656.12255-2-lpf.vector@gmail.comSigned-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Pengfei Li <lpf.vector@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Alastair D'Silva authored
There is no possibility for memmap to be NULL in the current codebase. This check was added in commit 95a4774d ("memory-hotplug: update mce_bad_pages when removing the memory") where memmap was originally inited to NULL, and only conditionally given a value. The code that could have passed a NULL has been removed by commit ba72b4c8 ("mm/sparsemem: support sub-section hotplug"), so there is no longer a possibility that memmap can be NULL. Link: http://lkml.kernel.org/r/20190829035151.20975-1-alastair@d-silva.orgSigned-off-by: Alastair D'Silva <alastair@d-silva.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Qian Cai <cai@lca.pw> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Baoquan He <bhe@redhat.com> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Alastair D'Silva authored
Use the function written to do it instead. Link: http://lkml.kernel.org/r/20190827053656.32191-2-alastair@au1.ibm.comSigned-off-by: Alastair D'Silva <alastair@d-silva.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Wei Yang <richardw.yang@linux.intel.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Wei Yang authored
__pfn_to_section is defined as __nr_to_section(pfn_to_section_nr(pfn)). Since we already get section_nr, it is not necessary to get mem_section from start_pfn. By doing so, we reduce one redundant operation. Link: http://lkml.kernel.org/r/20190809010242.29797-1-richardw.yang@linux.intel.comSigned-off-by: Wei Yang <richardw.yang@linux.intel.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Tested-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Lecopzer Chen authored
The size argument passed into sparse_buffer_alloc() has already been aligned with PAGE_SIZE or PMD_SIZE. If the size after aligned is not power of 2 (e.g. 0x480000), the PTR_ALIGN() will return wrong value. Use roundup to round sparsemap_buf up to next multiple of size. Link: http://lkml.kernel.org/r/20190705114826.28586-1-lecopzer.chen@mediatek.comSigned-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com> Signed-off-by: Mark-PK Tsai <Mark-PK.Tsai@mediatek.com> Cc: YJ Chiang <yj.chiang@mediatek.com> Cc: Lecopzer Chen <lecopzer.chen@mediatek.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Lecopzer Chen authored
sparse_buffer_alloc(xsize) gets the size of memory from sparsemap_buf after being aligned with the size. However, the size is at least PAGE_ALIGN(sizeof(struct page) * PAGES_PER_SECTION) and usually larger than PAGE_SIZE. Also, sparse_buffer_fini() only frees memory between sparsemap_buf and sparsemap_buf_end, since sparsemap_buf may be changed by PTR_ALIGN() first, the aligned space before sparsemap_buf is wasted and no one will touch it. In our ARM32 platform (without SPARSEMEM_VMEMMAP) Sparse_buffer_init Reserve d359c000 - d3e9c000 (9M) Sparse_buffer_alloc Alloc d3a00000 - d3E80000 (4.5M) Sparse_buffer_fini Free d3e80000 - d3e9c000 (~=100k) The reserved memory between d359c000 - d3a00000 (~=4.4M) is unfreed. In ARM64 platform (with SPARSEMEM_VMEMMAP) sparse_buffer_init Reserve ffffffc07d623000 - ffffffc07f623000 (32M) Sparse_buffer_alloc Alloc ffffffc07d800000 - ffffffc07f600000 (30M) Sparse_buffer_fini Free ffffffc07f600000 - ffffffc07f623000 (140K) The reserved memory between ffffffc07d623000 - ffffffc07d800000 (~=1.9M) is unfreed. Let's explicit free redundant aligned memory. [arnd@arndb.de: mark sparse_buffer_free as __meminit] Link: http://lkml.kernel.org/r/20190709185528.3251709-1-arnd@arndb.de Link: http://lkml.kernel.org/r/20190705114730.28534-1-lecopzer.chen@mediatek.comSigned-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com> Signed-off-by: Mark-PK Tsai <Mark-PK.Tsai@mediatek.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: YJ Chiang <yj.chiang@mediatek.com> Cc: Lecopzer Chen <lecopzer.chen@mediatek.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Souptick Joarder authored
Correct typo in comment. Link: http://lkml.kernel.org/r/1568233954-3913-1-git-send-email-jrdr.linux@gmail.comSigned-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
David Hildenbrand authored
walk_system_ram_range() will fail with -EINVAL in case online_pages_range() was never called (== no resource applicable in the range). Otherwise, we will always call online_pages_range() with nr_pages > 0 and, therefore, have online_pages > 0. Remove that special handling. Link: http://lkml.kernel.org/r/20190814154109.3448-6-david@redhat.comSigned-off-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arun KS <arunks@codeaurora.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
David Hildenbrand authored
Commit a9cd410a ("mm/page_alloc.c: memory hotplug: free pages as higher order") assumed that any PFN we get via memory resources is aligned to to MAX_ORDER - 1, I am not convinced that is always true. Let's play safe, check the alignment and fallback to single pages. akpm: warn in this situation so we get to find out if and why this ever occurs. [akpm@linux-foundation.org: add WARN_ON_ONCE()] Link: http://lkml.kernel.org/r/20190814154109.3448-5-david@redhat.comSigned-off-by: David Hildenbrand <david@redhat.com> Cc: Arun KS <arunks@codeaurora.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
David Hildenbrand authored
online_pages always corresponds to nr_pages. Simplify the code, getting rid of online_pages_blocks(). Add some comments. Link: http://lkml.kernel.org/r/20190814154109.3448-4-david@redhat.comSigned-off-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arun KS <arunks@codeaurora.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
David Hildenbrand authored
move_pfn_range_to_zone() will set all pages to PG_reserved via memmap_init_zone(). The only way a page could no longer be reserved would be if a MEM_GOING_ONLINE notifier would clear PG_reserved - which is not done (the online_page callback is used for that purpose by e.g., Hyper-V instead). walk_system_ram_range() will never call online_pages_range() with duplicate PFNs, so drop the PageReserved() check. This seems to be a leftover from ancient times where the memmap was initialized when adding memory and we wanted to check for already onlined memory. Link: http://lkml.kernel.org/r/20190814154109.3448-3-david@redhat.comSigned-off-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arun KS <arunks@codeaurora.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
David Hildenbrand authored
Patch series "mm/memory_hotplug: online_pages() cleanups", v2. Some cleanups (+ one fix for a special case) in the context of online_pages(). This patch (of 5): This makes it clearer that we will never call func() with duplicate PFNs in case we have multiple sub-page memory resources. All unaligned parts of PFNs are completely discarded. Link: http://lkml.kernel.org/r/20190814154109.3448-2-david@redhat.comSigned-off-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Wei Yang <richardw.yang@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Borislav Petkov <bp@suse.de> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Nadav Amit <namit@vmware.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Arun KS <arunks@codeaurora.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Wei Yang authored
When offlining a node in try_offline_node(), pgdat is not released. So that pgdat could be reused in hotadd_new_pgdat(). While we reallocate pgdat->per_cpu_nodestats if this pgdat is reused. This patch prevents the memory leak by just allocating per_cpu_nodestats when it is a new pgdat. Link: http://lkml.kernel.org/r/20190813020608.10194-1-richardw.yang@linux.intel.comSigned-off-by: Wei Yang <richardw.yang@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <OSalvador@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
David Hildenbrand authored
Each memory block spans the same amount of sections/pages/bytes. The size is determined before the first memory block is created. No need to store what we can easily calculate - and the calculations even look simpler now. Michal brought up the idea of variable-sized memory blocks. However, if we ever implement something like this, we will need an API compatibility switch and reworks at various places (most code assumes a fixed memory block size). So let's cleanup what we have right now. While at it, fix the variable naming in register_mem_sect_under_node() - we no longer talk about a single section. Link: http://lkml.kernel.org/r/20190809110200.2746-1-david@redhat.comSigned-off-by: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
David Hildenbrand authored
Let's validate the memory block size early, when initializing the memory device infrastructure. Fail hard in case the value is not suitable. As nobody checks the return value of memory_dev_init(), turn it into a void function and fail with a panic in all scenarios instead. Otherwise, we'll crash later during boot when core/drivers expect that the memory device infrastructure (including memory_block_size_bytes()) works as expected. I think long term, we should move the whole memory block size configuration (set_memory_block_size_order() and memory_block_size_bytes()) into drivers/base/memory.c. Link: http://lkml.kernel.org/r/20190806090142.22709-1-david@redhat.comSigned-off-by: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-