- 17 Mar, 2016 40 commits
-
-
Sergey Senozhatsky authored
console_unlock() allows to cond_resched() if its caller has set `console_may_schedule' to 1 (this functionality is present since 8d91f8b1 ("printk: do cond_resched() between lines while outputting to consoles"). The rules are: -- console_lock() always sets `console_may_schedule' to 1 -- console_trylock() always sets `console_may_schedule' to 0 printk() calls console_unlock() with preemption desabled, which basically can lead to RCU stalls, watchdog soft lockups, etc. if something is simultaneously calling printk() frequent enough (IOW, console_sem owner always has new data to send to console divers and can't leave console_unlock() for a long time). printk()->console_trylock() callers do not necessarily execute in atomic contexts, and some of them can cond_resched() in console_unlock(). console_trylock() can set `console_may_schedule' to 1 (allow cond_resched() later in consoe_unlock()) when it's safe. This patch (of 3): vprintk_emit() disables preemption around console_trylock_for_printk() and console_unlock() calls for a strong reason -- can_use_console() check. The thing is that vprintl_emit() can be called on a CPU that is not fully brought up yet (!cpu_online()), which potentially can cause problems if console driver wants to access per-cpu data. A console driver can explicitly state that it's safe to call it from !online cpu by setting CON_ANYTIME bit in console ->flags. That's why for !cpu_online() can_use_console() iterates all the console to find out if there is a CON_ANYTIME console, otherwise console_unlock() must be avoided. can_use_console() ensures that console_unlock() call is safe in vprintk_emit() only; console_lock() and console_trylock() are not covered by this check. Even though call_console_drivers(), invoked from console_cont_flush() and console_unlock(), tests `!cpu_online() && CON_ANYTIME' for_each_console(), it may be too late, which can result in messages loss. Assume that we have 2 cpus -- CPU0 is online, CPU1 is !online, and no CON_ANYTIME consoles available. CPU0 online CPU1 !online console_trylock() ... console_unlock() console_cont_flush spin_lock logbuf_lock if (!cont.len) { spin_unlock logbuf_lock return } for (;;) { vprintk_emit spin_lock logbuf_lock log_store spin_unlock logbuf_lock spin_lock logbuf_lock !console_trylock_for_printk msg_print_text return console_idx = log_next() console_seq++ console_prev = msg->flags spin_unlock logbuf_lock call_console_drivers() for_each_console(con) { if (!cpu_online() && !(con->flags & CON_ANYTIME)) continue; } /* * no message printed, we lost it */ vprintk_emit spin_lock logbuf_lock log_store spin_unlock logbuf_lock !console_trylock_for_printk return /* * go to the beginning of the loop, * find out there are new messages, * lose it */ } console_trylock()/console_lock() call on CPU1 may come from cpu notifiers registered on that CPU. Since notifiers are not getting unregistered when CPU is going DOWN, all of the notifiers receive notifications during CPU UP. For example, on my x86_64, I see around 50 notification sent from offline CPU to itself [swapper/2] from cpu:2 to:2 action:CPU_STARTING hotplug_hrtick [swapper/2] from cpu:2 to:2 action:CPU_STARTING blk_mq_main_cpu_notify [swapper/2] from cpu:2 to:2 action:CPU_STARTING blk_mq_queue_reinit_notify [swapper/2] from cpu:2 to:2 action:CPU_STARTING console_cpu_notify while doing echo 0 > /sys/devices/system/cpu/cpu2/online echo 1 > /sys/devices/system/cpu/cpu2/online So grabbing the console_sem lock while CPU is !online is possible, in theory. This patch moves can_use_console() check out of console_trylock_for_printk(). Instead it calls it in console_unlock(), so now console_lock()/console_unlock() are also 'protected' by can_use_console(). This also means that console_trylock_for_printk() is not really needed anymore and can be removed. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Jan Kara <jack@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Kyle McMartin <kyle@kernel.org> Cc: Dave Jones <davej@codemonkey.org.uk> Cc: Calvin Owens <calvinowens@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Rob Landley authored
The v850 port was removed by commits f606ddf4 and 07a887d3 in 2008. These #defines are not used in the current kernel. Signed-off-by: Rob Landley <rob@landley.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Christoph Lameter authored
There are various email addresses for me throughout the kernel. Use the one that will always be valid. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Steven Rostedt authored
This has hit me a couple of times already. I would be debugging code and the system would simply hang and then reboot. Finally, I found that the problem was caused by WARN_ON_ONCE() and friends. The macro WARN_ON_ONCE(condition) is defined as: static bool __section(.data.unlikely) __warned; int __ret_warn_once = !!(condition); if (unlikely(__ret_warn_once)) if (WARN_ON(!__warned)) __warned = true; unlikely(__ret_warn_once); Which looks great and all. But what I have hit, is an issue when WARN_ON() itself hits the same WARN_ON_ONCE() code. Because, the variable __warned is not yet set. Then it too calls WARN_ON() and that triggers the warning again. It keeps doing this until the stack is overflowed and the system crashes. By setting __warned first before calling WARN_ON() makes the original WARN_ON_ONCE() really only warn once, and not an infinite amount of times if the WARN_ON() also triggers the warning. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Andrew Morton authored
arch/mn10300/kernel/fpu-nofpu.c:27:36: error: unknown type name 'elf_fpregset_t' int dump_fpu(struct pt_regs *regs, elf_fpregset_t *fpreg) Reported-by: kbuild test robot <lkp@intel.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Andrew Morton authored
CONFIG_BUG=n && CONFIG_GENERIC_BUG=y make no sense and things break: In file included from include/linux/page-flags.h:9:0, from kernel/bounds.c:9: include/linux/bug.h:91:47: warning: 'struct bug_entry' declared inside parameter list static inline int is_warning_bug(const struct bug_entry *bug) ^ include/linux/bug.h:91:47: warning: its scope is only this definition or declaration, which is probably not what you want include/linux/bug.h: In function 'is_warning_bug': >> include/linux/bug.h:93:12: error: dereferencing pointer to incomplete type return bug->flags & BUGFLAG_WARNING; Reported-by: kbuild test robot <fengguang.wu@intel.com> Cc: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Dave Young authored
On i686 PAE enabled machine the contiguous physical area could be large and it can cause trimming down variables in below calculation in read_vmcore() and mmap_vmcore(): tsz = min_t(size_t, m->offset + m->size - *fpos, buflen); That is, the types being used is like below on i686: m->offset: unsigned long long int m->size: unsigned long long int *fpos: loff_t (long long int) buflen: size_t (unsigned int) So casting (m->offset + m->size - *fpos) by size_t means truncating a given value by 4GB. Suppose (m->offset + m->size - *fpos) being truncated to 0, buflen >0 then we will get tsz = 0. It is of course not an expected result. Similarly we could also get other truncated values less than buflen. Then the real size passed down is not correct any more. If (m->offset + m->size - *fpos) is above 4GB, read_vmcore or mmap_vmcore use the min_t result with truncated values being compared to buflen. Then, fpos proceeds with the wrong value so that we reach below bugs: 1) read_vmcore will refuse to continue so makedumpfile fails. 2) mmap_vmcore will trigger BUG_ON() in remap_pfn_range(). Use unsigned long long in min_t instead so that the variables in are not truncated. Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Dave Young <dyoung@redhat.com> Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jianyu Zhan <nasa4836@gmail.com> Cc: Minfei Huang <mhuang@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Minfei Huang authored
It is not elegant that prompt shell does not start from new line after executing "cat /proc/$pid/wchan". Make prompt shell start from new line. Signed-off-by: Minfei Huang <mnfhuang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Eric Engestrom authored
`proc_timers_operations` is only used when CONFIG_CHECKPOINT_RESTORE is enabled. Signed-off-by: Eric Engestrom <eric.engestrom@imgtec.com> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
John Stultz authored
This patch provides a proc/PID/timerslack_ns interface which exposes a task's timerslack value in nanoseconds and allows it to be changed. This allows power/performance management software to set timer slack for other threads according to its policy for the thread (such as when the thread is designated foreground vs. background activity) If the value written is non-zero, slack is set to that value. Otherwise sets it to the default for the thread. This interface checks that the calling task has permissions to to use PTRACE_MODE_ATTACH_FSCREDS on the target task, so that we can ensure arbitrary apps do not change the timer slack for other apps. Signed-off-by: John Stultz <john.stultz@linaro.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Oren Laadan <orenl@cellrox.com> Cc: Ruchi Kandoi <kandoiruchi@google.com> Cc: Rom Lemarchand <romlem@android.com> Cc: Android Kernel Team <kernel-team@android.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
John Stultz authored
This patchset introduces a /proc/<pid>/timerslack_ns interface which would allow controlling processes to be able to set the timerslack value on other processes in order to save power by avoiding wakeups (Something Android currently does via out-of-tree patches). The first patch tries to fix the internal timer_slack_ns usage which was defined as a long, which limits the slack range to ~4 seconds on 32bit systems. It converts it to a u64, which provides the same basically unlimited slack (500 years) on both 32bit and 64bit machines. The second patch introduces the /proc/<pid>/timerslack_ns interface which allows the full 64bit slack range for a task to be read or set on both 32bit and 64bit machines. With these two patches, on a 32bit machine, after setting the slack on bash to 10 seconds: $ time sleep 1 real 0m10.747s user 0m0.001s sys 0m0.005s The first patch is a little ugly, since I had to chase the slack delta arguments through a number of functions converting them to u64s. Let me know if it makes sense to break that up more or not. Other than that things are fairly straightforward. This patch (of 2): The timer_slack_ns value in the task struct is currently a unsigned long. This means that on 32bit applications, the maximum slack is just over 4 seconds. However, on 64bit machines, its much much larger (~500 years). This disparity could make application development a little (as well as the default_slack) to a u64. This means both 32bit and 64bit systems have the same effective internal slack range. Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify the interface as a unsigned long, so we preserve that limitation on 32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is actually larger then what can be stored by an unsigned long. This patch also modifies hrtimer functions which specified the slack delta as a unsigned long. Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Oren Laadan <orenl@cellrox.com> Cc: Ruchi Kandoi <kandoiruchi@google.com> Cc: Rom Lemarchand <romlem@android.com> Cc: Kees Cook <keescook@chromium.org> Cc: Android Kernel Team <kernel-team@android.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Tetsuo Handa authored
After the OOM killer is disabled during suspend operation, any !__GFP_NOFAIL && __GFP_FS allocations are forced to fail. Thus, any !__GFP_NOFAIL && !__GFP_FS allocations should be forced to fail as well. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Tetsuo Handa authored
While oom_killer_disable() is called by freeze_processes() after all user threads except the current thread are frozen, it is possible that kernel threads invoke the OOM killer and sends SIGKILL to the current thread due to sharing the thawed victim's memory. Therefore, checking for SIGKILL is preferable than TIF_MEMDIE. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Sergey Senozhatsky authored
Add a new column to pool stats, which will tell how many pages ideally can be freed by class compaction, so it will be easier to analyze zsmalloc fragmentation. At the moment, we have only numbers of FULL and ALMOST_EMPTY classes, but they don't tell us how badly the class is fragmented internally. The new /sys/kernel/debug/zsmalloc/zramX/classes output look as follows: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable [..] 12 224 0 2 146 5 8 4 4 13 240 0 0 0 0 0 1 0 14 256 1 13 1840 1672 115 1 10 15 272 0 0 0 0 0 1 0 [..] 49 816 0 3 745 735 149 1 2 51 848 3 4 361 306 76 4 8 52 864 12 14 378 268 81 3 21 54 896 1 12 117 57 26 2 12 57 944 0 0 0 0 0 3 0 [..] Total 26 131 12709 10994 1071 134 For example, from this particular output we can easily conclude that class-896 is heavily fragmented -- it occupies 26 pages, 12 can be freed by compaction. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
YiPing Xu authored
When unmapping a huge class page in zs_unmap_object, the page will be unmapped by kmap_atomic. the "!area->huge" branch in __zs_unmap_object is alway true, and no code set "area->huge" now, so we can drop it. Signed-off-by: YiPing Xu <xuyiping@huawei.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Shawn Lin authored
We have PAGE_ALIGNED() in mm.h, so let's use it instead of IS_ALIGNED() for checking PAGE_SIZE aligned case. Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Vladimir Davydov authored
mem_cgroup_print_oom_info is always called under oom_lock, so oom_info_lock is redundant. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Johannes Weiner authored
uncharge_list() does an unusual list walk because the function can take regular lists with dedicated list_heads as well as singleton lists where a single page is passed via the page->lru list node. This can sometimes lead to confusion as well as suggestions to replace the loop with a list_for_each_entry(), which wouldn't work. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Johannes Weiner authored
Setting the original memory.limit_in_bytes hardlimit is subject to a race condition when the desired value is below the current usage. The code tries a few times to first reclaim and then see if the usage has dropped to where we would like it to be, but there is no locking, and the workload is free to continue making new charges up to the old limit. Thus, attempting to shrink a workload relies on pure luck and hope that the workload happens to cooperate. To fix this in the cgroup2 memory.max knob, do it the other way round: set the limit first, then try enforcement. And if reclaim is not able to succeed, trigger OOM kills in the group. Keep going until the new limit is met, we run out of OOM victims and there's only unreclaimable memory left, or the task writing to memory.max is killed. This allows users to shrink groups reliably, and the behavior is consistent with what happens when new charges are attempted in excess of memory.max. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Johannes Weiner authored
When setting memory.high below usage, nothing happens until the next charge comes along, and then it will only reclaim its own charge and not the now potentially huge excess of the new memory.high. This can cause groups to stay in excess of their memory.high indefinitely. To fix that, when shrinking memory.high, kick off a reclaim cycle that goes after the delta. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Naoya Horiguchi authored
I found that page-types is very slow and my testing shows many timeout errors. Here's an example with a simple program allocating 1000 thps. $ time ./page-types -p $(pgrep -f test_alloc) ... real 0m17.201s user 0m16.889s sys 0m0.312s Most of time is spent in memset(). Currently memset() clears over whole buffer for every walk_pfn() call, which is inefficient when walk_pfn() is called from walk_vma(), because in that case walk_pfn() is called for each pfn. So this patch limits the zero initialization only for the first element. $ time ./page-types.patched -p $(pgrep -f test_alloc) ... real 0m0.182s user 0m0.046s sys 0m0.135s Fixes: 954e95584579 ("tools/vm/page-types.c: add memory cgroup dumping and filtering") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Suggested-by: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Li Zhang authored
Parallel initialisation has been enabled for X86, boot time is improved greatly. On Power8, it is improved greatly for small memory. Here is the result from my test on Power8 platform: For 4GB of memory, boot time is improved by 59%, from 24.5s to 10s. For 50GB memory, boot time is improved by 22%, from 56.8s to 43.8s. Signed-off-by: Li Zhang <zhlcindy@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Li Zhang authored
Upstream has supported page parallel initialisation for X86 and the boot time is improved greately. Some tests have been done for Power. Here is the result I have done with different memory size. * 4GB memory: boot time is as the following: with patch vs without patch: 10.4s vs 24.5s boot time is improved 57% * 200GB memory: boot time looks the same with and without patches. boot time is about 38s * 32TB memory: boot time looks the same with and without patches boot time is about 160s. The boot time is much shorter than X86 with 24TB memory. From community discussion, it costs about 694s for X86 24T system. Parallel initialisation improves the performance by deferring memory initilisation to kswap with N kthreads, it should improve the performance therotically. In testing on X86, performance is improved greatly with huge memory. But on Power platform, it is improved greatly with less than 100GB memory. For huge memory, it is not improved greatly. But it saves the time with several threads at least, as the following information shows(32TB system log): [ 22.648169] node 9 initialised, 16607461 pages in 280ms [ 22.783772] node 3 initialised, 23937243 pages in 410ms [ 22.858877] node 6 initialised, 29179347 pages in 490ms [ 22.863252] node 2 initialised, 29179347 pages in 490ms [ 22.907545] node 0 initialised, 32049614 pages in 540ms [ 22.920891] node 15 initialised, 32212280 pages in 550ms [ 22.923236] node 4 initialised, 32306127 pages in 550ms [ 22.923384] node 12 initialised, 32314319 pages in 550ms [ 22.924754] node 8 initialised, 32314319 pages in 550ms [ 22.940780] node 13 initialised, 33353677 pages in 570ms [ 22.940796] node 11 initialised, 33353677 pages in 570ms [ 22.941700] node 5 initialised, 33353677 pages in 570ms [ 22.941721] node 10 initialised, 33353677 pages in 570ms [ 22.941876] node 7 initialised, 33353677 pages in 570ms [ 22.944946] node 14 initialised, 33353677 pages in 570ms [ 22.946063] node 1 initialised, 33345485 pages in 580ms It saves the time about 550*16 ms at least, although it can be ignore to compare the boot time about 160 seconds. What's more, the boot time is much shorter on Power even without patches than x86 for huge memory machine. So this patchset is still necessary to be enabled for Power. This patch (of 2): This patch is based on Mel Gorman's old patch in the mailing list, https://lkml.org/lkml/2015/5/5/280 which is discussed but it is fixed with a completion to wait for all memory initialised in page_alloc_init_late(). It is to fix the OOM problem on X86 with 24TB memory which allocates memory in late initialisation. But for Power platform with 32TB memory, it causes a call trace in vfs_caches_init->inode_init() and inode hash table needs more memory. So this patch allocates 1GB for 0.25TB/node for large system as it is mentioned in https://lkml.org/lkml/2015/5/1/627 This call trace is found on Power with 32TB memory, 1024CPUs, 16nodes. Currently, it only allocates 2GB*16=32GB for early initialisation. But Dentry cache hash table needes 16GB and Inode cache hash table needs 16GB. So the system have no enough memory for it. The log from dmesg as the following: Dentry cache hash table entries: 2147483648 (order: 18,17179869184 bytes) vmalloc: allocation failure, allocated 16021913600 of 17179934720 bytes swapper/0: page allocation failure: order:0,mode:0x2080020 CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.0-0-ppc64 Call Trace: .dump_stack+0xb4/0xb664 (unreliable) .warn_alloc_failed+0x114/0x160 .__vmalloc_area_node+0x1a4/0x2b0 .__vmalloc_node_range+0xe4/0x110 .__vmalloc_node+0x40/0x50 .alloc_large_system_hash+0x134/0x2a4 .inode_init+0xa4/0xf0 .vfs_caches_init+0x80/0x144 .start_kernel+0x40c/0x4e0 start_here_common+0x20/0x4a4 Signed-off-by: Li Zhang <zhlcindy@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Kirill A. Shutemov authored
split_huge_pmd() tries to munlock page with munlock_vma_page(). That requires the page to locked. If the is locked by caller, we would get a deadlock: Unable to find swap-space signature INFO: task trinity-c85:1907 blocked for more than 120 seconds. Not tainted 4.4.0-00032-gf19d0bdced41-dirty #1606 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. trinity-c85 D ffff88084d997608 0 1907 309 0x00000000 Call Trace: schedule+0x9f/0x1c0 schedule_timeout+0x48e/0x600 io_schedule_timeout+0x1c3/0x390 bit_wait_io+0x29/0xd0 __wait_on_bit_lock+0x94/0x140 __lock_page+0x1d4/0x280 __split_huge_pmd+0x5a8/0x10f0 split_huge_pmd_address+0x1d9/0x230 try_to_unmap_one+0x540/0xc70 rmap_walk_anon+0x284/0x810 rmap_walk_locked+0x11e/0x190 try_to_unmap+0x1b1/0x4b0 split_huge_page_to_list+0x49d/0x18a0 follow_page_mask+0xa36/0xea0 SyS_move_pages+0xaf3/0x1570 entry_SYSCALL_64_fastpath+0x12/0x6b 2 locks held by trinity-c85/1907: #0: (&mm->mmap_sem){++++++}, at: SyS_move_pages+0x933/0x1570 #1: (&anon_vma->rwsem){++++..}, at: split_huge_page_to_list+0x402/0x18a0 I don't think the deadlock is triggerable without split_huge_page() simplifilcation patchset. But munlock_vma_page() here is wrong: we want to munlock the page unconditionally, no need in rmap lookup, that munlock_vma_page() does. Let's use clear_page_mlock() instead. It can be called under ptl. Fixes: e90309c9 ("thp: allow mlocked THP again") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Kirill A. Shutemov authored
freeze_page() and unfreeze_page() helpers evolved in rather complex beasts. It would be nice to cut complexity of this code. This patch rewrites freeze_page() using standard try_to_unmap(). unfreeze_page() is rewritten with remove_migration_ptes(). The result is much simpler. But the new variant is somewhat slower for PTE-mapped THPs. Current helpers iterates over VMAs the compound page is mapped to, and then over ptes within this VMA. New helpers iterates over small page, then over VMA the small page mapped to, and only then find relevant pte. We have short cut for PMD-mapped THP: we directly install migration entries on PMD split. I don't think the slowdown is critical, considering how much simpler result is and that split_huge_page() is quite rare nowadays. It only happens due memory pressure or migration. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Kirill A. Shutemov authored
Make remove_migration_ptes() available to be used in split_huge_page(). New parameter 'locked' added: as with try_to_umap() we need a way to indicate that caller holds rmap lock. We also shouldn't try to mlock() pte-mapped huge pages: pte-mapeed THP pages are never mlocked. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Kirill A. Shutemov authored
Add support for two ttu_flags: - TTU_SPLIT_HUGE_PMD would split PMD if it's there, before trying to unmap page; - TTU_RMAP_LOCKED indicates that caller holds relevant rmap lock; Also, change rwc->done to !page_mapcount() instead of !page_mapped(). try_to_unmap() works on pte level, so we are really interested in the mappedness of this small page rather than of the compound page it's a part of. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Kirill A. Shutemov authored
This patchset rewrites freeze_page() and unfreeze_page() using try_to_unmap() and remove_migration_ptes(). Result is much simpler, but somewhat slower. Migration 8GiB worth of PMD-mapped THP: Baseline 20.21 +/- 0.393 Patched 20.73 +/- 0.082 Slowdown 1.03x It's 3% slower, comparing to 14% in v1. I don't it should be a stopper. Splitting of PTE-mapped pages slowed more. But this is not a common case. Migration 8GiB worth of PMD-mapped THP: Baseline 20.39 +/- 0.225 Patched 22.43 +/- 0.496 Slowdown 1.10x rmap_walk_locked() is the same as rmap_walk(), but the caller takes care of the relevant rmap lock. This is preparation for switching THP splitting from custom rmap walk in freeze_page()/unfreeze_page() to the generic one. There is no support for KSM pages for now: not clear which lock is implied. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Dan Williams authored
The primary use case for devm_memremap_pages() is to allocate an memmap array from persistent memory. That capabilty requires vmem_altmap which requires SPARSEMEM_VMEMMAP. Also, without SPARSEMEM_VMEMMAP the addition of ZONE_DEVICE expands ZONES_WIDTH and triggers the: "Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid." ...warning in mm/memory.c. SPARSEMEM_VMEMMAP=n && ZONE_DEVICE=y is not a configuration we should worry about supporting. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jan Kara authored
The define has a comment from Nick Piggin from 2007: /* For backwards compat. Remove me quickly. */ I guess 9 years should not be too hurried sense of 'quickly' even for kernel measures. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Joe Perches authored
Use the normal mechanism to make the logging output consistently "percpu:" instead of a mix of "PERCPU:" and "percpu:" Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Joe Perches authored
Most of the mm subsystem uses pr_<level> so make it consistent. Miscellanea: - Realign arguments - Add missing newline to format - kmemleak-test.c has a "kmemleak: " prefix added to the "Kmemleak testing" logging message via pr_fmt Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> [percpu] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Joe Perches authored
Kernel style prefers a single string over split strings when the string is 'user-visible'. Miscellanea: - Add a missing newline - Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> [percpu] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Joe Perches authored
There are a mixture of pr_warning and pr_warn uses in mm. Use pr_warn consistently. Miscellanea: - Coalesce formats - Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> [percpu] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Dan Williams authored
ZONE_DEVICE (merged in 4.3) and ZONE_CMA (proposed) are examples of new mm zones that are bumping up against the current maximum limit of 4 zones, i.e. 2 bits in page->flags for the GFP_ZONE_TABLE. The GFP_ZONE_TABLE poses an interesting constraint since include/linux/gfp.h gets included by the 32-bit portion of a 64-bit build. We need to be careful to only build the table for zones that have a corresponding gfp_t flag. GFP_ZONES_SHIFT is introduced for this purpose. This patch does not attempt to solve the problem of adding a new zone that also has a corresponding GFP_ flag. Vlastimil points out that ZONE_DEVICE, by depending on x86_64 and SPARSEMEM_VMEMMAP implies that SECTIONS_WIDTH is zero. In other words even though ZONE_DEVICE does not fit in GFP_ZONE_TABLE it is free to consume another bit in page->flags (expand ZONES_WIDTH) with room to spare. Link: https://bugzilla.kernel.org/show_bug.cgi?id=110931 Fixes: 033fbae9 ("mm: ZONE_DEVICE for "device memory"") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Mark <markk@clara.co.uk> Reported-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Vladimir Davydov authored
- Do not take memcg_limit_mutex for resetting limits - the cgroup cannot be altered from userspace anymore, so no need to protect them. - Use plain page_counter_limit() for resetting ->memory and ->memsw limits instead of mem_cgrouop_resize_* helpers - we enlarge the limits, so no need in special handling. - Reset ->swap and ->tcpmem limits as well. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Chen Yucong authored
online_pages() simply returns an error value if memory_notify(MEM_GOING_ONLINE, &arg) return a value that is not what we want for successfully onlining target pages. This patch arms to print more failure information like offline_pages() in online_pages. This patch also converts printk(KERN_<LEVEL>) to pr_<level>(), and moves __offline_pages() to not print failure information with KERN_INFO according to David Rientjes's suggestion[1]. [1] https://lkml.org/lkml/2016/2/24/1094Signed-off-by: Chen Yucong <slaoub@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Michal Hocko authored
Commit 64775719 ("mm: clarify __GFP_NOFAIL deprecation status") was incomplete and didn't remove the comment about __GFP_NOFAIL being deprecated in buffered_rmqueue. Let's get rid of this leftover but keep the WARN_ON_ONCE for order > 1 because we should really discourage from using __GFP_NOFAIL with higher order allocations because those are just too subtle. Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Joonsoo Kim authored
CMA allocation should be guaranteed to succeed by definition, but, unfortunately, it would be failed sometimes. It is hard to track down the problem, because it is related to page reference manipulation and we don't have any facility to analyze it. This patch adds tracepoints to track down page reference manipulation. With it, we can find exact reason of failure and can fix the problem. Following is an example of tracepoint output. (note: this example is stale version that printing flags as the number. Recent version will print it as human readable string.) <...>-9018 [004] 92.678375: page_ref_set: pfn=0x17ac9 flags=0x0 count=1 mapcount=0 mapping=(nil) mt=4 val=1 <...>-9018 [004] 92.678378: kernel_stack: => get_page_from_freelist (ffffffff81176659) => __alloc_pages_nodemask (ffffffff81176d22) => alloc_pages_vma (ffffffff811bf675) => handle_mm_fault (ffffffff8119e693) => __do_page_fault (ffffffff810631ea) => trace_do_page_fault (ffffffff81063543) => do_async_page_fault (ffffffff8105c40a) => async_page_fault (ffffffff817581d8) [snip] <...>-9018 [004] 92.678379: page_ref_mod: pfn=0x17ac9 flags=0x40048 count=2 mapcount=1 mapping=0xffff880015a78dc1 mt=4 val=1 [snip] ... ... <...>-9131 [001] 93.174468: test_pages_isolated: start_pfn=0x17800 end_pfn=0x17c00 fin_pfn=0x17ac9 ret=fail [snip] <...>-9018 [004] 93.174843: page_ref_mod_and_test: pfn=0x17ac9 flags=0x40068 count=0 mapcount=0 mapping=0xffff880015a78dc1 mt=4 val=-1 ret=1 => release_pages (ffffffff8117c9e4) => free_pages_and_swap_cache (ffffffff811b0697) => tlb_flush_mmu_free (ffffffff81199616) => tlb_finish_mmu (ffffffff8119a62c) => exit_mmap (ffffffff811a53f7) => mmput (ffffffff81073f47) => do_exit (ffffffff810794e9) => do_group_exit (ffffffff81079def) => SyS_exit_group (ffffffff81079e74) => entry_SYSCALL_64_fastpath (ffffffff817560b6) This output shows that problem comes from exit path. In exit path, to improve performance, pages are not freed immediately. They are gathered and processed by batch. During this process, migration cannot be possible and CMA allocation is failed. This problem is hard to find without this page reference tracepoint facility. Enabling this feature bloat kernel text 30 KB in my configuration. text data bss dec hex filename 12127327 2243616 1507328 15878271 f2487f vmlinux_disabled 12157208 2258880 1507328 15923416 f2f8d8 vmlinux_enabled Note that, due to header file dependency problem between mm.h and tracepoint.h, this feature has to open code the static key functions for tracepoints. Proposed by Steven Rostedt in following link. https://lkml.org/lkml/2015/12/9/699 [arnd@arndb.de: crypto/async_pq: use __free_page() instead of put_page()] [iamjoonsoo.kim@lge.com: fix build failure for xtensa] [akpm@linux-foundation.org: tweak Kconfig text, per Vlastimil] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Joonsoo Kim authored
The success of CMA allocation largely depends on the success of migration and key factor of it is page reference count. Until now, page reference is manipulated by direct calling atomic functions so we cannot follow up who and where manipulate it. Then, it is hard to find actual reason of CMA allocation failure. CMA allocation should be guaranteed to succeed so finding offending place is really important. In this patch, call sites where page reference is manipulated are converted to introduced wrapper function. This is preparation step to add tracepoint to each page reference manipulation function. With this facility, we can easily find reason of CMA allocation failure. There is no functional change in this patch. In addition, this patch also converts reference read sites. It will help a second step that renames page._count to something else and prevents later attempt to direct access to it (Suggested by Andrew). Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-