- 30 Sep, 2016 16 commits
-
-
Peter Zijlstra authored
Now that the ia64 only set_curr_task() symbol is gone, provide a helper just like put_prev_task(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
Rename the ia64 only set_curr_task() function to free up the name. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
Vincent Guittot authored
When a task switches to fair scheduling class, the period between now and the last update of its utilization is accounted as running time whatever happened during this period. This incorrect accounting applies to the task and also to the task group branch. When changing the property of a running task like its list of allowed CPUs or its scheduling class, we follow the sequence: - dequeue task - put task - change the property - set task as current task - enqueue task The end of the sequence doesn't follow the normal sequence (as per __schedule()) which is: - enqueue a task - then set the task as current task. This incorrectordering is the root cause of incorrect utilization accounting. Update the sequence to follow the right one: - dequeue task - put task - change the property - enqueue task - set task as current task Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: linaro-kernel@lists.linaro.org Cc: pjt@google.com Cc: yuyang.du@intel.com Link: http://lkml.kernel.org/r/1473666472-13749-8-git-send-email-vincent.guittot@linaro.orgSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
Avoid pointless SCHED_SMT code when running on !SMT hardware. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
select_idle_siblings() is a known pain point for a number of workloads; it either does too much or not enough and sometimes just does plain wrong. This rewrite attempts to address a number of issues (but sadly not all). The current code does an unconditional sched_domain iteration; with the intent of finding an idle core (on SMT hardware). The problems which this patch tries to address are: - its pointless to look for idle cores if the machine is real busy; at which point you're just wasting cycles. - it's behaviour is inconsistent between SMT and !SMT hardware in that !SMT hardware ends up doing a scan for any idle CPU in the LLC domain, while SMT hardware does a scan for idle cores and if that fails, falls back to a scan for idle threads on the 'target' core. The new code replaces the sched_domain scan with 3 explicit scans: 1) search for an idle core in the LLC 2) search for an idle CPU in the LLC 3) search for an idle thread in the 'target' core where 1 and 3 are conditional on SMT support and 1 and 2 have runtime heuristics to skip the step. Step 1) is conditional on sd_llc_shared->has_idle_cores; when a cpu goes idle and sd_llc_shared->has_idle_cores is false, we scan all SMT siblings of the CPU going idle. Similarly, we clear sd_llc_shared->has_idle_cores when we fail to find an idle core. Step 2) tracks the average cost of the scan and compares this to the average idle time guestimate for the CPU doing the wakeup. There is a significant fudge factor involved to deal with the variability of the averages. Esp. hackbench was sensitive to this. Step 3) is unconditional; we assume (also per step 1) that scanning all SMT siblings in a core is 'cheap'. With this; SMT systems gain step 2, which cures a few benchmarks -- notably one from Facebook. One 'feature' of the sched_domain iteration, which we preserve in the new code, is that it would start scanning from the 'target' CPU, instead of scanning the cpumask in cpu id order. This avoids multiple CPUs in the LLC scanning for idle to gang up and find the same CPU quite as much. The down side is that tasks can end up hopping across the LLC for no apparent reason. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
Move the nr_busy_cpus thing from its hacky sd->parent->groups->sgc location into the much more natural sched_domain_shared location. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
Since struct sched_domain is strictly per cpu; introduce a structure that is shared between all 'identical' sched_domains. Limit to SD_SHARE_PKG_RESOURCES domains for now, as we'll only use it for shared cache state; if another use comes up later we can easily relax this. While the sched_group's are normally shared between CPUs, these are not natural to use when we need some shared state on a domain level -- since that would require the domain to have a parent, which is not a given. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
There is no point in doing a call_rcu() for each domain, only do a callback for the root sched domain and clean up the entire set in one go. Also make the entire call chain be called destroy_sched_domain*() to remove confusion with the free_sched_domains() call, which does an entirely different thing. Both cpu_attach_domain() callers of destroy_sched_domain() can live without the call_rcu() because at those points the sched_domain hasn't been published yet. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
Small cleanup; nothing uses the @cpu argument so make it go away. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
Oleg Nesterov authored
The partial initialization of wait_queue_t in prepare_to_wait_event() looks ugly. This was done to shrink .text, but we can simply add the new helper which does the full initialization and shrink the compiled code a bit more. And. This way prepare_to_wait_event() can have more users. In particular we are ready to remove the signal_pending_state() checks from wait_bit_action_f helpers and change __wait_on_bit_lock() to use prepare_to_wait_event(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160906140055.GA6167@redhat.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Oleg Nesterov authored
__wait_on_bit_lock() doesn't need abort_exclusive_wait() too. Right now it can't use prepare_to_wait_event() (see the next change), but it can do the additional finish_wait() if action() fails. abort_exclusive_wait() no longer has callers, remove it. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160906140053.GA6164@redhat.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Oleg Nesterov authored
___wait_event() doesn't really need abort_exclusive_wait(), we can simply change prepare_to_wait_event() to remove the waiter from q->task_list if it was interrupted. This simplifies the code/logic, and this way prepare_to_wait_event() can have more users, see the next change. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160908164815.GA18801@redhat.comSigned-off-by: Ingo Molnar <mingo@kernel.org> -- include/linux/wait.h | 7 +------ kernel/sched/wait.c | 35 +++++++++++++++++++++++++---------- 2 files changed, 26 insertions(+), 16 deletions(-)
-
Oleg Nesterov authored
Otherwise this logic only works if mode is "compatible" with another exclusive waiter. If some wq has both TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE waiters, abort_exclusive_wait() won't wait an uninterruptible waiter. The main user is __wait_on_bit_lock() and currently it is fine but only because TASK_KILLABLE includes TASK_UNINTERRUPTIBLE and we do not have lock_page_interruptible() yet. Just use TASK_NORMAL and remove the "mode" arg from abort_exclusive_wait(). Yes, this means that (say) wake_up_interruptible() can wake up the non- interruptible waiter(s), but I think this is fine. And in fact I think that abort_exclusive_wait() must die, see the next change. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160906140047.GA6157@redhat.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Dietmar Eggemann authored
Since commit: 2159197d ("sched/core: Enable increased load resolution on 64-bit kernels") we now have two different fixed point units for load: - 'shares' in calc_cfs_shares() has 20 bit fixed point unit on 64-bit kernels. Therefore use scale_load() on MIN_SHARES. - 'wl' in effective_load() has 10 bit fixed point unit. Therefore use scale_load_down() on tg->shares which has 20 bit fixed point unit on 64-bit kernels. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1471874441-24701-1-git-send-email-dietmar.eggemann@arm.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Tim Chen authored
Current code can call set_cpu_sibling_map() and invoke sched_set_topology() more than once (e.g. on CPU hot plug). When this happens after sched_init_smp() has been called, we lose the NUMA topology extension to sched_domain_topology in sched_init_numa(). This results in incorrect topology when the sched domain is rebuilt. This patch fixes the bug and issues warning if we call sched_set_topology() after sched_init_smp(). Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bp@suse.de Cc: jolsa@redhat.com Cc: rjw@rjwysocki.net Link: http://lkml.kernel.org/r/1474485552-141429-2-git-send-email-srinivas.pandruvada@linux.intel.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Ingo Molnar authored
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
- 28 Sep, 2016 7 commits
-
-
Linus Torvalds authored
Merge fixes from Andrew Morton: "4 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mem-hotplug: use nodes that contain memory as mask in new_node_page() scripts/recordmcount.c: account for .softirqentry.text dma-mapping.h: preserve unmap info for CONFIG_DMA_API_DEBUG mm,ksm: fix endless looping in allocating memory when ksm enable
-
Li Zhong authored
9bb627be ("mem-hotplug: don't clear the only node in new_node_page()") prevents allocating from an empty nodemask, but as David points out, it is still wrong. As node_online_map may include memoryless nodes, only allocating from these nodes is meaningless. This patch uses node_states[N_MEMORY] mask to prevent the above case. Fixes: 9bb627be ("mem-hotplug: don't clear the only node in new_node_page()") Fixes: 394e31d2 ("mem-hotplug: alloc new page from a nearest neighbor node when mem-offline") Link: http://lkml.kernel.org/r/1474447117.28370.6.camel@TP420Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: John Allen <jallen@linux.vnet.ibm.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Dmitry Vyukov authored
be7635e7 ("arch, ftrace: for KASAN put hard/soft IRQ entries into separate sections") added .softirqentry.text section, but it was not added to recordmcount. So functions in the section are untracable. Add the section to scripts/recordmcount.c and scripts/recordmcount.pl. Fixes: be7635e7 ("arch, ftrace: for KASAN put hard/soft IRQ entries into separate sections") Link: http://lkml.kernel.org/r/1474902626-73468-1-git-send-email-dvyukov@google.comSigned-off-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Steve Rostedt <rostedt@goodmis.org> Cc: <stable@vger.kernel.org> [4.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Andrey Smirnov authored
When CONFIG_DMA_API_DEBUG is enabled we need to preserve unmapping address even if "unmap" is a no-op for our architecutre because we need debug_dma_unmap_page() to correctly cleanup all of the debug bookkeeping. Failing to do so results in a false positive warnings about previously mapped areas never being unmapped. Link: http://lkml.kernel.org/r/1474387125-3713-1-git-send-email-andrew.smirnov@gmail.comSigned-off-by: Andrey Smirnov <andrew.smirnov@gmail.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Will Deacon <will.deacon@arm.com> Cc: Zhen Lei <thunder.leizhen@huawei.com> Cc: "Luis R. Rodriguez" <mcgrof@suse.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Geliang Tang <geliangtang@163.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
zhong jiang authored
I hit the following hung task when runing a OOM LTP test case with 4.1 kernel. Call trace: [<ffffffc000086a88>] __switch_to+0x74/0x8c [<ffffffc000a1bae0>] __schedule+0x23c/0x7bc [<ffffffc000a1c09c>] schedule+0x3c/0x94 [<ffffffc000a1eb84>] rwsem_down_write_failed+0x214/0x350 [<ffffffc000a1e32c>] down_write+0x64/0x80 [<ffffffc00021f794>] __ksm_exit+0x90/0x19c [<ffffffc0000be650>] mmput+0x118/0x11c [<ffffffc0000c3ec4>] do_exit+0x2dc/0xa74 [<ffffffc0000c46f8>] do_group_exit+0x4c/0xe4 [<ffffffc0000d0f34>] get_signal+0x444/0x5e0 [<ffffffc000089fcc>] do_signal+0x1d8/0x450 [<ffffffc00008a35c>] do_notify_resume+0x70/0x78 The oom victim cannot terminate because it needs to take mmap_sem for write while the lock is held by ksmd for read which loops in the page allocator ksm_do_scan scan_get_next_rmap_item down_read get_next_rmap_item alloc_rmap_item #ksmd will loop permanently. There is no way forward because the oom victim cannot release any memory in 4.1 based kernel. Since 4.6 we have the oom reaper which would solve this problem because it would release the memory asynchronously. Nevertheless we can relax alloc_rmap_item requirements and use __GFP_NORETRY because the allocation failure is acceptable as ksm_do_scan would just retry later after the lock got dropped. Such a patch would be also easy to backport to older stable kernels which do not have oom_reaper. While we are at it add GFP_NOWARN so the admin doesn't have to be alarmed by the allocation failure. Link: http://lkml.kernel.org/r/1474165570-44398-1-git-send-email-zhongjiang@huawei.comSigned-off-by: zhong jiang <zhongjiang@huawei.com> Suggested-by: Hugh Dickins <hughd@google.com> Suggested-by: Michal Hocko <mhocko@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
git://git.infradead.org/linux-mtdLinus Torvalds authored
Pull late MTD fixes from Brian Norris: "Another round of MTD fixes for v4.8 My apologies for sending this so late. I've been fairly absent as a maintainer this cycle, but I did queue these up weeks ago. In the meantime, Richard was able to handle some other fixes (thanks!) but didn't pick these up. On the bright side, these are very simple changes that should carry little risk. Summary: - Davinci NAND: fix a long-standing bug in how we clear/prep 4-bit ECC - OMAP NAND: an error-handling fix that made it into v4.8-rc1 caused error-handling cases in other configurations/code-paths; this fixes the fix" * tag 'for-linus-20160928' of git://git.infradead.org/linux-mtd: mtd: nand: davinci: Reinitialize the HW ECC engine in 4bit hwctl mtd: nand: omap2: Don't call dma_release_channel() if dma_request_chan() failed
-
Mark Fasheh authored
I will be starting employment at Versity next week and would like to update my MAINTAINERS e-mail to reflect that change. My versity e-mail is already activated so I shouldn't get any bounces on the new one. My ability to help with Ocfs2 kernel maintenance won't change as a result of the new job. Signed-off-by: Mark Fasheh <mfasheh@versity.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
- 27 Sep, 2016 1 commit
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroupLinus Torvalds authored
Pull cgroup fixes from Tejun Heo: "Three late fixes for cgroup: Two cpuset ones, one trivial and the other pretty obscure, and a cgroup core fix for a bug which impacts cgroup v2 namespace users" * 'for-4.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: fix invalid controller enable rejections with cgroup namespace cpuset: fix non static symbol warning cpuset: handle race between CPU hotplug and cpuset_hotplug_work
-
- 26 Sep, 2016 3 commits
-
-
Linus Torvalds authored
-
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-traceLinus Torvalds authored
Pull tracefs fixes from Steven Rostedt: "Al Viro has been looking at the tracefs code, and has pointed out some issues. This contains one fix by me and one by Al. I'm sure that he'll come up with more but for now I tested these patches and they don't appear to have any negative impact on tracing" * tag 'trace-v4.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: fix memory leaks in tracing_buffers_splice_read() tracing: Move mutex to protect against resetting of seq data
-
Dave Chinner authored
When building XFS with -Werror, it now fails with: include/linux/pagemap.h: In function 'fault_in_multipages_readable': include/linux/pagemap.h:602:16: error: variable 'c' set but not used [-Werror=unused-but-set-variable] volatile char c; ^ This is a regression caused by commit e23d4159 ("fix fault_in_multipages_...() on architectures with no-op access_ok()"). Fix it by re-adding the "(void)c" trick taht was previously used to make the compiler think the variable is used. Signed-off-by: Dave Chinner <dchinner@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
- 25 Sep, 2016 7 commits
-
-
Lorenzo Stoakes authored
The NUMA balancing logic uses an arch-specific PROT_NONE page table flag defined by pte_protnone() or pmd_protnone() to mark PTEs or huge page PMDs respectively as requiring balancing upon a subsequent page fault. User-defined PROT_NONE memory regions which also have this flag set will not normally invoke the NUMA balancing code as do_page_fault() will send a segfault to the process before handle_mm_fault() is even called. However if access_remote_vm() is invoked to access a PROT_NONE region of memory, handle_mm_fault() is called via faultin_page() and __get_user_pages() without any access checks being performed, meaning the NUMA balancing logic is incorrectly invoked on a non-NUMA memory region. A simple means of triggering this problem is to access PROT_NONE mmap'd memory using /proc/self/mem which reliably results in the NUMA handling functions being invoked when CONFIG_NUMA_BALANCING is set. This issue was reported in bugzilla (issue 99101) which includes some simple repro code. There are BUG_ON() checks in do_numa_page() and do_huge_pmd_numa_page() added at commit c0e7cad9 to avoid accidentally provoking strange behaviour by attempting to apply NUMA balancing to pages that are in fact PROT_NONE. The BUG_ON()'s are consistently triggered by the repro. This patch moves the PROT_NONE check into mm/memory.c rather than invoking BUG_ON() as faulting in these pages via faultin_page() is a valid reason for reaching the NUMA check with the PROT_NONE page table flag set and is therefore not always a bug. Link: https://bugzilla.kernel.org/show_bug.cgi?id=99101Reported-by: Trevor Saunders <tbsaunde@tbsaunde.org> Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
git://git.linux-mips.org/pub/scm/ralf/upstream-linusLinus Torvalds authored
Pull MIPS fixes from Ralf Baechle: "A round of 4.8 fixes: MIPS generic code: - Add a missing ".set pop" in an early commit - Fix memory regions reaching top of physical - MAAR: Fix address alignment - vDSO: Fix Malta EVA mapping to vDSO page structs - uprobes: fix incorrect uprobe brk handling - uprobes: select HAVE_REGS_AND_STACK_ACCESS_API - Avoid a BUG warning during PR_SET_FP_MODE prctl - SMP: Fix possibility of deadlock when bringing CPUs online - R6: Remove compact branch policy Kconfig entries - Fix size calc when avoiding IPIs for small icache flushes - Fix pre-r6 emulation FPU initialisation - Fix delay slot emulation count in debugfs ATH79: - Fix test for error return of clk_register_fixed_factor. Octeon: - Fix kernel header to work for VDSO build. - Fix initialization of platform device probing. paravirt: - Fix undefined reference to smp_bootstrap" * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: MIPS: Fix delay slot emulation count in debugfs MIPS: SMP: Fix possibility of deadlock when bringing CPUs online MIPS: Fix pre-r6 emulation FPU initialisation MIPS: vDSO: Fix Malta EVA mapping to vDSO page structs MIPS: Select HAVE_REGS_AND_STACK_ACCESS_API MIPS: Octeon: Fix platform bus probing MIPS: Octeon: mangle-port: fix build failure with VDSO code MIPS: Avoid a BUG warning during prctl(PR_SET_FP_MODE, ...) MIPS: c-r4k: Fix size calc when avoiding IPIs for small icache flushes MIPS: Add a missing ".set pop" in an early commit MIPS: paravirt: Fix undefined reference to smp_bootstrap MIPS: Remove compact branch policy Kconfig entries MIPS: MAAR: Fix address alignment MIPS: Fix memory regions reaching top of physical MIPS: uprobes: fix incorrect uprobe brk handling MIPS: ath79: Fix test for error return of clk_register_fixed_factor().
-
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linuxLinus Torvalds authored
Pull one more powerpc fix from Michael Ellerman: "powernv/pci: Fix m64 checks for SR-IOV and window alignment from Russell Currey" * tag 'powerpc-4.8-7' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/powernv/pci: Fix m64 checks for SR-IOV and window alignment
-
Linus Torvalds authored
The fixes to the radix tree test suite show that the multi-order case is broken. The basic reason is that the radix tree code uses tagged pointers with the "internal" bit in the low bits, and calculating the pointer indices was supposed to mask off those bits. But gcc will notice that we then use the index to re-create the pointer, and will avoid doing the arithmetic and use the tagged pointer directly. This cleans the code up, using the existing is_sibling_entry() helper to validate the sibling pointer range (instead of open-coding it), and using entry_to_node() to mask off the low tag bit from the pointer. And once you do that, you might as well just use the now cleaned-up pointer directly. [ Side note: the multi-order code isn't actually ever used in the kernel right now, and the only reason I didn't just delete all that code is that Kirill Shutemov piped up and said: "Well, my ext4-with-huge-pages patchset[1] uses multi-order entries. It also converts shmem-with-huge-pages and hugetlb to them. I'm okay with converting it to other mechanism, but I need something. (I looked into Konstantin's RFC patchset[2]. It looks okay, but I don't feel myself qualified to review it as I don't know much about radix-tree internals.)" [1] http://lkml.kernel.org/r/20160915115523.29737-1-kirill.shutemov@linux.intel.com [2] http://lkml.kernel.org/r/147230727479.9957.1087787722571077339.stgit@zurg ] Reported-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Cedric Blancher <cedric.blancher@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Matthew Wilcox authored
When we replace a multiorder entry, check that all indices reflect the new value. Also, compile the test suite with -O2, which shows other problems with the code due to some dodgy pointer operations in the radix tree code. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Al Viro authored
Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-
Steven Rostedt (Red Hat) authored
The iter->seq can be reset outside the protection of the mutex. So can reading of user data. Move the mutex up to the beginning of the function. Fixes: d7350c3f ("tracing/core: make the read callbacks reentrants") Cc: stable@vger.kernel.org # 2.6.30+ Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
-
- 24 Sep, 2016 6 commits
-
-
Paul Burton authored
Commit 432c6bac ("MIPS: Use per-mm page to execute branch delay slot instructions") accidentally removed use of the MIPS_FPU_EMU_INC_STATS macro from do_dsemulret, leading to the ds_emul file in debugfs always returning zero even though we perform delay slot emulations. Fix this by re-adding the use of the MIPS_FPU_EMU_INC_STATS macro. Signed-off-by: Paul Burton <paul.burton@imgtec.com> Fixes: 432c6bac ("MIPS: Use per-mm page to execute branch delay slot instructions") Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/14301/Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
-
Matt Redfearn authored
This patch fixes the possibility of a deadlock when bringing up secondary CPUs. The deadlock occurs because the set_cpu_online() is called before synchronise_count_slave(). This can cause a deadlock if the boot CPU, having scheduled another thread, attempts to send an IPI to the secondary CPU, which it sees has been marked online. The secondary is blocked in synchronise_count_slave() waiting for the boot CPU to enter synchronise_count_master(), but the boot cpu is blocked in smp_call_function_many() waiting for the secondary to respond to it's IPI request. Fix this by marking the CPU online in cpu_callin_map and synchronising counters before declaring the CPU online and calculating the maps for IPIs. Signed-off-by: Matt Redfearn <matt.redfearn@imgtec.com> Reported-by: Justin Chen <justinpopo6@gmail.com> Tested-by: Justin Chen <justinpopo6@gmail.com> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: stable@vger.kernel.org # v4.1+ Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/14302/Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull perf fixes from Thomas Gleixner: "Three fixlets for perf: - add a missing NULL pointer check in the intel BTS driver - make BTS an exclusive PMU because BTS can only handle one event at a time - ensure that exclusive events are limited to one PMU so that several exclusive events can be scheduled on different PMU instances" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Limit matching exclusive events to one PMU perf/x86/intel/bts: Make it an exclusive PMU perf/x86/intel/bts: Make sure debug store is valid
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull locking fixes from Thomas Gleixner: "Two smallish fixes: - use the proper asm constraint in the Super-H atomic_fetch_ops - a trivial typo fix in the Kconfig help text" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/hung_task: Fix typo in CONFIG_DETECT_HUNG_TASK help text locking/atomic, arch/sh: Fix ATOMIC_FETCH_OP()
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull EFI fixes from Thomas Gleixner: "Two fixes for EFI/PAT: - a 32bit overflow bug in the PAT code which was unearthed by the large EFI mappings - prevent a boot hang on large systems when EFI mixed mode is enabled but not used" * 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/efi: Only map RAM into EFI page tables if in mixed-mode x86/mm/pat: Prevent hang during boot when mapping pages
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull irq fixes from Thomas Gleixner: "Three fixes for irq core and irq chip drivers: - Do not set the irq type if type is NONE. Fixes a boot regression on various SoCs - Use the proper cpu for setting up the GIC target list. Discovered by the cpumask debugging code. - A rather large fix for the MIPS-GIC so per cpu local interrupts work again. This was discovered late because the code falls back to slower timers which use normal device interrupts" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/mips-gic: Fix local interrupts irqchip/gicv3: Silence noisy DEBUG_PER_CPU_MAPS warning genirq: Skip chained interrupt trigger setup if type is IRQ_TYPE_NONE
-