1. 28 Jun, 2019 3 commits
    • Takeshi Misawa's avatar
      tracing: Fix memory leak in tracing_err_log_open() · d122ed62
      Takeshi Misawa authored
      When tracing_err_log_open() calls seq_open(), allocated memory is not freed.
      
      kmemleak report:
      
      unreferenced object 0xffff92c0781d1100 (size 128):
        comm "tail", pid 15116, jiffies 4295163855 (age 22.704s)
        hex dump (first 32 bytes):
          00 f0 08 e5 c0 92 ff ff 00 10 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<000000000d0687d5>] kmem_cache_alloc+0x11f/0x1e0
          [<000000003e3039a8>] seq_open+0x2f/0x90
          [<000000008dd36b7d>] tracing_err_log_open+0x67/0x140
          [<000000005a431ae2>] do_dentry_open+0x1df/0x3a0
          [<00000000a2910603>] vfs_open+0x2f/0x40
          [<0000000038b0a383>] path_openat+0x2e8/0x1690
          [<00000000fe025bda>] do_filp_open+0x9b/0x110
          [<00000000483a5091>] do_sys_open+0x1ba/0x260
          [<00000000c558b5fd>] __x64_sys_openat+0x20/0x30
          [<000000006881ec07>] do_syscall_64+0x5a/0x130
          [<00000000571c2e94>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fix this by calling seq_release() in tracing_err_log_fops.release().
      
      Link: http://lkml.kernel.org/r/20190628105640.GA1863@DESKTOP
      
      Fixes: 8a062902 ("tracing: Add tracing error log")
      Reviewed-by: default avatarTom Zanussi <zanussi@kernel.org>
      Signed-off-by: default avatarTakeshi Misawa <jeliantsurux@gmail.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      d122ed62
    • Steven Rostedt (VMware)'s avatar
      ftrace/x86: Add a comment to why we take text_mutex in ftrace_arch_code_modify_prepare() · 39611265
      Steven Rostedt (VMware) authored
      Taking the text_mutex in ftrace_arch_code_modify_prepare() is to fix a
      race against module loading and live kernel patching that might try to
      change the text permissions while ftrace has it as read/write. This
      really needs to be documented in the code. Add a comment that does such.
      
      Link: http://lkml.kernel.org/r/20190627211819.5a591f52@gandalf.local.homeSuggested-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Reviewed-by: default avatarPetr Mladek <pmladek@suse.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      39611265
    • Petr Mladek's avatar
      ftrace/x86: Remove possible deadlock between register_kprobe() and ftrace_run_update_code() · d5b844a2
      Petr Mladek authored
      The commit 9f255b63 ("module: Fix livepatch/ftrace module text
      permissions race") causes a possible deadlock between register_kprobe()
      and ftrace_run_update_code() when ftrace is using stop_machine().
      
      The existing dependency chain (in reverse order) is:
      
      -> #1 (text_mutex){+.+.}:
             validate_chain.isra.21+0xb32/0xd70
             __lock_acquire+0x4b8/0x928
             lock_acquire+0x102/0x230
             __mutex_lock+0x88/0x908
             mutex_lock_nested+0x32/0x40
             register_kprobe+0x254/0x658
             init_kprobes+0x11a/0x168
             do_one_initcall+0x70/0x318
             kernel_init_freeable+0x456/0x508
             kernel_init+0x22/0x150
             ret_from_fork+0x30/0x34
             kernel_thread_starter+0x0/0xc
      
      -> #0 (cpu_hotplug_lock.rw_sem){++++}:
             check_prev_add+0x90c/0xde0
             validate_chain.isra.21+0xb32/0xd70
             __lock_acquire+0x4b8/0x928
             lock_acquire+0x102/0x230
             cpus_read_lock+0x62/0xd0
             stop_machine+0x2e/0x60
             arch_ftrace_update_code+0x2e/0x40
             ftrace_run_update_code+0x40/0xa0
             ftrace_startup+0xb2/0x168
             register_ftrace_function+0x64/0x88
             klp_patch_object+0x1a2/0x290
             klp_enable_patch+0x554/0x980
             do_one_initcall+0x70/0x318
             do_init_module+0x6e/0x250
             load_module+0x1782/0x1990
             __s390x_sys_finit_module+0xaa/0xf0
             system_call+0xd8/0x2d0
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(text_mutex);
                                     lock(cpu_hotplug_lock.rw_sem);
                                     lock(text_mutex);
        lock(cpu_hotplug_lock.rw_sem);
      
      It is similar problem that has been solved by the commit 2d1e38f5
      ("kprobes: Cure hotplug lock ordering issues"). Many locks are involved.
      To be on the safe side, text_mutex must become a low level lock taken
      after cpu_hotplug_lock.rw_sem.
      
      This can't be achieved easily with the current ftrace design.
      For example, arm calls set_all_modules_text_rw() already in
      ftrace_arch_code_modify_prepare(), see arch/arm/kernel/ftrace.c.
      This functions is called:
      
        + outside stop_machine() from ftrace_run_update_code()
        + without stop_machine() from ftrace_module_enable()
      
      Fortunately, the problematic fix is needed only on x86_64. It is
      the only architecture that calls set_all_modules_text_rw()
      in ftrace path and supports livepatching at the same time.
      
      Therefore it is enough to move text_mutex handling from the generic
      kernel/trace/ftrace.c into arch/x86/kernel/ftrace.c:
      
         ftrace_arch_code_modify_prepare()
         ftrace_arch_code_modify_post_process()
      
      This patch basically reverts the ftrace part of the problematic
      commit 9f255b63 ("module: Fix livepatch/ftrace module
      text permissions race"). And provides x86_64 specific-fix.
      
      Some refactoring of the ftrace code will be needed when livepatching
      is implemented for arm or nds32. These architectures call
      set_all_modules_text_rw() and use stop_machine() at the same time.
      
      Link: http://lkml.kernel.org/r/20190627081334.12793-1-pmladek@suse.com
      
      Fixes: 9f255b63 ("module: Fix livepatch/ftrace module text permissions race")
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reported-by: default avatarMiroslav Benes <mbenes@suse.cz>
      Reviewed-by: default avatarMiroslav Benes <mbenes@suse.cz>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: default avatarPetr Mladek <pmladek@suse.com>
      [
        As reviewed by Miroslav Benes <mbenes@suse.cz>, removed return value of
        ftrace_run_update_code() as it is a void function.
      ]
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      d5b844a2
  2. 14 Jun, 2019 7 commits
    • Wei Li's avatar
      ftrace: Fix NULL pointer dereference in free_ftrace_func_mapper() · 04e03d9a
      Wei Li authored
      The mapper may be NULL when called from register_ftrace_function_probe()
      with probe->data == NULL.
      
      This issue can be reproduced as follow (it may be covered by compiler
      optimization sometime):
      
      / # cat /sys/kernel/debug/tracing/set_ftrace_filter
      #### all functions enabled ####
      / # echo foo_bar:dump > /sys/kernel/debug/tracing/set_ftrace_filter
      [  206.949100] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
      [  206.952402] Mem abort info:
      [  206.952819]   ESR = 0x96000006
      [  206.955326]   Exception class = DABT (current EL), IL = 32 bits
      [  206.955844]   SET = 0, FnV = 0
      [  206.956272]   EA = 0, S1PTW = 0
      [  206.956652] Data abort info:
      [  206.957320]   ISV = 0, ISS = 0x00000006
      [  206.959271]   CM = 0, WnR = 0
      [  206.959938] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000419f3a000
      [  206.960483] [0000000000000000] pgd=0000000411a87003, pud=0000000411a83003, pmd=0000000000000000
      [  206.964953] Internal error: Oops: 96000006 [#1] SMP
      [  206.971122] Dumping ftrace buffer:
      [  206.973677]    (ftrace buffer empty)
      [  206.975258] Modules linked in:
      [  206.976631] Process sh (pid: 281, stack limit = 0x(____ptrval____))
      [  206.978449] CPU: 10 PID: 281 Comm: sh Not tainted 5.2.0-rc1+ #17
      [  206.978955] Hardware name: linux,dummy-virt (DT)
      [  206.979883] pstate: 60000005 (nZCv daif -PAN -UAO)
      [  206.980499] pc : free_ftrace_func_mapper+0x2c/0x118
      [  206.980874] lr : ftrace_count_free+0x68/0x80
      [  206.982539] sp : ffff0000182f3ab0
      [  206.983102] x29: ffff0000182f3ab0 x28: ffff8003d0ec1700
      [  206.983632] x27: ffff000013054b40 x26: 0000000000000001
      [  206.984000] x25: ffff00001385f000 x24: 0000000000000000
      [  206.984394] x23: ffff000013453000 x22: ffff000013054000
      [  206.984775] x21: 0000000000000000 x20: ffff00001385fe28
      [  206.986575] x19: ffff000013872c30 x18: 0000000000000000
      [  206.987111] x17: 0000000000000000 x16: 0000000000000000
      [  206.987491] x15: ffffffffffffffb0 x14: 0000000000000000
      [  206.987850] x13: 000000000017430e x12: 0000000000000580
      [  206.988251] x11: 0000000000000000 x10: cccccccccccccccc
      [  206.988740] x9 : 0000000000000000 x8 : ffff000013917550
      [  206.990198] x7 : ffff000012fac2e8 x6 : ffff000012fac000
      [  206.991008] x5 : ffff0000103da588 x4 : 0000000000000001
      [  206.991395] x3 : 0000000000000001 x2 : ffff000013872a28
      [  206.991771] x1 : 0000000000000000 x0 : 0000000000000000
      [  206.992557] Call trace:
      [  206.993101]  free_ftrace_func_mapper+0x2c/0x118
      [  206.994827]  ftrace_count_free+0x68/0x80
      [  206.995238]  release_probe+0xfc/0x1d0
      [  206.995555]  register_ftrace_function_probe+0x4a8/0x868
      [  206.995923]  ftrace_trace_probe_callback.isra.4+0xb8/0x180
      [  206.996330]  ftrace_dump_callback+0x50/0x70
      [  206.996663]  ftrace_regex_write.isra.29+0x290/0x3a8
      [  206.997157]  ftrace_filter_write+0x44/0x60
      [  206.998971]  __vfs_write+0x64/0xf0
      [  206.999285]  vfs_write+0x14c/0x2f0
      [  206.999591]  ksys_write+0xbc/0x1b0
      [  206.999888]  __arm64_sys_write+0x3c/0x58
      [  207.000246]  el0_svc_common.constprop.0+0x408/0x5f0
      [  207.000607]  el0_svc_handler+0x144/0x1c8
      [  207.000916]  el0_svc+0x8/0xc
      [  207.003699] Code: aa0003f8 a9025bf5 aa0103f5 f946ea80 (f9400303)
      [  207.008388] ---[ end trace 7b6d11b5f542bdf1 ]---
      [  207.010126] Kernel panic - not syncing: Fatal exception
      [  207.011322] SMP: stopping secondary CPUs
      [  207.013956] Dumping ftrace buffer:
      [  207.014595]    (ftrace buffer empty)
      [  207.015632] Kernel Offset: disabled
      [  207.017187] CPU features: 0x002,20006008
      [  207.017985] Memory Limit: none
      [  207.019825] ---[ end Kernel panic - not syncing: Fatal exception ]---
      
      Link: http://lkml.kernel.org/r/20190606031754.10798-1-liwei391@huawei.comSigned-off-by: default avatarWei Li <liwei391@huawei.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      04e03d9a
    • Josh Poimboeuf's avatar
      module: Fix livepatch/ftrace module text permissions race · 9f255b63
      Josh Poimboeuf authored
      It's possible for livepatch and ftrace to be toggling a module's text
      permissions at the same time, resulting in the following panic:
      
        BUG: unable to handle page fault for address: ffffffffc005b1d9
        #PF: supervisor write access in kernel mode
        #PF: error_code(0x0003) - permissions violation
        PGD 3ea0c067 P4D 3ea0c067 PUD 3ea0e067 PMD 3cc13067 PTE 3b8a1061
        Oops: 0003 [#1] PREEMPT SMP PTI
        CPU: 1 PID: 453 Comm: insmod Tainted: G           O  K   5.2.0-rc1-a188339c #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014
        RIP: 0010:apply_relocate_add+0xbe/0x14c
        Code: fa 0b 74 21 48 83 fa 18 74 38 48 83 fa 0a 75 40 eb 08 48 83 38 00 74 33 eb 53 83 38 00 75 4e 89 08 89 c8 eb 0a 83 38 00 75 43 <89> 08 48 63 c1 48 39 c8 74 2e eb 48 83 38 00 75 32 48 29 c1 89 08
        RSP: 0018:ffffb223c00dbb10 EFLAGS: 00010246
        RAX: ffffffffc005b1d9 RBX: 0000000000000000 RCX: ffffffff8b200060
        RDX: 000000000000000b RSI: 0000004b0000000b RDI: ffff96bdfcd33000
        RBP: ffffb223c00dbb38 R08: ffffffffc005d040 R09: ffffffffc005c1f0
        R10: ffff96bdfcd33c40 R11: ffff96bdfcd33b80 R12: 0000000000000018
        R13: ffffffffc005c1f0 R14: ffffffffc005e708 R15: ffffffff8b2fbc74
        FS:  00007f5f447beba8(0000) GS:ffff96bdff900000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: ffffffffc005b1d9 CR3: 000000003cedc002 CR4: 0000000000360ea0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         klp_init_object_loaded+0x10f/0x219
         ? preempt_latency_start+0x21/0x57
         klp_enable_patch+0x662/0x809
         ? virt_to_head_page+0x3a/0x3c
         ? kfree+0x8c/0x126
         patch_init+0x2ed/0x1000 [livepatch_test02]
         ? 0xffffffffc0060000
         do_one_initcall+0x9f/0x1c5
         ? kmem_cache_alloc_trace+0xc4/0xd4
         ? do_init_module+0x27/0x210
         do_init_module+0x5f/0x210
         load_module+0x1c41/0x2290
         ? fsnotify_path+0x3b/0x42
         ? strstarts+0x2b/0x2b
         ? kernel_read+0x58/0x65
         __do_sys_finit_module+0x9f/0xc3
         ? __do_sys_finit_module+0x9f/0xc3
         __x64_sys_finit_module+0x1a/0x1c
         do_syscall_64+0x52/0x61
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The above panic occurs when loading two modules at the same time with
      ftrace enabled, where at least one of the modules is a livepatch module:
      
      CPU0					CPU1
      klp_enable_patch()
        klp_init_object_loaded()
          module_disable_ro()
          					ftrace_module_enable()
      					  ftrace_arch_code_modify_post_process()
      				    	    set_all_modules_text_ro()
            klp_write_object_relocations()
              apply_relocate_add()
      	  *patches read-only code* - BOOM
      
      A similar race exists when toggling ftrace while loading a livepatch
      module.
      
      Fix it by ensuring that the livepatch and ftrace code patching
      operations -- and their respective permissions changes -- are protected
      by the text_mutex.
      
      Link: http://lkml.kernel.org/r/ab43d56ab909469ac5d2520c5d944ad6d4abd476.1560474114.git.jpoimboe@redhat.comReported-by: default avatarJohannes Erdfelt <johannes@erdfelt.com>
      Fixes: 444d13ff ("modules: add ro_after_init support")
      Acked-by: default avatarJessica Yu <jeyu@kernel.org>
      Reviewed-by: default avatarPetr Mladek <pmladek@suse.com>
      Reviewed-by: default avatarMiroslav Benes <mbenes@suse.cz>
      Signed-off-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      9f255b63
    • Eiichi Tsukata's avatar
      tracing/uprobe: Fix obsolete comment on trace_uprobe_create() · a4158345
      Eiichi Tsukata authored
      Commit 0597c49c ("tracing/uprobes: Use dyn_event framework for
      uprobe events") cleaned up the usage of trace_uprobe_create(), and the
      function has been no longer used for removing uprobe/uretprobe.
      
      Link: http://lkml.kernel.org/r/20190614074026.8045-2-devel@etsukata.comReviewed-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarEiichi Tsukata <devel@etsukata.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      a4158345
    • Eiichi Tsukata's avatar
      tracing/uprobe: Fix NULL pointer dereference in trace_uprobe_create() · f01098c7
      Eiichi Tsukata authored
      Just like the case of commit 8b05a3a7 ("tracing/kprobes: Fix NULL
      pointer dereference in trace_kprobe_create()"), writing an incorrectly
      formatted string to uprobe_events can trigger NULL pointer dereference.
      
      Reporeducer:
      
        # echo r > /sys/kernel/debug/tracing/uprobe_events
      
      dmesg:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000000
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 8000000079d12067 P4D 8000000079d12067 PUD 7b7ab067 PMD 0
        Oops: 0000 [#1] PREEMPT SMP PTI
        CPU: 0 PID: 1903 Comm: bash Not tainted 5.2.0-rc3+ #15
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
        RIP: 0010:strchr+0x0/0x30
        Code: c0 eb 0d 84 c9 74 18 48 83 c0 01 48 39 d0 74 0f 0f b6 0c 07 3a 0c 06 74 ea 19 c0 83 c8 01 c3 31 c0 c3 0f 1f 84 00 00 00 00 00 <0f> b6 07 89 f2 40 38 f0 75 0e eb 13 0f b6 47 01 48 83 c
        RSP: 0018:ffffb55fc0403d10 EFLAGS: 00010293
      
        RAX: ffff993ffb793400 RBX: 0000000000000000 RCX: ffffffffa4852625
        RDX: 0000000000000000 RSI: 000000000000002f RDI: 0000000000000000
        RBP: ffffb55fc0403dd0 R08: ffff993ffb793400 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
        R13: ffff993ff9cc1668 R14: 0000000000000001 R15: 0000000000000000
        FS:  00007f30c5147700(0000) GS:ffff993ffda00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 000000007b628000 CR4: 00000000000006f0
        Call Trace:
         trace_uprobe_create+0xe6/0xb10
         ? __kmalloc_track_caller+0xe6/0x1c0
         ? __kmalloc+0xf0/0x1d0
         ? trace_uprobe_create+0xb10/0xb10
         create_or_delete_trace_uprobe+0x35/0x90
         ? trace_uprobe_create+0xb10/0xb10
         trace_run_command+0x9c/0xb0
         trace_parse_run_command+0xf9/0x1eb
         ? probes_open+0x80/0x80
         __vfs_write+0x43/0x90
         vfs_write+0x14a/0x2a0
         ksys_write+0xa2/0x170
         do_syscall_64+0x7f/0x200
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Link: http://lkml.kernel.org/r/20190614074026.8045-1-devel@etsukata.com
      
      Cc: stable@vger.kernel.org
      Fixes: 0597c49c ("tracing/uprobes: Use dyn_event framework for uprobe events")
      Reviewed-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarEiichi Tsukata <devel@etsukata.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      f01098c7
    • YueHaibing's avatar
      tracing: Make two symbols static · ff585c5b
      YueHaibing authored
      Fix sparse warnings:
      
      kernel/trace/trace.c:6927:24: warning:
       symbol 'get_tracing_log_err' was not declared. Should it be static?
      kernel/trace/trace.c:8196:15: warning:
       symbol 'trace_instance_dir' was not declared. Should it be static?
      
      Link: http://lkml.kernel.org/r/20190614153210.24424-1-yuehaibing@huawei.comAcked-by: default avatarTom Zanussi <tom.zanussi@linux.intel.com>
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      ff585c5b
    • Vasily Gorbik's avatar
      tracing: avoid build warning with HAVE_NOP_MCOUNT · cbdaeaf0
      Vasily Gorbik authored
      Selecting HAVE_NOP_MCOUNT enables -mnop-mcount (if gcc supports it)
      and sets CC_USING_NOP_MCOUNT. Reuse __is_defined (which is suitable for
      testing CC_USING_* defines) to avoid conditional compilation and fix
      the following gcc 9 warning on s390:
      
      kernel/trace/ftrace.c:2514:1: warning: ‘ftrace_code_disable’ defined
      but not used [-Wunused-function]
      
      Link: http://lkml.kernel.org/r/patch.git-1a82d13f33ac.your-ad-here.call-01559732716-ext-6629@work.hours
      
      Fixes: 2f4df001 ("tracing: Add -mcount-nop option support")
      Signed-off-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      cbdaeaf0
    • Eiichi Tsukata's avatar
      tracing: Fix out-of-range read in trace_stack_print() · becf33f6
      Eiichi Tsukata authored
      Puts range check before dereferencing the pointer.
      
      Reproducer:
      
        # echo stacktrace > trace_options
        # echo 1 > events/enable
        # cat trace > /dev/null
      
      KASAN report:
      
        ==================================================================
        BUG: KASAN: use-after-free in trace_stack_print+0x26b/0x2c0
        Read of size 8 at addr ffff888069d20000 by task cat/1953
      
        CPU: 0 PID: 1953 Comm: cat Not tainted 5.2.0-rc3+ #5
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
        Call Trace:
         dump_stack+0x8a/0xce
         print_address_description+0x60/0x224
         ? trace_stack_print+0x26b/0x2c0
         ? trace_stack_print+0x26b/0x2c0
         __kasan_report.cold+0x1a/0x3e
         ? trace_stack_print+0x26b/0x2c0
         kasan_report+0xe/0x20
         trace_stack_print+0x26b/0x2c0
         print_trace_line+0x6ea/0x14d0
         ? tracing_buffers_read+0x700/0x700
         ? trace_find_next_entry_inc+0x158/0x1d0
         s_show+0xea/0x310
         seq_read+0xaa7/0x10e0
         ? seq_escape+0x230/0x230
         __vfs_read+0x7c/0x100
         vfs_read+0x16c/0x3a0
         ksys_read+0x121/0x240
         ? kernel_write+0x110/0x110
         ? perf_trace_sys_enter+0x8a0/0x8a0
         ? syscall_slow_exit_work+0xa9/0x410
         do_syscall_64+0xb7/0x390
         ? prepare_exit_to_usermode+0x165/0x200
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f867681f910
        Code: b6 fe ff ff 48 8d 3d 0f be 08 00 48 83 ec 08 e8 06 db 01 00 66 0f 1f 44 00 00 83 3d f9 2d 2c 00 00 75 10 b8 00 00 00 00 04
        RSP: 002b:00007ffdabf23488 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
        RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f867681f910
        RDX: 0000000000020000 RSI: 00007f8676cde000 RDI: 0000000000000003
        RBP: 00007f8676cde000 R08: ffffffffffffffff R09: 0000000000000000
        R10: 0000000000000871 R11: 0000000000000246 R12: 00007f8676cde000
        R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000000ec0
      
        Allocated by task 1214:
         save_stack+0x1b/0x80
         __kasan_kmalloc.constprop.0+0xc2/0xd0
         kmem_cache_alloc+0xaf/0x1a0
         getname_flags+0xd2/0x5b0
         do_sys_open+0x277/0x5a0
         do_syscall_64+0xb7/0x390
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        Freed by task 1214:
         save_stack+0x1b/0x80
         __kasan_slab_free+0x12c/0x170
         kmem_cache_free+0x8a/0x1c0
         putname+0xe1/0x120
         do_sys_open+0x2c5/0x5a0
         do_syscall_64+0xb7/0x390
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        The buggy address belongs to the object at ffff888069d20000
         which belongs to the cache names_cache of size 4096
        The buggy address is located 0 bytes inside of
         4096-byte region [ffff888069d20000, ffff888069d21000)
        The buggy address belongs to the page:
        page:ffffea0001a74800 refcount:1 mapcount:0 mapping:ffff88806ccd1380 index:0x0 compound_mapcount: 0
        flags: 0x100000000010200(slab|head)
        raw: 0100000000010200 dead000000000100 dead000000000200 ffff88806ccd1380
        raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
        page dumped because: kasan: bad access detected
      
        Memory state around the buggy address:
         ffff888069d1ff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
         ffff888069d1ff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        >ffff888069d20000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                           ^
         ffff888069d20080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff888069d20100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ==================================================================
      
      Link: http://lkml.kernel.org/r/20190610040016.5598-1-devel@etsukata.com
      
      Fixes: 4285f2fc ("tracing: Remove the ULONG_MAX stack trace hackery")
      Signed-off-by: default avatarEiichi Tsukata <devel@etsukata.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      becf33f6
  3. 28 May, 2019 1 commit
  4. 26 May, 2019 1 commit
    • Miguel Ojeda's avatar
      tracing: Silence GCC 9 array bounds warning · 0c97bf86
      Miguel Ojeda authored
      Starting with GCC 9, -Warray-bounds detects cases when memset is called
      starting on a member of a struct but the size to be cleared ends up
      writing over further members.
      
      Such a call happens in the trace code to clear, at once, all members
      after and including `seq` on struct trace_iterator:
      
          In function 'memset',
              inlined from 'ftrace_dump' at kernel/trace/trace.c:8914:3:
          ./include/linux/string.h:344:9: warning: '__builtin_memset' offset
          [8505, 8560] from the object at 'iter' is out of the bounds of
          referenced subobject 'seq' with type 'struct trace_seq' at offset
          4368 [-Warray-bounds]
            344 |  return __builtin_memset(p, c, size);
                |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      In order to avoid GCC complaining about it, we compute the address
      ourselves by adding the offsetof distance instead of referring
      directly to the member.
      
      Since there are two places doing this clear (trace.c and trace_kdb.c),
      take the chance to move the workaround into a single place in
      the internal header.
      
      Link: http://lkml.kernel.org/r/20190523124535.GA12931@gmail.comSigned-off-by: default avatarMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      [ Removed unnecessary parenthesis around "iter" ]
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      0c97bf86
  5. 22 May, 2019 1 commit
  6. 21 May, 2019 3 commits
    • Tom Zanussi's avatar
      tracing: Add a check_val() check before updating cond_snapshot() track_val · 9b2ca371
      Tom Zanussi authored
      Without this check a snapshot is taken whenever a bucket's max is hit,
      rather than only when the global max is hit, as it should be.
      
      Before:
      
        In this example, we do a first run of the workload (cyclictest),
        examine the output, note the max ('triggering value') (347), then do
        a second run and note the max again.
      
        In this case, the max in the second run (39) is below the max in the
        first run, but since we haven't cleared the histogram, the first max
        is still in the histogram and is higher than any other max, so it
        should still be the max for the snapshot.  It isn't however - the
        value should still be 347 after the second run.
      
        # echo 'hist:keys=pid:ts0=common_timestamp.usecs if comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_waking/trigger
        # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmax($wakeup_lat).save(next_prio,next_comm,prev_pid,prev_prio,prev_comm):onmax($wakeup_lat).snapshot() if next_comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
      
        # cyclictest -p 80 -n -s -t 2 -D 2
      
        # cat /sys/kernel/debug/tracing/events/sched/sched_switch/hist
      
        { next_pid:       2143 } hitcount:        199
          max:         44  next_prio:        120  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/4
      
        { next_pid:       2145 } hitcount:       1325
          max:         38  next_prio:         19  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/2
      
        { next_pid:       2144 } hitcount:       1982
          max:        347  next_prio:         19  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/6
      
        Snapshot taken (see tracing/snapshot).  Details:
            triggering value { onmax($wakeup_lat) }:        347
            triggered by event with key: { next_pid:       2144 }
      
        # cyclictest -p 80 -n -s -t 2 -D 2
      
        # cat /sys/kernel/debug/tracing/events/sched/sched_switch/hist
      
        { next_pid:       2143 } hitcount:        199
          max:         44  next_prio:        120  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/4
      
        { next_pid:       2148 } hitcount:        199
          max:         16  next_prio:        120  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/1
      
        { next_pid:       2145 } hitcount:       1325
          max:         38  next_prio:         19  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/2
      
        { next_pid:       2150 } hitcount:       1326
          max:         39  next_prio:         19  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/4
      
        { next_pid:       2144 } hitcount:       1982
          max:        347  next_prio:         19  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/6
      
        { next_pid:       2149 } hitcount:       1983
          max:        130  next_prio:         19  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/0
      
        Snapshot taken (see tracing/snapshot).  Details:
          triggering value { onmax($wakeup_lat) }:    39
          triggered by event with key: { next_pid:       2150 }
      
      After:
      
        In this example, we do a first run of the workload (cyclictest),
        examine the output, note the max ('triggering value') (375), then do
        a second run and note the max again.
      
        In this case, the max in the second run is still 375, the highest in
        any bucket, as it should be.
      
        # cyclictest -p 80 -n -s -t 2 -D 2
      
        # cat /sys/kernel/debug/tracing/events/sched/sched_switch/hist
      
        { next_pid:       2072 } hitcount:        200
          max:         28  next_prio:        120  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/5
      
        { next_pid:       2074 } hitcount:       1323
          max:        375  next_prio:         19  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/2
      
        { next_pid:       2073 } hitcount:       1980
          max:        153  next_prio:         19  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/6
      
        Snapshot taken (see tracing/snapshot).  Details:
          triggering value { onmax($wakeup_lat) }:        375
          triggered by event with key: { next_pid:       2074 }
      
        # cyclictest -p 80 -n -s -t 2 -D 2
      
        # cat /sys/kernel/debug/tracing/events/sched/sched_switch/hist
      
        { next_pid:       2101 } hitcount:        199
          max:         49  next_prio:        120  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/6
      
        { next_pid:       2072 } hitcount:        200
          max:         28  next_prio:        120  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/5
      
        { next_pid:       2074 } hitcount:       1323
          max:        375  next_prio:         19  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/2
      
        { next_pid:       2103 } hitcount:       1325
          max:         74  next_prio:         19  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/4
      
        { next_pid:       2073 } hitcount:       1980
          max:        153  next_prio:         19  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/6
      
        { next_pid:       2102 } hitcount:       1981
          max:         84  next_prio:         19  next_comm: cyclictest
          prev_pid:         12  prev_prio:        120  prev_comm: kworker/0:1
      
        Snapshot taken (see tracing/snapshot).  Details:
          triggering value { onmax($wakeup_lat) }:        375
          triggered by event with key: { next_pid:       2074 }
      
      Link: http://lkml.kernel.org/r/95958351329f129c07504b4d1769c47a97b70d65.1555597045.git.tom.zanussi@linux.intel.com
      
      Cc: stable@vger.kernel.org
      Fixes: a3785b7e ("tracing: Add hist trigger snapshot() action")
      Signed-off-by: default avatarTom Zanussi <tom.zanussi@linux.intel.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      9b2ca371
    • Tom Zanussi's avatar
      tracing: Check keys for variable references in expressions too · c8d94a18
      Tom Zanussi authored
      There's an existing check for variable references in keys, but it
      doesn't go far enough.  It checks whether a key field is a variable
      reference but doesn't check whether it's an expression containing
      variable references, which can cause the same problems for callers.
      
      Use the existing field_has_hist_vars() function rather than a direct
      top-level flag check to catch all possible variable references.
      
      Link: http://lkml.kernel.org/r/e8c3d3d53db5ca90ceea5a46e5413103a6902fc7.1555597045.git.tom.zanussi@linux.intel.com
      
      Cc: stable@vger.kernel.org
      Fixes: 067fe038 ("tracing: Add variable reference handling to hist triggers")
      Reported-by: default avatarVincent Bernat <vincent@bernat.ch>
      Signed-off-by: default avatarTom Zanussi <tom.zanussi@linux.intel.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      c8d94a18
    • Tom Zanussi's avatar
      tracing: Prevent hist_field_var_ref() from accessing NULL tracing_map_elts · 55267c88
      Tom Zanussi authored
      hist_field_var_ref() is an implementation of hist_field_fn_t(), which
      can be called with a null tracing_map_elt elt param when assembling a
      key in event_hist_trigger().
      
      In the case of hist_field_var_ref() this doesn't make sense, because a
      variable can only be resolved by looking it up using an already
      assembled key i.e. a variable can't be used to assemble a key since
      the key is required in order to access the variable.
      
      Upper layers should prevent the user from constructing a key using a
      variable in the first place, but in case one slips through, it
      shouldn't cause a NULL pointer dereference.  Also if one does slip
      through, we want to know about it, so emit a one-time warning in that
      case.
      
      Link: http://lkml.kernel.org/r/64ec8dc15c14d305295b64cdfcc6b2b9dd14753f.1555597045.git.tom.zanussi@linux.intel.comReported-by: default avatarVincent Bernat <vincent@bernat.ch>
      Signed-off-by: default avatarTom Zanussi <tom.zanussi@linux.intel.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      55267c88
  7. 19 May, 2019 17 commits
    • Linus Torvalds's avatar
      Linux 5.2-rc1 · a188339c
      Linus Torvalds authored
      a188339c
    • Linus Torvalds's avatar
      Merge tag 'upstream-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs · 2e2c1220
      Linus Torvalds authored
      Pull UBIFS fixes from Richard Weinberger:
      
       - build errors wrt xattrs
      
       - mismerge which lead to a wrong Kconfig ifdef
      
       - missing endianness conversion
      
      * tag 'upstream-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
        ubifs: Convert xattr inum to host order
        ubifs: Use correct config name for encryption
        ubifs: Fix build error without CONFIG_UBIFS_FS_XATTR
      2e2c1220
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · cb6f8739
      Linus Torvalds authored
      Merge yet more updates from Andrew Morton:
       "A few final bits:
      
         - large changes to vmalloc, yielding large performance benefits
      
         - tweak the console-flush-on-panic code
      
         - a few fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        panic: add an option to replay all the printk message in buffer
        initramfs: don't free a non-existent initrd
        fs/writeback.c: use rcu_barrier() to wait for inflight wb switches going into workqueue when umount
        mm/compaction.c: correct zone boundary handling when isolating pages from a pageblock
        mm/vmap: add DEBUG_AUGMENT_LOWEST_MATCH_CHECK macro
        mm/vmap: add DEBUG_AUGMENT_PROPAGATE_CHECK macro
        mm/vmalloc.c: keep track of free blocks for vmap allocation
      cb6f8739
    • Linus Torvalds's avatar
      Merge tag 'kbuild-v5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild · ff8583d6
      Linus Torvalds authored
      Pull more Kbuild updates from Masahiro Yamada:
      
       - remove unneeded use of cc-option, cc-disable-warning, cc-ldoption
      
       - exclude tracked files from .gitignore
      
       - re-enable -Wint-in-bool-context warning
      
       - refactor samples/Makefile
      
       - stop building immediately if syncconfig fails
      
       - do not sprinkle error messages when $(CC) does not exist
      
       - move arch/alpha/defconfig to the configs subdirectory
      
       - remove crappy header search path manipulation
      
       - add comment lines to .config to clarify the end of menu blocks
      
       - check uniqueness of module names (adding new warnings intentionally)
      
      * tag 'kbuild-v5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (24 commits)
        kconfig: use 'else ifneq' for Makefile to improve readability
        kbuild: check uniqueness of module names
        kconfig: Terminate menu blocks with a comment in the generated config
        kbuild: add LICENSES to KBUILD_ALLDIRS
        kbuild: remove 'addtree' and 'flags' magic for header search paths
        treewide: prefix header search paths with $(srctree)/
        media: prefix header search paths with $(srctree)/
        media: remove unneeded header search paths
        alpha: move arch/alpha/defconfig to arch/alpha/configs/defconfig
        kbuild: terminate Kconfig when $(CC) or $(LD) is missing
        kbuild: turn auto.conf.cmd into a mandatory include file
        .gitignore: exclude .get_maintainer.ignore and .gitattributes
        kbuild: add all Clang-specific flags unconditionally
        kbuild: Don't try to add '-fcatch-undefined-behavior' flag
        kbuild: add some extra warning flags unconditionally
        kbuild: add -Wvla flag unconditionally
        arch: remove dangling asm-generic wrappers
        samples: guard sub-directories with CONFIG options
        kbuild: re-enable int-in-bool-context warning
        MAINTAINERS: kbuild: Add pattern for scripts/*vmlinux*
        ...
      ff8583d6
    • Linus Torvalds's avatar
      Merge branch 'i2c/for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · f23d8719
      Linus Torvalds authored
      Pull i2c updates from Wolfram Sang:
       "Some I2C core API additions which are kind of simple but enhance error
        checking for users a lot, especially by returning errno now.
      
        There are wrappers to still support the old API but it will be removed
        once all users are converted"
      
      * 'i2c/for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        i2c: core: add device-managed version of i2c_new_dummy
        i2c: core: improve return value handling of i2c_new_device and i2c_new_dummy
      f23d8719
    • Linus Torvalds's avatar
      Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · c4d36b63
      Linus Torvalds authored
      Pull ext4 fixes from Ted Ts'o:
       "Some bug fixes, and an update to the URL's for the final version of
        Unicode 12.1.0"
      
      * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
        ext4: avoid panic during forced reboot due to aborted journal
        ext4: fix block validity checks for journal inodes using indirect blocks
        unicode: update to Unicode 12.1.0 final
        unicode: add missing check for an error return from utf8lookup()
        ext4: fix miscellaneous sparse warnings
        ext4: unsigned int compared against zero
        ext4: fix use-after-free in dx_release()
        ext4: fix data corruption caused by overlapping unaligned and aligned IO
        jbd2: fix potential double free
        ext4: zero out the unused memory region in the extent tree block
      c4d36b63
    • Linus Torvalds's avatar
      Merge tag '5.2-rc-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6 · d8848eef
      Linus Torvalds authored
      Pull cifs fixes from Steve French:
       "Minor cleanup and fixes, one for stable, four rdma (smbdirect)
        related. Also adds SEEK_HOLE support"
      
      * tag '5.2-rc-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
        cifs: add support for SEEK_DATA and SEEK_HOLE
        Fixed https://bugzilla.kernel.org/show_bug.cgi?id=202935 allow write on the same file
        cifs: Allocate memory for all iovs in smb2_ioctl
        cifs: Don't match port on SMBDirect transport
        cifs:smbd Use the correct DMA direction when sending data
        cifs:smbd When reconnecting to server, call smbd_destroy() after all MIDs have been called
        cifs: use the right include for signal_pending()
        smb3: trivial cleanup to smb2ops.c
        cifs: cleanup smb2ops.c and normalize strings
        smb3: display session id in debug data
      d8848eef
    • Linus Torvalds's avatar
      Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 1ba3b5dc
      Linus Torvalds authored
      Pull perf tooling updates from Ingo Molnar:
       "perf.data:
      
         - Streaming compression of perf ring buffer into
           PERF_RECORD_COMPRESSED user space records, resulting in ~3-5x
           perf.data file size reduction on variety of tested workloads what
           saves storage space on larger server systems where perf.data size
           can easily reach several tens or even hundreds of GiBs, especially
           when profiling with DWARF-based stacks and tracing of context
           switches.
      
        perf record:
      
         - Improve -user-regs/intr-regs suggestions to overcome errors
      
        perf annotate:
      
         - Remove hist__account_cycles() from callback, speeding up branch
           processing (perf record -b)
      
        perf stat:
      
         - Add a 'percore' event qualifier, e.g.: -e
           cpu/event=0,umask=0x3,percore=1/, that sums up the event counts for
           both hardware threads in a core.
      
           We can already do this with --per-core, but it's often useful to do
           this together with other metrics that are collected per hardware
           thread.
      
           I.e. now its possible to do this per-event, and have it mixed with
           other events not aggregated by core.
      
        arm64:
      
         - Map Brahma-B53 CPUID to cortex-a53 events.
      
         - Add Cortex-A57 and Cortex-A72 events.
      
        csky:
      
         - Add DWARF register mappings for libdw, allowing --call-graph=dwarf
           to work on the C-SKY arch.
      
        x86:
      
         - Add support for recording and printing XMM registers, available,
           for instance, on Icelake.
      
         - Add uncore_upi (Intel's "Ultra Path Interconnect" events) JSON
           support. UPI replaced the Intel QuickPath Interconnect (QPI) in
           Xeon Skylake-SP.
      
        Intel PT:
      
         - Fix instructions sampling rate.
      
         - Timestamp fixes.
      
         - Improve exported-sql-viewer GUI, allowing, for instance, to
           copy'n'paste the trees, useful for e-mailing"
      
      * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (73 commits)
        perf stat: Support 'percore' event qualifier
        perf stat: Factor out aggregate counts printing
        perf tools: Add a 'percore' event qualifier
        perf docs: Add description for stderr
        perf intel-pt: Fix sample timestamp wrt non-taken branches
        perf intel-pt: Fix improved sample timestamp
        perf intel-pt: Fix instructions sampling rate
        perf regs x86: Add X86 specific arch__intr_reg_mask()
        perf parse-regs: Add generic support for arch__intr/user_reg_mask()
        perf parse-regs: Split parse_regs
        perf vendor events arm64: Add Cortex-A57 and Cortex-A72 events
        perf vendor events arm64: Map Brahma-B53 CPUID to cortex-a53 events
        perf vendor events arm64: Remove [[:xdigit:]] wildcard
        perf jevents: Remove unused variable
        perf test zstd: Fixup verbose mode output
        perf tests: Implement Zstd comp/decomp integration test
        perf inject: Enable COMPRESSED record decompression
        perf report: Implement perf.data record decompression
        perf record: Implement -z,--compression_level[=<n>] option
        perf report: Add stub processing of compressed events for -D
        ...
      1ba3b5dc
    • Linus Torvalds's avatar
      Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · a13f950e
      Linus Torvalds authored
      Pull clocksource updates from Ingo Molnar:
       "Misc clocksource/clockevent driver updates that came in a bit late but
        are ready for v5.2"
      
      * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        misc: atmel_tclib: Do not probe already used TCBs
        clocksource/drivers/timer-atmel-tcb: Convert tc_clksrc_suspend|resume() to static
        clocksource/drivers/tcb_clksrc: Rename the file for consistency
        clocksource/drivers/timer-atmel-pit: Rework Kconfig option
        clocksource/drivers/tcb_clksrc: Move Kconfig option
        ARM: at91: Implement clocksource selection
        clocksource/drivers/tcb_clksrc: Use tcb as sched_clock
        clocksource/drivers/tcb_clksrc: Stop depending on atmel_tclib
        ARM: at91: move SoC specific definitions to SoC folder
        clocksource/drivers/timer-milbeaut: Cleanup common register accesses
        clocksource/drivers/timer-milbeaut: Add shutdown function
        clocksource/drivers/timer-milbeaut: Fix to enable one-shot timer
        clocksource/drivers/tegra: Rework for compensation of suspend time
        clocksource/drivers/sp804: Add COMPILE_TEST to CONFIG_ARM_TIMER_SP804
        clocksource/drivers/sun4i: Add a compatible for suniv
        dt-bindings: timer: Add Allwinner suniv timer
      a13f950e
    • Linus Torvalds's avatar
      Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · d9351ea1
      Linus Torvalds authored
      Pull IRQ chip updates from Ingo Molnar:
       "A late irqchips update:
      
         - New TI INTR/INTA set of drivers
      
         - Rewrite of the stm32mp1-exti driver as a platform driver
      
         - Update the IOMMU MSI mapping API to be RT friendly
      
         - A number of cleanups and other low impact fixes"
      
      * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
        iommu/dma-iommu: Remove iommu_dma_map_msi_msg()
        irqchip/gic-v3-mbi: Don't map the MSI page in mbi_compose_m{b, s}i_msg()
        irqchip/ls-scfg-msi: Don't map the MSI page in ls_scfg_msi_compose_msg()
        irqchip/gic-v3-its: Don't map the MSI page in its_irq_compose_msi_msg()
        irqchip/gicv2m: Don't map the MSI page in gicv2m_compose_msi_msg()
        iommu/dma-iommu: Split iommu_dma_map_msi_msg() in two parts
        genirq/msi: Add a new field in msi_desc to store an IOMMU cookie
        arm64: arch_k3: Enable interrupt controller drivers
        irqchip/ti-sci-inta: Add msi domain support
        soc: ti: Add MSI domain bus support for Interrupt Aggregator
        irqchip/ti-sci-inta: Add support for Interrupt Aggregator driver
        dt-bindings: irqchip: Introduce TISCI Interrupt Aggregator bindings
        irqchip/ti-sci-intr: Add support for Interrupt Router driver
        dt-bindings: irqchip: Introduce TISCI Interrupt router bindings
        gpio: thunderx: Use the default parent apis for {request,release}_resources
        genirq: Introduce irq_chip_{request,release}_resource_parent() apis
        firmware: ti_sci: Add helper apis to manage resources
        firmware: ti_sci: Add RM mapping table for am654
        firmware: ti_sci: Add support for IRQ management
        firmware: ti_sci: Add support for RM core ops
        ...
      d9351ea1
    • Linus Torvalds's avatar
      Merge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 39feaa3f
      Linus Torvalds authored
      Pull EFI fix from Ingo Molnar:
       "Fix an EFI-fb regression that affects certain x86 systems"
      
      * 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        fbdev/efifb: Ignore framebuffer memmap entries that lack any memory types
      39feaa3f
    • Linus Torvalds's avatar
      Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 1335d9a1
      Linus Torvalds authored
      Pull core fixes from Ingo Molnar:
       "This fixes a particularly thorny munmap() bug with MPX, plus fixes a
        host build environment assumption in objtool"
      
      * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        objtool: Allow AR to be overridden with HOSTAR
        x86/mpx, mm/core: Fix recursive munmap() corruption
      1335d9a1
    • Linus Torvalds's avatar
      Merge tag 'armsoc-late' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc · 4c4a5c99
      Linus Torvalds authored
      Pull ARM SoC late updates from Olof Johansson:
       "This is some material that we picked up into our tree late. Most of it
        are smaller fixes and additions, some defconfig updates due to recent
        development, etc.
      
        Code-wise the largest portion is a series of PM updates for the at91
        platform, and those have been in linux-next a while through the at91
        tree before we picked them up"
      
      * tag 'armsoc-late' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (29 commits)
        arm64: dts: sprd: Add clock properties for serial devices
        Opt out of scripts/get_maintainer.pl
        ARM: ixp4xx: Remove duplicated include from common.c
        soc: ixp4xx: qmgr: Fix an NULL vs IS_ERR() check in probe
        arm64: tegra: Disable XUSB support on Jetson TX2
        arm64: tegra: Enable SMMU translation for PCI on Tegra186
        arm64: tegra: Fix insecure SMMU users for Tegra186
        arm64: tegra: Select ARM_GIC_PM
        amba: tegra-ahb: Mark PM functions as __maybe_unused
        ARM: dts: logicpd-som-lv: Fix MMC1 card detect
        ARM: mvebu: drop return from void function
        ARM: mvebu: prefix coprocessor operand with p
        ARM: mvebu: drop unnecessary label
        ARM: mvebu: fix a leaked reference by adding missing of_node_put
        ARM: socfpga_defconfig: enable LTC2497
        ARM: mvebu: kirkwood: remove error message when retrieving mac address
        ARM: at91: sama5: make ov2640 as a module
        ARM: OMAP1: ams-delta: fix early boot crash when LED support is disabled
        ARM: at91: remove HAVE_FB_ATMEL for sama5 SoC as they use DRM
        soc/fsl/qe: Fix an error code in qe_pin_request()
        ...
      4c4a5c99
    • Linus Torvalds's avatar
      Merge tag 'powerpc-5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 86a78a8b
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
       "One fix going back to stable, for a bug on 32-bit introduced when we
        added support for THREAD_INFO_IN_TASK.
      
        A fix for a typo in a recent rework of our hugetlb code that leads to
        crashes on 64-bit when using hugetlbfs with a 4K PAGE_SIZE.
      
        Two fixes for our recent rework of the address layout on 64-bit hash
        CPUs, both only triggered when userspace tries to access addresses
        outside the user or kernel address ranges.
      
        Finally a fix for a recently introduced double free in an error path
        in our cacheinfo code.
      
        Thanks to: Aneesh Kumar K.V, Christophe Leroy, Sachin Sant, Tobin C.
        Harding"
      
      * tag 'powerpc-5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/cacheinfo: Remove double free
        powerpc/mm/hash: Fix get_region_id() for invalid addresses
        powerpc/mm: Drop VM_BUG_ON in get_region_id()
        powerpc/mm: Fix crashes with hugepages & 4K pages
        powerpc/32s: fix flush_hash_pages() on SMP
      86a78a8b
    • Linus Torvalds's avatar
      Merge tag 'mips_5.2_2' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux · bcd17397
      Linus Torvalds authored
      Pull a few more MIPS updates from Paul Burton:
       "Some SGI IP27 specific PCI rework and a batch of fixes:
      
         - A build fix for BMIPS5000 configurations with
           CONFIG_HW_PERF_EVENTS=y, which also neatly removes some #ifdefery.
      
         - A fix to report supported ISAs correctly on older Ingenic SoCs
           which incorrectly indicate MIPSr2 support in their cop0 Config
           register.
      
         - Some PCI modernization for SGI IP27 systems as part of ongoing work
           to support some other SGI systems.
      
         - A fix allowing use of appended DTB files with generic kernels.
      
         - DMA mask fixes for SGI IP22 & Alchemy systems"
      
      * tag 'mips_5.2_2' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
        MIPS: Alchemy: add DMA masks for on-chip ethernet
        MIPS: SGI-IP22: provide missing dma_mask/coherent_dma_mask
        generic: fix appended dtb support
        MIPS: SGI-IP27: abstract chipset irq from bridge
        MIPS: SGI-IP27: use generic PCI driver
        MIPS: Fix Ingenic SoCs sometimes reporting wrong ISA
        MIPS: perf: Fix build with CONFIG_CPU_BMIPS5000 enabled
      bcd17397
    • Linus Torvalds's avatar
      Merge tag 'riscv-for-linus-5.2-mw2' of... · b0bb1269
      Linus Torvalds authored
      Merge tag 'riscv-for-linus-5.2-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux
      
      Pull RISC-V updates from Palmer Dabbelt:
       "This contains an assortment of RISC-V related patches that I'd like to
        target for the 5.2 merge window. Most of the patches are cleanups, but
        there are a handful of user-visible changes:
      
         - The nosmp and nr_cpus command-line arguments are now supported,
           which work like normal.
      
         - The SBI console no longer installs itself as a preferred console,
           we rely on standard mechanisms (/chosen, command-line, hueristics)
           instead.
      
         - sfence_remove_sfence_vma{,_asid} now pass their arguments along to
           the SBI call.
      
         - Modules now support BUG().
      
         - A missing sfence.vma during boot has been added. This bug only
           manifests during boot.
      
         - The arch/riscv support for SiFive's L2 cache controller has been
           merged, which should un-block the EDAC framework work.
      
        I've only tested this on QEMU again, as I didn't have time to get
        things running on the Unleashed. The latest master from this morning
        merges in cleanly and passes the tests as well"
      
      * tag 'riscv-for-linus-5.2-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux: (31 commits)
        riscv: fix locking violation in page fault handler
        RISC-V: sifive_l2_cache: Add L2 cache controller driver for SiFive SoCs
        RISC-V: Add DT documentation for SiFive L2 Cache Controller
        RISC-V: Avoid using invalid intermediate translations
        riscv: Support BUG() in kernel module
        riscv: Add the support for c.ebreak check in is_valid_bugaddr()
        riscv: support trap-based WARN()
        riscv: fix sbi_remote_sfence_vma{,_asid}.
        riscv: move switch_mm to its own file
        riscv: move flush_icache_{all,mm} to cacheflush.c
        tty: Don't force RISCV SBI console as preferred console
        RISC-V: Access CSRs using CSR numbers
        RISC-V: Add interrupt related SCAUSE defines in asm/csr.h
        RISC-V: Use tabs to align macro values in asm/csr.h
        RISC-V: Fix minor checkpatch issues.
        RISC-V: Support nr_cpus command line option.
        RISC-V: Implement nosmp commandline option.
        RISC-V: Add RISC-V specific arch_match_cpu_phys_id
        riscv: vdso: drop unnecessary cc-ldoption
        riscv: call pm_power_off from machine_halt / machine_power_off
        ...
      b0bb1269
    • Masahiro Yamada's avatar
      kconfig: use 'else ifneq' for Makefile to improve readability · fc2694ec
      Masahiro Yamada authored
      'ifeq ... else ifneq ... endif' notation is supported by GNU Make 3.81
      or later, which is the requirement for building the kernel since
      commit 37d69ee3 ("docs: bump minimal GNU Make version to 3.81").
      
      Use it to improve the readability.
      Signed-off-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      fc2694ec
  8. 18 May, 2019 7 commits
    • Feng Tang's avatar
      panic: add an option to replay all the printk message in buffer · de6da1e8
      Feng Tang authored
      Currently on panic, kernel will lower the loglevel and print out pending
      printk msg only with console_flush_on_panic().
      
      Add an option for users to configure the "panic_print" to replay all
      dmesg in buffer, some of which they may have never seen due to the
      loglevel setting, which will help panic debugging .
      
      [feng.tang@intel.com: keep the original console_flush_on_panic() inside panic()]
        Link: http://lkml.kernel.org/r/1556199137-14163-1-git-send-email-feng.tang@intel.com
      [feng.tang@intel.com: use logbuf lock to protect the console log index]
        Link: http://lkml.kernel.org/r/1556269868-22654-1-git-send-email-feng.tang@intel.com
      Link: http://lkml.kernel.org/r/1556095872-36838-1-git-send-email-feng.tang@intel.comSigned-off-by: default avatarFeng Tang <feng.tang@intel.com>
      Reviewed-by: default avatarPetr Mladek <pmladek@suse.com>
      Cc: Aaro Koskinen <aaro.koskinen@nokia.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Borislav Petkov <bp@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      de6da1e8
    • Steven Price's avatar
      initramfs: don't free a non-existent initrd · 5d59aa8f
      Steven Price authored
      Since commit 54c7a891 ("initramfs: free initrd memory if opening
      /initrd.image fails"), the kernel has unconditionally attempted to free
      the initrd even if it doesn't exist.
      
      In the non-existent case this causes a boot-time splat if
      CONFIG_DEBUG_VIRTUAL is enabled due to a call to virt_to_phys() with a
      NULL address.
      
      Instead we should check that the initrd actually exists and only attempt
      to free it if it does.
      
      Link: http://lkml.kernel.org/r/20190516143125.48948-1-steven.price@arm.com
      Fixes: 54c7a891 ("initramfs: free initrd memory if opening /initrd.image fails")
      Signed-off-by: default avatarSteven Price <steven.price@arm.com>
      Reported-by: default avatarMark Rutland <mark.rutland@arm.com>
      Tested-by: default avatarMark Rutland <mark.rutland@arm.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d59aa8f
    • Jiufei Xue's avatar
      fs/writeback.c: use rcu_barrier() to wait for inflight wb switches going into workqueue when umount · ec084de9
      Jiufei Xue authored
      synchronize_rcu() didn't wait for call_rcu() callbacks, so inode wb
      switch may not go to the workqueue after synchronize_rcu().  Thus
      previous scheduled switches was not finished even flushing the
      workqueue, which will cause a NULL pointer dereferenced followed below.
      
        VFS: Busy inodes after unmount of vdd. Self-destruct in 5 seconds.  Have a nice day...
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000278
          evict+0xb3/0x180
          iput+0x1b0/0x230
          inode_switch_wbs_work_fn+0x3c0/0x6a0
          worker_thread+0x4e/0x490
          ? process_one_work+0x410/0x410
          kthread+0xe6/0x100
          ret_from_fork+0x39/0x50
      
      Replace the synchronize_rcu() call with a rcu_barrier() to wait for all
      pending callbacks to finish.  And inc isw_nr_in_flight after call_rcu()
      in inode_switch_wbs() to make more sense.
      
      Link: http://lkml.kernel.org/r/20190429024108.54150-1-jiufei.xue@linux.alibaba.comSigned-off-by: default avatarJiufei Xue <jiufei.xue@linux.alibaba.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarTejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec084de9
    • Mel Gorman's avatar
      mm/compaction.c: correct zone boundary handling when isolating pages from a pageblock · 60fce36a
      Mel Gorman authored
      syzbot reported the following error from a tree with a head commit of
      baf76f0c ("slip: make slhc_free() silently accept an error pointer")
      
        BUG: unable to handle kernel paging request at ffffea0003348000
        #PF error: [normal kernel read fault]
        PGD 12c3f9067 P4D 12c3f9067 PUD 12c3f8067 PMD 0
        Oops: 0000 [#1] PREEMPT SMP KASAN
        CPU: 1 PID: 28916 Comm: syz-executor.2 Not tainted 5.1.0-rc6+ #89
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:constant_test_bit arch/x86/include/asm/bitops.h:314 [inline]
        RIP: 0010:PageCompound include/linux/page-flags.h:186 [inline]
        RIP: 0010:isolate_freepages_block+0x1c0/0xd40 mm/compaction.c:579
        Code: 01 d8 ff 4d 85 ed 0f 84 ef 07 00 00 e8 29 00 d8 ff 4c 89 e0 83 85 38 ff
        ff ff 01 48 c1 e8 03 42 80 3c 38 00 0f 85 31 0a 00 00 <4d> 8b 2c 24 31 ff 49
        c1 ed 10 41 83 e5 01 44 89 ee e8 3a 01 d8 ff
        RSP: 0018:ffff88802b31eab8 EFLAGS: 00010246
        RAX: 1ffffd4000669000 RBX: 00000000000cd200 RCX: ffffc9000a235000
        RDX: 000000000001ca5e RSI: ffffffff81988cc7 RDI: 0000000000000001
        RBP: ffff88802b31ebd8 R08: ffff88805af700c0 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000000 R12: ffffea0003348000
        R13: 0000000000000000 R14: ffff88802b31f030 R15: dffffc0000000000
        FS:  00007f61648dc700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: ffffea0003348000 CR3: 0000000037c64000 CR4: 00000000001426e0
        Call Trace:
         fast_isolate_around mm/compaction.c:1243 [inline]
         fast_isolate_freepages mm/compaction.c:1418 [inline]
         isolate_freepages mm/compaction.c:1438 [inline]
         compaction_alloc+0x1aee/0x22e0 mm/compaction.c:1550
      
      There is no reproducer and it is difficult to hit -- 1 crash every few
      days.  The issue is very similar to the fix in commit 6b0868c8
      ("mm/compaction.c: correct zone boundary handling when resetting pageblock
      skip hints").  When isolating free pages around a target pageblock, the
      boundary handling is off by one and can stray into the next pageblock.
      Triggering the syzbot error requires that the end of pageblock is section
      or zone aligned, and that the next section is unpopulated.
      
      A more subtle consequence of the bug is that pageblocks were being
      improperly used as migration targets which potentially hurts fragmentation
      avoidance in the long-term one page at a time.
      
      A debugging patch revealed that it's definitely possible to stray outside
      of a pageblock which is not intended.  While syzbot cannot be used to
      verify this patch, it was confirmed that the debugging warning no longer
      triggers with this patch applied.  It has also been confirmed that the THP
      allocation stress tests are not degraded by this patch.
      
      Link: http://lkml.kernel.org/r/20190510182124.GI18914@techsingularity.net
      Fixes: e332f741 ("mm, compaction: be selective about what pageblocks to clear skip hints")
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reported-by: syzbot+d84c80f9fe26a0f7a734@syzkaller.appspotmail.com
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org> # v5.1+
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      60fce36a
    • Uladzislau Rezki (Sony)'s avatar
      mm/vmap: add DEBUG_AUGMENT_LOWEST_MATCH_CHECK macro · a6cf4e0f
      Uladzislau Rezki (Sony) authored
      This macro adds some debug code to check that vmap allocations are
      happened in ascending order.
      
      By default this option is set to 0 and not active.  It requires
      recompilation of the kernel to activate it.  Set to 1, compile the
      kernel.
      
      [urezki@gmail.com: v4]
        Link: http://lkml.kernel.org/r/20190406183508.25273-4-urezki@gmail.com
      Link: http://lkml.kernel.org/r/20190402162531.10888-4-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a6cf4e0f
    • Uladzislau Rezki (Sony)'s avatar
      mm/vmap: add DEBUG_AUGMENT_PROPAGATE_CHECK macro · bb850f4d
      Uladzislau Rezki (Sony) authored
      This macro adds some debug code to check that the augment tree is
      maintained correctly, meaning that every node contains valid
      subtree_max_size value.
      
      By default this option is set to 0 and not active.  It requires
      recompilation of the kernel to activate it.  Set to 1, compile the
      kernel.
      
      [urezki@gmail.com: v4]
        Link: http://lkml.kernel.org/r/20190406183508.25273-3-urezki@gmail.com
      Link: http://lkml.kernel.org/r/20190402162531.10888-3-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb850f4d
    • Uladzislau Rezki (Sony)'s avatar
      mm/vmalloc.c: keep track of free blocks for vmap allocation · 68ad4a33
      Uladzislau Rezki (Sony) authored
      Patch series "improve vmap allocation", v3.
      
      Objective
      ---------
      
      Please have a look for the description at:
      
        https://lkml.org/lkml/2018/10/19/786
      
      but let me also summarize it a bit here as well.
      
      The current implementation has O(N) complexity. Requests with different
      permissive parameters can lead to long allocation time. When i say
      "long" i mean milliseconds.
      
      Description
      -----------
      
      This approach organizes the KVA memory layout into free areas of the
      1-ULONG_MAX range, i.e.  an allocation is done over free areas lookups,
      instead of finding a hole between two busy blocks.  It allows to have
      lower number of objects which represent the free space, therefore to have
      less fragmented memory allocator.  Because free blocks are always as large
      as possible.
      
      It uses the augment tree where all free areas are sorted in ascending
      order of va->va_start address in pair with linked list that provides
      O(1) access to prev/next elements.
      
      Since the tree is augment, we also maintain the "subtree_max_size" of VA
      that reflects a maximum available free block in its left or right
      sub-tree.  Knowing that, we can easily traversal toward the lowest (left
      most path) free area.
      
      Allocation: ~O(log(N)) complexity.  It is sequential allocation method
      therefore tends to maximize locality.  The search is done until a first
      suitable block is large enough to encompass the requested parameters.
      Bigger areas are split.
      
      I copy paste here the description of how the area is split, since i
      described it in https://lkml.org/lkml/2018/10/19/786
      
      <snip>
      
      A free block can be split by three different ways.  Their names are
      FL_FIT_TYPE, LE_FIT_TYPE/RE_FIT_TYPE and NE_FIT_TYPE, i.e.  they
      correspond to how requested size and alignment fit to a free block.
      
      FL_FIT_TYPE - in this case a free block is just removed from the free
      list/tree because it fully fits.  Comparing with current design there is
      an extra work with rb-tree updating.
      
      LE_FIT_TYPE/RE_FIT_TYPE - left/right edges fit.  In this case what we do
      is just cutting a free block.  It is as fast as a current design.  Most of
      the vmalloc allocations just end up with this case, because the edge is
      always aligned to 1.
      
      NE_FIT_TYPE - Is much less common case.  Basically it happens when
      requested size and alignment does not fit left nor right edges, i.e.  it
      is between them.  In this case during splitting we have to build a
      remaining left free area and place it back to the free list/tree.
      
      Comparing with current design there are two extra steps.  First one is we
      have to allocate a new vmap_area structure.  Second one we have to insert
      that remaining free block to the address sorted list/tree.
      
      In order to optimize a first case there is a cache with free_vmap objects.
      Instead of allocating from slab we just take an object from the cache and
      reuse it.
      
      Second one is pretty optimized.  Since we know a start point in the tree
      we do not do a search from the top.  Instead a traversal begins from a
      rb-tree node we split.
      <snip>
      
      De-allocation.  ~O(log(N)) complexity.  An area is not inserted straight
      away to the tree/list, instead we identify the spot first, checking if it
      can be merged around neighbors.  The list provides O(1) access to
      prev/next, so it is pretty fast to check it.  Summarizing.  If merged then
      large coalesced areas are created, if not the area is just linked making
      more fragments.
      
      There is one more thing that i should mention here.  After modification of
      VA node, its subtree_max_size is updated if it was/is the biggest area in
      its left or right sub-tree.  Apart of that it can also be populated back
      to upper levels to fix the tree.  For more details please have a look at
      the __augment_tree_propagate_from() function and the description.
      
      Tests and stressing
      -------------------
      
      I use the "test_vmalloc.sh" test driver available under
      "tools/testing/selftests/vm/" since 5.1-rc1 kernel.  Just trigger "sudo
      ./test_vmalloc.sh" to find out how to deal with it.
      
      Tested on different platforms including x86_64/i686/ARM64/x86_64_NUMA.
      Regarding last one, i do not have any physical access to NUMA system,
      therefore i emulated it.  The time of stressing is days.
      
      If you run the test driver in "stress mode", you also need the patch that
      is in Andrew's tree but not in Linux 5.1-rc1.  So, please apply it:
      
      http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/commit/?id=e0cf7749bade6da318e98e934a24d8b62fab512c
      
      After massive testing, i have not identified any problems like memory
      leaks, crashes or kernel panics.  I find it stable, but more testing would
      be good.
      
      Performance analysis
      --------------------
      
      I have used two systems to test.  One is i5-3320M CPU @ 2.60GHz and
      another is HiKey960(arm64) board.  i5-3320M runs on 4.20 kernel, whereas
      Hikey960 uses 4.15 kernel.  I have both system which could run on 5.1-rc1
      as well, but the results have not been ready by time i an writing this.
      
      Currently it consist of 8 tests.  There are three of them which correspond
      to different types of splitting(to compare with default).  We have 3
      ones(see above).  Another 5 do allocations in different conditions.
      
      a) sudo ./test_vmalloc.sh performance
      
      When the test driver is run in "performance" mode, it runs all available
      tests pinned to first online CPU with sequential execution test order.  We
      do it in order to get stable and repeatable results.  Take a look at time
      difference in "long_busy_list_alloc_test".  It is not surprising because
      the worst case is O(N).
      
      # i5-3320M
      How many cycles all tests took:
      CPU0=646919905370(default) cycles vs CPU0=193290498550(patched) cycles
      
      # See detailed table with results here:
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_default.txt
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_patched.txt
      
      # Hikey960 8x CPUs
      How many cycles all tests took:
      CPU0=3478683207 cycles vs CPU0=463767978 cycles
      
      # See detailed table with results here:
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_default.txt
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_patched.txt
      
      b) time sudo ./test_vmalloc.sh test_repeat_count=1
      
      With this configuration, all tests are run on all available online CPUs.
      Before running each CPU shuffles its tests execution order.  It gives
      random allocation behaviour.  So it is rough comparison, but it puts in
      the picture for sure.
      
      # i5-3320M
      <default>            vs            <patched>
      real    101m22.813s                real    0m56.805s
      user    0m0.011s                   user    0m0.015s
      sys     0m5.076s                   sys     0m0.023s
      
      # See detailed table with results here:
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_default.txt
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_patched.txt
      
      # Hikey960 8x CPUs
      <default>            vs            <patched>
      real    unknown                    real    4m25.214s
      user    unknown                    user    0m0.011s
      sys     unknown                    sys     0m0.670s
      
      I did not manage to complete this test on "default Hikey960" kernel
      version.  After 24 hours it was still running, therefore i had to cancel
      it.  That is why real/user/sys are "unknown".
      
      This patch (of 3):
      
      Currently an allocation of the new vmap area is done over busy list
      iteration(complexity O(n)) until a suitable hole is found between two busy
      areas.  Therefore each new allocation causes the list being grown.  Due to
      over fragmented list and different permissive parameters an allocation can
      take a long time.  For example on embedded devices it is milliseconds.
      
      This patch organizes the KVA memory layout into free areas of the
      1-ULONG_MAX range.  It uses an augment red-black tree that keeps blocks
      sorted by their offsets in pair with linked list keeping the free space in
      order of increasing addresses.
      
      Nodes are augmented with the size of the maximum available free block in
      its left or right sub-tree.  Thus, that allows to take a decision and
      traversal toward the block that will fit and will have the lowest start
      address, i.e.  it is sequential allocation.
      
      Allocation: to allocate a new block a search is done over the tree until a
      suitable lowest(left most) block is large enough to encompass: the
      requested size, alignment and vstart point.  If the block is bigger than
      requested size - it is split.
      
      De-allocation: when a busy vmap area is freed it can either be merged or
      inserted to the tree.  Red-black tree allows efficiently find a spot
      whereas a linked list provides a constant-time access to previous and next
      blocks to check if merging can be done.  In case of merging of
      de-allocated memory chunk a large coalesced area is created.
      
      Complexity: ~O(log(N))
      
      [urezki@gmail.com: v3]
        Link: http://lkml.kernel.org/r/20190402162531.10888-2-urezki@gmail.com
      [urezki@gmail.com: v4]
        Link: http://lkml.kernel.org/r/20190406183508.25273-2-urezki@gmail.com
      Link: http://lkml.kernel.org/r/20190321190327.11813-2-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68ad4a33