1. 15 Jun, 2019 8 commits
    • Linus Torvalds's avatar
      Merge tag 'trace-v5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · 6a71398c
      Linus Torvalds authored
      Pull tracing fixes from Steven Rostedt:
      
       - Out of range read of stack trace output
      
       - Fix for NULL pointer dereference in trace_uprobe_create()
      
       - Fix to a livepatching / ftrace permission race in the module code
      
       - Fix for NULL pointer dereference in free_ftrace_func_mapper()
      
       - A couple of build warning clean ups
      
      * tag 'trace-v5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
        ftrace: Fix NULL pointer dereference in free_ftrace_func_mapper()
        module: Fix livepatch/ftrace module text permissions race
        tracing/uprobe: Fix obsolete comment on trace_uprobe_create()
        tracing/uprobe: Fix NULL pointer dereference in trace_uprobe_create()
        tracing: Make two symbols static
        tracing: avoid build warning with HAVE_NOP_MCOUNT
        tracing: Fix out-of-range read in trace_stack_print()
      6a71398c
    • Linus Torvalds's avatar
      Merge branch 'for-5.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · 0011572c
      Linus Torvalds authored
      Pull cgroup fixes from Tejun Heo:
       "This has an unusually high density of tricky fixes:
      
         - task_get_css() could deadlock when it races against a dying cgroup.
      
         - cgroup.procs didn't list thread group leaders with live threads.
      
           This could mislead readers to think that a cgroup is empty when
           it's not. Fixed by making PROCS iterator include dead tasks. I made
           a couple mistakes making this change and this pull request contains
           a couple follow-up patches.
      
         - When cpusets run out of online cpus, it updates cpusmasks of member
           tasks in bizarre ways. Joel improved the behavior significantly"
      
      * 'for-5.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
        cpuset: restore sanity to cpuset_cpus_allowed_fallback()
        cgroup: Fix css_task_iter_advance_css_set() cset skip condition
        cgroup: css_task_iter_skip()'d iterators must be advanced before accessed
        cgroup: Include dying leaders with live threads in PROCS iterations
        cgroup: Implement css_task_iter_skip()
        cgroup: Call cgroup_release() before __exit_signal()
        docs cgroups: add another example size for hugetlb
        cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css()
      0011572c
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-2019-06-14' of git://anongit.freedesktop.org/drm/drm · 6aa7a22b
      Linus Torvalds authored
      Pull drm fixes from Daniel Vetter:
       "Nothing unsettling here, also not aware of anything serious still
        pending.
      
        The edid override regression fix took a bit longer since this seems to
        be an area with an overabundance of bad options. But the fix we have
        now seems like a good path forward.
      
        Next week it should be back to Dave.
      
        Summary:
      
         - fix regression on amdgpu on SI
      
         - fix edid override regression
      
         - driver fixes: amdgpu, i915, mediatek, meson, panfrost
      
         - fix writecombine for vmap in gem-shmem helper (used by panfrost)
      
         - add more panel quirks"
      
      * tag 'drm-fixes-2019-06-14' of git://anongit.freedesktop.org/drm/drm: (25 commits)
        drm/amdgpu: return 0 by default in amdgpu_pm_load_smu_firmware
        drm/amdgpu: Fix bounds checking in amdgpu_ras_is_supported()
        drm: add fallback override/firmware EDID modes workaround
        drm/edid: abstract override/firmware EDID retrieval
        drm/i915/perf: fix whitelist on Gen10+
        drm/i915/sdvo: Implement proper HDMI audio support for SDVO
        drm/i915: Fix per-pixel alpha with CCS
        drm/i915/dmc: protect against reading random memory
        drm/i915/dsi: Use a fuzzy check for burst mode clock check
        drm/amdgpu/{uvd,vcn}: fetch ring's read_ptr after alloc
        drm/panfrost: Require the simple_ondemand governor
        drm/panfrost: make devfreq optional again
        drm/gem_shmem: Use a writecombine mapping for ->vaddr
        drm: panel-orientation-quirks: Add quirk for GPD MicroPC
        drm: panel-orientation-quirks: Add quirk for GPD pocket2
        drm/meson: fix G12A primary plane disabling
        drm/meson: fix primary plane disabling
        drm/meson: fix G12A HDMI PLL settings for 4K60 1000/1001 variations
        drm/mediatek: call mtk_dsi_stop() after mtk_drm_crtc_atomic_disable()
        drm/mediatek: clear num_pipes when unbind driver
        ...
      6aa7a22b
    • Linus Torvalds's avatar
      Merge tag 'gfs2-v5.2.fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 · 40665244
      Linus Torvalds authored
      Pull gfs2 fix from Andreas Gruenbacher:
       "Fix rounding error in gfs2_iomap_page_prepare"
      
      * tag 'gfs2-v5.2.fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
        gfs2: Fix rounding error in gfs2_iomap_page_prepare
      40665244
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 1ed1fa5f
      Linus Torvalds authored
      Pull SCSI fix from James Bottomley:
       "A single bug fix for hpsa.
      
        The user visible consequences aren't clear, but the ioaccel2 raid
        acceleration may misfire on the malformed request assuming the payload
        is big enough to require chaining (more than 31 sg entries)"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: hpsa: correct ioaccel2 chaining
      1ed1fa5f
    • Linus Torvalds's avatar
      Merge tag 'for-linus-20190614' of git://git.kernel.dk/linux-block · 7b103151
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
      
       - Remove references to old schedulers for the scheduler switching and
         blkio controller documentation (Andreas)
      
       - Kill duplicate check for report zone for null_blk (Chaitanya)
      
       - Two bcache fixes (Coly)
      
       - Ensure that mq-deadline is selected if zoned block device is enabled,
         as we need that to support them (Damien)
      
       - Fix io_uring memory leak (Eric)
      
       - ps3vram fallout from LBDAF removal (Geert)
      
       - Redundant blk-mq debugfs debugfs_create return check cleanup (Greg)
      
       - Extend NOPLM quirk for ST1000LM024 drives (Hans)
      
       - Remove error path warning that can now trigger after the queue
         removal/addition fixes (Ming)
      
      * tag 'for-linus-20190614' of git://git.kernel.dk/linux-block:
        block/ps3vram: Use %llu to format sector_t after LBDAF removal
        libata: Extend quirks for the ST1000LM024 drives with NOLPM quirk
        bcache: only set BCACHE_DEV_WB_RUNNING when cached device attached
        bcache: fix stack corruption by PRECEDING_KEY()
        blk-mq: remove WARN_ON(!q->elevator) from blk_mq_sched_free_requests
        blkio-controller.txt: Remove references to CFQ
        block/switching-sched.txt: Update to blk-mq schedulers
        null_blk: remove duplicate check for report zone
        blk-mq: no need to check return value of debugfs_create functions
        io_uring: fix memory leak of UNIX domain socket inode
        block: force select mq-deadline for zoned block devices
      7b103151
    • Linus Torvalds's avatar
      Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · 5dcedf46
      Linus Torvalds authored
      Pull i2c fixes from Wolfram Sang:
       "I2C has two simple but wanted driver fixes for you"
      
      * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        i2c: pca-platform: Fix GPIO lookup code
        i2c: acorn: fix i2c warning
      5dcedf46
    • Casey Schaufler's avatar
      Smack: Restore the smackfsdef mount option and add missing prefixes · 6e7739fc
      Casey Schaufler authored
      The 5.1 mount system rework changed the smackfsdef mount option to
      smackfsdefault.  This fixes the regression by making smackfsdef treated
      the same way as smackfsdefault.
      
      Also fix the smack_param_specs[] to have "smack" prefixes on all the
      names.  This isn't visible to a user unless they either:
      
       (a) Try to mount a filesystem that's converted to the internal mount API
           and that implements the ->parse_monolithic() context operation - and
           only then if they call security_fs_context_parse_param() rather than
           security_sb_eat_lsm_opts().
      
           There are no examples of this upstream yet, but nfs will probably want
           to do this for nfs2 or nfs3.
      
       (b) Use fsconfig() to configure the filesystem - in which case
           security_fs_context_parse_param() will be called.
      
      This issue is that smack_sb_eat_lsm_opts() checks for the "smack" prefix
      on the options, but smack_fs_context_parse_param() does not.
      
      Fixes: c3300aaf ("smack: get rid of match_token()")
      Fixes: 2febd254 ("smack: Implement filesystem context security hooks")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarJose Bollo <jose.bollo@iot.bzh>
      Signed-off-by: default avatarCasey Schaufler <casey@schaufler-ca.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarCasey Schaufler <casey@schaufler-ca.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6e7739fc
  2. 14 Jun, 2019 32 commits
    • Wei Li's avatar
      ftrace: Fix NULL pointer dereference in free_ftrace_func_mapper() · 04e03d9a
      Wei Li authored
      The mapper may be NULL when called from register_ftrace_function_probe()
      with probe->data == NULL.
      
      This issue can be reproduced as follow (it may be covered by compiler
      optimization sometime):
      
      / # cat /sys/kernel/debug/tracing/set_ftrace_filter
      #### all functions enabled ####
      / # echo foo_bar:dump > /sys/kernel/debug/tracing/set_ftrace_filter
      [  206.949100] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
      [  206.952402] Mem abort info:
      [  206.952819]   ESR = 0x96000006
      [  206.955326]   Exception class = DABT (current EL), IL = 32 bits
      [  206.955844]   SET = 0, FnV = 0
      [  206.956272]   EA = 0, S1PTW = 0
      [  206.956652] Data abort info:
      [  206.957320]   ISV = 0, ISS = 0x00000006
      [  206.959271]   CM = 0, WnR = 0
      [  206.959938] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000419f3a000
      [  206.960483] [0000000000000000] pgd=0000000411a87003, pud=0000000411a83003, pmd=0000000000000000
      [  206.964953] Internal error: Oops: 96000006 [#1] SMP
      [  206.971122] Dumping ftrace buffer:
      [  206.973677]    (ftrace buffer empty)
      [  206.975258] Modules linked in:
      [  206.976631] Process sh (pid: 281, stack limit = 0x(____ptrval____))
      [  206.978449] CPU: 10 PID: 281 Comm: sh Not tainted 5.2.0-rc1+ #17
      [  206.978955] Hardware name: linux,dummy-virt (DT)
      [  206.979883] pstate: 60000005 (nZCv daif -PAN -UAO)
      [  206.980499] pc : free_ftrace_func_mapper+0x2c/0x118
      [  206.980874] lr : ftrace_count_free+0x68/0x80
      [  206.982539] sp : ffff0000182f3ab0
      [  206.983102] x29: ffff0000182f3ab0 x28: ffff8003d0ec1700
      [  206.983632] x27: ffff000013054b40 x26: 0000000000000001
      [  206.984000] x25: ffff00001385f000 x24: 0000000000000000
      [  206.984394] x23: ffff000013453000 x22: ffff000013054000
      [  206.984775] x21: 0000000000000000 x20: ffff00001385fe28
      [  206.986575] x19: ffff000013872c30 x18: 0000000000000000
      [  206.987111] x17: 0000000000000000 x16: 0000000000000000
      [  206.987491] x15: ffffffffffffffb0 x14: 0000000000000000
      [  206.987850] x13: 000000000017430e x12: 0000000000000580
      [  206.988251] x11: 0000000000000000 x10: cccccccccccccccc
      [  206.988740] x9 : 0000000000000000 x8 : ffff000013917550
      [  206.990198] x7 : ffff000012fac2e8 x6 : ffff000012fac000
      [  206.991008] x5 : ffff0000103da588 x4 : 0000000000000001
      [  206.991395] x3 : 0000000000000001 x2 : ffff000013872a28
      [  206.991771] x1 : 0000000000000000 x0 : 0000000000000000
      [  206.992557] Call trace:
      [  206.993101]  free_ftrace_func_mapper+0x2c/0x118
      [  206.994827]  ftrace_count_free+0x68/0x80
      [  206.995238]  release_probe+0xfc/0x1d0
      [  206.995555]  register_ftrace_function_probe+0x4a8/0x868
      [  206.995923]  ftrace_trace_probe_callback.isra.4+0xb8/0x180
      [  206.996330]  ftrace_dump_callback+0x50/0x70
      [  206.996663]  ftrace_regex_write.isra.29+0x290/0x3a8
      [  206.997157]  ftrace_filter_write+0x44/0x60
      [  206.998971]  __vfs_write+0x64/0xf0
      [  206.999285]  vfs_write+0x14c/0x2f0
      [  206.999591]  ksys_write+0xbc/0x1b0
      [  206.999888]  __arm64_sys_write+0x3c/0x58
      [  207.000246]  el0_svc_common.constprop.0+0x408/0x5f0
      [  207.000607]  el0_svc_handler+0x144/0x1c8
      [  207.000916]  el0_svc+0x8/0xc
      [  207.003699] Code: aa0003f8 a9025bf5 aa0103f5 f946ea80 (f9400303)
      [  207.008388] ---[ end trace 7b6d11b5f542bdf1 ]---
      [  207.010126] Kernel panic - not syncing: Fatal exception
      [  207.011322] SMP: stopping secondary CPUs
      [  207.013956] Dumping ftrace buffer:
      [  207.014595]    (ftrace buffer empty)
      [  207.015632] Kernel Offset: disabled
      [  207.017187] CPU features: 0x002,20006008
      [  207.017985] Memory Limit: none
      [  207.019825] ---[ end Kernel panic - not syncing: Fatal exception ]---
      
      Link: http://lkml.kernel.org/r/20190606031754.10798-1-liwei391@huawei.comSigned-off-by: default avatarWei Li <liwei391@huawei.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      04e03d9a
    • Josh Poimboeuf's avatar
      module: Fix livepatch/ftrace module text permissions race · 9f255b63
      Josh Poimboeuf authored
      It's possible for livepatch and ftrace to be toggling a module's text
      permissions at the same time, resulting in the following panic:
      
        BUG: unable to handle page fault for address: ffffffffc005b1d9
        #PF: supervisor write access in kernel mode
        #PF: error_code(0x0003) - permissions violation
        PGD 3ea0c067 P4D 3ea0c067 PUD 3ea0e067 PMD 3cc13067 PTE 3b8a1061
        Oops: 0003 [#1] PREEMPT SMP PTI
        CPU: 1 PID: 453 Comm: insmod Tainted: G           O  K   5.2.0-rc1-a188339c #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014
        RIP: 0010:apply_relocate_add+0xbe/0x14c
        Code: fa 0b 74 21 48 83 fa 18 74 38 48 83 fa 0a 75 40 eb 08 48 83 38 00 74 33 eb 53 83 38 00 75 4e 89 08 89 c8 eb 0a 83 38 00 75 43 <89> 08 48 63 c1 48 39 c8 74 2e eb 48 83 38 00 75 32 48 29 c1 89 08
        RSP: 0018:ffffb223c00dbb10 EFLAGS: 00010246
        RAX: ffffffffc005b1d9 RBX: 0000000000000000 RCX: ffffffff8b200060
        RDX: 000000000000000b RSI: 0000004b0000000b RDI: ffff96bdfcd33000
        RBP: ffffb223c00dbb38 R08: ffffffffc005d040 R09: ffffffffc005c1f0
        R10: ffff96bdfcd33c40 R11: ffff96bdfcd33b80 R12: 0000000000000018
        R13: ffffffffc005c1f0 R14: ffffffffc005e708 R15: ffffffff8b2fbc74
        FS:  00007f5f447beba8(0000) GS:ffff96bdff900000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: ffffffffc005b1d9 CR3: 000000003cedc002 CR4: 0000000000360ea0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         klp_init_object_loaded+0x10f/0x219
         ? preempt_latency_start+0x21/0x57
         klp_enable_patch+0x662/0x809
         ? virt_to_head_page+0x3a/0x3c
         ? kfree+0x8c/0x126
         patch_init+0x2ed/0x1000 [livepatch_test02]
         ? 0xffffffffc0060000
         do_one_initcall+0x9f/0x1c5
         ? kmem_cache_alloc_trace+0xc4/0xd4
         ? do_init_module+0x27/0x210
         do_init_module+0x5f/0x210
         load_module+0x1c41/0x2290
         ? fsnotify_path+0x3b/0x42
         ? strstarts+0x2b/0x2b
         ? kernel_read+0x58/0x65
         __do_sys_finit_module+0x9f/0xc3
         ? __do_sys_finit_module+0x9f/0xc3
         __x64_sys_finit_module+0x1a/0x1c
         do_syscall_64+0x52/0x61
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The above panic occurs when loading two modules at the same time with
      ftrace enabled, where at least one of the modules is a livepatch module:
      
      CPU0					CPU1
      klp_enable_patch()
        klp_init_object_loaded()
          module_disable_ro()
          					ftrace_module_enable()
      					  ftrace_arch_code_modify_post_process()
      				    	    set_all_modules_text_ro()
            klp_write_object_relocations()
              apply_relocate_add()
      	  *patches read-only code* - BOOM
      
      A similar race exists when toggling ftrace while loading a livepatch
      module.
      
      Fix it by ensuring that the livepatch and ftrace code patching
      operations -- and their respective permissions changes -- are protected
      by the text_mutex.
      
      Link: http://lkml.kernel.org/r/ab43d56ab909469ac5d2520c5d944ad6d4abd476.1560474114.git.jpoimboe@redhat.comReported-by: default avatarJohannes Erdfelt <johannes@erdfelt.com>
      Fixes: 444d13ff ("modules: add ro_after_init support")
      Acked-by: default avatarJessica Yu <jeyu@kernel.org>
      Reviewed-by: default avatarPetr Mladek <pmladek@suse.com>
      Reviewed-by: default avatarMiroslav Benes <mbenes@suse.cz>
      Signed-off-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      9f255b63
    • Eiichi Tsukata's avatar
      tracing/uprobe: Fix obsolete comment on trace_uprobe_create() · a4158345
      Eiichi Tsukata authored
      Commit 0597c49c ("tracing/uprobes: Use dyn_event framework for
      uprobe events") cleaned up the usage of trace_uprobe_create(), and the
      function has been no longer used for removing uprobe/uretprobe.
      
      Link: http://lkml.kernel.org/r/20190614074026.8045-2-devel@etsukata.comReviewed-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarEiichi Tsukata <devel@etsukata.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      a4158345
    • Eiichi Tsukata's avatar
      tracing/uprobe: Fix NULL pointer dereference in trace_uprobe_create() · f01098c7
      Eiichi Tsukata authored
      Just like the case of commit 8b05a3a7 ("tracing/kprobes: Fix NULL
      pointer dereference in trace_kprobe_create()"), writing an incorrectly
      formatted string to uprobe_events can trigger NULL pointer dereference.
      
      Reporeducer:
      
        # echo r > /sys/kernel/debug/tracing/uprobe_events
      
      dmesg:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000000
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 8000000079d12067 P4D 8000000079d12067 PUD 7b7ab067 PMD 0
        Oops: 0000 [#1] PREEMPT SMP PTI
        CPU: 0 PID: 1903 Comm: bash Not tainted 5.2.0-rc3+ #15
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
        RIP: 0010:strchr+0x0/0x30
        Code: c0 eb 0d 84 c9 74 18 48 83 c0 01 48 39 d0 74 0f 0f b6 0c 07 3a 0c 06 74 ea 19 c0 83 c8 01 c3 31 c0 c3 0f 1f 84 00 00 00 00 00 <0f> b6 07 89 f2 40 38 f0 75 0e eb 13 0f b6 47 01 48 83 c
        RSP: 0018:ffffb55fc0403d10 EFLAGS: 00010293
      
        RAX: ffff993ffb793400 RBX: 0000000000000000 RCX: ffffffffa4852625
        RDX: 0000000000000000 RSI: 000000000000002f RDI: 0000000000000000
        RBP: ffffb55fc0403dd0 R08: ffff993ffb793400 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
        R13: ffff993ff9cc1668 R14: 0000000000000001 R15: 0000000000000000
        FS:  00007f30c5147700(0000) GS:ffff993ffda00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 000000007b628000 CR4: 00000000000006f0
        Call Trace:
         trace_uprobe_create+0xe6/0xb10
         ? __kmalloc_track_caller+0xe6/0x1c0
         ? __kmalloc+0xf0/0x1d0
         ? trace_uprobe_create+0xb10/0xb10
         create_or_delete_trace_uprobe+0x35/0x90
         ? trace_uprobe_create+0xb10/0xb10
         trace_run_command+0x9c/0xb0
         trace_parse_run_command+0xf9/0x1eb
         ? probes_open+0x80/0x80
         __vfs_write+0x43/0x90
         vfs_write+0x14a/0x2a0
         ksys_write+0xa2/0x170
         do_syscall_64+0x7f/0x200
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Link: http://lkml.kernel.org/r/20190614074026.8045-1-devel@etsukata.com
      
      Cc: stable@vger.kernel.org
      Fixes: 0597c49c ("tracing/uprobes: Use dyn_event framework for uprobe events")
      Reviewed-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarEiichi Tsukata <devel@etsukata.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      f01098c7
    • YueHaibing's avatar
      tracing: Make two symbols static · ff585c5b
      YueHaibing authored
      Fix sparse warnings:
      
      kernel/trace/trace.c:6927:24: warning:
       symbol 'get_tracing_log_err' was not declared. Should it be static?
      kernel/trace/trace.c:8196:15: warning:
       symbol 'trace_instance_dir' was not declared. Should it be static?
      
      Link: http://lkml.kernel.org/r/20190614153210.24424-1-yuehaibing@huawei.comAcked-by: default avatarTom Zanussi <tom.zanussi@linux.intel.com>
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      ff585c5b
    • Vasily Gorbik's avatar
      tracing: avoid build warning with HAVE_NOP_MCOUNT · cbdaeaf0
      Vasily Gorbik authored
      Selecting HAVE_NOP_MCOUNT enables -mnop-mcount (if gcc supports it)
      and sets CC_USING_NOP_MCOUNT. Reuse __is_defined (which is suitable for
      testing CC_USING_* defines) to avoid conditional compilation and fix
      the following gcc 9 warning on s390:
      
      kernel/trace/ftrace.c:2514:1: warning: ‘ftrace_code_disable’ defined
      but not used [-Wunused-function]
      
      Link: http://lkml.kernel.org/r/patch.git-1a82d13f33ac.your-ad-here.call-01559732716-ext-6629@work.hours
      
      Fixes: 2f4df001 ("tracing: Add -mcount-nop option support")
      Signed-off-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      cbdaeaf0
    • Eiichi Tsukata's avatar
      tracing: Fix out-of-range read in trace_stack_print() · becf33f6
      Eiichi Tsukata authored
      Puts range check before dereferencing the pointer.
      
      Reproducer:
      
        # echo stacktrace > trace_options
        # echo 1 > events/enable
        # cat trace > /dev/null
      
      KASAN report:
      
        ==================================================================
        BUG: KASAN: use-after-free in trace_stack_print+0x26b/0x2c0
        Read of size 8 at addr ffff888069d20000 by task cat/1953
      
        CPU: 0 PID: 1953 Comm: cat Not tainted 5.2.0-rc3+ #5
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
        Call Trace:
         dump_stack+0x8a/0xce
         print_address_description+0x60/0x224
         ? trace_stack_print+0x26b/0x2c0
         ? trace_stack_print+0x26b/0x2c0
         __kasan_report.cold+0x1a/0x3e
         ? trace_stack_print+0x26b/0x2c0
         kasan_report+0xe/0x20
         trace_stack_print+0x26b/0x2c0
         print_trace_line+0x6ea/0x14d0
         ? tracing_buffers_read+0x700/0x700
         ? trace_find_next_entry_inc+0x158/0x1d0
         s_show+0xea/0x310
         seq_read+0xaa7/0x10e0
         ? seq_escape+0x230/0x230
         __vfs_read+0x7c/0x100
         vfs_read+0x16c/0x3a0
         ksys_read+0x121/0x240
         ? kernel_write+0x110/0x110
         ? perf_trace_sys_enter+0x8a0/0x8a0
         ? syscall_slow_exit_work+0xa9/0x410
         do_syscall_64+0xb7/0x390
         ? prepare_exit_to_usermode+0x165/0x200
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f867681f910
        Code: b6 fe ff ff 48 8d 3d 0f be 08 00 48 83 ec 08 e8 06 db 01 00 66 0f 1f 44 00 00 83 3d f9 2d 2c 00 00 75 10 b8 00 00 00 00 04
        RSP: 002b:00007ffdabf23488 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
        RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f867681f910
        RDX: 0000000000020000 RSI: 00007f8676cde000 RDI: 0000000000000003
        RBP: 00007f8676cde000 R08: ffffffffffffffff R09: 0000000000000000
        R10: 0000000000000871 R11: 0000000000000246 R12: 00007f8676cde000
        R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000000ec0
      
        Allocated by task 1214:
         save_stack+0x1b/0x80
         __kasan_kmalloc.constprop.0+0xc2/0xd0
         kmem_cache_alloc+0xaf/0x1a0
         getname_flags+0xd2/0x5b0
         do_sys_open+0x277/0x5a0
         do_syscall_64+0xb7/0x390
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        Freed by task 1214:
         save_stack+0x1b/0x80
         __kasan_slab_free+0x12c/0x170
         kmem_cache_free+0x8a/0x1c0
         putname+0xe1/0x120
         do_sys_open+0x2c5/0x5a0
         do_syscall_64+0xb7/0x390
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        The buggy address belongs to the object at ffff888069d20000
         which belongs to the cache names_cache of size 4096
        The buggy address is located 0 bytes inside of
         4096-byte region [ffff888069d20000, ffff888069d21000)
        The buggy address belongs to the page:
        page:ffffea0001a74800 refcount:1 mapcount:0 mapping:ffff88806ccd1380 index:0x0 compound_mapcount: 0
        flags: 0x100000000010200(slab|head)
        raw: 0100000000010200 dead000000000100 dead000000000200 ffff88806ccd1380
        raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
        page dumped because: kasan: bad access detected
      
        Memory state around the buggy address:
         ffff888069d1ff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
         ffff888069d1ff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        >ffff888069d20000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                           ^
         ffff888069d20080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff888069d20100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ==================================================================
      
      Link: http://lkml.kernel.org/r/20190610040016.5598-1-devel@etsukata.com
      
      Fixes: 4285f2fc ("tracing: Remove the ULONG_MAX stack trace hackery")
      Signed-off-by: default avatarEiichi Tsukata <devel@etsukata.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      becf33f6
    • Andreas Gruenbacher's avatar
      gfs2: Fix rounding error in gfs2_iomap_page_prepare · 2741b672
      Andreas Gruenbacher authored
      The pos and len arguments to the iomap page_prepare callback are not
      block aligned, so we need to take that into account when computing the
      number of blocks.
      
      Fixes: d0a22a4b ("gfs2: Fix iomap write page reclaim deadlock")
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      2741b672
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 72a20cee
      Linus Torvalds authored
      Pull arm64 fixes from Will Deacon:
       "Here are some arm64 fixes for -rc5.
      
        The only non-trivial change (in terms of the diffstat) is fixing our
        SVE ptrace API for big-endian machines, but the majority of this is
        actually the addition of much-needed comments and updates to the
        documentation to try to avoid this mess biting us again in future.
      
        There are still a couple of small things on the horizon, but nothing
        major at this point.
      
        Summary:
      
         - Fix broken SVE ptrace API when running in a big-endian configuration
      
         - Fix performance regression due to off-by-one in TLBI range checking
      
         - Fix build regression when using Clang"
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64/sve: Fix missing SVE/FPSIMD endianness conversions
        arm64: tlbflush: Ensure start/end of address range are aligned to stride
        arm64: Don't unconditionally add -Wno-psabi to KBUILD_CFLAGS
      72a20cee
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · fd6b99fa
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "16 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm/devm_memremap_pages: fix final page put race
        PCI/P2PDMA: track pgmap references per resource, not globally
        lib/genalloc: introduce chunk owners
        PCI/P2PDMA: fix the gen_pool_add_virt() failure path
        mm/devm_memremap_pages: introduce devm_memunmap_pages
        drivers/base/devres: introduce devm_release_action()
        mm/vmscan.c: fix trying to reclaim unevictable LRU page
        coredump: fix race condition between collapse_huge_page() and core dumping
        mm/mlock.c: change count_mm_mlocked_page_nr return type
        mm: mmu_gather: remove __tlb_reset_range() for force flush
        fs/ocfs2: fix race in ocfs2_dentry_attach_lock()
        mm/vmscan.c: fix recent_rotated history
        mm/mlock.c: mlockall error for flag MCL_ONFAULT
        scripts/decode_stacktrace.sh: prefix addr2line with $CROSS_COMPILE
        mm/list_lru.c: fix memory leak in __memcg_init_list_lru_node
        mm: memcontrol: don't batch updates of local VM stats and events
      fd6b99fa
    • Daniel Vetter's avatar
      Merge branch 'drm-fixes-5.2' of git://people.freedesktop.org/~agd5f/linux into drm-fixes · e14c5873
      Daniel Vetter authored
      Fixes for 5.2:
      - Extend previous vce fix for resume to uvd and vcn
      - Fix bounds checking in ras debugfs interface
      - Fix a regression on SI using amdgpu
      Signed-off-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      From: Alex Deucher <alexdeucher@gmail.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190613021856.3307-1-alexander.deucher@amd.com
      e14c5873
    • Linus Torvalds's avatar
      Merge tag 'iommu-fixes-v5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu · c78ad1be
      Linus Torvalds authored
      Pull iommu fixes from Joerg Roedel:
      
       - three fixes for Intel VT-d to fix a potential dead-lock, a formatting
         fix and a bit setting fix
      
       - one fix for the ARM-SMMU to make it work on some platforms with
         sub-optimal SMMU emulation
      
      * tag 'iommu-fixes-v5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
        iommu/arm-smmu: Avoid constant zero in TLBI writes
        iommu/vt-d: Set the right field for Page Walk Snoop
        iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock
        iommu: Add missing new line for dma type
      c78ad1be
    • Linus Torvalds's avatar
      Merge tag 'gpio-v5.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio · 7617c9a0
      Linus Torvalds authored
      Pull GPIO fix from Linus Walleij:
       "A single fix for the PCA953x driver affecting some fringe variants of
        the chip"
      
      * tag 'gpio-v5.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
        gpio: pca953x: hack to fix 24 bit gpio expanders
      7617c9a0
    • Linus Torvalds's avatar
      Merge tag 'sound-5.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · bcb46a0e
      Linus Torvalds authored
      Pull sound fixes from Takashi Iwai:
       "It might feel like deja vu to receive a bulk of changes at rc5, and it
        happens again; we've got a collection of fixes for ASoC. Most of fixes
        are targeted for the newly merged SOF (Sound Open Firmware) stuff and
        the relevant fixes for Intel platforms.
      
        Other than that, there are a few regression fixes for the recent ASoC
        core changes and HD-audio quirk, as well as a couple of FireWire fixes
        and for other ASoC codecs"
      
      * tag 'sound-5.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (54 commits)
        Revert "ALSA: hda/realtek - Improve the headset mic for Acer Aspire laptops"
        ALSA: ice1712: Check correct return value to snd_i2c_sendbytes (EWS/DMX 6Fire)
        ALSA: oxfw: allow PCM capture for Stanton SCS.1m
        ALSA: firewire-motu: fix destruction of data for isochronous resources
        ASoC: Intel: sst: fix kmalloc call with wrong flags
        ASoC: core: Fix deadlock in snd_soc_instantiate_card()
        SoC: rt274: Fix internal jack assignment in set_jack callback
        ALSA: hdac: fix memory release for SST and SOF drivers
        ASoC: SOF: Intel: hda: use the defined ppcap functions
        ASoC: core: move DAI pre-links initiation to snd_soc_instantiate_card
        ASoC: Intel: cht_bsw_rt5672: fix kernel oops with platform_name override
        ASoC: Intel: cht_bsw_nau8824: fix kernel oops with platform_name override
        ASoC: Intel: bytcht_es8316: fix kernel oops with platform_name override
        ASoC: Intel: cht_bsw_max98090: fix kernel oops with platform_name override
        ASoC: sun4i-i2s: Add offset to RX channel select
        ASoC: sun4i-i2s: Fix sun8i tx channel offset mask
        ASoC: max98090: remove 24-bit format support if RJ is 0
        ASoC: da7219: Fix build error without CONFIG_I2C
        ASoC: SOF: Intel: hda: Fix COMPILE_TEST build error
        ASoC: SOF: fix DSP oops definitions in FW ABI
        ...
      bcb46a0e
    • Daniel Vetter's avatar
      Merge tag 'drm-misc-fixes-2019-06-13' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes · 744ed8cb
      Daniel Vetter authored
      Sean writes:
      
      meson: A few G12A fixes across the driver (Neil)
      quirks: A couple quirks for GPD devices (Hans)
      gem_shmem: Use writecombine when vmapping non-dmabuf BOs (Boris)
      panfrost: A couple tweaks to requiring devfreq (Neil & Ezequiel)
      edid: Ensure we return the override mode when ddc probe fails (Jani)
      
      Cc: Hans de Goede <hdegoede@redhat.com>
      Cc: Neil Armstrong <narmstrong@baylibre.com>
      Cc: Boris Brezillon <boris.brezillon@collabora.com>
      Cc: Ezequiel Garcia <ezequiel@collabora.com>
      Cc: Jani Nikula <jani.nikula@intel.com>
      Signed-off-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      From: Sean Paul <sean@poorly.run>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190613143946.GA24233@art_vandelay
      744ed8cb
    • Hui Wang's avatar
      Revert "ALSA: hda/realtek - Improve the headset mic for Acer Aspire laptops" · 17d30460
      Hui Wang authored
      This reverts commit 9cb40eb1.
      
      This patch introduces noise and headphone playback issue after
      rebooting or suspending/resuming. Let us revert it.
      
      BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=203831
      Fixes: 9cb40eb1 ("ALSA: hda/realtek - Improve the headset mic for Acer Aspire laptops")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarHui Wang <hui.wang@canonical.com>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      17d30460
    • Dan Williams's avatar
      mm/devm_memremap_pages: fix final page put race · 50f44ee7
      Dan Williams authored
      Logan noticed that devm_memremap_pages_release() kills the percpu_ref
      drops all the page references that were acquired at init and then
      immediately proceeds to unplug, arch_remove_memory(), the backing pages
      for the pagemap.  If for some reason device shutdown actually collides
      with a busy / elevated-ref-count page then arch_remove_memory() should
      be deferred until after that reference is dropped.
      
      As it stands the "wait for last page ref drop" happens *after*
      devm_memremap_pages_release() returns, which is obviously too late and
      can lead to crashes.
      
      Fix this situation by assigning the responsibility to wait for the
      percpu_ref to go idle to devm_memremap_pages() with a new ->cleanup()
      callback.  Implement the new cleanup callback for all
      devm_memremap_pages() users: pmem, devdax, hmm, and p2pdma.
      
      Link: http://lkml.kernel.org/r/155727339156.292046.5432007428235387859.stgit@dwillia2-desk3.amr.corp.intel.com
      Fixes: 41e94a85 ("add devm_memremap_pages")
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      50f44ee7
    • Dan Williams's avatar
      PCI/P2PDMA: track pgmap references per resource, not globally · 1570175a
      Dan Williams authored
      In preparation for fixing a race between devm_memremap_pages_release()
      and the final put of a page from the device-page-map, allocate a
      percpu-ref per p2pdma resource mapping.
      
      Link: http://lkml.kernel.org/r/155727338646.292046.9922678317501435597.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1570175a
    • Dan Williams's avatar
      lib/genalloc: introduce chunk owners · 795ee306
      Dan Williams authored
      The p2pdma facility enables a provider to publish a pool of dma
      addresses for a consumer to allocate.  A genpool is used internally by
      p2pdma to collect dma resources, 'chunks', to be handed out to
      consumers.  Whenever a consumer allocates a resource it needs to pin the
      'struct dev_pagemap' instance that backs the chunk selected by
      pci_alloc_p2pmem().
      
      Currently that reference is taken globally on the entire provider
      device.  That sets up a lifetime mismatch whereby the p2pdma core needs
      to maintain hacks to make sure the percpu_ref is not released twice.
      
      This lifetime mismatch also stands in the way of a fix to
      devm_memremap_pages() whereby devm_memremap_pages_release() must wait for
      the percpu_ref ->release() callback to complete before it can proceed to
      teardown pages.
      
      So, towards fixing this situation, introduce the ability to store a 'chunk
      owner' at gen_pool_add() time, and a facility to retrieve the owner at
      gen_pool_{alloc,free}() time.  For p2pdma this will be used to store and
      recall individual dev_pagemap reference counter instances per-chunk.
      
      Link: http://lkml.kernel.org/r/155727338118.292046.13407378933221579644.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      795ee306
    • Dan Williams's avatar
      PCI/P2PDMA: fix the gen_pool_add_virt() failure path · e615a191
      Dan Williams authored
      The pci_p2pdma_add_resource() implementation immediately frees the pgmap
      if gen_pool_add_virt() fails.  However, that means that when @dev
      triggers a devres release devm_memremap_pages_release() will crash
      trying to access the freed @pgmap.
      
      Use the new devm_memunmap_pages() to manually free the mapping in the
      error path.
      
      Link: http://lkml.kernel.org/r/155727337603.292046.13101332703665246702.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Fixes: 52916982 ("PCI/P2PDMA: Support peer-to-peer memory")
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Acked-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e615a191
    • Dan Williams's avatar
      mm/devm_memremap_pages: introduce devm_memunmap_pages · 2e3f139e
      Dan Williams authored
      Use the new devm_release_action() facility to allow
      devm_memremap_pages_release() to be manually triggered.
      
      Link: http://lkml.kernel.org/r/155727337088.292046.5774214552136776763.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2e3f139e
    • Dan Williams's avatar
      drivers/base/devres: introduce devm_release_action() · 2374b682
      Dan Williams authored
      Patch series "mm/devm_memremap_pages: Fix page release race", v2.
      
      Logan audited the devm_memremap_pages() shutdown path and noticed that
      it was possible to proceed to arch_remove_memory() before all potential
      page references have been reaped.
      
      Introduce a new ->cleanup() callback to do the work of waiting for any
      straggling page references and then perform the percpu_ref_exit() in
      devm_memremap_pages_release() context.
      
      For p2pdma this involves some deeper reworks to reference count
      resources on a per-instance basis rather than a per pci-device basis.  A
      modified genalloc api is introduced to convey a driver-private pointer
      through gen_pool_{alloc,free}() interfaces.  Also, a
      devm_memunmap_pages() api is introduced since p2pdma does not
      auto-release resources on a setup failure.
      
      The dax and pmem changes pass the nvdimm unit tests, and the p2pdma
      changes should now pass testing with the pci_p2pdma_release() fix.
      Jrme, how does this look for HMM?
      
      This patch (of 6):
      
      The devm_add_action() facility allows a resource allocation routine to
      add custom devm semantics.  One such user is devm_memremap_pages().
      
      There is now a need to manually trigger
      devm_memremap_pages_release().  Introduce devm_release_action() so the
      release action can be triggered via a new devm_memunmap_pages() api in a
      follow-on change.
      
      Link: http://lkml.kernel.org/r/155727336530.292046.2926860263201336366.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2374b682
    • Minchan Kim's avatar
      mm/vmscan.c: fix trying to reclaim unevictable LRU page · a58f2cef
      Minchan Kim authored
      There was the below bug report from Wu Fangsuo.
      
      On the CMA allocation path, isolate_migratepages_range() could isolate
      unevictable LRU pages and reclaim_clean_page_from_list() can try to
      reclaim them if they are clean file-backed pages.
      
        page:ffffffbf02f33b40 count:86 mapcount:84 mapping:ffffffc08fa7a810 index:0x24
        flags: 0x19040c(referenced|uptodate|arch_1|mappedtodisk|unevictable|mlocked)
        raw: 000000000019040c ffffffc08fa7a810 0000000000000024 0000005600000053
        raw: ffffffc009b05b20 ffffffc009b05b20 0000000000000000 ffffffc09bf3ee80
        page dumped because: VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page))
        page->mem_cgroup:ffffffc09bf3ee80
        ------------[ cut here ]------------
        kernel BUG at /home/build/farmland/adroid9.0/kernel/linux/mm/vmscan.c:1350!
        Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
        Modules linked in:
        CPU: 0 PID: 7125 Comm: syz-executor Tainted: G S              4.14.81 #3
        Hardware name: ASR AQUILAC EVB (DT)
        task: ffffffc00a54cd00 task.stack: ffffffc009b00000
        PC is at shrink_page_list+0x1998/0x3240
        LR is at shrink_page_list+0x1998/0x3240
        pc : [<ffffff90083a2158>] lr : [<ffffff90083a2158>] pstate: 60400045
        sp : ffffffc009b05940
        ..
           shrink_page_list+0x1998/0x3240
           reclaim_clean_pages_from_list+0x3c0/0x4f0
           alloc_contig_range+0x3bc/0x650
           cma_alloc+0x214/0x668
           ion_cma_allocate+0x98/0x1d8
           ion_alloc+0x200/0x7e0
           ion_ioctl+0x18c/0x378
           do_vfs_ioctl+0x17c/0x1780
           SyS_ioctl+0xac/0xc0
      
      Wu found it's due to commit ad6b6704 ("mm: remove SWAP_MLOCK in
      ttu").  Before that, unevictable pages go to cull_mlocked so that we
      can't reach the VM_BUG_ON_PAGE line.
      
      To fix the issue, this patch filters out unevictable LRU pages from the
      reclaim_clean_pages_from_list in CMA.
      
      Link: http://lkml.kernel.org/r/20190524071114.74202-1-minchan@kernel.org
      Fixes: ad6b6704 ("mm: remove SWAP_MLOCK in ttu")
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reported-by: default avatarWu Fangsuo <fangsuowu@asrmicro.com>
      Debugged-by: default avatarWu Fangsuo <fangsuowu@asrmicro.com>
      Tested-by: default avatarWu Fangsuo <fangsuowu@asrmicro.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Pankaj Suryawanshi <pankaj.suryawanshi@einfochips.com>
      Cc: <stable@vger.kernel.org>	[4.12+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a58f2cef
    • Andrea Arcangeli's avatar
      coredump: fix race condition between collapse_huge_page() and core dumping · 59ea6d06
      Andrea Arcangeli authored
      When fixing the race conditions between the coredump and the mmap_sem
      holders outside the context of the process, we focused on
      mmget_not_zero()/get_task_mm() callers in 04f5866e ("coredump: fix
      race condition between mmget_not_zero()/get_task_mm() and core
      dumping"), but those aren't the only cases where the mmap_sem can be
      taken outside of the context of the process as Michal Hocko noticed
      while backporting that commit to older -stable kernels.
      
      If mmgrab() is called in the context of the process, but then the
      mm_count reference is transferred outside the context of the process,
      that can also be a problem if the mmap_sem has to be taken for writing
      through that mm_count reference.
      
      khugepaged registration calls mmgrab() in the context of the process,
      but the mmap_sem for writing is taken later in the context of the
      khugepaged kernel thread.
      
      collapse_huge_page() after taking the mmap_sem for writing doesn't
      modify any vma, so it's not obvious that it could cause a problem to the
      coredump, but it happens to modify the pmd in a way that breaks an
      invariant that pmd_trans_huge_lock() relies upon.  collapse_huge_page()
      needs the mmap_sem for writing just to block concurrent page faults that
      call pmd_trans_huge_lock().
      
      Specifically the invariant that "!pmd_trans_huge()" cannot become a
      "pmd_trans_huge()" doesn't hold while collapse_huge_page() runs.
      
      The coredump will call __get_user_pages() without mmap_sem for reading,
      which eventually can invoke a lockless page fault which will need a
      functional pmd_trans_huge_lock().
      
      So collapse_huge_page() needs to use mmget_still_valid() to check it's
      not running concurrently with the coredump...  as long as the coredump
      can invoke page faults without holding the mmap_sem for reading.
      
      This has "Fixes: khugepaged" to facilitate backporting, but in my view
      it's more a bug in the coredump code that will eventually have to be
      rewritten to stop invoking page faults without the mmap_sem for reading.
      So the long term plan is still to drop all mmget_still_valid().
      
      Link: http://lkml.kernel.org/r/20190607161558.32104-1-aarcange@redhat.com
      Fixes: ba76149f ("thp: khugepaged")
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      59ea6d06
    • swkhack's avatar
      mm/mlock.c: change count_mm_mlocked_page_nr return type · 0874bb49
      swkhack authored
      On a 64-bit machine the value of "vma->vm_end - vma->vm_start" may be
      negative when using 32 bit ints and the "count >> PAGE_SHIFT"'s result
      will be wrong.  So change the local variable and return value to
      unsigned long to fix the problem.
      
      Link: http://lkml.kernel.org/r/20190513023701.83056-1-swkhack@gmail.com
      Fixes: 0cf2f6f6 ("mm: mlock: check against vma for actual mlock() size")
      Signed-off-by: default avatarswkhack <swkhack@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0874bb49
    • Yang Shi's avatar
      mm: mmu_gather: remove __tlb_reset_range() for force flush · 7a30df49
      Yang Shi authored
      A few new fields were added to mmu_gather to make TLB flush smarter for
      huge page by telling what level of page table is changed.
      
      __tlb_reset_range() is used to reset all these page table state to
      unchanged, which is called by TLB flush for parallel mapping changes for
      the same range under non-exclusive lock (i.e.  read mmap_sem).
      
      Before commit dd2283f2 ("mm: mmap: zap pages with read mmap_sem in
      munmap"), the syscalls (e.g.  MADV_DONTNEED, MADV_FREE) which may update
      PTEs in parallel don't remove page tables.  But, the forementioned
      commit may do munmap() under read mmap_sem and free page tables.  This
      may result in program hang on aarch64 reported by Jan Stancek.  The
      problem could be reproduced by his test program with slightly modified
      below.
      
      ---8<---
      
      static int map_size = 4096;
      static int num_iter = 500;
      static long threads_total;
      
      static void *distant_area;
      
      void *map_write_unmap(void *ptr)
      {
      	int *fd = ptr;
      	unsigned char *map_address;
      	int i, j = 0;
      
      	for (i = 0; i < num_iter; i++) {
      		map_address = mmap(distant_area, (size_t) map_size, PROT_WRITE | PROT_READ,
      			MAP_SHARED | MAP_ANONYMOUS, -1, 0);
      		if (map_address == MAP_FAILED) {
      			perror("mmap");
      			exit(1);
      		}
      
      		for (j = 0; j < map_size; j++)
      			map_address[j] = 'b';
      
      		if (munmap(map_address, map_size) == -1) {
      			perror("munmap");
      			exit(1);
      		}
      	}
      
      	return NULL;
      }
      
      void *dummy(void *ptr)
      {
      	return NULL;
      }
      
      int main(void)
      {
      	pthread_t thid[2];
      
      	/* hint for mmap in map_write_unmap() */
      	distant_area = mmap(0, DISTANT_MMAP_SIZE, PROT_WRITE | PROT_READ,
      			MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
      	munmap(distant_area, (size_t)DISTANT_MMAP_SIZE);
      	distant_area += DISTANT_MMAP_SIZE / 2;
      
      	while (1) {
      		pthread_create(&thid[0], NULL, map_write_unmap, NULL);
      		pthread_create(&thid[1], NULL, dummy, NULL);
      
      		pthread_join(thid[0], NULL);
      		pthread_join(thid[1], NULL);
      	}
      }
      ---8<---
      
      The program may bring in parallel execution like below:
      
              t1                                        t2
      munmap(map_address)
        downgrade_write(&mm->mmap_sem);
        unmap_region()
        tlb_gather_mmu()
          inc_tlb_flush_pending(tlb->mm);
        free_pgtables()
          tlb->freed_tables = 1
          tlb->cleared_pmds = 1
      
                                              pthread_exit()
                                              madvise(thread_stack, 8M, MADV_DONTNEED)
                                                zap_page_range()
                                                  tlb_gather_mmu()
                                                    inc_tlb_flush_pending(tlb->mm);
      
        tlb_finish_mmu()
          if (mm_tlb_flush_nested(tlb->mm))
            __tlb_reset_range()
      
      __tlb_reset_range() would reset freed_tables and cleared_* bits, but this
      may cause inconsistency for munmap() which do free page tables.  Then it
      may result in some architectures, e.g.  aarch64, may not flush TLB
      completely as expected to have stale TLB entries remained.
      
      Use fullmm flush since it yields much better performance on aarch64 and
      non-fullmm doesn't yields significant difference on x86.
      
      The original proposed fix came from Jan Stancek who mainly debugged this
      issue, I just wrapped up everything together.
      
      Jan's testing results:
      
      v5.2-rc2-24-gbec7550c
      --------------------------
               mean     stddev
      real    37.382   2.780
      user     1.420   0.078
      sys     54.658   1.855
      
      v5.2-rc2-24-gbec7550c + "mm: mmu_gather: remove __tlb_reset_range() for force flush"
      ---------------------------------------------------------------------------------------_
               mean     stddev
      real    37.119   2.105
      user     1.548   0.087
      sys     55.698   1.357
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/1558322252-113575-1-git-send-email-yang.shi@linux.alibaba.com
      Fixes: dd2283f2 ("mm: mmap: zap pages with read mmap_sem in munmap")
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarJan Stancek <jstancek@redhat.com>
      Reported-by: default avatarJan Stancek <jstancek@redhat.com>
      Tested-by: default avatarJan Stancek <jstancek@redhat.com>
      Suggested-by: default avatarWill Deacon <will.deacon@arm.com>
      Tested-by: default avatarWill Deacon <will.deacon@arm.com>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@gmail.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: <stable@vger.kernel.org>	[4.20+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7a30df49
    • Wengang Wang's avatar
      fs/ocfs2: fix race in ocfs2_dentry_attach_lock() · be99ca27
      Wengang Wang authored
      ocfs2_dentry_attach_lock() can be executed in parallel threads against the
      same dentry.  Make that race safe.  The race is like this:
      
                  thread A                               thread B
      
      (A1) enter ocfs2_dentry_attach_lock,
      seeing dentry->d_fsdata is NULL,
      and no alias found by
      ocfs2_find_local_alias, so kmalloc
      a new ocfs2_dentry_lock structure
      to local variable "dl", dl1
      
                     .....
      
                                          (B1) enter ocfs2_dentry_attach_lock,
                                          seeing dentry->d_fsdata is NULL,
                                          and no alias found by
                                          ocfs2_find_local_alias so kmalloc
                                          a new ocfs2_dentry_lock structure
                                          to local variable "dl", dl2.
      
                                                         ......
      
      (A2) set dentry->d_fsdata with dl1,
      call ocfs2_dentry_lock() and increase
      dl1->dl_lockres.l_ro_holders to 1 on
      success.
                    ......
      
                                          (B2) set dentry->d_fsdata with dl2
                                          call ocfs2_dentry_lock() and increase
      				    dl2->dl_lockres.l_ro_holders to 1 on
      				    success.
      
                                                        ......
      
      (A3) call ocfs2_dentry_unlock()
      and decrease
      dl2->dl_lockres.l_ro_holders to 0
      on success.
                   ....
      
                                          (B3) call ocfs2_dentry_unlock(),
                                          decreasing
      				    dl2->dl_lockres.l_ro_holders, but
      				    see it's zero now, panic
      
      Link: http://lkml.kernel.org/r/20190529174636.22364-1-wen.gang.wang@oracle.comSigned-off-by: default avatarWengang Wang <wen.gang.wang@oracle.com>
      Reported-by: default avatarDaniel Sobe <daniel.sobe@nxp.com>
      Tested-by: default avatarDaniel Sobe <daniel.sobe@nxp.com>
      Reviewed-by: default avatarChangwei Ge <gechangwei@live.cn>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      be99ca27
    • Kirill Tkhai's avatar
      mm/vmscan.c: fix recent_rotated history · b17f18af
      Kirill Tkhai authored
      Johannes pointed out that after commit 886cf190 ("mm: move
      recent_rotated pages calculation to shrink_inactive_list()") we lost all
      zone_reclaim_stat::recent_rotated history.
      
      This fixes it.
      
      Link: http://lkml.kernel.org/r/155905972210.26456.11178359431724024112.stgit@localhost.localdomain
      Fixes: 886cf190 ("mm: move recent_rotated pages calculation to shrink_inactive_list()")
      Signed-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Reported-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b17f18af
    • Potyra, Stefan's avatar
      mm/mlock.c: mlockall error for flag MCL_ONFAULT · dedca635
      Potyra, Stefan authored
      If mlockall() is called with only MCL_ONFAULT as flag, it removes any
      previously applied lockings and does nothing else.
      
      This behavior is counter-intuitive and doesn't match the Linux man page.
      
        For mlockall():
      
        EINVAL Unknown flags were specified or MCL_ONFAULT was specified
        without either MCL_FUTURE or MCL_CURRENT.
      
      Consequently, return the error EINVAL, if only MCL_ONFAULT is passed.
      That way, applications will at least detect that they are calling
      mlockall() incorrectly.
      
      Link: http://lkml.kernel.org/r/20190527075333.GA6339@er01809n.ebgroup.elektrobit.com
      Fixes: b0f205c2 ("mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage")
      Signed-off-by: default avatarStefan Potyra <Stefan.Potyra@elektrobit.com>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dedca635
    • Manuel Traut's avatar
      scripts/decode_stacktrace.sh: prefix addr2line with $CROSS_COMPILE · c04e32e9
      Manuel Traut authored
      At least for ARM64 kernels compiled with the crosstoolchain from
      Debian/stretch or with the toolchain from kernel.org the line number is
      not decoded correctly by 'decode_stacktrace.sh':
      
        $ echo "[  136.513051]  f1+0x0/0xc [kcrash]" | \
          CROSS_COMPILE=/opt/gcc-8.1.0-nolibc/aarch64-linux/bin/aarch64-linux- \
         ./scripts/decode_stacktrace.sh /scratch/linux-arm64/vmlinux \
                                        /scratch/linux-arm64 \
                                        /nfs/debian/lib/modules/4.20.0-devel
        [  136.513051] f1 (/linux/drivers/staging/kcrash/kcrash.c:68) kcrash
      
      If addr2line from the toolchain is used the decoded line number is correct:
      
        [  136.513051] f1 (/linux/drivers/staging/kcrash/kcrash.c:57) kcrash
      
      Link: http://lkml.kernel.org/r/20190527083425.3763-1-manut@linutronix.deSigned-off-by: default avatarManuel Traut <manut@linutronix.de>
      Acked-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c04e32e9
    • Shakeel Butt's avatar
      mm/list_lru.c: fix memory leak in __memcg_init_list_lru_node · 3510955b
      Shakeel Butt authored
      Syzbot reported following memory leak:
      
      ffffffffda RBX: 0000000000000003 RCX: 0000000000441f79
      BUG: memory leak
      unreferenced object 0xffff888114f26040 (size 32):
        comm "syz-executor626", pid 7056, jiffies 4294948701 (age 39.410s)
        hex dump (first 32 bytes):
          40 60 f2 14 81 88 ff ff 40 60 f2 14 81 88 ff ff  @`......@`......
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
           slab_post_alloc_hook mm/slab.h:439 [inline]
           slab_alloc mm/slab.c:3326 [inline]
           kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
           kmalloc include/linux/slab.h:547 [inline]
           __memcg_init_list_lru_node+0x58/0xf0 mm/list_lru.c:352
           memcg_init_list_lru_node mm/list_lru.c:375 [inline]
           memcg_init_list_lru mm/list_lru.c:459 [inline]
           __list_lru_init+0x193/0x2a0 mm/list_lru.c:626
           alloc_super+0x2e0/0x310 fs/super.c:269
           sget_userns+0x94/0x2a0 fs/super.c:609
           sget+0x8d/0xb0 fs/super.c:660
           mount_nodev+0x31/0xb0 fs/super.c:1387
           fuse_mount+0x2d/0x40 fs/fuse/inode.c:1236
           legacy_get_tree+0x27/0x80 fs/fs_context.c:661
           vfs_get_tree+0x2e/0x120 fs/super.c:1476
           do_new_mount fs/namespace.c:2790 [inline]
           do_mount+0x932/0xc50 fs/namespace.c:3110
           ksys_mount+0xab/0x120 fs/namespace.c:3319
           __do_sys_mount fs/namespace.c:3333 [inline]
           __se_sys_mount fs/namespace.c:3330 [inline]
           __x64_sys_mount+0x26/0x30 fs/namespace.c:3330
           do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This is a simple off by one bug on the error path.
      
      Link: http://lkml.kernel.org/r/20190528043202.99980-1-shakeelb@google.com
      Fixes: 60d3fd32 ("list_lru: introduce per-memcg lists")
      Reported-by: syzbot+f90a420dfe2b1b03cb2c@syzkaller.appspotmail.com
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Cc: <stable@vger.kernel.org>	[4.0+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3510955b
    • Johannes Weiner's avatar
      mm: memcontrol: don't batch updates of local VM stats and events · 815744d7
      Johannes Weiner authored
      The kernel test robot noticed a 26% will-it-scale pagefault regression
      from commit 42a30035 ("mm: memcontrol: fix recursive statistics
      correctness & scalabilty").  This appears to be caused by bouncing the
      additional cachelines from the new hierarchical statistics counters.
      
      We can fix this by getting rid of the batched local counters instead.
      
      Originally, there were *only* group-local counters, and they were fully
      maintained per cpu.  A reader of a stats file high up in the cgroup tree
      would have to walk the entire subtree and collect each level's per-cpu
      counters to get the recursive view.  This was prohibitively expensive,
      and so we switched to per-cpu batched updates of the local counters
      during a983b5eb ("mm: memcontrol: fix excessive complexity in
      memory.stat reporting"), reducing the complexity from nr_subgroups *
      nr_cpus to nr_subgroups.
      
      With growing machines and cgroup trees, the tree walk itself became too
      expensive for monitoring top-level groups, and this is when the culprit
      patch added hierarchy counters on each cgroup level.  When the per-cpu
      batch size would be reached, both the local and the hierarchy counters
      would get batch-updated from the per-cpu delta simultaneously.
      
      This makes local and hierarchical counter reads blazingly fast, but it
      unfortunately makes the write-side too cache line intense.
      
      Since local counter reads were never a problem - we only centralized
      them to accelerate the hierarchy walk - and use of the local counters
      are becoming rarer due to replacement with hierarchical views (ongoing
      rework in the page reclaim and workingset code), we can make those local
      counters unbatched per-cpu counters again.
      
      The scheme will then be as such:
      
         when a memcg statistic changes, the writer will:
         - update the local counter (per-cpu)
         - update the batch counter (per-cpu). If the batch is full:
         - spill the batch into the group's atomic_t
         - spill the batch into all ancestors' atomic_ts
         - empty out the batch counter (per-cpu)
      
         when a local memcg counter is read, the reader will:
         - collect the local counter from all cpus
      
         when a hiearchy memcg counter is read, the reader will:
         - read the atomic_t
      
      We might be able to simplify this further and make the recursive
      counters unbatched per-cpu counters as well (batch upward propagation,
      but leave per-cpu collection to the readers), but that will require a
      more in-depth analysis and testing of all the callsites.  Deal with the
      immediate regression for now.
      
      Link: http://lkml.kernel.org/r/20190521151647.GB2870@cmpxchg.org
      Fixes: 42a30035 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarkernel test robot <rong.a.chen@intel.com>
      Tested-by: default avatarkernel test robot <rong.a.chen@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      815744d7