1. 25 Feb, 2016 34 commits
    • James Bottomley's avatar
      klist: fix starting point removed bug in klist iterators · 5b27adfa
      James Bottomley authored
      commit 00cd29b7 upstream.
      
      The starting node for a klist iteration is often passed in from
      somewhere way above the klist infrastructure, meaning there's no
      guarantee the node is still on the list.  We've seen this in SCSI where
      we use bus_find_device() to iterate through a list of devices.  In the
      face of heavy hotplug activity, the last device returned by
      bus_find_device() can be removed before the next call.  This leads to
      
      Dec  3 13:22:02 localhost kernel: WARNING: CPU: 2 PID: 28073 at include/linux/kref.h:47 klist_iter_init_node+0x3d/0x50()
      Dec  3 13:22:02 localhost kernel: Modules linked in: scsi_debug x86_pkg_temp_thermal kvm_intel kvm irqbypass crc32c_intel joydev iTCO_wdt dcdbas ipmi_devintf acpi_power_meter iTCO_vendor_support ipmi_si imsghandler pcspkr wmi acpi_cpufreq tpm_tis tpm shpchp lpc_ich mfd_core nfsd nfs_acl lockd grace sunrpc tg3 ptp pps_core
      Dec  3 13:22:02 localhost kernel: CPU: 2 PID: 28073 Comm: cat Not tainted 4.4.0-rc1+ #2
      Dec  3 13:22:02 localhost kernel: Hardware name: Dell Inc. PowerEdge R320/08VT7V, BIOS 2.0.22 11/19/2013
      Dec  3 13:22:02 localhost kernel: ffffffff81a20e77 ffff880613acfd18 ffffffff81321eef 0000000000000000
      Dec  3 13:22:02 localhost kernel: ffff880613acfd50 ffffffff8107ca52 ffff88061176b198 0000000000000000
      Dec  3 13:22:02 localhost kernel: ffffffff814542b0 ffff880610cfb100 ffff88061176b198 ffff880613acfd60
      Dec  3 13:22:02 localhost kernel: Call Trace:
      Dec  3 13:22:02 localhost kernel: [<ffffffff81321eef>] dump_stack+0x44/0x55
      Dec  3 13:22:02 localhost kernel: [<ffffffff8107ca52>] warn_slowpath_common+0x82/0xc0
      Dec  3 13:22:02 localhost kernel: [<ffffffff814542b0>] ? proc_scsi_show+0x20/0x20
      Dec  3 13:22:02 localhost kernel: [<ffffffff8107cb4a>] warn_slowpath_null+0x1a/0x20
      Dec  3 13:22:02 localhost kernel: [<ffffffff8167225d>] klist_iter_init_node+0x3d/0x50
      Dec  3 13:22:02 localhost kernel: [<ffffffff81421d41>] bus_find_device+0x51/0xb0
      Dec  3 13:22:02 localhost kernel: [<ffffffff814545ad>] scsi_seq_next+0x2d/0x40
      [...]
      
      And an eventual crash. It can actually occur in any hotplug system
      which has a device finder and a starting device.
      
      We can fix this globally by making sure the starting node for
      klist_iter_init_node() is actually a member of the list before using it
      (and by starting from the beginning if it isn't).
      Reported-by: default avatarEwan D. Milne <emilne@redhat.com>
      Tested-by: default avatarEwan D. Milne <emilne@redhat.com>
      Signed-off-by: default avatarJames Bottomley <James.Bottomley@HansenPartnership.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5b27adfa
    • Steven Rostedt (Red Hat)'s avatar
      tracepoints: Do not trace when cpu is offline · 152fb022
      Steven Rostedt (Red Hat) authored
      commit f3775549 upstream.
      
      The tracepoint infrastructure uses RCU sched protection to enable and
      disable tracepoints safely. There are some instances where tracepoints are
      used in infrastructure code (like kfree()) that get called after a CPU is
      going offline, and perhaps when it is coming back online but hasn't been
      registered yet.
      
      This can probuce the following warning:
      
       [ INFO: suspicious RCU usage. ]
       4.4.0-00006-g0fe53e8-dirty #34 Tainted: G S
       -------------------------------
       include/trace/events/kmem.h:141 suspicious rcu_dereference_check() usage!
      
       other info that might help us debug this:
      
       RCU used illegally from offline CPU!  rcu_scheduler_active = 1, debug_locks = 1
       no locks held by swapper/8/0.
      
       stack backtrace:
        CPU: 8 PID: 0 Comm: swapper/8 Tainted: G S              4.4.0-00006-g0fe53e8-dirty #34
        Call Trace:
        [c0000005b76c78d0] [c0000000008b9540] .dump_stack+0x98/0xd4 (unreliable)
        [c0000005b76c7950] [c00000000010c898] .lockdep_rcu_suspicious+0x108/0x170
        [c0000005b76c79e0] [c00000000029adc0] .kfree+0x390/0x440
        [c0000005b76c7a80] [c000000000055f74] .destroy_context+0x44/0x100
        [c0000005b76c7b00] [c0000000000934a0] .__mmdrop+0x60/0x150
        [c0000005b76c7b90] [c0000000000e3ff0] .idle_task_exit+0x130/0x140
        [c0000005b76c7c20] [c000000000075804] .pseries_mach_cpu_die+0x64/0x310
        [c0000005b76c7cd0] [c000000000043e7c] .cpu_die+0x3c/0x60
        [c0000005b76c7d40] [c0000000000188d8] .arch_cpu_idle_dead+0x28/0x40
        [c0000005b76c7db0] [c000000000101e6c] .cpu_startup_entry+0x50c/0x560
        [c0000005b76c7ed0] [c000000000043bd8] .start_secondary+0x328/0x360
        [c0000005b76c7f90] [c000000000008a6c] start_secondary_prolog+0x10/0x14
      
      This warning is not a false positive either. RCU is not protecting code that
      is being executed while the CPU is offline.
      
      Instead of playing "whack-a-mole(TM)" and adding conditional statements to
      the tracepoints we find that are used in this instance, simply add a
      cpu_online() test to the tracepoint code where the tracepoint will be
      ignored if the CPU is offline.
      
      Use of raw_smp_processor_id() is fine, as there should never be a case where
      the tracepoint code goes from running on a CPU that is online and suddenly
      gets migrated to a CPU that is offline.
      
      Link: http://lkml.kernel.org/r/1455387773-4245-1-git-send-email-kda@linux-powerpc.orgReported-by: default avatarDenis Kirjanov <kda@linux-powerpc.org>
      Fixes: 97e1c18e ("tracing: Kernel Tracepoints")
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      152fb022
    • Arnd Bergmann's avatar
      tracing: Fix freak link error caused by branch tracer · 2fa82bbb
      Arnd Bergmann authored
      commit b33c8ff4 upstream.
      
      In my randconfig tests, I came across a bug that involves several
      components:
      
      * gcc-4.9 through at least 5.3
      * CONFIG_GCOV_PROFILE_ALL enabling -fprofile-arcs for all files
      * CONFIG_PROFILE_ALL_BRANCHES overriding every if()
      * The optimized implementation of do_div() that tries to
        replace a library call with an division by multiplication
      * code in drivers/media/dvb-frontends/zl10353.c doing
      
              u32 adc_clock = 450560; /* 45.056 MHz */
              if (state->config.adc_clock)
                      adc_clock = state->config.adc_clock;
              do_div(value, adc_clock);
      
      In this case, gcc fails to determine whether the divisor
      in do_div() is __builtin_constant_p(). In particular, it
      concludes that __builtin_constant_p(adc_clock) is false, while
      __builtin_constant_p(!!adc_clock) is true.
      
      That in turn throws off the logic in do_div() that also uses
      __builtin_constant_p(), and instead of picking either the
      constant- optimized division, and the code in ilog2() that uses
      __builtin_constant_p() to figure out whether it knows the answer at
      compile time. The result is a link error from failing to find
      multiple symbols that should never have been called based on
      the __builtin_constant_p():
      
      dvb-frontends/zl10353.c:138: undefined reference to `____ilog2_NaN'
      dvb-frontends/zl10353.c:138: undefined reference to `__aeabi_uldivmod'
      ERROR: "____ilog2_NaN" [drivers/media/dvb-frontends/zl10353.ko] undefined!
      ERROR: "__aeabi_uldivmod" [drivers/media/dvb-frontends/zl10353.ko] undefined!
      
      This patch avoids the problem by changing __trace_if() to check
      whether the condition is known at compile-time to be nonzero, rather
      than checking whether it is actually a constant.
      
      I see this one link error in roughly one out of 1600 randconfig builds
      on ARM, and the patch fixes all known instances.
      
      Link: http://lkml.kernel.org/r/1455312410-1058841-1-git-send-email-arnd@arndb.deAcked-by: default avatarNicolas Pitre <nico@linaro.org>
      Fixes: ab3c9c68 ("branch tracer, intel-iommu: fix build with CONFIG_BRANCH_TRACER=y")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2fa82bbb
    • Adrian Hunter's avatar
      perf tools: tracepoint_error() can receive e=NULL, robustify it · 6fa74f50
      Adrian Hunter authored
      commit ec183d22 upstream.
      
      Fixes segmentation fault using, for instance:
      
        (gdb) run record -I -e intel_pt/tsc=1,noretcomp=1/u /bin/ls
        Starting program: /home/acme/bin/perf record -I -e intel_pt/tsc=1,noretcomp=1/u /bin/ls
        Missing separate debuginfos, use: dnf debuginfo-install glibc-2.22-7.fc23.x86_64
        [Thread debugging using libthread_db enabled]
        Using host libthread_db library "/lib64/libthread_db.so.1".
      
       Program received signal SIGSEGV, Segmentation fault.
        0 x00000000004b9ea5 in tracepoint_error (e=0x0, err=13, sys=0x19b1370 "sched", name=0x19a5d00 "sched_switch") at util/parse-events.c:410
        (gdb) bt
        #0  0x00000000004b9ea5 in tracepoint_error (e=0x0, err=13, sys=0x19b1370 "sched", name=0x19a5d00 "sched_switch") at util/parse-events.c:410
        #1  0x00000000004b9fc5 in add_tracepoint (list=0x19a5d20, idx=0x7fffffffb8c0, sys_name=0x19b1370 "sched", evt_name=0x19a5d00 "sched_switch", err=0x0, head_config=0x0)
            at util/parse-events.c:433
        #2  0x00000000004ba334 in add_tracepoint_event (list=0x19a5d20, idx=0x7fffffffb8c0, sys_name=0x19b1370 "sched", evt_name=0x19a5d00 "sched_switch", err=0x0, head_config=0x0)
            at util/parse-events.c:498
        #3  0x00000000004bb699 in parse_events_add_tracepoint (list=0x19a5d20, idx=0x7fffffffb8c0, sys=0x19b1370 "sched", event=0x19a5d00 "sched_switch", err=0x0, head_config=0x0)
            at util/parse-events.c:936
        #4  0x00000000004f6eda in parse_events_parse (_data=0x7fffffffb8b0, scanner=0x19a49d0) at util/parse-events.y:391
        #5  0x00000000004bc8e5 in parse_events__scanner (str=0x663ff2 "sched:sched_switch", data=0x7fffffffb8b0, start_token=258) at util/parse-events.c:1361
        #6  0x00000000004bca57 in parse_events (evlist=0x19a5220, str=0x663ff2 "sched:sched_switch", err=0x0) at util/parse-events.c:1401
        #7  0x0000000000518d5f in perf_evlist__can_select_event (evlist=0x19a3b90, str=0x663ff2 "sched:sched_switch") at util/record.c:253
        #8  0x0000000000553c42 in intel_pt_track_switches (evlist=0x19a3b90) at arch/x86/util/intel-pt.c:364
        #9  0x00000000005549d1 in intel_pt_recording_options (itr=0x19a2c40, evlist=0x19a3b90, opts=0x8edf68 <record+232>) at arch/x86/util/intel-pt.c:664
        #10 0x000000000051e076 in auxtrace_record__options (itr=0x19a2c40, evlist=0x19a3b90, opts=0x8edf68 <record+232>) at util/auxtrace.c:539
        #11 0x0000000000433368 in cmd_record (argc=1, argv=0x7fffffffde60, prefix=0x0) at builtin-record.c:1264
        #12 0x000000000049bec2 in run_builtin (p=0x8fa2a8 <commands+168>, argc=5, argv=0x7fffffffde60) at perf.c:390
        #13 0x000000000049c12a in handle_internal_command (argc=5, argv=0x7fffffffde60) at perf.c:451
        #14 0x000000000049c278 in run_argv (argcp=0x7fffffffdcbc, argv=0x7fffffffdcb0) at perf.c:495
        #15 0x000000000049c60a in main (argc=5, argv=0x7fffffffde60) at perf.c:618
      (gdb)
      
      Intel PT attempts to find the sched:sched_switch tracepoint but that seg
      faults if tracefs is not readable, because the error reporting structure
      is null, as errors are not reported when automatically adding
      tracepoints.  Fix by checking before using.
      
      Committer note:
      
      This doesn't take place in a kernel that supports
      perf_event_attr.context_switch, that is the default way that will be
      used for tracking context switches, only in older kernels, like 4.2, in
      a machine with Intel PT (e.g. Broadwell) for non-priviledged users.
      
      Further info from a similar patch by Wang:
      
      The error is in tracepoint_error: it assumes the 'e' parameter is valid.
      
      However, there are many situation a parse_event() can be called without
      parse_events_error. See result of
      
        $ grep 'parse_events(.*NULL)' ./tools/perf/ -r'
      Signed-off-by: default avatarAdrian Hunter <adrian.hunter@intel.com>
      Tested-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Tong Zhang <ztong@vt.edu>
      Cc: Wang Nan <wangnan0@huawei.com>
      Fixes: 19658171 ("perf tools: Enhance parsing events tracepoint error output")
      Link: http://lkml.kernel.org/r/1453809921-24596-2-git-send-email-adrian.hunter@intel.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6fa74f50
    • Steven Rostedt's avatar
      tools lib traceevent: Fix output of %llu for 64 bit values read on 32 bit machines · 6e50ddaf
      Steven Rostedt authored
      commit 32abc2ed upstream.
      
      When a long value is read on 32 bit machines for 64 bit output, the
      parsing needs to change "%lu" into "%llu", as the value is read
      natively.
      
      Unfortunately, if "%llu" is already there, the code will add another "l"
      to it and fail to parse it properly.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Acked-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Link: http://lkml.kernel.org/r/20151116172516.4b79b109@gandalf.local.homeSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6e50ddaf
    • Jann Horn's avatar
      ptrace: use fsuid, fsgid, effective creds for fs access checks · 969624b7
      Jann Horn authored
      commit caaee623 upstream.
      
      By checking the effective credentials instead of the real UID / permitted
      capabilities, ensure that the calling process actually intended to use its
      credentials.
      
      To ensure that all ptrace checks use the correct caller credentials (e.g.
      in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
      flag), use two new flags and require one of them to be set.
      
      The problem was that when a privileged task had temporarily dropped its
      privileges, e.g.  by calling setreuid(0, user_uid), with the intent to
      perform following syscalls with the credentials of a user, it still passed
      ptrace access checks that the user would not be able to pass.
      
      While an attacker should not be able to convince the privileged task to
      perform a ptrace() syscall, this is a problem because the ptrace access
      check is reused for things in procfs.
      
      In particular, the following somewhat interesting procfs entries only rely
      on ptrace access checks:
      
       /proc/$pid/stat - uses the check for determining whether pointers
           should be visible, useful for bypassing ASLR
       /proc/$pid/maps - also useful for bypassing ASLR
       /proc/$pid/cwd - useful for gaining access to restricted
           directories that contain files with lax permissions, e.g. in
           this scenario:
           lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
           drwx------ root root /root
           drwxr-xr-x root root /root/foobar
           -rw-r--r-- root root /root/foobar/secret
      
      Therefore, on a system where a root-owned mode 6755 binary changes its
      effective credentials as described and then dumps a user-specified file,
      this could be used by an attacker to reveal the memory layout of root's
      processes or reveal the contents of files he is not allowed to access
      (through /proc/$pid/cwd).
      
      [akpm@linux-foundation.org: fix warning]
      Signed-off-by: default avatarJann Horn <jann@thejh.net>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Casey Schaufler <casey@schaufler-ca.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      969624b7
    • Filipe Manana's avatar
      Btrfs: fix direct IO requests not reporting IO error to user space · ba6d9280
      Filipe Manana authored
      commit 1636d1d7 upstream.
      
      If a bio for a direct IO request fails, we were not setting the error in
      the parent bio (the main DIO bio), making us not return the error to
      user space in btrfs_direct_IO(), that is, it made __blockdev_direct_IO()
      return the number of bytes issued for IO and not the error a bio created
      and submitted by btrfs_submit_direct() got from the block layer.
      This essentially happens because when we call:
      
         dio_end_io(dio_bio, bio->bi_error);
      
      It does not set dio_bio->bi_error to the value of the second argument.
      So just add this missing assignment in endio callbacks, just as we do in
      the error path at btrfs_submit_direct() when we fail to clone the dio bio
      or allocate its private object. This follows the convention of what is
      done with other similar APIs such as bio_endio() where the caller is
      responsible for setting the bi_error field in the bio it passes as an
      argument to bio_endio().
      
      This was detected by the new generic test cases in xfstests: 271, 272,
      276 and 278. Which essentially setup a dm error target, then load the
      error table, do a direct IO write and unload the error table. They
      expect the write to fail with -EIO, which was not getting reported
      when testing against btrfs.
      
      Fixes: 4246a0b6 ("block: add a bi_error field to struct bio")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ba6d9280
    • Filipe Manana's avatar
      Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctl · e8eced78
      Filipe Manana authored
      commit 0c0fe3b0 upstream.
      
      While doing some tests I ran into an hang on an extent buffer's rwlock
      that produced the following trace:
      
      [39389.800012] NMI watchdog: BUG: soft lockup - CPU#15 stuck for 22s! [fdm-stress:32166]
      [39389.800016] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [fdm-stress:32165]
      [39389.800016] Modules linked in: btrfs dm_mod ppdev xor sha256_generic hmac raid6_pq drbg ansi_cprng aesni_intel i2c_piix4 acpi_cpufreq aes_x86_64 ablk_helper tpm_tis parport_pc i2c_core sg cryptd evdev psmouse lrw tpm parport gf128mul serio_raw pcspkr glue_helper processor button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs]
      [39389.800016] irq event stamp: 0
      [39389.800016] hardirqs last  enabled at (0): [<          (null)>]           (null)
      [39389.800016] hardirqs last disabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35
      [39389.800016] softirqs last  enabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35
      [39389.800016] softirqs last disabled at (0): [<          (null)>]           (null)
      [39389.800016] CPU: 14 PID: 32165 Comm: fdm-stress Not tainted 4.4.0-rc6-btrfs-next-18+ #1
      [39389.800016] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [39389.800016] task: ffff880175b1ca40 ti: ffff8800a185c000 task.ti: ffff8800a185c000
      [39389.800016] RIP: 0010:[<ffffffff810902af>]  [<ffffffff810902af>] queued_spin_lock_slowpath+0x57/0x158
      [39389.800016] RSP: 0018:ffff8800a185fb80  EFLAGS: 00000202
      [39389.800016] RAX: 0000000000000101 RBX: ffff8801710c4e9c RCX: 0000000000000101
      [39389.800016] RDX: 0000000000000100 RSI: 0000000000000001 RDI: 0000000000000001
      [39389.800016] RBP: ffff8800a185fb98 R08: 0000000000000001 R09: 0000000000000000
      [39389.800016] R10: ffff8800a185fb68 R11: 6db6db6db6db6db7 R12: ffff8801710c4e98
      [39389.800016] R13: ffff880175b1ca40 R14: ffff8800a185fc10 R15: ffff880175b1ca40
      [39389.800016] FS:  00007f6d37fff700(0000) GS:ffff8802be9c0000(0000) knlGS:0000000000000000
      [39389.800016] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [39389.800016] CR2: 00007f6d300019b8 CR3: 0000000037c93000 CR4: 00000000001406e0
      [39389.800016] Stack:
      [39389.800016]  ffff8801710c4e98 ffff8801710c4e98 ffff880175b1ca40 ffff8800a185fbb0
      [39389.800016]  ffffffff81091e11 ffff8801710c4e98 ffff8800a185fbc8 ffffffff81091895
      [39389.800016]  ffff8801710c4e98 ffff8800a185fbe8 ffffffff81486c5c ffffffffa067288c
      [39389.800016] Call Trace:
      [39389.800016]  [<ffffffff81091e11>] queued_read_lock_slowpath+0x46/0x60
      [39389.800016]  [<ffffffff81091895>] do_raw_read_lock+0x3e/0x41
      [39389.800016]  [<ffffffff81486c5c>] _raw_read_lock+0x3d/0x44
      [39389.800016]  [<ffffffffa067288c>] ? btrfs_tree_read_lock+0x54/0x125 [btrfs]
      [39389.800016]  [<ffffffffa067288c>] btrfs_tree_read_lock+0x54/0x125 [btrfs]
      [39389.800016]  [<ffffffffa0622ced>] ? btrfs_find_item+0xa7/0xd2 [btrfs]
      [39389.800016]  [<ffffffffa069363f>] btrfs_ref_to_path+0xd6/0x174 [btrfs]
      [39389.800016]  [<ffffffffa0693730>] inode_to_path+0x53/0xa2 [btrfs]
      [39389.800016]  [<ffffffffa0693e2e>] paths_from_inode+0x117/0x2ec [btrfs]
      [39389.800016]  [<ffffffffa0670cff>] btrfs_ioctl+0xd5b/0x2793 [btrfs]
      [39389.800016]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
      [39389.800016]  [<ffffffff81276727>] ? __this_cpu_preempt_check+0x13/0x15
      [39389.800016]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
      [39389.800016]  [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d
      [39389.800016]  [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea
      [39389.800016]  [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71
      [39389.800016]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
      [39389.800016]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [39389.800016] Code: b9 01 01 00 00 f7 c6 00 ff ff ff 75 32 83 fe 01 89 ca 89 f0 0f 45 d7 f0 0f b1 13 39 f0 74 04 89 c6 eb e2 ff ca 0f 84 fa 00 00 00 <8b> 03 84 c0 74 04 f3 90 eb f6 66 c7 03 01 00 e9 e6 00 00 00 e8
      [39389.800012] Modules linked in: btrfs dm_mod ppdev xor sha256_generic hmac raid6_pq drbg ansi_cprng aesni_intel i2c_piix4 acpi_cpufreq aes_x86_64 ablk_helper tpm_tis parport_pc i2c_core sg cryptd evdev psmouse lrw tpm parport gf128mul serio_raw pcspkr glue_helper processor button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs]
      [39389.800012] irq event stamp: 0
      [39389.800012] hardirqs last  enabled at (0): [<          (null)>]           (null)
      [39389.800012] hardirqs last disabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35
      [39389.800012] softirqs last  enabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35
      [39389.800012] softirqs last disabled at (0): [<          (null)>]           (null)
      [39389.800012] CPU: 15 PID: 32166 Comm: fdm-stress Tainted: G             L  4.4.0-rc6-btrfs-next-18+ #1
      [39389.800012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [39389.800012] task: ffff880179294380 ti: ffff880034a60000 task.ti: ffff880034a60000
      [39389.800012] RIP: 0010:[<ffffffff81091e8d>]  [<ffffffff81091e8d>] queued_write_lock_slowpath+0x62/0x72
      [39389.800012] RSP: 0018:ffff880034a639f0  EFLAGS: 00000206
      [39389.800012] RAX: 0000000000000101 RBX: ffff8801710c4e98 RCX: 0000000000000000
      [39389.800012] RDX: 00000000000000ff RSI: 0000000000000000 RDI: ffff8801710c4e9c
      [39389.800012] RBP: ffff880034a639f8 R08: 0000000000000001 R09: 0000000000000000
      [39389.800012] R10: ffff880034a639b0 R11: 0000000000001000 R12: ffff8801710c4e98
      [39389.800012] R13: 0000000000000001 R14: ffff880172cbc000 R15: ffff8801710c4e00
      [39389.800012] FS:  00007f6d377fe700(0000) GS:ffff8802be9e0000(0000) knlGS:0000000000000000
      [39389.800012] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [39389.800012] CR2: 00007f6d3d3c1000 CR3: 0000000037c93000 CR4: 00000000001406e0
      [39389.800012] Stack:
      [39389.800012]  ffff8801710c4e98 ffff880034a63a10 ffffffff81091963 ffff8801710c4e98
      [39389.800012]  ffff880034a63a30 ffffffff81486f1b ffffffffa0672cb3 ffff8801710c4e00
      [39389.800012]  ffff880034a63a78 ffffffffa0672cb3 ffff8801710c4e00 ffff880034a63a58
      [39389.800012] Call Trace:
      [39389.800012]  [<ffffffff81091963>] do_raw_write_lock+0x72/0x8c
      [39389.800012]  [<ffffffff81486f1b>] _raw_write_lock+0x3a/0x41
      [39389.800012]  [<ffffffffa0672cb3>] ? btrfs_tree_lock+0x119/0x251 [btrfs]
      [39389.800012]  [<ffffffffa0672cb3>] btrfs_tree_lock+0x119/0x251 [btrfs]
      [39389.800012]  [<ffffffffa061aeba>] ? rcu_read_unlock+0x5b/0x5d [btrfs]
      [39389.800012]  [<ffffffffa061ce13>] ? btrfs_root_node+0xda/0xe6 [btrfs]
      [39389.800012]  [<ffffffffa061ce83>] btrfs_lock_root_node+0x22/0x42 [btrfs]
      [39389.800012]  [<ffffffffa062046b>] btrfs_search_slot+0x1b8/0x758 [btrfs]
      [39389.800012]  [<ffffffff810fc6b0>] ? time_hardirqs_on+0x15/0x28
      [39389.800012]  [<ffffffffa06365db>] btrfs_lookup_inode+0x31/0x95 [btrfs]
      [39389.800012]  [<ffffffff8108d62f>] ? trace_hardirqs_on+0xd/0xf
      [39389.800012]  [<ffffffff8148482b>] ? mutex_lock_nested+0x397/0x3bc
      [39389.800012]  [<ffffffffa068821b>] __btrfs_update_delayed_inode+0x59/0x1c0 [btrfs]
      [39389.800012]  [<ffffffffa068858e>] __btrfs_commit_inode_delayed_items+0x194/0x5aa [btrfs]
      [39389.800012]  [<ffffffff81486ab7>] ? _raw_spin_unlock+0x31/0x44
      [39389.800012]  [<ffffffffa0688a48>] __btrfs_run_delayed_items+0xa4/0x15c [btrfs]
      [39389.800012]  [<ffffffffa0688d62>] btrfs_run_delayed_items+0x11/0x13 [btrfs]
      [39389.800012]  [<ffffffffa064048e>] btrfs_commit_transaction+0x234/0x96e [btrfs]
      [39389.800012]  [<ffffffffa0618d10>] btrfs_sync_fs+0x145/0x1ad [btrfs]
      [39389.800012]  [<ffffffffa0671176>] btrfs_ioctl+0x11d2/0x2793 [btrfs]
      [39389.800012]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
      [39389.800012]  [<ffffffff81140261>] ? __might_fault+0x4c/0xa7
      [39389.800012]  [<ffffffff81140261>] ? __might_fault+0x4c/0xa7
      [39389.800012]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
      [39389.800012]  [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d
      [39389.800012]  [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea
      [39389.800012]  [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71
      [39389.800012]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
      [39389.800012]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [39389.800012] Code: f0 0f b1 13 85 c0 75 ef eb 2a f3 90 8a 03 84 c0 75 f8 f0 0f b0 13 84 c0 75 f0 ba ff 00 00 00 eb 0a f0 0f b1 13 ff c8 74 0b f3 90 <8b> 03 83 f8 01 75 f7 eb ed c6 43 04 00 5b 5d c3 0f 1f 44 00 00
      
      This happens because in the code path executed by the inode_paths ioctl we
      end up nesting two calls to read lock a leaf's rwlock when after the first
      call to read_lock() and before the second call to read_lock(), another
      task (running the delayed items as part of a transaction commit) has
      already called write_lock() against the leaf's rwlock. This situation is
      illustrated by the following diagram:
      
               Task A                       Task B
      
        btrfs_ref_to_path()               btrfs_commit_transaction()
          read_lock(&eb->lock);
      
                                            btrfs_run_delayed_items()
                                              __btrfs_commit_inode_delayed_items()
                                                __btrfs_update_delayed_inode()
                                                  btrfs_lookup_inode()
      
                                                    write_lock(&eb->lock);
                                                      --> task waits for lock
      
          read_lock(&eb->lock);
          --> makes this task hang
              forever (and task B too
      	of course)
      
      So fix this by avoiding doing the nested read lock, which is easily
      avoidable. This issue does not happen if task B calls write_lock() after
      task A does the second call to read_lock(), however there does not seem
      to exist anything in the documentation that mentions what is the expected
      behaviour for recursive locking of rwlocks (leaving the idea that doing
      so is not a good usage of rwlocks).
      
      Also, as a side effect necessary for this fix, make sure we do not
      needlessly read lock extent buffers when the input path has skip_locking
      set (used when called from send).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e8eced78
    • Filipe Manana's avatar
      Btrfs: fix page reading in extent_same ioctl leading to csum errors · be1232bc
      Filipe Manana authored
      commit 31314002 upstream.
      
      In the extent_same ioctl, we were grabbing the pages (locked) and
      attempting to read them without bothering about any concurrent IO
      against them. That is, we were not checking for any ongoing ordered
      extents nor waiting for them to complete, which leads to a race where
      the extent_same() code gets a checksum verification error when it
      reads the pages, producing a message like the following in dmesg
      and making the operation fail to user space with -ENOMEM:
      
      [18990.161265] BTRFS warning (device sdc): csum failed ino 259 off 495616 csum 685204116 expected csum 1515870868
      
      Fix this by using btrfs_readpage() for reading the pages instead of
      extent_read_full_page_nolock(), which waits for any concurrent ordered
      extents to complete and locks the io range. Also do better error handling
      and don't treat all failures as -ENOMEM, as that's clearly misleasing,
      becoming identical to the checks and operation of prepare_uptodate_page().
      
      The use of extent_read_full_page_nolock() was required before
      commit f4414602 ("btrfs: fix deadlock with extent-same and readpage"),
      as we had the range locked in an inode's io tree before attempting to
      read the pages.
      
      Fixes: f4414602 ("btrfs: fix deadlock with extent-same and readpage")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      be1232bc
    • Filipe Manana's avatar
      Btrfs: fix invalid page accesses in extent_same (dedup) ioctl · df567e6d
      Filipe Manana authored
      commit e0bd70c6 upstream.
      
      In the extent_same ioctl we are getting the pages for the source and
      target ranges and unlocking them immediately after, which is incorrect
      because later we attempt to map them (with kmap_atomic) and access their
      contents at btrfs_cmp_data(). When we do such access the pages might have
      been relocated or removed from memory, which leads to an invalid memory
      access. This issue is detected on a kernel with CONFIG_DEBUG_PAGEALLOC=y
      which produces a trace like the following:
      
      186736.677437] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [186736.680382] Modules linked in: btrfs dm_flakey dm_mod ppdev xor raid6_pq sha256_generic hmac drbg ansi_cprng acpi_cpufreq evdev sg aesni_intel aes_x86_64
      parport_pc ablk_helper tpm_tis psmouse parport i2c_piix4 tpm cryptd i2c_core lrw processor button serio_raw pcspkr gf128mul glue_helper loop autofs4 ext4
      crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last
      unloaded: btrfs]
      [186736.681319] CPU: 13 PID: 10222 Comm: duperemove Tainted: G        W       4.4.0-rc6-btrfs-next-18+ #1
      [186736.681319] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [186736.681319] task: ffff880132600400 ti: ffff880362284000 task.ti: ffff880362284000
      [186736.681319] RIP: 0010:[<ffffffff81264d00>]  [<ffffffff81264d00>] memcmp+0xb/0x22
      [186736.681319] RSP: 0018:ffff880362287d70  EFLAGS: 00010287
      [186736.681319] RAX: 000002c002468acf RBX: 0000000012345678 RCX: 0000000000000000
      [186736.681319] RDX: 0000000000001000 RSI: 0005d129c5cf9000 RDI: 0005d129c5cf9000
      [186736.681319] RBP: ffff880362287d70 R08: 0000000000000000 R09: 0000000000001000
      [186736.681319] R10: ffff880000000000 R11: 0000000000000476 R12: 0000000000001000
      [186736.681319] R13: ffff8802f91d4c88 R14: ffff8801f2a77830 R15: ffff880352e83e40
      [186736.681319] FS:  00007f27b37fe700(0000) GS:ffff88043dda0000(0000) knlGS:0000000000000000
      [186736.681319] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [186736.681319] CR2: 00007f27a406a000 CR3: 0000000217421000 CR4: 00000000001406e0
      [186736.681319] Stack:
      [186736.681319]  ffff880362287ea0 ffffffffa048d0bd 000000000009f000 0000000000001000
      [186736.681319]  0100000000000000 ffff8801f2a77850 ffff8802f91d49b0 ffff880132600400
      [186736.681319]  00000000000004f8 ffff8801c1efbe41 0000000000000000 0000000000000038
      [186736.681319] Call Trace:
      [186736.681319]  [<ffffffffa048d0bd>] btrfs_ioctl+0x24cb/0x2731 [btrfs]
      [186736.681319]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
      [186736.681319]  [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d
      [186736.681319]  [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea
      [186736.681319]  [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71
      [186736.681319]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
      [186736.681319]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [186736.681319] Code: 0a 3c 6e 74 0d 3c 79 74 04 3c 59 75 0c c6 06 01 eb 03 c6 06 00 31 c0 eb 05 b8 ea ff ff ff 5d c3 55 31 c9 48 89 e5 48 39 d1 74 13 <0f> b6
      04 0f 44 0f b6 04 0e 48 ff c1 44 29 c0 74 ea eb 02 31 c0
      
      (gdb) list *(btrfs_ioctl+0x24cb)
      0x5e0e1 is in btrfs_ioctl (fs/btrfs/ioctl.c:2972).
      2967                    dst_addr = kmap_atomic(dst_page);
      2968
      2969                    flush_dcache_page(src_page);
      2970                    flush_dcache_page(dst_page);
      2971
      2972                    if (memcmp(addr, dst_addr, cmp_len))
      2973                            ret = BTRFS_SAME_DATA_DIFFERS;
      2974
      2975                    kunmap_atomic(addr);
      2976                    kunmap_atomic(dst_addr);
      
      So fix this by making sure we keep the pages locked and respect the same
      locking order as everywhere else: get and lock the pages first and then
      lock the range in the inode's io tree (like for example at
      __btrfs_buffered_write() and extent_readpages()). If an ordered extent
      is found after locking the range in the io tree, unlock the range,
      unlock the pages, wait for the ordered extent to complete and repeat the
      entire locking process until no overlapping ordered extents are found.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      df567e6d
    • David Sterba's avatar
      btrfs: properly set the termination value of ctx->pos in readdir · b58081d4
      David Sterba authored
      commit bc4ef759 upstream.
      
      The value of ctx->pos in the last readdir call is supposed to be set to
      INT_MAX due to 32bit compatibility, unless 'pos' is intentially set to a
      larger value, then it's LLONG_MAX.
      
      There's a report from PaX SIZE_OVERFLOW plugin that "ctx->pos++"
      overflows (https://forums.grsecurity.net/viewtopic.php?f=1&t=4284), on a
      64bit arch, where the value is 0x7fffffffffffffff ie. LLONG_MAX before
      the increment.
      
      We can get to that situation like that:
      
      * emit all regular readdir entries
      * still in the same call to readdir, bump the last pos to INT_MAX
      * next call to readdir will not emit any entries, but will reach the
        bump code again, finds pos to be INT_MAX and sets it to LLONG_MAX
      
      Normally this is not a problem, but if we call readdir again, we'll find
      'pos' set to LLONG_MAX and the unconditional increment will overflow.
      
      The report from Victor at
      (http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500) with debugging
      print shows that pattern:
      
       Overflow: e
       Overflow: 7fffffff
       Overflow: 7fffffffffffffff
       PAX: size overflow detected in function btrfs_real_readdir
         fs/btrfs/inode.c:5760 cicus.935_282 max, count: 9, decl: pos; num: 0;
         context: dir_context;
       CPU: 0 PID: 2630 Comm: polkitd Not tainted 4.2.3-grsec #1
       Hardware name: Gigabyte Technology Co., Ltd. H81ND2H/H81ND2H, BIOS F3 08/11/2015
        ffffffff81901608 0000000000000000 ffffffff819015e6 ffffc90004973d48
        ffffffff81742f0f 0000000000000007 ffffffff81901608 ffffc90004973d78
        ffffffff811cb706 0000000000000000 ffff8800d47359e0 ffffc90004973ed8
       Call Trace:
        [<ffffffff81742f0f>] dump_stack+0x4c/0x7f
        [<ffffffff811cb706>] report_size_overflow+0x36/0x40
        [<ffffffff812ef0bc>] btrfs_real_readdir+0x69c/0x6d0
        [<ffffffff811dafc8>] iterate_dir+0xa8/0x150
        [<ffffffff811e6d8d>] ? __fget_light+0x2d/0x70
        [<ffffffff811dba3a>] SyS_getdents+0xba/0x1c0
       Overflow: 1a
        [<ffffffff811db070>] ? iterate_dir+0x150/0x150
        [<ffffffff81749b69>] entry_SYSCALL_64_fastpath+0x12/0x83
      
      The jump from 7fffffff to 7fffffffffffffff happens when new dir entries
      are not yet synced and are processed from the delayed list. Then the code
      could go to the bump section again even though it might not emit any new
      dir entries from the delayed list.
      
      The fix avoids entering the "bump" section again once we've finished
      emitting the entries, both for synced and delayed entries.
      
      References: https://forums.grsecurity.net/viewtopic.php?f=1&t=4284Reported-by: default avatarVictor <services@swwu.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Tested-by: default avatarHolger Hoffstätte <holger.hoffstaette@googlemail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b58081d4
    • David Sterba's avatar
      Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()" · dfd2961a
      David Sterba authored
      commit 80ad623e upstream.
      
      This reverts commit 69624913. The
      cleaner thread can block freezing when there's a snapshot cleaning in
      progress and the other threads get suspended first. From the logs
      provided by Martin we're waiting for reading extent pages:
      
      kernel: PM: Syncing filesystems ... done.
      kernel: Freezing user space processes ... (elapsed 0.015 seconds) done.
      kernel: Freezing remaining freezable tasks ...
      kernel: Freezing of tasks failed after 20.003 seconds (1 tasks refusing to freeze, wq_busy=0):
      kernel: btrfs-cleaner   D ffff88033dd13bc0     0   152      2 0x00000000
      kernel: ffff88032ebc2e00 ffff88032e750000 ffff88032e74fa50 7fffffffffffffff
      kernel: ffffffff814a58df 0000000000000002 ffffea000934d580 ffffffff814a5451
      kernel: 7fffffffffffffff ffffffff814a6e8f 0000000000000000 0000000000000020
      kernel: Call Trace:
      kernel: [<ffffffff814a58df>] ? bit_wait+0x2c/0x2c
      kernel: [<ffffffff814a5451>] ? schedule+0x6f/0x7c
      kernel: [<ffffffff814a6e8f>] ? schedule_timeout+0x2f/0xd8
      kernel: [<ffffffff81076f94>] ? timekeeping_get_ns+0xa/0x2e
      kernel: [<ffffffff81077603>] ? ktime_get+0x36/0x44
      kernel: [<ffffffff814a4f6c>] ? io_schedule_timeout+0x94/0xf2
      kernel: [<ffffffff814a4f6c>] ? io_schedule_timeout+0x94/0xf2
      kernel: [<ffffffff814a590b>] ? bit_wait_io+0x2c/0x30
      kernel: [<ffffffff814a5694>] ? __wait_on_bit+0x41/0x73
      kernel: [<ffffffff8109eba8>] ? wait_on_page_bit+0x6d/0x72
      kernel: [<ffffffff8105d718>] ? autoremove_wake_function+0x2a/0x2a
      kernel: [<ffffffff811a02d7>] ? read_extent_buffer_pages+0x1bd/0x203
      kernel: [<ffffffff8117d9e9>] ? free_root_pointers+0x4c/0x4c
      kernel: [<ffffffff8117e831>] ? btree_read_extent_buffer_pages.constprop.57+0x5a/0xe9
      kernel: [<ffffffff8117f4f3>] ? read_tree_block+0x2d/0x45
      kernel: [<ffffffff8116782a>] ? read_block_for_search.isra.34+0x22a/0x26b
      kernel: [<ffffffff811656c3>] ? btrfs_set_path_blocking+0x1e/0x4a
      kernel: [<ffffffff8116919b>] ? btrfs_search_slot+0x648/0x736
      kernel: [<ffffffff81170559>] ? btrfs_lookup_extent_info+0xb7/0x2c7
      kernel: [<ffffffff81170ee5>] ? walk_down_proc+0x9c/0x1ae
      kernel: [<ffffffff81171c9d>] ? walk_down_tree+0x40/0xa4
      kernel: [<ffffffff8117375f>] ? btrfs_drop_snapshot+0x2da/0x664
      kernel: [<ffffffff8104ff21>] ? finish_task_switch+0x126/0x167
      kernel: [<ffffffff811850f8>] ? btrfs_clean_one_deleted_snapshot+0xa6/0xb0
      kernel: [<ffffffff8117eaba>] ? cleaner_kthread+0x13e/0x17b
      kernel: [<ffffffff8117e97c>] ? btrfs_item_end+0x33/0x33
      kernel: [<ffffffff8104d256>] ? kthread+0x95/0x9d
      kernel: [<ffffffff8104d1c1>] ? kthread_parkme+0x16/0x16
      kernel: [<ffffffff814a7b5f>] ? ret_from_fork+0x3f/0x70
      kernel: [<ffffffff8104d1c1>] ? kthread_parkme+0x16/0x16
      
      As this affects a released kernel (4.4) we need a minimal fix for
      stable kernels.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=108361Reported-by: default avatarMartin Ziegler <ziegler@uni-freiburg.de>
      CC: Jiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dfd2961a
    • Filipe Manana's avatar
      Btrfs: fix fitrim discarding device area reserved for boot loader's use · 4e694390
      Filipe Manana authored
      commit 8cdc7c5b upstream.
      
      As of the 4.3 kernel release, the fitrim ioctl can now discard any region
      of a disk that is not allocated to any chunk/block group, including the
      first megabyte which is used for our primary superblock and by the boot
      loader (grub for example).
      
      Fix this by not allowing to trim/discard any region in the device starting
      with an offset not greater than min(alloc_start_mount_option, 1Mb), just
      as it was not possible before 4.3.
      
      A reproducer test case for xfstests follows.
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            cd /
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
      
        # Write to the [0, 64Kb[ and [68Kb, 1Mb[ ranges of the device. These ranges are
        # reserved for a boot loader to use (GRUB for example) and btrfs should never
        # use them - neither for allocating metadata/data nor should trim/discard them.
        # The range [64Kb, 68Kb[ is used for the primary superblock of the filesystem.
        $XFS_IO_PROG -c "pwrite -S 0xfd 0 64K" $SCRATCH_DEV | _filter_xfs_io
        $XFS_IO_PROG -c "pwrite -S 0xfd 68K 956K" $SCRATCH_DEV | _filter_xfs_io
      
        # Now mount the filesystem and perform a fitrim against it.
        _scratch_mount
        _require_batched_discard $SCRATCH_MNT
        $FSTRIM_PROG $SCRATCH_MNT
      
        # Now unmount the filesystem and verify the content of the ranges was not
        # modified (no trim/discard happened on them).
        _scratch_unmount
        echo "Content of the ranges [0, 64Kb] and [68Kb, 1Mb[ after fitrim:"
        od -t x1 -N $((64 * 1024)) $SCRATCH_DEV
        od -t x1 -j $((68 * 1024)) -N $((956 * 1024)) $SCRATCH_DEV
      
        status=0
        exit
      Reported-by: default avatarVincent Petry  <PVince81@yahoo.fr>
      Reported-by: default avatarAndrei Borzenkov <arvidjaar@gmail.com>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109341
      Fixes: 499f377f (btrfs: iterate over unused chunk space in FITRIM)
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4e694390
    • David Sterba's avatar
      btrfs: handle invalid num_stripes in sys_array · c57e49b5
      David Sterba authored
      commit f5cdedd7 upstream.
      
      We can handle the special case of num_stripes == 0 directly inside
      btrfs_read_sys_array. The BUG_ON in btrfs_chunk_item_size is there to
      catch other unhandled cases where we fail to validate external data.
      
      A crafted or corrupted image crashes at mount time:
      
      BTRFS: device fsid 9006933e-2a9a-44f0-917f-514252aeec2c devid 1 transid 7 /dev/loop0
      BTRFS info (device loop0): disk space caching is enabled
      BUG: failure at fs/btrfs/ctree.h:337/btrfs_chunk_item_size()!
      Kernel panic - not syncing: BUG!
      CPU: 0 PID: 313 Comm: mount Not tainted 4.2.5-00657-ge047887-dirty #25
      Stack:
       637af890 60062489 602aeb2e 604192ba
       60387961 00000011 637af8a0 6038a835
       637af9c0 6038776b 634ef32b 00000000
      Call Trace:
       [<6001c86d>] show_stack+0xfe/0x15b
       [<6038a835>] dump_stack+0x2a/0x2c
       [<6038776b>] panic+0x13e/0x2b3
       [<6020f099>] btrfs_read_sys_array+0x25d/0x2ff
       [<601cfbbe>] open_ctree+0x192d/0x27af
       [<6019c2c1>] btrfs_mount+0x8f5/0xb9a
       [<600bc9a7>] mount_fs+0x11/0xf3
       [<600d5167>] vfs_kern_mount+0x75/0x11a
       [<6019bcb0>] btrfs_mount+0x2e4/0xb9a
       [<600bc9a7>] mount_fs+0x11/0xf3
       [<600d5167>] vfs_kern_mount+0x75/0x11a
       [<600d710b>] do_mount+0xa35/0xbc9
       [<600d7557>] SyS_mount+0x95/0xc8
       [<6001e884>] handle_syscall+0x6b/0x8e
      Reported-by: default avatarJiri Slaby <jslaby@suse.com>
      Reported-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c57e49b5
    • Eryu Guan's avatar
      ext4: don't read blocks from disk after extents being swapped · bbfe21c8
      Eryu Guan authored
      commit bcff2488 upstream.
      
      I notice ext4/307 fails occasionally on ppc64 host, reporting md5
      checksum mismatch after moving data from original file to donor file.
      
      The reason is that move_extent_per_page() calls __block_write_begin()
      and block_commit_write() to write saved data from original inode blocks
      to donor inode blocks, but __block_write_begin() not only maps buffer
      heads but also reads block content from disk if the size is not block
      size aligned.  At this time the physical block number in mapped buffer
      head is pointing to the donor file not the original file, and that
      results in reading wrong data to page, which get written to disk in
      following block_commit_write call.
      
      This also can be reproduced by the following script on 1k block size ext4
      on x86_64 host:
      
          mnt=/mnt/ext4
          donorfile=$mnt/donor
          testfile=$mnt/testfile
          e4compact=~/xfstests/src/e4compact
      
          rm -f $donorfile $testfile
      
          # reserve space for donor file, written by 0xaa and sync to disk to
          # avoid EBUSY on EXT4_IOC_MOVE_EXT
          xfs_io -fc "pwrite -S 0xaa 0 1m" -c "fsync" $donorfile
      
          # create test file written by 0xbb
          xfs_io -fc "pwrite -S 0xbb 0 1023" -c "fsync" $testfile
      
          # compute initial md5sum
          md5sum $testfile | tee md5sum.txt
          # drop cache, force e4compact to read data from disk
          echo 3 > /proc/sys/vm/drop_caches
      
          # test defrag
          echo "$testfile" | $e4compact -i -v -f $donorfile
          # check md5sum
          md5sum -c md5sum.txt
      
      Fix it by creating & mapping buffer heads only but not reading blocks
      from disk, because all the data in page is guaranteed to be up-to-date
      in mext_page_mkuptodate().
      Signed-off-by: default avatarEryu Guan <guaneryu@gmail.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bbfe21c8
    • Insu Yun's avatar
      ext4: fix potential integer overflow · 600d41f4
      Insu Yun authored
      commit 46901760 upstream.
      
      Since sizeof(ext_new_group_data) > sizeof(ext_new_flex_group_data),
      integer overflow could be happened.
      Therefore, need to fix integer overflow sanitization.
      Signed-off-by: default avatarInsu Yun <wuninsu@gmail.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      600d41f4
    • Jan Kara's avatar
      ext4: fix scheduling in atomic on group checksum failure · 33f48f8a
      Jan Kara authored
      commit 05145bd7 upstream.
      
      When block group checksum is wrong, we call ext4_error() while holding
      group spinlock from ext4_init_block_bitmap() or
      ext4_init_inode_bitmap() which results in scheduling while in atomic.
      Fix the issue by calling ext4_error() later after dropping the spinlock.
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      33f48f8a
    • Peter Hurley's avatar
      serial: omap: Prevent DoS using unprivileged ioctl(TIOCSRS485) · 5859b907
      Peter Hurley authored
      commit 308bbc9a upstream.
      
      The omap-serial driver emulates RS485 delays using software timers,
      but neglects to clamp the input values from the unprivileged
      ioctl(TIOCSRS485). Because the software implementation busy-waits,
      malicious userspace could stall the cpu for ~49 days.
      
      Clamp the input values to < 100ms.
      
      Fixes: 4a0ac0f5 ("OMAP: add RS485 support")
      Signed-off-by: default avatarPeter Hurley <peter@hurleysoftware.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5859b907
    • Mika Westerberg's avatar
      serial: 8250_pci: Add Intel Broadwell ports · 76e88140
      Mika Westerberg authored
      commit 6c55d9b9 upstream.
      
      Some recent (early 2015) macbooks have Intel Broadwell where LPSS UARTs are
      PCI enumerated instead of ACPI. The LPSS UART block is pretty much same as
      used on Intel Baytrail so we can reuse the existing Baytrail setup code.
      
      Add both Broadwell LPSS UART ports to the list of supported devices.
      Signed-off-by: default avatarLeif Liddy <leif.liddy@gmail.com>
      Signed-off-by: default avatarMika Westerberg <mika.westerberg@linux.intel.com>
      Reviewed-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Reviewed-by: default avatarHeikki Krogerus <heikki.krogerus@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      76e88140
    • Jeremy McNicoll's avatar
      tty: Add support for PCIe WCH382 2S multi-IO card · 124efa9f
      Jeremy McNicoll authored
      commit 7dde5578 upstream.
      
      WCH382 2S board is a PCIe card with 2 DB9 COM ports detected as
      Serial controller: Device 1c00:3253 (rev 10) (prog-if 05 [16850])
      Signed-off-by: default avatarJeremy McNicoll <jmcnicol@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      124efa9f
    • Herton R. Krzesinski's avatar
      pty: make sure super_block is still valid in final /dev/tty close · 1bdf1602
      Herton R. Krzesinski authored
      commit 1f55c718 upstream.
      
      Considering current pty code and multiple devpts instances, it's possible
      to umount a devpts file system while a program still has /dev/tty opened
      pointing to a previosuly closed pty pair in that instance. In the case all
      ptmx and pts/N files are closed, umount can be done. If the program closes
      /dev/tty after umount is done, devpts_kill_index will use now an invalid
      super_block, which was already destroyed in the umount operation after
      running ->kill_sb. This is another "use after free" type of issue, but now
      related to the allocated super_block instance.
      
      To avoid the problem (warning at ida_remove and potential crashes) for
      this specific case, I added two functions in devpts which grabs additional
      references to the super_block, which pty code now uses so it makes sure
      the super block structure is still valid until pty shutdown is done.
      I also moved the additional inode references to the same functions, which
      also covered similar case with inode being freed before /dev/tty final
      close/shutdown.
      Signed-off-by: default avatarHerton R. Krzesinski <herton@redhat.com>
      Reviewed-by: default avatarPeter Hurley <peter@hurleysoftware.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1bdf1602
    • Herton R. Krzesinski's avatar
      pty: fix possible use after free of tty->driver_data · 3ceeb564
      Herton R. Krzesinski authored
      commit 2831c89f upstream.
      
      This change fixes a bug for a corner case where we have the the last
      release from a pty master/slave coming from a previously opened /dev/tty
      file. When this happens, the tty->driver_data can be stale, due to all
      ptmx or pts/N files having already been closed before (and thus the inode
      related to these files, which tty->driver_data points to, being already
      freed/destroyed).
      
      The fix here is to keep a reference on the opened master ptmx inode.
      We maintain the inode referenced until the final pty_unix98_shutdown,
      and only pass this inode to devpts_kill_index.
      Signed-off-by: default avatarHerton R. Krzesinski <herton@redhat.com>
      Reviewed-by: default avatarPeter Hurley <peter@hurleysoftware.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3ceeb564
    • Peter Hurley's avatar
      staging/speakup: Use tty_ldisc_ref() for paste kworker · a45f23ed
      Peter Hurley authored
      commit f4f9edcf upstream.
      
      As the function documentation for tty_ldisc_ref_wait() notes, it is
      only callable from a tty file_operations routine; otherwise there
      is no guarantee the ref won't be NULL.
      
      The key difference with the VT's paste_selection() is that is an ioctl,
      where __speakup_paste_selection() is completely async kworker, kicked
      off from interrupt context.
      
      Fixes: 28a821c3 ("Staging: speakup: Update __speakup_paste_selection()
             tty (ab)usage to match vt")
      Signed-off-by: default avatarPeter Hurley <peter@hurleysoftware.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a45f23ed
    • Tony Lindgren's avatar
      phy: twl4030-usb: Fix unbalanced pm_runtime_enable on module reload · 3375ee8b
      Tony Lindgren authored
      commit 58a66dba upstream.
      
      If we reload phy-twl4030-usb, we get a warning about unbalanced
      pm_runtime_enable. Let's fix the issue and also fix idling of the
      device on unload before we attempt to shut it down.
      
      If we don't properly idle the PHY before shutting it down on removal,
      the twl4030 ends up consuming about 62mW of extra power compared to
      running idle with the module loaded.
      
      Cc: Bin Liu <b-liu@ti.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Kishon Vijay Abraham I <kishon@ti.com>
      Cc: NeilBrown <neil@brown.name>
      Signed-off-by: default avatarTony Lindgren <tony@atomide.com>
      Signed-off-by: default avatarKishon Vijay Abraham I <kishon@ti.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3375ee8b
    • Tony Lindgren's avatar
      phy: twl4030-usb: Relase usb phy on unload · a90e66cb
      Tony Lindgren authored
      commit b241d31e upstream.
      
      Otherwise rmmod omap2430; rmmod phy-twl4030-usb; modprobe omap2430
      will try to use a non-existing phy and oops:
      
      Unable to handle kernel paging request at virtual address b6f7c1f0
      ...
      [<c048a284>] (devm_usb_get_phy_by_node) from [<bf0758ac>]
      (omap2430_musb_init+0x44/0x2b4 [omap2430])
      [<bf0758ac>] (omap2430_musb_init [omap2430]) from [<bf055ec0>]
      (musb_init_controller+0x194/0x878 [musb_hdrc])
      
      Cc: Bin Liu <b-liu@ti.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Kishon Vijay Abraham I <kishon@ti.com>
      Cc: NeilBrown <neil@brown.name>
      Signed-off-by: default avatarTony Lindgren <tony@atomide.com>
      Signed-off-by: default avatarKishon Vijay Abraham I <kishon@ti.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a90e66cb
    • Takashi Iwai's avatar
      ALSA: seq: Fix double port list deletion · a40efb85
      Takashi Iwai authored
      commit 13d5e5d4 upstream.
      
      The commit [7f0973e9: ALSA: seq: Fix lockdep warnings due to
      double mutex locks] split the management of two linked lists (source
      and destination) into two individual calls for avoiding the AB/BA
      deadlock.  However, this may leave the possible double deletion of one
      of two lists when the counterpart is being deleted concurrently.
      It ends up with a list corruption, as revealed by syzkaller fuzzer.
      
      This patch fixes it by checking the list emptiness and skipping the
      deletion and the following process.
      
      BugLink: http://lkml.kernel.org/r/CACT4Y+bay9qsrz6dQu31EcGaH9XwfW7o3oBzSQUG9fMszoh=Sg@mail.gmail.com
      Fixes: 7f0973e9 ('ALSA: seq: Fix lockdep warnings due to 'double mutex locks)
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Tested-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a40efb85
    • Takashi Iwai's avatar
      ALSA: seq: Fix leak of pool buffer at concurrent writes · 6bb345ac
      Takashi Iwai authored
      commit d99a36f4 upstream.
      
      When multiple concurrent writes happen on the ALSA sequencer device
      right after the open, it may try to allocate vmalloc buffer for each
      write and leak some of them.  It's because the presence check and the
      assignment of the buffer is done outside the spinlock for the pool.
      
      The fix is to move the check and the assignment into the spinlock.
      
      (The current implementation is suboptimal, as there can be multiple
       unnecessary vmallocs because the allocation is done before the check
       in the spinlock.  But the pool size is already checked beforehand, so
       this isn't a big problem; that is, the only possible path is the
       multiple writes before any pool assignment, and practically seen, the
       current coverage should be "good enough".)
      
      The issue was triggered by syzkaller fuzzer.
      
      BugLink: http://lkml.kernel.org/r/CACT4Y+bSzazpXNvtAr=WXaL8hptqjHwqEyFA+VN2AWEx=aurkg@mail.gmail.comReported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Tested-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6bb345ac
    • Takashi Iwai's avatar
      ALSA: pcm: Fix rwsem deadlock for non-atomic PCM stream · ef0ca961
      Takashi Iwai authored
      commit 67ec1072 upstream.
      
      A non-atomic PCM stream may take snd_pcm_link_rwsem rw semaphore twice
      in the same code path, e.g. one in snd_pcm_action_nonatomic() and
      another in snd_pcm_stream_lock().  Usually this is OK, but when a
      write lock is issued between these two read locks, the problem
      happens: the write lock is blocked due to the first reade lock, and
      the second read lock is also blocked by the write lock.  This
      eventually deadlocks.
      
      The reason is the way rwsem manages waiters; it's queued like FIFO, so
      even if the writer itself doesn't take the lock yet, it blocks all the
      waiters (including reads) queued after it.
      
      As a workaround, in this patch, we replace the standard down_write()
      with an spinning loop.  This is far from optimal, but it's good
      enough, as the spinning time is supposed to be relatively short for
      normal PCM operations, and the code paths requiring the write lock
      aren't called so often.
      Reported-by: default avatarVinod Koul <vinod.koul@intel.com>
      Tested-by: default avatarRamesh Babu <ramesh.babu@intel.com>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ef0ca961
    • Takashi Iwai's avatar
      ALSA: hda - Cancel probe work instead of flush at remove · 434e26d6
      Takashi Iwai authored
      commit 0b8c8219 upstream.
      
      The commit [991f86d7: ALSA: hda - Flush the pending probe work at
      remove] introduced the sync of async probe work at remove for fixing
      the race.  However, this may lead to another hangup when the module
      removal is performed quickly before starting the probe work, because
      it issues flush_work() and it's blocked forever.
      
      The workaround is to use cancel_work_sync() instead of flush_work()
      there.
      
      Fixes: 991f86d7 ('ALSA: hda - Flush the pending probe work at remove')
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      434e26d6
    • Toshi Kani's avatar
      x86/mm: Fix vmalloc_fault() to handle large pages properly · 6deb0ec9
      Toshi Kani authored
      commit f4eafd8b upstream.
      
      A kernel page fault oops with the callstack below was observed
      when a read syscall was made to a pmem device after a huge amount
      (>512GB) of vmalloc ranges was allocated by ioremap() on a x86_64
      system:
      
           BUG: unable to handle kernel paging request at ffff880840000ff8
           IP: vmalloc_fault+0x1be/0x300
           PGD c7f03a067 PUD 0
           Oops: 0000 [#1] SM
           Call Trace:
              __do_page_fault+0x285/0x3e0
              do_page_fault+0x2f/0x80
              ? put_prev_entity+0x35/0x7a0
              page_fault+0x28/0x30
              ? memcpy_erms+0x6/0x10
              ? schedule+0x35/0x80
              ? pmem_rw_bytes+0x6a/0x190 [nd_pmem]
              ? schedule_timeout+0x183/0x240
              btt_log_read+0x63/0x140 [nd_btt]
               :
              ? __symbol_put+0x60/0x60
              ? kernel_read+0x50/0x80
              SyS_finit_module+0xb9/0xf0
              entry_SYSCALL_64_fastpath+0x1a/0xa4
      
      Since v4.1, ioremap() supports large page (pud/pmd) mappings in
      x86_64 and PAE.  vmalloc_fault() however assumes that the vmalloc
      range is limited to pte mappings.
      
      vmalloc faults do not normally happen in ioremap'd ranges since
      ioremap() sets up the kernel page tables, which are shared by
      user processes.  pgd_ctor() sets the kernel's PGD entries to
      user's during fork().  When allocation of the vmalloc ranges
      crosses a 512GB boundary, ioremap() allocates a new pud table
      and updates the kernel PGD entry to point it.  If user process's
      PGD entry does not have this update yet, a read/write syscall
      to the range will cause a vmalloc fault, which hits the Oops
      above as it does not handle a large page properly.
      
      Following changes are made to vmalloc_fault().
      
      64-bit:
      
       - No change for the PGD sync operation as it handles large
         pages already.
       - Add pud_huge() and pmd_huge() to the validation code to
         handle large pages.
       - Change pud_page_vaddr() to pud_pfn() since an ioremap range
         is not directly mapped (while the if-statement still works
         with a bogus addr).
       - Change pmd_page() to pmd_pfn() since an ioremap range is not
         backed by struct page (while the if-statement still works
         with a bogus addr).
      
      32-bit:
       - No change for the sync operation since the index3 PGD entry
         covers the entire vmalloc range, which is always valid.
         (A separate change to sync PGD entry is necessary if this
          memory layout is changed regardless of the page size.)
       - Add pmd_huge() to the validation code to handle large pages.
         This is for completeness since vmalloc_fault() won't happen
         in ioremap'd ranges as its PGD entry is always valid.
      Reported-by: default avatarHenning Schild <henning.schild@siemens.com>
      Signed-off-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Acked-by: default avatarBorislav Petkov <bp@alien8.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: linux-mm@kvack.org
      Cc: linux-nvdimm@lists.01.org
      Link: http://lkml.kernel.org/r/1455758214-24623-1-git-send-email-toshi.kani@hpe.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6deb0ec9
    • Toshi Kani's avatar
      x86/uaccess/64: Handle the caching of 4-byte nocache copies properly in __copy_user_nocache() · e0c89043
      Toshi Kani authored
      commit a82eee74 upstream.
      
      Data corruption issues were observed in tests which initiated
      a system crash/reset while accessing BTT devices.  This problem
      is reproducible.
      
      The BTT driver calls pmem_rw_bytes() to update data in pmem
      devices.  This interface calls __copy_user_nocache(), which
      uses non-temporal stores so that the stores to pmem are
      persistent.
      
      __copy_user_nocache() uses non-temporal stores when a request
      size is 8 bytes or larger (and is aligned by 8 bytes).  The
      BTT driver updates the BTT map table, which entry size is
      4 bytes.  Therefore, updates to the map table entries remain
      cached, and are not written to pmem after a crash.
      
      Change __copy_user_nocache() to use non-temporal store when
      a request size is 4 bytes.  The change extends the current
      byte-copy path for a less-than-8-bytes request, and does not
      add any overhead to the regular path.
      Reported-and-tested-by: default avatarMicah Parrish <micah.parrish@hpe.com>
      Reported-and-tested-by: default avatarBrian Boylston <brian.boylston@hpe.com>
      Signed-off-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: linux-nvdimm@lists.01.org
      Link: http://lkml.kernel.org/r/1455225857-12039-3-git-send-email-toshi.kani@hpe.com
      [ Small readability edits. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e0c89043
    • Toshi Kani's avatar
      x86/uaccess/64: Make the __copy_user_nocache() assembly code more readable · 1e2e0ad1
      Toshi Kani authored
      commit ee9737c9 upstream.
      
      Add comments to __copy_user_nocache() to clarify its procedures
      and alignment requirements.
      
      Also change numeric branch target labels to named local labels.
      
      No code changed:
      
       arch/x86/lib/copy_user_64.o:
      
          text    data     bss     dec     hex filename
          1239       0       0    1239     4d7 copy_user_64.o.before
          1239       0       0    1239     4d7 copy_user_64.o.after
      
       md5:
          58bed94c2db98c1ca9a2d46d0680aaae  copy_user_64.o.before.asm
          58bed94c2db98c1ca9a2d46d0680aaae  copy_user_64.o.after.asm
      Signed-off-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: brian.boylston@hpe.com
      Cc: dan.j.williams@intel.com
      Cc: linux-nvdimm@lists.01.org
      Cc: micah.parrish@hpe.com
      Cc: ross.zwisler@linux.intel.com
      Cc: vishal.l.verma@intel.com
      Link: http://lkml.kernel.org/r/1455225857-12039-2-git-send-email-toshi.kani@hpe.com
      [ Small readability edits and added object file comparison. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1e2e0ad1
    • Matt Fleming's avatar
      x86/mm/pat: Avoid truncation when converting cpa->numpages to address · 4f298c10
      Matt Fleming authored
      commit 74256377 upstream.
      
      There are a couple of nasty truncation bugs lurking in the pageattr
      code that can be triggered when mapping EFI regions, e.g. when we pass
      a cpa->pgd pointer. Because cpa->numpages is a 32-bit value, shifting
      left by PAGE_SHIFT will truncate the resultant address to 32-bits.
      
      Viorel-Cătălin managed to trigger this bug on his Dell machine that
      provides a ~5GB EFI region which requires 1236992 pages to be mapped.
      When calling populate_pud() the end of the region gets calculated
      incorrectly in the following buggy expression,
      
        end = start + (cpa->numpages << PAGE_SHIFT);
      
      And only 188416 pages are mapped. Next, populate_pud() gets invoked
      for a second time because of the loop in __change_page_attr_set_clr(),
      only this time no pages get mapped because shifting the remaining
      number of pages (1048576) by PAGE_SHIFT is zero. At which point the
      loop in __change_page_attr_set_clr() spins forever because we fail to
      map progress.
      
      Hitting this bug depends very much on the virtual address we pick to
      map the large region at and how many pages we map on the initial run
      through the loop. This explains why this issue was only recently hit
      with the introduction of commit
      
        a5caa209 ("x86/efi: Fix boot crash by mapping EFI memmap
         entries bottom-up at runtime, instead of top-down")
      
      It's interesting to note that safe uses of cpa->numpages do exist in
      the pageattr code. If instead of shifting ->numpages we multiply by
      PAGE_SIZE, no truncation occurs because PAGE_SIZE is a UL value, and
      so the result is unsigned long.
      
      To avoid surprises when users try to convert very large cpa->numpages
      values to addresses, change the data type from 'int' to 'unsigned
      long', thereby making it suitable for shifting by PAGE_SHIFT without
      any type casting.
      
      The alternative would be to make liberal use of casting, but that is
      far more likely to cause problems in the future when someone adds more
      code and fails to cast properly; this bug was difficult enough to
      track down in the first place.
      Reported-and-tested-by: default avatarViorel-Cătălin Răpițeanu <rapiteanu.catalin@gmail.com>
      Acked-by: default avatarBorislav Petkov <bp@alien8.de>
      Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
      Signed-off-by: default avatarMatt Fleming <matt@codeblueprint.co.uk>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=110131
      Link: http://lkml.kernel.org/r/1454067370-10374-1-git-send-email-matt@codeblueprint.co.ukSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4f298c10
    • Jan Beulich's avatar
      x86/mm: Fix types used in pgprot cacheability flags translations · 75a101ba
      Jan Beulich authored
      commit 3625c2c2 upstream.
      
      For PAE kernels "unsigned long" is not suitable to hold page protection
      flags, since _PAGE_NX doesn't fit there. This is the reason for quite a
      few W+X pages getting reported as insecure during boot (observed namely
      for the entire initrd range).
      
      Fixes: 281d4078 ("x86: Make page cache mode a real type")
      Signed-off-by: default avatarJan Beulich <jbeulich@suse.com>
      Reviewed-by: default avatarJuergen Gross <JGross@suse.com>
      Link: http://lkml.kernel.org/r/56A7635602000078000CAFF1@prv-mh.provo.novell.comSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      75a101ba
  2. 17 Feb, 2016 6 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.4.2 · 1cb8570b
      Greg Kroah-Hartman authored
      1cb8570b
    • Benjamin Tissoires's avatar
      HID: multitouch: fix input mode switching on some Elan panels · bedd67e9
      Benjamin Tissoires authored
      commit 73e7d63e upstream.
      
      as reported by https://bugzilla.kernel.org/show_bug.cgi?id=108481
      
      This bug reports mentions 6d4f5440 ("HID: multitouch: Fetch feature
      reports on demand for Win8 devices") as the origin of the problem but this
      commit actually masked 2 firmware bugs that are annihilating each other:
      
      The report descriptor declares two features in reports 3 and 5:
      
      0x05, 0x0d,                    // Usage Page (Digitizers)             318
      0x09, 0x0e,                    // Usage (Device Configuration)        320
      0xa1, 0x01,                    // Collection (Application)            322
      0x85, 0x03,                    //  Report ID (3)                      324
      0x09, 0x22,                    //  Usage (Finger)                     326
      0xa1, 0x00,                    //  Collection (Physical)              328
      0x09, 0x52,                    //   Usage (Inputmode)                 330
      0x15, 0x00,                    //   Logical Minimum (0)               332
      0x25, 0x0a,                    //   Logical Maximum (10)              334
      0x75, 0x08,                    //   Report Size (8)                   336
      0x95, 0x02,                    //   Report Count (2)                  338
      0xb1, 0x02,                    //   Feature (Data,Var,Abs)            340
      0xc0,                          //  End Collection                     342
      0x09, 0x22,                    //  Usage (Finger)                     343
      0xa1, 0x00,                    //  Collection (Physical)              345
      0x85, 0x05,                    //   Report ID (5)                     347
      0x09, 0x57,                    //   Usage (Surface Switch)            349
      0x09, 0x58,                    //   Usage (Button Switch)             351
      0x15, 0x00,                    //   Logical Minimum (0)               353
      0x75, 0x01,                    //   Report Size (1)                   355
      0x95, 0x02,                    //   Report Count (2)                  357
      0x25, 0x03,                    //   Logical Maximum (3)               359
      0xb1, 0x02,                    //   Feature (Data,Var,Abs)            361
      0x95, 0x0e,                    //   Report Count (14)                 363
      0xb1, 0x03,                    //   Feature (Cnst,Var,Abs)            365
      0xc0,                          //  End Collection                     367
      
      The report ID 3 presents 2 input mode features, while only the first one
      is handled by the device. Given that we did not checked if one was
      previously assigned, we were dealing with the ignored featured and we
      should never have been able to switch this panel into the multitouch mode.
      
      However, the firmware presents an other bugs which allowed 6d4f5440
      to counteract the faulty report descriptor. When we request the values
      of the feature 5, the firmware answers "03 03 00". The fields are correct
      but the report id is wrong. Before 6d4f5440, we retrieved all the features
      and injected them in the system. So when we called report 5, we injected
      in the system the report 3 with the values "03 00".
      Setting the second input mode to 03 in this report changed it to "03 03"
      and the touchpad switched to the mt mode. We could have set anything
      in the second field because the actual value (the first 03 in this report)
      was given by the query of report ID 5.
      
      To sum up: 2 bugs in the firmware were hiding that we were accessing the
      wrong feature.
      Signed-off-by: default avatarBenjamin Tissoires <benjamin.tissoires@redhat.com>
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bedd67e9
    • Tetsuo Handa's avatar
      mm, vmstat: fix wrong WQ sleep when memory reclaim doesn't make any progress · fe5c164e
      Tetsuo Handa authored
      commit 564e81a5 upstream.
      
      Jan Stancek has reported that system occasionally hanging after "oom01"
      testcase from LTP triggers OOM.  Guessing from a result that there is a
      kworker thread doing memory allocation and the values between "Node 0
      Normal free:" and "Node 0 Normal:" differs when hanging, vmstat is not
      up-to-date for some reason.
      
      According to commit 373ccbe5 ("mm, vmstat: allow WQ concurrency to
      discover memory reclaim doesn't make any progress"), it meant to force
      the kworker thread to take a short sleep, but it by error used
      schedule_timeout(1).  We missed that schedule_timeout() in state
      TASK_RUNNING doesn't do anything.
      
      Fix it by using schedule_timeout_uninterruptible(1) which forces the
      kworker thread to take a short sleep in order to make sure that vmstat
      is up-to-date.
      
      Fixes: 373ccbe5 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress")
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: default avatarJan Stancek <jstancek@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Cristopher Lameter <clameter@sgi.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Arkadiusz Miskiewicz <arekm@maven.pl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fe5c164e
    • Junil Lee's avatar
      zsmalloc: fix migrate_zspage-zs_free race condition · bddaf791
      Junil Lee authored
      commit c102f07c upstream.
      
      record_obj() in migrate_zspage() does not preserve handle's
      HANDLE_PIN_BIT, set by find_aloced_obj()->trypin_tag(), and implicitly
      (accidentally) un-pins the handle, while migrate_zspage() still performs
      an explicit unpin_tag() on the that handle.  This additional explicit
      unpin_tag() introduces a race condition with zs_free(), which can pin
      that handle by this time, so the handle becomes un-pinned.
      
      Schematically, it goes like this:
      
        CPU0                                        CPU1
        migrate_zspage
          find_alloced_obj
            trypin_tag
              set HANDLE_PIN_BIT                    zs_free()
                                                      pin_tag()
        obj_malloc() -- new object, no tag
        record_obj() -- remove HANDLE_PIN_BIT           set HANDLE_PIN_BIT
        unpin_tag()  -- remove zs_free's HANDLE_PIN_BIT
      
      The race condition may result in a NULL pointer dereference:
      
        Unable to handle kernel NULL pointer dereference at virtual address 00000000
        CPU: 0 PID: 19001 Comm: CookieMonsterCl Tainted:
        PC is at get_zspage_mapping+0x0/0x24
        LR is at obj_free.isra.22+0x64/0x128
        Call trace:
           get_zspage_mapping+0x0/0x24
           zs_free+0x88/0x114
           zram_free_page+0x64/0xcc
           zram_slot_free_notify+0x90/0x108
           swap_entry_free+0x278/0x294
           free_swap_and_cache+0x38/0x11c
           unmap_single_vma+0x480/0x5c8
           unmap_vmas+0x44/0x60
           exit_mmap+0x50/0x110
           mmput+0x58/0xe0
           do_exit+0x320/0x8dc
           do_group_exit+0x44/0xa8
           get_signal+0x538/0x580
           do_signal+0x98/0x4b8
           do_notify_resume+0x14/0x5c
      
      This patch keeps the lock bit in migration path and update value
      atomically.
      Signed-off-by: default avatarJunil Lee <junil0814.lee@lge.com>
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bddaf791
    • Jerome Marchand's avatar
      zram: don't call idr_remove() from zram_remove() · aee0848f
      Jerome Marchand authored
      commit 17ec4cd9 upstream.
      
      The use of idr_remove() is forbidden in the callback functions of
      idr_for_each().  It is therefore unsafe to call idr_remove in
      zram_remove().
      
      This patch moves the call to idr_remove() from zram_remove() to
      hot_remove_store().  In the detroy_devices() path, idrs are removed by
      idr_destroy().  This solves an use-after-free detected by KASan.
      
      [akpm@linux-foundation.org: fix coding stype, per Sergey]
      Signed-off-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Acked-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aee0848f
    • Kyeongdon Kim's avatar
      zram: try vmalloc() after kmalloc() · 73f67360
      Kyeongdon Kim authored
      commit d913897a upstream.
      
      When we're using LZ4 multi compression streams for zram swap, we found
      out page allocation failure message in system running test.  That was
      not only once, but a few(2 - 5 times per test).  Also, some failure
      cases were continually occurring to try allocation order 3.
      
      In order to make parallel compression private data, we should call
      kzalloc() with order 2/3 in runtime(lzo/lz4).  But if there is no order
      2/3 size memory to allocate in that time, page allocation fails.  This
      patch makes to use vmalloc() as fallback of kmalloc(), this prevents
      page alloc failure warning.
      
      After using this, we never found warning message in running test, also
      It could reduce process startup latency about 60-120ms in each case.
      
      For reference a call trace :
      
          Binder_1: page allocation failure: order:3, mode:0x10c0d0
          CPU: 0 PID: 424 Comm: Binder_1 Tainted: GW 3.10.49-perf-g991d02b-dirty #20
          Call trace:
            dump_backtrace+0x0/0x270
            show_stack+0x10/0x1c
            dump_stack+0x1c/0x28
            warn_alloc_failed+0xfc/0x11c
            __alloc_pages_nodemask+0x724/0x7f0
            __get_free_pages+0x14/0x5c
            kmalloc_order_trace+0x38/0xd8
            zcomp_lz4_create+0x2c/0x38
            zcomp_strm_alloc+0x34/0x78
            zcomp_strm_multi_find+0x124/0x1ec
            zcomp_strm_find+0xc/0x18
            zram_bvec_rw+0x2fc/0x780
            zram_make_request+0x25c/0x2d4
            generic_make_request+0x80/0xbc
            submit_bio+0xa4/0x15c
            __swap_writepage+0x218/0x230
            swap_writepage+0x3c/0x4c
            shrink_page_list+0x51c/0x8d0
            shrink_inactive_list+0x3f8/0x60c
            shrink_lruvec+0x33c/0x4cc
            shrink_zone+0x3c/0x100
            try_to_free_pages+0x2b8/0x54c
            __alloc_pages_nodemask+0x514/0x7f0
            __get_free_pages+0x14/0x5c
            proc_info_read+0x50/0xe4
            vfs_read+0xa0/0x12c
            SyS_read+0x44/0x74
          DMA: 3397*4kB (MC) 26*8kB (RC) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB
               0*512kB 0*1024kB 0*2048kB 0*4096kB = 13796kB
      
      [minchan@kernel.org: change vmalloc gfp and adding comment about gfp]
      [sergey.senozhatsky@gmail.com: tweak comments and styles]
      Signed-off-by: default avatarKyeongdon Kim <kyeongdon.kim@lge.com>
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      73f67360