1. 26 Jan, 2019 40 commits
    • Kevin Barnett's avatar
      scsi: smartpqi: correct lun reset issues · ca8ad9bc
      Kevin Barnett authored
      [ Upstream commit 2ba55c98 ]
      
      Problem:
      The Linux kernel takes a logical volume offline after a LUN reset.  This is
      generally accompanied by this message in the dmesg output:
      
      Device offlined - not ready after error recovery
      
      Root Cause:
      The root cause is a "quirk" in the timeout handling in the Linux SCSI
      layer. The Linux kernel places a 30-second timeout on most media access
      commands (reads and writes) that it send to device drivers.  When a media
      access command times out, the Linux kernel goes into error recovery mode
      for the LUN that was the target of the command that timed out. Every
      command that timed out is kept on a list inside of the Linux kernel to be
      retried later. The kernel attempts to recover the command(s) that timed out
      by issuing a LUN reset followed by a TEST UNIT READY. If the LUN reset and
      TEST UNIT READY commands are successful, the kernel retries the command(s)
      that timed out.
      
      Each SCSI command issued by the kernel has a result field associated with
      it. This field indicates the final result of the command (success or
      error). When a command times out, the kernel places a value in this result
      field indicating that the command timed out.
      
      The "quirk" is that after the LUN reset and TEST UNIT READY commands are
      completed, the kernel checks each command on the timed-out command list
      before retrying it. If the result field is still "timed out", the kernel
      treats that command as not having been successfully recovered for a
      retry. If the number of commands that are in this state are greater than
      two, the kernel takes the LUN offline.
      
      Fix:
      When our RAIDStack receives a LUN reset, it simply waits until all
      outstanding commands complete. Generally, all of these outstanding commands
      complete successfully. Therefore, the fix in the smartpqi driver is to
      always set the command result field to indicate success when a request
      completes successfully. This normally isn’t necessary because the result
      field is always initialized to success when the command is submitted to the
      driver. So when the command completes successfully, the result field is
      left untouched. But in this case, the kernel changes the result field
      behind the driver’s back and then expects the field to be changed by the
      driver as the commands that timed-out complete.
      Reviewed-by: default avatarDave Carroll <david.carroll@microsemi.com>
      Reviewed-by: default avatarScott Teel <scott.teel@microsemi.com>
      Signed-off-by: default avatarKevin Barnett <kevin.barnett@microsemi.com>
      Signed-off-by: default avatarDon Brace <don.brace@microsemi.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ca8ad9bc
    • Stephan Günther's avatar
      scsi: mpt3sas: fix memory ordering on 64bit writes · 868152e4
      Stephan Günther authored
      [ Upstream commit 23c3828a ]
      
      With commit 09c2f95a ("scsi: mpt3sas: Swap I/O memory read value back
      to cpu endianness"), 64bit writes in _base_writeq() were rewritten to use
      __raw_writeq() instad of writeq().
      
      This introduced a bug apparent on powerpc64 systems such as the Raptor
      Talos II that causes the HBA to drop from the PCIe bus under heavy load and
      being reinitialized after a couple of seconds.
      
      It can easily be triggered on affacted systems by using something like
      
        fio --name=random-write --iodepth=4 --rw=randwrite --bs=4k --direct=0 \
          --size=128M --numjobs=64 --end_fsync=1
        fio --name=random-write --iodepth=4 --rw=randwrite --bs=64k --direct=0 \
          --size=128M --numjobs=64 --end_fsync=1
      
      a couple of times. In my case I tested it on both a ZFS raidz2 and a btrfs
      raid6 using LSI 9300-8i and 9400-8i controllers.
      
      The fix consists in resembling the write ordering of writeq() by adding a
      mandatory write memory barrier before device access and a compiler barrier
      afterwards. The additional MMIO barrier is superfluous.
      Signed-off-by: default avatarStephan Günther <moepi@moepi.net>
      Reported-by: default avatarMatt Corallo <linux@bluematt.me>
      Acked-by: default avatarSreekanth Reddy <Sreekanth.Reddy@broadcom.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      868152e4
    • Parvi Kaustubhi's avatar
      IB/usnic: Fix potential deadlock · 6fa75685
      Parvi Kaustubhi authored
      [ Upstream commit 8036e90f ]
      
      Acquiring the rtnl lock while holding usdev_lock could result in a
      deadlock.
      
      For example:
      
      usnic_ib_query_port()
      | mutex_lock(&us_ibdev->usdev_lock)
       | ib_get_eth_speed()
        | rtnl_lock()
      
      rtnl_lock()
      | usnic_ib_netdevice_event()
       | mutex_lock(&us_ibdev->usdev_lock)
      
      This commit moves the usdev_lock acquisition after the rtnl lock has been
      released.
      
      This is safe to do because usdev_lock is not protecting anything being
      accessed in ib_get_eth_speed(). Hence, the correct order of holding locks
      (rtnl -> usdev_lock) is not violated.
      Signed-off-by: default avatarParvi Kaustubhi <pkaustub@cisco.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      6fa75685
    • Daniel Vetter's avatar
      sysfs: Disable lockdep for driver bind/unbind files · a13daf03
      Daniel Vetter authored
      [ Upstream commit 4f4b3743 ]
      
      This is the much more correct fix for my earlier attempt at:
      
      https://lkml.org/lkml/2018/12/10/118
      
      Short recap:
      
      - There's not actually a locking issue, it's just lockdep being a bit
        too eager to complain about a possible deadlock.
      
      - Contrary to what I claimed the real problem is recursion on
        kn->count. Greg pointed me at sysfs_break_active_protection(), used
        by the scsi subsystem to allow a sysfs file to unbind itself. That
        would be a real deadlock, which isn't what's happening here. Also,
        breaking the active protection means we'd need to manually handle
        all the lifetime fun.
      
      - With Rafael we discussed the task_work approach, which kinda works,
        but has two downsides: It's a functional change for a lockdep
        annotation issue, and it won't work for the bind file (which needs
        to get the errno from the driver load function back to userspace).
      
      - Greg also asked why this never showed up: To hit this you need to
        unregister a 2nd driver from the unload code of your first driver. I
        guess only gpus do that. The bug has always been there, but only
        with a recent patch series did we add more locks so that lockdep
        built a chain from unbinding the snd-hda driver to the
        acpi_video_unregister call.
      
      Full lockdep splat:
      
      [12301.898799] ============================================
      [12301.898805] WARNING: possible recursive locking detected
      [12301.898811] 4.20.0-rc7+ #84 Not tainted
      [12301.898815] --------------------------------------------
      [12301.898821] bash/5297 is trying to acquire lock:
      [12301.898826] 00000000f61c6093 (kn->count#39){++++}, at: kernfs_remove_by_name_ns+0x3b/0x80
      [12301.898841] but task is already holding lock:
      [12301.898847] 000000005f634021 (kn->count#39){++++}, at: kernfs_fop_write+0xdc/0x190
      [12301.898856] other info that might help us debug this:
      [12301.898862]  Possible unsafe locking scenario:
      [12301.898867]        CPU0
      [12301.898870]        ----
      [12301.898874]   lock(kn->count#39);
      [12301.898879]   lock(kn->count#39);
      [12301.898883] *** DEADLOCK ***
      [12301.898891]  May be due to missing lock nesting notation
      [12301.898899] 5 locks held by bash/5297:
      [12301.898903]  #0: 00000000cd800e54 (sb_writers#4){.+.+}, at: vfs_write+0x17f/0x1b0
      [12301.898915]  #1: 000000000465e7c2 (&of->mutex){+.+.}, at: kernfs_fop_write+0xd3/0x190
      [12301.898925]  #2: 000000005f634021 (kn->count#39){++++}, at: kernfs_fop_write+0xdc/0x190
      [12301.898936]  #3: 00000000414ef7ac (&dev->mutex){....}, at: device_release_driver_internal+0x34/0x240
      [12301.898950]  #4: 000000003218fbdf (register_count_mutex){+.+.}, at: acpi_video_unregister+0xe/0x40
      [12301.898960] stack backtrace:
      [12301.898968] CPU: 1 PID: 5297 Comm: bash Not tainted 4.20.0-rc7+ #84
      [12301.898974] Hardware name: Hewlett-Packard HP EliteBook 8460p/161C, BIOS 68SCF Ver. F.01 03/11/2011
      [12301.898982] Call Trace:
      [12301.898989]  dump_stack+0x67/0x9b
      [12301.898997]  __lock_acquire+0x6ad/0x1410
      [12301.899003]  ? kernfs_remove_by_name_ns+0x3b/0x80
      [12301.899010]  ? find_held_lock+0x2d/0x90
      [12301.899017]  ? mutex_spin_on_owner+0xe4/0x150
      [12301.899023]  ? find_held_lock+0x2d/0x90
      [12301.899030]  ? lock_acquire+0x90/0x180
      [12301.899036]  lock_acquire+0x90/0x180
      [12301.899042]  ? kernfs_remove_by_name_ns+0x3b/0x80
      [12301.899049]  __kernfs_remove+0x296/0x310
      [12301.899055]  ? kernfs_remove_by_name_ns+0x3b/0x80
      [12301.899060]  ? kernfs_name_hash+0xd/0x80
      [12301.899066]  ? kernfs_find_ns+0x6c/0x100
      [12301.899073]  kernfs_remove_by_name_ns+0x3b/0x80
      [12301.899080]  bus_remove_driver+0x92/0xa0
      [12301.899085]  acpi_video_unregister+0x24/0x40
      [12301.899127]  i915_driver_unload+0x42/0x130 [i915]
      [12301.899160]  i915_pci_remove+0x19/0x30 [i915]
      [12301.899169]  pci_device_remove+0x36/0xb0
      [12301.899176]  device_release_driver_internal+0x185/0x240
      [12301.899183]  unbind_store+0xaf/0x180
      [12301.899189]  kernfs_fop_write+0x104/0x190
      [12301.899195]  __vfs_write+0x31/0x180
      [12301.899203]  ? rcu_read_lock_sched_held+0x6f/0x80
      [12301.899209]  ? rcu_sync_lockdep_assert+0x29/0x50
      [12301.899216]  ? __sb_start_write+0x13c/0x1a0
      [12301.899221]  ? vfs_write+0x17f/0x1b0
      [12301.899227]  vfs_write+0xb9/0x1b0
      [12301.899233]  ksys_write+0x50/0xc0
      [12301.899239]  do_syscall_64+0x4b/0x180
      [12301.899247]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [12301.899253] RIP: 0033:0x7f452ac7f7a4
      [12301.899259] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 80 00 00 00 00 8b 05 aa f0 2c 00 48 63 ff 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 f3 c3 66 90 55 53 48 89 d5 48 89 f3 48 83
      [12301.899273] RSP: 002b:00007ffceafa6918 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [12301.899282] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007f452ac7f7a4
      [12301.899288] RDX: 000000000000000d RSI: 00005612a1abf7c0 RDI: 0000000000000001
      [12301.899295] RBP: 00005612a1abf7c0 R08: 000000000000000a R09: 00005612a1c46730
      [12301.899301] R10: 000000000000000a R11: 0000000000000246 R12: 000000000000000d
      [12301.899308] R13: 0000000000000001 R14: 00007f452af4a740 R15: 000000000000000d
      
      Looking around I've noticed that usb and i2c already handle similar
      recursion problems, where a sysfs file can unbind the same type of
      sysfs somewhere else in the hierarchy. Relevant commits are:
      
      commit 356c05d5
      Author: Alan Stern <stern@rowland.harvard.edu>
      Date:   Mon May 14 13:30:03 2012 -0400
      
          sysfs: get rid of some lockdep false positives
      
      commit e9b526fe
      Author: Alexander Sverdlin <alexander.sverdlin@nsn.com>
      Date:   Fri May 17 14:56:35 2013 +0200
      
          i2c: suppress lockdep warning on delete_device
      
      Implement the same trick for driver bind/unbind.
      
      v2: Put the macro into bus.c (Greg).
      Reviewed-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Ramalingam C <ramalingam.c@intel.com>
      Cc: Arend van Spriel <aspriel@gmail.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Geert Uytterhoeven <geert+renesas@glider.be>
      Cc: Bartosz Golaszewski <brgl@bgdev.pl>
      Cc: Heikki Krogerus <heikki.krogerus@linux.intel.com>
      Cc: Vivek Gautam <vivek.gautam@codeaurora.org>
      Cc: Joe Perches <joe@perches.com>
      Signed-off-by: default avatarDaniel Vetter <daniel.vetter@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      a13daf03
    • Takashi Sakamoto's avatar
      ALSA: bebob: fix model-id of unit for Apogee Ensemble · 959bf5c1
      Takashi Sakamoto authored
      [ Upstream commit 644b2e97 ]
      
      This commit fixes hard-coded model-id for an unit of Apogee Ensemble with
      a correct value. This unit uses DM1500 ASIC produced ArchWave AG (formerly
      known as BridgeCo AG).
      
      I note that this model supports three modes in the number of data channels
      in tx/rx streams; 8 ch pairs, 10 ch pairs, 18 ch pairs. The mode is
      switched by Vendor-dependent AV/C command, like:
      
      $ cd linux-firewire-utils
      $ ./firewire-request /dev/fw1 fcp 0x00ff000003dbeb0600000000 (8ch pairs)
      $ ./firewire-request /dev/fw1 fcp 0x00ff000003dbeb0601000000 (10ch pairs)
      $ ./firewire-request /dev/fw1 fcp 0x00ff000003dbeb0602000000 (18ch pairs)
      
      When switching between different mode, the unit disappears from IEEE 1394
      bus, then appears on the bus with different combination of stream formats.
      In a mode of 18 ch pairs, available sampling rate is up to 96.0 kHz, else
      up to 192.0 kHz.
      
      $ ./hinawa-config-rom-printer /dev/fw1
      { 'bus-info': { 'adj': False,
                      'bmc': True,
                      'chip_ID': 21474898341,
                      'cmc': True,
                      'cyc_clk_acc': 100,
                      'generation': 2,
                      'imc': True,
                      'isc': True,
                      'link_spd': 2,
                      'max_ROM': 1,
                      'max_rec': 512,
                      'name': '1394',
                      'node_vendor_ID': 987,
                      'pmc': False},
        'root-directory': [ ['HARDWARE_VERSION', 19],
                            [ 'NODE_CAPABILITIES',
                              { 'addressing': {'64': True, 'fix': True, 'prv': False},
                                'misc': {'int': False, 'ms': False, 'spt': True},
                                'state': { 'atn': False,
                                           'ded': False,
                                           'drq': True,
                                           'elo': False,
                                           'init': False,
                                           'lst': True,
                                           'off': False},
                                'testing': {'bas': False, 'ext': False}}],
                            ['VENDOR', 987],
                            ['DESCRIPTOR', 'Apogee Electronics'],
                            ['MODEL', 126702],
                            ['DESCRIPTOR', 'Ensemble'],
                            ['VERSION', 5297],
                            [ 'UNIT',
                              [ ['SPECIFIER_ID', 41005],
                                ['VERSION', 65537],
                                ['MODEL', 126702],
                                ['DESCRIPTOR', 'Ensemble']]],
                            [ 'DEPENDENT_INFO',
                              [ ['SPECIFIER_ID', 2037],
                                ['VERSION', 1],
                                [(58, 'IMMEDIATE'), 16777159],
                                [(59, 'IMMEDIATE'), 1048576],
                                [(60, 'IMMEDIATE'), 16777159],
                                [(61, 'IMMEDIATE'), 6291456]]]]}
      Signed-off-by: default avatarTakashi Sakamoto <o-takashi@sakamocchi.jp>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      959bf5c1
    • Raghuram Hegde's avatar
      Bluetooth: btusb: Add support for Intel bluetooth device 8087:0029 · c5e68453
      Raghuram Hegde authored
      [ Upstream commit 2da711bc ]
      
      Include the new USB product ID for Intel Bluetooth device 22260
      family(CcPeak)
      
      The /sys/kernel/debug/usb/devices portion for this device is:
      
      T:  Bus=01 Lev=01 Prnt=01 Port=02 Cnt=02 Dev#=  2 Spd=12   MxCh= 0
      D:  Ver= 2.00 Cls=e0(wlcon) Sub=01 Prot=01 MxPS=64 #Cfgs=  1
      P:  Vendor=8087 ProdID=0029 Rev= 0.01
      C:* #Ifs= 2 Cfg#= 1 Atr=e0 MxPwr=100mA
      I:* If#= 0 Alt= 0 #EPs= 3 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
      E:  Ad=81(I) Atr=03(Int.) MxPS=  64 Ivl=1ms
      E:  Ad=02(O) Atr=02(Bulk) MxPS=  64 Ivl=0ms
      E:  Ad=82(I) Atr=02(Bulk) MxPS=  64 Ivl=0ms
      I:* If#= 1 Alt= 0 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
      E:  Ad=03(O) Atr=01(Isoc) MxPS=   0 Ivl=1ms
      E:  Ad=83(I) Atr=01(Isoc) MxPS=   0 Ivl=1ms
      I:  If#= 1 Alt= 1 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
      E:  Ad=03(O) Atr=01(Isoc) MxPS=   9 Ivl=1ms
      E:  Ad=83(I) Atr=01(Isoc) MxPS=   9 Ivl=1ms
      I:  If#= 1 Alt= 2 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
      E:  Ad=03(O) Atr=01(Isoc) MxPS=  17 Ivl=1ms
      E:  Ad=83(I) Atr=01(Isoc) MxPS=  17 Ivl=1ms
      I:  If#= 1 Alt= 3 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
      E:  Ad=03(O) Atr=01(Isoc) MxPS=  25 Ivl=1ms
      E:  Ad=83(I) Atr=01(Isoc) MxPS=  25 Ivl=1ms
      I:  If#= 1 Alt= 4 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
      E:  Ad=03(O) Atr=01(Isoc) MxPS=  33 Ivl=1ms
      E:  Ad=83(I) Atr=01(Isoc) MxPS=  33 Ivl=1ms
      I:  If#= 1 Alt= 5 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
      E:  Ad=03(O) Atr=01(Isoc) MxPS=  49 Ivl=1ms
      E:  Ad=83(I) Atr=01(Isoc) MxPS=  49 Ivl=1ms
      I:  If#= 1 Alt= 6 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
      E:  Ad=03(O) Atr=01(Isoc) MxPS=  63 Ivl=1ms
      E:  Ad=83(I) Atr=01(Isoc) MxPS=  63 Ivl=1ms
      Signed-off-by: default avatarRaghuram Hegde <raghuram.hegde@intel.com>
      Signed-off-by: default avatarChethan T N <chethan.tumkur.narayan@intel.com>
      Signed-off-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      c5e68453
    • Milan Broz's avatar
      dm: Check for device sector overflow if CONFIG_LBDAF is not set · 887b1c9a
      Milan Broz authored
      [ Upstream commit ef87bfc2 ]
      
      Reference to a device in device-mapper table contains offset in sectors.
      
      If the sector_t is 32bit integer (CONFIG_LBDAF is not set), then
      several device-mapper targets can overflow this offset and validity
      check is then performed on a wrong offset and a wrong table is activated.
      
      See for example (on 32bit without CONFIG_LBDAF) this overflow:
      
        # dmsetup create test --table "0 2048 linear /dev/sdg 4294967297"
        # dmsetup table test
        0 2048 linear 8:96 1
      
      This patch adds explicit check for overflow if the offset is sector_t type.
      Signed-off-by: default avatarMilan Broz <gmazyland@gmail.com>
      Reviewed-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      887b1c9a
    • Yangtao Li's avatar
      clocksource/drivers/integrator-ap: Add missing of_node_put() · decca9bc
      Yangtao Li authored
      [ Upstream commit 5eb73c83 ]
      
      The function of_find_node_by_path() acquires a reference to the node
      returned by it and that reference needs to be dropped by its caller.
      
      integrator_ap_timer_init_of() doesn't do that.  The pri_node and the
      sec_node are used as an identifier to compare against the current
      node, so we can directly drop the refcount after getting the node from
      the path as it is not used as pointer.
      
      By dropping the refcount right after getting it, a single variable is
      needed instead of two.
      
      Fix this by use a single variable and drop the refcount right after
      of_find_node_by_path().
      Signed-off-by: default avatarYangtao Li <tiny.windzz@gmail.com>
      Signed-off-by: default avatarDaniel Lezcano <daniel.lezcano@linaro.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      decca9bc
    • Javier Barrio's avatar
      quota: Lock s_umount in exclusive mode for Q_XQUOTA{ON,OFF} quotactls. · 876b79b9
      Javier Barrio authored
      [ Upstream commit 41c4f85c ]
      
      Commit 1fa5efe3 (ext4: Use generic helpers for quotaon
      and quotaoff) made possible to call quotactl(Q_XQUOTAON/OFF) on ext4 filesystems
      with sysfile quota support. This leads to calling dquot_enable/disable without s_umount
      held in excl. mode, because quotactl_cmd_onoff checks only for Q_QUOTAON/OFF.
      
      The following WARN_ON_ONCE triggers (in this case for dquot_enable, ext4, latest Linus' tree):
      
      [  117.807056] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: quota,prjquota
      
      [...]
      
      [  155.036847] WARNING: CPU: 0 PID: 2343 at fs/quota/dquot.c:2469 dquot_enable+0x34/0xb9
      [  155.036851] Modules linked in: quota_v2 quota_tree ipv6 af_packet joydev mousedev psmouse serio_raw pcspkr i2c_piix4 intel_agp intel_gtt e1000 ttm drm_kms_helper drm agpgart fb_sys_fops syscopyarea sysfillrect sysimgblt i2c_core input_leds kvm_intel kvm irqbypass qemu_fw_cfg floppy evdev parport_pc parport button crc32c_generic dm_mod ata_generic pata_acpi ata_piix libata loop ext4 crc16 mbcache jbd2 usb_storage usbcore sd_mod scsi_mod
      [  155.036901] CPU: 0 PID: 2343 Comm: qctl Not tainted 4.20.0-rc6-00025-gf5d58277 #9
      [  155.036903] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      [  155.036911] RIP: 0010:dquot_enable+0x34/0xb9
      [  155.036915] Code: 41 56 41 55 41 54 55 53 4c 8b 6f 28 74 02 0f 0b 4d 8d 7d 70 49 89 fc 89 cb 41 89 d6 89 f5 4c 89 ff e8 23 09 ea ff 85 c0 74 0a <0f> 0b 4c 89 ff e8 8b 09 ea ff 85 db 74 6a 41 8b b5 f8 00 00 00 0f
      [  155.036918] RSP: 0018:ffffb09b00493e08 EFLAGS: 00010202
      [  155.036922] RAX: 0000000000000001 RBX: 0000000000000008 RCX: 0000000000000008
      [  155.036924] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff9781b67cd870
      [  155.036926] RBP: 0000000000000002 R08: 0000000000000000 R09: 61c8864680b583eb
      [  155.036929] R10: ffffb09b00493e48 R11: ffffffffff7ce7d4 R12: ffff9781b7ee8d78
      [  155.036932] R13: ffff9781b67cd800 R14: 0000000000000004 R15: ffff9781b67cd870
      [  155.036936] FS:  00007fd813250b88(0000) GS:ffff9781ba000000(0000) knlGS:0000000000000000
      [  155.036939] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  155.036942] CR2: 00007fd812ff61d6 CR3: 000000007c882000 CR4: 00000000000006b0
      [  155.036951] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  155.036953] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  155.036955] Call Trace:
      [  155.037004]  dquot_quota_enable+0x8b/0xd0
      [  155.037011]  kernel_quotactl+0x628/0x74e
      [  155.037027]  ? do_mprotect_pkey+0x2a6/0x2cd
      [  155.037034]  __x64_sys_quotactl+0x1a/0x1d
      [  155.037041]  do_syscall_64+0x55/0xe4
      [  155.037078]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [  155.037105] RIP: 0033:0x7fd812fe1198
      [  155.037109] Code: 02 77 0d 48 89 c1 48 c1 e9 3f 75 04 48 8b 04 24 48 83 c4 50 5b c3 48 83 ec 08 49 89 ca 48 63 d2 48 63 ff b8 b3 00 00 00 0f 05 <48> 89 c7 e8 c1 eb ff ff 5a c3 48 63 ff b8 bb 00 00 00 0f 05 48 89
      [  155.037112] RSP: 002b:00007ffe8cd7b050 EFLAGS: 00000206 ORIG_RAX: 00000000000000b3
      [  155.037116] RAX: ffffffffffffffda RBX: 00007ffe8cd7b148 RCX: 00007fd812fe1198
      [  155.037119] RDX: 0000000000000000 RSI: 00007ffe8cd7cea9 RDI: 0000000000580102
      [  155.037121] RBP: 00007ffe8cd7b0f0 R08: 000055fc8eba8a9d R09: 0000000000000000
      [  155.037124] R10: 00007ffe8cd7b074 R11: 0000000000000206 R12: 00007ffe8cd7b168
      [  155.037126] R13: 000055fc8eba8897 R14: 0000000000000000 R15: 0000000000000000
      [  155.037131] ---[ end trace 210f864257175c51 ]---
      
      and then the syscall proceeds without s_umount locking.
      
      This patch locks the superblock ->s_umount sem. in exclusive mode for all Q_XQUOTAON/OFF
      quotactls too in addition to Q_QUOTAON/OFF.
      
      AFAICT, other than ext4, only xfs and ocfs2 are affected by this change.
      The VFS will now call in xfs_quota_* functions with s_umount held, which wasn't the case
      before. This looks good to me but I can not say for sure. Ext4 and ocfs2 where already
      beeing called with s_umount exclusive via quota_quotaon/off which is basically the same.
      Signed-off-by: default avatarJavier Barrio <javier.barrio.mart@gmail.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      876b79b9
    • Arnaldo Carvalho de Melo's avatar
      perf tools: Add missing open_memstream() prototype for systems lacking it · 77f14a49
      Arnaldo Carvalho de Melo authored
      [ Upstream commit d7a8c4a6 ]
      
      There are systems such as the Android NDK API level 24 has the
      open_memstream() function but doesn't provide a prototype, adding noise
      to the build:
      
        builtin-timechart.c: In function 'cat_backtrace':
        builtin-timechart.c:486:2: warning: implicit declaration of function 'open_memstream' [-Wimplicit-function-declaration]
          FILE *f = open_memstream(&p, &p_len);
          ^
        builtin-timechart.c:486:2: warning: nested extern declaration of 'open_memstream' [-Wnested-externs]
        builtin-timechart.c:486:12: warning: initialization makes pointer from integer without a cast
          FILE *f = open_memstream(&p, &p_len);
                    ^
      
      Define a LACKS_OPEN_MEMSTREAM_PROTOTYPE define so that code needing that
      can get a prototype.
      
      Checked in the bionic git repo to be available since level 23:
      
      https://android.googlesource.com/platform/bionic/+/master/libc/include/stdio.h#241
      
        FILE* open_memstream(char** __ptr, size_t* __size_ptr) __INTRODUCED_IN(23);
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Link: https://lkml.kernel.org/n/tip-343ashae97e5bq6vizusyfno@git.kernel.orgSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      77f14a49
    • Arnaldo Carvalho de Melo's avatar
      perf tools: Add missing sigqueue() prototype for systems lacking it · e2a1f8d6
      Arnaldo Carvalho de Melo authored
      [ Upstream commit 748fe088 ]
      
      There are systems such as the Android NDK API level 24 has the
      sigqueue() function but doesn't provide a prototype, adding noise to the
      build:
      
        util/evlist.c: In function 'perf_evlist__prepare_workload':
        util/evlist.c:1494:4: warning: implicit declaration of function 'sigqueue' [-Wimplicit-function-declaration]
            if (sigqueue(getppid(), SIGUSR1, val))
            ^
        util/evlist.c:1494:4: warning: nested extern declaration of 'sigqueue' [-Wnested-externs]
      
      Define a LACKS_SIGQUEUE_PROTOTYPE define so that code needing that can
      get a prototype.
      
      Checked in the bionic git repo to be available since level 23:
      
      https://android.googlesource.com/platform/bionic/+/master/libc/include/signal.h#123
      
        int sigqueue(pid_t __pid, int __signal, const union sigval __value) __INTRODUCED_IN(23);
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Link: https://lkml.kernel.org/n/tip-lmhpev1uni9kdrv7j29glyov@git.kernel.orgSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e2a1f8d6
    • Leo Yan's avatar
      perf cs-etm: Correct packets swapping in cs_etm__flush() · 4bc4b575
      Leo Yan authored
      [ Upstream commit 43fd5666 ]
      
      The structure cs_etm_queue uses 'prev_packet' to point to previous
      packet, this can be used to combine with new coming packet to generate
      samples.
      
      In function cs_etm__flush() it swaps packets only when the flag
      'etm->synth_opts.last_branch' is true, this means that it will not swap
      packets if without option '--itrace=il' to generate last branch entries;
      thus for this case the 'prev_packet' doesn't point to the correct
      previous packet and the stale packet still will be used to generate
      sequential sample.  Thus if dump trace with 'perf script' command we can
      see the incorrect flow with the stale packet's address info.
      
      This patch corrects packets swapping in cs_etm__flush(); except using
      the flag 'etm->synth_opts.last_branch' it also checks the another flag
      'etm->sample_branches', if any flag is true then it swaps packets so can
      save correct content to 'prev_packet'.  Finally this can fix the wrong
      program flow dumping issue.
      
      The patch has a minor refactoring to use 'etm->synth_opts.last_branch'
      instead of 'etmq->etm->synth_opts.last_branch' for condition checking,
      this is consistent with that is done in cs_etm__sample().
      Signed-off-by: default avatarLeo Yan <leo.yan@linaro.org>
      Reviewed-by: default avatarMathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Mike Leach <mike.leach@linaro.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Robert Walker <robert.walker@arm.com>
      Cc: coresight@lists.linaro.org
      Cc: linux-arm-kernel@lists.infradead.org
      Link: http://lkml.kernel.org/r/1544513908-16805-2-git-send-email-leo.yan@linaro.orgSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      4bc4b575
    • Nikos Tsironis's avatar
      dm snapshot: Fix excessive memory usage and workqueue stalls · 9e5be33b
      Nikos Tsironis authored
      [ Upstream commit 721b1d98 ]
      
      kcopyd has no upper limit to the number of jobs one can allocate and
      issue. Under certain workloads this can lead to excessive memory usage
      and workqueue stalls. For example, when creating multiple dm-snapshot
      targets with a 4K chunk size and then writing to the origin through the
      page cache. Syncing the page cache causes a large number of BIOs to be
      issued to the dm-snapshot origin target, which itself issues an even
      larger (because of the BIO splitting taking place) number of kcopyd
      jobs.
      
      Running the following test, from the device mapper test suite [1],
      
        dmtest run --suite snapshot -n many_snapshots_of_same_volume_N
      
      , with 8 active snapshots, results in the kcopyd job slab cache growing
      to 10G. Depending on the available system RAM this can lead to the OOM
      killer killing user processes:
      
      [463.492878] kthreadd invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP),
                    nodemask=(null), order=1, oom_score_adj=0
      [463.492894] kthreadd cpuset=/ mems_allowed=0
      [463.492948] CPU: 7 PID: 2 Comm: kthreadd Not tainted 4.19.0-rc7 #3
      [463.492950] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
      [463.492952] Call Trace:
      [463.492964]  dump_stack+0x7d/0xbb
      [463.492973]  dump_header+0x6b/0x2fc
      [463.492987]  ? lockdep_hardirqs_on+0xee/0x190
      [463.493012]  oom_kill_process+0x302/0x370
      [463.493021]  out_of_memory+0x113/0x560
      [463.493030]  __alloc_pages_slowpath+0xf40/0x1020
      [463.493055]  __alloc_pages_nodemask+0x348/0x3c0
      [463.493067]  cache_grow_begin+0x81/0x8b0
      [463.493072]  ? cache_grow_begin+0x874/0x8b0
      [463.493078]  fallback_alloc+0x1e4/0x280
      [463.493092]  kmem_cache_alloc_node+0xd6/0x370
      [463.493098]  ? copy_process.part.31+0x1c5/0x20d0
      [463.493105]  copy_process.part.31+0x1c5/0x20d0
      [463.493115]  ? __lock_acquire+0x3cc/0x1550
      [463.493121]  ? __switch_to_asm+0x34/0x70
      [463.493129]  ? kthread_create_worker_on_cpu+0x70/0x70
      [463.493135]  ? finish_task_switch+0x90/0x280
      [463.493165]  _do_fork+0xe0/0x6d0
      [463.493191]  ? kthreadd+0x19f/0x220
      [463.493233]  kernel_thread+0x25/0x30
      [463.493235]  kthreadd+0x1bf/0x220
      [463.493242]  ? kthread_create_on_cpu+0x90/0x90
      [463.493248]  ret_from_fork+0x3a/0x50
      [463.493279] Mem-Info:
      [463.493285] active_anon:20631 inactive_anon:4831 isolated_anon:0
      [463.493285]  active_file:80216 inactive_file:80107 isolated_file:435
      [463.493285]  unevictable:0 dirty:51266 writeback:109372 unstable:0
      [463.493285]  slab_reclaimable:31191 slab_unreclaimable:3483521
      [463.493285]  mapped:526 shmem:4903 pagetables:1759 bounce:0
      [463.493285]  free:33623 free_pcp:2392 free_cma:0
      ...
      [463.493489] Unreclaimable slab info:
      [463.493513] Name                      Used          Total
      [463.493522] bio-6                   1028KB       1028KB
      [463.493525] bio-5                   1028KB       1028KB
      [463.493528] dm_snap_pending_exception     236783KB     243789KB
      [463.493531] dm_exception              41KB         42KB
      [463.493534] bio-4                   1216KB       1216KB
      [463.493537] bio-3                 439396KB     439396KB
      [463.493539] kcopyd_job           6973427KB    6973427KB
      ...
      [463.494340] Out of memory: Kill process 1298 (ruby2.3) score 1 or sacrifice child
      [463.494673] Killed process 1298 (ruby2.3) total-vm:435740kB, anon-rss:20180kB, file-rss:4kB, shmem-rss:0kB
      [463.506437] oom_reaper: reaped process 1298 (ruby2.3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      Moreover, issuing a large number of kcopyd jobs results in kcopyd
      hogging the CPU, while processing them. As a result, processing of work
      items, queued for execution on the same CPU as the currently running
      kcopyd thread, is stalled for long periods of time, hurting performance.
      Running the aforementioned test we get, in dmesg, messages like the
      following:
      
      [67501.194592] BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 27s!
      [67501.195586] Showing busy workqueues and worker pools:
      [67501.195591] workqueue events: flags=0x0
      [67501.195597]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195611]     pending: cache_reap
      [67501.195641] workqueue mm_percpu_wq: flags=0x8
      [67501.195645]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195656]     pending: vmstat_update
      [67501.195682] workqueue kblockd: flags=0x18
      [67501.195687]   pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256
      [67501.195698]     pending: blk_timeout_work
      [67501.195753] workqueue kcopyd: flags=0x8
      [67501.195757]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195768]     pending: do_work [dm_mod]
      [67501.195802] workqueue kcopyd: flags=0x8
      [67501.195806]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195817]     pending: do_work [dm_mod]
      [67501.195834] workqueue kcopyd: flags=0x8
      [67501.195838]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195848]     pending: do_work [dm_mod]
      [67501.195881] workqueue kcopyd: flags=0x8
      [67501.195885]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195896]     pending: do_work [dm_mod]
      [67501.195920] workqueue kcopyd: flags=0x8
      [67501.195924]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=2/256
      [67501.195935]     in-flight: 67:do_work [dm_mod]
      [67501.195945]     pending: do_work [dm_mod]
      [67501.195961] pool 8: cpus=4 node=0 flags=0x0 nice=0 hung=27s workers=3 idle: 129 23765
      
      The root cause for these issues is the way dm-snapshot uses kcopyd. In
      particular, the lack of an explicit or implicit limit to the maximum
      number of in-flight COW jobs. The merging path is not affected because
      it implicitly limits the in-flight kcopyd jobs to one.
      
      Fix these issues by using a semaphore to limit the maximum number of
      in-flight kcopyd jobs. We grab the semaphore before allocating a new
      kcopyd job in start_copy() and start_full_bio() and release it after the
      job finishes in copy_callback().
      
      The initial semaphore value is configurable through a module parameter,
      to allow fine tuning the maximum number of in-flight COW jobs. Setting
      this parameter to zero initializes the semaphore to INT_MAX.
      
      A default value of 2048 maximum in-flight kcopyd jobs was chosen. This
      value was decided experimentally as a trade-off between memory
      consumption, stalling the kernel's workqueues and maintaining a high
      enough throughput.
      
      Re-running the aforementioned test:
      
        * Workqueue stalls are eliminated
        * kcopyd's job slab cache uses a maximum of 130MB
        * The time taken by the test to write to the snapshot-origin target is
          reduced from 05m20.48s to 03m26.38s
      
      [1] https://github.com/jthornber/device-mapper-test-suiteSigned-off-by: default avatarNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: default avatarIlias Tsitsimpis <iliastsi@arrikto.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      9e5be33b
    • Arnaldo Carvalho de Melo's avatar
      tools lib subcmd: Don't add the kernel sources to the include path · d9513fdb
      Arnaldo Carvalho de Melo authored
      [ Upstream commit ece98049 ]
      
      At some point we decided not to directly include kernel sources files
      when building tools/perf/, but when tools/lib/subcmd/ was forked from
      tools/perf it somehow ended up adding it via these two lines in its
      Makefile:
      
        CFLAGS += -I$(srctree)/include/uapi
        CFLAGS += -I$(srctree)/include
      
      As $(srctree) points to the kernel sources.
      
      Removing those lines and keeping just:
      
        CFLAGS += -I$(srctree)/tools/include/
      
      Is enough to build tools/perf and tools/objtool.
      
      This fixes the build when building from the sources in environments such
      as the Android NDK crossbuilding from a fedora:26 system:
      
        subcmd-util.h:11:15: error: expected ',' or ';' before 'void'
         static inline void report(const char *prefix, const char *err, va_list params)
                       ^
        In file included from /git/perf/include/uapi/linux/stddef.h:2:0,
                         from /git/perf/include/uapi/linux/posix_types.h:5,
                         from /opt/android-ndk-r12b/platforms/android-24/arch-arm/usr/include/sys/types.h:36,
                         from /opt/android-ndk-r12b/platforms/android-24/arch-arm/usr/include/unistd.h:33,
                         from run-command.c:2:
        subcmd-util.h:18:17: error: '__no_instrument_function__' attribute applies only to functions
      
      The /opt/android-ndk-r12b/platforms/android-24/arch-arm/usr/include/sys/types.h
      file that includes linux/posix_types.h ends up getting the one in the kernel
      sources causing the breakage. Fix it.
      
      Test built tools/objtool/ too.
      Reported-by: default avatarJiri Olsa <jolsa@kernel.org>
      Tested-by: default avatarJiri Olsa <jolsa@kernel.org>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Fixes: 4b6ab94e ("perf subcmd: Create subcmd library")
      Link: https://lkml.kernel.org/n/tip-5lhaoecrj12t0bqwvpiu14sm@git.kernel.orgSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      d9513fdb
    • Michael Petlan's avatar
      perf stat: Avoid segfaults caused by negated options · 8603cac2
      Michael Petlan authored
      [ Upstream commit 51433ead ]
      
      Some 'perf stat' options do not make sense to be negated (event,
      cgroup), some do not have negated path implemented (metrics). Due to
      that, it is better to disable the "no-" prefix for them, since
      otherwise, the later opt-parsing segfaults.
      
      Before:
      
        $ perf stat --no-metrics -- ls
        Segmentation fault (core dumped)
      
      After:
      
        $ perf stat --no-metrics -- ls
         Error: option `no-metrics' isn't available
         Usage: perf stat [<options>] [<command>]
      Signed-off-by: default avatarMichael Petlan <mpetlan@redhat.com>
      Tested-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      LPU-Reference: 1485912065.62416880.1544457604340.JavaMail.zimbra@redhat.com
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      8603cac2
    • Nikos Tsironis's avatar
      dm kcopyd: Fix bug causing workqueue stalls · cbd257f3
      Nikos Tsironis authored
      [ Upstream commit d7e6b8df ]
      
      When using kcopyd to run callbacks through dm_kcopyd_do_callback() or
      submitting copy jobs with a source size of 0, the jobs are pushed
      directly to the complete_jobs list, which could be under processing by
      the kcopyd thread. As a result, the kcopyd thread can continue running
      completed jobs indefinitely, without releasing the CPU, as long as
      someone keeps submitting new completed jobs through the aforementioned
      paths. Processing of work items, queued for execution on the same CPU as
      the currently running kcopyd thread, is thus stalled for excessive
      amounts of time, hurting performance.
      
      Running the following test, from the device mapper test suite [1],
      
        dmtest run --suite snapshot -n parallel_io_to_many_snaps_N
      
      , with 8 active snapshots, we get, in dmesg, messages like the
      following:
      
      [68899.948523] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 95s!
      [68899.949282] Showing busy workqueues and worker pools:
      [68899.949288] workqueue events: flags=0x0
      [68899.949295]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
      [68899.949306]     pending: vmstat_shepherd, cache_reap
      [68899.949331] workqueue mm_percpu_wq: flags=0x8
      [68899.949337]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
      [68899.949345]     pending: vmstat_update
      [68899.949387] workqueue dm_bufio_cache: flags=0x8
      [68899.949392]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
      [68899.949400]     pending: work_fn [dm_bufio]
      [68899.949423] workqueue kcopyd: flags=0x8
      [68899.949429]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
      [68899.949437]     pending: do_work [dm_mod]
      [68899.949452] workqueue kcopyd: flags=0x8
      [68899.949458]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
      [68899.949466]     in-flight: 13:do_work [dm_mod]
      [68899.949474]     pending: do_work [dm_mod]
      [68899.949487] workqueue kcopyd: flags=0x8
      [68899.949493]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
      [68899.949501]     pending: do_work [dm_mod]
      [68899.949515] workqueue kcopyd: flags=0x8
      [68899.949521]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
      [68899.949529]     pending: do_work [dm_mod]
      [68899.949541] workqueue kcopyd: flags=0x8
      [68899.949547]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
      [68899.949555]     pending: do_work [dm_mod]
      [68899.949568] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=95s workers=4 idle: 27130 27223 1084
      
      Fix this by splitting the complete_jobs list into two parts: A user
      facing part, named callback_jobs, and one used internally by kcopyd,
      retaining the name complete_jobs. dm_kcopyd_do_callback() and
      dispatch_job() now push their jobs to the callback_jobs list, which is
      spliced to the complete_jobs list once, every time the kcopyd thread
      wakes up. This prevents kcopyd from hogging the CPU indefinitely and
      causing workqueue stalls.
      
      Re-running the aforementioned test:
      
        * Workqueue stalls are eliminated
        * The maximum writing time among all targets is reduced from 09m37.10s
          to 06m04.85s and the total run time of the test is reduced from
          10m43.591s to 7m19.199s
      
      [1] https://github.com/jthornber/device-mapper-test-suiteSigned-off-by: default avatarNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: default avatarIlias Tsitsimpis <iliastsi@arrikto.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      cbd257f3
    • AliOS system security's avatar
      dm crypt: use u64 instead of sector_t to store iv_offset · 4e26ee31
      AliOS system security authored
      [ Upstream commit 8d683dcd ]
      
      The iv_offset in the mapping table of crypt target is a 64bit number when
      IV algorithm is plain64, plain64be, essiv or benbi. It will be assigned to
      iv_offset of struct crypt_config, cc_sector of struct convert_context and
      iv_sector of struct dm_crypt_request. These structures members are defined
      as a sector_t. But sector_t is 32bit when CONFIG_LBDAF is not set in 32bit
      kernel. In this situation sector_t is not big enough to store the 64bit
      iv_offset.
      
      Here is a reproducer.
      Prepare test image and device (loop is automatically allocated by cryptsetup):
      
        # dd if=/dev/zero of=tst.img bs=1M count=1
        # echo "tst"|cryptsetup open --type plain -c aes-xts-plain64 \
        --skip 500000000000000000 tst.img test
      
      On 32bit system (use IV offset value that overflows to 64bit; CONFIG_LBDAF if off)
      and device checksum is wrong:
      
        # dmsetup table test --showkeys
        0 2048 crypt aes-xts-plain64 dfa7cfe3c481f2239155739c42e539ae8f2d38f304dcc89d20b26f69daaf0933 3551657984 7:0 0
      
        # sha256sum /dev/mapper/test
        533e25c09176632b3794f35303488c4a8f3f965dffffa6ec2df347c168cb6c19 /dev/mapper/test
      
      On 64bit system (and on 32bit system with the patch), table and checksum is now correct:
      
        # dmsetup table test --showkeys
        0 2048 crypt aes-xts-plain64 dfa7cfe3c481f2239155739c42e539ae8f2d38f304dcc89d20b26f69daaf0933 500000000000000000 7:0 0
      
        # sha256sum /dev/mapper/test
        5d16160f9d5f8c33d8051e65fdb4f003cc31cd652b5abb08f03aa6fce0df75fc /dev/mapper/test
      Signed-off-by: default avatarAliOS system security <alios_sys_security@linux.alibaba.com>
      Tested-and-Reviewed-by: default avatarMilan Broz <gmazyland@gmail.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      4e26ee31
    • Hui Wang's avatar
      x86/topology: Use total_cpus for max logical packages calculation · a4772e8b
      Hui Wang authored
      [ Upstream commit aa02ef09 ]
      
      nr_cpu_ids can be limited on the command line via nr_cpus=. This can break the
      logical package management because it results in a smaller number of packages
      while in kdump kernel.
      
      Check below case:
      There is a two sockets system, each socket has 8 cores, which has 16 logical
      cpus while HT was turn on.
      
       0  1  2  3  4  5  6  7     |    16 17 18 19 20 21 22 23
       cores on socket 0               threads on socket 0
       8  9 10 11 12 13 14 15     |    24 25 26 27 28 29 30 31
       cores on socket 1               threads on socket 1
      
      While starting the kdump kernel with command line option nr_cpus=16 panic
      was triggered on one of the cpus 24-31 eg. 26, then online cpu will be
      1-15, 26(cpu 0 was disabled in kdump), ncpus will be 16 and
      __max_logical_packages will be 1, but actually two packages were booted on.
      
      This issue can reproduced by set kdump option nr_cpus=<real physical core
      numbers>, and then trigger panic on last socket's thread, for example:
      
      taskset -c 26 echo c > /proc/sysrq-trigger
      
      Use total_cpus which will not be limited by nr_cpus command line to calculate
      the value of __max_logical_packages.
      Signed-off-by: default avatarHui Wang <john.wanghui@huawei.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: <guijianfeng@huawei.com>
      Cc: <wencongyang2@huawei.com>
      Cc: <douliyang1@huawei.com>
      Cc: <qiaonuohan@huawei.com>
      Link: https://lkml.kernel.org/r/20181107023643.22174-1-john.wanghui@huawei.comSigned-off-by: default avatarSasha Levin <sashal@kernel.org>
      a4772e8b
    • Taehee Yoo's avatar
      netfilter: ipt_CLUSTERIP: fix deadlock in netns exit routine · 9d51378a
      Taehee Yoo authored
      [ Upstream commit 5a86d68b ]
      
      When network namespace is destroyed, cleanup_net() is called.
      cleanup_net() holds pernet_ops_rwsem then calls each ->exit callback.
      So that clusterip_tg_destroy() is called by cleanup_net().
      And clusterip_tg_destroy() calls unregister_netdevice_notifier().
      
      But both cleanup_net() and clusterip_tg_destroy() hold same
      lock(pernet_ops_rwsem). hence deadlock occurrs.
      
      After this patch, only 1 notifier is registered when module is inserted.
      And all of configs are added to per-net list.
      
      test commands:
         %ip netns add vm1
         %ip netns exec vm1 bash
         %ip link set lo up
         %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
      	-j CLUSTERIP --new --hashmode sourceip \
      	--clustermac 01:00:5e:00:00:20 --total-nodes 2 --local-node 1
         %exit
         %ip netns del vm1
      
      splat looks like:
      [  341.809674] ============================================
      [  341.809674] WARNING: possible recursive locking detected
      [  341.809674] 4.19.0-rc5+ #16 Tainted: G        W
      [  341.809674] --------------------------------------------
      [  341.809674] kworker/u4:2/87 is trying to acquire lock:
      [  341.809674] 000000005da2d519 (pernet_ops_rwsem){++++}, at: unregister_netdevice_notifier+0x8c/0x460
      [  341.809674]
      [  341.809674] but task is already holding lock:
      [  341.809674] 000000005da2d519 (pernet_ops_rwsem){++++}, at: cleanup_net+0x119/0x900
      [  341.809674]
      [  341.809674] other info that might help us debug this:
      [  341.809674]  Possible unsafe locking scenario:
      [  341.809674]
      [  341.809674]        CPU0
      [  341.809674]        ----
      [  341.809674]   lock(pernet_ops_rwsem);
      [  341.809674]   lock(pernet_ops_rwsem);
      [  341.809674]
      [  341.809674]  *** DEADLOCK ***
      [  341.809674]
      [  341.809674]  May be due to missing lock nesting notation
      [  341.809674]
      [  341.809674] 3 locks held by kworker/u4:2/87:
      [  341.809674]  #0: 00000000d9df6c92 ((wq_completion)"%s""netns"){+.+.}, at: process_one_work+0xafe/0x1de0
      [  341.809674]  #1: 00000000c2cbcee2 (net_cleanup_work){+.+.}, at: process_one_work+0xb60/0x1de0
      [  341.809674]  #2: 000000005da2d519 (pernet_ops_rwsem){++++}, at: cleanup_net+0x119/0x900
      [  341.809674]
      [  341.809674] stack backtrace:
      [  341.809674] CPU: 1 PID: 87 Comm: kworker/u4:2 Tainted: G        W         4.19.0-rc5+ #16
      [  341.809674] Workqueue: netns cleanup_net
      [  341.809674] Call Trace:
      [ ... ]
      [  342.070196]  down_write+0x93/0x160
      [  342.070196]  ? unregister_netdevice_notifier+0x8c/0x460
      [  342.070196]  ? down_read+0x1e0/0x1e0
      [  342.070196]  ? sched_clock_cpu+0x126/0x170
      [  342.070196]  ? find_held_lock+0x39/0x1c0
      [  342.070196]  unregister_netdevice_notifier+0x8c/0x460
      [  342.070196]  ? register_netdevice_notifier+0x790/0x790
      [  342.070196]  ? __local_bh_enable_ip+0xe9/0x1b0
      [  342.070196]  ? __local_bh_enable_ip+0xe9/0x1b0
      [  342.070196]  ? clusterip_tg_destroy+0x372/0x650 [ipt_CLUSTERIP]
      [  342.070196]  ? trace_hardirqs_on+0x93/0x210
      [  342.070196]  ? __bpf_trace_preemptirq_template+0x10/0x10
      [  342.070196]  ? clusterip_tg_destroy+0x372/0x650 [ipt_CLUSTERIP]
      [  342.123094]  clusterip_tg_destroy+0x3ad/0x650 [ipt_CLUSTERIP]
      [  342.123094]  ? clusterip_net_init+0x3d0/0x3d0 [ipt_CLUSTERIP]
      [  342.123094]  ? cleanup_match+0x17d/0x200 [ip_tables]
      [  342.123094]  ? xt_unregister_table+0x215/0x300 [x_tables]
      [  342.123094]  ? kfree+0xe2/0x2a0
      [  342.123094]  cleanup_entry+0x1d5/0x2f0 [ip_tables]
      [  342.123094]  ? cleanup_match+0x200/0x200 [ip_tables]
      [  342.123094]  __ipt_unregister_table+0x9b/0x1a0 [ip_tables]
      [  342.123094]  iptable_filter_net_exit+0x43/0x80 [iptable_filter]
      [  342.123094]  ops_exit_list.isra.10+0x94/0x140
      [  342.123094]  cleanup_net+0x45b/0x900
      [ ... ]
      
      Fixes: 202f59af ("netfilter: ipt_CLUSTERIP: do not hold dev")
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      9d51378a
    • Taehee Yoo's avatar
      netfilter: ipt_CLUSTERIP: remove wrong WARN_ON_ONCE in netns exit routine · bb7b6c49
      Taehee Yoo authored
      [ Upstream commit b12f7bad ]
      
      When network namespace is destroyed, both clusterip_tg_destroy() and
      clusterip_net_exit() are called. and clusterip_net_exit() is called
      before clusterip_tg_destroy().
      Hence cleanup check code in clusterip_net_exit() doesn't make sense.
      
      test commands:
         %ip netns add vm1
         %ip netns exec vm1 bash
         %ip link set lo up
         %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
      	-j CLUSTERIP --new --hashmode sourceip \
      	--clustermac 01:00:5e:00:00:20 --total-nodes 2 --local-node 1
         %exit
         %ip netns del vm1
      
      splat looks like:
      [  341.184508] WARNING: CPU: 1 PID: 87 at net/ipv4/netfilter/ipt_CLUSTERIP.c:840 clusterip_net_exit+0x319/0x380 [ipt_CLUSTERIP]
      [  341.184850] Modules linked in: ipt_CLUSTERIP nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_tcpudp iptable_filter bpfilter ip_tables x_tables
      [  341.184850] CPU: 1 PID: 87 Comm: kworker/u4:2 Not tainted 4.19.0-rc5+ #16
      [  341.227509] Workqueue: netns cleanup_net
      [  341.227509] RIP: 0010:clusterip_net_exit+0x319/0x380 [ipt_CLUSTERIP]
      [  341.227509] Code: 0f 85 7f fe ff ff 48 c7 c2 80 64 2c c0 be a8 02 00 00 48 c7 c7 a0 63 2c c0 c6 05 18 6e 00 00 01 e8 bc 38 ff f5 e9 5b fe ff ff <0f> 0b e9 33 ff ff ff e8 4b 90 50 f6 e9 2d fe ff ff 48 89 df e8 de
      [  341.227509] RSP: 0018:ffff88011086f408 EFLAGS: 00010202
      [  341.227509] RAX: dffffc0000000000 RBX: 1ffff1002210de85 RCX: 0000000000000000
      [  341.227509] RDX: 1ffff1002210de85 RSI: ffff880110813be8 RDI: ffffed002210de58
      [  341.227509] RBP: ffff88011086f4d0 R08: 0000000000000000 R09: 0000000000000000
      [  341.227509] R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff1002210de81
      [  341.227509] R13: ffff880110625a48 R14: ffff880114cec8c8 R15: 0000000000000014
      [  341.227509] FS:  0000000000000000(0000) GS:ffff880116600000(0000) knlGS:0000000000000000
      [  341.227509] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  341.227509] CR2: 00007f11fd38e000 CR3: 000000013ca16000 CR4: 00000000001006e0
      [  341.227509] Call Trace:
      [  341.227509]  ? __clusterip_config_find+0x460/0x460 [ipt_CLUSTERIP]
      [  341.227509]  ? default_device_exit+0x1ca/0x270
      [  341.227509]  ? remove_proc_entry+0x1cd/0x390
      [  341.227509]  ? dev_change_net_namespace+0xd00/0xd00
      [  341.227509]  ? __init_waitqueue_head+0x130/0x130
      [  341.227509]  ops_exit_list.isra.10+0x94/0x140
      [  341.227509]  cleanup_net+0x45b/0x900
      [ ... ]
      
      Fixes: 613d0776 ("netfilter: exit_net cleanup check added")
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      bb7b6c49
    • Taehee Yoo's avatar
      netfilter: ipt_CLUSTERIP: check MAC address when duplicate config is set · 744383c8
      Taehee Yoo authored
      [ Upstream commit 06aa151a ]
      
      If same destination IP address config is already existing, that config is
      just used. MAC address also should be same.
      However, there is no MAC address checking routine.
      So that MAC address checking routine is added.
      
      test commands:
         %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
      	   -j CLUSTERIP --new --hashmode sourceip \
      	   --clustermac 01:00:5e:00:00:20 --total-nodes 2 --local-node 1
         %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
      	   -j CLUSTERIP --new --hashmode sourceip \
      	   --clustermac 01:00:5e:00:00:21 --total-nodes 2 --local-node 1
      
      After this patch, above commands are disallowed.
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      744383c8
    • Andi Kleen's avatar
      perf vendor events intel: Fix Load_Miss_Real_Latency on SKL/SKX · bd1040e6
      Andi Kleen authored
      [ Upstream commit 91b2b970 ]
      
      Fix incorrect event names for the Load_Miss_Real_Latency metric for
      Skylake and Skylake Server.
      
      Fixes https://github.com/andikleen/pmu-tools/issues/158
      
      Before:
      
        % perf stat -M Load_Miss_Real_Latency true
        event syntax error: '..ss.pending,mem_load_retired.l1_miss_ps,mem_load_retired.fb_hit_ps}:W'
                                          \___ parser error
      
         Usage: perf stat [<options>] [<command>]
      
            -M, --metrics <metric/metric group list>
                                  monitor specified metrics or metric groups (separated by ,)
      
      After:
      
        % perf stat -M Load_Miss_Real_Latency true
      
         Performance counter stats for 'true':
      
                   279,204      l1d_pend_miss.pending     #     14.0 Load_Miss_Real_Latency
                     4,784      mem_load_uops_retired.l1_miss
                    15,188      mem_load_uops_retired.hit_lfb
      
               0.000899640 seconds time elapsed
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Acked-by: default avatarJiri Olsa <jolsa@kernel.org>
      Tested-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Link: http://lkml.kernel.org/r/20181120050635.4215-1-andi@firstfloor.orgSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      bd1040e6
    • Arnaldo Carvalho de Melo's avatar
      perf parse-events: Fix unchecked usage of strncpy() · 58c67a0b
      Arnaldo Carvalho de Melo authored
      [ Upstream commit bd8d57fb ]
      
      The strncpy() function may leave the destination string buffer
      unterminated, better use strlcpy() that we have a __weak fallback
      implementation for systems without it.
      
      This fixes this warning on an Alpine Linux Edge system with gcc 8.2:
      
        util/parse-events.c: In function 'print_symbol_events':
        util/parse-events.c:2465:4: error: 'strncpy' specified bound 100 equals destination size [-Werror=stringop-truncation]
            strncpy(name, syms->symbol, MAX_NAME_LEN);
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        In function 'print_symbol_events.constprop',
            inlined from 'print_events' at util/parse-events.c:2508:2:
        util/parse-events.c:2465:4: error: 'strncpy' specified bound 100 equals destination size [-Werror=stringop-truncation]
            strncpy(name, syms->symbol, MAX_NAME_LEN);
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        In function 'print_symbol_events.constprop',
            inlined from 'print_events' at util/parse-events.c:2511:2:
        util/parse-events.c:2465:4: error: 'strncpy' specified bound 100 equals destination size [-Werror=stringop-truncation]
            strncpy(name, syms->symbol, MAX_NAME_LEN);
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        cc1: all warnings being treated as errors
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Fixes: 947b4ad1 ("perf list: Fix max event string size")
      Link: https://lkml.kernel.org/n/tip-b663e33bm6x8hrkie4uxh7u2@git.kernel.orgSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      58c67a0b
    • Arnaldo Carvalho de Melo's avatar
      perf svghelper: Fix unchecked usage of strncpy() · b332b4cd
      Arnaldo Carvalho de Melo authored
      [ Upstream commit 2f530253 ]
      
      The strncpy() function may leave the destination string buffer
      unterminated, better use strlcpy() that we have a __weak fallback
      implementation for systems without it.
      
      In this specific case this would only happen if fgets() was buggy, as
      its man page states that it should read one less byte than the size of
      the destination buffer, so that it can put the nul byte at the end of
      it, so it would never copy 255 non-nul chars, as fgets reads into the
      orig buffer at most 254 non-nul chars and terminates it. But lets just
      switch to strlcpy to keep the original intent and silence the gcc 8.2
      warning.
      
      This fixes this warning on an Alpine Linux Edge system with gcc 8.2:
      
        In function 'cpu_model',
            inlined from 'svg_cpu_box' at util/svghelper.c:378:2:
        util/svghelper.c:337:5: error: 'strncpy' output may be truncated copying 255 bytes from a string of length 255 [-Werror=stringop-truncation]
             strncpy(cpu_m, &buf[13], 255);
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Fixes: f48d55ce ("perf: Add a SVG helper library file")
      Link: https://lkml.kernel.org/n/tip-xzkoo0gyr56gej39ltivuh9g@git.kernel.orgSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      b332b4cd
    • Florian Fainelli's avatar
      perf tests ARM: Disable breakpoint tests 32-bit · f54fc4c2
      Florian Fainelli authored
      [ Upstream commit 24f96733 ]
      
      The breakpoint tests on the ARM 32-bit kernel are broken in several
      ways.
      
      The breakpoint length requested does not necessarily match whether the
      function address has the Thumb bit (bit 0) set or not, and this does
      matter to the ARM kernel hw_breakpoint infrastructure. See [1] for
      background.
      
      [1]: https://lkml.org/lkml/2018/11/15/205
      
      As Will indicated, the overflow handling would require single-stepping
      which is not supported at the moment. Just disable those tests for the
      ARM 32-bit platforms and update the comment above to explain these
      limitations.
      Co-developed-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Acked-by: default avatarJiri Olsa <jolsa@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20181203191138.2419-1-f.fainelli@gmail.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      f54fc4c2
    • Adrian Hunter's avatar
      perf intel-pt: Fix error with config term "pt=0" · c3e8c335
      Adrian Hunter authored
      [ Upstream commit 1c6f709b ]
      
      Users should never use 'pt=0', but if they do it may give a meaningless
      error:
      
      	$ perf record -e intel_pt/pt=0/u uname
      	Error:
      	The sys_perf_event_open() syscall returned with 22 (Invalid argument) for
      	event (intel_pt/pt=0/u).
      
      Fix that by forcing 'pt=1'.
      
      Committer testing:
      
        # perf record -e intel_pt/pt=0/u uname
        Error:
        The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (intel_pt/pt=0/u).
        /bin/dmesg | grep -i perf may provide additional information.
      
        # perf record -e intel_pt/pt=0/u uname
        pt=0 doesn't make sense, forcing pt=1
        Linux
        [ perf record: Woken up 1 times to write data ]
        [ perf record: Captured and wrote 0.020 MB perf.data ]
        #
      Signed-off-by: default avatarAdrian Hunter <adrian.hunter@intel.com>
      Tested-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Link: http://lkml.kernel.org/r/b7c5b4e5-9497-10e5-fd43-5f3e4a0fe51d@intel.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      c3e8c335
    • Sergey Senozhatsky's avatar
      tty/serial: do not free trasnmit buffer page under port lock · f74fc96e
      Sergey Senozhatsky authored
      [ Upstream commit d7240214 ]
      
      LKP has hit yet another circular locking dependency between uart
      console drivers and debugobjects [1]:
      
           CPU0                                    CPU1
      
                                                  rhltable_init()
                                                   __init_work()
                                                    debug_object_init
           uart_shutdown()                          /* db->lock */
            /* uart_port->lock */                    debug_print_object()
             free_page()                              printk()
                                                       call_console_drivers()
              debug_check_no_obj_freed()                /* uart_port->lock */
               /* db->lock */
                debug_print_object()
      
      So there are two dependency chains:
      	uart_port->lock -> db->lock
      And
      	db->lock -> uart_port->lock
      
      This particular circular locking dependency can be addressed in several
      ways:
      
      a) One way would be to move debug_print_object() out of db->lock scope
         and, thus, break the db->lock -> uart_port->lock chain.
      b) Another one would be to free() transmit buffer page out of db->lock
         in UART code; which is what this patch does.
      
      It makes sense to apply a) and b) independently: there are too many things
      going on behind free(), none of which depend on uart_port->lock.
      
      The patch fixes transmit buffer page free() in uart_shutdown() and,
      additionally, in uart_port_startup() (as was suggested by Dmitry Safonov).
      
      [1] https://lore.kernel.org/lkml/20181211091154.GL23332@shao2-debian/T/#uSigned-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: default avatarPetr Mladek <pmladek@suse.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Dmitry Safonov <dima@arista.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      f74fc96e
    • Johannes Thumshirn's avatar
      btrfs: improve error handling of btrfs_add_link · 310f8296
      Johannes Thumshirn authored
      [ Upstream commit 1690dd41 ]
      
      In the error handling block, err holds the return value of either
      btrfs_del_root_ref() or btrfs_del_inode_ref() but it hasn't been checked
      since it's introduction with commit fe66a05a (Btrfs: improve error
      handling for btrfs_insert_dir_item callers) in 2012.
      
      If the error handling in the error handling fails, there's not much left
      to do and the abort either happened earlier in the callees or is
      necessary here.
      
      So if one of btrfs_del_root_ref() or btrfs_del_inode_ref() failed, abort
      the transaction, but still return the original code of the failure
      stored in 'ret' as this will be reported to the user.
      Signed-off-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      310f8296
    • Anand Jain's avatar
      btrfs: fix use-after-free due to race between replace start and cancel · 38b17eee
      Anand Jain authored
      [ Upstream commit d189dd70 ]
      
      The device replace cancel thread can race with the replace start thread
      and if fs_info::scrubs_running is not yet set, btrfs_scrub_cancel() will
      fail to stop the scrub thread.
      
      The scrub thread continues with the scrub for replace which then will
      try to write to the target device and which is already freed by the
      cancel thread.
      
      scrub_setup_ctx() warns as tgtdev is NULL.
      
        struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev, int is_dev_replace)
        {
        ...
      	  if (is_dev_replace) {
      		  WARN_ON(!fs_info->dev_replace.tgtdev);  <===
      		  sctx->pages_per_wr_bio = SCRUB_PAGES_PER_WR_BIO;
      		  sctx->wr_tgtdev = fs_info->dev_replace.tgtdev;
      		  sctx->flush_all_writes = false;
      	  }
      
        [ 6724.497655] BTRFS info (device sdb): dev_replace from /dev/sdb (devid 1) to /dev/sdc started
        [ 6753.945017] BTRFS info (device sdb): dev_replace from /dev/sdb (devid 1) to /dev/sdc canceled
        [ 6852.426700] WARNING: CPU: 0 PID: 4494 at fs/btrfs/scrub.c:622 scrub_setup_ctx.isra.19+0x220/0x230 [btrfs]
        ...
        [ 6852.428928] RIP: 0010:scrub_setup_ctx.isra.19+0x220/0x230 [btrfs]
        ...
        [ 6852.432970] Call Trace:
        [ 6852.433202]  btrfs_scrub_dev+0x19b/0x5c0 [btrfs]
        [ 6852.433471]  btrfs_dev_replace_start+0x48c/0x6a0 [btrfs]
        [ 6852.433800]  btrfs_dev_replace_by_ioctl+0x3a/0x60 [btrfs]
        [ 6852.434097]  btrfs_ioctl+0x2476/0x2d20 [btrfs]
        [ 6852.434365]  ? do_sigaction+0x7d/0x1e0
        [ 6852.434623]  do_vfs_ioctl+0xa9/0x6c0
        [ 6852.434865]  ? syscall_trace_enter+0x1c8/0x310
        [ 6852.435124]  ? syscall_trace_enter+0x1c8/0x310
        [ 6852.435387]  ksys_ioctl+0x60/0x90
        [ 6852.435663]  __x64_sys_ioctl+0x16/0x20
        [ 6852.435907]  do_syscall_64+0x50/0x180
        [ 6852.436150]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Further, as the replace thread enters scrub_write_page_to_dev_replace()
      without the target device it panics:
      
        static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
      				      struct scrub_page *spage)
        {
        ...
      	bio_set_dev(bio, sbio->dev->bdev); <======
      
        [ 6929.715145] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a0
        ..
        [ 6929.717106] Workqueue: btrfs-scrub btrfs_scrub_helper [btrfs]
        [ 6929.717420] RIP: 0010:scrub_write_page_to_dev_replace+0xb4/0x260
        [btrfs]
        ..
        [ 6929.721430] Call Trace:
        [ 6929.721663]  scrub_write_block_to_dev_replace+0x3f/0x60 [btrfs]
        [ 6929.721975]  scrub_bio_end_io_worker+0x1af/0x490 [btrfs]
        [ 6929.722277]  normal_work_helper+0xf0/0x4c0 [btrfs]
        [ 6929.722552]  process_one_work+0x1f4/0x520
        [ 6929.722805]  ? process_one_work+0x16e/0x520
        [ 6929.723063]  worker_thread+0x46/0x3d0
        [ 6929.723313]  kthread+0xf8/0x130
        [ 6929.723544]  ? process_one_work+0x520/0x520
        [ 6929.723800]  ? kthread_delayed_work_timer_fn+0x80/0x80
        [ 6929.724081]  ret_from_fork+0x3a/0x50
      
      Fix this by letting the btrfs_dev_replace_finishing() to do the job of
      cleaning after the cancel, including freeing of the target device.
      btrfs_dev_replace_finishing() is called when btrfs_scub_dev() returns
      along with the scrub return status.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      38b17eee
    • Hans van Kranenburg's avatar
      btrfs: alloc_chunk: fix more DUP stripe size handling · 720b86a5
      Hans van Kranenburg authored
      [ Upstream commit baf92114 ]
      
      Commit 92e222df "btrfs: alloc_chunk: fix DUP stripe size handling"
      fixed calculating the stripe_size for a new DUP chunk.
      
      However, the same calculation reappears a bit later, and that one was
      not changed yet. The resulting bug that is exposed is that the newly
      allocated device extents ('stripes') can have a few MiB overlap with the
      next thing stored after them, which is another device extent or the end
      of the disk.
      
      The scenario in which this can happen is:
      * The block device for the filesystem is less than 10GiB in size.
      * The amount of contiguous free unallocated disk space chosen to use for
        chunk allocation is 20% of the total device size, or a few MiB more or
        less.
      
      An example:
      - The filesystem device is 7880MiB (max_chunk_size gets set to 788MiB)
      - There's 1578MiB unallocated raw disk space left in one contiguous
        piece.
      
      In this case stripe_size is first calculated as 789MiB, (half of
      1578MiB).
      
      Since 789MiB (stripe_size * data_stripes) > 788MiB (max_chunk_size), we
      enter the if block. Now stripe_size value is immediately overwritten
      while calculating an adjusted value based on max_chunk_size, which ends
      up as 788MiB.
      
      Next, the value is rounded up to a 16MiB boundary, 800MiB, which is
      actually more than the value we had before. However, the last comparison
      fails to detect this, because it's comparing the value with the total
      amount of free space, which is about twice the size of stripe_size.
      
      In the example above, this means that the resulting raw disk space being
      allocated is 1600MiB, while only a gap of 1578MiB has been found. The
      second device extent object for this DUP chunk will overlap for 22MiB
      with whatever comes next.
      
      The underlying problem here is that the stripe_size is reused all the
      time for different things. So, when entering the code in the if block,
      stripe_size is immediately overwritten with something else. If later we
      decide we want to have the previous value back, then the logic to
      compute it was copy pasted in again.
      
      With this change, the value in stripe_size is not unnecessarily
      destroyed, so the duplicated calculation is not needed any more.
      Signed-off-by: default avatarHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      720b86a5
    • Qu Wenruo's avatar
      btrfs: volumes: Make sure there is no overlap of dev extents at mount time · bb5717a4
      Qu Wenruo authored
      [ Upstream commit 5eb19381 ]
      
      Enhance btrfs_verify_dev_extents() to remember previous checked dev
      extents, so it can verify no dev extents can overlap.
      
      Analysis from Hans:
      
      "Imagine allocating a DATA|DUP chunk.
      
       In the chunk allocator, we first set...
         max_stripe_size = SZ_1G;
         max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE
       ... which is 10GiB.
      
       Then...
         /* we don't want a chunk larger than 10% of writeable space */
         max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
             		 max_chunk_size);
      
       Imagine we only have one 7880MiB block device in this filesystem. Now
       max_chunk_size is down to 788MiB.
      
       The next step in the code is to search for max_stripe_size * dev_stripes
       amount of free space on the device, which is in our example 1GiB * 2 =
       2GiB. Imagine the device has exactly 1578MiB free in one contiguous
       piece. This amount of bytes will be put in devices_info[ndevs - 1].max_avail
      
       Next we recalculate the stripe_size (which is actually the device extent
       length), based on the actual maximum amount of available raw disk space:
         stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes);
      
       stripe_size is now 789MiB
      
       Next we do...
         data_stripes = num_stripes / ncopies
       ...where data_stripes ends up as 1, because num_stripes is 2 (the amount
       of device extents we're going to have), and DUP has ncopies 2.
      
       Next there's a check...
         if (stripe_size * data_stripes > max_chunk_size)
       ...which matches because 789MiB * 1 > 788MiB.
      
       We go into the if code, and next is...
         stripe_size = div_u64(max_chunk_size, data_stripes);
       ...which resets stripe_size to max_chunk_size: 788MiB
      
       Next is a fun one...
         /* bump the answer up to a 16MB boundary */
         stripe_size = round_up(stripe_size, SZ_16M);
       ...which changes stripe_size from 788MiB to 800MiB.
      
       We're not done changing stripe_size yet...
         /* But don't go higher than the limits we found while searching
          * for free extents
          */
         stripe_size = min(devices_info[ndevs - 1].max_avail,
             	      stripe_size);
      
       This is bad. max_avail is twice the stripe_size (we need to fit 2 device
       extents on the same device for DUP).
      
       The result here is that 800MiB < 1578MiB, so it's unchanged. However,
       the resulting DUP chunk will need 1600MiB disk space, which isn't there,
       and the second dev_extent might extend into the next thing (next
       dev_extent? end of device?) for 22MiB.
      
       The last shown line of code relies on a situation where there's twice
       the value of stripe_size present as value for the variable stripe_size
       when it's DUP. This was actually the case before commit 92e222df
       "btrfs: alloc_chunk: fix DUP stripe size handling", from which I quote:
         "[...] in the meantime there's a check to see if the stripe_size does
       not exceed max_chunk_size. Since during this check stripe_size is twice
       the amount as intended, the check will reduce the stripe_size to
       max_chunk_size if the actual correct to be used stripe_size is more than
       half the amount of max_chunk_size."
      
       In the previous version of the code, the 16MiB alignment (why is this
       done, by the way?) would result in a 50% chance that it would actually
       do an 8MiB alignment for the individual dev_extents, since it was
       operating on double the size. Does this matter?
      
       Does it matter that stripe_size can be set to anything which is not
       16MiB aligned because of the amount of remaining available disk space
       which is just taken?
      
       What is the main purpose of this round_up?
      
       The most straightforward thing to do seems something like...
         stripe_size = min(
             div_u64(devices_info[ndevs - 1].max_avail, dev_stripes),
             stripe_size
         )
       ..just putting half of the max_avail into stripe_size."
      
      Link: https://lore.kernel.org/linux-btrfs/b3461a38-e5f8-f41d-c67c-2efac8129054@mendix.com/Reported-by: default avatarHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      [ add analysis from report ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      bb5717a4
    • Jonas Danielsson's avatar
      mmc: atmel-mci: do not assume idle after atmci_request_end · c21991ed
      Jonas Danielsson authored
      [ Upstream commit ae460c11 ]
      
      On our AT91SAM9260 board we use the same sdio bus for wifi and for the
      sd card slot. This caused the atmel-mci to give the following splat on
      the serial console:
      
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 538 at drivers/mmc/host/atmel-mci.c:859 atmci_send_command+0x24/0x44
        Modules linked in:
        CPU: 0 PID: 538 Comm: mmcqd/0 Not tainted 4.14.76 #14
        Hardware name: Atmel AT91SAM9
        [<c000fccc>] (unwind_backtrace) from [<c000d3dc>] (show_stack+0x10/0x14)
        [<c000d3dc>] (show_stack) from [<c0017644>] (__warn+0xd8/0xf4)
        [<c0017644>] (__warn) from [<c0017704>] (warn_slowpath_null+0x1c/0x24)
        [<c0017704>] (warn_slowpath_null) from [<c033bb9c>] (atmci_send_command+0x24/0x44)
        [<c033bb9c>] (atmci_send_command) from [<c033e984>] (atmci_start_request+0x1f4/0x2dc)
        [<c033e984>] (atmci_start_request) from [<c033f3b4>] (atmci_request+0xf0/0x164)
        [<c033f3b4>] (atmci_request) from [<c0327108>] (mmc_start_request+0x280/0x2d0)
        [<c0327108>] (mmc_start_request) from [<c032800c>] (mmc_start_areq+0x230/0x330)
        [<c032800c>] (mmc_start_areq) from [<c03366f8>] (mmc_blk_issue_rw_rq+0xc4/0x310)
        [<c03366f8>] (mmc_blk_issue_rw_rq) from [<c03372c4>] (mmc_blk_issue_rq+0x118/0x5ac)
        [<c03372c4>] (mmc_blk_issue_rq) from [<c033781c>] (mmc_queue_thread+0xc4/0x118)
        [<c033781c>] (mmc_queue_thread) from [<c002daf8>] (kthread+0x100/0x118)
        [<c002daf8>] (kthread) from [<c000a580>] (ret_from_fork+0x14/0x34)
        ---[ end trace 594371ddfa284bd6 ]---
      
      This is:
        WARN_ON(host->cmd);
      
      This was fixed on our board by letting atmci_request_end determine what
      state we are in. Instead of unconditionally setting it to STATE_IDLE on
      STATE_END_REQUEST.
      Signed-off-by: default avatarJonas Danielsson <jonas@orbital-systems.com>
      Signed-off-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      c21991ed
    • Masahiro Yamada's avatar
      kconfig: fix memory leak when EOF is encountered in quotation · 46199110
      Masahiro Yamada authored
      [ Upstream commit fbac5977 ]
      
      An unterminated string literal followed by new line is passed to the
      parser (with "multi-line strings not supported" warning shown), then
      handled properly there.
      
      On the other hand, an unterminated string literal at end of file is
      never passed to the parser, then results in memory leak.
      
      [Test Code]
      
        ----------(Kconfig begin)----------
        source "Kconfig.inc"
      
        config A
                bool "a"
        -----------(Kconfig end)-----------
      
        --------(Kconfig.inc begin)--------
        config B
                bool "b\No new line at end of file
        ---------(Kconfig.inc end)---------
      
      [Summary from Valgrind]
      
        Before the fix:
      
          LEAK SUMMARY:
             definitely lost: 16 bytes in 1 blocks
             ...
      
        After the fix:
      
          LEAK SUMMARY:
             definitely lost: 0 bytes in 0 blocks
             ...
      
      Eliminate the memory leak path by handling this case. Of course, such
      a Kconfig file is wrong already, so I will add an error message later.
      Signed-off-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      46199110
    • Masahiro Yamada's avatar
      kconfig: fix file name and line number of warn_ignored_character() · ba8efcdc
      Masahiro Yamada authored
      [ Upstream commit 77c1c0fa ]
      
      Currently, warn_ignore_character() displays invalid file name and
      line number.
      
      The lexer should use current_file->name and yylineno, while the parser
      should use zconf_curname() and zconf_lineno().
      
      This difference comes from that the lexer is always going ahead
      of the parser. The parser needs to look ahead one token to make a
      shift/reduce decision, so the lexer is requested to scan more text
      from the input file.
      
      This commit fixes the warning message from warn_ignored_character().
      
      [Test Code]
      
        ----(Kconfig begin)----
        /
        -----(Kconfig end)-----
      
      [Output]
      
        Before the fix:
      
        <none>:0:warning: ignoring unsupported character '/'
      
        After the fix:
      
        Kconfig:1:warning: ignoring unsupported character '/'
      Signed-off-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ba8efcdc
    • Jiong Wang's avatar
      bpf: relax verifier restriction on BPF_MOV | BPF_ALU · 344b51e7
      Jiong Wang authored
      [ Upstream commit e434b8cd ]
      
      Currently, the destination register is marked as unknown for 32-bit
      sub-register move (BPF_MOV | BPF_ALU) whenever the source register type is
      SCALAR_VALUE.
      
      This is too conservative that some valid cases will be rejected.
      Especially, this may turn a constant scalar value into unknown value that
      could break some assumptions of verifier.
      
      For example, test_l4lb_noinline.c has the following C code:
      
          struct real_definition *dst
      
      1:  if (!get_packet_dst(&dst, &pckt, vip_info, is_ipv6))
      2:    return TC_ACT_SHOT;
      3:
      4:  if (dst->flags & F_IPV6) {
      
      get_packet_dst is responsible for initializing "dst" into valid pointer and
      return true (1), otherwise return false (0). The compiled instruction
      sequence using alu32 will be:
      
        412: (54) (u32) r7 &= (u32) 1
        413: (bc) (u32) r0 = (u32) r7
        414: (95) exit
      
      insn 413, a BPF_MOV | BPF_ALU, however will turn r0 into unknown value even
      r7 contains SCALAR_VALUE 1.
      
      This causes trouble when verifier is walking the code path that hasn't
      initialized "dst" inside get_packet_dst, for which case 0 is returned and
      we would then expect verifier concluding line 1 in the above C code pass
      the "if" check, therefore would skip fall through path starting at line 4.
      Now, because r0 returned from callee has became unknown value, so verifier
      won't skip analyzing path starting at line 4 and "dst->flags" requires
      dereferencing the pointer "dst" which actually hasn't be initialized for
      this path.
      
      This patch relaxed the code marking sub-register move destination. For a
      SCALAR_VALUE, it is safe to just copy the value from source then truncate
      it into 32-bit.
      
      A unit test also included to demonstrate this issue. This test will fail
      before this patch.
      
      This relaxation could let verifier skipping more paths for conditional
      comparison against immediate. It also let verifier recording a more
      accurate/strict value for one register at one state, if this state end up
      with going through exit without rejection and it is used for state
      comparison later, then it is possible an inaccurate/permissive value is
      better. So the real impact on verifier processed insn number is complex.
      But in all, without this fix, valid program could be rejected.
      
      >From real benchmarking on kernel selftests and Cilium bpf tests, there is
      no impact on processed instruction number when tests ares compiled with
      default compilation options. There is slightly improvements when they are
      compiled with -mattr=+alu32 after this patch.
      
      Also, test_xdp_noinline/-mattr=+alu32 now passed verification. It is
      rejected before this fix.
      
      Insn processed before/after this patch:
      
                              default     -mattr=+alu32
      
      Kernel selftest
      
      ===
      test_xdp.o              371/371      369/369
      test_l4lb.o             6345/6345    5623/5623
      test_xdp_noinline.o     2971/2971    rejected/2727
      test_tcp_estates.o      429/429      430/430
      
      Cilium bpf
      ===
      bpf_lb-DLB_L3.o:        2085/2085     1685/1687
      bpf_lb-DLB_L4.o:        2287/2287     1986/1982
      bpf_lb-DUNKNOWN.o:      690/690       622/622
      bpf_lxc.o:              95033/95033   N/A
      bpf_netdev.o:           7245/7245     N/A
      bpf_overlay.o:          2898/2898     3085/2947
      
      NOTE:
        - bpf_lxc.o and bpf_netdev.o compiled by -mattr=+alu32 are rejected by
          verifier due to another issue inside verifier on supporting alu32
          binary.
        - Each cilium bpf program could generate several processed insn number,
          above number is sum of them.
      
      v1->v2:
       - Restrict the change on SCALAR_VALUE.
       - Update benchmark numbers on Cilium bpf tests.
      Signed-off-by: default avatarJiong Wang <jiong.wang@netronome.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      344b51e7
    • Will Deacon's avatar
      arm64: Fix minor issues with the dcache_by_line_op macro · dfbf8c98
      Will Deacon authored
      [ Upstream commit 33309ecd ]
      
      The dcache_by_line_op macro suffers from a couple of small problems:
      
      First, the GAS directives that are currently being used rely on
      assembler behavior that is not documented, and probably not guaranteed
      to produce the correct behavior going forward. As a result, we end up
      with some undefined symbols in cache.o:
      
      $ nm arch/arm64/mm/cache.o
               ...
               U civac
               ...
               U cvac
               U cvap
               U cvau
      
      This is due to the fact that the comparisons used to select the
      operation type in the dcache_by_line_op macro are comparing symbols
      not strings, and even though it seems that GAS is doing the right
      thing here (undefined symbols by the same name are equal to each
      other), it seems unwise to rely on this.
      
      Second, when patching in a DC CVAP instruction on CPUs that support it,
      the fallback path consists of a DC CVAU instruction which may be
      affected by CPU errata that require ARM64_WORKAROUND_CLEAN_CACHE.
      
      Solve these issues by unrolling the various maintenance routines and
      using the conditional directives that are documented as operating on
      strings. To avoid the complexity of nested alternatives, we move the
      DC CVAP patching to __clean_dcache_area_pop, falling back to a branch
      to __clean_dcache_area_poc if DCPOP is not supported by the CPU.
      Reported-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Suggested-by: default avatarRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      dfbf8c98
    • Lucas Stach's avatar
      clk: imx6q: reset exclusive gates on init · 73f0b2e3
      Lucas Stach authored
      [ Upstream commit f7542d81 ]
      
      The exclusive gates may be set up in the wrong way by software running
      before the clock driver comes up. In that case the exclusive setup is
      locked in its initial state, as the complementary function can't be
      activated without disabling the initial setup first.
      
      To avoid this lock situation, reset the exclusive gates to the off
      state and allow the kernel to provide the proper setup.
      Signed-off-by: default avatarLucas Stach <l.stach@pengutronix.de>
      Reviewed-by: default avatarDong Aisheng <Aisheng.dong@nxp.com>
      Signed-off-by: default avatarStephen Boyd <sboyd@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      73f0b2e3
    • Qian Cai's avatar
      arm64: kasan: Increase stack size for KASAN_EXTRA · 8f183b33
      Qian Cai authored
      [ Upstream commit 6e883067 ]
      
      If the kernel is configured with KASAN_EXTRA, the stack size is
      increased significantly due to setting the GCC -fstack-reuse option to
      "none" [1]. As a result, it can trigger a stack overrun quite often with
      32k stack size compiled using GCC 8. For example, this reproducer
      
        https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/madvise/madvise06.c
      
      can trigger a "corrupted stack end detected inside scheduler" very
      reliably with CONFIG_SCHED_STACK_END_CHECK enabled. There are other
      reports at:
      
        https://lore.kernel.org/lkml/1542144497.12945.29.camel@gmx.us/
        https://lore.kernel.org/lkml/721E7B42-2D55-4866-9C1A-3E8D64F33F9C@gmx.us/
      
      There are just too many functions that could have a large stack with
      KASAN_EXTRA due to large local variables that have been called over and
      over again without being able to reuse the stacks. Some noticiable ones
      are,
      
      size
      7536 shrink_inactive_list
      7440 shrink_page_list
      6560 fscache_stats_show
      3920 jbd2_journal_commit_transaction
      3216 try_to_unmap_one
      3072 migrate_page_move_mapping
      3584 migrate_misplaced_transhuge_page
      3920 ip_vs_lblcr_schedule
      4304 lpfc_nvme_info_show
      3888 lpfc_debugfs_nvmestat_data.constprop
      
      There are other 49 functions over 2k in size while compiling kernel with
      "-Wframe-larger-than=" on this machine. Hence, it is too much work to
      change Makefiles for each object to compile without
      -fsanitize-address-use-after-scope individually.
      
      [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81715#c23Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      8f183b33
    • Dmitry V. Levin's avatar
      selftests: do not macro-expand failed assertion expressions · 656257cf
      Dmitry V. Levin authored
      [ Upstream commit b708a3cc ]
      
      I've stumbled over the current macro-expand behaviour of the test
      harness:
      
      $ gcc -Wall -xc - <<'__EOF__'
      TEST(macro) {
      	int status = 0;
      	ASSERT_TRUE(WIFSIGNALED(status));
      }
      TEST_HARNESS_MAIN
      __EOF__
      $ ./a.out
      [==========] Running 1 tests from 1 test cases.
      [ RUN      ] global.macro
      <stdin>:4:global.macro:Expected 0 (0) != (((signed char) (((status) & 0x7f) + 1) >> 1) > 0) (0)
      global.macro: Test terminated by assertion
      [     FAIL ] global.macro
      [==========] 0 / 1 tests passed.
      [  FAILED  ]
      
      With this change the output of the same test looks much more
      comprehensible:
      
      [==========] Running 1 tests from 1 test cases.
      [ RUN      ] global.macro
      <stdin>:4:global.macro:Expected 0 (0) != WIFSIGNALED(status) (0)
      global.macro: Test terminated by assertion
      [     FAIL ] global.macro
      [==========] 0 / 1 tests passed.
      [  FAILED  ]
      
      The issue is very similar to the bug fixed in glibc assert(3)
      three years ago:
      https://sourceware.org/bugzilla/show_bug.cgi?id=18604
      
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Will Drewry <wad@chromium.org>
      Cc: linux-kselftest@vger.kernel.org
      Signed-off-by: default avatarDmitry V. Levin <ldv@altlinux.org>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarShuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      656257cf
    • Bart Van Assche's avatar
      scsi: target/core: Make sure that target_wait_for_sess_cmds() waits long enough · 3ad8148c
      Bart Van Assche authored
      [ Upstream commit ad669505 ]
      
      A session must only be released after all code that accesses the session
      structure has finished. Make sure that this is the case by introducing a
      new command counter per session that is only decremented after the
      .release_cmd() callback has finished. This patch fixes the following crash:
      
      BUG: KASAN: use-after-free in do_raw_spin_lock+0x1c/0x130
      Read of size 4 at addr ffff8801534b16e4 by task rmdir/14805
      CPU: 16 PID: 14805 Comm: rmdir Not tainted 4.18.0-rc2-dbg+ #5
      Call Trace:
      dump_stack+0xa4/0xf5
      print_address_description+0x6f/0x270
      kasan_report+0x241/0x360
      __asan_load4+0x78/0x80
      do_raw_spin_lock+0x1c/0x130
      _raw_spin_lock_irqsave+0x52/0x60
      srpt_set_ch_state+0x27/0x70 [ib_srpt]
      srpt_disconnect_ch+0x1b/0xc0 [ib_srpt]
      srpt_close_session+0xa8/0x260 [ib_srpt]
      target_shutdown_sessions+0x170/0x180 [target_core_mod]
      core_tpg_del_initiator_node_acl+0xf3/0x200 [target_core_mod]
      target_fabric_nacl_base_release+0x25/0x30 [target_core_mod]
      config_item_release+0x9c/0x110 [configfs]
      config_item_put+0x26/0x30 [configfs]
      configfs_rmdir+0x3b8/0x510 [configfs]
      vfs_rmdir+0xb3/0x1e0
      do_rmdir+0x262/0x2c0
      do_syscall_64+0x77/0x230
      entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Cc: Nicholas Bellinger <nab@linux-iscsi.org>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Disseldorp <ddiss@suse.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      3ad8148c