1. 08 Apr, 2017 40 commits
    • Kinglong Mee's avatar
      nfsd: map the ENOKEY to nfserr_perm for avoiding warning · 3967cf7e
      Kinglong Mee authored
      commit c952cd4e upstream.
      
      Now that Ext4 and f2fs filesystems support encrypted directories and
      files, attempts to access those files may return ENOKEY, resulting in
      the following WARNING.
      
      Map ENOKEY to nfserr_perm instead of nfserr_io.
      
      [ 1295.411759] ------------[ cut here ]------------
      [ 1295.411787] WARNING: CPU: 0 PID: 12786 at fs/nfsd/nfsproc.c:796 nfserrno+0x74/0x80 [nfsd]
      [ 1295.411806] nfsd: non-standard errno: -126
      [ 1295.411816] Modules linked in: nfsd nfs_acl auth_rpcgss nfsv4 nfs lockd fscache tun bridge stp llc fuse ip_set nfnetlink vmw_vsock_vmci_transport vsock snd_seq_midi snd_seq_midi_event coretemp crct10dif_pclmul crc32_generic crc32_pclmul snd_ens1371 gameport ghash_clmulni_intel snd_ac97_codec f2fs intel_rapl_perf ac97_bus snd_seq ppdev snd_pcm snd_rawmidi snd_timer vmw_balloon snd_seq_device snd joydev soundcore parport_pc parport nfit acpi_cpufreq tpm_tis vmw_vmci tpm_tis_core tpm shpchp i2c_piix4 grace sunrpc xfs libcrc32c vmwgfx drm_kms_helper ttm drm crc32c_intel e1000 mptspi scsi_transport_spi serio_raw mptscsih mptbase ata_generic pata_acpi fjes [last unloaded: nfs_acl]
      [ 1295.412522] CPU: 0 PID: 12786 Comm: nfsd Tainted: G        W       4.11.0-rc1+ #521
      [ 1295.412959] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
      [ 1295.413814] Call Trace:
      [ 1295.414252]  dump_stack+0x63/0x86
      [ 1295.414666]  __warn+0xcb/0xf0
      [ 1295.415087]  warn_slowpath_fmt+0x5f/0x80
      [ 1295.415502]  ? put_filp+0x42/0x50
      [ 1295.415927]  nfserrno+0x74/0x80 [nfsd]
      [ 1295.416339]  nfsd_open+0xd7/0x180 [nfsd]
      [ 1295.416746]  nfs4_get_vfs_file+0x367/0x3c0 [nfsd]
      [ 1295.417182]  ? security_inode_permission+0x41/0x60
      [ 1295.417591]  nfsd4_process_open2+0x9b2/0x1200 [nfsd]
      [ 1295.418007]  nfsd4_open+0x481/0x790 [nfsd]
      [ 1295.418409]  nfsd4_proc_compound+0x395/0x680 [nfsd]
      [ 1295.418812]  nfsd_dispatch+0xb8/0x1f0 [nfsd]
      [ 1295.419233]  svc_process_common+0x4d9/0x830 [sunrpc]
      [ 1295.419631]  svc_process+0xfe/0x1b0 [sunrpc]
      [ 1295.420033]  nfsd+0xe9/0x150 [nfsd]
      [ 1295.420420]  kthread+0x101/0x140
      [ 1295.420802]  ? nfsd_destroy+0x60/0x60 [nfsd]
      [ 1295.421199]  ? kthread_park+0x90/0x90
      [ 1295.421598]  ret_from_fork+0x2c/0x40
      [ 1295.421996] ---[ end trace 0d5a969cd7852e1f ]---
      Signed-off-by: default avatarKinglong Mee <kinglongmee@gmail.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3967cf7e
    • Olga Kornievskaia's avatar
      NFSv4.1 fix infinite loop on IO BAD_STATEID error · 461bbb90
      Olga Kornievskaia authored
      commit 0e3d3e5d upstream.
      
      Commit 63d63cbf "NFSv4.1: Don't recheck delegations that
      have already been checked" introduced a regression where when a
      client received BAD_STATEID error it would not send any TEST_STATEID
      and instead go into an infinite loop of resending the IO that caused
      the BAD_STATEID.
      
      Fixes: 63d63cbf ("NFSv4.1: Don't recheck delegations that have already been checked")
      Signed-off-by: default avatarOlga Kornievskaia <kolga@netapp.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      461bbb90
    • Ludovic Desroches's avatar
      mmc: sdhci-of-at91: fix MMC_DDR_52 timing selection · 80df2b3e
      Ludovic Desroches authored
      commit d0918764 upstream.
      
      The controller has different timings for MMC_TIMING_UHS_DDR50 and
      MMC_TIMING_MMC_DDR52. Configuring the controller with SDHCI_CTRL_UHS_DDR50,
      when MMC_TIMING_MMC_DDR52 timings are requested, is not correct and can
      lead to unexpected behavior.
      Signed-off-by: default avatarLudovic Desroches <ludovic.desroches@microchip.com>
      Fixes: bb5f8ea4 ("mmc: sdhci-of-at91: introduce driver for the Atmel SDMMC")
      Signed-off-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      80df2b3e
    • Hans de Goede's avatar
      mmc: sdhci: Disable runtime pm when the sdio_irq is enabled · fa3b4f4f
      Hans de Goede authored
      commit 923713b3 upstream.
      
      SDIO cards may need clock to send the card interrupt to the host.
      
      On a cherrytrail tablet with a RTL8723BS wifi chip, without this patch
      pinging the tablet results in:
      
      PING 192.168.1.14 (192.168.1.14) 56(84) bytes of data.
      64 bytes from 192.168.1.14: icmp_seq=1 ttl=64 time=78.6 ms
      64 bytes from 192.168.1.14: icmp_seq=2 ttl=64 time=1760 ms
      64 bytes from 192.168.1.14: icmp_seq=3 ttl=64 time=753 ms
      64 bytes from 192.168.1.14: icmp_seq=4 ttl=64 time=3.88 ms
      64 bytes from 192.168.1.14: icmp_seq=5 ttl=64 time=795 ms
      64 bytes from 192.168.1.14: icmp_seq=6 ttl=64 time=1841 ms
      64 bytes from 192.168.1.14: icmp_seq=7 ttl=64 time=810 ms
      64 bytes from 192.168.1.14: icmp_seq=8 ttl=64 time=1860 ms
      64 bytes from 192.168.1.14: icmp_seq=9 ttl=64 time=812 ms
      64 bytes from 192.168.1.14: icmp_seq=10 ttl=64 time=48.6 ms
      
      Where as with this patch I get:
      
      PING 192.168.1.14 (192.168.1.14) 56(84) bytes of data.
      64 bytes from 192.168.1.14: icmp_seq=1 ttl=64 time=3.96 ms
      64 bytes from 192.168.1.14: icmp_seq=2 ttl=64 time=1.97 ms
      64 bytes from 192.168.1.14: icmp_seq=3 ttl=64 time=17.2 ms
      64 bytes from 192.168.1.14: icmp_seq=4 ttl=64 time=2.46 ms
      64 bytes from 192.168.1.14: icmp_seq=5 ttl=64 time=2.83 ms
      64 bytes from 192.168.1.14: icmp_seq=6 ttl=64 time=1.40 ms
      64 bytes from 192.168.1.14: icmp_seq=7 ttl=64 time=2.10 ms
      64 bytes from 192.168.1.14: icmp_seq=8 ttl=64 time=1.40 ms
      64 bytes from 192.168.1.14: icmp_seq=9 ttl=64 time=2.04 ms
      64 bytes from 192.168.1.14: icmp_seq=10 ttl=64 time=1.40 ms
      
      Cc: Dong Aisheng <b29396@freescale.com>
      Cc: Ian W MORRISON <ianwmorrison@gmail.com>
      Signed-off-by: default avatarHans de Goede <hdegoede@redhat.com>
      Acked-by: default avatarAdrian Hunter <adrian.hunter@intel.com>
      Acked-by: default avatarDong Aisheng <aisheng.dong@nxp.com>
      Signed-off-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fa3b4f4f
    • Aaron Armstrong Skomra's avatar
      HID: wacom: Don't add ghost interface as shared data · 8d6c3322
      Aaron Armstrong Skomra authored
      commit 8b407359 upstream.
      
      A previous commit (below) adds a check for already probed interfaces to
      Wacom's matching heuristic. Unfortunately this causes the Bamboo Pen
      (CTL-460) to match itself to its 'ghost' touch interface. After
      subsequent changes to the driver this match to the ghost causes the
      kernel to crash. This patch avoids calling wacom_add_shared_data()
      for the BAMBOO_PEN's ghost touch interface.
      
      Fixes: 41372d5d ("HID: wacom: Augment 'oVid' and 'oPid' with heuristics for HID_GENERIC")
      Signed-off-by: default avatarAaron Armstrong Skomra <aaron.skomra@wacom.com>
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8d6c3322
    • Takashi Sakamoto's avatar
      ASoC: Intel: Skylake: fix invalid memory access due to wrong reference of pointer · e5a13473
      Takashi Sakamoto authored
      commit d1a6fe41 upstream.
      
      In 'skl_tplg_set_module_init_data()', a pointer to 'params' member of
      'struct skl_algo_data' is calculated, then casted to (u32 *) and assigned
      to a member of configuration data. The configuration data is passed to the
      other functions and used to process intel IPC. In this processing, the
      value of member is used to get message data, however this can bring invalid
      memory access in 'skl_set_module_params()' as a result of calculation of
      a pointer for actual message data.
      
      (sound/soc/intel/skylake/skl-topology.c)
      skl_tplg_init_pipe_modules()
      ->skl_tplg_set_module_init_data() (has this bug)
      ->skl_tplg_set_module_params()
        (sound/soc/intel/skylake/skl-messages.c)
        ->skl_set_module_params()
          ((char *)param) + data_offset
      
      This commit fixes the bug.
      
      Fixes: abb74003 ("ASoC: Intel: Skylake: Add support to configure module params")
      Signed-off-by: default avatarTakashi Sakamoto <takashi.sakamoto@miraclelinux.com>
      Acked-by: default avatarVinod Koul <vinod.koul@intel.com>
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e5a13473
    • Songjun Wu's avatar
      ASoC: atmel-classd: fix audio clock rate · 7a042a4e
      Songjun Wu authored
      commit cd3ac9af upstream.
      
      Fix the audio clock rate according to the datasheet.
      Reported-by: default avatarDushara Jayasinghe <dushara@successful.com.au>
      Signed-off-by: default avatarSongjun Wu <songjun.wu@microchip.com>
      Acked-by: default avatarNicolas Ferre <nicolas.ferre@microchip.com>
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7a042a4e
    • Hui Wang's avatar
      ALSA: hda - fix a problem for lineout on a Dell AIO machine · 8aabccdc
      Hui Wang authored
      commit 2f726aec upstream.
      
      On this Dell AIO machine, the lineout jack does not work.
      
      We found the pin 0x1a is assigned to lineout on this machine, and in
      the past, we applied ALC298_FIXUP_DELL1_MIC_NO_PRESENCE to fix the
      heaset-set mic problem for this machine, this fixup will redefine
      the pin 0x1a to headphone-mic, as a result the lineout doesn't
      work anymore.
      
      After consulting with Dell, they told us this machine doesn't support
      microphone via headset jack, so we add a new fixup which only defines
      the pin 0x18 as the headset-mic.
      
      [rearranged the fixup insertion position by tiwai in order to make the
       merge with other branches easier -- tiwai]
      
      Fixes: 59ec4b57 ("ALSA: hda - Fix headset mic detection problem for two dell machines")
      Signed-off-by: default avatarHui Wang <hui.wang@canonical.com>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8aabccdc
    • Takashi Iwai's avatar
      ALSA: seq: Fix race during FIFO resize · 74a2c1ff
      Takashi Iwai authored
      commit 2d7d5400 upstream.
      
      When a new event is queued while processing to resize the FIFO in
      snd_seq_fifo_clear(), it may lead to a use-after-free, as the old pool
      that is being queued gets removed.  For avoiding this race, we need to
      close the pool to be deleted and sync its usage before actually
      deleting it.
      
      The issue was spotted by syzkaller.
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      74a2c1ff
    • Bjorn Helgaas's avatar
      PCI: iproc: Save host bridge window resource in struct iproc_pcie · 0dd5b335
      Bjorn Helgaas authored
      commit 6e347b5e upstream.
      
      The host bridge memory window resource is inserted into the iomem_resource
      tree and cannot be deallocated until the host bridge itself is removed.
      
      Previously, the window was on the stack, which meant the iomem_resource
      entry pointed into the stack and was corrupted as soon as the probe
      function returned, which caused memory corruption and errors like this:
      
        pcie_iproc_bcma bcma0:8: resource collision: [mem 0x40000000-0x47ffffff] conflicts with PCIe MEM space [mem 0x40000000-0x47ffffff]
      
      Move the memory window resource from the stack into struct iproc_pcie so
      its lifetime matches that of the host bridge.
      
      Fixes: c3245a56 ("PCI: iproc: Request host bridge window resources")
      Reported-and-tested-by: default avatarRafał Miłecki <zajec5@gmail.com>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0dd5b335
    • Bart Van Assche's avatar
      scsi: scsi_dh_alua: Ensure that alua_activate() calls the completion function · 8f915598
      Bart Van Assche authored
      commit 7cb689fe upstream.
      
      Callers of scsi_dh_activate(), e.g. dm-mpath, assume that this function
      either returns an error code or calls the completion function. Make
      alua_activate() call the completion function even if scsi_device_get()
      fails.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Tang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8f915598
    • Bart Van Assche's avatar
      scsi: scsi_dh_alua: Check scsi_device_get() return value · 68b275b7
      Bart Van Assche authored
      commit 625fe857 upstream.
      
      Do not queue ALUA work nor call scsi_device_put() if the
      scsi_device_get() call fails. This patch fixes the following crash:
      
      general protection fault: 0000 [#1] SMP
      RIP: 0010:scsi_device_put+0xb/0x30
      Call Trace:
       scsi_disk_put+0x2d/0x40
       sd_release+0x3d/0xb0
       __blkdev_put+0x29e/0x360
       blkdev_put+0x49/0x170
       dm_put_table_device+0x58/0xc0 [dm_mod]
       dm_put_device+0x70/0xc0 [dm_mod]
       free_priority_group+0x92/0xc0 [dm_multipath]
       free_multipath+0x70/0xc0 [dm_multipath]
       multipath_dtr+0x19/0x20 [dm_multipath]
       dm_table_destroy+0x67/0x120 [dm_mod]
       dev_suspend+0xde/0x240 [dm_mod]
       ctl_ioctl+0x1f5/0x520 [dm_mod]
       dm_ctl_ioctl+0xe/0x20 [dm_mod]
       do_vfs_ioctl+0x8f/0x700
       SyS_ioctl+0x3c/0x70
       entry_SYSCALL_64_fastpath+0x18/0xad
      
      Fixes: commit 03197b61 ("scsi_dh_alua: Use workqueue for RTPG")
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Tang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      68b275b7
    • John Garry's avatar
      scsi: libsas: fix ata xfer length · cf31d6d2
      John Garry authored
      commit 9702c67c upstream.
      
      The total ata xfer length may not be calculated properly, in that we do
      not use the proper method to get an sg element dma length.
      
      According to the code comment, sg_dma_len() should be used after
      dma_map_sg() is called.
      
      This issue was found by turning on the SMMUv3 in front of the hisi_sas
      controller in hip07. Multiple sg elements were being combined into a
      single element, but the original first element length was being use as
      the total xfer length.
      
      Fixes: ff2aeb1e ("libata: convert to chained sg")
      Signed-off-by: default avatarJohn Garry <john.garry@huawei.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cf31d6d2
    • peter chang's avatar
      scsi: sg: check length passed to SG_NEXT_CMD_LEN · c2a86952
      peter chang authored
      commit bf33f87d upstream.
      
      The user can control the size of the next command passed along, but the
      value passed to the ioctl isn't checked against the usable max command
      size.
      Signed-off-by: default avatarPeter Chang <dpf@google.com>
      Acked-by: default avatarDouglas Gilbert <dgilbert@interlog.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c2a86952
    • Christoph Hellwig's avatar
      xfs: try any AG when allocating the first btree block when reflinking · d5dbd1c9
      Christoph Hellwig authored
      commit 2fcc319d upstream.
      
      When a reflink operation causes the bmap code to allocate a btree block
      we're currently doing single-AG allocations due to having ->firstblock
      set and then try any higher AG due a little reflink quirk we've put in
      when adding the reflink code.  But given that we do not have a minleft
      reservation of any kind in this AG we can still not have any space in
      the same or higher AG even if the file system has enough free space.
      To fix this use a XFS_ALLOCTYPE_FIRST_AG allocation in this fall back
      path instead.
      
      [And yes, we need to redo this properly instead of piling hacks over
       hacks.  I'm working on that, but it's not going to be a small series.
       In the meantime this fixes the customer reported issue]
      
      Also add a warning for failing allocations to make it easier to debug.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d5dbd1c9
    • Brian Foster's avatar
      xfs: use iomap new flag for newly allocated delalloc blocks · da617af8
      Brian Foster authored
      commit f65e6fad upstream.
      
      Commit fa7f138a ("xfs: clear delalloc and cache on buffered write
      failure") fixed one regression in the iomap error handling code and
      exposed another. The fundamental problem is that if a buffered write
      is a rewrite of preexisting delalloc blocks and the write fails, the
      failure handling code can punch out preexisting blocks with valid
      file data.
      
      This was reproduced directly by sub-block writes in the LTP
      kernel/syscalls/write/write03 test. A first 100 byte write allocates
      a single block in a file. A subsequent 100 byte write fails and
      punches out the block, including the data successfully written by
      the previous write.
      
      To address this problem, update the ->iomap_begin() handler to
      distinguish newly allocated delalloc blocks from preexisting
      delalloc blocks via the IOMAP_F_NEW flag. Use this flag in the
      ->iomap_end() handler to decide when a failed or short write should
      punch out delalloc blocks.
      
      This introduces the subtle requirement that ->iomap_begin() should
      never combine newly allocated delalloc blocks with existing blocks
      in the resulting iomap descriptor. This can occur when a new
      delalloc reservation merges with a neighboring extent that is part
      of the current write, for example. Therefore, drop the
      post-allocation extent lookup from xfs_bmapi_reserve_delalloc() and
      just return the record inserted into the fork. This ensures only new
      blocks are returned and thus that preexisting delalloc blocks are
      always handled as "found" blocks and not punched out on a failed
      rewrite.
      Reported-by: default avatarXiong Zhou <xzhou@redhat.com>
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      da617af8
    • Chandan Rajendra's avatar
      xfs: Use xfs_icluster_size_fsb() to calculate inode alignment mask · 77aedb0c
      Chandan Rajendra authored
      commit d5825712 upstream.
      
      When block size is larger than inode cluster size, the call to
      XFS_B_TO_FSBT(mp, mp->m_inode_cluster_size) returns 0. Also, mkfs.xfs
      would have set xfs_sb->sb_inoalignmt to 0. Hence in
      xfs_set_inoalignment(), xfs_mount->m_inoalign_mask gets initialized to
      -1 instead of 0. However, xfs_mount->m_sinoalign would get correctly
      intialized to 0 because for every positive value of xfs_mount->m_dalign,
      the condition "!(mp->m_dalign & mp->m_inoalign_mask)" would evaluate to
      false.
      
      Also, xfs_imap() worked fine even with xfs_mount->m_inoalign_mask having
      -1 as the value because blks_per_cluster variable would have the value 1
      and hence we would never have a need to use xfs_mount->m_inoalign_mask
      to compute the inode chunk's agbno and offset within the chunk.
      Signed-off-by: default avatarChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      77aedb0c
    • Christoph Hellwig's avatar
      xfs: fix and streamline error handling in xfs_end_io · d07b5855
      Christoph Hellwig authored
      commit 787eb485 upstream.
      
      There are two different cases of buffered I/O errors:
      
       - first we can have an already shutdown fs.  In that case we should skip
         any on-disk operations and just clean up the appen transaction if
         present and destroy the ioend
       - a real I/O error.  In that case we should cleanup any lingering COW
         blocks.  This gets skipped in the current code and is fixed by this
         patch.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d07b5855
    • Christoph Hellwig's avatar
      xfs: only reclaim unwritten COW extents periodically · 3b83a02a
      Christoph Hellwig authored
      commit 3802a345 upstream.
      
      We only want to reclaim preallocations from our periodic work item.
      Currently this is archived by looking for a dirty inode, but that check
      is rather fragile.  Instead add a flag to xfs_reflink_cancel_cow_* so
      that the caller can ask for just cancelling unwritten extents in the COW
      fork.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      [darrick: fix typos in commit message]
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3b83a02a
    • Christoph Hellwig's avatar
      xfs: tune down agno asserts in the bmap code · a2402936
      Christoph Hellwig authored
      commit 410d17f6 upstream.
      
      In various places we currently assert that xfs_bmap_btalloc allocates
      from the same as the firstblock value passed in, unless it's either
      NULLAGNO or the dop_low flag is set.  But the reflink code does not
      fully follow this convention as it passes in firstblock purely as
      a hint for the allocator without actually having previous allocations
      in the transaction, and without having a minleft check on the current
      AG, leading to the assert firing on a very full and heavily used
      file system.  As even the reflink code only allocates from equal or
      higher AGs for now we can simply the check to always allow for equal
      or higher AGs.
      
      Note that we need to eventually split the two meanings of the firstblock
      value.  At that point we can also allow the reflink code to allocate
      from any AG instead of limiting it in any way.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a2402936
    • Chandan Rajendra's avatar
      xfs: Use xfs_icluster_size_fsb() to calculate inode chunk alignment · 9559c48c
      Chandan Rajendra authored
      commit 8ee9fdbe upstream.
      
      On a ppc64 system, executing generic/256 test with 32k block size gives the following call trace,
      
      XFS: Assertion failed: args->maxlen > 0, file: /root/repos/linux/fs/xfs/libxfs/xfs_alloc.c, line: 2026
      
      kernel BUG at /root/repos/linux/fs/xfs/xfs_message.c:113!
      Oops: Exception in kernel mode, sig: 5 [#1]
      SMP NR_CPUS=2048
      DEBUG_PAGEALLOC
      NUMA
      pSeries
      Modules linked in:
      CPU: 2 PID: 19361 Comm: mkdir Not tainted 4.10.0-rc5 #58
      task: c000000102606d80 task.stack: c0000001026b8000
      NIP: c0000000004ef798 LR: c0000000004ef798 CTR: c00000000082b290
      REGS: c0000001026bb090 TRAP: 0700   Not tainted  (4.10.0-rc5)
      MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI>
      CR: 28004428  XER: 00000000
      CFAR: c0000000004ef180 SOFTE: 1
      GPR00: c0000000004ef798 c0000001026bb310 c000000001157300 ffffffffffffffea
      GPR04: 000000000000000a c0000001026bb130 0000000000000000 ffffffffffffffc0
      GPR08: 00000000000000d1 0000000000000021 00000000ffffffd1 c000000000dd4990
      GPR12: 0000000022004444 c00000000fe00800 0000000020000000 0000000000000000
      GPR16: 0000000000000000 0000000043a606fc 0000000043a76c08 0000000043a1b3d0
      GPR20: 000001002a35cd60 c0000001026bbb80 0000000000000000 0000000000000001
      GPR24: 0000000000000240 0000000000000004 c00000062dc55000 0000000000000000
      GPR28: 0000000000000004 c00000062ecd9200 0000000000000000 c0000001026bb6c0
      NIP [c0000000004ef798] .assfail+0x28/0x30
      LR [c0000000004ef798] .assfail+0x28/0x30
      Call Trace:
      [c0000001026bb310] [c0000000004ef798] .assfail+0x28/0x30 (unreliable)
      [c0000001026bb380] [c000000000455d74] .xfs_alloc_space_available+0x194/0x1b0
      [c0000001026bb410] [c00000000045b914] .xfs_alloc_fix_freelist+0x144/0x480
      [c0000001026bb580] [c00000000045c368] .xfs_alloc_vextent+0x698/0xa90
      [c0000001026bb650] [c0000000004a6200] .xfs_ialloc_ag_alloc+0x170/0x820
      [c0000001026bb7c0] [c0000000004a9098] .xfs_dialloc+0x158/0x320
      [c0000001026bb8a0] [c0000000004e628c] .xfs_ialloc+0x7c/0x610
      [c0000001026bb990] [c0000000004e8138] .xfs_dir_ialloc+0xa8/0x2f0
      [c0000001026bbaa0] [c0000000004e8814] .xfs_create+0x494/0x790
      [c0000001026bbbf0] [c0000000004e5ebc] .xfs_generic_create+0x2bc/0x410
      [c0000001026bbce0] [c0000000002b4a34] .vfs_mkdir+0x154/0x230
      [c0000001026bbd70] [c0000000002bc444] .SyS_mkdirat+0x94/0x120
      [c0000001026bbe30] [c00000000000b760] system_call+0x38/0xfc
      Instruction dump:
      4e800020 60000000 7c0802a6 7c862378 3c82ffca 7ca72b78 38841c18 7c651b78
      38600000 f8010010 f821ff91 4bfff94d <0fe00000> 60000000 7c0802a6 7c892378
      
      When block size is larger than inode cluster size, the call to
      XFS_B_TO_FSBT(mp, mp->m_inode_cluster_size) returns 0. Also, mkfs.xfs
      would have set xfs_sb->sb_inoalignmt to 0. This causes
      xfs_ialloc_cluster_alignment() to return 0.  Due to this
      args.minalignslop (in xfs_ialloc_ag_alloc()) gets the unsigned
      equivalent of -1 assigned to it. This later causes alloc_len in
      xfs_alloc_space_available() to have a value of 0. In such a scenario
      when args.total is also 0, the assert statement "ASSERT(args->maxlen >
      0);" fails.
      
      This commit fixes the bug by replacing the call to XFS_B_TO_FSBT() in
      xfs_ialloc_cluster_alignment() with a call to xfs_icluster_size_fsb().
      Suggested-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9559c48c
    • Brian Foster's avatar
      xfs: don't reserve blocks for right shift transactions · 5db7b41b
      Brian Foster authored
      commit 48af96ab upstream.
      
      The block reservation for the transaction allocated in
      xfs_shift_file_space() is an artifact of the original collapse range
      support. It exists to handle the case where a collapse range occurs,
      the initial extent is left shifted into a location that forms a
      contiguous boundary with the previous extent and thus the extents
      are merged. This code was subsequently refactored and reused for
      insert range (right shift) support.
      
      If an insert range occurs under low free space conditions, the
      extent at the starting offset is split before the first shift
      transaction is allocated. If the block reservation fails, this
      leaves separate, but contiguous extents around in the inode. While
      not a fatal problem, this is unexpected and will flag a warning on
      subsequent insert range operations on the inode. This problem has
      been reproduce intermittently by generic/270 running against a
      ramdisk device.
      
      Since right shift does not create new extent boundaries in the
      inode, a block reservation for extent merge is unnecessary. Update
      xfs_shift_file_space() to conditionally reserve fs blocks for left
      shift transactions only. This avoids the warning reproduced by
      generic/270.
      Reported-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5db7b41b
    • Darrick J. Wong's avatar
      xfs: fix uninitialized variable in _reflink_convert_cow · e5e2e56f
      Darrick J. Wong authored
      commit 93aaead5 upstream.
      
      Fix an uninitialize variable.
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e5e2e56f
    • Brian Foster's avatar
      xfs: split indlen reservations fairly when under reserved · c251c6c2
      Brian Foster authored
      commit 75d65361 upstream.
      
      Certain workoads that punch holes into speculative preallocation can
      cause delalloc indirect reservation splits when the delalloc extent is
      split in two. If further splits occur, an already short-handed extent
      can be split into two in a manner that leaves zero indirect blocks for
      one of the two new extents. This occurs because the shortage is large
      enough that the xfs_bmap_split_indlen() algorithm completely drains the
      requested indlen of one of the extents before it honors the existing
      reservation.
      
      This ultimately results in a warning from xfs_bmap_del_extent(). This
      has been observed during file copies of large, sparse files using 'cp
      --sparse=always.'
      
      To avoid this problem, update xfs_bmap_split_indlen() to explicitly
      apply the reservation shortage fairly between both extents. This smooths
      out the overall indlen shortage and defers the situation where we end up
      with a delalloc extent with zero indlen reservation to extreme
      circumstances.
      Reported-by: default avatarPatrick Dung <mpatdung@gmail.com>
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c251c6c2
    • Brian Foster's avatar
      xfs: handle indlen shortage on delalloc extent merge · 2d7c1c7f
      Brian Foster authored
      commit 0e339ef8 upstream.
      
      When a delalloc extent is created, it can be merged with pre-existing,
      contiguous, delalloc extents. When this occurs,
      xfs_bmap_add_extent_hole_delay() merges the extents along with the
      associated indirect block reservations. The expectation here is that the
      combined worst case indlen reservation is always less than or equal to
      the indlen reservation for the individual extents.
      
      This is not always the case, however, as existing extents can less than
      the expected indlen reservation if the extent was previously split due
      to a hole punch. If a new extent merges with such an extent, the total
      indlen requirement may be larger than the sum of the indlen reservations
      held by both extents.
      
      xfs_bmap_add_extent_hole_delay() assumes that the worst case indlen
      reservation is always available and assigns it to the merged extent
      without consideration for the indlen held by the pre-existing extent. As
      a result, the subsequent xfs_mod_fdblocks() call can attempt an
      unintentional allocation rather than a free (indicated by an ASSERT()
      failure). Further, if the allocation happens to fail in this context,
      the failure goes unhandled and creates a filesystem wide block
      accounting inconsistency.
      
      Fix xfs_bmap_add_extent_hole_delay() to function as designed. Cap the
      indlen reservation assigned to the merged extent to the sum of the
      indlen reservations held by each of the individual extents.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2d7c1c7f
    • Christoph Hellwig's avatar
      xfs: don't fail xfs_extent_busy allocation · 47d7d1ea
      Christoph Hellwig authored
      commit 5e30c23d upstream.
      
      We don't just need the structure to track busy extents which can be
      avoided with a synchronous transaction, but also to keep track of
      pending discard.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      47d7d1ea
    • Christoph Hellwig's avatar
      xfs: reject all unaligned direct writes to reflinked files · 5bbf5ba6
      Christoph Hellwig authored
      commit 54a4ef8a upstream.
      
      We currently fall back from direct to buffered writes if we detect a
      remaining shared extent in the iomap_begin callback.  But by the time
      iomap_begin is called for the potentially unaligned end block we might
      have already written most of the data to disk, which we'd now write
      again using buffered I/O.  To avoid this reject all writes to reflinked
      files before starting I/O so that we are guaranteed to only write the
      data once.
      
      The alternative would be to unshare the unaligned start and/or end block
      before doing the I/O. I think that's doable, and will actually be
      required to support reflinks on DAX file system.  But it will take a
      little more time and I'd rather get rid of the double write ASAP.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      [slight changes in context due to the new direct I/O code in 4.10+]
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5bbf5ba6
    • Christoph Hellwig's avatar
      xfs: update ctime and mtime on clone destinatation inodes · 67eb7bf8
      Christoph Hellwig authored
      commit c5ecb423 upstream.
      
      We're changing both metadata and data, so we need to update the
      timestamps for clone operations.  Dedupe on the other hand does
      not change file data, and only changes invisible metadata so the
      timestamps should not be updated.
      
      This follows existing btrfs behavior.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      [darrick: remove redundant is_dedupe test]
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      67eb7bf8
    • Hou Tao's avatar
      xfs: reset b_first_retry_time when clear the retry status of xfs_buf_t · e060f488
      Hou Tao authored
      commit 4dd2eb63 upstream.
      
      After successful IO or permanent error, b_first_retry_time also
      needs to be cleared, else the invalid first retry time will be
      used by the next retry check.
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e060f488
    • Darrick J. Wong's avatar
      xfs: mark speculative prealloc CoW fork extents unwritten · e02f0ff2
      Darrick J. Wong authored
      commit 5eda4300 upstream.
      
      Christoph Hellwig pointed out that there's a potentially nasty race when
      performing simultaneous nearby directio cow writes:
      
      "Thread 1 writes a range from B to c
      
      "                    B --------- C
                                 p
      
      "a little later thread 2 writes from A to B
      
      "        A --------- B
                     p
      
      [editor's note: the 'p' denote cowextsize boundaries, which I added to
      make this more clear]
      
      "but the code preallocates beyond B into the range where thread
      "1 has just written, but ->end_io hasn't been called yet.
      "But once ->end_io is called thread 2 has already allocated
      "up to the extent size hint into the write range of thread 1,
      "so the end_io handler will splice the unintialized blocks from
      "that preallocation back into the file right after B."
      
      We can avoid this race by ensuring that thread 1 cannot accidentally
      remap the blocks that thread 2 allocated (as part of speculative
      preallocation) as part of t2's write preparation in t1's end_io handler.
      The way we make this happen is by taking advantage of the unwritten
      extent flag as an intermediate step.
      
      Recall that when we begin the process of writing data to shared blocks,
      we create a delayed allocation extent in the CoW fork:
      
      D: --RRRRRRSSSRRRRRRRR---
      C: ------DDDDDDD---------
      
      When a thread prepares to CoW some dirty data out to disk, it will now
      convert the delalloc reservation into an /unwritten/ allocated extent in
      the cow fork.  The da conversion code tries to opportunistically
      allocate as much of a (speculatively prealloc'd) extent as possible, so
      we may end up allocating a larger extent than we're actually writing
      out:
      
      D: --RRRRRRSSSRRRRRRRR---
      U: ------UUUUUUU---------
      
      Next, we convert only the part of the extent that we're actively
      planning to write to normal (i.e. not unwritten) status:
      
      D: --RRRRRRSSSRRRRRRRR---
      U: ------UURRUUU---------
      
      If the write succeeds, the end_cow function will now scan the relevant
      range of the CoW fork for real extents and remap only the real extents
      into the data fork:
      
      D: --RRRRRRRRSRRRRRRRR---
      U: ------UU--UUU---------
      
      This ensures that we never obliterate valid data fork extents with
      unwritten blocks from the CoW fork.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e02f0ff2
    • Darrick J. Wong's avatar
      xfs: allow unwritten extents in the CoW fork · 8370826f
      Darrick J. Wong authored
      commit 05a630d7 upstream.
      
      In the data fork, we only allow extents to perform the following state
      transitions:
      
      delay -> real <-> unwritten
      
      There's no way to move directly from a delalloc reservation to an
      /unwritten/ allocated extent.  However, for the CoW fork we want to be
      able to do the following to each extent:
      
      delalloc -> unwritten -> written -> remapped to data fork
      
      This will help us to avoid a race in the speculative CoW preallocation
      code between a first thread that is allocating a CoW extent and a second
      thread that is remapping part of a file after a write.  In order to do
      this, however, we need two things: first, we have to be able to
      transition from da to unwritten, and second the function that converts
      between real and unwritten has to be made aware of the cow fork.  Do
      both of those things.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8370826f
    • Darrick J. Wong's avatar
      xfs: verify free block header fields · 3d2bd2fd
      Darrick J. Wong authored
      commit de14c5f5 upstream.
      
      Perform basic sanity checking of the directory free block header
      fields so that we avoid hanging the system on invalid data.
      
      (Granted that just means that now we shutdown on directory write,
      but that seems better than hanging...)
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3d2bd2fd
    • Darrick J. Wong's avatar
      xfs: check for obviously bad level values in the bmbt root · 4056a74a
      Darrick J. Wong authored
      commit b3bf607d upstream.
      
      We can't handle a bmbt that's taller than BTREE_MAXLEVELS, and there's
      no such thing as a zero-level bmbt (for that we have extents format),
      so if we see this, send back an error code.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4056a74a
    • Darrick J. Wong's avatar
      xfs: filter out obviously bad btree pointers · efab3ae2
      Darrick J. Wong authored
      commit d5a91bae upstream.
      
      Don't let anybody load an obviously bad btree pointer.  Since the values
      come from disk, we must return an error, not just ASSERT.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      efab3ae2
    • Darrick J. Wong's avatar
      xfs: fail _dir_open when readahead fails · 7e2dd1fb
      Darrick J. Wong authored
      commit 7a652bbe upstream.
      
      When we open a directory, we try to readahead block 0 of the directory
      on the assumption that we're going to need it soon.  If the bmbt is
      corrupt, the directory will never be usable and the readahead fails
      immediately, so we might as well prevent the directory from being opened
      at all.  This prevents a subsequent read or modify operation from
      hitting it and taking the fs offline.
      
      NOTE: We're only checking for early failures in the block mapping, not
      the readahead directory block itself.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7e2dd1fb
    • Darrick J. Wong's avatar
      xfs: fix toctou race when locking an inode to access the data map · 0a6844ab
      Darrick J. Wong authored
      commit 4b5bd5bf upstream.
      
      We use di_format and if_flags to decide whether we're grabbing the ilock
      in btree mode (btree extents not loaded) or shared mode (anything else),
      but the state of those fields can be changed by other threads that are
      also trying to load the btree extents -- IFEXTENTS gets set before the
      _bmap_read_extents call and cleared if it fails.
      
      We don't actually need to have IFEXTENTS set until after the bmbt
      records are successfully loaded and validated, which will fix the race
      between multiple threads trying to read the same directory.  The next
      patch strengthens directory bmbt validation by refusing to open the
      directory if reading the bmbt to start directory readahead fails.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0a6844ab
    • Brian Foster's avatar
      xfs: fix eofblocks race with file extending async dio writes · 4127a5d9
      Brian Foster authored
      commit e4229d6b upstream.
      
      It's possible for post-eof blocks to end up being used for direct I/O
      writes. dio write performs an upfront unwritten extent allocation, sends
      the dio and then updates the inode size (if necessary) on write
      completion. If a file release occurs while a file extending dio write is
      in flight, it is possible to mistake the post-eof blocks for speculative
      preallocation and incorrectly truncate them from the inode. This means
      that the resulting dio write completion can discover a hole and allocate
      new blocks rather than perform unwritten extent conversion.
      
      This requires a strange mix of I/O and is thus not likely to reproduce
      in real world workloads. It is intermittently reproduced by generic/299.
      The error manifests as an assert failure due to transaction overrun
      because the aforementioned write completion transaction has only
      reserved enough blocks for btree operations:
      
        XFS: Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, \
         file: fs/xfs//xfs_trans.c, line: 309
      
      The root cause is that xfs_free_eofblocks() uses i_size to truncate
      post-eof blocks from the inode, but async, file extending direct writes
      do not update i_size until write completion, long after inode locks are
      dropped. Therefore, xfs_free_eofblocks() effectively truncates the inode
      to the incorrect size.
      
      Update xfs_free_eofblocks() to serialize against dio similar to how
      extending writes are serialized against i_size updates before post-eof
      block zeroing. Specifically, wait on dio while under the iolock. This
      ensures that dio write completions have updated i_size before post-eof
      blocks are processed.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4127a5d9
    • Brian Foster's avatar
      xfs: sync eofblocks scans under iolock are livelock prone · 4d725d74
      Brian Foster authored
      commit c3155097 upstream.
      
      The xfs_eofblocks.eof_scan_owner field is an internal field to
      facilitate invoking eofb scans from the kernel while under the iolock.
      This is necessary because the eofb scan acquires the iolock of each
      inode. Synchronous scans are invoked on certain buffered write failures
      while under iolock. In such cases, the scan owner indicates that the
      context for the scan already owns the particular iolock and prevents a
      double lock deadlock.
      
      eofblocks scans while under iolock are still livelock prone in the event
      of multiple parallel scans, however. If multiple buffered writes to
      different inodes fail and invoke eofblocks scans at the same time, each
      scan avoids a deadlock with its own inode by virtue of the
      eof_scan_owner field, but will never be able to acquire the iolock of
      the inode from the parallel scan. Because the low free space scans are
      invoked with SYNC_WAIT, the scan will not return until it has processed
      every tagged inode and thus both scans will spin indefinitely on the
      iolock being held across the opposite scan. This problem can be
      reproduced reliably by generic/224 on systems with higher cpu counts
      (x16).
      
      To avoid this problem, simplify the semantics of eofblocks scans to
      never invoke a scan while under iolock. This means that the buffered
      write context must drop the iolock before the scan. It must reacquire
      the lock before the write retry and also repeat the initial write
      checks, as the original state might no longer be valid once the iolock
      was dropped.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      4d725d74
    • Brian Foster's avatar
      xfs: pull up iolock from xfs_free_eofblocks() · 798b1dc5
      Brian Foster authored
      commit a36b9261 upstream.
      
      xfs_free_eofblocks() requires the IOLOCK_EXCL lock, but is called from
      different contexts where the lock may or may not be held. The
      need_iolock parameter exists for this reason, to indicate whether
      xfs_free_eofblocks() must acquire the iolock itself before it can
      proceed.
      
      This is ugly and confusing. Simplify the semantics of
      xfs_free_eofblocks() to require the caller to acquire the iolock
      appropriately and kill the need_iolock parameter. While here, the mp
      param can be removed as well as the xfs_mount is accessible from the
      xfs_inode structure. This patch does not change behavior.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      798b1dc5
    • Christoph Hellwig's avatar
      xfs: use per-AG reservations for the finobt · 08a2a268
      Christoph Hellwig authored
      commit 76d771b4 upstream.
      
      Currently we try to rely on the global reserved block pool for block
      allocations for the free inode btree, but I have customer reports
      (fairly complex workload, need to find an easier reproducer) where that
      is not enough as the AG where we free an inode that requires a new
      finobt block is entirely full.  This causes us to cancel a dirty
      transaction and thus a file system shutdown.
      
      I think the right way to guard against this is to treat the finot the same
      way as the refcount btree and have a per-AG reservations for the possible
      worst case size of it, and the patch below implements that.
      
      Note that this could increase mount times with large finobt trees.  In
      an ideal world we would have added a field for the number of finobt
      fields to the AGI, similar to what we did for the refcount blocks.
      We should do add it next time we rev the AGI or AGF format by adding
      new fields.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      08a2a268