1. 25 May, 2019 5 commits
  2. 22 May, 2019 35 commits
    • Greg Kroah-Hartman's avatar
      Linux 5.1.4 · e0e8106a
      Greg Kroah-Hartman authored
      e0e8106a
    • Martin Schwidefsky's avatar
      s390/mm: convert to the generic get_user_pages_fast code · ee4c3e28
      Martin Schwidefsky authored
      commit 1a42010c upstream.
      
      Define the gup_fast_permitted to check against the asce_limit of the
      mm attached to the current task, then replace the s390 specific gup
      code with the generic implementation in mm/gup.c.
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ee4c3e28
    • Martin Schwidefsky's avatar
      s390/mm: make the pxd_offset functions more robust · 8b066d00
      Martin Schwidefsky authored
      commit d1874a0c upstream.
      
      Change the way how pgd_offset, p4d_offset, pud_offset and pmd_offset
      walk the page tables. pgd_offset now always calculates the index for
      the top-level page table and adds it to the pgd, this is either a
      segment table offset for a 2-level setup, a region-3 offset for 3-levels,
      region-2 offset for 4-levels, or a region-1 offset for a 5-level setup.
      The other three functions p4d_offset, pud_offset and pmd_offset will
      only add the respective offset if they dereference the passed pointer.
      
      With the new way of walking the page tables a sequence like this from
      mm/gup.c now works:
      
           pgdp = pgd_offset(current->mm, addr);
           pgd = READ_ONCE(*pgdp);
           p4dp = p4d_offset(&pgd, addr);
           p4d = READ_ONCE(*p4dp);
           pudp = pud_offset(&p4d, addr);
           pud = READ_ONCE(*pudp);
           pmdp = pmd_offset(&pud, addr);
           pmd = READ_ONCE(*pmdp);
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8b066d00
    • Dan Williams's avatar
      libnvdimm/namespace: Fix label tracking error · 550f5ba1
      Dan Williams authored
      commit c4703ce1 upstream.
      
      Users have reported intermittent occurrences of DIMM initialization
      failures due to duplicate allocations of address capacity detected in
      the labels, or errors of the form below, both have the same root cause.
      
          nd namespace1.4: failed to track label: 0
          WARNING: CPU: 17 PID: 1381 at drivers/nvdimm/label.c:863
      
          RIP: 0010:__pmem_label_update+0x56c/0x590 [libnvdimm]
          Call Trace:
           ? nd_pmem_namespace_label_update+0xd6/0x160 [libnvdimm]
           nd_pmem_namespace_label_update+0xd6/0x160 [libnvdimm]
           uuid_store+0x17e/0x190 [libnvdimm]
           kernfs_fop_write+0xf0/0x1a0
           vfs_write+0xb7/0x1b0
           ksys_write+0x57/0xd0
           do_syscall_64+0x60/0x210
      
      Unfortunately those reports were typically with a busy parallel
      namespace creation / destruction loop making it difficult to see the
      components of the bug. However, Jane provided a simple reproducer using
      the work-in-progress sub-section implementation.
      
      When ndctl is reconfiguring a namespace it may take an existing defunct
      / disabled namespace and reconfigure it with a new uuid and other
      parameters. Critically namespace_update_uuid() takes existing address
      resources and renames them for the new namespace to use / reconfigure as
      it sees fit. The bug is that this rename only happens in the resource
      tracking tree. Existing labels with the old uuid are not reaped leading
      to a scenario where multiple active labels reference the same span of
      address range.
      
      Teach namespace_update_uuid() to flag any references to the old uuid for
      reaping at the next label update attempt.
      
      Cc: <stable@vger.kernel.org>
      Fixes: bf9bccc1 ("libnvdimm: pmem label sets and namespace instantiation")
      Link: https://github.com/pmem/ndctl/issues/91Reported-by: default avatarJane Chu <jane.chu@oracle.com>
      Reported-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Reported-by: default avatarErwin Tsaur <erwin.tsaur@oracle.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      550f5ba1
    • Christophe Leroy's avatar
      powerpc/32s: fix flush_hash_pages() on SMP · fda49aec
      Christophe Leroy authored
      commit 397d2300 upstream.
      
      flush_hash_pages() runs with data translation off, so current
      task_struct has to be accesssed using physical address.
      
      Fixes: f7354cca ("powerpc/32: Remove CURRENT_THREAD_INFO and rename TI_CPU")
      Cc: stable@vger.kernel.org # v5.1+
      Reported-by: default avatarErhard F. <erhard_f@mailbox.org>
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fda49aec
    • Roger Pau Monne's avatar
      xen/pvh: correctly setup the PV EFI interface for dom0 · 4d9ec162
      Roger Pau Monne authored
      commit 72813bfb upstream.
      
      This involves initializing the boot params EFI related fields and the
      efi global variable.
      
      Without this fix a PVH dom0 doesn't detect when booted from EFI, and
      thus doesn't support accessing any of the EFI related data.
      Reported-by: default avatarPGNet Dev <pgnet.dev@gmail.com>
      Signed-off-by: default avatarRoger Pau Monné <roger.pau@citrix.com>
      Reviewed-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4d9ec162
    • Roger Pau Monne's avatar
      xen/pvh: set xen_domain_type to HVM in xen_pvh_init · 97f06047
      Roger Pau Monne authored
      commit c9f804d6 upstream.
      
      Or else xen_domain() returns false despite xen_pvh being set.
      Signed-off-by: default avatarRoger Pau Monné <roger.pau@citrix.com>
      Reviewed-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      97f06047
    • Masahiro Yamada's avatar
      kbuild: turn auto.conf.cmd into a mandatory include file · fdea76e5
      Masahiro Yamada authored
      commit d2f8ae0e upstream.
      
      syncconfig is responsible for keeping auto.conf up-to-date, so if it
      fails for any reason, the build must be terminated immediately.
      
      However, since commit 9390dff6 ("kbuild: invoke syncconfig if
      include/config/auto.conf.cmd is missing"), Kbuild continues running
      even after syncconfig fails.
      
      You can confirm this by intentionally making syncconfig error out:
      
      #  diff --git a/scripts/kconfig/confdata.c b/scripts/kconfig/confdata.c
      #  index 08ba146..307b9de 100644
      #  --- a/scripts/kconfig/confdata.c
      #  +++ b/scripts/kconfig/confdata.c
      #  @@ -1023,6 +1023,9 @@ int conf_write_autoconf(int overwrite)
      #          FILE *out, *tristate, *out_h;
      #          int i;
      #
      #  +       if (overwrite)
      #  +               return 1;
      #  +
      #          if (!overwrite && is_present(autoconf_name))
      #                  return 0;
      
      Then, syncconfig fails, but Make would not stop:
      
        $ make -s mrproper allyesconfig defconfig
        $ make
        scripts/kconfig/conf  --syncconfig Kconfig
      
        *** Error during sync of the configuration.
      
        make[2]: *** [scripts/kconfig/Makefile;69: syncconfig] Error 1
        make[1]: *** [Makefile;557: syncconfig] Error 2
        make: *** [include/config/auto.conf.cmd] Deleting file 'include/config/tristate.conf'
        make: Failed to remake makefile 'include/config/auto.conf'.
          SYSTBL  arch/x86/include/generated/asm/syscalls_32.h
          SYSHDR  arch/x86/include/generated/asm/unistd_32_ia32.h
          SYSHDR  arch/x86/include/generated/asm/unistd_64_x32.h
          SYSTBL  arch/x86/include/generated/asm/syscalls_64.h
        [ continue running ... ]
      
      The reason is in the behavior of a pattern rule with multi-targets.
      
        %/auto.conf %/auto.conf.cmd %/tristate.conf: $(KCONFIG_CONFIG)
                $(Q)$(MAKE) -f $(srctree)/Makefile syncconfig
      
      GNU Make knows this rule is responsible for making all the three files
      simultaneously. As far as examined, auto.conf.cmd is the target in
      question when this rule is invoked. It is probably because auto.conf.cmd
      is included below the inclusion of auto.conf.
      
      The inclusion of auto.conf is mandatory, while that of auto.conf.cmd
      is optional. GNU Make does not care about the failure in the process
      of updating optional include files.
      
      I filed this issue (https://savannah.gnu.org/bugs/?56301) in case this
      behavior could be improved somehow in future releases of GNU Make.
      Anyway, it is quite easy to fix our Makefile.
      
      Given that auto.conf is already a mandatory include file, there is no
      reason to stick auto.conf.cmd optional. Make it mandatory as well.
      
      Cc: linux-stable <stable@vger.kernel.org> # 5.0+
      Fixes: 9390dff6 ("kbuild: invoke syncconfig if include/config/auto.conf.cmd is missing")
      Signed-off-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      [commented out diff above to keep patch happy - gregkh]
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fdea76e5
    • Steve French's avatar
      smb3: display session id in debug data · 404d72bd
      Steve French authored
      commit b63a9de0 upstream.
      
      Displaying the session id in /proc/fs/cifs/DebugData
      is needed in order to correlate Linux client information
      with network and server traces for many common support
      scenarios.  Turned out to be very important for debugging.
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      CC: Stable <stable@vger.kernel.org>
      Reviewed-by: default avatarPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      404d72bd
    • Sean Christopherson's avatar
      KVM: lapic: Busy wait for timer to expire when using hv_timer · 9488cacd
      Sean Christopherson authored
      commit ee66e453 upstream.
      
      ...now that VMX's preemption timer, i.e. the hv_timer, also adjusts its
      programmed time based on lapic_timer_advance_ns.  Without the delay, a
      guest can see a timer interrupt arrive before the requested time when
      KVM is using the hv_timer to emulate the guest's interrupt.
      
      Fixes: c5ce8235 ("KVM: VMX: Optimize tscdeadline timer latency")
      Cc: <stable@vger.kernel.org>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Reviewed-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9488cacd
    • Sean Christopherson's avatar
      KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes · 971a62fb
      Sean Christopherson authored
      commit 11988499 upstream.
      
      KVM allows userspace to violate consistency checks related to the
      guest's CPUID model to some degree.  Generally speaking, userspace has
      carte blanche when it comes to guest state so long as jamming invalid
      state won't negatively affect the host.
      
      Currently this is seems to be a non-issue as most of the interesting
      EFER checks are missing, e.g. NX and LME, but those will be added
      shortly.  Proactively exempt userspace from the CPUID checks so as not
      to break userspace.
      
      Note, the efer_reserved_bits check still applies to userspace writes as
      that mask reflects the host's capabilities, e.g. KVM shouldn't allow a
      guest to run with NX=1 if it has been disabled in the host.
      
      Fixes: d8017474 ("KVM: SVM: Only allow setting of EFER_SVME when CPUID SVM is set")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      971a62fb
    • Peter Xu's avatar
      KVM: Fix the bitmap range to copy during clear dirty · 12b014bd
      Peter Xu authored
      commit 4ddc9204 upstream.
      
      kvm_dirty_bitmap_bytes() will return the size of the dirty bitmap of
      the memslot rather than the size of bitmap passed over from the ioctl.
      Here for KVM_CLEAR_DIRTY_LOG we should only copy exactly the size of
      bitmap that covers kvm_clear_dirty_log.num_pages.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: 2a31b9dbSigned-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      12b014bd
    • Sean Christopherson's avatar
      Revert "KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU" · 2f6ef23f
      Sean Christopherson authored
      commit f93f7ede upstream.
      
      The RDPMC-exiting control is dependent on the existence of the RDPMC
      instruction itself, i.e. is not tied to the "Architectural Performance
      Monitoring" feature.  For all intents and purposes, the control exists
      on all CPUs with VMX support since RDPMC also exists on all VCPUs with
      VMX supported.  Per Intel's SDM:
      
        The RDPMC instruction was introduced into the IA-32 Architecture in
        the Pentium Pro processor and the Pentium processor with MMX technology.
        The earlier Pentium processors have performance-monitoring counters, but
        they must be read with the RDMSR instruction.
      
      Because RDPMC-exiting always exists, KVM requires the control and refuses
      to load if it's not available.  As a result, hiding the PMU from a guest
      breaks nested virtualization if the guest attemts to use KVM.
      
      While it's not explicitly stated in the RDPMC pseudocode, the VM-Exit
      check for RDPMC-exiting follows standard fault vs. VM-Exit prioritization
      for privileged instructions, e.g. occurs after the CPL/CR0.PE/CR4.PCE
      checks, but before the counter referenced in ECX is checked for validity.
      
      In other words, the original KVM behavior of injecting a #GP was correct,
      and the KVM unit test needs to be adjusted accordingly, e.g. eat the #GP
      when the unit test guest (L3 in this case) executes RDPMC without
      RDPMC-exiting set in the unit test host (L2).
      
      This reverts commit e51bfdb6.
      
      Fixes: e51bfdb6 ("KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU")
      Reported-by: default avatarDavid Hill <hilld@binarystorm.net>
      Cc: Saar Amar <saaramar@microsoft.com>
      Cc: Mihai Carabas <mihai.carabas@oracle.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2f6ef23f
    • Chengguang Xu's avatar
      jbd2: fix potential double free · b8bd6257
      Chengguang Xu authored
      commit 0d52154b upstream.
      
      When failing from creating cache jbd2_inode_cache, we will destroy the
      previously created cache jbd2_handle_cache twice.  This patch fixes
      this by moving each cache initialization/destruction to its own
      separate, individual function.
      Signed-off-by: default avatarChengguang Xu <cgxu519@gmail.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b8bd6257
    • Michał Wadowski's avatar
      ALSA: hda/realtek - Fix for Lenovo B50-70 inverted internal microphone bug · 8e9dbdd6
      Michał Wadowski authored
      commit 56df90b6 upstream.
      
      Add patch for realtek codec in Lenovo B50-70 that fixes inverted
      internal microphone channel.
      Device IdeaPad Y410P has the same PCI SSID as Lenovo B50-70,
      but first one is about fix the noise and it didn't seem help in a
      later kernel version.
      So I replaced IdeaPad Y410P device description with B50-70 and apply
      inverted microphone fix.
      
      Bugzilla: https://bugs.launchpad.net/ubuntu/+source/alsa-driver/+bug/1524215Signed-off-by: default avatarMichał Wadowski <wadosm@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8e9dbdd6
    • Kailang Yang's avatar
      ALSA: hda/realtek - Fixup headphone noise via runtime suspend · d018003e
      Kailang Yang authored
      commit dad3197d upstream.
      
      Dell platform with ALC298.
      system enter to runtime suspend. Headphone had noise.
      Let Headset Mic not shutup will solve this issue.
      
      [ Fixed minor coding style issues by tiwai ]
      Signed-off-by: default avatarKailang Yang <kailang@realtek.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d018003e
    • Jeremy Soller's avatar
      ALSA: hda/realtek - Corrected fixup for System76 Gazelle (gaze14) · f15d4a25
      Jeremy Soller authored
      commit 891afcf2 upstream.
      
      A mistake was made in the identification of the four variants of the
      System76 Gazelle (gaze14). This patch corrects the PCI ID of the
      17-inch, GTX 1660 Ti variant from 0x8560 to 0x8551. This patch also
      adds the correct fixups for the 15-inch and 17-inch GTX 1650 variants
      with PCI IDs 0x8560 and 0x8561.
      
      Tests were done on all four variants ensuring full audio capability.
      
      Fixes: 80a5052d ("ALSA: hdea/realtek - Headset fixup for System76 Gazelle (gaze14)")
      Signed-off-by: default avatarJeremy Soller <jeremy@system76.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f15d4a25
    • Jan Kara's avatar
      ext4: avoid panic during forced reboot due to aborted journal · e6d47828
      Jan Kara authored
      commit 2c1d0e36 upstream.
      
      Handling of aborted journal is a special code path different from
      standard ext4_error() one and it can call panic() as well. Commit
      1dc1097f ("ext4: avoid panic during forced reboot") forgot to update
      this path so fix that omission.
      
      Fixes: 1dc1097f ("ext4: avoid panic during forced reboot")
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org # 5.1
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e6d47828
    • Sahitya Tummala's avatar
      ext4: fix use-after-free in dx_release() · fc0d59b2
      Sahitya Tummala authored
      commit 08fc98a4 upstream.
      
      The buffer_head (frames[0].bh) and it's corresping page can be
      potentially free'd once brelse() is done inside the for loop
      but before the for loop exits in dx_release(). It can be free'd
      in another context, when the page cache is flushed via
      drop_caches_sysctl_handler(). This results into below data abort
      when accessing info->indirect_levels in dx_release().
      
      Unable to handle kernel paging request at virtual address ffffffc17ac3e01e
      Call trace:
       dx_release+0x70/0x90
       ext4_htree_fill_tree+0x2d4/0x300
       ext4_readdir+0x244/0x6f8
       iterate_dir+0xbc/0x160
       SyS_getdents64+0x94/0x174
      Signed-off-by: default avatarSahitya Tummala <stummala@codeaurora.org>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Cc: stable@kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fc0d59b2
    • Lukas Czerner's avatar
      ext4: fix data corruption caused by overlapping unaligned and aligned IO · 0dc4c612
      Lukas Czerner authored
      commit 57a0da28 upstream.
      
      Unaligned AIO must be serialized because the zeroing of partial blocks
      of unaligned AIO can result in data corruption in case it's overlapping
      another in flight IO.
      
      Currently we wait for all unwritten extents before we submit unaligned
      AIO which protects data in case of unaligned AIO is following overlapping
      IO. However if a unaligned AIO is followed by overlapping aligned AIO we
      can still end up corrupting data.
      
      To fix this, we must make sure that the unaligned AIO is the only IO in
      flight by waiting for unwritten extents conversion not just before the
      IO submission, but right after it as well.
      
      This problem can be reproduced by xfstest generic/538
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0dc4c612
    • Sriram Rajagopalan's avatar
      ext4: zero out the unused memory region in the extent tree block · b02ae56d
      Sriram Rajagopalan authored
      commit 592acbf1 upstream.
      
      This commit zeroes out the unused memory region in the buffer_head
      corresponding to the extent metablock after writing the extent header
      and the corresponding extent node entries.
      
      This is done to prevent random uninitialized data from getting into
      the filesystem when the extent block is synced.
      
      This fixes CVE-2019-11833.
      Signed-off-by: default avatarSriram Rajagopalan <sriramr@arista.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b02ae56d
    • Anup Patel's avatar
      tty: Don't force RISCV SBI console as preferred console · 9e897e34
      Anup Patel authored
      commit f91253a3 upstream.
      
      The Linux kernel will auto-disables all boot consoles whenever it
      gets a preferred real console.
      
      Currently on RISC-V systems, if we have a real console which is not
      RISCV SBI console then boot consoles (such as earlycon=sbi) are not
      auto-disabled when a real console (ttyS0 or ttySIF0) is available.
      This results in duplicate prints at boot-time after kernel starts
      using real console (i.e. ttyS0 or ttySIF0) if "earlycon=" kernel
      parameter was passed by bootloader.
      
      The reason for above issue is that RISCV SBI console always adds
      itself as preferred console which is causing other real consoles
      to be not used as preferred console.
      
      Ideally "console=" kernel parameter passed by bootloaders should
      be the one selecting a preferred real console.
      
      This patch fixes above issue by not forcing RISCV SBI console as
      preferred console.
      
      Fixes: afa6b1cc ("tty: New RISC-V SBI console driver")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAnup Patel <anup.patel@wdc.com>
      Reviewed-by: default avatarAtish Patra <atish.patra@wdc.com>
      Signed-off-by: default avatarPalmer Dabbelt <palmer@sifive.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9e897e34
    • Jiufei Xue's avatar
      fs/writeback.c: use rcu_barrier() to wait for inflight wb switches going into workqueue when umount · 0b9d5347
      Jiufei Xue authored
      commit ec084de9 upstream.
      
      synchronize_rcu() didn't wait for call_rcu() callbacks, so inode wb
      switch may not go to the workqueue after synchronize_rcu().  Thus
      previous scheduled switches was not finished even flushing the
      workqueue, which will cause a NULL pointer dereferenced followed below.
      
        VFS: Busy inodes after unmount of vdd. Self-destruct in 5 seconds.  Have a nice day...
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000278
          evict+0xb3/0x180
          iput+0x1b0/0x230
          inode_switch_wbs_work_fn+0x3c0/0x6a0
          worker_thread+0x4e/0x490
          ? process_one_work+0x410/0x410
          kthread+0xe6/0x100
          ret_from_fork+0x39/0x50
      
      Replace the synchronize_rcu() call with a rcu_barrier() to wait for all
      pending callbacks to finish.  And inc isw_nr_in_flight after call_rcu()
      in inode_switch_wbs() to make more sense.
      
      Link: http://lkml.kernel.org/r/20190429024108.54150-1-jiufei.xue@linux.alibaba.comSigned-off-by: default avatarJiufei Xue <jiufei.xue@linux.alibaba.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarTejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0b9d5347
    • Mel Gorman's avatar
      mm/compaction.c: correct zone boundary handling when isolating pages from a pageblock · 0e294894
      Mel Gorman authored
      commit 60fce36a upstream.
      
      syzbot reported the following error from a tree with a head commit of
      baf76f0c ("slip: make slhc_free() silently accept an error pointer")
      
        BUG: unable to handle kernel paging request at ffffea0003348000
        #PF error: [normal kernel read fault]
        PGD 12c3f9067 P4D 12c3f9067 PUD 12c3f8067 PMD 0
        Oops: 0000 [#1] PREEMPT SMP KASAN
        CPU: 1 PID: 28916 Comm: syz-executor.2 Not tainted 5.1.0-rc6+ #89
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:constant_test_bit arch/x86/include/asm/bitops.h:314 [inline]
        RIP: 0010:PageCompound include/linux/page-flags.h:186 [inline]
        RIP: 0010:isolate_freepages_block+0x1c0/0xd40 mm/compaction.c:579
        Code: 01 d8 ff 4d 85 ed 0f 84 ef 07 00 00 e8 29 00 d8 ff 4c 89 e0 83 85 38 ff
        ff ff 01 48 c1 e8 03 42 80 3c 38 00 0f 85 31 0a 00 00 <4d> 8b 2c 24 31 ff 49
        c1 ed 10 41 83 e5 01 44 89 ee e8 3a 01 d8 ff
        RSP: 0018:ffff88802b31eab8 EFLAGS: 00010246
        RAX: 1ffffd4000669000 RBX: 00000000000cd200 RCX: ffffc9000a235000
        RDX: 000000000001ca5e RSI: ffffffff81988cc7 RDI: 0000000000000001
        RBP: ffff88802b31ebd8 R08: ffff88805af700c0 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000000 R12: ffffea0003348000
        R13: 0000000000000000 R14: ffff88802b31f030 R15: dffffc0000000000
        FS:  00007f61648dc700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: ffffea0003348000 CR3: 0000000037c64000 CR4: 00000000001426e0
        Call Trace:
         fast_isolate_around mm/compaction.c:1243 [inline]
         fast_isolate_freepages mm/compaction.c:1418 [inline]
         isolate_freepages mm/compaction.c:1438 [inline]
         compaction_alloc+0x1aee/0x22e0 mm/compaction.c:1550
      
      There is no reproducer and it is difficult to hit -- 1 crash every few
      days.  The issue is very similar to the fix in commit 6b0868c8
      ("mm/compaction.c: correct zone boundary handling when resetting pageblock
      skip hints").  When isolating free pages around a target pageblock, the
      boundary handling is off by one and can stray into the next pageblock.
      Triggering the syzbot error requires that the end of pageblock is section
      or zone aligned, and that the next section is unpopulated.
      
      A more subtle consequence of the bug is that pageblocks were being
      improperly used as migration targets which potentially hurts fragmentation
      avoidance in the long-term one page at a time.
      
      A debugging patch revealed that it's definitely possible to stray outside
      of a pageblock which is not intended.  While syzbot cannot be used to
      verify this patch, it was confirmed that the debugging warning no longer
      triggers with this patch applied.  It has also been confirmed that the THP
      allocation stress tests are not degraded by this patch.
      
      Link: http://lkml.kernel.org/r/20190510182124.GI18914@techsingularity.net
      Fixes: e332f741 ("mm, compaction: be selective about what pageblocks to clear skip hints")
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reported-by: syzbot+d84c80f9fe26a0f7a734@syzkaller.appspotmail.com
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org> # v5.1+
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0e294894
    • Fabio Estevam's avatar
      ARM: dts: imx: Fix the AR803X phy-mode · b05c5e79
      Fabio Estevam authored
      commit 0672d22a upstream.
      
      Commit 6d4cd041 ("net: phy: at803x: disable delay only for RGMII mode")
      exposed an issue on imx DTS files using AR8031/AR8035 PHYs.
      
      The end result is that the boards can no longer obtain an IP address
      via UDHCP, for example.
      
      Quoting Andrew Lunn:
      
      "The problem here is, all the DTs were broken since day 0. However,
      because the PHY driver was also broken, nobody noticed and it
      worked. Now that the PHY driver has been fixed, all the bugs in the
      DTs now become an issue"
      
      To fix this problem, the phy-mode property needs to be "rgmii-id",  which
      has the following meaning as per
      Documentation/devicetree/bindings/net/ethernet.txt:
      
      "RGMII with internal RX and TX delays provided by the PHY, the MAC should
      not add the RX or TX delays in this case)"
      
      Tested on imx6-sabresd, imx6sx-sdb and imx7d-pico boards with
      successfully restored networking.
      
      Based on the initial submission from Steve Twiss for the
      imx6qdl-sabresd.
      Signed-off-by: default avatarFabio Estevam <festevam@gmail.com>
      Tested-by: default avatarBaruch Siach <baruch@tkos.co.il>
      Tested-by: default avatarSoeren Moch <smoch@web.de>
      Tested-by: default avatarSteve Twiss <stwiss.opensource@diasemi.com>
      Tested-by: default avatarAdam Thomson <Adam.Thomson@diasemi.com>
      Signed-off-by: default avatarSteve Twiss <stwiss.opensource@diasemi.com>
      Tested-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: default avatarShawn Guo <shawnguo@kernel.org>
      Cc: "George G. Davis" <george_davis@mentor.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b05c5e79
    • Kamlakant Patel's avatar
      ipmi:ssif: compare block number correctly for multi-part return messages · 14c7aae7
      Kamlakant Patel authored
      commit 55be8658 upstream.
      
      According to ipmi spec, block number is a number that is incremented,
      starting with 0, for each new block of message data returned using the
      middle transaction.
      
      Here, the 'blocknum' is data[0] which always starts from zero(0) and
      'ssif_info->multi_pos' starts from 1.
      So, we need to add +1 to blocknum while comparing with multi_pos.
      
      Fixes: 7d6380cd ("ipmi:ssif: Fix handling of multi-part return messages").
      Reported-by: default avatarKiran Kolukuluru <kirank@ami.com>
      Signed-off-by: default avatarKamlakant Patel <kamlakantp@marvell.com>
      Message-Id: <1556106615-18722-1-git-send-email-kamlakantp@marvell.com>
      [Also added a debug log if the block numbers don't match.]
      Signed-off-by: default avatarCorey Minyard <cminyard@mvista.com>
      Cc: stable@vger.kernel.org # 4.4
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      14c7aae7
    • Corey Minyard's avatar
      ipmi: Add the i2c-addr property for SSIF interfaces · ebad3ccb
      Corey Minyard authored
      commit d7323638 upstream.
      
      This is required for SSIF to work.
      
      There was no way to know if the interface being added was SI
      or SSIF from the platform data, but that was required so the
      i2c-addr is only added for SSIF interfaces.  So add a field
      for that.
      
      Also rework the logic a bit so that ipmi-type is not set
      for SSIF interfaces, as it is not necessary for that.
      
      Fixes: 3cd83bac ("ipmi: Consolidate the adding of platform devices")
      Reported-by: default avatarKamlakant Patel <kamlakantp@marvell.com>
      Signed-off-by: default avatarCorey Minyard <cminyard@mvista.com>
      Cc: stable@vger.kernel.org # 5.1
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ebad3ccb
    • Coly Li's avatar
      bcache: never set KEY_PTRS of journal key to 0 in journal_reclaim() · 9c52c4cf
      Coly Li authored
      commit 1bee2add upstream.
      
      In journal_reclaim() ja->cur_idx of each cache will be update to
      reclaim available journal buckets. Variable 'int n' is used to count how
      many cache is successfully reclaimed, then n is set to c->journal.key
      by SET_KEY_PTRS(). Later in journal_write_unlocked(), a for_each_cache()
      loop will write the jset data onto each cache.
      
      The problem is, if all jouranl buckets on each cache is full, the
      following code in journal_reclaim(),
      
      529 for_each_cache(ca, c, iter) {
      530       struct journal_device *ja = &ca->journal;
      531       unsigned int next = (ja->cur_idx + 1) % ca->sb.njournal_buckets;
      532
      533       /* No space available on this device */
      534       if (next == ja->discard_idx)
      535               continue;
      536
      537       ja->cur_idx = next;
      538       k->ptr[n++] = MAKE_PTR(0,
      539                         bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
      540                         ca->sb.nr_this_dev);
      541 }
      542
      543 bkey_init(k);
      544 SET_KEY_PTRS(k, n);
      
      If there is no available bucket to reclaim, the if() condition at line
      534 will always true, and n remains 0. Then at line 544, SET_KEY_PTRS()
      will set KEY_PTRS field of c->journal.key to 0.
      
      Setting KEY_PTRS field of c->journal.key to 0 is wrong. Because in
      journal_write_unlocked() the journal data is written in following loop,
      
      649	for (i = 0; i < KEY_PTRS(k); i++) {
      650-671		submit journal data to cache device
      672	}
      
      If KEY_PTRS field is set to 0 in jouranl_reclaim(), the journal data
      won't be written to cache device here. If system crahed or rebooted
      before bkeys of the lost journal entries written into btree nodes, data
      corruption will be reported during bcache reload after rebooting the
      system.
      
      Indeed there is only one cache in a cache set, there is no need to set
      KEY_PTRS field in journal_reclaim() at all. But in order to keep the
      for_each_cache() logic consistent for now, this patch fixes the above
      problem by not setting 0 KEY_PTRS of journal key, if there is no bucket
      available to reclaim.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9c52c4cf
    • Liang Chen's avatar
      bcache: fix a race between cache register and cacheset unregister · 41f891ee
      Liang Chen authored
      commit a4b732a2 upstream.
      
      There is a race between cache device register and cache set unregister.
      For an already registered cache device, register_bcache will call
      bch_is_open to iterate through all cachesets and check every cache
      there. The race occurs if cache_set_free executes at the same time and
      clears the caches right before ca is dereferenced in bch_is_open_cache.
      To close the race, let's make sure the clean up work is protected by
      the bch_register_lock as well.
      
      This issue can be reproduced as follows,
      while true; do echo /dev/XXX> /sys/fs/bcache/register ; done&
      while true; do echo 1> /sys/block/XXX/bcache/set/unregister ; done &
      
      and results in the following oops,
      
      [  +0.000053] BUG: unable to handle kernel NULL pointer dereference at 0000000000000998
      [  +0.000457] #PF error: [normal kernel read fault]
      [  +0.000464] PGD 800000003ca9d067 P4D 800000003ca9d067 PUD 3ca9c067 PMD 0
      [  +0.000388] Oops: 0000 [#1] SMP PTI
      [  +0.000269] CPU: 1 PID: 3266 Comm: bash Not tainted 5.0.0+ #6
      [  +0.000346] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
      [  +0.000472] RIP: 0010:register_bcache+0x1829/0x1990 [bcache]
      [  +0.000344] Code: b0 48 83 e8 50 48 81 fa e0 e1 10 c0 0f 84 a9 00 00 00 48 89 c6 48 89 ca 0f b7 ba 54 04 00 00 4c 8b 82 60 0c 00 00 85 ff 74 2f <49> 3b a8 98 09 00 00 74 4e 44 8d 47 ff 31 ff 49 c1 e0 03 eb 0d
      [  +0.000839] RSP: 0018:ffff92ee804cbd88 EFLAGS: 00010202
      [  +0.000328] RAX: ffffffffc010e190 RBX: ffff918b5c6b5000 RCX: ffff918b7d8e0000
      [  +0.000399] RDX: ffff918b7d8e0000 RSI: ffffffffc010e190 RDI: 0000000000000001
      [  +0.000398] RBP: ffff918b7d318340 R08: 0000000000000000 R09: ffffffffb9bd2d7a
      [  +0.000385] R10: ffff918b7eb253c0 R11: ffffb95980f51200 R12: ffffffffc010e1a0
      [  +0.000411] R13: fffffffffffffff2 R14: 000000000000000b R15: ffff918b7e232620
      [  +0.000384] FS:  00007f955bec2740(0000) GS:ffff918b7eb00000(0000) knlGS:0000000000000000
      [  +0.000420] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  +0.000801] CR2: 0000000000000998 CR3: 000000003cad6000 CR4: 00000000001406e0
      [  +0.000837] Call Trace:
      [  +0.000682]  ? _cond_resched+0x10/0x20
      [  +0.000691]  ? __kmalloc+0x131/0x1b0
      [  +0.000710]  kernfs_fop_write+0xfa/0x170
      [  +0.000733]  __vfs_write+0x2e/0x190
      [  +0.000688]  ? inode_security+0x10/0x30
      [  +0.000698]  ? selinux_file_permission+0xd2/0x120
      [  +0.000752]  ? security_file_permission+0x2b/0x100
      [  +0.000753]  vfs_write+0xa8/0x1a0
      [  +0.000676]  ksys_write+0x4d/0xb0
      [  +0.000699]  do_syscall_64+0x3a/0xf0
      [  +0.000692]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Signed-off-by: default avatarLiang Chen <liangchen.linux@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      41f891ee
    • Filipe Manana's avatar
      Btrfs: fix race between send and deduplication that lead to failures and crashes · 5a3d1d4d
      Filipe Manana authored
      commit 62d54f3a upstream.
      
      Send operates on read only trees and expects them to never change while it
      is using them. This is part of its initial design, and this expection is
      due to two different reasons:
      
      1) When it was introduced, no operations were allowed to modifiy read-only
         subvolumes/snapshots (including defrag for example).
      
      2) It keeps send from having an impact on other filesystem operations.
         Namely send does not need to keep locks on the trees nor needs to hold on
         to transaction handles and delay transaction commits. This ends up being
         a consequence of the former reason.
      
      However the deduplication feature was introduced later (on September 2013,
      while send was introduced in July 2012) and it allowed for deduplication
      with destination files that belong to read-only trees (subvolumes and
      snapshots).
      
      That means that having a send operation (either full or incremental) running
      in parallel with a deduplication that has the destination inode in one of
      the trees used by the send operation, can result in tree nodes and leaves
      getting freed and reused while send is using them. This problem is similar
      to the problem solved for the root nodes getting freed and reused when a
      snapshot is made against one tree that is currenly being used by a send
      operation, fixed in commits [1] and [2]. These commits explain in detail
      how the problem happens and the explanation is valid for any node or leaf
      that is not the root of a tree as well. This problem was also discussed
      and explained recently in a thread [3].
      
      The problem is very easy to reproduce when using send with large trees
      (snapshots) and just a few concurrent deduplication operations that target
      files in the trees used by send. A stress test case is being sent for
      fstests that triggers the issue easily. The most common error to hit is
      the send ioctl return -EIO with the following messages in dmesg/syslog:
      
       [1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=5228134400 found extent=5228134400
       [1631633.251754] BTRFS error (device sdc): parent transid verify failed on 32243712 wanted 24 found 27
      
      The first one is very easy to hit while the second one happens much less
      frequently, except for very large trees (in that test case, snapshots
      with 100000 files having large xattrs to get deep and wide trees).
      Less frequently, at least one BUG_ON can be hit:
      
       [1631742.130080] ------------[ cut here ]------------
       [1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806!
       [1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI
       [1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G    B D W         5.0.0-rc8-btrfs-next-45 #1
       [1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
       [1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
       (...)
       [1631742.135061] RSP: 0018:ffffb530021ebaa0 EFLAGS: 00010246
       [1631742.135615] RAX: ffff93ac8912e000 RBX: 000000000000009d RCX: 0000000000000002
       [1631742.136173] RDX: 000000000000009d RSI: ffff93ac564b0d08 RDI: ffff93ad5b48c000
       [1631742.136759] RBP: ffffb530021ebb7d R08: 0000000000000001 R09: ffffb530021ebb7d
       [1631742.137324] R10: ffffb530021eba70 R11: 0000000000000000 R12: ffff93ac87d0a708
       [1631742.137900] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
       [1631742.138455] FS:  00007f4cdb1528c0(0000) GS:ffff93ad76a80000(0000) knlGS:0000000000000000
       [1631742.139010] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [1631742.139568] CR2: 00007f5acb3d0420 CR3: 000000012be3e006 CR4: 00000000003606e0
       [1631742.140131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       [1631742.140719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       [1631742.141272] Call Trace:
       [1631742.141826]  ? do_raw_spin_unlock+0x49/0xc0
       [1631742.142390]  tree_advance+0x173/0x1d0 [btrfs]
       [1631742.142948]  btrfs_compare_trees+0x268/0x690 [btrfs]
       [1631742.143533]  ? process_extent+0x1070/0x1070 [btrfs]
       [1631742.144088]  btrfs_ioctl_send+0x1037/0x1270 [btrfs]
       [1631742.144645]  _btrfs_ioctl_send+0x80/0x110 [btrfs]
       [1631742.145161]  ? trace_sched_stick_numa+0xe0/0xe0
       [1631742.145685]  btrfs_ioctl+0x13fe/0x3120 [btrfs]
       [1631742.146179]  ? account_entity_enqueue+0xd3/0x100
       [1631742.146662]  ? reweight_entity+0x154/0x1a0
       [1631742.147135]  ? update_curr+0x20/0x2a0
       [1631742.147593]  ? check_preempt_wakeup+0x103/0x250
       [1631742.148053]  ? do_vfs_ioctl+0xa2/0x6f0
       [1631742.148510]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
       [1631742.148942]  do_vfs_ioctl+0xa2/0x6f0
       [1631742.149361]  ? __fget+0x113/0x200
       [1631742.149767]  ksys_ioctl+0x70/0x80
       [1631742.150159]  __x64_sys_ioctl+0x16/0x20
       [1631742.150543]  do_syscall_64+0x60/0x1b0
       [1631742.150931]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [1631742.151326] RIP: 0033:0x7f4cd9f5add7
       (...)
       [1631742.152509] RSP: 002b:00007ffe91017708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
       [1631742.152892] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f4cd9f5add7
       [1631742.153268] RDX: 00007ffe91017790 RSI: 0000000040489426 RDI: 0000000000000007
       [1631742.153633] RBP: 0000000000000007 R08: 00007f4cd9e79700 R09: 00007f4cd9e79700
       [1631742.153999] R10: 00007f4cd9e799d0 R11: 0000000000000202 R12: 0000000000000003
       [1631742.154365] R13: 0000555dfae53020 R14: 0000000000000000 R15: 0000000000000001
       (...)
       [1631742.156696] ---[ end trace 5dac9f96dcc3fd6b ]---
      
      That BUG_ON happens because while send is using a node, that node is COWed
      by a concurrent deduplication, gets freed and gets reused as a leaf (because
      a transaction commit happened in between), so when it attempts to read a
      slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer
      contents were wiped out and it now matches a leaf (which can even belong to
      some other tree now), hitting the BUG_ON(level == 0).
      
      Fix this concurrency issue by not allowing send and deduplication to run
      in parallel if both operate on the same readonly trees, returning EAGAIN
      to user space and logging an exlicit warning in dmesg/syslog.
      
      [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea
      [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2
      [3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5a3d1d4d
    • Filipe Manana's avatar
      Btrfs: do not start a transaction at iterate_extent_inodes() · e0791fa4
      Filipe Manana authored
      commit bfc61c36 upstream.
      
      When finding out which inodes have references on a particular extent, done
      by backref.c:iterate_extent_inodes(), from the BTRFS_IOC_LOGICAL_INO (both
      v1 and v2) ioctl and from scrub we use the transaction join API to grab a
      reference on the currently running transaction, since in order to give
      accurate results we need to inspect the delayed references of the currently
      running transaction.
      
      However, if there is currently no running transaction, the join operation
      will create a new transaction. This is inefficient as the transaction will
      eventually be committed, doing unnecessary IO and introducing a potential
      point of failure that will lead to a transaction abort due to -ENOSPC, as
      recently reported [1].
      
      That's because the join, creates the transaction but does not reserve any
      space, so when attempting to update the root item of the root passed to
      btrfs_join_transaction(), during the transaction commit, we can end up
      failling with -ENOSPC. Users of a join operation are supposed to actually
      do some filesystem changes and reserve space by some means, which is not
      the case of iterate_extent_inodes(), it is a read-only operation for all
      contextes from which it is called.
      
      The reported [1] -ENOSPC failure stack trace is the following:
      
       heisenberg kernel: ------------[ cut here ]------------
       heisenberg kernel: BTRFS: Transaction aborted (error -28)
       heisenberg kernel: WARNING: CPU: 0 PID: 7137 at fs/btrfs/root-tree.c:136 btrfs_update_root+0x22b/0x320 [btrfs]
      (...)
       heisenberg kernel: CPU: 0 PID: 7137 Comm: btrfs-transacti Not tainted 4.19.0-4-amd64 #1 Debian 4.19.28-2
       heisenberg kernel: Hardware name: FUJITSU LIFEBOOK U757/FJNB2A5, BIOS Version 1.21 03/19/2018
       heisenberg kernel: RIP: 0010:btrfs_update_root+0x22b/0x320 [btrfs]
      (...)
       heisenberg kernel: RSP: 0018:ffffb5448828bd40 EFLAGS: 00010286
       heisenberg kernel: RAX: 0000000000000000 RBX: ffff8ed56bccef50 RCX: 0000000000000006
       heisenberg kernel: RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8ed6bda166a0
       heisenberg kernel: RBP: 00000000ffffffe4 R08: 00000000000003df R09: 0000000000000007
       heisenberg kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff8ed63396a078
       heisenberg kernel: R13: ffff8ed092d7c800 R14: ffff8ed64f5db028 R15: ffff8ed6bd03d068
       heisenberg kernel: FS:  0000000000000000(0000) GS:ffff8ed6bda00000(0000) knlGS:0000000000000000
       heisenberg kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       heisenberg kernel: CR2: 00007f46f75f8000 CR3: 0000000310a0a002 CR4: 00000000003606f0
       heisenberg kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       heisenberg kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       heisenberg kernel: Call Trace:
       heisenberg kernel:  commit_fs_roots+0x166/0x1d0 [btrfs]
       heisenberg kernel:  ? _cond_resched+0x15/0x30
       heisenberg kernel:  ? btrfs_run_delayed_refs+0xac/0x180 [btrfs]
       heisenberg kernel:  btrfs_commit_transaction+0x2bd/0x870 [btrfs]
       heisenberg kernel:  ? start_transaction+0x9d/0x3f0 [btrfs]
       heisenberg kernel:  transaction_kthread+0x147/0x180 [btrfs]
       heisenberg kernel:  ? btrfs_cleanup_transaction+0x530/0x530 [btrfs]
       heisenberg kernel:  kthread+0x112/0x130
       heisenberg kernel:  ? kthread_bind+0x30/0x30
       heisenberg kernel:  ret_from_fork+0x35/0x40
       heisenberg kernel: ---[ end trace 05de912e30e012d9 ]---
      
      So fix that by using the attach API, which does not create a transaction
      when there is currently no running transaction.
      
      [1] https://lore.kernel.org/linux-btrfs/b2a668d7124f1d3e410367f587926f622b3f03a4.camel@scientia.net/Reported-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e0791fa4
    • Filipe Manana's avatar
      Btrfs: do not start a transaction during fiemap · c0f9698f
      Filipe Manana authored
      commit 03628cdb upstream.
      
      During fiemap, for regular extents (non inline) we need to check if they
      are shared and if they are, set the shared bit. Checking if an extent is
      shared requires checking the delayed references of the currently running
      transaction, since some reference might have not yet hit the extent tree
      and be only in the in-memory delayed references.
      
      However we were using a transaction join for this, which creates a new
      transaction when there is no transaction currently running. That means
      that two more potential failures can happen: creating the transaction and
      committing it. Further, if no write activity is currently happening in the
      system, and fiemap calls keep being done, we end up creating and
      committing transactions that do nothing.
      
      In some extreme cases this can result in the commit of the transaction
      created by fiemap to fail with ENOSPC when updating the root item of a
      subvolume tree because a join does not reserve any space, leading to a
      trace like the following:
      
       heisenberg kernel: ------------[ cut here ]------------
       heisenberg kernel: BTRFS: Transaction aborted (error -28)
       heisenberg kernel: WARNING: CPU: 0 PID: 7137 at fs/btrfs/root-tree.c:136 btrfs_update_root+0x22b/0x320 [btrfs]
      (...)
       heisenberg kernel: CPU: 0 PID: 7137 Comm: btrfs-transacti Not tainted 4.19.0-4-amd64 #1 Debian 4.19.28-2
       heisenberg kernel: Hardware name: FUJITSU LIFEBOOK U757/FJNB2A5, BIOS Version 1.21 03/19/2018
       heisenberg kernel: RIP: 0010:btrfs_update_root+0x22b/0x320 [btrfs]
      (...)
       heisenberg kernel: RSP: 0018:ffffb5448828bd40 EFLAGS: 00010286
       heisenberg kernel: RAX: 0000000000000000 RBX: ffff8ed56bccef50 RCX: 0000000000000006
       heisenberg kernel: RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8ed6bda166a0
       heisenberg kernel: RBP: 00000000ffffffe4 R08: 00000000000003df R09: 0000000000000007
       heisenberg kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff8ed63396a078
       heisenberg kernel: R13: ffff8ed092d7c800 R14: ffff8ed64f5db028 R15: ffff8ed6bd03d068
       heisenberg kernel: FS:  0000000000000000(0000) GS:ffff8ed6bda00000(0000) knlGS:0000000000000000
       heisenberg kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       heisenberg kernel: CR2: 00007f46f75f8000 CR3: 0000000310a0a002 CR4: 00000000003606f0
       heisenberg kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       heisenberg kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       heisenberg kernel: Call Trace:
       heisenberg kernel:  commit_fs_roots+0x166/0x1d0 [btrfs]
       heisenberg kernel:  ? _cond_resched+0x15/0x30
       heisenberg kernel:  ? btrfs_run_delayed_refs+0xac/0x180 [btrfs]
       heisenberg kernel:  btrfs_commit_transaction+0x2bd/0x870 [btrfs]
       heisenberg kernel:  ? start_transaction+0x9d/0x3f0 [btrfs]
       heisenberg kernel:  transaction_kthread+0x147/0x180 [btrfs]
       heisenberg kernel:  ? btrfs_cleanup_transaction+0x530/0x530 [btrfs]
       heisenberg kernel:  kthread+0x112/0x130
       heisenberg kernel:  ? kthread_bind+0x30/0x30
       heisenberg kernel:  ret_from_fork+0x35/0x40
       heisenberg kernel: ---[ end trace 05de912e30e012d9 ]---
      
      Since fiemap (and btrfs_check_shared()) is a read-only operation, do not do
      a transaction join to avoid the overhead of creating a new transaction (if
      there is currently no running transaction) and introducing a potential
      point of failure when the new transaction gets committed, instead use a
      transaction attach to grab a handle for the currently running transaction
      if any.
      Reported-by: default avatarChristoph Anton Mitterer <calestyo@scientia.net>
      Link: https://lore.kernel.org/linux-btrfs/b2a668d7124f1d3e410367f587926f622b3f03a4.camel@scientia.net/
      Fixes: afce772e ("btrfs: fix check_shared for fiemap ioctl")
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c0f9698f
    • Filipe Manana's avatar
      Btrfs: send, flush dellaloc in order to avoid data loss · 0bfc335f
      Filipe Manana authored
      commit 9f89d5de upstream.
      
      When we set a subvolume to read-only mode we do not flush dellaloc for any
      of its inodes (except if the filesystem is mounted with -o flushoncommit),
      since it does not affect correctness for any subsequent operations - except
      for a future send operation. The send operation will not be able to see the
      delalloc data since the respective file extent items, inode item updates,
      backreferences, etc, have not hit yet the subvolume and extent trees.
      
      Effectively this means data loss, since the send stream will not contain
      any data from existing delalloc. Another problem from this is that if the
      writeback starts and finishes while the send operation is in progress, we
      have the subvolume tree being being modified concurrently which can result
      in send failing unexpectedly with EIO or hitting runtime errors, assertion
      failures or hitting BUG_ONs, etc.
      
      Simple reproducer:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ btrfs subvolume create /mnt/sv
        $ xfs_io -f -c "pwrite -S 0xea 0 108K" /mnt/sv/foo
      
        $ btrfs property set /mnt/sv ro true
        $ btrfs send -f /tmp/send.stream /mnt/sv
      
        $ od -t x1 -A d /mnt/sv/foo
        0000000 ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea
        *
        0110592
      
        $ umount /mnt
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
      
        $ btrfs receive -f /tmp/send.stream /mnt
        $ echo $?
        0
        $ od -t x1 -A d /mnt/sv/foo
        0000000
        # ---> empty file
      
      Since this a problem that affects send only, fix it in send by flushing
      dellaloc for all the roots used by the send operation before send starts
      to process the commit roots.
      
      This is a problem that affects send since it was introduced (commit
      31db9f7c ("Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive"))
      but backporting it to older kernels has some dependencies:
      
      - For kernels between 3.19 and 4.20, it depends on commit 3cd24c69
        ("btrfs: use tagged writepage to mitigate livelock of snapshot") because
        the function btrfs_start_delalloc_snapshot() does not exist before that
        commit. So one has to either pick that commit or replace the calls to
        btrfs_start_delalloc_snapshot() in this patch with calls to
        btrfs_start_delalloc_inodes().
      
      - For kernels older than 3.19 it also requires commit e5fa8f86
        ("Btrfs: ensure send always works on roots without orphans") because
        it depends on the function ensure_commit_roots_uptodate() which that
        commits introduced.
      
      - No dependencies for 5.0+ kernels.
      
      A test case for fstests follows soon.
      
      CC: stable@vger.kernel.org # 3.19+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0bfc335f
    • Nikolay Borisov's avatar
      btrfs: Honour FITRIM range constraints during free space trim · eb432217
      Nikolay Borisov authored
      commit c2d1b3aa upstream.
      
      Up until now trimming the freespace was done irrespective of what the
      arguments of the FITRIM ioctl were. For example fstrim's -o/-l arguments
      will be entirely ignored. Fix it by correctly handling those paramter.
      This requires breaking if the found freespace extent is after the end of
      the passed range as well as completing trim after trimming
      fstrim_range::len bytes.
      
      Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eb432217
    • Nikolay Borisov's avatar
      btrfs: Correctly free extent buffer in case btree_read_extent_buffer_pages fails · 6806833b
      Nikolay Borisov authored
      commit 537f38f0 upstream.
      
      If a an eb fails to be read for whatever reason - it's corrupted on disk
      and parent transid/key validations fail or IO for eb pages fail then
      this buffer must be removed from the buffer cache. Currently the code
      calls free_extent_buffer if an error occurs. Unfortunately this doesn't
      achieve the desired behavior since btrfs_find_create_tree_block returns
      with eb->refs == 2.
      
      On the other hand free_extent_buffer will only decrement the refs once
      leaving it added to the buffer cache radix tree.  This enables later
      code to look up the buffer from the cache and utilize it potentially
      leading to a crash.
      
      The correct way to free the buffer is call free_extent_buffer_stale.
      This function will correctly call atomic_dec explicitly for the buffer
      and subsequently call release_extent_buffer which will decrement the
      final reference thus correctly remove the invalid buffer from buffer
      cache. This change affects only newly allocated buffers since they have
      eb->refs == 2.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=202755Reported-by: default avatarJungyeon <jungyeon@gatech.edu>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6806833b