1. 09 Aug, 2021 17 commits
    • Darrick J. Wong's avatar
      xfs: allow setting and clearing of log incompat feature flags · 908ce71e
      Darrick J. Wong authored
      Log incompat feature flags in the superblock exist for one purpose: to
      protect the contents of a dirty log from replay on a kernel that isn't
      prepared to handle those dirty contents.  This means that they can be
      cleared if (a) we know the log is clean and (b) we know that there
      aren't any other threads in the system that might be setting or relying
      upon a log incompat flag.
      
      Therefore, clear the log incompat flags when we've finished recovering
      the log, when we're unmounting cleanly, remounting read-only, or
      freezing; and provide a function so that subsequent patches can start
      using this.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      908ce71e
    • Dave Chinner's avatar
      xfs: replace kmem_alloc_large() with kvmalloc() · d634525d
      Dave Chinner authored
      There is no reason for this wrapper existing anymore. All the places
      that use KM_NOFS allocation are within transaction contexts and
      hence covered by memalloc_nofs_save/restore contexts. Hence we don't
      need any special handling of vmalloc for large IOs anymore and
      so special casing this code isn't necessary.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      d634525d
    • Dave Chinner's avatar
      xfs: remove kmem_alloc_io() · 98fe2c3c
      Dave Chinner authored
      Since commit 59bb4798 ("mm, sl[aou]b: guarantee natural alignment
      for kmalloc(power-of-two)"), the core slab code now guarantees slab
      alignment in all situations sufficient for IO purposes (i.e. minimum
      of 512 byte alignment of >= 512 byte sized heap allocations) we no
      longer need the workaround in the XFS code to provide this
      guarantee.
      
      Replace the use of kmem_alloc_io() with kmem_alloc() or
      kmem_alloc_large() appropriately, and remove the kmem_alloc_io()
      interface altogether.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      98fe2c3c
    • Dave Chinner's avatar
      mm: Add kvrealloc() · de2860f4
      Dave Chinner authored
      During log recovery of an XFS filesystem with 64kB directory
      buffers, rebuilding a buffer split across two log records results
      in a memory allocation warning from krealloc like this:
      
      xfs filesystem being mounted at /mnt/scratch supports timestamps until 2038 (0x7fffffff)
      XFS (dm-0): Unmounting Filesystem
      XFS (dm-0): Mounting V5 Filesystem
      XFS (dm-0): Starting recovery (logdev: internal)
      ------------[ cut here ]------------
      WARNING: CPU: 5 PID: 3435170 at mm/page_alloc.c:3539 get_page_from_freelist+0xdee/0xe40
      .....
      RIP: 0010:get_page_from_freelist+0xdee/0xe40
      Call Trace:
       ? complete+0x3f/0x50
       __alloc_pages+0x16f/0x300
       alloc_pages+0x87/0x110
       kmalloc_order+0x2c/0x90
       kmalloc_order_trace+0x1d/0x90
       __kmalloc_track_caller+0x215/0x270
       ? xlog_recover_add_to_cont_trans+0x63/0x1f0
       krealloc+0x54/0xb0
       xlog_recover_add_to_cont_trans+0x63/0x1f0
       xlog_recovery_process_trans+0xc1/0xd0
       xlog_recover_process_ophdr+0x86/0x130
       xlog_recover_process_data+0x9f/0x160
       xlog_recover_process+0xa2/0x120
       xlog_do_recovery_pass+0x40b/0x7d0
       ? __irq_work_queue_local+0x4f/0x60
       ? irq_work_queue+0x3a/0x50
       xlog_do_log_recovery+0x70/0x150
       xlog_do_recover+0x38/0x1d0
       xlog_recover+0xd8/0x170
       xfs_log_mount+0x181/0x300
       xfs_mountfs+0x4a1/0x9b0
       xfs_fs_fill_super+0x3c0/0x7b0
       get_tree_bdev+0x171/0x270
       ? suffix_kstrtoint.constprop.0+0xf0/0xf0
       xfs_fs_get_tree+0x15/0x20
       vfs_get_tree+0x24/0xc0
       path_mount+0x2f5/0xaf0
       __x64_sys_mount+0x108/0x140
       do_syscall_64+0x3a/0x70
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Essentially, we are taking a multi-order allocation from kmem_alloc()
      (which has an open coded no fail, no warn loop) and then
      reallocating it out to 64kB using krealloc(__GFP_NOFAIL) and that is
      then triggering the above warning.
      
      This is a regression caused by converting this code from an open
      coded no fail/no warn reallocation loop to using __GFP_NOFAIL.
      
      What we actually need here is kvrealloc(), so that if contiguous
      page allocation fails we fall back to vmalloc() and we don't
      get nasty warnings happening in XFS.
      
      Fixes: 771915c4 ("xfs: remove kmem_realloc()")
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      de2860f4
    • Darrick J. Wong's avatar
      xfs: dump log intent items that cannot be recovered due to corruption · 43059d54
      Darrick J. Wong authored
      If we try to recover a log intent item and the operation fails due to
      filesystem corruption, dump the contents of the item to the log for
      further analysis.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      43059d54
    • Darrick J. Wong's avatar
      xfs: grab active perag ref when reading AG headers · 48c6615c
      Darrick J. Wong authored
      This patch prepares scrub to deal with the possibility of tearing down
      entire AGs by changing the order of resource acquisition to match the
      rest of the XFS codebase.  In other words, scrub now grabs AG resources
      in order of: perag structure, then AGI/AGF/AGFL buffers, then btree
      cursors; and releases them in reverse order.
      
      This requires us to distinguish xchk_ag_init callers -- some are
      responding to a user request to check AG metadata, in which case we can
      return ENOENT to userspace; but other callers have an ondisk reference
      to an AG that they're trying to cross-reference.  In this second case,
      the lack of an AG means there's ondisk corruption, since ondisk metadata
      cannot point into nonexistent space.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      48c6615c
    • Darrick J. Wong's avatar
      xfs: drop experimental warnings for bigtime and inobtcount · f19ee6bb
      Darrick J. Wong authored
      These two features were merged a year ago, userspace tooling have been
      merged, and no serious errors have been reported by the developers.
      Drop the experimental tag to encourage wider testing.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: default avatarBill O'Donnell <billodo@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      f19ee6bb
    • Darrick J. Wong's avatar
      xfs: fix silly whitespace problems with kernel libxfs · b7df7630
      Darrick J. Wong authored
      Fix a few whitespace errors such as spaces at the end of the line, etc.
      This gets us back to something more closely resembling parity.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      b7df7630
    • Darrick J. Wong's avatar
      xfs: throttle inode inactivation queuing on memory reclaim · 40b1de00
      Darrick J. Wong authored
      Now that we defer inode inactivation, we've decoupled the process of
      unlinking or closing an inode from the process of inactivating it.  In
      theory this should lead to better throughput since we now inactivate the
      queued inodes in batches instead of one at a time.
      
      Unfortunately, one of the primary risks with this decoupling is the loss
      of rate control feedback between the frontend and background threads.
      In other words, a rm -rf /* thread can run the system out of memory if
      it can queue inodes for inactivation and jump to a new CPU faster than
      the background threads can actually clear the deferred work.  The
      workers can get scheduled off the CPU if they have to do IO, etc.
      
      To solve this problem, we configure a shrinker so that it will activate
      the /second/ time the shrinkers are called.  The custom shrinker will
      queue all percpu deferred inactivation workers immediately and set a
      flag to force frontend callers who are releasing a vfs inode to wait for
      the inactivation workers.
      
      On my test VM with 560M of RAM and a 2TB filesystem, this seems to solve
      most of the OOMing problem when deleting 10 million inodes.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      40b1de00
    • Darrick J. Wong's avatar
      xfs: avoid buffer deadlocks when walking fs inodes · a6343e4d
      Darrick J. Wong authored
      When we're servicing an INUMBERS or BULKSTAT request or running
      quotacheck, grab an empty transaction so that we can use its inherent
      recursive buffer locking abilities to detect inode btree cycles without
      hitting ABBA buffer deadlocks.  This patch requires the deferred inode
      inactivation patchset because xfs_irele cannot directly call
      xfs_inactive when the iwalk itself has an (empty) transaction.
      
      Found by fuzzing an inode btree pointer to introduce a cycle into the
      tree (xfs/365).
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      a6343e4d
    • Darrick J. Wong's avatar
      xfs: use background worker pool when transactions can't get free space · e8d04c2a
      Darrick J. Wong authored
      In xfs_trans_alloc, if the block reservation call returns ENOSPC, we
      call xfs_blockgc_free_space with a NULL icwalk structure to try to free
      space.  Each frontend thread that encounters this situation starts its
      own walk of the inode cache to see if it can find anything, which is
      wasteful since we don't have any additional selection criteria.  For
      this one common case, create a function that reschedules all pending
      background work immediately and flushes the workqueue so that the scan
      can run in parallel.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      e8d04c2a
    • Darrick J. Wong's avatar
      xfs: don't run speculative preallocation gc when fs is frozen · 6f649091
      Darrick J. Wong authored
      Now that we have the infrastructure to switch background workers on and
      off at will, fix the block gc worker code so that we don't actually run
      the worker when the filesystem is frozen, same as we do for deferred
      inactivation.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      6f649091
    • Darrick J. Wong's avatar
      xfs: flush inode inactivation work when compiling usage statistics · 01e8f379
      Darrick J. Wong authored
      Users have come to expect that the space accounting information in
      statfs and getquota reports are fairly accurate.  Now that we inactivate
      inodes from a background queue, these numbers can be thrown off by
      whatever resources are singly-owned by the inodes in the queue.  Flush
      the pending inactivations when userspace asks for a space usage report.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      01e8f379
    • Darrick J. Wong's avatar
      xfs: inactivate inodes any time we try to free speculative preallocations · 2eb66502
      Darrick J. Wong authored
      Other parts of XFS have learned to call xfs_blockgc_free_{space,quota}
      to try to free speculative preallocations when space is tight.  This
      means that file writes, transaction reservation failures, quota limit
      enforcement, and the EOFBLOCKS ioctl all call this function to free
      space when things are tight.
      
      Since inode inactivation is now a background task, this means that the
      filesystem can be hanging on to unlinked but not yet freed space.  Add
      this to the list of things that xfs_blockgc_free_* makes writer threads
      scan for when they cannot reserve space.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      2eb66502
    • Darrick J. Wong's avatar
      xfs: queue inactivation immediately when free realtime extents are tight · 65f03d86
      Darrick J. Wong authored
      Now that we have made the inactivation of unlinked inodes a background
      task to increase the throughput of file deletions, we need to be a
      little more careful about how long of a delay we can tolerate.
      
      Similar to the patch doing this for free space on the data device, if
      the file being inactivated is a realtime file and the realtime volume is
      running low on free extents, we want to run the worker ASAP so that the
      realtime allocator can make better decisions.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      65f03d86
    • Darrick J. Wong's avatar
      xfs: queue inactivation immediately when quota is nearing enforcement · 108523b8
      Darrick J. Wong authored
      Now that we have made the inactivation of unlinked inodes a background
      task to increase the throughput of file deletions, we need to be a
      little more careful about how long of a delay we can tolerate.
      
      Specifically, if the dquots attached to the inode being inactivated are
      nearing any kind of enforcement boundary, we want to queue that
      inactivation work immediately so that users don't get EDQUOT/ENOSPC
      errors even after they deleted a bunch of files to stay within quota.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      108523b8
    • Darrick J. Wong's avatar
      xfs: queue inactivation immediately when free space is tight · 7d6f07d2
      Darrick J. Wong authored
      Now that we have made the inactivation of unlinked inodes a background
      task to increase the throughput of file deletions, we need to be a
      little more careful about how long of a delay we can tolerate.
      
      On a mostly empty filesystem, the risk of the allocator making poor
      decisions due to fragmentation of the free space on account a lengthy
      delay in background updates is minimal because there's plenty of space.
      However, if free space is tight, we want to deallocate unlinked inodes
      as quickly as possible to avoid fallocate ENOSPC and to give the
      allocator the best shot at optimal allocations for new writes.
      
      Therefore, queue the percpu worker immediately if the filesystem is more
      than 95% full.  This follows the same principle that XFS becomes less
      aggressive about speculative allocations and lazy cleanup (and more
      precise about accounting) when nearing full.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      7d6f07d2
  2. 06 Aug, 2021 9 commits
  3. 02 Aug, 2021 1 commit
  4. 01 Aug, 2021 3 commits
    • Linus Torvalds's avatar
      Merge tag 'perf-tools-fixes-for-v5.14-2021-08-01' of... · d4affd6b
      Linus Torvalds authored
      Merge tag 'perf-tools-fixes-for-v5.14-2021-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
      
      Pull perf tools fixes from Arnaldo Carvalho de Melo:
      
       - Revert "perf map: Fix dso->nsinfo refcounting", this makes 'perf top'
         abort, uncovering a design flaw on how namespace information is kept.
         The fix for that is more than we can do right now, leave it for the
         next merge window.
      
       - Split --dump-raw-trace by AUX records for ARM's CoreSight, fixing up
         the decoding of some records.
      
       - Fix PMU alias matching.
      
      Thanks to James Clark and John Garry for these fixes.
      
      * tag 'perf-tools-fixes-for-v5.14-2021-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
        Revert "perf map: Fix dso->nsinfo refcounting"
        perf pmu: Fix alias matching
        perf cs-etm: Split --dump-raw-trace by AUX records
      d4affd6b
    • Linus Torvalds's avatar
      Merge tag 'powerpc-5.14-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · c82357a7
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
      
       - Don't use r30 in VDSO code, to avoid breaking existing Go lang
         programs.
      
       - Change an export symbol to allow non-GPL modules to use spinlocks
         again.
      
      Thanks to Paul Menzel, and Srikar Dronamraju.
      
      * tag 'powerpc-5.14-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/vdso: Don't use r30 to avoid breaking Go lang
        powerpc/pseries: Fix regression while building external modules
      c82357a7
    • Linus Torvalds's avatar
      Merge tag 'xfs-5.14-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · aa660326
      Linus Torvalds authored
      Pull xfs fixes from Darrick Wong:
       "This contains a bunch of bug fixes in XFS.
      
        Dave and I have been busy the last couple of weeks to find and fix as
        many log recovery bugs as we can find; here are the results so far. Go
        fstests -g recoveryloop! ;)
      
         - Fix a number of coordination bugs relating to cache flushes for
           metadata writeback, cache flushes for multi-buffer log writes, and
           FUA writes for single-buffer log writes
      
         - Fix a bug with incorrect replay of attr3 blocks
      
         - Fix unnecessary stalls when flushing logs to disk
      
         - Fix spoofing problems when recovering realtime bitmap blocks"
      
      * tag 'xfs-5.14-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
        xfs: prevent spoofing of rtbitmap blocks when recovering buffers
        xfs: limit iclog tail updates
        xfs: need to see iclog flags in tracing
        xfs: Enforce attr3 buffer recovery order
        xfs: logging the on disk inode LSN can make it go backwards
        xfs: avoid unnecessary waits in xfs_log_force_lsn()
        xfs: log forces imply data device cache flushes
        xfs: factor out forced iclog flushes
        xfs: fix ordering violation between cache flushes and tail updates
        xfs: fold __xlog_state_release_iclog into xlog_state_release_iclog
        xfs: external logs need to flush data device
        xfs: flush data dev on external log write
      aa660326
  5. 31 Jul, 2021 1 commit
  6. 30 Jul, 2021 9 commits
    • Linus Torvalds's avatar
      Merge tag 'net-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · c7d10223
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Networking fixes for 5.14-rc4, including fixes from bpf, can, WiFi
        (mac80211) and netfilter trees.
      
        Current release - regressions:
      
         - mac80211: fix starting aggregation sessions on mesh interfaces
      
        Current release - new code bugs:
      
         - sctp: send pmtu probe only if packet loss in Search Complete state
      
         - bnxt_en: add missing periodic PHC overflow check
      
         - devlink: fix phys_port_name of virtual port and merge error
      
         - hns3: change the method of obtaining default ptp cycle
      
         - can: mcba_usb_start(): add missing urb->transfer_dma initialization
      
        Previous releases - regressions:
      
         - set true network header for ECN decapsulation
      
         - mlx5e: RX, avoid possible data corruption w/ relaxed ordering and
           LRO
      
         - phy: re-add check for PHY_BRCM_DIS_TXCRXC_NOENRGY on the BCM54811
           PHY
      
         - sctp: fix return value check in __sctp_rcv_asconf_lookup
      
        Previous releases - always broken:
      
         - bpf:
             - more spectre corner case fixes, introduce a BPF nospec
               instruction for mitigating Spectre v4
             - fix OOB read when printing XDP link fdinfo
             - sockmap: fix cleanup related races
      
         - mac80211: fix enabling 4-address mode on a sta vif after assoc
      
         - can:
             - raw: raw_setsockopt(): fix raw_rcv panic for sock UAF
             - j1939: j1939_session_deactivate(): clarify lifetime of session
               object, avoid UAF
             - fix number of identical memory leaks in USB drivers
      
         - tipc:
             - do not blindly write skb_shinfo frags when doing decryption
             - fix sleeping in tipc accept routine"
      
      * tag 'net-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (91 commits)
        gve: Update MAINTAINERS list
        can: esd_usb2: fix memory leak
        can: ems_usb: fix memory leak
        can: usb_8dev: fix memory leak
        can: mcba_usb_start(): add missing urb->transfer_dma initialization
        can: hi311x: fix a signedness bug in hi3110_cmd()
        MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver
        bpf: Fix leakage due to insufficient speculative store bypass mitigation
        bpf: Introduce BPF nospec instruction for mitigating Spectre v4
        sis900: Fix missing pci_disable_device() in probe and remove
        net: let flow have same hash in two directions
        nfc: nfcsim: fix use after free during module unload
        tulip: windbond-840: Fix missing pci_disable_device() in probe and remove
        sctp: fix return value check in __sctp_rcv_asconf_lookup
        nfc: s3fwrn5: fix undefined parameter values in dev_err()
        net/mlx5: Fix mlx5_vport_tbl_attr chain from u16 to u32
        net/mlx5e: Fix nullptr in mlx5e_hairpin_get_mdev()
        net/mlx5: Unload device upon firmware fatal error
        net/mlx5e: Fix page allocation failure for ptp-RQ over SF
        net/mlx5e: Fix page allocation failure for trap-RQ over SF
        ...
      c7d10223
    • Linus Torvalds's avatar
      Merge tag 'acpi-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · e1dab4c0
      Linus Torvalds authored
      Pull ACPI fixes from Rafael Wysocki:
       "These revert a recent IRQ resources handling modification that turned
        out to be problematic, fix suspend-to-idle handling on AMD platforms
        to take upcoming systems into account properly and fix the retrieval
        of the DPTF attributes of the PCH FIVR.
      
        Specifics:
      
         - Revert recent change of the ACPI IRQ resources handling that
           attempted to improve the ACPI IRQ override selection logic, but
           introduced serious regressions on some systems (Hui Wang).
      
         - Fix up quirks for AMD platforms in the suspend-to-idle support code
           so as to take upcoming systems using uPEP HID AMDI007 into account
           as appropriate (Mario Limonciello).
      
         - Fix the code retrieving DPTF attributes of the PCH FIVR so that it
           agrees on the return data type with the ACPI control method
           evaluated for this purpose (Srinivas Pandruvada)"
      
      * tag 'acpi-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        ACPI: DPTF: Fix reading of attributes
        Revert "ACPI: resources: Add checks for ACPI IRQ override"
        ACPI: PM: Add support for upcoming AMD uPEP HID AMDI007
      e1dab4c0
    • Linus Torvalds's avatar
      pipe: make pipe writes always wake up readers · 3a34b13a
      Linus Torvalds authored
      Since commit 1b6b26ae ("pipe: fix and clarify pipe write wakeup
      logic") we have sanitized the pipe write logic, and would only try to
      wake up readers if they needed it.
      
      In particular, if the pipe already had data in it before the write,
      there was no point in trying to wake up a reader, since any existing
      readers must have been aware of the pre-existing data already.  Doing
      extraneous wakeups will only cause potential thundering herd problems.
      
      However, it turns out that some Android libraries have misused the EPOLL
      interface, and expected "edge triggered" be to "any new write will
      trigger it".  Even if there was no edge in sight.
      
      Quoting Sandeep Patil:
       "The commit 1b6b26ae ('pipe: fix and clarify pipe write wakeup
        logic') changed pipe write logic to wakeup readers only if the pipe
        was empty at the time of write. However, there are libraries that
        relied upon the older behavior for notification scheme similar to
        what's described in [1]
      
        One such library 'realm-core'[2] is used by numerous Android
        applications. The library uses a similar notification mechanism as GNU
        Make but it never drains the pipe until it is full. When Android moved
        to v5.10 kernel, all applications using this library stopped working.
      
        The library has since been fixed[3] but it will be a while before all
        applications incorporate the updated library"
      
      Our regression rule for the kernel is that if applications break from
      new behavior, it's a regression, even if it was because the application
      did something patently wrong.  Also note the original report [4] by
      Michal Kerrisk about a test for this epoll behavior - but at that point
      we didn't know of any actual broken use case.
      
      So add the extraneous wakeup, to approximate the old behavior.
      
      [ I say "approximate", because the exact old behavior was to do a wakeup
        not for each write(), but for each pipe buffer chunk that was filled
        in. The behavior introduced by this change is not that - this is just
        "every write will cause a wakeup, whether necessary or not", which
        seems to be sufficient for the broken library use. ]
      
      It's worth noting that this adds the extraneous wakeup only for the
      write side, while the read side still considers the "edge" to be purely
      about reading enough from the pipe to allow further writes.
      
      See commit f467a6a6 ("pipe: fix and clarify pipe read wakeup logic")
      for the pipe read case, which remains that "only wake up if the pipe was
      full, and we read something from it".
      
      Link: https://lore.kernel.org/lkml/CAHk-=wjeG0q1vgzu4iJhW5juPkTsjTYmiqiMUYAebWW+0bam6w@mail.gmail.com/ [1]
      Link: https://github.com/realm/realm-core [2]
      Link: https://github.com/realm/realm-core/issues/4666 [3]
      Link: https://lore.kernel.org/lkml/CAKgNAkjMBGeAwF=2MKK758BhxvW58wYTgYKB2V-gY1PwXxrH+Q@mail.gmail.com/ [4]
      Link: https://lore.kernel.org/lkml/20210729222635.2937453-1-sspatil@android.com/Reported-by: default avatarSandeep Patil <sspatil@android.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a34b13a
    • Arnaldo Carvalho de Melo's avatar
      Revert "perf map: Fix dso->nsinfo refcounting" · 9bac1bd6
      Arnaldo Carvalho de Melo authored
      This makes 'perf top' abort in some cases, and the right fix will
      involve surgery that is too much to do at this stage, so revert for now
      and fix it in the next merge window.
      
      This reverts commit 2d6b74ba.
      
      Cc: Riccardo Mancini <rickyman7@gmail.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Krister Johansen <kjlx@templeofstupid.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      9bac1bd6
    • Rafael J. Wysocki's avatar
      Merge branches 'acpi-resources' and 'acpi-dptf' · e83f54ea
      Rafael J. Wysocki authored
      * acpi-resources:
        Revert "ACPI: resources: Add checks for ACPI IRQ override"
      
      * acpi-dptf:
        ACPI: DPTF: Fix reading of attributes
      e83f54ea
    • Linus Torvalds's avatar
      Merge tag 'block-5.14-2021-07-30' of git://git.kernel.dk/linux-block · 4669e13c
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
      
       - gendisk freeing fix (Christoph)
      
       - blk-iocost wake ordering fix (Tejun)
      
       - tag allocation error handling fix (John)
      
       - loop locking fix. While this isn't the prettiest fix in the world,
         nobody has any good alternatives for 5.14. Something to likely
         revisit for 5.15. (Tetsuo)
      
      * tag 'block-5.14-2021-07-30' of git://git.kernel.dk/linux-block:
        block: delay freeing the gendisk
        blk-iocost: fix operation ordering in iocg_wake_fn()
        blk-mq-sched: Fix blk_mq_sched_alloc_tags() error handling
        loop: reintroduce global lock for safe loop_validate_file() traversal
      4669e13c
    • Linus Torvalds's avatar
      Merge tag 'io_uring-5.14-2021-07-30' of git://git.kernel.dk/linux-block · 27eb687b
      Linus Torvalds authored
      Pull io_uring fixes from Jens Axboe:
      
       - A fix for block backed reissue (me)
      
       - Reissue context hardening (me)
      
       - Async link locking fix (Pavel)
      
      * tag 'io_uring-5.14-2021-07-30' of git://git.kernel.dk/linux-block:
        io_uring: fix poll requests leaking second poll entries
        io_uring: don't block level reissue off completion path
        io_uring: always reissue from task_work context
        io_uring: fix race in unified task_work running
        io_uring: fix io_prep_async_link locking
      27eb687b
    • Linus Torvalds's avatar
      Merge tag 'libata-5.14-2021-07-30' of git://git.kernel.dk/linux-block · f6c5971b
      Linus Torvalds authored
      Pull libata fixlets from Jens Axboe:
      
       - A fix for PIO highmem (Christoph)
      
       - Kill HAVE_IDE as it's now unused (Lukas)
      
      * tag 'libata-5.14-2021-07-30' of git://git.kernel.dk/linux-block:
        arch: Kconfig: clean up obsolete use of HAVE_IDE
        libata: fix ata_pio_sector for CONFIG_HIGHMEM
      f6c5971b
    • Linus Torvalds's avatar
      Merge tag 'for-5.14-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 051df241
      Linus Torvalds authored
      Pull btrfs fixes from David Sterba:
      
       - fix -Warray-bounds warning, to help external patchset to make it
         default treewide
      
       - fix writeable device accounting (syzbot report)
      
       - fix fsync and log replay after a rename and inode eviction
      
       - fix potentially lost error code when submitting multiple bios for
         compressed range
      
      * tag 'for-5.14-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: calculate number of eb pages properly in csum_tree_block
        btrfs: fix rw device counting in __btrfs_free_extra_devids
        btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction
        btrfs: mark compressed range uptodate only if all bio succeed
      051df241