1. 03 Jan, 2022 4 commits
    • Filipe Manana's avatar
      btrfs: only copy dir index keys when logging a directory · 339d0354
      Filipe Manana authored
      Currently, when logging a directory, we copy both dir items and dir index
      items from the fs/subvolume tree to the log tree. Both items have exactly
      the same data (same struct btrfs_dir_item), the difference lies in the key
      values, where a dir index key contains the index number of a directory
      entry while the dir item key does not, as it's used for doing fast lookups
      of an entry by name, while the former is used for sorting entries when
      listing a directory.
      
      We can exploit that and log only the dir index items, since they contain
      all the information needed to correctly add, replace and delete directory
      entries when replaying a log tree. Logging only the dir index items is
      also backward and forward compatible: an unpatched kernel (without this
      change) can correctly replay a log tree generated by a patched kernel
      (with this patch), and a patched kernel can correctly replay a log tree
      generated by an unpatched kernel.
      
      The backward compatibility is ensured because:
      
      1) For inserting a new dentry: a dentry is only inserted when we find a
         new dir index key - we can only insert if we know the dir index offset,
         which is encoded in the dir index key's offset;
      
      2) For deleting dentries: during log replay, before adding or replacing
         dentries, we first replay dentry deletions. Whenever we find a dir item
         key or a dir index key in the subvolume/fs tree that is not logged in
         a range for which the log tree is authoritative, we do the unlink of
         the dentry, which removes both the existing dir item key and the dir
         index key. Therefore logging just dir index keys is enough to ensure
         dentry deletions are correctly replayed;
      
      3) For dentry replacements: they work when we log only dir index keys
         and this is mostly due to a combination of 1) and 2). If we replace a
         dentry with name "foobar" to point from inode A to inode B, then we
         know the dir index key for the new dentry is different from the old
         one, as it has an index number (key offset) larger than the old one.
         This results in replaying a deletion, through replay_dir_deletes(),
         that causes the old dentry to be removed, both the dir item key and
         the dir index key, as mentioned at 2). Then when processing the new
         dir index key, we add the new dentry, adding both a new dir item key
         and a new index key pointing to inode B, as stated in 1).
      
      The forward compatibility, the ability for a patched kernel to replay a
      log created by an older, unpatched kernel, comes from the changes required
      for making sure we are able to replay a log that only contains dir index
      keys - we simply ignore every dir item key we find.
      
      So modify directory logging to log only dir index items, and modify the
      log replay process to ignore dir item keys, from log trees created by an
      unpatched kernel, and process only with dir index keys. This reduces the
      amount of logged metadata by about half, and therefore the time spent
      logging or fsyncing large directories (less CPU time and less IO).
      
      The following test script was used to measure this change:
      
         #!/bin/bash
      
         DEV=/dev/nvme0n1
         MNT=/mnt/nvme0n1
      
         NUM_NEW_FILES=1000000
         NUM_FILE_DELETES=10000
      
         mkfs.btrfs -f $DEV
         mount -o ssd $DEV $MNT
      
         mkdir $MNT/testdir
      
         for ((i = 1; i <= $NUM_NEW_FILES; i++)); do
                 echo -n > $MNT/testdir/file_$i
         done
      
         start=$(date +%s%N)
         xfs_io -c "fsync" $MNT/testdir
         end=$(date +%s%N)
      
         dur=$(( (end - start) / 1000000 ))
         echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files"
      
         # sync to force transaction commit and wipeout the log.
         sync
      
         del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES ))
         for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do
                 rm -f $MNT/testdir/file_$i
         done
      
         start=$(date +%s%N)
         xfs_io -c "fsync" $MNT/testdir
         end=$(date +%s%N)
      
         dur=$(( (end - start) / 1000000 ))
         echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files"
         echo
      
         umount $MNT
      
      The tests were run on a physical machine, with a non-debug kernel (Debian's
      default kernel config), for different values of $NUM_NEW_FILES and
      $NUM_FILE_DELETES, and the results were the following:
      
      ** Before patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 **
      
      dir fsync took 8412 ms after adding 1000000 files
      dir fsync took 500 ms after deleting 10000 files
      
      ** After patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 **
      
      dir fsync took 4252 ms after adding 1000000 files   (-49.5%)
      dir fsync took 269 ms after deleting 10000 files    (-46.2%)
      
      ** Before patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 **
      
      dir fsync took 745 ms after adding 100000 files
      dir fsync took 59 ms after deleting 1000 files
      
      ** After patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 **
      
      dir fsync took 404 ms after adding 100000 files   (-45.8%)
      dir fsync took 31 ms after deleting 1000 files    (-47.5%)
      
      ** Before patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 **
      
      dir fsync took 67 ms after adding 10000 files
      dir fsync took 9 ms after deleting 1000 files
      
      ** After patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 **
      
      dir fsync took 36 ms after adding 10000 files   (-46.3%)
      dir fsync took 5 ms after deleting 1000 files   (-44.4%)
      
      ** Before patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 **
      
      dir fsync took 9 ms after adding 1000 files
      dir fsync took 4 ms after deleting 100 files
      
      ** After patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 **
      
      dir fsync took 7 ms after adding 1000 files     (-22.2%)
      dir fsync took 3 ms after deleting 100 files    (-25.0%)
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      339d0354
    • Nikolay Borisov's avatar
      btrfs: remove spurious unlock/lock of unused_bgs_lock · 17130a65
      Nikolay Borisov authored
      Since both unused block groups and reclaim bgs lists are protected by
      unused_bgs_lock then free them in the same critical section without
      doing an extra unlock/lock pair.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      17130a65
    • Filipe Manana's avatar
      btrfs: fix deadlock between quota enable and other quota operations · 232796df
      Filipe Manana authored
      When enabling quotas, we attempt to commit a transaction while holding the
      mutex fs_info->qgroup_ioctl_lock. This can result on a deadlock with other
      quota operations such as:
      
      - qgroup creation and deletion, ioctl BTRFS_IOC_QGROUP_CREATE;
      
      - adding and removing qgroup relations, ioctl BTRFS_IOC_QGROUP_ASSIGN.
      
      This is because these operations join a transaction and after that they
      attempt to lock the mutex fs_info->qgroup_ioctl_lock. Acquiring that mutex
      after joining or starting a transaction is a pattern followed everywhere
      in qgroups, so the quota enablement operation is the one at fault here,
      and should not commit a transaction while holding that mutex.
      
      Fix this by making the transaction commit while not holding the mutex.
      We are safe from two concurrent tasks trying to enable quotas because
      we are serialized by the rw semaphore fs_info->subvol_sem at
      btrfs_ioctl_quota_ctl(), which is the only call site for enabling
      quotas.
      
      When this deadlock happens, it produces a trace like the following:
      
        INFO: task syz-executor:25604 blocked for more than 143 seconds.
        Not tainted 5.15.0-rc6 #4
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        task:syz-executor state:D stack:24800 pid:25604 ppid: 24873 flags:0x00004004
        Call Trace:
        context_switch kernel/sched/core.c:4940 [inline]
        __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
        schedule+0xd3/0x270 kernel/sched/core.c:6366
        btrfs_commit_transaction+0x994/0x2e90 fs/btrfs/transaction.c:2201
        btrfs_quota_enable+0x95c/0x1790 fs/btrfs/qgroup.c:1120
        btrfs_ioctl_quota_ctl fs/btrfs/ioctl.c:4229 [inline]
        btrfs_ioctl+0x637e/0x7b70 fs/btrfs/ioctl.c:5010
        vfs_ioctl fs/ioctl.c:51 [inline]
        __do_sys_ioctl fs/ioctl.c:874 [inline]
        __se_sys_ioctl fs/ioctl.c:860 [inline]
        __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
        do_syscall_x64 arch/x86/entry/common.c:50 [inline]
        do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
        RIP: 0033:0x7f86920b2c4d
        RSP: 002b:00007f868f61ac58 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        RAX: ffffffffffffffda RBX: 00007f86921d90a0 RCX: 00007f86920b2c4d
        RDX: 0000000020005e40 RSI: 00000000c0109428 RDI: 0000000000000008
        RBP: 00007f869212bd80 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000246 R12: 00007f86921d90a0
        R13: 00007fff6d233e4f R14: 00007fff6d233ff0 R15: 00007f868f61adc0
        INFO: task syz-executor:25628 blocked for more than 143 seconds.
        Not tainted 5.15.0-rc6 #4
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        task:syz-executor state:D stack:29080 pid:25628 ppid: 24873 flags:0x00004004
        Call Trace:
        context_switch kernel/sched/core.c:4940 [inline]
        __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
        schedule+0xd3/0x270 kernel/sched/core.c:6366
        schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
        __mutex_lock_common kernel/locking/mutex.c:669 [inline]
        __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
        btrfs_remove_qgroup+0xb7/0x7d0 fs/btrfs/qgroup.c:1548
        btrfs_ioctl_qgroup_create fs/btrfs/ioctl.c:4333 [inline]
        btrfs_ioctl+0x683c/0x7b70 fs/btrfs/ioctl.c:5014
        vfs_ioctl fs/ioctl.c:51 [inline]
        __do_sys_ioctl fs/ioctl.c:874 [inline]
        __se_sys_ioctl fs/ioctl.c:860 [inline]
        __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
        do_syscall_x64 arch/x86/entry/common.c:50 [inline]
        do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      Reported-by: default avatarHao Sun <sunhao.th@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CACkBjsZQF19bQ1C6=yetF3BvL10OSORpFUcWXTP6HErshDB4dQ@mail.gmail.com/
      Fixes: 340f1aa2 ("btrfs: qgroups: Move transaction management inside btrfs_quota_enable/disable")
      CC: stable@vger.kernel.org # 4.19
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      232796df
    • Filipe Manana's avatar
      btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW range · f0bfa76a
      Filipe Manana authored
      When doing a direct IO write against a file range that either has
      preallocated extents in that range or has regular extents and the file
      has the NOCOW attribute set, the write fails with -ENOSPC when all of
      the following conditions are met:
      
      1) There are no data blocks groups with enough free space matching
         the size of the write;
      
      2) There's not enough unallocated space for allocating a new data block
         group;
      
      3) The extents in the target file range are not shared, neither through
         snapshots nor through reflinks.
      
      This is wrong because a NOCOW write can be done in such case, and in fact
      it's possible to do it using a buffered IO write, since when failing to
      allocate data space, the buffered IO path checks if a NOCOW write is
      possible.
      
      The failure in direct IO write path comes from the fact that early on,
      at btrfs_dio_iomap_begin(), we try to allocate data space for the write
      and if it that fails we return the error and stop - we never check if we
      can do NOCOW. But later, at btrfs_get_blocks_direct_write(), we check
      if we can do a NOCOW write into the range, or a subset of the range, and
      then release the previously reserved data space.
      
      Fix this by doing the data reservation only if needed, when we must COW,
      at btrfs_get_blocks_direct_write() instead of doing it at
      btrfs_dio_iomap_begin(). This also simplifies a bit the logic and removes
      the inneficiency of doing unnecessary data reservations.
      
      The following example test script reproduces the problem:
      
        $ cat dio-nocow-enospc.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
      
        # Use a small fixed size (1G) filesystem so that it's quick to fill
        # it up.
        # Make sure the mixed block groups feature is not enabled because we
        # later want to not have more space available for allocating data
        # extents but still have enough metadata space free for the file writes.
        mkfs.btrfs -f -b $((1024 * 1024 * 1024)) -O ^mixed-bg $DEV
        mount $DEV $MNT
      
        # Create our test file with the NOCOW attribute set.
        touch $MNT/foobar
        chattr +C $MNT/foobar
      
        # Now fill in all unallocated space with data for our test file.
        # This will allocate a data block group that will be full and leave
        # no (or a very small amount of) unallocated space in the device, so
        # that it will not be possible to allocate a new block group later.
        echo
        echo "Creating test file with initial data..."
        xfs_io -c "pwrite -S 0xab -b 1M 0 900M" $MNT/foobar
      
        # Now try a direct IO write against file range [0, 10M[.
        # This should succeed since this is a NOCOW file and an extent for the
        # range was previously allocated.
        echo
        echo "Trying direct IO write over allocated space..."
        xfs_io -d -c "pwrite -S 0xcd -b 10M 0 10M" $MNT/foobar
      
        umount $MNT
      
      When running the test:
      
        $ ./dio-nocow-enospc.sh
        (...)
      
        Creating test file with initial data...
        wrote 943718400/943718400 bytes at offset 0
        900 MiB, 900 ops; 0:00:01.43 (625.526 MiB/sec and 625.5265 ops/sec)
      
        Trying direct IO write over allocated space...
        pwrite: No space left on device
      
      A test case for fstests will follow, testing both this direct IO write
      scenario as well as the buffered IO write scenario to make it less likely
      to get future regressions on the buffered IO case.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f0bfa76a
  2. 02 Jan, 2022 6 commits
    • Linus Torvalds's avatar
      Linux 5.16-rc8 · c9e6606c
      Linus Torvalds authored
      c9e6606c
    • Linus Torvalds's avatar
      Merge tag 'perf-tools-fixes-for-v5.16-2022-01-02' of... · 24a0b220
      Linus Torvalds authored
      Merge tag 'perf-tools-fixes-for-v5.16-2022-01-02' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
      
      Pull perf tools fixes from Arnaldo Carvalho de Melo:
      
       - Fix TUI exit screen refresh race condition in 'perf top'.
      
       - Fix parsing of Intel PT VM time correlation arguments.
      
       - Honour CPU filtering command line request of a script's switch events
         in 'perf script'.
      
       - Fix printing of switch events in Intel PT python script.
      
       - Fix duplicate alias events list printing in 'perf list', noticed on
         heterogeneous arm64 systems.
      
       - Fix return value of ids__new(), users expect NULL for failure, not
         ERR_PTR(-ENOMEM).
      
      * tag 'perf-tools-fixes-for-v5.16-2022-01-02' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
        perf top: Fix TUI exit screen refresh race condition
        perf pmu: Fix alias events list
        perf scripts python: intel-pt-events.py: Fix printing of switch events
        perf script: Fix CPU filtering of a script's switch events
        perf intel-pt: Fix parsing of VM time correlation arguments
        perf expr: Fix return value of ids__new()
      24a0b220
    • Linus Torvalds's avatar
      Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · 859431ac
      Linus Torvalds authored
      Pull i2c fixes from Wolfram Sang:
       "Better input validation for compat ioctls and a documentation bugfix
        for 5.16"
      
      * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        Docs: Fixes link to I2C specification
        i2c: validate user data in compat ioctl
      859431ac
    • Linus Torvalds's avatar
      Merge tag 'x86_urgent_for_v5.16_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 1286cc48
      Linus Torvalds authored
      Pull x86 fix from Borislav Petkov:
      
       - Use the proper CONFIG symbol in a preprocessor check.
      
      * tag 'x86_urgent_for_v5.16_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/build: Use the proper name CONFIG_FW_LOADER
      1286cc48
    • yaowenbin's avatar
      perf top: Fix TUI exit screen refresh race condition · 64f18d2d
      yaowenbin authored
      When the following command is executed several times, a coredump file is
      generated.
      
      	$ timeout -k 9 5 perf top -e task-clock
      	*******
      	*******
      	*******
      	0.01%  [kernel]                  [k] __do_softirq
      	0.01%  libpthread-2.28.so        [.] __pthread_mutex_lock
      	0.01%  [kernel]                  [k] __ll_sc_atomic64_sub_return
      	double free or corruption (!prev) perf top --sort comm,dso
      	timeout: the monitored command dumped core
      
      When we terminate "perf top" using sending signal method,
      SLsmg_reset_smg() called. SLsmg_reset_smg() resets the SLsmg screen
      management routines by freeing all memory allocated while it was active.
      
      However SLsmg_reinit_smg() maybe be called by another thread.
      
      SLsmg_reinit_smg() will free the same memory accessed by
      SLsmg_reset_smg(), thus it results in a double free.
      
      SLsmg_reinit_smg() is called already protected by ui__lock, so we fix
      the problem by adding pthread_mutex_trylock of ui__lock when calling
      SLsmg_reset_smg().
      Signed-off-by: default avatarWenyu Liu <liuwenyu7@huawei.com>
      Tested-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: wuxu.wu@huawei.com
      Link: http://lore.kernel.org/lkml/a91e3943-7ddc-f5c0-a7f5-360f073c20e6@huawei.comSigned-off-by: default avatarHewenliang <hewenliang4@huawei.com>
      Signed-off-by: default avataryaowenbin <yaowenbin1@huawei.com>
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      64f18d2d
    • John Garry's avatar
      perf pmu: Fix alias events list · e0257a01
      John Garry authored
      Commit 0e0ae874 ("perf list: Display hybrid PMU events with cpu
      type") changes the event list for uncore PMUs or arm64 heterogeneous CPU
      systems, such that duplicate aliases are incorrectly listed per PMU
      (which they should not be), like:
      
        # perf list
        ...
        unc_cbo_cache_lookup.any_es
        [Unit: uncore_cbox L3 Lookup any request that access cache and found
        line in E or S-state]
        unc_cbo_cache_lookup.any_es
        [Unit: uncore_cbox L3 Lookup any request that access cache and found
        line in E or S-state]
        unc_cbo_cache_lookup.any_i
        [Unit: uncore_cbox L3 Lookup any request that access cache and found
        line in I-state]
        unc_cbo_cache_lookup.any_i
        [Unit: uncore_cbox L3 Lookup any request that access cache and found
        line in I-state]
        ...
      
      Notice how the events are listed twice.
      
      The named commit changed how we remove duplicate events, in that events
      for different PMUs are not treated as duplicates. I suppose this is to
      handle how "Each hybrid pmu event has been assigned with a pmu name".
      
      Fix PMU alias listing by restoring behaviour to remove duplicates for
      non-hybrid PMUs.
      
      Fixes: 0e0ae874 ("perf list: Display hybrid PMU events with cpu type")
      Signed-off-by: default avatarJohn Garry <john.garry@huawei.com>
      Tested-by: default avatarZhengjun Xing <zhengjun.xing@linux.intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: https://lore.kernel.org/r/1640103090-140490-1-git-send-email-john.garry@huawei.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      e0257a01
  3. 01 Jan, 2022 1 commit
  4. 31 Dec, 2021 12 commits
    • Mel Gorman's avatar
      mm: vmscan: reduce throttling due to a failure to make progress -fix · 80082938
      Mel Gorman authored
      Hugh Dickins reported the following
      
      	My tmpfs swapping load (tweaked to use huge pages more heavily
      	than in real life) is far from being a realistic load: but it was
      	notably slowed down by your throttling mods in 5.16-rc, and this
      	patch makes it well again - thanks.
      
      	But: it very quickly hit NULL pointer until I changed that last
      	line to
      
              if (first_pgdat)
                      consider_reclaim_throttle(first_pgdat, sc);
      
      The likely issue is that huge pages are a major component of the test
      workload.  When this is the case, first_pgdat may never get set if
      compaction is ready to continue due to this check
      
              if (IS_ENABLED(CONFIG_COMPACTION) &&
                  sc->order > PAGE_ALLOC_COSTLY_ORDER &&
                  compaction_ready(zone, sc)) {
                      sc->compaction_ready = true;
                      continue;
              }
      
      If this was true for every zone in the zonelist, first_pgdat would never
      get set resulting in a NULL pointer exception.
      
      Link: https://lkml.kernel.org/r/20211209095453.GM3366@techsingularity.net
      Fixes: 1b4e3f26 ("mm: vmscan: Reduce throttling due to a failure to make progress")
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reported-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      80082938
    • Mel Gorman's avatar
      mm: vmscan: Reduce throttling due to a failure to make progress · 1b4e3f26
      Mel Gorman authored
      Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar
      problems due to reclaim throttling for excessive lengths of time.  In
      Alexey's case, a memory hog that should go OOM quickly stalls for
      several minutes before stalling.  In Mike and Darrick's cases, a small
      memcg environment stalled excessively even though the system had enough
      memory overall.
      
      Commit 69392a40 ("mm/vmscan: throttle reclaim when no progress is
      being made") introduced the problem although commit a19594ca
      ("mm/vmscan: increase the timeout if page reclaim is not making
      progress") made it worse.  Systems at or near an OOM state that cannot
      be recovered must reach OOM quickly and memcg should kill tasks if a
      memcg is near OOM.
      
      To address this, only stall for the first zone in the zonelist, reduce
      the timeout to 1 tick for VMSCAN_THROTTLE_NOPROGRESS and only stall if
      the scan control nr_reclaimed is 0, kswapd is still active and there
      were excessive pages pending for writeback.  If kswapd has stopped
      reclaiming due to excessive failures, do not stall at all so that OOM
      triggers relatively quickly.  Similarly, if an LRU is simply congested,
      only lightly throttle similar to NOPROGRESS.
      
      Alexey's original case was the most straight forward
      
      	for i in {1..3}; do tail /dev/zero; done
      
      On vanilla 5.16-rc1, this test stalled heavily, after the patch the test
      completes in a few seconds similar to 5.15.
      
      Alexey's second test case added watching a youtube video while tail runs
      10 times.  On 5.15, playback only jitters slightly, 5.16-rc1 stalls a
      lot with lots of frames missing and numerous audio glitches.  With this
      patch applies, the video plays similarly to 5.15.
      
      [lkp@intel.com: Fix W=1 build warning]
      
      Link: https://lore.kernel.org/r/99e779783d6c7fce96448a3402061b9dc1b3b602.camel@gmx.de
      Link: https://lore.kernel.org/r/20211124011954.7cab9bb4@mail.inbox.lv
      Link: https://lore.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net
      Link: https://lore.kernel.org/r/20211202150614.22440-1-mgorman@techsingularity.net
      Link: https://linux-regtracking.leemhuis.info/regzbot/regression/20211124011954.7cab9bb4@mail.inbox.lv/Reported-and-tested-by: default avatarAlexey Avramov <hakavlad@inbox.lv>
      Reported-and-tested-by: default avatarMike Galbraith <efault@gmx.de>
      Reported-and-tested-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Tracked-by: default avatarThorsten Leemhuis <regressions@leemhuis.info>
      Fixes: 69392a40 ("mm/vmscan: throttle reclaim when no progress is being made")
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b4e3f26
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · f87bcc88
      Linus Torvalds authored
      Merge misc mm fixes from Andrew Morton:
       "2 patches.
      
        Subsystems affected by this patch series: mm (userfaultfd and damon)"
      
      * akpm:
        mm/damon/dbgfs: fix 'struct pid' leaks in 'dbgfs_target_ids_write()'
        userfaultfd/selftests: fix hugetlb area allocations
      f87bcc88
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · e46227bf
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "Three fixes, all in drivers. The lpfc one doesn't look exploitable,
        but nasty things could happen in string operations if mybuf ends up
        with an on stack unterminated string"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: vmw_pvscsi: Set residual data length conditionally
        scsi: libiscsi: Fix UAF in iscsi_conn_get_param()/iscsi_conn_teardown()
        scsi: lpfc: Terminate string in lpfc_debugfs_nvmeio_trc_write()
      e46227bf
    • SeongJae Park's avatar
      mm/damon/dbgfs: fix 'struct pid' leaks in 'dbgfs_target_ids_write()' · ebb3f994
      SeongJae Park authored
      DAMON debugfs interface increases the reference counts of 'struct pid's
      for targets from the 'target_ids' file write callback
      ('dbgfs_target_ids_write()'), but decreases the counts only in DAMON
      monitoring termination callback ('dbgfs_before_terminate()').
      
      Therefore, when 'target_ids' file is repeatedly written without DAMON
      monitoring start/termination, the reference count is not decreased and
      therefore memory for the 'struct pid' cannot be freed.  This commit
      fixes this issue by decreasing the reference counts when 'target_ids' is
      written.
      
      Link: https://lkml.kernel.org/r/20211229124029.23348-1-sj@kernel.org
      Fixes: 4bc05954 ("mm/damon: implement a debugfs-based user space interface")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: <stable@vger.kernel.org>	[5.15+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ebb3f994
    • Mike Kravetz's avatar
      userfaultfd/selftests: fix hugetlb area allocations · f5c73297
      Mike Kravetz authored
      Currently, userfaultfd selftest for hugetlb as run from run_vmtests.sh
      or any environment where there are 'just enough' hugetlb pages will
      always fail with:
      
        testing events (fork, remap, remove):
      		ERROR: UFFDIO_COPY error: -12 (errno=12, line=616)
      
      The ENOMEM error code implies there are not enough hugetlb pages.
      However, there are free hugetlb pages but they are all reserved.  There
      is a basic problem with the way the test allocates hugetlb pages which
      has existed since the test was originally written.
      
      Due to the way 'cleanup' was done between different phases of the test,
      this issue was masked until recently.  The issue was uncovered by commit
      8ba6e864 ("userfaultfd/selftests: reinitialize test context in each
      test").
      
      For the hugetlb test, src and dst areas are allocated as PRIVATE
      mappings of a hugetlb file.  This means that at mmap time, pages are
      reserved for the src and dst areas.  At the start of event testing (and
      other tests) the src area is populated which results in allocation of
      huge pages to fill the area and consumption of reserves associated with
      the area.  Then, a child is forked to fault in the dst area.  Note that
      the dst area was allocated in the parent and hence the parent owns the
      reserves associated with the mapping.  The child has normal access to
      the dst area, but can not use the reserves created/owned by the parent.
      Thus, if there are no other huge pages available allocation of a page
      for the dst by the child will fail.
      
      Fix by not creating reserves for the dst area.  In this way the child
      can use free (non-reserved) pages.
      
      Also, MAP_PRIVATE of a file only makes sense if you are interested in
      the contents of the file before making a COW copy.  The test does not do
      this.  So, just use MAP_ANONYMOUS | MAP_HUGETLB to create an anonymous
      hugetlb mapping.  There is no need to create a hugetlb file in the
      non-shared case.
      
      Link: https://lkml.kernel.org/r/20211217172919.7861-1-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f5c73297
    • Deep Majumder's avatar
      Docs: Fixes link to I2C specification · c116fe1e
      Deep Majumder authored
      The link to the I2C specification is broken. Although
      "https://www.nxp.com" hosts Rev 7 (2021) of this specification, it is
      behind a login-wall. Thus, an additional link has been added (which
      doesn't require a login) and the NXP official docs link has been
      updated.
      Signed-off-by: default avatarDeep Majumder <deep@fastmail.in>
      [wsa: minor updates to text and commit message]
      Signed-off-by: default avatarWolfram Sang <wsa@kernel.org>
      c116fe1e
    • Pavel Skripkin's avatar
      i2c: validate user data in compat ioctl · bb436283
      Pavel Skripkin authored
      Wrong user data may cause warning in i2c_transfer(), ex: zero msgs.
      Userspace should not be able to trigger warnings, so this patch adds
      validation checks for user data in compact ioctl to prevent reported
      warnings
      
      Reported-and-tested-by: syzbot+e417648b303855b91d8a@syzkaller.appspotmail.com
      Fixes: 7d5cb456 ("i2c compat ioctls: move to ->compat_ioctl()")
      Signed-off-by: default avatarPavel Skripkin <paskripkin@gmail.com>
      Signed-off-by: default avatarWolfram Sang <wsa@kernel.org>
      bb436283
    • Leo L. Schwab's avatar
      Input: spaceball - fix parsing of movement data packets · bc7ec917
      Leo L. Schwab authored
      The spaceball.c module was not properly parsing the movement reports
      coming from the device.  The code read axis data as signed 16-bit
      little-endian values starting at offset 2.
      
      In fact, axis data in Spaceball movement reports are signed 16-bit
      big-endian values starting at offset 3.  This was determined first by
      visually inspecting the data packets, and later verified by consulting:
      http://spacemice.org/pdf/SpaceBall_2003-3003_Protocol.pdf
      
      If this ever worked properly, it was in the time before Git...
      Signed-off-by: default avatarLeo L. Schwab <ewhac@ewhac.org>
      Link: https://lore.kernel.org/r/20211221101630.1146385-1-ewhac@ewhac.org
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDmitry Torokhov <dmitry.torokhov@gmail.com>
      bc7ec917
    • Pavel Skripkin's avatar
      Input: appletouch - initialize work before device registration · 9f3ccdc3
      Pavel Skripkin authored
      Syzbot has reported warning in __flush_work(). This warning is caused by
      work->func == NULL, which means missing work initialization.
      
      This may happen, since input_dev->close() calls
      cancel_work_sync(&dev->work), but dev->work initalization happens _after_
      input_register_device() call.
      
      So this patch moves dev->work initialization before registering input
      device
      
      Fixes: 5a6eb676 ("Input: appletouch - improve powersaving for Geyser3 devices")
      Reported-and-tested-by: syzbot+b88c5eae27386b252bbd@syzkaller.appspotmail.com
      Signed-off-by: default avatarPavel Skripkin <paskripkin@gmail.com>
      Link: https://lore.kernel.org/r/20211230141151.17300-1-paskripkin@gmail.com
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDmitry Torokhov <dmitry.torokhov@gmail.com>
      9f3ccdc3
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-2021-12-31' of git://anongit.freedesktop.org/drm/drm · 4f3d93c6
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "This is a bit bigger than I'd like, however it has two weeks of amdgpu
        fixes in it, since they missed last week, which was very small.
      
        The nouveau regression is probably the biggest fix in here, and it
        needs to go into 5.15 as well, two i915 fixes, and then a scattering
        of amdgpu fixes. The biggest fix in there is for a fencing NULL
        pointer dereference, the rest are pretty minor.
      
        For the misc team, I've pulled the two misc fixes manually since I'm
        not sure what is happening at this time of year!
      
        The amdgpu maintainers have the outstanding runpm regression to fix
        still, they are just working through the last bits of it now.
      
        Summary:
      
        nouveau:
         - fencing regression fix
      
        i915:
         - Fix possible uninitialized variable
         - Fix composite fence seqno icrement on each fence creation
      
        amdgpu:
         - Fencing fix
         - XGMI fix
         - VCN regression fix
         - IP discovery regression fixes
         - Fix runpm documentation
         - Suspend/resume fixes
         - Yellow Carp display fixes
         - MCLK power management fix
         - dma-buf fix"
      
      * tag 'drm-fixes-2021-12-31' of git://anongit.freedesktop.org/drm/drm:
        drm/amd/display: Changed pipe split policy to allow for multi-display pipe split
        drm/amd/display: Fix USB4 null pointer dereference in update_psp_stream_config
        drm/amd/display: Set optimize_pwr_state for DCN31
        drm/amd/display: Send s0i2_rdy in stream_count == 0 optimization
        drm/amd/display: Added power down for DCN10
        drm/amd/display: fix B0 TMDS deepcolor no dislay issue
        drm/amdgpu: no DC support for headless chips
        drm/amdgpu: put SMU into proper state on runpm suspending for BOCO capable platform
        drm/amdgpu: always reset the asic in suspend (v2)
        drm/amd/pm: skip setting gfx cgpg in the s0ix suspend-resume
        drm/i915: Increment composite fence seqno
        drm/i915: Fix possible uninitialized variable in parallel extension
        drm/amdgpu: fix runpm documentation
        drm/nouveau: wait for the exclusive fence after the shared ones v2
        drm/amdgpu: add support for IP discovery gc_info table v2
        drm/amdgpu: When the VCN(1.0) block is suspended, powergating is explicitly enabled
        drm/amd/pm: Fix xgmi link control on aldebaran
        drm/amdgpu: introduce new amdgpu_fence object to indicate the job embedded fence
        drm/amdgpu: fix dropped backing store handling in amdgpu_dma_buf_move_notify
      4f3d93c6
    • Dave Airlie's avatar
      Merge branch 'drm-misc-fixes' of ssh://git.freedesktop.org/git/drm/drm-misc into drm-fixes · ce9b333c
      Dave Airlie authored
      This merges two fixes that haven't been sent to me yet, but I wanted to get in.
      
      One amdgpu fix, but one nouveau regression fixer.
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      ce9b333c
  5. 30 Dec, 2021 15 commits
  6. 29 Dec, 2021 2 commits