1. 07 May, 2022 2 commits
  2. 06 May, 2022 16 commits
  3. 05 May, 2022 22 commits
    • Dave Airlie's avatar
      Merge tag 'amd-drm-fixes-5.18-2022-05-04' of... · ebbc04bd
      Dave Airlie authored
      Merge tag 'amd-drm-fixes-5.18-2022-05-04' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
      
      amd-drm-fixes-5.18-2022-05-04:
      
      amdgpu:
      - Fix a xen dom0 regression on APUs
      - Fix a potential array overflow if a receiver were to
        send an erroneous audio channel count
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      From: Alex Deucher <alexander.deucher@amd.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20220504190439.5723-1-alexander.deucher@amd.com
      ebbc04bd
    • Linus Torvalds's avatar
      Merge tag 'folio-5.18f' of git://git.infradead.org/users/willy/pagecache · fe27d189
      Linus Torvalds authored
      Pull folio fixes from Matthew Wilcox:
       "Two folio fixes for 5.18.
      
        Darrick and Brian have done amazing work debugging the race I created
        in the folio BIO iterator. The readahead problem was deterministic, so
        easy to fix.
      
         - Fix a race when we were calling folio_next() in the BIO folio iter
           without holding a reference, meaning the folio could be split or
           freed, and we'd jump to the next page instead of the intended next
           folio.
      
         - Fix readahead creating single-page folios instead of the intended
           large folios when doing reads that are not a power of two in size"
      
      * tag 'folio-5.18f' of git://git.infradead.org/users/willy/pagecache:
        mm/readahead: Fix readahead with large folios
        block: Do not call folio_next() on an unreferenced folio
      fe27d189
    • Linus Torvalds's avatar
      Merge tag 'devicetree-fixes-for-5.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux · f47c960e
      Linus Torvalds authored
      Pull devicetree fixes from Rob Herring:
      
       - Drop unused 'max-link-speed' in Apple PCIe
      
       - More redundant 'maxItems/minItems' schema fixes
      
       - Support values for pinctrl 'drive-push-pull' and 'drive-open-drain'
      
       - Fix redundant 'unevaluatedProperties' in MT6360 LEDs binding
      
       - Add missing 'power-domains' property to Cadence UFSHC
      
      * tag 'devicetree-fixes-for-5.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
        dt-bindings: pci: apple,pcie: Drop max-link-speed from example
        dt-bindings: Drop redundant 'maxItems/minItems' in if/then schemas
        dt-bindings: pinctrl: Allow values for drive-push-pull and drive-open-drain
        dt-bindings: leds-mt6360: Drop redundant 'unevaluatedProperties'
        dt-bindings: ufs: cdns,ufshc: Add power-domains
      f47c960e
    • David Sterba's avatar
      btrfs: sysfs: export the balance paused state of exclusive operation · 3e1ad196
      David Sterba authored
      The new state allowing device addition with paused balance is not
      exported to user space so it can't recognize it and actually start the
      operation.
      
      Fixes: efc0e69c ("btrfs: introduce exclusive operation BALANCE_PAUSED state")
      CC: stable@vger.kernel.org # 5.17
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3e1ad196
    • Filipe Manana's avatar
      btrfs: fix assertion failure when logging directory key range item · 750ee454
      Filipe Manana authored
      When inserting a key range item (BTRFS_DIR_LOG_INDEX_KEY) while logging
      a directory, we don't expect the insertion to fail with -EEXIST, because
      we are holding the directory's log_mutex and we have dropped all existing
      BTRFS_DIR_LOG_INDEX_KEY keys from the log tree before we started to log
      the directory. However it's possible that during the logging we attempt
      to insert the same BTRFS_DIR_LOG_INDEX_KEY key twice, but for this to
      happen we need to race with insertions of items from other inodes in the
      subvolume's tree while we are logging a directory. Here's how this can
      happen:
      
      1) We are logging a directory with inode number 1000 that has its items
         spread across 3 leaves in the subvolume's tree:
      
         leaf A - has index keys from the range 2 to 20 for example. The last
         item in the leaf corresponds to a dir item for index number 20. All
         these dir items were created in a past transaction.
      
         leaf B - has index keys from the range 22 to 100 for example. It has
         no keys from other inodes, all its keys are dir index keys for our
         directory inode number 1000. Its first key is for the dir item with
         a sequence number of 22. All these dir items were also created in a
         past transaction.
      
         leaf C - has index keys for our directory for the range 101 to 120 for
         example. This leaf also has items from other inodes, and its first
         item corresponds to the dir item for index number 101 for our directory
         with inode number 1000;
      
      2) When we finish processing the items from leaf A at log_dir_items(),
         we log a BTRFS_DIR_LOG_INDEX_KEY key with an offset of 21 and a last
         offset of 21, meaning the log is authoritative for the index range
         from 21 to 21 (a single sequence number). At this point leaf B was
         not yet modified in the current transaction;
      
      3) When we return from log_dir_items() we have released our read lock on
         leaf B, and have set *last_offset_ret to 21 (index number of the first
         item on leaf B minus 1);
      
      4) Some other task inserts an item for other inode (inode number 1001 for
         example) into leaf C. That resulted in pushing some items from leaf C
         into leaf B, in order to make room for the new item, so now leaf B
         has dir index keys for the sequence number range from 22 to 102 and
         leaf C has the dir items for the sequence number range 103 to 120;
      
      5) At log_directory_changes() we call log_dir_items() again, passing it
         a 'min_offset' / 'min_key' value of 22 (*last_offset_ret from step 3
         plus 1, so 21 + 1). Then btrfs_search_forward() leaves us at slot 0
         of leaf B, since leaf B was modified in the current transaction.
      
         We have also initialized 'last_old_dentry_offset' to 20 after calling
         btrfs_previous_item() at log_dir_items(), as it left us at the last
         item of leaf A, which refers to the dir item with sequence number 20;
      
      6) We then call process_dir_items_leaf() to process the dir items of
         leaf B, and when we process the first item, corresponding to slot 0,
         sequence number 22, we notice the dir item was created in a past
         transaction and its sequence number is greater than the value of
         *last_old_dentry_offset + 1 (20 + 1), so we decide to log again a
         BTRFS_DIR_LOG_INDEX_KEY key with an offset of 21 and an end range
         of 21 (key.offset - 1 == 22 - 1 == 21), which results in an -EEXIST
         error from insert_dir_log_key(), as we have already inserted that
         key at step 2, triggering the assertion at process_dir_items_leaf().
      
      The trace produced in dmesg is like the following:
      
      assertion failed: ret != -EEXIST, in fs/btrfs/tree-log.c:3857
      [198255.980839][ T7460] ------------[ cut here ]------------
      [198255.981666][ T7460] kernel BUG at fs/btrfs/ctree.h:3617!
      [198255.983141][ T7460] invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
      [198255.984080][ T7460] CPU: 0 PID: 7460 Comm: repro-ghost-dir Not tainted 5.18.0-5314c78ac373-misc-next+
      [198255.986027][ T7460] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
      [198255.988600][ T7460] RIP: 0010:assertfail.constprop.0+0x1c/0x1e
      [198255.989465][ T7460] Code: 8b 4c 89 (...)
      [198255.992599][ T7460] RSP: 0018:ffffc90007387188 EFLAGS: 00010282
      [198255.993414][ T7460] RAX: 000000000000003d RBX: 0000000000000065 RCX: 0000000000000000
      [198255.996056][ T7460] RDX: 0000000000000001 RSI: ffffffff8b62b180 RDI: fffff52000e70e24
      [198255.997668][ T7460] RBP: ffffc90007387188 R08: 000000000000003d R09: ffff8881f0e16507
      [198255.999199][ T7460] R10: ffffed103e1c2ca0 R11: 0000000000000001 R12: 00000000ffffffef
      [198256.000683][ T7460] R13: ffff88813befc630 R14: ffff888116c16e70 R15: ffffc90007387358
      [198256.007082][ T7460] FS:  00007fc7f7c24640(0000) GS:ffff8881f0c00000(0000) knlGS:0000000000000000
      [198256.009939][ T7460] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [198256.014133][ T7460] CR2: 0000560bb16d0b78 CR3: 0000000140b34005 CR4: 0000000000170ef0
      [198256.015239][ T7460] Call Trace:
      [198256.015674][ T7460]  <TASK>
      [198256.016313][ T7460]  log_dir_items.cold+0x16/0x2c
      [198256.018858][ T7460]  ? replay_one_extent+0xbf0/0xbf0
      [198256.025932][ T7460]  ? release_extent_buffer+0x1d2/0x270
      [198256.029658][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
      [198256.031114][ T7460]  ? lock_acquired+0xbe/0x660
      [198256.032633][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
      [198256.034386][ T7460]  ? lock_release+0xcf/0x8a0
      [198256.036152][ T7460]  log_directory_changes+0xf9/0x170
      [198256.036993][ T7460]  ? log_dir_items+0xba0/0xba0
      [198256.037661][ T7460]  ? do_raw_write_unlock+0x7d/0xe0
      [198256.038680][ T7460]  btrfs_log_inode+0x233b/0x26d0
      [198256.041294][ T7460]  ? log_directory_changes+0x170/0x170
      [198256.042864][ T7460]  ? btrfs_attach_transaction_barrier+0x60/0x60
      [198256.045130][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
      [198256.046568][ T7460]  ? lock_release+0xcf/0x8a0
      [198256.047504][ T7460]  ? lock_downgrade+0x420/0x420
      [198256.048712][ T7460]  ? ilookup5_nowait+0x81/0xa0
      [198256.049747][ T7460]  ? lock_downgrade+0x420/0x420
      [198256.050652][ T7460]  ? do_raw_spin_unlock+0xa9/0x100
      [198256.051618][ T7460]  ? __might_resched+0x128/0x1c0
      [198256.052511][ T7460]  ? __might_sleep+0x66/0xc0
      [198256.053442][ T7460]  ? __kasan_check_read+0x11/0x20
      [198256.054251][ T7460]  ? iget5_locked+0xbd/0x150
      [198256.054986][ T7460]  ? run_delayed_iput_locked+0x110/0x110
      [198256.055929][ T7460]  ? btrfs_iget+0xc7/0x150
      [198256.056630][ T7460]  ? btrfs_orphan_cleanup+0x4a0/0x4a0
      [198256.057502][ T7460]  ? free_extent_buffer+0x13/0x20
      [198256.058322][ T7460]  btrfs_log_inode+0x2654/0x26d0
      [198256.059137][ T7460]  ? log_directory_changes+0x170/0x170
      [198256.060020][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
      [198256.060930][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
      [198256.061905][ T7460]  ? lock_contended+0x770/0x770
      [198256.062682][ T7460]  ? btrfs_log_inode_parent+0xd04/0x1750
      [198256.063582][ T7460]  ? lock_downgrade+0x420/0x420
      [198256.064432][ T7460]  ? preempt_count_sub+0x18/0xc0
      [198256.065550][ T7460]  ? __mutex_lock+0x580/0xdc0
      [198256.066654][ T7460]  ? stack_trace_save+0x94/0xc0
      [198256.068008][ T7460]  ? __kasan_check_write+0x14/0x20
      [198256.072149][ T7460]  ? __mutex_unlock_slowpath+0x12a/0x430
      [198256.073145][ T7460]  ? mutex_lock_io_nested+0xcd0/0xcd0
      [198256.074341][ T7460]  ? wait_for_completion_io_timeout+0x20/0x20
      [198256.075345][ T7460]  ? lock_downgrade+0x420/0x420
      [198256.076142][ T7460]  ? lock_contended+0x770/0x770
      [198256.076939][ T7460]  ? do_raw_spin_lock+0x1c0/0x1c0
      [198256.078401][ T7460]  ? btrfs_sync_file+0x5e6/0xa40
      [198256.080598][ T7460]  btrfs_log_inode_parent+0x523/0x1750
      [198256.081991][ T7460]  ? wait_current_trans+0xc8/0x240
      [198256.083320][ T7460]  ? lock_downgrade+0x420/0x420
      [198256.085450][ T7460]  ? btrfs_end_log_trans+0x70/0x70
      [198256.086362][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
      [198256.087544][ T7460]  ? lock_release+0xcf/0x8a0
      [198256.088305][ T7460]  ? lock_downgrade+0x420/0x420
      [198256.090375][ T7460]  ? dget_parent+0x8e/0x300
      [198256.093538][ T7460]  ? do_raw_spin_lock+0x1c0/0x1c0
      [198256.094918][ T7460]  ? lock_downgrade+0x420/0x420
      [198256.097815][ T7460]  ? do_raw_spin_unlock+0xa9/0x100
      [198256.101822][ T7460]  ? dget_parent+0xb7/0x300
      [198256.103345][ T7460]  btrfs_log_dentry_safe+0x48/0x60
      [198256.105052][ T7460]  btrfs_sync_file+0x629/0xa40
      [198256.106829][ T7460]  ? start_ordered_ops.constprop.0+0x120/0x120
      [198256.109655][ T7460]  ? __fget_files+0x161/0x230
      [198256.110760][ T7460]  vfs_fsync_range+0x6d/0x110
      [198256.111923][ T7460]  ? start_ordered_ops.constprop.0+0x120/0x120
      [198256.113556][ T7460]  __x64_sys_fsync+0x45/0x70
      [198256.114323][ T7460]  do_syscall_64+0x5c/0xc0
      [198256.115084][ T7460]  ? syscall_exit_to_user_mode+0x3b/0x50
      [198256.116030][ T7460]  ? do_syscall_64+0x69/0xc0
      [198256.116768][ T7460]  ? do_syscall_64+0x69/0xc0
      [198256.117555][ T7460]  ? do_syscall_64+0x69/0xc0
      [198256.118324][ T7460]  ? sysvec_call_function_single+0x57/0xc0
      [198256.119308][ T7460]  ? asm_sysvec_call_function_single+0xa/0x20
      [198256.120363][ T7460]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [198256.121334][ T7460] RIP: 0033:0x7fc7fe97b6ab
      [198256.122067][ T7460] Code: 0f 05 48 (...)
      [198256.125198][ T7460] RSP: 002b:00007fc7f7c23950 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
      [198256.126568][ T7460] RAX: ffffffffffffffda RBX: 00007fc7f7c239f0 RCX: 00007fc7fe97b6ab
      [198256.127942][ T7460] RDX: 0000000000000002 RSI: 000056167536bcf0 RDI: 0000000000000004
      [198256.129302][ T7460] RBP: 0000000000000004 R08: 0000000000000000 R09: 000000007ffffeb8
      [198256.130670][ T7460] R10: 00000000000001ff R11: 0000000000000293 R12: 0000000000000001
      [198256.132046][ T7460] R13: 0000561674ca8140 R14: 00007fc7f7c239d0 R15: 000056167536dab8
      [198256.133403][ T7460]  </TASK>
      
      Fix this by treating -EEXIST as expected at insert_dir_log_key() and have
      it update the item with an end offset corresponding to the maximum between
      the previously logged end offset and the new requested end offset. The end
      offsets may be different due to dir index key deletions that happened as
      part of unlink operations while we are logging a directory (triggered when
      fsyncing some other inode parented by the directory) or during renames
      which always attempt to log a single dir index deletion.
      Reported-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Link: https://lore.kernel.org/linux-btrfs/YmyefE9mc2xl5ZMz@hungrycats.org/
      Fixes: 732d591a ("btrfs: stop copying old dir items when logging a directory")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      750ee454
    • Naohiro Aota's avatar
      btrfs: zoned: activate block group properly on unlimited active zone device · ceb4f608
      Naohiro Aota authored
      btrfs_zone_activate() checks if it activated all the underlying zones in
      the loop. However, that check never hit on an unlimited activate zone
      device (max_active_zones == 0).
      
      Fortunately, it still works without ENOSPC because btrfs_zone_activate()
      returns true in the end, even if block_group->zone_is_active == 0. But, it
      is confusing to have non zone_is_active block group still usable for
      allocation. Also, we are wasting CPU time to iterate the loop every time
      btrfs_zone_activate() is called for the blog groups.
      
      Since error case in the loop is handled by out_unlock, we can just set
      zone_is_active and do the list stuff after the loop.
      
      Fixes: f9a912a3 ("btrfs: zoned: make zone activation multi stripe capable")
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ceb4f608
    • Naohiro Aota's avatar
      btrfs: zoned: move non-changing condition check out of the loop · 54957712
      Naohiro Aota authored
      btrfs_zone_activate() checks if block_group->alloc_offset ==
      block_group->zone_capacity every time it iterates the loop. But, it is
      not depending on the index. Move out the check and do it only once.
      
      Fixes: f9a912a3 ("btrfs: zoned: make zone activation multi stripe capable")
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      54957712
    • Qu Wenruo's avatar
      btrfs: force v2 space cache usage for subpage mount · 9f73f1ae
      Qu Wenruo authored
      [BUG]
      For a 4K sector sized btrfs with v1 cache enabled and only mounted on
      systems with 4K page size, if it's mounted on subpage (64K page size)
      systems, it can cause the following warning on v1 space cache:
      
       BTRFS error (device dm-1): csum mismatch on free space cache
       BTRFS warning (device dm-1): failed to load free space cache for block group 84082688, rebuilding it now
      
      Although not a big deal, as kernel can rebuild it without problem, such
      warning will bother end users, especially if they want to switch the
      same btrfs seamlessly between different page sized systems.
      
      [CAUSE]
      V1 free space cache is still using fixed PAGE_SIZE for various bitmap,
      like BITS_PER_BITMAP.
      
      Such hard-coded PAGE_SIZE usage will cause various mismatch, from v1
      cache size to checksum.
      
      Thus kernel will always reject v1 cache with a different PAGE_SIZE with
      csum mismatch.
      
      [FIX]
      Although we should fix v1 cache, it's already going to be marked
      deprecated soon.
      
      And we have v2 cache based on metadata (which is already fully subpage
      compatible), and it has almost everything superior than v1 cache.
      
      So just force subpage mount to use v2 cache on mount.
      Reported-by: default avatarMatt Corallo <blnxfsl@bluematt.me>
      CC: stable@vger.kernel.org # 5.15+
      Link: https://lore.kernel.org/linux-btrfs/61aa27d1-30fc-c1a9-f0f4-9df544395ec3@bluematt.me/Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9f73f1ae
    • Linus Torvalds's avatar
      Merge tag 's390-5.18-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · 0f5d752b
      Linus Torvalds authored
      Pull s390 fixes from Heiko Carstens:
      
       - Disable -Warray-bounds warning for gcc12, since the only known way to
         workaround false positive warnings on lowcore accesses would result
         in worse code on fast paths.
      
       - Avoid lockdep_assert_held() warning in kvm vm memop code.
      
       - Reduce overhead within gmap_rmap code to get rid of long latencies
         when e.g. shutting down 2nd level guests.
      
      * tag 's390-5.18-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
        KVM: s390: vsie/gmap: reduce gmap_rmap overhead
        KVM: s390: Fix lockdep issue in vm memop
        s390: disable -Warray-bounds
      0f5d752b
    • Linus Torvalds's avatar
      Merge tag 'mips-fixes_5.18_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux · 905a6537
      Linus Torvalds authored
      Pull MIPS fix from Thomas Bogendoerfer:
       "Extend R4000/R4400 CPU erratum workaround to all revisions"
      
      * tag 'mips-fixes_5.18_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
        MIPS: Fix CP0 counter erratum detection for R4k CPUs
      905a6537
    • Linus Torvalds's avatar
      Merge tag 'net-5.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 68533eb1
      Linus Torvalds authored
      Pull networking fixes from Paolo Abeni:
       "Including fixes from can, rxrpc and wireguard.
      
        Previous releases - regressions:
      
         - igmp: respect RCU rules in ip_mc_source() and ip_mc_msfilter()
      
         - mld: respect RCU rules in ip6_mc_source() and ip6_mc_msfilter()
      
         - rds: acquire netns refcount on TCP sockets
      
         - rxrpc: enable IPv6 checksums on transport socket
      
         - nic: hinic: fix bug of wq out of bound access
      
         - nic: thunder: don't use pci_irq_vector() in atomic context
      
         - nic: bnxt_en: fix possible bnxt_open() failure caused by wrong RFS
           flag
      
         - nic: mlx5e:
            - lag, fix use-after-free in fib event handler
            - fix deadlock in sync reset flow
      
        Previous releases - always broken:
      
         - tcp: fix insufficient TCP source port randomness
      
         - can: grcan: grcan_close(): fix deadlock
      
         - nfc: reorder destructive operations in to avoid bugs
      
        Misc:
      
         - wireguard: improve selftests reliability"
      
      * tag 'net-5.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (63 commits)
        NFC: netlink: fix sleep in atomic bug when firmware download timeout
        selftests: ocelot: tc_flower_chains: specify conform-exceed action for policer
        tcp: drop the hash_32() part from the index calculation
        tcp: increase source port perturb table to 2^16
        tcp: dynamically allocate the perturb table used by source ports
        tcp: add small random increments to the source port
        tcp: resalt the secret every 10 seconds
        tcp: use different parts of the port_offset for index and offset
        secure_seq: use the 64 bits of the siphash for port offset calculation
        wireguard: selftests: set panic_on_warn=1 from cmdline
        wireguard: selftests: bump package deps
        wireguard: selftests: restore support for ccache
        wireguard: selftests: use newer toolchains to fill out architectures
        wireguard: selftests: limit parallelism to $(nproc) tests at once
        wireguard: selftests: make routing loop test non-fatal
        net/mlx5: Fix matching on inner TTC
        net/mlx5: Avoid double clear or set of sync reset requested
        net/mlx5: Fix deadlock in sync reset flow
        net/mlx5e: Fix trust state reset in reload
        net/mlx5e: Avoid checking offload capability in post_parse action
        ...
      68533eb1
    • Duoming Zhou's avatar
      NFC: netlink: fix sleep in atomic bug when firmware download timeout · 4071bf12
      Duoming Zhou authored
      There are sleep in atomic bug that could cause kernel panic during
      firmware download process. The root cause is that nlmsg_new with
      GFP_KERNEL parameter is called in fw_dnld_timeout which is a timer
      handler. The call trace is shown below:
      
      BUG: sleeping function called from invalid context at include/linux/sched/mm.h:265
      Call Trace:
      kmem_cache_alloc_node
      __alloc_skb
      nfc_genl_fw_download_done
      call_timer_fn
      __run_timers.part.0
      run_timer_softirq
      __do_softirq
      ...
      
      The nlmsg_new with GFP_KERNEL parameter may sleep during memory
      allocation process, and the timer handler is run as the result of
      a "software interrupt" that should not call any other function
      that could sleep.
      
      This patch changes allocation mode of netlink message from GFP_KERNEL
      to GFP_ATOMIC in order to prevent sleep in atomic bug. The GFP_ATOMIC
      flag makes memory allocation operation could be used in atomic context.
      
      Fixes: 9674da87 ("NFC: Add firmware upload netlink command")
      Fixes: 9ea7187c ("NFC: netlink: Rename CMD_FW_UPLOAD to CMD_FW_DOWNLOAD")
      Signed-off-by: default avatarDuoming Zhou <duoming@zju.edu.cn>
      Reviewed-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Link: https://lore.kernel.org/r/20220504055847.38026-1-duoming@zju.edu.cnSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      4071bf12
    • Matthew Wilcox (Oracle)'s avatar
      mm/readahead: Fix readahead with large folios · b9ff43dd
      Matthew Wilcox (Oracle) authored
      Reading 100KB chunks from a big file (eg dd bs=100K) leads to poor
      readahead behaviour.  Studying the traces in detail, I noticed two
      problems.
      
      The first is that we were setting the readahead flag on the folio which
      contains the last byte read from the block.  This is wrong because we
      will trigger readahead at the end of the read without waiting to see
      if a subsequent read is going to use the pages we just read.  Instead,
      we need to set the readahead flag on the first folio _after_ the one
      which contains the last byte that we're reading.
      
      The second is that we were looking for the index of the folio with the
      readahead flag set to exactly match the start + size - async_size.
      If we've rounded this, either down (as previously) or up (as now),
      we'll think we hit a folio marked as readahead by a different read,
      and try to read the wrong pages.  So round the expected index to the
      order of the folio we hit.
      Reported-by: default avatarGuo Xuenan <guoxuenan@huawei.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      b9ff43dd
    • Matthew Wilcox (Oracle)'s avatar
      block: Do not call folio_next() on an unreferenced folio · 170f37d6
      Matthew Wilcox (Oracle) authored
      It is unsafe to call folio_next() on a folio unless you hold a reference
      on it that prevents it from being split or freed.  After returning
      from the iterator, iomap calls folio_end_writeback() which may drop
      the last reference to the page, or allow the page to be split.  If that
      happens, the iterator will not advance far enough through the bio_vec,
      leading to assertion failures like the BUG() in folio_end_writeback()
      that checks we're not trying to end writeback on a page not currently
      under writeback.  Other assertion failures were also seen, but they're
      all explained by this one bug.
      
      Fix the bug by remembering where the next folio starts before returning
      from the iterator.  There are other ways of fixing this bug, but this
      seems the simplest.
      Reported-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Tested-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reported-by: default avatarBrian Foster <bfoster@redhat.com>
      Tested-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      170f37d6
    • Vladimir Oltean's avatar
      selftests: ocelot: tc_flower_chains: specify conform-exceed action for policer · 5a7c5f70
      Vladimir Oltean authored
      As discussed here with Ido Schimmel:
      https://patchwork.kernel.org/project/netdevbpf/patch/20220224102908.5255-2-jianbol@nvidia.com/
      
      the default conform-exceed action is "reclassify", for a reason we don't
      really understand.
      
      The point is that hardware can't offload that police action, so not
      specifying "conform-exceed" was always wrong, even though the command
      used to work in hardware (but not in software) until the kernel started
      adding validation for it.
      
      Fix the command used by the selftest by making the policer drop on
      exceed, and pass the packet to the next action (goto) on conform.
      
      Fixes: 8cd6b020 ("selftests: ocelot: add some example VCAP IS1, IS2 and ES0 tc offloads")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Link: https://lore.kernel.org/r/20220503121428.842906-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5a7c5f70
    • Jakub Kicinski's avatar
      Merge branch 'insufficient-tcp-source-port-randomness' · ef562489
      Jakub Kicinski authored
      Willy Tarreau says:
      
      ====================
      insufficient TCP source port randomness
      
      In a not-yet published paper, Moshe Kol, Amit Klein, and Yossi Gilad
      report being able to accurately identify a client by forcing it to emit
      only 40 times more connections than the number of entries in the
      table_perturb[] table, which is indexed by hashing the connection tuple.
      The current 2^8 setting allows them to perform that attack with only 10k
      connections, which is not hard to achieve in a few seconds.
      
      Eric, Amit and I have been working on this for a few weeks now imagining,
      testing and eliminating a number of approaches that Amit and his team were
      still able to break or that were found to be too risky or too expensive,
      and ended up with the simple improvements in this series that resists to
      the attack, doesn't degrade the performance, and preserves a reliable port
      selection algorithm to avoid connection failures, including the odd/even
      port selection preference that allows bind() to always find a port quickly
      even under strong connect() stress.
      
      The approach relies on several factors:
        - resalting the hash secret that's used to choose the table_perturb[]
          entry every 10 seconds to eliminate slow attacks and force the
          attacker to forget everything that was learned after this delay.
          This already eliminates most of the problem because if a client
          stays silent for more than 10 seconds there's no link between the
          previous and the next patterns, and 10s isn't yet frequent enough
          to cause too frequent repetition of a same port that may induce a
          connection failure ;
      
        - adding small random increments to the source port. Previously, a
          random 0 or 1 was added every 16 ports. Now a random 0 to 7 is
          added after each port. This means that with the default 32768-60999
          range, a worst case rollover happens after 1764 connections, and
          an average of 3137. This doesn't stop statistical attacks but
          requires significantly more iterations of the same attack to
          confirm a guess.
      
        - increasing the table_perturb[] size from 2^8 to 2^16, which Amit
          says will require 2.6 million connections to be attacked with the
          changes above, making it pointless to get a fingerprint that will
          only last 10 seconds. Due to the size, the table was made dynamic.
      
        - a few minor improvements on the bits used from the hash, to eliminate
          some unfortunate correlations that may possibly have been exploited
          to design future attack models.
      
      These changes were tested under the most extreme conditions, up to
      1.1 million connections per second to one and a few targets, showing no
      performance regression, and only 2 connection failures within 13 billion,
      which is less than 2^-32 and perfectly within usual values.
      
      The series is split into small reviewable changes and was already reviewed
      by Amit and Eric.
      ====================
      
      Link: https://lore.kernel.org/r/20220502084614.24123-1-w@1wt.euSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ef562489
    • Willy Tarreau's avatar
      tcp: drop the hash_32() part from the index calculation · e8161345
      Willy Tarreau authored
      In commit 190cc824 ("tcp: change source port randomizarion at
      connect() time"), the table_perturb[] array was introduced and an
      index was taken from the port_offset via hash_32(). But it turns
      out that hash_32() performs a multiplication while the input here
      comes from the output of SipHash in secure_seq, that is well
      distributed enough to avoid the need for yet another hash.
      Suggested-by: default avatarAmit Klein <aksecurity@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e8161345
    • Willy Tarreau's avatar
      tcp: increase source port perturb table to 2^16 · 4c2c8f03
      Willy Tarreau authored
      Moshe Kol, Amit Klein, and Yossi Gilad reported being able to accurately
      identify a client by forcing it to emit only 40 times more connections
      than there are entries in the table_perturb[] table. The previous two
      improvements consisting in resalting the secret every 10s and adding
      randomness to each port selection only slightly improved the situation,
      and the current value of 2^8 was too small as it's not very difficult
      to make a client emit 10k connections in less than 10 seconds.
      
      Thus we're increasing the perturb table from 2^8 to 2^16 so that the
      same precision now requires 2.6M connections, which is more difficult in
      this time frame and harder to hide as a background activity. The impact
      is that the table now uses 256 kB instead of 1 kB, which could mostly
      affect devices making frequent outgoing connections. However such
      components usually target a small set of destinations (load balancers,
      database clients, perf assessment tools), and in practice only a few
      entries will be visited, like before.
      
      A live test at 1 million connections per second showed no performance
      difference from the previous value.
      Reported-by: default avatarMoshe Kol <moshe.kol@mail.huji.ac.il>
      Reported-by: default avatarYossi Gilad <yossi.gilad@mail.huji.ac.il>
      Reported-by: default avatarAmit Klein <aksecurity@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4c2c8f03
    • Willy Tarreau's avatar
      tcp: dynamically allocate the perturb table used by source ports · e9261476
      Willy Tarreau authored
      We'll need to further increase the size of this table and it's likely
      that at some point its size will not be suitable anymore for a static
      table. Let's allocate it on boot from inet_hashinfo2_init(), which is
      called from tcp_init().
      
      Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
      Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
      Cc: Amit Klein <aksecurity@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e9261476
    • Willy Tarreau's avatar
      tcp: add small random increments to the source port · ca7af040
      Willy Tarreau authored
      Here we're randomly adding between 0 and 7 random increments to the
      selected source port in order to add some noise in the source port
      selection that will make the next port less predictable.
      
      With the default port range of 32768-60999 this means a worst case
      reuse scenario of 14116/8=1764 connections between two consecutive
      uses of the same port, with an average of 14116/4.5=3137. This code
      was stressed at more than 800000 connections per second to a fixed
      target with all connections closed by the client using RSTs (worst
      condition) and only 2 connections failed among 13 billion, despite
      the hash being reseeded every 10 seconds, indicating a perfectly
      safe situation.
      
      Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
      Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
      Cc: Amit Klein <aksecurity@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ca7af040
    • Eric Dumazet's avatar
      tcp: resalt the secret every 10 seconds · 4dfa9b43
      Eric Dumazet authored
      In order to limit the ability for an observer to recognize the source
      ports sequence used to contact a set of destinations, we should
      periodically shuffle the secret. 10 seconds looks effective enough
      without causing particular issues.
      
      Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
      Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
      Cc: Amit Klein <aksecurity@gmail.com>
      Cc: Jason A. Donenfeld <Jason@zx2c4.com>
      Tested-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4dfa9b43
    • Willy Tarreau's avatar
      tcp: use different parts of the port_offset for index and offset · 9e9b70ae
      Willy Tarreau authored
      Amit Klein suggests that we use different parts of port_offset for the
      table's index and the port offset so that there is no direct relation
      between them.
      
      Cc: Jason A. Donenfeld <Jason@zx2c4.com>
      Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
      Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
      Cc: Amit Klein <aksecurity@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9e9b70ae