Commits · 91f098704c25106d88706fc9f8bcfce01fdb97df · Kirill Smelkov / linux

24 Apr, 2024 2 commits

workqueue: Fix divide error in wq_update_node_max_active() · 91f09870

Lai Jiangshan authored Apr 24, 2024

Yue Sun and xingwei lee reported a divide error bug in
wq_update_node_max_active():

divide error: 0000 [#1] PREEMPT SMP KASAN PTI
CPU: 1 PID: 21 Comm: cpuhp/1 Not tainted 6.9.0-rc5 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:wq_update_node_max_active+0x369/0x6b0 kernel/workqueue.c:1605
Code: 24 bf 00 00 00 80 44 89 fe e8 83 27 33 00 41 83 fc ff 75 0d 41
81 ff 00 00 00 80 0f 84 68 01 00 00 e8 fb 22 33 00 44 89 f8 99 <41> f7
fc 89 c5 89 c7 44 89 ee e8 a8 24 33 00 89 ef 8b 5c 24 04 89
RSP: 0018:ffffc9000018fbb0 EFLAGS: 00010293
RAX: 00000000000000ff RBX: 0000000000000001 RCX: ffff888100ada500
RDX: 0000000000000000 RSI: 00000000000000ff RDI: 0000000080000000
RBP: 0000000000000001 R08: ffffffff815b1fcd R09: 1ffff1100364ad72
R10: dffffc0000000000 R11: ffffed100364ad73 R12: 0000000000000000
R13: 0000000000000100 R14: 0000000000000000 R15: 00000000000000ff
FS:  0000000000000000(0000) GS:ffff888135c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fb8c06ca6f8 CR3: 000000010d6c6000 CR4: 0000000000750ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <TASK>
 workqueue_offline_cpu+0x56f/0x600 kernel/workqueue.c:6525
 cpuhp_invoke_callback+0x4e1/0x870 kernel/cpu.c:194
 cpuhp_thread_fun+0x411/0x7d0 kernel/cpu.c:1092
 smpboot_thread_fn+0x544/0xa10 kernel/smpboot.c:164
 kthread+0x2ed/0x390 kernel/kthread.c:388
 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:244
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---

After analysis, it happens when all of the CPUs in a workqueue's affinity
get offine.

The problem can be easily reproduced by:

 # echo 8 > /sys/devices/virtual/workqueue/<any-wq-name>/cpumask
 # echo 0 > /sys/devices/system/cpu/cpu3/online

Use the default max_actives for nodes when all of the CPUs in the
workqueue's affinity get offline to fix the problem.
Reported-by: Yue Sun <samsun1006219@gmail.com>
Reported-by: xingwei lee <xrivendell7@gmail.com>
Link: https://lore.kernel.org/lkml/CAEkJfYPGS1_4JqvpSo0=FM0S1ytB8CEbyreLTtWpR900dUZymw@mail.gmail.com/
Fixes: 5797b1c1 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues")
Cc: stable@vger.kernel.org
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

91f09870

workqueue: The default node_nr_active should have its max set to max_active · d40f9202

Tejun Heo authored Apr 22, 2024

The default nna (node_nr_active) is used when the pool isn't tied to a
specific NUMA node. This can happen in the following cases:

1. On NUMA, if per-node pwq init failure and the fallback pwq is used.
2. On NUMA, if a pool is configured to span multiple nodes.
3. On single node setups.

5797b1c1 ("workqueue: Implement system-wide nr_active enforcement for
unbound workqueues") set the default nna->max to min_active because only #1
was being considered. For #2 and #3, using min_active means that the max
concurrency in normal operation is pushed down to min_active which is
currently 8, which can obviously lead to performance issues.

exact value nna->max is set to doesn't really matter. #2 can only happen if
the workqueue is intentionally configured to ignore NUMA boundaries and
there's no good way to distribute max_active in this case. #3 is the default
behavior on single node machines.

Let's set it the default nna->max to max_active. This fixes the artificially
lowered concurrency problem on single node machines and shouldn't hurt
anything for other cases.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 5797b1c1 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues")
Link: https://lore.kernel.org/dm-devel/20240410084531.2134621-1-shinichiro.kawasaki@wdc.com/Signed-off-by: Tejun Heo <tj@kernel.org>

d40f9202

23 Apr, 2024 1 commit

workqueue: Fix selection of wake_cpu in kick_pool() · 57a01eaf

Sven Schnelle authored Apr 23, 2024

With cpu_possible_mask=0-63 and cpu_online_mask=0-7 the following
kernel oops was observed:

smp: Bringing up secondary CPUs ...
smp: Brought up 1 node, 8 CPUs
Unable to handle kernel pointer dereference in virtual kernel address space
Failing address: 0000000000000000 TEID: 0000000000000803
[..]
 Call Trace:
arch_vcpu_is_preempted+0x12/0x80
select_idle_sibling+0x42/0x560
select_task_rq_fair+0x29a/0x3b0
try_to_wake_up+0x38e/0x6e0
kick_pool+0xa4/0x198
__queue_work.part.0+0x2bc/0x3a8
call_timer_fn+0x36/0x160
__run_timers+0x1e2/0x328
__run_timer_base+0x5a/0x88
run_timer_softirq+0x40/0x78
__do_softirq+0x118/0x388
irq_exit_rcu+0xc0/0xd8
do_ext_irq+0xae/0x168
ext_int_handler+0xbe/0xf0
psw_idle_exit+0x0/0xc
default_idle_call+0x3c/0x110
do_idle+0xd4/0x158
cpu_startup_entry+0x40/0x48
rest_init+0xc6/0xc8
start_kernel+0x3c4/0x5e0
startup_continue+0x3c/0x50

The crash is caused by calling arch_vcpu_is_preempted() for an offline
CPU. To avoid this, select the cpu with cpumask_any_and_distribute()
to mask __pod_cpumask with cpu_online_mask. In case no cpu is left in
the pool, skip the assignment.

tj: This doesn't fully fix the bug as CPUs can still go down between picking
the target CPU and the wake call. Fixing that likely requires adding
cpu_online() test to either the sched or s390 arch code. However, regardless
of how that is fixed, workqueue shouldn't be picking a CPU which isn't
online as that would result in unpredictable and worse behavior.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Fixes: 8639eceb ("workqueue: Implement non-strict affinity scope for unbound workqueues")
Cc: stable@vger.kernel.org # v6.6+
Signed-off-by: Tejun Heo <tj@kernel.org>

57a01eaf

03 Apr, 2024 2 commits

docs/zh_CN: core-api: Update translation of workqueue.rst to 6.9-rc1 · a1d34930

Xingyou Chen authored Apr 03, 2024

Significant changes have been made to workqueue, and there are staging
works transferring from tasklet, while the current translation doesn't
include description around WQ_BH, an update seems to be helpful.

Synchronize translation from upstream commit 3bc1e711
("workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered")
Signed-off-by: Xingyou Chen <rockrush@rockwork.org>
Signed-off-by: Tejun Heo <tj@kernel.org>

a1d34930

Documentation/core-api: Update events_freezable_power references. · 2c534f2f

Audra Mitchell authored Apr 03, 2024

Due to commit 8318d6a6 ("workqueue: Shorten
events_freezable_power_efficient name") we now have some stale
references in the workqeueue documentation, so updating those
references accordingly.
Signed-off-by: Audra Mitchell <audra@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

2c534f2f

02 Apr, 2024 2 commits

Merge tag 'docs-6.9-fixes' of git://git.lwn.net/linux · b1e6ec0a

Linus Torvalds authored Apr 02, 2024

Pull documentation fixes from Jonathan Corbet:
 "Four small documentation fixes"

* tag 'docs-6.9-fixes' of git://git.lwn.net/linux:
  docs: zswap: fix shell command format
  tracing: Fix documentation on tp_printk cmdline option
  docs: Fix bitfield handling in kernel-doc
  Documentation: dev-tools: Add link to RV docs

b1e6ec0a

Merge tag 'bcachefs-2024-04-01' of https://evilpiepirate.org/git/bcachefs · 67199a47

Linus Torvalds authored Apr 02, 2024

Pull bcachefs fixes from Kent Overstreet:
 "Lots of fixes for situations with extreme filesystem damage.

  One fix ("Fix journal pins in btree write buffer") applicable to
  normal usage; also a dio performance fix.

  New repair/construction code is in the final stages, should be ready
  in about a week. Anyone that lost btree interior nodes (or a variety
  of other damage) as a result of the splitbrain bug will be able to
  repair then"

* tag 'bcachefs-2024-04-01' of https://evilpiepirate.org/git/bcachefs: (32 commits)
  bcachefs: On emergency shutdown, print out current journal sequence number
  bcachefs: Fix overlapping extent repair
  bcachefs: Fix remove_dirent()
  bcachefs: Logged op errors should be ignored
  bcachefs: Improve -o norecovery; opts.recovery_pass_limit
  bcachefs: bch2_run_explicit_recovery_pass_persistent()
  bcachefs: Ensure bch_sb_field_ext always exists
  bcachefs: Flush journal immediately after replay if we did early repair
  bcachefs: Resume logged ops after fsck
  bcachefs: Add error messages to logged ops fns
  bcachefs: Split out recovery_passes.c
  bcachefs: fix backpointer for missing alloc key msg
  bcachefs: Fix bch2_btree_increase_depth()
  bcachefs: Kill bch2_bkey_ptr_data_type()
  bcachefs: Fix use after free in check_root_trans()
  bcachefs: Fix repair path for missing indirect extents
  bcachefs: Fix use after free in bch2_check_fix_ptrs()
  bcachefs: Fix btree node keys accounting in topology repair path
  bcachefs: Check btree ptr min_key in .invalid
  bcachefs: add REQ_SYNC and REQ_IDLE in write dio
  ...

67199a47

01 Apr, 2024 33 commits

Merge tag 'pwm/for-6.9-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ukleinek/linux · 026e680b

Linus Torvalds authored Apr 01, 2024

Pull pwm fix from Uwe Kleine-König:
 "This fixes a regression intoduced by an off-by-one in v6.9-rc1 making
  the pwm-pxa and the pwm driver in ti-sn65dsi86 unusable for most
  consumer drivers because the default period wasn't set"

* tag 'pwm/for-6.9-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ukleinek/linux:
  pwm: Fix setting period with #pwm-cells = <1> and of_pwm_single_xlate()

026e680b

bcachefs: On emergency shutdown, print out current journal sequence number · b3c7fd35
Kent Overstreet authored Mar 30, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
b3c7fd35

bcachefs: Fix overlapping extent repair · eab3a3ce

Kent Overstreet authored Mar 30, 2024

overlapping extent repair was colliding with extent past end of inode
checks - don't update "extent ends at" until we know we have an extent.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

eab3a3ce

bcachefs: Fix remove_dirent() · 8ce1db80

Kent Overstreet authored Apr 01, 2024

We were missing an iter_traverse().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

8ce1db80

bcachefs: Logged op errors should be ignored · cecfed9b

Kent Overstreet authored Mar 31, 2024

If something is wrong with a logged op, we just want to delete it -
there's nothing to repair.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

cecfed9b

bcachefs: Improve -o norecovery; opts.recovery_pass_limit · 13c1e583

Kent Overstreet authored Mar 28, 2024

This adds opts.recovery_pass_limit, and redoes -o norecovery to make use
of it; this fixes some issues with -o norecovery so it can be safely
used for data recovery.

Norecovery means "don't do journal replay"; it's an important data
recovery tool when we're getting stuck in journal replay.

When using it this way we need to make sure we don't free journal keys
after startup, so we continue to overlay them: thus it needs to imply
retain_recovery_info, as well as nochanges.

recovery_pass_limit is an explicit option for telling recovery to exit
after a specific recovery pass; this is a much cleaner way of
implementing -o norecovery, as well as being a useful debug feature in
its own right.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

13c1e583

bcachefs: bch2_run_explicit_recovery_pass_persistent() · 060ff30a

Kent Overstreet authored Mar 29, 2024

Flag that we need to run a recovery pass and run it - persistenly, so if
we crash it'll still get run.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

060ff30a

bcachefs: Ensure bch_sb_field_ext always exists · 0a34c058

Kent Overstreet authored Mar 30, 2024

This makes bch_sb_field_ext more consistent with the rest of -o
nochanges - we don't want to be varying other codepaths based on -o
nochanges, since it's used for testing in dry run mode; also fixes some
potential null ptr derefs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

0a34c058

bcachefs: Flush journal immediately after replay if we did early repair · 4fe0eeea
Kent Overstreet authored Mar 28, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
4fe0eeea

bcachefs: Resume logged ops after fsck · af855a5f

Kent Overstreet authored Mar 23, 2024

Finishing logged ops requires the filesystem to be in a reasonably
consistent state - and other fsck passes don't require it to have
completed, so just run it last.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

af855a5f

bcachefs: Add error messages to logged ops fns · e5aa8046
Kent Overstreet authored Mar 23, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
e5aa8046

bcachefs: Split out recovery_passes.c · d2554263

Kent Overstreet authored Mar 23, 2024

We've grown a fair amount of code for managing recovery passes; tracking
which ones we're running, which ones need to be run, and flagging in the
superblock which ones need to be run on the next recovery.

So it's worth splitting out into its own file, this code is pretty
different from the code in recovery.c.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

d2554263

bcachefs: fix backpointer for missing alloc key msg · 11d5568d
Kent Overstreet authored Mar 28, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
11d5568d

bcachefs: Fix bch2_btree_increase_depth() · 7f9e5080

Kent Overstreet authored Mar 14, 2024

When we haven't yet allocated any btree nodes for a given btree, we
first need to call the regular split path to allocate one.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

7f9e5080

bcachefs: Kill bch2_bkey_ptr_data_type() · 47d2080e

Kent Overstreet authored Mar 25, 2024

Remove some duplication, and inconsistency between check_fix_ptrs and
the main ptr marking paths
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

47d2080e

bcachefs: Fix use after free in check_root_trans() · dcc1c045
Kent Overstreet authored Mar 26, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
dcc1c045
bcachefs: Fix repair path for missing indirect extents · 83bb5853
Kent Overstreet authored Mar 26, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
83bb5853
bcachefs: Fix use after free in bch2_check_fix_ptrs() · 6f5869ff
Kent Overstreet authored Mar 26, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
6f5869ff

bcachefs: Fix btree node keys accounting in topology repair path · 812a9297

Kent Overstreet authored Mar 26, 2024

When dropping keys now outside a now because we're changing the node
min/max, we need to redo the node's accounting as well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

812a9297

bcachefs: Check btree ptr min_key in .invalid · 805b535a
Kent Overstreet authored Mar 25, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
805b535a

bcachefs: add REQ_SYNC and REQ_IDLE in write dio · bb660099

zhuxiaohui authored Mar 26, 2024

when writing file with direct_IO on bcachefs, then performance is
much lower than other fs due to write back throttle in block layer:

        wbt_wait+1
        __rq_qos_throttle+32
        blk_mq_submit_bio+394
        submit_bio_noacct_nocheck+649
        bch2_submit_wbio_replicas+538
        __bch2_write+2539
        bch2_direct_write+1663
        bch2_write_iter+318
        aio_write+355
        io_submit_one+1224
        __x64_sys_io_submit+169
        do_syscall_64+134
        entry_SYSCALL_64_after_hwframe+110

add set REQ_SYNC and REQ_IDLE in bio->bi_opf as standard dirct-io
Signed-off-by: zhuxiaohui <zhuxiaohui.400@bytedance.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

bb660099

bcachefs: Improved topology repair checks · 79032b07

Kent Overstreet authored Mar 23, 2024

Consolidate bch2_gc_check_topology() and btree_node_interior_verify(),
and replace them with an improved version,
bch2_btree_node_check_topology().

This checks that children of an interior node correctly span the full
range of the parent node with no overlaps.

Also, ensure that topology repairs at runtime are always a fatal error;
in particular, this adds a check in btree_iter_down() - if we don't find
a key while walking down the btree that's indicative of a topology error
and should be flagged as such, not a null ptr deref.

Some checks in btree_update_interior.c remaining BUG_ONS(), because we
already checked the node for topology errors when starting the update,
and the assertions indicate that we _just_ corrupted the btree node -
i.e. the problem can't be that existing on disk corruption, they
indicate an actual algorithmic bug.

In the future, we'll be annotating the fsck errors list with which
recovery pass corrects them; the open coded "run explicit recovery pass
or fatal error" in bch2_btree_node_check_topology() will in the future
be done for every fsck_err() call.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

79032b07

bcachefs: Be careful about btree node splits during journal replay · 40cb2623
Kent Overstreet authored Mar 26, 2024
```
Don't pick a pivot that's going to be deleted.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
40cb2623

bcachefs: btree_and_journal_iter now respects trans->journal_replay_not_finished · 048f47e8

Kent Overstreet authored Mar 25, 2024

btree_and_journal_iter is now safe to use at runtime, not just during
recovery before journal keys have been freed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

048f47e8

bcachefs: fix trans->mem realloc in __bch2_trans_kmalloc · 36f9ef10

Hongbo Li authored Mar 25, 2024

The old code doesn't consider the mem alloced from mempool when call
krealloc on trans->mem. Also in bch2_trans_put, using mempool_free to
free trans->mem by condition "trans->mem_bytes == BTREE_TRANS_MEM_MAX"
is inaccurate when trans->mem was allocated by krealloc function.
Instead, we use used_mempool stuff to record the situation, and realloc
or free the trans->mem in elegant way.

Also, after krealloc failed in __bch2_trans_kmalloc, the old data
should be copied to the new buffer when alloc from mempool_alloc.

Fixes: 31403dca ("bcachefs: optimize __bch2_trans_get(), kill DEBUG_TRANSACTIONS")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

36f9ef10

bcachefs: Don't do extent merging before journal replay is finished · 57339b24

Kent Overstreet authored Mar 23, 2024

We don't normally do extent updates this early in recovery, but some of
the repair paths have to and when we do, we don't want to do anything
that requires the snapshots table.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

57339b24

bcachefs: Add checks for invalid snapshot IDs · ec9cc18f

Kent Overstreet authored Mar 22, 2024

Previously, we assumed that keys were consistent with the snapshots
btree - but that's not correct as fsck may not have been run or may not
be complete.

This adds checks and error handling when using the in-memory snapshots
table (that mirrors the snapshots btree).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

ec9cc18f

bcachefs: Move snapshot table size to struct snapshot_table · 63332394

Kent Overstreet authored Mar 21, 2024

We need to add bounds checking for snapshot table accesses - it turns
out there are cases where we do need to use the snapshots table before
fsck checks have completed (and indeed, fsck may not have been run).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

63332394

bcachefs: Add an assertion for trying to evict btree root · aa6e130e
Kent Overstreet authored Mar 24, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
aa6e130e
bcachefs: fix mount error path · 4bd02d3f
Kent Overstreet authored Mar 28, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
4bd02d3f

bcachefs: fix misplaced newline in __bch2_inode_unpacked_to_text() · 688d750d

Thomas Bertschinger authored Mar 20, 2024

before:

u64s 18 type inode_v3 0:1879048192:U32_MAX len 0 ver 0:   mode=40700
  flags= (15300000)
  journal_seq=4
  bi_size=0
  bi_sectors=0

  bi_version=0bi_atime=227064388944
  ...

after:

u64s 18 type inode_v3 0:1879048192:U32_MAX len 0 ver 0:   mode=40700
  flags= (15300000)
  journal_seq=4
  bi_size=0
  bi_sectors=0
  bi_version=0
  bi_atime=227064388944
  ...
Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

688d750d

bcachefs: Fix journal pins in btree write buffer · 8aad8e1f

Kent Overstreet authored Mar 22, 2024

btree write buffer flush has two phases
 - in natural key order, which is more efficient but may fail
 - then in journal order

The journal order flush was assuming that keys were still correctly
ordered by journal sequence number - but due to coalescing by the
previous phase, we need an additional sort.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

8aad8e1f

bcachefs: Fix assert in bch2_backpointer_invalid() · a5e3dce4

Kent Overstreet authored Mar 22, 2024

Backpointers that point to invalid devices are caught by fsck, not
.key_invalid; so .key_invalid needs to check for them instead of hitting
asserts.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

a5e3dce4