• Nikos Tsironis's avatar
    dm snapshot: Fix excessive memory usage and workqueue stalls · 721b1d98
    Nikos Tsironis authored
    kcopyd has no upper limit to the number of jobs one can allocate and
    issue. Under certain workloads this can lead to excessive memory usage
    and workqueue stalls. For example, when creating multiple dm-snapshot
    targets with a 4K chunk size and then writing to the origin through the
    page cache. Syncing the page cache causes a large number of BIOs to be
    issued to the dm-snapshot origin target, which itself issues an even
    larger (because of the BIO splitting taking place) number of kcopyd
    jobs.
    
    Running the following test, from the device mapper test suite [1],
    
      dmtest run --suite snapshot -n many_snapshots_of_same_volume_N
    
    , with 8 active snapshots, results in the kcopyd job slab cache growing
    to 10G. Depending on the available system RAM this can lead to the OOM
    killer killing user processes:
    
    [463.492878] kthreadd invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP),
                  nodemask=(null), order=1, oom_score_adj=0
    [463.492894] kthreadd cpuset=/ mems_allowed=0
    [463.492948] CPU: 7 PID: 2 Comm: kthreadd Not tainted 4.19.0-rc7 #3
    [463.492950] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
    [463.492952] Call Trace:
    [463.492964]  dump_stack+0x7d/0xbb
    [463.492973]  dump_header+0x6b/0x2fc
    [463.492987]  ? lockdep_hardirqs_on+0xee/0x190
    [463.493012]  oom_kill_process+0x302/0x370
    [463.493021]  out_of_memory+0x113/0x560
    [463.493030]  __alloc_pages_slowpath+0xf40/0x1020
    [463.493055]  __alloc_pages_nodemask+0x348/0x3c0
    [463.493067]  cache_grow_begin+0x81/0x8b0
    [463.493072]  ? cache_grow_begin+0x874/0x8b0
    [463.493078]  fallback_alloc+0x1e4/0x280
    [463.493092]  kmem_cache_alloc_node+0xd6/0x370
    [463.493098]  ? copy_process.part.31+0x1c5/0x20d0
    [463.493105]  copy_process.part.31+0x1c5/0x20d0
    [463.493115]  ? __lock_acquire+0x3cc/0x1550
    [463.493121]  ? __switch_to_asm+0x34/0x70
    [463.493129]  ? kthread_create_worker_on_cpu+0x70/0x70
    [463.493135]  ? finish_task_switch+0x90/0x280
    [463.493165]  _do_fork+0xe0/0x6d0
    [463.493191]  ? kthreadd+0x19f/0x220
    [463.493233]  kernel_thread+0x25/0x30
    [463.493235]  kthreadd+0x1bf/0x220
    [463.493242]  ? kthread_create_on_cpu+0x90/0x90
    [463.493248]  ret_from_fork+0x3a/0x50
    [463.493279] Mem-Info:
    [463.493285] active_anon:20631 inactive_anon:4831 isolated_anon:0
    [463.493285]  active_file:80216 inactive_file:80107 isolated_file:435
    [463.493285]  unevictable:0 dirty:51266 writeback:109372 unstable:0
    [463.493285]  slab_reclaimable:31191 slab_unreclaimable:3483521
    [463.493285]  mapped:526 shmem:4903 pagetables:1759 bounce:0
    [463.493285]  free:33623 free_pcp:2392 free_cma:0
    ...
    [463.493489] Unreclaimable slab info:
    [463.493513] Name                      Used          Total
    [463.493522] bio-6                   1028KB       1028KB
    [463.493525] bio-5                   1028KB       1028KB
    [463.493528] dm_snap_pending_exception     236783KB     243789KB
    [463.493531] dm_exception              41KB         42KB
    [463.493534] bio-4                   1216KB       1216KB
    [463.493537] bio-3                 439396KB     439396KB
    [463.493539] kcopyd_job           6973427KB    6973427KB
    ...
    [463.494340] Out of memory: Kill process 1298 (ruby2.3) score 1 or sacrifice child
    [463.494673] Killed process 1298 (ruby2.3) total-vm:435740kB, anon-rss:20180kB, file-rss:4kB, shmem-rss:0kB
    [463.506437] oom_reaper: reaped process 1298 (ruby2.3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    
    Moreover, issuing a large number of kcopyd jobs results in kcopyd
    hogging the CPU, while processing them. As a result, processing of work
    items, queued for execution on the same CPU as the currently running
    kcopyd thread, is stalled for long periods of time, hurting performance.
    Running the aforementioned test we get, in dmesg, messages like the
    following:
    
    [67501.194592] BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 27s!
    [67501.195586] Showing busy workqueues and worker pools:
    [67501.195591] workqueue events: flags=0x0
    [67501.195597]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
    [67501.195611]     pending: cache_reap
    [67501.195641] workqueue mm_percpu_wq: flags=0x8
    [67501.195645]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
    [67501.195656]     pending: vmstat_update
    [67501.195682] workqueue kblockd: flags=0x18
    [67501.195687]   pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256
    [67501.195698]     pending: blk_timeout_work
    [67501.195753] workqueue kcopyd: flags=0x8
    [67501.195757]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
    [67501.195768]     pending: do_work [dm_mod]
    [67501.195802] workqueue kcopyd: flags=0x8
    [67501.195806]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
    [67501.195817]     pending: do_work [dm_mod]
    [67501.195834] workqueue kcopyd: flags=0x8
    [67501.195838]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
    [67501.195848]     pending: do_work [dm_mod]
    [67501.195881] workqueue kcopyd: flags=0x8
    [67501.195885]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
    [67501.195896]     pending: do_work [dm_mod]
    [67501.195920] workqueue kcopyd: flags=0x8
    [67501.195924]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=2/256
    [67501.195935]     in-flight: 67:do_work [dm_mod]
    [67501.195945]     pending: do_work [dm_mod]
    [67501.195961] pool 8: cpus=4 node=0 flags=0x0 nice=0 hung=27s workers=3 idle: 129 23765
    
    The root cause for these issues is the way dm-snapshot uses kcopyd. In
    particular, the lack of an explicit or implicit limit to the maximum
    number of in-flight COW jobs. The merging path is not affected because
    it implicitly limits the in-flight kcopyd jobs to one.
    
    Fix these issues by using a semaphore to limit the maximum number of
    in-flight kcopyd jobs. We grab the semaphore before allocating a new
    kcopyd job in start_copy() and start_full_bio() and release it after the
    job finishes in copy_callback().
    
    The initial semaphore value is configurable through a module parameter,
    to allow fine tuning the maximum number of in-flight COW jobs. Setting
    this parameter to zero initializes the semaphore to INT_MAX.
    
    A default value of 2048 maximum in-flight kcopyd jobs was chosen. This
    value was decided experimentally as a trade-off between memory
    consumption, stalling the kernel's workqueues and maintaining a high
    enough throughput.
    
    Re-running the aforementioned test:
    
      * Workqueue stalls are eliminated
      * kcopyd's job slab cache uses a maximum of 130MB
      * The time taken by the test to write to the snapshot-origin target is
        reduced from 05m20.48s to 03m26.38s
    
    [1] https://github.com/jthornber/device-mapper-test-suiteSigned-off-by: default avatarNikos Tsironis <ntsironis@arrikto.com>
    Signed-off-by: default avatarIlias Tsitsimpis <iliastsi@arrikto.com>
    Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
    721b1d98
dm-snap.c 58.4 KB