• Darrick J. Wong's avatar
    xfs: fix per-cpu CIL structure aggregation racing with dying cpus · ecd49f7a
    Darrick J. Wong authored
    In commit 7c8ade21 ("xfs: implement percpu cil space used
    calculation"), the XFS committed (log) item list code was converted to
    use per-cpu lists and space tracking to reduce cpu contention when
    multiple threads are modifying different parts of the filesystem and
    hence end up contending on the log structures during transaction commit.
    Each CPU tracks its own commit items and space usage, and these do not
    have to be merged into the main CIL until either someone wants to push
    the CIL items, or we run over a soft threshold and switch to slower (but
    more accurate) accounting with atomics.
    
    Unfortunately, the for_each_cpu iteration suffers from the same race
    with cpu dying problem that was identified in commit 8b57b11c
    ("pcpcntrs: fix dying cpu summation race") -- CPUs are removed from
    cpu_online_mask before the CPUHP_XFS_DEAD callback gets called.  As a
    result, both CIL percpu structure aggregation functions fail to collect
    the items and accounted space usage at the correct point in time.
    
    If we're lucky, the items that are collected from the online cpus exceed
    the space given to those cpus, and the log immediately shuts down in
    xlog_cil_insert_items due to the (apparent) log reservation overrun.
    This happens periodically with generic/650, which exercises cpu hotplug
    vs. the filesystem code:
    
    smpboot: CPU 3 is now offline
    XFS (sda3): ctx ticket reservation ran out. Need to up reservation
    XFS (sda3): ticket reservation summary:
    XFS (sda3):   unit res    = 9268 bytes
    XFS (sda3):   current res = -40 bytes
    XFS (sda3):   original count  = 1
    XFS (sda3):   remaining count = 1
    XFS (sda3): Filesystem has been shut down due to log error (0x2).
    
    Applying the same sort of fix from 8b57b11c to the CIL code seems
    to make the generic/650 problem go away, but I've been told that tglx
    was not happy when he saw:
    
    "...the only thing we actually need to care about is that
    percpu_counter_sum() iterates dying CPUs. That's trivial to do, and when
    there are no CPUs dying, it has no addition overhead except for a
    cpumask_or() operation."
    
    The CPU hotplug code is rather complex and difficult to understand and I
    don't want to try to understand the cpu hotplug locking well enough to
    use cpu_dying mask.  Furthermore, there's a performance improvement that
    could be had here.  Attach a private cpu mask to the CIL structure so
    that we can track exactly which cpus have accessed the percpu data at
    all.  It doesn't matter if the cpu has since gone offline; log item
    aggregation will still find the items.  Better yet, we skip cpus that
    have not recently logged anything.
    
    Worse yet, Ritesh Harjani and Eric Sandeen both reported today that CPU
    hot remove racing with an xfs mount can crash if the cpu_dead notifier
    tries to access the log but the mount hasn't yet set up the log.
    
    Link: https://lore.kernel.org/linux-xfs/ZOLzgBOuyWHapOyZ@dread.disaster.area/T/
    Link: https://lore.kernel.org/lkml/877cuj1mt1.ffs@tglx/
    Link: https://lore.kernel.org/lkml/20230414162755.281993820@linutronix.de/
    Link: https://lore.kernel.org/linux-xfs/ZOVkjxWZq0YmjrJu@dread.disaster.area/T/
    Cc: tglx@linutronix.de
    Cc: peterz@infradead.org
    Reported-by: ritesh.list@gmail.com
    Reported-by: sandeen@sandeen.net
    Fixes: af1c2146 ("xfs: introduce per-cpu CIL tracking structure")
    Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
    Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
    ecd49f7a
xfs_log_priv.h 25.7 KB