• Chris Mason's avatar
    iomap: skip pages past eof in iomap_do_writepage() · d58562ca
    Chris Mason authored
    iomap_do_writepage() sends pages past i_size through
    folio_redirty_for_writepage(), which normally isn't a problem because
    truncate and friends clean them very quickly.
    
    When the system has cgroups configured, we can end up in situations
    where one cgroup has almost no dirty pages at all, and other cgroups
    consume the entire background dirty limit.  This is especially common in
    our XFS workloads in production because they have cgroups using O_DIRECT
    for almost all of the IO mixed in with cgroups that do more traditional
    buffered IO work.
    
    We've hit storms where the redirty path hits millions of times in a few
    seconds, on all a single file that's only ~40 pages long.  This leads to
    long tail latencies for file writes because the pdflush workers are
    hogging the CPU from some kworkers bound to the same CPU.
    
    Reproducing this on 5.18 was tricky because 869ae85d ("xfs: flush new
    eof page on truncate...") ends up writing/waiting most of these dirty pages
    before truncate gets a chance to wait on them.
    
    The actual repro looks like this:
    
    /*
     * run me in a cgroup all alone.  Start a second cgroup with dd
     * streaming IO into the block device.
     */
    int main(int ac, char **av) {
    	int fd;
    	int ret;
    	char buf[BUFFER_SIZE];
    	char *filename = av[1];
    
    	memset(buf, 0, BUFFER_SIZE);
    
    	if (ac != 2) {
    		fprintf(stderr, "usage: looper filename\n");
    		exit(1);
    	}
    	fd = open(filename, O_WRONLY | O_CREAT, 0600);
    	if (fd < 0) {
    		err(errno, "failed to open");
    	}
    	fprintf(stderr, "looping on %s\n", filename);
    	while(1) {
    		/*
    		 * skip past page 0 so truncate doesn't write and wait
    		 * on our extent before changing i_size
    		 */
    		ret = lseek(fd, 8192, SEEK_SET);
    		if (ret < 0)
    			err(errno, "lseek");
    		ret = write(fd, buf, BUFFER_SIZE);
    		if (ret != BUFFER_SIZE)
    			err(errno, "write failed");
    		/* start IO so truncate has to wait after i_size is 0 */
    		ret = sync_file_range(fd, 16384, 4095, SYNC_FILE_RANGE_WRITE);
    		if (ret < 0)
    			err(errno, "sync_file_range");
    		ret = ftruncate(fd, 0);
    		if (ret < 0)
    			err(errno, "truncate");
    		usleep(1000);
    	}
    }
    
    And this bpftrace script will show when you've hit a redirty storm:
    
    kretprobe:xfs_vm_writepages {
        delete(@dirty[pid]);
    }
    
    kprobe:xfs_vm_writepages {
        @dirty[pid] = 1;
    }
    
    kprobe:folio_redirty_for_writepage /@dirty[pid] > 0/ {
        $inode = ((struct folio *)arg1)->mapping->host->i_ino;
        @inodes[$inode] = count();
        @redirty++;
        if (@redirty > 90000) {
            printf("inode %d redirty was %d", $inode, @redirty);
            exit();
        }
    }
    
    This patch has the same number of failures on xfstests as unpatched 5.18:
    Failures: generic/648 xfs/019 xfs/050 xfs/168 xfs/299 xfs/348 xfs/506
    xfs/543
    
    I also ran it through a long stress of multiple fsx processes hammering.
    
    (Johannes Weiner did significant tracing and debugging on this as well)
    Signed-off-by: default avatarChris Mason <clm@fb.com>
    Co-authored-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Reviewed-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Reported-by: default avatarDomas Mituzas <domas@fb.com>
    Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
    Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
    d58562ca
buffered-io.c 43.5 KB