• Jan Kara's avatar
    readahead: make sure sync readahead reads needed page · 8051b82a
    Jan Kara authored
    Patch series "mm: Fix various readahead quirks".
    
    When we were internally testing performance of recent kernels, we have
    noticed quite variable performance of readahead arising from various
    quirks in readahead code.  So I went on a cleaning spree there.  This is a
    batch of patches resulting out of that.  A quick testing in my test VM
    with the following fio job file:
    
    [global]
    direct=0
    ioengine=sync
    invalidate=1
    blocksize=4k
    size=10g
    readwrite=read
    
    [reader]
    numjobs=1
    
    shows that this patch series improves the throughput from variable one in
    310-340 MB/s range to rather stable one at 350 MB/s.  As a side effect
    these cleanups also address the issue noticed by Bruz Zhang [1].
    
    [1] https://lore.kernel.org/all/20240618114941.5935-1-zhangpengpeng0808@gmail.com/
    
    Zhang Peng reported:
    
    : I test this batch of patch with fio, it indeed has a huge sppedup
    : in sequential read when block size is 4KiB. The result as follow,
    : for async read, iodepth is set to 128, and other settings
    : are self-evident.
    : 
    : casename                upstream   withFix speedup
    : ----------------        --------   -------- -------
    : randread-4k-sync        48991      47
    : seqread-4k-sync         1162758    14229
    : seqread-1024k-sync      1460208    1452522 
    : randread-4k-libaio      47467      4730
    : randread-4k-posixaio    49190      49512
    : seqread-4k-libaio       1085932    1234635
    : seqread-1024k-libaio    1423341    1402214 -1
    : seqread-4k-posixaio     1165084    1369613 1
    : seqread-1024k-posixaio  1435422    1408808 -1.8
    
    
    This patch (of 10):
    
    page_cache_sync_ra() is called when a folio we want to read is not in the
    page cache.  It is expected that it creates the folio (and perhaps the
    following folios as well) and submits reads for them unless some error
    happens.  However if index == ra->start + ra->size, ondemand_readahead()
    will treat the call as another async readahead hit.  Thus ra->start will
    be advanced and we create pages and queue reads from ra->start + ra->size
    further.  Consequentially the page at 'index' is not created and
    filemap_get_pages() has to always go through filemap_create_folio() path.
    
    This behavior has particularly unfortunate consequences when we have two
    IO threads sequentially reading from a shared file (as is the case when
    NFS serves sequential reads).  In that case what can happen is:
    
    suppose ra->size == ra->async_size == 128, ra->start = 512
    
    T1					T2
    reads 128 pages at index 512
      - hits async readahead mark
        filemap_readahead()
          ondemand_readahead()
            if (index == expected ...)
    	  ra->start = 512 + 128 = 640
              ra->size = 128
    	  ra->async_size = 128
    	page_cache_ra_order()
    	  blocks in ra_alloc_folio()
    					reads 128 pages at index 640
    					  - no page found
    					  page_cache_sync_readahead()
    					    ondemand_readahead()
    					      if (index == expected ...)
    						ra->start = 640 + 128 = 768
    						ra->size = 128
    						ra->async_size = 128
    					    page_cache_ra_order()
    					      submits reads from 768
    					  - still no page found at index 640
    					    filemap_create_folio()
    					  - goes on to index 641
    					  page_cache_sync_readahead()
    					    ondemand_readahead()
    					      - founds ra is confused,
    					        trims is to small size
      	  finds pages were already inserted
    
    And as a result read performance suffers.
    
    Fix the problem by triggering async readahead case in ondemand_readahead()
    only if we are calling the function because we hit the readahead marker. 
    In any other case we need to read the folio at 'index' and thus we cannot
    really use the current ra state.
    
    Note that the above situation could be viewed as a special case of
    file->f_ra state corruption.  In fact two thread reading using the shared
    file can also seemingly corrupt file->f_ra in interesting ways due to
    concurrent access.  I never saw that in practice and the fix is going to
    be much more complex so for now at least fix this practical problem while
    we ponder about the theoretically correct solution.
    
    Link: https://lkml.kernel.org/r/20240625100859.15507-1-jack@suse.cz
    Link: https://lkml.kernel.org/r/20240625101909.12234-1-jack@suse.czSigned-off-by: default avatarJan Kara <jack@suse.cz>
    Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
    Tested-by: default avatarZhang Peng <zhangpengpeng0808@gmail.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    8051b82a
readahead.c 25.5 KB