- 14 Apr, 2023 8 commits
-
-
Jan Kara authored
The handling of journalled data in ext4_zero_range() is incomplete. We do not need to commit running transaction but we rather need to checkpoint pages with journalled data. If we don't, journal tail can be advanced beyond transaction containing the journalled data and if we then crash before committing the transaction doing the zeroing we will have inconsistent (too old) data in the file. Make sure file pages with journalled data are properly checkpointed before removing them from the page cache. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230329154950.19720-8-jack@suse.czSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Jan Kara authored
Now that filemap_write_and_wait() makes sure pages with journalled data are safely on disk, ext4_collapse_range() and ext4_insert_range() do not need special handling of journalled data. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230329154950.19720-7-jack@suse.czSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Jan Kara authored
Now that ext4_writepages() make sure all pages with journalled data are stable on disk, we don't need special handling of journalled data in ext4_sync_file(). Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230329154950.19720-6-jack@suse.czSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Jan Kara authored
When journalling data we currently just walk over pages, journal those that are marked for delayed dirtying (only pinned pages dirtied behing our back these days) and checkpoint other dirty pages. Because some pages may be part of running transaction the result is that after filemap_write_and_wait() we are not guaranteed pages are stable on disk. Thus places that want to flush current pagecache content need to jump through hoops to make sure journalled data is not lost. This is manageable in cases completely controlled by ext4 (such as extent shifting operations or inode eviction) but it gets ugly for stuff like fsverity. Furthermore it is rather error prone as people often do not realize journalled data needs special handling. So change ext4_writepages() to commit transaction with inode's data before going through the writeback loop in WB_SYNC_ALL mode. As a result filemap_write_and_wait() is now really getting pages to stable storage and makes pagecache pages safe to reclaim. Consequently we can remove the special handling of journalled data from several places in follow up patches. Note that this will make fsync(2) for journalled data more expensive as we will end up not only committing the transaction we need but also checkpointing the data (which we may have previously skipped if the data was part of the running transaction). If we really cared, we would need to introduce special VFS function for writing out & invalidating page cache for a range, use ->launder_page callback to perform checkpointing, and use it from all the places that need this functionality. But at this point I'm not convinced the complexity is worth it. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230329154950.19720-5-jack@suse.czSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Jan Kara authored
With journalled data it can happen that checkpointing code will write out page contents without clearing the page dirty bit. The logic in ext4_page_nomap_can_writeout() then results in us never calling mpage_submit_page() and thus clearing the dirty bit. Drop the optimization with ext4_page_nomap_can_writeout() and just always call to mpage_submit_page(). ext4_bio_write_page() knows when to redirty the page and the additional clearing & setting of page dirty bit for ordered mode writeout is not that expensive to jump through the hoops for it. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230329154950.19720-4-jack@suse.czSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Jan Kara authored
Currently we clear page dirty bit when we checkpoint some buffers from a page with journalled data or when we perform delayed dirtying of a page in ext4_writepages(). In a quest to simplify handling of journalled data we want to keep page dirty as long as it has either buffers to checkpoint or journalled dirty data. So make sure to keep page dirty in ext4_writepages() if it still has journalled data attached to it. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230329154950.19720-3-jack@suse.czSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Jan Kara authored
Currently pages with journalled data written by write(2) or modified by block zeroing during truncate(2) are not marked as dirty. They are dirtied only once the transaction commits. This however makes writeback code think inode has no pages to write and so ext4_writepages() is not called to make pages with journalled data persistent. Mark pages with journalled data dirty (similarly as it happens for writes through mmap) so that writeback code knows about them and ext4_writepages() can do what it needs to to the inode. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230329154950.19720-2-jack@suse.czSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Jan Kara authored
When invalidating buffers under the partial tail page, jbd2_journal_invalidate_folio() returns -EBUSY if the buffer is part of the committing transaction as we cannot safely modify buffer state. However if the buffer is already invalidated (due to previous invalidation attempts from ext4_wait_for_tail_page_commit()), there's nothing to do and there's no point in returning -EBUSY. This fixes occasional warnings from ext4_journalled_invalidate_folio() triggered by generic/051 fstest when blocksize < pagesize. Fixes: 53e87268 ("ext4: fix deadlock in journal_unmap_buffer()") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230329154950.19720-1-jack@suse.czSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
- 06 Apr, 2023 32 commits
-
-
Matthew Wilcox authored
This is an implementation of fsverity_operations read_merkle_tree_page, so it must still return the precise page asked for, but we can use the folio API to reduce the number of conversions between folios & pages. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20230324180129.1220691-30-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Matthew Wilcox authored
Use the folio API and support folios of arbitrary sizes. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20230324180129.1220691-29-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Matthew Wilcox authored
Use a folio throughout. Does not support large folios due to an array sized for MAX_BUF_PER_PAGE, but it does remove a few calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20230324180129.1220691-28-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Matthew Wilcox authored
Iterate once per folio, not once per page. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20230324180129.1220691-27-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Matthew Wilcox authored
Convert to the folio API, saving a few calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20230324180129.1220691-26-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Matthew Wilcox authored
All the callers now have a folio, so pass that in and operate on folios. Removes four calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230324180129.1220691-25-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Matthew Wilcox authored
This definitely doesn't include support for large folios; there are all kinds of assumptions about the number of buffers attached to a folio. But it does remove several calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20230324180129.1220691-24-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Matthew Wilcox authored
Remove a few calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20230324180129.1220691-23-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Matthew Wilcox authored
Its one caller already uses a folio. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230324180129.1220691-22-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Matthew Wilcox authored
Use folio APIs throughout. Saves many calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230324180129.1220691-21-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Matthew Wilcox authored
Remove a call to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20230324180129.1220691-20-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Matthew Wilcox authored
Convert the incoming page to a folio to remove a few calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230324180129.1220691-19-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Matthew Wilcox authored
Convert the incoming struct page to a folio. Replaces two implicit calls to compound_head() with one explicit call. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230324180129.1220691-18-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Matthew Wilcox authored
Remove a lot of calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20230324180129.1220691-17-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Matthew Wilcox authored
Convert the incoming page to a folio so that we call compound_head() only once instead of seven times. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230324180129.1220691-16-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Matthew Wilcox authored
All callers now have a folio, so pass it and use it. The folio may be large, although I doubt we'll want to use a large folio for an inline file. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230324180129.1220691-15-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Matthew Wilcox authored
Saves a number of calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20230324180129.1220691-14-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Matthew Wilcox authored
Saves a number of calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20230324180129.1220691-13-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Matthew Wilcox authored
Saves a number of calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20230324180129.1220691-12-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Matthew Wilcox authored
Saves a number of calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20230324180129.1220691-11-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Matthew Wilcox authored
Use the folio API in this function, saves a few calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230324180129.1220691-10-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Matthew Wilcox authored
The only caller now has a folio so pass it in directly and avoid the call to page_folio() at the beginning. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230324180129.1220691-9-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Matthew Wilcox authored
All callers now have a folio so we can pass one in and use the folio APIs to support large folios as well as save instructions by eliminating a call to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20230324180129.1220691-8-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Matthew Wilcox authored
All callers now have a folio so we can pass one in and use the folio APIs to support large folios as well as save instructions by eliminating calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230324180129.1220691-7-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Matthew Wilcox authored
The page/folio is only used to extract the buffers, so this is a simple change. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230324180129.1220691-6-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Matthew Wilcox authored
Prepare ext4 to support large folios in the page writeback path. Also set the actual error in the mapping, not just -EIO. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230324180129.1220691-5-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Matthew Wilcox authored
Remove several calls to compound_head() and the last caller of set_page_writeback_keepwrite(), so remove the wrapper too. Also export bio_add_folio() as this is the first caller from a module. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230324180129.1220691-4-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Matthew Wilcox authored
fscrypt_is_bounce_folio() is the equivalent of fscrypt_is_bounce_page() and fscrypt_pagecache_folio() is the equivalent of fscrypt_pagecache_page(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230324180129.1220691-3-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Matthew Wilcox authored
This particular combination of flags is used by most filesystems in their ->write_begin method, although it does find use in a few other places. Before folios, it warranted its own function (grab_cache_page_write_begin()), but I think that just having specialised flags is enough. It certainly helps the few places that have been converted from grab_cache_page_write_begin() to __filemap_get_folio(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20230324180129.1220691-2-willy@infradead.orgSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Ojaswin Mujoo authored
Earlier, inode PAs were stored in a linked list. This caused a need to periodically trim the list down inorder to avoid growing it to a very large size, as this would severly affect performance during list iteration. Recent patches changed this list to an rbtree, and since the tree scales up much better, we no longer need to have the trim functionality, hence remove it. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/c409addceaa3ade4b40328e28e3b54b2f259689e.1679731817.git.ojaswin@linux.ibm.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Ojaswin Mujoo authored
Currently, the kernel uses i_prealloc_list to hold all the inode preallocations. This is known to cause degradation in performance in workloads which perform large number of sparse writes on a single file. This is mainly because functions like ext4_mb_normalize_request() and ext4_mb_use_preallocated() iterate over this complete list, resulting in slowdowns when large number of PAs are present. Patch 27bc446e partially fixed this by enforcing a limit of 512 for the inode preallocation list and adding logic to continually trim the list if it grows above the threshold, however our testing revealed that a hardcoded value is not suitable for all kinds of workloads. To optimize this, add an rbtree to the inode and hold the inode preallocations in this rbtree. This will make iterating over inode PAs faster and scale much better than a linked list. Additionally, we also had to remove the LRU logic that was added during trimming of the list (in ext4_mb_release_context()) as it will add extra overhead in rbtree. The discards now happen in the lowest-logical-offset-first order. ** Locking notes ** With the introduction of rbtree to maintain inode PAs, we can't use RCU to walk the tree for searching since it can result in partial traversals which might miss some nodes(or entire subtrees) while discards happen in parallel (which happens under a lock). Hence this patch converts the ei->i_prealloc_lock spin_lock to rw_lock. Almost all the codepaths that read/modify the PA rbtrees are protected by the higher level inode->i_data_sem (except ext4_mb_discard_group_preallocations() and ext4_clear_inode()) IIUC, the only place we need lock protection is when one thread is reading "searching" the PA rbtree (earlier protected under rcu_read_lock()) and another is "deleting" the PAs in ext4_mb_discard_group_preallocations() function (which iterates all the PAs using the grp->bb_prealloc_list and deletes PAs from the tree without taking any inode lock (i_data_sem)). So, this patch converts all rcu_read_lock/unlock() paths for inode list PA to use read_lock() and all places where we were using ei->i_prealloc_lock spinlock will now be using write_lock(). Note that this makes the fast path (searching of the right PA e.g. ext4_mb_use_preallocated() or ext4_mb_normalize_request()), now use read_lock() instead of rcu_read_lock/unlock(). Ths also will now block due to slow discard path (ext4_mb_discard_group_preallocations()) which uses write_lock(). But this is not as bad as it looks. This is because - 1. The slow path only occurs when the normal allocation failed and we can say that we are low on disk space. One can argue this scenario won't be much frequent. 2. ext4_mb_discard_group_preallocations(), locks and unlocks the rwlock for deleting every individual PA. This gives enough opportunity for the fast path to acquire the read_lock for searching the PA inode list. Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/4137bce8f6948fedd8bae134dabae24acfe699c6.1679731817.git.ojaswin@linux.ibm.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-
Ojaswin Mujoo authored
** Splitting pa->pa_inode_list ** Currently, we use the same pa->pa_inode_list to add a pa to either the inode preallocation list or the locality group preallocation list. For better clarity, split this list into a union of 2 list_heads and use either of the them based on the type of pa. ** Splitting pa->pa_obj_lock ** Currently, pa->pa_obj_lock is either assigned &ei->i_prealloc_lock for inode PAs or lg_prealloc_lock for lg PAs, and is then used to lock the lists containing these PAs. Make the distinction between the 2 PA types clear by changing this lock to a union of 2 locks. Explicitly use the pa_lock_node.inode_lock for inode PAs and pa_lock_node.lg_lock for lg PAs. This patch is required so that the locality group preallocation code remains the same as in upcoming patches we are going to make changes to inode preallocation code to move from list to rbtree based implementation. This patch also makes it easier to review the upcoming patches. There are no functional changes in this patch. Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/1d7ac0557e998c3fc7eef422b52e4bc67bdef2b0.1679731817.git.ojaswin@linux.ibm.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>
-