- 12 Sep, 2024 16 commits
-
-
Dennis Lam authored
Fixed the word "aren't" to "isn't" based on singular word "bufferage". Signed-off-by: Dennis Lam <dennis.lamerice@gmail.com> Link: https://lore.kernel.org/r/20240912012550.13748-2-dennis.lamerice@gmail.comSigned-off-by: Christian Brauner <brauner@kernel.org>
-
Christian Brauner authored
Merge branch 'netfs-writeback' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs into vfs.netfs Merge patch series "netfs: Read/write improvements" from David Howells <dhowells@redhat.com>. * 'netfs-writeback' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (25 commits) cifs: Don't support ITER_XARRAY cifs: Switch crypto buffer to use a folio_queue rather than an xarray cifs: Use iterate_and_advance*() routines directly for hashing netfs: Cancel dirty folios that have no storage destination cachefiles, netfs: Fix write to partial block at EOF netfs: Remove fs/netfs/io.c netfs: Speed up buffered reading afs: Make read subreqs async netfs: Simplify the writeback code netfs: Provide an iterator-reset function netfs: Use new folio_queue data type and iterator instead of xarray iter cifs: Provide the capability to extract from ITER_FOLIOQ to RDMA SGEs iov_iter: Provide copy_folio_from_iter() mm: Define struct folio_queue and ITER_FOLIOQ to handle a sequence of folios netfs: Use bh-disabling spinlocks for rreq->lock netfs: Set the request work function upon allocation netfs: Remove NETFS_COPY_TO_CACHE netfs: Reserve netfs_sreq_source 0 as unset/unknown netfs: Move max_len/max_nr_segs from netfs_io_subrequest to netfs_io_stream netfs, cifs: Move CIFS_INO_MODIFIED_ATTR to netfs_inode ... Signed-off-by: Christian Brauner <brauner@kernel.org>
-
David Howells authored
There's now no need to support ITER_XARRAY in cifs as netfslib hands down ITER_FOLIOQ instead - and that's simpler to use with iterate_and_advance() as it doesn't hold the RCU read lock over the step function. This is part of the process of phasing out ITER_XARRAY. Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Tom Talpey <tom@talpey.com> cc: Enzo Matsumiya <ematsumiya@suse.de> cc: linux-cifs@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-26-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
-
David Howells authored
Switch cifs from using an xarray to hold the transport crypto buffer to using a folio_queue and use ITER_FOLIOQ rather than ITER_XARRAY. This is part of the process of phasing out ITER_XARRAY. Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Tom Talpey <tom@talpey.com> cc: Enzo Matsumiya <ematsumiya@suse.de> cc: linux-cifs@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-25-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
-
David Howells authored
Replace the bespoke cifs iterators of ITER_BVEC and ITER_KVEC to do hashing with iterate_and_advance_kernel() - a variant on iterate_and_advance() that only supports kernel-internal ITER_* types and not UBUF/IOVEC types. The bespoke ITER_XARRAY is left because we don't really want to be calling crypto_shash_update() under the RCU read lock for large amounts of data; besides, ITER_XARRAY is going to be phased out. Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Tom Talpey <tom@talpey.com> cc: Enzo Matsumiya <ematsumiya@suse.de> cc: linux-cifs@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-24-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
-
David Howells authored
Kafs wants to be able to cache the contents of directories (and symlinks), but whilst these are downloaded from the server with the FS.FetchData RPC op and similar, the same as for regular files, they can't be updated by FS.StoreData, but rather have special operations (FS.MakeDir, etc.). Now, rather than redownloading a directory's content after each change made to that directory, kafs modifies the local blob. This blob can be saved out to the cache, and since it's using netfslib, kafs just marks the folios dirty and lets ->writepages() on the directory take care of it, as for an regular file. This is fine as long as there's a cache as although the upload stream is disabled, there's a cache stream to drive the procedure. But if the cache goes away in the meantime, suddenly there's no way do any writes and the code gets confused, complains "R=%x: No submit" to dmesg and leaves the dirty folio hanging. Fix this by just cancelling the store of the folio if neither stream is active. (If there's no cache at the time of dirtying, we should just not mark the folio dirty). Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-23-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
-
David Howells authored
Because it uses DIO writes, cachefiles is unable to make a write to the backing file if that write is not aligned to and sized according to the backing file's DIO block alignment. This makes it tricky to handle a write to the cache where the EOF on the network file is not correctly aligned. To get around this, netfslib attempts to tell the driver it is calling how much more data there is available beyond the EOF that it can use to pad the write (netfslib preclears the part of the folio above the EOF). However, it tries to tell the cache what the maximum length is, but doesn't calculate this correctly; and, in any case, cachefiles actually ignores the value and just skips the block. Fix this by: (1) Change the value passed to indicate the amount of extra data that can be added to the operation (now ->submit_extendable_to). This is much simpler to calculate as it's just the end of the folio minus the top of the data within the folio - rather than having to account for data spread over multiple folios. (2) Make cachefiles add some of this data if the subrequest it is given ends at the network file's i_size if the extra data is sufficient to pad out to a whole block. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-22-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
-
David Howells authored
Remove fs/netfs/io.c as it is no longer used. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-21-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
-
David Howells authored
Improve the efficiency of buffered reads in a number of ways: (1) Overhaul the algorithm in general so that it's a lot more compact and split the read submission code between buffered and unbuffered versions. The unbuffered version can be vastly simplified. (2) Read-result collection is handed off to a work queue rather than being done in the I/O thread. Multiple subrequests can be processes simultaneously. (3) When a subrequest is collected, any folios it fully spans are collected and "spare" data on either side is donated to either the previous or the next subrequest in the sequence. Notes: (*) Readahead expansion is massively slows down fio, presumably because it causes a load of extra allocations, both folio and xarray, up front before RPC requests can be transmitted. (*) RDMA with cifs does appear to work, both with SIW and RXE. (*) PG_private_2-based reading and copy-to-cache is split out into its own file and altered to use folio_queue. Note that the copy to the cache now creates a new write transaction against the cache and adds the folios to be copied into it. This allows it to use part of the writeback I/O code. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-20-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
-
David Howells authored
Perform AFS read subrequests in a work item rather than in the calling thread. For normal buffered reads, this will allow the calling thread to copy data from the pagecache to the application at the same time as the demarshalling thread is shovelling data from skbuffs into the pagecache. This will also allow the RA mark to trigger a new read before we've finished shovelling the data from the current one. Note: This would be a bit safer if the FS.FetchData RPC ops returned the metadata (including the data version number) before returning the data. This would allow me to flush the pagecache before installing the new data. In future, it may be possible to asynchronously flush the pagecache either side of the region being read. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-afs@lists.infradead.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-19-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
-
David Howells authored
Use the new folio_queue structures to simplify the writeback code. The problem with referring to the i_pages xarray directly is that we may have gaps in the sequence of folios we're writing from that we need to skip when we're removing the writeback mark from the folios we're writing back from. At the moment the code tries to deal with this by carefully tracking the gaps in each writeback stream (eg. write to server and write to cache) and divining when there's a gap that spans folios (something that's not helped by folios not being a consistent size). Instead, the folio_queue buffer contains pointers only the folios we're dealing with, has them in ascending order and indicates a gap by placing non-consequitive folios next to each other. This makes it possible to track where we need to clean up to by just keeping track of where we've processed to on each stream and taking the minimum. Note that the I/O iterator is always rounded up to the end of the folio, even if that is beyond the EOF position, so that the cache can do DIO from the page. The excess space is cleared, though mmapped writes clobber it. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-18-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
-
David Howells authored
Provide a function to reset the iterator on a subrequest. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-17-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
-
David Howells authored
Make the netfs write-side routines use the new folio_queue struct to hold a rolling buffer of folios, with the issuer adding folios at the tail and the collector removing them from the head as they're processed instead of using an xarray. This will allow a subsequent patch to simplify the write collector. The primary mark (as tested by folioq_is_marked()) is used to note if the corresponding folio needs putting. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-16-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
-
David Howells authored
Make smb_extract_iter_to_rdma() extract page fragments from an ITER_FOLIOQ iterator into RDMA SGEs. Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Tom Talpey <tom@talpey.com> cc: Enzo Matsumiya <ematsumiya@suse.de> cc: linux-cifs@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-15-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
-
David Howells authored
Provide a copy_folio_from_iter() wrapper. Signed-off-by: David Howells <dhowells@redhat.com> cc: Alexander Viro <viro@zeniv.linux.org.uk> cc: Christian Brauner <christian@brauner.io> cc: Matthew Wilcox <willy@infradead.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/20240814203850.2240469-14-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
-
David Howells authored
Define a data structure, struct folio_queue, to represent a sequence of folios and a kernel-internal I/O iterator type, ITER_FOLIOQ, to allow a list of folio_queue structures to be used to provide a buffer to iov_iter-taking functions, such as sendmsg and recvmsg. The folio_queue structure looks like: struct folio_queue { struct folio_batch vec; u8 orders[PAGEVEC_SIZE]; struct folio_queue *next; struct folio_queue *prev; unsigned long marks; unsigned long marks2; }; It does not use a list_head so that next and/or prev can be set to NULL at the ends of the list, allowing iov_iter-handling routines to determine that they *are* the ends without needing to store a head pointer in the iov_iter struct. A folio_batch struct is used to hold the folio pointers which allows the batch to be passed to batch handling functions. Two mark bits are available per slot. The intention is to use at least one of them to mark folios that need putting, but that might not be ultimately necessary. Accessor functions are used to access the slots to do the masking and an additional accessor function is used to indicate the size of the array. The order of each folio is also stored in the structure to avoid the need for iov_iter_advance() and iov_iter_revert() to have to query each folio to find its size. With careful barriering, this can be used as an extending buffer with new folios inserted and new folio_queue structs added without the need for a lock. Further, provided we always keep at least one struct in the buffer, we can also remove consumed folios and consumed structs from the head end as we without the need for locks. [Questions/thoughts] (1) To manage this, I need a head pointer, a tail pointer, a tail slot number (assuming insertion happens at the tail end and the next pointers point from head to tail). Should I put these into a struct of their own, say "folio_queue_head" or "rolling_buffer"? I will end up with two of these in netfs_io_request eventually, one keeping track of the pagecache I'm dealing with for buffered I/O and the other to hold a bounce buffer when we need one. (2) Should I make the slots {folio,off,len} or bio_vec? (3) This is intended to replace ITER_XARRAY eventually. Using an xarray in I/O iteration requires the taking of the RCU read lock, doing copying under the RCU read lock, walking the xarray (which may change under us), handling retries and dealing with special values. The advantage of ITER_XARRAY is that when we're dealing with the pagecache directly, we don't need any allocation - but if we're doing encrypted comms, there's a good chance we'd be using a bounce buffer anyway. This will require afs, erofs, cifs, orangefs and fscache to be converted to not use this. afs still uses it for dirs and symlinks; some of erofs usages should be easy to change, but there's one which won't be so easy; ceph's use via fscache can be fixed by porting ceph to netfslib; cifs is using xarray as a bounce buffer - that can be moved to use sheaves instead; and orangefs has a similar problem to erofs - maybe orangefs could use netfslib? Signed-off-by: David Howells <dhowells@redhat.com> cc: Matthew Wilcox <willy@infradead.org> cc: Jeff Layton <jlayton@kernel.org> cc: Steve French <sfrench@samba.org> cc: Ilya Dryomov <idryomov@gmail.com> cc: Gao Xiang <xiang@kernel.org> cc: Mike Marshall <hubcap@omnibond.com> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org cc: linux-afs@lists.infradead.org cc: linux-cifs@vger.kernel.org cc: ceph-devel@vger.kernel.org cc: linux-erofs@lists.ozlabs.org cc: devel@lists.orangefs.org Link: https://lore.kernel.org/r/20240814203850.2240469-13-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
-
- 05 Sep, 2024 10 commits
-
-
David Howells authored
Use bh-disabling spinlocks when accessing rreq->lock because, in the future, it may be twiddled from softirq context when cleanup is driven from cache backend DIO completion. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-12-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
-
David Howells authored
Set the work function in the netfs_io_request work_struct when we allocate the request rather than doing this later. This reduces the number of places we need to set it in future code. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-11-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
-
David Howells authored
Remove NETFS_COPY_TO_CACHE as it isn't used anymore. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-10-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
-
David Howells authored
Reserve the 0-valued netfs_sreq_source to mean unset or unknown so that it can be seen in the trace as such rather than appearing as download-from-server when it's going to get switched to something else. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-9-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
-
David Howells authored
Move max_len/max_nr_segs from struct netfs_io_subrequest to struct netfs_io_stream as we only issue one subreq at a time and then don't need these values again for that subreq unless and until we have to retry it - in which case we want to renegotiate them. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-8-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
-
David Howells authored
Move CIFS_INO_MODIFIED_ATTR to netfs_inode as NETFS_ICTX_MODIFIED_ATTR and then make netfs_perform_write() set it. This means that cifs doesn't need to implement the ->post_modify() hook. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-7-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
-
David Howells authored
Reduce the number of conditional branches in netfs_perform_write() by merging in netfs_how_to_modify() and then creating a separate if-statement for each way we might modify a folio. Note that this means replicating the data copy in each path. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-6-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
-
David Howells authored
Record statistics for contention upon the writeback serialisation lock that prevents racing writeback calls from causing each other to interleave their writebacks. These can be viewed in /proc/fs/netfs/stats on the WbLock line, with skip=N indicating the number of non-SYNC writebacks skipped and wait=N indicating the number of SYNC writebacks that waited. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: Steve French <sfrench@samba.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-5-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
-
David Howells authored
Adjust the labels in /proc/fs/netfs/stats that refer to netfs-specific counters. These currently all begin with "Netfs", but change them to begin with more specific labels. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-4-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
-
David Howells authored
Unlike other vfs_xxxx() calls, vfs_setxattr() and vfs_removexattr() don't take the sb_writers lock, so the caller should do it for them. Fix cachefiles to do this. Fixes: 9ae326a6 ("CacheFiles: A cache that backs onto a mounted filesystem") Signed-off-by: David Howells <dhowells@redhat.com> cc: Christian Brauner <brauner@kernel.org> cc: Gao Xiang <xiang@kernel.org> cc: netfs@lists.linux.dev cc: linux-erofs@lists.ozlabs.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-3-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
-
- 04 Sep, 2024 5 commits
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfsLinus Torvalds authored
Pull vfs fixes from Christian Brauner: "Two netfs fixes for this merge window: - Ensure that fscache_cookie_lru_time is deleted when the fscache module is removed to prevent UAF - Fix filemap_invalidate_inode() to use invalidate_inode_pages2_range() Before it used truncate_inode_pages_partial() which causes copy_file_range() to fail on cifs" * tag 'vfs-6.11-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fscache: delete fscache_cookie_lru_timer when fscache exits to avoid UAF mm: Fix filemap_invalidate_inode() to use invalidate_inode_pages2_range()
-
git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linuxLinus Torvalds authored
Pull ARM fix from Russell King: - Fix a build issue with older binutils with LD dead code elimination disabled * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux: ARM: 9414/1: Fix build issue with LD_DEAD_CODE_DATA_ELIMINATION
-
Linus Torvalds authored
Merge tag 'parisc-for-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux Pull parisc architecture fix from Helge Deller: - Fix boot issue where boot memory is marked read-only too early * tag 'parisc-for-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: parisc: Delay write-protection until mark_rodata_ro() call
-
Linus Torvalds authored
Merge tag 'mm-hotfixes-stable-2024-09-03-20-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "17 hotfixes, 15 of which are cc:stable. Mostly MM, no identifiable theme. And a few nilfs2 fixups" * tag 'mm-hotfixes-stable-2024-09-03-20-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: alloc_tag: fix allocation tag reporting when CONFIG_MODULES=n mm: vmalloc: optimize vmap_lazy_nr arithmetic when purging each vmap_area mailmap: update entry for Jan Kuliga codetag: debug: mark codetags for poisoned page as empty mm/memcontrol: respect zswap.writeback setting from parent cg too scripts: fix gfp-translate after ___GFP_*_BITS conversion to an enum Revert "mm: skip CMA pages when they are not available" maple_tree: remove rcu_read_lock() from mt_validate() kexec_file: fix elfcorehdr digest exclusion when CONFIG_CRASH_HOTPLUG=y mm/slub: add check for s->flags in the alloc_tagging_slab_free_hook nilfs2: fix state management in error path of log writing function nilfs2: fix missing cleanup on rollforward recovery error nilfs2: protect references to superblock parameters exposed in sysfs userfaultfd: don't BUG_ON() if khugepaged yanks our page table userfaultfd: fix checks for huge PMDs mm: vmalloc: ensure vmap_block is initialised before adding to queue selftests: mm: fix build errors on armhf
-
Yuntao Liu authored
There is a build issue with LD segmentation fault, while CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is not enabled, as bellow. scripts/link-vmlinux.sh: line 49: 3796 Segmentation fault (core dumped) ${ld} ${ldflags} -o ${output} ${wl}--whole-archive ${objs} ${wl}--no-whole-archive ${wl}--start-group ${libs} ${wl}--end-group ${kallsymso} ${btf_vmlinux_bin_o} ${ldlibs} The error occurs in older versions of the GNU ld with version earlier than 2.36. It makes most sense to have a minimum LD version as a dependency for HAVE_LD_DEAD_CODE_DATA_ELIMINATION and eliminate the impact of ".reloc .text, R_ARM_NONE, ." when CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is not enabled. Fixes: ed0f9410 ("ARM: 9404/1: arm32: enable HAVE_LD_DEAD_CODE_DATA_ELIMINATION") Reported-by: Harith George <mail2hgg@gmail.com> Tested-by: Harith George <mail2hgg@gmail.com> Suggested-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Yuntao Liu <liuyuntao12@huawei.com> Link: https://lore.kernel.org/all/14e9aefb-88d1-4eee-8288-ef15d4a9b059@gmail.com/Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
-
- 03 Sep, 2024 2 commits
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuseLinus Torvalds authored
Pull fuse fixes from Miklos Szeredi: - Fix EIO if splice and page stealing are enabled on the fuse device - Disable problematic combination of passthrough and writeback-cache - Other bug fixes found by code review * tag 'fuse-fixes-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: disable the combination of passthrough and writeback cache fuse: update stats for pages in dropped aux writeback list fuse: clear PG_uptodate when using a stolen page fuse: fix memory leak in fuse_create_open fuse: check aborted connection before adding requests to pending list for resending fuse: use unsigned type for getxattr/listxattr size truncation
-
Helge Deller authored
Do not write-protect the kernel read-only and __ro_after_init sections earlier than before mark_rodata_ro() is called. This fixes a boot issue on parisc which is triggered by commit 91a1d97e ("jump_label,module: Don't alloc static_key_mod for __ro_after_init keys"). That commit may modify static key contents in the __ro_after_init section at bootup, so this section needs to be writable at least until mark_rodata_ro() is called. Signed-off-by: Helge Deller <deller@gmx.de> Reported-by: matoro <matoro_mailinglist_kernel@matoro.tk> Reported-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de> Tested-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de> Link: https://lore.kernel.org/linux-parisc/096cad5aada514255cd7b0b9dbafc768@matoro.tk/#r Fixes: 91a1d97e ("jump_label,module: Don't alloc static_key_mod for __ro_after_init keys") Cc: stable@vger.kernel.org # v6.10+
-
- 02 Sep, 2024 7 commits
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/libata/linuxLinus Torvalds authored
Pull ata fix from Damien Le Moal: - Fix a potential memory leak in the ata host initialization code (from Zheng) * tag 'ata-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux: ata: libata: Fix memory leak for error path in ata_host_alloc()
-
Suren Baghdasaryan authored
codetag_module_init() is used to initialize sections containing allocation tags. This function is used to initialize module sections as well as core kernel sections, in which case the module parameter is set to NULL. This function has to be called even when CONFIG_MODULES=n to initialize core kernel allocation tag sections. When CONFIG_MODULES=n, this function is a NOP, which is wrong. This leads to /proc/allocinfo reported as empty. Fix this by making it independent of CONFIG_MODULES. Link: https://lkml.kernel.org/r/20240828231536.1770519-1-surenb@google.com Fixes: 916cc516 ("lib: code tagging framework") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Sourav Panda <souravpanda@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [6.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Adrian Huang authored
When running the vmalloc stress on a 448-core system, observe the average latency of purge_vmap_node() is about 2 seconds by using the eBPF/bcc 'funclatency.py' tool [1]. # /your-git-repo/bcc/tools/funclatency.py -u purge_vmap_node & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1 usecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 29 | | 4 -> 7 : 19 | | 8 -> 15 : 56 | | 16 -> 31 : 483 |**** | 32 -> 63 : 1548 |************ | 64 -> 127 : 2634 |********************* | 128 -> 255 : 2535 |********************* | 256 -> 511 : 1776 |************** | 512 -> 1023 : 1015 |******** | 1024 -> 2047 : 573 |**** | 2048 -> 4095 : 488 |**** | 4096 -> 8191 : 1091 |********* | 8192 -> 16383 : 3078 |************************* | 16384 -> 32767 : 4821 |****************************************| 32768 -> 65535 : 3318 |*************************** | 65536 -> 131071 : 1718 |************** | 131072 -> 262143 : 2220 |****************** | 262144 -> 524287 : 1147 |********* | 524288 -> 1048575 : 1179 |********* | 1048576 -> 2097151 : 822 |****** | 2097152 -> 4194303 : 906 |******* | 4194304 -> 8388607 : 2148 |***************** | 8388608 -> 16777215 : 4497 |************************************* | 16777216 -> 33554431 : 289 |** | avg = 2041714 usecs, total: 78381401772 usecs, count: 38390 The worst case is over 16-33 seconds, so soft lockup is triggered [2]. [Root Cause] 1) Each purge_list has the long list. The following shows the number of vmap_area is purged. crash> p vmap_nodes vmap_nodes = $27 = (struct vmap_node *) 0xff2de5a900100000 crash> vmap_node 0xff2de5a900100000 128 | grep nr_purged nr_purged = 663070 ... nr_purged = 821670 nr_purged = 692214 nr_purged = 726808 ... 2) atomic_long_sub() employs the 'lock' prefix to ensure the atomic operation when purging each vmap_area. However, the iteration is over 600000 vmap_area (See 'nr_purged' above). Here is objdump output: $ objdump -D vmlinux ffffffff813e8c80 <purge_vmap_node>: ... ffffffff813e8d70: f0 48 29 2d 68 0c bb lock sub %rbp,0x2bb0c68(%rip) ... Quote from "Instruction tables" pdf file [3]: Instructions with a LOCK prefix have a long latency that depends on cache organization and possibly RAM speed. If there are multiple processors or cores or direct memory access (DMA) devices, then all locked instructions will lock a cache line for exclusive access, which may involve RAM access. A LOCK prefix typically costs more than a hundred clock cycles, even on single-processor systems. That's why the latency of purge_vmap_node() dramatically increases on a many-core system: One core is busy on purging each vmap_area of the *long* purge_list and executing atomic_long_sub() for each vmap_area, while other cores free vmalloc allocations and execute atomic_long_add_return() in free_vmap_area_noflush(). [Solution] Employ a local variable to record the total purged pages, and execute atomic_long_sub() after the traversal of the purge_list is done. The experiment result shows the latency improvement is 99%. [Experiment Result] 1) System Configuration: Three servers (with HT-enabled) are tested. * 72-core server: 3rd Gen Intel Xeon Scalable Processor*1 * 192-core server: 5th Gen Intel Xeon Scalable Processor*2 * 448-core server: AMD Zen 4 Processor*2 2) Kernel Config * CONFIG_KASAN is disabled 3) The data in column "w/o patch" and "w/ patch" * Unit: micro seconds (us) * Each data is the average of 3-time measurements System w/o patch (us) w/ patch (us) Improvement (%) --------------- -------------- ------------- ------------- 72-core server 2194 14 99.36% 192-core server 143799 1139 99.21% 448-core server 1992122 6883 99.65% [1] https://github.com/iovisor/bcc/blob/master/tools/funclatency.py [2] https://gist.github.com/AdrianHuang/37c15f67b45407b83c2d32f918656c12 [3] https://www.agner.org/optimize/instruction_tables.pdf Link: https://lkml.kernel.org/r/20240829130633.2184-1-ahuang12@lenovo.comSigned-off-by: Adrian Huang <ahuang12@lenovo.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jan Kuliga authored
Soon I won't be able to use my current email address. Link: https://lkml.kernel.org/r/20240830095658.1203198-1-jankul@alatek.krakow.plSigned-off-by: Jan Kuliga <jankul@alatek.krakow.pl> Cc: David S. Miller <davem@davemloft.net> Cc: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Hao Ge authored
When PG_hwpoison pages are freed they are treated differently in free_pages_prepare() and instead of being released they are isolated. Page allocation tag counters are decremented at this point since the page is considered not in use. Later on when such pages are released by unpoison_memory(), the allocation tag counters will be decremented again and the following warning gets reported: [ 113.930443][ T3282] ------------[ cut here ]------------ [ 113.931105][ T3282] alloc_tag was not set [ 113.931576][ T3282] WARNING: CPU: 2 PID: 3282 at ./include/linux/alloc_tag.h:130 pgalloc_tag_sub.part.66+0x154/0x164 [ 113.932866][ T3282] Modules linked in: hwpoison_inject fuse ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ip6table_nat ip6table_man4 [ 113.941638][ T3282] CPU: 2 UID: 0 PID: 3282 Comm: madvise11 Kdump: loaded Tainted: G W 6.11.0-rc4-dirty #18 [ 113.943003][ T3282] Tainted: [W]=WARN [ 113.943453][ T3282] Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 [ 113.944378][ T3282] pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 113.945319][ T3282] pc : pgalloc_tag_sub.part.66+0x154/0x164 [ 113.946016][ T3282] lr : pgalloc_tag_sub.part.66+0x154/0x164 [ 113.946706][ T3282] sp : ffff800087093a10 [ 113.947197][ T3282] x29: ffff800087093a10 x28: ffff0000d7a9d400 x27: ffff80008249f0a0 [ 113.948165][ T3282] x26: 0000000000000000 x25: ffff80008249f2b0 x24: 0000000000000000 [ 113.949134][ T3282] x23: 0000000000000001 x22: 0000000000000001 x21: 0000000000000000 [ 113.950597][ T3282] x20: ffff0000c08fcad8 x19: ffff80008251e000 x18: ffffffffffffffff [ 113.952207][ T3282] x17: 0000000000000000 x16: 0000000000000000 x15: ffff800081746210 [ 113.953161][ T3282] x14: 0000000000000000 x13: 205d323832335420 x12: 5b5d353031313339 [ 113.954120][ T3282] x11: ffff800087093500 x10: 000000000000005d x9 : 00000000ffffffd0 [ 113.955078][ T3282] x8 : 7f7f7f7f7f7f7f7f x7 : ffff80008236ba90 x6 : c0000000ffff7fff [ 113.956036][ T3282] x5 : ffff000b34bf4dc8 x4 : ffff8000820aba90 x3 : 0000000000000001 [ 113.956994][ T3282] x2 : ffff800ab320f000 x1 : 841d1e35ac932e00 x0 : 0000000000000000 [ 113.957962][ T3282] Call trace: [ 113.958350][ T3282] pgalloc_tag_sub.part.66+0x154/0x164 [ 113.959000][ T3282] pgalloc_tag_sub+0x14/0x1c [ 113.959539][ T3282] free_unref_page+0xf4/0x4b8 [ 113.960096][ T3282] __folio_put+0xd4/0x120 [ 113.960614][ T3282] folio_put+0x24/0x50 [ 113.961103][ T3282] unpoison_memory+0x4f0/0x5b0 [ 113.961678][ T3282] hwpoison_unpoison+0x30/0x48 [hwpoison_inject] [ 113.962436][ T3282] simple_attr_write_xsigned.isra.34+0xec/0x1cc [ 113.963183][ T3282] simple_attr_write+0x38/0x48 [ 113.963750][ T3282] debugfs_attr_write+0x54/0x80 [ 113.964330][ T3282] full_proxy_write+0x68/0x98 [ 113.964880][ T3282] vfs_write+0xdc/0x4d0 [ 113.965372][ T3282] ksys_write+0x78/0x100 [ 113.965875][ T3282] __arm64_sys_write+0x24/0x30 [ 113.966440][ T3282] invoke_syscall+0x7c/0x104 [ 113.966984][ T3282] el0_svc_common.constprop.1+0x88/0x104 [ 113.967652][ T3282] do_el0_svc+0x2c/0x38 [ 113.968893][ T3282] el0_svc+0x3c/0x1b8 [ 113.969379][ T3282] el0t_64_sync_handler+0x98/0xbc [ 113.969980][ T3282] el0t_64_sync+0x19c/0x1a0 [ 113.970511][ T3282] ---[ end trace 0000000000000000 ]--- To fix this, clear the page tag reference after the page got isolated and accounted for. Link: https://lkml.kernel.org/r/20240825163649.33294-1-hao.ge@linux.dev Fixes: d224eb02 ("codetag: debug: mark codetags for reserved pages as empty") Signed-off-by: Hao Ge <gehao@kylinos.cn> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Suren Baghdasaryan <surenb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hao Ge <gehao@kylinos.cn> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: <stable@vger.kernel.org> [6.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Mike Yuan authored
Currently, the behavior of zswap.writeback wrt. the cgroup hierarchy seems a bit odd. Unlike zswap.max, it doesn't honor the value from parent cgroups. This surfaced when people tried to globally disable zswap writeback, i.e. reserve physical swap space only for hibernation [1] - disabling zswap.writeback only for the root cgroup results in subcgroups with zswap.writeback=1 still performing writeback. The inconsistency became more noticeable after I introduced the MemoryZSwapWriteback= systemd unit setting [2] for controlling the knob. The patch assumed that the kernel would enforce the value of parent cgroups. It could probably be workarounded from systemd's side, by going up the slice unit tree and inheriting the value. Yet I think it's more sensible to make it behave consistently with zswap.max and friends. [1] https://wiki.archlinux.org/title/Power_management/Suspend_and_hibernate#Disable_zswap_writeback_to_use_the_swap_space_only_for_hibernation [2] https://github.com/systemd/systemd/pull/31734 Link: https://lkml.kernel.org/r/20240823162506.12117-1-me@yhndnzj.com Fixes: 501a06fe ("zswap: memcontrol: implement zswap writeback disabling") Signed-off-by: Mike Yuan <me@yhndnzj.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Marc Zyngier authored
Richard reports that since 772dd034 ("mm: enumerate all gfp flags"), gfp-translate is broken, as the bit numbers are implicit, leaving the shell script unable to extract them. Even more, some bits are now at a variable location, making it double extra hard to parse using a simple shell script. Use a brute-force approach to the problem by generating a small C stub that will use the enum to dump the interesting bits. As an added bonus, we are now able to identify invalid bits for a given configuration. As an added drawback, we cannot parse include files that predate this change anymore. Tough luck. Link: https://lkml.kernel.org/r/20240823163850.3791201-1-maz@kernel.org Fixes: 772dd034 ("mm: enumerate all gfp flags") Signed-off-by: Marc Zyngier <maz@kernel.org> Reported-by: Richard Weinberger <richard@nod.at> Cc: Petr Tesařík <petr@tesarici.cz> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-