- 22 Aug, 2023 2 commits
-
-
Jeff Layton authored
When creating a new inode, we need to determine the crypto context before we can transmit the RPC. The fscrypt API has a routine for getting a crypto context before a create occurs, but it requires an inode. Change the ceph code to preallocate an inode in advance of a create of any sort (open(), mknod(), symlink(), etc). Move the existing code that generates the ACL and SELinux blobs into this routine since that's mostly common across all the different codepaths. In most cases, we just want to allow ceph_fill_trace to use that inode after the reply comes in, so add a new field to the MDS request for it (r_new_inode). The async create codepath is a bit different though. In that case, we want to hash the inode in advance of the RPC so that it can be used before the reply comes in. If the call subsequently fails with -EJUKEBOX, then just put the references and clean up the as_ctx. Note that with this change, we now need to regenerate the as_ctx when this occurs, but it's quite rare for it to happen. Signed-off-by:
Jeff Layton <jlayton@kernel.org> Reviewed-by:
Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by:
Luís Henriques <lhenriques@suse.de> Reviewed-by:
Milind Changire <mchangir@redhat.com> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Jeff Layton authored
Add a new mount option that has the client issue sparse reads instead of normal ones. The callers now preallocate an sparse extent buffer that the libceph receive code can populate and hand back after the operation completes. After a successful sparse read, we can't use the req->r_result value to determine the amount of data "read", so instead we set the received length to be from the end of the last extent in the buffer. Any interstitial holes will have been filled by the receive code. [ xiubli: fix a double free on req reported by Ilya ] Signed-off-by:
Jeff Layton <jlayton@kernel.org> Reviewed-by:
Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by:
Luís Henriques <lhenriques@suse.de> Reviewed-by:
Milind Changire <mchangir@redhat.com> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
- 30 Jun, 2023 1 commit
-
-
Xiubo Li authored
For write requests the parent's mtime will be updated correspondingly. And if the 'Xx' caps is issued and when releasing other caps together with the write requests the MDS Locker will try to eval the xattr lock, which need to change the locker state excl --> sync and need to do Xx caps revocation. Just voluntarily dropping CEPH_CAP_XATTR_EXCL caps to avoid a cap revoke message, which could cause the mtime will be overwrote by stale one. [ idryomov: break unnecessarily long lines ] Link: https://tracker.ceph.com/issues/61584 Signed-off-by:
Xiubo Li <xiubli@redhat.com> Reviewed-by:
Milind Changire <mchangir@redhat.com> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
- 09 Jun, 2023 2 commits
-
-
Christoph Hellwig authored
All callers of generic_perform_write need to updated ki_pos, move it into common code. Link: https://lkml.kernel.org/r/20230601145904.1385409-4-hch@lst.de Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Xiubo Li <xiubli@redhat.com> Reviewed-by:
Damien Le Moal <dlemoal@kernel.org> Reviewed-by:
Hannes Reinecke <hare@suse.de> Acked-by:
Theodore Ts'o <tytso@mit.edu> Acked-by:
Darrick J. Wong <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Christoph Hellwig authored
Patch series "cleanup the filemap / direct I/O interaction", v4. This series cleans up some of the generic write helper calling conventions and the page cache writeback / invalidation for direct I/O. This is a spinoff from the no-bufferhead kernel project, for which we'll want to an use iomap based buffered write path in the block layer. This patch (of 12): The last user of current->backing_dev_info disappeared in commit b9b1335e ("remove bdi_congested() and wb_congested() and related functions"). Remove the field and all assignments to it. Link: https://lkml.kernel.org/r/20230601145904.1385409-1-hch@lst.de Link: https://lkml.kernel.org/r/20230601145904.1385409-2-hch@lst.de Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Christian Brauner <brauner@kernel.org> Reviewed-by:
Damien Le Moal <dlemoal@kernel.org> Reviewed-by:
Hannes Reinecke <hare@suse.de> Reviewed-by:
Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by:
Darrick J. Wong <djwong@kernel.org> Acked-by:
Theodore Ts'o <tytso@mit.edu> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- 24 May, 2023 1 commit
-
-
David Howells authored
Provide a splice_read wrapper for Ceph. This does the inode shutdown check before proceeding and jumps to copy_splice_read() if the file has inline data or is a synchronous file. We try and get FILE_RD and either FILE_CACHE and/or FILE_LAZYIO caps and hold them across filemap_splice_read(). If we fail to get FILE_CACHE or FILE_LAZYIO capabilities, we use copy_splice_read() instead. Signed-off-by:
David Howells <dhowells@redhat.com> Reviewed-by:
Xiubo Li <xiubli@redhat.com> cc: Christoph Hellwig <hch@lst.de> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Jens Axboe <axboe@kernel.dk> cc: Ilya Dryomov <idryomov@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: linux-block@vger.kernel.org cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/20230522135018.2742245-17-dhowells@redhat.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 26 Feb, 2023 1 commit
-
-
Xiubo Li authored
The fallocate will try to clear the suid/sgid if a unprevileged user changed the file. There is no POSIX item requires that we should clear the suid/sgid in fallocate code path but this is the default behaviour for most of the filesystems and the VFS layer. And also the same for the write code path, which have already support it. And also we need to update the time stamps since the fallocate will change the file contents. Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/58054 Signed-off-by:
Xiubo Li <xiubli@redhat.com> Reviewed-by:
Jeff Layton <jlayton@kernel.org> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
- 03 Feb, 2023 1 commit
-
-
Christoph Hellwig authored
Use the bvec_set_page helper to initialize a bvec. Signed-off-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230203150634.3199647-13-hch@lst.de Reviewed-by:
Ilya Dryomov <idryomov@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 02 Feb, 2023 1 commit
-
-
Xiubo Li authored
When received corrupted snap trace we don't know what exactly has happened in MDS side. And we shouldn't continue IOs and metadatas access to MDS, which may corrupt or get incorrect contents. This patch will just block all the further IO/MDS requests immediately and then evict the kclient itself. The reason why we still need to evict the kclient just after blocking all the further IOs is that the MDS could revoke the caps faster. Link: https://tracker.ceph.com/issues/57686 Signed-off-by:
Xiubo Li <xiubli@redhat.com> Reviewed-by:
Venky Shankar <vshankar@redhat.com> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
- 12 Dec, 2022 2 commits
-
-
Xiubo Li authored
We should call the check_caps() again immediately after the async creating finishes in case the MDS is waiting for caps revocation to finish. Link: https://tracker.ceph.com/issues/46904 Signed-off-by:
Xiubo Li <xiubli@redhat.com> Reviewed-by:
Ilya Dryomov <idryomov@gmail.com> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Xiubo Li authored
The session parameter makes no sense any more. Signed-off-by:
Xiubo Li <xiubli@redhat.com> Reviewed-by:
Ilya Dryomov <idryomov@gmail.com> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
- 25 Nov, 2022 1 commit
-
-
Al Viro authored
READ/WRITE proved to be actively confusing - the meanings are "data destination, as used with read(2)" and "data source, as used with write(2)", but people keep interpreting those as "we read data from it" and "we write data to it", i.e. exactly the wrong way. Call them ITER_DEST and ITER_SOURCE - at least that is harder to misinterpret... Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
- 09 Aug, 2022 2 commits
-
-
Al Viro authored
Most of the users immediately follow successful iov_iter_get_pages() with advancing by the amount it had returned. Provide inline wrappers doing that, convert trivial open-coded uses of those. BTW, iov_iter_get_pages() never returns more than it had been asked to; such checks in cifs ought to be removed someday... Reviewed-by:
Jeff Layton <jlayton@kernel.org> Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
Equivalent of single-segment iovec. Initialized by iov_iter_ubuf(), checked for by iter_is_ubuf(), otherwise behaves like ITER_IOVEC ones. We are going to expose the things like ->write_iter() et.al. to those in subsequent commits. New predicate (user_backed_iter()) that is true for ITER_IOVEC and ITER_UBUF; places like direct-IO handling should use that for checking that pages we modify after getting them from iov_iter_get_pages() would need to be dirtied. DO NOT assume that replacing iter_is_iovec() with user_backed_iter() will solve all problems - there's code that uses iter_is_iovec() to decide how to poke around in iov_iter guts and for that the predicate replacement obviously won't suffice. Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
- 03 Aug, 2022 1 commit
-
-
Jeff Layton authored
This function always returns 0, and ignores the nofail boolean. Drop the nofail argument, make the function void return and fix up the callers. Signed-off-by:
Jeff Layton <jlayton@kernel.org> Reviewed-by:
Ilya Dryomov <idryomov@gmail.com> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
- 02 Aug, 2022 6 commits
-
-
Hu Weiwen authored
Clear O_TRUNC from the flags sent in the MDS create request. `atomic_open' is called before permission check. We should not do any modification to the file here. The caller will do the truncation afterward. Fixes: 124e68e7 ("ceph: file operations") Signed-off-by:
Hu Weiwen <sehuww@mail.scut.edu.cn> Reviewed-by:
Xiubo Li <xiubli@redhat.com> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Xiubo Li authored
When the quota is approaching we need to notify it to the MDS as soon as possible, or the client could write to the directory more than expected. This will flush the dirty caps without delaying after each write, though this couldn't prevent the real size of a directory exceed the quota but could prevent it as soon as possible. Link: https://tracker.ceph.com/issues/56180 Signed-off-by:
Xiubo Li <xiubli@redhat.com> Reviewed-by:
Luís Henriques <lhenriques@suse.de> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Xiubo Li authored
If the 'i_inline_version' is 1, that means the file is just new created and there shouldn't have any inline data in it, we should skip retrieving the inline data from MDS. This also could help reduce possiblity of dead lock issue introduce by the inline data and Fcr caps. Gradually we will remove the inline feature from kclient after ceph's scrub too have support to unline the inline data, currently this could help reduce the teuthology test failures. This is possiblly could also fix a bug that for some old clients if they couldn't explictly uninline the inline data when writing, the inline version will keep as 1 always. We may always reading non-exist data from inline data. Signed-off-by:
Xiubo Li <xiubli@redhat.com> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Xiubo Li authored
For async create we will always try to choose the auth MDS of frag the dentry belonged to of the parent directory to send the request and ususally this works fine, but if the MDS migrated the directory to another MDS before it could be handled the request will be forwarded. And then the auth cap will be changed. We need to update the auth cap in this case before the request is forwarded. Link: https://tracker.ceph.com/issues/55857 Signed-off-by:
Xiubo Li <xiubli@redhat.com> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Jeff Layton authored
There's no reason we need to lock the inode for write in order to handle an llseek. I suspect this should have been dropped in 2013 when we stopped doing vmtruncate in llseek. With that gone, ceph_llseek is functionally equivalent to generic_file_llseek, so just call that after getting the size. Signed-off-by:
Jeff Layton <jlayton@kernel.org> Reviewed-by:
Luís Henriques <lhenriques@suse.de> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Xiubo Li authored
In async unlink case the kclient won't wait for the first reply from MDS and just drop all the links and unhash the dentry and then succeeds immediately. For any new create/link/rename,etc requests followed by using the same file names we must wait for the first reply of the inflight unlink request, or the MDS possibly will fail these following requests with -EEXIST if the inflight async unlink request was delayed for some reasons. And the worst case is that for the none async openc request it will successfully open the file if the CDentry hasn't been unlinked yet, but later the previous delayed async unlink request will remove the CDenty. That means the just created file is possiblly deleted later by accident. We need to wait for the inflight async unlink requests to finish when creating new files/directories by using the same file names. Link: https://tracker.ceph.com/issues/55332 Signed-off-by:
Xiubo Li <xiubli@redhat.com> Reviewed-by:
Jeff Layton <jlayton@kernel.org> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
- 21 Jul, 2022 1 commit
-
-
Yang Xu authored
Now that we finished moving setgid stripping for regular files in setgid directories into the vfs, individual filesystem don't need to manually strip the setgid bit anymore. Drop the now unneeded code from ceph. Link: https://lore.kernel.org/r/1657779088-2242-4-git-send-email-xuyang2018.jy@fujitsu.com Reviewed-by:
Xiubo Li <xiubli@redhat.com> Reviewed-by: Christian Brauner (Microsoft)<brauner@kernel.org> Reviewed-and-Tested-by:
Jeff Layton <jlayton@kernel.org> Signed-off-by:
Yang Xu <xuyang2018.jy@fujitsu.com> Signed-off-by:
Christian Brauner (Microsoft) <brauner@kernel.org>
-
- 09 Jun, 2022 1 commit
-
-
David Howells authored
While randstruct was satisfied with using an open-coded "void *" offset cast for the netfs_i_context <-> inode casting, __builtin_object_size() as used by FORTIFY_SOURCE was not as easily fooled. This was causing the following complaint[1] from gcc v12: In file included from include/linux/string.h:253, from include/linux/ceph/ceph_debug.h:7, from fs/ceph/inode.c:2: In function 'fortify_memset_chk', inlined from 'netfs_i_context_init' at include/linux/netfs.h:326:2, inlined from 'ceph_alloc_inode' at fs/ceph/inode.c:463:2: include/linux/fortify-string.h:242:25: warning: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning] 242 | __write_overflow_field(p_size_field, size); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Fix this by embedding a struct inode into struct netfs_i_context (which should perhaps be renamed to struct netfs_inode). The struct inode vfs_inode fields are then removed from the 9p, afs, ceph and cifs inode structs and vfs_inode is then simply changed to "netfs.inode" in those filesystems. Further, rename netfs_i_context to netfs_inode, get rid of the netfs_inode() function that converted a netfs_i_context pointer to an inode pointer (that can now be done with &ctx->inode) and rename the netfs_i_context() function to netfs_inode() (which is now a wrapper around container_of()). Most of the changes were done with: perl -p -i -e 's/vfs_inode/netfs.inode/'g \ `git grep -l 'vfs_inode' -- fs/{9p,afs,ceph,cifs}/*.[ch]` Kees suggested doing it with a pair structure[2] and a special declarator to insert that into the network filesystem's inode wrapper[3], but I think it's cleaner to embed it - and then it doesn't matter if struct randomisation reorders things. Dave Chinner suggested using a filesystem-specific VFS_I() function in each filesystem to convert that filesystem's own inode wrapper struct into the VFS inode struct[4]. Version #2: - Fix a couple of missed name changes due to a disabled cifs option. - Rename nfs_i_context to nfs_inode - Use "netfs" instead of "nic" as the member name in per-fs inode wrapper structs. [ This also undoes commit 507160f4 ("netfs: gcc-12: temporarily disable '-Wattribute-warning' for now") that is no longer needed ] Fixes: bc899ee1 ("netfs: Add a netfs inode context") Reported-by:
Jeff Layton <jlayton@kernel.org> Signed-off-by:
David Howells <dhowells@redhat.com> Reviewed-by:
Jeff Layton <jlayton@kernel.org> Reviewed-by:
Kees Cook <keescook@chromium.org> Reviewed-by:
Xiubo Li <xiubli@redhat.com> cc: Jonathan Corbet <corbet@lwn.net> cc: Eric Van Hensbergen <ericvh@gmail.com> cc: Latchesar Ionkov <lucho@ionkov.net> cc: Dominique Martinet <asmadeus@codewreck.org> cc: Christian Schoenebeck <linux_oss@crudebyte.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: Steve French <smfrench@gmail.com> cc: William Kucharski <william.kucharski@oracle.com> cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> cc: Dave Chinner <david@fromorbit.com> cc: linux-doc@vger.kernel.org cc: v9fs-developer@lists.sourceforge.net cc: linux-afs@lists.infradead.org cc: ceph-devel@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: samba-technical@lists.samba.org cc: linux-fsdevel@vger.kernel.org cc: linux-hardening@vger.kernel.org Link: https://lore.kernel.org/r/d2ad3a3d7bdd794c6efb562d2f2b655fb67756b9.camel@kernel.org/ [1] Link: https://lore.kernel.org/r/20220517210230.864239-1-keescook@chromium.org/ [2] Link: https://lore.kernel.org/r/20220518202212.2322058-1-keescook@chromium.org/ [3] Link: https://lore.kernel.org/r/20220524101205.GI2306852@dread.disaster.area/ [4] Link: https://lore.kernel.org/r/165296786831.3591209.12111293034669289733.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/165305805651.4094995.7763502506786714216.stgit@warthog.procyon.org.uk # v2 Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 10 May, 2022 1 commit
-
-
Jeff Layton authored
Currently when we create a file, we spin up an xattr buffer to send along with the create request. If we end up doing an async create however, then we currently pass down a zero-length xattr buffer. Fix the code to send down the xattr buffer in req->r_pagelist. If the xattrs span more than a page, however give up and don't try to do an async create. Cc: stable@vger.kernel.org URL: https://bugzilla.redhat.com/show_bug.cgi?id=2063929 Fixes: 9a8d03ca ("ceph: attempt to do async create when possible") Reported-by:
John Fortin <fortinj66@gmail.com> Reported-by:
Sri Ramanujam <sri@ramanujam.io> Signed-off-by:
Jeff Layton <jlayton@kernel.org> Reviewed-by:
Xiubo Li <xiubli@redhat.com> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
- 01 Apr, 2022 1 commit
-
-
Matthew Wilcox (Oracle) authored
We can extract both the file pointer and the pos from the iocb. This simplifies each caller as well as allowing generic_perform_write() to see more of the iocb contents in the future. Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Christian Brauner <brauner@kernel.org> Reviewed-by:
Al Viro <viro@zeniv.linux.org.uk> Acked-by:
Al Viro <viro@zeniv.linux.org.uk>
-
- 01 Mar, 2022 2 commits
-
-
Jeff Layton authored
Currently, we only wake the waiters if we got a req->r_target_inode out of the reply. In the case where the create fails though, we may not have one. If there is a non-zero result from the create, then ensure that we wake anything waiting on the inode after we shut it down. URL: https://tracker.ceph.com/issues/54067 Signed-off-by:
Jeff Layton <jlayton@kernel.org> Reviewed-by:
Xiubo Li <xiubli@redhat.com> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
David Howells authored
If a ceph file is made up of inline data, uninline that in the ceph_open() rather than in ceph_page_mkwrite(), ceph_write_iter(), ceph_fallocate() or ceph_write_begin(). This makes it easier to convert to using the netfs library for VM write hooks. Should this also take the inode lock for the duration on uninlining to prevent a race with truncation? [ jlayton: fix up folio locking, update i_inline_version after write ] Signed-off-by:
David Howells <dhowells@redhat.com> Signed-off-by:
Jeff Layton <jlayton@kernel.org> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
- 26 Jan, 2022 2 commits
-
-
Jeff Layton authored
Dan reported that he was unable to write to files that had been asynchronously created when the client's OSD caps are restricted to a particular namespace. The issue is that the layout for the new inode is only partially being filled. Ensure that we populate the pool_ns_data and pool_ns_len in the iinfo before calling ceph_fill_inode. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/54013 Fixes: 9a8d03ca ("ceph: attempt to do async create when possible") Reported-by:
Dan van der Ster <dan@vanderster.com> Signed-off-by:
Jeff Layton <jlayton@kernel.org> Reviewed-by:
Ilya Dryomov <idryomov@gmail.com> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Jeff Layton authored
The reference acquired by try_prep_async_create is currently leaked. Ensure we put it. Cc: stable@vger.kernel.org Fixes: 9a8d03ca ("ceph: attempt to do async create when possible") Signed-off-by:
Jeff Layton <jlayton@kernel.org> Reviewed-by:
Ilya Dryomov <idryomov@gmail.com> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
- 13 Jan, 2022 1 commit
-
-
Jeff Layton authored
CephFS is a bit unlike most other filesystems in that it only conditionally does buffered I/O based on the caps that it gets from the MDS. In most cases, unless there is contended access for an inode the MDS does give Fbc caps to the client, so the unbuffered codepaths are only infrequently traveled and are difficult to test. At one time, the "-o sync" mount option would give you this behavior, but that was removed in commit 7ab9b380 ("ceph: Don't use ceph-sync-mode for synchronous-fs."). Add a new mount option to tell the client to ignore Fbc caps when doing I/O, and to use the synchronous codepaths exclusively, even on non-O_DIRECT file descriptors. We already have an ioctl that forces this behavior on a per-file basis, so we can just always set the CEPH_F_SYNC flag in the file description on such mounts. Additionally, this patch also changes the client to not request Fbc when doing direct I/O. We aren't using the cache with O_DIRECT so we don't have any need for those caps. Signed-off-by:
Jeff Layton <jlayton@kernel.org> Acked-by:
Greg Farnum <gfarnum@redhat.com> Reviewed-by:
Venky Shankar <vshankar@redhat.com> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
- 11 Jan, 2022 1 commit
-
-
Jeff Layton authored
Now that the fscache API has been reworked and simplified, change ceph over to use it. With the old API, we would only instantiate a cookie when the file was open for reads. Change it to instantiate the cookie when the inode is instantiated and call use/unuse when the file is opened/closed. Also, ensure we resize the cached data on truncates, and invalidate the cache in response to the appropriate events. This will allow us to plumb in write support later. Signed-off-by:
Jeff Layton <jlayton@kernel.org> Signed-off-by:
David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20211129162907.149445-2-jlayton@kernel.org/ # v1 Link: https://lore.kernel.org/r/20211207134451.66296-2-jlayton@kernel.org/ # v2 Link: https://lore.kernel.org/r/163906984277.143852.14697110691303589000.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967188351.1823006.5065634844099079351.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021581427.640689.14128682147127509264.stgit@warthog.procyon.org.uk/ # v4
-
- 01 Dec, 2021 2 commits
-
-
Christian Brauner authored
Ceph always inherits the SGID bit if it is set on the parent inode, while the generic inode_init_owner does not do this in a few cases where it can create a possible security problem (cf. [1]). Update ceph to strip the SGID bit just as inode_init_owner would. This bug was detected by the mapped mount testsuite in [3]. The testsuite tests all core VFS functionality and semantics with and without mapped mounts. That is to say it functions as a generic VFS testsuite in addition to a mapped mount testsuite. While working on mapped mount support for ceph, SIGD inheritance was the only failing test for ceph after the port. The same bug was detected by the mapped mount testsuite in XFS in January 2021 (cf. [2]). [1]: commit 0fa3ecd8 ("Fix up non-directory creation in SGID directories") [2]: commit 01ea173e ("xfs: fix up non-directory creation in SGID directories") [3]: https://git.kernel.org/fs/xfs/xfstests-dev.git Cc: stable@vger.kernel.org Signed-off-by:
Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by:
Jeff Layton <jlayton@kernel.org> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Jeff Layton authored
Newer compilers seem to determine that this variable being uninitialized isn't a problem, but older compilers (from the RHEL8 era) seem to choke on it and complain that it could be used uninitialized. Go ahead and initialize the variable at declaration time to silence potential compiler warnings. Fixes: c3d8e0b5 ("ceph: return the real size read when it hits EOF") Signed-off-by:
Jeff Layton <jlayton@kernel.org> Reviewed-by:
Xiubo Li <xiubli@redhat.com> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
- 08 Nov, 2021 5 commits
-
-
Luís Henriques authored
This patch adds latency and size metrics for remote object copies operations ("copyfrom"). For now, these metrics will be available on the client only, they won't be sent to the MDS. Signed-off-by:
Luís Henriques <lhenriques@suse.de> Reviewed-by:
Jeff Layton <jlayton@kernel.org> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Luís Henriques authored
This patch moves ceph_osdc_copy_from() function out of libceph code into cephfs. There are no other users for this function, and there is the need (in another patch) to access internal ceph_osd_request struct members. Signed-off-by:
Luís Henriques <lhenriques@suse.de> Reviewed-by:
Jeff Layton <jlayton@kernel.org> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Xiubo Li authored
Currently, if the sync read handler ends up reading more from the last object in the file than the i_size indicates, then it'll end up returning the wrong length. Ensure that we cap the returned length and pos at the EOF. Signed-off-by:
Xiubo Li <xiubli@redhat.com> Reviewed-by:
Jeff Layton <jlayton@kernel.org> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Jeff Layton authored
Add proper error handling for when an async create fails. The inode never existed, so any dirty caps or data are now toast. We already d_drop the dentry in that case, but the now-stale inode may still be around. We want to shut down access to these inodes, and ensure that they can't harbor any more dirty data, which can cause problems at umount time. When this occurs, flag such inodes as being SHUTDOWN, and trash any caps and cap flushes that may be in flight for them, and invalidate the pagecache for the inode. Add a new helper that can check whether an inode or an entire mount is now shut down, and call it instead of accessing the mount_state directly in places where we test that now. URL: https://tracker.ceph.com/issues/51279 Signed-off-by:
Jeff Layton <jlayton@kernel.org> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Jeff Layton authored
We have a lot of log messages that print inode pointer values. This is of dubious utility. Switch a random assortment of the ones I've found most useful to use ceph_vinop to print the snap:inum tuple instead. [ idryomov: use . as a separator, break unnecessarily long lines ] Signed-off-by:
Jeff Layton <jlayton@kernel.org> Reviewed-by:
Ilya Dryomov <idryomov@gmail.com> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
- 25 Oct, 2021 1 commit
-
-
Jens Axboe authored
The second argument was only used by the USB gadget code, yet everyone pays the overhead of passing a zero to be passed into aio, where it ends up being part of the aio res2 value. Now that everybody is passing in zero, kill off the extra argument. Reviewed-by:
Darrick J. Wong <djwong@kernel.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 19 Oct, 2021 1 commit
-
-
Jeff Layton authored
Currently, we check the wb_err too early for directories, before all of the unsafe child requests have been waited on. In order to fix that we need to check the mapping->wb_err later nearer to the end of ceph_fsync. We also have an overly-complex method for tracking errors after blocklisting. The errors recorded in cleanup_session_requests go to a completely separate field in the inode, but we end up reporting them the same way we would for any other error (in fsync). There's no real benefit to tracking these errors in two different places, since the only reporting mechanism for them is in fsync, and we'd need to advance them both every time. Given that, we can just remove i_meta_err, and convert the places that used it to instead just use mapping->wb_err instead. That also fixes the original problem by ensuring that we do a check_and_advance of the wb_err at the end of the fsync op. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/52864 Reported-by:
Patrick Donnelly <pdonnell@redhat.com> Signed-off-by:
Jeff Layton <jlayton@kernel.org> Reviewed-by:
Xiubo Li <xiubli@redhat.com> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-