Commits · 87bc5b895d94a0f40fe170d4cf5771c8e8f85d15 · Kirill Smelkov / linux

08 Jul, 2019 40 commits

ceph: use ceph_evict_inode to cleanup inode's resource · 87bc5b89

Yan, Zheng authored Jun 02, 2019

remove_session_caps() relies on __wait_on_freeing_inode(), to wait for
freeing inode to remove its caps. But VFS wakes freeing inode waiters
before calling destroy_inode().

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/40102Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

87bc5b89

ceph: initialize superblock s_time_gran to 1 · 0f7cf80a

Luis Henriques authored Jun 27, 2019

Having granularity set to 1us results in having inode timestamps with a
accurancy different from the fuse client (i.e. atime, ctime and mtime will
always end with '000').  This patch normalizes this behaviour and sets the
granularity to 1.
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Sage Weil <sage@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

0f7cf80a

MAINTAINERS: take over for Zheng as CephFS kernel client maintainer · 1edd1fec

Jeff Layton authored Jun 26, 2019

Zheng wants to be able to spend more time working on the MDS, so I've
volunteered to take over for him as the CephFS kernel client maintainer.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Acked-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

1edd1fec

rbd: setallochint only if object doesn't exist · 8b5bec5c

Ilya Dryomov authored Jun 19, 2019

setallochint is really only useful on object creation.  Continue
hinting unconditionally if object map cannot be used.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

8b5bec5c

rbd: support for object-map and fast-diff · 22e8bd51

Ilya Dryomov authored Jun 05, 2019

Speed up reads, discards and zeroouts through RBD_OBJ_FLAG_MAY_EXIST
and RBD_OBJ_FLAG_NOOP_FOR_NONEXISTENT based on object map.

Invalid object maps are not trusted, but still updated.  Note that we
never iterate, resize or invalidate object maps.  If object-map feature
is enabled but object map fails to load, we just fail the requester
(either "rbd map" or I/O, by way of post-acquire action).
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

22e8bd51

rbd: call rbd_dev_mapping_set() from rbd_dev_image_probe() · da5ef6be

Ilya Dryomov authored Jun 17, 2019

Snapshot object map will be loaded in rbd_dev_image_probe(), so we need
to know snapshot's size (as opposed to HEAD's size) sooner.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

da5ef6be

libceph: export osd_req_op_data() macro · 4cf3e6df

Ilya Dryomov authored Jun 14, 2019

We already have one exported wrapper around it for extent.osd_data and
rbd_object_map_update_finish() needs another one for cls.request_data.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
Reviewed-by: Jeff Layton <jlayton@kernel.org>

4cf3e6df

libceph: change ceph_osdc_call() to take page vector for response · 68ada915

Ilya Dryomov authored Jun 14, 2019

This will be used for loading object map.  rbd_obj_read_sync() isn't
suitable because object map must be accessed through class methods.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
Reviewed-by: Jeff Layton <jlayton@kernel.org>

68ada915

libceph: bump CEPH_MSG_MAX_DATA_LEN (again) · ef83171b

Ilya Dryomov authored Apr 08, 2019

This time for rbd object map.  Object maps are limited in size to
256000000 objects, two bits per object.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

ef83171b

rbd: new exclusive lock wait/wake code · 637cd060

Ilya Dryomov authored Jun 06, 2019

rbd_wait_state_locked() is built around rbd_dev->lock_waitq and blocks
rbd worker threads while waiting for the lock, potentially impacting
other rbd devices. There is no good way to pass an error code into
image request state machines when acquisition fails, hence the use of
RBD_DEV_FLAG_BLACKLISTED for everything and various other issues.

Introduce rbd_dev->acquiring_list and move acquisition into image
request state machine. Use rbd_img_schedule() for kicking and passing
error codes. No blocking occurs while waiting for the lock, but
rbd_dev->lock_rwsem is still held across lock, unlock and set_cookie
calls.

Always acquire the lock on "rbd map" to avoid associating the latency
of acquiring the lock with the first I/O request.

A slight regression is that lock_timeout is now respected only if lock
acquisition is triggered by "rbd map" and not by I/O. This is somewhat
compensated by the fact that we no longer block if the peer refuses to
release lock -- I/O is failed with EROFS right away.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

637cd060

rbd: quiescing lock should wait for image requests · e1fddc8f

Ilya Dryomov authored May 30, 2019

Syncing OSD requests doesn't really work. A single image request may
be comprised of multiple object requests, each of which can go through
a series of OSD requests (original, copyups, etc). On top of that, the
OSD cliest may be shared with other rbd devices.

What we want is to ensure that all in-flight image requests complete.
Introduce rbd_dev->running_list and block in RBD_LOCK_STATE_RELEASING
until that happens. New OSD requests may be started during this time.

Note that __rbd_img_handle_request() acquires rbd_dev->lock_rwsem only
if need_exclusive_lock() returns true. This avoids a deadlock similar
to the one outlined in the previous commit between unlock and I/O that
doesn't require lock, such as a read with object-map feature disabled.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

e1fddc8f

rbd: lock should be quiesced on reacquire · a2b1da09

Ilya Dryomov authored May 30, 2019

Quiesce exclusive lock at the top of rbd_reacquire_lock() instead
of only when ceph_cls_set_cookie() fails.  This avoids a deadlock on
rbd_dev->lock_rwsem.

If rbd_dev->lock_rwsem is needed for I/O completion, set_cookie can
hang ceph-msgr worker thread if set_cookie reply ends up behind an I/O
reply, because, like lock and unlock requests, set_cookie is sent and
waited upon with rbd_dev->lock_rwsem held for write.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

a2b1da09

rbd: introduce copyup state machine · 793333a3

Ilya Dryomov authored Jun 13, 2019

Both write and copyup paths will get more complex with object map.
Factor copyup code out into a separate state machine.

While at it, take advantage of obj_req->osd_reqs list and issue empty
and current snapc OSD requests together, one after another.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

793333a3

rbd: rename rbd_obj_setup_*() to rbd_obj_init_*() · ea9b743c

Ilya Dryomov authored May 31, 2019

These functions don't allocate and set up OSD requests anymore.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

ea9b743c

rbd: move OSD request allocation into object request state machines · a086a1b8

Ilya Dryomov authored Jun 12, 2019

Following submission, move initial OSD request allocation into object
request state machines.  Everything that has to do with OSD requests is
now handled inside the state machine, all __rbd_img_fill_request() has
left is initialization.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

a086a1b8

rbd: factor out __rbd_osd_setup_discard_ops() · 27bbd911

Ilya Dryomov authored May 29, 2019

With obj_req->xferred removed, obj_req->ex.oe_off and obj_req->ex.oe_len
can be updated if required for alignment. Previously the new offset and
length weren't stored anywhere beyond rbd_obj_setup_discard().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

27bbd911

rbd: factor out rbd_osd_setup_copyup() · b5ae8cbc

Ilya Dryomov authored May 29, 2019

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

b5ae8cbc

rbd: introduce obj_req->osd_reqs list · bcbab1db

Ilya Dryomov authored May 27, 2019

Since the dawn of time it had been assumed that a single object request
spawns a single OSD request. This is already impacting copyup: instead
of sending empty and current snapc copyups together, we wait for empty
snapc OSD request to complete in order to reassign obj_req->osd_req
with current snapc OSD request. Looking further, updating potentially
hundreds of snapshot object maps serially is a non-starter.

Replace obj_req->osd_req pointer with obj_req->osd_reqs list. Use
osd_req->r_private_item for linkage.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

bcbab1db

libceph: rename r_unsafe_item to r_private_item · 94e85771

Ilya Dryomov authored Jul 08, 2019

This list item remained from when we had safe and unsafe replies
(commit vs ack).  It has since become a private list item for use by
clients.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

94e85771

rbd: introduce image request state machine · 0192ce2e

Ilya Dryomov authored May 16, 2019

Make it possible to schedule image requests on a workqueue.  This fixes
parent chain recursion added in the previous commit and lays the ground
for exclusive lock wait/wake improvements.

The "wait for pending subrequests and report first nonzero result" code
is generalized to be used by object request state machine.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

0192ce2e

rbd: move OSD request submission into object request state machines · 85b5e6d1

Ilya Dryomov authored May 14, 2019

Start eliminating asymmetry where the initial OSD request is allocated
and submitted from outside the state machine, making error handling and
restarts harder than they could be.  This commit deals with submission,
a commit that deals with allocation will follow.

Note that this commit adds parent chain recursion on the submission
side:

  rbd_img_request_submit
    rbd_obj_handle_request
      __rbd_obj_handle_request
        rbd_obj_handle_read
          rbd_obj_handle_write_guard
            rbd_obj_read_from_parent
              rbd_img_request_submit

This will be fixed in the next commit.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

85b5e6d1

rbd: get rid of RBD_OBJ_WRITE_{FLAT,GUARD} · 0ad5d953

Ilya Dryomov authored May 14, 2019

In preparation for moving OSD request allocation and submission into
object request state machines, get rid of RBD_OBJ_WRITE_{FLAT,GUARD}.
We would need to start in a new state, whether the request is guarded
or not.  Unify them into RBD_OBJ_WRITE_OBJECT and pass guard info
through obj_req->flags.

While at it, make our ENOENT handling a little more precise: only hide
ENOENT when it is actually expected, that is on delete.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

0ad5d953

rbd: replace obj_req->tried_parent with obj_req->read_state · a9b67e69

Ilya Dryomov authored May 08, 2019

Make rbd_obj_handle_read() look like a state machine and get rid of
the necessity to patch result in rbd_obj_handle_request(), completing
the removal of obj_req->xferred and img_req->xferred.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

a9b67e69

rbd: get rid of obj_req->xferred, obj_req->result and img_req->xferred · 54ab3b24

Ilya Dryomov authored May 11, 2019

obj_req->xferred and img_req->xferred don't bring any value. The
former is used for short reads and has to be set to obj_req->ex.oe_len
after that and elsewhere. The latter is just an aggregate.

Use result for short reads (>=0 - number of bytes read, <0 - error) and
pass it around explicitly. No need to store it in obj_req.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

54ab3b24

ceph: don't NULL terminate virtual xattrs · 26350535

Jeff Layton authored Jun 24, 2019

The convention with xattrs is to not store the termination with string
data, given that it returns the length. This is how setfattr/getfattr
operate.

Most of ceph's virtual xattr routines use snprintf to plop the string
directly into the destination buffer, but snprintf always NULL
terminates the string. This means that if we send the kernel a buffer
that is the exact length needed to hold the string, it'll end up
truncated.

Add a ceph_fmt_xattr helper function to format the string into an
on-stack buffer that should always be large enough to hold the whole
thing and then memcpy the result into the destination buffer. If it does
turn out that the formatted string won't fit in the on-stack buffer,
then return -E2BIG and do a WARN_ONCE().

Change over most of the virtual xattr routines to use the new helper. A
couple of the xattrs are sourced from strings however, and it's
difficult to know how long they'll be. Just have those memcpy the result
in place after verifying the length.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

26350535

ceph: return -ERANGE if virtual xattr value didn't fit in buffer · 3b421018

Jeff Layton authored Jun 13, 2019

The getxattr manpage states that we should return ERANGE if the
destination buffer size is too small to hold the value.
ceph_vxattrcb_layout does this internally, but we should be doing
this for all vxattrs.

Fix the only caller of getxattr_cb to check the returned size
against the buffer length and return -ERANGE if it doesn't fit.
Drop the same check in ceph_vxattrcb_layout and just rely on the
caller to handle it.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

3b421018

ceph: make getxattr_cb return ssize_t · f1d1b51d

Jeff Layton authored Jun 24, 2019

The getxattr_cb functions return size_t, which is unsigned and then
cast that value to int and then ssize_t before returning it. While all
of this works, it relies on implicit casting rules for signed/unsigned
conversions.

Change getxattr_cb to return ssize_t to better conform with what the
caller actually wants. Also, remove some suspicious casts.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

f1d1b51d

ceph: more precise CEPH_CLIENT_CAPS_PENDING_CAPSNAP · 49ada6e8

Yan, Zheng authored Jun 20, 2019

Client uses this flag to tell mds if there is more cap snap need to
flush. It's mainly for the case that client needs to re-send cap/snap
flushes after mds failover, but CEPH_CAP_ANY_FILE_WR on corresponding
inodes are all released before mds failover.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

49ada6e8

ceph: kick flushing and flush snaps before sending normal cap message · d6cee9db

Yan, Zheng authored Jun 20, 2019

Otherwise client may send cap flush messages in wrong order.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

d6cee9db

ceph: clear CEPH_I_KICK_FLUSH flag inside __kick_flushing_caps() · 054f8d41

Yan, Zheng authored Jun 20, 2019

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

054f8d41

ceph: increment change_attribute on local changes · 5c308356

Jeff Layton authored Jun 06, 2019

We don't set SB_I_VERSION on ceph since we need to manage it ourselves,
so we must increment it whenever we update the file times.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

5c308356

ceph: handle change_attr in cap messages · 176c77c9

Jeff Layton authored Jun 06, 2019

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

176c77c9

ceph: add change_attr field to ceph_inode_info · a35ead31

Jeff Layton authored Jun 06, 2019

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

a35ead31

iversion: add a routine to update a raw value with a larger one · 441d3676

Jeff Layton authored Jun 05, 2019

Under ceph, clients can be independently updating iversion themselves,
while working under comprehensive sets of caps on an inode. In that
situation we always want to prefer the largest value of a change
attribute. Add a new function that will update a raw value with a larger
one, but otherwise leave it alone.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

441d3676

ceph: allow querying of STATX_BTIME in ceph_getattr · 58981784

Jeff Layton authored Jun 05, 2019

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

58981784

libceph: turn on CEPH_FEATURE_MSG_ADDR2 · 6adaaafd

Jeff Layton authored May 31, 2019

Now that the client can handle either address formatting, advertise to
the peer that we can support it.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

6adaaafd

ceph: handle btime in cap messages · ec62b894

Jeff Layton authored May 29, 2019

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

ec62b894

ceph: add btime field to ceph_inode_info · 245ce991

Jeff Layton authored May 29, 2019

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

245ce991

libceph: rename ceph_encode_addr to ceph_encode_banner_addr · 2c66de56

Jeff Layton authored Jun 17, 2019

...ditto for the decode function. We only use these functions to fix
up banner addresses now, so let's name them more appropriately.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

2c66de56

libceph: use TYPE_LEGACY for entity addrs instead of TYPE_NONE · d3c3c0a8

Jeff Layton authored Jun 17, 2019

Going forward, we'll have different address types so let's use
the addr2 TYPE_LEGACY for internal tracking rather than TYPE_NONE.

Also, make ceph_pr_addr print the address type value as well.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

d3c3c0a8