Commits · 235a09821c2bc71d9d07f12217ce2ac00db99eba · Kirill Smelkov / linux

25 May, 2016 40 commits

ceph: multiple filesystem support · 235a0982

Yan, Zheng authored Mar 30, 2016

To access non-default filesystem, we just need to subscribe to
mdsmap.<MDS_NAMESPACE_ID> and add a new mount option for mds
namespace id.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
[idryomov@gmail.com: switch to a new libceph API]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

235a0982

libceph: support for subscribing to "mdsmap.<id>" maps · 737cc81e
Ilya Dryomov authored May 26, 2016
```
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
```
737cc81e

libceph: replace ceph_monc_request_next_osdmap() · 7cca78c9

Ilya Dryomov authored Apr 28, 2016

... with a wrapper around maybe_request_map() - no need for two
osdmap-specific functions.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

7cca78c9

libceph: take osdc->lock in osdmap_show() and dump flags in hex · b4f34795

Ilya Dryomov authored Apr 28, 2016

There is now about a dozen CEPH_OSDMAP_* flags.  This is a debugging
interface, so just dump in hex instead of spelling each flag out.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

b4f34795

libceph: pool deletion detection · 4609245e

Ilya Dryomov authored Apr 28, 2016

This adds the "map check" infrastructure for sending osdmap version
checks on CALC_TARGET_POOL_DNE and completing in-flight requests with
-ENOENT if the target pool doesn't exist or has just been deleted.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

4609245e

libceph: async MON client generic requests · d0b19705

Ilya Dryomov authored Apr 28, 2016

For map check, we are going to need to send CEPH_MSG_MON_GET_VERSION
messages asynchronously and get a callback on completion.  Refactor MON
client to allow firing off generic requests asynchronously and add an
async variant of ceph_monc_get_version().  ceph_monc_do_statfs() is
switched over and remains sync.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

d0b19705

libceph: support for checking on status of watch · b07d3c4b

Ilya Dryomov authored Apr 28, 2016

Implement ceph_osdc_watch_check() to be able to check on status of
watch.  Note that the time it takes for a watch/notify event to get
delivered through the notify_wq is taken into account.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

b07d3c4b

libceph: support for sending notifies · 19079203

Ilya Dryomov authored Apr 28, 2016

Implement ceph_osdc_notify() for sending notifies.

Due to the fact that the current messenger can't do read-in into
pagelists (it can only do write-out from them), I had to go with a page
vector for a NOTIFY_COMPLETE payload, for now.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

19079203

libceph, rbd: ceph_osd_linger_request, watch/notify v2 · 922dab61

Ilya Dryomov authored May 26, 2016

This adds support and switches rbd to a new, more reliable version of
watch/notify protocol.  As with the OSD client update, this is mostly
about getting the right structures linked into the right places so that
reconnects are properly sent when needed.  watch/notify v2 also
requires sending regular pings to the OSDs - send_linger_ping().

A major change from the old watch/notify implementation is the
introduction of ceph_osd_linger_request - linger requests no longer
piggy back on ceph_osd_request.  ceph_osd_event has been merged into
ceph_osd_linger_request.

All the details are now hidden within libceph, the interface consists
of a simple pair of watch/unwatch functions and ceph_osdc_notify_ack().
ceph_osdc_watch() does return ceph_osd_linger_request, but only to keep
the lifetime management simple.

ceph_osdc_notify_ack() accepts an optional data payload, which is
relayed back to the notifier.

Portions of this patch are loosely based on work by Douglas Fuller
<dfuller@redhat.com> and Mike Christie <michaelc@cs.wisc.edu>.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

922dab61

rbd: rbd_dev_header_unwatch_sync() variant · c525f036

Ilya Dryomov authored Apr 28, 2016

Introduce __rbd_dev_header_unwatch_sync(), which doesn't flush notify
callbacks.  This is for the new rados_watcherrcb_t, which would be
called from a notify callback.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

c525f036

libceph: wait_request_timeout() · 42b06965

Ilya Dryomov authored Apr 28, 2016

The unwatch timeout is currently implemented in rbd.  With
watch/unwatch code moving into libceph, we are going to need
a ceph_osdc_wait_request() variant with a timeout.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

42b06965

libceph: request_init() and request_release_checks() · 3540bfdb

Ilya Dryomov authored Apr 28, 2016

These are going to be used by request_reinit() code.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

3540bfdb

libceph: a major OSD client update · 5aea3dcd

Ilya Dryomov authored Apr 28, 2016

This is a major sync up, up to ~Jewel.  The highlights are:

- per-session request trees (vs a global per-client tree)
- per-session locking (vs a global per-client rwlock)
- homeless OSD session
- no ad-hoc global per-client lists
- support for pool quotas
- foundation for watch/notify v2 support
- foundation for map check (pool deletion detection) support

The switchover is incomplete: lingering requests can be setup and
teared down but aren't ever reestablished.  This functionality is
restored with the introduction of the new lingering infrastructure
(ceph_osd_linger_request, linger_work, etc) in a later commit.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

5aea3dcd

libceph: protect osdc->osd_lru list with a spinlock · 9dd2845c

Ilya Dryomov authored Apr 28, 2016

OSD client is getting moved from the big per-client lock to a set of
per-session locks. The big rwlock would only be held for read most of
the time, so a global osdc->osd_lru needs additional protection.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

9dd2845c

libceph: allocate ceph_osd with GFP_NOFAIL · 7a28f59b

Ilya Dryomov authored Apr 28, 2016

create_osd() is called way too deep in the stack to be able to error
out in a sane way; a failing create_osd() just messes everything up.
The current req_notarget list solution is broken - the list is never
traversed as it's not entirely clear when to do it, I guess.

If we were to start traversing it at regular intervals and retrying
each request, we wouldn't be far off from what __GFP_NOFAIL is doing,
so allocate OSD sessions with __GFP_NOFAIL, at least until we come up
with a better fix.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

7a28f59b

libceph: osd_init() and osd_cleanup() · 0247a0cf

Ilya Dryomov authored Apr 28, 2016

These are going to be used by homeless OSD sessions code.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

0247a0cf

libceph: handle_one_map() · 42c1b124

Ilya Dryomov authored Apr 28, 2016

Separate osdmap handling from decoding and iterating over a bag of maps
in a fresh MOSDMap message.  This sets up the scene for the updated OSD
client.

Of particular importance here is the addition of pi->was_full, which
can be used to answer "did this pool go full -> not-full in this map?".
This is the key bit for supporting pool quotas.

We won't be able to downgrade map_sem for much longer, so drop
downgrade_write().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

42c1b124

libceph: allocate dummy osdmap in ceph_osdc_init() · e5253a7b

Ilya Dryomov authored Apr 28, 2016

This leads to a simpler osdmap handling code, particularly when dealing
with pi->was_full, which is introduced in a later commit.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

e5253a7b

libceph: schedule tick from ceph_osdc_init() · fbca9635

Ilya Dryomov authored Apr 28, 2016

Both homeless OSD sessions and watch/notify v2, introduced in later
commits, require periodic ticks which don't depend on ->num_requests.
Schedule the initial tick from ceph_osdc_init() and reschedule from
handle_timeout() unconditionally.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

fbca9635

libceph: move schedule_delayed_work() in ceph_osdc_init() · b37ee1b9

Ilya Dryomov authored Apr 28, 2016

ceph_osdc_stop() isn't called if ceph_osdc_init() fails, so we end up
with handle_osds_timeout() running on invalid memory if any one of the
allocations fails.  Call schedule_delayed_work() after everything is
setup, just before returning.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

b37ee1b9

libceph: redo callbacks and factor out MOSDOpReply decoding · fe5da05e

Ilya Dryomov authored Apr 28, 2016

If you specify ACK | ONDISK and set ->r_unsafe_callback, both
->r_callback and ->r_unsafe_callback(true) are called on ack.  This is
very confusing.  Redo this so that only one of them is called:

    ->r_unsafe_callback(true), on ack
    ->r_unsafe_callback(false), on commit

or

    ->r_callback, on ack|commit

Decode everything in decode_MOSDOpReply() to reduce clutter.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

fe5da05e

libceph: drop msg argument from ceph_osdc_callback_t · 85e084fe

Ilya Dryomov authored Apr 28, 2016

finish_read(), its only user, uses it to get to hdr.data_len, which is
what ->r_result is set to on success. This gains us the ability to
safely call callbacks from contexts other than reply, e.g. map check.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

85e084fe

libceph: switch to calc_target(), part 2 · bb873b53

Ilya Dryomov authored May 26, 2016

The crux of this is getting rid of ceph_osdc_build_request(), so that
MOSDOp can be encoded not before but after calc_target() calculates the
actual target. Encoding now happens within ceph_osdc_start_request().

Also nuked is the accompanying bunch of pointers into the encoded
buffer that was used to update fields on each send - instead, the
entire front is re-encoded. If we want to support target->name_len !=
base->name_len in the future, there is no other way, because oid is
surrounded by other fields in the encoded buffer.

Encoding OSD ops and adding data items to the request message were
mixed together in osd_req_encode_op(). While we want to re-encode OSD
ops, we don't want to add duplicate data items to the message when
resending, so all call to ceph_osdc_msg_data_add() are factored out
into a new setup_request_data().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

bb873b53

libceph: switch to calc_target(), part 1 · a66dd383

Ilya Dryomov authored Apr 28, 2016

Replace __calc_request_pg() and most of __map_request() with
calc_target() and start using req->r_t.

ceph_osdc_build_request() however still encodes base_oid, because it's
called before calc_target() is and target_oid is empty at that point in
time; a printf in osdc_show() also shows base_oid.  This is fixed in
"libceph: switch to calc_target(), part 2".
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

a66dd383

libceph: introduce ceph_osd_request_target, calc_target() · 63244fa1

Ilya Dryomov authored Apr 28, 2016

Introduce ceph_osd_request_target, containing all mapping-related
fields of ceph_osd_request and calc_target() for calculating mappings
and populating it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

63244fa1

libceph: pi->min_size, pi->last_force_request_resend · 04812acf

Ilya Dryomov authored Apr 28, 2016

Add and decode pi->min_size and pi->last_force_request_resend.  These
are going to be used by calc_target().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

04812acf

libceph: make pgid_cmp() global · f984cb76

Ilya Dryomov authored Apr 28, 2016

calc_target() code is going to need to know how to compare PGs.  Take
lhs and rhs pgid by const * while at it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

f984cb76

libceph: rename ceph_calc_pg_primary() · f81f1633

Ilya Dryomov authored Apr 28, 2016

Rename ceph_calc_pg_primary() to ceph_pg_to_acting_primary() to
emphasise that it returns acting primary.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

f81f1633

libceph: ceph_osds, ceph_pg_to_up_acting_osds() · 6f3bfd45

Ilya Dryomov authored Apr 28, 2016

Knowning just acting set isn't enough, we need to be able to record up
set as well to detect interval changes.  This means returning (up[],
up_len, up_primary, acting[], acting_len, acting_primary) and passing
it around.  Introduce and switch to ceph_osds to help with that.

Rename ceph_calc_pg_acting() to ceph_pg_to_up_acting_osds() and return
both up and acting sets from it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

6f3bfd45

libceph: rename ceph_oloc_oid_to_pg() · d9591f5e

Ilya Dryomov authored Apr 28, 2016

Rename ceph_oloc_oid_to_pg() to ceph_object_locator_to_pg().  Emphasise
that returned is raw PG and return -ENOENT instead of -EIO if the pool
doesn't exist.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

d9591f5e

libceph: fix ceph_eversion encoding · 985c1673

Ilya Dryomov authored Apr 28, 2016

eversion_t is version+epoch in userspace and is encoded in that order.
ceph_eversion is defined as epoch+version in rados.h, yet we memcpy it
in __send_request().  Reoder ceph_eversion fields.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

985c1673

libceph: DEFINE_RB_FUNCS macro · fcd00b68

Ilya Dryomov authored Apr 28, 2016

Given

    struct foo {
        u64 id;
        struct rb_node bar_node;
    };

generate insert_bar(), erase_bar() and lookup_bar() functions with

    DEFINE_RB_FUNCS(bar, struct foo, id, bar_node)

The key is assumed to be an integer (u64, int, etc), compared with
< and >.  nodefld has to be initialized with RB_CLEAR_NODE().

Start using it for MDS, MON and OSD requests and OSD sessions.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

fcd00b68

libceph: open-code remove_{all,old}_osds() · 42a2c09f

Ilya Dryomov authored Apr 28, 2016

They are called only once, from ceph_osdc_stop() and
handle_osds_timeout() respectively.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

42a2c09f

libceph: nuke unused fields and functions · 0c0a8de1

Ilya Dryomov authored Apr 28, 2016

Either unused or useless:

    osdmap->mkfs_epoch
    osd->o_marked_for_keepalive
    monc->num_generic_requests
    osdc->map_waiters
    osdc->last_requested_map
    osdc->timeout_tid

    osd_req_op_cls_response_data()

    osdmap_apply_incremental() @msgr arg
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

0c0a8de1

rbd: use header_oid instead of header_name · c41d13a3

Ilya Dryomov authored Apr 29, 2016

Switch to ceph_object_id and use ceph_oid_aprintf() instead of a bare
const char *.  This reduces noise in rbd_dev_header_name().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

c41d13a3

libceph: variable-sized ceph_object_id · d30291b9

Ilya Dryomov authored Apr 29, 2016

Currently ceph_object_id can hold object names of up to 100
(CEPH_MAX_OID_NAME_LEN) characters.  This is enough for all use cases,
expect one - long rbd image names:

- a format 1 header is named "<imgname>.rbd"
- an object that points to a format 2 header is named "rbd_id.<imgname>"

We operate on these potentially long-named objects during rbd map, and,
for format 1 images, during header refresh.  (A format 2 header name is
a small system-generated string.)

Lift this 100 character limit by making ceph_object_id be able to point
to an externally-allocated string.  Apart from being able to work with
almost arbitrarily-long named objects, this allows us to reduce the
size of ceph_object_id from >100 bytes to 64 bytes.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

d30291b9

libceph: change how osd_op_reply message size is calculated · 711da55d

Ilya Dryomov authored Apr 27, 2016

For a message pool message, preallocate a page, just like we do for
osd_op.  For a normal message, take ceph_object_id into account and
don't bother subtracting CEPH_OSD_SLAB_OPS ceph_osd_ops.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

711da55d

libceph: move message allocation out of ceph_osdc_alloc_request() · 13d1ad16

Ilya Dryomov authored Apr 27, 2016

The size of ->r_request and ->r_reply messages depends on the size of
the object name (ceph_object_id), while the size of ceph_osd_request is
fixed.  Move message allocation into a separate function that would
have to be called after ceph_object_id and ceph_object_locator (which
is also going to become variable in size with RADOS namespaces) have
been filled in:

    req = ceph_osdc_alloc_request(...);
    <fill in req->r_base_oid>
    <fill in req->r_base_oloc>
    ceph_osdc_alloc_messages(req);
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

13d1ad16

libceph: grab snapc in ceph_osdc_alloc_request() · 84127282

Ilya Dryomov authored Apr 26, 2016

ceph_osdc_build_request() is going away.  Grab snapc and initialize
->r_snapid in ceph_osdc_alloc_request().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

84127282

libceph: make ceph_osdc_put_request() accept NULL · 3ed97d63
Ilya Dryomov authored Apr 26, 2016
```
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
```
3ed97d63