Commits · d3002b974cefbb7c1e325cc296966f768ff76b06 · Kirill Smelkov / linux

22 Mar, 2012 40 commits

Alex Elder authored Feb 14, 2012

This gathers a number of very minor changes:
    - use %hu when formatting the a socket address's address family
    - null out the ceph_msgr_wq pointer after the queue has been
      destroyed
    - drop a needless cast in ceph_write_space()
    - add a WARN() call in ceph_state_change() in the event an
      unrecognized socket state is encountered
    - rearrange the logic in ceph_con_get() and ceph_con_put() so
      that:
        - the reference counts are only atomically read once
	- the values displayed via dout() calls are known to
	  be meaningful at the time they are formatted
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

d3002b97

libceph: make ceph_tcp_connect() return int · 41617d0c

Alex Elder authored Feb 14, 2012

There is no real need for ceph_tcp_connect() to return the socket
pointer it creates, since it already assigns it to con->sock, which
is visible to the caller.  Instead, have it return an error code,
which tidies things up a bit.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

41617d0c

libceph: encapsulate some messenger cleanup code · 6173d1f0

Alex Elder authored Feb 14, 2012

Define a helper function to perform various cleanup operations.  Use
it both in the exit routine and in the init routine in the event of
an error.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

6173d1f0

libceph: make ceph_msgr_wq private · e0f43c94

Alex Elder authored Feb 14, 2012

The messenger workqueue has no need to be public.  So give it static
scope.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

e0f43c94

libceph: encapsulate connection kvec operations · 859eb799

Alex Elder authored Feb 14, 2012

Encapsulate the operation of adding a new chunk of data to the next
open slot in a ceph_connection's out_kvec array.  Also add a "reset"
operation to make subsequent add operations start at the beginning
of the array again.

Use these routines throughout, avoiding duplicate code and ensuring
all calls are handled consistently.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

859eb799

libceph: move prepare_write_banner() · 963be4d7

Alex Elder authored Feb 14, 2012

One of the arguments to prepare_write_connect() indicates whether it
is being called immediately after a call to prepare_write_banner().
Move the prepare_write_banner() call inside prepare_write_connect(),
and reinterpret (and rename) the "after_banner" argument so it
indicates that prepare_write_connect() should *make* the call
rather than should know it has already been made.

This was split out from the next patch to highlight this change in
logic.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

963be4d7

rbd: don't drop the rbd_id too early · 32eec68d

Alex Elder authored Feb 08, 2012

Currently an rbd device's id is released when it is removed, but it
is done before the code is run to clean up sysfs-related files (such
as /sys/bus/rbd/devices/1).

It's possible that an rbd is still in use after the rbd_remove()
call has been made.  It's essentially the same as an active inode
that stays around after it has been removed--until its final close
operation.  This means that the id shows up as free for reuse at a
time it should not be.

The effect of this was seen by Jens Rehpoehler, who:
    - had a filesystem mounted on an rbd device
    - unmapped that filesystem (without unmounting)
    - found that the mount still worked properly
    - but hit a panic when he attempted to re-map a new rbd device

This re-map attempt found the previously-unmapped id available.
The subsequent attempt to reuse it was met with a panic while
attempting to (re-)install the sysfs entry for the new mapped
device.

Fix this by holding off "putting" the rbd id, until the rbd_device
release function is called--when the last reference is finally
dropped.

Note: This fixes: http://tracker.newdream.net/issues/1907Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

32eec68d

rbd: small changes · 593a9e7b

Alex Elder authored Feb 07, 2012

Here is another set of small code tidy-ups:
    - Define SECTOR_SHIFT and SECTOR_SIZE, and use these symbolic
      names throughout.  Tell the blk_queue system our physical
      block size, in the (unlikely) event we want to use something
      other than the default.
    - Delete the definition of struct rbd_info, which is never used.
    - Move the definition of dev_to_rbd() down in its source file,
      just above where it gets first used, and change its name to
      dev_to_rbd_dev().
    - Replace an open-coded operation in rbd_dev_release() to use
      dev_to_rbd_dev() instead.
    - Calculate the segment size for a given rbd_device just once in
      rbd_init_disk().
    - Use the '%zd' conversion specifier in rbd_snap_size_show(),
      since the value formatted is a size_t.
    - Switch to the '%llu' conversion specifier in rbd_snap_id_show().
      since the value formatted is unsigned.
Signed-off-by: Alex Elder <elder@dreamhost.com>

593a9e7b

rbd: do some refactoring · 00f1f36f

Alex Elder authored Feb 07, 2012

A few blocks of code are rearranged a bit here:
    - In rbd_header_from_disk():
	- Don't bother computing snap_count until we're sure the
	  on-disk header starts with a good signature.
	- Move a few independent lines of code so they are *after* a
	  check for a failed memory allocation.
	- Get rid of unnecessary local variable "ret".
    - Make a few other changes in rbd_read_header(), similar to the
      above--just moving things around a bit while preserving the
      functionality.
    - In rbd_rq_fn(), just assign rq in the while loop's controlling
      expression rather than duplicating it before and at the end of
      the loop body.  This allows the use of "continue" rather than
      "goto next" in a number of spots.
    - Rearrange the logic in snap_by_name().  End result is the same.
Signed-off-by: Alex Elder <elder@dreamhost.com>

00f1f36f

rbd: fix module sysfs setup/teardown code · fed4c143

Alex Elder authored Feb 07, 2012

Once rbd_bus_type is registered, it allows an "add" operation via
the /sys/bus/rbd/add bus attribute, and adding a new rbd device that
way establishes a connection between the device and rbd_root_dev.
But rbd_root_dev is not registered until after the rbd_bus_type
registration is complete.  This could (in principle anyway) result
in an invalid state.

Since rbd_root_dev has no tie to rbd_bus_type we can reorder these
two initializations and never be faced with this scenario.

In addition, unregister the device in the event the bus registration
fails at module init time.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

fed4c143

rbd: don't allocate mon_addrs buffer in rbd_add() · 7ef3214a

Alex Elder authored Feb 02, 2012

The mon_addrs buffer in rbd_add is used to hold a copy of the
monitor IP addresses supplied via /sys/bus/rbd/add.  That is
passed to rbd_get_client(), which never modifies it (nor do
any of the functions it gets passed to thereafter)--the mon_addr
parameter to rbd_get_client() is a pointer to constant data, so it
can't be modifed.  Furthermore, rbd_get_client() has the length of
the mon_addrs buffer and that is used to ensure nothing goes beyond
its end.

Based on all this, there is no reason that a buffer needs to
be used to hold a copy of the mon_addrs provided via
/sys/bus/rbd/add.   Instead, the location within that passed-in
buffer can be provided, along with the length of the "token"
therein which represents the monitor IP's.

A small change to rbd_add_parse_args() allows the address within the
buffer to be passed back, and the length is already returned.  This
now means that, at least from the perspective of this interface,
there is no such thing as a list of monitor addresses that is too
long.
Signed-off-by: Alex Elder <elder@dreamhost.com>

7ef3214a

rbd: have rbd_parse_args() report found mon_addrs size · 5214ecc4

Alex Elder authored Feb 02, 2012

The argument parsing routine already computes the size of the
mon_addrs buffer it extracts from the "command."  Pass it to the
caller so it can use it to provide the length to rbd_get_client().
Signed-off-by: Alex Elder <elder@dreamhost.com>

5214ecc4

rbd: do a few checks at build time · 81a89793

Alex Elder authored Feb 02, 2012

This is a bit gratuitous, but there are a few things that can be
verified at build time rather than run time, so do that.
Signed-off-by: Alex Elder <elder@dreamhost.com>

81a89793

rbd: don't use sscanf() in rbd_add_parse_args() · e28fff26

Alex Elder authored Feb 02, 2012

Make use of a few simple helper routines to parse the arguments
rather than sscanf().  This will treat both missing and too-long
arguments as invalid input (rather than silently truncating the
input in the too-long case).  In time this can also be used by
rbd_add() to use the passed-in buffer in place, rather than copying
its contents into new buffers.

It appears to me that the sscanf() previously used would not
correctly handle a supplied snapshot--the two final "%s" conversion
specifications were not separated by a space, and I'm not sure
how sscanf() handles that situation.  It may not be well-defined.
So that may be a bug this change fixes (but I didn't verify that).

The sizes of the mon_addrs and options buffers are now passed to
rbd_add_parse_args(), so they can be supplied to copy_token().
Signed-off-by: Alex Elder <elder@dreamhost.com>

e28fff26

rbd: encapsulate argument parsing for rbd_add() · a725f65e

Alex Elder authored Feb 02, 2012

Move the code that parses the arguments provided to rbd_add() (which
are supplied via /sys/bus/rbd/add) into a separate function.

Also rename the "mon_dev_name" variable in rbd_add() to be
"mon_addrs".   The variable represents a list of one or more
comma-separated monitor IP addresses, each with an optional port
number.  I think "mon_addrs" captures that notion a little better.
Signed-off-by: Alex Elder <elder@dreamhost.com>

a725f65e

rbd: simplify error handling in rbd_add() · 27cc2594

Alex Elder authored Feb 02, 2012

If a couple pointers are initialized to NULL then a single
"out_nomem" label can be used for all of the memory allocation
failure cases in rbd_add().

Also, get rid of the "irc" local variable there.  There is no
real need for "rc" to be type ssize_t, and it can be used in
the spot "irc" was.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

27cc2594

rbd: reduce memory used for rbd_dev fields · 60571c7d

Alex Elder authored Feb 02, 2012

The length of the string containing the monitor address
specification(s) will never exceed the length of the string passed
in to rbd_add().  The same holds true for the ceph + rbd options
string.  So reduce the amount of memory allocated for these to
that length rather than the maximum (1024 bytes).
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

60571c7d

rbd: have rbd_get_client() return a rbd_client · d720bcb0

Alex Elder authored Feb 02, 2012

Since rbd_get_client() currently returns an error code.  It assigns
the rbd_client field of the rbd_device structure it is passed if
successful.  Instead, have it return the created rbd_client
structure and return a pointer-coded error if there is an error.
This makes the assignment of the client pointer more obvious at the
call site.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

d720bcb0

rbd: a few simple changes · f0f8cef5

Alex Elder authored Jan 29, 2012

Here are a few very simple cleanups:
    - Add a "RBD_" prefix to the two driver name string definitions.
    - Move the definition of struct rbd_request below struct rbd_req_coll
      to avoid the need for an empty declaration of the latter.
    - Move and group the definitions of rbd_root_dev_release() and
      rbd_root_dev, as well as rbd_bus_type and rbd_bus_attrs[],
      close to the top of the file.  Arrange the latter so
      rbd_bus_type.bus_attrs can be initialized statically.
    - Get rid of an unnecessary local variable in rbd_open().
    - Rework some hokey logic in rbd_bus_add_dev(), so the value of
      "ret" at the end is either 0 or -ENOENT to avoid the need for
      the code duplication that was there.
    - Rename a goto target in rbd_add().
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

f0f8cef5

rbd: rename "node_lock" · 432b8587

Alex Elder authored Jan 29, 2012

The spinlock used to protect rbd_client_list is named "node_lock".
Rename it to "rbd_client_list_lock" to make it more obvious what
it's for.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

432b8587

rbd: move ctl_mutex lock inside rbd_client_create() · bc534d86

Alex Elder authored Jan 29, 2012

Since rbd_client_create() is only called in one place, move the
acquisition of the mutex around that call inside that function.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

bc534d86

rbd: move ctl_mutex lock inside rbd_get_client() · d97081b0

Alex Elder authored Jan 29, 2012

Since rbd_get_client() is only called in one place, move the
acquisition of the mutex around that call inside that function.

Furthermore, within rbd_get_client(), it appears the mutex only
needs to be held while calling rbd_client_create().  (Moving
the lock inside that function will wait for the next patch.)
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

d97081b0

rbd: release client list lock sooner · e6994d3d

Alex Elder authored Jan 29, 2012

In rbd_get_client(), if a client is reused, a number of things
get done while still holding the list lock unnecessarily.

This just moves a few things that need no lock protection outside
the lock.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

e6994d3d

rbd: restore previous rbd id sequence behavior · d184f6bf

Alex Elder authored Jan 29, 2012

It used to be that selecting a new unique identifier for an added
rbd device required searching all existing ones to find the highest
id is used.  A recent change made that unnecessary, but made it
so that id's used were monotonically non-decreasing.  It's a bit
more pleasant to have smaller rbd id's though, and this change
makes ids get allocated as they were before--each new id is one more
than the maximum currently in use.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

d184f6bf

rbd: tie rbd_dev_list changes to rbd_id operations · 499afd5b

Alex Elder authored Feb 02, 2012

The only time entries are added to or removed from the global
rbd_dev_list is exactly when a "put" or "get" operation is being
performed on a rbd_dev's id.  So just move the list management code
into get/put routines.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

499afd5b

rbd: protect the rbd_dev_list with a spinlock · e124a82f

Alex Elder authored Jan 29, 2012

The rbd_dev_list is just a simple list of all the current
rbd_devices.  Using the ctl_mutex as a concurrency guard is
overkill.  Instead, use a spinlock for that specific purpose.

This also reduces the window that the ctl_mutex needs to be held in
rbd_add().
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

e124a82f

rbd: rework calculation of new rbd id's · 1ddbe94e

Alex Elder authored Jan 29, 2012

In order to select a new unique identifier for an added rbd device,
the list of all existing ones is searched and a value one greater
than the highest id is used.

The list search can be avoided by using an atomic variable that
keeps track of the current highest id.  Using a get/put model for
id's we can limit the boundless growth of id numbers a bit by
arranging to reuse the current highest id once it gets released.
Add these calls to "put" the id when an rbd is getting removed.

Note that this changes the pattern of device id's used--new values
will never be below the highest one seen so far (even if there
exists an unused lower one).  I assert this is OK because the key
property of an rbd id is its uniqueness, not its magnitude.

Regardless, a follow-on patch will restore the old way of doing
things, I just think this commit just makes the incremental change
to atomics a little easier to understand.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

1ddbe94e

rbd: encapsulate new rbd id selection · b7f23c36

Alex Elder authored Jan 29, 2012

Move the loop that finds a new unique rbd id to use into
its own helper function.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

b7f23c36

rbd: use a single value of snap_name to mean no snap · cc9d734c

Josh Durgin authored Nov 21, 2011

There's already a constant for this anyway.

Since rbd_header_set_snap() is only used to set the rbd device
snap_name field, just do that within that function rather than
having it take the snap_name as an argument.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

v2: Changed interface rbd_header_set_snap() so it explicitly updates
    the snap_name in the rbd_device.  Also added a BUILD_BUG_ON()
    to verify the size of the snap_name field is sufficient for
    SNAP_HEAD_NAME.

cc9d734c

rbd: do not duplicate ceph_client pointer in rbd_device · 1dbb4399

Alex Elder authored Jan 24, 2012

The rbd_device structure maintains a duplicate copy of the
ceph_client pointer maintained in its rbd_client structure.  There
appears to be no good reason for this, and its presence presents a
risk of them getting out of synch or otherwise misused.  So kill it
off, and use the rbd_client copy only.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

1dbb4399

rbd: make ceph_parse_options() return a pointer · ee57741c

Alex Elder authored Jan 24, 2012

ceph_parse_options() takes the address of a pointer as an argument
and uses it to return the address of an allocated structure if
successful.  With this interface is not evident at call sites that
the pointer is always initialized.  Change the interface to return
the address instead (or a pointer-coded error code) to make the
validity of the returned pointer obvious.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

ee57741c

rbd: a few small cleanups · 21079786

Alex Elder authored Jan 24, 2012

Some minor cleanups in "drivers/block/rbd.c:
    - Use the more meaningful "RBD_MAX_OBJ_NAME_LEN" in place if "96"
      in the definition of RBD_MAX_MD_NAME_LEN.
    - Use DEFINE_SPINLOCK() to define and initialize node_lock.
    - Drop a needless (char *) cast in parse_rbd_opts_token().
    - Make a few minor formatting changes.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

21079786

ceph: make ceph_setxattr() and ceph_removexattr() more alike · 18fa8b3f

Alex Elder authored Jan 23, 2012

This patch just rearranges a few bits of code to make more
portions of ceph_setxattr() and ceph_removexattr() identical.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

18fa8b3f

ceph: avoid repeatedly computing the size of constant vxattr names · 3ce6cd12

Alex Elder authored Jan 23, 2012

All names defined in the directory and file virtual extended
attribute tables are constant, and the size of each is known at
compile time.  So there's no need to compute their length every
time any file's attribute is listed.

Record the length of each string and use it when needed to determine
the space need to represent them.  In addition, compute the
aggregate size of strings in each table just once at initialization
time.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

3ce6cd12

ceph: encode type in vxattr callback routines · aa4066ed

Alex Elder authored Jan 23, 2012

The names of the callback functions used for virtual extended
attributes are based only on the last component of the attribute
name.  Because of the way these are defined, this precludes allowing
a single (lowest) attribute name for different callbacks, dependent
on the type of file being operated on.  (For example, it might be
nice to support both "ceph.dir.layout" and "ceph.file.layout".)

Just change the callback names to avoid this problem.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

aa4066ed

ceph: drop "_cb" from name of struct ceph_vxattr_cb · 881a5fa2

Alex Elder authored Jan 23, 2012

A struct ceph_vxattr_cb does not represent a callback at all, but
rather a virtual extended attribute itself.  Drop the "_cb" suffix
from its name to reflect that.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

881a5fa2

ceph: use macros to normalize vxattr table definitions · eb788084

Alex Elder authored Jan 23, 2012

Entries in the ceph virtual extended attribute tables all follow a
distinct pattern in their definition.  Enforce this pattern through
the use of a macro.

Also, a null name field signals the end of the table, so make that
be the first field in the ceph_vxattr_cb structure.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

eb788084

ceph: use a symbolic name for "ceph." extended attribute namespace · 22891907

Alex Elder authored Jan 23, 2012

Use symbolic constants to define the top-level prefix for "ceph."
extended attribute names.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

22891907

ceph: pass inode rather than table to ceph_match_vxattr() · 06476a69

Alex Elder authored Jan 23, 2012

All callers of ceph_match_vxattr() determine what to pass as the
first argument by calling ceph_inode_vxattrs(inode).  Just do that
inside ceph_match_vxattr() itself, changing it to take an inode
rather than the vxattr pointer as its first argument.

Also ensure the function works correctly for an empty table (i.e.,
containing only a terminating null entry).
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

06476a69

ceph: don't null-terminate xattr values · b829c195

Alex Elder authored Jan 23, 2012

For some reason, ceph_setxattr() allocates an extra byte in which a
'\0' is stored past the end of an extended attribute value.  This is
not needed, and is potentially misleading, so get rid of it.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

b829c195