Commits · 3105c19c450ac7c18ab28c19d364b588767261b3 · Kirill Smelkov / linux

18 Nov, 2010 1 commit

ceph: fix readdir EOVERFLOW on 32-bit archs · 3105c19c

Sage Weil authored Nov 18, 2010

One of the readdir filldir_t callers was passing the raw ceph 64-bit ino
instead of the hashed 32-bit one, producing an EOVERFLOW in the filler
callback.  Fix this by calling the ceph_vino_to_ino() helper to do the
conversion.
Reported-by: Jan Smets <jan.smets@alcatel-lucent.com>
Tested-by: Jan Smets <jan.smets@alcatel-lucent.com>
Signed-off-by: Sage Weil <sage@newdream.net>

3105c19c

12 Nov, 2010 1 commit

ceph: fix frag offset for non-leftmost frags · 7b88dadc

Sage Weil authored Nov 11, 2010

We start at offset 2 for the leftmost frag, and 0 for subsequent frags.
When we reach the end (rightmost), we go back to 2.  This fixes readdir on
fragmented (large) directories.
Signed-off-by: Sage Weil <sage@newdream.net>

7b88dadc

11 Nov, 2010 1 commit

ceph: fix dangling pointer · a1629c3b

Sage Weil authored Nov 11, 2010

Clear fi->last_name when it's freed.  The only caller is rewinddir() (or
equivalent lseek).
Signed-off-by: Sage Weil <sage@newdream.net>

a1629c3b

09 Nov, 2010 3 commits

ceph: explicitly specify page alignment in network messages · c5c6b19d

Sage Weil authored Nov 09, 2010

The alignment used for reading data into or out of pages used to be taken
from the data_off field in the message header. This only worked as long
as the page alignment matched the object offset, breaking direct io to
non-page aligned offsets.

Instead, explicitly specify the page alignment next to the page vector
in the ceph_msg struct, and use that instead of the message header (which
probably shouldn't be trusted). The alloc_msg callback is responsible for
filling in this field properly when it sets up the page vector.
Signed-off-by: Sage Weil <sage@newdream.net>

c5c6b19d

ceph: make page alignment explicit in osd interface · b7495fc2

Sage Weil authored Nov 09, 2010

We used to infer alignment of IOs within a page based on the file offset,
which assumed they matched. This broke with direct IO that was not aligned
to pages (e.g., 512-byte aligned IO). We were also trusting the alignment
specified in the OSD reply, which could have been adjusted by the server.

Explicitly specify the page alignment when setting up OSD IO requests.
Signed-off-by: Sage Weil <sage@newdream.net>

b7495fc2

ceph: fix comment, remove extraneous args · e98b6fed
Sage Weil authored Nov 09, 2010
```
The offset/length arguments aren't used.
Signed-off-by: Sage Weil <sage@newdream.net>
```
e98b6fed

08 Nov, 2010 4 commits

ceph: fix update of ctime from MDS · d8672d64

Sage Weil authored Nov 08, 2010

The client can have a newer ctime than the MDS due to AUTH_EXCL and
XATTR_EXCL caps as well; update the check in ceph_fill_file_time
appropriately.

This fixes cases where ctime/mtime goes backward under the right sequence
of local updates (e.g. chmod) and mds replies (e.g. subsequent stat that
goes to the MDS).
Signed-off-by: Sage Weil <sage@newdream.net>

d8672d64

ceph: fix version check on racing inode updates · 8bd59e01

Sage Weil authored Nov 08, 2010

We may get updates on the same inode from multiple MDSs; generally we only
pay attention if the update is newer than what we already have.  The
exception is when an MDS sense unstable information, in which case we
always update.

The old > check got this wrong when our version was odd (e.g. 3) and the
reply version was even (e.g. 2): the older stale (v2) info would be
applied.  Fixed and clarified the comment.
Signed-off-by: Sage Weil <sage@newdream.net>

8bd59e01

ceph: fix uid/gid on resent mds requests · cb4276cc

Sage Weil authored Nov 08, 2010

MDS requests can be rebuilt and resent in non-process context, but were
filling in uid/gid from current_fsuid/gid.  Put that information in the
request struct on request setup.

This fixes incorrect (and root) uid/gid getting set for requests that
are forwarded between MDSs, usually due to metadata migrations.
Signed-off-by: Sage Weil <sage@newdream.net>

cb4276cc

ceph: fix rdcache_gen usage and invalidate · cd045cb4

Sage Weil authored Nov 04, 2010

We used to use rdcache_gen to indicate whether we "might" have cached
pages. Now we just look at the mapping to determine that. However, some
old behavior remains from that transition.

First, rdcache_gen == 0 no longer means we have no pages. That can happen
at any time (presumably when we carry FILE_CACHE). We should not reset it
to zero, and we should not check that it is zero.

That means that the only purpose for rdcache_revoking is to resolve races
between new issues of FILE_CACHE and an async invalidate. If they are
equal, we should invalidate. On success, we decrement rdcache_revoking,
so that it is no longer equal to rdcache_gen. Similarly, if we success
in doing a sync invalidate, set revoking = gen - 1. (This is a small
optimization to avoid doing unnecessary invalidate work and does not
affect correctness.)
Signed-off-by: Sage Weil <sage@newdream.net>

cd045cb4

07 Nov, 2010 4 commits

ceph: re-request max_size if cap auth changes · feb4cc9b

Sage Weil authored Nov 07, 2010

If the auth cap migrates to another MDS, clear requested_max_size so that
we resend any pending max_size increase requests.  This fixes potential
hangs on writes that extend a file and race with an cap migration between
MDSs.
Signed-off-by: Sage Weil <sage@newdream.net>

feb4cc9b

ceph: only let auth caps update max_size · 912a9b03

Sage Weil authored Nov 07, 2010

Only the auth MDS has a meaningful max_size value for us, so only update it
in fill_inode if we're being issued an auth cap. Otherwise, a random
stat result from a non-auth MDS can clobber a meaningful max_size, get
the client<->mds cap state out of sync, and make writes hang.

Specifically, even if the client re-requests a larger max_size (which it
will), the MDS won't respond because as far as it knows we already have a
sufficiently large value.
Signed-off-by: Sage Weil <sage@newdream.net>

912a9b03

ceph: fix open for write on clustered mds · 7421ab80

Sage Weil authored Nov 07, 2010

Normally when we open a file we already have a cap, and simply update the
wanted set. However, if we open a file for write, but don't have an auth
cap, that doesn't work; we need to open a new cap with the auth MDS. Only
reuse existing caps if we are opening for read or the existing cap is auth.
Signed-off-by: Sage Weil <sage@newdream.net>

7421ab80

ceph: fix bad pointer dereference in ceph_fill_trace · d8b16b3d

Sage Weil authored Nov 06, 2010

We dereference *in a few lines down, but only set it on rename.  It is
apparently pretty rare for this to trigger, but I have been hitting it
with a clustered MDSs.
Signed-off-by: Sage Weil <sage@newdream.net>

d8b16b3d

01 Nov, 2010 1 commit

ceph: fix small seq message skipping · df9f86fa

Sage Weil authored Nov 01, 2010

If the client gets out of sync with the server message sequence number, we
normally skip low seq messages (ones we already received).  The skip code
was also incrementing the expected seq, such that all subsequent messages
also appeared old and got skipped, and an eventual timeout on the osd
connection.  This resulted in some lagging requests and console messages
like

[233480.882885] ceph: skipping osd22 10.138.138.13:6804 seq 2016, expected 2017
[233480.882919] ceph: skipping osd22 10.138.138.13:6804 seq 2017, expected 2018
[233480.882963] ceph: skipping osd22 10.138.138.13:6804 seq 2018, expected 2019
[233480.883488] ceph: skipping osd22 10.138.138.13:6804 seq 2019, expected 2020
[233485.219558] ceph: skipping osd22 10.138.138.13:6804 seq 2020, expected 2021
[233485.906595] ceph: skipping osd22 10.138.138.13:6804 seq 2021, expected 2022
[233490.379536] ceph: skipping osd22 10.138.138.13:6804 seq 2022, expected 2023
[233495.523260] ceph: skipping osd22 10.138.138.13:6804 seq 2023, expected 2024
[233495.923194] ceph: skipping osd22 10.138.138.13:6804 seq 2024, expected 2025
[233500.534614] ceph:  tid 6023602 timed out on osd22, will reset osd
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sage Weil <sage@newdream.net>

df9f86fa

28 Oct, 2010 1 commit

Revert "ceph: update issue_seq on cap grant" · 2f56f56a

Sage Weil authored Oct 27, 2010

This reverts commit d91f2438.

The intent of issue_seq is to distinguish between mds->client messages that
(re)create the cap and those that do not, which means we should _only_ be
updating that value in the create paths.  By updating it in handle_cap_grant,
we reset it to zero, which then breaks release.

The larger question is what workload/problem made me think it should be
updated here...
Signed-off-by: Sage Weil <sage@newdream.net>

2f56f56a

20 Oct, 2010 22 commits

ceph: do not carry i_lock for readdir from dcache · efa4c120

Sage Weil authored Oct 18, 2010

We were taking dcache_lock inside of i_lock, which introduces a dependency
not found elsewhere in the kernel, complicationg the vfs locking
scalability work.  Since we don't actually need it here anyway, remove
it.

We only need i_lock to test for the I_COMPLETE flag, so be careful to do
so without dcache_lock held.
Signed-off-by: Sage Weil <sage@newdream.net>

efa4c120

fs/ceph/xattr.c: Use kmemdup · 61413c2f

Julia Lawall authored Oct 17, 2010

Convert a sequence of kmalloc and memcpy to use kmemdup.

The semantic patch that performs this transformation is:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression a,flag,len;
expression arg,e1,e2;
statement S;
@@

  a =
-  \(kmalloc\|kzalloc\)(len,flag)
+  kmemdup(arg,len,flag)
  <... when != a
  if (a == NULL || ...) S
  ...>
- memcpy(a,arg,len+1);
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Sage Weil <sage@newdream.net>

61413c2f

rbd: passing wrong variable to bvec_kunmap_irq() · 85b5aaa6

Dan Carpenter authored Oct 11, 2010

We should be passing "buf" here insead of "bv".  This is tricky because
it's not the same as kmap() and kunmap().  GCC does warn about it if you
compile on i386 with CONFIG_HIGHMEM.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>

85b5aaa6

rbd: null vs ERR_PTR · b8d0638a

Dan Carpenter authored Oct 11, 2010

ceph_alloc_page_vector() returns ERR_PTR(-ENOMEM) on errors.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>

b8d0638a

ceph: fix num_pages_free accounting in pagelist · 240634e9

Sage Weil authored Oct 05, 2010

Decrement the free page counter when removing a page from the free_list.
Signed-off-by: Sage Weil <sage@newdream.net>

240634e9

ceph: add CEPH_MDS_OP_SETDIRLAYOUT and associated ioctl. · 571dba52
Greg Farnum authored Sep 24, 2010
```
Signed-off-by: Sage Weil <sage@newdream.net>
```
571dba52

ceph: don't crash when passed bad mount options · 010e3b48

Yehuda Sadeh authored Sep 30, 2010

This only happened when parse_extra_token was not passed
to ceph_parse_option() (hence, only happened in rbd).
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>

010e3b48

ceph: fix debugfs warnings · 6f453ed6

Randy Dunlap authored Sep 28, 2010

Include "super.h" outside of CONFIG_DEBUG_FS to eliminate a compiler warning:

fs/ceph/debugfs.c:266: warning: 'struct ceph_fs_client' declared inside parameter list
fs/ceph/debugfs.c:266: warning: its scope is only this definition or declaration, which is probably not what you want
fs/ceph/debugfs.c:271: warning: 'struct ceph_fs_client' declared inside parameter list
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>

6f453ed6

block: rbd: removing unnecessary test · f4cf3dee

Yehuda Sadeh authored Sep 27, 2010

rbd_get_segment() can't return a negative value, we don't need to check
the return output.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>

f4cf3dee

block: rbd: fixed may leaks · 28f259b7

Vasiliy Kulikov authored Sep 26, 2010

rbd_client_create() doesn't free rbdc, this leads to many leaks.

seg_len in rbd_do_op() is unsigned, so (seg_len < 0) makes no sense.
Also if fixed check fails then seg_name is leaked.
Signed-off-by: Vasiliy Kulikov <segooon@gmail.com>
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>

28f259b7

ceph: switch from BKL to lock_flocks() · 496e5955

Sage Weil authored Sep 22, 2010

Switch from using the BKL explicitly to the new lock_flocks() interface.
Eventually this will turn into a spinlock.
Signed-off-by: Sage Weil <sage@newdream.net>

496e5955

ceph: preallocate flock state without locks held · fca4451a

Greg Farnum authored Sep 17, 2010

When the lock_kernel() turns into lock_flocks() and a spinlock, we won't
be able to do allocations with the lock held. Preallocate space without
the lock, and retry if the lock state changes out from underneath us.
Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>

fca4451a

ceph: add pagelist_reserve, pagelist_truncate, pagelist_set_cursor · ac0b74d8

Greg Farnum authored Sep 17, 2010

These facilitate preallocation of pages so that we can encode into the pagelist
in an atomic context.
Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>

ac0b74d8

ceph: use mapping->nrpages to determine if mapping is empty · 18a38193
Sage Weil authored Sep 17, 2010
```
This is simpler and faster.
Signed-off-by: Sage Weil <sage@newdream.net>
```
18a38193

ceph: only invalidate on check_caps if we actually have pages · 93afd449

Sage Weil authored Sep 17, 2010

The i_rdcache_gen value only implies we MAY have cached pages; actually
check the mapping to see if it's worth bothering with an invalidate.
Signed-off-by: Sage Weil <sage@newdream.net>

93afd449

ceph: do not hide .snap in root directory · 4c32f5dd

Sage Weil authored Aug 24, 2010

Snaps in the root directory are now supported by the MDS, and harmless on
older versions.
Signed-off-by: Sage Weil <sage@newdream.net>

4c32f5dd

rbd: introduce rados block device (rbd), based on libceph · 602adf40

Yehuda Sadeh authored Aug 12, 2010

The rados block device (rbd), based on osdblk, creates a block device
that is backed by objects stored in the Ceph distributed object storage
cluster.  Each device consists of a single metadata object and data
striped over many data objects.

The rbd driver supports read-only snapshots.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>

602adf40

ceph: factor out libceph from Ceph file system · 3d14c5d2

Yehuda Sadeh authored Apr 06, 2010

This factors out protocol and low-level storage parts of ceph into a
separate libceph module living in net/ceph and include/linux/ceph.  This
is mostly a matter of moving files around.  However, a few key pieces
of the interface change as well:

 - ceph_client becomes ceph_fs_client and ceph_client, where the latter
   captures the mon and osd clients, and the fs_client gets the mds client
   and file system specific pieces.
 - Mount option parsing and debugfs setup is correspondingly broken into
   two pieces.
 - The mon client gets a generic handler callback for otherwise unknown
   messages (mds map, in this case).
 - The basic supported/required feature bits can be expanded (and are by
   ceph_fs_client).

No functional change, aside from some subtle error handling cases that got
cleaned up in the refactoring process.
Signed-off-by: Sage Weil <sage@newdream.net>

3d14c5d2

ceph-rbd: osdc support for osd call and rollback operations · ae1533b6
Yehuda Sadeh authored May 18, 2010
```
This will be used for rbd snapshots administration.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
```
ae1533b6

ceph: messenger and osdc changes for rbd · 68b4476b

Yehuda Sadeh authored Apr 06, 2010

Allow the messenger to send/receive data in a bio.  This is added
so that we wouldn't need to copy the data into pages or some other buffer
when doing IO for an rbd block device.

We can now have trailing variable sized data for osd
ops.  Also osd ops encoding is more modular.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>

68b4476b

ceph: refactor osdc requests creation functions · 3499e8a5

Yehuda Sadeh authored Apr 06, 2010

The osd requests creation are being decoupled from the
vino parameter, allowing clients using the osd to use
other arbitrary object names that are not necessarily
vino based. Also, calc_raw_layout now takes a snap id.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>

3499e8a5

ceph: lookup pool in osdmap by name · 7669a2c9

Yehuda Sadeh authored May 17, 2010

Implement a pool lookup by name.  This will be used by rbd.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>

7669a2c9

14 Oct, 2010 2 commits

Linux 2.6.36-rc8 · cd07202c
Linus Torvalds authored Oct 14, 2010

cd07202c

Un-inline the core-dump helper functions · 3aa0ce82

Linus Torvalds authored Oct 14, 2010

Tony Luck reports that the addition of the access_ok() check in commit
0eead9ab ("Don't dump task struct in a.out core-dumps") broke the
ia64 compile due to missing the necessary header file includes.

Rather than add yet another include (<asm/unistd.h>) to make everything
happy, just uninline the silly core dump helper functions and move the
bodies to fs/exec.c where they make a lot more sense.

dump_seek() in particular was too big to be an inline function anyway,
and none of them are in any way performance-critical.  And we really
don't need to mess up our include file headers more than they already
are.
Reported-and-tested-by: Tony Luck <tony.luck@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

3aa0ce82