1. 06 Jun, 2014 19 commits
    • Sage Weil's avatar
      ceph: include time stamp in every MDS request · b8e69066
      Sage Weil authored
      We recently modified the client/MDS protocol to include a timestamp in the
      client request.  This allows ctime updates to follow the client's clock
      in most cases, which avoids subtle problems when clocks are out of sync
      and timestamps are updated sometimes by the MDS clock (for most requests)
      and sometimes by the client clock (for cap writeback).
      Signed-off-by: default avatarSage Weil <sage@inktank.com>
      b8e69066
    • Ilya Dryomov's avatar
      rbd: fix ida/idr memory leak · ffe312cf
      Ilya Dryomov authored
      ida_destroy() needs to be called on module exit to release ida caches.
      Signed-off-by: default avatarIlya Dryomov <ilya.dryomov@inktank.com>
      Reviewed-by: default avatarAlex Elder <elder@linaro.org>
      ffe312cf
    • Alex Elder's avatar
      rbd: use reference counts for image requests · 0f2d5be7
      Alex Elder authored
      Each image request contains a reference count, but to date it has
      not actually been used.  (I think this was just an oversight.) A
      recent report involving rbd failing an assertion shed light on why
      and where we need to use these reference counts.
      
      Every OSD request associated with an object request uses
      rbd_osd_req_callback() as its callback function.  That function will
      call a helper function (dependent on the type of OSD request) that
      will set the object request's "done" flag if the object request if
      appropriate.  If that "done" flag is set, the object request is
      passed to rbd_obj_request_complete().
      
      In rbd_obj_request_complete(), requests are processed in sequential
      order.  So if an object request completes before one of its
      predecessors in the image request, the completion is deferred.
      Otherwise, if it's a completing object's "turn" to be completed, it
      is passed to rbd_img_obj_end_request(), which records the result of
      the operation, accumulates transferred bytes, and so on.  Next, the
      successor to this request is checked and if it is marked "done",
      (deferred) completion processing is performed on that request, and
      so on.  If the last object request in an image request is completed,
      rbd_img_request_complete() is called, which (typically) destroys
      the image request.
      
      There is a race here, however.  The instant an object request is
      marked "done" it can be provided (by a thread handling completion of
      one of its predecessor operations) to rbd_img_obj_end_request(),
      which (for the last request) can then lead to the image request
      getting torn down.  And this can happen *before* that object has
      itself entered rbd_img_obj_end_request().  As a result, once it
      *does* enter that function, the image request (and even the object
      request itself) may have been freed and become invalid.
      
      All that's necessary to avoid this is to properly count references
      to the image requests.  We tear down an image request's object
      requests all at once--only when the entire image request has
      completed.  So there's no need for an image request to count
      references for its object requests.  However, we don't want an
      image request to go away until the last of its object requests
      has passed through rbd_img_obj_callback().  In other words,
      we don't want rbd_img_request_complete() to necessarily
      result in the image request being destroyed, because it may
      get called before we've finished processing on all of its
      object requests.
      
      So the fix is to add a reference to an image request for
      each of its object requests.  The reference can be viewed
      as representing an object request that has not yet finished
      its call to rbd_img_obj_callback().  That is emphasized by
      getting the reference right after assigning that as the image
      object's callback function.  The corresponding release of that
      reference is done at the end of rbd_img_obj_callback(), which
      every image object request passes through exactly once.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Reviewed-by: default avatarIlya Dryomov <ilya.dryomov@inktank.com>
      0f2d5be7
    • Ilya Dryomov's avatar
      rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync() · b30a01f2
      Ilya Dryomov authored
      osd_request, along with r_request and r_reply messages attached to it
      are leaked in __rbd_dev_header_watch_sync() if the requested image
      doesn't exist.  This is because lingering requests are special and get
      an extra ref in the reply path.  Fix it by unregistering linger request
      on the error path and split __rbd_dev_header_watch_sync() into two
      functions to make it maintainable.
      Signed-off-by: default avatarIlya Dryomov <ilya.dryomov@inktank.com>
      b30a01f2
    • Ilya Dryomov's avatar
      rbd: make sure we have latest osdmap on 'rbd map' · 30ba1f02
      Ilya Dryomov authored
      Given an existing idle mapping (img1), mapping an image (img2) in
      a newly created pool (pool2) fails:
      
          $ ceph osd pool create pool1 8 8
          $ rbd create --size 1000 pool1/img1
          $ sudo rbd map pool1/img1
          $ ceph osd pool create pool2 8 8
          $ rbd create --size 1000 pool2/img2
          $ sudo rbd map pool2/img2
          rbd: sysfs write failed
          rbd: map failed: (2) No such file or directory
      
      This is because client instances are shared by default and we don't
      request an osdmap update when bumping a ref on an existing client.  The
      fix is to use the mon_get_version request to see if the osdmap we have
      is the latest, and block until the requested update is received if it's
      not.
      
      Fixes: http://tracker.ceph.com/issues/8184Signed-off-by: default avatarIlya Dryomov <ilya.dryomov@inktank.com>
      Reviewed-by: default avatarSage Weil <sage@inktank.com>
      30ba1f02
    • Ilya Dryomov's avatar
      libceph: add ceph_monc_wait_osdmap() · 6044cde6
      Ilya Dryomov authored
      Add ceph_monc_wait_osdmap(), which will block until the osdmap with the
      specified epoch is received or timeout occurs.
      
      Export both of these as they are going to be needed by rbd.
      Signed-off-by: default avatarIlya Dryomov <ilya.dryomov@inktank.com>
      Reviewed-by: default avatarSage Weil <sage@inktank.com>
      6044cde6
    • Ilya Dryomov's avatar
      libceph: mon_get_version request infrastructure · 513a8243
      Ilya Dryomov authored
      Add support for mon_get_version requests to libceph.  This reuses much
      of the ceph_mon_generic_request infrastructure, with one exception.
      Older OSDs don't set mon_get_version reply hdr->tid even if the
      original request had a non-zero tid, which makes it impossible to
      lookup ceph_mon_generic_request contexts by tid in get_generic_reply()
      for such replies.  As a workaround, we allocate a reply message on the
      reply path.  This can probably interfere with revoke, but I don't see
      a better way.
      Signed-off-by: default avatarIlya Dryomov <ilya.dryomov@inktank.com>
      Reviewed-by: default avatarSage Weil <sage@inktank.com>
      513a8243
    • Ilya Dryomov's avatar
      libceph: recognize poolop requests in debugfs · 002b36ba
      Ilya Dryomov authored
      Recognize poolop requests in debugfs monc dump, fix prink format
      specifiers - tid is unsigned.
      Signed-off-by: default avatarIlya Dryomov <ilya.dryomov@inktank.com>
      Reviewed-by: default avatarSage Weil <sage@inktank.com>
      002b36ba
    • Zhang Zhen's avatar
      ceph: refactor readpage_nounlock() to make the logic clearer · 23cd573b
      Zhang Zhen authored
      If the return value of ceph_osdc_readpages() is not negative,
      it is certainly greater than or equal to zero.
      
      Remove the useless condition judgment and redundant braces.
      Signed-off-by: default avatarZhang Zhen <zhenzhang.zhang@huawei.com>
      Reviewed-by: default avatarYan, Zheng <zheng.z.yan@intel.com>
      23cd573b
    • Yan, Zheng's avatar
      mds: check cap ID when handling cap export message · ca665e02
      Yan, Zheng authored
      handle following sequence of events:
      - mds0 exports an inode to mds1. client receives the cap import
        message from mds1. caps from mds0 are removed while handling
        the cap import message.
      - mds1 exports an inode to mds0. client receives the cap export
        message from mds1. handle_cap_export() adds placeholder caps
        for mds0
      - client receives the first cap export message (for exporting
        inode from mds0 to mds1)
      Signed-off-by: default avatarYan, Zheng <zheng.z.yan@intel.com>
      ca665e02
    • Yan, Zheng's avatar
      ceph: remember subtree root dirfrag's auth MDS · 8d08503c
      Yan, Zheng authored
      remember dirfrag's auth MDS when it's different from its parent inode's
      auth MDS.
      Signed-off-by: default avatarYan, Zheng <zheng.z.yan@intel.com>
      8d08503c
    • Yan, Zheng's avatar
      ceph: introduce ceph_fill_fragtree() · 3e7fbe9c
      Yan, Zheng authored
      Move the code that update the i_fragtree into a separate function.
      Also add simple probabilistic test to decide whether the i_fragtree
      should be updated
      Signed-off-by: default avatarYan, Zheng <zheng.z.yan@intel.com>
      3e7fbe9c
    • Yan, Zheng's avatar
      ceph: handle cap import atomically · 2cd698be
      Yan, Zheng authored
      cap import messages are processed by both handle_cap_import() and
      handle_cap_grant(). These two functions are not executed in the same
      atomic context, so they can races with cap release.
      
      The fix is make handle_cap_import() not release the i_ceph_lock when
      it returns. Let handle_cap_grant() release the lock after it finishes
      its job.
      Signed-off-by: default avatarYan, Zheng <zheng.z.yan@intel.com>
      2cd698be
    • Yan, Zheng's avatar
      ceph: pre-allocate ceph_cap struct for ceph_add_cap() · d9df2783
      Yan, Zheng authored
      So that ceph_add_cap() can be used while i_ceph_lock is locked.
      This simplifies the code that handle cap import/export.
      Signed-off-by: default avatarYan, Zheng <zheng.z.yan@intel.com>
      d9df2783
    • Yan, Zheng's avatar
      ceph: update inode fields according to issued caps · f98a128a
      Yan, Zheng authored
      Cap message and request reply from non-auth MDS may carry stale
      information (corresponding locks are in LOCK states) even they
      have the newest inode version. So client should update inode fields
      according to issued caps.
      Signed-off-by: default avatarYan, Zheng <zheng.z.yan@intel.com>
      f98a128a
    • Duan Jiong's avatar
      rbd: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO · 461f758a
      Duan Jiong authored
      This patch fixes coccinelle error regarding usage of IS_ERR and
      PTR_ERR instead of PTR_ERR_OR_ZERO.
      Signed-off-by: default avatarDuan Jiong <duanj.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarYan, Zheng <zheng.z.yan@intel.com>
      461f758a
    • Yan, Zheng's avatar
      ceph: queue vmtruncate if necessary when handing cap grant/revoke · c6bcda6f
      Yan, Zheng authored
      cap grant/revoke message from non-auth MDS can update inode's size
      and truncate_seq/truncate_size. (the message arrives before auth
      MDS's cap trunc message)
      Signed-off-by: default avatarYan, Zheng <zheng.z.yan@intel.com>
      c6bcda6f
    • Zhang Zhen's avatar
      ceph: remove useless ACL check · 979d4c18
      Zhang Zhen authored
      posix_acl_xattr_set() already does the check, and it's the only
      way to feed in an ACL from userspace.
      So the check here is useless, remove it.
      Signed-off-by: default avatarzhang zhen <zhenzhang.zhang@huawei.com>
      Reviewed-by: default avatarYan, Zheng <zheng.z.yan@intel.com>
      979d4c18
    • Fengguang Wu's avatar
      e84be11c
  2. 21 May, 2014 21 commits