1. 06 Oct, 2009 14 commits
    • Sage Weil's avatar
      ceph: capability management · a8599bd8
      Sage Weil authored
      The Ceph metadata servers control client access to inode metadata and
      file data by issuing capabilities, granting clients permission to read
      and/or write both inode field and file data to OSDs (storage nodes).
      Each capability consists of a set of bits indicating which operations
      are allowed.
      
      If the client holds a *_SHARED cap, the client has a coherent value
      that can be safely read from the cached inode.
      
      In the case of a *_EXCL (exclusive) or FILE_WR capabilities, the client
      is allowed to change inode attributes (e.g., file size, mtime), note
      its dirty state in the ceph_cap, and asynchronously flush that
      metadata change to the MDS.
      
      In the event of a conflicting operation (perhaps by another client),
      the MDS will revoke the conflicting client capabilities.
      
      In order for a client to cache an inode, it must hold a capability
      with at least one MDS server.  When inodes are released, release
      notifications are batched and periodically sent en masse to the MDS
      cluster to release server state.
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      a8599bd8
    • Sage Weil's avatar
      ceph: monitor client · ba75bb98
      Sage Weil authored
      The monitor cluster is responsible for managing cluster membership
      and state.  The monitor client handles what minimal interaction
      the Ceph client has with it: checking for updated versions of the
      MDS and OSD maps, getting statfs() information, and unmounting.
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      ba75bb98
    • Sage Weil's avatar
      ceph: CRUSH mapping algorithm · 5ecc0a0f
      Sage Weil authored
      CRUSH is a pseudorandom data distribution function designed to map
      inputs onto a dynamic hierarchy of devices, while minimizing the
      extent to which inputs are remapped when the devices are added or
      removed.  It includes some features that are specifically useful for
      storage, most notably the ability to map each input onto a set of N
      devices that are separated across administrator-defined failure
      domains.  CRUSH is used to distribute data across the cluster of Ceph
      storage nodes.
      
      More information about CRUSH can be found in this paper:
      
          http://www.ssrc.ucsc.edu/Papers/weil-sc06.pdfSigned-off-by: default avatarSage Weil <sage@newdream.net>
      5ecc0a0f
    • Sage Weil's avatar
      ceph: OSD client · f24e9980
      Sage Weil authored
      The OSD client is responsible for reading and writing data from/to the
      object storage pool.  This includes determining where objects are
      stored in the cluster, and ensuring that requests are retried or
      redirected in the event of a node failure or data migration.
      
      If an OSD does not respond before a timeout expires, keepalive
      messages are sent across the lossless, ordered communications channel
      to ensure that any break in the TCP is discovered.  If the session
      does reset, a reconnection is attempted and affected requests are
      resent (by the message transport layer).
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      f24e9980
    • Sage Weil's avatar
      ceph: MDS client · 2f2dc053
      Sage Weil authored
      The MDS (metadata server) client is responsible for submitting
      requests to the MDS cluster and parsing the response.  We decide which
      MDS to submit each request to based on cached information about the
      current partition of the directory hierarchy across the cluster.  A
      stateful session is opened with each MDS before we submit requests to
      it, and a mutex is used to control the ordering of messages within
      each session.
      
      An MDS request may generate two responses.  The first indicates the
      operation was a success and returns any result.  A second reply is
      sent when the operation commits to disk.  Note that locking on the MDS
      ensures that the results of updates are visible only to the updating
      client before the operation commits.  Requests are linked to the
      containing directory so that an fsync will wait for them to commit.
      
      If an MDS fails and/or recovers, we resubmit requests as needed.  We
      also reconnect existing capabilities to a recovering MDS to
      reestablish that shared session state.  Old dentry leases are
      invalidated.
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      2f2dc053
    • Sage Weil's avatar
      ceph: address space operations · 1d3576fd
      Sage Weil authored
      The ceph address space methods are concerned primarily with managing
      the dirty page accounting in the inode, which (among other things)
      must keep track of which snapshot context each page was dirtied in,
      and ensure that dirty data is written out to the OSDs in snapshort
      order.
      
      A writepage() on a page that is not currently writeable due to
      snapshot writeback ordering constraints is ignored (it was presumably
      called from kswapd).
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      1d3576fd
    • Sage Weil's avatar
      ceph: file operations · 124e68e7
      Sage Weil authored
      File open and close operations, and read and write methods that ensure
      we have obtained the proper capabilities from the MDS cluster before
      performing IO on a file.  We take references on held capabilities for
      the duration of the read/write to avoid prematurely releasing them
      back to the MDS.
      
      We implement two main paths for read and write: one that is buffered
      (and uses generic_aio_{read,write}), and one that is fully synchronous
      and blocking (operating either on a __user pointer or, if O_DIRECT,
      directly on user pages).
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      124e68e7
    • Sage Weil's avatar
      ceph: directory operations · 2817b000
      Sage Weil authored
      Directory operations, including lookup, are defined here.  We take
      advantage of lookup intents when possible.  For the most part, we just
      need to build the proper requests for the metadata server(s) and
      pass things off to the mds_client.
      
      The results of most operations are normally incorporated into the
      client's cache when the reply is parsed by ceph_fill_trace().
      However, if the MDS replies without a trace (e.g., when retrying an
      update after an MDS failure recovery), some operation-specific cleanup
      may be needed.
      
      We can validate cached dentries in two ways.  A per-dentry lease may
      be issued by the MDS, or a per-directory cap may be issued that acts
      as a lease on the entire directory.  In the latter case, a 'gen' value
      is used to determine which dentries belong to the currently leased
      directory contents.
      
      We normally prepopulate the dcache and icache with readdir results.
      This makes subsequent lookups and getattrs avoid any server
      interaction.  It also lets us satisfy readdir operation by peeking at
      the dcache IFF we hold the per-directory cap/lease, previously
      performed a readdir, and haven't dropped any of the resulting
      dentries.
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      2817b000
    • Sage Weil's avatar
      ceph: inode operations · 355da1eb
      Sage Weil authored
      Inode cache and inode operations.  We also include routines to
      incorporate metadata structures returned by the MDS into the client
      cache, and some helpers to deal with file capabilities and metadata
      leases.  The bulk of that work is done by fill_inode() and
      fill_trace().
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      355da1eb
    • Sage Weil's avatar
      ceph: super.c · 16725b9d
      Sage Weil authored
      Mount option parsing, client setup and teardown, and a few odds and
      ends (e.g., statfs).
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      16725b9d
    • Sage Weil's avatar
      ceph: ref counted buffer · c30dbb9c
      Sage Weil authored
      struct ceph_buffer is a simple ref-counted buffer.  We transparently
      choose between kmalloc for small buffers and vmalloc for large ones.
      
      This is currently used only for allocating memory for xattr data.
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      c30dbb9c
    • Sage Weil's avatar
      ceph: client types · de57606c
      Sage Weil authored
      We first define constants, types, and prototypes for the kernel client
      proper.
      
      A few subsystems are defined separately later: the MDS, OSD, and
      monitor clients, and the messaging layer.
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      de57606c
    • Sage Weil's avatar
      ceph: on-wire types · 0dee3c28
      Sage Weil authored
      These headers describe the types used to exchange messages between the
      Ceph client and various servers.  All types are little-endian and
      packed.  These headers are shared between the kernel and userspace, so
      all types are in terms of e.g. __u32.
      
      Additionally, we define a few magic values to identify the current
      version of the protocol(s) in use, so that discrepancies to be
      detected on mount.
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      0dee3c28
    • Sage Weil's avatar
      ceph: documentation · 7ad920b5
      Sage Weil authored
      Mount options, syntax.
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      7ad920b5
  2. 27 Sep, 2009 14 commits
  3. 26 Sep, 2009 12 commits