Maria: first version of checkpoint (WL#3071), least-recently-dirtied page...

Maria: first version of checkpoint (WL#3071), least-recently-dirtied page flushing (WL#3261), recovery (WL#3072), control file (WL#3234), to serve as a detailed LLD. It looks like C code, but does not compile (no point in making it compile, as other modules on which I depend are not yet fully speficied or written); some pieces are not coded and just marked in comments. Files' organization (names, directories of C files) does not matter at this point. I don't think I had to commit so early, but it feels good to publish something, gives me the impression of moving forward :) storage/maria/checkpoint.c: WL#3071 Maria checkpoint, implementation storage/maria/checkpoint.h: WL#3071 Maria checkpoint, interface storage/maria/control_file.c: WL#3234 Maria control file, implementation storage/maria/control_file.h: WL#3234 Maria control file, interface storage/maria/least_recently_dirtied.c: WL#3261 Maria background flushing of least-recently-dirtied pages, implementation storage/maria/least_recently_dirtied.h: WL#3261 Maria background flushing of least-recently-dirtied pages, interface storage/maria/recovery.c: WL#3072 Maria recovery, implementation storage/maria/recovery.h: WL#3072 Maria recovery, interface

Maria: first version of checkpoint (WL#3071), least-recently-dirtied page...
Maria: first version of checkpoint (WL#3071), least-recently-dirtied page flushing (WL#3261), recovery (WL#3072), control file (WL#3234), to serve as a detailed LLD. It looks like C code, but does not compile (no point in making it compile, as other modules on which I depend are not yet fully speficied or written); some pieces are not coded and just marked in comments. Files' organization (names, directories of C files) does not matter at this point. I don't think I had to commit so early, but it feels good to publish something, gives me the impression of moving forward :) storage/maria/checkpoint.c: WL#3071 Maria checkpoint, implementation storage/maria/checkpoint.h: WL#3071 Maria checkpoint, interface storage/maria/control_file.c: WL#3234 Maria control file, implementation storage/maria/control_file.h: WL#3234 Maria control file, interface storage/maria/least_recently_dirtied.c: WL#3261 Maria background flushing of least-recently-dirtied pages, implementation storage/maria/least_recently_dirtied.h: WL#3261 Maria background flushing of least-recently-dirtied pages, interface storage/maria/recovery.c: WL#3072 Maria recovery, implementation storage/maria/recovery.h: WL#3072 Maria recovery, interface
06f7675b · unknown · 99a86a34 · 06f7675b · 06f7675b · 06f7675b
Commit 06f7675b authored Apr 27, 2006 by unknown
8 changed files
--- a/storage/maria/checkpoint.c
+++ b/storage/maria/checkpoint.c
--- a/storage/maria/checkpoint.h
+++ b/storage/maria/checkpoint.h
+/*
+  WL#3071 Maria checkpoint
+  First version written by Guilhem Bichot on 2006-04-27.
+  Does not compile yet.
+*/
+
+/* This is the interface of this module. */
+
+typedef enum enum_checkpoint_level {
+  NONE=-1,
+  INDIRECT, /* just write dirty_pages, transactions table and sync files */
+  MEDIUM, /* also flush all dirty pages which were already dirty at prev checkpoint*/
+  FULL /* also flush all dirty pages */
+} CHECKPOINT_LEVEL;
+
+/*
+  Call this when you want to request a checkpoint.
+  In real life it will be called by log_write_record() and by client thread
+  which explicitely wants to do checkpoint (ALTER ENGINE CHECKPOINT
+  checkpoint_level).
+*/
+int request_checkpoint(CHECKPOINT_LEVEL level, my_bool wait_for_completion);
+/* that's all that's needed in the interface */
--- a/storage/maria/control_file.c
+++ b/storage/maria/control_file.c
+/*
+  WL#3234 Maria control file
+  First version written by Guilhem Bichot on 2006-04-27.
+  Does not compile yet.
+*/
+
+/* Here is the implementation of this module */
+
+/* Control file is 512 bytes (a disk sector), to be as atomic as possible */
+
+int control_file_fd;
+
+/*
+  Looks for the control file. If absent, it's a fresh start, create file.
+  If present, read it to find out last checkpoint's LSN and last log.
+  Called at engine's start.
+*/
+int control_file_create_or_open()
+{
+  char buffer[4];
+  /* name is concatenation of Maria's home dir and "control" */
+  if ((control_file_fd= my_open(name, O_RDWR)) < 0)
+  {
+    /* failure, try to create it */
+    if ((control_file_fd= my_create(name, O_RDWR)) < 0)
+      return 1;
+    /*
+      So this is a start from scratch, to be safer we should make sure that
+      there are no logs or data/index files around (indeed it could be that
+      the control file alone was deleted or not restored, and we should not
+      go on with life at this point.
+      For now we trust (this is alpha version), but for beta if would be great
+      to verify.
+
+      We could have a tool which can rebuild the control file, by reading the
+      directory of logs, finding the newest log, reading it to find last
+      checkpoint... Slow but can save your db.
+    */
+    last_checkpoint_lsn_at_startup= 0;
+    last_log_name_at_startup= NULL;
+    return 0;
+  }
+  /* Already existing file, read it */
+  if (my_read(control_file_fd, buffer, 8, MYF(MY_FNABP)))
+    return 1;
+  last_checkpoint_lsn_at_startup= uint8korr(buffer);
+  if (last_log_name_at_startup= my_malloc(512-8+1))
+    return 1;
+  if (my_read(control_file_fd, last_log_name_at_startup, 512-8), MYF(MY_FNABP))
+    return 1;
+  last_log_name[512-8]= 0; /* end zero to be nice */
+  return 0;
+}
+
+/*
+  Write information durably to the control file.
+  Called when we have created a new log (after syncing this log's creation)
+  and when we have written a checkpoint (after syncing this log record).
+*/
+int control_file_write_and_force(LSN lsn, char *log_name)
+{
+  char buffer[512];
+  uint start=8,end=8;
+  if (lsn != 0) /* LSN was specified */
+  {
+    start= 0;
+    int8store(buffer, lsn);
+  }
+  if (log_name != NULL) /* log name was specified */
+  {
+    end= 512;
+    memcpy(buffer+8, log_name, 512-8);
+  }
+  DBUG_ASSERT(start != end);
+  return (my_pwrite(control_file_fd, buffer, end-start, start, MYF(MY_FNABP)) ||
+          my_sync(control_file_fd))
+}
--- a/storage/maria/control_file.h
+++ b/storage/maria/control_file.h
+/*
+  WL#3234 Maria control file
+  First version written by Guilhem Bichot on 2006-04-27.
+  Does not compile yet.
+*/
+
+/* Here is the interface of this module */
+
+LSN last_checkpoint_lsn_at_startup;
+char *last_log_name_at_startup;
+
+/*
+  Looks for the control file. If absent, it's a fresh start, create file.
+  If present, read it to find out last checkpoint's LSN and last log.
+  Called at engine's start.
+*/
+int control_file_create_or_open();
+
+/*
+  Write information durably to the control file.
+  Called when we have created a new log (after syncing this log's creation)
+  and when we have written a checkpoint (after syncing this log record).
+*/
+int control_file_write_and_force(LSN lsn, char *log_name);
--- a/storage/maria/least_recently_dirtied.c
+++ b/storage/maria/least_recently_dirtied.c
+/*
+  WL#3261 Maria - background flushing of the least-recently-dirtied pages
+  First version written by Guilhem Bichot on 2006-04-27.
+  Does not compile yet.
+*/
+
+/*
+  To be part of the page cache.
+  The pseudocode below is dependent on the page cache
+  which is being designed WL#3134. It is not clear if I need to do page
+  copies, as the page cache already keeps page copies.
+  So, this code will move to the page cache and take inspiration from its
+  methods. Below is just to give the idea of what could be done.
+  And I should compare my imaginations to WL#3134.
+*/
+
+/* Here is the implementation of this module */
+
+#include "page_cache.h"
+#include "least_recently_dirtied.h"
+
+/*
+  When we flush a page, we should pin page.
+  This "pin" is to protect against that:
+  I make copy,
+  you modify in memory and flush to disk and remove from LRD and from cache,
+  I write copy to disk,
+  checkpoint happens.
+  result: old page is on disk, page is absent from LRD, your REDO will be
+  wrongly ignored.
+
+  Pin: there can be multiple pins, flushing imposes that there are zero pins.
+  For example, pin could be a uint counter protected by the page's latch.
+
+  Maybe it's ok if when there is a page replacement, the replacer does not
+  remove page from the LRD (it would save global mutex); for that, background
+  flusher should be prepared to see pages in the LRD which are not in the page
+  cache (then just ignore them). However checkpoint will contain superfluous
+  entries and so do more work.
+*/
+
+#define PAGE_SIZE (16*1024) /* just as an example */
+/*
+  Optimization:
+  LRD flusher should not flush pages one by one: to be fast, it flushes a
+  group of pages in sequential disk order if possible; a group of pages is just
+  FLUSH_GROUP_SIZE pages.
+  Key cache has groupping already somehow Monty said (investigate that).
+*/
+#define FLUSH_GROUP_SIZE 512 /* 8 MB */
+
+/*
+  This thread does background flush of pieces of the LRD, and all checkpoints.
+  Just launch it when engine starts.
+*/
+pthread_handler_decl background_flush_and_checkpoint_thread()
+{
+  char *flush_group_buffer= my_malloc(PAGE_SIZE*FLUSH_GROUP_SIZE);
+  while (this_thread_not_killed)
+  {
+    lock(log_mutex);
+    if (checkpoint_request)
+      checkpoint(); /* will unlock mutex */
+    else
+    {
+      unlock(log_mutex);
+      lock(global_LRD_mutex);
+      flush_one_group_from_LRD();
+      safemutex_assert_not_owner(global_LRD_mutex);
+    }
+    my_sleep(1000000); /* one second ? */
+  }
+  my_free(flush_group_buffer);
+}
+
+/*
+  flushes only the first FLUSH_GROUP_SIZE pages of the LRD.
+*/
+flush_one_group_from_LRD()
+{
+  char *ptr;
+  safe_mutex_assert_owner(global_LRD_mutex);
+
+  for (page= 0; page<FLUSH_GROUP_SIZE; page++)
+  {
+    copy_element_to_array;
+  }
+  /*
+    One rule to better observe is "page must be flushed to disk before it is
+    removed from LRD" (otherwise checkpoint is incomplete info, corruption).
+  */
+  unlock(global_LRD_mutex);
+  /* page id is concatenation of "file id" and "number of page in file" */
+  qsort(array, sizeof(*element), FLUSH_GROUP_SIZE, by_page_id);
+  for (scan_array)
+  {
+    if (page_cache_latch(page_id, READ) == PAGE_ABSENT)
+    {
+      /*
+        page disappeared since we made the copy (it was flushed to be
+        replaced), remove from array (memcpy tail of array over it)...
+      */
+      continue;
+    }
+    memcpy(flush_group_buffer+..., page->data, PAGE_SIZE);
+    pin_page;
+    page_cache_unlatch(page_id, KEEP_PINNED); /* but keep pinned */
+  }
+  for (scan_the_array)
+  {
+    /*
+      As an optimization, we try to identify contiguous-in-the-file segments (to
+      issue one big write()).
+      In non-optimized version, contiguous segment is always only one page.
+    */
+    if ((next_page.page_id - this_page.page_id) == 1)
+    {
+      /*
+        this page and next page are in same file and are contiguous in the
+        file: add page to contiguous segment...
+      */
+      continue; /* defer write() to next pages */
+    }
+    /* contiguous segment ends */
+    my_pwrite(file, contiguous_segment_start_offset, contiguous_segment_size);
+
+    /*
+      note that if we had doublewrite, doublewrite buffer may prevent us from
+      doing this write() grouping (if doublewrite space is shorter).
+    */
+  }
+  /*
+    Now remove pages from LRD. As we have pinned them, all pages that we
+    managed to pin are still in the LRD, in the same order, we can just cut
+    the LRD at the last element of "array". This is more efficient that
+    removing element by element (which would take LRD mutex many times) in the
+    loop above.
+  */
+  lock(global_LRD_mutex);
+  /* cut LRD by bending LRD->first, free cut portion... */
+  unlock(global_LRD_mutex);
+  for (scan_array)
+  {
+    /*
+      if the page has a property "modified since last flush" (i.e. which is
+      redundant with the presence of the page in the LRD, this property can
+      just be a pointer to the LRD element) we should reset it
+      (note that then the property would live slightly longer than
+      the presence in LRD).
+    */
+    page_cache_unpin(page_id);
+    /*
+      order between unpin and removal from LRD is not clear, depends on what
+      pin actually is.
+    */
+  }
+  free(array);
+}
+
+/* flushes all page from LRD up to approximately rec_lsn>=max_lsn */
+int flush_all_LRD_to_lsn(LSN max_lsn)
+{
+  lock(global_LRD_mutex);
+  if (max_lsn == MAX_LSN) /* don't want to flush forever, so make it fixed: */
+    max_lsn= LRD->first->prev->rec_lsn;
+  while (LRD->first->rec_lsn < max_lsn)
+  {
+    if (flush_one_group_from_LRD()) /* will unlock mutex */
+      return 1;
+    /* scheduler may preempt us here so that we don't take full CPU */
+    lock(global_LRD_mutex);
+  }
+  unlock(global_LRD_mutex);
+  return 0;
+}
--- a/storage/maria/least_recently_dirtied.h
+++ b/storage/maria/least_recently_dirtied.h
+/*
+  WL#3261 Maria - background flushing of the least-recently-dirtied pages
+  First version written by Guilhem Bichot on 2006-04-27.
+  Does not compile yet.
+*/
+
+/* This is the interface of this module. */
+
+/* flushes all page from LRD up to approximately rec_lsn>=max_lsn */
+int flush_all_LRD_to_lsn(LSN max_lsn);
--- a/storage/maria/recovery.c
+++ b/storage/maria/recovery.c
+/*
+  WL#3072 Maria recovery
+  First version written by Guilhem Bichot on 2006-04-27.
+  Does not compile yet.
+*/
+
+/* Here is the implementation of this module */
+
+#include "page_cache.h"
+#include "least_recently_dirtied.h"
+#include "transaction.h"
+#include "share.h"
+#include "log.h"
+
+typedef struct st_record_type_properties {
+ /* used for debug error messages or "maria_read_log" command-line tool: */
+  char *name,
+  my_bool record_ends_group;
+  int (*record_execute)(RECORD *); /* param will be record header instead later */
+} RECORD_TYPE_PROPERTIES;
+
+RECORD_TYPE_PROPERTIES all_record_type_properties[]=
+{
+  /* listed here in the order of the "log records type" enumeration */
+  {"REDO_INSERT_HEAD", 0, redo_insert_head_execute},
+  ...,
+  {"UNDO_INSERT"     , 1, undo_insert_execute     },
+  {"COMMIT",         , 1, commit_execute          },
+  ...
+};
+
+int redo_insert_head_execute(RECORD *record)
+{
+  /* write the data to the proper page */
+}
+
+int undo_insert_execute(RECORD *record)
+{
+  trans_table[short_trans_id].undo_lsn= record.lsn;
+  /* restore the old version of the row */
+}
+
+int commit_execute(RECORD *record)
+{
+  trans_table[short_trans_id].state= COMMITTED;
+  /*
+    and that's all: the delete/update handler should not be woken up! as there
+    may be REDO for purge further in the log.
+  */
+}
+
+#define record_ends_group(R)                                            \
+  all_record_type_properties[(R)->type].record_ends_group)
+
+#define execute_log_record(R)                                           \
+  all_record_type_properties[(R).type].record_execute(R)
+
+
+int recovery()
+{
+  control_file_create_or_open();
+  /*
+    init log handler: tell it that we are going to do large reads of the
+    log, sequential and backward. Log handler could decide to alloc a big
+    read-only IO_CACHE for this, or use its usual page cache.
+  */
+
+  /* read checkpoint log record from log handler */
+  RECORD *checkpoint_record= log_read_record(last_checkpoint_lsn_at_start);
+
+  /* parse this record, build structs (dirty_pages, transactions table, file_map) */
+  /*
+    read log records (note: sometimes only the header is needed, for ex during
+    REDO phase only the header of UNDO is needed, not the 4G blob in the
+    variable-length part, so I could use that; however for PREPARE (which is a
+    variable-length record) I'll need to read the full record in the REDO
+    phase):
+  */
+
+  record= log_read_record(min(rec_lsn, ...));
+  /*
+    if log handler knows the end LSN of the log, we could print here how many
+    MB of log we have to read (to give an idea of the time), and print
+    progress notes.
+  */
+
+  while (record != NULL)
+  {
+    /*
+      A complete group is a set of log records with an "end mark" record
+      (e.g. a set of REDOs for an operation, terminated by an UNDO for this
+      operation); if there is no "end mark" record the group is incomplete
+      and won't be executed.
+    */
+    if (record_ends_group(record)
+    {
+      /*
+        such end events can always be executed immediately (they don't touch
+        the disk).
+      */
+      execute_log_record(record);
+      if (trans_table[record.short_trans_id].group_start_lsn != 0)
+      {
+        /*
+          There is a complete group for this transaction.
+          We're going to read recently read log records:
+          for this log_read_record() to be efficient (not touch the disk),
+          log handler could cache recently read pages
+          (can just use an IO_CACHE of 10 MB to read the log, or the normal
+          log handler page cache).
+          Without it only OS file cache will help.
+        */
+        record2= log_read_record(trans_table[record.short_trans_id].group_start_lsn);
+        while (record2.lsn < record.lsn)
+        {
+          if (record2.short_trans_id == record.short_trans_id)
+            execute_log_record(record2); /* it's in our group */
+          record2= log_read_next_record();
+        }
+        trans_table[record.short_trans_id].group_start_lsn= 0; /* group finished */
+        /* we're now at the UNDO, re-read it to advance log pointer */
+        record2= log_read_next_record(); /* and throw it away */
+      }
+    }
+    else /* record does not end group */
+    {
+      /* just record the fact, can't know if can execute yet */
+      if (trans_table[short_trans_id].group_start_lsn == 0) /* group not yet started */
+        trans_table[short_trans_id].group_start_lsn= record.lsn;
+    }
+
+    /*
+      Later we can optimize: instead of "execute_log_record(record2)", do
+      copy_record_into_exec_buffer(record2):
+      this will just copy record into a multi-record (10 MB?) memory buffer,
+      and when buffer is full, will do sorting of REDOs per 
+      page id and execute them.
+      This sorting will enable us to do more sequential reads of the
+      data/index pages.
+      Note that updating bitmap pages (when we have executed a REDO for a page
+      we update its bitmap page) may break the sequential read of pages,
+      so maybe we should read and cache bitmap pages in the beginning.
+      Or ok the sequence will be broken, but quickly all bitmap pages will be
+      in memory and so the sequence will not be broken anymore.
+      Sorting could even determine, based on physical device of files
+      ("st_dev" in stat()), that some files should be should be taken by
+      different threads, if we want to do parallism.
+    */
+    /*
+      Here's how to read a complete variable-length record if needed:
+      <sanja> read the header, allocate buffer of record length, read whole
+      record.
+    */
+    record= log_read_next_record();
+  }
+
+  /*
+    Earlier or here, create true transactions in TM.
+    If done earlier, note that TM should not wake up the delete/update handler
+    when it receives a commit info, as existing REDO for purge may exist in
+    the log, and so the delete/update handler may do changes which conflict
+    with these REDOs.
+    Even if done here, better to not wake it up now as we're going to free the
+    page cache:
+  */
+
+  /*
+    We want to have two steps:
+    engine->recover_with_max_memory();
+    next_engine->recover_with_max_memory();
+    engine->init_with_normal_memory();
+    next_engine->init_with_normal_memory();
+    So: in recover_with_max_memory() allocate a giant page cache, do REDO
+    phase, then all page cache is flushed and emptied and freed (only retain
+    small structures like TM): take full checkpoint, which is useful if
+    next engine crashes in its recovery the next second.
+    Destroy all shares (maria_close()), then at init_with_normal_memory() we
+    do this:
+  */
+
+  print_information_to_error_log(nb of trans to roll back, nb of prepared trans);
+
+  /*
+    Launch one or more threads to do the background rollback. Don't wait for
+    them to complete their rollback (background rollback; for debugging, we
+    can have an option which waits).
+
+    Note that InnoDB's rollback-in-background works as long as InnoDB is the
+    last engine to recover, otherwise MySQL will refuse new connections until
+    the last engine has recovered so it's not "background" from the user's
+    point of view. InnoDB is near top of sys_table_types so all others
+    (e.g. BDB) recover after it... So it's really "online rollback" only if
+    InnoDB is the only engine.
+  */
+
+  /* wake up delete/update handler */
+  /* tell the TM that it can now accept new transactions */
+
+  /*
+    mark that checkpoint requests are now allowed.
+  */
+  /*
+    when all rollback threads have terminated, somebody should print "rollback
+    finished" to the error log.
+  */
+}
+
+pthread_handler_decl rollback_background_thread()
+{
+  /*
+    execute the normal runtime-rollback code for a bunch of transactions.
+  */
+  while (trans in list_of_trans_to_rollback_by_this_thread)
+  {
+    while (trans->undo_lsn != 0)
+    {
+      /* this is the normal runtime-rollback code: */
+      record= log_read_record(trans->undo_lsn);
+      execute_log_record(record);
+      trans->undo_lsn= record.prev_undo_lsn;
+    }
+    /* remove trans from list */
+  }
+}
--- a/storage/maria/recovery.h
+++ b/storage/maria/recovery.h
+/*
+  WL#3072 Maria recovery
+  First version written by Guilhem Bichot on 2006-04-27.
+  Does not compile yet.
+*/
+
+/* This is the interface of this module. */
+
+/* Performs recovery of the engine at start */
+int recovery();