1. 21 Oct, 2008 13 commits
    • Heinz Mauelshagen's avatar
      dm raid1: separate region_hash interface part1 · 1f965b19
      Heinz Mauelshagen authored
      Separate the region hash code from raid1 so it can be shared by forthcoming
      targets.  Use BUG_ON() for failed async dm_io() calls.
      Signed-off-by: default avatarHeinz Mauelshagen <hjm@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      1f965b19
    • Martin K. Petersen's avatar
      dm: mark split bio as cloned · f3e1d26e
      Martin K. Petersen authored
      When a bio gets split, mark its fragments with the BIO_CLONED flag.
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      f3e1d26e
    • Milan Broz's avatar
      dm crypt: remove waitqueue · 0a4a1047
      Milan Broz authored
      Remove waitqueue no longer needed with the async crypto interface.
      Signed-off-by: default avatarMilan Broz <mbroz@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      0a4a1047
    • Milan Broz's avatar
      dm crypt: fix async split · 393b47ef
      Milan Broz authored
      When writing io, dm-crypt has to allocate a new cloned bio
      and encrypt the data into newly-allocated pages attached to this bio.
      In rare cases, because of hw restrictions (e.g. physical segment limit)
      or memory pressure, sometimes more than one cloned bio has to be used,
      each processing a different fragment of the original.
      
      Currently there is one waitqueue which waits for one fragment to finish
      and continues processing the next fragment.
      
      But when using asynchronous crypto this doesn't work, because several
      fragments may be processed asynchronously or in parallel and there is
      only one crypt context that cannot be shared between the bio fragments.
      The result may be corruption of the data contained in the encrypted bio.
      
      The patch fixes this by allocating new dm_crypt_io structs (with new
      crypto contexts) and running them independently.
      
      The fragments contains a pointer to the base dm_crypt_io struct to
      handle reference counting, so the base one is properly deallocated
      after all the fragments are finished.
      
      In a low memory situation, this only uses one additional object from the
      mempool.  If the mempool is empty, the next allocation simple waits for
      previous fragments to complete.
      Signed-off-by: default avatarMilan Broz <mbroz@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      393b47ef
    • Milan Broz's avatar
      dm crypt: tidy sector · b635b00e
      Milan Broz authored
      Prepare local sector variable (offset) for later patch.
      Do not update io->sector for still-running I/O.
      
      No functional change.
      Signed-off-by: default avatarMilan Broz <mbroz@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      b635b00e
    • Mikulas Patocka's avatar
      dm: remove dm header from targets · 586e80e6
      Mikulas Patocka authored
      Change #include "dm.h" to #include <linux/device-mapper.h> in all targets.
      Targets should not need direct access to internal DM structures.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      586e80e6
    • Mikulas Patocka's avatar
      dm: publish array_too_big · d63a5ce3
      Mikulas Patocka authored
      Move array_too_big to include/linux/device-mapper.h because it is
      used by targets.
      
      Remove the test from dm-raid1 as the number of mirror legs is limited
      such that it can never fail.  (Even for stripes it seems rather
      unlikely.)
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      d63a5ce3
    • Mikulas Patocka's avatar
      dm exception store: fix misordered writes · 7acedc5b
      Mikulas Patocka authored
      We must zero the next chunk on disk *before* writing out the current chunk, not
      after.  Otherwise if the machine crashes at the wrong time, the "end of metadata"
      marker may be missing.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      Cc: stable@kernel.org
      7acedc5b
    • Alasdair G Kergon's avatar
      dm exception store: refactor zero_area · 7c9e6c17
      Alasdair G Kergon authored
      Use a separate buffer for writing zeroes to the on-disk snapshot
      exception store, make the updating of ps->current_area explicit and
      refactor the code in preparation for the fix in the next patch.
      
      No functional change.
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@kernel.org
      7c9e6c17
    • Mikulas Patocka's avatar
      dm snapshot: drop unused last_percent · f68d4f3d
      Mikulas Patocka authored
      The last_percent field is unused - remove it.
      (It dates from when events were triggered as each X% filled up.)
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      f68d4f3d
    • Mikulas Patocka's avatar
      dm snapshot: fix primary_pe race · 7c5f78b9
      Mikulas Patocka authored
      Fix a race condition with primary_pe ref_count handling.
      
      put_pending_exception runs under dm_snapshot->lock, it does atomic_dec_and_test
      on primary_pe->ref_count, and later does atomic_read primary_pe->ref_count.
      
      __origin_write does atomic_dec_and_test on primary_pe->ref_count without holding
      dm_snapshot->lock.
      
      This opens the following race condition:
      Assume two CPUs, CPU1 is executing put_pending_exception (and holding
      dm_snapshot->lock). CPU2 is executing __origin_write in parallel.
      primary_pe->ref_count == 2.
      
      CPU1:
      if (primary_pe && atomic_dec_and_test(&primary_pe->ref_count))
      	origin_bios = bio_list_get(&primary_pe->origin_bios);
      ... decrements primary_pe->ref_count to 1. Doesn't load origin_bios
      
      CPU2:
      if (first && atomic_dec_and_test(&primary_pe->ref_count)) {
      	flush_bios(bio_list_get(&primary_pe->origin_bios));
      	free_pending_exception(primary_pe);
      	/* If we got here, pe_queue is necessarily empty. */
      	return r;
      }
      ... decrements primary_pe->ref_count to 0, submits pending bios, frees
      primary_pe.
      
      CPU1:
      if (!primary_pe || primary_pe != pe)
      	free_pending_exception(pe);
      ... this has no effect.
      if (primary_pe && !atomic_read(&primary_pe->ref_count))
      	free_pending_exception(primary_pe);
      ... sees ref_count == 0 (written by CPU 2), does double free !!
      
      This bug can happen only if someone is simultaneously writing to both the
      origin and the snapshot.
      
      If someone is writing only to the origin, __origin_write will submit kcopyd
      request after it decrements primary_pe->ref_count (so it can't happen that the
      finished copy races with primary_pe->ref_count decrementation).
      
      If someone is writing only to the snapshot, __origin_write isn't invoked at all
      and the race can't happen.
      
      The race happens when someone writes to the snapshot --- this creates
      pending_exception with primary_pe == NULL and starts copying. Then, someone
      writes to the same chunk in the snapshot, and __origin_write races with
      termination of already submitted request in pending_complete (that calls
      put_pending_exception).
      
      This race may be reason for bugs:
        http://bugzilla.kernel.org/show_bug.cgi?id=11636
        https://bugzilla.redhat.com/show_bug.cgi?id=465825
      
      The patch fixes the code to make sure that:
      1. If atomic_dec_and_test(&primary_pe->ref_count) returns false, the process
      must no longer dereference primary_pe (because someone else may free it under
      us).
      2. If atomic_dec_and_test(&primary_pe->ref_count) returns true, the process
      is responsible for freeing primary_pe.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      Cc: stable@kernel.org
      7c5f78b9
    • Kazuo Ito's avatar
      dm kcopyd: avoid queue shuffle · b673c3a8
      Kazuo Ito authored
      Write throughput to LVM snapshot origin volume is an order
      of magnitude slower than those to LV without snapshots or
      snapshot target volumes, especially in the case of sequential
      writes with O_SYNC on.
      
      The following patch originally written by Kevin Jamieson and
      Jan Blunck and slightly modified for the current RCs by myself
      tries to improve the performance by modifying the behaviour
      of kcopyd, so that it pushes back an I/O job to the head of
      the job queue instead of the tail as process_jobs() currently
      does when it has to wait for free pages. This way, write
      requests aren't shuffled to cause extra seeks.
      
      I tested the patch against 2.6.27-rc5 and got the following results.
      The test is a dd command writing to snapshot origin followed by fsync
      to the file just created/updated.  A couple of filesystem benchmarks
      gave me similar results in case of sequential writes, while random
      writes didn't suffer much.
      
      dd if=/dev/zero of=<somewhere on snapshot origin> bs=4096 count=...
         [conv=notrunc when updating]
      
      1) linux 2.6.27-rc5 without the patch, write to snapshot origin,
      average throughput (MB/s)
                           10M     100M    1000M
      create,dd         511.46   610.72    11.81
      create,dd+fsync     7.10     6.77     8.13
      update,dd         431.63   917.41    12.75
      update,dd+fsync     7.79     7.43     8.12
      
      compared with write throughput to LV without any snapshots,
      all dd+fsync and 1000 MiB writes perform very poorly.
      
                           10M     100M    1000M
      create,dd         555.03   608.98   123.29
      create,dd+fsync   114.27    72.78    76.65
      update,dd         152.34  1267.27   124.04
      update,dd+fsync   130.56    77.81    77.84
      
      2) linux 2.6.27-rc5 with the patch, write to snapshot origin,
      average throughput (MB/s)
      
                           10M     100M    1000M
      create,dd         537.06   589.44    46.21
      create,dd+fsync    31.63    29.19    29.23
      update,dd         487.59   897.65    37.76
      update,dd+fsync    34.12    30.07    26.85
      
      Although still not on par with plain LV performance -
      cannot be avoided because it's copy on write anyway -
      this simple patch successfully improves throughtput
      of dd+fsync while not affecting the rest.
      Signed-off-by: default avatarJan Blunck <jblunck@suse.de>
      Signed-off-by: default avatarKazuo Ito <ito.kazuo@oss.ntt.co.jp>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      Cc: stable@kernel.org
      b673c3a8
    • Paul Mundt's avatar
      binfmt_elf_fdpic: Update for cputime changes. · 2515ddc6
      Paul Mundt authored
      Commit f06febc9 ("timers: fix itimer/
      many thread hang") introduced a new task_cputime interface and
      subsequently only converted binfmt_elf over to it.  This results in the
      build for binfmt_elf_fdpic blowing up given that p->signal->{u,s}time
      have disappeared from underneath us.
      
      Apply the same trivial fix from binfmt_elf to binfmt_elf_fdpic.
      Signed-off-by: default avatarPaul Mundt <lethal@linux-sh.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2515ddc6
  2. 20 Oct, 2008 27 commits