1. 19 Sep, 2014 2 commits
    • Filipe Manana's avatar
      Btrfs: fix data corruption after fast fsync and writeback error · 8407f553
      Filipe Manana authored
      When we do a fast fsync, we start all ordered operations and then while
      they're running in parallel we visit the list of modified extent maps
      and construct their matching file extent items and write them to the
      log btree. After that, in btrfs_sync_log() we wait for all the ordered
      operations to finish (via btrfs_wait_logged_extents).
      
      The problem with this is that we were completely ignoring errors that
      can happen in the extent write path, such as -ENOSPC, a temporary -ENOMEM
      or -EIO errors for example. When such error happens, it means we have parts
      of the on disk extent that weren't written to, and so we end up logging
      file extent items that point to these extents that contain garbage/random
      data - so after a crash/reboot plus log replay, we get our inode's metadata
      pointing to those extents.
      
      This worked in contrast with the full (non-fast) fsync path, where we
      start all ordered operations, wait for them to finish and then write
      to the log btree. In this path, after each ordered operation completes
      we check if it's flagged with an error (BTRFS_ORDERED_IOERR) and return
      -EIO if so (via btrfs_wait_ordered_range).
      
      So if an error happens with any ordered operation, just return a -EIO
      error to userspace, so that it knows that not all of its previous writes
      were durably persisted and the application can take proper action (like
      redo the writes for e.g.) - and definitely not leave any file extent items
      in the log refer to non fully written extents.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      8407f553
    • Filipe Manana's avatar
      Btrfs: fix fsync race leading to invalid data after log replay · 669249ee
      Filipe Manana authored
      When the fsync callback (btrfs_sync_file) starts, it first waits for
      the writeback of any dirty pages to start and finish without holding
      the inode's mutex (to reduce contention). After this it acquires the
      inode's mutex and repeats that process via btrfs_wait_ordered_range
      only if we're doing a full sync (BTRFS_INODE_NEEDS_FULL_SYNC flag
      is set on the inode).
      
      This is not safe for a non full sync - we need to start and wait for
      writeback to finish for any pages that might have been made dirty
      before acquiring the inode's mutex and after that first step mentioned
      before. Why this is needed is explained by the following comment added
      to btrfs_sync_file:
      
        "Right before acquiring the inode's mutex, we might have new
         writes dirtying pages, which won't immediately start the
         respective ordered operations - that is done through the
         fill_delalloc callbacks invoked from the writepage and
         writepages address space operations. So make sure we start
         all ordered operations before starting to log our inode. Not
         doing this means that while logging the inode, writeback
         could start and invoke writepage/writepages, which would call
         the fill_delalloc callbacks (cow_file_range,
         submit_compressed_extents). These callbacks add first an
         extent map to the modified list of extents and then create
         the respective ordered operation, which means in
         tree-log.c:btrfs_log_inode() we might capture all existing
         ordered operations (with btrfs_get_logged_extents()) before
         the fill_delalloc callback adds its ordered operation, and by
         the time we visit the modified list of extent maps (with
         btrfs_log_changed_extents()), we see and process the extent
         map they created. We then use the extent map to construct a
         file extent item for logging without waiting for the
         respective ordered operation to finish - this file extent
         item points to a disk location that might not have yet been
         written to, containing random data - so after a crash a log
         replay will make our inode have file extent items that point
         to disk locations containing invalid data, as we returned
         success to userspace without waiting for the respective
         ordered operation to finish, because it wasn't captured by
         btrfs_get_logged_extents()."
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      669249ee
  2. 18 Sep, 2014 2 commits
    • Liu Bo's avatar
      Btrfs: fix wrong parse of extent map's tracepoint · 254a2d14
      Liu Bo authored
      The tracepoint of extent map doesn't parse @flag correctly, we set @flag via
      set_bit(), so we need to parse it on a bit bias.
      
      Also add the missing flag, EXTENT_FLAG_FS_MAPPING.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      254a2d14
    • Qu Wenruo's avatar
      btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map · e6c4efd8
      Qu Wenruo authored
      The following commit enhanced the merge_extent_mapping() to reduce
      fragment in extent map tree, but it can't handle case which existing
      lies before map_start:
      51f39 btrfs: Use right extent length when inserting overlap extent map.
      
      [BUG]
      When existing extent map's start is before map_start,
      the em->len will be minus, which will corrupt the extent map and fail to
      insert the new extent map.
      This will happen when someone get a large extent map, but when it is
      going to insert it into extent map tree, some one has already commit
      some write and split the huge extent into small parts.
      
      [REPRODUCER]
      It is very easy to tiger using filebench with randomrw personality.
      It is about 100% to reproduce when using 8G preallocated file in 60s
      randonrw test.
      
      [FIX]
      This patch can now handle any existing extent position.
      Since it does not directly use existing->start, now it will find the
      previous and next extent around map_start.
      So the old existing->start < map_start bug will never happen again.
      
      [ENHANCE]
      This patch will insert the best fitted extent map into extent map tree,
      other than the oldest [map_start, map_start + sectorsize) or the
      relatively newer but not perfect [map_start, existing->start).
      
      The patch will first search existing extent that does not intersects with
      the desired map range [map_start, map_start + len).
      The existing extent will be either before or behind map_start, and based
      on the existing extent, we can find out the previous and next extent
      around map_start.
      
      So the best fitted extent would be [prev->end, next->start).
      For prev or next is not found, em->start would be prev->end and em->end
      wold be next->start.
      
      With this patch, the fragment in extent map tree should be reduced much
      more than the 51f39 commit and reduce an unneeded extent map tree search.
      Reported-by: default avatarTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e6c4efd8
  3. 17 Sep, 2014 36 commits