1. 26 Apr, 2016 3 commits
    • Daeho Jeong's avatar
      ext4: fix races between changing inode journal mode and ext4_writepages · c8585c6f
      Daeho Jeong authored
      In ext4, there is a race condition between changing inode journal mode
      and ext4_writepages(). While ext4_writepages() is executed on a
      non-journalled mode inode, the inode's journal mode could be enabled
      by ioctl() and then, some pages dirtied after switching the journal
      mode will be still exposed to ext4_writepages() in non-journaled mode.
      To resolve this problem, we use fs-wide per-cpu rw semaphore by Jan
      Kara's suggestion because we don't want to waste ext4_inode_info's
      space for this extra rare case.
      Signed-off-by: default avatarDaeho Jeong <daeho.jeong@samsung.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      c8585c6f
    • Daeho Jeong's avatar
      ext4: handle unwritten or delalloc buffers before enabling data journaling · 4c546592
      Daeho Jeong authored
      We already allocate delalloc blocks before changing the inode mode into
      "per-file data journal" mode to prevent delalloc blocks from remaining
      not allocated, but another issue concerned with "BH_Unwritten" status
      still exists. For example, by fallocate(), several buffers' status
      change into "BH_Unwritten", but these buffers cannot be processed by
      ext4_alloc_da_blocks(). So, they still remain in unwritten status after
      per-file data journaling is enabled and they cannot be changed into
      written status any more and, if they are journaled and eventually
      checkpointed, these unwritten buffer will cause a kernel panic by the
      below BUG_ON() function of submit_bh_wbc() when they are submitted
      during checkpointing.
      
      static int submit_bh_wbc(int rw, struct buffer_head *bh,...
      {
              ...
              BUG_ON(buffer_unwritten(bh));
      
      Moreover, when "dioread_nolock" option is enabled, the status of a
      buffer is changed into "BH_Unwritten" after write_begin() completes and
      the "BH_Unwritten" status will be cleared after I/O is done. Therefore,
      if a buffer's status is changed into unwrutten but the buffer's I/O is
      not submitted and completed, it can cause the same problem after
      enabling per-file data journaling. You can easily generate this bug by
      executing the following command.
      
      ./kvm-xfstests -C 10000 -m nodelalloc,dioread_nolock generic/269
      
      To resolve these problems and define a boundary between the previous
      mode and per-file data journaling mode, we need to flush and wait all
      the I/O of buffers of a file before enabling per-file data journaling
      of the file.
      Signed-off-by: default avatarDaeho Jeong <daeho.jeong@samsung.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      4c546592
    • Theodore Ts'o's avatar
      ext4: fix jbd2 handle extension in ext4_ext_truncate_extend_restart() · 7b808191
      Theodore Ts'o authored
      The function jbd2_journal_extend() takes as its argument the number of
      new credits to be added to the handle.  We weren't taking into account
      the currently unused handle credits; worse, we would try to extend the
      handle by N credits when it had N credits available.
      
      In the case where jbd2_journal_extend() fails because the transaction
      is too large, when jbd2_journal_restart() gets called, the N credits
      owned by the handle gets returned to the transaction, and the
      transaction commit is asynchronously requested, and then
      start_this_handle() will be able to successfully attach the handle to
      the current transaction since the required credits are now available.
      
      This is mostly harmless, but since ext4_ext_truncate_extend_restart()
      returns EAGAIN, the truncate machinery will once again try to call
      ext4_ext_truncate_extend_restart(), which will do the above sequence
      over and over again until the transaction has committed.
      
      This was found while I was debugging a lockup in caused by running
      xfstests generic/074 in the data=journal case.  I'm still not sure why
      we ended up looping forever, which suggests there may still be another
      bug hiding in the transaction accounting machinery, but this commit
      prevents us from looping in the first place.
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      7b808191
  2. 24 Apr, 2016 5 commits
    • Jan Kara's avatar
      ext4: do not ask jbd2 to write data for delalloc buffers · ee0876bc
      Jan Kara authored
      Currently we ask jbd2 to write all dirty allocated buffers before
      committing a transaction when doing writeback of delay allocated blocks.
      However this is unnecessary since we move all pages to writeback state
      before dropping a transaction handle and then submit all the necessary
      IO. We still need the transaction commit to wait for all the outstanding
      writeback before flushing disk caches during transaction commit to avoid
      data exposure issues though. Use the new jbd2 capability and ask it to
      only wait for outstanding writeback during transaction commit when
      writing back data in ext4_writepages().
      Tested-by: default avatar"HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      ee0876bc
    • Jan Kara's avatar
      jbd2: add support for avoiding data writes during transaction commits · 41617e1a
      Jan Kara authored
      Currently when filesystem needs to make sure data is on permanent
      storage before committing a transaction it adds inode to transaction's
      inode list. During transaction commit, jbd2 writes back all dirty
      buffers that have allocated underlying blocks and waits for the IO to
      finish. However when doing writeback for delayed allocated data, we
      allocate blocks and immediately submit the data. Thus asking jbd2 to
      write dirty pages just unnecessarily adds more work to jbd2 possibly
      writing back other redirtied blocks.
      
      Add support to jbd2 to allow filesystem to ask jbd2 to only wait for
      outstanding data writes before committing a transaction and thus avoid
      unnecessary writes.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      41617e1a
    • Jan Kara's avatar
      ext4: remove EXT4_STATE_ORDERED_MODE · 3957ef53
      Jan Kara authored
      This flag is just duplicating what ext4_should_order_data() tells you
      and is used in a single place. Furthermore it doesn't reflect changes to
      inode data journalling flag so it may be possibly misleading. Just
      remove it.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      3957ef53
    • Jan Kara's avatar
      ext4: fix data exposure after a crash · 06bd3c36
      Jan Kara authored
      Huang has reported that in his powerfail testing he is seeing stale
      block contents in some of recently allocated blocks although he mounts
      ext4 in data=ordered mode. After some investigation I have found out
      that indeed when delayed allocation is used, we don't add inode to
      transaction's list of inodes needing flushing before commit. Originally
      we were doing that but commit f3b59291 removed the logic with a
      flawed argument that it is not needed.
      
      The problem is that although for delayed allocated blocks we write their
      contents immediately after allocating them, there is no guarantee that
      the IO scheduler or device doesn't reorder things and thus transaction
      allocating blocks and attaching them to inode can reach stable storage
      before actual block contents. Actually whenever we attach freshly
      allocated blocks to inode using a written extent, we should add inode to
      transaction's ordered inode list to make sure we properly wait for block
      contents to be written before committing the transaction. So that is
      what we do in this patch. This also handles other cases where stale data
      exposure was possible - like filling hole via mmap in
      data=ordered,nodelalloc mode.
      
      The only exception to the above rule are extending direct IO writes where
      blkdev_direct_IO() waits for IO to complete before increasing i_size and
      thus stale data exposure is not possible. For now we don't complicate
      the code with optimizing this special case since the overhead is pretty
      low. In case this is observed to be a performance problem we can always
      handle it using a special flag to ext4_map_blocks().
      
      CC: stable@vger.kernel.org
      Fixes: f3b59291Reported-by: default avatar"HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
      Tested-by: default avatar"HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      06bd3c36
    • Theodore Ts'o's avatar
      ext4: allow readdir()'s of large empty directories to be interrupted · 1f60fbe7
      Theodore Ts'o authored
      If a directory has a large number of empty blocks, iterating over all
      of them can take a long time, leading to scheduler warnings and users
      getting irritated when they can't kill a process in the middle of one
      of these long-running readdir operations.  Fix this by adding checks to
      ext4_readdir() and ext4_htree_fill_tree().
      
      This was reverted earlier due to a typo in the original commit where I
      experimented with using signal_pending() instead of
      fatal_signal_pending().  The test was in the wrong place if we were
      going to return signal_pending() since we would end up returning
      duplicant entries.  See 9f2394c9 for a more detailed explanation.
      
      Added fix as suggested by Linus to check for signal_pending() in
      in the filldir() functions.
      Reported-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      Google-Bug-Id: 27880676
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      1f60fbe7
  3. 18 Apr, 2016 1 commit
  4. 17 Apr, 2016 5 commits
  5. 16 Apr, 2016 7 commits
  6. 15 Apr, 2016 17 commits
  7. 14 Apr, 2016 2 commits
    • Mike Snitzer's avatar
      dm cache metadata: fix READ_LOCK macros and cleanup WRITE_LOCK macros · 9567366f
      Mike Snitzer authored
      The READ_LOCK macro was incorrectly returning -EINVAL if
      dm_bm_is_read_only() was true -- it will always be true once the cache
      metadata transitions to read-only by dm_cache_metadata_set_read_only().
      
      Wrap READ_LOCK and WRITE_LOCK multi-statement macros in do {} while(0).
      Also, all accesses of the 'cmd' argument passed to these related macros
      are now encapsulated in parenthesis.
      
      A follow-up patch can be developed to eliminate the use of macros in
      favor of pure C code.  Avoiding that now given that this needs to apply
      to stable@.
      Reported-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Fixes: d14fcf3d ("dm cache: make sure every metadata function checks fail_io")
      Cc: stable@vger.kernel.org
      9567366f
    • Keith Busch's avatar
      NVMe: Always use MSI/MSI-x interrupts · a5229050
      Keith Busch authored
      Multiple users have reported device initialization failure due the driver
      not receiving legacy PCI interrupts. This is not unique to any particular
      controller, but has been observed on multiple platforms.
      
      There have been no issues reported or observed when with message signaled
      interrupts, so this patch attempts to use MSI-x during initialization,
      falling back to MSI. If that fails, legacy would become the default.
      
      The setup_io_queues error handling had to change as a result: the admin
      queue's msix_entry used to be initialized to the legacy IRQ. The case
      where nr_io_queues is 0 would fail request_irq when setting up the admin
      queue's interrupt since re-enabling MSI-x fails with 0 vectors, leaving
      the admin queue's msix_entry invalid. Instead, return success immediately.
      Reported-by: default avatarTim Muhlemmer <muhlemmer@gmail.com>
      Reported-by: default avatarJon Derrick <jonathan.derrick@intel.com>
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      a5229050