1. 05 Jan, 2024 1 commit
  2. 25 Oct, 2023 2 commits
  3. 18 Oct, 2023 1 commit
  4. 06 Oct, 2023 1 commit
    • Ojaswin Mujoo's avatar
      ext4: mark buffer new if it is unwritten to avoid stale data exposure · 2cd8bdb5
      Ojaswin Mujoo authored
      
      ** Short Version **
      
      In ext4 with dioread_nolock, we could have a scenario where the bh returned by
      get_blocks (ext4_get_block_unwritten()) in __block_write_begin_int() has
      UNWRITTEN and MAPPED flag set. Since such a bh does not have NEW flag set we
      never zero out the range of bh that is not under write, causing whatever stale
      data is present in the folio at that time to be written out to disk. To fix this
      mark the buffer as new, in case it is unwritten, in ext4_get_block_unwritten().
      
      ** Long Version **
      
      The issue mentioned above was resulting in two different bugs:
      
      1. On block size < page size case in ext4, generic/269 was reliably
      failing with dioread_nolock. The state of the write was as follows:
      
        * The write was extending i_size.
        * The last block of the file was fallocated and had an unwritten extent
        * We were near ENOSPC and hence we were switching to non-delayed alloc
          allocation.
      
      In this case, the back trace that triggers the bug is as follows:
      
        ext4_da_write_begin()
          /* switch to nodelalloc due to low space */
          ext4_write_begin()
            ext4_should_dioread_nolock() // true since mount flags still have delalloc
            __block_write_begin(..., ext4_get_block_unwritten)
              __block_write_begin_int()
                for(each buffer head in page) {
                  /* first iteration, this is bh1 which contains i_size */
                  if (!buffer_mapped)
                    get_block() /* returns bh with only UNWRITTEN and MAPPED */
                  /* second iteration, bh2 */
                    if (!buffer_mapped)
                      get_block() /* we fail here, could be ENOSPC */
                }
                if (err)
                  /*
                   * this would zero out all new buffers and mark them uptodate.
                   * Since bh1 was never marked new, we skip it here which causes
                   * the bug later.
                   */
                  folio_zero_new_buffers();
            /* ext4_wrte_begin() error handling */
            ext4_truncate_failed_write()
              ext4_truncate()
                ext4_block_truncate_page()
                  __ext4_block_zero_page_range()
                    if(!buffer_uptodate())
                      ext4_read_bh_lock()
                        ext4_read_bh() -> ... ext4_submit_bh_wbc()
                          BUG_ON(buffer_unwritten(bh)); /* !!! */
      
      2. The second issue is stale data exposure with page size >= blocksize
      with dioread_nolock. The conditions needed for it to happen are same as
      the previous issue ie dioread_nolock around ENOSPC condition. The issue
      is also similar where in __block_write_begin_int() when we call
      ext4_get_block_unwritten() on the buffer_head and the underlying extent
      is unwritten, we get an unwritten and mapped buffer head. Since it is
      not new, we never zero out the partial range which is not under write,
      thus writing stale data to disk. This can be easily observed with the
      following reproducer:
      
       fallocate -l 4k testfile
       xfs_io -c "pwrite 2k 2k" testfile
       # hexdump output will have stale data in from byte 0 to 2k in testfile
       hexdump -C testfile
      
      NOTE: To trigger this, we need dioread_nolock enabled and write happening via
      ext4_write_begin(), which is usually used when we have -o nodealloc. Since
      dioread_nolock is disabled with nodelalloc, the only alternate way to call
      ext4_write_begin() is to ensure that delayed alloc switches to nodelalloc ie
      ext4_da_write_begin() calls ext4_write_begin(). This will usually happen when
      ext4 is almost full like the way generic/269 was triggering it in Issue 1 above.
      This might make the issue harder to hit. Hence, for reliable replication, I used
      the below patch to temporarily allow dioread_nolock with nodelalloc and then
      mount the disk with -o nodealloc,dioread_nolock. With this you can hit the stale
      data issue 100% of times:
      
      @@ -508,8 +508,8 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
        if (ext4_should_journal_data(inode))
          return 0;
        /* temporary fix to prevent generic/422 test failures */
      - if (!test_opt(inode->i_sb, DELALLOC))
      -   return 0;
      + // if (!test_opt(inode->i_sb, DELALLOC))
      + //  return 0;
        return 1;
       }
      
      After applying this patch to mark buffer as NEW, both the above issues are
      fixed.
      Signed-off-by: default avatarOjaswin Mujoo <ojaswin@linux.ibm.com>
      Cc: stable@kernel.org
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatar"Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
      Link: https://lore.kernel.org/r/d0ed09d70a9733fbb5349c5c7b125caac186ecdf.1695033645.git.ojaswin@linux.ibm.com
      
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      2cd8bdb5
  5. 27 Aug, 2023 2 commits
    • Eric Biggers's avatar
      ext4: reject casefold inode flag without casefold feature · 8216776c
      Eric Biggers authored
      
      It is invalid for the casefold inode flag to be set without the casefold
      superblock feature flag also being set.  e2fsck already considers this
      case to be invalid and handles it by offering to clear the casefold flag
      on the inode.  __ext4_iget() also already considered this to be invalid,
      sort of, but it only got so far as logging an error message; it didn't
      actually reject the inode.  Make it reject the inode so that other code
      doesn't have to handle this case.  This matches what f2fs does.
      
      Note: we could check 's_encoding != NULL' instead of
      ext4_has_feature_casefold().  This would make the check robust against
      the casefold feature being enabled by userspace writing to the page
      cache of the mounted block device.  However, it's unsolvable in general
      for filesystems to be robust against concurrent writes to the page cache
      of the mounted block device.  Though this very particular scenario
      involving the casefold feature is solvable, we should not pretend that
      we can support this model, so let's just check the casefold feature.
      tune2fs already forbids enabling casefold on a mounted filesystem.
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Link: https://lore.kernel.org/r/20230814182903.37267-2-ebiggers@kernel.org
      
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      8216776c
    • Liu Song's avatar
      ext4: do not mark inode dirty every time when appending using delalloc · 03de20be
      Liu Song authored
      
      In the delalloc append write scenario, if inode's i_size is extended due
      to buffer write, there are delalloc writes pending in the range up to
      i_size, and no need to touch i_disksize since writeback will push
      i_disksize up to i_size eventually. Offers significant performance
      improvement in high-frequency append write scenarios.
      
      I conducted tests in my 32-core environment by launching 32 concurrent
      threads to append write to the same file. Each write operation had a
      length of 1024 bytes and was repeated 100000 times. Without using this
      patch, the test was completed in 7705 ms. However, with this patch, the
      test was completed in 5066 ms, resulting in a performance improvement of
      34%.
      
      Moreover, in test scenarios of Kafka version 2.6.2, using packet size of
      2K, with this patch resulted in a 10% performance improvement.
      Signed-off-by: default avatarLiu Song <liusong@linux.alibaba.com>
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20230810154333.84921-1-liusong@linux.alibaba.com
      
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      03de20be
  6. 18 Aug, 2023 1 commit
  7. 09 Aug, 2023 1 commit
  8. 02 Aug, 2023 1 commit
  9. 29 Jul, 2023 3 commits
  10. 24 Jul, 2023 1 commit
  11. 26 Jun, 2023 4 commits
  12. 19 Jun, 2023 1 commit
  13. 15 Jun, 2023 3 commits
  14. 30 May, 2023 1 commit
  15. 28 May, 2023 1 commit
    • Theodore Ts'o's avatar
      ext4: add EA_INODE checking to ext4_iget() · b3e6bcb9
      Theodore Ts'o authored
      
      Add a new flag, EXT4_IGET_EA_INODE which indicates whether the inode
      is expected to have the EA_INODE flag or not.  If the flag is not
      set/clear as expected, then fail the iget() operation and mark the
      file system as corrupted.
      
      This commit also makes the ext4_iget() always perform the
      is_bad_inode() check even when the inode is already inode cache.  This
      allows us to remove the is_bad_inode() check from the callers of
      ext4_iget() in the ea_inode code.
      
      Reported-by: syzbot+cbb68193bdb95af4340a@syzkaller.appspotmail.com
      Reported-by: syzbot+62120febbd1ee3c3c860@syzkaller.appspotmail.com
      Reported-by: syzbot+edce54daffee36421b4c@syzkaller.appspotmail.com
      Cc: stable@kernel.org
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Link: https://lore.kernel.org/r/20230524034951.779531-2-tytso@mit.edu
      
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      b3e6bcb9
  16. 13 May, 2023 2 commits
  17. 28 Apr, 2023 1 commit
    • Zhihao Cheng's avatar
      ext4: fix i_disksize exceeding i_size problem in paritally written case · 1dedde69
      Zhihao Cheng authored
      It is possible for i_disksize can exceed i_size, triggering a warning.
      
      generic_perform_write
       copied = iov_iter_copy_from_user_atomic(len) // copied < len
       ext4_da_write_end
       | ext4_update_i_disksize
       |  new_i_size = pos + copied;
       |  WRITE_ONCE(EXT4_I(inode)->i_disksize, newsize) // update i_disksize
       | generic_write_end
       |  copied = block_write_end(copied, len) // copied = 0
       |   if (unlikely(copied < len))
       |    if (!PageUptodate(page))
       |     copied = 0;
       |  if (pos + copied > inode->i_size) // return false
       if (unlikely(copied == 0))
        goto again;
       if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
        status = -EFAULT;
        break;
       }
      
      We get i_disksize greater than i_size here, which could trigger WARNING
      check 'i_size_read(inode) < EXT4_I(inode)->i_disksize' while doing dio:
      
      ext4_dio_write_iter
       iomap_dio_rw
        __iomap_dio_rw // return err, length is not aligned to 512
       ext4_handle_inode_extension
        WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize) // Oops
      
       WARNING: CPU: 2 PID: 2609 at fs/ext4/file.c:319
       CPU: 2 PID: 2609 Comm: aa Not tainted 6.3.0-rc2
       RIP: 0010:ext4_file_write_iter+0xbc7
       Call Trace:
        vfs_write+0x3b1
        ksys_write+0x77
        do_syscall_64+0x39
      
      Fix it by updating 'copied' value before updating i_disksize just like
      ext4_write_inline_data_end() does.
      
      A reproducer can be found in the buganizer link below.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=217209
      Fixes: 64769240
      
       ("ext4: Add delayed allocation support in data=writeback mode")
      Signed-off-by: default avatarZhihao Cheng <chengzhihao1@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20230321013721.89818-1-chengzhihao1@huawei.com
      
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      1dedde69
  18. 15 Apr, 2023 1 commit
  19. 14 Apr, 2023 7 commits
  20. 06 Apr, 2023 5 commits