1. 01 Oct, 2021 4 commits
    • yangerkun's avatar
      ext4: flush s_error_work before journal destroy in ext4_fill_super · bb9464e0
      yangerkun authored
      The error path in ext4_fill_super forget to flush s_error_work before
      journal destroy, and it may trigger the follow bug since
      flush_stashed_error_work can run concurrently with journal destroy
      without any protection for sbi->s_journal.
      
      [32031.740193] EXT4-fs (loop66): get root inode failed
      [32031.740484] EXT4-fs (loop66): mount failed
      [32031.759805] ------------[ cut here ]------------
      [32031.759807] kernel BUG at fs/jbd2/transaction.c:373!
      [32031.760075] invalid opcode: 0000 [#1] SMP PTI
      [32031.760336] CPU: 5 PID: 1029268 Comm: kworker/5:1 Kdump: loaded
      4.18.0
      [32031.765112] Call Trace:
      [32031.765375]  ? __switch_to_asm+0x35/0x70
      [32031.765635]  ? __switch_to_asm+0x41/0x70
      [32031.765893]  ? __switch_to_asm+0x35/0x70
      [32031.766148]  ? __switch_to_asm+0x41/0x70
      [32031.766405]  ? _cond_resched+0x15/0x40
      [32031.766665]  jbd2__journal_start+0xf1/0x1f0 [jbd2]
      [32031.766934]  jbd2_journal_start+0x19/0x20 [jbd2]
      [32031.767218]  flush_stashed_error_work+0x30/0x90 [ext4]
      [32031.767487]  process_one_work+0x195/0x390
      [32031.767747]  worker_thread+0x30/0x390
      [32031.768007]  ? process_one_work+0x390/0x390
      [32031.768265]  kthread+0x10d/0x130
      [32031.768521]  ? kthread_flush_work_fn+0x10/0x10
      [32031.768778]  ret_from_fork+0x35/0x40
      
      static int start_this_handle(...)
          BUG_ON(journal->j_flags & JBD2_UNMOUNT); <---- Trigger this
      
      Besides, after we enable fast commit, ext4_fc_replay can add work to
      s_error_work but return success, so the latter journal destroy in
      ext4_load_journal can trigger this problem too.
      
      Fix this problem with two steps:
      1. Call ext4_commit_super directly in ext4_handle_error for the case
         that called from ext4_fc_replay
      2. Since it's hard to pair the init and flush for s_error_work, we'd
         better add a extras flush_work before journal destroy in
         ext4_fill_super
      
      Besides, this patch will call ext4_commit_super in ext4_handle_error for
      any nojournal case too. But it seems safe since the reason we call
      schedule_work was that we should save error info to sb through journal
      if available. Conversely, for the nojournal case, it seems useless delay
      commit superblock to s_error_work.
      
      Fixes: c92dc856 ("ext4: defer saving error info from atomic context")
      Fixes: 2d01ddc8 ("ext4: save error info to sb through journal if available")
      Cc: stable@kernel.org
      Signed-off-by: default avataryangerkun <yangerkun@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Link: https://lore.kernel.org/r/20210924093917.1953239-1-yangerkun@huawei.com
      bb9464e0
    • Ritesh Harjani's avatar
      ext4: fix loff_t overflow in ext4_max_bitmap_size() · 75ca6ad4
      Ritesh Harjani authored
      We should use unsigned long long rather than loff_t to avoid
      overflow in ext4_max_bitmap_size() for comparison before returning.
      w/o this patch sbi->s_bitmap_maxbytes was becoming a negative
      value due to overflow of upper_limit (with has_huge_files as true)
      
      Below is a quick test to trigger it on a 64KB pagesize system.
      
      sudo mkfs.ext4 -b 65536 -O ^has_extents,^64bit /dev/loop2
      sudo mount /dev/loop2 /mnt
      sudo echo "hello" > /mnt/hello 	-> This will error out with
      				"echo: write error: File too large"
      Signed-off-by: default avatarRitesh Harjani <riteshh@linux.ibm.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Link: https://lore.kernel.org/r/594f409e2c543e90fd836b78188dfa5c575065ba.1622867594.git.riteshh@linux.ibm.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      75ca6ad4
    • Jeffle Xu's avatar
      ext4: fix reserved space counter leakage · 6fed8395
      Jeffle Xu authored
      When ext4_insert_delayed block receives and recovers from an error from
      ext4_es_insert_delayed_block(), e.g., ENOMEM, it does not release the
      space it has reserved for that block insertion as it should. One effect
      of this bug is that s_dirtyclusters_counter is not decremented and
      remains incorrectly elevated until the file system has been unmounted.
      This can result in premature ENOSPC returns and apparent loss of free
      space.
      
      Another effect of this bug is that
      /sys/fs/ext4/<dev>/delayed_allocation_blocks can remain non-zero even
      after syncfs has been executed on the filesystem.
      
      Besides, add check for s_dirtyclusters_counter when inode is going to be
      evicted and freed. s_dirtyclusters_counter can still keep non-zero until
      inode is written back in .evict_inode(), and thus the check is delayed
      to .destroy_inode().
      
      Fixes: 51865fda ("ext4: let ext4 maintain extent status tree")
      Cc: stable@kernel.org
      Suggested-by: default avatarGao Xiang <hsiangkao@linux.alibaba.com>
      Signed-off-by: default avatarJeffle Xu <jefflexu@linux.alibaba.com>
      Reviewed-by: default avatarEric Whitney <enwlinux@gmail.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Link: https://lore.kernel.org/r/20210823061358.84473-1-jefflexu@linux.alibaba.com
      6fed8395
    • Hou Tao's avatar
      ext4: limit the number of blocks in one ADD_RANGE TLV · a2c2f082
      Hou Tao authored
      Now EXT4_FC_TAG_ADD_RANGE uses ext4_extent to track the
      newly-added blocks, but the limit on the max value of
      ee_len field is ignored, and it can lead to BUG_ON as
      shown below when running command "fallocate -l 128M file"
      on a fast_commit-enabled fs:
      
        kernel BUG at fs/ext4/ext4_extents.h:199!
        invalid opcode: 0000 [#1] SMP PTI
        CPU: 3 PID: 624 Comm: fallocate Not tainted 5.14.0-rc6+ #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
        RIP: 0010:ext4_fc_write_inode_data+0x1f3/0x200
        Call Trace:
         ? ext4_fc_write_inode+0xf2/0x150
         ext4_fc_commit+0x93b/0xa00
         ? ext4_fallocate+0x1ad/0x10d0
         ext4_sync_file+0x157/0x340
         ? ext4_sync_file+0x157/0x340
         vfs_fsync_range+0x49/0x80
         do_fsync+0x3d/0x70
         __x64_sys_fsync+0x14/0x20
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Simply fixing it by limiting the number of blocks
      in one EXT4_FC_TAG_ADD_RANGE TLV.
      
      Fixes: aa75f4d3 ("ext4: main fast-commit commit path")
      Cc: stable@kernel.org
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Link: https://lore.kernel.org/r/20210820044505.474318-1-houtao1@huawei.com
      a2c2f082
  2. 09 Sep, 2021 3 commits
  3. 05 Sep, 2021 4 commits
    • Zhang Yi's avatar
      ext4: drop unnecessary journal handle in delalloc write · cc883236
      Zhang Yi authored
      After we factor out the inline data write procedure from
      ext4_da_write_end(), we don't need to start journal handle for the cases
      of both buffer overwrite and append-write. If we need to update
      i_disksize, mark_inode_dirty() do start handle and update inode buffer.
      So we could just remove all the journal handle codes in the delalloc
      write procedure.
      
      After this patch, we could get a lot of performance improvement. Below
      is the Unixbench comparison data test on my machine with 'Intel Xeon
      Gold 5120' CPU and nvme SSD backend.
      
      Test cmd:
      
        ./Run -c 56 -i 3 fstime fsbuffer fsdisk
      
      Before this patch:
      
        System Benchmarks Partial Index           BASELINE       RESULT   INDEX
        File Copy 1024 bufsize 2000 maxblocks       3960.0     422965.0   1068.1
        File Copy 256 bufsize 500 maxblocks         1655.0     105077.0   634.9
        File Copy 4096 bufsize 8000 maxblocks       5800.0    1429092.0   2464.0
                                                                          ======
        System Benchmarks Index Score (Partial Only)                      1186.6
      
      After this patch:
      
        System Benchmarks Partial Index           BASELINE       RESULT   INDEX
        File Copy 1024 bufsize 2000 maxblocks       3960.0     732716.0   1850.3
        File Copy 256 bufsize 500 maxblocks         1655.0     184940.0   1117.5
        File Copy 4096 bufsize 8000 maxblocks       5800.0    2427152.0   4184.7
                                                                          ======
        System Benchmarks Index Score (Partial Only)                      2053.0
      Signed-off-by: default avatarZhang Yi <yi.zhang@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Link: https://lore.kernel.org/r/20210716122024.1105856-5-yi.zhang@huawei.com
      cc883236
    • Zhang Yi's avatar
      ext4: factor out write end code of inline file · 6984aef5
      Zhang Yi authored
      Now that the inline_data file write end procedure are falled into the
      common write end functions, it is not clear. Factor them out and do
      some cleanup. This patch also drop ext4_da_write_inline_data_end()
      and switch to use ext4_write_inline_data_end() instead because we also
      need to do the same error processing if we failed to write data into
      inline entry.
      Signed-off-by: default avatarZhang Yi <yi.zhang@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Link: https://lore.kernel.org/r/20210716122024.1105856-4-yi.zhang@huawei.com
      6984aef5
    • Zhang Yi's avatar
      ext4: correct the error path of ext4_write_inline_data_end() · 55ce2f64
      Zhang Yi authored
      Current error path of ext4_write_inline_data_end() is not correct.
      
      Firstly, it should pass out the error value if ext4_get_inode_loc()
      return fail, or else it could trigger infinite loop if we inject error
      here. And then it's better to add inode to orphan list if it return fail
      in ext4_journal_stop(), otherwise we could not restore inline xattr
      entry after power failure. Finally, we need to reset the 'ret' value if
      ext4_write_inline_data_end() return success in ext4_write_end() and
      ext4_journalled_write_end(), otherwise we could not get the error return
      value of ext4_journal_stop().
      Signed-off-by: default avatarZhang Yi <yi.zhang@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Link: https://lore.kernel.org/r/20210716122024.1105856-3-yi.zhang@huawei.com
      55ce2f64
    • Zhang Yi's avatar
      ext4: check and update i_disksize properly · 4df031ff
      Zhang Yi authored
      After commit 3da40c7b ("ext4: only call ext4_truncate when size <=
      isize"), i_disksize could always be updated to i_size in ext4_setattr(),
      and we could sure that i_disksize <= i_size since holding inode lock and
      if i_disksize < i_size there are delalloc writes pending in the range
      upto i_size. If the end of the current write is <= i_size, there's no
      need to touch i_disksize since writeback will push i_disksize upto
      i_size eventually. So we can switch to check i_size instead of
      i_disksize in ext4_da_write_end() when write to the end of the file.
      we also could remove ext4_mark_inode_dirty() together because we defer
      inode dirtying to generic_write_end() or ext4_da_write_inline_data_end().
      Signed-off-by: default avatarZhang Yi <yi.zhang@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Link: https://lore.kernel.org/r/20210716122024.1105856-2-yi.zhang@huawei.com
      4df031ff
  4. 02 Sep, 2021 1 commit
    • Theodore Ts'o's avatar
      ext4: add error checking to ext4_ext_replay_set_iblocks() · 1fd95c05
      Theodore Ts'o authored
      If the call to ext4_map_blocks() fails due to an corrupted file
      system, ext4_ext_replay_set_iblocks() can get stuck in an infinite
      loop.  This could be reproduced by running generic/526 with a file
      system that has inline_data and fast_commit enabled.  The system will
      repeatedly log to the console:
      
      EXT4-fs warning (device dm-3): ext4_block_to_path:105: block 1074800922 > max in inode 131076
      
      and the stack that it gets stuck in is:
      
         ext4_block_to_path+0xe3/0x130
         ext4_ind_map_blocks+0x93/0x690
         ext4_map_blocks+0x100/0x660
         skip_hole+0x47/0x70
         ext4_ext_replay_set_iblocks+0x223/0x440
         ext4_fc_replay_inode+0x29e/0x3b0
         ext4_fc_replay+0x278/0x550
         do_one_pass+0x646/0xc10
         jbd2_journal_recover+0x14a/0x270
         jbd2_journal_load+0xc4/0x150
         ext4_load_journal+0x1f3/0x490
         ext4_fill_super+0x22d4/0x2c00
      
      With this patch, generic/526 still fails, but system is no longer
      locking up in a tight loop.  It's likely the root casue is that
      fast_commit replay is corrupting file systems with inline_data, and we
      probably need to add better error handling in the fast commit replay
      code path beyond what is done here, which essentially just breaks the
      infinite loop without reporting the to the higher levels of the code.
      
      Fixes: 8016E29F4362 ("ext4: fast commit recovery path")
      Cc: stable@kernel.org
      Cc: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      1fd95c05
  5. 31 Aug, 2021 17 commits
  6. 12 Aug, 2021 3 commits
  7. 10 Aug, 2021 3 commits
  8. 06 Aug, 2021 1 commit
  9. 23 Jul, 2021 2 commits
  10. 18 Jul, 2021 2 commits
    • Linus Torvalds's avatar
      Linux 5.14-rc2 · 2734d6c1
      Linus Torvalds authored
      2734d6c1
    • Linus Torvalds's avatar
      Merge tag 'perf-tools-fixes-for-v5.14-2021-07-18' of... · 8c25c447
      Linus Torvalds authored
      Merge tag 'perf-tools-fixes-for-v5.14-2021-07-18' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
      
      Pull perf tools fixes from Arnaldo Carvalho de Melo:
      
       - Skip invalid hybrid PMU on hybrid systems when the atom (little) CPUs
         are offlined.
      
       - Fix 'perf test' problems related to the recently added hybrid
         (BIG/little) code.
      
       - Split ARM's coresight (hw tracing) decode by aux records to avoid
         fatal decoding errors.
      
       - Fix add event failure in 'perf probe' when running 32-bit perf in a
         64-bit kernel.
      
       - Fix 'perf sched record' failure when CONFIG_SCHEDSTATS is not set.
      
       - Fix memory and refcount leaks detected by ASAn when running 'perf
         test', should be clean of warnings now.
      
       - Remove broken definition of __LITTLE_ENDIAN from tools'
         linux/kconfig.h, which was breaking the build in some systems.
      
       - Cast PTHREAD_STACK_MIN to int as it may turn into 'long
         sysconf(__SC_THREAD_STACK_MIN_VALUE), breaking the build in some
         systems.
      
       - Fix libperf build error with LIBPFM4=1.
      
       - Sync UAPI files changed by the memfd_secret new syscall.
      
      * tag 'perf-tools-fixes-for-v5.14-2021-07-18' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (35 commits)
        perf sched: Fix record failure when CONFIG_SCHEDSTATS is not set
        perf probe: Fix add event failure when running 32-bit perf in a 64-bit kernel
        perf data: Close all files in close_dir()
        perf probe-file: Delete namelist in del_events() on the error path
        perf test bpf: Free obj_buf
        perf trace: Free strings in trace__parse_events_option()
        perf trace: Free syscall tp fields in evsel->priv
        perf trace: Free syscall->arg_fmt
        perf trace: Free malloc'd trace fields on exit
        perf lzma: Close lzma stream on exit
        perf script: Fix memory 'threads' and 'cpus' leaks on exit
        perf script: Release zstd data
        perf session: Cleanup trace_event
        perf inject: Close inject.output on exit
        perf report: Free generated help strings for sort option
        perf env: Fix memory leak of cpu_pmu_caps
        perf test maps__merge_in: Fix memory leak of maps
        perf dso: Fix memory leak in dso__new_map()
        perf test event_update: Fix memory leak of unit
        perf test event_update: Fix memory leak of evlist
        ...
      8c25c447