1. 30 Aug, 2021 1 commit
  2. 23 Aug, 2021 3 commits
    • Chao Yu's avatar
      f2fs: rebuild nat_bits during umount · 94c821fb
      Chao Yu authored
      If all free_nat_bitmap are available, we can rebuild nat_bits from
      free_nat_bitmap entirely during umount, let's make another chance
      to reenable nat_bits for image.
      Signed-off-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      94c821fb
    • Daeho Jeong's avatar
      f2fs: introduce periodic iostat io latency traces · a4b68176
      Daeho Jeong authored
      Whenever we notice some sluggish issues on our machines, we are always
      curious about how well all types of I/O in the f2fs filesystem are
      handled. But, it's hard to get this kind of real data. First of all,
      we need to reproduce the issue while turning on the profiling tool like
      blktrace, but the issue doesn't happen again easily. Second, with the
      intervention of any tools, the overall timing of the issue will be
      slightly changed and it sometimes makes us hard to figure it out.
      
      So, I added the feature printing out IO latency statistics tracepoint
      events, which are minimal things to understand filesystem's I/O related
      behaviors, into F2FS_IOSTAT kernel config. With "iostat_enable" sysfs
      node on, we can get this statistics info in a periodic way and it
      would cause the least overhead.
      
      [samples]
       f2fs_ckpt-254:1-507     [003] ....  2842.439683: f2fs_iostat_latency:
      dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
      rd_data [136/1/801], rd_node [136/1/1704], rd_meta [4/2/4],
      wr_sync_data [164/16/3331], wr_sync_node [152/3/648],
      wr_sync_meta [160/2/4243], wr_async_data [24/13/15],
      wr_async_node [0/0/0], wr_async_meta [0/0/0]
      
       f2fs_ckpt-254:1-507     [002] ....  2845.450514: f2fs_iostat_latency:
      dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
      rd_data [60/3/456], rd_node [60/3/1258], rd_meta [0/0/1],
      wr_sync_data [120/12/2285], wr_sync_node [88/5/428],
      wr_sync_meta [52/6/2990], wr_async_data [4/1/3],
      wr_async_node [0/0/0], wr_async_meta [0/0/0]
      Signed-off-by: default avatarDaeho Jeong <daehojeong@google.com>
      Reviewed-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      a4b68176
    • Daeho Jeong's avatar
      f2fs: separate out iostat feature · 52118743
      Daeho Jeong authored
      Added F2FS_IOSTAT config option to support getting IO statistics through
      sysfs and printing out periodic IO statistics tracepoint events and
      moved I/O statistics related codes into separate files for better
      maintenance.
      Signed-off-by: default avatarDaeho Jeong <daehojeong@google.com>
      Reviewed-by: default avatarChao Yu <chao@kernel.org>
      [Jaegeuk Kim: set default=y]
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      52118743
  3. 17 Aug, 2021 6 commits
  4. 13 Aug, 2021 2 commits
  5. 12 Aug, 2021 1 commit
  6. 06 Aug, 2021 2 commits
  7. 05 Aug, 2021 2 commits
    • Chao Yu's avatar
      f2fs: extent cache: support unaligned extent · 94afd6d6
      Chao Yu authored
      Compressed inode may suffer read performance issue due to it can not
      use extent cache, so I propose to add this unaligned extent support
      to improve it.
      
      Currently, it only works in readonly format f2fs image.
      
      Unaligned extent: in one compressed cluster, physical block number
      will be less than logical block number, so we add an extra physical
      block length in extent info in order to indicate such extent status.
      
      The idea is if one whole cluster blocks are contiguous physically,
      once its mapping info was readed at first time, we will cache an
      unaligned (or aligned) extent info entry in extent cache, it expects
      that the mapping info will be hitted when rereading cluster.
      
      Merge policy:
      - Aligned extents can be merged.
      - Aligned extent and unaligned extent can not be merged.
      Signed-off-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      94afd6d6
    • Tiezhu Yang's avatar
      f2fs: Kconfig: clean up config options about compression · 6b3ba1e7
      Tiezhu Yang authored
      In fs/f2fs/Kconfig, F2FS_FS_LZ4HC depends on F2FS_FS_LZ4 and F2FS_FS_LZ4
      depends on F2FS_FS_COMPRESSION, so no need to make F2FS_FS_LZ4HC depends
      on F2FS_FS_COMPRESSION explicitly, remove the redudant "depends on", do
      the similar thing for F2FS_FS_LZORLE.
      
      At the same time, it is better to move F2FS_FS_LZORLE next to F2FS_FS_LZO,
      it looks like a little more clear when make menuconfig, the location of
      "LZO-RLE compression support" is under "LZO compression support" instead
      of "F2FS compression feature".
      
      Without this patch:
      
      F2FS compression feature
        LZO compression support
        LZ4 compression support
          LZ4HC compression support
        ZSTD compression support
        LZO-RLE compression support
      
      With this patch:
      
      F2FS compression feature
        LZO compression support
          LZO-RLE compression support
        LZ4 compression support
          LZ4HC compression support
        ZSTD compression support
      Signed-off-by: default avatarTiezhu Yang <yangtiezhu@loongson.cn>
      Reviewed-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      6b3ba1e7
  8. 04 Aug, 2021 2 commits
    • Yangtao Li's avatar
      f2fs: reduce the scope of setting fsck tag when de->name_len is zero · d4bf15a7
      Yangtao Li authored
      I recently found a case where de->name_len is 0 in f2fs_fill_dentries()
      easily reproduced, and finally set the fsck flag.
      
      Thread A			Thread B
      - f2fs_readdir
       - f2fs_read_inline_dir
        - ctx->pos = d.max
      				- f2fs_add_dentry
      				 - f2fs_add_inline_entry
      				  - do_convert_inline_dir
      				 - f2fs_add_regular_entry
      - f2fs_readdir
       - f2fs_fill_dentries
        - set_sbi_flag(sbi, SBI_NEED_FSCK)
      
      Process A opens the folder, and has been reading without closing it.
      During this period, Process B created a file under the folder (occupying
      multiple f2fs_dir_entry, exceeding the d.max of the inline dir). After
      creation, process A uses the d.max of inline dir to read it again, and
      it will read that de->name_len is 0.
      
      And Chao pointed out that w/o inline conversion, the race condition still
      can happen as below:
      
      dir_entry1: A
      dir_entry2: B
      dir_entry3: C
      free slot: _
      ctx->pos: ^
      
      Thread A is traversing directory,
      ctx-pos moves to below position after readdir() by thread A:
      AAAABBBB___
              ^
      
      Then thread B delete dir_entry2, and create dir_entry3.
      
      Thread A calls readdir() to lookup dirents starting from middle
      of new dirent slots as below:
      AAAACCCCCC_
              ^
      In these scenarios, the file system is not damaged, and it's hard to
      avoid it. But we can bypass tagging FSCK flag if:
      a) bit_pos (:= ctx->pos % d->max) is non-zero and
      b) before bit_pos moves to first valid dir_entry.
      
      Fixes: ddf06b75 ("f2fs: fix to trigger fsck if dirent.name_len is zero")
      Signed-off-by: default avatarYangtao Li <frank.li@vivo.com>
      [Chao: clean up description]
      Reviewed-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      d4bf15a7
    • Chao Yu's avatar
      f2fs: fix to stop filesystem update once CP failed · 91803392
      Chao Yu authored
      During f2fs_write_checkpoint(), once we failed in
      f2fs_flush_nat_entries() or do_checkpoint(), metadata of filesystem
      such as prefree bitmap, nat/sit version bitmap won't be recovered,
      it may cause f2fs image to be inconsistent, let's just set CP error
      flag to avoid further updates until we figure out a scheme to rollback
      all metadatas in such condition.
      Reported-by: default avatarYangtao Li <frank.li@vivo.com>
      Signed-off-by: default avatarYangtao Li <frank.li@vivo.com>
      Signed-off-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      91803392
  9. 03 Aug, 2021 2 commits
    • Daeho Jeong's avatar
      f2fs: add sysfs node to control ra_pages for fadvise seq file · 0f6b56ec
      Daeho Jeong authored
      fadvise() allows the user to expand the readahead window to double with
      POSIX_FADV_SEQUENTIAL, now. But, in some use cases, it is not that
      sufficient and we need to meet the need in a restricted way. We can
      control the multiplier value of bdi device readahead between 2 (default)
      and 256 for POSIX_FADV_SEQUENTIAL advise option.
      Signed-off-by: default avatarDaeho Jeong <daehojeong@google.com>
      Reviewed-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      0f6b56ec
    • Chao Yu's avatar
      f2fs: introduce discard_unit mount option · 4f993264
      Chao Yu authored
      As James Z reported in bugzilla:
      
      https://bugzilla.kernel.org/show_bug.cgi?id=213877
      
      [1.] One-line summary of the problem:
      Mount multiple SMR block devices exceed certain number cause system non-response
      
      [2.] Full description of the problem/report:
      Created some F2FS on SMR devices (mkfs.f2fs -m), then mounted in sequence. Each device is the same Model: HGST HSH721414AL (Size 14TB).
      Empirically, found that when the amount of SMR device * 1.5Gb > System RAM, the system ran out of memory and hung. No dmesg output. For example, 24 SMR Disk need 24*1.5GB = 36GB. A system with 32G RAM can only mount 21 devices, the 22nd device will be a reproducible cause of system hang.
      The number of SMR devices with other FS mounted on this system does not interfere with the result above.
      
      [3.] Keywords (i.e., modules, networking, kernel):
      F2FS, SMR, Memory
      
      [4.] Kernel information
      [4.1.] Kernel version (uname -a):
      Linux 5.13.4-200.fc34.x86_64 #1 SMP Tue Jul 20 20:27:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
      
      [4.2.] Kernel .config file:
      Default Fedora 34 with f2fs-tools-1.14.0-2.fc34.x86_64
      
      [5.] Most recent kernel version which did not have the bug:
      None
      
      [6.] Output of Oops.. message (if applicable) with symbolic information
           resolved (see Documentation/admin-guide/oops-tracing.rst)
      None
      
      [7.] A small shell script or example program which triggers the
           problem (if possible)
      mount /dev/sdX /mnt/0X
      
      [8.] Memory consumption
      
      With 24 * 14T SMR Block device with F2FS
      free -g
                    total        used        free      shared  buff/cache   available
      Mem:             46          36           0           0          10          10
      Swap:             0           0           0
      
      With 3 * 14T SMR Block device with F2FS
      free -g
                     total        used        free      shared  buff/cache   available
      Mem:               7           5           0           0           1           1
      Swap:              7           0           7
      
      The root cause is, there are three bitmaps:
      - cur_valid_map
      - ckpt_valid_map
      - discard_map
      and each of them will cost ~500MB memory, {cur, ckpt}_valid_map are
      necessary, but discard_map is optional, since this bitmap will only be
      useful in mountpoint that small discard is enabled.
      
      For a blkzoned device such as SMR or ZNS devices, f2fs will only issue
      discard for a section(zone) when all blocks of that section are invalid,
      so, for such device, we don't need small discard functionality at all.
      
      This patch introduces a new mountoption "discard_unit=block|segment|
      section" to support issuing discard with different basic unit which is
      aligned to block, segment or section, so that user can specify
      "discard_unit=segment" or "discard_unit=section" to disable small
      discard functionality.
      
      Note that this mount option can not be changed by remount() due to
      related metadata need to be initialized during mount().
      
      In order to save memory, let's use "discard_unit=section" for blkzoned
      device by default.
      Signed-off-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      4f993264
  10. 02 Aug, 2021 8 commits
  11. 25 Jul, 2021 3 commits
  12. 20 Jul, 2021 2 commits
    • Chao Yu's avatar
      f2fs: quota: fix potential deadlock · 9de71ede
      Chao Yu authored
      xfstest generic/587 reports a deadlock issue as below:
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.14.0-rc1 #69 Not tainted
      ------------------------------------------------------
      repquota/8606 is trying to acquire lock:
      ffff888022ac9320 (&sb->s_type->i_mutex_key#18){+.+.}-{3:3}, at: f2fs_quota_sync+0x207/0x300 [f2fs]
      
      but task is already holding lock:
      ffff8880084bcde8 (&sbi->quota_sem){.+.+}-{3:3}, at: f2fs_quota_sync+0x59/0x300 [f2fs]
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #2 (&sbi->quota_sem){.+.+}-{3:3}:
             __lock_acquire+0x648/0x10b0
             lock_acquire+0x128/0x470
             down_read+0x3b/0x2a0
             f2fs_quota_sync+0x59/0x300 [f2fs]
             f2fs_quota_on+0x48/0x100 [f2fs]
             do_quotactl+0x5e3/0xb30
             __x64_sys_quotactl+0x23a/0x4e0
             do_syscall_64+0x3b/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      -> #1 (&sbi->cp_rwsem){++++}-{3:3}:
             __lock_acquire+0x648/0x10b0
             lock_acquire+0x128/0x470
             down_read+0x3b/0x2a0
             f2fs_unlink+0x353/0x670 [f2fs]
             vfs_unlink+0x1c7/0x380
             do_unlinkat+0x413/0x4b0
             __x64_sys_unlinkat+0x50/0xb0
             do_syscall_64+0x3b/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      -> #0 (&sb->s_type->i_mutex_key#18){+.+.}-{3:3}:
             check_prev_add+0xdc/0xb30
             validate_chain+0xa67/0xb20
             __lock_acquire+0x648/0x10b0
             lock_acquire+0x128/0x470
             down_write+0x39/0xc0
             f2fs_quota_sync+0x207/0x300 [f2fs]
             do_quotactl+0xaff/0xb30
             __x64_sys_quotactl+0x23a/0x4e0
             do_syscall_64+0x3b/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      other info that might help us debug this:
      
      Chain exists of:
        &sb->s_type->i_mutex_key#18 --> &sbi->cp_rwsem --> &sbi->quota_sem
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&sbi->quota_sem);
                                     lock(&sbi->cp_rwsem);
                                     lock(&sbi->quota_sem);
        lock(&sb->s_type->i_mutex_key#18);
      
       *** DEADLOCK ***
      
      3 locks held by repquota/8606:
       #0: ffff88801efac0e0 (&type->s_umount_key#53){++++}-{3:3}, at: user_get_super+0xd9/0x190
       #1: ffff8880084bc380 (&sbi->cp_rwsem){++++}-{3:3}, at: f2fs_quota_sync+0x3e/0x300 [f2fs]
       #2: ffff8880084bcde8 (&sbi->quota_sem){.+.+}-{3:3}, at: f2fs_quota_sync+0x59/0x300 [f2fs]
      
      stack backtrace:
      CPU: 6 PID: 8606 Comm: repquota Not tainted 5.14.0-rc1 #69
      Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      Call Trace:
       dump_stack_lvl+0xce/0x134
       dump_stack+0x17/0x20
       print_circular_bug.isra.0.cold+0x239/0x253
       check_noncircular+0x1be/0x1f0
       check_prev_add+0xdc/0xb30
       validate_chain+0xa67/0xb20
       __lock_acquire+0x648/0x10b0
       lock_acquire+0x128/0x470
       down_write+0x39/0xc0
       f2fs_quota_sync+0x207/0x300 [f2fs]
       do_quotactl+0xaff/0xb30
       __x64_sys_quotactl+0x23a/0x4e0
       do_syscall_64+0x3b/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7f883b0b4efe
      
      The root cause is ABBA deadlock of inode lock and cp_rwsem,
      reorder locks in f2fs_quota_sync() as below to fix this issue:
      - lock inode
      - lock cp_rwsem
      - lock quota_sem
      
      Fixes: db6ec53b ("f2fs: add a rw_sem to cover quota flag changes")
      Signed-off-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      9de71ede
    • Jaegeuk Kim's avatar
      f2fs: let's keep writing IOs on SBI_NEED_FSCK · 1ffc8f5f
      Jaegeuk Kim authored
      SBI_NEED_FSCK is an indicator that fsck.f2fs needs to be triggered, so it
      is not fully critical to stop any IO writes. So, let's allow to write data
      instead of reporting EIO forever given SBI_NEED_FSCK, but do keep OPU.
      
      Fixes: 95577278 ("f2fs: drop inplace IO if fs status is abnormal")
      Cc: <stable@kernel.org> # v5.13+
      Reviewed-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      1ffc8f5f
  13. 19 Jul, 2021 1 commit
  14. 13 Jul, 2021 3 commits
  15. 11 Jul, 2021 2 commits