1. 10 Nov, 2020 6 commits
    • Andrii Nakryiko's avatar
      tools/bpftool: Add support for in-kernel and named BTF in `btf show` · cecaf4a0
      Andrii Nakryiko authored
      Display vmlinux BTF name and kernel module names when listing available BTFs
      on the system.
      
      In human-readable output mode, module BTFs are reported with "name
      [module-name]", while vmlinux BTF will be reported as "name [vmlinux]".
      Square brackets are added by bpftool and follow kernel convention when
      displaying modules in human-readable text outputs.
      
      [vmuser@archvm bpf]$ sudo ../../../bpf/bpftool/bpftool btf s
      1: name [vmlinux]  size 4082281B
      6: size 2365B  prog_ids 8,6  map_ids 3
      7: name [button]  size 46895B
      8: name [pcspkr]  size 42328B
      9: name [serio_raw]  size 39375B
      10: name [floppy]  size 57185B
      11: name [i2c_core]  size 76186B
      12: name [crc32c_intel]  size 16036B
      13: name [i2c_piix4]  size 50497B
      14: name [irqbypass]  size 14124B
      15: name [kvm]  size 197985B
      16: name [kvm_intel]  size 123564B
      17: name [cryptd]  size 42466B
      18: name [crypto_simd]  size 17187B
      19: name [glue_helper]  size 39205B
      20: name [aesni_intel]  size 41034B
      25: size 36150B
              pids bpftool(2519)
      
      In JSON mode, two fields (boolean "kernel" and string "name") are reported for
      each BTF object. vmlinux BTF is reported with name "vmlinux" (kernel itself
      returns and empty name for vmlinux BTF).
      
      [vmuser@archvm bpf]$ sudo ../../../bpf/bpftool/bpftool btf s -jp
      [{
              "id": 1,
              "size": 4082281,
              "prog_ids": [],
              "map_ids": [],
              "kernel": true,
              "name": "vmlinux"
          },{
              "id": 6,
              "size": 2365,
              "prog_ids": [8,6
              ],
              "map_ids": [3
              ],
              "kernel": false
          },{
              "id": 7,
              "size": 46895,
              "prog_ids": [],
              "map_ids": [],
              "kernel": true,
              "name": "button"
          },{
      
      ...
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Tested-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Link: https://lore.kernel.org/bpf/20201110011932.3201430-6-andrii@kernel.org
      cecaf4a0
    • Andrii Nakryiko's avatar
      bpf: Load and verify kernel module BTFs · 36e68442
      Andrii Nakryiko authored
      Add kernel module listener that will load/validate and unload module BTF.
      Module BTFs gets ID generated for them, which makes it possible to iterate
      them with existing BTF iteration API. They are given their respective module's
      names, which will get reported through GET_OBJ_INFO API. They are also marked
      as in-kernel BTFs for tooling to distinguish them from user-provided BTFs.
      
      Also, similarly to vmlinux BTF, kernel module BTFs are exposed through
      sysfs as /sys/kernel/btf/<module-name>. This is convenient for user-space
      tools to inspect module BTF contents and dump their types with existing tools:
      
      [vmuser@archvm bpf]$ ls -la /sys/kernel/btf
      total 0
      drwxr-xr-x  2 root root       0 Nov  4 19:46 .
      drwxr-xr-x 13 root root       0 Nov  4 19:46 ..
      
      ...
      
      -r--r--r--  1 root root     888 Nov  4 19:46 irqbypass
      -r--r--r--  1 root root  100225 Nov  4 19:46 kvm
      -r--r--r--  1 root root   35401 Nov  4 19:46 kvm_intel
      -r--r--r--  1 root root     120 Nov  4 19:46 pcspkr
      -r--r--r--  1 root root     399 Nov  4 19:46 serio_raw
      -r--r--r--  1 root root 4094095 Nov  4 19:46 vmlinux
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Link: https://lore.kernel.org/bpf/20201110011932.3201430-5-andrii@kernel.org
      36e68442
    • Andrii Nakryiko's avatar
      kbuild: Build kernel module BTFs if BTF is enabled and pahole supports it · 5f9ae91f
      Andrii Nakryiko authored
      Detect if pahole supports split BTF generation, and generate BTF for each
      selected kernel module, if it does. This is exposed to Makefiles and C code as
      CONFIG_DEBUG_INFO_BTF_MODULES flag.
      
      Kernel module BTF has to be re-generated if either vmlinux's BTF changes or
      module's .ko changes. To achieve that, I needed a helper similar to
      if_changed, but that would allow to filter out vmlinux from the list of
      updated dependencies for .ko building. I've put it next to the only place that
      uses and needs it, but it might be a better idea to just add it along the
      other if_changed variants into scripts/Kbuild.include.
      
      Each kernel module's BTF deduplication is pretty fast, as it does only
      incremental BTF deduplication on top of already deduplicated vmlinux BTF. To
      show the added build time, I've first ran make only just built kernel (to
      establish the baseline) and then forced only BTF re-generation, without
      regenerating .ko files. The build was performed with -j60 parallelization on
      56-core machine. The final time also includes bzImage building, so it's not
      a pure BTF overhead.
      
      $ time make -j60
      ...
      make -j60  27.65s user 10.96s system 782% cpu 4.933 total
      $ touch ~/linux-build/default/vmlinux && time make -j60
      ...
      make -j60  123.69s user 27.85s system 1566% cpu 9.675 total
      
      So 4.6 seconds real time, with noticeable part spent in compressed vmlinux and
      bzImage building.
      
      To show size savings, I've built my kernel configuration with about 700 kernel
      modules with full BTF per each kernel module (without deduplicating against
      vmlinux) and with split BTF against deduplicated vmlinux (approach in this
      patch). Below are top 10 modules with biggest BTF sizes. And total size of BTF
      data across all kernel modules.
      
      It shows that split BTF "compresses" 115MB down to 5MB total. And the biggest
      kernel modules get a downsize from 500-570KB down to 200-300KB.
      
      FULL BTF
      ========
      
      $ for f in $(find . -name '*.ko'); do size -A -d $f | grep BTF | awk '{print $2}'; done | awk '{ s += $1 } END { print s }'
      115710691
      
      $ for f in $(find . -name '*.ko'); do printf "%s %d\n" $f $(size -A -d $f | grep BTF | awk '{print $2}'); done | sort -nr -k2 | head -n10
      ./drivers/gpu/drm/i915/i915.ko 570570
      ./drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko 520240
      ./drivers/gpu/drm/radeon/radeon.ko 503849
      ./drivers/infiniband/hw/mlx5/mlx5_ib.ko 491777
      ./fs/xfs/xfs.ko 411544
      ./drivers/net/ethernet/intel/i40e/i40e.ko 403904
      ./drivers/net/ethernet/broadcom/bnx2x/bnx2x.ko 398754
      ./drivers/infiniband/core/ib_core.ko 397224
      ./fs/cifs/cifs.ko 386249
      ./fs/nfsd/nfsd.ko 379738
      
      SPLIT BTF
      =========
      
      $ for f in $(find . -name '*.ko'); do size -A -d $f | grep BTF | awk '{print $2}'; done | awk '{ s += $1 } END { print s }'
      5194047
      
      $ for f in $(find . -name '*.ko'); do printf "%s %d\n" $f $(size -A -d $f | grep BTF | awk '{print $2}'); done | sort -nr -k2 | head -n10
      ./drivers/gpu/drm/i915/i915.ko 293206
      ./drivers/gpu/drm/radeon/radeon.ko 282103
      ./fs/xfs/xfs.ko 222150
      ./drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko 198503
      ./drivers/infiniband/hw/mlx5/mlx5_ib.ko 198356
      ./drivers/net/ethernet/broadcom/bnx2x/bnx2x.ko 113444
      ./fs/cifs/cifs.ko 109379
      ./arch/x86/kvm/kvm.ko 100225
      ./drivers/gpu/drm/drm.ko 94827
      ./drivers/infiniband/core/ib_core.ko 91188
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20201110011932.3201430-4-andrii@kernel.org
      5f9ae91f
    • Andrii Nakryiko's avatar
      bpf: Assign ID to vmlinux BTF and return extra info for BTF in GET_OBJ_INFO · 53297220
      Andrii Nakryiko authored
      Allocate ID for vmlinux BTF. This makes it visible when iterating over all BTF
      objects in the system. To allow distinguishing vmlinux BTF (and later kernel
      module BTF) from user-provided BTFs, expose extra kernel_btf flag, as well as
      BTF name ("vmlinux" for vmlinux BTF, will equal to module's name for module
      BTF).  We might want to later allow specifying BTF name for user-provided BTFs
      as well, if that makes sense. But currently this is reserved only for
      in-kernel BTFs.
      
      Having in-kernel BTFs exposed IDs will allow to extend BPF APIs that require
      in-kernel BTF type with ability to specify BTF types from kernel modules, not
      just vmlinux BTF. This will be implemented in a follow up patch set for
      fentry/fexit/fmod_ret/lsm/etc.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20201110011932.3201430-3-andrii@kernel.org
      53297220
    • Andrii Nakryiko's avatar
      bpf: Add in-kernel split BTF support · 951bb646
      Andrii Nakryiko authored
      Adjust in-kernel BTF implementation to support a split BTF mode of operation.
      Changes are mostly mirroring libbpf split BTF changes, with the exception of
      start_id being 0 for in-kernel implementation due to simpler read-only mode.
      
      Otherwise, for split BTF logic, most of the logic of jumping to base BTF,
      where necessary, is encapsulated in few helper functions. Type numbering and
      string offset in a split BTF are logically continuing where base BTF ends, so
      most of the high-level logic is kept without changes.
      
      Type verification and size resolution is only doing an added resolution of new
      split BTF types and relies on already cached size and type resolution results
      in the base BTF.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20201110011932.3201430-2-andrii@kernel.org
      951bb646
    • Martin KaFai Lau's avatar
      bpf: selftest: Use static globals in tcp_hdr_options and btf_skc_cls_ingress · f52b8fd3
      Martin KaFai Lau authored
      Some globals in the tcp_hdr_options test and btf_skc_cls_ingress test
      are not using static scope.  This patch fixes it.
      
      Targeting bpf-next branch as an improvement since it currently does not
      break the build.
      
      Fixes: ad2f8eb0 ("bpf: selftests: Tcp header options")
      Fixes: 9a856cae ("bpf: selftest: Add test_btf_skc_cls_ingress")
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20201106225402.4135741-1-kafai@fb.com
      f52b8fd3
  2. 09 Nov, 2020 2 commits
  3. 06 Nov, 2020 23 commits
  4. 04 Nov, 2020 6 commits
  5. 02 Nov, 2020 1 commit
    • Eric Dumazet's avatar
      bpf: Fix error path in htab_map_alloc() · 8aaeed81
      Eric Dumazet authored
      syzbot was able to trigger a use-after-free in htab_map_alloc() [1]
      
      htab_map_alloc() lacks a call to lockdep_unregister_key() in its error path.
      
      lockdep_register_key() and lockdep_unregister_key() can not fail,
      it seems better to use them right after htab allocation and before
      htab freeing, avoiding more goto/labels in htab_map_alloc()
      
      [1]
      BUG: KASAN: use-after-free in lockdep_register_key+0x356/0x3e0 kernel/locking/lockdep.c:1182
      Read of size 8 at addr ffff88805fa67ad8 by task syz-executor.3/2356
      
      CPU: 1 PID: 2356 Comm: syz-executor.3 Not tainted 5.9.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x107/0x163 lib/dump_stack.c:118
       print_address_description.constprop.0.cold+0xae/0x4c8 mm/kasan/report.c:385
       __kasan_report mm/kasan/report.c:545 [inline]
       kasan_report.cold+0x1f/0x37 mm/kasan/report.c:562
       lockdep_register_key+0x356/0x3e0 kernel/locking/lockdep.c:1182
       htab_init_buckets kernel/bpf/hashtab.c:144 [inline]
       htab_map_alloc+0x6c5/0x14a0 kernel/bpf/hashtab.c:521
       find_and_alloc_map kernel/bpf/syscall.c:122 [inline]
       map_create kernel/bpf/syscall.c:825 [inline]
       __do_sys_bpf+0xa80/0x5180 kernel/bpf/syscall.c:4381
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x45deb9
      Code: 0d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f0eafee1c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
      RAX: ffffffffffffffda RBX: 0000000000001a00 RCX: 000000000045deb9
      RDX: 0000000000000040 RSI: 0000000020000040 RDI: 405a020000000000
      RBP: 000000000118bf60 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 000000000118bf2c
      R13: 00007ffd3cf9eabf R14: 00007f0eafee29c0 R15: 000000000118bf2c
      
      Allocated by task 2053:
       kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
       kasan_set_track mm/kasan/common.c:56 [inline]
       __kasan_kmalloc.constprop.0+0xc2/0xd0 mm/kasan/common.c:461
       kmalloc include/linux/slab.h:554 [inline]
       kzalloc include/linux/slab.h:666 [inline]
       htab_map_alloc+0xdf/0x14a0 kernel/bpf/hashtab.c:454
       find_and_alloc_map kernel/bpf/syscall.c:122 [inline]
       map_create kernel/bpf/syscall.c:825 [inline]
       __do_sys_bpf+0xa80/0x5180 kernel/bpf/syscall.c:4381
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Freed by task 2053:
       kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
       kasan_set_track+0x1c/0x30 mm/kasan/common.c:56
       kasan_set_free_info+0x1b/0x30 mm/kasan/generic.c:355
       __kasan_slab_free+0x102/0x140 mm/kasan/common.c:422
       slab_free_hook mm/slub.c:1544 [inline]
       slab_free_freelist_hook+0x5d/0x150 mm/slub.c:1577
       slab_free mm/slub.c:3142 [inline]
       kfree+0xdb/0x360 mm/slub.c:4124
       htab_map_alloc+0x3f9/0x14a0 kernel/bpf/hashtab.c:549
       find_and_alloc_map kernel/bpf/syscall.c:122 [inline]
       map_create kernel/bpf/syscall.c:825 [inline]
       __do_sys_bpf+0xa80/0x5180 kernel/bpf/syscall.c:4381
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The buggy address belongs to the object at ffff88805fa67800
       which belongs to the cache kmalloc-1k of size 1024
      The buggy address is located 728 bytes inside of
       1024-byte region [ffff88805fa67800, ffff88805fa67c00)
      The buggy address belongs to the page:
      page:000000003c5582c4 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x5fa60
      head:000000003c5582c4 order:3 compound_mapcount:0 compound_pincount:0
      flags: 0xfff00000010200(slab|head)
      raw: 00fff00000010200 ffffea0000bc1200 0000000200000002 ffff888010041140
      raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff88805fa67980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88805fa67a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                          ^
       ffff88805fa67b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88805fa67b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Fixes: c50eb518 ("bpf: Use separate lockdep class for each hashtab")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20201102114100.3103180-1-eric.dumazet@gmail.com
      8aaeed81
  6. 30 Oct, 2020 2 commits
    • Alexei Starovoitov's avatar
      Merge branch 'bpf: safeguard hashtab locking in NMI context' · cb5dc5b0
      Alexei Starovoitov authored
      Song Liu says:
      
      ====================
      LOCKDEP NMI warning highlighted potential deadlock of hashtab in NMI
      context:
      
      [   74.828971] ================================
      [   74.828972] WARNING: inconsistent lock state
      [   74.828973] 5.9.0-rc8+ #275 Not tainted
      [   74.828974] --------------------------------
      [   74.828975] inconsistent {INITIAL USE} -> {IN-NMI} usage.
      [   74.828976] taskset/1174 [HC2[2]:SC0[0]:HE0:SE1] takes:
      [...]
      [   74.828999]  Possible unsafe locking scenario:
      [   74.828999]
      [   74.829000]        CPU0
      [   74.829001]        ----
      [   74.829001]   lock(&htab->buckets[i].raw_lock);
      [   74.829003]   <Interrupt>
      [   74.829004]     lock(&htab->buckets[i].raw_lock);
      
      Please refer to patch 1/2 for full trace.
      
      This warning is a false alert, as "INITIAL USE" and "IN-NMI" in the tests
      are from different hashtab. On the other hand, in theory, it is possible
      to deadlock when a hashtab is access from both non-NMI and NMI context.
      Patch 1/2 fixes this false alert by assigning separate lockdep class to
      each hashtab. Patch 2/2 introduces map_locked counters, which is similar to
      bpf_prog_active counter, to avoid hashtab deadlock in NMI context.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      cb5dc5b0
    • Song Liu's avatar
      bpf: Avoid hashtab deadlock with map_locked · 20b6cc34
      Song Liu authored
      If a hashtab is accessed in both non-NMI and NMI context, the system may
      deadlock on bucket->lock. Fix this issue with percpu counter map_locked.
      map_locked rejects concurrent access to the same bucket from the same CPU.
      To reduce memory overhead, map_locked is not added per bucket. Instead,
      8 percpu counters are added to each hashtab. buckets are assigned to these
      counters based on the lower bits of its hash.
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20201029071925.3103400-3-songliubraving@fb.com
      20b6cc34