1. 28 Aug, 2024 2 commits
    • Xu Kuohai's avatar
      bpf, arm64: Avoid blindly saving/restoring all callee-saved registers · 5d4fa9ec
      Xu Kuohai authored
      The arm64 jit blindly saves/restores all callee-saved registers, making
      the jited result looks a bit too compliated. For example, for an empty
      prog, the jited result is:
      
         0:   bti jc
         4:   mov     x9, lr
         8:   nop
         c:   paciasp
        10:   stp     fp, lr, [sp, #-16]!
        14:   mov     fp, sp
        18:   stp     x19, x20, [sp, #-16]!
        1c:   stp     x21, x22, [sp, #-16]!
        20:   stp     x26, x25, [sp, #-16]!
        24:   mov     x26, #0
        28:   stp     x26, x25, [sp, #-16]!
        2c:   mov     x26, sp
        30:   stp     x27, x28, [sp, #-16]!
        34:   mov     x25, sp
        38:   bti j 		// tailcall target
        3c:   sub     sp, sp, #0
        40:   mov     x7, #0
        44:   add     sp, sp, #0
        48:   ldp     x27, x28, [sp], #16
        4c:   ldp     x26, x25, [sp], #16
        50:   ldp     x26, x25, [sp], #16
        54:   ldp     x21, x22, [sp], #16
        58:   ldp     x19, x20, [sp], #16
        5c:   ldp     fp, lr, [sp], #16
        60:   mov     x0, x7
        64:   autiasp
        68:   ret
      
      Clearly, there is no need to save/restore unused callee-saved registers.
      This patch does this change, making the jited image to only save/restore
      the callee-saved registers it uses.
      
      Now the jited result of empty prog is:
      
         0:   bti jc
         4:   mov     x9, lr
         8:   nop
         c:   paciasp
        10:   stp     fp, lr, [sp, #-16]!
        14:   mov     fp, sp
        18:   stp     xzr, x26, [sp, #-16]!
        1c:   mov     x26, sp
        20:   bti j		// tailcall target
        24:   mov     x7, #0
        28:   ldp     xzr, x26, [sp], #16
        2c:   ldp     fp, lr, [sp], #16
        30:   mov     x0, x7
        34:   autiasp
        38:   ret
      
      Since bpf prog saves/restores its own callee-saved registers as needed,
      to make tailcall work correctly, the caller needs to restore its saved
      registers before tailcall, and the callee needs to save its callee-saved
      registers after tailcall. This extra restoring/saving instructions
      increases preformance overhead.
      
      [1] provides 2 benchmarks for tailcall scenarios. Below is the perf
      number measured in an arm64 KVM guest. The result indicates that the
      performance difference before and after the patch in typical tailcall
      scenarios is negligible.
      
      - Before:
      
       Performance counter stats for './test_progs -t tailcalls' (5 runs):
      
                 4313.43 msec task-clock                       #    0.874 CPUs utilized               ( +-  0.16% )
                     574      context-switches                 #  133.073 /sec                        ( +-  1.14% )
                       0      cpu-migrations                   #    0.000 /sec
                     538      page-faults                      #  124.727 /sec                        ( +-  0.57% )
             10697772784      cycles                           #    2.480 GHz                         ( +-  0.22% )  (61.19%)
             25511241955      instructions                     #    2.38  insn per cycle              ( +-  0.08% )  (66.70%)
              5108910557      branches                         #    1.184 G/sec                       ( +-  0.08% )  (72.38%)
                 2800459      branch-misses                    #    0.05% of all branches             ( +-  0.51% )  (72.36%)
                              TopDownL1                 #     0.60 retiring                    ( +-  0.09% )  (66.84%)
                                                        #     0.21 frontend_bound              ( +-  0.15% )  (61.31%)
                                                        #     0.12 bad_speculation             ( +-  0.08% )  (50.11%)
                                                        #     0.07 backend_bound               ( +-  0.16% )  (33.30%)
              8274201819      L1-dcache-loads                  #    1.918 G/sec                       ( +-  0.18% )  (33.15%)
                  468268      L1-dcache-load-misses            #    0.01% of all L1-dcache accesses   ( +-  4.69% )  (33.16%)
                  385383      LLC-loads                        #   89.345 K/sec                       ( +-  5.22% )  (33.16%)
                   38296      LLC-load-misses                  #    9.94% of all LL-cache accesses    ( +- 42.52% )  (38.69%)
              6886576501      L1-icache-loads                  #    1.597 G/sec                       ( +-  0.35% )  (38.69%)
                 1848585      L1-icache-load-misses            #    0.03% of all L1-icache accesses   ( +-  4.52% )  (44.23%)
              9043645883      dTLB-loads                       #    2.097 G/sec                       ( +-  0.10% )  (44.33%)
                  416672      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  ( +-  5.15% )  (49.89%)
              6925626111      iTLB-loads                       #    1.606 G/sec                       ( +-  0.35% )  (55.46%)
                   66220      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  ( +-  1.88% )  (55.50%)
         <not supported>      L1-dcache-prefetches
         <not supported>      L1-dcache-prefetch-misses
      
                  4.9372 +- 0.0526 seconds time elapsed  ( +-  1.07% )
      
       Performance counter stats for './test_progs -t flow_dissector' (5 runs):
      
                10924.50 msec task-clock                       #    0.945 CPUs utilized               ( +-  0.08% )
                     603      context-switches                 #   55.197 /sec                        ( +-  1.13% )
                       0      cpu-migrations                   #    0.000 /sec
                     566      page-faults                      #   51.810 /sec                        ( +-  0.42% )
             27381270695      cycles                           #    2.506 GHz                         ( +-  0.18% )  (60.46%)
             56996583922      instructions                     #    2.08  insn per cycle              ( +-  0.21% )  (66.11%)
             10321647567      branches                         #  944.816 M/sec                       ( +-  0.17% )  (71.79%)
                 3347735      branch-misses                    #    0.03% of all branches             ( +-  3.72% )  (72.15%)
                              TopDownL1                 #     0.52 retiring                    ( +-  0.13% )  (66.74%)
                                                        #     0.27 frontend_bound              ( +-  0.14% )  (61.27%)
                                                        #     0.14 bad_speculation             ( +-  0.19% )  (50.36%)
                                                        #     0.07 backend_bound               ( +-  0.42% )  (33.89%)
             18740797617      L1-dcache-loads                  #    1.715 G/sec                       ( +-  0.43% )  (33.71%)
                13715669      L1-dcache-load-misses            #    0.07% of all L1-dcache accesses   ( +- 32.85% )  (33.34%)
                 4087551      LLC-loads                        #  374.164 K/sec                       ( +- 29.53% )  (33.26%)
                  267906      LLC-load-misses                  #    6.55% of all LL-cache accesses    ( +- 23.90% )  (38.76%)
             15811864229      L1-icache-loads                  #    1.447 G/sec                       ( +-  0.12% )  (38.73%)
                 2976833      L1-icache-load-misses            #    0.02% of all L1-icache accesses   ( +-  9.73% )  (44.22%)
             20138907471      dTLB-loads                       #    1.843 G/sec                       ( +-  0.18% )  (44.15%)
                  732850      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  ( +- 11.18% )  (49.64%)
             15895726702      iTLB-loads                       #    1.455 G/sec                       ( +-  0.15% )  (55.13%)
                  152075      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  ( +-  4.71% )  (54.98%)
         <not supported>      L1-dcache-prefetches
         <not supported>      L1-dcache-prefetch-misses
      
                 11.5613 +- 0.0317 seconds time elapsed  ( +-  0.27% )
      
      - After:
      
       Performance counter stats for './test_progs -t tailcalls' (5 runs):
      
                 4278.78 msec task-clock                       #    0.871 CPUs utilized               ( +-  0.15% )
                     569      context-switches                 #  132.982 /sec                        ( +-  0.58% )
                       0      cpu-migrations                   #    0.000 /sec
                     539      page-faults                      #  125.970 /sec                        ( +-  0.43% )
             10588986432      cycles                           #    2.475 GHz                         ( +-  0.20% )  (60.91%)
             25303825043      instructions                     #    2.39  insn per cycle              ( +-  0.08% )  (66.48%)
              5110756256      branches                         #    1.194 G/sec                       ( +-  0.07% )  (72.03%)
                 2719569      branch-misses                    #    0.05% of all branches             ( +-  2.42% )  (72.03%)
                              TopDownL1                 #     0.60 retiring                    ( +-  0.22% )  (66.31%)
                                                        #     0.22 frontend_bound              ( +-  0.21% )  (60.83%)
                                                        #     0.12 bad_speculation             ( +-  0.26% )  (50.25%)
                                                        #     0.06 backend_bound               ( +-  0.17% )  (33.52%)
              8163648527      L1-dcache-loads                  #    1.908 G/sec                       ( +-  0.33% )  (33.52%)
                  694979      L1-dcache-load-misses            #    0.01% of all L1-dcache accesses   ( +- 30.53% )  (33.52%)
                 1902347      LLC-loads                        #  444.600 K/sec                       ( +- 48.84% )  (33.69%)
                   96677      LLC-load-misses                  #    5.08% of all LL-cache accesses    ( +- 43.48% )  (39.30%)
              6863517589      L1-icache-loads                  #    1.604 G/sec                       ( +-  0.37% )  (39.17%)
                 1871519      L1-icache-load-misses            #    0.03% of all L1-icache accesses   ( +-  6.78% )  (44.56%)
              8927782813      dTLB-loads                       #    2.087 G/sec                       ( +-  0.14% )  (44.37%)
                  438237      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  ( +-  6.00% )  (49.75%)
              6886906831      iTLB-loads                       #    1.610 G/sec                       ( +-  0.36% )  (55.08%)
                   67568      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  ( +-  3.27% )  (54.86%)
         <not supported>      L1-dcache-prefetches
         <not supported>      L1-dcache-prefetch-misses
      
                  4.9114 +- 0.0309 seconds time elapsed  ( +-  0.63% )
      
       Performance counter stats for './test_progs -t flow_dissector' (5 runs):
      
                10948.40 msec task-clock                       #    0.942 CPUs utilized               ( +-  0.05% )
                     615      context-switches                 #   56.173 /sec                        ( +-  1.65% )
                       1      cpu-migrations                   #    0.091 /sec                        ( +- 31.62% )
                     567      page-faults                      #   51.788 /sec                        ( +-  0.44% )
             27334194328      cycles                           #    2.497 GHz                         ( +-  0.08% )  (61.05%)
             56656528828      instructions                     #    2.07  insn per cycle              ( +-  0.08% )  (66.67%)
             10270389422      branches                         #  938.072 M/sec                       ( +-  0.10% )  (72.21%)
                 3453837      branch-misses                    #    0.03% of all branches             ( +-  3.75% )  (72.27%)
                              TopDownL1                 #     0.52 retiring                    ( +-  0.16% )  (66.55%)
                                                        #     0.27 frontend_bound              ( +-  0.09% )  (60.91%)
                                                        #     0.14 bad_speculation             ( +-  0.08% )  (49.85%)
                                                        #     0.07 backend_bound               ( +-  0.16% )  (33.33%)
             18982866028      L1-dcache-loads                  #    1.734 G/sec                       ( +-  0.24% )  (33.34%)
                 8802454      L1-dcache-load-misses            #    0.05% of all L1-dcache accesses   ( +- 52.30% )  (33.31%)
                 2612962      LLC-loads                        #  238.661 K/sec                       ( +- 29.78% )  (33.45%)
                  264107      LLC-load-misses                  #   10.11% of all LL-cache accesses    ( +- 18.34% )  (39.07%)
             15793205997      L1-icache-loads                  #    1.443 G/sec                       ( +-  0.15% )  (39.09%)
                 3930802      L1-icache-load-misses            #    0.02% of all L1-icache accesses   ( +-  3.72% )  (44.66%)
             20097828496      dTLB-loads                       #    1.836 G/sec                       ( +-  0.09% )  (44.68%)
                  961757      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  ( +-  3.32% )  (50.15%)
             15838728506      iTLB-loads                       #    1.447 G/sec                       ( +-  0.09% )  (55.62%)
                  167652      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  ( +-  1.28% )  (55.52%)
         <not supported>      L1-dcache-prefetches
         <not supported>      L1-dcache-prefetch-misses
      
                 11.6173 +- 0.0268 seconds time elapsed  ( +-  0.23% )
      
      [1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/Signed-off-by: default avatarXu Kuohai <xukuohai@huawei.com>
      Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5d4fa9ec
    • Xu Kuohai's avatar
      bpf, arm64: Get rid of fpb · bd737fcb
      Xu Kuohai authored
      bpf prog accesses stack using BPF_FP as the base address and a negative
      immediate number as offset. But arm64 ldr/str instructions only support
      non-negative immediate number as offset. To simplify the jited result,
      commit 5b3d19b9 ("bpf, arm64: Adjust the offset of str/ldr(immediate)
      to positive number") introduced FPB to represent the lowest stack address
      that the bpf prog being jited may access, and with this address as the
      baseline, it converts BPF_FP plus negative immediate offset number to FPB
      plus non-negative immediate offset.
      
      Considering that for a given bpf prog, the jited stack space is fixed
      with A64_SP as the lowest address and BPF_FP as the highest address.
      Thus we can get rid of FPB and converts BPF_FP plus negative immediate
      offset to A64_SP plus non-negative immediate offset.
      Signed-off-by: default avatarXu Kuohai <xukuohai@huawei.com>
      Link: https://lore.kernel.org/r/20240826071624.350108-2-xukuohai@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      bd737fcb
  2. 27 Aug, 2024 1 commit
  3. 23 Aug, 2024 15 commits
  4. 22 Aug, 2024 14 commits
  5. 21 Aug, 2024 8 commits
    • Linus Torvalds's avatar
      Merge tag 'platform-drivers-x86-v6.11-4' of... · 872cf28b
      Linus Torvalds authored
      Merge tag 'platform-drivers-x86-v6.11-4' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
      
      Pull x86 platform driver fixes from Ilpo Järvinen:
      
       - ISST: Fix an error-handling corner case
      
       - platform/surface: aggregator: Minor corner case fix and new HW
         support
      
      * tag 'platform-drivers-x86-v6.11-4' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
        platform/x86: ISST: Fix return value on last invalid resource
        platform/surface: aggregator: Fix warning when controller is destroyed in probe
        platform/surface: aggregator_registry: Add support for Surface Laptop 6
        platform/surface: aggregator_registry: Add fan and thermal sensor support for Surface Laptop 5
        platform/surface: aggregator_registry: Add support for Surface Laptop Studio 2
        platform/surface: aggregator_registry: Add support for Surface Laptop Go 3
        platform/surface: aggregator_registry: Add Support for Surface Pro 10
        platform/x86: asus-wmi: Add quirk for ROG Ally X
      872cf28b
    • Linus Torvalds's avatar
      Merge tag 'erofs-for-6.11-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs · 5c6154ff
      Linus Torvalds authored
      Pull erofs fixes from Gao Xiang:
       "As I mentioned in the merge window pull request, there is a regression
        which could cause system hang due to page migration. The corresponding
        fix landed upstream through MM tree last week (commit 2e6506e1:
        "mm/migrate: fix deadlock in migrate_pages_batch() on large folios"),
        therefore large folios can be safely allowed for compressed inodes and
        stress tests have been running on my fleet for over 20 days without
        any regression. Users have explicitly requested this for months, so
        let's allow large folios for EROFS full cases now for wider testing.
      
        Additionally, there is a fix which addresses invalid memory accesses
        on a failure path triggered by fault injection and two minor cleanups
        to simplify the codebase.
      
        Summary:
      
         - Allow large folios on compressed inodes
      
         - Fix invalid memory accesses if z_erofs_gbuf_growsize() partially
           fails
      
         - Two minor cleanups"
      
      * tag 'erofs-for-6.11-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
        erofs: fix out-of-bound access when z_erofs_gbuf_growsize() partially fails
        erofs: allow large folios for compressed files
        erofs: get rid of check_layout_compatibility()
        erofs: simplify readdir operation
      5c6154ff
    • Alexei Starovoitov's avatar
      Merge branch '__jited-test-tag-to-check-disassembly-after-jit' · 1a437d35
      Alexei Starovoitov authored
      Eduard Zingerman says:
      
      ====================
      __jited test tag to check disassembly after jit
      
      Some of the logic in the BPF jits might be non-trivial.
      It might be useful to allow testing this logic by comparing
      generated native code with expected code template.
      This patch set adds a macro __jited() that could be used for
      test_loader based tests in a following manner:
      
          SEC("tp")
          __arch_x86_64
          __jited("   endbr64")
          __jited("   nopl    (%rax,%rax)")
          __jited("   xorq    %rax, %rax")
          ...
          __naked void some_test(void) { ... }
      
      Also add a test for jit code generated for tail calls handling to
      demonstrate the feature.
      
      The feature uses LLVM libraries to do the disassembly.
      At selftests compilation time Makefile detects if these libraries are
      available. When libraries are not available tests using __jit_x86()
      are skipped.
      Current CI environment does not include llvm development libraries,
      but changes to add these are trivial.
      
      This was previously discussed here:
      https://lore.kernel.org/bpf/20240718205158.3651529-1-yonghong.song@linux.dev/
      
      Patch-set includes a few auxiliary steps:
      - patches #2 and #3 fix a few bugs in test_loader behaviour;
      - patch #4 replaces __regex macro with ability to specify regular
        expressions in __msg and __xlated using "{{" "}}" escapes;
      - patch #8 updates __xlated to match disassembly lines consequently,
        same way as __jited does.
      
      Changes v2->v3:
      - changed macro name from __jit_x86 to __jited with __arch_* to
        specify disassembly arch (Yonghong);
      - __jited matches disassembly lines consequently with "..."
        allowing to skip some number of lines (Andrii);
      - __xlated matches disassembly lines consequently, same as __jited;
      - "{{...}}" regex brackets instead of __regex macro;
      - bug fixes for old commits.
      
      Changes v1->v2:
      - stylistic changes suggested by Yonghong;
      - fix for -Wformat-truncation related warning when compiled with
        llvm15 (Yonghong).
      
      v1: https://lore.kernel.org/bpf/20240809010518.1137758-1-eddyz87@gmail.com/
      v2: https://lore.kernel.org/bpf/20240815205449.242556-1-eddyz87@gmail.com/
      ====================
      
      Link: https://lore.kernel.org/r/20240820102357.3372779-1-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1a437d35
    • Eduard Zingerman's avatar
      selftests/bpf: validate __xlated same way as __jited · a038eacd
      Eduard Zingerman authored
      Both __xlated and __jited work with disassembly.
      It is logical to have both work in a similar manner.
      
      This commit updates __xlated macro handling in test_loader.c by making
      it expect matches on sequential lines, same way as __jited operates.
      For example:
      
          __xlated("1: *(u64 *)(r10 -16) = r1")      ;; matched on line N
          __xlated("3: r0 = &(void __percpu *)(r0)") ;; matched on line N+1
      
      Also:
      
          __xlated("1: *(u64 *)(r10 -16) = r1")      ;; matched on line N
          __xlated("...")                            ;; not matched
          __xlated("3: r0 = &(void __percpu *)(r0)") ;; mantched on any
                                                     ;; line >= N
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20240820102357.3372779-10-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a038eacd
    • Eduard Zingerman's avatar
      selftests/bpf: validate jit behaviour for tail calls · e5bdd6a8
      Eduard Zingerman authored
      A program calling sub-program which does a tail call.
      The idea is to verify instructions generated by jit for tail calls:
      - in program and sub-program prologues;
      - for subprogram call instruction;
      - for tail call itself.
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20240820102357.3372779-9-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e5bdd6a8
    • Eduard Zingerman's avatar
      selftests/bpf: __jited test tag to check disassembly after jit · 7d743e4c
      Eduard Zingerman authored
      Allow to verify jit behaviour by writing tests as below:
      
          SEC("tp")
          __arch_x86_64
          __jited("   endbr64")
          __jited("   nopl    (%rax,%rax)")
          __jited("   xorq    %rax, %rax")
          ...
          __naked void some_test(void)
          {
              asm volatile (... ::: __clobber_all);
          }
      
      Allow regular expressions in patterns, same way as in __msg.
      By default assume that each __jited pattern has to be matched on the
      next consecutive line of the disassembly, e.g.:
      
          __jited("   endbr64")             # matched on line N
          __jited("   nopl    (%rax,%rax)") # matched on line N+1
      
      If match occurs on a wrong line an error is reported.
      To override this behaviour use __jited("..."), e.g.:
      
          __jited("   endbr64")             # matched on line N
          __jited("...")                    # not matched
          __jited("   nopl    (%rax,%rax)") # matched on any line >= N
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20240820102357.3372779-7-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7d743e4c
    • Eduard Zingerman's avatar
      selftests/bpf: utility function to get program disassembly after jit · b991fc52
      Eduard Zingerman authored
      This commit adds a utility function to get disassembled text for jited
      representation of a BPF program designated by file descriptor.
      Function prototype looks as follows:
      
          int get_jited_program_text(int fd, char *text, size_t text_sz)
      
      Where 'fd' is a file descriptor for the program, 'text' and 'text_sz'
      refer to a destination buffer for disassembled text.
      Output format looks as follows:
      
          18:	77 06                               	ja	L0
          1a:	50                                  	pushq	%rax
          1b:	48 89 e0                            	movq	%rsp, %rax
          1e:	eb 01                               	jmp	L1
          20:	50                                  L0:	pushq	%rax
          21:	50                                  L1:	pushq	%rax
           ^  ^^^^^^^^                             ^  ^^^^^^^^^^^^^^^^^^
           |  binary insn                          |  textual insn
           |  representation                       |  representation
           |                                       |
          instruction offset              inferred local label name
      
      The code and makefile changes are inspired by jit_disasm.c from bpftool.
      Use llvm libraries to disassemble BPF program instead of libbfd to avoid
      issues with disassembly output stability pointed out in [1].
      
      Selftests makefile uses Makefile.feature to detect if LLVM libraries
      are available. If that is not the case selftests build proceeds but
      the function returns -EOPNOTSUPP at runtime.
      
      [1] commit eb9d1acf ("bpftool: Add LLVM as default library for disassembling JIT-ed programs")
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20240820102357.3372779-6-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b991fc52
    • Eduard Zingerman's avatar
      selftests/bpf: replace __regex macro with "{{...}}" patterns · f8d16175
      Eduard Zingerman authored
      Upcoming changes require a notation to specify regular expression
      matches for regular verifier log messages, disassembly of BPF
      instructions, disassembly of jited instructions.
      
      Neither basic nor extended POSIX regular expressions w/o additional
      escaping are good for this role because of wide use of special
      characters in disassembly, for example:
      
          movq -0x10(%rbp), %rax  ;; () are special characters
          cmpq $0x21, %rax        ;; $ is a special character
      
          *(u64 *)(r10 -16) = r1  ;; * and () are special characters
      
      This commit borrows syntax from LLVM's FileCheck utility.
      It replaces __regex macro with ability to embed regular expressions
      in __msg patters using "{{" "}}" pairs for escaping.
      Syntax for __msg patterns:
      
          pattern := (<verbatim text> | regex)*
          regex := "{{" <posix extended regular expression> "}}"
      
      For example, pattern "foo{{[0-9]+}}" matches strings like
      "foo0", "foo007", etc.
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20240820102357.3372779-5-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f8d16175