1. 02 Nov, 2017 11 commits
  2. 31 Oct, 2017 1 commit
  3. 30 Oct, 2017 2 commits
    • Nick Desaulniers's avatar
      arm64: prevent regressions in compressed kernel image size when upgrading to binutils 2.27 · fd9dde6a
      Nick Desaulniers authored
      Upon upgrading to binutils 2.27, we found that our lz4 and gzip
      compressed kernel images were significantly larger, resulting is 10ms
      boot time regressions.
      
      As noted by Rahul:
      "aarch64 binaries uses RELA relocations, where each relocation entry
      includes an addend value. This is similar to x86_64.  On x86_64, the
      addend values are also stored at the relocation offset for relative
      relocations. This is an optimization: in the case where code does not
      need to be relocated, the loader can simply skip processing relative
      relocations.  In binutils-2.25, both bfd and gold linkers did this for
      x86_64, but only the gold linker did this for aarch64.  The kernel build
      here is using the bfd linker, which stored zeroes at the relocation
      offsets for relative relocations.  Since a set of zeroes compresses
      better than a set of non-zero addend values, this behavior was resulting
      in much better lz4 compression.
      
      The bfd linker in binutils-2.27 is now storing the actual addend values
      at the relocation offsets. The behavior is now consistent with what it
      does for x86_64 and what gold linker does for both architectures.  The
      change happened in this upstream commit:
      https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=1f56df9d0d5ad89806c24e71f296576d82344613
      Since a bunch of zeroes got replaced by non-zero addend values, we see
      the side effect of lz4 compressed image being a bit bigger.
      
      To get the old behavior from the bfd linker, "--no-apply-dynamic-relocs"
      flag can be used:
      $ LDFLAGS="--no-apply-dynamic-relocs" make
      With this flag, the compressed image size is back to what it was with
      binutils-2.25.
      
      If the kernel is using ASLR, there aren't additional runtime costs to
      --no-apply-dynamic-relocs, as the relocations will need to be applied
      again anyway after the kernel is relocated to a random address.
      
      If the kernel is not using ASLR, then presumably the current default
      behavior of the linker is better. Since the static linker performed the
      dynamic relocs, and the kernel is not moved to a different address at
      load time, it can skip applying the relocations all over again."
      
      Some measurements:
      
      $ ld -v
      GNU ld (binutils-2.25-f3d35cf6) 2.25.51.20141117
                          ^
      $ ls -l vmlinux
      -rwxr-x--- 1 ndesaulniers eng 300652760 Oct 26 11:57 vmlinux
      $ ls -l Image.lz4-dtb
      -rw-r----- 1 ndesaulniers eng 16932627 Oct 26 11:57 Image.lz4-dtb
      
      $ ld -v
      GNU ld (binutils-2.27-53dd00a1) 2.27.0.20170315
                          ^
      pre patch:
      $ ls -l vmlinux
      -rwxr-x--- 1 ndesaulniers eng 300376208 Oct 26 11:43 vmlinux
      $ ls -l Image.lz4-dtb
      -rw-r----- 1 ndesaulniers eng 18159474 Oct 26 11:43 Image.lz4-dtb
      
      post patch:
      $ ls -l vmlinux
      -rwxr-x--- 1 ndesaulniers eng 300376208 Oct 26 12:06 vmlinux
      $ ls -l Image.lz4-dtb
      -rw-r----- 1 ndesaulniers eng 16932466 Oct 26 12:06 Image.lz4-dtb
      
      By Siqi's measurement w/ gzip:
      binutils 2.27 with this patch (with --no-apply-dynamic-relocs):
      Image 41535488
      Image.gz 13404067
      
      binutils 2.27 without this patch (without --no-apply-dynamic-relocs):
      Image 41535488
      Image.gz 14125516
      
      Any compression scheme should be able to get better results from the
      longer runs of zeros, not just GZIP and LZ4.
      
      10ms boot time savings isn't anything to get excited about, but users of
      arm64+compression+bfd-2.27 should not have to pay a penalty for no
      runtime improvement.
      Reported-by: default avatarGopinath Elanchezhian <gelanchezhian@google.com>
      Reported-by: default avatarSindhuri Pentyala <spentyala@google.com>
      Reported-by: default avatarWei Wang <wvw@google.com>
      Suggested-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Suggested-by: default avatarRahul Chaudhry <rahulchaudhry@google.com>
      Suggested-by: default avatarSiqi Lin <siqilin@google.com>
      Suggested-by: default avatarStephen Hines <srhines@google.com>
      Signed-off-by: default avatarNick Desaulniers <ndesaulniers@google.com>
      Reviewed-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      [will: added comment to Makefile]
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      fd9dde6a
    • Catalin Marinas's avatar
      arm64: Implement arch-specific pte_access_permitted() · 6218f96c
      Catalin Marinas authored
      The generic pte_access_permitted() implementation only checks for
      pte_present() (together with the write permission where applicable).
      However, for both kernel ptes and PROT_NONE mappings pte_present() also
      returns true on arm64 even though such mappings are not user accessible.
      Additionally, arm64 now supports execute-only user permission
      (PROT_EXEC) which is implemented by clearing the PTE_USER bit.
      
      With this patch the arm64 implementation of pte_access_permitted()
      checks for the PTE_VALID and PTE_USER bits together with writable access
      if applicable.
      
      Cc: <stable@vger.kernel.org>
      Reported-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      6218f96c
  4. 27 Oct, 2017 4 commits
  5. 25 Oct, 2017 3 commits
  6. 24 Oct, 2017 4 commits
  7. 19 Oct, 2017 8 commits
  8. 18 Oct, 2017 7 commits
    • Will Deacon's avatar
      drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension · d5d9696b
      Will Deacon authored
      The ARMv8.2 architecture introduces the optional Statistical Profiling
      Extension (SPE).
      
      SPE can be used to profile a population of operations in the CPU pipeline
      after instruction decode. These are either architected instructions (i.e.
      a dynamic instruction trace) or CPU-specific uops and the choice is fixed
      statically in the hardware and advertised to userspace via caps/. Sampling
      is controlled using a sampling interval, similar to a regular PMU counter,
      but also with an optional random perturbation to avoid falling into patterns
      where you continuously profile the same instruction in a hot loop.
      
      After each operation is decoded, the interval counter is decremented. When
      it hits zero, an operation is chosen for profiling and tracked within the
      pipeline until it retires. Along the way, information such as TLB lookups,
      cache misses, time spent to issue etc is captured in the form of a sample.
      The sample is then filtered according to certain criteria (e.g. load
      latency) that can be specified in the event config (described under
      format/) and, if the sample satisfies the filter, it is written out to
      memory as a record, otherwise it is discarded. Only one operation can
      be sampled at a time.
      
      The in-memory buffer is linear and virtually addressed, raising an
      interrupt when it fills up. The PMU driver handles these interrupts to
      give the appearance of a ring buffer, as expected by the AUX code.
      
      The in-memory trace-like format is self-describing (though not parseable
      in reverse) and written as a series of records, with each record
      corresponding to a sample and consisting of a sequence of packets. These
      packets are defined by the architecture, although some have CPU-specific
      fields for recording information specific to the microarchitecture.
      
      As a simple example, a record generated for a branch instruction may
      consist of the following packets:
      
        0 (Address) : Virtual PC of the branch instruction
        1 (Type)    : Conditional direct branch
        2 (Counter) : Number of cycles taken from Dispatch to Issue
        3 (Address) : Virtual branch target + condition flags
        4 (Counter) : Number of cycles taken from Dispatch to Complete
        5 (Events)  : Mispredicted as not-taken
        6 (END)     : End of record
      
      It is also possible to toggle properties such as timestamp packets in
      each record.
      
      This patch adds support for SPE in the form of a new perf driver.
      
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Reviewed-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      d5d9696b
    • Will Deacon's avatar
      dt-bindings: Document devicetree binding for ARM SPE · 4b8b77a4
      Will Deacon authored
      This patch documents the devicetree binding in use for ARM SPE.
      
      Cc: Rob Herring <robh@kernel.org>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      4b8b77a4
    • Will Deacon's avatar
      arm64: head: Init PMSCR_EL2.{PA,PCT} when entered at EL2 without VHE · b0c57e10
      Will Deacon authored
      When booting at EL2, ensure that we permit the EL1 host to sample
      physical addresses and physical counter values using SPE.
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      b0c57e10
    • Will Deacon's avatar
      arm64: sysreg: Move SPE registers and PSB into common header files · a173c390
      Will Deacon authored
      SPE is part of the v8.2 architecture, so move its system register and
      field definitions into sysreg.h and the new PSB barrier into barrier.h
      
      Finally, move KVM over to using the generic definitions so that it
      doesn't have to open-code its own versions.
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      a173c390
    • Will Deacon's avatar
      perf/core: Add PERF_AUX_FLAG_COLLISION to report colliding samples · 085b3062
      Will Deacon authored
      The ARM SPE architecture permits an implementation to ignore a sample
      if the sample is due to be taken whilst another sample is already being
      produced. In this case, it is desirable to report the collision to
      userspace, as they may want to lower the sample period.
      
      This patch adds a PERF_AUX_FLAG_COLLISION flag, so that such events can
      be relayed to userspace.
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      085b3062
    • Will Deacon's avatar
      perf/core: Export AUX buffer helpers to modules · bc1d2020
      Will Deacon authored
      Perf PMU drivers using AUX buffers cannot be built as modules unless
      the AUX helpers are exported.
      
      This patch exports perf_aux_output_{begin,end,skip} and perf_get_aux to
      modules.
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      bc1d2020
    • Will Deacon's avatar
      genirq: export irq_get_percpu_devid_partition to modules · 5ffeb050
      Will Deacon authored
      Any modular driver using cluster-affine PPIs needs to be able to call
      irq_get_percpu_devid_partition so that it can enable the IRQ on the
      correct subset of CPUs.
      
      This patch exports the symbol so that it can be called from within a
      module.
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      5ffeb050