1. 30 Oct, 2017 2 commits
    • Nick Desaulniers's avatar
      arm64: prevent regressions in compressed kernel image size when upgrading to binutils 2.27 · fd9dde6a
      Nick Desaulniers authored
      Upon upgrading to binutils 2.27, we found that our lz4 and gzip
      compressed kernel images were significantly larger, resulting is 10ms
      boot time regressions.
      
      As noted by Rahul:
      "aarch64 binaries uses RELA relocations, where each relocation entry
      includes an addend value. This is similar to x86_64.  On x86_64, the
      addend values are also stored at the relocation offset for relative
      relocations. This is an optimization: in the case where code does not
      need to be relocated, the loader can simply skip processing relative
      relocations.  In binutils-2.25, both bfd and gold linkers did this for
      x86_64, but only the gold linker did this for aarch64.  The kernel build
      here is using the bfd linker, which stored zeroes at the relocation
      offsets for relative relocations.  Since a set of zeroes compresses
      better than a set of non-zero addend values, this behavior was resulting
      in much better lz4 compression.
      
      The bfd linker in binutils-2.27 is now storing the actual addend values
      at the relocation offsets. The behavior is now consistent with what it
      does for x86_64 and what gold linker does for both architectures.  The
      change happened in this upstream commit:
      https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=1f56df9d0d5ad89806c24e71f296576d82344613
      Since a bunch of zeroes got replaced by non-zero addend values, we see
      the side effect of lz4 compressed image being a bit bigger.
      
      To get the old behavior from the bfd linker, "--no-apply-dynamic-relocs"
      flag can be used:
      $ LDFLAGS="--no-apply-dynamic-relocs" make
      With this flag, the compressed image size is back to what it was with
      binutils-2.25.
      
      If the kernel is using ASLR, there aren't additional runtime costs to
      --no-apply-dynamic-relocs, as the relocations will need to be applied
      again anyway after the kernel is relocated to a random address.
      
      If the kernel is not using ASLR, then presumably the current default
      behavior of the linker is better. Since the static linker performed the
      dynamic relocs, and the kernel is not moved to a different address at
      load time, it can skip applying the relocations all over again."
      
      Some measurements:
      
      $ ld -v
      GNU ld (binutils-2.25-f3d35cf6) 2.25.51.20141117
                          ^
      $ ls -l vmlinux
      -rwxr-x--- 1 ndesaulniers eng 300652760 Oct 26 11:57 vmlinux
      $ ls -l Image.lz4-dtb
      -rw-r----- 1 ndesaulniers eng 16932627 Oct 26 11:57 Image.lz4-dtb
      
      $ ld -v
      GNU ld (binutils-2.27-53dd00a1) 2.27.0.20170315
                          ^
      pre patch:
      $ ls -l vmlinux
      -rwxr-x--- 1 ndesaulniers eng 300376208 Oct 26 11:43 vmlinux
      $ ls -l Image.lz4-dtb
      -rw-r----- 1 ndesaulniers eng 18159474 Oct 26 11:43 Image.lz4-dtb
      
      post patch:
      $ ls -l vmlinux
      -rwxr-x--- 1 ndesaulniers eng 300376208 Oct 26 12:06 vmlinux
      $ ls -l Image.lz4-dtb
      -rw-r----- 1 ndesaulniers eng 16932466 Oct 26 12:06 Image.lz4-dtb
      
      By Siqi's measurement w/ gzip:
      binutils 2.27 with this patch (with --no-apply-dynamic-relocs):
      Image 41535488
      Image.gz 13404067
      
      binutils 2.27 without this patch (without --no-apply-dynamic-relocs):
      Image 41535488
      Image.gz 14125516
      
      Any compression scheme should be able to get better results from the
      longer runs of zeros, not just GZIP and LZ4.
      
      10ms boot time savings isn't anything to get excited about, but users of
      arm64+compression+bfd-2.27 should not have to pay a penalty for no
      runtime improvement.
      Reported-by: default avatarGopinath Elanchezhian <gelanchezhian@google.com>
      Reported-by: default avatarSindhuri Pentyala <spentyala@google.com>
      Reported-by: default avatarWei Wang <wvw@google.com>
      Suggested-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Suggested-by: default avatarRahul Chaudhry <rahulchaudhry@google.com>
      Suggested-by: default avatarSiqi Lin <siqilin@google.com>
      Suggested-by: default avatarStephen Hines <srhines@google.com>
      Signed-off-by: default avatarNick Desaulniers <ndesaulniers@google.com>
      Reviewed-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      [will: added comment to Makefile]
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      fd9dde6a
    • Catalin Marinas's avatar
      arm64: Implement arch-specific pte_access_permitted() · 6218f96c
      Catalin Marinas authored
      The generic pte_access_permitted() implementation only checks for
      pte_present() (together with the write permission where applicable).
      However, for both kernel ptes and PROT_NONE mappings pte_present() also
      returns true on arm64 even though such mappings are not user accessible.
      Additionally, arm64 now supports execute-only user permission
      (PROT_EXEC) which is implemented by clearing the PTE_USER bit.
      
      With this patch the arm64 implementation of pte_access_permitted()
      checks for the PTE_VALID and PTE_USER bits together with writable access
      if applicable.
      
      Cc: <stable@vger.kernel.org>
      Reported-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      6218f96c
  2. 27 Oct, 2017 4 commits
  3. 25 Oct, 2017 3 commits
  4. 24 Oct, 2017 4 commits
  5. 19 Oct, 2017 8 commits
  6. 18 Oct, 2017 7 commits
    • Will Deacon's avatar
      drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension · d5d9696b
      Will Deacon authored
      The ARMv8.2 architecture introduces the optional Statistical Profiling
      Extension (SPE).
      
      SPE can be used to profile a population of operations in the CPU pipeline
      after instruction decode. These are either architected instructions (i.e.
      a dynamic instruction trace) or CPU-specific uops and the choice is fixed
      statically in the hardware and advertised to userspace via caps/. Sampling
      is controlled using a sampling interval, similar to a regular PMU counter,
      but also with an optional random perturbation to avoid falling into patterns
      where you continuously profile the same instruction in a hot loop.
      
      After each operation is decoded, the interval counter is decremented. When
      it hits zero, an operation is chosen for profiling and tracked within the
      pipeline until it retires. Along the way, information such as TLB lookups,
      cache misses, time spent to issue etc is captured in the form of a sample.
      The sample is then filtered according to certain criteria (e.g. load
      latency) that can be specified in the event config (described under
      format/) and, if the sample satisfies the filter, it is written out to
      memory as a record, otherwise it is discarded. Only one operation can
      be sampled at a time.
      
      The in-memory buffer is linear and virtually addressed, raising an
      interrupt when it fills up. The PMU driver handles these interrupts to
      give the appearance of a ring buffer, as expected by the AUX code.
      
      The in-memory trace-like format is self-describing (though not parseable
      in reverse) and written as a series of records, with each record
      corresponding to a sample and consisting of a sequence of packets. These
      packets are defined by the architecture, although some have CPU-specific
      fields for recording information specific to the microarchitecture.
      
      As a simple example, a record generated for a branch instruction may
      consist of the following packets:
      
        0 (Address) : Virtual PC of the branch instruction
        1 (Type)    : Conditional direct branch
        2 (Counter) : Number of cycles taken from Dispatch to Issue
        3 (Address) : Virtual branch target + condition flags
        4 (Counter) : Number of cycles taken from Dispatch to Complete
        5 (Events)  : Mispredicted as not-taken
        6 (END)     : End of record
      
      It is also possible to toggle properties such as timestamp packets in
      each record.
      
      This patch adds support for SPE in the form of a new perf driver.
      
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Reviewed-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      d5d9696b
    • Will Deacon's avatar
      dt-bindings: Document devicetree binding for ARM SPE · 4b8b77a4
      Will Deacon authored
      This patch documents the devicetree binding in use for ARM SPE.
      
      Cc: Rob Herring <robh@kernel.org>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      4b8b77a4
    • Will Deacon's avatar
      arm64: head: Init PMSCR_EL2.{PA,PCT} when entered at EL2 without VHE · b0c57e10
      Will Deacon authored
      When booting at EL2, ensure that we permit the EL1 host to sample
      physical addresses and physical counter values using SPE.
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      b0c57e10
    • Will Deacon's avatar
      arm64: sysreg: Move SPE registers and PSB into common header files · a173c390
      Will Deacon authored
      SPE is part of the v8.2 architecture, so move its system register and
      field definitions into sysreg.h and the new PSB barrier into barrier.h
      
      Finally, move KVM over to using the generic definitions so that it
      doesn't have to open-code its own versions.
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      a173c390
    • Will Deacon's avatar
      perf/core: Add PERF_AUX_FLAG_COLLISION to report colliding samples · 085b3062
      Will Deacon authored
      The ARM SPE architecture permits an implementation to ignore a sample
      if the sample is due to be taken whilst another sample is already being
      produced. In this case, it is desirable to report the collision to
      userspace, as they may want to lower the sample period.
      
      This patch adds a PERF_AUX_FLAG_COLLISION flag, so that such events can
      be relayed to userspace.
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      085b3062
    • Will Deacon's avatar
      perf/core: Export AUX buffer helpers to modules · bc1d2020
      Will Deacon authored
      Perf PMU drivers using AUX buffers cannot be built as modules unless
      the AUX helpers are exported.
      
      This patch exports perf_aux_output_{begin,end,skip} and perf_get_aux to
      modules.
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      bc1d2020
    • Will Deacon's avatar
      genirq: export irq_get_percpu_devid_partition to modules · 5ffeb050
      Will Deacon authored
      Any modular driver using cluster-affine PPIs needs to be able to call
      irq_get_percpu_devid_partition so that it can enable the IRQ on the
      correct subset of CPUs.
      
      This patch exports the symbol so that it can be called from within a
      module.
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      5ffeb050
  7. 17 Oct, 2017 1 commit
  8. 16 Oct, 2017 8 commits
    • Lorenzo Pieralisi's avatar
      ACPI/IORT: Enable SMMUv3/PMCG IORT MSI domain set-up · 65637901
      Lorenzo Pieralisi authored
      ITS specific mappings for SMMUv3/PMCG components can be retrieved
      through special index mapping entries introduced in IORT revision C.
      
      Introduce a new API iort_set_device_domain() to set the MSI domain for
      SMMUv3/PMCG nodes (extendable to any future IORT node requiring special
      index ITS mapping entries) that represent MSI through special index
      mappings in order to enable MSI support for the devices their nodes
      represent.
      Signed-off-by: default avatarLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Signed-off-by: default avatarHanjun Guo <hanjun.guo@linaro.org>
      65637901
    • Hanjun Guo's avatar
      ACPI/IORT: Add SMMUv3 specific special index mapping handling · 86456a3f
      Hanjun Guo authored
      IORT revision C introduced a mapping entry binding to describe ITS
      device ID mapping for SMMUv3 MSI interrupts.
      
      Enable the single mapping flag (ie that is used by SMMUv3 component for
      its special index mappings) for the SMMUv3 node in the IORT mapping API
      and add IORT code to handle special index mapping entry for the SMMUv3
      IORT nodes to enable their MSI interrupts. In case the ACPICA for
      SMMUv3 device ID mapping is not ready, use the ACPICA version as a guard
      for function iort_get_id_mapping_index().
      Signed-off-by: default avatarHanjun Guo <hanjun.guo@linaro.org>
      [lorenzo.pieralisi@arm.com: patch split, typos fixing, rewrote the log]
      Signed-off-by: default avatarLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      86456a3f
    • Hanjun Guo's avatar
      ACPI/IORT: Enable special index ITS group mappings for IORT nodes · 8c8df8dc
      Hanjun Guo authored
      IORT revision C introduced SMMUv3 and PMCG MSI support by adding
      specific mapping entries in the SMMUv3/PMCG subtables to retrieve
      the device ID and the ITS group it maps to for a given SMMUv3/PMCG
      IORT node.
      
      Introduce a mapping function (ie iort_get_id_mapping_index()), that
      for a given IORT node looks up if an ITS specific ID mapping entry
      exists and if so retrieve the corresponding mapping index in the IORT
      node mapping array.
      
      Since an ITS specific index mapping can be present for an IORT
      node that is not a leaf node (eg SMMUv3 - to describe its own
      ITS device ID) special handling is required for two steps mapping
      cases such as PCI/NamedComponent--->SMMUv3--->ITS because the SMMUv3
      ITS specific index mapping entry should be skipped to prevent the
      IORT API from considering the mapping entry as a regular mapping one.
      
      If we take the following IORT topology example:
      
      |----------------------|
      |  Root Complex Node   |
      |----------------------|
      |    map entry[x]      |
      |----------------------|
      |       id value       |
      | output_reference     |
      |---|------------------|
          |
          |   |----------------------|
          |-->|        SMMUv3        |
              |----------------------|
              |     SMMUv3 dev ID    |
              |     mapping index 0  |
              |----------------------|
              |      map entry[0]    |
              |----------------------|
              |       id value       |
              | output_reference-----------> ITS 1 (SMMU MSI domain)
              |----------------------|
              |      map entry[1]    |
              |----------------------|
              |       id value       |
              | output_reference-----------> ITS 2 (PCI MSI domain)
              |----------------------|
      
      where the SMMUv3 ITS specific mapping entry is index 0 and it
      represents the SMMUv3 ITS specific index mapping entry (describing its
      own ITS device ID), we need to skip that mapping entry while carrying
      out the Root Complex Node regular mappings to prevent erroneous
      translations.
      
      Reuse the iort_get_id_mapping_index() function to detect the ITS
      specific mapping index for a specific IORT node and skip it in the IORT
      mapping API (ie iort_node_map_id()) loop to prevent considering it a
      normal PCI/Named Component ID mapping entry.
      Signed-off-by: default avatarHanjun Guo <hanjun.guo@linaro.org>
      [lorenzo.pieralisi@arm.com: split patch/rewrote commit log]
      Signed-off-by: default avatarLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      8c8df8dc
    • Hanjun Guo's avatar
      ACPI/IORT: Look up IORT node through struct fwnode_handle pointer · 0a71d8b9
      Hanjun Guo authored
      Current IORT code provides a function (ie iort_get_fwnode())
      which looks up a struct fwnode_handle pointer through a
      struct acpi_iort_node pointer for SMMU components but it
      lacks a function that implements the reverse look-up, namely
      struct fwnode_handle* -> struct acpi_iort_node*.
      
      Devices that are not IORT named components cannot be retrieved through
      their associated IORT named component scan interface because they just
      are not represented in the ACPI namespace; the reverse look-up is
      therefore required for all platform devices that represent IORT nodes
      (eg SMMUs) so that the struct acpi_iort_node* can be retrieved from the
      struct device->fwnode pointer.
      Signed-off-by: default avatarHanjun Guo <hanjun.guo@linaro.org>
      [lorenzo.pieralisi@arm.com: re-indented/rewrote the commit log]
      Signed-off-by: default avatarLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      0a71d8b9
    • Lorenzo Pieralisi's avatar
      ACPI/IORT: Make platform devices initialization code SMMU agnostic · 896dd2c3
      Lorenzo Pieralisi authored
      The way current IORT code initializes platform devices for SMMU nodes
      is somewhat tied (mostly for naming convention) to the SMMU nodes
      themselves but it need not be in that it is completely generic and
      can easily be made so by structures renaming and code reshuffling.
      
      Rework IORT platform devices initialization code to make the functions
      and data structures SMMU agnostic.
      
      No functional changes intended.
      Signed-off-by: default avatarLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Acked-by: default avatarHanjun Guo <hanjun.guo@linaro.org>
      Cc: Hanjun Guo <hanjun.guo@linaro.org>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      896dd2c3
    • Lorenzo Pieralisi's avatar
      ACPI/IORT: Improve functions return type/storage class specifier indentation · e3d49392
      Lorenzo Pieralisi authored
      Some functions definition indentations are using a style that is frowned
      upon with return value type/storage class specifier in a separate line.
      
      Reindent the function definitions to fix them.
      Signed-off-by: default avatarLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Acked-by: default avatarHanjun Guo <hanjun.guo@linaro.org>
      Cc: Hanjun Guo <hanjun.guo@linaro.org>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      e3d49392
    • Lorenzo Pieralisi's avatar
      ACPI/IORT: Remove leftover ACPI_IORT_SMMU_V3_PXM_VALID guard · 75808131
      Lorenzo Pieralisi authored
      The conditional ACPI_IORT_SMMU_V3_PXM_VALID guard around
      arm_smmu_v3_set_proximity() was added to manage a cross tree
      ACPICA merge dependency; with ACPICA changes merged in:
      
      commit c9442300 ("ACPICA: iasl: Update to IORT SMMUv3
      disassembling")
      
      the guard has become useless. Remove it.
      Signed-off-by: default avatarLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Acked-by: default avatarHanjun Guo <hanjun.guo@linaro.org>
      Cc: Hanjun Guo <hanjun.guo@linaro.org>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Cc: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
      75808131
    • Arvind Yadav's avatar
      acpi/arm64: pr_err() strings should end with newlines · ee10b9c9
      Arvind Yadav authored
      pr_err() messages should terminated with a new-line to avoid
      other messages being concatenated onto the end.
      Signed-off-by: default avatarArvind Yadav <arvind.yadav.cs@gmail.com>
      Signed-off-by: default avatarLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Acked-by: default avatarHanjun Guo <hanjun.guo@linaro.org>
      ee10b9c9
  9. 13 Oct, 2017 2 commits
    • Julien Thierry's avatar
      arm64: use WFE for long delays · 7b77452e
      Julien Thierry authored
      The current delay implementation uses the yield instruction, which is a
      hint that it is beneficial to schedule another thread. As this is a hint,
      it may be implemented as a NOP, causing all delays to be busy loops. This
      is the case for many existing CPUs.
      
      Taking advantage of the generic timer sending periodic events to all
      cores, we can use WFE during delays to reduce power consumption. This is
      beneficial only for delays longer than the period of the timer event
      stream.
      
      If timer event stream is not enabled, delays will behave as yield/busy
      loops.
      Signed-off-by: default avatarJulien Thierry <julien.thierry@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      7b77452e
    • Julien Thierry's avatar
      arm_arch_timer: Expose event stream status · ec5c8e42
      Julien Thierry authored
      The arch timer configuration for a CPU might get reset after suspending
      said CPU.
      
      In order to reliably use the event stream in the kernel (e.g. for delays),
      we keep track of the state where we can safely consider the event stream as
      properly configured. After writing to cntkctl, we issue an ISB to ensure
      that subsequent delay loops can rely on the event stream being enabled.
      Signed-off-by: default avatarJulien Thierry <julien.thierry@arm.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      ec5c8e42
  10. 11 Oct, 2017 1 commit
    • Mark Rutland's avatar
      arm64: docs: describe ELF hwcaps · 611a7bc7
      Mark Rutland authored
      We don't document our ELF hwcaps, leaving developers to interpret them
      according to hearsay, guesswork, or (in exceptional cases) inspection of
      the current kernel code.
      
      This is less than optimal, and it would be far better if we had some
      definitive description of each of the ELF hwcaps that developers could
      refer to.
      
      This patch adds a document describing the (native) arm64 ELF hwcaps.
      
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Martin <Dave.Martin@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      [ Updated new hwcap entries in the document ]
      Signed-off-by: default avatarSuzuki K Poulose <suzuki.poulose@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      611a7bc7