1. 29 Sep, 2020 3 commits
  2. 24 Sep, 2020 11 commits
  3. 10 Sep, 2020 10 commits
    • Kim Phillips's avatar
      arch/x86/amd/ibs: Fix re-arming IBS Fetch · 221bfce5
      Kim Phillips authored
      Stephane Eranian found a bug in that IBS' current Fetch counter was not
      being reset when the driver would write the new value to clear it along
      with the enable bit set, and found that adding an MSR write that would
      first disable IBS Fetch would make IBS Fetch reset its current count.
      
      Indeed, the PPR for AMD Family 17h Model 31h B0 55803 Rev 0.54 - Sep 12,
      2019 states "The periodic fetch counter is set to IbsFetchCnt [...] when
      IbsFetchEn is changed from 0 to 1."
      
      Explicitly set IbsFetchEn to 0 and then to 1 when re-enabling IBS Fetch,
      so the driver properly resets the internal counter to 0 and IBS
      Fetch starts counting again.
      
      A family 15h machine tested does not have this problem, and the extra
      wrmsr is also not needed on Family 19h, so only do the extra wrmsr on
      families 16h through 18h.
      Reported-by: default avatarStephane Eranian <stephane.eranian@google.com>
      Signed-off-by: default avatarKim Phillips <kim.phillips@amd.com>
      [peterz: optimized]
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
      221bfce5
    • Kim Phillips's avatar
      perf/x86/rapl: Add AMD Fam19h RAPL support · a77259bd
      Kim Phillips authored
      Family 19h RAPL support did not change from Family 17h; extend
      the existing Fam17h support to work on Family 19h too.
      Signed-off-by: default avatarKim Phillips <kim.phillips@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200908214740.18097-8-kim.phillips@amd.com
      a77259bd
    • Kim Phillips's avatar
      perf/x86/amd/ibs: Support 27-bit extended Op/cycle counter · 8b0bed7d
      Kim Phillips authored
      IBS hardware with the OpCntExt feature gets a 7-bit wider internal
      counter.  Both the maximum and current count bitfields in the
      IBS_OP_CTL register are extended to support reading and writing it.
      
      No changes are necessary to the driver for handling the extra
      contiguous current count bits (IbsOpCurCnt), as the driver already
      passes through 32 bits of that field.  However, the driver has to do
      some extra bit manipulation when converting from a period to the
      non-contiguous (although conveniently aligned) extra bits in the
      IbsOpMaxCnt bitfield.
      
      This decreases IBS Op interrupt overhead when the period is over
      1,048,560 (0xffff0), which would previously activate the driver's
      software counter.  That threshold is now 134,217,712 (0x7fffff0).
      Signed-off-by: default avatarKim Phillips <kim.phillips@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200908214740.18097-7-kim.phillips@amd.com
      8b0bed7d
    • Kim Phillips's avatar
      perf/x86/amd/ibs: Fix raw sample data accumulation · 36e1be8a
      Kim Phillips authored
      Neither IbsBrTarget nor OPDATA4 are populated in IBS Fetch mode.
      Don't accumulate them into raw sample user data in that case.
      
      Also, in Fetch mode, add saving the IBS Fetch Control Extended MSR.
      
      Technically, there is an ABI change here with respect to the IBS raw
      sample data format, but I don't see any perf driver version information
      being included in perf.data file headers, but, existing users can detect
      whether the size of the sample record has reduced by 8 bytes to
      determine whether the IBS driver has this fix.
      
      Fixes: 904cb367 ("perf/x86/amd/ibs: Update IBS MSRs and feature definitions")
      Reported-by: default avatarStephane Eranian <stephane.eranian@google.com>
      Signed-off-by: default avatarKim Phillips <kim.phillips@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20200908214740.18097-6-kim.phillips@amd.com
      36e1be8a
    • Kim Phillips's avatar
      perf/x86/amd/ibs: Don't include randomized bits in get_ibs_op_count() · 680d6963
      Kim Phillips authored
      get_ibs_op_count() adds hardware's current count (IbsOpCurCnt) bits
      to its count regardless of hardware's valid status.
      
      According to the PPR for AMD Family 17h Model 31h B0 55803 Rev 0.54,
      if the counter rolls over, valid status is set, and the lower 7 bits
      of IbsOpCurCnt are randomized by hardware.
      
      Don't include those bits in the driver's event count.
      
      Fixes: 8b1e1363 ("perf/x86-ibs: Fix usage of IBS op current count")
      Signed-off-by: default avatarKim Phillips <kim.phillips@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
      680d6963
    • Kim Phillips's avatar
      perf/x86/amd: Fix sampling Large Increment per Cycle events · 26e52558
      Kim Phillips authored
      Commit 57388912 ("perf/x86/amd: Add support for Large Increment
      per Cycle Events") mistakenly zeroes the upper 16 bits of the count
      in set_period().  That's fine for counting with perf stat, but not
      sampling with perf record when only Large Increment events are being
      sampled.  To enable sampling, we sign extend the upper 16 bits of the
      merged counter pair as described in the Family 17h PPRs:
      
      "Software wanting to preload a value to a merged counter pair writes the
      high-order 16-bit value to the low-order 16 bits of the odd counter and
      then writes the low-order 48-bit value to the even counter. Reading the
      even counter of the merged counter pair returns the full 64-bit value."
      
      Fixes: 57388912 ("perf/x86/amd: Add support for Large Increment per Cycle Events")
      Signed-off-by: default avatarKim Phillips <kim.phillips@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
      26e52558
    • Kim Phillips's avatar
      perf/amd/uncore: Set all slices and threads to restore perf stat -a behaviour · c8fe99d0
      Kim Phillips authored
      Commit 2f217d58 ("perf/x86/amd/uncore: Set the thread mask for
      F17h L3 PMCs") inadvertently changed the uncore driver's behaviour
      wrt perf tool invocations with or without a CPU list, specified with
      -C / --cpu=.
      
      Change the behaviour of the driver to assume the former all-cpu (-a)
      case, which is the more commonly desired default.  This fixes
      '-a -A' invocations without explicit cpu lists (-C) to not count
      L3 events only on behalf of the first thread of the first core
      in the L3 domain.
      
      BEFORE:
      
      Activity performed by the first thread of the last core (CPU#43) in
      CPU#40's L3 domain is not reported by CPU#40:
      
      sudo perf stat -a -A -e l3_request_g1.caching_l3_cache_accesses taskset -c 43 perf bench mem memcpy -s 32mb -l 100 -f default
      ...
      CPU36                 21,835      l3_request_g1.caching_l3_cache_accesses
      CPU40                 87,066      l3_request_g1.caching_l3_cache_accesses
      CPU44                 17,360      l3_request_g1.caching_l3_cache_accesses
      ...
      
      AFTER:
      
      The L3 domain activity is now reported by CPU#40:
      
      sudo perf stat -a -A -e l3_request_g1.caching_l3_cache_accesses taskset -c 43 perf bench mem memcpy -s 32mb -l 100 -f default
      ...
      CPU36                354,891      l3_request_g1.caching_l3_cache_accesses
      CPU40              1,780,870      l3_request_g1.caching_l3_cache_accesses
      CPU44                315,062      l3_request_g1.caching_l3_cache_accesses
      ...
      
      Fixes: 2f217d58 ("perf/x86/amd/uncore: Set the thread mask for F17h L3 PMCs")
      Signed-off-by: default avatarKim Phillips <kim.phillips@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20200908214740.18097-2-kim.phillips@amd.com
      c8fe99d0
    • Kan Liang's avatar
      perf/core: Pull pmu::sched_task() into perf_event_context_sched_out() · 44fae179
      Kan Liang authored
      The pmu::sched_task() is a context switch callback. It passes the
      cpuctx->task_ctx as a parameter to the lower code. To find the
      cpuctx->task_ctx, the current code iterates a cpuctx list.
      The same context will iterated in perf_event_context_sched_out() soon.
      Share the cpuctx->task_ctx can avoid the unnecessary iteration of the
      cpuctx list.
      
      The pmu::sched_task() is also required for the optimization case for
      equivalent contexts.
      
      The task_ctx_sched_out() will eventually disable and reenable the PMU
      when schedule out events. Add perf_pmu_disable() and perf_pmu_enable()
      around task_ctx_sched_out() don't break anything.
      
      Drop the cpuctx->ctx.lock for the pmu::sched_task(). The lock is for
      per-CPU context, which is not necessary for the per-task context
      schedule.
      
      No one uses sched_cb_entry, perf_sched_cb_usages, sched_cb_list, and
      perf_pmu_sched_task() any more.
      Suggested-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200821195754.20159-2-kan.liang@linux.intel.com
      44fae179
    • Kan Liang's avatar
      perf/core: Pull pmu::sched_task() into perf_event_context_sched_in() · 556cccad
      Kan Liang authored
      The pmu::sched_task() is a context switch callback. It passes the
      cpuctx->task_ctx as a parameter to the lower code. To find the
      cpuctx->task_ctx, the current code iterates a cpuctx list.
      
      The same context was just iterated in perf_event_context_sched_in(),
      which is invoked right before the pmu::sched_task().
      
      Reuse the cpuctx->task_ctx from perf_event_context_sched_in() can avoid
      the unnecessary iteration of the cpuctx list.
      
      Both pmu::sched_task and perf_event_context_sched_in() have to disable
      PMU. Pull the pmu::sched_task into perf_event_context_sched_in() can
      also save the overhead from the PMU disable and reenable.
      
      The new and old tasks may have equivalent contexts. The current code
      optimize this case by swapping the context, which avoids the scheduling.
      For this case, pmu::sched_task() is still required, e.g., restore the
      LBR content.
      Suggested-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200821195754.20159-1-kan.liang@linux.intel.com
      556cccad
    • Kan Liang's avatar
      perf/x86/intel/ds: Fix x86_pmu_stop warning for large PEBS · 35d1ce6b
      Kan Liang authored
      A warning as below may be triggered when sampling with large PEBS.
      
      [  410.411250] perf: interrupt took too long (72145 > 71975), lowering
      kernel.perf_event_max_sample_rate to 2000
      [  410.724923] ------------[ cut here ]------------
      [  410.729822] WARNING: CPU: 0 PID: 16397 at arch/x86/events/core.c:1422
      x86_pmu_stop+0x95/0xa0
      [  410.933811]  x86_pmu_del+0x50/0x150
      [  410.937304]  event_sched_out.isra.0+0xbc/0x210
      [  410.941751]  group_sched_out.part.0+0x53/0xd0
      [  410.946111]  ctx_sched_out+0x193/0x270
      [  410.949862]  __perf_event_task_sched_out+0x32c/0x890
      [  410.954827]  ? set_next_entity+0x98/0x2d0
      [  410.958841]  __schedule+0x592/0x9c0
      [  410.962332]  schedule+0x5f/0xd0
      [  410.965477]  exit_to_usermode_loop+0x73/0x120
      [  410.969837]  prepare_exit_to_usermode+0xcd/0xf0
      [  410.974369]  ret_from_intr+0x2a/0x3a
      [  410.977946] RIP: 0033:0x40123c
      [  411.079661] ---[ end trace bc83adaea7bb664a ]---
      
      In the non-overflow context, e.g., context switch, with large PEBS, perf
      may stop an event twice. An example is below.
      
        //max_samples_per_tick is adjusted to 2
        //NMI is triggered
        intel_pmu_handle_irq()
           handle_pmi_common()
             drain_pebs()
               __intel_pmu_pebs_event()
                 perf_event_overflow()
                   __perf_event_account_interrupt()
                     hwc->interrupts = 1
                     return 0
        //A context switch happens right after the NMI.
        //In the same tick, the perf_throttled_seq is not changed.
        perf_event_task_sched_out()
           perf_pmu_sched_task()
             intel_pmu_drain_pebs_buffer()
               __intel_pmu_pebs_event()
                 perf_event_overflow()
                   __perf_event_account_interrupt()
                     ++hwc->interrupts >= max_samples_per_tick
                     return 1
                 x86_pmu_stop();  # First stop
           perf_event_context_sched_out()
             task_ctx_sched_out()
               ctx_sched_out()
                 event_sched_out()
                   x86_pmu_del()
                     x86_pmu_stop();  # Second stop and trigger the warning
      
      Perf should only invoke the perf_event_overflow() in the overflow
      context.
      
      Current drain_pebs() is called from:
      - handle_pmi_common()			-- overflow context
      - intel_pmu_pebs_sched_task()		-- non-overflow context
      - intel_pmu_pebs_disable()		-- non-overflow context
      - intel_pmu_auto_reload_read()		-- possible overflow context
        With PERF_SAMPLE_READ + PERF_FORMAT_GROUP, the function may be
        invoked in the NMI handler. But, before calling the function, the
        PEBS buffer has already been drained. The __intel_pmu_pebs_event()
        will not be called in the possible overflow context.
      
      To fix the issue, an indicator is required to distinguish between the
      overflow context aka handle_pmi_common() and other cases.
      The dummy regs pointer can be used as the indicator.
      
      In the non-overflow context, perf should treat the last record the same
      as other PEBS records, and doesn't invoke the generic overflow handler.
      
      Fixes: 21509084 ("perf/x86/intel: Handle multiple records in the PEBS buffer")
      Reported-by: default avatarLike Xu <like.xu@linux.intel.com>
      Suggested-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarLike Xu <like.xu@linux.intel.com>
      Link: https://lkml.kernel.org/r/20200902210649.2743-1-kan.liang@linux.intel.com
      35d1ce6b
  4. 18 Aug, 2020 11 commits
  5. 16 Aug, 2020 5 commits
    • Linus Torvalds's avatar
      Linux 5.9-rc1 · 9123e3a7
      Linus Torvalds authored
      9123e3a7
    • Linus Torvalds's avatar
      Merge tag 'io_uring-5.9-2020-08-15' of git://git.kernel.dk/linux-block · 2cc3c4b3
      Linus Torvalds authored
      Pull io_uring fixes from Jens Axboe:
       "A few differerent things in here.
      
        Seems like syzbot got some more io_uring bits wired up, and we got a
        handful of reports and the associated fixes are in here.
      
        General fixes too, and a lot of them marked for stable.
      
        Lastly, a bit of fallout from the async buffered reads, where we now
        more easily trigger short reads. Some applications don't really like
        that, so the io_read() code now handles short reads internally, and
        got a cleanup along the way so that it's now easier to read (and
        documented). We're now passing tests that failed before"
      
      * tag 'io_uring-5.9-2020-08-15' of git://git.kernel.dk/linux-block:
        io_uring: short circuit -EAGAIN for blocking read attempt
        io_uring: sanitize double poll handling
        io_uring: internally retry short reads
        io_uring: retain iov_iter state over io_read/io_write calls
        task_work: only grab task signal lock when needed
        io_uring: enable lookup of links holding inflight files
        io_uring: fail poll arm on queue proc failure
        io_uring: hold 'ctx' reference around task_work queue + execute
        fs: RWF_NOWAIT should imply IOCB_NOIO
        io_uring: defer file table grabbing request cleanup for locked requests
        io_uring: add missing REQ_F_COMP_LOCKED for nested requests
        io_uring: fix recursive completion locking on oveflow flush
        io_uring: use TWA_SIGNAL for task_work uncondtionally
        io_uring: account locked memory before potential error case
        io_uring: set ctx sq/cq entry count earlier
        io_uring: Fix NULL pointer dereference in loop_rw_iter()
        io_uring: add comments on how the async buffered read retry works
        io_uring: io_async_buf_func() need not test page bit
      2cc3c4b3
    • Mike Rapoport's avatar
      parisc: fix PMD pages allocation by restoring pmd_alloc_one() · 6f6aea7e
      Mike Rapoport authored
      Commit 1355c31e ("asm-generic: pgalloc: provide generic pmd_alloc_one()
      and pmd_free_one()") converted parisc to use generic version of
      pmd_alloc_one() but it missed the fact that parisc uses order-1 pages for
      PMD.
      
      Restore the original version of pmd_alloc_one() for parisc, just use
      GFP_PGTABLE_KERNEL that implies __GFP_ZERO instead of GFP_KERNEL and
      memset.
      
      Fixes: 1355c31e ("asm-generic: pgalloc: provide generic pmd_alloc_one() and pmd_free_one()")
      Reported-by: default avatarMeelis Roos <mroos@linux.ee>
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Tested-by: default avatarMeelis Roos <mroos@linux.ee>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Link: https://lkml.kernel.org/r/9f2b5ebd-e4a4-0fa1-6cd3-4b9f6892d1ad@linux.eeSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6f6aea7e
    • Linus Torvalds's avatar
      Merge tag 'block-5.9-2020-08-14' of git://git.kernel.dk/linux-block · 4b6c093e
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
       "A few fixes on the block side of things:
      
         - Discard granularity fix (Coly)
      
         - rnbd cleanups (Guoqing)
      
         - md error handling fix (Dan)
      
         - md sysfs fix (Junxiao)
      
         - Fix flush request accounting, which caused an IO slowdown for some
           configurations (Ming)
      
         - Properly propagate loop flag for partition scanning (Lennart)"
      
      * tag 'block-5.9-2020-08-14' of git://git.kernel.dk/linux-block:
        block: fix double account of flush request's driver tag
        loop: unset GENHD_FL_NO_PART_SCAN on LOOP_CONFIGURE
        rnbd: no need to set bi_end_io in rnbd_bio_map_kern
        rnbd: remove rnbd_dev_submit_io
        md-cluster: Fix potential error pointer dereference in resize_bitmaps()
        block: check queue's limits.discard_granularity in __blkdev_issue_discard()
        md: get sysfs entry after redundancy attr group create
      4b6c093e
    • Linus Torvalds's avatar
      Merge tag 'riscv-for-linus-5.9-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux · d84835b1
      Linus Torvalds authored
      Pull RISC-V fix from Palmer Dabbelt:
       "I collected a single fix during the merge window: we managed to break
        the early trap setup on !MMU, this fixes it"
      
      * tag 'riscv-for-linus-5.9-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
        riscv: Setup exception vector for nommu platform
      d84835b1