1. 12 Oct, 2016 10 commits
    • Chris Wilson's avatar
      drm/i915: Update debugfs describe_obj() to show fault-mappable · 8baa1f04
      Chris Wilson authored
      The current meaning of whether an object has a GGTT vma is very
      ill-defined (and note we don't check for any partials either), it just
      means that at some point it was in the GGTT but it may not be now. The
      information we really care about here is whether it is taking up
      precious mappable aperture space. This is the obj->fault_mappable flag.
      We have a redundant long form reprinting of this information, so remove
      that in favour of the compact flag.
      Signed-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: default avatarJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20161012114827.17031-2-chris@chris-wilson.co.uk
      8baa1f04
    • Chris Wilson's avatar
      drm/i915: Use fence_write() from rpm resume · 4676dc83
      Chris Wilson authored
      During rpm resume we restore the fences, but we do not have the
      protection of struct_mutex. This rules out updating the activity
      tracking on the fences, and requires us to rely on the rpm as the
      serialisation barrier instead.
      
      [  350.298052] [drm:intel_runtime_resume [i915]] Resuming device
      [  350.308606]
      [  350.310520] ===============================
      [  350.315560] [ INFO: suspicious RCU usage. ]
      [  350.320554] 4.8.0-rc8-bsw-rapl+ #3133 Tainted: G     U  W
      [  350.327208] -------------------------------
      [  350.331977] ../drivers/gpu/drm/i915/i915_gem_request.h:371 suspicious rcu_dereference_protected() usage!
      [  350.342619]
      [  350.342619] other info that might help us debug this:
      [  350.342619]
      [  350.351593]
      [  350.351593] rcu_scheduler_active = 1, debug_locks = 0
      [  350.358952] 3 locks held by Xorg/320:
      [  350.363077]  #0:  (&dev->mode_config.mutex){+.+.+.}, at: [<ffffffffa030589c>] drm_modeset_lock_all+0x3c/0xd0 [drm]
      [  350.375162]  #1:  (crtc_ww_class_acquire){+.+.+.}, at: [<ffffffffa03058a6>] drm_modeset_lock_all+0x46/0xd0 [drm]
      [  350.387022]  #2:  (crtc_ww_class_mutex){+.+.+.}, at: [<ffffffffa0305056>] drm_modeset_lock+0x36/0x110 [drm]
      [  350.398236]
      [  350.398236] stack backtrace:
      [  350.403196] CPU: 1 PID: 320 Comm: Xorg Tainted: G     U  W       4.8.0-rc8-bsw-rapl+ #3133
      [  350.412457] Hardware name: Intel Corporation CHERRYVIEW C0 PLATFORM/Braswell CRB, BIOS BRAS.X64.X088.R00.1510270350 10/27/2015
      [  350.425212]  0000000000000000 ffff8801680a78c8 ffffffff81332187 ffff88016c5c5000
      [  350.433611]  0000000000000001 ffff8801680a78f8 ffffffff810ca6da ffff88016cc8b0f0
      [  350.442012]  ffff88016cc80000 ffff88016cc80000 ffff880177ad0000 ffff8801680a7948
      [  350.450409] Call Trace:
      [  350.453165]  [<ffffffff81332187>] dump_stack+0x67/0x90
      [  350.458931]  [<ffffffff810ca6da>] lockdep_rcu_suspicious+0xea/0x120
      [  350.466002]  [<ffffffffa039e8dd>] fence_update+0xbd/0x670 [i915]
      [  350.472766]  [<ffffffffa039efe2>] i915_gem_restore_fences+0x52/0x70 [i915]
      [  350.480496]  [<ffffffffa0368f42>] vlv_resume_prepare+0x72/0x570 [i915]
      [  350.487839]  [<ffffffffa0369802>] intel_runtime_resume+0x102/0x210 [i915]
      [  350.495442]  [<ffffffff8137f26f>] pci_pm_runtime_resume+0x7f/0xb0
      [  350.502274]  [<ffffffff8137f1f0>] ? pci_restore_standard_config+0x40/0x40
      [  350.509883]  [<ffffffff814401c5>] __rpm_callback+0x35/0x70
      [  350.516037]  [<ffffffff8137f1f0>] ? pci_restore_standard_config+0x40/0x40
      [  350.523646]  [<ffffffff81440224>] rpm_callback+0x24/0x80
      [  350.529604]  [<ffffffff8137f1f0>] ? pci_restore_standard_config+0x40/0x40
      [  350.537212]  [<ffffffff814417bd>] rpm_resume+0x4ad/0x740
      [  350.543161]  [<ffffffff81441aa1>] __pm_runtime_resume+0x51/0x80
      [  350.549824]  [<ffffffffa03889c8>] intel_runtime_pm_get+0x28/0x90 [i915]
      [  350.557265]  [<ffffffffa0388a53>] intel_display_power_get+0x23/0x50 [i915]
      [  350.565001]  [<ffffffffa03ef23d>] intel_atomic_commit_tail+0xdfd/0x10b0 [i915]
      [  350.573106]  [<ffffffffa034b2e9>] ? drm_atomic_helper_swap_state+0x159/0x300 [drm_kms_helper]
      [  350.582659]  [<ffffffff81615091>] ? _raw_spin_unlock+0x31/0x50
      [  350.589205]  [<ffffffffa034b2e9>] ? drm_atomic_helper_swap_state+0x159/0x300 [drm_kms_helper]
      [  350.598787]  [<ffffffffa03ef8a5>] intel_atomic_commit+0x3b5/0x500 [i915]
      [  350.606319]  [<ffffffffa03061dc>] ? drm_atomic_set_crtc_for_connector+0xcc/0x100 [drm]
      [  350.615209]  [<ffffffffa0306b49>] drm_atomic_commit+0x49/0x50 [drm]
      [  350.622242]  [<ffffffffa034dee8>] drm_atomic_helper_set_config+0x88/0xc0 [drm_kms_helper]
      [  350.631419]  [<ffffffffa02f94ac>] drm_mode_set_config_internal+0x6c/0x120 [drm]
      [  350.639623]  [<ffffffffa02fa94c>] drm_mode_setcrtc+0x22c/0x4d0 [drm]
      [  350.646760]  [<ffffffffa02f0f19>] drm_ioctl+0x209/0x460 [drm]
      [  350.653217]  [<ffffffffa02fa720>] ? drm_mode_getcrtc+0x150/0x150 [drm]
      [  350.660536]  [<ffffffff810c984a>] ? __lock_is_held+0x4a/0x70
      [  350.666885]  [<ffffffff81202303>] do_vfs_ioctl+0x93/0x6b0
      [  350.672939]  [<ffffffff8120f843>] ? __fget+0x113/0x200
      [  350.678797]  [<ffffffff8120f735>] ? __fget+0x5/0x200
      [  350.684361]  [<ffffffff81202964>] SyS_ioctl+0x44/0x80
      [  350.690030]  [<ffffffff81001deb>] do_syscall_64+0x5b/0x120
      [  350.696184]  [<ffffffff81615ada>] entry_SYSCALL64_slow_path+0x25/0x25
      
      Note we also have to remember the lesson from commit 4fc788f5
      ("drm/i915: Flush delayed fence releases after reset") where we have to
      flush any changes to the fence on restore.
      
      v2: Replace call to release user mmaps with an assertion that they have
      already been zapped.
      
      Fixes: 49ef5294 ("drm/i915: Move fence tracking from object to vma")
      Reported-by: default avatarVille Syrjälä <ville.syrjala@linux.intel.com>
      Signed-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Mika Kuoppala <mika.kuoppala@intel.com>
      Reviewed-by: default avatarJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20161012114827.17031-1-chris@chris-wilson.co.uk
      4676dc83
    • Chris Wilson's avatar
      drm/i915: Compress GPU objects in error state · 0a97015d
      Chris Wilson authored
      Our error states are quickly growing, pinning kernel memory with them.
      The majority of the space is taken up by the error objects. These
      compress well using zlib and without decode are mostly meaningless, so
      encoding them does not hinder quickly parsing the error state for
      familiarity.
      
      v2: Make the zlib dependency optional
      Signed-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: default avatarJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20161012090522.367-6-chris@chris-wilson.co.uk
      0a97015d
    • Chris Wilson's avatar
      drm/i915: Consolidate error object printing · fc4c79c3
      Chris Wilson authored
      Leave all the pretty printing to userspace and simplify the error
      capture to only have a single common object printer. It makes the kernel
      code more compact, and the refactoring allows us to apply more complex
      transformations like compressing the output.
      Signed-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: default avatarJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20161012090522.367-5-chris@chris-wilson.co.uk
      fc4c79c3
    • Chris Wilson's avatar
      drm/i915: Always use the GTT for error capture · 95374d75
      Chris Wilson authored
      Since the GTT provides universal access to any GPU page, we can use it
      to reduce our plethora of read methods to just one. It also has the
      important characteristic of being exactly what the GPU sees - if there
      are incoherency problems, seeing the batch as executed (rather than as
      trapped inside the cpu cache) is important.
      Signed-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: default avatarJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20161012090522.367-4-chris@chris-wilson.co.uk
      95374d75
    • Chris Wilson's avatar
      drm/i915: Stop the machine whilst capturing the GPU crash dump · 9f267eb8
      Chris Wilson authored
      The error state is purposefully racy as we expect it to be called at any
      time and so have avoided any locking whilst capturing the crash dump.
      However, with multi-engine GPUs and multiple CPUs, those races can
      manifest into OOPSes as we attempt to chase dangling pointers freed on
      other CPUs. Under discussion are lots of ways to slow down normal
      operation in order to protect the post-mortem error capture, but what it
      we take the opposite approach and freeze the machine whilst the error
      capture runs (note the GPU may still running, but as long as we don't
      process any of the results the driver's bookkeeping will be static).
      
      Note that by of itself, this is not a complete fix. It also depends on
      the compiler barriers in list_add/list_del to prevent traversing the
      lists into the void. We also depend that we only require state from
      carefully controlled sources - i.e. all the state we require for
      post-mortem debugging should be reachable from the request itself so
      that we only have to worry about retrieving the request carefully. Once
      we have the request, we know that all pointers from it are intact.
      
      v2: Avoid drm_clflush_pages() inside stop_machine() as it may use
      stop_machine() itself for its wbinvd fallback.
      Signed-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Acked-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Link: http://patchwork.freedesktop.org/patch/msgid/20161012090522.367-3-chris@chris-wilson.co.uk
      9f267eb8
    • Chris Wilson's avatar
      drm/i915: Allow disabling error capture · 98a2f411
      Chris Wilson authored
      We currently capture the GPU state after we detect a hang. This is vital
      for us to both triage and debug hangs in the wild (post-mortem
      debugging). However, it comes at the cost of running some potentially
      dangerous code (since it has to make very few assumption about the state
      of the driver) that is quite resource intensive.
      
      This patch introduces both a method to disable error capture at runtime
      (for users who hit bugs at runtime and need a workaround) and to disable
      error capture at compiletime (for realtime users who want to minimise
      any possible latency, and never require error capture, saving ~30k of
      code). The cost is that we now have to be wary of (and test!) a kconfig
      flag and a module parameter. The effect of the module parameter is easy
      to verify through code inspection and runtime testing, but a kconfig flag
      needs regular compile checking.
      Signed-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: default avatarJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Acked-by: default avatarJani Nikula <jani.nikula@linux.intel.com>
      Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch
      Link: http://patchwork.freedesktop.org/patch/msgid/20161012090522.367-2-chris@chris-wilson.co.uk
      98a2f411
    • Chris Wilson's avatar
      drm/i915: Move common code out of i915_gpu_error.c · 0e704476
      Chris Wilson authored
      In the next patch, I want to conditionally compile i915_gpu_error.c and
      that requires moving the functions used by debug out of
      i915_gpu_error.c!
      Signed-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: default avatarJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20161012090522.367-1-chris@chris-wilson.co.uk
      0e704476
    • Joonas Lahtinen's avatar
      drm/i915: Remove unused BSM_MASK causing warning · 40006c43
      Joonas Lahtinen authored
      Remove never used BSM{,_MASK}. BSM_MASK #define also causes a warning.
      
      include/drm/i915_drm.h:96:34: warning: result of ‘65535 << 20’
      requires 37 bits to represent, but ‘int’ only has 32 bits
      [-Wshiftoverflow=]
         #define   INTEL_BSM_MASK (0xFFFF << 20)
      Reported-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Reviewed-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Link: http://patchwork.freedesktop.org/patch/msgid/1476256734-6457-1-git-send-email-joonas.lahtinen@linux.intel.com
      40006c43
    • Daniel Vetter's avatar
      Merge tag 'drm-for-v4.9' into drm-intel-next-queued · c0c8b9ed
      Daniel Vetter authored
      It's been over two months, git definitely lost it's marbles. Conflicts
      resolved by picking our version, plus manually checking the diff with
      the parent in drm-intel-next-queued to make sure git didn't do
      anything stupid. It did, so I removed 2 occasions where it
      double-inserted a bit of code. The diff is now just
      - kernel-doc changes
      - drm format/name changes
      - display-info changes
      so looks all reasonable.
      Signed-off-by: default avatarDaniel Vetter <daniel.vetter@intel.com>
      c0c8b9ed
  2. 11 Oct, 2016 14 commits
  3. 10 Oct, 2016 16 commits