Commits · 6ed6ba32dba14ef851ecb7190597d6bac77618e2 · Kirill Smelkov / linux

19 Dec, 2023 40 commits

drm/xe: Add stepping support for GMD_ID platforms · 6ed6ba32

Matt Roper authored May 24, 2023

For platforms with GMD_ID registers, the IP stepping should be
determined from the 'revid' field of those registers rather than from
the PCI revid.

The hardware teams have indicated that they plan to keep the revid =>
stepping mapping consistent across all GMD_ID platforms, with major
steppings (A0, B0, C0, etc.) having revids that are multiples of 4, and
minor steppings (A1, A2, A3, etc.) taking the intermediate values.  For
now we'll trust that hardware follows through on this plan; if they have
to change direction in the future (e.g., they wind up needing something
like an "A4" that doesn't fit this scheme), we can add a GMD_ID-based
lookup table when the time comes.

v2:
 - Set xe->info.platform before finding stepping; the pre-GMD_ID code
   relies on this value to pick a lookup table.
v3:
 - Also set xe->info.subplatform before picking the stepping for
   pre-GMD_ID lookup.
Reviewed-by: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com>
Link: https://lore.kernel.org/r/20230524185952.666158-1-matthew.d.roper@intel.comSigned-off-by: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

6ed6ba32

drm/xe: Fail xe_device_create() if wq allocation fails · c93b6de7

Gustavo Sousa authored May 18, 2023

Let's make sure we give the driver a valid workqueue.

While at it, also make sure to call destroy_workqueue() only if the
workqueue is a valid one. That is necessary because xe_device_destroy()
is indirectly called as part of the cleanup process of a failed
xe_device_create().
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20230518215651.502159-3-gustavo.sousa@intel.comSigned-off-by: Gustavo Sousa <gustavo.sousa@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

c93b6de7

drm/xe: Call drmm_add_action_or_reset() early in xe_device_create() · b67ece5b

Gustavo Sousa authored May 18, 2023

Otherwise no cleanup is actually done if we branch to err_put.

This works for now: currently we do know that, once inside
xe_device_destroy(), ttm_device_init() was successful so we can safely
call ttm_device_fini(); and, for xe->ordered_wq, there is an upcoming
commit to check its value before calling destroy_workqueue().

However, we might need change this in the future if we have more
initializers called that can fail in a way that we can not know which
one was it once inside xe_device_destroy().
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20230518215651.502159-2-gustavo.sousa@intel.comSigned-off-by: Gustavo Sousa <gustavo.sousa@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

b67ece5b

drm/xe: Do not forget to drm_dev_put() in xe_pci_probe() · 6fedf842

Gustavo Sousa authored May 19, 2023

The function drm_dev_put() should also be called if xe_device_probe()
fails.

v2:
  - Improve commit message. (Lucas)
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20230519194802.578182-1-gustavo.sousa@intel.comSigned-off-by: Gustavo Sousa <gustavo.sousa@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

6fedf842

drm/xe: fix kernel-doc issues · 82f428b6

Matthew Auld authored May 22, 2023

drivers/gpu/drm/xe/xe_guc_submit_types.h:47: warning: cannot understand
function prototype: 'struct guc_submit_parallel_scratch '

drivers/gpu/drm/xe/xe_devcoredump_types.h:38: warning: Function
parameter or member 'ct' not described in 'xe_devcoredump_snapshot'

CI doesn't appear to be running BAT anymore, assuming this is caused by
the CI.Hooks now failing due to above warnings.
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Nirmoy Das <nirmoy.das@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

82f428b6

drm/xe: Change GuC interrupt data · 915757a6

Michal Wajdeczko authored May 19, 2023

Both GUC_HOST_INTERRUPT and MED_GUC_HOST_INTERRUPT can pass
additional payload data to the GuC but this capability is not
used by the firmware yet.

Stop using value mandated by legacy GuC interrupt register and
use default notify value (zero) instead.

Bspec: 49813, 63363
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

915757a6

drm/xe: Move Media GuC register definition to regs/ · 5013ad8d

Michal Wajdeczko authored May 11, 2023

This GuC register can be moved together with the rest of the
GuC register definitions and be named in a similar way.

v2: fix placement

Bspec: 63363
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com> #v1
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

5013ad8d

drm/xe: Limit CONFIG_DRM_XE_SIMPLE_ERROR_CAPTURE to itself. · 328f3414

Rodrigo Vivi authored May 16, 2023

There are multiple kind of config prints and with the upcoming
devcoredump there will be another layer. Let's limit the config
to the top level functions and leave the clean-up work for the
compilers so we don't create a spider-web of configs.

No functional change. Just a preparation for the devcoredump.
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>

328f3414

drm/xe: Add HW Engine snapshot to xe_devcoredump. · 01a87f31

Rodrigo Vivi authored May 16, 2023

Let's continue to add our existent simple logs to devcoredump one
by one. Any format change should come on follow-up work.

v2: remove unnecessary, and now duplicated, dma_fence annotation. (Matthew)
v3: avoid for_each with faulty_engine since that can be already freed at
    the time of the read/free. Instead, iterate in the full array of
    hw_engines. (Kasan)

Cc: Francois Dugast <francois.dugast@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Francois Dugast <francois.dugast@intel.com>

01a87f31

drm/xe: Convert Xe HW Engine print to snapshot capture and print. · a4db5555

Rodrigo Vivi authored May 16, 2023

The goal is to allow for a snapshot capture to be taken at the time
of the crash, while the print out can happen at a later time through
the exposed devcoredump virtual device.

v2: Addressing these Matthew comments:
    - Handle memory allocation failures.
    - Do not use GFP_ATOMIC on cases like debugfs prints.
    - placement of @reg doc.
    - identation issues.
v3: checkpatch
v4: Rebase and get back to GFP_ATOMIC only.
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>

a4db5555

drm/xe: Add GuC Submit Engine snapshot to xe_devcoredump. · 3847ec03

Rodrigo Vivi authored May 16, 2023

Let's start to move our existent logs to devcoredump one by
one. Any format change should come on follow-up work.
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>

3847ec03

drm/xe: Convert GuC Engine print to snapshot capture and print. · bbdf97c1

Rodrigo Vivi authored May 16, 2023

The goal is to allow for a snapshot capture to be taken at the time
of the crash, while the print out can happen at a later time through
the exposed devcoredump virtual device.

v2: Handle memory allocation failures. (Matthew)
Do not use GFP_ATOMIC on cases like debugfs prints. (Matthew)
v3: checkpatch
v4: pending_list allocation needs to be atomic because of the
    spin_lock. (Matthew)
    get back to GFP_ATOMIC only. (lockdep).

Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>

bbdf97c1

drm/xe: Introduce guc_submit_types.h with relevant structs. · 1825c492

Rodrigo Vivi authored May 16, 2023

These structs and definitions are only used for the guc_submit
and they were added specifically for the parallel submission.

While doing that also delete the unused struct guc_wq_item.

v2: checkpatch fixes.

Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>

1825c492

drm/xe: Add GuC CT snapshot to xe_devcoredump. · 5ed53446

Rodrigo Vivi authored May 16, 2023

Let's start to move our existent logs to devcoredump one by
one. Any format change should come on follow-up work.

v2: Rebase and add the dma_fence locking annotation here.
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>

5ed53446

drm/xe: Convert GuC CT print to snapshot capture and print. · 513260df

Rodrigo Vivi authored May 16, 2023

The goal is to allow for a snapshot capture to be taken at the time
of the crash, while the print out can happen at a later time through
the exposed devcoredump virtual device.

v2: Handle memory allocation failures. (Matthew)
    Do not use GFP_ATOMIC on cases like debugfs prints. (Matthew)
v3: checkpatch fixes
v4: Do not use atomic in the g2h_worker_func (Matthew)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>

513260df

drm/xe: Extract non mapped regions out of GuC CTB into its own struct. · a7ca8157

Rodrigo Vivi authored May 16, 2023

No functional change here. The goal is to have a clear split between
the mapped portions of the CTB and the static information, so we can
easily capture snapshots that will be used for later read out with
the devcoredump infrastructure.
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>

a7ca8157

drm/xe: Do not take any action if our device was removed. · 656d2950

Rodrigo Vivi authored May 16, 2023

Unfortunately devcoredump infrastructure does not provide and
interface for us to force the device removal upon the pci_remove
time of our device.

The devcoredump is linked at the device level, so when in use
it will prevent the module removal, but it doesn't prevent the
call of the pci_remove callback. This callback cannot fail
anyway and we end up clearing and freeing the entire pci device.

Hence, after we removed the pci device, we shouldn't allow any
read or free operations to avoid segmentation fault.
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>

656d2950

drm/xe: Introduce the dev_coredump infrastructure. · e7994850

Rodrigo Vivi authored May 18, 2023

The goal is to use devcoredump infrastructure to report error states
captured at the crash time.

The error state will contain useful information for GPU hang debug, such
as INSTDONE registers and the current buffers getting executed, as well
as any other information that helps user space and allow later replays of
the error.

The proposal here is to avoid a Xe only error_state like i915 and use
a standard dev_coredump infrastructure to expose the error state.

For our own case, the data is only useful if it is a snapshot of the
time when the GPU crash has happened, since we reset the GPU immediately
after and the registers might have changed. So the proposal here is to
have an internal snapshot to be printed out later.

Also, usually a subsequent GPU hang can be only a cause of the initial
one. So we only save the 'first' hang. The dev_coredump has a delayed
work queue where it remove the coredump and free all the data within a
few moments of the error. When that happens we also reset our capture
state and allow further snapshots.

Right now this infra only print out the time of the hang. More information
will be migrated here on subsequent work. Also, in order to organize the
dump better, the goal is to propose dev_coredump changes itself to allow
multiple files and different controls. But for now we start Xe usage of
it without any dependency on dev_coredump core changes.

v2: Add dma_fence annotation for capture that might happen during long
    running. (Thomas and Matt)
    Use xe->drm.primary->index on drm_info msg. (Jani)
v3: checkpatch fixes
v4: Fix building and locking issues found by Francois.
    Actually let's kill all of the locking in here. gt_reset serialization
    already guarantee that there will be only one capture at the same time.
    Also, the devcoredump has its own locking to protect the free and reads
    and drivers don't need to duplicate it.
    Besides this, the dma_fence locking was pushed to a following patch
    since it is not needed in this one.
    Fix a use after free identified by KASAN: Do not stash the faulty_engine
    since that will be freed somewhere else.
v5: Fix Uptime - ktime_get_boottime actually returns the Uptime. (Francois)

Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Francois Dugast <francois.dugast@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>

e7994850

drm/xe: Use GT oriented log messages in xe_gt.c · 3e535bd5

Michal Wajdeczko authored Mar 12, 2023

Replace generic log messages with ones dedicated for the GT.
While around replace errno logs from plain %d to pretty %pe.

v2: rebased
v3: unify errno logs
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

3e535bd5

drm/xe: Introduce GT oriented log messages · 75a6aadb

Michal Wajdeczko authored Jan 16, 2023

While debugging GT related problems, it's good to know which GT was
reporting problems. Introduce helper macros to allow prefix GT logs
with GT identifier. We will use them in upcoming patches.

v2: use xe_ prefix (Lucas)
v3: use correct include
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

75a6aadb

drm/xe: Allow dma-fences as in-syncs for compute / faulting VM · 7cabe558

Matthew Brost authored Apr 16, 2023

This is allowed and encouraged by the dma-fencing rules. This along with
allowing compute VMs to export dma-fences on binds will result in a
simpler compute UMD.
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

7cabe558

drm/xe: Allow compute VMs to output dma-fences on binds · ed1df989

Matthew Brost authored Apr 16, 2023

Binds are not long running jobs thus we can export dma-fences even if a
VM is in compute mode.
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

ed1df989

drm/xe: Call exit functions when xe_register_pci_driver() fails · 9afd4b2d

Gustavo Sousa authored May 11, 2023

Move xe_register_pci_driver() and xe_unregister_pci_driver() to
init_funcs to make sure that exit functions are also called when
xe_register_pci_driver() fails.

Note that this also allows adding init functions to be run after
xe_register_pci_driver().

v2:
 - Move functions to init_funcs instead of having a special case for
   xe_register_pci_driver(). (Jani)

Cc: Jani Nikula <jani.nikula@intel.com>
Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com>
Signed-off-by: Gustavo Sousa <gustavo.sousa@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

9afd4b2d

drm/xe: Get rid of MAKE_INIT_EXIT_FUNCS · a029aeca

Gustavo Sousa authored May 11, 2023

There is not much of a benefit from using that macro as of now and it
hurts grepability or other ways of cross-referencing.

Cc: Jani Nikula <jani.nikula@intel.com>
Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com>
Signed-off-by: Gustavo Sousa <gustavo.sousa@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

a029aeca

drm/xe: Remove unused define · d0e96f3d

Francois Dugast authored May 12, 2023

Signed-off-by: Francois Dugast <francois.dugast@intel.com>
Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

d0e96f3d

drm/xe: Load HuC on Alderlake P · 85635f5d

Lucas De Marchi authored May 12, 2023

Alderlake P uses TGL HuC and it was not added together with ADL-S,
because it was failing for unrelated reasons. Now that those are fixed,
allow it to load HuC.

	# cat /sys/kernel/debug/dri/0/gt0/uc/huc_info
	HuC firmware: i915/tgl_huc.bin
		status: RUNNING
		version: wanted 0.0, found 7.9
		uCode: 589504 bytes
		RSA: 256 bytes

	HuC status: 0x00090001
Reviewed-by: Anusha Srivatsa <anusha.srivatsa@intel.com>
Link: https://lore.kernel.org/r/20230512233649.3218736-1-lucas.demarchi@intel.comSigned-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

85635f5d

drm/xe/adln: Enable ADL-N · 500f9062

Matt Roper authored Apr 19, 2023

ADL-N is pretty much the same as ADL-P (i.e., Xe_LP graphics + Xe_M
media + Xe_LPD display). However unlike ADL-P, there's no GuC hwconfig
support so the "tgl" GuC firmware should be loaded (i.e., the same
situation as ADL-S).
Acked-by: Nirmoy Das <nirmoy.das@intel.com>
Reviewed-by: Ravi Kumar Vodapalli <ravi.kumar.vodapalli@intel.com>
Link: https://lore.kernel.org/r/20230419213703.3993439-2-matthew.d.roper@intel.comSigned-off-by: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

500f9062

drm/xe/adlp: Add revid => step mapping · 5737f74e

Matt Roper authored Apr 19, 2023

Setup the mapping from PCI revid to IP stepping for ADL-P (and its RPL-P
subplatform) in case this information becomes important for implementing
workarounds.

Bspec: 55376
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20230419213703.3993439-1-matthew.d.roper@intel.comSigned-off-by: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

5737f74e

drm/xe: fix tlb_invalidation_seqno_past() · a2db3192

Matthew Auld authored May 05, 2023

Checking seqno_recv >= seqno looks like it will incorrectly report true
when the seqno has wrapped (not unlikely given
TLB_INVALIDATION_SEQNO_MAX). Calling xe_gt_tlb_invalidation_wait() might
then return before the flush has been completed by the GuC.

Fix this by treating a large negative delta as an indication that the
seqno has wrapped around. Similar to how we treat a large positive delta
as an indication that the seqno_recv must have wrapped around, but in
that case the seqno has likely also signalled.

It looks like we could also potentially make the seqno use the full
32bits as supported by the GuC.
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

a2db3192

drm/xe: Fix indent in xe_hw_engine_print_state() · 50f1f059

Lucas De Marchi authored May 08, 2023

Fix the indent to align with open parenthesis, following the coding
style.
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20230508225322.2692066-5-lucas.demarchi@intel.comSigned-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

50f1f059

drm/xe: Rename reg field to addr · ee21379a

Lucas De Marchi authored May 08, 2023

Rename the address field to "addr" rather than "reg" so it's easier to
understand what it is.
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20230508225322.2692066-4-lucas.demarchi@intel.comSigned-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

ee21379a

drm/xe/mmio: Use struct xe_reg · ce8bf5bd

Lucas De Marchi authored May 08, 2023

Convert all the callers to deal with xe_mmio_*() using struct xe_reg
instead of plain u32. In a few places there was also a rename
s/reg/reg_val/ when dealing with the value returned so it doesn't get
mixed up with the register address.
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20230508225322.2692066-2-lucas.demarchi@intel.comSigned-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

ce8bf5bd

drm/xe: Handle -EDEADLK case in exec ioctl · 34f89ac8

Niranjana Vishwanathapura authored May 09, 2023

With multiple active VMs, under memory pressure, it is possible that
ttm_bo_validate() run into -EDEADLK in ttm_mem_evict_wait_busy() and
return -ENOMEM.

Until ttm properly handles locking in such scenarios, best thing the
driver can do is unwind the lock and retry.

Update xe_exec_begin to retry validating BOs with a timeout upon
-ENOMEM.
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

34f89ac8

drm/xe: Handle -EDEADLK case in preempt worker · 9ca14f94

Niranjana Vishwanathapura authored May 08, 2023

With multiple active VMs, under memory pressure, it is possible that
ttm_bo_validate() run into -EDEADLK in ttm_mem_evict_wait_busy() and
return -ENOMEM.

Until ttm properly handles locking in such scenarios, best thing the
driver can do is unwind the lock and retry.

Update preempt worker to retry validating BOs with a timeout upon
-ENOMEM.

v2: revert retry timeout upon -EAGAIN (Thomas)
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

9ca14f94

drm/xe: destroy clients engine and vm xarrays on close · e3e4964d

Mika Kuoppala authored Apr 12, 2023

xe_file_close cleanups the xarrays but forgets
to destroy them causing a memleak in xarray internals.
Found with kmemleak.
Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Reviewed-by: Christoph Manszewski <christoph.manszewski@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

e3e4964d

drm/xe: Print GT info on TLB inv failure · a5cecbac

Nirmoy Das authored May 05, 2023

Print GT info on TLB inv failure for better debugbility.
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

a5cecbac

drm/xe: Do not sleep in atomic · a56d8dab

Nirmoy Das authored May 05, 2023

Set atomic in xe_mmio_wait32() otherwise we would
be scheduling in atomic context.

Fixes: 7dc9b92d ("drm/xe: Remove i915_utils dependency from xe_pcode.")
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

a56d8dab

drm/xe/guc: Handle RCU_MODE as masked from definition · 98ce59e9

Lucas De Marchi authored Apr 28, 2023

guc_mmio_regset_write() had a flags for the registers to be added to the
GuC's regset list. The only register actually using that was RCU_MODE,
but it was setting the flags to a bogus value. From
struct xe_guc_fwif.h,

	#define GUC_REGSET_MASKED               BIT(0)
	#define GUC_REGSET_MASKED_WITH_VALUE    BIT(2)
	#define GUC_REGSET_RESTORE_ONLY         BIT(3)

Cross checking with i915, the only flag to set in RCU_MODE is
GUC_REGSET_MASKED. That can be done automatically from the register, as
long as the definition is correct.

Add the XE_REG_OPTION_MASKED annotation to RCU_MODE and kill the "flags"
field in guc_mmio_regset_write(): guc_mmio_regset_write_one() can decide
that based on the register being passed.
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20230429062332.354139-3-lucas.demarchi@intel.comSigned-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

98ce59e9

drm/xe/guc: Remove special handling for PVC A* · a31153fc

Lucas De Marchi authored May 04, 2023

The rest of the driver doesn't really support PVC before B0 stepping.
Drop the special handling in xe_guc.c.
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20230504073250.1436293-4-lucas.demarchi@intel.comSigned-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

a31153fc

drm/xe: Fix comment on Wa_22013088509 · 215bb2ce

Lucas De Marchi authored May 04, 2023

On i915 the "see comment about Wa_22013088509" referred to the comment
in the graphics version >= 11 branch, where there were more details
about it. From the platforms supported by xe, only PVC needs
Wa_22013088509, but as the comment says, it's simpler to do it for all
platforms as there is no downside. Bring the missing comment over from
i915 and reword it to fit xe better.
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20230504073250.1436293-3-lucas.demarchi@intel.comSigned-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

215bb2ce