Commit 92242716 authored by Dave Airlie's avatar Dave Airlie

Merge tag 'drm-habanalabs-next-2023-12-19' of...

Merge tag 'drm-habanalabs-next-2023-12-19' of https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux into drm-next

This tag contains habanalabs driver changes for v6.8.

The notable changes are:

- uAPI changes:
  - Add sysfs entry to allow users to identify a device minor id with its
    debugfs path
  - Add sysfs entry to expose the device's module id as given to us from
    the f/w
  - Add signed device information retrieval through the INFO ioctl

- New features and improvements:
  - Update documentation of debugfs paths
  - Add support for Gaudi2C device (new PCI revision number)
  - Add pcie reset prepare/done hooks

- Firmware related fixes and changes:
  - Print three instances version numbers of Infineon second stage
  - Assume hard-reset is done by f/w upon PCIe AXI drain

- Bug fixes and code cleanups:
  - Fix information leak in sec_attest_info()
  - Avoid overriding existing undefined opcode data in Gaudi2
  - Multiple Queue Manager (QMAN) fixes for Gaudi2
  - Set hard reset flag if graceful reset is skipped
  - Remove 'get temperature' debug print
  - Fix the new Event Queue heartbeat mechanism
Signed-off-by: default avatarDave Airlie <airlied@redhat.com>

From: Oded Gabbay <ogabbay@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/ZYFpihZscr/fsRRd@ogabbay-vm-u22.habana-labs.com
parents dc83fb6e a9f07790
What: /sys/kernel/debug/accel/<n>/addr What: /sys/kernel/debug/accel/<parent_device>/addr
Date: Jan 2019 Date: Jan 2019
KernelVersion: 5.1 KernelVersion: 5.1
Contact: ogabbay@kernel.org Contact: ogabbay@kernel.org
...@@ -8,34 +8,34 @@ Description: Sets the device address to be used for read or write through ...@@ -8,34 +8,34 @@ Description: Sets the device address to be used for read or write through
only when the IOMMU is disabled. only when the IOMMU is disabled.
The acceptable value is a string that starts with "0x" The acceptable value is a string that starts with "0x"
What: /sys/kernel/debug/accel/<n>/clk_gate What: /sys/kernel/debug/accel/<parent_device>/clk_gate
Date: May 2020 Date: May 2020
KernelVersion: 5.8 KernelVersion: 5.8
Contact: ogabbay@kernel.org Contact: ogabbay@kernel.org
Description: This setting is now deprecated as clock gating is handled solely by the f/w Description: This setting is now deprecated as clock gating is handled solely by the f/w
What: /sys/kernel/debug/accel/<n>/command_buffers What: /sys/kernel/debug/accel/<parent_device>/command_buffers
Date: Jan 2019 Date: Jan 2019
KernelVersion: 5.1 KernelVersion: 5.1
Contact: ogabbay@kernel.org Contact: ogabbay@kernel.org
Description: Displays a list with information about the currently allocated Description: Displays a list with information about the currently allocated
command buffers command buffers
What: /sys/kernel/debug/accel/<n>/command_submission What: /sys/kernel/debug/accel/<parent_device>/command_submission
Date: Jan 2019 Date: Jan 2019
KernelVersion: 5.1 KernelVersion: 5.1
Contact: ogabbay@kernel.org Contact: ogabbay@kernel.org
Description: Displays a list with information about the currently active Description: Displays a list with information about the currently active
command submissions command submissions
What: /sys/kernel/debug/accel/<n>/command_submission_jobs What: /sys/kernel/debug/accel/<parent_device>/command_submission_jobs
Date: Jan 2019 Date: Jan 2019
KernelVersion: 5.1 KernelVersion: 5.1
Contact: ogabbay@kernel.org Contact: ogabbay@kernel.org
Description: Displays a list with detailed information about each JOB (CB) of Description: Displays a list with detailed information about each JOB (CB) of
each active command submission each active command submission
What: /sys/kernel/debug/accel/<n>/data32 What: /sys/kernel/debug/accel/<parent_device>/data32
Date: Jan 2019 Date: Jan 2019
KernelVersion: 5.1 KernelVersion: 5.1
Contact: ogabbay@kernel.org Contact: ogabbay@kernel.org
...@@ -50,7 +50,7 @@ Description: Allows the root user to read or write directly through the ...@@ -50,7 +50,7 @@ Description: Allows the root user to read or write directly through the
If the IOMMU is disabled, it also allows the root user to read If the IOMMU is disabled, it also allows the root user to read
or write from the host a device VA of a host mapped memory or write from the host a device VA of a host mapped memory
What: /sys/kernel/debug/accel/<n>/data64 What: /sys/kernel/debug/accel/<parent_device>/data64
Date: Jan 2020 Date: Jan 2020
KernelVersion: 5.6 KernelVersion: 5.6
Contact: ogabbay@kernel.org Contact: ogabbay@kernel.org
...@@ -65,7 +65,7 @@ Description: Allows the root user to read or write 64 bit data directly ...@@ -65,7 +65,7 @@ Description: Allows the root user to read or write 64 bit data directly
If the IOMMU is disabled, it also allows the root user to read If the IOMMU is disabled, it also allows the root user to read
or write from the host a device VA of a host mapped memory or write from the host a device VA of a host mapped memory
What: /sys/kernel/debug/accel/<n>/data_dma What: /sys/kernel/debug/accel/<parent_device>/data_dma
Date: Apr 2021 Date: Apr 2021
KernelVersion: 5.13 KernelVersion: 5.13
Contact: ogabbay@kernel.org Contact: ogabbay@kernel.org
...@@ -83,7 +83,7 @@ Description: Allows the root user to read from the device's internal ...@@ -83,7 +83,7 @@ Description: Allows the root user to read from the device's internal
workloads. workloads.
Only supported on GAUDI at this stage. Only supported on GAUDI at this stage.
What: /sys/kernel/debug/accel/<n>/device What: /sys/kernel/debug/accel/<parent_device>/device
Date: Jan 2019 Date: Jan 2019
KernelVersion: 5.1 KernelVersion: 5.1
Contact: ogabbay@kernel.org Contact: ogabbay@kernel.org
...@@ -91,14 +91,14 @@ Description: Enables the root user to set the device to specific state. ...@@ -91,14 +91,14 @@ Description: Enables the root user to set the device to specific state.
Valid values are "disable", "enable", "suspend", "resume". Valid values are "disable", "enable", "suspend", "resume".
User can read this property to see the valid values User can read this property to see the valid values
What: /sys/kernel/debug/accel/<n>/device_release_watchdog_timeout What: /sys/kernel/debug/accel/<parent_device>/device_release_watchdog_timeout
Date: Oct 2022 Date: Oct 2022
KernelVersion: 6.2 KernelVersion: 6.2
Contact: ttayar@habana.ai Contact: ttayar@habana.ai
Description: The watchdog timeout value in seconds for a device release upon Description: The watchdog timeout value in seconds for a device release upon
certain error cases, after which the device is reset. certain error cases, after which the device is reset.
What: /sys/kernel/debug/accel/<n>/dma_size What: /sys/kernel/debug/accel/<parent_device>/dma_size
Date: Apr 2021 Date: Apr 2021
KernelVersion: 5.13 KernelVersion: 5.13
Contact: ogabbay@kernel.org Contact: ogabbay@kernel.org
...@@ -108,7 +108,7 @@ Description: Specify the size of the DMA transaction when using DMA to read ...@@ -108,7 +108,7 @@ Description: Specify the size of the DMA transaction when using DMA to read
When the write is finished, the user can read the "data_dma" When the write is finished, the user can read the "data_dma"
blob blob
What: /sys/kernel/debug/accel/<n>/dump_razwi_events What: /sys/kernel/debug/accel/<parent_device>/dump_razwi_events
Date: Aug 2022 Date: Aug 2022
KernelVersion: 5.20 KernelVersion: 5.20
Contact: fkassabri@habana.ai Contact: fkassabri@habana.ai
...@@ -117,7 +117,7 @@ Description: Dumps all razwi events to dmesg if exist. ...@@ -117,7 +117,7 @@ Description: Dumps all razwi events to dmesg if exist.
the routine will clear the status register. the routine will clear the status register.
Usage: cat dump_razwi_events Usage: cat dump_razwi_events
What: /sys/kernel/debug/accel/<n>/dump_security_violations What: /sys/kernel/debug/accel/<parent_device>/dump_security_violations
Date: Jan 2021 Date: Jan 2021
KernelVersion: 5.12 KernelVersion: 5.12
Contact: ogabbay@kernel.org Contact: ogabbay@kernel.org
...@@ -125,14 +125,14 @@ Description: Dumps all security violations to dmesg. This will also ack ...@@ -125,14 +125,14 @@ Description: Dumps all security violations to dmesg. This will also ack
all security violations meanings those violations will not be all security violations meanings those violations will not be
dumped next time user calls this API dumped next time user calls this API
What: /sys/kernel/debug/accel/<n>/engines What: /sys/kernel/debug/accel/<parent_device>/engines
Date: Jul 2019 Date: Jul 2019
KernelVersion: 5.3 KernelVersion: 5.3
Contact: ogabbay@kernel.org Contact: ogabbay@kernel.org
Description: Displays the status registers values of the device engines and Description: Displays the status registers values of the device engines and
their derived idle status their derived idle status
What: /sys/kernel/debug/accel/<n>/i2c_addr What: /sys/kernel/debug/accel/<parent_device>/i2c_addr
Date: Jan 2019 Date: Jan 2019
KernelVersion: 5.1 KernelVersion: 5.1
Contact: ogabbay@kernel.org Contact: ogabbay@kernel.org
...@@ -140,7 +140,7 @@ Description: Sets I2C device address for I2C transaction that is generated ...@@ -140,7 +140,7 @@ Description: Sets I2C device address for I2C transaction that is generated
by the device's CPU, Not available when device is loaded with secured by the device's CPU, Not available when device is loaded with secured
firmware firmware
What: /sys/kernel/debug/accel/<n>/i2c_bus What: /sys/kernel/debug/accel/<parent_device>/i2c_bus
Date: Jan 2019 Date: Jan 2019
KernelVersion: 5.1 KernelVersion: 5.1
Contact: ogabbay@kernel.org Contact: ogabbay@kernel.org
...@@ -148,7 +148,7 @@ Description: Sets I2C bus address for I2C transaction that is generated by ...@@ -148,7 +148,7 @@ Description: Sets I2C bus address for I2C transaction that is generated by
the device's CPU, Not available when device is loaded with secured the device's CPU, Not available when device is loaded with secured
firmware firmware
What: /sys/kernel/debug/accel/<n>/i2c_data What: /sys/kernel/debug/accel/<parent_device>/i2c_data
Date: Jan 2019 Date: Jan 2019
KernelVersion: 5.1 KernelVersion: 5.1
Contact: ogabbay@kernel.org Contact: ogabbay@kernel.org
...@@ -157,7 +157,7 @@ Description: Triggers an I2C transaction that is generated by the device's ...@@ -157,7 +157,7 @@ Description: Triggers an I2C transaction that is generated by the device's
reading from the file generates a read transaction, Not available reading from the file generates a read transaction, Not available
when device is loaded with secured firmware when device is loaded with secured firmware
What: /sys/kernel/debug/accel/<n>/i2c_len What: /sys/kernel/debug/accel/<parent_device>/i2c_len
Date: Dec 2021 Date: Dec 2021
KernelVersion: 5.17 KernelVersion: 5.17
Contact: obitton@habana.ai Contact: obitton@habana.ai
...@@ -165,7 +165,7 @@ Description: Sets I2C length in bytes for I2C transaction that is generated b ...@@ -165,7 +165,7 @@ Description: Sets I2C length in bytes for I2C transaction that is generated b
the device's CPU, Not available when device is loaded with secured the device's CPU, Not available when device is loaded with secured
firmware firmware
What: /sys/kernel/debug/accel/<n>/i2c_reg What: /sys/kernel/debug/accel/<parent_device>/i2c_reg
Date: Jan 2019 Date: Jan 2019
KernelVersion: 5.1 KernelVersion: 5.1
Contact: ogabbay@kernel.org Contact: ogabbay@kernel.org
...@@ -173,35 +173,35 @@ Description: Sets I2C register id for I2C transaction that is generated by ...@@ -173,35 +173,35 @@ Description: Sets I2C register id for I2C transaction that is generated by
the device's CPU, Not available when device is loaded with secured the device's CPU, Not available when device is loaded with secured
firmware firmware
What: /sys/kernel/debug/accel/<n>/led0 What: /sys/kernel/debug/accel/<parent_device>/led0
Date: Jan 2019 Date: Jan 2019
KernelVersion: 5.1 KernelVersion: 5.1
Contact: ogabbay@kernel.org Contact: ogabbay@kernel.org
Description: Sets the state of the first S/W led on the device, Not available Description: Sets the state of the first S/W led on the device, Not available
when device is loaded with secured firmware when device is loaded with secured firmware
What: /sys/kernel/debug/accel/<n>/led1 What: /sys/kernel/debug/accel/<parent_device>/led1
Date: Jan 2019 Date: Jan 2019
KernelVersion: 5.1 KernelVersion: 5.1
Contact: ogabbay@kernel.org Contact: ogabbay@kernel.org
Description: Sets the state of the second S/W led on the device, Not available Description: Sets the state of the second S/W led on the device, Not available
when device is loaded with secured firmware when device is loaded with secured firmware
What: /sys/kernel/debug/accel/<n>/led2 What: /sys/kernel/debug/accel/<parent_device>/led2
Date: Jan 2019 Date: Jan 2019
KernelVersion: 5.1 KernelVersion: 5.1
Contact: ogabbay@kernel.org Contact: ogabbay@kernel.org
Description: Sets the state of the third S/W led on the device, Not available Description: Sets the state of the third S/W led on the device, Not available
when device is loaded with secured firmware when device is loaded with secured firmware
What: /sys/kernel/debug/accel/<n>/memory_scrub What: /sys/kernel/debug/accel/<parent_device>/memory_scrub
Date: May 2022 Date: May 2022
KernelVersion: 5.19 KernelVersion: 5.19
Contact: dhirschfeld@habana.ai Contact: dhirschfeld@habana.ai
Description: Allows the root user to scrub the dram memory. The scrubbing Description: Allows the root user to scrub the dram memory. The scrubbing
value can be set using the debugfs file memory_scrub_val. value can be set using the debugfs file memory_scrub_val.
What: /sys/kernel/debug/accel/<n>/memory_scrub_val What: /sys/kernel/debug/accel/<parent_device>/memory_scrub_val
Date: May 2022 Date: May 2022
KernelVersion: 5.19 KernelVersion: 5.19
Contact: dhirschfeld@habana.ai Contact: dhirschfeld@habana.ai
...@@ -209,7 +209,7 @@ Description: The value to which the dram will be set to when the user ...@@ -209,7 +209,7 @@ Description: The value to which the dram will be set to when the user
scrubs the dram using 'memory_scrub' debugfs file and scrubs the dram using 'memory_scrub' debugfs file and
the scrubbing value when using module param 'memory_scrub' the scrubbing value when using module param 'memory_scrub'
What: /sys/kernel/debug/accel/<n>/mmu What: /sys/kernel/debug/accel/<parent_device>/mmu
Date: Jan 2019 Date: Jan 2019
KernelVersion: 5.1 KernelVersion: 5.1
Contact: ogabbay@kernel.org Contact: ogabbay@kernel.org
...@@ -219,7 +219,7 @@ Description: Displays the hop values and physical address for a given ASID ...@@ -219,7 +219,7 @@ Description: Displays the hop values and physical address for a given ASID
e.g. to display info about VA 0x1000 for ASID 1 you need to do: e.g. to display info about VA 0x1000 for ASID 1 you need to do:
echo "1 0x1000" > /sys/kernel/debug/accel/0/mmu echo "1 0x1000" > /sys/kernel/debug/accel/0/mmu
What: /sys/kernel/debug/accel/<n>/mmu_error What: /sys/kernel/debug/accel/<parent_device>/mmu_error
Date: Mar 2021 Date: Mar 2021
KernelVersion: 5.12 KernelVersion: 5.12
Contact: fkassabri@habana.ai Contact: fkassabri@habana.ai
...@@ -229,7 +229,7 @@ Description: Check and display page fault or access violation mmu errors for ...@@ -229,7 +229,7 @@ Description: Check and display page fault or access violation mmu errors for
echo "0x200" > /sys/kernel/debug/accel/0/mmu_error echo "0x200" > /sys/kernel/debug/accel/0/mmu_error
cat /sys/kernel/debug/accel/0/mmu_error cat /sys/kernel/debug/accel/0/mmu_error
What: /sys/kernel/debug/accel/<n>/monitor_dump What: /sys/kernel/debug/accel/<parent_device>/monitor_dump
Date: Mar 2022 Date: Mar 2022
KernelVersion: 5.19 KernelVersion: 5.19
Contact: osharabi@habana.ai Contact: osharabi@habana.ai
...@@ -243,7 +243,7 @@ Description: Allows the root user to dump monitors status from the device's ...@@ -243,7 +243,7 @@ Description: Allows the root user to dump monitors status from the device's
This interface doesn't support concurrency in the same device. This interface doesn't support concurrency in the same device.
Only supported on GAUDI. Only supported on GAUDI.
What: /sys/kernel/debug/accel/<n>/monitor_dump_trig What: /sys/kernel/debug/accel/<parent_device>/monitor_dump_trig
Date: Mar 2022 Date: Mar 2022
KernelVersion: 5.19 KernelVersion: 5.19
Contact: osharabi@habana.ai Contact: osharabi@habana.ai
...@@ -253,14 +253,14 @@ Description: Triggers dump of monitor data. The value to trigger the operatio ...@@ -253,14 +253,14 @@ Description: Triggers dump of monitor data. The value to trigger the operatio
When the write is finished, the user can read the "monitor_dump" When the write is finished, the user can read the "monitor_dump"
blob blob
What: /sys/kernel/debug/accel/<n>/set_power_state What: /sys/kernel/debug/accel/<parent_device>/set_power_state
Date: Jan 2019 Date: Jan 2019
KernelVersion: 5.1 KernelVersion: 5.1
Contact: ogabbay@kernel.org Contact: ogabbay@kernel.org
Description: Sets the PCI power state. Valid values are "1" for D0 and "2" Description: Sets the PCI power state. Valid values are "1" for D0 and "2"
for D3Hot for D3Hot
What: /sys/kernel/debug/accel/<n>/skip_reset_on_timeout What: /sys/kernel/debug/accel/<parent_device>/skip_reset_on_timeout
Date: Jun 2021 Date: Jun 2021
KernelVersion: 5.13 KernelVersion: 5.13
Contact: ynudelman@habana.ai Contact: ynudelman@habana.ai
...@@ -268,7 +268,7 @@ Description: Sets the skip reset on timeout option for the device. Value of ...@@ -268,7 +268,7 @@ Description: Sets the skip reset on timeout option for the device. Value of
"0" means device will be reset in case some CS has timed out, "0" means device will be reset in case some CS has timed out,
otherwise it will not be reset. otherwise it will not be reset.
What: /sys/kernel/debug/accel/<n>/state_dump What: /sys/kernel/debug/accel/<parent_device>/state_dump
Date: Oct 2021 Date: Oct 2021
KernelVersion: 5.15 KernelVersion: 5.15
Contact: ynudelman@habana.ai Contact: ynudelman@habana.ai
...@@ -279,7 +279,7 @@ Description: Gets the state dump occurring on a CS timeout or failure. ...@@ -279,7 +279,7 @@ Description: Gets the state dump occurring on a CS timeout or failure.
Writing an integer X discards X state dumps, so that the Writing an integer X discards X state dumps, so that the
next read would return X+1-st newest state dump. next read would return X+1-st newest state dump.
What: /sys/kernel/debug/accel/<n>/stop_on_err What: /sys/kernel/debug/accel/<parent_device>/stop_on_err
Date: Mar 2020 Date: Mar 2020
KernelVersion: 5.6 KernelVersion: 5.6
Contact: ogabbay@kernel.org Contact: ogabbay@kernel.org
...@@ -287,13 +287,13 @@ Description: Sets the stop-on_error option for the device engines. Value of ...@@ -287,13 +287,13 @@ Description: Sets the stop-on_error option for the device engines. Value of
"0" is for disable, otherwise enable. "0" is for disable, otherwise enable.
Relevant only for GOYA and GAUDI. Relevant only for GOYA and GAUDI.
What: /sys/kernel/debug/accel/<n>/timeout_locked What: /sys/kernel/debug/accel/<parent_device>/timeout_locked
Date: Sep 2021 Date: Sep 2021
KernelVersion: 5.16 KernelVersion: 5.16
Contact: obitton@habana.ai Contact: obitton@habana.ai
Description: Sets the command submission timeout value in seconds. Description: Sets the command submission timeout value in seconds.
What: /sys/kernel/debug/accel/<n>/userptr What: /sys/kernel/debug/accel/<parent_device>/userptr
Date: Jan 2019 Date: Jan 2019
KernelVersion: 5.1 KernelVersion: 5.1
Contact: ogabbay@kernel.org Contact: ogabbay@kernel.org
...@@ -301,7 +301,7 @@ Description: Displays a list with information about the current user ...@@ -301,7 +301,7 @@ Description: Displays a list with information about the current user
pointers (user virtual addresses) that are pinned and mapped pointers (user virtual addresses) that are pinned and mapped
to DMA addresses to DMA addresses
What: /sys/kernel/debug/accel/<n>/userptr_lookup What: /sys/kernel/debug/accel/<parent_device>/userptr_lookup
Date: Oct 2021 Date: Oct 2021
KernelVersion: 5.15 KernelVersion: 5.15
Contact: ogabbay@kernel.org Contact: ogabbay@kernel.org
...@@ -309,7 +309,7 @@ Description: Allows to search for specific user pointers (user virtual ...@@ -309,7 +309,7 @@ Description: Allows to search for specific user pointers (user virtual
addresses) that are pinned and mapped to DMA addresses, and see addresses) that are pinned and mapped to DMA addresses, and see
their resolution to the specific dma address. their resolution to the specific dma address.
What: /sys/kernel/debug/accel/<n>/vm What: /sys/kernel/debug/accel/<parent_device>/vm
Date: Jan 2019 Date: Jan 2019
KernelVersion: 5.1 KernelVersion: 5.1
Contact: ogabbay@kernel.org Contact: ogabbay@kernel.org
......
...@@ -149,6 +149,18 @@ Contact: ogabbay@kernel.org ...@@ -149,6 +149,18 @@ Contact: ogabbay@kernel.org
Description: Displays the current clock frequency, in Hz, of the MME compute Description: Displays the current clock frequency, in Hz, of the MME compute
engine. This property is valid only for the Goya ASIC family engine. This property is valid only for the Goya ASIC family
What: /sys/class/accel/accel<n>/device/module_id
Date: Nov 2023
KernelVersion: not yet upstreamed
Contact: ogabbay@kernel.org
Description: Displays the device's module id
What: /sys/class/accel/accel<n>/device/parent_device
Date: Nov 2023
KernelVersion: 6.8
Contact: ttayar@habana.ai
Description: Displays the name of the parent device of the accel device
What: /sys/class/accel/accel<n>/device/pci_addr What: /sys/class/accel/accel<n>/device/pci_addr
Date: Jan 2019 Date: Jan 2019
KernelVersion: 5.1 KernelVersion: 5.1
......
...@@ -853,6 +853,9 @@ static int device_early_init(struct hl_device *hdev) ...@@ -853,6 +853,9 @@ static int device_early_init(struct hl_device *hdev)
gaudi2_set_asic_funcs(hdev); gaudi2_set_asic_funcs(hdev);
strscpy(hdev->asic_name, "GAUDI2B", sizeof(hdev->asic_name)); strscpy(hdev->asic_name, "GAUDI2B", sizeof(hdev->asic_name));
break; break;
case ASIC_GAUDI2C:
gaudi2_set_asic_funcs(hdev);
strscpy(hdev->asic_name, "GAUDI2C", sizeof(hdev->asic_name));
break; break;
default: default:
dev_err(hdev->dev, "Unrecognized ASIC type %d\n", dev_err(hdev->dev, "Unrecognized ASIC type %d\n",
...@@ -1041,18 +1044,21 @@ static bool is_pci_link_healthy(struct hl_device *hdev) ...@@ -1041,18 +1044,21 @@ static bool is_pci_link_healthy(struct hl_device *hdev)
return (vendor_id == PCI_VENDOR_ID_HABANALABS); return (vendor_id == PCI_VENDOR_ID_HABANALABS);
} }
static void hl_device_eq_heartbeat(struct hl_device *hdev) static int hl_device_eq_heartbeat_check(struct hl_device *hdev)
{ {
u64 event_mask = HL_NOTIFIER_EVENT_DEVICE_RESET | HL_NOTIFIER_EVENT_DEVICE_UNAVAILABLE;
struct asic_fixed_properties *prop = &hdev->asic_prop; struct asic_fixed_properties *prop = &hdev->asic_prop;
if (!prop->cpucp_info.eq_health_check_supported) if (!prop->cpucp_info.eq_health_check_supported)
return; return 0;
if (hdev->eq_heartbeat_received) if (hdev->eq_heartbeat_received) {
hdev->eq_heartbeat_received = false; hdev->eq_heartbeat_received = false;
else } else {
hl_device_cond_reset(hdev, HL_DRV_RESET_HARD, event_mask); dev_err(hdev->dev, "EQ heartbeat event was not received!\n");
return -EIO;
}
return 0;
} }
static void hl_device_heartbeat(struct work_struct *work) static void hl_device_heartbeat(struct work_struct *work)
...@@ -1069,10 +1075,9 @@ static void hl_device_heartbeat(struct work_struct *work) ...@@ -1069,10 +1075,9 @@ static void hl_device_heartbeat(struct work_struct *work)
/* /*
* For EQ health check need to check if driver received the heartbeat eq event * For EQ health check need to check if driver received the heartbeat eq event
* in order to validate the eq is working. * in order to validate the eq is working.
* Only if both the EQ is healthy and we managed to send the next heartbeat reschedule.
*/ */
hl_device_eq_heartbeat(hdev); if ((!hl_device_eq_heartbeat_check(hdev)) && (!hdev->asic_funcs->send_heartbeat(hdev)))
if (!hdev->asic_funcs->send_heartbeat(hdev))
goto reschedule; goto reschedule;
if (hl_device_operational(hdev, NULL)) if (hl_device_operational(hdev, NULL))
...@@ -2035,7 +2040,7 @@ int hl_device_cond_reset(struct hl_device *hdev, u32 flags, u64 event_mask) ...@@ -2035,7 +2040,7 @@ int hl_device_cond_reset(struct hl_device *hdev, u32 flags, u64 event_mask)
if (ctx) if (ctx)
hl_ctx_put(ctx); hl_ctx_put(ctx);
return hl_device_reset(hdev, flags); return hl_device_reset(hdev, flags | HL_DRV_RESET_HARD);
} }
static void hl_notifier_event_send(struct hl_notifier_event *notifier_event, u64 event_mask) static void hl_notifier_event_send(struct hl_notifier_event *notifier_event, u64 event_mask)
......
...@@ -646,39 +646,27 @@ int hl_fw_send_heartbeat(struct hl_device *hdev) ...@@ -646,39 +646,27 @@ int hl_fw_send_heartbeat(struct hl_device *hdev)
return rc; return rc;
} }
static bool fw_report_boot_dev0(struct hl_device *hdev, u32 err_val, static bool fw_report_boot_dev0(struct hl_device *hdev, u32 err_val, u32 sts_val)
u32 sts_val)
{ {
bool err_exists = false; bool err_exists = false;
if (!(err_val & CPU_BOOT_ERR0_ENABLED)) if (!(err_val & CPU_BOOT_ERR0_ENABLED))
return false; return false;
if (err_val & CPU_BOOT_ERR0_DRAM_INIT_FAIL) { if (err_val & CPU_BOOT_ERR0_DRAM_INIT_FAIL)
dev_err(hdev->dev, dev_err(hdev->dev, "Device boot error - DRAM initialization failed\n");
"Device boot error - DRAM initialization failed\n");
err_exists = true;
}
if (err_val & CPU_BOOT_ERR0_FIT_CORRUPTED) { if (err_val & CPU_BOOT_ERR0_FIT_CORRUPTED)
dev_err(hdev->dev, "Device boot error - FIT image corrupted\n"); dev_err(hdev->dev, "Device boot error - FIT image corrupted\n");
err_exists = true;
}
if (err_val & CPU_BOOT_ERR0_TS_INIT_FAIL) { if (err_val & CPU_BOOT_ERR0_TS_INIT_FAIL)
dev_err(hdev->dev, dev_err(hdev->dev, "Device boot error - Thermal Sensor initialization failed\n");
"Device boot error - Thermal Sensor initialization failed\n");
err_exists = true;
}
if (err_val & CPU_BOOT_ERR0_BMC_WAIT_SKIPPED) { if (err_val & CPU_BOOT_ERR0_BMC_WAIT_SKIPPED) {
if (hdev->bmc_enable) { if (hdev->bmc_enable) {
dev_err(hdev->dev, dev_err(hdev->dev, "Device boot error - Skipped waiting for BMC\n");
"Device boot error - Skipped waiting for BMC\n");
err_exists = true;
} else { } else {
dev_info(hdev->dev, dev_info(hdev->dev, "Device boot message - Skipped waiting for BMC\n");
"Device boot message - Skipped waiting for BMC\n");
/* This is an info so we don't want it to disable the /* This is an info so we don't want it to disable the
* device * device
*/ */
...@@ -686,48 +674,29 @@ static bool fw_report_boot_dev0(struct hl_device *hdev, u32 err_val, ...@@ -686,48 +674,29 @@ static bool fw_report_boot_dev0(struct hl_device *hdev, u32 err_val,
} }
} }
if (err_val & CPU_BOOT_ERR0_NIC_DATA_NOT_RDY) { if (err_val & CPU_BOOT_ERR0_NIC_DATA_NOT_RDY)
dev_err(hdev->dev, dev_err(hdev->dev, "Device boot error - Serdes data from BMC not available\n");
"Device boot error - Serdes data from BMC not available\n");
err_exists = true;
}
if (err_val & CPU_BOOT_ERR0_NIC_FW_FAIL) { if (err_val & CPU_BOOT_ERR0_NIC_FW_FAIL)
dev_err(hdev->dev, dev_err(hdev->dev, "Device boot error - NIC F/W initialization failed\n");
"Device boot error - NIC F/W initialization failed\n");
err_exists = true;
}
if (err_val & CPU_BOOT_ERR0_SECURITY_NOT_RDY) { if (err_val & CPU_BOOT_ERR0_SECURITY_NOT_RDY)
dev_err(hdev->dev, dev_err(hdev->dev, "Device boot warning - security not ready\n");
"Device boot warning - security not ready\n");
err_exists = true;
}
if (err_val & CPU_BOOT_ERR0_SECURITY_FAIL) { if (err_val & CPU_BOOT_ERR0_SECURITY_FAIL)
dev_err(hdev->dev, "Device boot error - security failure\n"); dev_err(hdev->dev, "Device boot error - security failure\n");
err_exists = true;
}
if (err_val & CPU_BOOT_ERR0_EFUSE_FAIL) { if (err_val & CPU_BOOT_ERR0_EFUSE_FAIL)
dev_err(hdev->dev, "Device boot error - eFuse failure\n"); dev_err(hdev->dev, "Device boot error - eFuse failure\n");
err_exists = true;
}
if (err_val & CPU_BOOT_ERR0_SEC_IMG_VER_FAIL) { if (err_val & CPU_BOOT_ERR0_SEC_IMG_VER_FAIL)
dev_err(hdev->dev, "Device boot error - Failed to load preboot secondary image\n"); dev_err(hdev->dev, "Device boot error - Failed to load preboot secondary image\n");
err_exists = true;
}
if (err_val & CPU_BOOT_ERR0_PLL_FAIL) { if (err_val & CPU_BOOT_ERR0_PLL_FAIL)
dev_err(hdev->dev, "Device boot error - PLL failure\n"); dev_err(hdev->dev, "Device boot error - PLL failure\n");
err_exists = true;
}
if (err_val & CPU_BOOT_ERR0_TMP_THRESH_INIT_FAIL) { if (err_val & CPU_BOOT_ERR0_TMP_THRESH_INIT_FAIL)
dev_err(hdev->dev, "Device boot error - Failed to set threshold for temperature sensor\n"); dev_err(hdev->dev, "Device boot error - Failed to set threshold for temperature sensor\n");
err_exists = true;
}
if (err_val & CPU_BOOT_ERR0_DEVICE_UNUSABLE_FAIL) { if (err_val & CPU_BOOT_ERR0_DEVICE_UNUSABLE_FAIL) {
/* Ignore this bit, don't prevent driver loading */ /* Ignore this bit, don't prevent driver loading */
...@@ -735,52 +704,32 @@ static bool fw_report_boot_dev0(struct hl_device *hdev, u32 err_val, ...@@ -735,52 +704,32 @@ static bool fw_report_boot_dev0(struct hl_device *hdev, u32 err_val,
err_val &= ~CPU_BOOT_ERR0_DEVICE_UNUSABLE_FAIL; err_val &= ~CPU_BOOT_ERR0_DEVICE_UNUSABLE_FAIL;
} }
if (err_val & CPU_BOOT_ERR0_BINNING_FAIL) { if (err_val & CPU_BOOT_ERR0_BINNING_FAIL)
dev_err(hdev->dev, "Device boot error - binning failure\n"); dev_err(hdev->dev, "Device boot error - binning failure\n");
err_exists = true;
}
if (sts_val & CPU_BOOT_DEV_STS0_ENABLED) if (sts_val & CPU_BOOT_DEV_STS0_ENABLED)
dev_dbg(hdev->dev, "Device status0 %#x\n", sts_val); dev_dbg(hdev->dev, "Device status0 %#x\n", sts_val);
if (err_val & CPU_BOOT_ERR0_DRAM_SKIPPED)
dev_err(hdev->dev, "Device boot warning - Skipped DRAM initialization\n");
if (err_val & CPU_BOOT_ERR_ENG_ARC_MEM_SCRUB_FAIL)
dev_err(hdev->dev, "Device boot error - ARC memory scrub failed\n");
/* All warnings should go here in order not to reach the unknown error validation */
if (err_val & CPU_BOOT_ERR0_EEPROM_FAIL) { if (err_val & CPU_BOOT_ERR0_EEPROM_FAIL) {
dev_err(hdev->dev, "Device boot error - EEPROM failure detected\n"); dev_err(hdev->dev, "Device boot error - EEPROM failure detected\n");
err_exists = true; err_exists = true;
} }
/* All warnings should go here in order not to reach the unknown error validation */ if (err_val & CPU_BOOT_ERR0_PRI_IMG_VER_FAIL)
if (err_val & CPU_BOOT_ERR0_DRAM_SKIPPED) { dev_warn(hdev->dev, "Device boot warning - Failed to load preboot primary image\n");
dev_warn(hdev->dev,
"Device boot warning - Skipped DRAM initialization\n");
/* This is a warning so we don't want it to disable the
* device
*/
err_val &= ~CPU_BOOT_ERR0_DRAM_SKIPPED;
}
if (err_val & CPU_BOOT_ERR0_PRI_IMG_VER_FAIL) { if (err_val & CPU_BOOT_ERR0_TPM_FAIL)
dev_warn(hdev->dev, dev_warn(hdev->dev, "Device boot warning - TPM failure\n");
"Device boot warning - Failed to load preboot primary image\n");
/* This is a warning so we don't want it to disable the
* device as we have a secondary preboot image
*/
err_val &= ~CPU_BOOT_ERR0_PRI_IMG_VER_FAIL;
}
if (err_val & CPU_BOOT_ERR0_TPM_FAIL) {
dev_warn(hdev->dev,
"Device boot warning - TPM failure\n");
/* This is a warning so we don't want it to disable the
* device
*/
err_val &= ~CPU_BOOT_ERR0_TPM_FAIL;
}
if (!err_exists && (err_val & ~CPU_BOOT_ERR0_ENABLED)) { if (err_val & CPU_BOOT_ERR_FATAL_MASK)
dev_err(hdev->dev,
"Device boot error - unknown ERR0 error 0x%08x\n", err_val);
err_exists = true; err_exists = true;
}
/* return error only if it's in the predefined mask */ /* return error only if it's in the predefined mask */
if (err_exists && ((err_val & ~CPU_BOOT_ERR0_ENABLED) & if (err_exists && ((err_val & ~CPU_BOOT_ERR0_ENABLED) &
...@@ -3295,6 +3244,14 @@ int hl_fw_get_sec_attest_info(struct hl_device *hdev, struct cpucp_sec_attest_in ...@@ -3295,6 +3244,14 @@ int hl_fw_get_sec_attest_info(struct hl_device *hdev, struct cpucp_sec_attest_in
HL_CPUCP_SEC_ATTEST_INFO_TINEOUT_USEC); HL_CPUCP_SEC_ATTEST_INFO_TINEOUT_USEC);
} }
int hl_fw_get_dev_info_signed(struct hl_device *hdev,
struct cpucp_dev_info_signed *dev_info_signed, u32 nonce)
{
return hl_fw_get_sec_attest_data(hdev, CPUCP_PACKET_INFO_SIGNED_GET, dev_info_signed,
sizeof(struct cpucp_dev_info_signed), nonce,
HL_CPUCP_SEC_ATTEST_INFO_TINEOUT_USEC);
}
int hl_fw_send_generic_request(struct hl_device *hdev, enum hl_passthrough_type sub_opcode, int hl_fw_send_generic_request(struct hl_device *hdev, enum hl_passthrough_type sub_opcode,
dma_addr_t buff, u32 *size) dma_addr_t buff, u32 *size)
{ {
......
...@@ -1262,6 +1262,7 @@ struct hl_dec { ...@@ -1262,6 +1262,7 @@ struct hl_dec {
* @ASIC_GAUDI_SEC: Gaudi secured device (HL-2000). * @ASIC_GAUDI_SEC: Gaudi secured device (HL-2000).
* @ASIC_GAUDI2: Gaudi2 device. * @ASIC_GAUDI2: Gaudi2 device.
* @ASIC_GAUDI2B: Gaudi2B device. * @ASIC_GAUDI2B: Gaudi2B device.
* @ASIC_GAUDI2C: Gaudi2C device.
*/ */
enum hl_asic_type { enum hl_asic_type {
ASIC_INVALID, ASIC_INVALID,
...@@ -1270,6 +1271,7 @@ enum hl_asic_type { ...@@ -1270,6 +1271,7 @@ enum hl_asic_type {
ASIC_GAUDI_SEC, ASIC_GAUDI_SEC,
ASIC_GAUDI2, ASIC_GAUDI2,
ASIC_GAUDI2B, ASIC_GAUDI2B,
ASIC_GAUDI2C,
}; };
struct hl_cs_parser; struct hl_cs_parser;
...@@ -3519,6 +3521,9 @@ struct hl_device { ...@@ -3519,6 +3521,9 @@ struct hl_device {
u8 heartbeat; u8 heartbeat;
}; };
/* Retrieve PCI device name in case of a PCI device or dev name in simulator */
#define HL_DEV_NAME(hdev) \
((hdev)->pdev ? dev_name(&(hdev)->pdev->dev) : "NA-DEVICE")
/** /**
* struct hl_cs_encaps_sig_handle - encapsulated signals handle structure * struct hl_cs_encaps_sig_handle - encapsulated signals handle structure
...@@ -3594,6 +3599,14 @@ static inline bool hl_is_fw_sw_ver_below(struct hl_device *hdev, u32 fw_sw_major ...@@ -3594,6 +3599,14 @@ static inline bool hl_is_fw_sw_ver_below(struct hl_device *hdev, u32 fw_sw_major
return false; return false;
} }
static inline bool hl_is_fw_sw_ver_equal_or_greater(struct hl_device *hdev, u32 fw_sw_major,
u32 fw_sw_minor)
{
return (hdev->fw_sw_major_ver > fw_sw_major ||
(hdev->fw_sw_major_ver == fw_sw_major &&
hdev->fw_sw_minor_ver >= fw_sw_minor));
}
/* /*
* Kernel module functions that can be accessed by entire module * Kernel module functions that can be accessed by entire module
*/ */
...@@ -3954,6 +3967,8 @@ long hl_fw_get_max_power(struct hl_device *hdev); ...@@ -3954,6 +3967,8 @@ long hl_fw_get_max_power(struct hl_device *hdev);
void hl_fw_set_max_power(struct hl_device *hdev); void hl_fw_set_max_power(struct hl_device *hdev);
int hl_fw_get_sec_attest_info(struct hl_device *hdev, struct cpucp_sec_attest_info *sec_attest_info, int hl_fw_get_sec_attest_info(struct hl_device *hdev, struct cpucp_sec_attest_info *sec_attest_info,
u32 nonce); u32 nonce);
int hl_fw_get_dev_info_signed(struct hl_device *hdev,
struct cpucp_dev_info_signed *dev_info_signed, u32 nonce);
int hl_set_voltage(struct hl_device *hdev, int sensor_index, u32 attr, long value); int hl_set_voltage(struct hl_device *hdev, int sensor_index, u32 attr, long value);
int hl_set_current(struct hl_device *hdev, int sensor_index, u32 attr, long value); int hl_set_current(struct hl_device *hdev, int sensor_index, u32 attr, long value);
int hl_set_power(struct hl_device *hdev, int sensor_index, u32 attr, long value); int hl_set_power(struct hl_device *hdev, int sensor_index, u32 attr, long value);
......
...@@ -141,6 +141,9 @@ static enum hl_asic_type get_asic_type(struct hl_device *hdev) ...@@ -141,6 +141,9 @@ static enum hl_asic_type get_asic_type(struct hl_device *hdev)
case REV_ID_B: case REV_ID_B:
asic_type = ASIC_GAUDI2B; asic_type = ASIC_GAUDI2B;
break; break;
case REV_ID_C:
asic_type = ASIC_GAUDI2C;
break;
default: default:
break; break;
} }
...@@ -670,6 +673,38 @@ static pci_ers_result_t hl_pci_err_slot_reset(struct pci_dev *pdev) ...@@ -670,6 +673,38 @@ static pci_ers_result_t hl_pci_err_slot_reset(struct pci_dev *pdev)
return PCI_ERS_RESULT_RECOVERED; return PCI_ERS_RESULT_RECOVERED;
} }
static void hl_pci_reset_prepare(struct pci_dev *pdev)
{
struct hl_device *hdev;
hdev = pci_get_drvdata(pdev);
if (!hdev)
return;
hdev->disabled = true;
}
static void hl_pci_reset_done(struct pci_dev *pdev)
{
struct hl_device *hdev;
u32 flags;
hdev = pci_get_drvdata(pdev);
if (!hdev)
return;
/*
* Schedule a thread to trigger hard reset.
* The reason for this handler, is for rare cases where the driver is up
* and FLR occurs. This is valid only when working with no VM, so FW handles FLR
* and resets the device. FW will go back preboot stage, so driver needs to perform
* hard reset in order to load FW fit again.
*/
flags = HL_DRV_RESET_HARD | HL_DRV_RESET_BYPASS_REQ_TO_FW;
hl_device_reset(hdev, flags);
}
static const struct dev_pm_ops hl_pm_ops = { static const struct dev_pm_ops hl_pm_ops = {
.suspend = hl_pmops_suspend, .suspend = hl_pmops_suspend,
.resume = hl_pmops_resume, .resume = hl_pmops_resume,
...@@ -679,6 +714,8 @@ static const struct pci_error_handlers hl_pci_err_handler = { ...@@ -679,6 +714,8 @@ static const struct pci_error_handlers hl_pci_err_handler = {
.error_detected = hl_pci_err_detected, .error_detected = hl_pci_err_detected,
.slot_reset = hl_pci_err_slot_reset, .slot_reset = hl_pci_err_slot_reset,
.resume = hl_pci_err_resume, .resume = hl_pci_err_resume,
.reset_prepare = hl_pci_reset_prepare,
.reset_done = hl_pci_reset_done,
}; };
static struct pci_driver hl_pci_driver = { static struct pci_driver hl_pci_driver = {
......
...@@ -19,6 +19,9 @@ ...@@ -19,6 +19,9 @@
#include <asm/msr.h> #include <asm/msr.h>
/* make sure there is space for all the signed info */
static_assert(sizeof(struct cpucp_info) <= SEC_DEV_INFO_BUF_SZ);
static u32 hl_debug_struct_size[HL_DEBUG_OP_TIMESTAMP + 1] = { static u32 hl_debug_struct_size[HL_DEBUG_OP_TIMESTAMP + 1] = {
[HL_DEBUG_OP_ETR] = sizeof(struct hl_debug_params_etr), [HL_DEBUG_OP_ETR] = sizeof(struct hl_debug_params_etr),
[HL_DEBUG_OP_ETF] = sizeof(struct hl_debug_params_etf), [HL_DEBUG_OP_ETF] = sizeof(struct hl_debug_params_etf),
...@@ -685,7 +688,7 @@ static int sec_attest_info(struct hl_fpriv *hpriv, struct hl_info_args *args) ...@@ -685,7 +688,7 @@ static int sec_attest_info(struct hl_fpriv *hpriv, struct hl_info_args *args)
if (!sec_attest_info) if (!sec_attest_info)
return -ENOMEM; return -ENOMEM;
info = kmalloc(sizeof(*info), GFP_KERNEL); info = kzalloc(sizeof(*info), GFP_KERNEL);
if (!info) { if (!info) {
rc = -ENOMEM; rc = -ENOMEM;
goto free_sec_attest_info; goto free_sec_attest_info;
...@@ -719,6 +722,53 @@ static int sec_attest_info(struct hl_fpriv *hpriv, struct hl_info_args *args) ...@@ -719,6 +722,53 @@ static int sec_attest_info(struct hl_fpriv *hpriv, struct hl_info_args *args)
return rc; return rc;
} }
static int dev_info_signed(struct hl_fpriv *hpriv, struct hl_info_args *args)
{
void __user *out = (void __user *) (uintptr_t) args->return_pointer;
struct cpucp_dev_info_signed *dev_info_signed;
struct hl_info_signed *info;
u32 max_size = args->return_size;
int rc;
if ((!max_size) || (!out))
return -EINVAL;
dev_info_signed = kzalloc(sizeof(*dev_info_signed), GFP_KERNEL);
if (!dev_info_signed)
return -ENOMEM;
info = kzalloc(sizeof(*info), GFP_KERNEL);
if (!info) {
rc = -ENOMEM;
goto free_dev_info_signed;
}
rc = hl_fw_get_dev_info_signed(hpriv->hdev,
dev_info_signed, args->sec_attest_nonce);
if (rc)
goto free_info;
info->nonce = le32_to_cpu(dev_info_signed->nonce);
info->info_sig_len = dev_info_signed->info_sig_len;
info->pub_data_len = le16_to_cpu(dev_info_signed->pub_data_len);
info->certificate_len = le16_to_cpu(dev_info_signed->certificate_len);
info->dev_info_len = sizeof(struct cpucp_info);
memcpy(&info->info_sig, &dev_info_signed->info_sig, sizeof(info->info_sig));
memcpy(&info->public_data, &dev_info_signed->public_data, sizeof(info->public_data));
memcpy(&info->certificate, &dev_info_signed->certificate, sizeof(info->certificate));
memcpy(&info->dev_info, &dev_info_signed->info, info->dev_info_len);
rc = copy_to_user(out, info, min_t(size_t, max_size, sizeof(*info))) ? -EFAULT : 0;
free_info:
kfree(info);
free_dev_info_signed:
kfree(dev_info_signed);
return rc;
}
static int eventfd_register(struct hl_fpriv *hpriv, struct hl_info_args *args) static int eventfd_register(struct hl_fpriv *hpriv, struct hl_info_args *args)
{ {
int rc; int rc;
...@@ -1089,6 +1139,9 @@ static int _hl_info_ioctl(struct hl_fpriv *hpriv, void *data, ...@@ -1089,6 +1139,9 @@ static int _hl_info_ioctl(struct hl_fpriv *hpriv, void *data,
case HL_INFO_FW_GENERIC_REQ: case HL_INFO_FW_GENERIC_REQ:
return send_fw_generic_request(hdev, args); return send_fw_generic_request(hdev, args);
case HL_INFO_DEV_SIGNED:
return dev_info_signed(hpriv, args);
default: default:
dev_err(dev, "Invalid request %d\n", args->op); dev_err(dev, "Invalid request %d\n", args->op);
rc = -EINVAL; rc = -EINVAL;
......
...@@ -578,10 +578,6 @@ int hl_get_temperature(struct hl_device *hdev, ...@@ -578,10 +578,6 @@ int hl_get_temperature(struct hl_device *hdev,
CPUCP_PKT_CTL_OPCODE_SHIFT); CPUCP_PKT_CTL_OPCODE_SHIFT);
pkt.sensor_index = __cpu_to_le16(sensor_index); pkt.sensor_index = __cpu_to_le16(sensor_index);
pkt.type = __cpu_to_le16(attr); pkt.type = __cpu_to_le16(attr);
dev_dbg(hdev->dev, "get temp, ctl 0x%x, sensor %d, type %d\n",
pkt.ctl, pkt.sensor_index, pkt.type);
rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt), rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
0, &result); 0, &result);
......
...@@ -955,8 +955,8 @@ static int map_phys_pg_pack(struct hl_ctx *ctx, u64 vaddr, ...@@ -955,8 +955,8 @@ static int map_phys_pg_pack(struct hl_ctx *ctx, u64 vaddr,
(i + 1) == phys_pg_pack->npages); (i + 1) == phys_pg_pack->npages);
if (rc) { if (rc) {
dev_err(hdev->dev, dev_err(hdev->dev,
"map failed for handle %u, npages: %llu, mapped: %llu", "map failed (%d) for handle %u, npages: %llu, mapped: %llu\n",
phys_pg_pack->handle, phys_pg_pack->npages, rc, phys_pg_pack->handle, phys_pg_pack->npages,
mapped_pg_cnt); mapped_pg_cnt);
goto err; goto err;
} }
...@@ -1186,7 +1186,8 @@ static int map_device_va(struct hl_ctx *ctx, struct hl_mem_in *args, u64 *device ...@@ -1186,7 +1186,8 @@ static int map_device_va(struct hl_ctx *ctx, struct hl_mem_in *args, u64 *device
rc = map_phys_pg_pack(ctx, ret_vaddr, phys_pg_pack); rc = map_phys_pg_pack(ctx, ret_vaddr, phys_pg_pack);
if (rc) { if (rc) {
dev_err(hdev->dev, "mapping page pack failed for handle %u\n", handle); dev_err(hdev->dev, "mapping page pack failed (%d) for handle %u\n",
rc, handle);
mutex_unlock(&hdev->mmu_lock); mutex_unlock(&hdev->mmu_lock);
goto map_err; goto map_err;
} }
......
...@@ -596,6 +596,7 @@ int hl_mmu_if_set_funcs(struct hl_device *hdev) ...@@ -596,6 +596,7 @@ int hl_mmu_if_set_funcs(struct hl_device *hdev)
break; break;
case ASIC_GAUDI2: case ASIC_GAUDI2:
case ASIC_GAUDI2B: case ASIC_GAUDI2B:
case ASIC_GAUDI2C:
/* MMUs in Gaudi2 are always host resident */ /* MMUs in Gaudi2 are always host resident */
hl_mmu_v2_hr_set_funcs(hdev, &hdev->mmu_func[MMU_HR_PGT]); hl_mmu_v2_hr_set_funcs(hdev, &hdev->mmu_func[MMU_HR_PGT]);
break; break;
......
...@@ -8,6 +8,7 @@ ...@@ -8,6 +8,7 @@
#include "habanalabs.h" #include "habanalabs.h"
#include <linux/pci.h> #include <linux/pci.h>
#include <linux/types.h>
static ssize_t clk_max_freq_mhz_show(struct device *dev, struct device_attribute *attr, char *buf) static ssize_t clk_max_freq_mhz_show(struct device *dev, struct device_attribute *attr, char *buf)
{ {
...@@ -80,12 +81,27 @@ static ssize_t vrm_ver_show(struct device *dev, struct device_attribute *attr, c ...@@ -80,12 +81,27 @@ static ssize_t vrm_ver_show(struct device *dev, struct device_attribute *attr, c
{ {
struct hl_device *hdev = dev_get_drvdata(dev); struct hl_device *hdev = dev_get_drvdata(dev);
struct cpucp_info *cpucp_info; struct cpucp_info *cpucp_info;
u32 infineon_second_stage_version;
u32 infineon_second_stage_first_instance;
u32 infineon_second_stage_second_instance;
u32 infineon_second_stage_third_instance;
u32 mask = 0xff;
cpucp_info = &hdev->asic_prop.cpucp_info; cpucp_info = &hdev->asic_prop.cpucp_info;
infineon_second_stage_version = le32_to_cpu(cpucp_info->infineon_second_stage_version);
infineon_second_stage_first_instance = infineon_second_stage_version & mask;
infineon_second_stage_second_instance =
(infineon_second_stage_version >> 8) & mask;
infineon_second_stage_third_instance =
(infineon_second_stage_version >> 16) & mask;
if (cpucp_info->infineon_second_stage_version) if (cpucp_info->infineon_second_stage_version)
return sprintf(buf, "%#04x %#04x\n", le32_to_cpu(cpucp_info->infineon_version), return sprintf(buf, "%#04x %#04x:%#04x:%#04x\n",
le32_to_cpu(cpucp_info->infineon_second_stage_version)); le32_to_cpu(cpucp_info->infineon_version),
infineon_second_stage_first_instance,
infineon_second_stage_second_instance,
infineon_second_stage_third_instance);
else else
return sprintf(buf, "%#04x\n", le32_to_cpu(cpucp_info->infineon_version)); return sprintf(buf, "%#04x\n", le32_to_cpu(cpucp_info->infineon_version));
} }
...@@ -251,6 +267,9 @@ static ssize_t device_type_show(struct device *dev, ...@@ -251,6 +267,9 @@ static ssize_t device_type_show(struct device *dev,
case ASIC_GAUDI2B: case ASIC_GAUDI2B:
str = "GAUDI2B"; str = "GAUDI2B";
break; break;
case ASIC_GAUDI2C:
str = "GAUDI2C";
break;
default: default:
dev_err(hdev->dev, "Unrecognized ASIC type %d\n", dev_err(hdev->dev, "Unrecognized ASIC type %d\n",
hdev->asic_type); hdev->asic_type);
...@@ -383,6 +402,21 @@ static ssize_t security_enabled_show(struct device *dev, ...@@ -383,6 +402,21 @@ static ssize_t security_enabled_show(struct device *dev,
return sprintf(buf, "%d\n", hdev->asic_prop.fw_security_enabled); return sprintf(buf, "%d\n", hdev->asic_prop.fw_security_enabled);
} }
static ssize_t module_id_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct hl_device *hdev = dev_get_drvdata(dev);
return sprintf(buf, "%u\n", le32_to_cpu(hdev->asic_prop.cpucp_info.card_location));
}
static ssize_t parent_device_show(struct device *dev, struct device_attribute *attr, char *buf)
{
struct hl_device *hdev = dev_get_drvdata(dev);
return sprintf(buf, "%s\n", HL_DEV_NAME(hdev));
}
static DEVICE_ATTR_RO(armcp_kernel_ver); static DEVICE_ATTR_RO(armcp_kernel_ver);
static DEVICE_ATTR_RO(armcp_ver); static DEVICE_ATTR_RO(armcp_ver);
static DEVICE_ATTR_RO(cpld_ver); static DEVICE_ATTR_RO(cpld_ver);
...@@ -402,6 +436,8 @@ static DEVICE_ATTR_RO(thermal_ver); ...@@ -402,6 +436,8 @@ static DEVICE_ATTR_RO(thermal_ver);
static DEVICE_ATTR_RO(uboot_ver); static DEVICE_ATTR_RO(uboot_ver);
static DEVICE_ATTR_RO(fw_os_ver); static DEVICE_ATTR_RO(fw_os_ver);
static DEVICE_ATTR_RO(security_enabled); static DEVICE_ATTR_RO(security_enabled);
static DEVICE_ATTR_RO(module_id);
static DEVICE_ATTR_RO(parent_device);
static struct bin_attribute bin_attr_eeprom = { static struct bin_attribute bin_attr_eeprom = {
.attr = {.name = "eeprom", .mode = (0444)}, .attr = {.name = "eeprom", .mode = (0444)},
...@@ -427,6 +463,8 @@ static struct attribute *hl_dev_attrs[] = { ...@@ -427,6 +463,8 @@ static struct attribute *hl_dev_attrs[] = {
&dev_attr_uboot_ver.attr, &dev_attr_uboot_ver.attr,
&dev_attr_fw_os_ver.attr, &dev_attr_fw_os_ver.attr,
&dev_attr_security_enabled.attr, &dev_attr_security_enabled.attr,
&dev_attr_module_id.attr,
&dev_attr_parent_device.attr,
NULL, NULL,
}; };
......
...@@ -7858,39 +7858,44 @@ static bool gaudi2_handle_ecc_event(struct hl_device *hdev, u16 event_type, ...@@ -7858,39 +7858,44 @@ static bool gaudi2_handle_ecc_event(struct hl_device *hdev, u16 event_type,
return !!ecc_data->is_critical; return !!ecc_data->is_critical;
} }
static void handle_lower_qman_data_on_err(struct hl_device *hdev, u64 qman_base, u64 event_mask) static void handle_lower_qman_data_on_err(struct hl_device *hdev, u64 qman_base, u32 engine_id)
{ {
u32 lo, hi, cq_ptr_size, arc_cq_ptr_size; struct undefined_opcode_info *undef_opcode = &hdev->captured_err_info.undef_opcode;
u64 cq_ptr, arc_cq_ptr, cp_current_inst; u64 cq_ptr, cp_current_inst;
u32 lo, hi, cq_size, cp_sts;
bool is_arc_cq;
lo = RREG32(qman_base + QM_CQ_PTR_LO_4_OFFSET); cp_sts = RREG32(qman_base + QM_CP_STS_4_OFFSET);
hi = RREG32(qman_base + QM_CQ_PTR_HI_4_OFFSET); is_arc_cq = FIELD_GET(PDMA0_QM_CP_STS_CUR_CQ_MASK, cp_sts); /* 0 - legacy CQ, 1 - ARC_CQ */
cq_ptr = ((u64) hi) << 32 | lo;
cq_ptr_size = RREG32(qman_base + QM_CQ_TSIZE_4_OFFSET);
lo = RREG32(qman_base + QM_ARC_CQ_PTR_LO_OFFSET); if (is_arc_cq) {
hi = RREG32(qman_base + QM_ARC_CQ_PTR_HI_OFFSET); lo = RREG32(qman_base + QM_ARC_CQ_PTR_LO_STS_OFFSET);
arc_cq_ptr = ((u64) hi) << 32 | lo; hi = RREG32(qman_base + QM_ARC_CQ_PTR_HI_STS_OFFSET);
arc_cq_ptr_size = RREG32(qman_base + QM_ARC_CQ_TSIZE_OFFSET); cq_ptr = ((u64) hi) << 32 | lo;
cq_size = RREG32(qman_base + QM_ARC_CQ_TSIZE_STS_OFFSET);
} else {
lo = RREG32(qman_base + QM_CQ_PTR_LO_STS_4_OFFSET);
hi = RREG32(qman_base + QM_CQ_PTR_HI_STS_4_OFFSET);
cq_ptr = ((u64) hi) << 32 | lo;
cq_size = RREG32(qman_base + QM_CQ_TSIZE_STS_4_OFFSET);
}
lo = RREG32(qman_base + QM_CP_CURRENT_INST_LO_4_OFFSET); lo = RREG32(qman_base + QM_CP_CURRENT_INST_LO_4_OFFSET);
hi = RREG32(qman_base + QM_CP_CURRENT_INST_HI_4_OFFSET); hi = RREG32(qman_base + QM_CP_CURRENT_INST_HI_4_OFFSET);
cp_current_inst = ((u64) hi) << 32 | lo; cp_current_inst = ((u64) hi) << 32 | lo;
dev_info(hdev->dev, dev_info(hdev->dev,
"LowerQM. CQ: {ptr %#llx, size %u}, ARC_CQ: {ptr %#llx, size %u}, CP: {instruction %#llx}\n", "LowerQM. %sCQ: {ptr %#llx, size %u}, CP: {instruction %#018llx}\n",
cq_ptr, cq_ptr_size, arc_cq_ptr, arc_cq_ptr_size, cp_current_inst); is_arc_cq ? "ARC_" : "", cq_ptr, cq_size, cp_current_inst);
if (event_mask & HL_NOTIFIER_EVENT_UNDEFINED_OPCODE) { if (undef_opcode->write_enable) {
if (arc_cq_ptr) { memset(undef_opcode, 0, sizeof(*undef_opcode));
hdev->captured_err_info.undef_opcode.cq_addr = arc_cq_ptr; undef_opcode->timestamp = ktime_get();
hdev->captured_err_info.undef_opcode.cq_size = arc_cq_ptr_size; undef_opcode->cq_addr = cq_ptr;
} else { undef_opcode->cq_size = cq_size;
hdev->captured_err_info.undef_opcode.cq_addr = cq_ptr; undef_opcode->engine_id = engine_id;
hdev->captured_err_info.undef_opcode.cq_size = cq_ptr_size; undef_opcode->stream_id = QMAN_STREAMS;
} undef_opcode->write_enable = 0;
hdev->captured_err_info.undef_opcode.stream_id = QMAN_STREAMS;
} }
} }
...@@ -7929,21 +7934,12 @@ static int gaudi2_handle_qman_err_generic(struct hl_device *hdev, u16 event_type ...@@ -7929,21 +7934,12 @@ static int gaudi2_handle_qman_err_generic(struct hl_device *hdev, u16 event_type
error_count++; error_count++;
} }
if (i == QMAN_STREAMS && error_count) { /* Check for undefined opcode error in lower QM */
/* check for undefined opcode */ if ((i == QMAN_STREAMS) &&
if (glbl_sts_val & PDMA0_QM_GLBL_ERR_STS_CP_UNDEF_CMD_ERR_MASK && (glbl_sts_val & PDMA0_QM_GLBL_ERR_STS_CP_UNDEF_CMD_ERR_MASK)) {
hdev->captured_err_info.undef_opcode.write_enable) { handle_lower_qman_data_on_err(hdev, qman_base,
memset(&hdev->captured_err_info.undef_opcode, 0, gaudi2_queue_id_to_engine_id[qid_base]);
sizeof(hdev->captured_err_info.undef_opcode)); *event_mask |= HL_NOTIFIER_EVENT_UNDEFINED_OPCODE;
hdev->captured_err_info.undef_opcode.write_enable = false;
hdev->captured_err_info.undef_opcode.timestamp = ktime_get();
hdev->captured_err_info.undef_opcode.engine_id =
gaudi2_queue_id_to_engine_id[qid_base];
*event_mask |= HL_NOTIFIER_EVENT_UNDEFINED_OPCODE;
}
handle_lower_qman_data_on_err(hdev, qman_base, *event_mask);
} }
} }
...@@ -10007,6 +10003,8 @@ static void gaudi2_handle_eqe(struct hl_device *hdev, struct hl_eq_entry *eq_ent ...@@ -10007,6 +10003,8 @@ static void gaudi2_handle_eqe(struct hl_device *hdev, struct hl_eq_entry *eq_ent
error_count = gaudi2_handle_pcie_drain(hdev, &eq_entry->pcie_drain_ind_data); error_count = gaudi2_handle_pcie_drain(hdev, &eq_entry->pcie_drain_ind_data);
reset_flags |= HL_DRV_RESET_FW_FATAL_ERR; reset_flags |= HL_DRV_RESET_FW_FATAL_ERR;
event_mask |= HL_NOTIFIER_EVENT_GENERAL_HW_ERR; event_mask |= HL_NOTIFIER_EVENT_GENERAL_HW_ERR;
if (hl_is_fw_sw_ver_equal_or_greater(hdev, 1, 13))
is_critical = true;
break; break;
case GAUDI2_EVENT_PSOC59_RPM_ERROR_OR_DRAIN: case GAUDI2_EVENT_PSOC59_RPM_ERROR_OR_DRAIN:
......
...@@ -242,14 +242,15 @@ ...@@ -242,14 +242,15 @@
#define QM_FENCE2_OFFSET (mmPDMA0_QM_CP_FENCE2_RDATA_0 - mmPDMA0_QM_BASE) #define QM_FENCE2_OFFSET (mmPDMA0_QM_CP_FENCE2_RDATA_0 - mmPDMA0_QM_BASE)
#define QM_SEI_STATUS_OFFSET (mmPDMA0_QM_SEI_STATUS - mmPDMA0_QM_BASE) #define QM_SEI_STATUS_OFFSET (mmPDMA0_QM_SEI_STATUS - mmPDMA0_QM_BASE)
#define QM_CQ_PTR_LO_4_OFFSET (mmPDMA0_QM_CQ_PTR_LO_4 - mmPDMA0_QM_BASE) #define QM_CQ_TSIZE_STS_4_OFFSET (mmPDMA0_QM_CQ_TSIZE_STS_4 - mmPDMA0_QM_BASE)
#define QM_CQ_PTR_HI_4_OFFSET (mmPDMA0_QM_CQ_PTR_HI_4 - mmPDMA0_QM_BASE) #define QM_CQ_PTR_LO_STS_4_OFFSET (mmPDMA0_QM_CQ_PTR_LO_STS_4 - mmPDMA0_QM_BASE)
#define QM_CQ_TSIZE_4_OFFSET (mmPDMA0_QM_CQ_TSIZE_4 - mmPDMA0_QM_BASE) #define QM_CQ_PTR_HI_STS_4_OFFSET (mmPDMA0_QM_CQ_PTR_HI_STS_4 - mmPDMA0_QM_BASE)
#define QM_ARC_CQ_PTR_LO_OFFSET (mmPDMA0_QM_ARC_CQ_PTR_LO - mmPDMA0_QM_BASE) #define QM_ARC_CQ_TSIZE_STS_OFFSET (mmPDMA0_QM_ARC_CQ_TSIZE_STS - mmPDMA0_QM_BASE)
#define QM_ARC_CQ_PTR_HI_OFFSET (mmPDMA0_QM_ARC_CQ_PTR_HI - mmPDMA0_QM_BASE) #define QM_ARC_CQ_PTR_LO_STS_OFFSET (mmPDMA0_QM_ARC_CQ_PTR_LO_STS - mmPDMA0_QM_BASE)
#define QM_ARC_CQ_TSIZE_OFFSET (mmPDMA0_QM_ARC_CQ_TSIZE - mmPDMA0_QM_BASE) #define QM_ARC_CQ_PTR_HI_STS_OFFSET (mmPDMA0_QM_ARC_CQ_PTR_HI_STS - mmPDMA0_QM_BASE)
#define QM_CP_STS_4_OFFSET (mmPDMA0_QM_CP_STS_4 - mmPDMA0_QM_BASE)
#define QM_CP_CURRENT_INST_LO_4_OFFSET (mmPDMA0_QM_CP_CURRENT_INST_LO_4 - mmPDMA0_QM_BASE) #define QM_CP_CURRENT_INST_LO_4_OFFSET (mmPDMA0_QM_CP_CURRENT_INST_LO_4 - mmPDMA0_QM_BASE)
#define QM_CP_CURRENT_INST_HI_4_OFFSET (mmPDMA0_QM_CP_CURRENT_INST_HI_4 - mmPDMA0_QM_BASE) #define QM_CP_CURRENT_INST_HI_4_OFFSET (mmPDMA0_QM_CP_CURRENT_INST_HI_4 - mmPDMA0_QM_BASE)
......
...@@ -25,6 +25,7 @@ enum hl_revision_id { ...@@ -25,6 +25,7 @@ enum hl_revision_id {
REV_ID_INVALID = 0x00, REV_ID_INVALID = 0x00,
REV_ID_A = 0x01, REV_ID_A = 0x01,
REV_ID_B = 0x02, REV_ID_B = 0x02,
REV_ID_C = 0x03
}; };
#endif /* INCLUDE_PCI_GENERAL_H_ */ #endif /* INCLUDE_PCI_GENERAL_H_ */
...@@ -659,6 +659,12 @@ enum pq_init_status { ...@@ -659,6 +659,12 @@ enum pq_init_status {
* number (nonce) provided by the host to prevent replay attacks. * number (nonce) provided by the host to prevent replay attacks.
* public key and certificate also provided as part of the FW response. * public key and certificate also provided as part of the FW response.
* *
* CPUCP_PACKET_INFO_SIGNED_GET -
* Get the device information signed by the Trusted Platform device.
* device info data is also hashed with some unique number (nonce) provided
* by the host to prevent replay attacks. public key and certificate also
* provided as part of the FW response.
*
* CPUCP_PACKET_MONITOR_DUMP_GET - * CPUCP_PACKET_MONITOR_DUMP_GET -
* Get monitors registers dump from the CpuCP kernel. * Get monitors registers dump from the CpuCP kernel.
* The CPU will put the registers dump in the a buffer allocated by the driver * The CPU will put the registers dump in the a buffer allocated by the driver
...@@ -733,7 +739,7 @@ enum cpucp_packet_id { ...@@ -733,7 +739,7 @@ enum cpucp_packet_id {
CPUCP_PACKET_ENGINE_CORE_ASID_SET, /* internal */ CPUCP_PACKET_ENGINE_CORE_ASID_SET, /* internal */
CPUCP_PACKET_RESERVED2, /* not used */ CPUCP_PACKET_RESERVED2, /* not used */
CPUCP_PACKET_SEC_ATTEST_GET, /* internal */ CPUCP_PACKET_SEC_ATTEST_GET, /* internal */
CPUCP_PACKET_RESERVED3, /* not used */ CPUCP_PACKET_INFO_SIGNED_GET, /* internal */
CPUCP_PACKET_RESERVED4, /* not used */ CPUCP_PACKET_RESERVED4, /* not used */
CPUCP_PACKET_MONITOR_DUMP_GET, /* debugfs */ CPUCP_PACKET_MONITOR_DUMP_GET, /* debugfs */
CPUCP_PACKET_RESERVED5, /* not used */ CPUCP_PACKET_RESERVED5, /* not used */
......
...@@ -846,6 +846,7 @@ enum hl_server_type { ...@@ -846,6 +846,7 @@ enum hl_server_type {
#define HL_INFO_HW_ERR_EVENT 36 #define HL_INFO_HW_ERR_EVENT 36
#define HL_INFO_FW_ERR_EVENT 37 #define HL_INFO_FW_ERR_EVENT 37
#define HL_INFO_USER_ENGINE_ERR_EVENT 38 #define HL_INFO_USER_ENGINE_ERR_EVENT 38
#define HL_INFO_DEV_SIGNED 40
#define HL_INFO_VERSION_MAX_LEN 128 #define HL_INFO_VERSION_MAX_LEN 128
#define HL_INFO_CARD_NAME_MAX_LEN 16 #define HL_INFO_CARD_NAME_MAX_LEN 16
...@@ -1256,6 +1257,7 @@ struct hl_info_dev_memalloc_page_sizes { ...@@ -1256,6 +1257,7 @@ struct hl_info_dev_memalloc_page_sizes {
#define SEC_SIGNATURE_BUF_SZ 255 /* (256 - 1) 1 byte used for size */ #define SEC_SIGNATURE_BUF_SZ 255 /* (256 - 1) 1 byte used for size */
#define SEC_PUB_DATA_BUF_SZ 510 /* (512 - 2) 2 bytes used for size */ #define SEC_PUB_DATA_BUF_SZ 510 /* (512 - 2) 2 bytes used for size */
#define SEC_CERTIFICATE_BUF_SZ 2046 /* (2048 - 2) 2 bytes used for size */ #define SEC_CERTIFICATE_BUF_SZ 2046 /* (2048 - 2) 2 bytes used for size */
#define SEC_DEV_INFO_BUF_SZ 5120
/* /*
* struct hl_info_sec_attest - attestation report of the boot * struct hl_info_sec_attest - attestation report of the boot
...@@ -1290,6 +1292,32 @@ struct hl_info_sec_attest { ...@@ -1290,6 +1292,32 @@ struct hl_info_sec_attest {
__u8 pad0[2]; __u8 pad0[2];
}; };
/*
* struct hl_info_signed - device information signed by a secured device.
* @nonce: number only used once. random number provided by host. this also passed to the quote
* command as a qualifying data.
* @pub_data_len: length of the public data (bytes)
* @certificate_len: length of the certificate (bytes)
* @info_sig_len: length of the attestation signature (bytes)
* @public_data: public key info signed info data (outPublic + name + qualifiedName)
* @certificate: certificate for the signing key
* @info_sig: signature of the info + nonce data.
* @dev_info_len: length of device info (bytes)
* @dev_info: device info as byte array.
*/
struct hl_info_signed {
__u32 nonce;
__u16 pub_data_len;
__u16 certificate_len;
__u8 info_sig_len;
__u8 public_data[SEC_PUB_DATA_BUF_SZ];
__u8 certificate[SEC_CERTIFICATE_BUF_SZ];
__u8 info_sig[SEC_SIGNATURE_BUF_SZ];
__u16 dev_info_len;
__u8 dev_info[SEC_DEV_INFO_BUF_SZ];
__u8 pad[2];
};
/** /**
* struct hl_page_fault_info - page fault information. * struct hl_page_fault_info - page fault information.
* @timestamp: timestamp of page fault. * @timestamp: timestamp of page fault.
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment