- 19 Sep, 2022 17 commits
-
-
Dani Liberman authored
In order to get the error cause and the captured address in case of page fault, added pmmu events to eqe handler. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Tal Cohen authored
This change is done while there is a problem to use QMAN error for TPC assert async. The problem involves security limitation that exists to generate the assert via QMAN error. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Dani Liberman authored
As a preparation for adding more errors to it, change to more suitable name. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
farah kassabri authored
Get the firmware reset status address from the dynamic registers we read from the firmware instead of using a define. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Tomer Tayar authored
The access to the device registers is blocked during hard reset, until preboot runs and allows the access to specific registers, including the PSOC BTM_FSM register which is used to know when the reset is done. Between the reset request and until this register is polled there is a small delay of 500 msec which is not enough for F/W to process the reset and for preboot to run, so the register might be accessed while it is blocked. To avoid it, increase the delay to 2 sec. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Tomer Tayar authored
Add the dump of the RAZWI information when a PCIe access is blocked by RR. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
The code used the mmu mutex to protect access to the context's page tables and invalidation of the MMU cache. Because pgt are per context, the mmu mutex was a member of the context object. The problem is that the device has a single MMU invalidation h/w (per MMU). Therefore, the mmu mutex should not be a property of the context but a property of the device. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Tal Cohen authored
Add new notifier events that inform several device states. General H/W error raised on device general H/W error occurs. User engine error is raised when a device engine informs of an error. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
In case initialization fails after event irq was requested, we need to release that irq. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ohad Sharabi authored
Current code does not takes into account the new DRAM region base and so calculated address is wrong and can lead to crush. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
Firmware now responds with a more detailed cpucp return codes. Driver can now distinguish between error and debug return codes. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Tomer Tayar authored
F/W security status might change after every reset. Add the reading of the preboot status to the hard reset sequence, which among others reads this security indication. As this preboot status reading includes the waiting for the preboot to be ready, it can be removed from the CPU init which is done in a later stage. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
Current description is misleading hence we rename it to a more suitable error description. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
farah kassabri authored
cb_map_mem() uses gen_pool_alloc() to get virtual address for mapping a CB. The mapping is done in chunks of page size, so if the CB size is larger, it is possible that the allocated virtual addresses won't be consecutive. User retrieves this device VA which returns the virtual address in the first va_block. If there is a "hole" in the virtual addresses, user can configure a HW block with a bad device VA. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
'Device activity open packet' should be sent outside of mutex as there is no real necessity for a lock. In addition 'device activity close packet' should be sent upon an actual release of the device. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
farah kassabri authored
As part of the RAS that is done by the f/w, we should send a message to the f/w when a user either acquires or releases the device. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
In order to improve debuggability, we add all available information when a RAZWI event occur. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
- 18 Sep, 2022 23 commits
-
-
farah kassabri authored
When we have a storm of errors of HBM ECC SERR we can reach a situation where driver start hard reset flow without logging the error cause that caused the hard reset due to logs rate limiting. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
EEPROM errors reported by firmware are basically warnings and should not fail the boot process. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
Except Goya, none of our ASICs require context switch flow, hence we enable this flow only where it is needed. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Dafna Hirschfeld authored
Set the addresses for userspace command buffer dynamically instead of hard-coded. There is no reason for it to be hard-coded. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ohad Sharabi authored
This patch add tracepoints in the code for DMA allocation. The main purpose is to be able to cross data with the map operations and determine whether memory violation occurred, for example free DMA allocation before unmapping it from device memory. To achieve this the DMA alloc/free code flows were refactored so that a single DMA tracepoint will catch many flows. To get better understanding of what happened in the DMA allocations the real allocating function is added to the trace as well. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ohad Sharabi authored
This patch utilize the defined tracepoint to trace the MMU's pages map/unmap operations. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ohad Sharabi authored
This patch adds trace events for habanalabs driver to gain all the benefits such an infrastructure can supply. The following events were added: - MMU map/unmap: to be able to track driver's memory allocations - DMA alloc/free: to track our DMA allocation the above trace points in conjunction will help us map the device memory usage as well as to be able to track memory violations. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Acked-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Rajarama Manjukody Bhat authored
Assigning 3 PQFs in PDMA1 and 2 PQFs in PDMA0 for ARC firmware usage. Signed-off-by: Rajarama Manjukody Bhat <rmbhat@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Tomer Tayar authored
The calculation of the device DRAM base address before setting the relevant PCIe BAR to point at it, has an assumption that this BAR is used to access only the DRAM, and thus the covered DRAM size is a power of 2. In future ASICs it is not necessarily true, so need to update the calculation to support also a non-power-of-2 size. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Dafna Hirschfeld authored
The original code tried to unmap a page that was not mapped as part of the map page error path. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
The driver is loading firmware to the device and we use the firmware loading functions from the FW_LOADER module. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Omer Shpigelman authored
Instead of recalculating the cdev index, store it in a dedicated data member. This data member is intended to be passed to other drivers using the auxiliary bus infra and hence this new data member is necessary in case that the calculation is changed in the future. Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
The kernel version field wasn't updated when a few entries were upstreamed. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Dafna Hirschfeld authored
the size of a block is always 'block->end - block->start + 1' Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
In order for the user to know if he is running on a secured device or not, we add it also to the hw_ip info ioctl. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
In order for the user to know if he is running on a secured device or not, a sysfs node is added. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
Secured PCI ID will not be supported in new asics because the security status can always be read from the f/w. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Tomer Tayar authored
Several munmap() calls can be done or a mapped H/W block that has a larger size than a page size. Releasing the object should be done only when all mapped range is unmapped. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Dani Liberman authored
Since hwmon fini code is common for all asics, unified it to common function. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Tal Cohen authored
The current flow of halting the engine cores is implemented by command buffers built by the user space and sent towards the Driver. This current flow is broken since the user space does not know when the cores actually halt as sending a workload is async op. Therefore the application can not free the memory that is mapped to the engine cores. This new API allows the user space to control the running mode. The API call is sync (returns after the cores are set to the requested mode). Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
There is some left-over code from the gaudi2 bring-up that wasn't removed so far. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
farah kassabri authored
On Gaudi2 the f/w always configures the PCIe iATU and allows access to scratchpad registers. Therefore, we can know if the f/w is secured by reading a status bit from the f/w registers. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
A common function that is called from multiple places can't be located in degugfs.c because that file is only compiled if debugfs is enabled in the kernel config file. This can lead to undefined symbol compilation error. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-