• Oded Gabbay's avatar
    habanalabs: increase timeout during reset · 7a65ee04
    Oded Gabbay authored
    When doing training, the DL framework (e.g. tensorflow) performs hundreds
    of thousands of memory allocations and mappings. In case the driver needs
    to perform hard-reset during training, the driver kills the application and
    unmaps all those memory allocations. Unfortunately, because of that large
    amount of mappings, the driver isn't able to do that in the current timeout
    (5 seconds). Therefore, increase the timeout significantly to 30 seconds
    to avoid situation where the driver resets the device with active mappings,
    which sometime can cause a kernel bug.
    
    BTW, it doesn't mean we will spend all the 30 seconds because the reset
    thread checks every one second if the unmap operation is done.
    Reviewed-by: default avatarOmer Shpigelman <oshpigelman@habana.ai>
    Signed-off-by: default avatarOded Gabbay <oded.gabbay@gmail.com>
    7a65ee04
habanalabs.h 60.4 KB