habanalabs/gaudi: increase default cs timeout to 10 minutes
In order to improve scalability and reduce host overhead, it is better
to increase the default TDR timeout of Gaudi1 from 30 seconds to
10 minutes.
This will allow the DL Framework (e.g. PyTorch, TensorFlow) to remove
the host sync they are using now and improve overall performance on
scaleout training.
Note that one can always set the timeout to a custom value via
a kernel module parameter given during driver load.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Showing
Please register or sign in to comment