• Oded Gabbay's avatar
    habanalabs/gaudi: increase default cs timeout to 10 minutes · cd6b0cea
    Oded Gabbay authored
    In order to improve scalability and reduce host overhead, it is better
    to increase the default TDR timeout of Gaudi1 from 30 seconds to
    10 minutes.
    
    This will allow the DL Framework (e.g. PyTorch, TensorFlow) to remove
    the host sync they are using now and improve overall performance on
    scaleout training.
    
    Note that one can always set the timeout to a custom value via
    a kernel module parameter given during driver load.
    Signed-off-by: default avatarOded Gabbay <ogabbay@kernel.org>
    cd6b0cea
habanalabs_drv.c 16.9 KB