QuantumATK Forum

QuantumATK => General Questions and Answers => Topic started by: nils.holle on September 3, 2024, 07:27

Title: CUDA error with GPU acceleration of MTP training
Post by: nils.holle on September 3, 2024, 07:27
Hello, we are trying to use the NVIDIA GPUs (V100, A100) in our cluster to accelerate the MTP training. Unfortunately, I encounter the following error:

Code
CUDA Error: cusolverDnDgesvd failed with status 3
NL.ComputerScienceUtilities.ParallelTools.DynamicTaskScheduler.TaskExecutionError: An exception was raised while executing task "013b829c660c11ef992218c04dbe52d0".
  Traceback (most recent call last):
    File "zipdir/NL/ComputerScienceUtilities/ParallelTools/DynamicTaskScheduler.py", line 940, in __runNextTaskOnDelegatorProcess
    File "zipdir/NL/ComputerScienceUtilities/Workflow/Workflow.py", line 1193, in _runTask
    File "zipdir/NL/ComputerScienceUtilities/Workflow/Workflow.py", line 708, in run
    File "zipdir/NL/Study/MomentTensorPotential/FitMomentTensorPotential.py", line 599, in _execute
    File "build/atkpython/lib/python3.11/site-packages/scaitools/moment_tensor_potentials/training.py", line 909, in fitMTPPotential
    File "build/atkpython/lib/python3.11/site-packages/scaitools/moment_tensor_potentials/training.py", line 630, in fit_mtp_potential
    File "build/atkpython/lib/python3.11/site-packages/scaitools/moment_tensor_potentials/training.py", line 421, in solve_least_squares_problem
    File "build/atkpython/lib/python3.11/site-packages/scaitools/moment_tensor_potentials/linear_learning.py", line 55, in fit
    File "build/atkpython/lib/python3.11/site-packages/scaitools/moment_tensor_potentials/mathutil.py", line 145, in svd_least_squares
  RuntimeError: CUDA error

The documentation states that it should work with CUDA 11.8, the cluster has CUDA 11.7 and 12.2 installed and it doesn't work with any of these versions. The driver version is 555. According to the NVIDIA documentation (https://docs.nvidia.com/cuda/cusolver/index.html#cusolverdn-t-gesvd), the status message is "CUSOLVER_STATUS_ARCH_MISMATCH The device only supports compute capability 5.0 and above."

I think this is a bit odd. Does QuantumATK use CUDA functions that are not available on GPUs with a higher compute capability? Maybe I misinterpret the status message.

Thanks a lot!

Best regards,
Nils
Title: Re: CUDA error with GPU acceleration of MTP training
Post by: Julian Schneider on September 5, 2024, 21:56
Hi Nils,

which version of QuantumATK are you using?
We have tested it among others on A100 and V100 GPUs without problems, therefore I would say it is unlikely the compute version, but we need to investigate more.

Best regards,
Julian
Title: Re: CUDA error with GPU acceleration of MTP training
Post by: Anders Blom on September 5, 2024, 23:40
We might also add that in the W-release, coming out next week, we have accelerated the CPU fitting of MTP 10-15x which is basically the speedup we see on GPU. So it might be simpler to just use that version with CPU, at least until we figure out if this is a fixable problem or where it comes from.
Title: Re: CUDA error with GPU acceleration of MTP training
Post by: nils.holle on September 9, 2024, 10:34
Hi Julian, hi Anders,

Thank you very much for your replies. We used version V-2023.12-SP1.

The improvements in W-2024.09 sound very interesting, we will try them as soon as possible. Thanks!

Best regards,
Nils