Author Topic: CUDA error with GPU acceleration of MTP training  (Read 2362 times)

0 Members and 1 Guest are viewing this topic.

Offline nils.holle

  • Regular QuantumATK user
  • **
  • Posts: 5
  • Country: de
  • Reputation: 0
    • View Profile
CUDA error with GPU acceleration of MTP training
« on: September 3, 2024, 07:27 »
Hello, we are trying to use the NVIDIA GPUs (V100, A100) in our cluster to accelerate the MTP training. Unfortunately, I encounter the following error:
Code
CUDA Error: cusolverDnDgesvd failed with status 3
NL.ComputerScienceUtilities.ParallelTools.DynamicTaskScheduler.TaskExecutionError: An exception was raised while executing task "013b829c660c11ef992218c04dbe52d0".
  Traceback (most recent call last):
    File "zipdir/NL/ComputerScienceUtilities/ParallelTools/DynamicTaskScheduler.py", line 940, in __runNextTaskOnDelegatorProcess
    File "zipdir/NL/ComputerScienceUtilities/Workflow/Workflow.py", line 1193, in _runTask
    File "zipdir/NL/ComputerScienceUtilities/Workflow/Workflow.py", line 708, in run
    File "zipdir/NL/Study/MomentTensorPotential/FitMomentTensorPotential.py", line 599, in _execute
    File "build/atkpython/lib/python3.11/site-packages/scaitools/moment_tensor_potentials/training.py", line 909, in fitMTPPotential
    File "build/atkpython/lib/python3.11/site-packages/scaitools/moment_tensor_potentials/training.py", line 630, in fit_mtp_potential
    File "build/atkpython/lib/python3.11/site-packages/scaitools/moment_tensor_potentials/training.py", line 421, in solve_least_squares_problem
    File "build/atkpython/lib/python3.11/site-packages/scaitools/moment_tensor_potentials/linear_learning.py", line 55, in fit
    File "build/atkpython/lib/python3.11/site-packages/scaitools/moment_tensor_potentials/mathutil.py", line 145, in svd_least_squares
  RuntimeError: CUDA error
The documentation states that it should work with CUDA 11.8, the cluster has CUDA 11.7 and 12.2 installed and it doesn't work with any of these versions. The driver version is 555. According to the NVIDIA documentation (https://docs.nvidia.com/cuda/cusolver/index.html#cusolverdn-t-gesvd), the status message is "CUSOLVER_STATUS_ARCH_MISMATCH The device only supports compute capability 5.0 and above." I think this is a bit odd. Does QuantumATK use CUDA functions that are not available on GPUs with a higher compute capability? Maybe I misinterpret the status message. Thanks a lot! Best regards, Nils

Offline Julian Schneider

  • QuantumATK Staff
  • QuantumATK Guru
  • *****
  • Posts: 164
  • Country: dk
  • Reputation: 25
    • View Profile
Re: CUDA error with GPU acceleration of MTP training
« Reply #1 on: September 5, 2024, 21:56 »
Hi Nils,

which version of QuantumATK are you using?
We have tested it among others on A100 and V100 GPUs without problems, therefore I would say it is unlikely the compute version, but we need to investigate more.

Best regards,
Julian

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5574
  • Country: dk
  • Reputation: 96
    • View Profile
    • QuantumATK at Synopsys
Re: CUDA error with GPU acceleration of MTP training
« Reply #2 on: September 5, 2024, 23:40 »
We might also add that in the W-release, coming out next week, we have accelerated the CPU fitting of MTP 10-15x which is basically the speedup we see on GPU. So it might be simpler to just use that version with CPU, at least until we figure out if this is a fixable problem or where it comes from.

Offline nils.holle

  • Regular QuantumATK user
  • **
  • Posts: 5
  • Country: de
  • Reputation: 0
    • View Profile
Re: CUDA error with GPU acceleration of MTP training
« Reply #3 on: September 9, 2024, 10:34 »
Hi Julian, hi Anders,

Thank you very much for your replies. We used version V-2023.12-SP1.

The improvements in W-2024.09 sound very interesting, we will try them as soon as possible. Thanks!

Best regards,
Nils