QuantumATK Forum

QuantumATK => General Questions and Answers => Topic started by: AsifShah on February 2, 2024, 13:51

Title: Error on Cluster
Post by: AsifShah on February 2, 2024, 13:51
Dear Admin,

Any idea what is causing this error? QuantumATK version is latest one released in december 2023.V12.

I do not face any issue on previous version of QATK on same cluster.

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 181758 RUNNING AT node10
=   KILLED BY SIGNAL: 4 (Illegal instruction)
===================================================================================
Title: Re: Error on Cluster
Post by: Anders Blom on February 2, 2024, 20:15
In almost all cases it means the calculation ran out of memory. There are many techniques to reduce memory, incl. different parallelization strategies, see https://docs.quantumatk.com/manual/technicalnotes/advanced_performance/advanced_performance.html
Title: Re: Error on Cluster
Post by: filipr on February 5, 2024, 08:43
I'm not convinced this is a memory issue. More likely an incompatibility between the software and hardware.

Some questions we need answer to in order to investigate this issue:


On top of that it would be very valuable if you could rerun the script and set the environment variable I_MPI_DEBUG=5 in your submission script before the execution of atkpython. Then Intel MPI will output various diagnostics to the output log. Send that to us. Also please send the output of 'cat /proc/cpuinfo' on the node that is running the script (you can add this to the top of the submission script) as well as the name and version of the Operating System.
Title: Re: Error on Cluster
Post by: AsifShah on February 9, 2024, 04:51
Dear Filipr,
Thanks for responding.

GPU Response:
1. When I run on single GPU core, it runs very well and gives output nicely. (See attached Au_MoS2.py file)
2. When I run on multiple GPU cores, it shows an error (See attached file Error.txt).
3. Also see the attached Au_MoS2.log file.
4. Also see attached cpuinfoo.txt

CPU Response:
1. When I run on single or  multiple CPU cores, the output file shows this:
[0] MPI startup(): Intel(R) MPI Library, Version 2021.8  Build 20221129 (id: 339ec755a1)
  • MPI startup(): Copyright (C) 2003-2022 Intel Corporation.  All rights reserved.
  • [0] MPI startup(): library kind: release
  • MPI startup(): libfabric version: 1.13.2rc1-impi


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 805 RUNNING AT ssdnode2
=   KILLED BY SIGNAL: 4 (Illegal instruction)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 1 PID 806 RUNNING AT ssdnode2
=   KILLED BY SIGNAL: 4 (Illegal instruction)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 2 PID 807 RUNNING AT ssdnode2
=   KILLED BY SIGNAL: 4 (Illegal instruction)
===================================================================================

Title: Re: Error on Cluster
Post by: Julian Schneider on February 9, 2024, 17:47
The M3GNet implementation in QuantumATK currently only supports running on a single GPU and with a 1 MPI process (the case that works nicely for you).

When running on CPU only you should set device='cpu'. We found an issue when running on a node that does not support CUDA with device='cuda', the automatic fallback to CPU does not work. We will fix that in the upcoming service pack.