Author Topic: Error on Cluster  (Read 4222 times)

0 Members and 1 Guest are viewing this topic.

Offline AsifShah

  • QuantumATK Guru
  • ****
  • Posts: 173
  • Country: in
  • Reputation: 2
    • View Profile
Error on Cluster
« on: February 2, 2024, 13:51 »
Dear Admin,

Any idea what is causing this error? QuantumATK version is latest one released in december 2023.V12.

I do not face any issue on previous version of QATK on same cluster.

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 181758 RUNNING AT node10
=   KILLED BY SIGNAL: 4 (Illegal instruction)
===================================================================================
« Last Edit: February 2, 2024, 14:01 by AsifShah »

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5575
  • Country: dk
  • Reputation: 96
    • View Profile
    • QuantumATK at Synopsys
Re: Error on Cluster
« Reply #1 on: February 2, 2024, 20:15 »
In almost all cases it means the calculation ran out of memory. There are many techniques to reduce memory, incl. different parallelization strategies, see https://docs.quantumatk.com/manual/technicalnotes/advanced_performance/advanced_performance.html

Offline filipr

  • QuantumATK Staff
  • Heavy QuantumATK user
  • *****
  • Posts: 81
  • Country: dk
  • Reputation: 6
  • QuantumATK developer
    • View Profile
Re: Error on Cluster
« Reply #2 on: February 5, 2024, 08:43 »
I'm not convinced this is a memory issue. More likely an incompatibility between the software and hardware.

Some questions we need answer to in order to investigate this issue:

  • What are you trying to run? Please send script if you can
  • How far does it get? Please send log output

On top of that it would be very valuable if you could rerun the script and set the environment variable I_MPI_DEBUG=5 in your submission script before the execution of atkpython. Then Intel MPI will output various diagnostics to the output log. Send that to us. Also please send the output of 'cat /proc/cpuinfo' on the node that is running the script (you can add this to the top of the submission script) as well as the name and version of the Operating System.

Offline AsifShah

  • QuantumATK Guru
  • ****
  • Posts: 173
  • Country: in
  • Reputation: 2
    • View Profile
Re: Error on Cluster
« Reply #3 on: February 9, 2024, 04:51 »
Dear Filipr,
Thanks for responding.

GPU Response:
1. When I run on single GPU core, it runs very well and gives output nicely. (See attached Au_MoS2.py file)
2. When I run on multiple GPU cores, it shows an error (See attached file Error.txt).
3. Also see the attached Au_MoS2.log file.
4. Also see attached cpuinfoo.txt

CPU Response:
1. When I run on single or  multiple CPU cores, the output file shows this:
[0] MPI startup(): Intel(R) MPI Library, Version 2021.8  Build 20221129 (id: 339ec755a1)
  • MPI startup(): Copyright (C) 2003-2022 Intel Corporation.  All rights reserved.
  • [0] MPI startup(): library kind: release
  • MPI startup(): libfabric version: 1.13.2rc1-impi


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 805 RUNNING AT ssdnode2
=   KILLED BY SIGNAL: 4 (Illegal instruction)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 1 PID 806 RUNNING AT ssdnode2
=   KILLED BY SIGNAL: 4 (Illegal instruction)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 2 PID 807 RUNNING AT ssdnode2
=   KILLED BY SIGNAL: 4 (Illegal instruction)
===================================================================================


Offline Julian Schneider

  • QuantumATK Staff
  • QuantumATK Guru
  • *****
  • Posts: 164
  • Country: dk
  • Reputation: 25
    • View Profile
Re: Error on Cluster
« Reply #4 on: February 9, 2024, 17:47 »
The M3GNet implementation in QuantumATK currently only supports running on a single GPU and with a 1 MPI process (the case that works nicely for you).

When running on CPU only you should set device='cpu'. We found an issue when running on a node that does not support CUDA with device='cuda', the automatic fallback to CPU does not work. We will fix that in the upcoming service pack.