Author Topic: ATK Error when running on cluster  (Read 41501 times)

0 Members and 1 Guest are viewing this topic.

Offline AsifShah

  • QuantumATK Guru
  • ****
  • Posts: 216
  • Country: in
  • Reputation: 4
    • View Profile
ATK Error when running on cluster
« on: April 26, 2025, 11:39 »
Dear Admin,

I am running a 500+ atom geometry optimization on 240 cores (60 cores per node). After running for some time the simulation stopped giving error as:
===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 175 PID 170903 RUNNING AT n3
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 176 PID 170908 RUNNING AT n3
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 177 PID 170910 RUNNING AT n3
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================


Could you please help how to resolve this issue as it requires me otherwise to run simulation again and again after sometime?

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5749
  • Country: dk
  • Reputation: 112
    • View Profile
    • QuantumATK at Synopsys
Re: ATK Error when running on cluster
« Reply #1 on: April 29, 2025, 00:58 »
Almost always this is due to "out of memory", but more details needed.
To reduce memory usage, use fewer MPI processes and more threads, such a large system doesn't have enough k-points anyway for many MPI processes.

Offline AsifShah

  • QuantumATK Guru
  • ****
  • Posts: 216
  • Country: in
  • Reputation: 4
    • View Profile
Re: ATK Error when running on cluster
« Reply #2 on: May 1, 2025, 08:19 »
Thanks for your response.
Will decreasing MPI and increasing threads decrease speed?

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5749
  • Country: dk
  • Reputation: 112
    • View Profile
    • QuantumATK at Synopsys
Re: ATK Error when running on cluster
« Reply #3 on: May 1, 2025, 23:05 »
Not necessarily, the threading scaling is pretty good too. But anything is faster than a job that crashes!

Offline AsifShah

  • QuantumATK Guru
  • ****
  • Posts: 216
  • Country: in
  • Reputation: 4
    • View Profile
Re: ATK Error when running on cluster
« Reply #4 on: May 2, 2025, 12:44 »
Dear Anders Blom,

Please confirm if I am doing it rightly,
For cluster and single node:
export MKL_DYNAMIC=TRUE
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
When running on multiple cores on a cluster, I use: mpiexec.hydra -n 240 -ppn 60 atkpython input.py > output.log
When running on multiple cores on a single node, I use: mpiexec.hydra -n 60 atkpython input.py > output.log

For threads:
export MKL_DYNAMIC=TRUE
export OMP_NUM_THREADS=100
export MKL_NUM_THREADS=100
When running on multiple threads, I use: mpiexec -n 1 atkpython input.py > output.log

Am I doing the threading part correctly or there is any other way to speedup?
Best regards

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5749
  • Country: dk
  • Reputation: 112
    • View Profile
    • QuantumATK at Synopsys
Re: ATK Error when running on cluster
« Reply #5 on: May 2, 2025, 18:52 »
60 cores per node is a slightly unusual number, but possible of course.

In general there is no need to specify the number of threads, the software will figure out the best number by itself. So leave NUM_THREADS empty, unless you have very, very special reasons.

Few problems in QuantumATK scale well to 200+ MPI processes (at least when using LCAO; plane-wave basis sets can go higher, and NEGF). For memory reasons, and to conserve your hardware resources, if you have 60 core machines, I would start by running on just one node, perhaps with 10 or 20 MPI processes to test speed and memory consumption.

The point is to combine MPI and threads, not pick only one or the other. So, again, single node, a smallish number of MPIs (for memory, and weighed against number of k-points), and let the software thread by itself. Of course you still need to reserve the full node in your queue submission.