Author Topic: MPI batch script error  (Read 3998 times)

0 Members and 1 Guest are viewing this topic.

Offline Ambika kumari

  • Heavy QuantumATK user
  • ***
  • Posts: 32
  • Country: in
  • Reputation: 0
    • View Profile
MPI batch script error
« on: March 17, 2023, 09:09 »
Respected Authority,

I  am trying to run QuantumATK on an HPC using slurm job manager (I have a perpetual license for version QuantumATK Q-2019.12-SP1). For running the simulation I am specifying the number of CPU to 240 (we have licence of  256), but I find that the simulation is failing every time. However instead of specifying the number of CPU when I specify the number of nodes=6  (1 node=40 CPU) the simulation does not show any error.  The super computer that we are using, has incorporated certain rules that will allow us to run simulation only by specifying the number of CPUs instead of number of nodes. Can you please provide any help or support regarding this matter?  As it will only be possible for us to run simulations by specifying the number of CPUs.
The batch script that we used to run the simulation is based on the one that was available from QuantumATK which is attached herewith for your reference.
I have also attached the log file showing the error that we faced.

Thanking you in advance.

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5576
  • Country: dk
  • Reputation: 96
    • View Profile
    • QuantumATK at Synopsys
Re: MPI batch script error
« Reply #1 on: March 17, 2023, 20:44 »
It is possible that you need to run jobs under "srun" rather than mpiexec. This is a common case with SLURM. However, the exact details depend on your cluster setup, so you really need a sysadmin to show you what the proper way to submit a job is, manually.

So, talk to the admins about how the input script (that you attached) should be modified in order to run correctly, e.g. by changing the last line to "srun" and changing some of the resource allocation commands (and, in particular, srun probably doesn't recognize the -ppn option, but also doesn't need it).

After that, we can help you set up the Job Manager to automate this for future jobs.

Offline Ambika kumari

  • Heavy QuantumATK user
  • ***
  • Posts: 32
  • Country: in
  • Reputation: 0
    • View Profile
Re: MPI batch script error
« Reply #2 on: March 21, 2023, 15:49 »
Respected Dr. Blom,

As per  your recommendation  we tried to modify the software generated run script  by replacing mpiexec with srun after getting advice from the HPC admin, but were unsuccessful. I have attached the modified run script and the error log that was generated. The system admin suggests that it is not possible for him to rectify the script alone as he is unaware of the internal scripting of QuantumATK. And he has suggested that he will be able to help us if an expert from your side could directly connect with him and help resolve the issue. I would be obliged if you could give an intimation in this matter.



Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5576
  • Country: dk
  • Reputation: 96
    • View Profile
    • QuantumATK at Synopsys
Re: MPI batch script error
« Reply #3 on: March 23, 2023, 05:40 »
Those error messages are not at all related to QuantumATK, since the software does not even start. The error says that wrong parameters are provided to mpiexec, which is launched through your srun.

In general you will first need to figure out how to run ANY application properly in parallel, because there is nothing about "internal scripting of QuantumATK" that is related to how you get your mpiexec to run.

So, just go back to basics: assuming you have a binary "a.out" that you want to run in parallel on your cluster, with a certain number of nodes, threads and MPI processes. What is the launch script for that? Based on this, one can backtrack how to run QuantumATK and set up the Job Manager.

Offline Ambika kumari

  • Heavy QuantumATK user
  • ***
  • Posts: 32
  • Country: in
  • Reputation: 0
    • View Profile
Re: MPI batch script error
« Reply #4 on: March 25, 2023, 20:03 »
Respected Dr. Blom,

After initialising through mpiexec we found that the simulation (ATKPYTHON) was successfully initialised. However after calculating the first set of eigenvalues and density matrix the error showed:
Calculating Eigenvalues : ==================================================

                            |--------------------------------------------------|
Calculating Density Matrix : ==================================================

cn112:UCM:6d009:b49e0700: 515424974 us(515424974 us!!!): dapl async_event CQ (0x157ed00) ERR 0
cn112:UCM:6d009:b49e0700: 515425014 us(40 us): -- dapl_evd_cq_async_error_callback (0x1538e00, 0x157ee60, 0x2b1db49dfd30, 0x157ed00)
cn112:UCM:6d009:b49e0700: 515425031 us(17 us): dapl async_event QP (0x3f1c660) Event 1
cn112:UCM:6d009:b49e0700: 515425041 us(10 us): dapl async_event QP (0x3f1c660) Event 1

We connected to the affected node (ssh cn112) and found that ATKPYTHON assigned to the respective cores was still running while utilising almost 100% of the cores. This suggested that the job was still running albeit in an unknown condition as the progress was not being reflected in the log.

So we reinstalled QuantumATK again on the HPC and found that we were able to resolve the issue.

However, now we face another issue. We find that the simulation is running very slow, as the simulation was not even able to converge the left electrode calculation and calculated only 8 sets of eigenvalues and density matrix in a period of 24 hours. Could this be due to any compatibility issues? (As we enquired from other HPC users utilising similar platforms such as vasp and found that they did not face such issues.) If so, could you suggest any way to overcome it?

Thank you very much.

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5576
  • Country: dk
  • Reputation: 96
    • View Profile
    • QuantumATK at Synopsys
Re: MPI batch script error
« Reply #5 on: March 27, 2023, 07:21 »
Ok, great that you solved those initial issues.

As for performance, this is very difficult to speculate on without seeing detailed input/output, however I would suggest not doing a device first, but rather test that the parallelization is behaving correctly for a standard bulk crystal simulation. Perhaps not the smallest possible, but say a larger crystal with 10-20 atoms and several k-points.

In earlier times it was possible to use "wrong" parallelization and get very slow behavior, in particular if trying to use OpenMPI with QuantumATK, which it is not compatible with. However, newer versions should block that. Still, it will be interesting to analyze the output of the simpler calculation when you have it done. I can also suggest some MPI parallel performance test commands to use after that.

Offline Ambika kumari

  • Heavy QuantumATK user
  • ***
  • Posts: 32
  • Country: in
  • Reputation: 0
    • View Profile
Re: MPI batch script error
« Reply #6 on: March 28, 2023, 10:22 »
 Respected Dr. Blom,

I am not able to post input and output file as QuantumATK  asking me to start new topic. So I am going to start new topic for same issue.