Author Topic: Running QATK on Multiple Nodes  (Read 142 times)

0 Members and 1 Guest are viewing this topic.

Offline AsifShah

  • Regular ATK user
  • **
  • Posts: 36
  • Country: in
  • Reputation: 0
    • View Profile
Running QATK on Multiple Nodes
« on: July 19, 2022, 19:44 »
Dear Admin, I want to run QATK on multiple nodes.
I have added paths of license, mpiexec.hydra (from libexec) & atkpython (bin) in ~/.bashrc added.
next I run the following commands on terminal.

1. mpiexec.hydra -f host_file -n 200 -ppn 40 atkpython filename.py > output.log
(host_file contains list of node information)
I get the following errors:
            i) /etc/tmi.conf: No such file or directory
 
              ii) DAPL startup: RLIMIT_MEMLOCK too small


2. mpiexec.hydra  -n 200 -ppn 40 atkpython filename.py > output.log
In this case although I mention 200/40= 5 nodes but it runs all calculation on single node & not on expected 5 nodes.


KIndly provide a solution!
« Last Edit: July 19, 2022, 19:48 by AsifShah »

Offline Anders Blom

  • QuantumATK Staff
  • Supreme ATK Wizard
  • *****
  • Posts: 5072
  • Country: dk
  • Reputation: 85
    • View Profile
    • QuantumATK at Synopsys
Re: Running QATK on Multiple Nodes
« Reply #1 on: July 19, 2022, 23:41 »
All 200 MPIs run on one node? Could you share the top of the log file with the parallelization report? Or you are just not seeing any CPU activity on the other nodes? That might be because there simply are not 200 tasks to be done...

Still, maybe 1 is the proper way to run on your hardware, but then you should fix the error messages (which are not really related to QuantumATK itself).
« Last Edit: July 19, 2022, 23:42 by Anders Blom »

Offline AsifShah

  • Regular ATK user
  • **
  • Posts: 36
  • Country: in
  • Reputation: 0
    • View Profile
Re: Running QATK on Multiple Nodes
« Reply #2 on: July 20, 2022, 07:03 »
Kindly look at the images below for illustration of the problem i face:

When I run mpiexec -n 200 -ppn 40 atkpython ted.py > ted.log it runs all mpis on master node while other compute nodes have 0 usage as you can see below in one of the images.

When I run mpiexec -f host_file -n 200 -ppn 40 atkpython ted.py > ted.log it gives the following error as in image.

Offline Ambika kumari

  • Regular ATK user
  • **
  • Posts: 23
  • Country: in
  • Reputation: 0
    • View Profile
Re: Running QATK on Multiple Nodes
« Reply #3 on: July 23, 2022, 06:47 »
I am also facing same issue

Offline Anders Blom

  • QuantumATK Staff
  • Supreme ATK Wizard
  • *****
  • Posts: 5072
  • Country: dk
  • Reputation: 85
    • View Profile
    • QuantumATK at Synopsys
Re: Running QATK on Multiple Nodes
« Reply #4 on: July 26, 2022, 22:23 »
This is a bit tricky to solve over a Forum, because I don't think it's directly related to QuantumATK itself, more about the MPI itself.

One thing I learned from a quick Google search is that the error message about tmi.conf seems to be related to Myrinet, which historically has been more tricky to use than using Infiniband (which should be the default).

Are you using "mpd" to boot MPI on multiple machines (normally not needed with hydra)? If so, is there an option -MX passed to mpd?

Offline Anders Blom

  • QuantumATK Staff
  • Supreme ATK Wizard
  • *****
  • Posts: 5072
  • Country: dk
  • Reputation: 85
    • View Profile
    • QuantumATK at Synopsys

Offline Ambika kumari

  • Regular ATK user
  • **
  • Posts: 23
  • Country: in
  • Reputation: 0
    • View Profile
Re: Running QATK on Multiple Nodes
« Reply #6 on: July 27, 2022, 08:30 »
I am facing this type of issue. suggest some solution

srun: error: hm010: task 0: Out Of Memory
slurmstepd: error: Detected 432 oom-kill event(s) in StepId=869350.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
[mpiexec@hm010] HYDT_bscu_wait_for_completion (../../tools/bootstrap/utils/bscu_wait.c:151): one of the processes terminated badly; aborting
[mpiexec@hm010] HYDT_bsci_wait_for_completion (../../tools/bootstrap/src/bsci_wait.c:36): launcher returned error waiting for completion
[mpiexec@hm010] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:527): launcher returned error waiting for completion
[mpiexec@hm010] main (../../ui/mpich/mpiexec.c:1148): process manager error waiting for completion
slurmstepd: error: Detected 432 oom-kill event(s) in StepId=869350.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

Offline Anders Blom

  • QuantumATK Staff
  • Supreme ATK Wizard
  • *****
  • Posts: 5072
  • Country: dk
  • Reputation: 85
    • View Profile
    • QuantumATK at Synopsys
Re: Running QATK on Multiple Nodes
« Reply #7 on: July 27, 2022, 09:03 »
The root cause is obvious: there is not enough memory to run the system desired.

To fix that, there are many options, but it depends on the details of the hardware and the input script. I suggest having a look at
https://docs.quantumatk.com/technicalnotes/advanced_performance/advanced_performance.html#running-out-of-memory
to get some ideas

Offline Ambika kumari

  • Regular ATK user
  • **
  • Posts: 23
  • Country: in
  • Reputation: 0
    • View Profile
Re: Running QATK on Multiple Nodes
« Reply #8 on: July 27, 2022, 09:33 »
when i use multiple node for enough memory still show same error, i talked to the engineer of this college regarding this issued he said this is problem related to multiple nodes, only Admin can solve it, they have their own code for multiple node connection.

Offline AsifShah

  • Regular ATK user
  • **
  • Posts: 36
  • Country: in
  • Reputation: 0
    • View Profile
Re: Running QATK on Multiple Nodes
« Reply #9 on: August 2, 2022, 13:19 »
Dear Admin,
Yes it is resolved now. Looks like it was error on the cluster.