Author Topic: Parallel code  (Read 7452 times)

0 Members and 1 Guest are viewing this topic.

Offline carbn9

  • Regular QuantumATK user
  • **
  • Posts: 24
  • Reputation: 0
    • View Profile
Parallel code
« on: February 14, 2009, 19:32 »
Hi firends,

I tried to see the paralellization of ATK how much it speeds up calculations? I runned the Li-H2-Li two-probe example on single core and then at six cores. But while 1 core completed in 2:38 mins, 6 cores completed in 5:51 mins. I am very surprised since parallelization made things slower. I looked at the system performance during runs, when running in parallel, most of the time is spent during MPI communication more than real calculation I think. I concluded that for small systems, MPI parallelization may have negative effect, is this true, or am I making sth wrong? Is there a trick that must be used during parallel calculation? I gave the command:

mpiexec -n 6 /ATK dir  /file name after connection all cores using mpd. I used the Li.py given in the manual without modifying the code. Is there anything to add in order to order ATK to parallelize??

Maresh

Offline Nordland

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 812
  • Reputation: 18
    • View Profile
Re: Parallel code
« Reply #1 on: February 14, 2009, 20:38 »
The way you use ATK in parallel is the correct one, as far as I can see.

For Li-H2-Li there is not much to gain unless you choose parameters that are made to scale well, but
not needed for this system. However if the system is just a bit more complicated or has 3d dimension instead
of one, you will see a quite nice scaling! :)

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5576
  • Country: dk
  • Reputation: 96
    • View Profile
    • QuantumATK at Synopsys
Re: Parallel code
« Reply #2 on: February 16, 2009, 00:30 »
As Nordland writes, this system does not gain much by being run in parallel, while others do (see e.g. the pdf file attached to this post).

However, even Li-H2-Li would scale (a bit) in parallel if the MPI nodes were separate computational nodes. In general, one should be very careful to MPI-parallelize over multi-cores on a single machine. First of all, each MPI process will allocate X Mb RAM, but since all processes run on the same machine, they will require N*X from that single machine. So, you become very limited in how large calculations you can run; even for smallish calculations (that use, say, 400 Mb), putting 6 MPI processes on one machine will exhaust the RAM even if you have 2 Gb available. What happens is that you start swapping, and performance degrades terribly.

You can however get poor performance anyway (the Li-H2-Li system probably needs only 50 Mb, so that's not the issue here), since the MPI processes will fight for interconnect bandwidth and may also share the L2 cache (on Intel machines).

There is increasing and widespread awareness that parallelizing on multicores doesn't always improve the calculation speed (see e.g. http://www.hpcprojects.com/news/news_story.php?news_id=758).

However, ATK actually has (since version 2008.10) another way to take advantage of multicores, namely threading. While the MPI parallelization is able to boost the performance when using several k-points and energy points (always used for two-probes, on the contour integral), and if used right, i.e. preferably on a cluster, threading speeds up matrix operations on each MPI node.

An important point to note is that threading in ATK is not automatically enabled in Linux. See the manual.

So, does this mean that it is never a good idea to MPI parallelize on a single machine? Well, it depends. I think this example clearly shows that putting as many as 6 MPI processes on a single node, even if it perhaps has 8 cores available, is not a good idea. 4 might work better, while 2 is probably a practical maximum. Second, the parallel behavior is very different for the SCF loop and transmission calculations, and the linear scaling seen in clusters might hold for 2-4 processes also on a single machine (if the calculations are not too large).

Note however that by occupying the cores with MPI processes, you effectively kill threading, which can be very detrimental for performance.

Thus, generally the best strategy for ATK is to MPI-parallelize over as many cluster nodes as you can afford, and leave it to ATK to thread on the cores of each node. If you don't have a cluster, consider getting one :) Seriously, you can make an excellent computational cluster from 3-4 desktop machines, or of course you can invest in some nice blades, they're not that much more expensive. If that's not an option, if you're stuck on a single multi-core machine, you should test how ATK behaves for your particular system (and don't use Li-H2-Li, test with the real geometries you are interested in ;) ), but don't expect to be able to use more than 2-4 MPI processes; most likely, the bigger benefit will come from threading instead.
« Last Edit: February 16, 2009, 00:41 by Anders Blom »

Offline xuzhuo06

  • New QuantumATK user
  • *
  • Posts: 1
  • Reputation: 0
    • View Profile
Re: Parallel code
« Reply #3 on: October 14, 2009, 18:36 »
Hello, I'm new to atk. I'm quite puzzled with how to run parallel calculation effectively.
I have tried to do parallel calculation on blades as the following three ways ('......' means the paths, 'CLUSTERS' means the name of the clusters):

1) use openmp
First I wrote the file openmp.sh:
[
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
export MKL_DYNAMIC=FALSE
....../atk/bin/atk_exec ....../xyz.py
]
Then executed:
[bsub -W 10:0 -a openmp -n 16 -R "span[hosts=1]" -q CLUSTERS -o xyz.out -e error ./openmp.sh]
It could be run, but it does not become any faster than the 4 paralleled cpu in a single node.

2) use mpi+openmp
To execute:
[export OMP_NUM_THREADS=4
bsub -W 10:0 -a intelmpi -n 16 -R "span[ptile=4]" -q CLUSTERS -o xyz.out -e error mpirun.lsf ....../atk_exec ....../xyz.py]
However it failed, cannot be run.

3) use mpi+openmp
I used openmp.sh file like 1), to execute:
[bsub -W 10:0 -a intelmpi -n 16 -R "span[ptile=4]" -q CLUSTERS -o xyz.out -e error mpirun.lsf ./openmp.sh]
It failed either.

Shall I use openmp or mpi+openmp? Shall I use .sh file? How to write the command line to run parallel calculation?
Please kindly help me as much as possible. Thank you very much!

XU Zhuo

Offline zh

  • Supreme QuantumATK Wizard
  • *****
  • Posts: 1141
  • Reputation: 24
    • View Profile
Re: Parallel code
« Reply #4 on: October 15, 2009, 06:18 »

The commands in the 2nd and 3rd ways are not correct for using MPI+OpenMP. Please refer to here for using MPI:
http://quantumwise.com/documents/manuals/ATK-2008.10/chap.parallel.html#sect1.parallel.launching


mpiexec -n 2 $ATK_BIN_DIR/atk /home/myusername/test_mpi.py
.

In order to finger out how to submit your parallel computing jobs on your computer system, you had better know the following things first from the administrator of your computer system:
i) Which mpi is installed on your computer system? MPICH1, MPICH2, Openmpi, or others?
ii) What is the batch job submission on your system? QBS, LSF, or others?

« Last Edit: October 15, 2009, 06:23 by zh »