Author Topic: How to Do A Parallel Calculation?(v 2010.8)  (Read 7723 times)

0 Members and 1 Guest are viewing this topic.

Offline windysoul

  • Regular QuantumATK user
  • **
  • Posts: 6
  • Reputation: 0
    • View Profile
How to Do A Parallel Calculation?(v 2010.8)
« on: July 22, 2010, 11:41 »
Feel so puzzled!!!!! Read the the manuals "Parallel calculations using ATK" for 2008.10. Error got from test_mpi.py
Code
from ATK.MPI import *
 
if processIsMaster():
    print '# Master node'
else:
    print '# Slave node'
like
Code
+------------------------------------------------------------------------------+
|                                                                              |
| Atomistix ToolKit 10.8.0                                                     |
|                                                                              |
+------------------------------------------------------------------------------+
Traceback (most recent call last):
  File "test.py", line 1, in <module>
    from ATK.MPI import *
ImportError: No module named ATK.MPI
+------------------------------------------------------------------------------+
|                                                                              |
| Atomistix ToolKit 10.8.0                                                     |
|                                                                              |
+------------------------------------------------------------------------------+
Traceback (most recent call last):
  File "test.py", line 1, in <module>
    from ATK.MPI import *
ImportError: No module named ATK.MPI
And 2010.8 version released "Fully parallel for DFT and SE, on two levels: MPI parallelization over clusters, and OpenMP threading for multi-core machines". So I want to know the differences between the two version(2008.10,2010.8 ).I install the ATK 2010.8 on multi-cpus single node,  which mode do I chose? which parallel code i must use? Openmpi,mpich2? And which paralled version?
Code
[root@node1 ~]# mpiexec -n 2 atkpython test_mpi.py
or
Code
[root@node1 ~]# MKL_NUM_THREADS=2 MKL_DYNAMIC=FALSE atkpython test_mpi.py

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5418
  • Country: dk
  • Reputation: 89
    • View Profile
    • QuantumATK at Synopsys
Re: How to Do A Parallel Calculation?(v 2010.8)
« Reply #1 on: July 22, 2010, 12:49 »
The only change needed from 2008.10 to 10.8 is to remove the line "from ATK.MPI import processIsMaster". Also see the upgrade guide. You do not need to choose between MPI and OpenMP, they work together. So ideally for 2 dualcores machines (two separate machines!!!), try
Code
mpiexec -n 2 -env MKL_NUM_THREADS 2 -env MKL_DYNAMIC FALSE atkpython test_mpi.py
Of course, if you only have 1 single dualcore machine, then yes you should choose between the two options you have indicated. The first one uses more memory but might be faster for transmission calculations, for instance, and suitable for smallish systems, while the latter is better for larger systems (mostly for memory reasons).

Offline ziand

  • Heavy QuantumATK user
  • ***
  • Posts: 78
  • Country: de
  • Reputation: 5
    • View Profile
Re: How to Do A Parallel Calculation?(v 2010.8)
« Reply #2 on: July 23, 2010, 16:55 »
Hallo Mr. Blom,

I have a similar question and maybe I know the answer already. We have a small Opteron blade cluster here with 4 Quadcores on every node (=16 cores per node). We are using PBS as job-scheduling system. We managed to write a pbs-script where we put something like
#PBS -l nodes=n:ppn=m,walltime=...
to define the resources for the given calcuation (n nodes and m cores per node) and also do some other stuff like starting the mpd-daemons. I understand that it is probably a bad idea (but it is what I do right now) to set n=1 and m=16 (one full node) when the calculation needs a lot of memory (btw. we have 32Gb per node). Until now I did no changes to the MKL-settings (so probably threading is disabed on default).

My question:
Let's assume we have a big calculation like only 4 jobs per node fit in memory, we want to have 16 parallel jobs and want to get the best performance (obviously).

- Do we need threading at all or is it better to write:
#PBS -l nodes=4:ppn=4,walltime=...
mpiexec -n 16 $ATK_BIN_DIR/atk $PBS_O_WORKDIR/$PBS_JOBNAME.py

- If we activate threading, should we use
#PBS -l nodes=AAA:ppn=BBB,walltime=...
mpiexec -n CCC -env MKL_NUM_THREADS DDD -env MKL_DYNAMIC FALSE ...

and what would be good choices for AAA, BBB, CCC, DDD?  I would use: 1, 16, 4, 4. I think the best would be 4 MPI-jobs on the same node and each has 4 OpenMP threads running on the same Quadcore. Is this possible?

Thanks for an answer.

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5418
  • Country: dk
  • Reputation: 89
    • View Profile
    • QuantumATK at Synopsys
Re: How to Do A Parallel Calculation?(v 2010.8)
« Reply #3 on: July 23, 2010, 23:06 »
I can see that you understand all the important points, especially the one about memory per node. That is, each MPI processes uses X amount of RAM, so having n=1 and m=1 you need 16*X on that node, so your calculation is in fact limited to 2 Gb (with 32 Gb total).

However, you don't write how many nodes you have. I will assume that you have 3 nodes (available), I hope you can follow the logic to change to fit your actual configuration.

Let's consider some scenarios.

* BIG JOB, where you only can fit one MPI copy per node in memory (i.e. the job uses say 17 Gb of RAM)
AAA=3
BBB=16
CCC=3
DDD=4
That gives you CCC=3 MPI processes, each running on a separate node. However, on each node only one CPU will be active. Honestly speaking I have not had the luxury to run on a 4-CPU quadcore machine, to perhaps DDD=16 is possible and better; please test! :)

* MEDIUM JOB, where you only can fit 4 MPI copies per node in memory (i.e. the job uses say 7 Gb of RAM)
AAA=3
BBB=16
CCC=12
DDD=4
That gives you CCC=12 MPI processes, running on AAA=3 machines, and still with full threading on each quadcore.

* SMALL JOB, where you only can fit as many as 16 MPI copies per node in memory (i.e. the job uses max 2 Gb of RAM)
AAA=3
BBB=16
CCC=48
That gives you CCC=48 MPI processes, running on AAA=3 machines. There is no room for threading, so leave it at default. This would be particularly useful for jobs with many k-points (or transmission calculations), again provided they fit in memory. However, it might be inefficient for jobs with few k-points, since not all parts of ATK are MPI parallelized.

The general rule is AAA*BBB=CCC. So if you actually only have one node (available), just rescale all AAA to 1 and all above will still apply.

Also, BBB=16 is important so that the queue does not attempt to load any other processes onto your nodes!

In the end it comes down to testing for your specific jobs, because it also depends on k-point sampling etc how efficient MPI is compared to OpenMP.

So, after a long story: your numbers 1,16,4,4 seem reasonable :)

Offline ziand

  • Heavy QuantumATK user
  • ***
  • Posts: 78
  • Country: de
  • Reputation: 5
    • View Profile
Re: How to Do A Parallel Calculation?(v 2010.8)
« Reply #4 on: July 29, 2010, 19:44 »
I tried with openMP:

mpiexec -n 4 -env MKL_NUM_THREADS 4 -env MKL_DYNAMIC FALSE $ATK_BIN_DIR/atk $PBS_O_WORKDIR/$PBS_JOBNAME.py

and got an error:

# sc  0 : Fermi Energy =    0.00000 Ry
#-------------------------------------------------------------------------------
# Mulliken Population for sc 0
#-------------------------------------------------------------------------------
  [ here I skip some output ]
#-------------------------------------------------------------------------------
# Total Charge =  297.00000
#-------------------------------------------------------------------------------
# Date: 19:26:10 07-29-2010
rank 1 in job 1  arm05_55875   caused collective abort of all ranks
  exit status of rank 1: killed by signal 9

It is the first time I see this error. Without the -env settings it works. $PBS_O_WORKDIR is my current directory where I have an akt-script with the same name as the pbs-job, which I set at the beginning of the script. arm05 is the name of the node where the job should run. My mpd-nodefile (called mpd.nodes) has a single line:

arm05.cluster:16

and I use

export MPD_CON_EXT=${PBS_JOBID}
mpdboot -f mpd.nodes -n $NNODES --remcons

to start the mpd-daemons (with $NNODES==1 in this case). I can send the whole pbs-script if neccessary.

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5418
  • Country: dk
  • Reputation: 89
    • View Profile
    • QuantumATK at Synopsys
Re: How to Do A Parallel Calculation?(v 2010.8)
« Reply #5 on: July 29, 2010, 22:24 »
I'm honestly not sure why then -env settings would have an influence, but the error indicates you are running out of memory. Since you run 4 processes on one machine, as mentioned above you only have 1/4 of the total RAM to play with per process. Is there any chance some other user is running on the same computer? You should reserve all 16 ppns, in case you don't.

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5418
  • Country: dk
  • Reputation: 89
    • View Profile
    • QuantumATK at Synopsys
Re: How to Do A Parallel Calculation?(v 2010.8)
« Reply #6 on: July 29, 2010, 22:29 »
Also, probably it's a better idea not to have your node file static (unless you know for 100% certain that no other user may be running anything on it). Use $PBS_NODEFILE instead (you need to "uniq" it, however; see http://qcd.phys.cmu.edu/QCDcluster/pbs/run_mpich2_job.html for a lot of nice things!).



Offline ziand

  • Heavy QuantumATK user
  • ***
  • Posts: 78
  • Country: de
  • Reputation: 5
    • View Profile
Re: How to Do A Parallel Calculation?(v 2010.8)
« Reply #7 on: July 30, 2010, 09:17 »
In fact I do create the node-file with uniq. And it also works with 16 mpi-processes without threading (so memory should not be the problem). The first lines of my pbs-script are:

#!/bin/bash
#PBS -N CopperWireTest02
#PBS -l nodes=1:ppn=16,walltime=60:00:00
#PBS -S /bin/bash
#PBS -j oe
#PBS -k o

the whole script is much longer, because ist prints lots of information and it also contains a loop where I test for a free atk-license every minute until one is free (we only have one master license here). This is convenient when I want to run 2 or more jobs over night. Anyway, I attached my script (a version without the -env settings). Maybe you find it helpful or discover mistakes (I used a script that I found somewhere in the internet as a starting point).

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5418
  • Country: dk
  • Reputation: 89
    • View Profile
    • QuantumATK at Synopsys
Re: How to Do A Parallel Calculation?(v 2010.8)
« Reply #8 on: August 2, 2010, 12:01 »
This is a very good PBS script, I hope it will be useful for other people too.

I think the memory issue is simply related to running too many processes on the same machine.

PS: I note that you run "atk", as relevant for ATK 2008.10. You may want to upgrade to 10.8! ;)

PS2: The nice "--version" trick will unfortunately not work in 10.8 because a "master" license is only checked out when you actually enter the self-consistent loop. This is changed to make it easier for people to use basic functions in ATK like testing the setup of geometries etc without consuming a full master license.

There is an alternative to this, called license queuing, which currently is not enabled, however.

Offline ziand

  • Heavy QuantumATK user
  • ***
  • Posts: 78
  • Country: de
  • Reputation: 5
    • View Profile
Re: How to Do A Parallel Calculation?(v 2010.8)
« Reply #9 on: August 9, 2010, 20:14 »
Hallo,

I moved to the new version af ATK and it works in principle. However, I still have to mention a few things:

1.) Above script has a bug. One should add the -machinfile nodes.mpd to the mpiexec command (otherwise when using more than 1 node, jobs do not get distributed over the nodes as expected). So the machinefile is the same as used for mpdboot, I hope this is right (it seems to work). The fact that the --version trick does not work any more prevented me form posting a new script. I don't want to run the actual job within the test loop...

2.) It is VERY nice, that the master licnse is only checked out when actually used. I really missed that feature when debugging scripts with the old version. What is the reason that the license queing is not activated?

3.) I still have some problems with OpenMP threading. I very often get those
rank 1 in job 1  arm05_55875   caused collective abort of all ranks
  exit status of rank 1: killed by signal 9
errors. And it is NOT a problem of not enough memory. I tested it. It really works with e.g. 16 (!!!) jobs on 1 node without threading, but fails with e.g. 4 jobs and 4 threads per job. But it does not always produce an error, only sometimes. I watched the activity using the top command in linux: And threading really takes place because I saw a CPU-usage of >1000% per job. I know that this is the true sign of threading.

Once and only once I got a very long and strange error (see attachement). Maybe this helps. Maybe I shold not mix MPI and OpenMP on one node and only start one job per node with 16 threads? I will test it.

4.) I got also signal 15 error once:
rank 18 in job 1  arm07_60102   caused collective abort of all ranks
  exit status of rank 18: killed by signal 15

From your experienced point of view: Is it possile that this is a problem of ATK? Or would you guess on a problem within our cluster hardware/software (For example: our PBS-system was installed not so long ago...).

5.) Which is the recommended version of MPICH2 for ATK-10.8?  Right now I use mpich2-1.0.8 but we also have mpich2-1.2.1p1 here.

Thank you very much for your help.

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5418
  • Country: dk
  • Reputation: 89
    • View Profile
    • QuantumATK at Synopsys
Re: How to Do A Parallel Calculation?(v 2010.8)
« Reply #10 on: August 9, 2010, 23:08 »
Dear ziand, glad to hear 10.8 works nicely, and thanks for the feedback. 1. Whether or not to use the machinefile argument depends a bit on your PBS setup. It is possible to have PBS take over the machine allocation. Correct, the --version trick will not work since it doesn't use any license at all (related to 2.). You can use another (must better!) trick to check for the available license with the new version:
Code
lmxendutil -licstat | grep -i -A5 dftmaster | grep used | cut -f1 -d" "
will return the number of ATK master licenses currently in use. You can compare this number of the number of licenses you have (1?), to see if there are any free licenses available for running. Like
Code
declare -i available_lic
declare -i used_lic
declare -i max_lic
available_lic=0
max_lic=`lmxendutil -licstat | grep -i -A5 dftmaster | grep used | cut -f3 -d" "`
# Stick in loop until license is available
while [ "$available_lic" -eq 0 ]; do
    used_lic=`lmxendutil -licstat | grep -i -A5 dftmaster | grep used | cut -f1 -d" "`
    available_lic=max_lic-used_lic
done
# Now we there know there's a license available
2. Yes, it's very useful :) Queuing might come if people request it, we just didn't have reason to activate it. But with the above script, one can work around it. 3. Of course, Intel recommends using "True" for MKL_DYNAMIC, so that is the safer option. Maybe with False it gets confused regarding the max number of threads... Did you try to also set MKL_NUM_THREADS to 4 or did you leave it at default (unspecified)? 4. That's different than what we usually see (which is, indeed, signal 9, for memory). Hopefully it doesn't appear again! 5. ATK doesn't care too much about the version, 1.0.8 works fine. But the MPICH2 release notes mention some bug fixes which look serious enough to motivate the upgrade (which shouldn't disturb anything on your system, it's a very safe upgrade), so in the case of MPICH2 the recommendation is to use the latest version always.

Offline ziand

  • Heavy QuantumATK user
  • ***
  • Posts: 78
  • Country: de
  • Reputation: 5
    • View Profile
Re: How to Do A Parallel Calculation?(v 2010.8)
« Reply #11 on: August 10, 2010, 09:13 »
I used
MKL_NUM_THREADS 4
and
MKL_DYNAMIC FALSE.

If I set MKL_DYNAMIC True
and leave MKL_NUM_THREADS unspecified,
does it really use the right number of threads if I have more than one MPI-job on a node?

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5418
  • Country: dk
  • Reputation: 89
    • View Profile
    • QuantumATK at Synopsys
Re: How to Do A Parallel Calculation?(v 2010.8)
« Reply #12 on: August 10, 2010, 11:21 »
According to Intel, using the default values should provide the best performance. According to my experience, running in serial, this is indeed true on Windows, but not on Linux - it will simply use only one thread. That's why we recommend setting MKL_DYNAMIC to FALSE.

Whether or not you set MKL_NUM_THREADS seems to matter less, although I have only tested this (extensively) on simple dual-core machines in serial.

MPI operation adds a second dimension to this. Intel claims that

Quote
... if Intel MKL is called in a parallel region, it will use only one thread by default. If you want the library to use nested parallelism, and the thread within a parallel region is compiled with the same OpenMP compiler as Intel MKL is using, you may experiment with setting MKL_DYNAMIC to FALSE and manually increasing the number of threads.

(MKL User's Guide, available from http://software.intel.com/en-us/articles/intel-math-kernel-library-documentation/)

I'm honestly not sure what a "parallel region" is...

If you have benchmarks for various configurations of MPI nodes, threading, etc, it would be great if these could be shared with the community.

Offline ziand

  • Heavy QuantumATK user
  • ***
  • Posts: 78
  • Country: de
  • Reputation: 5
    • View Profile
Re: How to Do A Parallel Calculation?(v 2010.8)
« Reply #13 on: August 10, 2010, 18:27 »
Indeed I made some quick and dirty timing test some days before, but I'm afraid, the system is FAR TOO SMALL to be meaningful. I didn't want to spend too much time on it, but rather just see how it performs in principle. I used a "carbon molecule" (a single unitcell of a (5,5) CNT) with some vacuum all around. I really did only a few clicks in VNL and than run it. I think I will provide a more serious test later (or you post a script which you think is interesting (but the whole testing should be runable in one night)). My script:
Code
###############################################################
# Bulk configuration
###############################################################

# Set up lattice
vector_a = [16.7841067732, 0.0, 0.0]*Angstrom
vector_b = [0.0, 16.7841067732, 0.0]*Angstrom
vector_c = [0.0, 0.0, 2.46100171044]*Angstrom
lattice = UnitCell(vector_a, vector_b, vector_c)

# Define elements
elements = [Carbon, Carbon, Carbon, Carbon, Carbon, Carbon, Carbon, Carbon,
            Carbon, Carbon, Carbon, Carbon, Carbon, Carbon, Carbon, Carbon,
            Carbon, Carbon, Carbon, Carbon]

# Define coordinates
cartesian_coordinates = [[ 11.78410677,   8.39205339,   0.        ],
                         [ 11.49084835,   9.77172579,   0.        ],
                         [ 11.13628222,  10.38585234,   1.23050086],
                         [ 10.08808008,  11.32965779,   1.23050086],
                         [  9.44025553,  11.61808786,   0.        ],
                         [  8.03748726,  11.76552475,   0.        ],
                         [  7.34385124,  11.61808786,   1.23050086],
                         [  6.12232665,  10.91284031,   1.23050086],
                         [  5.64782455,  10.38585234,   0.        ],
                         [  5.0741245 ,   9.09730094,   0.        ],
                         [  5.        ,   8.39205339,   1.23050086],
                         [  5.29325842,   7.01238098,   1.23050086],
                         [  5.64782455,   6.39825443,   0.        ],
                         [  6.69602669,   5.45444898,   0.        ],
                         [  7.34385124,   5.16601891,   1.23050086],
                         [  8.74661951,   5.01858202,   1.23050086],
                         [  9.44025553,   5.16601891,   0.        ],
                         [ 10.66178013,   5.87126646,   0.        ],
                         [ 11.13628222,   6.39825443,   1.23050086],
                         [ 11.70998227,   7.68680583,   1.23050086]]*Angstrom

# Set up configuration
bulk_configuration = BulkConfiguration(
    bravais_lattice=lattice,
    elements=elements,
    cartesian_coordinates=cartesian_coordinates
    )

###############################################################
# Calculator
###############################################################
numerical_accuracy_parameters = NumericalAccuracyParameters(
    electron_temperature=3000.0*Kelvin,
    k_point_sampling=(3, 3, 3),
    )

calculator = LCAOCalculator(
    numerical_accuracy_parameters=numerical_accuracy_parameters,
    )

bulk_configuration.setCalculator(calculator)
nlprint(bulk_configuration)
bulk_configuration.update()
nlsave('CNT_5_5.nc', bulk_configuration)
NOTES: no errors occured; we have 24 slave licenses PBS ressources: nodes=x:ppn=y defines the PBS environment (reserved to be safe of other users) !!! usePPN=z defines how many MPI-processes per node I actually use !!! -> Important (the usePPN is done by some custom PBS-scripting) NOTE: When using OpenMPI, MKL_NUM_THREADS is given. Otherwise no OpenMP is used. Results: Short Story: nodes  usePPN  MKL_NUM_THREADS  totalTime (seconds)                                                    113.86  (my Dualcore lab notebook) 2           2          no OpenMP              84.81 2          12         no OpenMP              48.09 2           4                4                     60.69 4           4                4                     51.84 4           2                8                     64.01 4           1               16                    82.54 6           4                4                     48.17 Long Story:
  • Notebook (Dualcore, Windows): Timing:                          Total     Per Step        % -------------------------------------------------------------------------------- Valence Density         :      38.73 s       2.98 s      28.94% |=============| Real space integral     :      34.99 s       2.69 s      26.14% |============| Diagonalization         :      34.45 s       2.87 s      25.74% |============| Difference Density      :       4.30 s       0.33 s       3.21% |=| Mixing                  :       3.09 s       0.26 s       2.31% || Hartree Potential       :       1.83 s       0.14 s       1.37% || Core Density            :       1.27 s       0.10 s       0.95% | Basis Set Generation    :       0.88 s       0.88 s       0.65% | Exchange-Correlation    :       0.84 s       0.06 s       0.63% | Real Space Basis        :       0.63 s       0.63 s       0.47% | Neutral Atom Potential  :       0.31 s       0.31 s       0.23% | Setting Density Matrix  :       0.01 s       0.01 s       0.01% | Hubbard Term            :       0.00 s       0.00 s       0.00% | -------------------------------------------------------------------------------- Total                   :     133.86 s
  • Cluster (nodes=2:ppn=16, usePPN=2 ): Timing:                          Total     Per Step        % -------------------------------------------------------------------------------- Real space integral     :      23.41 s       1.80 s      27.60% |=============| Valence Density         :      19.43 s       1.49 s      22.91% |==========| Diagonalization         :      15.01 s       1.25 s      17.70% |========| Difference Density      :       6.17 s       0.47 s       7.27% |===| Mixing                  :       3.85 s       0.32 s       4.54% |=| Hartree Potential       :       3.00 s       0.23 s       3.54% |=| Core Density            :       1.87 s       0.14 s       2.20% || Exchange-Correlation    :       1.43 s       0.11 s       1.68% || Real Space Basis        :       1.18 s       1.18 s       1.39% || Basis Set Generation    :       0.80 s       0.80 s       0.94% | Neutral Atom Potential  :       0.46 s       0.46 s       0.55% | Setting Density Matrix  :       0.01 s       0.01 s       0.02% | Hubbard Term            :       0.00 s       0.00 s       0.00% | -------------------------------------------------------------------------------- Total                   :      84.81 s
  • Cluster (nodes=2:ppn=16, usePPN=12 ): Timing:                          Total     Per Step        % -------------------------------------------------------------------------------- Difference Density      :       6.88 s       0.53 s      14.31% |======| Valence Density         :       6.80 s       0.52 s      14.15% |======| Real space integral     :       5.64 s       0.43 s      11.73% |=====| Diagonalization         :       5.47 s       0.46 s      11.37% |=====| Hartree Potential       :       4.22 s       0.32 s       8.77% |===| Mixing                  :       3.90 s       0.33 s       8.11% |===| Core Density            :       2.29 s       0.18 s       4.77% |=| Exchange-Correlation    :       1.44 s       0.11 s       2.99% || Real Space Basis        :       1.15 s       1.15 s       2.40% || Basis Set Generation    :       0.78 s       0.78 s       1.62% || Neutral Atom Potential  :       0.49 s       0.49 s       1.02% || Setting Density Matrix  :       0.01 s       0.01 s       0.03% | Hubbard Term            :       0.00 s       0.00 s       0.00% | -------------------------------------------------------------------------------- Total                   :      48.09 s
  • Cluster (OpenMP, nodes=2:ppn=16, usePPN=4, MKL_NUM_THREADS 4 ): Timing:                          Total     Per Step        % -------------------------------------------------------------------------------- Real space integral     :      12.51 s       0.96 s      20.61% |=========| Valence Density         :      10.81 s       0.83 s      17.81% |========| Diagonalization         :       9.71 s       0.81 s      16.00% |=======| Difference Density      :       6.19 s       0.48 s      10.20% |====| Mixing                  :       3.84 s       0.32 s       6.33% |==| Hartree Potential       :       2.96 s       0.23 s       4.88% |=| Core Density            :       1.91 s       0.15 s       3.15% |=| Exchange-Correlation    :       1.44 s       0.11 s       2.37% || Real Space Basis        :       1.16 s       1.16 s       1.92% || Basis Set Generation    :       0.79 s       0.79 s       1.30% || Neutral Atom Potential  :       0.51 s       0.51 s       0.83% | Setting Density Matrix  :       0.01 s       0.01 s       0.02% | Hubbard Term            :       0.00 s       0.00 s       0.00% | -------------------------------------------------------------------------------- Total                   :      60.69 s
  • Cluster (OpenMP, nodes=4:ppn=16, usePPN=4, MKL_NUM_THREADS 4 ): Timing:                          Total     Per Step        % -------------------------------------------------------------------------------- Diagonalization         :       8.30 s       0.69 s      16.02% |=======| Valence Density         :       7.29 s       0.56 s      14.07% |======| Real space integral     :       6.89 s       0.53 s      13.29% |======| Difference Density      :       6.63 s       0.51 s      12.79% |=====| Mixing                  :       3.90 s       0.32 s       7.52% |===| Hartree Potential       :       3.56 s       0.27 s       6.87% |==| Core Density            :       2.10 s       0.16 s       4.04% |=| Exchange-Correlation    :       1.45 s       0.11 s       2.79% || Real Space Basis        :       1.16 s       1.16 s       2.24% || Basis Set Generation    :       0.80 s       0.80 s       1.54% || Neutral Atom Potential  :       0.51 s       0.51 s       0.98% | Setting Density Matrix  :       0.01 s       0.01 s       0.02% | Hubbard Term            :       0.00 s       0.00 s       0.00% | -------------------------------------------------------------------------------- Total                   :      51.84 s
  • Cluster (OpenMP, nodes=4:ppn=16, usePPN=2, MKL_NUM_THREADS 8 ): Timing:                          Total     Per Step        % -------------------------------------------------------------------------------- Real space integral     :      12.59 s       0.97 s      19.66% |=========| Valence Density         :      12.25 s       0.94 s      19.13% |=========| Diagonalization         :       9.99 s       0.83 s      15.61% |=======| Difference Density      :       6.40 s       0.49 s      10.00% |====| Mixing                  :       3.93 s       0.33 s       6.14% |==| Hartree Potential       :       3.16 s       0.24 s       4.94% |=| Core Density            :       2.00 s       0.15 s       3.12% |=| Exchange-Correlation    :       1.43 s       0.11 s       2.24% || Real Space Basis        :       1.17 s       1.17 s       1.82% || Basis Set Generation    :       0.81 s       0.81 s       1.27% || Neutral Atom Potential  :       0.51 s       0.51 s       0.80% | Setting Density Matrix  :       0.01 s       0.01 s       0.02% | Hubbard Term            :       0.00 s       0.00 s       0.00% | -------------------------------------------------------------------------------- Total                   :      64.01 s
  • Cluster (OpenMP, nodes=4:ppn=16, usePPN=1, MKL_NUM_THREADS 16 ): Timing:                          Total     Per Step        % -------------------------------------------------------------------------------- Real space integral     :      21.15 s       1.63 s      25.62% |============| Valence Density         :      19.34 s       1.49 s      23.43% |===========| Diagonalization         :      12.86 s       1.07 s      15.58% |=======| Difference Density      :       6.20 s       0.48 s       7.51% |===| Mixing                  :       4.05 s       0.34 s       4.90% |=| Hartree Potential       :       2.57 s       0.20 s       3.12% |=| Core Density            :       1.87 s       0.14 s       2.27% || Exchange-Correlation    :       1.44 s       0.11 s       1.74% || Real Space Basis        :       1.18 s       1.18 s       1.43% || Basis Set Generation    :       0.80 s       0.80 s       0.97% | Neutral Atom Potential  :       0.46 s       0.46 s       0.56% | Setting Density Matrix  :       0.01 s       0.01 s       0.02% | Hubbard Term            :       0.00 s       0.00 s       0.00% | -------------------------------------------------------------------------------- Total                   :      82.54 s
  • Cluster (OpenMP, nodes=6:ppn=16, usePPN=4, MKL_NUM_THREADS 4 ): Timing:                          Total     Per Step        % -------------------------------------------------------------------------------- Diagonalization         :       7.44 s       0.62 s      15.45% |=======| Valence Density         :       6.67 s       0.51 s      13.85% |======| Real space integral     :       6.43 s       0.49 s      13.35% |======| Difference Density      :       6.37 s       0.49 s      13.22% |======| Mixing                  :       3.93 s       0.33 s       8.16% |===| Hartree Potential       :       3.09 s       0.24 s       6.41% |==| Core Density            :       1.95 s       0.15 s       4.04% |=| Exchange-Correlation    :       1.44 s       0.11 s       2.99% || Real Space Basis        :       1.17 s       1.17 s       2.43% || Basis Set Generation    :       0.80 s       0.80 s       1.66% || Neutral Atom Potential  :       0.51 s       0.51 s       1.05% || Setting Density Matrix  :       0.01 s       0.01 s       0.03% | Hubbard Term            :       0.00 s       0.00 s       0.00% | -------------------------------------------------------------------------------- Total                   :      48.17 s

Offline ziand

  • Heavy QuantumATK user
  • ***
  • Posts: 78
  • Country: de
  • Reputation: 5
    • View Profile
Re: How to Do A Parallel Calculation?(v 2010.8)
« Reply #14 on: August 17, 2010, 11:13 »
I recently found out, that it is in prinicple not a good idea to use MPI and OpenMP on one node simultaneously. I regularly get sgange errors:
Code
+------------------------------------------------------------------------------+
|                                                                              |
| Left Electrode Calculation  [Started Thu Aug 12 19:30:20 2010]               |
|                                                                              |
+------------------------------------------------------------------------------+

                            |--------------------------------------------------|
Calculating Eigenvalues    : *** glibc detected *** /opt/QuantumWise/atk-10.8.0/atkpython/bin/atkpython_exec: double free or corruption (!prev): 0x000000002490acf0 ***
*** glibc detected *** /opt/QuantumWise/atk-10.8.0/atkpython/bin/atkpython_exec: double free or corruption (!prev): 0x000000000ad3dd60 ***
*** glibc detected *** /opt/QuantumWise/atk-10.8.0/atkpython/bin/atkpython_exec: double free or corruption (!prev): 0x0000000022b7ccf0 ***
*** glibc detected *** /opt/QuantumWise/atk-10.8.0/atkpython/bin/atkpython_exec: double free or corruption (!prev): 0x000000000b937f90 ***
/opt/QuantumWise/atk-10.8.0/atkpython/bin/atkpython: line 3: 22617 Aborted                 PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
*** glibc detected *** /opt/QuantumWise/atk-10.8.0/atkpython/bin/atkpython_exec: double free or corruption (!prev): 0x000000001e8a1220 ***
*** glibc detected *** /opt/QuantumWise/atk-10.8.0/atkpython/bin/atkpython_exec: free(): invalid next size (normal): 0x000000000f3d3d60 ***
rank 20 in job 1  arm07_37313   caused collective abort of all ranks
  exit status of rank 20: killed by signal 9
rank 7 in job 1  arm07_37313   caused collective abort of all ranks
  exit status of rank 7: killed by signal 9
rank 1 in job 1  arm07_37313   caused collective abort of all ranks
  exit status of rank 1: killed by signal 9
My solution for now is: - For smaller calculations: Only use MPI (where I can afford running lots of jobs on one node). - For big calculations: Here we are limited by RAM/node anyway, so I start one MPI-Job per node and activate OpenMP with as much threads as I have cores on my nodes. This seems to work quite well (the small drawback is, that we do not have enough nodes to fully utilize our slave licenses). One question: Could you comment on the ATK-10.8 timing output?
Code
Timing:                          Total     Per Step        %
--------------------------------------------------------------------------------
Density Matrix (EQ)     :    2690.05 s      26.90 s      39.97% |===================|
Mixing                  :    1344.78 s       8.10 s      19.98% |=========|
Setting Density Matrix  :    1093.89 s     364.63 s      16.25% |=======|
Diagonalization         :     545.68 s       8.27 s       8.11% |===|
Valence Density         :     303.11 s       1.79 s       4.50% |=|
Real Space Integral     :     134.46 s       1.33 s       2.00% ||
Difference Density      :     114.08 s       1.68 s       1.69% ||
Core Density            :      71.58 s       1.05 s       1.06% ||
Real space integral     :      64.87 s       0.95 s       0.96% |
Hartree Potential       :      32.50 s       0.19 s       0.48% |
Exchange-Correlation    :      28.97 s       0.17 s       0.43% |
Self-Energies           :      28.38 s       0.28 s       0.42% |
Real Space Basis        :      14.83 s       4.94 s       0.22% |
Basis Set Generation    :       6.60 s       2.20 s       0.10% |
Neutral Atom Potential  :       3.10 s       1.55 s       0.05% |
Hubbard Term            :       0.00 s       0.00 s       0.00% |
Density Matrix (NEQ)    :       0.00 s       0.00 s       0.00% |
--------------------------------------------------------------------------------
Total                   :    6730.90 s (1h52m10.90s)
Which part is parallelized and what kind of parallelization (MPI / OpenMP) does it use? The above timing was done with MPI only (3 nodes, 8 MPI-jobs per node = 24 jobs). Below I used 3 MPI-jobs (1 on each node) and 16 OpenMP-threads per node.
Code
Timing:                          Total     Per Step        %
--------------------------------------------------------------------------------
Density Matrix (EQ)     :   22207.88 s     222.08 s     100.37% |=================================================|
Diagonalization         :    2448.43 s      37.10 s      11.07% |=====|
Setting Density Matrix  :    2035.19 s     678.40 s       9.20% |====|
Mixing                  :    1342.02 s       8.08 s       6.07% |==|
Valence Density         :     771.22 s       4.56 s       3.49% |=|
Real Space Integral     :     418.80 s       4.15 s       1.89% ||
Real space integral     :     252.46 s       3.71 s       1.14% ||
Self-Energies           :     216.34 s       2.16 s       0.98% |
Difference Density      :     128.48 s       1.89 s       0.58% |
Core Density            :      71.83 s       1.06 s       0.32% |
Exchange-Correlation    :      29.27 s       0.17 s       0.13% |
Hartree Potential       :      28.84 s       0.17 s       0.13% |
Real Space Basis        :      15.00 s       5.00 s       0.07% |
Basis Set Generation    :       6.62 s       2.21 s       0.03% |
Neutral Atom Potential  :       3.46 s       1.73 s       0.02% |
Hubbard Term            :       0.00 s       0.00 s       0.00% |
Density Matrix (NEQ)    :       0.00 s       0.00 s       0.00% |
--------------------------------------------------------------------------------
Total                   :   22125.98 s (6h08m45.98s)
The setup I used is the non-converging TwoProbe system described here: http://quantumwise.com/forum/index.php?topic=762.0