Author Topic: Parallel calculation (Read 1601 times)

Dmitry · « **on:** July 13, 2012, 11:07 »

We calculate graphene molecule with 92 atoms by DFT on cluster. There are 3 node with 4 12-core processes on each in the cluster. What type of parallelize should we use for best acceleration? Our input file is here (atoms xyz is cut out):

Code

molecule_configuration = MoleculeConfiguration(xyz_format=
"""60
geometry
C         -4.27910    -2.53300     0.00000
…
H          8.21480     2.31410     0.00000
H          4.49440     4.49970    -0.00000
H          6.97860     4.47250    -0.00000""")

# -------------------------------------------------------------
# Calculator
# -------------------------------------------------------------
calculator = LCAOCalculator()

molecule_configuration.setCalculator(calculator)
nlprint(molecule_configuration)
molecule_configuration.update()
nlsave('nh3.nc', molecule_configuration)

# -------------------------------------------------------------
# Molecular energy spectrum
# -------------------------------------------------------------
molecular_energy_spectrum = MolecularEnergySpectrum(
    configuration=molecule_configuration,
    energy_zero_parameter=FermiLevel,
    projection_list=ProjectionList(All)
    )
nlsave('c42h18.nc', molecular_energy_spectrum)
nlprint(molecular_energy_spectrum)

We tried to run it in two threades by MVAPICH (

Code

mpiexec –n 4 atkpython graphen.py > out.out

) on one node and have had deceleration from 32 to 34 min. Could you please to advise us something.

Anders Blom · « **Reply #1 on:** July 13, 2012, 12:31 »

You should not expect any real speedup from MPI for ATK for molecules, since the main parts of the code that are MPI-parallelized are k-point sampling and energy sampling on the complex contour for devices.

What you can get, however, is a nice threading advantage when diagonalizing large dense matrices. So your best option is to run in serial on one node (skip "mpiexec") but make sure ATK has access to all 48 cores on the node. In most cases it will figure this out by itself, but in case not then you can control it via "export MKL_NUM_THREADS=48".

The general rule is never more than one MPI process/socket otherwise the MPI processes will be competing for CPU cycles and cache/RAM bus access. Your nodes have 4 sockets so -n 4 is in theory fine. The reason you see a slowdown is however that there is always a cost for MPI communications, and since the parallel speedup is negligible in your case, all you see is the cost, which normally is overtaken by a parallel performance improvement and thus not seen.

If you allow ATK to use all cores for one process, rather than share them over 4, perhaps it will give a bit shorter calculation time than 30 minutes.

QuantumATK Forum

News:

Author Topic: Parallel calculation (Read 1601 times)

Dmitry

Parallel calculation

Anders Blom

Re: Parallel calculation