Print Page - Cannot run script in parallel

QuantumATK => General Questions and Answers => Topic started by: chp on March 26, 2009, 04:18

Title: Cannot run script in parallel
Post by: chp on March 26, 2009, 04:18

Recently I have writen a small python script based on ATK, it can work correctly when I run it with the command “ mpiexec -machinefile mpd.hosts -n 1 $ATK_BIN_DIR/atk $WORK_DIR/script.py”. However, when I run it in parallel with the command “mpiexec -machinefile mpd.hosts -n 8 $ATK_BIN_DIR/atk $WORK_DIR/script.py”, it cannot work with the following hints:

[ch@console ~]$mpiexec -machinefile mpd.hosts -n 8 $ATK_BIN_DIR/atk $WORK_DIR/script.py
5: [cli_5]: aborting job:
rank 5 in job 133 console_45778 caused collective abort of all ranks
exit status of rank 5: return code 1
5: Fatal error in MPI_Allreduce: Message truncated, error stack:
5: MPI_Allreduce(707).....................: MPI_Allreduce(sbuf=0x57136008, rbuf=0x5763d008, count=658560, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
5: MPIR_Allreduce(385)....................:
5: MPIDI_CH3U_Post_data_receive_found(163): Message from rank 4 and tag 14 truncated; 2814240 bytes received but buffer size is 2634240
[ch@console ~]$

I think it must have something to do with the “buffer size” of mpich2-1.0.5p4.
How to deal with it? Thanks everyone !!!

Title: Re: Cannot run script in parallel
Post by: Anders Blom on March 26, 2009, 09:04

In earlier versions on ATK the buffer size was hardcoded and this resulted in similar error messages in some cases. Recent versions have, as far as I know, a dynamic buffer size which should work much better. So, assuming you are running the latest version (2008.10), and this error is reproducible (always happens with the same script), it might be worth looking into. Can you post the script?

Title: Re: Cannot run script in parallel
Post by: chp on March 27, 2009, 05:05

Thank you for your attentions !

I use the version 2008.02 of ATK. The script is attached. Please check it.

To save time, it should preferably be performed in parallel. I appreciate any reply.

Regards

Title: Re: Cannot run script in parallel
Post by: Nordland on March 27, 2009, 08:14

Wow!

Nice script - do you see this problem for any size configuration or is the big ones?
I know that in ATK 2008.10 there was correct a bug in MPI involving buffer sizes, but would only be the case if you system is a big one.

Title: Re: Cannot run script in parallel
Post by: Anders Blom on March 27, 2009, 10:09

We're always very happy to see users make the most of NanoLanguage, and this script is an excellent example of that!

Running in parallel offers less of an advantage for molecules compared to other systems, but there is still a clear performance benefit if the system is large. Very important would also be to ensure that ATK threads on all cores, if you have a dual or quad core CPU. This is used for the matrix diagonalization which may be the primary time-consuming part of your calculation. Note, however, that this functionality is only available from ATK 2008.10.

In general, my primary advise is to upgrade to ATK 2008.10; it will hopefully solve the MPI issue, plus it will offer performance benefits due to threading + some general improvements of the code.

Keep up the good work in NanoLanguage! :) Feel free to post some pictures of your systems, it could be very interesting for others who perhaps can use your script for their own studies.

Title: Re: Cannot run script in parallel
Post by: chp on March 27, 2009, 13:00

Thank you for your advice !

The reported problem always appears (even for very small system) when I run the script in parallel with ATK 2008.02. Because the geometry optimizations at the level of DFT is very time-consuming, I want to know if we can make some adjustments to the script or the environment of mpich2-1.0.5p4 just for the sake of parallel with ATK 2008.02.

I appreciate your help ! Thanks a lot !

Title: Re: Cannot run script in parallel
Post by: Anders Blom on March 27, 2009, 13:43

It seems that the buffer size error was actually fixed in 2008.02 already.

Perhaps you can try to update your MPICH2 to version 1.0.8. ATK runs fine with this version too, perhaps there was some bug fix in MPICH2 that is related to this. A long shot, but simple and worth trying, I think.

Title: Re: Cannot run script in parallel
Post by: chp on March 28, 2009, 14:33

Dear Blom:

Thank you for your suggestions !

I have update my mpich2 to version 1.0.8, however, the reported problem still appears. Furthermore, I have run it in parallel with both ATK 2008.10 and mpich2-1.0.8, the same error hints are raised (even for a very small molecule system with 3 atoms, please see the attached script). What can we do to fix it ?

Regards

Title: Re: Cannot run script in parallel
Post by: Anders Blom on March 28, 2009, 14:46

I guess we'll have to go into troubleshooting mode here, and start eliminating suspects.

At what point in the calculation does the error arise? From your initial post it looks like it fails almost immediately, is that the case or did you truncate the log for convenience? Otherwise, post the output log too.

Can you run a simple example from the manual in parallel? Like, just a water molecule. No optimization (as this is one of the suspects!).

What is the machine config, is this a cluster? What is the hardware architecture?

Can you run other codes in parallel (if you have any)?

It fails on 8 nodes, how about on 2 or 4? Have you checked that the mpd.hosts file is correct?

Try as many combinations of various things you can, apparently it's sufficient to run small systems to make it fail, so hopefully we can quickly get some better idea of where exactly the problem occurs.

Title: Re: Cannot run script in parallel
Post by: chp on March 30, 2009, 09:59

Dear Blom:

I have made some further test calculations. Any example in the manual can be run in parallel and work well. However, when I run the above script, it still cannot work in parallel. It fails when the command “mpiexec -machinefile mpd.hosts -n CPUS $ATK_BIN_DIR/atk $WORK_DIR/script.py” was used where CPUS ≠ 1.

The following message gives an example.
##########################################################################
[ch@console ~]$ mpiexec -machinefile mpd.hosts -n 16 $ATK_BIN_DIR/atk $WORK_DIR/script.py
0: # Mon Mar 30 8:38:26 2009
0: # Linux-2.6.9-34.ELsmp-x86_64-with-redhat-4-Nahant_Update_3
0:
0: Start study for clusters:
0: C2Si1
0:
0:
0:
0: -----------------------------------------------------------------
0:
0:
5: [cli_5]: aborting job:
5: Fatal error in MPI_Allreduce: Message truncated, error stack:
5: MPI_Allreduce(707).....................: MPI_Allreduce(sbuf=0x2a9e7c7010, rbuf=0x2a9eab6010, count=384345, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
5: MPIR_Allreduce(385)....................:
5: MPIDI_CH3U_Post_data_receive_found(163): Message from rank 4 and tag 14 truncated; 1772160 bytes received but buffer size is 1537416
rank 5 in job 23 console_48274 caused collective abort of all ranks
exit status of rank 5: return code 1
[ch@console ~]$
################################################################

It can been seen that the script fails when it comes into geometry optimization based on ATK. What is the matter with it? How to fix it?

Regards

Title: Re: Cannot run script in parallel
Post by: Anders Blom on March 30, 2009, 10:15

One more thing to check: can you run the optimization examples in the manual in parallel?

Title: Re: Cannot run script in parallel
Post by: Anders Blom on March 30, 2009, 10:51

Btw, a couple of useful tips for your script :)

The labels "ele1" and "ele2" can be automatically generated, to save you double-editing when you change the elements:

Code

element1 = 'Carbon'
element2 = 'Silicon'
ele1 = element1.symbol()
ele2 = element2.symbol()

Then, you can save yourself some headache in clusterConfigToATK():

Code

         if ( clusterConfig[atom][3] == 1 ):                                      
            clusterElements.append(eval(element1))              #  element1              
        else:
            clusterElements.append(eval(element2))                #  element2

Now you only have to change elements in once place!

Since you do from math import *, you don't need to redefine pi (it's already available from math).

Code

    iFlag = []                                                                   
    for anyAtom in range(N):                                                     
        iFlag.append(1)

could simply be

Code

iFlag = [1,]*N

The role of distance() could be made a bit more obvious by using numpy:

Code

def distance(N, clusterConfig):
    import numpy
    dist = []
    for atom1 in range(N):                                                       
        distAtom = []                                                            
        for atom2 in range(N):                                                   
            r = numpy.sqrt(numpy.sum((numpy.array(clusterConfig[atom1])-numpy.array(clusterConfig[atom2]))**2))
            distAtom.append(r)                                                   
        dist.append(distAtom)                                                    
    return dist

These are minor details that have nothing to do with the problem of running in parallel. I just thought I would take the chance to show some nice NanoLanguage tricks :)

Title: Re: Cannot run script in parallel
Post by: chp on March 30, 2009, 11:58

Thank you for your nice tips.

The optimization examples in the manual can work well in parallel.

Title: Re: Cannot run script in parallel
Post by: Anders Blom on March 30, 2009, 12:26

What is your hardware/software config?

Title: Re: Cannot run script in parallel
Post by: Anders Blom on March 30, 2009, 12:29

One mistake above (corrected above also). The new code for the function should be

Code

       if ( clusterConfig[atom][3] == 1 ):                                      
            clusterElements.append(eval(element1))              #  element1              
        else:
            clusterElements.append(eval(element2))                #  element2

Never copy/paste code!

Title: Re: Cannot run script in parallel
Post by: chp on March 30, 2009, 15:02

I have tried it in different hardware/software configs, however, similar error messages were obtained.

softwares: ATK 2008.02 OR ATK 2008.10
mpich2-1.0.5p OR mpich2-1.0.7
Linux-2.6.27.7-9-default-x86_64-with-SuSE-11.1-x86_64
Linux-2.6.9-34.ELsmp-x86_64-with-redhat-4-Nahant_Update_3

hardwares: two nodes of a cluster each of which consists of two Quad intel CPUs and 12G memory
a PC with a Quad intel CPU and 4G memory

Title: Re: Cannot run script in parallel
Post by: Anders Blom on March 30, 2009, 15:05

I will attempt to run your system myself on a cluster and see if I get a similar error. The queue is quite loaded so I cannot say for sure when I'll get a spot, but hopefully within the next 24 hours. I'll use 2 nodes, I assume as soon as n>1 you get problems.

Title: Re: Cannot run script in parallel
Post by: Anders Blom on April 3, 2009, 11:52

The plot thickens! I get the same error on our cluster. Definitely need to look further at this...

Title: Re: Cannot run script in parallel
Post by: Anders Blom on April 3, 2009, 13:27

The problem has been located. As often in these cases, it's completely obvious - once you know the answer! :)

The issue lies in the use of random(). This function will give different values on the different nodes, so by the time the self-consistent loop starts, each MPI node will have a different configuration, which obviously makes no sense at all.

~~To solve this, you could generate the configuration only on the master node and export it to a VNL file (all encapsulated in processIsMaster()) and then read the VNL file back in on all nodes.~~
Edit: This will not work, since the slave nodes will run ahead and attempt to read the VNL file while the master node is still making it. Use Nordland's method below instead!

Simple - and not! ;)

PS: I don't know how large systems you plan to compute, but there is actually not very much to be gained by running molecules in parallel, unless they are very, very large (hundred of atoms). On the other hand, you can really gain a lot by ensuring that ATK runs threaded over the cores on a dual or quad core CPU. The reason is that the MPI parallelization only can be used for the matrix element generation for molecules, while the threading kicks in for the matrix diagonalization.

Title: Re: Cannot run script in parallel
Post by: Anders Blom on April 3, 2009, 13:30

A side issue: you script potentially ends with an indented statement. This generates error messages in MPI on the slave nodes, and actually I think it prevents the script from running. So, always make sure the last line in the script is completely empty (no spaces at all!) when running in parallel!

Title: Re: Cannot run script in parallel
Post by: Nordland on April 3, 2009, 13:47

Alternative you can choose a fixed random seed, then all the mpi nodes will get the same random numbers and hence you will get the same configuration.

Here is a small example:

Quote

import time; import random
random.seed(time.localtime().tm_min)

print random.random()

Then all calculation started on the same minute will give the same results, and hence your script will work again.

P.S The reason for chosing minute over second is that if there is a slight delay in the process, then you will not get the same error as you got before.

Title: Re: Cannot run script in parallel
Post by: chp on April 3, 2009, 16:56

Thanks a lot ! Blom and Nordland.

I will try it in my script.

Regards !

Title: Re: Cannot run script in parallel
Post by: chp on May 25, 2009, 17:37

Dear Blom and Nordland:

I am sorry that this problem is raised again. Just as you have indicated that the original issue lies in the use of random() which gives different values on the different nodes when the script is performed in parallel.

I have tried the example given by Nordland, however, the same error still appeared. Having tried some other random seeds, I can run the script in parallel successfully. Because I am a novice in python, I still want to use some better seeds. Could you give me one or two examples of random seed in the form of function like below:

def rand ():
import time; import random
random.seed(time.localtime().tm_min)
return random.random()

Thank you very much.

Sincerely yours !

Title: Re: Cannot run script in parallel
Post by: Anders Blom on May 25, 2009, 21:03

My primary suggestion would be to generate the structures first, in serial, on your local machine. Then you read in the geometries when you perform the calculations in parallel. That way you have no randomness in parallel. Generating the geometries anyway will not parallelize, so there is no performance difference.

Otherwise it seems you already have a "parallel" random function, the one you posted. I'm not how it can be improved because you need a seed which is common on all nodes, but you have no way of communicating it between the nodes, so it must be taken externally. In principle you can use any combination of day of week, year, minute, hour, etc..., all of which are common on all nodes.

QuantumATK Forum

QuantumATK => General Questions and Answers => Topic started by: chp on March 26, 2009, 04:18