Author Topic: Cannot run script in parallel (Read 18366 times)

chp · « **on:** March 26, 2009, 04:18 »

Recently I have writen a small python script based on ATK, it can work correctly when I run it with the command “ mpiexec -machinefile mpd.hosts -n 1 $ATK_BIN_DIR/atk $WORK_DIR/script.py”. However, when I run it in parallel with the command “mpiexec -machinefile mpd.hosts -n 8 $ATK_BIN_DIR/atk $WORK_DIR/script.py”, it cannot work with the following hints:

[ch@console ~]$mpiexec -machinefile mpd.hosts -n 8 $ATK_BIN_DIR/atk $WORK_DIR/script.py
5: [cli_5]: aborting job:
rank 5 in job 133 console_45778 caused collective abort of all ranks
exit status of rank 5: return code 1
5: Fatal error in MPI_Allreduce: Message truncated, error stack:
5: MPI_Allreduce(707).....................: MPI_Allreduce(sbuf=0x57136008, rbuf=0x5763d008, count=658560, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
5: MPIR_Allreduce(385)....................:
5: MPIDI_CH3U_Post_data_receive_found(163): Message from rank 4 and tag 14 truncated; 2814240 bytes received but buffer size is 2634240
[ch@console ~]$

I think it must have something to do with the “buffer size” of mpich2-1.0.5p4.
How to deal with it? Thanks everyone !!!

Anders Blom · « **Reply #1 on:** March 26, 2009, 09:04 »

In earlier versions on ATK the buffer size was hardcoded and this resulted in similar error messages in some cases. Recent versions have, as far as I know, a dynamic buffer size which should work much better. So, assuming you are running the latest version (2008.10), and this error is reproducible (always happens with the same script), it might be worth looking into. Can you post the script?

chp · « **Reply #2 on:** March 27, 2009, 05:05 »

Thank you for your attentions !

I use the version 2008.02 of ATK. The script is attached. Please check it.

To save time, it should preferably be performed in parallel. I appreciate any reply.

Regards

Nordland · « **Reply #3 on:** March 27, 2009, 08:14 »

Wow!

Nice script - do you see this problem for any size configuration or is the big ones?
I know that in ATK 2008.10 there was correct a bug in MPI involving buffer sizes, but would only be the case if you system is a big one.

Anders Blom · « **Reply #4 on:** March 27, 2009, 10:09 »

We're always very happy to see users make the most of NanoLanguage, and this script is an excellent example of that!

Running in parallel offers less of an advantage for molecules compared to other systems, but there is still a clear performance benefit if the system is large. Very important would also be to ensure that ATK threads on all cores, if you have a dual or quad core CPU. This is used for the matrix diagonalization which may be the primary time-consuming part of your calculation. Note, however, that this functionality is only available from ATK 2008.10.

In general, my primary advise is to upgrade to ATK 2008.10; it will hopefully solve the MPI issue, plus it will offer performance benefits due to threading + some general improvements of the code.

Keep up the good work in NanoLanguage!

Feel free to post some pictures of your systems, it could be very interesting for others who perhaps can use your script for their own studies.

chp · « **Reply #5 on:** March 27, 2009, 13:00 »

Thank you for your advice !

The reported problem always appears (even for very small system) when I run the script in parallel with ATK 2008.02. Because the geometry optimizations at the level of DFT is very time-consuming, I want to know if we can make some adjustments to the script or the environment of mpich2-1.0.5p4 just for the sake of parallel with ATK 2008.02.

I appreciate your help ! Thanks a lot !

Anders Blom · « **Reply #6 on:** March 27, 2009, 13:43 »

It seems that the buffer size error was actually fixed in 2008.02 already.

Perhaps you can try to update your MPICH2 to version 1.0.8. ATK runs fine with this version too, perhaps there was some bug fix in MPICH2 that is related to this. A long shot, but simple and worth trying, I think.

chp · « **Reply #7 on:** March 28, 2009, 14:33 »

Dear Blom:

Thank you for your suggestions !

I have update my mpich2 to version 1.0.8, however, the reported problem still appears. Furthermore, I have run it in parallel with both ATK 2008.10 and mpich2-1.0.8, the same error hints are raised (even for a very small molecule system with 3 atoms, please see the attached script). What can we do to fix it ?

Regards

Anders Blom · « **Reply #8 on:** March 28, 2009, 14:46 »

I guess we'll have to go into troubleshooting mode here, and start eliminating suspects.

At what point in the calculation does the error arise? From your initial post it looks like it fails almost immediately, is that the case or did you truncate the log for convenience? Otherwise, post the output log too.

Can you run a simple example from the manual in parallel? Like, just a water molecule. No optimization (as this is one of the suspects!).

What is the machine config, is this a cluster? What is the hardware architecture?

Can you run other codes in parallel (if you have any)?

It fails on 8 nodes, how about on 2 or 4? Have you checked that the mpd.hosts file is correct?

Try as many combinations of various things you can, apparently it's sufficient to run small systems to make it fail, so hopefully we can quickly get some better idea of where exactly the problem occurs.

chp · « **Reply #9 on:** March 30, 2009, 09:59 »

Dear Blom:

I have made some further test calculations. Any example in the manual can be run in parallel and work well. However, when I run the above script, it still cannot work in parallel. It fails when the command “mpiexec -machinefile mpd.hosts -n CPUS $ATK_BIN_DIR/atk $WORK_DIR/script.py” was used where CPUS ≠ 1.

The following message gives an example.
##########################################################################
[ch@console ~]$ mpiexec -machinefile mpd.hosts -n 16 $ATK_BIN_DIR/atk $WORK_DIR/script.py
0: # Mon Mar 30 8:38:26 2009
0: # Linux-2.6.9-34.ELsmp-x86_64-with-redhat-4-Nahant_Update_3
0:
0: Start study for clusters:
0: C2Si1
0:
0:
0:
0: -----------------------------------------------------------------
0:
0:
5: [cli_5]: aborting job:
5: Fatal error in MPI_Allreduce: Message truncated, error stack:
5: MPI_Allreduce(707).....................: MPI_Allreduce(sbuf=0x2a9e7c7010, rbuf=0x2a9eab6010, count=384345, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
5: MPIR_Allreduce(385)....................:
5: MPIDI_CH3U_Post_data_receive_found(163): Message from rank 4 and tag 14 truncated; 1772160 bytes received but buffer size is 1537416
rank 5 in job 23 console_48274 caused collective abort of all ranks
exit status of rank 5: return code 1
[ch@console ~]$
################################################################

It can been seen that the script fails when it comes into geometry optimization based on ATK. What is the matter with it? How to fix it?

Regards

Anders Blom · « **Reply #10 on:** March 30, 2009, 10:15 »

One more thing to check: can you run the optimization examples in the manual in parallel?

Anders Blom · « **Reply #11 on:** March 30, 2009, 10:51 »

Btw, a couple of useful tips for your script

The labels "ele1" and "ele2" can be automatically generated, to save you double-editing when you change the elements:

Code

element1 = 'Carbon'
element2 = 'Silicon'
ele1 = element1.symbol()
ele2 = element2.symbol()

Then, you can save yourself some headache in clusterConfigToATK():

Code

         if ( clusterConfig[atom][3] == 1 ):                                      
            clusterElements.append(eval(element1))              #  element1              
        else:
            clusterElements.append(eval(element2))                #  element2

Now you only have to change elements in once place! Since you do from math import *, you don't need to redefine pi (it's already available from math).

Code

    iFlag = []                                                                   
    for anyAtom in range(N):                                                     
        iFlag.append(1)

could simply be

Code

iFlag = [1,]*N

The role of distance() could be made a bit more obvious by using numpy:

Code

def distance(N, clusterConfig):
    import numpy
    dist = []
    for atom1 in range(N):                                                       
        distAtom = []                                                            
        for atom2 in range(N):                                                   
            r = numpy.sqrt(numpy.sum((numpy.array(clusterConfig[atom1])-numpy.array(clusterConfig[atom2]))**2))
            distAtom.append(r)                                                   
        dist.append(distAtom)                                                    
    return dist

These are minor details that have nothing to do with the problem of running in parallel. I just thought I would take the chance to show some nice NanoLanguage tricks

chp · « **Reply #12 on:** March 30, 2009, 11:58 »

Thank you for your nice tips.

The optimization examples in the manual can work well in parallel.

Anders Blom · « **Reply #13 on:** March 30, 2009, 12:26 »

What is your hardware/software config?

Anders Blom · « **Reply #14 on:** March 30, 2009, 12:29 »

One mistake above (corrected above also). The new code for the function should be

Code

       if ( clusterConfig[atom][3] == 1 ):                                      
            clusterElements.append(eval(element1))              #  element1              
        else:
            clusterElements.append(eval(element2))                #  element2

Never copy/paste code!

QuantumATK Forum

News:

Author Topic: Cannot run script in parallel (Read 18366 times)

chp

Cannot run script in parallel

Anders Blom

Re: Cannot run script in parallel

chp

Re: Cannot run script in parallel

Nordland

Re: Cannot run script in parallel

Anders Blom

Re: Cannot run script in parallel

chp

Re: Cannot run script in parallel

Anders Blom

Re: Cannot run script in parallel

chp

Re: Cannot run script in parallel

Anders Blom

Re: Cannot run script in parallel

chp

Re: Cannot run script in parallel

Anders Blom

Re: Cannot run script in parallel

Anders Blom

Re: Cannot run script in parallel

chp

Re: Cannot run script in parallel

Anders Blom

Re: Cannot run script in parallel

Anders Blom

Re: Cannot run script in parallel