Author Topic: Cannot run script in parallel  (Read 18366 times)

0 Members and 1 Guest are viewing this topic.

Offline chp

  • Heavy QuantumATK user
  • ***
  • Posts: 31
  • Reputation: 0
    • View Profile
Cannot run script in parallel
« on: March 26, 2009, 04:18 »
Recently I have writen a small python script based on ATK, it can work correctly when I run it with the command “ mpiexec  -machinefile  mpd.hosts  -n  1  $ATK_BIN_DIR/atk  $WORK_DIR/script.py”.  However, when I run it in parallel with the command “mpiexec  -machinefile  mpd.hosts  -n  8  $ATK_BIN_DIR/atk  $WORK_DIR/script.py”, it cannot work with the following hints:

[ch@console ~]$mpiexec  -machinefile  mpd.hosts  -n  8  $ATK_BIN_DIR/atk  $WORK_DIR/script.py
5: [cli_5]: aborting job:
rank 5 in job 133  console_45778   caused collective abort of all ranks
  exit status of rank 5: return code 1
5: Fatal error in MPI_Allreduce: Message truncated, error stack:
5: MPI_Allreduce(707).....................: MPI_Allreduce(sbuf=0x57136008, rbuf=0x5763d008, count=658560, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
5: MPIR_Allreduce(385)....................:
5: MPIDI_CH3U_Post_data_receive_found(163): Message from rank 4 and tag 14 truncated; 2814240 bytes received but buffer size is 2634240
[ch@console ~]$

I think it must have something to do with the “buffer size” of mpich2-1.0.5p4.
How to deal with it?  Thanks everyone !!!

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5576
  • Country: dk
  • Reputation: 96
    • View Profile
    • QuantumATK at Synopsys
Re: Cannot run script in parallel
« Reply #1 on: March 26, 2009, 09:04 »
In earlier versions on ATK the buffer size was hardcoded and this resulted in similar error messages in some cases. Recent versions have, as far as I know, a dynamic buffer size which should work much better. So, assuming you are running the latest version (2008.10), and this error is reproducible (always happens with the same script), it might be worth looking into. Can you post the script?

Offline chp

  • Heavy QuantumATK user
  • ***
  • Posts: 31
  • Reputation: 0
    • View Profile
Re: Cannot run script in parallel
« Reply #2 on: March 27, 2009, 05:05 »
Thank you for your attentions !

I use the version 2008.02 of ATK.  The script is attached.  Please check it.

To save time, it should preferably be performed in parallel.   I appreciate any reply. 

Regards
« Last Edit: March 28, 2009, 03:32 by chp »

Offline Nordland

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 812
  • Reputation: 18
    • View Profile
Re: Cannot run script in parallel
« Reply #3 on: March 27, 2009, 08:14 »
Wow!

Nice script - do you see this problem for any size configuration or is the big ones?
I know that in ATK 2008.10 there was correct a bug in MPI involving buffer sizes, but would only be the case if you system is a big one.

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5576
  • Country: dk
  • Reputation: 96
    • View Profile
    • QuantumATK at Synopsys
Re: Cannot run script in parallel
« Reply #4 on: March 27, 2009, 10:09 »
We're always very happy to see users make the most of NanoLanguage, and this script is an excellent example of that!

Running in parallel offers less of an advantage for molecules compared to other systems, but there is still a clear performance benefit if the system is large. Very important would also be to ensure that ATK threads on all cores, if you have a dual or quad core CPU. This is used for the matrix diagonalization which may be the primary time-consuming part of your calculation. Note, however, that this functionality is only available from ATK 2008.10.

In general, my primary advise is to upgrade to ATK 2008.10; it will hopefully solve the MPI issue, plus it will offer performance benefits due to threading + some general improvements of the code.

Keep up the good work in NanoLanguage! :) Feel free to post some pictures of your systems, it could be very interesting for others who perhaps can use your script for their own studies.

Offline chp

  • Heavy QuantumATK user
  • ***
  • Posts: 31
  • Reputation: 0
    • View Profile
Re: Cannot run script in parallel
« Reply #5 on: March 27, 2009, 13:00 »
Thank you for your advice !

The reported problem always appears (even for very small system) when I run the script in parallel with ATK 2008.02. Because the geometry optimizations at the level of DFT is very time-consuming, I want to know if we can make some adjustments to the script or the environment of mpich2-1.0.5p4 just for the sake of parallel with ATK 2008.02.

I appreciate your help ! Thanks a lot !

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5576
  • Country: dk
  • Reputation: 96
    • View Profile
    • QuantumATK at Synopsys
Re: Cannot run script in parallel
« Reply #6 on: March 27, 2009, 13:43 »
It seems that the buffer size error was actually fixed in 2008.02 already.

Perhaps you can try to update your MPICH2 to version 1.0.8. ATK runs fine with this version too, perhaps there was some bug fix in MPICH2 that is related to this. A long shot, but simple and worth trying, I think.

Offline chp

  • Heavy QuantumATK user
  • ***
  • Posts: 31
  • Reputation: 0
    • View Profile
Re: Cannot run script in parallel
« Reply #7 on: March 28, 2009, 14:33 »
Dear Blom:

Thank you for your suggestions !
 
I have update my mpich2 to version 1.0.8, however, the reported problem still appears. Furthermore, I have run it in parallel with both ATK 2008.10 and mpich2-1.0.8, the same error hints are raised (even for a very small molecule system with 3 atoms, please see the attached script).   What can we do to fix it ?

Regards

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5576
  • Country: dk
  • Reputation: 96
    • View Profile
    • QuantumATK at Synopsys
Re: Cannot run script in parallel
« Reply #8 on: March 28, 2009, 14:46 »
I guess we'll have to go into troubleshooting mode here, and start eliminating suspects.

At what point in the calculation does the error arise? From your initial post it looks like it fails almost immediately, is that the case or did you truncate the log for convenience? Otherwise, post the output log too.

Can you run a simple example from the manual in parallel? Like, just a water molecule. No optimization (as this is one of the suspects!).

What is the machine config, is this a cluster? What is the hardware architecture?

Can you run other codes in parallel (if you have any)?

It fails on 8 nodes, how about on 2 or 4? Have you checked that the mpd.hosts file is correct?

Try as many combinations of various things you can, apparently it's sufficient to run small systems to make it fail, so hopefully we can quickly get some better idea of where exactly the problem occurs.

Offline chp

  • Heavy QuantumATK user
  • ***
  • Posts: 31
  • Reputation: 0
    • View Profile
Re: Cannot run script in parallel
« Reply #9 on: March 30, 2009, 09:59 »
Dear Blom:

I have made some further test calculations. Any example in the manual can be run in parallel and work well.  However, when I run the above script, it still cannot work in parallel. It fails when the command “mpiexec  -machinefile  mpd.hosts  -n  CPUS  $ATK_BIN_DIR/atk  $WORK_DIR/script.py” was used where CPUS ≠ 1.

The following message gives an example.
##########################################################################
[ch@console ~]$ mpiexec -machinefile mpd.hosts -n 16 $ATK_BIN_DIR/atk $WORK_DIR/script.py
0: # Mon Mar 30 8:38:26 2009
0: # Linux-2.6.9-34.ELsmp-x86_64-with-redhat-4-Nahant_Update_3
0:
0: Start  study for  clusters:
0: C2Si1
0:
0:
0:
0: -----------------------------------------------------------------
0:
0:
5: [cli_5]: aborting job:
5: Fatal error in MPI_Allreduce: Message truncated, error stack:
5: MPI_Allreduce(707).....................: MPI_Allreduce(sbuf=0x2a9e7c7010, rbuf=0x2a9eab6010, count=384345, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
5: MPIR_Allreduce(385)....................:
5: MPIDI_CH3U_Post_data_receive_found(163): Message from rank 4 and tag 14 truncated; 1772160 bytes received but buffer size is 1537416
rank 5 in job 23  console_48274   caused collective abort of all ranks
  exit status of rank 5: return code 1
[ch@console ~]$
################################################################

It can been seen that the script fails when it comes into geometry optimization based on ATK. What is the matter with it? How to fix it?

Regards


Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5576
  • Country: dk
  • Reputation: 96
    • View Profile
    • QuantumATK at Synopsys
Re: Cannot run script in parallel
« Reply #10 on: March 30, 2009, 10:15 »
One more thing to check: can you run the optimization examples in the manual in parallel?

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5576
  • Country: dk
  • Reputation: 96
    • View Profile
    • QuantumATK at Synopsys
Re: Cannot run script in parallel
« Reply #11 on: March 30, 2009, 10:51 »
Btw, a couple of useful tips for your script :) The labels "ele1" and "ele2" can be automatically generated, to save you double-editing when you change the elements:
Code
element1 = 'Carbon'
element2 = 'Silicon'
ele1 = element1.symbol()
ele2 = element2.symbol()
Then, you can save yourself some headache in clusterConfigToATK():
Code
         if ( clusterConfig[atom][3] == 1 ):                                      
            clusterElements.append(eval(element1))              #  element1              
        else:
            clusterElements.append(eval(element2))                #  element2           
Now you only have to change elements in once place! Since you do from math import *, you don't need to redefine pi (it's already available from math).
Code
    iFlag = []                                                                   
    for anyAtom in range(N):                                                     
        iFlag.append(1)                                                          
could simply be
Code
iFlag = [1,]*N                                                         
The role of distance() could be made a bit more obvious by using numpy:
Code
def distance(N, clusterConfig):
    import numpy
    dist = []
    for atom1 in range(N):                                                       
        distAtom = []                                                            
        for atom2 in range(N):                                                   
            r = numpy.sqrt(numpy.sum((numpy.array(clusterConfig[atom1])-numpy.array(clusterConfig[atom2]))**2))
            distAtom.append(r)                                                   
        dist.append(distAtom)                                                    
    return dist                                                                  
These are minor details that have nothing to do with the problem of running in parallel. I just thought I would take the chance to show some nice NanoLanguage tricks :)
« Last Edit: March 30, 2009, 12:30 by Anders Blom »

Offline chp

  • Heavy QuantumATK user
  • ***
  • Posts: 31
  • Reputation: 0
    • View Profile
Re: Cannot run script in parallel
« Reply #12 on: March 30, 2009, 11:58 »
Thank you for your nice tips.

The optimization examples in the manual can work well in parallel.

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5576
  • Country: dk
  • Reputation: 96
    • View Profile
    • QuantumATK at Synopsys
Re: Cannot run script in parallel
« Reply #13 on: March 30, 2009, 12:26 »
What is your hardware/software config?

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5576
  • Country: dk
  • Reputation: 96
    • View Profile
    • QuantumATK at Synopsys
Re: Cannot run script in parallel
« Reply #14 on: March 30, 2009, 12:29 »
One mistake above (corrected above also). The new code for the function should be
Code
       if ( clusterConfig[atom][3] == 1 ):                                      
            clusterElements.append(eval(element1))              #  element1              
        else:
            clusterElements.append(eval(element2))                #  element2 
Never copy/paste code!