Author Topic: Cannot run script in parallel  (Read 18370 times)

0 Members and 1 Guest are viewing this topic.

Offline chp

  • Heavy QuantumATK user
  • ***
  • Posts: 31
  • Reputation: 0
    • View Profile
Re: Cannot run script in parallel
« Reply #15 on: March 30, 2009, 15:02 »
I have tried it in different hardware/software configs, however, similar error messages were obtained.
 
softwares:  ATK 2008.02   OR    ATK 2008.10
               mpich2-1.0.5p  OR   mpich2-1.0.7
               Linux-2.6.27.7-9-default-x86_64-with-SuSE-11.1-x86_64
               Linux-2.6.9-34.ELsmp-x86_64-with-redhat-4-Nahant_Update_3

hardwares:  two nodes of a cluster each of which consists of two Quad intel CPUs and 12G memory
                 a PC with a Quad intel CPU and 4G memory

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5576
  • Country: dk
  • Reputation: 96
    • View Profile
    • QuantumATK at Synopsys
Re: Cannot run script in parallel
« Reply #16 on: March 30, 2009, 15:05 »
I will attempt to run your system myself on a cluster and see if I get a similar error. The queue is quite loaded so I cannot say for sure when I'll get a spot, but hopefully within the next 24 hours. I'll use 2 nodes, I assume as soon as n>1 you get problems.

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5576
  • Country: dk
  • Reputation: 96
    • View Profile
    • QuantumATK at Synopsys
Re: Cannot run script in parallel
« Reply #17 on: April 3, 2009, 11:52 »
The plot thickens! I get the same error on our cluster. Definitely need to look further at this...

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5576
  • Country: dk
  • Reputation: 96
    • View Profile
    • QuantumATK at Synopsys
Re: Cannot run script in parallel
« Reply #18 on: April 3, 2009, 13:27 »
The problem has been located. As often in these cases, it's completely obvious - once you know the answer! :)

The issue lies in the use of random(). This function will give different values on the different nodes, so by the time the self-consistent loop starts, each MPI node will have a different configuration, which obviously makes no sense at all.

To solve this, you could generate the configuration only on the master node and export it to a VNL file (all encapsulated in processIsMaster()) and then read the VNL file back in on all nodes.
Edit: This will not work, since the slave nodes will run ahead and attempt to read the VNL file while the master node is still making it. Use Nordland's method below instead!

Simple - and not! ;)

PS: I don't know how large systems you plan to compute, but there is actually not very much to be gained by running molecules in parallel, unless they are very, very large (hundred of atoms). On the other hand, you can really gain a lot by ensuring that ATK runs threaded over the cores on a dual or quad core CPU. The reason is that the MPI parallelization only can be used for the matrix element generation for molecules, while the threading kicks in for the matrix diagonalization.
« Last Edit: April 3, 2009, 14:19 by Anders Blom »

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5576
  • Country: dk
  • Reputation: 96
    • View Profile
    • QuantumATK at Synopsys
Re: Cannot run script in parallel
« Reply #19 on: April 3, 2009, 13:30 »
A side issue: you script potentially ends with an indented statement. This generates error messages in MPI on the slave nodes, and actually I think it prevents the script from running. So, always make sure the last line in the script is completely empty (no spaces at all!) when running in parallel!

Offline Nordland

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 812
  • Reputation: 18
    • View Profile
Re: Cannot run script in parallel
« Reply #20 on: April 3, 2009, 13:47 »
Alternative you can choose a fixed random seed, then all the mpi nodes will get the same random numbers and hence you will get the same configuration.

Here is a small example:
Quote
import time; import random
random.seed(time.localtime().tm_min)

print random.random()

Then all calculation started on the same minute will give the same results, and hence your script will work again.

P.S The reason for chosing minute over second is that if there is a slight delay in the process, then you will not get the same error as you got before.

Offline chp

  • Heavy QuantumATK user
  • ***
  • Posts: 31
  • Reputation: 0
    • View Profile
Re: Cannot run script in parallel
« Reply #21 on: April 3, 2009, 16:56 »
Thanks a lot ! Blom and Nordland.

I will try it in my script.

Regards !

Offline chp

  • Heavy QuantumATK user
  • ***
  • Posts: 31
  • Reputation: 0
    • View Profile
Re: Cannot run script in parallel
« Reply #22 on: May 25, 2009, 17:37 »
Dear Blom and Nordland:

I am sorry that this problem is raised again. Just as you have indicated that the original issue lies in the use of random() which gives different values on the different nodes when the script is performed in parallel.

I have tried the example given by Nordland, however, the same error still appeared. Having tried some other random seeds, I can run the script in parallel successfully. Because I am a novice in python, I still want to use some better seeds. Could you give me one or two examples of random seed in the form of function like below:

def rand ():
    import time; import random
    random.seed(time.localtime().tm_min)
    return random.random()

Thank you very much.

Sincerely yours !

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5576
  • Country: dk
  • Reputation: 96
    • View Profile
    • QuantumATK at Synopsys
Re: Cannot run script in parallel
« Reply #23 on: May 25, 2009, 21:03 »
My primary suggestion would be to generate the structures first, in serial, on your local machine. Then you read in the geometries when you perform the calculations in parallel. That way you have no randomness in parallel. Generating the geometries anyway will not parallelize, so there is no performance difference.

Otherwise it seems you already have a "parallel" random function, the one you posted. I'm not how it can be improved because you need a seed which is common on all nodes, but you have no way of communicating it between the nodes, so it must be taken externally. In principle you can use any combination of day of week, year, minute, hour, etc..., all of which are common on all nodes.