QuantumATK Forum

QuantumATK => General Questions and Answers => Topic started by: wring on December 22, 2009, 01:22

Title: It's memory's question
Post by: wring on December 22, 2009, 01:22
     rank 15 in job 1  cu108-ib_38172   caused collective abort of all ranks
     Is this question cased by the litter memory? In every cpu, the memory is about 3Gb.
Title: Re: It's memory's question
Post by: zh on December 22, 2009, 10:03
The reason and solution can be found in the following thread:
http://quantumwise.com/forum/index.php?topic=199.0 (http://quantumwise.com/forum/index.php?topic=199.0)
Title: Re: It's memory's question
Post by: wring on December 23, 2009, 01:35
rank 0 in job 1  cu107-ib_48955   caused collective abort of all ranks
  exit status of rank 0: return code 137


We use Intel mpi in our cluster. Doesn't this cause the problem?
Title: Re: It's memory's question
Post by: zh on December 23, 2009, 07:12
Quote
We use Intel mpi in our cluster.

This just means that the MPI installed on your cluster was combined with the C and Fortran compilers of Intel. Please check the version of your MPI.
Title: Re: It's memory's question
Post by: Anders Blom on December 23, 2009, 11:43
ATK only functions with MPICH2 (and, quite possibly, MVAPICH).

Typical error when using other MPIs manifest themselves as all processes running as masters, thus you don't actually get any parallelization performance benefit, and possible collisions occur in the I/O (writing NetCDF files for instance).
Title: Re: It's memory's question
Post by: wring on December 24, 2009, 01:34
But my senior fellow apprentice run well and I run only one work in the cluster well, too. When I put the second or the third in the computer ,the question is ocurring .
    Thanks a lot.
Title: Re: It's memory's question
Post by: Anders Blom on December 24, 2009, 22:39
Not sure exactly what you mean by "put ... in the computer". If this means you run in parallel over more than one node, compared to just running on one node, then the error is to be expected. But, again, under any circumstance, ATK is only supported under MPICH2, anything else is experimental and up to the user.

If you don't have a queue system that controls allocation to individual nodes, then certainly if you run several calculations simultaneously you can run out of memory.

Also, finally, note that to run several calculations simultaneously you need more than one master license. If this is the problem you'll see an error message in your "stderr" file (if you have a queue).
Title: It's memory's question
Post by: wring on January 7, 2010, 03:26
  Our new cluster has 8 cpu/node, the total memory of each node is 24 Gb. How many atoms we calculate per node are? Because recently the calculation always leads to the cumputer dead.
Title: Re: It's memory's question
Post by: zh on January 7, 2010, 07:41
Quote
"How many atoms we calculate per node are?"
Just for the sentence of your question, it seems that you can give an answer by yourself if you take look at your input file. 

How many atoms in a system can be calculated on each node?
Usually, no definite answer because the required computing resource also depends on the choice of other parameter, not only the total number of atoms.
Title: Re: It's memory's question
Post by: Anders Blom on January 7, 2010, 11:29
As zh writes, the total memory usage is a complicated function of a lot of parameters, not just the number of atoms. Each element has a different number of valence electrons, which is what matters rather than the number of atoms. Also, we have to consider at least also the basis set size (which is what really matters in determining the matrix sizes), the mesh cut-off, and the k-point sampling.

However, I believe another point is most crucial in your case. If your 8 CPUs share this memory, and you run 8 MPI processes on the machine, the amount of memory available is effectively only 4 Gb, because each MPI process uses the same amount of RAM. So, to test how "large" system you can run, start by using only one CPU, and monitor the memory usage. If, for instance, the job takes 10 Gb, then you can only use 2 MPI processes.