QuantumATK > General Questions and Answers

inquiring on the memory settings and multi-task running on a single cluster

(1/2) > >>

fanjiaping:
Dear sir:
  We are runing the ATK package on a cluster with 22 nodes. However, we find that we are allowed to run only one task each time. And if we submit another new task, the currrent job will be killed automatically. Furthermore, based on our calculation, each task is usually  time consuming and expensive, that is, for runing an individual ATK task, it cost a  great amount of memory. Can you kindly make some modification of the default envionment settings for us, then we can use the soruces properly. Or is there some options we can apply to monitor the working processing?

    Any kind of suggestiong would be greatly appreciated.
 Thanks in advance.

Anders Blom:
Do I understand it correctly that you have a purchased license? If so, I suggest you contact your sales office to discuss these points, as they can help more specifically.

Which version of ATK are you running? 10.8 is very memory hungry, but this has changed a lot in 11.2.

Also, it matters precisely how you submit the job. If you let several MPI process run on the same node, each one uses the same amount of memory so the total memory load on the machine becomes multiplied. You can control this with the flag "-npernode 1" if your mpiexec supports it (if not, probably it will balance the load automatically, but it's best to check carefully).

fanjiaping:
yes, I have a purchased license! I met a problem recently as below(from the computer Cluster):

Fatal error in MPI_Allreduce: Message truncated, error stack:
MPI_Allreduce(773)................: MPI_Allreduce(sbuf=0x1e4a9450, rbuf=0x2d55b718, count=613, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Reduce(759)..................:
MPIR_Reduce_redscat_gather(406)...:
MPIDI_CH3U_Receive_data_found(129): Message from rank 5 and tag 11 truncated; 4912 bytes received but buffer size is 4904

and other the message from the .log file:

+------------------------------------------------------------------------------+
| Optimization step =  0 E = -1.2254e+05 eV Maximum force =  2.0744e+00 eV/Ang |
+------------------------------------------------------------------------------+
+------------------------------------------------------------------------------+
|                                                                              |
| Device Calculation  [Started Wed Mar 23 07:44:11 2011]                       |
|                                                                              |
+------------------------------------------------------------------------------+
+------------------------------------------------------------------------------+
|                                                                              |
| Device Density Matrix Calculation   [Started Wed Mar 23 07:45:00 2011]       |
|                                                                              |
+------------------------------------------------------------------------------+
| Left electrode chemical potential  = -0.090023 Ha                            |
| Right electrode chemical potential = -0.090023 Ha                            |
+------------------------------------------------------------------------------+
rank 4 in job 1  compute-0-9_56863   caused collective abort of all ranks
  exit status of rank 4: killed by signal 9

I can't handle it by myself ! help me ,please !

fanjiaping:
 :'( !
We have requested the sales office for upgrading ATK for us to the latest version. but they told us that the lastest version can't be installed in our cluster! So we have to continue working with the ATK10.8.2 ! But the ATK10.8.2 always brings problems such as over the buffer,  terminated normally but without proper results as it supposed to be(eg:I want to get ten transimissionSpectrum but only six can be obtain (I'm sure there is no problem in my script))!  

fanjiaping:
 :'(!
Can any one give some ideas to solve those issues given above?

Navigation

[0] Message Index

[#] Next page

Go to full version