Author Topic: phonon bandstructure calculation: caused collective abort  (Read 4688 times)

0 Members and 1 Guest are viewing this topic.

Offline Yue-Wen Fang

  • Heavy QuantumATK user
  • ***
  • Posts: 29
  • Country: cn
  • Reputation: 0
    • View Profile
Hi all, I performed some  phonon Kinetic matrix calculations for a 20-atom structure in my windows PC, they work well but were time consuming. I thus moved to linux server to do the same calculations. In linux, I used 2 cores at two nodes (and 4cores/nodes) for ATK-DFT (energy cutoff: 90 Hartree; 8*8*6 kgrid). But during calculations it crashed after self-consistent calculations with ATK errors shown below. The ATK version is 2014.3.
Code
|  14 E = -52.5709 dE =  9.491318e-06 dH =  3.159575e-06                       |
+------------------------------------------------------------------------------+
| Calculation Converged in 14 steps                                            |
|                                                                              |
| Fermi Level  = -3.417684 eV                                                  |
+------------------------------------------------------------------------------+
+------------------------------------------------------------------------------+
|                                                                              |
| DFT Calculation  [Finished Wed Aug 26 17:31:31 2015]                         |
|                                                                              |
+------------------------------------------------------------------------------+
+------------------------------------------------------------------------------+
| Phonon: Automatically detected repetitions = [5 3 5]                         |
+------------------------------------------------------------------------------+

                            |--------------------------------------------------|
Calculating Kinetic Matrix : ===================rank 1 in job 1  a130_36884   caused collective abort of all ranks
  exit status of rank 1: killed by signal 9
The job log indicated that
Code
+ /opt/intel/impi/4.0.1.007/intel64/bin/mpirun --rsh=ssh -env I_MPI_DEVICE rdma:OpenIB-cma -np 2 atkpython ./phonon.py
terminate called after throwing an instance of 'std::bad_alloc'
  what():  St9bad_alloc
Could anyone give some suggestions? Thank you in advance.

Offline Jess Wellendorff

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 933
  • Country: dk
  • Reputation: 29
    • View Profile
Re: phonon bandstructure calculation: caused collective abort
« Reply #1 on: August 26, 2015, 16:34 »
Looks like you ran out of memory on your 2 nodes on the cluster. Perhaps you can "borrow" memory from other nodes while executing the job, e.g. use 8 nodes but only run 2 MPI processes?

Offline Yue-Wen Fang

  • Heavy QuantumATK user
  • ***
  • Posts: 29
  • Country: cn
  • Reputation: 0
    • View Profile
Re: phonon bandstructure calculation: caused collective abort
« Reply #2 on: August 26, 2015, 17:45 »
Dear Prof. Jess Wellendorff

Thank you for your suggestion. I'll have a try and then give you feedback.

Offline Jess Wellendorff

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 933
  • Country: dk
  • Reputation: 29
    • View Profile
Re: phonon bandstructure calculation: caused collective abort
« Reply #3 on: August 26, 2015, 19:05 »
Good. Please note that there is a good reason the calculation of the dynamical matrix can be memory-intensive: You are interested in the force-response on all atoms in your supercell when atoms in the cell and in neighboring cells are displaced slightly. A repetition of the supercell along all directions is therefore needed (usually along along all directions). Each matrix element of the dynamical matrix is therefore calculated from a significantly larger supercell than the original one. This requires memory....

Offline Yue-Wen Fang

  • Heavy QuantumATK user
  • ***
  • Posts: 29
  • Country: cn
  • Reputation: 0
    • View Profile
Re: phonon bandstructure calculation: caused collective abort
« Reply #4 on: August 31, 2015, 09:16 »
Dear Prof. Jess,

 Yes, I have repeated the structure, thus the supercell was quite large. Based on your previous instructions, I used several nodes with large memory (24G/core) to calculate it, but it still failed. On the other hand, I could run the same script in my windows PC with a total of 8G memory and 4 cores and it was very fast. Besides, I also realized that when I use a command in the terminal of the server, i.e. atkpython < .py >.log, it was also much faster than the jobs submitted by job.pbs. It's really a tricky thing. So I think this error could be caused by my job.pbs.

 Could you have a look at it for me? Please see the attached.

Great thanks.


Offline Jess Wellendorff

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 933
  • Country: dk
  • Reputation: 29
    • View Profile
Re: phonon bandstructure calculation: caused collective abort
« Reply #5 on: September 2, 2015, 13:05 »
I'm not really an expert on PBS, but parallelization strategy can be tricky for phonon calculations. Essentially, ATK will try to parallelize over as many displacements as possible. However, each of these parallel processes work on the fully repeated system, which is memory-intensive. This may be why the cluster calculation did not go well. Please also see the paragraph "Running the phonon bandstructure calculation" in this tutorial: http://quantumwise.com/publications/tutorials/item/836-silicon-phonon-bandstructure.

Offline Yue-Wen Fang

  • Heavy QuantumATK user
  • ***
  • Posts: 29
  • Country: cn
  • Reputation: 0
    • View Profile
Re: phonon bandstructure calculation: caused collective abort
« Reply #6 on: September 2, 2015, 16:43 »
Hi again,

I modified some parameters in my job.pbs and now it works well though I don't know how it works.

It's very dependent on the environmental variables.  I also reproduced the results with other users on the same server, it becomes very fast.