QuantumATK Forum

QuantumATK => General Questions and Answers => Topic started by: John on March 12, 2011, 02:02

Title: One error occured when calculating the DOS
Post by: John on March 12, 2011, 02:02
Dear Sir,
    One error occured when I calculate the Device DOS of my two-probe system, which have 270 atoms (Carbon or Hydrogen) using the 2X2 CPUs. The error is following, can you give any suggestion? It is run by the newest version 11.2.  Thanks.


+------------------------------------------------------------------------------+
|                                                                              |
| Transmission Spectrum Analysis                                               |
|                                                                              |
+------------------------------------------------------------------------------+

                            |--------------------------------------------------|
Calculating Transmission   : ==================================================

+----------------------------------------------------------+
| Transmission Spectrum Report                             |
| -------------------------------------------------------- |
| Left electrode Fermi level  = -3.993602e+00 eV           |
| Right electrode Fermi level = -3.993602e+00 eV           |
| Energy zero                 = -3.993602e+00 eV           |
+----------------------------------------------------------+
   energy       T(up)
     eV
 -2.000000e+00   8.812919e-01
 -1.990000e+00   8.925038e-01
 -1.980000e+00   9.001541e-01

...........................................

  1.980000e+00   9.576673e-01
  1.990000e+00   9.638053e-01
  2.000000e+00   9.703618e-01
+----------------------------------------------------------+
+------------------------------------------------------------------------------+
|                                                                              |
| Device Density of States                                                     |
|                                                                              |
+------------------------------------------------------------------------------+

                            |--------------------------------------------------|
Calculating DOS            : =================================================Fatal error in MPI_Allreduce: Message truncated, error stack:
MPI_Allreduce(773).......: MPI_Allreduce(sbuf=0x95abc30, rbuf=0x1c820e0, count=4, MPI_INT, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Reduce(764).........:
MPIR_Reduce_binomial(172):
do_cts(490)..............: Message truncated; 1405104 bytes received but buffer size is 16
rank 0 in job 17  node18_42298   caused collective abort of all ranks
  exit status of rank 0: killed by signal 9
Title: Re: One error occured when calculating the DOS
Post by: ajaramil on March 12, 2011, 03:39
I had the same problem when running an LDOS calculation on a two-probe system on multiple processors. 

This is most likely an MPI coding problem in the LDOS/DOS routines in ATK 11.2.0, so as a quick workaround I just ran the same calculation on a single processor and it worked fine.

Cheers
Title: Re: One error occured when calculating the DOS
Post by: zh on March 12, 2011, 10:30
which version of  mpi is used? It seems that the heavy communication between nodes caused the problem. Please try the mpich2, meanwhile use more nodes.
Title: Re: One error occured when calculating the DOS
Post by: Anders Blom on March 12, 2011, 13:24
We cannot rule out that ATK is to blame, but in almost all cases this error is related to the hardware on the cluster. Different codes require different performance and features from the MPI environment, so it might be some software packages do not trigger the problem, while ATK does.

We'll try to run some tests on our own cluster + an external one next week. If you don't mind posting your offending scripts (all details), it will make it easier for us.

(Btw, don't try to use OpenMPI with ATK, it does not work at all.)
Title: Re: One error occured when calculating the DOS
Post by: John on March 13, 2011, 03:27
Dear Sir,
    My MPI is mpich2.1.0.8.
    I have tested the simple Li-H2-Li system, Almost no errors occured in parallel calculation for all kinds of analysises, including LDOS and DOS calculation.
    Maybe, like ajaramil said,"This is most likely an MPI coding problem in  ATK 11.2.0 when calculating the LDOS, DOS, and Transmissionpathways of one big system (such as my 200 atoms) in parallel (2X2 nodes). It worked fine when running the same calculation (LDOS, DOS, and Transmissionpathways ) on a single processor.
Title: Re: One error occured when calculating the DOS
Post by: John on March 24, 2011, 03:22
Dear Sir,
     Fortunately, the problem of LDOS parallel calculation is solved, but I still encounter some problems when calculating the DOS using the newest version 11.2.1 in parallel by 2X2 nodes because the output of DOS only prints "nan" like the folllowing.
PS:  My mpi is mpich2.1.0.8.   The command is
"/home/user1/bin/mpich2/bin/mpiexec -machinefile node -n 4 /home/user1/bin/atk112/atkpython/bin/atkpython test.py >& test.txt"
Can you give any advance?  Thank you.

output:
                            |--------------------------------------------------|
Calculating DOS            : ==================================================

+----------------------------------------------------------+
| Density of States Report                                 |
| -------------------------------------------------------- |
| Left electrode Fermi level  = -2.244959e+00              |
| Right electrode Fermi level = -2.244959e+00              |
| Energy zero                 = -2.244959e+00              |
| Units = eV    1/eV                                       |
+----------------------------------------------------------+
 -2.000000e+00            nan
 -1.990000e+00            nan
 -1.980000e+00            nan
 -1.970000e+00            nan
 -1.960000e+00            nan
 -1.950000e+00            nan
.......................................
  1.950000e+00            nan
  1.960000e+00            nan
  1.970000e+00            nan
  1.980000e+00            nan
  1.990000e+00            nan
  2.000000e+00            nan
Title: Re: One error occured when calculating the DOS
Post by: Anders Blom on March 24, 2011, 11:45
First of all, let us rule out the possibility that this is because you run it in parallel. Thus: do you get the same NaNs if you run in serial? Beyond that, we would have to see the script in order to troubleshoot.
Title: Re: One error occured when calculating the DOS
Post by: John on March 25, 2011, 03:15
It still can not get the DOS information in serial calculation for my 3-dimensional Au-molecule-Au system.
Title: Re: One error occured when calculating the DOS
Post by: Anders Blom on March 25, 2011, 11:11
We are working on a solution for this problem, which has been reported by other users as well.
Title: Re: One error occured when calculating the DOS
Post by: Anders Blom on April 7, 2011, 10:07
This issue is fixed in ATK 11.2.2.