QuantumATK Forum

QuantumATK => General Questions and Answers => Topic started by: wring on January 6, 2009, 02:54

Title: MPI error: killed by signal 9
Post by: wring on January 6, 2009, 02:54
Hi everyone!
  When I calculate the Cobalt cluster ,I meet the error:rank 1 in job 34  n7_37676   caused collective abort of all ranks   exit status of rank 1: killed by signal 9. Why this error occurs? Is it because of the hardware of my computer?
   Thank you for your replies!

Moderator edit: Updated subject to improve clarity and enable searching
Title: Re: question
Post by: Nordland on January 6, 2009, 13:23
Well, I think that one things is sure. This is not a pure error from ATK.
For me it looks like a MPI-error? are you trying to run the calculation in parallel?
Title: Re: question
Post by: wring on January 7, 2009, 09:24
Yes.I run the calculation with parallel for 3 cpus.This question occurs when the electrode's calculation 
finishes and two probe's calculation doesn't come out. I even don't see sc=0.
   Thank you for your replies!
Title: Re: question
Post by: carbn9 on January 7, 2009, 09:57
Hello Wring. This error occurs usually when the mpi did not installed or started correctly.

1. use mpich 1.0.5p4
2. Start the mpi using the mpdboot command with port addresses.

which command do u using for starting mpd? what is your mpd.hosts file? could you show these here and we can find a solution.

Kubar
Title: Re: MPI error: killed by signal 9
Post by: Anders Blom on January 7, 2009, 10:38
Just to clarify: The version of MPI to use with ATK is mpich2. ATK itself is compiled using version 1.0.5p4, but the latest edition from the MPICH2 homepage (http://www.mcs.anl.gov/research/projects/mpich2/) (1.0.8 at the time of writing) seems to work fine as well.
Title: Re: MPI error: killed by signal 9
Post by: wring on January 7, 2009, 13:24
I use mpich-1.2.5. The commad is mpirun -np 3.
 Thanks for your replies!
Title: Re: MPI error: killed by signal 9
Post by: Anders Blom on January 8, 2009, 10:25
I split this topic. The original question was on a specific MPI issue, and follow-up posts should only address that point. Let's try not to mix different discussions in the same thread :)

MPICH1 (like 1.2.5) works fine - if you are running a ATK version pre-dating 2008.02. As far as I know, it does not work with the latest releases (2008.02 and later, which use MPICH2), and even if it does it's "by accident" and not recommended (i.e. it might not be stable, and error might be hard to detect).

In MPICH2, the corresponding command is

Code
mpiexec -n 3
Title: Re: MPI error: killed by signal 9
Post by: duygu on February 18, 2009, 08:25
I started to get the same error.  At first i thought it s a kind of parallelization problem and tried different k-points while changing number of nodes to run the calculation.
Several tries consisting of (1,1,1) k-points on 4-nodes, (1,1,3) k-points on 2-nodes, (1,1,3) k-points on 4-nodes gives the following error,
rank 2 in job 46  DualQuad_19430   caused collective abort of all ranks
  exit status of rank 2: killed by signal 9

But somehow i had managed to run the calculations with (1,1,5)  k-points on 3 nodes and they converged without error until yesterday. Yesterday when i started another calculation with  (1,1,5)  k-points on 3 nodes and this morning got the same error. Do you know why?

ps. I am using ATK 2008.10.0 on a Linux-x86_64 machine. My mpich2 version is ATK 2008.10.0 Linux-x86_64
Title: Re: MPI error: killed by signal 9
Post by: Anders Blom on February 18, 2009, 10:34
I assume you meant something like 1.0.5 for MPICH2 version.

This is a difficult error. It is, in our experience, not really an ATK error, but rather something went wrong in the MPI communication. It might be that the system geometry is too large. Are you parallelizing this calculation over two individual nodes, or on the cores of a dual-core? We generally do not recommend MPI-parallelizing over the cores, as discussed extensively on this forum (see e.g. this post (http://quantumwise.com/forum/index.php?topic=95.msg432#msg432)).
Title: Re: MPI error: killed by signal 9
Post by: duygu on February 19, 2009, 10:39
upps! sorry i copied the wrong line..
yes MPICH2 version is 1.0.5. We have two core 2 quad xeon processors running under open suse 11.
mpiexec -n 4 /usr/local/btk/bin/atk test-mpi.py command prints
# Master node
# Slave node
# Slave node
# Slave node
On the other hand i have a large unitcell (10x10x110 Ang.^3) with 85 atoms
Title: Re: MPI error: killed by signal 9
Post by: Anders Blom on February 19, 2009, 13:17
I think you should avoid running ATK in parallel on this machine, and see if that solves the problem. Just run it serial and ensure threading is enabled (see the manual (http://quantumwise.com/documents/manuals/ATK-2008.10/chap.parallel.html#sect1.parallel.mkl_threading)). I don't think the performance will be much affected, but you will definitely use less memory in total.
Title: Re: MPI error: killed by signal 9
Post by: Jahangiri on July 16, 2010, 08:15
Dear Users and Developers

I got approximatly the same error:

========================================
caused collective abort of all ranks
  exit status of rank 1: killed by signal 9
========================================

but as I read whole of this article, I believe that is not because of the system of atoms which I want to do calculation for, nor the machine which I am running my jobs on.
Because I ran on of the tutorials ( Gold nano wire )  before and I got the same result as it was in the Tutorial, later on when I wanted to do almost the same calculation ( bigger molecule between two electrode ) I got the same error as I mentioned above.
in one sentence; before it was working properly but I don't know why it is giving me error( just I used another molecule with almost 40 atoms between two probes ) ?!!!
could you please let me know what can be the problem ?

Best regards
Akbar