Author Topic: Problem: Fatal error in MPI_Allreduce (Read 30456 times)

ziand · « **on:** February 11, 2011, 16:49 »

Hello,

very recently I am getting the error below. Im using ATK 10.8.2 doing a parallel calulation on 25 nodes, 1 MPI per node, 4 OpenMP threads per MPI-process. My feeling is that it is not a problem of ATK. There might be someting wrong with the network. However I post here because I would like the hear your opinion about it. (There are 2 similar topics, but they deal with some "Message truncated" error.) Thank you very much.

+------------------------------------------------------------------------------+ | | | Atomistix ToolKit 10.8.0 | | | +------------------------------------------------------------------------------+ +------------------------------------------------------------------------------+ | | | Device Calculation [Started Fri Feb 11 15:18:12 2011] | | | +------------------------------------------------------------------------------+ +------------------------------------------------------------------------------+ | | | Left Electrode Calculation [Started Fri Feb 11 15:18:39 2011] | | | +------------------------------------------------------------------------------+ |--------------------------------------------------| Calculating Eigenvalues : ================================================== Calculating Density Matrix : ================================================== Fatal error in MPI_Allreduce: Other MPI error, error stack: MPI_Allreduce(773)................: MPI_Allreduce(sbuf=0x9ff43c0, rbuf=0x2aaab80aa650, count=2205, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed MPIR_Allreduce(445)...............: MPIC_Sendrecv(164)................: MPIC_Wait(513)....................: MPIDI_CH3I_Progress(150)..........: MPID_nem_mpich2_blocking_recv(948): MPID_nem_tcp_connpoll(1709).......: Communication error rank 9 in job 1 chic2i38_33252 caused collective abort of all ranks exit status of rank 9: killed by signal 9 ### JOB - END ######################################## MPD: killing mpd daemons Fri Feb 11 15:34:18 CET 2011 ### PBS - END ########################################

Anders Blom · « **Reply #1 on:** February 12, 2011, 19:48 »

Yes, most likely it's a network issue, one or more nodes were unreachable at a time when communication within ATK across the nodes was necessary. Hopefully it's a spurious error that would not appear if you rerun; if it is recurrent, you may need to troubleshoot the network.

ziand · « **Reply #2 on:** November 25, 2011, 12:22 »

Hello, sorry for bringing this up again. Similar problems like the one mentioned above happen to me from time to time. I'm somehow puzzled and try to find a way to firgure out the source of the problem. Now the reason why I post this right now is some recent errors. All are right after the equivalent bulk of a device calculation. Is there a strong need of memory there? Here some details about my calculations (I have no skript attached, becuse it is rather trick with lots of custom stuff. However, I carefully checked it with the VNL am sure that geometry is OK): MPI-version: mvapich2-1.0 (there are other MPIs available like MPICH2 1.0.6) ATK-version: 11.2.3 Geometry: 5x5x3 copper bulk electrodes, 5x5x5 copper left/right and 1 unitcell of (6,0) CNT between. Calculator:

Code

basis_set = GGABasis.DoubleZetaPolarized
electron_temperature = 1200
k_point_sampling = (3, 3, 100)
grid_mesh_cutoff = 75
max_steps = 500
tolerance = 4.0e-5
exchange_correlation = GGA.PBE
#electrode_voltages = [0, 0] * Volt
#trans_energies = numpy.linspace(-2, 2, 401) * eV
#trans_kpoints = (5, 5, 1)

numAcc = NumericalAccuracyParameters(electron_temperature = electron_temperature * Kelvin,
                                     k_point_sampling = k_point_sampling,
                                     grid_mesh_cutoff = grid_mesh_cutoff * Hartree)
itCon = IterationControlParameters(max_steps = max_steps,
                                   tolerance = tolerance)
elCal = LCAOCalculator(basis_set = basis_set,
                       exchange_correlation = exchange_correlation,
                       numerical_accuracy_parameters = numAcc,
                       iteration_control_parameters = itCon,
                       poisson_solver = FastFourierSolver([[PeriodicBoundaryCondition] * 2] * 3),
                       checkpoint_handler = CheckpointHandler('checkpoint.nc', 60*Minute))
devCal = DeviceLCAOCalculator(basis_set = basis_set,
                             exchange_correlation = exchange_correlation,
                             numerical_accuracy_parameters = numAcc,
                             iteration_control_parameters = itCon,
                             poisson_solver = FastFourier2DSolver([[PeriodicBoundaryCondition] * 2,
                                                                   [PeriodicBoundaryCondition] * 2,
                                                                   [DirichletBoundaryCondition] * 2]),
                             electrode_calculators = [elCal, elCal],
                             #electrode_voltages = electrode_voltages,
                             checkpoint_handler = CheckpointHandler('checkpoint.nc', 60*Minute))

The very first run on 16Gb nodes with no CNT but lots of vacuum between the central electrodes went without errors, but as the number of 16Gb nodes is low on our cluster, I went to 4Gb now with a CNT in between. Actually, the system got 24 more carbon atoms (1 unit cell) and it got smaller in volume (1 unit of CNT was smaller than the vacuum I had before). Here are my errors: The first error is very likely caused by insufficient memory. I have 4Gb per quadcore node and wanted to have one MPI-job per node, but I think, there might have been two MPIs on the node below (rank7 and rank 10???).

Code

\lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3:  1466 Segmentation fault      PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
rank 7 in job 1  chic2g22_34011   caused collective abort of all ranks
  exit status of rank 7: killed by signal 9
rank 10 in job 1  chic2g22_34011   caused collective abort of all ranks
  exit status of rank 10: killed by signal 9

So I changed something in my PBS skript. Old

Code

# Number of nodes needed for mpd-startup
NNODES=$( uniq $PBS_NODEFILE | wc -l )
# Number of CPUs needed for mpiexec
NCPU=$( wc -l < $PBS_NODEFILE )
# Number of CPUs in use
useCPU=$(( $NNODES * $usePPN ))
sort $PBS_NODEFILE | uniq -c | gawk -v ppn=$usePPN '{ printf("%s:%s\n", $2, ppn); }' > mpd.nodes
mpdrun -machinefile mpd.nodes -n $useCPU -env MKL_DYNAMIC FALSE -env MKL_NUM_THREADS 4 $ATK_BIN_DIR/atkpython $PBS_O_WORKDIR/$PBS_JOBNAME.py

New

Code

usePPN=1
export MKL_DYNAMIC=FALSE
export MKL_NUM_THREADS=4
mpiexec -npernode $usePPN $ATK_BIN_DIR/atkpython $PBS_O_WORKDIR/$PBS_JOBNAME.py

I ensured that really every node has only one atkpython_exec precess running. OpenMP seems to work as expected, because the load on the quadcores frequently goes above 25 percent. However, now I got the following error (right after the equiv. bulk has finished):

Code

/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3:  6079 Speicherzugriffsfehler  PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 12284 Speicherzugriffsfehler  PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 31418 Speicherzugriffsfehler  PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(773)................: MPI_Allreduce(sbuf=0x1314dda0, rbuf=0x2ab5400010, count=1705725, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(240)...............:
MPIC_Recv(83).....................:
MPIC_Wait(513)....................:
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720).......:
state_commrdy_handler(1556).......:
MPID_nem_tcp_recv_handler(1446)...: socket closed
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(773)................: MPI_Allreduce(sbuf=0x2ab7ee1010, rbuf=0x2ab444f010, count=1705725, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(228)...............:
MPIC_Send(41).....................:
MPIC_Wait(513)....................:
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720).......:
state_commrdy_handler(1556).......:
MPID_nem_tcp_recv_handler(1446)...: socket closed
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(773)................: MPI_Allreduce(sbuf=0x2abb63b010, rbuf=0x2ab83b1010, count=1705725, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(385)...............:
MPIC_Sendrecv(161)................:
MPIC_Wait(513)....................:
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720).......:
state_commrdy_handler(1556).......:
MPID_nem_tcp_recv_handler(1446)...: socket closed
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(773)................: MPI_Allreduce(sbuf=0x2ab9b96010, rbuf=0x2ab6104010, count=1705725, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(385)...............:
MPIC_Sendrecv(161)................:
MPIC_Wait(513)....................:
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720).......:
state_commrdy_handler(1556).......:
MPID_nem_tcp_recv_handler(1446)...: socket closed
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(773)................: MPI_Allreduce(sbuf=0x2ab9609010, rbuf=0x2ab757f010, count=1705725, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(385)...............:
MPIC_Sendrecv(161)................:
MPIC_Wait(513)....................:
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720).......:
state_commrdy_handler(1556).......:
MPID_nem_tcp_recv_handler(1446)...: socket closed
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(773)................: MPI_Allreduce(sbuf=0x2ab7524010, rbuf=0x2ab6820010, count=1705725, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(385)...............:
MPIC_Sendrecv(161)................:
MPIC_Wait(513)....................:
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720).......:
state_commrdy_handler(1556).......:
MPID_nem_tcp_recv_handler(1446)...: socket closed
mpiexec: Warning: tasks 0-4,7 exited with status 1.
mpiexec: Warning: tasks 5-6,8 exited with status 139.

In line one "Speicherzugriffsfehler" is german and means memory access error (maybe it actually is the same as segmentation fautl). I have no idea why it is german here. The only difference between the first and the second run is that I changed the type of node on our cluster (also 4Gb RAM but hardware may differ) and I ensured that every node has really just 1 job. I was a bit doubtful from the beginning, that 4Gb memory is too few, so I recorded memory usage at both runs via

Code

top | grep --line-buffered 'atkpython_exec' > mem_XXXXX.txt

and used a little skript to plot the percent of memory usage as provided by top. There result is attachted. Now this looks very nice, you see the SCF-cycles of the electorde, then you see the cycles of the equivalent bulk and right after this finished it breaks! Can I be sure, that I have enough memory? Does the two-probe part of the calculation or maybe a step right before need much more memory than the equivalent bulk? It would be much help if this option can be eliminated. What could I do or which information should I send, to figure out what is going on here. Maybe I need to ask the cluster admins, but there are lots of users here with lots of simulations, so the cluster seems (?) to be quite stable. Thank you very much.

Anders Blom · « **Reply #3 on:** November 25, 2011, 13:16 »

Lot of information to digest

First of all, how reproducible is this? Does it occur every time, or only sometimes?

There is no great memory usage after equivalent bulk, in fact it's usually the equivalent bulk itself which is most memory-hungry.

Is there a particular reason for the difference "new/old", you use mpdrun in one case and mpiexec in the other? This indicates a switch of MPI libs, and would at least perhaps explain why the error messages are formatted differently.

MPICH2 is the most stable (meaning, most tested) MPI used with ATK. 1.0.6 is very (very!) old however, they are up to 1.4 now and getting ready for 1.5. MVAPICH2 is up to 1.8, so I really can't rule out that you are not hitting a bug in this ancient version.

Is there a possible correlation to the checkpoint file? If you have "cd $PBS_O_WORKDIR" in the PBS script it should be fine, but otherwise the location of the checkpoint file is perhaps somewhere different, and undesired.

In newer ATK versions you don't need to set the MKL variables, so you can simplify your script a bit.

Now, finally. My main guess is the following: even though you make sure to assign only 1 atkpython_exec to each node, you don't know which other codes run on those nodes. So maybe ATK uses 25% of memory, but if another codes uses 80%...

What you may need to do (I often find myself in the same situation) is that you need to overallocate on the queue, and then underpopulate. Meaning, you need to ask for nodes=5:ppn=8 to get 5 nodes, but also block out all other users from those nodes. But then you only start 5 ATK processes (-n 5 to mpiexec) plus -npernode 1 so that they get evenly distributed.

Do you see what I mean?

The primary disadvantage of this approach is that if you pay for usage (or somehow get audited for usage), you will pay the same as if you used 40 MPI processes, since usage is normally counted on allocation. Of course running 40 MPI processes on 5 nodes @ 4 GB is certainly not recommended, but if you could run 40 MPIs on 40 (private) nodes, or 40 nodes @ 16 GB, you would be real happy (and not pay more!). But on large clusters with limited memory per node you also have to pay for the privilege to be privately on a node.

So, the main problem is really that the memory per node is as low as 4 GB; this is really too little to share the node with others.

ziand · « **Reply #4 on:** November 28, 2011, 10:40 »

Thank you very much for this quick reply.

As I expected, no clear answer to this problems can be given. Sadly. However, let me comment on the things you mention.

0.) It is a pitty but I am afraid to say, that this is not 100% reproducible. At least I can not find a pattern. But I am pretty sure, that if I start the same calculation with he same settings (nodes / MPI OpenMP ...) on the same hardware (node types), than it will always happen.

1.) Your main guess is a good one, but I was always aware of this problem and I ALWAYS grab full nodes (no matter how many cores I actually use).

[[[By the way (but this is surely another story), I was never ever able to mix MPI and OpenMP in the sense, that I have more than one MPI-job on ONE NODE and simultaneously more than one OpenMP thread. Example: I use nodes=5:ppn=4. than I could switch off OpenMP and use 20 MPI-jobs OR I could use 5 MPI-jobs and 4 OpenMP threads. Something like 10 MPI * 2 OpenMP always crashed. I think there was a post from me about that long time ago. Howerver, this does not really hurt me too much, as I anyway need lots of RAM, so 1 MPI per node.]]]

Do I really need the -n 5 option? If I have -npernode 1, then the number of MPI-jobs should be known and fixed (to the number of nodes).

2.) "In newer ATK versions you don't need to set the MKL variables, so you can simplify your script a bit."
So, I guess 11.2.3 is new enough for that. (And I will tell the cluster admin to install the newes ATK soon.)
Btw.: I will test what happens if I put 2 MPIs on one quadcore. Maybe my side-comment above is not relevant anymore.

3.) To the mpiexec/mpdrun thing. Actually I myself am not 100% sure what the difference is. I googled a while and came to the conclusion that I should use mpiexec. (I found a piece of code (python), that maybe the source of mpiexec and therein I found mpdrun calls.) Also mpdrun is not so frequently known in the web. But I did NOT switch MPI libs, because to use them on our cluster, one has to "activate" them ("module add mpi/mvapich2/gcc422", I think "module" is some custom script which sets some environment variables), otherwise mpdrun or mpiexec are unknown commands. I think the different output formatting really comes from using different nodes.

4.) Okay, I can test MIPCH2 instead of MVAPICH2. However you are right, its ancient. Let me see if I can convince the admin to install some newer version.

5.) Checkpoint file. Indeed this caused me some trouble recently. I did not set a checkpoint handler and the default checkpoint file (somewhere in tmp), could not be written. In the way I do it now, the output tells me a checkpoint file, that is within home directory, exactly where I want it to be. And yes, I have "cd $PBS_O_WORKDIR".

In summary: I will test the old MPICH2 and will try to get an updated MPICH2 or MVAPICH2.

I will report about what happens.

ziand · « **Reply #5 on:** November 28, 2011, 12:16 »

Update: I made no change to the device and calculator settings. I used 9 quadcore nodes with 8Gb mamory. This time the error occured later. It finished left electrode, equiv. bulk, two-probe and a transmission spectrum. After that a new cycle should start with two CNT unitcells. But after the transmission spectrum I got the following error.

Code

+------------------------------------------------------------------------------+
|                                                                              |
| Transmission Spectrum Analysis                                               |
|                                                                              |
+------------------------------------------------------------------------------+

                            |--------------------------------------------------|
Calculating Transmission   : ==================================================

/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 19339 Segmentation fault      PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 29948 Segmentation fault      PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 17088 Segmentation fault      PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3:  2341 Segmentation fault      PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3:   715 Segmentation fault      PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 25996 Segmentation fault      PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 13896 Segmentation fault      PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 30131 Segmentation fault      PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 20241 Segmentation fault      PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
mpiexec: Warning: tasks 0-8 exited with status 139.

I have attached a picture with the memory usage. You said the equiv. bulk is the most memory consuming part. According to this graph, the two-probe part does use significantly more memory. Is this possible or could it be that I am making a mistake somewhere? The transmission spectrum has been correctly written to the nc-file, this is very nice, so the crash happened later (maybe at the beginning of the next loop). I create my calculator object only once (outside a loop which makes a new device) and attach it to the device during the loop. I use device.setCalculator(calculator) but I could also make a fresh copy with device.setCalculator(calculator()). Does it make a difference? The next test will be with (an old) MPICH2 instead of (an old) MVAPICH2. (And up-to-date MPI versions will be available to me soon).

Forum Administrator · « **Reply #6 on:** November 28, 2011, 12:39 »

I haven't had more than 10 secs to look at this, but when I look at the kind of graph you showed, I would think your labels are displaced by 1: the high-memory area is probably the equivalent bulk and the part you marked transmission is more likely the two-probe. You can double-check this by looking at the time used for each segment - this info is in the log file, each step shows when it started and finished (grep for "2011").

ziand · « **Reply #7 on:** November 28, 2011, 14:10 »

I determined the different phases of calculation "backwards". The last must be the transmission, before there is two-probe, before this there is equiv. bulk, and first must be electrode. I started the memory logging quiet a while after the calculation had started, so the very first part looks shorther than is actually was and I stopped the logging a long time after the calculation had crashed, so the length of the flat line at the end is random. Here is the timing:

Code

| Device Calculation  [Started Fri Nov 25 09:15:55 2011]                       |
| Left Electrode Calculation  [Started Fri Nov 25 09:16:17 2011]               |
| Left Electrode Calculation  [Finished Fri Nov 25 14:30:48 2011]              |
| Device Density Matrix Calculation   [Started Fri Nov 25 14:30:48 2011]       |
| Equivalent Bulk  [Started Fri Nov 25 14:31:51 2011]                          |
| Equivalent Bulk  [Finished Fri Nov 25 19:53:51 2011]                         |
| Device Calculation  [Finished Fri Nov 25 23:44:43 2011]                      |
# Here was the transmission calculation  (most time consuming part, I used 5x5 k-points)
# followed by the error messages
Sat Nov 26 15:07:34 CET 2011  # <- end of output

And: The equiv. bulk took 24 steps, which corresponds with the number of "memory peaks" in the part that I marked as equivalent bulk.

Anders Blom · « **Reply #8 on:** November 28, 2011, 14:58 »

Ok, I suppose you are right. Anyway, the memory usage is not too large in any part of the calculation, so I don't see that this ("out of memory") could explain any crash.

Ok, so the T(E) is in the NetCDF, yes, very nice.

You are right mpiexec is just a script that calls mpd(mpi?)run, and whatever works locally is how you need to do it. It would anyway not relate to the problems discussed here. However, I think in most cases it's more appropriate to call mpiexec to start the calculation.

There is a very clear difference between setCalculator(calculator) and setCalculator(calculator()). In a loop, the latter is the proper way to do it. Without knowing precisely what you loop over, this is a point that could crash it, or have nothing to do with the crash. But the use of setCalculator(calculator) is not "normal behavior" (even though we willingly admit it appears like it should be; but Python logic is not always equal to common sense, plus there's a bit of deep-level ATK code that matters for the difference between these two statements).

Side note: yes, Intel very silently fixed something (for Linux) in an update to the MKL which is in ATK 11.2 such that threading works as the manual claims it should work. Still, there are clear complications when mixing MPI and OpenMP, it's not for the fainthearted. At least we have a basic experience that on a simple cluster, running with default (no environment variables) gives you threading. But on special clusters, the cores are often blocked off, so you can't share a process between multiple cores. This is a good thing, in a sense, because it keeps jobs apart better and also somehow also optimizes memory access.

ziand · « **Reply #9 on:** November 28, 2011, 15:23 »

Thanks again for the quick reply.

I will keep you up to date on what I find out about the error. Maybe I can really figure it out (it is very surely NOT a matter of insufficient memory). I changed my loop to "setCalculator(calculator())".
I hope it is okay to create the device calculator outside the loop and attach the electrode calculators by "Device...Calculator(electrode_calculators=[elCalculator, elCalculator])", without the brackets. This is how I saw it in the manual.

By the way. I am doing device calculations with always the same electrodes. It would be nice to be able to save the electrode state once and then reuse it. Is this possible? I did not check what happens if I use the initial_state argument, as my central region changes its geometry (different number of atoms).

Anders Blom · « **Reply #10 on:** November 28, 2011, 15:28 »

Yes, that's the correct way.

Electrodes: http://www.quantumwise.com/publications/tutorials/mini-tutorials/130-reusing-electrodes (generally: see http://www.quantumwise.com/publications/tutorials/mini-tutorials/)

Btw, if your central region changes number of atoms, your setCalculator(calculator) WILL crash!

ziand · « **Reply #11 on:** December 16, 2011, 11:21 »

It's me again, still having problem with my calculation going down with seg. fault errors.

I have done some further testing and want to report about what I found out. (There are 3 figures with notes attached. The time axis is not meaningful, as I changed the delay in my memory log thing: "top -d 0.05 | grep --line-buffered 'Mem:' > mem_XXXXXXX.txt" and "d" was always different, e.g fig. 1 had "-d 0.01".)

First calculations were done on a 8Gb node. My memory log file is shown in the first picture. The very first part is interesting. There you see 3 segments. The first calculation carshed, the second one was interrupted by me and the third one did not crash with a seg. fault. Now, as noted in the picture, for the first calc. I used MPI with just one process (but on a full node). The second and third calc. was done in serial (!) mode, without any MPI, just "$ATK_BIN_DIR/atkpython $PBS_O_WORKDIR/$PBS_JOBNAME.py" (and not mpiexec ...). So, my idea was that there must be something wrong with your MPI.

I repeated the very same calculation (serial) on a 16Gb node, to avoid the out of memory problem, that occured at the end of the first calculation. Now this one also crashed with seg. fault, see figure 2.

The I went back to an 8Gb node and it also crashed, see figure 3.

Now, I noticed that when it will crash, there are 2 memory peaks visible in the log, while this is not the case in the successful calculation (see notes in fig. 1). Maybe this is somehow helpful. At least, the memory log shows some regular patterns, that are equal on every calculation startup (as it should be!), except those spikes.

For now, I am not sure what to learn from this. It could even be a hardware problem, because ALL calculations were done on different nodes. However, if you find the time (no hurry!), please have a look on this stuff. At least this time the possibilities are more limited, because it's not MPI. Can you really rule out, that this is not a problem of AKT? If so, I'll have to talk with some cluster admin here...

(Btw. I try to create a minimal script of my calculation and send it privately. Do not hurry if you are busy right now.)

Thank you very much.

ziand · « **Reply #12 on:** December 16, 2011, 16:01 »

Okay, I created a minimal example to reproduce the error, on a 4Gb even the electrode calculation breaks with the known seg. fault. Obviously, this is nothing to worry, as 4Gb might be too small for this calculation. So I moved to some 16Gb nodes with 2 x MPI per node, that is I should have 8Gb per process. Now, the electrode calculation went fine, and also the device calculation started !!! From my point of view, this should not have happened. So, maybe the whole thing was my fault, and maybe there is some sort of non-trivial bug in my own script. That would be a bit embarrassing for me. Anyway, I suppose when you read this and the previous post, you should not spend much time on it, right now (but the mem log I attached previously is interesting indeed, as I can not see this long an steep growth in memory consumption right now, so if I made a mistake in may scripts, I have no idea what it could be). A final question, that you might be able to answer quickly: There once was a hint that you can or should do

Code

import NL
NL.MAXMEMORY=XXXX

I have this in my scripts for a long time and often do not think about it very much. The question is: can this be a source of trouble? What happens if I set this (a) too small e.g. XXXX=3000 on a 16Gb machine or too lage, for e.g. XXXX=8000 on a 4Gb machine. Both cases might have happened to me in the past. (Right now, my calculation is running with XXXX=3400 on 16Gb node with 2 x MPI per node.)

Anders Blom · « **Reply #13 on:** December 16, 2011, 19:51 »

Ok... I'll not look at this right now, but keep my fingers crossed that things run now

Well, I will look at the memlog.

MAXMEMORY forces ATK to use a slower algorithm that uses less memory. The default corresponds to about 1.5 GB. If you set it higher than the RAM you have, you use more memory and if you run out, you run out. If it's too low, you switch to the slower algorithm unnecessarily.

My advise would be to use the default value, esp. on 4 GB nodes. Note that MAXMEMORY is not supposed to match the total RAM you have, since the quantity it measures against is only part of the total memory consumption of a calculation. So there is a clear risk that with MAXMEM close to 3500 you will select the faster method, which uses more than 4 GB memory total, and that's why it doesn't work.

QuantumATK Forum

News:

Author Topic: Problem: Fatal error in MPI_Allreduce (Read 30456 times)

ziand

Problem: Fatal error in MPI_Allreduce

Anders Blom

Re: Problem: Fatal error in MPI_Allreduce

ziand

Re: Problem: Fatal error in MPI_Allreduce

Anders Blom

Re: Problem: Fatal error in MPI_Allreduce

ziand

Re: Problem: Fatal error in MPI_Allreduce

ziand

Re: Problem: Fatal error in MPI_Allreduce

Forum Administrator

Re: Problem: Fatal error in MPI_Allreduce

ziand

Re: Problem: Fatal error in MPI_Allreduce

Anders Blom

Re: Problem: Fatal error in MPI_Allreduce

ziand

Re: Problem: Fatal error in MPI_Allreduce

Anders Blom

Re: Problem: Fatal error in MPI_Allreduce

ziand

Re: Problem: Fatal error in MPI_Allreduce

ziand

Re: Problem: Fatal error in MPI_Allreduce

Anders Blom

Re: Problem: Fatal error in MPI_Allreduce