Hello,
sorry for bringing this up again. Similar problems like the one mentioned above happen to me from time to time. I'm somehow puzzled and try to find a way to firgure out the source of the problem. Now the reason why I post this right now is some recent errors. All are right after the equivalent bulk of a device calculation. Is there a strong need of memory there?
Here some details about my calculations (I have no skript attached, becuse it is rather trick with lots of custom stuff. However, I carefully checked it with the VNL am sure that geometry is OK):
MPI-version: mvapich2-1.0 (there are other MPIs available like MPICH2 1.0.6)
ATK-version: 11.2.3
Geometry: 5x5x3 copper bulk electrodes, 5x5x5 copper left/right and 1 unitcell of (6,0) CNT between.
Calculator:
basis_set = GGABasis.DoubleZetaPolarized
electron_temperature = 1200
k_point_sampling = (3, 3, 100)
grid_mesh_cutoff = 75
max_steps = 500
tolerance = 4.0e-5
exchange_correlation = GGA.PBE
#electrode_voltages = [0, 0] * Volt
#trans_energies = numpy.linspace(-2, 2, 401) * eV
#trans_kpoints = (5, 5, 1)
numAcc = NumericalAccuracyParameters(electron_temperature = electron_temperature * Kelvin,
k_point_sampling = k_point_sampling,
grid_mesh_cutoff = grid_mesh_cutoff * Hartree)
itCon = IterationControlParameters(max_steps = max_steps,
tolerance = tolerance)
elCal = LCAOCalculator(basis_set = basis_set,
exchange_correlation = exchange_correlation,
numerical_accuracy_parameters = numAcc,
iteration_control_parameters = itCon,
poisson_solver = FastFourierSolver([[PeriodicBoundaryCondition] * 2] * 3),
checkpoint_handler = CheckpointHandler('checkpoint.nc', 60*Minute))
devCal = DeviceLCAOCalculator(basis_set = basis_set,
exchange_correlation = exchange_correlation,
numerical_accuracy_parameters = numAcc,
iteration_control_parameters = itCon,
poisson_solver = FastFourier2DSolver([[PeriodicBoundaryCondition] * 2,
[PeriodicBoundaryCondition] * 2,
[DirichletBoundaryCondition] * 2]),
electrode_calculators = [elCal, elCal],
#electrode_voltages = electrode_voltages,
checkpoint_handler = CheckpointHandler('checkpoint.nc', 60*Minute))
The very first run on 16Gb nodes with no CNT but lots of vacuum between the central electrodes went without errors, but as the number of 16Gb nodes is low on our cluster, I went to 4Gb now with a CNT in between. Actually, the system got 24 more carbon atoms (1 unit cell) and it got smaller in volume (1 unit of CNT was smaller than the vacuum I had before).
Here are my errors:
The first error is very likely caused by insufficient memory. I have 4Gb per quadcore node and wanted to have one MPI-job per node, but I think, there might have been two MPIs on the node below (rank7 and rank 10???).
\lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 1466 Segmentation fault PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
rank 7 in job 1 chic2g22_34011 caused collective abort of all ranks
exit status of rank 7: killed by signal 9
rank 10 in job 1 chic2g22_34011 caused collective abort of all ranks
exit status of rank 10: killed by signal 9
So I changed something in my PBS skript.
Old
# Number of nodes needed for mpd-startup
NNODES=$( uniq $PBS_NODEFILE | wc -l )
# Number of CPUs needed for mpiexec
NCPU=$( wc -l < $PBS_NODEFILE )
# Number of CPUs in use
useCPU=$(( $NNODES * $usePPN ))
sort $PBS_NODEFILE | uniq -c | gawk -v ppn=$usePPN '{ printf("%s:%s\n", $2, ppn); }' > mpd.nodes
mpdrun -machinefile mpd.nodes -n $useCPU -env MKL_DYNAMIC FALSE -env MKL_NUM_THREADS 4 $ATK_BIN_DIR/atkpython $PBS_O_WORKDIR/$PBS_JOBNAME.py
New
usePPN=1
export MKL_DYNAMIC=FALSE
export MKL_NUM_THREADS=4
mpiexec -npernode $usePPN $ATK_BIN_DIR/atkpython $PBS_O_WORKDIR/$PBS_JOBNAME.py
I ensured that really every node has only one atkpython_exec precess running. OpenMP seems to work as expected, because the load on the quadcores frequently goes above 25 percent. However, now I got the following error (right after the equiv. bulk has finished):
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 6079 Speicherzugriffsfehler PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 12284 Speicherzugriffsfehler PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 31418 Speicherzugriffsfehler PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(773)................: MPI_Allreduce(sbuf=0x1314dda0, rbuf=0x2ab5400010, count=1705725, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(240)...............:
MPIC_Recv(83).....................:
MPIC_Wait(513)....................:
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720).......:
state_commrdy_handler(1556).......:
MPID_nem_tcp_recv_handler(1446)...: socket closed
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(773)................: MPI_Allreduce(sbuf=0x2ab7ee1010, rbuf=0x2ab444f010, count=1705725, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(228)...............:
MPIC_Send(41).....................:
MPIC_Wait(513)....................:
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720).......:
state_commrdy_handler(1556).......:
MPID_nem_tcp_recv_handler(1446)...: socket closed
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(773)................: MPI_Allreduce(sbuf=0x2abb63b010, rbuf=0x2ab83b1010, count=1705725, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(385)...............:
MPIC_Sendrecv(161)................:
MPIC_Wait(513)....................:
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720).......:
state_commrdy_handler(1556).......:
MPID_nem_tcp_recv_handler(1446)...: socket closed
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(773)................: MPI_Allreduce(sbuf=0x2ab9b96010, rbuf=0x2ab6104010, count=1705725, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(385)...............:
MPIC_Sendrecv(161)................:
MPIC_Wait(513)....................:
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720).......:
state_commrdy_handler(1556).......:
MPID_nem_tcp_recv_handler(1446)...: socket closed
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(773)................: MPI_Allreduce(sbuf=0x2ab9609010, rbuf=0x2ab757f010, count=1705725, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(385)...............:
MPIC_Sendrecv(161)................:
MPIC_Wait(513)....................:
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720).......:
state_commrdy_handler(1556).......:
MPID_nem_tcp_recv_handler(1446)...: socket closed
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(773)................: MPI_Allreduce(sbuf=0x2ab7524010, rbuf=0x2ab6820010, count=1705725, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(385)...............:
MPIC_Sendrecv(161)................:
MPIC_Wait(513)....................:
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720).......:
state_commrdy_handler(1556).......:
MPID_nem_tcp_recv_handler(1446)...: socket closed
mpiexec: Warning: tasks 0-4,7 exited with status 1.
mpiexec: Warning: tasks 5-6,8 exited with status 139.
In line one "Speicherzugriffsfehler" is german and means memory access error (maybe it actually is the same as segmentation fautl). I have no idea why it is german here. The only difference between the first and the second run is that I changed the type of node on our cluster (also 4Gb RAM but hardware may differ) and I ensured that every node has really just 1 job. I was a bit doubtful from the beginning, that 4Gb memory is too few, so I recorded memory usage at both runs via
top | grep --line-buffered 'atkpython_exec' > mem_XXXXX.txt
and used a little skript to plot the percent of memory usage as provided by top. There result is attachted.
Now this looks very nice, you see the SCF-cycles of the electorde, then you see the cycles of the equivalent bulk and right after this finished it breaks! Can I be sure, that I have enough memory? Does the two-probe part of the calculation or maybe a step right before need much more memory than the equivalent bulk? It would be much help if this option can be eliminated.
What could I do or which information should I send, to figure out what is going on here. Maybe I need to ask the cluster admins, but there are lots of users here with lots of simulations, so the cluster seems (?) to be quite stable.
Thank you very much.
Update:
I made no change to the device and calculator settings. I used 9 quadcore nodes with 8Gb mamory. This time the error occured later. It finished left electrode, equiv. bulk, two-probe and a transmission spectrum. After that a new cycle should start with two CNT unitcells. But after the transmission spectrum I got the following error.
+------------------------------------------------------------------------------+
| |
| Transmission Spectrum Analysis |
| |
+------------------------------------------------------------------------------+
|--------------------------------------------------|
Calculating Transmission : ==================================================
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 19339 Segmentation fault PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 29948 Segmentation fault PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 17088 Segmentation fault PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 2341 Segmentation fault PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 715 Segmentation fault PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 25996 Segmentation fault PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 13896 Segmentation fault PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 30131 Segmentation fault PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 20241 Segmentation fault PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
mpiexec: Warning: tasks 0-8 exited with status 139.
I have attached a picture with the memory usage. You said the equiv. bulk is the most memory consuming part. According to this graph, the two-probe part does use significantly more memory. Is this possible or could it be that I am making a mistake somewhere?
The transmission spectrum has been correctly written to the nc-file, this is very nice, so the crash happened later (maybe at the beginning of the next loop).
I create my calculator object only once (outside a loop which makes a new device) and attach it to the device during the loop. I use device.setCalculator(calculator) but I could also make a fresh copy with device.setCalculator(calculator()). Does it make a difference?
The next test will be with (an old) MPICH2 instead of (an old) MVAPICH2. (And up-to-date MPI versions will be available to me soon).
I determined the different phases of calculation "backwards". The last must be the transmission, before there is two-probe, before this there is equiv. bulk, and first must be electrode. I started the memory logging quiet a while after the calculation had started, so the very first part looks shorther than is actually was and I stopped the logging a long time after the calculation had crashed, so the length of the flat line at the end is random. Here is the timing:
| Device Calculation [Started Fri Nov 25 09:15:55 2011] |
| Left Electrode Calculation [Started Fri Nov 25 09:16:17 2011] |
| Left Electrode Calculation [Finished Fri Nov 25 14:30:48 2011] |
| Device Density Matrix Calculation [Started Fri Nov 25 14:30:48 2011] |
| Equivalent Bulk [Started Fri Nov 25 14:31:51 2011] |
| Equivalent Bulk [Finished Fri Nov 25 19:53:51 2011] |
| Device Calculation [Finished Fri Nov 25 23:44:43 2011] |
# Here was the transmission calculation (most time consuming part, I used 5x5 k-points)
# followed by the error messages
Sat Nov 26 15:07:34 CET 2011 # <- end of output
And: The equiv. bulk took 24 steps, which corresponds with the number of "memory peaks" in the part that I marked as equivalent bulk.
Okay, I created a minimal example to reproduce the error, on a 4Gb even the electrode calculation breaks with the known seg. fault. Obviously, this is nothing to worry, as 4Gb might be too small for this calculation. So I moved to some 16Gb nodes with 2 x MPI per node, that is I should have 8Gb per process. Now, the electrode calculation went fine, and also the device calculation started !!!
From my point of view, this should not have happened. So, maybe the whole thing was my fault, and maybe there is some sort of non-trivial bug in my own script. That would be a bit embarrassing for me. Anyway, I suppose when you read this and the previous post, you should not spend much time on it, right now (but the mem log I attached previously is interesting indeed, as I can not see this long an steep growth in memory consumption right now, so if I made a mistake in may scripts, I have no idea what it could be).
A final question, that you might be able to answer quickly:
There once was a hint that you can or should do
import NL
NL.MAXMEMORY=XXXX
I have this in my scripts for a long time and often do not think about it very much. The question is: can this be a source of trouble? What happens if I set this (a) too small e.g. XXXX=3000 on a 16Gb machine or too lage, for e.g. XXXX=8000 on a 4Gb machine. Both cases might have happened to me in the past. (Right now, my calculation is running with XXXX=3400 on 16Gb node with 2 x MPI per node.)