Hello,
sorry for bringing this up again. Similar problems like the one mentioned above happen to me from time to time. I'm somehow puzzled and try to find a way to firgure out the source of the problem. Now the reason why I post this right now is some recent errors. All are right after the equivalent bulk of a device calculation. Is there a strong need of memory there?
Here some details about my calculations (I have no skript attached, becuse it is rather trick with lots of custom stuff. However, I carefully checked it with the VNL am sure that geometry is OK):
MPI-version: mvapich2-1.0 (there are other MPIs available like MPICH2 1.0.6)
ATK-version: 11.2.3
Geometry: 5x5x3 copper bulk electrodes, 5x5x5 copper left/right and 1 unitcell of (6,0) CNT between.
Calculator:
basis_set = GGABasis.DoubleZetaPolarized
electron_temperature = 1200
k_point_sampling = (3, 3, 100)
grid_mesh_cutoff = 75
max_steps = 500
tolerance = 4.0e-5
exchange_correlation = GGA.PBE
#electrode_voltages = [0, 0] * Volt
#trans_energies = numpy.linspace(-2, 2, 401) * eV
#trans_kpoints = (5, 5, 1)
numAcc = NumericalAccuracyParameters(electron_temperature = electron_temperature * Kelvin,
k_point_sampling = k_point_sampling,
grid_mesh_cutoff = grid_mesh_cutoff * Hartree)
itCon = IterationControlParameters(max_steps = max_steps,
tolerance = tolerance)
elCal = LCAOCalculator(basis_set = basis_set,
exchange_correlation = exchange_correlation,
numerical_accuracy_parameters = numAcc,
iteration_control_parameters = itCon,
poisson_solver = FastFourierSolver([[PeriodicBoundaryCondition] * 2] * 3),
checkpoint_handler = CheckpointHandler('checkpoint.nc', 60*Minute))
devCal = DeviceLCAOCalculator(basis_set = basis_set,
exchange_correlation = exchange_correlation,
numerical_accuracy_parameters = numAcc,
iteration_control_parameters = itCon,
poisson_solver = FastFourier2DSolver([[PeriodicBoundaryCondition] * 2,
[PeriodicBoundaryCondition] * 2,
[DirichletBoundaryCondition] * 2]),
electrode_calculators = [elCal, elCal],
#electrode_voltages = electrode_voltages,
checkpoint_handler = CheckpointHandler('checkpoint.nc', 60*Minute))
The very first run on 16Gb nodes with no CNT but lots of vacuum between the central electrodes went without errors, but as the number of 16Gb nodes is low on our cluster, I went to 4Gb now with a CNT in between. Actually, the system got 24 more carbon atoms (1 unit cell) and it got smaller in volume (1 unit of CNT was smaller than the vacuum I had before).
Here are my errors:
The first error is very likely caused by insufficient memory. I have 4Gb per quadcore node and wanted to have one MPI-job per node, but I think, there might have been two MPIs on the node below (rank7 and rank 10???).
\lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 1466 Segmentation fault PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
rank 7 in job 1 chic2g22_34011 caused collective abort of all ranks
exit status of rank 7: killed by signal 9
rank 10 in job 1 chic2g22_34011 caused collective abort of all ranks
exit status of rank 10: killed by signal 9
So I changed something in my PBS skript.
Old
# Number of nodes needed for mpd-startup
NNODES=$( uniq $PBS_NODEFILE | wc -l )
# Number of CPUs needed for mpiexec
NCPU=$( wc -l < $PBS_NODEFILE )
# Number of CPUs in use
useCPU=$(( $NNODES * $usePPN ))
sort $PBS_NODEFILE | uniq -c | gawk -v ppn=$usePPN '{ printf("%s:%s\n", $2, ppn); }' > mpd.nodes
mpdrun -machinefile mpd.nodes -n $useCPU -env MKL_DYNAMIC FALSE -env MKL_NUM_THREADS 4 $ATK_BIN_DIR/atkpython $PBS_O_WORKDIR/$PBS_JOBNAME.py
New
usePPN=1
export MKL_DYNAMIC=FALSE
export MKL_NUM_THREADS=4
mpiexec -npernode $usePPN $ATK_BIN_DIR/atkpython $PBS_O_WORKDIR/$PBS_JOBNAME.py
I ensured that really every node has only one atkpython_exec precess running. OpenMP seems to work as expected, because the load on the quadcores frequently goes above 25 percent. However, now I got the following error (right after the equiv. bulk has finished):
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 6079 Speicherzugriffsfehler PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 12284 Speicherzugriffsfehler PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 31418 Speicherzugriffsfehler PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(773)................: MPI_Allreduce(sbuf=0x1314dda0, rbuf=0x2ab5400010, count=1705725, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(240)...............:
MPIC_Recv(83).....................:
MPIC_Wait(513)....................:
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720).......:
state_commrdy_handler(1556).......:
MPID_nem_tcp_recv_handler(1446)...: socket closed
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(773)................: MPI_Allreduce(sbuf=0x2ab7ee1010, rbuf=0x2ab444f010, count=1705725, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(228)...............:
MPIC_Send(41).....................:
MPIC_Wait(513)....................:
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720).......:
state_commrdy_handler(1556).......:
MPID_nem_tcp_recv_handler(1446)...: socket closed
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(773)................: MPI_Allreduce(sbuf=0x2abb63b010, rbuf=0x2ab83b1010, count=1705725, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(385)...............:
MPIC_Sendrecv(161)................:
MPIC_Wait(513)....................:
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720).......:
state_commrdy_handler(1556).......:
MPID_nem_tcp_recv_handler(1446)...: socket closed
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(773)................: MPI_Allreduce(sbuf=0x2ab9b96010, rbuf=0x2ab6104010, count=1705725, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(385)...............:
MPIC_Sendrecv(161)................:
MPIC_Wait(513)....................:
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720).......:
state_commrdy_handler(1556).......:
MPID_nem_tcp_recv_handler(1446)...: socket closed
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(773)................: MPI_Allreduce(sbuf=0x2ab9609010, rbuf=0x2ab757f010, count=1705725, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(385)...............:
MPIC_Sendrecv(161)................:
MPIC_Wait(513)....................:
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720).......:
state_commrdy_handler(1556).......:
MPID_nem_tcp_recv_handler(1446)...: socket closed
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(773)................: MPI_Allreduce(sbuf=0x2ab7524010, rbuf=0x2ab6820010, count=1705725, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(385)...............:
MPIC_Sendrecv(161)................:
MPIC_Wait(513)....................:
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720).......:
state_commrdy_handler(1556).......:
MPID_nem_tcp_recv_handler(1446)...: socket closed
mpiexec: Warning: tasks 0-4,7 exited with status 1.
mpiexec: Warning: tasks 5-6,8 exited with status 139.
In line one "Speicherzugriffsfehler" is german and means memory access error (maybe it actually is the same as segmentation fautl). I have no idea why it is german here. The only difference between the first and the second run is that I changed the type of node on our cluster (also 4Gb RAM but hardware may differ) and I ensured that every node has really just 1 job. I was a bit doubtful from the beginning, that 4Gb memory is too few, so I recorded memory usage at both runs via
top | grep --line-buffered 'atkpython_exec' > mem_XXXXX.txt
and used a little skript to plot the percent of memory usage as provided by top. There result is attachted.
Now this looks very nice, you see the SCF-cycles of the electorde, then you see the cycles of the equivalent bulk and right after this finished it breaks! Can I be sure, that I have enough memory? Does the two-probe part of the calculation or maybe a step right before need much more memory than the equivalent bulk? It would be much help if this option can be eliminated.
What could I do or which information should I send, to figure out what is going on here. Maybe I need to ask the cluster admins, but there are lots of users here with lots of simulations, so the cluster seems (?) to be quite stable.
Thank you very much.