1
General Questions and Answers / Checkpoint file error and questions
« on: March 11, 2011, 16:06 »
Hello,
I'm looking for some information regarding the checkpoint file ATK uses in version 11.2. I manually set the location of the checkpoint file using:
When trying to optimise the geometry of a twoprobe system, I get a memory crash as the checkpoint file is created (I think). The output file has a standard out-of-memory error message:
Judging by the time of the crash (i.e. as an SCF is about to finish) I figure the memory crash and checkpoint file creation are linked.
The other error message I get is:
This seems to point to the checkpoint file as a cause for error. Looking at the 11.2 manual, the checkpoint file is a .nc file; however the file trying to be written is a .tmp.
So my questions are:
1) Is there any difference between the .nc I directed it to write and the .tmp it is trying to write?
2) How large is a checkpoint file going to get during the course of a calculation? Does it get overwritten after every SCF step which takes place after the specified time interval, or is the new data appended? Will the checkpoint file ever be bigger than the final .nc file?
3) Does saving a checkpoint file require a significant amount of memory during the checkpoint creation procedure?
4) Should I create the (empty) checkpoint file beforehand so the program has a file to write to?
I'd also like to ask a question on geometry optimisation - when you use vnl to create a configuration, add a calculator and a twoprobe optimisation, is it valid to go to the editor and delete the device_configuration.update() line if you're not looking for the electronic structure of the guess geometry. Or will deleting this line cause no SCF during the geometry optimisation?
I'm looking for some information regarding the checkpoint file ATK uses in version 11.2. I manually set the location of the checkpoint file using:
Code
checkpoint_handler = CheckpointHandler('/home/nanotubes/vitesh/CNT/8.8-K24I24-opt11.2-cp.nc', 30*Minute)
calculator = DeviceLCAOCalculator(
checkpoint_handler=checkpoint_handler,
When trying to optimise the geometry of a twoprobe system, I get a memory crash as the checkpoint file is created (I think). The output file has a standard out-of-memory error message:
Code
Calculating Eigenvalues : ==================================================
rank 0 in job 1 r1i1n11_47968 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
Judging by the time of the crash (i.e. as an SCF is about to finish) I figure the memory crash and checkpoint file creation are linked.
The other error message I get is:
Code
Traceback (most recent call last):
File "./zipdir/NL/Calculators/BulkCalculatorInterface.py", line 183, in _update
File "./zipdir/NL/Calculators/LCAOCalculator/LCAOCalculator.py", line 1019, in scfLoop
File "./zipdir/NL/Calculators/LCAOCalculator/LCAOCalculator.py", line 733, in scfLoopHamiltonian
File "./zipdir/NL/Calculators/GenericParameters/CheckpointHandler.py", line 117, in _storeIfNecessary
File "./zipdir/NL/IO/IOUtilityFunctions.py", line 566, in createNetCDFFile
OSError: [Errno 2] No such file or directory: '/home/nanotubes/vitesh/CNT/8.8-K24I24-opt11.2-cp.nc.tmp'
Fatal error in MPI_Allreduce: Message truncated, error stack:
MPI_Allreduce(773).......: MPI_Allreduce(sbuf=0x2aaaad19a350, rbuf=0x2aaaad19d4f0, count=2, MPI_INT, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Reduce(764).........:
MPIR_Reduce_binomial(172):
do_cts(490)..............: Message truncated; 8632624 bytes received but buffer size is 8
This seems to point to the checkpoint file as a cause for error. Looking at the 11.2 manual, the checkpoint file is a .nc file; however the file trying to be written is a .tmp.
So my questions are:
1) Is there any difference between the .nc I directed it to write and the .tmp it is trying to write?
2) How large is a checkpoint file going to get during the course of a calculation? Does it get overwritten after every SCF step which takes place after the specified time interval, or is the new data appended? Will the checkpoint file ever be bigger than the final .nc file?
3) Does saving a checkpoint file require a significant amount of memory during the checkpoint creation procedure?
4) Should I create the (empty) checkpoint file beforehand so the program has a file to write to?
I'd also like to ask a question on geometry optimisation - when you use vnl to create a configuration, add a calculator and a twoprobe optimisation, is it valid to go to the editor and delete the device_configuration.update() line if you're not looking for the electronic structure of the guess geometry. Or will deleting this line cause no SCF during the geometry optimisation?