Author Topic: Checkpoint file error and questions  (Read 3690 times)

0 Members and 1 Guest are viewing this topic.

Offline Vit

  • Regular QuantumATK user
  • **
  • Posts: 6
  • Reputation: 0
    • View Profile
Checkpoint file error and questions
« on: March 11, 2011, 16:06 »
Hello, I'm looking for some information regarding the checkpoint file ATK uses in version 11.2. I manually set the location of the checkpoint file using:
Code
checkpoint_handler = CheckpointHandler('/home/nanotubes/vitesh/CNT/8.8-K24I24-opt11.2-cp.nc', 30*Minute)

calculator = DeviceLCAOCalculator(
    checkpoint_handler=checkpoint_handler,
When trying to optimise the geometry of a twoprobe system, I get a memory crash as the checkpoint file is created (I think). The output file has a standard out-of-memory error message:
Code
Calculating Eigenvalues    : ==================================================
rank 0 in job 1  r1i1n11_47968   caused collective abort of all ranks
  exit status of rank 0: killed by signal 9
Judging by the time of the crash (i.e. as an SCF is about to finish) I figure the memory crash and checkpoint file creation are linked. The other error message I get is:
Code
Traceback (most recent call last):
  File "./zipdir/NL/Calculators/BulkCalculatorInterface.py", line 183, in _update
  File "./zipdir/NL/Calculators/LCAOCalculator/LCAOCalculator.py", line 1019, in scfLoop
  File "./zipdir/NL/Calculators/LCAOCalculator/LCAOCalculator.py", line 733, in scfLoopHamiltonian
  File "./zipdir/NL/Calculators/GenericParameters/CheckpointHandler.py", line 117, in _storeIfNecessary
  File "./zipdir/NL/IO/IOUtilityFunctions.py", line 566, in createNetCDFFile
OSError: [Errno 2] No such file or directory: '/home/nanotubes/vitesh/CNT/8.8-K24I24-opt11.2-cp.nc.tmp'
Fatal error in MPI_Allreduce: Message truncated, error stack:
MPI_Allreduce(773).......: MPI_Allreduce(sbuf=0x2aaaad19a350, rbuf=0x2aaaad19d4f0, count=2, MPI_INT, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Reduce(764).........:
MPIR_Reduce_binomial(172):
do_cts(490)..............: Message truncated; 8632624 bytes received but buffer size is 8
This seems to point to the checkpoint file as a cause for error. Looking at the 11.2 manual, the checkpoint file is a .nc file; however the file trying to be written is a .tmp. So my questions are: 1) Is there any difference between the .nc I directed it to write and the .tmp it is trying to write? 2) How large is a checkpoint file going to get during the course of a calculation? Does it get overwritten after every SCF step which takes place after the specified time interval, or is the new data appended? Will the checkpoint file ever be bigger than the final .nc file? 3) Does saving a checkpoint file require a significant amount of memory during the checkpoint creation procedure? 4) Should I create the (empty) checkpoint file beforehand so the program has a file to write to? I'd also like to ask a question on geometry optimisation - when you use vnl to create a configuration, add a calculator and a twoprobe optimisation, is it valid to go to the editor and delete the device_configuration.update() line if you're not looking for the electronic structure of the guess geometry. Or will deleting this line cause no SCF during the geometry optimisation?

Offline Vit

  • Regular QuantumATK user
  • **
  • Posts: 6
  • Reputation: 0
    • View Profile
Re: Checkpoint file error and questions
« Reply #1 on: March 16, 2011, 15:44 »
A little bump for this - any insight into the checkpoint file would be nice (i.e. is it a typical ATK .nc file?)

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5417
  • Country: dk
  • Reputation: 89
    • View Profile
    • QuantumATK at Synopsys
Re: Checkpoint file error and questions
« Reply #2 on: March 16, 2011, 23:18 »
Fortunately, we have already fixed this problem, and will release an update (ATK 11.2.1) probably tomorrow that solves it.

For now, if you can't wait, the solution is to NOT specify your own name for the checkpoint file. That way, each node will use its own (random) name, and you're safe. What happens now is that all nodes use the same name, and all try to write to the checkpoint file at the same time, and this doesn't work.

Yes, the checkpoint file is a usual NC file. But I think once the bug is fixed, you don't have to worry about all those other points.

We apologize for the inconvenience.
« Last Edit: March 16, 2011, 23:20 by Anders Blom »

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5417
  • Country: dk
  • Reputation: 89
    • View Profile
    • QuantumATK at Synopsys
Re: Checkpoint file error and questions
« Reply #3 on: March 16, 2011, 23:24 »
Well:

The checkpoint file is written every 30 minutes by default. It should be comparable in size to the final NC file. You can set it to save at every SCF step, but it's not recommended. The idea behind using time instead of steps to determine when to save is, that for a very small calculation you introduce an overhead that reduces performance, if you have to write the file in each step (this might take as much time as the step itself). 30 minutes is considered an ok time to lose work, in the worst case of a crash. And for a large calculation, each step might take 30 minutes or more, so you get the same effect as saving each step (or you lose 1-2 steps at the most, if it crashes).

Offline Nordland

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 812
  • Reputation: 18
    • View Profile
Re: Checkpoint file error and questions
« Reply #4 on: March 17, 2011, 00:22 »
A little bump for this - any insight into the checkpoint file would be nice (i.e. is it a typical ATK .nc file?)

Yes. It is a ordinary ATK .nc file in all terms and can be used as such.

Offline Vit

  • Regular QuantumATK user
  • **
  • Posts: 6
  • Reputation: 0
    • View Profile
Re: Checkpoint file error and questions
« Reply #5 on: March 17, 2011, 19:06 »
Thanks for the information and update.