Author Topic: Restarting from a checkpoint file  (Read 2661 times)

0 Members and 1 Guest are viewing this topic.

Offline perfetti

  • QuantumATK Guru
  • ****
  • Posts: 103
  • Country: us
  • Reputation: 2
    • View Profile
Restarting from a checkpoint file
« on: March 1, 2012, 04:05 »
Dear Everyone,
        I am restarting from a checkpoint file. My job got broken before the EquivalentBulk get converged, and after i restarted the job, the job would start from the 0E for EquivalentBulk, that will last 8 hours to get to the point at which it got broken.(22E)
         Could I adjust some parameters that makes the job restart from the exactly breaking point? Then I can save much time.
         I checked the turial, which seems to mean that, only when a scf calculation got converged, it would be stored in the checkpoint file. Any advice would be appreciated. Thank you!

         
 
« Last Edit: March 1, 2012, 22:38 by perfetti »

Offline perfetti

  • QuantumATK Guru
  • ****
  • Posts: 103
  • Country: us
  • Reputation: 2
    • View Profile
Re: Restarting from a checkpoint file
« Reply #1 on: March 1, 2012, 22:45 »
My job keeps running for a whole day, while it gives no output. It comes to the iteration step 22E, with dE = 1.063183e-02, dH=5.849138e-04. And then it stops there, with all CPU usage are more than 90%, keeping running busily.

This transmission calculation was restarted from a checkpoint file. I am not sure if the restart changed the nature of this job? As I can see the "device" has been changed to "Bulk" in the scripter, and I don't know if this caused the problem.

Thank you very much. I really appreciate it.



Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5411
  • Country: dk
  • Reputation: 89
    • View Profile
    • QuantumATK at Synopsys
Re: Restarting from a checkpoint file
« Reply #2 on: March 2, 2012, 15:24 »
So, the checkpoint file is not a magical solution to this situation, I fear. If you crashed in EquivalentBulk, the system stored in the checkpoint is a bulk system, and it will not be possible to use it to recapture the calculation for the device. This is something we can improve in the future.

You may be able to run without the EquivalentBulk, and use NeutralAtoms instead. That saves you time and makes the calculation parallelize better. The only disadvantage is that usually you need more device iterations (however, these are more parallel, so you can compensate by having more MPI nodes).

Offline perfetti

  • QuantumATK Guru
  • ****
  • Posts: 103
  • Country: us
  • Reputation: 2
    • View Profile
Re: Restarting from a checkpoint file
« Reply #3 on: March 2, 2012, 17:05 »
 I just restart it from initial. Hope that will help.
 Thank you for hardworking.