QuantumATK Forum

QuantumATK => General Questions and Answers => Topic started by: Sukhbir on September 22, 2015, 18:55

Title: How to find checkpoint file and restart calculation in linux redhat system
Post by: Sukhbir on September 22, 2015, 18:55
Hello,
I was running calculation on Graphene FET by using mpiexec parallel run. But my calculation stopped due to power cut. Hence I am unable to find checkpoint file to restart it again. Presently, i am using 12.8.2 version of ATK-VNL in linux base operating system (Red Hat).
So, Can anyone tell me
1. How to find and locate appropriate checkpoint file
2. How to restart it (If any anyone can provide me video)
3. Does restart can provide all analysis results.
 
Title: Re: How to find checkpoint file and restart calculation in linux redhat system
Post by: zh on September 23, 2015, 03:29
Please use the new version. The support to this very old version is limited.
1. It may be stored in '/tmp'.  The storage information (name and path of checkpoint file) may be written into the log file of your job.
2. Look at here: https://www.quantumwise.com/publications/tutorials/item/502-restarting-stopped-calculations
3. It depends on how much the information has been stored in the checkpoint file.
Title: Re: How to find checkpoint file and restart calculation in linux redhat system
Post by: Sukhbir on September 23, 2015, 07:05
Thanks for your reply,
But I am confused because tmp folder contain many checkpoint file and i have opened them all in editor but they donot contain correct input information. So is there any other folder where by default it can be stored . Therefore, how should i analyse it. Secondly,I am unable to find log file.
Please guide me.   
Title: Re: How to find checkpoint file and restart calculation in linux redhat system
Post by: Jess Wellendorff on September 23, 2015, 08:37
If you cannot find the log file of the job (example: mpiexec -n 4 myscript.py > mylog.log) it is gonna be hard for you to restart this job. Why not simply redo the calculation from scratch? That is a pretty common consequence of power failures on a supercomputer.
Title: Re: How to find checkpoint file and restart calculation in linux redhat system
Post by: Sukhbir on September 23, 2015, 12:44
Thanks for reply,

I have got checkpoint file from tmp folder. Now I want to know that, Should I keep name of output file (analysis.nc) same as it was in previous script. Because half of its calculation is complete. So does the calculation start from where it stopped and now i can expect all results .
   
Title: Re: How to find checkpoint file and restart calculation in linux redhat system
Post by: zh on September 24, 2015, 06:27
It is better to keep it. 

The restarted calculation continues not exactly from the stopped point of the last calculation, because the stored information was written after some specific step or point. For example, during the self-consistent calculation (SCF) of a bulk configuration, the charge density may be written into the checkpoint file in every SCF step.  If the calculation stops during the i^th step, the continued job will continue from the saved charge density of the (i-1)^th step.
Title: Re: How to find checkpoint file and restart calculation in linux redhat system
Post by: Sukhbir on September 24, 2015, 07:15
Thanks for your kind reply,

My calculation  is completed. But, It is showing some error and I am unable to understand it. I am attaching the screenshot of it.
Please guide me accordingly where i am making mistake.
Title: Re: How to find checkpoint file and restart calculation in linux redhat system
Post by: Jess Wellendorff on September 24, 2015, 08:37
As the error message clearly says, your script has called the function "nlsave" with too few arguments.

If you take a look at the ATK reference manual ( http://www.quantumwise.com/documents/manuals/latest/ReferenceManual/index.html/ref.nlsave.html (http://www.quantumwise.com/documents/manuals/latest/ReferenceManual/index.html/ref.nlsave.html) ) you will see that the correct syntax is like this:

Code
nlsave('file.nc', configuration)

In your script, you have only specify the NetCDF filename, but not the configuration that should be saved. Fix this, and it will work.
Title: Re: How to find checkpoint file and restart calculation in linux redhat system
Post by: Sukhbir on September 24, 2015, 09:04
Thanks lot for your kind reply