Author Topic: How to make a checkpoint file after a fixed period of time? (Read 11549 times)

lknife · « **on:** May 20, 2017, 17:40 »

On many computer clusters, there would be a hard time limit, such as 12hr, 24 hr or even longer. I want to know if it is possible to make a checkpoint file for later use before it is killed because of time limit. That is, if the calculation is not not converged before the time limit, how can I save the checkpoint file after a fixed period of time?

Anders Blom · « **Reply #1 on:** May 22, 2017, 07:25 »

I would just lower the interval of the checkpoint file to e.g. 15 minutes. The overhead is very small for saving it anyway.
Remember that the time you lose is at most 1 checkpoint cycle, so even with the default (30 minutes), in the absolutely most unlucky case, the checkpoint file is saved 29 minutes before termination, but perhaps that's not a problem for a calculation that takes > 24 hours.

However, I would also ask, which calculation takes > 24 hours that is "checkpointable" on a cluster (where presumably the parallelization gives a great boost to the calculation)? Normally the checkpoint only works well for a single calculation, NOT for I-V curves etc. If your I-V curve, for instance, takes so long, then just break it up in smaller pieces, and do a few bias voltages in each job. And if you are indeed running a single job that takes > 24 hours, maybe there are some settings that are not ideal (unless you are trying 10,000 atoms in bulk DFT...), because in almost all of our experience, reasonable bulk or even devices take only a few hours per bias post at most (unless you are including spin-orbit, and/or electron-phonon interaction etc).

lknife · « **Reply #2 on:** May 22, 2017, 15:55 »

Dear Anders Blom,

Thank you for your reply!

What I want to do is to make just one checkpoint file instead of saving it every 30 minutes. That is, if the time limit for a cluster is 2 days and my calculation does not finish 47 hours after the beginning of the calculation, I want to save a checkpoint file as the final output for this calculation and stop the calculation automatically by itself. The check point file can be used for the next calculation.

As to a big calculation that takes so long a time, attached are two files, one is a .py file of a calculation, which is an ATK2015 edition of the file "electrode1.py" in tutorial : Spin-orbit transport calculations: Bi2Se3 topological insulator thin-film device; the other is a .sh file, which is for the submission of the .py file to our computer cluster --- I changed its suffix to txt.

This calculation takes very long time. On my local computer, it took me 4D-18hr to get the result. So I submitted it to the computer cluster of our university. Till now, it has been running on the cluster for 34 hrs (the time limit of the cluster is 5 days). Could you please help me check the files and give me some comments on how to make an idea setting for this calculation on the cluster?

The followings are some information about the cluster:
(1) the maximum number of Nodes can be used: about 16 Nodes
(2) The maximum memory for one Node is: 64 GB
(3) the maximum number of cores for one Node is: 16 cores (two sockets/CPU)
(4) ATK2015 MPI with permanent license can be used on the cluster

I appreciate you for your kind help!

Lknife

Anders Blom · « **Reply #3 on:** May 22, 2017, 17:33 »

The checkpoint file is the only option for this right now, but as far as I can tell it gives you more or less what you are looking for.

You have forgotten a crucial setting in your cluster input file:
export OMP_NUM_THREADS=1
See the parallel guide for information: http://docs.quantumwise.com/guides/mpi_atk/mpi_atk.html#comments-on-performance
This probably can give you a 10x speedup or more.

lknife · « **Reply #4 on:** May 31, 2017, 17:11 »

Thank you very much for your reply.

However, even if I added the code "export OMP_NUM_THREADS=1" into my script, it still did not work. I posted another question on MPI settings: Why the MPI process cannot help me to speedup the calculation? (seen https://quantumwise.com/forum/index.php?topic=5096.msg22096#new ). I would appreciate you if you would take a look at it and help me to answer it.

By the way, according to the ATK guide for running in parallel (version 2015.2), it is said that we can combine MPI and threading together to obtain optimal performance for parallel calculation. However, according to the parallel guide on the website, it is said that we'd better not to do so.

A staff working on our computer cluster helped me test a small job with and without the combination of MPI and threading. The results showed that the time consuming with combination is less than 2hrs, while without combination, is 8hrs. Here is what she told me: "I tested a small ATK job using threads and no threads. With threads, the job finished in less than 2hours, and with *NO* threads, the same job took 8 hours!"

I am really confusing.

Anders Blom · « **Reply #5 on:** May 31, 2017, 18:53 »

I can't comment without knowing his job settings and type.

QuantumATK Forum

News:

Author Topic: How to make a checkpoint file after a fixed period of time? (Read 11549 times)

lknife

How to make a checkpoint file after a fixed period of time?

Anders Blom

Re: How to make a checkpoint file after a fixed period of time?

lknife

Re: How to make a checkpoint file after a fixed period of time?

Anders Blom

Re: How to make a checkpoint file after a fixed period of time?

lknife

Re: How to make a checkpoint file after a fixed period of time?

Anders Blom

Re: How to make a checkpoint file after a fixed period of time?