Author Topic: How to restar the calculation because of exceeding walktime?  (Read 5386 times)

0 Members and 1 Guest are viewing this topic.

Offline wot19920302

  • QuantumATK Guru
  • ****
  • Posts: 124
  • Country: cn
  • Reputation: 0
    • View Profile
Dear staffs, I want to restar the calculation which stops due to exceeding walktime.I modify the origin script as following:
#----------------------------------------
# Device Calculator
#----------------------------------------
calculator = DeviceLCAOCalculator(
        contour_parameters=contour_parameters,
        electrode_calculators=
                  [left_electrode_calculator, right_electrode_calculator],
    )

device_configuration.setCalculator(calculator)
nlprint(device_configuration)
device_configuration = nlread("/tmp/checkpoint48792306.nc")[0]
device_configuration.update(force_restart=True)
nlsave('aIV.nc', device_configuration)
         I calculate by means of the cluster,using node65.However,the cluster tell  me "/tmp/checkpoint48792306.nc, was not found - please check correct path and name" I don't know how to do.


Offline wot19920302

  • QuantumATK Guru
  • ****
  • Posts: 124
  • Country: cn
  • Reputation: 0
    • View Profile
Re: How to restar the calculation because of exceeding walktime?
« Reply #2 on: March 22, 2016, 04:33 »
I have read this chapter before. But it can't provide me useful methods. Thank you.

Offline zh

  • QuantumATK Support
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 1141
  • Reputation: 24
    • View Profile
Re: How to restar the calculation because of exceeding walktime?
« Reply #3 on: March 22, 2016, 04:46 »
It is not a safe way to save the checkpoint file in '/tmp' directory because the computer system could delete it sometimes. The best way is to save the checkpoint file in your 'WORK DIRECTORY".

At the current situation, your checkpoint file seems to be already deleted by the computer system and so you couldn't restart the calculations from your specified checkpoint file.

Offline wot19920302

  • QuantumATK Guru
  • ****
  • Posts: 124
  • Country: cn
  • Reputation: 0
    • View Profile
Re: How to restar the calculation because of exceeding walktime?
« Reply #4 on: March 22, 2016, 05:53 »
      After trying, I  load the tempfile down from temperory directory and add the tempfile into my work directory. The state of the calculation shows "running".  It seems like that the calculation restart. But I want to know wether the calculation starts from the chekpoint or from the beginning.

Offline Ulrik G. Vej-Hansen

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 425
  • Country: dk
  • Reputation: 8
    • View Profile
Re: How to restar the calculation because of exceeding walktime?
« Reply #5 on: March 22, 2016, 09:50 »
If you have moved the checkpoint file into a regular directory, and changed your script accordingly, your calculation should start from the checkpoint. However, to be certain, you should (always) inspect the output files.

Offline Ulrik G. Vej-Hansen

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 425
  • Country: dk
  • Reputation: 8
    • View Profile
Re: How to restar the calculation because of exceeding walktime?
« Reply #6 on: March 22, 2016, 14:34 »
Please also provide the script, as you need to point to the specific location of the file.

Offline wot19920302

  • QuantumATK Guru
  • ****
  • Posts: 124
  • Country: cn
  • Reputation: 0
    • View Profile
Re: How to restar the calculation because of exceeding walktime?
« Reply #7 on: March 22, 2016, 14:37 »
This is the modified scripter:

Offline Ulrik G. Vej-Hansen

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 425
  • Country: dk
  • Reputation: 8
    • View Profile
Re: How to restar the calculation because of exceeding walktime?
« Reply #8 on: March 22, 2016, 14:58 »
I do not see the checkpoint file in any of the folders, so I am not sure exactly where you have it? With the path you use in your script, it must be in the same folder as your script.

Also, I believe you need to set the initial state explicitly after line 620, as shown in this tutorial: http://www.quantumwise.com/publications/tutorials/item/502-restarting-stopped-calculations

Offline Ulrik G. Vej-Hansen

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 425
  • Country: dk
  • Reputation: 8
    • View Profile
Re: How to restar the calculation because of exceeding walktime?
« Reply #9 on: March 22, 2016, 15:52 »
Is this also the directory your script is running from?

I would like to take a closer look at your problem, so could you please upload or provide a link to the checkpoint .nc file?

Offline wot19920302

  • QuantumATK Guru
  • ****
  • Posts: 124
  • Country: cn
  • Reputation: 0
    • View Profile
Re: How to restar the calculation because of exceeding walktime?
« Reply #10 on: March 22, 2016, 16:20 »
I can't upload this file because it's too big (99.8M) :-[...."wot" folder is not my work directory. I just want to copy broken files from the older folder to a new folder-"wot" folder and import checkpoint file to "wot" folder at the same time. Then qsub the .pbs to the cluster.

Offline Ulrik G. Vej-Hansen

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 425
  • Country: dk
  • Reputation: 8
    • View Profile
Re: How to restar the calculation because of exceeding walktime?
« Reply #11 on: March 22, 2016, 16:36 »
Can you maybe upload it somewhere else and provide a link?

It is important that the relative path to the checkpoint file is correct. The script you uploaded earlier will look for the checkpoint file in the same directory, so if the checkpoint file is not present, it cannot use it to start from. However, I am not sure if this is really the problem you are experiencing, so getting the .nc file will help me help you.

Offline wot19920302

  • QuantumATK Guru
  • ****
  • Posts: 124
  • Country: cn
  • Reputation: 0
    • View Profile
Re: How to restar the calculation because of exceeding walktime?
« Reply #12 on: March 23, 2016, 15:15 »
This is the link:     http://pan.baidu.com/s/1pKi8MAR 

Offline Ulrik G. Vej-Hansen

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 425
  • Country: dk
  • Reputation: 8
    • View Profile
Re: How to restar the calculation because of exceeding walktime?
« Reply #13 on: March 29, 2016, 11:18 »
Thanks for providing the link, however, I cannot find the option for English, so I have no idea where I should click to access the file. Could you maybe upload it on a site where it is possible to change the language to English?

Offline wot19920302

  • QuantumATK Guru
  • ****
  • Posts: 124
  • Country: cn
  • Reputation: 0
    • View Profile
Re: How to restar the calculation because of exceeding walktime?
« Reply #14 on: April 6, 2016, 17:08 »
Quote
Thanks for providing the link, however, I cannot find the option for English, so I have no idea where I should click to access the file. Could you maybe upload it on a site where it is possible to change the language to English?
I am so sorry to reply you so late because I just realized there were two page in my post!!!. :-[ :-[ :-[  .  I made a test to verify my thoughts of solving this problem. Here are two block codes, the first one is origin input script( I modified interval time and store path ), the second one is the restarting script. the first block:
Code
  #----------------------------------------
# Device Calculator
#----------------------------------------
checkpoint_handler=CheckpointHandler('/nobackup/wangzq/toy/cutoff/checkpoint.nc', 3*Minute)
calculator = DeviceLCAOCalculator(
    iteration_control_parameters=device_iteration_control_parameters,
    contour_parameters=contour_parameters,
    electrode_calculators=
        [left_electrode_calculator, right_electrode_calculator],
    checkpoint_handler=checkpoint_handler,
    )

device_configuration.setCalculator(calculator)
nlprint(device_configuration)
device_configuration.update()
nlsave('toy-iv.nc', device_configuration)
the second block:
Code
#----------------------------------------
# Device Calculator
#----------------------------------------
checkpoint_handler=CheckpointHandler('/nobackup/wangzq/toy/cutoff/checkpoint.nc', 3*Minute)
calculator = DeviceLCAOCalculator(
    iteration_control_parameters=device_iteration_control_parameters,
    contour_parameters=contour_parameters,
    electrode_calculators=
        [left_electrode_calculator, right_electrode_calculator],
    checkpoint_handler=checkpoint_handler,
    )

device_configuration.setCalculator(calculator)
nlprint(device_configuration)
device_configuration = nlread("checkpoint.nc")[0]
device_configuration.update(force_restart=True)
nlsave('toy-iv.nc', device_configuration)
Finally, I got results by submiting the second script(checkpoint.nc ,of course, in the same work file with the second script) to clusters.  Did my methods take effect for restarting calculation? Sorry again for my carelessness :-[ :-[ :-[ :-[  
« Last Edit: April 7, 2016, 08:13 by wot19920302 »