Author Topic: Avoid calculation from interrupting if you lose ssh connection  (Read 22323 times)

0 Members and 1 Guest are viewing this topic.

Offline Anders Blom

  • QuantumWise Staff
  • Supreme ATK Wizard
  • *****
  • Posts: 4966
  • Country: dk
  • Reputation: 78
    • View Profile
    • QuantumWise
A rather annoying "feature" of running calculations on a remote host, such as a cluster, is that if you are not careful, the calculation will be interrupted if you lose the network connection.

The way I used to get around this was to submit my jobs under "nohup", like

Code: [Select]
nohup atkpython myscript.py > myscript.log &

Note the ampersand "&" at the end, so that the processes goes into the background.

Now, there are certain disadvantages with nohup, which I will not go into. One very notable feature is also that if you launch "mpiexec" under nohup, you must redirect stdin too:

Code: [Select]
nohup mpiexec atkpython myscript.py > myscript.log < /dev/null &

otherwise your log file will fill up with warning messages real fast.

Fortunately, there is an alternative to nohup, called screen. There is a very nice tutorial for it here:
http://www.rackaid.com/resources/linux-screen-tutorial-and-how-to/

To summarize briefly how to run ATK on a detached screen (we assume screen is installed on the remote host; if not, please confer with the sysadm of that system):

1. Log into the remote host where you want to run, e.g. using ssh
2. Use the command "screen" to start a new screen (there is some information screen that pops up, it can be avoided with "screen -q")
3. Start ATK as you would normally, for instance

Code: [Select]
atkpython myscript.py

(we use an example without a log file, it works fine with one as well, of course)

4. Hit Ctrl-A-D to detach the screen
5. You will be returned to the login session on the node, from which you can now safely log off ("exit") and the calculation will continue to run in the background.

Some time later, you want to check if the calculation finished.

1. ssh back into the remote host
2. Type "screen -r" to resume the screen (if you have several screens running, you will be prompted to provide the pid, like "screen -r 1234")
3. You are now back in the session where you started ATK, and you can operate as usual. If the calculation didn't finish, just detach the screen with Ctrl-A-D again. If it finished and you want to exit permanently, type "exit"

There are more options, like switching between different screens (Ctrl-A-N), etc. For details, see the tutorial mentioned above.
« Last Edit: August 27, 2010, 10:58 by Anders Blom »

Offline Anders Blom

  • QuantumWise Staff
  • Supreme ATK Wizard
  • *****
  • Posts: 4966
  • Country: dk
  • Reputation: 78
    • View Profile
    • QuantumWise
Re: Avoid calculation from interrupting if you lose ssh connection
« Reply #1 on: August 27, 2010, 11:01 »

Below is a half-dummy transcript of a session involving screen:

Quote
user@workstation > ssh node1
user@node1's password:
Linux node1 2.6.32-23-generic #37-Ubuntu SMP Fri Jun 11 07:54:58 UTC 2010 i686 GNU/Linux
Ubuntu 10.04 LTS
Last login: Fri Jul  2 13:26:11 2010 from 192.168.0.11

user@node1 > screen -q
[detached from 4931.pts-4.node1]

user@node1 > exit
logout
Connection to node1 closed.

user@workstation ~
$ ssh node1
user@node1's password:
Linux node1 2.6.32-23-generic #37-Ubuntu SMP Fri Jun 11 07:54:58 UTC 2010 i686 GNU/Linux
Ubuntu 10.04 LTS

Last login: Fri Aug 27 10:53:38 2010 from 192.168.0.49

user@node1 > screen -r
There are several suitable screens on:
        4931.pts-4.node1       (2010-08-27 10:53:58)   (Detached)
        4747.pts-0.node1       (2010-08-27 10:52:55)   (Detached)
        4579.pts-0.node1       (2010-08-27 10:42:26)   (Detached)
        4542.pts-0.node1       (2010-08-27 10:41:36)   (Detached)
Type "screen [-d] -r [pid.]tty.host" to resume one of them.

user@node1 > screen -r 4931
[screen is terminating]

user@node1 > exit
logout
Connection to node1 closed.

user@workstation ~

What we do is:

1. We work at workstation, and ssh into node1
2. Start a screen with -q option to avoid the start-up message
3. Run ATK (this cannot be seen here, as screen opens a new window)
4. Detach the screen (also cannot be seen)
5. Log out from node1
6. ssh back into node1 (later)
7. Attempt to resume the screen - oops, we notice that several screens are running.
8. Ours was 4931, so we resume that one
9. Calculation finished, so we exit the screen (cannot be seen here)
10. All done, log off the compute node