Author Topic: Jobs die *instantly* when losing connection to license server  (Read 3449 times)

0 Members and 1 Guest are viewing this topic.

Offline asanchez

  • Heavy QuantumATK user
  • ***
  • Posts: 42
  • Country: ie
  • Reputation: 1
    • View Profile
When running jobs sometimes the computing nodes lose access to the license server: it has happened a couple of times lately that the machine running the license server gets rebooted.

In this event, the jobs running with ATK 2014.2 "survive" the license outage for a few minutes no problem whereas jobs running with ATK 2015.b2 die before the license server is back up.

When the license server comes back up the jobs are still showing in the PBS queue system but the computing nodes are idle, and of course there's a message in the standard output saying that the job has been "terminated after failing 3 heartbeats".

It would be very useful if ATK 2015.b2 worked in a similar way to previous versions in which you have a bigger time window to reconnect to the license server (or whatever the difference in behaviour is).

I have no control whatsoever on when the license server is going to reboot (mostly it gets rebooted due to someone making a mistake so it's not something planned) and it's very inconvenient having to restart jobs when this happens. Particularly since it doesn't happen with ATK 2014.2 (or all of the previous versions I've ever used)

Not sure if this is the right subforum for this. Hope so!

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5405
  • Country: dk
  • Reputation: 89
    • View Profile
    • QuantumATK at Synopsys
Re: Jobs die *instantly* when losing connection to license server
« Reply #1 on: August 25, 2015, 13:31 »
I am not aware of any changes to the license system, so this should work precisely the same way in 2015 and 2014, meaning that you have a grace period of 3 heartbeats which are 120 seconds apart, before the connection to the license system is considered lost permanently. In practice, this means you have 4-5 minutes to reboot the server. (Actually, it looks in the code like we intended it to fail only after 5 heartbeats, not 3 - I think we will update this for the final 2015 release.)

Can you check the log file carefully, and see if there are lines saying "Connection to server [...] lost - trying to reconnect". It should appear at least twice. If not, do let me know, and we will check. Also look for any other "server" (or "Server") messages in the log, and let me know what you see.
« Last Edit: August 25, 2015, 13:49 by Anders Blom »

Offline asanchez

  • Heavy QuantumATK user
  • ***
  • Posts: 42
  • Country: ie
  • Reputation: 1
    • View Profile
Re: Jobs die *instantly* when losing connection to license server
« Reply #2 on: August 25, 2015, 14:46 »
The "Connection to server [...] lost" messages are there. Three times, with the last one indicating "terminated after failing 3 heartbeats".

But some jobs (2014.2) kept going even after this. The 2015.b2 jobs did die after showing this message. I also remember getting the three messages with the job not dying a few months ago with 2014.1 (this actually happened a few times IIRC). I guess the job dying is the intended behaviour anyway.

It'd be nice though if the window was a bit longer. Someone has to realise the node running the license server is down and then reboot it so probably 4-5 minutes wouldn't be enough in most cases. Admittedly I have no idea of the shortcomings of widening the time window to, say, 1 hr. But as a user in my specific situation I can say it would be very useful. I suppose for shorter jobs it wouldn't be a good idea as you could theoretically start the job and not need a license again if your job takes less than the hypothetical 1 hr.

I'm sure the proper solution is not having the license server shut down; and if it is rebooted that it be done in a controlled way etc. Alas, that is completely outside of my control and since I perceived a change in behaviour in the latest versions I thought I'd ask. I do realise the problem lies within our internal system and not the QW software.

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5405
  • Country: dk
  • Reputation: 89
    • View Profile
    • QuantumATK at Synopsys
Re: Jobs die *instantly* when losing connection to license server
« Reply #3 on: August 25, 2015, 15:08 »
Thanks for that information. Actually, the intention was to have 5 heartbeats and not 3 before failing, that would make it 8-10 minutes, some improvement at least. We may implement that change for 2015, but if the observation in 2014 was that the application didn't die even if the heartbeats expired, then I guess that was a bug...

I'm sure users would really appreciate if we didn't even have a license system ;) But in the end, as indeed you write, the license server should be a stable machine - if needed, I would recommend considering another server, like the machines holding user home directories (?).