Author Topic: CPU still running after MPI job exited  (Read 2461 times)

0 Members and 1 Guest are viewing this topic.

Offline Qiang Fu

  • Regular QuantumATK user
  • **
  • Posts: 7
  • Reputation: 0
    • View Profile
CPU still running after MPI job exited
« on: September 4, 2014, 17:18 »
Dear Experts,

Recently we encounter a problem when performing parallel ATK calculations. When the job does not finish successfully (be killed or exit due to time limit), it may leave all processes still running, and cause CPU overloading for the next job, no matter if it is an ATK job.

Could you please give us some advice to fix the problem? Thank you very much!
(Please find a job info in the attachment. A ATK job was killed due to convergence issue and then a VASP job started)

Best regards,
Qiang Fu

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5538
  • Country: dk
  • Reputation: 90
    • View Profile
    • QuantumATK at Synopsys
Re: CPU still running after MPI job exited
« Reply #1 on: September 4, 2014, 23:36 »
This is quite common. If the job is killed by time limit, the kill signal reaches (I think) the master process first (or only) and it is terminated immediately. Therefore it has no real chance to send a signal to the children to die also, and therefore there is not so much ATK can do about these hanging processes. I don't know if other codes have a better way to terminate children in this situation, but for now with ATK it is the user's responsibility to clean up in case the master process is killed.

Also, don't run version 13.8.0 - at least upgrade to 13.8.2 which has some bug fixes for ATK 13.8, but better yet, get ATK 2014.0 which will be available tomorrow :)