Author Topic: Device Simulation freezing just before "Left Electrode Calculation"  (Read 2438 times)

0 Members and 1 Guest are viewing this topic.

Offline Kaspar

  • Regular QuantumATK user
  • **
  • Posts: 19
  • Country: dk
  • Reputation: 0
    • View Profile
Hi all, I am having a strange problem with a series of calculations I am doing. I have a couple of different systems for which I only vary a few things in the geometry between each simulation. I have not had any problems, but suddenly this one simulation freezes just after the message (in the log file)
Code
 Left Electrode Calculation [Started .... 2013] 
I have looked very carefully and compared with it's "sister" calculations but I can see nothing that should cause this. The geometry is essentially identical (except for the variations that I am investigating in the first place) and all the calculators are set up identically. But I guess it doesn't even make it to the calculator part. When I compare with the successful calculations, the next line in the log file should output  
Code
Checkpoint Handler
Filename : ...... .nc
Interval : ... h
and then begin with calculating eigenvalues. Also, it does not give any errors, it just stops there. I have tried to restart it with the same result. I can supply the log file and .py script if necessary. I am using 12.8.2, running in parallel on 8 nodes. But again, that too is the same for all the successfully running calculations. Thank you for any suggestions/ideas Best regards, Kaspar

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5429
  • Country: dk
  • Reputation: 89
    • View Profile
    • QuantumATK at Synopsys
I have no immediate suspicion, maybe a look at your script would help (you can email it, if you prefer not to share it with the world).

Even though things seem to be identical, on a cluster they rarely are, since you may end up running on different nodes for different jobs, and those nodes may have other jobs running (or not, I'm just speaking generally) which can affect for instance the available amount of memory. How are you distributing the MPI processes, are they mostly on separate nodes?


Offline Kaspar

  • Regular QuantumATK user
  • **
  • Posts: 19
  • Country: dk
  • Reputation: 0
    • View Profile
Thanks for your reply. I have an idea about what it could be, maybe you can confirm if this can be the reason. Could it be that the max number of simultaneous calculations for my research group's license was reached? Normally the simulation is killed with an error message in that case, but could this behavior be related? The reason I am suspecting this is that I tried to start a new calculation at the same time, which gave me the license error. Adding to the suspicion is that the script later ran fine, probably when new simulation slots were freed up. About the running parameters, I always use the PBS flag
Code
#PBS -n -l nodes=M
where M is the number of nodes I request, and then I run the job with
Code
mpiexec -np M
For those not familiar with this option, it asks for M nodes which then run with 1 mpi process per node, and exclusive acces to the whole node.

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5429
  • Country: dk
  • Reputation: 89
    • View Profile
    • QuantumATK at Synopsys
Yes, that's possible. When you say it stops, does it hang or terminate (without errors)? Have you checked stderr for error messages?

When running in parallel, it's possible that it hangs if a slave node terminates badly, but normally it should shut down if the master fails, and in fact only the master checks out licenses in ATK, so normally it should terminate with an error message.

Thanks for the tip about how to reserve entire nodes. The PBS options are, however, sometimes specific to the cluster you run, so they may not apply universally. On the cluster I usually run, you can for instance do "-npernode 2" to place 2 MPIs on each node, so that you spread out the processes instead of piling them onto the same node, if you reserve more cores than you intend to run on.

Offline Kaspar

  • Regular QuantumATK user
  • **
  • Posts: 19
  • Country: dk
  • Reputation: 0
    • View Profile
That sounds like it could have been what happened. It just stopped doing anything, no error messages or anything. Usually I get an error message if I run out of licenses, stating exactly that, but not this time.

I haven't seen the command '-npernode=M' before, I should check it out. I was looking for a way to do more than one process per node while still having multiple nodes.

If anyone is interested, I recently compiled a guide on PBS divided into basic and advanced topics. It contains topics like running jobs depending on each other and running array jobs, for example, something most people don't learn.
Its freely available here (basic) https://nanohub.org/resources/7496 and (advanced) https://nanohub.org/resources/7498