Author Topic: Calculation do not converge due to nodes number setting?  (Read 2949 times)

0 Members and 1 Guest are viewing this topic.

Offline faxer92

  • Heavy QuantumATK user
  • ***
  • Posts: 27
  • Country: tw
  • Reputation: 0
    • View Profile
Dear Sirs
Last weekend I hit 3 issue in ATK mpi settings. The story is when I used 36 core to run the following script, it's quickly converged at 37 step in each loop. However, as increasing computing process to 44 core from 36,  which gives not converged results, even >100 steps? Same code gives very different results due to process number setting, could you please explain this ? The second bizarre thing is,  my cluster is made of A(44 core/256G ram)+B(14 core/128G ram), when I turn off B and run web example mpi_test.py with mpi "58" process in only A, which equally showing 1 master+57 slaves (completely ignores my physical cpu number is 44 ea in A) and keeps running, why?  does ATK may virtualize cpu/process in A ? What's the correct concept between process and physical cpu count in ATK? Could you hint me where I  shall improve? much appreciate and looking forward to your replies.   

--------------
The third bizarre is : I have physical 14 cpu in B, but fully running program with mpi: 58,  they always shows 0% performance , anything wrong in my QW path setting or folders in clstr?
« Last Edit: October 30, 2017, 09:54 by faxer92 »

Offline Jess Wellendorff

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 933
  • Country: dk
  • Reputation: 29
    • View Profile
Re: Calculation not converge due to mpi setting?
« Reply #1 on: October 30, 2017, 10:04 »
1) There is a very distinct different between your *physical* computing cores, and the *MPI* processes you launch: When running an ATK job, you choose both the number of cores to use (physical cores) and the number of MPI processes to launch on those cores. It is in general highly recommended that the two numbers be identical! That is, if you launch more MPI processes than the cores you have allocated, some cores will run more than one MPI process. So yes, MPI does not stop you from overloading the physical cores with several MPI processes, which may lead to an unstable job because several MPI processes must then share the RAM available to only one physical core. It is your own responsibility to avoid this by carefully making sure the number of physical cores match the number of MPI processes. There may of course be situations where it is advantageous to run fewer MPI processes than allocated cores (e.g. to get more memory per MPI process), but this is rare.

2) Not knowing the details of your cluster setup, it's hard to know exactly how you should run your ATK jobs to get 100% performance. You mention that your cluster is composed of 2 different sets of cores (A and B). In such a situation, I would usually recommend to run an ATK job on one of those sub-systems only, i.e. run job #1 on A and job #2 on B. But again, this depends on the cluster setup. I believe we already have a support ticket on support@quantumwise.com about this, that's a better channel for discussing details of cluster setups, so I will continue on that one from here.
 

Offline faxer92

  • Heavy QuantumATK user
  • ***
  • Posts: 27
  • Country: tw
  • Reputation: 0
    • View Profile
Re: Calculation do not converge due to nodes number setting?
« Reply #2 on: October 30, 2017, 10:23 »
thanks, Jess, we'll discuss in support@quantumwise.com

Offline faxer92

  • Heavy QuantumATK user
  • ***
  • Posts: 27
  • Country: tw
  • Reputation: 0
    • View Profile
Cluster specific issue in 2017 v2?
« Reply #3 on: November 5, 2017, 16:54 »
Excuse me, it's been silent in the past "5" days, and no further constructive indication from support@quantumwise.com,

may I know when/who/how will help take care customer's cluster problem  as promised in the mail?

if not, please gives comment on what our IT do for this cluster specific issue?  after all, we'd put a lot on ATK

thank you so much.

==mail info====
1) I do see from the images you attached that it looks like not all CPU cores are in use, but from this information it is impossible for me to say why.
2) When submitting an ATK job, deciding on which cores to use for the job is entirely a matter of job submission parameters, which is highly cluster dependent; the appropriate job submission method depends entirely on how the cluster is configured. For example, I personally use a cluster with 3 different types of nodes (nodes with 8, 16, and 24 cores on each node). And I also submit jobs that use only 1 type of node, e.g. 3 16-core nodes for a total of 48 cores. Those nodes have the "node property" of "name" xeon16, and that's what I specify upon job submission in order to choose those particular nodes. So, as you see, it's a cluster specific issue.
Perhaps it would be easier if I communicated directly with the customer, e.g. using teamviewer?
« Last Edit: November 6, 2017, 03:33 by faxer92 »

Offline Jess Wellendorff

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 933
  • Country: dk
  • Reputation: 29
    • View Profile
Re: Calculation do not converge due to nodes number setting?
« Reply #4 on: November 6, 2017, 09:51 »
There have been issues with getting a TeamViewer session up and running, but I have now received the information needed to establish the session. I will look at this today.
Kind regards,
Jess
« Last Edit: November 6, 2017, 09:55 by Petr Khomyakov »

Offline faxer92

  • Heavy QuantumATK user
  • ***
  • Posts: 27
  • Country: tw
  • Reputation: 0
    • View Profile
Re: Calculation do not converge due to nodes number setting?
« Reply #5 on: November 7, 2017, 04:22 »
thank you so much, jess