Author Topic: Calculation do not converge due to nodes number setting? (Read 15157 times)

faxer92 · « **on:** October 30, 2017, 03:30 »

Dear Sirs
Last weekend I hit 3 issue in ATK mpi settings. The story is when I used 36 core to run the following script, it's quickly converged at 37 step in each loop. However, as increasing computing process to 44 core from 36, which gives not converged results, even >100 steps? Same code gives very different results due to process number setting, could you please explain this ? The second bizarre thing is, my cluster is made of A(44 core/256G ram)+B(14 core/128G ram), when I turn off B and run web example mpi_test.py with mpi "58" process in only A, which equally showing 1 master+57 slaves (completely ignores my physical cpu number is 44 ea in A) and keeps running, why? does ATK may virtualize cpu/process in A ? What's the correct concept between process and physical cpu count in ATK? Could you hint me where I shall improve? much appreciate and looking forward to your replies.

--------------
The third bizarre is : I have physical 14 cpu in B, but fully running program with mpi: 58, they always shows 0% performance , anything wrong in my QW path setting or folders in clstr?

Jess Wellendorff · « **Reply #1 on:** October 30, 2017, 10:04 »

1) There is a very distinct different between your *physical* computing cores, and the *MPI* processes you launch: When running an ATK job, you choose both the number of cores to use (physical cores) and the number of MPI processes to launch on those cores. It is in general highly recommended that the two numbers be identical! That is, if you launch more MPI processes than the cores you have allocated, some cores will run more than one MPI process. So yes, MPI does not stop you from overloading the physical cores with several MPI processes, which may lead to an unstable job because several MPI processes must then share the RAM available to only one physical core. It is your own responsibility to avoid this by carefully making sure the number of physical cores match the number of MPI processes. There may of course be situations where it is advantageous to run fewer MPI processes than allocated cores (e.g. to get more memory per MPI process), but this is rare.

2) Not knowing the details of your cluster setup, it's hard to know exactly how you should run your ATK jobs to get 100% performance. You mention that your cluster is composed of 2 different sets of cores (A and B). In such a situation, I would usually recommend to run an ATK job on one of those sub-systems only, i.e. run job #1 on A and job #2 on B. But again, this depends on the cluster setup. I believe we already have a support ticket on [email protected] about this, that's a better channel for discussing details of cluster setups, so I will continue on that one from here.

faxer92 · « **Reply #2 on:** October 30, 2017, 10:23 »

thanks, Jess, we'll discuss in [email protected]

faxer92 · « **Reply #3 on:** November 5, 2017, 16:54 »

Excuse me, it's been silent in the past "5" days, and no further constructive indication from [email protected],

may I know when/who/how will help take care customer's cluster problem as promised in the mail?

if not, please gives comment on what our IT do for this cluster specific issue? after all, we'd put a lot on ATK

thank you so much.

==mail info====
1) I do see from the images you attached that it looks like not all CPU cores are in use, but from this information it is impossible for me to say why.
2) When submitting an ATK job, deciding on which cores to use for the job is entirely a matter of job submission parameters, which is highly cluster dependent; the appropriate job submission method depends entirely on how the cluster is configured. For example, I personally use a cluster with 3 different types of nodes (nodes with 8, 16, and 24 cores on each node). And I also submit jobs that use only 1 type of node, e.g. 3 16-core nodes for a total of 48 cores. Those nodes have the "node property" of "name" xeon16, and that's what I specify upon job submission in order to choose those particular nodes. So, as you see, it's a cluster specific issue.
Perhaps it would be easier if I communicated directly with the customer, e.g. using teamviewer?

Jess Wellendorff · « **Reply #4 on:** November 6, 2017, 09:51 »

There have been issues with getting a TeamViewer session up and running, but I have now received the information needed to establish the session. I will look at this today.
Kind regards,
Jess

faxer92 · « **Reply #5 on:** November 7, 2017, 04:22 »

thank you so much, jess

QuantumATK Forum

News:

Author Topic: Calculation do not converge due to nodes number setting? (Read 15157 times)

faxer92

Calculation do not converge due to nodes number setting?

Jess Wellendorff

Re: Calculation not converge due to mpi setting?

faxer92

Re: Calculation do not converge due to nodes number setting?

faxer92

Cluster specific issue in 2017 v2?

Jess Wellendorff

Re: Calculation do not converge due to nodes number setting?

faxer92

Re: Calculation do not converge due to nodes number setting?