Author Topic: Why the MPI process cannot help me to speedup the calculation?  (Read 4069 times)

0 Members and 1 Guest are viewing this topic.

Offline lknife

  • QuantumATK Guru
  • ****
  • Posts: 214
  • Country: us
  • Reputation: 1
    • View Profile
Dear experts,

I am studying the tutorial "Spin-orbit transport calculations: Bi2Se3 topological insulator thin-film device" and want to reproduce the results of the tutorial. I downloaded the script "electrode1.py" and ran it on my local computer using ATK2016 with the default setting of the Job manager. It took me a long time (nearly 5 days) to get the result. The configuration of my computer is :
   
     Intel Core (TM)  i7 6700 CPU, quad core, 3.4 GHz, 16GB RAM

Since it took to too long time, I want to run it on the computer cluster of our university using MPI process to speedup the calculation. Because of the license of ATK, only ATK2015 can be used on the cluster. I modified some parts of the script "electrod1.py" using ATK2015 and submitted it to the cluster using these MPI settings:

     Number of Nodes: 8
     Tasks per Node: 2
     memory per CPU =16GB
     as some other's suggestion, I disabled the threading calculation using "export OMP_NUM_THREADS=1, export OMP_DYNAMIC=FALSE" and set 16 MPI processes to run the calculation.
     (Please see the attached files: electrode1gkm_argo5.py is the python file for calculation and elelctrode1gkm_argo5.sh (txt) is the file for submitting the job to the cluster.)

Now the calculation has been running on the cluster for 2D21hrs, much slower than I expected.
What I want to ask is: Is there anything wrong with my MPI setting or the scripts? Why it's almost no use for speeding up the calculation?

Thank you very much for your kind reply!

Offline lknife

  • QuantumATK Guru
  • ****
  • Posts: 214
  • Country: us
  • Reputation: 1
    • View Profile
Now the calculation has been running for more than 3D10hrs. It's still running. I don't know how much time it will take. It seems that the MPI process does not work, but I don't know why.

I hope anybody here can help me about this problem, since for the calculations of large systems, MPI calculation is the only solution to them.

Thanks a lot to anybody who is willing to help me!

Offline Jess Wellendorff

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 933
  • Country: dk
  • Reputation: 29
    • View Profile
Your SLURM job submission script looks sub-optimal: Your allocate 8 nodes with 16 cores each = 128 cores in total, but start only 2 MPI tasks on each node = 16 MPI tasks in total. That setup is only useful if memory is a serious issue, and hampers performance significantly. Admitted, spin-orbit calculations with OMX pseudopotentials can be memory hungry, but I suggest you first of all try to run the calculation in minimimum-time-to-result mode, which will use the most memory:

#SBATCH --nodes 2
#SBATCH --ntasks-per-node 16

If you run into memory problems, try to double the amount of nodes and use "processes_per_kpoint=2" option for the DiagonalizationSolver such that 2 MPI processes are used for each k-point. They will then share their memory, but the total calculation time also increases.

Finally, you might try to change the "bands_above_fermi_level" parameter from "All" to a positive integer, for example 20. This will limit the amount of unoccupied electronic bands included whe ndiagonalizing the Hamiltonian. Should not be too small...

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5418
  • Country: dk
  • Reputation: 89
    • View Profile
    • QuantumATK at Synopsys
DiagonalizationSolver with multiple MPIs per k-point is however not available for 2015

Offline lknife

  • QuantumATK Guru
  • ****
  • Posts: 214
  • Country: us
  • Reputation: 1
    • View Profile
Thank you for your kind reply!

Because the device contains many atoms, the tutorial paid a lot of attention on reducing the memory burden when setting the MPI parameters. Thus, I ran only two tasks on each node with 16 cores but only 64GB memory in total since I am not sure of the memory request for each task. From the ATK guide, it iw recommended to run only one or two tasks per node for large systems which requires a lot of memory. I can try to set 8nodes with 4 tasks per node or set 16 nodes with 2 tasks per node to get 32 MPI processes. Calculation with the latter setting (16 nodes with 2 tasks per node) has been running for 1D14hrs.

The problem is, even for 16 MPI processes, there should be somewhat promotion to the speed. However, comparing with the time consuming of my local computer, almost no promotion can be observed.

By the way, did you find anything wrong with my scripts except the sub-optimal problem?

Thank you again for your time and kind help!

Offline Jess Wellendorff

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 933
  • Country: dk
  • Reputation: 29
    • View Profile
No, I did not find any obvious errors in your scripts. But I will say this: The case study in question is currently Work in Progress, and may need updating wrt. the newest ATK versions, especially concerning efficient parallelization schemes of such systems. I can thereofre not give you a definite answer about exactly how to run those calculations most efficiently, because no one really knows yet...

Offline lknife

  • QuantumATK Guru
  • ****
  • Posts: 214
  • Country: us
  • Reputation: 1
    • View Profile
That's fine. Thank you again for your kind reply!

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5418
  • Country: dk
  • Reputation: 89
    • View Profile
    • QuantumATK at Synopsys
We don't really support version 2015 anymore - your job settings seem fine, so I'm more leaning to other issues. For instance, you run 2 processes on 8 nodes, but are these your exclusive nodes? Or can someone else run jobs on the remaining cores? Does your submit environment propagate the environment variables to the job, or should you have set OMP in your .bashrc?

Why not run a small test job, some bulk crystal with many k-points which takes 5 minutes, in parallel over 2, 8, 32 parallel MPIs and check if there is speedup. If not (because there really should be!) then your problem is local, related to the cluster.

Offline lknife

  • QuantumATK Guru
  • ****
  • Posts: 214
  • Country: us
  • Reputation: 1
    • View Profile
I will check my calculation according to what you said. Actually, I have used the cluster for ATK parallel calculation for a period of time. In many cases, it does accelerated the calculation and helped me save a lot of time. I don't know why it did not work for this particular job.

Anyway, thank you very much for your reply and kind help!

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5418
  • Country: dk
  • Reputation: 89
    • View Profile
    • QuantumATK at Synopsys
One way to understand this better is to look at the timing report at the end of the calculation. If the simulation spends most of its time in parts that are not as well parallelized, such as real-space integration, then there is less advantage of MPI. Also, 2015 is an older version, 2016 and 2017 have better parallelization.

Offline lknife

  • QuantumATK Guru
  • ****
  • Posts: 214
  • Country: us
  • Reputation: 1
    • View Profile
Thank you very much for your kind help!