Ok, it seems your MPI is not the issue.
I think maybe there is nothing wrong here, actually...
I assume you have (1,1) k-points in the XY directions? The equivalent bulk run is carried out with (Nkx,Nky,1) k-points where (Nkx,Nky) are those given to the electrodes. Nkz given to the electrodes is ignored, since the cell is always long enough in Z not to need any significant k-point sampling, and the equivalent bulk run is anyway just an approximation.
In that case, there isn't really much for ATK to parallelize over. Parallelization in ATK is primarily implemented in 3 places:
- k-point sampling, but if you have (1,1,1) this means only one node is active
- energy sampling, but this only kicks in in the real two-probe calculation part (with open boundary conditions)
- generation of matrix elements, but beyond 400 atoms this is the smaller part of the SCF iteration
There is one thing that could help you in this situation. While the matrix diagonalization does not parallelize over MPI, it does thread on multi-core CPUs using OpenMP. Thus, if you enable threading in ATK (see the
manual appendix), and run such that each MPI node is a multi-core CPU, you should get "double parallelization": MPI parallelization over k-points, energy points, and matrix elements, and OpenMP threading to speed up the matrix diagonalization.