We have found out that OpenMP threading interferes with the MPI parallelization of ATK in a highly unexpected way. By using the correct OpenMP settings, instead of the default ones, it is possible to improve the parallel performance of ATK up to 10x (provided you already have enough slave licenses) without upgrading your hardware or buying more licenses.
For years, the conventional wisdom has been that ATK performs best in parallel when running 1 MPI process/socket. This feels wasteful on 4 or 8 core sockets, but it just wasn't possible to get any real speedup if we put more MPIs on the socket. Logically, we figured this was due to communications overhead on the L1/L2 cache or RAM buses, or even the Ethernet ports.
It turns out, however, that by setting
export OMP_NUM_THREADS=1
export OMP_DYNAMIC=FALSE
you can get fantastic MPI speedup in ATK!
We are working on benchmarks, but we have already seen a 10x speedup on a heavy device calculation (for the transmission spectrum calculation). Note that this is not a speedup for serial vs parallel - it's the same MPI setup in both cases (1 node with 32 cores, running 32 MPI processes), but with threading on/off.
As another example, a (smallish) bulk calculation in MGGA which takes 1000 s in serial can be improved 10x by using 16 MPIs on just 2 nodes. With threading turned on, the best we could get on 2 nodes was 2x speedup with 2 MPIs - if we put more MPIs, the calculation slows down! In fact, with threading on, using 8 MPIs on 1 node is 50% slower than running in serial, whereas if threading is OFF we get 8x speedup!
There are still serial overheads in ATK, and in fact they start to show now in a way we never saw before, because we could never get the speedup to get the runtime of the parallelized parts down low enough. The most noticeable case for this is the Multigrid method, if you run devices with gates or charged systems (these are, in fact, the only cases for which you should use Multigrid, which also in serial is a lot slower than FFT2D). We will put some focus on addressing these parts of the code for the next release.
Also, there are still cases where keeping threading ON makes sense, viz. if you are running a calculation which ATK cannot parallelize over - the clearest example would be a large supercell with 1x1x1 k-points (or molecules). In those cases one could run it on a single node, without using MPI, and get some benefit of OpenMP threading by leaving OMP_DYNAMIC=TRUE and by not setting OMP_NUM_THREADS (or set it to the number of cores).
Hopefully this new insight means that our customers can get results from ATK faster, both by the shorter calculation times and by being able to run more concurrent calculations without tying up all the nodes on the cluster (if you have more than 1 master license, that is). For 10x speedup, maybe a calculation just needs 2-3 nodes now, where it before needed 10-20 full nodes.
If you have further questions on this, don't hesitate to ask us!