A common error is over-parallelization by running one MPI process per core. ATK will only exhibit a positive performance boost from MPI if you limit the number of MPI processes to the number of available machines (or number of sockets, in case each machine has more than one). In this scenario ATK will use OpenMP to thread (parallelize) over the cores, although this is more limited compared to MPI in terms of performance improvement. But if you have too many MPI processes on a machine, they will fight for access to memory and the machine becomes overloaded with communication between the MPI processes. So, again, the simple rule for best performance is: one MPI process per socket.