You should not expect any real speedup from MPI for ATK for molecules, since the main parts of the code that are MPI-parallelized are k-point sampling and energy sampling on the complex contour for devices.
What you can get, however, is a nice threading advantage when diagonalizing large dense matrices. So your best option is to run in serial on one node (skip "mpiexec") but make sure ATK has access to all 48 cores on the node. In most cases it will figure this out by itself, but in case not then you can control it via "export MKL_NUM_THREADS=48".
The general rule is never more than one MPI process/socket otherwise the MPI processes will be competing for CPU cycles and cache/RAM bus access. Your nodes have 4 sockets so -n 4 is in theory fine. The reason you see a slowdown is however that there is always a cost for MPI communications, and since the parallel speedup is negligible in your case, all you see is the cost, which normally is overtaken by a parallel performance improvement and thus not seen.
If you allow ATK to use all cores for one process, rather than share them over 4, perhaps it will give a bit shorter calculation time than 30 minutes.