The two problems are related, and provide a perfect illustration about the importance of understanding how MPI parallelization actually takes place.
The short story is, that you really shouldn't put more than 1 MPI process per socket. Depending on your hardware, you may have 2 or 3 sockets on each node (depending on if they are 4-core or 6-core). Therefore, the maximum recommended MPI parallelization is -n 4 or -n 6 (2 nodes x 2/3 sockets). Anything above this is likely to be slower.
And this is what you are seeing in parallel (CUI) vs. serial (GUI). You probably over-parallelize so much, that the cores spend most of their time fighting for cache and RAM access, plus communicating among each other, rather than actually doing calculations.
The beauty of multicore is that you can have different tasks running at the same time without disturbing each other so much as if you have only a single core; like, you can still use your internet browser while a calculation is running. On an old computer, each time the browser needed CPU, it would kick out the calculation for a while, and vice versa, but with a dualcore they can run in tandem.
But it's a myth that cores are independent compute nodes. They share a lot of infrastructure (like L2 cache), and loading independent processes on all 4 cores of a quad-core will be slower than having a single process.
The really proper way to utilize a multinode/socket/core environment is to use hybrid parallelization, where you do MPI over the nodes (or sockets), and OpenMP threading over the cores. This means, each socket still only solves "one problem" (in the case of ATK, it diagonalizes one k-point), but it can try to use all cores to do so - if they are not doing something else at the moment. If, however, you try to make each socket solve 4 problems simultaneously, then first of all they cannot thread, so you lose that advantage actually, and second as mentioned above the cores start to suffer from insufficient memory and network access.
The second important point is that each MPI process is a complete (almost) replica of the calculation. Therefore, if you load 12 MPI processes on a machine 12 GB of RAM, your effective calculation size is limited just 1 GB. So most likely what happens in your case 1 is that a serial version of the calculation would need perhaps 1.5 GB (I'm just using dummy numbers to show the principle) but in parallel with -n 24 over 2 nodes it would need 1.5*12 per node - and this is probably more than your available RAM.
ATK does utilize a hybrid scheme for some parts of the computations, and will do so even more in the future. So, limit your MPI parallelization to 1 process per node (or 1 per socket if you have enough RAM) and let ATK thread over the cores instead.