As Nordland writes, this system does not gain much by being run in parallel, while others do (see e.g. the pdf file attached to
this post).
However, even Li-H2-Li would scale (a bit) in parallel if the MPI nodes were separate computational nodes.
In general, one should be very careful to MPI-parallelize over multi-cores on a single machine. First of all, each MPI process will allocate X Mb RAM, but since all processes run on the same machine, they will require N*X from that single machine. So, you become very limited in how large calculations you can run; even for smallish calculations (that use, say, 400 Mb), putting 6 MPI processes on one machine will exhaust the RAM even if you have 2 Gb available. What happens is that you start swapping, and performance degrades terribly.
You can however get poor performance anyway (the Li-H2-Li system probably needs only 50 Mb, so that's not the issue here), since the MPI processes will fight for interconnect bandwidth and may also share the L2 cache (on Intel machines).
There is increasing and widespread awareness that parallelizing on multicores doesn't always improve the calculation speed (see e.g.
http://www.hpcprojects.com/news/news_story.php?news_id=758).
However, ATK actually has (since version 2008.10) another way to take advantage of multicores, namely
threading. While the MPI parallelization is able to boost the performance when using several k-points and energy points (always used for two-probes, on the contour integral), and if used right, i.e. preferably on a cluster, threading speeds up matrix operations on each MPI node.
An important point to note is that
threading in ATK is not automatically enabled in Linux. See
the manual.
So, does this mean that it is never a good idea to MPI parallelize on a single machine? Well, it depends. I think this example clearly shows that putting as many as 6 MPI processes on a single node, even if it perhaps has 8 cores available, is not a good idea. 4 might work better, while 2 is probably a practical maximum. Second, the parallel behavior is very different for the SCF loop and transmission calculations, and the linear scaling seen in clusters might hold for 2-4 processes also on a single machine (if the calculations are not too large).
Note however that by occupying the cores with MPI processes, you effectively kill threading, which can be very detrimental for performance.
Thus, generally the best strategy for ATK is to MPI-parallelize over as many cluster nodes as you can afford, and leave it to ATK to thread on the cores of each node. If you don't have a cluster, consider getting one
Seriously, you can make an excellent computational cluster from 3-4 desktop machines, or of course you can invest in some nice blades, they're not that much more expensive. If that's not an option, if you're stuck on a single multi-core machine, you should test how ATK behaves for your particular system (and don't use Li-H2-Li, test with the real geometries you are interested in
), but don't expect to be able to use more than 2-4 MPI processes; most likely, the bigger benefit will come from threading instead.