On a quad-core, you might be better off using MKL over MPI, but it depends.
For the transmission calculation, for instance, you get a linear speedup using MPI (so, 4 times faster for you), but certainly not as much using MKL (OpenMPI). On the other hand, when you MPI-parallelize (esp the SCF loop) on a single node all processes will need X Mb of RAM = total 4X Gb, so it limits the size of the calculation you can perform, plus the processes will fight for the L2 cache a bit...
But the only way to really find out is to try it, because it differs a lot between different system sizes, geometries, and parameters like k-point sampling etc