This point has been discussed about a million times in various contexts...
"Their" assumption is that you want to run an MPI process on all the allocated cores. That's a reasonable assumption - if you don't know anything about how ATK parallelizes for performance.
To cut it short (you can read more at http://quantumwise.com/documents/tutorials/latest/ParallelGuide/index.html/), if you put more than 1 MPI process/socket, the competition among the MPI processes for RAM/cache/bus access means the calculation goes slower. But having only one MPI/socket that doesn't mean that ATK doesn't use the cores - it uses them for threading, but that is not a parameter you specify to "mpiexec".
So, again - a run with mpiexec/mpirun -n/np N uses 1 master and N-1 slaves, and launches N MPI processes. How those processes are distributed among your allocated nodes is up the scheduler/queue system - on many cluster you may need additional arguments like -npernode 2 or some other directive to PBS, that should be documented for the cluster
Like the -nodes argument you show, but that documentation is for MPICH1, not MPICH2 which is what you should for ATK. But there should be a similar option.
So in your example case (final post), in which case I'm assuming the cluster consists of X machines with dual quad-core chips (i.e. 2 sockets/nodes, 4 cores/socket), you would do
#!/bin/bash -l
#PBS -l walltime=01:00:00,pmem=500mb,nodes=4:ppn=8
#PBS -m abe
cd $HOME
module load intel
mpirun -np 8 -nodes 4 script.py > run.out
NOTE: load ompi/intel is a big no-no!
In this case ATK will need 1 master and 7 slaves, will run 8 MPI processes distributed over 4 nodes (i.e. 2 MPIs/node), but at least for parts of the calculation (not a whole lot for devices, I admit) it will utilize all 8 cores on all 4 nodes.
And - very importantly: since you have requested "full nodes" - all cores on each node - no other process will run on those nodes, so you get the best possible performance, since you have the node to yourself.