1. What was your command line for starting ATK? You may need to provide a machinefile to make sure the jobs are spread out over the nodes. Remember to not use more than 1 2 MPI process per socket, so assuming your nodes are double quad-cores, I would recommend running mpiexec -n 8. If your mpiexec supports it, you can use the argument -npernode 2, otherwise you will need a machinefile.
To test if you get the desired MPI allocation, try a small test script first (without any real calculations), containing just
import socket
if processIsMaster():
print "Master node",
else:
print "Slave node",
print socket.gethostname()
If you run -n 8 you want this to write each hostname twice.
2. It depends a lot on where and why it stopped, but there is a checkpoint file that you could try. See http://quantumwise.com/publications/tutorials/mini-tutorials/142