Hi everyone,
I have been testing out the new version of ATK with some parallel runs and I have run into a problem when I try to run even the simple mpi_test in the background with mpiexec. Everything works properly if I run it with everything printing out to the screen or I direct it to a file and let it run in the foreground. For mpich2, I am using version 1.3.2. The calculations are done on a redhat enterprise 5 machine with Xeon processors.
For example, these commands work fine:
mpiexec -n 2 -hosts d1,d2 /opt/QuantumWise/atk-11.2.b2/atkpython/bin/atkpython /home/derek/atk_mpi_test/test_mpi.py
mpiexec -n 2 -hosts d1,d2 /opt/QuantumWise/atk-11.2.b2/atkpython/bin/atkpython /home/derek/atk_mpi_test/atk_mpi_test > out.run
However, when I try to run it in the background using & at the end.
mpiexec -n 2 -hosts d1,d2 /opt/QuantumWise/atk-11.2.b2/atkpython/bin/atkpython /home/derek/atk_mpi_test/test_mpi.py &
I get the following error:
[mpiexec@d1.cnf.cornell.edu] HYDU_sock_read (./utils/sock/sock.c:222): read errno (Input/output error)
[mpiexec@d1.cnf.cornell.edu] control_cb (./pm/pmiserv/pmiserv_cb.c:249): assert (!closed) failed
[mpiexec@d1.cnf.cornell.edu] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@d1.cnf.cornell.edu] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:206): error waiting for event
[mpiexec@d1.cnf.cornell.edu] main (./ui/mpich/mpiexec.c:404): process manager error waiting for completion
After searching through some discussion groups on mpiexec using hydra routing, I found the following work-around to run things in the background.
mpiexec -n 2 -hosts d1,d2 /opt/QuantumWise/atk-11.2.b2/atkpython/bin/atkpython /home/derek/atk_mpi_test/test_mpi.py < /dev/null &
With this redirection, you can also run the calculation with nohup at the beginning as well.
The following link discusses this issue in more detail:
http://lists.mcs.anl.gov/pipermail/mpich-discuss/2010-October/008239.htmlBest Regards,
Derek