QuantumATK Forum
QuantumATK => General Questions and Answers => Topic started by: Derek Stewart on February 14, 2011, 16:13
-
Hi everyone,
I have been testing out the new version of ATK with some parallel runs and I have run into a problem when I try to run even the simple mpi_test in the background with mpiexec. Everything works properly if I run it with everything printing out to the screen or I direct it to a file and let it run in the foreground. For mpich2, I am using version 1.3.2. The calculations are done on a redhat enterprise 5 machine with Xeon processors.
For example, these commands work fine:
mpiexec -n 2 -hosts d1,d2 /opt/QuantumWise/atk-11.2.b2/atkpython/bin/atkpython /home/derek/atk_mpi_test/test_mpi.py
mpiexec -n 2 -hosts d1,d2 /opt/QuantumWise/atk-11.2.b2/atkpython/bin/atkpython /home/derek/atk_mpi_test/atk_mpi_test > out.run
However, when I try to run it in the background using & at the end.
mpiexec -n 2 -hosts d1,d2 /opt/QuantumWise/atk-11.2.b2/atkpython/bin/atkpython /home/derek/atk_mpi_test/test_mpi.py &
I get the following error:
[[email protected]] HYDU_sock_read (./utils/sock/sock.c:222): read errno (Input/output error)
[[email protected]] control_cb (./pm/pmiserv/pmiserv_cb.c:249): assert (!closed) failed
[[email protected]] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[[email protected]] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:206): error waiting for event
[[email protected]] main (./ui/mpich/mpiexec.c:404): process manager error waiting for completion
After searching through some discussion groups on mpiexec using hydra routing, I found the following work-around to run things in the background.
mpiexec -n 2 -hosts d1,d2 /opt/QuantumWise/atk-11.2.b2/atkpython/bin/atkpython /home/derek/atk_mpi_test/test_mpi.py < /dev/null &
With this redirection, you can also run the calculation with nohup at the beginning as well.
The following link discusses this issue in more detail:
http://lists.mcs.anl.gov/pipermail/mpich-discuss/2010-October/008239.html
Best Regards,
Derek
-
Hi everyone,
I just noticed that the final command line I listed had a typo. It should have included the redirect to the output file as follows:
mpiexec -n 2 -hosts d1,d2 /opt/QuantumWise/atk-11.2.b2/atkpython/bin/atkpython /home/derek/atk_mpi_test/test_mpi.py > out.run < /dev/null &
-
Thanks for sharing this trick to solve this little snag, which also appears with mpd (but in that case, unlike with hydra, the program is allowed to run, it just generates tons of garbage output in the log file). We'll make it a "tip" in the parallel guide to make sure people are aware of it!