QuantumATK Forum
QuantumATK => General Questions and Answers => Topic started by: rebacoo on September 14, 2023, 12:51
-
Dear QuantumWise staff:
Recently, i've installed the QuantumATK-2022.03 in Rocky linux8.8 system, i find the parallel computation cannot be performed (PS: Single-core computation is ok. i mean: atkpython ***.py > ***.log& ) . when i use mpiexec.hydra, it doen;t work. the error message is as follows:
********************************
[atk@cluster ~]$ mpiexec.hydra -np 4 atkpython A5AO2-opt.py
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 64882 RUNNING AT cluster
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
Intel(R) MPI Library troubleshooting guide:
https://software.intel.com/node/561764
===================================================================================
***************************
How to deal with this question? thank you very much
-
To troubleshoot something like this, I would try a few things
* Can you run a trivial command in parallel, like mpiexec.hydra -np 4 echo "hello".
* Is parallelization across nodes set up correctly in general on the cluster? Can you run mpiexec.hydra -np 4 -localonly echo "hello".
* Is your path set up correctly, so that mpiexec.hydra actually points to our binary? Same for atkpython
* Always use the latest version of quantumatk. We released 2023.09 just this month
* Make sure the test system is really small (start with 1 Au atom), to exclude problems like running out of memory
* Add the -v option to mpiexec, it will print tons of debug information
-
Thank you professor Anders, the QuantumATK2022 is installed in single node server, and the path of mpiexec.hydra and atkpython is correct.
([atk@cluster ~]$ which atkpython
~/software/QuantumATK2022/bin/atkpython
[atk@cluster ~]$ which mpiexec.hydra
~/software/QuantumATK2022/libexec/mpiexec.hydra)
According to your suggestion, when i run: mpiexec.hydra -np 4 -localonly echo "hello", the message is as follows:
*********************************
[atk@cluster ~]$ mpiexec.hydra -np 4 -localonly echo "hello"
[mpiexec@cluster] match_arg (../../utils/args/args.c:254): unrecognized argument localonly
[mpiexec@cluster] HYDU_parse_array (../../utils/args/args.c:269): argument matching returned error
[mpiexec@cluster] parse_args (../../ui/mpich/utils.c:4770): error parsing input array
[mpiexec@cluster] HYD_uii_mpx_get_parameters (../../ui/mpich/utils.c:5106): unable to parse user arguments
Usage: ./mpiexec [global opts] [exec1 local opts] : [exec2 local opts] : ...
Global options (passed to all executables):
Global environment options:
-genv {name} {value} environment variable name and value
-genvlist {env1,env2,...} environment variable list to pass
-genvnone do not pass any environment variables
-genvall pass all environment variables not managed
by the launcher (default)
Other global options:
-f {name} | -hostfile {name} file containing the host names
-hosts {host list} comma separated host list
-configfile {name} config file containing MPMD launch options
-machine {name} | -machinefile {name}
file mapping procs to machines
-pmi-connect {nocache|lazy-cache|cache}
set the PMI connections mode to use
-pmi-aggregate aggregate PMI messages
-pmi-noaggregate do not aggregate PMI messages
-trace {<libraryname>} trace the application using <libraryname>
profiling library; default is libVT.so
-trace-imbalance {<libraryname>} trace the application using <libraryname>
imbalance profiling library; default is libVTim.so
-check-mpi {<libraryname>} check the application using <libraryname>
checking library; default is libVTmc.so
-ilp64 Preload ilp64 wrapper library for support default size of
integer 8 bytes
-mps start statistics gathering for MPI Performance Snapshot (MPS)
-aps start statistics gathering for Application Performance Snapshot (APS)
-trace-pt2pt collect information about
Point to Point operations
-trace-collectives collect information about
Collective operations
-tune [<confname>] apply the tuned data produced by
the MPI Tuner utility
-use-app-topology <statfile> perform optimized rank placement based statistics
and cluster topology
-noconf do not use any mpiexec's configuration files
-branch-count {leaves_num} set the number of children in tree
-gwdir {dirname} working directory to use
-gpath {dirname} path to executable to use
-gumask {umask} mask to perform umask
-tmpdir {tmpdir} temporary directory for cleanup input file
-cleanup create input file for clean up
-gtool {options} apply a tool over the mpi application
-gtoolfile {file} apply a tool over the mpi application. Parameters specified in the file
Local options (passed to individual executables):
Local environment options:
-env {name} {value} environment variable name and value
-envlist {env1,env2,...} environment variable list to pass
-envnone do not pass any environment variables
-envall pass all environment variables (default)
Other local options:
-host {hostname} host on which processes are to be run
-hostos {OS name} operating system on particular host
-wdir {dirname} working directory to use
-path {dirname} path to executable to use
-umask {umask} mask to perform umask
-n/-np {value} number of processes
{exec_name} {args} executable name and arguments
Hydra specific options (treated as global):
Bootstrap options:
-bootstrap bootstrap server to use
(ssh rsh pdsh fork slurm srun ll llspawn.stdio lsf blaunch sge qrsh persist service pbsdsh)
-bootstrap-exec executable to use to bootstrap processes
-bootstrap-exec-args additional options to pass to bootstrap server
-prefork use pre-fork processes startup method
-enable-x/-disable-x enable or disable X forwarding
Resource management kernel options:
-rmk resource management kernel to use (user slurm srun ll llspawn.stdio lsf blaunch sge qrsh pbs cobalt)
Processor topology options:
-binding process-to-core binding mode
Extended fabric control options:
-rdma select RDMA-capable network fabric (dapl). Fallback list is ofa,tcp,tmi,ofi
-RDMA select RDMA-capable network fabric (dapl). Fallback is ofa
-dapl select DAPL-capable network fabric. Fallback list is tcp,tmi,ofa,ofi
-DAPL select DAPL-capable network fabric. No fallback fabric is used
-ib select OFA-capable network fabric. Fallback list is dapl,tcp,tmi,ofi
-IB select OFA-capable network fabric. No fallback fabric is used
-tmi select TMI-capable network fabric. Fallback list is dapl,tcp,ofa,ofi
-TMI select TMI-capable network fabric. No fallback fabric is used
-mx select Myrinet MX* network fabric. Fallback list is dapl,tcp,ofa,ofi
-MX select Myrinet MX* network fabric. No fallback fabric is used
-psm select PSM-capable network fabric. Fallback list is dapl,tcp,ofa,ofi
-PSM select PSM-capable network fabric. No fallback fabric is used
-psm2 select Intel* Omni-Path Fabric. Fallback list is dapl,tcp,ofa,ofi
-PSM2 select Intel* Omni-Path Fabric. No fallback fabric is used
-ofi select OFI-capable network fabric. Fallback list is tmi,dapl,tcp,ofa
-OFI select OFI-capable network fabric. No fallback fabric is used
Checkpoint/Restart options:
-ckpoint {on|off} enable/disable checkpoints for this run
-ckpoint-interval checkpoint interval
-ckpoint-prefix destination for checkpoint files (stable storage, typically a cluster-wide file system)
-ckpoint-tmp-prefix temporary/fast/local storage to speed up checkpoints
-ckpoint-preserve number of checkpoints to keep (default: 1, i.e. keep only last checkpoint)
-ckpointlib checkpointing library (blcr)
-ckpoint-logfile checkpoint activity/status log file (appended)
-restart restart previously checkpointed application
-ckpoint-num checkpoint number to restart
Demux engine options:
-demux demux engine (poll select)
Debugger support options:
-tv run processes under TotalView
-tva {pid} attach existing mpiexec process to TotalView
-gdb run processes under GDB
-gdba {pid} attach existing mpiexec process to GDB
-gdb-ia run processes under Intel IA specific GDB
Other Hydra options:
-v | -verbose verbose mode
-V | -version show the version
-info build information
-print-rank-map print rank mapping
-print-all-exitcodes print exit codes of all processes
-iface network interface to use
-help show this message
-perhost <n> place consecutive <n> processes on each host
-ppn <n> stand for "process per node"; an alias to -perhost <n>
-grr <n> stand for "group round robin"; an alias to -perhost <n>
-rr involve "round robin" startup scheme
-s <spec> redirect stdin to all or 1,2 or 2-4,6 MPI processes (0 by default)
-ordered-output avoid data output intermingling
-profile turn on internal profiling
-l | -prepend-rank prepend rank to output
-prepend-pattern prepend pattern to output
-outfile-pattern direct stdout to file
-errfile-pattern direct stderr to file
-localhost local hostname for the launching node
-nolocal avoid running the application processes on the node where mpiexec.hydra started
Intel(R) MPI Library for Linux* OS, Version 2018 Update 1 Build 20171011 (id: 17941)
Copyright (C) 2003-2017, Intel Corporation. All rights reserved.
****************************
when i run: mpiexec.hydra -np 4 echo "hello"
it seems work:
[atk@cluster ~]$ mpiexec.hydra -np 4 echo "hello"
hello
hello
hello
hello
How to deal with this problem? Thank you
-
Remove the '-localonly' option from the command, it was only available for older versions of Intel MPI.
-
Thank you professor filipr, the mpiexec.hydra is from the software of QuantumATK2022, when i run: mpiexec.hydra -np 8 echo "Hello" it work well, when i run atkpython **.py > ***.log it work well
but when i run: mpiexec.hydra -np 4 atkpython A5AO2-opt.py it does't work
***********************************************************************
[atk@cluster ~]$ mpiexec.hydra -np 4 atkpython A5AO2-opt.py
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 108659 RUNNING AT cluster
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
Intel(R) MPI Library troubleshooting guide:
https://software.intel.com/node/561764
===================================================================================
[atk@cluster ~]$ mpiexec.hydra -np 8 echo "Hello"
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
************************************
-
Can you share with us your input script A5AO2-opt.py?
-
A5AO2-opt.py is a test file, and when i use atkpython A5AO2-opt.py, it works well
-
The use of localonly might be specific to Linux vs Windows. I use it all the time on Windows if I launch from the command line to avoid having to set up ssh keys or an MPI service, or give my password to each ssh process. Perhaps there is a smarter way to do it, but there is no indication this keyword is deprecated on Windows at least (https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-windows/2021-10/global-hydra-options.html).
But you use Linux, rebacoo, so indeed you can skip it.
Did you try the -v version (mpiexec.hydra -v -np 4 ...)? Also, you can try
mpiexec.hydra -n 4 -genv I_MPI_HYDRA_DEBUG=1 -genv I_MPI_DEBUG=5 atkpython script.py
to generate a LOT of debug info that might help
-
Thank you professor Anders, here is the results:
[atk@cluster ~]$ mpiexec.hydra -n 4 -genv I_MPI_HYDRA_DEBUG=1 -genv I_MPI_DEBUG=5 atkpython A5AO2-opt.py
host: cluster
==================================================================================================
mpiexec options:
----------------
Base path: /home/atk/software/QuantumATK2022/libexec/
Launcher: ssh
Debug level: 1
Enable X: -1
Global environment:
-------------------
LD_LIBRARY_PATH=/home/atk/software/QuantumATK2022/lib
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.m4a=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.oga=01;36:*.opus=01;36:*.spx=01;36:*.xspf=01;36:
SSH_CONNECTION=192.168.0.3 14188 192.168.0.202 22
MODULES_RUN_QUARANTINE=LD_LIBRARY_PATH LD_PRELOAD
LANG=en_US.UTF-8
HISTCONTROL=ignoredups
HOSTNAME=cluster
S_COLORS=auto
which_declare=declare -f
XDG_SESSION_ID=19
MODULES_CMD=/usr/share/Modules/libexec/modulecmd.tcl
USER=atk
SELINUX_ROLE_REQUESTED=
PWD=/home/atk
SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
HOME=/home/atk
SSH_CLIENT=192.168.0.3 14188 22
SELINUX_LEVEL_REQUESTED=
XDG_DATA_DIRS=/home/atk/.local/share/flatpak/exports/share:/var/lib/flatpak/exports/share:/usr/local/share:/usr/share
LOADEDMODULES=
SSH_TTY=/dev/pts/2
MAIL=/var/spool/mail/atk
TERM=xterm
SHELL=/bin/bash
SELINUX_USE_CURRENT_RANGE=
SHLVL=1
MANPATH=:
GDK_BACKEND=x11
MODULEPATH=/etc/scl/modulefiles:/usr/share/Modules/modulefiles:/etc/modulefiles:/usr/share/modulefiles
LOGNAME=atk
DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-qciD5xJKGf,guid=1da0234fcdd78505b1ab234365039c04
XDG_RUNTIME_DIR=/run/user/1000
MODULEPATH_modshare=/usr/share/Modules/modulefiles:2:/etc/modulefiles:2:/usr/share/modulefiles:2
PATH=/home/atk/software/QuantumATK2022/bin:/home/atk/software/QuantumATK2022/libexec:/home/atk/software/QuantumATK2022/bin:/home/atk/.local/bin:/home/atk/bin:/usr/share/Modules/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin
DEBUGINFOD_URLS=https://debuginfod.centos.org/
MODULESHOME=/usr/share/Modules
HISTSIZE=1000
LESSOPEN=||/usr/bin/lesspipe.sh %s
BASH_FUNC_which%%=() { ( alias;
eval ${which_declare} ) | /usr/bin/which --tty-only --read-alias --read-functions --show-tilde --show-dot $@
}
BASH_FUNC_module%%=() { _module_raw "$@" 2>&1
}
BASH_FUNC__module_raw%%=() { unset _mlshdbg;
if [ "${MODULES_SILENT_SHELL_DEBUG:-0}" = '1' ]; then
case "$-" in
*v*x*)
set +vx;
_mlshdbg='vx'
;;
*v*)
set +v;
_mlshdbg='v'
;;
*x*)
set +x;
_mlshdbg='x'
;;
*)
_mlshdbg=''
;;
esac;
fi;
unset _mlre _mlIFS;
if [ -n "${IFS+x}" ]; then
_mlIFS=$IFS;
fi;
IFS=' ';
for _mlv in ${MODULES_RUN_QUARANTINE:-};
do
if [ "${_mlv}" = "${_mlv##*[!A-Za-z0-9_]}" -a "${_mlv}" = "${_mlv#[0-9]}" ]; then
if [ -n "`eval 'echo ${'$_mlv'+x}'`" ]; then
_mlre="${_mlre:-}${_mlv}_modquar='`eval 'echo ${'$_mlv'}'`' ";
fi;
_mlrv="MODULES_RUNENV_${_mlv}";
_mlre="${_mlre:-}${_mlv}='`eval 'echo ${'$_mlrv':-}'`' ";
fi;
done;
if [ -n "${_mlre:-}" ]; then
eval `eval ${_mlre} /usr/bin/tclsh /usr/share/Modules/libexec/modulecmd.tcl bash '"$@"'`;
else
eval `/usr/bin/tclsh /usr/share/Modules/libexec/modulecmd.tcl bash "$@"`;
fi;
_mlstatus=$?;
if [ -n "${_mlIFS+x}" ]; then
IFS=$_mlIFS;
else
unset IFS;
fi;
unset _mlre _mlv _mlrv _mlIFS;
if [ -n "${_mlshdbg:-}" ]; then
set -$_mlshdbg;
fi;
unset _mlshdbg;
return $_mlstatus
}
BASH_FUNC_switchml%%=() { typeset swfound=1;
if [ "${MODULES_USE_COMPAT_VERSION:-0}" = '1' ]; then
typeset swname='main';
if [ -e /usr/share/Modules/libexec/modulecmd.tcl ]; then
typeset swfound=0;
unset MODULES_USE_COMPAT_VERSION;
fi;
else
typeset swname='compatibility';
if [ -e /usr/share/Modules/libexec/modulecmd-compat ]; then
typeset swfound=0;
MODULES_USE_COMPAT_VERSION=1;
export MODULES_USE_COMPAT_VERSION;
fi;
fi;
if [ $swfound -eq 0 ]; then
echo "Switching to Modules $swname version";
source /usr/share/Modules/init/bash;
else
echo "Cannot switch to Modules $swname version, command not found";
return 1;
fi
}
BASH_FUNC_scl%%=() { if [ "$1" = "load" -o "$1" = "unload" ]; then
eval "module $@";
else
/usr/bin/scl "$@";
fi
}
BASH_FUNC_ml%%=() { module ml "$@"
}
_=/home/atk/software/QuantumATK2022/libexec/mpiexec.hydra
Hydra internal environment:
---------------------------
MPIR_CVAR_NEMESIS_ENABLE_CKPOINT=1
GFORTRAN_UNBUFFERED_PRECONNECTED=y
I_MPI_HYDRA_UUID=be640200-572b-9016-7705-060000cac0a8
DAPL_NETWORK_PROCESS_NUM=4
User set environment:
---------------------
I_MPI_HYDRA_DEBUG=1
I_MPI_DEBUG=5
Intel(R) MPI Library specific variables:
----------------------------------------
I_MPI_HYDRA_UUID=be640200-572b-9016-7705-060000cac0a8
I_MPI_HYDRA_DEBUG=1
I_MPI_DEBUG=5
Proxy information:
*********************
[1] proxy: cluster (48 cores)
Exec list: atkpython (4 processes);
==================================================================================================
[mpiexec@cluster] Timeout set to -1 (-1 means infinite)
[mpiexec@cluster] Got a control port string of cluster:45541
Proxy launch args: /home/atk/software/QuantumATK2022/libexec/pmi_proxy --control-port cluster:45541 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk user --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 2105394070 --usize -2 --proxy-id
Arguments being passed to proxy 0:
--version 3.2 --iface-ip-env-name MPIR_CVAR_CH3_INTERFACE_HOSTNAME --hostname cluster --global-core-map 0,48,48 --pmi-id-map 0,0 --global-process-count 4 --auto-cleanup 1 --pmi-kvsname kvs_156862_0 --pmi-process-mapping (vector,(0,1,48)) --topolib ipl --ckpointlib blcr --ckpoint-prefix /tmp --ckpoint-preserve 1 --ckpoint off --ckpoint-num -1 --global-inherited-env 45 'LD_LIBRARY_PATH=/home/atk/software/QuantumATK2022/lib' 'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.m4a=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.oga=01;36:*.opus=01;36:*.spx=01;36:*.xspf=01;36:' 'SSH_CONNECTION=192.168.0.3 14188 192.168.0.202 22' 'MODULES_RUN_QUARANTINE=LD_LIBRARY_PATH LD_PRELOAD' 'LANG=en_US.UTF-8' 'HISTCONTROL=ignoredups' 'HOSTNAME=cluster' 'S_COLORS=auto' 'which_declare=declare -f' 'XDG_SESSION_ID=19' 'MODULES_CMD=/usr/share/Modules/libexec/modulecmd.tcl' 'USER=atk' 'SELINUX_ROLE_REQUESTED=' 'PWD=/home/atk' 'SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass' 'HOME=/home/atk' 'SSH_CLIENT=192.168.0.3 14188 22' 'SELINUX_LEVEL_REQUESTED=' 'XDG_DATA_DIRS=/home/atk/.local/share/flatpak/exports/share:/var/lib/flatpak/exports/share:/usr/local/share:/usr/share' 'LOADEDMODULES=' 'SSH_TTY=/dev/pts/2' 'MAIL=/var/spool/mail/atk' 'TERM=xterm' 'SHELL=/bin/bash' 'SELINUX_USE_CURRENT_RANGE=' 'SHLVL=1' 'MANPATH=:' 'GDK_BACKEND=x11' 'MODULEPATH=/etc/scl/modulefiles:/usr/share/Modules/modulefiles:/etc/modulefiles:/usr/share/modulefiles' 'LOGNAME=atk' 'DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-qciD5xJKGf,guid=1da0234fcdd78505b1ab234365039c04' 'XDG_RUNTIME_DIR=/run/user/1000' 'MODULEPATH_modshare=/usr/share/Modules/modulefiles:2:/etc/modulefiles:2:/usr/share/modulefiles:2' 'PATH=/home/atk/software/QuantumATK2022/bin:/home/atk/software/QuantumATK2022/libexec:/home/atk/software/QuantumATK2022/bin:/home/atk/.local/bin:/home/atk/bin:/usr/share/Modules/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin' 'DEBUGINFOD_URLS=https://debuginfod.centos.org/ ' 'MODULESHOME=/usr/share/Modules' 'HISTSIZE=1000' 'LESSOPEN=||/usr/bin/lesspipe.sh %s' 'BASH_FUNC_which%%=() { ( alias;
eval ${which_declare} ) | /usr/bin/which --tty-only --read-alias --read-functions --show-tilde --show-dot $@
}' 'BASH_FUNC_module%%=() { _module_raw "$@" 2>&1
}' 'BASH_FUNC__module_raw%%=() { unset _mlshdbg;
if [ "${MODULES_SILENT_SHELL_DEBUG:-0}" = '1' ]; then
case "$-" in
*v*x*)
set +vx;
_mlshdbg='vx'
;;
*v*)
set +v;
_mlshdbg='v'
;;
*x*)
set +x;
_mlshdbg='x'
;;
*)
_mlshdbg=''
;;
esac;
fi;
unset _mlre _mlIFS;
if [ -n "${IFS+x}" ]; then
_mlIFS=$IFS;
fi;
IFS=' ';
for _mlv in ${MODULES_RUN_QUARANTINE:-};
do
if [ "${_mlv}" = "${_mlv##*[!A-Za-z0-9_]}" -a "${_mlv}" = "${_mlv#[0-9]}" ]; then
if [ -n "`eval 'echo ${'$_mlv'+x}'`" ]; then
_mlre="${_mlre:-}${_mlv}_modquar='`eval 'echo ${'$_mlv'}'`' ";
fi;
_mlrv="MODULES_RUNENV_${_mlv}";
_mlre="${_mlre:-}${_mlv}='`eval 'echo ${'$_mlrv':-}'`' ";
fi;
done;
if [ -n "${_mlre:-}" ]; then
eval `eval ${_mlre} /usr/bin/tclsh /usr/share/Modules/libexec/modulecmd.tcl bash '"$@"'`;
else
eval `/usr/bin/tclsh /usr/share/Modules/libexec/modulecmd.tcl bash "$@"`;
fi;
_mlstatus=$?;
if [ -n "${_mlIFS+x}" ]; then
IFS=$_mlIFS;
else
unset IFS;
fi;
unset _mlre _mlv _mlrv _mlIFS;
if [ -n "${_mlshdbg:-}" ]; then
set -$_mlshdbg;
fi;
unset _mlshdbg;
return $_mlstatus
}' 'BASH_FUNC_switchml%%=() { typeset swfound=1;
if [ "${MODULES_USE_COMPAT_VERSION:-0}" = '1' ]; then
typeset swname='main';
if [ -e /usr/share/Modules/libexec/modulecmd.tcl ]; then
typeset swfound=0;
unset MODULES_USE_COMPAT_VERSION;
fi;
else
typeset swname='compatibility';
if [ -e /usr/share/Modules/libexec/modulecmd-compat ]; then
typeset swfound=0;
MODULES_USE_COMPAT_VERSION=1;
export MODULES_USE_COMPAT_VERSION;
fi;
fi;
if [ $swfound -eq 0 ]; then
echo "Switching to Modules $swname version";
source /usr/share/Modules/init/bash;
else
echo "Cannot switch to Modules $swname version, command not found";
return 1;
fi
}' 'BASH_FUNC_scl%%=() { if [ "$1" = "load" -o "$1" = "unload" ]; then
eval "module $@";
else
/usr/bin/scl "$@";
fi
}' 'BASH_FUNC_ml%%=() { module ml "$@"
}' '_=/home/atk/software/QuantumATK2022/libexec/mpiexec.hydra' --global-user-env 2 'I_MPI_HYDRA_DEBUG=1' 'I_MPI_DEBUG=5' --global-system-env 4 'MPIR_CVAR_NEMESIS_ENABLE_CKPOINT=1' 'GFORTRAN_UNBUFFERED_PRECONNECTED=y' 'I_MPI_HYDRA_UUID=be640200-572b-9016-7705-060000cac0a8' 'DAPL_NETWORK_PROCESS_NUM=4' --proxy-core-count 48 --mpi-cmd-env mpiexec.hydra -n 4 -genv I_MPI_HYDRA_DEBUG=1 -genv I_MPI_DEBUG=5 atkpython A5AO2-opt.py --exec --exec-appnum 0 --exec-proc-count 4 --exec-local-env 0 --exec-wdir /home/atk --exec-args 2 atkpython A5AO2-opt.py
[mpiexec@cluster] Launch arguments: /home/atk/software/QuantumATK2022/libexec/pmi_proxy --control-port cluster:45541 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk user --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 2105394070 --usize -2 --proxy-id 0
[proxy:0:0@cluster] Start PMI_proxy 0
[proxy:0:0@cluster] STDIN will be redirected to 1 fd(s): 17
[proxy:0:0@cluster] got pmi command (from 16): init
pmi_version=1 pmi_subversion=1
[proxy:0:0@cluster] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0@cluster] got pmi command (from 12): init
pmi_version=1 pmi_subversion=1
[proxy:0:0@cluster] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0@cluster] got pmi command (from 14): init
pmi_version=1 pmi_subversion=1
[proxy:0:0@cluster] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0@cluster] got pmi command (from 21): init
pmi_version=1 pmi_subversion=1
[proxy:0:0@cluster] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0@cluster] got pmi command (from 12): get_maxes
[proxy:0:0@cluster] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:0@cluster] got pmi command (from 14): get_maxes
[proxy:0:0@cluster] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:0@cluster] got pmi command (from 16): get_maxes
[proxy:0:0@cluster] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:0@cluster] got pmi command (from 12): barrier_in
[proxy:0:0@cluster] got pmi command (from 14): barrier_in
[proxy:0:0@cluster] got pmi command (from 21): get_maxes
[proxy:0:0@cluster] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:0@cluster] got pmi command (from 16): barrier_in
[proxy:0:0@cluster] got pmi command (from 21): barrier_in
[proxy:0:0@cluster] forwarding command (cmd=barrier_in) upstream
[mpiexec@cluster] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec@cluster] PMI response to fd 8 pid 21: cmd=barrier_out
[proxy:0:0@cluster] PMI response: cmd=barrier_out
[proxy:0:0@cluster] PMI response: cmd=barrier_out
[proxy:0:0@cluster] PMI response: cmd=barrier_out
[proxy:0:0@cluster] PMI response: cmd=barrier_out
[proxy:0:0@cluster] got pmi command (from 12): get_ranks2hosts
[proxy:0:0@cluster] PMI response: put_ranks2hosts 21 1
7 cluster 0,1,2,3,
[proxy:0:0@cluster] got pmi command (from 14): get_ranks2hosts
[proxy:0:0@cluster] PMI response: put_ranks2hosts 21 1
7 cluster 0,1,2,3,
[proxy:0:0@cluster] got pmi command (from 16): get_ranks2hosts
[proxy:0:0@cluster] PMI response: put_ranks2hosts 21 1
7 cluster 0,1,2,3,
[proxy:0:0@cluster] got pmi command (from 21): get_ranks2hosts
[proxy:0:0@cluster] PMI response: put_ranks2hosts 21 1
7 cluster 0,1,2,3,
[proxy:0:0@cluster] got pmi command (from 12): get_appnum
[proxy:0:0@cluster] PMI response: cmd=appnum appnum=0
[proxy:0:0@cluster] got pmi command (from 14): get_appnum
[proxy:0:0@cluster] PMI response: cmd=appnum appnum=0
[proxy:0:0@cluster] got pmi command (from 16): get_appnum
[proxy:0:0@cluster] PMI response: cmd=appnum appnum=0
[proxy:0:0@cluster] got pmi command (from 12): get_my_kvsname
[proxy:0:0@cluster] PMI response: cmd=my_kvsname kvsname=kvs_156862_0
[proxy:0:0@cluster] got pmi command (from 14): get_my_kvsname
[proxy:0:0@cluster] PMI response: cmd=my_kvsname kvsname=kvs_156862_0
[proxy:0:0@cluster] got pmi command (from 21): get_appnum
[proxy:0:0@cluster] PMI response: cmd=appnum appnum=0
[proxy:0:0@cluster] got pmi command (from 12): get_my_kvsname
[proxy:0:0@cluster] PMI response: cmd=my_kvsname kvsname=kvs_156862_0
[proxy:0:0@cluster] got pmi command (from 14): get_my_kvsname
[proxy:0:0@cluster] PMI response: cmd=my_kvsname kvsname=kvs_156862_0
[proxy:0:0@cluster] got pmi command (from 16): get_my_kvsname
[proxy:0:0@cluster] PMI response: cmd=my_kvsname kvsname=kvs_156862_0
[proxy:0:0@cluster] got pmi command (from 21): get_my_kvsname
[proxy:0:0@cluster] PMI response: cmd=my_kvsname kvsname=kvs_156862_0
- MPI startup(): Multi-threaded optimized library
[proxy:0:0@cluster] got pmi command (from 16): get_my_kvsname
[proxy:0:0@cluster] PMI response: cmd=my_kvsname kvsname=kvs_156862_0
[proxy:0:0@cluster] got pmi command (from 21): get_my_kvsname
[proxy:0:0@cluster] PMI response: cmd=my_kvsname kvsname=kvs_156862_0
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 156869 RUNNING AT cluster
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
Intel(R) MPI Library troubleshooting guide:
https://software.intel.com/node/561764
===================================================================================
Please help me, thank you
-
More info, but not clue. Now please add the -v option to mpiexec.hydra also, but post the results in an attachment text file, rather, as it will be very long.
-
thank you professor Anders, here is the attachment
-
What actual version of QuantumATK is this? There have been some issue with Intel MPI in the past. With 2022.12 things improved, but ideally you should run 2023.09. Also, is there some other MPI library in your path, or generally recommended/used on this cluster?
-
thank you professor Anders, the version of QuantumATK is 2022.03, it works well in centos 7.9.
-
It's a bit "old", but more importantly we have had issues with Intel MPI for a while. I have also seen reports online where people solved issues like this by upgrading their Intel MPI version, which in your case would be easiest done by moving to 2022.12 or even 2023.09, that way you also get access to all the latest features! There is not much we can do to troubleshoot an old version, esp. if this is a bug in Intel MPI as it very well might be.