Author Topic: error in mpiexec.hydra  (Read 6411 times)

0 Members and 1 Guest are viewing this topic.

Offline rebacoo

  • Regular QuantumATK user
  • **
  • Posts: 11
  • Reputation: 0
    • View Profile
error in mpiexec.hydra
« on: September 14, 2023, 12:51 »
Dear QuantumWise staff:
  Recently, i've installed the QuantumATK-2022.03 in Rocky linux8.8 system, i find the parallel computation cannot be performed (PS: Single-core computation is ok. i mean: atkpython ***.py > ***.log& ) . when i use mpiexec.hydra, it doen;t work. the error message is as follows:
********************************
[atk@cluster ~]$ mpiexec.hydra -np 4 atkpython A5AO2-opt.py
===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 64882 RUNNING AT cluster
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
   Intel(R) MPI Library troubleshooting guide:
      https://software.intel.com/node/561764
===================================================================================
***************************
How to deal with this question? thank you very much

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5589
  • Country: dk
  • Reputation: 102
    • View Profile
    • QuantumATK at Synopsys
Re: error in mpiexec.hydra
« Reply #1 on: September 14, 2023, 19:25 »
To troubleshoot something like this, I would try a few things

* Can you run a trivial command in parallel, like mpiexec.hydra -np 4 echo "hello".
* Is parallelization across nodes set up correctly in general on the cluster? Can you run mpiexec.hydra -np 4 -localonly echo "hello".
* Is your path set up correctly, so that mpiexec.hydra actually points to our binary? Same for atkpython
* Always use the latest version of quantumatk. We released 2023.09 just this month
* Make sure the test system is really small (start with 1 Au atom), to exclude problems like running out of memory
* Add the -v option to mpiexec, it will print tons of debug information

Offline rebacoo

  • Regular QuantumATK user
  • **
  • Posts: 11
  • Reputation: 0
    • View Profile
Re: error in mpiexec.hydra
« Reply #2 on: September 15, 2023, 02:07 »
Thank you professor Anders, the QuantumATK2022 is installed in single node server, and the path of mpiexec.hydra and atkpython is correct.
 ([atk@cluster ~]$ which atkpython
~/software/QuantumATK2022/bin/atkpython             
 [atk@cluster ~]$ which mpiexec.hydra
~/software/QuantumATK2022/libexec/mpiexec.hydra)

According to your suggestion, when i run: mpiexec.hydra -np 4 -localonly echo "hello", the  message is as follows:
*********************************
[atk@cluster ~]$ mpiexec.hydra -np 4 -localonly echo "hello"
[mpiexec@cluster] match_arg (../../utils/args/args.c:254): unrecognized argument localonly
[mpiexec@cluster] HYDU_parse_array (../../utils/args/args.c:269): argument matching returned error
[mpiexec@cluster] parse_args (../../ui/mpich/utils.c:4770): error parsing input array
[mpiexec@cluster] HYD_uii_mpx_get_parameters (../../ui/mpich/utils.c:5106): unable to parse user arguments

Usage: ./mpiexec [global opts] [exec1 local opts] : [exec2 local opts] : ...

Global options (passed to all executables):

  Global environment options:
    -genv {name} {value}             environment variable name and value
    -genvlist {env1,env2,...}        environment variable list to pass
    -genvnone                        do not pass any environment variables
    -genvall                         pass all environment variables not managed
                                          by the launcher (default)

  Other global options:
    -f {name} | -hostfile {name}     file containing the host names
    -hosts {host list}               comma separated host list
    -configfile {name}               config file containing MPMD launch options
    -machine {name} | -machinefile {name}
                                     file mapping procs to machines
    -pmi-connect {nocache|lazy-cache|cache}
                                     set the PMI connections mode to use
    -pmi-aggregate                   aggregate PMI messages
    -pmi-noaggregate                 do not  aggregate PMI messages
    -trace {<libraryname>}           trace the application using <libraryname>
                                     profiling library; default is libVT.so
    -trace-imbalance {<libraryname>} trace the application using <libraryname>
                                     imbalance profiling library; default is libVTim.so
    -check-mpi {<libraryname>}       check the application using <libraryname>
                                     checking library; default is libVTmc.so
    -ilp64                           Preload ilp64 wrapper library for support default size of
                                     integer 8 bytes
    -mps                             start statistics gathering for MPI Performance Snapshot (MPS)
    -aps                             start statistics gathering for Application Performance Snapshot (APS)
    -trace-pt2pt                     collect information about
                                     Point to Point operations
    -trace-collectives               collect information about
                                     Collective operations
    -tune [<confname>]               apply the tuned data produced by
                                     the MPI Tuner utility
    -use-app-topology <statfile>     perform optimized rank placement based statistics
                                     and cluster topology
    -noconf                          do not use any mpiexec's configuration files
    -branch-count {leaves_num}       set the number of children in tree
    -gwdir {dirname}                 working directory to use
    -gpath {dirname}                 path to executable to use
    -gumask {umask}                  mask to perform umask
    -tmpdir {tmpdir}                 temporary directory for cleanup input file
    -cleanup                         create input file for clean up
    -gtool {options}                 apply a tool over the mpi application
    -gtoolfile {file}                apply a tool over the mpi application. Parameters specified in the file


Local options (passed to individual executables):

  Local environment options:
    -env {name} {value}              environment variable name and value
    -envlist {env1,env2,...}         environment variable list to pass
    -envnone                         do not pass any environment variables
    -envall                          pass all environment variables (default)

  Other local options:
    -host {hostname}                 host on which processes are to be run
    -hostos {OS name}                operating system on particular host
    -wdir {dirname}                  working directory to use
    -path {dirname}                  path to executable to use
    -umask {umask}                   mask to perform umask
    -n/-np {value}                   number of processes
    {exec_name} {args}               executable name and arguments


Hydra specific options (treated as global):

  Bootstrap options:
    -bootstrap                       bootstrap server to use
     (ssh rsh pdsh fork slurm srun ll llspawn.stdio lsf blaunch sge qrsh persist service pbsdsh)
    -bootstrap-exec                  executable to use to bootstrap processes
    -bootstrap-exec-args             additional options to pass to bootstrap server
    -prefork                         use pre-fork processes startup method
    -enable-x/-disable-x             enable or disable X forwarding

  Resource management kernel options:
    -rmk                             resource management kernel to use (user slurm srun ll llspawn.stdio lsf blaunch sge qrsh pbs cobalt)

  Processor topology options:
    -binding                         process-to-core binding mode
  Extended fabric control options:
    -rdma                            select RDMA-capable network fabric (dapl). Fallback list is ofa,tcp,tmi,ofi
    -RDMA                            select RDMA-capable network fabric (dapl). Fallback is ofa
    -dapl                            select DAPL-capable network fabric. Fallback list is tcp,tmi,ofa,ofi
    -DAPL                            select DAPL-capable network fabric. No fallback fabric is used
    -ib                              select OFA-capable network fabric. Fallback list is dapl,tcp,tmi,ofi
    -IB                              select OFA-capable network fabric. No fallback fabric is used
    -tmi                             select TMI-capable network fabric. Fallback list is dapl,tcp,ofa,ofi
    -TMI                             select TMI-capable network fabric. No fallback fabric is used
    -mx                              select Myrinet MX* network fabric. Fallback list is dapl,tcp,ofa,ofi
    -MX                              select Myrinet MX* network fabric. No fallback fabric is used
    -psm                             select PSM-capable network fabric. Fallback list is dapl,tcp,ofa,ofi
    -PSM                             select PSM-capable network fabric. No fallback fabric is used
    -psm2                            select Intel* Omni-Path Fabric. Fallback list is dapl,tcp,ofa,ofi
    -PSM2                            select Intel* Omni-Path Fabric. No fallback fabric is used
    -ofi                             select OFI-capable network fabric. Fallback list is tmi,dapl,tcp,ofa
    -OFI                             select OFI-capable network fabric. No fallback fabric is used

  Checkpoint/Restart options:
    -ckpoint {on|off}                enable/disable checkpoints for this run
    -ckpoint-interval                checkpoint interval
    -ckpoint-prefix                  destination for checkpoint files (stable storage, typically a cluster-wide file system)
    -ckpoint-tmp-prefix              temporary/fast/local storage to speed up checkpoints
    -ckpoint-preserve                number of checkpoints to keep (default: 1, i.e. keep only last checkpoint)
    -ckpointlib                      checkpointing library (blcr)
    -ckpoint-logfile                 checkpoint activity/status log file (appended)
    -restart                         restart previously checkpointed application
    -ckpoint-num                     checkpoint number to restart

  Demux engine options:
    -demux                           demux engine (poll select)

  Debugger support options:
    -tv                              run processes under TotalView
    -tva {pid}                       attach existing mpiexec process to TotalView
    -gdb                             run processes under GDB
    -gdba {pid}                      attach existing mpiexec process to GDB
    -gdb-ia                          run processes under Intel IA specific GDB

  Other Hydra options:
    -v | -verbose                    verbose mode
    -V | -version                    show the version
    -info                            build information
    -print-rank-map                  print rank mapping
    -print-all-exitcodes             print exit codes of all processes
    -iface                           network interface to use
    -help                            show this message
    -perhost <n>                     place consecutive <n> processes on each host
    -ppn <n>                         stand for "process per node"; an alias to -perhost <n>
    -grr <n>                         stand for "group round robin"; an alias to -perhost <n>
    -rr                              involve "round robin" startup scheme
    -s <spec>                        redirect stdin to all or 1,2 or 2-4,6 MPI processes (0 by default)
    -ordered-output                  avoid data output intermingling
    -profile                         turn on internal profiling
    -l | -prepend-rank               prepend rank to output
    -prepend-pattern                 prepend pattern to output
    -outfile-pattern                 direct stdout to file
    -errfile-pattern                 direct stderr to file
    -localhost                       local hostname for the launching node
    -nolocal                         avoid running the application processes on the node where mpiexec.hydra started

Intel(R) MPI Library for Linux* OS, Version 2018 Update 1 Build 20171011 (id: 17941)
Copyright (C) 2003-2017, Intel Corporation. All rights reserved.
****************************
when i run: mpiexec.hydra -np 4 echo "hello"
it seems work:
[atk@cluster ~]$ mpiexec.hydra -np 4 echo "hello"
hello
hello
hello
hello

How to deal with this problem?  Thank you
« Last Edit: September 15, 2023, 02:09 by rebacoo »

Offline filipr

  • QuantumATK Staff
  • Heavy QuantumATK user
  • *****
  • Posts: 83
  • Country: dk
  • Reputation: 6
  • QuantumATK developer
    • View Profile
Re: error in mpiexec.hydra
« Reply #3 on: September 15, 2023, 08:32 »
Remove the '-localonly' option from the command, it was only available for older versions of Intel MPI.

Offline rebacoo

  • Regular QuantumATK user
  • **
  • Posts: 11
  • Reputation: 0
    • View Profile
Re: error in mpiexec.hydra
« Reply #4 on: September 15, 2023, 10:59 »
Thank you professor filipr, the mpiexec.hydra is from the software of QuantumATK2022, when i run:    mpiexec.hydra -np 8 echo "Hello"      it work well,  when i run atkpython **.py > ***.log    it work well
but when i run: mpiexec.hydra -np 4 atkpython A5AO2-opt.py      it does't work   
***********************************************************************
[atk@cluster ~]$ mpiexec.hydra -np 4 atkpython A5AO2-opt.py

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 108659 RUNNING AT cluster
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
   Intel(R) MPI Library troubleshooting guide:
      https://software.intel.com/node/561764
===================================================================================
[atk@cluster ~]$ mpiexec.hydra -np 8 echo "Hello"
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
************************************

Offline filipr

  • QuantumATK Staff
  • Heavy QuantumATK user
  • *****
  • Posts: 83
  • Country: dk
  • Reputation: 6
  • QuantumATK developer
    • View Profile
Re: error in mpiexec.hydra
« Reply #5 on: September 15, 2023, 11:15 »
Can you share with us your input script A5AO2-opt.py?

Offline rebacoo

  • Regular QuantumATK user
  • **
  • Posts: 11
  • Reputation: 0
    • View Profile
Re: error in mpiexec.hydra
« Reply #6 on: September 15, 2023, 13:33 »
A5AO2-opt.py is a test file, and when i use atkpython A5AO2-opt.py, it works well

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5589
  • Country: dk
  • Reputation: 102
    • View Profile
    • QuantumATK at Synopsys
Re: error in mpiexec.hydra
« Reply #7 on: September 15, 2023, 19:27 »
The use of localonly might be specific to Linux vs Windows. I use it all the time on Windows if I launch from the command line to avoid having to set up ssh keys or an MPI service, or give my password to each ssh process. Perhaps there is a smarter way to do it, but there is no indication this keyword is deprecated on Windows at least (https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-windows/2021-10/global-hydra-options.html).

But you use Linux, rebacoo, so indeed you can skip it.

Did you try the -v version (mpiexec.hydra -v -np 4 ...)? Also, you can try

mpiexec.hydra -n 4 -genv I_MPI_HYDRA_DEBUG=1 -genv I_MPI_DEBUG=5 atkpython script.py

to generate a LOT of debug info that might help

Offline rebacoo

  • Regular QuantumATK user
  • **
  • Posts: 11
  • Reputation: 0
    • View Profile
Re: error in mpiexec.hydra
« Reply #8 on: September 16, 2023, 12:13 »
Thank you professor Anders, here is the results:
[atk@cluster ~]$ mpiexec.hydra -n 4 -genv I_MPI_HYDRA_DEBUG=1 -genv I_MPI_DEBUG=5 atkpython A5AO2-opt.py
host: cluster

==================================================================================================
mpiexec options:
----------------
  Base path: /home/atk/software/QuantumATK2022/libexec/
  Launcher: ssh
  Debug level: 1
  Enable X: -1

  Global environment:
  -------------------
    LD_LIBRARY_PATH=/home/atk/software/QuantumATK2022/lib
    LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.m4a=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.oga=01;36:*.opus=01;36:*.spx=01;36:*.xspf=01;36:
    SSH_CONNECTION=192.168.0.3 14188 192.168.0.202 22
    MODULES_RUN_QUARANTINE=LD_LIBRARY_PATH LD_PRELOAD
    LANG=en_US.UTF-8
    HISTCONTROL=ignoredups
    HOSTNAME=cluster
    S_COLORS=auto
    which_declare=declare -f
    XDG_SESSION_ID=19
    MODULES_CMD=/usr/share/Modules/libexec/modulecmd.tcl
    USER=atk
    SELINUX_ROLE_REQUESTED=
    PWD=/home/atk
    SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
    HOME=/home/atk
    SSH_CLIENT=192.168.0.3 14188 22
    SELINUX_LEVEL_REQUESTED=
    XDG_DATA_DIRS=/home/atk/.local/share/flatpak/exports/share:/var/lib/flatpak/exports/share:/usr/local/share:/usr/share
    LOADEDMODULES=
    SSH_TTY=/dev/pts/2
    MAIL=/var/spool/mail/atk
    TERM=xterm
    SHELL=/bin/bash
    SELINUX_USE_CURRENT_RANGE=
    SHLVL=1
    MANPATH=:
    GDK_BACKEND=x11
    MODULEPATH=/etc/scl/modulefiles:/usr/share/Modules/modulefiles:/etc/modulefiles:/usr/share/modulefiles
    LOGNAME=atk
    DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-qciD5xJKGf,guid=1da0234fcdd78505b1ab234365039c04
    XDG_RUNTIME_DIR=/run/user/1000
    MODULEPATH_modshare=/usr/share/Modules/modulefiles:2:/etc/modulefiles:2:/usr/share/modulefiles:2
    PATH=/home/atk/software/QuantumATK2022/bin:/home/atk/software/QuantumATK2022/libexec:/home/atk/software/QuantumATK2022/bin:/home/atk/.local/bin:/home/atk/bin:/usr/share/Modules/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin
    DEBUGINFOD_URLS=https://debuginfod.centos.org/
    MODULESHOME=/usr/share/Modules
    HISTSIZE=1000
    LESSOPEN=||/usr/bin/lesspipe.sh %s
    BASH_FUNC_which%%=() {  ( alias;
 eval ${which_declare} ) | /usr/bin/which --tty-only --read-alias --read-functions --show-tilde --show-dot $@
}
    BASH_FUNC_module%%=() {  _module_raw "$@" 2>&1
}
    BASH_FUNC__module_raw%%=() {  unset _mlshdbg;
 if [ "${MODULES_SILENT_SHELL_DEBUG:-0}" = '1' ]; then
 case "$-" in
 *v*x*)
 set +vx;
 _mlshdbg='vx'
 ;;
 *v*)
 set +v;
 _mlshdbg='v'
 ;;
 *x*)
 set +x;
 _mlshdbg='x'
 ;;
 *)
 _mlshdbg=''
 ;;
 esac;
 fi;
 unset _mlre _mlIFS;
 if [ -n "${IFS+x}" ]; then
 _mlIFS=$IFS;
 fi;
 IFS=' ';
 for _mlv in ${MODULES_RUN_QUARANTINE:-};
 do
 if [ "${_mlv}" = "${_mlv##*[!A-Za-z0-9_]}" -a "${_mlv}" = "${_mlv#[0-9]}" ]; then
 if [ -n "`eval 'echo ${'$_mlv'+x}'`" ]; then
 _mlre="${_mlre:-}${_mlv}_modquar='`eval 'echo ${'$_mlv'}'`' ";
 fi;
 _mlrv="MODULES_RUNENV_${_mlv}";
 _mlre="${_mlre:-}${_mlv}='`eval 'echo ${'$_mlrv':-}'`' ";
 fi;
 done;
 if [ -n "${_mlre:-}" ]; then
 eval `eval ${_mlre} /usr/bin/tclsh /usr/share/Modules/libexec/modulecmd.tcl bash '"$@"'`;
 else
 eval `/usr/bin/tclsh /usr/share/Modules/libexec/modulecmd.tcl bash "$@"`;
 fi;
 _mlstatus=$?;
 if [ -n "${_mlIFS+x}" ]; then
 IFS=$_mlIFS;
 else
 unset IFS;
 fi;
 unset _mlre _mlv _mlrv _mlIFS;
 if [ -n "${_mlshdbg:-}" ]; then
 set -$_mlshdbg;
 fi;
 unset _mlshdbg;
 return $_mlstatus
}
    BASH_FUNC_switchml%%=() {  typeset swfound=1;
 if [ "${MODULES_USE_COMPAT_VERSION:-0}" = '1' ]; then
 typeset swname='main';
 if [ -e /usr/share/Modules/libexec/modulecmd.tcl ]; then
 typeset swfound=0;
 unset MODULES_USE_COMPAT_VERSION;
 fi;
 else
 typeset swname='compatibility';
 if [ -e /usr/share/Modules/libexec/modulecmd-compat ]; then
 typeset swfound=0;
 MODULES_USE_COMPAT_VERSION=1;
 export MODULES_USE_COMPAT_VERSION;
 fi;
 fi;
 if [ $swfound -eq 0 ]; then
 echo "Switching to Modules $swname version";
 source /usr/share/Modules/init/bash;
 else
 echo "Cannot switch to Modules $swname version, command not found";
 return 1;
 fi
}
    BASH_FUNC_scl%%=() {  if [ "$1" = "load" -o "$1" = "unload" ]; then
 eval "module $@";
 else
 /usr/bin/scl "$@";
 fi
}
    BASH_FUNC_ml%%=() {  module ml "$@"
}
    _=/home/atk/software/QuantumATK2022/libexec/mpiexec.hydra

  Hydra internal environment:
  ---------------------------
    MPIR_CVAR_NEMESIS_ENABLE_CKPOINT=1
    GFORTRAN_UNBUFFERED_PRECONNECTED=y
    I_MPI_HYDRA_UUID=be640200-572b-9016-7705-060000cac0a8
    DAPL_NETWORK_PROCESS_NUM=4

  User set environment:
  ---------------------
    I_MPI_HYDRA_DEBUG=1
    I_MPI_DEBUG=5

  Intel(R) MPI Library specific variables:
  ----------------------------------------
    I_MPI_HYDRA_UUID=be640200-572b-9016-7705-060000cac0a8
    I_MPI_HYDRA_DEBUG=1
    I_MPI_DEBUG=5


    Proxy information:
    *********************
      [1] proxy: cluster (48 cores)
      Exec list: atkpython (4 processes);


==================================================================================================

[mpiexec@cluster] Timeout set to -1 (-1 means infinite)
[mpiexec@cluster] Got a control port string of cluster:45541

Proxy launch args: /home/atk/software/QuantumATK2022/libexec/pmi_proxy --control-port cluster:45541 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk user --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 2105394070 --usize -2 --proxy-id

Arguments being passed to proxy 0:
--version 3.2 --iface-ip-env-name MPIR_CVAR_CH3_INTERFACE_HOSTNAME --hostname cluster --global-core-map 0,48,48 --pmi-id-map 0,0 --global-process-count 4 --auto-cleanup 1 --pmi-kvsname kvs_156862_0 --pmi-process-mapping (vector,(0,1,48)) --topolib ipl --ckpointlib blcr --ckpoint-prefix /tmp --ckpoint-preserve 1 --ckpoint off --ckpoint-num -1 --global-inherited-env 45 'LD_LIBRARY_PATH=/home/atk/software/QuantumATK2022/lib' 'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.m4a=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.oga=01;36:*.opus=01;36:*.spx=01;36:*.xspf=01;36:' 'SSH_CONNECTION=192.168.0.3 14188 192.168.0.202 22' 'MODULES_RUN_QUARANTINE=LD_LIBRARY_PATH LD_PRELOAD' 'LANG=en_US.UTF-8' 'HISTCONTROL=ignoredups' 'HOSTNAME=cluster' 'S_COLORS=auto' 'which_declare=declare -f' 'XDG_SESSION_ID=19' 'MODULES_CMD=/usr/share/Modules/libexec/modulecmd.tcl' 'USER=atk' 'SELINUX_ROLE_REQUESTED=' 'PWD=/home/atk' 'SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass' 'HOME=/home/atk' 'SSH_CLIENT=192.168.0.3 14188 22' 'SELINUX_LEVEL_REQUESTED=' 'XDG_DATA_DIRS=/home/atk/.local/share/flatpak/exports/share:/var/lib/flatpak/exports/share:/usr/local/share:/usr/share' 'LOADEDMODULES=' 'SSH_TTY=/dev/pts/2' 'MAIL=/var/spool/mail/atk' 'TERM=xterm' 'SHELL=/bin/bash' 'SELINUX_USE_CURRENT_RANGE=' 'SHLVL=1' 'MANPATH=:' 'GDK_BACKEND=x11' 'MODULEPATH=/etc/scl/modulefiles:/usr/share/Modules/modulefiles:/etc/modulefiles:/usr/share/modulefiles' 'LOGNAME=atk' 'DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-qciD5xJKGf,guid=1da0234fcdd78505b1ab234365039c04' 'XDG_RUNTIME_DIR=/run/user/1000' 'MODULEPATH_modshare=/usr/share/Modules/modulefiles:2:/etc/modulefiles:2:/usr/share/modulefiles:2' 'PATH=/home/atk/software/QuantumATK2022/bin:/home/atk/software/QuantumATK2022/libexec:/home/atk/software/QuantumATK2022/bin:/home/atk/.local/bin:/home/atk/bin:/usr/share/Modules/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin' 'DEBUGINFOD_URLS=https://debuginfod.centos.org/ ' 'MODULESHOME=/usr/share/Modules' 'HISTSIZE=1000' 'LESSOPEN=||/usr/bin/lesspipe.sh %s' 'BASH_FUNC_which%%=() {  ( alias;
 eval ${which_declare} ) | /usr/bin/which --tty-only --read-alias --read-functions --show-tilde --show-dot $@
}' 'BASH_FUNC_module%%=() {  _module_raw "$@" 2>&1
}' 'BASH_FUNC__module_raw%%=() {  unset _mlshdbg;
 if [ "${MODULES_SILENT_SHELL_DEBUG:-0}" = '1' ]; then
 case "$-" in
 *v*x*)
 set +vx;
 _mlshdbg='vx'
 ;;
 *v*)
 set +v;
 _mlshdbg='v'
 ;;
 *x*)
 set +x;
 _mlshdbg='x'
 ;;
 *)
 _mlshdbg=''
 ;;
 esac;
 fi;
 unset _mlre _mlIFS;
 if [ -n "${IFS+x}" ]; then
 _mlIFS=$IFS;
 fi;
 IFS=' ';
 for _mlv in ${MODULES_RUN_QUARANTINE:-};
 do
 if [ "${_mlv}" = "${_mlv##*[!A-Za-z0-9_]}" -a "${_mlv}" = "${_mlv#[0-9]}" ]; then
 if [ -n "`eval 'echo ${'$_mlv'+x}'`" ]; then
 _mlre="${_mlre:-}${_mlv}_modquar='`eval 'echo ${'$_mlv'}'`' ";
 fi;
 _mlrv="MODULES_RUNENV_${_mlv}";
 _mlre="${_mlre:-}${_mlv}='`eval 'echo ${'$_mlrv':-}'`' ";
 fi;
 done;
 if [ -n "${_mlre:-}" ]; then
 eval `eval ${_mlre} /usr/bin/tclsh /usr/share/Modules/libexec/modulecmd.tcl bash '"$@"'`;
 else
 eval `/usr/bin/tclsh /usr/share/Modules/libexec/modulecmd.tcl bash "$@"`;
 fi;
 _mlstatus=$?;
 if [ -n "${_mlIFS+x}" ]; then
 IFS=$_mlIFS;
 else
 unset IFS;
 fi;
 unset _mlre _mlv _mlrv _mlIFS;
 if [ -n "${_mlshdbg:-}" ]; then
 set -$_mlshdbg;
 fi;
 unset _mlshdbg;
 return $_mlstatus
}' 'BASH_FUNC_switchml%%=() {  typeset swfound=1;
 if [ "${MODULES_USE_COMPAT_VERSION:-0}" = '1' ]; then
 typeset swname='main';
 if [ -e /usr/share/Modules/libexec/modulecmd.tcl ]; then
 typeset swfound=0;
 unset MODULES_USE_COMPAT_VERSION;
 fi;
 else
 typeset swname='compatibility';
 if [ -e /usr/share/Modules/libexec/modulecmd-compat ]; then
 typeset swfound=0;
 MODULES_USE_COMPAT_VERSION=1;
 export MODULES_USE_COMPAT_VERSION;
 fi;
 fi;
 if [ $swfound -eq 0 ]; then
 echo "Switching to Modules $swname version";
 source /usr/share/Modules/init/bash;
 else
 echo "Cannot switch to Modules $swname version, command not found";
 return 1;
 fi
}' 'BASH_FUNC_scl%%=() {  if [ "$1" = "load" -o "$1" = "unload" ]; then
 eval "module $@";
 else
 /usr/bin/scl "$@";
 fi
}' 'BASH_FUNC_ml%%=() {  module ml "$@"
}' '_=/home/atk/software/QuantumATK2022/libexec/mpiexec.hydra' --global-user-env 2 'I_MPI_HYDRA_DEBUG=1' 'I_MPI_DEBUG=5' --global-system-env 4 'MPIR_CVAR_NEMESIS_ENABLE_CKPOINT=1' 'GFORTRAN_UNBUFFERED_PRECONNECTED=y' 'I_MPI_HYDRA_UUID=be640200-572b-9016-7705-060000cac0a8' 'DAPL_NETWORK_PROCESS_NUM=4' --proxy-core-count 48 --mpi-cmd-env mpiexec.hydra -n 4 -genv I_MPI_HYDRA_DEBUG=1 -genv I_MPI_DEBUG=5 atkpython A5AO2-opt.py  --exec --exec-appnum 0 --exec-proc-count 4 --exec-local-env 0 --exec-wdir /home/atk --exec-args 2 atkpython A5AO2-opt.py

[mpiexec@cluster] Launch arguments: /home/atk/software/QuantumATK2022/libexec/pmi_proxy --control-port cluster:45541 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk user --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 2105394070 --usize -2 --proxy-id 0
[proxy:0:0@cluster] Start PMI_proxy 0
[proxy:0:0@cluster] STDIN will be redirected to 1 fd(s): 17
[proxy:0:0@cluster] got pmi command (from 16): init
pmi_version=1 pmi_subversion=1
[proxy:0:0@cluster] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0@cluster] got pmi command (from 12): init
pmi_version=1 pmi_subversion=1
[proxy:0:0@cluster] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0@cluster] got pmi command (from 14): init
pmi_version=1 pmi_subversion=1
[proxy:0:0@cluster] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0@cluster] got pmi command (from 21): init
pmi_version=1 pmi_subversion=1
[proxy:0:0@cluster] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0@cluster] got pmi command (from 12): get_maxes

[proxy:0:0@cluster] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:0@cluster] got pmi command (from 14): get_maxes

[proxy:0:0@cluster] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:0@cluster] got pmi command (from 16): get_maxes

[proxy:0:0@cluster] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:0@cluster] got pmi command (from 12): barrier_in

[proxy:0:0@cluster] got pmi command (from 14): barrier_in

[proxy:0:0@cluster] got pmi command (from 21): get_maxes

[proxy:0:0@cluster] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:0@cluster] got pmi command (from 16): barrier_in

[proxy:0:0@cluster] got pmi command (from 21): barrier_in

[proxy:0:0@cluster] forwarding command (cmd=barrier_in) upstream
[mpiexec@cluster] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec@cluster] PMI response to fd 8 pid 21: cmd=barrier_out
[proxy:0:0@cluster] PMI response: cmd=barrier_out
[proxy:0:0@cluster] PMI response: cmd=barrier_out
[proxy:0:0@cluster] PMI response: cmd=barrier_out
[proxy:0:0@cluster] PMI response: cmd=barrier_out
[proxy:0:0@cluster] got pmi command (from 12): get_ranks2hosts

[proxy:0:0@cluster] PMI response: put_ranks2hosts 21 1
7 cluster 0,1,2,3,
[proxy:0:0@cluster] got pmi command (from 14): get_ranks2hosts

[proxy:0:0@cluster] PMI response: put_ranks2hosts 21 1
7 cluster 0,1,2,3,
[proxy:0:0@cluster] got pmi command (from 16): get_ranks2hosts

[proxy:0:0@cluster] PMI response: put_ranks2hosts 21 1
7 cluster 0,1,2,3,
[proxy:0:0@cluster] got pmi command (from 21): get_ranks2hosts

[proxy:0:0@cluster] PMI response: put_ranks2hosts 21 1
7 cluster 0,1,2,3,
[proxy:0:0@cluster] got pmi command (from 12): get_appnum

[proxy:0:0@cluster] PMI response: cmd=appnum appnum=0
[proxy:0:0@cluster] got pmi command (from 14): get_appnum

[proxy:0:0@cluster] PMI response: cmd=appnum appnum=0
[proxy:0:0@cluster] got pmi command (from 16): get_appnum

[proxy:0:0@cluster] PMI response: cmd=appnum appnum=0
[proxy:0:0@cluster] got pmi command (from 12): get_my_kvsname

[proxy:0:0@cluster] PMI response: cmd=my_kvsname kvsname=kvs_156862_0
[proxy:0:0@cluster] got pmi command (from 14): get_my_kvsname

[proxy:0:0@cluster] PMI response: cmd=my_kvsname kvsname=kvs_156862_0
[proxy:0:0@cluster] got pmi command (from 21): get_appnum

[proxy:0:0@cluster] PMI response: cmd=appnum appnum=0
[proxy:0:0@cluster] got pmi command (from 12): get_my_kvsname

[proxy:0:0@cluster] PMI response: cmd=my_kvsname kvsname=kvs_156862_0
[proxy:0:0@cluster] got pmi command (from 14): get_my_kvsname

[proxy:0:0@cluster] PMI response: cmd=my_kvsname kvsname=kvs_156862_0
[proxy:0:0@cluster] got pmi command (from 16): get_my_kvsname

[proxy:0:0@cluster] PMI response: cmd=my_kvsname kvsname=kvs_156862_0
[proxy:0:0@cluster] got pmi command (from 21): get_my_kvsname

[proxy:0:0@cluster] PMI response: cmd=my_kvsname kvsname=kvs_156862_0
  • MPI startup(): Multi-threaded optimized library
[proxy:0:0@cluster] got pmi command (from 16): get_my_kvsname

[proxy:0:0@cluster] PMI response: cmd=my_kvsname kvsname=kvs_156862_0
[proxy:0:0@cluster] got pmi command (from 21): get_my_kvsname

[proxy:0:0@cluster] PMI response: cmd=my_kvsname kvsname=kvs_156862_0

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 156869 RUNNING AT cluster
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
   Intel(R) MPI Library troubleshooting guide:
      https://software.intel.com/node/561764
===================================================================================
Please help me, thank you

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5589
  • Country: dk
  • Reputation: 102
    • View Profile
    • QuantumATK at Synopsys
Re: error in mpiexec.hydra
« Reply #9 on: September 18, 2023, 21:13 »
More info, but not clue. Now please add the -v option to mpiexec.hydra also, but post the results in an attachment text file, rather, as it will be very long.

Offline rebacoo

  • Regular QuantumATK user
  • **
  • Posts: 11
  • Reputation: 0
    • View Profile
Re: error in mpiexec.hydra
« Reply #10 on: September 19, 2023, 02:00 »
thank you professor Anders, here is the attachment

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5589
  • Country: dk
  • Reputation: 102
    • View Profile
    • QuantumATK at Synopsys
Re: error in mpiexec.hydra
« Reply #11 on: September 19, 2023, 07:12 »
What actual version of QuantumATK is this? There have been some issue with Intel MPI in the past. With 2022.12 things improved, but ideally you should run 2023.09. Also, is there some other MPI library in your path, or generally recommended/used on this cluster?

Offline rebacoo

  • Regular QuantumATK user
  • **
  • Posts: 11
  • Reputation: 0
    • View Profile
Re: error in mpiexec.hydra
« Reply #12 on: September 19, 2023, 14:21 »
thank you professor Anders, the version of QuantumATK is 2022.03, it works well in centos 7.9.

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5589
  • Country: dk
  • Reputation: 102
    • View Profile
    • QuantumATK at Synopsys
Re: error in mpiexec.hydra
« Reply #13 on: September 19, 2023, 19:44 »
It's a bit "old", but more importantly we have had issues with Intel MPI for a while. I have also seen reports online where people solved issues like this by upgrading their Intel MPI version, which in your case would be easiest done by moving to 2022.12 or even 2023.09, that way you also get access to all the latest features! There is not much we can do to troubleshoot an old version, esp. if this is a bug in Intel MPI as it very well might be.