Author Topic: Error when running QuantumATK on multiple nodes (Read 14322 times)

chem-william · « **on:** August 12, 2022, 08:20 »

Hi everyone At our university, we have access to two different partitions: A and B. We use SLURM as workload manager. If we submit a job to A that runs on multiple nodes, everything is fine. If we submit the same job to B, we get the following error:

Code

Fri Aug 12 08:15:30 CEST 2022
node642.cluster:UCM:a1d1:b570b740: 19751 us(19751 us):  create_ah: ERR Invalid argument
node642.cluster:UCM:a1d1:b570b740: 19760 us(9 us): UCM connect: snd ERR -> cm_lid 0 cm_qpn 10e r_psp 8104 p_sz=24
srun: Job step aborted: Waiting up to 602 seconds for job step to finish.
srun: error: node643: task 3: Killed
srun: launch/slurm: _step_signal: Terminating StepId=35302717.0
node642.cluster:UCM:a1d2:2000c740: 20600 us(20600 us):  create_ah: ERR Invalid argument
node642.cluster:UCM:a1d2:2000c740: 20613 us(13 us): UCM connect: snd ERR -> cm_lid 0 cm_qpn 10e r_psp 8104 p_sz=24
node642.cluster:UCM:a1d3:8fc04740: 22407 us(22407 us):  create_ah: ERR Invalid argument
node642.cluster:UCM:a1d3:8fc04740: 22418 us(11 us): UCM connect: snd ERR -> cm_lid 0 cm_qpn 10e r_psp 8104 p_sz=24
[0:node642][../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c:247] error(0x30000): ofa-v2-mlx5_0-1u: could not connect DAPL endpoints: DAT_INSUFFICIENT_RESOURCES()
Fatal error in MPI_Init: Internal MPI error!, error stack:
MPIR_Init_thread(805).................: fail failed
MPID_Init(1859).......................: channel initialization failed
MPIDI_CH3_Init(147)...................: fail failed
dapl_rc_setup_all_connections_20(1434): generic failure with errno = 16
(unknown)(): Internal MPI error!
[1:node642][../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c:247] error(0x30000): ofa-v2-mlx5_0-1u: could not connect DAPL endpoints: DAT_INSUFFICIENT_RESOURCES()
Fatal error in MPI_Init: Internal MPI error!, error stack:
MPIR_Init_thread(805).................: fail failed
MPID_Init(1859).......................: channel initialization failed
MPIDI_CH3_Init(147)...................: fail failed
dapl_rc_setup_all_connections_20(1434): generic failure with errno = 16
(unknown)(): Internal MPI error!
[2:node642][../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c:247] error(0x30000): ofa-v2-mlx5_0-1u: could not connect DAPL endpoints: DAT_INSUFFICIENT_RESOURCES()
Fatal error in MPI_Init: Internal MPI error!, error stack:
MPIR_Init_thread(805).................: fail failed
MPID_Init(1859).......................: channel initialization failed
MPIDI_CH3_Init(147)...................: fail failed
dapl_rc_setup_all_connections_20(1434): generic failure with errno = 16
(unknown)(): Internal MPI error!
slurmstepd: error: *** STEP 35302717.0 ON node642 CANCELLED AT 2022-08-12T08:15:31 ***
srun: error: node642: tasks 0-2: Killed

We have the following minimal reproducing script:

Code

from __future__ import print_function
import socket
if processIsMaster():
    print("Master node:", end=' ')
else:
   print("Slave node:", end=' ')
print(socket.gethostname())

that gets submitted using the following script:

Code

#!/bin/bash

#SBATCH --ntasks=4
#SBATCH --cpus-per-task=2
#SBATCH --nodes=2
#SBATCH --time=0:10:00
#SBATCH --partition=B

date

module load kemi
module load ATK

srun -n4 --mpi=pmi2 atkpython test_mpi.py

As far as I'm told, neither partition has Infiniband, but only RoCE

filipr · « **Reply #1 on:** August 12, 2022, 10:39 »

I don't think this is due to a problem in QuantumATK, but instead due to the configuration of MPI on the cluster. QuantumATK 2022 (and many earlier versions) have been built, linked and tested against Intel MPI 2018.1 - but it should work with any Intel MPI version that is ABI compatible with this. Just as a check you could try to ensure that SLURM uses Intel MPI 2018 - typically that is done by e.g.:

Code

module load intel-mpi/2018.1

You will need to ensure that SLURM is actually configured to use the loaded Intel MPI when the module is loaded. If this doesn't work you can try to use the Intel MPI that we ship with QuantumATK, i.e. instead of srun use:

Code

export PATH=/path/to/quantumatk/libexec # If it isn't already in PATH
mpiexec.hydra -n4 atkpython test_mpi.py

You can also try a newer version of Intel MPI, preferably newer than 2019 as that particular version had some serious issues. If neither of these approaches work, it's likely that there is some incompatibility between the configuration of Intel MPI and the cluster and it's network hardware. We can't really help you much here as it's out of our hands. Instead I suggest you to try to run your script again but with

Code

export I_MPI_DEBUG=5

This will print a lot of MPI debugging information. Now copy that output and send it along with your current question to your cluster administrators. If they can't solve the problem I will suggest you to ask for help on the Intel MPI support forum.

chem-william · « **Reply #2 on:** August 12, 2022, 22:42 »

Thanks for the very thorough reply! On our system, I've got a module available titled "intelmpi/18.1.163". Using that one, I have the following output from 'mpirun --version':

Code

Intel(R) MPI Library for Linux* OS, Version 2018 Update 1 Build 20171011 (id: 17941)
Copyright (C) 2003-2017, Intel Corporation. All rights reserved.

Running the same script, but with this module loaded, I get the following output:

Code

Fri Aug 12 22:20:19 CEST 2022
node643.cluster:SCM:872d:a39ed740: 2009 us(2009 us):  open_hca: device mlx4_0 not found
node642.cluster:SCM:a99e:df300740: 2412 us(2412 us):  open_hca: device mlx4_0 not found
node642.cluster:SCM:a99d:c5027740: 2455 us(2455 us):  open_hca: device mlx4_0 not found
node642.cluster:SCM:a99c:b32c4740: 2560 us(2560 us):  open_hca: device mlx4_0 not found
node643.cluster:SCM:872d:a39ed740: 2274 us(265 us):  open_hca: device mlx4_0 not found
node642.cluster:SCM:a99e:df300740: 2786 us(374 us):  open_hca: device mlx4_0 not found
node643.cluster:SCM:872d:a39ed740: 2475 us(201 us):  open_hca: device mlx4_0 not found
node642.cluster:SCM:a99d:c5027740: 2883 us(428 us):  open_hca: device mlx4_0 not found
node643.cluster:CMA:872d:a39ed740: 37 us(37 us):  open_hca: getaddr_netdev ERROR:No such device. Is ib0 configured?
node642.cluster:SCM:a99c:b32c4740: 3267 us(707 us):  open_hca: device mlx4_0 not found
node643.cluster:CMA:872d:a39ed740: 83 us(46 us):  open_hca: getaddr_netdev ERROR:No such device. Is ib1 configured?
node642.cluster:SCM:a99e:df300740: 3312 us(526 us):  open_hca: device mlx4_0 not found
node643.cluster:SCM:872d:a39ed740: 3289 us(814 us):  open_hca: device mthca0 not found
node642.cluster:SCM:a99d:c5027740: 3371 us(488 us):  open_hca: device mlx4_0 not found
node643.cluster:SCM:872d:a39ed740: 3487 us(198 us):  open_hca: device mthca0 not found
node642.cluster:SCM:a99c:b32c4740: 3536 us(269 us):  open_hca: device mlx4_0 not found
node643.cluster:SCM:872d:a39ed740: 3688 us(201 us):  open_hca: device ipath0 not found
node642.cluster:CMA:a99e:df300740: 35 us(35 us):  open_hca: getaddr_netdev ERROR:No such device. Is ib0 configured?
node643.cluster:SCM:872d:a39ed740: 3883 us(195 us):  open_hca: device ipath0 not found
node642.cluster:CMA:a99e:df300740: 73 us(38 us):  open_hca: getaddr_netdev ERROR:No such device. Is ib1 configured?
node643.cluster:SCM:872d:a39ed740: 4078 us(195 us):  open_hca: device ehca0 not found
node642.cluster:CMA:a99d:c5027740: 32 us(32 us):  open_hca: getaddr_netdev ERROR:No such device. Is ib0 configured?
node643.cluster:CMA:872d:a39ed740: 1130 us(1047 us):  open_hca: getaddr_netdev ERROR:No such device. Is eth2 configured?
node642.cluster:CMA:a99d:c5027740: 66 us(34 us):  open_hca: getaddr_netdev ERROR:No such device. Is ib1 configured?
node643.cluster:UCM:872d:a39ed740: 200 us(200 us):  open_hca: device mlx4_0 not found
node642.cluster:CMA:a99c:b32c4740: 32 us(32 us):  open_hca: getaddr_netdev ERROR:No such device. Is ib0 configured?
node643.cluster:UCM:872d:a39ed740: 412 us(212 us):  open_hca: device mlx4_0 not found
node642.cluster:CMA:a99c:b32c4740: 79 us(47 us):  open_hca: getaddr_netdev ERROR:No such device. Is ib1 configured?
node643.cluster:UCM:872d:a39ed740: 611 us(199 us):  open_hca: device mthca0 not found
node642.cluster:SCM:a99e:df300740: 4171 us(859 us):  open_hca: device mthca0 not found
node643.cluster:UCM:872d:a39ed740: 808 us(197 us):  open_hca: device mthca0 not found
node642.cluster:SCM:a99d:c5027740: 4278 us(907 us):  open_hca: device mthca0 not found
node643.cluster:CMA:872d:a39ed740: 2226 us(1096 us):  open_hca: getaddr_netdev ERROR:No such device. Is eth2 configured?
node642.cluster:SCM:a99e:df300740: 4732 us(561 us):  open_hca: device mthca0 not found
node643.cluster:CMA:872d:a39ed740: 2255 us(29 us):  open_hca: getaddr_netdev ERROR:No such device. Is eth3 configured?
node642.cluster:SCM:a99c:b32c4740: 4807 us(1271 us):  open_hca: device mthca0 not found
node643.cluster:SCM:872d:a39ed740: 5431 us(1353 us):  open_hca: device mlx4_0 not found
node642.cluster:SCM:a99d:c5027740: 4809 us(531 us):  open_hca: device mthca0 not found
node643.cluster:SCM:872d:a39ed740: 5632 us(201 us):  open_hca: device mlx4_0 not found
node642.cluster:SCM:a99e:df300740: 5241 us(509 us):  open_hca: device ipath0 not found
node643.cluster:SCM:872d:a39ed740: 6041 us(409 us):  open_hca: device scif0 not found
node642.cluster:SCM:a99d:c5027740: 5356 us(547 us):  open_hca: device ipath0 not found
node643.cluster:UCM:872d:a39ed740: 1878 us(1070 us):  open_hca: device scif0 not found
node642.cluster:SCM:a99c:b32c4740: 5444 us(637 us):  open_hca: device mthca0 not found
node643.cluster:CMA:872d:a39ed740: 3296 us(1041 us):  open_hca: getaddr_netdev ERROR:No such device. Is mic0:ib configured?
node642.cluster:SCM:a99e:df300740: 5713 us(472 us):  open_hca: device ipath0 not found
node643.cluster:SCM:872d:a39ed740: 6471 us(430 us):  open_hca: device mlx4_0 not found
node642.cluster:SCM:a99d:c5027740: 5857 us(501 us):  open_hca: device ipath0 not found
node643.cluster:SCM:872d:a39ed740: 6676 us(205 us):  open_hca: device mlx4_0 not found
node642.cluster:SCM:a99c:b32c4740: 6035 us(591 us):  open_hca: device ipath0 not found
node643.cluster:SCM:872d:a39ed740: 6872 us(196 us):  open_hca: device mlx4_1 not found
node642.cluster:SCM:a99e:df300740: 6208 us(495 us):  open_hca: device ehca0 not found
node643.cluster:SCM:872d:a39ed740: 7065 us(193 us):  open_hca: device mlx4_1 not found
node642.cluster:CMA:a99e:df300740: 2430 us(2357 us):  open_hca: getaddr_netdev ERROR:No such device. Is eth2 configured?
node643.cluster:UCM:872d:a39ed740: 2896 us(1018 us):  open_hca: device mlx4_1 not found
node642.cluster:SCM:a99d:c5027740: 6308 us(451 us):  open_hca: device ehca0 not found
node643.cluster:UCM:872d:a39ed740: 3090 us(194 us):  open_hca: device mlx4_1 not found
node642.cluster:CMA:a99d:c5027740: 2478 us(2412 us):  open_hca: getaddr_netdev ERROR:No such device. Is eth2 configured?
node642.cluster:SCM:a99c:b32c4740: 6419 us(384 us):  open_hca: device ipath0 not found
node642.cluster:SCM:a99c:b32c4740: 6799 us(380 us):  open_hca: device ehca0 not found
node642.cluster:CMA:a99c:b32c4740: 2764 us(2685 us):  open_hca: getaddr_netdev ERROR:No such device. Is eth2 configured?
node642.cluster:UCM:a99e:df300740: 434 us(434 us):  open_hca: device mlx4_0 not found
node642.cluster:UCM:a99d:c5027740: 391 us(391 us):  open_hca: device mlx4_0 not found
node642.cluster:UCM:a99e:df300740: 760 us(326 us):  open_hca: device mlx4_0 not found
node642.cluster:UCM:a99d:c5027740: 750 us(359 us):  open_hca: device mlx4_0 not found
node642.cluster:UCM:a99c:b32c4740: 612 us(612 us):  open_hca: device mlx4_0 not found
node642.cluster:UCM:a99e:df300740: 1284 us(524 us):  open_hca: device mthca0 not found
node642.cluster:UCM:a99d:c5027740: 1229 us(479 us):  open_hca: device mthca0 not found
node642.cluster:UCM:a99e:df300740: 1865 us(581 us):  open_hca: device mthca0 not found
node642.cluster:UCM:a99d:c5027740: 1794 us(565 us):  open_hca: device mthca0 not found
node642.cluster:UCM:a99c:b32c4740: 1250 us(638 us):  open_hca: device mlx4_0 not found
node642.cluster:CMA:a99e:df300740: 4566 us(2136 us):  open_hca: getaddr_netdev ERROR:No such device. Is eth2 configured?
node642.cluster:CMA:a99e:df300740: 4586 us(20 us):  open_hca: getaddr_netdev ERROR:No such device. Is eth3 configured?
node642.cluster:CMA:a99d:c5027740: 4539 us(2061 us):  open_hca: getaddr_netdev ERROR:No such device. Is eth2 configured?
node642.cluster:CMA:a99d:c5027740: 4561 us(22 us):  open_hca: getaddr_netdev ERROR:No such device. Is eth3 configured?
node642.cluster:SCM:a99e:df300740: 8921 us(2713 us):  open_hca: device mlx4_0 not found
node642.cluster:SCM:a99d:c5027740: 8957 us(2649 us):  open_hca: device mlx4_0 not found
node642.cluster:UCM:a99c:b32c4740: 1872 us(622 us):  open_hca: device mthca0 not found
node642.cluster:SCM:a99e:df300740: 9402 us(481 us):  open_hca: device mlx4_0 not found
node642.cluster:SCM:a99d:c5027740: 9455 us(498 us):  open_hca: device mlx4_0 not found
node642.cluster:UCM:a99c:b32c4740: 2376 us(504 us):  open_hca: device mthca0 not found
node642.cluster:CMA:a99c:b32c4740: 5470 us(2706 us):  open_hca: getaddr_netdev ERROR:No such device. Is eth2 configured?
node642.cluster:CMA:a99c:b32c4740: 5499 us(29 us):  open_hca: getaddr_netdev ERROR:No such device. Is eth3 configured?
node642.cluster:SCM:a99c:b32c4740: 9990 us(3191 us):  open_hca: device mlx4_0 not found
node642.cluster:SCM:a99e:df300740: 10119 us(717 us):  open_hca: device scif0 not found
node642.cluster:SCM:a99d:c5027740: 10235 us(780 us):  open_hca: device scif0 not found
node642.cluster:SCM:a99c:b32c4740: 10600 us(610 us):  open_hca: device mlx4_0 not found
node642.cluster:UCM:a99e:df300740: 4164 us(2299 us):  open_hca: device scif0 not found
node642.cluster:CMA:a99e:df300740: 6862 us(2276 us):  open_hca: getaddr_netdev ERROR:No such device. Is mic0:ib configured?
node642.cluster:UCM:a99d:c5027740: 4108 us(2314 us):  open_hca: device scif0 not found
node642.cluster:CMA:a99d:c5027740: 6851 us(2290 us):  open_hca: getaddr_netdev ERROR:No such device. Is mic0:ib configured?
node642.cluster:SCM:a99e:df300740: 10972 us(853 us):  open_hca: device mlx4_0 not found
node642.cluster:SCM:a99d:c5027740: 11070 us(835 us):  open_hca: device mlx4_0 not found
node642.cluster:SCM:a99c:b32c4740: 11463 us(863 us):  open_hca: device scif0 not found
node642.cluster:SCM:a99e:df300740: 11479 us(507 us):  open_hca: device mlx4_0 not found
node642.cluster:SCM:a99d:c5027740: 11566 us(496 us):  open_hca: device mlx4_0 not found
node642.cluster:SCM:a99e:df300740: 12008 us(529 us):  open_hca: device mlx4_1 not found
node642.cluster:UCM:a99c:b32c4740: 4952 us(2576 us):  open_hca: device scif0 not found
node642.cluster:SCM:a99d:c5027740: 12111 us(545 us):  open_hca: device mlx4_1 not found
node642.cluster:CMA:a99c:b32c4740: 8047 us(2548 us):  open_hca: getaddr_netdev ERROR:No such device. Is mic0:ib configured?
node642.cluster:SCM:a99e:df300740: 12482 us(474 us):  open_hca: device mlx4_1 not found
node642.cluster:SCM:a99d:c5027740: 12606 us(495 us):  open_hca: device mlx4_1 not found
node642.cluster:SCM:a99c:b32c4740: 12711 us(1248 us):  open_hca: device mlx4_0 not found
node642.cluster:UCM:a99e:df300740: 6485 us(2321 us):  open_hca: device mlx4_1 not found
node642.cluster:UCM:a99d:c5027740: 6557 us(2449 us):  open_hca: device mlx4_1 not found
node642.cluster:SCM:a99c:b32c4740: 13304 us(593 us):  open_hca: device mlx4_0 not found
node642.cluster:UCM:a99e:df300740: 6967 us(482 us):  open_hca: device mlx4_1 not found
node642.cluster:UCM:a99d:c5027740: 6978 us(421 us):  open_hca: device mlx4_1 not found
node642.cluster:SCM:a99c:b32c4740: 13658 us(354 us):  open_hca: device mlx4_1 not found
node642.cluster:SCM:a99c:b32c4740: 13859 us(201 us):  open_hca: device mlx4_1 not found
node642.cluster:UCM:a99c:b32c4740: 7187 us(2235 us):  open_hca: device mlx4_1 not found
node642.cluster:UCM:a99c:b32c4740: 7434 us(247 us):  open_hca: device mlx4_1 not found
+------------------------------------------------------------------------------+
|                                                                              |
|                                  QuantumATK®                                 |
|                                                                              |
|          Version: T-2022.03 for Windows and Linux [Build 17f5b1eb610]        |
|                                                                              |
|                      Copyright © 2004-2022 Synopsys, Inc.                    |
|                                                                              |
|       This software and all associated documentation are proprietary to      |
|         Synopsys, Inc. This software may only be used pursuant to the        |
|       terms and conditions of a written license agreement with Synopsys,     |
|       Inc. All other use, reproduction, modification, or distribution of     |
|                     this software is strictly prohibited.                    |
|                                                                              |
+------------------------------------------------------------------------------+
Slave node: node642.cluster
Slave node: node643.cluster
Master node: node642.cluster
Slave node: node642.cluster

Timing:                          Total     Per Step        %

--------------------------------------------------------------------------------

Loading Modules + MPI   :       5.80 s       5.80 s      99.95% |=============|
--------------------------------------------------------------------------------
Total                   :       5.80 s

The QuantumATK output match the output on the working partition A. The first part with the all the error messages, I've got no clue about. Can they be ignored or are they important? If I choose any of the newer modules available ('intelmpi/2020.1.217' or 'intelmpi/2020.4.304'), I'm back to the original error. If I use the intel MPI shipped with QuantumATK, I get the following error:

Code

Fri Aug 12 22:38:24 CEST 2022
atkpython: error while loading shared libraries: libpython3.8.so.1.0: cannot open shared object file: No such file or directory
atkpython: error while loading shared libraries: libpython3.8.so.1.0: cannot open shared object file: No such file or directory
atkpython: error while loading shared libraries: libpython3.8.so.1.0: cannot open shared object file: No such file or directory
[[email protected]] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
[[email protected]] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:253): unable to write data to proxy

and then the job never progresses any further nor gets shutdown. In summary, it seems that using intelmpi/18.1.163 works.. ish? At least, if the error/warning messages can be ignored?

filipr · « **Reply #3 on:** August 15, 2022, 09:48 »

I'll address the last problem (with shipped MPI) first: There are actually two atkpython "executables": one in the 'bin' folder, which is actually a bootstrap bash script (that sets LD_LIBRARY_PATH) and the other in the libexec folder, which is the actual executable. When I suggested you to set PATH to include the libexec folder it will actually run the executable there instead of the bash script. So my mistake! Instead of setting path, specify the full path to the mpi executable:

Code

/path/to/quantumatk/libexec/mpiexec.hydra -n 4 atkpython test_mpi.py

However, I doubt that this will work, as it also fails with the other MPI programs. From the other error messages it appears that the cluster is configured for infiniband but that MPI for some reason can't find the device. From similar problems (e.g. https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/open-hca-device-mlx4-0-not-found/td-p/982635) it could be that you can solve this by setting the I_MPI_DAPL_PROVIDER environment variable - but to what exactly, I don't know - that depends on your cluster. Again, I think you will get much better help if you contact your cluster admin/support as they are the ones responsible for configuring the hardware and the MPI installation. You can also get help on Intels HPC support forum: https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/bd-p/oneapi-hpc-toolkit We can't really provide much more help and support on this issue - I think you will have the same problem running any other MPI executable on the cluster, not just QuantumATK (in fact Intel ships some MPI benchmark programs that you can run: https://www.intel.com/content/www/us/en/develop/documentation/imb-user-guide/top.html)

chem-william · « **Reply #4 on:** August 15, 2022, 15:59 »

Using the shipped MPI is not really working as you suspected. While it technically "runs", it never does anything. After 10+ min it hasn't printed anything.

The MPI error stuff/changing the MPI environments is a bit beyond my expertise.

I appreciate all the help that you've given me. I'll try to circle back to our sysadmins and see, if they might be able to help with this extra information

AsifShah · « **Reply #5 on:** August 16, 2022, 07:01 »

Sometimes the terminal calls atkpython from libexec when it should be called from bin. This is one reason why you have no printing even after 10 mins.

So try this:

pathto/QuantumATK/libexec/mpiexec.hydra -n 10 /pathto/QuantumATK/bin/atkpython input.py

chem-william · « **Reply #6 on:** August 16, 2022, 08:34 »

Unfortunately, it still hangs even when I specify the exact path to both mpiexe.hydra and atkpython

If I run it on a single node, it's able to make progress.

QuantumATK Forum

News:

Author Topic: Error when running QuantumATK on multiple nodes (Read 14322 times)

chem-william

Error when running QuantumATK on multiple nodes

filipr

Re: Error when running QuantumATK on multiple nodes

chem-william

Re: Error when running QuantumATK on multiple nodes

filipr

Re: Error when running QuantumATK on multiple nodes

chem-william

Re: Error when running QuantumATK on multiple nodes

AsifShah

Re: Error when running QuantumATK on multiple nodes

chem-william

Re: Error when running QuantumATK on multiple nodes