Thanks for the very thorough reply!
On our system, I've got a module available titled "intelmpi/18.1.163". Using that one, I have the following output from 'mpirun --version':
Intel(R) MPI Library for Linux* OS, Version 2018 Update 1 Build 20171011 (id: 17941)
Copyright (C) 2003-2017, Intel Corporation. All rights reserved.
Running the same script, but with this module loaded, I get the following output:
Fri Aug 12 22:20:19 CEST 2022
node643.cluster:SCM:872d:a39ed740: 2009 us(2009 us): open_hca: device mlx4_0 not found
node642.cluster:SCM:a99e:df300740: 2412 us(2412 us): open_hca: device mlx4_0 not found
node642.cluster:SCM:a99d:c5027740: 2455 us(2455 us): open_hca: device mlx4_0 not found
node642.cluster:SCM:a99c:b32c4740: 2560 us(2560 us): open_hca: device mlx4_0 not found
node643.cluster:SCM:872d:a39ed740: 2274 us(265 us): open_hca: device mlx4_0 not found
node642.cluster:SCM:a99e:df300740: 2786 us(374 us): open_hca: device mlx4_0 not found
node643.cluster:SCM:872d:a39ed740: 2475 us(201 us): open_hca: device mlx4_0 not found
node642.cluster:SCM:a99d:c5027740: 2883 us(428 us): open_hca: device mlx4_0 not found
node643.cluster:CMA:872d:a39ed740: 37 us(37 us): open_hca: getaddr_netdev ERROR:No such device. Is ib0 configured?
node642.cluster:SCM:a99c:b32c4740: 3267 us(707 us): open_hca: device mlx4_0 not found
node643.cluster:CMA:872d:a39ed740: 83 us(46 us): open_hca: getaddr_netdev ERROR:No such device. Is ib1 configured?
node642.cluster:SCM:a99e:df300740: 3312 us(526 us): open_hca: device mlx4_0 not found
node643.cluster:SCM:872d:a39ed740: 3289 us(814 us): open_hca: device mthca0 not found
node642.cluster:SCM:a99d:c5027740: 3371 us(488 us): open_hca: device mlx4_0 not found
node643.cluster:SCM:872d:a39ed740: 3487 us(198 us): open_hca: device mthca0 not found
node642.cluster:SCM:a99c:b32c4740: 3536 us(269 us): open_hca: device mlx4_0 not found
node643.cluster:SCM:872d:a39ed740: 3688 us(201 us): open_hca: device ipath0 not found
node642.cluster:CMA:a99e:df300740: 35 us(35 us): open_hca: getaddr_netdev ERROR:No such device. Is ib0 configured?
node643.cluster:SCM:872d:a39ed740: 3883 us(195 us): open_hca: device ipath0 not found
node642.cluster:CMA:a99e:df300740: 73 us(38 us): open_hca: getaddr_netdev ERROR:No such device. Is ib1 configured?
node643.cluster:SCM:872d:a39ed740: 4078 us(195 us): open_hca: device ehca0 not found
node642.cluster:CMA:a99d:c5027740: 32 us(32 us): open_hca: getaddr_netdev ERROR:No such device. Is ib0 configured?
node643.cluster:CMA:872d:a39ed740: 1130 us(1047 us): open_hca: getaddr_netdev ERROR:No such device. Is eth2 configured?
node642.cluster:CMA:a99d:c5027740: 66 us(34 us): open_hca: getaddr_netdev ERROR:No such device. Is ib1 configured?
node643.cluster:UCM:872d:a39ed740: 200 us(200 us): open_hca: device mlx4_0 not found
node642.cluster:CMA:a99c:b32c4740: 32 us(32 us): open_hca: getaddr_netdev ERROR:No such device. Is ib0 configured?
node643.cluster:UCM:872d:a39ed740: 412 us(212 us): open_hca: device mlx4_0 not found
node642.cluster:CMA:a99c:b32c4740: 79 us(47 us): open_hca: getaddr_netdev ERROR:No such device. Is ib1 configured?
node643.cluster:UCM:872d:a39ed740: 611 us(199 us): open_hca: device mthca0 not found
node642.cluster:SCM:a99e:df300740: 4171 us(859 us): open_hca: device mthca0 not found
node643.cluster:UCM:872d:a39ed740: 808 us(197 us): open_hca: device mthca0 not found
node642.cluster:SCM:a99d:c5027740: 4278 us(907 us): open_hca: device mthca0 not found
node643.cluster:CMA:872d:a39ed740: 2226 us(1096 us): open_hca: getaddr_netdev ERROR:No such device. Is eth2 configured?
node642.cluster:SCM:a99e:df300740: 4732 us(561 us): open_hca: device mthca0 not found
node643.cluster:CMA:872d:a39ed740: 2255 us(29 us): open_hca: getaddr_netdev ERROR:No such device. Is eth3 configured?
node642.cluster:SCM:a99c:b32c4740: 4807 us(1271 us): open_hca: device mthca0 not found
node643.cluster:SCM:872d:a39ed740: 5431 us(1353 us): open_hca: device mlx4_0 not found
node642.cluster:SCM:a99d:c5027740: 4809 us(531 us): open_hca: device mthca0 not found
node643.cluster:SCM:872d:a39ed740: 5632 us(201 us): open_hca: device mlx4_0 not found
node642.cluster:SCM:a99e:df300740: 5241 us(509 us): open_hca: device ipath0 not found
node643.cluster:SCM:872d:a39ed740: 6041 us(409 us): open_hca: device scif0 not found
node642.cluster:SCM:a99d:c5027740: 5356 us(547 us): open_hca: device ipath0 not found
node643.cluster:UCM:872d:a39ed740: 1878 us(1070 us): open_hca: device scif0 not found
node642.cluster:SCM:a99c:b32c4740: 5444 us(637 us): open_hca: device mthca0 not found
node643.cluster:CMA:872d:a39ed740: 3296 us(1041 us): open_hca: getaddr_netdev ERROR:No such device. Is mic0:ib configured?
node642.cluster:SCM:a99e:df300740: 5713 us(472 us): open_hca: device ipath0 not found
node643.cluster:SCM:872d:a39ed740: 6471 us(430 us): open_hca: device mlx4_0 not found
node642.cluster:SCM:a99d:c5027740: 5857 us(501 us): open_hca: device ipath0 not found
node643.cluster:SCM:872d:a39ed740: 6676 us(205 us): open_hca: device mlx4_0 not found
node642.cluster:SCM:a99c:b32c4740: 6035 us(591 us): open_hca: device ipath0 not found
node643.cluster:SCM:872d:a39ed740: 6872 us(196 us): open_hca: device mlx4_1 not found
node642.cluster:SCM:a99e:df300740: 6208 us(495 us): open_hca: device ehca0 not found
node643.cluster:SCM:872d:a39ed740: 7065 us(193 us): open_hca: device mlx4_1 not found
node642.cluster:CMA:a99e:df300740: 2430 us(2357 us): open_hca: getaddr_netdev ERROR:No such device. Is eth2 configured?
node643.cluster:UCM:872d:a39ed740: 2896 us(1018 us): open_hca: device mlx4_1 not found
node642.cluster:SCM:a99d:c5027740: 6308 us(451 us): open_hca: device ehca0 not found
node643.cluster:UCM:872d:a39ed740: 3090 us(194 us): open_hca: device mlx4_1 not found
node642.cluster:CMA:a99d:c5027740: 2478 us(2412 us): open_hca: getaddr_netdev ERROR:No such device. Is eth2 configured?
node642.cluster:SCM:a99c:b32c4740: 6419 us(384 us): open_hca: device ipath0 not found
node642.cluster:SCM:a99c:b32c4740: 6799 us(380 us): open_hca: device ehca0 not found
node642.cluster:CMA:a99c:b32c4740: 2764 us(2685 us): open_hca: getaddr_netdev ERROR:No such device. Is eth2 configured?
node642.cluster:UCM:a99e:df300740: 434 us(434 us): open_hca: device mlx4_0 not found
node642.cluster:UCM:a99d:c5027740: 391 us(391 us): open_hca: device mlx4_0 not found
node642.cluster:UCM:a99e:df300740: 760 us(326 us): open_hca: device mlx4_0 not found
node642.cluster:UCM:a99d:c5027740: 750 us(359 us): open_hca: device mlx4_0 not found
node642.cluster:UCM:a99c:b32c4740: 612 us(612 us): open_hca: device mlx4_0 not found
node642.cluster:UCM:a99e:df300740: 1284 us(524 us): open_hca: device mthca0 not found
node642.cluster:UCM:a99d:c5027740: 1229 us(479 us): open_hca: device mthca0 not found
node642.cluster:UCM:a99e:df300740: 1865 us(581 us): open_hca: device mthca0 not found
node642.cluster:UCM:a99d:c5027740: 1794 us(565 us): open_hca: device mthca0 not found
node642.cluster:UCM:a99c:b32c4740: 1250 us(638 us): open_hca: device mlx4_0 not found
node642.cluster:CMA:a99e:df300740: 4566 us(2136 us): open_hca: getaddr_netdev ERROR:No such device. Is eth2 configured?
node642.cluster:CMA:a99e:df300740: 4586 us(20 us): open_hca: getaddr_netdev ERROR:No such device. Is eth3 configured?
node642.cluster:CMA:a99d:c5027740: 4539 us(2061 us): open_hca: getaddr_netdev ERROR:No such device. Is eth2 configured?
node642.cluster:CMA:a99d:c5027740: 4561 us(22 us): open_hca: getaddr_netdev ERROR:No such device. Is eth3 configured?
node642.cluster:SCM:a99e:df300740: 8921 us(2713 us): open_hca: device mlx4_0 not found
node642.cluster:SCM:a99d:c5027740: 8957 us(2649 us): open_hca: device mlx4_0 not found
node642.cluster:UCM:a99c:b32c4740: 1872 us(622 us): open_hca: device mthca0 not found
node642.cluster:SCM:a99e:df300740: 9402 us(481 us): open_hca: device mlx4_0 not found
node642.cluster:SCM:a99d:c5027740: 9455 us(498 us): open_hca: device mlx4_0 not found
node642.cluster:UCM:a99c:b32c4740: 2376 us(504 us): open_hca: device mthca0 not found
node642.cluster:CMA:a99c:b32c4740: 5470 us(2706 us): open_hca: getaddr_netdev ERROR:No such device. Is eth2 configured?
node642.cluster:CMA:a99c:b32c4740: 5499 us(29 us): open_hca: getaddr_netdev ERROR:No such device. Is eth3 configured?
node642.cluster:SCM:a99c:b32c4740: 9990 us(3191 us): open_hca: device mlx4_0 not found
node642.cluster:SCM:a99e:df300740: 10119 us(717 us): open_hca: device scif0 not found
node642.cluster:SCM:a99d:c5027740: 10235 us(780 us): open_hca: device scif0 not found
node642.cluster:SCM:a99c:b32c4740: 10600 us(610 us): open_hca: device mlx4_0 not found
node642.cluster:UCM:a99e:df300740: 4164 us(2299 us): open_hca: device scif0 not found
node642.cluster:CMA:a99e:df300740: 6862 us(2276 us): open_hca: getaddr_netdev ERROR:No such device. Is mic0:ib configured?
node642.cluster:UCM:a99d:c5027740: 4108 us(2314 us): open_hca: device scif0 not found
node642.cluster:CMA:a99d:c5027740: 6851 us(2290 us): open_hca: getaddr_netdev ERROR:No such device. Is mic0:ib configured?
node642.cluster:SCM:a99e:df300740: 10972 us(853 us): open_hca: device mlx4_0 not found
node642.cluster:SCM:a99d:c5027740: 11070 us(835 us): open_hca: device mlx4_0 not found
node642.cluster:SCM:a99c:b32c4740: 11463 us(863 us): open_hca: device scif0 not found
node642.cluster:SCM:a99e:df300740: 11479 us(507 us): open_hca: device mlx4_0 not found
node642.cluster:SCM:a99d:c5027740: 11566 us(496 us): open_hca: device mlx4_0 not found
node642.cluster:SCM:a99e:df300740: 12008 us(529 us): open_hca: device mlx4_1 not found
node642.cluster:UCM:a99c:b32c4740: 4952 us(2576 us): open_hca: device scif0 not found
node642.cluster:SCM:a99d:c5027740: 12111 us(545 us): open_hca: device mlx4_1 not found
node642.cluster:CMA:a99c:b32c4740: 8047 us(2548 us): open_hca: getaddr_netdev ERROR:No such device. Is mic0:ib configured?
node642.cluster:SCM:a99e:df300740: 12482 us(474 us): open_hca: device mlx4_1 not found
node642.cluster:SCM:a99d:c5027740: 12606 us(495 us): open_hca: device mlx4_1 not found
node642.cluster:SCM:a99c:b32c4740: 12711 us(1248 us): open_hca: device mlx4_0 not found
node642.cluster:UCM:a99e:df300740: 6485 us(2321 us): open_hca: device mlx4_1 not found
node642.cluster:UCM:a99d:c5027740: 6557 us(2449 us): open_hca: device mlx4_1 not found
node642.cluster:SCM:a99c:b32c4740: 13304 us(593 us): open_hca: device mlx4_0 not found
node642.cluster:UCM:a99e:df300740: 6967 us(482 us): open_hca: device mlx4_1 not found
node642.cluster:UCM:a99d:c5027740: 6978 us(421 us): open_hca: device mlx4_1 not found
node642.cluster:SCM:a99c:b32c4740: 13658 us(354 us): open_hca: device mlx4_1 not found
node642.cluster:SCM:a99c:b32c4740: 13859 us(201 us): open_hca: device mlx4_1 not found
node642.cluster:UCM:a99c:b32c4740: 7187 us(2235 us): open_hca: device mlx4_1 not found
node642.cluster:UCM:a99c:b32c4740: 7434 us(247 us): open_hca: device mlx4_1 not found
+------------------------------------------------------------------------------+
| |
| QuantumATK® |
| |
| Version: T-2022.03 for Windows and Linux [Build 17f5b1eb610] |
| |
| Copyright © 2004-2022 Synopsys, Inc. |
| |
| This software and all associated documentation are proprietary to |
| Synopsys, Inc. This software may only be used pursuant to the |
| terms and conditions of a written license agreement with Synopsys, |
| Inc. All other use, reproduction, modification, or distribution of |
| this software is strictly prohibited. |
| |
+------------------------------------------------------------------------------+
Slave node: node642.cluster
Slave node: node643.cluster
Master node: node642.cluster
Slave node: node642.cluster
Timing: Total Per Step %
--------------------------------------------------------------------------------
Loading Modules + MPI : 5.80 s 5.80 s 99.95% |=============|
--------------------------------------------------------------------------------
Total : 5.80 s
The QuantumATK output match the output on the working partition A. The first part with the all the error messages, I've got no clue about. Can they be ignored or are they important?
If I choose any of the newer modules available ('intelmpi/2020.1.217' or 'intelmpi/2020.4.304'), I'm back to the original error.
If I use the intel MPI shipped with QuantumATK, I get the following error:
Fri Aug 12 22:38:24 CEST 2022
atkpython: error while loading shared libraries: libpython3.8.so.1.0: cannot open shared object file: No such file or directory
atkpython: error while loading shared libraries: libpython3.8.so.1.0: cannot open shared object file: No such file or directory
atkpython: error while loading shared libraries: libpython3.8.so.1.0: cannot open shared object file: No such file or directory
[mpiexec@node642.cluster] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
[mpiexec@node642.cluster] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:253): unable to write data to proxy
and then the job never progresses any further nor gets shutdown.
In summary, it seems that using intelmpi/18.1.163 works.. ish? At least, if the error/warning messages can be ignored?