Hi - I'm trying to make QuantumATK-X-2025-06-SP1 functional on the ORNL cluster CADES.
While I was successful at installing the program, I haven't been able to run parallel jobs so far.
Here is an example for an MPI script that worked extremely well on CADES for QuantumATK-U-2022.12:
#!/bin/bash
#SBATCH --nodes 2
##SBATCH --ntasks-per-node 4
#SBATCH --exclusive
#SBATCH --mem=0
#SBATCH -p high_mem
#SBATCH -A cnms
#SBATCH -t 24:00:00
#SBATCH -o output_%j.out
#SBATCH -e output_%j.err
module purge
module load gcc/8
export SNPSLMD_LICENSE_FILE="
[email protected]"
srun -N2 --ntasks-per-node=4 -v --mpi=pmi2 /home/fhagelberg/QuantumATK-2/QuantumATK-U-2022.12-SP1/atkpython/bin/atkpython /lustre/or-scratch/cades-birthright/fhagelberg/Olaf/zWS2+Hbridge-AFM-SGGA-Dojo-2-electrode.py
When I use the same script for QuantumATK-X-2025-06-SP1, cryptic error messages appear (see below). Pmi2 doesn't seem to work any longer (pmix is unavailable on CADES). Is there any other way I could implement MPI to run the program in a SLURM environment, leveraging srun which has turned to be very effective?
Error message:
srun: defined options
srun: -------------------- --------------------
srun: (null) : or-condo-c[317,364]
srun: jobid : 4348670
srun: job-name : atksub
srun: mpi : pmi2
srun: nodes : 2
srun: ntasks-per-node : 4
srun: oom-kill-step : 0
srun: verbose : 1
srun: -------------------- --------------------
srun: end of defined options
srun: jobid 4348670: nodes(2):`or-condo-c[317,364]', cpu counts: 36(x2)
srun: CpuBindType=(null type)
srun: launching StepId=4348670.0 on host or-condo-c317, 4 tasks: [0-3]
srun: launching StepId=4348670.0 on host or-condo-c364, 4 tasks: [4-7]
srun: topology/default: init: topology Default plugin loaded
srun: Node or-condo-c317, 4 tasks started
srun: Node or-condo-c364, 4 tasks started
slurmstepd: error: mpi/pmi2: value not properly terminated in client request
slurmstepd: error: mpi/pmi2: request not begin with 'cmd='
slurmstepd: error: mpi/pmi2: full request is: 00000000000000000000000000000000000000000000000