Recently I have writen a small python script based on ATK, it can work correctly when I run it with the command “ mpiexec -machinefile mpd.hosts -n 1 $ATK_BIN_DIR/atk $WORK_DIR/script.py”. However, when I run it in parallel with the command “mpiexec -machinefile mpd.hosts -n 8 $ATK_BIN_DIR/atk $WORK_DIR/script.py”, it cannot work with the following hints:
[ch@console ~]$mpiexec -machinefile mpd.hosts -n 8 $ATK_BIN_DIR/atk $WORK_DIR/script.py
5: [cli_5]: aborting job:
rank 5 in job 133 console_45778 caused collective abort of all ranks
exit status of rank 5: return code 1
5: Fatal error in MPI_Allreduce: Message truncated, error stack:
5: MPI_Allreduce(707).....................: MPI_Allreduce(sbuf=0x57136008, rbuf=0x5763d008, count=658560, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
5: MPIR_Allreduce(385)....................:
5: MPIDI_CH3U_Post_data_receive_found(163): Message from rank 4 and tag 14 truncated; 2814240 bytes received but buffer size is 2634240
[ch@console ~]$
I think it must have something to do with the “buffer size” of mpich2-1.0.5p4.
How to deal with it? Thanks everyone !!!