Author Topic: "dapl async_event" error  (Read 3526 times)

0 Members and 1 Guest are viewing this topic.

Offline Ambika kumari

  • Heavy QuantumATK user
  • ***
  • Posts: 32
  • Country: in
  • Reputation: 0
    • View Profile
"dapl async_event" error
« on: May 6, 2023, 05:49 »
Dear Dr. Blom,

I had contacted you earlier regarding a peculiar "dapl async_event" error that I was getting while submitting jobs on HPC.
We were receiving these kind of notifications (shown below) in the log after which the simulation paused and did not proceed any further:
hm010:UCM:dec1:1f7d2700: 781258369 us(781258369 us!!!): dapl async_event CQ (0x1fdb8e0) ERR 0
hm010:UCM:dec1:1f7d2700: 781258411 us(42 us): -- dapl_evd_cq_async_error_callback (0x1f7d700, 0x18171f0, 0x2b351f7d1d30, 0x1fdb8e0)
hm010:UCM:dec1:1f7d2700: 781258429 us(18 us): dapl async_event QP (0x55c0cf0) Event 1

After reinstalling QATK I thought that the problem was mitigated. However, I found that it was not so. Jobs ran fine only when the size of the simulations were small, but as the simulation size was increased, the same problem started reappearing. Then I searched for this issue on internet I found someone also facing the exact same error while running MPI jobs on an HPC (posted on an intel forum, with link: https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/dapl-async-event-QP/td-p/1133391). Here it was suggested that the error was occuring because of a CQ overrun (completion queue is not large enough for data queue processing). IBV_EVENT_CQ_ERR CQ is in error (CQ overrun), IBV_EVENT_QP_FATAL Error occurred on a QP and it transitioned to error state; and that it could be solved by increasing the default CQ (EVD) size.

I also contacted the intel community and came to know that this problem could also be occurring due to incompatibility between centos 7 and latest version of mpiexec.hydra, but I am unsure if QATK 2019.12 uses the latest version of mpiexec.

Based on the above, I feel that this problem could probably be resolved by tuning some internal MPI parameters but being a novice, I have absolutely no idea how to proceed any further. My research is virtually halted as I am unable run any computationally heavy simulations and I would be completely obliged to you if you could kindly guide me in resolving this problem.     

Eagerly awaiting your reply.

Sincerely,
Ambika

Offline Ambika kumari

  • Heavy QuantumATK user
  • ***
  • Posts: 32
  • Country: in
  • Reputation: 0
    • View Profile
Re: "dapl async_event" error
« Reply #1 on: May 11, 2023, 15:56 »
Dear Dr. Blom
We are using QuantumATK 2019, may be mpiexec.hydra is old version.  That may be the reason thats why its creating problem like  CQ overrun (completion queue is not large enough for data queue processing) and . Kindly look into problem.
We did not get any reply. If someone will reply it will be of great help.

Offline filipr

  • QuantumATK Staff
  • Heavy QuantumATK user
  • *****
  • Posts: 81
  • Country: dk
  • Reputation: 6
  • QuantumATK developer
    • View Profile
Re: "dapl async_event" error
« Reply #2 on: May 12, 2023, 09:38 »
The dapl async_event error could either be 1) a bug or incompatibility in Intel MPI 2) a wrongly configured Intel MPI or 3) a wrongly configured network infrastructure on the cluster. If you can run other MPI software on the cluster, we can disregard (3). QuantumATK 2019.12 ships Intel MPI 2018 update 1. For (2) please read the Intel MPI user guide and documentation which you can find here. I suggest reaching out to your cluster administrator to get advice on how to configure Intel MPI to use the cluster network infrastructure. If you believe that Intel MPI 2018 update 1 is incompatible with your cluster it is possible to use a newer version when running QuantumATK. If your cluster already has a new version of Intel MPI installed in a module system you can simply put e.g.:
Code
module load intel-mpi
in your submission script. If Intel MPI is not installed, you can install it yourself by downloading the oneAPI installer from Intel's website. Then in your submission script put:
Code
source /path/to/intel/oneapi/mpi/latest/env/mpivars.sh
You can verify that it find the correct version by executing
Code
mpirun --version
Be sure that your submission script uses the mpiexec/mpirun executable of the new version of Intel MPI (which should now be in your PATH) and not the hardcoded path to the one in QuantumATK/libexec/mpiexec.hydra. If you use a job scheduler you may have to use the MPI launcher it provides, e.g. for SLURM use srun. Now, this will still not work out of the box. The reason is that the QuantumATK/bin/atkpython file is actually not an executable but a launcher script, which sets up environment variables so that third party libraries like Intel MPI can be found. You can open it in a text editor if you are curious. Importantly in this case is that it prepends the path to the QuantumATK/lib directory to LD_LIBRARY_PATH. When the program launches it will look for the mpi library (libmpi.so) in the directories in LD_LIBRARY_PATH in the order they appear. It will then always find the 2018.1 version in QuantumATK/lib. In order to force it to use the newer version of Intel MPI you thus have to delete or rename all files starting with "libmpi" in the QuantumATK/lib directory. That way it will end up instead finding the newest version in e.g. /path/to/intel/oneapi/mpi/latest/lib. You can verify that an atkpython run uses the correct Intel MPI version by setting:
Code
export I_MPI_DEBUG=5
in your submission script. When the program starts it should output debug log messages from Intel MPI, including the version of the library used. Starting from QuantumATK 2022.12 we now ship Intel MPI in a separate folder and append, instead of prepend, the directory to LD_LIBRARY_PATH. This makes it easier for users to use their own installation of Intel MPI without them having to delete/rename files.
« Last Edit: May 12, 2023, 09:47 by filipr »

Offline Ambika kumari

  • Heavy QuantumATK user
  • ***
  • Posts: 32
  • Country: in
  • Reputation: 0
    • View Profile
Re: "dapl async_event" error
« Reply #3 on: May 17, 2023, 11:46 »
Thank you very much for your support. I am trying to resolve the issue following your comments. I will contact you again if I am unable to resolve or should there be any other problems.