Dear Dr. Blom,
I had contacted you earlier regarding a peculiar "dapl async_event" error that I was getting while submitting jobs on HPC.
We were receiving these kind of notifications (shown below) in the log after which the simulation paused and did not proceed any further:
hm010:UCM:dec1:1f7d2700: 781258369 us(781258369 us!!!): dapl async_event CQ (0x1fdb8e0) ERR 0
hm010:UCM:dec1:1f7d2700: 781258411 us(42 us): -- dapl_evd_cq_async_error_callback (0x1f7d700, 0x18171f0, 0x2b351f7d1d30, 0x1fdb8e0)
hm010:UCM:dec1:1f7d2700: 781258429 us(18 us): dapl async_event QP (0x55c0cf0) Event 1
After reinstalling QATK I thought that the problem was mitigated. However, I found that it was not so. Jobs ran fine only when the size of the simulations were small, but as the simulation size was increased, the same problem started reappearing. Then I searched for this issue on internet I found someone also facing the exact same error while running MPI jobs on an HPC (posted on an intel forum, with link:
https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/dapl-async-event-QP/td-p/1133391). Here it was suggested that the error was occuring because of a CQ overrun (completion queue is not large enough for data queue processing). IBV_EVENT_CQ_ERR CQ is in error (CQ overrun), IBV_EVENT_QP_FATAL Error occurred on a QP and it transitioned to error state; and that it could be solved by increasing the default CQ (EVD) size.
I also contacted the intel community and came to know that this problem could also be occurring due to incompatibility between centos 7 and latest version of mpiexec.hydra, but I am unsure if QATK 2019.12 uses the latest version of mpiexec.
Based on the above, I feel that this problem could probably be resolved by tuning some internal MPI parameters but being a novice, I have absolutely no idea how to proceed any further. My research is virtually halted as I am unable run any computationally heavy simulations and I would be completely obliged to you if you could kindly guide me in resolving this problem.
Eagerly awaiting your reply.
Sincerely,
Ambika