Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.


Messages - Ambika kumari

Pages: [1] 2 3
1
Thank you very much for your support. I am trying to resolve the issue following your comments. I will contact you again if I am unable to resolve or should there be any other problems.

2
Dear Dr. Blom
We are using QuantumATK 2019, may be mpiexec.hydra is old version.  That may be the reason thats why its creating problem like  CQ overrun (completion queue is not large enough for data queue processing) and . Kindly look into problem.
We did not get any reply. If someone will reply it will be of great help.

3
General Questions and Answers / "dapl async_event" error
« on: May 6, 2023, 05:49 »
Dear Dr. Blom,

I had contacted you earlier regarding a peculiar "dapl async_event" error that I was getting while submitting jobs on HPC.
We were receiving these kind of notifications (shown below) in the log after which the simulation paused and did not proceed any further:
hm010:UCM:dec1:1f7d2700: 781258369 us(781258369 us!!!): dapl async_event CQ (0x1fdb8e0) ERR 0
hm010:UCM:dec1:1f7d2700: 781258411 us(42 us): -- dapl_evd_cq_async_error_callback (0x1f7d700, 0x18171f0, 0x2b351f7d1d30, 0x1fdb8e0)
hm010:UCM:dec1:1f7d2700: 781258429 us(18 us): dapl async_event QP (0x55c0cf0) Event 1

After reinstalling QATK I thought that the problem was mitigated. However, I found that it was not so. Jobs ran fine only when the size of the simulations were small, but as the simulation size was increased, the same problem started reappearing. Then I searched for this issue on internet I found someone also facing the exact same error while running MPI jobs on an HPC (posted on an intel forum, with link: https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/dapl-async-event-QP/td-p/1133391). Here it was suggested that the error was occuring because of a CQ overrun (completion queue is not large enough for data queue processing). IBV_EVENT_CQ_ERR CQ is in error (CQ overrun), IBV_EVENT_QP_FATAL Error occurred on a QP and it transitioned to error state; and that it could be solved by increasing the default CQ (EVD) size.

I also contacted the intel community and came to know that this problem could also be occurring due to incompatibility between centos 7 and latest version of mpiexec.hydra, but I am unsure if QATK 2019.12 uses the latest version of mpiexec.

Based on the above, I feel that this problem could probably be resolved by tuning some internal MPI parameters but being a novice, I have absolutely no idea how to proceed any further. My research is virtually halted as I am unable run any computationally heavy simulations and I would be completely obliged to you if you could kindly guide me in resolving this problem.     

Eagerly awaiting your reply.

Sincerely,
Ambika

4
Thank you for replying

5
Respected Sir/madam
 I am trying to calculate banstructure of graphene. As per Quantumatk tutorial we  can take symmetry point [G, K, M,G].
But as we take these symmetry point for graphene structure it showing error.
For reference we have attached log file.  Kindly help

Thank you


6
General Questions and Answers / Re: MPI batch script error
« on: March 28, 2023, 10:22 »
 Respected Dr. Blom,

I am not able to post input and output file as QuantumATK  asking me to start new topic. So I am going to start new topic for same issue.

7
General Questions and Answers / Re: MPI batch script error
« on: March 25, 2023, 20:03 »
Respected Dr. Blom,

After initialising through mpiexec we found that the simulation (ATKPYTHON) was successfully initialised. However after calculating the first set of eigenvalues and density matrix the error showed:
Calculating Eigenvalues : ==================================================

                            |--------------------------------------------------|
Calculating Density Matrix : ==================================================

cn112:UCM:6d009:b49e0700: 515424974 us(515424974 us!!!): dapl async_event CQ (0x157ed00) ERR 0
cn112:UCM:6d009:b49e0700: 515425014 us(40 us): -- dapl_evd_cq_async_error_callback (0x1538e00, 0x157ee60, 0x2b1db49dfd30, 0x157ed00)
cn112:UCM:6d009:b49e0700: 515425031 us(17 us): dapl async_event QP (0x3f1c660) Event 1
cn112:UCM:6d009:b49e0700: 515425041 us(10 us): dapl async_event QP (0x3f1c660) Event 1

We connected to the affected node (ssh cn112) and found that ATKPYTHON assigned to the respective cores was still running while utilising almost 100% of the cores. This suggested that the job was still running albeit in an unknown condition as the progress was not being reflected in the log.

So we reinstalled QuantumATK again on the HPC and found that we were able to resolve the issue.

However, now we face another issue. We find that the simulation is running very slow, as the simulation was not even able to converge the left electrode calculation and calculated only 8 sets of eigenvalues and density matrix in a period of 24 hours. Could this be due to any compatibility issues? (As we enquired from other HPC users utilising similar platforms such as vasp and found that they did not face such issues.) If so, could you suggest any way to overcome it?

Thank you very much.

8
General Questions and Answers / Re: MPI batch script error
« on: March 21, 2023, 15:49 »
Respected Dr. Blom,

As per  your recommendation  we tried to modify the software generated run script  by replacing mpiexec with srun after getting advice from the HPC admin, but were unsuccessful. I have attached the modified run script and the error log that was generated. The system admin suggests that it is not possible for him to rectify the script alone as he is unaware of the internal scripting of QuantumATK. And he has suggested that he will be able to help us if an expert from your side could directly connect with him and help resolve the issue. I would be obliged if you could give an intimation in this matter.



9
General Questions and Answers / MPI batch script error
« on: March 17, 2023, 09:09 »
Respected Authority,

I  am trying to run QuantumATK on an HPC using slurm job manager (I have a perpetual license for version QuantumATK Q-2019.12-SP1). For running the simulation I am specifying the number of CPU to 240 (we have licence of  256), but I find that the simulation is failing every time. However instead of specifying the number of CPU when I specify the number of nodes=6  (1 node=40 CPU) the simulation does not show any error.  The super computer that we are using, has incorporated certain rules that will allow us to run simulation only by specifying the number of CPUs instead of number of nodes. Can you please provide any help or support regarding this matter?  As it will only be possible for us to run simulations by specifying the number of CPUs.
The batch script that we used to run the simulation is based on the one that was available from QuantumATK which is attached herewith for your reference.
I have also attached the log file showing the error that we faced.

Thanking you in advance.

10
when i use multiple node for enough memory still show same error, i talked to the engineer of this college regarding this issued he said this is problem related to multiple nodes, only Admin can solve it, they have their own code for multiple node connection.

11
I am facing this type of issue. suggest some solution

srun: error: hm010: task 0: Out Of Memory
slurmstepd: error: Detected 432 oom-kill event(s) in StepId=869350.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
[mpiexec@hm010] HYDT_bscu_wait_for_completion (../../tools/bootstrap/utils/bscu_wait.c:151): one of the processes terminated badly; aborting
[mpiexec@hm010] HYDT_bsci_wait_for_completion (../../tools/bootstrap/src/bsci_wait.c:36): launcher returned error waiting for completion
[mpiexec@hm010] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:527): launcher returned error waiting for completion
[mpiexec@hm010] main (../../ui/mpich/mpiexec.c:1148): process manager error waiting for completion
slurmstepd: error: Detected 432 oom-kill event(s) in StepId=869350.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

12
I am also facing same issue

13
Dear Admin
please answer this query . I am not able to proceed.

14
I optimized my device by seeing the tutorial you suggested
It optimized in case of forcefield but didnot converge in in case of LCAO.

15
Dear Admin
I am getting this error. What does it mean.

Calculating Eigenvalues    : cn113:UCM:22f42:cd0b2700: 686464132 us(686464132 us!!!): dapl async_event CQ (0x1de4450) ERR 0
cn113:UCM:22f42:cd0b2700: 686464166 us(34 us):  -- dapl_evd_cq_async_error_callback (0x1d85f40, 0x1829490, 0x2b2acd0b1d30, 0x1de4450)
cn113:UCM:22f42:cd0b2700: 686464308 us(142 us): dapl async_event QP (0x1f2d5a0) Event 1

Kindly reply

Pages: [1] 2 3