Author Topic: Job suddenly stop and Segmentation fault  (Read 3423 times)

0 Members and 1 Guest are viewing this topic.

Offline dmicje12

  • Regular QuantumATK user
  • **
  • Posts: 19
  • Country: tw
  • Reputation: 0
    • View Profile
Job suddenly stop and Segmentation fault
« on: October 30, 2024, 09:55 »
Hi everyone,
My JOB has encountered Segmentation fault frequently recently.
I'm pretty sure there is enough ram during the calculation,
When doing SCF operation, the first few steps can be calculated stably, but it will stop when calculating the Calculating Density Matrix at a certain step.
The log file look like this at the end:
+------------------------------------------------------------------------------+
| Total Density Report                    DM           DD           dQ         |
+------------------------------------------------------------------------------+
| Left Electrode Extension            64.85281    -25.14719    -25.14719       |
| Right Electrode Extension           64.42974    -25.57026    -25.57026       |
| Central Region                    1532.77422    -33.22578    -33.22578       |
+------------------------------------------------------------------------------+
|  15 E = -509.293 dE =  2.042577e+00 dH =  4.867394e-01                       |
+------------------------------------------------------------------------------+

                            |--------------------------------------------------|
Calculating Density Matrix : ===========================

mobaxterm will display:
===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 58 PID 27109 RUNNING AT 512.lab.nycu.edu.tw
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 59 PID 27123 RUNNING AT 512.lab.nycu.edu.tw
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================
My ATK version is 2023.12-sp1 and operating system is CentOS 8
I only calculate the system of 240 atoms-noncollinear-FHI-DZP basis
I want to know how to solve the problem, or why the problem occurs.
Thanks!


Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5594
  • Country: dk
  • Reputation: 103
    • View Profile
    • QuantumATK at Synopsys
Re: Job suddenly stop and Segmentation fault
« Reply #1 on: October 30, 2024, 20:20 »
Hard to say without any input script. Also, if you run very many parallel processes on the same node, you can still use quite a lot of memory. In 95% of all cases, this error means out of memory.

Offline dmicje12

  • Regular QuantumATK user
  • **
  • Posts: 19
  • Country: tw
  • Reputation: 0
    • View Profile
Re: Job suddenly stop and Segmentation fault
« Reply #2 on: October 31, 2024, 05:06 »
OK
I would try lowering the MPI
Thanks for the reply!

 

Offline dmicje12

  • Regular QuantumATK user
  • **
  • Posts: 19
  • Country: tw
  • Reputation: 0
    • View Profile
Re: Job suddenly stop and Segmentation fault
« Reply #3 on: December 6, 2024, 09:44 »
Hello!
My jobs suddenly crashed again
When I use the "coredumpctl list" command, the following line appears:
Fri 2024-12-06 15:38:51 CST  396269 1000 1000 SIGSEGV present  /cad/synopsys/qatk/2023.12-sp1/atkpython/bin/python3.11   1.0G
Then I used "gdb /cad/synopsys/qatk/2023.12-sp1/atkpython/bin/python3.11 core"
It appears in the following line:
gdb: symbol lookup error: /lib/x86_64-linux-gnu/libpython3.12.so.1.0: undefined symbol: XML_SetReparseDeferralEnabled

I'm pretty sure the ram is enough during the calculation
My server has 500g of ram. The peak ram usage during calculation is about 200g.
Sometimes the calculation can be stable for 2 to 3 days, but sometimes the calculation stops after 4 hours.
I still don't know why jobs stopped suddenly.
Attached is my script below
Thanks!


Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5594
  • Country: dk
  • Reputation: 103
    • View Profile
    • QuantumATK at Synopsys
Re: Job suddenly stop and Segmentation fault
« Reply #4 on: December 6, 2024, 20:09 »
Nothing obvious I can point to, this should not be a very heavy calculation memory-wise. Maybe you are not exclusive on the machine and some other calculation is taking up a lot of memory...? It's difficult to troubleshoot these things, best to just resubmit.

Offline dmicje12

  • Regular QuantumATK user
  • **
  • Posts: 19
  • Country: tw
  • Reputation: 0
    • View Profile
Re: Job suddenly stop and Segmentation fault
« Reply #5 on: December 7, 2024, 05:06 »
Thanks for reply!
I am the only user of the server
If the license server cannot be connected due to network instability, will jobs suddenly stop?
But I remember that when encountering network conditions before, , the word "heartbeat" would appear in the log file.
Finally, I hope ATK can add the function of automatic restarting calculation, because I encountered many times that jobs suddenly stopped while I was sleeping, which was very frustrating.
Anyway, thank you very much for the help!
Thanks!




Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5594
  • Country: dk
  • Reputation: 103
    • View Profile
    • QuantumATK at Synopsys
Re: Job suddenly stop and Segmentation fault
« Reply #6 on: December 7, 2024, 21:45 »
Yes, the software will try to reach the license server for about 5 minutes before terminating. I am not sure if it's writing "heartbeat" any more, but definitely some obvious message will be in the log file.

There is restart for some lengthy calculations, like IV or geometry optimization, but fully automatic is trickier to implement as it depends on how you submit it (from the Job Manager, yes, it could be considered).

The main question is if this happens often, and always for the same calculation, or randomly. Can also be a hardware problem

Offline filipr

  • QuantumATK Staff
  • Heavy QuantumATK user
  • *****
  • Posts: 84
  • Country: dk
  • Reputation: 8
  • QuantumATK developer
    • View Profile
Re: Job suddenly stop and Segmentation fault
« Reply #7 on: December 9, 2024, 10:41 »
Hi djicje12,

Do you have possibility of sending us the full script including the atomic configuration as well as how you run it, i.e. how many MPI processes and threads you use? In order for us to look into this we need to be able to reproduce the error.

Thanks,
Filip

Offline dmicje12

  • Regular QuantumATK user
  • **
  • Posts: 19
  • Country: tw
  • Reputation: 0
    • View Profile
Re: Job suddenly stop and Segmentation fault
« Reply #8 on: December 10, 2024, 06:51 »
Hello, my ATK version is 2023.12-sp1, the operating system is ubuntu24.04 LTS, I have 96 cores and 192 threads(epyc 7k62),
I use 32MPI, 6 threads to calculate,
the memory usage does not exceed the maximum memory(512G),
strangely, my friend uses the 20-core Xeon-e5 series and CentOS version 7.9 and ATK 2022.12 version, and there is no sudden stop of the job when calculating similar structures. (20MPI and threads setting is automatic)
Then I have another machine that is being calculated. Its configuration is xeon 4514y with CentOS 7.9 installed and ATK version 2023.12. When calculating the same structure, the job will suddenly stop. (32MPI and 2 threads)
If you need more information please remind me.
Thanks!
p.s. I will use all threads when calculating. I wonder if this will affect the stability of the calculation?
« Last Edit: December 10, 2024, 06:59 by dmicje12 »

Offline filipr

  • QuantumATK Staff
  • Heavy QuantumATK user
  • *****
  • Posts: 84
  • Country: dk
  • Reputation: 8
  • QuantumATK developer
    • View Profile
Re: Job suddenly stop and Segmentation fault
« Reply #9 on: December 16, 2024, 11:48 »
Hi again,

I can confirm that I can reproduce a segmentation fault with your script in v2023.12. My suspicion was that it was the same bug as we've seen in some transmission calculations with the same version when using threading, which are due to a bug in Intel MKL. We've made a fix for this in v2024.09-SP1 which has just been released (see forum post https://forum.quantumatk.com/index.php?topic=12114.msg41940#new) and indeed with that version I no longer get the segfault.

Try to install the new v2024.09-SP1 version and see if that fixes the issue. Otherwise you may be able to work around the problem by not using OpenMP threads by setting OMP_NUM_THREADS=1 - but notice that this increases the memory requirements.