Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Messages - ziand

Pages: 1 2 [3] 4 5 6

General Questions and Answers / Re: Why the unit of LDOS is different between ATK2008.10 and ATK2011.2?

« on: February 8, 2012, 15:40 »

This is a good question, so I repeat it. (I use ATK 11.2.3)

I calculated the LDDOS and when I plot is with VNL, the unit is given as 1/Ang**3.
When loading the same LDDOS via lddos=nlread('lddos.nc') and doing lddos.evaluate(0*Ang,0*Ang,0*Ang),
the unit comes out as 1/Bohr**1.5. This should not happen, no matter what the real unit of LDDOS is!

Finally I think the unit of LDDOS should be (1/LengthUnit**3)*(1/EnergyUnit).
(So that integrating it e.g. over a unit cell would give the DOS in 1/eV. This in turn could be integrated over all energies and weighted by the Fermi function to get the total number of electrons within that unit cell.)

Or is there a significant difference between LDDOS and the normal LDOS that may explain this?

Future Releases / LDOS

« on: January 20, 2012, 17:25 »

It seems there is no way to obtain the LDOS in ATK.
We have DOS, DDOS, and LDDOS.

Of course, one can create an ideal device and use the LDDOS but wouldn't it be easier to use normal BulkConfiguration and have a LDOS?

Future Releases / A tutorial on convergence of large (metallic) systems would be nice.

« on: January 19, 2012, 20:51 »

When dealing with large systems, especially large (maybe only in one dimensions) metallic systems, one often encounters convergence problems.

An eEffective ways to improve convergence is to increase the electron temperature. However, this affects the accuracy of the results at some point (very high T_el).
The other approch is called mixing. Right now ATK implements a Pulay mixer for the Hamiltonian. This is okay. What I'm asking for (maybe it is too complicated and only experience on the user side can help) is some sort of tutorial or guidlines how to twist the various parameters in order to systematically find a way out of a non-converging situation.

One has the parameters:
damping_factor, number_of_history_steps, start_mixing_after_step, linear_dependence_threshold and preconditioner.
The last one (preconditioner Kerker) itself has parameters: energy_q0, energy_qmax, maximum_damping

I doubt I'm the only user who is sometimes a bit puzzled which knobs are the most effective ones to twist or in which order one should change them to improve convergence (may it be slower (more tiny steps), but it should stop oscillating and the error should go down down down!!!)

Thanks.

Future Releases / Different smearing methods

« on: January 10, 2012, 12:25 »

Hello,

I know that when dealing with metallic systems, so called smearing is very importend.
ATK uses Fermi-Dirac smearing. Thus an electron temperature is introduced, which softenes the step in the occupation of states at the Fermi level.
The advantage of this method (as I understand) is the fact, that the electron temperature in Fermi-Dirac smearing can be interpreted as the real physical temperature of the system.

However, by looking on other codes and some publications I noticed that there exist other smearing methods. The most prominent are:
- (No smearing (not recommended for metals!))
- Fermi-Dirac
- Gaussian
- Methfessel-Paxton
- Marzari-Vanderbilt
- Tetrahedron method (with Blügel correction)

Different codes (I looked at siesta, quantum espresso and castep) use different smearing methods as their default. A new siesta-manual praises the Methfessel-Paxton method (can use high electron temp. and low numer of k-points).
Anyway, if you have the time ;-) or don't know what to put in a future release you could consider implementing one or the other of the above methods.

General Questions and Answers / Re: Problem: Fatal error in MPI_Allreduce

« on: December 16, 2011, 16:01 »

Okay, I created a minimal example to reproduce the error, on a 4Gb even the electrode calculation breaks with the known seg. fault. Obviously, this is nothing to worry, as 4Gb might be too small for this calculation. So I moved to some 16Gb nodes with 2 x MPI per node, that is I should have 8Gb per process. Now, the electrode calculation went fine, and also the device calculation started !!!
From my point of view, this should not have happened. So, maybe the whole thing was my fault, and maybe there is some sort of non-trivial bug in my own script. That would be a bit embarrassing for me. Anyway, I suppose when you read this and the previous post, you should not spend much time on it, right now (but the mem log I attached previously is interesting indeed, as I can not see this long an steep growth in memory consumption right now, so if I made a mistake in may scripts, I have no idea what it could be).

A final question, that you might be able to answer quickly:
There once was a hint that you can or should do

Code

import NL
NL.MAXMEMORY=XXXX

I have this in my scripts for a long time and often do not think about it very much. The question is: can this be a source of trouble? What happens if I set this (a) too small e.g. XXXX=3000 on a 16Gb machine or too lage, for e.g. XXXX=8000 on a 4Gb machine. Both cases might have happened to me in the past. (Right now, my calculation is running with XXXX=3400 on 16Gb node with 2 x MPI per node.)

General Questions and Answers / Re: Problem: Fatal error in MPI_Allreduce

« on: December 16, 2011, 11:21 »

It's me again, still having problem with my calculation going down with seg. fault errors.

I have done some further testing and want to report about what I found out. (There are 3 figures with notes attached. The time axis is not meaningful, as I changed the delay in my memory log thing: "top -d 0.05 | grep --line-buffered 'Mem:' > mem_XXXXXXX.txt" and "d" was always different, e.g fig. 1 had "-d 0.01".)

First calculations were done on a 8Gb node. My memory log file is shown in the first picture. The very first part is interesting. There you see 3 segments. The first calculation carshed, the second one was interrupted by me and the third one did not crash with a seg. fault. Now, as noted in the picture, for the first calc. I used MPI with just one process (but on a full node). The second and third calc. was done in serial (!) mode, without any MPI, just "$ATK_BIN_DIR/atkpython $PBS_O_WORKDIR/$PBS_JOBNAME.py" (and not mpiexec ...). So, my idea was that there must be something wrong with your MPI.

I repeated the very same calculation (serial) on a 16Gb node, to avoid the out of memory problem, that occured at the end of the first calculation. Now this one also crashed with seg. fault, see figure 2.

The I went back to an 8Gb node and it also crashed, see figure 3.

Now, I noticed that when it will crash, there are 2 memory peaks visible in the log, while this is not the case in the successful calculation (see notes in fig. 1). Maybe this is somehow helpful. At least, the memory log shows some regular patterns, that are equal on every calculation startup (as it should be!), except those spikes.

For now, I am not sure what to learn from this. It could even be a hardware problem, because ALL calculations were done on different nodes. However, if you find the time (no hurry!), please have a look on this stuff. At least this time the possibilities are more limited, because it's not MPI. Can you really rule out, that this is not a problem of AKT? If so, I'll have to talk with some cluster admin here...

(Btw. I try to create a minimal script of my calculation and send it privately. Do not hurry if you are busy right now.)

Thank you very much.

General Questions and Answers / Re: Restart a broken device calculation

« on: December 8, 2011, 10:21 »

After doing the procedure as explained above, the equiv. bulk calculation was not preserved and started again, but the calculation finised without any problems.

General Questions and Answers / Restart a broken device calculation

« on: December 5, 2011, 19:27 »

Hello,

I have read the Mini-Tutorial about restarting a calculation from a checkpoint file, but I have some minor questions. First: I still use ATK-11.2.3, and there is no force_restart keyword available. What does this force_restart=True in the new version really do?

Second question: My calculation broke after the equivalent bulk part of the device calculation and now the checkpoint file only contains a bulk-configuration and not a device-configuration. Furthermore I saved the electrodes to a separte file.

Is the following restart procedure correct (or do I loose something at step 4):
1.) read Electrode
2.) read checkpoint-file (== equiv. bulk)
3.) create Device from equiv. Bulk and electrodes
4.) attach a new device calculator
5.) update

(By the way, I now use mvapich2 version 1.8a1. It has the nice side-effect, that this new version does not MPD anymore. Sadly I still get the initially mentioned seg.-fault errors from time to time after the equiv.-bulk has finished)

General Questions and Answers / Re: Problem: Fatal error in MPI_Allreduce

« on: November 28, 2011, 15:23 »

Thanks again for the quick reply.

I will keep you up to date on what I find out about the error. Maybe I can really figure it out (it is very surely NOT a matter of insufficient memory). I changed my loop to "setCalculator(calculator())".
I hope it is okay to create the device calculator outside the loop and attach the electrode calculators by "Device...Calculator(electrode_calculators=[elCalculator, elCalculator])", without the brackets. This is how I saw it in the manual.

By the way. I am doing device calculations with always the same electrodes. It would be nice to be able to save the electrode state once and then reuse it. Is this possible? I did not check what happens if I use the initial_state argument, as my central region changes its geometry (different number of atoms).

General Questions and Answers / Re: Problem: Fatal error in MPI_Allreduce

« on: November 28, 2011, 14:10 »

I determined the different phases of calculation "backwards". The last must be the transmission, before there is two-probe, before this there is equiv. bulk, and first must be electrode. I started the memory logging quiet a while after the calculation had started, so the very first part looks shorther than is actually was and I stopped the logging a long time after the calculation had crashed, so the length of the flat line at the end is random. Here is the timing:

Code

| Device Calculation  [Started Fri Nov 25 09:15:55 2011]                       |
| Left Electrode Calculation  [Started Fri Nov 25 09:16:17 2011]               |
| Left Electrode Calculation  [Finished Fri Nov 25 14:30:48 2011]              |
| Device Density Matrix Calculation   [Started Fri Nov 25 14:30:48 2011]       |
| Equivalent Bulk  [Started Fri Nov 25 14:31:51 2011]                          |
| Equivalent Bulk  [Finished Fri Nov 25 19:53:51 2011]                         |
| Device Calculation  [Finished Fri Nov 25 23:44:43 2011]                      |
# Here was the transmission calculation  (most time consuming part, I used 5x5 k-points)
# followed by the error messages
Sat Nov 26 15:07:34 CET 2011  # <- end of output

And: The equiv. bulk took 24 steps, which corresponds with the number of "memory peaks" in the part that I marked as equivalent bulk.

General Questions and Answers / Re: Problem: Fatal error in MPI_Allreduce

« on: November 28, 2011, 12:16 »

Update:

I made no change to the device and calculator settings. I used 9 quadcore nodes with 8Gb mamory. This time the error occured later. It finished left electrode, equiv. bulk, two-probe and a transmission spectrum. After that a new cycle should start with two CNT unitcells. But after the transmission spectrum I got the following error.

Code

+------------------------------------------------------------------------------+
|                                                                              |
| Transmission Spectrum Analysis                                               |
|                                                                              |
+------------------------------------------------------------------------------+

                            |--------------------------------------------------|
Calculating Transmission   : ==================================================

/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 19339 Segmentation fault      PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 29948 Segmentation fault      PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 17088 Segmentation fault      PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3:  2341 Segmentation fault      PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3:   715 Segmentation fault      PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 25996 Segmentation fault      PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 13896 Segmentation fault      PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 30131 Segmentation fault      PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 20241 Segmentation fault      PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
mpiexec: Warning: tasks 0-8 exited with status 139.

I have attached a picture with the memory usage. You said the equiv. bulk is the most memory consuming part. According to this graph, the two-probe part does use significantly more memory. Is this possible or could it be that I am making a mistake somewhere?

The transmission spectrum has been correctly written to the nc-file, this is very nice, so the crash happened later (maybe at the beginning of the next loop).

I create my calculator object only once (outside a loop which makes a new device) and attach it to the device during the loop. I use device.setCalculator(calculator) but I could also make a fresh copy with device.setCalculator(calculator()). Does it make a difference?

The next test will be with (an old) MPICH2 instead of (an old) MVAPICH2. (And up-to-date MPI versions will be available to me soon).

General Questions and Answers / Re: Problem: Fatal error in MPI_Allreduce

« on: November 28, 2011, 10:40 »

Thank you very much for this quick reply.

As I expected, no clear answer to this problems can be given. Sadly. However, let me comment on the things you mention.

0.) It is a pitty but I am afraid to say, that this is not 100% reproducible. At least I can not find a pattern. But I am pretty sure, that if I start the same calculation with he same settings (nodes / MPI OpenMP ...) on the same hardware (node types), than it will always happen.

1.) Your main guess is a good one, but I was always aware of this problem and I ALWAYS grab full nodes (no matter how many cores I actually use).

[[[By the way (but this is surely another story), I was never ever able to mix MPI and OpenMP in the sense, that I have more than one MPI-job on ONE NODE and simultaneously more than one OpenMP thread. Example: I use nodes=5:ppn=4. than I could switch off OpenMP and use 20 MPI-jobs OR I could use 5 MPI-jobs and 4 OpenMP threads. Something like 10 MPI * 2 OpenMP always crashed. I think there was a post from me about that long time ago. Howerver, this does not really hurt me too much, as I anyway need lots of RAM, so 1 MPI per node.]]]

Do I really need the -n 5 option? If I have -npernode 1, then the number of MPI-jobs should be known and fixed (to the number of nodes).

2.) "In newer ATK versions you don't need to set the MKL variables, so you can simplify your script a bit."
So, I guess 11.2.3 is new enough for that. (And I will tell the cluster admin to install the newes ATK soon.)
Btw.: I will test what happens if I put 2 MPIs on one quadcore. Maybe my side-comment above is not relevant anymore.

3.) To the mpiexec/mpdrun thing. Actually I myself am not 100% sure what the difference is. I googled a while and came to the conclusion that I should use mpiexec. (I found a piece of code (python), that maybe the source of mpiexec and therein I found mpdrun calls.) Also mpdrun is not so frequently known in the web. But I did NOT switch MPI libs, because to use them on our cluster, one has to "activate" them ("module add mpi/mvapich2/gcc422", I think "module" is some custom script which sets some environment variables), otherwise mpdrun or mpiexec are unknown commands. I think the different output formatting really comes from using different nodes.

4.) Okay, I can test MIPCH2 instead of MVAPICH2. However you are right, its ancient. Let me see if I can convince the admin to install some newer version.

5.) Checkpoint file. Indeed this caused me some trouble recently. I did not set a checkpoint handler and the default checkpoint file (somewhere in tmp), could not be written. In the way I do it now, the output tells me a checkpoint file, that is within home directory, exactly where I want it to be. And yes, I have "cd $PBS_O_WORKDIR".

In summary: I will test the old MPICH2 and will try to get an updated MPICH2 or MVAPICH2.

I will report about what happens.

General Questions and Answers / Re: Problem: Fatal error in MPI_Allreduce

« on: November 25, 2011, 12:22 »

Hello,

sorry for bringing this up again. Similar problems like the one mentioned above happen to me from time to time. I'm somehow puzzled and try to find a way to firgure out the source of the problem. Now the reason why I post this right now is some recent errors. All are right after the equivalent bulk of a device calculation. Is there a strong need of memory there?

Here some details about my calculations (I have no skript attached, becuse it is rather trick with lots of custom stuff. However, I carefully checked it with the VNL am sure that geometry is OK):
MPI-version: mvapich2-1.0 (there are other MPIs available like MPICH2 1.0.6)
ATK-version: 11.2.3
Geometry: 5x5x3 copper bulk electrodes, 5x5x5 copper left/right and 1 unitcell of (6,0) CNT between.
Calculator:

Code

basis_set = GGABasis.DoubleZetaPolarized
electron_temperature = 1200
k_point_sampling = (3, 3, 100)
grid_mesh_cutoff = 75
max_steps = 500
tolerance = 4.0e-5
exchange_correlation = GGA.PBE
#electrode_voltages = [0, 0] * Volt
#trans_energies = numpy.linspace(-2, 2, 401) * eV
#trans_kpoints = (5, 5, 1)

numAcc = NumericalAccuracyParameters(electron_temperature = electron_temperature * Kelvin,
                                     k_point_sampling = k_point_sampling,
                                     grid_mesh_cutoff = grid_mesh_cutoff * Hartree)
itCon = IterationControlParameters(max_steps = max_steps,
                                   tolerance = tolerance)
elCal = LCAOCalculator(basis_set = basis_set,
                       exchange_correlation = exchange_correlation,
                       numerical_accuracy_parameters = numAcc,
                       iteration_control_parameters = itCon,
                       poisson_solver = FastFourierSolver([[PeriodicBoundaryCondition] * 2] * 3),
                       checkpoint_handler = CheckpointHandler('checkpoint.nc', 60*Minute))
devCal = DeviceLCAOCalculator(basis_set = basis_set,
                             exchange_correlation = exchange_correlation,
                             numerical_accuracy_parameters = numAcc,
                             iteration_control_parameters = itCon,
                             poisson_solver = FastFourier2DSolver([[PeriodicBoundaryCondition] * 2,
                                                                   [PeriodicBoundaryCondition] * 2,
                                                                   [DirichletBoundaryCondition] * 2]),
                             electrode_calculators = [elCal, elCal],
                             #electrode_voltages = electrode_voltages,
                             checkpoint_handler = CheckpointHandler('checkpoint.nc', 60*Minute))

The very first run on 16Gb nodes with no CNT but lots of vacuum between the central electrodes went without errors, but as the number of 16Gb nodes is low on our cluster, I went to 4Gb now with a CNT in between. Actually, the system got 24 more carbon atoms (1 unit cell) and it got smaller in volume (1 unit of CNT was smaller than the vacuum I had before).

Here are my errors:

The first error is very likely caused by insufficient memory. I have 4Gb per quadcore node and wanted to have one MPI-job per node, but I think, there might have been two MPIs on the node below (rank7 and rank 10???).

Code

\lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3:  1466 Segmentation fault      PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
rank 7 in job 1  chic2g22_34011   caused collective abort of all ranks
  exit status of rank 7: killed by signal 9
rank 10 in job 1  chic2g22_34011   caused collective abort of all ranks
  exit status of rank 10: killed by signal 9

So I changed something in my PBS skript.

Old

Code

# Number of nodes needed for mpd-startup
NNODES=$( uniq $PBS_NODEFILE | wc -l )
# Number of CPUs needed for mpiexec
NCPU=$( wc -l < $PBS_NODEFILE )
# Number of CPUs in use
useCPU=$(( $NNODES * $usePPN ))
sort $PBS_NODEFILE | uniq -c | gawk -v ppn=$usePPN '{ printf("%s:%s\n", $2, ppn); }' > mpd.nodes
mpdrun -machinefile mpd.nodes -n $useCPU -env MKL_DYNAMIC FALSE -env MKL_NUM_THREADS 4 $ATK_BIN_DIR/atkpython $PBS_O_WORKDIR/$PBS_JOBNAME.py

New

Code

usePPN=1
export MKL_DYNAMIC=FALSE
export MKL_NUM_THREADS=4
mpiexec -npernode $usePPN $ATK_BIN_DIR/atkpython $PBS_O_WORKDIR/$PBS_JOBNAME.py

I ensured that really every node has only one atkpython_exec precess running. OpenMP seems to work as expected, because the load on the quadcores frequently goes above 25 percent. However, now I got the following error (right after the equiv. bulk has finished):

Code

/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3:  6079 Speicherzugriffsfehler  PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 12284 Speicherzugriffsfehler  PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
/lustrefs/apps/atk/11.2.3/atkpython/bin/atkpython: line 3: 31418 Speicherzugriffsfehler  PSEUDOPOTENTIALS_PATH=$EXEC_DIR/../share/pseudopotentials GPAW_SETUP_PATH=$EXEC_DIR/../share/gpaw-setups/ PYTHONHOME=$EXEC_DIR/.. PYTHONPATH= LD_LIBRARY_PATH=$EXEC_DIR/../lib $EXEC_DIR/atkpython_exec $*
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(773)................: MPI_Allreduce(sbuf=0x1314dda0, rbuf=0x2ab5400010, count=1705725, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(240)...............:
MPIC_Recv(83).....................:
MPIC_Wait(513)....................:
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720).......:
state_commrdy_handler(1556).......:
MPID_nem_tcp_recv_handler(1446)...: socket closed
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(773)................: MPI_Allreduce(sbuf=0x2ab7ee1010, rbuf=0x2ab444f010, count=1705725, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(228)...............:
MPIC_Send(41).....................:
MPIC_Wait(513)....................:
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720).......:
state_commrdy_handler(1556).......:
MPID_nem_tcp_recv_handler(1446)...: socket closed
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(773)................: MPI_Allreduce(sbuf=0x2abb63b010, rbuf=0x2ab83b1010, count=1705725, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(385)...............:
MPIC_Sendrecv(161)................:
MPIC_Wait(513)....................:
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720).......:
state_commrdy_handler(1556).......:
MPID_nem_tcp_recv_handler(1446)...: socket closed
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(773)................: MPI_Allreduce(sbuf=0x2ab9b96010, rbuf=0x2ab6104010, count=1705725, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(385)...............:
MPIC_Sendrecv(161)................:
MPIC_Wait(513)....................:
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720).......:
state_commrdy_handler(1556).......:
MPID_nem_tcp_recv_handler(1446)...: socket closed
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(773)................: MPI_Allreduce(sbuf=0x2ab9609010, rbuf=0x2ab757f010, count=1705725, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(385)...............:
MPIC_Sendrecv(161)................:
MPIC_Wait(513)....................:
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720).......:
state_commrdy_handler(1556).......:
MPID_nem_tcp_recv_handler(1446)...: socket closed
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(773)................: MPI_Allreduce(sbuf=0x2ab7524010, rbuf=0x2ab6820010, count=1705725, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(385)...............:
MPIC_Sendrecv(161)................:
MPIC_Wait(513)....................:
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720).......:
state_commrdy_handler(1556).......:
MPID_nem_tcp_recv_handler(1446)...: socket closed
mpiexec: Warning: tasks 0-4,7 exited with status 1.
mpiexec: Warning: tasks 5-6,8 exited with status 139.

In line one "Speicherzugriffsfehler" is german and means memory access error (maybe it actually is the same as segmentation fautl). I have no idea why it is german here. The only difference between the first and the second run is that I changed the type of node on our cluster (also 4Gb RAM but hardware may differ) and I ensured that every node has really just 1 job. I was a bit doubtful from the beginning, that 4Gb memory is too few, so I recorded memory usage at both runs via

Code

top | grep --line-buffered 'atkpython_exec' > mem_XXXXX.txt

and used a little skript to plot the percent of memory usage as provided by top. There result is attachted.

Now this looks very nice, you see the SCF-cycles of the electorde, then you see the cycles of the equivalent bulk and right after this finished it breaks! Can I be sure, that I have enough memory? Does the two-probe part of the calculation or maybe a step right before need much more memory than the equivalent bulk? It would be much help if this option can be eliminated.

What could I do or which information should I send, to figure out what is going on here. Maybe I need to ask the cluster admins, but there are lots of users here with lots of simulations, so the cluster seems (?) to be quite stable.

Thank you very much.

General Questions and Answers / Re: Hückel convergence in 0 steps

« on: September 2, 2011, 11:40 »

Oh sorry, that was a mistake from my side. When I inspected dH I saw ...e+00 and overlooked that there is actually nothing else than zero in front. It was the first time this happened as I used DFT in the past, where this is really uncommon...

For the second question, I pretty much expected your answer but wanted to be really sure.

Thank you very much.

General Questions and Answers / Re: Hückel convergence in 0 steps

« on: September 1, 2011, 13:36 »

Yes, I know that there should not be much charge transfer and convergence should be fast. I am just asking about the criterion that is used to determine convergence. Is dM the difference in Mulliken populations? Why is only dM below the given tolerance and not dE and dH? Could you please comment on the strategy used to determine convergence? In a past version there was a keyword "criterion". Within the latest manual (version 11.2) the information on tolerance does not give a clear statement about the criterion used.

Another small question: Is there serious reason why the Huckel method does not support to calculate the electrostatic potential or the electron density? Or is it just not implemented yet?

Pages: 1 2 [3] 4 5 6

QuantumATK Forum

News:

Show Posts

Messages - ziand

General Questions and Answers / Re: Why the unit of LDOS is different between ATK2008.10 and ATK2011.2?

Future Releases / LDOS

Future Releases / A tutorial on convergence of large (metallic) systems would be nice.

Future Releases / Different smearing methods

General Questions and Answers / Re: Problem: Fatal error in MPI_Allreduce

General Questions and Answers / Re: Problem: Fatal error in MPI_Allreduce

General Questions and Answers / Re: Restart a broken device calculation

General Questions and Answers / Restart a broken device calculation

General Questions and Answers / Re: Problem: Fatal error in MPI_Allreduce

General Questions and Answers / Re: Problem: Fatal error in MPI_Allreduce

General Questions and Answers / Re: Problem: Fatal error in MPI_Allreduce

General Questions and Answers / Re: Problem: Fatal error in MPI_Allreduce

General Questions and Answers / Re: Problem: Fatal error in MPI_Allreduce

General Questions and Answers / Re: Hückel convergence in 0 steps

General Questions and Answers / Re: Hückel convergence in 0 steps