QuantumATK Forum

QuantumATK => General Questions and Answers => Topic started by: huangshenjie on July 27, 2012, 17:15

Title: a problem about the result of MTJ
Post by: huangshenjie on July 27, 2012, 17:15
Dear,
I made a MTJ by ATK, but the results confuse me. The result of MTJ with anti-parallel spin is perfect, however, the result of parallel spin is not.
Just like the following IV curve, when the bias increases, the current tends to zero. The transmission spectrum also tends to zero. I have posted the problem several days before, but no replay.
Now I post the calculation part in my script and hope someone can help me out.
thanks a lot
 
Title: Re: a problem about the result of MTJ
Post by: nori on July 27, 2012, 18:41
First of all, the k-point sampling for transmission should be increased.
Gamma point approximation is too crude for MTJ.
You should check how large k-point sampling is needed to obtain converged transmission spectrum.

Title: Re: a problem about the result of MTJ
Post by: Anders Blom on July 27, 2012, 21:02
A reasonable number of k-points is 400 in each direction, A and B (x and y), but you should increase it until the results converge.
Title: Re: a problem about the result of MTJ
Post by: huangshenjie on July 28, 2012, 04:17
Hi, nori, thanks for your reply. Since my MTJ is very complicated, although I take the k-points as 2, 2,10, it still takes me 2 days to calculate each point of bias (my computer has 4 CPUs). I don't how to set the k-points to get the perfect result and save my time.
Now I give you my MTJ, I hope that you can help me to set the k-points.Thanks a lot.
Title: Re: a problem about the result of MTJ
Post by: huangshenjie on July 28, 2012, 04:19
Hi, Anders Blom, thanks for your reply. Since my MTJ is very complicated, although I take the k-points as 2, 2,10, it still takes me 2 days to calculate each point of bias (my computer has 4 CPUs). I don't how to set the k-points to get the perfect result and save my time.
Now I give you my MTJ, I hope that you can help me to set the k-points.Thanks a lot.
Title: Re: a problem about the result of MTJ
Post by: huangshenjie on July 28, 2012, 07:44
Should I reduce the mgo to two layers to save time?
Title: Re: a problem about the result of MTJ
Post by: nori on July 28, 2012, 08:19
About SCF, you can reduce calculation time by setting initial_state obtained in previous calculation.
(For instance, using 0.8V SCF result as initial_state for 1.0V SCF calculation)

About transmission, energy range and points can be reduced like linspace(-1.5, 1.5, 31).
If you find sharp and large peaks in the spectrum, you should add points around the peaks.



Title: Re: a problem about the result of MTJ
Post by: huangshenjie on July 28, 2012, 10:52
Thank you, nori. I get it! Uh, but I still don't know what I should do about the k-points... 
Title: Re: a problem about the result of MTJ
Post by: Anders Blom on July 28, 2012, 11:26
Note that the k-points needed for converged transmission spectrum (the k-points you give for the TransmissionSpectrum analysis option) are, most likely, very different from what you need for the density matrix (self-consistent loop) to be accurately determined. So, once you have converged all bias points (using Nori's suggestion will help a lot), you will need to experiment with the transmission spectrum calculation to determine the required k-point sampling. But for this you don't have to rerun the whole calculation, you can just restore the self-consistent state from the NC file and compute the transmission spectrum for more and more k-points, and see when the results converge.
Title: Re: a problem about the result of MTJ
Post by: Anders Blom on July 28, 2012, 11:28
As a general rule, it's a bit the wrong approach to think about how to lower the accuracy of the calculation to save time. If you have decided to compute this particular structure in this model, you either have to accept the long calculation time, or increase your computational capability (this system will benefit enormously from MPI parallelization, probably you can rduce the time by a factor 5-10 if run on 10-12 nodes), or choose a simpler system. And by simpler I don't mean less MgO for instance, because in this device the functionality is very dependent on the size of the barrier. In fact, for a really serious calculation you should even investigate the dependence on the barrier thickness ;)
Title: Re: a problem about the result of MTJ
Post by: huangshenjie on July 30, 2012, 06:05
Hi, Anders Blom. Unfortuntely, I meet another problem. When I'm trying to find the adequate k-points, the log shows that :
+------------------------------------------------------------------------------+
|  75 E = -552.675 dE =  4.281446e-04 dH =  2.221776e-03                       |
+------------------------------------------------------------------------------+
| Calculation Converged in 75 steps                                            |
|                                                                              |
| Fermi Level  = -0.182751 Ha                                                  |
+------------------------------------------------------------------------------+
+------------------------------------------------------------------------------+
|                                                                              |
| Equivalent Bulk  [Finished Mon Jul 30 00:31:19 2012]                         |
|                                                                              |
+------------------------------------------------------------------------------+

                            |--------------------------------------------------|
Density Matrix Calculation : ===============================Connection to server [ 172.18.203.57 : 6200 ] lost - trying to reconnect.
Connection to server [ 172.18.203.57 : 6200 ] lost - trying to reconnect.
rank 0 in job 5  bogon_45266   caused collective abort of all ranks
  exit status of rank 0: return code 137



I have never met this problem and I'm sure that the net work is fine since I can still surf the internet…
Do you know the reason of this problem?
Title: Re: a problem about the result of MTJ
Post by: Anders Blom on July 30, 2012, 07:50
Seems it lost the connection to the license server. ATK accepts that the license server goes offline (or the connection to it is broken) for 5 minutes, but longer than that and it shuts down.
Title: Re: a problem about the result of MTJ
Post by: huangshenjie on July 30, 2012, 08:23
OK…Thank you. It seems that I have to calculated again. 
Title: Re: a problem about the result of MTJ
Post by: Anders Blom on July 30, 2012, 08:47
In this case yes, unfortunately it was interrupted before going into the real calculation, so there is no useful checkpoint file.
Title: Re: a problem about the result of MTJ
Post by: huangshenjie on July 31, 2012, 04:12
Uh…I am sorry to say that I have try again, but the problem still occur!

Density Matrix Calculation : ===============================Connection to server [ 172.18.203.57 : 6200 ] lost - trying to reconnect.
Connection to server [ 172.18.203.57 : 6200 ] lost - trying to reconnect.
rank 0 in job 5  bogon_45266   caused collective abort of all ranks
  exit status of rank 0: return code 137

Is there anything wrong with my script? I have felt disappointed about the problem…
Title: Re: a problem about the result of MTJ
Post by: Anders Blom on July 31, 2012, 14:39
No, this has nothing to do with ATK really, it appears your network is not very reliable. Since you are running this calculation in parallel it's hard to provide any other solution, because you need a network license. For serial operation we could arrange a node-locked license if you run all calculations on the same computer, but then you lose the parallel performance advantage. One solution might be to relocate the license server to a computer closer to the computational machines, thus making it less sensitive to network outages. If you want to run the license server on a different machine please contact us by email with customer ID etc.
Title: Re: a problem about the result of MTJ
Post by: huangshenjie on August 2, 2012, 03:51
No, this has nothing to do with ATK really, it appears your network is not very reliable. Since you are running this calculation in parallel it's hard to provide any other solution, because you need a network license. For serial operation we could arrange a node-locked license if you run all calculations on the same computer, but then you lose the parallel performance advantage. One solution might be to relocate the license server to a computer closer to the computational machines, thus making it less sensitive to network outages. If you want to run the license server on a different machine please contact us by email with customer ID etc.

Thank you for detailed replys.But in fact we just calculate in only one computer,and the license server started at the same computer,I dont really think it's the problem of network.

We tried it again with a good and stable network,it terminated like that:
Quote
rank 0 in job 5  bogon_45266   caused collective abort of all ranks
  exit status of rank 0: return code 137

No failed connecting now.What caused the problem?

I use "mpiexec -n 4" and the cpu has 4 core 8 processors and a 8G mem.It may be caused by improper processes we use?Or should I repile a latester mpi?It looks like it according from former topics about it here.

ps: We still use 10.8 now,if it matters...

Sorry for poor English,It's the first time for me to deal with parallell calculate^_^
Title: Re: a problem about the result of MTJ
Post by: Anders Blom on August 2, 2012, 13:19
The message from MPI is in some sense irrelevant, it can't be used to troubleshoot the reason, it just shows that all calculations where shut down by order of the master node. The interesting part is the error message from ATK. Before you showed it was from the license system, so that was clear.

If there is not error message at all from ATK, it's usually because you have run out of memory. Try with fewer nodes just to see how much memory the calculation needs, before running in parallel.

Is this a single machine with 8 GB? How many sockets/cores? (4 core 8 processors makes no sense :) ). If you run 4 MPI processes on one node, it means you are limited to a problem which in serial requires about 2 GB.
Title: Re: a problem about the result of MTJ
Post by: reverland on August 2, 2012, 15:22
Hi,I'm huangshenjie at present......

NO error from atk
Yes, It's a single machine with 8GB
I don't really understand sockets/cores...I read the parallel guide,but many words still confused me...
Quote
[atk@bogon ~]$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           X5355  @ 2.66GHz
stepping        : 11
cpu MHz         : 2660.004
cache size      : 4096 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips        : 5323.63
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           X5355  @ 2.66GHz
stepping        : 11
cpu MHz         : 2660.004
cache size      : 4096 KB
physical id     : 1
siblings        : 4
core id         : 0
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips        : 5320.05
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           X5355  @ 2.66GHz
stepping        : 11
cpu MHz         : 2660.004
cache size      : 4096 KB
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips        : 5320.03
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           X5355  @ 2.66GHz
stepping        : 11
cpu MHz         : 2660.004
cache size      : 4096 KB
physical id     : 1
siblings        : 4
core id         : 1
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips        : 5320.01
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:

processor       : 4
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           X5355  @ 2.66GHz
stepping        : 11
cpu MHz         : 2660.004
cache size      : 4096 KB
physical id     : 0
siblings        : 4
core id         : 2
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips        : 5320.01
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:

processor       : 5
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           X5355  @ 2.66GHz
stepping        : 11
cpu MHz         : 2660.004
cache size      : 4096 KB
physical id     : 1
siblings        : 4
core id         : 2
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips        : 5320.04
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:

processor       : 6
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           X5355  @ 2.66GHz
stepping        : 11
cpu MHz         : 2660.004
cache size      : 4096 KB
physical id     : 0
siblings        : 4
core id         : 3
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips        : 5319.95
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           X5355  @ 2.66GHz
stepping        : 11
cpu MHz         : 2660.004
cache size      : 4096 KB
physical id     : 1
siblings        : 4
core id         : 3
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips        : 5319.70
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:
Maybe out of mem,I'll try fewer nodes(if it means so)

Quote
[atk@bogon ~]$ free -m
             total       used       free     shared    buffers     cached
Mem:          7978       7123        854          0         45        200
-/+ buffers/cache:       6877       1100
Swap:         8001       2230       5770

Now I've repiled mpich2 and waiting for the result.
Title: Re: a problem about the result of MTJ
Post by: Anders Blom on August 2, 2012, 15:26
Look at the "Physical ID", it is either 0 or 1, so you have 2 sockets (physical processor units), each of which has 4 cores.

Also - while your machine has 8 GB in total, over 7 GB is currently used, so there is no memory left to do any simulations.
Title: Re: a problem about the result of MTJ
Post by: reverland on August 3, 2012, 01:52
Really appreciate!It looks like that my MTJ is so complex and it has taken too much memory to calculate.
I've reduced the number of processor to 2 to retried it.
My last try became interesting,one of the processor was killed but three others still works.Also, rarely memory left.
Thanks for all the replies,they are really instrumental.