Author Topic: a problem about the result of MTJ  (Read 7309 times)

0 Members and 1 Guest are viewing this topic.

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5418
  • Country: dk
  • Reputation: 89
    • View Profile
    • QuantumATK at Synopsys
Re: a problem about the result of MTJ
« Reply #15 on: July 31, 2012, 14:39 »
No, this has nothing to do with ATK really, it appears your network is not very reliable. Since you are running this calculation in parallel it's hard to provide any other solution, because you need a network license. For serial operation we could arrange a node-locked license if you run all calculations on the same computer, but then you lose the parallel performance advantage. One solution might be to relocate the license server to a computer closer to the computational machines, thus making it less sensitive to network outages. If you want to run the license server on a different machine please contact us by email with customer ID etc.
« Last Edit: July 31, 2012, 14:41 by Anders Blom »

Offline huangshenjie

  • Heavy QuantumATK user
  • ***
  • Posts: 40
  • Country: cn
  • Reputation: 0
    • View Profile
Re: a problem about the result of MTJ
« Reply #16 on: August 2, 2012, 03:51 »
No, this has nothing to do with ATK really, it appears your network is not very reliable. Since you are running this calculation in parallel it's hard to provide any other solution, because you need a network license. For serial operation we could arrange a node-locked license if you run all calculations on the same computer, but then you lose the parallel performance advantage. One solution might be to relocate the license server to a computer closer to the computational machines, thus making it less sensitive to network outages. If you want to run the license server on a different machine please contact us by email with customer ID etc.

Thank you for detailed replys.But in fact we just calculate in only one computer,and the license server started at the same computer,I dont really think it's the problem of network.

We tried it again with a good and stable network,it terminated like that:
Quote
rank 0 in job 5  bogon_45266   caused collective abort of all ranks
  exit status of rank 0: return code 137

No failed connecting now.What caused the problem?

I use "mpiexec -n 4" and the cpu has 4 core 8 processors and a 8G mem.It may be caused by improper processes we use?Or should I repile a latester mpi?It looks like it according from former topics about it here.

ps: We still use 10.8 now,if it matters...

Sorry for poor English,It's the first time for me to deal with parallell calculate^_^

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5418
  • Country: dk
  • Reputation: 89
    • View Profile
    • QuantumATK at Synopsys
Re: a problem about the result of MTJ
« Reply #17 on: August 2, 2012, 13:19 »
The message from MPI is in some sense irrelevant, it can't be used to troubleshoot the reason, it just shows that all calculations where shut down by order of the master node. The interesting part is the error message from ATK. Before you showed it was from the license system, so that was clear.

If there is not error message at all from ATK, it's usually because you have run out of memory. Try with fewer nodes just to see how much memory the calculation needs, before running in parallel.

Is this a single machine with 8 GB? How many sockets/cores? (4 core 8 processors makes no sense :) ). If you run 4 MPI processes on one node, it means you are limited to a problem which in serial requires about 2 GB.
« Last Edit: August 2, 2012, 13:23 by Anders Blom »

Offline reverland

  • New QuantumATK user
  • *
  • Posts: 2
  • Country: cn
  • Reputation: 0
    • View Profile
Re: a problem about the result of MTJ
« Reply #18 on: August 2, 2012, 15:22 »
Hi,I'm huangshenjie at present......

NO error from atk
Yes, It's a single machine with 8GB
I don't really understand sockets/cores...I read the parallel guide,but many words still confused me...
Quote
[atk@bogon ~]$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           X5355  @ 2.66GHz
stepping        : 11
cpu MHz         : 2660.004
cache size      : 4096 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips        : 5323.63
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           X5355  @ 2.66GHz
stepping        : 11
cpu MHz         : 2660.004
cache size      : 4096 KB
physical id     : 1
siblings        : 4
core id         : 0
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips        : 5320.05
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           X5355  @ 2.66GHz
stepping        : 11
cpu MHz         : 2660.004
cache size      : 4096 KB
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips        : 5320.03
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           X5355  @ 2.66GHz
stepping        : 11
cpu MHz         : 2660.004
cache size      : 4096 KB
physical id     : 1
siblings        : 4
core id         : 1
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips        : 5320.01
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:

processor       : 4
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           X5355  @ 2.66GHz
stepping        : 11
cpu MHz         : 2660.004
cache size      : 4096 KB
physical id     : 0
siblings        : 4
core id         : 2
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips        : 5320.01
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:

processor       : 5
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           X5355  @ 2.66GHz
stepping        : 11
cpu MHz         : 2660.004
cache size      : 4096 KB
physical id     : 1
siblings        : 4
core id         : 2
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips        : 5320.04
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:

processor       : 6
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           X5355  @ 2.66GHz
stepping        : 11
cpu MHz         : 2660.004
cache size      : 4096 KB
physical id     : 0
siblings        : 4
core id         : 3
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips        : 5319.95
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           X5355  @ 2.66GHz
stepping        : 11
cpu MHz         : 2660.004
cache size      : 4096 KB
physical id     : 1
siblings        : 4
core id         : 3
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips        : 5319.70
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:
Maybe out of mem,I'll try fewer nodes(if it means so)

Quote
[atk@bogon ~]$ free -m
             total       used       free     shared    buffers     cached
Mem:          7978       7123        854          0         45        200
-/+ buffers/cache:       6877       1100
Swap:         8001       2230       5770

Now I've repiled mpich2 and waiting for the result.

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5418
  • Country: dk
  • Reputation: 89
    • View Profile
    • QuantumATK at Synopsys
Re: a problem about the result of MTJ
« Reply #19 on: August 2, 2012, 15:26 »
Look at the "Physical ID", it is either 0 or 1, so you have 2 sockets (physical processor units), each of which has 4 cores.

Also - while your machine has 8 GB in total, over 7 GB is currently used, so there is no memory left to do any simulations.

Offline reverland

  • New QuantumATK user
  • *
  • Posts: 2
  • Country: cn
  • Reputation: 0
    • View Profile
Re: a problem about the result of MTJ
« Reply #20 on: August 3, 2012, 01:52 »
Really appreciate!It looks like that my MTJ is so complex and it has taken too much memory to calculate.
I've reduced the number of processor to 2 to retried it.
My last try became interesting,one of the processor was killed but three others still works.Also, rarely memory left.
Thanks for all the replies,they are really instrumental.