Author Topic: Speed issue in Two probe configuration  (Read 5387 times)

0 Members and 1 Guest are viewing this topic.

Offline abhola

  • Regular QuantumATK user
  • **
  • Posts: 5
  • Reputation: 0
    • View Profile
Speed issue in Two probe configuration
« on: May 21, 2009, 13:37 »
Hi,

I am doing benchmarking for Two probe configuration using ATK software.

we have two systems
   1) 16 CPU with per CPU configuration
      cpu MHz         : 2400.090
      cache size      : 4096 KB

   2) 8 CPU with per CPU configuration
      cpu MHz         : 3000.116
      cache size      : 6144 KB

The problem is that with 16 CPU system the running time is apx 12 hours but on 8 CPU sytem it is much more that 12 * 2 (24) hours
On 16 CPU machine one loop takes 3-4 minutes but on 8 CPU nachine it takes 18 - 40 minutes.

We are using MPICH2.

Can some please suggest me if this is expected behaviour ?
Or there is some configuration/ sytem problem / something we can try.

Regards
Anshu

Offline Nordland

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 812
  • Reputation: 18
    • View Profile
Re: Speed issue in Two probe configuration
« Reply #1 on: May 21, 2009, 22:21 »
Well it is always nice to see super linear speed up :)

How different is the hardware in the two clusters? same memory, same motherboard?

Offline abhola

  • Regular QuantumATK user
  • **
  • Posts: 5
  • Reputation: 0
    • View Profile
Re: Speed issue in Two probe configuration
« Reply #2 on: May 22, 2009, 14:49 »
Thanks for replying.

On 8 CPU system we have
# free -m
             total       used       free     shared    buffers     cached
Mem:         14033        741      13292          0          5         96
-/+ buffers/cache:        639      13394
Swap:         1983         65       1918

On 16 CPU System it's

             total       used       free     shared    buffers     cached
Mem:         16051        329      15721          0         82        123
-/+ buffers/cache:        123      15928
Swap:         1983         51       1932

CPU configurations are different but 8 CPU system have better configuration than 16 CPU sytem.
but 8 CPU sytem is taking much more time.

PER CPU CONFIG in 16 CPU system
------------------------------------
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7340  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.090
cache size      : 4096 KB
physical id     : 6
siblings        : 4
core id         : 3
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips        : 4800.39
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
---------------------------------------------
PER CPU CONFIG in 8 CPU sytem
---------------------------------------------
processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz
stepping        : 6
cpu MHz         : 3000.116
cache size      : 6144 KB
physical id     : 1
siblings        : 4
core id         : 3
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips        : 6000.19
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:
----------------------------------------------------------

Boards are same.

Regards
Anshu

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5411
  • Country: dk
  • Reputation: 89
    • View Profile
    • QuantumATK at Synopsys
Re: Speed issue in Two probe configuration
« Reply #3 on: May 22, 2009, 18:29 »
Do you use the exact same parallelization strategy on the two cluster? I'm particularly referring to the loading of MPI nodes vs. CPUs w.r.t. the number of cores/sockets. You should avoid using more MPI nodes than physical nodes (i.e., if you have for instance 2 dual cores, you may not get very good performance using 4 MPI nodes on this system).

Also, sometimes on Linux the OS will not show the correct number of cores to thread on to the application. You can ensure ATK knows how many threads it can start on each node by setting MKL_NUM_THREADS by hand (see the manual!).

Offline abhola

  • Regular QuantumATK user
  • **
  • Posts: 5
  • Reputation: 0
    • View Profile
Re: Speed issue in Two probe configuration
« Reply #4 on: May 23, 2009, 08:22 »
Yes , i am initiating 8 threads on 8 CPU cluster and 16 threads on 16 CPU cluster, i also tried setting MKL variable but it seems nothing is clicking :(

Regards
Anshu

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5411
  • Country: dk
  • Reputation: 89
    • View Profile
    • QuantumATK at Synopsys
Re: Speed issue in Two probe configuration
« Reply #5 on: May 25, 2009, 15:26 »
It's kind of hard to troubleshoot such things remotely... Is the difference reproducible each time you run? No other jobs running that might load the nodes with other jobs?

You write threads, but I assume you mean MPI nodes, i.e. on the 8-node cluster you run mpiexec with "-n 8", and "-n 16" on the 16-node cluster?

Offline abhola

  • Regular QuantumATK user
  • **
  • Posts: 5
  • Reputation: 0
    • View Profile
Re: Speed issue in Two probe configuration
« Reply #6 on: May 26, 2009, 05:14 »
Yes difference is reproducable in each run. There is nothing running on the machines except atk.
yes i meant MPI nodes.


Regards
Anshu

Offline abhola

  • Regular QuantumATK user
  • **
  • Posts: 5
  • Reputation: 0
    • View Profile
Re: Speed issue in Two probe configuration
« Reply #7 on: May 26, 2009, 06:07 »
One more observation here is that , if i run in serial mode without using MPI , then

16 CPU m/c takes 48 minutes per loop.
8 CPU m/c takes 52 minutes per loop.

even though 8 CPU m/c has faster CPU , more cache ?


Offline Nordland

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 812
  • Reputation: 18
    • View Profile
Re: Speed issue in Two probe configuration
« Reply #8 on: May 26, 2009, 07:40 »
A fun puzzle  ;D

Do you know if it is only ATK that show this problem?

My guess would still be it is something like 99% sure a hardware "issue". And when I say "issue" :) - it is because that it is nice to have super linear speed up.

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5411
  • Country: dk
  • Reputation: 89
    • View Profile
    • QuantumATK at Synopsys
Re: Speed issue in Two probe configuration
« Reply #9 on: May 26, 2009, 09:20 »
I was just about the suggest running without MPI... :)

Cache size can certainly play a role when you are shuffling huge amounts of data around.

Another thing, do you have the same OS on both machines?

When you measure the time per SCF cycle, is that done by taking the total compute time divided by number of steps, or timing of each cycle (using verbosity=20)? If you use the total time, note that the calculation may take different number of steps on the two machines in the MPI case, so take care to normalize by each respective number of steps...

Offline Anders Blom

  • QuantumATK Staff
  • Supreme QuantumATK Wizard
  • *****
  • Posts: 5411
  • Country: dk
  • Reputation: 89
    • View Profile
    • QuantumATK at Synopsys
Re: Speed issue in Two probe configuration
« Reply #10 on: May 26, 2009, 10:43 »
The 8 and 16 CPUs, how are the distributed w.r.t. RAM? That is, for 8 CPUs, for instance, are they located in 8 separate boxes, or in 4 machines with two sockets, or 2 machines with 4 sockets, for instance? Same for the 16.

Competition for RAM, cache and communication bandwidth will depend critically on these factors.