Computer Architectures ... High Performance Computing I - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

Computer Architectures ... High Performance Computing I

Description:

Ttheor: theoretical peak performance; obtained by multiplying clock rate with no. ... fine subdivision of an operation into sub-operations leading to shorter cycle ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 63
Provided by: abani5
Category:

less

Transcript and Presenter's Notes

Title: Computer Architectures ... High Performance Computing I


1
Computer Architectures ...High Performance
Computing I
  • Fall 2001
  • MAE609 /Mth667
  • Abani Patra

2
Microprocessor
  • Basic Architecture
  • CISC vs. RISC
  • Superscalar
  • EPIC

3
Performance
  • Measures
  • Floating Point Operations Per Second (FLOPS)
  • 1 MFLOP, workstations
  • 1 GFLOP readily available HPC
  • 1 TFLOP BEST NOW !!
  • 1 PFLOP 2010 ??

4
Performance
  • Ttheor theoretical peak performance obtained by
    multiplying clock rate with no. of CPU and no. of
    FPU/CPU
  • Trealreal performance on some specific operation
    e.g. vector add and multiply
  • Tsustained sustained performance on an
    application e.g. CFD
  • Tsustained ltlt Treal ltlt Ttheor

5
Performance
  • Performance degrades if the CPU has to wait for
    data to operate
  • Fast CPU gt need adequate fast memory
  • Thumb rule --
  • Memory in MB Ttheor in MFLOPS

6
Making a Supercomputer Faster
  • Reduce Cycle time
  • Pipelining
  • Instruction Pipelines
  • Vector Pipelines
  • Internal Parallelism
  • Superscalar
  • EPIC
  • External Parallelism

7
Making a SuperComputer Faster
  • Reduce Cycle time
  • increase clock rate
  • Limited by semiconductor manufacture!
  • Current generation 1-2GHz( Immediate future
    10GHz)
  • Pipelining
  • fine subdivision of an operation into
    sub-operations leading to shorter cycle time but
    larger start-up time

8
Pipelining
  • Instruction Pipelining
  • 4 stage instruction pipeline

stage
1
3
4
2
Fetch Ins Fetch Data Execute
Store
cycle
1
A B C
2
A B C
  • 3 instructions A,B,C

3 4 5 6
A B C
  • 4 cycles needed by each instruction

A B C
  • one result per cycle after pipe is full --
    startup time

9
Pipelining
  • Almost all current computers use some pipelining
    e.g. IBM RS6000
  • Speedup of instruction pipelining cannot always
    be achieved !!
  • Next instruction may not be known till execution
    --e.g. branch
  • Data for execution may not be available

10
Vector Pipelines
  • Effective for operations like
  • do 10 I1,1000
  • 10 c(I)a(I)b(I)
  • same instructions executed 1000 times with
    different data
  • using a vector pipe the whole loop is one
    vector instruction
  • Cray XMP, YMP, T90 ...

11
Vector pipelining
  • For some operations like
  • a(I) b(I) c(I)d(I)
  • the results of the multiply are chained to the
    addition pipeline
  • Disadvantage
  • startup time of vector
  • code has to be vectorized loops have to be
    blocked into vector lengths

12
Internal Parallelism
  • Use multiple Functional Units per processor
  • Cray T90 has 2 track vector unitsNEC SX4,
    Fujitsu VPP300 -- 8 track vector units
  • superscalar e.g. IBM RS6000 Power2 uses 2
    arithmetic units
  • EPIC
  • Need to provide data to multiple functional unit
    gt fast memory access
  • Limiting factors are memory-processor bandwidth

13
External Parallelism
  • Use multiple processors
  • Shared Memory (SMPSymmetric Multi-processors)
  • many processors accessing the same memory
  • limited by memory-processors bandwidth
  • SUN Ultra2, SGI Octane, SGI Onyx, Compaq ...

Memory banks
14
External Parallelism
  • Distributed memory
  • many processors each with local memory and some
    type of high speed interconnect

Local Memories
Interconnection
CPU 1
E.g. IBM SPx, Cray T3E, network of W/S, Beauwolf
Clusters of Pentium PCs
15
External Parallelism
  • SMP Clusters
  • nodes with multiple processors that have shared
    local memory nodes connected by interconnect
  • best of both ?

16
Classification of Computers
  • Hardware
  • SISD (Single Instruction Single Data)
  • SIMD(Single Instruction Multiple Data)
  • MIMD (Multiple Instruction Multiple Data)
  • Programming Model
  • SPMD(Single Program Multiple Data)
  • MPMD(Multiple Program Multiple Data)

17
Hardware Classification
  • SISD (Single Instruction Single Data)
  • classical scalar/vector computer -- one
    instruction one datum
  • superscalar -- instructions may run in parallel
  • SIMD (Single Instruction Multiple Data)
  • vector computers
  • Data Parallel -- Connection Machine etc. extinct
    now

18
Hardware Classification
  • MIMD (Multiple Instruction Multiple Data)
  • usual parallel computer
  • each processor executes its own instructions on
    different data streams
  • need synchronization to get meaningful result

19
Programming Model
  • SPMD(Single Program Multiple Data)
  • single program is run on all processors with
    different data
  • each processor knows its ID -- thus
  • if(proc ID .eq. N) then
  • .
  • Else
  • .
  • Constructs can be used for program control

20
Programming Model
  • MPMD(Multiple Program Multiple Data)
  • Different programs run on different processors
  • usually a master-slave model is used

21
Topologies/Interconnects
  • Hypercube
  • Torus

22
  • Prototype Supercomputers and Bottlenecks

23
Types of Processors/Computers used in HPC
  • Prototype processors
  • Vector Processors
  • Superscalar Processors
  • Prototype Parallel Computers
  • Shared Memory
  • Without Cache
  • With Cache SMP
  • Distributed Memory

24
Vector Processors
25
Vector Processors
  • Components
  • Vector registers
  • ADD/Logic pipeline and MULTIPLY Pipelines
  • Load/Store pipelines
  • Scalar registers pipelines

26
Vector Registers
  • Finite length of vector registers 32/64/128 etc.
  • Strip mining to operate on longer vectors
  • Codes often manually restructured to
    vector-length loops
  • Sawtooth performance curve -- maximum at
    multiples of vector length

27
Vector Processors
  • Memory-processor bandwidth
  • performance depends completely on keeping the
    vector registers supplied with operands from
    memory
  • Size of main memory and extended memory
  • bandwidth of main memory is much higher but main
    memory is more expensive
  • size determines -- size of problem that can be
    run
  • scalar registers/scalar processors for scalar
    instructions
  • I/O through special processor - -
  • T90 can produce data at 14400 MB/sec -- Disk
    20MB/s. Thus single word can take 720 cycles on
    Cray T90 !!

28
Superscalar Processor
  • Workstations and nodes of parallel
    supercomputers

29
Superscalar Processor
  • main components are
  • Multiple ALU and FPU
  • data and instruction caches
  • superscalar since the ALU and FPUs can operate
    in parallel producing more than one result per
    cycle
  • e.g. IBM POWER2 -- 2 FPU/ALUs each can operate
    in parallel producing up to 4 results per cycle
    if operands are in registers

30
Superscalar Processor
  • RISC architecture operating at very high clock
    speeds (gt1GHz now -- more in a year)
  • Processor works only on data in registers which
    come only from and go only to data cache. If data
    is not in cache -- cache miss -- processor is
    idle while another cache line (4 -16 words) are
    fetched from memory !!

31
Superscalar Processor
  • Large off chip Level 2 caches to help in data
    availability. L1 cache data is accessed in 1/2
    cycles while L2 cache is 3/4 cycles and memory
    can be 8 times that!
  • Efficiency directly related to reuse of data in
    cache
  • Remedies
  • Blocked algorithms,
  • contiguous storage,
  • avoid strides and random/non-deterministic
    access

32
Superscalar Processor
  • Remedies
  • Blocked algorithms,
  • do I1,1000 do j1,20
  • a(I). do i(j-1)50,j50

  • a(i)....
  • contiguous storage,
  • avoid strides and random/non-deterministic
    access
  • a(ix(i)) ...

33
Superscalar Processors
  • Memory bandwidth critical to performance
  • Many engineering applications are difficult to
    optimize for cache efficiency
  • Application efficiency gt memory bandwidth
  • Size of memory determines size of problem that
    can be solved
  • DMA (direct memory access) channels take memory
    access duties for external application (I/O)
    remote processor request away from CPU

34
Shared Memory Parallel Computer
  • Shared Memory Computer
  • Memory in banks is accessed equally through a
    switch (crossbar) by the processors (usually
    vector)
  • Processors run p independent tasks with
    possibly shared data
  • Usually some compilers and preprocessors can
    extract the fine-grained parallelism available

Cray T90
35
Shared Memory Paralllel ...
  • Memory contention and bandwidth limits the number
    of processors that may be connected
  • Memory contention can be reduced by increasing
    banks and reducing the bank busy time(bbt)
  • This type of parallel computer is closest in
    programming model to the general purpose single
    processor computer

36
Symmetric Multiprocessors (SMP)
  • Processors are usually superscalar -- SUN Ultra,
    MIPS R10000 with large cache
  • Bus/crossbar used to connect to memory modules
  • For bus -- 1 processor can access memory at a time
  • SMP Computer

M1
M2
M3
Sun Ultraenterprise 10000, SGI Powerchallenge
37
Symmetric Multi-processors
  • If interconnect -- then there will be memory
    contention
  • Data flows from memory to cache to processors
  • Cache coherence
  • If a piece of data is changed in one cache then
    all other caches that contain that data must
    update the value. Hardware and software must take
    care of this.

38
Symmetric Multi-Processors
  • Performance depends dramatically on the reuse of
    data in cache
  • Fetching data from larger memory with potential
    memory contention can be expensive!
  • Caches and cache lines also are bigger
  • Large L2 cache really plays the role of local
    fast memory with memory banks are more like
    extended memory accessed in blocks

39
Distributed Memory Parallel Computer
  • Prototype DMP
  • Processors are superscalar RISC with only LOCAL
    memory
  • Each processor can only work on data in local
    memory
  • Communication required for access to remote
    memory

IBM SP, Intel Paragon, SGI Origin2000
40
Distributed Memory Parallel Computer
  • Problems need to be broken up into independent
    tasks with independent memory -- naturally
    matches a data based decomposition of problem
    using a owner computes rule
  • Parallelization mostly at high granularity level
    controlled by user -- difficult for compilers/
    automatic parallelization tools
  • Computers are scalable to very large numbers of
    processors

41
Distributed Memory Parallel Computer
  • NUMA non uniform memory access based
    classification
  • Intel Paragon (1st teraflop machine had 4
    Pentiums per node with a bus)
  • HP exemplar has bus at node
  • Hybrid Parallel Computer

42
Distributed Memory Parallel Computer
  • Semi-autonomous memory
  • Semi-automomous memory Processor can access
    remote memory using memory control units (MCU)
  • CRAY T3E and SGI Origin 2000

Comm. network
.
43
Distributed Memory Parallel Computer
  • Fully autonomous memory
  • Memory and procesors are equally distributed over
    the network
  • Tera MTA is only example
  • Latency and data transfer from memory is at the
    speed of network!

Comm. network
M
P
P
M
44
Accessing Distributed Memory
  • Message Passing
  • User transfers all data using explicit
    send/receive instructions
  • synchronous message passing can be slow
  • Programming with NEW programming model !
  • User must optimize communication
  • asynchronous/one-sided get and put are faster but
    need more care in programming
  • Codes used to be machine specific -- Intel NEXUS
    etc. until standardized to PVM (parallel virtual
    machine) and subsequently MPI (message passing
    interface)

45
Accessing Distributed Memory
  • Global distributed memory
  • Physically distributed and globally addressable
    -- Cray T3E/ SGI Origin 2000
  • User formally accesses remote memory as if it
    were local -- operating system/compilers will
    translate such accesses to fetches/stores over
    the communication network
  • High Performance FORTRAN (HPF) -- software
    realization of distributed memory -- arrays etc.
    when declared can be distributed using compiler
    directives. Compiler translates remote memory
    access to appropriate calls (message passing/ OS
    calls as supported by the hardware)

46
Processor interconnects/topologies
  • Buses
  • Lower cost -- but only one pair of devices
    (processors/memories etc. can communicate at a
    time) e.g. ethernet used to link workstation
    networks
  • Switches
  • Like the telephone network -- can sustain
    many-many communications higher cost!
  • Critical measure is bisection bandwidth -- how
    much data can be passed between units

47
Processor interconnects/topologies
  • .

48
Processor interconnects/topologies
  • .

49
Processor interconnects/topologies
  • Workstation network on ethernet
  • Very high latency -- processors must participate
    in communication

50
Processor interconnects/topologies
  • 1D and 2D Meshes and rings/toruses

51
Processor interconnects/topologies
  • 3DMeshes and rings/toruses

52
Processor interconnects/topologies
  • D- dimensional hypercubes

53
Processor Scheduling
  • Space Sharing
  • Processor banks of 4/8/16 etc. assigned to users
    for specific times
  • Time sharing on processor partitions
  • Livermore Gang Scheduling

54
IBM RS/6000 SP
  • Distributed Memory Parallel Computer
  • Assembly of workstations using a HPS (a crossbar
    type switch)
  • Comes with a choice of processors -- POWER2
    (variants), POWER3 and clusters of PowerPC (also
    used by Apple G3 G4 etc.)

55
POWER 2 Processor
  • Different versions -- with different frequency,
    cache size and bandwidth

56
POWER 2 ARCHITECTURE
57
POWER2
  • Double fixed point/floating point units --
    multiply/add in each
  • Max. 4 Floating Point results/cycle
  • ICU (with 32 KB instruction cache) can execute a
    branch and a condition/cycle
  • Per cycle 8 instructions may be issued and
    executed -- truly SUPERSCALAR!

58
Wide 77 Node Performance
  • Theoretical peak performance
  • 277 154 MFLOP for dyad
  • 477 308 MFLOP for triad
  • Cache Effects dominate performance
  • 256 KB Cache and 256 bit path to cache and from
    cache to memory -- 2 words (8 bytes each) may be
    fetched and 2 words stored per cycle

59
Expected Performance
  • Expected Performance
  • For Dyad ai bici or aibici -- needs 2 load
    and 1 store i.e. 6 memory references to feed 2
    FPUs -- only 4 are available
  • (277)(4/6) 102.7 MFLOP
  • For linked triad
  • ai bi sci (2 load 1 store)
  • (477)(4/6) 205.3 MFLOP
  • For vector triad
  • ai bi ci di (3 load 1 store)
  • (477)(4/8)154 MFLOPS

60
Cache Hit/Miss
  • The Performance numbers assumed that data was
    available in cache
  • If data is not in cache it must be fetched in
    cache lines of 256 bytes each from memory at a
    much slower pace

61
(No Transcript)
62
TERM PAPER
  • Based on the analysis of the Power 2 processor
    and IBM SP presented here prepare a similar
    analysis (including estimates of performance) for
    the new Power4 chip in the IBM SP or a cluster of
    Pentium4s.
Write a Comment
User Comments (0)
About PowerShow.com