Computer architecture II - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Computer architecture II

Description:

Examine programming model, motivation, intended applications, and ... Example Intel Paragon. Computer Architecture II. 28. SAS & MP Architectural. Convergence ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 40
Provided by: Bog61
Category:

less

Transcript and Presenter's Notes

Title: Computer architecture II


1
Computer architecture II
  • Introduction

2
Recap
  • Importance of parallelism
  • Architecture classification
  • Flynn (SISD, SIMD, MISD, MIMD)
  • Memory access (SM, MP)
  • Clusters
  • Grids
  • Top500

3
Today's plan
  • Parallel Architecture convergence (Cullers
    classification)
  • Shared Memory (Single Address Space)
  • Message Passing
  • Data parallel (SIMD)
  • Data flow
  • Systolic

4
Convergence of Architectural Models
  • Cullers classification of parallel architectures
  • Shared Address Space
  • Message Passing
  • Data Parallel
  • Others
  • Dataflow
  • Systolic Arrays
  • Examine programming model, motivation, intended
    applications, and contributions to convergence

5
Where is Parallel Arch Going?
Old view Divergent architectures, no predictable
pattern of growth.
Application Software
System Software
Systolic Arrays
SIMD
Architecture
Message Passing
Dataflow
Shared Memory
  • Uncertainty of direction paralyzed parallel
    software development!

6
NEW VIEW Convergence of parallel architectures
Systolic Arrays
SIMD
Generic Architecture
Message Passing
Dataflow
Shared Memory
7
Parallel computer
  • Last class definition
  • A parallel computer is a collection of
    processing elements that cooperate to solve
    large problems fast
  • Extend the sequential computer architecture with
    a communication architecture
  • Computer architecture has 2 important aspects
  • Abstractions hardware/software, user/system
  • Implementation of these abstractions
  • Communication architecture as well
  • Abstractions communication and synchronization
    operations
  • Implementations of these abstractions
  • Programming model
  • Abstractions
  • Implementations of these abstractions

8
Modern Layered Framework
Layers of architectural abstraction
9
Programming Model
  • What programmer uses in coding applications
  • Specifies communication and synchronization
  • Examples
  • Multiprogramming no communication or synch. at
    program level
  • Shared address space like bulletin board
  • Message passing like letters or phone calls,
    explicit point to point
  • Data parallel global simultaneous actions on
    data
  • Implemented with shared address space or message
    passing

10
Modern Layered Framework
Layers of architectural abstraction
11
Communication Abstraction
  • Programming model is built on communication
    abstraction
  • Possibilities
  • Supported directly by hardware
  • OS (sockets)
  • user software
  • Combination OS/hardware page fault OS handler
  • Earlier
  • Communication abstraction oriented toward
    programming model
  • Today
  • Compilers and software play important roles as
    bridges today (MPI/OpenMP)

12
Shared Address Space (SAS) Architectures
  • Any processor can directly reference any memory
    location
  • Communication occurs implicitly as result of
    loads and stores
  • Convenient
  • Location transparency
  • Similar programming model to time-sharing on
    uniprocessors
  • Except processes run on different processors
  • Naturally provided on wide range of platforms
  • History dates at least to precursors of
    mainframes in early 60s
  • Wide range of scale few to hundreds of
    processors
  • Popularly known as shared memory machines or
    model
  • memory may be physically distributed among
    processors
  • UMA
  • NUMA

13
SAS-UMA (Uniform Memory Access)
  • Any processor can directly reference any memory
    location
  • Theoretical same access time for all accesses

Mk
M1
M2

Interconnect
P1
P2

Pn
14
SAS-NUMA (Non Uniform Memory Access)
  • Any processor can directly reference any memory
    location (including the memory of remote
    processors)
  • NI integrated into the memory system
  • IMPORTANT DIFFERENCE! For message passing
    machines P1 can not access M2 (NI integrated into
    the I/O system)
  • Different access times for local and remote memory

Pn
P2
P1

Mn
M2
M1
PEn
PE1
PE2
Interconnect
15
SAS Memory Model
  • Process virtual address space plus one or more
    threads of control
  • Portions of address spaces of processes are shared
  • Writes to shared address visible to other threads
    (in other processes too)
  • Natural extension of uniprocessors model
  • Communication R/W the memory
  • Synchronization special atomic operations (we
    come back later)
  • ONE OS uses shared memory to coordinate processes

16
Communication Hardware
  • Natural extension of uniprocessor
  • Already have processor, one or more memory
    modules and I/O controllers connected by hardware
    interconnect of some sort
  • Memory capacity increases by adding modules
  • I/O by adding Controllers
  • Add processors for processing!

17
History
  • Mainframe approach
  • Motivated by multiprogramming
  • Extends crossbar used for more memory and I/O
    bandwidth
  • Bandwidth scales with nr of processors
  • Originally processor cost high
  • later, cost of crossbar use multistage
  • Multistage
  • Reduces the incremental cost
  • Increased latency
  • Minicomputer approach
  • Almost all microprocessor systems have bus
  • Used heavily for parallel computing
  • Called symmetric multiprocessor (SMP)
  • Latency larger than for uniprocessor
  • Bus is bandwidth bottleneck
  • caching is key coherence problem
  • Low incremental cost

M
M
M
M
P
I/O
P
I/O


18
SAS UMA Example Intel Pentium Pro Quad
  • All coherence and multiprocessing in the
    processor module
  • Highly integrated, targeted at high volume
  • Low latency and bandwidth

19
SAS UMA Example SUN UltraSPARC-based Enterprise
  • 16 cards of either type processors memory, or
    I/O
  • All memory accessed over bus, so symmetric
  • Higher bandwidth, higher latency bus

20
NUMA
PE1
PE2
PEn
P1
P2
Pn
C1
C2
Cn
M1
M2
Mn
Interconnect
21
SAS-NUMA Example Cray T3E
  • Scale up to 1024 processors, 480MB/s links
  • Memory controller generates comm. request for
    non-local references (no local caching, SGI
    Origin has)
  • No hardware mechanism for coherence (SGI Origin
    has)

22
Message Passing Architectures
  • High-level block diagram similar to
    distributed-memory SAS
  • NIC integrated into I/O system, neednt be into
    memory system
  • Clusters, but tighter integration
  • Easier to build than scalable SAS

Pn
P2
P1

Mn
M2
M1
PEn
PE1
PE2
Interconnect
23
Message Passing Architectures
  • Communication
  • via explicit I/O operations
  • In SAS case through memory accesses
  • Programming model
  • directly access only private address space (local
    memory)
  • comm. via explicit messages (send/receive)
  • farther from hardware operations
  • Library (MPI)
  • OS intervention ( ex page fault in page DSM)

24
Message-Passing Abstraction
  • Send specifies buffer to be transmitted and
    receiving process
  • Recv specifies sending process and a buffer to
    receive into
  • Optional tag on send and matching rule on receive
  • Many overheads copying, buffer management,
    protection

25
Evolution of Message-Passing Machines
  • Early machines
  • Store and forward
  • FIFO on each link
  • Hardware close to programming model
  • synchronous ops
  • Only neighboring nodes named!
  • Replaced by DMA, enabling non-blocking ops
  • Buffered by system at destination until recv
  • Diminishing role of topology
  • topology less important
  • all nodes named
  • pipelined wormhole routing (asynchronous MP)
  • Cost is in node-network interface
  • Simplifies programming (earlier you had to map
    your program on the topology)

26
Example IBM SP-2
  • Made out of essentially complete RS6000
    workstations
  • Network interface integrated in I/O bus (bw
    limited by I/O bus)
  • 8X8 Crossbar switch

27
Example Intel Paragon
28
SAS MP Architectural Convergence
  • SAS machines
  • SOFTWARE MP Send/recv supported via buffers
  • HARDWARE At lower level, even hardware SAS
    passes hardware messages
  • MP machines
  • SOFTWARE Constructed SAS global address space on
    MP (software DSM)
  • Page-based (or finer-grained) shared virtual
    memory
  • HARDWARE Tighter NI integration even for MP
    (low-latency, high-bandwidth)
  • Due to the mergence of fast system area networks
    (SAN)
  • Traditional NI integrated into the memory system
    for SAS NUMA systems
  • Clusters of SMP workstations

29
Data Parallel Systems (SIMD)
  • Architectural model
  • SIMD Array of many simple, cheap processors with
    little memory each
  • Processors dont sequence through instructions
  • Attached to a control processor that issues
    instructions
  • Specialized and general communication, cheap
    global synchronization
  • Original motivations
  • Matches simple differential equation solvers
  • Well see Ocean Current simulation
  • Centralize high cost of instruction
    fetch/sequencing

30
Data Parallel Systems
  • Programming model
  • Operations performed in parallel on each element
    of data structure
  • Logically single thread of control, performs
    sequential or parallel steps
  • Conceptually, a processor associated with each
    data element
  • After a phase of computation all the processors
    synchronize

31
Application of Data Parallelism
  • Each PE contains an employee record with his/her
    salary
  • Each PE has a condition flag execute or not the
    instruction
  • Ex work in parallel on several employer records
  • If salary gt 100K then
  • salary salary 1.05
  • else
  • salary salary 1.10
  • Logically, the whole operation is a single step
  • Some processors enabled for arithmetic operation,
    others disabled
  • Other examples
  • Differential equations, linear algebra, ...
  • Document searching, graphics, image processing,
    ...
  • Last machines
  • Thinking Machines CM-1, CM-2 (and CM-5)
  • Maspar MP-1 and MP-2

32
Data parallel machines evolution
  • Architecture disappeared today, but programming
    model still popular
  • Popular when cost savings of centralized
    sequencer high
  • 60s when CPU was a cabinet
  • Replaced by vectors in mid-70s
  • More flexible memory layout and easier to manage
  • No need to map the problem on the infrastructure
  • Revived in mid-80s when 32-bit datapath fit on
    chip
  • Modern microprocessors more attractive today
  • Other reasons for demise
  • Simple, regular applications have good locality,
    can do well anyway
  • Loss of applicability due to hardwiring data
    parallelism
  • MIMD machines as effective for data parallelism
    and more general

33
Convergence
  • Programming model
  • Still exists separated from hardware
  • converges to SPMD (single program multiple data)
  • Map local data structure on the HW machine model
  • HPF, OpenMP
  • Needed fast global synchronization
  • Global address space, implemented with either SAS
    or MP

34
Dataflow Architectures
  • Represent computation (program) as a graph of
    essential dependences
  • Logical processor at each node, activated by
    availability of operands
  • Message (tokens) carrying tag of next instruction
    sent to next processor
  • Tag compared with others in matching store match
    fires execution

35
Data-flow architectures
  • Key characteristics
  • Name operations anywhere in the machine
  • Support synchronization for independent ops
  • Dynamic scheduling at machine level
  • The architectures demised
  • Problems
  • Operations have locality across them, useful to
    group together
  • Handling complex data structures like arrays
  • Complexity of matching store and memory units
  • Expose too much parallelism
  • Too fine-grained
  • Hurts locality

36
Data-flow architectures convergence
  • Converged to use conventional processors and
    memory
  • Support for large, dynamic set of threads to map
    to processors
  • Typically shared address space as well
  • Separation of programming model from hardware
    (like data-parallel)
  • Lasting contributions
  • Integration of communication with thread
    (handler) generation
  • Tightly integrated communication and fine-grained
    synchronization
  • Data-flow useful concept for software (compilers
    etc.)

37
Systolic Architectures
  • Replace single processor with array of regular
    processing elements
  • Orchestrate data flow for high throughput with
    less memory access
  • Different from pipelining
  • Nonlinear array structure, multidirection data
    flow, each PE may have (small) local instruction
    and data memory
  • Different from SIMD each PE may do something
    different
  • Initial motivation VLSI enables inexpensive
    special-purpose chips
  • Represent algorithms directly by chips connected
    in regular pattern

38
Systolic Arrays (contd.)
Example Systolic array for 1-D convolution
  • Practical realizations (e.g. iWARP CMU-Intel) use
    quite general processors
  • Enable variety of algorithms on same hardware
  • Dedicated interconnect channels
  • Data transfer directly from register to register
    across channel
  • Specialized, and same problems as SIMD
  • General purpose systems work well for same
    algorithms (locality etc.)

39
Recap Generic Multiprocessor Architecture
Write a Comment
User Comments (0)
About PowerShow.com