Overview%20of%20Parallel%20Architecture%20and%20Programming%20Models - PowerPoint PPT Presentation

About This Presentation
Title:

Overview%20of%20Parallel%20Architecture%20and%20Programming%20Models

Description:

A collection of processing elements that cooperate to solve large problems fast ... TPC benchmarks (TPC-C order entry, TPC-D decision support) ... – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0
Slides: 107
Provided by: jaswinder1
Category:

less

Transcript and Presenter's Notes

Title: Overview%20of%20Parallel%20Architecture%20and%20Programming%20Models


1
Overview of Parallel Architecture and
Programming Models
2
What is a Parallel Computer?
  • A collection of processing elements that
    cooperate to solve large problems fast
  • Some broad issues that distinguish parallel
    computers
  • Resource Allocation
  • how large a collection?
  • how powerful are the elements?
  • how much memory?
  • Data access, Communication and Synchronization
  • how do the elements cooperate and communicate?
  • how are data transmitted between processors?
  • what are the abstractions and primitives for
    cooperation?
  • Performance and Scalability
  • how does it all translate into performance?
  • how does it scale?

3
Why Parallelism?
  • Provides alternative to faster clock for
    performance
  • Assuming a doubling of effective per-node
    performance every 2 years, 1024-CPU system can
    get you the performance that it would take 20
    years for a single-CPU system to deliver
  • Applies at all levels of system design
  • Is increasingly central in information processing
  • Scientific computing simulation, data analysis,
    data storage and management, etc.
  • Commercial computing Transaction processing,
    databases
  • Internet applications Search Google operates
    at least 50,000 CPUs, many as part of large
    parallel systems

4
How to Study Parallel Systems
  • History diverse and innovative organizational
    structures, often tied to novel programming
    models
  • Rapidly matured under strong technological
    constraints
  • The microprocessor is ubiquitous
  • Laptops and supercomputers are fundamentally
    similar!
  • Technological trends cause diverse approaches to
    converge
  • Technological trends make parallel computing
    inevitable
  • In the mainstream
  • Need to understand fundamental principles and
    design tradeoffs, not just taxonomies
  • Naming, Ordering, Replication, Communication
    performance

5
Outline
  • Drivers of Parallel Computing
  • Trends in Supercomputers for Scientific
    Computing
  • Evolution and Convergence of Parallel
    Architectures
  • Fundamental Issues in Programming Models and
    Architecture

6
Drivers of Parallel Computing
  • Application Needs Our insatiable need for
    computing cycles
  • Scientific computing CFD, Biology, Chemistry,
    Physics, ...
  • General-purpose computing Video, Graphics, CAD,
    Databases, TP...
  • Internet applications Search, e-Commerce,
    Clustering ...
  • Technology Trends
  • Architecture Trends
  • Economics
  • Current trends
  • All microprocessors have support for external
    multiprocessing
  • Servers and workstations are MP Sun, SGI, Dell,
    COMPAQ...
  • Microprocessors are multiprocessors. Multicore
    SMP on a chip

7
Application Trends
  • Demand for cycles fuels advances in hardware, and
    vice-versa
  • Cycle drives exponential increase in
    microprocessor performance
  • Drives parallel architecture harder most
    demanding applications
  • Range of performance demands
  • Need range of system performance with
    progressively increasing cost
  • Platform pyramid
  • Goal of applications in using parallel machines
    Speedup
  • Speedup (p processors)
  • For a fixed problem size (input data set),
    performance 1/time
  • Speedup fixed problem (p processors)

Time (1 processor)
Time (p processors)
8
Scientific Computing Demand
  • Ever increasing demand due to need for more
    accuracy, higher-level modeling and knowledge,
    and analysis of exploding amounts of data
  • Example area 1 Climate and Ecological Modeling
    goals
  • By 2010 or so
  • Simply resolution, simulated time, and improved
    physics leads to increased requirement by factors
    of 104 to 107. Then
  • Reliable global warming, natural disaster and
    weather prediction
  • By 2015 or so
  • Predictive models of rainforest destruction,
    forest sustainability, effects of climate change
    on ecoystems and on foodwebs, global health
    trends
  • By 2020 or so
  • Verifiable global ecosystem and epidemic models
  • Integration of macro-effects with localized and
    then micro-effects
  • Predictive effects of human activities on earths
    life support systems
  • Understanding earths life support systems

9
Scientific Computing Demand
  • Example area 2 Biology goals
  • By 2010 or so
  • Ex vivo and then in vivo molecular-computer
    diagnosis
  • By 2015 or so
  • Modeling based vaccines
  • Individualized medicine
  • Comprehensive biological data integration (most
    data co-analyzable)
  • Full model of a single cell
  • By 2020 or so
  • Full model of a multi-cellular tissue/organism
  • Purely in-silico developed drugs personalized
    smart drugs
  • Understanding complex biological systems cells
    and organisms to ecosystems
  • Verifiable predictive models of biological
    systems

10
Engineering Computing Demand
  • Large parallel machines a mainstay in many
    industries
  • Petroleum (reservoir analysis)
  • Automotive (crash simulation, drag analysis,
    combustion efficiency),
  • Aeronautics (airflow analysis, engine efficiency,
    structural mechanics, electromagnetism),
  • Computer-aided design
  • Pharmaceuticals (molecular modeling)
  • Visualization
  • in all of the above
  • entertainment (movies), architecture
    (walk-throughs, rendering)
  • Financial modeling (yield and derivative
    analysis)
  • etc.

11
Learning Curve for Parallel Applications
  • AMBER molecular dynamics simulation program
  • Starting point was vector code for Cray-1
  • 145 MFLOP on Cray90, 406 for final version on
    128-processor Paragon, 891 on 128-processor Cray
    T3D

12
Commercial Computing
  • Also relies on parallelism for high end
  • Scale not so large, but use much more wide-spread
  • Computational power determines scale of business
    that can be handled
  • Databases, online-transaction processing,
    decision support, data mining, data warehousing
    ...
  • TPC benchmarks (TPC-C order entry, TPC-D decision
    support)
  • Explicit scaling criteria provided size of
    enterprise scales with system
  • Problem size no longer fixed as p increases, so
    throughput is used as a performance measure
    (transactions per minute or tpm)
  • E-commerce, search and other scalable internet
    services
  • Parallel applications running on clusters
  • Developing new parallel software models and
    primitives
  • Insight from automated analysis of large
    disparate data

13
TPC-C Results for Wintel Systems
6-way Unisys AQ HS6 Pentium Pro 200 MHz 12,026
tpmC 39.38/tpmC Avail 11-30-97 TPC-C
v3.3 (withdrawn)
4-way Cpq PL 5000 Pentium Pro 200 MHz 6,751
tpmC 89.62/tpmC Avail 12-1-96 TPC-C
v3.2 (withdrawn)
4-way IBM NF 7000 PII Xeon 400 MHz 18,893
tpmC 29.09/tpmC Avail 12-29-98 TPC-C
v3.3 (withdrawn)
8-way Cpq PL 8500 PIII Xeon 550 MHz 40,369
tpmC 18.46/tpmC Avail 12-31-99 TPC-C
v3.5 (withdrawn)
8-way Dell PE 8450 PIII Xeon 700 MHz 57,015
tpmC 14.99/tpmC Avail 1-15-01 TPC-C
v3.5 (withdrawn)
32-way Unisys ES7000 PIII Xeon 900 MHz 165,218
tpmC 21.33/tpmC Avail 3-10-02 TPC-C v5.0
32-way NEC Express5800 Itanium2 1GHz 342,746
tpmC 12.86/tpmC Avail 3-31-03 TPC-C v5.0
32-way Unisys ES7000 Xeon MP 2 GHz 234,325
tpmC 11.59/tpmC Avail 3-31-03 TPC-C v5.0
  • Parallelism is pervasive
  • Small to moderate scale parallelism very
    important
  • Difficult to obtain snapshot to compare across
    vendor platforms

14
Summary of Application Trends
  • Transition to parallel computing has occurred for
    scientific and engineering computing
  • Also occurred commercial computing
  • Database and transactions as well as financial
  • Scalable internet services (at least
    coarse-grained parallelism)
  • Desktop also uses multithreaded programs, which
    are a lot like parallel programs
  • Demand for improving throughput on sequential
    workloads
  • Greatest use of small-scale multiprocessors
  • Solid application demand, keeps increasing with
    time
  • Key challenge throughout is making parallel
    programming easier
  • Taking advantage of pervasive parallelism with
    multi-core systems

15
Drivers of Parallel Computing
  • Application Needs
  • Technology Trends
  • Architecture Trends
  • Economics

16
Technology Trends Rise of the Micro
The natural building block for multiprocessors is
now also about the fastest!
17
General Technology Trends
  • Microprocessor performance increases 50 - 100
    per year
  • Clock frequency doubles every 3 years
  • Transistor count quadruples every 3 years
  • Moores law xtors per chip 1.59year-1959
    (originally 2year-1959)
  • Huge investment per generation is carried by huge
    commodity market
  • With every feature size scaling of n
  • we get O(n2) transistors
  • we get O(n) increase in possible clock frequency
  • We should get O(n3) increase in processor
    performance.
  • Do we?
  • See architecture trends

18
Die and Feature Size Scaling
  • Die Size growing at 7 per year feature size
    shrinking 25-30

19
Clock Frequency Growth Rate (Intel family)
  • 30 per year

20
Transistor Count Growth Rate (Intel family)
  • Transistor count grows much faster than clock
    rate
  • - 40 per year, order of magnitude more
    contribution in 2 decades
  • Width/space has greater potential than per-unit
    speed

21
How to Use More Transistors
  • Improve single threaded performance via
    architecture
  • Not keeping up with potential given by technology
    (next)
  • Use transistors for memory structures to improve
    data locality
  • Doesnt give as high returns (2x for 4x cache
    size, to a point)
  • Use parallelism
  • Instruction-level
  • Thread level
  • Bottom line Not that single-threaded performance
    has plateaued, but that parallelism is natural
    way to stay on a better curve

22
Microprocessor Performance
23
Similar Story for Storage (Transistor Count)
24
Similar Story for Storage (DRAM Capacity)
25
Similar Story for Storage
  • Divergence between memory capacity and speed more
    pronounced
  • Capacity increased by 1000x from 1980-95, and
    increases 50 per yr
  • Latency reduces only 3 per year (only 2x from
    1980-95)
  • Bandwidth per memory chip increases 2x as fast as
    latency reduces
  • Larger memories are slower, while processors get
    faster
  • Need to transfer more data in parallel
  • Need deeper cache hierarchies
  • How to organize caches?

26
Similar Story for Storage
  • Parallelism increases effective size of each
    level of hierarchy, without increasing access
    time
  • Parallelism and locality within memory systems
    too
  • New designs fetch many bits within memory chip
    follow with fast pipelined transfer across
    narrower interface
  • Buffer caches most recently accessed data
  • Disks too Parallel disks plus caching
  • Overall, dramatic growth of processor speed,
    storage capacity and bandwidths relative to
    latency (especially) and clock speed point toward
    parallelism as the desirable architectural
    direction

27
Drivers of Parallel Computing
  • Application Needs
  • Technology Trends
  • Architecture Trends
  • Economics

28
Architectural Trends
  • Architecture translates technologys gifts to
    performance and capability
  • Resolves the tradeoff between parallelism and
    locality
  • Recent microprocessors 1/3 compute, 1/3 cache,
    1/3 off-chip connect
  • Tradeoffs may change with scale and technology
    advances
  • Four generations of architectural history tube,
    transistor, IC, VLSI
  • Here focus only on VLSI generation
  • Greatest delineation in VLSI has been in scale
    and type of parallelism exploited

29
Architectural Trends in Parallelism
  • Up to 1985 bit level parallelism 4-bit -gt 8 bit
    -gt 16-bit
  • slows after 32 bit
  • adoption of 64-bit well under way, 128-bit is far
    (not performance issue)
  • great inflection point when 32-bit micro and
    cache fit on a chip
  • Basic pipelining, hardware support for complex
    operations like FP multiply etc. led to O(N3)
    growth in performance.
  • Intel 4004 to 386

30
Architectural Trends in Parallelism
  • Mid 80s to mid 90s instruction level parallelism
  • Pipelining and simple instruction sets,
    compiler advances (RISC)
  • Larger on-chip caches
  • But only halve miss rate on quadrupling cache
    size
  • More functional units gt superscalar execution
  • But limited performance scaling
  • N2 growth in performance
  • Intel 486 to Pentium III/IV

31
Architectural Trends in Parallelism
  • After mid-90s
  • Greater sophistication out of order execution,
    speculation, prediction
  • to deal with control transfer and latency
    problems
  • Very wide issue processors
  • Dont help many applications very much
  • Need multiple threads (SMT) to exploit
  • Increased complexity and size leads to slowdown
  • Long global wires
  • Increased access times to data
  • Time to market
  • Next step thread level parallelism

32
Can Instruction-Level get us there?
  • Reported speedups for superscalar processors
  • Horst, Harris, and Jardine 1990
    ...................... 1.37
  • Wang and Wu 1988 .............................
    ............. 1.70
  • Smith, Johnson, and Horowitz 1989
    .............. 2.30
  • Murakami et al. 1989 .........................
    ............... 2.55
  • Chang et al. 1991 ............................
    ................. 2.90
  • Jouppi and Wall 1989 .........................
    ............. 3.20
  • Lee, Kwok, and Briggs 1991 ...................
    ........ 3.50
  • Wall 1991 ....................................
    ...................... 5
  • Melvin and Patt 1991 .........................
    .............. 8
  • Butler et al. 1991 ...........................
    .................. 17
  • Large variance due to difference in
  • application domain investigated (numerical versus
    non-numerical)
  • capabilities of processor modeled

33
ILP Ideal Potential
  • Infinite resources and fetch bandwidth, perfect
    branch prediction and renaming
  • real caches and non-zero miss latencies

34
Results of ILP Studies
  • Concentrate on parallelism for 4-issue machines
  • Realistic studies show only 2-fold speedup
  • More recent work examines ILP that looks across
    threads for parallelism

35
Architectural Trends Bus-based MPs
  • Micro on a chip makes it natural to connect many
    to shared memory
  • dominates server and enterprise market, moving
    down to desktop
  • Faster processors began to saturate bus, then
    bus technology advanced
  • today, range of sizes for bus-based systems,
    desktop to large servers

No. of processors in fully configured commercial
shared-memory systems
36
Bus Bandwidth
37
Bus Bandwith Intel Systems
38
Do Buses Scale?
  • Buses are a convenient way to extend architecture
    to parallelism, but they do not scale
  • bandwidth doesnt grow as CPUs are added
  • Scalable systems use physically distributed memory

39
Drivers of Parallel Computing
  • Application Needs
  • Technology Trends
  • Architecture Trends
  • Economics

40
Finally, Economics
  • Fabrication cost roughly O(1/feature-size)
  • 90nm fabs cost about 1-2 billion dollars
  • So fabrication of processors is expensive
  • Number of designers also O(1/feature-size)
  • 10 micron 4004 processor had 3 designers
  • Recent 90 nm processors had 300
  • New designs very expensive
  • Push toward consolidation of processor types
  • Processor complexity increasingly expensive
  • Cores reused, but tweaks expensive too

41
Design Complexity and Productivity
  • Design complexity outstrips human productivity

42
Economics
  • Commodity microprocessors not only fast but CHEAP
  • Development cost is tens of millions of dollars
  • BUT, many more are sold compared to
    supercomputers
  • Crucial to take advantage of the investment, and
    use the commodity building block
  • Exotic parallel architectures no more than
    special-purpose
  • Multiprocessors being pushed by software vendors
    (e.g. database) as well as hardware vendors
  • Standardization by Intel makes small, bus-based
    SMPs commodity
  • What about on-chip processor design?

43
Whats on a processing chip?
  • Recap
  • Number of transistors growing fast
  • Methods to use for single-thread performance
    running out of steam
  • Memory issues argue for parallelism too
  • Instruction-level parallelism limited, need
    thread-level
  • Consolidation is a powerful force
  • All seems to point to many simpler cores rather
    than single bigger complex core
  • Additional key arguments wires, power, cost

44
Wire Delay
  • Gate delay shrinks, global interconnect delay
    grows short local wires

45
Power
  • Power dissipation in Intel processors over time

46
Power and Performance
47
Power
  • Power grows with number of transistors and clock
    frequency
  • Power grows with voltage P CV2f
  • Going from 12V to 1.1V reduced power consumption
    by 120x in 20 yr
  • Voltage projected to go down to 0.7V in 2018, so
    only another 2.5x
  • Power per chip peaking in designs
  • - Itanium II was 130W, Montecito 100W
  • - Power is first-class design constraint
  • Circuit-level power techniques quite far along
  • - clock gating, multiple thresholds, sleeper
    transistors

48
Power versus Clock Frequency
  • Two processor generations two feature sizes

49
Architectural Implication of Power
  • Fewer transistors per core a lot more power
    efficient
  • Narrower issue, shorter pipelines, smaller OOO
    window
  • Get per-processor performance back on O(n3) curve
  • But lower single thread performance.
  • What complexity to eliminate?
  • Speculation, multithreading, ?
  • All good for some things, but need to be careful
    about power/benefit

50
ITRS Projections
51
ITRS Projections (contd.)
  • Procs on chip will outstrip individual processor
    performance

52
Cost of Chip Development
  • Non-recurring engineering costs increasing
    greatly as complexity outstrips productivity

53
Recurring Costs Per Die (1994)
54
Summary Whats on a Chip
  • Beyond arguments for parallelism based on
    commodity processors in general
  • Wire delay, power and economics all argue for
    multiple simpler cores on a chip rather than
    increasingly complex single cores
  • Challenge SOFTWARE. How to program parallel
    machines

55
Summary Why Parallel Architecture?
  • Increasingly attractive
  • Economics, technology, architecture, application
    demand
  • Increasingly central and mainstream
  • Parallelism exploited at many levels
  • Instruction-level parallelism
  • Thread level parallelism and On-chip
    multiprocessing
  • Multiprocessor servers
  • Large-scale multiprocessors (MPPs)
  • Focus of this class multiprocessor level of
    parallelism
  • Same story from memory (and storage) system
    perspective
  • Increase bandwidth, reduce average latency with
    many local memories
  • Wide range of parallel architectures make sense
  • Different cost, performance and scalability

56
Outline
  • Drivers of Parallel Computing
  • Trends in Supercomputers for Scientific
    Computing
  • Evolution and Convergence of Parallel
    Architectures
  • Fundamental Issues in Programming Models and
    Architecture

57
Scientific Supercomputing
  • Proving ground and driver for innovative
    architecture and techniques
  • Market smaller relative to commercial as MPs
    become mainstream
  • Dominated by vector machines starting in 70s
  • Microprocessors have made huge gains in
    floating-point performance
  • high clock rates
  • pipelined floating point units (e.g. mult-add)
  • instruction-level parallelism
  • effective use of caches
  • Plus economics
  • Large-scale multiprocessors replace vector
    supercomputers

58
Raw Uniprocessor Performance LINPACK
59
Raw Parallel Performance LINPACK
  • Even vector Crays became parallel X-MP (2-4)
    Y-MP (8), C-90 (16), T94 (32)
  • Since 1993, Cray produces MPPs too (T3D, T3E)

60
Another View
61
Top 10 Fastest Computers (Linpack)
  • Rank Site Computer Processors Year Rmax
  • DOE/NNSA/LLNL USA IBM BlueGene 131072 2005 28060
    0
  • NNSA/Sandia Labs, USA Cray Red Storm, Opteron
    26544 2006 101400
  • IBM Research, USA, IBM Blue Gene Solution
    40960 2005 91290
  • DOE/NNSA/LLNL, USA ASCI Purple - IBM eServer p5
    12208 2006 75760
  • Barcelona Center, Spain IBM JS21 Cluster, PPC
    970 10240 2006 62630
  • NNSA/Sandia Labs, USA Dell Thunderbird Cluster
    9024 2006 53000
  • CEA, France Bull Tera-10 Itanium2 Cluster
    9968 2006 52840
  • NASA/Ames, USA SGI Altix 1.5 GHz, Infiniband
    10160 2004 51870
  • GSIC Center, Japan NEC/Sun Grid Cluster
    (Opteron) 11088 2006 47380
  • NEC Earth Simulator (top for 5 lists) moves
    down to 14
  • 10 system has doubled in performance since
    last year

62
Top 500 Architectural Styles
63
Top 500 Processor Type
64
Top 500 Installation Type
65
Top 500 as of Nov 2006 Highlights
  • NEC Earth Simulator (top for 5 lists) moves down
    to 14
  • 10 system has doubled in performance since last
    year
  • 359 six months ago was 500 in this list
  • Total performance of top 500 up from 2.3 Pflops a
    year ago to 3.5 Pflops
  • Clusters are dominant at this scale 359 of top
    500 labeled as clusters
  • Dual core processors growing in popularity 75
    use Opteron dual core, and 31 Intel Woodcrest
  • IBM is top vendor with almost 50 of systems, HP
    is second
  • IBM and HP have 237 out of the 244 commercial and
    industrial installations
  • US has 360 of the top 500 installations, UK 32,
    Japan 30,Germany 19, China 18

66
Top 500 Linpack Performance over Time
67
Another View of Performance Growth
68
Another View of Performance Growth
69
Another View of Performance Growth
70
Another View of Performance Growth
71
Processor Types in Top 500 (2002)
72
Parallel and Distributed Systems
73
Outline
  • Drivers of Parallel Computing
  • Trends in Supercomputers for Scientific
    Computing
  • Evolution and Convergence of Parallel
    Architectures
  • Fundamental Issues in Programming Models and
    Architecture

74
History
  • Historically, parallel architectures tied to
    programming models
  • Divergent architectures, with no predictable
    pattern of growth.

Application Software
System Software
Systolic Arrays
SIMD
Architecture
Message Passing
Dataflow
Shared Memory
  • Uncertainty of direction paralyzed parallel
    software development!

75
Today
  • Extension of computer architecture to support
    communication and cooperation
  • OLD Instruction Set Architecture
  • NEW Communication Architecture
  • Defines
  • Critical abstractions, boundaries, and primitives
    (interfaces)
  • Organizational structures that implement
    interfaces (hw or sw)
  • Compilers, libraries and OS are important bridges
    between application and architecture today

76
Modern Layered Framework
77
Parallel Programming Model
  • What the programmer uses in writing applications
  • Specifies communication and synchronization
  • Examples
  • Multiprogramming no communication or synch. at
    program level
  • Shared address space like bulletin board
  • Message passing like letters or phone calls,
    explicit point to point
  • Data parallel more regimented, global actions on
    data
  • Implemented with shared address space or message
    passing

78
Communication Abstraction
  • User level communication primitives provided by
    system
  • Realizes the programming model
  • Mapping exists between language primitives of
    programming model and these primitives
  • Supported directly by hw, or via OS, or via user
    sw
  • Lot of debate about what to support in sw and gap
    between layers
  • Today
  • Hw/sw interface tends to be flat, i.e. complexity
    roughly uniform
  • Compilers and software play important roles as
    bridges today
  • Technology trends exert strong influence
  • Result is convergence in organizational structure
  • Relatively simple, general purpose communication
    primitives

79
Communication Architecture
  • User/System Interface Implementation
  • User/System Interface
  • Comm. primitives exposed to user-level by hw and
    system-level sw
  • (May be additional user-level software between
    this and prog model)
  • Implementation
  • Organizational structures that implement the
    primitives hw or OS
  • How optimized are they? How integrated into
    processing node?
  • Structure of network
  • Goals
  • Performance
  • Broad applicability
  • Programmability
  • Scalability
  • Low Cost

80
Evolution of Architectural Models
  • Historically, machines were tailored to
    programming models
  • Programming model, communication abstraction, and
    machine organization lumped together as the
    architecture
  • Understanding their evolution helps understand
    convergence
  • Identify core concepts
  • Evolution of Architectural Models
  • Shared Address Space (SAS)
  • Message Passing
  • Data Parallel
  • Others (wont discuss) Dataflow, Systolic Arrays
  • Examine programming model, motivation, and
    convergence

81
Shared Address Space Architectures
  • Any processor can directly reference any memory
    location
  • Communication occurs implicitly as result of
    loads and stores
  • Convenient
  • Location transparency
  • Similar programming model to time-sharing on
    uniprocessors
  • Except processes run on different processors
  • Good throughput on multiprogrammed workloads
  • Naturally provided on wide range of platforms
  • History dates at least to precursors of
    mainframes in early 60s
  • Wide range of scale few to hundreds of
    processors
  • Popularly known as shared memory machines or
    model
  • Ambiguous memory may be physically distributed
    among processors

82
Shared Address Space Model
  • Process virtual address space plus one or more
    threads of control
  • Portions of address spaces of processes are shared
  • Writes to shared address visible to other
    threads (in other processes too)
  • Natural extension of uniprocessor model
    conventional memory operations for comm. special
    atomic operations for synchronization
  • OS uses shared memory to coordinate processes

83
Communication Hardware for SAS
  • Also natural extension of uniprocessor
  • Already have processor, one or more memory
    modules and I/O controllers connected by hardware
    interconnect of some sort
  • Memory capacity increased by adding modules, I/O
    by controllers

Add processors for processing!
84
History of SAS Architecture
  • Mainframe approach
  • Motivated by multiprogramming
  • Extends crossbar used for mem bw and I/O
  • Originally processor cost limited to small
  • later, cost of crossbar
  • Bandwidth scales with p
  • High incremental cost use multistage instead
  • Minicomputer approach
  • Almost all microprocessor systems have bus
  • Motivated by multiprogramming, TP
  • Used heavily for parallel computing
  • Called symmetric multiprocessor (SMP)
  • Latency larger than for uniprocessor
  • Bus is bandwidth bottleneck
  • caching is key coherence problem
  • Low incremental cost

85
Example Intel Pentium Pro Quad
  • All coherence and multiprocessing glue integrated
    in processor module
  • Highly integrated, targeted at high volume
  • Low latency and bandwidth

86
Example SUN Enterprise
  • Memory on processor cards themselves
  • 16 cards of either type processors memory, or
    I/O
  • But all memory accessed over bus, so symmetric
  • Higher bandwidth, higher latency bus

87
Scaling Up
  • Problem is interconnect cost (crossbar) or
    bandwidth (bus)
  • Dance-hall bandwidth still scalable, but lower
    cost than crossbar
  • latencies to memory uniform, but uniformly large
  • Distributed memory or non-uniform memory access
    (NUMA)
  • Construct shared address space out of simple
    message transactions across a general-purpose
    network (e.g. read-request, read-response)
  • Caching shared (particularly nonlocal) data?

88
Example Cray T3E
  • Scale up to 1024 processors, 480MB/s links
  • Memory controller generates comm. request for
    nonlocal references
  • Communication architecture tightly integrated
    into node
  • No hardware mechanism for coherence (SGI Origin
    etc. provide this)

89
Caches and Cache Coherence
  • Caches play key role in all cases
  • Reduce average data access time
  • Reduce bandwidth demands placed on shared
    interconnect
  • But private processor caches create a problem
  • Copies of a variable can be present in multiple
    caches
  • A write by one processor may not become visible
    to others
  • Theyll keep accessing stale value in their
    caches
  • Cache coherence problem
  • Need to take actions to ensure visibility

90
Example Cache Coherence Problem
  • Processors see different values for u after event
    3
  • With write back caches, value written back to
    memory depends on happenstance of which cache
    flushes or writes back value when
  • Processes accessing main memory may see very
    stale value
  • Unacceptable to programs, and frequent!

91
Cache Coherence
  • Reading a location should return latest value
    written (by any process)
  • Easy in uniprocessors
  • Except for I/O coherence between I/O devices and
    processors
  • But infrequent, so software solutions work
  • Would like same to hold when processes run on
    different processors
  • E.g. as if the processes were interleaved on a
    uniprocessor
  • But coherence problem much more critical in
    multiprocessors
  • Pervasive and performance-critical
  • A very basic design issue in supporting the prog.
    model effectively
  • Its worse than that what is the latest value
    with indept. processes?
  • Memory consistency models

92
SGI Origin2000
  • Hub chip provides memory control, communication
    and cache coherence support
  • Plus I/O communication etc

93
Shared Address Space Machines Today
  • Bus-based, cache coherent at small scale
  • Distributed memory, cache-coherent at larger
    scale
  • Without cache coherence, are essentially (fast)
    message passing systems
  • Clusters of these at even larger scale

94
Message-Passing Programming Model
  • Send specifies data buffer to be transmitted and
    receiving process
  • Recv specifies sending process and application
    storage to receive into
  • Optional tag on send and matching rule on receive
  • Memory to memory copy, but need to name processes
  • User process names only local data and entities
    in process/tag space
  • In simplest form, the send/recv match achieves
    pairwise synch event
  • Other variants too
  • Many overheads copying, buffer management,
    protection

95
Message Passing Architectures
  • Complete computer as building block, including
    I/O
  • Communication via explicit I/O operations
  • Programming model directly access only private
    address space (local memory), comm. via explicit
    messages (send/receive)
  • High-level block diagram similar to
    distributed-memory SAS
  • But comm. neednt be integrated into memory
    system, only I/O
  • History of tighter integration, evolving to
    spectrum incl. clusters
  • Easier to build than scalable SAS
  • Can use clusters of PCs or SMPs on a LAN
  • Programming model more removed from basic
    hardware operations
  • Library or OS intervention

96
Evolution of Message-Passing Machines
  • Early machines FIFO on each link
  • Hw close to prog. Model synchronous ops
  • Replaced by DMA, enabling non-blocking ops
  • Buffered by system at destination until recv
  • Diminishing role of topology
  • Storeforward routing topology important
  • Introduction of pipelined routing made it less so
  • Cost is in node-network interface
  • Simplifies programming

97
Example IBM SP-2
  • Made out of essentially complete RS6000
    workstations
  • Network interface integrated in I/O bus (bw
    limited by I/O bus)
  • Doesnt need to see memory references

98
Example Intel Paragon
  • Network interface integrated in memory bus, for
    performance

99
Toward Architectural Convergence
  • Evolution and role of software have blurred
    boundary
  • Send/recv supported on SAS machines via buffers
  • Can construct global address space on MP using
    hashing
  • Software shared memory (e.g. using pages as units
    of comm.)
  • Hardware organization converging too
  • Tighter NI integration even for MP (low-latency,
    high-bandwidth)
  • At lower level, even hardware SAS passes hardware
    messages
  • Hw support for fine-grained comm makes software
    MP faster as well
  • Even clusters of workstations/SMPs are parallel
    systems
  • Fast system area networks (SAN)
  • Programming models distinct, but organizations
    converged
  • Nodes connected by general network and
    communication assists
  • Assists range in degree of integration, all the
    way to clusters

100
Data Parallel Systems
  • Programming model
  • Operations performed in parallel on each element
    of data structure
  • Logically single thread of control, performs
    sequential or parallel steps
  • Conceptually, a processor associated with each
    data element
  • Architectural model
  • Array of many simple, cheap processors with
    little memory each
  • Processors dont sequence through instructions
  • Attached to a control processor that issues
    instructions
  • Specialized and general communication, cheap
    global synchronization
  • Original motivations
  • Matches simple differential equation solvers
  • Centralize high cost of instruction
    fetch/sequencing

101
Application of Data Parallelism
  • Each PE contains an employee record with his/her
    salary
  • If salary gt 100K then
  • salary salary 1.05
  • else
  • salary salary 1.10
  • Logically, the whole operation is a single step
  • Some processors enabled for arithmetic operation,
    others disabled
  • Other examples
  • Finite differences, linear algebra, ...
  • Document searching, graphics, image processing,
    ...
  • Some recent machines
  • Thinking Machines CM-1, CM-2 (and CM-5)
  • Maspar MP-1 and MP-2,

102
Evolution and Convergence
  • Rigid control structure (SIMD in Flynn taxonomy)
  • SISD uniprocessor, MIMD multiprocessor
  • Popular when cost savings of centralized
    sequencer high
  • 60s when CPU was a cabinet
  • Replaced by vectors in mid-70s
  • More flexible w.r.t. memory layout and easier to
    manage
  • Revived in mid-80s when 32-bit datapath slices
    just fit on chip
  • No longer true with modern microprocessors
  • Other reasons for demise
  • Simple, regular applications have good locality,
    can do well anyway
  • Loss of applicability due to hardwiring data
    parallelism
  • MIMD machines as effective for data parallelism
    and more general
  • Prog. model converges with SPMD (single program
    multiple data)
  • Contributes need for fast global synchronization
  • Structured global address space, implemented with
    either SAS or MP

103
Convergence Generic Parallel Architecture
  • A generic modern multiprocessor
  • Node processor(s), memory system, plus
    communication assist
  • Network interface and communication controller
  • Scalable network
  • Communication assist provides primitives with
    perf profile
  • Build your programming model on this
  • Convergence allows lots of innovation, now within
    framework
  • Integration of assist with node, what operations,
    how efficiently...

104
Outline
  • Drivers of Parallel Computing
  • Trends in Supercomputers for Scientific
    Computing
  • Evolution and Convergence of Parallel
    Architectures
  • Fundamental Issues in Programming Models and
    Architecture

105
The Model/System Contract
  • Model specifies an interface (contract) to the
    programmer
  • Naming How are logically shared data and/or
    processes referenced?
  • Operations What operations are provided on these
    data
  • Ordering How are accesses to data ordered and
    coordinated?
  • Replication How are data replicated to reduce
    communication?
  • Underlying implementation addresses performance
    issues
  • Communication Cost Latency, bandwidth,
    overhead, occupancy
  • Well look at the aspects of the contract through
    examples

106
Supporting the Contract
  • Given prog. model can be supported in various
    ways at various layers
  • In fact, each layer takes a position on all
    issues (naming, ops, performance etc), and any
    set of positions can be mapped to another by
    software
  • Key issues for supporting programming models are
  • What primitives are provided at comm. abstraction
    layer
  • How efficiently are they supported (hw/sw)
  • How are programming models mapped to them

107
Recap of Parallel Architecture
  • Parallel architecture is important thread in
    evolution of architecture
  • At all levels
  • Multiple processor level now in mainstream of
    computing
  • Exotic designs have contributed much, but given
    way to convergence
  • Push of technology, cost and application
    performance
  • Basic processor-memory architecture is the same
  • Key architectural issue is in communication
    architecture
  • How communication is integrated into memory and
    I/O system on node
  • Fundamental design issues
  • Functional naming, operations, ordering
  • Performance organization, replication,
    performance characteristics
  • Design decisions driven by workload-driven
    evaluation
  • Integral part of the engineering focus
Write a Comment
User Comments (0)
About PowerShow.com