How to Build a Petaflops Computer - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

How to Build a Petaflops Computer

Description:

Carl Kesselman. David E. Keyes. Andrew Lunsdaine. James R. McGraw. Piyush Mehrotra. Daniel Savarese ... D. Crawford. Project Secretary. A. Smythe. System ... – PowerPoint PPT presentation

Number of Views:125
Avg rating:3.0/5.0
Slides: 62
Provided by: cacrCa
Category:

less

Transcript and Presenter's Notes

Title: How to Build a Petaflops Computer


1
How to Build a Petaflops Computer
Keynote address to 3rd workshop on The Petaflops
Frontier
  • Thomas Sterling
  • California Institute of Technology
  • NASA Jet Propulsion Laboratory
  • February 22, 1999

2
(No Transcript)
3
Comparison to Present Technology
4
(No Transcript)
5
The High Cs to crossing to Petaflops Computing
  • Capability
  • Computation rate
  • Capacity of storage
  • Communication bandwidth
  • Cost
  • Component count
  • Connection complexity
  • Consumption of power
  • Concurrency
  • Cycles of latency
  • Customers and Ciller-applications
  • Confidence

6
POWR Workshop Overview
  • Petaflops initiative context
  • Objectives
  • Charter Guidelines
  • 3 Pflops system classes
  • COTS clusters
  • MPP system architecture
  • Hybrid-technology custom architecture
  • Specific group results
  • Summary findings
  • Open issues
  • Recommendations
  • Conclusions

7
MPP Petaflops System
  • COTS chips and industry standard interfaces
  • Custom glue-logic ASICs and SAN
  • New systems architecture
  • Distributed shared memory and cache based latency
    management
  • Algorithm/application methodologies
  • Specialized compile time and runtime software

8
MPP Breakout Group
Rudolf Eigenmann Jose Fortes David Frye Kent
Koeninger Vipin Kumar John May Paul
Messina Merrell Patrick Paul Smith Rick Stevens
Valerie Taylor Josep Torrellas Paul Woodward
9
Summary of MPP
  • processor 3 GHz, 10 Gflops
  • processors 100,000
  • memory 32 Tbytes, DRAM, 40ns access time local
  • interconnect frames switched, 128 Gbps/channel
  • secondary storage 1 Pbyte, 1 ms access time
  • distributed shared memory
  • latency management cache coherence protocol

10
COTS Clustered Petaflops System
  • NO specialized hardware
  • Leverages mass market economy of scale
  • Distributed memory model with message passing
  • Incorporates desktop/server mainstream component
    systems
  • Integrated by means of COTS networking technology
  • Augmented by new application algorithm
    methodologies and system software

11
COTS Cluster Breakout Group
David H. Bailey James Bieda Remy Evard Robert
Clay Al Geist Carl Kesselman David E.
Keyes Andrew Lunsdaine James R. McGraw Piyush
Mehrotra Daniel Savarese Bob Voigt Michael S.
Warren
12
Summary of COTS Cluster
  • processor 3 GHz, 10 Gflops
  • processors 100,000
  • memory 32 Tbytes, DRAM, 40ns access time
  • interconnect degree 12 n-cube, 20 Gbps/channel
  • secondary storage 1 Pbyte, 1 ms access time
  • distributed memory, 3 level cache, 1 level DRAM
  • latency management software

13
Hybrid Technology Petaflops System
  • New device technologies
  • New component designs
  • New subsystem architecture
  • New system architecture
  • New latency management paradigm and mechanisms
  • New algorithms/applications
  • New compile time and runtime software

14
HTMT Breakout Group
Larry Bergman Nikos Chrisochoides Vincent
Freeh Guang R. Gao Peter Kogge Phil Merkey John
Van Rosendale John Salmon Burton Smith Thomas
Sterling
15
Summary of HTMT
  • processor 150 GHz, 600 Gflops
  • processors 2048
  • memory 16 Tbytes PIM-DRAM, 80ns access time
  • interconnect Data Vortex, 500 Gbps/channel, gt 10
    Pbps bi-section bw
  • 3/2 storage 1 Pbyte, 10 us access time
  • shared memory, 4 level hierarchy
  • latency management multithreaded with percolation

16
Summary Findings
  • Architecture is important
  • Bandwidth requirements dominate hardware
    structures
  • Latency management determines runtime resource
    management strategy
  • Efficient mechanisms for overhead services
  • Generality of application workload dependent on
    interconnect throughput and response time
  • COTS processors will not hide system latency,
    even if multithreading is adopted
  • More memory than earlier thought may be needed
  • MPP problem is very difficult, unclear which
    direction to take

17
Summary Findings (cont)
  • COTS clusters will provide safe migration path at
    best price-performance but must rely on user
    management of all system resources
  • Inter-process load balancing too expensive on
    clusters
  • New formalism required to expose diverse modes of
    parallelism
  • Compilers cant ever make all performance
    decisions must be combined with collaborative
    runtime software
  • Critical-path performance decision tree requires
    new internal protocols
  • User must describe application properties, not
    means

18
Open Issues
  • Is network of processor/memories best use of
    multi-billion transistor chips
  • Is convergence real or only point of inflexion
  • Will semiconductor continue to push beyond 0.15
    micron do market costs support it
  • Can alternative technology fabrication be
    supported/avoided
  • Can orders-of-magnitude latency be managed
  • What will the computer languages of the Pflops
    era look like?
  • Processor granularity fine and many or fat and
    few

19
Return to a Single Node(but highly parallel)
  • Emergence of a new class of high end computer
  • Return to a single world API image
  • Eliminate (virtualize) processors from the name
    space
  • Unburden applications program from direct
    resource management
  • Latency management an intrinsic architecture
    responsibility (with compiler assist)
  • Enable adaptive system operation at hyper speeds
  • Leap-frog the conventional price-performance-power
    curves for wide market

20
(No Transcript)
21
HTMT Objectives
  • Scalable architecture with high sustained
    performance in the presence of disparate cycle
    times and latencies
  • Exploit diverse device technologies to achieve
    substantially superior operating point
  • Execution model to simplify parallel system
    programming and expand generality and
    applicability

22
Hybrid Technology MultiThreaded Architecture
23
Summary of HTMT
  • processor 150 GHz, 600 Gflops
  • processors 2048
  • memory 16 Tbytes PIM-DRAM, 80ns access time
  • interconnect Data Vortex, 500 Gbps/channel, gt 10
    Pbps bi-section bw
  • 3/2 storage 1 Pbyte, 10 us access time
  • shared memory, 4 level hierarchy
  • latency management multithreaded with percolation

24
(No Transcript)
25
Storage Capacity by Subsystem 2007 Design Point
26
(No Transcript)
27
HTMT Strategy
  • High performance
  • Superconductor RSFQ logic
  • Data Vortex optical interconnect network
  • PIM smart memory
  • Low power
  • Superconductor RSFQ logic
  • Optical holographic storage
  • PIM smart memory

28
HTMT Strategy (cont)
  • Low cost
  • reduce wire count through chip-to-chip fiber
  • reduce processor count through x100 clock speed
  • reduce memory chips by 3-2 holographic memory
    layer
  • Efficiency
  • processor level multithreading
  • smart memory managed second stage context pushing
    multithreading
  • fine grain regular irregular data parallelism
    exploited in memory
  • high memory bandwidth and low latency ops through
    PIM
  • memory to memory interactions without processor
    intervention
  • hardware mechanisms for synchronization,
    scheduling, data/context migration, gather/scatter

29
HTMT Strategy (cont)
  • Programmability
  • Global shared name space
  • hierarchical parallel thread flow control model
  • no explicit processor naming
  • automatic latency management
  • automatic processor load balancing
  • runtime fine grain multithreading
  • automatic context pushing for process migration
    (percolation)
  • configuration transparent, runtime scalable

30
HTMT Organization
NSA G. Cotter Doc Bedard W. Carlson (IDA)
NASA E. Tu W. Johnston
DARPA J. Munioz
PI T. Sterling Project Mgr L. Bergman
STEERING COMMITTEE P. Messina
Project AA D. Crawford
Project Secretary A. Smythe
System Engineer (S. Monacos)
Tech Publishing M. MacDonald
PRINCETON CO-I K. Bergman C. Reed (IDA)
UNIVERSITY OF DELAWARE CO-I G. Gao
CACR CO-I P. Messina
NOTRE DAME CO-I P. Kogge
SUNY CO-I K. Likharev
TERA CO-I B. Smith
CALTECH CO-I D. Psaltis
ARGONNE CO-I R. Stevens
UCSB CI-I M. Rodwell M. Melliar-Smith
JPL CO-I D. Curkendall H. Siegel
JPL CO-I T. Cwik
TI CI-I G. Armstrong
TRW CI-I A. Silver
HYPRESS CI-I E. Track
RPI CI-I J. McDonald
Univ Rochester CI-I M. Feldman
31
Areas of Accomplishments
  • Concepts and Structures
  • approach strategy
  • device technologies
  • subsystem design
  • efficiency, productivity, generality
  • System Architecture
  • size, cost, complexity, power
  • System Software
  • resource management
  • multiprocessor emulator
  • Applications
  • multithreaded codes
  • scaling models
  • Evaluation
  • feasibility
  • cost
  • performance
  • Future Directions
  • Phase 3 prototype
  • Phase 4 petaflops system
  • Proposals

32
RSFQ Roadmap(VLSI Circuit Clock Frequency)
33
(No Transcript)
34
Advantages
  • X100 clock speeds achievable
  • X100 power efficiency advantage
  • Easier fabrication
  • Leverage semiconductor fabrication tools
  • First technology to encounter ultra-high speed
    operation

35
SuperconductorProcessor
  • 100 GHz clock, 33 GHz inter-chip
  • 0.8 micron Niobium on Silicon
  • 100K gates per chip
  • 0.05 watts per processor
  • 100Kwatts per Petaflops

36
Accomplishments - Processor
  • SPELL Architecture
  • Detailed circuit design for critical paths
  • CRAM Memory design initiated
  • 1st network design and analysis/simulation
  • 750 GHz logic demonstrated
  • Detailed sizing, cost, and power analysis
  • Estimate for fabrication facilities investment
  • Barriers and path to 0.4-0.25 micron regime
  • Sizing for Phase 3 50 Gflops processor

37
Data Vortex Optical Interconnect
38
(No Transcript)
39
DATA VORTEX LATENCY DISTRIBUTION
network height 1024
40
Single-mode rib waveguides on silicon-on-insulator
wafers Hybrid sources and detectors Mix of
CMOS-like and micromachining-type processes
for fabrication
e.g R A Soref, J Schmidtchen K Petermann,
IEEE J. Quantum Electron. 27 p1971 (1991) A
Rickman, G T Reed, B L Weiss F Navamar, IEEE
Photonics Technol. Lett. 4 p.633 (1992) B
Jalali, P D Trinh, S Yegnanarayanan F
Coppinger IEE Proc. Optoelectron. 143 p.307 (1996)
41
Data Vortex Parameters for Petaflops in 2007
  • Bi-section sustained bandwidth 4000 Tbps
  • Per port data rate 640 Gbps
  • Single wavelength channel rate 10 Gbps
  • Level of WDM 64 colors
  • Number of input ports 6250
  • Angle nodes 7
  • Network node height 4096
  • Number of nodes per cylinder 28672
  • Number of cylinders 13
  • Total node number 372736

42
Accomplishments - Data Vortex
  • Implemented and tested optical device technology
  • Prototyped electro-optical butterfly switch
  • Design study of electro-optic integrated switch
  • Implemented and tested most of end-to-end path
  • Design of topology to size
  • Simulation of network behavior under load
  • Modified structure for ease of packaging
  • Size, complexity, power studies
  • Initial interface design

43
PIM Provides Smart Memory
  • Merge logic and memory
  • Integrate multiple logic/mem stacks on single
    chip
  • Exposes high intrinsic memory bandwidth
  • Reduction of memory access latency
  • Low overhead for memory oriented operations
  • Manages data structure manipulation, context
    coordination and percolation

44
Multithreaded PIM DRAM
  • Multithreaded Control of PIM Functions
  • multiple operation sequences with low context
    switching overhead
  • maximize memory utilization and efficiency
  • maximize processor and I/O utilization
  • multiple banks of row buffers to hold data,
    instructions, and addr
  • data parallel basic operations at row buffer
  • manages shared resources such as FP
  • Direct PIM to PIM Interaction
  • memory communicates with memory within and across
    chip boundaries without external control
    processor intervention by parcels
  • exposes fine grain parallelism intrinsic to
    vector and irregular data structures
  • e.g. pointer chasing, block moves,
    synchronization, data balancing

45
Accomplishments - PIM DRAM
  • Establish operational opportunity and
    requirements
  • Win 12.2M DARPA contract for DIVA
  • USC ISI prime
  • Caltech, Notre Dame, U. of Delaware
  • Deliver 8 Mbyte part in FY01 at 0.25 micron
  • Architecture concept design complete
  • parcel message driven computation
  • multithreaded resource management
  • Analysis of size, power, bandwidth
  • Diva to be used directly in Phase 3 testbed

46
Holographic 3/2 Memory
Performance Scaling
  • Advantages
  • petabyte memory
  • competitive cost
  • 10 ?sec access time
  • low power
  • efficient interface to DRAM
  • Disadvantages
  • recording rate is slower than the readout rate
    for LiNbO3
  • recording must be done in GB chunks
  • long term trend favors DRAM unless new materials
    and lasers are used

47
Accomplishments - HoloStore
  • Detailed study of two optical storage
    technologies
  • photo refractive
  • spectral hole burning
  • Operational photo refractive read/write storage
  • Access approaches explored for 10 usec regime
  • pixel array
  • wavelength multiplexing
  • Packaging studies
  • power, size, cost analysis

48
Multilevel Multithreaded Execution Model
  • Extend latency hiding of multithreading
  • Hierarchy of logical thread
  • Delineates threads and thread ensembles
  • Action sequences, state, and precedence
    constraints
  • Fine grain single cycle thread switching
  • Processor level, hides pipeline and time of
    flight latency
  • Coarse grain context "percolation"
  • Memory level, in memory synchronization
  • Ready contexts move toward processors, pending
    contexts towards big memory

49
HTMT Thread ActivationState Diagram
Percolationof threads
50
Percolation of Active Tasks
  • Multiple stage latency management methodology
  • Augmented multithreaded resource scheduling
  • Hierarchy of task contexts
  • Coarse-grain contexts coordinate in PIM memory
  • Ready contexts migrate to SRAM under PIM control
    releasing threads for scheduling
  • Threads pushed into SRAM/CRAM frame buffers
  • Strands loaded in register banks on space
    available basis

51
HTMT Percolation Model
CRYOGENIC AREA
DMA to CRAM
Split-Phase Synchronization to SRAM
done
start
C-Buffer
I-Queue
A-Queue
Parcel Invocation Termination
Parcel Assembly Disassembly
Parcel Dispatcher Dispenser
Re-Use
T-Queue
D-Queue
Run Time System
SRAM-PIM
DMA to DRAM-PIM
52
0.3 m
1.4 m
4oK 50 W
77oK
SIDE VIEW
1 m
Fiber/Wire Interconnects
1 m
3 m
0.5 m
53
Top Down View of HTMT Machine 2007 Design Point
54
SIDE VIEW
Nitrogen
Helium
Tape Silo Array (400 Silos)
Hard Disk Array (40 cabinets)
4oK 50 W
77oK
Fiber/Wire Interconnects
Front End Computer Server

3 m
3 m
Console
Cable Tray Assembly
0.5 m
220Volts
220Volts
WDM Source
Generator
Generator
980 nm Pumps (20 cabinets)
Optical Amplifiers
55
HTMT Facility (Top View)
56
Floor Area
57
Power Dissipation by Subsystem Petaflops Design
Point
58
Subsystem Interfaces 2007 Design Point
  • Same colors indicate a connection between
    subsystems
  • Horizontal lines group interfaces within a
    subsystem

59
Accomplishments - Systems
  • System architecture completed
  • Physical structure design
  • Parts count, power, interconnect complexity
    analysis
  • Infrastructure requirements and impact
  • Feasibility assessment

60
Distributed Isomorphic Simulator
  • Executable Specification
  • subsystem functional/operational description
  • inter-subsystem interface protocol definition
  • Distributed Low-cost Cluster of processors
  • Cluster partitioned and allocated to separate
    subsystems
  • Subsystem development groups own cluster
    partitions, and develop functional specification
  • Subsystem partitions interact by agreed-upon
    interface protocols
  • Runtime percolation and thread scheduling system
    software put on top of emulation software.

61
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com