Project ARIES - PowerPoint PPT Presentation

About This Presentation
Title:

Project ARIES

Description:

Project Aries. 3. Motivation - System Level ... Project Aries. 4. Motivation - Architecture. Transistor counts have increased 1000x in 20 years ... – PowerPoint PPT presentation

Number of Views:164
Avg rating:3.0/5.0
Slides: 39
Provided by: JP6
Learn more at: http://www.ai.mit.edu
Category:
Tags: aries | aries | project

less

Transcript and Presenter's Notes

Title: Project ARIES


1
Project ARIES
  • Advanced RAM Integration for Efficiency and
    Scalability
  • Presented by the 8th floor buttheads
  • (pun intended)

2
Talk Overview
  • Motivation and research goals
  • Previous work
  • System description
  • Current work
  • Research Opportunities

3
Motivation - System Level
  • Many applications have massive memory and/or
    computational requirements
  • graphics, CAD, physical simulation, factoring,
    etc.
  • Current parallel systems have severe limitations
  • difficult to program
  • poor network performance
  • insufficient scalability
  • There is a need for programmable systems scalable
    to 1M processors and beyond

4
Motivation - Architecture
  • Transistor counts have increased 1000x in 20
    years
  • 1978 29K in Intel 8086
  • 1999 23M in NVIDIA GeForce 256

8086
  • Suppose someone gave you 100M transistors and you
    had never seen a load-store architecture. What
    would you build?
  • There is a need for designs scalable to 1G
    transistors and beyond

5
Motivation - Area Efficiency
  • In modern processors lt 25 of chip area devoted
    to useful work (red areas on die shot)
  • gt 75 devoted to making the 25 faster
  • Even the 25 is bloated due to
  • large instruction sets
  • complex superscalar design
  • ...not necessarily a bad thing

K7 Die Photo
6
Motivation - Scalability
  • Today
  • 1-4 processors per die
  • Use available area to make processor fast
  • complicated designs
  • Tomorrow
  • N processors per die
  • Use available area for lots of processors
  • simple designs

7
Motivation - RAM Integration
  • Logic and DRAM on a single die provides
  • lower latency access to memory (2x)
  • much higher bandwidth (10x)
  • What architectural features are enabled by this
    technology?
  • SRAM-tagged DRAM
  • Transactional memory
  • Forwarding pointers

8
Research Goals
  • Highly programmable parallel system
  • language
  • compiler
  • operating system
  • architecture
  • Manage huge data sets
  • Area efficient architecture
  • Integration of processor and memory

9
Architectural Goals
  • Fast general purpose processor
  • 1 GFLOP/processing node
  • High performance network
  • high bandwidth (4GB/s at each node)
  • low latency (10ns across a 1000 node system)
  • Compilability
  • Scalability
  • up to 1,000,000 nodes

10
Talk Overview
  • Motivation and research goals
  • Previous work
  • System Description
  • Current work
  • Research Opportunities

11
Previous Work - Processors
  • Bill Dally
  • J-Machine
  • M-Machine
  • Imagine
  • Anant Agarwal
  • Alewife
  • RAW
  • NYU Ultracomputer
  • Hydra
  • VLIW stuff
  • KSR1
  • NuMachine
  • Hydra
  • Tera
  • IBM Power4

12
Previous Work - Networks
  • Tom Knight
  • Metro
  • Bill Dally/John Poulton
  • 4GB/s
  • Craylink
  • SGI
  • Charles Leiserson
  • FAT trees
  • SCI
  • HiPPi
  • CM-5
  • Myranet
  • Artic

13
Previous Work - Cache
  • SHASTA
  • DASH
  • FLASH
  • LimitLESS

14
Previous Work - Languages
  • NESL

15
Other Previous Work
  • Capabilities
  • Guarded Pointers (Dally)
  • Containment
  • IRAM
  • IStore
  • SimOS
  • Beowulf
  • Active Pages

16
Talk Overview
  • Motivation and research goals
  • Previous work
  • System description
  • Current work
  • Research Opportunities

17
System Description
  • Distributed shared memory machine
  • No caching of remote data
  • Single shared virtual address space
  • Large collection of processor-memory nodes
  • High Performance FAT Tree network

18
System Description
  • OS
  • Memory management
  • threads
  • language
  • ???

19
Top-Level View
  • Bunnie should make this slide.. Bunch of cabinets
    containing processor and network boards 1 disk
    drive per network board high speed serial links
    enough wire to tether the moon to tech square.

20
Chip-Level View
  • array of processor-memory nodes connected by
    on-chip network
  • high speed signaling technology exists (Dally) to
    provide large off-chip bandwidth
  • e.g. 256 pin pairs at 2Gb/s each 512 Gb/s

21
Processor-Memory Node
22
Processing element
  • 128 bit multi-context VLIW processor
  • Four functional units
  • IEEE floating point
  • Hardware enforced capabilities
  • Hardware Macros???

23
Simple Silicon
  • No dynamic superscalar issue
  • No out of order issue
  • No register renaming
  • No branch prediction
  • No speculative execution
  • No remote-data caching

24
Key Features
  • Fast fault tolerant network
  • blah blah blah
  • Augmented DRAM
  • SRAM tagged DRAM
  • fast searching for marked data
  • high performance GC
  • experimental versioning
  • Hardware page tables

25
Key Features
  • Fixed virtual address ? node mapping
  • Eliminates global tables
  • Multi-striped address ? node mapping
  • Stripe contiguous addresses across nodes at any
    power of two granularity
  • Memory efficient guarded pointers
  • Power-of-two schemes waste up to 50 of the
    virtual address space

26
Talk Overview
  • Motivation and research goals
  • Previous work
  • System description
  • Current work
  • Research Opportunities

27
Hardware Design Effort
  • Node design
  • processor
  • memory system
  • Network design
  • routing protocol
  • topology
  • Cycle accurate simulator

28
Prototyping System
  • Simulator for complex system issues
  • not practical to do an accurate, pure software
    simulation of a massively parallel processor
  • Evaluation of experimental architectural features
    in a realistic environment
  • network topologies
  • language design
  • hardware support for programming features

29
Prototyping System
  • Modular design
  • easily reconfigure a single design for multiple
    network topologies by recabling and
    redistributing processor and network cards
  • network can be adapted for other
    performance-oriented platforms
  • enable collaboration between multiple research
    groups in different research institutions

30
Modular Design
31
Prototype Node
  • Features for fast debug, profiling,
    reconfiguration
  • Architecture to enable experimentation with novel
    memory architectures and domains of execution

32
Prototype Node
33
Language Design
  • Communication intensive programming
  • Leverage data snapshots for migratability
  • e.g. function calls
  • Blah blah blah

34
Garbage Collection
  • Fast area local GC
  • Data migration
  • Paging friendly
  • Blah blah blah

35
Talk Overview
  • Motivation and research goals
  • Previous work
  • System description
  • Current work
  • Research Opportunities

36
Research Opportunities
  • Processor evaluation
  • hand coded sample apps
  • performance at various design points
  • Language design/evaluation
  • expresses parallelism and communication
  • programmability vs. compilability

37
Research Opportunities 2
  • Compiler Design
  • C compiler
  • PSCHEME compiler
  • VLIW scheduling
  • Operating System Design
  • Thread management, page management, memory
    management, etc. etc. etc.

38
Research Opportunities 3
  • Alternative Execution Domains
  • secure computing
  • efficient dynamic typing
  • transactional/speculative computing
  • dataflow
Write a Comment
User Comments (0)
About PowerShow.com