Project ARIES - PowerPoint PPT Presentation

About This Presentation

Title:

Project ARIES

Description:

Project Aries. 3. Motivation - System Level ... Project Aries. 4. Motivation - Architecture. Transistor counts have increased 1000x in 20 years ... – PowerPoint PPT presentation

Number of Views:164

Avg rating:3.0/5.0

Slides: 39

Provided by: JP6

Learn more at: http://www.ai.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Project ARIES

1
Project ARIES

Advanced RAM Integration for Efficiency and
Scalability
Presented by the 8th floor buttheads
(pun intended)

2
Talk Overview

Motivation and research goals
Previous work
System description
Current work
Research Opportunities

3
Motivation - System Level

Many applications have massive memory and/or
computational requirements
graphics, CAD, physical simulation, factoring,
etc.
Current parallel systems have severe limitations
difficult to program
poor network performance
insufficient scalability
There is a need for programmable systems scalable
to 1M processors and beyond

4
Motivation - Architecture

Transistor counts have increased 1000x in 20
years
1978 29K in Intel 8086
1999 23M in NVIDIA GeForce 256

8086

Suppose someone gave you 100M transistors and you
had never seen a load-store architecture. What
would you build?
There is a need for designs scalable to 1G
transistors and beyond

5
Motivation - Area Efficiency

In modern processors lt 25 of chip area devoted
to useful work (red areas on die shot)
gt 75 devoted to making the 25 faster
Even the 25 is bloated due to
large instruction sets
complex superscalar design
...not necessarily a bad thing

K7 Die Photo
6
Motivation - Scalability

Today
1-4 processors per die
Use available area to make processor fast
complicated designs

Tomorrow
N processors per die
Use available area for lots of processors
simple designs

7
Motivation - RAM Integration

Logic and DRAM on a single die provides
lower latency access to memory (2x)
much higher bandwidth (10x)
What architectural features are enabled by this
technology?
SRAM-tagged DRAM
Transactional memory
Forwarding pointers

8
Research Goals

Highly programmable parallel system
language
compiler
operating system
architecture
Manage huge data sets
Area efficient architecture
Integration of processor and memory

9
Architectural Goals

Fast general purpose processor
1 GFLOP/processing node
High performance network
high bandwidth (4GB/s at each node)
low latency (10ns across a 1000 node system)
Compilability
Scalability
up to 1,000,000 nodes

10
Talk Overview

Motivation and research goals
Previous work
System Description
Current work
Research Opportunities

11
Previous Work - Processors

Bill Dally
J-Machine
M-Machine
Imagine
Anant Agarwal
Alewife
RAW
NYU Ultracomputer

Hydra
VLIW stuff
KSR1
NuMachine
Hydra
Tera
IBM Power4

12
Previous Work - Networks

Tom Knight
Metro
Bill Dally/John Poulton
4GB/s
Craylink
SGI

Charles Leiserson
FAT trees
SCI
HiPPi
CM-5
Myranet
Artic

13
Previous Work - Cache

SHASTA
DASH
FLASH
LimitLESS

14
Previous Work - Languages

NESL

15
Other Previous Work

Capabilities
Guarded Pointers (Dally)
Containment
IRAM
IStore

SimOS
Beowulf
Active Pages

16
Talk Overview

Motivation and research goals
Previous work
System description
Current work
Research Opportunities

17
System Description

Distributed shared memory machine
No caching of remote data
Single shared virtual address space
Large collection of processor-memory nodes
High Performance FAT Tree network

18
System Description

OS
Memory management
threads
language
???

19
Top-Level View

Bunnie should make this slide.. Bunch of cabinets
containing processor and network boards 1 disk
drive per network board high speed serial links
enough wire to tether the moon to tech square.

20
Chip-Level View

array of processor-memory nodes connected by
on-chip network
high speed signaling technology exists (Dally) to
provide large off-chip bandwidth
e.g. 256 pin pairs at 2Gb/s each 512 Gb/s

21
Processor-Memory Node
22
Processing element

128 bit multi-context VLIW processor
Four functional units
IEEE floating point
Hardware enforced capabilities
Hardware Macros???

23
Simple Silicon

No dynamic superscalar issue
No out of order issue
No register renaming
No branch prediction
No speculative execution
No remote-data caching

24
Key Features

Fast fault tolerant network
blah blah blah
Augmented DRAM
SRAM tagged DRAM
fast searching for marked data
high performance GC
experimental versioning
Hardware page tables

25
Key Features

Fixed virtual address ? node mapping
Eliminates global tables
Multi-striped address ? node mapping
Stripe contiguous addresses across nodes at any
power of two granularity
Memory efficient guarded pointers
Power-of-two schemes waste up to 50 of the
virtual address space

26
Talk Overview

Motivation and research goals
Previous work
System description
Current work
Research Opportunities

27
Hardware Design Effort

Node design
processor
memory system
Network design
routing protocol
topology
Cycle accurate simulator

28
Prototyping System

Simulator for complex system issues
not practical to do an accurate, pure software
simulation of a massively parallel processor
Evaluation of experimental architectural features
in a realistic environment
network topologies
language design
hardware support for programming features

29
Prototyping System

Modular design
easily reconfigure a single design for multiple
network topologies by recabling and
redistributing processor and network cards
network can be adapted for other
performance-oriented platforms
enable collaboration between multiple research
groups in different research institutions

30
Modular Design
31
Prototype Node