Aries: Integrated Symbolic and Numeric Database Handling with Parallel Distributed Processors - PowerPoint PPT Presentation

About This Presentation

Title:

Aries: Integrated Symbolic and Numeric Database Handling with Parallel Distributed Processors

Description:

DARPA DIS Review 5/24/00 Tom Knight Andrew Huang Kalman Reti JP Grossman Jeremy Brown John Mallory Tom Cleary Norm Margolus Howie Shrobe Peggy Chen Greg Sullivan – PowerPoint PPT presentation

Number of Views:192

Avg rating:3.0/5.0

Slides: 63

Provided by: JohnSm186

Learn more at: http://www.ai.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Aries: Integrated Symbolic and Numeric Database Handling with Parallel Distributed Processors

1
Aries Integrated Symbolic and Numeric Database
Handling with Parallel Distributed Processors
DARPA DIS Review 5/24/00 Tom Knight Andrew Huang
Norm Margolus Howie Shrobe Peggy Chen Greg
Sullivan Michael Phillips Ben Vandiver
Kalman Reti JP Grossman Jeremy Brown John
Mallory Tom Cleary
2
Aries Project Components

Technology development
Language development
Core Data and Pointer Representations
System Architecture
Experimental vehicle
Benchmark plan

3
Technology Substrate

Fast SRAM tagged DRAM banks
leverage DRAM integration by allowing fast
parallel access to SRAM tags in parallel with
slow DRAM acccess
Traps, GC forwarding pointers, reference counts,
timestamps for parallel out of order execution
Multiple read/writes of SRAM data during fetch of
the DRAM line
Microchannel liquid cooling technology
copper-laminate manifold heat sinks
etched silicon microchannels

4
Language Development

Simple modifications to Scheme
parallel array operations from APL
parallel pointer operations (e.g. join, mark)
sophisticated GC techniques
data structures built on new object pointers
multithreading synchronized with combiners and
splitters
Units
Preparation for a much more extensive complete
language rewrite

5
Pointer and Array Representations

New pointer structure allowing
internal object pointers
immediate access to object header
immediate access to object size
little wasted memory (lt 3)
Objects lt 32 words have dense, efficient
representation
Add only pointers

B
L
5
5
6
gt 64
L
E
Pointer
B
E
Blocks of size 2
6
Address Format

MSB of address determines data stripe layout
index 0 gives single word per processor
index max gives whole objects in processor

Address
Node
Node index
7
Squids for Pointer Aliasing Equality

Create a randomized SQUID tag for newly created
objects (hash address, etc.)
Store the SQUID with the pointer to the object
Copy the SQUID with forwarding or sub-component
pointer creation
Memory aliasing check has three outcomes
pointers identical gt insert barrier
pointer different, squid identical gt insert
barrier (rare)
pointers different, squid different gt no barrier
needed
GC forwards EQ testing sub-object equality

Short Quasi Unique IDentifiers
8
Fast inter-node memory references

Build on the low latency Metro architecture
DeHon et al., ISCA 1994
Bring the network onto the die
H tree network connecting on chip multiprocessor
array
uniform addressing and access across the on/off
chip pins
Processor to processor
both remote memory reference and message passing
Connection rather than packet based
acknowledgement inherent
return path pre-allocated
very simple retry-based routing and error
recovery
provably good message handling

9
Key Metro Characteristics

Connection rather than packet based
No buffering, flow control, emergency handling
Provably good message routing
insensitive to network permutation patterns
One clock pin-pin routing within the component
Scalable bisection bandwidth at configuration
time
Fault tolerant in both wiring and routers
Inherent acknowledgement and reply path

10
Proposed Aries Packaging

32-64 slot active routing backplane (Rack)
Integral/redundant cooling, power
Two interchangeable slot components
Computation Cluster (Processor Box)
Disk
Memory
Processor clusters active RAM
Communication channel (Network Box)
High bandwidth channel to other active backplanes

11
Packaging Overview
12
Cluster Configurations
13
Cluster Configurations
14
Multiple Domains of Execution
Registers
Value Type Units Owner
Integrity
15
Execution Unit Hardware

Multiple execution domains
value, type, units, ownership
Value domain requires special hardware
Strategy handle others with uniform hash array
associative match techniques and software traps
Same hardware useful for
data cache (default -- N way set associative)
value cache -- long ops and functional
subroutines
tags
ownership
units

16
Uniform Reconfigurable Hardware
Operands
Bit Mask
Bit Mask
Hash
Table Lookup
Compare
Result insert
Trap
17
Contexts

Ownership / Pedigree carried with all data
Detect / enforce ownership rights to computed
data
Control access to critical owned data
Only data you have a right to see is visible
Control actuation / authorization of sensitive
tasks
Only data not touched by unauthorized users can
actuate
Automatic propagation of the set of assumptions
We know on what basis a decision has been made
We know how to revoke it
System has access to this data itself
(Introspection)

18
Prototype and Verification Vehicle

Construct a 1-10 uniform speed scaled prototype
Off the shelf components
FPGA based design
Compromise on size, cost, performance, power
Do not compromise on functionality or debug
access
Becomes an experimental architecture vehicle
Allows language software development
Allows inexpensive feature evaluation
Debugging hooks everywhere
Ready to transition to real hardware
real implementation will still retain many of
these ideas

19
Experimental Vehicle Details

High performance disk drives
Xilinx Virtex-E FPGA arrays
fast serial links (LVDS), embedded memory
debug/access paths for all signals
Fast host interface by masquerading as SDRAM
Memory augmented with fast SRAM tags
FPGA implementation of Metro network
debug in a PC cluster environment
off-the-shelf LVDS serializers _at_ 5 Gbps/chip
Cycle by cycle power profiling

20
FPGA SRAM (Moore) Board
21
Open PC System Context
22
Moore Board Details

User can implement an early Pentium-class
processor
Performance scales with Moores law over time
leverage latest process technologies in the form
of FPGAs, SRAMs, new bus technologies
minimal cost, risk
High-performance host interface (direct
memory-map via PC-100 DIMM interface)

23
Benchmark Plans

Array primitive performance
Network intensive performance
Sparse matrix operations
Parallel Database Join
DIS Data Management benchmark (rewritten)
Persistent data storage
Database operations on out of core data
Database commit performance

24
Summary

Coherent plan for next generation HW SW
Details will be verified with experimental
vehicle
Manpower and resource limited
the ideas are largely here
We will know what to build when we have the
resources to do so
early prototypes under way

25
(No Transcript)
26
Box Architecture
Processor
Network
27
Whats Wrong with this Picture?

80 of die area is cache, LSU and schedulers to
help hide memory latency
PPC 750 die shot below (obtained from IBM website)

28
Our Way

Rendition of what Aries proc. mem die might
look like

Multi-bank DRAM
16 GB/s, 4-8 cycles latency per DRAM bank,
multiple banks per execution unit
Execution Unit
Total bandwidth between execution units and
DRAM for this chip 128 GB/s at 4-8 cycle latency
(supposing 1 GHz processors)
die shots of IBM SA-27E DRAM-ASIC process and IBM
PPC750, obtained from IBM website
29
Possible Die Layout of processor DRAM
8 processors/chip 128 MBits DRAM
30
System Architecture
VME 9U card
one node, consisting of 256 MB DRAM, 128
processors, and a network processor
DIMM-style cards with 4 DRAM processor chips
(M), 1 network interface chip (N) (64 MBytes of
memory/card)
lots of wires

Aggregate bandwidth to embedded DRAM processing
elements is in excess of 16 Terabytes/s
example card has 2 GBytes of DRAM-PEs 1024
processing elements
This architecture can comprehensively search for
a keyword in 2 GB of pre-loaded memory in about
500 ms

31
Dynamically Reconfigurable Pipeline

MATRIX style high level functional blocks
Dynamically reconfigurable interconnect

conditional unit
similar to Rixner, Dally 1998
32
System Integrity

Per object capabilities
Triad Pointer Structures
Garbage Collection
Transactional Semantics
Reliable message transport
Data integrity labelling
Data ownership

33
Ownership / Accessor Labels

Every word of data d is labelled with owner /
accessor information L(d)

L(d) d

Every channel c is labelled with potential
readers L(c)

L(c)
contained in L(d) to write d to c
34
Synthesizing Labels
PC add r1, r2, r3
L(pc) L(add) L(r1) L(r2)
Label Intersector
L(r3)
35
Label Intersector Efficiency

Compute Joins Once
Store results in hash table
Hardware label cache
similar to value or tlb cache
No slowdown in the usual case

36
PC Restriction
PC bne r1, foo
L(pc) L(bne) L(r1)
Label Intersector
L(pc)
37
Dynamic PC declassification

How can you declassify the program counter?
Choose a definitive return address before the
branch
Halfway to transactional semantics

pushpc endpoint bne r1, foo poppc poppc
save a definitive return raise pc security
level lower and return lower and return
foo
38
Efficiency Techniques

Per word labels are costly and awkward
Most components of a compound structure have
identical labels
Reclassification of entire data structures should
be efficient

Put labels only on inter-structure links
39
Domain Representation
L1
L2
L1
L2
L1
L2
L1
L2
Domain Representation Semantic View
40
Domain Representation
L1
L2
L2
Domain Representation Implementation View
41
Label Format Details

Ownership and Security model derived from the
work of Meyers and Liskov
A label is a set of pairs
each pair consists of an owner
and a set of permitted accessors
The effective set of accessors is the
intersection of the permitted accessor sets
Transitive permission delegation
revokeable

42
Ownership Semantics

Fine grained control of ownership of data and
derived data
Safe dynamic control flow declassification
Fine grained control over information
dissemination
Efficient implementation as a parallel execution
domain
Leverages strong capability and transactional
models of data representation and exeecution

43
Summary

Domain level parallelism within processors
robustness
security
higher level semantics
Processor level parallelism within active RAM
parallelism friendly data structures, operations
network friendly communications
Explicitly parallel transactional programming
environment
Emphasis on the conceptual simplicity of the
programming models

44
(No Transcript)
45
Symbolic Computing

Symbolic data has no inherent local structure
Knowledge Databases
Indices
Higher level vision
target recognition
Architectures must implement efficient non local
communications
Data representations are key
Triad pointer structures
Clean parallel semantics
Transactions as hardware and software primitives

46
Triad Pointer Structures

All pointers have back pointers associated
Fan in trees for parallel data access
Combining networks
Limited fan in and contention -- memoizing
Fan out trees for data operator distribution
Data movement is straightforward (paging)
Garbage collection is no longer an issue
Compactness comes from linearized freelists
Utility in balancing fanin/out trees (local)
Typed pointers and data allow distributed
processors to manipulate data autonomously

47
Semantically Richer Objects and Operators

Sets
Ordered sets
Vectors, Tensors
APL operators
Extension of the type system
Units
Persistent objects
Unguarded objects barriers
Explicit communication between concurrent threads
Combiners only allowed
I/O as an unguarded operation

48
Transactional Processing

We already use transactional semantics
instruction execution on any modern processor
system calls in some operating systems mimic
instruction semantics (e.g. ITS)
Early proposals for parallel execution of
sequential code relied on transactional semantics
Raise the level - Reed
make transactions visible at the language level
provide hardware support for efficient
transactional processing
Timestamp modifications for auditing, rollback
and introspective analysis

49
Computing Models

Sequential
Semantically sequential
deterministic results
static
dynamic
Concurrent independent
Concurrent atomic
nondeterministic results

50
Timestamping Data

Separate virtual and real time - Jefferson
Assign virtual time ranges to data objects
Subranging for nested transactions
Reassignment of virtual time tokens on allocation
failure
Guarded references access the correct version of
data

51
Liquid Viewing Serial Execution as a Sequence of
Transactions

Instructions read and then modify processor and
memory state -- a transaction
Blocks of instructions can be viewed similarly
Key idea
execute multiple, logically sequential blocks in
parallel
independent threads
use database commit techniques to handle
otherwise intractable problems of aliasing
use cache coherency mechanisms to automatically
detect aliasing problems and back out
optimistic concurrency in executing parallel
threads

52
Research Plan

Architect physical simulation model
Locate partner to spin
Language design for symbolic processing
Feature list and architecture strawman for
symbolic applications
Implement symbolic processing in FPGA or Alpha
emulation
Logic level design of network and component
Locate partner to spin design

53
Impact

New generation of data oriented processing
physical
symbolic
Parallel performance
Almost serial programming model
100 - 1000x performance on a wide range of
problems
includes important symbolic processing and
knowledge database retrieval problems as well as
physical simulation techniques

54
Language Innovation

Learn the lessons from decades of AI languages
cleanliness in design
manifest performance costs
simple data structures
trivial syntax
performance counts
pointer allocation is central
type checking at both compile and run time is
essential
security can be enforced by pointer hygiene
side effects are necessary as a programming tool
manifest data types allow distributed data
handling

55
Benchmark Problems

Graphics
polygon rendering
point sample rendering
ray tracing
radiosity
CAD
verilog
hspice
place and route
Simulation
mechanics
n body
fluid flow
static PDEs
EM field solver

AI
neural networks
genetic algorithms
knowledge databases
computer vision
object recognition
database matching
biological sequences
Numerical
factoring
primality testing
Miscellaneous
text searching
chess
Mandelbrot sets
protein folding

56
Enabling Ideas

Physical simulations
SIMD processing
Skip samples
Lookup tables plus multiply array
Symbolic processing
on chip GC
Metro routing
Fat tree packaging
Timestamped memory
Commit operations fundamental

57
Architectural Synthesis

Lisp Machine
CADR, LM-2, 3600, Ivory, Open Genera
Connection Machine
Cross Omega Machine
CAM-8
Abacus
Transit/Metro
Matrix
Terasys
Liquid

58
Technology Opportunity

DRAM Logic
Terabaud access to on chip memory
with state of the art on chip logic and
interconnect
BGA Packaging
600-1000 pins/die
GHz signalling
Small die footprint
Reconfigurable logic
Commodity component opportunity
Delayed binding of architectures

59
Substrate Technology Development

Resonant Clocking Design Tool

Crafted resonant transmission line
60
Garbage Collection Technology

Problem Large, distributed, highly linked,
persistent data structures, partially swapped out
Slow access to large parts of the data
Solution incremental, distributed, local garbage
collection techniques
Maintenance of in and out vectors of external
pointers
In set used as a root for local garbage
collection
Out set maintained and opportunistically sent out
Object reference safety essential -- sizes from
ptr
Techniques based on Area GC of Bishop (1968)
Maheswari and Liskov (1998)

61
Garbage Collection Phases

Within processor, within core collection
Among processors following coordinated
distribution of out sets
Distributed marking of out sets from in sets
followed on null updates by global collection
Compaction and maintenance of dense out sets for
swapped out pages
possible because we can afford to spend lots of
time when swapping page data
Layered on top of conventional ephemeral
techniques (temporal reference counting)

62
Cluster Configurations

Write a Comment

User Comments (0)