Aries: Integrated Symbolic and Numeric Database Handling with Parallel Distributed Processors - PowerPoint PPT Presentation

About This Presentation
Title:

Aries: Integrated Symbolic and Numeric Database Handling with Parallel Distributed Processors

Description:

DARPA DIS Review 5/24/00 Tom Knight Andrew Huang Kalman Reti JP Grossman Jeremy Brown John Mallory Tom Cleary Norm Margolus Howie Shrobe Peggy Chen Greg Sullivan – PowerPoint PPT presentation

Number of Views:192
Avg rating:3.0/5.0
Slides: 63
Provided by: JohnSm186
Learn more at: http://www.ai.mit.edu
Category:

less

Transcript and Presenter's Notes

Title: Aries: Integrated Symbolic and Numeric Database Handling with Parallel Distributed Processors


1
Aries Integrated Symbolic and Numeric Database
Handling with Parallel Distributed Processors
DARPA DIS Review 5/24/00 Tom Knight Andrew Huang
Norm Margolus Howie Shrobe Peggy Chen Greg
Sullivan Michael Phillips Ben Vandiver
Kalman Reti JP Grossman Jeremy Brown John
Mallory Tom Cleary
2
Aries Project Components
  • Technology development
  • Language development
  • Core Data and Pointer Representations
  • System Architecture
  • Experimental vehicle
  • Benchmark plan

3
Technology Substrate
  • Fast SRAM tagged DRAM banks
  • leverage DRAM integration by allowing fast
    parallel access to SRAM tags in parallel with
    slow DRAM acccess
  • Traps, GC forwarding pointers, reference counts,
    timestamps for parallel out of order execution
  • Multiple read/writes of SRAM data during fetch of
    the DRAM line
  • Microchannel liquid cooling technology
  • copper-laminate manifold heat sinks
  • etched silicon microchannels

4
Language Development
  • Simple modifications to Scheme
  • parallel array operations from APL
  • parallel pointer operations (e.g. join, mark)
  • sophisticated GC techniques
  • data structures built on new object pointers
  • multithreading synchronized with combiners and
    splitters
  • Units
  • Preparation for a much more extensive complete
    language rewrite

5
Pointer and Array Representations
  • New pointer structure allowing
  • internal object pointers
  • immediate access to object header
  • immediate access to object size
  • little wasted memory (lt 3)
  • Objects lt 32 words have dense, efficient
    representation
  • Add only pointers

B
L
5
5
6
gt 64
L
E
Pointer
B
E
Blocks of size 2
6
Address Format
  • MSB of address determines data stripe layout
  • index 0 gives single word per processor
  • index max gives whole objects in processor

Address
Node
Node index
7
Squids for Pointer Aliasing Equality
  • Create a randomized SQUID tag for newly created
    objects (hash address, etc.)
  • Store the SQUID with the pointer to the object
  • Copy the SQUID with forwarding or sub-component
    pointer creation
  • Memory aliasing check has three outcomes
  • pointers identical gt insert barrier
  • pointer different, squid identical gt insert
    barrier (rare)
  • pointers different, squid different gt no barrier
    needed
  • GC forwards EQ testing sub-object equality

Short Quasi Unique IDentifiers
8
Fast inter-node memory references
  • Build on the low latency Metro architecture
  • DeHon et al., ISCA 1994
  • Bring the network onto the die
  • H tree network connecting on chip multiprocessor
    array
  • uniform addressing and access across the on/off
    chip pins
  • Processor to processor
  • both remote memory reference and message passing
  • Connection rather than packet based
  • acknowledgement inherent
  • return path pre-allocated
  • very simple retry-based routing and error
    recovery
  • provably good message handling

9
Key Metro Characteristics
  • Connection rather than packet based
  • No buffering, flow control, emergency handling
  • Provably good message routing
  • insensitive to network permutation patterns
  • One clock pin-pin routing within the component
  • Scalable bisection bandwidth at configuration
    time
  • Fault tolerant in both wiring and routers
  • Inherent acknowledgement and reply path

10
Proposed Aries Packaging
  • 32-64 slot active routing backplane (Rack)
  • Integral/redundant cooling, power
  • Two interchangeable slot components
  • Computation Cluster (Processor Box)
  • Disk
  • Memory
  • Processor clusters active RAM
  • Communication channel (Network Box)
  • High bandwidth channel to other active backplanes

11
Packaging Overview
12
Cluster Configurations
13
Cluster Configurations
14
Multiple Domains of Execution
Registers
Value Type Units Owner
Integrity
15
Execution Unit Hardware
  • Multiple execution domains
  • value, type, units, ownership
  • Value domain requires special hardware
  • Strategy handle others with uniform hash array
    associative match techniques and software traps
  • Same hardware useful for
  • data cache (default -- N way set associative)
  • value cache -- long ops and functional
    subroutines
  • tags
  • ownership
  • units

16
Uniform Reconfigurable Hardware
Operands
Bit Mask
Bit Mask
Hash
Table Lookup
Compare
Result insert
Trap
17
Contexts
  • Ownership / Pedigree carried with all data
  • Detect / enforce ownership rights to computed
    data
  • Control access to critical owned data
  • Only data you have a right to see is visible
  • Control actuation / authorization of sensitive
    tasks
  • Only data not touched by unauthorized users can
    actuate
  • Automatic propagation of the set of assumptions
  • We know on what basis a decision has been made
  • We know how to revoke it
  • System has access to this data itself
    (Introspection)

18
Prototype and Verification Vehicle
  • Construct a 1-10 uniform speed scaled prototype
  • Off the shelf components
  • FPGA based design
  • Compromise on size, cost, performance, power
  • Do not compromise on functionality or debug
    access
  • Becomes an experimental architecture vehicle
  • Allows language software development
  • Allows inexpensive feature evaluation
  • Debugging hooks everywhere
  • Ready to transition to real hardware
  • real implementation will still retain many of
    these ideas

19
Experimental Vehicle Details
  • High performance disk drives
  • Xilinx Virtex-E FPGA arrays
  • fast serial links (LVDS), embedded memory
  • debug/access paths for all signals
  • Fast host interface by masquerading as SDRAM
  • Memory augmented with fast SRAM tags
  • FPGA implementation of Metro network
  • debug in a PC cluster environment
  • off-the-shelf LVDS serializers _at_ 5 Gbps/chip
  • Cycle by cycle power profiling

20
FPGA SRAM (Moore) Board
21
Open PC System Context
22
Moore Board Details
  • User can implement an early Pentium-class
    processor
  • Performance scales with Moores law over time
  • leverage latest process technologies in the form
    of FPGAs, SRAMs, new bus technologies
  • minimal cost, risk
  • High-performance host interface (direct
    memory-map via PC-100 DIMM interface)

23
Benchmark Plans
  • Array primitive performance
  • Network intensive performance
  • Sparse matrix operations
  • Parallel Database Join
  • DIS Data Management benchmark (rewritten)
  • Persistent data storage
  • Database operations on out of core data
  • Database commit performance

24
Summary
  • Coherent plan for next generation HW SW
  • Details will be verified with experimental
    vehicle
  • Manpower and resource limited
  • the ideas are largely here
  • We will know what to build when we have the
    resources to do so
  • early prototypes under way

25
(No Transcript)
26
Box Architecture
Processor
Network
27
Whats Wrong with this Picture?
  • 80 of die area is cache, LSU and schedulers to
    help hide memory latency
  • PPC 750 die shot below (obtained from IBM website)

28
Our Way
  • Rendition of what Aries proc. mem die might
    look like

Multi-bank DRAM
16 GB/s, 4-8 cycles latency per DRAM bank,
multiple banks per execution unit
Execution Unit
Total bandwidth between execution units and
DRAM for this chip 128 GB/s at 4-8 cycle latency
(supposing 1 GHz processors)
die shots of IBM SA-27E DRAM-ASIC process and IBM
PPC750, obtained from IBM website
29
Possible Die Layout of processor DRAM
8 processors/chip 128 MBits DRAM
30
System Architecture
VME 9U card
one node, consisting of 256 MB DRAM, 128
processors, and a network processor
DIMM-style cards with 4 DRAM processor chips
(M), 1 network interface chip (N) (64 MBytes of
memory/card)
lots of wires
  • Aggregate bandwidth to embedded DRAM processing
    elements is in excess of 16 Terabytes/s
  • example card has 2 GBytes of DRAM-PEs 1024
    processing elements
  • This architecture can comprehensively search for
    a keyword in 2 GB of pre-loaded memory in about
    500 ms

31
Dynamically Reconfigurable Pipeline
  • MATRIX style high level functional blocks
  • Dynamically reconfigurable interconnect

conditional unit
similar to Rixner, Dally 1998
32
System Integrity
  • Per object capabilities
  • Triad Pointer Structures
  • Garbage Collection
  • Transactional Semantics
  • Reliable message transport
  • Data integrity labelling
  • Data ownership

33
Ownership / Accessor Labels
  • Every word of data d is labelled with owner /
    accessor information L(d)

L(d) d
  • Every channel c is labelled with potential
    readers L(c)

L(c)
contained in L(d) to write d to c
34
Synthesizing Labels
PC add r1, r2, r3
L(pc) L(add) L(r1) L(r2)
Label Intersector
L(r3)
35
Label Intersector Efficiency
  • Compute Joins Once
  • Store results in hash table
  • Hardware label cache
  • similar to value or tlb cache
  • No slowdown in the usual case

36
PC Restriction
PC bne r1, foo
L(pc) L(bne) L(r1)
Label Intersector
L(pc)
37
Dynamic PC declassification
  • How can you declassify the program counter?
  • Choose a definitive return address before the
    branch
  • Halfway to transactional semantics

pushpc endpoint bne r1, foo poppc poppc
save a definitive return raise pc security
level lower and return lower and return
foo
38
Efficiency Techniques
  • Per word labels are costly and awkward
  • Most components of a compound structure have
    identical labels
  • Reclassification of entire data structures should
    be efficient

Put labels only on inter-structure links
39
Domain Representation
L1
L2
L1
L2
L1
L2
L1
L2
Domain Representation Semantic View
40
Domain Representation
L1
L2
L2
Domain Representation Implementation View
41
Label Format Details
  • Ownership and Security model derived from the
    work of Meyers and Liskov
  • A label is a set of pairs
  • each pair consists of an owner
  • and a set of permitted accessors
  • The effective set of accessors is the
    intersection of the permitted accessor sets
  • Transitive permission delegation
  • revokeable

42
Ownership Semantics
  • Fine grained control of ownership of data and
    derived data
  • Safe dynamic control flow declassification
  • Fine grained control over information
    dissemination
  • Efficient implementation as a parallel execution
    domain
  • Leverages strong capability and transactional
    models of data representation and exeecution

43
Summary
  • Domain level parallelism within processors
  • robustness
  • security
  • higher level semantics
  • Processor level parallelism within active RAM
  • parallelism friendly data structures, operations
  • network friendly communications
  • Explicitly parallel transactional programming
    environment
  • Emphasis on the conceptual simplicity of the
    programming models

44
(No Transcript)
45
Symbolic Computing
  • Symbolic data has no inherent local structure
  • Knowledge Databases
  • Indices
  • Higher level vision
  • target recognition
  • Architectures must implement efficient non local
    communications
  • Data representations are key
  • Triad pointer structures
  • Clean parallel semantics
  • Transactions as hardware and software primitives

46
Triad Pointer Structures
  • All pointers have back pointers associated
  • Fan in trees for parallel data access
  • Combining networks
  • Limited fan in and contention -- memoizing
  • Fan out trees for data operator distribution
  • Data movement is straightforward (paging)
  • Garbage collection is no longer an issue
  • Compactness comes from linearized freelists
  • Utility in balancing fanin/out trees (local)
  • Typed pointers and data allow distributed
    processors to manipulate data autonomously

47
Semantically Richer Objects and Operators
  • Sets
  • Ordered sets
  • Vectors, Tensors
  • APL operators
  • Extension of the type system
  • Units
  • Persistent objects
  • Unguarded objects barriers
  • Explicit communication between concurrent threads
  • Combiners only allowed
  • I/O as an unguarded operation

48
Transactional Processing
  • We already use transactional semantics
  • instruction execution on any modern processor
  • system calls in some operating systems mimic
    instruction semantics (e.g. ITS)
  • Early proposals for parallel execution of
    sequential code relied on transactional semantics
  • Raise the level - Reed
  • make transactions visible at the language level
  • provide hardware support for efficient
    transactional processing
  • Timestamp modifications for auditing, rollback
    and introspective analysis

49
Computing Models
  • Sequential
  • Semantically sequential
  • deterministic results
  • static
  • dynamic
  • Concurrent independent
  • Concurrent atomic
  • nondeterministic results

50
Timestamping Data
  • Separate virtual and real time - Jefferson
  • Assign virtual time ranges to data objects
  • Subranging for nested transactions
  • Reassignment of virtual time tokens on allocation
    failure
  • Guarded references access the correct version of
    data

51
Liquid Viewing Serial Execution as a Sequence of
Transactions
  • Instructions read and then modify processor and
    memory state -- a transaction
  • Blocks of instructions can be viewed similarly
  • Key idea
  • execute multiple, logically sequential blocks in
    parallel
  • independent threads
  • use database commit techniques to handle
    otherwise intractable problems of aliasing
  • use cache coherency mechanisms to automatically
    detect aliasing problems and back out
  • optimistic concurrency in executing parallel
    threads

52
Research Plan
  • Architect physical simulation model
  • Locate partner to spin
  • Language design for symbolic processing
  • Feature list and architecture strawman for
    symbolic applications
  • Implement symbolic processing in FPGA or Alpha
    emulation
  • Logic level design of network and component
  • Locate partner to spin design

53
Impact
  • New generation of data oriented processing
  • physical
  • symbolic
  • Parallel performance
  • Almost serial programming model
  • 100 - 1000x performance on a wide range of
    problems
  • includes important symbolic processing and
    knowledge database retrieval problems as well as
    physical simulation techniques

54
Language Innovation
  • Learn the lessons from decades of AI languages
  • cleanliness in design
  • manifest performance costs
  • simple data structures
  • trivial syntax
  • performance counts
  • pointer allocation is central
  • type checking at both compile and run time is
    essential
  • security can be enforced by pointer hygiene
  • side effects are necessary as a programming tool
  • manifest data types allow distributed data
    handling

55
Benchmark Problems
  • Graphics
  • polygon rendering
  • point sample rendering
  • ray tracing
  • radiosity
  • CAD
  • verilog
  • hspice
  • place and route
  • Simulation
  • mechanics
  • n body
  • fluid flow
  • static PDEs
  • EM field solver
  • AI
  • neural networks
  • genetic algorithms
  • knowledge databases
  • computer vision
  • object recognition
  • database matching
  • biological sequences
  • Numerical
  • factoring
  • primality testing
  • Miscellaneous
  • text searching
  • chess
  • Mandelbrot sets
  • protein folding

56
Enabling Ideas
  • Physical simulations
  • SIMD processing
  • Skip samples
  • Lookup tables plus multiply array
  • Symbolic processing
  • on chip GC
  • Metro routing
  • Fat tree packaging
  • Timestamped memory
  • Commit operations fundamental

57
Architectural Synthesis
  • Lisp Machine
  • CADR, LM-2, 3600, Ivory, Open Genera
  • Connection Machine
  • Cross Omega Machine
  • CAM-8
  • Abacus
  • Transit/Metro
  • Matrix
  • Terasys
  • Liquid

58
Technology Opportunity
  • DRAM Logic
  • Terabaud access to on chip memory
  • with state of the art on chip logic and
    interconnect
  • BGA Packaging
  • 600-1000 pins/die
  • GHz signalling
  • Small die footprint
  • Reconfigurable logic
  • Commodity component opportunity
  • Delayed binding of architectures

59
Substrate Technology Development
  • Resonant Clocking Design Tool

Crafted resonant transmission line
60
Garbage Collection Technology
  • Problem Large, distributed, highly linked,
    persistent data structures, partially swapped out
  • Slow access to large parts of the data
  • Solution incremental, distributed, local garbage
    collection techniques
  • Maintenance of in and out vectors of external
    pointers
  • In set used as a root for local garbage
    collection
  • Out set maintained and opportunistically sent out
  • Object reference safety essential -- sizes from
    ptr
  • Techniques based on Area GC of Bishop (1968)
  • Maheswari and Liskov (1998)

61
Garbage Collection Phases
  • Within processor, within core collection
  • Among processors following coordinated
    distribution of out sets
  • Distributed marking of out sets from in sets
  • followed on null updates by global collection
  • Compaction and maintenance of dense out sets for
    swapped out pages
  • possible because we can afford to spend lots of
    time when swapping page data
  • Layered on top of conventional ephemeral
    techniques (temporal reference counting)

62
Cluster Configurations
Write a Comment
User Comments (0)
About PowerShow.com