Title: Aries: Integrated Symbolic and Numeric Database Handling with Parallel Distributed Processors
1Aries Integrated Symbolic and Numeric Database
Handling with Parallel Distributed Processors
DARPA DIS Review 5/24/00 Tom Knight Andrew Huang
Norm Margolus Howie Shrobe Peggy Chen Greg
Sullivan Michael Phillips Ben Vandiver
Kalman Reti JP Grossman Jeremy Brown John
Mallory Tom Cleary
2Aries Project Components
- Technology development
- Language development
- Core Data and Pointer Representations
- System Architecture
- Experimental vehicle
- Benchmark plan
3Technology Substrate
- Fast SRAM tagged DRAM banks
- leverage DRAM integration by allowing fast
parallel access to SRAM tags in parallel with
slow DRAM acccess - Traps, GC forwarding pointers, reference counts,
timestamps for parallel out of order execution - Multiple read/writes of SRAM data during fetch of
the DRAM line - Microchannel liquid cooling technology
- copper-laminate manifold heat sinks
- etched silicon microchannels
4Language Development
- Simple modifications to Scheme
- parallel array operations from APL
- parallel pointer operations (e.g. join, mark)
- sophisticated GC techniques
- data structures built on new object pointers
- multithreading synchronized with combiners and
splitters - Units
- Preparation for a much more extensive complete
language rewrite
5Pointer and Array Representations
- New pointer structure allowing
- internal object pointers
- immediate access to object header
- immediate access to object size
- little wasted memory (lt 3)
- Objects lt 32 words have dense, efficient
representation - Add only pointers
B
L
5
5
6
gt 64
L
E
Pointer
B
E
Blocks of size 2
6Address Format
- MSB of address determines data stripe layout
- index 0 gives single word per processor
- index max gives whole objects in processor
Address
Node
Node index
7Squids for Pointer Aliasing Equality
- Create a randomized SQUID tag for newly created
objects (hash address, etc.) - Store the SQUID with the pointer to the object
- Copy the SQUID with forwarding or sub-component
pointer creation - Memory aliasing check has three outcomes
- pointers identical gt insert barrier
- pointer different, squid identical gt insert
barrier (rare) - pointers different, squid different gt no barrier
needed - GC forwards EQ testing sub-object equality
Short Quasi Unique IDentifiers
8Fast inter-node memory references
- Build on the low latency Metro architecture
- DeHon et al., ISCA 1994
- Bring the network onto the die
- H tree network connecting on chip multiprocessor
array - uniform addressing and access across the on/off
chip pins - Processor to processor
- both remote memory reference and message passing
- Connection rather than packet based
- acknowledgement inherent
- return path pre-allocated
- very simple retry-based routing and error
recovery - provably good message handling
9Key Metro Characteristics
- Connection rather than packet based
- No buffering, flow control, emergency handling
- Provably good message routing
- insensitive to network permutation patterns
- One clock pin-pin routing within the component
- Scalable bisection bandwidth at configuration
time - Fault tolerant in both wiring and routers
- Inherent acknowledgement and reply path
10Proposed Aries Packaging
- 32-64 slot active routing backplane (Rack)
- Integral/redundant cooling, power
- Two interchangeable slot components
- Computation Cluster (Processor Box)
- Disk
- Memory
- Processor clusters active RAM
- Communication channel (Network Box)
- High bandwidth channel to other active backplanes
11Packaging Overview
12Cluster Configurations
13Cluster Configurations
14Multiple Domains of Execution
Registers
Value Type Units Owner
Integrity
15Execution Unit Hardware
- Multiple execution domains
- value, type, units, ownership
- Value domain requires special hardware
- Strategy handle others with uniform hash array
associative match techniques and software traps - Same hardware useful for
- data cache (default -- N way set associative)
- value cache -- long ops and functional
subroutines - tags
- ownership
- units
16Uniform Reconfigurable Hardware
Operands
Bit Mask
Bit Mask
Hash
Table Lookup
Compare
Result insert
Trap
17Contexts
- Ownership / Pedigree carried with all data
- Detect / enforce ownership rights to computed
data - Control access to critical owned data
- Only data you have a right to see is visible
- Control actuation / authorization of sensitive
tasks - Only data not touched by unauthorized users can
actuate - Automatic propagation of the set of assumptions
- We know on what basis a decision has been made
- We know how to revoke it
- System has access to this data itself
(Introspection)
18Prototype and Verification Vehicle
- Construct a 1-10 uniform speed scaled prototype
- Off the shelf components
- FPGA based design
- Compromise on size, cost, performance, power
- Do not compromise on functionality or debug
access - Becomes an experimental architecture vehicle
- Allows language software development
- Allows inexpensive feature evaluation
- Debugging hooks everywhere
- Ready to transition to real hardware
- real implementation will still retain many of
these ideas
19Experimental Vehicle Details
- High performance disk drives
- Xilinx Virtex-E FPGA arrays
- fast serial links (LVDS), embedded memory
- debug/access paths for all signals
- Fast host interface by masquerading as SDRAM
- Memory augmented with fast SRAM tags
- FPGA implementation of Metro network
- debug in a PC cluster environment
- off-the-shelf LVDS serializers _at_ 5 Gbps/chip
- Cycle by cycle power profiling
20FPGA SRAM (Moore) Board
21Open PC System Context
22Moore Board Details
- User can implement an early Pentium-class
processor - Performance scales with Moores law over time
- leverage latest process technologies in the form
of FPGAs, SRAMs, new bus technologies - minimal cost, risk
- High-performance host interface (direct
memory-map via PC-100 DIMM interface)
23Benchmark Plans
- Array primitive performance
- Network intensive performance
- Sparse matrix operations
- Parallel Database Join
- DIS Data Management benchmark (rewritten)
- Persistent data storage
- Database operations on out of core data
- Database commit performance
24Summary
- Coherent plan for next generation HW SW
- Details will be verified with experimental
vehicle - Manpower and resource limited
- the ideas are largely here
- We will know what to build when we have the
resources to do so - early prototypes under way
25(No Transcript)
26Box Architecture
Processor
Network
27Whats Wrong with this Picture?
- 80 of die area is cache, LSU and schedulers to
help hide memory latency - PPC 750 die shot below (obtained from IBM website)
28Our Way
- Rendition of what Aries proc. mem die might
look like
Multi-bank DRAM
16 GB/s, 4-8 cycles latency per DRAM bank,
multiple banks per execution unit
Execution Unit
Total bandwidth between execution units and
DRAM for this chip 128 GB/s at 4-8 cycle latency
(supposing 1 GHz processors)
die shots of IBM SA-27E DRAM-ASIC process and IBM
PPC750, obtained from IBM website
29Possible Die Layout of processor DRAM
8 processors/chip 128 MBits DRAM
30System Architecture
VME 9U card
one node, consisting of 256 MB DRAM, 128
processors, and a network processor
DIMM-style cards with 4 DRAM processor chips
(M), 1 network interface chip (N) (64 MBytes of
memory/card)
lots of wires
- Aggregate bandwidth to embedded DRAM processing
elements is in excess of 16 Terabytes/s - example card has 2 GBytes of DRAM-PEs 1024
processing elements - This architecture can comprehensively search for
a keyword in 2 GB of pre-loaded memory in about
500 ms
31Dynamically Reconfigurable Pipeline
- MATRIX style high level functional blocks
- Dynamically reconfigurable interconnect
conditional unit
similar to Rixner, Dally 1998
32System Integrity
- Per object capabilities
- Triad Pointer Structures
- Garbage Collection
- Transactional Semantics
- Reliable message transport
- Data integrity labelling
- Data ownership
33Ownership / Accessor Labels
- Every word of data d is labelled with owner /
accessor information L(d)
L(d) d
- Every channel c is labelled with potential
readers L(c)
L(c)
contained in L(d) to write d to c
34Synthesizing Labels
PC add r1, r2, r3
L(pc) L(add) L(r1) L(r2)
Label Intersector
L(r3)
35Label Intersector Efficiency
- Compute Joins Once
- Store results in hash table
- Hardware label cache
- similar to value or tlb cache
- No slowdown in the usual case
36PC Restriction
PC bne r1, foo
L(pc) L(bne) L(r1)
Label Intersector
L(pc)
37Dynamic PC declassification
- How can you declassify the program counter?
- Choose a definitive return address before the
branch - Halfway to transactional semantics
pushpc endpoint bne r1, foo poppc poppc
save a definitive return raise pc security
level lower and return lower and return
foo
38Efficiency Techniques
- Per word labels are costly and awkward
- Most components of a compound structure have
identical labels - Reclassification of entire data structures should
be efficient
Put labels only on inter-structure links
39Domain Representation
L1
L2
L1
L2
L1
L2
L1
L2
Domain Representation Semantic View
40Domain Representation
L1
L2
L2
Domain Representation Implementation View
41Label Format Details
- Ownership and Security model derived from the
work of Meyers and Liskov - A label is a set of pairs
- each pair consists of an owner
- and a set of permitted accessors
- The effective set of accessors is the
intersection of the permitted accessor sets - Transitive permission delegation
- revokeable
42Ownership Semantics
- Fine grained control of ownership of data and
derived data - Safe dynamic control flow declassification
- Fine grained control over information
dissemination - Efficient implementation as a parallel execution
domain - Leverages strong capability and transactional
models of data representation and exeecution
43Summary
- Domain level parallelism within processors
- robustness
- security
- higher level semantics
- Processor level parallelism within active RAM
- parallelism friendly data structures, operations
- network friendly communications
- Explicitly parallel transactional programming
environment - Emphasis on the conceptual simplicity of the
programming models
44(No Transcript)
45Symbolic Computing
- Symbolic data has no inherent local structure
- Knowledge Databases
- Indices
- Higher level vision
- target recognition
- Architectures must implement efficient non local
communications - Data representations are key
- Triad pointer structures
- Clean parallel semantics
- Transactions as hardware and software primitives
46Triad Pointer Structures
- All pointers have back pointers associated
- Fan in trees for parallel data access
- Combining networks
- Limited fan in and contention -- memoizing
- Fan out trees for data operator distribution
- Data movement is straightforward (paging)
- Garbage collection is no longer an issue
- Compactness comes from linearized freelists
- Utility in balancing fanin/out trees (local)
- Typed pointers and data allow distributed
processors to manipulate data autonomously
47Semantically Richer Objects and Operators
- Sets
- Ordered sets
- Vectors, Tensors
- APL operators
- Extension of the type system
- Units
- Persistent objects
- Unguarded objects barriers
- Explicit communication between concurrent threads
- Combiners only allowed
- I/O as an unguarded operation
48Transactional Processing
- We already use transactional semantics
- instruction execution on any modern processor
- system calls in some operating systems mimic
instruction semantics (e.g. ITS) - Early proposals for parallel execution of
sequential code relied on transactional semantics - Raise the level - Reed
- make transactions visible at the language level
- provide hardware support for efficient
transactional processing - Timestamp modifications for auditing, rollback
and introspective analysis
49Computing Models
- Sequential
- Semantically sequential
- deterministic results
- static
- dynamic
- Concurrent independent
- Concurrent atomic
- nondeterministic results
50Timestamping Data
- Separate virtual and real time - Jefferson
- Assign virtual time ranges to data objects
- Subranging for nested transactions
- Reassignment of virtual time tokens on allocation
failure - Guarded references access the correct version of
data
51Liquid Viewing Serial Execution as a Sequence of
Transactions
- Instructions read and then modify processor and
memory state -- a transaction - Blocks of instructions can be viewed similarly
- Key idea
- execute multiple, logically sequential blocks in
parallel - independent threads
- use database commit techniques to handle
otherwise intractable problems of aliasing - use cache coherency mechanisms to automatically
detect aliasing problems and back out - optimistic concurrency in executing parallel
threads
52Research Plan
- Architect physical simulation model
- Locate partner to spin
- Language design for symbolic processing
- Feature list and architecture strawman for
symbolic applications - Implement symbolic processing in FPGA or Alpha
emulation - Logic level design of network and component
- Locate partner to spin design
53Impact
- New generation of data oriented processing
- physical
- symbolic
- Parallel performance
- Almost serial programming model
- 100 - 1000x performance on a wide range of
problems - includes important symbolic processing and
knowledge database retrieval problems as well as
physical simulation techniques
54Language Innovation
- Learn the lessons from decades of AI languages
- cleanliness in design
- manifest performance costs
- simple data structures
- trivial syntax
- performance counts
- pointer allocation is central
- type checking at both compile and run time is
essential - security can be enforced by pointer hygiene
- side effects are necessary as a programming tool
- manifest data types allow distributed data
handling
55Benchmark Problems
- Graphics
- polygon rendering
- point sample rendering
- ray tracing
- radiosity
- CAD
- verilog
- hspice
- place and route
- Simulation
- mechanics
- n body
- fluid flow
- static PDEs
- EM field solver
- AI
- neural networks
- genetic algorithms
- knowledge databases
- computer vision
- object recognition
- database matching
- biological sequences
- Numerical
- factoring
- primality testing
- Miscellaneous
- text searching
- chess
- Mandelbrot sets
- protein folding
56Enabling Ideas
- Physical simulations
- SIMD processing
- Skip samples
- Lookup tables plus multiply array
- Symbolic processing
- on chip GC
- Metro routing
- Fat tree packaging
- Timestamped memory
- Commit operations fundamental
57Architectural Synthesis
- Lisp Machine
- CADR, LM-2, 3600, Ivory, Open Genera
- Connection Machine
- Cross Omega Machine
- CAM-8
- Abacus
- Transit/Metro
- Matrix
- Terasys
- Liquid
58Technology Opportunity
- DRAM Logic
- Terabaud access to on chip memory
- with state of the art on chip logic and
interconnect - BGA Packaging
- 600-1000 pins/die
- GHz signalling
- Small die footprint
- Reconfigurable logic
- Commodity component opportunity
- Delayed binding of architectures
59Substrate Technology Development
- Resonant Clocking Design Tool
Crafted resonant transmission line
60Garbage Collection Technology
- Problem Large, distributed, highly linked,
persistent data structures, partially swapped out - Slow access to large parts of the data
- Solution incremental, distributed, local garbage
collection techniques - Maintenance of in and out vectors of external
pointers - In set used as a root for local garbage
collection - Out set maintained and opportunistically sent out
- Object reference safety essential -- sizes from
ptr - Techniques based on Area GC of Bishop (1968)
- Maheswari and Liskov (1998)
61Garbage Collection Phases
- Within processor, within core collection
- Among processors following coordinated
distribution of out sets - Distributed marking of out sets from in sets
- followed on null updates by global collection
- Compaction and maintenance of dense out sets for
swapped out pages - possible because we can afford to spend lots of
time when swapping page data - Layered on top of conventional ephemeral
techniques (temporal reference counting)
62Cluster Configurations