Title: HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing
1HiDISC A Decoupled Architecture for Applications
in Data Intensive Computing
PIs Alvin M. Despain and Jean-Luc Gaudiot DARPA
DIS PI Meeting Santa Fe, NM
March 26-27, 2002
2HiDISC Hierarchical Decoupled Instruction Set
Computer
- New Ideas
- A dedicated processor for each level of the
memory hierarchy - Explicit management of each level of the memory
hierarchy using instructions generated by the
compiler - Hide memory latency by converting data access
predictability to data access locality - Exploit instruction-level parallelism without
extensive scheduling hardware - Zero overhead prefetches for maximal computation
throughput
- Impact
- 2x speedup for scientific benchmarks with large
data sets over an in-order superscalar
processor - 7.4x speedup for matrix multiply over an
in-order issue superscalar processor - 2.6x speedup for matrix decomposition/substitutio
n over an in-order issue superscalar processor - Reduced memory latency for systems that have
high memory bandwidths (e.g. PIMs, RAMBUS) - Allows the compiler to solve indexing functions
for irregular applications - Reduced system cost for high-throughput
scientific codes
Schedule
- Defined benchmarks
- Completed simulator
- Performed instruction-level simulations on
hand-compiled - benchmarks
- Continue simulations of more benchmarks
(SAR) - Define HiDISC architecture
- Benchmark result
- Update Simulator
- Developed and tested a complete decoupling
compiler - Generated performance statistics and
evaluated design
April 2001
March 2002
Start
3Accomplishments
- Design of the HiDISC model
- Compiler development (operational)
- Simulator design (operational)
- Performance Evaluation
- Three DIS benchmarks (Multidimensional Fourier
Transform, Method of Moments, Data Management) - Five stressmarks (Pointer, Transitive Closure,
Neighborhood, Matrix, Field) - Results
- Mostly higher performance (some lower)
- Range of applications
- HiDISC of the future
4Outline
- Original Technical Objective
- HiDISC Architecture Review
- Benchmark Results
- Conclusion and Future Works
5HiDISC Hierarchical Decoupled Instruction Set
Computer
- Technological Trend Memory latency is getting
longer relative to microprocessor speed (40 per
year) - Problem Some SPEC benchmarks spend more than
half of their time stalling Lebeck and Wood
1994 - Domain benchmarks with large data sets
symbolic, signal processing and scientific
programs - Present Solutions Larger Caches, Prefetching
(software hardware), Simultaneous Multithreading
6 Present Solutions
Solution Larger Caches Hardware
Prefetching Software Prefetching
Multithreading
- Limitations
- Slow
- Works well only if working set fits cache and
there is temporal locality. - Cannot be tailored for each application
- Behavior based on past and present execution-time
behavior - Ensure overheads of prefetching do not outweigh
the benefits gt conservative prefetching - Adaptive software prefetching is required to
change prefetch distance during run-time - Hard to insert prefetches for irregular access
patterns - Solves the throughput problem, not the memory
latency problem
7 The HiDISC Approach
- Observation
- Software prefetching impacts compute performance
- PIMs and RAMBUS offer a high-bandwidth memory
system - useful for speculative prefetching
- Approach
- Add a processor to manage prefetching ?
hide overhead - Compiler explicitly manages the memory hierarchy
- Prefetch distance adapts to the program runtime
behavior
8 What is HiDISC?
- A dedicated processor for each level of the
memory hierarchy - Explicitly manage each level of the memory
hierarchy using instructions generated by the
compiler - Hide memory latency by converting data access
predictability to data access locality (Just in
Time Fetch) - Exploit instruction-level parallelism without
extensive scheduling hardware - Zero overhead prefetches for maximal computation
throughput
Computation Processor (CP)
2-issue
Registers
Store Address Queue
Load Data Queue
Access Processor (AP)
Store Data Queue
Slip Control Queue
3-issue
L1 Cache
Cache Mgmt. Processor (CMP)
3-issue
HiDISC
9 Slip Control Queue
- The Slip Control Queue (SCQ) adapts dynamically
- Late prefetches prefetched data arrived after
load had been issued - Useful prefetches prefetched data arrived
before load had been issued
if (prefetch_buffer_full ()) Dont change size
of SCQ else if ((2late_prefetches) gt
useful_prefetches) Increase size of
SCQ else Decrease size of SCQ
10Decoupling Programs for HiDISC(Discrete
Convolution - Inner Loop)
while (not EOD) y y (x h) send y to SDQ
Computation Processor Code
for (j 0 j lt i j) load (xj) load
(hi-j-1) GET_SCQ send (EOD token) send
address of yi to SAQ
for (j 0 j lt i j) yiyi(xjhi-j-1)
Inner Loop Convolution
Access Processor Code
SAQ Store Address Queue SDQ Store Data
Queue SCQ Slip Control Queue EOD End of Data
for (j 0 j lt i j) prefetch
(xj) prefetch (hi-j-1 PUT_SCQ
Cache Management Code
11General View of the Compiler
Source Program
Gcc
Binary Code
Disassembler
Stream Separator
Computation Code
Cache Mgmt. Code
Access Code
HiDISC Compilation Overview
12HiDISC Stream Separator
13Stream Separation Backward Load/Store Chasing
- Load/store instructions and backward slice are
included as Access stream
LLL1
- Other remaining instructions are sent to the
Computation stream
- Communications between the two streams are also
defined at this point
14Stream Separation Creating an Access Stream
- Communication needs to take place via the various
hardware queues. - Insert Communication instructions
- CMP instructions are copied from access stream
instructions, except that load instructions are
replaced by prefetch instructions
15DIS Benchmarks
- Application oriented benchmarks
- Main characteristic Large data sets non
contiguous memory access and no temporal locality
16DIS Benchmarks Description
Method of Moments (MoM) Computing the electromagnetic scattering from complex objects Contains computational complexity and requests high memory speed Done
Multidimensional Fourier Transform (FFT) Wide range of application usage Image processing, synthesis, convolution deconvolution and digital signal filtering Done
Data Management (DM) Traditional DBMS processing. Focusing on index algorithms and ad hoc query processing. Done
Image Understanding Sampling contains algorithms that perform spatial filtering and data reduction. Morphological filter component, a region of interest (ROI) selection component, and a feature extraction component Compiled (Input file problem)
SAR Ray Tracing Cost-effective alternative to real data collectionsSAR Ray-Tracing Benchmark utilizes an image-domain approach to SAR image simulation Done
17Atlantic Aerospace Stressmark Suite
- Smaller and more specific procedures
- Seven individual data intensive benchmarks
- Directly illustrate particular elements of the
DIS problem - Fewer computation operations than DIS benchmarks
- Focusing on irregular memory accesses
18Stressmark Suite
Name of Stressmark Problem Memory Access
Pointer Pointer following Small blocks at unpredictable locations. Can be parallelized Done
Update Pointer following with memory update Small blocks at unpredictable location Done
Matrix Conjugate gradient simultaneous equation solver Dependent on matrix representation Likely to be irregular or mixed, with mixed levels of reuse Done
Neighborhood Calculate image texture measures by finding sum and difference histograms Regular access to pairs of words at arbitrary distances Done
Field Collect statistics on large field of words Regular, with little re-use Done
Corner-Turn Matrix transposition Block movement between processing nodes with practically nil computation N/A
Transitive Closure Find all-pairs-shortest-path solution for a directed graph Dependent on matrix representation, but requires reads and writes to different matrices concurrently Done
DIS Stressmark Suite Version 1.0, Atlantic
Aerospace Division
19Pointer Stressmarks
- Basic idea repeatedly follow pointers to
randomized locations in memory - Memory access pattern is unpredictable (input
data dependent) - Randomized memory access pattern
- Not sufficient temporal and spatial locality for
conventional cache architectures - HiDISC architecture should provide lower average
memory access latency - The pointer chasing in AP can run ahead without
blocking by the CP
20Simulation Parameters (simplescalar 3.0, 2000)
Branch predict mode Bimodal
Branch table size 2048
Issue width 4
Window size for superscalar RUU 16 LSQ 8
Slip distance for AP/EP 50
Data L1 cache configuration 128 sets, 32 block, 4 -way set associative , LRU
Data L1 cache latency 1
Unified L2 cache configuration 1024 sets, 64 block, 4 - way set associative, LRU
Unified L2 cache latency 6
Integer functional units ALU( x 4), MUL/DIV
Floating point functional units ALU( x 4), MUL/DIV
Number of memory ports 2
21Superscalar and HiDISC
- Ability to vary the memory latency
- Superscalar supports OOO issue with 16 RUU
(Register Update Unit) and 8 LSQ (Load Store
Queue) - Each of AP and CP issues instructions in order
22DIS Benchmark/Stressmark Results
DIS Benchmarks DIS Benchmarks DIS Benchmarks DIS Benchmarks DIS Benchmarks
FFT MoM DM IU RAY
? ? ? N/A ?
? better with HiDISC ? better with Superscalar
Stressmarks Stressmarks Stressmarks Stressmarks Stressmarks Stressmarks Stressmarks
Pointer Tran. Cl. Field Matrix Neighbor Update Corner
? ? ? ? ? ? N/A
23DIS Benchmark Results
24DIS Benchmark Performance
- DIS benchmarks perform extremely well in general
with our decoupled processing for the following
reasons - Many long latency floating-point operations
- Robust with longer memory latency (eg, Method of
Moment is a more stream-like process) - In FFT, the HiDISC architecture also suffers from
longer memory latencies (Due to the data
dependencies between two streams) - DM is not affected by the longer memory latency
in either case.
25Stressmark Results
26Stressmark Results Good Performance
- Pointer chasing can be executed far ahead by
using decoupled access stream - It does not requires the computation results from
the CP - Transitive closure also produces good results
- Not much in the AP depend on the results of CP
- AP can run ahead
27Some Weak Performance
28Performance Bottleneck
- Too much synchronization causes loss of
decoupling - Unbalanced code size between two streams
- Stressmark suite is highly access-heavy
- Application domain for decoupled architectures
- Balanced ratio between computation and memory
access
29Synthesis
- Synchronization degrades performance
- When the dependency of AP on CP increases, the
slip distance of AP cannot be increased - More stream like applications will benefit from
using HiDISC - Multithreading support is needed
- Applications should contain enough computation (1
to 1 ratio) to hide the memory access latency - CMP should be simple
- It executes redundant operations if the data is
already in cache - Cache pollution can occur
30Flexi-DISC
-
- Fundamental characteristics
- inherently highly dynamic at execution time.
- Dynamic reconfigurable central computational
kernel (CK) - Multiple levels of caching and processing around
CK - adjustable prefetching
- Multiple processors on a chip which will provide
for a flexible adaptation from multiple to single
processors and horizontal sharing of the existing
resources.
31Flexi-DISC
- Partitioning of Computation Kernel
- It can be allocated to the different portions of
the application or different applications - CK requires separation of the next ring to feed
it with data - The variety of target applications makes the
memory accesses unpredictable - Identical processing units for outer rings
- Highly efficient dynamic partitioning of the
resources and their run-time allocation can be
achieved