Title: HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing
1HiDISC A Decoupled Architecture for Applications
in Data Intensive Computing
- PIs Alvin M. Despain and Jean-Luc Gaudiot
- University of Southern California
- http//www-pdpc.usc.edu
- October 12th, 2000
2HiDISC Hierarchical Decoupled Instruction Set
Computer
- New Ideas
- A dedicated processor for each level of the
memory hierarchy - Explicitly manage each level of the memory
hierarchy using instructions generated by the
compiler - Hide memory latency by converting data access
predictability to data access locality - Exploit instruction-level parallelism without
extensive scheduling hardware - Zero overhead prefetches for maximal computation
throughput
- Impact
- 2x speedup for scientific benchmarks with large
data sets over an in-order superscalar
processor - 7.4x speedup for matrix multiply over an
in-order issue superscalar processor - 2.6x speedup for matrix decomposition/substitutio
n over an in-order issue superscalar processor - Reduced memory latency for systems that have
high memory bandwidths (e.g. PIMs, RAMBUS) - Allows the compiler to solve indexing functions
for irregular applications - Reduced system cost for high-throughput
scientific codes
Schedule
- Defined benchmarks
- Completed simulator
- Performed instruction-level simulations on
hand-compiled benchmarks
- Continue simulations of more benchmarks
(SAR) - Define HiDISC architecture
- Benchmark result
- Develop and test a full decoupling compiler
- Update Simulator
- Generate performance statistics and evaluate
design
April 98 Start
April 99
April 00
3HiDISC Hierarchical Decoupled Instruction Set
Computer
Technological Trend Memory latency is getting
longer relative to microprocessor speed (40 per
year) Problem Some SPEC benchmarks spend more
than half of their time stalling Lebeck and Wood
1994 Domain benchmarks with large data sets
symbolic, signal processing and scientific
programs Present Solutions Multithreading
(Homogenous), Larger Caches, Prefetching,
Software Multithreading
4 Present Solutions
Solution Larger Caches Hardware
Prefetching Software Prefetching
Multithreading
- Limitations
- Slow
- Works well only if working set fits cache and
there is temporal locality. - Cannot be tailored for each application
- Behavior based on past and present execution-time
behavior - Ensure overheads of prefetching do not outweigh
the benefits gt conservative prefetching - Adaptive software prefetching is required to
change prefetch distance during run-time - Hard to insert prefetches for irregular access
patterns - Solves the throughput problem, not the memory
latency problem
5 The HiDISC Approach
- Observation
- Software prefetching impacts compute performance
- PIMs and RAMBUS offer a high-bandwidth memory
system - useful for speculative prefetching
- Approach
- Add a processor to manage prefetching -gt
hide overhead - Compiler explicitly manages the memory hierarchy
- Prefetch distance adapts to the program runtime
behavior
6 What is HiDISC?
Computation Instructions
- A dedicated processor for each level of the
memory hierarchy - Explicitly manage each level of the memory
hierarchy using instructions generated by the
compiler - Hide memory latency by converting data access
predictability to data access locality (Just in
Time Fetch) - Exploit instruction-level parallelism without
extensive scheduling hardware - Zero overhead prefetches for maximal computation
throughput
Computation Processor (CP)
Registers
Access Instructions
Access Processor (AP)
Compiler
Program
Cache Mgmt. Instructions
Cache Mgmt. Processor (CMP)
7Decoupled Architectures
5-issue
2-issue
3-issue
8-issue
Computation Processor (CP)
Computation Processor (CP)
Computation Processor (CP)
Computation Processor (CP)
Registers
Registers
Registers
Registers
Access Processor (AP) - (5-issue)
Access Processor (AP) - (3-issue)
Cache
Cache Mgmt. Processor (CMP)
3-issue
3-issue
Cache Mgmt. Processor (CMP)
2nd-Level Cache and Main Memory
MIPS
CAPP
HiDISC
DEAP
(Conventional)
(Decoupled)
(New Decoupled)
DEAP Kurian, Hulina, Caraor 94 PIPE
Goodman 85 Other Decoupled Processors ACRI,
ZS-1, WA
8 Slip Control Queue
- The Slip Control Queue (SCQ) adapts dynamically
- Late prefetches prefetched data arrived after
load had been issued - Useful prefetches prefetched data arrived
before load had been issued
if (prefetch_buffer_full ()) Dont change size
of SCQ else if ((2late_prefetches) gt
useful_prefetches) Increase size of
SCQ else Decrease size of SCQ
9Decoupling Programs for HiDISC(Discrete
Convolution - Inner Loop)
while (not EOD) y y (x h) send y to SDQ
Computation Processor Code
for (j 0 j lt i j) load (xj) load
(hi-j-1) GET_SCQ send (EOD token) send
address of yi to SAQ
for (j 0 j lt i j) yiyi(xjhi-j-1)
Inner Loop Convolution
Access Processor Code
SAQ Store Address Queue SDQ Store Data
Queue SCQ Slip Control Queue EOD End of Data
for (j 0 j lt i j) prefetch
(xj) prefetch (hi-j-1 PUT_SCQ
Cache Management Code
10Benchmarks
Benchmarks
Source of
Lines of
Description
Data
Benchmark
Source
Set
Code
Size
LLL1
Livermore
20
1024
-
element
24 KB
Loops 45
arrays, 100
iterations
LLL2
Livermore
24
1024
-
element
16 KB
Loops
arrays, 100
iterations
LLL3
Livermore
18
1024
-
element
16 KB
Loops
a
rrays, 100
iterations
LLL4
Livermore
25
1024
-
element
16 KB
Loops
arrays, 100
iterations
LLL5
Livermore
17
1024
-
element
24 KB
Loops
arrays, 100
iterations
Tomcatv
SPECfp95 68
190
33x33
-
element
lt64 KB
matrices, 5
iterations
MXM
NAS kernels 5
11
3
Unrolled matrix
448 KB
multiply, 2
iterations
CHOLSKY
NAS kernels
156
Cholsky matrix
724 KB
decomposition
VPENTA
NAS kernels
199
Invert three
128 KB
pentadiagonals
simultaneously
Qsort
Quicksort
58
Quicksort
128 KB
sorting
algorithm 14
11Simulation
Parameter Value Parameter Value
L1 cache size 4 KB L2 cache size 16 KB
L1 cache associativity 2 L2 cache associativity 2
L1 cache block size 32 B L2 cache block size 32 B
Memory Latency Variable, (0-200 cycles) Memory contention time Variable
Victim cache size 32 entries Prefetch buffer size 8 entries
Load queue size 128 Store address queue size 128
Store data queue size 128 Total issue width 8
12Simulation Results
13Accomplishments
- 2x speedup for scientific benchmarks with large
data sets over an in-order superscalar processor - 7.4x speedup for matrix multiply (MXM) over an
in-order issue superscalar processor - (similar
operations are used in ATR/SLD) - 2.6x speedup for matrix decomposition/substitution
(Cholsky) over an in-order issue superscalar
processor - Reduced memory latency for systems that have high
memory bandwidths (e.g. PIMs, RAMBUS) - Allows the compiler to solve indexing functions
for irregular applications - Reduced system cost for high-throughput
scientific codes
14Work in Progress
- Compiler design
- Data Intensive Systems (DIS) benchmarks analysis
- Simulator update
- Parameterization of silicon space for VLSI
implementation
15Compiler Requirements
- Source language flexibility
- Sequential assembly code for streaming
- Ease of implementation
- Optimality of sequential code
- Source level language flexibility
- Portability
- Ease of implementation
- Portability and upgradability
16Gcc-2.95 Features
- Localized register spilling, global common sub
expression elimination using lazy code motion
algorithms - There is also an enhancement made in the control
flow graph analysis.The new framework simplifies
control dependence analysis, which is used by
aggressive dead code elimination algorithms - Provision to add modules for instruction
scheduling and delayed branch execution - Front-ends for C, C and Fortran available
- Support for different environments and
platforms Cross compilation -
17Compiler Organization
Source Program
GCC
Assembly Code
Stream Separator
Computational Assembly Code
Access Assembly Code
Cache Management Assembly Code
Access Assembly Code
Cache Management Object Code
Computation Assembly Code
HiDISC Compilation Overview
18HiDISC Stream Separator
19Compiler Front End Optimizations
- Jump Optimization simplify jumps to the
following instruction, jumps across jumps and
jumps to jumps - Jump Threading detect a conditional jump that
branches to an identical or inverse test - Delayed Branch Execution find instructions that
can go into the delay slots of other instructions - Constant Propagation Propagate constants into a
conditional loop
20Compiler Front End Optimizations (contd.)
- Instruction Combination combine groups of two or
three instructions that are related by data flow
into a single instruction - Instruction Scheduling looks for instructions
whose output will not be available by the time
that it is used in subsequent instructions - Loop Optimizations move constant expressions out
of loops, and do strength-reduction
21Example of Stressmarks
- Pointer Stressmark
- Basic idea repeatedly follow pointers to
randomized locations in memory - Memory access pattern is unpredictable
- Randomized memory access pattern
- Not sufficient temporal and spatial locality for
conventional cache architectures - HiDISC architecture provides lower memory access
latency
22Decoupling of Pointer Stressmarks
while (not EOD)if (field gt partition)
balance if (balancehigh w/2) breakelse
if (balancehigh gt w/2) min
partitionelse max partition
high
for (ij1iltwi) if (fieldindexi gt
partition) balance if (balancehigh w/2)
breakelse if (balancehigh gt w/2) min
partitionelse max partition
high
Computation Processor Code
for (ij1 iltw i) load (fieldindexi) G
ET_SCQ send (EOD token)
Access Processor Code
for (ij1 iltw i) prefetch
(fieldindexi) PUT_SCQ
Inner loop for the next indexing
Cache Management Code
23Stressmarks
- Hand-compile the 7 individual benchmarks
- Use gcc as front-end
- Manually partition each of the three instruction
streams and insert synchronizing instructions - Evaluate architectural trade-offs
- Updated simulator characteristics such as
out-of-order issue - Large L2 cache and enhanced main memory system
such as Rambus and DDR
24Simulator Update
- Survey the current processor architecture
- Focus on commercial leading edge technology for
implementation - Analyze the current simulator and previous
benchmark results - Enhance memory hierarchy configurations
- Add Out-of-Order issue
25Memory Hierarchy
- Current modern processors have increasingly large
L2 on-chip caches - E.g., 256 KB L-2 cache on Pentium and Athlon
processor - reduces L1 cache miss penalty
- Also, development of new mechanisms in the
architecture of the main memory (e.g., RAMBUS)
reduces the L2 cache miss penalty
26Out-of-Order multiple issue
- Most of the current advanced processors are based
on the Superscalar and Multiple Issue paradigm. - MIPS-R10000, Power-PC, Ultra-Sparc, Alpha and
Pentium family - Compare HiDISC architecture and modern
superscalar processors - Out-of-Order instruction issue
- For precision exception handling, include
in-order completion - New access decoupling paradigm for out-of-order
issue
27HiDISC with Modern DRAM Architecture
- RAMBUS and DDR DRAM improve the memory bandwidth
- Latency does not improve significantly
- Decoupled access processor can fully utilize the
enhanced memory bandwidth - More requests caused by access processor
- Pre-fetching mechanism hide memory access latency
28HiDISC / SMT
- Reduced memory latency of HiDISC can
- decrease the number of threads for SMT
architecture - relieve memory burden of SMT architecture
- lessen complex issue logic of multithreading
- The functional unit utilization can increase with
multithreading features on HiDISC - More instruction level parallelism is possible
29The McDISC System Memory-Centered Distributed
Instruction Set Computer
30Summary
- Designing a compiler
- Porting gcc to HiDISC
- Benchmark simulation with new parameters and
updated simulator - Analysis of architectural trade-offs for equal
silicon area - Hand-compilation of Stressmarks suites and
simulation - DIS benchmarks simulation