Title: HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing
1HiDISC A Decoupled Architecture for Applications
in Data Intensive Computing
- PIs Alvin M. Despain and Jean-Luc Gaudiot
- University of Southern California
- http//www-pdpc.usc.edu
- May 2001
2Outline
- HiDISC Project Description
- Experiments and Accomplishments
- Work in Progress
- Summary
3HiDISC Hierarchical Decoupled Instruction Set
Computer
Technological Trend Memory latency is getting
longer relative to microprocessor speed (40 per
year) Problem Some SPEC benchmarks spend more
than half of their time stalling Domain
benchmarks with large data sets symbolic,
signal processing and scientific programs Present
Solutions Multithreading (Homogenous), Larger
Caches, Prefetching, Software Multithreading
4 Present Solutions
Solution Larger Caches Hardware
Prefetching Software Prefetching
Multithreading
- Limitations
- Slow
- Works well only if working set fits cache and
there is temporal locality. - Cannot be tailored for each application
- Behavior based on past and present execution-time
behavior - Ensure overheads of prefetching do not outweigh
the benefits gt conservative prefetching - Adaptive software prefetching is required to
change prefetch distance during run-time - Hard to insert prefetches for irregular access
patterns - Solves the throughput problem, not the memory
latency problem
5 The HiDISC Approach
- Observation
- Software prefetching impacts compute performance
- PIMs and RAMBUS offer a high-bandwidth memory
system - useful for speculative prefetching
- Approach
- Add a processor to manage prefetching -gt
hide overhead - Compiler explicitly manages the memory hierarchy
- Prefetch distance adapts to the program runtime
behavior
6 What is HiDISC?
Computation Instructions
- A dedicated processor for each level of the
memory hierarchy - Explicitly manage each level of the memory
hierarchy using instructions generated by the
compiler - Hide memory latency by converting data access
predictability to data access locality (Just in
Time Fetch) - Exploit instruction-level parallelism without
extensive scheduling hardware - Zero overhead prefetches for maximal computation
throughput
Computation Processor (CP)
Registers
Access Instructions
Access Processor (AP)
Compiler
Program
Cache Mgmt. Instructions
Cache Mgmt. Processor (CMP)
7Decoupled Architectures
5-issue
2-issue
3-issue
8-issue
Computation Processor (CP)
Computation Processor (CP)
Computation Processor (CP)
Computation Processor (CP)
Registers
Registers
Registers
Registers
SDQ
SAQ
LQ
Access Processor (AP) - (5-issue)
Access Processor (AP) - (3-issue)
SCQ
Cache
Cache Mgmt. Processor (CMP)
3-issue
3-issue
Cache Mgmt. Processor (CMP)
2nd-Level Cache and Main Memory
MIPS
CAPP
HiDISC
DEAP
(Conventional)
(Decoupled)
(New Decoupled)
DEAP Kurian, Hulina, Caraor 94 PIPE
Goodman 85 Other Decoupled Processors ACRI,
ZS-1, WM
8 Slip Control Queue
- The Slip Control Queue (SCQ) adapts dynamically
- Late prefetches prefetched data arrived after
load had been issued - Useful prefetches prefetched data arrived
before load had been issued
if (prefetch_buffer_full ()) Dont change size
of SCQ else if ((2late_prefetches) gt
useful_prefetches) Increase size of
SCQ else Decrease size of SCQ
9Decoupling Programs for HiDISC(Discrete
Convolution - Inner Loop)
while (not EOD) y y (x h) send y to SDQ
Computation Processor Code
for (j 0 j lt i j) load (xj) load
(hi-j-1) GET_SCQ send (EOD token) send
address of yi to SAQ
for (j 0 j lt i j) yiyi(xjhi-j-1)
Inner Loop Convolution
Access Processor Code
SAQ Store Address Queue SDQ Store Data
Queue SCQ Slip Control Queue EOD End of Data
for (j 0 j lt i j) prefetch
(xj) prefetch (hi-j-1 PUT_SCQ
Cache Management Code
10Where We Were
- HiDISC compiler
- Frontend selection (Gcc)
- Single thread running without conditionals
- Hand compiling of benchmarks
- Livermore loops, Tomcatv, MXM, Cholsky, Vpenta
and Qsort
11Benchmarks
Benchmarks
Source of
Lines of
Description
Data
Benchmark
Source
Set
Code
Size
LLL1
Livermore
20
1024
-
element
24 KB
Loops 45
arrays, 100
iterations
LLL2
Livermore
24
1024
-
element
16 KB
Loops
arrays, 100
iterations
LLL3
Livermore
18
1024
-
element
16 KB
Loops
a
rrays, 100
iterations
LLL4
Livermore
25
1024
-
element
16 KB
Loops
arrays, 100
iterations
LLL5
Livermore
17
1024
-
element
24 KB
Loops
arrays, 100
iterations
Tomcatv
SPECfp95 68
190
33x33
-
element
lt64 KB
matrices, 5
iterations
MXM
NAS kernels 5
11
3
Unrolled matrix
448 KB
multiply, 2
iterations
CHOLSKY
NAS kernels
156
Cholsky matrix
724 KB
decomposition
VPENTA
NAS kernels
199
Invert three
128 KB
pentadiagonals
simultaneously
Qsort
Quicksort
58
Quicksort
128 KB
sorting
algorithm 14
12Simulation Parameters
13Simulation Results
14Accomplishments
- 2x speedup for scientific benchmarks with large
data sets over an in-order superscalar processor - 7.4x speedup for matrix multiply (MXM) over an
in-order issue superscalar processor - (similar
operations are used in ATR/SLD) - 2.6x speedup for matrix decomposition/substitution
(Cholsky) over an in-order issue superscalar
processor - Reduced memory latency for systems that have high
memory bandwidths (e.g. PIMs, RAMBUS) - Allows the compiler to solve indexing functions
for irregular applications - Reduced system cost for high-throughput
scientific codes
15Work in Progress
- Silicon Space for VLSI Layout
- Compiler Progress
- Simulator Integration
- Hand-compiling of Benchmarks
- Architectural Enhancement for Data Intensive
Applications
16VLSI Layout Overhead
- Goal Evaluate layout effectiveness of HiDISC
architecture - Cache has become a major portion of the chip area
- Methodology Extrapolate HiDISC VLSI Layout based
on MIPS10000 processor (0.35 µm, 1996) - The space overhead is 11.3 over a comparable
MIPS processor
17VLSI Layout Overhead
18Compiler Progress
- Preprocessor- Support for library calls
- Gnu pre-processor
- - Support for special compiler directives
- (include, define)
- Nested Loops
- Nested loops without data dependencies (for and
while) - Support for conditional statements where the
index variable of an inner loop is not a function
of some outer loop computation - Conditional statements
- CP to perform all computations
- Need to move the condition to AP
19Nested Loops (Assembly)
- 6 for(i0ilt10i)
- sw 0, 12(sp)
- 32
- 7 for(k0klt10k)
- sw 0, 4(sp)
- 33
- 8 j
- lw 15, 8(sp)
- addu 24, 15, 1
- sw 24, 8(sp)
- lw 25, 4(sp)
- addu 8, 25, 1
- sw 8, 4(sp)
- blt 8, 10, 33
- lw 9, 12(sp)
- addu 10, 9, 1
- sw 10, 12(sp)
- blt 10, 10, 32
Assembly Code
High Level Code - C
20Nested Loops
6 for(i0ilt10i) 32 b_eod
loop_i 7 for(k0klt10k) 33
b_eod loop_k 8 j addu
15, LQ, 1 sw 15, SDQ b
33 Loop_k b 32 loop_i
- 6 for(i0ilt10i)
- sw 0, 12(sp)
- 32
- 7 for(k0klt10k)
- sw 0, 4(sp)
- 33
- 8 j
- lw LQ, 8(sp)
- get SCQ
- sw 8(sp), SAQ
- lw 25, 4(sp)
- addu 8, 25, 1
- sw 8, 4(sp)
- blt 8, 10, 33
- s_eod
- lw 9, 12(sp)
- addu 10, 9, 1
- sw 10, 12(sp)
- blt 10, 10, 32
- 6 for(i0ilt10i)
- sw 0, 12(sp)
- 32
- 7 for(k0klt10k)
- sw 0, 4(sp)
- 33
- 8 j
- pref 8(sp)
- put SCQ
- lw 25, 4(sp)
- addu 8, 25, 1
- sw 8, 4(sp)
- blt 8, 10, 33
- lw 9, 12(sp)
- addu 10, 9, 1
- sw 10, 12(sp)
- blt 10, 10, 32
CP Stream
AP Stream
CMP Stream
21HiDISC Compiler
- Gcc Backend
- Use input from the parsing phase for performing
loop optimizations - Extend the compiler to provide MIPS4
- Handle loops with dependencies
- Nested loops where the computation depends on the
indices of more than 1 loop - e.g. X(i,j) iY(j,i)
- where i and j are index variables and j is a
function of i
22HiDISC Stream Separator
Sequential Source
Program Flow Graph
Classify Address Registers
Allocate Instruction to streams
Previous Work
Current Work
Access Stream
Computation Stream
Fix Conditional Statements
Move Queue Access into Instructions
Move Loop Invariants out of the loop
Add Slip Control Queue Instructions
Substitute Prefetches for Loads, Remove global
Stores, and Reverse SCQ Direction
Add global data Communication and Synchronization
Produce Assembly code
Cache Management Assembly Code
Computation Assembly Code
Access Assembly Code
23Simulator Integration
- Based on MIPS RISC pipeline Architecture (dsim)
- Supports MIPS1 and MIPS2 ISA
- Supports dynamic linking for shared library
- Loading shared library in the simulator
- Hand compiling
- Using SGI-cc or gcc as front-end
- Making 3 streams of codes
- Using SGI-cc as compiler back-end
.c
cc mips2 -s
- Modification on three .s codes to HiDISC
assembly - Convert .hs to .s for three codes (hs2s)
.s
.cp.s
.ap.s
.cmp.s
cc mips2 -o
sharedlibrary
.cp
.ap
.cmp
dsim
24DIS Benchmarks
- Atlantic Aerospace DIS benchmark suite
- Application oriented benchmarks
- Many defense applications employ large data sets
non contiguous memory access and no temporal
locality - Too large for hand-compiling, wait until the
compiler ready - Requires linker that can handle multiple object
files - Atlantic Aerospace Stressmarks suite
- Smaller and more specific procedures
- Seven individual data intensive benchmarks
- Directly illustrate particular elements of the
DIS problem
25Stressmark Suite
DIS Stressmark Suite Version 1.0, Atlantic
Aerospace Division
26Example of Stressmarks
- Pointer Stressmark
- Basic idea repeatedly follow pointers to
randomized locations in memory - Memory access pattern is unpredictable
- Randomized memory access pattern
- Not sufficient temporal and spatial locality for
conventional cache architectures - HiDISC architecture provides lower memory access
latency
27Decoupling of Pointer Stressmarks
while (not EOD)if (field gt partition)
balance if (balancehigh w/2) breakelse
if (balancehigh gt w/2) min
partitionelse max partition
high
for (ij1iltwi) if (fieldindexi gt
partition) balance if (balancehigh w/2)
breakelse if (balancehigh gt w/2) min
partitionelse max partition
high
Computation Processor Code
for (ij1 iltw i) load (fieldindexi) G
ET_SCQ send (EOD token)
Access Processor Code
for (ij1 iltw i) prefetch
(fieldindexi) PUT_SCQ
Inner loop for the next indexing
Cache Management Code
28Stressmarks
- Hand-compile the 7 individual benchmarks
- Use gcc as front-end
- Manually partition each of the three instruction
streams and insert synchronizing instructions - Evaluate architectural trade-offs
- Updated simulator characteristics such as
out-of-order issue - Large L2 cache and enhanced main memory system
such as Rambus and DDR
29Architectural Enhancement for Data Intensive
Applications
- Enhanced memory system such as RAMBUS DRAM and
DDR DRAM - Provide high memory bandwidth
- Latency does not improve significantly
- Decoupled access processor can fully utilize the
enhanced memory bandwidth - More requests caused by access processor
- Pre-fetching mechanism hide memory access latency
30Flexi-DISC
-
- Fundamental characteristics
- inherently highly dynamic at execution time.
- Dynamic reconfigurable central computational
kernel (CK) - Multiple levels of caching and processing around
CK - adjustable prefetching
- Multiple processors on a chip which will provide
for a flexible adaptation from multiple to single
processors and horizontal sharing of the existing
resources.
31Flexi-DISC
- Partitioning of Computation Kernel
- It can be allocated to the different portions of
the application or different applications - CK requires separation of the next ring to feed
it with data - The variety of target applications makes the
memory accesses unpredictable - Identical processing units for outer rings
- Highly efficient dynamic partitioning of the
resources and their run-time allocation can be
achieved
32Summary
- Gcc Backend
- Use the parsing tree to extract the load
instructions - Handle loops with dependencies where the index
variable of an inner loop is not a function of
some outer loop computation - Robust compiler is being designed to experiment
and analyze with additional benchmarks - Eventually extend it to experiment DIS benchmarks
- Additional architectural enhancements have been
introduced to make HiDISC amenable to DIS
benchmarks
33HiDISC Hierarchical Decoupled Instruction Set
Computer
- New Ideas
- A dedicated processor for each level of the
memory hierarchy - Explicitly manage each level of the memory
hierarchy using instructions generated by the
compiler - Hide memory latency by converting data access
predictability to data access locality - Exploit instruction-level parallelism without
extensive scheduling hardware - Zero overhead prefetches for maximal computation
throughput
- Impact
- 2x speedup for scientific benchmarks with large
data sets over an in-order superscalar
processor - 7.4x speedup for matrix multiply over an
in-order issue superscalar processor - 2.6x speedup for matrix decomposition/substitutio
n over an in-order issue superscalar processor - Reduced memory latency for systems that have
high memory bandwidths (e.g. PIMs, RAMBUS) - Allows the compiler to solve indexing functions
for irregular applications - Reduced system cost for high-throughput
scientific codes
Schedule
- Defined benchmarks
- Completed simulator
- Performed instruction-level simulations on
hand-compiled - benchmarks
- Continue simulations of more benchmarks
(SAR) - Define HiDISC architecture
- Benchmark result
- Update Simulator
- Develop and test a full decoupling compiler
decoupling compiler - Generate performance statistics and evaluate
design
Start
April 2001