HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing - PowerPoint PPT Presentation

About This Presentation
Title:

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing

Description:

Other Decoupled Processors: ACRI, ZS-1, WM. Cache. 2nd-Level Cache. and Main Memory ... Cache has become a major portion of the chip area ... – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 34
Provided by: james510
Category:

less

Transcript and Presenter's Notes

Title: HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing


1
HiDISC A Decoupled Architecture for Applications
in Data Intensive Computing
  • PIs Alvin M. Despain and Jean-Luc Gaudiot
  • University of Southern California
  • http//www-pdpc.usc.edu
  • May 2001

2
Outline
  • HiDISC Project Description
  • Experiments and Accomplishments
  • Work in Progress
  • Summary


3
HiDISC Hierarchical Decoupled Instruction Set
Computer
Technological Trend Memory latency is getting
longer relative to microprocessor speed (40 per
year) Problem Some SPEC benchmarks spend more
than half of their time stalling Domain
benchmarks with large data sets symbolic,
signal processing and scientific programs Present
Solutions Multithreading (Homogenous), Larger
Caches, Prefetching, Software Multithreading
4
Present Solutions
Solution Larger Caches Hardware
Prefetching Software Prefetching
Multithreading
  • Limitations
  • Slow
  • Works well only if working set fits cache and
    there is temporal locality.
  • Cannot be tailored for each application
  • Behavior based on past and present execution-time
    behavior
  • Ensure overheads of prefetching do not outweigh
    the benefits gt conservative prefetching
  • Adaptive software prefetching is required to
    change prefetch distance during run-time
  • Hard to insert prefetches for irregular access
    patterns
  • Solves the throughput problem, not the memory
    latency problem

5
The HiDISC Approach
  • Observation
  • Software prefetching impacts compute performance
  • PIMs and RAMBUS offer a high-bandwidth memory
    system - useful for speculative prefetching
  • Approach
  • Add a processor to manage prefetching -gt
    hide overhead
  • Compiler explicitly manages the memory hierarchy
  • Prefetch distance adapts to the program runtime
    behavior

6
What is HiDISC?
Computation Instructions
  • A dedicated processor for each level of the
    memory hierarchy
  • Explicitly manage each level of the memory
    hierarchy using instructions generated by the
    compiler
  • Hide memory latency by converting data access
    predictability to data access locality (Just in
    Time Fetch)
  • Exploit instruction-level parallelism without
    extensive scheduling hardware
  • Zero overhead prefetches for maximal computation
    throughput

Computation Processor (CP)
Registers
Access Instructions
Access Processor (AP)
Compiler
Program
Cache Mgmt. Instructions
Cache Mgmt. Processor (CMP)
7
Decoupled Architectures
5-issue
2-issue
3-issue
8-issue
Computation Processor (CP)
Computation Processor (CP)
Computation Processor (CP)
Computation Processor (CP)
Registers
Registers
Registers
Registers
SDQ
SAQ
LQ
Access Processor (AP) - (5-issue)
Access Processor (AP) - (3-issue)
SCQ
Cache
Cache Mgmt. Processor (CMP)
3-issue
3-issue
Cache Mgmt. Processor (CMP)
2nd-Level Cache and Main Memory
MIPS
CAPP
HiDISC
DEAP
(Conventional)
(Decoupled)
(New Decoupled)
DEAP Kurian, Hulina, Caraor 94 PIPE
Goodman 85 Other Decoupled Processors ACRI,
ZS-1, WM
8
Slip Control Queue
  • The Slip Control Queue (SCQ) adapts dynamically
  • Late prefetches prefetched data arrived after
    load had been issued
  • Useful prefetches prefetched data arrived
    before load had been issued

if (prefetch_buffer_full ()) Dont change size
of SCQ else if ((2late_prefetches) gt
useful_prefetches) Increase size of
SCQ else Decrease size of SCQ
9
Decoupling Programs for HiDISC(Discrete
Convolution - Inner Loop)
while (not EOD) y y (x h) send y to SDQ
Computation Processor Code
for (j 0 j lt i j) load (xj) load
(hi-j-1) GET_SCQ send (EOD token) send
address of yi to SAQ
for (j 0 j lt i j) yiyi(xjhi-j-1)

Inner Loop Convolution
Access Processor Code
SAQ Store Address Queue SDQ Store Data
Queue SCQ Slip Control Queue EOD End of Data
for (j 0 j lt i j) prefetch
(xj) prefetch (hi-j-1 PUT_SCQ
Cache Management Code
10
Where We Were
  • HiDISC compiler
  • Frontend selection (Gcc)
  • Single thread running without conditionals
  • Hand compiling of benchmarks
  • Livermore loops, Tomcatv, MXM, Cholsky, Vpenta
    and Qsort

11
Benchmarks
Benchmarks

Source of

Lines of
Description

Data
Benchmark

Source
Set
Code

Size

LLL1

Livermore
20

1024
-
element
24 KB

Loops 45

arrays, 100
iterations

LLL2

Livermore
24

1024
-
element
16 KB

Loops

arrays, 100
iterations

LLL3

Livermore
18

1024
-
element
16 KB

Loops

a
rrays, 100
iterations

LLL4

Livermore
25

1024
-
element
16 KB

Loops

arrays, 100
iterations

LLL5

Livermore
17

1024
-
element
24 KB

Loops

arrays, 100
iterations

Tomcatv

SPECfp95 68

190

33x33
-
element
lt64 KB

matrices, 5
iterations

MXM

NAS kernels 5

11
3

Unrolled matrix
448 KB

multiply, 2
iterations

CHOLSKY

NAS kernels

156

Cholsky matrix
724 KB

decomposition

VPENTA

NAS kernels

199

Invert three
128 KB

pentadiagonals
simultaneously

Qsort

Quicksort
58

Quicksort

128 KB

sorting
algorithm 14


12
Simulation Parameters
13
Simulation Results
14
Accomplishments
  • 2x speedup for scientific benchmarks with large
    data sets over an in-order superscalar processor
  • 7.4x speedup for matrix multiply (MXM) over an
    in-order issue superscalar processor - (similar
    operations are used in ATR/SLD)
  • 2.6x speedup for matrix decomposition/substitution
    (Cholsky) over an in-order issue superscalar
    processor
  • Reduced memory latency for systems that have high
    memory bandwidths (e.g. PIMs, RAMBUS)
  • Allows the compiler to solve indexing functions
    for irregular applications
  • Reduced system cost for high-throughput
    scientific codes

15
Work in Progress
  • Silicon Space for VLSI Layout
  • Compiler Progress
  • Simulator Integration
  • Hand-compiling of Benchmarks
  • Architectural Enhancement for Data Intensive
    Applications

16
VLSI Layout Overhead
  • Goal Evaluate layout effectiveness of HiDISC
    architecture
  • Cache has become a major portion of the chip area
  • Methodology Extrapolate HiDISC VLSI Layout based
    on MIPS10000 processor (0.35 µm, 1996)
  • The space overhead is 11.3 over a comparable
    MIPS processor

17
VLSI Layout Overhead
18
Compiler Progress
  • Preprocessor- Support for library calls
  • Gnu pre-processor
  • - Support for special compiler directives
  • (include, define)
  • Nested Loops
  • Nested loops without data dependencies (for and
    while)
  • Support for conditional statements where the
    index variable of an inner loop is not a function
    of some outer loop computation
  • Conditional statements
  • CP to perform all computations
  • Need to move the condition to AP

19
Nested Loops (Assembly)
  • 6 for(i0ilt10i)
  • sw 0, 12(sp)
  • 32
  • 7 for(k0klt10k)
  • sw 0, 4(sp)
  • 33
  • 8 j
  • lw 15, 8(sp)
  • addu 24, 15, 1
  • sw 24, 8(sp)
  • lw 25, 4(sp)
  • addu 8, 25, 1
  • sw 8, 4(sp)
  • blt 8, 10, 33
  • lw 9, 12(sp)
  • addu 10, 9, 1
  • sw 10, 12(sp)
  • blt 10, 10, 32

Assembly Code
High Level Code - C
20
Nested Loops
6 for(i0ilt10i) 32 b_eod
loop_i 7 for(k0klt10k) 33
b_eod loop_k 8 j addu
15, LQ, 1 sw 15, SDQ b
33 Loop_k b 32 loop_i
  • 6 for(i0ilt10i)
  • sw 0, 12(sp)
  • 32
  • 7 for(k0klt10k)
  • sw 0, 4(sp)
  • 33
  • 8 j
  • lw LQ, 8(sp)
  • get SCQ
  • sw 8(sp), SAQ
  • lw 25, 4(sp)
  • addu 8, 25, 1
  • sw 8, 4(sp)
  • blt 8, 10, 33
  • s_eod
  • lw 9, 12(sp)
  • addu 10, 9, 1
  • sw 10, 12(sp)
  • blt 10, 10, 32
  • 6 for(i0ilt10i)
  • sw 0, 12(sp)
  • 32
  • 7 for(k0klt10k)
  • sw 0, 4(sp)
  • 33
  • 8 j
  • pref 8(sp)
  • put SCQ
  • lw 25, 4(sp)
  • addu 8, 25, 1
  • sw 8, 4(sp)
  • blt 8, 10, 33
  • lw 9, 12(sp)
  • addu 10, 9, 1
  • sw 10, 12(sp)
  • blt 10, 10, 32

CP Stream
AP Stream
CMP Stream
21
HiDISC Compiler
  • Gcc Backend
  • Use input from the parsing phase for performing
    loop optimizations
  • Extend the compiler to provide MIPS4
  • Handle loops with dependencies
  • Nested loops where the computation depends on the
    indices of more than 1 loop
  • e.g. X(i,j) iY(j,i)
  • where i and j are index variables and j is a
    function of i

22
HiDISC Stream Separator
Sequential Source
Program Flow Graph
Classify Address Registers
Allocate Instruction to streams
Previous Work
Current Work
Access Stream
Computation Stream
Fix Conditional Statements
Move Queue Access into Instructions
Move Loop Invariants out of the loop
Add Slip Control Queue Instructions
Substitute Prefetches for Loads, Remove global
Stores, and Reverse SCQ Direction
Add global data Communication and Synchronization
Produce Assembly code
Cache Management Assembly Code
Computation Assembly Code
Access Assembly Code
23
Simulator Integration
  • Based on MIPS RISC pipeline Architecture (dsim)
  • Supports MIPS1 and MIPS2 ISA
  • Supports dynamic linking for shared library
  • Loading shared library in the simulator
  • Hand compiling
  • Using SGI-cc or gcc as front-end
  • Making 3 streams of codes
  • Using SGI-cc as compiler back-end

.c
cc mips2 -s
  • Modification on three .s codes to HiDISC
    assembly
  • Convert .hs to .s for three codes (hs2s)

.s
.cp.s
.ap.s
.cmp.s
cc mips2 -o
sharedlibrary
.cp
.ap
.cmp
dsim
24
DIS Benchmarks
  • Atlantic Aerospace DIS benchmark suite
  • Application oriented benchmarks
  • Many defense applications employ large data sets
    non contiguous memory access and no temporal
    locality
  • Too large for hand-compiling, wait until the
    compiler ready
  • Requires linker that can handle multiple object
    files
  • Atlantic Aerospace Stressmarks suite
  • Smaller and more specific procedures
  • Seven individual data intensive benchmarks
  • Directly illustrate particular elements of the
    DIS problem

25
Stressmark Suite
DIS Stressmark Suite Version 1.0, Atlantic
Aerospace Division
26
Example of Stressmarks
  • Pointer Stressmark
  • Basic idea repeatedly follow pointers to
    randomized locations in memory
  • Memory access pattern is unpredictable
  • Randomized memory access pattern
  • Not sufficient temporal and spatial locality for
    conventional cache architectures
  • HiDISC architecture provides lower memory access
    latency

27
Decoupling of Pointer Stressmarks
while (not EOD)if (field gt partition)
balance if (balancehigh w/2) breakelse
if (balancehigh gt w/2) min
partitionelse max partition
high
for (ij1iltwi) if (fieldindexi gt
partition) balance if (balancehigh w/2)
breakelse if (balancehigh gt w/2) min
partitionelse max partition
high
Computation Processor Code
for (ij1 iltw i) load (fieldindexi) G
ET_SCQ send (EOD token)
Access Processor Code
for (ij1 iltw i) prefetch
(fieldindexi) PUT_SCQ
Inner loop for the next indexing
Cache Management Code
28
Stressmarks
  • Hand-compile the 7 individual benchmarks
  • Use gcc as front-end
  • Manually partition each of the three instruction
    streams and insert synchronizing instructions
  • Evaluate architectural trade-offs
  • Updated simulator characteristics such as
    out-of-order issue
  • Large L2 cache and enhanced main memory system
    such as Rambus and DDR

29
Architectural Enhancement for Data Intensive
Applications
  • Enhanced memory system such as RAMBUS DRAM and
    DDR DRAM
  • Provide high memory bandwidth
  • Latency does not improve significantly
  • Decoupled access processor can fully utilize the
    enhanced memory bandwidth
  • More requests caused by access processor
  • Pre-fetching mechanism hide memory access latency

30
Flexi-DISC
  •  
  • Fundamental characteristics
  • inherently highly dynamic at execution time.
  • Dynamic reconfigurable central computational
    kernel (CK)
  • Multiple levels of caching and processing around
    CK
  • adjustable prefetching
  • Multiple processors on a chip which will provide
    for a flexible adaptation from multiple to single
    processors and horizontal sharing of the existing
    resources.

31
Flexi-DISC
  • Partitioning of Computation Kernel
  • It can be allocated to the different portions of
    the application or different applications
  • CK requires separation of the next ring to feed
    it with data
  • The variety of target applications makes the
    memory accesses unpredictable
  • Identical processing units for outer rings
  • Highly efficient dynamic partitioning of the
    resources and their run-time allocation can be
    achieved

32
Summary
  • Gcc Backend
  • Use the parsing tree to extract the load
    instructions
  • Handle loops with dependencies where the index
    variable of an inner loop is not a function of
    some outer loop computation
  • Robust compiler is being designed to experiment
    and analyze with additional benchmarks
  • Eventually extend it to experiment DIS benchmarks
  • Additional architectural enhancements have been
    introduced to make HiDISC amenable to DIS
    benchmarks

33
HiDISC Hierarchical Decoupled Instruction Set
Computer
  • New Ideas
  • A dedicated processor for each level of the
    memory    hierarchy
  • Explicitly manage each level of the memory
    hierarchy using   instructions generated by the
    compiler
  • Hide memory latency by converting data access
       predictability to data access locality
  • Exploit instruction-level parallelism without
    extensive   scheduling hardware
  • Zero overhead prefetches for maximal computation
      throughput
  • Impact
  • 2x speedup for scientific benchmarks with large
    data sets   over an in-order superscalar
    processor
  • 7.4x speedup for matrix multiply over an
    in-order issue   superscalar processor
  • 2.6x speedup for matrix decomposition/substitutio
    n over an   in-order issue superscalar processor
  • Reduced memory latency for systems that have
    high   memory bandwidths (e.g. PIMs, RAMBUS)
  • Allows the compiler to solve indexing functions
    for irregular   applications
  • Reduced system cost for high-throughput
    scientific codes

Schedule
  • Defined benchmarks
  • Completed simulator
  • Performed instruction-level simulations on
    hand-compiled
  • benchmarks
  • Continue simulations of more benchmarks
    (SAR)
  • Define HiDISC architecture
  • Benchmark result
  • Update Simulator
  • Develop and test a full decoupling compiler
    decoupling compiler
  • Generate performance statistics and evaluate
    design

Start
April 2001
Write a Comment
User Comments (0)
About PowerShow.com