Efficient Support for All Levels of Parallelism for Complex Media Applications

About This Presentation

Title:

Efficient Support for All Levels of Parallelism for Complex Media Applications

Description:

Ph.D. Preliminary Exam. Thesis Advisor: Sarita Adve. Department of ... Bank 0, Way 2. FP/SIMD RegFile. FP/SIMD RegFile. 2 SIMD ALUs. 2 SIMD FPUs. 2 SIMD ALUs ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 71

Provided by: ruchira8

Category:

more less

Transcript and Presenter's Notes

Title: Efficient Support for All Levels of Parallelism for Complex Media Applications

1
Efficient Support for All Levels of Parallelism
for Complex Media Applications

Ruchira Sasanka
Ph.D. Preliminary Exam
Thesis Advisor Sarita Adve
Department of Computer Science
University of Illinois at Urbana-Champaign

2
Motivation

Complex multimedia applications are critical
workloads
Demand high performance
Demand high energy efficiency
Media apps important for General-purpose
processors (GPPs)
Increasingly run on general-purpose processors
Multiple specifications/standards demand high
programmability
? How can GPPs support multimedia apps
efficiently?

3
Motivation

Is support for Data Level Parallelism (DLP)
sufficient?
VIRAM, Imagine
Assume large amounts of similar (regular) work
Look at media-kernels
Are full-applications highly regular?
Similarity/amount limited by decisions (control)
Multiple specifications and intelligent
algorithms
Varying amounts of DLP sub-word, vector, stream
Instruction/Data/Thread-level parallelism
(ILP/DLP/TLP)
? Need to support all levels of parallelism
efficiently

4
Philosophy Five Guidelines

Supporting All Levels of Parallelism
ILP, TLP, sub-word SIMD (SIMDsw), vectors,
streams
Familiar programming model
Increases portability and facilitates wide-spread
adoption
Evolutionary hardware
Adaptive, partitioned resources for efficiency
Limited dedicated systems (seamless integration)
Degree/type of parallelism varies with
application

5
Contributions of Thesis

Analyze complex media apps
Make the case that media apps need all levels of
parallelism
ALP Support for All Levels of Parallelism
Based on a contemporary CMP/SMT processor
Careful partitioning for energy efficiency
Novel DLP support
Supports vectors and streams
Seamless integration with CMP/SMT core (no
dedicated units)
Low hardware/programming overhead
Speedups of 5X to 49X, EDP improvement of 5X to
361X
w/ thread SIMDsw Novel DLP, over 4-wide OOO
core

6
Other Contributions

CMP/SMT/Hybrid Energy Comparison (ICS04)
Evaluates the best energy TLP support for media
apps on GPPs
Mathematical model to predict energy efficiency
Found a hybrid CMP/SMT to be an attractive
solution
Hardware Adaptation for energy for Media Apps
(ASPLOS02)
Control algorithms for hardware adaptation
MS Thesis, jointly done with Chris Hughes
? Not discussed here to focus on current work

7
Outline

ALP Efficient ILP, TLP and DLP for GPPs
Architecture
Support for TLP
Support for ILP
Support for DLP
Methodology
Results
Summary
Related work
Future work

8
Supporting TLP

4core CMP
2 SMT threads for each core
Total 8 threads
Based on our TLP study
Shared L2 cache

Core 0
Core 1
Core 2
Core 3
L2 Cache
9
Supporting ILP (and SMT)
d T L B
L2 Sub-bank
Ld/St Q
Int ALU
2 SIMD FPUs
Fet/ Dec
Ld/St Q
2 SIMD ALUs
Int RegFile
Int Iss. Q
L2 Sub-bank
FP/SIMD RegFile
RET
RET
Int Iss. Q
i T L B
Fet/ Dec
BrP
Int ALU
FP Iss. Q
FP Iss. Q
FP/SIMD RegFile
BrP
Int RegFile
REN
L2 Sub-bank
Int Iss. Q
Fet/ Dec
Int Iss. Q
Int ALU
2 SIMD FPUs
REN
Int ALU
2 SIMD ALUs
d T L B
L2 Sub-bank
Key

Partitioned for energy/latency and SMT support
Sub-word DLP support with SIMDsw units

Thread 0
Thread 1
Shared
10
Supporting DLP

128-bit SIMDsw operations
like SSE/AltiVec/VIS
Novel DLP support
Supports vectors/streams of SIMDsw elements
2-dimensional DLP
No vector execution unit !
No vector masks !!
No vector register file !!!

11
Novel DLP Support

Vector data and computation decoupled
Vector data provides locality and regularity
Vector instructions encode multiple operations
Vector registers morphed out of L1 cache lines
Indexed Vector Registers (IVR)
Computation uses SIMDsw units

12
Indexed Vector Registers (IVR)
d T L B
L1 DC Bank 0, Way 1
L1 DC Bank 0, Way 0
L2 Sub-bank
IVR
IVR
Ld/St Q
Int ALU
2 SIMD FPUs
Fet/ Dec
Ld/St Q
2 SIMD ALUs
Int RegFile
Int Iss. Q
L2 Sub-bank
FP/SIMD RegFile
RET
RET
Int Iss. Q
i T L B
Fet/ Dec
BrP
Int ALU
FP Iss. Q
FP Iss. Q
FP/SIMD RegFile
BrP
Int RegFile
REN
L2 Sub-bank
Int Iss. Q
Fet/ Dec
Int Iss. Q
Int ALU
2 SIMD FPUs
REN
Int ALU
2 SIMD ALUs
d T L B
IVR
IVR
L2 Sub-bank
L1 DC Bank 1, Way 1
L1 DC Bank 1, Way 0
Key

IVR allocated in cache ways close to SIMDsw units
Each IVR bank partitioned and allocated only on
demand

Thread 0
Thread 1
Shared
13
Indexed Vector Registers (IVR)
Current Element Pointer
Start Element
Available Elements
Length
0
8
2
V0
6
Vector Descriptors
2
8
4
V1
10

128b
Cache Line
0
0
1
1
2
3
2
3
V0
4
5
4
5
V1
6
7
6
7
8
9

Bank 0, Way 0
Bank 0, Way 1

No logic in the critical path in fact faster
than cache

14
Novel DLP Support

Vector data and computation decoupled
Vector registers morphed out of L1 cache lines
Indexed Vector Registers (IVR)
Computation uses SIMDsw units
Vector instructions for memory
Vector memory instructions serviced at L2
SIMDsw instructions for computation
SIMDsw instructions can access vector registers
(IVR)

15
Vector vs. SIMDsw Loop

V2 k (V0 V1)
vector_load ? V0
vector_load ? V1
vector_add V0, V1 ? V3
// V3 temporary vector register
vector_mult V3, simd_reg1 ? V2 // k is in
simd_reg1
vector_store V2 ? // V2 k (V0 V1)
vector_load ? V0 vector_load ? V1
vector_alloc V2
do for all elements in vector V0 and V1
simd_add V0, V1 ? simd_reg0 //
adds next elements of V0 V1
simd_mul simd_reg0, simd_reg1 ? V2 // write
prod. into next elem of V2
cur_elem_pointer_increment V0, V1, V2 //
advance pointers to next element
vector_store V2 ?

Vector
ALP
16
SIMDsw Loops in Action
vector_load ? V0 vector_load ? V1 vector_alloc
V2 do for all elements in vector V0 and V1
simd_add V0, V1 ? simd_reg0 simd_mul simd_reg0,
simd_reg1 ? V2 cur_elem_pointer_increment V0,
V1, V2 vector_store V2 ?
Current Element Pointer
Start Element
Available Elements
Length
2
V0
2
V1

9
10
V2

vector_load ? V0
?
?
vector_load ? V1
?
simd_add V00, V10 ? simd_reg0
IQ
?
simd_mul simd_reg0, simd_reg1 ? V28
?
simd_add V01, V11 ? simd_reg0
?
simd_mul simd_reg0, simd_reg1 ? V29
Bank 0, Way 0
Bank 0, Way 1
vector_load ? V0
REN
vector_load ? V1
vector_alloc V2
simd_add V00, V10 ? simd_reg0
simd_mul simd_reg0, simd_reg1 ? V28
cur_elem_pointer_increment V0, V1, V2
simd_add V01, V11 ? simd_reg0
simd_mul simd_reg0, simd_reg1 ? V29
17
Why SIMDsw Loops?

Lower peak memory BW and better memory latency
tolerance
Computation interspersed with memory operand use

Vector
vector_load ? V0 vector_load ? V1 vector_add
V0, V1 ? V3 // V3
temporary vector register vector_mult V3,
simd_reg1 ? V2 // k is in
simd_reg1 vector_store V2 ? // V2
k (V0 V1) vector_load ? V0 vector_load ?
V1 vector_alloc V2 do for all elements in
vector V0 and V1 simd_add V0, V1 ? simd_reg0
// adds next elements of V0 V1
simd_mul simd_reg0, simd_reg1 ? V2 // write
prod. into next elem of V2 cur_elem_pointer_incr
ement V0, V1, V2 // advance pointers to next
element vector_store V2 ?
ALP
18
Why SIMDsw Loops?

Fewer vector register ports and no temporary
vector regs
Lower power, fewer L1 invalidations (vector
allocations)

Vector
1
2
3
vector_sub V0, V1 ? V2 // V2
(V0 V1) vector_mul V2, V2 ? V3 // V3 (V0
V1) (V0 V1) vector_add V3, reg0 gt reg0
// sum (V0 V1) (V0 V1) do for
all elements in vector V0 and V1 simd_add V0,
V1 ? simd_reg0 simd_mul simd_reg0, simd_reg0 ?
simd_reg1 simd_add simd_reg1, simd_reg2 ?
simd_reg2
3
3
4
ALP
1
2
19
Why SIMDsw Loops?

Lower peak memory BW and better memory latency
tolerance
Computation interspersed with memory operand use
Fewer vector register ports and no temporary
vector regs
Lower power, fewer L1 invalidations
No vector mask registers
Compatibility with streams
Easy utilization of heterogeneous execution units

20
Drawbacks of SIMDsw Loops

Use more dynamic instructions
Higher front-end energy
Higher front-end width
Require static/dynamic unrolling for multi-cycle
exec units
Vectors contain independent elements
Static unrolling increases code size
Dynamic unrolling limited by issue queue size
How to retain advantages and get rid of
drawbacks?
? Future work

21
Supporting DLP - Streams

A stream is a long vector
Processed using SIMDsw loops similar to vectors,
except
Only a limited of records allocated in IVR at a
time
Stream load/allocate (analogous to vector
load/alloc)
When a cur_elem_pointer_increment retires
Head element is discarded from IVR (or written to
memory)
New element appended to IVR

22
Streams Example
stream_load addrstride ? V0 stream_load
addrstride ? V1 stream_alloc addrstride ?
V2 // just allocate some IVR entries do
for all elements in stream simd_add V0, V1 ?
V2 cur_elem_pointer_inc V0, V1, V2 //
optional can assume auto increment stream_strore
V2 // flushes the
remainder of V2
N2
N3
N2
N3
V0
V1
N4
N5
N4
N5
N6
N7
N6
N7
23
ALPs Vector Support vs. SIMDsw (e.g., MMX)

ALPs vector support different from conventional
vector
Uses SIMDsw loops and IVR
How does ALPs vector support differ from SIMDsw
(MMX)?
Vector data (IVR)
Vector memory instructions
Stream support
? What do these really buy?

24
ALPs Vector Support vs. SIMDsw (e.g.,MMX)

V2 k (V0 V1)
vector_load ? V0
vector_load ? V1
vector_alloc V2
do for all elements in vector V0 and V1
simd_add V0, V1 ? simd_reg0 //
adds next W elements of V0 V1
simd_mul simd_reg0, simd_reg1 ? V2 // write
prod. into next W elems of V2
vector_store V2 ?
do for all elements
simd_load ? simd_reg2
simd_load ? simd_reg3
simd_add simd_reg2, simd_reg3 ? simd_reg0
simd_mul simd_reg0, simd_reg1 ? simd_reg4
simd_store simd_reg4 ? .

ALP
More Instructions More SIMDsw register
pressure Loads are evil (LdStQ/TLB/Cache Tag/Can
miss) Indexed stores are really 2 instructions
SIMDsw (MMX)
25
ALPs Vector Support vs. SIMDsw (e.g., MMX)

Sources of performance/energy advantage
Reduced instructions (especially loads/stores)
Load latency hiding and reduced repeated loading
Block pre-fetching
No L1 or SIMDsw register file pollution
Low latency/energy IVR access and no alignment

26
Outline

ALP Efficient ILP, TLP and DLP for GPPs
Architecture
Support for TLP
Support for ILP
Support for DLP
Methodology
Results
Summary
Related work
Future work

27
Indexed Vector Registers (IVR)
L1 D Cache Write Through 2 Banks 16K Per Bank 4
Ways Per Bank 2 Ports Per Bank 32 B Line Size
SIMD FPUs/ALUs 2 Per Partition 128-bit
Int Issue Queues 2 Partitions 2 Banks Per
Partition 12 Entries Per Bank 2R/2W Per Bank Tag
4R/2W Per Bank 2 Issues Per Bank
Load/Store Queue 2 Partitions 16 Entries Per
Partition 2R/2W Per Partition 2 Issues Per
Partition
d T L B
L1 DC Bank 0, Way 1
L1 DC Bank 0, Way 0
L2 Sub-bank
IVR
IVR
Int RegFile 2 Partitions 32 Regs Per
Partition 4R/3W Per Partition
Fet/ Dec
Ld/St Q
FP/SIMD RegFile 2 Partitions 16 Regs Per
Part. 128-bit 4R/3W Per Part.
Int ALU
2 SIMD FPUs
Ld/St Q
Reorder Buffer 2 Partitions 2 Banks Per
Partition 32 Entries Per Bank 2R/2W Per Bank 2
Ret. Per Bank
2 SIMD ALUs
FP/SIMD Issue Queue 2 Partitions 16 Entries Per
Partition 2R/2W Per Partition Tag 4R/2W Per
Partition 2 Issues Per Partition
Int RegFile
L2 Sub-bank
Int Iss. Q
I Cache 2 Banks 8K Per Bank 4 Ways Per Bank 1
Port Per Bank 32 B Line Size
RET
RET
FP/SIMD RegFile
Fet/ Dec
Int Units 64 Bit 2 Per Partition
Int Iss. Q
i T L B
L2 Unified Cache Write Back 4 Banks 4 Sub-banks
Per Bank 64K Per Sub-bank 4 Ways Per Sub-bank 1
Port Per Sub-bank 64 B Line Size
BrP
Int ALU
FP Iss. Q
FP Iss. Q
REN
FP/SIMD RegFile
Int RegFile
BrP
Fet/ Dec
Int Iss. Q
L2 Sub-bank
REN
Branch Predictor G-Select 2-Partitions 2K Per
Partition
Int Iss. Q
Int ALU
2 SIMD FPUs
Int ALU
2 SIMD ALUs
Rename Table 2 Partitions 4 Wide Per Partition
d T L B
IVR
IVR
L2 Sub-bank
L1 DC Bank 1, Way 1
L1 DC Bank 1, Way 0
Freq 500 MHz
Key
Shared
Thread 1
Thread 0
28
Methodology

Execution driven simulator
Functional simulation based on RSIM
Entirely new timing simulator for ALP
Applications compiled using Sun CC with full
optimization
SIMDsw/Vector instructions in separate assembly
file
Hand-written assembly instructions (like MMX)
Used only in a small number of heavily used
routines
Simulated through hooks placed in application
binary
Dynamic power modeled using WATTCH
Static power with HotLeakage coming soon

29
Multi-threaded Full Applications

MPEG 2 encoding
Each thread encodes part of a frame
DLP instructions in DCT, SAD, IDCT, QUANT, IQUANT
MPEG 2 decoding
Each thread decodes a part of a frame
DLP instructions in IDCT, CompPred
Ray Tracing (Tachyon)
Each thread processes independent ray No DLP
Speech Recognition (Sphinx 3)
Each thread evaluates a Gaussian scoring model
DLP instructions in Gaussian vector operations
Face Recognition (CSU)
Threaded matrix multiplication and distance
evaluation
Streams used for matrix manipulations

30
Systems Simulated

1 Thread Superscalar 1T, 1TS, 1TSV
1T - The base system with one thread and no
SIMDsw
1TS - base SIMDsw instructions
1TSV - base SIMDsw/Vector instructions IVR
4 Thread CMP 4T, 4TS, 4TSV
Analogous to above but running four-threads on
4-core CMP
8 Thread CMPSMT 4x2T, 4x2TS, 4x2TSV
Analogous to first three, but running 8 threads
4 core CMP, with each core running 2 SMT threads

31
Outline

ALP Efficient ILP, TLP and DLP for GPPs
Architecture
Support for TLP
Support for ILP
Support for DLP
Methodology
Results
Summary
Related work
Future work

32
Speedup

SIMDsw support adds
1.5X to 6.6X over base
Vector support adds
2X to 9.7X over base
1.1X to 1.9X over SIMDsw

Speedup
MPGenc MPGdec RayTrace
SpeechRec FaceRec
33
Speedup
35.9

CMP support adds
3.1X to 3.9X over base
2.5X to 3.8X over 1TSV

Speedup
MPGenc MPGdec RayTrace
SpeechRec FaceRec
34
Speedup
35.9
48.8

ALP achieves
5.0X to 48.8X over base
All forms of parallelism essential

SMT support adds
1.14X to 1.87X over CMP (SV)
1.03X to 1.29X over CMP (S)

Speedup
MPGenc MPGdec RayTrace
SpeechRec FaceRec
35
Energy Consumption (No DVS)
Energy (Normalized to 1T)
MPGenc MPGdec RayTrace
SpeechRec FaceRec
SIMDsw savings 1.4X to 4.8X over base SV savings
1.1X to 1.4X over SIMDsw
36
Energy Consumption (No DVS)
Energy (Normalized to 1T)
MPGenc MPGdec RayTrace
SpeechRec FaceRec

CMP savings 1.09X to 1.17X (for SV)
1.08X to 1.16X over base

37
Energy Consumption (No DVS)
Energy (Normalized to 1T)
MPGenc MPGdec RayTrace
SpeechRec FaceRec
SMT increases energy by 4 (SV) and 14
(SV) ALP reduces energy up to 7.4X
38
Energy Delay Product (EDP) Improvement
63

SIMDsw support adds (no RayTrace)
2.3X to 30.7X over base
Vector support adds (no RayTrace)
4.5X to 63X over base
1.3X to 2.5X over SIMDsw

EDP Improvement (Normalized to 1T)
MPGenc MPGdec RayTrace
SpeechRec FaceRec
39
Energy Delay Product (EDP) Improvement
266

CMP adds
4.0X to 4.3X over base
2.5X to 4.6X over 1TSV

126
63
EDP Improvement (Normalized to 1T)
MPGenc MPGdec RayTrace
SpeechRec FaceRec
40
Energy Delay Product (EDP) Improvement

ALP achieves
5X to 361X over base
All forms of parallelism essential

361
159
266
126
63

SMT support adds
1.1X to 1.9X over CMP (SV)
0.9X to 1.2X over CMP (S)

EDP Improvement (Normalized to 1T)
MPGenc MPGdec RayTrace
SpeechRec FaceRec
41
Analysis Vector Vs. SIMDsw (Recap)

Performance due to 3 primary enhancements
Vector data (IVR)
Vector memory instructions
Stream support
Sources of performance/energy advantage
Reduced instructions (especially loads/stores)
Load latency hiding and reduced repeated loading
Block pre-fetching
No L1 or SIMD register file pollution
Low latency/energy IVR access and no alignment

42
Number of Retired Instructions/Operations
Operations
Retired Inst/Oper. (Normalized to 1T)
Instructions
MPGenc MPGdec RayTrace
SpeechRec FaceRec
Operations reduced by eliminating
overhead Instructions reduced by less overhead
and packing of operations
43
Vector Vs. SIMDsw Retirement Stall Distribution
MPGenc MPGdec
SpeechRec FaceRec

SIMD memory stalls replaced by fewer vector
memory stalls
Streaming in FaceRec eliminates most of memory
stalls

44
Outline

ALP Efficient ILP, TLP and DLP for GPPs
Architecture
Support for TLP
Support for ILP
Support for DLP
Methodology
Results
Summary
Related work
Future work

45
Comparison with Other Architectures

Several interesting architectures
Imagine, RAW, VIRAM, Tarantula, TRIPS
Most do not report performance for full media
apps
Detailed modeling/programming difficult
Imagine gives a frames per second number for MPEG
2 encoder
? Compare with Imagine

46
Comparison With Imagine

MPEG 2 encoding on Imagine
138 fps for 360x288 resolution at 200MHz
Do not include
B frame encoding
at least 30 more than P, twice than I
2/3 of all frames are B frames
Huffman VLC
Only 5 on single thread
Up to 35 when other parts parallelized/vectorized
Half pixel motion estimation
Adds 30 to the execution time

Hard to make fair energy comparison
ALP achieves 79 fps with everything _at_ same
frequency
47
Summary

Complex media apps need all levels of parallelism
Supporting all levels of parallelism is essential
No single type of parallelism gives the best
performance/energy
CMP/SMT processors with evolutionary DLP support
effective
ALP supports DLP efficiently
Benefits of vectors and streams with low cost
Decoupling of vector data from instructions
Adaptable L1 cache as a vector register file
Evolutionary hardware and familiar programming
model
Overall, speedups of 5X to 49X, EDP gains of 5X
to 361X

48
Outline

ALP Efficient ILP, TLP and DLP for GPPs
Architecture
Support for TLP
Support for ILP
Support for DLP
Methodology
Results
Summary
Related work
Future work

49
Future Work

Eliminating the drawbacks of SIMDsw loops
Benefits of ILP
Memory system enhancements
Scalability of ALP
More applications
Adaptation for energy

50
Eliminating the Drawbacks of SIMDsw Loops

Drawbacks
Use more dynamic instructions
Require static/dynamic unrolling for multi-cycle
exec units
Solution
Loop repetition with in-order issue
No renaming
SIMDsw registers volatile across SIMDsw code
blocks
Automatic cur_elem_pointer increment

51
Loop Repetition with In-Order Issue
Current Element Pointer
vector_load ? V0 vector_load ? V1 vector_alloc
V2 do for all elements in vector V0 and V1
simd_add V0, V1 ? simd_reg0 simd_mul simd_reg0,
simd_reg1 ? V2 vector_store V2 ?
Start Element
Available Elements
Length
2
V0
4
2
V1
4

10
12
10
V2
12
vector_load ? V0
?
?
vector_load ? V1
simd_add V00-3, V10-3 ? simd_reg0-3
IQ
simd_mul simd_reg0-3, simd_reg1 ? V28-11
Bank 0, Way 0
Bank 0, Way 1
vector_load ? V0
REN
add v00
add v01
vector_load ? V1
FUs
add v02
add v03
mul v28
mul v29
mul v210
mul v211
vector_alloc V2
simd_add V00, V10 ? simd_reg0
simd_mul simd_reg0, simd_reg1 ? V28
0
1
SIMD Regs
2
3
52
Future Work (Contd.)

Eliminating the drawbacks of SIMDsw loops
Benefits of ILP
Memory system enhancements
Scalability of ALP
Frequency/Threads/Lanes
More applications
Adaptation for energy

53
Thank You!
54
Related Work

Imagine
Stream processor with VLIW clusters and stream
reg file
Needs a special stream programming model
Targets highly regular computation
Conditional streams available
Does not preserve order
Higher complexity
No support for threads

55
Related Work (Contd.)

Tarantula
Vector unit for an Alpha EV8 core
Dedicated vector unit
Sits idle when no vector code
Thread support unclear
If shared among threads, large vector register
file
If duplicated for each thread, large area
Enhancements to conventional vector architecture
Adding thread support/out-of-order execution
Large multi-ported register files and dedicated
vector units

56
Related Work (Contd.)

VIRAM
Vector units with embedded memory
Does not provide thread support
Targeted for highly regular data parallel code
Limited ILP
SCALE
Vectors with dissimilar computation
Uses special form of threads to target loop
iterations
A control processor distributes loop iterations
Special programming model
Limited support for ILP

57
Related Work (Contd.)

Smart Memories
Reconfigurable processor/memory substrate
Can be configured to map different architectures
Imagine/Hydra
Each mode has different instructions
No ILP study

58
Related Work (Contd.)

RAW
Tiled architecture that supports ILP, DLP and TLP
Special programming model to expose wire delay
Can operate on values directly from network
Requires the communication exposed programming
model
Should support extra switching instructions in an
extra cache
64bit switching instructions
Twice the cache size of instruction cache
Efficient on apps with static memory references
Dynamic memory disambiguation is expensive
Currently low ILP due to single-issue in-order
tile processor

59
Related Work (Contd.)

TRIPS
Grid architecture with different modes for
ILP/TLP/DLP
Uses static mapping
Can hinder portability across generations
Supports DLP by converting DLP into ILP
No explicit DLP support in ISA
Can add a software managed cache in L2

60
Saving/Restoring Vector State

Minimize need to save/restore
ISA has instructions to de-allocate all vectors
An RT OS should allow a frame to run
Switching to OS (e.g., on an interrupt) doesnt
need saving
Disallow non-vector apps from accessing vector
state
IBM S/370
Sufficient to save only vector descriptors if
space in cache
When saving necessary, can happen in the
background

61
Efficient TLP Implementation

GPPs employ SMT, CMP and CMP/SMT to support
threads
Which gives higher energy efficiency for same
performance?
Same performance can be obtained by changing
Core architecture
Frequency
Should look at multiple performance points
? Explore a large design space

62
TLP Summary of What We Found (ICS04)

Energy at equal performance
CMP better than SMT
Slightly better for 2 threads
Significantly better for 4 threads (and more)
CMP/SMT is only slightly worse than CMP for 4
threads or more
Has an area advantage
Factors responsible
Clock gating
Steep power vs. complexity curve
High CMP speedup
Required speedup for SMT (e.g., 79 of CMP
speedup)

63
Instructions/Operations Per Cycle (IPC/OPC)
OPC
OPC gt IPC
IPC/OPC
IPC
MPGenc MPGdec RayTrace
SpeechRec FaceRec
64
Number of Retired Instructions/Operations
Retired Inst/Oper. (Normalized to 1T)
MPGenc MPGdec RayTrace
SpeechRec FaceRec
65
Number of Retired Instructions/Operations
Retired Inst/Oper. (Normalized to 1T)
MPGenc MPGdec RayTrace
SpeechRec FaceRec
66
ILP vs. SIMDsw

Sources of speedup
Reduction in instruction/operation count
Multiple operations packed into one
Reduced overhead e.g., fewer loop iterations
Sources of energy savings
Reduction of instruction/operation count
Reduction of execution time
Reduces un-gated energy

67
Benefits of CMP

1TSV vs. 4TSV
Speedups from 2.5X to 3.8X (mean 3.3X)
Energy reduction from 9 to 17 (mean 12)
EDP improvement of 2.5X to 4.6X (mean 3.7X)
Similar results for 1T vs. 4T and 1TS vs. 4TS
Individual threads largely independent and
homogenous
? CMP is effective

68
Benefits of SMT

4TSV vs. 4x2TSV
Speedups from 14 to 87 (mean 40)
Energy within 1 to -4
EDP reduction from 10 to 90 (mean 38)
4TS vs. 4x2TS
Speedup 3 to 29
Energy increase up to 14
EDP reduction up to 20
? S benefits lower due to higher resource
contention

69
Benefits of ILP

IPC
1.7 to 2.9 for base
1.6 to 3.1 with SIMDsw/Vector
For narrower processors
Fetch/Decode/Retirement limited to 2
Issue width and functional units unchanged
IPC degradation from 12 to 55
ILP important even with vector/SIMDsw support
SIMDsw loops also need ILP
Future work address this problem

70
Allocating/Freeing Vectors
REN
L1
Start Elem
Logical Vector
Physical Vector
Cur Elem
Avail Elem
Len
V0
PV1
PV0
V1
Cache Scrubbing Logic
.
V7
Un-reserve elems 4-9
On free
Stream Fetch Logic AddrStride
VL0
Init stream AddrStride
PV2
On stream alloc
Free Vec Descriptor Stack
VL1
PV3
Fetch/Store Next Elem
On cur_elem_inc retire
VL2
PV4

PV1 freed when this retires
Can be freed also with VFREE
Steps for freeing
Update hole pointers
Send an un-reserve msg to L1
Push the freed vector descriptor
Nothing in critical path

VL3
0
0

Steps for allocating (REN)
Parallel search Len column for a hole
Record the start of hole and len in descriptor
Update hole descriptors
Pop a free vector descriptor and map to it
(1 4) (2 3)
Allocation/Freeing outside of SIMDsw loop

Vec Len Regs
0
0
Allocated only with VLD, VALLOC Vld . ?
V0 (PV1 OldPV0) Vld ? V0
(PV2 OldPV1)
Len
Start
End
0
1
0
3
4
F
0
1
1
1
4
5
2
U
1
0
0
0
6
7
2
F
0
0
Hole Descriptors
Imaginary Free Lists

Write a Comment

User Comments (0)

About PowerShow.com

Efficient Support for All Levels of Parallelism for Complex Media Applications - PowerPoint PPT Presentation

Efficient Support for All Levels of Parallelism for Complex Media Applications

Ph.D. Preliminary Exam. Thesis Advisor: Sarita Adve. Department of ... Bank 0, Way 2. FP/SIMD RegFile. FP/SIMD RegFile. 2 SIMD ALUs. 2 SIMD FPUs. 2 SIMD ALUs ... – PowerPoint PPT presentation