Efficient Support for All Levels of Parallelism for Complex Media Applications - PowerPoint PPT Presentation

1 / 70
About This Presentation
Title:

Efficient Support for All Levels of Parallelism for Complex Media Applications

Description:

Ph.D. Preliminary Exam. Thesis Advisor: Sarita Adve. Department of ... Bank 0, Way 2. FP/SIMD RegFile. FP/SIMD RegFile. 2 SIMD ALUs. 2 SIMD FPUs. 2 SIMD ALUs ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 71
Provided by: ruchira8
Category:

less

Transcript and Presenter's Notes

Title: Efficient Support for All Levels of Parallelism for Complex Media Applications


1
Efficient Support for All Levels of Parallelism
for Complex Media Applications
  • Ruchira Sasanka
  • Ph.D. Preliminary Exam
  • Thesis Advisor Sarita Adve
  • Department of Computer Science
  • University of Illinois at Urbana-Champaign

2
Motivation
  • Complex multimedia applications are critical
    workloads
  • Demand high performance
  • Demand high energy efficiency
  • Media apps important for General-purpose
    processors (GPPs)
  • Increasingly run on general-purpose processors
  • Multiple specifications/standards demand high
    programmability
  • ? How can GPPs support multimedia apps
    efficiently?

3
Motivation
  • Is support for Data Level Parallelism (DLP)
    sufficient?
  • VIRAM, Imagine
  • Assume large amounts of similar (regular) work
  • Look at media-kernels
  • Are full-applications highly regular?
  • Similarity/amount limited by decisions (control)
  • Multiple specifications and intelligent
    algorithms
  • Varying amounts of DLP sub-word, vector, stream
  • Instruction/Data/Thread-level parallelism
    (ILP/DLP/TLP)
  • ? Need to support all levels of parallelism
    efficiently

4
Philosophy Five Guidelines
  • Supporting All Levels of Parallelism
  • ILP, TLP, sub-word SIMD (SIMDsw), vectors,
    streams
  • Familiar programming model
  • Increases portability and facilitates wide-spread
    adoption
  • Evolutionary hardware
  • Adaptive, partitioned resources for efficiency
  • Limited dedicated systems (seamless integration)
  • Degree/type of parallelism varies with
    application

5
Contributions of Thesis
  • Analyze complex media apps
  • Make the case that media apps need all levels of
    parallelism
  • ALP Support for All Levels of Parallelism
  • Based on a contemporary CMP/SMT processor
  • Careful partitioning for energy efficiency
  • Novel DLP support
  • Supports vectors and streams
  • Seamless integration with CMP/SMT core (no
    dedicated units)
  • Low hardware/programming overhead
  • Speedups of 5X to 49X, EDP improvement of 5X to
    361X
  • w/ thread SIMDsw Novel DLP, over 4-wide OOO
    core

6
Other Contributions
  • CMP/SMT/Hybrid Energy Comparison (ICS04)
  • Evaluates the best energy TLP support for media
    apps on GPPs
  • Mathematical model to predict energy efficiency
  • Found a hybrid CMP/SMT to be an attractive
    solution
  • Hardware Adaptation for energy for Media Apps
    (ASPLOS02)
  • Control algorithms for hardware adaptation
  • MS Thesis, jointly done with Chris Hughes
  • ? Not discussed here to focus on current work

7
Outline
  • ALP Efficient ILP, TLP and DLP for GPPs
  • Architecture
  • Support for TLP
  • Support for ILP
  • Support for DLP
  • Methodology
  • Results
  • Summary
  • Related work
  • Future work

8
Supporting TLP
  • 4core CMP
  • 2 SMT threads for each core
  • Total 8 threads
  • Based on our TLP study
  • Shared L2 cache

Core 0
Core 1
Core 2
Core 3
L2 Cache
9
Supporting ILP (and SMT)
d T L B
L2 Sub-bank
Ld/St Q
Int ALU
2 SIMD FPUs
Fet/ Dec
Ld/St Q
2 SIMD ALUs
Int RegFile
Int Iss. Q
L2 Sub-bank
FP/SIMD RegFile
RET
RET
Int Iss. Q
i T L B
Fet/ Dec
BrP
Int ALU
FP Iss. Q
FP Iss. Q
FP/SIMD RegFile
BrP
Int RegFile
REN
L2 Sub-bank
Int Iss. Q
Fet/ Dec
Int Iss. Q
Int ALU
2 SIMD FPUs
REN
Int ALU
2 SIMD ALUs
d T L B
L2 Sub-bank
Key
  • Partitioned for energy/latency and SMT support
  • Sub-word DLP support with SIMDsw units

Thread 0
Thread 1
Shared
10
Supporting DLP
  • 128-bit SIMDsw operations
  • like SSE/AltiVec/VIS
  • Novel DLP support
  • Supports vectors/streams of SIMDsw elements
  • 2-dimensional DLP
  • No vector execution unit !
  • No vector masks !!
  • No vector register file !!!

11
Novel DLP Support
  • Vector data and computation decoupled
  • Vector data provides locality and regularity
  • Vector instructions encode multiple operations
  • Vector registers morphed out of L1 cache lines
  • Indexed Vector Registers (IVR)
  • Computation uses SIMDsw units

12
Indexed Vector Registers (IVR)
d T L B
L1 DC Bank 0, Way 1
L1 DC Bank 0, Way 0
L2 Sub-bank
IVR
IVR
Ld/St Q
Int ALU
2 SIMD FPUs
Fet/ Dec
Ld/St Q
2 SIMD ALUs
Int RegFile
Int Iss. Q
L2 Sub-bank
FP/SIMD RegFile
RET
RET
Int Iss. Q
i T L B
Fet/ Dec
BrP
Int ALU
FP Iss. Q
FP Iss. Q
FP/SIMD RegFile
BrP
Int RegFile
REN
L2 Sub-bank
Int Iss. Q
Fet/ Dec
Int Iss. Q
Int ALU
2 SIMD FPUs
REN
Int ALU
2 SIMD ALUs
d T L B
IVR
IVR
L2 Sub-bank
L1 DC Bank 1, Way 1
L1 DC Bank 1, Way 0
Key
  • IVR allocated in cache ways close to SIMDsw units
  • Each IVR bank partitioned and allocated only on
    demand

Thread 0
Thread 1
Shared
13
Indexed Vector Registers (IVR)
Current Element Pointer
Start Element
Available Elements
Length
0
8
2
V0
6
Vector Descriptors
2
8
4
V1
10



128b
Cache Line
0
0
1
1
2
3
2
3
V0
4
5
4
5
V1
6
7
6
7
8
9




Bank 0, Way 0
Bank 0, Way 1
  • No logic in the critical path in fact faster
    than cache

14
Novel DLP Support
  • Vector data and computation decoupled
  • Vector registers morphed out of L1 cache lines
  • Indexed Vector Registers (IVR)
  • Computation uses SIMDsw units
  • Vector instructions for memory
  • Vector memory instructions serviced at L2
  • SIMDsw instructions for computation
  • SIMDsw instructions can access vector registers
    (IVR)

15
Vector vs. SIMDsw Loop
  • V2 k (V0 V1)
  • vector_load ? V0
  • vector_load ? V1
  • vector_add V0, V1 ? V3
    // V3 temporary vector register
  • vector_mult V3, simd_reg1 ? V2 // k is in
    simd_reg1
  • vector_store V2 ? // V2 k (V0 V1)
  • vector_load ? V0 vector_load ? V1
  • vector_alloc V2
  • do for all elements in vector V0 and V1
  • simd_add V0, V1 ? simd_reg0 //
    adds next elements of V0 V1
  • simd_mul simd_reg0, simd_reg1 ? V2 // write
    prod. into next elem of V2
  • cur_elem_pointer_increment V0, V1, V2 //
    advance pointers to next element
  • vector_store V2 ?

Vector
ALP
16
SIMDsw Loops in Action
vector_load ? V0 vector_load ? V1 vector_alloc
V2 do for all elements in vector V0 and V1
simd_add V0, V1 ? simd_reg0 simd_mul simd_reg0,
simd_reg1 ? V2 cur_elem_pointer_increment V0,
V1, V2 vector_store V2 ?
Current Element Pointer
Start Element
Available Elements
Length
2
V0
2
V1



9
10
V2



vector_load ? V0
?
?
vector_load ? V1
?
simd_add V00, V10 ? simd_reg0
IQ
?
simd_mul simd_reg0, simd_reg1 ? V28
?
simd_add V01, V11 ? simd_reg0
?
simd_mul simd_reg0, simd_reg1 ? V29
Bank 0, Way 0
Bank 0, Way 1
vector_load ? V0
REN
vector_load ? V1
vector_alloc V2
simd_add V00, V10 ? simd_reg0
simd_mul simd_reg0, simd_reg1 ? V28
cur_elem_pointer_increment V0, V1, V2
simd_add V01, V11 ? simd_reg0
simd_mul simd_reg0, simd_reg1 ? V29
17
Why SIMDsw Loops?
  • Lower peak memory BW and better memory latency
    tolerance
  • Computation interspersed with memory operand use

Vector
vector_load ? V0 vector_load ? V1 vector_add
V0, V1 ? V3 // V3
temporary vector register vector_mult V3,
simd_reg1 ? V2 // k is in
simd_reg1 vector_store V2 ? // V2
k (V0 V1) vector_load ? V0 vector_load ?
V1 vector_alloc V2 do for all elements in
vector V0 and V1 simd_add V0, V1 ? simd_reg0
// adds next elements of V0 V1
simd_mul simd_reg0, simd_reg1 ? V2 // write
prod. into next elem of V2 cur_elem_pointer_incr
ement V0, V1, V2 // advance pointers to next
element vector_store V2 ?
ALP
18
Why SIMDsw Loops?
  • Fewer vector register ports and no temporary
    vector regs
  • Lower power, fewer L1 invalidations (vector
    allocations)

Vector
1
2
3
vector_sub V0, V1 ? V2 // V2
(V0 V1) vector_mul V2, V2 ? V3 // V3 (V0
V1) (V0 V1) vector_add V3, reg0 gt reg0
// sum (V0 V1) (V0 V1) do for
all elements in vector V0 and V1 simd_add V0,
V1 ? simd_reg0 simd_mul simd_reg0, simd_reg0 ?
simd_reg1 simd_add simd_reg1, simd_reg2 ?
simd_reg2
3
3
4
ALP
1
2
19
Why SIMDsw Loops?
  • Lower peak memory BW and better memory latency
    tolerance
  • Computation interspersed with memory operand use
  • Fewer vector register ports and no temporary
    vector regs
  • Lower power, fewer L1 invalidations
  • No vector mask registers
  • Compatibility with streams
  • Easy utilization of heterogeneous execution units

20
Drawbacks of SIMDsw Loops
  • Use more dynamic instructions
  • Higher front-end energy
  • Higher front-end width
  • Require static/dynamic unrolling for multi-cycle
    exec units
  • Vectors contain independent elements
  • Static unrolling increases code size
  • Dynamic unrolling limited by issue queue size
  • How to retain advantages and get rid of
    drawbacks?
  • ? Future work

21
Supporting DLP - Streams
  • A stream is a long vector
  • Processed using SIMDsw loops similar to vectors,
    except
  • Only a limited of records allocated in IVR at a
    time
  • Stream load/allocate (analogous to vector
    load/alloc)
  • When a cur_elem_pointer_increment retires
  • Head element is discarded from IVR (or written to
    memory)
  • New element appended to IVR

22
Streams Example
stream_load addrstride ? V0 stream_load
addrstride ? V1 stream_alloc addrstride ?
V2 // just allocate some IVR entries do
for all elements in stream simd_add V0, V1 ?
V2 cur_elem_pointer_inc V0, V1, V2 //
optional can assume auto increment stream_strore
V2 // flushes the
remainder of V2
N2
N3
N2
N3
V0
V1
N4
N5
N4
N5
N6
N7
N6
N7
23
ALPs Vector Support vs. SIMDsw (e.g., MMX)
  • ALPs vector support different from conventional
    vector
  • Uses SIMDsw loops and IVR
  • How does ALPs vector support differ from SIMDsw
    (MMX)?
  • Vector data (IVR)
  • Vector memory instructions
  • Stream support
  • ? What do these really buy?

24
ALPs Vector Support vs. SIMDsw (e.g.,MMX)
  • V2 k (V0 V1)
  • vector_load ? V0
  • vector_load ? V1
  • vector_alloc V2
  • do for all elements in vector V0 and V1
  • simd_add V0, V1 ? simd_reg0 //
    adds next W elements of V0 V1
  • simd_mul simd_reg0, simd_reg1 ? V2 // write
    prod. into next W elems of V2
  • vector_store V2 ?
  • do for all elements
  • simd_load ? simd_reg2
  • simd_load ? simd_reg3
  • simd_add simd_reg2, simd_reg3 ? simd_reg0
  • simd_mul simd_reg0, simd_reg1 ? simd_reg4
  • simd_store simd_reg4 ? .

ALP
More Instructions More SIMDsw register
pressure Loads are evil (LdStQ/TLB/Cache Tag/Can
miss) Indexed stores are really 2 instructions
SIMDsw (MMX)
25
ALPs Vector Support vs. SIMDsw (e.g., MMX)
  • Sources of performance/energy advantage
  • Reduced instructions (especially loads/stores)
  • Load latency hiding and reduced repeated loading
  • Block pre-fetching
  • No L1 or SIMDsw register file pollution
  • Low latency/energy IVR access and no alignment

26
Outline
  • ALP Efficient ILP, TLP and DLP for GPPs
  • Architecture
  • Support for TLP
  • Support for ILP
  • Support for DLP
  • Methodology
  • Results
  • Summary
  • Related work
  • Future work

27
Indexed Vector Registers (IVR)
L1 D Cache Write Through 2 Banks 16K Per Bank 4
Ways Per Bank 2 Ports Per Bank 32 B Line Size
SIMD FPUs/ALUs 2 Per Partition 128-bit
Int Issue Queues 2 Partitions 2 Banks Per
Partition 12 Entries Per Bank 2R/2W Per Bank Tag
4R/2W Per Bank 2 Issues Per Bank
Load/Store Queue 2 Partitions 16 Entries Per
Partition 2R/2W Per Partition 2 Issues Per
Partition
d T L B
L1 DC Bank 0, Way 1
L1 DC Bank 0, Way 0
L2 Sub-bank
IVR
IVR
Int RegFile 2 Partitions 32 Regs Per
Partition 4R/3W Per Partition
Fet/ Dec
Ld/St Q
FP/SIMD RegFile 2 Partitions 16 Regs Per
Part. 128-bit 4R/3W Per Part.
Int ALU
2 SIMD FPUs
Ld/St Q
Reorder Buffer 2 Partitions 2 Banks Per
Partition 32 Entries Per Bank 2R/2W Per Bank 2
Ret. Per Bank
2 SIMD ALUs
FP/SIMD Issue Queue 2 Partitions 16 Entries Per
Partition 2R/2W Per Partition Tag 4R/2W Per
Partition 2 Issues Per Partition
Int RegFile
L2 Sub-bank
Int Iss. Q
I Cache 2 Banks 8K Per Bank 4 Ways Per Bank 1
Port Per Bank 32 B Line Size
RET
RET
FP/SIMD RegFile
Fet/ Dec
Int Units 64 Bit 2 Per Partition
Int Iss. Q
i T L B
L2 Unified Cache Write Back 4 Banks 4 Sub-banks
Per Bank 64K Per Sub-bank 4 Ways Per Sub-bank 1
Port Per Sub-bank 64 B Line Size
BrP
Int ALU
FP Iss. Q
FP Iss. Q
REN
FP/SIMD RegFile
Int RegFile
BrP
Fet/ Dec
Int Iss. Q
L2 Sub-bank
REN
Branch Predictor G-Select 2-Partitions 2K Per
Partition
Int Iss. Q
Int ALU
2 SIMD FPUs
Int ALU
2 SIMD ALUs
Rename Table 2 Partitions 4 Wide Per Partition
d T L B
IVR
IVR
L2 Sub-bank
L1 DC Bank 1, Way 1
L1 DC Bank 1, Way 0
Freq 500 MHz
Key
Shared
Thread 1
Thread 0
28
Methodology
  • Execution driven simulator
  • Functional simulation based on RSIM
  • Entirely new timing simulator for ALP
  • Applications compiled using Sun CC with full
    optimization
  • SIMDsw/Vector instructions in separate assembly
    file
  • Hand-written assembly instructions (like MMX)
  • Used only in a small number of heavily used
    routines
  • Simulated through hooks placed in application
    binary
  • Dynamic power modeled using WATTCH
  • Static power with HotLeakage coming soon

29
Multi-threaded Full Applications
  • MPEG 2 encoding
  • Each thread encodes part of a frame
  • DLP instructions in DCT, SAD, IDCT, QUANT, IQUANT
  • MPEG 2 decoding
  • Each thread decodes a part of a frame
  • DLP instructions in IDCT, CompPred
  • Ray Tracing (Tachyon)
  • Each thread processes independent ray No DLP
  • Speech Recognition (Sphinx 3)
  • Each thread evaluates a Gaussian scoring model
  • DLP instructions in Gaussian vector operations
  • Face Recognition (CSU)
  • Threaded matrix multiplication and distance
    evaluation
  • Streams used for matrix manipulations

30
Systems Simulated
  • 1 Thread Superscalar 1T, 1TS, 1TSV
  • 1T - The base system with one thread and no
    SIMDsw
  • 1TS - base SIMDsw instructions
  • 1TSV - base SIMDsw/Vector instructions IVR
  • 4 Thread CMP 4T, 4TS, 4TSV
  • Analogous to above but running four-threads on
    4-core CMP
  • 8 Thread CMPSMT 4x2T, 4x2TS, 4x2TSV
  • Analogous to first three, but running 8 threads
  • 4 core CMP, with each core running 2 SMT threads

31
Outline
  • ALP Efficient ILP, TLP and DLP for GPPs
  • Architecture
  • Support for TLP
  • Support for ILP
  • Support for DLP
  • Methodology
  • Results
  • Summary
  • Related work
  • Future work

32
Speedup
  • SIMDsw support adds
  • 1.5X to 6.6X over base
  • Vector support adds
  • 2X to 9.7X over base
  • 1.1X to 1.9X over SIMDsw

Speedup
MPGenc MPGdec RayTrace
SpeechRec FaceRec
33
Speedup
35.9
  • CMP support adds
  • 3.1X to 3.9X over base
  • 2.5X to 3.8X over 1TSV

Speedup
MPGenc MPGdec RayTrace
SpeechRec FaceRec
34
Speedup
35.9
48.8
  • ALP achieves
  • 5.0X to 48.8X over base
  • All forms of parallelism essential
  • SMT support adds
  • 1.14X to 1.87X over CMP (SV)
  • 1.03X to 1.29X over CMP (S)

Speedup
MPGenc MPGdec RayTrace
SpeechRec FaceRec
35
Energy Consumption (No DVS)
Energy (Normalized to 1T)
MPGenc MPGdec RayTrace
SpeechRec FaceRec
SIMDsw savings 1.4X to 4.8X over base SV savings
1.1X to 1.4X over SIMDsw
36
Energy Consumption (No DVS)
Energy (Normalized to 1T)
MPGenc MPGdec RayTrace
SpeechRec FaceRec
  • CMP savings 1.09X to 1.17X (for SV)
  • 1.08X to 1.16X over base

37
Energy Consumption (No DVS)
Energy (Normalized to 1T)
MPGenc MPGdec RayTrace
SpeechRec FaceRec
SMT increases energy by 4 (SV) and 14
(SV) ALP reduces energy up to 7.4X
38
Energy Delay Product (EDP) Improvement
63
  • SIMDsw support adds (no RayTrace)
  • 2.3X to 30.7X over base
  • Vector support adds (no RayTrace)
  • 4.5X to 63X over base
  • 1.3X to 2.5X over SIMDsw

EDP Improvement (Normalized to 1T)
MPGenc MPGdec RayTrace
SpeechRec FaceRec
39
Energy Delay Product (EDP) Improvement
266
  • CMP adds
  • 4.0X to 4.3X over base
  • 2.5X to 4.6X over 1TSV

126
63
EDP Improvement (Normalized to 1T)
MPGenc MPGdec RayTrace
SpeechRec FaceRec
40
Energy Delay Product (EDP) Improvement
  • ALP achieves
  • 5X to 361X over base
  • All forms of parallelism essential

361
159
266
126
63
  • SMT support adds
  • 1.1X to 1.9X over CMP (SV)
  • 0.9X to 1.2X over CMP (S)

EDP Improvement (Normalized to 1T)
MPGenc MPGdec RayTrace
SpeechRec FaceRec
41
Analysis Vector Vs. SIMDsw (Recap)
  • Performance due to 3 primary enhancements
  • Vector data (IVR)
  • Vector memory instructions
  • Stream support
  • Sources of performance/energy advantage
  • Reduced instructions (especially loads/stores)
  • Load latency hiding and reduced repeated loading
  • Block pre-fetching
  • No L1 or SIMD register file pollution
  • Low latency/energy IVR access and no alignment

42
Number of Retired Instructions/Operations
Operations
Retired Inst/Oper. (Normalized to 1T)
Instructions
MPGenc MPGdec RayTrace
SpeechRec FaceRec
Operations reduced by eliminating
overhead Instructions reduced by less overhead
and packing of operations
43
Vector Vs. SIMDsw Retirement Stall Distribution
MPGenc MPGdec
SpeechRec FaceRec
  • SIMD memory stalls replaced by fewer vector
    memory stalls
  • Streaming in FaceRec eliminates most of memory
    stalls

44
Outline
  • ALP Efficient ILP, TLP and DLP for GPPs
  • Architecture
  • Support for TLP
  • Support for ILP
  • Support for DLP
  • Methodology
  • Results
  • Summary
  • Related work
  • Future work

45
Comparison with Other Architectures
  • Several interesting architectures
  • Imagine, RAW, VIRAM, Tarantula, TRIPS
  • Most do not report performance for full media
    apps
  • Detailed modeling/programming difficult
  • Imagine gives a frames per second number for MPEG
    2 encoder
  • ? Compare with Imagine

46
Comparison With Imagine
  • MPEG 2 encoding on Imagine
  • 138 fps for 360x288 resolution at 200MHz
  • Do not include
  • B frame encoding
  • at least 30 more than P, twice than I
  • 2/3 of all frames are B frames
  • Huffman VLC
  • Only 5 on single thread
  • Up to 35 when other parts parallelized/vectorized
  • Half pixel motion estimation
  • Adds 30 to the execution time

Hard to make fair energy comparison
ALP achieves 79 fps with everything _at_ same
frequency
47
Summary
  • Complex media apps need all levels of parallelism
  • Supporting all levels of parallelism is essential
  • No single type of parallelism gives the best
    performance/energy
  • CMP/SMT processors with evolutionary DLP support
    effective
  • ALP supports DLP efficiently
  • Benefits of vectors and streams with low cost
  • Decoupling of vector data from instructions
  • Adaptable L1 cache as a vector register file
  • Evolutionary hardware and familiar programming
    model
  • Overall, speedups of 5X to 49X, EDP gains of 5X
    to 361X

48
Outline
  • ALP Efficient ILP, TLP and DLP for GPPs
  • Architecture
  • Support for TLP
  • Support for ILP
  • Support for DLP
  • Methodology
  • Results
  • Summary
  • Related work
  • Future work

49
Future Work
  • Eliminating the drawbacks of SIMDsw loops
  • Benefits of ILP
  • Memory system enhancements
  • Scalability of ALP
  • More applications
  • Adaptation for energy

50
Eliminating the Drawbacks of SIMDsw Loops
  • Drawbacks
  • Use more dynamic instructions
  • Require static/dynamic unrolling for multi-cycle
    exec units
  • Solution
  • Loop repetition with in-order issue
  • No renaming
  • SIMDsw registers volatile across SIMDsw code
    blocks
  • Automatic cur_elem_pointer increment

51
Loop Repetition with In-Order Issue
Current Element Pointer
vector_load ? V0 vector_load ? V1 vector_alloc
V2 do for all elements in vector V0 and V1
simd_add V0, V1 ? simd_reg0 simd_mul simd_reg0,
simd_reg1 ? V2 vector_store V2 ?
Start Element
Available Elements
Length
2
V0
4
2
V1
4


10
12
10
V2
12
vector_load ? V0
?
?
vector_load ? V1
simd_add V00-3, V10-3 ? simd_reg0-3
IQ
simd_mul simd_reg0-3, simd_reg1 ? V28-11
Bank 0, Way 0
Bank 0, Way 1
vector_load ? V0
REN
add v00
add v01
vector_load ? V1
FUs
add v02
add v03
mul v28
mul v29
mul v210
mul v211
vector_alloc V2
simd_add V00, V10 ? simd_reg0
simd_mul simd_reg0, simd_reg1 ? V28
0
1
SIMD Regs
2
3
52
Future Work (Contd.)
  • Eliminating the drawbacks of SIMDsw loops
  • Benefits of ILP
  • Memory system enhancements
  • Scalability of ALP
  • Frequency/Threads/Lanes
  • More applications
  • Adaptation for energy

53
Thank You!
54
Related Work
  • Imagine
  • Stream processor with VLIW clusters and stream
    reg file
  • Needs a special stream programming model
  • Targets highly regular computation
  • Conditional streams available
  • Does not preserve order
  • Higher complexity
  • No support for threads

55
Related Work (Contd.)
  • Tarantula
  • Vector unit for an Alpha EV8 core
  • Dedicated vector unit
  • Sits idle when no vector code
  • Thread support unclear
  • If shared among threads, large vector register
    file
  • If duplicated for each thread, large area
  • Enhancements to conventional vector architecture
  • Adding thread support/out-of-order execution
  • Large multi-ported register files and dedicated
    vector units

56
Related Work (Contd.)
  • VIRAM
  • Vector units with embedded memory
  • Does not provide thread support
  • Targeted for highly regular data parallel code
  • Limited ILP
  • SCALE
  • Vectors with dissimilar computation
  • Uses special form of threads to target loop
    iterations
  • A control processor distributes loop iterations
  • Special programming model
  • Limited support for ILP

57
Related Work (Contd.)
  • Smart Memories
  • Reconfigurable processor/memory substrate
  • Can be configured to map different architectures
  • Imagine/Hydra
  • Each mode has different instructions
  • No ILP study

58
Related Work (Contd.)
  • RAW
  • Tiled architecture that supports ILP, DLP and TLP
  • Special programming model to expose wire delay
  • Can operate on values directly from network
  • Requires the communication exposed programming
    model
  • Should support extra switching instructions in an
    extra cache
  • 64bit switching instructions
  • Twice the cache size of instruction cache
  • Efficient on apps with static memory references
  • Dynamic memory disambiguation is expensive
  • Currently low ILP due to single-issue in-order
    tile processor

59
Related Work (Contd.)
  • TRIPS
  • Grid architecture with different modes for
    ILP/TLP/DLP
  • Uses static mapping
  • Can hinder portability across generations
  • Supports DLP by converting DLP into ILP
  • No explicit DLP support in ISA
  • Can add a software managed cache in L2

60
Saving/Restoring Vector State
  • Minimize need to save/restore
  • ISA has instructions to de-allocate all vectors
  • An RT OS should allow a frame to run
  • Switching to OS (e.g., on an interrupt) doesnt
    need saving
  • Disallow non-vector apps from accessing vector
    state
  • IBM S/370
  • Sufficient to save only vector descriptors if
    space in cache
  • When saving necessary, can happen in the
    background

61
Efficient TLP Implementation
  • GPPs employ SMT, CMP and CMP/SMT to support
    threads
  • Which gives higher energy efficiency for same
    performance?
  • Same performance can be obtained by changing
  • Core architecture
  • Frequency
  • Should look at multiple performance points
  • ? Explore a large design space

62
TLP Summary of What We Found (ICS04)
  • Energy at equal performance
  • CMP better than SMT
  • Slightly better for 2 threads
  • Significantly better for 4 threads (and more)
  • CMP/SMT is only slightly worse than CMP for 4
    threads or more
  • Has an area advantage
  • Factors responsible
  • Clock gating
  • Steep power vs. complexity curve
  • High CMP speedup
  • Required speedup for SMT (e.g., 79 of CMP
    speedup)

63
Instructions/Operations Per Cycle (IPC/OPC)
OPC
OPC gt IPC
IPC/OPC
IPC
MPGenc MPGdec RayTrace
SpeechRec FaceRec
64
Number of Retired Instructions/Operations
Retired Inst/Oper. (Normalized to 1T)
MPGenc MPGdec RayTrace
SpeechRec FaceRec
65
Number of Retired Instructions/Operations
Retired Inst/Oper. (Normalized to 1T)
MPGenc MPGdec RayTrace
SpeechRec FaceRec
66
ILP vs. SIMDsw
  • Sources of speedup
  • Reduction in instruction/operation count
  • Multiple operations packed into one
  • Reduced overhead e.g., fewer loop iterations
  • Sources of energy savings
  • Reduction of instruction/operation count
  • Reduction of execution time
  • Reduces un-gated energy

67
Benefits of CMP
  • 1TSV vs. 4TSV
  • Speedups from 2.5X to 3.8X (mean 3.3X)
  • Energy reduction from 9 to 17 (mean 12)
  • EDP improvement of 2.5X to 4.6X (mean 3.7X)
  • Similar results for 1T vs. 4T and 1TS vs. 4TS
  • Individual threads largely independent and
    homogenous
  • ? CMP is effective

68
Benefits of SMT
  • 4TSV vs. 4x2TSV
  • Speedups from 14 to 87 (mean 40)
  • Energy within 1 to -4
  • EDP reduction from 10 to 90 (mean 38)
  • 4TS vs. 4x2TS
  • Speedup 3 to 29
  • Energy increase up to 14
  • EDP reduction up to 20
  • ? S benefits lower due to higher resource
    contention

69
Benefits of ILP
  • IPC
  • 1.7 to 2.9 for base
  • 1.6 to 3.1 with SIMDsw/Vector
  • For narrower processors
  • Fetch/Decode/Retirement limited to 2
  • Issue width and functional units unchanged
  • IPC degradation from 12 to 55
  • ILP important even with vector/SIMDsw support
  • SIMDsw loops also need ILP
  • Future work address this problem

70
Allocating/Freeing Vectors
REN
L1
Start Elem
Logical Vector
Physical Vector
Cur Elem
Avail Elem
Len
V0
PV1
PV0
V1
Cache Scrubbing Logic
.
V7
Un-reserve elems 4-9
On free
Stream Fetch Logic AddrStride
VL0
Init stream AddrStride
PV2
On stream alloc
Free Vec Descriptor Stack
VL1
PV3
Fetch/Store Next Elem
On cur_elem_inc retire
VL2
PV4
  • PV1 freed when this retires
  • Can be freed also with VFREE
  • Steps for freeing
  • Update hole pointers
  • Send an un-reserve msg to L1
  • Push the freed vector descriptor
  • Nothing in critical path

VL3
0
0
  • Steps for allocating (REN)
  • Parallel search Len column for a hole
  • Record the start of hole and len in descriptor
  • Update hole descriptors
  • Pop a free vector descriptor and map to it
  • (1 4) (2 3)
  • Allocation/Freeing outside of SIMDsw loop

Vec Len Regs
0
0
Allocated only with VLD, VALLOC Vld . ?
V0 (PV1 OldPV0) Vld ? V0
(PV2 OldPV1)
Len
Start
End
0
1
0
3
4
F
0
1
1
1
4
5
2
U
1
0
0
0
6
7
2
F
0
0
Hole Descriptors
Imaginary Free Lists
Write a Comment
User Comments (0)
About PowerShow.com