Motivation

About This Presentation

Title:

Motivation

Description:

Ruchira Sasanka, Manlap Li, Sarita Adve (Univ. of Illinois) ... Exhibits different forms of DLP - sub-word, vectors, streams. Existing Vector/Stream Processors ... – PowerPoint PPT presentation

Number of Views:21

Avg rating:3.0/5.0

Slides: 2

Provided by: danie295

Category:

more less

Transcript and Presenter's Notes

Title: Motivation

1
ALP Energy Efficient Support for All Levels of
Parallelism for Complex Media Applications
Ruchira Sasanka, Manlap Li, Sarita Adve (Univ.
of Illinois), Yen-Kuang Chen, Eric Debes (Intel)
Motivation
Results

Challenges of Complex Media Apps
Real-time Performance
Energy Efficiency
Programmability

Nature of DLP in Complex Media Apps
DLP interspersed with control
Exhibits different forms of DLP
- sub-word, vectors, streams
Existing Vector/Stream Processors
Targeted for large amounts of DLP
Not ideal for code with control
New programming paradigms
Cost of new ISA, vector registers, BW
Forward/Backward compatibility

Opportunities
Lots of parallelism (DLP/TLP/ILP)
Existing Support on General Purpose Procs
- ILP/TLP CMP/SMT processors
- DLP SIMD (e.g., MMX, AltiVec)
Already multi-core (and SIMD multi-lane)

MPGenc MPGdec RayTrace SpeechRec
FaceRec
ALP (All Levels of Parallelism)

ALP
Based on CMP/SMT processors with SIMD
Uses Indexed Vectors (vectors of SIMD records)
Only a handful of new instructions
- only vector loads use vector instructions
Vector data stored in L1 cache
Supports both vectors and streams
Familiar SIMD programming exception model

Indexed Vectors
Indexed Vector Registers (IVR) e.g., V0, V1
Each IVR has a Current Record Pointer (CRP)
An instruction can access only current record
CRPs auto-incremented on use
Computation using SIMD instructions/registers
CRPs allow scalar processing on vector data

MPGenc MPGdec
RayTrace SpeechRec FaceRec
1T, 4T, 4x2T 1 thread, 4 thread (CMP) and 8
thread (CMP/SMT) S with SIMD SV with
indexed vectors (ALP is 4x2TSV)
ALP over 1T Energy savings 1.5X-15X, EDP savings
7.3X-873X, and Speedups 5X-58X.
Record 0
V0 (IVR)
Record 1
Sub-word 3

Benefits Over SIMD
Reduced load/store and overhead instructions
Increased exposed parallelism
Load latency tolerance and efficient use of L1
Energy efficient IVR access (cf. cache accesses)

Record N
Packed Word 0 (Contiguous in memory)
Packed Word 1 (Contiguous in memory)
1
CRP for V0 (Currently Points to Record1)
Programming Example V2 k (V1V2)-16
(A) VLD addrstridelength ? V0
(B) VLD addrstridelength ? V1
(C) VADD V0, V1 ? V3 (D)
VMUL V3, reg1 ? V4 (E) VSUB V4, 16
? V2 (F) VSTORE addrstridelength V2
Conventional Vector code
(1) VLD addrstridelength ? V0 (2)
VLD addrstridelength ? V1 (3)
VALLOCst addrstridelength ? V2
do for all records in vector (4) simd_add V0, V1
? simd_r0 (5) simd_mul simd_r0, simd_r1 ?
simd_r2 (6) simd_sub simd_r2, 16 ? V2 Indexed
Vector Code
Benefits/Drawbacks Over Vectors Few new
instructions Easily handles control intensive
code w/o masks Supports streams and while
loops Flexible scheduling and scalar exception
model Can be scaled back (e.g., for legacy
support) - More dynamic instructions

Write a Comment

User Comments (0)