System on Chip C SoCC Efficient programming abstractions for heterogeneous multicore Systems on Chip - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

System on Chip C SoCC Efficient programming abstractions for heterogeneous multicore Systems on Chip

Description:

Alastair Reid ARM Ltd. Yuan Lin University of Michigan. Krisztian Flautner ARM Ltd. Edmund Grimley-Evans ARM Ltd. 2. Mobile Consumer Electronics Trends ... – PowerPoint PPT presentation

Number of Views:386

Avg rating:3.0/5.0

Slides: 50

Provided by: swal49

Category:

more less

Transcript and Presenter's Notes

Title: System on Chip C SoCC Efficient programming abstractions for heterogeneous multicore Systems on Chip

1
System on Chip C(SoC-C)Efficient programming
abstractions for heterogeneous multicore Systems
on Chip

Alastair Reid ARM Ltd
Yuan Lin University of Michigan
Krisztian Flautner ARM Ltd
Edmund Grimley-Evans ARM Ltd

2
Mobile Consumer Electronics Trends

Mobile Application Requirements Still Growing
Rapidly
Still cameras 2Mpixel ? 10 Mpixel
Video cameras VGA ? HD 1080p ?
Video players MPEG-2 ? H.264
2D Graphics QVGA ? HVGA ? VGA ? FWVGA ?
3D Gaming 30Mtriangle/s, antialiasing,
Bandwidth HSDPA (14.4Mbps) ? WiMax (70Mbps) ?
LTE (326Mbps)
Feature Convergence
Phone
graphics UI games
still camera video camera
music
WiFi Bluetooth 3.5G 3.9G WiMax GPS

3
Trends in Mobile Processing Requirements

Video Encode
Mobile Phone

Encoding requirements for 30 frames/s VGA video
3GPP receiver requirements for different channel
types
Source O. Silven and K. Jyrkka, Observation on
Power-Efficiency Trends in Mobile
Communication Devices, EURASIP
Journal on Embedded Systems, 2007
4
Pocket Supercomputers

The challenge is not processing power
The challenge is energy efficiency

5
Different Requirements

Desktop/Laptop/Server
1-10Gop/s
10-100W

Mobile Electronics
10-100Gop/s
100mW-1W

10x performance 1/100 power consumption
1000x energy efficiency
6
Energy Efficient Systems are Lumpy

Drop Frequency 10x
Desktop 2-4GHz
Mobile 200-400MHz
Increase Parallelism 100x
Desktop 1-2 cores
Mobile 32-way SIMD Instruction Set, 4-8 cores
Match Processor Type to Task
Desktop homogeneous, general purpose
Mobile heterogeneous, specialised
Keep Memory Local
Desktop coherent, shared memory
Mobile processor-memory clusters linked by DMA

7
Example Architecture
Artists impression
SIMD Instruction Set Data Engines
Control Processor
Accelerators
Distributed Memories
8
How do we program AMP systems?

C doesnt provide language features to support
Multiple processors (or multi-ISA systems)
Distributed memory
Multiple threads

9
Use Indirection (Strawman 1)

Add a layer of indirection
Operating System
Layer of middleware
Device drivers
Hardware support
All impose a cost in Power/Performance/Area

10
Raise Pain Threshold (Strawman 2)

Write efficient code at very low level of
abstraction
Problems
Hard, slow and expensive to write, test, debug
and maintain
Design intent drowns in sea of low level detail
Not portable across different architectures
Expensive to try different points in design
space

11
Our Response

Extend C
Support Asymmetric Multiprocessors
SoC-C language raises level of abstraction
but take care not to hide expensive operations

12
Overview

Pocket-Sized Supercomputers
Energy efficient hardware is lumpy
and unsupported by C
but supported by SoC-C
SoC-C Extensions by Example
Pipeline Parallelism
Code Placement
Data Placement
SoC-C Compilation
Conclusion

13
3 steps in mapping an application

Decide how to parallelize
Choose processors for each pipeline stage
Resolve distributed memory issues

14
A Simple Program

int x100
int y100
int z100
while (1)
get(x)
foo(y,x)
bar(z,y)
baz(z)
put(z)

15
Step 1 Decide how to parallelize

int x100
int y100
int z100
while (1)
get(x)
foo(y,x)
bar(z,y)
baz(z)
put(z)

50 of work
50 of work
16
Step 1 Decide how to parallelize

int x100
int y100
int z100
PIPELINE
while (1)
get(x)
foo(y,x)
FIFO(y)
bar(z,y)
baz(z)
put(z)

PIPELINE indicates region to parallelize
FIFO indicates boundaries between pipeline
stages
17
SoC-C Feature 1 Pipeline Parallelism

Annotations express coarse-grained pipeline
parallelism
PIPELINE indicates scope of parallelism
FIFO indicates boundaries between pipeline
stages
Compiler splits into threads communicating
through FIFOs

18
Step 2 Choose Processors

int x100
int y100
int z100
PIPELINE
while (1)
get(x)
foo(y,x)
FIFO(y)
bar(z,y)
baz(z)
put(z)

19
Step 2 Choose Processors

int x100
int y100
int z100
PIPELINE
while (1)
get(x)
foo(y,x) _at_ P0
FIFO(y)
bar(z,y) _at_ P1
baz(z) _at_ P1
put(z)

_at_ P indicates processor to execute function
20
SoC-C Feature 2 RPC Annotations

Annotations express where code is to execute
Behaves like Synchronous Remote Procedure Call
Does not change meaning of program
Bulk data is not implicitly copied to processors
local memory

21
Step 3 Resolve Memory Issues

int x100
int y100
int z100
PIPELINE
while (1)
get(x)
foo(y,x) _at_ P0
FIFO(y)
bar(z,y) _at_ P1
baz(z) _at_ P1
put(z)

P0 uses x ? x must be in M0
P1 uses z ? z must be in M1
P0 uses y ? y must be in M0
Conflict?!
P1 uses y ? y must be in M1
22
Hardware Cache Coherency
P1
P0
1
0
invalidate x
copy x
invalidate x
write x
read x
write x
23
Step 3 Resolve Memory Issues

int x100
int y100
int z100
PIPELINE
while (1)
get(x)
foo(y,x) _at_ P0
FIFO(y)
bar(z,y) _at_ P1
baz(z) _at_ P1
put(z)

Two versions y_at_M0, y_at_M1
write y_at_M0 ? y_at_M1 is invalid
reads y_at_M1 ? Coherence error
24
Step 3 Resolve Memory Issues

int x100
int y100
int z100
PIPELINE
while (1)
get(x)
foo(y,x) _at_ P0
SYNC(x) _at_ DMA
FIFO(y)
bar(z,y) _at_ P1
baz(z) _at_ P1
put(z)

SYNC(x) _at_ P copies data from one version of
x to another using processor P
? y_at_M1 and y_at_M0 are valid
read y_at_M1
25
SoC-C Feature 3 Compile Time Coherency

Variables can have multiple coherent versions
Compiler uses memory topology to determine which
version is being accessed
Compiler applies cache coherency protocol
Writing to a version makes it valid and other
versions invalid
Dataflow analysis propagates validity
Reading from an invalid version is an error
SYNC(x) copies from valid version to invalid
version

26
What SoC-C Provides

SoC-C language features
Pipeline to support parallelism
Coherence to support distributed memory
RPC to support multiple processors/ISAs
Non-features
Does not choose boundary between pipeline stages
Does not resolve coherence problems
Does not allocate processors
SoC-C is concise notation to express mapping
decisions (not a tool for making them on your
behalf)

27
Compiling SoC-C

Data Placement
Infer data placement
Propagate coherence
Split variables with multiple placement
Pipeline Parallelism
Identify maximal threads
Split into multiple threads
Apply zero copy optimization
RPC (see paper for details)

28
Step 1a Infer Data Placement

int x100
int y100
int z100
PIPELINE
while (1)
get(x)
foo(y,x) _at_ P0
SYNC(x) _at_ DMA
FIFO(y)
bar(z,y) _at_ P1
baz(z) _at_ P1
put(z)

Solve Set of Constraints

29
Step 1a Infer Data Placement

int x100
int y100
int z100
PIPELINE
while (1)
get(x)
foo(y,x) _at_ P0
SYNC(x) _at_ DMA
FIFO(y)
bar(z,y) _at_ P1
baz(z) _at_ P1
put(z)

Solve Set of Constraints
Memory Topology constrains where variables could
live

30
Step 1a Infer Data Placement

Solve Set of Constraints
Memory Topology constrains where variables could
live

int x100 _at_ M0
int y100 _at_ M0,M1
int z100 _at_ M1
PIPELINE
while (1)
get(x_at_?)
foo(y_at_M0, x_at_M0) _at_ P0
SYNC(y,?,?) _at_ DMA
FIFO(y_at_?)
bar(z_at_M1, y_at_M1) _at_ P1
baz(z_at_M1) _at_ P1
put(z_at_?)

31
Step 1b Propagate Coherence

Solve Set of Constraints
Memory Topology constrains where variables could
live
Forwards Dataflow propagates availability of
valid versions

int x100 _at_ M0
int y100 _at_ M0,M1
int z100 _at_ M1
PIPELINE
while (1)
get(x_at_?)
foo(y_at_M0, x_at_M0) _at_ P0
SYNC(y,?,?) _at_ DMA
FIFO(y_at_?)
bar(z_at_M1, y_at_M1) _at_ P1
baz(z_at_M1) _at_ P1
put(z_at_?)

32
Step 1b Propagate Coherence

Solve Set of Constraints
Memory Topology constrains where variables could
live
Forwards Dataflow propagates availability of
valid versions

int x100 _at_ M0
int y100 _at_ M0,M1
int z100 _at_ M1
PIPELINE
while (1)
get(x_at_?)
foo(y_at_M0, x_at_M0) _at_ P0
SYNC(y,?,M0) _at_ DMA
FIFO(y_at_?)
bar(z_at_M1, y_at_M1) _at_ P1
baz(z_at_M1) _at_ P1
put(z_at_M1)

33
Step 1b Propagate Coherence

Solve Set of Constraints
Memory Topology constrains where variables could
live
Forwards Dataflow propagates availability of
valid versions
Backwards Dataflow propagates need for valid
versions

int x100 _at_ M0
int y100 _at_ M0,M1
int z100 _at_ M1
PIPELINE
while (1)
get(x_at_?)
foo(y_at_M0, x_at_M0) _at_ P0
SYNC(y,?,M0) _at_ DMA
FIFO(y_at_?)
bar(z_at_M1, y_at_M1) _at_ P1
baz(z_at_M1) _at_ P1
put(z_at_M1)

34
Step 1b Propagate Coherence

Solve Set of Constraints
Memory Topology constrains where variables could
live
Forwards Dataflow propagates availability of
valid versions
Backwards Dataflow propagates need for valid
versions

int x100 _at_ M0
int y100 _at_ M0,M1
int z100 _at_ M1
PIPELINE
while (1)
get(x_at_M0)
foo(y_at_M0, x_at_M0) _at_ P0
SYNC(y,M1,M0) _at_ DMA
FIFO(y_at_M1)
bar(z_at_M1, y_at_M1) _at_ P1
baz(z_at_M1) _at_ P1
put(z_at_M1)

35
Step 1c Split Variables

int x100 _at_ M0
int y0100 _at_ M0
int y1100 _at_ M1
int z100 _at_ M1
PIPELINE
while (1)
get(x)
foo(y0, x) _at_ P0
memcpy(y1,y0,) _at_ DMA
FIFO(y1)
bar(z, y1) _at_ P1
baz(z) _at_ P1
put(z)

Split variables with multiple locations
Replace SYNC with memcpy

36
Step 2 Implement Pipeline Annotation

int x100 _at_ M0
int y0100 _at_ M0
int y1100 _at_ M1
int z100 _at_ M1
PIPELINE
while (1)
get(x)
foo(y0, x) _at_ P0
memcpy(y1,y0,) _at_ DMA
FIFO(y1)
bar(z, y1) _at_ P1
baz(z) _at_ P1
put(z)

Dependency Analysis

37
Step 2a Identify Dependent Operations

int x100 _at_ M0
int y0100 _at_ M0
int y1100 _at_ M1
int z100 _at_ M1
PIPELINE
while (1)
get(x)
foo(y0, x) _at_ P0
memcpy(y1,y0,) _at_ DMA
FIFO(y1)
bar(z, y1) _at_ P1
baz(z) _at_ P1
put(z)

Dependency Analysis
Split use-def chains at FIFOs

38
Step 2b Identify Maximal Threads

int x100 _at_ M0
int y0100 _at_ M0
int y1100 _at_ M1
int z100 _at_ M1
PIPELINE
while (1)
get(x)
foo(y0, x) _at_ P0
memcpy(y1,y0,) _at_ DMA
FIFO(y1)
bar(z, y1) _at_ P1
baz(z) _at_ P1
put(z)

Dependency Analysis
Split use-def chains at FIFOs
Identify Thread Operations

39
Step 2b Split Into Multiple Threads

int x100 _at_ M0
int y0100 _at_ M0
int y1a100 _at_ M1
int y1b100 _at_ M1
int z100 _at_ M1
PARALLEL
SECTION
while (1)
get(x)
foo(y0, x) _at_ P0
memcpy(y1a,y0,) _at_ DMA
fifo_put(f, y1a)
SECTION
while (1)
fifo_get(f, y1b)
bar(z, y1b) _at_ P1
baz(z) _at_ P1

Perform Dataflow Analysis
Split use-def chains at FIFOs
Identify Thread Operations
Split into threads

40
Step 2c Zero Copy Optimization

int x100 _at_ M0
int y0100 _at_ M0
int y1a100 _at_ M1
int y1b100 _at_ M1
int z100 _at_ M1
PARALLEL
SECTION
while (1)
get(x)
foo(y0, x) _at_ P0
memcpy(y1a,y0,) _at_ DMA
fifo_put(f, y1a)
SECTION
while (1)
fifo_get(f, y1b)
bar(z, y1b) _at_ P1
baz(z) _at_ P1

Generate Data
Copy into FIFO

Copy out of FIFO Consume Data
41
Step 2c Zero Copy Optimization

int x100 _at_ M0
int y0100 _at_ M0
int y1a100 _at_ M1
int y1b100 _at_ M1
int z100 _at_ M1
PARALLEL
SECTION
while (1)
get(x)
foo(y0, x) _at_ P0
memcpy(y1a,y0,) _at_ DMA
fifo_put(f, y1a)
SECTION
while (1)
fifo_get(f, y1b)
bar(z, y1b) _at_ P1
baz(z) _at_ P1

Calculate Live Range of variables passed through
FIFOs
Live Range of y1a
Live Range of y1b

42
Step 2c Zero Copy Optimization

int x100 _at_ M0
int y0100 _at_ M0
int py1a
int py1b
int z100 _at_ M1
PARALLEL
SECTION
while (1)
get(x)
foo(y0, x) _at_ P0
fifo_acquireRoom(f, py1a)
memcpy(py1a,y0,) _at_ DMA
fifo_releaseData(f, py1a)
SECTION
while (1)
fifo_acquireData(f, py1b)
bar(z, py1b) _at_ P1

Calculate Live Range of variables passed through
FIFOs
Transform FIFO operations to pass pointers
instead of copying data
Acquire empty buffer
Generate data directly into buffer
Pass full buffer to thread 2
Acquire full buffer from thread 1
Consume data directly from buffer
Release empty buffer

43
Order of transformations

Dataflow-sensitive transformations go first
Inferring data placement
Coherence checking within threads
Dependency analysis for parallelism
Parallelism transformations
Obscures data and control flow
Thread-local optimizations go last
Zero-copy optimization of FIFO operations
Continuation passing thread implementation

44
Related Work

Language
OpenMP SMP data parallelism using C plus
annotations
StreamIt Pipeline parallelism using dataflow
language
Pipeline parallelism
J.E. Smith, Decoupled access/execute computer
architectures, Trans. Computer Systems, 2(4),
1984
Multiple independent reinventions
Hardware
Woh et al., From Soda to Scotch The Evolution
of a Wireless Baseband Processor, Proc.
MICRO-41, Nov. 2008
(not cited by paper)

45
The SoC-C Model

Program as if using SMP system
Single multithreaded processor RPCs provide a
Migrating thread Model
Single memory Compiler Managed Coherence handles
bookkeeping
Annotations change execution, not semantics
Avoid need to restructure code
Pipeline parallelism
Compiler managed coherence
Efficiency
Avoid abstracting expensive operations
? programmer can optimize and reason about

46
Fin
47
Language Design Meta Issues

Compiler only uses simple analyses
Easier to maintain consistency between different
compiler versions/implementations
Programmer makes the high-level decisions
Code and Data Placement
Inserting SYNC
Load balancing
Implementation by many source-source transforms
Programmer can mix high- and low-level features
90-10 rule use high-level features when you can,
low-level features when you need to

48
Step 3a Resolve Overloaded RPC