A Dataflow Approach to Design Low Power Control Paths in CGRAs - PowerPoint PPT Presentation

About This Presentation
Title:

A Dataflow Approach to Design Low Power Control Paths in CGRAs

Description:

No code compression technique developed for CGRAs ... Fine-grain Code Compression. Compress unused fields rather than the whole instruction ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 30
Provided by: fank
Category:

less

Transcript and Presenter's Notes

Title: A Dataflow Approach to Design Low Power Control Paths in CGRAs


1
A Dataflow Approach to Design Low Power Control
Paths in CGRAs
  • Hyunchul Park, Yongjun Park, and Scott Mahlke
  • University of Michigan

2
Coarse-Grained Reconfigurable Architecture (CGRA)
  • Array of PEs connected in a mesh-like
    interconnect
  • High throughput with a large number of resources
  • Distributed hardware offers low cost/power
    consumption
  • High flexibility with dynamic reconfiguration

3
CGRA Attractive Alternative to ASICs
  • Suitable for running multimedia applications for
    future embedded systems
  • High throughput, low power consumption, high
    flexibility
  • Morphosys 8x8 array with RISC processor
  • SiliconHive hierarchical systolic array
  • ADRES 4x4 array with tightly coupled VLIW

Morphosys SiliconHive ADRES
viterbi at 80Mbps
h.264 at 30fps
50-60 MOps /mW
3
4
Control Power Explosion
Single PE
PE Instruction
  • Large number of configuration signals
  • Distributed interconnect, many resources to
    control
  • Nealy 1000 bits each cycle
  • No code compression technique developed for CGRAs
  • Fully decoded instructions are stored in memory
  • 45 total power

4
5
Code Compressions
  • Huffman encoding
  • High efficiency, but sequential process
  • Dictionary-based
  • Recurring patterns stored in dictionary
  • Not many patterns found in CGRAs
  • Instruction level code compression
  • No-op compression Itanium, DSPs
  • Only 17 are no-ops in CGRA

6
Fine-grain Code Compression
  • Compress unused fields rather than the whole
    instruction
  • Opcode, MUX selection, register address
  • 35 of fields contain valid information
  • Instruction format needs be stored in the memory
  • Information regarding which fields exist in the
    memory
  • Significant overhead 172 bits (20) for a 4x4
    CGRA

6
7
Dynamic Instruction Format Discovery
FU dest lt- src0 src1 RF reg write
  • Resources need configuration only when data flows
    through them
  • Instruction format can be discovered by looking
    at the data flow
  • Token network from dataflow machines can be
    utilized
  • Token is 1 bit information indicating incoming
    data in next cycle
  • Each PE observes incoming tokens and determines
    the instruction format

7
8
Dynamic Configuration of PEs
Dataflow Graph
Mapping
Configuration
  • Each cycle, tokens are sent to the consuming PEs
  • Consuming resources collect incoming tokens,
    discover instruction formats, and fetch only
    necessary instruction fields
  • Next cycle, resources can execute the scheduled
    operations

8
9
Token Generation
  • Tokens are generated at the beginning of dataflow
    live-in nodes in RFs
  • Each RF read port needs token generation info
    26 read ports in 4x4 CGRA
  • 26 bits for token generation vs. 172 bits for
    instruction format

10
Token Network
  • Token network between datapath and decoder
  • No instruction format, but token generation info
    in the memory
  • Adds 1 cycle between IF and EX stage
  • Created by cloning the datapath
  • 1 bit interconnect with same topology
  • Each resource translated to a token processing
    module
  • Encode dest fields, not src fields

10
11
Register File Token Module
token_gen
token sender
  • Write port MUXes are converted to token receivers
  • Determine selection bits
  • Read ports are converted to token senders
  • Tokens are initially generated here
  • Token generation information stored in a separate
    memory

11
12
FU Token Module
  • Input MUXes are converted to token receivers
  • Opcode processor
  • Fetch opcode field if necessary
  • Determine token type (data/pred), latency

12
13
System Overview
datapath
14
Experimental Setup
  • Target multimedia applications for embedded
    systems
  • Modulo scheduling for compute intensive loops in
    3D graphics, AAC decoder, AVC decoder (214 loops)
  • Three different control path designs
  • baseline fully decoded instructions
  • static fine-grained code compression with
    instruction format stored in the memory
  • token fine-grain code compression with token
    network

14
15
Code Size / Performance
  • Fine grain code compression increase code
    efficiency
  • Token network further improve code efficiency
  • Performance degradation
  • Sharing of fields, allowing only 2 dests

15
16
Power / Area
  • SRAM read power is greatly reduced with token
    network
  • Introducing token network slightly increases
    power and area
  • Area overhead can be mitigated with the reduced
    SRAM area
  • Hardware overhead for token network is minimal

16
17
Staging Predicates Optimization
  • Modulo scheduled loops
  • Prolog (filling pipeline)
  • Kernel code (steady state)
  • Epilog (draining pipeline)
  • Only kernel code is stored in memory
  • Staging predicate control prolog/epilog phases

i0
i0
i1
i2
II
i1
i2
Overlapped Execution
17
18
Migrating Staging Predicate
  • Staging predicate
  • Control information, not data dependent
  • 10 configurations used for routing staging
    predicate
  • Move staging predicates into control path
  • Increase token by 1 bit staging predicate
  • Only top nodes are guarded
  • Staging predicate flows along with tokens
  • Benefits
  • Code size reduction
  • Performance increase

stage 0
stage 1
stage 2
stage 3
data
staging predicate
18
19
Code Size / Performance
  • Code size reduction by 9
  • Migrating staging predicates improve performance
    by 7
  • 5 increase over baseline

19
20
Power / Area
  • Power/area of token network increase due to valid
    bit
  • Reduced code size decreases SRAM power/area
  • Overall overhead for migrating staging predicates
    is minimal

20
21
Overall Power
226.4 mW
170.0 mW
  • System power measured for a kernel loop in AVC
  • Introducing token network reduces the overall
    system power by 25, while achieving 5
    performance gain

21
22
Conclusion
  • Fine grain code compression is a good fit for
    CGRAs
  • Token network can eliminate the instruction
    format overhead
  • Dynamic discovery of instruction format
  • Small overhead (lt 3)
  • Migrating staging predicates to token network
    improves performance
  • Applicable to other highly distributed
    architectures

22
23
Questions?
24
Token Sender
  • Each output port of resources are converted into
    a token sender
  • FU output, routing mux output, register file read
    ports
  • Send out tokens only to the specified consumers
    in dest fields
  • Allow only two destinations for each output,
    potentially limits the performance

25
Token Receiver
  • Input MUXes are converted to token receivers
  • Dest fields are stored in the memory, not src
    fields
  • MUX selection bits are determined with incoming
    token position

25
26
Dynamic Instruction Format Discovery
  • Resources need configuration only when data flows
    through them
  • Instruction format can be discovered by looking
    at the data flow
  • Token network from dataflow machines can be
    utilized
  • Token is 1 bit information indicating incoming
    data in next cycle
  • Each PE observes incoming tokens and determines
    the instruction format

26
27
Who Generates Tokens?
  • Tokens are generated at the start of dataflow
  • Live-ins
  • Terminate when they get into a register file
  • Tokens terminated in register files can be
    re-generated
  • Read ports of register files generate tokens
  • Token generation information at RF read ports are
    stored separately
  • 26 read ports in 4x4 CGRA

28
Reducing Decoder Complexity
MEM
MEM
MEM
MEM

decoder
decoder
decoder
decoder
Token Network
  • Partitioning the configuration memory and decoder
  • Trade-off between number of memories and decoder
    complexity
  • Design space exploration for memory partitioning
  • Which fields are stored in the same memory?
  • Sharing of field entries in the memory
    under-utilized fields

28
29
Memory Partitioning
  • Bundle fields with the same type field width
    uniformity
  • Design space exploration result for a 4x4 CGRA
  • sharing degree total entries / total fields
  • Reduces decoder complexity by 33 over naïve
    partitioning
  • Sharing incurs less than 1 performance
    degradation

type fields memories entries total entries sharing degree
opcode 16 2 8 16 1.0
dest 96 8 8 64 0.75
const 16 2 6 12 0.75
reg addr 48 4 6 24 0.5
29
Write a Comment
User Comments (0)
About PowerShow.com