Compiler-directed Data Partitioning for Multicluster Processors - PowerPoint PPT Presentation

About This Presentation
Title:

Compiler-directed Data Partitioning for Multicluster Processors

Description:

Compilation focuses on partitioning operations. Most previous work assumes a unified memory ... This work focuses on use of scratchpad-like static local memories ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 27
Provided by: Micha1
Category:

less

Transcript and Presenter's Notes

Title: Compiler-directed Data Partitioning for Multicluster Processors


1
Compiler-directed Data Partitioning for
Multicluster Processors
  • Michael Chu and Scott Mahlke
  • Advanced Computer Architecture Lab
  • University of Michigan
  • March 28, 2006

2
Multicluster Architectures
  • Addresses the register file bottleneck
  • Decentralizes architecture
  • Compilation focuses on partitioning operations
  • Most previous work assumes a unified memory

Register File
Data Memory
3
Problem Partitioning of Data
int x100
struct foo
int y100
  • Determine object placement into data memories
  • Limited by
  • Memory sizes/capacities
  • Computation operations related to data
  • Partitioning relevant to caches and scratchpad
    memories

4
Architectural Model
  • This work focuses on use of scratchpad-like
    static local memories
  • Each cluster has one local memory
  • Each object placed in one specific memory
  • Data object available in the memory throughout
    the lifetime of the program

5
Data Unaware Partitioning
Lose average 30 performance by ignoring data
6
Our Objective
  • Goal Produce efficient code
  • Strategy
  • Partition both data objects and computation
    operations
  • Balance memory size across clusters
  • Improve memory bandwidth
  • Maximize parallelism

int y 100
int x100
struct foo
7
First Try Greedy Approach
Data Partition Computation Partition
Data Unaware None, Profile-based placement Region-view
Greedy Region-view Greedy Profile-based Region-view
  • Computation-centric partition of data
  • Place data where computation references it most
    often
  • Greedy approach
  • Pass 1 Region-view computation partition Greedy
    data cluster assignment
  • Pass 2 Region-view computation repartition
    Full knowledge of data location

8
Greedy Approach Results
  • 2 Clusters
  • One Integer, Float, Memory, Branch unit per
    cluster
  • Relative to a unified, dual-ported memory
  • Improvement over Data Unaware, still room for
    improvement

9
Second Try Global Data Partition
  • Data-centric partition of computation
  • Hierarchical technique
  • Pass 1 Global-view for data
  • Consider memory relationships throughout program
  • Locks memory operations to clusters
  • Pass 2 Region-view for computation
  • Partition computation based on data location

10
Pass 1 Global Data Partitioning
  • Determine memory relationships
  • Pointer analysis profiling of memory
  • Build program-level graph representation of all
    operations
  • Perform data object memory operation merging
  • Respect correctness constraints of the program

11
Global Data Graph Representation
  • Nodes Operations, either memory or non-memory
  • Memory operations loads, stores, malloc
    callsites
  • Edges Data flow between operations
  • Node weight Data object size
  • Sum of data sizes forreferenced objects
  • Object size determined by
  • Globals/locals pointer analysis
  • Malloc callsites memory profile

int x100
struct foo
malloc site 1
12
Global Data Partitioning Example
BB1
2 Objects referenced 80 Kb
BB2
2 Objects referenced 200 Kb
1 Object referenced 100 Kb
13
Pass 2 Computation Partitioning
  • ObservationGlobal-level data partition is only
    half the answer
  • Doesnt account for operation resource usage
  • Doesnt consider code scheduling regions
  • Second pass of partitioning on each scheduling
    region
  • Memory operations from first phase locked in
    place

BB1
14
Experimental Methodology
  • Compared to
  • 2 Clusters
  • One Integer, Float, Memory, Branch unit per
    cluster
  • All results relative to a unified, dual-ported
    memory

Data Partitioning Computation Partition
Global Global-view Data-centric Know data location
Greedy Region-view Greedy computation-centric Know data location
Data Unaware None, assume unified memory Assume unified memory
Unified Memory N/A Unified memory
15
Performance 1-cycle Remote Access
Unified Memory
16
Performance 10-cycle Remote Access
Unified Memory
17
Case Study rawcaudio
X
Global Data Partition
Greedy Profile-based
X
18
Summary
  • Global Data Partitioning
  • Data placement first-order design principle
  • Global data-centric partition of computation
  • Phased ordered approach
  • Global-view for decisions on data
  • Region-view for decisions on computation
  • Achieves 96 performance of a unified memory on
    partitioned memories
  • Future work apply to cache memories

19
Data Partitioning for Multicores
  • Adapt global data partitioning for cache memory
    domain
  • Similar goals
  • Increase data bandwidth
  • Maximize parallel computation
  • Different goals
  • Reducing coherence traffic
  • Keep working set cache size

20
Questions?
  • http//cccp.eecs.umich.edu

21
Backup
22
Future Work Cache Memories
  • Adapt global data partitioning for cache memory
    domain
  • Similar goals
  • Increase data bandwidth
  • Maximize parallelcomputation
  • Different goals
  • Reducing coherence traffic
  • Balancing working set

23
Memory Operation Merging
  • Interprocedural pointer analysis determines
    memory relationships

int x int foo 100 int bar 100 void
main() int a malloc() int b int
c if(cond) c foo1 b a
else c bar1 b bar1 b
100 foo0 c
malloc
load bar
load foo
store malloc or bar
store foo
24
Multicluster Compilation
  • Previous techniques focused on operation
    partitioning cite some papers
  • Ignores the issue of data object placement in
    memory
  • Assumes shared memory accessible from each cluster

25
Phase 2 Computation Partitioning
  • Observation Global-level data partition is only
    half the solution
  • Doesnt properly account for resource usage
    details
  • Doesnt consider code scheduling regions
  • Second pass of partitioning is done locally on
    each basic block of the program
  • Memory operations locked into specific clusters
  • Uses Region-based Hierarchical Operation
    Partitioner (RHOP)

26
Computation Partitioning Example
  • Memory operations from first phase locked in
    place
  • RHOP performs a detailed resource-cognizant
    computation partition
  • Modified multi-level Kernighan-Lin algorithm
    using schedule estimates

BB1


L
L



S

Write a Comment
User Comments (0)
About PowerShow.com