Polar Opposites: Next Generation Languages presentation

About This Presentation

Transcript and Presenter's Notes

Title: Polar Opposites: Next Generation Languages

1
Polar OppositesNext Generation Languages
Architectures

Kathryn S McKinley
The University of Texas at Austin

2
Collaborators

Faculty
Steve Blackburn, Doug Burger, Perry Cheng, Steve
Keckler, Eliot Moss,
Graduate Students
Xianglong Huang, Sundeep Kushwaha, Aaron Smith,
Zhenlin Wang (MTU)
Research Staff
Jim Burrill, Sam Guyer, Bill Yoder

3
Computing in the Twenty-First Century

New and changing architectures
Hitting the microprocessor wall
TRIPS - an architecture for future technology
Object-oriented languages
Java and C becoming mainstream
Key challenges and approaches
Memory gap, parallelism
Language runtime implementation efficiency
Orchestrating a new software/hardware dance
Break down artificial system boundaries

4
Technology Scaling Hitting the Wall
Analytically
Qualitatively
35 nm
70 nm
100 nm
130 nm
20 mm chip edge
Either way Partitioning for on-chip
communication is key
5
End of the Road for Out-of-Order SuperScalars

Clock ride is over
Wire and pipeline limits
Quadratic out-of-order issue logic
Power, a first order constraint
Major vendors ending processor lines
Problems for any architectural solution
ILP - instruction level parallelism
Memory latency

6
Where are Programming Languages?

High Productivity Languages
Java, C, Matlab, S, Python, Perl
High Performance Languages
C/C, Fortran
Why not both in one?
Interpretation/JIT vs compilation
Language representation
Pointers, arrays, frequent method calls, etc.
Automatic memory management costs
Obscure ILP and memory behavior

7
Outline

TRIPS
Next generation tiled EDGE architecture
ILP compilation model
Memory system performance
Garbage collection influence
The GC advantage
Locality, locality, locality
Online adaptive copying
Cooperative software/hardware caching

8
TRIPS

Project Goals
Fast clock high ILP in future technologies
Architecture sustains 1 TRIPS in 35 nm technology
Cost-performance scalability
Find the right hardware/software balance
New balance reduces hardware complexity power
New compiler responsibilities challenges
Hardware/Software Prototype
Proof-of-concept of scalability and
configurability
Technology transfer

9
TRIPS Prototype Architecture
10
Execution Substrate
Register banks
Execution node
Global Ctrl Branch Predictor
I-cache H
0
1
2
3
I-cache 0
D-cache/LSQ 0
I-cache 1
D-cache/LSQ 1
Execution array
I-cache 2
D-cache/LSQ 2
I-cache 3
D-cache/LSQ 3

Interconnect topology latency
exposed to compiler scheduler

11
Large Instruction Window
Control
opcode
src1
src2
src2
src1
opcode
src2
src1
opcode
opcode
src1
src2
ALU
Out-of-Order Instruction Buffers form a logical
z-dimension in each node
Router
4 logical frames of 4 X 4 instructions
Execution Node

Instruction buffers add depth to execution array
2D array of ALUs 3D volume of instructions
Entire 3D volume exposed to compiler

12
Execution Model

SPDI - static placement, dynamic issue
Dataflow within a block
Sequential between blocks
TRIPS compiler challenges
Create large blocks of instructions
Single entry, multiple exit, predication
Schedule blocks of instructions on a tile
Resource limitations
Registers, Memory operations

13
Block Execution Model

Program execution
Fetch and map block to TRIPS grid
Execute block, produce result(s)
Commit results
Repeat
Block dataflow execution
Each cycle, execute a ready instruction at every
node
Single read of registers and memory locations
Single write of registers and memory locations
Update the PC to successor block
TRIPS core may speculatively execute multiple
blocks (as well as instructions)
TRIPS uses branch prediction and register
renaming between blocks, but not within a block

start
A
C
B
D
E
end
14
Just Right Division of Labor

TRIPS architecture
Eliminates short-term temporaries
Out-of-order execution at every node in grid
Exploits ILP, hides unpredictable latencies
without superscalar quadratic hardware
without VLIW guarantees of completion time
Scale compiler - generate ILP
Large hyperblocks - predicate, unroll, inline,
etc.
Schedule hyperblocks
Map independent instructions to different nodes
Map communicating instructions to same or close
nodes
Let hardware deal with unpredictable latencies
(loads)
Exploits Hardware and Compiler Strengths

15
High Productivity Programming Languages

Interpretation/JIT vs compilation
Language representation
Pointers, arrays, frequent method calls, etc.
Automatic memory management costs MMTk in IBM
Jikes RVM
ICSE04, SIGMETRICS04
Memory Management Toolkit for Java
High Performance, Extensible, Portable
Mark-Sweep, Copying SemiSpace, Reference Counting
Generational collection, Beltway, etc.

16
Allocation Choices
Bump-Pointer
Free-List

Fast (increment bounds check)
Can't incrementally free reuse must free en
masse

Relatively slow (consult list for fit)
Can incrementally free reuse cells

17
Allocation Choices

Bump pointer
70 bytes IA32 instructions, 726MB/s
Free list
140 bytes IA32 instructions, 654MB/s
Bump pointer 11 faster in tight loop
lt 1 in practical setting
No significant difference (?)
Second order effects?
Locality??
Collection mechanism??

18
Implications for Locality

Compare SS MS mutator
Mutator time
Mutator memory performance L1, L2 TLB

19
javac
20
pseudojbb
21
db
22
Locality Architecture
23
MS/SS Crossover 1.6GHz PPC
24
MS/SS Crossover1.9GHz AMD
25
MS/SS Crossover 2.6GHz P4
26
MS/SS Crossover3.2GHz P4
27
MS/SS Crossover
locality
space
2.6GHz
1.6GHz
1.9GHz
3.2GHz
28
Locality in Memory Management

Explicit memory management on its way out
Key GC vs Explicit MM insights 20 yrs old
Technology has and is changing
Generational and Beltway Collectors
Significant collection time benefits over
full heap collectors
Collect young objects
Infrequently collect old space
Copying nursery attains similar locality effects
as full heap

29
Where are the Misses?
Generational Copying Collector
30
Copy Order

Static copy orders
Bredth first - Cheney scan
Depth first, hierarchical
Problem one size does not fit all
Static profiling per class
Inconsistant with JIT
Object sampling
Too expensive in our experience
OOR - Online Object Reordering
OOPSLA04

31
OOR Overview

Records object accesses in each method (excludes
cold basic blocks)
Finds hot methods by dynamic sampling
Reorders objects with hot fields in higher
generation during GC
Copies hot objects into separate region

32
Static Analysis Example

Method Foo
Class A a
try
a.b
catch(Exception e)
a.c

Hot BB Collect access info
Compiler
Compiler
Cold BB Ignore
Access List 1. A.b 2. . .
33
Adaptive Sampling

Method Foo
Class A a
try
a.b
catch(Exception e)
a.c

Adaptive Sampling
Foo Accesses 1. A.b 2. . .
Foo is hot
A.b is hot
A
b
c
..
B
34
Advice Directed Reordering

Example
Assume (1,4), (4,7) and (2,6) are hot field
accesses
Order 1,4,7,2,6 3,5

1
5
4
2
3
7
6
35
OOR System Overview
Hot Methods
Source Code
Look Up
Access Info Database
Adaptive Sampling
Baseline Compiler
Optimizing Compiler
Adds Entries
Register Hot Field Accesses
GC copying objects
GC Copies Objects
Executing Code
Affects Locality
Advice
OOR addition
Jikes RVM
Input/Output
36
Cost of OOR
Benchmark Default OOR Difference
jess 4.39 4.43 0.84
jack 5.79 5.82 0.57
raytrace 4.63 4.61 -0.59
mtrt 4.95 4.99 0.70
javac 12.83 12.70 -1.05
compress 8.56 8.54 0.20
pseudojbb 13.39 13.43 0.36
db 18.88 18.88 -0.03
antlr 0.94 0.91 -2.90
gcold 1.21 1.23 1.49
hsqldb 160.56 158.46 -1.30
ipsixql 41.62 42.43 1.93
jython 37.71 37.16 -1.44
ps-fun 129.24 128.04 -1.03
Mean -0.19
37
Performance db
38
Performance jython
39
Performance javac
40
Software is not enoughHardware is not enough

Problem inefficient use of cache
Hardware limitations set associativity, cannot
predict the future
Cooperative Software/Hardware Caching
Combines high level compiler analysis with
dynamic miss behavior
Lightweight ISA support conveys compilers global
view to hardware
Compiler-guided cache replacement (evict-me)
Compiler-guided region prefetching
ISCA03, PACT02

41
Exciting Times

Dramatic architectural changes
Execution tiles
Cache Memory tiles
Next generation system solutions
Moving hardware/software boundaries
Online optimizations
Key compiler challenges (same old)
ILP and Cache Memory Hierarchy

Write a Comment

User Comments (0)

About PowerShow.com

Polar Opposites: Next Generation Languages PowerPoint PPT Presentation