Profile-Based Dynamic Optimization Research for Future Computer Systems - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Profile-Based Dynamic Optimization Research for Future Computer Systems

Description:

Profile-Based Dynamic Optimization Research. for Future Computer Systems. Takanobu Baba ... compress/ compress. ijpeg/ forward_DCT. m88ksim/ killtime. li/ sweep ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 55
Provided by: aquilaIsU
Category:

less

Transcript and Presenter's Notes

Title: Profile-Based Dynamic Optimization Research for Future Computer Systems


1
Profile-Based Dynamic Optimization Research for
Future Computer Systems
  • Takanobu Baba
  • Department of Information Science
  • Utsunomiya University, Japan
  • http//aquila.is.utsunomiya-u.ac.jp
  • November 12, 2004

2
Brief history of my research
  • 1970s The MPG System
  • A Machine-Independent Efficient Microprogram
  • Generator
  • 1980s MUNAP
  • A Two-Level Microprogrammed Multiprocessor
  • Computer
  • 1990s A-NET
  • A Language-Architecture Integrated Approach
  • for Parallel Object-Oriented Computation

3
A Two-Level Microprogrammed Multiprocessor
Computer-MUNAP
A 28-bit vertical microinstruction activates up
to 4 nanoprograms in 4 PUs every machine cycle
MUNAP
4
A Parallel Object-Oriented Total Architecture
A-NET(Actors-NETwork )
  • Massively parallel computation
  • Each node consists of a PE and a router.
  • PE has the language-oriented, typical CISC
    architecture.
  • The programmable router is topology- independent.

A-NET Multicomputer
5
Current dynamic optimization projects
  • Computation-oriented
  • YAWARA A meta-level optimizing computer system
  • HAGANE Binary-level multithreading
  • Communication-oriented
  • Spec-All Aggressive Read/Write Access
    Speculation Method for DSM Systems
  • Cross-Line Adaptive Router Using Dynamic
    Information

6
YAWARA A Meta-Level Optimizing Computer System
7
Background
  • Moores Law will be maintained by the
    semiconductor technology
  • how can we utilize the huge amount of transistors
    for speedup of program execution?
  • our idea is to utilize some chip area for
    dynamically and autonomously tuning the
    configuration of on-chip multiprocessor

8
Meta-level
Meta-level processor
Base-level
Profile of control and data
Results of optimization
Base-level processor
Results of computation
Instructions and data
Memory
9
Design considerations
  • HW vs. SW reconfiguration
  • ? SW reconfiguration
  • Static vs. dynamic reconfiguration
  • ? both a static and dynamic reconfig.
    capability
  • Homogeneous vs. heterogeneous architecture
  • ? unified homogeneous structure

10
Basic concepts of thread-level reconfiguration
????Meta-level????
????Base-level????
Profiling
MT
PT
Application
PT
PT
PT
CT
CT
CT
CT
CT
PT
Management Thread
CT
PT
CT
CT
CT
CT
CT
Optimization
OT
CT
CT
CT
OT
OT
OT
OT
OT
OT
Memory
MT Management Thread, PT Profiling Thread, OT
Optimizing Thread, CT Computing Thread
11
Execution model
Management Thread (MT)
activate
Profiling Thread (PT)
Computing Thread (CT)
Profiling-centric
sleep
collect profile
wake up
optimization initiate condition satisfied
activate
Optimizing Thread (OT)
sleep
collect profile
Computing Thread (CT)
Profiling Thread (PT)
Computing-centric
sleep
collect profile
optimization initiate condition satisfied
12
Change of configurations by meta-level
optimization
Meta-level
Base-level
MT
OT
PT
PT
CT
OT
OT
OT
PT
CT
PT
OT
OT
OT
CT
OT
PT
CT
CT
PT
PT
OT
OT
CT
PT
CT
CT
PT
OT
OT
CT
PT
CT
PT
OT
OT
MT
OT
CT
PT
CT
CT
MT
OT
CT
PT
CT
CT
OT
OT
CT
CT
CT
PT
OT
CT
PT
CT
CT
CT
OT
OT
CT
CT
CT
PT
PT
CT
CT
CT
CT
CT
PT
CT
CT
CT
PT
OT
CT
CT
CT
CT
CT
CT
PT
CT
CT
CT
CT
OT
CT
CT
CT
CT
CT
CT
CT
CT
CT
CT
PT
OT
CT
CT
CT
CT
CT
CT
13
The YAWARA System
  • an implementation of the computation model
  • the SW system consists of static and dynamic
    optimization systems
  • the HW system includes uniformly structured
    thread engines (TE) each TE can execute base-
    and meta-level threads
  • spirit of YAWARA "A flexible method prevails
    where a rigid one fails."

14
Software System
Static feedback
Source Code (C/C,Java,Fortran,)
Execution Profile
SOS (Static Optimization System)
DOS (Dynamic Optimization System)
Code Analysis Info
Dynamic feedback
Executable image
Run-time Profile
Execution Results
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
Thread Engines
15
Hardware System
feedback-directed resource control
TE
TE
TE
TE
I
register file
net- work OUT
thread- code cache
TE
TE
TE
TE
to/from network
thread -0
thread- data cache
thread -1
I
D
thread -2
net- work IN
thread -N
TE
TE
TE
TE
INT4 FP1
execution control
TE
TE
TE
TE
D
profiling buffer
profiling controller
Thread Engine(TE)
16
Example application compress
Speculative multithreading using path prediction
mechanism
Hot path
Hot loop
Phased behavior
Hot path0
Base
1
1
hit
miss ? 1
speculative multithreading code
generation helper threads generation path
predictor generation
(OT) management thread (MT)
speculative multithreading profiling (PT)
hot loop / hot path detection (PT, OT)
Meta
17
Conclusion -YAWARA-
  • we proposed an autonomous reconfiguration
    mechanism based on dynamic behavior
  • we also proposed a software and hardware system,
    called YAWARA, that implements the
    reconfiguration efficiently
  • we are now developing the software system and the
    simulator.

18
Prediction and Execution Methods of Frequently
Executed Two Paths for Speculative Multithreading
YAWARA_at_PDCS2004
19
  • Occurrence ratios of the top-two paths

2 path
1 path
other paths
compress/ compress
54.5
22.4
ijpeg/ forward_DCT
48.2
42.1
m88ksim/ killtime
97.0
3.0
li/ sweep
80.7
19.3
The top two paths occupy 80-100 of execution
20
Two-level path prediction
  • Introducing two-level branch prediction
  • history register keeps sequence of 1 path
    executions (1 1, 0 the other paths)
  • counter table counts 1 path executions

Single Path Predictor (SPP)
history register
if v13 gt X predict 1
counter table
1101
v0
v1

v13
v14
v15
otherwise predict 2
threshold X
21
Another path predictor
Dual Path Predictor (DPP)
1 path history register
1 path counter table
1101
v0
v1
if v13 gt v2 predict 1

v13
v14
v15
2 path history register
2 path counter table
0010
v0
v1
otherwise predict 2
v2

v14
v15
22
Single Speculation (SS)
When a thread fails
recovery process
Abort succeeding threads
1 path
1 path
Recovery process
execute non-speculative thread
Non-speculative execution
Continue speculative execution
continue speculative execution
Speculation failure degrades performance
1 path
1 path
23
Double Speculation (DS)
  • Even when 1st speculation fails,
  • secondary choice has high possibility

Top-Two Paths are Dominant.
because
expected 2 hit 49.2
expected 2 hit 81.3
expected 2 hit 100
expected 2 hit 100
24
Double Speculation (DS)
recovery process
1 path
2 path
1 path
1 path
2 path
1 path
secondary speculation
1 path
continue speculative execution
  • If secondary speculation succeeds,
  • performance loss is not so large.

25
Evaluation flow
hot-path detection (SIMCA)
  • thread codes
  • 1 path speculative thread
  • 2 path speculative thread
  • non-speculative thread

thread-code generation
path history acquisition (SIMCA)
path execution history
performance estimator
speculation hit ratio speed-up ratio
26
Prediction success ratio
100
compress
80
60
succ. ratio ()
40
20
0
1
2
3
4
6
5
7
8
9
10
11
12
13
14
15
16
100
forward_DCT
80
60
succ. ratio ()
40
20
0
1
2
3
4
6
5
7
8
9
10
11
12
13
14
15
16
history length
27
Prediction success ratio
100
80
killtime
60
succ. ratio ()
40
20
0
1
2
3
4
6
5
7
8
9
10
11
12
13
14
15
16
100
80
sweep
60
succ. ratio ()
40
20
0
1
2
3
4
6
5
7
8
9
10
11
12
13
14
15
16
history length
28
Speed-up ratio
2.0
compress
1.0
speed-up ratio
0
1
2
3
4
6
5
7
8
9
10
11
12
13
14
15
16
S 100
P1 only
4.0
forward_DCT
3.0
2.0
speed-up ratio
1.0
0
1
2
3
4
6
5
7
8
9
10
11
12
13
14
15
16
S 100
P1 only
history length
29
Speed-up ratio
3.0
2.0
killtime
speed-up ratio
1.0
0
1
2
3
4
6
5
7
8
9
10
11
12
13
14
15
16
S 100
P1 only
3.0
2.0
speed-up ratio
sweep
1.0
0
1
2
3
4
6
5
7
8
9
10
11
12
13
14
15
16
S 100
P1 only
history length
30
Conclusions- Two-Path-Limited Speculative
Multithreading -
  • We proposed
  • - path prediction method and predictors
  • - speculation methods
  • for path-based speculative multithreading
  • Preliminary performance estimation results are
    shown

31
Current and future works
  • Accurate and detailed evaluation for various
    applications
  • ? SPEC 2000, MediaBench,
  • Integration to our Dynamic Optimization Framework
    YAWARA

32
Current dynamic optimization projects
  • Computation-oriented
  • YAWARA A meta-level optimizing computer system
  • HAGANE Binary-level multithreading
  • Communication-oriented
  • Spec-All Aggressive Read/Write Access
    Speculation Method for DSM Systems
  • Cross-Line Adaptive Router Using Dynamic
    Information

33
HAGANEBinary-Level Multithreading
34
Background
  • Multithread programming is not so easy.
  • ? Automatic multithreading system
  • However
  • Source codes are not always available.
  • ? Multithreading at binary level

35
Binary Translator Optimizer System
Source Binary Code
Execution Profile
Analysis Info
STO (Static Translator Optimizer)
DTO (Dynamic Translator Optimizer)
Multithreaded Binary Code (statically translated)
Multithreaded Binary Code (dynamically translated)
Process Memory Image
Multithread Processor
Execution Profile Info
36
Thread Pipelining Model
- Loop iterations are mapped onto threads
Thread i
Thread i1
Thread i2
TSAG Target Store Address Generation
37
Example translation
mtc1 zero0,f4 addu v13,zero0,zero0
bstr slti v02,v13,5000 beq v02,zer
o0,ST_LL0 addu t08,a04,zero0 addu t
19,a15,zero0 addi v13,v13,1 addi
a04,a04,4 addi a15,a15,4 lfrk wtsagd
addu t210,sp28,zero0 altsw t210 ts
agd l.s f0,0(t08) l.s f2,0(t19) l.s f4
,0(t210) mul.s f0,f0,f2 add.s f4,f4,f0 s
ttsw t210,f4 ST_LL0 estr mov.s f0,f4 jr
ra31
mtc1 zero0,f4 addu v13,zero0,zero0
BB1 l.s f0,0(a04) l.s f2,0(a15) mul.s
f0,f0,f2 addiu v13,v13,1 add.s f4,f4,
f0 slti v02,v13,5000 addiu a15,a15,4
addiu a04,a04,4 bne v02,zero0,BB1
BB2 mov.s f0,f4 jr ra31
Cont.
TSAG
Comp.
Source Binary Code
W.B.
Thread Management Instructions
Overhead code for multithreading
Translated Code
38
Superthreaded Architecture
L1 Instruction Cache
Thread Processing Unit
Thread Processing Unit
Execution Unit
Execution Unit
Communication Unit
Communication Unit
?
?
?
Memory Buffer
Memory Buffer
Write-Back Unit
Write-Back Unit
L1 Data Cache
39
m88ksim (SPECint95)
  • poor speedup ratios
  • loop unrolling does not affect the performance
  • number of iterations is quite small.

40
ijpeg (SPECint95)
  • the thread code size is too small to hide the
    thread management
  • overhead
  • loop unrolling is effective to achieve good
    speedup ratios
  • excessive loop unrolling causes performance
    degradation
  • number of iterations is not so large.

41
swim (SPECfp95)
  • good speedup ratios
  • loop unrolling is effective to achieve linear
    speedup
  • number of iterations is large.

42
Conclusion-HAGANE-
  • We have evaluated the binary-level multithreading
    using some SPEC95 benchmark programs.
  • The performance evaluation results indicate
  • the thread code size should be large enough to
    improve the performance.
  • loop unrolling is effective for the small loop
    body.
  • excessive loop unrolling degrades performance

43
A Methodology ofBinary-Level Variable
Analysisfor Multithreading
HAGANE_at_PDCS2004
44
Background and Objective
  • Usually, loop-iterations are interrelated through
    memory variables, such as induction ones.

However, it is difficult to analyze this kind of
dependency at binary level.
Binary-level variable analysis method is strongly
required for binary-level multithreading.
45
Example Binary Code
  • lw a15, 16(s830)
  • lw v13, 16(s830)
  • lw a04, 16(s830)
  • sll v13, v13, 0x2
  • addu v13, v13, a26
  • lw v02, 16(s830)
  • lw v13, -4(v13)
  • addiu v02, v02, 1
  • sw v02, 16(s830)
  • lw v02, 16(s830)
  • sll a15, a15, 0x1
  • sll a04, a04, 0x2
  • sll v02, v13, 0x1
  • addu v02, v02, v13
  • lw v13, 16(s830)
  • addu a04, a04, a26
  • addu a15, a15, v02
  • sw a15, 0(a04)
  • slt v13, v13, a37
  • for (i 1 i lt N i)
  • z i 2
  • x ai-1
  • y x 3
  • ai z y

-4(v13)
0(a04)
46
Binary-Level Variable Analysis
  • Register values are analyzed using data flow
    trees.
  • When register values, used for memory references,
    are judged as the same, the memory location is
    regarded as a virtual register.
  • Using the virtual registers, steps (1) and (2)
    are repeated.

47
Construction of Dataflow Tree
  • addiu 291, 290, -8
  • sw 0, 0(291)
  • addu 51, 0, 0
  • lw 21, 0(291)
  • addu 31, 51, 40
  • addiu 52, 51, 1
  • addu 22, 21, 31
  • sw 22, 0(291)
  • slti 23, 52, 100
  • bne 23, 0, L1

48
Example Normalization
49
Detection of Loop Induction Variables
  • Loop induction variable is the register, which
  • has inter-iteration dependency, and
  • increases with a fixed value between iterations.

The concept of virtual register makes it possible
to detect induction variables on memory.
50
Application
  • 101.tomcatv of SPECfp95 Benchmark
  • Fortran to C translator ver. 19940927
  • GCC cross compiler ver 2.7.2.3 for SIMCA
  • Data set test
  • The six most inner loops (1-6) are selected
  • They have induction variables on memory

51
Speedup Ratios
52
Conclusion -Binary-Level Variable Analysis-
  • We proposed a binary-level variable analysis
    method.
  • This method makes it possible to detect induction
    variables and the increment/decrement values.
  • The detected information allows us to multithread
    binary codes they may not be multithreaded
    without our algorithm.
  • We attained up to 9.8 speedup by the
    multithreading.

53
Summary
  • Dynamic optimization projects at our laboratory
  • The results show the performance improvement
    quantitatively in each project

54
Whats the next step of computer architecture
research?
  • from performance to reliability? or low power?
  • e.g. dependable computing
  • architecture for new device technologies?
  • e.g. quantum computing
  • However.
  • if we stick to conventional high-performance
    computing research,
  • whats the promising way?
Write a Comment
User Comments (0)
About PowerShow.com