The potential for Software-only thread-level speculation - PowerPoint PPT Presentation

About This Presentation
Title:

The potential for Software-only thread-level speculation

Description:

Re-execute. No. Yes. Optimistic at compile time, detect and recover at runtime. 7 ... Recover from failed speculation: re-execution. Quick summary on HW-only ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 39
Provided by: Kar9283
Category:

less

Transcript and Presenter's Notes

Title: The potential for Software-only thread-level speculation


1
The potential for Software-only thread-level
speculation
  • Depth Oral Presentation
  • Co-Supervisors Prof. Greg. Steffan
  • Prof. Cristina Amza
  • Committee Members
  • Prof. Tarek. Abdelrahman  
  • Prof. Michael Voss
  • Prof. Ken Sevick
  • By Chuck (Chengyan) Zhao
  • April 25, 2005

2
Chip Multi-Processor (CMP) is now everywhere
  • From all major companies
  • IBM
  • Power 4
  • Power 5
  • Intel
  • Montecito
  • Smithfield
  • AMD
  • Dual-core Opteron
  • Sun
  • MAJC
  • Sony, Toshiba, IBM
  • Cell

Power 4
Dual-core Intel chip
Cell
Dual-core Opteron
Abundant Chip Multiprocessors
3
Improving Throughput with a Chip Multi-Processor
Multiprogramming Workload
Applications
Execution Time
Processor
Caches
improve throughput
4
Improving Single Application Performance with a
Chip Multi-Processor
Single Application
?
Exec. Time
need parallel threads to reduce execution time
5
Using Chip Multi-Processor for improvements
  • Improve throughput for multi-programming workload
  • Easy
  • CMP behaves like a normal MP
  • Improve single-application performance
  • Hard
  • Control and Data Dependence
  • Proposed approach Thread-Level Speculation (TLS)

CMP trade-offs
6
Thread-Level Speculation (TLS)
  • Enable compiler to create parallel threads
    despite the existence of ambiguous data
    dependence
  • Optimistically parallelize at compile time
  • Detect violations and recover at runtime

Optimistic at compile time, detect and recover at
runtime
7
Example of Thread-Level Speculation
  • Code to parallelize

for ( ) p q
  • Un-parallelizable through paralleling compilers
  • Uncertain dependence between p and q
  • Might be runtime or user-input dependent

Break loop iterations into threads, explore
uncertainty in each thread
8
How Thread-Level Speculation works
?
9
Thread-Level Speculation quick summary
  • Benefits
  • Reduce inter-thread communication time among
    cores
  • Scale
  • New parallel programming model
  • Types of implementations
  • Hardware only
  • Combined with hardware and software
  • Software only

Thread-Level Speculation is good for Chip
Multi-Processor
10
Thread-Level Speculation Implementation Diagram
Overall picture of Thread-Level Speculation
11
Thread-Level Speculation Implementation Comparison
  • Hardware-only approach
  • Lots of research
  • Good speed up through simulation
  • Nobody builds it yet
  • cost, risky,
  • need both HW SW at the same time
  • Outcome
  • HW-only TLS looks promising
  • Significant hardware changes
  • Software-only approach limited work, limited
    progress
  • Major problem high overhead
  • Buffer memory for speculative states
  • Track each memory read write violation
    detection
  • Recover from failed speculation re-execution

Quick summary on HW-only and SW-only approaches
12
Outline for the rest of the talk
  • Hardware TLS schemes
  • Software TLS schemes
  • Our scheme
  • Our goals
  • Starting point
  • Potential applications
  • Conclusion

13
Hardware-only Thread-Level Speculation
Overall picture of HW-only TLS approach
14
Hardware Thread-Level Speculation Schemes
  • Lots of hardware TLS research
  • CMU Stampede
  • Stanford Hydra
  • Wisconsin Multiscalar
  • UIUC IA-COMA
  • UMN Super-threaded architecture
  • Convergence of hardware schemes
  • Use cache to buffer speculative state
  • Extend cache coherence protocol to track data
    dependence

Convergence of HW-only Thread-Level Speculation
15
Hardware TLS Schemes quick summary
  • Result
  • TLS is promising
  • SPEC int improvement
  • 30 - 100
  • Depends on aggressiveness of the hardware support

Sp-state
Sp-state
Sp-state
Sp-state
CMP with hardware speculative buffer and enhanced
cache consistence protocol
Convergence of HW-only Thread-Level Speculation
16
Software-only Thread-Level Speculation
Overall picture of SW-only TLS approach
17
Software-only Thread-Level Speculation Schemes
  • LRPD Test UIUC
  • VM for dependence tracking Spiross, CMU
  • Cintras SW TLS U Edinburgh
  • Problem of software-only approach high overhead
  • Try to reduce it

overview of SW-only TLS approach
18
LRPD Test (UIUC)
Exec. Time
  • implemented entirely in software
  • applies only to array-based code
  • no partial parallelism
  • entire loop will re-execute sequentially if
    there is any dependence

Pros Cons of LRPD
19
Dependence tracking using Virtual Memory
Exec. Time
Software dependence tracking through VM pages
Virtual Memory Synchronize transfer VM pages
? Pros Cons of VM Tracking
20
CMU Spiross approach -- Dependence tracking
using Virtual Memory
  • Coarse-grain, software-only
  • Based on memory tracking
  • virtual memory page protection mechanism
  • use software DSM (TreadMarks)
  • Synchronization through VM pages through cost
    analysis
  • Overhead is prohibitive
  • 2 sec (seq) / 5 min (par)
  • Not a viable approach on this level of coarse
    granularity

SW-TLS through VM Tracking is not attractive
21
Cintras SW TLS Memory tracking tuned for
performance
Exec. Time
Efficient tracking for array references
Efficient but custom-made for array only
22
Cintras software-only Thread-Level Speculation
quick summary
  • Features
  • Software simulation for extended cache coherence
    protocol
  • Provide speculative state transition table
  • Violation detection through speculate state
    comparison
  • Instrument on each load and store
  • Pros Cons
  • advanced implementation of LRPD test
  • implement entirely in software
  • cover partial parallelism
  • hand-crafted code for performance
  • apply only to array-based code

Summary of Cintras work
23
Problems with Software Thread-Level Speculation
  • High overhead
  • Buffer speculative state
  • Track data dependence for all memory reference
  • Re-execute in case of failed speculation
  • Potential speedup
  • largely unexplored
  • Possible directions for future research
  • Reduce overhead
  • Achieve speedup from TLS parallelism

Summary of Software TLS
24
Our current Thread-Level Speculation approach
Overall position for our SW TLS approach
25
Long term future plan
  • Goals
  • Target
  • Chip Multi-Processors
  • Tightly-coupled MPs
  • Apply to general-purpose code not only arrays
  • Minimize overhead
  • Capitalize on compiler analysis and optimizations
  • Idempotency analysis ltdonegt
  • Synchronization and communications ltdonegt
  • PPA Probabilistic pointer analysis Framework
    (Jeffs work) ltprogressinggt
  • Minimal backup and buffer retrieval analysis
    ltprogressinggt
  • more analysis we will invent lttodogt
  • SW-only approach room to improve
  • Starting point highly efficient software
    checkpointing

Goals and Plans
26
Starting point efficient software checkpointing
program execution
?
Buffer memory changes
Buffer more memory changes
?
Software checkpointing
  • Some program points in source code
  • Buffer state change between current execution
    point and its latest check point
  • Execution can always efficiently rewind to its
    latest checkpointing

Introduce software checkpointing
27
Potential use of Software checkpointing
  • Software Rollback
  • automatic software TLS support
  • foundation of future automatic TLS
    parallelization
  • Debug
  • controlled rewind
  • Enhance application reliability
  • Speculative optimizations in uni-processor
    program
  • larger window size
  • deep branch speculation
  • speculative code motion

what can software checkpointing do
28
Software checkpointing schemes
  • Compiler analysis
  • Local Basic Block level
  • Backup only needed memory writes
  • Optimize to minimize
  • number of backup
  • Number of buffer retrieval
  • Global procedural level
  • Populate buffers through control-flow graph
  • Iterate until buffer stabilizes
  • Inter-procedural level
  • Potential approaches for software backup
  • Undo backup
  • Todo backup

build software checkpointing
29
Undo backup
  • Compile-time analysis
  • Backup once
  • per distinct memory write
  • per Basic Block
  • Program continue to operate on non-backup memory
  • Action upon execution completion
  • Commit trash buffer
  • Rollback restore from buffer

undo backup properties
30
Undo backup example
Program, Basic Block level
Undo backup memory
Undo backup action
(a, a) (b, b) (c, c)
a 10 b 12 c a b
conflicts check
Y
restore undo memory
N
trash undo memory
Next Basic Block
undo backup process
31
Todo backup
  • Perform at runtime
  • Happen on each single memory write inside Basic
    Block
  • Each following read might need to retrieve from
    buffer
  • Action upon completion (reverse of Undo type)
  • Commit write-back from buffer
  • Rollback trash buffer

todo backup properties
32
Todo backup example
Program, Basic Block level
todo backup memory
(p, a) (q, b)
p a q b p q
conflicts check
Y
trash todo backup
N
write todo backup to memory
Next Block
todo backup process
33
Backup Comparison
  • Undo
  • Pro fast
  • Few number of backups
  • No need to retrieve from buffer for read
  • Con Memory address needs to be known statically
  • Scalar
  • Pointer to fixed location
  • Todo
  • Pro
  • Handle both scalar and general-purpose pointer
    cases
  • Con slow
  • Backup once per memory write
  • Need to retrieve each following read from buffer
  • In reality both types are used

pros cons of undo and todo
34
An example in reality mixed mode
Code to execute
Undo buffer
int a, b, c int p, q (d) a 1 (d)
b 2 (d) p 5 (u) c a
b (u) q
(a, a) (b, b) (c, c)
Todo buffer
(p, 5)
combined-backup process in reality
35
Selection of backups in reality
  • Combined approach
  • Undo memory address known
  • Scalars
  • Pointers to fixed address
  • Compile-time analysis
  • Todo memory address unknown
  • Normal pointers
  • Run-time analysis
  • Plan for implementation
  • put into SUIF, as a optimization pass
  • Minimize performance drop

use both types together in reality
36
Conclusion
  • Thread-Level Speculation is compelling
  • Potential large performance gains
  • Challenge
  • Software overhead
  • Limited SW TLS work
  • No previous SW TLS working on general-purpose
    programs
  • Killer advantage compiler analyses
  • Modest starting point
  • efficient software checkpointing

summary
37
Questions and Answers
38
Concurrent HW-only Related Work
Approach Composition Compiler-assisted or Translator-only
DMT HW-only
CSMP HW-only
Trace Processor HW-only
Krishnan99 SW/HW
Hydra SW/HW
SVC SW/HW
SUDS SW/HW
Zhang99 SW/HW
Cintra00 SW/HW
STAMPede SW/HW
An other view of HW-only Thread-Level Speculation
Schemes
Write a Comment
User Comments (0)
About PowerShow.com