Instruction and Data Address Trace Compression - PowerPoint PPT Presentation

About This Presentation
Title:

Instruction and Data Address Trace Compression

Description:

DMT. Program. Counter. Data Address. Task. Switch. Trace ... DMT. Data Miss Trace. Stride.Hit. Data Address Stride Cache. Stride.Hit. DASC. Tagless structure ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 44
Provided by: Mil36
Learn more at: http://www.ece.uah.edu
Category:

less

Transcript and Presenter's Notes

Title: Instruction and Data Address Trace Compression


1
Instruction and Data Address Trace Compression
  • Aleksandar Milenkovic
  • (collaborative work with Milena Milenkovic and
    Martin Burtscher)
  • Electrical and Computer Engineering Department
  • The University of Alabama in Huntsville
  • Email milenka_at_ece.uah.edu
  • Web http//www.ece.uah.edu/milenka
  • http//www.ece.uah.edu/lacasa

2
Outline
  • Program Execution Traces
  • Trace Compression
  • Trace Compression in Hardware
  • Stream caches and predictors for instruction
    address trace compression
  • Data address stride caches for data address
    trace compression
  • Results
  • Conclusions

3
Program Execution Traces
  • Streams of recorded events
  • Basic block traces
  • Address traces
  • Instruction words
  • Operands
  • Trace uses
  • Computer architects for evaluation of new
    architectures
  • Computer analysts for workload characterization
  • Software developers for program tuning,
    optimization, and debugging

4
Instruction and Data Address TracesAn Example
for(i0 ilt100 i) ci sai bi
sum sum ci
Dinero Execution Trace
InstructionAddress
DataAddress
Type
2 0x020001f4 0 0x020001f8 0xbfffbe24 0 0x020001fc
0xbfffbc94 2 0x02000200 2 0x02000204 2 0x02000208
2 0x0200020c 1 0x02000210 0xbfffbb04 2 0x02000214
_at_ 0x020001f4 mov r1,r12, lsl 2 _at_ 0x020001f8
ldr r2,r4, r1 _at_ 0x020001fc ldr r3,r14, r1
_at_ 0x02000200 mla r0,r2,r8,r3 _at_ 0x02000204 add
r12,r12,1 (1 gtgtgt 0) _at_ 0x02000208 cmp r12,99
(99 gtgtgt 0) _at_ 0x0200020c add r6,r6,r0 _at_
0x02000210 str r0,r5, r1 _at_ 0x02000214 ble
0x20001f4
5
Trace Issues
  • Trace issues
  • Capture
  • Compression
  • Processing
  • Traces tend to be very large
  • In terabytes for a minute of program execution
  • Expensive to store, transfer, and use
  • Effective reduction techniques
  • Lossless
  • High compression ratio
  • Fast decompression

6
Outline
  • Program Execution Traces
  • Trace Compression
  • Trace Compression in Hardware
  • Stream caches and predictors for instruction
    address trace compression
  • Data address stride caches for data address
    trace compression
  • Results
  • Conclusions

7
Trace Compression
  • General purpose compression algorithms
  • Ziv-Lempel (gzip)
  • Burroughs-Wheeler transformation (bzip2)
  • Sequitur
  • Trace specific compression techniques
  • Tuned to exploit redundancy in traces
  • Better compression, faster, can be further
    combined with general-purpose compression
    algorithms

8
Trace-Specific Compression Techniques
Lossless Compression
Instructions
Instructions data
Link data addresses to dynamic basic block
Offset
Mache Samples 1989,LBTC Luo and John 2004
Replacing an execution sequence with its
identifier
Pleszkun 1994,SBC Milenkovic and Milenkovic,
2003
Offset repetitions
  • Acyclic path (WPP Larus 1999, Time Stamped
    WPP Zhang and Gupta 2001)
  • - N-tuple Milenkovic, Milenkovic and Kulick
    2003
  • Instruction (PDI Johnson, Ha and Zaidi 2001)

Control flow graph trace of transitions
PDATS Johnson, Ha and Zaidi 2001
Link data addresses to loop
QPT Larus 1993
Elnozahy 1999, SIGMA DeRose, et al. 2002
Regenerate addresses
Abstract execution
Value Predictor
Graph with number of repetitions in nodes
VPC Burtscher and Jeeradit 2003,TCGEN
Burtscher and Sam 2005
Eggers, et al. 1990,Larus 1993
Hamou-Lhadj and Lethbridge 2002
9
Outline
  • Program Execution Traces
  • Trace Compression
  • Trace Compression in Hardware
  • Stream caches and predictors for instruction
    address traces
  • Data address stride caches for data address
    traces
  • Results
  • Conclusions

10
Why Trace Compression in Hardware?
  • Problem 1 Capture program traces
  • In software trap after each instruction or taken
    branch
  • E.g., IBMs Performance Inspector
  • Slowdown gt 100 times
  • Multiple cores on a single chip more detailed
    information needed (e.g., time stamps of events)
  • Problem 2 debugging is far from fun
  • Stop execution on breakpoints, examine the state
  • Time-consuming, difficult, may miss a critical
    state leading to erroneous behavior
  • Stopping the CPU may perturb the sequence of
    events making your bugs disappear
  • gt Need an unobtrusive real-time tracing mechanism

11
Trace Compression in Hardware
  • Goals
  • Small on-chip area and small number of pins
  • Real-time compression (never stall the processor)
  • Achieve a good compression ratio
  • Solution
  • A set of compression algorithms targeting
    on-the-fly compression of instruction and data
    address traces

12
Exploiting Stream and Strides
  • Instruction address trace compression
  • Limited number andstrong temporal locality of
    instruction streams
  • gt Replace an instruction streamwith its
    identifier
  • Data address trace compression
  • Spatial and temporal locality of data addresses
  • gt Recognize regular strides

13
Trace Compressor System Overview
Processor Core
Data Address
Program Counter
Task Switch
System Under Test
Data AddressBuffer
Processor Core
Memory
Stream Cache(SC)
Data Address Stride Cache (DASC)
Trace Compressor
SCIT
SCMT
DMT
DT
Predictor Byte rep. FSM
Byte rep.FSM
Trace port
External Trace Unitfor Storing/Processing (PC or
Intelligent Drive)
Trace Output Controller
To External Unit
14
Outline
  • Program Execution Traces
  • Trace Compression
  • Trace Compression in Hardware
  • Stream caches and predictors for instruction
    address traces
  • Data address stride caches for data address
    traces
  • Results
  • Conclusions

15
Stream Detector Stream Cache
Stream Cache (SC)
NWAY - 1

iWay





SA






1
0





SA






0
reserved




SA





L
1
F(S.SA, S.SL)
0x0E
S.SA S.L
i
iSet
000
iWay
(0x020001f4,0x09)
NSET - 1
S.SA S.LFrom InstructionStream Buffer
?
Hit/Miss
0x00 // it. 0
SCMT (SA, SL)
SCIT
(0x020001f4,0x09)
Stream Cache Index Trace
Stream Cache Miss Trace
0x0E // it. 1
0x0E // it. 99
16
SC Itrace Compression
  • Compress instruction stream
  • Get the next instruction stream record from the
    instruction stream buffer(S.SA, S.SL)
  • Lookup in the stream cache with iSet F(S.SA,
    S.SL)
  • if (hit)
  • Emit(iSet iWay) to SCIT
  • else
  • Emit reserved value 0 to SCIT
  • Emit stream descriptor (S.SA, S.SL) to SCMT
  • Select an entry (iWay) in the iSet set to be
    replaced
  • Update stream cache entry SCiSetiWay.Valid
    1 SCiSetiWay.SA S.SA,
    SCiSetiWay.SL S.SL
  • Update stream cache replacement indicators

Design Decisions
  • Stream cache
  • Size
  • Associativity
  • Replacement policy
  • Mapping function
  • Instruction Stream Buffer size
  • Not to stall processor (e.g., have consecutive
    very short instruction streams)

17
SC Itrace Compression An Analytical Model
  • Legend
  • CR(SC.I) compression ratio
  • N number of instructions
  • SL.Dyn average stream length (dynamic)
  • SC.Hit(Nset,Nway) SC hit rate
  • Assumptions
  • stream length lt 256(1 byte for SL)
  • 4 bytes for stream starting address

18
2nd Level Itrace Compression
  • Size(SCIT) gtgt Size(SCMT)
  • HitRate 98, 8-bit index gt Size(SCIT)
    10Size(SCMT)
  • Redundancy in SCIT
  • Temporal and spatial locality of instruction
    streams
  • Reduce SCIT trace
  • Global Predictor
  • N-tuple compression using Tuple History Table
  • N-tuple compression using SCIT History Buffer

19
Global Predictor Structure
SCIT Trace
History Buffer
Predictor
next.sid
...




0
F
pindex
MaxP-1
?
0
1
Hit/Miss
SCIT PRED Trace
SCIT PRED Miss Trace
20
SCIT Compression
  • Predict SCIT index
  • Get the incoming index, next.sid, from the SCIT
    trace
  • Calculate the SCIT predictor index, pindex,
    using indices in the History bufferpindex F
    (indices in the History Buffer)
  • Perform lookup in the SCIT Predictor with pindex
  • if(SCIT.Predictorpindex next.sid)
  • Emit(1') to SCIT PRED trace
  • else
  • Emit(0) to SCIT PRED trace
  • Emit next.sid to SCIT Miss PRED trace
  • SCIT.Predictorpindex next.sid
  • Shift in the next.sid to the History Buffer

Design Decisions
  • Length of history buffer
  • Global predictor
  • Size
  • Mapping function

21
Redundancy in SCIT Pred Trace
  • High predictor hit rates and long runs of 0xFF
    bytes are expected in Predictor Hit Trace
  • Use a simple FSM to exploit byte repetitions

PREDHit Trace
// Detect byte repetitions in SCIT pred 1. Get
next SCIT Pred byte, Next.BYTE 2. if
(Next.BYTE Prev.BYTE) CNT 3. else 4.
if (CNT 0) 5. Emit Prev.BYTE to
SCIT.REP.Trace 6. Emit 0 to SCIT
Header 7. else 8. Emit (Prev.BYTE,
CNT) pair to SCIT.REP.Trace 9. Emit 1
to SCIT Header 10. Prev.BYTE Next.BYTE
Prev.BYTE
CNT
?

SCIT PRED Repetition Trace
SCIT PRED Header
22
Outline
  • Program Execution Traces
  • Trace Compression
  • Trace Compression in Hardware
  • Stream caches and predictors for instruction
    address traces
  • Data address stride caches for data address
    traces
  • Results
  • Conclusions

23
Data Address Trace Compression
  • More challenging task
  • Data addresses rarely stay constant during
    program execution
  • However, they often have a regular stride
  • gt Use Data Address Stride Cache (DASC) to
    exploit locality of memory referencing
    instructions and regularity in data address
    strides

24
Data Address Stride Cache
Data Address Stride Cache (DASC)
0x020001f8
  • DASC
  • Tagless structure
  • Indexed by PC of the corresponding instruction
  • Entry fields
  • LDA Last Data Address
  • Stride

PC





LDA Stride
0
1
G(PC)
i
index
0xbfffbe24
N - 1
0xbfffbe20
0xbfffbe1c
DA
DA-LDA
?
0
1
Stride.Hit
Stride.Hit
0xbfffbe24
DT (Data trace)
DMT Data Miss Trace
0xbfffbe20

0
0
1
25
DASC Compression
  • // Compress data address stream
  • Get the next pair from data buffers (PC, DA)
  • Lookup in the data address stream cache indexSet
    G(PC)
  • cStride DA - DASCiSet.LDA
  • if (cStride DASCiSet.Stride)
  • Emit(1) to DT //1-bit info
  • else
  • Emit(0) to DT
  • Emit DA to DMT
  • DASCiSet.Stride lsb(cStride)
  • DASCiSet.LDA DA

Design Decisions
  • Number of entries
  • Index function G
  • Stride length
  • Data address buffer depth

26
DASC Dtrace Compression An Analytical Model
  • Legend
  • CR(SC.D) compression ratio
  • Nmemref number of memory referencing
    instructions
  • DASC.Hit DASC hit rate
  • Assumptions
  • 4 bytes for stream starting address

27
Redundancy in DT Trace
  • High predictor hit rates and long runs of 0xFF
    bytes are expected in DT Trace
  • Use a simple FSM to exploit byte repetitions

DT
// Detect data repetitions 1. Get next DT byte
2. if (DT Prev.DT) CNT 3. else 4. if
(CNT 0) 5. Emit Prev.DT to DRT 6.
Emit 0 to DH 7. else 8. Emit
(Prev.DT, CNT) pair to DRT 9. Emit 1 to
DH 10. Prev.DT DT
Prev.DT
CNT
?
Data Header (DH)

Data Repetition Trace (DRT)
28
Outline
  • Program Execution Traces
  • Trace Compression
  • Trace Compression in Hardware
  • Stream caches and predictors for instruction
    address traces
  • Data address stride caches for data address
    traces
  • Results
  • Conclusions

29
Experimental Evaluation
  • Goals
  • Assess the effectiveness of the proposed
    algorithms
  • Explore the feasibility of the proposed hardware
    implementations
  • Determine optimal size and organization of HW
    structures
  • Workload
  • 16 MiBench benchmarks
  • ARM architecture
  • Legend
  • IC Instruction count
  • NUS Number of unique instruction streams
  • maxSL Maximum stream length
  • SL.Dyn Average stream length (dynamic)

30
Findings about SC Size/Organization
  • Good compression ratio
  • Outperforms fast GZIP
  • High stream cache hit rates for all application
    (gt98 )
  • Smaller SCs work well too
  • Replacement policy
  • Pseudo-LRU vs. FIFO
  • Associativity
  • 4-way is a reasonable choice
  • 8-way and 16-way desirable
  • Mapping function
  • S.SAlt5n6gt xor S.Lltn-10gtnlog2(NSET)

31
Findings about Global Predictor
  • Number of entries should not exceed the number
    of entries in SC
  • Having longer histories and larger
    predictorsgives only marginal improvements for
    all applicationsexcept ghostscript, blowfish,
    and stringsearch
  • History length 1
  • Index GPRED using the previous SCIT index

32
Putting It All Together (SCGPREDBREP) Itrace
Compression
33
Findings about DASC
  • Stride size
  • 1 byte is optimal
  • 2 byte stride improves compression for ? 10
  • DASC with 1K entriesis an optimal choice
  • Tagged (multi-way) DASC further improves overall
    compression ratio
  • Increased complexity

34
DASC Compression Ratio
35
Hardware Complexity Estimation
Component Entries Complexity Bytes
Instruction stream buffer 2 2x5 10
Stream detector 2 2x4 8
Stream cache 64x4 256x5 1280
Global Predictor 256 256 1(h) 257
Data address buffer 8 8x8 64
Data address stride cache 1024 1024x5 5120
Byte repetition state machines - 4 4
  • CPU model
  • In-order, Xscale like
  • Vary SC and DASC parameters
  • SC and DASC timings
  • SC Hit latency 1 clock, Miss latency 2
    clocks
  • DASC Hit latency 2 clocks Miss latency 2
    clocks
  • To avoid any stalls
  • Instruction stream input buffer MIN 2 entries
  • Data address input buffer MIN 8 entries
  • Results are relatively independent of SC and DASC
    organization

36
Trace Port Bandwidth Analysis
37
Outline
  • Program Execution Traces
  • Trace Compression
  • Trace Compression in Hardware
  • Stream caches and predictors for instruction
    address traces
  • Data address stride caches for data address
    traces
  • Results
  • Conclusions

38
Conclusions
  • A set of algorithms and hardware structuresfor
    instruction and data address trace compression
  • Stream Caches Global Predictor Byte
    repetition FSMfor instruction traces
  • Data Address Stride Cache Byte repetition FSM
    for data traces
  • Benefits
  • Enabling real-time trace compression with high
    compression ratio
  • Low complexity (small structures, small number of
    external pins)
  • Analytical simulation analysis focusing on
    compression ratio and optimal sizing/organization
    of the structures as well as real-time trace
    port bandwidth requirements

39
Laboratory for Advanced Computer Architectures
and Systems at Alabama Research Overview
  • Aleksandar Milenkovic
  • The LaCASA Laboratory
  • Electrical and Computer Engineering Department
  • The University of Alabama in Huntsville
  • Email milenka_at_ece.uah.edu
  • Web http//www.ece.uah.edu/milenka
  • http//www.ece.uah.edu/lacasa

40
Secure Processors
Software physical attacks
Computer Security is Critical
Sign Verify for Guaranteed Integrity and
Confidentiality of Code
Improvements
  • PMAC (Parallel MACs) for reducedcryptographic
    latency
  • A variation of the one-time-pad for code
    encryption
  • Instruction Verification Buffer for conditional
    execution before verification

http//www.ece.uah.edu/lacasa/research.htmsecure
_processors
41
Microbenchmarks for Architectural Analysis
  • Small programs for uncovering architectural
    parameters (usually not publicly disclosed) of
    modern processors
  • Relatively simple, so their behavior can be
    understood
  • Benefits
  • Architecture-aware compiler optimization
  • Processor design evaluation and verification
  • Testing
  • Competitive analysis

Microbenchmarks
  • Results
  • Microbenchmarks for BTB analysis
  • Experimental flow foroutcome predictor
  • Tested on P6 and NetBurst (Northwood core)

BTB Size
Outcome Predictor
BTB Org.
BTB
BTB Indexing
...
Local History
PerformanceCounters
  • Challenge
  • Dothan (PentiumM) predictor

Branch relatedevents
Global History
...
http//www.ece.uah.edu/lacasa/bp_mbs/bp_microbenc
h.htm
42
TinyHMS
Prototype
Concept
Software
http//www.ece.uah.edu/lacasa/research.htmtinyHM
S
43
TinyHMS
Write a Comment
User Comments (0)
About PowerShow.com