Lock Behaviour Characterization of Commercial Workloads - PowerPoint PPT Presentation

About This Presentation
Title:

Lock Behaviour Characterization of Commercial Workloads

Description:

Lock Behaviour Characterization of Commercial Workloads Jichuan Chang Xidong Wang – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 27
Provided by: wis112
Category:

less

Transcript and Presenter's Notes

Title: Lock Behaviour Characterization of Commercial Workloads


1
Lock Behaviour Characterization of Commercial
Workloads
ltchang_at_cs.wisc.edugt ltwxd_at_cs.wisc.edugt
Jichuan Chang Xidong Wang
2
Outline
  • Motivation
  • Methods
  • Results
  • Speculative Lock Elision Issues
  • Conclusions

3
Motivation
Understanding the Synchronization Behavior
of Commercial Workloads (OLTP, Apache, SpecJBB)
Identifying Opportunities for Speculative Lock
Elision (performance, ease of programming)
4
Questions to Answer
  • Lock related statistics
  • Can hardware identify critical sections?
  • Critical section size
  • Lock-free section size
  • Amount of lock contentions
  • Hardware optimizations by speculation
  • Context switching implications
  • Resource requirements
  • Other issues
  • Realistic timing model
  • Other synchronization (reader/writer, etc)

5
Methods
  • Benchmarks
  • OLTP, Apache, JBB, Barnes (for comparison)
  • Full system simulation (tracing) using Simics
  • Simple timing model - Simics tracer
  • Ruby timing model - Simics Ruby
  • Using instr (not cycle) as the measurement unit
  • Set cpu_switch_time to 1, disable STC
  • Validating our approach
  • Using micro-benchmarks, to compare our stats with
    the result reported by kernel tools (lockstat)
  • Tracing into disassembly code (kernel/user)

6
Lock Identification
  • Basic idea from SLE
  • Lock acquisition must use one atomic instruction.
  • Silent store pair as a pair, the stores in lock
    acquisition and release operations are silent.
  • SPARC v9 atomic instructions
  • ldstub, swap, casa (compare-and-swap)

OLTP
JBB
Values
Values
casa l2 128,g4,g3 casa l2
128,l0,g4
ldstub o0 g0, o4 brnz,pn o4,
lt0x10034b98gt stbar stb g0, o0 12
0x0-gt0xff 0xff-gt0x0
0x1-gt0x8410f8bc 0x8410f8bc-gt0x1
7
Lock Identification Algorithm
  • Starts with an atomic instruction
  • that writes back a different value to the lock
  • otherwise meaning unsuccessful lock acquisition
  • Examine each following store made by the same CPU
  • Until we meet a normal store
  • that completes the silent store pair
  • usually with the value of 0x0
  • Other completion patterns
  • Self-release (by the same CPU)
  • using atomic instruction, pair-silently (JBB)
  • using atomic instruction, not pair-silently
  • Cross-release (by a different CPU)
  • using atomic instruction
  • Removed cant observe lock release (16K limited
    window).

8
Lock Frequency
9
Execution Phase Breakdown
10
Critical Section Size
11
Lock-free Section Size
12
Timing Models
  • Adding Ruby doesnt change the size of critical
    section and lock-free section, but removes lock
    contentions.
  • Why?
  • Shrinking caused by less frequent memory
    accesses within critical sections
  • or simulation effect?
  • Guess more shrinking using Ruby and Opal

13
Lock Contention
  • Waiting from the first try to successful
    acquisition
  • Spinning ignore those have been waiting for more
    than 4K instructions.

14
Distinguishing wait and spin
  • Why bother?
  • Very few long-waiting events make big difference
    in the percentages of wasted instructions
  • Easy if we can identify thread switching
  • But the identification is not easy
  • Waiting if spinning for too many instructions
  • Using 4096 instructions as the limit
  • 90 contentions are shorter than 4K instr
  • It makes sense for different timing models.

15
Lock Contention Most Contended Locks
16
SLE on Commercial Workloads
  • Context switching (later)
  • Buffering requirement Not much
  • Small critical sections dominate
  • Except for Apache user locks (1-8K)
  • Single shared buffer among threads on the same
    CPU
  • Possible performance gain
  • Not big if only counting num of instructions (1 -
    6)
  • Critical section size already small
  • Contention already infrequent
  • Can be larger if lock spinning latency increases
  • Can be smaller
  • less lock contentions happen (as in Ruby case)
  • Must throttle speculation (to avoid unnecessary
    rollbacks)

17
Context Switch
  • Why bother?
  • Needed to precisely quantify the amount of
    instructions spent on lock waiting (process and
    thread switching)
  • Needed to correctly implement speculative lock
    elision (process switching only)
  • Process Switching Identification
  • Marker Demap TLB on context switch
  • Apache (100 transactions, CPU 3)
  • Average 210K instructions (Max 360K, Min
    160K)
  • Process switching are infrequent, performance
    implication negligible
  • Thread Switching Identification is hard
  • No simple patterns to observe, No feedback to
    validate assumptions
  • Not a good idea to provide separate buffer for
    each thread on a single processor. Hard to detect
    conflicts, thread switch need many buffers.

18
Other Synchronization Algorithms
  • Hard to recognize complex synchronization
  • Barriers, Read/writer locks, etc
  • Mutual Exclusion implementation composed of the
    small critical sections
  • pthread_mutex_lock(lock) acquires 3 lock
  • Reader/writer lock use locks to maintain data
    structure (reader/writer queues, num of current
    reader, etc)

Serialized Execution (maintained by synch. algo.)
writer_enter()
writer_exit()
HW only sees two small critical sections
19
Conclusion
  • Commercial workloads lock characterization
  • Small critical sections dominate
  • Infrequent lock contention
  • User/kernel code have different behavior
  • Kernel locks cant be ignored
  • (Kernel) contented PCs predictable
  • Performance Improvements
  • SLE wont help as much

20
Thank You! Questions?
21
Backup Slides
  • Thread switching details
  • Critical section size using Ruby timing model
  • Sparc Atomic Instructions
  • Misc Issues
  • Acknowledgement

22
Thread Switch Identification
  • User thread scheduling
  • Disassemble user thread library, Observe
    execution of scheduling methods (_disp, _switch).
    not always possible!!
  • Kernel thread scheduling
  • Involve a set of interleaved method invocations
    (resume, disp, swtch, _resume_from_idle..). Hard
    to identify starting and ending point of thread
    switch
  • Impossible to identify kernel thread switch by
    only observing register window swap since it also
    happen in user thread switch
  • No feedback from OS to validate our assumption
  • Methodology Preliminary Observations
  • Disassemble kernel code to build VA ? kernel
    method map. Observe the method control flow in
    Simics trace.
  • resume may indicate a kernel thread switch
  • user_rtt may indicate a user level thread switch.
  • Conclusion Thread Switch Identification is a
    hard, unresolved issue

23
Critical Section Size (Ruby)
24
Sparc Atomic Instructions
  • ldstub
  • Write all 1 into a byte
  • Swap
  • Swap the value of the reg and the mem location
  • Compare-and-swap
  • Swap if (value in the 1st reg value in mem)
  • Membar/stbar
  • Usually follows such atomic instructions

25
Misc.
  • Why Apache strange?
  • Lock more frequent, few user lock (1-2)
  • Large percentage of critical section instruction
  • Nested Locks
  • Intertwined Locks
  • Critical sections in Barnes are more clustered
  • Buffer size 29 30 1/3 64 Blocks
  • The same as SLE

26
Acknowledgement
  • Project suggested by Prof. Mark Hill
  • Guiding and supporting
  • Lots of discussion with and help from
  • Min Xu, our TA
  • Carl Mauer, Multifacet simulator expert
  • Ravi Rajwar, SLE paper author
Write a Comment
User Comments (0)
About PowerShow.com