Catching Accurate Profiles in Hardware - PowerPoint PPT Presentation

Loading...

PPT – Catching Accurate Profiles in Hardware PowerPoint presentation | free to download - id: f5d1b-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Catching Accurate Profiles in Hardware

Description:

SW used to gather program behavior information ... SPEC95:go, li, vortex; SPEC2K: gcc, vortex; deltablue, sis, burg. Compilation: ... – PowerPoint PPT presentation

Number of Views:10
Avg rating:3.0/5.0
Slides: 24
Provided by: deptinform
Learn more at: http://www.cecs.uci.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Catching Accurate Profiles in Hardware


1
Catching Accurate Profiles in Hardware
ICS 280/259
  • Satish Narayanasamy, Timothy Sherwood, Suleyman
    Sair, Brad Calder, George Varghese

Presented by Jelena Trajkovic
2
Outline
  • Introduction Motivation
  • Goal
  • Related Work (Stratified Sampler)
  • Interval-based Profiling for a Single Hash
    Profiler
  • Experimental results
  • Multiple-hash Profiler
  • Experimental results

3
Introduction Motivation
  • SW used to gather program behavior information
  • Architectural support for generating profiles at
    run-time
  • HW is used to assist SW,
  • dependent on on system SW (for management or
    aggregation of events)
  • HW-only profiler

4
Introduction Motivation (cont.)
  • HW optimizations that can take advantage of info
    gathered in run-time
  • Cache replacement prefetching
  • identifying loads that cause majority of misses
  • Value based optimization
  • 50 of memory accesses are dominated by 10
    distinct values
  • capture this dynamically? gt this information is
    used for storing compressed values in data cache
  • Trace formation
  • dynamically extracting and ordering frequently
    executed code gt I-fetch more efficient
  • Multiple path execution
  • find branches that are hard to predict and
    execute down multiple paths

5
Goal
  • The goal is to build a profiling scheme that
    satisfies following properties
  • Area Efficient capacity constraints (fixed
    amount of area)
  • Accurate identify important / frequent events
    and count them accurately
  • Timely up-to-date information about program
    behavior
  • Performance Efficiency and SW Independence
    independent of system SW support to manage
    profiles (accumulate and analyze events),
    identifying in HW

6
Related Work
  • SW profiling
  • Binary instrumentation (ATOM by Calder et al.)
  • HW counter assisted profiling
  • DCPI system for Alpha Processors
  • HW table based profiling
  • Stratified sampling (Sastry et al.)
  • Co-processor profiler
  • Distill information passed from main processor
    (Ziles and Sohi)

7
Profiling Events
  • Profiling event combination of several variables
  • instruction PC, load address, register value or
    name, cache miss
  • Tuple represents event as combination of 2
    variables
  • ltpc, valuegt

8
Related Work Stratified Sampler
  • Divides the original input stream into multiple
    streams via hashing (independently sampled)
  • Table of counters
  • number of occurrences of different events
  • counter is selected by applying hash function on
    the input event
  • incremented when event appears in the input
    stream
  • on reaching threshold value, counter is reset
    and event is reported (interrupt to the OS)

9
Related Work Stratified Sampler (cont.)
  • To reduce aliasing and improve accuracy
  • Partial tags, miss counters, state information
  • Hit counters number of occurrences
  • Miss counters tuple hashes to particular entry,
    but tag differs (replacement policy)
  • On reaching threshold value
  • Generate interrupt
  • Buffered, interrupt is sent when buffer fills up
  • Placed in associative counter table, passed to SW
    (via intermediate buffer)
  • Accumulating information in SW (5 interrupt
    overhead)

10
Interval-based Profiling for a Single Hash
Profiler
  • Removing SW accumulator table
  • Interval-based
  • significant number of occurrences within
    interval
  • reset hash-table counters after every interval
  • improving accuracy - shielding
  • Divide execution time into intervals
  • interval length fixed number of profiling
    events (tuples)
  • capture only events (candidate tuples) that occur
    more than candidate threshold ( of interval
    length)

11
Single Hash Architecture
  • accumulator table is fully associative and tagged
  • if (input tuple is in acc. table )
  • inc counter
  • else
  • hash into hash-table
  • increment corresponding counter
  • hash-table does not contain tags aliasing
  • if (tuple reaches candidate threshold value)
  • if (acc. table is not full)
  • acc. table is allocated
  • mark entry as non-replicable till the end of
    interval
  • particular entry is not given as an input to the
    hash-table shielding
  • if (end of the interval)
  • flush hash-table
  • mark all entries in acc. table as replaceable

12
Single Hash Architecture (cont.)
  • Calculate worst case number of entries in the
    acc. table (avoid capacity and aliasing issues)
    as a function of profile interval length and
    candidate threshold
  • number of events that determine profiling
    interval
  • number of occurrences in order to get recorded in
    acc. table (percentage of interval length)
  • e.g. interval length 10,000
  • candidate threshold 1 gt 100 entries
  • 0.1 gt 1,000 entries
  • 10,000 w/ 1 and 1 million w/ 0.1
  • Hash-table 2K entries

13
Single Hash Architecture (cont.)
  • Hash functions for a given tuple ltpc, valuegt
  • npc flip(randomize(pc))
  • nv randomize(value)
  • index xor-fold(npc xor nv, index-size)
  • Optimizations
  • Retaining keeps top entries in acc. table from
    the previous interval
  • Resetting reset counter in hash-table, after it
    reaches candidate threshold

14
Experimental setup
  • SPEC95go, li, vortex SPEC2K gcc, vortex
    deltablue, sis, burg
  • Compilation
  • DEC Alpha 21164, DEC C (full optimizations)
  • Profiling analysis ATOM
  • Fast forwarded and then ran for 500 million
    instructions

15
Error Calculation
  • For each interval compare candidates seen by HW
    profiler and perfect profiler
  • False Positive
  • False Negative
  • Neutral Positive
  • Neutral Negatives
  • Total error rate for an interval

16
Experimental Results
  • Accuracy of HW profiling depends
  • number of unique tuples in an interval (distinct
    tuples)
  • number of unique tuples that cross threshold
  • Analysis of candidate tuples

Number of distinct tuples seen in an interval
on average
17
  • Number of unique candidate tuples in an interval
    on average

18
  • Percentage of variation of candidates from
  • one interval to the next

19
Error rates
  • Single Hash table with retaining/resetting
    results across a set of benchmarks

20
Multiple-hash Profiler
  • Independent hash functions (for each table)
  • if(no entry in acc. table)
  • hash to each table
  • update each counter
  • if(all entries for particular tuple in hash table
    reach candidate threshold)
  • add entry to the acc. table
  • reset counters in hash-table (immediately or at
    the end of interval)
  • Conservative update update just smallest counter

21
  • Muti-hash profiler for an interval of 10,000, 1
    candidate threshold, and a total number of 2K
    hash-table entries

Muti-hash profiler for an interval of 1 million,
0.1 candidate threshold, and a total number of
hash-table entries of 2K
22
  • Varying number of hash tables for the best
    muti-hash profiler - C1, R0 (w/ conservative
    update and w/o resetting) (10,00, 1 - L 1mill,
    0.1 - R)

Variation in the error across different intervals
(BSH w/ resetting - L multi-hash w/ conservative
update and no resetting 4hash tables - R)
23
Summary
  • Profiling architecture
  • Efficiently filters out important data
  • Efficient in terms of HW cost (6KB (1KB or 10
    KB) and overhead (no performance overhead)
About PowerShow.com