Highlights of the 36th Annual International Symposium on Microarchitecture December 2003 - PowerPoint PPT Presentation

Loading...

PPT – Highlights of the 36th Annual International Symposium on Microarchitecture December 2003 PowerPoint presentation | free to download - id: 7753c-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Highlights of the 36th Annual International Symposium on Microarchitecture December 2003

Description:

Processor architecture, compilers, and systems for technical interaction on ... 2nd Workshop on Application Specific Processors (WASP) ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 63
Provided by: theotheo
Learn more at: http://www.cse.psu.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Highlights of the 36th Annual International Symposium on Microarchitecture December 2003


1
Highlights of the 36th Annual International
Symposium on MicroarchitectureDecember 2003
  • Theo Theocharides
  • Embedded and Mobile Computing Center
  • Department of Computer Science and Engineering
  • The Pennsylvania State University
  • Acknowledgements
  • K. Bernstein, T. Austin, D. Blaauw, L. Peh, D.
    Jimenez

2
Introduction
  • The International Symposium on Microarchitecture
    is the premier forum for discussing new
    microarchitecture and software techniques
  • Processor architecture, compilers, and systems
    for technical interaction on traditional MICRO
    topics
  • special emphasis on optimizations to take
    advantage of application specific opportunities
  • microarchitecture and embedded architecture
    communities
  • http//www.microarch.org
  • http//www.microarch.org/micro36/

3
Symposium Outline
  • Session 1 Voltage Scaling Transient
  • Session 2 Cache
  • Session 3 Power and Energy Efficient
    Architectures
  • Session 4 Application-Specific Optimization and
    Analysis
  • Session 5 Dynamic Optimization Systems
  • Session 6 Dynamic Program Analysis and
    Optimization
  • Session 7 Branch, Value, and Scheduling
    Optimization
  • Session 8 Dataflow, Data Parallel, and Clustered
    Architectures
  • Session 9 Secure and Network Processors
  • Session 10 Scaling Design

4
Highlights
  • Keynote Speech
  • Caution Flag Out Microarchitecture's Race for
    Power Performance
  • Kerry Bernstein, IBM T. J. Watson Research Center
  • Interesting Papers
  • Razor A Low-Power Pipeline Based on
    Circuit-Level Timing Speculation, D. Ernst, et.
    al
  • Power-Driven Design of Router Microarchitectures
    in On-Chip Networks, H. Wang, Li-Shiuan Peh, S.
    Malik
  • Fast Path-Based Neural Branch Prediction, D.
    Jimenez

5
Workshops and Tutorials
  • 5th Workshop on Media and Streaming Processors
    (MSP)
  • 3rd Workshop on Power-Aware Computer Systems
    (PACS)
  • 2nd Workshop on Application Specific Processors
    (WASP)
  • Tutorial Challenges in Embedded Computing
  • Tutorial Open Research Compiler (ORC)
    Proliferation of Technologies and Tools
  • Tutorial Microarchitecture-Level
    Power-Performance Simulators Modeling,
    Validation, and Impact on Design
  • Tutorial Network Processors
  • Tutorial Architectural Exploration with Liberty

6
Keynote Speech
  • Given by Kerry Bernstein, IBM T.J. Watson
    Research Center
  • Microarchitecture and technology relationship
  • We cannot continue to scale down to achieve
    higher frequencies without any catch
  • Increasing pipeline depth does not necessarily
    help
  • Power consumption, process variation, soft
    errors, die area erosion becoming more and more
    important
  • Keynote explored how past technologies have
    influenced high speed microarchitectures
  • Keynote showed how characteristics of proposed
    new devices and interconnects for lithographies
    beyond 90nm may shape future machine design.
  • Given the present issues and incoming trends,
    role of microarchitecture in extending CMOS
    performance will be more important than ever

7
Where Scaling fails
8
Cost of Performance in terms of power
9
Issues in summary
  • Feature size
  • Device count (transistors per chip)
  • Pipeline depth
  • Power consumption increases non-linearly with
    scaling
  • Power growths when we reduce the FO4 delay
  • Delay and power affected by process variation
  • Cooling creates more problems
  • Cost of power diverges from performance gain

10
How does Microarchitecture help?
11
Repairs
  • Monitor-based Full Chip Voltage, Clock Throttling
  • Voltage Islands
  • Technology aid required here
  • Latency required
  • Low-activity FET count increase
  • Clock Gating
  • So far has been a nice solution
  • Pipeline depth optimization
  • Performance accelerators for ASICs (DSP, GPUs,
    etc.)
  • As in, they need power anyways, at least make
    them efficient
  • Software solutions should be developed here
  • Compute-Informed Power Management
  • Instruction Stream
  • Dynamic Resource Assertion
  • Power Aware OS
  • Thermal Modeling

12
New Ideas
  • Evolutionary
  • Strained Silicon
  • High-K Gate Dielectrics
  • Hybrid Crystal Silicon
  • Increase current drive/micron of device
  • Allow transistor density improvement
  • Introduce Features which enable active static
    power management
  • Revolutionary
  • Double Gated MOSFETs
  • 3D Integration
  • Molecular Computing
  • Reduce Power Density without architectural
    management
  • Eliminate power dependence on frequency
  • Return the industry to threshold and supply
    voltage scaling

13
Molecular Computing
14
Keynote Conclusions
  • New technologies will likely help, not
    necessarily
  • Power is by far the predominant factor in scaling
    we need to see what new technologies can give
    us
  • Staying ahead requires power-aware systems

15
Razor Project (T. Austin, D. Blaauw, T. Mudge)
  • We (designers/architects) have been scaling the
    voltage down but up to a point where it was
    proven that under all possible worst cases, there
    were no errors
  • Very conservative voltage scaling
  • IDEA!
  • Instead of trying to avoid ALL errors, ALLOW some
    errors to happen and correct them!
  • Major argument Scaling the voltage supply by
    almost 0.25V down, gives an average error rate of
    less than 5
  • Instead of spending energy, logic, effort, time
    and so many other useful factors into avoiding
    error, allow a very small error percentage to
    happen, and gain huge power savings
  • Cost of fixing errors is minimal when the error
    percentage is kept under control

16
Razor Project
17
Razor Pipeline Flip-Flop
18
Error-Rate vs. Power Savings
19
IPC vs. Error Rate
20
DVS
21
Razor Advantages
  • Eliminate safety margins
  • Process variation, IR-drop, temperature
    fluctuation, data-dependent latencies, model
    uncertainty
  • Operate at sub-critical voltage for optimal
    trade-off between
  • Energy gain from voltage scaling
  • Energy overhead from dynamic error correction
  • Tune voltage for average instruction data
  • Exploit delay dependence in data
  • Tolerate delay degradation due to infrequent
    noise events
  • SER, capacitive, inductive noise, charge sharing,
    floating body effect
  • Most severe noise also least frequent

22
Power-driven Design of Router Microarchitectures
in On-chip Networks (Hangsheng Wang, Li-Shiuan
Peh, Sharad Malik)
  • Investigates on-chip network microarchitectures
    from a power-driven perspective
  • Power-efficient network microarchitectures
  • segmented crossbar, cut-through crossbar and
    write-through buffer
  • Studies and uncovers the power saving potential
    of an existing network architecture Express cube
  • Reduction in network power of up to 44.9,
  • NO degradation in network performance
  • Improved latency throughput in some cases.

23
Power in NoC
  • Ewrt is the average energy dissipated when
    writing a flit into the input buffer
  • Erd is the average energy dissipated when reading
    a flit from the input buffer
  • Ebuf Ewrt Erd is average buffer energy
  • Earb is average arbitration energy
  • Exb is average crossbar traversal energy
  • Elnk is average link traversal energy
  • H is the number of hops traversed by this flit

24
Architectural Methods
  • Segmented crossbar
  • Cut-through crossbar
  • Write-through input buffer
  • Express cube

25
Segmented Crossbar
Schematic of a matrix crossbar and a segmented
crossbar. F is flit size in bits, dw is track
width, E, W, N, S are ports.
26
Cut-through crossbar
Schematic of cut-through crossbars F is flit
size, dw is track width, E, W, N, S are ports
27
Write-through buffer
  1. Bypassing without overlapping
  2. Bypassing with overlapping
  3. Schematic of a write-through input buffer.

28
Express cube topology and microarchitecture
29
Power savings and conclusions
  • Importance of a power-driven approach to on-chip
    network design
  • Need to investigate the interactions between
    traffic patterns and On Chip Network
    architectures
  • Need to reach a systematic design methodology for
    on-chip networks

30
Fast Path-Based Neural Branch Prediction (J.
Himenez)
  • Paper presented a new neural branch predictor
  • both more accurate and much faster than previous
    neural predictors
  • Accuracy far superior to conventional predictors
  • Latency comparable to predictors from industrial
    designs
  • Improves the instructions-per-cycle (IPC) rate of
    an aggressively clocked microarchitecture by 16

31
Latency - Accuracy Gain
Rather than being done all at once (above),
computation is staggered (below)
  • Train a neural network with path history, and
    update it dynamically.
  • Choose the weight vectors according to the path
    leading up to the branch rather than branch
    address alone
  • Directly reduces latency (can begin prior to the
    prediction see figure on the left)
  • Improves accuracy as the predictor incorporates
    path information

32
Comparative Results Misprediction rate
33
IPC per hardware cost
  • Faster and more accurate than existing neural
    branch predictors

34
Conclusion
  • Overview of MICRO36
  • Conference lasted 5 days impossible to review
    in half hour!
  • If you are interested, you should read the
    proceedings on-line at
  • http//www.microarch.org/micro36
  • The Call For Papers for MICRO37 is available, at
  • http//www.microarch.org/micro37
  • DEADLINE FOR PAPER SUBMISSION May 28th, 2004

35
Links to the papers reviewed
  • Razor
  • http//www.microarch.org/micro36/html/pdf/ernst-Ra
    zor.pdf
  • NoC Router Power-Driven Design
  • http//www.microarch.org/micro36/html/pdf/wang-Pow
    erDrivenDesign.pdf
  • Fast-Path Neural Branch Predictor
  • http//www.microarch.org/micro36/html/pdf/jimenez-
    FastPath.pdf

36
Questions?
  • THANK YOU !

37
36th Annual International Symposium on
Micro-Architecture - A Review
  • Rajaraman Ramanarayanan

38
Talk Overview
  • Session covered in this presentation
  • Review papers
  • Architectural vulnerability factors
  • Introduction
  • Proposed technique
  • Soft error terminology
  • Computing AVFs
  • Results
  • Conclusion
  • L2-Miss Drive Variable Supply voltage scaling
  • Introduction
  • Proposed Solution
  • Transitions
  • Results
  • Achievements

39
Session Covered
  • Voltage Scaling Transient Faults
  • Methodology to compute Artificial vulnerability
    factors
  • VSV L2-Miss-Driven Variable Supply-Voltage
    Scaling for Low Power

40
Architectural Vulnerability Factors(S. S.
Mukherjee, C. T. Weaver, J. Emer, S. K.
Reinhardt, T. Austin)
  • Single-event upsets from particle strikes have
    become a key challenge in microprocessor design.
  • Soft errors due to cosmic rays making an impact
    in industry.
  • In 2000, Sun Microsystems acknowledged cosmic ray
    strikes on unprotected cache memories as the
    cause of random crashes at major customer sites
    in its flagship Enterprise server line
  • The fear of cosmic ray strikes prompted Fujitsu
    to protect 80 of its 200,000 latches in its
    recent SPARC processor with some form of error
    detection
  • require accurate estimates of processor error
    rates to make appropriate cost/reliability
    trade-offs.

41
Introduction
  • All existing approaches introduce a significant
    penalty in performance, power, die size, and
    design time
  • Tools and techniques to estimate processor
    transient error rates are not readily available
    or fully understood.
  • Estimates are needed early in the design cycle.
  • In this Paper
  • Define architectural vulnerability factor (AVF)
  • identify numerous cases, such as pre-fetches,
    dynamically dead code, and wrong-path
    instructions, in which a fault will not affect
    correct execution

42
Proposed technique
  • Not all faults in a micro-architectural structure
    affect the final outcome of a program.
  • Architectural Vulnerability factor (AVF)
  • probability that a fault in that particular
    structure will result in an error in the final
    output of the program
  • The overall error rate product of raw fault
    rate and AVF.
  • Can examine the relative contributions of various
    structures
  • identify cost-effective areas to employ fault
    protection techniques
  • Tracks the subset of processor state bits
    required for architecturally correct execution
    (ACE)
  • fault in a storage cell containing one of these
    bits affects output
  • For example, a branch predictors AVF is 0
  • predictor bits are always un-ACE bits.
  • Bits in the committed PC are always ACE bits, has
    an AVF of 100

43
Soft error terminology
  • Error budget expressed in terms of
  • Mean Time Between Failures (MTBF).
  • Failures In Time (FIT) - inversely related to
    MTBF.
  • Errors are often classified as
  • Undetected - silent data corruption (SDC)
  • Detected - detected unrecoverable errors (DUE)
  • Effective FIT rate for a structure is the product
    of its raw circuit FIT rate and the structures
    vulnerability factor
  • effective FIT rate per bit is influenced by
    several vulnerability factors
  • also known as de-rating factors or soft error
    sensitivity factor
  • Examples include timing vulnerability factor for
    latches and AVF

44
Silent data corruption in the future
45
Identifying Un-ACE Bits
  • Bits that do not affect final program output
  • Analyzed a uniprocessor system
  • Micro-architectural Un-ACE bits
  • Idle or Invalid State.
  • Miss-speculated State.
  • Predictor Structures.
  • Ex-ACE State.
  • Architectural Un-ACE Bits
  • NOP instructions.
  • Performance-enhancing instructions.
  • Predicated-false instructions.
  • Dynamically dead instructions.
  • Logical masking.

46
Computing AVF
  • AVFs for storage cells - fraction of time an
    upset in that cell will cause a visible error in
    the final output of a program
  • AVF Equations for a Hardware Structure
  • average AVF for all its bits in that structure
  • ? residency (in cycles) of all ACE bits in a
    structure --------------------
    --------------------------------------------------
    ---------- total number of bits in the hardware
    structure total execution cycles
  • Littles Law
  • N BL, where
  • N average number of bits in a box,
  • B average bandwidth per cycle into the box, and
  • L average latency of an individual bit through
    the box.
  • Bace Lace
  • AVF ---------------------------------------
    -----------------------
  • total number of bits in the hardware
    structure

47
Computing AVFs using a Performance Model
  • Two structuresthe instruction queue and
    execution unitsusing the Asim performance model
    framework
  • Need following information
  • Sum of all residence cycles of all ACE bits of
    the objects resident in the structure during the
    execution of the program,
  • Total execution cycles for which we observe the
    ACE bits residence time, and
  • Total number of bits in a hardware structure.
  • AVF algorithm
  • Record the residence time of the instruction in
    the structure as an instruction flows through
    different structures in the pipeline
  • Update the structures the instruction flowed
    through
  • Put the instruction in a post-commit analysis
    window to
  • Determine if the instruction is dynamically dead
    or
  • Determine if there are any bits that are
    logically masked

48
Methodology for evaluation
  • Use an Itanium2-like IA64 processor 14 scaled
    to current technology
  • Modeled in detail in Asim performance model
    framework.

49
Results Program level Decomposition
50
Results
  • Program-level Decomposition
  • We get about 45 ACE instructions. The rest55
    of the instructionsare un-ACE instructions
  • Some of these un- ACE instructions still contain
    ACE bits, such as the op-code bits of pre-fetch
    instructions
  • UNKNOWN and NOT_PROCESSED instructions account
    for about 1 of the total instructions
  • NOPs, predicated false instructions, and prefetch
    instructions account for 26, 6.7, and 1.5,
    respectively.
  • FDD_reg and FDD_mem denote results that are
    written back to registers and memory,
    respectively
  • Account for about 9.4 and 2 of the dynamic
    instructions
  • IA64 has a large number of registers
  • TDD_reg and TDD_mem account for 6.6 and 1.6 of
    the dynamic instructions

51
AVF for instruction queue
52
AVF for instruction queue
  • Shows what percentage of cycles a storage cell in
    the instruction queue contains ACE and un-ACE
    bits.
  • Instruction queue contains an ACE bit about 28
    of the time.
  • Thus AVF of the instruction queue is 28.
  • Floating point programs, in general, have higher
    AVFs compared to integer programs (31 vs. 25,
    respectively)
  • Long-latency instructions and few branch
    mispredictions
  • Use the instruction queue more effectively than
    integer programs, leading to a higher AVF
  • Apply Littles law
  • Number of ACE instructions in the queue
  • bandwidth or ACE IPC X the average number of
    cycles an instruction can be considered to be in
    ACE state or ACE latency
  • The ACE IPC and ACE latency from our performance
    mode

53
AVFs for the Execution Units
54
AVFs for the Execution Units
  • Four integer pipes and two floating point pipes
  • 50 control latches and 50 datapath latches
  • 11 of the cycles processing ACE instructions
  • Significantly lower
  • Instructions must wait in the instruction queue
  • Speculatively issued instructions succeeding
    cache-miss loads must replay through the
    instruction queue
  • The floating point pipes are mostly idle while
    executing integer code
  • Implemented logical masking functions for a small
    but important subset

55
Conclusion
  • Estimated AVFs using a novel approach that tracks
    bits required for architecturally correct
    execution (ACE) and un-ACE bits
  • Computed the AVF for the instruction queue and
    execution units of an Itanium2-like IA64
    processor.
  • Further refinement could further lower the AVF
    estimates but expect the contribution from
    further refinement to be small
  • Can estimate the FIT rate of an entire processor
    early in the design cycle
  • Can help designers choose the appropriate error
    detection or correction schemes
  • Can lower the FIT rate of the chip iteratively by
    adding more and more error protection, using AVF
    estimates as a guide.

56
L2-Miss Driven VSV for low power (H. Li, C.
Cher, T. N. Vijaykumar, K. Roy)
  • Idea Upon a L2 miss, pipeline performs
    independent computations, but almost always ends
    up stalled, waiting for data despite out-or-order
    issue and other latency-hiding techniques
  • During an L2 miss, scale down the supply, carry
    out independent computations at lower speed
    instead
  • Performance degradations if there are sufficient
    independent computations however, which will
    overlap with the delay of the cache
  • Returning to full speed however, will likely
    reduce power savings if there are multiple misses
    and insufficient independent computations to
    overlap with the misses

57
Proposed solution
  • Two state machines tracking parallelism on the
    fly
  • Scale down voltage depending on parallelism of
    the two events
  • Factors considered
  • Circuit level complexities reducing VSV to two
    voltages
  • Stability
  • Signal propagation speed issues
  • Energy overhead issues in RAMs and Register files
  • Average reduction of processor power is 7 while
    performance degradation is 0.9

58
VSV Structure
59
Transitions
High-To-Low transition
Low-To-High transition
60
VSV -Results
61
VSV - Achievements
  • Power savings with minimal performance
    degradation
  • Complexity of circuits taken into consideration
  • FSMs control the level of parallelism between
    independent operations and delay caused by an L2
    cache miss
  • VSV achieves 4 reduction in power for all SPEC2K
    benchmarks
  • VSV achieves 12 for the benchmarks with high L2
    miss rates

62
Questions
  • Any questions or feedback ??
About PowerShow.com