Yuchun Ma - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Yuchun Ma

Description:

A 3D IC example with two device layers. 10. Rlateral. Thermal Resistive Network [Wilkerson04] ... the given clock period and the set of paths P, we can then ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 44
Provided by: cadlab
Category:
Tags: register | yuchun

less

Transcript and Presenter's Notes

Title: Yuchun Ma


1
International Center for Design on
Nanotechnologies Workshop
Physical Modeling and Exploration
for 3D Microarchitecture Design
  • Yuchun Ma
  • Joint Work with Jason Cong, Yongxiang Liu,
  • Glenn Reinman, and Yan Zhang

2
Outline
  • Micro-architecture Design
  • 3-D IC Technology
  • 3D Architecture Exploration with 2D blocks
  • 3D Architecture Design with cubic folded blocks
  • 3D cubic packing algorithm
  • 3D architecture exploration with folded blocks
  • Pipelining Optimization with Throughput-Aware
    Floorplanning
  • Summary and Future Work

3
Outline
  • Micro-architecture Design
  • 3-D IC Technology
  • 3D Architecture Exploration with 2D blocks
  • 3D Architecture Design with cubic folded blocks
  • 3D cubic packing algorithm
  • 3D architecture exploration with folded blocks
  • Pipelining Optimization with Throughput-Aware
    Floorplanning
  • Summary and Future Work

4
Superscalar Processors
  • Superscalar processing is the ability of a
    microprocessor to initiate multiple instructions
    into multiple pipelines so that the computations
    of many instructions can be done in parallel if
    they are not dependent on each other.

5
Alpha 21264
6
Performance of a microprocessor
  • Performance is measured as the time taken to
    complete a given task
  • Operating systems
  • Compiler optimizations
  • Workload used for studying the performance
  • Microprocessor organization
  • Typically, the processor performance is measured
    in MIPS or BIPS

7
Outline
  • Micro-architecture Design
  • 3-D IC Technology
  • 3D Architecture Exploration with 2D blocks
  • 3D Architecture Design with cubic folded blocks
  • 3D cubic packing algorithm
  • 3D architecture exploration with folded blocks
  • Pipelining Optimization with Throughput-Aware
    Floorplanning
  • Summary and Future Work

8
Motivations of 3-D ICs
  • Alternative ways for device integration as we
    approach the limit of CMOS scaling
  • Interconnect length/delay reduction
  • System performance Improvement Black04
  • Power Reduction Black04
  • Integration of heterogeneous technologies
  • No existing flow to evaluate 3D implementations
    of architectures systematically
  • Performance
  • Thermal

Black04
9
Technology background
  • Wafer bonding 3D IC technologies
  • With flipping the top layer
  • Without flipping the top layer

(a)  With flipping the top layer
(b) Without flipping the top layer A 3D IC
example with two device layers
10
Thermal Resistive Network Wilkerson04
  • Circuit stack partitioned into tiles
  • Tiles connected through thermal resistances
  • Lateral resistances fixed
  • Vertical resistances ? 1/via
  • Heat sources modeled as current sources
  • Current value power
  • Heat sinks modeled as ground nodes
  • Thermal vias
  • After floorplanning, we can further reduce the
    temperature by thermal via insertion.


(a) Tiles stack array
(b) Single tile stack
11
Outline
  • Micro-architecture Design
  • 3-D IC Technology
  • 3D Architecture Exploration with 2D blocks
  • 3D Architecture Design with cubic folded blocks
  • 3D cubic packing algorithm
  • 3D architecture exploration with folded blocks
  • Pipelining Optimization with Throughput-Aware
    Floorplanning
  • Summary and Future Work

12
MEVA-3D
  • An Automated Design Flow for 3D Architecture
    Evaluation (MEVA-3D)
  • Evaluate 3D implementations of micro-architectures
    systematically and study them from both
    performance and thermal perspectives.
  • MEVA-3D Flow
  • Automated 2D/3D floorplanning
  • Reduce the latency along critical loops in the
    mico-architecture by considering interconnect
    pipelining at a given target frequency.
  • Thermal Evaluation
  • Resistive network model considering white-space
    and thermal via insertion.
  • 3D router

13
3D Architecture Evaluation with Physical Planning
  • Optimize
  • BIPS (not IPC or Freq)
  • Consider interconnect pipelining based on early
    floorplanning for critical paths
  • Use IPC sensitivity model Jagannathan05
  • Area/wirelength
  • Temperature

14
Design Example
  • An out-of-order superscalar processor
    micro-architecture with 4 banks of L2 cache in
    70nm technology
  • Critical paths

15
Baseline Processor Parameters
16
2D vs 3D Layout
Assume two device layers
3D EV6-like core (2 layers)
2D EV6-like core
BIPS 2.75
BIPS 2.94
Wakeup loop The extra cycle is eliminated.
Branch misprediction resolution loop and the L2
cache access latency Some of the extra cycles
are eliminated
17
Simulation Results
  • The 3D architecture outperforms 2D design about
    11.7 when the frequency is 4GHz.

18
Performance for the micro-architecture with 2D
and 3D layout at different target frequencies
  • 3D integration can help improve the performance
    by 11 by eliminating most of the wire latencies
    in 2D.

19
Maximum On-Chip Temperature
  • 3D integration shows a temperature increase of
    over 4.78? on average. After thermal via
    insertion, we can reduce the maximum on-chip
    temperature by an average of about 62.

HS denotes a heat sink, and the 3D integration
allows to insert thermal vias to reduce the
temperature.
20
Outline
  • Micro-architecture Design
  • 3-D IC Technology
  • 3D Architecture Exploration with 2D blocks
  • 3D Architecture Design with cubic folded blocks
  • 3D cubic packing algorithm
  • 3D architecture exploration with folded blocks
  • Pipelining Optimization with Throughput-Aware
    Floorplanning
  • Summary and Future Work

21
3D Design w/ Component Folding and Stacking
  • Explore 3D design of architectural structures
    that are
  • Timing/Throughput Critical
  • Expensive in Terms of Power Consumption and/or
    Thermal Output
  • Possible candidates for 3D component folding
  • Instruction Scheduling Window
  • Issue Queue can be partitioned into multiple
    levels via matchlines or taglines.
  • On-Chip Caches
  • Regular structure lends itself to a wide range of
    partitionings
  • Register File
  • Thermally critical resource also has a regular
    structure

22
3D Architectural Block Design and Modeling
  • First explore how to design blocks in 3D
  • Wordline folding
  • Fold block horizontally
  • Port Partitioning
  • Extend ports to different layers
  • Tools
  • CACTI
  • Caches and cache-like structures
  • Register files
  • HSpice
  • Issue Queue
  • Then explore design space for a microprocessor
    with these blocks

23
3D Issue Queue
  • Block folding
  • Fold the entries and place them on different
    layers
  • Effectively shortens the tag lines
  • Port partitioning
  • Place tag lines and ports on multiple layer, thus
    reducing both the height and width of the ISQ.
  • The reduction in tag and matchline wires can help
    reduce both power and delay.

(a) 2D issue queue with 4 taglines(b)block
folding (c) port partitioning
24
Benefits from IQ folding
  • Maximum delay reduction of 50, maximum area
    reduction of 90 and a maximum reduction in power
    consumption of 40

nL- n number of layers, FB Folding banks, TP
Tag/Ports Partitioning
25
Improvements for blocks
  • Port folding performs better than wordline
    folding for area.(72 vs 51)
  • Wordline folding is more effective in reducing
    the block delay (13 vs 5)
  • Port folding also performs better in reducing
    power (13 vs 5)

26
3D packing with folded blocks
  • The exploration of the use of vertical
    integration on microprocessor design requires
    consideration for both physical design and
    architecture.
  • True 3D packing
  • Architectural Alternative Selection
  • The number of layers in folded blocks
  • The partition way block folding or port
    partitioning

27
3D Corner Block List Representation
  • (S, L, T) composes a 3D CBL.
  • S a record of block name
  • L corner cubic block orientation(X-, Y- or Z-
    oriented)
  • T The sequence of Tn,Tn-1, ,T2 recording the
    number of attached tri-branches covered by corner
    cubic block

S1 2 3 4 5 L ( Y,Z,Y,X) T( 10,110,10,1110)
28
Packings with folded blocks

29
(No Transcript)
30
Performance
  • On average, multi-layer(3D) block configurations
    have 11 lower temperature as well as 14
    improvement in BIPS.

31
Temperatures
  • Temperatures can be below 100 degree with thermal
    vias inserted.

32
Temperature profile
  • 1 layer

33
Temperature profile(2 layers with thermal vias)
34
Outline
  • Micro-architecture Design
  • 3-D IC Technology
  • 3D Architecture Exploration with 2D blocks
  • 3D Architecture Design with cubic folded blocks
  • 3D cubic packing algorithm
  • 3D architecture exploration with folded blocks
  • Pipelining Optimization with Throughput-Aware
    Floorplanning
  • Summary and Future Work

35
Micro-architecture Pipelining Optimization
  • Previous works assume that the blocks are
    separately designed subject to a clock frequency,
    and the wire pipelining is then carried out on
    the global wires of the circuits.
  • Sub-optimal due to the possible utilized slacks
    in block pipeline designs
  • We propose a novel optimization methodology of
    architecture pipelining with physical design, so
    that block pipelining and interconnect pipelining
    can be considered simultaneously.

36
Simultaneous Block and Interconnect Pipelining
  • We define path-based pipelinging as Simultaneous
    Block and Interconnect Pipelining (SBIP) Problem
  • Represent the micro-architecture design by a path
    graph G(V,E).
  • The delay between any two flip-flops along the
    same path is less than clock period ?.
  • The performance of the architecture can be
    evaluated by the weighted sum of number of FFs on
    ei(nei) along the paths.
  • Therefore the objective is to find a feasible
    solution with the optimal performance.

37
MILP Formulation
  • We define a term a(P,v) that represents the
    arrival time at node (v) along path P, which is
    the longest delay from a flip-flop to the node v
    along path P.
  • With the given clock period ? and the set of
    paths P, we can then formulate the problem as the
    following MILP
  • Obj. Min
  • s.t. 0 ? a(Pi,v)? ? ? v?V and Pi passes
    v (1)
  • nei?0 ? ei?E
    (2)
  • a(Pi,v) ? a(Pi,u) dei ? nei ? ei ?E
    and ei is a connection from node u to node v
    along path Pi. (3)

38
Graph-based heuristic algorithm
  • Traverse the graph to decide the optimal
    insertion of flip-flops such that the weighted
    sum of cycle numbers of paths is minimized
  • Dynamic scanning for combinational circuits
  • Slacks along paths are used to compute the
    optimal positions for FFs.
  • Near-optimal method for sequential circuits
  • break the cycle into a path from s to t
  • Throughput aware floorplanning with pipelining
  • The path-based pipelining design guides the block
    design to optimize the performance for the whole
    design.

39
Experimental Results
  • We compare the results with the wire-pipelining
    results (WP), and the solutions obtained from the
    MILP solver (MILP), the ideal upper bound used in
    68(UB) and our graph-based heuristic approach
    (GH).
  • Impact of frequencies
  • The path-based pipelining will give about a 27
    performance improvement over wire pipelining

40
Integrated with floorplanning optimization
  • MILP approach as a post process at the end of the
    floorplanning
  • integrate our approach with the thoughput-driven
    floorplannning.

Frequency GHz UBpost_MILP UBpost_MILP UBpost_MILP GH GH GH
Frequency GHz Area (mm2) Wire (mm) BIPS Area (mm2) Wire (mm) BIPS
2 32. 115.6 1.492 31.8 142 1.714
3 34.6 103.7 2.139 33.3 108.4 2.22
4 32.4 98.7 2.776 36.1 124.3 2.828
5 32.8 126.2 2.885 32.6 94.17 3.35
6 36.0 108.4 3.636 33.7 100.3 3.882
7 35.9 112.5 3.479 36.8 129.9 3.906
Comparison 1 1 1 1.003 1.05 1.091
41
Summary
  • 3D Architecture Exploration
  • Coupled with 3D physical planning
  • Consider both 3D component stacking and folding
  • MEVA-3D can systematically evaluate the 3D
    architecture both from the performance side and
    from the thermal side.
  • We propose the optimization methodology of
    architecture pipelining with physical design
    which simultaneously optimize the pipeline design
    and physical packing in terms of system
    throughput. The performance of the system can be
    improved a lot over the wire-pipelining.

42
Ongoing Work
  • 3D Multi-core architecture design and
    implementation
  • Deep pipeline design in microarchitecture with
    interconnect considered
  • The slacks in 3D design may be used to enlarge
    the sizes of blocks and get better performance.

43
Thank You! Mayuchun_at_tsinghua.org.cn
Write a Comment
User Comments (0)
About PowerShow.com