Digital Integrated Circuits A Design Perspective - PowerPoint PPT Presentation

Loading...

PPT – Digital Integrated Circuits A Design Perspective PowerPoint presentation | free to download - id: 6106b4-MzQ5N



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Digital Integrated Circuits A Design Perspective

Description:

Title: No Slide Title Author: Vandana Prabhu Last modified by: Janusz Starzyk Created Date: 4/13/1997 2:24:48 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:493
Avg rating:3.0/5.0
Slides: 124
Provided by: Vandana7
Learn more at: http://www.ohio.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Digital Integrated Circuits A Design Perspective


1
Digital Integrated Circuits A Design Perspective
System on a Chip Design
2
Application Specific Integrated Circuits
Introduction
  • Jun-Dong Cho
  • SungKyunKwan Univ.
  • Dept. of ECE, Vada Lab.
  • http//vada.skku.ac.kr

3
Contents
  • Why ASIC?
  • Introduction to System On Chip Design
  • Hardware and Software Co-design
  • Low Power ASIC Designs

4
Why ASIC Design productivity grows!
Complexity increase 40 per year Design
productivity increase 15 per year
Integration of PCB on single die
5
Silicon in 2010
Die Area 2.5x2.5 cm Voltage 0.6
V Technology 0.07 ?m
6
ASIC Principles
  • Value-added ASIC for huge volume opportunities
    standard parts for quick time to market
    applications
  • Economics of Design
  • Fast Prototyping, Low Volume
  • Custom Design, Labor Intensive, High Volume
  • CAD Tools Needed to Achieve the Design
    Strategies
  • System-level design Concept to VHDL/C
  • Physical design VHDL/C to silicon, Timing closure
    (Monterey, Magma, Synopsys, Cadence, Avant!)
  • Design Strategies Hierarchy Regularity
    Modularity Locality

7
ASIC Design Strategies
  • Design is a continuous tradeoff to achieve
    performance specs with adequate results in all
    the other parameters.
  • Performance Specs - function, timing, speed,
    power
  • Size of Die - manufacturing cost
  • Time to Design - engineering cost and schedule
  • Ease of Test Generation Testability -
    engineering cost, manufacturing cost, schedule

8
ASIC Flow
9
Structured ASIC Designs
  • Hierarchy Subdivide the design into many levels
    of sub-modules
  • Regularity Subdivide to max number of similar
    sub-modules at each level
  • Modularity Define sub-modules unambiguously
    well defined interfaces
  • Locality Max local connections, keeping critical
    paths within module boundaries

10
ASIC Design Options
  • Programmable Logic
  • Programmable Interconnect
  • Reprogrammable Gate Arrays
  • Sea of Gates Gate Array Design
  • Standard Cell Design
  • Full Custom Mask Design
  • Symbolic Layout
  • Process Migration - Retargeting Designs

11
ASIC Design Methodologies
12
Why SOC?
  • SOC specs are coming from system engineers
    rather
  • than RTL descriptions
  • SOC will bridge the gap hardware/software and
    their implementation in novel, energy-efficient
    silicon architecture.
  • In SOC design, chips are assembled at IP block
    level (design reusable) and IP interfaces rather
    than gate level

13
CMOS density now allows complete
System-on-a-chip Solutions
Source Brodersen, ICASSP 98
Also like to add
  • FPGA
  • Reconfigurable Interconnect

How do we design these chips?
14
Possible Single-Chip Radio Architectures
  • Software Radio
  • GOAL Simplify System Design Process
  • Seek architectures which are flexible such that
    hardware and protocols can be designed
    independently
  • APPROACH Minimize the use of dedicated logic
  • Universal Radio
  • GOAL Maximize Bandwidth Efficiency and Battery
    Life
  • Seek architectures which perform complex
    algorithms very fast with minimal energy
  • APPROACH Minimize the use of programmable logic

Why is SOC design so scary?
15
60 GHz SiGe Transceiver for Wireless LAN
Applications
  • A low power 30 GHz LNA is designed as the front
    end of the receiver.
  • Wideband and high gain response is realized by a
    2-stage design using a stagger-tuned technique.
  • The simulated performance predicts a forward gain
    of S21 gt 20 dB over a 6 GHz range with an input
    match of S11 lt -30 dB and output match of S22
    lt -10 dB.
  • The mixer consists of a single balanced Gilbert
    cell.
  • A fully-integrated differential 25 GHz VCO is
    used, in conjunction with the mixer, to
    downconvert the RF input to a 5 GHz IF.

30 GHz receiver layout consisting of the LNA,
mixer and VCO
16
Wideband CMOS LC VCO
  • A 1.8 GHz wideband LC VCO implemented in 0.18 µm
    bulk CMOS has been successfully designed,
    fabricated, and measured.
  • This VCO utilizes a 4-bit array of switched
    capacitors and a small accumulation-mode varactor
    to achieve a measured tuning range exceeding 21
    (73) and a worst-case tuning sensitivity of 270
    MHz/V.
  • The amplitude reference level is programmable by
    means of a 3-bit DAC.

VCOs die photograph
17
A High Level View of an Industry Standard Design
Flow
source Hitachi, Prof. R. W. Brodersen
Problems with this flow
  • Every step can loop to every other step
  • Each step can take hours or days for a 100,000
    line description
  • HDL description contains no physical information
  • Different engineers handle the front-end and
    back-end design

How have semiconductor companies made this flow
work?
18
A More Accurate Picture of the Standard Flow
Source IBM Semiconductor, Prof. R. Newton
  • Architecture Partition the chip into functional
    units and generate bit-true test vectors to
    specify the behavior of each unit TOOLS Matlab,
    C, SPW, (VCC) FREEZE the test vectors
  • Front-End Enter HDL code which matches the test
    vectors TOOLS HDL Simulators, Design
    Compiler FREEZE the HDL code
  • Back-End Create a floor-plan and tweak the tools
    until a successful mask layout is created TOOLS
    Design Compiler, Floor-planners, Placers,
    Routers, Clock-tree generators, Physical
    Verification

How can we improve this flow?
19
Common Fabric for IP Blocks
  • Soft IP blocks are portable, but not as
    predictable as hard IP.
  • Hard IP blocks are very predictable since a
    specific physical implementation can be
    characterized, but are hard to port since are
    often tied to a specific process.
  • Common fabric is required for both portability
    and predictability.
  • Wide availability Cell Based Array, metal
    programmable architecture that provides the
    performance of a standard cell and is optimized
    for synthesis.

20
Four main applications
  • Set-top box Mobile multimedia system, base
    station for the home local-area network.
  • Digital PCTV concurrent use of TV,3D graphics,
    and Internet services
  • Set-top box LAN service Wireless home-networks,
    multi-user wireless LAN
  • Navigation system steer and control traffic
    and/or goods-transportation
  • CMPR is a multipurpose program that can be used
    for displaying diffraction data, manual-
    auto-indexing, peak fitting and other

21
PC-Multimedia Applications
22
Types of System-on-a-Chip Designs
23
Physical gap
  • Timing closure problem layout-driven logic and
    RT-level synthesis
  • Energy efficiency requires locality of
    computation and storage match for stream-based
    data processing of speech,images, and
    multimedia-system packets.
  • Next generation SOC designers must bridge the
    architectural gap b/w system specification and
    energy-efficient IP-based architectures, while
    CAE vendors and IP providers will bridge the
    physical gap.

24
Circular Y-Chart
25
SOC Co-Design Challenges
  • Current systems are complex and heterogenous
    Contain many different types of components
  • Half of the chip can be filled with 200
    low-power, RISC-like processors (ASIP)
    interconnected by field-programmable buses,
    embedded in 20Mbytes of distributed DRAM and
    flash memory, Another Half ASIC
  • Computational power will not result from
    multi-GHz clocking but from parallelism, with
    below 200 MHz.
  • This will greatly simplify the design for correct
    timing, testability, and signal integrity.

26
Bridging the architectural gap
  • One-M gate reconfigurable, one-M gate hardwired
    logic.
  • 50GIPS for programmable components or 500 GIPS
    for dedicated hardwares
  • Product reliability design at a level far above
    the RT level, with reuse factors in excess of 100
  • Trade-off 100MOPs/watt (microprocessor)
    100GOPs/watt (hardwired) Reconf. Computing with a
    large number of computing nodes and a very
    restricted instruction set (Pleiades)

27
Why Lower Power
  • Portable systems
  • long battery life
  • light weight
  • small form factor
  • IC priority list
  • power dissipation
  • cost
  • performance
  • Technology direction
  • Reduced voltage/power designs based on mature
    high performance IC technology, high integration
    to minimize size, cost, power, and speed

28
Microprocessor Power Dissipation
29
Levels for Low Power Design
30
Power-hungry Applications
  • Signal Compression HDTV Standard, ADPCM,
    Vector Quantization, H.263, 2-D motion
    estimation, MPEG-2 storage management
  • Digital Communications Shaping Filters,
    Equalizers, Viterbi decoders, Reed-Solomon
    decoders

31
New Computing Platforms
  • SOC power efficiency more than 10GOPs/w
  • Higher On Chip System Integration COTS 100W,
    SOC10W (inter-chip capacitive loads, I/O
    buffers)
  • Speed Performance shorter interconnection,fewer
    drivers,faster devices,more efficient processing
    artchitectures
  • Mixed signal systems
  • Reuse of IP blocks
  • Multiprocessor, configurable computing
  • Domain-specific, combined memory-logic

32
Low Power Design Flow I
33
Low Power Design Flow II
34
Three Factors affecting Energy
  • Reducing waste by Hardware Simplification
    redundant h/w extraction, Locality of
    reference,Demand-driven / Data-driven
    computation,Application-specific
    processing,Preservation of data correlations,
    Distributed processing
  • All in one Approach(SOC) I/O pin and buffer
    reduction
  • Voltage Reducible Hardwares
  • 2-D pipelining (systolic arrays)
  • SIMDParallel Processinguseful for data w/
    parallel structure
  • VLIW Approach- flexible

35
IBMs PowerPC Lower Power Architecture
  • Optimum Supply Voltage through Hardware Parallel,
    Pipelining ,Parallel instruction execution
  • 603e executes five instruction in parallel (IU,
    FPU, BPU, LSU, SRU)
  • FPU is pipelined so a multiply-add instruction
    can be issued every clock cycle
  • Low power 3.3-volt design
  • Use small complex instruction with smaller
    instruction length
  • IBMs PowerPC 603e is RISC
  • Superscalar CPI lt 1
  • 603e issues as many as three instructions per
    cycle
  • Low Power Management
  • 603e provides four software controllable
    power-saving modes.
  • Copper Processor with SOI
  • IBMs Blue Logic ASIC New design reduces of
    power by a factor of 10 times

36
Power-Down Techniques
Lowering the voltage along with the clock
actually alters the energy-per-operation of the
microprocessor, reducing the energy required to
perform a fixed amount of work
37
Implementing Digital Systems
38
H/W and S/W Co-design

39
Three Co-Design Approaches
  • IFIP International Conference FORTE/PSTV98,
    Nov.98 N.S. Voros et.al, Hardware -software
    co-design of embedded systems using multiple
    formalisms for application development
  • ASIP co-design builds a specific programmable
    processor for an application, and translates the
    application into software code. H/w and s/w
    partitioning includes the instruction set design.
  • H/w s/w synchronous system co-design s/w
    processor as a master controller, and a set of
    h/w accelerators as co-processors. Vulcan, Codes,
    Tosca, Cosyma
  • H/w s/w for distributed systems mapping of a set
    of communication processors onto a set of
    interconnected processors. Behavioral
    decomposition, process allocation and
    communication transformation. Coware(powerful),
    Siera (reuse), Ptolemy (DSP)

40
Mixing H/W and S/W
  • Argument Mixed hardware/ software systems
  • represent the best of both worlds.
  • High performance, flexibility, design reuse, etc.
  • Counterpoint From a design standpoint, it is
  • the worst of both worlds
  • Simulation Problems of verification, and test
    become harder
  • Interface Too many tools, too many interactions,
    too much heterogeneity
  • Hardware/ software partitioning is AI-
    complete!
  • (MIT, Stanford by analogy with "NP-complete") A
    term used to describe problems in artificial
    intelligence, to indicate that the solution
    presupposes a solution to the "strong AI problem"
    (that is, the synthesis of a human-level
    intelligence). A problem that is AI-complete is
    just too hard.

41
Low power partitioning approach
  • Different HW resources are invoked according to
    the instruction executed at a specific point in
    time
  • During the execution of the add op., ALU and
    register are used, but Multiplier is in idle
    state.
  • Non-active resources will still consume energy
    since the according circuit continue to switch
  • Calculate wasting energy
  • Adding application specific core and partial
    running
  • Whenever one core performing, all the other
    cores are shut down

42
ASIP (Application Specific Instruction
Processors) Design
  • Given a set of applications, determine micro
    architecture of ASIP (i. e., configuration of
    functional units in datapaths, instruction set)
  • To accurately evaluate performance of processor
    on a given application need to compile the
    application program onto the processor datapath
    and simulate object code.
  • The micro architecture of the processor is a
    design parameter!

43
ASIP Design Flow
44
Cross-Disciplinary nature
  • Software for low powerloop transformation leads
    to much higher temporal and spatial locality of
    data.
  • Code size becomes an important objective Software
    will eventually become a part of the chip
  • Behavior-platform-compiler codesign codesigned
    with C or JAVA, describing their h/w and s/w
    implementation.
  • Multidisciplinary system thinking is required for
    future designs (e.g., Eindhoven Embedded Systems
    Institute http//www.eesi.tue.nl/english)

45
VLSI Signal Processing Design Methodology
  • pipelining, parallel processing, retiming,
    folding, unfolding, look-ahead, relaxed
    look-ahead, and approximate filtering
  • bit-serial, bit-parallel and digit-serial
    architectures, carry save architecture
  • redundant and residue systems
  • Viterbi decoder, motion compensation,
    2D-filtering, and data transmission systems

46
Low Power DSP
  • DO-LOOP Dominant

VSELP Vocoder 83.4 2D 8x8 DCT 98.3 LPC
computation 98.0
DO-LOOP Power Minimization gt DSP Power
Minimization
VSELP Vector Sum Excited Linear Prediction LPC
Linear Prediction Coding
47
Deep-Submicron Design Flows
  • Rapid evaluation of complex designs for area and
    performance
  • Timing convergence via estimated routing
    parasitics
  • In-place timing repair without resynthesis
  • Shorter design intervals, minimum iterations
  • Block-level design and place and route
  • Localized changes without disturbance
  • Integration of complex projects and design reuse

48
SOC CAD Companies
  • Avant! www.avanticorp.com
  • Cadence www.cadence.com
  • Duet Tech www.duettech.com
  • Escalade www.escalade.com
  • Logic visions www.logicvision.com
  • Mentor Graphics www.mentor.com
  • Palmchip www.palmchip.com
  • Sonic www.sonicsinc.com
  • Summit Design www.summit-design.com
  • Synopsys www.synopsys.com
  • Topdown design solutions www.topdown.com
  • Xynetix Design Systems www.xynetix.com
  • Zuken-Redac www.redac.co.uk

49
Design Technology for Low Power Radio Systems
Rhett Davis Dept. of EECS Univ. of Calif. Berkeley
  • http//bwrc.eecs.berkeley.edu

50
Domain of Interest
  • Highly integrated system-on-a-chip solutions
    SOCs
  • Wireless communications with associated
    processing, e.g. multimedia processing,
    compression, switching, etc
  • Primary computation is high complexity dataflow
    with a relatively small amount of control

51
Why Systems-on-a-Chip - SOC ?
  • State-of-the-Art CMOS is easily able to implement
    complete systems (or what was on a board before)
  • A microprocessor core is only 1-2 mm2
    (1-2 of the area of a 4 chip)
  • Portability (size) is critical to meet the cost,
    power and size requirements of future wireless
    systems
  • Chips will be required to support the complete
    application (wireless internet, multimedia)
  • Dedicated stand-alone computation is replacing
    general purpose processors as the semiconductor
    industry driver

52
Cellular Phones An example
(Courtesy Mike McMahon, Texas Instruments)
53
Cellular Phone Baseband SOC
ROM
MCU
DSP
Gates
RAM
Analog
2000 phones on each 8 wafer _at_ .15 Leff
1Million Baseband Chips per Day!!!
(Courtesy Mike McMahon, Texas Instruments)
54
Wireless System Design Issues
  • It is now possible to use CMOS to integrate all
    digital radio functions but what is the best
    architectural way to use CMOS???
  • Computation rates for wireless systems will
    easily range up to 100s of GOPS in signal
    processing
  • Whats keeping us from achieving this in silicon?
  • What can we do about it?

55
Computational Efficiency Metrics
  • Definition MOPS
  • Millions of algorithmically defined arithmetic
    operations (e.g. multiply, add, shift) in a GP
    processor several instructions per useful
    operation
  • Figures of merit
  • MOPS/mW - Energy efficiency (battery life)
  • MOPS/mm2 - Area efficiency (cost)
  • Optimization of these efficiencies is the basic
    goal assuming functionality is met

56
Energy-Efficiency of Architectures
57
Software Processors Energy Trends
  • Primary means of performance increase of software
    processors has been by increasing clock rate
  • Decreasing Energy Efficiency

E ? C ? VDD2
58
Software Processors Area Trends
  • Increasing clock rate results in a memory
    bottleneck addressed by bringing memory on-chip
  • Area is increasingly dominated by memory
    degrading MOPs/mm2

16x16 multiplier (.05 mm2)
DSP processor with 1 multiplier (25 mm2)
Why time multiplex to save area if the overhead
is much greater than the area saved????
59
Parallelism is the answer, but
  • Not by putting Von Neumann processors in parallel
    and programming with a sequential language
  • Attempts to do this have failed over and over
    again
  • The parallel computer compiler problem is very
    difficult
  • Not by trying to capture parallelism at the
    instruction level
  • Superscalar, VLIW, etc are very inefficient
  • Hardware cant figure out the parallelism from a
    sequential language either
  • The problem is the initial sequential description
    (e.g. C) which is poorly matched to highly
    parallel applications

60
What is really hapenning
Then try to rediscover the parallelism
Re-entering it using a sequential description
Starting with a parallel algorithmic description
  • While (i0iiltnum)
  • a a ci
  • bi sin (a pi) cos(api)
  • Outfil bi indata

We take this path so that we can use an
architecture that is orders of magnitude less
efficient in energy and area ??????
61
What can a fully parallel CMOS solution
potentially do?
  • In .25 micron a multiplier requires .05 mm2 and
    7pJ per operation at 1 V. Adders and registers
    are about 10 times smaller and 10 times lower
    energy
  • Lets implement a 50mm2 , .25 micron chip using
    adders, registers and multipliers
  • We can have 2000 adders/registers and 200
    multipliers in less than 1/2 of the chip, also
    assume 1/3 of power goes into clocks
  • 25 MHz clock (1 volt) gives 50 Gops at 100mW
  • 500 MOPS/mW and 1000 MOPS/mm2

62
Start with a parallel description of the
algorithm
63
Then directly map into hardware
64
Results in fully parallel solutions
Energy Energy Area Area
64-point FFT Energy per Transform (nJ) 16-State Viterbi Decoder Energy per Decoded bit (nJ) 64-point FFT Transforms per second per unit area (Trans/ms/mm2) 16-State Viterbi Decoder Decode rate per unit area (kb/s/mm2)
Direct-Mapped Hardware 1.78 0.022 2,200 200,000
FPGA 683 5.5 1.8 100
Low-Power DSP 436 19.6 4.3 50
High-Performance DSP 1700 108 10 150
(numbers taken from vendor-published
benchmarks) Orders of magnitude lower efficiency
even for an optimized processor architecture
65
Reasons software solutions seem attractive
  • (1) Believed to reduce time-to-system-implementati
    on
  • (2) Provides flexibility
  • (3) Locks the customers into an architecture they
    cant change
  • (4) Difficulty in getting dedicated SOC chips
    designed
  • Are these good reasons???

66
(1) Believed to reduce time-to-system
implementation
  • Software decreases time to get first prototype,
    but time to fully verified system is much longer
    (hardware is often ready but software still needs
    to be done)
  • Limitations of software prototype often sets the
    ultimate limit of the system performance
  • Software solutions can be shipped with bugs, not
    a real option for SOC

67
(2) Need flexibility
  • Software is not always flexible
  • Can be hard to verify
  • Flexibility does not imply software
    programmability
  • Domain specific design can have multiple modules,
    coefficients and local state control (the factor
    of 100 in efficiency) to address a range of
    applications
  • Reconfiguration of interconnect can achieve
    flexibility with high levels of efficiency

68
Flexibility without software
Energy per Transform vs. FFT size
Transforms per Second per mm2 vs. FFT size
All results are scaled to 0.18mm
69
Reasons software solutions seem attractive
  • (1) Believed to reduce time-to-system
    implementation
  • (2) Provides flexibility
  • (3) Locks the customers into an architecture they
    cant change
  • (4) Difficulty in getting dedicated SOC chips
    designed

70
Standard DSP-ASIC Design Flow
Problems
  • Three translations of design data
  • Requirements for re-verification at each stage
  • Uncontrolled looping when pipeline stalls

Prohibitively Long Design Time for Direct Mapped
Architectures
71
Direct Mapping Design Flow
  • Encourages iterations of layout
  • Controls looping
  • Reduces the flow to a single phase
  • Depends on fast automation

72
Déjà vu???
  • An automated style of design with parameterized
    modules processed through foundries is just the
    reincarnation of good ole Silicon Compilation of
    gt10 years ago
  • What happened?
  • A decline of research into design methodologies
  • A single dominant flow has resulted - the
    Verilog-Synopsys-Standard Cell
  • Lack of tool flows to support alternative styles
    of design
  • Research community lost access to technology
    moved to highly sub-optimal processor and FPGA
    solutions

73
Capturing Design Decisions
  • Categories
  • Function - basic input-output behavior
  • Signal - physical signals and types
  • Circuit - transistors
  • Floorplan - physical positions

How to get layout and performance estimates in a
day?
74
Simplified View of the Flow
  • New Software
  • Generation of netlists from a dataflow graph
  • Merging of floorplan from last iteration
  • Automatic routing and performance analysis
  • Automation of flow as a dependency graph (UNIX
    MAKE program)

75
Why Simulink?
  • Simulink is an easy sell to algorithm developers
  • Closely integrated with popular system design
    tool Matlab
  • Successfully models digital and analog circuits

76
Modeling Datapath Logic
  • Discrete-Time (cycle accurate)
  • Fixed-Point Types (bit true)
  • Completely specify function and signal decisions
  • No need for RTL

Multiply / Accumulate
77
Modeling Control Logic
  • Extended finite state-machine editor
  • Co-simulation with dataflow graph
  • New Software Stateflow-VHDL translator
  • No need for RTL

Address Generator / MAC Reset
78
Specifying Circuit Decisions
  • Macro choices embedded in dataflow graph
  • Cross-check simulations required

79
Hierarchy Hardened Progressively
  • Macro characterization saved for fast estimates
  • Each level of hierarchy becomes a new hard macro
  • Higher levels of hierarchy are adjusted
  • When top level of hierarchy is hardened, the
    design is done

80
Capturing Floorplan Decisions
  • Commercial physical design tools used
  • Instance names in floorplan match dataflow graph
  • Placements merged on each iteration
  • Manhattan distance can be used for parasitic
    estimates

81
Reduced Impact of Interconnect
  • 0.18 mm

Long wires can be modeled as lumped capacitances
82
Race-Immune Clock Tree Synthesis
  • Race margin 580 ps
  • 0.18 mm
  • VDD 1 V

Demonstrated on a 600k transistor design
83
Example 1 Macro Hardening
Most time/disk space spent on extraction and
power simulation
84
Example 2 Test Chip
  • 300k transistors
  • 0.25 mm
  • 1.0 V
  • 25 MHz
  • 6.8 mm2
  • 14 mW
  • 2 phase clock
  • 3 layers of PR hierarchy

Parallel Pipelined FIR Filter (8X decimation
filter for 12-bit 200 MHz SD)
85
TDMA Baseband Receiver
  • 600k transistors
  • 0.18 mm
  • 1.0 V
  • 25 MHz
  • 1.1 mm2
  • 21 mW
  • single phase clock
  • 5 clock domains
  • 2 layers of PR hierarchy

86
Conclusions
  • Direct-Mapped hardware is the most efficient use
    of silicon
  • Direct-Mapped hardware can be easier to design
    and verify than embedded hardware/software
    systems
  • Dont translate design data, refine it
  • Design with dataflow graphs, not sequential code
  • Design flow automation speeds up design space
    exploration

87
Embedded Processor Architectures and
(Re)Configurable Computing
  • Vandana Prabhu
  • Professor Jan M. Rabaey

Jan 10, 2000
88
Pico Radio Architecture
FPGA
Embedded uP
Dedicated FSM
Dedicated DSP
Reconfigurable DataPath
89
Reconfigurable Computing Merging Efficiency and
Versatility
Spatially programmed connection of processing
elements.
Hardware customized to specifics of
problem. Direct map of problem specific dataflow,
control. Circuits adapted as problem
requirements change.
90
Matching Computation and Architecture
91
Implementation Fabrics for Data Processing
300 million multiplications/sec 357 million
add-subs/sec
Adaptive Pilot Correlator Digital Baseband Receiver
DSP Power 460mW Area 1089mm2 Power 1500mW Area 3600mm2
Direct Mapped Power 3mW Area 1.3mm2 Power 10mW Area 5mm2
Pleiades Power 18.49mW Area 5.44mm2 Power 62.33mW Area 21.34mm2
Data In
16 Mmacs/mW!
92
Software Methodology Flow
Algorithms
Area
m
proc

Timing
Accelerator
Constraints
PDA Models
Kernel Detection
Behavioral
Xforms
Estimation/Exploration
for low
Premapped
power
Power Timing Estimation
Kernels
of Various Kernel Implementations
Kernels
Partitioning
Executable Intemediate
Form
Reconfig HW
Software Compilation
Reconfig. Hardware Mapping
Interface Code Generation
Interconnect
Optimization
(Marlene Wan)
93
Maia Reconfigurable Baseband Processor for
Wireless
  • 0.25um tech 4.5mm x 6mm
  • 1.2 Million transistors
  • 40 MHz at 1V
  • 1 mW VCELP voice coder
  • Hardware
  • 1 ARM-8
  • 8 SRAMs 8 AGPs
  • 2 MACs
  • 2 ALUs
  • 2 In-Ports and 2 Out-Ports
  • 14x8 FPGA

94
Implementation Fabrics for Protocols
A protocol Extended FSM
  • ASIC 1V, 0.25 mm CMOS process
  • FPGA 1.5 V 0.25 mm CMOS low-energy
    FPGA
  • ARM8 1 V 25 MHz processor n 13,000
  • Ratio 1 - 8 - gtgt 400

Idea Exploit model of computation concurrent
finite state machines, communicating through
message passing
Intercom TDMA MAC
95
Low-Power FPGA
  • Low Energy Embedded FPGA (Varghese George)
  • Test chip
  • 8x8 CLB array
  • 5 in - 3 out CLB
  • 3-level interconnect hierarchy
  • 4 mm2 in 0.25 mm ST CMOS
  • 0.8 and 1.5 V supply
  • Simulation Results
  • 125 MHz Toggle Frequency
  • 50 MHz 8-bit adder
  • energy 70 times lower than comparable Xilinx

96
An Energy-Efficient µP System
  • Dynamic Voltage Scaling (Trevor Pering Tom
    Burd)

Lower speed, Lower voltage, Lower energy
Before
µProc. Speed
After
Idle
97
Xtensa Configurable Processor
  • Xtensa (Tensilica,Inc) for embedded CPU
  • Configurability allows designer to keep minimal
    hardware overhead
  • ISA (compatible with 32 bit RISC) can be extended
    for software optimizations
  • Fully synthesizable
  • Complete HW/SW suite
  • VCC modeling for exploration
  • Requires mapping of fuzzy instructions of VCC
    processor model to real ISA
  • Requires multiple models depending on memory
    configuration
  • ISS simulation to validate accuracy of model

(Vandana Prabhu)
98
Microprocessor Optimizations for Network Protocols
  • ImplementsTransport layer on configurable
    processor
  • TDMA control and channel usage management
  • Upper layer of protocol is dominated by processor
    control flow
  • Memory routines, Branches, Procedure calls
  • Artifacts of code generation tools is significant
  • Excessively modular code introduces procedure
    calls
  • Uses dynamic memory allocation
  • Configurable processor
  • Increased size of register file
  • Customized instructions help datapath but not
    control

Efficient implementaion at code generation and
architecture levels!
(Kevin Camera Tim Tuan )
99
Implementation Methodology for Reconfigurable
Wireless Protocol
  • Changing granularity within protocol stack
    requires estimation tool for energy-efficient
    implementation
  • Software exploration on processors
  • Exploring Xtensas TIE
  • Hardware exploration on FPGA platforms
  • Optimal FPGA architecture
  • Alternately Reconfigurable FSM analogous to
    Pleiades approach for datapath kernels

(Suetfei Li Tim Tuan)
100
TCI - A First Generation PicoNode
Memory Sub-system
Tensilica Embedded Proc.
Sonics Backplane
Programmable Protocol Stack
Configurable Logic (Physical Layer)
Baseband Processing
101
The System-on-a-Chip Nightmare
The Board-on-a-Chip Approach
Courtesy of Sonics, Inc
102
The Communications Perspective
(Mike Sheets)
Communications-based Design
103
Summary
  • Design for low-energy impacts all stages of the
    design process the earlier the better
  • Energy reduction requires clear communication and
    computation abstractions
  • Efficient and abstract modeling of energy at
    behavior and architecture level is crucial
  • Efficient hardware implementation of protocol
    stack
  • Beat the SoC monster!

104
Targeting Tiled Architectures in Design
Exploration
  • Lilian Bossuet1, Wayne Burleson2, Guy Gogniat1,
  • Vikas Anand2, Andrew Laffely2, Jean-Luc Philippe1

105
Design Space Exploration Motivations
  • Design solutions for new telecommunication and
    multimedia applications targeting embedded
    systems
  • Optimization and reduction of SoC power
    consumption
  • Increase computing performance
  • Increase parallelism
  • Increase speed
  • Be flexible
  • Take into account run-time reconfiguration
  • Targeting multi-granularity (heterogeneous)
    architectures

106
Design Space Exploration Flow
  • Progressive design space reduction
  • iterative exploration
  • refinement of architecture model
  • increase of performance estimation accuracy
  • One level of abstraction for one level of
    estimation accuracy

107
Reconfigurable Architectures
  • Bridging the flexibility gap between ASICs and
    microprocessor Hartenstein DATE 2001
  • Energy efficient and solution to low power
    programmable DSP Rabaey ICASSP 1997, FPL 2000
  • Run Time Reconfigurable Compton Hauck
    1999
  • gt A key ingredient for future silicon
    platforms Schaumont all. DAC 2001

108
Design Space of Reconfigurable Architecture
RECONFIGURABLE ARCHITECTURES (R-SOC)
MULTI GRANULARITY (Heterogeneous)
FINE GRAIN (FPGA)
COARSE GRAIN (Systolic)
Processor Coprocessor
Tile-Based Architecture
Coarse Grain Coprocessor
Fine Grain Coprocessor
Island Topology
Hierarchical Topology
Linear Topology
Hierarchical Topology
Mesh Topology
  • RAW
  • CHESS
  • MATRIX
  • KressArray
  • Systolix Pulsedsp
  • Chameleon
  • REMARC
  • Morphosys
  • Pleiades
  • Garp
  • FIPSOC
  • Triscend E5
  • Triscend A7
  • Xilinx Virtex-II Pro
  • Altera Excalibur
  • Atmel FPSIC
  • Xilinx Virtex
  • Xilinx Spartran
  • Atmel AT40K
  • Lattice ispXPGA
  • Altera Stratix
  • Altera Apex
  • Altera Cyclone
  • Systolic Ring
  • RaPiD
  • PipeRench
  • DART
  • FPFA
  • aSoC
  • E-FPFA

109
A Target Architecture aSoC
  • Adaptive System-on-a-Chip (aSoC)
  • Tiled architecture containing many heterogeneous
    processing cores (RISC, DSP, FPGA, Motion
    Estimation, Viterbi Decoder)
  • Mesh communication network controlled with
    statically determined communication schedule
  • A scalable architecture.

110
FPGA in System-on-a-Chip
  • Fast Time-To-Market
  • Post-Fabrication Customization
  • Broaden application domain
  • Run-time Reconfiguration
  • Bug Fixes
  • Upgrades
  • 10x-100x Worse
  • Area
  • Performance
  • Power

Mark L. Chang mchang_at_ee.washington.edu
111
aSoC Architecture
tile
uProc
  • Heterogeneous Cores

MUL
FPGA
MUL
112
aSoC Communications Interface
  • Interface Crossbar
  • inter-tile transfer
  • tile to core transfer
  • Interconnect/Instruction Memory
  • contains instructions to configure the interface
    crossbar (cycle-by-cycle)
  • Interface Controller
  • selects the instruction
  • Coreports
  • data interface and storage for transfers with the
    tile IP core
  • Dynamic Voltage and Frequency Selection
  • Dynamic Power Management

Core
Coreports
Interface Crossbar
North
North
South
South
East
East
West
West
Outputs
Inputs
Local
Config
.
Local
Decoder
Controller
Frequency
Voltage
North to South East
PC
Instruction Memory
113
aSoC Exploration ...
  • Type of tiles
  • Number of each type of tile
  • Placement of the tiles
  • Intern architecture of reconfigurable tiles (FPGA
    core)
  • Communication scheduling

114
Design Space Exploration Goals
  • Goal Rapid exploration of various architectural
    solutions to be implemented on heterogeneous
    reconfigurable architectures (aSoC) in order to
    select the most efficient architecture for one or
    several applications
  • Take place before architectural synthesis
    (algorithmic specification with high level
    abstraction language)
  • Estimations are based on a functional
    architecture model (generic, technology-independen
    t)
  • Iterative exploration flow to progressively
    refine the architecture definition, from a coarse
    model to a dedicated model

115
Design Exploration Flow Targeting Tiled
Architecture
116
Application Analysis
  • Use of algorithmic metrics and dedicated
    scheduling algorithms to highlight the target
    architectures
  • Algorithmic metrics
  • Characterize the application orientation
  • Processing
  • Memory
  • Control
  • Characterize the application potential
    parallelism
  • Processing
  • Memory

117
Tile Exploration with 3 steps
  • Projection
  • Link between necessary resources (application)
    and available resources (tile)
  • Use of an allocation algorithm based on
    communication costs reduction
  • Composition
  • Take into account of the function scheduling to
    estimate additional resources (register, mux, )
  • Estimation
  • performance interval computation (lower and upper
    bounds)
  • speed/resource utilization/power characterization

118
aSoC Builder
  • Environment AppMapper
  • Partition and assignment
  • based on Run Time Estimation
  • Compilation
  • Communication Scheduling
  • Core compilation
  • Generate tiles configuration
  • Communications instructions
  • Bitstreams (for reconfigurable tile)
  • RISC instructions

119
aSoC Analysis
  • Use the results of previous steps
  • Functions scheduling
  • Tile allocation
  • Communication scheduling
  • Complete estimation of the proposed solution
  • Global execution time
  • Global power consumption
  • Total area

120
Power-Aware System on a Chip
  • A. Laffely, J. Liang, R. Tessier, C. A. Moritz,
    W. Burleson
  • University of Massachusetts Amherst
  • Boston Area Architecture Conference
  • 30 Jan 2003
  • alaffely, jliang, tessier, moritz,
    burleson_at_ecs.umass.edu

This material is based upon work supported by the
National Science Foundation under Grant No.
9988238. Any opinions, findings, and conclusions
or recommendations expressed in this material are
those of the author(s) and do not necessarily
reflect the views of the National Science
Foundation.
121
Adaptive System-on-a-Chip
  • Tiled architecture with mesh interconnect
  • Point to point communication pipeline
  • Allows for heterogeneous cores
  • Differing sizes, clock rates, voltages
  • Low-overhead core interface for
  • On-chip bus substitute for streaming applications
  • Based on static scheduling
  • Fast and predictable

122
aSoC Implementation
2500 l
.18 m technology Full custom
3000 l
123
Some Results
  • 9 and 16 core systems tested for IIR, MPEG
    encoding and Image processing applications
  • 2 x the performance compared to Coreconnect bus
    Burst and Hierarchical
  • 1.5 x the performance of an oblivious routing
    network1 (Dynamic routing)
  • Max speedup is 5 x

1. W. Dally and H. Aoki, Deadlock-free Adaptive
Routing in Multi-computer Networks Using
Virtual Routing, IEEE Transactions on Parallel
and Distributed Systems, April 1993
About PowerShow.com