Options%20for%20embedded%20systems.%20Constraints,%20challenges,%20and%20approaches%20HPEC%202001%20Lincoln%20Laboratory%2025%20September%202001 - PowerPoint PPT Presentation

About This Presentation
Title:

Options%20for%20embedded%20systems.%20Constraints,%20challenges,%20and%20approaches%20HPEC%202001%20Lincoln%20Laboratory%2025%20September%202001

Description:

ALU Pipe I/O Timer MMU Register File Cache Tailored, HDL uP core Customized Compiler, Assembler, ... terminal, PC, workstation ... Altera Sea of un-committed ... – PowerPoint PPT presentation

Number of Views:228
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Options%20for%20embedded%20systems.%20Constraints,%20challenges,%20and%20approaches%20HPEC%202001%20Lincoln%20Laboratory%2025%20September%202001


1
Options for embedded systems.Constraints,
challenges, and approachesHPEC 2001Lincoln
Laboratory25 September 2001
  • Gordon Bell
  • Bay Area Research Center
  • Microsoft Corporation

2
More architecture options Applications, COTS
(clusters, computers chips), Custom Chips
3
The architecture challenge One persons
system, is anothers component.- Alan Perlis
  • Kurzweil predicted hardware will be compiled and
    be as easy to change as software by 2010
  • COTS streaming, Beowulf, and www relevance?
  • Architecture Hierarchy
  • Application
  • Scalable components forming the system
  • Design and test
  • Chips the raw materials
  • Scalability fewest, replicatable components
  • Modularity finding reusable components

4
The architecture levels options
  • The apps
  • Data-types signals, packets, video, voice,
    RF, etc.
  • Environment parallelism, power, power, power,
    speed, cost
  • The material clock, transistors
  • Performance its about parallelism
  • Program programming environment
  • Network e.g. WWW and Grid
  • Clusters
  • Storage, cluster, and network interconnect
  • Multiprocessors
  • Processor and special processing
  • Multi-threading and multiple processor per chip
  • Instruction Level Parallelism vs
  • Vector processors

5
Sony Playstation export limiits
A problem X-Box would like to have, but have
solved.
6
Will the PC prevail for the next decade as a/the
dominant platform? or 2nd to smart, mobile
devices?
  • Moores Law increases performance Bells
    Corollary reduces prices for new classes
  • PC server clusters aka Beowulf with low cost OS
    kills proprietary switches, smPs, and DSMs
  • Home entertainment control
  • Very large disks (1TB by 2005) to store
    everything
  • Screens to enhance use
  • Mobile devices, etc. dominate WWW gt2003!
  • Voice and video become the important apps!

C Commercial C Consumer
7
Wheres the action? Problems?
  • Constraints from the application Speech, video,
    mobility, RF, GPS, securityMoores Law,
    networking, Interconnects
  • Scalability and high performance processing
  • Building them Clusters vs DSM
  • Structure wheres the processing, memory, and
    switches (disk and ip/tcp processing)
  • Micros getting the most from the nodes
  • Not ISAs Change can delay Moore Law effect and
    wipe out software investment! Please, please,
    just interpret my object code!
  • System (on a chip) alternatives apps drivers
  • Data-types (e.g. video, video, RF) performance,
    portability/power, and cost

8
COTS Anything at the system structure level to
use?
  • How are the system components e.g. computers,
    etc. going to be interconnected?
  • What are the components? Linux
  • What is the programming model?
  • Is a plane, CCC, tank, fleet, ship, etc. an
    Internet?
  • Beowulfs the next COTS
  • What happened to Ada? Visual Basic? Java?

9
ComputingSNAPbuilt entirelyfrom PCs
Legacy mainframes minicomputers servers terms
Portables
Legacy mainframe minicomputer servers
terminals
Wide-area global network
Mobile Nets
Wide Local Area Networks for terminal, PC,
workstation, servers
Person servers (PCs)
scalable computers built from PCs
Person servers (PCs)
Centralized departmental uni- mP
servers (UNIX NT)
Centralized departmental servers buit from PCs
???
TCTVPC home ... (CATV or ATM or satellite)
  • A space, time (bandwidth), generation scalable
    environment

10
How Will Future Computers Be Built?
  • Thesis SNAP Scalable Networks and Platforms
  • Upsize from desktop to world-scale computer
  • based on a few standard components
  • Because
  • Moores law exponential progress
  • Standardization Commoditization
  • Stratification and competition
  • When Sooner than you think!
  • Massive standardization gives massive use
  • Economic forces are enormous

11
Five Scalabilities
  • Size scalable -- designed from a few components,
    with no bottlenecks
  • Generation scaling -- no rewrite/recompile or
    user effort to run across generations of an
    architecture
  • Reliability scaling chose any level
  • Geographic scaling -- compute anywhere (e.g.
    multiple sites or in situ workstation sites)
  • Problem x machine scalability -- ability of an
    algorithm or program to exist at a range of sizes
    that run efficiently on a given, scalable
    computer.
  • Problem x machine space gt run time problem
    scale, machine scale (p), run time, implies
    speedup and efficiency,

12
Why I gave up on large smPs DSMs
  • Economics Perf/Cost is lowerunless a commodity
  • Economics Longer design time life. Complex.
    gt Poorer tech tracking end of life
    performance.
  • Economics Higher, uncompetitive costs for
    processor switching. Sole sourcing of the
    complete system.
  • DSMs NUMA! Latency matters. Compiler,
    run-time, O/S locate the programs anyway.
  • Arent scalable. Reliability requires clusters.
    Start there.
  • They arent needed for most apps hence, a small
    market unless one can find a way to lock in a
    user base. Important as in the case of IBM Token
    Rings vs Ethernet.

13
What is the basic structure of these scalable
systems?
  • Overall
  • Disk connection especially wrt to fiber channel
  • SAN, especially with fast WANs LANs

14
GB plumbing from the baroqueevolving from 2
dance-hall SMP Storage model
  • Mp S Pc
  • S.fc Ms
  • S.Cluster
  • S.WAN
  • vs.
  • MpPcMs S.Lan/Cluster/Wan

15
SNAP Architecture----------
16
ISTORE Hardware Vision
  • System-on-a-chip enables computer, memory,
    without significantly increasing size of disk
  • 5-7 year target

MicroDrive1.7 x 1.4 x 0.2 2006 ? 1999 340
MB, 5400 RPM, 5 MB/s, 15 ms seek 2006 9 GB, 50
MB/s ? (1.6X/yr capacity, 1.4X/yr
BW) Integrated IRAM processor 2x height Connected
via crossbar switch growing like Moores law 16
Mbytes 1.6 Gflops 6.4 Gops 10,000 nodes in
one rack! 100/board 1 TB 0.16 Tflops
17
The Disk Farm? or a System On a Card?
  • The 500GB disc card
  • An array of discs
  • Can be used as
  • 100 discs
  • 1 striped disc
  • 50 FT discs
  • ....etc
  • LOTS of accesses/second
  • of bandwidth
  • A few disks are replaced by 10s of Gbytes of RAM
    and a processor to run Apps!!

18
The Promise of SAN/VIA/Infiniband
http//www.ViArch.org/
  • Yesterday
  • 10 MBps (100 Mbps Ethernet)
  • 20 MBps tcp/ip saturates 2 cpus
  • round-trip latency 250 µs
  • Now
  • Wires are 10x faster Myrinet, Gbps Ethernet,
    ServerNet,
  • Fast user-level communication
  • tcp/ip 100 MBps 10 cpu
  • round-trip latency is 15 us
  • 1.6 Gbps demoed on a WAN

19
Top500 taxonomy everything is a cluster aka
multicomputer
  • Clusters are the ONLY scalable structure
  • Cluster n, inter-connected computer nodes
    operating as one system. Nodes uni- or SMP.
    Processor types scalar or vector.
  • MPP miscellaneous, not massive (gt1000), SIMD or
    something we couldnt name
  • Cluster types. Implied message passing.
  • Constellations clusters of gt16 P, SMP
  • Commodity clusters of uni or lt4 Ps, SMP
  • DSM NUMA (and COMA) SMPs and constellations
  • DMA clusters (direct memory access) vs msg. pass
  • Uni- and SMPvector clustersVector Clusters and
    Vector Constellations

20
Courtesy of Dr. Thomas Sterling, Caltech
21
The Virtuous Economic Cycle drives the PC
industry Beowulf
Attracts suppliers
Competition
Greater availability _at_ lower cost
Volume
Standards
DOJ
Utility/value
Innovation
Creates apps, tools, training,
Attracts users
22
BEOWULF-CLASS SYSTEMS
  • Cluster of PCs
  • Intel x86
  • DEC Alpha
  • Mac Power PC
  • Pure M2COTS
  • Unix-like O/S with source
  • Linux, BSD, Solaris
  • Message passing programming model
  • PVM, MPI, BSP, homebrew remedies
  • Single user environments
  • Large science and engineering applications

23
Lessons from Beowulf
  • An experiment in parallel computing systems
  • Established vision- low cost high end computing
  • Demonstrated effectiveness of PC clusters for
    some (not all) classes of applications
  • Provided networking software
  • Provided cluster management tools
  • Conveyed findings to broad community
  • Tutorials and the book
  • Provided design standard to rally community!
  • Standards beget books, trained people, software
    virtuous cycle that allowed apps to form
  • Industry begins to form beyond a research project

Courtesy, Thomas Sterling, Caltech.
24
Designs at chip levelany COTS options?
  • Substantially more programmability versus factory
    compilation
  • As systems move onto chips and chip sets become
    part of larger systems, Electronic Design must
    move from RTL to algorithms.
  • Verification and design of GigaScale systems
    will be the challenge.

25
The Productivity Gap
10,000,000
100,000,000
.10m
1,000,000
10,000,000
58/Yr. compound Complexity growth rate
100,000
1,000,000
Logic Transistors per Chip
(K)
Productivity Trans./Staff - Month
10,000
100,000
.35m
1,000
10,000
x
100
1,000
x
x
x
x
x
x
100
21/Yr. compound Productivity growth rate
10
2.5m
10
1
1991
1999
2003
2001
2007
1987
1989
1993
1995
1997
2005
2009
1983
1985
1981
Logic Transistors/Chip
Source SEMATECH
Transistor/Staff Month
26
What Is GigaScale?
  • Extremely large gate counts
  • Chips chip sets
  • Systems multiple-systems
  • High complexity
  • Complex data manipulation
  • Complex dataflow
  • Intense pressure for correct , 1st time
  • TTM, cost of failure, etc. impacts ability to
    have a silicon startup
  • Multiple languages and abstraction levels
  • Design, verification, and software

27
EDA Evolution chips to systems
GigaScale Architect
2005 (e.g. Forte)
GigaScale
Hierarchical Verification plus
SOC Designer
System Architect
1995 (Synopsys Cadence)
RTL 1M gates
Testbench Automation Emulation Formal
Verification plus
ASIC Designer
Chip Architect
1985(Daisy, Mentor)
Gates 10K gates
Simulation
IC Designer
1975 (Calma CV) Physical design
Courtesy of Forte Design Systems
28
Processor Limit DRAM Gap
Moores Law
  • Alpha 21264 full cache miss / instructions
    executed 180 ns/1.7 ns 108 clks x 4 or 432
    instructions
  • Caches in Pentium Pro 64 area, 88 transistors
  • Taken from Patterson-Keeton Talk to SigMod

29
The memory gap
  • Multiple e.g. 4 processors/chip in order to
    increase the ops/chip while waiting for the
    inevitable access delays
  • Or alternatively, multi-threading (MTA)
  • Vector processors with a supporting memory system
  • System-on-a-chip to reduce chip boundary
    crossings

30
If system-on-a-chip is the answer, what is the
problem?
  • Small, high volume products
  • Phones, PDAs,
  • Toys games (to sell batteries)
  • Cars
  • Home appliances
  • TV video
  • Communication infrastructure
  • Plain old computers and portables
  • Embeddable computers of all types where
    performance and/or power are the major
    constraints.

31
SOC Alternatives not including C/C CAD Tools
  • The blank sheet of paper FPGA
  • Auto design of a processor Tensilica
  • Standardized, committee designed components,
    cells, and custom IP
  • Standard components including more application
    specific processors , IP add-ons plus custom
  • One chip does it all SMOP
  • Processors, Memory, Communication Memory
    Links,

32
Tradeoffs and Reuse Model
System Application
Silicon Process
33
System-on-a-chip alternatives
FPGA Sea of un-committed gate arrays Xylinx, Altera
Compile a system Unique processor for every app Unique processor for every app Tensillica
Systolic array Many pipelined or parallel processors custom Many pipelined or parallel processors custom
Pc ?? Dynamic reconfiguration of the entire chip Dynamic reconfiguration of the entire chip
PcDSP VLIW Spec. purpose processors cores custom Spec. purpose processors cores custom TI
Pc Mp. ASICS Gen. Purpose cores. Specialized by I/O, etc. Gen. Purpose cores. Specialized by I/O, etc. IBM, Intel, Lucent
Universal Micro Multiprocessor array, programmable I/0 Multiprocessor array, programmable I/0 Cradle, Intel IXP 1200
34
Xilinx 10Mg, 500Mt, .12 mic
35
Tensillica Approach Compiled Processor Plus
Development Tools
ALU
I/O
Timer
Pipe
Cache
MMU
Register File
Tailored, HDL uP core
Using the processor generator, create...
Describe the processor attributes from a
browser-like interface
Standard cell library targetted to the silicon
process
Customized Compiler, Assembler, Linker,
Debugger, Simulator
Courtesy of Tensilica, Inc. http//www.tensilica.c
om
Richard Newton, UC/Berkeley
36
EEMBC Networking Benchmark
  • Benchmarks OSPF, Route Lookup, Packet Flow
  • Xtensa with no optimization comparable to 64b
    RISCs
  • Xtensa with optimization comparable to high-end
    desktop CPUs
  • Xtensa has outstanding efficiency (performance
    per cycle, per watt, per mm2)
  • Xtensa optimizations custom instructions for
    route lookup and packet flow

Colors Blue-Xtensa, Green-Desktop x86s,
Maroon-64b RISCs, Orange-32b RISCs
37
EEMBC Consumer Benchmark
  • Benchmarks JPEG, Grey-scale filter, Color-space
    conversion
  • Xtensa with no optimization comparable to 64b
    RISCs
  • Xtensa with optimization beats all processors by
    6x (no JPEG optimization)
  • Xtensa has exceptional efficiency (performance
    per cycle, per watt, per mm2)
  • Xtensa optimizationscustom instructions for
    filters, RGB-YIQ, RGB-CMYK

Colors Blue-Xtensa, Green-Desktop x86s,
Maroon-64b RISCs, Orange-32b RISCs
38
Free 32 bit processor core
39
Complex SOC architecture
Synopsys via Richard Newton, UC/B
40
UMS Architecture
  • Memory bandwidth scales with processing
  • Scalable processing, software, I/O
  • Each app runs on its own pool of processors
  • Enables durable, portable intellectual property

41
Cradle UMS Design Goals
  • Minimize design time for applications
  • Efficient programming model
  • High reusability accelerates derivative
    development
  • Cost/Performance
  • Replace ASICs, FPGAs, ASSPs, and DSPs
  • Low power for battery powered appliances
  • Flexibility
  • Cost effective solution to address fragmenting
    markets
  • Faster return on RD investments

42
Universal Microsystem (UMS)
Quad 1
Quad 2
Quad 3
Quad 3
Quad 2
I/O Quad
Quad n
SDRAMCONTROL
I/O Quad
PLA Ring
Quad n
Each Quad has 4 RISCs, 8 DSPs, and Memory Unique
I/O subsystem keeps interfaces soft
43
The Universal Micro System (UMS)
An off the shelf Platform for Product Line
Solutions
Universal Micro System
Superior Digital Signal Processing (Single Clock
FP-MAC)
Local Memory that scales with additional
processors
Scalable real time functions in software using
small fast processors (QUAD)
Intelligent I/O Subsystem (Change Interfaces
without changing chips)
250 MFLOPS/mm2
44
VPN Enterprise Gateway
  • Five quads Two 10/100 Ethernet ports at wire
    speed one T1/E1/J1 interface
  • Handles 250 end users and 100 routes
  • Does key handling for IPSec
  • Delivers 100Mbps of 3DES
  • Firewall
  • IP Telephony
  • O/S for user interactions
  • Single quad Two 10/100 Ethernet ports at wire
    speed one T1/E1/J1 interface
  • Handles 250 end users and 100 routes
  • Does key handling for IPSec
  • Delivers 50Mbps of 3DES

45
Table 2 Performance of Kernels on UMS  
UMS Application Performance
  • Architecture permits scalable software
  • Supports two Gigabit Ethernets at wire speed
    four fast Ethernets four T-1s, USB, PCI, 1394,
    etc.
  • MSP is a logical unit of one PE and two DSEs

46
Cradle Universal Microsystemtrading Verilog
hardware for C/C
UMS VLSI microprocessor special
systemsSoftware Hardware
  • Single part for all apps
  • App specd_at_ run time using FPGA ROM
  • 5 quad mPs at 3 Gflops/quad 15 Glops
  • Single shared memory space, caches
  • Programmable periphery including 1 GB/s 2.5
    GipsPCI, 100 baseT, firewire
  • 4 per flops 150 mW/Gflops

47
Silicon Landscape 200x
  • Increasing cost of fabrication and mask
  • 7M for high-end ASSP chip design
  • Over 650K for masks alone and rising
  • SOC/ASIC companies require 7-10M business
    guarantee
  • Physical effects (parasitics, reliability issues,
    power management) are more significant design
    issues
  • These must now be considered explicitly at the
    circuit level
  • Design complexity and context complexity is
    sufficiently high that design verification is a
    major limitation on time-to-market
  • Fewer design starts, higher-design
    volumeimplies more programmable platforms

Richard Newton, UC/Berkeley
48
The End
49
(No Transcript)
50
The Energy-Flexibility Gap
1000
Dedicated HW
MUD 100-200 MOPS/mW
100
ReconfigurableProcessor/Logic
Pleiades 10-50 MOPS/mW
Energy Efficiency MOPS/mW (or MIPS/mW)
10
ASIPs DSPs
1 V DSP 3 MOPS/mW
1
Embedded mProcessors
LPArm 0.5-2 MIPS/mW
0.1
Flexibility (Coverage)
Source Prof. Jan Rabaey, UC Berkeley
51
Approaches to Reuse
  • SOC as the Assembly of Components?
  • Alberto Sangiovanni-Vincentelli
  • SOC as a Programmable Platform?
  • Kurt Keutzer

52
Component-Based Programmable Platform Approach
  • Application-Specific Programmable Platforms
    (ASPP)
  • These platforms will be highly-programmable
  • They will implement highly-concurrent
    functionality

? Intermediate language that exposes
programmability of all aspects of the
microarchitecture
? Integrate using programmable approach to
on-chip communication
? Assemble Components from parameterized
library
Richard Newton, UC/Berkeley
53
Compact Synthesized Processor, Including Software
Development Environment
  • Use virtually any standard cell library with
    commercial memory generators
  • Base implementation is less than 25K gates (1.0
    mm2 in 0.25m CMOS)
  • Power Dissipation in 0.25m standard cell is less
    than 0.5 mW/MHz

to scale on a typical 10 IC (3-6 of 60mm2)
Courtesy of Tensilica, Inc. http//www.tensilica.c
om
54
Challenges of Programmability for Consumer
Applications
  • Power, Power, Power.
  • Performance, Performance, Performance
  • Cost
  • Can we develop approaches to programming silicon
    and its integration, along with the tools and
    methodologies to support them, that will allow us
    to approach the power and performance of a
    dedicated solution sufficiently closely (2-4x?)
    that a programmable platform is the preferred
    choice?

Richard Newton, UC/Berkeley
55
Bottom Line Programmable Platforms
  • The challenge is finding the right programmers
    model and associated family of micro-architectures
  • Address a wide-enough range of applications
    efficiently (performance, power, etc.)
  • Successful platform developers must own the
    software development environment and associated
    kernel-level run-time environment
  • Its all about concurrency
  • If you could develop a very efficient and
    reliable re-programmable logic technology
    (comparable to ASIC densities), you would
    eventually own the silicon industry!

Richard Newton, UC/Berkeley
56
Approaches to Reuse
  • SOC as the Assembly of Components?
  • Alberto Sangiovanni-Vincentelli
  • SOC as a Programmable Platform?
  • Kurt Keutzer

Richard Newton, UC/Berkeley
57
A Component-Based Approach
  • Simple Universal Protocol (SUP)
  • Unix pipes (character streams only)
  • TCP/IP (only one type of packet limited options)
  • RS232, PCI
  • Streaming
  • Single-Owner Protocol (SOP)
  • Visual Basic
  • Unibus, Massbus, Sbus,
  • Simple Interfaces, Complex Application (SIC)
  • When the spec is much simpler than the code
    you arent tempted to rewrite it
  • SQL, SAP, etc.
  • Implies natural boundaries to partition IP and
    successful components will be aligned with those
    boundaries.

(suggested by Butler Lampson)
58
The Key Elements of the SOC
Applications
What is the Platform aka Programmer model?
RF MEMS optical ASIP
Richard Newton, UC/Berkeley
59
Power as the Driver
(Power is still, almost always, the driver!)
Source R. Brodersen, UC Berkeley
60
Back end
61
Computer ops/sec x word length /
62
Microprocessor performance
100 G 10 G Giga 100 M 10 M Mega Kilo
1970 1980 1990 2000 2010
63
GigaScale Evolution
  • In 1999 less than 3 of engineers doing designs
    with more than 10M transistors per chip.
    (Dataquest)
  • By early 2002, 0.1 micron will allow 600M
    transistors per chip. (Dataquest)
  • In 2001 49 of engineers _at_ .18 micron, 5 _at_ .10
    micron. (EE Times)
  • 54 plan to be _at_ .10 micron in 2003.(EET)

64
Challenges of GigaScale
  • GigaScale systems are too big to simulate
  • Hierarchical verification
  • Distributed verification
  • Requires a higher level of abstraction
  • Higher abstraction needed for verification
  • High level modeling
  • Transaction-based verification
  • Higher abstraction needed for design
  • High-level synthesis required for productivity
    breakthrough

65
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com