Title: Options%20for%20embedded%20systems.%20Constraints,%20challenges,%20and%20approaches%20HPEC%202001%20Lincoln%20Laboratory%2025%20September%202001
1Options for embedded systems.Constraints,
challenges, and approachesHPEC 2001Lincoln
Laboratory25 September 2001
- Gordon Bell
- Bay Area Research Center
- Microsoft Corporation
2More architecture options Applications, COTS
(clusters, computers chips), Custom Chips
3The architecture challenge One persons
system, is anothers component.- Alan Perlis
- Kurzweil predicted hardware will be compiled and
be as easy to change as software by 2010 - COTS streaming, Beowulf, and www relevance?
- Architecture Hierarchy
- Application
- Scalable components forming the system
- Design and test
- Chips the raw materials
- Scalability fewest, replicatable components
- Modularity finding reusable components
4The architecture levels options
- The apps
- Data-types signals, packets, video, voice,
RF, etc. - Environment parallelism, power, power, power,
speed, cost - The material clock, transistors
- Performance its about parallelism
- Program programming environment
- Network e.g. WWW and Grid
- Clusters
- Storage, cluster, and network interconnect
- Multiprocessors
- Processor and special processing
- Multi-threading and multiple processor per chip
- Instruction Level Parallelism vs
- Vector processors
5Sony Playstation export limiits
A problem X-Box would like to have, but have
solved.
6Will the PC prevail for the next decade as a/the
dominant platform? or 2nd to smart, mobile
devices?
- Moores Law increases performance Bells
Corollary reduces prices for new classes - PC server clusters aka Beowulf with low cost OS
kills proprietary switches, smPs, and DSMs - Home entertainment control
- Very large disks (1TB by 2005) to store
everything - Screens to enhance use
- Mobile devices, etc. dominate WWW gt2003!
- Voice and video become the important apps!
C Commercial C Consumer
7Wheres the action? Problems?
- Constraints from the application Speech, video,
mobility, RF, GPS, securityMoores Law,
networking, Interconnects - Scalability and high performance processing
- Building them Clusters vs DSM
- Structure wheres the processing, memory, and
switches (disk and ip/tcp processing) - Micros getting the most from the nodes
- Not ISAs Change can delay Moore Law effect and
wipe out software investment! Please, please,
just interpret my object code! - System (on a chip) alternatives apps drivers
- Data-types (e.g. video, video, RF) performance,
portability/power, and cost
8COTS Anything at the system structure level to
use?
- How are the system components e.g. computers,
etc. going to be interconnected? - What are the components? Linux
- What is the programming model?
- Is a plane, CCC, tank, fleet, ship, etc. an
Internet? - Beowulfs the next COTS
- What happened to Ada? Visual Basic? Java?
9ComputingSNAPbuilt entirelyfrom PCs
Legacy mainframes minicomputers servers terms
Portables
Legacy mainframe minicomputer servers
terminals
Wide-area global network
Mobile Nets
Wide Local Area Networks for terminal, PC,
workstation, servers
Person servers (PCs)
scalable computers built from PCs
Person servers (PCs)
Centralized departmental uni- mP
servers (UNIX NT)
Centralized departmental servers buit from PCs
???
TCTVPC home ... (CATV or ATM or satellite)
- A space, time (bandwidth), generation scalable
environment
10How Will Future Computers Be Built?
- Thesis SNAP Scalable Networks and Platforms
- Upsize from desktop to world-scale computer
- based on a few standard components
- Because
- Moores law exponential progress
- Standardization Commoditization
- Stratification and competition
- When Sooner than you think!
- Massive standardization gives massive use
- Economic forces are enormous
11Five Scalabilities
- Size scalable -- designed from a few components,
with no bottlenecks - Generation scaling -- no rewrite/recompile or
user effort to run across generations of an
architecture - Reliability scaling chose any level
- Geographic scaling -- compute anywhere (e.g.
multiple sites or in situ workstation sites) - Problem x machine scalability -- ability of an
algorithm or program to exist at a range of sizes
that run efficiently on a given, scalable
computer. - Problem x machine space gt run time problem
scale, machine scale (p), run time, implies
speedup and efficiency,
12Why I gave up on large smPs DSMs
- Economics Perf/Cost is lowerunless a commodity
- Economics Longer design time life. Complex.
gt Poorer tech tracking end of life
performance. - Economics Higher, uncompetitive costs for
processor switching. Sole sourcing of the
complete system. - DSMs NUMA! Latency matters. Compiler,
run-time, O/S locate the programs anyway. - Arent scalable. Reliability requires clusters.
Start there. - They arent needed for most apps hence, a small
market unless one can find a way to lock in a
user base. Important as in the case of IBM Token
Rings vs Ethernet.
13What is the basic structure of these scalable
systems?
- Overall
- Disk connection especially wrt to fiber channel
- SAN, especially with fast WANs LANs
14GB plumbing from the baroqueevolving from 2
dance-hall SMP Storage model
- Mp S Pc
-
- S.fc Ms
-
- S.Cluster
- S.WAN
- vs.
- MpPcMs S.Lan/Cluster/Wan
15SNAP Architecture----------
16ISTORE Hardware Vision
- System-on-a-chip enables computer, memory,
without significantly increasing size of disk - 5-7 year target
MicroDrive1.7 x 1.4 x 0.2 2006 ? 1999 340
MB, 5400 RPM, 5 MB/s, 15 ms seek 2006 9 GB, 50
MB/s ? (1.6X/yr capacity, 1.4X/yr
BW) Integrated IRAM processor 2x height Connected
via crossbar switch growing like Moores law 16
Mbytes 1.6 Gflops 6.4 Gops 10,000 nodes in
one rack! 100/board 1 TB 0.16 Tflops
17The Disk Farm? or a System On a Card?
- The 500GB disc card
- An array of discs
- Can be used as
- 100 discs
- 1 striped disc
- 50 FT discs
- ....etc
- LOTS of accesses/second
- of bandwidth
- A few disks are replaced by 10s of Gbytes of RAM
and a processor to run Apps!!
18The Promise of SAN/VIA/Infiniband
http//www.ViArch.org/
- Yesterday
- 10 MBps (100 Mbps Ethernet)
- 20 MBps tcp/ip saturates 2 cpus
- round-trip latency 250 µs
- Now
- Wires are 10x faster Myrinet, Gbps Ethernet,
ServerNet, - Fast user-level communication
- tcp/ip 100 MBps 10 cpu
- round-trip latency is 15 us
- 1.6 Gbps demoed on a WAN
19Top500 taxonomy everything is a cluster aka
multicomputer
- Clusters are the ONLY scalable structure
- Cluster n, inter-connected computer nodes
operating as one system. Nodes uni- or SMP.
Processor types scalar or vector. - MPP miscellaneous, not massive (gt1000), SIMD or
something we couldnt name - Cluster types. Implied message passing.
- Constellations clusters of gt16 P, SMP
- Commodity clusters of uni or lt4 Ps, SMP
- DSM NUMA (and COMA) SMPs and constellations
- DMA clusters (direct memory access) vs msg. pass
- Uni- and SMPvector clustersVector Clusters and
Vector Constellations
20Courtesy of Dr. Thomas Sterling, Caltech
21The Virtuous Economic Cycle drives the PC
industry Beowulf
Attracts suppliers
Competition
Greater availability _at_ lower cost
Volume
Standards
DOJ
Utility/value
Innovation
Creates apps, tools, training,
Attracts users
22BEOWULF-CLASS SYSTEMS
- Cluster of PCs
- Intel x86
- DEC Alpha
- Mac Power PC
- Pure M2COTS
- Unix-like O/S with source
- Linux, BSD, Solaris
- Message passing programming model
- PVM, MPI, BSP, homebrew remedies
- Single user environments
- Large science and engineering applications
23Lessons from Beowulf
- An experiment in parallel computing systems
- Established vision- low cost high end computing
- Demonstrated effectiveness of PC clusters for
some (not all) classes of applications - Provided networking software
- Provided cluster management tools
- Conveyed findings to broad community
- Tutorials and the book
- Provided design standard to rally community!
- Standards beget books, trained people, software
virtuous cycle that allowed apps to form - Industry begins to form beyond a research project
Courtesy, Thomas Sterling, Caltech.
24Designs at chip levelany COTS options?
- Substantially more programmability versus factory
compilation - As systems move onto chips and chip sets become
part of larger systems, Electronic Design must
move from RTL to algorithms. - Verification and design of GigaScale systems
will be the challenge.
25The Productivity Gap
10,000,000
100,000,000
.10m
1,000,000
10,000,000
58/Yr. compound Complexity growth rate
100,000
1,000,000
Logic Transistors per Chip
(K)
Productivity Trans./Staff - Month
10,000
100,000
.35m
1,000
10,000
x
100
1,000
x
x
x
x
x
x
100
21/Yr. compound Productivity growth rate
10
2.5m
10
1
1991
1999
2003
2001
2007
1987
1989
1993
1995
1997
2005
2009
1983
1985
1981
Logic Transistors/Chip
Source SEMATECH
Transistor/Staff Month
26What Is GigaScale?
- Extremely large gate counts
- Chips chip sets
- Systems multiple-systems
- High complexity
- Complex data manipulation
- Complex dataflow
- Intense pressure for correct , 1st time
- TTM, cost of failure, etc. impacts ability to
have a silicon startup - Multiple languages and abstraction levels
- Design, verification, and software
27EDA Evolution chips to systems
GigaScale Architect
2005 (e.g. Forte)
GigaScale
Hierarchical Verification plus
SOC Designer
System Architect
1995 (Synopsys Cadence)
RTL 1M gates
Testbench Automation Emulation Formal
Verification plus
ASIC Designer
Chip Architect
1985(Daisy, Mentor)
Gates 10K gates
Simulation
IC Designer
1975 (Calma CV) Physical design
Courtesy of Forte Design Systems
28Processor Limit DRAM Gap
Moores Law
- Alpha 21264 full cache miss / instructions
executed 180 ns/1.7 ns 108 clks x 4 or 432
instructions - Caches in Pentium Pro 64 area, 88 transistors
- Taken from Patterson-Keeton Talk to SigMod
29The memory gap
- Multiple e.g. 4 processors/chip in order to
increase the ops/chip while waiting for the
inevitable access delays - Or alternatively, multi-threading (MTA)
- Vector processors with a supporting memory system
- System-on-a-chip to reduce chip boundary
crossings
30If system-on-a-chip is the answer, what is the
problem?
- Small, high volume products
- Phones, PDAs,
- Toys games (to sell batteries)
- Cars
- Home appliances
- TV video
- Communication infrastructure
- Plain old computers and portables
- Embeddable computers of all types where
performance and/or power are the major
constraints.
31SOC Alternatives not including C/C CAD Tools
- The blank sheet of paper FPGA
- Auto design of a processor Tensilica
- Standardized, committee designed components,
cells, and custom IP - Standard components including more application
specific processors , IP add-ons plus custom - One chip does it all SMOP
- Processors, Memory, Communication Memory
Links,
32Tradeoffs and Reuse Model
System Application
Silicon Process
33System-on-a-chip alternatives
FPGA Sea of un-committed gate arrays Xylinx, Altera
Compile a system Unique processor for every app Unique processor for every app Tensillica
Systolic array Many pipelined or parallel processors custom Many pipelined or parallel processors custom
Pc ?? Dynamic reconfiguration of the entire chip Dynamic reconfiguration of the entire chip
PcDSP VLIW Spec. purpose processors cores custom Spec. purpose processors cores custom TI
Pc Mp. ASICS Gen. Purpose cores. Specialized by I/O, etc. Gen. Purpose cores. Specialized by I/O, etc. IBM, Intel, Lucent
Universal Micro Multiprocessor array, programmable I/0 Multiprocessor array, programmable I/0 Cradle, Intel IXP 1200
34Xilinx 10Mg, 500Mt, .12 mic
35Tensillica Approach Compiled Processor Plus
Development Tools
ALU
I/O
Timer
Pipe
Cache
MMU
Register File
Tailored, HDL uP core
Using the processor generator, create...
Describe the processor attributes from a
browser-like interface
Standard cell library targetted to the silicon
process
Customized Compiler, Assembler, Linker,
Debugger, Simulator
Courtesy of Tensilica, Inc. http//www.tensilica.c
om
Richard Newton, UC/Berkeley
36EEMBC Networking Benchmark
- Benchmarks OSPF, Route Lookup, Packet Flow
- Xtensa with no optimization comparable to 64b
RISCs - Xtensa with optimization comparable to high-end
desktop CPUs - Xtensa has outstanding efficiency (performance
per cycle, per watt, per mm2) - Xtensa optimizations custom instructions for
route lookup and packet flow
Colors Blue-Xtensa, Green-Desktop x86s,
Maroon-64b RISCs, Orange-32b RISCs
37EEMBC Consumer Benchmark
- Benchmarks JPEG, Grey-scale filter, Color-space
conversion - Xtensa with no optimization comparable to 64b
RISCs - Xtensa with optimization beats all processors by
6x (no JPEG optimization) - Xtensa has exceptional efficiency (performance
per cycle, per watt, per mm2) - Xtensa optimizationscustom instructions for
filters, RGB-YIQ, RGB-CMYK
Colors Blue-Xtensa, Green-Desktop x86s,
Maroon-64b RISCs, Orange-32b RISCs
38Free 32 bit processor core
39Complex SOC architecture
Synopsys via Richard Newton, UC/B
40UMS Architecture
- Memory bandwidth scales with processing
- Scalable processing, software, I/O
- Each app runs on its own pool of processors
- Enables durable, portable intellectual property
41Cradle UMS Design Goals
- Minimize design time for applications
- Efficient programming model
- High reusability accelerates derivative
development - Cost/Performance
- Replace ASICs, FPGAs, ASSPs, and DSPs
- Low power for battery powered appliances
-
- Flexibility
- Cost effective solution to address fragmenting
markets - Faster return on RD investments
42Universal Microsystem (UMS)
Quad 1
Quad 2
Quad 3
Quad 3
Quad 2
I/O Quad
Quad n
SDRAMCONTROL
I/O Quad
PLA Ring
Quad n
Each Quad has 4 RISCs, 8 DSPs, and Memory Unique
I/O subsystem keeps interfaces soft
43The Universal Micro System (UMS)
An off the shelf Platform for Product Line
Solutions
Universal Micro System
Superior Digital Signal Processing (Single Clock
FP-MAC)
Local Memory that scales with additional
processors
Scalable real time functions in software using
small fast processors (QUAD)
Intelligent I/O Subsystem (Change Interfaces
without changing chips)
250 MFLOPS/mm2
44VPN Enterprise Gateway
- Five quads Two 10/100 Ethernet ports at wire
speed one T1/E1/J1 interface - Handles 250 end users and 100 routes
- Does key handling for IPSec
- Delivers 100Mbps of 3DES
- Firewall
- IP Telephony
- O/S for user interactions
- Single quad Two 10/100 Ethernet ports at wire
speed one T1/E1/J1 interface - Handles 250 end users and 100 routes
- Does key handling for IPSec
- Delivers 50Mbps of 3DES
45Table 2 Performance of Kernels on UMS
UMS Application Performance
- Architecture permits scalable software
- Supports two Gigabit Ethernets at wire speed
four fast Ethernets four T-1s, USB, PCI, 1394,
etc. - MSP is a logical unit of one PE and two DSEs
46Cradle Universal Microsystemtrading Verilog
hardware for C/C
UMS VLSI microprocessor special
systemsSoftware Hardware
- Single part for all apps
- App specd_at_ run time using FPGA ROM
- 5 quad mPs at 3 Gflops/quad 15 Glops
- Single shared memory space, caches
- Programmable periphery including 1 GB/s 2.5
GipsPCI, 100 baseT, firewire - 4 per flops 150 mW/Gflops
47Silicon Landscape 200x
- Increasing cost of fabrication and mask
- 7M for high-end ASSP chip design
- Over 650K for masks alone and rising
- SOC/ASIC companies require 7-10M business
guarantee - Physical effects (parasitics, reliability issues,
power management) are more significant design
issues - These must now be considered explicitly at the
circuit level - Design complexity and context complexity is
sufficiently high that design verification is a
major limitation on time-to-market - Fewer design starts, higher-design
volumeimplies more programmable platforms
Richard Newton, UC/Berkeley
48The End
49(No Transcript)
50The Energy-Flexibility Gap
1000
Dedicated HW
MUD 100-200 MOPS/mW
100
ReconfigurableProcessor/Logic
Pleiades 10-50 MOPS/mW
Energy Efficiency MOPS/mW (or MIPS/mW)
10
ASIPs DSPs
1 V DSP 3 MOPS/mW
1
Embedded mProcessors
LPArm 0.5-2 MIPS/mW
0.1
Flexibility (Coverage)
Source Prof. Jan Rabaey, UC Berkeley
51Approaches to Reuse
- SOC as the Assembly of Components?
- Alberto Sangiovanni-Vincentelli
- SOC as a Programmable Platform?
- Kurt Keutzer
52Component-Based Programmable Platform Approach
- Application-Specific Programmable Platforms
(ASPP) - These platforms will be highly-programmable
- They will implement highly-concurrent
functionality
? Intermediate language that exposes
programmability of all aspects of the
microarchitecture
? Integrate using programmable approach to
on-chip communication
? Assemble Components from parameterized
library
Richard Newton, UC/Berkeley
53Compact Synthesized Processor, Including Software
Development Environment
- Use virtually any standard cell library with
commercial memory generators - Base implementation is less than 25K gates (1.0
mm2 in 0.25m CMOS) - Power Dissipation in 0.25m standard cell is less
than 0.5 mW/MHz
to scale on a typical 10 IC (3-6 of 60mm2)
Courtesy of Tensilica, Inc. http//www.tensilica.c
om
54Challenges of Programmability for Consumer
Applications
- Power, Power, Power.
- Performance, Performance, Performance
- Cost
- Can we develop approaches to programming silicon
and its integration, along with the tools and
methodologies to support them, that will allow us
to approach the power and performance of a
dedicated solution sufficiently closely (2-4x?)
that a programmable platform is the preferred
choice?
Richard Newton, UC/Berkeley
55Bottom Line Programmable Platforms
- The challenge is finding the right programmers
model and associated family of micro-architectures
- Address a wide-enough range of applications
efficiently (performance, power, etc.) - Successful platform developers must own the
software development environment and associated
kernel-level run-time environment - Its all about concurrency
- If you could develop a very efficient and
reliable re-programmable logic technology
(comparable to ASIC densities), you would
eventually own the silicon industry!
Richard Newton, UC/Berkeley
56Approaches to Reuse
- SOC as the Assembly of Components?
- Alberto Sangiovanni-Vincentelli
- SOC as a Programmable Platform?
- Kurt Keutzer
Richard Newton, UC/Berkeley
57A Component-Based Approach
- Simple Universal Protocol (SUP)
- Unix pipes (character streams only)
- TCP/IP (only one type of packet limited options)
- RS232, PCI
- Streaming
- Single-Owner Protocol (SOP)
- Visual Basic
- Unibus, Massbus, Sbus,
- Simple Interfaces, Complex Application (SIC)
- When the spec is much simpler than the code
you arent tempted to rewrite it - SQL, SAP, etc.
- Implies natural boundaries to partition IP and
successful components will be aligned with those
boundaries.
(suggested by Butler Lampson)
58The Key Elements of the SOC
Applications
What is the Platform aka Programmer model?
RF MEMS optical ASIP
Richard Newton, UC/Berkeley
59Power as the Driver
(Power is still, almost always, the driver!)
Source R. Brodersen, UC Berkeley
60Back end
61Computer ops/sec x word length /
62Microprocessor performance
100 G 10 G Giga 100 M 10 M Mega Kilo
1970 1980 1990 2000 2010
63GigaScale Evolution
- In 1999 less than 3 of engineers doing designs
with more than 10M transistors per chip.
(Dataquest) - By early 2002, 0.1 micron will allow 600M
transistors per chip. (Dataquest) - In 2001 49 of engineers _at_ .18 micron, 5 _at_ .10
micron. (EE Times) - 54 plan to be _at_ .10 micron in 2003.(EET)
64Challenges of GigaScale
- GigaScale systems are too big to simulate
- Hierarchical verification
- Distributed verification
- Requires a higher level of abstraction
- Higher abstraction needed for verification
- High level modeling
- Transaction-based verification
- Higher abstraction needed for design
- High-level synthesis required for productivity
breakthrough
65(No Transcript)