Centaur Technology Inc. - PowerPoint PPT Presentation

About This Presentation
Title:

Centaur Technology Inc.

Description:

Available in all of our processors, for free ... Industry tools not sufficient (oriented to APR methodology) Our Basic Approach ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 45
Provided by: glenn61
Category:

less

Transcript and Presenter's Notes

Title: Centaur Technology Inc.


1
Random Stuff Centaur Technology Inc. G Glenn
Henry Quick Background Our Security
Functions Centaur Build Methodology Physical
Design Example
2
Quick Background
  • Were Centaur Technology Inc. (Austin, TX)
  • We design x86 processors
  • Have been alive for 11 yrs, have shipped
    processors for 8.5
  • We operate independently, but are owned by VIA
  • We are a tiny group but shipping millions of
    processors/yr
  • Our processors are software bus compatible with
    Intel x86
  • But are unique vs. Intel AMD (re design
    target market)
  • lower cost (price)
  • lower power consumption
  • smaller chip footprint
  • unique integrated security features
  • generally, lower performance
  • This fits some rapidly growing new markets for
    x86
  • Parent company is VIA Technologies (Taiwan)
  • They manufacture, market sell our processor
    designs
  • They develop all other PC platform chips
  • (including chip sets for Intel AMD
    processors), etc.

3
C5J (aka VIA Esther, VIA C7-M)
First Shipped 8/2005
90nm IBM SOI Technology
128KB, 32-way exclusive L2
P-M bus and new VIA V4bus (400-800 MHz)
400 MHz2.0 GHz
P4 instructions (incl SSE2 SSE3)
64KB 4-way I-cache
64KB 4-way D-cache
Lowest Power/MHz 3.5W _at_ 1 GHz TDP 20W _at_ 2
GHz TDP
2-way SMP support
P-M power mngt features
Exclusive security features
unique nanoBGA package
31.2 mm2 26.2 M transistors
4
90nm VIA C7-M
our die cost
90nm Intel Pentium M (Dothan)
84 mm2
31 mm2
5
C5J Die
6.9 mm
Bus APIC
64 KB 4-way L1-I
64 KB 4-way L1-D
DCU
Fetch, Decode Translate
I-unit
128 KB 32-way L2
x87 FP
Br pred
ROM
Security
SSE 1,2 3, MMX
PLLs etc
fuses
6
Our Security Strategy
  • Provide comprehensive set of data security
    functions
  • That are very secure
  • That are worlds fastest (for a single chip)
  • These goals require that the functions
  • Be Integrated tightly into the processor core
  • Processor silicon implementation is
    fastest hdw
  • Only hdw can be trusted (no viruses,
    etc.)
  • Require no operating system support/involvement
  • ? available via non-privileged x86
    instructions
  • ? hardware must manage multi-tasking
    considerations
  • Available in all of our processors, for free
  • We believe data security should be built into all
    processors
  • Its easy to do small (effectively free)
  • Its our hobby

7
Our Security Implementation
Encryption
Hardware RNG
Secure Hash
C5XL (shipped 1/2003)
Hardware RNG unit
2 units fastest in world!
Full AES (FIPS-197) standard in
hdw ECB,CBC,CFB,OFB Modes in hdw fastest in world!
C5P (shipped 1/2004)
(can also feed entropy to hardware SHA to get
faster high quality)
CBC/CFB-MAC modes CTR mode unaligned
support faster
(faster/better using built-in hdw hash functions)
Full SHA-1 -256 (FIPS-180-1) standard in hdw
C5J (shipped 8/2005)
RSA Hdw Assist (Montgomery multiply)
CN (future)
xxx
xxx
8
Centaur Hardware RNG

adj DC bias


2 duplicate RNGs in different physical areas (
rotated)












whitener

whitener
asynch
clocked
1-byte per delivery
A, B, or both
up to 8-byte delivery per store request
SSE store bus
x86 store-rand instruction
1-of-n bit selector
status in EAX
9
RNG Typical Performance
  • Randomness too hard to describe here,
  • but heres some basics
  • Key requirements for truly random (per
    Schneier)
  • Unbiased statistical distribution ? determined by
    statistics
  • Unpredictability ? determined by modeling
  • Unreproducibility ? only hardware need apply
  • Many statistical tests defined used ( argued
    about)
  • Collections of many different statistical
    analyses
  • FIPS-140-2 ? useless (4-tests, broken, 20,000
    bit sample!)
  • Diehard (18 tests) ? oriented to software RNGs,
    10Mb sample
  • NIST (16 tests) ? we think the best (much
    overlap with Diehard)
  • Ent, etc. everyone has one, everyone has their
    favorite
  • Individual tests
  • entropy important widely reported, but its
    not randomness
  • chi2 heavily used, especially for huge samples,
    our favorite
  • Maurer, etc. everyone has their favorite
  • Many different evaluation approaches
  • threshhold value, fixed ranges, probability
    analysis (p-value)
  • Much analysis interpretation needed to make
    sense here

10
RNG Typical Performance
  • Performance randomness varies by part these
    are typical
  • We have done extensive analysis
  • Many terabytes of data
  • Massive sample sizes (terabyte)
  • Hundreds of chip
  • Our own testbed software
  • Analysis report by external group
  • www.cryptography.com/research/evaluations.html
  • Heres an embarrassingly simple summary
  • Passes standard test collections FIPS, NIST,
    Diehard
  • Good chi2 results
  • Many variations SHA, random seed size, etc.

11
Centaur AES Encryption Features
  • Full FIPS-197 implemented in hardware
  • Encrypt decrypt
  • 128b, 192b, 256b keys
  • 128b data blocks
  • Multiple operating modes in hardware
  • ECB, CBC, CFB, OFB
  • CBC/CFC-MAC CTR modes
  • Optional extended key generation in hardware
  • For 128b key (both E D) only
  • Various experimentation options supported
  • Round count 1-16, intermediate round results,
    etc.
  • Accessed via new application-level x86
    instructions
  • No OS support needed
  • Hardware provides inherent multitasking
  • US export licenses in place

12
Centaur AES Hardware
16-byte blocks
SSE load bus
key
ctrl
? 0.3 mm2 total!
can pipeline 2 blks in ECB
Round key generation
block startup CBC, CFB, OFB, etc.
Extended Key Ram 16x16B
shared logic
S-box row-shift
round fwd
round key
column mix key add
Everything runs at processor clock speed
block finish CBC, CFB, OFB, etc.
blk-blk fwd
SSE store bus
13
Centaur AES Performance
  • AES instruction performance (approx.)
  • 128-bit key block size
  • usual instruction timing assumptions
  • data in cache, no interrupts, aligned,
    key done, etc.
  • Approximate clocks w/ 128b extended keys already
    loaded
  • ECB, 1 block ?17 clocks
  • ECB, large block count ?11.8/blk
  • CBC/CFB/etc, 1 block ?37
  • CBC/etc, large block count ?22.5/blk
  • Additional extended key generation/load time
    (128b key)
  • Hardware generated ?38
  • Loaded from memory ?53

14
AES Performance
  • Measured Performance
  • P4 Gladman library AES, C5J replaced routine
    with AES inst
  • ECB mode (other modes slower, but same advantage
    over P4)
  • Same memory size (512MB), same bus speeds (533
    MHz)
  • Another example Gladman reports (his site) using
    his library (ECB)

bus limited
Earlier part
15
C5J Montgomery Multiplier Features
  • Goal Speed up RSAs modular exponentiation
  • c me mod n
  • is dominated by repeated d m x y mod(n) ops
  • where m, y, n are thousand bits long!
  • This multiply is always done using
  • Montgomery Multiply algorithm
  • Uses special number space to make
  • d a x b mod(m) much faster by
    eliminating divide
  • But initial result values must be transformed
  • to/from Montgomery number space
  • In real usage, the transformation overhead is
    relatively small
  • Our hardware directly performs Montgomery
    Multiply
  • About as fast as an ordinary multiply!
  • For up to 32Kb numbers!
  • New application-level x86 MontMul instruction

16
Centaur Montgomery Multiplier
SSE load bus
Usable with any size data (256 to 32Kb, 128b
steps)
temp regs
U
Bi
hack of existing multipliers
32 x 32
32 x 32
Ucode sequences loads stores


32b x 32b mod(32b) 4 clks (2 clk pipelined)
Bits 6432
Bits 310
SSE store bus
16-byte blocks
17
Centaur MontMul Performance
  • Compared to GMP library
  • Perform c me mod n (m,e,n chosen randomly)
  • An example (speeds vary slightly based on values)
  • Note this is most of RSA time, but not the whole
    thing
  • Same hardware as for AES chart

18
Centaur SHA Features
  • FIPS-180-1 completely implemented in hardware
  • SHA-1 (160-bit result)
  • SHA-256 (256-bit result)
  • Instruction timing
  • SHA-1 ? 251 clks
  • SHA-256 ? 262
  • where n is the number of 64B blocks to be
    compressed
  • Measured performance (Gb/s)
  • Same hardware as for AES chart, GPL SHA SW
    (Devine)

bus limited
19
C5J SHA Hardware
SSE load bus
next 64b data
Initial digest
accumulating digest
SHA-1 2 clks/32b rnd (5)
data scheduler (16 x32b regs)
Function generators
SHA-256 3 clks/round
regs
5-way add
Final sha-256 add


SSE store bus
20
Build Process
21
The Centaur Process
22
Centaur Build Methodolgy
  • Our challenges!
  • Complex logic with lots of architectural
    interconnections
  • 2-GHz aggressive power/size objectives
  • Relatively few designers (?30 logic circuit)
  • Strong schedule pressure (must do it fast)
  • Industry tools not sufficient (oriented to APR
    methodology)
  • Our Basic Approach
  • Hundreds of top-level stand-alone blocks
  • Allows parallel development of one-person
    blocks
  • Facilitates fast build time (chip assembly,
    timing, etc.)
  • Facilitates use of optimum process for particular
    logic
  • Hook blocks together with top-level routing,
    clocks, etc.
  • Significant content added in top-level
    build
  • Full-chip timing with fast iterations
  • Fast full-chip build iterations
  • Develop our own tools methodology to accomplish
    above

23
(No Transcript)
24
Centaur Chip Physical Build Process
25
C5J Die
26
Underlying Source Statistics
  • Verilog lines as written (small)
  • (no behaviorals, no comments, no clocks, no
    top chip)
  • APR logic 112K lines 129K cells
  • Stack logic 41K lines 172K cells
  • Note this is single instance as written
  • much of this gets instantiated
    multiple times
  • Schematic pages as written (large)
  • Primitive (inv, nand2, nor2, etc.) 110
  • Standard cells 712
  • Datapath elements 1308
  • Full customs 1332
  • -------
  • 3462
  • Circuit library size avail used
  • Clock regens 445 277
  • Std cell 547 435
  • G datapath elements 493 271
  • W datapath elements 248 147
  • ----- -----

27
C5J Security Components (metal 1-4 only)
clock repeaters
global clk meanders
7 RC bfrs
stk
stk
stk
stk
stk
stk
stk
stk
cus-tom
APR(control for all stacks)
32b data
Note global interconnects not shown
decoupling caps
bfr section
28
C5J Security Components (metal 1-4 only)
RNG buffers
SHA sch ALU
key RAM
commoncontrollogic
128b-wide AES engine
29
Fast Build Timing
  • Every 1-5 days ? Full-chip Release
  • APRs synthesized, placed RCs estimated
  • Stacks cracked, placed RCs estimated
  • Full-chip timing done with estimated RCs
  • Takes lt 1 day for full-chip timing report
  • Every 5-10 days ? Full-Chip Physical Build
  • APRs routed
  • Stacks routed
  • Global chip routed
  • Global chip layout produced
  • APRs, stack global route RC extraction
  • RCs feed back to calibrate estimated RCs
  • This goes on continuously, picking up new
    Releases as needed
  • Our experience at other companies ? much slower

30
Basic Release Process
31
RTL Design Rules
  • APR Blocks
  • Element instantiation OK
  • Registers (reqd ? synthesis cant infer them
    correctly)
  • Clock buffers distribution (reqd ? synthesis
    clocks are slow!!)
  • Occasional logic (this has diminished over time)
  • The instantiated elements are really macros
  • Auto expanded to right size, number bits, etc. in
    the flow
  • Wires continuous assignment OK
  • Including operators like ?, , lt etc.
  • Nothing else! (no procedural stuff)
  • No if/else, no case, no loops, no always, no
    at, etc.
  • No timing information/control
  • Synthesis generates bad logic for these
  • Unexpected/surperflous elements,
  • registers where not expected, timing doesnt
    work, etc.
  • Stacks
  • Component instantiation wires only!

32
APR RTL Example
As Written
assign idleNS (T0 T8)
shaDone_P assign funcNS (T1 T3 T6
T10) shaDone_P assign add1NS (T2)
shaDone_P assign add2NS
(T5) shaDone_P assign
faddNS (T4 T7 T9)
shaDone_P rregs (5) state (.q (idleState,
funcState, add1State,
add2State, faddState), .d
(idleNS,funcNS,add1NS,add2NS,faddNS),
.clk (ph1c) ) ------------------
sha2cnst sha2cnst(.in (iteration50 ),
.ksel (shKSel ),
.algo (sha1_P ), .out (KsubI
)) ------------------ wire 60
nextIteration assign nextIteration (shaDone_P
idleState) ? 7'b0000000
shIterationStall ? iteration
iteration
1
33
Stack RTL Example
Datapath Section /------------------- KeyGen
XOR --------------------------/ wire 310
aesKeyGenXorOut2_L zdxor (32,15) keyg1 (.out
(aesKeyGenXorOut2_L ),
.in0 (aesWord2I_LB ), .in1
(aesKeyGenXorOut1_LB )) zinv (32,60) kgen2
(aesKeyGenXorOut2_LB, aesKeyGenXorOut2_L) wire
310 aesKeyGenXorOut2_MB wire 310
aesKeyGenXorOut2_M zregi_en (32,10) keyg2 (.q
(aesKeyGenXorOut2_MB ),
.d (aesKeyGenXorOut2_L ),
.clk (EPH1 ),
.en (aesDynEn_K)) zinv (32,10) keyg2i
(aesKeyGenXorOut2_M, aesKeyGenXorOut2_MB) Buffer
Section rregsi (2,20) bf_kk (.qb
(aesKeyMuxSel_M ), .d
(aesKeyMuxSel_LB), .clk
(evph1))
34
Stack Placement Tool Output
(32-bit AES stack)
35
Buffer section added
Inter-element routing (m2-6)
36
Global wires added
37
Sample Timing Report Path
time path element delta load cap wire
rise/fall 0.875ns eeph1aesdp2 aesdp2/eph1buf_ae
sdp2/ 0.050ns 0.2423pF 0.000ns 0.000ns 0.925ns
aesdp2/eph1 aesdp2/sc_c0ph1_48/ 0.160ns
0.0321pF 0.000ns 0.000ns 1.085ns aesdp2/keyg2_ph1
aesdp2/gxregi_x4_10 0.063ns 0.0035pF
0.000ns 0.004ns 1.148ns aesdp2/aesdp2_dp_aeskeygen
xorout2_mb10 v 0.000ns 0.0035pF 0.000ns
0.004ns 1.148ns aesdp2/aesdp2_dp_keyg2i_stack_bit1
0_i0 v aesdp2/ginv_10 0.026ns
0.0209pF 0.000ns 0.044ns 1.173ns
aesdp2/aesdp2_dp_aeskeygenxorout2_m10 0.000ns
0.0209pF 0.000ns 0.045ns 1.174ns
aesdp2/aesdp2_dp_invk_stack_bit10_i0
aesdp2/gemux3i_19 0.045ns 0.0336pF
0.000ns 0.031ns 1.219ns aesdp2/aesdp2_dp_key_mb10
v 0.000ns 0.0336pF 0.000ns 0.031ns 1.219ns
aesdp2/aesdp2_dp_kml_stack_bit10_i0
v aesdp2/ginv_31 0.017ns 0.0188pF
0.000ns 0.013ns 1.236ns aesdp2/aesdp2_dp_key_m10
0.001ns 0.0188pF 0.001ns 0.014ns 1.236ns
aesdp2/aesdp2_dp_mixcoldec_xorout_stack_bit10_in0
aesdp2/gxor8_10 0.095ns 0.0170pF
0.000ns 0.029ns 1.331ns aesdp2/aesdp2_dp_decout_m1
0 v 0.000ns 0.0170pF 0.000ns 0.030ns 1.332ns
aesdp2/aesdp2_dp_mcmux_stack_bit10_i2 v
aesdp2/gmux3i_10 0.030ns 0.0089pF
0.000ns 0.017ns 1.362ns aesdp2/aesdp2_dp_mcout_mb1
0 0.000ns 0.0089pF 0.000ns 0.017ns 1.362ns
aesdp2/aesdp2_dp_invm_stack_bit10_i0
aesdp2/ginv_31 0.030ns 0.1101pF
0.000ns 0.053ns 1.391ns aesdp2/aesdp2_dp_mcout_m10
v 0.012ns 0.1101pF 0.012ns 0.078ns 1.403ns
aesdp2/aesdp2_dp_pipemux0_stack_bit10_i1
v aesdp2/gmux2i_16 0.048ns 0.0249pF
0.000ns 0.030ns 1.451ns aesdp2/aesdp2_dp_aesword2i
_kb10 0.001ns 0.0249pF 0.001ns
0.032ns 1.452ns aesdp2/aesdp2_dp_byte1_indx_pb2
Local reg clock-to-next reg input 1.452-1.085
367ps
38
Random Circuit Topics
  • Clocking is very difficult very critical
  • Very aggressive skew goals
  • 0 ps clock skew across all top-level blocks
  • lt20 ps skew worst case within a block
  • These are met in our designs ignoring on-chip
    silicon variations
  • Multiple clock domains required (for bus
    various power states)
  • Many early, late, etc. versions of the clocks
    needed
  • Clocks must be gated (for power management)
  • Our clocking methodology is proprietary, but
  • Hand-routed global clock tree (continually
    changing)
  • Our own tools to generate clock shields tuned to
    surroundings
  • Tunable repeaters (via fuse via metal)
  • Hand instantiated clock elements within blocks
  • Many selectable clocks (?xx ps for each reg)
  • Auto-generated clock grids within APRs stacks
  • Fuse adjustable PLL characteristics (duty cycle,
    etc.)
  • Power/ground distribution critical
  • Extensive analysis management required

39
Random Circuit Topics (cont)
  • Robust circuit design reqd across ?12 corner
    models
  • 54 formal corners identified, we choose the most
    critical 12
  • Covers variations in Temp, V, N xistor, P xistor
  • Automated element simulation done across these
    models
  • Full-chip timing is done using 2 of these corners
    (hi V, lo V)
  • Extensive use of dynamic logic
  • Precharge in phase 1, evaluate in phase 2
  • Registers, adders, comparators, arrays, etc.
  • Customs, stacks ( APRs)
  • Two stack-element libraries
  • With different bit pitches
  • Element libraries has several versions of same
    function
  • Usually, at least Fast/big/hot
    slow/small/cool
  • Example C5J has 2 different vanilla 32-bit
    adders
  • Fast (dynamic) 180 ps 37.9? high
  • Slow (static) 250 ps 16.9? high
  • Note 25 total adders in library, instantiated
    65 total times

40
Random Circuit Topics (cont)
  • Several families of registers available
  • Differ in function, speed, size performance
  • Std cell, datapath custom versions
  • Each comes in many drive strengths (sizes)
  • Many have built-in functions
  • muxes, and/or logic, xors, compares, etc.
  • These provide speed/size/power improvements vs.
    separate elements
  • Examples using C5J stack elements

26b
normal reg
fast reg
static cmp-eq 20
4.6?
k-reg 10
3.8?
k-reg 10
k-reg dynamic cmp-eq 60
x-reg 10
5.0?
3.8?
1.4?
inv 54
9.5?
32 ps
82 ps (data-to-out)
90 32 17 ----- 139 ps
88 ps
41
(No Transcript)
42
C5J Security Component Sizes (mm2)
Sample scale
227?
0.014
0.014
0.069
0.046
0.080
0.080
0.080
0.091
0.034
0.021
Total 0.529 mm2 0.014 for 2 RNGs
(elsewhere) 0.54 (a few cents, but for
this chip its really free)
43
C5J Security Component Sizes
(If we had only known about all this space when
we started)
227?
0.014
0.014
0.069
0.046
0.080
0.080 mm2
0.080
0.091
0.034
Note We had so much spare room on die that
we didnt spend any effort making this smaller.
We estimate at least 30 smaller if we tried
hard!
44
Startup, CBC, etc. muxes registers
---register----------------------------------
S-box ROM
(2 x 256 x 8 bit) x 4 bytes ?200 ps access
(dynamic)
Row-shift muxes
(wires to other 32b stacks not visible)
---register----------------------------------
Column multiply ( key xor)
made out of 2-,3-,4-,5-,6-, 7- 8-input xors
---register----------------------------------
Startup, CBC, etc. muxes registers
(extra stuff at bottom for key generation)
Write a Comment
User Comments (0)
About PowerShow.com