Group Talk - PowerPoint PPT Presentation

About This Presentation
Title:

Group Talk

Description:

... Fusion/Quattro/Mach. Gillette Fusion, AMD Fusion, Ford Fusion ... Schick Quattro, NVIDIA Quadro, Audi Quattro. Gillette Mach, ATI Mach, Ford Mustang Mach 1 ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 57
Provided by: brej
Learn more at: https://brej.org
Category:
Tags: ford | group | mustang | talk

less

Transcript and Presenter's Notes

Title: Group Talk


1
Group Talk
  • Charlie Brej
  • APT Group
  • University of Manchester

2
Part 1The Future According to Me
  • Charlie Brej
  • APT Group
  • University of Manchester

3
Razor Blades
1998
Scheme 1 Name Number Plus/Extreme/Ultra/Turbo
/?X Trac II Plus Core Quad Extreme Athlon 64
FX GeForce 8800 Ultra
1971
1901
Scheme 2 Company Name Fusion/Quattro/Mach Gille
tte Fusion, AMD Fusion, Ford Fusion Schick
Quattro, NVIDIA Quadro, Audi Quattro Gillette
Mach, ATI Mach, Ford Mustang Mach 1 Maybe more
soon
2005
2004
4
Razor Blade History
5
Prediction2007 Jan-Sept
15 Blade Apple iShave
6
Why did this not happen?
  • Because you dont need more than five blades on
    your razor
  • Unless we grow larger faces
  • Which hasnt happened before, so we wont need
    them for some time
  • We dont need more than four processors
  • Unless we invent an automagic parallelism
    extractor
  • Which we havent since the 60s, so we wont need
    them for some time
  • People will still demand faster single thread
    performance

7
Real Future
  • Moores law will continue
  • Transistor count doubles every 18 months
  • Moving into 3rd dimension
  • Intelligent transistors placed per person will
    remain constant
  • Not copy-paste
  • Verification becomes problematic
  • Designs become very complicated

8
Productivity
Managers 40
Grunt Coder 80
Can we make it pink?
Sales 0
Hero Coder 100
Maintainers 60
Marketting -20
Admin 20
How about Intel Terrano
9
Brejs Law
  • Person years per design doubles every 18 months
  • Most transistors are copy-paste
  • Verification becomes much more complex
  • Hero coders become more rare
  • People get stupider
  • Marketing becomes more important

10
Brejs Law
  • 1985 5 person years
  • ARM
  • 1997 2560 person years
  • Pentium II (about right)
  • 2007 81920 person years
  • Intel has 94,000 employees
  • AMD has 16,000
  • A new design every 7 years

11
Brejs Law
  • 2028 Entire population of the USA are employed
    by Intel
  • 2031 Entire population of China employed by AMD
  • 2034 Entire world population working on creating
    Pentium 12
  • 2090 Project to build Pentium 15 starts but hits
    a snag as universe finishes before the project
    does

12
  • The most powerful force in the universe is
    compound interest
  • Albert Einstein
  • And we didn't have any fancy Sony Playstation
    video games
  • We had the Atari 2600!
  • There were no multiple levels or screens.
  • It was just ONE screen, forever, and you could
    never win.
  • The game just kept getting harder and faster and
    until you died.
  • Just like LIFE!
  • Ernest Cline

13
Back to the Future
  • Transistors will be free
  • Mostly consumed in memory
  • Diminishing returns
  • Single thread grinds to a halt
  • Increase performance by 1 get 100 more money
  • Fewer designs
  • Very expensive and long lead up times
  • Extend rather than redesign

14
Part 2Wagging Logic Non Throughput-Bound
Design Methodology
  • Charlie Brej
  • APT Group
  • University of Manchester

15
Introduction
  • Async performance
  • Asynchronous logic is slow
  • Wagging Logic
  • Example circuits
  • Red Star
  • Design
  • Results
  • Conclusions

16
Data propagation
Logic
C
C
C
C
C
C
C
C
Latency Cycle Time
1
2
3
4
5
6
7
8
9
10
11
12
0
17
Control propagation
Logic
C
C
C
C
C
C
C
C
C
C
C
C
Latency Cycle Time
1
2
3
4
5
6
7
8
9
10
11
12
0
18
Control propagation
Logic
C
C
C
C
C
C
C
C
C
C
C
C
Latency Cycle Time
1
2
3
4
5
6
7
8
9
10
11
12
0
19
And then it gets worse
  • Latency is at least six times lower than the
    cycle time
  • Assumes all data arrives at arrive at the same
    time
  • Assumes all acknowledgements arrive at the same
    time
  • Actual number is somewhere between 10 and 100

20
What can we do
  • Use two-phase signalling
  • Halve the control delay
  • Loose all average case advantages
  • Fine grain pipelining
  • Need to add 10 latches per stage
  • Adds latency
  • Faster completion
  • Anti-tokens, Early-drop latches
  • Careful timing analysis

21
Wagging Latches
  • Alternate latch read/write
  • Capacity of two latches
  • Depth of one latch

22
Wagging Logic
  • Apply same method to the logic
  • Rotate logic allowing one to set while others
    reset

Set
Reset
Reset
23
Single Channel Mixer
24
LCM Channels Mixer
25
Direct Connection Mixer
26
32bit Incrementer Example
Reg
1
Slice 0
Reg
1
Slice 1
HB
1
Slice 2
HB
1
27
32bit Incrementer
Optimal Design 3288 Operations 3.04 GDs per
operation
Original Design 77 Operations 130 GDs per
operation
28
32bit Accumulator Example
  • Load or Accumulate

29
32bit Accumulator Example
Load
Accumulate
Accumulate
Load
Accumulate
Load
30
32bit Accumulator Example
31
Transistors are Free
  • What is expensive?
  • Design effort
  • Time to market
  • Yield
  • What we want
  • Simple
  • Copy-Paste
  • Redundancy

32
Redundancy
Slice
Slice
Slice
Slice
Slice
Slice
33
Arrangement
Slice 0
Slice 0
Slice 0
Slice 2
Slice 1
Slice 5
Slice 3
Slice 1
Slice 2
Slice 1
Slice 3
Slice 4
Slice 4
Slice 2
Slice 5
Slice 3
34
Teaching Monkeys
  • Dynamic extraction of parallelism
  • Implicit data dependency tracking
  • No locking
  • No polling
  • No handshakes
  • Average case performance

35
Red Star
  • MIPS ISA
  • 32bit RISC
  • Fast and simple development
  • Use synchronous design methodology
  • Complicated features without complicated design
    effort
  • OOO execution, banked caching

36
Red Star
37
Register Bank
38
ADD R1, R1, 1
1401 Operations 7.14 GDs per operation
39
Branch Logic
PC
1

Additional unnecessary stages to extend the
branch shadow
40
Overlapping Instructions
Fetch
Decode
Execute
Memory
WriteBack
Dummy
Fetch
Decode
Execute
Memory
WriteBack
Dummy
Branch Shadow
Fetch
Decode
Execute
Memory
WriteBack
Dummy
Fetch
Decode
Execute
Memory
WriteBack
Dummy
Fetch
Decode
Execute
Memory
WriteBack
Dummy
Fetch
Decode
Execute
Memory
WriteBack
Dummy
Fetch
Decode
Execute
Memory
WriteBack
Dummy
Fetch
Decode
Execute
Memory
WriteBack
Dummy
41
Nine Instruction Loop
42
Caching 4 Instruction Loop
RAM
Slice 0
Cache
Slice 1
Cache
0
1
2
3
3
2
1
0
4
5
6
7
Slice 2
Cache
0Instruction 1Instruction 2Instruction 3Branch
0
Slice 3
Cache
43
Caching 3 Instruction Loop
RAM
Slice 0
Cache
Slice 1
Cache
2
3
2
2
0
0
0
0
1
1
1
1
4
5
6
7
Slice 2
Cache
0Instruction 1Instruction 2Branch 0
Slice 3
Cache
44
Caching Delayed Branch
RAM
Slice 0
Cache
If (PCWagLevel ! Slice) Execute a NOP Dont
increment the PC
Slice 1
Cache
2
3
0
1
2
0
1
4
5
6
7
Slice 2
Cache
0Instruction 1Instruction 2Branch 0
Slice 3
Cache
NOP
45
Caching
  • Instead of one large 16Kb cache
  • 12bit address
  • 16 small 1Kb caches
  • 8bit address
  • Approximately 50 faster lookup
  • No data duplication

46
Area
  • 4 times larger than synchronous
  • Times the number of slices
  • Currently 45,000 gates per slice
  • 15,000 gates without the register bank
  • Approx 6 million transistors (16 way)
  • 2 million without the register bank
  • Final design target 4 million transistors
  • Dont wag the register bank (66 of area)
  • Simplify completion detection (50 of area)
  • Technology mapper
  • Complete the ISA

47
How much is 4 million?
48
How much is 4 million?
49
How much is 4 million?
50
How much is 4 million?
51
How much is 4 million?
52
Performance
  • Gate delay based simulations
  • No optimiser
  • No technology mapper
  • 7 Gate delays per instruction
  • 103 inversion delays
  • Target of 5
  • 72 inversion delays

53
How Much is 10 Inversion Delays
54
How Much is 10 Inversion Delays
55
Future work
  • Very early in development
  • One week of development
  • Clumsy completion logic
  • Slowest path analysis
  • Remove unnecessary dependencies
  • Improve worst case latency
  • Target of 5 gate delays per instruction
  • Parallel instruction execution
  • Removing unnecessary latches

56
Conclusions
  • Method of producing very fast circuits
  • Minimal design effort
  • Minimal experience required
  • Implicit data dependency
  • Eager evaluation
  • Many improvements possible
  • Area could be halved
  • Performance of 5 gate delays per instruction
Write a Comment
User Comments (0)
About PowerShow.com