Title: A Deterministic Globally Asynchronous Locally Synchronous GALS Methodology for Validation, Debug, an
1A DeterministicGlobally Asynchronous Locally
Synchronous (GALS)Methodology for Validation,
Debug, and Test
- Matthew Heath
- University of Massachusetts Amherst
- http//www-unix.ecs.umass.edu/mheath/
- mheath_at_ecs.umass.edu
- This research is funded by NSF grant 0204134 and
SRC task 1075.
2Outline
- Globally Asynchronous Locally Synchronous (GALS)
is a natural clocking style for SoCs - Each synchronous core is locally clocked
- Asynchronous communication between cores
- Existing GALS methodologies have limitations
- Many are nondeterministic - bad for validation,
debug, and test - Others achieve determinism by imposing
environmental constraints which are valid for
limited applications - Synchro-Tokens a novel deterministic GALS
methodology - Flexible constraints for a wide variety of
applications - Uses token rings to control local clocks and
regulate data flow between clock domains - Current status verilog simulation has validated
the concept - Future work circuit studies and formal methods
3Modern synchronous design
PLL
Phase adjustment and frequency scaling
Low-skew global distribution
Wire delay
RC
logic
Flip-flop repeater
Clock domains or SoC cores
Low-skew local distribution
Logic delay wire delay on inter-domain
paths designed to avoid races critical paths
4Clock design for SoCs
- Fully synchronous isnt feasible for large SoCs
- Difficult to design a chip-level clock with known
skew - Pre-designed clocks of different cores may be
incompatible - Cores run at different frequencies
- Ratioed clocks cause critical path dependencies
and impede dynamic frequency scaling - Chip-level timing convergence and flip-flop
repeater placement - Fully asynchronous also has drawbacks
- Not always better than synchronous due to
handshaking overhead - Legacy cores are likely to be synchronous designs
- Most tools handle async circuits inadequately or
not at all - Many designers lack async design experience
5GALSGlobally Asynchronous Locally Synchronous
Synchronous blocks
Local ring oscillator
Wire delay
Synchronizer
RC
logic
Asynchronous communication between blocks
Low-skew local distribution
D. Chapiro, Globally-Asynchronous
Locally-Synchronous Systems,PhD Thesis, Stanford
University, Report No. STAN-CS-84-1026, Oct.
1984.
6Nondeterminism A behavioral view
I1 ADD R3, R1, R2 I2 MUL R5, R3, R4 I3 SUB R4,
R2, R1 I4 MOV R6, R3 I5 ADD R4, R3, R2
7Simulated expectation
Simulated Expectation Clk Clock of RF SB Tester
observes RFstate after each Clk
8Same sequence, different cycles
Simulated Expectation Clk Clock of RF SB Tester
observes RFstate after each Clk
Silicon Test 1 Results I3 exec/write 1 cycle
late I5 sequence delayed I2 I5 writes on time
9Different sequence
Simulated Expectation Clk Clock of RF SB Tester
observes RFstate after each Clk
Silicon Test 2 Results I2 exec 1 cycle late I2
I5 writes swapped
Silicon Test 1 Results I3 exec/write 1 cycle
late I5 sequence delayed I2 I5 writes on time
10More than one right answer
- Architectural spec defines partially ordered
sequence of events - Implementation is correct if it conforms to the
spec - Single events can occur on nondeterministic clock
cycles - Multiple events with no specified order can occur
in a nondeterministic sequence - Many partially ordered sets of events induce a
large number of possible correct event traces
11Lack of a unique trace makes validation, debug,
and test much harder
- Validating many possible traces requires
computing resources - Analyzing whether a response is correct slows
down debug - Finding all possible traces consumes test
creation time - On-chip storage for BIST costs die area
- Off-chip storage needs expensive tester memory
- Comparing test results with all possible traces
costs test time - If fault effect maps to another correct trace,
coverage is lowered - Divide-and-conquer doesnt allow testing of
entire chip at once - Waiting for the test to reach a naturally
deterministic state provides insufficient
observability
12Nondeterminism A signal-level view
Sequence at output q
clk
q
clock cycle
- Async data input switches after clock edge
- Output switches after second clock edge
d
D1
D2
D1
1
D1
2
q
D1
D2
D2
3
clk
q
clock cycle
- Async data input switches before first clock edge
- Output switches after first clock edge
d
D1
D2
D1
1
D2
2
q
D1
D2
D2
3
clk
- Data input switches very close to clock edge
- Flip-flop goes metastable
- Output resolves to a random value after a random
time
q
clock cycle
d
D1
D2
D1
1
?
2
q
D1
D1 or D2
D2
D2
3
13Sources of nondeterminism
The cycle of Clk_B on whicha signal transition
caused byClk_A is captured depends on
in
out
RC
logic
Wire Delay
- tPA Propagation delay
- Includes FF delay plus any combinational logic
- May vary within one test on one chip due to
data-dependence - tSkew nT Skew between local clocks, plus an
integral number of clock cycles - May vary between test runs of one chip due to
clock initialization and frequency shmoo - tWire Wire delay
- May vary between chips due to process variation
Clk_A
Clk_B
Clk_A
out
in
Clk_B
tPA
tWire
tSetup
tSkew nT
14Asynchronous signals sampled with unrelated
clocks are nondeterministic...
RC
logic
Double-flip-flop synchronizers
RC
Source-synchronous
RC
PLL
RC
encode
decode
Clock recovery from data
PLL
15...regardless of what gets synchronized
D
RC
hand- shake logic
hand- shake logic
Handshaking with dual-rail data
RC
ACK
DATA
Handshaking with bundled data
REQ
hand- shake logic
hand- shake logic
ACK
16Mutex Element (ME) hides metastabilitybut is
still nondeterministic
A2
Initial condition R1 R2 0 A1 A2 0 V1
V2 1
R1
V1
T1
Requests are acknowledged one at a time on a
first-come first-served basis
R2
V2
T2
A1
R1
During metastable period, V1 V2 lt Vt and A1
A2 0
R2
Metastable
A1
A2
17Stoppable clocks
- Ring oscillator, but not a PLL
- Aligns clock to data to avoid metastability-induce
d system failure - Clock stops while ME is metastable or while
acknowledging data request - Each synchronous block has independent frequency
and phase - Nondeterministic cant predict which clock
cycle async request arrives - Muttersbach, Villiger, Fichtner, Practical
Design of GALS Systems, ASYNC 00
To synchronous logic
Stoppable Clock
R1
A1
R2
A2
Req
Mutex
Ack
18Self-timed FIFOs
Self-timed FIFO
Sync Block B
Sync Block A
Clk_B
Clk_A
ack
ack
ack
ack
req
req
req
req
- Self-timed FIFOs pipeline the async communication
channel - Use bundled data and careful timing (shown above)
- Embed the request in dual-rail data
- Same nondeterministic stoppable clock
- Yun Dooply, Pausible Clocking-Based
Heterogeneous Systems, Trans. VLSI, Dec. 99
19Determinism by environmental constraints
- Like all synchronous designs, each synchronous
block produces deterministic state and output
sequences in response to a given input sequence - To eliminate all nondeterminism, make the input
sequence of each synchronous block deterministic
by constraining the input data
20Determinism for low-bandwidth I/O
Sync Block B
Sync Block A
Sync Block C
Clk_B
Clk_A
Clk_C
ack
ack
req
req
- Accept new, asynchronous input only after a
deterministic, synchronous local event has
stopped the clock - Dont restart the clock until the input data has
arrived - Nilsson Torkelson, A Monolithic Digital Clock
Generator for On-Chip Clocking of Custom DSPs,
JSSC, May 96
21Determinism for constant I/O
Clk
Self-timed FIFO
Sync Block B
Sync Block A
ack
ack
ack
ack
req
req
req
req
- Prevent FIFO from becoming empty or full
- Initialize the FIFO to ½ full
- Use global reference clock for exact frequency
matching - Add and remove data at equal rates
- Each end of the FIFO is effectively synchronized
to the local clock - Greenstreet, Implementing a STARI Chip, 1995
ICCD
22Making GALS Deterministic
- Each SB must receive each transition on each of
its asynchronous inputs during a local clock
cycle which is known in advance. - A transition must not be recognized if it occurs
earlier than expected, and the local clock must
stop to wait for a transition if it occurs later
than expected. - Such complete knowledge is never available in
practice - Its existence would imply that the inputs carry
no information and thus arent even needed! - This knowledge can be inferred for all inputs if
it is available for select inputs and if the
timing relationship between those and all other
inputs is known.
23Types of Signals
Value Known?
Y
N
Asynchronous Data
Asynchronous Handshake
N
Transition Time Known?
Synchronous
Redundant
Y
24Bundled Data
Async Handshake
Async Data
- Use timing verification during design to ensure
that the logic level of a data signal at the time
of a transition of its associated handshake
signal is deterministic - Easier than synchronous design because data
signal and timing signal have the same source and
destination, and thus can have similar routes
25Master Handshake
- All handshake signals with a common source SB and
a common destination SB are bundled to a single
master handshake signal - Timing verification ensures that the values of
all bundled handshake signals at the time of a
transition of the master handshake signal are
deterministic
Master Handshake
Request Handshake
Bundled Data
Acknowledge Handshake
Bundled Data
26Stoppable Clock
- Use the master handshake signal as the
asynchronous enable of a stoppable clock
clk
Synchronous Logic
SyncEn
D
Q
Clk
En
AsyncEn
SyncEn
AsyncEn
En
Clk
27Synchro-Tokens Flexible constraints
- Synchro-Tokens does NOT impose requirements on
the outputs of other synchronous blocks - Instead, control logic added to the asynchronous
inputs of blocks constrains them before they
reach the synchronous logic - Ensures that asynchronous input transitions are
captured on deterministic clock cycles - Uses token rings to control local clocks and
regulate data flow between clock domains - No synchronizers ? zero probability of
metastability failure - Flexible constraints for a wide variety of
applications - Clocks dont stop for asynchronous data transfer
- Time-varying asynchronous data rates are supported
28Synchro-Tokens System Overview
Fifo Ifc
Node
Fifo Ifcs
Node
Fifo Ifc
Node
En
Clk
Fifo Ifc
Node
Fifo Ifcs
Node
Node
Fifo Ifc
29Synchro-Tokens System Overview
Fifo Ifc
Node
Fifo Ifcs
- Self-timed FIFOs for inter-block communication
- Async / sync interfacein each SB
Node
Fifo Ifc
Node
En
Clk
Fifo Ifc
Node
Fifo Ifcs
Node
Node
Fifo Ifc
30Synchro-Tokens System Overview
Fifo Ifc
Node
Fifo Ifcs
- Self-timed FIFOs for inter-block communication
- Async / sync interfacein each SB
Node
Fifo Ifc
Node
- One token ring for eachcommunicating SB pair
- Any of FIFOs
- Node in each SB
- 1 link inverting for 2-phase handshake
En
Clk
Fifo Ifc
Node
Fifo Ifcs
Node
Node
Fifo Ifc
31Synchro-Tokens System Overview
Fifo Ifc
Node
Fifo Ifcs
- Self-timed FIFOs for inter-block communication
- Async / sync interfacein each SB
Node
Fifo Ifc
Node
- One token ring for eachcommunicating SB pair
- Any of FIFOs
- Node in each SB
- 1 link inverting for 2-phase handshake
En
Clk
Fifo Ifc
Node
Fifo Ifcs
Node
- Internal blocks have localclock generators
enabledby token ring nodes
Node
Fifo Ifc
32Synchro-Tokens System Overview
Fifo Ifc
Node
Fifo Ifcs
- Self-timed FIFOs for inter-block communication
- Async / sync interfacein each SB
Node
Fifo Ifc
Node
- One token ring for eachcommunicating SB pair
- Any of FIFOs
- Node in each SB
- 1 link inverting for 2-phase handshake
En
Clk
Fifo Ifc
Node
Fifo Ifcs
Node
- Internal blocks have localclock generators
enabledby token ring nodes
Node
Fifo Ifc
- System I/O blocks are externallysynchronized
33Token Ring NodesControl when token is received
and sent
Clock_En
Async Token Ring
TokenIn
Sync Block
FIFO_En
TokenOut
Clk
- Node has two clock cycle counters
- Decrement once per local clock
- Initial values are programmable architectural
parameters - Hold counter
- Tracks time between receiving and sending token
- Nonzero value enables clock AND interfaces of
associated FIFOs - Recycle counter
- Tracks how long after sending the token it is
expected to return - Nonzero value enables clock but NOT FIFO
interfaces
34Stoppable ClockEnsures token received on a
deterministic clock cycle
- Programmable frequency
- Enabled by all nodes in the SB
- If a token returns early, it is ignored by the
node until the recycle count reaches zero - If recycle count reaches zero before token
returns, the clock is synchronously disabled - When the late token arrives, the clock is
asynchronously re-enabled - Counters and token are deterministically
initialized - Token is always received on a deterministic clock
cycle
Clock_En
TokenIn
Token Ring 2
FIFO_En
TokenOut
Clk
Node 2
Stoppable Clock
Clock_En
TokenIn
Token Ring 1
FIFO_En
TokenOut
Clk
Node 1
35FIFO InterfacesEnsure deterministic data
accompanies token
Stoppable Clock
Clock_En
TokenIn
FIFO_En
Token Ring
TokenOut
Clk
- FIFO handshakes use bundled data
- Many data bits per req/ack pair
- FIFO timing coupled to token ring
- Arrival of token indicates async FIFO control
data inputs have stabilized - Mutually exclusive FIFO access
- FIFO interface enabled while associated node
holds token - FIFO cant asynchronously become non-full or
non-empty as a result of activity at other end
(thus allowing nondeterministic data exchange) - FIFO shifts fast enough to exchange data on every
local clock cycle
Node
En
Req
Clk
Self-Timed FIFO
Ack
Valid
Data
Full
Data
SB Output FIFO Interface
En
Req
Clk
Self-Timed FIFO
Ack
Read
Data
Empty
Data
SB Input FIFO Interface
36Waveforms for One Node
TokenIn
TokenOut
Clk
Clock_En
FIFO_En
Hold Counter
0
3
2
1
4
3
2
4
0
Recycle Counter
3
2
1
0
6
5
4
3
2
0
1
37Waveforms for One Node
A
TokenIn
TokenOut
Clk
Clock_En
FIFO_En
Hold Counter
0
3
2
1
4
3
2
4
0
Recycle Counter
3
2
1
0
6
5
4
3
2
0
1
The incoming token arrives early, but is not
received because the recycle counter is nonzero.
38Waveforms for One Node
A
TokenIn
TokenOut
Clk
Clock_En
FIFO_En
Hold Counter
0
3
2
1
4
3
2
4
0
B
Recycle Counter
3
2
1
0
6
5
4
3
2
0
1
The recycle counter reaches zero.
39Waveforms for One Node
A
TokenIn
TokenOut
Clk
Clock_En
FIFO_En
C
Hold Counter
0
3
2
1
4
3
2
4
0
B
Recycle Counter
3
2
1
0
6
5
4
3
2
0
1
FIFO_En is asserted to enable the FIFO
interfaces associated with the node.
40Waveforms for One Node
A
TokenIn
TokenOut
Clk
Clock_En
FIFO_En
C
D
Hold Counter
0
3
2
1
4
3
2
4
0
B
Recycle Counter
3
2
1
0
6
5
4
3
2
0
1
The hold counter decrements on each local clock
cycle.
41Waveforms for One Node
A
TokenIn
TokenOut
Clk
Clock_En
FIFO_En
C
D
E
Hold Counter
0
3
2
1
4
3
2
4
0
B
Recycle Counter
3
2
1
0
6
5
4
3
2
0
1
When the hold counter reaches zero...
42Waveforms for One Node
A
TokenIn
F
TokenOut
Clk
Clock_En
FIFO_En
C
D
E
Hold Counter
0
3
2
1
4
3
2
4
0
B
Recycle Counter
3
2
1
0
6
5
4
3
2
0
1
...the token is sent out of the node...
43Waveforms for One Node
A
TokenIn
F
TokenOut
Clk
Clock_En
G
FIFO_En
C
D
E
Hold Counter
0
3
2
1
4
3
2
4
0
B
Recycle Counter
3
2
1
0
6
5
4
3
2
0
1
and FIFO_En is de-asserted to disable the FIFO
interfaces.
44Waveforms for One Node
A
TokenIn
F
TokenOut
Clk
Clock_En
G
FIFO_En
C
D
E
Hold Counter
0
3
2
1
4
3
2
4
0
B
H
Recycle Counter
3
2
1
0
6
5
4
3
2
0
1
The recycle counter decrements on each local
clock cycle.
45Waveforms for One Node
A
TokenIn
F
TokenOut
Clk
I
Clock_En
G
FIFO_En
C
D
E
Hold Counter
0
3
2
1
4
3
2
4
0
B
H
Recycle Counter
3
2
1
0
6
5
4
3
2
0
1
Because the token hasnt arrived when the recycle
counter reaches zero, Clock_En is de-asserted...
46Waveforms for One Node
A
TokenIn
F
TokenOut
J
Clk
I
Clock_En
G
FIFO_En
C
D
E
Hold Counter
0
3
2
1
4
3
2
4
0
B
H
Recycle Counter
3
2
1
0
6
5
4
3
2
0
1
...and the local clock stops synchronously.
47Waveforms for One Node
A
TokenIn
K
F
TokenOut
J
Clk
I
Clock_En
G
FIFO_En
C
D
E
Hold Counter
0
3
2
1
4
3
2
4
0
B
H
Recycle Counter
3
2
1
0
6
5
4
3
2
0
1
When the late token arrives...
48Waveforms for One Node
A
TokenIn
K
F
TokenOut
J
Clk
I
Clock_En
L
G
FIFO_En
C
D
E
Hold Counter
0
3
2
1
4
3
2
4
0
B
H
Recycle Counter
3
2
1
0
6
5
4
3
2
0
1
Clock_En is re-asserted and the local clock is
asynchronously re-enabled.
49Waveforms for One Node
A
TokenIn
K
F
TokenOut
J
Clk
I
M
Clock_En
L
G
FIFO_En
C
D
E
Hold Counter
0
3
2
1
4
3
2
4
0
B
H
Recycle Counter
3
2
1
0
6
5
4
3
2
0
1
A late token at another node stops the clock to
the entire synchronous block, even if this node
is holding the token or recycling.
50Results Out-of-order processor core
- Implemented out-of-order processor core in
verilog with variable, nonzero delays - Out-of-order engine and execution units in
different blocks - Note which out-of-order engine clock cycle each
instruction issues, reads, executes, and writes - Design 1 Fully synchronous
- Deterministic, provided synchronous design rules
are obeyed - Design 2 Fully asynchronous
- Nondeterministic due to first-come, first-served
bus arbitration - Design 3 Standard GALS
- Nondeterministic due to synchronizers
- Design 4 Synchro-tokens GALS
- Deterministic!
51Results Determinism Validation
- Implemented a synchro-tokens system called
Thrasher - Processes data with LFSR, bitwise logic, and
arithmetic functions - No data hazards or partial orders just churn
garbage data - Clocks only stop for late tokens, never for
functional constraints - Chose nominal clock periods, FIFO token delays,
hold recycle counts such that tokens always
arrive just in time - Generate expected response
- Simulate with different parameter combinations
- Each delay can be 50, 75, 100, 150, or 200
of nominal - 16,285 permutations
- Observe state sequence in each SB on first 100
local clocks - Exact matches on all states for all delay
permutations shows system is deterministic!
52Muller C-element
C
X
Z
Y
Y
X
Z
0
0
0
1
0
Hold State
0
1
Hold State
1
1
1
Z
X
Y
531-bit asynchronous shift register
- Dual-rail data
- Empty bit (neither 0 nor 1) is available to hold
incoming data - To shift the chain
- Assert Ack_in of the chain head to remove a data
bit - Wait for data bubble to ripple backward to the
chain tail - Assert Req0_in or Req1_in of the chain tail to
add a data bit - Add extra empty cells to chain tail so reverse
bubble propagation doesnt limit shifting
frequency
C
C
Req0_in
Req0_out
Ack_out
Ack_in
C
C
Req1_in
Req1_out
54Loadable C-element
C
X
Z
Y
X
Z
D
L
0
0
0
0
Y
1
0
Hold State
0
D
L
0
1
Hold State
0
1
1
1
0
0
0
1
1
1
1
0
Z
1
X
L
D
Y
55Results 1-bit boundary scan cell
C
Req0_in
C
Req0_out
Ack_out
Ack_in
Req1_out
C
C
Req1_in
0
D_out
1
D_in
Drive
Update
Capture
56Results Nondestructive ATPG scan cell
C
Req0_in
C
Req0_out
Ack_out
Ack_in
Req1_out
C
C
Req1_in
0
D_out
1
D_in
Update
Capture
Clk
57Results Test Methodologies
Wrapper
I/O SB
SB
SB
SB
Internal TCK-Domain Scan Chain
Test FIFO
Test SB
Boundary Scan Chain
1149.1 TAP
System I/O
58Future Work
- SPICE simulations and timing analysis of
synchro-tokens logic - Apply to a large system using FPGAs or a testchip
- Investigate area, power, and performance impact
- Investigate more aggressive protocol variations
- Data-dependent, deterministically-varying hold
recycle counts - Use local empty/full bits to keep FIFO interfaces
enabled after releasing the token - Formal methods
- Prove determinism
- Show how to avoid deadlock
59Summary
- GALS is a natural clocking methodology for SoCs
- Typical GALS designs are nondeterministic because
asynchronous signals unpredictably transition
before or after the sampling clock edge - A nondeterministic implementation which conforms
to a higher-level specification is functionally
correct - Nondeterminism makes validation, debug, and test
harder because the expected response is not
unique - Synchro-tokens eliminates nondeterminism by
adding control logic to the interface of
synchronous blocks so that asynchronous input
transitions are captured on deterministic local
clock cycles - Key components of synchro-tokens architecture
- Token ring nodes, hold counters, and recycle
counters control when tokens are received and
sent - Stoppable clocks ensure tokens are received on
deterministic clock cycles - FIFO interfaces ensure deterministic data
accompanies the token - The synchro-tokens concept has been validated
with HDL simulations - Future work circuit design and formal analyses
60Time for Questions!