Title: The Power of Priority: NoC based Distributed Cache Coherency
1The Power of PriorityNoC based Distributed
Cache Coherency
EE Department Technion, Haifa, Israel
Evgeny Bolotin, Zvika Guz, Israel Cidon, Ran
Ginosar, Avinoam Kolodny
QNoC Research GroupTechnion
2Chip Multi-Processor (CMP)
Multi-Core Large cache Shared cache
Distributed cache NoC-based How?
Dual-Core Monolithic shared cache
3Future Cache - Physics Perspective
- Large cache ? Large access time
Large monolithic cache is not scalable
4NUCA - Non Uniform Cache Architecture
- Banked cache over NoC
- Smaller bank ? Smaller Access Time
- Multiple banks ? Multiple Ports
- Closer bank ? Smaller Access Time
NUCA Non uniform access times
- Cache-line placement policy
- Static NUCA (SNUCA)
- Dynamic NUCA (DNUCA)
Sources Kim et al. ASPLOS 2002 Beckmann et al.
MICRO 2004
5Issues in NUCA-based CMP
- NoC performance ? CMP performance
- Cache coherency and transaction order
(correctness) - Search (in DNUCA)
- Different traffic types (e.g. fetch vs. prefetch)
- Synchronization (locks)
NoC Services for CMP?
6Cache Coherency over NoC
How do we maintain coherency over NoC?
- Snooping
- Central directory
7Distributed Cache Coherency
Cache access ? Multiple NoC transactions
Example Simple read transaction
8Read Transaction of Modified Block
9Read Exclusive of Shared Block
10Basic NoC to Support CMP
- Off-the-shelf (Vanilla) NoC
- Grid of wormhole routers
- Ordering in network
- Static routing
- No virtual channels
Can We Do Better?
11Observations L2 Access
A) Delay Queueing NoC transactions
B) All NoC transactions are equally important
- C) NoC transactions consist of
- Short ctrl. packets
- Long data packets
Idea Differentiate between Ctrl. and Data
Solution Preemptive Priority NoC ? Give priority
to short ctrl. packets
12Preemptive Priority NoC QNoC
QNoC
Multiple SL Router
- Service Levels
- Dedicated wormhole buffer
- Preemptive priority scheduling
Multiple SL link
13Example Vanilla NoC
Without contention XDelay of long
packet dDelay of short packet
Vanilla NoC example
Blue delay X Red delay 2Xd Average delay
1.5X
A
B
14Example Priority NoC
Without contention XDelay of long
packet dDelay of short packet
Vanilla NoC example
Blue delayX Red delay 2Xd Average delay 1.5X
A
B
Priority NoC example
Blue delay Xd Red delay Xd Average delay X
Potential delay reduction 0.5X
15Priority NoC Different Destinations
- Very important in wormhole
- When ctrl. packet is blocked by other worms
Long Data
Short Req.
16Protocol Correctness
Need state-preserving serialization of
transactions in the processor interface
17Numerical Evaluation
- CMP simulator (SIMICS)
- Simulate parallel benchmarks
- Obtain L2-cache access traces
- QNoC simulator (OPNET)
- Simulate distributed coherence protocol over NoC
- Measure total RD/RX L2-access delay
- Measure total program throughput
18Priority NoC Results
Delay Reduction vs. Network Load
RD Delay - Apache
RD/RX Delay Reduction - Apache
- Short ctrl. packet gets high priority
- Long data packet gets low priority
19Priority NoC Several Benchmarks
Delay Reduction
Program Speedup
20So Far The Power of Priority
- Simplicity - Almost for Free
- Significant CMP Speed-up
-
- Good For
- Coherency
- Traffic differentiation (e.g. Fetch vs.
Pre-Fetch) - Search in DNUCA
- Synchronization (Locks)
21Advanced Support Functions
- Special Broadcast for Short Messages
- Broadcast service (e.g. search in DNUCA)
- Wormhole broadcast slow and expensive
- ?SF broadcast embedded in wormhole
- Virtual Ring
- No Additional Cost
- For Invalidation Multicast
- Snooping or synchronization
22Summary
- NoC at CMP Service!
- Shared cache over NoC
- Priority is powerful
- Built-in support functions