Title: Improving Real-Time Performance on Multicore Platforms Using MemGuard
1Improving Real-Time Performance on Multicore
Platforms Using MemGuard
- University of Kansas
- Dr. Heechul Yun
- 10/28/2013
2Multicore
Mobile
RT/Embedded
Desktop
Server
3Challenges Shared Resources
T1
T2
CPU
Memory Hierarchy
Unicore
Performance Impact
4Case Study
- HRT
- Synthetic real-time video capture
- P20, D13ms
- Cache-insensitive
- X-server
- Scrolling text on a gnome-terminal
- Hardware platform
- Intel Xeon 3530
- 8MB shared L3 cache
- 4GB DDR3 1333MHz DIMM (1ch)
- CPU cores are isolated
5HRT Time Distribution
solo
99pct 10.2ms
- 28 deadline violations
- Due to contention in DRAM
6Outline
- Motivation
- Background
- DRAM basics
- Worst-case memory performance
- MemGuardRTAS13
- Improving Real-Time Performance with MemGuard
7Background DRAM Organization
Core1
Core2
Core3
Core4
L3
Memory Controller (MC)
DRAM DIMM
- Have multiple banks
- Different banks can be accessed in parallel
Bank 1
Bank 3
Bank 2
Bank 4
8Best-case
Core1
Core2
Core3
Core4
L3
Memory Controller (MC)
Fast
DRAM DIMM
Bank 1
Bank 3
Bank 2
Bank 4
- Peak 10.6 GB/s
- DDR3 1333Mhz
9Best-case
Core1
Core2
Core3
Core4
L3
Memory Controller (MC)
Fast
DRAM DIMM
Bank 1
Bank 3
Bank 2
Bank 4
- Peak 10.6 GB/s
- DDR3 1333Mhz
- Out-of-order processors
10Most-cases
Core1
Core2
Core3
Core4
L3
Memory Controller (MC)
Mess
DRAM DIMM
Bank 1
Bank 3
Bank 2
Bank 4
() Intel 64 and IA-32 Architectures
Optimization Reference Manual
11Worst-case
Slow
- 1bank b/w
- Less than peak b/w
- How much?
() Intel 64 and IA-32 Architectures
Optimization Reference Manual
12Background DRAM Operation
Bank 1
READ (Bank 1, Row 3, Col 7)
Row 1
Row 2
Col7
Row 3
Row 4
Row 5
Row Buffer
- Stateful per-bank access time
- Row miss 19 cycles
- Row hit 9 cycles
() PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency
setting)
13Real Worst-case
Request order
Row 1 Row 2 Row 3 Row 4 Row 1 Row 2
time
1 bank always row miss ? 1.2GB/s
Each core ¼ x 1.2GB/s 300MB/s ?
() Intel 64 and IA-32 Architectures
Optimization Reference Manual
14Background Memory Controller(MC)
Bruce Jacob et al, Memory Systems Cache, DRAM,
Disk Fig 13.1.
- Request queue(s)
- Not fair (open-row first ? re-ordering)
- Unpredictable queuing delay
15Challenges for Real-Time Systems
- Multiple parallel resources (banks)
- Stateful bank access latency
- Queuing delay
- Unpredictable memory performance
16MemGuard RTAS13
Operating System
MemGuard
Reclaim Manager
BW Regulator
BW Regulator
BW Regulator
BW Regulator
0.6GB/s
0.2GB/s
0.2GB/s
0.2GB/s
Multicore Processor
Core1
Core2
Core3
Core4
PMC
PMC
PMC
PMC
Memory Controller
DRAM DIMM
- Goal guarantee minimum memory b/w for each core
- How b/w reservation best effort sharing
17Reservation
- Idea
- Scheduler regulates per-core memory b/w using h/w
counters - Period 1 scheduler tick (e.g., 1ms)
Suspend the RT idle task
2 1
Budget
Core activity
0
1ms
2ms
Schedule a RT idle task
computation
memory fetch
18Reservation
19Best-Effort Sharing
Core0 900MB/s
Core1 300MB/s
0
time(ms)
throttled
reschedule
1
2
- Spare Sharing RTAS13
- Proportional Sharing Unpublished TR
20Case Study
- HRT
- Synthetic real-time video capture
- P20, D13ms
- Cache-insensitive
- X-server
- Scrolling text on a gnome-terminal
- Hardware platform
- Intel Xeon 3530
- 8MB shared cache
- 4GB DDR3 1333MHz DIMM
21w/o MemGuard
HRT (solo)
HRT (w/ Xserver)
HRTs 99pct 14.3ms Xs CPU util 78
22MemGuard reserve only (HRT900MB/s, X300MB/s)
HRT (solo)
HRT (w/ Xserver)
HRTs 99pct 10.7ms
HRTs 99pct 11.2ms Xs CPU util 4
23MemGuardreserve (HRT900MB/s, X300MB/s)
best-effort sharing
HRT (solo)
HRT (w/ Xserver)
HRTs 99pct 10.7ms
HRTs 99pct 10.7ms Xs CPU util 48
24MemGuardreserve (HRT600MB/s, X600MB/s)
best-effort sharing
HRT (solo)
HRT (w/ Xserver)
HRTs 99pct 10.9 ms
HRTs 99pct 12.1ms Xs CPU util 61
25Real-Time Performance Improvement
HRT
X-server
- Using MemGuard, we can achieve
- No deadline miss for HRT
- Good X-server performance
26Conclusion
- Unpredictable memory performance
- multiple resources(banks), per-bank state,
unpredictable queueing delay - MemGuard
- Guarantee minimum memory bandwidth for each core
- b/w reservation (guaranteed part) best-effort
sharing - Case-study
- On Intel Xeon multicore platform, using HRT
X-server - MemGuard can improve real-time performance
efficiently - Limitations and Future Work
- Coarse grain (a OS tick) enforcement
- Small guaranteed b/w ? DRAM bank partitioning
(submitted to RTAS14)
https//github.com/heechul/memguard
27Thank you.
28Evaluation on Intel Core2
- T1 Synthetic video capture task (HRT)
- Period20ms(50Hz)
- Deadline14ms,
- Metrics ACET, WCET, stdev, deadline miss ratio
(out of 1000 periods) - T2 Xserver, update screen (SRT)
- Metric CPU utilization
- Higher CPU utilization ? faster screen update
- Platform
- Intel Core2Quad 8400, 2MB L2 cache x 2, tunable
H/W prefetchers - PC6400 DDR2 DRAM DIMM x 1
- Three platform configurations
- Exp1 Private L2, Prefetchoff
- Exp2 Private L2, Prefetchon
- Exp3 Shared L2, Prefetchon
Core0
Core1
Core2
Core3
L2 (pref.)
L2 (pref.)
DRAM
Intel Core2Quad based PC
29Experiment 1
Performance guarantee
deadline
38
78
T1
T2
T1
T2
92
T1
T2
550M/s
550M/s
550M/s
550M/s
Core1
Core2
Core1
Core2
Core1
Core2
Private L2 Prefetchoff
L2
L2
L2
L2
L2
L2
DRAM
DRAM
DRAM
MemGuard (Reserve only)
MemGuard (reclaim share)
Original
30Experiment 1
Performance guarantee
30 WCET
deadline
38
78
T1
T2
T1
T2
92
T1
T2
550M/s
550M/s
550M/s
550M/s
Core1
Core2
Core1
Core2
Core1
Core2
Private L2 Prefetchoff
L2
L2
L2
L2
L2
L2
DRAM
DRAM
DRAM
MemGuard (Reserve only)
MemGuard (reclaim share)
Original
31Experiment 1
deadline
38
78
T1
T2
T1
T2
92
T1
T2
550M/s
550M/s
550M/s
550M/s
550M/s
550M/s
Core1
Core2
Core1
Core2
Core1
Core2
Private L2 Prefetchoff
L2
L2
L2
L2
L2
L2
DRAM
DRAM
DRAM
MemGuard (Reserve only)
MemGuard (reclaim share)
Original
32Experiment 1
deadline
38
78
T1
T2
T1
T2
92
T1
T2
550M/s
550M/s
550M/s
550M/s
Core1
Core2
Core1
Core2
Core1
Core2
Private L2 Prefetchoff
L2
L2
L2
L2
L2
L2
DRAM
DRAM
DRAM
MemGuard (Reserve only)
MemGuard (reclaim share)
Original
33Experiment 1
Performance target
38
78
T1
T2
T1
T2
92
T1
T2
550M/s
550M/s
550M/s
550M/s
Core1
Core2
Core1
Core2
Core1
Core2
Private L2 Prefetchoff
L2
L2
L2
L2
L2
L2
DRAM
DRAM
DRAM
MemGuard (Reserve only)
MemGuard (reclaim share)
Original
34Experiment 2 Prefetcher
Not enough reserv.
More slowdown
Deadline violation
deadline
60
33
82
T1
T2
T1
T2
94
T1
T2
550M/s
550M/s
550M/s
550M/s
Core1
Core2
Core1
Core2
Core1
Core2
Private L2 PrefetchON
L2
L2
L2
L2
L2
L2
DRAM
DRAM
DRAM
MemGuard (Reserve only)
MemGuard (reclaim share)
Original
35Experiment 2-2
Enough reserv.
No deadline violation
60
14
69
T1
T2
T1
T2
94
T1
T2
900M/s
200M/s
900M/s
200M/s
Core1
Core2
Core1
Core2
Core1
Core2
Private L2 PrefetchON
L2
L2
L2
L2
L2
L2
DRAM
DRAM
DRAM
MemGuard (Reserve only)
MemGuard (reclaim share)
Original
36Experiment 3 Shared Cache
Even more slowdown
Minimum reserv.
No deadline violation
108
11
63
T1
T2
T1
T2
92
T1
T2
900M/s
200M/s
900M/s
200M/s
Core1
Core2
Core1
Core2
Core1
Core2
Shared L2 PrefetchON
L2
L2
L2
DRAM
DRAM
DRAM
MemGuard (Reserve only)
MemGuard (reclaim share)
Original