Improving Real-Time Performance on Multicore Platforms Using MemGuard - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Improving Real-Time Performance on Multicore Platforms Using MemGuard

Description:

Multicore. Server. Desktop. Mobile. RT/Embedded. Soon more rt/embedded systems will use multicore as well. – PowerPoint PPT presentation

Number of Views:123
Avg rating:3.0/5.0
Slides: 37
Provided by: kue57
Learn more at: http://www.ittc.ku.edu
Category:

less

Transcript and Presenter's Notes

Title: Improving Real-Time Performance on Multicore Platforms Using MemGuard


1
Improving Real-Time Performance on Multicore
Platforms Using MemGuard
  • University of Kansas
  • Dr. Heechul Yun
  • 10/28/2013

2
Multicore
Mobile
RT/Embedded
Desktop
Server
3
Challenges Shared Resources
T1
T2
CPU
Memory Hierarchy
Unicore
Performance Impact
4
Case Study
  • HRT
  • Synthetic real-time video capture
  • P20, D13ms
  • Cache-insensitive
  • X-server
  • Scrolling text on a gnome-terminal
  • Hardware platform
  • Intel Xeon 3530
  • 8MB shared L3 cache
  • 4GB DDR3 1333MHz DIMM (1ch)
  • CPU cores are isolated

5
HRT Time Distribution
solo
99pct 10.2ms
  • 28 deadline violations
  • Due to contention in DRAM

6
Outline
  • Motivation
  • Background
  • DRAM basics
  • Worst-case memory performance
  • MemGuardRTAS13
  • Improving Real-Time Performance with MemGuard

7
Background DRAM Organization
Core1
Core2
Core3
Core4
L3
Memory Controller (MC)
DRAM DIMM
  • Have multiple banks
  • Different banks can be accessed in parallel

Bank 1
Bank 3
Bank 2
Bank 4
8
Best-case
Core1
Core2
Core3
Core4
L3
Memory Controller (MC)
Fast
DRAM DIMM
Bank 1
Bank 3
Bank 2
Bank 4
  • Peak 10.6 GB/s
  • DDR3 1333Mhz

9
Best-case
Core1
Core2
Core3
Core4
L3
Memory Controller (MC)
Fast
DRAM DIMM
Bank 1
Bank 3
Bank 2
Bank 4
  • Peak 10.6 GB/s
  • DDR3 1333Mhz
  • Out-of-order processors

10
Most-cases
Core1
Core2
Core3
Core4
L3
Memory Controller (MC)
Mess
DRAM DIMM
Bank 1
Bank 3
Bank 2
Bank 4
  • Performance ??

() Intel 64 and IA-32 Architectures
Optimization Reference Manual
11
Worst-case
Slow
  • 1bank b/w
  • Less than peak b/w
  • How much?

() Intel 64 and IA-32 Architectures
Optimization Reference Manual
12
Background DRAM Operation
Bank 1
READ (Bank 1, Row 3, Col 7)
Row 1
Row 2
Col7
Row 3
Row 4
Row 5
Row Buffer
  • Stateful per-bank access time
  • Row miss 19 cycles
  • Row hit 9 cycles

() PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency
setting)
13
Real Worst-case
Request order
Row 1 Row 2 Row 3 Row 4 Row 1 Row 2
time
1 bank always row miss ? 1.2GB/s
Each core ¼ x 1.2GB/s 300MB/s ?
() Intel 64 and IA-32 Architectures
Optimization Reference Manual
14
Background Memory Controller(MC)
Bruce Jacob et al, Memory Systems Cache, DRAM,
Disk Fig 13.1.
  • Request queue(s)
  • Not fair (open-row first ? re-ordering)
  • Unpredictable queuing delay

15
Challenges for Real-Time Systems
  • Multiple parallel resources (banks)
  • Stateful bank access latency
  • Queuing delay
  • Unpredictable memory performance

16
MemGuard RTAS13
Operating System
MemGuard
Reclaim Manager
BW Regulator
BW Regulator
BW Regulator
BW Regulator
0.6GB/s
0.2GB/s
0.2GB/s
0.2GB/s
Multicore Processor
Core1
Core2
Core3
Core4
PMC
PMC
PMC
PMC
Memory Controller
DRAM DIMM
  • Goal guarantee minimum memory b/w for each core
  • How b/w reservation best effort sharing

17
Reservation
  • Idea
  • Scheduler regulates per-core memory b/w using h/w
    counters
  • Period 1 scheduler tick (e.g., 1ms)

Suspend the RT idle task
2 1
Budget
Core activity
0
1ms
2ms
Schedule a RT idle task
computation
memory fetch
18
Reservation
  •  

19
Best-Effort Sharing
Core0 900MB/s
Core1 300MB/s
0
time(ms)
throttled
reschedule
1
2
  • Spare Sharing RTAS13
  • Proportional Sharing Unpublished TR

20
Case Study
  • HRT
  • Synthetic real-time video capture
  • P20, D13ms
  • Cache-insensitive
  • X-server
  • Scrolling text on a gnome-terminal
  • Hardware platform
  • Intel Xeon 3530
  • 8MB shared cache
  • 4GB DDR3 1333MHz DIMM

21
w/o MemGuard
HRT (solo)
HRT (w/ Xserver)
  • HRTs 99pct 10.2ms

HRTs 99pct 14.3ms Xs CPU util 78
22
MemGuard reserve only (HRT900MB/s, X300MB/s)
HRT (solo)
HRT (w/ Xserver)
HRTs 99pct 10.7ms
HRTs 99pct 11.2ms Xs CPU util 4
23
MemGuardreserve (HRT900MB/s, X300MB/s)
best-effort sharing
HRT (solo)
HRT (w/ Xserver)
HRTs 99pct 10.7ms
HRTs 99pct 10.7ms Xs CPU util 48
24
MemGuardreserve (HRT600MB/s, X600MB/s)
best-effort sharing
HRT (solo)
HRT (w/ Xserver)
HRTs 99pct 10.9 ms
HRTs 99pct 12.1ms Xs CPU util 61
25
Real-Time Performance Improvement
HRT
X-server
  • Using MemGuard, we can achieve
  • No deadline miss for HRT
  • Good X-server performance

26
Conclusion
  • Unpredictable memory performance
  • multiple resources(banks), per-bank state,
    unpredictable queueing delay
  • MemGuard
  • Guarantee minimum memory bandwidth for each core
  • b/w reservation (guaranteed part) best-effort
    sharing
  • Case-study
  • On Intel Xeon multicore platform, using HRT
    X-server
  • MemGuard can improve real-time performance
    efficiently
  • Limitations and Future Work
  • Coarse grain (a OS tick) enforcement
  • Small guaranteed b/w ? DRAM bank partitioning
    (submitted to RTAS14)

https//github.com/heechul/memguard
27
Thank you.
28
Evaluation on Intel Core2
  • T1 Synthetic video capture task (HRT)
  • Period20ms(50Hz)
  • Deadline14ms,
  • Metrics ACET, WCET, stdev, deadline miss ratio
    (out of 1000 periods)
  • T2 Xserver, update screen (SRT)
  • Metric CPU utilization
  • Higher CPU utilization ? faster screen update
  • Platform
  • Intel Core2Quad 8400, 2MB L2 cache x 2, tunable
    H/W prefetchers
  • PC6400 DDR2 DRAM DIMM x 1
  • Three platform configurations
  • Exp1 Private L2, Prefetchoff
  • Exp2 Private L2, Prefetchon
  • Exp3 Shared L2, Prefetchon

Core0
Core1
Core2
Core3
L2 (pref.)
L2 (pref.)
DRAM
Intel Core2Quad based PC
29
Experiment 1
Performance guarantee
deadline
38
78
T1
T2
T1
T2
92
T1
T2
550M/s
550M/s
550M/s
550M/s
Core1
Core2
Core1
Core2
Core1
Core2
Private L2 Prefetchoff
L2
L2
L2
L2
L2
L2
DRAM
DRAM
DRAM
MemGuard (Reserve only)
MemGuard (reclaim share)
Original
30
Experiment 1
Performance guarantee
30 WCET
deadline
38
78
T1
T2
T1
T2
92
T1
T2
550M/s
550M/s
550M/s
550M/s
Core1
Core2
Core1
Core2
Core1
Core2
Private L2 Prefetchoff
L2
L2
L2
L2
L2
L2
DRAM
DRAM
DRAM
MemGuard (Reserve only)
MemGuard (reclaim share)
Original
31
Experiment 1
deadline
38
78
T1
T2
T1
T2
92
T1
T2
550M/s
550M/s
550M/s
550M/s
550M/s
550M/s
Core1
Core2
Core1
Core2
Core1
Core2
Private L2 Prefetchoff
L2
L2
L2
L2
L2
L2
DRAM
DRAM
DRAM
MemGuard (Reserve only)
MemGuard (reclaim share)
Original
32
Experiment 1
deadline
38
78
T1
T2
T1
T2
92
T1
T2
550M/s
550M/s
550M/s
550M/s
Core1
Core2
Core1
Core2
Core1
Core2
Private L2 Prefetchoff
L2
L2
L2
L2
L2
L2
DRAM
DRAM
DRAM
MemGuard (Reserve only)
MemGuard (reclaim share)
Original
33
Experiment 1
Performance target
38
78
T1
T2
T1
T2
92
T1
T2
550M/s
550M/s
550M/s
550M/s
Core1
Core2
Core1
Core2
Core1
Core2
Private L2 Prefetchoff
L2
L2
L2
L2
L2
L2
DRAM
DRAM
DRAM
MemGuard (Reserve only)
MemGuard (reclaim share)
Original
34
Experiment 2 Prefetcher
Not enough reserv.
More slowdown
Deadline violation
deadline
60
33
82
T1
T2
T1
T2
94
T1
T2
550M/s
550M/s
550M/s
550M/s
Core1
Core2
Core1
Core2
Core1
Core2
Private L2 PrefetchON
L2
L2
L2
L2
L2
L2
DRAM
DRAM
DRAM
MemGuard (Reserve only)
MemGuard (reclaim share)
Original
35
Experiment 2-2
Enough reserv.
No deadline violation
60
14
69
T1
T2
T1
T2
94
T1
T2
900M/s
200M/s
900M/s
200M/s
Core1
Core2
Core1
Core2
Core1
Core2
Private L2 PrefetchON
L2
L2
L2
L2
L2
L2
DRAM
DRAM
DRAM
MemGuard (Reserve only)
MemGuard (reclaim share)
Original
36
Experiment 3 Shared Cache
Even more slowdown
Minimum reserv.
No deadline violation
108
11
63
T1
T2
T1
T2
92
T1
T2
900M/s
200M/s
900M/s
200M/s
Core1
Core2
Core1
Core2
Core1
Core2
Shared L2 PrefetchON
L2
L2
L2
DRAM
DRAM
DRAM
MemGuard (Reserve only)
MemGuard (reclaim share)
Original
Write a Comment
User Comments (0)
About PowerShow.com