Title: ECE 669 Parallel Computer Architecture Lecture 20 Evaluation and Message Passing
1ECE 669Parallel Computer ArchitectureLecture
20Evaluation and Message Passing
2Performance Evaluation
- Why?
- Evaluate tradeoffs
- Estimate machine performance
- Measure application behavior
- How?
- Ask an expert --- hire a consultant?
- Measure existing machines --- (What?!)
- Build simulators
- Analytical models
- Hybrids --- combination of 2-3-4
3In the lab
Parallel Traces Network Model
4Example
- 1. Trace simulation, awk filter
- 2. Compute network parameters m,B
- 3. Compute processor utilization
- Given m,B and kd, n, N
- Derive U.
Event Prob out-msg out-siz in-msg in-siz fli
ts flits read miss in 1.2507 yes 4 yes
(see write mode formula) other
msgs 2.8483 yes 4 yes 4
. . .
. . .
. . .
. . .
. . .
. . .
. . .
in-siz 4 (block size/flit size)
5Evaluation
Weather
1.0
- Barriers implemented using distributed trees
- Read-only sharing marked
U
0.8
Processor utilization
LimitLESS
0.6
0.4
0.2
0.0
pointers
Speech
U
pointers
6Evaluation
Processor
Memory
Inter- connection network
32
12
5
16
16
8
38
12
3
5
32
24
8
28
1
40
64 ...
7
32
... 2
3
24
28
7
5
52
5 9
64
7Full system simulation (coupled)
Compiled application
Processor simulator
Processor
Memory requests
Synchronization requests
Wait, traps
Cache and memory systems simulator
Memory
Network requests
Acknowledgements, responses
Inter- connection network
Interconnection network simulator
32
12
5
16
16
38
12
8
3
5
32
24
8
28
1
40
64 ...
7
32
... 2
3
24
28
7
5
52
5 9
64
8Trace-driven simulation (coupled)
E.g. From M.I.T. many parallel traces exist
Sequential address trace
Processor
Trace scheduler
Address trace
Port ready
...
Cache and memory systems simulator
Memory
Network requests
Acknowledgements, responses
Inter- connection network
Network simulator
32
12
5
16
16
8
38
12
3
5
32
24
8
28
1
40
64 ...
7
32
... 2
3
24
28
7
5
52
5 9
64
9Hybrid Network Model
Compiled application
Processor
Processor simulator
Parallel address trace
Wait
Memory
Cache and memory system simulator
Requests
Response, latency
Time window average
Inter- connection network
Network model
32
12
5
16
16
38
12
8
3
5
32
24
8
28
1
40
64 ...
7
32
... 2
3
24
28
7
5
52
5 9
64
10Many hybrids possible
traces Network model
Processor
Parallel address traces
Cache and memory system simulator
Memory
Events counts
Request rate, message size
Analytical network model
Inter- connection network
32
12
5
16
16
8
38
12
3
5
32
24
8
28
1
40
64 ...
7
32
... 2
3
24
28
7
5
52
5 9
64
11Hybrid - Network model
Compiled application
Direct execution - Round robin/process - Switch
on mem. req.
Processor
Memory requests
Switch
Cache and memory system simulator
Memory
Responses, latency
Network requests, time window ave
Inter- connection network
Network model
32
12
5
16
16
8
38
12
3
5
32
24
8
28
1
40
64 ...
7
32
... 2
3
24
28
7
5
52
5 9
64
12Trace-driven (decoupled)
Processor
No trace scheduler!
Address trace
Cache and memory system simulator
Memory
No feedback
Responses
Network requests
Inter- connection network
Network simulator
Synchronization constraints may be violated
Garbage!
32
12
5
16
16
38
12
8
3
5
32
24
8
28
1
40
64 ...
7
32
... 2
3
24
28
7
5
52
5 9
64
13Message passing
- Bulk transfers
- Complex synchronization semantics
- more complex protocols
- More complex action
- Synchronous
- Send completes after matching recv and source
data sent - Receive completes after data transfer complete
from matching send - Asynchronous
- Send completes after send buffer may be reused
14Synchronous Message Passing
Processor Action?
- Constrained programming model.
- Deterministic! What happens when threads
added? - Destination contention very limited.
- User/System boundary?
15Asynchronous Message Passing Optimistic
- More powerful programming model
- Wildcard receive gt non-deterministic
- Storage required within msg layer?
16Asynchronous Message Passing Conservative
D
e
s
t
i
n
a
t
i
o
n
S
o
u
r
c
e
(
1
)
I
n
i
t
i
a
t
e
s
e
n
d
(
2
)
A
d
d
r
e
s
s
t
r
a
n
s
l
a
t
i
o
n
o
n
P
S
e
n
d
P
,
l
o
c
a
l
V
A
,
l
e
n
(
3
)
L
o
c
a
l
/
r
e
m
o
t
e
c
h
e
c
k
S
e
n
d
-
r
d
y
r
e
q
(
4
)
S
e
n
d
-
r
e
a
d
y
r
e
q
u
e
s
t
(
5
)
R
e
m
o
t
e
c
h
e
c
k
f
o
r
p
o
s
t
e
d
r
e
c
e
i
v
e
(
a
s
s
u
m
e
f
a
i
l
)
R
e
t
u
r
n
a
n
d
c
o
m
p
u
t
e
r
e
c
o
r
d
s
e
n
d
-
r
e
a
d
y
T
a
g
c
h
e
c
k
(
6
)
R
e
c
e
i
v
e
-
r
e
a
d
y
r
e
q
u
e
s
t
R
e
c
v
P
l
o
c
a
l
V
A
,
l
e
n
(
7
)
B
u
l
k
d
a
t
a
r
e
p
l
y
S
o
u
r
c
e
V
A
D
e
s
t
V
A
o
r
I
D
R
e
c
v
-
r
d
y
r
e
q
D
a
t
a
-
x
f
e
r
r
e
p
l
y
T
i
m
e
- Where is the buffering?
- Contention control? Receiver initiated protocol?
- Short message optimizations
17Key Features of Message Passing Abstraction
- Source knows send data address, dest. knows
receive data address - after handshake they both know both
- Arbitrary storage outside the local address
spaces - may post many sends before any receives
- non-blocking asynchronous sends reduces the
requirement to an arbitrary number of descriptors - fine print says these are limited too
- Fundamentally a 3-phase transaction
- includes a request / response
- can use optimisitic 1-phase in limited Safe
cases - credit scheme
18Active Messages
- User-level analog of network transaction
- transfer data packet and invoke handler to
extract it from the network and integrate with
on-going computation - Request/Reply
- Event notification interrupts, polling, events?
- May also perform memory-to-memory transfer
19Common Challenges
- Input buffer overflow
- N-1 queue over-commitment gt must slow sources
- reserve space per source (credit)
- when available for reuse?
- Ack or Higher level
- Refuse input when full
- backpressure in reliable network
- tree saturation
- deadlock free
- what happens to traffic not bound for congested
dest? - Reserve ack back channel
- drop packets
- Utilize higher-level semantics of programming
model
20Summary
- Evaluation important to understand intermediate
messages in cache protocol - Message sizes may vary based on function
- Two main types of message passing protocols
- Synchronous and asynchronous
- Active messages involve remote operations
- Message techniques depend on network reliability