Analysis of Time Warp on a 32,768 processor IBM Blue GeneL supercomputer - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Analysis of Time Warp on a 32,768 processor IBM Blue GeneL supercomputer

Description:

Analysis of Time Warp on a 32,768 processor IBM Blue Gene/L supercomputer ... Ex: Movies over the Internet ... in performance when going from BG/L to ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 41

Provided by: akin48

Category:

more less

Transcript and Presenter's Notes

Title: Analysis of Time Warp on a 32,768 processor IBM Blue GeneL supercomputer

1

Analysis of Time Warp on a 32,768 processor IBM
Blue Gene/L supercomputer
Akintayo Holder and Christopher Carothers
Department of Computer Science
Rensselaer Polytechnic Institute

2
Outline

Motivation
Blue Gene
Time Warp
PHOLD
Implementation
Performance
Conclusion
Please feel free to ask questions at any time.

3
Discrete Event Simulation

Discrete event simulation is used for complex
models that cannot be easily analyzed or
represented by equations
Discrete event simulation is commonly used for
simulating network protocols like TCP, simulating
TCP on a large networks may require a powerful
computer.
In the logical processor (LP) model, each LP is a
concurrent task, and communication is performed
by exchanging events.
The model also defines processor elements (PE),
which usually corresponds to actual physical
processors.

4
Motivation

Why Parallel Discrete-Event Simulation (DES)?
Large-scale systems are difficult to understand
Analytical models are often constrained
Parallel DES simulation offers
Dramatically shrinks models execution-time
Prediction of future what-if systems
performance
Potential for real-time decision support
Minutes instead of days
Analysis can be done right away
Example models national air space (NAS), ISP
backbone(s), distributed content caches, next
generation supercomputer systems.

5
Ex Movies over the Internet

Suppose we want to model 1 million home ISP
customers downloading a 2 GB movie
How long to compute?
Assume a nominal 100K ev/sec seq. simulator
Assume on avg. each packet takes 8 hops
2GB movies yields 2 trillion 1K data packets.
_at_ 8 hops yields 16 trillion events

16 trillion events _at_ 100K ev/sec
Over 1,900 days!!! Or
5 years!!!
Need massively parallel simulation to make
tractable

6
Overview of the Blue Gene/L

Blue Gene/L is an ultra large scale
supercomputer, with an architecture that balances
the computing power of the processor against the
delivery speed of the network.

The processing nodes consist of two low power 700
MHz PowerPC with 1GB of memory. The Blue Gene/L
has five networks, the point-to-point network
that has an end to end delay of 4 microseconds, a
barrier network that takes 1.5. The collective
network performs a global operation in 5
microseconds.
7
Time Warp

Local Control Mechanism
Global Control Mechanism

8
Parallel Discrete-Event Simulation Via Time Warp
Local Control Mechanism error detection and
rollback
Global Control Mechanism compute Global Virtual
Time (GVT)
V i r t u a l T i m e
V i r t u a l T i m e
collect versions of state / events perform
I/O operations that are lt GVT
?????undo state ?s (2) cancel sent events
GVT
LP 2
LP 3
LP 1
LP 2
LP 1
LP 3
unprocessed event
processed event
straggler event
committed event
9
Time Warp Local Control Mechanism
Each LP process events in time stamp order, like
a sequential simulator, except (1) do NOT
discard processed events and (2) add a rollback
mechanism

Adding rollback
a message arriving in the LPs past initiates
rollback
to roll back an event computation we must undo
changes to state variables performed by the
event
message sends

10
Implementing Rollback State Save

The state of the logical processor is stored
before the event is executed.
May be very memory intensive.

11
Implementing RollbackIncremental State Save

Only changes to the state are stored.
Consumes less memory than state saves.
Complicates the rollback process.

12
Rollback Via Reversible Computation...

Use Reversible Computation (RC)
Undo by executing reverse code
Delivers better performance
Negligible overhead for forward computation
Significantly lower memory utilization
control state ltlt data state
In DES models, many of ops are constructive
e.g., , --, etc.
Size of control state lt size of data state
i.e., b bitfield ltlt delays array
Perfectly reversible high-level operations
gleaned from irreversible smaller operations
e.g., random number generation
ROSS Rensselaer Optimistic Simulation System

13
RC Applications

PDES applications include
Wireless telephone networks
Distributed content caches
Large-scale Internet models
TCP over ATT backbone
Leverages RC swaps
Hodgkin-Huxley neuron models
Plasma physics models using PIC
Non-DES include
Debugging
PISA Reversible instruction set architecture
for low power computing
Quantum computing

14
Global Virtual Time and Fossil Collection

A mechanism is needed to
reclaim memory resources (e.g., old state and
events)?
perform irrevocable operations (e.g., I/O)?
Observation A lower bound on the time stamp of
any rollback that can occur in the future is
needed.

Global Virtual Time (GVT) is defined as the
minimum time stamp of any unprocessed (or
partially processed) message or anti-message in
the system. GVT provides a lower bound on the
time stamp of any future rollback.
storage for events and state vectors older than
GVT (except one state vector) can be reclaimed
I/O operations with time stamp less than GVT can
be performed.
Observation The computation corresponding to GVT
will not be rolled back, guaranteeing forward
progress.

15
Global Virtual Time Overlapping Windows

Each process calculates GVT during a window based
on wall clock time.
There is variation due to using wall clock.
GVT is calculated during overlaps in the
processor's windows.

16
Global Virtual TimeTwo Cuts

Builds a consistent snapshot of events within the
system.
Any event sent before the first cut, must be
received before the second cut.
When all events are accounted for, GVT can be
computed.

17
Global Virtual TimeGlobal Reduction

Global collective operations ensure there are no
outstanding messages.
GVT is then computed.

18
Fossil Collection

Processed events are aggregated for fossil
collection.
By LP, may lead to checking a lot of empty lists
By PE, will resulting searching very large lists
Kernel Processes, can vary from LP to PE mapping.

19
Implementation, ROSS on the Blue Gene

Local Control Mechanisms
Global Control Mechanisms
Reverse Computation

20
ROSS Local Control Mechanisms
ROSS uses a pointer based framework, where event
lists and causal relationships are maintained
through pointer manipulation. If an event is
shared by two LPs on the same processor (PE) only
a pointer is used.
PE 0
LP 0
Event 5
LP 1
Event 50
Event 40
Event 5
PE 1
LP 2
Event 6
21
ROSS Local Control Mechanisms
When an event involves LPs on different PEs we
must make a copy and send it to the remote
PE. A pointer would identify the correct event,
if it is local. But the receiver must search the
priority queue and processed list while handling
remote cancels.
PE 0
LP 0
Event 5
Event 10
Event 10
LP 1
Event 50
Event 40
Event 5
PE 1
LP 2
Event 10
Event 6
22
ROSS Local Control Mechanisms
The receiver processes the remote event, a copy,
as a normal event, including rollbacks.
PE 0
LP 0
Event 5
Event 10
Event 10
LP 1
Event 50
Event 40
Event 5
PE 1
LP 2
Event 15
Event 10
Event 6
23
ROSS Local Control Mechanisms
If the event is then cancelled, the sender sends
a cancel, another copy, which the receiver uses
to find the event and cancel it.
PE 0
LP 0
Event 5
LP 1
Event 50
Event 40
PE 1
LP 2
Event 15
Event 6
24
Cancellation, Identifying events
Destination LP
Actions at Source LPGenerate event A _at_
tRollback the source LP Generate event B _at_
tEvent A is cancelled
Event A _at_ Time t
Event A _at_ Time t
Event A _at_ Time t
Event B _at_ Time t
Cancel _at_ Time t
Event B _at_ Time t
Event A _at_ Time t
NO !
YES!
We use the timestamp, source and destination LP,
source LP sequence number to uniquely identify
events. Sequence number differentiates the first
event, from the one sent after roll back.
25
The GVT Algorithm

Construct cut using global collective algorithms.
Reduce All operations to account for all
transient messages and find the GVT.
Exploits the Blue Gene/L's fast global
operations.
No processing during GVT computation

26
Fossil Collection
Remote events are included in the caused_by_me
list of the event that created it. A remote event
can only be affected by the causal event, and is
not affected by any operations upon the actual
event at the receiver.
LP 0
5 sec
Remote events must be collected at the sender.
27
Fossil Collection
The remote event at the sender is collected when
the causal event is collected, as there are no
other references to it. At the receiver it is
collected when it passes GVT.
The LP must check every events caused_by_me list
for remotes.
28
Results

Computing power
The PHOLD Model
PHOLD Performance
PCS Performance

29
Growth in system performance

Increasingly computing power has been due to an
increase in processor count, rather than
processor power
Fastest processor no longer belongs to fastest
computer
Processor is 10000x, while system is 1x106
Charts show LINPACK performance.

30
PHOLD Model

PHOLD is a synthetic benchmark, that is a
pathological case of remote communication.
No work is done when events are processed, rather
it only schedules future events.
Events are scheduled on an LP that chosen
uniformly at random.
As the number of processors increases, and the
ratio of LP to processors decreases the
percentage of remote events increase to 100
In our experiments we only schedule events
remotely 10 of the time. We present results when
there are 10 and 16 events per LP.

31
Historical Context of PHOLD Performance

The growth in PHOLD performance better than
processor performance, but less than system
performance.
Trend shows the best reported performance for a
given year.

32
Historical Context of PHOLD Performance
33
PHOLD Performance

Speedup is super linear with the per processor
event rate increasing from 45000 with 2048
processors to 52,000 with 16384 processors. We
attribute the speedup to a decrease in priority
queue overhead as the processor count increased.
We observed a maximum event rate of 853 million
events per second when running on 16,384
processors.

34
PHOLD Performance

Why the 4.5x improvement in performance when
going from BG/L to BG/P Intrepidh _at_ ANL?
Answer 8 byte alignment of time-stamps (only
doubles in system)
Note if you dont align on BG/P, code crashes
.. this is a good thing!!

35
PCS Model

PCS is a model of a radio network that provides
wireless communication services to mobile phone
subscribes.
The service area of the provider is divided into
cells, each of which contains a number of
channels that are used by the radios to
communicate.
As the mobile radios move amongst the cells, the
radios are forced to switch to another channel
belonging to the new cell.
PCS is a well balanced, self initializing
workload. It ensures each LP does equal work, and
the model continues to generate more work without
external interference.
Limited experiments due to cost of runs, only one
run with 32,768 processors.

36
PCS Performance

We observed linear increase from 1024 to 16,384
processors.
We observed 2 billion events per second on 16,384
processors.
We observed 2.47 billion events per second on
32,768 processors
This is a 25 increase from doubling the
processor count.

37
Explanation of PCS Performance

Probably due to the frequency of GVT computation.
We can improve performance by reducing the
frequency of GVT computes, but we have not been
able to repeat the 32K run.
We used a longer run when using 32K, but we were
able to reproduce our 16K performance using this
longer run.
Rollbacks increased from 0.05 to 0.06 of
commited events, as we increased to 32
Increased network width and the cost of fossil
collection, not eliminated.

38
Movies over the Internet Revisited

Suppose we want to model 1 million home ISP
customers over ATT downloading a 2 GB movie
How long to compute with PDES?

16 trillion events _at_ 1 Billion ev/sec
4.5 hours!!

39
Conclusion

We can implement an efficient scalable Time Warp
kernel on a large scale supercomputer.
A synchronous GVT computation algorithm can be
used in an efficient Time Warp kernel if it is
supported by the underlying hardware
Our future work will include running TCP and
models of large scale networks on supercomputers

40
References

A. Holder and C. Carothers, Analysis of Time Warp
on a 32,768 processor IBM Blue Gene/L
Supercomputer,European Modeling and Simulation
Symposium 2008.
R. Fujimoto, Parallel Discrete Event Simulation,
Communications of the ACM, vol 33, no. 10, pp
30-53,1990.

Write a Comment

User Comments (0)