Title: Analysis of Time Warp on a 32,768 processor IBM Blue GeneL supercomputer
1- Analysis of Time Warp on a 32,768 processor IBM
Blue Gene/L supercomputer - Akintayo Holder and Christopher Carothers
- Department of Computer Science
- Rensselaer Polytechnic Institute
2Outline
- Motivation
- Blue Gene
- Time Warp
- PHOLD
- Implementation
- Performance
- Conclusion
- Please feel free to ask questions at any time.
3Discrete Event Simulation
- Discrete event simulation is used for complex
models that cannot be easily analyzed or
represented by equations - Discrete event simulation is commonly used for
simulating network protocols like TCP, simulating
TCP on a large networks may require a powerful
computer. - In the logical processor (LP) model, each LP is a
concurrent task, and communication is performed
by exchanging events. - The model also defines processor elements (PE),
which usually corresponds to actual physical
processors.
4Motivation
- Why Parallel Discrete-Event Simulation (DES)?
- Large-scale systems are difficult to understand
- Analytical models are often constrained
- Parallel DES simulation offers
- Dramatically shrinks models execution-time
- Prediction of future what-if systems
performance - Potential for real-time decision support
- Minutes instead of days
- Analysis can be done right away
- Example models national air space (NAS), ISP
backbone(s), distributed content caches, next
generation supercomputer systems.
5Ex Movies over the Internet
- Suppose we want to model 1 million home ISP
customers downloading a 2 GB movie - How long to compute?
- Assume a nominal 100K ev/sec seq. simulator
- Assume on avg. each packet takes 8 hops
- 2GB movies yields 2 trillion 1K data packets.
- _at_ 8 hops yields 16 trillion events
- 16 trillion events _at_ 100K ev/sec
- Over 1,900 days!!! Or
- 5 years!!!
- Need massively parallel simulation to make
tractable
6Overview of the Blue Gene/L
- Blue Gene/L is an ultra large scale
supercomputer, with an architecture that balances
the computing power of the processor against the
delivery speed of the network.
The processing nodes consist of two low power 700
MHz PowerPC with 1GB of memory. The Blue Gene/L
has five networks, the point-to-point network
that has an end to end delay of 4 microseconds, a
barrier network that takes 1.5. The collective
network performs a global operation in 5
microseconds.
7Time Warp
- Local Control Mechanism
- Global Control Mechanism
8Parallel Discrete-Event Simulation Via Time Warp
Local Control Mechanism error detection and
rollback
Global Control Mechanism compute Global Virtual
Time (GVT)
V i r t u a l T i m e
V i r t u a l T i m e
collect versions of state / events perform
I/O operations that are lt GVT
?????undo state ?s (2) cancel sent events
GVT
LP 2
LP 3
LP 1
LP 2
LP 1
LP 3
unprocessed event
processed event
straggler event
committed event
9Time Warp Local Control Mechanism
Each LP process events in time stamp order, like
a sequential simulator, except (1) do NOT
discard processed events and (2) add a rollback
mechanism
- Adding rollback
- a message arriving in the LPs past initiates
rollback - to roll back an event computation we must undo
- changes to state variables performed by the
event - message sends
10Implementing Rollback State Save
- The state of the logical processor is stored
before the event is executed. - May be very memory intensive.
11Implementing RollbackIncremental State Save
- Only changes to the state are stored.
- Consumes less memory than state saves.
- Complicates the rollback process.
12Rollback Via Reversible Computation...
- Use Reversible Computation (RC)
- Undo by executing reverse code
- Delivers better performance
- Negligible overhead for forward computation
- Significantly lower memory utilization
- control state ltlt data state
- In DES models, many of ops are constructive
- e.g., , --, etc.
- Size of control state lt size of data state
- i.e., b bitfield ltlt delays array
- Perfectly reversible high-level operations
- gleaned from irreversible smaller operations
- e.g., random number generation
- ROSS Rensselaer Optimistic Simulation System
13RC Applications
- PDES applications include
- Wireless telephone networks
- Distributed content caches
- Large-scale Internet models
- TCP over ATT backbone
- Leverages RC swaps
- Hodgkin-Huxley neuron models
- Plasma physics models using PIC
- Non-DES include
- Debugging
- PISA Reversible instruction set architecture
for low power computing - Quantum computing
14Global Virtual Time and Fossil Collection
- A mechanism is needed to
- reclaim memory resources (e.g., old state and
events)? - perform irrevocable operations (e.g., I/O)?
- Observation A lower bound on the time stamp of
any rollback that can occur in the future is
needed.
- Global Virtual Time (GVT) is defined as the
minimum time stamp of any unprocessed (or
partially processed) message or anti-message in
the system. GVT provides a lower bound on the
time stamp of any future rollback. - storage for events and state vectors older than
GVT (except one state vector) can be reclaimed - I/O operations with time stamp less than GVT can
be performed. - Observation The computation corresponding to GVT
will not be rolled back, guaranteeing forward
progress.
15Global Virtual Time Overlapping Windows
- Each process calculates GVT during a window based
on wall clock time. - There is variation due to using wall clock.
- GVT is calculated during overlaps in the
processor's windows.
16Global Virtual TimeTwo Cuts
- Builds a consistent snapshot of events within the
system. - Any event sent before the first cut, must be
received before the second cut. - When all events are accounted for, GVT can be
computed.
17Global Virtual TimeGlobal Reduction
- Global collective operations ensure there are no
outstanding messages. - GVT is then computed.
18Fossil Collection
- Processed events are aggregated for fossil
collection. - By LP, may lead to checking a lot of empty lists
- By PE, will resulting searching very large lists
- Kernel Processes, can vary from LP to PE mapping.
19Implementation, ROSS on the Blue Gene
- Local Control Mechanisms
- Global Control Mechanisms
- Reverse Computation
20ROSS Local Control Mechanisms
ROSS uses a pointer based framework, where event
lists and causal relationships are maintained
through pointer manipulation. If an event is
shared by two LPs on the same processor (PE) only
a pointer is used.
PE 0
LP 0
Event 5
LP 1
Event 50
Event 40
Event 5
PE 1
LP 2
Event 6
21ROSS Local Control Mechanisms
When an event involves LPs on different PEs we
must make a copy and send it to the remote
PE. A pointer would identify the correct event,
if it is local. But the receiver must search the
priority queue and processed list while handling
remote cancels.
PE 0
LP 0
Event 5
Event 10
Event 10
LP 1
Event 50
Event 40
Event 5
PE 1
LP 2
Event 10
Event 6
22ROSS Local Control Mechanisms
The receiver processes the remote event, a copy,
as a normal event, including rollbacks.
PE 0
LP 0
Event 5
Event 10
Event 10
LP 1
Event 50
Event 40
Event 5
PE 1
LP 2
Event 15
Event 10
Event 6
23ROSS Local Control Mechanisms
If the event is then cancelled, the sender sends
a cancel, another copy, which the receiver uses
to find the event and cancel it.
PE 0
LP 0
Event 5
LP 1
Event 50
Event 40
PE 1
LP 2
Event 15
Event 6
24Cancellation, Identifying events
Destination LP
Actions at Source LPGenerate event A _at_
tRollback the source LP Generate event B _at_
tEvent A is cancelled
Event A _at_ Time t
Event A _at_ Time t
Event A _at_ Time t
Event B _at_ Time t
Cancel _at_ Time t
Event B _at_ Time t
Event A _at_ Time t
NO !
YES!
We use the timestamp, source and destination LP,
source LP sequence number to uniquely identify
events. Sequence number differentiates the first
event, from the one sent after roll back.
25The GVT Algorithm
- Construct cut using global collective algorithms.
- Reduce All operations to account for all
transient messages and find the GVT. - Exploits the Blue Gene/L's fast global
operations. - No processing during GVT computation
26Fossil Collection
Remote events are included in the caused_by_me
list of the event that created it. A remote event
can only be affected by the causal event, and is
not affected by any operations upon the actual
event at the receiver.
LP 0
5 sec
Remote events must be collected at the sender.
27Fossil Collection
The remote event at the sender is collected when
the causal event is collected, as there are no
other references to it. At the receiver it is
collected when it passes GVT.
The LP must check every events caused_by_me list
for remotes.
28Results
- Computing power
- The PHOLD Model
- PHOLD Performance
- PCS Performance
29Growth in system performance
- Increasingly computing power has been due to an
increase in processor count, rather than
processor power - Fastest processor no longer belongs to fastest
computer - Processor is 10000x, while system is 1x106
- Charts show LINPACK performance.
30PHOLD Model
- PHOLD is a synthetic benchmark, that is a
pathological case of remote communication. - No work is done when events are processed, rather
it only schedules future events. - Events are scheduled on an LP that chosen
uniformly at random. - As the number of processors increases, and the
ratio of LP to processors decreases the
percentage of remote events increase to 100 - In our experiments we only schedule events
remotely 10 of the time. We present results when
there are 10 and 16 events per LP.
31Historical Context of PHOLD Performance
- The growth in PHOLD performance better than
processor performance, but less than system
performance. - Trend shows the best reported performance for a
given year.
32Historical Context of PHOLD Performance
33PHOLD Performance
- Speedup is super linear with the per processor
event rate increasing from 45000 with 2048
processors to 52,000 with 16384 processors. We
attribute the speedup to a decrease in priority
queue overhead as the processor count increased. - We observed a maximum event rate of 853 million
events per second when running on 16,384
processors.
34PHOLD Performance
- Why the 4.5x improvement in performance when
going from BG/L to BG/P Intrepidh _at_ ANL? - Answer 8 byte alignment of time-stamps (only
doubles in system) - Note if you dont align on BG/P, code crashes
.. this is a good thing!!
35PCS Model
- PCS is a model of a radio network that provides
wireless communication services to mobile phone
subscribes. - The service area of the provider is divided into
cells, each of which contains a number of
channels that are used by the radios to
communicate. - As the mobile radios move amongst the cells, the
radios are forced to switch to another channel
belonging to the new cell. - PCS is a well balanced, self initializing
workload. It ensures each LP does equal work, and
the model continues to generate more work without
external interference. - Limited experiments due to cost of runs, only one
run with 32,768 processors.
36PCS Performance
- We observed linear increase from 1024 to 16,384
processors. - We observed 2 billion events per second on 16,384
processors. - We observed 2.47 billion events per second on
32,768 processors - This is a 25 increase from doubling the
processor count.
37Explanation of PCS Performance
- Probably due to the frequency of GVT computation.
- We can improve performance by reducing the
frequency of GVT computes, but we have not been
able to repeat the 32K run. - We used a longer run when using 32K, but we were
able to reproduce our 16K performance using this
longer run. - Rollbacks increased from 0.05 to 0.06 of
commited events, as we increased to 32 - Increased network width and the cost of fossil
collection, not eliminated.
38Movies over the Internet Revisited
- Suppose we want to model 1 million home ISP
customers over ATT downloading a 2 GB movie - How long to compute with PDES?
- 16 trillion events _at_ 1 Billion ev/sec
- 4.5 hours!!
39Conclusion
- We can implement an efficient scalable Time Warp
kernel on a large scale supercomputer. - A synchronous GVT computation algorithm can be
used in an efficient Time Warp kernel if it is
supported by the underlying hardware - Our future work will include running TCP and
models of large scale networks on supercomputers
40References
- A. Holder and C. Carothers, Analysis of Time Warp
on a 32,768 processor IBM Blue Gene/L
Supercomputer,European Modeling and Simulation
Symposium 2008. - R. Fujimoto, Parallel Discrete Event Simulation,
Communications of the ACM, vol 33, no. 10, pp
30-53,1990.