Analysis of Time Warp on a 32,768 processor IBM Blue GeneL supercomputer - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Analysis of Time Warp on a 32,768 processor IBM Blue GeneL supercomputer

Description:

Analysis of Time Warp on a 32,768 processor IBM Blue Gene/L supercomputer ... Ex: Movies over the Internet ... in performance when going from BG/L to ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 41
Provided by: akin48
Category:

less

Transcript and Presenter's Notes

Title: Analysis of Time Warp on a 32,768 processor IBM Blue GeneL supercomputer


1
  • Analysis of Time Warp on a 32,768 processor IBM
    Blue Gene/L supercomputer
  • Akintayo Holder and Christopher Carothers
  • Department of Computer Science
  • Rensselaer Polytechnic Institute

2
Outline
  • Motivation
  • Blue Gene
  • Time Warp
  • PHOLD
  • Implementation
  • Performance
  • Conclusion
  • Please feel free to ask questions at any time.

3
Discrete Event Simulation
  • Discrete event simulation is used for complex
    models that cannot be easily analyzed or
    represented by equations
  • Discrete event simulation is commonly used for
    simulating network protocols like TCP, simulating
    TCP on a large networks may require a powerful
    computer.
  • In the logical processor (LP) model, each LP is a
    concurrent task, and communication is performed
    by exchanging events.
  • The model also defines processor elements (PE),
    which usually corresponds to actual physical
    processors.

4
Motivation
  • Why Parallel Discrete-Event Simulation (DES)?
  • Large-scale systems are difficult to understand
  • Analytical models are often constrained
  • Parallel DES simulation offers
  • Dramatically shrinks models execution-time
  • Prediction of future what-if systems
    performance
  • Potential for real-time decision support
  • Minutes instead of days
  • Analysis can be done right away
  • Example models national air space (NAS), ISP
    backbone(s), distributed content caches, next
    generation supercomputer systems.

5
Ex Movies over the Internet
  • Suppose we want to model 1 million home ISP
    customers downloading a 2 GB movie
  • How long to compute?
  • Assume a nominal 100K ev/sec seq. simulator
  • Assume on avg. each packet takes 8 hops
  • 2GB movies yields 2 trillion 1K data packets.
  • _at_ 8 hops yields 16 trillion events
  • 16 trillion events _at_ 100K ev/sec
  • Over 1,900 days!!! Or
  • 5 years!!!
  • Need massively parallel simulation to make
    tractable

6
Overview of the Blue Gene/L
  • Blue Gene/L is an ultra large scale
    supercomputer, with an architecture that balances
    the computing power of the processor against the
    delivery speed of the network.

The processing nodes consist of two low power 700
MHz PowerPC with 1GB of memory. The Blue Gene/L
has five networks, the point-to-point network
that has an end to end delay of 4 microseconds, a
barrier network that takes 1.5. The collective
network performs a global operation in 5
microseconds.
7
Time Warp
  • Local Control Mechanism
  • Global Control Mechanism

8
Parallel Discrete-Event Simulation Via Time Warp
Local Control Mechanism error detection and
rollback
Global Control Mechanism compute Global Virtual
Time (GVT)
V i r t u a l T i m e
V i r t u a l T i m e
collect versions of state / events perform
I/O operations that are lt GVT
?????undo state ?s (2) cancel sent events
GVT
LP 2
LP 3
LP 1
LP 2
LP 1
LP 3
unprocessed event
processed event
straggler event
committed event
9
Time Warp Local Control Mechanism
Each LP process events in time stamp order, like
a sequential simulator, except (1) do NOT
discard processed events and (2) add a rollback
mechanism
  • Adding rollback
  • a message arriving in the LPs past initiates
    rollback
  • to roll back an event computation we must undo
  • changes to state variables performed by the
    event
  • message sends

10
Implementing Rollback State Save
  • The state of the logical processor is stored
    before the event is executed.
  • May be very memory intensive.

11
Implementing RollbackIncremental State Save
  • Only changes to the state are stored.
  • Consumes less memory than state saves.
  • Complicates the rollback process.

12
Rollback Via Reversible Computation...
  • Use Reversible Computation (RC)
  • Undo by executing reverse code
  • Delivers better performance
  • Negligible overhead for forward computation
  • Significantly lower memory utilization
  • control state ltlt data state
  • In DES models, many of ops are constructive
  • e.g., , --, etc.
  • Size of control state lt size of data state
  • i.e., b bitfield ltlt delays array
  • Perfectly reversible high-level operations
  • gleaned from irreversible smaller operations
  • e.g., random number generation
  • ROSS Rensselaer Optimistic Simulation System

13
RC Applications
  • PDES applications include
  • Wireless telephone networks
  • Distributed content caches
  • Large-scale Internet models
  • TCP over ATT backbone
  • Leverages RC swaps
  • Hodgkin-Huxley neuron models
  • Plasma physics models using PIC
  • Non-DES include
  • Debugging
  • PISA Reversible instruction set architecture
    for low power computing
  • Quantum computing

14
Global Virtual Time and Fossil Collection
  • A mechanism is needed to
  • reclaim memory resources (e.g., old state and
    events)?
  • perform irrevocable operations (e.g., I/O)?
  • Observation A lower bound on the time stamp of
    any rollback that can occur in the future is
    needed.
  • Global Virtual Time (GVT) is defined as the
    minimum time stamp of any unprocessed (or
    partially processed) message or anti-message in
    the system. GVT provides a lower bound on the
    time stamp of any future rollback.
  • storage for events and state vectors older than
    GVT (except one state vector) can be reclaimed
  • I/O operations with time stamp less than GVT can
    be performed.
  • Observation The computation corresponding to GVT
    will not be rolled back, guaranteeing forward
    progress.

15
Global Virtual Time Overlapping Windows
  • Each process calculates GVT during a window based
    on wall clock time.
  • There is variation due to using wall clock.
  • GVT is calculated during overlaps in the
    processor's windows.

16
Global Virtual TimeTwo Cuts
  • Builds a consistent snapshot of events within the
    system.
  • Any event sent before the first cut, must be
    received before the second cut.
  • When all events are accounted for, GVT can be
    computed.

17
Global Virtual TimeGlobal Reduction
  • Global collective operations ensure there are no
    outstanding messages.
  • GVT is then computed.

18
Fossil Collection
  • Processed events are aggregated for fossil
    collection.
  • By LP, may lead to checking a lot of empty lists
  • By PE, will resulting searching very large lists
  • Kernel Processes, can vary from LP to PE mapping.

19
Implementation, ROSS on the Blue Gene
  • Local Control Mechanisms
  • Global Control Mechanisms
  • Reverse Computation

20
ROSS Local Control Mechanisms
ROSS uses a pointer based framework, where event
lists and causal relationships are maintained
through pointer manipulation. If an event is
shared by two LPs on the same processor (PE) only
a pointer is used.
PE 0
LP 0
Event 5
LP 1
Event 50
Event 40
Event 5
PE 1
LP 2
Event 6
21
ROSS Local Control Mechanisms
When an event involves LPs on different PEs we
must make a copy and send it to the remote
PE. A pointer would identify the correct event,
if it is local. But the receiver must search the
priority queue and processed list while handling
remote cancels.
PE 0
LP 0
Event 5
Event 10
Event 10
LP 1
Event 50
Event 40
Event 5
PE 1
LP 2
Event 10
Event 6
22
ROSS Local Control Mechanisms
The receiver processes the remote event, a copy,
as a normal event, including rollbacks.
PE 0
LP 0
Event 5
Event 10
Event 10
LP 1
Event 50
Event 40
Event 5
PE 1
LP 2
Event 15
Event 10
Event 6
23
ROSS Local Control Mechanisms
If the event is then cancelled, the sender sends
a cancel, another copy, which the receiver uses
to find the event and cancel it.
PE 0
LP 0
Event 5
LP 1
Event 50
Event 40
PE 1
LP 2
Event 15
Event 6
24
Cancellation, Identifying events
Destination LP
Actions at Source LPGenerate event A _at_
tRollback the source LP Generate event B _at_
tEvent A is cancelled
Event A _at_ Time t
Event A _at_ Time t
Event A _at_ Time t
Event B _at_ Time t
Cancel _at_ Time t
Event B _at_ Time t
Event A _at_ Time t
NO !
YES!
We use the timestamp, source and destination LP,
source LP sequence number to uniquely identify
events. Sequence number differentiates the first
event, from the one sent after roll back.
25
The GVT Algorithm
  • Construct cut using global collective algorithms.
  • Reduce All operations to account for all
    transient messages and find the GVT.
  • Exploits the Blue Gene/L's fast global
    operations.
  • No processing during GVT computation

26
Fossil Collection
Remote events are included in the caused_by_me
list of the event that created it. A remote event
can only be affected by the causal event, and is
not affected by any operations upon the actual
event at the receiver.
LP 0
5 sec
Remote events must be collected at the sender.
27
Fossil Collection
The remote event at the sender is collected when
the causal event is collected, as there are no
other references to it. At the receiver it is
collected when it passes GVT.
The LP must check every events caused_by_me list
for remotes.
28
Results
  • Computing power
  • The PHOLD Model
  • PHOLD Performance
  • PCS Performance

29
Growth in system performance
  • Increasingly computing power has been due to an
    increase in processor count, rather than
    processor power
  • Fastest processor no longer belongs to fastest
    computer
  • Processor is 10000x, while system is 1x106
  • Charts show LINPACK performance.

30
PHOLD Model
  • PHOLD is a synthetic benchmark, that is a
    pathological case of remote communication.
  • No work is done when events are processed, rather
    it only schedules future events.
  • Events are scheduled on an LP that chosen
    uniformly at random.
  • As the number of processors increases, and the
    ratio of LP to processors decreases the
    percentage of remote events increase to 100
  • In our experiments we only schedule events
    remotely 10 of the time. We present results when
    there are 10 and 16 events per LP.

31
Historical Context of PHOLD Performance
  • The growth in PHOLD performance better than
    processor performance, but less than system
    performance.
  • Trend shows the best reported performance for a
    given year.

32
Historical Context of PHOLD Performance
33
PHOLD Performance
  • Speedup is super linear with the per processor
    event rate increasing from 45000 with 2048
    processors to 52,000 with 16384 processors. We
    attribute the speedup to a decrease in priority
    queue overhead as the processor count increased.
  • We observed a maximum event rate of 853 million
    events per second when running on 16,384
    processors.

34
PHOLD Performance
  • Why the 4.5x improvement in performance when
    going from BG/L to BG/P Intrepidh _at_ ANL?
  • Answer 8 byte alignment of time-stamps (only
    doubles in system)
  • Note if you dont align on BG/P, code crashes
    .. this is a good thing!!

35
PCS Model
  • PCS is a model of a radio network that provides
    wireless communication services to mobile phone
    subscribes.
  • The service area of the provider is divided into
    cells, each of which contains a number of
    channels that are used by the radios to
    communicate.
  • As the mobile radios move amongst the cells, the
    radios are forced to switch to another channel
    belonging to the new cell.
  • PCS is a well balanced, self initializing
    workload. It ensures each LP does equal work, and
    the model continues to generate more work without
    external interference.
  • Limited experiments due to cost of runs, only one
    run with 32,768 processors.

36
PCS Performance
  • We observed linear increase from 1024 to 16,384
    processors.
  • We observed 2 billion events per second on 16,384
    processors.
  • We observed 2.47 billion events per second on
    32,768 processors
  • This is a 25 increase from doubling the
    processor count.

37
Explanation of PCS Performance
  • Probably due to the frequency of GVT computation.
  • We can improve performance by reducing the
    frequency of GVT computes, but we have not been
    able to repeat the 32K run.
  • We used a longer run when using 32K, but we were
    able to reproduce our 16K performance using this
    longer run.
  • Rollbacks increased from 0.05 to 0.06 of
    commited events, as we increased to 32
  • Increased network width and the cost of fossil
    collection, not eliminated.

38
Movies over the Internet Revisited
  • Suppose we want to model 1 million home ISP
    customers over ATT downloading a 2 GB movie
  • How long to compute with PDES?
  • 16 trillion events _at_ 1 Billion ev/sec
  • 4.5 hours!!

39
Conclusion
  • We can implement an efficient scalable Time Warp
    kernel on a large scale supercomputer.
  • A synchronous GVT computation algorithm can be
    used in an efficient Time Warp kernel if it is
    supported by the underlying hardware
  • Our future work will include running TCP and
    models of large scale networks on supercomputers

40
References
  • A. Holder and C. Carothers, Analysis of Time Warp
    on a 32,768 processor IBM Blue Gene/L
    Supercomputer,European Modeling and Simulation
    Symposium 2008.
  • R. Fujimoto, Parallel Discrete Event Simulation,
    Communications of the ACM, vol 33, no. 10, pp
    30-53,1990.
Write a Comment
User Comments (0)
About PowerShow.com