Title: Requirements for an end-to-end solution the Center for Plasma Edge Simulation (FSP)
1Requirements for an end-to-end solution the
Center for Plasma Edge Simulation (FSP)
- SDM AHM
- October 5, 2005
- Scott A. Klasky
- ORNL
2Perhaps not just the CPES FSP
- Can we form the CAFÉ solution?
- Combustion, Astrophysics, Fusion End-to-end
framework - Combustion SciDAC
- Astrophysics TSI SciDAC.
- Fusion SciDACS (CPES, SWIM, GPS, CEMM)
- SNS Follow closely, and try to exchange
technology.
3Center for Plasma Edge Simulations (A Fusion
Simulation Project SciDAC)
How can a particular plasma edge condition
dramatically improve the confinement of fusion
plasma, as observed in the experiments? The
physics of the transitional edge plasma that
connects the hot core (of order
100-million-degree-C, or tens of keV) with the
material walls is the subject of this research
question. 5-year goal Predict the edge pedestal
behavior for the ITER and existing devices. This
must be answered for the success of ITER
We are developing a testable pedestal simulation
framework which incorporates the relevant
spectrum of physics processes (e.g, transport,
kinetic and magnetohydrodynamic stability and
turbulence, flows, and atomic physics in
realistic geometry) that span the range of plasma
parameters relevant to ITER.
Pedestal
Use Kepler for end-to-end solution with autonomic
high performance NXM data transfers for code
coupling, code monitoring, saving results.
M3D simulation depicting edge localized modes
(ELMs)
Input Files
Data Interpolation
MHD Linear Stability monitor
Job submission
XGC-ET Simulation on leadership-class computer
True
STABLE?
False
Noise Monitor
-ET
Distributed Storage
Distributed Storage
M3D Simulation
Portal
Data Interpolation
a
XGC-ET Compute SOL
a
Out-of-core isosurface
- Codes used in this project
- XGC-ET
- A fully kinetic PIC code which will solve
turbulence, neoclassical, and neutral dynamics
self-consistently. - High velocity space resolution and arbitrary
shaped wall are necessary to solve this research
problem. - Will acquire the gyrokinetic machinery from the
GTC code, part of the GPS SciDAC. - Will include Degas-2 for more accurate neutral
atomic physics around the boundary. - M3D-edge
- An edge modified version of M3D MHD/2-fluid code,
part of the CEMM SciDAC. - For nonlinear MHD ELM crashes.
- Linear solvers
- Simple preconditioners for diagonally dominant
systems - Multigrid for scalable elliptic solves.
- perfect weak scaling
- investigation of tree code methods (e.g. fast
multipole) for direct calculation of
electrostatic forces (i.e., PIC w/o cells)
a
a
4Code Coupling Forming a computational pipeline
- 2 computers (or more)
- 1 computer runs in batch.
- Other system(s) is for interactive parallel use.
- Security will be by-passed if we can have all
computers at ORNL.
Cray XT3 XGC on 1,024P
Move 10MB lt1 second
Move 10MB lt1 second
I. cluster Mhd-L on 4P
I. cluster M3D on 32P
30GB/minute
I. cluster Noise monitor 80P
5Interfaces must be designed to couple codes.
- What variables are to be moved/what units?
- What is the data decomposition on the sending
side? On the receiving side? - Intercomm (Sussman) seems very interesting (PVM)
- Development of algorithms and techniques for
effectively solving key problems in software
support for coupled simulations. - Concentrate on three main issues
- Comprehensive support for determining at runtime
what data is to be moved between simulations - Flexibly and efficiently determining when the
data should be moved - Effectively deploying coupled simulation codes in
a Grid computing environment. - A major goal is to minimize the changes that must
be made to each individual simulation code. - Accomplished by having an individual simulation
model only specify what data will be made
available for a potential data transfer and not
specify when an actual data transfer will take
place. - Decisions about when data transfers will take
place will be made through a separate
coordination specification, that generally will
be provided by the person building the complete
coupled simulation.
6Look at Mbs, not total data sizes
- Hawkes (SciDAC 2005)
- INCITE calculation
- 2000 Seaborg processors, 2.5 million hours total
- 5tb data, 9.3Mbs.
- Blondin (SciDAC 2005)
- 4 TB, 30 hours 310Mbs
- CPES code coupling 1.3Mbs, data saving (3D)
300 - 30(0)GB/10 minutes - Future is difficult to predict for data
generation rates. - Codes add more physics, which slow down the code,
algorithms speed up the code, new variables are
generated, computers speed up, - This is also true for analysis of the Data.
- Do we need all of the data at all of the
timesteps before we can analyze? - Can we do analysis and data movement together?
- Analysis/Visualization systems might have to be
changed.
7What happens when the Mbs gets too large?
- Must understand the features in the data.
- Use AMR-like scheme to save the data.
- Does the data change dramatically everywhere?
- Is the data smooth in some regions?
- Can save 100x in compression techniques, but must
be able to use data. - New viz/analysis tools?
- Could just stitch up the grid, and use old tools.
- Useful for Level of Detail Visualization (more
detail in regions which change). - Use in combination with smart data caching/
data compression (see below)
8End-to-end/workflow requirements.
- Easy to Install
- Good examples (MPI, Netcdf, HDF5, LN, bbcp)
- Easy to Use
- Ensight-Gold
- Must have value-added over simple approaches.
- Value added discussed in the following slides.
- Must be robust/fault tolerant.
- The workflow can not crash our simulations/nodes!
9Need a data model
- Allows the CS community to design modules which
can understand the data. - Allow for netcdf, hdf5.
- Develop interfaces to extract portion of the
data from the files/memory. - Must come from the application areas teaming up
with the CS community. - HDF5/Netcdf is not a data model.
- Can we use the data model in SciRun/AVSExpress/Ens
ight as a start? - Meshes (uniform, rectlinear, structured,
unstructured). - Hierarchy in meshes (AMR).
- Cell Centered, Vertex Centered, Edge Centered
data. - Multiple variables on a mesh.
- Can we use simple APIs in the codes which can
write the data out?
10Monitoring.
- We want to watch portions of the data from the
simulation, as the simulation progresses. - Want the ability to play back from t0 to the
current frame. I.e. snapshot movies. - Want this information presented so that we can
collaborate during/after the simulation. - Highlights part of the data, to discuss with
other users. - Draw on the figures.
- Mostly 1D plots, some 2d (surface/contour) plots,
some 3D plots. - Example (http//w3.pppl.gov/transp/ElVis/121472A03
_D3D.html)
11Portal to launch workflow/monitor jobs
- Use the portal as a front-end to the workflow.
- Would like to see the workflow/ but not monitor
it. - Perhaps it will allow us to choose different
workflows which were created? - Would like to launch workflow, and have
automatic job submission for known
clusters/HPC. - Submit to all, kill all when one starts running
12Users want to write their own analysis
- Requires that they can do this in F90, C/C,
Python. - Need wizards to allow users to describe their
input/output. - Similar to AVS/Express, SciRun, OpenDX.
- Common scenario
- Users want the main data field (field_in), they
want a string (temperature), they want a
condition (gt), they want an output field. They
also want this to run on their cluster with M
processors. They also want to change the inputs
at any given time.
13Efficient Data Movement
- One same node
- Use memory reference.
- On same cluster
- Use MPI communication.
- On different clusters (NXM communication)
- 2 approaches memory-memory vs. files.
- File approach is not always useable.
- Will break the solution for code-coupling
approaches since I/O can become the bottleneck.
(open/close/read/write). - Working with Parashar/Kohl to look into the NXM
problem. - Do we make this part of Kepler?
14Distributed Data Storage - 1
- Users do NOT want to know where their data is
stored. - Users want the FASTEST possible method to get to
their data. - Users seldom look at all of their data at once.
- Usually, we look at a handful of variables at a
time, with only a few time slices at a time.
(DONT need 4 TB in a second). - Users require that solution works on their laptop
when traveling! (must cache results from
local-disk). - Users do NOT want to change their
mode-of-operational during travel.
15Distributed-data storage -2
- LN is a good example of an almost useable
system. - Needs to directly understand HDF5/netcdf.
- Needs to be able to cache information on local
disks, and modify the eXnodes. - Needs to be able to work with HPSS.
- But this is NOT enough!
16Smart data cache
- Users typically access their data in similar
patterns. - Look at timestep 1 for variables A,B, look at
ts2 for A,B, .. - If we know what the user wants, when he/she wants
it, then we can use smart technologies. - In a collaboration, the data access gets more
complicated. - Neural Networks to the rescue!
17Need data mining technology integrated into the
solution
- We must understand the features of the data.
- Requires a working relationship with app.
Scientists and computer scientists. - Want to detect features on-the-fly (from the
current, and previous timesteps). - Could feature born analysis be done by the end of
the simulation? - Pre-compute everything possible by the end of the
simulation. DO NOT REQUIRE the end user to wait
for anything that we know we want.
18Security
- Users do NOT want to deal with this.
- But of course, they have to.
- Will DOE require single sign-ins.
- Can trusted sites talk to other trusted sites
via ports being opened from A-B? - Will this be the death for workflow automation?
- Can automate data movement, if we must sign on
each time with unique passwords.
19Conclusions.
- We need Kepler in order for the CPES project to
be successful. - We need efficient NXM data moved, and monitored.
- We need to be able to provide feedback to the
simulation(s). - Codes must be coupled, and we need an efficient
mechanism to couple the data. - What do we do with single-logins?
- ORNL tells me that we can have ports open from
one site to another without violating the
security model. What about other sites? - Are we prepared for new architectures?
- Cray XT3 has only 1 small pipe out to the world.