Parallel Visualization of LargeScale Datasets for the Earth Simulator - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Parallel Visualization of LargeScale Datasets for the Earth Simulator

Description:

Research Organization for Information Science and Technology (RIST) Japan ... Construct Vectorizatoin Loop. Combine some short loops into one long loop by reordering ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 43

Provided by: quakesEa

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Visualization of LargeScale Datasets for the Earth Simulator

1
Parallel Visualization of Large-Scale Datasets
for the Earth Simulator

Li Chen Issei Fujishiro
Kengo Nakajima

Basic Design Parallel/SMP/Vector Algorithm
Research Organization for Information Science and
Technology (RIST) Japan
3rd ACES Workshop, May 5-10, 2002, Maui, Hawaii.
2
Background
Role of Visualization Subsystem
Earth Simulator
Hardware
Software
GeoFEM
Solver
Application analysis
Vis Subsys
Mesh gen.
Tools for (1) Post Processing (2) Data
Mining etc.
3
Background Requirements
Target 1 Powerful visualization functions
Translate data from numerical forms to visual
forms. Provide the researchers with immense
assistance in the process of understanding their
computational results.
We have developed many visualization techniques
in GeoFEM, for scalar, vector and tensor data
fields, to reveal data distribution from many
aspects.
Target 2 Suitable for large-scale datasets
High parallel performance
Our modules have been parallelized and obtained a
high parallel performance
Target 3 Available for unstructured datasets
Complicated grids
All of our modules are based on unstructured
datasets, and can be extended to hybrid grids.
Target 4 SMP cluster architecture oriented
Effective based on the SMP cluster architecture
Three-level hybrid parallel programming model is
adopted in our modules
4
Works after 2nd ACES (Oct. 2000)

? Developed more visualization techniques for
GeoFEM
Improved parallel performance
Please Visit Our Poster for Detail !!

5
Overview

Visualization Subsystem in GeoFEM
Newly Developed Parallel Volume Rendering (PVR)
Algorithm
Parallel/Vector Efficiency
Examples
Future Works

6
Parallel VisualizationFile Version or
DEBUGGING Version
on Client
includes simplification, combination etc.
7
Large-Scale Data in GeoFEM
1km x 1km x 1km mesh for 1000km x 1000km x 100km
"local" region
1000 x 1000 x 100 108 grid points
1GB/variable/time step 10GB/time step for 10
variables
TB scale for 100 steps !!
8
Parallel VisualizationMemory/Concurrent Version
Dangerous if detailed physics is not clear
9
Parallel Visualization Techniques in GeoFEM
Scalar Field
Vector Field
Tensor Field
Cross-sectioning
Streamlines
Hyperstreamlines
Isosurface-fitting
Particle Tracking
Surface-fitting
Topological Map
Interval Volume-fitting
LIC
Volume Rendering
Volume Rendering
In the following, we will take the Parallel
Volume Rendering module as example to demonstrate
our strategies on improving parallel performance
available June 2002, http//geofem.tokyo.rist.or.j
p/
10

Visualization Subsystem in GeoFEM
Newly Developed Parallel Volume Rendering (PVR)
Algorithm
Parallel/Vector Efficiency
Examples
Future Works

11
Design of Visualization Methods
Principle ? Taking account of parallel
performance ? Taking account of huge data
size ? Taking account of unstructured grids

Classification of Current Volume Rendering Methods
Grid type ? Regular ? Curvilinear
Traversal Approach ? Image-order volume
rendering (Ray casting) ? Object-order volume
rendering (Cell projection)
? Unstructured
? Hybrid order volume rendering
Composition Approach
Projection ? Parallel
? From front to back
? Perspective
? From back to front
12
Design of Visualization Methods
Principle Taking account of concurrent with
computational process
Classification of Parallelism
? Object-space parallelism
Partition object space and each PE gets a portion
of the dataset. Each PE calculates an image of
the sub-volume.
? Image-space parallelism
Partition image space and each PE calculates a
portion of the whole image.
? Time-space parallelism
Partition time space and each PE calculates the
images of several timesteps.
13
Design for Parallel Volume Rendering
Unstructured Locally Refined
Octree/Hierarchical
Why not unstructured grid?

? Hard to build hierarchical structure
Connectivity information should be found
beforehand
Unstructured grid makes image composition and
load balance difficult
Irregular shape makes sampling slower

regular grid?
Why not

Large storage requirement
Slow down volume rendering process

14
Parallel Transformation
Unstructured Hierarchical
One Solution
Ray-casting PVR
Resampling
VR Image
Hierarchical data
FEM Data
15
Accelerated Ray-casting PVR
Hierarchical datasets
Build Branch-on-need octree
Determine sampling and mapping parameters
VR parameters
Generate subimages for each PE
for each subvolume for jstartj to endj do for
istarti to endi do
Fast find the intersection voxels with ray (i,j)
Compute (r,g,b) at each intersection voxel based
on volume illumination model and transfer
functions
Compute (r,g,b) for pixel(i,j) based on
front-to-back composition
Build topological structure of subvolumes on all
PEs
Composite subimages from front to back
16

Visualization Subsystem in GeoFEM
Newly Developed Parallel Volume Rendering (PVR)
Algorithm
Parallel/Vector Efficiency
Examples
Future Works

17
SMP Cluster Type Architectures

Earth Simulator
ASCI Hardwares
Various Types of Communication, Parallelism.
Inter-SMP node, Intra-SMP node, Individual PE

18
Optimum Programming Modelsfor Earth Simulator ?
Intra NODE
Inter NODE
Each PE
F90 directive(OpenMP)
MPI
MPI
F90
MPI
HPF
HPF
Memory
P E
P E
P E
19
Three-Level Hybrid parallelization
Flat MPI parallelization Each PE
independent
Hybrid Parallel Programming Model Based on
Memory hierarchy
Inter-SMP node MPI Intra-SMP node
OpenMP for parallelization Individual PE
Compiler directives for vectorization/pseudo
vectorization
20
Flat MPI vs. OpenMP/MPI Hybrid
Memory
Memory
P E
P E
P E
P E
P E
P E
P E
P E
Flat-MPIEach PE -gt Independent
HybridHierarchal Structure
21
Three-Level Hybrid parallelization
Previous work on hybrid parallelization R.
Falgout, and J. Jones, "Multigrid on Massively
Parallel Architectures", 1999. F. Cappelo, and
D. Etiemble, "MPI versus MPIOpenMP on the IBM SP
for the NAS Benchmarks", 2000. K.
Nakajima and H. Okuda, "Parallel Iterative
Solvers for Unstructured Grids using
Directive/MPI Hybrid Programming Model for GeoFEM
Platform on SMP Cluster
Architectures", 2001 All these are in
computational research area. No
visualization papers are found on this topic.
Previous parallel visualization methods
Classification by platform Shared memory
machines J. Nieh and M. Levoy 1992, P. Lacroute
1996 Distributed memory machines
U. Neumann 1993, C. M. Wittenbrink
and A. K. Somani, 1997 SMP cluster
machines almost no papers are found.
22
SMP Cluster Architecture
The Earth Simulator 640 SMP nodes, and 8
vector processors in each SMP node
23
Three-Level Hybrid parallelization
Criteria to achieve high parallel performance
Local operation and no global dependency
Continuous memory access Sufficiently long loops
24
Vectorization for Each PE
Construct Vectorizatoin Loop
Combine some short loops into one long loop by
reordering
for(i0iltMAX_N_VERTEX i) for(j0jlt3j)
pij. .
for(i0iltMAX_N_VERTEX3 i) pi/3i
3. .
Exchange the innermost and outer loop to make
the innermost loop longer
for(j0jlt3j) for(i0iltMAX_N_VERTEX
i) pij. .

for(i0iltMAX_N_VERTEX i) for(j0jlt3j)
pij. .
Avoid using tree and single/double link data
structure, especially in the inner loop
link (single or double) structure
tree structure
25
Intra-SMP Node Parallelization ? OpenMP
http//www.openmp.org
Multi-coloring for Removing the Data Race
Nakajima, et al. 2001
Ex gradient computation in PVR
pragma omp parallel for(i0iltnum_elementi)
compute jacobian matrix of
shape function for(j0 jlt8j)
for(k0 klt8k) accumulate gradient
value of vertex j contributed by vertex
k
PE0
PE1
PE2
PE3
26
Inter-SMP Node Parallelization ? MPI ?
Parallel Data Structure in GeoFEM
External node
Internal node
Communication
Overlapped elements are used for reducing
communication among SMP nodes Overlap removal is
necessary for final results
27
Dynamic Load Repartition
Why?
Rendered voxels often accumulate in small
portions of the field during visualization
Initial partition on each PE (Same with analysis
computing)
almost equal number of voxels
Load on each PE (PVR process)
the number of rendered voxels
Load balance during PVR
Keep almost equal number of rendered voxels on
each PE
? Number of non-empty voxels ? Opacity transfer
functions ? Viewpoint
Dynamic
28
Dynamic Load Repartition
Most previous methods
Scattered decomposition K.-L. Ma, et al, 1997

Advantage Can get very good load balance
easily
Disadvantage
Large amount of intermediate results have
to be stored

Large extra memory
Large extra communication

Assign several continuous subvolumes on each PE

Count the number of rendered voxels during the
process of gird transformation
Move a subvolume from a PE with a larger number
of rendered voxels to
another PE with a smaller one

29
Dynamic Load Repartition
Assign several continuous subvolumes on each PE

Count the number of rendered voxels during the
process of gird transformation
Move a subvolume from a PE with a larger number
of rendered voxels to
another PE with a smaller one

PE0
PE1
PE1
PE0
PE3
PE2
PE3
PE2
Initial partition
Repartition
30

Visualization Subsystem in GeoFEM
Newly Developed Parallel Volume Rendering (PVR)
Algorithm
Parallel/Vector Efficiency
Examples
Future Works

31
Speedup Test 1
Demonstrate the effect of three-level hybrid
parallelization
Dataset Pin Grid Array (PGA) dataset
(Data courtesy of H. Okuda and S. Ezure).
Simulate the Mises Stress distribution on the pin
grid board by the Linear Elastostatic Analysis
Data size 7,869,771 nodes and 7,649,024
elements
Running environment
SR8000 Each node 8 PEs
8GFLOPS peak performance 8GB memory
Total system 128 nodes (1024 PEs)
1.0TFLOPS peak performance 1.0TB
memory
32
Speedup Test 1
Top view
Bottom view
Volume rendered images to show the equivalent
scalar value of stress by the linear elastostatic
analysis for a PGA dataset with 7,869,771 nodes
and 7,649,024 elements (Data courtesy of H. Okuda
and S. Ezure).
33
Speedup Test 1
Comparison of speedup performance between flat
MPI and the hybrid parallel method for our
parallel volume rendering module.
Original (MPI) to Vector Version (Hybrid)
Speed-up for 1PE 4.30 1283 Uniform Cubes for PVR
34
Speedup Test 2
Demonstrate the effect of three-level hybrid
parallelization
Test Dataset Core dataset (Data courtesy of
H. Matsui in GeoFEM)
Simulate thermal convection in a rotating
spherical shell Data size 257,414 nodes and
253,440 elements
Test Module Parallel Surface Rendering
module
Running environment
SR8000 Each node 8 PEs
8GFLOPS peak performance 8GB memory
Total system 128 nodes (1024 PEs)
1.0TFLOPS peak performance
1.0TB memory
35
Speedup Test 2
Pressure isosurfaces and temperature
cross-sections for a core dataset with 257,414
nodes and 253,440 elements. The speedup of our
3-level parallel method is 231.7 for 8 nodes
(64PEs) on SR8000.
36
Speedup Test 2
Comparison of speedup performance between flat
MPI and the hybrid parallel method for our
parallel surface rendering module.
Original (MPI) to Vector Version (Hybrid)
Speed-up for 1PE 4.00
37
Speedup Test 3
Demonstrate the effect of dynamic load repartition
Dataset Underground water dataset
Simulate groundwater flow and convection/diffusion
transportation through heterogeneous porous media
200?100?100 Region Different Water Conductivity
for 16,000, 128,000, 1,024,000 Meshes (?h
5.00/2.50/1.25) 100 timesteps
Running environment
Compaq alpha 21164 cluster machine (8 PEs,
600MHz/PE, 512M RAM/PE)
Result
For mesh 3 (about 10 million cubes and 100
timesteps)
Without dynamic load repartition 8.15 seconds
for one time-step averagely.
After dynamic load repartition 3.37 seconds
for one time-step averagely.
38
Speedup Test 3
Groundwater Flow Channel
Dh 5.00
Dh2.50
Dh1.25
Effects of convection diffusion for different
mesh sizes
39
Application (2)

Flow/Transportation
50?50?50 Region
Different Water Conductivity for each (Dh5)3
cube
df/dx0.01, f0_at_xmax
1003 Meshes
Dh 0.50
64PEs Hitachi SR2201

40
Parallel PerformanceConvection Diffusion

13,280 steps for 200 Time Unit
106 Meshes, 1,030,301 Nodes
3,984 sec. for elapsed time including
communication on Hitachi SR2201/64PEs
3,934 sec. for real CPU
98.7 parallel performance

41
Convection Diffusion Visualization by PVR
Groundwater Flow Channel
42
Conclusions and Future Work
Improve Parallel Performance of Visualization
Subsystem in GeoFEM
? Improve the parallel performance of the
visualization algorithms
? Three-level hybrid parallel based on SMP
cluster architecture
Inter-SMP node MPI Intra-SMP node
OpenMP for parallelization Individual PE
Compiler directives for vectorization/pseudo
vectorization
? Dynamic load balancing
Future Work
Tests on the Earth Simulator
http//www.es.jamstec.go.jp/

Write a Comment

User Comments (0)