Title: Towards Personal High-Performance Geospatial Computing (HPC-G): Perspectives and a Case Study
1Towards Personal High-Performance Geospatial
Computing (HPC-G) Perspectives and a Case Study
- Jianting Zhang
- Department of Computer Science, the City College
of New Yorkjzhang_at_cs.ccny.cuny.edu
2Outline
- Introduction
- Geospatial Data, GIS, Spatial Databases and HPC
- Geospatial data whats special?
- GIS impacts of hardware architectures
- Spatial Databases parallel DB or MapReduce?
- HPC many options
- Personal HPC-G A New Framework
- Why Personal HPC for geospatial data?
- GPGPU Computing a brief introduction
- Pipelining CPU and GPU workloads for performance
- Parallel GIS prototype development strategies
- A Case Study Geographically Weighted Regression
- Summary and Conclusions
3Ecological Informatics
- Computer Science
- Spatial Databases
- Mobile Computing
- Data Mining
Geography GIS Applications Remote Sensing
4Introduction Personal Stories
- Computational intensive problems in geospatial
data processing - Distributed hydrological modeling/flood
simulation - Satellite image processing clustering/classificat
ion (multi-/hyper-spectral) - Identifying and tracking storms from time-series
NEXRAD images - Species distribution modeling (e.g.
regression/GA-based) - History of accesses to HPC resources
- 1994 Simulating a 33 hours flood on a PC
(33MHZ/4M) took 50 hours - 2000 A Cray machine was available but special
arrangement was required to access it while
taking a course (Parallel and Distributed
Processing) - 2004-2007 HPC resources at SDSC were available
to the SEEK project but the project ended up only
using SRB for data/metadata storage - 2009-2010 An Nvidia Quadro FX 3700 GPU card
(that came with a Dell workstation) gave 23X
speedup after porting a serial CPU codebase (for
SSDBM10) to CUDA platform (ACM-GIS10)
5- Two books that changed my research focus (as a
database person )
2nd edition?4th edition
http//courses.engr.illinois.edu/ece498/al/
As well as a few visionary database research
papers
- David J. DeWitt, Jim Gray Parallel Database
Systems The Future of High Performance Database
Systems. Commun. ACM 35(6) 85-98 (1992) - Anastassia Ailamaki, David J. DeWitt, Mark D.
Hill, David A. Wood DBMSs on a Modern Processor
Where Does Time Go? VLDB 1999 266-277 - J. Cieslewicz and K.A. Ross Database
Optimizations for Modern Hardware. Proceedings of
the IEEE, 96(5)2008
6Introduction PGIS in traditional HPC Environment
A. Clematis, M. Mineter, and R. Marciano. High
performance computing with geographical data.
Parallel Computing, 29(10)12751279, 2003
- Despite all these initiatives the impact of
parallel GIS research has remained slight - the anticipated performance plateau became a
mountain still being scaled - GIS companies found that, other than for
concurrency in databases, their markets did not
demand multi-processor performance. - While computing in general demands less of its
users, HPC has demanded morethe barriers to use
remain high and the range of options has
increased - fundamental problem remains the fact that
creating parallel GIS operations is non-trivial
and there is a lack of parallel GIS algorithms,
application libraries and toolkits.
If parallel GIS runs in a personal computing
environment, to what degree the conclusions will
change?
7Introduction PGIS in Personal Computing
Environment
- Every personal computer is now a parallel machine
- Chip-Multiprocessors (CMP) Dual-core, Quad-core,
Six-core CPUs - INTEL XEON E5520 379.99
- 4 cores/8 threads 2.26G, 80W
- 4256K L2 cache, 8M L3 cache
- Max Memory Bandwidth 25.6GB/s
- Massively parallel GPGPU computing Hundreds of
GPU cores in a GPU card - Nvidia GTX480 499.99
- 480 cores/ (151024 threads) 700/1401MHZ, 250W
?1.35 TFlops - 1532768 registers 1564K shared memory/L1
cache 768 L2 cache additional constant/texture
memory - 1.5G GDDR5 1848MHZ clock rate, 384-bit memory
interface width, 177.4 GB/s memory bandwidth
If these parallel computing powers are fully
utilized, to what degree a personal workstation
can match a traditional cluster for geospatial
data processing?
8Geospatial data whats special?
- The slowest processing unit determines the
overall performance in parallel computing - Real world data very often are skewed
Wavelet compressed raster data
Clustered Point data
9Geospatial data whats special?
- Techniques to handle skewness
- data decomposition/partition? spatial indexing
- task scheduling
Complexities of task scheduling grow fast with
the number of tasks and generic scheduling
heuristics may not always produce good results
Simple equal-size partition may work well for
local operations, but may not for focal, zonal
and global operations which requires more
sophisticated partitions to achieve load balancing
10GIS impacts of hardware architectures
- GIS have been evolving along with mainstream
information technologies - major platform shift from Unix workstations to
Windows PCs in the early 1990s - the marriage with Web technologies to create
Web-GIS in the late 1990s - Will GIS naturally evolve from serial to parallel
as computers evolve from uniprocessor to chip
multiprocessor? - What can the community do to speedup the
evolution?
11GIS impacts of hardware architectures
- Three roles of GIS
- data management
- information visualization
- modeling support
- GIS-based spatial modeling, such as agent based
modeling, is naturally suitable for HPC - Computational intensive
- Adopt a raster tessellation and mostly involve
local operations and/or focal operations with
small constant numbers of neighbors -
parallelization-friendly or even Embarrassingly
parallel - Runs in an offline mode and uses traditional GIS
for visualization - How to make full use of hardware and support data
management and information visualization more
efficiently and effectively?
12HPC many options
- The combination of architectural and
organizational enhancements lead to 16 years of
sustained growth in performance at an annual rate
of 50 from 1986 to 2002, due to the combined
power, memory and instruction-level parallelism
problem, the growth rate has dropped to about 20
per year from 2002 to 2006 - In 2004, Intel cancelled its high-performance
uniprocessor projects and joined IBM and Sun to
declare that the road to higher performance would
be via multiple processors per chip (or Chip
Multiprocessors, CMP) rather than via faster
uniprocessors. - As a marketing strategy, Nvidia calls a personal
computer equipped with one or more of its
high-end GPGPU cards as a personal supercomputer.
Nvidia claimed that when compared to the latest
quad-core CPU, Tesla 20-series GPU computing
processors deliver equivalent performance at
1/20th of power consumption and 1/10th of cost.
13HPC many options
- CPU Multi-cores
- GPU Many-cores
- CPU Multi-nodes (traditional HPC)
- CPUGPU Multi-nodes (23)
- How about 12? ?Personal HPC
- Affordable and dedicated personal computing
environment - No additional cost use-it or waste-it
- Excellent visualization and user interaction
supports - Can be the last-mile of a larger
cyberinfrastructure - Data structures/algorithms/software are critical
to the success
14Personal HPC-G A New Framework
- Additional arguments to advocate for Personal HPC
for geospatial data - While some geospatial data processing tasks are
computationally intensive, many more are data
intensive in nature - Distributing large data chunks incur significant
network and disk I/O overheads (50-100MB/s) - make full use of high interface bandwidths
between CPU cores memory (10-30 GB/s), CPU
memory??GPU memory(8GB/s) and GPU cores-memory
(100-200 GB/s) - The improved CPUGPU performance will not only
solve old problems faster but also allow many
traditionally offline data processing tasks run
online in an interactive manner. The
uninterrupted exploration processes are likely to
facilitate novel scientific discoveries more
effectively.
15Why Personal HPC for geospatial data?
High-Level Comparisons among Cluster Computing,
Cloud Computing and Personal HPC
Cluster Computing Cloud Computing Personal HPC
Initial cost High Low Low
Operational cost High Medium Low
End user control Low High High
Theoretical scalability High High Medium
User code development Medium Low High
Data management Low Medium Medium
Numeric modeling High Medium High
Interaction visualization Low Low High
16Spatial Database parallel DB or MapReduce
- Spatial databases GIS without GUI
- Learn lessons from the relational databases on
parallelization - The debates between Parallel DB and MapReduce
- The emergence of hybrid approaches (e.g. HadoopDB
) - While parallel processing of geospatial data to
achieve high performance has been a research
topic for quite a while, neither of them has been
extensively applied to practical large-scale
geospatial data management - Call for pilot studies in experimenting the two
approaches to provide insights for future
synthesis
17GPGPU Computing Nvidia CUDA Compute Unified
Device Architecture AMD/ATI Stream Computing
18Parallel GIS prototype development strategies
- We envision that Personal HPC-G provides an
opportunity to evolve traditional GIS to parallel
GIS gradually. Community research and development
efforts are needed to speed up the evolution. - We first propose to learn from existing parallel
geospatial data processing algorithms and adapt
them to CMP CPU and GPU architectures. - Second, we suggest study existing GIS modules
(e.g., ArcGIS geoprocessing tools) carefully,
identify most frequently used ones and develop
parallel code for multicore CPUs and many-core
GPUs - Third, while exiting database research on CMP CPU
and GPU architectures are still relatively
limited, they can be the starting point to
investigate how geospatial data management can be
realized on the new architectures and their
hybridization - Finally, reuse existing CMP and GPU based
software codebases developed by the computer
vision and computer graphics communities
19GWR Case Study
- A conceptual design of efficiently implement GWR
based on CUDA GPGPU computing architecture
-preliminary in nature - Being realized by a master student at CCNY
- Good C/C programming skills
- New to GPGPU/CUDA programming
- Being supported 5 hours/per week through a tiny
grant (experiment on what 2000 can contribute to
PGIS development)
20GWR Case Study
- GWR extends the traditional regression framework
by allowing local parameters to be estimated - Given a neighborhood definition (or Bandwidth) of
a data item, a traditional regression can be
applied to data items that fall into the
neighborhood or region. - The correlation coefficients for all the
geo-referenced data items (raster cells or
points) form a scalar field that can be
visualized and interactively explored - By interactively changing some GWR parameters
(e.g., bandwidth) and visual exploring the
changes of the corresponding scalar fields, users
can have better understanding of the
distributions of GWR statistics and the original
dataset.
21GWR is computationally intensive
Dependent Variable
Independent Variable
8
7
9
6
8
7
7
8
7
6
5
4
6
6
5
5
4
3
Using an nn moving window to compute correlation
coefficients (n3). The correlation coefficient
at the dotted cell is r0.84
Point data are usually clustered which makes
load-balancing very difficult
22GWR Case Study Overall Design
23GWR Case Study From partial to total statistics
Let S1nSxiyi, S2Sxi, S3 Syi, S4nSxi2,
S5nSyi2, f can be computed from n and S1 through
S5. Assuming that data items D1, D2, Dn are
divided into m groups and each group has computed
their partial statistics s1, s2, s3, s4, s5, then
f can be computed from nj, S1j, S2j, S3j, S4j and
S5j as the following (j1,m) n Snj, S1nS
(S1j/nj), S2S S2j, S3S S3j, S4nS (S4j/nj),
S5nS (S5j/nj).
24Summary and Conclusions
- We aimed at introducing a new HPC framework for
processing geospatial data in a personal
computing environment, i.e., Personal HPC-G. - We argued that the fast increasing hardware
capacities of modern personal computers equipped
with chip multiprocessor CPUs and massively
parallel GPU devices have make Personal HPC-G an
attractive alternative to traditional Cluster
computing and newly emerging Cloud computing for
geospatial data processing. - We used a parallel design of GWR on Nvidia CUDA
enabled GPU device as an example to discuss how
Personal HPC-G can be utilized to realize
parallel GIS modules by synergistic software and
hardware co-programming.
25QA
jzhang_at_cs.ccny.cuny.edu
25