Parallel R (pR)

Parallel R (pR)

- For High Performance Statistical Computing

- Nagiza F. Samatova (ORNL)
- Srikanth Yoginath (ORNL)
- Guruprasad Kora (ORNL)
- David Bauer (GT)
- Chongle Pan (UTK/ORNL)

SDM AHM _at_ Salt Lake City March 3-4, 2005

Contact Nagiza Samatova, samatovan_at_ornl.gov

Outline

- About Parallel R
- Motivation
- About R and its parallelization efforts
- Task and data parallelism with Parallel R (pR)
- Extensibility of Parallel R
- Performance Benchmarks
- Parallel R across Different Applications
- GIS data analysis with GRASS and Parallel R
- Clustered Climate Regimes using Parallel R
- Fusion scenario challenges Parallel R
- Quantitative Proteomics in Biology using Parallel

R - Summary and Future Work

Tera-(Flop Byte) Analyses Could Be Routine for

Scientific Applications But

- Algorithmic Complexity
- Calculate means O(n)
- Calculate FFT O(n log(n))
- Calculate PCA O(r c)
- Hierarchical clust. O(n2)

Climate Now 20-40TB per simulated year 5

yrs 100TB/yr 5-10PB/yr Astrophysics Now and

5 yrs Can soak up anything! Fusion Now

100Mbytes/15min 5 yrs 1000Mbytes/2 min

Statistical Computing with R

- About R (http//www.r-project.org/)
- R is an Open Source (GPL), most widely used

programming environment for statistical analysis

and graphics similar to S. - Provides good support for both users and

developers. - Highly extensible via dynamically loadable

add-on packages. - Originally developed by Robert Gentleman and

Ross Ihaka.

gt library (rpvm) gt .PVM.start.pvmd () gt

.PVM.addhosts (...) gt .PVM.config ()

Towards Enabling Parallel Computing in R

- Rmpi (Hao Yu) R interface to LAM-MPI.
- rpvm (Na Li and Tony Rossini) R interface to

PVM requires knowledge of parallel programming. - snow (Luke Tierney) general API on top of

message passing routines to provide high-level

(parallel apply) commands mostly demonstrated

for embarrassingly parallel applications .

Motivation behind Parallel R (pR)

- Ideal Programming Requirements
- Be able to use existing high level (i.e. R) code
- Require minimal extra efforts for parallelizing
- Have Identical/similar (presumably easy-to-use)

interface to Rs - Be able to test codes in sequential settings
- Provide efficient and scalable (in terms of

problem size and number of processors)

performance

Providing Task and Data Parallelism in pR

Extensibility of Parallel R (pR)

Scalability of Parallel R (pR)

Rgt solve (A,B) pRgt sla.solve (A, B, NPROWS,

NPCOLS, MB) A and B are the input matrices

NPROWS and NPCOLS are process grid specs MB is

block size

Overhead due to R Parallel Agent in pR

Parallel R (pR) Distribution

http//www.ASPECT-SDM.org/Parallel-R

- Releases History
- pR enables both data and task parallelism

(includes task-pR and RScaLAPACK) (2004/Q4) - RScaLAPACK provides R interface to ScaLAPACK

with its scalability in terms of problem size and

number of processors using data parallelism

(2004/Q2) - task-pR achieves parallelism by performing

out-of-order execution of tasks. With its

intelligent scheduling mechanism it attains

significant gain in execution times (2004/Q3) - pMatrix provides a parallel platform to perform

major matrix operations in parallel using

ScaLAPACK and PBLAS Level II III routines

(2005/Q2)

Also Available for download from Rs CRAN web

site (www.R-Project.org) with 37 mirror sites in

20 countries

Geo-statistical and Spatial Data Analysis with

GRASS and Parallel R

With George Fann, John Drake, and Bhaduri

Budhendra

- About GRASS (http//grass.itc.it/)
- GRASS (Geographic Resources Analysis Support

System) is a raster/vector GIS, image processing

system, and graphics production. - GRASS contains over 350 programs and tools to

render maps and images on monitor and paper

manipulate raster, vector, and sites data

process multi spectral image data create,

manage, and store spatial data. - It is Free (Libre) Software/Open Source released

under GNU GPL.

- Parallel R (pR) extension for GRASS
- Leverages the work by Markus Neteler

(http//grass.itc.it/statsgrass/grass_geostats.htm

l). - Offers a richer set of statistical analysis

capabilities including (Basic Statistics,

Exploratory Data Analysis, Linear Models,

Multivariate Analysis, Time Series Analysis,

etc.) - Provides high performance and parallel

computational platform for large datasets

Grass/Parallel-R Examples

Clustered Climate Regimes AnalysisWith W.

Hargrove, F. Hoffman, and D. Erickson

Scalability of pk-means() in pR

Fusion Scenario Challenges Parallel R With

George Ostrouchov and Don Batchelor

Mahalanobis Distance ? easy

250,000 points 10 sampling for 1hr analysis

Hierarchical Model-based Clustering (mclust) ?

hard

Expectation Maximization (EM) ? easy

Quantitative Proteomics in BiologyWith Bob

Hettich, Hays McDonald, and Greg Hurst

Ratio Calculations for 50,000 files

3. Calculate RatioSlope(Eigenvector)

2. Select Peak Window

- Subtract background noise from data
- Generate Covariance Chromatogram (red)
- Apply Savitzky-Golay Smoother (blue)
- Calculate cut-off for search (cyan)
- Find Window with Max. SN ratio (green)

Ratio Estimation over 50,000 files

Ratio Calculations with Parallel R

Performance Results for Ratio Calculation

Summary and Future Work

- Parallel R (pR) is an Open Source high

performance library for statistical computing in

R - It has been deployed in a number of applications

including climate, GIS, fusion, and biology - Future improvements in few major directions
- Demonstrate more application scenarios
- Add more libraries like RScaLAPACK, PMatrix (e.g.

pAlok?, pclust, pnetCDF) - Improve the performance (reduce overhead, memory

management) of Parallel Agent - Enhance features of Parallel Agent
- Support outside of Master-Slave model
- Better memory management strategies (one-sided

put(), get(), release(), etc.) - Support of parallel I/O over netCDF and HDF files