Parallel R (pR) - PowerPoint PPT Presentation


PPT – Parallel R (pR) PowerPoint presentation | free to view - id: 680de-ZDc1Z


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Parallel R (pR)


Parallel R pR – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 22
Provided by: nagizas
Tags: eel | parallel


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Parallel R (pR)

Parallel R (pR)
  • For High Performance Statistical Computing
  • Nagiza F. Samatova (ORNL)
  • Srikanth Yoginath (ORNL)
  • Guruprasad Kora (ORNL)
  • David Bauer (GT)
  • Chongle Pan (UTK/ORNL)

SDM AHM _at_ Salt Lake City March 3-4, 2005
Contact Nagiza Samatova,
  • About Parallel R
  • Motivation
  • About R and its parallelization efforts
  • Task and data parallelism with Parallel R (pR)
  • Extensibility of Parallel R
  • Performance Benchmarks
  • Parallel R across Different Applications
  • GIS data analysis with GRASS and Parallel R
  • Clustered Climate Regimes using Parallel R
  • Fusion scenario challenges Parallel R
  • Quantitative Proteomics in Biology using Parallel
  • Summary and Future Work

Tera-(Flop Byte) Analyses Could Be Routine for
Scientific Applications But
  • Algorithmic Complexity
  • Calculate means O(n)
  • Calculate FFT O(n log(n))
  • Calculate PCA O(r c)
  • Hierarchical clust. O(n2)

Climate Now 20-40TB per simulated year 5
yrs 100TB/yr 5-10PB/yr Astrophysics Now and
5 yrs Can soak up anything! Fusion Now
100Mbytes/15min 5 yrs 1000Mbytes/2 min
Statistical Computing with R
  • About R (http//
  • R is an Open Source (GPL), most widely used
    programming environment for statistical analysis
    and graphics similar to S.
  • Provides good support for both users and
  • Highly extensible via dynamically loadable
    add-on packages.
  • Originally developed by Robert Gentleman and
    Ross Ihaka.

gt library (rpvm) gt .PVM.start.pvmd () gt
.PVM.addhosts (...) gt .PVM.config ()
Towards Enabling Parallel Computing in R
  • Rmpi (Hao Yu) R interface to LAM-MPI.
  • rpvm (Na Li and Tony Rossini) R interface to
    PVM requires knowledge of parallel programming.
  • snow (Luke Tierney) general API on top of
    message passing routines to provide high-level
    (parallel apply) commands mostly demonstrated
    for embarrassingly parallel applications .

Motivation behind Parallel R (pR)
  • Ideal Programming Requirements
  • Be able to use existing high level (i.e. R) code
  • Require minimal extra efforts for parallelizing
  • Have Identical/similar (presumably easy-to-use)
    interface to Rs
  • Be able to test codes in sequential settings
  • Provide efficient and scalable (in terms of
    problem size and number of processors)

Providing Task and Data Parallelism in pR
Extensibility of Parallel R (pR)
Scalability of Parallel R (pR)
Rgt solve (A,B) pRgt sla.solve (A, B, NPROWS,
NPCOLS, MB) A and B are the input matrices
NPROWS and NPCOLS are process grid specs MB is
block size
Overhead due to R Parallel Agent in pR
Parallel R (pR) Distribution
  • Releases History
  • pR enables both data and task parallelism
    (includes task-pR and RScaLAPACK) (2004/Q4)
  • RScaLAPACK provides R interface to ScaLAPACK
    with its scalability in terms of problem size and
    number of processors using data parallelism
  • task-pR achieves parallelism by performing
    out-of-order execution of tasks. With its
    intelligent scheduling mechanism it attains
    significant gain in execution times (2004/Q3)
  • pMatrix provides a parallel platform to perform
    major matrix operations in parallel using
    ScaLAPACK and PBLAS Level II III routines

Also Available for download from Rs CRAN web
site ( with 37 mirror sites in
20 countries
Geo-statistical and Spatial Data Analysis with
GRASS and Parallel R
With George Fann, John Drake, and Bhaduri
  • About GRASS (http//
  • GRASS (Geographic Resources Analysis Support
    System) is a raster/vector GIS, image processing
    system, and graphics production.
  • GRASS contains over 350 programs and tools to
    render maps and images on monitor and paper
    manipulate raster, vector, and sites data
    process multi spectral image data create,
    manage, and store spatial data.
  • It is Free (Libre) Software/Open Source released
    under GNU GPL.
  • Parallel R (pR) extension for GRASS
  • Leverages the work by Markus Neteler
  • Offers a richer set of statistical analysis
    capabilities including (Basic Statistics,
    Exploratory Data Analysis, Linear Models,
    Multivariate Analysis, Time Series Analysis,
  • Provides high performance and parallel
    computational platform for large datasets

Grass/Parallel-R Examples
Clustered Climate Regimes AnalysisWith W.
Hargrove, F. Hoffman, and D. Erickson
Scalability of pk-means() in pR
Fusion Scenario Challenges Parallel R With
George Ostrouchov and Don Batchelor
Mahalanobis Distance ? easy
250,000 points 10 sampling for 1hr analysis
Hierarchical Model-based Clustering (mclust) ?
Expectation Maximization (EM) ? easy
Quantitative Proteomics in BiologyWith Bob
Hettich, Hays McDonald, and Greg Hurst
Ratio Calculations for 50,000 files
3. Calculate RatioSlope(Eigenvector)
2. Select Peak Window
  • Subtract background noise from data
  • Generate Covariance Chromatogram (red)
  • Apply Savitzky-Golay Smoother (blue)
  • Calculate cut-off for search (cyan)
  • Find Window with Max. SN ratio (green)

Ratio Estimation over 50,000 files
Ratio Calculations with Parallel R
Performance Results for Ratio Calculation
Summary and Future Work
  • Parallel R (pR) is an Open Source high
    performance library for statistical computing in
  • It has been deployed in a number of applications
    including climate, GIS, fusion, and biology
  • Future improvements in few major directions
  • Demonstrate more application scenarios
  • Add more libraries like RScaLAPACK, PMatrix (e.g.
    pAlok?, pclust, pnetCDF)
  • Improve the performance (reduce overhead, memory
    management) of Parallel Agent
  • Enhance features of Parallel Agent
  • Support outside of Master-Slave model
  • Better memory management strategies (one-sided
    put(), get(), release(), etc.)
  • Support of parallel I/O over netCDF and HDF files