Title: pR: Automatic, Transparent Runtime Parallelization of the R Scripting Language
1pR Automatic, Transparent Runtime
Parallelization of the R Scripting Language
- Jiangtian Li
- Department of Computer Science
- North Carolina State University
2Acknowledgement
- This project is originated from and in
collaboration with Dr. Samatovas group at Oak
Ridge National Lab - Dr. Nagiza Samatova
- Guru Kora
- Srikanth Yoginath
- Advisors
- Dr. Xiaosong Ma
- Dr. Nagiza Samatova
- Supported by grants from
- NSF
- DOE
3Outline
- Motivation
- Background
- Architecture
- Design
- Performance
- Conclusion and Future Work
v
4Motivation
- Increasing demand of massive scientific data
processing - Statistical analysis in gene/protein data (61
billions sequence records in GenBank) - Time series analysis of climate data (300GB for
10 years) - Widely used computing tools such as R, Matlab are
interpreted language in nature - Facilitate runtime parallelization
- Involve both computation-intensive and
data-intensive tasks - Can exploit both task and data parallelism
5What is R?
- Portable and extensible software as well as an
interpreted language - Lisp alike - read-eval-print loop
- Perform diverse statistical analysis
- Many extension packages are being developed
- Can be used in either interactive mode or batch
mode
6Example R script example.R
- Assign an integer
- a lt-1
- Construct a vector of 9 real numbers
- conforming to normal distribution
- c lt- rnorm(9)
- Initialize a two-dimensional array
- d lt- array(00, dimc(9,9))
- Loop, read data from file
- for(i in 1length(c))
- di, lt- matrix(scan(paste(test.data, i,
sep))) -
7Example batch mode execution
- From R prompt
- gtsource("example.R")
- gta
- 1 1
- gtc
- 1 1.16808 0.15877 1.40785 1.73696 -1.19267
0.41321 - 7 -0.39817 -0.13059 -0.67247
- gtd
- ,1 ,2 ,3
,4 ,5 - 1, 0 0 0
0 0 - 2, 0 0 0
0 0 -
- From shell
- R CMD BATCH example.R
8Research Goal
- Propose runtime framework for parallelizing R
- Provide automatic and transparent manner for
parallel R programming - Achieve speedup and scalability for R
applications and benefit R community users
9Outline
- Motivation
- Background
- Architecture
- Design
- Performance
- Conclusion and Future Work
v
10Related Work
- Embarrassingly parallel
- snow package - Rossini et al.
- Message passing
- MultiMATLAB - Trefethen et al.
- pyMPI - Miller
- Back-end support
- RScaLAPACK - Yoginath et al.
- Star-P - Choy et al.
- Compilers
- Otter - Quinn et al.
- Shared memory
- MATmarks Almasi et al.
11Related Work
- Parallelizing compilers
- SUIF Hall et al.
- Polaris - Blume et al.
- Runtime parallelization
- Jprm - Chen et al.
- Dynamic compilation
- DyC - Grant et al.
12Outline
- Motivation
- Background
- Architecture
- Design
- Performance
- Conclusion and Future Work
v
13Design Rationale
- Most R codes consist of high-level pre-built
functions, e.g., svd for singular value
decomposition, eigen for eigenvalues and
eigenvector computation - Loops usually has less inter-iteration dependency
and higher per-iteration execution cost, e.g., R
applications from Bioconductor - No pointer, no aliasing problem
14Approach
- Selective parallelizing scheme that focus on
function calls and loops - Dynamic and incremental dependency analysis with
runtime evaluation pause where dependency
cannot be determined, such as dynamic loop bound,
conditional branch - Master-worker paradigm to reduce scheduling and
data communication overhead - Outsource expensive tasks, i.e., function calls
and loops to workers - Data are distributed at workers
15Framework Architecture
- Inter-node communication MPI
- Inter-process communication domain socket
16Outline
- Motivation
- Background
- Architecture
- Design
- Performance
- Conclusion and Future Work
v
17Analyzer
- Input R script
- Output Task Precedence Graph
- Task finest unit in scheduling
- Identify precedence relationship among tasks
18Parsing
- Identify basic execution unit R statement
- Retrieve expressions such as variable names,
array subscripts - Output parse tree
19An example of parse tree
20Dependence analysis
- Identify task finest unit in scheduling
- Statement dependence analysis
- Loop dependence analysis GCD test
- Incremental analysis
- Pause at points where runtime information is
needed for dependence analysis or branch decision - Obtain runtime evaluation results and proceed
- Output Task Precedence Graph
- Vertex task
- Edge - dependence
21Loop parallelization
- Parallelize loop if no dependence is discovered
- Executed in an embarrassingly parallel manner
- Adjust Task Precedence Graph
22An running example
23task 1
task 2
task 3
a lt- 1
b lt- 2
c lt- rnorm(9)
d lt- array(00, dimc(9,9))
task 5
task 4
ll
ll
for (i in 15) di, lt- matrix(scan(paste(t
est.data, i, sep)))
for (i in blength(c)) ci lt- ci-1 a
for (i in 1lenth(c)) di, lt-
matrix(scan(paste(test.data, i, sep)))
for (i in 29) ci lt- ci-1 a
ll
if (clength(c) gt 10) e lt-
eigen(d) else e lt- sum(c)
task 6
task 6
ll
Pause point
24Parallel Execution Engine
- Dispatch ready tasks
- Outsource expensive tasks (loops or function
calls) to workers - Coordinate peer-to-peer data communication and
monitor execution status - Update analyzer with runtime results
25Outline
- Motivation
- Background
- Architecture
- Design
- Performance
- Conclusion and Future Work
v
26Ease of use demonstration
- Comparison of pR and snow (an R add-on package)
- pR no user interference of source code
- snow user plugs in APIs
27Performance
- Testbed
- Opt cluster 16 nodes, 2 core, dual Opteron 265,
1 Gbps Ether - Fedora Core 5 Linux x86_64(Linux Kernel 2.6.16)
- Benchmarks
- Boost a statistics application
- Bootstrap
- SVD
28Boost
- Analysis overhead is very small
- From 16 to 32 processors, computation speedup
drops to 1.5
29Boostrap
30SVD
- Analysis overhead is very small
- Serialization large data set in R is major
overhead (1.9 MB/s)
31Task Parallelism Test
- Statistical functions
- prcomp principal component analysis
- svd singular value decomposition
- lm.fit linear model fitting
- cor variance computation
- fft Fast Fourier Transform
- qr QR decomposition
- Execution time of each task ranges from 3-27
seconds
32Outline
- Motivation
- Background
- Architecture
- Design
- Performance
- Conclusion and Future Work
v
33Future work
- Apply loop transformation techniques
- Intelligent scheduling to exploit data locality
- Explore finer granularity interprocedural
parallelization - Load balance
- Optimize high-level R function such as
serialization
34Conclusion
- Present pR framework, the first step to
parallelize R automatically and transparently - Optimization is needed to improve efficiency