Star-P and the Knowledge Discovery Suite Steve Reinhardt, spr@InteractiveSupercomputing.com Viral Shah, vshah@InteractiveSupercomputing.com John Gilbert, gilbert@cs.ucsb.edu Stefan Karpinski, sgk@cs.ucsb.edu - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Star-P and the Knowledge Discovery Suite Steve Reinhardt, spr@InteractiveSupercomputing.com Viral Shah, vshah@InteractiveSupercomputing.com John Gilbert, gilbert@cs.ucsb.edu Stefan Karpinski, sgk@cs.ucsb.edu

Description:

(bytes) time (seconds) 22. 4.194E 06. 4.194E 07. 7.550E 08 ... Brad McRae and Paul Beier, 'Circuit theory predicts gene flow in plant and animal populations' ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 22

Provided by: imir7

Category:

more less

Transcript and Presenter's Notes

Title: Star-P and the Knowledge Discovery Suite Steve Reinhardt, spr@InteractiveSupercomputing.com Viral Shah, vshah@InteractiveSupercomputing.com John Gilbert, gilbert@cs.ucsb.edu Stefan Karpinski, sgk@cs.ucsb.edu

1
Star-P and the Knowledge Discovery SuiteSteve
Reinhardt, spr_at_InteractiveSupercomputing.comViral
Shah, vshah_at_InteractiveSupercomputing.comJohn
Gilbert, gilbert_at_cs.ucsb.eduStefan Karpinski,
sgk_at_cs.ucsb.edu
2
Context for Knowledge Discovery
3
Goal for Large-scale Knowledge Discovery via
Star-P/KDS

Enable domain expert to explore big unstructured
data interactively
Domain Expert Scientist or analyst, not math or
graph expert
Explore human-guided characterization, from
simple statistics to complex clustering or
factoring, even when best algorithm not known
Big 20GB of data commonplace, gt1TB largest
Unstructured Data
E.g., arising from metabolic networks, climate
change, social interactions, and Internet traffic
Allow analyst to discern structure in the data
via exploration
KDT implements many key algorithms extensible
for other algorithms
Interactively common queries take O(10 seconds)
on 30-128P Altix
Depends on algorithms being general-purpose,
reusable, and usable by non-experts

4
Star-P Basics
MATLAB is a registered trademark of The
MathWorks, Inc. ISC's products are not sponsored
or endorsed by The Mathworks, Inc. or by any
other trademark owner referred to in this
document.
5
A Comprehensive (?) Pictureof Knowledge Discovery
SVD implemented by ISCHadoop
implemented by others
Input HDF5, OPeNDAP, Hadoop, live
data feeds
Visualization Renoir, In-Spire, ...
Dimensionalityreduction / factorization SVD,
eigen, NMF, PCA,
Clustering K-means, ...
Classification SVM, Bayesian, HMMs, ...
Etc.
Graph primitives(BFS, MIS, conncomp, ...)
Linear algebra(spmatvec, ...)
Solvers(MUMPS, SuperLU, ...)
Utility(sort, indexing)
Data structures(sparse matrices, ...)
Parallel constructs
6
Knowledge Discovery Suite

Simple data analysis operations at very large
scale
Sorting, set operations, statistical operations
Graph operations on very large graphs
Simple queries, breadth-first search, connected
components, independent sets
Visualization with desktop tools
Distributed image generation for large graphs and
datasets
Clustering and decomposition
K-means clustering, non-negative matrix
factorization, principal component analysis
Bayesian network modeling
Expectation maximization (Baum-Welch), hidden
Markov models
What would you like to see?

7
Ways to Work Together

You use Star-P infrastructure for parallel
algorithm development/deployment
You develop serial algorithms, we develop
parallel versions
We jointly develop parallel algorithms
We develop or integrate missing functionality you
need
We target joint client with integrated package

8
A Brief Demo
9
Sample Kernels, Algorithms, and Workflows
10
Simplest KDS operationParallel Sorting
3 6 8 1 5 4 7 2 9
1 2 3 4 5 6 7 8 9

Simple, widely used combinatorial primitive
W, perm sort (V)
Used in many sparse matrix and array algorithms
sparse(), indexing, concatenation, transpose,
reshape, repmat, etc.
Communication efficient

11
Sorting performance
12
A complex workflow (SSCA2, kernel 3)

function subgraphs kernel3 (G, pathlen, starts)
KERNEL3 SSCA2 Kernel 3 -- Graph Extraction
starts starts(,2)
nstarts length(starts)
A grsparse (G)
nv nverts (G)
Use sparse matrix multiplication to do several
BFS searches at once.
s sparse (starts, 1nstarts, 1, nv, nstarts)
for k1pathlen
s A s Ideally reach should support
this. Not yet.
s (s 0)
end
for i 1nstarts
x s(,i)
vtxmap find(x)
S.graph subgraph (G, vtxmap)

13
A complex workflow (SSCA2, kernel 4)

function leader kernel4f (G)
KERNEL4F SSCA2 Kernel 4 -- Graph Clustering
Find a Maximal Independent Set in G
IS, misrounds mis (G)
fprintf ('MIS rounds d. MIS nodes d\n',
misrounds, length(IS))
Find neighbours of each node from the IS
neighFromIS G sparse(IS, IS, 1, n, n)
Pick one of the neighbouring IS nodes as a
leader
ign leader max (neighFromIS, , 2)
Collect votes from neighbours
I J find (G)
S sparse (I, leader(J), 1, n, n)
Pick the most popular leader among neighbours
and join that cluster

14
Scaling Performance cSSCA2 on 128P
"scale vertices ( 2scale) edges ( 10 vertices) graph size (bytes) time (seconds)
22 4.194E06 4.194E07 7.550E08 122.51
24 1.678E07 1.678E08 3.020E09 402.31
26 6.711E07 6.711E08 1.208E10 1237.1