Star-P and the Knowledge Discovery Suite Steve Reinhardt, spr@InteractiveSupercomputing.com Viral Shah, vshah@InteractiveSupercomputing.com John Gilbert, gilbert@cs.ucsb.edu Stefan Karpinski, sgk@cs.ucsb.edu - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Star-P and the Knowledge Discovery Suite Steve Reinhardt, spr@InteractiveSupercomputing.com Viral Shah, vshah@InteractiveSupercomputing.com John Gilbert, gilbert@cs.ucsb.edu Stefan Karpinski, sgk@cs.ucsb.edu

Description:

(bytes) time (seconds) 22. 4.194E 06. 4.194E 07. 7.550E 08 ... Brad McRae and Paul Beier, 'Circuit theory predicts gene flow in plant and animal populations' ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Star-P and the Knowledge Discovery Suite Steve Reinhardt, spr@InteractiveSupercomputing.com Viral Shah, vshah@InteractiveSupercomputing.com John Gilbert, gilbert@cs.ucsb.edu Stefan Karpinski, sgk@cs.ucsb.edu


1
Star-P and the Knowledge Discovery SuiteSteve
Reinhardt, spr_at_InteractiveSupercomputing.comViral
Shah, vshah_at_InteractiveSupercomputing.comJohn
Gilbert, gilbert_at_cs.ucsb.eduStefan Karpinski,
sgk_at_cs.ucsb.edu
2
Context for Knowledge Discovery
3
Goal for Large-scale Knowledge Discovery via
Star-P/KDS
  • Enable domain expert to explore big unstructured
    data interactively
  • Domain Expert Scientist or analyst, not math or
    graph expert
  • Explore human-guided characterization, from
    simple statistics to complex clustering or
    factoring, even when best algorithm not known
  • Big 20GB of data commonplace, gt1TB largest
  • Unstructured Data
  • E.g., arising from metabolic networks, climate
    change, social interactions, and Internet traffic
  • Allow analyst to discern structure in the data
    via exploration
  • KDT implements many key algorithms extensible
    for other algorithms
  • Interactively common queries take O(10 seconds)
    on 30-128P Altix
  • Depends on algorithms being general-purpose,
    reusable, and usable by non-experts

4
Star-P Basics
MATLAB is a registered trademark of The
MathWorks, Inc. ISC's products are not sponsored
or endorsed by The Mathworks, Inc. or by any
other trademark owner referred to in this
document.
5
A Comprehensive (?) Pictureof Knowledge Discovery
SVD implemented by ISCHadoop
implemented by others
Input HDF5, OPeNDAP, Hadoop, live
data feeds
Visualization Renoir, In-Spire, ...
Dimensionalityreduction / factorization SVD,
eigen, NMF, PCA,
Clustering K-means, ...
Classification SVM, Bayesian, HMMs, ...
Etc.
Graph primitives(BFS, MIS, conncomp, ...)
Linear algebra(spmatvec, ...)
Solvers(MUMPS, SuperLU, ...)
Utility(sort, indexing)
Data structures(sparse matrices, ...)
Parallel constructs
6
Knowledge Discovery Suite
  • Simple data analysis operations at very large
    scale
  • Sorting, set operations, statistical operations
  • Graph operations on very large graphs
  • Simple queries, breadth-first search, connected
    components, independent sets
  • Visualization with desktop tools
  • Distributed image generation for large graphs and
    datasets
  • Clustering and decomposition
  • K-means clustering, non-negative matrix
    factorization, principal component analysis
  • Bayesian network modeling
  • Expectation maximization (Baum-Welch), hidden
    Markov models
  • What would you like to see?

7
Ways to Work Together
  • You use Star-P infrastructure for parallel
    algorithm development/deployment
  • You develop serial algorithms, we develop
    parallel versions
  • We jointly develop parallel algorithms
  • We develop or integrate missing functionality you
    need
  • We target joint client with integrated package

8
A Brief Demo
9
Sample Kernels, Algorithms, and Workflows
10
Simplest KDS operationParallel Sorting
3 6 8 1 5 4 7 2 9
1 2 3 4 5 6 7 8 9
  • Simple, widely used combinatorial primitive
  • W, perm sort (V)
  • Used in many sparse matrix and array algorithms
    sparse(), indexing, concatenation, transpose,
    reshape, repmat, etc.
  • Communication efficient

11
Sorting performance
12
A complex workflow (SSCA2, kernel 3)
  • function subgraphs kernel3 (G, pathlen, starts)
  • KERNEL3 SSCA2 Kernel 3 -- Graph Extraction
  • starts starts(,2)
  • nstarts length(starts)
  • A grsparse (G)
  • nv nverts (G)
  • Use sparse matrix multiplication to do several
    BFS searches at once.
  • s sparse (starts, 1nstarts, 1, nv, nstarts)
  • for k1pathlen
  • s A s Ideally reach should support
    this. Not yet.
  • s (s 0)
  • end
  • for i 1nstarts
  • x s(,i)
  • vtxmap find(x)
  • S.graph subgraph (G, vtxmap)

13
A complex workflow (SSCA2, kernel 4)
  • function leader kernel4f (G)
  • KERNEL4F SSCA2 Kernel 4 -- Graph Clustering
  • Find a Maximal Independent Set in G
  • IS, misrounds mis (G)
  • fprintf ('MIS rounds d. MIS nodes d\n',
    misrounds, length(IS))
  • Find neighbours of each node from the IS
  • neighFromIS G sparse(IS, IS, 1, n, n)
  • Pick one of the neighbouring IS nodes as a
    leader
  • ign leader max (neighFromIS, , 2)
  • Collect votes from neighbours
  • I J find (G)
  • S sparse (I, leader(J), 1, n, n)
  • Pick the most popular leader among neighbours
    and join that cluster

14
Scaling Performance cSSCA2 on 128P
"scale vertices ( 2scale) edges ( 10 vertices) graph size (bytes) time (seconds)
22 4.194E06 4.194E07 7.550E08 122.51
24 1.678E07 1.678E08 3.020E09 402.31
26 6.711E07 6.711E08 1.208E10 1237.1
  • Timings scale well for large graphs,
  • 2x problem size ? 2x time
  • 2x problem size 2x processors ? same time

15
Distributed visualization
16
App1 Computational Ecology
  • Modeling dispersal of species within a habitat
    (to maximize range)
  • Large geographic areas, linked with GIS data
  • Blend of numerical and combinatorial algorithms

Brad McRae and Paul Beier, Circuit theory
predicts gene flow in plant and animal
populations, PNAS, Vol. 104, no. 50, December
11, 2007
17
Results
  • Solution time reduced from 3 days (desktop) to 5
    minutes (14p) for typical problems
  • Aiming for much larger problems
    Yellowstone-to-Yukon (Y2Y)

18
App2 Factoring network flow behavior
Karpinski, Almeroth, Belding
19
Algorithmic exploration
  • Many NMF variants exist in the literature
  • Not clear how useful on large data
  • Not clear how to calibrate (i.e., number of
    iterations to converge)
  • NMF algorithms combine linear algebra and
    optimization methods
  • Basic and improved NMF factorization
    algorithms implemented
  • euclidean (Lee Seung 2000)
  • K-L divergence (Lee Seung 2000)
  • semi-nonnegative (Ding et al. 2006)
  • left/right-orthogonal (Ding et al. 2006)
  • bi-orthogonal tri-factorization (Ding et al.
    2006)
  • sparse euclidean (Hoyer et al. 2002)
  • sparse divergence (Liu et al. 2003)
  • non-smooth (Pascual-Montano et al. 2006)

20
NMF traffic analysis results
  • NMF identifies essential components of the
    traffic
  • Analyst labels different types of external
    behavior

21
Future Directions
  • What should KDS contain
  • More algorithms ?
  • Other classes of algorithms?
  • Visualization ?
  • Easy use of hardware accelerators (GPU, Cell) ?
Write a Comment
User Comments (0)
About PowerShow.com