Getting Beyond the Filesystem: New Models for Data Intensive Scientific Computing - PowerPoint PPT Presentation

About This Presentation
Title:

Getting Beyond the Filesystem: New Models for Data Intensive Scientific Computing

Description:

H. Bui et al, 'BXGrid: A Repository and Experimental Abstraction' ... An abstraction is a declarative specification of both the data and computation ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 24
Provided by: dougla9
Learn more at: https://www3.nd.edu
Category:

less

Transcript and Presenter's Notes

Title: Getting Beyond the Filesystem: New Models for Data Intensive Scientific Computing


1
Getting Beyond the Filesystem New Models
forData Intensive Scientific Computing
  • Douglas Thain
  • University of Notre Dame
  • HEC FSIO Workshop
  • 6 August 2009

2
The Standard Model
Disk
Disk
Disk
CPU
CPU
CPU
Disk
Disk
Disk
P
F
CPU
CPU
CPU
The user has to guess at what the filesystem is
good at.
The FS has to guess what what the user is going
to do next.
Disk
Disk
Disk
CPU
CPU
CPU
3
Understand what the useris trying to
accomplish.not what they are strugglingto run
on this machine right now.
4
The Context Biometrics Research
  • Computer Vision Research Lab at Notre Dame
  • Kinds of research questions
  • How does human variation affect matching?
  • Whats the reliability of 3D versus 2D?
  • Can we reliably extract a good iris image from an
    imperfect video, and use it for matching?
  • A great context for systems research
  • Committed collaborators, lots of data, problems
    of unbounded size, some time sensitivity, real
    impact.
  • Acquiring TB data/month, sustained 90 CPU util.
  • Always composing workloads that break something.

5
BXGrid Schema
Immutable Replicas
replicaid423 stateok
Scientific Metadata
Type Subject Eye Color FileID
Iris S100 Right Blue 10486
Iris S100 Left Blue 10487
Iris S203 Right Brown 24304
Iris S203 Left Brown 24305
replicaid105 stateok
replicaid293 statecreating
General Metadata
fileid 24305 size 300K type jpg sum
abc123
replicaid102 statedeleting
6
(No Transcript)
7
(No Transcript)
8
BXGrid Abstractions
S Select( colorbrown )
B Transform( S,G )
M AllPairs( A, B, F )
S
eye
color
A1
A2
A3
L
brown
G
B1
F
F
F
L
blue
ROC Curve
G
B2
F
F
F
R
brown
B3
F
F
F
G
R
brown
H. Bui et al, BXGrid A Repository and
Experimental Abstraction Special Issue on
e-Science, Journal of Cluster Computing, 2009.
9
An abstraction is a declarative specification of
both the data and computation in a batch workload
(Think SQL for clusters.)
10
(No Transcript)
11
(No Transcript)
12
AllPairs( set A, set B, func F )
B0
B1
Bn
B2
0.5
0.2
0.9
0.8
A0
0.1
0.3
0.4
A1
F
0.2
0.6
0.6
1.0
A2
0.5
0.1
0.7
0.3
An
Moretti, et al, All-Pairs An Abstraction for
Data Intensive Computing on Campus Grids, TPDS,
to appear in 2009.
13
gt AllPairs IrisA IrisB MyComp.exe
14
How can All-Pairs do better?
  • No demand paging!
  • Distribute the data via spanning tree.
  • Metadata lookup can be cached forever.
  • File locations can be cached optimistically.
  • Compute the right number of CPUs to use.
  • Check for data integrity, but on failure, dont
    abort, just migrate.
  • On a multicore CPU, use multiple threads and a
    cache-aware access algorithm.
  • Decide at runtime whether to run here on 4 cores,
    or run on the entire cluster.

15
Whats the Upshot?
  • The end user gets scalable performance always at
    the sweet spot of the system.
  • It works despite many different failure modes,
    including uncooperative admins.
  • 60Kx60K All-Pairs ran on 100-500 CPUs over one
    week, largest known experiment on public data in
    the field.
  • All-Pairs abstraction is 4X more efficient than
    the standard model.

16
AllPairs( set A, set B, func F )
B0
B1
Bn
B2
0.5
0.2
0.9
0.8
A0
0.1
0.3
0.4
A1
F
0.2
0.6
0.6
1.0
A2
0.5
0.1
0.7
0.3
An
Moretti, et al, All-Pairs An Abstraction for
Data Intensive Computing on Campus Grids, TPDS,
to appear in 2009.
17
SomePairs( set A, set B, pairs P, func F )
B0
B1
Bn
B2
P
0.5
0.2
0.9
0.8
(0,1) (0,2) (1,1) (1,2) (2,0) (2,2) (2,3)
A0
0.1
0.5
0.3
0.4
A1
0.2
0.6
0.6
1.0
A2
0.5
0.1
0.7
0.3
An
Moretti, et al, Scalable Modular Genome
Assembly on Campus Grids, under review in 2009.
18
Wavefront ( Rx,0, R0,y, F(x,y,d) )
R4,4
R3,4
R2,4
R0,4
F
x
y
d
R3,2
R4,3
R0,3
x
F
F
x
y
d
y
d
R4,2
R0,2
x
F
F
x
y
d
y
d
R0,1
F
x
F
x
x
F
F
x
y
d
y
d
y
d
y
d
R4,0
R3,0
R2,0
R1,0
R0,0
Yu et al, Harnessing Parallelism in Multicore
Clusters with the All-Pairs and Wavefront
Abstractions, IEEE HPDC 2009.
19
Directed Graph
part1 part2 part3 input.data split.py
./split.py input.data out1 part1 mysim.exe
./mysim.exe part1 gtout1 out2 part2 mysim.exe
./mysim.exe part2 gtout2 out3 part3 mysim.exe
./mysim.exe part3 gtout3 result out1 out2 out3
join.py ./join.py out1 out2 out3 gt result
Thain and Moretti, Abstractions for Cloud
Computing with Condor, in Handbook of Cloud
Computing Services, in press.
20
Contributions
  • Repository for Scientific Data
  • Provide a better match with end user goals.
  • Simplify scalability, management, recovery
  • Abstractions for Large Scale Computing
  • Expose actual data needs of computation.
  • Simplify scalability, consistency, recovery
  • Enable transparent portability from single CPU to
    multicore to cluster to grid.
  • Are there other useful abstractions?

21
HECURA Summary
  • Impact on real science
  • Production quality data repository used to
    produce datasets accepted by NIST and published
    to community. Indispensable
  • Abstractions used to scale up students daily
    scratch work to 500 CPUs increased data
    coverage of published results by two orders of
    magnitude.
  • Planning to adapt prototype to Civil Eng CDI
    project.
  • Impact on computer science
  • Open source tools that run on Condor, SGE,
    multicore, recently with independent users.
  • 1 chapter, 2 journal papers, 4 conference papers.
  • Broader Impact
  • Integration into courses, used by other projects
    at ND.
  • Undergrad co-authors from ND and elsewhere.

22
Participants
  • Faculty
  • Douglas Thain and Patrick Flynn
  • Graduate Students
  • Christopher Moretti (Abstractions)
  • Hoang Bui (Repository)
  • Li Yu (Multicore)
  • Undergraduates
  • Jared Bulosan
  • Michael Kelly
  • Christopher Lyon (North Carolina AT Univ)
  • Mark Pasquier
  • Kameron Srimoungchanh (Clemson Univ)
  • Rachel Witty

23
For more information
  • The Cooperative Computing Lab
  • http//www.cse.nd.edu/ccl
  • Cloud Computing and Applications
  • October 20-21, Chicago IL
  • http//www.cca09.org
  • This work was supported by the National Science
    Foundation via grant CCF-06-21434.
Write a Comment
User Comments (0)
About PowerShow.com