1 of 245 - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

1 of 245

Description:

1 of 245 – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 55
Provided by: robpe
Learn more at: http://bmi.osu.edu
Category:
Tags: fap | image

less

Transcript and Presenter's Notes

Title: 1 of 245


1
1 of 245
2
Middleware Infrastructure for Large-Scale Data
Management
  • Umit Catalyurek, Tahsin Kurc, Joel Saltz
  • Department of Biomedical Informatics
  • The Ohio State University

VIEWS Alliance Forum July 19-20, 2004
3
Goals what the world will look like? (IMAGE
Data Management Working Group)
  • Identify, query, retrieve, carry out on-demand
    data product generation directed at collections
    of data from multiple sites/groups on a given
    topic, reproduce each groups data analysis and
    carry out new analyses on all datasets. Should be
    able to carry out entirely new analyses or to
    incrementally modify other scientists data
    analyses. Should not have to worry about physical
    location of data or processing. Should have
    excellent tools available to examine data. This
    should include a mechanism to authenticate
    potential users, control access to data and log
    identity of those accessing data.

4
Imaging, Medical Analysis and Grid Environments
(IMAGE)September 16 - 18 2003
  • OrganisersMalcolm Atkinson, Richard Ansorge,
    Richard Baldock, Dave Berry, Mike Brady, Vincent
    Breton, Frederica Darema, Mark Ellisman, Cecile
    Germain-Renaud, Derek Hill, Robert Hollebeek,
    Chris Johnson, Michael Knopp, Alan Rector, Joel
    Saltz, Chris Taylor, Bonnie Webber

5
Image/Data Application Areas
Satellite Data Processing
Imaging, Medical Analysis and Grid Environments
(IMAGE) September 16 - 18 2003 e-Science
Institute, 15 South College Street, Edinburgh
Digital Pathology
Managing Oilfields, Contaminant Transport
Biomedical Image Analysis
DCE-MRI Analysis
6
Dataset Analysis and Visualization
  • Spatio-temporal datasets (generally low
    dimensional) datasets describe physical
    scenarios
  • Data products often involve results from ensemble
    of spatio-temporal datasets
  • Some applications require interactive exploration
    of datasets
  • Common operations subsetting, filtering,
    interpolations, projections, comparisons,
    frequency counts
  • Optimizations Semantic caching, caching of
    intermediate results, multiple-query
    optimizations

7
Molecular data and OSU Testbed Effort
  • Data sharing in OSU shared resource
  • Support for all data sharing in hundreds of
    research studies in OSU comprehensive cancer
    center
  • State of Ohio BRTT
  • Integration of clinical, genotype, proteomic,
    histological, gene regulatory data in context of
    4 translational research projects
  • 2M per year to fund development of
    bioinformatics data sharing infrastructure

8
Examples of data sources to be Integrated
Examples of data types that are generated or
referenced by OSUCCC Shared Resources
9
Center for Grid Enabled Image Analysis
  • Biomedical research
  • Ensure success of biventricular pacing, Mechanism
    of ischemic cardiac injury characterization of
    relationship between angiogenesis and breast,
    bone cancer mouse models of tumorogenesis, role
    of expression of oncogenes in placental
    development (virtual slides, mouse placenta)
  • Radiology/cardiac imaging research
  • Capture, analyze time dependent cardiac imagery
    EPR technology development quantify treatment
    efficacy through analysis of diffusion contrast
    imagery automated detection of mitoses
    deconvolution, 3-D reconstruction, segmentation,
    shape characterization in microscopy imagery.
  • Computer Science
  • Middleware for large scale multi-scale data, grid
    metadata management, feature detection, parallel
    visualization, on-demand and interactive
    computing

10
Data Storage
  • Clusters provide inexpensive and scalable storage
  • 50K-100K clusters at Ohio State, OSC, U.
    Maryland range from 16 to 50 processors, 7.5TB to
    15TB IDE disk storage
  • Data declustered to cluster nodes to promote
    parallel I/O
  • Uses DataCutter and IP4G toolkits for data
    preprocessing and declustering
  • R-tree indexing of declustered data

11
Ohio Supercomputing Center Mass Storage Testbed
  • 50 TB of performance storage
  • home directories, project storage space, and
    long-term frequently accessed files.
  • 420 TB of performance/capacity storage
  • Active Disk Cache - compute jobs that require
    directly connected storage
  • parallel file systems, and scratch space.
  • Large temporary holding area
  • 128 TB tape library
  • Backups and long-term "offline" storage

IBMs Storage Tank technology combined with TFN
connections will allow large data sets to be
seamlessly moved throughout the state with
increased redundancy and seamless delivery.
12
Services
  • Filter-stream based distributed execution
    middleware (DataCutter, STORM)
  • Grid based dataset management, query, on demand
    data product generation (STORM, Active ProxyG,
    Mako)
  • Supports distributed storage of XML schemas
    through virtualized databases, file systems
  • Distributed metadata management (Mobius Global
    Model Exchange)
  • Track metadata associated with workflows, input
    image datasets, checkpointed intermediate results

13
Underlying Technologies
  • DataCutter
  • Component-based middleware for processing of
    large distributed datasets.
  • Enables execution of user-defined data processing
    components in a distributed environment.
  • STORM
  • Basic database support for large file based
    scientific datasets in the Grid.
  • Implemented on the DataCutter framework
  • Efficient subsetting and user-defined filtering
    of large, distributed datasets.
  • Object relational SQL front end

14
Underlying Technologies
  • IP4G (Image Processing for the Grid)
  • Toolkit to create parallel, distributed image
    processing applications.
  • Built on the DataCutter framework.
  • VTK, ITK in a grid-based computation environment.
  • Active Proxy G Active Semantic Data Cache
  • Employ user semantics to cache and retrieve data
  • Store and reuse results of computations
  • Compilation Support
  • Thesis work of Henrique Andrade
  • Mobius
  • services for managing metadata definition and
    metadata on the Grid
  • controlled metadata definitions and metadata
    versioning
  • Federated XML-based data and metadata storage

15
DataCutter Support for Demand Driven Workflows
  • Many data analysis queries have multiple stages
  • Decompose into parallel components
  • Strategically place components
  • Create GT4 compliant services

Virtual Microscope
Iso-surface Rendering
http//www.datacutter.org/
16
Integrating DataCutter with existing Grid
toolkits SRB, Globus, NWS
  • SRB integration Subset and filter datasets
  • Globus integration DataCutter uses Globus
    resource discovery, resource allocation,
    authentication, and authorization services.
  • Network Weather Service (NWS) integration NWS
    for used for system monitoring.
  • DataCutter will be used as a toolkit to assemble
    GT4 compliant services

17
STORM Query Planning
http//storm.bmi.ohio-state.edu/
18
STORM Query Execution
19
Data Returned as stream of tuples
  • Stand-alone client
  • MPI program
  • MPI program provides partitioning function
  • Partitioning service generates mapping
  • Data mover sends data to appropriate MPI process
  • Single or replicated copies of a DataCutter
    filter group

20
Digital Microscopy NPACI Telescience, BIRN
40,000 pixels
  • Goal
  • Remote access, processing of subsets of large,
    distributed virtual slides
  • DataCutter, IP4G
  • Indexing, querying, caching and subsetting
  • Image processing by custom routines, VTK and ITK
    layered on DataCutter.
  • Use of heterogeneous, distributed clusters for
    data processing.

40,000 pixels
Query
DataCutter
Telescience Portal
21
Telescience Portal and DataCutter
22
Virtual Slide Cooperative Study Support
  • Childrens Oncology Group, CALGB Cooperative
    Studies
  • 60 slides/day 120 GB/day compressed, 3 TB/day
    uncompressed
  • Remote review of slides
  • Computer assisted tumor grading
  • 3-D Reconstruction
  • Tissue Microarray support
  • CALGB began November 2003, Childrens Oncology in
    Spring 2004

23
Distributed, Federated and Integrated
  • Consortia Group Support
  • Virtual slides
  • OSC 420 on-line Terabyte storage system
  • CALGB COG
  • OSUs Virtual Placenta Project
  • embryonic development and gene expression
  • BIRN OSU/UCSD multi-photon project
  • Detect rare mitoses in mouse brain tumor model

24
Prototype Multiscale Data Analysis Pipeline
  • Disk based multi-scale dataset
  • Total 1 TB data generated by super-sampling -
    visible woman used as coarse reference dataset
  • Preparation for multi-scale visible mouse project
    will synthesize multiple imaging modalities,
    microscopy, high throughput molecular data
  • Filters use indices to extract data subset
  • Interpolation to derive single structured mesh
    from results of multi-scale query
  • Results streamed to Ohio Supercomputer parallel
    renderer
  • Demand driven processing under client control

25
Multiscale Pipeline
26
On Demand Data Analysis The Instrumented Oilfield
27
Production Simulation via Reservoir Modeling
Monitor Production by acquiring Time Lapse
Observations of Seismic Data
Data Analysis
Model 1
Model 2
Model N
Data Analysis Tools (e.g., Visualization)
Data Analysis
New Model or Parameters
Data Management and Manipulation Tools
Revise Knowledge of Reservoir Model via Imaging
and Inversion of Seismic Data
Modify Production Strategy using an Optimization
Criteria
Data Parameters
28
Analysis of Oil Reservoir Simulation Data
  • Datasets
  • A 1.5TB Dataset
  • 207 simulations, selected from several
    Geostatistics models and well patterns
  • Each simulation is 6.9GB 10,000 time steps,
    9,000 grid elements, 8 scalars 3 vectors 17
    variables
  • A 5TB Dataset
  • 500 simulations, selected from several
    Geostatistics models and well patterns
  • Each simulation is 10GB 2,000 time steps, 65K
    grid elements, 8 scalars 3 vectors 17
    variables
  • Stored at
  • SDSC HPSS and 30TB Storage Area Network System
  • UMD 9TB disks on 50 nodes PIII-650, 768MB,
    Switched Ethernet
  • OSU 7.2TB disks on 24 nodes PIII-900, 512MB,
    Switched Ethernet
  • Data Analysis
  • Economic model assessment
  • Bypassed oil regions
  • Representative Realization Selection for more
    simulations

29
Example Bypassed Oil
  • Query Find all the datasets in D that have
    bypassed oil pockets with at least Tcc grid
    cells.
  • RD -- Read data filter. Access data sets.
  • CC -- Connected component filter. Perform
    connected component analysis to find oil regions
    per time step.
  • MT Merge over time. Combine over multiple time
    steps for bypassed oil.

30
Seismic Modeling of Reservoirs
DataCutter
Reservoir Datasets
Reservoir Datasets
31
Seismic Modeling of Reservoirs
Reservoir Datasets
Reservoir Datasets
STORM
32
Seismic Data Analysis STORM On Demand
Processing of 1.5 TB Seismic Dataset
Survey
Line
Sp (or CDP) source position
Traces
Array
Receiver group receiver group position
Component
33
Multi-Query OptimizationActive Proxy G
q1
  • Goal minimize the total cost of processing a
    series of queries by creating an optimized access
    plan for the entire sequence Kang, Dietz, and
    Bhargava
  • Approach minimize the total cost of processing a
    series of queries through data and computation
    reuse
  • IPDPS2002,SC2002,ICS02

q2
This blue slab is the same as in q1
We have seen the pieces of q3 computed for other
queries in the past
q3
34
What does it buy?(Digital microscopy)
Average Execution Time Per Query
50
45
40
  • 12 clients
  • 4 x 16 VR queries
  • 8 x 32 VM queries
  • 4 processors (up to 4 queries simultaneously)

35
30
Time (s)
25
20
15
10
128M
192M
256M
320M
PDSS size
Reuse results of identical queries
Disabled
Active Semantic Cache
35
Active ProxyG Functional Components
  • Query Server
  • Lightweight Directory Service
  • Workload Monitor Service
  • Persistent Data Store Service

.........
Client 1
Client 2
Client
k
query workload
Active Proxy-G
Query Server
Persistent
Lightweight
Workload
Data
Directory
Monitor
Store
Service
Service
Service
directory updates
workload updates
subqueries
Application
Application
Application
Application
Application
Application
.........
Server
Server
Server
Server
Server
Server
I
II
n
I
II
n
36
Automatic Data Virtualization
  • Scientific and engineering applications require
    interactive exploration and analysis of datasets.
  • Applications developers generally prefer storing
    data in files
  • Support high level queries on multi-dimensional
    distributed datasets
  • Many possible data abstractions, query interfaces
  • Grid virtualized object relational database
  • Grid virtualized objects with user defined
    methods invoked to access and process data
  • A virtual relational table view
  • Large distributed scientific datasets

Data Virtualization
Data Service
37
System Architecture
SELECT FROM IPARS WHERE RID in (0,6,26,27)
AND TIME1000 AND TIME0.7
AND SPEED(OILVX, OILVY, OILVZ)operations Subsetting, filtering, user defined
filtering
38
Comparison with hand written codes
Dataset stored on 16 nodes. Performance
difference is within 17, With an average
difference of 14.
Dataset stored on a single node. Performance
difference is within 4.
39
Components of Meta-data Descriptor
  • Describe attributes, location of files, layout of
    data in files, indices

40
Dataset Descriptor Example
41
Active Projects/Funding
  • National Science Foundation National Middleware
    Infrastructure
  • National Science Foundation ITR on the
    Instrumented Oilfield (Dynamic Data Driven
    Application Systems)
  • National Science Foundation NGS An Integrated
    Middleware and Language/Compiler for Data
    Intensive Applications in Grid Environment
  • Center for Grid Enabled Image Biomedical Image
    Analysis (NIH,NBIB, NIGMS)
  • Biomedical Research Technology Transfer
    Partnership Award, Biomedical Informatics
    Synthesis Platform (State of Ohio)
  • Department of Energy Data Cutter Software
    Support for Generating Data Products from Very
    Large Datasets
  • NCI Overcoming Barriers to Clinical Trial
    Accrual
  • OSU Cancer Center Shared Resource

42
Mobius
  • Middleware system that provides support for
    management of metadata definitions (defined as
    XML schemas) and efficient storage and retrieval
    of data instances in a distributed environment.
  • Mechanism for data driven applications to cache,
    share, and asynchronously communicate data in a
    distributed environment
  • Grid based distributed, searchable, and shareable
    persistent storage
  • Infrastructure for grid coordination language

http//projectmobius.osu.edu/
43
Global Model Exchange
  • Store and link data models defined inside
    namespaces in grid.
  • Enables other services to publish, retrieve,
    discover, remove, and version metadata
    definitions
  • Services composed in a DNS-like architecture
    representing parent-child namespace hierarchy
  • When a schema is registered in GME, it is stored
    in under the name and name space specified by the
    application schema is assigned a version number

44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
Functioning prototype cost characterization
  • System prototype constructed
  • Versioned grid schema management, database
    creation, insertion, query implemented
  • Benchmarks carried out involving different schema
    shapes and sizes

49
System Architecture
50
Image Processing Pipeline with Checkpointing
51
Related Work
  • GGF
  • Grid Middleware Globus, Network Weather Service,
    GridSolve, Storage Resource Broker, CACTUS,
    CONDOR
  • Common Component Architecture
  • Query, indexing very large databases Jim Gray
    Microsoft keyhole.com
  • Close relationship to much viz work

52
Multiscale Laboratory Research Group
Ohio State University Joel Saltz Gagan
Agrawal Umit Catalyurek Tahsin Kurc Shannon
Hastings Steve Langella Scott Oster Tony
Pan Benjamin Rutt Sivaramakrishnan (K2) Michael
Zhang Dan Cowden Mike Gray
The Ohio Supercomputer Center Don Stredney Dennis
Sessanna Jason Bryan
University of Maryland Alan Sussman Henrique
Andrade Christian Hansen
53
Center on Grid Enabled Image Processing
  • Joel Saltz
  • Michael Caligiuri
  • Charis Eng
  • Mike Knopp
  • DK Panda
  • Steve Qualman
  • Jay Zweier

54
Instrumented Oilfield Collaborators
Manish ParasharManish Agarwal Electrical and
Computer Eng. Dept. Rutgers University
Alan SussmanChristian Hansen Computer Science
Department University of Maryland
Joel Saltz Umit CatalyurekMike Gray Tahsin
Kurc Shannon Hastings Steve Langella Krishnan
Sivaramakrishnan Tyler Gingrich Biomedical
Informatics Department The Ohio State University
Arcot Rajasekar Mike Wan San Diego Supercomputer
Center
Mary Wheeler Hector Klie Malgorzata Peszynska
Ryan Martino University of Texas at Austin
Write a Comment
User Comments (0)
About PowerShow.com