Discovery Processes: Representation and Re-use - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Discovery Processes: Representation and Re-use

Description:

Distributed Data Mining for Compute Intensive Tasks. Distributed ... Embryogenesis. Literature. References. Ontologies. Pathway. Maps. GeneMaps. AmiGO. GenNav ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 41
Provided by: jamee7
Category:

less

Transcript and Presenter's Notes

Title: Discovery Processes: Representation and Re-use


1
Distributed Data Mining in Discovery Net
Dr. Moustafa Ghanem Department of
Computing Imperial College London
2
  1. What is Discovery Net
  2. Distributed Data Mining for Compute Intensive
    Tasks
  3. Distributed Data Mining for Sensor Grids
  4. Knowledge Discovery from Naturally Distributed
    Data Sources
  5. What Do Scientists Really Want?

3
  • 1. What is Discovery Net

4
What is Discovery Net?
  • Funding One of the eight UK national e-science
    Pilot Projects funded by EPSRC (2.2M)
  • Start Oct 2001, End March 2005
  • Goal Construct the Worlds first Infrastructure
    for Global Knowledge Discovery Services
  • Key Technologies
  • Open Service Computing
  • High Throughput Devices and Real Time Data Mining
  • Real Time Data Integration Information
    Structuring
  • Cross Domain Knowledge Discovery and Management
  • Discovery Workflow and Discovery Planning

5
Discovery Net Applications
  • Life Sciences
  • High throughput genomics and proteomics
  • Distributed Databases and Applications
  • Environmental Modelling
  • High throughput dispersed air sensing technology
  • Sensor Grids
  • Real time geo-hazard modelling
  • Earthquake modelling through satellite imagery
  • High performance Distributed Computation

6
Discovery Net Architecture
DPML Web/Grid Services OGSA
D-Net Clients End-user applications and user
interface allowing scientists to construct and
drive knowledge discovery activities
D-Net Middleware Provides services and execution
logic for distributed knowledge discovery and
access to distributed resources and services
Computation Data Resources Distributed
databases, compute servers and scientific
devices.
High Performance Communication
Protocol (GridFTP, DSTP..) Grid
Infrastructure (GSI)
  • Goal Plug Play
  • Data Sources,
  • Analysis Components
  • Knowledge Discovery Processes

7
Discovery Net Data Mining Components
  • Generic Data Mining
  • Classification, Clustering, Associations, ..
  • Unstructured-Data Mining
  • Text Mining, Image Mining
  • Domain-specific Mining
  • Bioinformatics, Cheminformatics, ..

8
  • 2. Distribution of Compute Intensive Tasks
  • a. Distributed Data Mining for Geo-hazard
    Prediction

9
Grid-based Geo-hazard Data Mining
  • Grid-based HPC Computation
  • Workflow to Co-ordinate Grid Computation
  • Automatically co-register a stack of imagery
    layers at high precision and speed.
  • Grid-based Data Access and Integration

10
Normalised cross-correlation (NCC) template
algorithm
Operating on a remotely accessed MPI UNIX
parallel computer through fast network with DNet
interface. Slow but high accuracy 24 processors
10 hours for one scene of Landsat-7 ETM Pan
imagery data. The algorithm also run on GRID.
11
(No Transcript)
12
  • 2. Distribution of Compute Intensive Tasks
  • b. Distributed Clustering

13
Workflows for Distributed Data Clustering
14
  • 3. Distributed Mining over Sensor Grid Data
  • Distributed Spatial Data Mining for Air Pollution
    Modelling

15
Sensor Specification
The GUSTO Project - Update(Generic UV Sensors
Technologies Observations)
  • High throughput open path spectrometer system
  • Robust algorithm for pollutant concentration
    retrievals
  • Measures SO2, NO, NO2,O3 Benzene to ppb levels
    every few seconds
  • Geared for networking of multiple GUSTO units
    within a GRID Infrastructure
  • Can support Remote Sensing data for (contour)
    mapping of pollutants

www.gusto-systems.com
16
Networking of Multiple GUSTO Units
www.gusto-systems.com
17
Pollution analysis
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
  • 4. Knowledge Discovery from Naturally Distributed
    Data Sources
  • Distributed Data Mining in Life Sciences

22
Distributed Data Mining for Life Sciences
23

Information Integration
  • Given a collection of microarray generated gene
    expression data, what kind of questions the users
    wish to pose.
  • Design an integration schema?

24
From Data Integration to Knowledge Unification
In Silico Experiment
D-World
I-World
K-World
25
Life Science Application SC2002 HPC Challenge
D-Net based Global Collaborative Real- Time
Genome Annotation
Genome Annotation
26
HPC Challenge SC2002
Nucleotide Annotation Workflows
Real-time sequencing in London
  • 1800 clicks
  • 500 Web access
  • 200 copy/paste
  • 3 weeks work
  • in 1 workflow and few second execution

27
Discovery Net in ActionChina SARS Virtual Lab
28
Discovery Net in Action SARS Virus Mutation
Analysis
29
  • 5. What do Scientist Really Want?
  • Does it really work?

30
Towards Compositional Grid Services
Resource Mapping
Service Browsing
Workflow Execution A compositional GRID
Workflow Authoring Composing services
Workflow Warehousing
Service Abstraction
Workflow Management Collaborative Knowledge
Management
31
Discovery Net Service Composition
32
Full Workflow
33
Executing Protein Annotation Workflow
34
Deployment of Node
35
Deploying Protein Annotation Workflow
36
Executing Deployed Service
37
Locating Executing Deployed Service from
Discovery Net
38
Workflow Provenance
39
Workflow Warehousing
40
Discovery Net Snapshot
Scientific Information
Scientific Discovery
In Real Time
Write a Comment
User Comments (0)
About PowerShow.com