Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics - PowerPoint PPT Presentation

About This Presentation
Title:

Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics

Description:

This is our case study, in which we will collect data from biomedical sources and save to our cloud-based data store. ... medical repo for literatures Uniprot , ... – PowerPoint PPT presentation

Number of Views:160
Avg rating:3.0/5.0
Slides: 26
Provided by: NguyenT2
Category:

less

Transcript and Presenter's Notes

Title: Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics


1
Large Scale Semantic Data Integration and
Analytics through Cloud A Case Study in
Bioinformatics
Tat Thang Parallel and Distributed Computing
Centre, School of Computer Engineering, NTU,
Singapore Michael Li Semantic Technology
Group, Institute for Infocomm Research (I2R),
A-Star, Singapore 11th Feb 2011
2
Overview
  • Motivation
  • Problem Definition
  • Objective
  • Proposed Architecture
  • A case study in Bio-informatics
  • Demo
  • Future works
  • Summary

3
Motivation
  • Deluge of biological data
  • Biomedical data is available on heterogeneous
    databases
  • Data structured and semi/un-structured formats
  • Demand for fast, large-scale and cost-effective
    computing strategies

4
Problem Definition
  • Data
  • PubMed contains 20 million abstracts
  • UniProt contains 13.5 million records
  • Case study on antiviral proteins
  • Over 70,000 citations in Pubmed
  • Over 14,000 proteins in Uniprot
  • Integration and Analysis

5
Related Works
  • Using NLP to link documents to existing
    ontologies (e.g. GoPubMed, Textpresso)
  • No querying reasoning
  • Not scalable
  • RDF/OWL based integration tools (e.g. TopBraid
    Suite)
  • No NLP
  • Not bio specific. Also not biologist friendly
  • Cloud-based bio data mining works (e.g. Kudtarkar
    P 2010)
  • Still in early stages
  • Challenging to perform semantic integration on
    cloud

6
Objective
  • To provide a framework that enables
  • Better data infrastructure
  • Scalability
  • Management of heterogeneity
  • Cost-effectiveness
  • Better data analytics
  • Integrative data mining
  • Visual query interface

7
Our Approach
8
Our Approach
User Interface
Query Reasoner
Ontology
Parser
Web Crawler
Knowle Population Service
Biomedical sources
Cloud-based data store
9
Our Approach
  • Data Infrastructure Module
  • Cloud based Amazon EC2, Hadoop, Microsoft Azure
  • Parallel processing MapReduce
  • Distributed Storage Big Table, HBase, HDFS
  • Data Analytics Module
  • Non-semantic database driven
  • Semantic ontology driven (Knowle, Allegrograph,
    TopBraid)

10
Data Infrastructure Module (Hadoop)
  • Software framework for data-intensive and
    distributed applications
  • Hadoop distributed file system provides a
    distributed, scalable, and portable file system
    that support for large data set
  • Hadoop Map-reduce allows to program in parallel
    on large amount of data

11
Cloud Based Data Store Hadoop Distributed File
System
  • Meta data (in memory)
  • Data nodes
  • Data blocks
  • Node attributes
  • Name of files
  • Mapping of block-node

Name node
Secondary Name node
Data node
Data node
Data node
Data node
Data node
  • Stores file contents
  • File is chunked to block
  • each block is spread to data nodes

12
Data Analytics Module (Knowle)
  • Semantic Technology Toolkit
  • Knowle services used in Data Analytics Module
  • Data/Text mining
  • Ontology Population
  • Ontology Query
  • Visual Ontology Query

Developed in Institute for Infocomm Research,
Singapore
13
Our Approach
User Interface
Query Reasoner
Ontology
Parser
Web Crawler
Knowle Population Service
Biomedical data sources
Cloud-based data store
14
Web Crawler
Cloud-based data store
UniProt Crawler
UniProt
PubMed Crawler
PubMed
Bio-medical data source
15
Parser
UniProt Parser
Crawled UniProt data
Knowle Ontology Population Service
Crawled PubMed data
PubMed Parser
Cloud-based data store
16
Ontology
17
Ontology Populator
Knowle Ontolgy Population Service
Assert Object Properties
Assert Datatype Properties
Populate concepts
Knowle Text mining Service
Entity Detection
Relation Extraction
18
Query Reasoner
SAIL
OWLIM Reasoner
User Interface
Knowle Query Service
Sesame
19
User Interface
Knowle Population Service
Search
Web Crawler
Parser
KnowleGator Ontology Visual Query
Visual Query Translator
Ontology Query Reasoner
20
A case study in Bio-informatics
  • Integration, cross-querying from PubMed and
    UniProt
  • Data
  • 70,054 citations from Pubmed
  • 14,527 proteins in Uniprot
  • Infrastructure (virtual computers)
  • 4 data node ( RAM 1Gb, CPU Intel Xeon 2.4Ghz)
  • 2 master node ( 1 name node,1 secondary name
    node) (RAM 512 Mb, CPU Intel Xeon 2.4Ghz)
  • 1 virtual CPU Intel Xeon 2.4 Ghz

21
Demo
  • Data
  • Uniprot 853 antiviral protein entries
  • Pubmed 2000 citations

22
Demo Snapshot
23
Summary
  • We proposed a new framework
  • Data infrastructure module (cloud-based
    infrastructure )
  • Data analytics module(semantic technologies)
  • We tested on a prototype
  • Using our own infrastructure
  • With integration, cross-querying from PubMed and
    UniProt

24
Future works
  • Integrated user interface
  • Explore other cloud-based data store HBase,
    BigTable
  • Apply map-reduce concept on data analytics and
    crawling
  • Integrate Knowle into cloud-based environment

25
Large Scale Semantic Data Integration and
Analytics through Cloud A Case Study in
Bioinformatics
Tat Thang Parallel and Distributed Computing
Centre, School of Computer Engineering, NTU,
Singapore Michael Li Semantic Technology
Group, Institute for Infocomm Research (I2R),
A-Star, Singapore 11th Feb 2011
Write a Comment
User Comments (0)
About PowerShow.com