Title: Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics
1Large Scale Semantic Data Integration and
Analytics through Cloud A Case Study in
Bioinformatics
Tat Thang Parallel and Distributed Computing
Centre, School of Computer Engineering, NTU,
Singapore Michael Li Semantic Technology
Group, Institute for Infocomm Research (I2R),
A-Star, Singapore 11th Feb 2011
2Overview
- Motivation
- Problem Definition
- Objective
- Proposed Architecture
- A case study in Bio-informatics
- Demo
- Future works
- Summary
3Motivation
- Deluge of biological data
- Biomedical data is available on heterogeneous
databases - Data structured and semi/un-structured formats
- Demand for fast, large-scale and cost-effective
computing strategies
4Problem Definition
- Data
- PubMed contains 20 million abstracts
- UniProt contains 13.5 million records
- Case study on antiviral proteins
- Over 70,000 citations in Pubmed
- Over 14,000 proteins in Uniprot
- Integration and Analysis
5Related Works
- Using NLP to link documents to existing
ontologies (e.g. GoPubMed, Textpresso) - No querying reasoning
- Not scalable
- RDF/OWL based integration tools (e.g. TopBraid
Suite) - No NLP
- Not bio specific. Also not biologist friendly
- Cloud-based bio data mining works (e.g. Kudtarkar
P 2010) - Still in early stages
- Challenging to perform semantic integration on
cloud
6Objective
- To provide a framework that enables
- Better data infrastructure
- Scalability
- Management of heterogeneity
- Cost-effectiveness
- Better data analytics
- Integrative data mining
- Visual query interface
7Our Approach
8Our Approach
User Interface
Query Reasoner
Ontology
Parser
Web Crawler
Knowle Population Service
Biomedical sources
Cloud-based data store
9Our Approach
- Data Infrastructure Module
- Cloud based Amazon EC2, Hadoop, Microsoft Azure
- Parallel processing MapReduce
- Distributed Storage Big Table, HBase, HDFS
- Data Analytics Module
- Non-semantic database driven
- Semantic ontology driven (Knowle, Allegrograph,
TopBraid)
10Data Infrastructure Module (Hadoop)
- Software framework for data-intensive and
distributed applications - Hadoop distributed file system provides a
distributed, scalable, and portable file system
that support for large data set - Hadoop Map-reduce allows to program in parallel
on large amount of data
11Cloud Based Data Store Hadoop Distributed File
System
- Meta data (in memory)
- Data nodes
- Data blocks
- Node attributes
- Name of files
- Mapping of block-node
Name node
Secondary Name node
Data node
Data node
Data node
Data node
Data node
- Stores file contents
- File is chunked to block
- each block is spread to data nodes
12Data Analytics Module (Knowle)
- Semantic Technology Toolkit
- Knowle services used in Data Analytics Module
- Data/Text mining
- Ontology Population
- Ontology Query
- Visual Ontology Query
Developed in Institute for Infocomm Research,
Singapore
13Our Approach
User Interface
Query Reasoner
Ontology
Parser
Web Crawler
Knowle Population Service
Biomedical data sources
Cloud-based data store
14Web Crawler
Cloud-based data store
UniProt Crawler
UniProt
PubMed Crawler
PubMed
Bio-medical data source
15Parser
UniProt Parser
Crawled UniProt data
Knowle Ontology Population Service
Crawled PubMed data
PubMed Parser
Cloud-based data store
16Ontology
17Ontology Populator
Knowle Ontolgy Population Service
Assert Object Properties
Assert Datatype Properties
Populate concepts
Knowle Text mining Service
Entity Detection
Relation Extraction
18Query Reasoner
SAIL
OWLIM Reasoner
User Interface
Knowle Query Service
Sesame
19User Interface
Knowle Population Service
Search
Web Crawler
Parser
KnowleGator Ontology Visual Query
Visual Query Translator
Ontology Query Reasoner
20A case study in Bio-informatics
- Integration, cross-querying from PubMed and
UniProt - Data
- 70,054 citations from Pubmed
- 14,527 proteins in Uniprot
- Infrastructure (virtual computers)
- 4 data node ( RAM 1Gb, CPU Intel Xeon 2.4Ghz)
- 2 master node ( 1 name node,1 secondary name
node) (RAM 512 Mb, CPU Intel Xeon 2.4Ghz) - 1 virtual CPU Intel Xeon 2.4 Ghz
21Demo
- Data
- Uniprot 853 antiviral protein entries
- Pubmed 2000 citations
22Demo Snapshot
23Summary
- We proposed a new framework
- Data infrastructure module (cloud-based
infrastructure ) - Data analytics module(semantic technologies)
- We tested on a prototype
- Using our own infrastructure
- With integration, cross-querying from PubMed and
UniProt
24Future works
- Integrated user interface
- Explore other cloud-based data store HBase,
BigTable - Apply map-reduce concept on data analytics and
crawling - Integrate Knowle into cloud-based environment
25Large Scale Semantic Data Integration and
Analytics through Cloud A Case Study in
Bioinformatics
Tat Thang Parallel and Distributed Computing
Centre, School of Computer Engineering, NTU,
Singapore Michael Li Semantic Technology
Group, Institute for Infocomm Research (I2R),
A-Star, Singapore 11th Feb 2011