Application of Data and Text Mining to Bioinformatics - PowerPoint PPT Presentation

1 / 51

About This Presentation

Title:

Application of Data and Text Mining to Bioinformatics

Description:

Synonyms, Abbreviations and Acronym (nuclear receptor = NR) ... Manage term acronym, synonym and variation. Link term in text with concept ... – PowerPoint PPT presentation

Number of Views:386

Avg rating:3.0/5.0

Slides: 52

Provided by: sam120

Category:

more less

Transcript and Presenter's Notes

Title: Application of Data and Text Mining to Bioinformatics

1
Application of Data and Text Mining to
Bioinformatics

Sammy Wang
Computer Science
University of Georgia

2
Data Mining (DM) Definition

Extraction of implicit, previously unknown, and
potentially useful information (Pattern) from
large data sets or databases.
Uses computational techniques from statistics,
machine learning and pattern recognition.

--from wiki
3
Where DM Comes From?

Information Theory

Database Management

Visualization

High Performance Computers

DATA MINING

Applied Statistics

Parallel Algorithms

Machine Learning

--from http//wwwmaths.anu.edu.au/steve/pdcn.pdf
4
Knowledge Discovery Process
Knowledge

Data mining the core of knowledge discovery
process.

Knowledge Interpretation
Data Mining
Task-relevant Data Data transformations
Selection
Preprocessed Data
Data Cleaning
Data Integration
Databases
--from http//www.cse.ohio-state.edu/srini/694Z/
part1.ppt
5
DM Tasks (Goals)

There are two categories of goals (or high level
tasks) in DM
Description models are constructed to describe
particular patterns or relationships in the data
(beer and diaper)
Prediction models are constructed using
historical cases to predict outcomes for new
cases (PROSPECTR)

--from http//www.sys-consulting.co.uk/
6
Data Mining Tech
Data Mining
Descriptive
Predictive
Clustering Summarization Association Rules
Sequence Discovery
Classification Regression Time Series Analysis
Decision Tree Artificial Neural Network
--from http//www.cse.ohio-state.edu/srini/694Z/
part1.ppt
7
PRiOritization by Sequence and PhylogEnetic
features of CandidaTe Regions - PROSPECTR

Classify genes as likely or unlikely to be
involved in human hereditary disease, based on
features derived from their sequence
Build classifier using a decision tree (based on
C4.5) using Weka machine learning package
can input an arbitrary number of features /
attributes
gives human readable classifier
also happens to give better results than SVM and
Bayes based methods.
Train data set on approx 1084 disease genes
versus 1084 randomly selected genes

8
START
Mouse homol gt42
Has signal Peptide?
Mouse homol gt95
Gene length gt563
Gene length gt997bp
Exons gt 32
N Y
N Y
N Y
N Y
N Y
N Y
0.827
-0.163
-0.315
0.114
-0.036
0.151
0.818
-0.026
0.029
-0.594
-0.014
0.792
Best paralog gt78
Rata id gt59
3UTR lt 647bp
CDS len gt704bp
Hs/Mm Ka/Ks lt0.195
N Y
N Y
N Y
N Y
N Y
-0.422
0.344
-0.044
0.205
0.106
-0.087
-0.034
0.2
0.008
-0.57
GC gt 37.5
N Y
Mouse id gt 68.3
Worm id gt55
CLASS is DISEASE if Score lt 0
N Y
N Y
-0.038
0.213
0.015
-0.492
-0.034
0.027
--fromhttp//www.genetics.med.ed.ac.uk/tutorials/
InfMSc2_12.ppt27
9
Text Mining Definition

Refers to the process of extracting interesting,
non-trivial information and knowledge from
unstructured text (i.e. free text).
A young interdisciplinary area that draws on
information retrieval, data mining, machine
learning, statistics and computational
linguistics.
--From Wikipedia

10
Two Approaches of TM

Bag of Words
looking for features (words, concepts, headings,
formatting, authors, references and links)
Statistical methods, machine learning methods,
algorithms, etc.
Natural Language Process (NLP)
Syntax and semantics

11
Bag of Words Approach

Any techs used in data mining, statistic and
machine learning
For example, neural network, decision tree, HMM,
stochastic, Naïve Bayes, Maximum Entropy, SVM,
etc. Also they can be used into NLP
Web searchgoogle, yahoo
Classification
Clustering

12
Natural Language Processing (NLP)

NLP is a subfield of AI and linguistics.
Goal of NLP -- design and build software that
will analyze, understand, and generate natural
languages, making people address the computer as
though they were addressing another person.
Major Tasks Information extraction (IE),
Information retrieval (IR), Text to speech,
Speech recognition, Natural language generation,
Machine translation, Question answering,
Text-proofing, Translation technology, Automatic
summarization
--from Wiki and Microsoft Research

13
Five Types of IE

Named Entity recognition (NE)
Finds and classifies names, places, etc.
Co-reference resolution (CO)
Identifies identity relations between entities
Template Element construction (TE)
Adds descriptive information to NE results (using
CO)
Template Relation construction (TR)
Finds relations between TE entities
Scenario Template production (ST)
Fits TE and TR results into specified event
scenarios
-- according to the definition given by The MUC
programme in 1998

14
NE Example

the E2F-RB complex induced by TGF-beta may bind
to E2F sites and suppress expression of specific
genes whose promoters contain E2F binding sites

Named Entity recognition (NE)
15
CO Example

the E2F-RB complex induced by TGF-beta may bind
to E2F sites and suppress expression of
specific genes whose promoters contain E2F
binding sites

Coreference resolution (CO)
16
TE Example

the E2F-RB complex induced by TGF-beta may bind
to E2F sites and suppressexpression of specific
genes whose promoters contain E2F binding sites

Template Element construction (TE)
17
TR Example

the E2F-RB complex induced by TGF-beta may bind
to E2F sites and suppress expression of specific
genes whose promoters contain E2F binding sites

Template Relation construction (TR)
18
Foundation of NLP--Tagging

Assign part of speech tags to words reflecting
their syntactic category (noun. verb. adjectiv
etc.)
Difficulty words can belong to different
syntactic categories in different contexts.
he books tickets vs. he reads books
Buffalo buffaloes buffalo buffalo buffaloes
Algorithms used viterbi algorithm, HMM, maximum
likelihood, rule-based, stochastic, n-grams,
Baum-Welch, Dynamic Programming

19
Foundation of NLP--Tokenizing

TokenizingSegmenting sentences into words and
phrases. This process determines which words
should be retained as phrases and which ones
should be segmented into individual words.
"Type II Diabetes" vs. "A patient with diabetes"

20
Text Mining Steps

IR yields all relevant texts
Gathers, selects, filters relevant documents
IE extracts entities, relations, facts events
of interest to user
Finds relevant concepts, facts about concepts
Finds only what we are looking for
DM discovers unsuspected associations
Combines links facts and events
Discovers new knowledge, finds new associations
--from text mining and terminology management in
biomedicine

21
Main TM Research in Biology

Entity/Term recognition
Relationship extraction

22
Named Entity Recognition (NE)

Term ambiguity/variation
Lack of clear naming convention
Synonyms, Abbreviations and Acronym (nuclear
receptor NR)
Many different terms refer to the same concept
vs. One term can have multiple meaning
BAD human gene encoding BCL-2 family of proteins
(bad things, bad weather)

23
Approaches of NE

Dictionary/controlled vocabularies
MeSH
Ontology Approaches
--GO
Rule-based
CFG parser
Statistical Methods
Term frequency
Machine Learning
Neural Network
Hybrid Approaches

24
Ontology Approaches

Ontologies are crucial for ATR
Manage term acronym, synonym and variation
Link term in text with concept
Add meaning, semantic annotation of texts
Support relationship extraction
Ontologies support IE/IR
populate terms into ontologies (automatic
ontology generation)

25
Relationship Extraction

Detect a prespecified type of relationship
between a pair of entities of given types.
Relationships between genes and proteins
Relationships between genes, protein, or other
biological entities (Protein Active Site Template
Acquisition System--PASTA)

26
Relationships between Genes and Proteins

Grouping genes by functional relationships could
aid gene expression analysis and database
annotation
How to know if a group of genes share the same
function?
Raychaudhuri et al. used a measure of neighbor
divergence of papers to measure the functional
coherence of a group of genes

27
Methodology of Functional Coherence
28
Relationships between Genes and Proteins

MeKE (Medical Knowledge Explorer) system (Chiang
and Yu)
Uses GO codes as a lexicon of function names,
combining it with a lexicon of gene and gene
product names from LocusLink
Uses sentence alignment to determine patterns
associated with statements about gene function
Uses a Naïve Bayes classifier to extract
sentences containing information about gene
product function

29
PASTA

Uses type and POS tagging along with manually
created templates and lexicons assembled from
biological databases to extract relationships
between amino acid residues and their function
within a protein.

30
Example of Annotated Text
TITLE The crystal structure of a ltNAME
TYPEPROTEINgttriacylglycerol lipaselt/NAMEgt from
ltNAME TYPESPECIESgtPseudomonas cepacialt/NAMEgt.
reveals a highly open conformation in the absence
of a bound inhibitor AUTHORS Kim_KK, Song_HK,
Shin_DH, Hwang_KY, Suh_SW JOURNAL STRUCTURE,
1997, Vol.5, No.2, pp.173-185 ABSTRACT
Results We have determined the crystal structure
of a ltNAME TYPEPROTEINgt triacylglycerol
lipaselt/NAMEgt from ltNAME TYPEPROTEINgtPseudomonas
cepacia (Pet)lt/NAMEgt in the absence of a bound
inhibitor using X-ray crystallography. The
structure shows the ltNAME TYPEPROTEINgtlipaselt/NAM
Egt to contain an ltNAME TYPEPROTEINgtalpha/beta-hyd
rolaselt/NAMEgt fold and a catalytic triad
comprising of residues ltNAME TYPERESIDUEgt
Ser87lt/NAMEgt, ltNAME TYPERESIDUEgtHis286 lt/NAMEgt
and ltNAME TYPERESIDUEgtAsp264 lt/NAMEgt. The enzyme
shares several structural features with
homologous ltNAME TYPEPROTEINgtlipases lt/NAMEgt
from ltNAME TYPESPECIESgtPseudomonas glumae
(PgL)lt/NAMEgt and ltNAME TYPESPECIESgtChromobacteriu
m viscosum (CvL)lt/NAMEgt, including a
calcium-binding site. The present structure of
ltNAME TYPESPECIESgtPetlt/NAMEgt reveals a highly
open conformation with a solvent-accessible
active site. This is in contrast to the
structures of ltNAME TYPESPECIESgtPgLlt/NAMEgt and
ltNAME TYPESPECIESgtPetlt/NAMEgt in which the
active site is buried under a closed or
partially opened 'lid', respectively.
31
Approaches of Relationship Extraction

Manually generated template-based methods
use patterns (usually in the form of regular
expressions) generated by domain experts to
extract concepts connected by a specific relation
from text.
Automatic template methods
create similar templates automatically by
generalizing patterns from text known to have the
relevant relationship.
Statistical methods
identify relationships by looking for concepts
that are found with each other more often than
would be predicted by chance.
NLP-based methods
perform a substantial amount of sentence parsing
to decompose the text into a structure from which
relationships can be readily extracted

--fromA survey of current work in biomedical
text mining
32
Hypothesis Generation

Goal uncover previously unrecognized
relationships.
Swanson found a connection between fish oil and
Raynauds syndrome in 1986 by manually connecting
concepts between journal articles.
He also traced 11 indirect connections between
migraine and magnesium using summarizations of
published articles.

--fromA survey of current work in biomedical
text mining
33
Approaches of Hypothesis Generation

A influences B, and B influences C, therefore A
may influence Cby Swanson
Automated hypothesis generation

34
Initial Thought

Data resource paper abstracts/full papers, web
pages, databases online
Computing resource computer with large memory
(several Gig) for training taggerParser
(Stanford POS tagger)
Learning and clearing biological problem
(horizontal gene transfer)cooperated with
biologist
Preparing ontology
Open Biology ontology (OBO)
Basic Formal Ontology (BFO)
Finding relationship

The End
Thanks!

36
Comparison of Document-handling Techs
--from Text analysis and knowledge mining system
37
Model Types

For a successful IR/IE, it is necessary to
represent the documents in some way. There are a
number of models for this purpose. They can be
classified according to two dimensions like shown
in the left figure the mathematical basis and
the properties of the model.
--from wiki

38
--from wiki
39
Bag of words approach

Treats a document as a collection of words or
phrases
Generally ignores the word order
May count each word occurrence, or just flag
which words occur
?Stemming and stop word elimination

40
Common Techniques of NLP

StemmingIdentifying the stem of each word. For
example, "hybridized", "hybridizing", and
"hybridization" would be stemmed to "hybrid". As
a result, the analysis phase of the NLP process
has to deal with only the stem of each word, not
every possible permutation.
TaggingIdentifying the part of speech
represented by each word, such as noun, verb, or
adjective.
TokenizingSegmenting sentences into words and
phrases. This process determines which words
should be retained as phrases and which ones
should be segmented into individual words. For
example, "Type II Diabetes" should be retained as
a word phrase, whereas "A patient with diabetes"
would be segmented into four separate words.
Core TermsSignificant terms, such as protein
names and experimental method names, are
identified based on a dictionary of core terms. A
related process is ignoring insignificant words
such as "the", "and", and "a".
Resolving Abbreviations, Acronyms, and
SynonymsReplacing abbreviations with the words
they represent, and resolving acronyms and
synonyms to a controlled vocabulary. For example,
"DM" and "Diabetes Mellitus" could be resolved to
"Type II Diabetes", depending on the controlled
vocabulary.

41
Relationships between Genes and Proteins

Pan et al.s Dragon TF association miner system
used linear discriminate analysis on terms and
neural networks to create models that recognized
abstracts that contained information relating
transcription factors (TFs) with GO codes and
diseases.

--fromA survey of current work in biomedical
text mining
42
Dictionary-based Approaches

Neologisms, variations a major issue for these
Combine dictionaries with edit distance for
flexible string matching
Tuning of cost function (space to hyphen less
costly)
Hirschman et al (2002)
Tsuruoka Tsujii (2004)

--from Text Mining and Terminology Management In
Biomedicine
43
Rule-based Approaches

Use of dictionaries of typical term constituents
Heads, class-specific adjectives, affixes,
specific acronyms
Use of term formation patterns (Ananiadou,
Gaizauskas)
Context-free grammars
Simple lexical patterns orthographic features
Fukuda et al PROPER (core feature term)
core terms are domain-characteristic words
feature terms are keywords that describe
function and characteristic of a term (e.g.
protein, receptor, etc)
SAP kinase core SAP, feature kinase
Usual problem of tuning and porting

--from Text Mining and Terminology Management In
Biomedicine
44
Machine Learning Approaches

Typically designed for specific classes of
entities
Challenges
Selecting set of representative features for
accurate
recognition classification
Detection of term boundaries of multiword terms
Few reliable training resources for biomedicine
Experimentation with various techniques
Hidden Markov models (Collier et al.), Naïve
Bayes,support vector machines (Kazama et al.,
Yamamoto etal.), decision trees, etc.

--from Text Mining and Terminology Management In
Biomedicine
45
Statistical Approaches

Based on statistical distributions of
collocations
Challenge to define adequate measure of
termhood of candidate terms
Main strategy
Extract specific noun phrases as term candidates
Estimate termhoods
Ranked list, thresholds
More easily tuned, more portable, no training
data
Work best on large collections (normalization
required for small)

--from Text Mining and Terminology Management In
Biomedicine
46
Hybrid Approaches

Combine several techniques
C/NC value (Frantzi Ananiadou) being used by
National Centre
Combines statistical, linguistic and contextual
processing to rank candidate terms
Nested (embedded) sub-terms help to recognize
full compounds
ABGENE (Tanabe Wilbur)
Machine learning, transformation rules,
dictionary combined with probabilistic approach

--from Text Mining and Terminology Management In
Biomedicine
47
Descriptive Data Mining Tasks

Classification maps data into predefined groups
or classes
Pattern recognition
direct marketing, retention
Clustering groups similar data together into
clusters/groups.
Segmentation
Partitioning
www marketing

--from http//www.cse.ohio-state.edu/srini/694Z/
part1.ppt
48
Predictive Data Mining Tasks

Regression is used to map a data item to a real
valued prediction variable.
credit scoring
Link Analysis uncovers relationships among data.
Association Rules
Sequential Analysis determines sequential
patterns.

--from http//www.cse.ohio-state.edu/srini/694Z/
part1.ppt
49
Decision Trees

Popular technique for classification Leaf node
indicates class to which the corresponding tuple
belongs.
Decision Tree (DT) representation
Each internal node tests an attribute.
Each branch corresponds to attribute value.
Each leaf node assigns a classification.

50
Training Data Set
--from http//www.cs.cmu.edu/afs/cs.cmu.edu/proje
ct/theo-20/www/mlbook/ch3.pdf
51
Decision Tree for PlayTennis (back)
--from http//www.cs.cmu.edu/afs/cs.cmu.edu/proje
ct/theo-20/www/mlbook/ch3.pdf

Write a Comment

User Comments (0)