Application of Data and Text Mining to Bioinformatics - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Application of Data and Text Mining to Bioinformatics

Description:

Synonyms, Abbreviations and Acronym (nuclear receptor = NR) ... Manage term acronym, synonym and variation. Link term in text with concept ... – PowerPoint PPT presentation

Number of Views:386
Avg rating:3.0/5.0
Slides: 52
Provided by: sam120
Category:

less

Transcript and Presenter's Notes

Title: Application of Data and Text Mining to Bioinformatics


1
Application of Data and Text Mining to
Bioinformatics
  • Sammy Wang
  • Computer Science
  • University of Georgia

2
Data Mining (DM) Definition
  • Extraction of implicit, previously unknown, and
    potentially useful information (Pattern) from
    large data sets or databases.
  • Uses computational techniques from statistics,
    machine learning and pattern recognition.

--from wiki
3
Where DM Comes From?
  • Information Theory
  • Database Management
  • Visualization
  • High Performance Computers

DATA MINING
  • Applied Statistics
  • Parallel Algorithms
  • Machine Learning

--from http//wwwmaths.anu.edu.au/steve/pdcn.pdf
4
Knowledge Discovery Process
Knowledge
  • Data mining the core of knowledge discovery
    process.

Knowledge Interpretation
Data Mining
Task-relevant Data Data transformations
Selection
Preprocessed Data
Data Cleaning
Data Integration
Databases
--from http//www.cse.ohio-state.edu/srini/694Z/
part1.ppt
5
DM Tasks (Goals)
  • There are two categories of goals (or high level
    tasks) in DM
  • Description models are constructed to describe
    particular patterns or relationships in the data
    (beer and diaper)
  • Prediction models are constructed using
    historical cases to predict outcomes for new
    cases (PROSPECTR)

--from http//www.sys-consulting.co.uk/
6
Data Mining Tech
Data Mining
Descriptive
Predictive
Clustering Summarization Association Rules
Sequence Discovery
Classification Regression Time Series Analysis
Decision Tree Artificial Neural Network
--from http//www.cse.ohio-state.edu/srini/694Z/
part1.ppt
7
PRiOritization by Sequence and PhylogEnetic
features of CandidaTe Regions - PROSPECTR
  • Classify genes as likely or unlikely to be
    involved in human hereditary disease, based on
    features derived from their sequence
  • Build classifier using a decision tree (based on
    C4.5) using Weka machine learning package
  • can input an arbitrary number of features /
    attributes
  • gives human readable classifier
  • also happens to give better results than SVM and
    Bayes based methods.
  • Train data set on approx 1084 disease genes
    versus 1084 randomly selected genes

8
START
Mouse homol gt42
Has signal Peptide?
Mouse homol gt95
Gene length gt563
Gene length gt997bp
Exons gt 32
N Y
N Y
N Y
N Y
N Y
N Y
0.827
-0.163
-0.315
0.114
-0.036
0.151
0.818
-0.026
0.029
-0.594
-0.014
0.792
Best paralog gt78
Rata id gt59
3UTR lt 647bp
CDS len gt704bp
Hs/Mm Ka/Ks lt0.195
N Y
N Y
N Y
N Y
N Y
-0.422
0.344
-0.044
0.205
0.106
-0.087
-0.034
0.2
0.008
-0.57
GC gt 37.5
N Y
Mouse id gt 68.3
Worm id gt55
CLASS is DISEASE if Score lt 0
N Y
N Y
-0.038
0.213
0.015
-0.492
-0.034
0.027
--fromhttp//www.genetics.med.ed.ac.uk/tutorials/
InfMSc2_12.ppt27
9
Text Mining Definition
  • Refers to the process of extracting interesting,
    non-trivial information and knowledge from
    unstructured text (i.e. free text).
  • A young interdisciplinary area that draws on
    information retrieval, data mining, machine
    learning, statistics and computational
    linguistics.
  • --From Wikipedia

10
Two Approaches of TM
  • Bag of Words
  • looking for features (words, concepts, headings,
    formatting, authors, references and links)
  • Statistical methods, machine learning methods,
    algorithms, etc.
  • Natural Language Process (NLP)
  • Syntax and semantics

11
Bag of Words Approach
  • Any techs used in data mining, statistic and
    machine learning
  • For example, neural network, decision tree, HMM,
    stochastic, Naïve Bayes, Maximum Entropy, SVM,
    etc. Also they can be used into NLP
  • Web searchgoogle, yahoo
  • Classification
  • Clustering

12
Natural Language Processing (NLP)
  • NLP is a subfield of AI and linguistics.
  • Goal of NLP -- design and build software that
    will analyze, understand, and generate natural
    languages, making people address the computer as
    though they were addressing another person.
  • Major Tasks Information extraction (IE),
    Information retrieval (IR), Text to speech,
    Speech recognition, Natural language generation,
    Machine translation, Question answering,
    Text-proofing, Translation technology, Automatic
    summarization
  • --from Wiki and Microsoft Research

13
Five Types of IE
  • Named Entity recognition (NE)
  • Finds and classifies names, places, etc.
  • Co-reference resolution (CO)
  • Identifies identity relations between entities
  • Template Element construction (TE)
  • Adds descriptive information to NE results (using
    CO)
  • Template Relation construction (TR)
  • Finds relations between TE entities
  • Scenario Template production (ST)
  • Fits TE and TR results into specified event
    scenarios
  • -- according to the definition given by The MUC
    programme in 1998

14
NE Example
  • the E2F-RB complex induced by TGF-beta may bind
    to E2F sites and suppress expression of specific
    genes whose promoters contain E2F binding sites

Named Entity recognition (NE)
15
CO Example
  • the E2F-RB complex induced by TGF-beta may bind
    to E2F sites and suppress expression of
    specific genes whose promoters contain E2F
    binding sites

Coreference resolution (CO)
16
TE Example
  • the E2F-RB complex induced by TGF-beta may bind
    to E2F sites and suppressexpression of specific
    genes whose promoters contain E2F binding sites

Template Element construction (TE)
17
TR Example
  • the E2F-RB complex induced by TGF-beta may bind
    to E2F sites and suppress expression of specific
    genes whose promoters contain E2F binding sites

Template Relation construction (TR)
18
Foundation of NLP--Tagging
  • Assign part of speech tags to words reflecting
    their syntactic category (noun. verb. adjectiv
    etc.)
  • Difficulty words can belong to different
    syntactic categories in different contexts.
  • he books tickets vs. he reads books
  • Buffalo buffaloes buffalo buffalo buffaloes
  • Algorithms used viterbi algorithm, HMM, maximum
    likelihood, rule-based, stochastic, n-grams,
    Baum-Welch, Dynamic Programming

19
Foundation of NLP--Tokenizing
  • TokenizingSegmenting sentences into words and
    phrases. This process determines which words
    should be retained as phrases and which ones
    should be segmented into individual words.
  • "Type II Diabetes" vs. "A patient with diabetes"

20
Text Mining Steps
  • IR yields all relevant texts
  • Gathers, selects, filters relevant documents
  • IE extracts entities, relations, facts events
    of interest to user
  • Finds relevant concepts, facts about concepts
  • Finds only what we are looking for
  • DM discovers unsuspected associations
  • Combines links facts and events
  • Discovers new knowledge, finds new associations
  • --from text mining and terminology management in
    biomedicine

21
Main TM Research in Biology
  • Entity/Term recognition
  • Relationship extraction

22
Named Entity Recognition (NE)
  • Term ambiguity/variation
  • Lack of clear naming convention
  • Synonyms, Abbreviations and Acronym (nuclear
    receptor NR)
  • Many different terms refer to the same concept
    vs. One term can have multiple meaning
  • BAD human gene encoding BCL-2 family of proteins
    (bad things, bad weather)

23
Approaches of NE
  • Dictionary/controlled vocabularies
  • MeSH
  • Ontology Approaches
  • --GO
  • Rule-based
  • CFG parser
  • Statistical Methods
  • Term frequency
  • Machine Learning
  • Neural Network
  • Hybrid Approaches

24
Ontology Approaches
  • Ontologies are crucial for ATR
  • Manage term acronym, synonym and variation
  • Link term in text with concept
  • Add meaning, semantic annotation of texts
  • Support relationship extraction
  • Ontologies support IE/IR
  • populate terms into ontologies (automatic
    ontology generation)

25
Relationship Extraction
  • Detect a prespecified type of relationship
    between a pair of entities of given types.
  • Relationships between genes and proteins
  • Relationships between genes, protein, or other
    biological entities (Protein Active Site Template
    Acquisition System--PASTA)

26
Relationships between Genes and Proteins
  • Grouping genes by functional relationships could
    aid gene expression analysis and database
    annotation
  • How to know if a group of genes share the same
    function?
  • Raychaudhuri et al. used a measure of neighbor
    divergence of papers to measure the functional
    coherence of a group of genes

27
Methodology of Functional Coherence
28
Relationships between Genes and Proteins
  • MeKE (Medical Knowledge Explorer) system (Chiang
    and Yu)
  • Uses GO codes as a lexicon of function names,
    combining it with a lexicon of gene and gene
    product names from LocusLink
  • Uses sentence alignment to determine patterns
    associated with statements about gene function
  • Uses a Naïve Bayes classifier to extract
    sentences containing information about gene
    product function

29
PASTA
  • Uses type and POS tagging along with manually
    created templates and lexicons assembled from
    biological databases to extract relationships
    between amino acid residues and their function
    within a protein.

30
Example of Annotated Text
TITLE The crystal structure of a ltNAME
TYPEPROTEINgttriacylglycerol lipaselt/NAMEgt from
ltNAME TYPESPECIESgtPseudomonas cepacialt/NAMEgt.
reveals a highly open conformation in the absence
of a bound inhibitor AUTHORS Kim_KK, Song_HK,
Shin_DH, Hwang_KY, Suh_SW JOURNAL STRUCTURE,
1997, Vol.5, No.2, pp.173-185 ABSTRACT
Results We have determined the crystal structure
of a ltNAME TYPEPROTEINgt triacylglycerol
lipaselt/NAMEgt from ltNAME TYPEPROTEINgtPseudomonas
cepacia (Pet)lt/NAMEgt in the absence of a bound
inhibitor using X-ray crystallography. The
structure shows the ltNAME TYPEPROTEINgtlipaselt/NAM
Egt to contain an ltNAME TYPEPROTEINgtalpha/beta-hyd
rolaselt/NAMEgt fold and a catalytic triad
comprising of residues ltNAME TYPERESIDUEgt
Ser87lt/NAMEgt, ltNAME TYPERESIDUEgtHis286 lt/NAMEgt
and ltNAME TYPERESIDUEgtAsp264 lt/NAMEgt. The enzyme
shares several structural features with
homologous ltNAME TYPEPROTEINgtlipases lt/NAMEgt
from ltNAME TYPESPECIESgtPseudomonas glumae
(PgL)lt/NAMEgt and ltNAME TYPESPECIESgtChromobacteriu
m viscosum (CvL)lt/NAMEgt, including a
calcium-binding site. The present structure of
ltNAME TYPESPECIESgtPetlt/NAMEgt reveals a highly
open conformation with a solvent-accessible
active site. This is in contrast to the
structures of ltNAME TYPESPECIESgtPgLlt/NAMEgt and
ltNAME TYPESPECIESgtPetlt/NAMEgt in which the
active site is buried under a closed or
partially opened 'lid', respectively.
31
Approaches of Relationship Extraction
  • Manually generated template-based methods
  • use patterns (usually in the form of regular
    expressions) generated by domain experts to
    extract concepts connected by a specific relation
    from text.
  • Automatic template methods
  • create similar templates automatically by
    generalizing patterns from text known to have the
    relevant relationship.
  • Statistical methods
  • identify relationships by looking for concepts
    that are found with each other more often than
    would be predicted by chance.
  • NLP-based methods
  • perform a substantial amount of sentence parsing
    to decompose the text into a structure from which
    relationships can be readily extracted

--fromA survey of current work in biomedical
text mining
32
Hypothesis Generation
  • Goal uncover previously unrecognized
    relationships.
  • Swanson found a connection between fish oil and
    Raynauds syndrome in 1986 by manually connecting
    concepts between journal articles.
  • He also traced 11 indirect connections between
    migraine and magnesium using summarizations of
    published articles.

--fromA survey of current work in biomedical
text mining
33
Approaches of Hypothesis Generation
  • A influences B, and B influences C, therefore A
    may influence Cby Swanson
  • Automated hypothesis generation

34
Initial Thought
  • Data resource paper abstracts/full papers, web
    pages, databases online
  • Computing resource computer with large memory
    (several Gig) for training taggerParser
    (Stanford POS tagger)
  • Learning and clearing biological problem
    (horizontal gene transfer)cooperated with
    biologist
  • Preparing ontology
  • Open Biology ontology (OBO)
  • Basic Formal Ontology (BFO)
  • Finding relationship

35
  • The End
  • Thanks!

36
Comparison of Document-handling Techs
--from Text analysis and knowledge mining system
37
Model Types
  • For a successful IR/IE, it is necessary to
    represent the documents in some way. There are a
    number of models for this purpose. They can be
    classified according to two dimensions like shown
    in the left figure the mathematical basis and
    the properties of the model.
  • --from wiki

38
--from wiki
39
Bag of words approach
  • Treats a document as a collection of words or
    phrases
  • Generally ignores the word order
  • May count each word occurrence, or just flag
    which words occur
  • ?Stemming and stop word elimination

40
Common Techniques of NLP
  • StemmingIdentifying the stem of each word. For
    example, "hybridized", "hybridizing", and
    "hybridization" would be stemmed to "hybrid". As
    a result, the analysis phase of the NLP process
    has to deal with only the stem of each word, not
    every possible permutation.
  • TaggingIdentifying the part of speech
    represented by each word, such as noun, verb, or
    adjective.
  • TokenizingSegmenting sentences into words and
    phrases. This process determines which words
    should be retained as phrases and which ones
    should be segmented into individual words. For
    example, "Type II Diabetes" should be retained as
    a word phrase, whereas "A patient with diabetes"
    would be segmented into four separate words.
  • Core TermsSignificant terms, such as protein
    names and experimental method names, are
    identified based on a dictionary of core terms. A
    related process is ignoring insignificant words
    such as "the", "and", and "a".
  • Resolving Abbreviations, Acronyms, and
    SynonymsReplacing abbreviations with the words
    they represent, and resolving acronyms and
    synonyms to a controlled vocabulary. For example,
    "DM" and "Diabetes Mellitus" could be resolved to
    "Type II Diabetes", depending on the controlled
    vocabulary.

41
Relationships between Genes and Proteins
  • Pan et al.s Dragon TF association miner system
    used linear discriminate analysis on terms and
    neural networks to create models that recognized
    abstracts that contained information relating
    transcription factors (TFs) with GO codes and
    diseases.

--fromA survey of current work in biomedical
text mining
42
Dictionary-based Approaches
  • Neologisms, variations a major issue for these
  • Combine dictionaries with edit distance for
    flexible string matching
  • Tuning of cost function (space to hyphen less
    costly)
  • Hirschman et al (2002)
  • Tsuruoka Tsujii (2004)

--from Text Mining and Terminology Management In
Biomedicine
43
Rule-based Approaches
  • Use of dictionaries of typical term constituents
  • Heads, class-specific adjectives, affixes,
    specific acronyms
  • Use of term formation patterns (Ananiadou,
    Gaizauskas)
  • Context-free grammars
  • Simple lexical patterns orthographic features
  • Fukuda et al PROPER (core feature term)
  • core terms are domain-characteristic words
  • feature terms are keywords that describe
    function and characteristic of a term (e.g.
    protein, receptor, etc)
  • SAP kinase core SAP, feature kinase
  • Usual problem of tuning and porting

--from Text Mining and Terminology Management In
Biomedicine
44
Machine Learning Approaches
  • Typically designed for specific classes of
    entities
  • Challenges
  • Selecting set of representative features for
    accurate
  • recognition classification
  • Detection of term boundaries of multiword terms
  • Few reliable training resources for biomedicine
  • Experimentation with various techniques
  • Hidden Markov models (Collier et al.), Naïve
    Bayes,support vector machines (Kazama et al.,
    Yamamoto etal.), decision trees, etc.

--from Text Mining and Terminology Management In
Biomedicine
45
Statistical Approaches
  • Based on statistical distributions of
    collocations
  • Challenge to define adequate measure of
    termhood of candidate terms
  • Main strategy
  • Extract specific noun phrases as term candidates
  • Estimate termhoods
  • Ranked list, thresholds
  • More easily tuned, more portable, no training
    data
  • Work best on large collections (normalization
    required for small)

--from Text Mining and Terminology Management In
Biomedicine
46
Hybrid Approaches
  • Combine several techniques
  • C/NC value (Frantzi Ananiadou) being used by
    National Centre
  • Combines statistical, linguistic and contextual
  • processing to rank candidate terms
  • Nested (embedded) sub-terms help to recognize
    full compounds
  • ABGENE (Tanabe Wilbur)
  • Machine learning, transformation rules,
    dictionary combined with probabilistic approach

--from Text Mining and Terminology Management In
Biomedicine
47
Descriptive Data Mining Tasks
  • Classification maps data into predefined groups
    or classes
  • Pattern recognition
  • direct marketing, retention
  • Clustering groups similar data together into
    clusters/groups.
  • Segmentation
  • Partitioning
  • www marketing

--from http//www.cse.ohio-state.edu/srini/694Z/
part1.ppt
48
Predictive Data Mining Tasks
  • Regression is used to map a data item to a real
    valued prediction variable.
  • credit scoring
  • Link Analysis uncovers relationships among data.
  • Association Rules
  • Sequential Analysis determines sequential
    patterns.

--from http//www.cse.ohio-state.edu/srini/694Z/
part1.ppt
49
Decision Trees
  • Popular technique for classification Leaf node
    indicates class to which the corresponding tuple
    belongs.
  • Decision Tree (DT) representation
  • Each internal node tests an attribute.
  • Each branch corresponds to attribute value.
  • Each leaf node assigns a classification.

50
Training Data Set
--from http//www.cs.cmu.edu/afs/cs.cmu.edu/proje
ct/theo-20/www/mlbook/ch3.pdf
51
Decision Tree for PlayTennis (back)
--from http//www.cs.cmu.edu/afs/cs.cmu.edu/proje
ct/theo-20/www/mlbook/ch3.pdf
Write a Comment
User Comments (0)
About PowerShow.com