Mining Medical Literature - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

Mining Medical Literature

Description:

Mining Medical Literature Vignesh Ganapathy (CS 374 : Algorithms in Biology) (FALL 2005) Outline Introduction and Background Mining Technique 1: Identifying ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 60
Provided by: aiStanfo
Category:

less

Transcript and Presenter's Notes

Title: Mining Medical Literature


1
Mining Medical Literature
  • Vignesh Ganapathy
  • (CS 374 Algorithms in Biology)
  • (FALL 2005)

2
Outline
  • Introduction and Background
  • Mining Technique 1
  • Identifying Functionally Coherent Gene Groups
  • Mining Technique 2
  • Extracting Synonymous gene and protein terms
  • Conclusions

3
Outline
  • Introduction and Background
  • Mining Technique 1
  • Identifying Functionally Coherent Gene Groups
  • Mining Technique 2
  • Extracting Synonymous gene and protein terms
  • Conclusions

4
Introduction
  • Medical Literature has vast amounts of knowledge
    and information
  • PubMed Central (PMC) ( the U.S. National
    Institutes of Health (NIH) free digital archive
    of biomedical and life sciences journal
    literature)
  • Amedeo.com (The Medical Literature Guide)
  • Journals like Science, Nature, Cell ,EMBO, Cell
    Biology, PNAS
  • (and many more..)

5
The Problem
  • Major task is finding out ways to extract useful
    information from these resources.

6
What is Data Mining?
  • Data Mining is the Process of discovering
    meaningful, new correlation patterns and trends
    by sifting through large amount of data stored in
    repositories, using pattern recognition
    techniques as well as statistical and
    mathematical techniques.

7
Example Data!
  • Large amounts of data but no information
  • Daily transactions at a supermarket
  • Daily website visit histories
  • Books/videos rented at a Library
  • Newspaper, Journal archives

8
Amazon.com
9
Google News
  • Clustering News items (Google News)

10
More Applications
  • Improving Sales strategy
  • Finding items that sell together
  • (there is a common example of beer and diaper
    being related. A supermarket found out that 50
    of the times beer was purchased with diapers)
  • Anomaly Detection and many more

11
Information Retrieval (IR)
  • Collecting information from text data
    (Unstructured Data)
  • Applications
  • Search web documents
  • Natural Language Processing
  • Term also extends to include multimedia or other
    forms of unstructured data

12
Simple flow of Retrieval Process
13
IR System Evaluation
  • Some measures are
  • Precision
  • Recall
  • F1 measure Combined measure which is a
    weighted harmonic mean
  • Sensitivity
  • Specificity

14
Precision and Recall
  • How are Precision and Recall related?

15
Problems with Precision and Recall
  • Deciding documents relevant and non relevant is
    not easy
  • For recall, difficult to measure the number of
    relevant documents in database
  • Creating pool of relevant records is one solution
  • In practice, these are still good measures

16
Sensitivity and Specificity
  • Sensitivity Probability of positive examples
  • Specificity Probability of negative examples
  • What is the relation between Sensitivity,
    Specificity, Precision and Recall?

17
Outline
  • Introduction and Background
  • Mining Technique 1
  • Identifying Functionally Coherent Gene Groups
  • Mining Technique 2
  • Extracting Synonymous gene and protein terms
  • Conclusion

18
Introduction
  • Analysis shifting from single gene to family of
    genes
  • Examples of these are
  • Sequence Data
  • Gene Expression Clustering
  • Deletion Phenotypes
  • Yeast-2-Hybrid screens

19
HOVERGEN a Database of Homologous Vertebrate
Genes
  • Useful for comparative sequence analysis, or
    molecular evolution studies

10 biggest gene families
20
Why identify functional gene groups?
  • Interesting to know functionally relevant groups
    for large gene group sets
  • Helps to assess the significance of
    experimentally derived gene sets
  • Refine gene groups to find more functionally
    relevant groups
  • Existing algorithms can make use of this
    information in finding gene groups

21
Existing Approaches
  • Use of co occurrence of gene names in abstracts
    to create networks of related genes automatically
  • Use existing vocabulary of gene functions and
    assigned genes to decide a functionally relevant
    group
  • (Gene Ontology (GO) consortium and Munich
    Information Center for Protein Sequences (MIPS) )

22
Statistical NLP approach
  • Used for annotating individual genes
  • Determining gene and protein interactions
  • Assigning keywords to genes or group of genes

23
Neighbor Divergence Approach
  • Statistical NLP technique
  • Will always be up to date if provided with a
    current literature base
  • Cannot specify what the actual function is!

24
Challenges in the Problem
  • Large number of genes
  • Genes have multiple functions
  • Some genes have been extensively studied, others
    recently discovered
  • So the literature about genes reflects these
    differences

25
Neighbor Divergence Intuition
26
Neighbor Divergence Algorithm
  • Representation Of Articles
  • Identifying Semantic Neighbors for Corpus
    Articles
  • Scoring Articles Relative to Gene Group
  • Calculating a Theoretical distribution of Scores
  • Calculating the Difference between empirical and
    theoretical distribution

27
ND- Article Representation
  • Words in articles represented by their inverse
    document frequency (to reduce the impact of
    common words)
  • Wi,j 1 (log2 (tfi,j))log2 (N/dfi) if tfi,j
    gt 0
  • Wi,j 0 if
    tfi,j 0
  • where Wi,j weighted count of word i in
    document j,
  • tfi,j the number f times word i is
    in document
  • dfi the number of documents
    containing I
  • N the total number of documents

28
ND Identifying Semantic Neighbors
  • For each article, K most similar articles are pre
    computed (k20 was used)
  • Cosine similarity measure is used ( Cosine of the
    angle between two weighted article vectors)

29
ND Scoring articles
  • Given a gene group, ND assigns a score to each
    article (Si,g)
  • Score is a count of semantic neighbors that refer
    to group genes
  • frk,g nk,g / nk (Fractional Reference for
    each neighbor k)
  • Si,g round(S(i1 to 20) fr sem(i,j),g) (Score
    value)

30
ND Difference in Distributions
  • Calculating a theoretical Distribution of Scores
  • Use of Poisson Distribution to represent the non
    coherent functional structure
  • P(S n) ((?)n/n!)e-?
  • KL Divergence
  • If 2 distributions are same, divergence is zero
  • More disparate the distributions, larger the
    divergence
  • Dgh Sum(gi log gi /hi )

31
Observed and Expected Distribution of Article
Scores
32
Results
33
Other methods
  • Word Divergence

34
Other methods
  • Best Article Score
  • Highest article score is used as a measure of the
    gene groups functional coherence
  • Best p-Value
  • Summed probability of an article having equal or
    more neighbors than it has
  • Neighborhood Divergence No Filter
  • Filter used is When calculating semantic
    neighbors, only articles that refer to different
    genes are considered.

35
Evaluation
36
Corrupting Functional Groups
37
Outline
  • Introduction and Background
  • Mining Technique 1
  • Identifying Functionally Coherent Gene Groups
  • Mining Technique 2
  • Extracting Synonymous gene and protein terms
  • Conclusion

38
Introduction
  • Genes and proteins are associated with multiple
    names
  • LARD , DR3 , TR3 , Wsl, DDR3, APO-3, TRAMP,
    WSL-1, WSL-LR, Tnfrsf12,
  • PS2, Alg2, MA-3, alg-2, Pdcd6
  • GRIP-1, TIF2, 9530095N19, D1Ertd433e, Ncoa2
  • http//bioinformatics.org/textknowledge/synonym.p
    hp)

39
Advantage
  • Automated method will keep the database updated
  • Extracting synonyms will help
  • Information retrieval and extraction
  • Human curators of biological resource

40
Existing approaches
  • Detecting semantically related words
  • beer and wine are related terms
  • Use of WORDNET (a large lexical database of
    English words) to evaluate semantic similarity
  • Most synonymous identification methods do not
    consider surrounding context of words

41
Information Extraction and Machine Learning
  • Requires a large amount of manual labor to
    construct and tune extraction systems
  • Machine learning techniques help to reduce the
    manual labor by automatically acquiring rules for
    labeled and unlabeled data

42
ML techniques
  • Supervised Learning
  • Labeled Training Data available
  • Semi supervised Learning
  • Small number of labeled training data
  • Unsupervised Learning
  • Data with no labeling
  • Reinforcement Learning
  • Learn a mapping form situations to actions by
    trial and error interactions

43
Approach Used here
  • Obtain tagged genes and proteins in text using
    existing gene taggers
  • Four approaches used
  • Unsupervised Learning
  • Partially Supervised Learning
  • Supervised Learning
  • Hand Crafter System
  • Use of a final COMBINED system

44
Unsupervised Learning Contextual Similarity
  • Finds set of words that appear in similar context
    using mutual information between the words

45
Unsupervised Learning Contextual Similarity
  • Mutual Information
  • Similarity Measure

46
Contextual Similarity
  • For all terms takes time O(lexicon3 . So
    ,heuristic search is used
  • Lots of false positives returned, so useful to
    incorporate some domain knowledge

47
Partially supervised Learning- Snowball
48
Snowball
  • Confidence of a pattern
  • Calculates confidence of extracted tuples and
    discards low confidence tuples

49
Supervised Learning Text classification
  • User provided positive and negative example gene
    and protein pairs
  • Use SVM to train using this data (radial basis
    kernel function of SVMLight)
  • Classifies pairs of identified genes and proteins
    using a confidence score Conf(s)(score assigned
    by classifier)
  • Does not combine evidence from multiple
    occurrences of same gene or protein pair

50
Hand Crafted Extraction System- GPE system
  • Most labor intensive but high quality result
    approach
  • Starts with set of known pairs of synonyms
  • Manual examination to find patterns of
    occurrences
  • Use of known as or also called
  • Scans for more synonyms and uses heuristics and
    filters to ignore non gene/protein terms
  • Confidence value of 1 assigned to every returned
    result

51
Combined System
  • Exploits advantages of knowledge based and
    machine learning based systems
  • ConfE(s) represents the confidence score assigned
    to s by system E
  • (1 prob that all systems extracted s
    incorrectly)

52
Final parameters used for the different systems
53
Running Times
54
Results and Evaluation
55
Results and Evaluation
56
Outline
  • Introduction and Background
  • Mining Technique 1
  • Identifying Functionally Coherent Gene Groups
  • Mining Technique 2
  • Extracting Synonymous gene and protein terms
  • Conclusion

57
Conclusion and Future Work
  • Lot of interest in using knowledge from medical
    literature to guide bioinformatics algorithms
  • Functional Gene Groups
  • Can be used to connect data analysis algorithms
    to scientific literature
  • ND maybe used to define new functional groups,
    annotating genes and organizing genes in a
    functional hierarchy
  • Use of full text articles instead of only
    abstracts

58
Conclusion and Future Work
  • Synonym Extraction
  • Extracted synonyms could be used as a valuable
    supplement to the SWISSPROT database
  • Techniques could use the existing systems to find
    other biological relations between genes and
    proteins, small molecules, drugs and diseases.

59
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com