Textmining, Entity Identification, and Relationship Extraction - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Textmining, Entity Identification, and Relationship Extraction

Description:

Biomedical literature is growing at a tremendous pace ... Superman, Hairy, Crooked Legs, Lava lamp, Dreadlocks, Clown, Gene Mention Tagging ... – PowerPoint PPT presentation

Number of Views:128
Avg rating:3.0/5.0
Slides: 58
Provided by: biotecTu
Category:

less

Transcript and Presenter's Notes

Title: Textmining, Entity Identification, and Relationship Extraction


1
Text-mining, Entity Identification, and
Relationship Extraction
  • ILS Lecture 11th July
  • Loïc Royer

2
Outline
  • Motivation
  • Text-Minning
  • Gene Mention Tagging
  • Gene Mention Identification
  • Relation Extraction

3
Motivation
  • Biomedical literature is growing at a tremendous
    pace
  • PubMed indexes 16 million articles and grows
    every year by 600'000 articles

4
A Knowledge Explosion
Most biological information is in the form of
text and sequences
5
Four Solutions
  • Manual curation
  • Best precision
  • Not scalable
  • Text-Mining
  • Quite scalable
  • Difficult
  • Authors formalize their results in a formal
    language
  • Problems to convince people
  • Formal language cannot be top complicated
  • Wikipedia approach for Life Sciences
  • Scalability already demonstrated
  • Better Semantic Wiki engines are needed

6
Text-Mining
  • Text Classification
  • Information Retrieval
  • Entity Recognition
  • Information Extraction
  • Question Answering
  • Text Summarization

7
Outline
  • Motivation
  • Text-Minning
  • Gene Mention Tagging
  • Gene Mention Identification
  • Relation Extraction

8
Gene Mention Tagging
Objective Locate gene/protein names in text
9
Gene Mention Tagging
  • The problem can be formulated as a classification
    problem of tokens in text
  • were reactive for CXCR3 and that
  • of the interleukin-1 receptor gene in
  • Tokens (words) are either
  • part of a gene
  • not part of a gene

10
Gene Mention Tagging
  • Machine Learning techniques used
  • CRF (Conditional random Fields)
  • SVM (Support Vector Machines)
  • N-gram (frequencies of word n-tuples)
  • Max-Ent (Maximum Entropy)
  • HMM (Hidden Markov Model)
  • NLP Techniques used
  • POS Tagger (Part of Speech)
  • NP Chunker (Noun Phrase Chunker)

11
Gene Mention Tagging
  • Conditional Random Fields

Part of gene name, or not
Words (tokens)
Edges between random variables represent
conditional probabilities
What is the main assumption a such a model ?
What is missing here ?
12
Gene Mention Tagging
  • Features used
  • Morphological Features
  • CAPWORD ? A-Za-z
  • CAPSMIX ? A-Z(A-za-za-zA-Z)A-z
  • Character n-gram features
  • ase in decarboxylase
  • Part Of Speech Features
  • Noun better than Verb
  • Lexical features
  • Dictionaries of known genes/proteins
  • Dictionaries of non-gene/protein.

13
Gene Mention Tagging
  • Things that make the tagging of gene/protein
    names difficult
  • Delimitation of a gene name in text issubject to
    discussion activated
  • Biological terms such as diseases or phenotypes
    are also used to name genes,
  • Abbreviations defined in the text that
    accidentally resemble gene mentions, such as cell
    lines or domains.

14
Gene Mention Tagging
  • cheap date (id32999)Mutants are especially
    sensitive to alcohol. Interestingly, another name
    for the gene is amnesiac, as mutants also have a
    poor memory.
  • ken and barbie (id37785)Both male and female
    mutants lack external genitalia, as do poor Ken
    and Barbie.
  • icebox (id48456)Female icebox mutants do not
    care about courting males.

Superman, Hairy, Crooked Legs, Lava lamp,
Dreadlocks, Clown,
15
Gene Mention Tagging
  • Results and State of the Art for human genes
  • Recall 85
  • Precision 88
  • F-Measure 87
  • Relevance What biologists really need are the
    database identifiers of the genes.

16
Outline
  • Motivation
  • Text-Minning
  • Gene Mention Tagging
  • Gene Mention Identification
  • Relation Extraction

17
Gene Mention Identification
  • Objective Associate to each document a list of
    gene identifiers from a reference database

18
Gene Mention Identification
MAGUK
  • False positives
  • Text that does not mention proteins gets
    annotated
  • The wrong EntrezGene identifier is chosen
  • False negatives
  • A protein is not found at all,
  • The wrong EntrezGene identifier is chosen

One mistake that counts for two !!
19
Gene Mention Identification
  • Recall Phase
  • First a maximum of reasonable candidates genes
    are obtained per document using
  • Dictionaries merging, filtering,
    classification,
  • Known Text gene/protein associations,
  • Syntactical variation and matching.
  • Precision Phase
  • Candidates are ranked according to
  • Syntax / Semantics,
  • Local / document wide contexts,
  • Inter-annotation agreement.

20
Gene Mention Identification
  • Recall Oriented Techniques
  • Preprocess text by interpreting intensive
    enumerationsfreac1 to freac4 freac1,
    freac2, freac3, and freac4
  • Problem eiF1-eiF3
  • Is it eiF1 and eiF3 or eiF1, eiF2, and eiF3 ?

21
Gene Mention Identification
  • Recall Oriented Techniques
  • Collect and merge gene/protein synonyms
    dictionaries from different sources.

22
Gene Mention Identification
  • Recall Oriented Techniques
  • Dictionary synonyms classification Divide and
    conquer strategy. Different types of gene
    synonyms require different identification
    strategies
  • database identifiers (KIAA0958, HGNC17875),
  • Abbreviations (CD95L, Lin7c),
  • single- or multi-word terms (tumor necrosis
    factor alpha)
  • spurious synonyms (AA, ORF has no N-terminal
    Met, it may be non-functional).

23
Gene Mention Identification
  • Recall Oriented Techniques
  • Generate variant synonyms using rules

24
Gene Mention Identification
Vesicle Soluble Maleic acid N-ethylimide
Sensitive Fusion Protein Attachment Protein
Receptor
Hunter et. Al.
25
Gene Mention Identification
  • Recall Oriented Techniques
  • Gather similar documents with known associations
    to genes/proteins, then transfer association

rab5
orc2
p20
eiF2
Rap55
Similar documents
Document examined
26
Gene Mention Identification
  • How to compute similarity between documents?
  • Vector space model a document is represented as
    a word vector.
  • Cosine Similarity
  • TFIDF

Why is there a logarithm in this formula ?
27
Gene Mention Identification
  • Zipf's law In a corpus of natural language
    utterances, the frequency of any word is roughly
    inversely proportional to its rank in the
    frequency table.

28
Gene Mention Identification
  • Ambiguity Problem
  • In naming
  • 1168 genes in EntrezGene named p60
  • In species
  • 949 species have a gene named p53
  • Official names and symbols are not used.

29
Gene Mention Identification
Yeast smallest vocab, shortest names, least
ambiguity Mouse largest vocabulary, longest
names less ambiguity than fly Fly large
vocabulary, medium names, most ambiguity
Lynette Hirschman, Marc Colosimo, Alexander A.
Morgan, Alexander S. Yeh. "Overview of
BioCreAtIvE task 1B Normalized Gene Lists,"
accepted by BMC Bioinformatics.
30
Gene Mention Identification
  • Precision Oriented Techniques
  • Identify with - high confidence - regions of text
    that do not refer to genes/proteins

Genes high confidence
Non-Geneshigh confidence
31
Gene Mention Identification
  • Precision Oriented Techniques
  • Alignement-based syntactical similarities
  • Levenshtein Distance edit distance
  • Needleman-Wunch distance or Sellers Algorithm
  • Gap cost function substitution matrix
  • Smith-Waterman distance
  • optimal subsequences
  • Smith-Waterman-Gotoh distance
  • Starting a gap different from continuing a gap.

What is the weakness of alignement based methods ?
32
Gene Mention Identification
  • Precision Oriented Techniques
  • Other syntactical similarities
  • Jaro distance metric between s1 and s2 m
    number of matching characters a,b
    length of s1, s2t number of
    transpositions

What is this not a distance but a similarity ?
33
Gene Mention Identification
  • Precision Oriented Techniques
  • Bayesian Estimations
  • Knowing the a priori use frequency of gene names.
    Given a context and additional evidenceken
    and barbie in biomedical text relating to the
    fly organism does refer to a gene and never to
    the toys...

Why is the marginal probability not needed in
practice ?
34
Gene Mention Identification
  • Evidences that influence posterior probabilities

35
Gene Mention Identification
  • Z-score

36
Gene Mention Identificationtypical recall
problems
  • Missing substitution rules
  • GAR1 protein ? Gar1p
  • Wrong order of tokens
  • IL-receptor, type II ? type II IL-receptor
  • Abbreviation inside long synonym
  • ubiquitin ( UBC4/5) ? Ubc4
  • Capitalisation
  • APOER2 ? ApoER2

37
Gene Mention Identificationtypical recall
problems
  • Missing syntactic variants
  • GPIb-alpha ? GPIbalpha
  • Morphology
  • UBC3B ? Ubc3
  • Token polution
  • Serotonin receptor 6 ? serotonin 5-HT(6)
    receptor
  • Unspecific mentions
  • Maxi K channel beta subunit ? beta2 !!!

38
Gene Mention Identification typical Precision
Problems
  • Wrongly delimited match
  • complex I NADH dehydrogenase (ubiquinone), Fe-S
     (20 kDa) EC 1.6.5.3
  • Local Context
  • inhibitors of PI 3-kinase.
  • Unspecific
  • NF-kappa B
  • Wrong identifier chosen / not found
  • Acronym resolution failed

39
Gene Mention Identification State of the Art -
BioCreAtIvE II 2006
  • Our group got the best results for gene name
    identification
  • Recall 83
  • Precision 78
  • F-Measure 81

40
Outline
  • Motivation
  • Text-Minning
  • Gene Mention Tagging
  • Gene Mention Identification
  • Relation Extraction

41
Relation Extraction
  • Entities involved in interactions
  • Genes / proteins / chemicals
  • Species / Cell Types
  • Diseases / Phenotypes
  • Qualities of relations that can be extracted
  • Co-occurrences,
  • Strict semantic relations protein interactions,
    protein to function,

42
Jensen et al.
43
alibaba.informatik.hu-berlin.de
44
Relation Extraction
  • Techniques
  • Co-occurrences
  • Same sentence, word distance, word composition
  • High recall, low precision
  • Strict Semantic Relations
  • Natural Language Processing (NLP)
  • Shallow parsing (manual rule, mined patterns)
  • Deep parsing (grammar, linguistic).
  • Low Precision, Low Recall.

45
Relation Extraction
  • Relation Extraction Workflow
  • Sentence segmentation
  • Tokenization
  • Part of speech
  • Chunking
  • Lexical analysis
  • Entity identification
  • Natural Language Parsing
  • Candidate relation filtering

46
Relation Extraction
  • Natural Language Parsing
  • Chomsky Hierarchy
  • Type 3 Regular language
  • Type 2 Context-free
  • Type 1 Context-sensitive
  • Type 0 Unrestricted
  • Natural Language Grammars
  • Head-Driven Phrase Structure Grammar (HPSG)
  • Probabilistic context-free grammar (PCFG)

47
Relation Extraction
  • Parsing Rab5 interacts with CDC2 and CDC3 with
    Enju a wide-coverage probabilistic HPSG parser
  • interacts(rab5,cdc2)interacts(rab5,cdc3)interact
    s(cdc2,cdc3)

48
Relation Extraction
  • Problems
  • Natural text is very complex
  • Dependent on entity identification,
  • Anaphora resolution,
  • Coverage of grammars is still poor,
  • Results State of the Art is not good -(
  • Recall 30
  • Precision 38
  • F-measure 28

49
Conclusion
  • Biomedical knowledge is being dumped as text
    without computer readable semantics.
  • Text-mining techniques are being developed to
    mitigate this problem.
  • Identifying entities and their relations are the
    main goals.
  • The next iteration in entity identification will
    be usable by biologists. Relation Extraction does
    not yet work beyond co-occurrence.

50
Thank you for your attention
51
A Knowledge Explosion
Jensen et al.
52
Relation Extraction
53
Relation Extraction
54
A Knowledge Explosion
Possible in theory to have semantics, but in
practice only links.
55
Gene Mention Identification
Tamames et al., 2003
56
Entity Recognition and Information Extraction for
Biology
  • Entity Tagging
  • Finding the mention of gene/protein names in text
  • Entity Identification
  • Link a gene/protein mention to a reference
    database (EntrezGene, Uniprot, )
  • Relation Extraction
  • Identify interactions between genes and proteins.

57
Gene Mention Identification State of the Art
BioCreAtIvE II 2006
We have the best F-measure 81
We have the best official recall 88 However we
can go up to 92.7with Rlt40
Write a Comment
User Comments (0)
About PowerShow.com