Text%20Mining%20for%20Biomedicine:%20Techniques%20 - PowerPoint PPT Presentation

About This Presentation
Title:

Text%20Mining%20for%20Biomedicine:%20Techniques%20

Description:

TerMine, AcroMine, Smart dictionary look up, Phenetica. Medie, InfoPubMed, KLEIO. 3 ... Focus: biology, medicine, social sciences... 9. We don't just press a ... – PowerPoint PPT presentation

Number of Views:695
Avg rating:3.0/5.0
Slides: 147
Provided by: personalpa6
Category:

less

Transcript and Presenter's Notes

Title: Text%20Mining%20for%20Biomedicine:%20Techniques%20


1
Text Mining for BiomedicineTechniques tools
  • Sophia Ananiadou, Chikashi Nobata,Yutaka Sasaki,
    Yoshimasa Tsuruoka
  • School of Computer Science
  • National Centre for Text Mining
  • www.nactem.ac.uk
  • Sophia.Ananiadou_at_manchester.ac.uk

2
Outline
  • Challenges / objectives of TM in biomedicine
  • Terminology processing
  • Term extraction, term variation, named entity
    recognition
  • Resources for TM in biomedicine
  • Document classification
  • Information Extraction approaches
  • Levels of Text Mining Processing
  • Biomedical text mining services and systems _at_
    NaCTeM
  • TerMine, AcroMine, Smart dictionary look up,
    Phenetica
  • Medie, InfoPubMed, KLEIO

3
Material
  • Further background on TM for Biology
  • Ananiadou, S. McNaught, J. (eds) (2006)
    Text Mining for Biology and Biomedicine. Boston,
    MA Artech House
  • Numerous papers on line from bibliography
  • See BLIMP http//blimp.cs.queensu.ca/
  • Biomedical Literature (and text) mining
    publications

4
Text Mining in biomedicine
  • Why biomedicine?
  • Consider just MEDLINE 16,000,000 references,
    40,000 added per month
  • Dynamic nature of the domain new terms (genes,
    proteins, chemical compounds, drugs) constantly
    created
  • Impossible to manage such an information overload

5
From Text to Knowledge tackling the data deluge
through text mining
Unstructured Text (implicit knowledge)
Information Retrieval
Information extraction
Knowledge Discovery
Semantic metadata
Structured content (explicit knowledge)
Advanced Information Retrieval
6
Information deluge
  • Bio-databases, controlled vocabularies and
    bio-ontologies encode only small fraction of
    information
  • Linking text to databases and ontologies
  • Curators struggling to process scientific
    literature
  • Discovery of facts and events crucial for gaining
    insights in biosciences need for text mining

7
(No Transcript)
8
The solution The UK National Centre for Text
Mining www.nactem.ac.uk
  • Location Manchester Interdisciplinary Biocentre
    (MIB) www.mib.ac.uk
  • First publicly funded text mining centre in the
    world..
  • Focus biology, medicine, social sciences

9
We dont just press a button
  • TM involves
  • Many components (converters, analysers, miners,
    visualisers, ...)?
  • Many resources (grammars, ontologies, lexicons,
    terminologies, thesauri, CVs)?
  • Many combinations of components and resources for
    different applications
  • Many different user requirements and scenarios,
    training needs
  • The best solutions are customised

10
People behind NaCTeM
  • Text Mining Team 14 members
  • Close collaboration with University of Tokyo,
    Tsujii Lab http//www-tsujii.is.s.u-tokyo.ac.jp/

11
What NaCTeM is building
  • Resources ontologies, lexicons, terminologies,
    thesauri, grammars, annotated corpora
  • BOOTStrep project http//www.nactem.ac.uk/bootstre
    p.php
  • Tools tokenisers, taggers, chunkers, parsers, NE
    recognisers, semantic analysers
  • NaCTeM is also providing services
  • Our related bio-text mining projects
  • REFINE http//dbkgroup.org/refine/
  • Representing Evidence For Interacting Network
    Elements
  • ONDEX (data integration, workflows, text mining)

12
Individual tools for user data
  • Splitters, taggers, chunkers, parsers, NER, term
    extractors
  • Modes of use
  • Demonstrators for small-scale online use
  • Batch mode upload data, get email with link to
    download site when job done
  • Web Services
  • Integration into Workflows (Taverna)
  • Some services are compositions of tools

13
Aims
  • Text mining discover extract unstructured
    knowledge hidden in text
  • Hearst (1999)
  • Text mining aids to construct hypotheses from
    associations derived from text
  • protein-protein interactions
  • associations of genes phenotypes
  • functional relationships among genes

14
Impact of text mining
  • Extraction of named entities (genes, proteins,
    metabolites, etc)
  • Discovery of concepts allows semantic annotation
    of documents
  • Improves information access by going beyond index
    terms, enabling semantic querying
  • Construction of concept networks from text
  • Allows clustering, classification of documents
  • Visualisation of concept maps

15
Impact of TM
  • Extraction of relationships (events and facts)
    for knowledge discovery
  • Information extraction, more sophisticated
    annotation of texts (event annotation)
  • Beyond named entities facts, events
  • Enables even more advanced semantic querying

16
Hypothesis generation from literature
  • Swanson experiments (1986) influenced conceptual
    biology
  • rapid mining of candidate hypotheses from the
    literature
  • migraine and magnesium deficiency (Swanson,
    1988)
  • indomethacin and Alzheimers disease (Swanson
    and Smalheiser 1994),
  • Curcuma longa and retinal diseases, Crohn's
    disease and disorders related to the spinal cord
    (Srinivasan and Libbus 2004).
  • (Weeber M, Rein et al. 2003) thalidomide for
    treating a series of diseases such as acute
    pancreatitis, chronic hepatitis C.

17
Text mining steps
  • Information Retrieval yields all relevant texts
  • Gathers, selects, filters documents that may
    prove useful
  • Finds what is known
  • Information Extraction extracts facts events of
    interest to user
  • Finds relevant concepts, facts about concepts
  • Finds only what we are looking for
  • Data Mining discovers unsuspected associations
  • Combines links facts and events
  • Discovers new knowledge, finds new associations

18
From Text to Knowledge NLP and Knowledge
Extraction
Lexicons and ontologies
19
Challenge the resource bottleneck
  • Lack of large-scale, richly annotated corpora
  • Support training of ML algorithms
  • Development of computational grammars
  • Evaluation of text mining components
  • Lack of knowledge resources lexica,
    terminologies, ontologies.

20
Annotation Information Extraction
Biomedical Knowledge
Biomedical Literature
  • Semantic annotation simulates an ideal
    performance of IE system.
  • IE systems can be developed by referencing
    annotated corpus.
  • The performance of IE systems can be evaluated by
    being compared to the annotated corpus.
  • (Kim Tsujii, Text Mining Workshop, Manchester,
    2006)

21
Text Annotation
  • Task-neutral Annotation
  • GENIA Corpus
  • U-Tokyo, NaCTeM
  • Development of generic tools
  • Defined by theories
  • Linguistics
  • Tokens
  • POS
  • Phrase Structure
  • Dependency Structure
  • Deep Syntax (PAS)
  • Biology
  • Named Entities of various semantic types
  • Events
  • Linguistics Biology
  • Co-references
  • Task-oriented Annotation
  • Application annotated text
  • User system development
  • Defined by specific tasks
  • Specific curation tasks in specific environments
  • Mapping of Protein names to database IDs in
    specific text types
  • Specific event types such as Protein-Protein
    Interaction
  • Disease-Gene Association of specific diseases

22
Annotation of GENIA corpus TermPOS
Term (entity)annotation2000400abstracts
23
Text semantic annotation
  • annotation of events and involved named entities
  • Example Regulation of Transcription events
  • BOOTSTrep project http//www.nactem.ac.uk/bootstre
    p.php
  • two different types of annotation levels
  • linguistic annotation levels
  • biological annotation level, in charge of marking
    the biological knowledge contained in the text
  • Linking text with biological knowledge

24
Events and variables
  • Biological events can be centred on
  • verbs, e.g. activate,
  • nouns with verb-like meanings (nominalised
    verbs), e.g. transcription
  • Different parts of sentence correspond to
    different types of variables in the event e.g.
  • What caused event
  • The narL gene product activates the nitrate
    reductase operon
  • What was affected by event
  • Analysis of mutants
  • Where event took place
  • These fusions were formed on plasmid cloning
    vectors

25
Verb Frame Example
  • The narL gene product activates the nitrate
    reductase operon

Theme Characteristics operon
Agent Characteristics protein
26
Role Name Description Phrase Type(s) Clues
AGENT Drives or instigates event Entity or event Typically subject of verb, Follows by in passives
The narL gene product activates the nitrate reductase operon The narL gene product activates the nitrate reductase operon The narL gene product activates the nitrate reductase operon The narL gene product activates the nitrate reductase operon
THEME Affected by or results from event Entity or event Typically object of verb, subject in passives
recA protein was induced by UV radiation recA protein was induced by UV radiation recA protein was induced by UV radiation recA protein was induced by UV radiation
MANNER Method or way in which event is carried out Event (process), adverb, direction, in vitro, in vivo etc by, through, via, using
cpxA gene increases the levels of csgA transcription by dephosphorylation of CpxR cpxA gene increases the levels of csgA transcription by dephosphorylation of CpxR cpxA gene increases the levels of csgA transcription by dephosphorylation of CpxR cpxA gene increases the levels of csgA transcription by dephosphorylation of CpxR
27
Role Name Description Phrase Type(s) Clues
INSTRUMENT Used to carry out event Entity with,with the aid of, via, by, through, using
EnvZ functions through OmpR to control porin gene expression in Escherichia coli K-12 EnvZ functions through OmpR to control porin gene expression in Escherichia coli K-12 EnvZ functions through OmpR to control porin gene expression in Escherichia coli K-12 EnvZ functions through OmpR to control porin gene expression in Escherichia coli K-12
LOCATION Location of event Entity in, on, near, etc
Phosphorylation of OmpR by the osmosensor EnvZ modulates expression of the ompF and ompC genes in Escherichia coli Phosphorylation of OmpR by the osmosensor EnvZ modulates expression of the ompF and ompC genes in Escherichia coli Phosphorylation of OmpR by the osmosensor EnvZ modulates expression of the ompF and ompC genes in Escherichia coli Phosphorylation of OmpR by the osmosensor EnvZ modulates expression of the ompF and ompC genes in Escherichia coli
SOURCE Start point of event Entity from
A transducing lambda phage carrying glpD''lacZ, glpR, and malT was isolated from a strain harbouring a glpD''lacZ fusion A transducing lambda phage carrying glpD''lacZ, glpR, and malT was isolated from a strain harbouring a glpD''lacZ fusion A transducing lambda phage carrying glpD''lacZ, glpR, and malT was isolated from a strain harbouring a glpD''lacZ fusion A transducing lambda phage carrying glpD''lacZ, glpR, and malT was isolated from a strain harbouring a glpD''lacZ fusion
DESTINATION End point of event Entity to, into
Transcription of gntT is activated by binding of the cyclic AMP (cAMP)-cAMP receptor protein (CRP) complex to a CRP binding site Transcription of gntT is activated by binding of the cyclic AMP (cAMP)-cAMP receptor protein (CRP) complex to a CRP binding site Transcription of gntT is activated by binding of the cyclic AMP (cAMP)-cAMP receptor protein (CRP) complex to a CRP binding site Transcription of gntT is activated by binding of the cyclic AMP (cAMP)-cAMP receptor protein (CRP) complex to a CRP binding site
28
Example 1

activates
29
Linguistically Annotated Corpora
  • GENIA
  • Domain
  • Mesh term Human, Blood Cells, and Transcription
    Factors.
  • Annotation POS, named entity, parse tree
  • Penn BioIE
  • Domain
  • the molecular genetics of oncology
  • the inhibition of enzymes of the CYP450 class.
  • Annotation POS, named entity, parse tree
  • Yapex
  • GENETAG a corpus of 20K MEDLINE sentences for
    gene/protein NER

30
The GENIA annotation
  • Linguistic annotation
  • Reveals linguistic structures behind the text
  • Part-of-speech annotation
  • annotates for the syntactic category of each
    word.
  • Syntactic Tree annotation
  • annotates for the syntactic structure of
    sentences.
  • Semantic annotation
  • Reveals knowledge pieces delivered by the text.
  • Term annotation
  • annotates domain-specific terms
  • Event annotation
  • annotates events on biological entities.

Ontology-drivenannotation
31
Annotation Tool
  • WordFreak http//wordfreak.sourceforge.net/
  • Java-based linguistic annotation tool developed
    at University of Pennsylvania
  • Extensible to new tasks and domains
  • Customised visualisation and annotation
    specification
  • Allows annotation process to be made as simple as
    possible

32
  • Resources

33
What about existing resources?
  • Ontologies important for knowledge discovery
  • They form the link between terms in texts and
    biological databases
  • Can be used to add meaning, semantic annotation
    of texts

34
Link between text and ontologies
Adding new knowledge
KEGG
Ontological resources
text
UMLS
Supporting semantics
GO
GENIA
35
Bridging the Gap Integrating data, text and
knowledge

Databases
Semantic Interpretation of data
Adding new knowledge
Ontological resources
text
UMLS
Supporting semantics
GO
GENIA
KEGG
Semantic Interpretation of models in Systems
Biology
Mathematical Models
36
Resources for Bio-Text Mining
  • Lexical / terminological resources
  • SPECIALIST lexicon, Metathesaurus (UMLS)
  • Lists of terms / lexical entries (hierarchical
    relations)
  • Ontological resources
  • Metathesaurus, Semantic Network, GO, SNOMED CT,
    etc
  • Encode relations among entities
  • Bodenreider, O. Lexical, Terminological, and
    Ontological Resources for Biological Text
    Mining, Chapter 3, Text Mining for Biology and
    Biomedicine, pp.43-66

37
SPECIALIST lexicon
  • UMLS specialist lexicon http//SPECIALIST.nlm.nih.
    gov
  • Each lexical entry contains morphological (e.g.
    cauterize, cauterizes, cauterized, cauterizing),
    syntactic (e.g. complementation patterns for
    verbs, nouns, adjectives), orthographic
    information (e.g. esophagus oesophagus)
  • General language lexicon with many biomedical
    terms (over 180,000 records)
  • Lexical programs include variation (spelling),
    base form, inflection, acronyms

38
Lexicon record
  • baseKaposi's sarcoma
  • spelling_variantKaposi sarcoma
  • entryE0003576
  • catnoun
  • variantsuncount
  • variantsreg
  • variantsglreg

  • Kaposis
    sarcoma
  • Kaposis sarcomas
  • Kaposis sarcomata
  • Kaposi sarcoma
  • Kaposi sarcomas
  • Kaposi sarcomata

The SPECIALIST Lexicon and Lexical Tools Allen
C. Browne, Guy Divita, and Chris Lu PhD 2002 NLM
Associates Presentation, 12/03/2002, Bethesda, MD
39
Normalisation (lexical tools)
  • Hodgkin Disease
  • HODGKIN DISEASE
  • Hodgkins Disease
  • Hodgkins disease
  • Disease, Hodgkin ...
  • disease hodgkin

normalise
40
Steps of Norm
  • Remove genitive
  • Hodgkins Diseases
  • Replace punctuation with spaces
  • Hodgkin Diseases
  • Remove stop words
  • Hodgkin Diseases
  • Lowercase
  • hodgkin diseases
  • Uninflect each word
  • hodgkin disease
  • Word order sort
  • disease hodgkin

Lexical tools of the UMLS http//lexsrv3.nlm.nih.g
ov/SPECIALIST/index.html
41
The Gene Ontology (GO)
  • Controlled vocabulary for the annotation of gene
    products
  • http//www.geneontology.org/
  • 19,468 terms. 95.3 with definitions
  • 10391 biological_process
  • 1681 cellular_component7396
    molecular_function

42
Gene Ontology
  • GOA database (http//www.ebi.ac.uk/GOA/) assigns
    gene products to the Gene Ontology
  • GO terms follow certain conventions of creation,
    have synonyms such as
  • ornithine cycle is an exact synonym of urea cycle
  • cell division is a broad synonym of cytokinesis
  • cytochrome bc1 complex is a related synonym of
    ubiquinol-cytochrome-c reductase activity

43
GO terms, definitions and ontologies in OBO
  • id GO0000002
  • name mitochondrial genome maintenance
  • namespace biological_process
  • def "The maintenance of the structure and
    integrity of the mitochondrial genome. GOCai
  • is_a GO0007005 ! mitochondrion organization
    and biogenesis

44
Metathesaurus
  • organised by concept
  • 5M names, 1M concepts, 16M relations
  • built from 134 electronic versions of many
    different thesauri, classifications, code sets,
    and lists of controlled terms
  • "source vocabularies
  • common representation

45
Are the existing knowledge resources sufficient
for TM?
  • No!
  • Why?
  • Limited lexical terminological coverage of
    biological sub-domains
  • Resources focused on human specialists
  • GO, UMLS, UniProt ontology concept names
    frequently confused with terms

46
Naming conventions
  • Update and curation of resources
  • FlyBase gene name coverage 31 (abstracts) to 84
    (full texts)
  • Naming conventions and representation in
    heterogeneous resources
  • Term formation guidelines from formal bodies e.g.
    HUGO, IPI not uniformly used
  • Problems with integration of resources
  • dystrophin used for 18 gene products
  • Dystrophin (muscular dystrophy, Duchenne and
    Becker types), included DXS143, DXS164, DXS206,
    HUGO

47
Term variation
  • Terminological variation and complexity of names
  • High correlation between degree of term variation
    and dynamic nature of biomedicine
  • Variation occurs in controlled vocabularies and
    texts but discrepancy between the two
  • Exact match methods fail to associate term
    occurrences in texts with databases

48
  • Whats in a name?
  • Terms, named entities in biology

49
Whats in a name?
  • Breast cancer 1 (BRCA1)
  • p53
  • Ribosomal protein S27
  • Heat shock protein 110
  • Mitogen activated protein kinase 15
  • Mitogen activated protein kinase kinase kinase 5

From K. Cohen, NAACL 2007
50
Worst gene names
  • sema domain, seven thrombospondin repeats (type 1
    and type 1-like), transmembrane domain (TM) and
    short cytoplasmic domain, (semaphorin) 5A

K. Cohen NAACL 2007
51
Worst gene names
  • sema domain, seven thrombospondin repeats (type 1
    and type 1-like), transmembrane domain (TM) and
    short cytoplasmic domain, (semaphorin) 5A

K. Cohen NAACL 2007
52
Worst gene names
  • sema domain, seven thrombospondin repeats (type 1
    and type 1-like), transmembrane domain (TM) and
    short cytoplasmic domain, (semaphorin) 5A
  • SEMA5A

K. Cohen NAACL 2007
53
Worst gene names
  • sema domain, seven thrombospondin repeats (type 1
    and type 1-like), transmembrane domain (TM) and
    short cytoplasmic domain, (semaphorin) 5A
  • SEMA5A
  • Tyrosine kinase with immunoglobulin and epidermal
    growth factor homology domains
  • tie

K. Cohen NAACL 2007
54
Term ambiguity
  • Neurofibromatosis 2 disease
  • NF2 Neurofibromin 2 protein
  • Neurofibromatosis 2 gene gene

O. Bodenreider, MIE 2005 tutorial http//www.nacte
m.ac.uk/
55
Term ambiguity
  • Gene terms may be also common English words
  • BAD human gene encoding BCL-2 family of proteins
    (bad news, bad prediction)
  • Gene names are often used to denote gene products
    (proteins)
  • suppressor of sable is used ambiguously to refer
    to either genes and proteins
  • Existing resources lack information that can
    support term disambiguation
  • Difficult to establish equivalences between
    termforms and concepts

56
Homologues
  • Cycline-dependent kinase inhibitor first
    introduced to represent a protein family p27
  • But it is used interchangeably with p27 or
    p27kip1, as the name of the individual protein
    and not as the name of the protein family (Morgan
    2003).
  • NFKB2 denotes the name of a family of 2
    individual proteins with separate IDs in
    Swiss-Prot.
  • These proteins are homologues belonging to
    different species, homo sapiens chicken.

57
Terms
  • Term linguistic realisation of specialised
    concepts, e.g. genes, proteins, diseases
  • Terminology collection of terms structured
    (hierarchy) denoting relationships among
    concepts, part-whole, is-a, specific, generic,
    etc.
  • Terms link text and ontologies
  • Mapping is not trivial (main challenge)

58
Term variation and ambiguity
Term1 Term2 Term3 TEXT
Term variation
Term ambiguity
Concept1 concept2 concept3
ONTOLOGY
59
Term mining steps
Term recognition
Tp53
Term classification
Gene
Genome Database, IARC TP53 Mutation Database
Term mapping
60
Term recognition techniques
  • ATR extracts terms (variants) from a collection
    of document
  • Distinguishes terms vs non-terms
  • In NER the steps of recognition and
    classification are merged, a classified
    terminological instance is a named entity
  • The tasks of ATR and NER share techniques but
    their ultimate goals are different
  • ATR for resource building, lexica ontologies
  • NER first step of IE, text mining

61
Overview papers
  • S. Ananiadou G. Nenadic (2006) Automatic
    Terminology Management in Biomedicine, Text
    Mining for Biology and Biomedicine, pp. 67- 97.
  • M. Krauthammer G. Nenadic (2004) Term
    identification in the biomedical literature, JBI
    37 (2004) 512-526
  • J.C. Park J. Kim (2006) Named Entity
    Recognition, Text Mining for Biology and
    Biomedicine, pp. 121-142
  • Detailed bibliography in Bio-Text Mining
  • BLIMPhttp//blimp.cs.queensu.ca/
  • http//www.ccs.neu.edu/home/futrelle/bionlp/
  • Book on BioText Mining
  • S. Ananiadou J. McNaught (eds) (2006) Text
    Mining for Biology and Biomedicine, Artech
    House.
  • Other Bio-Text Mining tutorials
  • Kevin Cohen (NAACL 2007 tutorial) U. Colorado

62
Main ATR approaches
63
Dictionary NER (1)
  • Use terminological resources to locate term
    occurrences in text
  • NCBI http//www.ncbi.nlm.nih.gov/
  • EBI http//www.ebi.ac.uk/
  • neologisms, variations, ambiguity problematic for
    simple dictionary look-up
  • Ambiguous words e.g. an, for, can
  • spelling variants, punctuation, word order
    variations
  • estrogen oestrogen
  • NF kappa B / NF kB

64
Dictionary NER (2)
  • Hirschman (2002) used FlyBase for gene name
    recognition, results disappointing due to
    homonymy, spelling variations
  • Precision, 7 abstracts, 2 full papers
  • Recall, 31 -- 84
  • Tuason (2004) reports term variation as main
    problem of mismatch
  • bmp-4 bmp4
  • syt4 syt iv
  • integrin alpha 4 alpha4 integrin

65
Dictionary NER (3)
  • Tsuruoka Tsujii (2003) suggest a probabilistic
    generator of spelling variants, edit distance
    operations (delete, substitute, insert)
  • Terms with ED 1 considered spelling variants
  • Used a dictionary of protein terms
  • Support query expansion
  • Augment dictionaries with variation

66
Rule NER (2)
67
Rule based (1)
  • Use orthographic, morpho-syntactic features of
    terms
  • Rules that make use of internal term formation
    patterns (tagging, morphological analysers) e.g.
    affixes, combining forms
  • Do not take into account contextual features
  • Dictionaries of constituents e.g. affixes,
    neoclassical forms included
  • Portability to different domains?

68
Rule based (2)
  • Ananiadou, S. (1994) recognised single-word terms
    based on morphological analysis of term formation
    patterns (internal term make up)
  • based on analysis of neoclassical and hybrid
    elements
  • alphafetoprotein immunoosmoelectrophor
    esis
  • radioimmunoassay
  • some elements are used for creating terms
  • term ? word term_suffix
  • term ? term word_suffix
  • neoclassical combining forms (electro- adeno-),
  • prefixes (auto-, hypo-)
  • suffixes ( -osis, -itis)

69
Rule-based (3)
  • Fukuda (1998) used lexical, orthographic features
    for protein name recognition e.g. upper case
    character, numerals etc.
  • PROPER core and feature elements
  • Core meaning bearing elements
  • Feature function elements
  • SAP kinase

feature
core
Core elements extended to feature based on
concatenation rules (based on POS tags)
70
Rule-based (4)
  • Gaizauskas (2000) CFG for protein name
    recognition (PASTA, EMPATHIE)
  • Based on morphological and lexical
    characteristics of terms
  • biochemical suffixes (-ase enzyme name)
  • dictionary look-up (protein names, chemical
    compounds, etc)
  • deduction of term grammar rules from Protein Data
    Bank

Protein -gt protein_modifier, protein_head, numeral
71
Rule-based (5)
  • Inspired by PROPER, Yapex uses Swiss-Prot to add
    core term elements
  • http//www.sics.se/humle/projects/prothalt/yapex.c
    gi
  • Hou (2003) used Yapex with context information
    (collocations) appearing with protein names
  • Rule based approaches construct rule and patterns
    manually or automatically
  • Difficult to tune to different domains

72
Machine learning systems
  • Learn features from training data for term
    recognition and classification
  • Most ML systems combine recognition and
    classification
  • Challenges
  • Feature selection and optimisation
  • Availability of training data
  • detection of term boundaries

73
Overview of ML-based NER
  • Training phase
  • Testing phase
  • Detecting features
  • Learning model

Manually tagged texts
Learned Model
Tag annotator with model
Tagged texts
Raw texts
74
ML (1)
  • Nobata et al.(1999) used Decision Tree for NER
  • Decision tree one of the methods to classify a
    case using training data
  • Node specifies some condition with a subtree
  • Leaf indicates a class
  • Features
  • Part-of-speech information
  • Orthographic information
  • Term lists

75
Example of a decision tree
Each node has one condition
Is the current word in the Protein term list?
No
Yes
Does the previous word have figures?
What is the next words POS?
No
Noun
Yes
Verb

Each leaf has one class
PROTEIN
Unknown
RNA
DNA

76
ML (2)
  • Collier (2000) used HMM, orthographic features
    for term recognition
  • HMM looks for most likely sequence of classes
    corresponding to a word sequence e.g.
    interleukin-2 protein/DNA
  • To find similarities between known words
    (training set) and unknown words, use character
    features
  • Feature Examples
  • DigitNumber 2protein3DNA
  • GreekLetter alphaprotein
  • TwoCaps RelBproteinTARRNA

77
ML (2)
  • Use of GENIA resources as training data
  • Results depend on training data
  • Morgan (2004) used FlyBase to construct
    automatically training corpus
  • Pattern matching for gene name recognition, noisy
    corpus annotated
  • HMM was trained on that corpus for gene name
    recognition

78
Support Vector Machines (1)
  • Kazama trained multi-class SVMs on Genia corpus
  • Corpus annotated with B-I-O tags
  • B tags denote words at beginning of term
  • I tags inside term
  • O tags outside term
  • B-protein-tag word in the beginning of a
    protein name

79
SVMs for NER (2)
  • Yamamoto used a combination of features for
    protein name recognition
  • Morphological, lexical, boundary, syntactic (head
    noun), domain specific (if term exists in
    biomedical database).
  • Lee use different features for recognition and
    classification.
  • orthographic, prefix, suffix
  • Contextual information

80
Hybrid approaches
  • Combine rules, statistics, resources

81
Hybrid (1)
  • ABGene protein and gene name tagger
  • Combines ML, transformation rules, dictionaries
    with statistics
  • Protein tagger trained on MEDLINE abstracts by
    adapting Brills tagger
  • Transformation rules for recognition of gene,
    protein names
  • Used GO, LocusLink list of genes, proteins for
    false negative tags

82
Hybrid (2)
  • ARBITER (Access and Retrieve Binding Terms) uses
  • UMLS Metathesaurus and GenBank to map NPs
    (binding terms)
  • morphological features
  • lexical information (head noun)
  • EDGAR recognises gene, cell, drug names using
    co-occurrences of cell, clone, expression

83
Hybrid (3)
  • C/NC value (Frantzi Ananiadou, 1999)
  • C-value
  • Linguistic filters
  • total frequency of occurrence of string in corpus
  • frequency of string as part of longer candidate
    terms (nested terms)
  • number of these longer candidate terms
  • length of string
  • Output automatically ranked terms (TerMine)

84
C-value
  • C- value measure extracts multi-word, nested
    terms
  • adenoid cystic basal cell carcinoma
  • cystic basal cell carcinoma
  • ulcerated basal cell carcinoma
  • recurrent basal cell carcinoma
  • basal cell carcinoma

85
Term variation
  • variation recognition as part of ATR (Nenadic,
    Ananiadou)
  • recognise term forms and link them into
    equivalence classes
  • important if ATR is based on statistics (e.g.
    frequency of occurrence)
  • corpus-based measures are distributed across
    different variants
  • conflation of various surface representations of
    a given term should improve ATR

86
Simple variation
  • orthographic
  • hyphens, slashes (amino acid and amino-acid)
  • lower/upper cases (NF-KB and NF-kb)
  • spelling variations (tumour and tumor)
  • transliterations (oestrogen and estrogen)
  • morphological
  • inflectional phenomena (plural, possessives)
  • lexical
  • genuine synonyms (carcinoma and cancer)

87
Complex variation
  • Structural
  • Possessive usage of nouns using prepositions
    (clones of human and human clones)
  • Prepositional variants (cell in blood, cell from
    blood)
  • Term coordinations (adrenal glands and gonads)

88
Coordinated term variants
  • Structure is ambiguous
  • Head coordination or term conjunction?
  • Head or argument coordination?
  • (NA) CC (NA) N
  • cell differentiation and proliferation
  • chicken and mouse receptors

89
TerMine a term management system
Demo
90
http//www.nactem.ac.uk/software/termine/
91
Marrying IR and terminology
  • IR engine plus TerMine
  • Discover associated terms ranked according to
    relevance
  • Allow user to link term with IR for document
    discovery
  • NB compound terms
  • NB technical terms, not classic index terms
  • NB terms familiar to user, found in documents

92

http//www.nactem.ac.uk/software/ctermine/
93
Biomedical IE/IR Systems
  • iHOP
  • http//www.ihop-net.org/UniPub/iHOP/
  • EBIMed
  • http//www.ebi.ac.uk/Rebholz-srv/ebimed/index.jsp
  • GoPubMed
  • http//www.gopubmed.org/
  • PubFinder
  • http//www.glycosciences.de/tools/PubFinder
  • Textpresso
  • http//www.textpresso.org/

94
Acronyms
  • Very productive type of term variation
  • Acronym variation (synonymy)
  • NF kappa B/ NF kB / nuclear factor kappa B
  • Acronym ambiguity (polysemy) even in controlled
    vocabularies
  • GR glucocorticoid receptor
  • glutathione reductase

95
Acronym recognition
  • Swartz, A. Hearst, M. (2003) A simple algorithm
    for identifying abbreviation definitions in
    biomedical text, PSB 2003,8, 451-462
  • Adar, E. (2004) SaRAD a simple and robust
    abbreviation dictionary, Bioinformatics, 20(4)
    527-533
  • Chang, J.T. Schutze, H. (2006) Abbreviations in
    biomedical text, Text Mining for Biology and
    Biomedicine, pp.99-119, Artech
  • Tsuruoka, Y., Ananiadou, S. Tsujii, J. (2005) A
    Machine learning approach to automatic acronym
    generation, ISMB, BioLink SIG, 25-31
  • Okazaki, N. S.Ananiadou (2006) Acronym
    recognition based on term identification,
    Bioinformatics

96
The importance ofacronym recognition
  • Acronyms are among the most productive type of
    term variation
  • 64, 242 new acronyms are introduced in 2004
    Chang and Schütze 06
  • Acronyms are used more frequently than full terms
  • 5,477 documents could be retrieved by using the
    acronym JNK while only 3,773 documents could be
    retrieved by using its full term, c-jun
    N-terminal kinase Wren et al. 05
  • No rules or exact patterns for the creation of
    acronyms from their full form

97
Recognition
  • Extracting pairs of short and long forms
  • ltacronym, long formgt
  • Distinguishing acronyms from parenthetical
    expressions
  • Search for parentheses in text single or more
    words e.g. Ab (antibody)
  • Limit context around ( ) limit number of words
    according to number of letters in acronym

98
Recognition (heuristics)
  • Heuristics match letters of acronym with letters
    of long form using rules, patterns
  • letters from beginning of words
  • combining forms
  • carboxifluorescein diacetate (CFDA)
  • Acronym normalisation to allow orthographic,
    structural and lexical variations
  • morphological information, positional info
  • Penalise words in long form that do not match
    acronym
  • Accidental matching
  • argininosuccitate synthetase (AS)

A
S
99
Letter matching
  • Alignment find all matches between letters of
    acronyms and their long forms and calculate
    likelihood (Chang Schütze)
  • Solves problem of acronyms containing letters not
    occurring in LF
  • Choose best alignment based on features, e.g.
    position of letter etc.
  • Finding optimal weight for each feature challenge

http//abbreviation.stanford.edu/
100
Acronym Recognition
Okazaki, N., Ananiadou, S. (2006) Building an
abbreviation dictionary using a term recognition
approach. Bioinformatics.
101
A simple algorithm Schwartz and Hearst (2003)
  • Uses parenthetical expressions as a marker of a
    short form
  • long-form (short-form )
  • All letters and digits in a short form must
    appear in the corresponding long form in the same
    order
  • We used hidden markov model (HMM) to
  • Early repolarization (ER) is an enigma.

102
Problems of letter-matching approach
  • Highly dependent on the expressions in the target
    text
  • o acquired immuno deficiency syndrome (AIDS)
  • x acquired syndrome (AIDS)
  • x a patient with human immunodeficiency syndrome
    (AIDS)
  • ? magnetic resonance imaging unit (MRI)
  • ! beta 2 adrenergic receptor (ADRB2)
  • ! gamma interferon (IFN-GAMMA)
  • (These examples are obtained from actual MEDLINE
    abstracts)
  • Naive with respect to term variations

103
AcroMines approach
  • Extract a word or word sequence
  • Co-occurring frequently with an acronym (e.g.,
    TTF-1)
  • 1, factor 1, transcription factor 1, thyroid
    transcription factor 1
  • Does not co-occur with other surrounding words
  • thyroid transcription factor 1
  • Not necessarily based on letter-matching
  • Note that this is a difficult case for the
    letter-matching algorithm
  • Prune unlikely candidates
  • Nested candidates transcription factor 1
  • Expansions expression of thyroid transcription
    factor 1
  • Insertions thyroid specific transcription factor
    1

104
Short-form mining
  • Enumerate all short forms in a target text
  • Using parentheses as a clue (short-form
    )
  • Validation rules for identifying acronyms
    Schwartz and Hearst 03
  • It consists of at most two words
  • Its length is between two to ten characters
  • It contains at least an alphabetic letter
  • The first character is alphanumeric

The contextual sentence of HMM and ASR.
The present system consists of a hidden Markov
model (HMM) based automatic speech recognizer
(ASR), with a keyword spotting system to capture
the machine sensitive words (registered in a
dictionary) from the running utterances.
105
Enumerating long-form candidates for an acronym
  • Tokenize a contextual sentence by
    non-alphanumeric characters (e.g., space, hyphen,
    etc.)
  • Apply Porters stemming algorithm Porter 80
  • Extract terms that match the following pattern
  • WORD.

Empty string or words of any length
We studied the expression of thyroid
transcription factor-1 (TTF-1).
1 factor 1 transcript factor 1 thyroid
transcript factor 1 expression of thyroid
transcript factor 1 studi the expression
of thyroid transcript factor 1
of thyroid transcript factor 1 thyroid transcript
106
Expansions for TTF-1
107
Top 20 acronyms in MEDLINE
108
Long-form candidates for acronym ADM
Candidate Length Frequency Score Validity
adriamycin 1 727 721.4 o
adrenomedullin 1 247 241.7 o
abductor digiti minimi 3 78 74.9 o
doxorubicin 1 56 54.6 x
effect of adriamycin 3 25 23.6 Expansion
adrenodemedullated 1 19 17.7 o
acellular dermal matrix 3 17 15.9 o
peptide adrenomedullin 2 17 15.1 Expansion
effects of adrenomedullin 3 15 13.2 Expansion
resistance to adriamycin 3 15 13.2 Expansion
amyopathic dermatomyositis 2 14 12.8 o
brevis and abductor digiti minimi 5 11 9.8 Expansion
minimi 1 83 5.8 Nested
digiti minimi 2 80 3.9 Nested
automated digital microscopy 3 1 0.0 match
adrenomedullin concentration 2 1 0.0 Nested
109
Long-form extraction
  • Long-form candidates are sorted with their scores
    in a descending order
  • A long-form candidate is considered valid if
  • It has a score greater than 2.0
  • The words in the long form can be rearranged so
    that all alphanumeric letters appear in the same
    order as the short form
  • It is not nested or expansion of the previously
    chosen long forms

110
http//www.nactem.ac.uk/software/acromine/
111
Acronym disambiguation
  • Local acronyms
  • Accompany their expanded forms in documents
  • Global acronyms
  • Appear in documents without the expanded forms
    stated
  • Need to be their correct expanded forms
    identified
  • Immunomodulatory effects of CT were investigated
    in a rat model, and the effects of CT on rat
    renal allograft (from Lewis rat to WKAH rat) were
    also examined.
  • Immunomodulatory effects of cholera toxin (CT)
    were investigated in a rat model, and the effects
    of cholera toxin (CT) on rat renal allograft
    (from Lewis rat to Wistar-King-Aptekman-Hokudai
    (WKAH) rat) were also examined.

112
Acronym disambiguation
Sample text Considerations in the identification
of functional RNA structural elements in genomic
alignments (Tomas Babak et al) http//www.biomedce
ntral.com/1471-2105/8/33
113
  • Term structuring

114
Term structuring
  • term clustering (linking semantically similar
    terms) and term classification (assigning terms
    to classes from a pre-defined classification
    scheme)
  • Hypothesis similar terms tend to appear in
    similar contexts (patterns)
  • combining various sources of similarity
  • lexical
  • syntactic
  • contextual
  • Ontological (using external resources)

115
Term structuring
  • Based on term similarities
  • choice of features
  • domain specific ? ontology
  • linguistic ? text
  • ontology-based similarity
  • textual similarity
  • internal features
  • contextual features

116
Using ontologies
  • two terms should match if they are
  • identified as variants
  • siblings in the is-a hierarchy
  • in the is-a or part-whole relation
  • the distance between the corresponding nodes in
    the ontology should be transformed into the
    matching score
  • ? I. Spasic presentation MIE Tutorial
    http//www.nactem.ac.uk/

117
Using text
  • number of neologisms terms are not in the
    ontologies
  • Use of text based techniques to calculate
    similarities
  • edit distance (ED) the minimal number (or cost)
    of changes needed to transform one string into
    the other
  • edit operations
  • insertion deletion replacement
    transposition
  • ...a-c... ...abc... ...abc... ...abc...
  • ...abc... ...a-c... ...adc... ...acb...
  • use of dynamic programming

118
Term similarities
  • lexical similarity based on sharing term head
    and/or modifier(s) --hyponymy
  • nuclear receptor
  • orphan nuclear receptor
  • Sharing heads
  • progesterone receptor oestrogen receptor
  • Specific types of associations
  • mainly general is_a and part_of
  • some domain-specific, e.g. binding CREP binding
    protein

119
Contextual similarities
  • Features from context
  • syntactic category
  • terminological status
  • position relative to the term
  • syntactic relation between a context element and
    the term
  • semantic properties
  • semantic relation between a context element and
    the term .

120
Lexical syntactic patterns
  • a lexico-syntactic pattern
  • . . . Term (, Term) , and other Term .
    . .
  • the leading Terms hyponyms of the head Term
  • ... antiandrogens, hydroxyflutamide,
    bicalutamide,
  • cyproterone acetate, RU58841, and other
    compounds ...
  • candidate instances of the hyponymy relation
  • hyponym( antiandrogens, compound )
  • hyponym( hydroxyflutamide, compound )
  • hyponym( bicalutamide, compound )
  • hyponym( cyproterone acetate, compound )
  • hyponym( RU58841, compound )

121
Contextual information
  • automatic pattern mining for most important
    context patterns
  • find most important contexts in which a term
    appears
  • receptor is bound to these DNA sequences
  • proteins bound to the DNA
  • estrogen receptor bound to DNA
  • steroid receptor coactivator-1 when bound to
    DNA
  • progesterone receptor complexes bound to DNA
  • RXRs bound to respective DNA elements in vitro
  • glucocorticoid receptor to bind DNA
  • pattern ltTERMgt Vbind ltTERMDNAgt

122
Stumbling blocks
  • Lexical similarities affected by many neologisms
    and ad hoc names
  • only 5 of most frequent terms in GENIA belonging
    to same biomedical class have some lexical links
  • how much context to use? (sentence, phrase,
    abstract, )
  • Attempts at using co-occurrence many report up
    to 40 of co-occurrence based relationships
    biologically meaningless

123
Term similarities
  • SOLD Syntactic, Ontology-driven Lexical
    Distance (Spasic, I. Ananiadou, S. 2005,
    Bioinformatics)
  • hybrid approach to comparing term contexts, which
    relies on
  • linguistic information (acquired through tagging
    and parsing)
  • domain-specific knowledge (obtained from the
    ontology)
  • based on the approximate pattern matching
  • combines ontology-based similarity with
    corpus-based similarity using both internal and
    contextual features

124
Challenges of biomedical terminology
  • Linking termforms in text with existing resources
  • Term clustering, classification and linking to
    databases, ontologies
  • Selection of most representative terms (concepts)
    in documents (important for improved IR, database
    curation, annotation tasks)
  • Efficient term management important for updating
    terminological and ontological resources, text
    mining applications e.g. IE, Q/A, summarisation,
    linking heterogeneous resources, IR etc

125
Information Extraction in Biology
  • Results appear depressed compared to general
    language
  • Dependent of earlier stages of processing
    (tokenisers, taggers, results from NER, etc)
  • MUC data 80 F-score template relations, 60
    events
  • Challenge for bio-text mining is to achieve
    similar results
  • Evaluation see Hirschman, L. (Text mining book)
    BioCreATive 2004

126
I
  • Information Extraction

127
IE in Biology
  • Pattern-matching
  • Context-free grammar approaches
  • Full parsing approaches
  • Sublanguage driven IE
  • Ontology-driven IE

McNaught, J. Black, W. (2006) Information
Extraction, Text Mining for Biology
Biomedicine, Artech house, pp.143-177
128
Pattern-matching IE
  • Usual limitations with non inclusion of semantic
    processing
  • Large amount of surface grammatical structures
    too many patterns (Zipfs law)
  • Cannot explore syntactic generalisations (active,
    passive voice)
  • Systems extract phrases or entire sentences with
    matched patterns restricted usefulness for
    subsequent mining

129
Pattern-matching systems (1)
  • BioIE uses patterns to extract sentences, protein
    families, structures, functions..
  • Presents user with relevant information,
    improvement from classic IR
  • BioRAT uses deeper analysis, tagging, apply RE
    over POS tags, stemming, gazetter categories etc
  • Templates apply to extract matching phrases,
    primitive filters (verbs are not proteins, etc)

130
Pattern matching systems (2)
  • RLIMS-P (Hu) protein phosphorylation by looking
    for enzymes, substrates, sites assigned to agent,
    theme, site roles of phosphorylation relations
  • Pos tagger, trained on newswire, chunking,
    semantic typing of chunks, identification of
    relations using pattern-matching rules
  • Semantic typing of NPs using combination of clue
    words, suffixes, acronyms etc
  • Semantically typed sentences matched with rules
  • Patterns target sentences containing
    phosphorylate

131
Full parsing approaches
  • Link Grammar applied for protein-protein
    interactions general English grammar adapted to
    bio-text
  • Link Grammar finds all possible linkages
    according to its grammar
  • Number of analyses reduced by random sampling,
    heuristics, processing constraints relaxed
  • 10,000 results permitted per sentence
  • 60 of protein interactions extracted
  • Problems missing possessive markers
    determiners, coordination of compound noun
    modifiers

132
Full parsing IE (2)
  • Not all parsing strategies suitable for bio-text
    mining
  • Text type, abstracts, ungrammaticality related
    with sublanguage characteristics?
  • Ambiguity and full parsing fragmentary phrases
    (titles, headings, text in table cells, etc)
  • CADERIGE project used Link grammar but on shallow
    parsing mode
  • Kim Park (BioIE) use combinatorial categorial
    grammar, annotated with GO concepts, extract
    general biological interactions
  • 1,300 patterns applied to find instances of
    patterns with keywords

133
Full parsing (3)
  • Keywords indicate basic biological interactions
  • Patterns find potential arguments of the
    interaction keywords (verbs or nominalisations)
  • Validated arguments mapped into GO concepts
  • Difficult to generalise interaction keyword
    patterns
  • BioIEs syntactic parsing performance improved
    after adding subcategorisation frames on verbal
    interaction keywords

134
Full parsing (4)
  • Daraselia(2004) use full parsing and domain
    specific filter to extract protein interactions
  • All syntactic analyses discovered using CFG and
    variant of LFG
  • Each alternative parse mapped to its
    corresponding semantic representation
  • Output set of semantic trees, lexemes linked by
    relations indicating thematic or attributive
    roles
  • Apply custom-built, frame based ontology to
    filter representations of each sentence
  • Preference mechanism controls construction of
    frame tree, high precision, low recall (21)

135
Sublanguage-driven IE (1)
  • Language of a special community (e.g. biology)
  • Particular set of constraints re GL
  • Constraints operate at all linguistic levels
  • Special vocabulary (terms)
  • Specialised term formation rules
  • Sublanguage syntactic patterns
  • Sublanguage semantics
  • These constraints give rise to the informational
    structure of the domain (Z. Harris)
  • See JBI 35(4) Special Issue on Sublanguage

136
GENIES system
  • Employs SL approach to extract biomolecular
    interactions
  • Uses hybrid syntactic-semantic rules
  • Syntactic and semantic constraints referred to in
    one rule
  • Able to cope with complex sentences
  • Frame-based representation
  • Embedded frames
  • Domain specific ontology covers both entities and
    events

137
GENIES system
  • Default strategy full parsing
  • Robust due to sublanguage constraints
  • Much ambiguity excluded
  • If full parse fails, partial parsing invoked
  • Maintains good level of recall
  • Precision 96, Recall 63

138
Ontology-driven IE
  • Until recently most rule based IE have used
    neither linguistic lexica nor ontologies
  • Reliance on gazetteers
  • Small number of semantic categories
  • Gazetteer approach not well suited in bioIE
  • Ontology based vs ontology driven
  • Passive use of ontologies, map discovered entity
    to concept
  • Active use, ontology guides and constrains
    analysis, fewer rules
  • Examples PASTA, GenIE not SL
  • GENIES, SL and ontology driven

139
Summary simple pattern matching
  • Over text strings
  • Many patterns required, no generalisation
    possible
  • Over POS
  • Some generalisation but ignore sentence structure
  • POS tagging, chunking, semantic p-m, typing
  • Limited generalisation, some account taken of
    structure, limited consideration of SL patterns

140
Summary full parsing
  • Full parsing on its own, parsing done in
    combination with chunking, partial parsing,
    heuristics) to reduce ambiguity, filter out
    implausible readings
  • GL theories not appropriate
  • Difficult to specialise for biotext
  • Many analyses per sentence
  • Missing information due to sublanguage meaning

Write a Comment
User Comments (0)
About PowerShow.com