Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine, UG, Belgium) - PowerPoint PPT Presentation

About This Presentation
Title:

Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine, UG, Belgium)

Description:

Association analysis may help determine the kinds of genes that are likely to co ... Biomedical Informatics. Make large volumes of scientific texts better accessable ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 41
Provided by: wern3
Category:

less

Transcript and Presenter's Notes

Title: Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine, UG, Belgium)


1
Basic Introduction toOntology-basedLanguage
Technology (LT)for the Biomedical Sciences(1st
year Biomedicine, UG, Belgium)
  • Werner Ceusters
  • European Centre for Ontological Research
  • Universität des Saarlandes
  • Saarbrücken, Germany

2
Purpose of this lecture
  • Introduce some keywords
  • Give just a taste for ontology-based LT in
    Biomedicine
  • Induce interest for further research

3
Biomedicine A Great Area for LT
  • Educated users
  • High utility of NLP
  • Doesnt require solution to general problem
  • Complex and interesting (not just IE)
  • Recent surge in data
  • Knowledge bases available

Hinrich Schütze, Novation Biosciences Russ
Altman, Stanford University
4
Biomedical Data Mining and DNA Analysis
  • DNA sequences 4 basic building blocks
    (nucleotides) adenine (A), cytosine (C), guanine
    (G), and thymine (T).
  • Gene a sequence of hundreds of individual
    nucleotides arranged in a particular order
  • Humans have around 100,000 genes
  • Tremendous number of ways that the nucleotides
    can be ordered and sequenced to form distinct
    genes
  • Semantic integration of heterogeneous,
    distributed genome databases
  • Current highly distributed, uncontrolled
    generation and use of a wide variety of DNA data
  • Data cleaning and data integration methods
    developed in data mining will help

Jiawei Han and Micheline Kamber
5
DNA Analysis Examples
  • Similarity search and comparison among DNA
    sequences
  • Compare the frequently occurring patterns of each
    class (e.g., diseased and healthy)
  • Identify gene sequence patterns that play roles
    in various diseases
  • Association analysis identification of
    co-occurring gene sequences
  • Most diseases are not triggered by a single gene
    but by a combination of genes acting together
  • Association analysis may help determine the kinds
    of genes that are likely to co-occur together in
    target samples
  • Path analysis linking genes to different disease
    development stages
  • Different genes may become active at different
    stages of the disease
  • Develop pharmaceutical interventions that target
    the different stages separately
  • Visualization tools and genetic data analysis

Jiawei Han and Micheline Kamber
6
Task descriptions
  • Sequence similarity searching
  • Nucleic acid vs nucleic acid 28
  • Protein vs protein 39
  • Translated nucleic acid vs protein 6
  • Unspecified sequence type 29
  • Search for non-coding DNA 9
  • Functional motif searching 35
  • Sequence retrieval 27
  • Multiple sequence alignment 21
  • Restriction mapping 19
  • Secondary and tertiary structure prediction 14
  • Other DNA analysis including translation 14
  • Primer design 12
  • ORF analysis 11
  • Literature searching 10
  • Phylogenetic analysis 9
  • Protein analysis 10
  • Sequence assembly 8
  • Location of expression 7

Stevens R, Goble C, Baker P, and Brass A. A
Classification of Tasks in Bioinformatics.
Bioinformatics 2001 17 (2)180-188.
7
Three major challenges
  • Analyse massive amounts of data
  • Eg high throughput technologies based upon cDNA
    or oligonucleotide microarrays for analysis of
    gene expression, analysis of sequence
    polymorphisms and mutations, and sequencing
  • Appropriately link clinical histories to
    molecular or other biomarker data generated by
    genomic and proteomic technologies.
  • Development of user-friendly computer-based
    platforms
  • that can be accessed and utilized by the average
    researcher for searching, retrieval,
    manipulation, and analysis of information from
    large-scale datasets

8
BUT !!!
  • Majority of data buried in
  • huge amounts of texts
  • Incompatibly annotated databases

9
Text overload
  • According to a conservative estimate, the number
    of digital libraries is more than 105.
  • Norbert Fuhr 03
  • Google indexed over 4.28 billion web pages
  • from Google press release.
  • But, any single engine is prevented from indexing
    more than one-third of the indexable web.
  • from Science.Vol.285, Nr.5426.

10
Objectives of LT inBiomedical Informatics
  • Make large volumes of scientific texts better
    accessable
  • Assist annotation of genome and phenome to allow
    better linking of the data
  • CSB Computational Systems Biology
  • Link biomedical data with patient record data

11
Knowledge discovery and use
12
Text Mining Technologiesfor Biomedicine
Artificial Intelligence Cyc
Hi
Manual Knowledge Representation Riboweb
Information Extraction Fastus
Structure Mining
Primary Literature Reading
Utility
Keyword-based Retrieval PubMed
Low
Cost effectiveness
Low
Hi
Hinrich Schütze, Novation Biosciences Russ
Altman, Stanford University
13
Scientists in areas such as molecular biology and
biochemistry aim to discover new biological
entities and their functions. Typical cases could
be discoveries of the implications of new
proteins and genes in an already known process,
or implication of proteins with previously
characterized functions in a separate
process. The use of available information
(published papers, etc.) is a key step for the
discovery process, since in many cases weak or
indirect evidences about possible relations
hidden in the literature are used to substantiate
working hypothesis that are experimentally
explored.
C.Blaschke, A.Valencia 2001
14
Text-basedknowledge discovery
  • Goal
  • Finding new biomedical scientific knowledge
    through the combination of existing knowledge as
    represented in the medical literature
  • Motivation
  • Prevention of re-inventing the wheel, re-usage of
    specific knowledge outside the original domain of
    discovery

15
Swanson
Effects B
Substance A
Disease C
16
Protein-Protein Interaction extracted from texts
by C. Blaschke
17
Steps of Knowledge Discovery
  • Training data gathering
  • Feature generation
  • k-grams, domain know-how, ...
  • Feature selection
  • Entropy, ?2, CFS, t-test, domain know-how...
  • Feature integration
  • SVM, ANN, PCL, CART, C4.5, kNN, ...

Some classifiers/learning methods
Limsoon Wong
18
Functional componentsfor text-basedfeature
generation system
  • Basic use components end-user
  • Corpus Management tool
  • Parser
  • Export module
  • Management components
  • Corpus editor super user
  • Grammar building workbench super user
  • Domain Ontology editor super user
  • Parser generator exporter
  • Linguistic ontology (multi-lingual use)
    exporter

19
What does it taketo build such a system ?
  • Short term single domain
  • Corpus collection analysis
  • Domain model design implementation
  • Grammar Development
  • Corpus Manipulation Engine
  • Integration in Biomining package
  • Long term generic system
  • Grammar Building Workbench
  • Parser Generator
  • Documentation

20
A statistics only system
21
Relative Concept/Node identification (real)
Statistic analysis is powerful, but not enough
concepts
nodes
22
Clean separation of knowledgefor deep
understanding
  • The Galen view
  • linguistic knowledge
  • conceptual knowledge
  • pragmatic knowledge
  • criteria knowledge
  • terminological knowledge
  • The LT view
  • phonologic knowledge
  • morphologic knowledge
  • syntactic knowledge
  • semantic knowledge
  • pragmatic knowledge
  • world knowledge

23
One word multiple meanings
  • Abbreviation Extraction (Schwartz 2003)
  • Extracts short and long form pairs

Short form Long form
AA Alcoholic Anonymous
American
Americans
Arachidonic acid
arachidonic acid
amino acid
amino acids
anaemia
anemia

24
Syntactic variant detection
  • Corpus
  • MEDLINE the largest collection of abstracts in
    the biomedical domain
  • Rule learning
  • 83,142 abstracts
  • Obtained rules 14,158
  • Evaluation
  • 18,930 abstracts
  • Count the occurrences of each generated variant.

Tsuruoka, et.al. 03 SIGIR
25
Results antiinflammatory effect
Generation Probability Generated Variants Frequency
1.0 (input) antiinflammatory effect 7
0.462 anti-inflammatory effect 33
0.393 antiinflammatory effects 6
0.356 Antiinflammatory effect 0
0.286 antiinflammatory-effect 0
0.181 anti-inflammatory effects 23

26
Results tumour necrosis factor alpha
Generation Probability Generated Variants Frequency
1.0 (Input) tumour necrosis factor alpha 15
0.492 tumor necrosis factor alpha 126
0.356 tumour necrosis factor-alpha 30
0.235 Tumour necrosis factor alpha 2
0.175 tumor necrosis factor alpha 182
0.115 Tumor necrosis factor alpha 8

27
Biomedical NE Task (Collier Coling00,Kazama
ACL02, Kim ISMB02)
  • Recognize names in the text
  • Technical terms expressing proteins, genes,
    cells, etc.

Thus, CIITA not only activates the expression of
class II genes but recruits another B
cell-specific coactivator to increase
transcriptional activity of class II promoters in
B cells .
Junichi Tsujii
28
Text mining and classification
29
Data integration approaches
at least, the beginnings of ...
  • Protein interaction databases
  • Small molecule databases
  • Genome databases
  • Pathway databases
  • Protein databases
  • Enzyme databases

30
(No Transcript)
31
Data Integration approaches
System Integration approaches
  • Data Warehousing
  • Data from various data sources are converted,
    merged and stored in a centralized DBMS.
    (Examples) Integrated Genomic Database
  • Hyperlinking approaches
  • Where links are set up between related
    information and data sources. SRS, Entrez (NCBI)
  • Standardization
  • Efforts which address the need for a common
    metadata model for various application domains.
  • Integration systems
  • Systems that can gather and integrate
    information from multiple sources. Some of these
    systems have a Mediator-Wrapper Architecture
    others are language based systems like
    Bio-Kleisli.
  • Federated Database
  • Cooperating, yet autonomous, databases map their
    individual schemas to a single global schema.
    Operations are preformed against the federated
    schema.

Steve Brady
32
CoMeDIAS (France)
33
GenesTraceTM Biological Knowledge Discovery via
Structured Terminology
34
The XML misconception
lt?XML version"1.0" ?gt lt?XMLstylesheet
type"text/XSL" href"cr-radio.xsl"
?gt ltCR-RADIOLOGIEgtltENTETEgt ltINFORMATION-SERVICEgt
ltHOPITALgtGroupe hospitalier Léonard
Devintscielt/HOPITALgt ltSERVICEgtRadiologie
Centralelt/SERVICEgtltMEDECINgtDr. Bouaudlt/MEDECINgt
ltTITRE-EXAMENgtPhlébographie des membres
inférieurslt/TITRE-EXAMENgt lt/INFORMATION-SERVICEgt
ltINFORMATION-DEMANDEgt ltSERVICEgtSce Pr.
Charletlt/SERVICEgtltMEDECINgtDr. Brunielt/MEDECINgt
ltDATEgt29-10-99lt/DATEgt lt/INFORMATION-DEMANDEgt
ltINFORMATION-PATIENT ID"236784020"gtltNOMgtDonaldlt/
NOMgt ltPRENOMgtDucklt/PRENOMgtlt/INFORMATION-PAT
IENTgtlt/ENTETEgt ltBODYgt ltINDICATIONgtSuspicion
de phlébite de jambe gauchelt/INDICATIONgt
ltTECHNIQUEgtPonction bilatérale dune veine du dos
du pied et injection de 180cc de produit
de contrastelt/TECHNIQUEgt ltRESULTATSgtimage
lacunaire endoluminale visible au niveau des
veines péronières gauche. Absence dopacification
des veines tibiales antérieures et postérieures
gauches. Les veines illiaques et la veine cave
inférieure sont libres. lt/RESULTATSgt
ltCONCLUSIONgtTrombophlébite péronière et
probablement tibiale antérieure et
postérieure gauche.lt/CONCLUSION
gt lt/BODYgt lt/CR-RADIOLOGIEgt
35
Towards Machine ReadableSemantics
Form
Structure
Meaning
Function
Usage
Document Type Definition
Knowledge Type Definition
Workflow Type Definition
Style Type Definition
Information Type Definition
Data about
Formalism
XML
CSS
RDF
OWL
?
Cases Static Dynamic
Bold Centred Align Left Blink
Title Paragraph Heading1 Play
Subject isPartOf Date After_value
Utility affectedBy Receive Protect
Actor Receival Maintenance Archival
Standard
Layout
Outline
Content
Behaviour
Process
Hao Ding, Ingeborg T. Sølvberg
36
Triadic models of meaning The Semiotic/Semantic
triangle
Reference Concept / Sense / Model / View
Sign Language/ Term/ Symbol
Referent Reality/ Object
37
There is ontology and ontology
  • Ontology in Information Science
  • An ontology is a description (like a formal
    specification of a program) of the concepts and
    relationships that can exist for an agent or a
    community of agents.
  • Ontology in Philosophy
  • Ontology is the science of what is, of the kinds
    and structures of objects, properties, events,
    processes and relations in every area of reality.

38
Why are conceptsnot enough?
  • Why must our theory address also the referents in
    reality?
  • Because referents are observable fixed points in
    relation to which we can work out how the
    concepts used by different communities relate to
    each other
  • Because only by looking at referents can we
    establish the degree to which concepts are good
    for their purpose.

39
Or you get nonsenseDefinition of cancer gene
40
Take home messageLanguage Technology requiresa
clean separation of knowledge AND (the right sort
of) ontology
Pragmatic knowledge what users usually say or
think, what they consider important, how to
integrate in software
Knowledge of classification and coding systems
how an expression has been classified by such a
system
Knowledge of definitions and criteria how to
determine if a concept applies to a particular
instance
Surface linguistic knowledge how to express the
concepts in any given language
Conceptual knowledge the knowledge of sensible
domain concepts
Ontology what exists and how what exists relates
to each other
Write a Comment
User Comments (0)
About PowerShow.com