The Research Assistant for Biological Text Mining - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

The Research Assistant for Biological Text Mining

Description:

Use of database annotations for text mining ... TWIST. H-twist. TWIST1. FACL3 BioMinT - 2005 Knowledge For Growth 3 June 2005. GPSDB screenshot ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 32
Provided by: lievede
Category:

less

Transcript and Presenter's Notes

Title: The Research Assistant for Biological Text Mining


1
The Research Assistant for Biological Text Mining
  • Luc Dehaspe
  • Other Members of the BioMinT Consortium

2
Text Mining in the biological domain
  • Emerging field of research and development
  • 40 articles in Bioinformatics 2004
  • Dedicated workshops, competitions and interest
    groups
  • Information retrieval and extraction to deal with
    information overflow
  • 12 million citations in Medline from 4600
    journals
  • Many more resources on the web
  • Essential link in the semantic integration of the
    numerous biological resources.

3
Use of text mining for database annotation
  • curated protein sequence database
  • high level of annotation of proteins
  • high level of integration with other databases

Swiss-Prot Entry Creation Flowchart
4
Use of database annotations for text mining
  • Tools for information retrieval, filtering,
    classification, extraction rely on
  • Corpora of examples used by machine learning
    methods
  • Linguistic analysis and controlled vocabularies,
    (ontologies, thesauri, biological dictionaries).
  • Databases provide semi-structured information
    that could be used
  • for corpus elaboration
  • as specific vocabulary resources

5
  • 3 year FP5 European Project, started in January
    2003
  • Official web site www.biomint.org
  • Interdisciplinary consortium

6
The goals of BioMinT
  • To develop a generic text mining tool that
  • interprets different types of queries
  • retrieves relevant documents from the biological
    literature
  • extracts the required information
  • outputs the result as a database slot filler or
    as a structured report
  • The tool thus provides two essential research
    support services
  • Curator's Assistant accelerate, by partially
    automating, the annotation and update of
    databases
  • Researcher's Assistant generate readable reports
    in response to queries from biological
    researchers.

7
Curators Assistant forSwiss-Prot Annotation
8
Curators Assistant for PRINTS annotation
  • PRINTS deals with groups of proteins
  • Annotation of 3 types of protein fingerprints

Extracted Information
9
The Biological Research Assistant
  • Overlap with Curators Assistant
  • All biologists occasionally in the curators seat
  • Keep ahead of Swiss-Prot in research area of
    interest
  • Include private (confidential) document
    collections

10
Information retrieval and extraction modules
11
Information retrieval and extraction modules
G U I
IR
Query expansion
PubMed search
Document filtering/ranking
Document organisation
IE
Sentence extractor
NLP tools
Case frame generator
12
Information Retrieval
  • A meta-query engine built round PubMed
  • Expansion of the initial query with synonyms
    using a gene/protein synonym database (GPSDB)
  • the goal being to retrieve an exhaustive set of
    documents containing information on a protein.
  • Filtration and ranking of the retrieved documents
  • Pre-classification according to information
    topics.

13
GPSDB
  • Database for synonym expansion of gene and
    protein names
  • Populated by the main resources on model
    organisms
  • Contains 559294 synonyms referring to 292472
    proteins

14
GPSDB
  • Cross-reference links are used to connect
    database entries that refer to a same
    gene/protein entity, thus pointing out the
    problem of homonymy when it occurs

15
GPSDB screenshot
lap2 is a synonym of three separate protein
entities
Erbin
HSP 86
Thymopoietin
16
GPSDB screenshot
17
GPSDB used for query expansion
lap2
Original user query
Query expansion based on GPSDB
18
Document filtering and ranking
  • Interactive modules which permit a flexible
    selection of relevant documents for the IE
    process.
  • Algorithmic approaches
  • Query dependent
  • Lucene Ranker java-based indexing engine giving
    a ranked output of queried documents
  • Query independent
  • Naive Bayes Ranker using pre-trained
    classification of relevant documents on specific
    topics

19
Document filtering and ranking
Output of query dependent ranking
20
Document filtering and ranking
Output of query independent ranking with respect
to topic Disease
21
Information retrieval and extraction modules
G U I
IR
Query expansion
PubMed search
Document filtering/ranking
Document organisation
IE
Sentence extractor
NLP tools
Case frame generator
22
Sentence extractor
  • Goal extract sentences with information relevant
    for protein annotation
  • Method machine learning from corpora with
    manually labeled sentences
  • Data representation bag-of-words approach
  • Best results with Support Vector Machines
    (linear/Radial Basis Function)

23
Sentence extractorSample output
  • set of sentences extracted from the top 5 ranked
    papers
  • query-terms are highlighted
  • sentences classified according to topics
    (function, structure, disease)
  • sentences linked to the PubMed abstract they
    originate from

24
Case frame generator
A protein containing the N-terminal domain with
the first transmembrane segment of MAN1 is
retained in the inner nuclear membrane.
TARGETED_TO X MAN1 Y inner nuclear membrane
25
Case frame generator
  • Goal Automatic identification of selected types
    of entities, relations, or events in free text
  • Methods
  • Given a set of pre-labeled sentences, learn IE
    templates with Inductive Logic Programming (ILP)
  • Background knowledge
  • Syntactic semantic information from
    shallow-parser
  • Ontologies providing entities in a given domain
  • Text analysis tools
  • Shallow Parser (MBSP) based on Machine Learning
    (TiMBL)
  • Shallow parser adapted to biomedical field using
    Genia corpus

26
Case frame generatorSample output shallow parser
  • The mouse lymphoma assay (MLA) utilizing the Tk
    gene is widely used to identify chemical mutagens.

Cell-line
The mouse lymphoma assay
MLA
DNA part
to identify
utilizing
chemical mutagens
the TK gene
27
Case frame generatorSample output
  • Information extracted by the Case Frame
    Generator, which applied machine learned IE rules
    to output of the Shallow Parser

28
Summary
  • The BioMinT prototype is a working unified system
    for Biological Text Mining
  • Information Retrieval
  • query expansion
  • doc filtering/ranking
  • Information extraction
  • Extraction of sentences on user-specified topics
  • Extraction of relationships between entities
    (Case frames)
  • Based on variety of resources/technologies/experti
    ses
  • Biological sciences corpus annotation, database
    annotation, fingerprints, ontologies,
  • Artificial intelligence IR, machine learning
    (SVM, ILP, ), Natural Language Processing
    (Shallow Parser), Case Frames,
  • Software development databases, web-server,
    GUI,

29
Future BioMinT developments
  • Integration of BioMinT prototype in the future
    annotation environment of Swiss-Prot PRINTS
  • Release Q4-2005
  • Free web-based version, with restrictions on
  • Simultaneous users
  • Resources per user (computing storage)
  • Customization services provided by PharmaDM
  • Integration into researchers IT environment
    (E-mail alerts )
  • Mining in-house document collections
  • Combination with DMax data analysis software
  • Incorporation of highly specialized background
    knowledge (ontologies, thesauri, biological
    dictionaries, etc)
  • Custom reports and GUI, etc

30
WWW
  • BioMinT home page http//www.biomint.org
  • GPSDB synonyms database http//biomint.oefai.at
  • BioMinT prototype Quick Tour
  • http//biomint-server.pharmadm.com8080/xwiki/b
    in/view/BioMinT/ProtopQuickTour

31
Acknowledgements
Artificial Intelligence
Biological Sciences
Interested? Demo? Leave your card at POSTER 49
Write a Comment
User Comments (0)
About PowerShow.com