Customizing Gene Taggers for BeeSpace - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Customizing Gene Taggers for BeeSpace

Description:

KeX (Fukuda) Based on hand-crafted rules. Recognizes proteins and other entities ... KeX on honeybee data. False positives: company name, country name, etc. ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 18
Provided by: jin144
Category:

less

Transcript and Presenter's Notes

Title: Customizing Gene Taggers for BeeSpace


1
Customizing Gene Taggersfor BeeSpace
  • Jing Jiang
  • jiang4_at_uiuc.edu
  • March 9, 2005

2
Entity Recognition in BeeSpace
  • Types of entities we are interested in
  • Genes
  • Sequences
  • Proteins
  • Organisms
  • Behaviors
  • Currently, we focus on genes

3
Input and Output
  • Input free text (w/ simple XML tags)
  • lt?xml version1.0 encodingUTF-8gtltDocument
    id1gtWe have cloned and sequenced a cDNA
    encoding Apis mellifera ultraspiracle (AMUSP) and
    examined its responses to JH. lt/Documentgt
  • Output tagged text (XML format)
  • lt?xml version1.0 encodingUTF-8gt ltDocument
    id1gt ltSentgtltNPgtWelt/NPgt have ltVPgtclonedlt/VPgt
    and ltVPgtsequencedlt/VPgt ltNPgta cDNA encoding
    ltGenegtApis mellifera ultraspiraclelt/GenegtltNPgt
    (ltGenegtAMUSPlt/Genegt) and ltVPgtexaminedlt/VPgt
    ltNPgtits responses to JHlt/NPgt.lt/Sentgtlt/Documentgt

4
Challenges
  • No complete gene dictionary
  • Many variations
  • Acronyms hyperpolarization-activated ion channel
    (Amih)
  • Synonyms octopamine receptor (oa1, oar, amoa1)
  • Common English words at (arctops), by (3R-B)
  • Different genes or gene and protein may share the
    same name/symbol

5
Automatic Gene RecognitionCharacteristics of
Gene Names
  • Capitalization (especially acronyms)
  • Numbers (gene families)
  • Punctuation -, /, , etc.
  • Context
  • Local surrounding words such as gene,
    encoding, regulation, expressed, etc.
  • Global same noun phrase occurs several times in
    the same article

6
Existing Tools
  • KeX (Fukuda)
  • Based on hand-crafted rules
  • Recognizes proteins and other entities
  • Human efforts, not easy to modify
  • ABNER YAGI (Settles)
  • Based on conditional random fields (CRFs) to
    learn the rules
  • ABNER identifies and classifies different
    entities including proteins, DNAs, RNAs, cells
  • YAGI recognizes genes and gene products
  • No training

7
Existing Tools (cont.)
  • LingPipe (Alias-i, Inc.)
  • Uses a generative statistical model based on word
    trigrams and tag bigrams
  • Can be trained
  • Has two trained models
  • Others
  • NLProt (SVM)
  • AbGene (rule-based)
  • GeneTaggerCRF (CRFs)

8
Comparison of Existing Tools
  • Performance on a few manually annotated, public
    data sets (protein names)
  • GENIA (2000 abstracts on human blood cell
    transcription factor)
  • Yapex (99 abstracts on protein binding
    interaction molecular)
  • UTexas (750 abstracts on human)
  • Performance on a honeybee sample data set
  • Biosis search apis mellifera gene

9
Comparison of Existing Tools (cont.)
10
Comparison of Existing Tools (cont.)
  • KeX on honeybee data
  • False positives company name, country name, etc.
  • Does not differentiate between genes, proteins,
    and other chemicals
  • YAGI on honeybee data
  • False negatives occurrences of the same gene
    name are not all tagged
  • Entity types and boundary detection
  • LingPipe on honeybee data
  • Similar to YAGI

11
Lessons Learned
  • Machine learning methods outperform hand-crafted
    rule-based system
  • Machine learning methods have over-fitting
    problem
  • Existing tools need to be customized for BeeSpace
  • LingPipe is a good choice
  • There is still room for better feature selection
  • E.g., global context

12
Customization
  • Train LingPipe on a better training data set
  • Use fly (Drosophila) genes
  • F1 increased from 0.2207 to 0.7226 on held-out
    fly data
  • Tested on honeybee data results
  • Some gene names are learned (Record 13)
  • Some false positives are removed (proteins, RNAs)
  • Some false positives are introduced
  • The noisy training data can be further cleaned
  • E.g., exclude common English words

13
Customization (cont.)
  • Exploit more features such as global context
  • Occurrences of the same word/phrase should be
    tagged all positive or all negative
  • Differentiate between domain-independent features
    and domain-specific features
  • E.g., prefix Am is domain-specific for Apis
    mellifera
  • Features can be weighted based on their
    contribution across domains

14
Maximum Entropy Modelfor Gene Tagging
  • Given an observation (a token or a noun phrase),
    together with its context, denoted as x
  • Predict y ? gene, non-gene
  • Maximum entropy model
  • P(yx) K exp(??ifi(x, y))
  • Typical f
  • y gene candidate phrase starts with a capital
    letter
  • y gene candidate phrase contains digits
  • Estimate ?i with training data

15
Plan Customization with Feature Adaptation
  • ?i trained on large set of data in domain A
    (e.g., human or fly)
  • ?i trained on small set of data in domain B
    (e.g., bee)
  • ?i ?i?i (1 - ?i)?i used for domain B
  • ?i based on how useful fi is across different
    domains
  • Large ?i if fi is domain-independent
  • Small ?i if fi is domain-specific

16
Issues to Discuss
  • Definition of gene names
  • Gene families? (e.g., cb1 gene family)
  • Entities with a gene name? (e.g., Ks-1
    transcripts)
  • Difference between genes and proteins?
  • E.g., CREB (cAMP response element binding
    protein) and AmCREB?
  • How to evaluate the performance on honeybee data?

17
The End
  • Questions?
  • Thank You!
Write a Comment
User Comments (0)
About PowerShow.com