Malignancy Types - PowerPoint PPT Presentation

About This Presentation
Title:

Malignancy Types

Description:

Flow Chart for Manual Annotation Process. Biomedical Literature. Entity ... K Bretonnel Cohen and Lawrence Hunter, BMC Bioinformatics. 2006; 7(Suppl 3): S5. ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 64
Provided by: yang3
Category:
Tags: malignancy | types

less

Transcript and Presenter's Notes

Title: Malignancy Types


1
Molecular Entity Types
Phenotypic Entity Types
Gene
Differentiation Status
Clinical Stage
Site
Malignancy Types
Genomic Information
Phenomic Information
Histology
Developmental State
Heredity Status
Variation
Genomic Variation associated with Malignancy
2
Flow Chart for Manual Annotation Process
Auto-Annotated Texts
Biomedical Literature
Machine-learning Algorithm
Annotators (Experts)
Manually Annotated Texts
Annotation Ambiguity
Entity Definitions
3
(No Transcript)
4
Defining biomedical entities
A point mutation was found at codon 12 (G ? A).

? Variation
5
Defining biomedical entities
A point mutation was found at codon 12 (G ? A).

? Variation A point mutation was found at
codon 12 ?
?
Variation.Type Variation.Location
(G ?
A). ?
?
Variation.InitialState Variation.AlteredSta
te
Data Gathering
Data Classification
6
Defining biomedical entities
  • Conceptual boundaries
  • Sub-classification of entities

7
Defining biomedical entities
  • Conceptual boundaries
  • Sub-classification of entities
  • Levels of specificity

8
Levels of specificity
Gene Entity Gene Protein kinase (Super
family) MAPK (Gene family) MAPK10
Malignancy type Entity Cancer/Tumor Carcinoma Lun
g carcinoma Squamous cell lung carcinoma
9
Defining biomedical entities
  • Conceptual boundaries
  • Sub-classification of entities
  • Levels of specificity
  • Conceptual overlaps between entities

Symptom Subjective or objective evidence of
disease. Disease A specific pathological
process with a characteristic set of
symptoms. Arrhythmia vs. Long QT Syndrome
10
Defining biomedical entities
  • Conceptual boundaries
  • Sub-classification of entities
  • Levels of specificity
  • Conceptual overlaps between entities
  • Domain-specific clarification

Gene entity clarification Regulation element --
promoters (eg. TATA box)
11
Defining biomedical entities
  • Conceptual boundaries
  • Sub-classification of entities
  • Levels of specificity
  • Conceptual overlaps between entities
  • Domain-specific clarification
  • Syntactical boundaries
  • Text boundary issues
  • The K-ras gene

12
Defining biomedical entities
  • Conceptual boundaries
  • Sub-classification of entities
  • Levels of specificity
  • Conceptual overlaps between entities
  • Domain-specific clarification
  • Syntactical boundaries
  • Text boundary issues (The K-ras gene)
  • Pronoun co-reference (this gene, it, they)

13
Defining biomedical entities
  • Conceptual boundaries
  • Sub-classification of entities
  • Levels of specificity
  • Conceptual overlaps between entities
  • Domain-specific clarification
  • Syntactical boundaries
  • Text boundary issues (The K-ras gene)
  • Co-reference (this gene, it, they)
  • Structural overlap -- entity within entity (same
    entity type)
  • MAP kinase kinase kinase

14
Defining biomedical entities
  • Conceptual boundaries
  • Sub-classification of entities
  • Levels of specificity
  • Conceptual overlaps between entities
  • Domain-specific clarification
  • Syntactical boundaries
  • Text boundary issues (The K-ras gene)
  • Pronoun co-reference (this gene, it, they)
  • Structural overlap -- entity within entity
    (different entity type)
  • Squamous cell lung carcinoma

15
Defining biomedical entities
  • Conceptual boundaries
  • Sub-classification of entities
  • Levels of specificity
  • Conceptual overlaps between entities
  • Domain-specific clarification
  • Syntactical boundaries
  • Text boundary issues (The K-ras gene)
  • Co-reference (this gene, it, they)
  • Structural overlap -- entity within entity
  • Discontinuous mentions (N- and K-ras )

16
Semantic ambiguity challenges
  • Ambiguity within an entity type

catalase glycine-N-acyltransferase (GLYAT)
CAT
17
Semantic ambiguity challenges
  • Ambiguity within an entity type
  • Ambiguity between entity types

Gene entity Organism
CAT
18
Semantic ambiguity challenges
  • Ambiguity within entity types
  • Ambiguity between entity types
  • Gene entity ambiguity
  • 3 of human genes share aliases
  • Huge ambiguity of genes between species (mouse
    and human)
  • Gene.general, Gene.gene/RNA, Gene.protein

19
Gene
Variation
Malignancy Type
Gene RNA Protein
Type Location Initial State Altered State
Site Histology Clinical Stage Differentiation
Status Heredity Status Developmental
State Physical Measurement Cellular Process
Expressional Status Environmental Factor Clinical
Treatment Clinical Outcome Research
System Research Methodology Drug Effect
20
http//www.ldc.upenn.edu/mamandel/itre/annotators/
onco/definitions.html
21
Manual Annotation Corpus Release
Jena University Language Information
Engineering Lab http//www.julielab.de
K Bretonnel Cohen and Lawrence Hunter, BMC
Bioinformatics. 2006 7(Suppl 3) S5.
22
Summary -- Entity Definition
  • Developed iterative process for biomedical entity
    definition
  • Defined genomic and phenotypic entities with
    distinct conceptual and syntactical boundaries in
    genomic variation of malignancy
  • Constructed a manually annotated corpus with 1442
    oncology-focused articles.

23
Named Entity Extractors
Mycn is amplified in neuroblastoma.
Gene
Variation type
Malignancy type
24
Automated Extractor Development
  • Training and testing data
  • 1442 cancer-focused MEDLINE abstracts
  • 70 for training, 30 for testing

25
Automated Extractor Development
  • Training and testing data
  • 1442 cancer-focused MEDLINE abstracts
  • 70 for training, 30 for testing
  • Machine-learning algorithm
  • Conditional Random Field (CRF)
  • Sets of Features

Lung cancer is the of carcinoma deaths
worldwide.
MType Mtype
26
Automated Extractor Development
  • Training and testing data
  • 1442 cancer-focused MEDLINE abstracts
  • 70 for training, 30 for testing
  • Machine-learning algorithm
  • Conditional Random Fields (CRFs)
  • Sets of Features
  • Orthographic features (capitalization,
    punctuation, digit/number/alpha-numeric/symbol)
  • Character-N-grams (N2,3,4)
  • Prefix/Suffix (oma)
  • Offsite conjuction (3 consecutive word tokens)
  • Domain-specific lexicon (NCI neoplasm list).

27
Extractor Performance
  • Precision (true positives)/(true positives
    false positives)
  • Recall (true positives)/(true positives false
    negatives)

28
(No Transcript)
29
CRF-based Extractor vs. Pattern Matcher
  • The testing corpus
  • 39 manually annotated MEDLINE abstracts selected
  • 202 malignancy type mentions identified
  • The pattern matching system
  • 5,555 malignancy types extracted from NCI
    neoplasm ontology
  • Case-insensitive exact string matching applied
  • 85 malignancy type mentions (42.1) recognized
    correctly
  • The malignancy type extractor
  • 190 malignancy type mentions (94.1) recognized
    correctly
  • Included all the baseline-identified mentions

30
The Types of Mentions NOT Identified by Pattern
Matching
31
Normalization
  • abdominal neoplasm
  • abdomen neoplasm
  • Abdominal tumour
  • Abdominal neoplasm NOS
  • Abdominal tumor
  • Abdominal Neoplasms
  • Abdominal Neoplasm
  • Neoplasm, Abdominal
  • Neoplasms, Abdominal
  • Neoplasm of abdomen
  • Tumour of abdomen
  • Tumor of abdomen
  • ABDOMEN TUMOR

Unique Identifier
32
Normalization
  • abdominal neoplasm
  • abdomen neoplasm
  • Abdominal tumour
  • Abdominal neoplasm NOS
  • Abdominal tumor
  • Abdominal Neoplasms
  • Abdominal Neoplasm
  • Neoplasm, Abdominal
  • Neoplasms, Abdominal
  • Neoplasm of abdomen
  • Tumour of abdomen
  • Tumor of abdomen
  • ABDOMEN TUMOR

UMLS metathesaurus Concept Unique Identifier
(CUI) 19,397 CUIs with 92,414 synonyms
C0000735
33
Normalization Computational Procedures
  • Rule-based algorithm
  • Applied to both entity mentions and vocabulary
    terms (UMLS metathesaurus)
  • Case insensitivity (carcinoma/Carcinoma)
  • Space/punctuation removal (lung-cancer/lungcancer)
  • Stemming (neuroblastoma/neuroblastomas)
  • Applied to mentions only
  • First/last character removal (additional
    space/punctuation)
  • First/last word removal (translocation lung
    carcinoma)
  • Evaluate the accuracy and the priority of the
    rules
  • 1,000 randomly selected entity mentions
  • Choose the best performed rule combination and
    sequences

34
MEDLINE Data Processing
  • Tagging MEDLINE pre-2006 abstracts
  • 15,433,668 MEDLINE abstracts
  • 9,153,340 redundant and 580,002 distinct
    malignancy type mentions
  • 60 extracted mentions matched to UMLS CUIs
  • 1,642 CPU-hours (2.44 days on a 28-CPU cluster)
  • Infrastructure construction (postgreSQL Database)

35
Gene-Malignancy-Evidence Matrix
Gene Malignancy Evidence
A1BG Adenocarcinoma 1634938 A1BG Adenocarcinoma 22
92657 A1BG Adenocarcinoma 3566173 ABCC1
Lung Carcinoma 11156254 ABCC1 Lung
Carcinoma 11159731 ABCC1 Lung Carcinoma 11172691
B3GAT1 Breast Neoplasm 6870377 B3GAT1 Bre
ast Neoplasm 9129046 B3GAT1 Breast
Neoplasm 9701020 ERVK6 Stage IV
Melanoma of the Skin 9056412 ERVK6 Stage IV
Melanoma of the Skin 9620301 ERVK6 Stage IV
Melanoma of the Skin 9640365 NFKB1 Colon
Carcinoma 12842827 NFKB1 Colon
Carcinoma 12901803 NFKB1 Colon Carcinoma 12934082
VIM Gastrointestinal Stromal
Tumor 12375611 VIM Gastrointestinal Stromal
Tumor 12657940 VIM Gastrointestinal Stromal
Tumor 12673425
21,493,687 normalized gene symbols (16,875 unique)
36
Gene-Malignancy-Evidence Matrix
Gene Malignancy Evidence
A1BG Adenocarcinoma 1634938 A1BG Adenocarcinoma 22
92657 A1BG Adenocarcinoma 3566173 ABCC1
Lung Carcinoma 11156254 ABCC1 Lung
Carcinoma 11159731 ABCC1 Lung Carcinoma 11172691
B3GAT1 Breast Neoplasm 6870377 B3GAT1 Bre
ast Neoplasm 9129046 B3GAT1 Breast
Neoplasm 9701020 ERVK6 Stage IV
Melanoma of the Skin 9056412 ERVK6 Stage IV
Melanoma of the Skin 9620301 ERVK6 Stage IV
Melanoma of the Skin 9640365 NFKB1 Colon
Carcinoma 12842827 NFKB1 Colon
Carcinoma 12901803 NFKB1 Colon Carcinoma 12934082
VIM Gastrointestinal Stromal
Tumor 12375611 VIM Gastrointestinal Stromal
Tumor 12657940 VIM Gastrointestinal Stromal
Tumor 12673425
5,398,954 normalized malignancy types (4,166 CUIs)
37
Gene-Malignancy-Evidence Matrix
Gene Malignancy Evidence
A1BG Adenocarcinoma 1634938 A1BG Adenocarcinoma 22
92657 A1BG Adenocarcinoma 3566173 ABCC1
Lung Carcinoma 11156254 ABCC1 Lung
Carcinoma 11159731 ABCC1 Lung Carcinoma 11172691
B3GAT1 Breast Neoplasm 6870377 B3GAT1 Bre
ast Neoplasm 9129046 B3GAT1 Breast
Neoplasm 9701020 ERVK6 Stage IV
Melanoma of the Skin 9056412 ERVK6 Stage IV
Melanoma of the Skin 9620301 ERVK6 Stage IV
Melanoma of the Skin 9640365 NFKB1 Colon
Carcinoma 12842827 NFKB1 Colon
Carcinoma 12901803 NFKB1 Colon Carcinoma 12934082
VIM Gastrointestinal Stromal
Tumor 12375611 VIM Gastrointestinal Stromal
Tumor 12657940 VIM Gastrointestinal Stromal
Tumor 12673425
3,100,773 distinct Gene-Malignancy-Evidence
relations
38
Ranked by Frequency
39
Summary -- Extractor Development and Application
  • Developed well-performed automated entity
    extractors across genomic and phenotypic domains
  • Constructed rule-based computational procedure
    for normalization
  • Applied the extractors and normalizers to all
    MEDLINE abstracts
  • Imported the extracted information into a
    relational database.

40
Text Mining Applications -- Hypothesizing NB
Candidate Genes
41
Text Mining Applications -- Hypothesizing NB
Candidate Genes
  • Two distinct subtypes of neuroblastoma

42
Text Mining Applications -- Hypothesizing NB
Candidate Genes
  • Two distinct subtypes of neuroblastoma
  • Distinct clinical behaviors (favorable vs.
    unfavorable)
  • NGF/NTRK1 (TrkA) vs. BDNF/NTRK2 (TrkB) signaling
    pathways

43
Text Mining Applications -- Hypothesizing NB
Candidate Genes
  • Two distinct subtypes of neuroblastoma
  • Distinct clinical behaviors (favorable vs.
    unfavorable)
  • NGF/NTRK1 (TrkA) vs. BDNF/NTRK2 (TrkB) signaling
    pathways
  • Determine the early response genes
    differentiating the two pathways
  • More precise prognosis and clinical intervention

44
Text Mining Applications -- Hypothesizing NB
Candidate Genes
NTRK1
NTRK2
SH-SY5Y
SH-SY5Y
NGF
BDNF
RNA extraction at 0,1.5hrs,4hrs and 12hrs
Affymetrix U133A Expression Array (RMAexpress
normalization, SAM test)
751 differentially expressed genes
45
Text Mining Applications -- Hypothesizing NB
Candidate Genes
Microarray Expression Data Analysis
Gene Set 1 NTRK1?, NTRK2?
468
283
Gene Set 2 NTRK2?, NTRK1?
46
Text Mining Applications -- Hypothesizing NB
Candidate Genes
  • Differentially represented genes in biomedical
    literature
  • NTRK1 vs. NTRK2 pathway differentially associated
    genes/proteins based on literature
  • Preferential association determined by
    co-occurrence with either receptor 5 times or
    more over the other
  • Assumption the co-occurrence frequency is
    reflecting functional correlation

47
Text Mining Applications -- Hypothesizing NB
Candidate Genes
NTRK1/NTRK2 Preferentially Associated Genes in
Literature
LitSet 1 NTRK1 Associated Genes
514
157
LitSet 2 NTRK2 Associated Genes
48
Text Mining Applications -- Hypothesizing NB
Candidate Genes
Microarray Expression Data Analysis
NTRK1/NTRK2 Associated Genes in Literature
Gene Set 1 NTRK1?, NTRK2?
NTRK1 Associated Genes
18
514
468
4
283
157
NTRK2 Associated Genes
Gene Set 2 NTRK2?, NTRK1?
49
Functional Pathway Analysis
Determine gene enrichment score for six selected
functional pathways CD -- Cell Death CGP --
Cell Growth and Proliferation CCSI --
Cell-to-Cell Signaling and Interaction CM --
Cell Morphology NSDF -- Nervous System
Development and Function CAO -- Cellular
Assembly and Organization.
50
Functional Pathway Analysis
Six selected pathways CD -- Cell Death CM
-- Cell Morphology CGP -- Cell Growth and
Proliferation NSDF -- Nervous System
Development and Function CCSI -- Cell-to-Cell
Signaling and Interaction CAO -- Cellular
Assembly and Organization. Ingenuity Pathway
Analysis Tool Kit
51
Hypergeometric Test P-values
52
Hypergeometric Test between Array and Overlap
Groups
Multiple-test corrected P-values (Bonferroni
step-down)
53
RT-PCR Experimental Validation
11 out of 22 genes selected for RT-PCR validation
Symbol Description CAMK4 calcium/calmodulin-d
ependent protein kinase IV VSNL1 visinin-like
1 TBC1D8 TBC1 domain family, member 8 (with GRAM
domain) RPS6KA1 ribosomal protein S6 kinase,
90kDa, polypeptide 1 EFNB3 ephrin-B3 B3GAT1 beta-1
,3-glucuronyltransferase 1 (glucuronosyltransferas
e P) GNAS GNAS complex locus NEFH neurofilament,
heavy polypeptide 200kDa INA internexin neuronal
intermediate filament protein, alpha NEFL neurofil
ament, light polypeptide 68kDa TYRO3 TYRO3
protein tyrosine kinase
54
RT-PCR Experimental Validation
11 out of 22 genes selected for RT-PCR validation
Symbol Description CAMK4 calcium/calmodulin-d
ependent protein kinase IV VSNL1 visinin-like
1 TBC1D8 TBC1 domain family, member 8 (with GRAM
domain) RPS6KA1 ribosomal protein S6 kinase,
90kDa, polypeptide 1 EFNB3 ephrin-B3 B3GAT1 beta-1
,3-glucuronyltransferase 1 (glucuronosyltransferas
e P) GNAS GNAS complex locus NEFH neurofilament,
heavy polypeptide 200kDa INA internexin neuronal
intermediate filament protein, alpha NEFL neurofil
ament, light polypeptide 68kDa TYRO3 TYRO3
protein tyrosine kinase
55
RT-PCR Experimental Validation
11 out of 22 genes selected for RT-PCR validation
Symbol Description CAMK4 calcium/calmodulin-d
ependent protein kinase IV VSNL1 visinin-like
1 TBC1D8 TBC1 domain family, member 8 (with GRAM
domain) RPS6KA1 ribosomal protein S6 kinase,
90kDa, polypeptide 1 EFNB3 ephrin-B3 B3GAT1 beta-1
,3-glucuronyltransferase 1 (glucuronosyltransferas
e P) GNAS GNAS complex locus NEFH neurofilament,
heavy polypeptide 200kDa INA internexin neuronal
intermediate filament protein, alpha NEFL neurofil
ament, light polypeptide 68kDa TYRO3 TYRO3
protein tyrosine kinase
56
RT-PCR Experimental Validation
0hr 1.5hr 4hr 12hr
57
EFNB3 Discussion
  • EFNB3 (ephrin-B3) belongs to a family of ligands
    that binds to Eph family receptor tyrosine
    kinases
  • Implicated in axon guidance and vertebrate
    nervous system development
  • Exhibited growth-suppressive activity against NB
    cells in vitro
  • Preferentially and significantly associated with
    low tumor stage and favorable clinical outcomes
    in neuroblastoma primary tumors

58
RT-PCR Experimental Validation
0hr 1.5hr 4hr 12hr
59
TYRO3 Discussion
  • Trans-memberane receptor tyrosine kinase
    activated by GAS6
  • GAS6 has showed to promote human fetal
    oligodendrocyte survival without proliferation
  • GAS6 may also contribute to cell adhesion and
    immune responses
  • Further study of GAS6/TYRO3 signaling is needed

60
Summary -- NB Application
  • Prioritized array-determined differentially
    expressed genes by integrating text mining
    results
  • Literature-based method showed its capability of
    enriching functionally relevant genes by pathway
    analysis
  • RT-PCR experiments further validated the
    inferential power of text mining

61
Conclusion
  • Created a process for iteratively and precisely
    defining biomedical semantic types directly from
    literature
  • Developed automated entity extractors across
    genomic and phenotypic domains in malignancy with
    satisfactory accuracy rates
  • Applied this computational entity recognition and
    normalization process to all MEDLINE abstracts
  • Integrated text mining results with neuroblastoma
    experimental data to hypothesize candidate genes
    differentiating neuroblastoma subtypes

62
Future Directions
  • Increasing dimensions of Information matrix
  • Context-based normalization algorithm
  • Relation extraction with deeper semantic parsing

63
Acknowledgement
Penn BioIE Team Dr. Mark Liberman
Dr. Mark Mandel Dr. Ryan McDonald Dr.
Fernando Pereira Annotator team
Brodeur Lab Dr. Garrett Brodeur Ms.
Ruth Ho Dr. Jane Minturn
CHOP NAP Core Dr. Eric Rappaport
White Lab Steve Carroll Hawren
Fang Kevin Murphy
CHOP Bioinformatics Core Dr. Xiaowu Gai
Dr. Jim Zhang
Write a Comment
User Comments (0)
About PowerShow.com