Title: Information Extraction from the Cancer Literature
1 Information Extraction from the Cancer
Literature The Pediatric Hematology/Oncology
Seminar Series Childrens Hospital of
Philadelphia March 8, 2005 Philadelphia, PA
2A Global Challenge
DNA sequence Genomic variation Microarrays RNAi Pr
otein interactions
Patient records Test results Clinical
reports Procedures Phone calls
Natural language understanding
3Too Much Text
- Biomedical text
- 15 million articles
- 1.5 billion words
- Solution 1 Approximate
- What you can find
- What finds you
- Solution 2 Read everything
- Leukemia 181,394 articles
- 20/day25 years
- 385,034 new articles by then
Solution 3 Impose structure on the descriptions
4IE Process
- Phase 1 Domain selection and definition
- Phase 2 Manual annotation
- Phase 3 Create and train machine-learning
algorithms - Phase 4 Active Annotation
- Phase 5 Utilization of annotations
5Domain
- Biological Domains
- Genomic variations in malignancy
- Neuroblastoma
- Entity Classes
- Genes (genes, transcripts, proteins)
- Genomic variations (type, location, state)
- Malignant type
- Malignancy attributes
- Developmental state
- Clinical stage
- Histology
- Malignancy site
- Differentiation status
- Heredity status
6Document Sets
- MEDLINE Abstracts --gt Full Text
- Annotation training set 4,000 MEDLINE abstracts
- Genes commonly mutated in various malignancies
- Genes implicated in neuroblastoma
- Abstracts are manually annotated (dual pass)
- Results are used to train automated taggers
7Workflow Management
8Extraction Process
9Parsing
MDS1
gene
alterations
leukemia
cause
often
Separate
10Part-of-speech Tagging
11Part-of-speech Tagging
12Part-of-speech Tagging
13Named Entity Recognition
14Definitions Process
- Initial Definitions Domain Experts
- Analyze representative subset of text mentions
- Input of specific knowledge
- Manual Annotation
- Tag text with initial definitions
- Iterative re-definition process
- More text Tighter and more robust definitions
- Widen Domain Expertise
- Publication and Utilization
15Definitions
Individual Gene
Gene Family
Gene Superfamily
16Definitions
Gene The Gene-Entity category includes genes
as well as their downstream products such as
transcripts and proteins, in addition to the more
general groups of gene and protein families,
super-families, and so forth. Note that the
category name 'Gene-Entity is not a completely
accurate description of the members of this class
since the category includes things other than
genes. However, most things in this class are
genes, and everything is either a gene or gene
derived (transcripts and proteins). The diagram
that follows attempts to illustrate this point
and provides some examples. What is and What
is Not Included? There are two ways to think
about genes. 1. Genes as conceptual entities.
(This is what we want to capture.) Genes
refer to segments of the genome which have been
identified with a specific function or product
(for example, the gene for eye color in a fly or
a membrane receptor in humans). Although they are
"things", they really represent abstract
concepts. We can talk about the gene "K-Ras", but
we are really referring to an abstract concept
an "ideal form" of the K-Ras gene, which has
known attributes. We cant point to K-Ras we can
only point to instances of K-Ras. Each of these
instances (a specific manifestation of the gene
as described in 2 below) has the attributes and
characteristics of the abstract concept of K-Ras
but the different instances of K-Ras may vary
slightly between them. (This parallels the
concept of "species". We all have an intuitive
grasp of the species concept, and can
differentiate most species apart a grizzly bear
from a polar bear. However, when we visit the zoo
we encounter instances of a species -- individual
bears -- and not the concept itself.) Although
this may seem pedantic, there is an important
reason for making this distinction which well
describe below. Lets consider some examples
based upon this logic a. For genes
c-kit, CD117, and alpha-smooth muscle actin
b. A non-biology example a 2003 Ferrari Modena.
This is an abstract concept for a specific type
of car. However, you cant point to an abstract
2003 Ferrari Modena, you can only point to
specific instances which may vary, even if
slightly, between one another. c. K-Ras
as investigated in Bob. This can be a tricky
example since it would appear as though we are
talking about a specific instance of K-Ras. But
remember, in nearly all cases, genes are paired
in humans (sometimes there are even more
17Definitions
- Confounding Issues
- Levels of specificity
- Protein/enzyme/kinase/tyrosine kinase/NTRK1
- TRK antibody
- Colon cancer vs. cancer of the colon
- Boundary issues
- Retinoblastoma
- Head and neck cancer
- MEN type 2B syndrome
18Entity Annotation
19Named Entity Recognition
20Syntactic Analysis
21Treebanking
22Syntactic Analysis
23Relation Tagging
24Relation Tagging
25Annotation Viewer
Annotation Viewer
26Annotations
Annotation Start Annotated Annotated
Task Date Documents Words
Pre-tagging 11/3/03 3834 1,456,000
Entity tagging 9/24/03 3829 1,455,000
POS tagging 8/27/03 2332 886,160
Treebanking 2/26/04 2300 874,000
Relation tagging 10/31/04 618 234,000
27Automated Algorithms
- Pretagger
- Assigns token, sentence, paragraph, section
boundaries - Nearly 100 accuracy
- Pipeline implementation Finished
- Bio Part-of-speech tagger
- Assigns part-of-speech tags to tokens
- Uses pretagging annotations
- Accuracy of 97.3
- Pipeline implementation Finished
28Entity Taggers
- Entity Taggers Automated, machine-learning
algorithms for named entity recognition in text - Goals
- Highly accurate, precision gt recall
- Rapid deployment
- Flexible design
- Technique
- Conditional random fields
- Text feature-based
- Uses pretagging, POS annotations
- Probabilistic maximization of feature weights
- Corrects for overfitting
29Entity Taggers
- GeneTaggerCRF
- Tags gene symbols, names, and descriptions
- KDR, VEGFR-2, VEGF receptor-2
- vascular endothelial growth factor receptor type
2 - 86 precision/79 recall
- Pipeline implementation Imminent
- VTag
- Simulataneously tags variation types, locations,
states - point mutation, loss of heterozygosity
- codon 12, 11q23, base pair 17, Ki-ras
- GGT, glycine, Asp
- 85 precision/79 recall
- Pipeline implementation Imminent
30Entity Taggers
- Mtag
- Tags malignant type labels
- acute myeloid leukemias (AMLs)
- translocation t( 911) - positive leukemia
- NB
- transitional cell carcinoma of the bladder
- Hypoplastic myelodysplastic syndrome
- predominantly cystic bilateral neuroblastomas
- 85 precision/82 recall
- Pipeline implementation Imminent
31Entity Taggers
32Relation Tagger
Relation Taggers Identifying relationships
between entities Given this text Missense
mutation at codon 45 (TCT to TTT) Can we
automatically identify 1. Pairwise associations
(codon 45 and TCT) (TCT and TTT) etc. 2. The
entire mutation event VARIATION EVENT
60609 Variation type missense
mutation Variation location codon 45 Variation
state 1 TCT Variation state 2 TTT
33Relation Tagger
- Goals Accurate, rapid, flexible
- Technique
- Maximum entropy
- Feature-based probabilistic model
- Events built upon binary associations
- Uses pretagging, POS, and entity annotations
- Domain
- Genomic variation events
- Tested on 447 abstracts 1218 relations, 4773
entities - 38 of relations were non-binary
- Baseline Two entities within 5 words related
34Relation Tagger
- Results
- Binary
- Tagger 77 precision/82 recall
- Baseline 66 precision/77 recall
- Event-wide
- Tagger 63 precision/77 recall
- Baseline 43 precision/66 recall
- Example
- most common base change was a A -gtG transition
at codon 12 or 13 - Manual annotation
- (transition, codon 12, A, G)
- (transition, codon 13, A, G)
- Automated annotation
- (transition, codon 12, A, G)
- (transition, codon 13, A, G)
- (base change, codon 12, A, G)
- (base change, codon 13, A, G)
35Data Management
36Annotation Pipeline
Document
Pretagging
POS tagging
Entity tagging
Relation tagging
Treebanking
Propbanking
Database
Normalization
Integration
Interface
37Annotation Pipeline
Annotation Pipeline
Carolyn Felix
38Annotation Retrieval
Biomedical Annotation Database
39Applications Entity Lists
- What is this all good for, anyway?
- Objective To align the literature with genomic
objects - Goal Can we replicate a manually curated list
of genes implicated in a biological process? - Domain Angiogenesis
- Rationale To focus on the subset of genes
implicated in the process of angiogenesis from
whole- genome expression profiling
40Applications Entity Lists
- The manual list
- Genes represented on the Affy U133 chips
- 340 genes, identified through
- Prior knowledge
- Literature reviews
- PubMed searches
- Gene Ontology codes
- Gene family-based inference
41Applications Entity Lists
- The automated list
- Twelve partially specific angiogenic terms
- Concordancy searching of MEDLINE 41,276
abstracts - Trained GeneTaggerCRF with 100 hand-annotated
angiogenesis abstracts - Tagged the document set
- 104,118 mentions
- 22,662 non-redundant mentions
42Applications Entity Lists
- Normalization
- Human gene/alias/identifier list
- Compiled identifiers from 19 public databases
- 302,976 entries
- 156,860 non-redundant entries
- All entries mapped to 25,096 official gene
symbols - Aligned normalized gene and tagged gene lists
- 50.01 of entries matched a known gene term
- 2,389 identified genes
43Applications Entity Lists
Gene Description Frequency VEGF Vascu
lar endothelial growth factor
9688 NUDT6 Antisense basic fibroblast growth
factor 1887 FGF2 Fibroblast growth
factor 2 (basic) 1463 KDR Kinase
insert domain receptor
1287 TGFB1 Transforming growth factor, beta 1
909 TNF Tumor necrosis factor 908 FLT1 Fms-
related tyrosine kinase 1 (VEGF/VPF
receptor) 880 MMP2 Matrix metalloproteinase
2 598 IL8 Interleukin 8 571 IL28B Interleu
kin 28B 559 PECAM1 Platelet/endothelial cell
adhesion molecule 558 ECGF1 Endothelial cell
growth factor 1 545 EGF Epidermal growth
factor 524 TP53 Tumor protein
p53 524 THBS1 Thrombospondin
1 501 PTGS2 Prostaglandin-endoperoxide
synthase 2 427 FN1 Fibronectin
1 407 IL6 Interleukin 6 407
44Applications Entity Lists
- Accuracy
- 247 (72.6) of manual genes on the automated list
- 91 (26.8) of manual genes had no literature
support - 2 (0.6) of manual genes were missed for
technical reasons - Overall, 99.2 recall
- Prediction
- Relevance ranked auto-tagged genes by number of
mentions - Evaluated the top 40 NOT on the manual list
- All 40 appear to be legitimate angiogenesis-relate
d genes - Gene Ontology (GO) 42 human genes associated
with angiogenesis or related terms
45Applications Entity Lists
Gene Description Frequency NUDT6 Ant
isense basic fibroblast growth factor
1887 TNF Tumor necrosis factor 908 IL28B Int
erleukin 28B 559 EGF Epidermal growth
factor 524 TP53 Tumor protein
p53 524 FN1 Fibronectin 1 407 IL6 Interle
ukin 6 407 CD34 CD34 antigen 384 EGFR Ep
idermal growth factor receptor 373 IL1B Interleu
kin 1, beta 323 PCNA Proliferating cell
nuclear antigen 277 SOS1 Son of sevenless
homolog 1 243 FGF1 Fibroblast growth factor 1
(acidic) 239 TM7SF2 Transmembrane 7 superfamily
member 2 230 GALGT2 4-GalNAc transferase 229
PRAP1 Proline-rich acidic protein
1 219 BMP6 Bone morphogenetic protein
6 202 BCL2 B-cell CLL/lymphoma 2 201
46Applications Directed Retrieval
- Locus-specific Databases Repositories of
recorded mutation information - gt 300 human genes
- gt 100 databases
- Highly curated
- Limited resources
- CDKN2A database Somatic and germline p16
mutations - Over 1400 mutation instances
- Primarily identified through manual literature
perusal - Large and inefficient effort
- lt 20 of identified articles contain mutation
instances
47Applications Directed Retrieval
- Experiment Identify mutation instance-containing
articles from relevant articles - Literature search of PubMed using p16 key words
- 418 articles (1/2000 to 6/2002)
- 78 articles contained mutation data (experts)
- Training
- 218 articles
- Logistic regression classifier
- Features words and word pairs
48Applications Directed Retrieval
- Evaluation
- Experts
- Identified 200 candidate articles
- 32 articles contained mutation information
- 16 precision 100(?) recall F-measure 0.28
- Algorithm
- Predicted that 88 of the 200 articles contained
relevant info - 29 of 32 with relevant info identified
- 44 precision 91 recall F-measure 0.59
- Second random trial comparable results
- Relevance ranking Associated with value
- In progress refinement of relevance with text
annotations - Conclusion automation significantly reduces
workload
49The Global Challenge
What is MYCN? What is MYCN related
to? How? Genes Proteins Pathways Cells Tissues Ph
enotypes Traits Diseases Behaviors Environment
50Integration
Cellular location
Genomic position
Genomic context
Protein function
Known alteration
Cell type
Disease association
Symptom
Environmental factor
Clinical observation
51Resources
BioIE group http//bioie.ldc.upenn.edu/ Resourc
es http//bioie.ldc.upenn.edu/index.jsp?pagedoc_
resources.html Documentation http//bioie.ldc.up
enn.edu/index.jsp?pagedoc_users.html Software/To
ols http//bioie.ldc.upenn.edu/index.jsp?pagedoc
_soft_tools.htm
52Contributors
University of Vermont Claire Anduka Mark
Greenblatt Joan Murphy Amy Rodgers Sanger
Institute Sally Bamford Elisabeth Dawson Jon
Teague Richard Wooster
CHOP Shannon Davis Jayanti Jagannathan Yang
Jin Jessica Kim Jeremy Lautman Pete White Scott
Winters Garrett Brodeur Mike Hogarty John Maris
University of Pennsylvania Avik
Basu Ann Bies Christine Brisson Dan
Caroff Hareesh Chandrupatla Melissa
Demian Jacqueline Ewing Nadeene Francesco Hubert
Jin Aravind Joshi Sanipa Koetswawasdi Seth
Kulick Jeremy LaCivita Justin Lacasse Matt
Leger Alexis Lerro Mark Liberman Mark Mandel Mark
Manocchio Mitch Marcus Ryan McDonald Tom
Morton Grace Mrowicki
Sina Neshatian Ben Newman Michael Noda Martha
Palmer Eric Pancoast Anita Patel Fernando
Pereira Ariel Richmond Karen Rudo Andrew Schein
Mike Schultz Jonathan Schwartz Amanda van
Scoyoc Nilay Shah Sarah Stippich Sabrina
Sumner Rachel Swetz Partha Talukdar Julie
Wang Colin Warner Christopher Wright Johanna
Wright Dalal Zakhary Ramez Zakhary
53(No Transcript)