Title: Proteomics Data Interoperation with Applications to Integrated Datamining and Enhanced Information Retrieval
1Proteomics Data Interoperation with Applications
to Integrated Datamining and Enhanced Information
Retrieval
- Andrew Smith
- Thesis Defense
- 8/25/2006
- Committee Martin Schultz, Mark Gerstein
(co-advisors), Drew McDermott, Steven Brenner (UC
Berkeley)
2Outline
- Problem Description --- Note focus today more on
enhanced information retrieval aspects. - Data set integration and interoperation (hubs
connecting information resources) using the
semantic web. - YeastHub lightweight RDF data warehouse
- LinkHub biological identifiers and their
relationships - LinkHub supports enhanced information retrieval /
web search - Relational queries for node-attached documents
- Enhanced automated information retrieval
- The web
- PubMed (biomedical scientific literature)
- Empirical performance evaluation for yeast
proteins - Related Work and Conclusions
3Problem Description
- Integration and interoperation of structured,
relational data that is - Large-scale
- Widely distributed
- Independently maintained
- Leverage such data for important applications
- Cross-database queries and datamining
- Enhanced information retrieval / web search
4Characteristics of Biological Data
- Biological data, especially proteomics data, is
the motivation and domain of focus. - Vast quantities of data from high throughput
experiments (genome sequencing, structural
genomics, microarray, etc.) - Huge and growing number of biological data
resources distributed, heterogeneous, large size
variance. - Need for better interoperation
- Practical domain to work in, good match to
problem, challenging but not overly complex. - Rich semantic relationships among data resources,
but often not made explicit.
5(Very) Brief Proteomics Overview
The Central Dogma of Biology states that the
coded genetic information hard-wired into DNA
(i.e. the genome) is transcribed into individual
transportable cassettes, composed of messenger
RNA (mRNA) each mRNA cassette contains the
program for synthesis of a particular protein (or
small number of proteins).
Proteins are they key agents in the cell.
Proteomics is the large-scale study of proteins,
particularly their structures and functions.
While the genome is a rather constant entity, the
proteome differs from cell to cell and is
constantly changing through its biochemical
interactions with the genome and the environment.
One organism has radically different protein
expression in different parts of its body, in
different stages of its life cycle and in
different environmental conditions.
6Proteins are modular, composed of domains or
families (based on evolution)
7Problem with the Web
8Data Heterogeneity
- Lack of standard detailed description of
resources - Data are exposed in different ways
- Programmatic interfaces
- Web forms or pages
- FTP directory structures
- Data are presented in different ways
- Structured text (e.g., tab delimited format and
XML format) - Free text
- Binary (e.g., images)
9Classical Approaches to Interoperation Data
Warehousing and Federation
- Data Warehousing
- Focuses on data translation
- Translate data to common format, under unified
schema, cross-reference, store and query in
single machine/system. - Federation
- Focuses on query translation
- Translating and distributing the parts of a query
across multiple distinct, distributed databases
and collating their results into single result.
10General Strategy for Interoperation
- Data is vast, distributed, and independently
maintained ? complete centralized
data-warehousing integration is impractical. - Must rely on federated, cooperative, loosely
coupled solutions which allow partial and
incremental progress. - Widely used and supported standards necessary to
enable such independent, loosely coupled,
cooperative integration. - Semantic Web is excellent fit to these needs.
11Semantic Web Resource Description Framework (RDF)
Models Data as a Directed Graph with Objects and
Relationships Named by URI
http//www.ncbi.nlm.nih.gov/SNP
http//purl.org/dc/elements/1.1/creator
http//purl.org/dc/elements/1.1/language
http//www.ncbi.nlm.nih.gov
en
lt?xml version"1.0"?gt ltrdfRDF
xmlnsrdfhttp//www.w3.org/1999/02/22-rdf-syntax
-ns xmlnsdchttp//purl.org/dc/elements/1.1
xmlnsexhttp//www.example.org/termsgt ltrdfDe
scription abouthttp//www.ncbi.nlm.nih.gov/SNPgt
ltdccreator rdfresourcehttp//www.ncbi.nlm.nih
.govgtlt/dccreatorgt ltdclanguagegtenlt/dclanguagegt
dategt lt/rdfDescriptiongt lt/rdfRDFgt
Above RDF graph in XML
12YeastHub Lightweight RDF Data Warehouse
13Name / ID proliferation problem
- Identifiers for biological entities are a simple
but key way to identify and interrelate the
entities Important scaffold for biological
data. - But often many synonyms for same entity, e.g.
strange, legacy names e.g. fly gene called
sonic hedgehog. - Even simple syntactic variants can be cumbersome,
e.g. GO0008150 vs. GO0008150 vs. GO-8150, etc. - Not just synonyms, many kinds of relationship,
one-to-many mappings. - Known relationships among entities not always
stored, or stored in non-standard ways. - Implicit overall data structure enormous,
elaborate graph of relationships among biological
entities.
14Complex Biological Relationships
15LinkHub
- The major proteomics hub UniProt performs
centralized identifier mapping for large,
well-known databases. - Large staff, resource intensive, manual curation
? centralization bottleneck. Not viable as
complete soln ? need distributed, collaborative
integration based on standards. - Just simple mappings, no specification of
relationship types. - Many data resources not covered (e.g., small,
transient, local, lab-specific, boutique). - Need for system / toolkit to enable local,
collaborative integration of data ? allow people
to create mini UniProts and connect them ?
LinkHub. - Practically, LinkHub provides common links
portal into a labs resources, also connecting
them to larger resources. - LinkHub used this way for Gerstein Lab and NESG,
connecting them to major hub UniProt.
16Major / Minor Hubs and Spokes Federated Model
- LinkHub as local minor hubs to connect groups of
common resources, single common connection to
major hubs more efficient organization of
biological data.
17LinkHub Data Model
- LinkHub has identifier types, identifiers,
mappings between identifiers (mapping type
attribute gives relationship type), and resources
and the identifier types they accept (and where)
in their URL templates (link_exceptions gives
exceptions). - LinkHub stored in both RDF (Sesame) and SQL
(MySQL), translate between. MySQL for robustness
and efficiency for the GUI frontend, RDF to
complement YeastHub (as glue to make direct and
indirect identifier connections). - Perl scripts and web crawlers to maintain it.
DHTML GUI web frontend. Tools for fast, exact
sequence matching for cross-referencing.
18LinkHub Web GUI Links Portal
19Example YeastHub / LinkHub queries
- Query 1 Finding Worm Interologs of Yeast
Protein Interactions - For each yeast gene in interacting pair find
corresponding WormBase genes yeast gene name ?
UniProt Accession ? Pfam accession ? UniProt
Accession ? WormBase ID - Query 2 Exploring Pseudogene Content versus Gene
Essentiality in Yeast and Humans - yeast gene name ? UniProt Accession ? yeast
pseudogene - yeast gene name ? UniProt Accession ? Pfam
accession ? human UniProt Id ? UniProt Accession
? Pseudogene LSID
20LinkHub Support for Enhanced Information
Retrieval / Web Search
21Web Information Management and Access Paradigms
- Search Engines
- Automated
- People can publish in natural languages
- Vast coverage (almost whole web)
- Currently, preeminent paradigm because of vast
size of web and its unstructured heterogeneity - Unfortunately, only gives coarse-grained topical
access, no real cross-site interoperation/analysis
- Semantic Web
- Very fine-grained data modeling and connection,
very precise cross-resource query / question
answering supported - Unfortunately, requires much more manual
intervention, people must change how they publish
(RDF, OWL, etc.) Thus, limited acceptance and
size
22Combining Search and Semantic Web Paradigms
- Currently, these two paradigms are largely
independent. However, they seem to have
complementary strengths and weaknesses. - Key idea These two approaches to web information
management and retrieval can work together and
complement one another and there are interesting,
practical, and useful ways they can work with,
leverage, and enhance each other.
23Combining Relational and Keyword-based Search
Access to Free Text Documents
- Consider query for all documents containing
information for proteins which are members of the
Pfam Adenylate Kinase (ADK) family - Standard keyword-based search engines couldnt
support this. - Relational information about documents required,
i.e. that they are related to proteins in the
Pfam ADK family. - LinkHub attaches documents to identifier nodes
and supports such relational query access to
them.
24LinkHub Path Type Queries
- View all paths in LinkHub graph matching specific
relationship types, e.g. family views - PDB ID ? UniProt ID ? Pfam family ? UniProt ID ?
PDB ID ? MolMovDB Motion - NESG ID ? UniProt ID ? Pfam family ? UniProt ID ?
NESG ID - Practically used as secondary, orthogonal
interface to other databases. - MolMovDB and NESGs SPINE both use LinkHub for
such family views.
25Using Semantic Web for Enhanced Automated
Information Retrieval
- Basic idea the semantic web provides detailed
information about terms and their
interrelationships which can be used as
additional information to improve web searches
for those terms (and related terms). - As proof of concept, the particular problem we
are addressing is to find additional relevant
documents for proteomics identifiers on the web
or in the scientific literature.
26Finding Additional Relevant Documents for
Proteomics Identifiers
- Why not just do a web search directly for the
identifier, e.g. P26364? ? likely bad results - Conflated senses of the identifier text, e.g.
product catalog codes, etc. - Might be synonyms of the identifier ? should
search these too (LinkHub graph has them). - Many important, relevant documents might not
directly mention the identifier, e.g. pages
generally about cancer pathways but not
specifically containing the identifier for a
given protein in a cancer pathway. Should search
for important related concepts. - LinkHub has lots of extra info ? use it!
27Using LinkHub Subgraphs as Gold-standard Training
Sets
- The LinkHub subgraph emanating from a given
identifier and the web pages (hyperlinks)
attached to the identifiers in the subgraph is
concrete, accurate, extra information about the
given identifier that can be used to improve
document retrieval for the given central
identifier. - LinkHub subgraphs and associated documents for a
given identifier are used as a training set to
build classifiers to rank documents obtained from
the web or scientific literature.
28Training Set Docs are Scaled down based on
Distance and Link-types from Central Identifier
29Classifier for Document Relevance Reranking
- Use standard information retrieval techniques
tokenization, stopword filtering, word stemming,
TF-IDF term weighting, and cosine similarity
measures. - Classifier model add weighted subgraph
documents word vectors, TF-IDF weight it, and
take top weighted 20 terms. - Standard cosine similarity value used to score
documents against classifier.
30Obtaining Documents to Rank
- Use major web search engines, via their web APIs.
For demo purposes, we used Yahoo. - Perform individual, base searches for each of top
40 training set feature words and identifiers in
the subgraph combine all results into one large
result set. - Rerank the combined result set using the
constructed classifier. - Essentially, systematically exploring concept
space around the identifier. - Searches returning most relevant docs on average
could be called semantic signatures key concepts
related to the given identifier of interest in
their own right as succinct snippets of what the
identifier is about.
31Example UniProt P26364
Note Direct Yahoo search for P26364 returned
very poor results. In manual results inspection,
17/40 clearly had nothing to do with the UniProt
protein. Many of others didnt seem too useful
large tabular dumps of identifiers, etc. First
clearly unrelated result in LinkHubs results was
at position 72. LinkHubs results arguably better.
32PubMed Application
- PubMed is a database of all scientific literature
citations for about the last 50-100 years. - Currently, no automated information retrieval of
PubMed for biological identifier-related
citations. - Built app to search for related PubMed abstracts,
using Swish-e to index and provide base search
access to PubMed.
33PubMed search for UniProt P26364
Manual annotations exist (above) Only 3 and 4
are directly related --- LinkHub-based automated
method ranked these 13 and 7 and returned many
more relevant docs.
34 Empirical Performance Tests
- Preceding results seemed reasonably good, but can
we empirically measure performance? - We need a gold-standard set of documents (curated
bibliography) that we know are highly related to
particular identifiers ? gene_literature.tab from
yeast genome database (SGD)
35Goals of Performance Tests
- Quantify the performance level of the procedure
Performance close to optimal or lots of room for
improvement? - How can we know that adding in documents
(downweighted) for related identifiers actually
helps? - Proof of concept PFAM and GO are key related
concepts for proteins, lets objectively see if
they help. - Quantify performance of a particular enhancement
pre-IDF step.
36Pre-Inverse Document Frequency (pre-IDF) step
- Idea maximally separate all proteomics
identifiers classifiers while at the same time
making them as specifically relevant and
discriminating. - Determine document frequencies for all pages of a
type, e.g. all or sample of UniProt pages. - Perform IDF against the types doc freqs and then
again against the corpus you are searching e.g.
first UniProt then PubMed
37Pre-IDF Step is Generally Useful
- For example, imagine wanting to find web pages
highly relevant to a particular digital camera or
mp3 player. - Cnet has many pages about different digital
cameras or mp3 players ? doc freqs for these. - Build classifier for particular digital camera by
first doing pre-IDF step against doc freqs for
all Cnet digital camera pages.
38Experimental Protocol
- Pick few hundred random Yeast proteins from
TrEMBL and Swiss-Prot separately, each with at
least 20 citations and GO and PFAM relations. - A given proteins citations are its in group
- Other proteins citations are mid group
- Randomly selected PubMed citations (not in
gene_lit.tab or UniProt) are an out group - Classifier should match in mid out
order and degree of deviation from this is the
objective test.
39Performance Measures
- in group most important, focus on it.
- Area under the ROC curve (AUC) for in group.
- ROC robust to unknown or skewed class
distributions - Useful when error costs not known
- Look at full AUC and also top 5 AUC (.05 AUC) ?
it is how well you do in the top of rankings that
really matters. - Perform optimization over parameters at 0.1
granularity PFAM and GO weights, percentage of
features kept, and use/not use pre-IDF step.
UniProt page weight set to 1.0. - Compare average AUC values for various parameter
values, and determine statistical significance
with paired t-tests.
40Results
- Perc features kept didnt really matter ? just
use smaller value 0.4 or 0.5 for computational
efficiency. - Pre-IDF gave largest performance boost any trial
with pre-IDF gave better result than one w/out,
regardless of other parameters. - Pre-IDF and PFAM at small weight increased avg
AUC for all trials for both TrEMBL and
Swiss-Prot. GO did not help (PFAM more info
rich). - TrEMBL was helped more than Swiss-Prot
- SwissProt is curated, high quality, complete,
whereas TrEMBL is automated, lower quality ?
makes sense PFAM helped TrEMBL more than
SwissProt. - All comparisons for TrEMBL were statistically
significant by t-tests, only pre-IDF for
Swiss-Prot.
41Results
TrEMBL
Swiss-Prot
42Related WorkSearch Engines / Google
- Small number of search terms entered.
- So little information to go by ? maybe millions
of result documents, hard to rank well. - Good ranking was big problem for early search
engines. - Google (Brin and Page 1998) provided a popular
solution - Hyperlinks as votes for importance
- Rank web pages with most and best votes highest.
43Alternative to Millions of Result
DocumentsIncreased Search Precision
- Increase search precision so fewer, more
manageable number of documents returned. - More search terms
- Problem users are lazy, wont enter too many
terms. - Solution semantic web provides the increased
precision ? users just select semantic web nodes
for concepts, the nodes are automatically
expanded to increase search precision. - This is what LinkHub does
44Very Recent Related Work Aphinyanaphongs et al
2006
- Argues for specialized, automated filters for
finding relevant documents in huge and ever
expanding scientific literature. - Constructed classifiers for predicting relevance
of PubMed documents for various clinical medicine
themes - State of the art SVM classifiers
- Used large, manually curated, respected
bibliographies to train - Used text from article title, abstract, journal
name, and MeSH terms for features
45Aphinyanaphongs et al 2006 cont.
- LinkHub-based search by contrast
- Fairly basic classifier model, word weight
vectors compared with cosine similarity. - Small training sets (UniProt, GO, PFAM pgs)
- Fairly noisy also web pages vs focused text
- Only abstract text used as features
- Some gene_lit.tab citations more generally
relevant ? True performance understated - Classifiers built automatically easily at very
large scale as natural byproduct of LinkHub - But LinkHubs .927 and .951 AUCs better than or
negligibly smaller than 0.893, 0.932, 0.966 AUCs
of Aphinyanaphongs et al 2006
46Aphinyanaphongs et al 2006 and Citation Metrics
- Also compared to citation-based metrics
- Citation count
- Journal impact factor
- Indirectly to Google PageRank
- SVM classifiers outperformed these, and adding
citation metrics as features gave marginal
improvement at best. - Surprising given Googles success with PageRank
47Aphinyanaphongs et al 2006 and Citation Metrics
cont.
More generally stated, the conceivable reasons
for citation are so numerous that it is
unrealistic to believe that citation conveys just
one semantic interpretation. Instead citation
metrics are a superimposition of a vast array of
semantically distinct reasons to acknowledge an
existing article. It follows that any specific
set of criteria cannot be captured by a few
general citation metrics and only focused
filtering mechanisms, if attainable, would be
able to identify articles satisfying the specific
criteria in question.
Conclusion Aphinyanaphongs et al 2006 is
consistent with and supports the general approach
taken by LinkHub of creating specialized filters
(in the form of word weight vectors) for
retrieval of documents specific to particular
proteomics identifiers. It is arguably state of
the art for focused tasks, superior to most
commonly used search technology of Google by
extension, LinkHub-based search is also.
48Acknowledgements
- Commitee Martin Schultz, Mark Gerstein
(co-advisors), Drew McDermott, Steven Brenner - Kei Cheung
- Michael Krauthammer
- Kevin Yip
- Yale Semantic Web Interest Group
- National Library of Medecine (NLM)
49Term Selection / Weighting using Term
Associations in Corpus
- Lots of information in originating UniProt, GO,
PFAM, etc. pages and PubMed ? use it. Tune
classifier to actual content in PubMed (word
freqs). TF-IDF does this crudely, lets do
better. - Conceptually a "good" term is one that is
strongly associated in the corpus (i.e. much
above the background chance association in the
corpus) with many other terms (which themselves
have similar strong associations) from the
originating documents. - Strength of association is ratio of doc freq in
search results of a term, over the background doc
freq in the corpus (PubMed). gt 1 means
association.
50Example associations in PubMed for term browser
51Simple Algorithm for Term Selection Weighting
- Compute combined term freq vec for originating
docs. - For each term, do a search with it against PubMed
and compute combined doc freq vector for result
set. Filter terms which arent present at highly
above background chance. - Compute cosine similarity measure between these
two vectors ? larger means more relevant. - In tests Ive done, seems to do well,
particularly for filtering noise terms (which
almost always have a score close to 0).
52Network analysis of Term Association Graph
- Can create directed graph of word
inter-associations edges can be weighted by
strength of association. - Do network analysis to aid term selection /
weighting - Find disconnected components ? independent
concept groups. - Find hubs ? likely central concepts in the
originating documents, so good features. - Find cliques ? groups of words that are all
strongly co-associated with each other, the
bigger the better. - Google pagerank for term selection and weighting.
53Example Maximal Clique for Yeast Adenylate
Kinase (UniProt P26364)
54Largest cliques for pfam, scop, and embl
Small, and not overlapping with central largest
clique ? not key terms. I noticed that noise
terms mostly have empty or small maximal cliques.
55Google PageRank on term association graph.
- Actually, doesnt work well.
- Seems to most highly weight relevant but more
general terms (e.g. cell, pfam, etc.) - In retrospect, makes sense pagerank is trying to
find nodes which are linked to by many other
nodes and link out to many other nodes, and this
is intuitively a signature of more general terms
(i.e. they will have a larger, more diffuse set
of assocations than more focused terms). - There are some results in the literature that
PageRank applied to scientific citation graphs
does not work well. Maybe Google really isnt God!