Proteomics Data Interoperation with Applications to Integrated Datamining and Enhanced Information Retrieval

About This Presentation

Title:

Proteomics Data Interoperation with Applications to Integrated Datamining and Enhanced Information Retrieval

Description:

Proteomics Data Interoperation with Applications to Integrated Datamining and Enhanced Information Retrieval Andrew Smith Thesis Defense 8/25/2006 – PowerPoint PPT presentation

Number of Views:173

Avg rating:3.0/5.0

Slides: 56

Provided by: gers52

Learn more at: http://homes.gersteinlab.org

Category:

more less

Transcript and Presenter's Notes

Title: Proteomics Data Interoperation with Applications to Integrated Datamining and Enhanced Information Retrieval

1
Proteomics Data Interoperation with Applications
to Integrated Datamining and Enhanced Information
Retrieval

Andrew Smith
Thesis Defense
8/25/2006
Committee Martin Schultz, Mark Gerstein
(co-advisors), Drew McDermott, Steven Brenner (UC
Berkeley)

2
Outline

Problem Description --- Note focus today more on
enhanced information retrieval aspects.
Data set integration and interoperation (hubs
connecting information resources) using the
semantic web.
YeastHub lightweight RDF data warehouse
LinkHub biological identifiers and their
relationships
LinkHub supports enhanced information retrieval /
web search
Relational queries for node-attached documents
Enhanced automated information retrieval
The web
PubMed (biomedical scientific literature)
Empirical performance evaluation for yeast
proteins
Related Work and Conclusions

3
Problem Description

Integration and interoperation of structured,
relational data that is
Large-scale
Widely distributed
Independently maintained
Leverage such data for important applications
Cross-database queries and datamining
Enhanced information retrieval / web search

4
Characteristics of Biological Data

Biological data, especially proteomics data, is
the motivation and domain of focus.
Vast quantities of data from high throughput
experiments (genome sequencing, structural
genomics, microarray, etc.)
Huge and growing number of biological data
resources distributed, heterogeneous, large size
variance.
Need for better interoperation
Practical domain to work in, good match to
problem, challenging but not overly complex.
Rich semantic relationships among data resources,
but often not made explicit.

5
(Very) Brief Proteomics Overview
The Central Dogma of Biology states that the
coded genetic information hard-wired into DNA
(i.e. the genome) is transcribed into individual
transportable cassettes, composed of messenger
RNA (mRNA) each mRNA cassette contains the
program for synthesis of a particular protein (or
small number of proteins).
Proteins are they key agents in the cell.
Proteomics is the large-scale study of proteins,
particularly their structures and functions.
While the genome is a rather constant entity, the
proteome differs from cell to cell and is
constantly changing through its biochemical
interactions with the genome and the environment.
One organism has radically different protein
expression in different parts of its body, in
different stages of its life cycle and in
different environmental conditions.
6
Proteins are modular, composed of domains or
families (based on evolution)
7
Problem with the Web
8
Data Heterogeneity

Lack of standard detailed description of
resources
Data are exposed in different ways
Programmatic interfaces
Web forms or pages
FTP directory structures
Data are presented in different ways
Structured text (e.g., tab delimited format and
XML format)
Free text
Binary (e.g., images)

9
Classical Approaches to Interoperation Data
Warehousing and Federation

Data Warehousing
Focuses on data translation
Translate data to common format, under unified
schema, cross-reference, store and query in
single machine/system.
Federation
Focuses on query translation
Translating and distributing the parts of a query
across multiple distinct, distributed databases
and collating their results into single result.

10
General Strategy for Interoperation

Data is vast, distributed, and independently
maintained ? complete centralized
data-warehousing integration is impractical.
Must rely on federated, cooperative, loosely
coupled solutions which allow partial and
incremental progress.
Widely used and supported standards necessary to
enable such independent, loosely coupled,
cooperative integration.
Semantic Web is excellent fit to these needs.

11
Semantic Web Resource Description Framework (RDF)
Models Data as a Directed Graph with Objects and
Relationships Named by URI
http//www.ncbi.nlm.nih.gov/SNP
http//purl.org/dc/elements/1.1/creator
http//purl.org/dc/elements/1.1/language
http//www.ncbi.nlm.nih.gov
en
lt?xml version"1.0"?gt ltrdfRDF
xmlnsrdfhttp//www.w3.org/1999/02/22-rdf-syntax
-ns xmlnsdchttp//purl.org/dc/elements/1.1
xmlnsexhttp//www.example.org/termsgt ltrdfDe
scription abouthttp//www.ncbi.nlm.nih.gov/SNPgt
ltdccreator rdfresourcehttp//www.ncbi.nlm.nih
.govgtlt/dccreatorgt ltdclanguagegtenlt/dclanguagegt
dategt lt/rdfDescriptiongt lt/rdfRDFgt
Above RDF graph in XML
12
YeastHub Lightweight RDF Data Warehouse
13
Name / ID proliferation problem

Identifiers for biological entities are a simple
but key way to identify and interrelate the
entities Important scaffold for biological
data.
But often many synonyms for same entity, e.g.
strange, legacy names e.g. fly gene called
sonic hedgehog.
Even simple syntactic variants can be cumbersome,
e.g. GO0008150 vs. GO0008150 vs. GO-8150, etc.
Not just synonyms, many kinds of relationship,
one-to-many mappings.
Known relationships among entities not always
stored, or stored in non-standard ways.
Implicit overall data structure enormous,
elaborate graph of relationships among biological
entities.

14
Complex Biological Relationships
15
LinkHub

The major proteomics hub UniProt performs
centralized identifier mapping for large,
well-known databases.
Large staff, resource intensive, manual curation
? centralization bottleneck. Not viable as
complete soln ? need distributed, collaborative
integration based on standards.
Just simple mappings, no specification of
relationship types.
Many data resources not covered (e.g., small,
transient, local, lab-specific, boutique).
Need for system / toolkit to enable local,
collaborative integration of data ? allow people
to create mini UniProts and connect them ?
LinkHub.
Practically, LinkHub provides common links
portal into a labs resources, also connecting
them to larger resources.
LinkHub used this way for Gerstein Lab and NESG,
connecting them to major hub UniProt.

16
Major / Minor Hubs and Spokes Federated Model

LinkHub as local minor hubs to connect groups of
common resources, single common connection to
major hubs more efficient organization of
biological data.

17
LinkHub Data Model

LinkHub has identifier types, identifiers,
mappings between identifiers (mapping type
attribute gives relationship type), and resources
and the identifier types they accept (and where)
in their URL templates (link_exceptions gives
exceptions).
LinkHub stored in both RDF (Sesame) and SQL
(MySQL), translate between. MySQL for robustness
and efficiency for the GUI frontend, RDF to
complement YeastHub (as glue to make direct and
indirect identifier connections).
Perl scripts and web crawlers to maintain it.
DHTML GUI web frontend. Tools for fast, exact
sequence matching for cross-referencing.

18
LinkHub Web GUI Links Portal
19
Example YeastHub / LinkHub queries

Query 1 Finding Worm Interologs of Yeast
Protein Interactions
For each yeast gene in interacting pair find
corresponding WormBase genes yeast gene name ?
UniProt Accession ? Pfam accession ? UniProt
Accession ? WormBase ID
Query 2 Exploring Pseudogene Content versus Gene
Essentiality in Yeast and Humans
yeast gene name ? UniProt Accession ? yeast
pseudogene
yeast gene name ? UniProt Accession ? Pfam
accession ? human UniProt Id ? UniProt Accession
? Pseudogene LSID

20
LinkHub Support for Enhanced Information
Retrieval / Web Search
21
Web Information Management and Access Paradigms

Search Engines
Automated
People can publish in natural languages
Vast coverage (almost whole web)
Currently, preeminent paradigm because of vast
size of web and its unstructured heterogeneity
Unfortunately, only gives coarse-grained topical
access, no real cross-site interoperation/analysis
Semantic Web
Very fine-grained data modeling and connection,
very precise cross-resource query / question
answering supported
Unfortunately, requires much more manual
intervention, people must change how they publish
(RDF, OWL, etc.) Thus, limited acceptance and
size

22
Combining Search and Semantic Web Paradigms

Currently, these two paradigms are largely
independent. However, they seem to have
complementary strengths and weaknesses.
Key idea These two approaches to web information
management and retrieval can work together and
complement one another and there are interesting,
practical, and useful ways they can work with,
leverage, and enhance each other.

23
Combining Relational and Keyword-based Search
Access to Free Text Documents

Consider query for all documents containing
information for proteins which are members of the
Pfam Adenylate Kinase (ADK) family
Standard keyword-based search engines couldnt
support this.
Relational information about documents required,
i.e. that they are related to proteins in the
Pfam ADK family.
LinkHub attaches documents to identifier nodes
and supports such relational query access to
them.

24
LinkHub Path Type Queries

View all paths in LinkHub graph matching specific
relationship types, e.g. family views
PDB ID ? UniProt ID ? Pfam family ? UniProt ID ?
PDB ID ? MolMovDB Motion
NESG ID ? UniProt ID ? Pfam family ? UniProt ID ?
NESG ID
Practically used as secondary, orthogonal
interface to other databases.
MolMovDB and NESGs SPINE both use LinkHub for
such family views.

25
Using Semantic Web for Enhanced Automated
Information Retrieval

Basic idea the semantic web provides detailed
information about terms and their
interrelationships which can be used as
additional information to improve web searches
for those terms (and related terms).
As proof of concept, the particular problem we
are addressing is to find additional relevant
documents for proteomics identifiers on the web
or in the scientific literature.

26
Finding Additional Relevant Documents for
Proteomics Identifiers

Why not just do a web search directly for the
identifier, e.g. P26364? ? likely bad results
Conflated senses of the identifier text, e.g.
product catalog codes, etc.
Might be synonyms of the identifier ? should
search these too (LinkHub graph has them).
Many important, relevant documents might not
directly mention the identifier, e.g. pages
generally about cancer pathways but not
specifically containing the identifier for a
given protein in a cancer pathway. Should search
for important related concepts.
LinkHub has lots of extra info ? use it!

27
Using LinkHub Subgraphs as Gold-standard Training
Sets

The LinkHub subgraph emanating from a given
identifier and the web pages (hyperlinks)
attached to the identifiers in the subgraph is
concrete, accurate, extra information about the
given identifier that can be used to improve
document retrieval for the given central
identifier.
LinkHub subgraphs and associated documents for a
given identifier are used as a training set to
build classifiers to rank documents obtained from
the web or scientific literature.

28
Training Set Docs are Scaled down based on
Distance and Link-types from Central Identifier
29
Classifier for Document Relevance Reranking

Use standard information retrieval techniques
tokenization, stopword filtering, word stemming,
TF-IDF term weighting, and cosine similarity
measures.
Classifier model add weighted subgraph
documents word vectors, TF-IDF weight it, and
take top weighted 20 terms.
Standard cosine similarity value used to score
documents against classifier.

30
Obtaining Documents to Rank

Use major web search engines, via their web APIs.
For demo purposes, we used Yahoo.
Perform individual, base searches for each of top
40 training set feature words and identifiers in
the subgraph combine all results into one large
result set.
Rerank the combined result set using the
constructed classifier.
Essentially, systematically exploring concept
space around the identifier.
Searches returning most relevant docs on average
could be called semantic signatures key concepts
related to the given identifier of interest in
their own right as succinct snippets of what the
identifier is about.

31
Example UniProt P26364
Note Direct Yahoo search for P26364 returned
very poor results. In manual results inspection,
17/40 clearly had nothing to do with the UniProt
protein. Many of others didnt seem too useful
large tabular dumps of identifiers, etc. First
clearly unrelated result in LinkHubs results was
at position 72. LinkHubs results arguably better.
32
PubMed Application

PubMed is a database of all scientific literature
citations for about the last 50-100 years.
Currently, no automated information retrieval of
PubMed for biological identifier-related
citations.
Built app to search for related PubMed abstracts,
using Swish-e to index and provide base search
access to PubMed.

33
PubMed search for UniProt P26364
Manual annotations exist (above) Only 3 and 4
are directly related --- LinkHub-based automated
method ranked these 13 and 7 and returned many
more relevant docs.
34
Empirical Performance Tests

Preceding results seemed reasonably good, but can
we empirically measure performance?
We need a gold-standard set of documents (curated
bibliography) that we know are highly related to
particular identifiers ? gene_literature.tab from
yeast genome database (SGD)

35
Goals of Performance Tests

Quantify the performance level of the procedure
Performance close to optimal or lots of room for
improvement?
How can we know that adding in documents
(downweighted) for related identifiers actually
helps?
Proof of concept PFAM and GO are key related
concepts for proteins, lets objectively see if
they help.
Quantify performance of a particular enhancement
pre-IDF step.

36
Pre-Inverse Document Frequency (pre-IDF) step

Idea maximally separate all proteomics
identifiers classifiers while at the same time
making them as specifically relevant and
discriminating.
Determine document frequencies for all pages of a
type, e.g. all or sample of UniProt pages.
Perform IDF against the types doc freqs and then
again against the corpus you are searching e.g.
first UniProt then PubMed

37
Pre-IDF Step is Generally Useful

For example, imagine wanting to find web pages
highly relevant to a particular digital camera or
mp3 player.
Cnet has many pages about different digital
cameras or mp3 players ? doc freqs for these.
Build classifier for particular digital camera by
first doing pre-IDF step against doc freqs for
all Cnet digital camera pages.

38
Experimental Protocol

Pick few hundred random Yeast proteins from
TrEMBL and Swiss-Prot separately, each with at
least 20 citations and GO and PFAM relations.
A given proteins citations are its in group
Other proteins citations are mid group
Randomly selected PubMed citations (not in
gene_lit.tab or UniProt) are an out group
Classifier should match in mid out
order and degree of deviation from this is the
objective test.

39
Performance Measures

in group most important, focus on it.
Area under the ROC curve (AUC) for in group.
ROC robust to unknown or skewed class
distributions
Useful when error costs not known
Look at full AUC and also top 5 AUC (.05 AUC) ?
it is how well you do in the top of rankings that
really matters.
Perform optimization over parameters at 0.1
granularity PFAM and GO weights, percentage of
features kept, and use/not use pre-IDF step.
UniProt page weight set to 1.0.
Compare average AUC values for various parameter
values, and determine statistical significance
with paired t-tests.

40
Results

Perc features kept didnt really matter ? just
use smaller value 0.4 or 0.5 for computational
efficiency.
Pre-IDF gave largest performance boost any trial
with pre-IDF gave better result than one w/out,
regardless of other parameters.
Pre-IDF and PFAM at small weight increased avg
AUC for all trials for both TrEMBL and
Swiss-Prot. GO did not help (PFAM more info
rich).
TrEMBL was helped more than Swiss-Prot
SwissProt is curated, high quality, complete,
whereas TrEMBL is automated, lower quality ?
makes sense PFAM helped TrEMBL more than
SwissProt.
All comparisons for TrEMBL were statistically
significant by t-tests, only pre-IDF for
Swiss-Prot.

41
Results
TrEMBL
Swiss-Prot
42
Related WorkSearch Engines / Google

Small number of search terms entered.
So little information to go by ? maybe millions
of result documents, hard to rank well.
Good ranking was big problem for early search
engines.
Google (Brin and Page 1998) provided a popular
solution
Hyperlinks as votes for importance
Rank web pages with most and best votes highest.

43
Alternative to Millions of Result
DocumentsIncreased Search Precision

Increase search precision so fewer, more
manageable number of documents returned.
More search terms
Problem users are lazy, wont enter too many
terms.
Solution semantic web provides the increased
precision ? users just select semantic web nodes
for concepts, the nodes are automatically
expanded to increase search precision.
This is what LinkHub does

44
Very Recent Related Work Aphinyanaphongs et al
2006

Argues for specialized, automated filters for
finding relevant documents in huge and ever
expanding scientific literature.
Constructed classifiers for predicting relevance
of PubMed documents for various clinical medicine
themes
State of the art SVM classifiers
Used large, manually curated, respected
bibliographies to train
Used text from article title, abstract, journal
name, and MeSH terms for features

45
Aphinyanaphongs et al 2006 cont.

LinkHub-based search by contrast
Fairly basic classifier model, word weight
vectors compared with cosine similarity.
Small training sets (UniProt, GO, PFAM pgs)
Fairly noisy also web pages vs focused text
Only abstract text used as features
Some gene_lit.tab citations more generally
relevant ? True performance understated
Classifiers built automatically easily at very
large scale as natural byproduct of LinkHub
But LinkHubs .927 and .951 AUCs better than or
negligibly smaller than 0.893, 0.932, 0.966 AUCs
of Aphinyanaphongs et al 2006

46
Aphinyanaphongs et al 2006 and Citation Metrics

Also compared to citation-based metrics
Citation count
Journal impact factor
Indirectly to Google PageRank
SVM classifiers outperformed these, and adding
citation metrics as features gave marginal
improvement at best.
Surprising given Googles success with PageRank

47
Aphinyanaphongs et al 2006 and Citation Metrics
cont.
More generally stated, the conceivable reasons
for citation are so numerous that it is
unrealistic to believe that citation conveys just
one semantic interpretation. Instead citation
metrics are a superimposition of a vast array of
semantically distinct reasons to acknowledge an
existing article. It follows that any specific
set of criteria cannot be captured by a few
general citation metrics and only focused
filtering mechanisms, if attainable, would be
able to identify articles satisfying the specific
criteria in question.
Conclusion Aphinyanaphongs et al 2006 is
consistent with and supports the general approach
taken by LinkHub of creating specialized filters
(in the form of word weight vectors) for
retrieval of documents specific to particular
proteomics identifiers. It is arguably state of
the art for focused tasks, superior to most
commonly used search technology of Google by
extension, LinkHub-based search is also.
48
Acknowledgements

Commitee Martin Schultz, Mark Gerstein
(co-advisors), Drew McDermott, Steven Brenner
Kei Cheung
Michael Krauthammer
Kevin Yip
Yale Semantic Web Interest Group
National Library of Medecine (NLM)

49
Term Selection / Weighting using Term
Associations in Corpus

Lots of information in originating UniProt, GO,
PFAM, etc. pages and PubMed ? use it. Tune
classifier to actual content in PubMed (word
freqs). TF-IDF does this crudely, lets do
better.
Conceptually a "good" term is one that is
strongly associated in the corpus (i.e. much
above the background chance association in the
corpus) with many other terms (which themselves
have similar strong associations) from the
originating documents.
Strength of association is ratio of doc freq in
search results of a term, over the background doc
freq in the corpus (PubMed). gt 1 means
association.

50
Example associations in PubMed for term browser
51
Simple Algorithm for Term Selection Weighting

Compute combined term freq vec for originating
docs.
For each term, do a search with it against PubMed
and compute combined doc freq vector for result
set. Filter terms which arent present at highly
above background chance.
Compute cosine similarity measure between these
two vectors ? larger means more relevant.
In tests Ive done, seems to do well,
particularly for filtering noise terms (which
almost always have a score close to 0).

52
Network analysis of Term Association Graph

Can create directed graph of word
inter-associations edges can be weighted by
strength of association.
Do network analysis to aid term selection /
weighting
Find disconnected components ? independent
concept groups.
Find hubs ? likely central concepts in the
originating documents, so good features.
Find cliques ? groups of words that are all
strongly co-associated with each other, the
bigger the better.
Google pagerank for term selection and weighting.

53
Example Maximal Clique for Yeast Adenylate
Kinase (UniProt P26364)
54
Largest cliques for pfam, scop, and embl
Small, and not overlapping with central largest
clique ? not key terms. I noticed that noise
terms mostly have empty or small maximal cliques.
55
Google PageRank on term association graph.

Actually, doesnt work well.
Seems to most highly weight relevant but more
general terms (e.g. cell, pfam, etc.)
In retrospect, makes sense pagerank is trying to
find nodes which are linked to by many other
nodes and link out to many other nodes, and this
is intuitively a signature of more general terms
(i.e. they will have a larger, more diffuse set
of assocations than more focused terms).
There are some results in the literature that
PageRank applied to scientific citation graphs
does not work well. Maybe Google really isnt God!