Proteomics Data Interoperation with Applications to Integrated Datamining and Enhanced Information Retrieval - PowerPoint PPT Presentation

About This Presentation
Title:

Proteomics Data Interoperation with Applications to Integrated Datamining and Enhanced Information Retrieval

Description:

Proteomics Data Interoperation with Applications to Integrated Datamining and Enhanced Information Retrieval Andrew Smith Thesis Defense 8/25/2006 – PowerPoint PPT presentation

Number of Views:173
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Proteomics Data Interoperation with Applications to Integrated Datamining and Enhanced Information Retrieval


1
Proteomics Data Interoperation with Applications
to Integrated Datamining and Enhanced Information
Retrieval
  • Andrew Smith
  • Thesis Defense
  • 8/25/2006
  • Committee Martin Schultz, Mark Gerstein
    (co-advisors), Drew McDermott, Steven Brenner (UC
    Berkeley)

2
Outline
  • Problem Description --- Note focus today more on
    enhanced information retrieval aspects.
  • Data set integration and interoperation (hubs
    connecting information resources) using the
    semantic web.
  • YeastHub lightweight RDF data warehouse
  • LinkHub biological identifiers and their
    relationships
  • LinkHub supports enhanced information retrieval /
    web search
  • Relational queries for node-attached documents
  • Enhanced automated information retrieval
  • The web
  • PubMed (biomedical scientific literature)
  • Empirical performance evaluation for yeast
    proteins
  • Related Work and Conclusions

3
Problem Description
  • Integration and interoperation of structured,
    relational data that is
  • Large-scale
  • Widely distributed
  • Independently maintained
  • Leverage such data for important applications
  • Cross-database queries and datamining
  • Enhanced information retrieval / web search

4
Characteristics of Biological Data
  • Biological data, especially proteomics data, is
    the motivation and domain of focus.
  • Vast quantities of data from high throughput
    experiments (genome sequencing, structural
    genomics, microarray, etc.)
  • Huge and growing number of biological data
    resources distributed, heterogeneous, large size
    variance.
  • Need for better interoperation
  • Practical domain to work in, good match to
    problem, challenging but not overly complex.
  • Rich semantic relationships among data resources,
    but often not made explicit.

5
(Very) Brief Proteomics Overview
The Central Dogma of Biology states that the
coded genetic information hard-wired into DNA
(i.e. the genome) is transcribed into individual
transportable cassettes, composed of messenger
RNA (mRNA) each mRNA cassette contains the
program for synthesis of a particular protein (or
small number of proteins).
Proteins are they key agents in the cell.
Proteomics is the large-scale study of proteins,
particularly their structures and functions.
While the genome is a rather constant entity, the
proteome differs from cell to cell and is
constantly changing through its biochemical
interactions with the genome and the environment.
One organism has radically different protein
expression in different parts of its body, in
different stages of its life cycle and in
different environmental conditions.
6
Proteins are modular, composed of domains or
families (based on evolution)
7
Problem with the Web
8
Data Heterogeneity
  • Lack of standard detailed description of
    resources
  • Data are exposed in different ways
  • Programmatic interfaces
  • Web forms or pages
  • FTP directory structures
  • Data are presented in different ways
  • Structured text (e.g., tab delimited format and
    XML format)
  • Free text
  • Binary (e.g., images)

9
Classical Approaches to Interoperation Data
Warehousing and Federation
  • Data Warehousing
  • Focuses on data translation
  • Translate data to common format, under unified
    schema, cross-reference, store and query in
    single machine/system.
  • Federation
  • Focuses on query translation
  • Translating and distributing the parts of a query
    across multiple distinct, distributed databases
    and collating their results into single result.

10
General Strategy for Interoperation
  • Data is vast, distributed, and independently
    maintained ? complete centralized
    data-warehousing integration is impractical.
  • Must rely on federated, cooperative, loosely
    coupled solutions which allow partial and
    incremental progress.
  • Widely used and supported standards necessary to
    enable such independent, loosely coupled,
    cooperative integration.
  • Semantic Web is excellent fit to these needs.

11
Semantic Web Resource Description Framework (RDF)
Models Data as a Directed Graph with Objects and
Relationships Named by URI
http//www.ncbi.nlm.nih.gov/SNP
http//purl.org/dc/elements/1.1/creator
http//purl.org/dc/elements/1.1/language
http//www.ncbi.nlm.nih.gov
en
lt?xml version"1.0"?gt ltrdfRDF
xmlnsrdfhttp//www.w3.org/1999/02/22-rdf-syntax
-ns xmlnsdchttp//purl.org/dc/elements/1.1
xmlnsexhttp//www.example.org/termsgt ltrdfDe
scription abouthttp//www.ncbi.nlm.nih.gov/SNPgt
ltdccreator rdfresourcehttp//www.ncbi.nlm.nih
.govgtlt/dccreatorgt ltdclanguagegtenlt/dclanguagegt
dategt lt/rdfDescriptiongt lt/rdfRDFgt
Above RDF graph in XML
12
YeastHub Lightweight RDF Data Warehouse
13
Name / ID proliferation problem
  • Identifiers for biological entities are a simple
    but key way to identify and interrelate the
    entities Important scaffold for biological
    data.
  • But often many synonyms for same entity, e.g.
    strange, legacy names e.g. fly gene called
    sonic hedgehog.
  • Even simple syntactic variants can be cumbersome,
    e.g. GO0008150 vs. GO0008150 vs. GO-8150, etc.
  • Not just synonyms, many kinds of relationship,
    one-to-many mappings.
  • Known relationships among entities not always
    stored, or stored in non-standard ways.
  • Implicit overall data structure enormous,
    elaborate graph of relationships among biological
    entities.

14
Complex Biological Relationships
15
LinkHub
  • The major proteomics hub UniProt performs
    centralized identifier mapping for large,
    well-known databases.
  • Large staff, resource intensive, manual curation
    ? centralization bottleneck. Not viable as
    complete soln ? need distributed, collaborative
    integration based on standards.
  • Just simple mappings, no specification of
    relationship types.
  • Many data resources not covered (e.g., small,
    transient, local, lab-specific, boutique).
  • Need for system / toolkit to enable local,
    collaborative integration of data ? allow people
    to create mini UniProts and connect them ?
    LinkHub.
  • Practically, LinkHub provides common links
    portal into a labs resources, also connecting
    them to larger resources.
  • LinkHub used this way for Gerstein Lab and NESG,
    connecting them to major hub UniProt.

16
Major / Minor Hubs and Spokes Federated Model
  • LinkHub as local minor hubs to connect groups of
    common resources, single common connection to
    major hubs more efficient organization of
    biological data.

17
LinkHub Data Model
  • LinkHub has identifier types, identifiers,
    mappings between identifiers (mapping type
    attribute gives relationship type), and resources
    and the identifier types they accept (and where)
    in their URL templates (link_exceptions gives
    exceptions).
  • LinkHub stored in both RDF (Sesame) and SQL
    (MySQL), translate between. MySQL for robustness
    and efficiency for the GUI frontend, RDF to
    complement YeastHub (as glue to make direct and
    indirect identifier connections).
  • Perl scripts and web crawlers to maintain it.
    DHTML GUI web frontend. Tools for fast, exact
    sequence matching for cross-referencing.

18
LinkHub Web GUI Links Portal
19
Example YeastHub / LinkHub queries
  • Query 1 Finding Worm Interologs of Yeast
    Protein Interactions
  • For each yeast gene in interacting pair find
    corresponding WormBase genes yeast gene name ?
    UniProt Accession ? Pfam accession ? UniProt
    Accession ? WormBase ID
  • Query 2 Exploring Pseudogene Content versus Gene
    Essentiality in Yeast and Humans
  • yeast gene name ? UniProt Accession ? yeast
    pseudogene
  • yeast gene name ? UniProt Accession ? Pfam
    accession ? human UniProt Id ? UniProt Accession
    ? Pseudogene LSID

20
LinkHub Support for Enhanced Information
Retrieval / Web Search
21
Web Information Management and Access Paradigms
  • Search Engines
  • Automated
  • People can publish in natural languages
  • Vast coverage (almost whole web)
  • Currently, preeminent paradigm because of vast
    size of web and its unstructured heterogeneity
  • Unfortunately, only gives coarse-grained topical
    access, no real cross-site interoperation/analysis
  • Semantic Web
  • Very fine-grained data modeling and connection,
    very precise cross-resource query / question
    answering supported
  • Unfortunately, requires much more manual
    intervention, people must change how they publish
    (RDF, OWL, etc.) Thus, limited acceptance and
    size

22
Combining Search and Semantic Web Paradigms
  • Currently, these two paradigms are largely
    independent. However, they seem to have
    complementary strengths and weaknesses.
  • Key idea These two approaches to web information
    management and retrieval can work together and
    complement one another and there are interesting,
    practical, and useful ways they can work with,
    leverage, and enhance each other.

23
Combining Relational and Keyword-based Search
Access to Free Text Documents
  • Consider query for all documents containing
    information for proteins which are members of the
    Pfam Adenylate Kinase (ADK) family
  • Standard keyword-based search engines couldnt
    support this.
  • Relational information about documents required,
    i.e. that they are related to proteins in the
    Pfam ADK family.
  • LinkHub attaches documents to identifier nodes
    and supports such relational query access to
    them.

24
LinkHub Path Type Queries
  • View all paths in LinkHub graph matching specific
    relationship types, e.g. family views
  • PDB ID ? UniProt ID ? Pfam family ? UniProt ID ?
    PDB ID ? MolMovDB Motion
  • NESG ID ? UniProt ID ? Pfam family ? UniProt ID ?
    NESG ID
  • Practically used as secondary, orthogonal
    interface to other databases.
  • MolMovDB and NESGs SPINE both use LinkHub for
    such family views.

25
Using Semantic Web for Enhanced Automated
Information Retrieval
  • Basic idea the semantic web provides detailed
    information about terms and their
    interrelationships which can be used as
    additional information to improve web searches
    for those terms (and related terms).
  • As proof of concept, the particular problem we
    are addressing is to find additional relevant
    documents for proteomics identifiers on the web
    or in the scientific literature.

26
Finding Additional Relevant Documents for
Proteomics Identifiers
  • Why not just do a web search directly for the
    identifier, e.g. P26364? ? likely bad results
  • Conflated senses of the identifier text, e.g.
    product catalog codes, etc.
  • Might be synonyms of the identifier ? should
    search these too (LinkHub graph has them).
  • Many important, relevant documents might not
    directly mention the identifier, e.g. pages
    generally about cancer pathways but not
    specifically containing the identifier for a
    given protein in a cancer pathway. Should search
    for important related concepts.
  • LinkHub has lots of extra info ? use it!

27
Using LinkHub Subgraphs as Gold-standard Training
Sets
  • The LinkHub subgraph emanating from a given
    identifier and the web pages (hyperlinks)
    attached to the identifiers in the subgraph is
    concrete, accurate, extra information about the
    given identifier that can be used to improve
    document retrieval for the given central
    identifier.
  • LinkHub subgraphs and associated documents for a
    given identifier are used as a training set to
    build classifiers to rank documents obtained from
    the web or scientific literature.

28
Training Set Docs are Scaled down based on
Distance and Link-types from Central Identifier
29
Classifier for Document Relevance Reranking
  • Use standard information retrieval techniques
    tokenization, stopword filtering, word stemming,
    TF-IDF term weighting, and cosine similarity
    measures.
  • Classifier model add weighted subgraph
    documents word vectors, TF-IDF weight it, and
    take top weighted 20 terms.
  • Standard cosine similarity value used to score
    documents against classifier.

30
Obtaining Documents to Rank
  • Use major web search engines, via their web APIs.
    For demo purposes, we used Yahoo.
  • Perform individual, base searches for each of top
    40 training set feature words and identifiers in
    the subgraph combine all results into one large
    result set.
  • Rerank the combined result set using the
    constructed classifier.
  • Essentially, systematically exploring concept
    space around the identifier.
  • Searches returning most relevant docs on average
    could be called semantic signatures key concepts
    related to the given identifier of interest in
    their own right as succinct snippets of what the
    identifier is about.

31
Example UniProt P26364
Note Direct Yahoo search for P26364 returned
very poor results. In manual results inspection,
17/40 clearly had nothing to do with the UniProt
protein. Many of others didnt seem too useful
large tabular dumps of identifiers, etc. First
clearly unrelated result in LinkHubs results was
at position 72. LinkHubs results arguably better.
32
PubMed Application
  • PubMed is a database of all scientific literature
    citations for about the last 50-100 years.
  • Currently, no automated information retrieval of
    PubMed for biological identifier-related
    citations.
  • Built app to search for related PubMed abstracts,
    using Swish-e to index and provide base search
    access to PubMed.

33
PubMed search for UniProt P26364
Manual annotations exist (above) Only 3 and 4
are directly related --- LinkHub-based automated
method ranked these 13 and 7 and returned many
more relevant docs.
34
Empirical Performance Tests
  • Preceding results seemed reasonably good, but can
    we empirically measure performance?
  • We need a gold-standard set of documents (curated
    bibliography) that we know are highly related to
    particular identifiers ? gene_literature.tab from
    yeast genome database (SGD)

35
Goals of Performance Tests
  • Quantify the performance level of the procedure
    Performance close to optimal or lots of room for
    improvement?
  • How can we know that adding in documents
    (downweighted) for related identifiers actually
    helps?
  • Proof of concept PFAM and GO are key related
    concepts for proteins, lets objectively see if
    they help.
  • Quantify performance of a particular enhancement
    pre-IDF step.

36
Pre-Inverse Document Frequency (pre-IDF) step
  • Idea maximally separate all proteomics
    identifiers classifiers while at the same time
    making them as specifically relevant and
    discriminating.
  • Determine document frequencies for all pages of a
    type, e.g. all or sample of UniProt pages.
  • Perform IDF against the types doc freqs and then
    again against the corpus you are searching e.g.
    first UniProt then PubMed

37
Pre-IDF Step is Generally Useful
  • For example, imagine wanting to find web pages
    highly relevant to a particular digital camera or
    mp3 player.
  • Cnet has many pages about different digital
    cameras or mp3 players ? doc freqs for these.
  • Build classifier for particular digital camera by
    first doing pre-IDF step against doc freqs for
    all Cnet digital camera pages.

38
Experimental Protocol
  • Pick few hundred random Yeast proteins from
    TrEMBL and Swiss-Prot separately, each with at
    least 20 citations and GO and PFAM relations.
  • A given proteins citations are its in group
  • Other proteins citations are mid group
  • Randomly selected PubMed citations (not in
    gene_lit.tab or UniProt) are an out group
  • Classifier should match in mid out
    order and degree of deviation from this is the
    objective test.

39
Performance Measures
  • in group most important, focus on it.
  • Area under the ROC curve (AUC) for in group.
  • ROC robust to unknown or skewed class
    distributions
  • Useful when error costs not known
  • Look at full AUC and also top 5 AUC (.05 AUC) ?
    it is how well you do in the top of rankings that
    really matters.
  • Perform optimization over parameters at 0.1
    granularity PFAM and GO weights, percentage of
    features kept, and use/not use pre-IDF step.
    UniProt page weight set to 1.0.
  • Compare average AUC values for various parameter
    values, and determine statistical significance
    with paired t-tests.

40
Results
  • Perc features kept didnt really matter ? just
    use smaller value 0.4 or 0.5 for computational
    efficiency.
  • Pre-IDF gave largest performance boost any trial
    with pre-IDF gave better result than one w/out,
    regardless of other parameters.
  • Pre-IDF and PFAM at small weight increased avg
    AUC for all trials for both TrEMBL and
    Swiss-Prot. GO did not help (PFAM more info
    rich).
  • TrEMBL was helped more than Swiss-Prot
  • SwissProt is curated, high quality, complete,
    whereas TrEMBL is automated, lower quality ?
    makes sense PFAM helped TrEMBL more than
    SwissProt.
  • All comparisons for TrEMBL were statistically
    significant by t-tests, only pre-IDF for
    Swiss-Prot.

41
Results
TrEMBL
Swiss-Prot
42
Related WorkSearch Engines / Google
  • Small number of search terms entered.
  • So little information to go by ? maybe millions
    of result documents, hard to rank well.
  • Good ranking was big problem for early search
    engines.
  • Google (Brin and Page 1998) provided a popular
    solution
  • Hyperlinks as votes for importance
  • Rank web pages with most and best votes highest.

43
Alternative to Millions of Result
DocumentsIncreased Search Precision
  • Increase search precision so fewer, more
    manageable number of documents returned.
  • More search terms
  • Problem users are lazy, wont enter too many
    terms.
  • Solution semantic web provides the increased
    precision ? users just select semantic web nodes
    for concepts, the nodes are automatically
    expanded to increase search precision.
  • This is what LinkHub does

44
Very Recent Related Work Aphinyanaphongs et al
2006
  • Argues for specialized, automated filters for
    finding relevant documents in huge and ever
    expanding scientific literature.
  • Constructed classifiers for predicting relevance
    of PubMed documents for various clinical medicine
    themes
  • State of the art SVM classifiers
  • Used large, manually curated, respected
    bibliographies to train
  • Used text from article title, abstract, journal
    name, and MeSH terms for features

45
Aphinyanaphongs et al 2006 cont.
  • LinkHub-based search by contrast
  • Fairly basic classifier model, word weight
    vectors compared with cosine similarity.
  • Small training sets (UniProt, GO, PFAM pgs)
  • Fairly noisy also web pages vs focused text
  • Only abstract text used as features
  • Some gene_lit.tab citations more generally
    relevant ? True performance understated
  • Classifiers built automatically easily at very
    large scale as natural byproduct of LinkHub
  • But LinkHubs .927 and .951 AUCs better than or
    negligibly smaller than 0.893, 0.932, 0.966 AUCs
    of Aphinyanaphongs et al 2006

46
Aphinyanaphongs et al 2006 and Citation Metrics
  • Also compared to citation-based metrics
  • Citation count
  • Journal impact factor
  • Indirectly to Google PageRank
  • SVM classifiers outperformed these, and adding
    citation metrics as features gave marginal
    improvement at best.
  • Surprising given Googles success with PageRank

47
Aphinyanaphongs et al 2006 and Citation Metrics
cont.
More generally stated, the conceivable reasons
for citation are so numerous that it is
unrealistic to believe that citation conveys just
one semantic interpretation. Instead citation
metrics are a superimposition of a vast array of
semantically distinct reasons to acknowledge an
existing article. It follows that any specific
set of criteria cannot be captured by a few
general citation metrics and only focused
filtering mechanisms, if attainable, would be
able to identify articles satisfying the specific
criteria in question.
Conclusion Aphinyanaphongs et al 2006 is
consistent with and supports the general approach
taken by LinkHub of creating specialized filters
(in the form of word weight vectors) for
retrieval of documents specific to particular
proteomics identifiers. It is arguably state of
the art for focused tasks, superior to most
commonly used search technology of Google by
extension, LinkHub-based search is also.
48
Acknowledgements
  • Commitee Martin Schultz, Mark Gerstein
    (co-advisors), Drew McDermott, Steven Brenner
  • Kei Cheung
  • Michael Krauthammer
  • Kevin Yip
  • Yale Semantic Web Interest Group
  • National Library of Medecine (NLM)

49
Term Selection / Weighting using Term
Associations in Corpus
  • Lots of information in originating UniProt, GO,
    PFAM, etc. pages and PubMed ? use it. Tune
    classifier to actual content in PubMed (word
    freqs). TF-IDF does this crudely, lets do
    better.
  • Conceptually a "good" term is one that is
    strongly associated in the corpus (i.e. much
    above the background chance association in the
    corpus) with many other terms (which themselves
    have similar strong associations) from the
    originating documents.
  • Strength of association is ratio of doc freq in
    search results of a term, over the background doc
    freq in the corpus (PubMed). gt 1 means
    association.

50
Example associations in PubMed for term browser
51
Simple Algorithm for Term Selection Weighting
  • Compute combined term freq vec for originating
    docs.
  • For each term, do a search with it against PubMed
    and compute combined doc freq vector for result
    set. Filter terms which arent present at highly
    above background chance.
  • Compute cosine similarity measure between these
    two vectors ? larger means more relevant.
  • In tests Ive done, seems to do well,
    particularly for filtering noise terms (which
    almost always have a score close to 0).

52
Network analysis of Term Association Graph
  • Can create directed graph of word
    inter-associations edges can be weighted by
    strength of association.
  • Do network analysis to aid term selection /
    weighting
  • Find disconnected components ? independent
    concept groups.
  • Find hubs ? likely central concepts in the
    originating documents, so good features.
  • Find cliques ? groups of words that are all
    strongly co-associated with each other, the
    bigger the better.
  • Google pagerank for term selection and weighting.

53
Example Maximal Clique for Yeast Adenylate
Kinase (UniProt P26364)
54
Largest cliques for pfam, scop, and embl
Small, and not overlapping with central largest
clique ? not key terms. I noticed that noise
terms mostly have empty or small maximal cliques.
55
Google PageRank on term association graph.
  • Actually, doesnt work well.
  • Seems to most highly weight relevant but more
    general terms (e.g. cell, pfam, etc.)
  • In retrospect, makes sense pagerank is trying to
    find nodes which are linked to by many other
    nodes and link out to many other nodes, and this
    is intuitively a signature of more general terms
    (i.e. they will have a larger, more diffuse set
    of assocations than more focused terms).
  • There are some results in the literature that
    PageRank applied to scientific citation graphs
    does not work well. Maybe Google really isnt God!
Write a Comment
User Comments (0)
About PowerShow.com