FlyWeb: the way to go for biological data integration - PowerPoint PPT Presentation

Loading...

PPT – FlyWeb: the way to go for biological data integration PowerPoint presentation | free to download - id: 198895-NGJiY



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

FlyWeb: the way to go for biological data integration

Description:

a library of Javascript widgets as front ends to SPARQL data sources ... Widgets are composed in a browser to create the complete application. Each widget provides: ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 22
Provided by: openfl
Learn more at: http://openflydata.org
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: FlyWeb: the way to go for biological data integration


1
FlyWeb the way to go for biological data
integration
  • Jun Zhao, Alistair Miles and Graham Klyne
  • Image Bioinformatics Research Group
  • Department of Zoology
  • University of Oxford

2
FlyWeb Application
  • To answer questions about "what does this gene
    do?
  • Gene Expression Images
  • Sequence and ESTs (Expressed sequence tags) of
    the gene
  • Publications about the gene
  • ....
  • A first example of the Image Web that our group
    is developing
  • Investigate the feasibility of existing Semantic
    Web tools and technologies for real applications

3
Gene expression images
  • Reveal gene expression pattern in different
    development stages
  • Important for identifying genes of interests and
    verifying a picture of probable gene functions

4
FlyWeb demonstration
  • http//openflydata.org/flyui/build/apps/imagemashu
    p2/
  • Run application go
  • Two examples
  • Single gene query (aos1)?
  • Use gene synonyms to enhance gene matching (rbf)?

5
(No Transcript)
6
More than one synonyms of gene rbf
7
(No Transcript)
8
How does it work?
  • Data from 3 independent sources
  • www.flybase.org model organismreference
    database, gene namesand identifiers
  • www.fruitfly.org (BDGP) embryo in situ images
  • www.fly-ted.org testis in situ images
  • All data accessed via SPARQL
  • Pure Ajax user application
  • Essentially, a mashup using a SPARQL API

9
The client side
  • FlyUI
  • a library of Javascript widgets as front ends to
    SPARQL data sources
  • Built on Yahoo User Interface (YUI) library
  • Widgets are composed in a browser to create the
    complete application
  • Each widget provides
  • A Service that implements SPARQL queries
  • A Model encapsulating SPARQL query results
  • A Renderer

The in situ search application
GeneFinder Widget
FlyTED Image Widget
BDGP Image Widget
10
Gene name mapping
  • FlyTED and BDGP use different gene names
  • FlyTED data derived from spreadsheets with
    imperfectly controlled gene name vocabulary
  • BDGP's data are annotated using FlyBase's unique
    FBgn numbers
  • Use FlyBase for automatic gene mapping
  • Additional inputs from scientists for
    disambiguating many-many mappings
  • Mappings are stored as JSON file to assist
    GeneFinder widget (having no use for RDF/OWL
    reasoning at this stage)?

11
SPARQL queries
SELECT WHERE ?gene fbutilanyName
"userInput"xsstring a chadoFeature
chadoname ?symbol chadouniquename
?flybaseID . OPTIONAL ?gene chadodbxref
chadoaccession ?annotationSymbol .
OPTIONAL ?gene chadosynonym
chadoname ?synonym . OPTIONAL ?gene
chadosynonym a syntypeFullName
chadoname ?fullName .
  • Free text matchings
  • Case insensitive searching
  • Very important for our users
  • Too expensive using SPARQL Filter
  • Pre-generate lower-case gene names and load into
    the Flybase RDF DB

SELECT DISTINCT WHERE ?fullImageURL "
flytedassociatesToGene lthttp//openflydata.org/
id/flyted/gene-geneNamegt flytedassociatesTo
Gene ?gene flytedthumbnail ?thumbnailURL
rdfsseeAlso ?flytedURL rdfslabel
?caption
12
The RDF data sources
  • Flybase and BDGP relational databases
  • FlyTED, an image repository built using Eprints
  • FlyAtlas (forthcoming), tissue-specific
    Drosophila gene expression levels, as a single
    spreadsheet

13
Creating RDF from data sources
  • D2RQ mapping
  • FlyBase and BDGP, native relational databases
  • Conservative mapping, with minimum interpretation
  • OAI2SPARQL
  • Harvesting N3 RDF metadata via the OAI-PMH
    protocol, built-in support by Eprints
  • Further from ESWC2008 paper
  • Custom Python program
  • FlyAtlas
  • Generating N3 from spreadsheet table

14
More about the data sources
  • Bulk download
  • http//openflydata.org/dump/flybase, 8m triples
  • http//openflydata.org/dump/bdgp, 1m triples
  • http//openflydata.org/dump/flyted, 30,000
    triples
  • SPARQL endpoint
  • http//openflydata.org/query/flybase
  • http//openflydata.org/query/bdgp
  • http//openflydata.org/query/flyted
  • Schema
  • http//purl.org/net/chado/schema/
  • http//purl.org/net/flybase/synonym-types/
  • http//purl.org/net/bdgp/schema/

15
SPARQL server
  • Amazon EC2 (Elastic Compute Cloud)
  • To run SPARQL endpoints
  • To host the demo you've just seen
  • Jena TDB as triple store
  • For better loading performance 6K tps for 9M
    triples to Amazon Elastic Block Storage (EBS)?
  • For better querying performance
  • SPARQLite
  • home-grown SPARQL protocol implementation
  • More later
  • Apache, Tomcat, mod_jk, etc.

16
SPARQLite protocol
  • http//sparqlite.googlecode.com
  • Also, a platform for exploring SPARQL service
    quality concerns, more later
  • Motivation
  • Enable streaming
  • Create a database connection pool
  • Designed for Jena TDB/SDB Postgres
  • Restricted forms of query (SELECT, ASK)
  • Restricted query result format (e.g. only JSON)

17
Lessons
  • RDF provides a uniform and flexible data model
  • RDF dump is cheaper and quicker
  • Maintaining a separate SPARQL endpoint for each
    data source makes it easier than a data warehouse
    approach for handling data updates
  • RDF facilitates data re-use and re-purposing
  • SPARQL raises the point of departure for an
    application
  • Benefits for the future
  • Linking to other data sources
  • Querying genes using the Fly Anatomy ontology
  • Magic of inference

18
Performance
  • Loading Our datasets 10 million triples
  • Jena / RDB / Postgres, OK with lt1 M triples
  • Jena / SDB / Postgres better, but problems with
    load performance with larger datasets
  • Jena / TDB gives much better load performance
    (6K tps), even on 32 bit system with Amazon EBS
    storage (but not so good with local EC2 store)?
  • Virtuoso performs reasonably well
  • Querying, particularly text matching and case
    insensitive search
  • Problems with using SPARQL regex filter, the only
    mechanism for case-insensitive search in SPARQL
  • Tried with OpenLink Virtuoso, still 10 seconds
    for a case-insensitive search
  • Any suggestions?

19
Further lessons
  • SPARQL results streaming
  • Resolves out of memory errors for large datasets
  • Joseki / SDB / Postgres can be made to stream
    results, but using just a single JDBC connection,
    causing performance problems with concurrent
    requests
  • Therefore, SPARQLite
  • The openness of SPARQL
  • SPARQL is an inherently open query language and
    protocol
  • Open endpoints are vulnerable to simple queries
    that can overload the service, exposing them to
    denial of service style attacks (whether intended
    or not)?
  • Futures API key mechanism? Restricted SPARQL
    profiles?

20
Future directions
  • Adding new data sources
  • FlyAtlas tissue-specific Drosophila gene
    expression levels
  • More information from FlyBase e.g. references
  • More applications
  • Find out all the gene expression images of its
    neighbours
  • Find out all the genes related to blood
    pressure
  • ...
  • Linked data (dereferencable, follow-your nose)?
  • We're thinking about this, but our application
    does not currently need it
  • How to control and predict quality of service for
    open SPARQL endpoints

21
Acknowledgement
  • Alistair Miles, Graham Klyne and David Shotton
  • Dr Helen White-Cooper and her research group
  • BBSRC for funding building the FlyTED database
  • BDGP and FlyBase for making the data available
  • JISC, for funding the FlyWeb project
  • The Jena team, esp. Andy Seaborne
About PowerShow.com