Quarrying Unfamiliar DataSpaces - PowerPoint PPT Presentation

About This Presentation
Title:

Quarrying Unfamiliar DataSpaces

Description:

Quarrying Unfamiliar DataSpaces Bill Howe David Maier Nick Rayner Sponsored by the NSF ITR Program 2001-2006 In collaboration with Antonio Baptista, Paul Turner ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 53
Provided by: Bill5182
Learn more at: http://web.cecs.pdx.edu
Category:

less

Transcript and Presenter's Notes

Title: Quarrying Unfamiliar DataSpaces


1
Quarrying Unfamiliar DataSpaces
  • Bill Howe
  • David Maier
  • Nick Rayner
  • Sponsored by the NSF ITR Program 2001-2006
  • In collaboration with Antonio Baptista, Paul
    Turner, Yinlong Zhang, Sergey Frolov, and the
    entire CORIE Environmental Science Team at OGI

2
Dataspaces
  • DataSpace (DS)
  • Autonomous, heterogeneous data sources
  • grouped by an identifiable scope
  • with respect to a set of requirements
  • DataSpace Support Platform (DSSP)
  • A collection of best effort services
  • Catalog and Browse
  • Search and Query
  • Workflow (Events, Actions, and Monitoring)
  • Integrity checks/guarantees

From Databases to Dataspaces A New
Abstraction for Information Management, Michael
Franklin, Alon Halevy, David Maier, SIGMOD Record
December 2005.
3
Dataspaces vs. Databases
  • Single Schema
  • Centralized Administration
  • Structured Query
  • Strict Integrity Constraints
  • Data Coexistence
  • Autonomous Sources
  • Search, Browse, Approximate Answer
  • Best Effort Guarantees

4
Dataspaces vs. Semantic Web
  • No ontology
  • Probably no inferencing
  • Pay as you go
  • Autonomous agents
  • crawling richly described autonomous data
    sources
  • which are related formally via ontologies

5
Dataspace Timeline
Insular, application-specific databases
Autonomous agents crawling richly described data
sources integrated via an ontology
time, scope
6
Example Scientific Data Repository
Atmospheric forcings River forcings Global ocean
forcings
Sensor Data
Ocean simulation results Configuration and log
files Annotations Data Products
salinity
/anim-sal_estuary_7.gif
7
Example Pharmacology
RxNav Interface developed by the National
Library of Medicine
8
Dataspace Timeline
Semantic Web
Quarry and related tools
Dataspaces
utility
Federated Databases
Data Integration Tools
RDF/OWL
Insular Databases
time, scope
9
Unfamiliar Dataspaces
  • No schema is available
  • No query workload is available
  • Browse is the dominant interaction
  • keys, ids, URIs not directly useful
  • properties and values carry the meaning
  • Goal Maximize return on effort when working with
    an unfamiliar dataspace

10
Green Field Tools for Unfamiliar Dataspaces
  • Goal A working, extensible application with the
    least possible effort
  • We need at least
  • a Data Model
  • Lowest Common Denominator
  • minimal modeling decisions
  • an API
  • easy to use for domain experts
  • uniformly efficient

11
Outline
  • Dataspaces
  • Data Models for Dataspaces
  • Quarry Data Model
  • Quarry Storage
  • Quarry API
  • Experimental Results
  • Extensions
  • Related Work

12
Data Models
RDFSOWL
High
XML
RDF
Application Costs
Object Models
Hypertext
Low
Document/Text
NetCDF, HDF, etc.
Relational
High
Low
Modeling Costs
expressive power in terms of structure,
operations, and constraints
13
Quarry Data Model
  • resource, property, value
  • (subject, predicate, object) if you prefer
  • no intrinsic distinction between literal values
    and resource values
  • no explicit types or classes
  • no variables (no inference)

14
Example Pharmacology
Concept
Relationship
Atom
15
Example Scientific Data Repository
/anim-sal_estuary_7.gif
16
Outline
  • Dataspaces
  • Data Models for Dataspaces
  • Quarry Data Model
  • Quarry Storage
  • Quarry API
  • Experimental Results
  • Extensions
  • Related Work

17
Some Storage Models
  • Schema dependent storage (RDFS)
  • We assume schema is unavailable
  • Indexed Triple Store
  • Logically, one large table of (s, p, o) triples
  • Physically, multiple indices for various access
    patterns
  • Property Tables
  • Some properties get their own (s, o) extents
    (basically isomorphic to a pso index)
  • Selection of properties depends on query workload

18
A Simple Idea
  • Signatures
  • resources expressing the same properties
    clustered together
  • Posit that Signature ltlt Resource
  • Queries evaluated over Signature Extents

19
Triple Store
A Query in RDQL
Triples
select ?p where (?r, ltsregiongt,
ltsestuarygt), (?r, ltsvariablegt, ltssaltgt),
(?r, ltsdepthgt, lts7gt), (?r, ltspathgt, ?p)
rsrc
prop
value
101
depth
7
336
variable
temp
101
path
/iso_e_s_7.gif
101
variable
salt
and in SQL
843
channel
north
SELECT r.rsrc, p.value as path FROM Triples r,
Triples v, Triples d, Triples p WHERE
r.property region AND v.property
variable AND d.property depth AND
p.property path AND r.rsrc v.rsrc
AND v.rsrc d.rsrc AND d.rsrc p.rsrc
843
variable
salt
336
path
/trans_s_t.gif
843
path
/trans_n_s.gif
336
channel
south
101
region
estuary
One join per condition
20
Triple Store
select ?p where (?r, ltsregiongt,
ltsestuarygt), (?r, ltsvariablegt, ltssaltgt),
(?r, ltsdepthgt, lts7gt) (?r, ltspathgt, ?p)
SELECT rsrc, MAX(CASE WHEN propertyregion'
THEN value END) as region,  MAX(CASE WHEN
propertyvariable' THEN value END) as variable,
 MAX(CASE WHEN propertydepth' THEN value END)
as depth,  MAX(CASE WHEN propertypath' THEN
value END) as path, FROM TriplesGROUP BY
rsrc HAVING MAX(CASE WHEN propertyregion'
THEN value END) estuary AND MAX(CASE WHEN
propertyvariable' THEN value END) salt AND
MAX(CASE WHEN propertyregion' THEN value END)
7
but cant exploit indexes
21
Property Tables
depth
select ?p where (?r, ltsregiongt,
ltsestuarygt), (?r, ltsvariablegt, ltssaltgt),
(?r, ltsdepthgt, lts7gt) (?r, ltspathgt,
?p)
value
rsrc
101
7
region
variable
rsrc
value
rsrc
value
101
estuary
336
temp
101
salt
select p.value from region r, variable v,
depth d, path p where r.value estuary
and v.value salt and d.value 7 and
r.rsrc v.rsrc and v.rsrc d.rsrc and
d.rsrc p.rsrc
path
843
salt
rsrc
value
channel
101
/iso_e_s_7.gif
336
/trans_s_t.gif
rsrc
value
843
/trans_n_s.gif
843
north
336
south
22
Signature Tables
select ?p where (?r, ltsregiongt,
ltsestuarygt), (?r, ltsvariablegt, ltssaltgt),
(?r, ltsdepthgt, lts7gt) (?r, ltspathgt,
?p)
S1 variable, channel, path
variable
channel
value
rsrc
336
temp
south
/trans_s_t.gif
north
843
salt
/trans_n_s.gif
S2 depth, region, variable, path
region
rsrc
depth
variable
path
101
7
salt
estuary
/iso_e_s_7.gif
select path from S2 where region estuary
and variable salt and depth 7
23
Choosing a Storage Model
  • Sources of information
  • A priori knowledge (schema)
  • Query workload (learning)
  • The data (mining)

24
Computing Signatures
r0
p0
v(0,0)
r0
p0
v(0,0)
r2
p1
v(2,1)
p1
v(0,1)
r0
p2
v(0,2)
p2
v(0,2)
External Sort
r0
p1
v(0,1)
r1
p1
v(1,1)
r1
p3
v(1,3)
p3
v(1,3)
r1
p1
v(1,1)
r2
p1
v(1,1)
r2
p3
v(2,3)
p3
v(1,3)
Nest
r0
p0, p1, p2
v(0,0), v(0,1), v(0,2)
hash(S0)
r1
p1, p3
v(1,1), v(1,3)
hash(S1)
r2
p1, p3
v(1,1), v(1,3)
hash(S2)
25
Computing Signatures
r0
p0, p1, p2
v(0,0), v(0,1), v(0,2)
hash(P0)
r1
p1, p3
v(1,1), v(1,3)
hash(P1)
r2
v(1,1), v(1,3)
signatures
hash(S0)
rsrc
p0
p1
p2
signature
sighash
r0
p0, p1, p2
hash(S0)
v(0,0)
v(0,1)
v(0,2)
p1, p3
hash(S1)
hash(S1)
rsrc
p1
p3
r1
v(1,1)
v(1,3)
r2
v(1,1)
v(1,3)
26
Outline
  • Dataspaces
  • Data Models for Dataspaces
  • Quarry Storage
  • Quarry API
  • Experimental Results
  • Extensions
  • Related Work

27
Quarry API
  • /2004/2004-001//anim-tem_estuary_bottom.gif
  • aggregate bottom animation isotem day
    001 directory images plottype
    isotem region estuary runid
    2004-001 year 2004
  • /2004/2004-001//amp_plume_2d.gif day 001
  • directory images plottype 2d
  • region plume
  • runid 2004-001 year 2004

28
Quarry API Describe
  • Describe(r)
  • Property, Value pairs describing resource r

Describe(/2005-002//anim-sal_plume_5.gif)
year2005, day002, runid2005-002, anim,
regionplume, variablesalt, depth5,
plottypeisoline Describe(/2005-002//anim-sa
l_channel_transects.gif) year2005,
day002, runid2005-002, anim, channelplume,
variablesalt, plottypetransect
29
Quarry API Values
  • Values(B, p)
  • Unique values of property p associated with any
    resource that satisfies B

Values(varsalt, day) 1,2,3,4,5,6,7
30
Quarry API Properties
  • Properties(B)
  • The set of properties that describe any resource
    satisfying B

GetProperties(variablesalt) plottype, year,
region, depth, channel, GetProperties(plottype
isoline) region, depth, year,
31
Quarry API
  • Applications use sequences of Prop and Val calls
    to explore the Dataspace

32
Quarry API
all unique properties
p
all unique values of parent property
v
all properties of resources satisfying pv
Every path from a root represents a conjunctive
query
33
Expressiveness
  • Incomparable with most RDF Query Languages
  • Unique properties not usually supported by
    others
  • Were limited to queries of the form

?s LANGUAGECODE en . ?s DESCRIPTIONTYPE 2 . ?s
UMLSAUI A3711025 . ?s string Sodium_lactate_0.16_
molar_infusion .
34
Quarry Query Processing
  • Props(B)
  • B (regionestuary and day136 and
    variablesalt)
  • let cover region, day, variable
  • Ans
  • for Sig in Signatures
  • if cover in Sig
  • if exists tup (tup in Extent(Sig) and B(tup))
  • Ans Ans U Sig

select rho from Extent(Sig1) where B limit 1
35
Quarry Query Processing
  • Val(B, rho)
  • B (regionestuary and day136 and
    variablesalt)
  • let cover region, day, variable
  • Ans
  • for Sig in Signatures
  • if cover in Sig
  • for tuple in Extent(Sig)
  • if B(tuple)
  • insert tuplerho in Ans

select rho from Extent(Sig1) where
B union select rho from Extent(Sig1) where B
36
Experimental Results
  • Yet Another RDF Store
  • Several B-Tree indexes to support
  • spo, po ? s, os ? p, etc.
  • Reports of YARS outperforming Redland and Sesame
  • 3M triples, single term queries
  • We looked at multi-term queries

?s ltp0gt lto0gt ?s ltp1gt lto1gt ?s ltpngt ltongt
37
Experimental Results Queries
3.6M triples 606k resources 149 signatures
38
Frequent YARS Access Plan
?s LANGUAGECODE en . ?s DESCRIPTIONTYPE 2 . ?s
UMLSAUI A3711025 . ?s string Sodium_lactate_0.16_
molar_infusion .
spo
ltsgt string Sodium
spo
ltsgt LANGUAGECODE en
Choice of first lookup can be important
po ? s
spo
ltsgt DESCRIPTIONTYPE 2
UMLSAUIA3711025 ? ltsgt
39
YARS Plan Speed
time (s)
cardinality of first join
40
Scaling Up
  • Queries covered by many signatures can be
    inefficient

SELECT orig_code FROM sig1 WHERE va_class_name
DE820 UNION SELECT orig_code FROM sig2 WHERE
va_class_name DE820 UNION SELECT orig_code
FROM sig3 WHERE va_class_name
DE820 UNION SELECT orig_code FROM sig4 WHERE
va_class_name DE820
41
Scaling Up
  • Combine(S1, S2)

S1(a,b,c)
S12(a,b,c,d)
S2(a,b,d)
pad with nulls
42
Scaling Up
  • Extract
  • Find commonly access property sets and
    materialize them separately

S1(a,b,c)
S1(a,b,c)
S2(a,b,d)
S2(a,b,d)
Sab(a,b)
43
Related Work
  • RDF Redland, Sesame, Jena, YARS, Forth, KAON
  • Primarily Indexed Triple Stores
  • Path Indexes Lorel, DataGuides
  • Data Mining for Structure
  • Ding, Wilkinson, Sayers, Kuno _at_ HP Labs
    Application-specific Schema Design for Storing
    Large RDF Datasets

44
(No Transcript)
45
Data Management Solutions
Web Search
Virtual Organization
Far
Enterprise potal
Ontology
Administrative Proximity
Data Integration System
Near
Desktop Search
Scientific Respository
DBMS
Low
High
Semantic Integration
Diagram adapted, with permission, from Figure 1
in the paper From Databases to Dataspaces A New
Abstraction for Information Management, Michael
Franklin, Alon Halevy, David Maier, SIGMOD Record
December 2005.
46
Query Languages
  • RQL
  • RDQL
  • RDFQL
  • RxPath
  • N3
  • SeRQL
  • Triple
  • Versa

47
Facts
  • Environmental Observation and Forecasting System
  • 7.5M triples describing 1M files
  • Integrated Pharmacological Database
  • 23M triples describing 0.6M concepts

48
Dataspace Components
  • Catalog and Browse
  • Search and Query
  • Global Query
  • Structured Query
  • Provenance Query
  • Continuous Query (Monitoring)
  • Local Store and Index
  • Discovery
  • Source Extension

49
Scaling Way Up
50
Integrity Constraints and Normalization
51
Growing a Query Language
  • Desc(k)
  • Prop(B)
  • Val(B, p)

52
Pharmacological Database
  • Signature ltgt ptty in 85 of the cases
Write a Comment
User Comments (0)
About PowerShow.com