Title: A System to Integrate Distributed Sources of Heterogeneous Scientific Information
1A System to Integrate Distributed Sources of
Heterogeneous Scientific Information
- Amarnath Gupta, SDSC
- Bertram Ludaescher, SDSC
- Maryann E. Martone, NCMIR
- Ilya Zaslavsky SDSC
- University of California, San Diego
2 A Standard Mediator Architecture (MIX --
Mediation of Information using XML)
USER-Query
XML Q/A
INTEGRATED VIEW
MIX MEDIATOR
XML Integrated View Definition
XML Q/A
XML Q/A
Wrapper
Wrapper
Wrapper
Files
Lab1
Lab2
Lab3
Data Sources
3 Integration Issues
4Integration Issues Mediating across
Multiple-Worlds
- Structural Integration
- gt common semistructured data model (XML)
- gt XML queries transformations to resolve
schema conflicts - Limited Query Capabilities
- gt mediator is aware of QCs exported by wrappers
- ...
- Semantic Integration
- most work deals with issues for one-world
scenarios (e.g., amazon.com and bn.com) - what if data comes from a multiple-world
scenario (like Neuroscience), where data objects
from different sources are not even similar, and
only the hidden semantics (known to the domain
expert) provides the semantic link?
5A Neuroscience Question
What is the cerebellar distribution of rat
proteins with more than 70 homology with human
NCS-1? Any structure specificity? How about other
rodents?
??? Integrated View ???
??? Integrated View Definition ???
???Mediator ???
Wrapper
Wrapper
Wrapper
Wrapper
Web
protein localization
morphometry
neurotransmission
CaBP, Expasy
6Hidden Semantics Protein Localization
- ltprotein_localizationgt
- ltneuron typepurkinje cell /gt
- ltprotein channelredgt
- ltnamegtRyRlt/gt
- .
- lt/proteingt
- ltregion h_grid_pos1 v_grid_posAgt
- ltdensitygt
- ltstructure fraction0.8gt
- ltnamegtspinelt/gt
- ltamount nameRyRgt0lt/gt
- lt/gt
- ltstructure fraction0.2gt
- ltnamegtbranchletlt/gt
- ltamount nameRyRgt30lt/gt
- lt/gt
7Hidden Semantics Morphometry
- ltneuron namepurkinje cellgt
- ltbranch level10gt
- ltshaftgt
-
- lt/shaftgt
- ltspine number1gt
- ltattachment x5.3 y-3.2 z8.7 /gt
- ltlengthgt12.348lt/gt
- ltmin_sectiongt1.93lt/gt
- ltmax_sectiongt4.47lt/gt
- ltsurface_areagt9.884lt/gt
- ltvolumegt7.930lt/gt
- ltheadgt
- ltwidthgt4.47lt/gt
- ltlengthgt1.79lt/gt
- lt/headgt
- lt/spinegt
-
8The Problem
- Multiple Worlds Integration
- compatible terms not directly joinable
- complex, indirect associations among schema
elements - unstated integrity constraints
- Why not just use Ontologies?
- typical ontologies associate terms along limited
number of dimensions - Whats needed?
- a theory under which non-identical terms can be
semantically joined - gt lift mediation to the level of conceptual
models (CMs) - gt domain knowledge, ICs become rules over CMs
- gt Model-Based Mediation
9XML-Based vs. Model-Based Mediation
XML Models
10Extended Mediator Architecture
- gt Wrappers export Conceptual Models (CMs), i.e.,
factsrules for classes, relationships, ICs, ...
) - gt Mediator imports CMs (from sources, auxiliary
knowledge bases, and domain maps (DMs) - gt a generic conceptual model (GCM, a subset of
F-logic), extensible via rules common target CM
language - gt new CMs can be plugged-in by specifying them
in GCM F-logic rules - gt prototype implementation in FLORA
- global-as-view approach
- compiler F-logic gt XSB-Prolog
- top-down evaluation gt virtual (demand-driven)
views - external interfaces (XML, RDBs, DM
visualization,...)
11Model-Based Mediator Architecture
USER/Client
CM (Integrated View)
Domain Map DM
Integrated View Definition IVD
CM Plug-In
CM Queries Results (exchanged in XML)
Logic API (capabilities)
12Definition of Integrated Views ...
- XML-2-FL and CM-2-FL Translators
lt!ELEMENT Studies (Study)gt lt!ELEMENT Study
(study_id, animal,
experiments, experimentersgt lt!ELEMENT experiments
(experiment)gt lt!ELEMENT experiment (description,
instrument, parameters)gt
studyDBstudies gtgt study. studystudy_id gt
string animal gt animal
experiments gtgt experiment
experimenters gtgt string.
- Specification of Domain Knowledge
- Subclasses
- Rules
- Integrity Constraints
- Integrated View Definition
mushroom_spine spine
Smushroom_spine IF Sspinehead?_ neck ?_.
ic1(S)alerttype ? invalid spine object ? S
IF Sspineundef -gtgt head, neck.
protein_distribution(Protein, Organism,
Brain_region, Feature_name, Anatom, Value) IF
Iprotein_label_image proteins -gtgt Protein
organism -gt Organism
anatomical_structures -gtgt ASanatomical_structure
name-gtAnatom, NAEneuro_anatomic_entitynam
e-gtAnatom loccated_in-gtgtBrain_region, AS..seg
ments..featuresname-gtFeature_name
value-gtValue.
13... Definition of Integrated Views (Multiple
Sources)
- Creating Mediated Classes
- Reasoning with Schema
animalM?R IF Ssource, S.animal M?R
. Xtaxon?T IF X PROLAB.animalname ?N,
words(N,W1,W2_), T
TAXON.taxongenus ?W1species ?W2.
union over all classes
At Mediator
subspeciesspeciesgenus kingdomsuperkingd
om
TTR, TRTR1 IF T TAXON.taxonTaxon_Rank
?TR, Taxon_Rank1 ?TR1, Taxon_RankTaxon_Rank1.
Class creation by schema reasoning
14Model-Based Mediation with DOMAIN MAPS (DMs)
- Semantic Road Maps for situating source data
- gt navigational aid (browsing source classes at
the conceptual level) - gt basis for integrated views across multiple
worlds - gt link points (concepts) and labeled arcs
(roles) - gt formal semantics (in FL and/or DLs)
- Example ANATOM DM
- antatomical entities (concepts) is_a, has_a,
overlaps, ... (roles) - gt from syntactic equality to semantic joins
-
LINK(X,Y) X.zip Y.zip X.addr in Y.zip X.zip
overlaps Y.county ...
Integrated-CM(Z1,...) get X1,... from
Src1 get X2,... from Src2 LINK (Xi, Yj) Zj
CM-QL(X1,...,Y1,...)
15ANATOM Domain Map
ANATOM
16ANATOM Domain Map with Registered Data
ANATOM DATA
17Deductive Closure of has_a with tc(is_a)
ANATOM CLOSURE
18Example Query Evaluation (I)
- Example protein_distribution
- given organism, protein, brain_region
- ANATOM DM
- recursively traverse the has_a_star paths under
brain_region collect all anatomical_entities - Source PROLAB
- join with anatomical structures and collect the
value of attribute image.segments.features.featur
e.protein_amount where image.segments.features.f
eature.protein_name protein and
study_db.study.animal.name organism - Mediator
- aggregate over all parents up to brain_region
- report distribution
19Interactive Queries (I)
KIND
20Example Query Evaluation (II)
"How does the parallel fiber output
(Yale/SENSELAB) relate to the distribution of
Ryanodine Receptors (UCSD/NCMIR)?"
- _at_SENSELAB X1 select output from parallel
fiber - _at_MEDIATOR X2 hang off X1 from Domain Map
- _at_MEDIATOR X3 subregion-closure(X2)
- _at_NCMIR X4 select PROT-data(X3,
Ryanodine Receptors) - _at_MEDIATOR X5 compute aggregate(X4)
21Interactive Queries (II)
KIND01
22Resulting Sub DOMAIN MAP Browser
PROTLOC
23Computed Protein Localization Data
PROTLOC
24Client-Side Result Visualization(using AxioMap
Viewer Ilya Zaslavsky)
PROTLOC-AxioMap
25 Summary Outlook Federation of Brain Data
PROTLOC
Result (XML/XSLT)
Result (VML)
ANATOM