Information Integration: A Status Report

1 / 42
About This Presentation
Title:

Information Integration: A Status Report

Description:

Big guys making announcements: IBM, BEA, MS, (Oracle still being defiant) ... What other tables are there? Support a KR-style interface on the corpus (OKBC-like) ... – PowerPoint PPT presentation

Number of Views:11
Avg rating:3.0/5.0
Slides: 43
Provided by: uw3

less

Transcript and Presenter's Notes

Title: Information Integration: A Status Report


1
Information IntegrationA Status Report
  • Alon Halevy
  • University of Washington, Seattle
  • IJCAI 2003

2
Mediated Schema
Entity
Sequenceable Entity
Structured Vocabulary
Gene
Phenotype
Experiment
Nucleotide Sequence
Microarray Experiment
Protein
OMIM
Swiss- Prot
HUGO
GO
Gene- Clinics
Entrez
Locus- Link
GEO
Query For the micro-array experiment I just ran,
what are the related nucleotide sequences and for
what protein do they code?
3
Motivation and Activity
  • Application areas of data integration
  • Enterprise information integration ()
  • The government
  • Data sources on the web
  • Scientific data sharing.
  • Several data sharing architectures
  • Virtual data integration, warehousing,
    message-passing, web-services.
  • Many research projects
  • Mine Information Manifold, Tukwila, LSD, Piazza.
  • EII a new industry buzzword.

4
Todays Agenda
  • Recent progress
  • Mediation languages
  • Query processing (XML and other)
  • Some lessons from commercial world.
  • Current challenges
  • Enabling large-scale data sharing peer-data
    management systems.
  • The age of problem semantic heterogeneity.
  • A new agenda item for AI corpus-based KR.
  • AI is more vital than ever for progress here!

5
Mediation Languages
Goal
Language for Specifying Semantic Relationships
(not full FOL)
Mediated Schema
Assume data at the sources is structure (or
seems so).
6
Global-as-View (GAV)
Actor(x,y) - R1(x,y,z) Actor(x,y) - R2(x,z),
R3(z,y)
Mediated Schema
Title, Actor,
R1
R2
R3
R4
R5
7
Local-as-View (LAV,GLAV)
R1(x,y,z) - Title(x,y), Actor(x,z), ylt 1970
R5(x,y,z) - Movie(x,y,French)
Mediated Schema
Title, Actor
R1
R2
R3
R4
R5
8
Mediation Languages Summary
  • A lot of nice theory and practical algorithms.
  • Careful choice of expressive power mattered.
  • Algorithms for answering queries using views are
    in every commercial DBMS.
  • Description Logics also an attractive formalism
    for mediation.
  • Bottleneck is coming up with the mapping
    expressions.

9
Outline
  • Recent progress
  • Mediation languages
  • Query processing (XML and other)
  • Some lessons from commercial world.
  • Current challenges
  • Enabling large-scale data sharing peer-data
    management systems.
  • The age old problem semantic heterogeneity.
  • A new agenda item for AI corpus-based KR.

10
Adaptive Query Processing
  • Problem no stats, network unstable
  • Cannot Plan and then execute
  • Need to adapt plan during execution.
  • Ideas already in
  • Ingres (1976) (early database system)
  • Interleaving planning and execution (AI)
  • Key question when and granularity of adaptation
  • For every tuple? Materialization points?
  • See Ives et al. 2002 for our solution.

11
Convergent Query ProcessingIves et al., 2002
Join In-stock, Orders, Shipping
(I ? O ? S)
I OS
IO
12
XML Query Processing
  • XML facilitates integration.
  • Mediator query processor may manipulate XML
    directly.
  • Challenges
  • XML is not flat, but nested Path queries.
  • Can be irregular doesnt adhere to a strict
    schema.
  • Progress
  • Defining and optimizing XQuery.
  • Going back and forth XML to relational.

13
The Commercial World
  • Some startups
  • Nimble, MetaMatrix, Calixa, Composite, Enosys
  • Big guys making announcements
  • IBM, BEA, MS, (Oracle still being defiant).
  • Integration technology in different layers
  • E.g., reporting companies want it (Actuate)
  • Progress analysts have buzzword -- EII.
  • Challenges
  • Integration with EAI?
  • Yet another middleware?
  • Horizontal vs. vertical?

14
What Worked?
  • Performance was not an issue.
  • Tools, tools, tools
  • For managing sources and creating mediated
    schemas.
  • XML query processing was needed.
  • Concordance need common keys to join sources
  • Active research area!

15
Outline
  • Recent progress
  • Mediation languages
  • Query processing (XML and other)
  • Some lessons from commercial world.
  • Current challenges
  • Enabling large-scale data sharing peer-data
    management systems.
  • The age old problem semantic heterogeneity.
  • A new agenda item for AI corpus-based KR.

16
Limitations of Mediated Schema
Mediated Schema
17
Peer Data-Management
  • PDMS a network of peers (data sources)
  • Peers can
  • Export base data, or combinations of data
  • Serve as logical mediators for other peers
  • A peer can be both a server and a client.
  • Semantic relationships are specified locally
    (between small sets of peers).
  • This is a Semantic Web (different angle)

18
Network of Mappings (Piazza)
CiteSeer
UW
Stanford
GAV, LAV GLAV
DBLP
Paris
Roma
Vienna
19
Advantages of PDMS
  • No need for a central mediated schema.
  • Can map data opportunistically, as is most
    convenient.
  • Queries are posed using the peers schema.
    Answers come from anywhere in the system.
  • Infrastructure for Semantic Web applications
  • This is not P2P file sharing.
  • Data has rich semantics
  • Membership is not as dynamic.

20
Schema Mediation for PDMS
When can LAV and GAV be combined to form such a
network structure? (semantics not yet obvious.
CiteSeer
UW
Stanford
GAV, LAV GLAV
ICDE-03, WWW-03 for XML
DBLP
Paris
Roma
Vienna
21
Efficient Query Answering
  • Problems
  • redundant paths
  • expensive reformulation.

CiteSeer
UW
Stanford
  • Possible solution
  • Pre-compose some paths

DBLP
Paris
Roma
Vienna
22
Mapping CompositionJayant Madhavan and Halevy,
VLDB 2003
  • Incredibly subtle!
  • In general, composition can be an infinite set of
    GLAV formulas.
  • Results
  • Finite in many cases
  • Even when infinite, often has finite, useful
    encoding.
  • Hence, compositions can usually be pre-optimized.

23
Other Research Issues
Intelligent data placement Management of mapping
networks Improving networks finding additional
connections. Handling inconsistencies
CiteSeer
UW
Stanford
DBLP
Leipzig
Saarbruecken
Berlin
24
PDMS-Related Projects
  • Hyperion (Toronto)
  • PeerDB (Singapore)
  • Local relational models (Trento)
  • Edutella (Hannover, Germany)
  • Semantic Gossiping (EPFL Zurich)
  • Raccoon (UC Irvine)
  • Orchestra (Ives, U. Penn)

25
Outline
  • Recent progress
  • Mediation languages
  • Query processing (XML and other)
  • Some lessons from commercial world.
  • Current challenges
  • Enabling large-scale data sharing peer-data
    management systems.
  • The age old problem semantic heterogeneity.
  • A new agenda item for AI corpus-based KR.

26
Schema/Ontology Matching
Data Source
Consumer
Mediator
Data Source
Data Source
  • Schema heterogeneity a key roadblock for
    information integration
  • Different data sources speak their own schema
  • Mapping is key to any data sharing architecture

27
Schema Matching
Books Title ISBN Price DiscountPrice
Edition
Authors ISBN FirstName LastName
BooksAndMusic Title Author Publisher ItemID ItemTy
pe SuggestedPrice Categories Keywords
BookCategories ISBN Category
CDCategories ASIN Category
CDs Album ASIN Price DiscountPrice St
udio
Artists ASIN ArtistName GroupName
Inventory Database A
Inventory Database B
  • Schema Matching Discovering correspondences
    between similar elements
  • Eventually BooksAndMusic(xTitle,)
    Books(xTitle,) ? CDs(xAlbum,)

28
Typical Approaches
  • Multiple sources of evidences in the schemas
  • Schema element names
  • BooksAndCDs/Categories BookCategories/Category
  • Descriptions and documentation
  • ItemID unique identifier for a book or a CD
  • ISBN unique identifier for any book
  • Data types, data instances
  • DateTime ? Integer,
  • addresses have similar formats
  • Schema structure
  • All books have similar attributes
  • Use domain knowledge

In isolation, techniques are incomplete or brittle
Combine multiple techniques to exploit all
available evidence
29
Philosophy of Solutions
  • Effective schema matching requires a principled
    combination of techniques.
  • Like human experts, the matcher should improve
    over time
  • LSD
  • Mapping data sources to a mediated schema.
  • Use a few mappings as training examples to learn
    hypotheses for elements of the mediated schema.
  • See Doan et al., SIGMOD-2001, MLJ-2003
  • Next step corpus-based matching.

30
Corpus-Based Matching
Collection of schemas and mappings
31
Mapping Knowledge Base
Data Instances Learner
Structure Learner
Name Learner
Data Type Learner
Description Learner
Meta Learner
C1
CN
NL DIL DTL DL SL ML
NL DIL DTL DL SL ML
Mapping Knowledge Base
32
Preliminary results Corpus is useful
33
With and without the corpus
34
Outline
  • Recent progress
  • Mediation languages
  • Query processing (XML and other)
  • Some lessons from commercial world.
  • Current challenges
  • Enabling large-scale data sharing peer-data
    management systems.
  • The age old problem semantic heterogeneity.
  • A new agenda item for AI corpus-based KR.

35
Corpus vs. Traditional KR
  • A large corpus of uncoordinated knowledge
    fragments
  • vs.
  • Carefully designed knowledge base
  • Can a corpus offer a more attractive solution for
    some KR problems?

36
Pause KR vs. Corpus
  • Knowledge base
  • Hard to engineer, brittle at the boundaries
  • Only one way of saying things.
  • Corpus
  • Easier to build, coverage not predefined.
  • Many views of the domain.
  • See proceedings for full argument.

37
Corpus-based KR
  • Contents
  • Schemas, ontologies, meta-data, data, queries,
    mappings.
  • Collect statistics on the corpus
  • How often does a word appear as a relation name?
  • When it does, what tend to be the attribute
    names? What other tables are there?
  • Support a KR-style interface on the corpus
    (OKBC-like)

38
Other Applications of C-B-KR
  • Question answering on the web
  • Focused crawling
  • Natural language interfaces to DBs
  • Schema and ontology authoring
  • Semantic query optimization.
  • Whenever we need knowledge to help us rank
    multiple answers/plans.

39
Example Queries
  • How are two terms related?
  • GPA(studentID, value),
  • Student(studentID, GPA, address)
  • Find different ways of saying the same
  • Class(Lexus, Luxury)
  • LuxuryCar(Lexus, Toyota)
  • When do two terms play similar roles?
  • IJCAIReview(p1, rev2, accept)
  • AIJReferees(round2, p3, rev4, reject)

40
Challenges for C-B-KR
  • Building the corpus.
  • How focused should the corpus be?
  • Is human tuning needed or helpful?
  • How do we accommodate inference?
  • How do we leverage traditional KR?

41
Summary
  • The vision data authoring, querying and sharing
    by everyone.
  • We got the plumbing to work. To go further, we
    need AI techniques.
  • Challenge cross the structure chasm
  • Its hard to author query structured data!
  • PDMS architecture for ad-hoc sharing.
  • Ontology/schema matching is key!
  • Are we providing the right tools?
  • Corpus-based knowledge representation.
  • We need benchmarks!

42
Some References
  • www.cs.washington.edu/homes/alon
  • Piazza ICDE03, WWW03, VLDB-03
  • The Structure Chasm CIDR-03
  • Mediation surveys VLDB Journal 01
  • Lenzerini tutorial.
  • Schema matching
  • Rahm and Bernstein, VLDB Journal 01.
  • Workshops IJCAI, Semantic Web Conf.
  • Teaching integration to undergraduates SIGMOD
    Record, September, 2003.
Write a Comment
User Comments (0)