Integrating Online and Geospatial Information Sources - PowerPoint PPT Presentation

Loading...

PPT – Integrating Online and Geospatial Information Sources PowerPoint presentation | free to view - id: 3414c-NjAxO



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Integrating Online and Geospatial Information Sources

Description:

Craig Knoblock. University of Southern California. 1 ... Craig Knoblock. University of Southern California. 7. Why isn't this just. No common schema ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 116
Provided by: craigkn9
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Integrating Online and Geospatial Information Sources


1
Integrating Online and Geospatial Information
Sources
  • Craig A. Knoblock
  • University of Southern California

2
Acknowledgements
  • Slide Integration
  • Thanks to Subbarao Kambhampati
  • We jointly gave a version of this tutorial at
    AAAI-02
  • Also many thanks to
  • Jose Luis Ambite
  • Greg Barish
  • Alon (Ha)Levy
  • Steve Minton
  • Ion Muslea
  • Sheila Tejada
  • for permission to use/mutilate some of their
    slides

3
Overview
  • Motivation for Information Integration
  • Database Refresher
  • Accessing Information Sources
  • Record Linkage
  • Data Integration/Query Planning
  • Plan Execution
  • Standards for Integration/Mediation
  • Discussion

4
Preamble Platitudes
  • Internet is growing at an enormous rate
  • Even after the bubble-burst
  • All kinds of information sources are online
  • Web pages, databases masquerading as web pages,
    Services, Sensors
  • Promise of unprecedented information access to
    every Tom, Dick and Mary..
  • But, right now, they still need to know where
    to go, and be willing to manually put together
    bits and pieces of information gleaned from
    various sources and services
  • Information Integration aims to do this
    automatically.

5
Isnt web mostly text?
  • The invisible web is mostly structured
  • Most web servers have back end database servers
  • They dynamically convert (wrap) the structured
    data into readable english
  • ltIndia, New Delhigt gt The capital of India is
    New Delhi.
  • So, if we can unwrap the text, we have
    structured data!
  • (un)wrappers, learning wrappers etc
  • Note also that such dynamic pages cannot be
    crawled...
  • The (coming) Semi-structured web
  • Most pages are at least semi-structured
  • XML standard is expected to ease the transfer of
    such pages.
  • The Services
  • Travel services, mapping services
  • The Sensors
  • Stock quotes, current temperatures, ticket prices

6
Why isnt this just
  • Search engines do text-based retrieval of URLS
  • Works reasonably well for single document texts,
    or for finding sites based on single document
    text
  • Cannot integrate information from multiple
    documents
  • Cannot do effective query relaxation or
    generalization
  • Cannot link documents and databases
  • The aim of Information integration is to support
    query processing over structured and
    semi-structured sources as well as services.

7
Why isnt this just
Databases
Distributed Databases
  • No common schema
  • Sources with heterogeneous schemas (and
    ontologies)
  • Semi-structured sources
  • Legacy Sources
  • Not relational-complete
  • Variety of access/process limitations
  • Autonomous sources
  • No central administration
  • Uncontrolled source content overlap
  • Unpredictable run-time behavior
  • Makes query execution hard
  • Presence of services
  • Need to compose services

8
(An exceedingly brief) Database Refresher
9
Traditional Database Architecture
Query (SQL)
Database Manager (DBMS) -Storage mgmt
-Query processing -View management
-(Transaction processing)
Database (relational)
Answer (relation)
10
Database Outline
  • What we care about
  • Structured data representations
  • Relational databases
  • Deductive databases
  • Structured query languages
  • SQL
  • Views ( materialized views)

11
Relational Data Terminology
Attribute names
Product
Name Price Category
Manufacturer gizmo 19.99
gadgets GizmoWorks Power
gizmo 29.99 gadgets
GizmoWorks SingleTouch 149.99
photography Canon MultiTouch 203.99
household Hitachi
tuples
(Arity4)
schema
Product(name string, Price real, category
enum, Manufacturer string)
12
Relational Algebra
  • Operators
  • tuple sets as input, new set as output
  • Operations
  • Union, Intersection, difference, ..
  • Selection (s)
  • Projection (?)
  • Cartesian product (X)
  • Join ( )


13
SQL A query language for Relational Algebra
Many standards out there SQL92, SQL2, SQL3,
SQL99 Select attributes From relations
(possibly multiple, joined) Where conditions
(selections)
Find companies that manufacture products
bought by Joe Blow SELECT Company.name FROM
Company, Product WHERE Company.nameProduct.m
aker AND Product.name IN
(SELECT product
FROM Purchase
WHERE buyer Joe Blow)
Other features aggregation, group-by etc.
14
Deductive Databases
  • Relations viewed as predicates.
  • Interrelations between relations expressed as
    datalog rules
  • (Horn clauses, without function symbols)
  • Queries correspond to datalog programs
  • Conjunctive queries are datalog programs with a
    single non-recursive rule Correspond to SPJ
    queries in SQL
  • Emprelated(Name,Dname) - Empdep(Name,Dname)
  • Emprelated(Name,Dname) - Empdep(Name,D1),
    Emprelated(D1,Dname)


EDB predicate
IDB predicate
15
Datalog
  • Datalog Program set of datalog rules
  • Datalog rule conjunctive query
  • big-LA-buyers(Buyer,Seller, Price) -
  • person(Buyer, Los Angeles),
  • purchase(Buyer, Seller,
    Product, Price),
  • Price gt 10000.

16
Datalog
  • Datalog Program set of datalog rules
  • Datalog rule conjunctive query
  • big-LA-buyers(Buyer,Seller, Price) -
  • person(Buyer, Los Angeles),
  • purchase(Buyer, Seller,
    Product, Price),
  • Price gt 10000.
  • Buyer, Seller, Price
  • ? Product person(Buyer, Los Angeles)
  • purchase(Buyer, Seller,
    Product, Price)
  • Price gt 10000)
  • ? big-LA-buyers(Buyer,Seller, Price)

Datalog
First-Order Logic
17
Views
CREATE VIEW Seattle-view AS SELECT
buyer, seller, product, store FROM
Person, Purchase WHERE Person.city
Seattle AND
Person.name Purchase.buyer
Virtual vs. Materialized
We can later use the views SELECT
name, store FROM Seattle-view,
Product WHERE Seattle-view.product
Product.name AND
Product.category shoes
Whats really happening when we query a view??
18
Conjunctive Queries and Views
  • CREATE VIEW Big-LA-buyers AS
  • SELECT buyer, seller, price
  • FROM Person, Purchase
  • WHERE Person.city Los Angeles AND
  • Person.name Purchase.buyer
    AND
  • Purchase.price gt 10000
  • big-LA-buyers(Buyer,Seller, Price) -
  • person(Buyer, Los Angeles),
  • purchase(Buyer, Seller,
    Product, Price),
  • Price gt 10000.
  • Datalog rule view definition
  • Rule body select-from-where construct of SQL

19
Integrator vs. DBMS
Reprise
  • No common schema
  • Sources with heterogeneous schemas
  • Semi-structured sources
  • Legacy Sources
  • Not relational-complete
  • Variety of access/process limitations
  • Autonomous sources
  • No central administration
  • Uncontrolled source content overlap
  • Lack of source statistics
  • Tradeoffs between query plan cost, coverage,
    quality etc.
  • Multi-objective cost models
  • Unpredictable run-time behavior
  • Makes query execution hard
  • Presence of services
  • Need to compose services

20
Acessing Information Sources
  • Wrapper Learning

21
Wrappers Information Agents
Thai lt 20
A G E N T
GIVE ME Thai food lt 20 A-rated
Arated
22
Wrapper Induction
  • Problem description
  • Web sources present data in human-readable format
  • take user query
  • apply it to data base
  • present results in template HTML page
  • To integrate data from multiple sources, one must
    first extract relevant information from Web pages
  • Task learn extraction rules based on labeled
    examples
  • Hand-writing rules is tedious, error prone, and
    time consuming

23
Example of Extraction Task
NAME Casablanca Restaurant STREET 220
Lincoln Boulevard CITY Venice PHONE
(310) 392-5751
24
Rule Learning
  • Machine learning
  • Use past experiences to improve performance
  • Rule learning
  • INPUT
  • Labeled examples training testing data
  • Admissible rules (hypotheses space)
  • Search strategy
  • Desired output
  • Rule that performs well both on training and
    testing data

25
STALKER Muslea et al, 98 99 01
  • Hierarchical wrapper induction
  • Decomposes a hard problem in several easier ones
  • Extracts items independently of each other
  • Each rule is a finite automaton
  • Advantages
  • Powerful extraction language (eg, embedded list)
  • One hard-to-extract item does not affect others
  • Disadvantage
  • Does not exploit item order (sometimes may help)

26
STALKER The Wrapper Architecture
Data
Query
Information Extractor
Extraction Rules
EC Tree
27
Extraction Rules
Extraction rule sequence of landmarks
SkipTo(Phone) SkipTo(ltigt)
SkipTo(lt/igt)
Name Joels ltpgt Phone ltigt (310) 777-1111
lt/igtltpgt Review
28
The Embedded Catalog Tree (ECT)
RESTAURANT Name
List ( Locations ) Cuisine
City List
(PhoneNumbers)
AreaCode Phone
  • Name KFC
  • Cuisine Fast Food
  • Locations
  • Venice (310) 123-4567,
  • (800) 888-4412.
  • L.A. (213) 987-6543.
  • Encino (818) 999-4567,
  • (888) 727-3131.

29
Example of Rule Induction
Training Examples
  • Name Del Taco ltpgt Phone (toll free) ltbgt ( 800
    ) 123-4567 lt/bgtltpgtCuisine ...
  • Name Burger King ltpgt Phone ( 310 ) 987-9876
    ltpgt Cuisine

30
Learning the Extraction Rules
GUI
Inductive Learning System
Extraction Rules
Labeled Pages
31
Example of Rule Induction
Training Examples
  • Name Del Taco ltpgt Phone (toll free) ltbgt ( 800
    ) 123-4567 lt/bgtltpgtCuisine ...
  • Name Burger King ltpgt Phone ( 310 ) 987-9876
    ltpgt Cuisine

Initial candidate SkipTo( ( )
32
Example of Rule Induction
Training Examples
  • Name Del Taco ltpgt Phone (toll free) ltbgt ( 800
    ) 123-4567 lt/bgtltpgtCuisine ...
  • Name Burger King ltpgt Phone ( 310 ) 987-9876
    ltpgt Cuisine

SkipTo( ltbgt ( ) ... SkipTo(Phone)
SkipTo( ( ) ... SkipTo() SkipTo(()
Initial candidate SkipTo( ( )
33
Example of Rule Induction
Training Examples
  • Name Del Taco ltpgt Phone (toll free) ltbgt ( 800
    ) 123-4567 lt/bgtltpgtCuisine ...
  • Name Burger King ltpgt Phone ( 310 ) 987-9876
    ltpgt Cuisine

Initial candidate SkipTo( ( )

SkipTo(Phone) SkipTo() SkipTo( ( )
...
SkipTo( ltbgt ( ) ... SkipTo(Phone)
SkipTo( ( ) ... SkipTo() SkipTo(()
34
Active Learning Information Agents
  • Active Learning
  • Idea system selects most informative exs. to
    label
  • Advantage fewer examples to reach same accuracy
  • Information Agents
  • One agent may use hundreds of extraction rules
  • Small reduction of examples per rule gt big
    impact on user
  • Want to achieve 100 accuracy with as few
    examples as possible

35
Which example should be labeled next?
SkipTo( Phone )
Training Examples
Name Joels ltpgt Phone (310) 777-1111
ltpgtReview The chef
Name Kims ltpgt Phone (213) 757-1111
ltpgtReview Korean
Unlabeled Examples
Name Chez Jean ltpgt Phone (310) 666-1111
ltpgt Review
Name Burger King ltpgt Phone(818) 789-1211
ltpgt Review ...
Name Café del Rey ltpgt Phone (310) 111-1111
ltpgt Review ...
Name KFC ltpgt Phoneltbgt (800) 111-7171 lt/bgt
ltpgt Review...
36

Multi-view Learning
Two ways to find start of the phone number
SkipTo( Phone )
BackTo( ( Number ) )
Name KFC ltpgt Phone (310) 111-1111 ltpgt
Review Fried chicken
37
Co-Testing

-
Labeled data
Unlabeled data
38
Co-Testing for Wrapper Induction
BackTo( (Number) )
SkipTo( Phone )
Name Joels ltpgt Phone (310) 777-1111
ltpgtReview ...
Name Kims ltpgt Phone (213) 757-1111
ltpgtReview ...
39
Not all queries are equally informative
Phone (800) 171-1771 ltpgt Fax (111)
111-1111 ltpgt Review
Phoneltigt - lt/igtltpgt Review Founded a
century ago (1891) , this
40
Weak Views
  • Learn content description for item to be
    extracted
  • Too general for extraction
  • ( Nmb ) Nmb Nmb cant tell a phone number from
    a fax number
  • Useful at discriminating among query candidates
  • Learned field description
  • Starts with ( Nmb )
  • Ends with Nmb Nmb
  • Contains Nmb Punct
  • Length 6,6

41
Naïve Aggressive Co-Testing
  • Naïve Co-Testing
  • Query randomly chosen contention point
  • Output rule with fewest mistakes on queries
  • Aggressive Co-Testing
  • Query contention point that most violates weak
    view
  • Output committee vote (2 rules weak view)

42
Empirical Results 33 Difficult Tasks
  • 33 most difficult of the 140 extraction tasks
  • Each view gt 7 labeled examples for best accuracy
  • At least 100 examples for task

43
Results in 33 Difficult Domains
Extraction Tasks
44
Results in 33 Difficult Domains
Extraction Tasks
45
Results in 33 Difficult Domains
Extraction Tasks
46
Automatic Wrapper Generation Crescenzi, Mecca,
Merialdo, 2001
  • Automatically generates wrappers web pages
  • Supports nested structures and lists
  • Applies to large, complex pages with regular
    structure
  • Approach
  • Start with the first page and create a union-free
    regular expression that defines the wrapper
  • Match each successive sample against the wrapper
  • Mismatches result in generalizations of the
    regular expression

47
Example Matching
48
Types of Mismatches
  • String mismatches are used to discover fields of
    the document
  • Tag mismatches can indicate either optional
    elements or iterators
  • For iterations, mismatch is caused by repeated
    elements in a list
  • End of the list corresponds to last matching
    token
  • Beginning of list corresponds to one of the
    mismatched tokens
  • These create possible squares

49
Limitations
  • Assumptions
  • Pages are well-structured
  • Want to extract at the level of entire fields
  • Structure can be modeled without disjunctions
  • Search space for explaining mismatches is huge
  • Uses a number of heuristics to prune space
  • Limited backtracking
  • Limit on number of choices to explore
  • Patterns cannot be delimited by optionals
  • Will result in pruning possible wrappers

50
Record Linkage
  • Integrating Data Across Sources

51
Record Linkage
  • Problem
  • Different sources typically represent and format
    information differently.
  • As a result, determining if two sources are
    referring to the same object can be difficult.
  • Example
  • Is Joe Cool the same person as Joseph B.
    Cool?
  • What if they have the same telephone number?
  • What if Joe Cools number is 310-322-0730 and
    Joseph B. Cools number is 310-640-2973?

52
Example Data Integration Problem
  • How to align (or join) the objects across
    different sources

Zagats Restaurant Guide Source
Department of Health Restaurant Source
Arts Delicatessen Ca Brea CPK The
Grill Patina Philippes The Original The Tillerman
Arts Deli California Pizza Kitchen Campanile Citr
us Grill, The Philippe The Original Spago
53
Information Retrieval Approach Cohen, 1998
  • Idea Evaluate the similarity of records via
    textual similarity. Used in Whirl (Cohen 1998).
  • Follows the same approach used by classical IR
    algorithms (including web search engines).
  • First, stemming is applied to each entry.
  • E.g. Joes Diner -gt Joe s Diner
  • Then, entries are compared by counting the number
    of words in common.
  • Note Infrequent words weighted more heavily by
    TFIDF metric Term Frequency Inverse Document
    Frequency

54
Unsupervised Record Linkage
  • Idea Analyze data and automatically cluster
    pairs into three groups
  • Let R P(obs Same) / P(obs Different)
  • Matched if R gt threshold TU
  • Unmatched if R lt threshold TL
  • Ambiguous if TL lt R lt TU
  • This model for computing decision rules was
    introduced by Felligi Sunter in 1969
  • Particularly useful for statistically linking
    large sets of data, e.g., by US Census Bureau

55
Unsupervised Record Linkage (cont.)
  • Winkler (1998) used EM algorithm to estimate
    P(obs Same) and P(obs Different)
  • EM computes the maximum likelihood estimate
  • The algorithm iteratively determines the
    parameters most likely to generate the observed
    data.
  • Additional mathematical techniques must be used
    to adjust for relative frequencies
  • I.e. last name of Smith is much more frequent
    than Knoblock.

56
Supervised Active Learning ApproachTejada,
Knoblock Minton, 2001
  • Supervised learning. System learns
  • Which attributes to weight more heavily
  • Transformation rules

57
Mapping Rules
  • Set of Similarity Scores Mapping Rules
  • Name Street Phone

.967 .973 .3 .17 .3
.74 .8 .542 .49 .95 .97
.67
Name gt .8 Street gt .79 gt mapped Name gt .89 gt
mapped Street lt .57 gt not mapped
58
Transformation Weights
  • Appropriate transformations depend on the
    application domain
  • Restaurants, companies, airports
  • and on the different attributes within an
    application
  • Acronym is more appropriate for restaurant name
    than phone number
  • Learn likelihood that if a transformation is
    applied then two object match

Transformation Weight P (match transformation)
59
Learning Object Mappings
Active Atlas
Set of Mapped Objects
Source 1
Candidate Generator
Mapping Learner
Source 2
  • Candidate Generator
  • Judge textual similarity of mappings
  • Reduce number of mappings considered for
    classification
  • Mapping Learner
  • Active learning technique to learn mapping rules
    and transformation weights
  • System chooses most informative example for the
    user to label
  • Minimize the amount of user interaction

User Input
60
Mapping Rule Learner
61
Committee Disagreement
  • Chooses an example based on the disagreement of
    the query committee
  • In this case CPK, California Pizza Kitchen is the
    most informative example based on disagreement

Committee
Examples M1 M2 M3
Arts Deli, Arts Delicatessen CPK, California
Pizza Kitchen CaBrea, La Brea Bakery
Yes Yes Yes Yes No
Yes No No No
62
Data Integration/Query Planning
63
Principal Dimensions of Data Integration
  • Virtual vs. materialized architecture
  • Access query only or query update?
  • Mediated schema and query reformulation
  • Content Descriptions
  • Global-as-view
  • Local-as-view
  • Language for descriptions and queries
    conjunctive queries (CQs), union of CQs, Datalog
    (recursion), first-order logic (?,?,?),
    description logics,
  • Types of Sources
  • Structured (DBs) vs. semi-structured (Web)
  • Source capabilities positive and negative

64
Materialized Architecture Data Warehouse
65
Virtual ArchitectureMediator
66
Virtual Integration Architecture
  • Leave the data in the sources
  • When a query comes in
  • Determine the relevant sources to the query
  • Break down the query into sub-queries for the
    sources
  • Get the answers from the sources, and combine
    them appropriately
  • Data is fresh. Approach scalable
  • Issues
  • Relating Sources Mediator
  • Reformulating the query
  • Efficient planning execution

User queries
Mediated schema
Mediator
Reformulation engine
optimizer
Data source
Execution engine
catalog
wrapper
wrapper
wrapper
Data
Data
Data
source
source
source
Garlic IBM, HermesUMDTsimmis,
InfoMasterStanford DISCOINRIA Information
Manifold ATT SIMS/AriadneUSCEmerac/HavasuA
SU
67
Desiderata for Relating Source-Mediator Schemas
  • Expressive power distinguish between sources
    with closely related data. Hence, be able to
    prune access to irrelevant sources.
  • Easy addition make it easy to add new data
    sources.
  • Reformulation be able to reformulate a user
    query into a query on the sources efficiently and
    effectively.
  • Nonlossy be able to handle all queries that can
    be answered by directly accessing the sources

Reformulation
68
Source Descriptions
  • Elements of source descriptions
  • Contents source contains movies, directors,
    cast.
  • Constraints only movies produced after 1965.
  • Completeness contains all American movies.
  • Capabilities
  • Negative source requires movie title or director
    as input
  • Positive source can perform selections, joins,

69
Approaches to Specification of Source Descriptions
  • Global-as-View (GAV)
  • Mediator relation defined as a view over source
    relations
  • Ex TSIMMIS (Stanford), HERMES (Maryland)
  • Local-as-View (LAV)
  • Source relation defined as view over mediator
    relations
  • Ex Information Manifold (ATT), Tukwila(UW),
    InfoMaster (Stanford), Ariadne (USC)

View named query logical formula
70
Query Reformulation
  • Problem rewrite the user query expressed in the
    mediated schema into a query expressed in the
    source schemas
  • Given a query Q in terms of the mediated-schema
    relations, and descriptions of the information
    sources,
  • Find a query Q that uses only the source
    relations, such that
  • Q Q (i.e., answers are correct i.e., Q ? Q)
    and
  • Q provides all possible answers to Q given the
    sources

71
Answering queries using views
  • Given query q and view definitions VV1Vn
  • q is an Equivalent Rewriting of q using V if
  • q refers only to views in V, and
  • q q
  • q is a Maximally-Contained Rewriting of q using
    V if
  • q refers only to views in V, and
  • q ? q, and
  • there is no rewriting q1, such that q ? q1 ? q
    and q1 ? q

72
Global-as-View (GAV)
  • Each mediator relation is defined as a view over
    source relations.
  • MovieActor(title,actor) ?
  • DB1(id,title,actor,year)
  • MovieActor(title,actor) ? DB2(title,director,a
    ctor,year)
  • MovieReview(title, review) ? DB1(id,title,actor,ye
    ar) DB3(id,review)

73
Query Reformulation in GAV
  • Query reformulation rule unfoldingsimplificatio
    n
  • Query Find reviews for DeNiro movies
  • q(title,review) - MovieActor(title,DeNiro),
  • MovieReview(title,revie
    w)
  • 1. q(title,review) - DB1(id,title,DeNiro,year)
    ,
  • DB1(id,title,actor,year),
    DB3(id,review)
  • 2. q(title,review) -
  • DB2(title,director,DeNiro,year),
  • DB1(id,title,actor, year),
    DB3(id,review)

74
Local-as-View (LAV)
  • Each source relation is defined as a view over
    mediator relations
  • V1(title, year, director) ? Movie(title,year,dire
    ctor,genre) American(director) year 1960
    genre Comedy
  • V2 (title, review) ? Movie(title,year,director,gen
    re) year1990 MovieReview(title, review)

?
?
75
Query Reformulation in LAV
Query Reviews for comedies produced after
1950 q(title,review) - Movie(title,year,director,
Comedy), year 1950, MovieReview(title,review)
  • Reformulated query
  • q(title,review) - V1(title,year,director),
  • V2(title,review)

q ? q
V1(title, year, director) ? Movie(title,year,direc
tor,genre) American(director) year 1960
genre Comedy V2 (title, review) ?
Movie(title,year,director,genre) year1990
MovieReview(title, review)
76
Integrating GIS and ImageryGlobal as View
Approach Gupta et al.
  • GIS Source
  • Soil maps
  • Parcel maps
  • Digital elevation maps
  • Transportation network maps
  • Image Library
  • Satellite imagery
  • Aerial images
  • Property photographs

77
Mediation in MIX
  • Mediator defined by building an structured
    representation of both GIS and image sources
  • Mediator relations defined by
  • Containment conditions
  • Spatial or temporal joins
  • Logical associations
  • Queries and results in XML

78
Mediation in MIX (cont.)
  • Wrappers
  • Construct wrappers for the GIS and image data
    sources
  • Evaluating spatial queries
  • Determine subqueries to each of the sources
  • Compose results and produce integrated XML
    document
  • Spatial data converter used to handle conversions
    between sources (e.g., UTM to USGS 7.5 quad)

79
Example
  • Produce a table of aerial imagery and photographs
    of houses broken down by 5-year increments and
    Total Assessed Value

80
Result
81
Plan Execution
82
Motivation
  • Problem
  • Information gathering may involve accessing and
    integrating data from many sources
  • Total time to execute these plans may be large
  • Why?
  • Unpredictable network latencies
  • Varying remote source capabilities
  • Thus, execution is often I/O-bound
  • Complicating factor binding patterns
  • During execution, many sources cannot be queried
    until a previous source query has been answered

83
GAV vs. LAV
  • Not modular
  • Addition of new sources changes the mediated
    schema
  • Can be awkward to write mediated schema without
    loss of information
  • Query reformulation easy
  • reduces to view unfolding (polynomial)
  • Can build hierarchies of mediated schemas
  • Best when
  • Few, stable, data sources
  • well-known to the mediator (e.g. corporate
    integration)
  • Garlic, TSIMMIS, HERMES
  • Modular--adding new sources is easy
  • Very flexible--power of the entire query language
    available to describe sources
  • Reformulation is hard
  • Involves answering queries only using views (can
    be intractable)
  • Best when
  • Many, relatively unknown data sources
  • possibility of addition/deletion of sources
  • Information Manifold, InfoMaster, Emerac

84
Traditional Approaches
  • Executing information gathering plans
  • Generate a plan
  • Plan typically consists of a partial ordering of
    the operators
  • Execute the plan based on the given order
  • Operators process all of their input data before
    transmitting any results to consumer(s)
  • Operators as fast as their most latent input
  • Long delays due to the dependencies in the plan

85
Dataflow vs Von-Neumann
((a b) (c d))
a
b
c
d
a
b
c
d
ADD
ADD
ADD
ADD
MUL
arc
MUL
actor
86
Streaming Dataflow
  • Plans consist of a network of operators
  • Each operator like a function
  • Example Wrapper, Select, etc.
  • Operators produce and consume data
  • Operators fire when any part of any input data
    becomes available
  • Data routed between operators are relations
  • Zero or more tuples with one or more attributes

Input
Output
Plan
Wrapper
Wrapper
Join
Select
87
Parallelism of Streaming Dataflow
  • Dataflow (horizontal parallelism)
  • Decentralized, independent operator execution
  • Enables "maximally parallel" operator execution
  • Also known as the "dataflow limit"
  • Streaming/pipelining (vertical parallelism)
  • Producer emits tuples to consumer ASAP
  • Producer consumer can process same relation
    simultaneously
  • Effective because information gathering latencies
    can be high even at the tuple level
  • Data often "trickles" out of I/O-bound operators

88
Example The RepInfo Agent
  • INPUT
  • Any street address
  • e.g., 4767 Admiralty Way, Marina del Rey, CA,
    90292
  • OUTPUT
  • Federal reps
  • 2 senators,
  • 1 house member
  • For each rep
  • Recent news
  • Real-time funding
  • information

89
RepInfo Sources
90
RepInfo Sources
91
RepInfo Sources
92
OpenSecrets Navigation Fetching!
93
OpenSecrets Navigation Fetching!
94
OpenSecrets Navigation Fetching!
95
OpenSecrets Navigation Fetching!
96
RepInfo agent plan
address
senators house reps
combined results
recent news
Join name
Wrapper Yahoo News
Select senators, house reps
Wrapper Vote-Smart
graph URL
Wrapper OpenSecrets (funding page)
Wrapper OpenSecrets (member page)
Wrapper OpenSecrets (names page)
all officials
member URL
funding URL
97
Adaptive Query Execution
  • Network Query Engines
  • Tukwila
  • Operator reordering
  • Optimized operators
  • Telegraph
  • Tuple-level adaptivity
  • Niagara
  • Partial results for blocking operators
  • Agent Execution Language
  • Theseus
  • Speculative execution

98
How to speculate?
  • General problem
  • Means for issuing and confirming predictions
  • Two new operators
  • Speculate Makes predictions based on "hints"
  • Confirm Prevents errant results from exiting
    plan

hints
predictions/additions
Speculate
answers
confirmations
probable results
Confirm
actual results
confirmations
99
How to speculate?
  • Example RepInfo
  • Make predictions about officials based on address
  • Makes practical sense
  • Representatives do not change often
  • Addresses-to-reps is a many-to-one relationship

100
Speedups beyond 2
  • Cascading speculation
  • Speculation on speculation
  • Functional dependencies
  • Enable early confirmation because subsequent FD
    processing is deterministic

S
S
S
S
S
S
S
S
S
W
W
W
W
W
W
W
W
W
W
G
101
Learning to Speculate
  • Accurate predictions
  • The better our prediction accuracy, the better
    the speedup
  • Example
  • Predict federal officials given an address
  • Categories of predictions
  • How do we deal with?
  • New hints
  • Making novel predictions

102
Caching
  • Associate answers with previously seen hints
  • Advantages
  • Simple
  • Disadvantages
  • Requires lots of space
  • Only supports previously seen predictions on
    previously seen data (category A)

103
Other ways to predict
  • Classification
  • 4780 Admiralty Way, Marina del Rey, CA 90292
  • Likely reps (Boxer, Feinstein, and Harman)
  • We have learned that zip code and city are the
    features that most likely indicate the
    representative
  • Translation
  • Some data have predictable transformations
  • Example the OpenSecrets source
  • Member URL
  • http//www.opensecrets.org/politicians/summary.asp
    ?CIDN00006750
  • Funding URL
  • http//www.opensecrets.org/politicians/sector.asp?
    CIDN00006750

104
Standards for Integration/Mediation
105
The X-standards
  • XML an on-the-wire representation for data
  • Xquery a query language for XML
  • Xschema/DTD a schema description language for
    XML data
  • RDF a language for meta-data description
  • WSDL/SOAP/UDDI languages for describing services

106
HTML vs. XML
  • ltbibliographygt
  • ltbookgt lttitlegt Foundations lt/titlegt
  • ltauthorgt Abiteboul lt/authorgt
  • ltauthorgt Hull lt/authorgt
  • ltauthorgt Vianu lt/authorgt
  • ltpublishergt Addison Wesley
    lt/publishergt
  • ltyeargt 1995 lt/yeargt
  • lt/bookgt
  • lt/bibliographygt
  • lth1gt Bibliography lt/h1gt
  • ltpgt ltigt Foundations of Databases lt/igt
  • Abiteboul, Hull, Vianu
  • ltbrgt Addison Wesley, 1995
  • ltpgt ltigt Data on the Web lt/igt
  • Abiteoul, Buneman, Suciu
  • ltbrgt Morgan Kaufmann, 1999

Self-describing -Schema info part of the
data -Good for data exchange (albeit
baroque for storage)
107
XML Terminology
  • tags book, title, author,
  • start tag ltbookgt, end tag lt/bookgt
  • elements ltbookgtltbookgt,ltauthorgtlt/authorgt
  • elements are nested
  • empty element ltredgtlt/redgt abbrv. ltred/gt
  • an XML document single root element

well formed XML document if it has matching tags
108
Why are Database folks so excited about XML?
  • XML is just a syntax for (self-describing) data
  • This is still exciting because
  • No standard syntax for relational data
  • With XML, we can
  • Translate any legacy data to XML
  • Can exchange data in XML format
  • Ship over the web, input to any application

109
XML vs. Relational Data
  • XML is meant as a language that supports both
    Text and Structured Data
  • Conflicting demands...
  • XML supports semi-structured data
  • In essence, the schema can be union of multiple
    schemas
  • Easy to represent books with or without prices,
    books with any number of authors etc.
  • XML supports free mixing of text and data
  • using the PCDATA type
  • XML is ordered (while relational data is
    unordered)

110
Querying XML
  • Requirements
  • Need to handle lack of schema.
  • We may not know much about the data, so we need
    to navigate the XML.
  • Need to support both information retrieval and
    SQL-style queries.
  • Ordered vs. un-ordered XML
  • Human readable
  • like SQL? ?
  • Candidates
  • Many based on conflicting requirements
  • XSL Makes IR folks happy
  • XML-QL Makes DB folks happy
  • Xquery W3Cs attempt to make everybody (un)happy

111
Example Query
Query
Result
  • ltbibgt
  • for b in /bib/book
  • where b/publisher "Addison-Wesley"
  • and b/_at_year gt 1991
  • return ltbook year b/_at_year gt
  • b/title
  • lt/bookgt
  • lt/bibgt
  • For all books after 1991,
  • return with Year changed from
  • a tag to an attribute

ltbibgt ltbook year"1994"gt lttitlegtTCP/IP
Illustratedlt/titlegt lt/bookgt ltbook
year"1992"gt lttitlegtAdvanced Programming in
the Unix environmentlt/titlegt lt/bookgt lt/bibgt
112
Impact of XML on Integration
  • If and when all sources accept Xqueries and
    exchange data in XML format, then
  • Mediator can accept user queries in Xquery
  • Access sources using Xquery
  • Get data back in XML format
  • Merge results and send to user in XML format
  • How about now?
  • Sources can use XML adapters (middle-ware)

113
XML middleware for Databases
  • XML adapters (middle-ware) received significant
    attention in DB community
  • SilkRoute (ATT)
  • Xperanto (IBM)
  • Issues
  • Need to convert relational data into XML
  • Tagging (easy)
  • Need to convert Xquery queries into equivalent
    SQL queries
  • Trickier as Xquery supports schema querying
  • A single query may be mapped into a union of SQL
    queries

114
Is XML standardization a magical solution for
Integration?
  • If all WEB sources standardize into XML format
  • Source access (wrapper generation issues) become
    easier to manage
  • BUT all other problems remain
  • Still need to relate source (XML)schemas to
    mediator (XML)schema
  • Still need to reason about source overlap, source
    access limitations etc.
  • Still need to manage execution in the presence of
    source/network uncertainities

115
Semantic Web
  • The LAV/GAV approaches assume that some human
    expert will do the actual schema mapping
  • The semantic-web initiative attempts to
    automate schema mapping
  • Idea Allow pages to write logical axioms
    relating their vocabulary (tags) to other
    external tags
  • Support automatic inference of relations between
    source and mediator schema using these rules
  • DAMLOIL

116
Review
  • Motivation for Information Integration
  • Database Refresher
  • Accessing Information Sources
  • Record Linkage
  • Data Integration/Query Planning
  • Plan Execution
  • Standards for Integration/Mediation
  • Discussion

117
Discussion
  • Many opportunities for integrating online sources
    with geospatial sources
  • There are many online sources that can be related
    to geospatial sources
  • Online databases, text documents, phone books,
    schedules,
  • Possible uses
  • Augmenting what is known about geospatial
    entities
  • Using online sources to analyze imagery
  • Combining online sources with geospatial sources
    for research and analysis
  • Effect of coastal pollution on the economy of
    coastal regions
  • Placing text documents in a geospatial context
About PowerShow.com