AnHai Doan - PowerPoint PPT Presentation

About This Presentation
Title:

AnHai Doan

Description:

Actually Bought a House in 2004. Buying period. queried 7-8 data sources over 3 weeks ... Dell laptop X200 with mouse ... Serious problem in many settings (e.g. ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 44
Provided by: zam34
Category:
Tags: anhai | doan | housemouse

less

Transcript and Presenter's Notes

Title: AnHai Doan


1
From Data Integration to Community Information
Management
  • AnHai Doan
  • University of Illinois
  • Joint work with Pedro DeRose, Robert McCann,
    Yoonkyong Lee, Mayssam Sayyadian, Warren Shen,
    Wensheng Wu, Quoc Le, Hoa Nguyen, Long Vu, Robin
    Dhamankar, Alex Kramnik, Luis Gravano, Weiyi
    Meng, Raghu Ramakrishnan, Dan Roth, Arnon
    Rosenthal, Clemen Yu

2
Data Integration Challenge
Find houses with 4 bedrooms priced under 300K
New researcher
homes.com
realestate.com
homeseekers.com
3
Actually Bought a House in 2004
  • Buying period
  • queried 7-8 data sources over 3 weeks
  • some of the sources are local, not indexed by
    national sources
  • 3 hours / night ? 60 hours
  • huge amount of time on querying, post processing
  • Buyer-remorse period
  • repeated the above for another 3 weeks!

We really need to automate data integration ...
4
Architecture of Data Integration Systems
Find houses with 4 bedroomspriced under 300K
mediated schema
source schema 2
source schema 3
source schema 1
houses.com
homes.com
realestate.com
5
Current State of Affairs
  • Vibrant research industrial landscape
  • Research since the 70s, accelerated in past
    decade
  • database, AI, Web, KDD, Semantic Web communities
  • 14 workshops in past 3 years ISWC-03, IJCAI-03,
    VLDB-04, SIGMOD-04, DILS-04, IQIS-04, ISWC-04,
    WebDB-05, ICDE-05, DILS-05, IQIS-05, IIWeb-06,
    etc.
  • main database focuses
  • modeling, architecture, query processing,
    schema/tuple matching
  • building specialized systems life sciences, Deep
    Web, etc.
  • Industry
  • 53 startups in 2002 Wiederhold-02
  • many new ones in 2005

Despite much RD activities, however
6
DI Systems are Still Very Difficultto Build and
Maintain
  • Builder must execute multiple tasks

select data sources create wrappers create
mediated schemas match schemas eliminate
duplicate tuples monitor changes etc.


  • Most tasks are extremely labor intensive
  • Total cost often at 35 of IT budget Knoblock
    et. al. 02
  • systems often take months or years to develop
  • High cost severely limits deployment of DI
    systems

7
Data Integration _at_ Illinois


  • Directions
  • automate tasks to minimize human labor
  • leverage users to spread out the cost
  • simplify tasks so that they can be done quickly

8
Sample Research on Automating Integration Tasks
Schema Matching



Mediated-schema
price agent-name address
1-1 match
complex match
homes.com
listed-price contact-name city
state
320K Jane Brown Seattle
WA 240K Mike Smith Miami
FL
9
Schema Matching is Ubiquitous!
  • Fundamental problem in numerous applications
  • Databases
  • data integration,
  • model management
  • data translation, collaborative data sharing
  • keyword querying, schema/view integration
  • data warehousing, peer data management,
  • AI
  • knowledge bases, ontology merging, information
    gathering agents, ...
  • Web
  • e-commerce, Deep Web, Semantic Web, Google Base,
    next version of My Web 2.0?
  • eGovernment, bio-informatics, e-sciences

10
Why Schema Matching is Difficult
  • Schema data never fully capture semantics!
  • not adequately documented
  • Must rely on clues in schema data
  • using names, structures, types, data values, etc.
  • Such clues can be unreliable
  • same names ? different entities area ?
    location or square-feet
  • different names ? same entity area
    address ? location
  • Intended semantics can be subjective
  • house-style house-description?
  • Cannot be fully automated, needs user feedback

11
Current State of Affairs
  • Schema matching is now a key bottleneck!
  • largely done by hand, labor intensive error
    prone
  • data integration at GTE LiClifton, 2000
  • 40 databases, 27000 elements, estimated time 12
    years
  • Numerous matching techniques have been developed
  • Databases IBM Almaden, Wisconsin, Microsoft
    Research, Purdue, BYU, George Mason, Leipzig,
    NCSU, Illinois, Washington, ...
  • AI Stanford, Toronto, Rutgers, Karlsruhe
    University, NEC, USC, "everyone and his
    brother is doing ontology mapping"
  • Techniques are often synergistic, leading to
    multi-component matching
    architectures
  • each component employs a particular technique
  • final predictions combine those of the components

12
Example LSD Doan et al. SIGMOD-01
agent name
Mediated schema
address agent-name
0.5
Name Matcher
contact agent
Urbana, IL James Smith Seattle, WA
Mike Doan
Combiner
homes.com
0.3
0.1
Naive Bayes Matcher
area contact-agent
Peoria, IL (206) 634 9435 Kent, WA
(617) 335 4243
area gt (address, 0.7),
(description, 0.3) contact-agent gt
(agent-phone, 0.7), (agent-name, 0.3) comments
gt (address, 0.6), (desc, 0.4)
Match Selector
Constraint Enforcer
area address contact-agent agent-phone ... com
ments desc
Only one attribute of source schema matches
address
13
Multi-Component Matching Solutions
  • Introduced in Doan et. al., WebDB-00, SIGMOD-01,
    DoRahm, VLDB-02, Embley et. al. 02
  • Now commonly adopted, with industrial-strength
    systems
  • e.g., Protoplasm MSR, COMA Univ of Lepzig

Match selector
Combiner
Matcher
Matcher 1
Matcher n

Matcher 1
Matcher n

LSD
COMA
SF
LSD-SF
  • Such systems are very powerful ...
  • maximize accuracy highly customizable
  • ... but place a serious tuning burden on domain
    users

14
Tuning Schema Matching Systems
  • Given a particular matching situation
  • how to select the right components?
  • how to adjust the multitude of knobs?

Threshold selector
Bipartite graph selector
Characteristics of attr.
Split measure
A search enforcer Relax. labeler ILP
Post-prune?
Size of validation set
Average combiner
Min combiner
Max combiner
Weighted sum combiner



q-gram name matcher
Decision tree matcher
Naïve Bays matcher
TF/IDF name matcher
SVM matcher
Knobs of decision tree matcher
Library of matching components
Execution graph
  • Untuned versions produce inferior accuracy

15
But Tuning is Extremely Difficult
  • Large number of knobs
  • e.g., 8-29 in our experiments
  • Wide variety of techniques
  • database, machine learning, IR, information
    theory, etc.
  • Complex interaction among components
  • Not clear how to compare quality of knob configs
  • Long-standing problem since the 80s, getting
    much worse with multiple-component systems

? Developing efficient tuning techniques is now
crucial
16
The eTuner Solution VLDB-05a
  • Given schema S matching system M
  • tunes M to maximize average accuracy
    of matching S
    with future schemas
  • commonly occur in data integration, warehousing,
    supply chain
  • Challenge 1 Evaluation
  • score each knob config K of matching system M
  • return K, the one with highest score
  • but how to score knob config K?
  • if we know a representative workload W (S,T1),
    ..., (S,Tn),and correct matches between S and
    T1, , Tn
    ? can use W to score K
  • Challenge 2 Huge or infinite search space

17
Solving Challenge 1 Generate Synthetic
Input/Output
  • Need workload W (S,T1), (S,T2), , (S,Tn)
  • To generate W
  • start with S
  • perturb S to generate T1
  • perturb S to generate T2
  • etc.
  • Know the perturbation ? know matches between S
    Ti

18
Generate Synthetic Input/Output
Emps
Employees
emp-last id wage
Laup 1 45200
Brown 2 59328
id id first NONE last
emp-last salary wage
id first last salary ()
1 Bill Laup 40,000
2 Mike Brown 60,000
Perturb data tuples
Perturb of columns
Emps
Employees
emp-last id wage
Laup 1 40,000
Brown 2 60,000
last id salary ()
Laup 1 40,000
Brown 2 60,000
Perturb table and column names
Schema S
1
3
2
  • Make sure tables do not share tuples
  • Rules are applied probabilistically

19
The eTuner Architecture
Tuning Procedures
Perturbation Rules
Workload Generator
Staged Searcher
Synthetic Workload
Tuned Matching Tool M
S O1 T1 S O2 T2 S On Tn
Matching Tool M
Schema S
  • More details / experiments in
  • Sayyadian et. al., VLDB-05

20
eTuner Current Status
  • Only the first step
  • but now we have a line of attack for a
    long-standing problem
  • Current directions
  • find optimal synthetic workload
  • develop faster search methods
  • extend for other matching scenarios
  • adapt ideas to scenarios beyond schema matching
  • wrapper maintenance VLDB-05b
  • domain-specific search engine?

21
Automate Integration Tasks Summary
  • Schema matching
  • architecture WebDB-00,
    SIGMOD-01, WWW-02
  • long-standing problems SIGMOD-04a, eTuner
    VLDB-05a
  • learning/other techniques CIDR-03, VLDBJ-03,
    MLJ-03, WebDB-03,
    SIGMOD-04b, ICDE-05a, ICDE-05b
  • novel problem debug schemas for interoperability
    ongoing
  • industry transfer involving 2 startups
  • promote research area workshop at ISWC-03,
    special issues in
    SIGMOD Record-04 AI Magazine-05, survey
  • Query reformulation ICDE-02
  • Mediated schema construction SIGMOD-04b,
    ICDM-05,
    ICDE-06
  • Duplicate tuple removal AAAI-05, Tech report
    06a, 06b
  • Wrapper maintenance VLDB-05b

22
Research Directions


  • Automate integration tasks
  • to minimize human labor
  • Leverage users
  • to spread the cost
  • Simplify integration tasks
  • so that they can be done quickly

23
The MOBS Project
  • Learn from multitude of users to improve tool
    accuracy,
  • thus significantly
    reducing builder workload
  • MOBS Mass Collaboration to Build Systems

Questions
Answers



24
Mass Collaboration
  • Build software artifacts
  • Linux, Apache server, other open-source software
  • Knowledge bases, encyclopedia
  • wikipedia.com
  • Review technical support websites
  • amazon.com, epinions.com, quiq.com,
  • Detect software bugs
  • Liblit et al. PLDI 03 05
  • Label images/pages on the Web
  • ESPgame, flickr, del.ici.ous, My Web 2.0
  • Improve search engines, recommender systems
  • Why not data
    integration systems?

25
Example Duplicate Data Matching
  • Serious problem in many settings (e.g., e.com)

Dell laptop X200 with mouse ...
Mouse for Dell laptop 200 series ...
Dell X200 mouse at reduced price ...
  • Hard for machine, but easy for human

26
Key Challenges
  • How to modify tools to learn from users?
  • How to combine noisy user answers
  • How to obtain user participation?
  • data experts, often willing to help (e.g.,
    Illinois Fire Service)
  • may be asked to help (e.g., e.com)
  • volunteer (e.g., online communities), "payment"
    schemes

Multiple noisy oracles
  • build user models, learn them via interaction
    with users
  • novel form of active learning

27
Current Status
  • Develop first-cut solutions
  • built prototype, experimented with 3-132 users,
    for source discovery and
    schema matching
  • improve accuracy by 9-60, reduced workload by
    29-88
  • Built two simple DI systems on Web
  • almost exclusively with users
  • Building a real-world application
  • DBlife (more later)
  • See McCann et al., WebDB-03, ICDE-05,
    AAAI Spring Symposium-05, Tech Report-06

28
Research Directions


  • Automate integration tasks
  • to minimize human labor
  • Leverage users
  • to spread the cost
  • Simplify integration tasks
  • so that they can be done quickly

29
Simplify Mediated Schema ?Keyword Search over
Multiple Databases





  • Novel problem
  • Very useful for urgent / one-time DI needs
  • also when users are SQL-illiterate (e.g.,
    Electronic Medical Records)
  • also on the Web (e.g., when data is tagged with
    some structure)
  • Solution Kite, Tech Report 06a
  • combines IR, schema matching, data matching, and
    AI planning

30
Simplify Wrappers ?Structured Queries over
Text/Web Data
SELECT ... FROM ... WHERE ...



E-mails, text, Web data, news, etc.
  • Novel problem
  • attracts attention from database / AI / Web
    researchers at Columbia, IBM TJ Watson/Almaden,
    UCLA, IIT-Bombay
  • SQOUT, Tech Report 06b, SLIC, Tech Report 06c

31
Research Directions


  • Automate integration tasks
  • to minimize human labor
  • Leverage users
  • to spread the cost
  • Simplify integration tasks
  • so that they can be done quickly

Integration is difficult Do best-effort
integration Integrate with text Should leverage
human
Build on this to promote
Community Information Management
32
Community Information Management
  • Numerous communities on the Web
  • database researchers, movie fans, legal
    professionals, bioinformatics, etc.
  • enterprise intranets, tech support groups
  • Each community many disparate data sources
    people
  • Members often want to query, monitor, discover
    info.
  • any interesting connection between researchers X
    and Y?
  • list all courses that cite this paper
  • find all citations of this paper in the past one
    week on the Web
  • what is new in the past 24 hours in the database
    community?
  • which faculty candidates are interviewing this
    year, where?
  • Current integration solutions fall short
    of addressing
    such needs

33
Cimple Project _at_ Illinois/Wisconsin
  • Software platform that can be rapidly deployed
    and customized to manage data-rich online
    communities

Keyword search SQL querying Question
answering Browse Mining Alert/Monitor News
summary
Jim Gray
Jim Gray
Researcher Homepages Conference Pages Group
Pages DBworld mailing list DBLP
Web pages



give-talk




SIGMOD-04
SIGMOD-04








Text documents
Import personalize data Tag entities/relationshi
p / create new contents Context-dependent
services
Share / aggregation
34
Prototype System DBlife
  • 1164 data sources, crawled daily, 11000 pages /
    day ? 160 MB, 121400 people mentions ? 5600
    persons

35
Structure Related Challenges
Keyword search SQL querying Question
answering Browse Mining Alert/Monitor News
summary
Jim Gray
Jim Gray
Researcher Homepages Conference Pages Group
Pages DBworld mailing list DBLP
Web pages



give-talk




SIGMOD-04
SIGMOD-04








Text documents
  • Extraction
  • better blackboxes, compose blackboxes, exploit
    domain knowledge
  • Maintenance
  • critical, but very little has been done
  • Exploitation
  • keyword search over extracted structure? SQL
    queries?
  • detect interesting events?

36
User Related Challenges
Jim Gray
  • Users should be able to
  • import whatever they want
  • correct/add to the imported data
  • extend the ER schema
  • create new contents for share/exchange
  • ask for context-dependent services
  • Examples
  • user imports a paper, system provides bib item
  • user imports a movie, add desc, tags it for
    exchange
  • Challenges
  • provide incentives, payment
  • handle malicious/spam users
  • share / aggregate user activities/actions/content

give-talk
SIGMOD-04
37
Comparison to Current My Web 2.0
  • Cimple focuses on domain-specific communities
  • not the entire Web
  • Besides page level
  • also considers finer granularities of entities /
    relations / attributes
  • leverages automatic best-effort data
    integration techniques
  • Leverages user feedback to further improve
    accuracy
  • thus combines both automatic techniques and human
    efforts
  • Considers the entire range of
    search structured queries
  • and how to seamlessly move between them
  • Allows personalization and sharing
  • consider context-dependent services beyond
    keyword search (e.g., selling, exchange)

38
Applying Cimple to My Web 2.0 An Example
  • Going beyond just sharing Web pages
  • Leveraging My Web 2.0 for other actions
  • e.g., selling, exchanging goods (turning it to a
    classified ads platform?)
  • E.g., want to sell my house
  • create a page describing the house
  • save it to my account on My Web 2.0
  • tag it with sellhouse, sell, house, champaign,
    IL
  • took me less than 5 minutes (not including
    creating the page)
  • now if someone searches for any of these keywords

39
(No Transcript)
40
(No Transcript)
41
Here a button can be added to facilitate the
sell action? provide context-dependent services
42
The Big Picture Speculative Mode
Structured data (relational, XML)
Unstructured data (text, Web, email)
Database SQL
IR/Web/AI/Mining keyword, QA
  • Many apps will involve all three
  • Exact integration will be difficult
  • - best-effort is promising
  • - should leverage human
  • Apps will want broad range of services
  • - keyword search, SQL queries
  • - buy, sell, exchange, etc.

Multitude of users
Semantic Web Industry/Real World
43
Summary
  • Data integration crucial problem
  • at intersection of database, AI, Web, IR
  • Integration _at_ Illinois in my group
  • automate tasks to minimize human labor
  • leverage users to spread out the cost
  • simplify tasks so that they can be done quickly
  • Best-effort integration, should leverage human
  • The Cimple project _at_ Illinois/Wisconsin
  • builds on current work to study Community
    Information Management
  • A step toward managing structured text
    users synergistically!

See anhai on Yahoo for more details
Write a Comment
User Comments (0)
About PowerShow.com