MOBS Mass Collaboration to Build Systems - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

MOBS Mass Collaboration to Build Systems

Description:

Barnes & Noble. Search. YES. NO. 13. Comparison to Database Tuning. Database Tuning ... population and deployed 3 such systems on the. web: Bookstores, Hubs, ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 45
Provided by: robertm164
Category:

less

Transcript and Presenter's Notes

Title: MOBS Mass Collaboration to Build Systems


1
MOBSMass Collaboration to Build Systems
  • Prof. AnHai Doan , Alexander Kramnik, Rob McCann,
    Warren Shen, Olu Sobulo , Vanitha Varadarajan
  • DAIS Research _at_ UIUC

2
Data Retrieval Problem
Unstructured Query Broad Coverage
Structured Query Low Coverage
3
Solution Data Integration Systems
Structured Query
Broad Coverage
4
Building A Data Integration System
Construct Mediated Interface
5
Building A Data Integration System
Locate Sources
6
Building A Data Integration System
Learn To Translate A Query
7
Building A Data Integration System
Combine Results For User
8
Building A Data Integration SystemIs Expensive
  • Automated tools help but are inaccurate.
  • Today DI Systems are still built manually!
  • There are very few DI systems on the web.

9
Use Mass Collaboration
MOBS Mass Collaboration to Build Systems
  • Treat a DI system as having a finite set of
    parameters.
  • System admins construct and deploy an initial
    system shell.
  • Users help system converge to correct
    parameter values.

10
Parameterize System
?
Parameter Form 1 is an online Bookstore Value
Yes / No
11
Parameterize System
?
Parameter Form 2 WRITER Matches Author Value
Yes / No
12
Ask The Users
13
Comparison to Database Tuning
  • Database Tuning
  • set values of physical-design knobs (e.g., buffer
    size)
  • using feedback from query execution
  • time, resources consumed, etc.
  • to further improve query execution performance
  • Mass Collaboration for DI Systems
  • set values of logical-design knobs (e.g., ab?")
  • using feedback from users
  • to improve system correctness and further expand
    system

14
Mass CollaborationUsed In Many Places
  • Review sites (amazon, imdb, epinions)
  • Open-Source software (Linux, GNU)
  • Knowledge Bases (mindpixel, openmind, bibserv)
  • Peer-to-Peer Systems (napster)
  • The World Wide Web
  • Wired Magazine 11/03 Highlighted growing
    popularity of mass collaboration.
  • Why Not Data Integration?

15
Potential High Impact
  • If succeeds
  • Dramatically reduce development cost time
  • Launch numerous DI systems on the Web in
    enterprises
  • everyday domains books, movies, cars, travel,
    etc.
  • "niche" domains e.g., fire fighting
  • scientific domains e.g., bioinformatics
  • within/across enterprises
  • Applicable to other data management tasks
  • building P2P systems, info extraction from text,
    Semantic Web, ...
  • Our current work
  • Start by exploring a few initial settings
  • Online Bookstores, Hub Finder, Publication
    Extractor
  • Use these settings to understand key challenges.
  • Develop, deploy, evaluate general solutions.

16
Contributions
  • Proposed a new solution to building DI systems
  • Mass Collaboration
  • Proposed several methods for gathering
  • feedback
  • Monopoly, Better Service, Cooperative
  • Evaluated our approach across a variety of
  • population and deployed 3 such systems on
    the
  • web
  • Bookstores, Hubs, Peanut

17
Integrating Online Bookstores
  • As a first step, we restrict our attention to two
    crucial tasks
  • Interface Recognition
  • Query Translation

18
Interface Recognition
Candidate
Candidate
19
Interface Recognition
20
Query Translation
First Label Each Field With A Relevant Semantic
Concept
Ranked Concepts
21
Query Translation
First Label Each Field With A Relevant Semantic
Concept
Ranked Concepts
- Title - Author - Genre - ISBN
?
22
Query Translation
First Label Each Field With A Relevant Semantic
Concept
Ranked Concepts
- Title - Author - Genre - ISBN
?
23
Query Translation
First Label Each Field With A Relevant Semantic
Concept
Ranked Concepts
- Title - Author - Genre - ISBN
24
Query Translation
25
Query Translation
How To Translate A Query To A Source?
Author-Specific Rules
If 1 text field, FN LN LN, FN If 2
text fields, LN FN FN LN If a
drop-down Search for LN
26
Query Translation
How To Translate A Query To A Source?
27
Query Translation
How To Translate A Query To A Source?
Frost
Robert
28
Challenges
  • Formulate a Mass Collaboration
  • Framework suitable for such DI tasks.
  • Motivate Users To Participate.
  • Handle Malicious / Error-Prone Users.
  • Resolve Differences In User Opinion.
  • Boost Accuracy Of Mass Collaboration
  • Results.
  • Reduce Workload Placed Upon Users.

29
Mass Collaboration Framework
  • Cast a DI task into a sequence of simple binary
    decisions.
  • Is this source an online bookstore?
  • Is this field used to query on Author?
  • Making these decisions is equivalent to solving
    the particular DI task.
  • Simple decisions can be solved by users.

30
User Participation
  • Sell a Monopolized Service CS 311 Site
  • Sell Improved Service Query Interface
  • Cooperative Environment A Community
  • There are several ways to collect feedback.

31
Malicious Users
Inject known questions to detect and remove bad
users.
32
Forming a Consensus
  • Must decide when and how each decision is made
    from given feedback.
  • Our current scheme
  • If one answer (Yes/No) has a convincing lead,
    immediately make that decision.
  • Otherwise, at some max number of answers, take
    the majority opinion.

33
Overcoming Exponential Error
  • Dependent (sequential) decisions snowball error
    exponentially fast.

Use A Semi-Parallel Pool Scheme
34
Pool Scheme
  • We need parallelism to be accurate, but
  • Round-Robin is too much work
  • Lessen workload by Zooming

35
Empirical Evaluation
  • Interface Recognition
  • 55 Query Forms (from Books, Movies, and Music
    pages)

Fully Automated
Automation Mass Collab
Selected 24 forms 0.70 Precision 0.89 Recall
Automation Selected 24 forms 0.70 Precision
0.89 Recall MC Selected 17/24 forms 1.00
Precision 0.89 Recall
75 Users Averaged 8 Answers
36
Empirical Evaluation
  • Query Translation
  • Just deployed on CS 311 Site.
  • Labeled 2 sources (5 fields) with 100 accuracy,
  • with 53 users averaging 5 answers apiece.
  • Simulated 500 Users over 18 Sources (114 Fields)

37
Empirical Evaluation
  • Query Translation
  • Simulated 500 Users over 18 Sources (114 Fields)

8 Answers
Workload is spread thinly.
38
MOBS Is A General Technique
  • We were also able to use MOBS to build two
    applications on the Surface Web.
  • Hubs Locate CS faculty hubs.
  • Peanut Locate CS publication lists.

39
Hubs System
  • Overall Goal Build an IR system over content
    found in CS faculty homepages.
  • First Step Use Mass Collaboration to find
    faculty homepages. Do this using department
    hubs.
  • Look only for 1 page per site.
  • More stable than individual homepages.

40
Hubs System
  • Use automated techniques to crawl a dept website
    and create a ranked list of hub candidates.
  • Use mass collaboration to select correct hub from
    candidates.
  • Is this page a CS faculty hub?
  • Currently deployed on Prof. Zhais 397 site
  • Finished 2 CS depts.
  • Machine learning was wrong in both cases.
  • Mass Collaboration boosted accuracy from 0 to
    100,
  • with the help of 20 users averaging 9.5 answers.

41
Peanut System
  • Overall Goal Build a mini-citeseer over
    relevant CS publications.
  • Standard MOBS Approach Automatically parse
    candidate lists from web pages and use mass
    collaboration to verify/reject candidates.
  • Is this a publication list?
  • Currently deployed atop the Google search engine
  • Extracted 10 publication lists from 50 lists
    over 10 pages
  • with 100 precision and 100 recall
  • with the help of 10 users averaging 8.4 answers.

42
Related Work
  • Mass Collaboration
  • Review sites, knowledge bases, open-source.
  • Data Integration
  • Lots of research, but few works on
    addressing entire process (Rosenthal Seligman,
    01).
  • Autonomic Systems
  • A system built by mass collaboration
    exhibits self-healing and self-improving
    qualities.

43
Future Work
  • Theoretical guarantees of accuracy/speed.
  • Automatic self-tuning in response to observations
    on the population.
  • More sophisticated consensus techniques (i.e.
    linear regression).
  • Maintain over changing sources.
  • Most importantly, find The Application.

44
System Demos
  • Bookstore
  • http//hanoi.cs.uiuc.edu/cgi-bin/deploy/mobs.pl?sy
    stem4
  • http//hanoi.cs.uiuc.edu/cs311
  • http//hanoi.cs.uiuc.edu/cgi-bin/deploy/mb_system_
    deep_web_portal.pl
  • Hubs
  • http//hanoi.cs.uiuc.edu/cs397
  • http//hanoi.cs.uiuc.edu/cgi-bin/deploy/hubs_publi
    c_stats.pl
  • Peanut
  • http//hanoi.cs.uiuc.edu/peanut
Write a Comment
User Comments (0)
About PowerShow.com