Title: Building Structured Web Databases: A Midterm Report from the Cimple Project
1Building Structured Web Databases A Midterm
Report from the Cimple Project
AnHai Doan University of Wisconsin-Madison
2Structured Web Databases
2
3The Cimple Project (2005 Date)
- Develops a generic solution to build Web
databases - using extraction integration user feedback
- Example DBLife
Browse Keyword search SQL querying Question
answering Mining Alert/Monitor News summary
Jim Gray
Researcher Homepages Conference Pages Group
Pages DBworld mailing list DBLP
give-talk
Web pages
SIGMOD-04
IE/II program
4Data Model for IE/II
- Many choices
- Relational, XML, RDF triples, nested, Jason, etc.
- Desiderata
- Conceptually simple, programmers can visualize
- Naïve users can visualize (for providing
feedback) - Easy to write queries
- Robust industrial support
- Decided on relational ER
Jim Gray
Researcher Homepages Conference Pages Group
Pages
give-talk
Web pages
SIGMOD-04
IE/II program
- Want to understand benefits / limitations
5Programming Model for IE/II
- Must combine IE/II blackboxes into workflows
- Many possible choices
- E.g., UIMA, pub/sub
- Desiderata
- Easy to write, understand, debug, maintain
- Expressive (e.g., can do loops), highly
extensible - Solid theoretical foundation
- Can optimize to death (critical!)
6Proposed Solution Xlog, Datalog with Embedded
Procedural Predicates
title abstract
Feedback in IR Relevance feedback is important...
Personalized Search Customizing rankings with relevance feedback...
docs
d1 d2
titles(d,t) - docs(d), extractTitle(d,t). abstra
cts(d,a) - docs(d), extractAbstract(d,a). talks(
d,t,a) - titles(d,t), abstracts(d,a),
immBefore(t,a),
contains(a,relevance feedback).
7Xlog Workflow of Relational Operators
Blackboxes
d1 t1 a1
scontains(a, relevance feedback)
d1 t1 a1
d1 t2 a2
simmBefore(t,a)
d1 t1 a1
d1 t1 a2
d2 t2 a1
d2 t2 a2
d1 t1
d1 t2
d1 a1
d1 a2
extractAbstract(d,a)
extractTitle(d,t)
d1
d2
d1
d2
docs(d)
docs(d)
8Sample Optimization Pushing Down Text Properties
a
extractAbstract
d
contains(a,w) ? comes-from(a,d) ? contains(d,w) italics(s) ? overlaps(s,t) ? containsItalics(t) (lengthWord(s) 3) ? comes-from(s,t) ? lengthWord(t) gt 3
9Benefits of Xlog
- Can model complex workflows
- e.g., recursion, negation
- Has well-defined semantics
- Can naturally combine IE/II blackboxes w/
relational ops - Can immediately exploit many optimization methods
- already developed for Datalog RDBMS
- Can naturally incorporate text-centric
optimizations - estimate cost, select good exec plan, in RDBMS
fashion
10Implementing Xlog Take 1
- Key challenge how to store access data on disk
HTML pages
RDBMS
OS Files
Version store (e.g., Rdiff)
Web
11Problems (Observed when Running DBLife)
- Multiple concurrent processes
- machines, humans
- Random data access
- Lots of RDBMS-like operations
- Huge amount of disk-resident data
- Unlike ETL, Mapreduce processes
HTML pages
RDBMS
OS Files
Version store (e.g., Rdiff)
Web
12Implementing Xlog Take 2
- Extend RDBMS to handle IE/II over text
- also hot direction today at RDBMS companies
- Want to understand benefits / limitations
HTML pages
RDBMS
Web
13Implementing Learning-Based Operatorsby Pushing
Them into RDBMS
- E.g., Markov Logic network
- Lots of RDBMS-like operations
- Alchemy uses a fixed exec plan
- RDBMS automatically selects a good plan
- Drastic speedup in our experiments
HTML pages
RDBMS
Web
14Lessons Learned / Open Questions
- Relational ER seem okay so far
- To combine blackboxes, Datalog variants are
promising - Right implementation strategy still unclear
- ETL / Mapreduce seems best for one-shot IE/II
- Building / maintaining many Web DBs are not
one-shot - especially if involving humans
- concurrent processes, data often revisited,
random access - Optimization is critical
- RDBMS especially promising
- locking, indexing, optimization, handling
disk-resident data - Most likely need a combination of RDBMS
Mapreduce
15The Cimple Project (2005 Date)
- Develops a generic solution to build Web
databases - using extraction integration user feedback
- Example DBLife
Browse Keyword search SQL querying Question
answering Mining Alert/Monitor News summary
Jim Gray
Researcher Homepages Conference Pages Group
Pages DBworld mailing list DBLP
give-talk
Web pages
SIGMOD-04
IE/II program
16User Feedback
- Critical
- IE/II inevitably make mistakes, can cascade
quickly - when database evolves, mistakes happen
- a lot of data in user head, not yet on the Web
- Highly beneficial
- scenario 1 10-15 developers
- their feedback can already make a big difference
- no good solution today, designated victim in
DBLife - scenario 2 hire people using Mechanical Turk
- scenario 3 lot of ordinary users volunteering
feedback
17Types of User Feedback
Flagging an Error
Fixing an Error
Editing Code
Editing Data
Input
Output
IntermediateResults
17
18Editing the Output
- To maximize amount of feedback
? users should be able to edit anything - records, lists, sets, tables, natural text,
- using whatever UI they like form, excel, wiki,
GUI, - virtually the whole page should be editable
19Example Editing a Record
Research Interest Data stream Declarative
networking Sensor networks
Name Joe HellersteinOrganization
UC-BerkeleyContact joe_at_berkeley.edu
Remove Contact joe_at_berkeley.edu
HTML
Entity 123 name Joe Hellerstein
org UC-Berkeley email joe_at_berkeley.edu
Data stream, 0.9 Declarative networking,
0.6 Sensor networks, 0.4
View
- How to interpret edits?
- How to push down edits?
- How to manage concurrent edits?
- How to propagate edits?
Entity 123 name Joe Hellerstein salary
150K org UC-Berkeley email
joe_at_berkeley.edu
Data
20Example Editing a Record
- How to edit page format? How to display new data?
Name Joe HellersteinContact joe_at_berkeley.edu
(try calling first) Organization UC-Berkeley
Name Joe HellersteinOrganization
UC-BerkeleyContact joe_at_berkeley.edu
HTML
Name Contact (try calling
first) Organization
Entity 123 name Joe Hellerstein
org UC-Berkeley email joe_at_berkeley.edu
View
Entity 123 name Joe Hellerstein salary
150K org UC-Berkeley email
joe_at_berkeley.edu, joe_at_acm.org
Entity 123 name Joe Hellerstein salary
150K org UC-Berkeley email
joe_at_berkeley.edu
Data
21Example Editing a Record
- How to undo? recover from crash?
- roll back to 3pm yesterday
- undo a bad user edit what if other users have
built on that edit? - How to reconcile human / machine edits?
-
- How to split superhomepages?
Name Joe HellersteinOrganization
UC-BerkeleyContact joe_at_berkeley.edu
human
machine
Name Joe HellersteinOrganization
UC-BerkeleyContact joe_at_berkeley.edu,
joe_at_mit.edu, joe_at_swivel.com
Joe Berkeley Joe MIT
machine
machine
human
22(No Transcript)
23(No Transcript)
24(No Transcript)
25(No Transcript)
26- Text mixed with structured data (from the
database) - Can edit both
27(No Transcript)
28Editing Input/Intermediate Results
- Extracting conference services
services
Wiki
roles
findRoles
extractConf
Spreadsheet
extractNames
crawl
Form
dataSources
28
29Editing Code
- Currently naïve users edit control flow of code
1 Joe Hellerstein
5 Chen Li-s
use just author name use author name,
co-authors, conf proximity
use just author name use author name,
co-authors, conf proximity
filter pubs
filter pubs
30Lessons Learned / Open Questions
- User feedback is critical
- correct data obtained from Web
- help improve IE/II algorithms
- help solicit data in users head
- help build community Wikipedia, using
machine-human - Numerous interesting challenges
31Cimple Current Status
- Started in 2005
- Involved UIUC, Yahoo, IBM, Microsoft
- Major project _at_ Wisconsin
- affiliated profs Jeff Naughton, Chris Re, Jude
Shavlik, Raghu Ramakrishnan - 20 students Pedro DeRose, Warren Shen, Robert
McCann, Xiaoyong Chai, Ba-Quy Vuong, Fei Chen,
Chaitanya Gokhale, Feng Niu, Ting Chen, Byron
Gao, Erick Chu, Akanksha Baid, Jiansheng Huang,
and more - prototypes Cimple 1.0, Cimple 2.0, applied to
DB, Lake Mendota, Wikipedia - 19 SIGMOD/VLDB/ICDE papers invited papers,
special issues, tutorial - funded by NSF, Yahoo, IBM, Google, Microsoft,
DARPA - technology transfer to Microsoft (Ad Lab, SQL
Server group)