Building Structured Web Databases: A Midterm Report from the Cimple Project - PowerPoint PPT Presentation

About This Presentation

Title:

Building Structured Web Databases: A Midterm Report from the Cimple Project

Description:

In normal datalog, each predicate in the body is associated with a relation that is either stored, or defined by a Datalog rule. Intuitively, ... – PowerPoint PPT presentation

Number of Views:134

Avg rating:3.0/5.0

Slides: 32

Provided by: zam61

Category:

more less

Transcript and Presenter's Notes

Title: Building Structured Web Databases: A Midterm Report from the Cimple Project

1
Building Structured Web Databases A Midterm
Report from the Cimple Project
AnHai Doan University of Wisconsin-Madison
2
Structured Web Databases
2
3
The Cimple Project (2005 Date)

Develops a generic solution to build Web
databases
using extraction integration user feedback
Example DBLife

Browse Keyword search SQL querying Question
answering Mining Alert/Monitor News summary
Jim Gray
Researcher Homepages Conference Pages Group
Pages DBworld mailing list DBLP
give-talk
Web pages
SIGMOD-04
IE/II program
4
Data Model for IE/II

Many choices
Relational, XML, RDF triples, nested, Jason, etc.
Desiderata
Conceptually simple, programmers can visualize
Naïve users can visualize (for providing
feedback)
Easy to write queries
Robust industrial support
Decided on relational ER

Jim Gray
Researcher Homepages Conference Pages Group
Pages
give-talk
Web pages
SIGMOD-04
IE/II program

Want to understand benefits / limitations

5
Programming Model for IE/II

Must combine IE/II blackboxes into workflows
Many possible choices
E.g., UIMA, pub/sub
Desiderata
Easy to write, understand, debug, maintain
Expressive (e.g., can do loops), highly
extensible
Solid theoretical foundation
Can optimize to death (critical!)

6
Proposed Solution Xlog, Datalog with Embedded
Procedural Predicates
title abstract
Feedback in IR Relevance feedback is important...
Personalized Search Customizing rankings with relevance feedback...
docs
d1 d2
titles(d,t) - docs(d), extractTitle(d,t). abstra
cts(d,a) - docs(d), extractAbstract(d,a). talks(
d,t,a) - titles(d,t), abstracts(d,a),
immBefore(t,a),
contains(a,relevance feedback).
7
Xlog Workflow of Relational Operators
Blackboxes
d1 t1 a1
scontains(a, relevance feedback)
d1 t1 a1
d1 t2 a2
simmBefore(t,a)
d1 t1 a1
d1 t1 a2
d2 t2 a1
d2 t2 a2
d1 t1
d1 t2
d1 a1
d1 a2
extractAbstract(d,a)
extractTitle(d,t)
d1
d2
d1
d2
docs(d)
docs(d)
8
Sample Optimization Pushing Down Text Properties
a
extractAbstract
d
contains(a,w) ? comes-from(a,d) ? contains(d,w) italics(s) ? overlaps(s,t) ? containsItalics(t) (lengthWord(s) 3) ? comes-from(s,t) ? lengthWord(t) gt 3
9
Benefits of Xlog

Can model complex workflows
e.g., recursion, negation
Has well-defined semantics
Can naturally combine IE/II blackboxes w/
relational ops
Can immediately exploit many optimization methods
already developed for Datalog RDBMS
Can naturally incorporate text-centric
optimizations
estimate cost, select good exec plan, in RDBMS
fashion

10
Implementing Xlog Take 1

Key challenge how to store access data on disk

HTML pages
RDBMS
OS Files
Version store (e.g., Rdiff)
Web
11
Problems (Observed when Running DBLife)

Multiple concurrent processes
machines, humans
Random data access
Lots of RDBMS-like operations
Huge amount of disk-resident data
Unlike ETL, Mapreduce processes

HTML pages
RDBMS
OS Files
Version store (e.g., Rdiff)
Web
12
Implementing Xlog Take 2

Extend RDBMS to handle IE/II over text
also hot direction today at RDBMS companies
Want to understand benefits / limitations

HTML pages
RDBMS
Web
13
Implementing Learning-Based Operatorsby Pushing
Them into RDBMS

E.g., Markov Logic network
Lots of RDBMS-like operations
Alchemy uses a fixed exec plan
RDBMS automatically selects a good plan
Drastic speedup in our experiments

HTML pages
RDBMS
Web
14
Lessons Learned / Open Questions

Relational ER seem okay so far
To combine blackboxes, Datalog variants are
promising
Right implementation strategy still unclear
ETL / Mapreduce seems best for one-shot IE/II
Building / maintaining many Web DBs are not
one-shot
especially if involving humans
concurrent processes, data often revisited,
random access
Optimization is critical
RDBMS especially promising
locking, indexing, optimization, handling
disk-resident data
Most likely need a combination of RDBMS
Mapreduce

15
The Cimple Project (2005 Date)

Develops a generic solution to build Web
databases
using extraction integration user feedback
Example DBLife

Critical
IE/II inevitably make mistakes, can cascade
quickly
when database evolves, mistakes happen
a lot of data in user head, not yet on the Web
Highly beneficial
scenario 1 10-15 developers
their feedback can already make a big difference
no good solution today, designated victim in
DBLife
scenario 2 hire people using Mechanical Turk
scenario 3 lot of ordinary users volunteering
feedback

17
Types of User Feedback
Flagging an Error
Fixing an Error
Editing Code
Editing Data
Input
Output
IntermediateResults
17
18
Editing the Output

To maximize amount of feedback
? users should be able to edit anything
records, lists, sets, tables, natural text,
using whatever UI they like form, excel, wiki,
GUI,
virtually the whole page should be editable

19
Example Editing a Record
Research Interest Data stream Declarative
networking Sensor networks
Name Joe HellersteinOrganization
UC-BerkeleyContact joe_at_berkeley.edu
Remove Contact joe_at_berkeley.edu
HTML
Entity 123 name Joe Hellerstein
org UC-Berkeley email joe_at_berkeley.edu
Data stream, 0.9 Declarative networking,
0.6 Sensor networks, 0.4
View

How to interpret edits?
How to push down edits?
How to manage concurrent edits?
How to propagate edits?

Entity 123 name Joe Hellerstein salary
150K org UC-Berkeley email
joe_at_berkeley.edu
Data
20
Example Editing a Record

How to edit page format? How to display new data?

Name Joe HellersteinContact joe_at_berkeley.edu
(try calling first) Organization UC-Berkeley
Name Joe HellersteinOrganization
UC-BerkeleyContact joe_at_berkeley.edu
HTML
Name Contact (try calling
first) Organization
Entity 123 name Joe Hellerstein
org UC-Berkeley email joe_at_berkeley.edu
View
Entity 123 name Joe Hellerstein salary
150K org UC-Berkeley email
joe_at_berkeley.edu, joe_at_acm.org
Entity 123 name Joe Hellerstein salary
150K org UC-Berkeley email
joe_at_berkeley.edu
Data
21
Example Editing a Record

How to undo? recover from crash?
roll back to 3pm yesterday
undo a bad user edit what if other users have
built on that edit?
How to reconcile human / machine edits?
How to split superhomepages?

Name Joe HellersteinOrganization
UC-BerkeleyContact joe_at_berkeley.edu
human
machine
Name Joe HellersteinOrganization
UC-BerkeleyContact joe_at_berkeley.edu,
joe_at_mit.edu, joe_at_swivel.com
Joe Berkeley Joe MIT
machine
machine
human
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26

Text mixed with structured data (from the
database)
Can edit both

27
(No Transcript)
28
Editing Input/Intermediate Results

Extracting conference services

services
Wiki
roles
findRoles
extractConf
Spreadsheet
extractNames
crawl
Form
dataSources
28
29
Editing Code

Currently naïve users edit control flow of code

1 Joe Hellerstein
5 Chen Li-s
use just author name use author name,
co-authors, conf proximity
use just author name use author name,
co-authors, conf proximity
filter pubs
filter pubs
30
Lessons Learned / Open Questions

User feedback is critical
correct data obtained from Web
help improve IE/II algorithms
help solicit data in users head
help build community Wikipedia, using
machine-human
Numerous interesting challenges

31
Cimple Current Status

Started in 2005
Involved UIUC, Yahoo, IBM, Microsoft
Major project _at_ Wisconsin
affiliated profs Jeff Naughton, Chris Re, Jude
Shavlik, Raghu Ramakrishnan
20 students Pedro DeRose, Warren Shen, Robert
McCann, Xiaoyong Chai, Ba-Quy Vuong, Fei Chen,
Chaitanya Gokhale, Feng Niu, Ting Chen, Byron
Gao, Erick Chu, Akanksha Baid, Jiansheng Huang,
and more
prototypes Cimple 1.0, Cimple 2.0, applied to
DB, Lake Mendota, Wikipedia
19 SIGMOD/VLDB/ICDE papers invited papers,
special issues, tutorial
funded by NSF, Yahoo, IBM, Google, Microsoft,
DARPA
technology transfer to Microsoft (Ad Lab, SQL
Server group)