Building Structured Web Databases: A Midterm Report from the Cimple Project - PowerPoint PPT Presentation

About This Presentation
Title:

Building Structured Web Databases: A Midterm Report from the Cimple Project

Description:

In normal datalog, each predicate in the body is associated with a relation that is either stored, or defined by a Datalog rule. Intuitively, ... – PowerPoint PPT presentation

Number of Views:134
Avg rating:3.0/5.0
Slides: 32
Provided by: zam61
Category:

less

Transcript and Presenter's Notes

Title: Building Structured Web Databases: A Midterm Report from the Cimple Project


1
Building Structured Web Databases A Midterm
Report from the Cimple Project
AnHai Doan University of Wisconsin-Madison
2
Structured Web Databases
2
3
The Cimple Project (2005 Date)
  • Develops a generic solution to build Web
    databases
  • using extraction integration user feedback
  • Example DBLife

Browse Keyword search SQL querying Question
answering Mining Alert/Monitor News summary
Jim Gray
Researcher Homepages Conference Pages Group
Pages DBworld mailing list DBLP
give-talk
Web pages
SIGMOD-04
IE/II program
4
Data Model for IE/II
  • Many choices
  • Relational, XML, RDF triples, nested, Jason, etc.
  • Desiderata
  • Conceptually simple, programmers can visualize
  • Naïve users can visualize (for providing
    feedback)
  • Easy to write queries
  • Robust industrial support
  • Decided on relational ER

Jim Gray
Researcher Homepages Conference Pages Group
Pages
give-talk
Web pages
SIGMOD-04
IE/II program
  • Want to understand benefits / limitations

5
Programming Model for IE/II
  • Must combine IE/II blackboxes into workflows
  • Many possible choices
  • E.g., UIMA, pub/sub
  • Desiderata
  • Easy to write, understand, debug, maintain
  • Expressive (e.g., can do loops), highly
    extensible
  • Solid theoretical foundation
  • Can optimize to death (critical!)

6
Proposed Solution Xlog, Datalog with Embedded
Procedural Predicates
title abstract
Feedback in IR Relevance feedback is important...
Personalized Search Customizing rankings with relevance feedback...
docs
d1 d2
titles(d,t) - docs(d), extractTitle(d,t). abstra
cts(d,a) - docs(d), extractAbstract(d,a). talks(
d,t,a) - titles(d,t), abstracts(d,a),
immBefore(t,a),
contains(a,relevance feedback).
7
Xlog Workflow of Relational Operators
Blackboxes
d1 t1 a1
scontains(a, relevance feedback)
d1 t1 a1
d1 t2 a2
simmBefore(t,a)
d1 t1 a1
d1 t1 a2
d2 t2 a1
d2 t2 a2
d1 t1
d1 t2
d1 a1
d1 a2
extractAbstract(d,a)
extractTitle(d,t)
d1
d2
d1
d2
docs(d)
docs(d)
8
Sample Optimization Pushing Down Text Properties
a
extractAbstract
d
contains(a,w) ? comes-from(a,d) ? contains(d,w) italics(s) ? overlaps(s,t) ? containsItalics(t) (lengthWord(s) 3) ? comes-from(s,t) ? lengthWord(t) gt 3
9
Benefits of Xlog
  • Can model complex workflows
  • e.g., recursion, negation
  • Has well-defined semantics
  • Can naturally combine IE/II blackboxes w/
    relational ops
  • Can immediately exploit many optimization methods
  • already developed for Datalog RDBMS
  • Can naturally incorporate text-centric
    optimizations
  • estimate cost, select good exec plan, in RDBMS
    fashion

10
Implementing Xlog Take 1
  • Key challenge how to store access data on disk

HTML pages
RDBMS
OS Files
Version store (e.g., Rdiff)
Web
11
Problems (Observed when Running DBLife)
  • Multiple concurrent processes
  • machines, humans
  • Random data access
  • Lots of RDBMS-like operations
  • Huge amount of disk-resident data
  • Unlike ETL, Mapreduce processes

HTML pages
RDBMS
OS Files
Version store (e.g., Rdiff)
Web
12
Implementing Xlog Take 2
  • Extend RDBMS to handle IE/II over text
  • also hot direction today at RDBMS companies
  • Want to understand benefits / limitations

HTML pages
RDBMS
Web
13
Implementing Learning-Based Operatorsby Pushing
Them into RDBMS
  • E.g., Markov Logic network
  • Lots of RDBMS-like operations
  • Alchemy uses a fixed exec plan
  • RDBMS automatically selects a good plan
  • Drastic speedup in our experiments

HTML pages
RDBMS
Web
14
Lessons Learned / Open Questions
  • Relational ER seem okay so far
  • To combine blackboxes, Datalog variants are
    promising
  • Right implementation strategy still unclear
  • ETL / Mapreduce seems best for one-shot IE/II
  • Building / maintaining many Web DBs are not
    one-shot
  • especially if involving humans
  • concurrent processes, data often revisited,
    random access
  • Optimization is critical
  • RDBMS especially promising
  • locking, indexing, optimization, handling
    disk-resident data
  • Most likely need a combination of RDBMS
    Mapreduce

15
The Cimple Project (2005 Date)
  • Develops a generic solution to build Web
    databases
  • using extraction integration user feedback
  • Example DBLife

Browse Keyword search SQL querying Question
answering Mining Alert/Monitor News summary
Jim Gray
Researcher Homepages Conference Pages Group
Pages DBworld mailing list DBLP
give-talk
Web pages
SIGMOD-04
IE/II program
16
User Feedback
  • Critical
  • IE/II inevitably make mistakes, can cascade
    quickly
  • when database evolves, mistakes happen
  • a lot of data in user head, not yet on the Web
  • Highly beneficial
  • scenario 1 10-15 developers
  • their feedback can already make a big difference
  • no good solution today, designated victim in
    DBLife
  • scenario 2 hire people using Mechanical Turk
  • scenario 3 lot of ordinary users volunteering
    feedback

17
Types of User Feedback
Flagging an Error
Fixing an Error
Editing Code
Editing Data
Input
Output
IntermediateResults
17
18
Editing the Output
  • To maximize amount of feedback
    ? users should be able to edit anything
  • records, lists, sets, tables, natural text,
  • using whatever UI they like form, excel, wiki,
    GUI,
  • virtually the whole page should be editable

19
Example Editing a Record
Research Interest Data stream Declarative
networking Sensor networks
Name Joe HellersteinOrganization
UC-BerkeleyContact joe_at_berkeley.edu
Remove Contact joe_at_berkeley.edu
HTML
Entity 123 name Joe Hellerstein
org UC-Berkeley email joe_at_berkeley.edu
Data stream, 0.9 Declarative networking,
0.6 Sensor networks, 0.4
View
  • How to interpret edits?
  • How to push down edits?
  • How to manage concurrent edits?
  • How to propagate edits?

Entity 123 name Joe Hellerstein salary
150K org UC-Berkeley email
joe_at_berkeley.edu
Data
20
Example Editing a Record
  • How to edit page format? How to display new data?

Name Joe HellersteinContact joe_at_berkeley.edu
(try calling first) Organization UC-Berkeley
Name Joe HellersteinOrganization
UC-BerkeleyContact joe_at_berkeley.edu
HTML
Name Contact (try calling
first) Organization
Entity 123 name Joe Hellerstein
org UC-Berkeley email joe_at_berkeley.edu
View
Entity 123 name Joe Hellerstein salary
150K org UC-Berkeley email
joe_at_berkeley.edu, joe_at_acm.org
Entity 123 name Joe Hellerstein salary
150K org UC-Berkeley email
joe_at_berkeley.edu
Data
21
Example Editing a Record
  • How to undo? recover from crash?
  • roll back to 3pm yesterday
  • undo a bad user edit what if other users have
    built on that edit?
  • How to reconcile human / machine edits?
  • How to split superhomepages?

Name Joe HellersteinOrganization
UC-BerkeleyContact joe_at_berkeley.edu
human
machine
Name Joe HellersteinOrganization
UC-BerkeleyContact joe_at_berkeley.edu,
joe_at_mit.edu, joe_at_swivel.com
Joe Berkeley Joe MIT
machine
machine
human
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
  • Text mixed with structured data (from the
    database)
  • Can edit both

27
(No Transcript)
28
Editing Input/Intermediate Results
  • Extracting conference services

services
Wiki
roles
findRoles
extractConf
Spreadsheet
extractNames
crawl
Form
dataSources
28
29
Editing Code
  • Currently naïve users edit control flow of code

1 Joe Hellerstein
5 Chen Li-s
use just author name use author name,
co-authors, conf proximity
use just author name use author name,
co-authors, conf proximity
filter pubs
filter pubs
30
Lessons Learned / Open Questions
  • User feedback is critical
  • correct data obtained from Web
  • help improve IE/II algorithms
  • help solicit data in users head
  • help build community Wikipedia, using
    machine-human
  • Numerous interesting challenges

31
Cimple Current Status
  • Started in 2005
  • Involved UIUC, Yahoo, IBM, Microsoft
  • Major project _at_ Wisconsin
  • affiliated profs Jeff Naughton, Chris Re, Jude
    Shavlik, Raghu Ramakrishnan
  • 20 students Pedro DeRose, Warren Shen, Robert
    McCann, Xiaoyong Chai, Ba-Quy Vuong, Fei Chen,
    Chaitanya Gokhale, Feng Niu, Ting Chen, Byron
    Gao, Erick Chu, Akanksha Baid, Jiansheng Huang,
    and more
  • prototypes Cimple 1.0, Cimple 2.0, applied to
    DB, Lake Mendota, Wikipedia
  • 19 SIGMOD/VLDB/ICDE papers invited papers,
    special issues, tutorial
  • funded by NSF, Yahoo, IBM, Google, Microsoft,
    DARPA
  • technology transfer to Microsoft (Ad Lab, SQL
    Server group)
Write a Comment
User Comments (0)
About PowerShow.com