Title: Modeling Query-Based Access to Text Databases
1Modeling Query-Based Access to Text Databases
- Eugene AgichteinPanagiotis IpeirotisLuis
Gravano - Computer Science Department
- Columbia University
2Scalable Text Mining
- Often only a tiny fraction of a text database is
relevant, so processing every document is
unnecessarily expensive. - Often relevant information is not crawlable, but
available only via a search engine.
Search engines can helpefficiency and
accessibility
3Task1 Extracting Structured Information
Buried in Text Documents
Organization
Location
Microsoft's central headquarters in Redmond is
home to almost every product group and division.
Microsoft Apple Computer Nike
Redmond Cupertino Portland
Brent Barlow, 27, a software analyst and
beta-tester at Apple Computers headquarters in
Cupertino, was fired Monday for "thinking a
little too different."
Apple's programmers "think different" on a
"campus" in Cupertino, Cal. Nike employees "just
do it" at what the company refers to as its
"World Campus," near Portland, Ore.
4Information Extraction Applications
- Over a corporations customer report or email
complaint database enabling sophisticated
querying and analysis - Over biomedical literature identifying
drug/condition interactions - Over newspaper archives tracking disease
outbreaks, terrorist attacks intelligence
Significant progress over the last decade MUC
5Goal Extract All Tuples of a Relation from a
Document Database
InformationExtraction System
Extracted Tuples
- One approach feed every document to information
extraction system - Problem efficiency, accessibility!
6A Query-Based Strategy for Information Extraction
Intuition Documents with one tuple for the
relation are also likely to contain other tuples.
- 0 While seed has unprocessed tuple t
- 1 Retrieve up to MaxResults
- documents matching t
- 2 Extract new tuples te from these
documents - 3 Augment seed with te
seed
t0
t1
t2
Problem May run out of tuples (and queries) ?
incomplete relation!
7Hidden Web Databases
- Surface Web
- Link structure
- Crawlable
- Documents indexed by search engines
- Hidden Web
- No link structure
- Documents hidden in databases
- Documents not indexed by search engines
- Need to query each collection individually
8Search Over the Hidden Web
- Task 2 Database Content Summary Construction
- Typically the vocabulary of each
database plus simple frequency statistics
Database selection relies on simple content
summaries vocabulary, word frequencies
thrombopenia
PubMed (3,868,552 documents) cancer
1,398,178aids 106,512heart 281,506hepatitis
23,481thrombopenia 24,826
Metasearcher
Problem Databases dont export content summaries!
?
?
?
PubMed
NYTimesArchives
US Patents
... thrombopenia 24,826 ...
... thrombopenia 18 ...
... thrombopenia 0 ...
9A Query-Based Strategy for Constructing Database
Summaries
- 0 While seed has unprocessed word t
- 1 Retrieve up to MaxResults
- documents matching t
- 2 Extract new words te from these
documents - 3 Augment seed with te
seed
t0
t1
t2
Problem May run out of words (and queries) ?
incomplete summary!
10Query-Based Information Extraction and Database
Summary Construction
seed
seed
connected
disconnected
11Model Querying Graph
T
D
- Tokens T
- Task 1 tuple attributes
- microsoft AND redmond acm AND new york
- Task 2 words sigmod, webdb
- Tokens (as queries) retrieve
- documents in D
- Documents contain tokens
t1
d1
d2
t2
t3
d3
t4
d4
t5
d5
12Model Reachability Graph
T
D
t1
t1
d1
t2
t3
d2
t2
t3
d3
t4
t5
t4
d4
t1 retrieves document d1 that contains t2
t5
d5
t2, t3, and t4 reachable from t1
13Model (cont.) Connectivity
Reachable tokens, do not retrieve core tokens
Tokens that retrieve other tokens and themselves
Tokens that retrieve other tokens but are not
reachable
14Power-law Graphs
- Conjecture Degree distribution in the
reachability graph is described by a power-law - Completely described by only two parameters, a
and b. - Power-law random graphs are expected to have at
most one giant connected component
(CoreInOut). Other connected components are
small.
15Model (cont.) Reachability
Giant Component CRG
t1
seed
t2
t3
t4
reachable
- Reachability relative size of the largest Core
Out
16Outline
- Task 1 Information Extraction
- Task 2 Constructing Database Summary
- Model
- Querying, reachability graphs
- Power-law graphs
- Reachability
- Querying Real Databases
- Estimation
- Experimental Results
- Discussion
17Querying Real Databases
- Task 1 NYT
- DiseaseOutbreaks
- (date, disease, location)
- The New York Times
- D137,000
- T8859
- Task 2 20NG
- Postings from 20 Newsgroups
- D 20,000
- T 109,000
Date DiseaseName Location
Jan. 1995 Malaria Ethiopia
June 1995 Ebola Africa
July 1995 Mad Cow Disease The U.K.
Feb. 1995 Pneumonia The U.S.
18NYT Reachability Graph Outdegree Distribution
Matches the power-law distribution
MaxResults10
MaxResults50
19NYT Component Size Distribution
Not reachable
reachable
MaxResults10
MaxResults50
CG / T 0.297
CG / T 0.620
2020NG Outdegree Distribution
Approximated by power-law distribution
MaxResults1
MaxResults10
CG / T 1 (completely connected)
21Estimating Reachability
- In a power-law random graph G a giant component
CG emerges if d (the average outdegree) gt 1,
and
- Estimate Reachability CG / T
- Depends only on d (average outdegree)
For b lt 3.457
22Estimating Reachability using Sampling
T
D
- Choose S random seed tokens
- Query the database for seed
- Compute the outgoing edges of nodes in seed.
- Estimate d as average outdegree of seed tokens.
t1
d1
t1
d2
t2
t3
t3
d3
t4
d4
d 1.5
t5
d5
23Estimating Reachability for NYT
Approximate reachability is estimated after 50
queries. ? Can be used to predict success (or
failure) of a Task 1 algorithm.
24Reachability of NYT (cont.)
.46
Reachability correctly predicts performance of
the Tuples strategy for Task 1 (described in
Agichtein and Gravano, ICDE 2003)
25Estimating Reachability of 20NG
Estimates reachability closely, after just 10
queries Corroborates Callans results Callan et
al., SIGMOD 1999
26Summary
- Presented graph model for query-based access to
text databases - Querying and Reachability graphs
- Formal tool for analyzing heuristic algorithms
- The reachability metric predictions for
algorithm performance - Efficient estimation techniques
- Power-law random graph properties Document
sampling
27Future Work
- Other properties of the reachability graph
- Edge Density
- Diameter
- Real-life limitations
- Total number of queries? ? querying graph
- Total number of documents? ? querying graph
- Analyze other (heuristic) algorithms.
28Modeling Query-Based Access to Text Databases
- Eugene AgichteinPanagiotis IpeirotisLuis
Gravano - Computer Science
- Columbia University
Questions?
29Overflow Slides
30(No Transcript)
31Information Extraction Example Organizations
Headquarters
Input Documents
Named-Entity Tagging
Pattern Matching
Output Tuples
32Efficient Information Extraction Alternatives
Given a large text database and an information
extraction task, how to proceed?
- If a large fraction of documents are relevant
- Scan (not always possible)
- Else
- Tuples ?
? ?
Text Database
Will Tuples retrieve enough of the relation?
33Search Over Hidden Web Databases
- ? Metasearchers
- Database Selection Choosing best databases for
a query - Database Selection Needs Content Summaries
- Typically the vocabulary of
- each database plus simple
- frequency statistics
PubMed (3,868,552 documents) cancer
1,398,178aids 106,512heart 281,506hepatitis
23,481thrombopenia 24,826
34Model
- Is there a common model for algorithms for
Query-Based Information Extraction and Database
Summary Construction? - What are the limitations of these algorithms?
- Given a new database, will such an algorithm for
work?