Modeling Query-Based Access to Text Databases

About This Presentation

Title:

Modeling Query-Based Access to Text Databases

Description:

Often relevant information is not crawlable, but available only via a search engine. Search engines can help: ... Over newspaper archives: tracking disease ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 35

Provided by: EugeneAg8

Learn more at: http://www.mathcs.emory.edu

Category:

more less

Transcript and Presenter's Notes

Title: Modeling Query-Based Access to Text Databases

1
Modeling Query-Based Access to Text Databases

Eugene AgichteinPanagiotis IpeirotisLuis
Gravano
Computer Science Department
Columbia University

2
Scalable Text Mining

Often only a tiny fraction of a text database is
relevant, so processing every document is
unnecessarily expensive.
Often relevant information is not crawlable, but
available only via a search engine.

Search engines can helpefficiency and
accessibility
3
Task1 Extracting Structured Information
Buried in Text Documents
Organization
Location
Microsoft's central headquarters in Redmond is
home to almost every product group and division.
Microsoft Apple Computer Nike
Redmond Cupertino Portland
Brent Barlow, 27, a software analyst and
beta-tester at Apple Computers headquarters in
Cupertino, was fired Monday for "thinking a
little too different."
Apple's programmers "think different" on a
"campus" in Cupertino, Cal. Nike employees "just
do it" at what the company refers to as its
"World Campus," near Portland, Ore.
4
Information Extraction Applications

Over a corporations customer report or email
complaint database enabling sophisticated
querying and analysis
Over biomedical literature identifying
drug/condition interactions
Over newspaper archives tracking disease
outbreaks, terrorist attacks intelligence

Significant progress over the last decade MUC
5
Goal Extract All Tuples of a Relation from a
Document Database
InformationExtraction System
Extracted Tuples

One approach feed every document to information
extraction system
Problem efficiency, accessibility!

6
A Query-Based Strategy for Information Extraction
Intuition Documents with one tuple for the
relation are also likely to contain other tuples.

0 While seed has unprocessed tuple t
1 Retrieve up to MaxResults
documents matching t
2 Extract new tuples te from these
documents
3 Augment seed with te

seed
t0
t1
t2
Problem May run out of tuples (and queries) ?
incomplete relation!
7
Hidden Web Databases

Surface Web
Link structure
Crawlable
Documents indexed by search engines

Hidden Web
No link structure
Documents hidden in databases
Documents not indexed by search engines
Need to query each collection individually

8
Search Over the Hidden Web

Task 2 Database Content Summary Construction
Typically the vocabulary of each
database plus simple frequency statistics

Database selection relies on simple content
summaries vocabulary, word frequencies
thrombopenia
PubMed (3,868,552 documents) cancer
1,398,178aids 106,512heart 281,506hepatitis
23,481thrombopenia 24,826
Metasearcher
Problem Databases dont export content summaries!
?
?
?
PubMed
NYTimesArchives
US Patents
... thrombopenia 24,826 ...
... thrombopenia 18 ...
... thrombopenia 0 ...
9
A Query-Based Strategy for Constructing Database
Summaries

0 While seed has unprocessed word t
1 Retrieve up to MaxResults
documents matching t
2 Extract new words te from these
documents
3 Augment seed with te

seed
t0
t1
t2
Problem May run out of words (and queries) ?
incomplete summary!
10
Query-Based Information Extraction and Database
Summary Construction
seed
seed
connected
disconnected
11
Model Querying Graph
T
D

Tokens T
Task 1 tuple attributes
microsoft AND redmond acm AND new york
Task 2 words sigmod, webdb
Tokens (as queries) retrieve
documents in D
Documents contain tokens

t1
d1
d2
t2
t3
d3
t4
d4
t5
d5
12
Model Reachability Graph
T
D
t1
t1
d1
t2
t3
d2
t2
t3
d3
t4
t5
t4
d4
t1 retrieves document d1 that contains t2
t5
d5
t2, t3, and t4 reachable from t1
13
Model (cont.) Connectivity
Reachable tokens, do not retrieve core tokens
Tokens that retrieve other tokens and themselves
Tokens that retrieve other tokens but are not
reachable
14
Power-law Graphs

Conjecture Degree distribution in the
reachability graph is described by a power-law
Completely described by only two parameters, a
and b.
Power-law random graphs are expected to have at
most one giant connected component
(CoreInOut). Other connected components are
small.

15
Model (cont.) Reachability
Giant Component CRG
t1
seed
t2
t3
t4
reachable

Reachability relative size of the largest Core
Out

16
Outline

Task 1 Information Extraction
Task 2 Constructing Database Summary
Model
Querying, reachability graphs
Power-law graphs
Reachability
Querying Real Databases
Estimation
Experimental Results
Discussion

17
Querying Real Databases

Task 1 NYT
DiseaseOutbreaks
(date, disease, location)
The New York Times
D137,000
T8859
Task 2 20NG
Postings from 20 Newsgroups
D 20,000
T 109,000

Date DiseaseName Location
Jan. 1995 Malaria Ethiopia
June 1995 Ebola Africa
July 1995 Mad Cow Disease The U.K.
Feb. 1995 Pneumonia The U.S.

18
NYT Reachability Graph Outdegree Distribution
Matches the power-law distribution
MaxResults10
MaxResults50
19
NYT Component Size Distribution
Not reachable
reachable
MaxResults10
MaxResults50
CG / T 0.297
CG / T 0.620
20
20NG Outdegree Distribution
Approximated by power-law distribution
MaxResults1
MaxResults10
CG / T 1 (completely connected)
21
Estimating Reachability

In a power-law random graph G a giant component
CG emerges if d (the average outdegree) gt 1,
and

Estimate Reachability CG / T
Depends only on d (average outdegree)

For b lt 3.457
22
Estimating Reachability using Sampling
T
D

Choose S random seed tokens
Query the database for seed
Compute the outgoing edges of nodes in seed.
Estimate d as average outdegree of seed tokens.

t1
d1
t1
d2
t2
t3
t3
d3
t4
d4
d 1.5
t5
d5
23
Estimating Reachability for NYT
Approximate reachability is estimated after 50
queries. ? Can be used to predict success (or
failure) of a Task 1 algorithm.
24
Reachability of NYT (cont.)
.46
Reachability correctly predicts performance of
the Tuples strategy for Task 1 (described in
Agichtein and Gravano, ICDE 2003)
25
Estimating Reachability of 20NG
Estimates reachability closely, after just 10
queries Corroborates Callans results Callan et
al., SIGMOD 1999
26
Summary

Presented graph model for query-based access to
text databases
Querying and Reachability graphs
Formal tool for analyzing heuristic algorithms
The reachability metric predictions for
algorithm performance
Efficient estimation techniques
Power-law random graph properties Document
sampling

27
Future Work

Other properties of the reachability graph
Edge Density
Diameter
Real-life limitations
Total number of queries? ? querying graph
Total number of documents? ? querying graph
Analyze other (heuristic) algorithms.

28
Modeling Query-Based Access to Text Databases

Eugene AgichteinPanagiotis IpeirotisLuis
Gravano
Computer Science
Columbia University

Questions?
29
Overflow Slides
30
(No Transcript)
31
Information Extraction Example Organizations
Headquarters
Input Documents
Named-Entity Tagging
Pattern Matching
Output Tuples
32
Efficient Information Extraction Alternatives
Given a large text database and an information
extraction task, how to proceed?

If a large fraction of documents are relevant
Scan (not always possible)
Else
Tuples ?

? ?
Text Database
Will Tuples retrieve enough of the relation?
33
Search Over Hidden Web Databases

? Metasearchers
Database Selection Choosing best databases for
a query
Database Selection Needs Content Summaries
Typically the vocabulary of
each database plus simple
frequency statistics

PubMed (3,868,552 documents) cancer
1,398,178aids 106,512heart 281,506hepatitis
23,481thrombopenia 24,826
34
Model

Is there a common model for algorithms for
Query-Based Information Extraction and Database
Summary Construction?
What are the limitations of these algorithms?
Given a new database, will such an algorithm for
work?

Write a Comment

User Comments (0)

About PowerShow.com

Modeling Query-Based Access to Text Databases - PowerPoint PPT Presentation

Modeling Query-Based Access to Text Databases

Often relevant information is not crawlable, but available only via a search engine. Search engines can help: ... Over newspaper archives: tracking disease ... – PowerPoint PPT presentation