Modeling Query-Based Access to Text Databases - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Modeling Query-Based Access to Text Databases

Description:

Often relevant information is not crawlable, but available only via a search engine. Search engines can help: ... Over newspaper archives: tracking disease ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 35
Provided by: EugeneAg8
Category:

less

Transcript and Presenter's Notes

Title: Modeling Query-Based Access to Text Databases


1
Modeling Query-Based Access to Text Databases
  • Eugene AgichteinPanagiotis IpeirotisLuis
    Gravano
  • Computer Science Department
  • Columbia University

2
Scalable Text Mining
  • Often only a tiny fraction of a text database is
    relevant, so processing every document is
    unnecessarily expensive.
  • Often relevant information is not crawlable, but
    available only via a search engine.

Search engines can helpefficiency and
accessibility
3
Task1 Extracting Structured Information
Buried in Text Documents
Organization
Location
Microsoft's central headquarters in Redmond is
home to almost every product group and division.
Microsoft Apple Computer Nike
Redmond Cupertino Portland
Brent Barlow, 27, a software analyst and
beta-tester at Apple Computers headquarters in
Cupertino, was fired Monday for "thinking a
little too different."
Apple's programmers "think different" on a
"campus" in Cupertino, Cal. Nike employees "just
do it" at what the company refers to as its
"World Campus," near Portland, Ore.
4
Information Extraction Applications
  • Over a corporations customer report or email
    complaint database enabling sophisticated
    querying and analysis
  • Over biomedical literature identifying
    drug/condition interactions
  • Over newspaper archives tracking disease
    outbreaks, terrorist attacks intelligence

Significant progress over the last decade MUC
5
Goal Extract All Tuples of a Relation from a
Document Database
InformationExtraction System
Extracted Tuples
  • One approach feed every document to information
    extraction system
  • Problem efficiency, accessibility!

6
A Query-Based Strategy for Information Extraction
Intuition Documents with one tuple for the
relation are also likely to contain other tuples.
  • 0 While seed has unprocessed tuple t
  • 1 Retrieve up to MaxResults
  • documents matching t
  • 2 Extract new tuples te from these
    documents
  • 3 Augment seed with te

seed
t0
t1
t2
Problem May run out of tuples (and queries) ?
incomplete relation!
7
Hidden Web Databases
  • Surface Web
  • Link structure
  • Crawlable
  • Documents indexed by search engines
  • Hidden Web
  • No link structure
  • Documents hidden in databases
  • Documents not indexed by search engines
  • Need to query each collection individually

8
Search Over the Hidden Web
  • Task 2 Database Content Summary Construction
  • Typically the vocabulary of each
    database plus simple frequency statistics

Database selection relies on simple content
summaries vocabulary, word frequencies
thrombopenia
PubMed (3,868,552 documents) cancer
1,398,178aids 106,512heart 281,506hepatitis
23,481thrombopenia 24,826
Metasearcher
Problem Databases dont export content summaries!
?
?
?
PubMed
NYTimesArchives
US Patents
... thrombopenia 24,826 ...
... thrombopenia 18 ...
... thrombopenia 0 ...
9
A Query-Based Strategy for Constructing Database
Summaries
  • 0 While seed has unprocessed word t
  • 1 Retrieve up to MaxResults
  • documents matching t
  • 2 Extract new words te from these
    documents
  • 3 Augment seed with te

seed
t0
t1
t2
Problem May run out of words (and queries) ?
incomplete summary!
10
Query-Based Information Extraction and Database
Summary Construction
seed
seed
connected
disconnected
11
Model Querying Graph
T
D
  • Tokens T
  • Task 1 tuple attributes
  • microsoft AND redmond acm AND new york
  • Task 2 words sigmod, webdb
  • Tokens (as queries) retrieve
  • documents in D
  • Documents contain tokens

t1
d1
d2
t2
t3
d3
t4
d4
t5
d5
12
Model Reachability Graph
T
D
t1
t1
d1
t2
t3
d2
t2
t3
d3
t4
t5
t4
d4
t1 retrieves document d1 that contains t2
t5
d5
t2, t3, and t4 reachable from t1
13
Model (cont.) Connectivity
Reachable tokens, do not retrieve core tokens
Tokens that retrieve other tokens and themselves
Tokens that retrieve other tokens but are not
reachable
14
Power-law Graphs
  • Conjecture Degree distribution in the
    reachability graph is described by a power-law
  • Completely described by only two parameters, a
    and b.
  • Power-law random graphs are expected to have at
    most one giant connected component
    (CoreInOut). Other connected components are
    small.

15
Model (cont.) Reachability
Giant Component CRG
t1
seed
t2
t3
t4
reachable
  • Reachability relative size of the largest Core
    Out

16
Outline
  • Task 1 Information Extraction
  • Task 2 Constructing Database Summary
  • Model
  • Querying, reachability graphs
  • Power-law graphs
  • Reachability
  • Querying Real Databases
  • Estimation
  • Experimental Results
  • Discussion

17
Querying Real Databases
  • Task 1 NYT
  • DiseaseOutbreaks
  • (date, disease, location)
  • The New York Times
  • D137,000
  • T8859
  • Task 2 20NG
  • Postings from 20 Newsgroups
  • D 20,000
  • T 109,000

Date DiseaseName Location
Jan. 1995 Malaria Ethiopia
June 1995 Ebola Africa
July 1995 Mad Cow Disease The U.K.
Feb. 1995 Pneumonia The U.S.

18
NYT Reachability Graph Outdegree Distribution
Matches the power-law distribution
MaxResults10
MaxResults50
19
NYT Component Size Distribution
Not reachable
reachable
MaxResults10
MaxResults50
CG / T 0.297
CG / T 0.620
20
20NG Outdegree Distribution
Approximated by power-law distribution
MaxResults1
MaxResults10
CG / T 1 (completely connected)
21
Estimating Reachability
  • In a power-law random graph G a giant component
    CG emerges if d (the average outdegree) gt 1,
    and
  • Estimate Reachability CG / T
  • Depends only on d (average outdegree)

For b lt 3.457
22
Estimating Reachability using Sampling
T
D
  1. Choose S random seed tokens
  2. Query the database for seed
  3. Compute the outgoing edges of nodes in seed.
  4. Estimate d as average outdegree of seed tokens.

t1
d1
t1
d2
t2
t3
t3
d3
t4
d4
d 1.5
t5
d5
23
Estimating Reachability for NYT
Approximate reachability is estimated after 50
queries. ? Can be used to predict success (or
failure) of a Task 1 algorithm.
24
Reachability of NYT (cont.)
.46
Reachability correctly predicts performance of
the Tuples strategy for Task 1 (described in
Agichtein and Gravano, ICDE 2003)
25
Estimating Reachability of 20NG
Estimates reachability closely, after just 10
queries Corroborates Callans results Callan et
al., SIGMOD 1999
26
Summary
  • Presented graph model for query-based access to
    text databases
  • Querying and Reachability graphs
  • Formal tool for analyzing heuristic algorithms
  • The reachability metric predictions for
    algorithm performance
  • Efficient estimation techniques
  • Power-law random graph properties Document
    sampling

27
Future Work
  • Other properties of the reachability graph
  • Edge Density
  • Diameter
  • Real-life limitations
  • Total number of queries? ? querying graph
  • Total number of documents? ? querying graph
  • Analyze other (heuristic) algorithms.

28
Modeling Query-Based Access to Text Databases
  • Eugene AgichteinPanagiotis IpeirotisLuis
    Gravano
  • Computer Science
  • Columbia University

Questions?
29
Overflow Slides
30
(No Transcript)
31
Information Extraction Example Organizations
Headquarters
Input Documents
Named-Entity Tagging
Pattern Matching
Output Tuples
32
Efficient Information Extraction Alternatives
Given a large text database and an information
extraction task, how to proceed?
  • If a large fraction of documents are relevant
  • Scan (not always possible)
  • Else
  • Tuples ?

? ?
Text Database
Will Tuples retrieve enough of the relation?
33
Search Over Hidden Web Databases
  • ? Metasearchers
  • Database Selection Choosing best databases for
    a query
  • Database Selection Needs Content Summaries
  • Typically the vocabulary of
  • each database plus simple
  • frequency statistics

PubMed (3,868,552 documents) cancer
1,398,178aids 106,512heart 281,506hepatitis
23,481thrombopenia 24,826
34
Model
  • Is there a common model for algorithms for
    Query-Based Information Extraction and Database
    Summary Construction?
  • What are the limitations of these algorithms?
  • Given a new database, will such an algorithm for
    work?
Write a Comment
User Comments (0)
About PowerShow.com