When one Sample is not Enough: Improving Text Database Selection using Shrinkage presentation

About This Presentation

Transcript and Presenter's Notes

Title: When one Sample is not Enough: Improving Text Database Selection using Shrinkage

1
When one Sample is not EnoughImproving Text
Database Selection using Shrinkage

Panos Ipeirotis
Luis Gravano

Computer Science Department Columbia University
2
Regular Web Pages and Text Databases

Regular Web
Link structure
Crawlable
Documents indexed by search engines

Text Databases (a.k.a. Hidden Web, Deep Web)
Usually no link structure
Documents hidden in databases
Documents not indexed by search engines
Need to query each collection individually

3
Text Databases Examples
Search on U.S. Patent and Trademark Office
(USPTO) database wireless network ? 26,012
matches (USPTO database is at http//patft.uspto.g
ov/netahtml/search-bool.html) Search on Google
restricted to USPTO database site wireless
network sitepatft.uspto.gov ? 0 matches
Database Query Database Matches Site-Restricted Google Matches
USPTO wireless network 26,012 0
Library of Congress visa regulations gt10,000 0
PubMed thrombopenia 27,960 172
as of June 10th, 2004
4
Metasearchers Provide Access to Distributed Text
Databases
Database selection relies on simple content
summaries vocabulary, word frequencies
thrombopenia
Metasearcher
PubMed (11,868,552 documents) aids 121,491
cancer 1,562,477 heart 691,360hepatitis
121,129 thrombopenia 27,960
PubMed
NYTimesArchives
USPTO
5
Extracting Content Summaries from Autonomous Text
Databases

Send queries to databases
Retrieve top matching documents
If stopping criterion met (e.g., samplegt300
docs)
then exit
else go to Step 1

Content summary contains words in sample and
document frequency of each word
Problem Summaries from small samples are highly
incomplete
6
Problem Summaries Derived from Small Samples
Fundamentally Incomplete
Sample300
Log(Frequency)
107
106
10 most frequent words in PubMed database
9,000
.
.

endocarditis 9,000 docs / 0.1
103
102
Rank
2104
4104
105

Many words appear in relatively few documents
(Zipfs law)
Low-frequency words are often important
Small document samples miss many low-frequency
words

7
Improving Sample-based Content Summaries
Challenge Improve content summary quality
without increasing sample size

Main Idea Database Classification Helps
Similar topics ? Similar content summaries
Extracted content summaries complement each other
Classification available from directories (e.g.,
Open Directory) or derived automatically (e.g.,
QProber)

8
Databases with Similar Topics

Cancerlit contains metastasis, not found during
sampling
CancerBacup contains metastasis
Databases under same category have similar
vocabularies, and can complement each other

9
Content Summaries for Categories

Databases under same category share similar
vocabulary
Higher-level category content summaries provide
additional useful estimates of word
probabilities
Can use all estimates in category path

10
Enhancing Summaries Using Shrinkage

Word-probability estimates from database content
summaries can be unreliable
Category content summaries are more reliable
(based on larger samples) but less specific to
database
By combining estimates from category and database
content summaries we get better estimates

11
Shrinkage-based Estimations
Adjust probability estimates Pr metastasis D
?1 0.002 ?2 0.05 ?3 0.092 ?4
0.000
Select ?i weights to maximize the probability
that the summary of D is from a database under
all its parent categories
?
12
Computing Shrinkage-based Summaries
Root
Health
Cancer
D
Pr metastasis D ?1 0.002 ?2 0.05 ?3
0.092 ?4 0.000 Pr treatment D ?1
0.015 ?2 0.12 ?3 0.179 ?4 0.184

Automatic computation of ?i weights using an EM
algorithm
Computation performed offline ? No query overhead

Avoids sparse data problem and decreases
estimation risk
13
Shrinkage Weights and Summary
new estimates
old estimates
CANCERLITShrinkage-based ?root0.02 ?health0.13 ?cancer0.20 ?cancerlit0.65
metastasis 2.5 0.2 5 9.2 0
aids 14.3 0.8 7 2 20
football 0.17 2 1 0 0

Shrinkage
Increases estimations for underestimates (e.g.,
metastasis)
Decreases word-probability estimates for
overestimates (e.g., aids)
it also introduces (with small probabilities)
spurious words (e.g., football)

14
Is Shrinkage Always Necessary?

Shrinkage used to reduce uncertainty (variance)
of estimations
Small samples of large databases ? high variance
In sample 10 out of 100 documents contain
metastasis
In database ? out of 10,000,000 documents?
Small samples of small databases ? small variance
In sample 10 out of 100 documents contain
metastasis
In database ? out of 200 documents?
Shrinkage less useful (or even harmful) when
uncertainty is low

15
Adaptive Application of Shrinkage

Database selection algorithms assign scores to
databases for each query
When word frequency estimates are uncertain,
assigned score has high variance
shrinkage improves score estimates
When word frequency estimates are reliable,
assigned score has small variance
shrinkage unnecessary

Unreliable Score Estimate Use shrinkage
Probability
0
1
Database Score for a Query
Reliable Score Estimate Shrinkage might hurt
Probability
Solution Use shrinkage adaptively in a query-
and database-specific manner
0
1
Database Score for a Query
16
Searching Algorithm

Extract document samples
Get database classification
Compute shrinkage-based summaries

One-time process

To process a query Q
For each database D
Use a regular database selection algorithm to
compute query score for D using old, unshrunk
summary
Analyze uncertainty of score
If uncertainty high, use new, shrinkage-based
summary instead and compute new query score for D
Evaluate Q over top-k scoring databases

For every query
17
Evaluation Goals

Examine quality of shrinkage-based summaries
Examine effect of shrinkage on database selection

CANCERLITCorrect CANCERLITShrinkage-based CANCERLITUnshrunk
metastasis 12 2.5 0
aids 8 14.3 20
football 0 0.17 0
regression 1 0 0
18
Experimental Setup

Three data sets
Two standard testbeds from TREC (Text Retrieval
Conference)
200 databases
100 queries with associated human-assigned
document relevance judgments
315 real Web databases
Two sets of experiments
Content summary quality
Database selection accuracy

19
Results Content Summary Quality

Recall How many words in database also in
summary?
Shrinkage-based summaries include 10-90 more
words than unshrunk summaries
Precision How many words in the summary also in
database?
Shrinkage-based summaries include 5-15 words
not in actual database

20
Results Content Summary Quality

Rank correlation Is word ranking in summary
similar to ranking in database?
Shrinkage-based summaries demonstrate better
word rankings than unshrunk summaries

Kullback-Leibler divergence Is probability
distribution in summary similar to distribution
in database?
Shrinkage improves bad cases, making very good
ones worse
? Motivates adaptive application of shrinkage!

21
Results Database Selection

Metric R(K) ? / ?
X of relevant documents in the selected K
databases
Y of relevant documents in the best K
databases

For CORI (a state-of-the-art database selection
algorithm) with stemming over one TREC testbed
22
Other Experiments

Choice of database selection algorithm
(CORI, bGlOSS, Language Modeling)
Comparison with VLDB02 hierarchical database
selection algorithm
Universal vs. adaptive application of shrinkage
Effect of stemming
Effect of stop-word elimination

23
Conclusions

Developed strategy to automatically summarize
contents of hidden-web text databases
Content summaries are critical for efficient
metasearching
Strategy assumes no cooperation from databases
Shrinkage improves content summary quality by
exploiting topical similarity
Shrinkage is efficient no increase in document
sample size required
Developed adaptive database selection strategy
that decides whether to apply shrinkage on a
database- and query-specific way

24
Thank you!
Shrinkage-based content summary generation
implemented and available for download at
http//sdarts.cs.columbia.edu
Questions?

Write a Comment

User Comments (0)

About PowerShow.com

When one Sample is not Enough: Improving Text Database Selection using Shrinkage PowerPoint PPT Presentation