Title: When one Sample is not Enough: Improving Text Database Selection using Shrinkage
1When one Sample is not EnoughImproving Text
Database Selection using Shrinkage
- Panos Ipeirotis
- Luis Gravano
Computer Science Department Columbia University
2Regular Web Pages and Text Databases
- Regular Web
- Link structure
- Crawlable
- Documents indexed by search engines
- Text Databases (a.k.a. Hidden Web, Deep Web)
- Usually no link structure
- Documents hidden in databases
- Documents not indexed by search engines
- Need to query each collection individually
3Text Databases Examples
Search on U.S. Patent and Trademark Office
(USPTO) database wireless network ? 26,012
matches (USPTO database is at http//patft.uspto.g
ov/netahtml/search-bool.html) Search on Google
restricted to USPTO database site wireless
network sitepatft.uspto.gov ? 0 matches
Database Query Database Matches Site-Restricted Google Matches
USPTO wireless network 26,012 0
Library of Congress visa regulations gt10,000 0
PubMed thrombopenia 27,960 172
as of June 10th, 2004
4Metasearchers Provide Access to Distributed Text
Databases
Database selection relies on simple content
summaries vocabulary, word frequencies
thrombopenia
Metasearcher
PubMed (11,868,552 documents) aids 121,491
cancer 1,562,477 heart 691,360hepatitis
121,129 thrombopenia 27,960
PubMed
NYTimesArchives
USPTO
5Extracting Content Summaries from Autonomous Text
Databases
- Send queries to databases
- Retrieve top matching documents
- If stopping criterion met (e.g., samplegt300
docs) - then exit
- else go to Step 1
Content summary contains words in sample and
document frequency of each word
Problem Summaries from small samples are highly
incomplete
6Problem Summaries Derived from Small Samples
Fundamentally Incomplete
Sample300
Log(Frequency)
107
106
10 most frequent words in PubMed database
9,000
.
.
endocarditis 9,000 docs / 0.1
103
102
Rank
2104
4104
105
- Many words appear in relatively few documents
(Zipfs law) - Low-frequency words are often important
- Small document samples miss many low-frequency
words
7Improving Sample-based Content Summaries
Challenge Improve content summary quality
without increasing sample size
- Main Idea Database Classification Helps
- Similar topics ? Similar content summaries
- Extracted content summaries complement each other
- Classification available from directories (e.g.,
Open Directory) or derived automatically (e.g.,
QProber)
8Databases with Similar Topics
- Cancerlit contains metastasis, not found during
sampling - CancerBacup contains metastasis
- Databases under same category have similar
vocabularies, and can complement each other
9Content Summaries for Categories
- Databases under same category share similar
vocabulary - Higher-level category content summaries provide
additional useful estimates of word
probabilities - Can use all estimates in category path
10Enhancing Summaries Using Shrinkage
- Word-probability estimates from database content
summaries can be unreliable - Category content summaries are more reliable
(based on larger samples) but less specific to
database - By combining estimates from category and database
content summaries we get better estimates
11Shrinkage-based Estimations
Adjust probability estimates Pr metastasis D
?1 0.002 ?2 0.05 ?3 0.092 ?4
0.000
Select ?i weights to maximize the probability
that the summary of D is from a database under
all its parent categories
?
12Computing Shrinkage-based Summaries
Root
Health
Cancer
D
Pr metastasis D ?1 0.002 ?2 0.05 ?3
0.092 ?4 0.000 Pr treatment D ?1
0.015 ?2 0.12 ?3 0.179 ?4 0.184
- Automatic computation of ?i weights using an EM
algorithm - Computation performed offline ? No query overhead
Avoids sparse data problem and decreases
estimation risk
13Shrinkage Weights and Summary
new estimates
old estimates
CANCERLITShrinkage-based ?root0.02 ?health0.13 ?cancer0.20 ?cancerlit0.65
metastasis 2.5 0.2 5 9.2 0
aids 14.3 0.8 7 2 20
football 0.17 2 1 0 0
- Shrinkage
- Increases estimations for underestimates (e.g.,
metastasis) - Decreases word-probability estimates for
overestimates (e.g., aids) - it also introduces (with small probabilities)
spurious words (e.g., football)
14Is Shrinkage Always Necessary?
- Shrinkage used to reduce uncertainty (variance)
of estimations - Small samples of large databases ? high variance
- In sample 10 out of 100 documents contain
metastasis - In database ? out of 10,000,000 documents?
- Small samples of small databases ? small variance
- In sample 10 out of 100 documents contain
metastasis - In database ? out of 200 documents?
- Shrinkage less useful (or even harmful) when
uncertainty is low
15Adaptive Application of Shrinkage
- Database selection algorithms assign scores to
databases for each query - When word frequency estimates are uncertain,
assigned score has high variance - shrinkage improves score estimates
- When word frequency estimates are reliable,
assigned score has small variance - shrinkage unnecessary
Unreliable Score Estimate Use shrinkage
Probability
0
1
Database Score for a Query
Reliable Score Estimate Shrinkage might hurt
Probability
Solution Use shrinkage adaptively in a query-
and database-specific manner
0
1
Database Score for a Query
16Searching Algorithm
- Extract document samples
- Get database classification
- Compute shrinkage-based summaries
One-time process
- To process a query Q
- For each database D
- Use a regular database selection algorithm to
compute query score for D using old, unshrunk
summary - Analyze uncertainty of score
- If uncertainty high, use new, shrinkage-based
summary instead and compute new query score for D - Evaluate Q over top-k scoring databases
For every query
17Evaluation Goals
- Examine quality of shrinkage-based summaries
- Examine effect of shrinkage on database selection
CANCERLITCorrect CANCERLITShrinkage-based CANCERLITUnshrunk
metastasis 12 2.5 0
aids 8 14.3 20
football 0 0.17 0
regression 1 0 0
18Experimental Setup
- Three data sets
- Two standard testbeds from TREC (Text Retrieval
Conference) - 200 databases
- 100 queries with associated human-assigned
- document relevance judgments
- 315 real Web databases
- Two sets of experiments
- Content summary quality
- Database selection accuracy
19Results Content Summary Quality
- Recall How many words in database also in
summary? - Shrinkage-based summaries include 10-90 more
words than unshrunk summaries - Precision How many words in the summary also in
database? - Shrinkage-based summaries include 5-15 words
not in actual database
20Results Content Summary Quality
- Rank correlation Is word ranking in summary
similar to ranking in database? - Shrinkage-based summaries demonstrate better
word rankings than unshrunk summaries
- Kullback-Leibler divergence Is probability
distribution in summary similar to distribution
in database? - Shrinkage improves bad cases, making very good
ones worse - ? Motivates adaptive application of shrinkage!
21Results Database Selection
- Metric R(K) ? / ?
- X of relevant documents in the selected K
databases - Y of relevant documents in the best K
databases
For CORI (a state-of-the-art database selection
algorithm) with stemming over one TREC testbed
22Other Experiments
- Choice of database selection algorithm
- (CORI, bGlOSS, Language Modeling)
- Comparison with VLDB02 hierarchical database
selection algorithm - Universal vs. adaptive application of shrinkage
- Effect of stemming
- Effect of stop-word elimination
23Conclusions
- Developed strategy to automatically summarize
contents of hidden-web text databases - Content summaries are critical for efficient
metasearching - Strategy assumes no cooperation from databases
- Shrinkage improves content summary quality by
exploiting topical similarity - Shrinkage is efficient no increase in document
sample size required - Developed adaptive database selection strategy
that decides whether to apply shrinkage on a
database- and query-specific way
24Thank you!
Shrinkage-based content summary generation
implemented and available for download at
http//sdarts.cs.columbia.edu
Questions?