When one Sample is not Enough: Improving Text Database Selection using Shrinkage PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: When one Sample is not Enough: Improving Text Database Selection using Shrinkage


1
When one Sample is not EnoughImproving Text
Database Selection using Shrinkage
  • Panos Ipeirotis
  • Luis Gravano

Computer Science Department Columbia University
2
Regular Web Pages and Text Databases
  • Regular Web
  • Link structure
  • Crawlable
  • Documents indexed by search engines
  • Text Databases (a.k.a. Hidden Web, Deep Web)
  • Usually no link structure
  • Documents hidden in databases
  • Documents not indexed by search engines
  • Need to query each collection individually

3
Text Databases Examples
Search on U.S. Patent and Trademark Office
(USPTO) database wireless network ? 26,012
matches (USPTO database is at http//patft.uspto.g
ov/netahtml/search-bool.html) Search on Google
restricted to USPTO database site wireless
network sitepatft.uspto.gov ? 0 matches
Database Query Database Matches Site-Restricted Google Matches
USPTO wireless network 26,012 0
Library of Congress visa regulations gt10,000 0
PubMed thrombopenia 27,960 172
as of June 10th, 2004
4
Metasearchers Provide Access to Distributed Text
Databases
Database selection relies on simple content
summaries vocabulary, word frequencies
thrombopenia
Metasearcher
PubMed (11,868,552 documents) aids 121,491
cancer 1,562,477 heart 691,360hepatitis
121,129 thrombopenia 27,960
PubMed
NYTimesArchives
USPTO
5
Extracting Content Summaries from Autonomous Text
Databases
  • Send queries to databases
  • Retrieve top matching documents
  • If stopping criterion met (e.g., samplegt300
    docs)
  • then exit
  • else go to Step 1

Content summary contains words in sample and
document frequency of each word
Problem Summaries from small samples are highly
incomplete
6
Problem Summaries Derived from Small Samples
Fundamentally Incomplete
Sample300
Log(Frequency)
107
106
10 most frequent words in PubMed database
9,000
.
.

endocarditis 9,000 docs / 0.1
103
102
Rank
2104
4104
105
  • Many words appear in relatively few documents
    (Zipfs law)
  • Low-frequency words are often important
  • Small document samples miss many low-frequency
    words

7
Improving Sample-based Content Summaries
Challenge Improve content summary quality
without increasing sample size
  • Main Idea Database Classification Helps
  • Similar topics ? Similar content summaries
  • Extracted content summaries complement each other
  • Classification available from directories (e.g.,
    Open Directory) or derived automatically (e.g.,
    QProber)

8
Databases with Similar Topics
  • Cancerlit contains metastasis, not found during
    sampling
  • CancerBacup contains metastasis
  • Databases under same category have similar
    vocabularies, and can complement each other

9
Content Summaries for Categories
  • Databases under same category share similar
    vocabulary
  • Higher-level category content summaries provide
    additional useful estimates of word
    probabilities
  • Can use all estimates in category path

10
Enhancing Summaries Using Shrinkage
  • Word-probability estimates from database content
    summaries can be unreliable
  • Category content summaries are more reliable
    (based on larger samples) but less specific to
    database
  • By combining estimates from category and database
    content summaries we get better estimates

11
Shrinkage-based Estimations
Adjust probability estimates Pr metastasis D
?1 0.002 ?2 0.05 ?3 0.092  ?4
0.000
Select ?i weights to maximize the probability
that the summary of D is from a database under
all its parent categories
?
12
Computing Shrinkage-based Summaries
Root
Health
Cancer
D
Pr metastasis D ?1 0.002 ?2 0.05 ?3
0.092  ?4 0.000 Pr treatment D ?1
0.015 ?2 0.12 ?3 0.179  ?4 0.184
  • Automatic computation of ?i weights using an EM
    algorithm
  • Computation performed offline ? No query overhead

Avoids sparse data problem and decreases
estimation risk
13
Shrinkage Weights and Summary
new estimates
old estimates
CANCERLITShrinkage-based ?root0.02 ?health0.13 ?cancer0.20 ?cancerlit0.65
metastasis 2.5 0.2 5 9.2 0
aids 14.3 0.8 7 2 20
football 0.17 2 1 0 0
  • Shrinkage
  • Increases estimations for underestimates (e.g.,
    metastasis)
  • Decreases word-probability estimates for
    overestimates (e.g., aids)
  • it also introduces (with small probabilities)
    spurious words (e.g., football)

14
Is Shrinkage Always Necessary?
  • Shrinkage used to reduce uncertainty (variance)
    of estimations
  • Small samples of large databases ? high variance
  • In sample 10 out of 100 documents contain
    metastasis
  • In database ? out of 10,000,000 documents?
  • Small samples of small databases ? small variance
  • In sample 10 out of 100 documents contain
    metastasis
  • In database ? out of 200 documents?
  • Shrinkage less useful (or even harmful) when
    uncertainty is low

15
Adaptive Application of Shrinkage
  • Database selection algorithms assign scores to
    databases for each query
  • When word frequency estimates are uncertain,
    assigned score has high variance
  • shrinkage improves score estimates
  • When word frequency estimates are reliable,
    assigned score has small variance
  • shrinkage unnecessary

Unreliable Score Estimate Use shrinkage
Probability
0
1
Database Score for a Query
Reliable Score Estimate Shrinkage might hurt
Probability
Solution Use shrinkage adaptively in a query-
and database-specific manner
0
1
Database Score for a Query
16
Searching Algorithm
  1. Extract document samples
  2. Get database classification
  3. Compute shrinkage-based summaries

One-time process
  • To process a query Q
  • For each database D
  • Use a regular database selection algorithm to
    compute query score for D using old, unshrunk
    summary
  • Analyze uncertainty of score
  • If uncertainty high, use new, shrinkage-based
    summary instead and compute new query score for D
  • Evaluate Q over top-k scoring databases

For every query
17
Evaluation Goals
  • Examine quality of shrinkage-based summaries
  • Examine effect of shrinkage on database selection

CANCERLITCorrect CANCERLITShrinkage-based CANCERLITUnshrunk
metastasis 12 2.5 0
aids 8 14.3 20
football 0 0.17 0
regression 1 0 0
18
Experimental Setup
  • Three data sets
  • Two standard testbeds from TREC (Text Retrieval
    Conference)
  • 200 databases
  • 100 queries with associated human-assigned
  • document relevance judgments
  • 315 real Web databases
  • Two sets of experiments
  • Content summary quality
  • Database selection accuracy

19
Results Content Summary Quality
  • Recall How many words in database also in
    summary?
  • Shrinkage-based summaries include 10-90 more
    words than unshrunk summaries
  • Precision How many words in the summary also in
    database?
  • Shrinkage-based summaries include 5-15 words
    not in actual database

20
Results Content Summary Quality
  • Rank correlation Is word ranking in summary
    similar to ranking in database?
  • Shrinkage-based summaries demonstrate better
    word rankings than unshrunk summaries
  • Kullback-Leibler divergence Is probability
    distribution in summary similar to distribution
    in database?
  • Shrinkage improves bad cases, making very good
    ones worse
  • ? Motivates adaptive application of shrinkage!

21
Results Database Selection
  • Metric R(K) ? / ?
  • X of relevant documents in the selected K
    databases
  • Y of relevant documents in the best K
    databases

For CORI (a state-of-the-art database selection
algorithm) with stemming over one TREC testbed
22
Other Experiments
  • Choice of database selection algorithm
  • (CORI, bGlOSS, Language Modeling)
  • Comparison with VLDB02 hierarchical database
    selection algorithm
  • Universal vs. adaptive application of shrinkage
  • Effect of stemming
  • Effect of stop-word elimination

23
Conclusions
  • Developed strategy to automatically summarize
    contents of hidden-web text databases
  • Content summaries are critical for efficient
    metasearching
  • Strategy assumes no cooperation from databases
  • Shrinkage improves content summary quality by
    exploiting topical similarity
  • Shrinkage is efficient no increase in document
    sample size required
  • Developed adaptive database selection strategy
    that decides whether to apply shrinkage on a
    database- and query-specific way

24
Thank you!
Shrinkage-based content summary generation
implemented and available for download at
http//sdarts.cs.columbia.edu
Questions?
Write a Comment
User Comments (0)
About PowerShow.com