Classifying%20and%20Searching%20"Hidden-Web"%20Text%20Databases

About This Presentation

Title:

Classifying%20and%20Searching%20"Hidden-Web"%20Text%20Databases

Description:

www.ipeirotis.com – PowerPoint PPT presentation

Number of Views:144

Avg rating:3.0/5.0

Slides: 92

Provided by: Pana47

Category:

more less

Transcript and Presenter's Notes

Title: Classifying%20and%20Searching%20"Hidden-Web"%20Text%20Databases

1
Classifying and Searching "Hidden-Web" Text
Databases

Panos Ipeirotis

Computer Science Department Columbia University
2
Motivation?Surface Web vs. Hidden Web

Surface Web
Link structure
Crawlable
Documents indexed by search engines

Hidden Web
No link structure
Documents hidden in databases
Documents not indexed by search engines
Need to query each collection individually

3
Hidden-Web Databases Examples
Search on U.S. Patent and Trademark Office
(USPTO) database wireless network ? 29,051
matches (USPTO database is at http//patft.uspto.g
ov/netahtml/search-bool.html) Search on Google
restricted to USPTO database site wireless
network sitepatft.uspto.gov ? 0 matches
Database Query Database Matches Site-Restricted Google Matches
USPTO wireless network 29,051 0
Library of Congress visa regulations gt10,000 0
PubMed thrombopenia 26,887 221
as of July 6th, 2004
4
Interacting With Hidden-Web Databases

Browsing Yahoo!-like directories
InvisibleWeb.com
SearchEngineGuide.com
Searching Metasearchers

Populated Manually
5
Outline of Talk

Classification of Hidden-Web Databases
Search over Hidden-Web Databases
Modeling and Managing Changes in Hidden-Web
Databases

6
Hierarchically Classifying the ACM Digital Library
ACM DL
?
7
Text Database Classification Definition

For a text database D and a category C
Coverage(D,C) number of docs in D about C
Specificity(D,C) fraction of docs in D about C
Assign a text database to a category C if
Database coverage for C at least Tc
Tc coverage threshold (e.g., gt 100 docs in C)
Database specificity for C at least Ts
Ts specificity threshold (e.g., gt 40 of docs
in C)

8
Brute-Force Classification Strategy

Extract all documents from database
Classify documents on topic
(use state-of-the-art classifiers SVMs, C4.5,
RIPPER,)
Classify database according to topic distribution

Problem No direct access to full contents of
Hidden-Web databases
9
Classification Goal Challenges

Goal
Discover database topic distribution
Challenges
No direct access to full contents of Hidden-Web
databases
Only limited search interfaces available
Should not overload databases

Key observation Only queries about database
topic(s) generate large number of matches
10
Query-based Database Classification Overview
TRAIN CLASSIFIER

Train document classifier
Extract queries from classifier
Adaptively issue queries to database
Identify topic distribution based on adjusted
number of query matches
Classify database

EXTRACT QUERIES
Sports
nba knicks
Health
sars
QUERY DATABASE
sars
1254
IDENTIFY TOPIC DISTRIBUTION
CLASSIFY DATABASE
11
Training a Document Classifier
TRAIN CLASSIFIER

Get training set (set of pre-classified
documents)
Select best features to characterize documents
(Zipfs law information theoretic feature
selection)
Koller and Sahami 1996
Train classifier (SVM, C4.5, RIPPER, )

EXTRACT QUERIES
Sports
nba knicks
Health
sars
QUERY DATABASE
Output A black-box model for classifying
documents
IDENTIFY TOPIC DISTRIBUTION
CLASSIFY DATABASE
Document
?
?
Classifier
12
Extracting Query Probes
ACM TOIS 2003
TRAIN CLASSIFIER

Transform classifier model into queries
Trivial for rule-based classifiers (RIPPER)

EXTRACT QUERIES
Sports
nba knicks
Health
sars
QUERY DATABASE
sars
1254
IDENTIFY TOPIC DISTRIBUTION
CLASSIFY DATABASE
Example query for Sports nba knicks
13
Querying Database with Extracted Queries
TRAIN CLASSIFIER

Issue each query to database to obtain number of
matches without retrieving any documents
Increase coverage of rules category accordingly
(Sports Sports 706)

EXTRACT QUERIES
Sports
nba knicks
Health
sars
QUERY DATABASE
sars
1254
IDENTIFY TOPIC DISTRIBUTION
CLASSIFY DATABASE
SIGMOD 2001
ACM TOIS 2003
14
Identifying Topic Distribution from Query Results
TRAIN CLASSIFIER
Query-based estimates of topic distribution not
perfect

Document classifiers not perfect
Rules for one category match documents from other
categories
Querying not perfect
Queries for same category might overlap
Queries do not match all documents in a category

EXTRACT QUERIES
Sports
nba knicks
Health
sars
QUERY DATABASE
IDENTIFY TOPIC DISTRIBUTION
Solution Learn to adjust results of query probes
CLASSIFY DATABASE
15
Confusion Matrix Adjustment of Query Probe
Results
correct class
Correct (but unknown) topic distribution
Incorrect topic distribution derived from query
probing
Real Coverage
1000
5000
50
comp sports health
comp 0.80 0.10 0.00
sports 0.08 0.85 0.04
health 0.02 0.15 0.96
Estimated Coverage
8005000 1300
8042502 4332
2075048 818
X

assigned class
This multiplication can be inverted to get a
better estimate of the real topic distribution
from the probe results
10 of sport documents match queries for
computers
16
Confusion Matrix Adjustment of Query Probe
Results
TRAIN CLASSIFIER
Coverage(D) M-1 . ECoverage(D)
EXTRACT QUERIES
Sports
nba knicks
Adjusted estimate of topic distribution
Health
Probing results
sars
QUERY DATABASE

M usually diagonally dominant for reasonable
document classifiers, hence invertible
Compensates for errors in query-based estimates
of topic distribution

IDENTIFY TOPIC DISTRIBUTION
CLASSIFY DATABASE
17
Classification Algorithm (Again)
TRAIN CLASSIFIER

Train document classifier
Extract queries from classifier
Adaptively issue queries to database
Identify topic distribution based on adjusted
number of query matches
Classify database

One-time process
EXTRACT QUERIES
Sports
nba knicks
Health
sars
QUERY DATABASE
sars
1254
IDENTIFY TOPIC DISTRIBUTION
For every database
CLASSIFY DATABASE
18
Experimental Setup

72-node 4-level topic hierarchy from
InvisibleWeb/Yahoo! (54 leaf nodes)
500,000 Usenet articles (April-May 2000)
Newsgroups assigned by hand to hierarchy nodes
RIPPER trained with 54,000 articles (1,000
articles per leaf), 27,000 articles to construct
confusion matrix

500 Controlled databases built using 419,000
newsgroup articles
(to run detailed experiments)
130 real Web databases picked from InvisibleWeb
(first 5 under each topic)

comp.hardware
rec.music.classical
rec.photo.
19
Experimental ResultsControlled Databases

Accuracy (using F-measure)
Above 80 for most ltTc, Tsgt threshold
combinations tried
Degrades gracefully with hierarchy depth
Confusion-matrix adjustment helps
Efficiency
Relatively small number of probes (lt500) needed
for most threshold ltTc, Tsgt combinations tried

20
Experimental Results Web Databases

Accuracy (using F-measure)
70 for best ltTc, Tsgt combination
Learned thresholds that reproduce human
classification
Tested threshold choice using 3-fold cross
validation
Efficiency
120 queries per database on average needed for
choice of thresholds, no documents retrieved
Only small part of hierarchy explored
Queries are short 1.5 words on average 4 words
maximum (easily handled by most Web databases)

21
Other Experiments

Effect of choice of document classifiers
RIPPER
C4.5
Naïve Bayes
SVM
Benefits of feature selection
Effect of search-interface heterogeneity
Boolean vs. vector-space retrieval models
Effect of query-overlap elimination step
Over crawlable databases query-based
classification orders of magnitude faster than
brute-force crawling-based classification

ACM TOIS 2003
IEEE Data Engineering Bulletin 2003
22
Hidden-Web Database Classification Summary

Handles autonomous Hidden-Web databases
accurately and efficiently
70 F-measure
Only 120 queries issued on average, with no
documents retrieved
Handles large family of document classifiers
(and can hence exploit future advances in
machine learning)

23
Outline of Talk

Classification of Hidden-Web Databases
Search over Hidden-Web Databases
Modeling and Managing Changes in Hidden-Web
Databases

24
Interacting With Hidden-Web Databases

Browsing Yahoo!-like directories
Searching Metasearchers

Content not accessible through Google
NYTimesArchives

PubMed

Metasearcher
Query
USPTO
Library of Congress

25
Metasearchers Provide Access to Distributed
Databases
Database selection relies on simple content
summaries vocabulary, word frequencies
thrombopenia
Metasearcher
PubMed (11,868,552 documents) aids 123,826
cancer 1,598,896 heart 706,537hepatitis
124,320thrombopenia 26,887
?
?
?
PubMed
NYTimesArchives
USPTO
... thrombopenia 26,887 ...
... thrombopenia 42 ...
... thrombopenia 0 ...
26
Extracting Content Summaries from Autonomous
Hidden-Web Databases
CallanConnell 2001

Send random queries to databases
Retrieve top matching documents
If retrieved 300 documents then stop else go to
Step 1

Content summary contains words in sample and
document frequency of each word

Problems
Random sampling retrieves non-representative
documents
Frequencies in summary compressed to sample
size range
Summaries from small samples are highly incomplete

27
Extracting Representative Document Sample
Problem 1 Random sampling retrieves
non-representative documents

Train a document classifier
Create queries from classifier
Adaptively issue queries to databases
Retrieve top-k matching documents for each query
Save matches for each one-word query
Identify topic distribution based on adjusted
number of query matches
Categorize the database
Generate content summary from document sample

Sampling retrieves documents only from
topically dense areas from database
28
Sample Frequencies vs. Actual Frequencies
Problem 2 Frequencies in summary compressed to
sample size range
PubMed (11,868,552 docs) cancer 1,562,477
heart 691,360
PubMed Sample (300 documents) cancer 45
heart 16
Sampling
Key Observation Query matches reveal frequency
information
29
Adjusting Document Frequencies

Zipfs law empirically connects word frequency f
and rank r

f A (r B) c
frequency
rank
VLDB 2002
30
Adjusting Document Frequencies

Zipfs law empirically connects word frequency f
and rank r
We know document frequency and rank r of the
words in sample

f A (r B) c
frequency
Frequency in sample
rank
1 12 78 .
VLDB 2002
Rank in sample
31
Adjusting Document Frequencies

Zipfs law empirically connects word frequency f
and rank r
We know document frequency and rank r of the
words in sample
We know real document frequency f of some words
from one-word queries

f A (r B) c
frequency
Frequency in database
rank
1 12 78 .
VLDB 2002
Rank in sample
32
Adjusting Document Frequencies

Zipfs law empirically connects word frequency f
and rank r
We know document frequency and rank r of the
words in sample
We know real document frequency f of some words
from one-word queries
We use curve-fitting to estimate the absolute
frequency of all words in sample

f A (r B) c
frequency
Estimated frequency in database
rank
1 12 78 .
VLDB 2002
33
Actual PubMed Content Summary
PubMed content summary Number of Documents
8,868,552 (Actual 12,349,932) Category
Health, Diseases cancer 1,562,477 heart
581,506 (Actual 706,537) aids 121,491
hepatitis 73,481 (Actual
124,320) basketball 907 (Actual
1,094) cpu 598

Extracted automatically
27,500 words in extracted content summary
Fewer than 200 queries sent
At most 4 documents retrieved per query

(heart, hepatitis, basketball not in 1-word
probes)
34
Sampling and Incomplete Content Summaries
Problem 3 Summaries from small samples are
highly incomplete
Sample300
Log(Frequency)
107
106
10 most frequent words in PubMed database
9,000
.
.
endocarditis 9,000 docs / 0.1
103
102
Rank
2104
4104
105

Many words appear in relatively few documents
(Zipfs law)
Low-frequency words are often important
Small document samples miss many low-frequency
words

35
Sample-based Content Summaries
Challenge Improve content summary quality
without increasing sample size

Main Idea Database Classification Helps
Similar topics ? Similar content summaries
Extracted content summaries complement each other

36
Databases with Similar Topics

Cancerlit contains metastasis, not found during
sampling
CancerBacup contains metastasis
Databases under same category have similar
vocabularies, and can complement each other

37
Content Summaries for Categories

Databases under same category share similar
vocabulary
Higher level category content summaries provide
additional useful estimates
Can use all estimates in category path

38
Enhancing Summaries Using Shrinkage

Estimates from database content summaries can be
unreliable
Category content summaries are more reliable
(based on larger samples) but less specific to
database
By combining estimates from category and database
content summaries we get better estimates

SIGMOD 2004
39
Shrinkage-based Estimations
Adjust probability estimates Pr metastasis D
?1 0.002 ?2 0.05 ?3 0.092 ?4
0.000
Select ?i weights to maximize the probability
that the summary of D is from a database under
all its parent categories
?
Avoids sparse data problem and decreases
estimation risk
40
Shrinkage Weights and Summary
new estimates
old estimates
CANCERLITShrinkage-based ?root0.02 ?health0.13 ?cancer0.20 ?cancerlit0.65
metastasis 2.5 0.2 5 9.2 0
aids 14.3 0.8 7 2 20
football 0.17 2 1 0 0

Shrinkage
Increases estimations for underestimates (e.g.,
metastasis)
Decreases word-probability estimates for
overestimates (e.g., aids)
it also introduces (with small probabilities)
spurious words (e.g., football)

41
Adaptive Application of Shrinkage

Database selection algorithms assign scores to
databases for each query
When frequency estimates are uncertain, assigned
score is uncertain
but sometimes confidence about assigned score is
high
When confident about score, shrinkage unnecessary

Unreliable Score Estimate Use shrinkage
Probability
0
1
Database Score for a Query
Reliable Score Estimate Shrinkage might hurt
Probability
0
1
Database Score for a Query
42
Extracting Content Summaries Problems Solved

Problem 1 Random sampling may retrieve
non-representative documents
Solution Focus querying on topically dense
areas of the database
Problem 2 Frequencies are compressed to the
sample size range
Solution Exploit number of matches for query and
adjust estimates using curve fitting
Problem 3 Summaries based on small samples are
highly incomplete
Solution Exploit database classification and
augment summaries using samples from topically
similar databases

43
Searching Algorithm

Classify databases and extract document samples
Adjust frequencies in samples

One-time process

For each query Q
For each database
Assign score to each database D (using extracted
content summary)
Examine uncertainty of score
If uncertainty high, apply shrinkage and give new
score
Query only top-K scoring databases

For every query
44
Experimental Setup

Two standard testbeds from TREC (Text Retrieval
Conference)
200 databases
100 queries with associated human-assigned
document relevance judgments
Two sets of experiments
Content summary quality
Metrics precision, recall, Spearman correlation
coefficient, KL-divergence
Database selection accuracy
Metric fraction of relevant documents for
queries in top-scored databases

SIGMOD 2004
45
Experimental Results

Content summary quality
Shrinkage improves quality of content summaries
without increasing sample size
Frequency estimation gives accurate (within 30)
estimates of actual frequencies
Database selection accuracy
Focused sampling Improves performance by 20-40
Adaptive application of shrinkage Improves
performance up to 50
Shrinkage is robust Improved performance
consistently across many different configurations

46
Results Database Selection

Metric R(K) ? / ?
X of relevant documents in the selected K
databases
Y of relevant documents in the best K
databases

CORI, with stemming, TREC4 testbed
47
Other Experiments

Additional data set 315 real Web databases
Choice of database selection algorithm (CORI,
bGlOSS, Language Modeling)
Effect of stemming
Effect of stop-word elimination
Comparison with VLDB02 hierarchical database
selection algorithm
Universal vs. adaptive application of shrinkage

SIGMOD 2004
48
Search Contributions

Developed strategy to automatically summarize
contents of Hidden-Web text databases
Strategy assumes no cooperation from databases
Improves content summary quality by exploiting
topical similarity and number of matches
No increase in document sample size required
Developed adaptive database selection strategy
that decides whether to apply shrinkage on a
database- and query-specific way

49
Outline of Talk

Classification of Hidden-Web Databases
Search over Hidden-Web Databases
Modeling and Managing Changes in Hidden-Web
Databases

50
Do Content Summaries Change Over Time?
Databases are not static. Their content changes.
Should we refresh the content summary?

Examined summaries of 152 real Web databases over
52 weeks
Summary quality declines as age increases

51
Updating Content Summaries

Summaries change ? Need to refresh to capture
changes
To devise update policy ? Need to know frequency
of change
Summary changes at time T if dist(current,
old(T)) gt t
Survival analysis estimates probability S(t) that
Tgtt
Common model S(t) e-?t (? defines frequency of
change)
Problems
No access to content summaries
Even if we know summaries, long time to estimate
?

change sensitivity threshold
52
Cox Proportional Hazards Regression

We want to estimate frequency of change for each
database
Cox PH model examines effect of database
characteristics on frequency of change
E.g., if you double the size of a database, it
changes twice as fast
Cox PH model effectively uses censored data
(i.e., database did not change within time T)

53
Cox PH Regression Results

Examined effect of
Change-sensitivity threshold t
Topic
Domain
Size
Number of words
Differences of summaries extracted in consecutive
weeks
Devised concrete change model according to
database characteristics (formula in thesis)

54
Scheduling Updates
D ? average time between updates average time between updates
D ? 10 weeks 40 weeks
Toms Hardware 0.088 5 weeks 46 weeks
USPS 0.023 12 weeks 34 weeks
Using our change model, we schedule updates
according to the available resources (using
Lagrange-multiplier method)
55
Scheduling Results

With clever scheduling we improve the quality of
summaries according to a variety of metrics
(precision, recall, KL-divergence)

56
Updating Content Summaries Contributions

Extensive experimental study showing that quality
of summaries deteriorates for increasing summary
age
Change frequency model that uses database
characteristics to predict frequency of change
Derived scheduling algorithms that define update
frequency for each database according to the
available resources

57
Overall Contributions

Support for browsing, searching and updating
autonomous Hidden-Web databases
Browsing
Algorithm for automatic classification of
Hidden-Web databases
Accuracy 70 (F-measure)
Only 120 queries issued on average, with no
documents retrieved
Searching
Content summary construction technique that
samples topically dense areas of the database
Database selection algorithms (hierarchical and
shrinkage-based) that improve existing algorithms
by exploiting topical similarity
Updating
Change model that uses database characteristics
to predict frequency of change
Scheduling algorithms that exploit the model and
define update frequency for each database
according to the available resources

58
Thank you!
Classification and content summary extraction
implemented and available for download at
http//sdarts.cs.columbia.edu
59
Panos Ipeirotis http//www.cs.columbia.edu/pirot

Classification and Search of Hidden-Web Databases
P. Ipeirotis, L. Gravano, When one Sample is not
Enough Improving Text Database Selection using
Shrinkage SIGMOD 2004
L. Gravano, P. Ipeirotis, M. Sahami QProber A
System for Automatic Classification of Hidden-Web
Databases ACM TOIS 2003
E. Agichtein, P. Ipeirotis, L. Gravano Modelling
Query-Based Access to Text Databases WebDB 2003
P. Ipeirotis, L. Gravano Distributed Search over
the Hidden-Web Hierarchical Database Sampling
and Selection VLDB 2002
P. Ipeirotis, L. Gravano, M. Sahami Query- vs.
Crawling-based Classification of Searchable Web
Databases DEB 2002
P. Ipeirotis, L. Gravano, M. Sahami Probe, Count,
and Classify Categorizing Hidden-Web Databases
SIGMOD 2001
Approximate Text Matching
L. Gravano, P. Ipeirotis, N. Koudas, D.
Srivastava Text Joins in an RDBMS for Web Data
Integration WWW2003
L. Gravano, P. Ipeirotis, H.V. Jagadish, N.
Koudas, S. Muthukrishnan, D. Srivastava
Approximate String Joins in a Database (Almost)
for Free VLDB 2001
L. Gravano, P. Ipeirotis, H.V. Jagadish, N.
Koudas, S. Muthukrishnan, D. Srivastava, L.
Pietarinen Using q-grams in a DBMS for
Approximate String Processing DEB 2001
SDARTS Protocol Toolkit for Metasearching
N. Green, P. Ipeirotis, L. Gravano SDLIP STARTS
SDARTS. A Protocol and Toolkit for
Metasearching JCDL 2001
P. Ipeirotis, T. Barry, L. Gravano Extending
SDARTS Extracting Metadata from Web Databases
and Interfacing with the Open Archives Initiative
JCDL 2002

60
Future WorkIntegrated Access to Hidden-Web
Databases
Query good indie movies playing in Manhattan
tomorrow
Current top Google result
(Feb 17th, 2004)
Story at Seattle Times about 9-year old
drummer Rachel Trachtenburg
61
Future WorkIntegrated Access to Hidden-Web
Databases
Query good indie movies playing in New York
now

All information already available on the Web
Review databases Rotten Tomatoes, NY Times,
TONY,
Movie databases All Movie Guide, IMDB
Tickets Moviefone, Fandango,

62
Future WorkIntegrated Access to Hidden-Web
Databases
Query good indie movies playing in New York
now

Challenges
Short term
Learn to interface with different databases
Adapt database selection algorithms
Long term
Understand semantics of query
Extract query plans and optimize for
distributed execution
Personalization
Security and privacy

63
SDARTS Protocol and Toolkit for Metasearching
Query
Harrisons Online
SDARTS
British Medical Journal
PubMed
Unstructured text documents
DLI2 Corpus XML documents
Local
Web
64
SDARTS Protocol and Toolkit for Metasearching

Accomplishments
Combines the strength of existing Digital Library
protocols (SDLIP, STARTS)
Enables indexing and wrapping of local
collections of text and XML documents
Enables declarative wrapping of Hidden-Web
databases, with no programming
Extracts content summary, topical focus, and
technical level of each database
Interfaces with Open Archives Initiative, an
emerging Digital Library interoperability
protocol
Critical building block for search component of
Columbias PERSIVAL project
(5-year, 5M NSF Digital Libraries Phase 2
project)
Open source, available at http//sdarts.cs.columb
ia.edu
1,000 downloads since Jan 2003
Supervised and coordinated eight students during
development

ACMIEEE JCDL Conference 2001, 2002
65
Current Work Updating Content Summaries
Databases are not static. Their content changes.
When should we refresh the content summary?

Examined 150 real Web databases over 52 weeks
Modeled changes using survival analysis
techniques (Cox proportional hazards model)
Currently developing updating algorithms
Contact database only when necessary
Improve quality of summaries by exploiting history

Joint work with Junghoo Cho and Alex Ntoulas
(UCLA)
66
Other WorkApproximate Text Matching
VLDB01
WWW03
Matching similar strings within relational DBMS
important data resides there
Service A
Jenny Stamatopoulou
John Paul McDougal
Aldridge Rodriguez
Panos Ipeirotis
John Smith
Service B
Panos Ipirotis
Jonh Smith
Stamatopulou, Jenny
John P. McDougal
Al Dridge Rodriguez
Exact joins not enough Typing mistakes,
abbreviations, different conventions

Introduced algorithms for mapping approximate
text joins into SQL
No need for import/export of data
Provides crucial building block for data cleaning
applications
Identifies many interesting matches

Joint work with Divesh Srivastava, Nick Koudas
(ATT Labs-Research) and others
67
No Good Category for Database

General problem with supervised learning
Example English vs. Chinese databases
Devised technique to analyze if can work with
given database
Find candidate textfields
Send top-level queries
Examine results construct similarity matrix
If matrix rank small ? Many similar pages
returned
Web form is not a search interface
Textfield is not a keyword field
Database is of different language
Database is of an unknown topic

68
Database not Category Focused

Extract one content summary per topic
Focused queries retrieve documents about known
topic
Each database is represented multiple times in
hierarchy

69
Near Future WorkDefinition and analysis of
query-based algorithms

Currently query-based algorithms are evaluated
only empirically
Possible to model querying process using random
graph theory and
Analyze thoroughly properties of the algorithms
Understand better why, when, and how the
algorithms work
Interested in exploring similar directions
Adapt hyperlink-based ranking algorithms
Use results in graph theory to design sampling
algorithms

WebDB 2003
70
Database Selection (CORI, TREC6)
More results in Stemming/No Stemming,
CORI/LM/bGlOSS, QBS/FPS/RS/CMPL, Stopwords
71
3-Fold Cross-Validation
72
Crawling- vs. Query-based Classification for CNN
Sports
Efficiency Statistics
Crawling-based Crawling-based Crawling-based Query-based Query-based Query-based
Time Files Size Time Queries Size
1325min 270,202 8Gb 2min (-99.8) 112 357Kb (-99.9)
IEEE DEB March 2002
Accuracy Statistics
Crawling-based classification is classified
correctly only after downloading 70 of the
documents in CNN-Sports
73
Experiments Precision of Database Selection
Algorithms
Content Summary Generation Technique CORI CORI
Content Summary Generation Technique Hierarchical Flat
FP-SVM-Documents 0.270 0.170
FP-SVM-Snippets 0.200 0.183
Random Sampling 0.177
QPilot (backlinks front page) 0.050
VLDB 2002 (extended version)
74
F-measure vs. Hierarchy Depth
ACM TOIS 2003
75
Real Confusion Matrix for Top Node of Hierarchy
Health Sports Science Computers Arts
Health 0.753 0.018 0.124 0.021 0.017
Sports 0.006 0.171 0.021 0.016 0.064
Science 0.016 0.024 0.255 0.047 0.018
Computers 0.004 0.042 0.080 0.610 0.031
Arts 0.004 0.024 0.027 0.031 0.298
76
Overlap Elimination
77
No Support for Conjunctive Queries(Boolean vs.
Vector-space)
78
CORI Stemming
TREC4 QBS
TREC4 FPS
TREC6 QBS
TREC6 FPS
79
bGlOSS Stemming
TREC4 QBS
TREC4 FPS
TREC6 QBS
TREC6 FPS
80
LM Stemming
TREC4 QBS
TREC4 FPS
TREC6 QBS
TREC6 FPS
81
CORI No Stemming
TREC4 QBS
TREC4 FPS
TREC6 QBS
TREC6 FPS
82
bGlOSS No stemming
TREC4 QBS
TREC4 FPS
TREC6 QBS
TREC6 FPS
83
LM No Stemming
TREC4 QBS
TREC4 FPS
TREC6 QBS
TREC6 FPS
84
Frequency Estimation TREC 4 - CORI
Stemming
No Stemming
85
Frequency Estimation TREC 6 - CORI
Stemming
No Stemming
86
Universal Application of Shrinkage TREC4 CORI
87
Universal Application of Shrinkage TREC4
bGlOSS
88
Results Content Summary Quality

Recall How many words in database also in
summary?
Shrinkage-based summaries include 10-90 more
words than unshrunk summaries
Precision How many words in the summary also in
database?
Shrinkage-based summaries include 5-15 words
not in actual database

89
Results Content Summary Quality

Rank correlation Is word ranking in summary
similar to ranking in database?
Shrinkage-based summaries demonstrate better
word rankings than unshrunk summaries

Kullback-Leibler divergence Is probability
distribution in summary similar to distribution
in database?
Shrinkage improves bad cases, making very good
ones worse
? Motivates adaptive application of shrinkage!

90
Model Querying Graph
Words
Documents
t1
d1
d2
t2
t3
d3
t4
d4
t5
d5
91
Model Reachability Graph
Words
Documents
t1
t1
d1
t2
t3
d2
t2
t3
d3
t4
t5
t4
d4
t1 retrieves document d1 that contains t2
t5
d5

Write a Comment

User Comments (0)