Probe, Count, and Classify: Categorizing HiddenWeb Databases

About This Presentation

Title:

Probe, Count, and Classify: Categorizing HiddenWeb Databases

Description:

Is collection of text documents that is searchable through a web ... CBS SportsLine. Coverage-based. CBS SportsLine - Basketball. Probe, Count, and Classify. 8 ... – PowerPoint PPT presentation

Number of Views:130

Avg rating:3.0/5.0

Slides: 31

Provided by: leehe

Category:

more less

Transcript and Presenter's Notes

Title: Probe, Count, and Classify: Categorizing HiddenWeb Databases

1
Probe, Count, and ClassifyCategorizing
Hidden-Web Databases

SIGMOD 2001,
Panagiotis G. Ipeirotis
Computer Science Dept. Columbia University
DB Lab.
Hee-Jeon Lee

2
1. INTRODUCTION (1)

Ordinary web
Traditional web, crawlable
2 billion pages
Hidden Web
Only accessible through search interfaces
Cohesive, Higher quality, not crawlable
500 billion pages

3
1. INTRODUCTION (2)

Example Query with the keyword cancer
- PubMed (http//ncbi.nlm.nih.gov/PubMed/
)
- Search engine on AltaVista

The problem of accurate information retrieval in
WWW
Retrieve static document
Determine searchable databases
Searchable Web databases
Is collection of text documents that is
searchable through a web-accessible search
interface
Focus is on text

4
1. INTRODUCTION (3)

Manually classifying searchable web databases
(Yahoo!-like hierarchical categorization scheme)
Lack of scalability

Automating classification process -
combination 1) machine learning 2) database
querying techniques - transform the rules
of the classifier into a set of query
probes
5
2. TEXT-DATABASE CLASSIFICATION

Organize the space of searchable databases in a
hierarchical categorization scheme.
Sec 2.1
Define appropriate classification schemes
Sec 2.2
Alternative methods

6
2.1 Classification Schemes for Databases
7
2.2 Alternative Classification Strategies (1)

To assign a searchable web database to category
Manually inspect the contents of the database and
make a decision based on the results of
inspection
A less manual approach
Coverage-based classification
Specificity-based classification

Specificity-based
CBS SportsLine
Coverage-based
CBS SportsLine - Basketball

8
2.2 Alternative Classification Strategies (2)
9
2.2 Alternative Classification Strategies (3)
10
3. CLASSIFYING DATABASE THROUGH PROBING

How can approximate information for a given
database without accessing its contents
Sec 3.1
Train a rule-based document classifier with a set
of preclassified documents.
Sec 3.2
Transform classifier rules into queries.
Sec 3.3
Adaptively issue queries to databases, extracting
and adjusting the number of matches for each
query using the classifiers confusion matrix.
Sec 3.4
Classify databases using the adjusted number of
query matches.

11
3.1 Training a Document Classifier (1)

Rely on a rule-based document classifier to
create the probing queries
Use supervised learning to construct a rule-based
classifier from a set of preclassified documents
The resulting classifier is a set of logical rules

Antecedents are conjunctions
of words.
Consequents are the category
assignments for each document.

12
3.1 Training a Document Classifier (2)

To define a document classifier over an entire
hierarchical classification scheme, train one
flat rule-based document classifier for each
internal node of the hierarchy.

13
3.1 Training a Document Classifier (3)

To produce a rule-based document classifier
1. Feature selection
To eliminate from the training set all words that
appear very frequently in the training documents,
as well as very infrequently appearing words.
Eliminates the terms that have the least impact
on the class distribution of documents
2. Classify the database according to the number
of documents that it contains in each category

14
3.2 Defining Query Probes from a Document
Classifier (1)

Query probe
Will help estimate the number of documents for
each category of interest in a searchable web
database

15
3.2 Defining Query Probes from a Document
Classifier (2)

Map the rule into the Boolean query
IF jordan AND bulls THEN Sports -gt jordan AND
bulls

16
3.2 Defining Query Probes from a Document
Classifier (3)
Boolean query
17
3.3 Adjusting Probing Results (1)

Confusion matrix
Need to adjust initial probing results to account
for potential errors
In the machine learning community to report the
document classification results

18
3.3 Adjusting Probing Results (2)
19
3.3 Adjusting Probing Results (3)

diagonally dominant matrix
Gershgorin disk theorem

20
3.4 Using Probing Results for Classification (1)

Classify database in a top-to-bottom way
1. Each database is first classified by
root-level classifier
2. recursively push down to the lower level
classifiers

21
3.4 Using Probing Results for Classification (2)

Call Classify(root, D)

22
3.4 Using Probing Results for Classification (3)
23
4. EXPERIMENTAL SETTING

Sec 4.1 Data Collections
Controlled databases
Homogeneous
Heterogeneous
Real web databases
Sec 4.2 Techniques for Comparison
Probe and Count (PnC)
Document Sampling (DS)
Query probing to automatically construct a
language model of a text database
Title-based Querying (TQ)
One long query for each category using the title
of the category itself augmented by the titles of
all its subcategories

24
4.3 Evaluation Metrics (1)
25
4.3 Evaluation Metrics (2)
26
4.3 Evaluation Metrics (3)

Correct Expanded (programming)
Classified Expanded (Java)

Java / Java 1 / 1 1
Java / Prog.., C.., Pe.., Java, Visu..
1 / 5
27
5. EXPERIMENTAL RESULTS

Sec 5.1 Tuning the PnC Technique
Effect of Confusion Matrix Adjustment (CMA)
Effect of Feature Selection
ECoverage estimates with FS on were between 15
and 20 better
ESpecificity estimates with FS on were around
10 better

28
5.2 Results over the Controlled Databases
29
5.3 Results over the Web Databases
30
6. CONCLUSIONS AND FUTURE WORK

Have presented a novel and efficient method for
the hierarchical classification of text databases
on the web.
Would completely automate the classification
process is to eliminate the need for a human to
construct the simple wrapper for each database to
classify.
Can be eliminated by automatically learning how
to parse the pages with query results

Write a Comment

User Comments (0)

About PowerShow.com

Probe, Count, and Classify: Categorizing HiddenWeb Databases - PowerPoint PPT Presentation

Probe, Count, and Classify: Categorizing HiddenWeb Databases

Is collection of text documents that is searchable through a web ... CBS SportsLine. Coverage-based. CBS SportsLine - Basketball. Probe, Count, and Classify. 8 ... – PowerPoint PPT presentation