Probe, Count, and Classify: Categorizing HiddenWeb Databases - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Probe, Count, and Classify: Categorizing HiddenWeb Databases

Description:

Is collection of text documents that is searchable through a web ... CBS SportsLine. Coverage-based. CBS SportsLine - Basketball. Probe, Count, and Classify. 8 ... – PowerPoint PPT presentation

Number of Views:130
Avg rating:3.0/5.0
Slides: 31
Provided by: leehe
Category:

less

Transcript and Presenter's Notes

Title: Probe, Count, and Classify: Categorizing HiddenWeb Databases


1
Probe, Count, and ClassifyCategorizing
Hidden-Web Databases
  • SIGMOD 2001,
  • Panagiotis G. Ipeirotis
  • Computer Science Dept. Columbia University
  • DB Lab.
  • Hee-Jeon Lee

2
1. INTRODUCTION (1)
  • Ordinary web
  • Traditional web, crawlable
  • 2 billion pages
  • Hidden Web
  • Only accessible through search interfaces
  • Cohesive, Higher quality, not crawlable
  • 500 billion pages

3
1. INTRODUCTION (2)
  • Example Query with the keyword cancer
  • - PubMed (http//ncbi.nlm.nih.gov/PubMed/
    )
  • - Search engine on AltaVista
  • The problem of accurate information retrieval in
    WWW
  • Retrieve static document
  • Determine searchable databases
  • Searchable Web databases
  • Is collection of text documents that is
    searchable through a web-accessible search
    interface
  • Focus is on text

4
1. INTRODUCTION (3)
  • Manually classifying searchable web databases
    (Yahoo!-like hierarchical categorization scheme)
  • Lack of scalability

Automating classification process -
combination 1) machine learning 2) database
querying techniques - transform the rules
of the classifier into a set of query
probes
5
2. TEXT-DATABASE CLASSIFICATION
  • Organize the space of searchable databases in a
    hierarchical categorization scheme.
  • Sec 2.1
  • Define appropriate classification schemes
  • Sec 2.2
  • Alternative methods

6
2.1 Classification Schemes for Databases
7
2.2 Alternative Classification Strategies (1)
  • To assign a searchable web database to category
  • Manually inspect the contents of the database and
    make a decision based on the results of
    inspection
  • A less manual approach
  • Coverage-based classification
  • Specificity-based classification
  • Specificity-based
  • CBS SportsLine
  • Coverage-based
  • CBS SportsLine - Basketball

8
2.2 Alternative Classification Strategies (2)
9
2.2 Alternative Classification Strategies (3)
10
3. CLASSIFYING DATABASE THROUGH PROBING
  • How can approximate information for a given
    database without accessing its contents
  • Sec 3.1
  • Train a rule-based document classifier with a set
    of preclassified documents.
  • Sec 3.2
  • Transform classifier rules into queries.
  • Sec 3.3
  • Adaptively issue queries to databases, extracting
    and adjusting the number of matches for each
    query using the classifiers confusion matrix.
  • Sec 3.4
  • Classify databases using the adjusted number of
    query matches.

11
3.1 Training a Document Classifier (1)
  • Rely on a rule-based document classifier to
    create the probing queries
  • Use supervised learning to construct a rule-based
    classifier from a set of preclassified documents
  • The resulting classifier is a set of logical rules
  • Antecedents are conjunctions
  • of words.
  • Consequents are the category
  • assignments for each document.

12
3.1 Training a Document Classifier (2)
  • To define a document classifier over an entire
    hierarchical classification scheme, train one
    flat rule-based document classifier for each
    internal node of the hierarchy.

13
3.1 Training a Document Classifier (3)
  • To produce a rule-based document classifier
  • 1. Feature selection
  • To eliminate from the training set all words that
    appear very frequently in the training documents,
    as well as very infrequently appearing words.
  • Eliminates the terms that have the least impact
    on the class distribution of documents
  • 2. Classify the database according to the number
    of documents that it contains in each category

14
3.2 Defining Query Probes from a Document
Classifier (1)
  • Query probe
  • Will help estimate the number of documents for
    each category of interest in a searchable web
    database

15
3.2 Defining Query Probes from a Document
Classifier (2)
  • Map the rule into the Boolean query
  • IF jordan AND bulls THEN Sports -gt jordan AND
    bulls

16
3.2 Defining Query Probes from a Document
Classifier (3)
Boolean query
17
3.3 Adjusting Probing Results (1)
  • Confusion matrix
  • Need to adjust initial probing results to account
    for potential errors
  • In the machine learning community to report the
    document classification results

18
3.3 Adjusting Probing Results (2)
19
3.3 Adjusting Probing Results (3)
  • diagonally dominant matrix
  • Gershgorin disk theorem

20
3.4 Using Probing Results for Classification (1)
  • Classify database in a top-to-bottom way
  • 1. Each database is first classified by
    root-level classifier
  • 2. recursively push down to the lower level
    classifiers

21
3.4 Using Probing Results for Classification (2)
  • Call Classify(root, D)

22
3.4 Using Probing Results for Classification (3)
23
4. EXPERIMENTAL SETTING
  • Sec 4.1 Data Collections
  • Controlled databases
  • Homogeneous
  • Heterogeneous
  • Real web databases
  • Sec 4.2 Techniques for Comparison
  • Probe and Count (PnC)
  • Document Sampling (DS)
  • Query probing to automatically construct a
    language model of a text database
  • Title-based Querying (TQ)
  • One long query for each category using the title
    of the category itself augmented by the titles of
    all its subcategories

24
4.3 Evaluation Metrics (1)
25
4.3 Evaluation Metrics (2)
26
4.3 Evaluation Metrics (3)
  • Correct Expanded (programming)
  • Classified Expanded (Java)

Java / Java 1 / 1 1
Java / Prog.., C.., Pe.., Java, Visu..
1 / 5
27
5. EXPERIMENTAL RESULTS
  • Sec 5.1 Tuning the PnC Technique
  • Effect of Confusion Matrix Adjustment (CMA)
  • Effect of Feature Selection
  • ECoverage estimates with FS on were between 15
    and 20 better
  • ESpecificity estimates with FS on were around
    10 better

28
5.2 Results over the Controlled Databases
29
5.3 Results over the Web Databases
30
6. CONCLUSIONS AND FUTURE WORK
  • Have presented a novel and efficient method for
    the hierarchical classification of text databases
    on the web.
  • Would completely automate the classification
    process is to eliminate the need for a human to
    construct the simple wrapper for each database to
    classify.
  • Can be eliminated by automatically learning how
    to parse the pages with query results
Write a Comment
User Comments (0)
About PowerShow.com