Automatic Classification of Text Databases Through Query Probing - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Automatic Classification of Text Databases Through Query Probing

Description:

NBA.com, but not ESPN.com. Categorizing a Text Database: Two Problems ... NBA. ACM. SPEC. We use the results to estimate coverage and specificity values ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 28
Provided by: panagi2
Category:

less

Transcript and Presenter's Notes

Title: Automatic Classification of Text Databases Through Query Probing


1
Automatic Classification of Text Databases
Through Query Probing
  • Panagiotis G. Ipeirotis
  • Luis Gravano
  • Columbia University
  • Mehran Sahami
  • E.piphany Inc.

2
Search-only Text Databases
  • Sources of valuable information
  • Hidden behind search interfaces
  • Non-crawlable
  • Example Microsoft Support KB

3
Interacting With Searchable Text Databases
  • Searching Metasearchers
  • Browsing Use Yahoo-like directories
  • Browse search Category-enabled metasearchers

4
Searching Text Databases Metasearchers
  • Select the good databases for a query
  • Evaluate the query at these databases
  • Combine the query results from the databases
  • Examples MetaCrawler, SavvySearch, Profusion

5
Browsing Through Text Databases
  • Yahoo-like web directories
  • InvisibleWeb.com
  • SearchEngineGuide.com
  • TheBigHub.com
  • Example from InvisibleWeb.com
  • Computers gt Publications gt ACM DL
  • Category-enabled metasearchers
  • User-defined category (e.g. Recipes)

6
Problem With Current Classification Approach
  • Classification of databases is done manually
  • This requires a lot of human effort!

7
How to Classify Text Databases Automatically
Outline
  • Definition of classification
  • Strategies for classifying searchable databases
    through query probing
  • Initial experiments

8
Database Classification Two Definitions
  • Coverage-based classification
  • The database contains many documents about the
    category (e.g. Basketball)
  • Coverage docs about this category
  • Specificity-based classification
  • The database contains mainly documents about this
    category
  • Specificity docs/DB

9
Database Classification An Example
  • Category Basketball
  • Coverage-based classification
  • ESPN.com, NBA.com
  • Specificity-based classification
  • NBA.com, but not ESPN.com

10
Categorizing a Text DatabaseTwo Problems
  • Find the category of a given document
  • Find the category of all the documents inside the
    database

11
Categorizing Documents
  • Several text classifiers available
  • RIPPER (ATT Research, William Cohen 1995)
  • Input A set of pre-classified, labeled documents
  • Output A set of classification rules

12
Categorizing Documents RIPPER
  • Training set Preclassified documents
  • Linux as a web server Computers
  • Linux vs. Windows Computers
  • Jordan was the leader of Chicago Bulls Sports
  • Smoking causes lung cancer Health
  • Output Rule-based classifier
  • IF linux THEN Computers
  • IF jordan AND bulls THEN Sports
  • IF lung AND cancer THEN Health

13
Precision and Recall of Document Classifier
  • During the training phase
  • 100 documents about computers
  • Computer rules matched 50 docs
  • From these 50 docs 40 were about computers
  • Precision 40/50 0.8
  • Recall 40/100 0.4

14
From Document to Database Classification
  • If we know the categories of all the documents,
    we are done!
  • But databases do not export such data!
  • How can we extract this information?

15
Our Approach Query Probing
  • Design a small set of queries to probe the
    databases
  • Categorize the database based on the probing
    results

16
Designing and Implementing Query Probes
  • The probes should extract information about the
    categories of the documents in the database
  • Start with a document classifier (RIPPER)
  • Transform each rule into a query
  • IF lung AND cancer THEN health ? lung cancer
  • IF linux THEN computers ? linux
  • Get number of matches for each query

17
Three Categories and Three Databases
ACM DL
linux ? computers
jordan AND bulls ? sports
lung AND cancer ? health
NBA.com
PubMED
18
Using the Results for Classification
We use the results to estimate coverage and
specificity values
19
Adjusting Query Results
  • Classifiers are not perfect!
  • Queries do not retrieve all the documents that
    belong to a category
  • Queries for one category match documents that
    do not belong to this category
  • From the training phase of classifier we use
    precision and recall

20
Precision Recall Adjustment
  • Computer-category
  • Rule linux, Precision 0.7
  • Rule cpu, Precision 0.9
  • Recall (for all the rules) 0.4
  • Probing with queries for Computers
  • Query linux ? X1 matches ? 0.7X1 correct
    matches
  • Query cpu ? X2 matches ? 0.9X2 correct matches
  • From X1X2 documents found
  • Expect 0.7 X10.9 X2 to be correct
  • Expect (0.7 X10.9 X2)/0.4 total computer docs

21
Initial Experiments
  • Used a collection of 20,000 newsgroup articles
  • Formed 5 categories
  • Computers (comp.)
  • Science (sci.)
  • Hobbies (rec.)
  • Society (soc. alt.atheism)
  • Misc (misc.sale)
  • RIPPER trained with 10,000 newsgroup articles
  • Classifier 29 rules, 32 words used
  • IF windows AND pc THEN Computers (precision0.75)
  • IF satellite AND space THEN Science
    (precision0.9)

22
Web-databases Probed
  • Using the newsgroup classifier we probed four web
    databases
  • Cora (www.cora.jprc.com)
  • CS Papers archive (Computers)
  • American Scientist (www.amsci.org)
  • Science and technology magazine (Science)
  • All Outdoors (www.alloutdoors.com)
  • Articles about outdoor activities (Hobbies)
  • Religion Today (www.religiontoday.com)
  • News and discussion about religions (Society)

23
Results
  • Only 29 queries per web site
  • No need for document retrieval!

24
Conclusions
  • Easy classification using only a small number of
    queries
  • No need for document retrieval
  • Only need a result like X matches found
  • Not limited to search-only databases
  • Every searchable database can be classified this
    way
  • Not limited to topical classification

25
Current Issues
  • Comprehensive classification scheme
  • Representative training data

26
Future Work
  • Use a hierarchical classification scheme
  • Test different search interfaces
  • Boolean model
  • Vector-space model
  • Different capabilities
  • Compare with document sampling (Callan et al.s
    work SIGMOD99, adapted for the classification
    task)
  • Study classification efficiency when documents
    are accessible

27
Related Work
  • Gauch (JUCS 1996)
  • Etzioni et al. (JIIS 1997)
  • Hawking Thistlewaite (TOIS 1999)
  • Callan et al. (SIGMOD 1999)
  • Meng et al. (CoopIS 1999)
Write a Comment
User Comments (0)
About PowerShow.com