Document Categorization and Query Generation on the World Wide Web Using WebACE - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Document Categorization and Query Generation on the World Wide Web Using WebACE

Description:

Search engines often return inconsistent results. WebACE: an intelligent web agent ... Top 10 with text frequency 1. E6. 185x1105. Top 15 with text ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 29
Provided by: csd47
Category:

less

Transcript and Presenter's Notes

Title: Document Categorization and Query Generation on the World Wide Web Using WebACE


1
Document Categorization and Query Generation on
the World Wide Web Using WebACE
  • Daniel Boley etc.
  • University of Minnesota
  • Presented by
  • Ya Sun Yanhong Zhang

2
Contents
  • Introduction
  • Related Work
  • WebACE Architecture
  • Clustering Methods
  • Experimental Evaluation
  • Search for and Categorization of Similar
    Documents
  • Conclusion

3
Introduction
  • Search engines often return inconsistent results
  • WebACE an intelligent web agent
  • Clustering two new partitioning-based algorithms
  • Query generation
  • Searching for similar web documents
  • Filtering and classifying into existing clusters

4
Related Work
  • Intelligent Search Agents
  • Interpret discovered information, e.g. FAQ-Finder
  • Learn structures of unfamiliar sources, e.g. ILA
  • Information Filtering/ Categorization
  • Automatically retrieve, filter and categorize,
    e.g. HyPursuit
  • Personalized Web Agents
  • Learn user preference and discover related
    information, e.g. WebWatcher

5
WebACE Architecture
Document vectors
clusters
queries
Profile Creation Module
Clustering Modules
Query Generator
Search Mechanism
User option
Filter (optional)
Clustering Updater(optional)
Document vectors
6
Association Rule Hypergraph Partitioning Algorithm
  • Hypergraph H (V, E)
  • V a set of vertices, here, docs being clustered
  • E a set of hyperedges which can connect more
    than two vertices, here, a set of related docs
  • A weight is assigned to each hyperedge
  • Transactional form of document retrieval
  • Each document viewed as an item
  • Each possible feature word as a transaction

7
Association Rule Hypergraph Partitioning
Algorithm (Cont)
  • Support Count
  • T the set of transactions
  • t transaction, a subset of the item-set I
  • C a subset of I
  • Association Rule
  • X, Y subset of I
  • Support s
  • Confidence a
  • Task find all rules with s and a greater than
    given minimums

8
Association Rule Hypergraph Partitioning
Algorithm (Cont)
  • The hypergraph representation
  • Represent each document as a vertex item
  • Compute all the frequent item-sets, with a given
    threshold support count
  • Represent each frequent set as a hyperedge
  • Assign the weight as the average confidence of
    the essential association rules of the set

9
Association Rule Hypergraph Partitioning
Algorithm (Cont)
  • The hypergraph partitioning
  • Partition the hypergraph by minimizing the weight
    of the hyperedges that are cut
  • Partition each part recursively
  • Stop the partition with fitness criterion
  • Filter out bad vertices by connectivity function

10
Principal Direction Divisive partitioning
  • Represent documents as a term frequency matrix M,
  • Mij is the number of occurrences of word wi in
    document dj
  • Results independent of document length
  • Sparse, 3 of entries nonzero

11
Principal Direction Divisive partitioning (Cont)
  • The Centroid vector c for each cluster
  • k the number of documents in the cluster
  • The principal direction for each cluster
  • The direction of maximum variance
  • The principal eigenvector of the covariance
    matrix
  • Obtained by computing the leading left singular
    vector of (M c e)
  • Only matrix-vector products required, preserving
    the sparsity

12
Principal Direction Divisive partitioning (Cont)
  • Partitioning
  • Project all documents onto the principal
    direction, classify them by the sign of the
    results
  • Repeat the process on each cluster recursively
  • Stop condition scatter of cluster
  • Stops when scatter values of all individual
    clusters below that of the centroid vectors

13
Experimental Evaluation
  • Comparative Evaluation of Clustering Algorithms
  • Compare ARHP, PDDP, HAC and AutoClass with LSI or
    without LSI
  • 185 web pages in 10 categories BC, IP, EC, etc.
  • The measure of goodness of the clusters Entropy
  • Etotal ?j Ej nj /n
  • Scalability of Clustering Algorithms
  • PDDP
  • ARHP

14
10 Feature Selection Methods
15
Comparative Evaluation of
Clustering Algorithms
16
Comparative Evaluation of
Clustering algorithm
17
Comparative Evaluation of
Clustering algorithm (Cont)
18
Comparative Evaluation of
Clustering algorithm (Cont)
Figure 6 Entropy of different algorithms. Note
that lower entropy Indicates better cohesiveness
of clusters.
19
Conclusions from experimental results
  • Both PDDP and ARHP performed better than HAC and
    AutoClass regardless of feature selection
    criteria used
  • Dramatic differences in run times of the four
    methods.
  • ARHP, PDDP lt 2 mins
  • HAC 1 hr. and 40 mins
  • AutoClass 38 mins
  • (dataset size of 185 x 10538)

20
Dimensionality Reduction using LSI/SVD
Figure 7 Comparison of Entropies for the E1,
With and Without LSI.
21
Scalability of Clustering Algorithms
  • Data sets
  • D1 2340 docs, 21,839 words
  • D3, D9, D10 reduced dictionaries
  • D3 8104 words
  • D9 7358 words
  • D10 1458 words

22
Scalability of PDDP
Figure 8 Entropies form the PDDP algorithm with
various number of Clusters.
23
Scalability of PDDP (Cont)
Figure 9 Times for the PDDP algorithm on an SCI
versus number of Nonzeros in term frequency
matrix, for both E and D series.
24
Scalability of ARHP
Figure 10 Entropies from the ARHP algorithm with
various number of clusters.
25
Scalability of ARHP (Cont)
Figure 11 Hypergraph partitioning time for both
E and D series
26
Search for and Categorization of Similar
Documents
  • A representative set of words used in a web
    search.
  • TF the list of k words that have the highest
    average text frequency.
  • DF the list of k words that have the highest
    document frequency.
  • The query can be formed as
  • (c1?c2 ? ? cm) ? (t1? t2 ? tn)
  • Where ci ? TF ? DF , ti ? TF - DF

27
Search for and Categorization of Similar
Documents ( Cont)
  • The documents returned as the result of queries
    can be handled
  • ARHP could be used to filter out non-relevant
    documents
  • Added to the existing clusters using ARHP or PDDP

28
Conclusion and Future Work
  • Conclusion
  • ARHP and PDDP are capable of extracting higher
    quality clusters.
  • They are fast and scale with the number of words
    in the documents
  • Future work
  • Explore the performance of the entire agent as an
    integrated and fully automated system
  • Development a method for evaluating quality of
    clusters which is not based on a priori class
    labels.
Write a Comment
User Comments (0)
About PowerShow.com