Document Categorization and Query Generation on the World Wide Web Using WebACE - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Document Categorization and Query Generation on the World Wide Web Using WebACE

Description:

Search engines often return inconsistent results. WebACE: an intelligent web agent ... Top 10 with text frequency 1. E6. 185x1105. Top 15 with text ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 29

Provided by: csd47

Category:

more less

Transcript and Presenter's Notes

Title: Document Categorization and Query Generation on the World Wide Web Using WebACE

1
Document Categorization and Query Generation on
the World Wide Web Using WebACE

Daniel Boley etc.
University of Minnesota
Presented by
Ya Sun Yanhong Zhang

2
Contents

Introduction
Related Work
WebACE Architecture
Clustering Methods
Experimental Evaluation
Search for and Categorization of Similar
Documents
Conclusion

3
Introduction

Search engines often return inconsistent results
WebACE an intelligent web agent
Clustering two new partitioning-based algorithms
Query generation
Searching for similar web documents
Filtering and classifying into existing clusters

4
Related Work

Intelligent Search Agents
Interpret discovered information, e.g. FAQ-Finder
Learn structures of unfamiliar sources, e.g. ILA
Information Filtering/ Categorization
Automatically retrieve, filter and categorize,
e.g. HyPursuit
Personalized Web Agents
Learn user preference and discover related
information, e.g. WebWatcher

5
WebACE Architecture
Document vectors
clusters
queries
Profile Creation Module
Clustering Modules
Query Generator
Search Mechanism
User option
Filter (optional)
Clustering Updater(optional)
Document vectors
6
Association Rule Hypergraph Partitioning Algorithm

Hypergraph H (V, E)
V a set of vertices, here, docs being clustered
E a set of hyperedges which can connect more
than two vertices, here, a set of related docs
A weight is assigned to each hyperedge
Transactional form of document retrieval
Each document viewed as an item
Each possible feature word as a transaction

7
Association Rule Hypergraph Partitioning
Algorithm (Cont)

Support Count
T the set of transactions
t transaction, a subset of the item-set I
C a subset of I
Association Rule
X, Y subset of I
Support s
Confidence a
Task find all rules with s and a greater than
given minimums

8
Association Rule Hypergraph Partitioning
Algorithm (Cont)

The hypergraph representation
Represent each document as a vertex item
Compute all the frequent item-sets, with a given
threshold support count
Represent each frequent set as a hyperedge
Assign the weight as the average confidence of
the essential association rules of the set

9
Association Rule Hypergraph Partitioning
Algorithm (Cont)

The hypergraph partitioning
Partition the hypergraph by minimizing the weight
of the hyperedges that are cut
Partition each part recursively
Stop the partition with fitness criterion
Filter out bad vertices by connectivity function

10
Principal Direction Divisive partitioning

Represent documents as a term frequency matrix M,
Mij is the number of occurrences of word wi in
document dj
Results independent of document length
Sparse, 3 of entries nonzero

11
Principal Direction Divisive partitioning (Cont)

The Centroid vector c for each cluster
k the number of documents in the cluster
The principal direction for each cluster
The direction of maximum variance
The principal eigenvector of the covariance
matrix
Obtained by computing the leading left singular
vector of (M c e)
Only matrix-vector products required, preserving
the sparsity

12
Principal Direction Divisive partitioning (Cont)

Partitioning
Project all documents onto the principal
direction, classify them by the sign of the
results
Repeat the process on each cluster recursively
Stop condition scatter of cluster
Stops when scatter values of all individual
clusters below that of the centroid vectors

13
Experimental Evaluation

Comparative Evaluation of Clustering Algorithms
Compare ARHP, PDDP, HAC and AutoClass with LSI or
without LSI
185 web pages in 10 categories BC, IP, EC, etc.
The measure of goodness of the clusters Entropy
Etotal ?j Ej nj /n
Scalability of Clustering Algorithms
PDDP
ARHP

14
10 Feature Selection Methods
15
Comparative Evaluation of
Clustering Algorithms
16
Comparative Evaluation of
Clustering algorithm
17
Comparative Evaluation of
Clustering algorithm (Cont)
18
Comparative Evaluation of
Clustering algorithm (Cont)
Figure 6 Entropy of different algorithms. Note
that lower entropy Indicates better cohesiveness
of clusters.
19
Conclusions from experimental results

Both PDDP and ARHP performed better than HAC and
AutoClass regardless of feature selection
criteria used
Dramatic differences in run times of the four
methods.
ARHP, PDDP lt 2 mins
HAC 1 hr. and 40 mins
AutoClass 38 mins
(dataset size of 185 x 10538)

20
Dimensionality Reduction using LSI/SVD
Figure 7 Comparison of Entropies for the E1,
With and Without LSI.
21
Scalability of Clustering Algorithms

Data sets
D1 2340 docs, 21,839 words
D3, D9, D10 reduced dictionaries
D3 8104 words
D9 7358 words
D10 1458 words

22
Scalability of PDDP
Figure 8 Entropies form the PDDP algorithm with
various number of Clusters.
23
Scalability of PDDP (Cont)
Figure 9 Times for the PDDP algorithm on an SCI
versus number of Nonzeros in term frequency
matrix, for both E and D series.
24
Scalability of ARHP
Figure 10 Entropies from the ARHP algorithm with
various number of clusters.
25
Scalability of ARHP (Cont)
Figure 11 Hypergraph partitioning time for both
E and D series
26
Search for and Categorization of Similar
Documents

A representative set of words used in a web
search.
TF the list of k words that have the highest
average text frequency.
DF the list of k words that have the highest
document frequency.
The query can be formed as
(c1?c2 ? ? cm) ? (t1? t2 ? tn)
Where ci ? TF ? DF , ti ? TF - DF

27
Search for and Categorization of Similar
Documents ( Cont)

The documents returned as the result of queries
can be handled
ARHP could be used to filter out non-relevant
documents
Added to the existing clusters using ARHP or PDDP

28
Conclusion and Future Work

Conclusion
ARHP and PDDP are capable of extracting higher
quality clusters.
They are fast and scale with the number of words
in the documents
Future work
Explore the performance of the entire agent as an
integrated and fully automated system
Development a method for evaluating quality of
clusters which is not based on a priori class
labels.