Title: Document Categorization and Query Generation on the World Wide Web Using WebACE
1Document Categorization and Query Generation on
the World Wide Web Using WebACE
- Daniel Boley etc.
- University of Minnesota
- Presented by
- Ya Sun Yanhong Zhang
2Contents
- Introduction
- Related Work
- WebACE Architecture
- Clustering Methods
- Experimental Evaluation
- Search for and Categorization of Similar
Documents - Conclusion
3Introduction
- Search engines often return inconsistent results
- WebACE an intelligent web agent
- Clustering two new partitioning-based algorithms
- Query generation
- Searching for similar web documents
- Filtering and classifying into existing clusters
4Related Work
- Intelligent Search Agents
- Interpret discovered information, e.g. FAQ-Finder
- Learn structures of unfamiliar sources, e.g. ILA
- Information Filtering/ Categorization
- Automatically retrieve, filter and categorize,
e.g. HyPursuit - Personalized Web Agents
- Learn user preference and discover related
information, e.g. WebWatcher
5WebACE Architecture
Document vectors
clusters
queries
Profile Creation Module
Clustering Modules
Query Generator
Search Mechanism
User option
Filter (optional)
Clustering Updater(optional)
Document vectors
6Association Rule Hypergraph Partitioning Algorithm
- Hypergraph H (V, E)
- V a set of vertices, here, docs being clustered
- E a set of hyperedges which can connect more
than two vertices, here, a set of related docs - A weight is assigned to each hyperedge
- Transactional form of document retrieval
- Each document viewed as an item
- Each possible feature word as a transaction
7Association Rule Hypergraph Partitioning
Algorithm (Cont)
- Support Count
- T the set of transactions
- t transaction, a subset of the item-set I
- C a subset of I
- Association Rule
- X, Y subset of I
- Support s
- Confidence a
- Task find all rules with s and a greater than
given minimums
8Association Rule Hypergraph Partitioning
Algorithm (Cont)
- The hypergraph representation
- Represent each document as a vertex item
- Compute all the frequent item-sets, with a given
threshold support count - Represent each frequent set as a hyperedge
- Assign the weight as the average confidence of
the essential association rules of the set
9Association Rule Hypergraph Partitioning
Algorithm (Cont)
- The hypergraph partitioning
- Partition the hypergraph by minimizing the weight
of the hyperedges that are cut - Partition each part recursively
- Stop the partition with fitness criterion
- Filter out bad vertices by connectivity function
10Principal Direction Divisive partitioning
- Represent documents as a term frequency matrix M,
- Mij is the number of occurrences of word wi in
document dj - Results independent of document length
- Sparse, 3 of entries nonzero
11Principal Direction Divisive partitioning (Cont)
- The Centroid vector c for each cluster
- k the number of documents in the cluster
- The principal direction for each cluster
- The direction of maximum variance
- The principal eigenvector of the covariance
matrix - Obtained by computing the leading left singular
vector of (M c e) - Only matrix-vector products required, preserving
the sparsity
12Principal Direction Divisive partitioning (Cont)
- Partitioning
- Project all documents onto the principal
direction, classify them by the sign of the
results - Repeat the process on each cluster recursively
- Stop condition scatter of cluster
- Stops when scatter values of all individual
clusters below that of the centroid vectors -
13Experimental Evaluation
- Comparative Evaluation of Clustering Algorithms
- Compare ARHP, PDDP, HAC and AutoClass with LSI or
without LSI - 185 web pages in 10 categories BC, IP, EC, etc.
- The measure of goodness of the clusters Entropy
- Etotal ?j Ej nj /n
- Scalability of Clustering Algorithms
- PDDP
- ARHP
1410 Feature Selection Methods
15Comparative Evaluation of
Clustering Algorithms
16Comparative Evaluation of
Clustering algorithm
17Comparative Evaluation of
Clustering algorithm (Cont)
18Comparative Evaluation of
Clustering algorithm (Cont)
Figure 6 Entropy of different algorithms. Note
that lower entropy Indicates better cohesiveness
of clusters.
19Conclusions from experimental results
- Both PDDP and ARHP performed better than HAC and
AutoClass regardless of feature selection
criteria used - Dramatic differences in run times of the four
methods. - ARHP, PDDP lt 2 mins
- HAC 1 hr. and 40 mins
- AutoClass 38 mins
- (dataset size of 185 x 10538)
20Dimensionality Reduction using LSI/SVD
Figure 7 Comparison of Entropies for the E1,
With and Without LSI.
21Scalability of Clustering Algorithms
- Data sets
- D1 2340 docs, 21,839 words
- D3, D9, D10 reduced dictionaries
- D3 8104 words
- D9 7358 words
- D10 1458 words
22Scalability of PDDP
Figure 8 Entropies form the PDDP algorithm with
various number of Clusters.
23Scalability of PDDP (Cont)
Figure 9 Times for the PDDP algorithm on an SCI
versus number of Nonzeros in term frequency
matrix, for both E and D series.
24Scalability of ARHP
Figure 10 Entropies from the ARHP algorithm with
various number of clusters.
25Scalability of ARHP (Cont)
Figure 11 Hypergraph partitioning time for both
E and D series
26Search for and Categorization of Similar
Documents
- A representative set of words used in a web
search. - TF the list of k words that have the highest
average text frequency. - DF the list of k words that have the highest
document frequency. - The query can be formed as
- (c1?c2 ? ? cm) ? (t1? t2 ? tn)
- Where ci ? TF ? DF , ti ? TF - DF
27Search for and Categorization of Similar
Documents ( Cont)
- The documents returned as the result of queries
can be handled - ARHP could be used to filter out non-relevant
documents - Added to the existing clusters using ARHP or PDDP
28Conclusion and Future Work
- Conclusion
- ARHP and PDDP are capable of extracting higher
quality clusters. - They are fast and scale with the number of words
in the documents - Future work
- Explore the performance of the entire agent as an
integrated and fully automated system - Development a method for evaluating quality of
clusters which is not based on a priori class
labels.