Clustering of Web Documents - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Clustering of Web Documents

Description:

Step 1: Search result fetching. Step 2: Document paring and Phrase property calculation ... Search result fetching. Input a query to a conventional web search engine ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 29

Provided by: cacsLou

Category:

more less

Transcript and Presenter's Notes

Title: Clustering of Web Documents

1

Clustering of Web Documents
Jinfeng Chen

Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu
and Yuhen Hu, Correlation-based Document
Clustering using Web Logs, 2001.
Hua-Jun Zeng ,Qi cai He,Zheng Chen,Weiyin Ma and
Jinwen Ma,Learning to Cluster Web Search Results

3
Correlation-based Document Clustering using Web
Logs

Introduction
Using web log data to construct clusters.
Frequent simultaneous visits to two seemingly
unrelated documents should indicate that they are
in fact closely related.
Basic algorithm is DBSCAN, an algorithm to group
neighboring objects of the database into clusters
based on local distance information.

4
DBSCAN

Does not require the user to pre-specify the
number of clusters.
Only one scan through the database.
A radius value e and a value Mpts.
e - distance measure (radius)
Mpts number of minimal points that
should occur in around a dense object

5
DBSCAN algorithm (cond)

Algorithm DBSCAN(DB, e,Minpts)
for each o belong to DB do
if o is not yet assigned to a
cluster
if o is a core-object then
collect all objects
density-reachable form o
according to e
and MinPts
assign them to a new
cluster

6
Limitations of DBSCAN in Clustering of web
document

Performance clustering using a fixed threshold
value to determine dense regions in the
document space.
Thus the algorithm often cannot distinguish
between dense and loose points, often the entire
document space is lumped into a single cluster.

7
RDBC algorithm(recursive density based
clustering)

Key difference between RDBC and DBSCAN is that in
RDBC, the identification of core points are
performed separately from that of clustering each
individual data points.
Different values of e and Mpts are used in RDBC
to identify this core point set, Cset.

8
RDBC algorithm (cond)

For avoid connecting too many clusters
through bridge
Set initial value ee1 and MptsMpts1
WebPageSetweb_log
RDBC(e,Mpts, WebPageSet)
use e, Mpts to get the core point
Cset
if size (Cset gt
size(webPageSet)/2
DBSCAN(e,Mpts, WebPageSet)
else
e e/2 MptsMpts/4
RDBC (e, Mpts, WebPageSet)
Collect all other points in
(WebPageSet-Cset)
around clusters found in
last step according to e2

9
Construct WebPageSet from web logs

Step 1
Step 2 Delete visit of image files.
Step 3 Extract sessions from the data.

10
Construct WebPageSet (cond)

Step 4 Create a distance matrix
1) Determine the size of a moving window,
within which URL requests
will be regarded as co-occurrence.
2) Calculate the co-occurrence times Ni,,j,
and
Ni, Nj of this pair of URLs.

11
Construct WebPageSet (cond)

Step 4 Create a distance matrix
3) P(pi pj) Ni,j /Nj
4) Three Distance function

12
Experimental Validation
13
Conclusions

A new algorithm for clustering web documents
based only on the log data.
It change the parameters intelligently during the
recursively process, RDBC can give clustering
results more superior than that of DBSCAN

14
Learning to Cluster Web Search Results

Introduction
This algorithm based on salient phrase come from
documents contents.
Fast enough to be used in online calculation
engine.

15
Characteristics of Cluster web search results

Existing search engines such as Google ,Yahoo and
MSN often return long list of search results.
Clustering of similar search results helps users
find relevant results.

16
Clustered Search results
17
Conventional Search results
18
Procedure of algorithm

Step 1 Search result fetching
Step 2 Document paring and Phrase property
calculation
Step 3 Salient phrase ranking

19
Search result fetching

Input a query to a conventional web search engine
Getting the webpage of results returned by
engine.
Extracting the title and snippets.

20
Document parsing

Step 1 Cleaning
Stemming (use Porter algorithm)
Sentence boundary identification
Step 2Post-processing
Punctuation elimination
Filter out stop-words, ex too are
Filter out query word
Ex Microsoft software is available to students.

21
Phrase property calculation

Five properties
1.Phrase Frequency/Inverted Document Frequency
2.Phrase Length
LENn exLEN(big) 1

22
Phrase property calculation (cond)

3.Intra-Cluster Similarity
o centroid
Here diTFIDF1,TFIDF2,,
Each component of the vectors represents TFIDF
of a phrase

23
Phrase property calculation (cond)

4. Cluster Entropy
5. Phrase Independence
Ex three vectors has
with some vectors
be

24
Learning to rank key phrases

Using Regression model to combine above five
properties, calculating a single salience score
for each phrase
Regression is a algorithm which tries to
determine the relationship between two random
variables X(x1,x2,xn) and y.
Here x(TFIDF,LEN,ICS,CE,IND)

25
Learning to rank key phrases

Three Regression
Linear Regression
Logistic Regression
Support Vector Regression

26
Evaluation
27
Conclusions

Change the search result clustering problem to be
a supervised salient phrase ranking problem.
Generate the correct clusters with short name,
thus could improve users browsing efficiency
through search result.

28
Thanks!

Write a Comment

User Comments (0)