Clustering of Web Documents - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Clustering of Web Documents

Description:

Step 1: Search result fetching. Step 2: Document paring and Phrase property calculation ... Search result fetching. Input a query to a conventional web search engine ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 29
Provided by: cacsLou
Category:

less

Transcript and Presenter's Notes

Title: Clustering of Web Documents


1
  • Clustering of Web Documents
  • Jinfeng Chen

2
  • Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu
    and Yuhen Hu, Correlation-based Document
    Clustering using Web Logs, 2001.
  • Hua-Jun Zeng ,Qi cai He,Zheng Chen,Weiyin Ma and
    Jinwen Ma,Learning to Cluster Web Search Results

3
Correlation-based Document Clustering using Web
Logs
  • Introduction
  • Using web log data to construct clusters.
  • Frequent simultaneous visits to two seemingly
    unrelated documents should indicate that they are
    in fact closely related.
  • Basic algorithm is DBSCAN, an algorithm to group
    neighboring objects of the database into clusters
    based on local distance information.

4
DBSCAN
  • Does not require the user to pre-specify the
    number of clusters.
  • Only one scan through the database.
  • A radius value e and a value Mpts.
  • e - distance measure (radius)
  • Mpts number of minimal points that
    should occur in around a dense object

5
DBSCAN algorithm (cond)
  • Algorithm DBSCAN(DB, e,Minpts)
  • for each o belong to DB do
  • if o is not yet assigned to a
    cluster
  • if o is a core-object then
  • collect all objects
    density-reachable form o
  • according to e
    and MinPts
  • assign them to a new
    cluster

6
Limitations of DBSCAN in Clustering of web
document
  • Performance clustering using a fixed threshold
    value to determine dense regions in the
    document space.
  • Thus the algorithm often cannot distinguish
    between dense and loose points, often the entire
    document space is lumped into a single cluster.

7
RDBC algorithm(recursive density based
clustering)
  • Key difference between RDBC and DBSCAN is that in
    RDBC, the identification of core points are
    performed separately from that of clustering each
    individual data points.
  • Different values of e and Mpts are used in RDBC
    to identify this core point set, Cset.

8
RDBC algorithm (cond)
  • For avoid connecting too many clusters
    through bridge
  • Set initial value ee1 and MptsMpts1
  • WebPageSetweb_log
  • RDBC(e,Mpts, WebPageSet)
  • use e, Mpts to get the core point
    Cset
  • if size (Cset gt
    size(webPageSet)/2
  • DBSCAN(e,Mpts, WebPageSet)
  • else
  • e e/2 MptsMpts/4
  • RDBC (e, Mpts, WebPageSet)
  • Collect all other points in
    (WebPageSet-Cset)
  • around clusters found in
    last step according to e2

9
Construct WebPageSet from web logs
  • Step 1
  • Step 2 Delete visit of image files.
  • Step 3 Extract sessions from the data.

10
Construct WebPageSet (cond)
  • Step 4 Create a distance matrix
  • 1) Determine the size of a moving window,
  • within which URL requests
  • will be regarded as co-occurrence.
  • 2) Calculate the co-occurrence times Ni,,j,
    and
  • Ni, Nj of this pair of URLs.

11
Construct WebPageSet (cond)
  • Step 4 Create a distance matrix
  • 3) P(pi pj) Ni,j /Nj
  • 4) Three Distance function

12
Experimental Validation
13
Conclusions
  • A new algorithm for clustering web documents
    based only on the log data.
  • It change the parameters intelligently during the
    recursively process, RDBC can give clustering
    results more superior than that of DBSCAN

14
Learning to Cluster Web Search Results
  • Introduction
  • This algorithm based on salient phrase come from
    documents contents.
  • Fast enough to be used in online calculation
    engine.

15
Characteristics of Cluster web search results
  • Existing search engines such as Google ,Yahoo and
    MSN often return long list of search results.
  • Clustering of similar search results helps users
    find relevant results.

16
Clustered Search results
17
Conventional Search results
18
Procedure of algorithm
  • Step 1 Search result fetching
  • Step 2 Document paring and Phrase property
    calculation
  • Step 3 Salient phrase ranking

19
Search result fetching
  • Input a query to a conventional web search engine
  • Getting the webpage of results returned by
    engine.
  • Extracting the title and snippets.

20
Document parsing
  • Step 1 Cleaning
  • Stemming (use Porter algorithm)
  • Sentence boundary identification
  • Step 2Post-processing
  • Punctuation elimination
  • Filter out stop-words, ex too are
  • Filter out query word
  • Ex Microsoft software is available to students.

21
Phrase property calculation
  • Five properties
  • 1.Phrase Frequency/Inverted Document Frequency
  • 2.Phrase Length
  • LENn exLEN(big) 1

22
Phrase property calculation (cond)
  • 3.Intra-Cluster Similarity
  • o centroid
  • Here diTFIDF1,TFIDF2,,
  • Each component of the vectors represents TFIDF
    of a phrase

23
Phrase property calculation (cond)
  • 4. Cluster Entropy
  • 5. Phrase Independence
  • Ex three vectors has
  • with some vectors
    be

24
Learning to rank key phrases
  • Using Regression model to combine above five
    properties, calculating a single salience score
    for each phrase
  • Regression is a algorithm which tries to
    determine the relationship between two random
    variables X(x1,x2,xn) and y.
  • Here x(TFIDF,LEN,ICS,CE,IND)

25
Learning to rank key phrases
  • Three Regression
  • Linear Regression
  • Logistic Regression
  • Support Vector Regression

26
Evaluation
27
Conclusions
  • Change the search result clustering problem to be
    a supervised salient phrase ranking problem.
  • Generate the correct clusters with short name,
    thus could improve users browsing efficiency
    through search result.

28
Thanks!
Write a Comment
User Comments (0)
About PowerShow.com