Web Page Clustering based on Web Community Extraction - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Web Page Clustering based on Web Community Extraction

Description:

Due to these two properties of the Web. ... http://www.ducati.com/od/ducatijapan/jp/index.jhtml. http://www.triumphmotorcycles.com/japan ... – PowerPoint PPT presentation

Number of Views:133
Avg rating:3.0/5.0
Slides: 47
Provided by: edd5
Category:

less

Transcript and Presenter's Notes

Title: Web Page Clustering based on Web Community Extraction


1
Web Page Clustering based onWeb Community
Extraction
  • Chikayama-Taura Lab.
  • M2 Shim Wonbo

2
Background
Directory Category
3
Open Directory Project
  • Used by Google, Lycos, etc.
  • Categorizing Web pages by hand
  • Accurate
  • Lately updated
  • Unscalable

4
World Wide Web
  • Rapid increase ( of clusters changes)
  • Daily updated ( cluster centers move)
  • Due to these two properties of the Web..
  • A Web page clustering system without human effort
    is needed.

5
Purpose
  • Constructing a Web page clustering system which
  • finds clusters without human help
  • is scalable
  • clusters Web pages in high speed
  • clusters Web pages accurately

6
Brief System View
Partitioning of remaining pages based on TF-IDF
DBG Extraction
(c) Web Page Clustering
(a) Web pages
(b) Web Communities
7
Contribution
  • Web Community
  • A new Web community topology is defined.
  • Extracted Web community shows higher precision
    than existing work.
  • Web Page Clustering
  • An approach to exploit Web communities as
    centroids of clusters in TF-IDF space is taken.
  • Experimental results show meaningful clusters.

8
Agenda
  • Introduction
  • Related Work
  • Proposal
  • Evaluation
  • Conclusion

9
Existing Work
  • Text-based clustering
  • Use of terms as feature
  • Generally used algorithm
  • ex) k-means, Hierarchical algorithm,
    Density-based clustering
  • Link-based clustering
  • Called as Web community extraction
  • Extracting dense subgraphs from the Web graph
  • Conjunction of text and link information
  • ex) Contents-Link Coupled Web Page Clustering
    Yitong et al., DEWS2004

10
Text-based Clustering
  • Merit
  • Accurate (because of considering text)
  • Problem
  • Unsupervised clustering
  • Complex to decide the number of clusters
  • Supervised learning and clustering
  • Difficult to label each training datum

11
Contents-Link Coupled Web Page Clustering Yitong
et al., DEWS2004
  • Feature
  • Term frequency (pterm), Out-link (pout), In-link
    (pin)
  • Similarity
  • Clustering Algorithm
  • An extension of the k-means algorithm

12
Extraction of Web Community based on Link Analysis
  • An Approach to Find Related Communities Based on
    Bipartite Graphs P.Krishna Reddy et al., 2001
  • PlusDBG Web Community Extraction Scheme
    Improving Both Precision and Pseudo-Recall Saida
    et al, 2005

13
Terminology
  • Fan and Center
  • Bipartite Graph (BG)
  • Complete BG (CBG)
  • Dense BG (DBG)

Fan
Center
p
q
(b) DBG
(a) CBG
14
Algorithm for Extracting DBG Reddy et al., 2001
  • Finds bipartite graph using co-citing and
    co-cited Web pages
  • Extracts a DBG from above graph

2
1
DBG(3, 3)
4
3
3
Seed page
3
3
3
3
1
15
PlusDBG
  • Uses distance defined by co-citing page rate
    between two pages
  • Finds co-citing pages which are within distance
    threshold
  • Extracts a DBG from above graph
  • PlusDBG shows higher precision than DBG does.

16
Web Community Extraction
  • O High speed
  • O Finding out topics over the Web
  • X Possibility of extracting unrelated Web pages
    as a community

17
Problem of DBG
18
Improvement of PlusDBG
19
Agenda
  • Introduction
  • Related Work
  • Proposal
  • Evaluation
  • Conclusion

20
Proposal
  1. Extracts Web communities using link structure.
  2. Assigns remainders to the closest Web community
    in TF-IDF space.

21
Proposed Web Community
  • Connecter
  • Fan which is citing two centers.
  • Connectable
  • If two centers are connectable, the centers have
    more than two connecters.
  • Web Community
  • A Web Community C is a DBG composed of
    connectable centers and connecters.

Connectable centers
Connecter
22
Proposed Web Community
All center is connectable to another one.
23
Extraction Algorithm
S Tg
Sb,c,d Tg,i
Sa,b,c,d
e
a
Te,f,h,i
Te,f,h,i,j
ti
tj
f
b
connecters 3
connecters 1
g
c
h
d
i
j
Output Community a,b,c,d,e,f,g,h,i
24
Labeling Remainders
  • Remainder a Web page which is not extracted as a
    member of communities.
  • Calculate centroids of Web communities.
  • Label remainders with Web community ID

w.r.t vi is the TF-IDF vector of a page v
25
Agenda
  • Introduction
  • Related Work
  • Proposal
  • Evaluation
  • Preprocess
  • Web community extraction
  • Labeling result
  • Conclusion

26
Preprocess
  • Data set
  • 2.34 M pages, 20 M links
  • Almost 80 of data set is Japanese pages.
  • Create a link-only file
  • Links to out of data set are deleted.
  • Duplicates are deleted which share 90 of links.
  • Pages including 50 links are deleted.
  • Remained data set 1.45 M pages, 5.09 M links
  • Create a TF-IDF file
  • Used TF-IDF
  • Parser MeCab
  • Terms which appeared in less than 0.1 or more
    than 90 of total documents are removed

27
Distribution of Web Community Size
28
Distribution of Web Community Size
communities extracted pages
PlusDBG 0.8 22,902 865,945
PlusDBG 1.0 8,077 922,053
PlusDBG 1.2 7,527 923,100
Proposed method 50,065 648,626
29
Distance from centroids to term vectors
30
Variance of distance
31
Example of Web communities
  • About motor bike manufacturers and links.
  • http//bike.ak-m.jp/
  • http//www.bike-cube.jp/
  • http//bike.ak-m.jp/2006/01/post_32.html
  • http//www.bike-cube.jp/index.php
  • http//bike.ak-m.jp/2006/11/post_20.html
  • http//www.kymco.co.jp/
  • http//www1.suzuki.co.jp/motor/
  • http//www.yamaha-motor.jp/mc/
  • http//bike.ak-m.jp/
  • http//www.peugeot-moto.com/
  • http//www.apriliajapan.co.jp/index.html
  • http//www.buell.jp/
  • http//www.cagiva.co.jp/
  • http//www.mitsuoka-motor.com/
  • http//www.ducati.com/od/ducatijapan/jp/index.jhtm
    l
  • http//www.triumphmotorcycles.com/japan/
  • http//www.harley-davidson.co.jp/index.html
  • http//www.ktm-japan.co.jp/

32
Comparing to ODP
  • Definition of precision
  • From a Web community C, let page subset existing
    in ODP OC.
  • If OC lt 3, the precision of C is undefined.
  • For r in OC, the Pscore of r is
  • With Pscore, the precision of C is
  • Comparing to the 4th and 5th level of ODP
    directories (Top/Regional/Japan/Arts/Movie)
  • The number of ODP pages included in the data set
    47,093

score(p, q) 1, p, q in same directory score(p,
q) 0, otherwise
33
Comparing to ODP
pages of ODP communities including ODP pages directories which the pages belong to
PlusDBG 0.8 23,287 459 426
PlusDBG 1.0 25,016 156 430
PlusDBG 1.2 25,405 81 435
Proposed Method 12,406 4811 337
34
Precision of Web Communities(4th level)
35
Precision of Web communities(5th level)
36
Summary of Web Community Extraction
  • The proposed method extracted smaller Web
    communities than PlusDBG did.
  • Members of each community were closer to the
    centroid in the TF-IDF space than members of
    PlusDBG were.
  • My communities showed higher precision than
    PlusDBGs when comparing to ODP.

37
Labeling Result
  • Ignore pages including less than 10 terms.
  • Compare to the ODP
  • ODP pages 29,153
  • ODP directories 1,862

38
Labeling Result (the 4th level)
39
Labeling Result (the 5th level)
40
Labeling example
41
Labeling example
42
Summary and Conclusion
  • A DBG structure is defined as the Web community
    topology.
  • All two centers should be connectable.
  • All fan is a connecter of centers.
  • My DBG structure extracts more compact and more
    precise Web communities than existing work does.
  • Clustering based on the Web community extraction
    is proposed.
  • The centroids of communities in TF-IDF space are
    used in labeling of remainders.
  • Clustering result showed meaningful page groups.

43
Future Work
  • Coupling feature selections for improvement on
    the labeling result.
  • Clustering extracted centroids.

44
????
  • (????) ?????????????????????? ????? ??????
    ???????????????2007

45
Thank you for attention
46
Extraction Algorithm
  1. Select seed page t and set Tt, S.
  2. Find S of which members cite any page in T.
  3. Find T of which members cited by any page in T
    and are not in T.
  4. Determine that t?T is connectable to all pages
    in T.
  5. If t is connectable, set TT?t and
    Sconnecters and go to 2.
  6. If not, select other t?T and go to 4.
  7. If S gt 3 and T gt 3, extract the page set as a
    Web Community and delete from the Web Graph.
  8. If any t exists, go to 1.
Write a Comment
User Comments (0)
About PowerShow.com