The%20Structure%20of%20Broad%20Topics%20on%20the%20Web - PowerPoint PPT Presentation

About This Presentation
Title:

The%20Structure%20of%20Broad%20Topics%20on%20the%20Web

Description:

The Structure of Broad Topics on the Web Soumen Chakrabarti, Mukul M. Joshi, etc Presentation by Na Dai Introduction & Contribution Convergence of topic distribution ... – PowerPoint PPT presentation

Number of Views:146
Avg rating:3.0/5.0
Slides: 13
Provided by: dai136
Category:

less

Transcript and Presenter's Notes

Title: The%20Structure%20of%20Broad%20Topics%20on%20the%20Web


1
The Structure of Broad Topics on the Web
  • Soumen Chakrabarti, Mukul M. Joshi, etc
  • Presentation by Na Dai

2
Introduction Contribution
  • Convergence of topic distribution on undirected
    random walks
  • Degree distribution restricted to topics
  • How topic-biased are breadth-first crawls?
  • Representation of topics in Web directories
  • Topic convergence on directed walks
  • Link-based vs. content-based Web communities

3
Building Blocks
  • Sampling Web pages
  • PageRank-based random walk ? Wander walk
  • The Bar-Yossef random walk ? Sampling walk
  • Undirected graph
  • Regular
  • Taxonomy design Document classification
  • 271,954 topics, 6 levels, 1,697,266 sample URLs
  • Pruned taxonomy 482 leaf nodes, 144,859 sample
    URLs
  • Classification Rainbow naïve Bayes classifier

4
Convergence
  • Sampling method
  • Sampling walk
  • Topic distribution of a set
  • Soft counting
  • Difference measure
  • L1 distance

5
The background distribution vs. breadth-first
crawls
6
Faithful representation of topics in Web directory
7
Topic-specific degree distributions
  • Power law distribution
  • Pr(i) k1/ix (xgt1)
  • Contribution to Class c
  • Soft-counting
  • ?d pc(d)

8
Topical locality and link-based prestige ranking
  • Sampling method
  • Wander walk
  • Class selection
  • Dmoz, well-populated
  • Collect all the pages at distance i (igt0)

9
Topical locality and link-based prestige ranking
10
Relations between topics
  • Topic citation matrix
  • Contribution to topic citation matrix C
  • C ? C p(u)T p(v)
  • Implications and application
  • Improved hypertext classification
  • Enhanced focused crawling
  • Reorganizing topic directories

11
Concluding remarks
  • Characterize some important notions of topical
    locality on the web
  • Open problems
  • PageRank jump parameter
  • Topical stability of distillation algorithms
  • Better crawling algorithms

12
Q A?
Write a Comment
User Comments (0)
About PowerShow.com