Using Web Structure for Classifying and Describing Web Pages - PowerPoint PPT Presentation

About This Presentation
Title:

Using Web Structure for Classifying and Describing Web Pages

Description:

Title: Using Web Structure for Classifying and Describing Web Pages Author: Zaihan Yang Last modified by: Brian D. Davison Created Date: 8/16/2006 12:00:00 AM – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 16
Provided by: Zaiha5
Category:

less

Transcript and Presenter's Notes

Title: Using Web Structure for Classifying and Describing Web Pages


1
Using Web Structure for Classifying and
Describing Web Pages
  • Eric J. Glover1, Kostas Tsioutsiouliklis1,2,
  • Steve Lawrence1, David M. Pennock1, Gary W.
    Flake1
  • International World Wide Web Conference, 2002
  • Presented by Zaihan Yang
  • CSE Web Mining

2
Introduction
  • Aim
  • Classification of web pages
  • Description of web pages (to name clusters of web
    pages)
  • Using Web Structure
  • Extracting patterns from hyperlinks in the web.
  • HyperLink
  • The destination page
  • Associated anchortext describing link

3
  • Typical Text-based classification
  • To utilize the words (or phrases) of a target
    document, considering the most significant
    features.
  • Not Effective.
  • E.g.
  • The home page of General Motors (www.gm.com)
    does not state that they are a car company.
  • Full text
  • Anchortext
  • Extended-anchortext
  • A combination

4
Virtual Document
  • A virtual document is
  • A collection of anchortexts or extended
    anchortexts from links pointing to the target
    document.
  • Anchortext
  • The words occurring inside of a link
  • Extended anchortext
  • The set of rendered words occurring up to 25
    words before and after an associated link (as
    well as the anchortext itself).

5
  • Main Method
  • Full-text classifier
  • Virtual documents classifier
  • Two Improvement methods
  • Name a cluster
  • Main Procedure

6
  • Datasets
  • Positive a set of web pages downloaded from
    various Yahoo! Categories.
  • Negative Random documents from outside Yahoo!
  • WebKB dataset
  • Features
  • All words and two or three word phrases
  • i.e. My favorite game is scrabble.
  • Possible features
  • My, my favorite, my favorite game, favorite,
    favorite game, etc.

7
Dimensionality reduction
  • To remove useless features.
  • Two step process
  • First, remove all features that do not occur in a
    specified percentage of documents.
  • i.e. (Af/A lt T) and (Bf/B lt T-)
  • A the set of positive examples.
  • B the set of negative examples.
  • Af documents in A that contain feature f.
  • Bf documents in B that contain feature f.
  • T threshold for positive features.
  • T- threshold for negative features.
  • Second, rank the remaining features based on
    expected entropy loss.

8
Expected Entropy Loss
  • The prior entropy of the class distribution
  • The posterior entropy of the class when the
    feature is present
  • The posterior entropy of the class when the
    feature is absent
  • The expected entropy loss

9
Train SVM
  • A set of data points (x1,y1),, (xN,yN)
  • xi is an input and yi is a target output (1 or
    -1).
  • Separating hyperplane
  • wf(xi) b 0
  • wf(xi) b 1 if yi 1
  • wf(xi) b -1 if yi -1
  • wf(xi) b
  • where
  • minimizing
  • Output
  • Kernel function

10
Improvement-Uncertainty Sampling
  • The result from an SVM classifier is a real
    number from -8 to 8.
  • When the output is on the interval (-1,1) it is
    less certain than if it is on the intervals
    (-8,-1) and (1,8).
  • The region (-1,1) is called
  • the uncertain region.
  • Uncertainty sampling
  • A human judges the
  • documents in the
  • uncertain region

11
Improvement-Combination
  • To combine results from the extended anchortext
    based classifier with the less accurate full-text
    classifier.

12
Name the Cluster
  • Using the top ranked features extracted from the
    extended anchotexts virtual documents to name a
    cluster.
  • Beliefs
  • The words near the anchortexts are descriptions
    of the target documents.
  • The top ranked features by expected entropy loss
    are those which occur in many positive
    examples,and few negative ones.

13
Results-classifying
  • Anchortext alone is comparable for classification
    purpose with the full-text.
  • Classification accuracy is significantly improved
    when using the extended anchortext instead of
    the document full-text.
  • Combination method is highly effective for
    improving positive-class accuracy, but reduces
    negative class accuracy.
  • Uncertainty sampling required examining only 8
    of the documents on average, while providing an
    average positive class accuracy improvement of
    almost 10 percentage points.

14
Result--Clustering
  • The full-text appears comparable to the extended
    anchortext.
  • The anchortext alone appears to do a poor job of
    describing the category.

15
Future Work
  • To include other features on the inbound web
    pages besides extended anchortext
  • To examine the effects of the number of inbound
    links.
  • To examine the nature of the category by
    expanding this to thousands of categories.
  • To study the effects of the positive set size.
Write a Comment
User Comments (0)
About PowerShow.com