Web Classification The Web Unit Approach - PowerPoint PPT Presentation

1 / 63

About This Presentation

Title:

Web Classification The Web Unit Approach

Description:

... ve training samples (Tr ) ranges from 2% to 18% compared to ve training ... Connectivity index threshold to determine the folders containing web unit(s) ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 64

Provided by: caisN

Category:

more less

Transcript and Presenter's Notes

Title: Web Classification The Web Unit Approach

1
Web Classification The Web Unit Approach

Ee-Peng LimDivision of Information
SystemsSchool of Computer Engineering

2
Acknowledgement

Dr Aixin Sun, New South Wales University,
Australia
Asst Prof Dion Goh, Sch of Communication
Information, NTU
Ming Yin, Graduate Student

3
Outline

Overview of Web Classification
Web Page Classification
Web Unit Mining
Web Unit Relationship Mining
Conclusion

4
Introduction

Users and businesses today depend on the World
Wide Web for information and knowledge.
Overloading syndrome in finding web information
Web classification To classify Web objects into
pre-defined semantic structures
Web page classification
Web site classification

5
Web Page Classification
Classifier Training
Classi- fication
Classifier
6
Applications of Web Page Classification

Web search engines/browsers
Web directories
Business intelligence
Web information integration

7
Web Page Classification Methods

Web page classification approaches
Text only approach
Hypertext approach
Neighborhood category approach
Relational learning approach

8
Existing Web Page Classification Approaches

Text only approach
Use term features only Dumais and Chen 00
Performance is reasonable when web pages are like
text documents
Hypertext approach
Using content and context features
Context features layout tag Esposto et al 99,
words from neighboring pages Furnkranz 99,
Glover et al 02, Chakrabarti et al 98

9
Existing Web Page Classification Approaches

Relational learning approach
Foil-Pilfs Craven and Slattery 01
Neighborhood category approach
Make use of neighboring page category labels Oh
et al 00

10
Research Questions

Web page features for classification
Which feature combinations give good performance?
Category structure
Can we have more complex category structures?
Does a single Web page carry sufficient
information about a concept instance?

11
Concept Relationship Graph

Information is modeled by concepts and
relationships

12
Which Page Feature Combination Gives Good
Performance?

Difficult to compare web page classification
methods
Web pages to be classified Web pages from
one/few/many web site(s)
Different performance metrics

13
Our Web Page Classification Research

To conduct web page classification using
different combinations of web page features
To study their impact on classification accuracy
Support Vector Machine (SVM) classifiers are used

14
Data Set

Web?KB
4 Universities (4159 pages)
Cornell, Texas, Washington, Wisconsin
Classes
Web-gtKB consists of web pages classified into 7
classes
Only 4 classes were used.
Student, Faculty, Projects, Course
Training documents
Proportion of ve training samples (Tr) ranges
from 2 to 18 compared to ve training samples.

15
Web Page Features

4 combinations of web page features.
Text Only (X)
Text Title (T)
Title words are enclosed by lttitlegt and lt/titlegt
Title words of a web page may provide more
semantic hint.
Text Anchor Words (A)
Text words from the neighboring docs could be
noise gt only the anchor words of incoming links.
Text Title Anchor Words (TA)
Text words, title words and in-link anchor words,
all separately indexed

16
Construction of Support Vector Machine (SVM)
Classifier

SVM is a binary classifier
SVM has been shown to be accurate for text
classification
A SVM classifier is constructed for each category
Joachims SVMlight was used

17
Construction of SVM Classifiers

Due to the unbalanced training set
Cost factor (parameter j in SVMlight)
SCut thresholding
Training set is further divided into construction
set and validation set.
Validation set is used to derived the appropriate
threshold for the output score of SVM

18
SCut Thresholding

Performance Evaluation using leave-one-university-
out strategy
Pages from 3 universities are used as training
pages
Pages from the fourth are used as test pages.
SCut Thresholding
For the training dataset
2 university ? training (train set)
1 university ? test (validation set)
Find threshold that yields the optimal F1 value

19
Experimental Results

F1 measure results
Compared with CMUs FOIL-PILFS method

20
Precision Results
21
Recall Results
22
F1 Results
23
Discussion

SVM performed very well on Web?KB data set
SVM delivered better results than Foil-Pilfs when
context features were used
Title words are helpful to SVM compared to using
text words only
In-link anchor words lead to significant increase
in F1.
Using text, title and anchor words resulted in
the best classification performance

24
What is the Impact of Concept-Relationship Graph?

Tasks
Find the Web pages forming a concept instance
Assign it with the appropriate concept label
Identify the relationships among the concept
instances
Challenges
Definition of concept instance is subjective
Web sites organize Web pages in different ways
Features for identifying relationship instances
are limited

25
Web Unit

A Web page or a set of Web pages jointly provides
information of one concept instance
One key page (homepage) and zero or more support
pages.

Web Unit 2
Web Unit 1
http//..path/course/CS100/CS100.htmlhttp//..pat
h/course/CS100/lecture-programs.htmlhttp//..path
/course/CS100/officehours.htmlhttp//..path/cours
e/CS100/instructor.htmlhttp//..path/course/CS100
/exams/final.htmlhttp//..path/course/CS100/exams
/prelim.html
http//..path/user/johnson/index.htmlhttp//..pat
h/user/johnson/research.htmlhttp//..path/user/jo
hnson/publications.htmlhttp//..path/user/johnson
/activities.htmlhttp//..path/user/johnson/studen
ts.htmlhttp//..path/user/johnson/teaching.html h
ttp//..path/user/johnson/contact.html
26
Web Unit Mining

Each concept instance is a Web unit
Tasks
Find Web pages that form a Web unit and determine
the role of each page
Assign concept labels to Web units
Differences between Web Unit Mining and Web Page
Classification
Concept-relationship graph vs flat categories
Web units are not given a-apriori

27
Web Directory Structure

Structure of Web site based on Web page URLs
Example

28
Observations on Web Units

Observation 1
Web pages from the same Web folder are more
semantically related
Observation 2
Support pages are normally reachable from key
page
Observation 3
Key page is usually at the highest level of the
Web folder containing the Web unit

29
Observations on Web Units

Observation 4
Web units of same concept seldom have links
between them
Observation 5
Multi-page Web units of the same concept often
reside in a set of folders (one for each) under a
common parent folder
One-page Web units of the same concept often
appear in the same folder
Observation 6
Key page of the Web units of the same concept are
often the link targets of a hub page

30
Iterative Web Unit Mining (iWUM)
31
Iterative Web Unit Mining (iWUM)
32
Web Fragment Generation

Associate closely-related Web pages together
Reduce the objects to be classified
Reduce noise in training
Criteria to determine Web fragments
Web folder connectivity index
Web page naming convention
Common names for key pages are index.html,
index.htm, etc..

33
Web Folder Connectivity
34
Web Folder Connectivity Index

Connectivity from web page pa to web page pb
Connectivity from a web page to a web folder
Connectivity from a web folder to a web folder

35
Web Fragment Generation

Connectivity index of a web folder Fi , fFi
Large fFi suggests pages and subfolders in Fi are
closely linked
Connectivity index threshold to determine the
folders containing web unit(s)
In experiments, 0.1667 was used (at least 5 items
not connected)

36
Web Fragment Generation

Find Candidate Key Pages
URL of the page ends with a /
The folder containing the page and the page share
the same name, e.g., path/cs100/cs100.html
Page file name matches home, index, welcome,
default, and homepage

37
Web Fragment Generation and Classification
38
Web Unit Construction
39
Web Unit Construction
1. http//..path/user/johnson/index.html PROF
http//..path/user/johnson/research.html
http//..path/user/johnson/publications.html
http//..path/user/johnson/activities.html
http//..path/user/johnson/students.html
http//..path/user/johnson/teaching.html
http//..path/user/johnson/contact.html
1. http//..path/course/CS100/CS100.html COURSE
http//..path/course/CS100/lecture-program
s.html http//..path/course/CS100/officehours.
html http//..path/course/CS100/instructor.htm
l http//..path/course/CS100/exams/final.htm
http//..path/course/CS100/exams/preli
m.html
40
Web Unit Classification

Observations 5 and 6
Multi-page Web units of the same concept often
reside in a set of folders (one for each) under a
common parent folder
Key pages of the Web units of the same concept
are often the link targets of a hub page
Improve Web unit mining accuracy
Web site structure features
Content features

41
Web Unit Classification
42
Web Site Structure Features

Normalized classification score (each web unit)
for each concept
Organization of the web units within the web site
Closeness to the average depth for each concept
Highest in-link hub value for each concept
Precision support of the parent web folder for
each concept
Recall support of the parent web folder for each
concept
Word features in the web page names and URLs
Each word (term) in page names and URL

43
Performance Metrics

Given a mined web unit ui
Is ui correctly constructed?
Is ui correctly classified?
Perfect web unit u'i of a constructed web unit ui
the labelled web unit containing ui.k and u'i
has the same label as ui. ui.k can be either the
key page or a support page of u'i.

44
Performance Metrics

Precision/Recall for a web unit
Satisfaction variable (a)
Degree of importance when the key page of a web
unit is correctly identified
a 1/u key page and support page are equally
important.
a 1 only key page is important.
a 1, 0.5, 1/u are used in our experiments

45
Performance Metrics

Precision/Recall for a concept

46
Experiments (dataset)

UnitSet
Pages in WebKB are manually grouped into Web
units
Most pages from the Others category are used as
support pages of the corresponding Web units.

47
Experiments (methods)

Baseline method
3 steps (1) train Web page classifiers (2)
classify Web pages, and (3) construct Web units
Baseline with fragments method
Non-iterative version of iWUM
5 steps (1) train Web fragment classifiers (2)
build Web directory (3) generate Web fragments
(4) classify Web fragments and (5) construct Web
units.
Iterative Web Unit Mining method (iWUM)

48
Results
a 1
a 0.5
a
49
Results
50
iWUM Results
51
Detailed Results

Web unit label change rate

52
Web Unit Relationship Mining

Identify relationship instances among the mined
web units that are concept instances
Example Instructor-of (Johnson, CS100)
Challenges
Approach? Method?
Features to be used?

53
Web Unit Relationship Mining

Assumptions
Relationships can be determined based on
background relation knowledge
Background relation are represented by inter-unit
features
Our proposed method
Candidate Web unit pair generation
Feature acquisition
Classifier training
Classification

54
Inter-Unit Features

Navigation Features (N)
Relative Location Features (R)
Parent-child
Sibling
Ancestor-descendent
Common-item Features (E)
Email addresses

55
Navigation Features (N)
56
Relative Location Features (R)

Parent-child h2 and h4
Sibling h2 and h3
Ancestor-descendent h1 and h4

57
Experimental Dataset

WebKB
Department-of (people, department)
Instructor-of (people, course)
Member-of (people, project)

58
Experimental Results

On the manually labelled web units

59
Experimental Results

On the iWUM mined web units

60
Conclusion

Feature combinations for Web Page Classification
Web Unit to model a concept instance
Web unit mining and iWUM method
Web fragment generation classification
Web unit construction classification
Web unit mining performance metric
Web unit relationship mining

61
Future Works

Enhancement of the proposed solutions
Evaluation of iWUM method on larger datasets
Development of incremental Web unit mining
methods
Web units and applications
Search Engines for Web Units and Web Unit
Relationships

62
Relevant Publications

A. Sun, E.-P. Lim, W.-K. Ng, J. Srivastava
Blocking Reduction Strategies in Hierarchical
Text Classification, IEEE TKDE 16(10)1305-1308
, 2004.
A. Sun, E.-P. Lim, Web Unit Mining Finding and
Classifying Subgraphs of Web Pages, ACM CIKM,
2003.
A. Sun, E.-P. Lim, and W.-K. Ng, Performance
Measurement Framework for Hierarchical Text
Classification, JASIST 54(11)1014 1028, 2003.
A. Sun, E.-P. Lim, Web Classification Using
Support Vector Machine, ACM WIDM 2002.

63
Thank You
?? ??

Write a Comment

User Comments (0)