Web Classification The Web Unit Approach - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

Web Classification The Web Unit Approach

Description:

... ve training samples (Tr ) ranges from 2% to 18% compared to ve training ... Connectivity index threshold to determine the folders containing web unit(s) ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 64
Provided by: caisN
Category:

less

Transcript and Presenter's Notes

Title: Web Classification The Web Unit Approach


1
Web Classification The Web Unit Approach
  • Ee-Peng LimDivision of Information
    SystemsSchool of Computer Engineering

2
Acknowledgement
  • Dr Aixin Sun, New South Wales University,
    Australia
  • Asst Prof Dion Goh, Sch of Communication
    Information, NTU
  • Ming Yin, Graduate Student

3
Outline
  • Overview of Web Classification
  • Web Page Classification
  • Web Unit Mining
  • Web Unit Relationship Mining
  • Conclusion

4
Introduction
  • Users and businesses today depend on the World
    Wide Web for information and knowledge.
  • Overloading syndrome in finding web information
  • Web classification To classify Web objects into
    pre-defined semantic structures
  • Web page classification
  • Web site classification

5
Web Page Classification
Classifier Training
Classi- fication
Classifier
6
Applications of Web Page Classification
  • Web search engines/browsers
  • Web directories
  • Business intelligence
  • Web information integration

7
Web Page Classification Methods
  • Web page classification approaches
  • Text only approach
  • Hypertext approach
  • Neighborhood category approach
  • Relational learning approach

8
Existing Web Page Classification Approaches
  • Text only approach
  • Use term features only Dumais and Chen 00
  • Performance is reasonable when web pages are like
    text documents
  • Hypertext approach
  • Using content and context features
  • Context features layout tag Esposto et al 99,
    words from neighboring pages Furnkranz 99,
    Glover et al 02, Chakrabarti et al 98

9
Existing Web Page Classification Approaches
  • Relational learning approach
  • Foil-Pilfs Craven and Slattery 01
  • Neighborhood category approach
  • Make use of neighboring page category labels Oh
    et al 00

10
Research Questions
  • Web page features for classification
  • Which feature combinations give good performance?
  • Category structure
  • Can we have more complex category structures?
  • Does a single Web page carry sufficient
    information about a concept instance?

11
Concept Relationship Graph
  • Information is modeled by concepts and
    relationships

12
Which Page Feature Combination Gives Good
Performance?
  • Difficult to compare web page classification
    methods
  • Web pages to be classified Web pages from
    one/few/many web site(s)
  • Different performance metrics

13
Our Web Page Classification Research
  • To conduct web page classification using
    different combinations of web page features
  • To study their impact on classification accuracy
  • Support Vector Machine (SVM) classifiers are used

14
Data Set
  • Web?KB
  • 4 Universities (4159 pages)
  • Cornell, Texas, Washington, Wisconsin
  • Classes
  • Web-gtKB consists of web pages classified into 7
    classes
  • Only 4 classes were used.
  • Student, Faculty, Projects, Course
  • Training documents
  • Proportion of ve training samples (Tr) ranges
    from 2 to 18 compared to ve training samples.

15
Web Page Features
  • 4 combinations of web page features.
  • Text Only (X)
  • Text Title (T)
  • Title words are enclosed by lttitlegt and lt/titlegt
  • Title words of a web page may provide more
    semantic hint.
  • Text Anchor Words (A)
  • Text words from the neighboring docs could be
    noise gt only the anchor words of incoming links.
  • Text Title Anchor Words (TA)
  • Text words, title words and in-link anchor words,
    all separately indexed

16
Construction of Support Vector Machine (SVM)
Classifier
  • SVM is a binary classifier
  • SVM has been shown to be accurate for text
    classification
  • A SVM classifier is constructed for each category
  • Joachims SVMlight was used

17
Construction of SVM Classifiers
  • Due to the unbalanced training set
  • Cost factor (parameter j in SVMlight)
  • SCut thresholding
  • Training set is further divided into construction
    set and validation set.
  • Validation set is used to derived the appropriate
    threshold for the output score of SVM

18
SCut Thresholding
  • Performance Evaluation using leave-one-university-
    out strategy
  • Pages from 3 universities are used as training
    pages
  • Pages from the fourth are used as test pages.
  • SCut Thresholding
  • For the training dataset
  • 2 university ? training (train set)
  • 1 university ? test (validation set)
  • Find threshold that yields the optimal F1 value

19
Experimental Results
  • F1 measure results
  • Compared with CMUs FOIL-PILFS method

20
Precision Results
21
Recall Results
22
F1 Results
23
Discussion
  • SVM performed very well on Web?KB data set
  • SVM delivered better results than Foil-Pilfs when
    context features were used
  • Title words are helpful to SVM compared to using
    text words only
  • In-link anchor words lead to significant increase
    in F1.
  • Using text, title and anchor words resulted in
    the best classification performance

24
What is the Impact of Concept-Relationship Graph?
  • Tasks
  • Find the Web pages forming a concept instance
  • Assign it with the appropriate concept label
  • Identify the relationships among the concept
    instances
  • Challenges
  • Definition of concept instance is subjective
  • Web sites organize Web pages in different ways
  • Features for identifying relationship instances
    are limited

25
Web Unit
  • A Web page or a set of Web pages jointly provides
    information of one concept instance
  • One key page (homepage) and zero or more support
    pages.

Web Unit 2
Web Unit 1
http//..path/course/CS100/CS100.htmlhttp//..pat
h/course/CS100/lecture-programs.htmlhttp//..path
/course/CS100/officehours.htmlhttp//..path/cours
e/CS100/instructor.htmlhttp//..path/course/CS100
/exams/final.htmlhttp//..path/course/CS100/exams
/prelim.html
http//..path/user/johnson/index.htmlhttp//..pat
h/user/johnson/research.htmlhttp//..path/user/jo
hnson/publications.htmlhttp//..path/user/johnson
/activities.htmlhttp//..path/user/johnson/studen
ts.htmlhttp//..path/user/johnson/teaching.html h
ttp//..path/user/johnson/contact.html
26
Web Unit Mining
  • Each concept instance is a Web unit
  • Tasks
  • Find Web pages that form a Web unit and determine
    the role of each page
  • Assign concept labels to Web units
  • Differences between Web Unit Mining and Web Page
    Classification
  • Concept-relationship graph vs flat categories
  • Web units are not given a-apriori

27
Web Directory Structure
  • Structure of Web site based on Web page URLs
  • Example

28
Observations on Web Units
  • Observation 1
  • Web pages from the same Web folder are more
    semantically related
  • Observation 2
  • Support pages are normally reachable from key
    page
  • Observation 3
  • Key page is usually at the highest level of the
    Web folder containing the Web unit

29
Observations on Web Units
  • Observation 4
  • Web units of same concept seldom have links
    between them
  • Observation 5
  • Multi-page Web units of the same concept often
    reside in a set of folders (one for each) under a
    common parent folder
  • One-page Web units of the same concept often
    appear in the same folder
  • Observation 6
  • Key page of the Web units of the same concept are
    often the link targets of a hub page

30
Iterative Web Unit Mining (iWUM)
31
Iterative Web Unit Mining (iWUM)
32
Web Fragment Generation
  • Associate closely-related Web pages together
  • Reduce the objects to be classified
  • Reduce noise in training
  • Criteria to determine Web fragments
  • Web folder connectivity index
  • Web page naming convention
  • Common names for key pages are index.html,
    index.htm, etc..

33
Web Folder Connectivity
34
Web Folder Connectivity Index
  • Connectivity from web page pa to web page pb
  • Connectivity from a web page to a web folder
  • Connectivity from a web folder to a web folder

35
Web Fragment Generation
  • Connectivity index of a web folder Fi , fFi
  • Large fFi suggests pages and subfolders in Fi are
    closely linked
  • Connectivity index threshold to determine the
    folders containing web unit(s)
  • In experiments, 0.1667 was used (at least 5 items
    not connected)

36
Web Fragment Generation
  • Find Candidate Key Pages
  • URL of the page ends with a /
  • The folder containing the page and the page share
    the same name, e.g., path/cs100/cs100.html
  • Page file name matches home, index, welcome,
    default, and homepage

37
Web Fragment Generation and Classification
38
Web Unit Construction
39
Web Unit Construction
1. http//..path/user/johnson/index.html PROF
http//..path/user/johnson/research.html
http//..path/user/johnson/publications.html
http//..path/user/johnson/activities.html
http//..path/user/johnson/students.html
http//..path/user/johnson/teaching.html
http//..path/user/johnson/contact.html
1. http//..path/course/CS100/CS100.html COURSE
http//..path/course/CS100/lecture-program
s.html http//..path/course/CS100/officehours.
html http//..path/course/CS100/instructor.htm
l http//..path/course/CS100/exams/final.htm
http//..path/course/CS100/exams/preli
m.html
40
Web Unit Classification
  • Observations 5 and 6
  • Multi-page Web units of the same concept often
    reside in a set of folders (one for each) under a
    common parent folder
  • Key pages of the Web units of the same concept
    are often the link targets of a hub page
  • Improve Web unit mining accuracy
  • Web site structure features
  • Content features

41
Web Unit Classification
42
Web Site Structure Features
  • Normalized classification score (each web unit)
    for each concept
  • Organization of the web units within the web site
  • Closeness to the average depth for each concept
  • Highest in-link hub value for each concept
  • Precision support of the parent web folder for
    each concept
  • Recall support of the parent web folder for each
    concept
  • Word features in the web page names and URLs
  • Each word (term) in page names and URL

43
Performance Metrics
  • Given a mined web unit ui
  • Is ui correctly constructed?
  • Is ui correctly classified?
  • Perfect web unit u'i of a constructed web unit ui
    the labelled web unit containing ui.k and u'i
    has the same label as ui. ui.k can be either the
    key page or a support page of u'i.

44
Performance Metrics
  • Precision/Recall for a web unit
  • Satisfaction variable (a)
  • Degree of importance when the key page of a web
    unit is correctly identified
  • a 1/u key page and support page are equally
    important.
  • a 1 only key page is important.
  • a 1, 0.5, 1/u are used in our experiments

45
Performance Metrics
  • Precision/Recall for a concept

46
Experiments (dataset)
  • UnitSet
  • Pages in WebKB are manually grouped into Web
    units
  • Most pages from the Others category are used as
    support pages of the corresponding Web units.

47
Experiments (methods)
  • Baseline method
  • 3 steps (1) train Web page classifiers (2)
    classify Web pages, and (3) construct Web units
  • Baseline with fragments method
  • Non-iterative version of iWUM
  • 5 steps (1) train Web fragment classifiers (2)
    build Web directory (3) generate Web fragments
    (4) classify Web fragments and (5) construct Web
    units.
  • Iterative Web Unit Mining method (iWUM)

48
Results
a 1
a 0.5
a
49
Results
50
iWUM Results
51
Detailed Results
  • Web unit label change rate

52
Web Unit Relationship Mining
  • Identify relationship instances among the mined
    web units that are concept instances
  • Example Instructor-of (Johnson, CS100)
  • Challenges
  • Approach? Method?
  • Features to be used?

53
Web Unit Relationship Mining
  • Assumptions
  • Relationships can be determined based on
    background relation knowledge
  • Background relation are represented by inter-unit
    features
  • Our proposed method
  • Candidate Web unit pair generation
  • Feature acquisition
  • Classifier training
  • Classification

54
Inter-Unit Features
  • Navigation Features (N)
  • Relative Location Features (R)
  • Parent-child
  • Sibling
  • Ancestor-descendent
  • Common-item Features (E)
  • Email addresses

55
Navigation Features (N)
56
Relative Location Features (R)
  • Parent-child h2 and h4
  • Sibling h2 and h3
  • Ancestor-descendent h1 and h4

57
Experimental Dataset
  • WebKB
  • Department-of (people, department)
  • Instructor-of (people, course)
  • Member-of (people, project)

58
Experimental Results
  • On the manually labelled web units

59
Experimental Results
  • On the iWUM mined web units

60
Conclusion
  • Feature combinations for Web Page Classification
  • Web Unit to model a concept instance
  • Web unit mining and iWUM method
  • Web fragment generation classification
  • Web unit construction classification
  • Web unit mining performance metric
  • Web unit relationship mining

61
Future Works
  • Enhancement of the proposed solutions
  • Evaluation of iWUM method on larger datasets
  • Development of incremental Web unit mining
    methods
  • Web units and applications
  • Search Engines for Web Units and Web Unit
    Relationships

62
Relevant Publications
  • A. Sun, E.-P. Lim, W.-K. Ng, J. Srivastava
    Blocking Reduction Strategies in Hierarchical
    Text Classification, IEEE TKDE 16(10)1305-1308
    , 2004.
  • A. Sun, E.-P. Lim, Web Unit Mining Finding and
    Classifying Subgraphs of Web Pages, ACM CIKM,
    2003.
  • A. Sun, E.-P. Lim, and W.-K. Ng, Performance
    Measurement Framework for Hierarchical Text
    Classification, JASIST 54(11)1014 1028, 2003.
  • A. Sun, E.-P. Lim, Web Classification Using
    Support Vector Machine, ACM WIDM 2002.

63
Thank You
?? ??
Write a Comment
User Comments (0)
About PowerShow.com