Web classification - PowerPoint PPT Presentation

About This Presentation
Title:

Web classification

Description:

... to Discover Domain-Level Web Usage Profiles {hdai,mobasher}_at_cs.depaul.edu ... Creating an Aggregated Representation of a usage profile. pr={ o1wo1 , ..., onwon ... – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 55
Provided by: Rafa45
Category:

less

Transcript and Presenter's Notes

Title: Web classification


1
Web classification
  • Ontology and Taxonomy

2
References
  • Using Ontologies to Discover Domain-Level Web
    Usage Profiles hdai,mobasher_at_cs.depaul.edu
  • Learning to Construct Knowledge Bases from World
    Wide Web. M. Craven, D. DiPasquo, A. Mitchell,
    K. Nigam, S Slattery Carnegie Mellon
    University-Pittsburg-USA D. Freitag A.
    McCallum Just Reserch-Pittsburg-USA

3
Definitions
  • Ontology
  • An explicit formal specification of how to
    represent the objects, concepts and other
    entities that are assumed to exist in some area
    of interest and the relationships that hold among
    them.
  • Taxonomy
  • a classification of organisms into groups based
    on similarities of structure or origin etc

4
Goal
  • Capture and model behavioral patterns and
    profiles of users interacting with a web site.
  • Why?
  • Collaborative filtering
  • Personalization systems
  • Improve organization and structural of the site
  • Provide dynamic recommendations
    (www.recommend-me.com)

5
Algorithm 0 (by Rafas brother Gabriel)
  • Recommend pages viewed by other users with
    similar page ranks.
  • Problems
  • New item problem
  • Doesnt consider content similarity nor
    item-to-item relationships.

6
User session
  • User session s ltw(p1,s),w(p2,s),..,w(pn,s)gt
  • W(pi,s) is a weight in session s, associated with
    page pi
  • Session clusters cl1, cl2,
  • cli is a subset of the set of sessions
  • Usage profile prclltp, weight(p,prcl)
    weight(p,prcl) µ
  • Weight(p,prcl)(1/cl) ?w(p,s)

7
Algorithm 1
  • For every session, create a vector containing the
    viewed pages and a weight for each page.
  • Each vector represent a point in a N-dimensional
    space, so we may identify the clusters.
  • For a new session, check to which cluster this
    vector/point belongs, and recommend high scores
    pages of this cluster
  • Problems
  • New item problem
  • Doesnt consider content similarity nor
    item-to-item relationships

8
Algorithm 2 keyword search
  • Solves new item problem.
  • Not good enough
  • A page can contain info for more than 1 object.
  • A fundamental data can be pointed by the page,
    not included.
  • What exactly is a keyword.
  • Solution
  • Domain ontologies for objects

9
Domain Ontologies
  • Domain-Level Aggregate Profile Set of pseudo
    objects each characterizing objects of different
    types occurring commonly across the user
    sessions.
  • Class - C
  • Attributes a lt Da, Ta, a, ?agt
  • Ta type of attribute
  • DaDomain of the values for a (red, blue,..)
  • a ordering relation among Da
  • ?a combination function

10
Example movie web site
  • Classes
  • movies, actors, directors, etc
  • Attributes
  • Movies title, genre, starring actors
  • Actors name, filmography, gender, nationality
  • Functions
  • ?actor(ltS,0.7 T, 0.2 U,0.1,1gt, ltS,0.5
    T,0.5),0.7gt) sumi(wiwo)/ sumi(wi)
  • ?year(1991,1994) 1991,1994
  • ?is_a(person,student,person,TA) person

11
(No Transcript)
12
Creating an Aggregated Representation of a usage
profile
  • prlto1wo1gt, ,ltonwongt
  • Oi object woisignificance on the profile pr
  • Let assume all the object are instances of the
    same class
  • Create a new virtual object o, with attributes
    ai ?i(o1,,on)

13
Item level usage profile
Year Actor Genre Name
2002 S0.7 T0.2 U0.1 Genre-all Romance Romance Comedy Comedy Kids family A
1999 S0.5, T0.5 Genre-all Romance Comedy B
2001 W0.6,S04 Genre-all Romance C
1999, 2002 S0.58 T0.27 W0.09 U0.05 Genre-all Romance A1 B1 C1
14
A real (estate property) example
15
Item Level Usage Profile
Room num Location Price Weight
5 Chicago 475K 1
4 Chicago 299K 0.7
4 Evanston 272k 0.18
3 Chicago 99K 0.18
4 Chicago, Evanston 365K 1
16
Algorithm 2
  • Do not just recommend other items viewed by other
    users, recommend items similar to the class
    representative.
  • Advantages
  • More accuracy
  • Need less examples
  • No new item problem
  • Consider also content similarity (item-to-item
    relationship).

17
Item Level Usage Profile
Room Location Price Weight
5 Chicago 475K 1
4 Chicago 299K 0.7
4 Evanston 272k 0.18
3 Chicago 99K 0.18
4 Chicago, Evanston 365K 1
4 Chicago 370K 1
18
Final Algorithm
  • Given a web site
  • Classify it contents into classes and attributes.
  • Merge the objects of each user profile and create
    a pseudo object.
  • Recommend according to this pseudo-object.

19
Problems
  • A per-topic solution
  • Found patterns can be incomplete
  • User patterns may change with time (for movies)
    I loved ET problem.
  • Need cookies and other methods to identify users.
  • How is weight calculated? Can need many examples
    I loved American Beauty problem.
  • How to automatically group the web-pages?

20
  • Hafsaka?

21
Constructing Knowledge Base from WWW
  • Goal
  • Automatically create computer understandable
    knowledge base from the web.
  • Why?
  • To use in the previous described work, and
    similar
  • Find all universities that offer Java Programming
    courses
  • Make me hotel and flight arrangements for the
    upcoming Linux conference

22
Constructing Knowledge Base from WWW
  • How?
  • Use machine learning to create information
    extraction methods for each of the desired types
    of knowledge
  • Apply it, to extract symbolic, probabilistic
    statements directly from the web
    Student-of(Rafa, sdbi) 99
  • Used method
  • Provide an initial ontology (classes and
    relations)
  • Training examples 3 out of 4 university sites
    (8000 web pages, 1400 web-page pairs)

23
Example of web pages
Jims Home Page I teach several
courses Fundamental of CS Intro to AI My
research includes Intelligent web agents
Fundamentals of CS Home Page Instructors Jim Tom
Classes Faculty, Research-project, Student,
Staff, (Person), Course, Department,
Other Relations instructor-of,
members-of-project, department-of.
24
Ontology
Web KB instances
25
Problem Assumption
  • Class instance one-instance/one-webpage
  • Multiple instances in one web-page
  • Multiple linked/related web-pages for instance
  • Elvis problem
  • Relation R(A,B) is represented by
  • Hyperlinks A?B or A?C?D??B
  • Inclusion in a particular context (I teach
    Intro2cs)
  • Statistical model of typical words

26
To Learn
  1. Recognizing class instances by classifying bodies
    of hypertext
  2. Recognizing relations instances by classifying
    chains of hyperlinks
  3. Extract text fields

27
Recognizing class instances by classifying bodies
of hypertext
  • Statistical bag-of-words approach
  • Full Text
  • Hyperlinks
  • Title/Head
  • Learning first order rules
  • Combine the previous 4 methods

28
Statistical bag-of-words approach
  • Context-less classification
  • Given a set of classes Cc1, c2,cN
  • Given a document consisting of n2000 words w1,
    w2, ..,wn
  • c argmaxc Pr(c w1,,wn)

29
Accuracy Othe dept rese staff facu stud cour
26.2 552 0 1 0 0 17 202 Cours
43.3 519 0 2 17 14 421 0 Stud
17.9 264 0 3 16 118 56 5 Facu
6.2 45 0 0 4 1 15 0 Staff
13 384 0 62 5 10 9 8 Rese
1.7 209 4 5 1 3 8 10 Dept
93.6 1064 0 12 3 7 32 19 Other
35 100 72.9 8.7 77.1 75.4 82.8 Coverage
actual
predicted
30
Statistical bag-of-words approach Pr(wic) log
(Pr(wic)/Pr(wic))
31
Accuracy/Coverage tradeoff for full-text
classifiers
32
Accuracy/coverage tradeoff for hyperlinks
classifiers
33
Accuracy/Coverage for title heading classifiers
34
Learning first order rules
  • The previous method doesnt consider relations
    between pages
  • A page is a course home-page if it contains the
    word textbook and TA and point to a page
    containing the word assignment.
  • FOIL is a learning system that constructs Horn
    clause programs from examples

35
Relations
  • Has_word(Page). Stemmed words computer
    computing comput. 200 occurrences but less than
    30 in other class pages
  • Link_to(page,page)
  • m-estimate accuracy (nc(mp))/(nm)
  • nc of instances correctly classified by the
    rule
  • N Total of instance classified by the rule
  • m2
  • P proportion of instances in trainning set that
    belongs to that class
  • Predict each class with confidence best_match /
    total__of_matches

36
New learned rules
  • student(A) - not(has_data(A)),
    not(has_comment(A)), link_to(B,A), has_jame(B),
    has_paul(B), not(has_mail(B)).
  • faculty(A) - has_professor(A), has_ph(A),
    link_to(B,A), has_faculti(B).
  • course(A) - has_instructor(A), not(has_good(A)),
    link_to(A,B), not(link_to(B, 1)),has_assign(B).

37
Accuracy/coverage for FOIL page classifiers
38
Boosting
  • The best prediction classification depends on the
    class
  • Combine the predictions using the measure
    confidence

39
Accuracy/coverage tradeoff for combined
classifiers (2000 words vocabulary)
40
Boosting
  • Disappointing Somehow it is not uniformly better
  • Possible solutions
  • Using reduced size dictionaries (next)
  • Using other methods for combining predictions
    (voting instead of best_match /
    total__of_matches)

41
Accuracy/coverage tradeoff for combined
classifiers (200 words vocabulary)
42
Multi-Page segments
  • The group is the longest prefix (indicated in
    parentheses)
  • (_at_/user,faculty,people,home,projects/)/.html,
    htm
  • (_at_/cs???,www/,)/.html,htm
  • (_at_/cs???,www/,)/
  • A primary page is any page which URL matches
  • _at_/index.html,htm
  • _at_/home.html,htm
  • _at_/1/1.html,htm
  • If no page in the group matches one of these
    patterns, then the page with the highest score
    for any non-other class is a primary page.
  • Any non-primary page is tagged as Other

43
Accuracy/coverage tradeoff for the full text
after URL grouping heuristics
44
Conclusion- Recognizing Classes
  • Hypertext provides redundant information
  • We can classify using several methods
  • Full text
  • Heading/title
  • Hyperlinks
  • Text in neighboring pages
  • Grouping pages
  • No method alone is good enough.
  • Combine predictions (classify methods) allows a
    better result.

45
Learning to Recognize Relation Instances
  • Assume Relations are represented by hyper-links
  • Given the following background relations
  • Class (Page)
  • Link-to(Hyperlink,P1,P2)
  • Has-word (H) the word is part of the Hyperlink
  • All-words-capitalized (H)
  • Has-alphanumeric-word (H) I Teach CS2765
  • Has-neighborhood-word (H) Neighborhood
    paragraph

46
Learning to Recognize Relation Instances
  • Try to learn the following
  • Members-of-project(P1,P2)
  • Intsructors_of_course(P1,P2)
  • Department_of_person(P1,P2)

47
Learned relations
  • instructors of(A,B) - course(A), person(B), link
    to(C,B,A).
  • Test Set 133 Pos, 5 Neg
  • department of(A,B) - person(A), department(B),
    link to(C,D,A), link to(E,F,D), link to(G,B,F),
    has neighborhood word graduate(E).
  • Test Set 371 Pos, 4 Neg
  • members of project(A,B) - research project(A),
    person(B), link to(C,A,D), link to(E,D,B), has
    neighborhood word people(C).
  • Test Set 18 Pos, 0 Neg

48
Accuracy/Coverage tradeoff for learned relation
rules
49
Learning to Extract Text Fields
  • Sometimes we want a small fragment of text, not
    the whole web-page or class (like Jon, Peter,
    etc)
  • Make me hotel and flight arrangements for the
    upcoming Linux conference

50
Predefined predicates
  • Let F w1, w2, wj be a fragment of text
  • length(lt,gt,, N).
  • some(Var, Path, Feat, Value) some
    (A,next_token, next_token, numeric, true)
  • position(Var, From, Relop, N)
  • relpos(Var1, Var2, Relop, N)

51
A wrongExample
Last-Modified Wednesday, 26-Jun-96 013746
GMT lttitlegt Bruce Randall Donald lt/titlegt lth
1gt ltimg src"ftp//ftp.cs.cornell.edu/pub/brd/imag
es/brd.gif"gt ltpgt Bruce Randall Donaldltbrgt Associat
e Professorltbrgt
  • ownername(Fragment) -
  • some(A, prev token, word, gmt"),
  • some(A, , in title, true),
  • some(A, , word, unknown),
  • some(A, , quadrupletonp, false)
  • length(lt, 3)

52
Accuracy/coverage tradeoff for Name Extraction
53
Conclusions
  • Used machine learning algorithms to create
    information extract methods for each desired type
    of knowledge.
  • WebKB achieves 70 accuracy at 30 coverage.
  • Bag-of-words (Hyperlinks, web-pages and full
    text) and First order learning can be used to
    boost the confidence
  • First order learning can be used to look outward
    from the page and consider its neighbors

54
Problems
  • Not as accurate as we want
  • You can get more accuracy at cost of coverage
  • Use linguistic features (verbs)
  • Add new methods to the booster (predict the
    department of a professor, based on the
    department of his students advisees)
  • A per topic, per language, per method. Needs
    hand made labeling to learn.
  • Learners with high accuracy can be used to teach
    learners with low accuracy.
Write a Comment
User Comments (0)
About PowerShow.com