Web Mining - PowerPoint PPT Presentation

Loading...

PPT – Web Mining PowerPoint presentation | free to download - id: 14273a-MjcwZ



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Web Mining

Description:

Web Structure Mining. Web as Social Network. Features and ... Web mining research integrate research from several research communities : Database (DB) ... – PowerPoint PPT presentation

Number of Views:1927
Avg rating:3.0/5.0
Slides: 121
Provided by: deve62
Learn more at: http://ce.sharif.edu
Category:
Tags: mined | mining | simfree | web

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Web Mining


1
Web Mining
  • Kyumars Sheykh Esmaili
  • Modern Information Retrieval Course
  • Sharif University of Technology
  • Spring 2006

2
Table of Contents
  • Introduction
  • Web Content Mining
  • Feature Selection and Similarity Measures
  • Web Structure Mining
  • Web as Social Network
  • Features and Similarity Measures
  • Social Network Analysis Algorithms
  • PageRank
  • Cyber-Communities
  • HITS
  • CT
  • Web Content-Structure Clustering
  • Web Usage Mining
  • Some Concrete Applications of Web Mining
  • Focus Crawling
  • Web Search Result Clustering
  • Summary

3
Table of Contents
  • Introduction
  • Web Content Mining
  • Feature Selection and Similarity Measures
  • Web Structure Mining
  • Web as Social Network
  • Features and Similarity Measures
  • Social Network Analysis Algorithms
  • PageRank
  • Cyber-Communities
  • HITS
  • CT
  • Web Content-Structure Clustering
  • Web Usage Mining
  • Some Concrete Applications of Web Mining
  • Focus Crawling
  • Web Search Result Clustering
  • Summary

4
Introduction
  • Information Overloading on the web
  • Size
  • 2001
  • New information created 6 exabytes (1018 bytes)
  • 10 billion (nonspam) e-mail messages were sent
    per day.
  • 2002
  • New information created 12 exabytes (1018
    bytes)
  • 2003
  • the public Internet contained about 1 trillion
    pages and was increasing at a rate of
    approximately 8 million pages per day.
  • 2005
  • 35 billion messages per day by 2005.

5
Challenges on WWW Interactions
  • Finding Relevant Information
  • Creating knowledge from Information available
  • Personalization of the information
  • Learning about customers / individual users
  • Web Mining can play an important Role!

6
Introduction
  • Web mining - data mining techniques to
    automatically discover and extract information
    from Web documents/services
  • Web mining research integrate research from
    several research communities
  • Database (DB)
  • Information retrieval (IR)
  • The sub-areas of machine learning (ML)
  • Natural language processing (NLP)

7
Web Data
  • Web pages
  • Intra-page structures
  • Inter-page structures
  • Usage data
  • Supplemental data
  • Profiles
  • Registration information
  • Cookies

8
Web Data Categories
9
Web Mining Taxonomy
Web Structure Clustering
Web Content Clustering
Web Usage Clustering
Web C-S Clustering
10
Web Mining Subtasks
  • Resource Finding
  • Task of retrieving intended web-documents
  • Information Selection Pre-processing
  • Automatic selection and pre-processing specific
    information from retrieved web resources
  • Generalization
  • Automatic Discovery of patterns in web sites
  • Analysis
  • Validation and / or interpretation of mined
    patterns

11
Table of Contents
  • Introduction
  • Web Content Mining
  • Feature Selection and Similarity Measures
  • Web Structure Mining
  • Web as Social Network
  • Features and Similarity Measures
  • Social Network Analysis Algorithms
  • PageRank
  • Cyber-Communities
  • HITS
  • CT
  • Web Content-Structure Clustering
  • Web Usage Mining
  • Some Concrete Applications of Web Mining
  • Focus Crawling
  • Web Search Result Clustering
  • Summary

12
Feature Selection for Web Mining
  • for the purposes of automated text classification
    text features should be
  • Relatively few in number
  • Moderate in frequency of assignment
  • Low in redundancy
  • Low in noise
  • Related in semantic scope to the classes to
    be assigned
  • Relatively unambiguous in meaning

13
Feature Selection
  • Potential features
  • BODY
  • META
  • TITLE
  • Snippet
  • Means sentences attached with URL u appeared in
    search results
  • Anchor Window
  • The anchor text and text around the hyperlink
    v-gtu in the source page v
  • MT, the union of META and TITLE content
  • BMT, the union of BODY, META and TITLE content.

14
Feature Selection for Content Mining
Percentage of Web Pages With Words in HTML Tags
15
Feature Selection For Web Pages
Classification performance for various
representations of web pages
16
Vector Space Model for Content-Similarity
  • IR systems usually adopt index terms to process
    queries
  • Index term
  • a keyword or group of selected words
  • any word (more general)
  • Stemming might be used
  • connect connecting, connection, connections
  • An inverted file is built for the chosen index
    terms

17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
Table of Contents
  • Introduction
  • Web Content Mining
  • Feature Selection and Similarity Measures
  • Web Structure Mining
  • Web as Social Network
  • Features and Similarity Measures
  • Social Network Analysis Algorithms
  • PageRank
  • Cyber-Communities
  • HITS
  • CT
  • Web Content-Structure Clustering
  • Web Usage Mining
  • Some Concrete Applications of Web Mining
  • Focus Crawling
  • Web Search Result Clustering
  • Summary

23
Social network analysis
  • Social network is the study of social entities
    (people in an organization, called actors), and
    their interactions and relationships.
  • The interactions and relationships can be
    represented with a network or graph,
  • each vertex (or node) represents an actor and
  • each link represents a relationship.
  • From the network, we can study the properties of
    its structure, and the role, position and
    prestige of each social actor.
  • We can also find various kinds of sub-graphs,
    e.g., communities formed by groups of actors.

24
Social network and the Web
  • Social network analysis is useful for the Web
    because the Web is essentially a virtual society,
    and thus a virtual social network,
  • Each page a social actor and
  • each hyperlink a relationship.
  • Many results from social network can be adapted
    and extended for use in the Web context.

25
Web Structure Mining
  • The Web consists not only of pages, but also of
    hyperlinks pointing from one page to another
  • These hyperlinks contain an enormous amount of
    latent human annotation
  • Assumption
  • link from page A to page B is a recommendation of
    page B by A
  • If A and B are connected by a link, there is a
    higher probability that they are on the same topic

26
Web Link Analysis
  • Used for
  • Ordering documents matching a user query ranking
  • Deciding what pages to add to a collection
    crawling
  • Page categorization
  • Finding related pages
  • Finding duplicated web sites

27
Table of Contents
  • Introduction
  • Web Content Mining
  • Feature Selection and Similarity Measures
  • Web Structure Mining
  • Web as Social Network
  • Features and Similarity Measures
  • Social Network Analysis Algorithms
  • PageRank
  • Cyber-Communities
  • HITS
  • CT
  • Web Content-Structure Clustering
  • Web Usage Mining
  • Some Concrete Applications of Web Mining
  • Focus Crawling
  • Web Search Result Clustering
  • Summary

28
Structural Similarity Measures
  • We must define the similarity of two nodes
  • Method I
  • For page and page B, A is related to B if there
    is a hyper-link from A to B, or from B to A
  • Not so good. Consider the home page of IBM and
    Microsoft.

Page A
Page B
29
Structural Similarity Measures
  • Method II (from Bibliometrics)
  • Co-citation the similarity of A and B is
    measured by the number of pages cite both A and B
  • Bibliographic coupling the similarity of A and B
    is measured by the number of pages cited by both
    A and B.

Page A
Page B
Page A
Page B
30
Table of Contents
  • Introduction
  • Web Content Mining
  • Feature Selection and Similarity Measures
  • Web Structure Mining
  • Web as Social Network
  • Features and Similarity Measures
  • Social Network Analysis Algorithms
  • PageRank
  • Cyber-Communities
  • HITS
  • CT
  • Web Content-Structure Clustering
  • Web Usage Mining
  • Some Concrete Applications of Web Mining
  • Focus Crawling
  • Web Search Result Clustering
  • Summary

31
Using link structure of web (cont.)
  • There are two famous Link-Structure based
    algorithms for ranking
  • PageRank
  • HITS
  • Nearly All other algorithms are base on these
    ones
  • Salsa,
  • Clever,
  • .

32
PageRank
  • Introduced by Page et al (1998)
  • An offline algorithm (Query independent)
  • The weight is assigned by the rank of parents

33
Matrix Notation
34
Solve the PageRank equation
(15)
  • This is the characteristic equation of the
    eigensystem, where the solution to P is an
    eigenvector with the corresponding eigenvalue of
    1.
  • It turns out that if some conditions are
    satisfied, 1 is the largest eigenvalue and the
    PageRank vector P is the principal eigenvector.
  • A well known mathematical technique called power
    iteration can be used to find P.
  • Problem the above Equation does not quite
    suffice because the Web graph does not meet the
    conditions.

35
Using Markov chain
  • To introduce these conditions and the enhanced
    equation, let us derive the same Equation (15)
    based on the Markov chain.
  • In the Markov chain, each Web page or node in the
    Web graph is regarded as a state.
  • A hyperlink is a transition, which leads from one
    state to another state with a probability.
  • This framework models Web surfing as a stochastic
    process.
  • It models a Web surfer randomly surfing the Web
    as state transition.

36
Random surfing
  • Recall we use Oi to denote the number of
    out-links of a node i.
  • Each transition probability is 1/Oi if we assume
    the Web surfer will click the hyperlinks in the
    page i uniformly at random.
  • The back button on the browser is not used and
  • the surfer does not type in an URL.

37
Transition probability matrix
  • Let A be the state transition probability
    matrix,,
  • Aij represents the transition probability that
    the surfer in state i (page i) will move to state
    j (page j). Aij is defined exactly as in Equation
    (14).

38
Let us start
  • Given an initial probability distribution vector
    that a surfer is at each state (or page)
  • p0 (p0(1), p0(2), , p0(n))T (a column vector)
    and
  • an n?n transition probability matrix A,
  • we have
  • If the matrix A satisfies Equation (17), we say
    that A is the stochastic matrix of a Markov
    chain.

(16)
(17)
39
Back to the Markov chain
  • In a Markov chain, a question of common interest
    is
  • Given p0 at the beginning, what is the
    probability that m steps/transitions later the
    Markov chain will be at each state j?
  • We determine the probability that the system (or
    the random surfer) is in state j after 1 step (1
    transition) by using the following reasoning

(18)
40
State transition
41
Stationary probability distribution
  • By a Theorem of the Markov chain,
  • a finite Markov chain defined by the stochastic
    matrix A has a unique stationary probability
    distribution if A is irreducible and aperiodic.
  • The stationary probability distribution means
    that after a series of transitions pk will
    converge to a steady-state probability vector ?
    regardless of the choice of the initial
    probability vector p0, i.e.,

(21)
42
PageRank again
  • When we reach the steady-state, we have pk pk1
    ?, and thus
  • ? AT?.
  • ? is the principal eigenvector of AT with
    eigenvalue of 1.
  • In PageRank, ? is used as the PageRank vector P.
    We again obtain Equation (15), which is
    re-produced here as Equation (22)

(22)
43
Is P ? justified?
  • Using the stationary probability distribution ?
    as the PageRank vector is reasonable and quite
    intuitive because
  • it reflects the long-run probabilities that a
    random surfer will visit the pages.
  • A page has a high prestige if the probability of
    visiting it is high.

44
Back to the Web graph
  • Now let us come back to the real Web context and
    see whether the above conditions are satisfied,
    i.e.,
  • whether A is a stochastic matrix and
  • whether it is irreducible and aperiodic.
  • None of them is satisfied.
  • Hence, we need to extend the ideal-case Equation
    (22) to produce the actual PageRank model.

45
A is a not stochastic matrix
  • A is the transition matrix of the Web graph
  • It does not satisfy equation (17)
  • because many Web pages have no out-links, which
    are reflected in transition matrix A by some rows
    of complete 0s.
  • Such pages are called the dangling pages (nodes).

46
An example Web hyperlink graph
47
Fix the problem two possible ways
  • Remove those pages with no out-links during the
    PageRank computation as these pages do not affect
    the ranking of any other page directly.
  • Add a complete set of outgoing links from each
    such page i to all the pages on the Web.

Let us use the second way
48
A is a not irreducible
  • Irreducible means that the Web graph G is
    strongly connected.
  • Definition A directed graph G (V, E) is
    strongly connected if and only if, for each pair
    of nodes u, v ? V, there is a path from u to v.
  • A general Web graph represented by A is not
    irreducible because
  • for some pair of nodes u and v, there is no path
    from u to v.
  • In our example, there is no directed path from
    nodes 3 to 4.

49
A is a not aperiodic
  • A state i in a Markov chain being periodic means
    that there exists a directed cycle that the chain
    has to traverse.
  • Definition A state i is periodic with period k gt
    1 if k is the smallest number such that all paths
    leading from state i back to state i have a
    length that is a multiple of k.
  • If a state is not periodic (i.e., k 1), it is
    aperiodic.
  • A Markov chain is aperiodic if all states are
    aperiodic.

50
An example periodic
  • Fig. 5 shows a periodic Markov chain with k 3.
    Eg, if we begin from state 1, to come back to
    state 1 the only path is 1-2-3-1 for some number
    of times, say h. Thus any return to state 1 will
    take 3h transitions.

51
Deal with irreducible and aperiodic
  • It is easy to deal with the above two problems
    with a single strategy.
  • Add a link from each page to every page and give
    each link a small transition probability
    controlled by a parameter d.
  • Obviously, the augmented transition matrix
    becomes irreducible and aperiodic

52
Improved PageRank
  • After this augmentation, at a page, the random
    surfer has two options
  • With probability d, he randomly chooses an
    out-link to follow.
  • With probability 1-d, he jumps to a random page
  • Equation (25) gives the improved model,
  • where E is eeT (e is a column vector of all 1s)
    and thus E is a n?n square matrix of all 1s.

(25)
53
Follow our example
54
The final PageRank algorithm
  • (1-d)E/n dAT is a stochastic matrix
    (transposed). It is also irreducible and
    aperiodic
  • If we scale Equation (25) so that eTP n,
  • PageRank for each page i is

(27)
(28)
55
The final PageRank (cont )
  • (28) is equivalent to the formula given in the
    PageRank paper
  • The parameter d is called the damping factor
    which can be set to between 0 and 1. d 0.85 was
    used in the PageRank paper.

56
A Practical Example for PageRank
57
Table of Contents
  • Introduction
  • Web Content Mining
  • Feature Selection and Similarity Measures
  • Web Structure Mining
  • Web as Social Network
  • Features and Similarity Measures
  • Social Network Analysis Algorithms
  • PageRank
  • Cyber-Communities
  • HITS
  • CT
  • Web Content-Structure Clustering
  • Web Usage Mining
  • Some Concrete Applications of Web Mining
  • Focus Crawling
  • Web Search Result Clustering
  • Summary

58
What is cyber-community
  • A community on the web is a group of web pages
    sharing a common interest
  • Eg. A group of web pages talking about POP Music
  • Eg. A group of web pages interested in
    data-mining
  • Main properties
  • Pages in the same community should be similar to
    each other in contents
  • The pages in one community should differ from the
    pages in another community
  • Similar to cluster

59
Cyber Communities
60
Two different types of communities
  • Explicitly-defined communities
  • They are well known ones, such as the resource
    listed by Yahoo!
  • Implicitly-defined communities
  • They are communities unexpected or invisible to
    most users

Arts
eg.
Music
Painting
Classic
Pop
eg. The group of web pages interested in a
particular singer
61
Different types of communities
  • The explicit communities are easy to identify
  • Eg. Yahoo!, InfoSeek, Clever System
  • In order to extract the implicit communities, we
    need analyze the web-graph objectively
  • In research, people are more interested in the
    implicit communities

62
Methods of clustering
  • Clustering methods based on co-citation analysis
  • Methods derived from HITS (Kleinberg)
  • Using co-citation matrix
  • CT Method

63
Table of Contents
  • Introduction
  • Web Content Mining
  • Feature Selection and Similarity Measures
  • Web Structure Mining
  • Web as Social Network
  • Features and Similarity Measures
  • Social Network Analysis Algorithms
  • PageRank
  • Cyber-Communities
  • HITS
  • CT
  • Web Content-Structure Clustering
  • Web Usage Mining
  • Some Concrete Applications of Web Mining
  • Focus Crawling
  • Web Search Result Clustering
  • Summary

64
HITS Hubs and Authority
  • Hub web page links to a collection of prominent
    sites on a common topic
  • Authority Pages that link to a collection of
    authoritative pages on a broad topic web page
    pointed to by hubs
  • Mutual Reinforcing Relationship a good authority
    is a page that is pointed to by many good hubs,
    while a good hub is a page that points to many
    good authorities

65
Authority and Hubness
5
2
3
1
1
6
4
7
y(1) x(5) x(6) xs(7)
x(1) y(2) y(3) y(4)
66
HITS Steps (1)
  • Creating root and base sets

67
HITS Steps (2)
  • Calculating Weights
  • Authority weight
  • Hub weight
  • Matrix notation A - adjacency matrix
  • A(i, j) 1 if i-th page points to j-th page

68
Final Result of HITS
69
HITS Results 3D perspective
70
A Practical Example for HITS
71
HITS Problems
  • From narrow topic, HITS tends to end in more
    general one.
  • Specific of hub pages - many links can cause
    algorithm drift. They can point to authorities in
    different topics
  • Pages from single domain / website can dominate
    result, if they point to one page - not necessary
    a good authority.
  • Automatically generated links
  • Non relevant highly connected pages
  • Topic drift generalisation of the query topic

72
Difference between PageRank and HITS
  • The PageRank is computed for all web pages stored
    in the database and then prior to the query HITS
    is performed on the set of retrieved web pages,
    and for each query.
  • HITS computes authorities and hubs PageRank
    computes authorities only.
  • PageRank non-trivial to compute, HITS easy to
    compute, but real-time execution is hard

73
Table of Contents
  • Introduction
  • Web Content Mining
  • Feature Selection and Similarity Measures
  • Web Structure Mining
  • Web as Social Network
  • Features and Similarity Measures
  • Social Network Analysis Algorithms
  • PageRank
  • Cyber-Communities
  • HITS
  • CT
  • Web Content-Structure Clustering
  • Web Usage Mining
  • Some Concrete Applications of Web Mining
  • Focus Crawling
  • Web Search Result Clustering
  • Summary

74
A cheaper method
  • Previous methods are expensive
  • There another simple method called communities
    trawling (CT)
  • It has been implemented on the graph of 200
    millions pages, it worked very well

75
Basic idea of CT
  • Definition of communities
  • dense directed bipartite sub graphs
  • Bipartite graph Nodes are partitioned into two
    sets, F and C
  • Every directed edge in the graph is directed from
    a node u in F to a node v in C
  • dense if many of the possible edges between F and
    C are present

F
C
76
Basic idea of CT
  • Bipartite cores
  • a complete bipartite subgraph with at least i
    nodes from F and at least j nodes from C
  • i and j are tunable parameters
  • A (i, j) Bipartite core
  • Every community have such a core with a certain i
    and j.

A (i3, j3) bipartite core
77
Basic idea of CT
  • A bipartite core is the identity of a community
  • To extract all the communities is to enumerate
    all the bipartite cores on the web.
  • Author invent an efficient algorithm to enumerate
    the bipartite cores. Its main idea is iterate
    pruning -- elimination-generation pruning

78
Table of Contents
  • Introduction
  • Web Content Mining
  • Feature Selection and Similarity Measures
  • Web Structure Mining
  • Web as Social Network
  • Features and Similarity Measures
  • Social Network Analysis Algorithms
  • PageRank
  • Cyber-Communities
  • HITS
  • CT
  • Web Content-Structure Clustering
  • Web Usage Mining
  • Some Concrete Applications of Web Mining
  • Focus Crawling
  • Web Search Result Clustering
  • Summary

79
Content Link Clustering
  • By CLC, each web page q in data set D is
    represented
  • as 3 vectors
  • qOut
  • qIn
  • qKword
  • with M, N and L as the vector dimension
    respectively
  • The ith item of vector qOut (and qIn) indicates
    whether q has the corresponding out-link as the
    ith one in M out-links. If yes, the ith item is1,
    else 0.
  • The kth item of vector qKword indicates the
    frequency of the corresponding kth term of L
    appeared in page q.

80
Similarity Measure
  • The similarity of two pages Q and R is the linear
    combination of three parts
  • poutS(Qout,Rout) pinS(Qin,Rin)
    ptermS(Qterm,Rterm)
  • pout pin pterm 1
  • S(Qout,Rout) is defined as Cosine of two out-link
    vectors.

81
Tuning the similarity measure
  • By varying weighting factors in second formula,
    it is possible to study the effects of out-links,
    in-link and terms on clustering process.
  • Results of term-based clustering is rather
    coarse and usually includes very general groups,
    which are totally different each other from
    semantic point of view.
  • E.g. for topic jaguar, car group and
    animal group are two very general groups with
    very different semantic topics

82
Tuning the similarity measure
  • So, term-based clustering could only roughly
    separate pages into general semantic groups and
    failed to handle the finer case
  • Like racing car and car driver club since
    both pages may include some terms like car,
    model etc.
  • The main reasons of poor purity of clusters
    produced by term-based clustering are
  • Noise pages are included into clusters instead
    of removing since noise pages share some
    unimportant terms with other pages
  • Pages that on different finer topics (but the
    same general topic) are mixed together.

83
Tuning the similarity measure
  • Hyperlinks represent the authors view of the
    relationship among Web pages
  • hyperlink-based clustering expresses
    association of pages.
  • Therefore, we could say that clusters produced by
    link-based clustering are in finer granularity.
  • The problem of link-based clustering is that some
    similar pages (e.g. new created pages) may not
    have enough co-citation/citation to be grouped
    together. That is to say, recall is some low.

84
Tuning the similarity measure
  • T, L and CLC to denote termsbased (with
    pout , pin and pKword as (0, 0, 1), link-based
    (with pout ,pin and pKword as (0.5, 0.5, 0) and
    contents-link coupled (with pout , pin and pKword
    as (0.2,0.3, 0.5) clustering approaches
    respectively.
  • Parameters are
  • Similarity threshold
  • weighting factors
  • The label of each cluster is identified
    automatically by term vector of centroid for each
    cluster.

85
Content Link Mining
86
Table of Contents
  • Introduction
  • Web Content Mining
  • Feature Selection and Similarity Measures
  • Web Structure Mining
  • Web as Social Network
  • Features and Similarity Measures
  • Social Network Analysis Algorithms
  • PageRank
  • Cyber-Communities
  • HITS
  • CT
  • Web Content-Structure Clustering
  • Web Usage Mining
  • Some Concrete Applications of Web Mining
  • Focus Crawling
  • Web Search Result Clustering
  • Summary

87
Web Usage Mining
  • Web usage mining also known as Web log mining
  • mining techniques to discover interesting usage
    patterns from the secondary data derived from the
    interactions of the users while surfing the web
  • Including
  • web log data,
  • click-stream data,
  • cookies,
  • user queries,
  • and any data related to the results of
    interaction between humans interaction with the
    web

88
Web Usage Mining
  • Applications
  • Target potential customers for electronic
    commerce
  • Enhance the quality and delivery of Internet
    information services to the end user
  • Improve Web server system performance
  • Identify potential prime advertisement locations
  • Facilitates personalization/adaptive sites
  • Improve site design
  • Fraud/intrusion detection
  • Predict users actions (allows prefetching)

89
(No Transcript)
90
Web Log Clustering Applications
  • Association rules
  • Find pages that are often viewed together
  • Clustering
  • Cluster users based on browsing patterns
  • Cluster pages based on content

91
Server Logs
92
Fields
  • Client IP 128.101.228.20
  • Authenticated User ID - -
  • Time/Date 10/Nov/1999101639 -0600
  • Request "GET / HTTP/1.0"
  • Status 200
  • Bytes -
  • Referrer -
  • Agent "Mozilla/4.61 en (WinNT I)"

93
Web Usage Mining
  • User The principal using a client to
    interactively retrieve and render resources or
    resource manifestations.
  • Page view Visual rendering of a Web page in a
    specific client environment at a specific point
    of time
  • Click stream a sequential series of page view
    request
  • User session a delimited set of user clicks
    (click stream) across one or more Web servers.
  • Server session (visit) a collection of user
    clicks to a single Web server during a user
    session.
  • Episode a subset of related user clicks that
    occur within a user session.

94
WUM Pre-Processing
  • Data Cleaning
  • Removes log entries that are not needed for
    the mining process
  • Data Integration
  • Synchronize data from multiple server logs
  • User Identification
  • Associates page references with different
    users
  • Session/Episode Identification
  • Groups users page references into user
    sessions
  • Path Completion
  • Fills in page references missing due to
    browser and proxy caching

95
(No Transcript)
96
(No Transcript)
97
WUM Association Rule Generation
  • Discovers the correlations between pages that are
    most often referenced together in a single server
    session
  • Provide the information
  • What are the set of pages frequently accessed
    together by Web users?
  • What page will be fetched next?
  • What are paths frequently accessed by Web users?
  • Association rule
  • A B Support 60,
    Confidence 80
  • Example
  • 50 of visitors who accessed URLs
    /infor-f.html and labo/infos.html also visited
    situation.html

98
WUM Clustering
  • Groups together a set of items having similar
    characteristics
  • User Clusters
  • Discover groups of users exhibiting similar
    browsing patterns
  • Page recommendation
  • Users partial session is classified into a
    single cluster
  • The links contained in this cluster are
    recommended

99
Web Usage Clustering
  • clients who often access
  • /products/software/webminer.html tend to be
    from educational institutions.
  • clients who placed an online order for software
    tend to be students in the 20-25 age group and
    live in the United States.
  • 75 of clients who download software from
  • /products/software/demos/ visit between 700
    and 1100 pm on weekends.

100
Table of Contents
  • Introduction
  • Web Content Mining
  • Feature Selection and Similarity Measures
  • Web Structure Mining
  • Web as Social Network
  • Features and Similarity Measures
  • Social Network Analysis Algorithms
  • PageRank
  • Cyber-Communities
  • HITS
  • CT
  • Web Content-Structure Clustering
  • Web Usage Mining
  • Some Concrete Applications of Web Mining
  • Focus Crawling
  • Web Search Result Clustering
  • Summary

101
Focused Crawling
  • Only visit links from a page if that page is
    determined to be relevant.
  • Classifier is static after learning phase.
  • Components
  • Classifier which assigns relevance score to each
    page based on crawl topic.
  • Distiller to identify hub pages.
  • Crawler visits pages to based on crawler and
    distiller scores.

102
Focused Crawling
  • Classifier also determines how useful outgoing
    links are
  • Hub Pages contain links to many relevant pages.
    Must be visited even if not high relevance score.

103
Focused Crawling
104
Table of Contents
  • Introduction
  • Web Content Mining
  • Feature Selection and Similarity Measures
  • Web Structure Mining
  • Web as Social Network
  • Features and Similarity Measures
  • Social Network Analysis Algorithms
  • PageRank
  • Cyber-Communities
  • HITS
  • CT
  • Web Content-Structure Clustering
  • Web Usage Mining
  • Some Concrete Applications of Web Mining
  • Focus Crawling
  • Web Search Result Clustering
  • Summary

105
Motivation
  • In the web search contextorganizing web pages
    (search results) into groups, so that different
    groups correspond to different user
    needs search enginei.e. engine car
    part Engine Corp.
  • Why not other data mining techniques?

106
(1) Using Contents of Documents
  • Creating clusters based on snippets returned by
    web search engines.
  • Clusters based on snippets are almost as good as
    clusters created using the full text of Web
    documents.
  • Suffix Tree Clustering (STC) incremental, O(n)
    time algorithm
  • Linear
  • Incremental
  • Overlapping
  • Can be extended to hierarchical

107
STC algorithm
  • Step 1 Cleaning
  • Stemming
  • Sentence boundary identification
  • Punctuation elimination
  • Step 2 Suffix tree construction
  • Produces base clusters (internal nodes)
  • Base clusters are scored based on size and phrase
    score (which depends on length and word
    quality)
  • Step 3 Merging base clusters
  • Highly overlapping clusters are merged

108
(2) Using users usage logs
  • Advantage relevancy information is objectively
    reflected by the usage logs
  • An experimental result on www.nasa.gov/

109
(3) Using hyperlinks
  • For each URL P in search results R, we extract
    its all out-links as well as top n in-links by
    services of AltaVista
  • We could get all distinct N out-links and M
    in-links for all URLs in R.
  • Each page P in R (result set) is represented as
    2 vectors
  • POut (N- dimension)
  • PIn (Mdimension)

110
(3) Using Hyperlinks continued
111
(3) Using Hyperlinks continued
112
Concerns on current methods
  • Each method has pros and cons
  • Using hyperlinks the best accuracy and still
    some room to improve
  • STC best to browse and for incrementality.

113
Sample systems
  • Scatter/Gather
  • Grouper
  • Carrot2
  • Vivisimo
  • Mapuccino
  • SHOC

114
Grouper
  • Online
  • Operates on query result snippets
  • Clusters together documents with large common
    subphrases
  • Suffix Tree Clustering (STC)
  • STC induces labeling

115
(No Transcript)
116
(No Transcript)
117
(No Transcript)
118
Table of Contents
  • Introduction
  • Web Content Mining
  • Feature Selection and Similarity Measures
  • Web Structure Mining
  • Web as Social Network
  • Features and Similarity Measures
  • Social Network Analysis Algorithms
  • PageRank
  • Cyber-Communities
  • HITS
  • CT
  • Web Content-Structure Clustering
  • Web Usage Mining
  • Some Concrete Applications of Web Mining
  • Focus Crawling
  • Web Search Result Clustering
  • Summary

119
Summary
120
Thank You
About PowerShow.com