Loading...

PPT – Web Mining PowerPoint presentation | free to download - id: 14273a-MjcwZ

The Adobe Flash plugin is needed to view this content

Web Mining

- Kyumars Sheykh Esmaili
- Modern Information Retrieval Course
- Sharif University of Technology
- Spring 2006

Table of Contents

- Introduction
- Web Content Mining
- Feature Selection and Similarity Measures
- Web Structure Mining
- Web as Social Network
- Features and Similarity Measures
- Social Network Analysis Algorithms
- PageRank
- Cyber-Communities
- HITS
- CT
- Web Content-Structure Clustering
- Web Usage Mining
- Some Concrete Applications of Web Mining
- Focus Crawling
- Web Search Result Clustering
- Summary

Table of Contents

- Introduction
- Web Content Mining
- Feature Selection and Similarity Measures
- Web Structure Mining
- Web as Social Network
- Features and Similarity Measures
- Social Network Analysis Algorithms
- PageRank
- Cyber-Communities
- HITS
- CT
- Web Content-Structure Clustering
- Web Usage Mining
- Some Concrete Applications of Web Mining
- Focus Crawling
- Web Search Result Clustering
- Summary

Introduction

- Information Overloading on the web
- Size
- 2001
- New information created 6 exabytes (1018 bytes)

- 10 billion (nonspam) e-mail messages were sent

per day. - 2002
- New information created 12 exabytes (1018

bytes) - 2003
- the public Internet contained about 1 trillion

pages and was increasing at a rate of

approximately 8 million pages per day. - 2005
- 35 billion messages per day by 2005.

Challenges on WWW Interactions

- Finding Relevant Information
- Creating knowledge from Information available
- Personalization of the information
- Learning about customers / individual users
- Web Mining can play an important Role!

Introduction

- Web mining - data mining techniques to

automatically discover and extract information

from Web documents/services - Web mining research integrate research from

several research communities

- Database (DB)

- Information retrieval (IR)

- The sub-areas of machine learning (ML)
- Natural language processing (NLP)

Web Data

- Web pages
- Intra-page structures
- Inter-page structures
- Usage data
- Supplemental data
- Profiles
- Registration information
- Cookies

Web Data Categories

Web Mining Taxonomy

Web Structure Clustering

Web Content Clustering

Web Usage Clustering

Web C-S Clustering

Web Mining Subtasks

- Resource Finding
- Task of retrieving intended web-documents
- Information Selection Pre-processing
- Automatic selection and pre-processing specific

information from retrieved web resources - Generalization
- Automatic Discovery of patterns in web sites
- Analysis
- Validation and / or interpretation of mined

patterns

Table of Contents

- Introduction
- Web Content Mining
- Feature Selection and Similarity Measures
- Web Structure Mining
- Web as Social Network
- Features and Similarity Measures
- Social Network Analysis Algorithms
- PageRank
- Cyber-Communities
- HITS
- CT
- Web Content-Structure Clustering
- Web Usage Mining
- Some Concrete Applications of Web Mining
- Focus Crawling
- Web Search Result Clustering
- Summary

Feature Selection for Web Mining

- for the purposes of automated text classification

text features should be - Relatively few in number
- Moderate in frequency of assignment
- Low in redundancy
- Low in noise
- Related in semantic scope to the classes to

be assigned - Relatively unambiguous in meaning

Feature Selection

- Potential features
- BODY
- META
- TITLE
- Snippet
- Means sentences attached with URL u appeared in

search results - Anchor Window
- The anchor text and text around the hyperlink

v-gtu in the source page v - MT, the union of META and TITLE content
- BMT, the union of BODY, META and TITLE content.

Feature Selection for Content Mining

Percentage of Web Pages With Words in HTML Tags

Feature Selection For Web Pages

Classification performance for various

representations of web pages

Vector Space Model for Content-Similarity

- IR systems usually adopt index terms to process

queries - Index term
- a keyword or group of selected words
- any word (more general)
- Stemming might be used
- connect connecting, connection, connections
- An inverted file is built for the chosen index

terms

(No Transcript)

(No Transcript)

(No Transcript)

(No Transcript)

(No Transcript)

Table of Contents

- Introduction
- Web Content Mining
- Feature Selection and Similarity Measures
- Web Structure Mining
- Web as Social Network
- Features and Similarity Measures
- Social Network Analysis Algorithms
- PageRank
- Cyber-Communities
- HITS
- CT
- Web Content-Structure Clustering
- Web Usage Mining
- Some Concrete Applications of Web Mining
- Focus Crawling
- Web Search Result Clustering
- Summary

Social network analysis

- Social network is the study of social entities

(people in an organization, called actors), and

their interactions and relationships. - The interactions and relationships can be

represented with a network or graph, - each vertex (or node) represents an actor and
- each link represents a relationship.
- From the network, we can study the properties of

its structure, and the role, position and

prestige of each social actor. - We can also find various kinds of sub-graphs,

e.g., communities formed by groups of actors.

Social network and the Web

- Social network analysis is useful for the Web

because the Web is essentially a virtual society,

and thus a virtual social network, - Each page a social actor and
- each hyperlink a relationship.
- Many results from social network can be adapted

and extended for use in the Web context.

Web Structure Mining

- The Web consists not only of pages, but also of

hyperlinks pointing from one page to another - These hyperlinks contain an enormous amount of

latent human annotation - Assumption
- link from page A to page B is a recommendation of

page B by A - If A and B are connected by a link, there is a

higher probability that they are on the same topic

Web Link Analysis

- Used for
- Ordering documents matching a user query ranking
- Deciding what pages to add to a collection

crawling - Page categorization
- Finding related pages
- Finding duplicated web sites

Table of Contents

- Introduction
- Web Content Mining
- Feature Selection and Similarity Measures
- Web Structure Mining
- Web as Social Network
- Features and Similarity Measures
- Social Network Analysis Algorithms
- PageRank
- Cyber-Communities
- HITS
- CT
- Web Content-Structure Clustering
- Web Usage Mining
- Some Concrete Applications of Web Mining
- Focus Crawling
- Web Search Result Clustering
- Summary

Structural Similarity Measures

- We must define the similarity of two nodes
- Method I
- For page and page B, A is related to B if there

is a hyper-link from A to B, or from B to A - Not so good. Consider the home page of IBM and

Microsoft.

Page A

Page B

Structural Similarity Measures

- Method II (from Bibliometrics)
- Co-citation the similarity of A and B is

measured by the number of pages cite both A and B - Bibliographic coupling the similarity of A and B

is measured by the number of pages cited by both

A and B.

Page A

Page B

Page A

Page B

Table of Contents

- Introduction
- Web Content Mining
- Feature Selection and Similarity Measures
- Web Structure Mining
- Web as Social Network
- Features and Similarity Measures
- Social Network Analysis Algorithms
- PageRank
- Cyber-Communities
- HITS
- CT
- Web Content-Structure Clustering
- Web Usage Mining
- Some Concrete Applications of Web Mining
- Focus Crawling
- Web Search Result Clustering
- Summary

Using link structure of web (cont.)

- There are two famous Link-Structure based

algorithms for ranking - PageRank
- HITS
- Nearly All other algorithms are base on these

ones - Salsa,
- Clever,
- .

PageRank

- Introduced by Page et al (1998)
- An offline algorithm (Query independent)
- The weight is assigned by the rank of parents

Matrix Notation

Solve the PageRank equation

(15)

- This is the characteristic equation of the

eigensystem, where the solution to P is an

eigenvector with the corresponding eigenvalue of

1. - It turns out that if some conditions are

satisfied, 1 is the largest eigenvalue and the

PageRank vector P is the principal eigenvector. - A well known mathematical technique called power

iteration can be used to find P. - Problem the above Equation does not quite

suffice because the Web graph does not meet the

conditions.

Using Markov chain

- To introduce these conditions and the enhanced

equation, let us derive the same Equation (15)

based on the Markov chain. - In the Markov chain, each Web page or node in the

Web graph is regarded as a state. - A hyperlink is a transition, which leads from one

state to another state with a probability. - This framework models Web surfing as a stochastic

process. - It models a Web surfer randomly surfing the Web

as state transition.

Random surfing

- Recall we use Oi to denote the number of

out-links of a node i. - Each transition probability is 1/Oi if we assume

the Web surfer will click the hyperlinks in the

page i uniformly at random. - The back button on the browser is not used and
- the surfer does not type in an URL.

Transition probability matrix

- Let A be the state transition probability

matrix,, - Aij represents the transition probability that

the surfer in state i (page i) will move to state

j (page j). Aij is defined exactly as in Equation

(14).

Let us start

- Given an initial probability distribution vector

that a surfer is at each state (or page) - p0 (p0(1), p0(2), , p0(n))T (a column vector)

and - an n?n transition probability matrix A,
- we have
- If the matrix A satisfies Equation (17), we say

that A is the stochastic matrix of a Markov

chain.

(16)

(17)

Back to the Markov chain

- In a Markov chain, a question of common interest

is - Given p0 at the beginning, what is the

probability that m steps/transitions later the

Markov chain will be at each state j? - We determine the probability that the system (or

the random surfer) is in state j after 1 step (1

transition) by using the following reasoning

(18)

State transition

Stationary probability distribution

- By a Theorem of the Markov chain,
- a finite Markov chain defined by the stochastic

matrix A has a unique stationary probability

distribution if A is irreducible and aperiodic. - The stationary probability distribution means

that after a series of transitions pk will

converge to a steady-state probability vector ?

regardless of the choice of the initial

probability vector p0, i.e.,

(21)

PageRank again

- When we reach the steady-state, we have pk pk1

?, and thus - ? AT?.
- ? is the principal eigenvector of AT with

eigenvalue of 1. - In PageRank, ? is used as the PageRank vector P.

We again obtain Equation (15), which is

re-produced here as Equation (22)

(22)

Is P ? justified?

- Using the stationary probability distribution ?

as the PageRank vector is reasonable and quite

intuitive because - it reflects the long-run probabilities that a

random surfer will visit the pages. - A page has a high prestige if the probability of

visiting it is high.

Back to the Web graph

- Now let us come back to the real Web context and

see whether the above conditions are satisfied,

i.e., - whether A is a stochastic matrix and
- whether it is irreducible and aperiodic.
- None of them is satisfied.
- Hence, we need to extend the ideal-case Equation

(22) to produce the actual PageRank model.

A is a not stochastic matrix

- A is the transition matrix of the Web graph
- It does not satisfy equation (17)
- because many Web pages have no out-links, which

are reflected in transition matrix A by some rows

of complete 0s. - Such pages are called the dangling pages (nodes).

An example Web hyperlink graph

Fix the problem two possible ways

- Remove those pages with no out-links during the

PageRank computation as these pages do not affect

the ranking of any other page directly. - Add a complete set of outgoing links from each

such page i to all the pages on the Web.

Let us use the second way

A is a not irreducible

- Irreducible means that the Web graph G is

strongly connected. - Definition A directed graph G (V, E) is

strongly connected if and only if, for each pair

of nodes u, v ? V, there is a path from u to v. - A general Web graph represented by A is not

irreducible because - for some pair of nodes u and v, there is no path

from u to v. - In our example, there is no directed path from

nodes 3 to 4.

A is a not aperiodic

- A state i in a Markov chain being periodic means

that there exists a directed cycle that the chain

has to traverse. - Definition A state i is periodic with period k gt

1 if k is the smallest number such that all paths

leading from state i back to state i have a

length that is a multiple of k. - If a state is not periodic (i.e., k 1), it is

aperiodic. - A Markov chain is aperiodic if all states are

aperiodic.

An example periodic

- Fig. 5 shows a periodic Markov chain with k 3.

Eg, if we begin from state 1, to come back to

state 1 the only path is 1-2-3-1 for some number

of times, say h. Thus any return to state 1 will

take 3h transitions.

Deal with irreducible and aperiodic

- It is easy to deal with the above two problems

with a single strategy. - Add a link from each page to every page and give

each link a small transition probability

controlled by a parameter d. - Obviously, the augmented transition matrix

becomes irreducible and aperiodic

Improved PageRank

- After this augmentation, at a page, the random

surfer has two options - With probability d, he randomly chooses an

out-link to follow. - With probability 1-d, he jumps to a random page
- Equation (25) gives the improved model,
- where E is eeT (e is a column vector of all 1s)

and thus E is a n?n square matrix of all 1s.

(25)

Follow our example

The final PageRank algorithm

- (1-d)E/n dAT is a stochastic matrix

(transposed). It is also irreducible and

aperiodic - If we scale Equation (25) so that eTP n,
- PageRank for each page i is

(27)

(28)

The final PageRank (cont )

- (28) is equivalent to the formula given in the

PageRank paper - The parameter d is called the damping factor

which can be set to between 0 and 1. d 0.85 was

used in the PageRank paper.

A Practical Example for PageRank

Table of Contents

- Introduction
- Web Content Mining
- Feature Selection and Similarity Measures
- Web Structure Mining
- Web as Social Network
- Features and Similarity Measures
- Social Network Analysis Algorithms
- PageRank
- Cyber-Communities
- HITS
- CT
- Web Content-Structure Clustering
- Web Usage Mining
- Some Concrete Applications of Web Mining
- Focus Crawling
- Web Search Result Clustering
- Summary

What is cyber-community

- A community on the web is a group of web pages

sharing a common interest - Eg. A group of web pages talking about POP Music
- Eg. A group of web pages interested in

data-mining - Main properties
- Pages in the same community should be similar to

each other in contents - The pages in one community should differ from the

pages in another community - Similar to cluster

Cyber Communities

Two different types of communities

- Explicitly-defined communities
- They are well known ones, such as the resource

listed by Yahoo! - Implicitly-defined communities
- They are communities unexpected or invisible to

most users

Arts

eg.

Music

Painting

Classic

Pop

eg. The group of web pages interested in a

particular singer

Different types of communities

- The explicit communities are easy to identify
- Eg. Yahoo!, InfoSeek, Clever System
- In order to extract the implicit communities, we

need analyze the web-graph objectively - In research, people are more interested in the

implicit communities

Methods of clustering

- Clustering methods based on co-citation analysis
- Methods derived from HITS (Kleinberg)
- Using co-citation matrix
- CT Method

Table of Contents

- Introduction
- Web Content Mining
- Feature Selection and Similarity Measures
- Web Structure Mining
- Web as Social Network
- Features and Similarity Measures
- Social Network Analysis Algorithms
- PageRank
- Cyber-Communities
- HITS
- CT
- Web Content-Structure Clustering
- Web Usage Mining
- Some Concrete Applications of Web Mining
- Focus Crawling
- Web Search Result Clustering
- Summary

HITS Hubs and Authority

- Hub web page links to a collection of prominent

sites on a common topic - Authority Pages that link to a collection of

authoritative pages on a broad topic web page

pointed to by hubs - Mutual Reinforcing Relationship a good authority

is a page that is pointed to by many good hubs,

while a good hub is a page that points to many

good authorities

Authority and Hubness

5

2

3

1

1

6

4

7

y(1) x(5) x(6) xs(7)

x(1) y(2) y(3) y(4)

HITS Steps (1)

- Creating root and base sets

HITS Steps (2)

- Calculating Weights

- Authority weight
- Hub weight
- Matrix notation A - adjacency matrix
- A(i, j) 1 if i-th page points to j-th page

Final Result of HITS

HITS Results 3D perspective

A Practical Example for HITS

HITS Problems

- From narrow topic, HITS tends to end in more

general one. - Specific of hub pages - many links can cause

algorithm drift. They can point to authorities in

different topics - Pages from single domain / website can dominate

result, if they point to one page - not necessary

a good authority. - Automatically generated links
- Non relevant highly connected pages
- Topic drift generalisation of the query topic

Difference between PageRank and HITS

- The PageRank is computed for all web pages stored

in the database and then prior to the query HITS

is performed on the set of retrieved web pages,

and for each query. - HITS computes authorities and hubs PageRank

computes authorities only. - PageRank non-trivial to compute, HITS easy to

compute, but real-time execution is hard

Table of Contents

- Introduction
- Web Content Mining
- Feature Selection and Similarity Measures
- Web Structure Mining
- Web as Social Network
- Features and Similarity Measures
- Social Network Analysis Algorithms
- PageRank
- Cyber-Communities
- HITS
- CT
- Web Content-Structure Clustering
- Web Usage Mining
- Some Concrete Applications of Web Mining
- Focus Crawling
- Web Search Result Clustering
- Summary

A cheaper method

- Previous methods are expensive
- There another simple method called communities

trawling (CT) - It has been implemented on the graph of 200

millions pages, it worked very well

Basic idea of CT

- Definition of communities
- dense directed bipartite sub graphs
- Bipartite graph Nodes are partitioned into two

sets, F and C - Every directed edge in the graph is directed from

a node u in F to a node v in C - dense if many of the possible edges between F and

C are present

F

C

Basic idea of CT

- Bipartite cores
- a complete bipartite subgraph with at least i

nodes from F and at least j nodes from C - i and j are tunable parameters
- A (i, j) Bipartite core
- Every community have such a core with a certain i

and j.

A (i3, j3) bipartite core

Basic idea of CT

- A bipartite core is the identity of a community
- To extract all the communities is to enumerate

all the bipartite cores on the web. - Author invent an efficient algorithm to enumerate

the bipartite cores. Its main idea is iterate

pruning -- elimination-generation pruning

Table of Contents

- Introduction
- Web Content Mining
- Feature Selection and Similarity Measures
- Web Structure Mining
- Web as Social Network
- Features and Similarity Measures
- Social Network Analysis Algorithms
- PageRank
- Cyber-Communities
- HITS
- CT
- Web Content-Structure Clustering
- Web Usage Mining
- Some Concrete Applications of Web Mining
- Focus Crawling
- Web Search Result Clustering
- Summary

Content Link Clustering

- By CLC, each web page q in data set D is

represented - as 3 vectors
- qOut
- qIn
- qKword
- with M, N and L as the vector dimension

respectively - The ith item of vector qOut (and qIn) indicates

whether q has the corresponding out-link as the

ith one in M out-links. If yes, the ith item is1,

else 0. - The kth item of vector qKword indicates the

frequency of the corresponding kth term of L

appeared in page q.

Similarity Measure

- The similarity of two pages Q and R is the linear

combination of three parts - poutS(Qout,Rout) pinS(Qin,Rin)

ptermS(Qterm,Rterm) - pout pin pterm 1
- S(Qout,Rout) is defined as Cosine of two out-link

vectors.

Tuning the similarity measure

- By varying weighting factors in second formula,

it is possible to study the effects of out-links,

in-link and terms on clustering process. - Results of term-based clustering is rather

coarse and usually includes very general groups,

which are totally different each other from

semantic point of view. - E.g. for topic jaguar, car group and

animal group are two very general groups with

very different semantic topics

Tuning the similarity measure

- So, term-based clustering could only roughly

separate pages into general semantic groups and

failed to handle the finer case - Like racing car and car driver club since

both pages may include some terms like car,

model etc. - The main reasons of poor purity of clusters

produced by term-based clustering are - Noise pages are included into clusters instead

of removing since noise pages share some

unimportant terms with other pages - Pages that on different finer topics (but the

same general topic) are mixed together.

Tuning the similarity measure

- Hyperlinks represent the authors view of the

relationship among Web pages - hyperlink-based clustering expresses

association of pages. - Therefore, we could say that clusters produced by

link-based clustering are in finer granularity. - The problem of link-based clustering is that some

similar pages (e.g. new created pages) may not

have enough co-citation/citation to be grouped

together. That is to say, recall is some low.

Tuning the similarity measure

- T, L and CLC to denote termsbased (with

pout , pin and pKword as (0, 0, 1), link-based

(with pout ,pin and pKword as (0.5, 0.5, 0) and

contents-link coupled (with pout , pin and pKword

as (0.2,0.3, 0.5) clustering approaches

respectively. - Parameters are
- Similarity threshold
- weighting factors
- The label of each cluster is identified

automatically by term vector of centroid for each

cluster.

Content Link Mining

Table of Contents

- Introduction
- Web Content Mining
- Feature Selection and Similarity Measures
- Web Structure Mining
- Web as Social Network
- Features and Similarity Measures
- Social Network Analysis Algorithms
- PageRank
- Cyber-Communities
- HITS
- CT
- Web Content-Structure Clustering
- Web Usage Mining
- Some Concrete Applications of Web Mining
- Focus Crawling
- Web Search Result Clustering
- Summary

Web Usage Mining

- Web usage mining also known as Web log mining
- mining techniques to discover interesting usage

patterns from the secondary data derived from the

interactions of the users while surfing the web - Including
- web log data,
- click-stream data,
- cookies,
- user queries,
- and any data related to the results of

interaction between humans interaction with the

web

Web Usage Mining

- Applications
- Target potential customers for electronic

commerce - Enhance the quality and delivery of Internet

information services to the end user - Improve Web server system performance
- Identify potential prime advertisement locations
- Facilitates personalization/adaptive sites
- Improve site design
- Fraud/intrusion detection
- Predict users actions (allows prefetching)

(No Transcript)

Web Log Clustering Applications

- Association rules
- Find pages that are often viewed together
- Clustering
- Cluster users based on browsing patterns
- Cluster pages based on content

Server Logs

Fields

- Client IP 128.101.228.20
- Authenticated User ID - -
- Time/Date 10/Nov/1999101639 -0600
- Request "GET / HTTP/1.0"
- Status 200
- Bytes -
- Referrer -
- Agent "Mozilla/4.61 en (WinNT I)"

Web Usage Mining

- User The principal using a client to

interactively retrieve and render resources or

resource manifestations. - Page view Visual rendering of a Web page in a

specific client environment at a specific point

of time - Click stream a sequential series of page view

request - User session a delimited set of user clicks

(click stream) across one or more Web servers. - Server session (visit) a collection of user

clicks to a single Web server during a user

session. - Episode a subset of related user clicks that

occur within a user session.

WUM Pre-Processing

- Data Cleaning
- Removes log entries that are not needed for

the mining process - Data Integration
- Synchronize data from multiple server logs
- User Identification
- Associates page references with different

users - Session/Episode Identification
- Groups users page references into user

sessions - Path Completion
- Fills in page references missing due to

browser and proxy caching

(No Transcript)

(No Transcript)

WUM Association Rule Generation

- Discovers the correlations between pages that are

most often referenced together in a single server

session - Provide the information
- What are the set of pages frequently accessed

together by Web users? - What page will be fetched next?
- What are paths frequently accessed by Web users?
- Association rule
- A B Support 60,

Confidence 80 - Example
- 50 of visitors who accessed URLs

/infor-f.html and labo/infos.html also visited

situation.html

WUM Clustering

- Groups together a set of items having similar

characteristics - User Clusters
- Discover groups of users exhibiting similar

browsing patterns - Page recommendation
- Users partial session is classified into a

single cluster - The links contained in this cluster are

recommended

Web Usage Clustering

- clients who often access
- /products/software/webminer.html tend to be

from educational institutions. - clients who placed an online order for software

tend to be students in the 20-25 age group and

live in the United States. - 75 of clients who download software from
- /products/software/demos/ visit between 700

and 1100 pm on weekends.

Table of Contents

- Introduction
- Web Content Mining
- Feature Selection and Similarity Measures
- Web Structure Mining
- Web as Social Network
- Features and Similarity Measures
- Social Network Analysis Algorithms
- PageRank
- Cyber-Communities
- HITS
- CT
- Web Content-Structure Clustering
- Web Usage Mining
- Some Concrete Applications of Web Mining
- Focus Crawling
- Web Search Result Clustering
- Summary

Focused Crawling

- Only visit links from a page if that page is

determined to be relevant. - Classifier is static after learning phase.
- Components
- Classifier which assigns relevance score to each

page based on crawl topic. - Distiller to identify hub pages.
- Crawler visits pages to based on crawler and

distiller scores.

Focused Crawling

- Classifier also determines how useful outgoing

links are - Hub Pages contain links to many relevant pages.

Must be visited even if not high relevance score.

Focused Crawling

Table of Contents

- Introduction
- Web Content Mining
- Feature Selection and Similarity Measures
- Web Structure Mining
- Web as Social Network
- Features and Similarity Measures
- Social Network Analysis Algorithms
- PageRank
- Cyber-Communities
- HITS
- CT
- Web Content-Structure Clustering
- Web Usage Mining
- Some Concrete Applications of Web Mining
- Focus Crawling
- Web Search Result Clustering
- Summary

Motivation

- In the web search contextorganizing web pages

(search results) into groups, so that different

groups correspond to different user

needs search enginei.e. engine car

part Engine Corp. - Why not other data mining techniques?

(1) Using Contents of Documents

- Creating clusters based on snippets returned by

web search engines. - Clusters based on snippets are almost as good as

clusters created using the full text of Web

documents. - Suffix Tree Clustering (STC) incremental, O(n)

time algorithm - Linear
- Incremental
- Overlapping
- Can be extended to hierarchical

STC algorithm

- Step 1 Cleaning
- Stemming
- Sentence boundary identification
- Punctuation elimination
- Step 2 Suffix tree construction
- Produces base clusters (internal nodes)
- Base clusters are scored based on size and phrase

score (which depends on length and word

quality) - Step 3 Merging base clusters
- Highly overlapping clusters are merged

(2) Using users usage logs

- Advantage relevancy information is objectively

reflected by the usage logs - An experimental result on www.nasa.gov/

(3) Using hyperlinks

- For each URL P in search results R, we extract

its all out-links as well as top n in-links by

services of AltaVista - We could get all distinct N out-links and M

in-links for all URLs in R. - Each page P in R (result set) is represented as

2 vectors - POut (N- dimension)
- PIn (Mdimension)

(3) Using Hyperlinks continued

(3) Using Hyperlinks continued

Concerns on current methods

- Each method has pros and cons
- Using hyperlinks the best accuracy and still

some room to improve - STC best to browse and for incrementality.

Sample systems

- Scatter/Gather
- Grouper
- Carrot2
- Vivisimo
- Mapuccino
- SHOC

Grouper

- Online
- Operates on query result snippets
- Clusters together documents with large common

subphrases - Suffix Tree Clustering (STC)
- STC induces labeling

(No Transcript)

(No Transcript)

(No Transcript)

Table of Contents

- Introduction
- Web Content Mining
- Feature Selection and Similarity Measures
- Web Structure Mining
- Web as Social Network
- Features and Similarity Measures
- Social Network Analysis Algorithms
- PageRank
- Cyber-Communities
- HITS
- CT
- Web Content-Structure Clustering
- Web Usage Mining
- Some Concrete Applications of Web Mining
- Focus Crawling
- Web Search Result Clustering
- Summary

Summary

Thank You