LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals - PowerPoint PPT Presentation

1 / 61

About This Presentation

Title:

LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals

Description:

Using search engines (best coverage 38.3%) (Lawrence and Giles,1999) Clicking on hyperlinks ... What are the user-sought top-level web pages? ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 62

Provided by: DBSR

Category:

more less

Transcript and Presenter's Notes

Title: LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals

1
LinkSelector A Web Mining Approach to
Hyperlink Selection for Web Portals

Xiao Fang
University of Arizona
10/18/2002

2
Agenda

Introduction
Related work
Problem definition -- Hyperlink Selection
Solution -- LinkSelector
Evaluation
Contributions, limitations and future work

3
Introduction

Size of WWW (Lawrence and Giles,1999)
800 million web pages
1 million pages added daily
How to find information on the Web
Using search engines (best coverage 38.3)
(Lawrence and Giles,1999)
Clicking on hyperlinks

4
Introduction
Web Page 1

Web Page 2
Product Category List A B C D E F
Web Page 3

Product Category A Product List A1 A2 A3 A4 A5

Click on A
Product A2 Price 1000 Detailed description
Click on A2
B2
5
Introduction

Portal page is the entrance to a website.
Portal page
Homepage of a website
Default web portal (e.g.,My Yahoo!)
Most My Yahoo! users never customize their
default web portals (Manber et al., 2000).

6
Introduction

Hyperlinks in a portal page are selected from a
hyperlink pool.
A hyperlink pool is a set of hyperlinks pointing
to top-level web pages, e.g., hyperlink in a site
index page.

7
Portal page
8
Hyperlink pool
9
Portal page
10
Hyperlink pool
11
Introduction

Number of hyperlinks in a portal page several
dozens (e.g., 32 in the Arizona Home page).
Number of hyperlinks in a hyperlink pool several
hundreds (e.g., 743 in the Arizona Index page).

12
Introduction

It is too computational expensive to do an
exhaustive search (e.g., ).
Current practice of hyperlink selection expert
selection
Only reflect domain experts perspectives
Subjective

13
Introduction

Our approach is based on
web access patterns extracted from a web log
objective and reflect web surfers perspectives
web structural patterns extracted from an
existing website -- objective

14
Related work

Web mining is the process of applying data mining
techniques to extract patterns from the Web.
Web Data
Content texts and graphics in web pages
Structure hyperlinks
Usage web logs

15
Related work

Web content mining is the process of
automatically retrieving, parsing, indexing and
categorizing web documents.(Chakrabrati, 2000)
Web structure mining
HITS (Kleinberg, 1998)
PageRank (Brin and Page, 1998)

16
Related work

Web usage mining is the process of applying data
mining techniques to extract web access patterns
from a web log.

17
Related work

Web usage mining
General purpose, e.g., Chen et al. 1996 Cooley
et al., 1999
Website improvement, e.g., Perkowitz and Etzioni,
2000
Personalization, e.g., Yan 1996

18
Related work

Limitations of previous web usage mining research
Not considering web structure information, e.g.,
Chen et al., 1996
Web structure information are used to exclude
uninteresting web visiting patterns, e.g., Yan
et al., 1996 and Cooley et al., 1999

19
Hyperlink Selection

The quality of a portal page is measured using a
web log and a web log can be divided into
sessions.
Metrics to measure the quality of a portal page
Effectiveness
Efficiency
Usage

20
Hyperlink Selection

Effectiveness is the percentage of user-sought
top-level web pages that can be easily accessed
from the portal page.
What are the user-sought top-level web pages?
How to define the easiness to find a web page
from a Portal page?

21
Hyperlink Selection

User-sought top-level web pages
Session j L1, L10, L11, L2, L13, L14, L5, L9,
L7, L12
L1, L2, L5, and L7 are in the hyperlink pool
User-sought top-level web pages L1, L2, L5, L7

22
Hyperlink Selection

Usually, web pages that are 1-2 clicks away from
a portal page can be easily found from the portal
page.

23
Hyperlink Selection

Effectiveness measured at session level

Effectiveness measured at log level

24
Hyperlink Selection

Efficiency measures the usefulness of hyperlinks
placed in a portal page.
Efficiency measured at session level

Efficiency measured at log level

25
Hyperlink Selection

Usage how often a portal page is visited.

26
Hyperlink Selection
Definition Given a website w, its hyperlink
pool HP and the number of hyperlinks to be
placed in the portal page of w N, where
, the hyperlink selection problem is to
construct the portal page by selecting N
hyperlinks from the hyperlink pool HP to
maximize the effectiveness, efficiency and usage
of the resulting portal page (i.e., all metrics
are measured at the web log level).
27
LinkSelector

LinkSelector is based on relationships between
hyperlinks in a hyperlink pool.
Structure Relationship
Access Relationship

28
LinkSelector

Structure Relationship

29
LinkSelector

Structure Relationship

L2 L4 L6 L8
Web page 2
L1 L3
L5 L7
Web page 1
Web page 3
Structure relationships L1?L2 L1?L4
L1?L6 L1?L8 L3?L5 L3?L7
30
LinkSelector

Access Relationship
k-HS is denoted as a hyperlink set with k
hyperlinks. e.g., L1,L2 is a 2-HS
The support of a k-HS is the percentage of
sessions that web pages pointed to by hyperlinks
in the k-HS are accessed together.
e.g., If web pages pointed to by L1 and L2 are
accessed together in 20 sessions out of total 100
sessions, then the support of 2-HS L1,L2 is
20.

31
LinkSelector

Access Relationship

Definition For a k-HS , where ,
there exists an access relationship among
elements in the k-HS if and only if its support
is greater than a pre-defined threshold.
32
LinkSelector
33
LinkSelector

Group-I relationship

L2 L4 L6 L8
Web page 2
L1 L3
L5 L7
Web page 1
Web page 3
Structure relationships L1?L2 L1?L4
L1?L6 L1?L8 L3?L5 L3?L7
34
LinkSelector

Group-I relationship

L2 L4 L6 L8
Web page 2
L1 L3
L5 L7
Web page 1
Web page 3
Access relationships L1,L2,0.1 L1,L4,0.1
L1,L6,0.05 L1,L8,0.05 L3,L5,0.4
L3,L7,0.5
Structure relationships L1?L2 L1?L4
L1?L6 L1?L8 L3?L5 L3?L7
35
LinkSelector

Group-I relationship provides indicators of
preference for individual hyperlinks in
hyperlink
selection
the number of structure relationships a
hyperlink participate in as an initial hyperlink
the quality of these structure relationships

36
LinkSelector

Group-II relationship
no structure relationship between L9 and L12
an access relationship between L9 and L12

37
LinkSelector

Group-II relationship provides indicators of
hyperlink pair preference in hyperlink
selection
hyperlink pairs with Group-II relationships are
preferred to hyperlink pairs without
Group-II relationships
within hyperlink pairs with Group-II
relationships, hyperlink pairs with higher
support
of access relationship are preferred to those
with
lower support of access relationships

38
LinkSelector

Group-III relationship reveals patterns that
are not relevant to hyperlink selection
Group-IV relationship does not reveal
interesting patterns.

L1
Web page 2
L5
Web page 1
39
LinkSelector

The Sketch of LinkSelector

40
LinkSelector

Discover Structure Relationships

41
LinkSelector

Access relationship can be discovered from a web
log using association rule mining
Data Preprocessing
Association rule mining

42
LinkSelector

Data Preprocessing
Web log cleaning
Error logs (e.g., status code 404)
Accessory logs (e.g., .gif)
Session identification
Modify web server
30-min time interval

43
LinkSelector

Association rule mining
Given a transaction database
tid item
001 1,2,3
002 1,2,4
003 2,3,4
004 4,5,6
An itemset is a set of items, e.g., 1,2
The support of an itemset is the percentage of
transactions that contain (e.g., purchase) the
itemst.
Objective discover all itemsets with supports
larger than a user-defined threshold.
Apriori Algorithm (Agrawal and Srikant,1994 )

44
LinkSelector

Calculate Preferences for Hyperlinks

45
LinkSelector

Calculate Preferences for Hyperlinks Sets

No structure relationships between and
and and .
Preference for hyperlink set is 0.022

46
LinkSelector

Clustering is a data mining algorithm to segment
objects into groups based on their similarities

3
1
0.2
5
1 2 3 4 5 1 0 0.2
0.1 0.1 0.05 2 0.2 0 0.1 0.1
0.05 3 0.1 0.1 4 5
2
0.12
4
47
LinkSelector

Hyperlink Clustering
Hyperlinks ? Objects
Preferences for hyperlinks ? Weights of objects
Preferences for hyperlinks sets ? Similarities
among hyperlinks

48
LinkSelector

Limitations of classical clustering algorithms
Weights of objects are not considered.
Only considers similarities between two objects

49
LinkSelector

Our solution
Indexes of the proposed similarity matrix are
clusters while indexes of the traditional
similarity matrix are objects to be clustered.
Similarities involving two and more objects are
considered in the proposed similarity matrix.
Weights of objects are considered

50
LinkSelector
51
Experiment

Data collected from UA website
Hyperlink pool 110 links
Web log collected in Sep. 2001
10 M records ? 4.2 M records
total 344 K sessions
262 k sessions ? Training data (23 days)
82 k sessions ? Testing data (7 days)

52
Experiment
Hyperlinks selected by LinkSelector, Domain
experts and access frequency (N6)
53
Experiment
Average improvement 12.7 Improvement decrease
from 22.1 to 8.4 Average number of sessions per
day 11.5k
54
Experiment
Average improvement 17.0 Improvement decrease
from 30.2 to 9.4 Absolute number of hyperlinks
improved 15610 to 6509
55
Experiment
Average improvement 16.9 Improvement decrease
from 30.2 to 9.3
56
Experiment
Hyperlinks selected by LinkSelector, Classical
Hierarchical Clustering and Association rule
mining (N6)
57
Experiment
Average improvement compared with association
rule mining 25.8 Average improvement compared
with classical clustering 102.0
58
Experiment
Average improvement compared with association
rule mining 31.7 Average improvement compared
with classical clustering 124.0
59
Experiment
Average improvement compared with association
rule mining 31.6 Average improvement compared
with classical clustering 123.0
60
Contributions
1. We proposed and formally defined a new and
important research problem hyperlink
selection. 2. We proposed and showed what a web
mining based hyperlink selection approach
outperforms other hyperlink selection
approaches. 3.We developed a new clustering
algorithm for hyperlink selection.
61
Limitations and Future work