Title: LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals
1LinkSelector A Web Mining Approach to
Hyperlink Selection for Web Portals
- Xiao Fang
- University of Arizona
- 10/18/2002
2Agenda
- Introduction
- Related work
- Problem definition -- Hyperlink Selection
- Solution -- LinkSelector
- Evaluation
- Contributions, limitations and future work
3Introduction
- Size of WWW (Lawrence and Giles,1999)
- 800 million web pages
- 1 million pages added daily
- How to find information on the Web
- Using search engines (best coverage 38.3)
(Lawrence and Giles,1999) - Clicking on hyperlinks
4Introduction
Web Page 1
Web Page 2
Product Category List A B C D E F
Web Page 3
Product Category A Product List A1 A2 A3 A4 A5
Click on A
Product A2 Price 1000 Detailed description
Click on A2
B2
5Introduction
- Portal page is the entrance to a website.
- Portal page
- Homepage of a website
- Default web portal (e.g.,My Yahoo!)
- Most My Yahoo! users never customize their
default web portals (Manber et al., 2000).
6Introduction
- Hyperlinks in a portal page are selected from a
hyperlink pool. - A hyperlink pool is a set of hyperlinks pointing
to top-level web pages, e.g., hyperlink in a site
index page.
7Portal page
8Hyperlink pool
9Portal page
10Hyperlink pool
11Introduction
- Number of hyperlinks in a portal page several
dozens (e.g., 32 in the Arizona Home page). - Number of hyperlinks in a hyperlink pool several
hundreds (e.g., 743 in the Arizona Index page).
12Introduction
- It is too computational expensive to do an
exhaustive search (e.g., ). - Current practice of hyperlink selection expert
selection - Only reflect domain experts perspectives
- Subjective
13Introduction
- Our approach is based on
- web access patterns extracted from a web log
objective and reflect web surfers perspectives - web structural patterns extracted from an
existing website -- objective
14Related work
- Web mining is the process of applying data mining
techniques to extract patterns from the Web. - Web Data
- Content texts and graphics in web pages
- Structure hyperlinks
- Usage web logs
15Related work
- Web content mining is the process of
automatically retrieving, parsing, indexing and
categorizing web documents.(Chakrabrati, 2000) - Web structure mining
- HITS (Kleinberg, 1998)
- PageRank (Brin and Page, 1998)
16Related work
- Web usage mining is the process of applying data
mining techniques to extract web access patterns
from a web log.
17Related work
- Web usage mining
- General purpose, e.g., Chen et al. 1996 Cooley
et al., 1999 - Website improvement, e.g., Perkowitz and Etzioni,
2000 - Personalization, e.g., Yan 1996
18Related work
- Limitations of previous web usage mining research
- Not considering web structure information, e.g.,
Chen et al., 1996 - Web structure information are used to exclude
uninteresting web visiting patterns, e.g., Yan
et al., 1996 and Cooley et al., 1999
19Hyperlink Selection
- The quality of a portal page is measured using a
web log and a web log can be divided into
sessions. - Metrics to measure the quality of a portal page
- Effectiveness
- Efficiency
- Usage
20Hyperlink Selection
- Effectiveness is the percentage of user-sought
top-level web pages that can be easily accessed
from the portal page. - What are the user-sought top-level web pages?
- How to define the easiness to find a web page
from a Portal page?
21Hyperlink Selection
- User-sought top-level web pages
- Session j L1, L10, L11, L2, L13, L14, L5, L9,
L7, L12 - L1, L2, L5, and L7 are in the hyperlink pool
- User-sought top-level web pages L1, L2, L5, L7
22Hyperlink Selection
- Usually, web pages that are 1-2 clicks away from
a portal page can be easily found from the portal
page.
23Hyperlink Selection
- Effectiveness measured at session level
- Effectiveness measured at log level
24Hyperlink Selection
- Efficiency measures the usefulness of hyperlinks
placed in a portal page. - Efficiency measured at session level
- Efficiency measured at log level
25Hyperlink Selection
- Usage how often a portal page is visited.
-
26Hyperlink Selection
Definition Given a website w, its hyperlink
pool HP and the number of hyperlinks to be
placed in the portal page of w N, where
, the hyperlink selection problem is to
construct the portal page by selecting N
hyperlinks from the hyperlink pool HP to
maximize the effectiveness, efficiency and usage
of the resulting portal page (i.e., all metrics
are measured at the web log level).
27LinkSelector
- LinkSelector is based on relationships between
hyperlinks in a hyperlink pool. - Structure Relationship
- Access Relationship
28LinkSelector
29LinkSelector
L2 L4 L6 L8
Web page 2
L1 L3
L5 L7
Web page 1
Web page 3
Structure relationships L1?L2 L1?L4
L1?L6 L1?L8 L3?L5 L3?L7
30LinkSelector
- Access Relationship
- k-HS is denoted as a hyperlink set with k
hyperlinks. e.g., L1,L2 is a 2-HS - The support of a k-HS is the percentage of
sessions that web pages pointed to by hyperlinks
in the k-HS are accessed together. - e.g., If web pages pointed to by L1 and L2 are
accessed together in 20 sessions out of total 100
sessions, then the support of 2-HS L1,L2 is
20. -
31LinkSelector
Definition For a k-HS , where ,
there exists an access relationship among
elements in the k-HS if and only if its support
is greater than a pre-defined threshold.
32LinkSelector
33LinkSelector
L2 L4 L6 L8
Web page 2
L1 L3
L5 L7
Web page 1
Web page 3
Structure relationships L1?L2 L1?L4
L1?L6 L1?L8 L3?L5 L3?L7
34LinkSelector
L2 L4 L6 L8
Web page 2
L1 L3
L5 L7
Web page 1
Web page 3
Access relationships L1,L2,0.1 L1,L4,0.1
L1,L6,0.05 L1,L8,0.05 L3,L5,0.4
L3,L7,0.5
Structure relationships L1?L2 L1?L4
L1?L6 L1?L8 L3?L5 L3?L7
35LinkSelector
- Group-I relationship provides indicators of
- preference for individual hyperlinks in
hyperlink - selection
- the number of structure relationships a
hyperlink participate in as an initial hyperlink - the quality of these structure relationships
36LinkSelector
- Group-II relationship
- no structure relationship between L9 and L12
- an access relationship between L9 and L12
37LinkSelector
- Group-II relationship provides indicators of
- hyperlink pair preference in hyperlink
- selection
- hyperlink pairs with Group-II relationships are
- preferred to hyperlink pairs without
- Group-II relationships
- within hyperlink pairs with Group-II
- relationships, hyperlink pairs with higher
support - of access relationship are preferred to those
with - lower support of access relationships
38LinkSelector
- Group-III relationship reveals patterns that
- are not relevant to hyperlink selection
- Group-IV relationship does not reveal
- interesting patterns.
L1
Web page 2
L5
Web page 1
39LinkSelector
- The Sketch of LinkSelector
40LinkSelector
- Discover Structure Relationships
41LinkSelector
- Access relationship can be discovered from a web
log using association rule mining - Data Preprocessing
- Association rule mining
42LinkSelector
- Data Preprocessing
- Web log cleaning
- Error logs (e.g., status code 404)
- Accessory logs (e.g., .gif)
- Session identification
- Modify web server
- 30-min time interval
43LinkSelector
- Association rule mining
- Given a transaction database
- tid item
- 001 1,2,3
- 002 1,2,4
- 003 2,3,4
- 004 4,5,6
- An itemset is a set of items, e.g., 1,2
- The support of an itemset is the percentage of
transactions that contain (e.g., purchase) the
itemst. - Objective discover all itemsets with supports
larger than a user-defined threshold. - Apriori Algorithm (Agrawal and Srikant,1994 )
-
44LinkSelector
- Calculate Preferences for Hyperlinks
45LinkSelector
- Calculate Preferences for Hyperlinks Sets
No structure relationships between and
and and .
Preference for hyperlink set is 0.022
46LinkSelector
- Clustering is a data mining algorithm to segment
objects into groups based on their similarities
3
1
0.2
5
1 2 3 4 5 1 0 0.2
0.1 0.1 0.05 2 0.2 0 0.1 0.1
0.05 3 0.1 0.1 4 5
2
0.12
4
47LinkSelector
- Hyperlink Clustering
- Hyperlinks ? Objects
- Preferences for hyperlinks ? Weights of objects
- Preferences for hyperlinks sets ? Similarities
among hyperlinks
48LinkSelector
- Limitations of classical clustering algorithms
- Weights of objects are not considered.
- Only considers similarities between two objects
49LinkSelector
- Our solution
- Indexes of the proposed similarity matrix are
clusters while indexes of the traditional
similarity matrix are objects to be clustered. - Similarities involving two and more objects are
considered in the proposed similarity matrix. - Weights of objects are considered
50LinkSelector
51Experiment
- Data collected from UA website
- Hyperlink pool 110 links
- Web log collected in Sep. 2001
- 10 M records ? 4.2 M records
- total 344 K sessions
- 262 k sessions ? Training data (23 days)
- 82 k sessions ? Testing data (7 days)
52Experiment
Hyperlinks selected by LinkSelector, Domain
experts and access frequency (N6)
53Experiment
Average improvement 12.7 Improvement decrease
from 22.1 to 8.4 Average number of sessions per
day 11.5k
54Experiment
Average improvement 17.0 Improvement decrease
from 30.2 to 9.4 Absolute number of hyperlinks
improved 15610 to 6509
55Experiment
Average improvement 16.9 Improvement decrease
from 30.2 to 9.3
56Experiment
Hyperlinks selected by LinkSelector, Classical
Hierarchical Clustering and Association rule
mining (N6)
57Experiment
Average improvement compared with association
rule mining 25.8 Average improvement compared
with classical clustering 102.0
58Experiment
Average improvement compared with association
rule mining 31.7 Average improvement compared
with classical clustering 124.0
59Experiment
Average improvement compared with association
rule mining 31.6 Average improvement compared
with classical clustering 123.0
60Contributions
1. We proposed and formally defined a new and
important research problem hyperlink
selection. 2. We proposed and showed what a web
mining based hyperlink selection approach
outperforms other hyperlink selection
approaches. 3.We developed a new clustering
algorithm for hyperlink selection.
61Limitations and Future work
- User Study
- Adaptive LinkSelector
- Structure of website changes
- Web visiting patterns change