LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals

Description:

Using search engines (best coverage 38.3%) (Lawrence and Giles,1999) Clicking on hyperlinks ... What are the user-sought top-level web pages? ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 62
Provided by: DBSR
Category:

less

Transcript and Presenter's Notes

Title: LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals


1
LinkSelector A Web Mining Approach to
Hyperlink Selection for Web Portals
  • Xiao Fang
  • University of Arizona
  • 10/18/2002

2
Agenda
  • Introduction
  • Related work
  • Problem definition -- Hyperlink Selection
  • Solution -- LinkSelector
  • Evaluation
  • Contributions, limitations and future work

3
Introduction
  • Size of WWW (Lawrence and Giles,1999)
  • 800 million web pages
  • 1 million pages added daily
  • How to find information on the Web
  • Using search engines (best coverage 38.3)
    (Lawrence and Giles,1999)
  • Clicking on hyperlinks

4
Introduction
Web Page 1
   
Web Page 2
Product Category List   A B C D E F
Web Page 3
   
Product Category A Product List   A1 A2 A3 A4 A5
   
Click on A
Product A2   Price 1000 Detailed description  
Click on A2
B2
5
Introduction
  • Portal page is the entrance to a website.
  • Portal page
  • Homepage of a website
  • Default web portal (e.g.,My Yahoo!)
  • Most My Yahoo! users never customize their
    default web portals (Manber et al., 2000).

6
Introduction
  • Hyperlinks in a portal page are selected from a
    hyperlink pool.
  • A hyperlink pool is a set of hyperlinks pointing
    to top-level web pages, e.g., hyperlink in a site
    index page.

7
Portal page
8
Hyperlink pool
9
Portal page
10
Hyperlink pool
11
Introduction
  • Number of hyperlinks in a portal page several
    dozens (e.g., 32 in the Arizona Home page).
  • Number of hyperlinks in a hyperlink pool several
    hundreds (e.g., 743 in the Arizona Index page).

12
Introduction
  • It is too computational expensive to do an
    exhaustive search (e.g., ).
  • Current practice of hyperlink selection expert
    selection
  • Only reflect domain experts perspectives
  • Subjective

13
Introduction
  • Our approach is based on
  • web access patterns extracted from a web log
    objective and reflect web surfers perspectives
  • web structural patterns extracted from an
    existing website -- objective

14
Related work
  • Web mining is the process of applying data mining
    techniques to extract patterns from the Web.
  • Web Data
  • Content texts and graphics in web pages
  • Structure hyperlinks
  • Usage web logs

15
Related work
  • Web content mining is the process of
    automatically retrieving, parsing, indexing and
    categorizing web documents.(Chakrabrati, 2000)
  • Web structure mining
  • HITS (Kleinberg, 1998)
  • PageRank (Brin and Page, 1998)

16
Related work
  • Web usage mining is the process of applying data
    mining techniques to extract web access patterns
    from a web log.

17
Related work
  • Web usage mining
  • General purpose, e.g., Chen et al. 1996 Cooley
    et al., 1999
  • Website improvement, e.g., Perkowitz and Etzioni,
    2000
  • Personalization, e.g., Yan 1996

18
Related work
  • Limitations of previous web usage mining research
  • Not considering web structure information, e.g.,
    Chen et al., 1996
  • Web structure information are used to exclude
    uninteresting web visiting patterns, e.g., Yan
    et al., 1996 and Cooley et al., 1999

19
Hyperlink Selection
  • The quality of a portal page is measured using a
    web log and a web log can be divided into
    sessions.
  • Metrics to measure the quality of a portal page
  • Effectiveness
  • Efficiency
  • Usage

20
Hyperlink Selection
  • Effectiveness is the percentage of user-sought
    top-level web pages that can be easily accessed
    from the portal page.
  • What are the user-sought top-level web pages?
  • How to define the easiness to find a web page
    from a Portal page?

21
Hyperlink Selection
  • User-sought top-level web pages
  • Session j L1, L10, L11, L2, L13, L14, L5, L9,
    L7, L12
  • L1, L2, L5, and L7 are in the hyperlink pool
  • User-sought top-level web pages L1, L2, L5, L7

22
Hyperlink Selection
  • Usually, web pages that are 1-2 clicks away from
    a portal page can be easily found from the portal
    page.

23
Hyperlink Selection
  • Effectiveness measured at session level
  • Effectiveness measured at log level

24
Hyperlink Selection
  • Efficiency measures the usefulness of hyperlinks
    placed in a portal page.
  • Efficiency measured at session level
  • Efficiency measured at log level

25
Hyperlink Selection
  • Usage how often a portal page is visited.

26
Hyperlink Selection
Definition Given a website w, its hyperlink
pool HP and the number of hyperlinks to be
placed in the portal page of w N, where
, the hyperlink selection problem is to
construct the portal page by selecting N
hyperlinks from the hyperlink pool HP to
maximize the effectiveness, efficiency and usage
of the resulting portal page (i.e., all metrics
are measured at the web log level).
27
LinkSelector
  • LinkSelector is based on relationships between
    hyperlinks in a hyperlink pool.
  • Structure Relationship
  • Access Relationship

28
LinkSelector
  • Structure Relationship

29
LinkSelector
  • Structure Relationship

L2 L4 L6 L8
Web page 2
L1 L3
L5 L7
Web page 1
Web page 3
Structure relationships L1?L2 L1?L4
L1?L6 L1?L8 L3?L5 L3?L7
30
LinkSelector
  • Access Relationship
  • k-HS is denoted as a hyperlink set with k
    hyperlinks. e.g., L1,L2 is a 2-HS
  • The support of a k-HS is the percentage of
    sessions that web pages pointed to by hyperlinks
    in the k-HS are accessed together.
  • e.g., If web pages pointed to by L1 and L2 are
    accessed together in 20 sessions out of total 100
    sessions, then the support of 2-HS L1,L2 is
    20.

31
LinkSelector
  • Access Relationship

Definition For a k-HS , where ,
there exists an access relationship among
elements in the k-HS if and only if its support
is greater than a pre-defined threshold.
32
LinkSelector
33
LinkSelector
  • Group-I relationship

L2 L4 L6 L8
Web page 2
L1 L3
L5 L7
Web page 1
Web page 3
Structure relationships L1?L2 L1?L4
L1?L6 L1?L8 L3?L5 L3?L7
34
LinkSelector
  • Group-I relationship

L2 L4 L6 L8
Web page 2
L1 L3
L5 L7
Web page 1
Web page 3
Access relationships L1,L2,0.1 L1,L4,0.1
L1,L6,0.05 L1,L8,0.05 L3,L5,0.4
L3,L7,0.5
Structure relationships L1?L2 L1?L4
L1?L6 L1?L8 L3?L5 L3?L7
35
LinkSelector
  • Group-I relationship provides indicators of
  • preference for individual hyperlinks in
    hyperlink
  • selection
  • the number of structure relationships a
    hyperlink participate in as an initial hyperlink
  • the quality of these structure relationships

36
LinkSelector
  • Group-II relationship
  • no structure relationship between L9 and L12
  • an access relationship between L9 and L12

37
LinkSelector
  • Group-II relationship provides indicators of
  • hyperlink pair preference in hyperlink
  • selection
  • hyperlink pairs with Group-II relationships are
  • preferred to hyperlink pairs without
  • Group-II relationships
  • within hyperlink pairs with Group-II
  • relationships, hyperlink pairs with higher
    support
  • of access relationship are preferred to those
    with
  • lower support of access relationships

38
LinkSelector
  • Group-III relationship reveals patterns that
  • are not relevant to hyperlink selection
  • Group-IV relationship does not reveal
  • interesting patterns.

L1
Web page 2
L5
Web page 1
39
LinkSelector
  • The Sketch of LinkSelector

40
LinkSelector
  • Discover Structure Relationships

41
LinkSelector
  • Access relationship can be discovered from a web
    log using association rule mining
  • Data Preprocessing
  • Association rule mining

42
LinkSelector
  • Data Preprocessing
  • Web log cleaning
  • Error logs (e.g., status code 404)
  • Accessory logs (e.g., .gif)
  • Session identification
  • Modify web server
  • 30-min time interval

43
LinkSelector
  • Association rule mining
  • Given a transaction database
  • tid item
  • 001 1,2,3
  • 002 1,2,4
  • 003 2,3,4
  • 004 4,5,6
  • An itemset is a set of items, e.g., 1,2
  • The support of an itemset is the percentage of
    transactions that contain (e.g., purchase) the
    itemst.
  • Objective discover all itemsets with supports
    larger than a user-defined threshold.
  • Apriori Algorithm (Agrawal and Srikant,1994 )

44
LinkSelector
  • Calculate Preferences for Hyperlinks



45
LinkSelector
  • Calculate Preferences for Hyperlinks Sets

No structure relationships between and
and and .
Preference for hyperlink set is 0.022

46
LinkSelector
  • Clustering is a data mining algorithm to segment
    objects into groups based on their similarities

3
1
0.2
5
1 2 3 4 5 1 0 0.2
0.1 0.1 0.05 2 0.2 0 0.1 0.1
0.05 3 0.1 0.1 4 5
2
0.12
4
47
LinkSelector
  • Hyperlink Clustering
  • Hyperlinks ? Objects
  • Preferences for hyperlinks ? Weights of objects
  • Preferences for hyperlinks sets ? Similarities
    among hyperlinks

48
LinkSelector
  • Limitations of classical clustering algorithms
  • Weights of objects are not considered.
  • Only considers similarities between two objects

49
LinkSelector
  • Our solution
  • Indexes of the proposed similarity matrix are
    clusters while indexes of the traditional
    similarity matrix are objects to be clustered.
  • Similarities involving two and more objects are
    considered in the proposed similarity matrix.
  • Weights of objects are considered

50
LinkSelector
51
Experiment
  • Data collected from UA website
  • Hyperlink pool 110 links
  • Web log collected in Sep. 2001
  • 10 M records ? 4.2 M records
  • total 344 K sessions
  • 262 k sessions ? Training data (23 days)
  • 82 k sessions ? Testing data (7 days)

52
Experiment
Hyperlinks selected by LinkSelector, Domain
experts and access frequency (N6)
53
Experiment
Average improvement 12.7 Improvement decrease
from 22.1 to 8.4 Average number of sessions per
day 11.5k
54
Experiment
Average improvement 17.0 Improvement decrease
from 30.2 to 9.4 Absolute number of hyperlinks
improved 15610 to 6509
55
Experiment
Average improvement 16.9 Improvement decrease
from 30.2 to 9.3
56
Experiment
Hyperlinks selected by LinkSelector, Classical
Hierarchical Clustering and Association rule
mining (N6)
57
Experiment
Average improvement compared with association
rule mining 25.8 Average improvement compared
with classical clustering 102.0
58
Experiment
Average improvement compared with association
rule mining 31.7 Average improvement compared
with classical clustering 124.0
59
Experiment
Average improvement compared with association
rule mining 31.6 Average improvement compared
with classical clustering 123.0
60
Contributions
1. We proposed and formally defined a new and
important research problem hyperlink
selection. 2. We proposed and showed what a web
mining based hyperlink selection approach
outperforms other hyperlink selection
approaches. 3.We developed a new clustering
algorithm for hyperlink selection.
61
Limitations and Future work
  • User Study
  • Adaptive LinkSelector
  • Structure of website changes
  • Web visiting patterns change
Write a Comment
User Comments (0)
About PowerShow.com