Characterizing Visitors to a Website Across Multiple Sessions - PowerPoint PPT Presentation

About This Presentation
Title:

Characterizing Visitors to a Website Across Multiple Sessions

Description:

... authors,6)(/articles,77) -(/authors,290)(/articles,290) -(/authors, ... Cluster 2 glance through home, articles. Cluster 1 interest in coffeehouse, contests ... – PowerPoint PPT presentation

Number of Views:140
Avg rating:3.0/5.0
Slides: 24
Provided by: joydee
Category:

less

Transcript and Presenter's Notes

Title: Characterizing Visitors to a Website Across Multiple Sessions


1
Characterizing Visitors to a Website Across
Multiple Sessions
Arindam Banerjee Joydeep Ghosh
  • NGDM Workshop, Nov 2002

2
Motivation
  • Why Characterize or Predict web user behavior?
  • Site-centric view Personalization, sticky
    websites
  • User-centric view personal agents for
    information acquisition
  • Universalist approaches Pagerank, web metrics,

3
Clustering Users from Web Logs
  • Wide variety of web behavior ? segment users
    based on surfing behavior as a first step to
    further analysis.
  • User set of sessions
  • Session sequence of
  • (page I.d., time spent on that page) tuples
  • How to cluster sets of sequences?

4
The Approach
  • Cluster Sessions
  • Session Similarity Measure
  • Session Similarity Graph
  • Outlier Detection
  • Graph Partitioning
  • Create a Cluster Space
  • Cluster users in this Space

5
A Similarity Measure for Sessions
  • Overlap between two sessions represented by the
    longest common subsequence (LCS)
  • Obtain session similarity using LCS
    and time information session similarity (time
    similarity in LCS) x (importance of LCS)
  • The similarity component
  • Average min-max similarity for each page in the
    LCS
  • The importance component
  • Average of the fraction of overall session time
    spent in the LCS

6
Session Clustering
  • Find the pairwise similarity values between all
    pair of sessions record only similarities gt q
  • Incrementally construct similarity graph Gq
  • the vertices are the sessions, the edge weights
    are the session similarity values
  • no isolated vertices (discard outliers)
  • Balanced Graph Partitioning
  • we used Metis Karypis, Kumar

7
The Cluster Space
  • Given each session assigned to one of k clusters
    (sets)
  • ?Sessions of a user are distributed among the k
    sets
  • vector u u1 u2 uk T where ui number of
    sessions of the user belonging to cluster I
  • Stage II User Clustering
  • find pairwise similarity values using the
    extended Jaccard measure
  • partition similarity graph
  • Gives l user clusters and a set of outlier users

8
The Dataset Sulekha.com
9
Dataset details
  • Logs over a one month period
  • Raw log size 184 Mb
  • 453,953 files accessed
  • 37,753 sessions in all
  • 23,310 sessions after some preprocessing/filtering
  • 2,493 users

10
Results Session Clusters
Cluster 1 interest in coffeehouse, contests Cluster 2 glance through home, articles
-(/,12)(/movies,6)(/contests,178) -(/contests,142) -(/coffeehouse,5)(/contests,183) -(/contests,172) -(/,10)(/contests,143) -(/,22)(/articles,22) -(/,20)(/articles,20) -(/,21)(/articles,21) -(/,19)(/articles,19) -(/,20)(/articles,19)
Cluster 3 interest in author, articles Cluster 4 read articles
-(/,148)(/authors,6)(/articles,77) -(/authors,290)(/articles,290) -(/authors,295)(/articles,295) -(/,33)(/authors,90)(/articles,475) -(/,32)(/authors,91)(/articles,425) -(/,39)(/articles,98)(/misc,17) (/articles,2649) -(/,9)(/articles,2666) -(/authors,26)(/articles,2561) -(/misc,20)(/articles,77)(/misc 32)(/articles,43)(/authors,16) (/articles,2373.1)
11
Results User Clusters
  • user (128.194.xxx.xxx)
  • (/authors,3)(/articles,129)
  • (/authors,8)(/articles,8)
  • (/authors,80)(/articles,2141)
  • user (209.30.xxx.xxx)
  • (/home,77)(/articles,111)(/authors,93)(/articles,6
    29)(/misc,58) (/coffeehouse,75)(/wo-men,967)
  • (/articles,2627)
  • user (171.68.xxx.xxx)
  • (/home,323)(/articles,24)(/authors,45)(/articles,1
    290)
  • A user cluster
  • people who read the articles

12
Results User Clusters
  • user (152.170.xxx.xxx)
  • (/home,21)(/wo-men,1075)(/philosophy,52)
  • user (209.244.xxx.xxx)
  • (/home,5)(/coffeehouse,94)(/wo-men,75)(/movies,75)
    (/wo-men,31)
  • (/home,52)(/philosophy,67)(/wo-men,955)(/philosoph
    y, 26)(/coffeehouse,382)(/biztech,298)(/philosophy
    ,290)
  • (/home,17)(/coffeehouse,12)(/wo-men,15)(/personal,
    6) (/biztech,94)(/coffeehouse,2)(/philosophy,1093)
  • A user cluster
  • people interested in wo-men, philosophy,
    coffeehouse

13
Results User Clusters
  • user (216.154.xxx.xxx)
  • (/coffeehouse,12)(/biztech,25)(/books,48)
  • (/coffeehouse,13)(/biztech,26)(/books,19)
  • user (204.220.xxx.xxx)
  • (/coffeehouse,162)
  • (/coffeehouse,40)
  • user (32.100.xxx.xxx)
  • (/coffeehouse,12)(/contests 12)
  • (/coffeehouse,43)(/contests 44)
  • A user cluster
  • people interested in coffeehouse bookmarked it
    !

14
Result Visualization using CLUSION Strehl Ghosh
01
Sessions
Users
15
Conclusions
  • Segmentation a basic pre-processing step for Web
    Mining
  • Similarity measure Cluster Space Concept
    applicable to clustering of sets of any
    data-structure
  • For certain websites, time spent on the pages
    matters
  • not handled by current commercial tools
  • Outlier detection before clustering is important
  • Results QA-ed by human subjects
  • Results for clusters outliers at both levels
    were subjectively good
  • No good way to find cluster quality
    analytically
  • Formation of similarity graph is a slow process

16
Future Work
  • Improve the present method by
  • using cluster seeds for cluster growing
  • using alternative clustering algorithms for each
    stage
  • studying the effect of thresholds, number of
    clusters on performance
  • studying the importance of order of page-visits
  • studying the importance of balanced clustering

17
Backup
18
Issues Choice of Parameters
  • Number of session clusters, k, should be chosen
    appropriately
  • Thresholds for forming session user similarity
    graphs
  • threshold value should be chosen after looking at
    the distribution of edge weights

19
Related Work
  • Research in Web Mining
  • Extraction of navigational patterns
    Spiliopoulou, Faulstich
  • Ordering relationships Mannila, Meek
  • Surfing prediction Pitkow, Pirolli
  • Clustering web usage sessions Fu, Sandhu, Shih

20
Example
  • Sessions
  • Session1 (a,8) (b,100) (d,8) (c,5) (e,23)
    (a,5)
  • Session2 (b,5) (d,12) (f,1) (a,7) (c,5)
  • LCS pages (b)(d)(c)
  • Corresponding Index, Times Sequences
  • Index1 (1)(2)(3), Time1 (100) (8) (5)
  • Index2 (0)(1)(4), Time2 (5) (12) (5)
  • Similarity over each LCS page of the
    two times
  • Similarity on page b 5/100 0.05
  • Similarity on page d 8/12 0.67
  • Similarity on page c 5/5 1.00

21
Example (contd.)
  • The similarity component
  • (0.05 0.67 1.00)/3
  • 0.57
  • The importance component
  • Fraction of time spent in the LCS by Session1
    113/149 0.76
  • Fraction of time spent in the LCS by Session2
    22/30 0.73
  • The mean (0.760.73)/2 0.75
  • The overall similarity
  • 0.57 x 0.75
  • 0.43

22
Issues Session Resolution
  • Generate coarse resolution paths making use of
    the concept hierarchy of the website
  • Reduces computations Increases interpretability
    of results

Original Path Concept-level Path
(/authors/ramesh_mahadevan.html,3) (/articles/rm_phattas.html,75) (/articles/rm_desidads.html,39) (/authors,3) (/articles,114)
(/authors/arun_sampath.html,109) (/philosophy/messages/1951.html,102) (/philosophy/messages/1953.html,46) (/,3) (/philosophy/messages/1954.html,69) (/authors,109) (/philosophy,148) (/,3) (/philosophy,69)
23
Comments
  • Results QA-ed by human subject
  • Results for clusters outliers at both levels
    were subjectively good
  • No good way to find cluster quality analytically
  • Clustering algorithms for the two stages
  • Stage I Graph partitioning works well for large
    sparse graphs, so it is desirable in this stage
  • Stage II Since the space is not
    high-dimensional, any reasonable clustering
    algorithm should be adequate
  • Cluster space
  • Gives a general framework for mapping any
    non-vector clustering problem to an equivalent
    vector clustering problem
Write a Comment
User Comments (0)
About PowerShow.com