ICS 278: Data Mining Lecture 17: Web Log Mining - PowerPoint PPT Presentation

1 / 82
About This Presentation
Title:

ICS 278: Data Mining Lecture 17: Web Log Mining

Description:

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine. ICS 278: Data Mining ... Important to identify robots (also known as crawlers, spiders) ... – PowerPoint PPT presentation

Number of Views:159
Avg rating:3.0/5.0
Slides: 83
Provided by: Informatio367
Category:

less

Transcript and Presenter's Notes

Title: ICS 278: Data Mining Lecture 17: Web Log Mining


1
ICS 278 Data MiningLecture 17 Web Log Mining
  • Padhraic Smyth
  • Department of Information and Computer Science
  • University of California, Irvine

2
Outline
  • Basic concepts in Web log data analysis
  • Predictive modeling of Web navigation behavior
  • Markov modeling methods
  • Analyzing search engine data
  • Ecommerce aspects of Web log mining

3
Introduction
  • Useful to study human digital behavior, e.g.
    search engine data can be used for
  • Exploration e.g. of queries per session?
  • Modeling e.g. any time of day dependence?
  • Prediction e.g. which pages are relevant?
  • Applications
  • Understand social implications of Web usage
  • Design of better tools for information access
  • E-commerce applications

4
How our Web navigation is recorded
  • Web logs
  • Record activity between client browser and a
    specific Web server
  • Easily available
  • Can be augmented with cookies (provide notion of
    state)
  • Search engine records
  • Text in queries, which responses were clicked on,
    etc
  • Client-side browsing records
  • Produced for research purposes as part of a study
  • Automatically recorded by client-side software
  • Harder to obtain, but much more accurate than
    server-side logs
  • Other sources
  • Web site registration, purchases, email, etc
  • ISP recording of Web browsing

5
Web Server Log Files
  • Server Transfer Log
  • transactions between a browser and server are
    logged
  • IP address, the time of the request
  • Method of the request (GET, HEAD, POST)
  • Status code, a response from the server
  • Size in byte of the transaction
  • Referrer Log
  • where the request originated
  • Agent Log
  • browser software making the request (spider)
  • Error Log
  • request resulted in errors (404)

6
W3C Extended Log File Format
7
Example of Web Log entries
  • Apache web log
  • 205.188.209.10 - - 29/Mar/2002035806 -0800
    "GET /sophal/whole5.gif HTTP/1.0" 200 9609
    "http//www.csua.berkeley.edu/sophal/whole.html"
    "Mozilla/4.0 (compatible MSIE 5.0 AOL 6.0
    Windows 98 DigExt)"
  • 216.35.116.26 - - 29/Mar/2002035940 -0800
    "GET /alexlam/resume.html HTTP/1.0" 200 2674 "-"
    "Mozilla/5.0 (Slurp/cat slurp_at_inktomi.com
    http//www.inktomi.com/slurp.html)
  • 202.155.20.142 - - 29/Mar/2002030014 -0800
    "GET /tahir/indextop.html HTTP/1.1" 200 3510
    "http//www.csua.berkeley.edu/tahir/"
    "Mozilla/4.0 (compatible MSIE 6.0 Windows NT
    5.1)
  • 202.155.20.142 - - 29/Mar/2002030014 -0800
    "GET /tahir/animate.js HTTP/1.1" 200 14261
    "http//www.csua.berkeley.edu/tahir/indextop.html
    " "Mozilla/4.0 (compatible MSIE 6.0 Windows NT
    5.1)

8
Routine Server Log Analysis
  • Most and least visited web pages
  • Entry and exit pages
  • Referrals from other sites or search engines
  • What are the searched keywords
  • How many clicks/page views a page received
  • Error reports, like broken links

9
Visualization of Web Log Data over Time
10
Server Log Analysis
11
Descriptive Summary Statistics
  • Histograms, scatter plots, time-series plots
  • Very important!
  • Helps to understand the big picture
  • Provides marginal context for any
    model-building
  • models aggregate behavior, not individuals
  • Challenging for Web log data
  • Examples
  • Session lengths (e.g., power laws)
  • Click rates as a function of time, content

12
L number of page requests in a single
session from visitors to www.ics.uci.edu over 1
week in November 2002 (robots removed)
13
Best fit of simple power law model Log P(L) -a
Log L b or P(L) b L-a
14
(No Transcript)
15
Web data measurement issues
  • Important to understand how data is collected
  • Web data is collected automatically via software
    logging tools
  • Advantage
  • No manual supervision required
  • Disadvantage
  • Data can be skewed (e.g. due to the presence of
    robot traffic)
  • Important to identify robots (also known as
    crawlers, spiders)

16
A time-series plot of ICS Website data
Number of page requests per hour as a function of
time from page requests in the www.ics.uci.edu
Web server logs during the first week of April
2002.
17
Robot / human identification
  • Robot requests identified by classifying page
    requests using a variety of heuristics
  • e.g. some robots self-identify themselves in the
    server logs (robots.txt)
  • Robots explore the entire website in breadth
    first fashion
  • Humans access web-pages in depth first fashion
  • Tan and Kumar (2002) discuss more techniques

18
Page requests, caching, and proxy servers
  • In theory, requester browser requests a page from
    a Web server and the request is processed
  • In practice, there are
  • Other users
  • Browser caching
  • Dynamic addressing in local network
  • Proxy Server caching

19
Page requests, caching, and proxy servers
A graphical summary of how page requests from an
individual user can be masked at various stages
between the users local computer and the Web
server.
20
Page requests, caching, and proxy servers
  • Web server logs are therefore not so ideal in
    terms of a complete and faithful representation
    of individual page views
  • There are heuristics to try to infer the true
    actions of the user -
  • Path completion (Cooley et al. 1999)
  • e.g. If known B -gt F and not C -gt F, then session
    ABCF can be interpreted as ABCBF
  • Anderson et al. 2001 for more heuristics
  • In general case, hard to know what user viewed

21
Identifying individual users from Web server logs
  • Useful to associate specific page requests to
    specific individual users
  • IP address most frequently used
  • Disadvantages
  • One IP address can belong to several users
  • Dynamic allocation of IP address
  • Better to use cookies
  • Information in the cookie can be accessed by the
    Web server to identify an individual user over
    time
  • Actions by the same user during different
    sessions can be linked together

22
Identifying individual users from Web server logs
  • Commercial websites use cookies extensively
  • 97 of users have cookies enabled permanently on
    their browsers
  • (source Amazon.com, 2003)
  • However
  • There are privacy issues need implicit user
    cooperation
  • Cookies can be deleted / disabled
  • Another option is to enforce user registration
  • High reliability
  • Can discourage potential visitors

23
Sessionizing
  • Time oriented (robust)
  • E.g., by gaps between requests
  • not more than 25 minutes between successive
    requests
  • Navigation oriented (good for short sessions and
    when timestamps unreliable)
  • Referrer is previous page in session, or
  • Referrer is undefined but request within 10 secs,
    or
  • Link from previous to current page in web site

24
Client-side data
  • Advantages of collecting data at the client side
  • Direct recording of page requests (eliminates
    masking due to caching)
  • Recording of all browser-related actions by a
    user (including visits to multiple websites)
  • More-reliable identification of individual users
    (e.g. by login ID for multiple users on a single
    computer)
  • Preferred mode of data collection for studies of
    navigation behavior on the Web
  • Companies like comScore and Nielsen use
    client-side software to track home computer users

25
Client-side data
  • Statistics like Time per session and Page-view
    duration are more reliable in client-side data
  • Some limitations
  • Still some statistics like Page-view duration
    cannot be totally reliable e.g. user might go to
    fetch coffee
  • Need explicit user cooperation
  • Typically recorded on home computers may not
    reflect a complete picture of Web browsing
    behavior
  • Web surfing data can be collected at intermediate
    points like ISPs, proxy servers
  • Can be used to create user profile and target
    advertise

26
Early studies from 1995 to 1997
  • Earliest studies on client-side data are Catledge
    and Pitkow (1995) and Tauscher and Greenberg
    (1997)
  • In both studies, data was collected by logging
    Web browser commands
  • Population consisted of faculty, staff and
    students
  • Both studies found
  • clicking on the hypertext anchors as the most
    common action
  • using back button was the second common action

27
Early studies from 1995 to 1997
  • high probability of page revisitation
    (0.58-0.61)
  • Lower bound because the page requests prior to
    the start of the studies are not accounted for
  • Humans are creatures of habit?
  • Content of the pages changed over time?
  • strong recency (page that is revisited is usually
    the page that was visited in the recent past)
    effect
  • Correlates with the back button usage
  • Similar repetitive actions are found in telephone
    number dialing etc

28
The Cockburn and McKenzie study from 2002
  • Previous studies are relatively old
  • Web has changed dramatically in the past few
    years
  • Cockburn and McKenzie (2002) provides a more
    up-to-date analysis
  • Analyzed the daily history.dat files produced by
    the Netscape browser for 17 users for about 4
    months
  • Population studied consisted of faculty, staff
    and graduate students
  • Study found revisitation rates higher than past
    94 and 95 studies (0.81)
  • Time-window is three times that of past studies

29
The Cockburn and McKenzie study from 2002
  • Revisitation rate less biased than the previous
    studies?
  • Human behavior changed from an exploratory mode
    to a utilitarian mode?
  • The more pages user visits, the more are the
    requests for new pages
  • The most frequently requested page for each user
    can account for a relatively large fraction of
    his/her page requests
  • Useful to see the scatter plot of the distinct
    number of pages requested per user versus the
    total pages requested

30
The Cockburn and McKenzie study from 2002
The number of distinct pages visited versus page
vocabulary size of each of the 17 users in the
Cockburn and McKenzie (2002) study (log-log plot)
31
The Cockburn and McKenzie study from 2002
Bar chart of the ratio of the number of page
requests for the most frequent page divided by
the total number of page requests, for 17 users
in the Cockburn McKenzie (2002) study
32
Outline
  • Basic concepts in Web log data analysis
  • Predictive modeling of Web navigation behavior
  • Markov modeling methods
  • Analyzing search engine data
  • Ecommerce aspects of Web log mining

33
Markov models for page prediction
  • General approach is to use a finite-state Markov
    chain
  • Each state can be a specific Web page or a
    category of Web pages
  • If only interested in the order of visits (and
    not in time), each new request can be modeled as
    a transition of states
  • Issues
  • Self-transition
  • Time-independence

34
Markov models for page prediction
  • For simplicity, consider order-dependent,
    time-independent finite-state Markov chain with M
    states
  • Let s be a sequence of observed states of length
    L. e.g. s ABBCAABBCCBBAA with three states A, B
    and C. st is state at position t (1lttltL). In
    general,
  • first-order Markov assumption
  • This provides a simple generative model to
    produce sequential data

35
Markov models for page prediction
  • If we denote Tij P(st jst-1 i), we can
    define a M x M transition matrix
  • Properties
  • Strong first-order assumption
  • Simple way to capture sequential dependence
  • If each page is a state and if W pages, O(W2), W
    can be of the order 105 to 106 for a CS dept. of
    a university
  • To alleviate, we can cluster W pages into M
    clusters, each assigned a state in the Markov
    model
  • Clustering can be done manually, based on
    directory structure on the Web server, or
    automatic clustering using clustering techniques

36
Markov models for page prediction
  • Tij P(st jst-1 i) represents the
    probability that an individual users next
    request will be from category j, given they were
    in category i
  • We can add E, an end-state to the model
  • E.g. for three categories with end state -
  • E denotes the end of a sequence, and start of a
    new sequence

37
Markov models for page prediction
  • First-order Markov model assumes that the next
    state is based only on the current state
  • Limitations
  • Doesnt consider long-term memory
  • We can try to capture more memory with kth-order
    Markov chain
  • Limitations
  • Inordinate amount of training data O(Mk1)

38
Parameter estimation for Markov model transitions
  • Smoothed parameter estimates of transition
    probabilities are
  • If nij 0 for some transition (i, j) then
    instead of having a parameter estimate of 0 (ML),
    we will have allowing prior
    knowledge to be incorporated
  • If nij gt 0, we get a smooth combination of the
    data-driven information (nij) and the prior

39
Parameter estimation for Markov models
  • One simple way to set prior parameter is
  • Consider alpha as the effective sample size
  • Partition the states into two sets, set 1
    containing all states directly linked to state i
    and the remaining in set 2
  • Assign uniform probability r/K to all states in
    set 2 (all set 2 states are equally likely)
  • The remaining (1-r) can be either uniformly
    assigned among set 1 elements or weighted by some
    measure
  • Prior probabilities in and out of E can be set
    based on our prior knowledge of how likely we
    think a user is to exit the site from a
    particular state

40
Predicting page requests with Markov models
  • Deshpande and Karypis (2001) propose schemes to
    prune kth-order Markov state space
  • Provide systematic but modest improvements
  • Another way is to use empirical smoothing
    techniques that combine different models from
    order 1 to order k (Chen and Goodman 1996)

41
Mixtures of Markov Chains
  • Cadez et al. (2003) and Sen and Hansen (2003)
    replace the first-order Markov chain
  • with a mixture of first-order Markov chains
  • where c is a discrete-value hidden variable
    taking K values Sk P(c k) 1 and
  • P(st st-1, c k) is the transition matrix
    for the kth mixture component
  • One interpretation of this is user behavior
    consists of K different navigation behaviors
    described by the K Markov chains

42
Modeling Web Page Requests with Markov chain
mixtures
  • MSNBC Web logs
  • 2 million individuals per day
  • different session lengths per individual
  • difficult visualization and clustering problem
  • WebCanvas
  • uses mixtures of Markov chains to cluster
    individuals based on their observed sequences
  • software tool EM mixture modeling
    visualization

43
(No Transcript)
44
From Web logs to sequences
128.195.36.195, -, 3/22/00, 103511, W3SVC,
SRVR1, 128.200.39.181, 781, 363, 875, 200, 0,
GET, /top.html, -, 128.195.36.195, -, 3/22/00,
103516, W3SVC, SRVR1, 128.200.39.181, 5288,
524, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.195, -, 3/22/00, 103517, W3SVC,
SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.195.36.101, -,
3/22/00, 161850, W3SVC, SRVR1, 128.200.39.181,
60, 425, 72, 304, 0, GET, /top.html, -,
128.195.36.101, -, 3/22/00, 161858, W3SVC,
SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0,
POST, /spt/main.html, -, 128.195.36.101, -,
3/22/00, 161859, W3SVC, SRVR1, 128.200.39.181,
0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205437, W3SVC,
SRVR1, 128.200.39.181, 140, 199, 875, 200, 0,
GET, /top.html, -, 128.200.39.17, -, 3/22/00,
205455, W3SVC, SRVR1, 128.200.39.181, 17766,
365, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 205455, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205507, W3SVC, SRVR1, 128.200.39.181,
0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205536, W3SVC,
SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0,
POST, /spt/main.html, -, 128.200.39.17, -,
3/22/00, 205536, W3SVC, SRVR1, 128.200.39.181,
0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205539, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205603, W3SVC, SRVR1, 128.200.39.181,
1081, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 205604, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205633, W3SVC, SRVR1, 128.200.39.181,
0, 262, 72, 304, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 205652, W3SVC,
SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0,
POST, /spt/main.html, -,
3
3
3
3
1
3
1
1
1
3
3
3
2
2
3
2
User 1
1
1
1
3
3
3
User 2
User 3
7
7
7
7
7
7
7
7
1
1
1
1
1
1
5
1
5
1
1
1
5
1
User 4
5
1
1
5
User 5


45
Clusters of Finite State Machines
A
A
Cluster 1
Cluster 2
B
B
D
D
E
E
A
B
D
Cluster 3
E
46
Learning Problem
  • Assumptions
  • data is being generated by K different groups
  • Each group is described by a stochastic finite
    state machine (SFSM)
  • aka, a Markov model with an end-state
  • Given
  • A set of sequences from different users of
    different lengths
  • Learn
  • A mixture of K different stochastic finite
    state machines
  • Solution
  • EM is very easy fractional counts of transitions
  • efficient and accurate, scales as O(KN)

47
Experimental Methodology
  • Model Training
  • fit 2 types of models
  • mixtures of histograms
  • mixtures of finite state machines
  • Train on a full days worth of MSNBC Web data
  • Model Evaluation
  • one-step-ahead prediction on unseen test data
  • Test sequences from a different day of Web logs
  • negative log probability predictive entropy

48
(No Transcript)
49
(No Transcript)
50
Timing Results
51
WebCanvas
  • Software tool for Web log visualization
  • uses Markov mixtures to cluster data for display
  • in use by msnbc.com administrators at Microsoft
  • also being applied to non-Web data
  • Model-based visualization
  • random sample of actual sequences
  • interactive tiled windows displayed for
    visualization
  • more effective than
  • planar graphs
  • traffic-flow movie in Microsoft Site Server v3.0

52
WebCanvas Cadez, Heckerman, et al, 2003
53
Insights from WebCanvas
  • From msnbc.com site adminstrators.
  • significant heterogeneity of behavior
  • relatively focused activity of many users
  • typically only 1 or 2 categories of pages
  • many individuals not entering via main page
  • detected problems with the weather page
  • missing transitions (e.g., tech ltgt business)

54
Extensions
  • Adding time-dependence
  • adding time-between clicks, time of day effects
  • Uncategorized Web pages
  • coupling page content with sequence models
  • Modeling switching behaviors
  • allowing users to switch between models
  • Individualized weights (hierarchical Bayes)
  • Update WebCanvas tool will be part of 2004
    SQLServer release

55
Prediction with Markov mixtures
P(st1 s1,t )
56
Prediction with Markov mixtures
P(st1 s1,t ) S P(st1 , k s1,t )
S
P(st1 k , s1,t ) P(k s1,t )

57
Prediction with Markov mixtures
P(st1 s1,t ) S P(st1 , k s1,t )
S
P(st1 k , s1,t ) P(k s1,t )
S P(st1 k , st ) P(k
s1,t )
Prediction of kth component
Membership, based on sequence history
gt Predictions are a convex combination of K
different component transition matrices, with
weights based on sequence history
58
Related Work
  • Mixtures of Markov chains
  • special case Poulsen (1990)
  • general case Ridgeway (1997), Smyth (1997)
  • Clustering of Web page sequences
  • non-probabilistic approaches (Fu et al, 1999)
  • Markov models for prediction
  • Anderson et al (IJCAI, 2001)
  • mixtures of Markov outperform other sequential
    models for page-request prediction

59
Predicting page requests with Markov models
  • K can be chosen by evaluating the out-of-sample
    predictive performance based on
  • Accuracy of prediction
  • Log probability score
  • Entropy
  • Other variations
  • Sen and Hansen 2003
  • Position-dependent Markov models (Anderson et al.
    2001, 2002)
  • Zukerman et al. 1999

60
Modeling Clickrate Data
  • Data
  • 200k Alexa users, client-side, over 24 hours
  • ignore URLs requested
  • goal is to build a time-series model that
    characterizes user click rates

61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
(No Transcript)
65
Markov-Poisson Model
  • Doubly stochastic process
  • Locally constant Poisson rate
  • indexed by M Markov states
  • Fit a model with M 3 states
  • absence of a Web session
  • Web session with slow click rate 1 minute rate
  • Web session with rapid click rate 10 second rate
  • Used hierarchical Bayes on individuals

66
Hierarchical Bayes Model
Population Prior p(lq)
l1 Individual 1
li Individual i
lN Individual N
D1
D1
D2
D3
D1
D2
Individuals with little data get shrunk to the
prior Individuals with a lot of data are more
data-driven
67
(No Transcript)
68
Prediction with Hierarchical Bayes
Population Prior p(lq)
New Individual l ?
l1 Individual 1
lN Individual N
D1
D2
D3
D1
D2
First few clicks
Historical Training Data
69
Extensions
  • Couple click-rate with purchase behavior
  • Markov-type model through different states
  • product viewing
  • detailed product information
  • reviews
  • combine states with
  • click rate and page content
  • to predict p(purchase data up to time t)
  • Can these Bayesian techniques be scaled up??

70
Outline
  • Basic concepts in Web log data analysis
  • Predictive modeling of Web navigation behavior
  • Markov modeling methods
  • Analyzing search engine data
  • Ecommerce aspects of Web log mining

71
Analysis of Search Engine Query Logs
of Sample Query Source SE Time Period
Lau Horvitz 4690 of 1 Million Excite Sep 1997
Silverstein et al 1 Billion AltaVista 6 weeks in Aug Sep 1998
Spink et al (series of studies) 1Million for each time period Excite Sep 1997Dec 1999May 2001
Xie OHallaron 110,000 Vivisimo 35 days Jan Feb 2001
Xie OHallaron 1.9 Million Excite 8 hrs in a day, Dec 1999
72
Main Results
  • Average number of terms in a query is ranging
    from a low of 2.2 to a high of 2.6
  • The most common number of terms in a query is 2
  • The majority of users dont refine their query
  • The number of users who viewed only a single page
    increase 29 (1997) to 51 (2001) (Excite)
  • 85 of users viewed only first page of search
    results (AltaVista)
  • 45 (2001) of queries are about Commerce, Travel,
    Economy, People (was 20 in 1997)
  • The queries about adult or entertainment
    decreased from 20 (1997) to around 7 (2001)

73
Xie and O Halloran Study (2002)
- Query Length Distributions (bar) - Poisson
Model(dots lines)
  • All four studies produced a generally consistent
    set of findings about user behavior in a search
    engine context
  • most users view relatively few pages per query
  • most users dont use advanced search features

74
Power-law Characteristics of Common Queries
Power-Law in log-log space
  • Frequency f(r) of Queries with Rank r
  • 110000 queries from Vivisimo
  • 1.9 Million queries from Excite
  • There are strong regularities in terms of
    patterns of behavior in how we search the Web

75
Outline
  • Basic concepts in Web log data analysis
  • Predictive modeling of Web navigation behavior
  • Markov modeling methods
  • Analyzing search engine data
  • Ecommerce aspects of Web log mining

76
The next few slides are from Ronny Kohavi,
director of data mining and personalization at
Amazon.com. His full set of slides are available
online see the PPT slides and related papers on
ecommerce and data mining online at
http//robotics.stanford.edu/ronnyk/ronnyk-bib.ht
ml
77
ECommerce
  • Page request Web logs combined with
  • Purchase (market-basket) information
  • User address information (if they make a
    purchase)
  • Demographics information (can be purchased)
  • Emails to/from the customer
  • Main focus here is to increase revenue
  • Data mining widely used an online commerce
    companies like Amazon
  • This is a very rich source of problems for data
    mining
  • What products should we advertise to this person?
  • Can we do dynamic pricing?
  • If a person buys X should we also suggest Y?
  • Who are our best customers?
  • etc

78
Combining Data Sources
  • Comprehensive collection of US consumer and
    telephone data available via the internet
  • Multi-sourced database
  • Demographic, socioeconomic, and lifestyle
    information.
  • Information on most U.S. households
  • Contributors files refreshed a minimum of 3-12
    times per year.
  • Data sources include County Real Estate Property
    Records, U.S. Telephone Directories, Public
    Information, Motor Vehicle Registrations, Census
    Directories, Credit Grantors, Public Records and
    Consumer Data, Drivers Licenses, Voter
    Registrations, Product Registration
    Questionnaires, Catalogers, Magazines, Specialty
    Retailers, Packaged Goods Manufacturers, Accounts
    Receivable Files, Warranty Cards
  • Much of this data can be accessed in real-time
    once a customer self-identifies

79
Map of World Wide Revenue
Although Debenhams online site only ships in the
UK, we see some revenue from the rest of the
world.
UK 98.8
US 0.6
Australia 0.1
Low
Medium
High
NOTE About 50 of the non-UK orders are wedding
list purchases
80
Online Consumer Demographics
  • Results from Blue-Martini
  • People who have a Travel and Entertainment credit
    card are 48 more likely to be online shoppers
    (27 for people with premium credit card)
  • People whose home was built after 1990 are 45
    more likely to be online shoppers
  • Households with income over 100K are 31 more
    likely to be online shoppers
  • People under the age of 45 are 17 morelikely to
    be online shoppers

81
Demographics - Income
  • A higher household income means you are more
    likely to be an online shopper

82
Demographics Credit Cards
  • The more credit cards, the more likely you are to
    be an online shopper

83
Example Web Traffic
Sept-11 Note significant drop in human traffic,
not bot traffic
Weekends
Internal Perfor-mance bot
Registration at Search Engine sites
84
Product Affinities at MEC
Website Recommended Products
Product
Association
Lift Confidence
Orbit Sleeping Pad
Orbit Stuff Sack
222 37
Cygnet Sleeping Bag
Primus Stove
Aladdin 2 Backpack
Bambini Crewneck Sweater Childrens
Bambini Tights Childrens
195 52
Yeti Crew Neck Pullover Childrens
Beneficial Ts Organic Long Sleeve T-Shirt Kids
Silk Long Johns Womens
Silk Crew Womens
304 73
Micro Check Vee Sweater
Volant Pants
Composite Jacket
Cascade Entrant Overmitts
Polartec 300 Double Mitts
51 48
Windstopper Alpine Hat
Volant Pants
Tremblant 575 Vest Womens
  • Minimum support for the associations is 80
    customers
  • Confidence 37 of people who purchased Orbit
    Sleeping Pad also purchased Orbit Stuff Sack
  • Lift People who purchased Orbit Sleeping Pad
    were 222 times more likely to purchase the Orbit
    Stuff Sack compared to the general population

85
Customer Locations Relative to Retail Stores
Heavy purchasing areas away from retail stores
can suggest new retail store locations
No stores in several hot areas MEC is building
a store in Montreal right now.
Map of Canada with store locations.
86
Building The Customer Signature
  • Building a customer signature is a significant
    effort, but well worth the effort
  • A signature summarizes customer or visitor
    behavior across hundreds of attributes, many
    which are specific to the site
  • Once a signature is built, it can be used to
    answer many questions.
  • The mining algorithms will pick the most
    important attributes for each question
  • Example attributes computed
  • Total Visits and Sales
  • Revenue by Product Family
  • Revenue by Month
  • Customer State and Country
  • Recency, Frequency, Monetary
  • Latitude/Longitude from the Customers Postal Code
Write a Comment
User Comments (0)
About PowerShow.com