Mining ECommerce Data: The Good, the Bad - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Mining ECommerce Data: The Good, the Bad

Description:

Bots and crawlers skew statistics about the number of sessions, page hits, and exit pages ... Spend the time to identify crawlers and bots ... – PowerPoint PPT presentation

Number of Views:81

Avg rating:3.0/5.0

Slides: 21

Provided by: Gila4

Category:

more less

Transcript and Presenter's Notes

Title: Mining ECommerce Data: The Good, the Bad

1
Mining E-Commerce DataThe Good, the Bad the
Ugly

Paper written by Ron Kohavi

Presented by Arie E. Gözlüklü
2
Business Models
Brick and Click
Click and Click
Brick and Mortar
E-Commerce Models Business to business(B2B) Busine
ss to consumers(B2C) Consumer to consumer(C2C)
3
Contibution of DataMining to E-Commerce

The insight gained through data-mining
transactional and clickstream data can be used
for
Site design
Customer loyalty
Personalization Strategies
Profitability

4
Use of Web Sites

Supporting online transactions
Information about products and services
Early Alert System for Patterns
Warning about Sites offering
Viewing of buying patterns
Ads can be tested
Target markets can be identified

5
clickstream

A virtual trail that a user leaves behind while
surfing the Internet.
A clickstream is a record of a user's activity on
the Internet
Every Web site and every page of every Web site
that the user visits
how long the user was on a page or site
in what order the pages were visited
any newsgroups that the user participates in
The e-mail addresses of mail that the user sends
and receives.
Both ISPs and individual Web sites are capable of
tracking a user's clickstream.

6
cookie

A message given to a Web browser by a Web server.
The browser stores the message in a text file.
The message is then sent back to the server each
time the browser requests a page from the server.
The main purpose of cookies Identify users and
possibly prepare customized Web pages
The name cookie derives from UNIX objects called
magic cookies. These are tokens that are attached
to a user or program and change depending on the
areas entered by the user or program.

7
Organization of the Paper

E-commerce as Killer Domain(The Good)
Web Server Logs(The Bad)
Alternative to Web Server Logs
Challenging Open Problems (The Ugly)
Lessons from mining real e-commerce Data

8
The GoodE-commerce is a Great Domain for Data
Mining

For data mining to succeed
Large amount of Data
Yahoo serves 1 billion page views a day,so the
log files alone require around 10GB an hour
Wide Records with many attributes
With proper design,its possible
Clean data
Direct electronic collection provides superior
quality

9
The GoodE-commerce is a Great Domain for Data
Mining

Actionable Domain
Through web site change(layout,design,cross-sells,
up-sells, personalization)
Measurable Return on Investment(ROI)
Clickstreams and events translated to incremental
dollars
Controlled experiments
Immediate evaluation

10
Conversion Process

Assume there is an e-mail campaign
Opening e-mail
Clicking through
Browsing products
Adding products to shopping cart
Initiating check-out
Purchasing
Aim Increasing conversion rates

11
The Bad Web Server Logs
Definition Web servers can generate logs in
Common Log Format (CLF) or Extended Common Log
Format (ECLF), which detail the interactions with
clients, typically web browsers.
12
The Bad Web Server Logs

Logs contain following fields
Remote (client) host
Remote logname (client identity information)
Username for authentication
Date and time of request
Request
HTTP status code
Number of bytes transferred
The referring server URL
The user agent (name and version of client)
Most web servers support options to log
additional fields,such as
Cookie
performance-related data.

13
The Bad Web Server Logs

Web server logs are considered as the primary or
sole source of data,but they create major hurdles
when additional information needs to be
collected.
Because Web server logs were designed to debug
web servers, not for data mining!

14
Problems with Web Server Logs

They do not identify sessions or users.
Combining the order data and other transactional
data with web server logs is a complex
extract-transform-load (ETL) process.
Web server logs lack critical events such as add
to cart, delete item, or change
quantity.
Web server logs do not store web form
information.
Web server logs contain URLs, not the semantic
information of what the URLs contain.

15
Problems with Web Server Logs

Web server logs lack information for modern sites
that generate dynamic content.
Web server logs are flat files on multiple file
systems,
possibly in different time zones.
Web server logs contain redundant information(90
of the web serverlogs is commonly pruned)
Web server logs lack important information that
can be
collected using other means.(user local
time-screen size)

16
THE ALTERNATIVE TO WEB SERVERLOGS APPLICATION
SERVER LOGGING

Web server logs do not identify sessions or
users.
The application server controls sessions, user
registration,login, and logout. These can be
directly logged and no sessionization heuristics
are needed.
Web server logs lack critical events.
The application layer must know about events such
as add to cart and can log these. In addition,
it can log specific interesting events, such as a
browser reset
Web server logs do not store web form
information.
The application layer must know about events such
as add to cart and can log these. In addition,
it can log specific interesting events, such as a
browser reset.

17
THE ALTERNATIVE TO WEB SERVERLOGS APPLICATION
SERVER LOGGING

Web server logs contain URLs, not the semantic
information of what the URLs contain.
At the application layer of a dynamic site,
significant semantic information is available
about the content of the page being displayed.
Web server logs lack information for modern sites
that generate dynamic content.
Clearly the URL itself becomes less important
when logging information at the application
server layer.
Web server logs are flat files on multiple file
systems,possibly at different time zones.
The application server logs can be generated
directly into the database, so that transaction
level integrity holds.
Web server logs contain redundant information.
Redundancy is trivially eliminated when the
application server controls logging
Web server logs lack important information that
can becollected using other means.
Any information that can be collected can also be
logged into the same database with the
appropriate keys.

18
THE UGLY OPEN ISSUES

Crawler/bot/spider/robot identification.
Bots and crawlers skew statistics about the
number of sessions, page hits, and exit pages
Data transformations
about 80 of the time to complete an analysis is
spent in data transformations
Taking action and operationalizing the findings
not easy
Scalability of data mining algorithms

19
LESSONS AND STATISTICS

Spend the time to identify crawlers and bots
Buyers and Browsers have very different browsing
patterns
Half the sessions are shorter than a minute with
about a third of sessions never going past the
home page
Only a small percentage of visitors use search
(6), but those that search buy more
About a third (15-50) of the shopping carts are
abandoned
Never provide defaults in forms if you want
unbiased answers
Nobody reads the privacy policy

20
End of Presentation