Title: Mining ECommerce Data: The Good, the Bad
1Mining E-Commerce DataThe Good, the Bad the
Ugly
- Paper written by Ron Kohavi
Presented by Arie E. Gözlüklü
2Business Models
Brick and Click
Click and Click
Brick and Mortar
E-Commerce Models Business to business(B2B) Busine
ss to consumers(B2C) Consumer to consumer(C2C)
3Contibution of DataMining to E-Commerce
- The insight gained through data-mining
transactional and clickstream data can be used
for - Site design
- Customer loyalty
- Personalization Strategies
- Profitability
4Use of Web Sites
- Supporting online transactions
- Information about products and services
- Early Alert System for Patterns
- Warning about Sites offering
- Viewing of buying patterns
- Ads can be tested
- Target markets can be identified
5clickstream
- A virtual trail that a user leaves behind while
surfing the Internet. - A clickstream is a record of a user's activity on
the Internet - Every Web site and every page of every Web site
that the user visits - how long the user was on a page or site
- in what order the pages were visited
- any newsgroups that the user participates in
- The e-mail addresses of mail that the user sends
and receives. - Both ISPs and individual Web sites are capable of
tracking a user's clickstream.
6cookie
- A message given to a Web browser by a Web server.
- The browser stores the message in a text file.
- The message is then sent back to the server each
time the browser requests a page from the server.
- The main purpose of cookies Identify users and
possibly prepare customized Web pages -
- The name cookie derives from UNIX objects called
magic cookies. These are tokens that are attached
to a user or program and change depending on the
areas entered by the user or program.
7Organization of the Paper
- E-commerce as Killer Domain(The Good)
- Web Server Logs(The Bad)
- Alternative to Web Server Logs
- Challenging Open Problems (The Ugly)
- Lessons from mining real e-commerce Data
8The GoodE-commerce is a Great Domain for Data
Mining
- For data mining to succeed
- Large amount of Data
- Yahoo serves 1 billion page views a day,so the
log files alone require around 10GB an hour - Wide Records with many attributes
- With proper design,its possible
- Clean data
- Direct electronic collection provides superior
- quality
9The GoodE-commerce is a Great Domain for Data
Mining
- Actionable Domain
- Through web site change(layout,design,cross-sells,
up-sells, personalization) - Measurable Return on Investment(ROI)
- Clickstreams and events translated to incremental
dollars - Controlled experiments
- Immediate evaluation
10Conversion Process
- Assume there is an e-mail campaign
- Opening e-mail
- Clicking through
- Browsing products
- Adding products to shopping cart
- Initiating check-out
- Purchasing
- Aim Increasing conversion rates
11The Bad Web Server Logs
Definition Web servers can generate logs in
Common Log Format (CLF) or Extended Common Log
Format (ECLF), which detail the interactions with
clients, typically web browsers.
12The Bad Web Server Logs
- Logs contain following fields
- Remote (client) host
- Remote logname (client identity information)
- Username for authentication
- Date and time of request
- Request
- HTTP status code
- Number of bytes transferred
- The referring server URL
- The user agent (name and version of client)
- Most web servers support options to log
additional fields,such as - Cookie
- performance-related data.
13The Bad Web Server Logs
- Web server logs are considered as the primary or
sole source of data,but they create major hurdles
when additional information needs to be
collected. - Because Web server logs were designed to debug
web servers, not for data mining!
14Problems with Web Server Logs
- They do not identify sessions or users.
- Combining the order data and other transactional
data with web server logs is a complex
extract-transform-load (ETL) process. - Web server logs lack critical events such as add
to cart, delete item, or change - quantity.
- Web server logs do not store web form
information. - Web server logs contain URLs, not the semantic
information of what the URLs contain.
15Problems with Web Server Logs
- Web server logs lack information for modern sites
that generate dynamic content. - Web server logs are flat files on multiple file
systems, - possibly in different time zones.
- Web server logs contain redundant information(90
of the web serverlogs is commonly pruned) - Web server logs lack important information that
can be - collected using other means.(user local
time-screen size)
16THE ALTERNATIVE TO WEB SERVERLOGS APPLICATION
SERVER LOGGING
- Web server logs do not identify sessions or
users. - The application server controls sessions, user
registration,login, and logout. These can be
directly logged and no sessionization heuristics
are needed. - Web server logs lack critical events.
- The application layer must know about events such
as add to cart and can log these. In addition,
it can log specific interesting events, such as a
browser reset - Web server logs do not store web form
information. - The application layer must know about events such
as add to cart and can log these. In addition,
it can log specific interesting events, such as a
browser reset.
17THE ALTERNATIVE TO WEB SERVERLOGS APPLICATION
SERVER LOGGING
- Web server logs contain URLs, not the semantic
information of what the URLs contain. - At the application layer of a dynamic site,
significant semantic information is available
about the content of the page being displayed. - Web server logs lack information for modern sites
that generate dynamic content. - Clearly the URL itself becomes less important
when logging information at the application
server layer. - Web server logs are flat files on multiple file
systems,possibly at different time zones. - The application server logs can be generated
directly into the database, so that transaction
level integrity holds. - Web server logs contain redundant information.
- Redundancy is trivially eliminated when the
application server controls logging - Web server logs lack important information that
can becollected using other means. - Any information that can be collected can also be
logged into the same database with the
appropriate keys. -
-
18THE UGLY OPEN ISSUES
- Crawler/bot/spider/robot identification.
- Bots and crawlers skew statistics about the
number of sessions, page hits, and exit pages - Data transformations
- about 80 of the time to complete an analysis is
spent in data transformations - Taking action and operationalizing the findings
not easy - Scalability of data mining algorithms
19LESSONS AND STATISTICS
- Spend the time to identify crawlers and bots
- Buyers and Browsers have very different browsing
patterns - Half the sessions are shorter than a minute with
about a third of sessions never going past the
home page - Only a small percentage of visitors use search
(6), but those that search buy more - About a third (15-50) of the shopping carts are
abandoned - Never provide defaults in forms if you want
unbiased answers - Nobody reads the privacy policy
20End of Presentation
- Thank you for your attention