Mining ECommerce Data: The Good, the Bad - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Mining ECommerce Data: The Good, the Bad

Description:

Bots and crawlers skew statistics about the number of sessions, page hits, and exit pages ... Spend the time to identify crawlers and bots ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 21
Provided by: Gila4
Category:

less

Transcript and Presenter's Notes

Title: Mining ECommerce Data: The Good, the Bad


1
Mining E-Commerce DataThe Good, the Bad the
Ugly
  • Paper written by Ron Kohavi

Presented by Arie E. Gözlüklü
2
Business Models
Brick and Click
Click and Click
Brick and Mortar
E-Commerce Models Business to business(B2B) Busine
ss to consumers(B2C) Consumer to consumer(C2C)
3
Contibution of DataMining to E-Commerce
  • The insight gained through data-mining
    transactional and clickstream data can be used
    for
  • Site design
  • Customer loyalty
  • Personalization Strategies
  • Profitability

4
Use of Web Sites
  • Supporting online transactions
  • Information about products and services
  • Early Alert System for Patterns
  • Warning about Sites offering
  • Viewing of buying patterns
  • Ads can be tested
  • Target markets can be identified

5
clickstream
  • A virtual trail that a user leaves behind while
    surfing the Internet.
  • A clickstream is a record of a user's activity on
    the Internet
  • Every Web site and every page of every Web site
    that the user visits
  • how long the user was on a page or site
  • in what order the pages were visited
  • any newsgroups that the user participates in
  • The e-mail addresses of mail that the user sends
    and receives.
  • Both ISPs and individual Web sites are capable of
    tracking a user's clickstream.

6
cookie
  • A message given to a Web browser by a Web server.
  • The browser stores the message in a text file.
  • The message is then sent back to the server each
    time the browser requests a page from the server.
  • The main purpose of cookies Identify users and
    possibly prepare customized Web pages
  • The name cookie derives from UNIX objects called
    magic cookies. These are tokens that are attached
    to a user or program and change depending on the
    areas entered by the user or program.

7
Organization of the Paper
  • E-commerce as Killer Domain(The Good)
  • Web Server Logs(The Bad)
  • Alternative to Web Server Logs
  • Challenging Open Problems (The Ugly)
  • Lessons from mining real e-commerce Data

8
The GoodE-commerce is a Great Domain for Data
Mining
  • For data mining to succeed
  • Large amount of Data
  • Yahoo serves 1 billion page views a day,so the
    log files alone require around 10GB an hour
  • Wide Records with many attributes
  • With proper design,its possible
  • Clean data
  • Direct electronic collection provides superior
  • quality

9
The GoodE-commerce is a Great Domain for Data
Mining
  • Actionable Domain
  • Through web site change(layout,design,cross-sells,
    up-sells, personalization)
  • Measurable Return on Investment(ROI)
  • Clickstreams and events translated to incremental
    dollars
  • Controlled experiments
  • Immediate evaluation

10
Conversion Process
  • Assume there is an e-mail campaign
  • Opening e-mail
  • Clicking through
  • Browsing products
  • Adding products to shopping cart
  • Initiating check-out
  • Purchasing
  • Aim Increasing conversion rates

11
The Bad Web Server Logs
Definition Web servers can generate logs in
Common Log Format (CLF) or Extended Common Log
Format (ECLF), which detail the interactions with
clients, typically web browsers.
12
The Bad Web Server Logs
  • Logs contain following fields
  • Remote (client) host
  • Remote logname (client identity information)
  • Username for authentication
  • Date and time of request
  • Request
  • HTTP status code
  • Number of bytes transferred
  • The referring server URL
  • The user agent (name and version of client)
  • Most web servers support options to log
    additional fields,such as
  • Cookie
  • performance-related data.

13
The Bad Web Server Logs
  • Web server logs are considered as the primary or
    sole source of data,but they create major hurdles
    when additional information needs to be
    collected.
  • Because Web server logs were designed to debug
    web servers, not for data mining!

14
Problems with Web Server Logs
  • They do not identify sessions or users.
  • Combining the order data and other transactional
    data with web server logs is a complex
    extract-transform-load (ETL) process.
  • Web server logs lack critical events such as add
    to cart, delete item, or change
  • quantity.
  • Web server logs do not store web form
    information.
  • Web server logs contain URLs, not the semantic
    information of what the URLs contain.

15
Problems with Web Server Logs
  • Web server logs lack information for modern sites
    that generate dynamic content.
  • Web server logs are flat files on multiple file
    systems,
  • possibly in different time zones.
  • Web server logs contain redundant information(90
    of the web serverlogs is commonly pruned)
  • Web server logs lack important information that
    can be
  • collected using other means.(user local
    time-screen size)

16
THE ALTERNATIVE TO WEB SERVERLOGS APPLICATION
SERVER LOGGING
  • Web server logs do not identify sessions or
    users.
  • The application server controls sessions, user
    registration,login, and logout. These can be
    directly logged and no sessionization heuristics
    are needed.
  • Web server logs lack critical events.
  • The application layer must know about events such
    as add to cart and can log these. In addition,
    it can log specific interesting events, such as a
    browser reset
  • Web server logs do not store web form
    information.
  • The application layer must know about events such
    as add to cart and can log these. In addition,
    it can log specific interesting events, such as a
    browser reset.

17
THE ALTERNATIVE TO WEB SERVERLOGS APPLICATION
SERVER LOGGING
  • Web server logs contain URLs, not the semantic
    information of what the URLs contain.
  • At the application layer of a dynamic site,
    significant semantic information is available
    about the content of the page being displayed.
  • Web server logs lack information for modern sites
    that generate dynamic content.
  • Clearly the URL itself becomes less important
    when logging information at the application
    server layer.
  • Web server logs are flat files on multiple file
    systems,possibly at different time zones.
  • The application server logs can be generated
    directly into the database, so that transaction
    level integrity holds.
  • Web server logs contain redundant information.
  • Redundancy is trivially eliminated when the
    application server controls logging
  • Web server logs lack important information that
    can becollected using other means.
  • Any information that can be collected can also be
    logged into the same database with the
    appropriate keys.

18
THE UGLY OPEN ISSUES
  • Crawler/bot/spider/robot identification.
  • Bots and crawlers skew statistics about the
    number of sessions, page hits, and exit pages
  • Data transformations
  • about 80 of the time to complete an analysis is
    spent in data transformations
  • Taking action and operationalizing the findings
    not easy
  • Scalability of data mining algorithms

19
LESSONS AND STATISTICS
  • Spend the time to identify crawlers and bots
  • Buyers and Browsers have very different browsing
    patterns
  • Half the sessions are shorter than a minute with
    about a third of sessions never going past the
    home page
  • Only a small percentage of visitors use search
    (6), but those that search buy more
  • About a third (15-50) of the shopping carts are
    abandoned
  • Never provide defaults in forms if you want
    unbiased answers
  • Nobody reads the privacy policy

20
End of Presentation
  • Thank you for your attention
Write a Comment
User Comments (0)
About PowerShow.com