Overview of Web Mining and E-Commerce Data Analytics - PowerPoint PPT Presentation

About This Presentation
Title:

Overview of Web Mining and E-Commerce Data Analytics

Description:

Data Miing and Knowledge Discvoery - Web Data Mining – PowerPoint PPT presentation

Number of Views:362
Avg rating:3.0/5.0
Slides: 59
Provided by: Bamsh
Category:

less

Transcript and Presenter's Notes

Title: Overview of Web Mining and E-Commerce Data Analytics


1
Overview of Web Mining and E-Commerce Data
Analytics
Bamshad Mobasher DePaul University
2
Why Data Mining
  • Increased Availability of Huge Amounts of Data
  • point-of-sale customer data (Walmart 60M
    transactions per day)
  • E-commerce transaction data
  • digitization of text, images, video, voice, etc.
  • World Wide Web and Online collections
  • usage/navigation data (Yahoo 20 terabytes of
    clickstream data per day)
  • Data Too Large or Complex for Classical or Manual
    Analysis
  • number of records in millions or billions
  • high dimensional data (too many
    fields/features/attributes)
  • often too sparse for rudimentary observations
  • high rate of growth (e.g., through logging or
    automatic data collection)
  • heterogeneous data sources
  • Business Necessity
  • e-commerce
  • high degree of competition
  • personalization, customer loyalty, market
    segmentation

3
From Data to Wisdom
  • Data
  • The raw material of information
  • Information
  • Data organized and presented by someone
  • Knowledge
  • Information read, heard or seen and understood
    and integrated
  • Wisdom
  • Distilled knowledge and understanding which can
    lead to decisions

Wisdom
Knowledge
Information
Data
The Information Hierarchy
4
What is Data Mining
  • What do we need?
  • Extract interesting and useful knowledge from the
    data
  • Find rules, regularities, irregularities,
    patterns, constraints
  • hopefully, this will help us better compete in
    business, do research, learn concepts, make
    money, etc.
  • Data Mining A Definition

The non-trivial extraction of implicit,
previously unknown and potentially useful
knowledge from data in large data repositories
  • Non-trivial obvious knowledge is not useful
  • implicit hidden difficult to observe knowledge
  • previously unknown
  • potentially useful actionable easy to understand

5
Data Minings Virtuous Cycle
  1. Identifying the business problem
  2. Mining data to transform it into actionable
    information
  3. Acting on the information
  4. Measuring the results

Textbook interchanges problem with
opportunity
5
6
1. Identify the Business Opportunity
  • First Step clearly identify the business problem
    that requires a solution
  • Then translate this problem into a data mining
    problem
  • Many business processes are good candidates
  • New product introduction / eliminating a product
    line
  • Direct marketing campaign
  • Understanding customer attrition/churn
  • Evaluating the results of a test market
  • Measurements from past DM efforts
  • What types of customers responded to our last
    campaign?
  • Where do the best customers live?
  • Are long waits in check-out lines a cause of
    customer attrition?
  • What products should be promoted with our XYZ
    product?

6
7
2. Mining data to transform it into actionable
information
  • Success is making business sense of the data
  • Need to identify the right data mining tasks that
    can address the specified problem
  • Numerous data issues
  • Bad data formats (alpha vs numeric, missing,
    null, bogus data)
  • Confusing data fields (synonyms and differences)
  • Lack of functionality (I wish I could)
  • Legal ramifications (privacy, etc.)
  • Organizational factors (unwilling to change our
    ways)
  • Lack of timeliness

7
8
3. Acting on the Information
  • This is the purpose of Data Mining with the
    hope of adding value
  • What type of action?
  • Interactions with customers, prospects, suppliers
  • Modifying service procedures
  • Adjusting inventory levels
  • Consolidating
  • Expanding
  • Etc

8
9
4. Measuring the Results
  • Assesses the impact of the action taken
  • Often overlooked, ignored, skipped
  • Planning for the measurement should begin when
    analyzing the business opportunity, not after it
    is all over
  • Assessment questions (examples)
  • Did this ____ campaign do what we hoped?
  • Did some offers work better than others?
  • Did these customers purchase additional products?
  • Tons of others

9
10
The Knowledge Discovery Process
  • Data Mining v. Knowledge Discovery in Databases
    (KDD)
  • DM and KDD are often used interchangeably
  • actually, DM is only part of the KDD process

- The KDD Process
11
What Can Data Mining Do
  • Two kinds of knowledge discovery directed and
    undirected
  • Directed Knowledge Discovery
  • Purpose Explain value of some field in terms of
    all the others (goal-oriented)
  • Method select the target field based on some
    hypothesis about the data ask the algorithm to
    tell us how to predict or classify new instances
  • Examples
  • what products show increased sale when cream
    cheese is discounted
  • which banner ad to use on a web page for a given
    user coming to the site
  • Undirected Knowledge Discovery
  • Purpose Find patterns in the data that may be
    interesting (no target field)
  • Method clustering, affinity grouping
  • Examples
  • which products in the catalog often sell together
  • market segmentation (groups of customers/users
    with similar characteristics)

12
What Can Data Mining Do
  • Many Data Mining Tasks
  • often inter-related
  • often need to try different techniques for each
    task
  • each tasks may require different types of
    knowledge discovery
  • What are some of data mining tasks
  • Classification
  • Prediction
  • Characterization
  • Discrimination
  • Affinity Grouping
  • Clustering
  • Sequence Analysis
  • Description

13
Some Applications of Data mining
  • Business data analysis and decision support
  • Marketing focalization
  • Recognizing specific market segments that respond
    to particular characteristics
  • Return on mailing campaign (target marketing)
  • Customer Profiling
  • Segmentation of customer for marketing strategies
    and/or product offerings
  • Customer behavior understanding
  • Customer retention and loyalty
  • Mass customization / personalization

14
Some Applications of Data mining
  • Business data analysis and decision support
    (cont.)
  • Market analysis and management
  • Provide summary information for decision-making
  • Market basket analysis, cross selling, market
    segmentation.
  • Resource planning
  • Risk analysis and management
  • "What if" analysis
  • Forecasting
  • Pricing analysis, competitive analysis
  • Time-series analysis (Ex. stock market)

15
Some Applications of Data mining
  • Fraud detection
  • Detecting telephone fraud
  • Telephone call model destination of the call,
    duration, time of day or week
  • Analyze patterns that deviate from an expected
    norm
  • British Telecom identified discrete groups of
    callers with frequent intra-group calls,
    especially mobile phones, and broke a
    multimillion dollar fraud scheme
  • Detection of credit-card fraud
  • Detecting suspicious money transactions (money
    laundering)
  • Text mining
  • Message filtering (e-mail, newsgroups, etc.)
  • Newspaper articles analysis
  • Text and document categorization
  • Web Mining . . .

16
What is Web Mining
  • From its very beginning, the potential of
    extracting valuable knowledge from the Web has
    been quite evident
  • Web mining is the collection of technologies to
    fulfill this potential.

Web Mining Definition
application of data mining and machine learning
techniques to extract useful knowledge from the
content, structure, and usage of Web resources.
17
Types of Web Mining
Web Mining
Web Usage Mining
Web Structure Mining
Web Content Mining
18
Types of Web Mining
Web Mining
Web Usage Mining
Web Structure Mining
Web Content Mining
Extracting useful knowledge from the contents of
Web documents or other semantic information about
Web resources
19
Types of Web Mining
Web Mining
Web Usage Mining
Web Structure Mining
Web Content Mining
Content data may consist of text, images, audio,
video, structured records from lists and tables,
or item attributes from backend databases.
20
Types of Web Mining
Web Mining
Web Usage Mining
Web Structure Mining
Web Content Mining
  • Applications
  • document clustering or categorization
  • topic identification / tracking
  • concept discovery
  • focused crawling
  • content-based personalization
  • intelligent search tools

21
Types of Web Mining
Web Mining
Web Usage Mining
Web Structure Mining
Web Content Mining
Extracting interesting patterns from user
interactions with resources on one or more Web
sites
22
Types of Web Mining
Web Mining
Web Usage Mining
Web Structure Mining
Web Content Mining
  • Applications
  • user and customer behavior modeling
  • Web site optimization
  • e-customer relationship management
  • Web marketing
  • targeted advertising
  • recommender systems

23
Types of Web Mining
Web Mining
Web Usage Mining
Web Structure Mining
Web Content Mining
Discovering useful patterns from the hyperlink
structure connecting Web sites or Web resources
24
Types of Web Mining
Web Mining
Web Usage Mining
Web Structure Mining
Web Content Mining
Data sources include the explicit hyperlink
between documents, or implicit links among
objects (e.g., two objects being tagged using
the same keyword).
25
Types of Web Mining
Web Mining
Web Usage Mining
Web Structure Mining
Web Content Mining
  • Applications
  • document retrieval and ranking (e.g.,
    Google)
  • discovery of hubs and authorities
  • discovery of Web communities
  • social network analysis

26
Web Content Mining common approaches and
applications
  • Basic notion document similarity
  • Most Web content mining and information retrieval
    applications involve measuring similarity among
    two or more documents
  • Vector representation facilitates similarity
    computations using vector-space operations (such
    as Cosine of the angle between two vectors)
  • Examples
  • Search engines measure the similarity between a
    query (represented as a vector) and the indexed
    document vectors to return a ranked list of
    relevant documents
  • Document clustering group documents based on
    similarity or dissimilarity (distance) among them
  • Document categorization measure the similarity
    of a new document to be classified with
    representations of existing categories (such as
    the mean vector representing a group of document
    vectors)
  • Personalization recommend documents or items
    based their similarity to a representation of the
    users profile (may be a term vector representing
    concepts or terms of interest to the user)

27
Web Content Mining example clustered search
results
Can drill down within clusters to view sub-topics
or to view the relevant subset of results
28
Web Content Mining example personalized
content delivery
Google's personalized news is an example of a
content-based recommender system which recommends
items (in part) based on the similarity of their
content to a users profile (gathered from search
and click history)
29
Web Structure Mining graph structures on the
Web
  • The structure of a typical Web graph
  • Web pages as nodes
  • hyperlinks as edges connecting two related pages
  • Hyperlink Analysis
  • Hyperlinks can serve as a tool for pure
    navigation
  • But, often they are used to point to pages with
    authority on the same topic as the source page
    (similar to a citation in a publication)
  • Some interesting Web structures

30
Web Structure Mining example Googles
PageRank algorithm
  • Basic idea
  • Rank of a page depends on the ranks of pages
    pointing to it
  • Out Degree of page is the number of edges
    pointing away from it used to compute the
    contribution of the page to those to which it
    points
  • The final PageRank value represents the
    probability that a random surfer will reach the
    page
  • d is the prob. that a random surfer chooses the
    page directly rather than getting there via
    navigation

31
Web Structure Mining example Hubs and
Authorities
  • Basic idea
  • Authority comes from in-edges
  • Being a hub comes from out-edges
  • Mutually re-enforcing relationship
  • A good authority is a page that is pointed to by
    many good hubs.
  • A good hub is a page that points to many good
    authorities.
  • Together they tend to form a bipartite graph
  • This idea can be used to discover authoritative
    pages related to a topic
  • HITS algorithm Hypertext Induced Topic Search

32
Web Structure Mining example online
communities
  • Basic idea
  • Web communities are collections of Web pages such
    that each member node has more hyperlinks (in
    either direction) within the community than
    outside the community.
  • Typical approach Maximal-flow model
  • Ex separate the two subgraphs with any choice of
    source node (left subgraph) and sink node (right
    subgraph), removing the three dashed links

Source G. Flake, et al. Self-Organization and
Identification of Web Communities, IEEE
Computer, Vol. 35, No. 3, pp.
66-71, March 2002 .
33
Web Usage Mining
  • The Problem analyze Web navigational data to
  • Find how the Web site is used by Web users
  • Understand the behavior of different user
    segments
  • Predict how users will behave in the future
  • Target relevant or interesting information to
    individual or groups of users
  • Increase sales, profit, loyalty, etc.
  • Challenge
  • Quantitatively capture Web users common
    interests and characterize their underlying tasks

34
Applications of Web Usage Mining
  • Electronic Commerce
  • design cross marketing strategies across products
  • evaluate promotional campaigns
  • target electronic ads and coupons at user groups
    based on their access patterns
  • predict user behavior based on previously learned
    rules and users profiles
  • present dynamic information to users based on
    their interests and profiles Web
    personalization
  • Effective and Efficient Web Presence
  • determine the best way to structure the Web site
  • identify weak links for elimination or
    enhancement
  • prefetch files that are most likely to be
    accessed
  • enhance workgroup management communication
  • Search Engines
  • Behavior-based ranking

35
Web Usage Mining data sources
  • Typical Sources of Data
  • automatically generated Web/application server
    access logs
  • e-commerce and product-oriented user events
    (e.g., shopping cart changes, product
    clickthroughs, etc.)
  • user profiles and/or user ratings
  • meta-data, page content, site structure
  • User Transactions
  • sets or sequences of pageviews possibly with
    associated weights
  • a pageview is a set of page files and associated
    objects that contribute to a single display in a
    Web Browser

36
Whats in a Typical Server Log?
37
Typical Fields in a Log File Entry
client IP address 1.2.3.4 base url
maya.cs.depaul.edu date/time 2006-02-01
000843 http method GET file accessed
/classes/cs589/papers.html protocol
version HTTP/1.1 status code 200 (successful
access) bytes transferred 9221 referrer
page http//dataminingresources.blogspot.com/ user
agent Mozilla/4.0(compatibleMSIE6.0Windows
NT5.1 SV1.NETCLR2.0.50727)
  • In addition, there may be fields corresponding to
  • login information
  • client-side cookies (unique keys, issued to
    clients in order to identify a repeat
    visitor)
  • session ids issued by the Web or application
    servers

38
Basic Entities in Web Usage Mining
  • User (Visitor) - Single individual that is
    accessing files from one or more Web servers
    through a Browser
  • Page File - File that is served through HTTP
    protocol
  • Pageview - Set of Page Files that contribute to a
    single display in a Web Browser
  • User Session - Set of Pageviews served due to a
    series of HTTP requests from a single User across
    the entire Web.
  • Server Session - Set of Pageviews served due to a
    series of HTTP requests from a single User to a
    single site
  • Transaction (Episode) - Subset of Pageviews from
    a single User or Server Session

39
Main Challenges in Data Collection and
Preprocessing
  • Main Questions
  • what data to collect and how to collect it what
    to exclude
  • how to identify requests associated with a unique
    user sessions (HTTP is stateless)
  • how to identify/define user transactions (within
    each session)
  • how to identify what is the basic unit of
    analysis (e.g., pageviews, items purchased)
  • how to integrate e-commerce data with usage data
  • Problems
  • user ids are usually suppressed due to security
    concerns
  • individual IP addresses are sometimes hidden
    behind proxy servers may not be unique
  • client-side proxy caching makes server log data
    less reliable
  • data must be integrated from multiple sources
    (e.g., server logs, content data, e-commerce
    applications servers, customer demographic data,
    etc.)
  • Standard Solutions/Practices
  • user registration, cookies, server extensions and
    URL re-writing, cache busting
  • heuristic approaches to session/user
    identification and path completion

40
Usage Data Preparation Tasks
  • Data cleaning
  • remove irrelevant references and fields in server
    logs
  • remove references due to spider navigation
  • add missing references due to client-side caching
  • Data integration
  • synchronize data from multiple server logs
  • integrate e-commerce and application server data
  • integrate meta-data
  • Data Transformation
  • pageview identification
  • identification of unique users
  • sessionization partitioning each users record
    into multiple sessions or transactions (usually
    representing different visits)
  • mapping between user sessions and topics or
    categories
  • Associating weights with object/pageviews in one
    session or transaction

41
Conceptual Representation of User Transactions or
Sessions
Pageview/objects
Sessions/user transactions
This is the typical representation of the data,
after preprocessing, that is used for input into
data mining algorithms. Raw weights may be
binary, based on time spent on a page, or other
measures of user interest in an item. In
practice, need to normalize or standardize this
data.
42
Web Usage Mining as a Process
43
E-Commerce Data
  • Integrating E-Commerce and Usage Data
  • Needed for analyzing relationships between
    navigational patterns of visitors and business
    questions such as profitability, customer value,
    product placement, etc.
  • E-business / Web Analytics
  • E.g., tracking and analyzing conversion of
    browsers to buyers
  • E-Commerce v. Simple Usage Data
  • E-commerce data is product oriented while usage
    data is pageview oriented
  • Usage events (pageviews) are well defined and
    have consistent meaning across all Web sites
  • E-commerce events are often only applicable to
    specific domains, and the definition of certain
    events can vary from site to site
  • Major difficulty for Usage events is getting
    accurate preprocessed data
  • Major difficulty for E-commerce events is
    defining and implementing the events for a
    particular site

44
Why We Need Web Analytics
  • Are we attracting new people to our site?
  • Is our site sticky? Which regions in it are
    not?
  • What is the health of our lead qualification
    process?
  • How adept is our conversion of browsers to
    buyers?
  • What behavior indicates purchase propensity?
  • What site navigation do we wish to encourage?
  • How can profiling help use cross-sell and
    up-sell?
  • How do customer segments differ?
  • What attributes describe our best customers?
  • Can we target other prospects like them?
  • What makes customers loyal?
  • How do we measure loyalty?

45
Three Skill Sets Required
  • Technology
  • How do we get the data? Are we collecting the
    right data?
  • Analytics
  • How do we turn the data into insightful
    information?
  • Business Management
  • What action do we take? How do we measure the
    impact of that action?

Data Collection / Preprocessing / Integration
Analysis Tools, OLAP, Data Mining
E-Metrics
46
Using Analytics for E-Business Management
  • Navigation Calibration
  • Calculating Content
  • Popularity
  • Freshness
  • Stickiness / Slipperiness / Leakage
  • Stimulus - Inducement
  • Conversion Quotient
  • Interaction Computation
  • Customer Service Assessment
  • Customer Experience Evaluation
  • Branding

47
Web Usage and E-Business Analytics
Different Levels of Analysis
  • Session Analysis
  • Static Aggregation and Statistics
  • OLAP
  • Data Mining

48
Session Analysis
  • Simplest form of analysis examine individual or
    groups of server sessions and e-commerce data.
  • Advantages
  • Gain insight into typical customer behaviors.
  • Trace specific problems with the site.
  • Drawbacks
  • LOTS of data.
  • Difficult to generalize.

49
Static Aggregation (Reports)
  • Most common form of analysis.
  • Data is aggregated by predetermined units such as
    days or sessions.
  • Generally gives most bang for the buck.
  • Advantages
  • Gives quick overview of how a site is being used.
  • Minimal disk space or processing power required.
  • Drawbacks
  • No ability to dig deeper into the data.

50
Online Analytical Processing (OLAP)
  • Allows changes to aggregation level for multiple
    dimensions.
  • Generally associated with a Data Warehouse.
  • Advantages Drawbacks
  • Very flexible
  • Requires significantly more resources than static
    reporting.

51
Data Mining Going Deeper
  • Frequent Itemsets and Association Rules
  • The Donkey Kong Video Game and Stainless Steel
    Flatware Set product pages are accessed together
    in 1.2 of the sessions.
  • When the Shopping Cart Page is accessed in a
    session, Home Page is also accessed 90 of the
    time.
  • When the Stainless Steel Flatware Set product
    page is accessed in a session, the Donkey Kong
    Video page is also accessed 5 of the time.
  • 30 of clients who accessed /special-offer.html,
    placed an online order in /products/software/
  • Sequential Patterns
  • Add an extra dimension to frequent itemsets and
    association rules - time
  • x of the time, when AB appears in a
    transaction, C appears within z transactions)
  • 40 of people who bought the book How to cheat
    IRS booked a flight to South America 6 months
    later
  • The Video Game Caddy page view is accessed
    after the Donkey Kong Video Game page view 50
    of the time. This occurs in 1 of the sessions.
  • 15 of visitors followed the path home gt gt
    software gt gt shopping cart gt checkout

52
Data Mining Going Deeper
  • Clustering Content-Based or Usage-Based
  • Customer/visitor segmentation
  • Categorization of pages and products
  • Classification
  • Classifying users into behavioral groups
    (browser, likely to purchase, loyal customer,
    etc.)
  • Examples
  • Cusotmers who access Video Game Product pages,
    have income of 50K, and have 1 or more children,
    should get a banner ad for Xbox in their next
    visit.
  • Customers who make at least 4 purchases in one
    year should be categorized as loyal
  • Load applicants in 45K-60K income range, low
    debt, and good-excellent credit should be
    approved for a new mortgage.

53
Example Path Analysis for Ecommerce
Visit
90
10
No Search
Search(64 successful)
30
70
Last Search Failed
Last Search Succeeded
54
Example Association Analysis for Ecommerce
  • Confidence 41 who purchased Fully Reversible
    Mats also purchased Egyptian Cotton Towels
  • Lift People who purchased Fully Reversible Mats
    were 456 times more likely to purchase the
    Egyptian Cotton Towels compared to the general
    population

55
Web Usage Mining clustering example
  • Transaction Clusters
  • Clustering similar user transactions and using
    centroid of each cluster as a usage profile
    (representative for a user segment)

Sample cluster centroid from dept. Web site
(cluster size 330)
Support URL Pageview Description
1.00 /courses/syllabus.asp?course450-96-303q3y2002id290 SE 450 Object-Oriented Development class syllabus
0.97 /people/facultyinfo.asp?id290 Web page of a lecturer who thought the above course
0.88 /programs/ Current Degree Descriptions 2002
0.85 /programs/courses.asp?depcode96deptmnesecourseid450 SE 450 course description in SE program
0.82 /programs/2002/gradds2002.asp M.S. in Distributed Systems program description
56
Basic Framework for E-Commerce Data Analysis
57
Components of E-Commerce Data Analysis Framework
  • Content Analysis Module
  • extract linkage and semantic information from
    pages
  • potentially used to construct the site map and
    site dictionary
  • analysis of dynamic pages includes (partial)
    generation of pages based on templates, specified
    parameters, and/or databases (may be done in real
    time, if available as an extension of
    Web/Application servers)
  • Site Map / Site Dictionary
  • site map is used primarily in data preparation
    (e.g., required for pageview identification and
    path completion) it may be constructed through
    content analysis and/or analysis of usage data
    (e.g., from referrer information)
  • site dictionary provides a mapping between
    pageview identifiers / URLs and
    content/structural information on pages it is
    used primarily for content labeling both in
    sessionized usage data as well as integrated
    e-commerce data

58
Components of E-Commerce Data Analysis Framework
  • Data Integration Module
  • used to integrate sessionized usage data,
    e-commerce data (from application servers), and
    product/user data from databases
  • user data may include user profiles, demographic
    information, and individual purchase activity
  • e-commerce data includes various product-oriented
    events, including shopping cart changes, purchase
    information, impressions, click-throughs, and
    other basic metrics
  • primarily used for data transformation and
    loading mechanism for the Data Mart
  • E-Commerce Data mart
  • this is a multi-dimensional database integrating
    data from a variety of sources, and at different
    levels of aggregation
  • can provide pre-computed e-metrics along multiple
    dimensions
  • is used as the primary data source in OLAP
    analysis, as well as in data selection for a
    variety of data mining tasks (performed by the
    data mining engine
Write a Comment
User Comments (0)
About PowerShow.com