Data Webhousing - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Data Webhousing

Description:

Part of MP3 file. Demo versions of games. Downloads. Delay until the end of a session ... Are customers from this referrer likely to buy ? Customer about to leave ? ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 38
Provided by: csA2
Category:
Tags: data | webhousing

less

Transcript and Presenter's Notes

Title: Data Webhousing


1
Data Webhousing

2
Talk Overview
  • Data Warehousing and Data Webhousing
  • User actions on a website
  • Making decision based on click-streams
  • What data is needed ?
  • The click-stream as a data source
  • Proxies, cookies,..

3
Data Analysis Problems
  • The same data found in many different systems
  • Example customer data in 14 systems!
  • The same concept is defined differently
  • Data is suited for operational systems (OLTP)
  • Accounting, billing, etc.
  • Do not support analysis across business functions
  • Data quality is bad
  • Missing data, imprecise data, different use of
    systems
  • Data are volatile
  • Data deleted in operational systems (6 months)
  • Data change over time no historical information

4
Data Warehousing
  • Solution new analysis environment (DW) where
    data are
  • Subject oriented (versus function oriented)
  • Integrated (logically and physically)
  • Stable (data not deleted, several versions )
  • Time variant (data can always be related to time)
  • Supporting management decisions (different
    organization)
  • Data from the operational systems are
  • Extracted
  • Cleansed
  • Transformed
  • Aggregated

5
DW Architecture
DB
6
The Web in the Warehouse
  • Bring behavior to the warehouse
  • Capture, analyze, and understand user website
    behavior
  • The click-stream brings new possibilities
  • Call-data-records does not contain as much
    information
  • XML will bring even more content

7
Analyzing Behavior
  • Basic behavior types
  • Information and product feature gathering
  • Product and service order
  • Status tracking
  • Searching
  • Reading news and white papers
  • Downloading, email,.
  • High-level behavior types
  • Successful or unsuccessful purchase
  • Information not found
  • User left site or user cancelled request
  • Angry or happy user, .

8
Quality of Interaction
  • 8 success factors
  • Target the right customer
  • Own the customers total experience
  • Streamline customer-oriented business processes
  • Provide a 360 degree view of customer
    relationship
  • Let customers help themselves
  • Help customers do their jobs
  • Deliver personalized service
  • Foster community
  • Ensuring privacy
  • Does the web reveal all our secrets ?
  • Physical store cash transactions versus credit
    account
  • On the web total privacy versus

9
Webhouse Demands
  • The Webhouse places new demands on our DWs
  • Timeliness
  • The webhouse must be updated in near-real-time
  • Data volumes
  • Many page requests for each business transaction
    (sales,..)
  • Microsoft-related sites have more than 1 billion
    events per day !
  • Response times
  • Web users are used to few-second response times
  • Not like traditional Data Warehouse queries
  • This problem must be addressed
  • The webhouse must deliver many kinds of content
  • Query results, reports, data mining results,
    status updates, custom greetings, images, OLAP
    cubes

10
Webhouse Architecture
  • All components are complex and distributed
  • Components must be separated to ensure proper
    function
  • Public web server
  • Services customer web requests by calling the
    other components
  • Public application and business transaction
    server
  • Processes business transactions (sales, status,
    )
  • Hot response cache a file server
  • Stores pre-cooked responses (custom greetings,
    cross-selling propositions, promotions, XML data,
    reports, OLAP cubes, data mining studies,
    aggregations)
  • Updated by batch jobs
  • Guaranteed response time requests
  • Return default object if the desired object is
    not cached
  • Accelerated response time requests
  • Compute desired object if not in cache

11
The Webhouse System
  • The webhouse is devoted to publishing the
    companys data assets
  • Balancing real-time versus off-line queries
  • The webhouse application server serves qualified
    users
  • Ad hoc querying, status reporting, data mining,
    decision support
  • Security is managed by the webhouse application
    server
  • The database servers cannot be accessed directly
    by users
  • The data webhouse is a fully distributed system
  • Multiple hot response caches, application
    servers, databases,

12
Break !
  • The click-stream is larger, messier, and more
    expressive than other data sources
  • The data webhouse architecture must support
  • Timeliness
  • Very large data volumes
  • Low response times

13
Tracking Website User Actions
  • In a physical store we only know the final
    outcome
  • Only actual sales recorded, browsing is not
  • In a web store we know the total shopping session

14
User Actions
  • Searching
  • Information gathering
  • Entertainment
  • Education
  • Communication
  • Status tracking
  • Downloading
  • Shopping and ordering
  • Accidental entry

15
Steps in Product Purchase
  • Recognition of need
  • URL on bottle use a special sub-domain
  • Trying to find whats needed
  • Initial action not successful
  • Searching
  • Several queries necessary (too many or wrong
    results)
  • Comparing search strings to product selection is
    interesting
  • Selection
  • Selecting a specific product and rejecting all
    others
  • Cross-selling and up-selling
  • Branded goods versus house brands, natural
    products
  • Deferred processing can produce customized
    greetings
  • Checkout
  • Validation of customer-provided information is
    important
  • Post-order processing

16
Software and Content Purchase
  • Two additional steps compared to physical
    products
  • Trials and demos
  • Part of MP3 file
  • Demo versions of games
  • Downloads
  • Delay until the end of a session
  • Handle interrupted downloads

17
Elements of Tracking
  • User origin
  • Default home page
  • Portals referral records may contain origin URL
    and query string
  • Browser bookmarks
  • Click-through referring site in referral record
  • Session identification
  • Session IDs (temporary or final)
  • Identify session using IP address
  • Session-level cookies
  • SSL logins
  • Placing session id in (hidden) query string
    (dynamic pages)
  • Persistent cookies

18
User Identification
  • Wish for anonymity
  • Dont mix identity and warehouse data
  • Users give false identities
  • Always validate information
  • House holding
  • More than one person uses the same computer gt
    mixing of behavior of several persons
  • Logins can (sometimes) be used to solve this
    problem
  • Roaming users
  • The same persons use several computers
  • Attractive market segment !
  • Logins ?

19
Behavioral Analysis
  • Entry point
  • Via link to subsidiary page ?
  • Important marketing and design information
  • Dwell
  • Short, negative or very long dwell time ?
  • Compare actual and expected dwell time
  • Querying
  • Search words are very valuable (add metatags and
    keywords)
  • Some users prefer drill-down table of contents
  • Intra-site navigation
  • Hit and run versus window shopping
  • Exit point
  • HTTP has no log-off
  • Do we want the user to exit here ?

20
Personalization Requirements
  • Personalization versus customization
  • Recognition of re-visits
  • Inaccurate recognition is worse than no
    recognition !
  • User interface and content personalization
  • Do not personalize too much
  • Collateral and impulse sales
  • Suggest only paperbacks
  • Active collaborative filtering
  • Combine user-given preferences with behavior
    analysis
  • Calendar and life-style events
  • Remember Mothers Day sales
  • Life-style tracking gt lifetime relationship with
    the customer
  • Localization
  • Canadian customers should have different content
    than US ones

21
Click-Stream-Based Decisions
  • Data gt information gt knowledge gt decisions
  • Data in itself is useless
  • The data webhouse should be a decision-support
    system
  • Business decisions are not based on IT alone
  • The data webhouse need not claim all credit for
    decisions
  • Calculating Return On Investment (ROI)
  • Customer Relationship Management (CRM)
  • The data webhouse is a very important part

22
Identifying and Recognizing
  • Customizing marketing by identifying customers
  • High-profit versus low-profit customers
  • New versus returning customers
  • How good is the identifier ?
  • Temporary or stable, machines or humans, person
    or customer
  • Dont seek identity to aggressively !
  • Clustering customers
  • Regency, frequency, intensity (money)
  • High recency, high frequency, low intensity is
    not interesting
  • Supporting referring links
  • Are customers from this referrer likely to buy ?
  • Customer about to leave ?
  • Quick clicks may or may not be bad (returning
    customers)
  • When and how should we intervene to prevent
    customer loss ?
  • Marketing decision require click-stream and sales
    data

23
Decisions About Communicating
  • Is a particular ad working ?
  • Soft causal effect, hard to measure
  • Link customer purchases to ads
  • Causal dimension needed in click-stream data
  • Access needed to both referrers and own
    click-stream data
  • Are custom greetings working ?
  • Cross- and up-selling
  • Click-stream and sales data needed
  • Is a promotion profitable ?
  • Relating sales lift to promotion cost
  • Did sales of our other products suffer ?
  • Click-stream, sales, promotions, competitive
    intelligence data needed

24
Decisions About Communicating
  • Responding to customers life change
  • Marriage, divorce, having children, moving,
    retirement,
  • Click-stream, sales, demographics data needed
  • Improving website effectiveness
  • Return targets
  • Session killers
  • Search engine findability
  • Can common tasks be performed easily ?
  • Fostering a sense of community
  • Communities based on interest
  • Good business to bring community together at your
    website
  • Click-stream and customer communication data
    needed

25
Web Business Decisions
  • What should be provided over the web ?
  • Products and services that can described,
    ordered, and tracked
  • How many button clicks to order something ? (new
    or return cust.)
  • Providing real-time status tracking
  • What status measures are requested ?
  • How recent is the status data ?
  • How easy is it to get the status ?
  • Data for each part of the supply chain needed
  • Determining whether the web business is
    profitable
  • Which customers and products are profitable ?
  • When are we profitable ?
  • Is the web business profitable overall ?
  • Compute profit and loss statement for each
    customer session
  • Gross revenue, net revenue, gross profit, net
    profit,..
  • Data needed about each source of cost

26
Break !
  • Decisions based on click-stream data
  • Many other data sources needed
  • Sales transactions
  • Promotions
  • Demographics
  • Referrers click-stream
  • .
  • Decisions about
  • Identifying and recognizing customers
  • Communication with the customers
  • The web business

27
The Click-stream
  • Communication between customer and website
  • Cookie information
  • Referrers

28
Web Client/Server Interactions
  • A link is clicked for yoursite.com
  • HTTP request for page
  • The page contains an image
  • Separate HTTP request for image
  • The page contains a banner ad
  • HTTP request to banner server (different from
    yoursite)
  • Banner site reads cookie to determine user
    identity
  • Referrer determines pay for ad
  • Hidden link
  • HTTP request to profiler site
  • Profiler site reads cookie
  • Profiler site sends demographic information to
    yoursite

29
Proxy Servers and Caches
  • Proxy servers
  • Forward proxies have three problems
  • Outdated content
  • Web server not notified
  • User identification only possible with cookies
  • NOCACHE tags not always respected
  • Reverse proxies are not a problem
  • Browser caches
  • Most browsers cache visited pages
  • When pages are revisited (BACK), they are read
    from the cache
  • The web server is not notified
  • This means that the click-stream is not complete
    !
  • Multiple browser windows cause problems

30
Web Server Logs
  • CLOG Common Log Format
  • ECLF Extended Common Log Format
  • Log information
  • Host IP address
  • Dynamic IP addresses for ISP users
  • Domain name provide useful information (index
    spider,..)
  • Domain lookup shouldnt be done in real-time
  • Ident Arbitrary ID supplied by identd clients,
    seldom used
  • Authuser SSL user ID
  • Time time when the request is completed
  • Request Mostly GET or POST actions
  • GET /images/under-c.gif HTTP/1.0
  • POST /cgi-bin/dns_check HTTP/1.0

31
Web Server Logs
  • Status three digit status code
  • 200 OK, 404 Not found,
  • Bytes number of bytes sent to the client
  • Referrer text string indicating the source of
    request
  • http//www.webcom.com/megasite/ -gt /index.html
  • User-agent name and version of client software
  • Mozilla/4.0 (compatible MSIE 4.01Windows98)
  • Filename part of URL specifying path of accessed
    file
  • Time-to-serve time taken to serve request
  • Good for calculating dwell time
  • IP address
  • Server Port usually 80 for HTTP and 443 for SSL
  • Process ID Web server process that serviced
    request
  • URL scheme hostname port documentpath
    querystring
  • http//ralphkimball.com443/seminars/schedule.html
    ?tokyofall2001

32
Cookies
  • Cookies files placed on client machine by web
    server
  • Two kinds
  • Session level only stored in memory
  • Persistent stored on client disk
  • Cookies can only be read from the domain set in
    cookie
  • Not necessarily the original server domain
  • Cookie contents
  • Name cookie name
  • Value string value
  • Expires GMT expiration time
  • Domain domain that can read the cookie
  • Path can only be read from path in domain
  • Secure cookie must be transmitted via SSL

33
Cookie Examples
  • Netscape format
  • Portal Yahoo
  • .yahoo.com/ TRUE / FALSE / 1271362380 B ad34deft
  • Leading . and trailing / gives flexibility
  • Ad server Doubleclick
  • doubleclick.net TRUE / FALSE 192385398 id
    7c56e94f
  • Doubleclick reads cookie when ad is requested
  • Profiler Matchlogic
  • .preferences.com TRUE / FALSE 118294394 ID
    t5j54j3k3llkk7fk
  • Value contains error-correcting information
  • Unique id into demographics database
  • Microsoft GUID Same id for several sites
  • microsoft.com TRUE / FALSE MC1 GUID1234567890
  • msn.com TRUE / FALSE MC1 GUID1234567890

34
System Issues
  • Universal identifiers
  • Read by anyone cookies
  • Not used for privacy reasons
  • Ethernet hardware address
  • Works only for computers with network card
  • Pentium III serial numbers
  • Can be disabled in BIOS
  • Query strings
  • Contains labelvalue pairs
  • Can be used for any information not just
    queries
  • Templates
  • Used for dynamic content
  • Encode both template and content into URL, to
    preserve both in the click-stream

35
Conclusion
  • Data Warehousing and Data Webhousing
  • User actions on a website
  • Making decision based on click-streams
  • What data is needed ?
  • The click-stream as a data source
  • Proxies, cookies,..
  • We can do quite a bit with click-stream data
  • But user identity is hard to guess !
  • Data webhouses will become
  • Very common
  • Very important

36
Discussion
  • Is data webhousing useful ?
  • Do you expect to work with this in the future ?
  • Does the web log contain enough information ?

37
Next time
  • Double lecture Thursday 12.30-16.00 !
  • Chapters 5-9,11,14, and 16
Write a Comment
User Comments (0)
About PowerShow.com