Web Mining - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Web Mining

Description:

Why do we want to mine the web? What should we mine? ... html HTTP/1.0' 200 4595 '-' 'FAST-WebCrawler/2.1-pre2 (ashen_at_looksmart.net) ... – PowerPoint PPT presentation

Number of Views:465
Avg rating:3.0/5.0
Slides: 41
Provided by: usersC1
Category:
Tags: mining | web | webcrawler

less

Transcript and Presenter's Notes

Title: Web Mining


1
Web Mining
  • What we would like to teach you
  • Web Mining, what can it do for you?
  • Why do we want to mine the web?
  • What should we mine?
  • What type of problems do we have when Mining?

2
Web Mining cont.
  • How do we mine the web?
  • What problems will we have along the way?
  • Our data requires users, how can we tell who is
    who?
  • How do we track users?
  • Knowing how we can track the users, what type of
    data do we need?

3
Background Information
  • What you need to know before mining the web.
  • There are three types of web mining
  • Web usage mining
  • Web content mining
  • Web structure mining

4
Web Usage Mining
Web usage mining is a type of data mining that
looks a how users use and navigate a web
site. Early web usage mining only reported user
activity. Now we look to find patterns in the
user activity.
5
Web Content Mining
  • Web content mining tries to discover useful
    information regarding the content of the page.
  • Many times it is text mining with little regard
    to the structure of the page itself.
  • Goal of finding useful information about text
    video or images.

6
Web Structure Mining
  • Web structure mining associates the connection
    and layout of the web site.
  • Types of connections
  • Hyperlinks
  • HTML XML tags

7
Knowing all of this, what can we do?
  • We can help eCommerce better market their items
    and services
  • We can personalize our web sites.
  • We can better present data with smarter
    entry/exit points.

8
Web Mining Steps
  • Pre Process your data
  • Discover patterns
  • Pattern Analysis

9
Pre-Processing
  • Web logs contain varied information, what do we
    want?
  • i.e.
  • Single i.p. / multiple server sessions.
  • Multiple i.p. / single server session
  • Multiple i.p. / single user
  • Multiple agent / single user

10
Pattern Discovery
  • We can use our Data Mining Techniques
  • Association rule Mining
  • Classification
  • Clustering
  • Outlier detection

11
What to do with our rules?
  • eCommerce web sites can use rules to find out who
    else will purchase similar items.
  • We can use rules to make advertisements generated
    for our personal tastes.
  • News sites can layout their pages to suit quicker
    paths to the most used data, or customized to the
    specific user.

12
Data and Errors/Noise
  • Sample Data and explanation
  • Noise problems
  • Main problem with HTTP
  • IP problems
  • NAT, Proxies, VPN, and remote access problems
  • Small viewing problem
  • Bots

13
Sample Data
  • fcrawler.looksmart.com - - 26/Apr/2000000012
    -0400 "GET /contacts.html HTTP/1.0" 200 4595 "-"
    "FAST-WebCrawler/2.1-pre2 (ashen_at_looksmart.net)"
  • fcrawler.looksmart.com - - 26/Apr/2000001719
    -0400 "GET /news/news.html HTTP/1.0" 200 16716
    "-" "FAST-WebCrawler/2.1-pre2 (ashen_at_looksmart.net
    )"
  • 123.123.123.123 - - 26/Apr/2000002348 -0400
    "GET /pics/wpaper.gif HTTP/1.0" 200 6248
    "http//www.jafsoft.com/asctortf/" "Mozilla/4.05
    (Macintosh I PPC)"
  • 123.123.123.123 - - 26/Apr/2000002347 -0400
    "GET /asctortf/ HTTP/1.0" 200 8130
    "http//search.netscape.com/Computers/Data_Formats
    /Document/Text/RTF" "Mozilla/4.05 (Macintosh I
    PPC)"
  • 123.123.123.123 - - 26/Apr/2000002348 -0400
    "GET /pics/5star2000.gif HTTP/1.0" 200 4005
    "http//www.jafsoft.com/asctortf/" "Mozilla/4.05
    (Macintosh I PPC)"
  • 123.123.123.123 - - 26/Apr/2000002350 -0400
    "GET /pics/5star.gif HTTP/1.0" 200 1031
    "http//www.jafsoft.com/asctortf/" "Mozilla/4.05
    (Macintosh I PPC)"
  • 123.123.123.123 - - 26/Apr/2000002351 -0400
    "GET /pics/a2hlogo.jpg HTTP/1.0" 200 4282
    "http//www.jafsoft.com/asctortf/" "Mozilla/4.05
    (Macintosh I PPC)"
  • 123.123.123.123 - - 26/Apr/2000002351 -0400
    "GET /cgi-bin/newcount?jafsof3width4fontdigita
    lnoshow HTTP/1.0" 200 36 "http//www.jafsoft.com/
    asctortf/" "Mozilla/4.05 (Macintosh I PPC)"

14
Explanation of Data
  • ppp931.on.bellglobal.com
  • - -
  • 26/Apr/2000001612 -0400
  • "GET /download/windows/asctab31.zip HTTP/1.0"
  • 200
  • 1540096
  • "http//www.htmlgoodies.com/downloads/freeware/web
    development/15.html"
  • "Mozilla/4.7 enC-SYMPA (Win95 U)"

15
Noise Problem HTTP
  • Stateless Protocol
  • Server forget the users ever came to the site
  • Unable to keep track of any interactions between
    users and server

16
Noise IP problem NAT
  • Network Address Translation
  • Aka IP masquerading
  • Each Connection has their own IP
  • However, NAT converts it into one general IP used
    by everyone
  • 192.168.24.5, 192.168.24.8, 192.168.24.38
  • 56.23.92.1

17
Noise IP problem Proxies
  • Like NAT in that they change your IP
  • Difference exists in that they are at a higher
    level
  • Happens when you access Library resources
  • More Control than NAT

18
Noise IP problem VPN
  • Virtual Private Network
  • Use of tunnels
  • Used to connect to other computers on the VPN
  • IP as in VPN

19
Noise IP problem Remote Access
  • Citrix, VPC, SSH, Putty, and others
  • Using programs to access computers from a remote
    area
  • Actually able to use programs on remote computer
    rather than just view files.

20
Noise problem Viewing
  • What counts as viewing a page
  • One page will most likely have multiple get
    commands
  • Variety of types
  • .html, .js, .doc, .zip, .jpg, various web apps
    such as .cgi and .php
  • Concerning between important and irrelevant types

21
Noise problem Bots
  • Web robot
  • Why Bots are bad?
  • Bots sometimes show up in User Agent
  • Problems occur when they do not show up

22
Tracking User Sessions
  • A Session series of URLs visited in order in a
    given time frame
  • Techniques to add state to HTTP
  • HTTP Authentication
  • Client-side cookies
  • URL cookies
  • Hidden form fields

23
HTTP Authentication
  • How it Works
  • Visitor accesses a URL that requires HTTP
    authentication
  • The server sends a response to the browser asking
    for credentials
  • The browser asks the visitor for credentials and
    sends them to the server
  • The server validates the credentials and logs the
    username along with the URL accessed
  • The browser caches the authentication information
    for future HTTP requests until the browser window
    is closed

24
HTTP Authentication
An example of a site that requires HTTP
authentication
25
HTTP Authentication
  • Advantages
  • Part of the HTTP protocol
  • Username logged to standard server logs
  • Easy to keep track of unique users
  • Disadvantages
  • Anonymous access lumped together
  • Usernames need to be unique per person
  • Users dislike accounts for viewing web pages

26
Client-side cookies
  • How it works
  • Visitor accesses a URL
  • The server appends a cookie to the HTTP response
  • The browser saves the cookie
  • The browser sends the cookie back to the server
    with future HTTP requests

27
Client-side cookies
An example cookie from cnet.com
28
Client-side cookies
  • Advantages
  • Transport is transparent to the user
  • Works even if user closes the browser window
  • Can be logged by the web server
  • Cookies are required by many useful web sites
  • Users will be used to accepting them
  • Wellsfargo.com, discovercard, citicards.com, and
    chase.com all required cookies when tested.
  • facebook.com, gmail.com, hotmail.com, and
    myspace.com all required cookies when tested.
  • Disadvantages
  • The browser can decide what to do with cookies
    (reject/delete them)
  • Multiple people can share a browser
  • The browsers date/time must match the servers
  • Computer cleaning applications often delete
    cookies for privacy reasons

29
Client-side cookies
Firefox 2.02 Privacy Options for Cookies
30
Client-side cookies
A site that requires cookies
31
URL cookies
  • How it works
  • Visitor accesses a URL
  • The server generates all links on the returned
    page with a unique session identifier in the
    URL. Example csbsju.edu/SESSIONID/home.html
  • When the visitor clicks on a link the URL
    including the session ID is logged

32
URL cookies
  • Advantages
  • Will work when client-side cookies wont
  • Disadvantages
  • All the links on a page have to be dynamically
    generated
  • Visitors can bookmark URLs and send them to
    others so return visits based on old sessions IDs
    may be noisy data
  • The URLs are confusing to read
  • Wont automatically track return visitors without
    a boorkmark

33
Hidden form fields
  • How it works
  • Visitor accesses a URL
  • The server generates a unique session identifier
    and places it within the HTML of the page
  • The user clicks a form submit button or
    JavaScript controlled link which submits the form

34
Hidden form fields
  • Advantages
  • Will work with all browsers all the time if form
    buttons are used without JavaScript
  • Transparent to the user
  • Disadvantages
  • Using GET instead of POST will have the same
    problems as URL Cookies
  • May require JavaScript for the best user
    transparency
  • All pages must be dynamically generated by the
    web server
  • Web logs dont contain POST information

35
Data Intelligence Processes
  • Association Rule Mining
  • Used to discover web structure
  • Used to discover user access patterns
  • Useful because
  • Advertising
  • Can help to determine web structure
  • Notion of hubs and authorities making up a web
    community
  • E.g., Googles PageRank algorithm

36
Data Intelligence Processes
  • Classification
  • Predict similar web pages that a user would like
    to visit
  • Customization of a web page
  • Clustering
  • Useful in creating a hierarchy of web pages
    (e.g., Yahoo! hierarchy)
  • Outlier Detection
  • Very limited application in web mining

37
Our Research Objective
  • Mining the CSB/SJU website
  • Weve noticed
  • Search function is bad/horrible
  • Use of A to Z Index
  • Take only student network traffic
  • Take different time sections
  • See if any relationships can be discovered
  • Using Enter/Exit vectors
  • See what common sessions are like
  • Are there similar web pages which should be
    linked to one another?

38
References
  • http//delivery.acm.org/10.1145/320000/319781/p43-
    garofalakis.pdf?key1319781key21356793711collP
    ortaldlACMCFID16009993CFTOKEN87185929 Data
    Mining and the Web Past, Present and Future,
    Garofalakis M, Rastogi R, Seshadri S, Shim K
  • http//delivery.acm.org/10.1145/850000/846188/p12-
    srivastava.pdf?key1846188key26296793711collPo
    rtaldlACMCFID16009993CFTOKEN87185929 Web
    Usage Mining Discovery and Applications of Usage
    Patterns from Web Data, Srivastava J, Cooley R,
    Deshpande M, Tan P
  • http//delivery.acm.org/10.1145/320000/319792/p63-
    joshi.pdf?key1319792key22396793711collPortal
    dlACMCFID16009993CFTOKEN87185929 Warehousing
    and Mining Web Logs, Joshi K, Joshi A, Yesha Y,
    Drishnapuram R

39
References
  • Dataset and explanations, http//www.jafsoft.com/s
    earchengines/log_sample.html Jafsoft 2005
  • Nat Picture, http//www.skullbox.net/firewalls/nat
    .gif Skullbox.net 2004
  • VPN pic, http//www.internetaccessmonitor.com/eng
    /support/docs/winroute/img/vpn-scheme.png Red
    Line Software, 2006
  • Various research, www.wikipedia.org, Wikipedia
    2007

40
References
  • http//www.galeas.de/webmining.html
Write a Comment
User Comments (0)
About PowerShow.com