Advanced Topics in Data Mining: Web Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Advanced Topics in Data Mining: Web Mining

Description:

Applications are ported to the Web at rapid pace ... such as America Online (AOL), and CompuServe (merged to AOL), are anxious to ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 67
Provided by: leeyu2
Category:

less

Transcript and Presenter's Notes

Title: Advanced Topics in Data Mining: Web Mining


1
Advanced Topics in Data MiningWeb Mining
2
Web Mining
3
Web Mining
  • Applications are ported to the Web at rapid pace
  • On-line services, such as America Online (AOL),
    and CompuServe (merged to AOL), are anxious to
    know user access patterns not just search in
    the Web
  • How Amazon does it?
  • Understanding Web user behavior is important
  • It can improve Web page organization
  • It can increase Web server performance
  • It can exploit Web advertising
  • It can increase business opportunity

4
Amazon Web Page
Association Rules
5
More Information Desired
  • Collect statistical information (page hits) only,
    which is insufficient since
  • The hit frequency of a page depends not only on
    its content but also on its location
  • The number of users accessing a page is not
    available
  • Information on what pages accessed together is
    not available
  • Data mining in the Web (Web Mining)
  • Web Access Pattern Collection
  • Web User Pattern Mining

6
Web Access Pattern Collection
  • Server-Based Data Collection
  • Who are visiting a given Web site and what are
    they doing
  • Agent-Based Data Collection
  • What are the Web sites a particular user has
    visited?

7
Server-Based Data Collection
  • Examine the logs collected by HTTPd
  • Access Log (IP, Time, Access Data), Referred Log
    (A?B), Error Log,
  • We can combining some of them for our use if
    necessary
  • Problems
  • The use of proxy servers
  • The effect of caching

8
Server-Based Data Collection
9
Access Log
IP/Domain Name
Time
Access Data
10
Referred Log
???Caching???
11
Server-Based Data Collection
  • Have to be done in accordance with technology
    advances
  • The use of Active Server Pages (Session ID
    available)
  • The use of proxy servers
  • The effect of caching
  • HTTPd 1.1
  • Limitation
  • Can only capture the user behavior when they are
    within this site

12
Agent-Based Data Collection
  • Understanding individual Web behavior needs
    client-based data collection
  • Results are useful
  • Better Personalized Service
  • Improved Web Page Organization
  • Better Pricing Policies
  • Methods
  • Applets can only read/write files in their source
    servers
  • a big security constraint
  • Using Active Components (ActiveX Control) and
    PlugIns
  • APCS (Access Pattern Collection Server)

13
APCS
14
APCS
15
APCS
16
APCS
17
APCS
18
Agent-Based Data Collection
  • Very difficult to do for non-registered users in
    the current Web environment
  • We have to be conducted with users consent
  • Very dependent upon available Web technologies

19
Web User Pattern Mining
  • Web user pattern mining is to discover user
    access patterns in Web servers
  • Pattern discovery and analysis tools
  • Some existing Web tools provide mechanisms for
    reporting user activity in the servers
  • Web Trends (http//www.webtrends.com.tw/)
  • Open Market (http//www.openmarket.com/)
  • Net.Genesis (http//www.netgen.com/)

20
Path Traversal Patterns Mining
  • Mining path traversal patterns in a distributed
    information providing environment (WWW) where
    documents or objects are linked together (via
    hyperlinks) to facilitate interactive access
  • Solution procedure consists of three steps
  • Convert the original sequence of log data into a
    set of maximal forward references (MF)
  • Filter out the effect of some backward references
  • Mainly made for ease of traveling and concentrate
    on mining meaningful user access sequences
  • Some objects are visited because of their
    locations rather than their content
  • Determine the frequent traversal patterns, i.e.,
    large reference sequences, from the maximal
    forward references obtained
  • Determine the maximal reference sequences from
    large reference sequences (Trivial)

21
Step1 MF References
  • Suppose the traversal log contains the following
    traversal path for a user
  • A, B, C, D, C, B, E, G, H, G, W, A, O, U, O, V

When backward references occur, a forward
reference path terminate.
The set of maximal forward references is ABCD,
ABEGH, ABEGW, AOU, AOV
22
Step1 Another Example
23
Step1 Arrange Database
Encoding
24
Step1 Database Reduction
Database Reduction
25
Step2 Find Frequent Reference Sequences
  • Two algorithms for finding Frequent Traversal
    Patterns (Frequent Reference Sequences, Frequent
    Consecutive Subsequences)
  • Full-Scan (FS) Algorithm
  • FS utilizes key ideas of the DHP algorithm
  • Selective-Scan (SS) Algorithm
  • SS reduces the number of database scans

26
Full-Scan (FS) Algorithm
Generate L1 Hash Table
Scan DB-1
27
Generate L1 Hash Table
Scan DB-1
h(x,y) ( order of x ) 23 ( order of y )
mod 17
28
Generate C2
29
Generate L2 Reduce DB
Scan DB-2
30
Generate L2 Reduce DB
Scan DB-2
31
Generate C3, L3 Reduce DB
Scan DB-3
32
Generate C4, L4 Reduce DB
Scan DB-4
33
Selective-Scan (SS) Algorithm
Scan DB-3
34
Step 3 Generate FrequentTraversal Patterns
Maximal Reference Sequences
35
WAP-Mine Algorithm
  • The key consideration is how to facilitate the
    tedious support counting and candidate generating
    operations in the mining procedure
  • Given Web Access Sequence database WAS and a
    support threshold ?, mine the complete set of
    ?-patterns of WAS

User ID Web Access Sequence
100 abdac
200 eaebcac
300 babfaec
400 afbacfc
WAS
36
WAP-Mine Algorithm
(1)Scan WAS once,find all frequent-1 events
(2)Scan WAS again,construct a WAP-tree
(3)Recursively mine the WAP-tree using
conditional search
Access patterns
37
Find All Frequent-1 Events
Item Support Frequency
a 4
b 4
c 4
d 1
e 2
f 2
User ID Web Access Sequence
100 abdac
200 eaebcac
300 babfaec
400 afbacfc
Min_Sup75
User ID Web Access Sequence Frequent Subsequence
100 abdac abac
200 eaebcac abcac
300 babfaec babac
400 afbacfc abacc
38
WAP-Tree Construction
  • Using frequent events to register all count
    information for further mining

User ID Frequent Subsequence
100 abac
200 abcac
300 babac
400 abacc
39
Mining Web Access Patterns from WAP-Tree
Conditional Sequence Based on c
Sequence Count
aba 2
ab 1
abca 1
ab -1
baba 1
abac 1
aba -1
Sequence Count
aba 1
abca 1
baba 1
abac 1
Item Sup Frequency
a 4
b 4
c 2
Generate Web Access Patterns ac, bc
40
Mining Web Access Patterns from WAP-Tree
Conditional Sequence Based on ac
Sequence Count
ab 3
bab 1
Sequence Count
ab 3
b 1
bab 1
b -1
Item Sup Frequency
a 4
b 4
Generate Web Access Patterns aac, bac
41
Mining Web Access Patterns from WAP-Tree
Conditional Sequence Based on bac
Sequence Count
a 3
ba 1
Item Sup Frequent
a 4
b 1
Generate Web Access Patterns abac
42
Mining Web Access Patterns from WAP-Tree
Conditional Sequence Based on abac
Sequence Count
a 4
No Web Access Patterns are Generated
43
Mining for Web Transactions
  • To capture Web customer buying behavior
  • It is not just market basket transaction for the
    set of items bought by a customer in a single
    purchase (Association Rules)
  • It is not just Web user travel patterns (Path
    Traversal Patterns)
  • It is an extension from path traversal patterns
  • Exploring the relationship between traveling and
    buying

44
Mining for Web Transactions
Web Transaction
Algorithm WR (Web-transaction-Record)
Web Transaction Records ltPath a Set of
Purchasesgt
Algorithm WTM, MTSPJ, MTSPC
Frequent Transaction Patterns
Web Transaction Association Rules
45
Mining for Web Transactions
  • Web-transaction-Record (WR) Algorithm
  • Extract meaningful Web transaction records from
    the given Web transaction
  • WTM (Web Transaction Mining) Algorithm
  • Mining Web Transaction Patterns
  • MTS (Maximal Transaction Segment) Algorithms are
    the improvement versions of WTM

46
Mining for Web Transactions
47
Mining for Web Transactions
48
WTM Algorithm
  • It joins the purchased itemsets for generating
    candidate transaction patterns
  • WTM employs a two-level hash tree, called Web
    transaction tree, to store candidate transaction
    patterns
  • WTM hashes not only each item but also each
    purchase in the path

49
WTM Algorithm
50
Support Count
WT_ID Path Purchase
100 ABCE Bi1, Ci2, Ei4
100 ABFGH Bi1, Hi6
100 ASJL Si7, Li9
200 ABCE Bi1, Ci2, Ei4
200 ASJLQ Si7, Qi10
Path Purchase Support Count
AB Bi1 2
ABC Ci2 2
51
WTM Algorithm
Support Count gt 2
C1
T1
Sup.
Purchase
Path
Path Purchase Sup.
AB Bi1 3
ABC Ci2 2
ABCE Ei4 3
ABFG Gi5 2
AS Si7 4
ASJ Ji8 2
ASJL Li9 2
ASJLQ Qi10 2
3
Bi1
AB
2
Ci2
ABC
1
Di3
ABD
3
Ei4
ABCE
2
Gi5
ABFG
1
Hi6
ABFGH
4
Si7
AS
2
Ji8
ASJ
2
Li9
ASJL
2
Qi10
ASJLQ
52
WTM Algorithm
Support Count gt 2
?28?
53
WTM Algorithm
Support Count gt 2
C3
Sup.
Purchase
Path
2
Bi1 Ci2 Ei4
ABCE
T3
Sup.
Purchase
Path
2
Bi1 Ci2 Ei4
ABCE
54
WTM Disadvantages
  • WTM may generate a lot of unqualified candidate
    transaction patterns without utilizing the paths
    of frequent transaction patterns
  • This will degrade the performance

55
MTSPJ Algorithm
  • Algorithm MTSPJ uses maximal transaction segment
    that contains frequent transaction patterns and
    the maximal path, to solve the unqualified
    candidate transaction pattern problem
  • MTSPJ generalizes candidate transaction patterns
    only when the leaf node of the Web transaction
    tree is reached

56
MTSPJ Algorithm
57
MTSPJ Algorithm
Support Count gt 2
C1
T1
Sup.
Purchase
Path
Path Purchase Sup.
AB Bi1 3
ABC Ci2 2
ABCE Ei4 3
ABFG Gi5 2
AS Si7 4
ASJ Ji8 2
ASJL Li9 2
ASJLQ Qi10 2
3
Bi1
AB
2
Ci2
ABC
1
Di3
ABCD
3
Ei4
ABCE
2
Gi5
ABFG
1
Hi6
ABFGH
4
Si7
AS
2
Ji8
ASJ
2
Li9
ASJL
2
Qi10
ASJLQ
58
MTSPJ Algorithm
C2
Sup.
Purchase
Path
Si7 Ji8
ASJ
2 2 1 2 1 0
Si7 Li9
ASJL
Ji8 Li9
ASJL
Si7 Qi10
ASJLQ
Ji8 Qi10
ASJLQ
Li9 Qi10
ASJLQ
2 3 2
1
59
MTSPJ Algorithm
C2
T2
Path Purchase Sup.
ABC Bi1 Ci2 2
ABCE Bi1 Ei4 3
ABCE Ci2 Ei4 2
ABFG Bi1 Gi5 1
ASJ Si7 Ji8 2
ASJL Si7 Li9 2
ASJL Ji8 Li9 1
ASJLQ Si7 Qi10 2
ASJLQ Ji8 Qi10 1
ASJLQ Li9 Qi10 0
Path Purchase Sup.
ABC Bi1 Ci2 2
ABCE Bi1 Ei4 3
ABCE Ci2 Ei4 2
ASJ Si7 Ji8 2
ASJL Si7 Li9 2
ASJLQ Si7 Qi10 2
60
MTSPJ Algorithm
61
MTSPC Algorithm
MTSPC utilizes the LC (Large Count) to Filter
Candidates
Support Count gt 2
C1
T1
Sup.
Purchase
Path
Path Purchase Sup.
AB Bi1 3
ABC Ci2 2
ABCE Ei4 3
ABFG Gi5 2
AS Si7 4
ASJ Ji8 2
ASJL Li9 2
ASJLQ Qi10 2
3
Bi1
AB
2
Ci2
ABC
1
Di3
ABCD
3
Ei4
ABCE
2
Gi5
ABFG
1
Hi6
ABFGH
4
Si7
AS
2
Ji8
ASJ
2
Li9
ASJL
2
Qi10
ASJLQ
62
MTSPC Algorithm
Maximal Transaction Segment Maximal Transaction Segment Maximal Transaction Segment
Maximal Path Item LC
ASJLQ Si7 1
ASJLQ Ji8 1
ASJLQ Li9 1
ASJLQ Qi10 1
I 4 gt 1
I 3 gt 1 (K-1)
C2
Maximal Transaction Segment Maximal Transaction Segment Maximal Transaction Segment
Maximal Path Item LC
ABFG Bi1 1
ABFG Gi5 1
Sup.
Purchase
Path
2
Si7 Ji8
ASJ
2
Si7 Li9
ASJL
1
Ji8 Li9
ASJL
I 2 gt 1
2
Si7 Qi10
ASJLQ
1
Ji8 Qi10
ASJLQ
0
Li9 Qi10
ASJLQ
63
MTSPC Algorithm
C2
T2
Path Purchase Sup.
ABC Bi1 Ci2 2
ABCE Bi1 Ei4 3
ABCE Ci2 Ei4 2
ABFG Bi1 Gi5 1
ASJ Si7 Ji8 2
ASJL Si7 Li9 2
ASJL Ji8 Li9 1
ASJLQ Si7 Qi10 2
ASJLQ Ji8 Qi10 1
ASJLQ Li9 Qi10 0
Path Purchase Sup.
ABC Bi1 Ci2 2
ABCE Bi1 Ei4 3
ABCE Ci2 Ei4 2
ASJ Si7 Ji8 2
ASJL Si7 Li9 2
ASJLQ Si7 Qi10 2
64
MTSPC Algorithm
Maximal Transaction Segment Maximal Transaction Segment Maximal Transaction Segment
Maximal Path Item LC
ASJLQ Si7 3
ASJLQ Ji8 1
ASJLQ Li9 1
ASJLQ Qi10 1
T2
I 3 gt 2
Path Purchase Sup.
ABC Bi1 Ci2 2
ABCE Bi1 Ei4 3
ABCE Ci2 Ei4 2
ASJ Si7 Ji8 2
ASJL Si7 Li9 2
ASJLQ Si7 Qi10 2
I 1 lt 2
No Generations
65
Mining for Web Transactions
  • ltABCE B1, E4gt 2
  • ltAB B1gt 3
  • We can derive ltABCE B1 gt E4gt
  • support_count(ltABCE B1 gt E4gt) 2
  • confidence(ltABCE B1 gt E4gt)

66
Summary
  • Data mining in the Web is an area of growing
    importance
  • In particular, the emerging of EC
  • More and more applications will benefit from the
    knowledge from data mining
  • Web Mining Web Data Collection Traditional
    Data Mining?
  • Important Issues
  • Incremental Web Mining
Write a Comment
User Comments (0)
About PowerShow.com