Introduction to Web Mining and Web Usage Mining

About This Presentation

Title:

Introduction to Web Mining and Web Usage Mining

Description:

a collection of user clicks to a single Web server during a user ... Ck = candidates of size k: those itemsets of size k that could be frequent, given Fk-1 ... – PowerPoint PPT presentation

Number of Views:1124

Avg rating:3.0/5.0

Slides: 65

Provided by: federic91

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Web Mining and Web Usage Mining

1
Introduction to Web Mining and Web Usage Mining

Course Usability of Interactive Applications
Year 2007
Lecturer Federico M. Facca (facca_at_elet.polimi.ti)
Main Lecturer Francesca Rizzo (rizzo_at_elet.polimi.
it)

2
Agenda

Web Mining
Introduction
Web Content Mining
Web Structure Mining
Web Usage Mining
Introduction
Algorithms
Applications
Examples
References

3
Web MiningIntroduction

Web Mining
is the application of data mining techniques to
discover patterns from the Web.
Data Mining
also called Knowledge-Discovery in Databases
(KDD) is the process of automatically searching
large volumes of data for patterns (extracting
Knowledge from data)

4
Web MiningIntroduction

Web Content Mining
discover useful information from the content of a
web page. The type of the web content may consist
of text, image, audio or video data in the web
Web Structure Mining
using the graph theory to analyse the node and
connection structure of a web site
Web Usage Mining
analyse and discover interesting patterns of
users usage data on the web. The usage data
records the users behaviour when the user
browses or makes transactions on the web site.

5
Web MiningIntroduction
WCM
Wrapper
WUM
Characterizing
?
Web data
InformationRetrieval
Information Extraction
Generalizzation
Analysis
Knowledge
WSM
Indexer

According to the Web Mining category and of the
objective, the different phases acquire a
different role and importance

Categorization
Crawler/Spider
Clustering
Ranker
6
Web MiningWeb Content Mining

Discovery of useful information from web contents
/ data / documents
Web data contents text, image, audio, video,
metadata and hyperlinks.
Information Retrieval View ( Structured
Semi-Structured)
Assist / Improve information finding
Filtering Information to users on user profiles
Database View
Model Data on the web
Integrate them for more sophisticated queries

7
Web MiningWeb Content Mining

Developing intelligent tools for IR
Finding keywords and key phrases
Discovering grammatical rules and collocations
Hypertext classification/categorization
Extracting key phrases from text documents
Learning extraction models/rules
Hierarchical clustering
Predicting (words) relationship

8
Web MiningWeb Structure Mining

To discover the link structure of the hyperlinks
at the inter-document level to generate
structural summary about the Website and Web
page.
based on the hyperlinks, categorizing the Web
pages and generated information.
discovering the structure of Web document
itself.
discovering the nature of the hierarchy or
network of hyperlinks in the Website of a
particular domain.

9
Web MiningWeb Structure Mining

Finding authoritative Web pages
Retrieving pages that are not only relevant, but
also of high quality, or authoritative on the
topic
Hyperlinks can infer the notion of authority
The Web consists not only of pages, but also of
hyperlinks pointing from one page to another
These hyperlinks contain an enormous amount of
latent human annotation
A hyperlink pointing to another Web page, this
can be considered as the author's endorsement of
the other page

10
Web Usage MiningIntroduction

Known also as web log mining
Not only statistical measures
Not only server logs
Can be organized according 3 orthogonal dimensions

Techniques
Statistical Analysis
Association Rules
Clustering
Sequential Patterns
Rough Sets
Fuzzy Logic

Visualizzation
Graphs
Relational Tables
OLAP
Query languages

Applications
Personalization
Usability Testing
User modeling
Marketing
Adaptive Web sites

11
Web Usage MiningTerms

User
The principal using a client to interactively
retrieve and render resources or resource
manifestations.
Page view
Visual rendering of a Web page in a specific
client environment at a specific point of time
Click stream
a sequential series of page view request
User session
a delimited set of user clicks (click stream)
across one or more Web servers.
Server session (visit)
a collection of user clicks to a single Web
server during a user session.
Episode
a subset of related user clicks that occur within
a user session.

12
Web Usage MiningApplications

Target potential customers for electronic
commerce
Enhance the quality and delivery of Internet
information services to the end user
Improve Web server system performance
Identify potential prime advertisement locations
Facilitates personalization/adaptive sites
Improve site design
Fraud/intrusion detection
Predict users actions (allows prefetching)

13
Web Usage MiningInformation Retrieval

The information is usually easy to obtain (web
log, cookies, proxy log, data base log...).
Information can be obtained from server, client
e proxy.

14
Web Usage MiningInformation Extraction

Completing missing information using some
heuristics
Identification of sessions/episodes
Mining and conversion of contents to the
elaboration format (WCM)
Mining of the web site structure (WSM)
Finding and removing data distortion (e.g.
crawlers sessions).
Representing the information in the correct
format for the pattern discovery task

15
Web Usage MiningGeneralization

Usage Patterns
Navigation patterns
Behaviour pattern
Access patterns
Techniques
Association rules (e.g. 45 users that visited
products/product1.html also visited
products/productX.html ).
Clustering (identifying group of users that show
similar sessions)
Classification (e.g. 30 of users that bought
products from the category Music are between
18-25 years old and live in north Europe)
Sequential patterns (e.g. 15 of users that
bought a product in the category Music after a
week made a new order in the category Book)

16
Web Usage MiningAnalysis

Removing patterns that do not provide new
knowledge
Visualization of acquired knowledge
Usage of discovered pattern to
Categorizing users
Personalizing contents/advertisements
Modifying dynamically web site structure
Marketing
Improving application usability

17
Web Usage MiningProblems with Web Logs

Identifying users
Clients may have multiple streams
Clients may access web from multiple hosts
Proxy servers many clients/one address
Proxy servers one client/many addresses
Data not in log
POST data (i.e., CGI request) not recorded
Cookie data stored elsewhere

18
Web Usage MiningProblems with Web Logs

Missing data
Pages may be cached
Referring page requires client cooperation
When does a session end?
Use of forward and backward pointers
Typically a 30 minute timeout is used
Web content may be dynamic
May not be able to reconstruct what the user saw
Use of spiders and automated agents automatic
request we pages

19
Web Usage MiningProblems with Web Logs

Like most data mining tasks, web log mining
requires preprocessing
To identify users
To match sessions to other data
To fill in missing data
Essentially, to reconstruct the click stream

20
Web Usage MiningWeb Server Logs

Web servers have the ability to log all
requests
Web server log formats
Common Log Format (CLF)
Extended Log Format allows configuration of log
file
Generate vast amounts of data

21
Web Usage MiningWeb Server Logs

Common Log Format
Remotehost browser hostname or IP
Remote log name of user (almost always "-"
meaning "unknown")
Authuser authenticated username
Date Date and time of the request
"request exact request lines from client
Status The HTTP status code returned
Bytes The content-length of response

22
Web Usage MiningWeb Server Logs
23
Web Usage MiningPre-Processing

Data Cleaning
Removes log entries that are not needed for the
mining process
Data Integration
Synchronize data from multiple server logs,
metadata
User Identification
Associates page references with different users
Session/Episode Identification
Groups users page references into user sessions
Page View Identification
Path Completion
Fills in page references missing due to browser
and proxy caching

24
Web Usage MiningPre-Processing

A single IP address is used by many users
Different IP addresses in a single session
Missing cache hits in the server logs

Proxy server
Different users
Web server
Single user
ISP server
Web server
25
Web Usage MiningPre-Processing

Remote Agent
A remote agent is implemented in Java Applet
It is loaded into the client only once when the
first page is accessed
The subsequent requests are captured and send
back to the server
Modified Browser
The source code of the existing browser can be
modified to gain user specific data at the client
side
Dynamic page rewriting
When the user first submit the request, the
server returns the requested page rewritten to
include a session specific ID
Each subsequent request will supply this ID to
the server
Heuristics
Use a set of assumptions to identify user
sessions and find the missing cache hits in the
server log

26
Web Usage MiningSession identification heuristics

Timeout
if the time between pages requests exceeds a
certain limit, it is assumed that the user is
starting a new session
IP/Agent
Each different agent type for an IP address
represents a different sessions
Referring page
If the referring page file for a request is not
part of an open session, it is assumed that the
request is coming from a different session
Same IP-Agent/different sessions (Closest)
Assigns the request to the session that is
closest to the referring page at the time of the
request
Same IP-Agent/different sessions (Recent)
In the case where multiple sessions are same
distance from a page request, assigns the request
to the session with the most recent referrer
access in terms of time

27
Web Usage MiningSessionization Example
28
Web Usage MiningSessionization Example
29
Web Usage MiningSessionization Example
30
Web Usage MiningSessionization Example
31
Web Usage MiningSessionization Example
32
Web Usage MiningAssociation rule mining

Proposed by Agrawal et al in 1993.
It is an important data mining model studied
extensively by the database and data mining
community.
Assume all data are categorical.
No good algorithm for numeric data.
Initially used for Market Basket Analysis to find
how items purchased by customers are related.
Url1? Url4 sup 5, conf 100

33
Web Usage MiningAssociation rule mining

A set of items
I i1, i2, , im
Transaction t
t a set of items, and t ? I
Transaction Database T
a set of transactions T t1, t2, , tn

34
Web Usage MiningAssociation rule mining

A transaction t contains X, a set of items
(itemset) in I, if X ? t.
An association rule is an implication of the
form
X ? Y, where X, Y ? I, and X ?Y ?
An itemset is a set of items.
E.g., X url1, url2, url3 is an itemset.
A k-itemset is an itemset with k items.
E.g., url1, url2, url3 is a 3-itemset

35
Web Usage MiningAssociation rule mining

Support
The rule holds with support sup in T (the
transaction data set) if sup of transactions
contain X ? Y.
sup Pr(X ? Y).
Confidence
The rule holds in T with confidence conf if conf
of tranactions that contain X also contain Y.
conf Pr(Y X)
An association rule is a pattern that states when
X occurs, Y occurs with certain probability.

36
Web Usage MiningAssociation rule mining
t1 Url1, Url2, Url4 t2 Url1, Url3 t3 Url3,
Url5 t4 Url1, Url2, Url3 t5 Url1, Url2, Url6,
Url3, Url4 t6 Url2, Url6, Url4 t7 Url2, Url4,
Url6

Transaction data
Assume
minsup 30
minconf 80
An example frequent itemset
Url2, Url6, Url4 sup 3/7
Association rules from the itemset
Url6 ? Url4, Url2 sup 3/7, conf 3/3
Url6, Url2 ? Url4, sup 3/7, conf 3/3

37
Web Usage MiningAssociation rule mining
Apriori Algorithm

Probably the best known algorithm
Two steps
Find all itemsets that have minimum support
(frequent itemsets, also called large itemsets).
Use frequent itemsets to generate rules.
E.g., a frequent itemset
Url2, Url6, Url4 sup 3/7
and one rule from the frequent itemset
Url6 ? Url4, Url2 sup 3/7, conf 3/3

38
Web Usage MiningAssociation rule mining
Apriori Algorithm

Iterative algo
Find all 1-item frequent itemsets then all
2-item frequent itemsets, and so on.
In each iteration k, only consider itemsets that
contain some k-1 frequent itemset.
Find frequent itemsets of size 1 F1
From k 2
Ck candidates of size k those itemsets of size
k that could be frequent, given Fk-1
Fk those itemsets that are actually frequent,
Fk ? Ck (need to scan the database once).

39
Web Usage MiningAssociation rule mining
Apriori Algorithm
Dataset T minsup0.5
itemsetcount 1. scan T ? C1 12,
23, 33, 41, 53 ? F1 12,
23, 33, 53 ? C2
1,2, 1,3, 1,5, 2,3, 2,5, 3,5 2. scan
T ? C2 1,21, 1,32, 1,51, 2,32,
2,53, 3,52 ? F2
1,32, 2,32, 2,53,
3,52 ? C3 2, 3,5 3. scan T ?
C3 2, 3, 52 ? F3 2, 3, 5
40
Web Usage MiningAssociation rule mining
Apriori Algorithm

Frequent itemsets ? association rules
One more step is needed to generate association
rules
For each frequent itemset X,
For each proper nonempty subset A of X,
Let B X - A
A ? B is an association rule if
Confidence(A ? B) minconf,
support(A ? B) support(A?B) support(X)
confidence(A ? B) support(A ? B) / support(A)

41
Web Usage MiningSequential pattern mining

Association Rule concerns about what items are
appears together (at the same time)
Intra-transaction patterns
Sequential Pattern concerns about what items
appears at different times
Inter-transaction patterns

42
Web Usage MiningSequential pattern mining

Itemset
non-empty set of items. Each itemset is mapped to
an integer.
Sequence
Ordered list of itemsets.
Customer Sequence
List of customer transactions ordered by
increasing transaction time.
A customer supports a sequence if the sequence is
contained in the customer-sequence.
Support for a Sequence
Fraction of total customers that support a
sequence.
Maximal Sequence
A sequence that is not contained in any other
sequence.
Large Sequence
Sequence that meets minisup.

43
Web Usage MiningSequential pattern mining
44
Web Usage MiningSequential pattern mining
PrefixSpan algorithm

ltagt, ltaagt, lta(ab)gt and lta(abc)gt are prefixes of
sequence lta(abc)(ac)d(cf)gt
Given sequence lta(abc)(ac)d(cf)gt

45
Web Usage MiningSequential pattern mining
PrefixSpan algorithm

Step 1 find length-1 sequential patterns
ltagt, ltbgt, ltcgt, ltdgt, ltegt, ltfgt
Step 2 divide search space. The complete set of
seq. pat. can be partitioned into 6 subsets
The ones having prefix ltagt
The ones having prefix ltbgt
The ones having prefix ltfgt

46
Web Usage MiningSequential pattern mining
PrefixSpan algorithm

Only need to consider projections w.r.t. ltagt
ltagt-projected database lt(abc)(ac)d(cf)gt,
lt(_d)c(bc)(ae)gt, lt(_b)(df)cbgt, lt(_f)cbcgt
Find all the length-2 seq. pat. Having prefix
ltagt ltaagt, ltabgt, lt(ab)gt, ltacgt, ltadgt, ltafgt
Further partition into 6 subsets
Having prefix ltaagt
Having prefix ltafgt

47
Web Usage MiningSequential pattern mining
PrefixSpan algorithm
SDB
Length-1 sequential patterns ltagt, ltbgt, ltcgt, ltdgt,
ltegt, ltfgt
Having prefix ltcgt, , ltfgt
Having prefix ltagt
Having prefix ltbgt
ltagt-projected database lt(abc)(ac)d(cf)gt lt(_d)c(bc)
(ae)gt lt(_b)(df)cbgt lt(_f)cbcgt
ltbgt-projected database

Length-2 sequential patterns ltaagt, ltabgt,
lt(ab)gt, ltacgt, ltadgt, ltafgt

Having prefix ltaagt
Having prefix ltafgt

ltaagt-proj. db
ltafgt-proj. db
48
Web Usage MiningClustering

Clustering is a technique for finding similarity
groups in data, called clusters. I.e.,
it groups data instances that are similar to
(near) each other in one cluster and data
instances that are very different (far away) from
each other into different clusters.
Clustering is often called an unsupervised
learning task as no class values denoting an a
priori grouping of the data instances are given,
which is the case in supervised learning.
Due to historical reasons, clustering is often
considered synonymous with unsupervised learning.
In fact, association rule mining is also
unsupervised

49
Web Usage MiningClustering

The data set has three natural groups of data
points, i.e., 3 natural clusters.

50
Web Usage MiningClustering

Let us see some real-life examples
Example 1 groups people of similar sizes
together to make small, medium and large
T-Shirts.
Tailor-made for each person too expensive
One-size-fits-all does not fit all.
Example 2 In e-commerce, segment customers
according to their similarities
To do targeted marketing.

51
Web Usage MiningClustering

A clustering algorithm
Partitional clustering
Hierarchical clustering
A distance (similarity, or dissimilarity)
function
Clustering quality
Inter-clusters distance ? maximized
Intra-clusters distance ? minimized
The quality of a clustering result depends on the
algorithm, the distance function, and the
application.

52
Web Usage MiningClustering - K-means algorithm

K-means is a partitional clustering algorithm
Let the set of data points (or instances) D be
x1, x2, , xn,
where xi (xi1, xi2, , xir) is a vector in a
real-valued space X ? Rr, and r is the number of
attributes (dimensions) in the data.
The k-means algorithm partitions the given data
into k clusters.
Each cluster has a cluster center, called
centroid.
k is specified by the user

53
Web Usage MiningClustering - K-means algorithm

Given k, the k-means algorithm works as follows
Randomly choose k data points (seeds) to be the
initial centroids, cluster centers
Assign each data point to the closest centroid
Re-compute the centroids using the current
cluster memberships.
If a convergence criterion is not met, go to 2)

54
Web Usage MiningClustering - K-means algorithm

no (or minimum) re-assignments of data points to
different clusters,
no (or minimum) change of centroids, or
minimum decrease in the sum of squared error
(SSE),
Ci is the jth cluster, mj is the centroid of
cluster Cj (the mean vector of all the data
points in Cj), and dist(x, mj) is the distance
between data point x and centroid mj.

55
Web Usage MiningClustering - K-means algorithm
Select K and according, K centers in the space
56
Web Usage MiningClustering - K-means algorithm
Assign points to the nearest center
57
Web Usage MiningClustering - K-means algorithm
Recompute the new center for each cluster
58
Web Usage MiningClustering - K-means algorithm
Assign points to the nearest center
59
Web Usage MiningClustering - K-means algorithm
Three points change cluster
60
Web Usage MiningClustering - K-means algorithm
Recompute the new center for each cluster
61
Web Usage MiningClustering - K-means algorithm
Assign points to the nearest center No change!
STOP
62
Web Usage MiningApplications

User characterizing
Creation of user classes according to navigations
behaviours and visited contents.
Basic step for many of the other WUM applications
Personalization
Attracting users with advanced personalized
features (content, presentation, navigation).
Recommender systems based on user profiles and
mined behaviours
Ad Hoc advertising

63
Web Usage MiningApplications

Web Application Improving
Performances prefetching, load balance, web
caching, based on user behaviours
Security finding intrusions and frauds .
Usability adapting the model of the web
application to the model expected by users
Marketing
Information on users are very important for
e-commerce web sites.
Its possible to obtain data on
Customer acquisition
Customer keeping
Cross sales
Customer loss

64
References

R. Kosala and H. Blockeel. Web mining research a
survey. SIGKDD Explorations, ACM, 2(1) 1-15,
2000.
Sankar Pal, Varun Talwar, and Pabitra Mitra. Web
mining in soft computing framework Relevance,
state of the art and future directions, 2002.
Jaideep Srivastava, Robert Cooley, Mukund
Deshpande, and Pang-Ning Tan. Web usage mining
Discovery and applications of usage patterns from
web data. SIGKDD Explorations, ACM, 1(2) 12-23,
2000.
Federico Michele Facca, Pier Luca Lanzi Mining
interesting knowledge from weblogs a survey.
Data Knowl. Eng. 53(3) 225-241 (2005)