Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts

About This Presentation

Title:

Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts

Description:

Data Mining, Data Warehousing and Knowledge Discovery ... which contain j as a sequence Sequence data: transaction logs, DNA sequences, patient ailment history, ... – PowerPoint PPT presentation

Number of Views:482

Avg rating:3.0/5.0

Slides: 71

Provided by: iii7

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts

1
Data Mining, Data Warehousing and Knowledge
DiscoveryBasic Algorithms and Concepts

Srinath Srinivasa
IIIT Bangalore
sri_at_iiitb.ac.in

2
Overview

Why Data Mining?
Data Mining concepts
Data Mining algorithms
Tabular data mining
Association, Classification and Clustering
Sequence data mining
Streaming data mining
Data Warehousing concepts

3
Why Data Mining
From a managerial perspective
Analyzing trends
Wealth generation
Security
Strategic decision making
4
Data Mining

Look for hidden patterns and trends in data that
is not immediately apparent from summarizing the
data
No Query
But an Interestingness criteria

5
Data Mining

Interestingness criteria
Hidden patterns
Data
6
Data Mining
Type of Patterns

Interestingness criteria
Hidden patterns
Data
7
Data Mining
Type of data
Type of Interestingness criteria

Interestingness criteria
Hidden patterns
Data
8
Type of Data

Tabular (Ex Transaction data)
Relational
Multi-dimensional
Spatial (Ex Remote sensing
data)
Temporal (Ex Log information)
Streaming (Ex multimedia, network
traffic)
Spatio-temporal (Ex GIS)
Tree (Ex XML data)
Graphs (Ex WWW,
BioMolecular data)
Sequence (Ex DNA, activity
logs)
Text, Multimedia

9
Type of Interestingness

Frequency
Rarity
Correlation
Length of occurrence (for sequence and temporal
data)
Consistency
Repeating / periodicity
Abnormal behavior
Other patterns of interestingness

10
Data Mining vs Statistical Inference
Statistics
Statistical Reasoning
Conceptual Model (Hypothesis)
Proof (Validation of Hypothesis)
11
Data Mining vs Statistical Inference
Data mining
Mining Algorithm Based on Interestingness
Data
Pattern (model, rule, hypothesis) discovery
12
Data Mining Concepts
Associations and Item-sets An association is a
rule of the form if X then Y. It is denoted as
X ? Y Example If India wins in cricket, sales
of sweets go up. For any rule if X ? Y ? Y ?
X, then X and Y are called an interesting
item-set. Example People buying school
uniforms in June also buy school bags (People
buying school bags in June also buy school
uniforms)
13
Data Mining Concepts
Support and Confidence The support for a rule
R is the ratio of the number of occurrences of
R, given all occurrences of all rules. The
confidence of a rule X ? Y, is the ratio of the
number of occurrences of Y given X, among all
other occurrences given X.
14
Data Mining Concepts
Support and Confidence
Support for Bag, Uniform 5/10 0.5
Confidence for Bag ? Uniform 5/8 0.625
Bag Books Bag Bag Uniform Bag Crayons Books Unifor
m Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform
Crayons Crayons Uniform
Crayons Uniform Pencil Book Bag Book Bag Bag Penci
l Books
15
Mining for Frequent Item-sets

The Apriori Algorithm
Given minimum required support s as
interestingness criterion
Search for all individual elements (1-element
item-set) that have a minimum support of s
Repeat
From the results of the previous search for
i-element item-sets, search for all i1 element
item-sets that have a minimum support of s
This becomes the set of all frequent
(i1)-element item-sets that are interesting
Until item-set size reaches maximum..

16
Mining for Frequent Item-sets
The Apriori Algorithm (Example)
Let minimum support 0.3 Interesting 1-element
item-sets Bag, Uniform, Crayons,
Pencil, Books Interesting 2-element
item-sets Bag,Uniform Bag,Crayons
Bag,Pencil Bag,Books Uniform,Crayons
Uniform,Pencil Pencil,Books
Bag Books Bag Bag Uniform Bag Crayons Books Unifor
m Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform
Crayons Crayons Uniform
Crayons Uniform Pencil Books Bag Books Bag Bag Pen
cil Books
17
Mining for Frequent Item-sets
The Apriori Algorithm (Example)
Let minimum support 0.3 Interesting 3-element
item-sets Bag,Uniform,Crayons
Bag Books Bag Bag Uniform Bag Crayons Books Unifor
m Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform
Crayons Crayons Uniform
Crayons Uniform Pencil Books Bag Books Bag Bag Pen
cil Books
18
Mining for Association Rules
Association rules are of the form A ? B Which
are directional Association rule mining
requires two thresholds minsup and minconf
Bag Books Bag Bag Uniform Bag Crayons Books Unifor
m Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform
Crayons Crayons Uniform
Crayons Uniform Pencil Books Bag Books Bag Bag Pen
cil Books
19
Mining for Association Rules
Mining association rules using apriori

General Procedure
Use apriori to generate frequent itemsets of
different sizes
At each iteration divide each frequent itemset X
into two parts LHS and RHS. This represents a
rule of the form LHS ? RHS
The confidence of such a rule is
support(X)/support(LHS)
Discard all rules whose confidence is less than
minconf.

Bag Books Bag Bag Uniform Bag Crayons Books Unifor
m Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform
Crayons Crayons Uniform
Crayons Uniform Pencil Books Bag Books Bag Bag Pen
cil Books
20
Mining for Association Rules
Mining association rules using apriori
Example The frequent itemset Bag, Uniform,
Crayons has a support of 0.3. This can be
divided into the following rules Bag ?
Uniform, Crayons Bag, Uniform ? Crayons
Bag, Crayons ? Uniform Uniform ? Bag,
Crayons Uniform, Crayons ? Bag Crayons ?
Bag, Uniform
Bag Books Bag Bag Uniform Bag Crayons Books Unifor
m Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform
Crayons Crayons Uniform
Crayons Uniform Pencil Books Bag Books Bag Bag Pen
cil Books
21
Mining for Association Rules
Mining association rules using apriori
Confidence for these rules are as follows Bag
? Uniform, Crayons 0.375 Bag, Uniform ?
Crayons 0.6 Bag, Crayons ? Uniform
0.75 Uniform ? Bag, Crayons 0.428
Uniform, Crayons ? Bag 0.75 Crayons ?
Bag, Uniform 0.75
Bag Books Bag Bag Uniform Bag Crayons Books Unifor
m Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform
Crayons Crayons Uniform
Crayons Uniform Pencil Books Bag Books Bag Bag Pen
cil Books
If minconf is 0.7, then we have discovered the
following rules
22
Mining for Association Rules
Mining association rules using apriori
People who buy a school bag and a set of crayons
are likely to buy school uniform. People who
buy school uniform and a set of crayons are
likely to buy a school bag. People who buy just
a set of crayons are likely to buy a school bag
and school uniform as well.
Bag Books Bag Bag Uniform Bag Crayons Books Unifor
m Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform
Crayons Crayons Uniform
Crayons Uniform Pencil Books Bag Books Bag Bag Pen
cil Books
23
Generalized Association Rules
Since customers can buy any number of items in
one transaction, the transaction relation would
be in the form of a list of individual
purchases.
Bill No. Date Item
15563 23.10.2003 Books
15563 23.10.2003 Crayons
15564 23.10.2003 Uniform
15564 23.10.2003 Crayons
24
Generalized Association Rules
A transaction for the purposes of data mining is
obtained by performing a GROUP BY of the table
over various fields.
Bill No. Date Item
15563 23.10.2003 Books
15563 23.10.2003 Crayons
15564 23.10.2003 Uniform
15564 23.10.2003 Crayons
25
Generalized Association Rules
A GROUP BY over Bill No. would show frequent
buying patterns across different customers. A
GROUP BY over Date would show frequent buying
patterns across different days.
Bill No. Date Item
15563 23.10.2003 Books
15563 23.10.2003 Crayons
15564 23.10.2003 Uniform
15564 23.10.2003 Crayons
26
Classification and Clustering
Given a set of data elements Classification
maps each data element to one of a set of
pre-determined classes based on the difference
among data elements belonging to different
classes Clustering groups data elements into
different groups based on the similarity between
elements within a single group
27
Classification Techniques
Decision Tree Identification
Outlook Temp Play?
Sunny 30 Yes
Overcast 15 No
Sunny 16 Yes
Cloudy 27 Yes
Overcast 25 Yes
Overcast 17 No
Cloudy 17 No
Cloudy 35 Yes
Classification problem Weather ?
Play(Yes,No)
28
Classification Techniques

Hunts method for decision tree identification
Given N element types and m decision classes
For i ? 1 to N do
Add element i to the i-1 element item-sets from
the previous iteration
Identify the set of decision classes for each
item-set
If an item-set has only one decision class, then
that item-set is done, remove that item-set from
subsequent iterations
done

29
Classification Techniques
Decision Tree Identification Example
Outlook Temp Play?
Sunny Warm Yes
Overcast Chilly No
Sunny Chilly Yes
Cloudy Pleasant Yes
Overcast Pleasant Yes
Overcast Chilly No
Cloudy Chilly No
Cloudy Warm Yes
Sunny
Yes
Cloudy
Yes/No
Overcast
Yes/No
30
Classification Techniques
Decision Tree Identification Example
Outlook Temp Play?
Sunny Warm Yes
Overcast Chilly No
Sunny Chilly Yes
Cloudy Pleasant Yes
Overcast Pleasant Yes
Overcast Chilly No
Cloudy Chilly No
Cloudy Warm Yes
Sunny
Yes
Cloudy
Yes/No
Overcast
Yes/No
31
Classification Techniques
Decision Tree Identification Example
Outlook Temp Play?
Sunny Warm Yes
Overcast Chilly No
Sunny Chilly Yes
Cloudy Pleasant Yes
Overcast Pleasant Yes
Overcast Chilly No
Cloudy Chilly No
Cloudy Warm Yes
Cloudy Warm
Yes
Cloudy Chilly
No
Cloudy Pleasant
Yes
32
Classification Techniques
Decision Tree Identification Example
Outlook Temp Play?
Sunny Warm Yes
Overcast Chilly No
Sunny Chilly Yes
Cloudy Pleasant Yes
Overcast Pleasant Yes
Overcast Chilly No
Cloudy Chilly No
Cloudy Warm Yes
Overcast Warm
Overcast Chilly
No
Overcast Pleasant
Yes
33
Classification Techniques
Decision Tree Identification Example
Yes/No
Overcast
Cloudy
Sunny
Yes/No
Yes
Yes/No
Pleasant
Chilly
Warm
Chilly
No
Pleasant
Yes
No
Yes
Yes
34
Classification Techniques
Decision Tree Identification Example

Top down technique for decision tree
identification
Decision tree created is sensitive to the order
in which items are considered
If an N-item-set does not result in a clear
decision, classification classes have to be
modeled by rough sets.

35
Other Classification Algorithms
Quinlans depth-first strategy builds the
decision tree in a depth-first fashion, by
considering all possible tests that give a
decision and selecting the test that gives the
best information gain. It hence eliminates tests
that are inconclusive. SLIQ (Supervised
Learning in Quest) developed in the QUEST project
of IBM uses a top-down breadth-first strategy to
build a decision tree. At each level in the tree,
an entropy value of each node is calculated and
nodes having the lowest entropy values selected
and expanded.
36
Clustering Techniques
Clustering partitions the data set into clusters
or equivalence classes. Similarity among
members of a class more than similarity among
members across classes. Similarity measures
Euclidian distance or other application specific
measures.
37
Euclidian Distance for Tables
(Overcast,Chilly,Dont Play)
Overcast
(Cloudy,Pleasant,Play)
Cloudy
Dont Play
Play
Sunny
Warm
Pleasant
Chilly
38
Clustering Techniques

General Strategy
Draw a graph connecting items which are close to
one another with edges.
Partition the graph into maximally connected
subcomponents.
Construct an MST for the graph
Merge items that are connected by the minimum
weight of the MST into a cluster

39
Clustering Techniques
Clustering types Hierarchical clustering
Clusters are formed at different levels by
merging clusters at a lower level Partitional
clustering Clusters are formed at only one level
40
Clustering Techniques

Nearest Neighbour Clustering Algorithm
Given n elements x1, x2, xn, and threshold t, .
j ? 1, k ? 1, Clusters
Repeat
Find the nearest neighbour of xj
Let the nearest neighbour be in cluster m
If distance to nearest neighbour gt t, then create
a new cluster and k ? k1 else assign xj to
cluster m
j ? j1
until j gt n

41
Clustering Techniques

Iterative partitional clustering
Given n elements x1, x2, xn, and k clusters,
each with a center.
Assign each element to its closest cluster center
After all assignments have been made, compute the
cluster centroids for each of the cluster
Repeat the above two steps with the new centroids
until the algorithm converges

42
Mining Sequence Data

Characteristics of Sequence Data
Collection of data elements which are ordered
sequences
In a sequence, each item has an index associated
with it
A k-sequence is a sequence of length k. Support
for sequence j is the number of m-sequences
(mgtj) which contain j as a sequence
Sequence data transaction logs, DNA sequences,
patient ailment history,

43
Mining Sequence Data

Some Definitions
A sequence is a list of itemsets of finite
length.
Example
pen,pencil,inkpencil,inkink,eraserruler,pe
ncil
the purchases of a single customer over time
The order of items within an itemset does not
matter but the order of itemsets matter
A subsequence is a sequence with some itemsets
deleted

44
Mining Sequence Data

Some Definitions
A sequence S a1, a2, , am is said to be
contained within another sequence S, if S
contains a subsequence b1, b2, bm such that
a1 ? b1, a2 ? b2, , am ? bm.
Hence, penpencilruler,pencil is contained
in pen,pencil,inkpencil,inkink,eraserruler,
pencil

45
Mining Sequence Data

Apriori Algorithm for Sequences
L1 ? Set of all interesting 1-sequences
k ? 1
while Lk is not empty do
Generate all candidate k1 sequences
Lk1 ? Set of all interesting k1-sequences
done

46
Mining Sequence Data
Generating Candidate Sequences Given L1, L2,
Lk, candidate sequences of Lk1 are generated as
follows For each sequence s in Lk, concatenate
s with all new 1-sequences found while generating
Lk-1
47
Mining Sequence Data
Example
minsup 0.5 a b c d e
Interesting 1-sequences b d a e
a a e b d b
b e d e a b d a
e a a a a b a a a
Candidate 2-sequences c b d b
aa, ab, ad, ae a b b a b
ba, bb, bd, be a b d e
da, db, dd, de
ea, eb, ed, ee
48
Mining Sequence Data
Example
minsup 0.5 a b c d e
Interesting 2-sequences b d a e
ab, bd a e b d
b e Candidate
2-sequences e a b d a aba,
abb, abd, abe, a a a a
aab, bab, dab, eab, b a a a
bda, bdb, bdd, bde,
c b d b bbd, dbd,
ebd. a b b a b a b d e
Interesting 3-sequences
49
Mining Sequence Data
Language Inference Given a set of sequences,
consider each sequence as the behavioural trace
of a machine, and infer the machine that can
display the given sequence as behavior.
aabb ababcac abbac
Input set of sequences
Output state machine
50
Mining Sequence Data

Inferring the syntax of a language given its
sentences
Applications discerning behavioural patterns,
emergent properties discovery, collaboration
modeling,
State machine discovery is the reverse of state
machine construction
Discovery is maximalist in nature

51
Mining Sequence Data
Maximal nature of language inference abc
aabc aabbc abbc
a,b,c
Most general state machine
c
b
c
b
a
c
c
a
b
b
Most specific state machine
52
Mining Sequence Data

Shortest-run Generalization (Srinivasa and
Spiliopoulou 2000)
Given a set of n sequences
Create a state machine for the first sequence
for j ? 2 to n do
Create a state machine for the jth sequence
Merge this sequence into the earlier sequence as
follows
Merge all halt states in the new state machine to
the halt state in the existing state machine
If two or more paths to the halt state share the
same suffix, merge the suffixes together into a
single path
Done

53
Mining Sequence Data
Shortest-run Generalization (Srinivasa and
Spiliopoulou 2000)
a
a
b
c
b
aabcb aac aabc
a
a
b
c
b
c
b
a
a
b
c
c
b
c
b
a
a
54
Mining Streaming Data

Characteristics of streaming data
Large data sequence
No storage
Often an infinite sequence
Examples Stock market quotes, streaming
audio/video, network traffic

55
Mining Streaming Data
Running mean Let n number of items read so
far, avg running average calculated so
far, On reading the next number num
avg ? (navgnum) / (n1) n ? n1
56
Mining Streaming Data
Running variance var ?(num-avg)2
?num2 - 2?numavg ?avg2 Let A ?num2 of all
numbers read so far B 2?numavg of all
numbers read so far C ?avg2 of all
numbers read so far avg average of numbers
read so far n number of numbers read so
far
57
Mining Streaming Data
Running variance On reading next number num
avg ? (avgn num) / (n1) n ? n1 A ? A
num2 B ? B 2avgnum C ? C avg2 var A
B C
58
Mining Streaming Data
?-Consistency (Srinivasa and Spiliopoulou,
CoopIS 1999) Let streaming data be in the form
of frames where each frame comprises of one or
more data elements. Support for data element k
within a frame is defined as (occurrences of
k)/(elements in frame) ?-Consistency for data
element k is the sustained support for k over
all frames read so far, with a leakage of (1-
?)
59
Mining Streaming Data
?-Consistency (Srinivasa and Spiliopoulou,
CoopIS 1999)
?sup(k)
(1-?)
levelt(k) (1-?)levelt-1(k) ?sup(k)
60
Data Warehousing

A platform for online analytical processing
(OLAP)
Warehouses collect transactional data from
several transactional databases and organize them
in a fashion amenable to analysis
Also called data marts
A critical component of the decision support
system (DSS) of enterprises
Some typical DW queries
Which item sells best in each region that has
retail outlets
Which advertising strategy is best for South
India?
Which (age_group/occupation) in South India likes
fast food, and which (age_group/occupation) likes
to cook?

61
Data Warehousing
OLTP
Order Processing
Data Cleaning
Inventory
Data Warehouse (OLAP)
Sales
62
OLTP vs OLAP
Transactional Data (OLTP) Analysis Data (OLAP)
Small or medium size databases Very large databases
Transient data Archival data
Frequent insertions and updates Infrequent updates
Small query shadow Very large query shadow
Normalization important to handle updates De-normalization important to handle queries
63
Data Cleaning

Performs logical transformation of transactional
data to suit the data warehouse
Model of operations ? model of enterprise
Usually a semi-automatic process

64
Data Cleaning
Data Warehouse Customers Products Orders Inventor
y Price Time
Orders Order_id Price Cust_id
Inventory Prod_id Price Price_chng
Sales Cust_id Cust_prof Tot_sales
65
Multi-dimensional Data Model
Price
Products
Orders
Customers
Jan01 Jun01 Jan02
Jun02
Time
66
Some MDBMS Operations

Roll-up
Add dimensions
Drill-down
Collapse dimensions
Vector-distance operations (ex clustering)
Vector space browsing

67
Star Schema
Dim Tbl_1
Dim Tbl_1
Fact table
Dim Tbl_1
Dim Tbl_1
68
WWW Based References

http//www.kdnuggets.com/
http//www.megaputer.com/
http//www.almaden.ibm.com/cs/quest/index.html
http//fas.sfu.ca/cs/research/groups/DB/sections/p
ublication/kdd/kdd.html
http//www.cs.su.oz.au/thierry/ckdd.html
http//www.dwinfocenter.org/
http//datawarehouse.itoolbox.com/
http//www.knowledgestorm.com/
http//www.bitpipe.com/
http//www.dw-institute.com/
http//www.datawarehousing.com/

69
References

Agrawal, R. Srikant Fast Algorithms for Mining
Association Rules'', Proc. of the 20th Int'l
Conference on Very Large Databases, Santiago,
Chile, Sept. 1994.
R. Agrawal, R. Srikant, Mining Sequential
Patterns'', Proc. of the Int'l Conference on Data
Engineering (ICDE), Taipei, Taiwan, March 1995.
R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J.
Shafer, R. Srikant "The Quest Data Mining
System", Proc. of the 2nd Int'l Conference on
Knowledge Discovery in Databases and Data Mining,
Portland, Oregon, August, 1996.
Surajit Chaudhuri, Umesh Dayal. An Overview of
Data Warehousing and OLAP Technology. ACM SIGMOD
Record. 26(1), March 1997.
Jennifer Widom. Research Problems in Data
Warehousing. Proc. of Intl Conf. On Information
and Knowledge Management, 1995.

70
References

A. Shoshani. OLAP and Statistical Databases
Similarities and Differences. Proc. of ACM PODS
1997.
Panos Vassiliadis, Timos Sellis. A Survey on
Logical Models for OLAP Databases. ACM SIGMOD
Record
M. Gyssens, Laks VS Lakshmanan. A Foundation for
Multi-Dimensional Databases. Proc of VLDB 1997,
Athens, Greece.
Srinath Srinivasa, Myra Spiliopoulou. Modeling
Interactions Based on Consistent Patterns. Proc.
of CoopIS 1999, Edinburg, UK.
Srinath Srinivasa, Myra Spiliopoulou. Discerning
Behavioral Patterns By Mining Transaction Logs.
Proc. of ACM SAC 2000, Como, Italy.