Title: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts
1Data Mining, Data Warehousing and Knowledge
DiscoveryBasic Algorithms and Concepts
- Srinath Srinivasa
- IIIT Bangalore
- sri_at_iiitb.ac.in
2Overview
- Why Data Mining?
- Data Mining concepts
- Data Mining algorithms
- Tabular data mining
- Association, Classification and Clustering
- Sequence data mining
- Streaming data mining
- Data Warehousing concepts
3Why Data Mining
From a managerial perspective
Analyzing trends
Wealth generation
Security
Strategic decision making
4Data Mining
- Look for hidden patterns and trends in data that
is not immediately apparent from summarizing the
data - No Query
- But an Interestingness criteria
5Data Mining
Interestingness criteria
Hidden patterns
Data
6Data Mining
Type of Patterns
Interestingness criteria
Hidden patterns
Data
7Data Mining
Type of data
Type of Interestingness criteria
Interestingness criteria
Hidden patterns
Data
8Type of Data
- Tabular (Ex Transaction data)
- Relational
- Multi-dimensional
- Spatial (Ex Remote sensing
data) - Temporal (Ex Log information)
- Streaming (Ex multimedia, network
traffic) - Spatio-temporal (Ex GIS)
- Tree (Ex XML data)
- Graphs (Ex WWW,
BioMolecular data) - Sequence (Ex DNA, activity
logs) - Text, Multimedia
9Type of Interestingness
- Frequency
- Rarity
- Correlation
- Length of occurrence (for sequence and temporal
data) - Consistency
- Repeating / periodicity
- Abnormal behavior
- Other patterns of interestingness
10Data Mining vs Statistical Inference
Statistics
Statistical Reasoning
Conceptual Model (Hypothesis)
Proof (Validation of Hypothesis)
11Data Mining vs Statistical Inference
Data mining
Mining Algorithm Based on Interestingness
Data
Pattern (model, rule, hypothesis) discovery
12Data Mining Concepts
Associations and Item-sets An association is a
rule of the form if X then Y. It is denoted as
X ? Y Example If India wins in cricket, sales
of sweets go up. For any rule if X ? Y ? Y ?
X, then X and Y are called an interesting
item-set. Example People buying school
uniforms in June also buy school bags (People
buying school bags in June also buy school
uniforms)
13Data Mining Concepts
Support and Confidence The support for a rule
R is the ratio of the number of occurrences of
R, given all occurrences of all rules. The
confidence of a rule X ? Y, is the ratio of the
number of occurrences of Y given X, among all
other occurrences given X.
14Data Mining Concepts
Support and Confidence
Support for Bag, Uniform 5/10 0.5
Confidence for Bag ? Uniform 5/8 0.625
Bag Books Bag Bag Uniform Bag Crayons Books Unifor
m Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform
Crayons Crayons Uniform
Crayons Uniform Pencil Book Bag Book Bag Bag Penci
l Books
15Mining for Frequent Item-sets
- The Apriori Algorithm
- Given minimum required support s as
interestingness criterion - Search for all individual elements (1-element
item-set) that have a minimum support of s - Repeat
- From the results of the previous search for
i-element item-sets, search for all i1 element
item-sets that have a minimum support of s - This becomes the set of all frequent
(i1)-element item-sets that are interesting - Until item-set size reaches maximum..
16Mining for Frequent Item-sets
The Apriori Algorithm (Example)
Let minimum support 0.3 Interesting 1-element
item-sets Bag, Uniform, Crayons,
Pencil, Books Interesting 2-element
item-sets Bag,Uniform Bag,Crayons
Bag,Pencil Bag,Books Uniform,Crayons
Uniform,Pencil Pencil,Books
Bag Books Bag Bag Uniform Bag Crayons Books Unifor
m Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform
Crayons Crayons Uniform
Crayons Uniform Pencil Books Bag Books Bag Bag Pen
cil Books
17Mining for Frequent Item-sets
The Apriori Algorithm (Example)
Let minimum support 0.3 Interesting 3-element
item-sets Bag,Uniform,Crayons
Bag Books Bag Bag Uniform Bag Crayons Books Unifor
m Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform
Crayons Crayons Uniform
Crayons Uniform Pencil Books Bag Books Bag Bag Pen
cil Books
18Mining for Association Rules
Association rules are of the form A ? B Which
are directional Association rule mining
requires two thresholds minsup and minconf
Bag Books Bag Bag Uniform Bag Crayons Books Unifor
m Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform
Crayons Crayons Uniform
Crayons Uniform Pencil Books Bag Books Bag Bag Pen
cil Books
19Mining for Association Rules
Mining association rules using apriori
- General Procedure
- Use apriori to generate frequent itemsets of
different sizes - At each iteration divide each frequent itemset X
into two parts LHS and RHS. This represents a
rule of the form LHS ? RHS - The confidence of such a rule is
support(X)/support(LHS) - Discard all rules whose confidence is less than
minconf.
Bag Books Bag Bag Uniform Bag Crayons Books Unifor
m Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform
Crayons Crayons Uniform
Crayons Uniform Pencil Books Bag Books Bag Bag Pen
cil Books
20Mining for Association Rules
Mining association rules using apriori
Example The frequent itemset Bag, Uniform,
Crayons has a support of 0.3. This can be
divided into the following rules Bag ?
Uniform, Crayons Bag, Uniform ? Crayons
Bag, Crayons ? Uniform Uniform ? Bag,
Crayons Uniform, Crayons ? Bag Crayons ?
Bag, Uniform
Bag Books Bag Bag Uniform Bag Crayons Books Unifor
m Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform
Crayons Crayons Uniform
Crayons Uniform Pencil Books Bag Books Bag Bag Pen
cil Books
21Mining for Association Rules
Mining association rules using apriori
Confidence for these rules are as follows Bag
? Uniform, Crayons 0.375 Bag, Uniform ?
Crayons 0.6 Bag, Crayons ? Uniform
0.75 Uniform ? Bag, Crayons 0.428
Uniform, Crayons ? Bag 0.75 Crayons ?
Bag, Uniform 0.75
Bag Books Bag Bag Uniform Bag Crayons Books Unifor
m Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform
Crayons Crayons Uniform
Crayons Uniform Pencil Books Bag Books Bag Bag Pen
cil Books
If minconf is 0.7, then we have discovered the
following rules
22Mining for Association Rules
Mining association rules using apriori
People who buy a school bag and a set of crayons
are likely to buy school uniform. People who
buy school uniform and a set of crayons are
likely to buy a school bag. People who buy just
a set of crayons are likely to buy a school bag
and school uniform as well.
Bag Books Bag Bag Uniform Bag Crayons Books Unifor
m Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform
Crayons Crayons Uniform
Crayons Uniform Pencil Books Bag Books Bag Bag Pen
cil Books
23Generalized Association Rules
Since customers can buy any number of items in
one transaction, the transaction relation would
be in the form of a list of individual
purchases.
Bill No. Date Item
15563 23.10.2003 Books
15563 23.10.2003 Crayons
15564 23.10.2003 Uniform
15564 23.10.2003 Crayons
24Generalized Association Rules
A transaction for the purposes of data mining is
obtained by performing a GROUP BY of the table
over various fields.
Bill No. Date Item
15563 23.10.2003 Books
15563 23.10.2003 Crayons
15564 23.10.2003 Uniform
15564 23.10.2003 Crayons
25Generalized Association Rules
A GROUP BY over Bill No. would show frequent
buying patterns across different customers. A
GROUP BY over Date would show frequent buying
patterns across different days.
Bill No. Date Item
15563 23.10.2003 Books
15563 23.10.2003 Crayons
15564 23.10.2003 Uniform
15564 23.10.2003 Crayons
26Classification and Clustering
Given a set of data elements Classification
maps each data element to one of a set of
pre-determined classes based on the difference
among data elements belonging to different
classes Clustering groups data elements into
different groups based on the similarity between
elements within a single group
27Classification Techniques
Decision Tree Identification
Outlook Temp Play?
Sunny 30 Yes
Overcast 15 No
Sunny 16 Yes
Cloudy 27 Yes
Overcast 25 Yes
Overcast 17 No
Cloudy 17 No
Cloudy 35 Yes
Classification problem Weather ?
Play(Yes,No)
28Classification Techniques
- Hunts method for decision tree identification
- Given N element types and m decision classes
- For i ? 1 to N do
- Add element i to the i-1 element item-sets from
the previous iteration - Identify the set of decision classes for each
item-set - If an item-set has only one decision class, then
that item-set is done, remove that item-set from
subsequent iterations - done
29Classification Techniques
Decision Tree Identification Example
Outlook Temp Play?
Sunny Warm Yes
Overcast Chilly No
Sunny Chilly Yes
Cloudy Pleasant Yes
Overcast Pleasant Yes
Overcast Chilly No
Cloudy Chilly No
Cloudy Warm Yes
Sunny
Yes
Cloudy
Yes/No
Overcast
Yes/No
30Classification Techniques
Decision Tree Identification Example
Outlook Temp Play?
Sunny Warm Yes
Overcast Chilly No
Sunny Chilly Yes
Cloudy Pleasant Yes
Overcast Pleasant Yes
Overcast Chilly No
Cloudy Chilly No
Cloudy Warm Yes
Sunny
Yes
Cloudy
Yes/No
Overcast
Yes/No
31Classification Techniques
Decision Tree Identification Example
Outlook Temp Play?
Sunny Warm Yes
Overcast Chilly No
Sunny Chilly Yes
Cloudy Pleasant Yes
Overcast Pleasant Yes
Overcast Chilly No
Cloudy Chilly No
Cloudy Warm Yes
Cloudy Warm
Yes
Cloudy Chilly
No
Cloudy Pleasant
Yes
32Classification Techniques
Decision Tree Identification Example
Outlook Temp Play?
Sunny Warm Yes
Overcast Chilly No
Sunny Chilly Yes
Cloudy Pleasant Yes
Overcast Pleasant Yes
Overcast Chilly No
Cloudy Chilly No
Cloudy Warm Yes
Overcast Warm
Overcast Chilly
No
Overcast Pleasant
Yes
33Classification Techniques
Decision Tree Identification Example
Yes/No
Overcast
Cloudy
Sunny
Yes/No
Yes
Yes/No
Pleasant
Chilly
Warm
Chilly
No
Pleasant
Yes
No
Yes
Yes
34Classification Techniques
Decision Tree Identification Example
- Top down technique for decision tree
identification - Decision tree created is sensitive to the order
in which items are considered - If an N-item-set does not result in a clear
decision, classification classes have to be
modeled by rough sets.
35Other Classification Algorithms
Quinlans depth-first strategy builds the
decision tree in a depth-first fashion, by
considering all possible tests that give a
decision and selecting the test that gives the
best information gain. It hence eliminates tests
that are inconclusive. SLIQ (Supervised
Learning in Quest) developed in the QUEST project
of IBM uses a top-down breadth-first strategy to
build a decision tree. At each level in the tree,
an entropy value of each node is calculated and
nodes having the lowest entropy values selected
and expanded.
36Clustering Techniques
Clustering partitions the data set into clusters
or equivalence classes. Similarity among
members of a class more than similarity among
members across classes. Similarity measures
Euclidian distance or other application specific
measures.
37Euclidian Distance for Tables
(Overcast,Chilly,Dont Play)
Overcast
(Cloudy,Pleasant,Play)
Cloudy
Dont Play
Play
Sunny
Warm
Pleasant
Chilly
38Clustering Techniques
- General Strategy
- Draw a graph connecting items which are close to
one another with edges. - Partition the graph into maximally connected
subcomponents. - Construct an MST for the graph
- Merge items that are connected by the minimum
weight of the MST into a cluster
39Clustering Techniques
Clustering types Hierarchical clustering
Clusters are formed at different levels by
merging clusters at a lower level Partitional
clustering Clusters are formed at only one level
40Clustering Techniques
- Nearest Neighbour Clustering Algorithm
- Given n elements x1, x2, xn, and threshold t, .
- j ? 1, k ? 1, Clusters
- Repeat
- Find the nearest neighbour of xj
- Let the nearest neighbour be in cluster m
- If distance to nearest neighbour gt t, then create
a new cluster and k ? k1 else assign xj to
cluster m - j ? j1
- until j gt n
41Clustering Techniques
- Iterative partitional clustering
- Given n elements x1, x2, xn, and k clusters,
each with a center. - Assign each element to its closest cluster center
- After all assignments have been made, compute the
cluster centroids for each of the cluster - Repeat the above two steps with the new centroids
until the algorithm converges
42Mining Sequence Data
- Characteristics of Sequence Data
- Collection of data elements which are ordered
sequences - In a sequence, each item has an index associated
with it - A k-sequence is a sequence of length k. Support
for sequence j is the number of m-sequences
(mgtj) which contain j as a sequence - Sequence data transaction logs, DNA sequences,
patient ailment history,
43Mining Sequence Data
- Some Definitions
- A sequence is a list of itemsets of finite
length. - Example
- pen,pencil,inkpencil,inkink,eraserruler,pe
ncil - the purchases of a single customer over time
- The order of items within an itemset does not
matter but the order of itemsets matter - A subsequence is a sequence with some itemsets
deleted
44Mining Sequence Data
- Some Definitions
- A sequence S a1, a2, , am is said to be
contained within another sequence S, if S
contains a subsequence b1, b2, bm such that
a1 ? b1, a2 ? b2, , am ? bm. - Hence, penpencilruler,pencil is contained
in pen,pencil,inkpencil,inkink,eraserruler,
pencil
45Mining Sequence Data
- Apriori Algorithm for Sequences
- L1 ? Set of all interesting 1-sequences
- k ? 1
- while Lk is not empty do
- Generate all candidate k1 sequences
- Lk1 ? Set of all interesting k1-sequences
- done
46Mining Sequence Data
Generating Candidate Sequences Given L1, L2,
Lk, candidate sequences of Lk1 are generated as
follows For each sequence s in Lk, concatenate
s with all new 1-sequences found while generating
Lk-1
47Mining Sequence Data
Example
minsup 0.5 a b c d e
Interesting 1-sequences b d a e
a a e b d b
b e d e a b d a
e a a a a b a a a
Candidate 2-sequences c b d b
aa, ab, ad, ae a b b a b
ba, bb, bd, be a b d e
da, db, dd, de
ea, eb, ed, ee
48Mining Sequence Data
Example
minsup 0.5 a b c d e
Interesting 2-sequences b d a e
ab, bd a e b d
b e Candidate
2-sequences e a b d a aba,
abb, abd, abe, a a a a
aab, bab, dab, eab, b a a a
bda, bdb, bdd, bde,
c b d b bbd, dbd,
ebd. a b b a b a b d e
Interesting 3-sequences
49Mining Sequence Data
Language Inference Given a set of sequences,
consider each sequence as the behavioural trace
of a machine, and infer the machine that can
display the given sequence as behavior.
aabb ababcac abbac
Input set of sequences
Output state machine
50Mining Sequence Data
- Inferring the syntax of a language given its
sentences - Applications discerning behavioural patterns,
emergent properties discovery, collaboration
modeling, - State machine discovery is the reverse of state
machine construction - Discovery is maximalist in nature
51Mining Sequence Data
Maximal nature of language inference abc
aabc aabbc abbc
a,b,c
Most general state machine
c
b
c
b
a
c
c
a
b
b
Most specific state machine
52Mining Sequence Data
- Shortest-run Generalization (Srinivasa and
Spiliopoulou 2000) - Given a set of n sequences
- Create a state machine for the first sequence
- for j ? 2 to n do
- Create a state machine for the jth sequence
- Merge this sequence into the earlier sequence as
follows - Merge all halt states in the new state machine to
the halt state in the existing state machine - If two or more paths to the halt state share the
same suffix, merge the suffixes together into a
single path - Done
53Mining Sequence Data
Shortest-run Generalization (Srinivasa and
Spiliopoulou 2000)
a
a
b
c
b
aabcb aac aabc
a
a
b
c
b
c
b
a
a
b
c
c
b
c
b
a
a
54Mining Streaming Data
- Characteristics of streaming data
- Large data sequence
- No storage
- Often an infinite sequence
- Examples Stock market quotes, streaming
audio/video, network traffic
55Mining Streaming Data
Running mean Let n number of items read so
far, avg running average calculated so
far, On reading the next number num
avg ? (navgnum) / (n1) n ? n1
56Mining Streaming Data
Running variance var ?(num-avg)2
?num2 - 2?numavg ?avg2 Let A ?num2 of all
numbers read so far B 2?numavg of all
numbers read so far C ?avg2 of all
numbers read so far avg average of numbers
read so far n number of numbers read so
far
57Mining Streaming Data
Running variance On reading next number num
avg ? (avgn num) / (n1) n ? n1 A ? A
num2 B ? B 2avgnum C ? C avg2 var A
B C
58Mining Streaming Data
?-Consistency (Srinivasa and Spiliopoulou,
CoopIS 1999) Let streaming data be in the form
of frames where each frame comprises of one or
more data elements. Support for data element k
within a frame is defined as (occurrences of
k)/(elements in frame) ?-Consistency for data
element k is the sustained support for k over
all frames read so far, with a leakage of (1-
?)
59Mining Streaming Data
?-Consistency (Srinivasa and Spiliopoulou,
CoopIS 1999)
?sup(k)
(1-?)
levelt(k) (1-?)levelt-1(k) ?sup(k)
60Data Warehousing
- A platform for online analytical processing
(OLAP) - Warehouses collect transactional data from
several transactional databases and organize them
in a fashion amenable to analysis - Also called data marts
- A critical component of the decision support
system (DSS) of enterprises - Some typical DW queries
- Which item sells best in each region that has
retail outlets - Which advertising strategy is best for South
India? - Which (age_group/occupation) in South India likes
fast food, and which (age_group/occupation) likes
to cook?
61Data Warehousing
OLTP
Order Processing
Data Cleaning
Inventory
Data Warehouse (OLAP)
Sales
62OLTP vs OLAP
Transactional Data (OLTP) Analysis Data (OLAP)
Small or medium size databases Very large databases
Transient data Archival data
Frequent insertions and updates Infrequent updates
Small query shadow Very large query shadow
Normalization important to handle updates De-normalization important to handle queries
63Data Cleaning
- Performs logical transformation of transactional
data to suit the data warehouse - Model of operations ? model of enterprise
- Usually a semi-automatic process
64Data Cleaning
Data Warehouse Customers Products Orders Inventor
y Price Time
Orders Order_id Price Cust_id
Inventory Prod_id Price Price_chng
Sales Cust_id Cust_prof Tot_sales
65Multi-dimensional Data Model
Price
Products
Orders
Customers
Jan01 Jun01 Jan02
Jun02
Time
66Some MDBMS Operations
- Roll-up
- Add dimensions
- Drill-down
- Collapse dimensions
- Vector-distance operations (ex clustering)
- Vector space browsing
67Star Schema
Dim Tbl_1
Dim Tbl_1
Fact table
Dim Tbl_1
Dim Tbl_1
68WWW Based References
- http//www.kdnuggets.com/
- http//www.megaputer.com/
- http//www.almaden.ibm.com/cs/quest/index.html
- http//fas.sfu.ca/cs/research/groups/DB/sections/p
ublication/kdd/kdd.html - http//www.cs.su.oz.au/thierry/ckdd.html
- http//www.dwinfocenter.org/
- http//datawarehouse.itoolbox.com/
- http//www.knowledgestorm.com/
- http//www.bitpipe.com/
- http//www.dw-institute.com/
- http//www.datawarehousing.com/
69References
- Agrawal, R. Srikant Fast Algorithms for Mining
Association Rules'', Proc. of the 20th Int'l
Conference on Very Large Databases, Santiago,
Chile, Sept. 1994. - R. Agrawal, R. Srikant, Mining Sequential
Patterns'', Proc. of the Int'l Conference on Data
Engineering (ICDE), Taipei, Taiwan, March 1995. - R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J.
Shafer, R. Srikant "The Quest Data Mining
System", Proc. of the 2nd Int'l Conference on
Knowledge Discovery in Databases and Data Mining,
Portland, Oregon, August, 1996. - Surajit Chaudhuri, Umesh Dayal. An Overview of
Data Warehousing and OLAP Technology. ACM SIGMOD
Record. 26(1), March 1997. - Jennifer Widom. Research Problems in Data
Warehousing. Proc. of Intl Conf. On Information
and Knowledge Management, 1995.
70References
- A. Shoshani. OLAP and Statistical Databases
Similarities and Differences. Proc. of ACM PODS
1997. - Panos Vassiliadis, Timos Sellis. A Survey on
Logical Models for OLAP Databases. ACM SIGMOD
Record - M. Gyssens, Laks VS Lakshmanan. A Foundation for
Multi-Dimensional Databases. Proc of VLDB 1997,
Athens, Greece. - Srinath Srinivasa, Myra Spiliopoulou. Modeling
Interactions Based on Consistent Patterns. Proc.
of CoopIS 1999, Edinburg, UK. - Srinath Srinivasa, Myra Spiliopoulou. Discerning
Behavioral Patterns By Mining Transaction Logs.
Proc. of ACM SAC 2000, Como, Italy.