Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts - PowerPoint PPT Presentation

1 / 70
About This Presentation
Title:

Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts

Description:

Data Mining, Data Warehousing and Knowledge Discovery ... which contain j as a sequence Sequence data: transaction logs, DNA sequences, patient ailment history, ... – PowerPoint PPT presentation

Number of Views:483
Avg rating:3.0/5.0
Slides: 71
Provided by: iii7
Category:

less

Transcript and Presenter's Notes

Title: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts


1
Data Mining, Data Warehousing and Knowledge
DiscoveryBasic Algorithms and Concepts
  • Srinath Srinivasa
  • IIIT Bangalore
  • sri_at_iiitb.ac.in

2
Overview
  • Why Data Mining?
  • Data Mining concepts
  • Data Mining algorithms
  • Tabular data mining
  • Association, Classification and Clustering
  • Sequence data mining
  • Streaming data mining
  • Data Warehousing concepts

3
Why Data Mining
From a managerial perspective
Analyzing trends
Wealth generation
Security
Strategic decision making
4
Data Mining
  • Look for hidden patterns and trends in data that
    is not immediately apparent from summarizing the
    data
  • No Query
  • But an Interestingness criteria

5
Data Mining


Interestingness criteria
Hidden patterns
Data
6
Data Mining
Type of Patterns


Interestingness criteria
Hidden patterns
Data
7
Data Mining
Type of data
Type of Interestingness criteria


Interestingness criteria
Hidden patterns
Data
8
Type of Data
  • Tabular (Ex Transaction data)
  • Relational
  • Multi-dimensional
  • Spatial (Ex Remote sensing
    data)
  • Temporal (Ex Log information)
  • Streaming (Ex multimedia, network
    traffic)
  • Spatio-temporal (Ex GIS)
  • Tree (Ex XML data)
  • Graphs (Ex WWW,
    BioMolecular data)
  • Sequence (Ex DNA, activity
    logs)
  • Text, Multimedia

9
Type of Interestingness
  • Frequency
  • Rarity
  • Correlation
  • Length of occurrence (for sequence and temporal
    data)
  • Consistency
  • Repeating / periodicity
  • Abnormal behavior
  • Other patterns of interestingness

10
Data Mining vs Statistical Inference
Statistics
Statistical Reasoning
Conceptual Model (Hypothesis)
Proof (Validation of Hypothesis)
11
Data Mining vs Statistical Inference
Data mining
Mining Algorithm Based on Interestingness
Data
Pattern (model, rule, hypothesis) discovery
12
Data Mining Concepts
Associations and Item-sets An association is a
rule of the form if X then Y. It is denoted as
X ? Y Example If India wins in cricket, sales
of sweets go up. For any rule if X ? Y ? Y ?
X, then X and Y are called an interesting
item-set. Example People buying school
uniforms in June also buy school bags (People
buying school bags in June also buy school
uniforms)
13
Data Mining Concepts
Support and Confidence The support for a rule
R is the ratio of the number of occurrences of
R, given all occurrences of all rules. The
confidence of a rule X ? Y, is the ratio of the
number of occurrences of Y given X, among all
other occurrences given X.
14
Data Mining Concepts
Support and Confidence
Support for Bag, Uniform 5/10 0.5
Confidence for Bag ? Uniform 5/8 0.625
Bag Books Bag Bag Uniform Bag Crayons Books Unifor
m Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform
Crayons Crayons Uniform
Crayons Uniform Pencil Book Bag Book Bag Bag Penci
l Books
15
Mining for Frequent Item-sets
  • The Apriori Algorithm
  • Given minimum required support s as
    interestingness criterion
  • Search for all individual elements (1-element
    item-set) that have a minimum support of s
  • Repeat
  • From the results of the previous search for
    i-element item-sets, search for all i1 element
    item-sets that have a minimum support of s
  • This becomes the set of all frequent
    (i1)-element item-sets that are interesting
  • Until item-set size reaches maximum..

16
Mining for Frequent Item-sets
The Apriori Algorithm (Example)
Let minimum support 0.3 Interesting 1-element
item-sets Bag, Uniform, Crayons,
Pencil, Books Interesting 2-element
item-sets Bag,Uniform Bag,Crayons
Bag,Pencil Bag,Books Uniform,Crayons
Uniform,Pencil Pencil,Books
Bag Books Bag Bag Uniform Bag Crayons Books Unifor
m Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform
Crayons Crayons Uniform
Crayons Uniform Pencil Books Bag Books Bag Bag Pen
cil Books
17
Mining for Frequent Item-sets
The Apriori Algorithm (Example)
Let minimum support 0.3 Interesting 3-element
item-sets Bag,Uniform,Crayons
Bag Books Bag Bag Uniform Bag Crayons Books Unifor
m Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform
Crayons Crayons Uniform
Crayons Uniform Pencil Books Bag Books Bag Bag Pen
cil Books
18
Mining for Association Rules
Association rules are of the form A ? B Which
are directional Association rule mining
requires two thresholds minsup and minconf
Bag Books Bag Bag Uniform Bag Crayons Books Unifor
m Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform
Crayons Crayons Uniform
Crayons Uniform Pencil Books Bag Books Bag Bag Pen
cil Books
19
Mining for Association Rules
Mining association rules using apriori
  • General Procedure
  • Use apriori to generate frequent itemsets of
    different sizes
  • At each iteration divide each frequent itemset X
    into two parts LHS and RHS. This represents a
    rule of the form LHS ? RHS
  • The confidence of such a rule is
    support(X)/support(LHS)
  • Discard all rules whose confidence is less than
    minconf.

Bag Books Bag Bag Uniform Bag Crayons Books Unifor
m Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform
Crayons Crayons Uniform
Crayons Uniform Pencil Books Bag Books Bag Bag Pen
cil Books
20
Mining for Association Rules
Mining association rules using apriori
Example The frequent itemset Bag, Uniform,
Crayons has a support of 0.3. This can be
divided into the following rules Bag ?
Uniform, Crayons Bag, Uniform ? Crayons
Bag, Crayons ? Uniform Uniform ? Bag,
Crayons Uniform, Crayons ? Bag Crayons ?
Bag, Uniform
Bag Books Bag Bag Uniform Bag Crayons Books Unifor
m Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform
Crayons Crayons Uniform
Crayons Uniform Pencil Books Bag Books Bag Bag Pen
cil Books
21
Mining for Association Rules
Mining association rules using apriori
Confidence for these rules are as follows Bag
? Uniform, Crayons 0.375 Bag, Uniform ?
Crayons 0.6 Bag, Crayons ? Uniform
0.75 Uniform ? Bag, Crayons 0.428
Uniform, Crayons ? Bag 0.75 Crayons ?
Bag, Uniform 0.75
Bag Books Bag Bag Uniform Bag Crayons Books Unifor
m Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform
Crayons Crayons Uniform
Crayons Uniform Pencil Books Bag Books Bag Bag Pen
cil Books
If minconf is 0.7, then we have discovered the
following rules
22
Mining for Association Rules
Mining association rules using apriori
People who buy a school bag and a set of crayons
are likely to buy school uniform. People who
buy school uniform and a set of crayons are
likely to buy a school bag. People who buy just
a set of crayons are likely to buy a school bag
and school uniform as well.
Bag Books Bag Bag Uniform Bag Crayons Books Unifor
m Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform
Crayons Crayons Uniform
Crayons Uniform Pencil Books Bag Books Bag Bag Pen
cil Books
23
Generalized Association Rules
Since customers can buy any number of items in
one transaction, the transaction relation would
be in the form of a list of individual
purchases.
Bill No. Date Item
15563 23.10.2003 Books
15563 23.10.2003 Crayons
15564 23.10.2003 Uniform
15564 23.10.2003 Crayons
24
Generalized Association Rules
A transaction for the purposes of data mining is
obtained by performing a GROUP BY of the table
over various fields.
Bill No. Date Item
15563 23.10.2003 Books
15563 23.10.2003 Crayons
15564 23.10.2003 Uniform
15564 23.10.2003 Crayons
25
Generalized Association Rules
A GROUP BY over Bill No. would show frequent
buying patterns across different customers. A
GROUP BY over Date would show frequent buying
patterns across different days.
Bill No. Date Item
15563 23.10.2003 Books
15563 23.10.2003 Crayons
15564 23.10.2003 Uniform
15564 23.10.2003 Crayons
26
Classification and Clustering
Given a set of data elements Classification
maps each data element to one of a set of
pre-determined classes based on the difference
among data elements belonging to different
classes Clustering groups data elements into
different groups based on the similarity between
elements within a single group
27
Classification Techniques
Decision Tree Identification
Outlook Temp Play?
Sunny 30 Yes
Overcast 15 No
Sunny 16 Yes
Cloudy 27 Yes
Overcast 25 Yes
Overcast 17 No
Cloudy 17 No
Cloudy 35 Yes
Classification problem Weather ?
Play(Yes,No)
28
Classification Techniques
  • Hunts method for decision tree identification
  • Given N element types and m decision classes
  • For i ? 1 to N do
  • Add element i to the i-1 element item-sets from
    the previous iteration
  • Identify the set of decision classes for each
    item-set
  • If an item-set has only one decision class, then
    that item-set is done, remove that item-set from
    subsequent iterations
  • done

29
Classification Techniques
Decision Tree Identification Example
Outlook Temp Play?
Sunny Warm Yes
Overcast Chilly No
Sunny Chilly Yes
Cloudy Pleasant Yes
Overcast Pleasant Yes
Overcast Chilly No
Cloudy Chilly No
Cloudy Warm Yes
Sunny
Yes
Cloudy
Yes/No
Overcast
Yes/No
30
Classification Techniques
Decision Tree Identification Example
Outlook Temp Play?
Sunny Warm Yes
Overcast Chilly No
Sunny Chilly Yes
Cloudy Pleasant Yes
Overcast Pleasant Yes
Overcast Chilly No
Cloudy Chilly No
Cloudy Warm Yes
Sunny
Yes
Cloudy
Yes/No
Overcast
Yes/No
31
Classification Techniques
Decision Tree Identification Example
Outlook Temp Play?
Sunny Warm Yes
Overcast Chilly No
Sunny Chilly Yes
Cloudy Pleasant Yes
Overcast Pleasant Yes
Overcast Chilly No
Cloudy Chilly No
Cloudy Warm Yes
Cloudy Warm
Yes
Cloudy Chilly
No
Cloudy Pleasant
Yes
32
Classification Techniques
Decision Tree Identification Example
Outlook Temp Play?
Sunny Warm Yes
Overcast Chilly No
Sunny Chilly Yes
Cloudy Pleasant Yes
Overcast Pleasant Yes
Overcast Chilly No
Cloudy Chilly No
Cloudy Warm Yes
Overcast Warm
Overcast Chilly
No
Overcast Pleasant
Yes
33
Classification Techniques
Decision Tree Identification Example
Yes/No
Overcast
Cloudy
Sunny
Yes/No
Yes
Yes/No
Pleasant
Chilly
Warm
Chilly
No
Pleasant
Yes
No
Yes
Yes
34
Classification Techniques
Decision Tree Identification Example
  • Top down technique for decision tree
    identification
  • Decision tree created is sensitive to the order
    in which items are considered
  • If an N-item-set does not result in a clear
    decision, classification classes have to be
    modeled by rough sets.

35
Other Classification Algorithms
Quinlans depth-first strategy builds the
decision tree in a depth-first fashion, by
considering all possible tests that give a
decision and selecting the test that gives the
best information gain. It hence eliminates tests
that are inconclusive. SLIQ (Supervised
Learning in Quest) developed in the QUEST project
of IBM uses a top-down breadth-first strategy to
build a decision tree. At each level in the tree,
an entropy value of each node is calculated and
nodes having the lowest entropy values selected
and expanded.
36
Clustering Techniques
Clustering partitions the data set into clusters
or equivalence classes. Similarity among
members of a class more than similarity among
members across classes. Similarity measures
Euclidian distance or other application specific
measures.
37
Euclidian Distance for Tables
(Overcast,Chilly,Dont Play)
Overcast
(Cloudy,Pleasant,Play)
Cloudy
Dont Play
Play
Sunny
Warm
Pleasant
Chilly
38
Clustering Techniques
  • General Strategy
  • Draw a graph connecting items which are close to
    one another with edges.
  • Partition the graph into maximally connected
    subcomponents.
  • Construct an MST for the graph
  • Merge items that are connected by the minimum
    weight of the MST into a cluster

39
Clustering Techniques
Clustering types Hierarchical clustering
Clusters are formed at different levels by
merging clusters at a lower level Partitional
clustering Clusters are formed at only one level
40
Clustering Techniques
  • Nearest Neighbour Clustering Algorithm
  • Given n elements x1, x2, xn, and threshold t, .
  • j ? 1, k ? 1, Clusters
  • Repeat
  • Find the nearest neighbour of xj
  • Let the nearest neighbour be in cluster m
  • If distance to nearest neighbour gt t, then create
    a new cluster and k ? k1 else assign xj to
    cluster m
  • j ? j1
  • until j gt n

41
Clustering Techniques
  • Iterative partitional clustering
  • Given n elements x1, x2, xn, and k clusters,
    each with a center.
  • Assign each element to its closest cluster center
  • After all assignments have been made, compute the
    cluster centroids for each of the cluster
  • Repeat the above two steps with the new centroids
    until the algorithm converges

42
Mining Sequence Data
  • Characteristics of Sequence Data
  • Collection of data elements which are ordered
    sequences
  • In a sequence, each item has an index associated
    with it
  • A k-sequence is a sequence of length k. Support
    for sequence j is the number of m-sequences
    (mgtj) which contain j as a sequence
  • Sequence data transaction logs, DNA sequences,
    patient ailment history,

43
Mining Sequence Data
  • Some Definitions
  • A sequence is a list of itemsets of finite
    length.
  • Example
  • pen,pencil,inkpencil,inkink,eraserruler,pe
    ncil
  • the purchases of a single customer over time
  • The order of items within an itemset does not
    matter but the order of itemsets matter
  • A subsequence is a sequence with some itemsets
    deleted

44
Mining Sequence Data
  • Some Definitions
  • A sequence S a1, a2, , am is said to be
    contained within another sequence S, if S
    contains a subsequence b1, b2, bm such that
    a1 ? b1, a2 ? b2, , am ? bm.
  • Hence, penpencilruler,pencil is contained
    in pen,pencil,inkpencil,inkink,eraserruler,
    pencil

45
Mining Sequence Data
  • Apriori Algorithm for Sequences
  • L1 ? Set of all interesting 1-sequences
  • k ? 1
  • while Lk is not empty do
  • Generate all candidate k1 sequences
  • Lk1 ? Set of all interesting k1-sequences
  • done

46
Mining Sequence Data
Generating Candidate Sequences Given L1, L2,
Lk, candidate sequences of Lk1 are generated as
follows For each sequence s in Lk, concatenate
s with all new 1-sequences found while generating
Lk-1
47
Mining Sequence Data
Example
minsup 0.5 a b c d e
Interesting 1-sequences b d a e
a a e b d b
b e d e a b d a
e a a a a b a a a
Candidate 2-sequences c b d b
aa, ab, ad, ae a b b a b
ba, bb, bd, be a b d e
da, db, dd, de
ea, eb, ed, ee
48
Mining Sequence Data
Example
minsup 0.5 a b c d e
Interesting 2-sequences b d a e
ab, bd a e b d
b e Candidate
2-sequences e a b d a aba,
abb, abd, abe, a a a a
aab, bab, dab, eab, b a a a
bda, bdb, bdd, bde,
c b d b bbd, dbd,
ebd. a b b a b a b d e
Interesting 3-sequences
49
Mining Sequence Data
Language Inference Given a set of sequences,
consider each sequence as the behavioural trace
of a machine, and infer the machine that can
display the given sequence as behavior.
aabb ababcac abbac
Input set of sequences
Output state machine
50
Mining Sequence Data
  • Inferring the syntax of a language given its
    sentences
  • Applications discerning behavioural patterns,
    emergent properties discovery, collaboration
    modeling,
  • State machine discovery is the reverse of state
    machine construction
  • Discovery is maximalist in nature

51
Mining Sequence Data
Maximal nature of language inference abc
aabc aabbc abbc
a,b,c
Most general state machine
c
b
c
b
a
c
c
a
b
b
Most specific state machine
52
Mining Sequence Data
  • Shortest-run Generalization (Srinivasa and
    Spiliopoulou 2000)
  • Given a set of n sequences
  • Create a state machine for the first sequence
  • for j ? 2 to n do
  • Create a state machine for the jth sequence
  • Merge this sequence into the earlier sequence as
    follows
  • Merge all halt states in the new state machine to
    the halt state in the existing state machine
  • If two or more paths to the halt state share the
    same suffix, merge the suffixes together into a
    single path
  • Done

53
Mining Sequence Data
Shortest-run Generalization (Srinivasa and
Spiliopoulou 2000)
a
a
b
c
b
aabcb aac aabc
a
a
b
c
b
c
b
a
a
b
c
c
b
c
b
a
a
54
Mining Streaming Data
  • Characteristics of streaming data
  • Large data sequence
  • No storage
  • Often an infinite sequence
  • Examples Stock market quotes, streaming
    audio/video, network traffic

55
Mining Streaming Data
Running mean Let n number of items read so
far, avg running average calculated so
far, On reading the next number num
avg ? (navgnum) / (n1) n ? n1
56
Mining Streaming Data
Running variance var ?(num-avg)2
?num2 - 2?numavg ?avg2 Let A ?num2 of all
numbers read so far B 2?numavg of all
numbers read so far C ?avg2 of all
numbers read so far avg average of numbers
read so far n number of numbers read so
far
57
Mining Streaming Data
Running variance On reading next number num
avg ? (avgn num) / (n1) n ? n1 A ? A
num2 B ? B 2avgnum C ? C avg2 var A
B C
58
Mining Streaming Data
?-Consistency (Srinivasa and Spiliopoulou,
CoopIS 1999) Let streaming data be in the form
of frames where each frame comprises of one or
more data elements. Support for data element k
within a frame is defined as (occurrences of
k)/(elements in frame) ?-Consistency for data
element k is the sustained support for k over
all frames read so far, with a leakage of (1-
?)
59
Mining Streaming Data
?-Consistency (Srinivasa and Spiliopoulou,
CoopIS 1999)
?sup(k)
(1-?)
levelt(k) (1-?)levelt-1(k) ?sup(k)
60
Data Warehousing
  • A platform for online analytical processing
    (OLAP)
  • Warehouses collect transactional data from
    several transactional databases and organize them
    in a fashion amenable to analysis
  • Also called data marts
  • A critical component of the decision support
    system (DSS) of enterprises
  • Some typical DW queries
  • Which item sells best in each region that has
    retail outlets
  • Which advertising strategy is best for South
    India?
  • Which (age_group/occupation) in South India likes
    fast food, and which (age_group/occupation) likes
    to cook?

61
Data Warehousing
OLTP
Order Processing
Data Cleaning
Inventory
Data Warehouse (OLAP)
Sales
62
OLTP vs OLAP
Transactional Data (OLTP) Analysis Data (OLAP)
Small or medium size databases Very large databases
Transient data Archival data
Frequent insertions and updates Infrequent updates
Small query shadow Very large query shadow
Normalization important to handle updates De-normalization important to handle queries
63
Data Cleaning
  • Performs logical transformation of transactional
    data to suit the data warehouse
  • Model of operations ? model of enterprise
  • Usually a semi-automatic process

64
Data Cleaning
Data Warehouse Customers Products Orders Inventor
y Price Time
Orders Order_id Price Cust_id
Inventory Prod_id Price Price_chng
Sales Cust_id Cust_prof Tot_sales
65
Multi-dimensional Data Model
Price
Products
Orders
Customers
Jan01 Jun01 Jan02
Jun02
Time
66
Some MDBMS Operations
  • Roll-up
  • Add dimensions
  • Drill-down
  • Collapse dimensions
  • Vector-distance operations (ex clustering)
  • Vector space browsing

67
Star Schema
Dim Tbl_1
Dim Tbl_1
Fact table
Dim Tbl_1
Dim Tbl_1
68
WWW Based References
  • http//www.kdnuggets.com/
  • http//www.megaputer.com/
  • http//www.almaden.ibm.com/cs/quest/index.html
  • http//fas.sfu.ca/cs/research/groups/DB/sections/p
    ublication/kdd/kdd.html
  • http//www.cs.su.oz.au/thierry/ckdd.html
  • http//www.dwinfocenter.org/
  • http//datawarehouse.itoolbox.com/
  • http//www.knowledgestorm.com/
  • http//www.bitpipe.com/
  • http//www.dw-institute.com/
  • http//www.datawarehousing.com/

69
References
  • Agrawal, R. Srikant Fast Algorithms for Mining
    Association Rules'', Proc. of the 20th Int'l
    Conference on Very Large Databases, Santiago,
    Chile, Sept. 1994.
  • R. Agrawal, R. Srikant, Mining Sequential
    Patterns'', Proc. of the Int'l Conference on Data
    Engineering (ICDE), Taipei, Taiwan, March 1995.
  • R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J.
    Shafer, R. Srikant "The Quest Data Mining
    System", Proc. of the 2nd Int'l Conference on
    Knowledge Discovery in Databases and Data Mining,
    Portland, Oregon, August, 1996.
  • Surajit Chaudhuri, Umesh Dayal. An Overview of
    Data Warehousing and OLAP Technology. ACM SIGMOD
    Record. 26(1), March 1997.
  • Jennifer Widom. Research Problems in Data
    Warehousing. Proc. of Intl Conf. On Information
    and Knowledge Management, 1995.

70
References
  • A. Shoshani. OLAP and Statistical Databases
    Similarities and Differences. Proc. of ACM PODS
    1997.
  • Panos Vassiliadis, Timos Sellis. A Survey on
    Logical Models for OLAP Databases. ACM SIGMOD
    Record
  • M. Gyssens, Laks VS Lakshmanan. A Foundation for
    Multi-Dimensional Databases. Proc of VLDB 1997,
    Athens, Greece.
  • Srinath Srinivasa, Myra Spiliopoulou. Modeling
    Interactions Based on Consistent Patterns. Proc.
    of CoopIS 1999, Edinburg, UK.
  • Srinath Srinivasa, Myra Spiliopoulou. Discerning
    Behavioral Patterns By Mining Transaction Logs.
    Proc. of ACM SAC 2000, Como, Italy.
Write a Comment
User Comments (0)
About PowerShow.com