Introduction to Data Mining Chapter 4 - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

Introduction to Data Mining Chapter 4

Description:

Globalised world. Vast amount of information available. 5. What is an information ... Assign Cheat to 'No' 30. Decision Tree Classification Task. Decision Tree. 31 ... – PowerPoint PPT presentation

Number of Views:173

Avg rating:3.0/5.0

Slides: 49

Provided by: SEAS80

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Data Mining Chapter 4

1
Introduction to Data MiningChapter 4

2
Chapter 4 Outline

Background
Information is Power
Knowledge is Power
Data Mining

3
Introduction

4
Information is Power

Relevant
Right Information
Globalised world
Vast amount of information available

5
What is an information

a collection of data
The act of human analysis and interpretation of
activities
Decomposing it into various components and
tackling them

6
What is Knowledge?

The act of human synthesis and evaluation of
information
Integration of the relevant components and form
as a relevant whole system.

7
Why Mine Data? Commercial Viewpoint

Lots of data is being collected and warehoused
Web data, e-commerce
purchases at department/grocery stores
Bank/Credit Card transactions
Computers have become cheaper and more powerful
Competitive Pressure is Strong
Provide better, customized services for an edge
(e.g. in Customer Relationship Management)

8
Why Mine Data? Scientific Viewpoint

Data collected and stored at enormous speeds
(GB/hour)
remote sensors on a satellite
telescopes scanning the skies
microarrays generating gene expression data
scientific simulations generating terabytes of
data
Traditional techniques infeasible for raw data
Data mining may help scientists
in classifying and segmenting data
in Hypothesis Formation

9
Data Mining Definition I

The nontrivial extraction of hidden, previously
unidentified, and potentially valuable knowledge
from data
A variety of techniques such as neural networks,
decision trees or standard statistical techniques
to identify nuggets of information or
decision-making knowledge in bodies of data, and
extracting these in such a way that they can be
put to use in areas such as decision support,
prediction, forecasting, and estimation.

10
Data Mining Definition II

Finding hidden information in a database

11
Hidden Information

Number of years of experiences
Great secret recipes
Success Factors

12
Origins of Data Mining

Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
Traditional Techniquesmay be unsuitable due to
Enormity of data
High dimensionality of data
Heterogeneous, distributed nature of data

Statistics/AI
Machine Learning/ Pattern Recognition
Data Mining
Database systems
13
What is (not) Data Mining?

What is Data Mining?
Certain names are more prevalent in certain US
locations (OBrien, ORurke, OReilly in Boston
area)
Group together similar documents returned by
search engine according to their context (e.g.
Amazon rainforest, Amazon.com,)

What is not Data Mining?
Look up phone number in phone directory
Query a Web search engine for information about
Amazon

14
Database Processing vs. Data Mining Processing

Query
Poorly defined
No precise query language

Query
Well defined
SQL

Data
Operational data

Data
Not operational data

Output
Precise
Subset of database

Output
Fuzzy
Not a subset of database

15
Query Examples

Database
Data Mining

Find all credit applicants with surname name of
Lee.

Identify customers who have purchased more than
100,000 in the last year.

Find all customers who have purchased bread

Find all credit applicants who are good credit
risks. (classification)

Identify customers with similar eating habits.
(Clustering)

Find all items which are frequently purchased
with bread. (association rules)

16
Data Mining Models and Tasks
17
Classification Definition

Given a collection of records (training set )
Each record contains a set of attributes, one of
the attributes is the class.
Find a model for class attribute as a function
of the values of other attributes.
Goal previously unseen records should be
assigned a class as accurately as possible.
A test set is used to determine the accuracy of
the model. Usually, the given data set is divided
into training and test sets, with training set
used to build the model and test set used to
validate it.

18
Illustrating Classification Task
19
Examples of Classification Task

Predicting tumor cells as benign or malignant
Classifying credit card transactions as
legitimate or fraudulent
Classifying secondary structures of protein as
alpha-helix, beta-sheet, or random coil
Categorizing news stories as finance, weather,
entertainment, sports, etc

20
Classification Techniques

Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines

21
Example of a Decision Tree
Splitting Attributes
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
Model Decision Tree
Training Data
22
Another Example of Decision Tree
categorical
categorical
continuous
class
Single, Divorced
MarSt
Married
Refund
NO
No
Yes
TaxInc
lt 80K
gt 80K
YES
NO
There could be more than one tree that fits the
same data!
23
Decision Tree Classification Task
Decision Tree
24
Apply Model to Test Data
Test Data
Start from the root of tree.
25
Apply Model to Test Data
Test Data
26
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
27
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
28
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
29
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Assign Cheat to No
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
30
Decision Tree Classification Task
Decision Tree
31
What is Cluster Analysis?

Finding groups of objects such that the objects
in a group will be similar (or related) to one
another and different from (or unrelated to) the
objects in other groups

32
Applications of Cluster Analysis

Understanding
Group related documents for browsing, group genes
and proteins that have similar functionality, or
group stocks with similar price fluctuations
Summarization
Reduce the size of large data sets

33
What is not Cluster Analysis?

Supervised classification
Have class label information
Simple segmentation
Dividing students into different registration
groups alphabetically, by last name
Results of a query
Groupings are a result of an external
specification
Graph partitioning
Some mutual relevance and synergy, but areas are
not identical

34
Notion of a Cluster can be Ambiguous
35
Types of Clusterings

A clustering is a set of clusters
Important distinction between hierarchical and
partitional sets of clusters
Partitional Clustering
A division data objects into non-overlapping
subsets (clusters) such that each data object is
in exactly one subset
Hierarchical clustering
A set of nested clusters organized as a
hierarchical tree

36
Partitional Clustering
Original Points
37
Hierarchical Clustering
Traditional Hierarchical Clustering
Traditional Dendrogram
Non-traditional Hierarchical Clustering
Non-traditional Dendrogram
38
Association Rules

Association Rules are a data mining technique and
complement market basket analysis.
All association rules are unidirectional and take
the following form
Left-hand side rule IMPLIES Right-hand side rule
Both left hand side and the right-hand side of
the rule may contain multiple items or
combination of items such as following
Yellow Peppers IMPLIES Red Peppers, Bananas, and
Bakery
Associations are written as A B, where A is
called antecedent or left-hand side(LHS) and B is
called consequent or right-hand side(RHS).
Ex If people buy printer then they buy
catridge
The antecedent is buy printer and the
consequent is buy catridge

39
Association Rules

Market Basket Analysis
-Necessary to have a list of transactions and
what was purchased in each one.
-Ex
Transaction 1 Frozen Pizza, Cola, Milk
Transaction 2 Milk, potato chips,
Transaction 3 Cola, Frozen pizza
Transaction 4 Milk, pretzels
Transaction 5 Cola, pretzels

40
Association Rules
41
Association Rules

Measures of Association
Support- the support measure refers to the
percentage of baskets in the analysis where the
rule is true, that is where both the left-hand
side and the right-hand side of the association
are found.
Confidence
The percentage of baskets from the analysis
having the left-hand side item that also contain
the right-hand side item is found via the
confidence measure. This measure is different
from support in that confidence is the
probability that the right-hand side item is
present given that we know the left-hand side
item is in the basket.
Calculated as a ratio
(frequency of A and B)/(frequency of A)

42
Association Rules

Measures of Association
-The support measure
for the rule
Cola IMPLIES Frozen Pizza is 40
Frozen Pizza IMPLIES Cola is 40
single item
Milk is 60
(Note support considers only the combination and
not the direction.)

43
Association Rules

Measures of Association
Confidence
Milk IMPLIES Potato Chips has confidence
(frequency of A and B) / (frequency of A)
20 / 60
33

44
Data Mining vs. KDD

Knowledge Discovery in Databases (KDD) process
of finding useful information and patterns in
data.
Data Mining Use of algorithms to extract the
information and patterns derived by the KDD
process.

45
KDD Process
Modified from FPSS96C

Selection ( Pre-Mining 1) Obtain data from
various sources.
Preprocessing (Pre-Mining 2) Cleanse data.
Transformation (Pre-Mining 3) Convert to common
format. Transform to new format.
Data Mining Obtain desired results.
Interpretation/Evaluation (Post-Mining) Present
results to user in meaningful manner.

46
KDD Process Ex Web Log