Handson Workshop on Data Mining presentation

About This Presentation

Transcript and Presenter's Notes

Title: Handson Workshop on Data Mining

1
Hands-on Workshop onData Mining

Ahmed M. Zeki
8 Sep 2007

2
PART I

INTRODUCTION TO DATA MINING

3
Introduction

What is Data?
Data , Information, Knowledge.
What is Mining?

4
Machine Learning vs. Knowledge Engineering
Machine Learning
Knowledge Engineering (Expert Systems)
Samples
Rules
Learning Systems
System
Decision Making (Rules)
Output (Applying the Rules)
e.g. MYCIN (Medical Diagnosis System)
5
Expert Systems (Example)
6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
Definition

Data Mining is the process of exploration and
analysis, by automatic or semi-automatic means,
of large quantities of data in order to discover
meaningful patterns, relationships and rules.
What comes next?
5, 7, 10, 14, 19, ____
Khairul likes 252 but not 422 he likes 900 but
not 800 he likes 144 but not 134. Which does he
like
1800 or 1700 ?
Which of the figures below the line of drawings
best completes the series?
?

25
11
Data Mining vs. Other Techniques
Statistics, Query, Reporting, OLAP, ...
Hypothesis-Free
Hypothesis
Not suitable for large databases and data
warehouses within the time limits.

Why are my discount coupons not attracting the
sort of return I was expecting?
How can I increase the share I have of my
customers total spending on electronic goods?
How can I get my other stores to match the
incredibly successful sales figures of the main
branch?

Volume of TVs sold in one store last month.
Analyze the price sensitivity of new line of TVs.
Comparing the sales of various of products in
different stores over time.
Hypotheses the manager knows that there are
stores, products, sensitivity and sales figures,
and he is checking out the interrelationships.

12
Traditional Data Analysis
Hypothesis
Query Language
Graphics Statistics OLAP
Output
Database
13
Relationship between Data Mining and Statistics

Statistics is closest to data mining.
Many of the analysis that are now done with data
mining has been used by statistics, such as
predictive models or discovering associations in
databases.

14
Data Mining for Business Intelligence

Business intelligence all of the processes,
techniques and tools that support business
decision-making based on information technology.
The approaches can range from a simple
spreadsheet to a major competitive intelligence
undertaking. Data mining is an important new
component of business intelligence.

15
Data Mining and Business IntelligencePositioning
of different business intelligence according to
their potential value as a basis for tactical and
strategic business decisions
Making Decisions Data Presentation Visualization
Techniques Data Mining Information Discovery Data
Exploration OLAP, MDA, Statistical Analysis and
Querying and Reporting Data Warehouses / Data
Marts Data Sources Paper, Files, Information
Providers, Database Systems
Decision Maker Business Analyst Data
Analyst Database Administrator
Increasing potential to support business decision
The value of the information to support decision
making increase from the bottom of the pyramid to
the top.
16
Data Mining Applications

Fortune (Financial Magazine) in its annual report
of the best 500 companies. 80 of them are using
data mining for decision support.

17
Data Mining Applications

The main three business areas where data mining
is applied are
(1) Market Management
Target Marketing
Customer relationship management
Market basket analysis
Cross Selling
(2) Risk Management
Forecasting
Customer retention
Improved underwriting
Quality control
Competitive Analysis
(3) Fraud Management
Fraud detection

18
Market Management Applications

The organization builds the database of customer
product preferences and lifestyles from such
sources as credit card transactions, loyalty
cards, warranty cards, discount coupons, entries
to free prizes drawings and customer complaint
calls.
Data mining algorithms then surf through the data
looking for clusters of model consumers who all
share the same characteristics (examples income,
interests and spending habits).

19
Determining customer purchasing patterns over time

Examples
The sequence in which they take up financial
services as their family grows
How they change their cars.
Converting a single bank account to a joint
account indicates marriage which could lead to
future opportunities to get loans, insurance,
study fees....
By understanding these patterns the organization
can advertise just-in-time.

20
Improving Catalog Telesales

The goal is to track the products its customers
order most frequently as well as to suggest the
purchase of those products in future order.
Some products associations are obvious Camera
Films, Radio Batteries...

21
Loyalty Cards

To reward your frequently buyers.
Cardholders get special treatment such as
exclusive discounts on selected items, to
encourage them to do more shopping at the shop
and less likely to visit the competition.

22
Risk Management Applications

Risk associated with insurance or investments.
Risk associated to business risks arising from
competitive threat.
Risk associated to poor product quality.
Risk associated to customer attrition (i.e. The
loss of customers, especially to competitors.
Examples in the retails, finance and
telecommunications fields).
The idea here is to build a model of a vulnerable
customer who shows characteristics typical of one
who is likely to leave for a competitive company.

23
Risk Management Applications

Example customer losses may frequently follow a
change of address or a recent protracted exchange
with an agent of the company.
One US bank uses such models to predict the loss
of customers up to one year in advance.
Another bank analyzes more than one million
credit card account histories to ensure that it
is not over expected to high rates of attraction.

24
Risk Management Applications

Telecommunication companies have several billion
dollars in uncollectible debts every year. Data
mining can build models that help predict whether
a particular account is likely to be collectible
and is therefore worth going after.

25
Forecasting Financial Future

If changes in financial behavior can be
predicted, the organization can adjust its
investment strategy and capitalize on the
predicted changes.
Example The ability to forecast the right price
of a future which is a contract that allows
someone to buy something at a certain price on a
certain date in the future.

26
Pricing Strategy in a Highly Competitive Market

A chain of gasoline stations used data mining to
develop profitable pricing strategies in a very
competitive marketplace, by developing a model
that helps to determine
Appropriate pricing for its products on a day-to
day basis, with a view to maximizing sales and
profits.
Sales volumes and profitability.
The likely competitive reaction to their price
changes.
The likely profitability of a new station.

27
Fraud Management Applications

Detecting Telephone Fraud some of the more
important elements (patterns) in building the
model are the destination of the call, duration,
time of day and week.
Those sectors suffer more than most - especially
those where there are many transactions such as
health care, retail, credit card service and
telecommunication.
The goal is to use historical data to build a
model of fraudulent behavior and then use data
mining to help identify similar instances of this
behavior.

28
Detecting inappropriate medical treatments

An insurance company maintains computerized
records of every doctors consultation in
Australia, including details on the diagnosis,
prescribed drugs and recommended treatment.
Using traditional data analysis techniques they
noticed a rapid increase in the number of
prescribed pathology tests.
Using data mining they were able to identify
which combinations of tests were commonly used,
they were able to detect these invalid
combinations and to no longer accept them for
benefit payment, and they were able to identify
that in many cases that a certain test has been
used at given symptoms.

29
Future Application Areas

Text Mining Words are analyzed in context, for
example the word memory used in a medical
article or a computer article.
Web Analytics to develop insights into users
behavior on the internet. For example today
hypertext are typically fixed, the site
developers have provided the most likely links by
trying to second guess what the user wants to do
next. With data mining, historical user browsing
patterns can be analyzed to dynamically suggest
related sites for users to visit.

30
Data Mining is not Magic !

A class of divorced women
A data mining system discovered that divorced
women have distinctly different shopping pattern
from those of either single or married women.
After analyzing the data they found that the data
on martial status was much less accurate than the
other data because of cultural norms.

31
Data Mining is not Magic !

Missing the Point
While preparing to mine a database of hospital
patient admission records. They found this
strange graph about the temperature. Then they
discovered that the nurse was likely to have the
temperature 37oC recorded as either 36.9oC or
37.1oC.

Population
35o 36o
37o 38o Temperature
32
Data Mining is not Magic !

Older and wealthy customers were buying large
sedans!!
People born under the sign of Pisces were most
prone to accidents!
Males with incomes between 50k-65k who
subscribe to certain magazines are likely
purchasers of a certain product!
DM just assists business analysis by finding
patterns and relationships in the data.
These patterns and relationships are not
necessarily causes of an action.

33
Data Mining Approaches

Classification Studies (Supervised Learning)
I want to understand what makes customers more
likely to stay with or to leave my company? no
hypothesis
Clustering Studies (Unsupervised Learning)
What are the products that are likely to be
purchased together? no hypothesis

34
How do we mine data?

The process of data mining is described as a
process of model building.
Five main steps to data mining
1. Data Preparation
2. Defining a study
Reading the data and building a model
Understanding the model
3. Data Mining
4. Analysis of Results.
5. Assimilation of Knowledge.

35
Effort Required for Each Data Mining Process Step
70 60 50 40 30 20 10
Effort
Defining a study Data Preparation Data
Mining Analysis of Results
and Knowledge Assimilation
36
PART II

INTRODUCTION TO CLEMENTINE
Drug Treatment
(Exploratory Graphs / C5.0)

37
The Problem

Imagine that you are a medical researcher
compiling data for a study.
You have collected data about a set of patients,
all of whom suffered from the same illness.
During their course of treatment, each patient
responded to one of five medications.
Part of your job is to use data mining to find
out which drug might be appropriate for a future
patient with the same illness.

38
The Data Fields
39
Data Reading

Use the Variable File node.
Open Drug1n
Select Read field names from file
Click the Data tab. In the override column,
select cholesterol
Click the Types tab to learn more about the
type of fields in your data. Choose Read Values
to view the actual values for each field
Use the Table Node to have a glance at the
values

40
Exploring the Dataset

Use the Distribution node to explore the data
Select Drug as the target field
Click Execute
The resulting graph shows that patients responded
to drug Y most often and to drugs B and C least
often.
Use the Data Audit node for a quick glance at
distributions and histograms for all fields at
once.

41
What factors might influence Drug?

As a researcher, you know that the concentrations
of sodium and potassium in the blood are
important factors.
Since these are both numeric values, create a
scatterplot of sodium versus potassium, using the
drug categories as a color overlay.
Use the Plot node, double click to edit.
Na vs. K
Overlay color Drug
The plot clearly shows a threshold above which
the correct drug is always drug Y and below which
the correct drug is never drug Y. This threshold
is a ratio - the ratio of sodium (Na) to
potassium (K).

42
What factors might influence Drug?

Use web graph if many of the data fields are
categorical
Web graph maps associations between different
categories
Select BP and Drug. Then, Execute
It appears that drug Y is associated with all
three levels of blood pressure.

43
What factors might influence Drug?

To focus on the other drugs, use Hide and
Replan.
After hiding drug Y
Only drugs A and B are associated with high blood
pressure.
Only drugs C and X are associated with low blood
pressure.
Normal blood pressure is associated only with
drug X.
At this point, though, you still don't know how
to choose between drugs A and B or between drugs
C and X, for a given patient. This is where
modeling can help.

44
Deriving a New Field

Since the ratio of sodium to potassium seems to
predict when to use drug Y, you can derive a
field that contains the value of this ratio for
each record. This field might be useful later
when you build a model to predict when to use
each of the five drugs.
Insert a Derive node, and edit
Name Na_to_K
Ratio enter Na/K for the formula, or use the
Expression Builder
Check the distribution of the Derive node using
a Histogram node, specify Na_to_K as the field to
be plotted and Drug as the overlay field.
? when the Na_to_K value is about 15 or above,
drug Y is the drug of choice

45
Building a Model

By exploring and manipulating the data, you have
been able to form some hypotheses.
The ratio of sodium to potassium in the blood
seems to affect the choice of drug, as does blood
pressure.
But you cannot fully explain all of the
relationships yet.
This is where modeling will likely provide some
answers.
In this case, you will try to fit the data using
a rule-building model, C5.0.

46
Building a Model

Since we have a new derived field, Na_to_K, we
can filter out the original fields, Na and K, so
that they are not used twice in the modeling
algorithm.
Use Filter node
Click the arrows next to Na and K.
Red Xs appear over the arrows to indicate that
the fields are now filtered out.
Connect a Type node to the Filter node which
allows you to indicate the types of fields that
you are using and how they are used to predict
the outcomes.
Set the direction for the Drug field to Out
(i.e. to be predicted), others directions In.

47
Building a Model

To estimate the model, attach a C5.0 node to
the Type node. Then execute.
Browse the created model.
Rule Browser
Viewer Decision Tree

48
The Accuracy

To assess the accuracy of the model connect the
analysis node to the C5.0 Model Node which is
connected to the Type Node
The Analysis node output shows that with this
artificial data set, the model correctly
predicted the choice of drug for almost every
record in the data set.
With a real data set you are unlikely to see 100
accuracy, but you can use the Analysis node to
help determine whether the model is acceptably
accurate for your particular application.

Write a Comment

User Comments (0)

About PowerShow.com

Handson Workshop on Data Mining PowerPoint PPT Presentation