Data Mining - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Data Mining

Description:

Data Mining – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 30
Provided by: busi237
Category:
Tags: data | mining

less

Transcript and Presenter's Notes

Title: Data Mining


1
Data Mining
  • Just a teaser of an overview

2
Classic Cases
  • Beer and diapers (an urban legend?)
  • market basket analysis
  • Credit card fraud detection based on user buying
    patterns
  • gas station purchase followed by big buy
  • Online brokerages sponsor golf tournaments
  • 70 of online investors are males over 40 who
    golf
  • 60 of stock investors are golfers
  • Senior citizens buy CDs for their grand-kids
  • RAP ads in magazines for seniors (?)

3
Data Mining - Overview
  • Extraction of hidden predictive information from
    large databases
  • Predict future trends and behavior
  • proactive, knowledge-based decision making
  • Automated, prospective analysis
  • Goes beyond Data Warehousing and OLAP
  • Supervised vs. Unsupervised Learning
  • Attempt to have computers learn concepts

4
Examples of Data Mining
  • Marketing (clustering for customer segmentation)
  • Direct marketing
  • how to reach customers who matter - mass mailing
  • Customer modeling
  • customer profile - identify who is most likely to
    buy
  • Trend analysis
  • isolating when a customer changes habits
  • Fraud Detection
  • insurance
  • credit card
  • Healthcare
  • patient type clustering
  • prediction of DRG outliers
  • Hospital length of stay prediction

5
Examples of Data Mining (contd.)
  • Banking
  • credit card marketing campaign
  • Telecommunication
  • changing long distance carriers
  • Astronomy
  • classification of stars in the known universe
  • Geographic Information System
  • identify relative consumption of products within
    counties and zip codes

6
Very Nice Intro to Data Mining
  • http//www.aw.com/info/database/roiger.html
  • Business focus
  • Enough algorithm detail to learn something
  • Includes Excel based software for many common
    data mining techniques

7
Some commonly used data mining techniques
  • Classification trees
  • Descriminant analysis
  • Logistic regression
  • Neural networks
  • Kohenen self-organizing maps (networks)
  • K-means or hierarchical cluster analysis

8
Two Nice Examples in PMS
  • Cluster Analysis (sec. 8.7)
  • Cluster cities based on some demographic
    information
  • standardization of data
  • Calculation of a distance metric
  • Clustering as a non-linear optimization problem
    (using Evolutionary Solver)
  • Discriminant Analysis
  • Classify people as Wall Street Journal readers
    (or not)
  • Fits regression line that discriminates between
    the two groups
  • Can use the classifier equation to predict
    whether a new person would or would not subscribe

9
Supervised Learning
  • We see and hear things, learn their names,
    classify them using our own mental models
  • We have inputs and outputs
  • We have predefined classes into which we wish to
    classify new cases
  • Example Look at data of mortgage borrowers and
    whether or not they defaulted. Build model to
    predict if a new borrower is a default risk.

10
Supervised Learning
Training Data
Examples (defaulted)
Data mining algorithm (for classification)
Data (inputs and outputs)
Non-examples (did NOT default)
Classifier Model
New data (inputs only)
Default or Not Default OR Probability of Default
Test Data
11
DISCRIM.XLS
  • This file contains the annual income and size of
    investment portfolio (both in thousands of
    dollars) for 84 people.
  • It also indicates whether each of these people
    subscribes or does not subscribe to the Wall
    Street Journal.
  • Using income and size of investment portfolio,
    determine a classification rule that maximizes
    the number of people correctly classified as
    subscribers or nonsubscribers.

12
Unsupervised Learning
  • We have inputs only
  • We do NOT have predefined classes into which we
    wish to classify new cases
  • We want to cluster the data so that cases
  • Similar to other cases in their cluster
  • Different from cases in other clusters
  • Example Look at data for online investors and
    cluster them into groups
  • do targeted marketing based on cluster membership

13
Unsupervised Learning
Training Data
Data (inputs and outputs)
Data mining algorithm (for clustering)
Examples (online investors)
Clustering Model
Number of clusters (some algs need this)
Clusters (each case gets put in a cluster, you
name the clusters)
14
Can you identify clusters?
15
CLUSTERS.XLS
  • This file contains demographic data on 49 of the
    largest cities in the United States.
  • Some of the data appears in the shaded region of
    the figure on the next slide.
  • For example, Atlanta is 67 African American, 2
    Hispanic, and 1 Asian. It has a median age of
    31, a 5 unemployment rate, and a per capita
    income of 22,000.
  • We would like to group these 49 cities into four
    clusters of cities that are demographically
    similar.

16
Steps in Data Mining
  • 1. Problem Formulation
  • Exploratory
  • Exploring relationships, reasoning later!
  • Hypotheses
  • Testing relationships based on theory / some
    evidence

Beer-diaper example, Insurance Sports car
Antique
Classification Problem Credit Approval,
Bankruptcy prediction
17
Steps in Data Mining
This step is critical and can take an enormous
amount of time and effort.
  • 2. Data

Expertise on data
Is there enough data? size
Data Quality Is the data noisy, format, missing
errors Garbage in Garbage out!
Is data available on relevant factors
18
Data for Mining
  • Most data mining packages want a flat file of
    data
  • Much work goes into preparing the data
  • Missing values
  • May transform numeric into categorical
  • Often standardize or rescale the data using a
    variety of different approaches
  • Standardized values are in a relatively small
    range of values
  • Removes bias due to large differences in absolute
    numeric values
  • Example ages in years vs. fractions (decimals
    between 0 and 1)
  • WHY? Many data mining algorithms calculate
    distance between cases (records in your data)

19
Flat File for Mining
Standardized data values
Raw data values
20
Steps in Data Mining
Different techniques are suited for different
kinds of problems, produce different kinds of
output
  • 3. Techniques
  • Unsupervised Learning
  • K-means cluster analysis
  • Kohonen Networks
  • The number of groups are not known a priori
  • Supervised Learning
  • Neural Networks
  • The number of groups are known a priori enabling
    categorization and training
  • Linear vs. non linear separation, dimensionality

Classification Problem Credit Approval,
Bankruptcy prediction
Customer profiling
21
4. Analysis
  • Subject new data to classification models to aid
    in decision making
  • Use output from clustering models as part of
    larger decision or planning problem
  • Supplement data mining models with data
    visualization, statistical reporting
  • Supplement all of this with domain knowledge,
    common sense, expert opinion

22
Decision Trees
  • Another supervised learning technique
  • Generalizes input instances by building a tree
  • Non-terminal nodes are tests on one or more
    attributes (i.e. pick next branch to traverse)
  • Terminal nodes are decision outcomes
  • Simple, graphical, can be transformed into
    logical rules

23
Hypothetical Training Data for Disease Diagnosis
24
Simple Decision Tree
Swollen Glands?
No
Yes
Fever?
Diagnosis Strep Throat
No
Yes
Diagnosis Allergy
Diagnosis Cold
25
Neural NetworksUsed for Classification
  • Based on simple model of brain as collection of
    neurons that fire if inputs exceed a certain
    threshold
  • Three types of neurons or nodes
  • Input nodes these are input variables
  • Hidden nodes they help fit a model
  • Output nodes these are the predictions
  • Supervised learning

26
Basic Steps with Neural Nets
  • Prepare data including input and output variables
  • Standardization a good idea
  • Hold out some data to test model later
  • Submit training data to neural network software
    (see later slide)
  • Build neural network model
  • A few tuning parameters involved
  • Model iteratively puts weights on neurons in an
    attempt to predict the outputs correctly
  • Submit new data to neural network for predictive
    classification
  • Neural network is a black box there is NO
    equation to look at
  • Can be difficult to gain insight from neural
    network
  • Has proven to be a very effective classification
    method in practice

27
Basic Structure of a Neural Network
28
Neural Network Software
  • Clementine
  • iDA trial version ships with Roiger and Geatz
    book (see previous slide)
  • Excel based add-ins
  • Neuralyst
  • Freeware and Shareware

29
More Data Mining Software
  • Clustan cluster analysis
  • SAS
  • SPSS
  • KXEN
  • Lots more at KDNuggets.com
Write a Comment
User Comments (0)
About PowerShow.com