Issues in Data Mining Applications Tutorial - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Issues in Data Mining Applications Tutorial

Description:

to swim in the ocean of the data, you will get drowned....' Page Number: 4. Tutorial Content: ... How to improve existing Data Mining applications? Potential ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 49
Provided by: IFA7
Category:

less

Transcript and Presenter's Notes

Title: Issues in Data Mining Applications Tutorial


1
Issues in Data Mining Applications-Tutorial-
How to Make A Decision About Your Own Data Mining
Tool?
Authors
  • Nemanja Jovanovic, nemko_at_sezampro.yu
  • Valentina Milenkovic, tina_at_eunet.yu
  • Prof. Dr. Veljko Milutinovic, vm_at_etf.bg.ac.yu

2
Data Mining vs. Knowledge Mining ?
  • ?

3
Instead of a foreword
  • .If you are not able
  • to swim in the ocean of the data,
  • you will get drowned.

4
Tutorial Content
This Tutorial will guide you through the
following sections
  • What really means Data Mining?
  • Successful Data Mining
  • Comparasion of fourteen DM tools
  • How to improve existing Data Mining applications?
  • Potential applications
  • Myths and facts about Data Mining
  • Two case studies
  • The future of DM applications

5
Other definitions of Data Mining
  • Data mining is the (semi)automatic discovery of
    patterns, associations, anomalies, and changes in
    data.
  • Data mining, on the other hand, extracts
    information from a database that the user did
    not know existed.
  • Also, data mining is the search for relationships
    and global
    patterns that exist in large databases
    but are
    hidden' among the vast amount of data.

6
The Foundations of Data Mining
  • Massive data collection
  • Powerful multiprocessor computers
  • Data Mining algorithms

Volume of data
1970
1980
1990
2000
7
Evolution Of Data Mining
8
Data Mining context
  • Application domain
  • Data mining problem type
  • Technical aspect
  • Data mining tools and technique

9
Data Mining Techniques
  • Artificial Neural Networks
  • Decision Trees
  • Genetic Algorithms
  • Rule Induction
  • K-Nearest Neighbor (k-NN)
  • Data Visualization

. . .
Input patern
. . ..
Output patern
. . .
. . .
10
What type of a user do DM tools require?
  • cooperation between business and datamining
    experts
  • less skill and experience business experts in
    modeling and using the tools ? more help they
    need from data mining experts
  • example of financial analysts

11
Examples of DM projects to stimulate your
imagination
  • Here are six examples of how data mining is
    helping corporations to operate more
    efficiently and profitably in today's business
    environment.
  • Targeting a set of consumers who are most
    likely to respond to a direct mail campaign
  • Predicting the probability of default for
    consumer loan applications
  • Reducing fabrication flaws in VLSI chips
  • Predicting audience share for television programs
  • Predicting the probability that a cancer patient
    will respond to radiation therapy
  • Predicting the probability that an offshore oil
    well is actually going to produce oil

12
Successful Data Mining
  • Come up with a precise formulation of the problem
    you are trying to solve and use the right data
  • Have a clearly articulated business problem and
    then determine whether data mining is the proper
    solution technology
  • Understand and deliver the fundamentals
  • Have your technology folks be involved, too
  • Visualization of the data mining output is very
    important in a meaningful way
  • Allow the user to interact with the visualization

13
Comparison of forteen DM tools
  • Evaluated by four undergraduates inexperienced at
    data mining, a relatively
    experienced graduate student and
    a profesional data mining
    consultant
  • Run under the MS Windows 95, MS Windows NT,
    Macintosh System 7.5
  • Use one of the four technologies
    Decision Trees, Rule
    Inductions, Neural or Polynomial Networks
  • Solve two binary classification problems
    multi-class
    classification and noiseless estimation problem
  • Price from 75 to 25.000

14
Comparison of forteen DM tools
  • The Decision Tree products were - CART
    - Scenario - See5 -
    S-Plus
  • The Rule Induction tools were - WizWhy
    - DataMind - DMSK
  • Neural Networks were built from three
    programs - NeuroShell2 - PcOLPARS
    - PRW
  • The Polynomial Network tools were -
    ModelQuest Expert - Gnosis - a
    module of NeuroShell2 - KnowledgeMiner

15
Criteria for evaluating DM tools
  • A list of 20 criteria for evaluating DM tools,
    put into 4 categories
  • Capability measures what a desktop tool can do,
    and how well it does
    it - Handless missing data -
    Considers misclassification costs - Allows
    data transformations - Quality of tesing
    options - Has programming language -
    Provides useful output reports -
    Visualisation

16
Visualisation
? excellent capability ? good capability ?-
some capability blank no capability
17
Criteria for evaluating DM tools
  • Learnability/Usability shows how easy a tool is
    to learn and use - Tutorials -
    Wizards - Easy to learn - Users
    manual - Online help - Interface

18
Criteria for evaluating DM tools
  • Interoperability shows a tools ability to
    interface with other
    computer applications - Importing
    data - Exporting data - Links to
    other applications
  • Flexibility - Model adjustment
    flexibility - Customizable work
    enviroment - Ability to write or change code

19
Data Input Output Model
? excellent capability ? good capability ?-
some capability blank no capability
20
A classification of data sets
  • Pima Indians Diabetes data set
  • 768 cases of Native American women from the Pima
    tribe some of whom are diabetic,
    most of whom are not
  • 8 attributes plus the binary class variable for
    diabetes per instance
  • Wisconsin Breast Cancer data set
  • 699 instances of breast tumors some of which
    are malignant, most of which are benign
  • 10 attributes plus the binary malignancy
    variable per case
  • The Forensic Glass Identification data set
  • 214 instances of glass collected during crime
    investigations
  • 10 attributes plus the multi-class output
    variable per instance
  • Moon Cannon data set
  • 300 solutions to the equation x 2v 2
    sin(g)cos(g)/g
  • the data were generated without adding noise

21
Evaluation of forteen DM tools
22
Strenghts and Weaknesses
  • Strengths
  • Ease of use (Scenario, WizWhy..)
  • Data visualisation (S-plus,MineSet...)
  • Depth of algorithms (tree options)
    (CART,See5,S-plus..)
  • Multiplte neural network architectures
    (NeuroShell)
  • Weaknesses
  • Difficult file I/O (OLPARS,CART)
  • Limited visualisation (PRW,See5,WizWhy)
  • Narrow analyses path (Scenario)

23
How to improve existing DM applications
  • The top ten points
  • Database integration
  • no more flat files
  • use the millions spent on data warehousing

  • Automated model scoring
  • without scoring DM is pretty useless
  • should be integrated with the driving
    applications
  • Exporting models to other applications
  • close the loop between DM and applications
    that need to
    use the results (scores)

24
How to improve existing DM applications
  • Business templates
  • cross-selling specific application is more
    valuable than a general
    modeling tool
  • Effort knob
  • it is relevant in a way that tuning parametars
    are not
  • Incorporate financial information
  • the financial information is very important and
    often available and shold be provided as
    input to the DM application

25
How to improve existing DM applications
  • Computed target columns
  • allow the user to interactively create a new
    target variable
  • Time-series data
  • a years worth of monthly balance information is
    qualitatively different than twelve distinct
    non-time-series variables
  • Use versus View
  • do not present visually to user the full model,
    only the most
    important levels
  • Wizards
  • not necessarily but desirable
  • prevent human error by keeping the user on track

26
Potential Applications
  • Data mining has many varied fields of
    application,
  • some of which are listed below.
  • Retail/Marketing
  • Identify buying patterns from customers
  • Find associations among customer demographic
    characteristics
  • Predict response to mailing campaigns
  • Market basket analysis

27
Potential Applications
  • Banking
  • Detect patterns of fraudulent credit card use
  • Identify loyal' customers
  • Determine credit card spending by customer groups
  • Find hidden correlations between different
    financial indicators
  • Identify stock trading rules from historical
    market data

28
Potential Applications
  • Insurance and Health Care
  • Claims analysis - i.e., which medical procedures
    are claimed together
  • Predict which customers will buy new policies
  • Identify behaviour patterns of risky customers
  • Identify fraudulent behaviour

29
Potential Applications
  • Transportation
  • Determine the distribution schedules among
    outlets
  • Analyse loading patterns
  • Medicine
  • Characterise patient behaviour to predict office
    visits
  • Identify successful medical therapies for
    different illnesses
  • To predict the effectiveness of surgical
    procedures or
    medical tests

30
Potential Applications
  • Sport
  • To make the best choice about players in
    different circumstance
  • To predict the results of relevance match
  • Do a better list of seed players in groups or
    tournament
  • DM report from an NBA game
  • When Price was Point-Guard, J.Williams missed 0
    (0) of his jump field-goal attempts and made
    100 (4) of his jump field-goal-attempts.
  • The total number of such field-goal-attempts
    was 4.

31
DM and Customer Relationship Management
  • CRM is a process that manages the interactions
    between a company and its customers
  • Users of CRM software applications are database
    marketers
  • Goals of database marketers are
  • identifying market segments, which requires
    significant data about prospective customers
    and their buying behaviors
  • build and execute campaigns
  • Tightly integrating the two disciplines presents
    an opportunity for companies to gain
    competetive adventage

32
DM and Customer Relationship Management
  • How Data Mining helps Database Marketing
  • Scoring
  • The role of Campaign Management Software
  • Increasing the customer lifetime value
  • Combining Data Mining and Campaign Management

33
DM and Customer Relationship Management
  • Evaluating the benefits of a Data Mining model

Gains chart
Profability chart
34
Myths and Facts about Data Mining
  • Myth DM produces surprising results
    that will utterly
    transform your business.
  • Myth DM techniques are so sophisticated
    that they can
    substitute for domain knowledge or for
    experience in analysis and model building.
  • Myth DM tools automatically find the patterns
    you are looking for,
    without being told what to do.

35
Myths and Facts about Data Mining
  • Myth Data mining is more effective with more
    data, so all existing data
    should be brought into any data-mining effort.
  • Myth Building a DM model on a sample of a
    database is ineffective, because sampling loses
    the information in the unused data.
  • Myth Data mining is another fad that will soon
    fade, allowing us to return to
    standard business practice.

36
Myths and Facts about Data Mining
  • Myth DM is useful only in certain areas,
    such as
    marketing, sales, and fraud detection.
  • Myth The methods used in DM are fundamentally
    different from the older quantitative
    model-building techniques.
  • Myth Data mining is an extremely complex
    process.
  • Myth Only massive databases are worth mining.

37
Data Mining Examples
  • Bass Brewers Weve been brewing beer
    since 1777, with increased competition comes a
    demand to make faster better informed decision
  • Northern Bank The information is now more
    accessible, paperless and timely.
  • TSB Group Plc We are using Holos because
    of its flexibility and its excellent
    multidimensional database

38
Data Mining Examples
  • Delphic Universites Real value is added to
    data by multidimensional manipulation (being
    able to to easily compare many different views
    of the avaible information in one report) and by
    modeling.
  • Harvard - Holden Sybase technology has
    allowed us to develop an information system that
    will preserve this legacy into the twenty-first
    century
  • J.P.Morgan The promise of data mining
    tools like Information Harvester is that they
    are able to quickly wade through massive amounts
    of data to identify relationships or trending
    information that would not have been avaible
    without the tool

39
Case study of Breast Cancer Survival Analysis
  • Case study of the influence of various patient
    characteristics on survival rates for breast
    cancer
  • The survival analysis technique employed is Cox
    Regression (this technique is useful in
    situations, where some of the patients do not
    die during the observation period)
  • Linear regression technique (if all
    patients had died during the observation period)

40
Case study of Breast Cancer Survival Analysis
  • The observation period runs for 133.8 months
  • The modeling sample contains 746 patients (50
    patients died during the observation period and
    696 who survived beyond the end of the
    observation period)
  • In this example, we are testing only four
    predictors
  • Age, in years, at the start of the observation
    period (22 to 88)
  • Pathological tumor size, in centimeters (0.10 to
    7.00)
  • Number of positive axillary lymph nodes (0 to 35)
  • Estrogen receptor status (positive vs. negative)

41
Case study of Breast Cancer Survival Analysis
  • The Cox Regression used a backward stepwise
    likelihood-ratio variable selection method
  • Significance criteria were set at 0.05 for
    inclusion in the model, and 0.10 for
    removal from the model
  • Printout from the final step of the stepwise
    regression analysis
  • ________________ Variables in the Equation
    ______________
  • Variable B S.E. Wald
    df Sig R Exp(B)
  • AGE -.0314 .0121 6.7486
    1 .0094 -.0893 .9691
  • PATHSIZE .3975 .1175 11.4476 1
    .0007 .1259 1.4881
  • LNPOS .1372 .0361 14.4100 1
    .0001 .1443 1.1471
  • __________________________________________________
    _____
  • The column labeled "Sig" shows the statistical
    significance of included variables
  • The column labeled "R" shows the degree of
    unique correlation with the dependent variable

42
Case study of Breast Cancer Survival Analysis
  • Some key things to note are
  • Estrogen status was removed as a predictor
    because it did not reach the 0.05
    significance criterion for inclusion
  • Number of positive axillary lymph nodes was the
    strongest predictor of survival rates (R.1443 /
    Sig.0001), then follow pathological tumor
    size (R.1259 / Sig..0007), over the course of
    the observation period
  • Age, although significant, is somewhat less
    influential than the other two predictors
    (R-0.893 / Sig..0094)
  • Note that both the number of positive axillary
    lymph nodes and the pathological tumor size
    are positively correlated, which means that they
    are directly associated with more rapid
    mortality.
  • Age is negatively correlated with the dependent
    variable, which means that younger age is
    predictive of somewhat longer survival.

43
Case study of Breast Cancer Survival Analysis
  • All patients survive through the 10 month of
    the observation period
  • At the fortieth month, the mortality
    rate increases and continues at this fairly
    constant increased rate
    through the forty-fifth month
  • At the forty-fifty month, there is a
    five-month period without additional mortality
  • 11 of the original sample has died

The following chart shows the cumulative
survival function during the observation period
44
Case study of Breast Cancer Survival Analysis
  • Conclusions and Implications
  • The case study presented here is relatively
    simple, and is for illustrative purposes only.
  • With the addition of more candidate predictors
    (progesterone receptor status, histologic
    grade, blood type etc.), an even more powerful
    model could emerge.
  • By understanding the influence of patient
    characteristics on mortality rates over time, we
    are in a better position to estimate survival
    times for individual patients, and to defend
    using different or more aggressive therapeutic
    approaches for some patients.

45
Securities Brokerage Case Study
  • Predictive market segmentation model designed to
    identify and profile high-value brokerage
    customer segments as targets for special
    marketing communications efforts.
  • The dependent variable for this ordinal CHAID
    model is brokerage account
    commission dollars during the past 12 months
  • We begin by splitting the client's entire
    customer file into a modeling sample and a
    validation sample. (Once the model is built
    using the modeling sample, we apply it to
    the validation sample to see how well it works
    on a sample other than the one on which it
    was built).

46
Securities Brokerage Case Study
  • The resulting CHAID model has 55 segments.
  • However, the results are summarized in the
    following comb chart, showing the segment indexes
    (indexes of average dollar value)

47
Securities Brokerage Case Study
The part of Gains Chart Average Annual Brokerage
Commission Dollars
  • Gains chart provides quantitative detail useful
    for financial and marketing planning.
  • We have highlighted the top 20 of the file in
    blue
  • The top 20 of the file is worth an average
    of about 334 per account, which is
    nearly three times the average account value for
    the entire sample.



...
48
Securities Brokerage Case Study
  • Using the data in the gains chart this
    information, we can better plan our
    communications/promotion budget.
  • In general, the best segments represent customers
    who are experienced, aggressive, self-directed
    traders.
  • The other decisions, which the gains chart and
    the segmentation rules can help us make
  • We might wish to conduct some market research
    among customers in
    under-performing segments, or among
    under-performing customers in the
    better segments
  • We can use the segment definitions to help us
    identify possible issues and question areas to
    include in the survey
  • Before we try to apply such a model, we perform a
    validation against a holdout sample, to confirm
    that it is a good model.

49
The future of DM applications
  • Different opinions
  • Very little functionality in DB systems to
    support DM applications
  • Data mining, as a vital application, is
    just one more advance in the on-going research
    process
  • Data mining will not go away

T h e E n d
Write a Comment
User Comments (0)
About PowerShow.com