Title: FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT
1FROM BUSINESS OBJECTIVES TO DATA MINING TOWARDS
A SISTEMATIC WAY OF DATA MINING PROJECT
DEVELOPMENT
- Ernestina Menasalvas
- Facultad de Informática
- Universidad Politecnica de Madrid. Spain
- emenasalvas_at_fi.upm.es
- November 2004
-
2Background(I)
- 1995 doctoral student.
- Visit University of Regina (Prof. Ziarko)
- Visit Warsaw University (Prof. Pawlak)
- 1998 Defend thesis. Data Mining process model
(Anita Wasilewska C. Fernandez-Baizan) - Since then
- Data Bases Professor Data bases, data mining
- Coordinator of the Data Mining group at Facultad
de Informática UPM - Techniques Rough Sets, Bayes,
- Methodologies for data mining process management
- Evaluation in Data Mining
- Experimentation in Web Mining
- Web Mining Web Goal Mining
3Background(II)
- Projects developed
- Pure Research
- Data Mining to be integrated on RDBMS
- Web Profiler
- Methodology for Data Mining process management
- Research and application
- Data Mining applied on different domains
- Car dealers
- Travel agency
- .
4Data Mining Project Development
- Methodologies for Data Mining project development
- Is it really Data Mining a Science?
- Are we developing proyects as an art?
- Has the research got the same results in all the
areas?? - Algorithms
- Data Preparation
- Data enrichment
- Conceptualization of Data Mining problems
5Data Mining an art, a science?
- Since it appeared a lot of algorithms have been
programmed - Standards
- Crisp-DM
- SEMMA
- PMML 3.0
- Process depends on the expertise of the data
miner - User speaks about business problems
- Data Miner speaks about algorithms
6Data Mining as a project
- Data Mining is data intensive activity
- Data understanding
- Data Preparation
- Database manager
- Transactional databases
- Datawarehouses
- The end result of a data mining project is a tool
(software project) for better decision making
process - Software development project
- IT department has to be involved
7Project Management
- Why?
- In order to organize the process of develpoment
and to produce a project plan - How?
- Establish how the process is going to be develop
- Sequential
- Incremental
- What?
- Establish how is the process is splitted into
phases and define the tasks to be developed in
each step - RUP
- XP
- COMMONKADS
- Way of making things
- Independent of the process being developed
LIFECYCLE MODELS
- Particular tasks
- Detail of tasks to be developed
METHODOLOGY
8Common pitfall of data mining implementation
- The common pitfall of data mining implementation
the following - Not being able to efficiently communicate mining
results within an organization. - Not having the right data to conduct effective
analysis. - Not using existing data correctly.
- Not being able to evaluate results
- Questions that arise
- Can the adequateness of a set of data for a
problem be established when preparing the project
plan? - How the set of data can be used to produce the
expected results? - How we can evaluate the results?
- Cost estimation?
9Data Mining Approaches
- Vendor independent
- CRISP-DM
- Based on the commercial tools
- CATs
- SEMMA
- CRM Methodology
- CRM Catalyst
Model Process
Not Real Methodology Based on Crisp-DM
Globlal CRM process Does not concentrate on Data
Mining step
10Cross-Industry Standard Process for Data
MiningCRISP-DM
11Data Mining as a project CATs
- CATs Clementine Application Templates CATs
- Specific libraries of best practices that provide
inmediate value right out of the box - Following the CRISP-DM standard. Every CAT stream
is assigned to a CRISP-DM phase - They provide long term value as they can always
be used with a new data set for new insight in
other projects. - Available as an add-on module to Clementine,
include - Telco CAT - improve retention and cross-selling
efforts for telecommunications - CRM CAT - understand and predict customer
migration between segments, - Microarray CAT - accelerate biological
discoveries, find genes Fraud CAT - predict and
detect instances of fraud in financial
transactions, claims, tax returns - Web CAT
12What is a CAT?CATs
13SEMMA(1)
- SEMMA (Sample, Explore, Modify, Model, Assess)
SEMMA - Is not a data mining methodology
- Rather a logical organization of the functional
tool set of SAS Enterprise Miner for carrying out
the core tasks of data mining. - Enterprise Miner can be used as part of any
iterative data mining methodology adopted by the
client. - Naturally steps such as formulating a well
defined business or research problem and
assembling quality representative data sources
are critical to the overall success of any data
mining project.
14SEMMA(2)
- SEMMA is focused on the model development aspects
of data miningSEMMA - Sample the data to extract a portion of a large
data set big enough to contein significant
information, yet small to manipulate quickly. - Explore the data by searching for anticipated
trends and anomalies in order to gain
understanding and ideas. - Modify the data by creating selecting and
transforming the variables to focus the model
selection problem. - Model the data allowing the software to search
automatically for a combination of data that
reliably predicts a desired outcome. Modelling
techniques include neural networks,
tree-clasiffiers, statistical models, etc. - Assess the data by evaluating the usefulness and
reliability of the findings from the data mining
process and estimate how well it performs.
15Methods for Project ManagementCRM Catalyst(1)
- Developed jointly by CustomISe, MACS and
SalesPathways. Together they have formed the
Catalyst Foundation http//www.crmmethodology.com/
- Motivations
- CRM projects are difficult to execute
successfully because of the wide range of factors
influencing their success. So it can take a long
time to make CRM work properly for an
organisation. - Solution CRM Catalyst.
- Methodology acts as a catalyst for CRM projects
enabling them to achieve their objectives more
reliably and in less time. - It gives a project life cycle with a set of
defined phases broken down into steps with
clearly stated inputs and outputs.
16Methods for Project Management CRM Catalyst(2)
Implementation requires Data Mining development
process
Progressive Lifecycle Model
The resutls are obtained in a progressive way
Implementation is Knowledge intensive
In some steps Knowledge Intensive Methdology
could be appropriate
17Main steps in a Data Mining Project
- Define the goals
- Business and data mining experts together have to
define the goals - Each goal must be defined with measurements for
success - Obtain the models
- Apply data mining algorithms.
- Preprocesing is important
- Evaluate results
- ascertaine the value of an object according to
specified criteria, operationalised in terms of
measures. - Deploy
- Decide patterns and models that can be deployed
- Evaluate
- After product working it should be contrasted the
result
181. Define the goals
- Distinguish between
- Data Mining goals
- Business goals
- How do we translate?
- Increase the lifetime value of valuable customers
?
?
?
Clasification
Estimation
Association
It has to be solved in the Business Understanding
step of CRISP-DM
19Business Understandingin the CRISP-DM Process
Business Understanding
Business Success Criteria
Background
Business Objectives
Determine Business Objectives
Inventory Resources
Reqs, Assumptions Constraints
Risks Contingencies
Terminology
Costs Benefits
Assess Situation
Determine Data Mining Goals
Data Mining Goals
Data Mining Success Criteria
Produce Project Plan
Initial Assessment of Tools Techniques
Project Plan
201.1 Determine Business objectives and success
criteria
- Not only business objectives have to be
established but measures in order to be able to
evaluate the results - Business objectives
- What is the customer's primary objective?
- Increase the number of loyal customers
- Selling more of a certain product
- Have a positive marketing campaing
- Business success criteria
- What constitutes a successful outcome of the
project? - Objectives measures so that the success can be
established - ROI
211.2 Costs Benefits
- Perform a cost-benefits analysis
- Compute the benefits of the project
- Which measures do we have?
- ROI
- APEX
- OPEX....
- Compute the costs of the project (equipment,
human resources...) - Which methodology do we have?
- COCOMO for sortware
- Quantify the risk that the project fails
- Knowledge not available
- Data Not available
- Proper tools
22Data Mining Estimation Model
- Establishing a parametrical estimation model for
Data Mining (Marban03)
DMCOMO (Data Mining COst MOdel)
23Data Mining Cost Estimation
- Main factors in a Data Mining project
- Data Sources (number, kind, nature, )
- Data mining problem to be solved (descriptive,
predictive, ) - Development platform
- Available tools
- Expertise of the development team
- Drivers
- Data Drivers
- Model Drivers
- Platform Drivers
- Tools and techniques Drivers
- Project Drivers
- People Drivers
241.3 Data Mining goals and success
- Data mining goals
- Translate the customer's primary objective into a
data mining goal, e.g. - Loyalty program translated into segmentation
problem - Decreasing the attrition rate transformed into
classification problem - Data mining success criteria
- Determine success in technical terms
- Translate the notion of sucess into confidence,
support and lift and other parameteres - Determine de cost of errors
- How do we make the translation?
25Methodology
- Which is the methodology to be followed to
translate business objectives into data mining
objectives? - Unluckily, there is no such methodology. First we
have to solve - How a business objective is expressed?
- What is a data mining goal?
- How are data mining goals achieved?
- Which are the requirements of data mining
functions?
In order to describe everything in a standard
way Conceptualize the problem
26Conceptualization in other disciplines
- Data Bases
- E/R diagrams
- Independent of the domain
- A tool for business understanding and for data
base designer - Translation from E/R to implementation
External view n
External view 1
Conceptual Schema
Internal Schema
273 levels proposed architecture
Business problem
Business problem
Requirements of algorithms will be solved at
this level
Conceptual Schema
Internal Schema
Tools requirements to be solved
SAS, WEKA, Clementine
283 layers architecture for data mining
- It is the bridge
- Between business goals and the final tool
- Independent of the domain
- Provides independence
- Changes in the tool do not reflect to the
solution - It has to be decided what to model in the
conceptualization - Automatic translation of business goals into data
mining goals - Data Mining goals constraints feasible data
mining goals
29Elements to conceptualize
- Elements to be taken into account
- Data
- Quality from data mining point of view
- Adequateness for the problem
- Classification for data mining purposes
- Knowledge
- Related to the process being analyzed
- Related to the data used
- People
- Owners of data
- Experts in the process
- Data mining problems requirements
- Data mining methods requirements
30Proposed process
31DMMO
- Data Mining Modelling Objects
- Data
- Knowledge
- Constraints of data and applications
- Data Mining objects
- Algorithms
- Measures
- Methods
- To bridge the gap between data miners and
business users
32Are data adequate for analysis?
- The adequateness of the data is analyzed taking
into account goals to fulfil. - Data together with the knowledge extracted from
the experts can be transformed so that just by
being the input of a certain data mining
algorithm will produce the required patterns. - Quality of the data, in this context
- is not only related to the technical quality
proper model, percentage of null values, - but also has to do with
- meaning of the attributes,
- Where each piece of data comes from,
- relationship among data, and
- finally how the data fulfil the requirements of
the data mining functions
332. Data Mining obtain models
- Apply data mining process model
- Associated problems solved by the 3 layers
architecture - Comparison of approaches
- Evaluate costs
- Pros and cons of approaches
- Only experience or a conceptualization can help
- The conceptual model will help to establish the
process to obtain each feasible model. - Requirements and transformations implicit in the
model
342.1 Determine type of problem
- What are data mining problems?
- Classification
- Estimation
- Association
- Segmentation
- In the conceptual model requirements for each
type will be settled
352.2 Apply CRISP-DMprocess model
- Data Mining problem has to be settled before
going into modeling step - Requierements will be established in Business
understanding - Requierements will be checked in Data
Understanding and data Preparation - Preparation will be guided by conceptual model
- Evaluation on feasibility can be done before
applying the model
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
Business Understanding
363. Evaluate results
- Spilipopou, Berendt
- Evaluation the act of ascertaining the value of
an object according to specified criteria,
operationalised in terms of measures. - Object model already obtained
- Criteria and Measures and has to do with goals
- Evaluation requires a well-defined notion of
success, which must be in place before - the evaluation takes place
- the data mining phase starts
- any work with the data starts
- i.e. already during the business understanding
process. - Here once again conceptualization plays its role
37Evaluation in the CRISP-DM Process
- The CRISP-DM process is
- a non-ending circle of iterations
- a non-sequential process, where backtracking at
previous phases is usually necessary - In each sequential instantiation evaluation takes
place - But it is a cycle
- In all the iterations all the steps should be
revisited - Results have to be evaluated!!
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
Business Understanding
384. Deployment
- All the models that have possitive evaluation can
be deployed - For measurements of success to trust deployment
has to follow rules established at the beginning
of the project - The real evaluation has not yet been performed
395. Evaluate after deployment
- After deployment there is the need to proof that
the improvements are really due to the actions
taken after a data mining discovery and not to
any other factor or action carried out in the
company - None of the obvious claims about success of data
mining have ever been systematically tested. - Experiments are crucial to establish if the
impact of the deployment is really positive or
negative - Experiments have to be designed at the beginning
of the project
40Conclusions
- Data mining projects are being developed more as
art than a science - Many algorithms have been implemented but no
systematically proof of one better than another
in real case is done after deployment - Conceptual model is required
- To map business goals to the model
- To map data mining algorithms to a conceptual
model - Achievements of the model
- Will be used along the process to guide the
project - Evaluation tool
41Future works
- Conceptual model
- Define DMMO objects
- Evaluation techniques related to the model
- Evaluate data mining goals
- Evaluate business goals
- Experimentation methods
- obstursively and
- non obstrusivelsly
42References
- Evaluation in Web mining Tutorial at ECML/PKDD
2004 Pisa, Italy 20th September, 2004. Bettina
Berendt, Myra Spiliopoulou, Ernestina Menasalvas - Towards a Methodology for Data mining Project
Development The Importance of Abstraction.
Menasalvas, Millán, Gonzalez-Aranda, Segovia - Bettina Berendt, Andreas Hotho, Dunja Mladenic,
Maarten van Someren, Myra Spiliopoulou, Gerd
Stumme Web Mining From Web to Semantic Web,
First European Web Mining Forum, EMWF 2003,
Cavtat-Dubrovnik, Croatia, September 22, 2003,
Revised Selected and Invited Papers Springer 2004
- Myra Spiliopoulou, Carsten Pohle Modelling and
Incorporating Background Knowledge in the Web
Mining Process. Pattern Detection and Discovery
2002 154-169 - www.crisp-dm.org
- www.spss.com/clementine/cats.htm
- www.sas.com/technologies/analytics/datamining/mine
r/semma.html - www.crmmethodology.com
- www.emetrics.org/articles/whitepaper.html
43THANKS