Title: Competitive advantage from Data Mining: some lessons learnt in the Information Systems field
1Competitive advantage from Data Mining some
lessons learnt in the Information Systems field
PMKD05 Copenhagen, Denmark August 22-26, 2005
- Mykola Pechenizkiy, Seppo Puuronen Department of
Computer ScienceUniversity of Jyväskylä Finland -
- Alexey Tsymbal
- Department of Computer ScienceTrinity College
DublinIreland
2Outline
- Introduction and What is our message?
- Part I Existing frameworks for DM
- Theory-oriented Databases Statistics Machine
learning etc - Process-oriented Fayyads, CRISP, Reinartzs
- Part II Where we are? rigor vs. relevance in
DM - Part III Towards the new framework for DM
research - DM System as adaptive Information System (IS)
- DM research as IS Development DM system as
artefact - DM success model success factors
- KM Challenges in KDD
- One possible reference for new DM research
framework - Further plans and Discussion
3What is Data Mining
Data mining or Knowledge discovery is the process
of finding previously unknown and potentially
interesting patterns and relations in large
databases (Fayyad, KDD96) Data mining is the
emerging science and industry of applying modern
statistical and computational technologies to the
problem of finding useful patterns hidden within
large databases (John 1997) Intersection of many
fields statistics, AI, machine learning,
databases, neural networks, pattern recognition,
econometrics, etc.
4H. Information Systems
- H.0 GENERAL
- H.1 MODELS AND PRINCIPLES
- H.2 DATABASE MANAGEMENT
- H.2.0 General
- Security, integrity, and protection
- H.2.8 Database Applications
- Data mining
- Image databases
- Scientific databases
- Spatial databases and GIS
- Statistical databases
- H.2.m Miscellaneous
http//www.acm.org/class/1998/ valid in 2003
5I. Computing Methodologies
- I.2 ARTIFICIAL INTELLIGENCE
- I.2.0 General
- Cognitive simulation
- Philosophical foundations
- I.2.1 Applications and Expert Systems
- I.2.2 Automatic Programming
- I.2.3 Deduction and Theorem Proving
- I.2.4 Knowledge Representation Formalisms and
Methods - I.2.5 Programming Languages and Software
- I.2.6 Learning
- Analogies
- Concept learning
- Connectionism and neural nets
- Induction
- Knowledge acquisition
- Language acquisition
- Parameter learning
- I.2.7 Natural Language Processing
- I.2.m Miscellaneous
- I.5 PATTERN RECOGNITION
- I.5.0 General
- I.5.1 Models
- Deterministic
- Fuzzy set
- Geometric
- Neural nets
- Statistical
- Structural
- I.5.2 Design Methodology
- Classifier design evaluation
- Feature evaluation selection
- Pattern analysis
- I.5.3 Clustering
- Algorithms
- Similarity measures
- I.5.4 Applications
- Computer vision
- Signal processing
6G. Mathematics of Computing
- G.3 PROBABILITY AND STATISTICS
- Correlation and regression analysis
- Distribution functions
- Experimental design
- Markov processes
- Multivariate statistics
- Nonparametric statistics
- Probabilistic algorithms (including Monte Carlo)
- Statistical computing
7Our Message
- DM is still a technology having great
expectations to enable organizations to take more
benefit of their huge databases. - There exist some success stories where
organizations have managed to have competitive
advantage of DM. - Still the strong focus of most DM-researchers in
technology-oriented topics does not support
expanding the scope in less rigorous but
practically very relevant sub-areas. - Research in the IS discipline has strong
traditions to take into account human and
organizational aspects of systems beside the
technical ones.
8Our Message
- Currently the maturation of DM-supporting
processes which would take into account human and
organizational aspects is still living its
childhood. - DM community might benefit, at least from the
practical point of view, looking at some other
older sub-areas of IT having traditions to
consider solution-driven concepts with a focus
also on human and organizational aspects. - The DM community by becoming more amenable to
research results of the IS community might be
able to increase its collective understanding of - how DM artifacts are developed conceived,
constructed, and implemented, - how DM artifacts are used, supported and evolved,
- how DM artifacts impact and are impacted by the
contexts in which they are embedded.
9Part I
- Existing Frameworks for DM
- Theory-oriented
- Databases
- Statistics
- Machine learning
- Data compression
- Process-oriented
- Fayyads
- CRISP-DM
- Reinartzs
10Theory-Oriented Frameworks
11Database Perspective
- DM as application to DBs
- In the same way business applications are
currently supported using SQL-based API, the KKD
applications need to be provided with application
development support. - query KDD objects, support for finding NNs,
clustering, or discretization and aggregate
operations. - Inductive databases approach
- query concept should be applied also to data
mining and knowledge discovery tasks - there is no such thing as discovery, it is all
in the power of the query language - contain not only the data but the theory of the
data as well
Imielinski, T., and Mannila, H. 1996, A database
perspective on knowledge discovery.
Communications of the ACM, 39(11),
58-64. Boulicaut, J., Klemettinen, M., and
Mannila, H. 1999, Modeling KDD processes within
the inductive database framework. In Proceedings
of the First International Conference on Data
Warehousing and Knowledge Discovery,
Springer-Verlag, London, 293-302
12Reductionism Approach
- Two basic Statistical Paradigms
- Statistical Experiment
- Fishers version, inductive principle of maximum
likelihood - Neyman and Pearson-Walds version, inductive
behaviour - Bayesian version, maximum posterior probability
- Statistical learning from empirical process
- Structural Data Analysis
- SVD
- Data mining ? statistics - the issue of
computational feasibility has a much clearer role
in data mining than in statistics - data mining area approaches that emphasize on
database integration, simplicity of use, and the
understandability of results - theoretical framework of statistics does not
concern much about data analysis as a process
that includes several steps
13Machine Learning Approach
- let the data suggest a model can be seen as a
practical alternative to the statistical paradigm
fit a model to the data - Constructive Induction a learning process, two
intertwined phases construction of the best
representation space and generating hypothesis in
the found space (Michalski Wnek, 1993). - Feature transformation (PCA, SVD, Random
Projection) - Feature selection
- LSI
14Data Compression Approach
- Compress the data set by finding some structure
or knowledge for it, where knowledge is
interpreted as a representation that allows
coding the data by using fewer amount of bits. - Theories should not be ad hoc that is they should
not overfit the examples used to build it. - Occams razor principle,14th century.
- "when you have two competing models which make
exactly the same predictions, the one that is
simpler is the better".
Mehta, M., Rissanen, J., and Agrawal, R. 1995,
MDL-based decision tree pruning. In U.M. Fayyad,
R. Uthurusamy (Eds.) Proceedings of the KDD 1995,
AAAI Press, Montreal, Canada, 216-221.
15Other Theoretical frameworks for DM
- Microeconomic view
- the key point is that data mining is about
finding actionable patterns the only interest is
in patterns that can somehow be used to increase
utility - a decision theoretic formulation of this
principle the goal can be formulated in finding
a decision x that tries to maximise utility
function f(x). - Kleinberg, J., Papadimitriou, C., and Raghavan,
P. 1998, A microeconomic view of data mining,
Data Mining and Knowledge Discovery 2(4), 311-324 - Philosophy of Science
- logical empiricism, critical rationalism, systems
theory - formism, mechanism, contextualism
- dispersive vs. integrative, analytical vs.
synthetic theories - subjectivist vs. objectivist, nomothetic vs.
ideographic, nominalism vs. realism, voluntarism
vs. determinism, epistemological assumptions - Explanation, prediction, understanding
16Process-Oriented Frameworks
17Knowledge discovery as a process
I
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.,
Uthurusamy, R., Advances in Knowledge Discovery
and Data Mining, AAAI/MIT Press, 1997.
18CRISP-DM
http//www.crisp-dm.org/
19KDD Vertical Solutions
Reinartz, T. 1999, Focusing Solutions for Data
Mining. LNAI 1623, Berlin Heidelberg.
20Conclusion on different frameworks
- Reductionist approach of viewing data mining as
statistics has advantages of the strong
background, and easy-formulated problems. - The data mining tasks concerning processed like
clusterisation, regression and classification fit
easily into these approaches. - More recent (process-oriented) frameworks address
the issues related to a view of data mining as a
process, and its iterative and interactive nature
21Part II
- Where we are?
- Rigor and Relevance in DM Reseach
22So, where are we?
- Lin in Wu et al. notices that a new successful
industry (as DM) can follow consecutive phases - discovering a new idea,
- ensuring its applicability,
- producing small-scale systems to test the market,
- better understanding of new technology and
- producing a fully scaled system.
- At the present moment there are several dozens of
DM systems, none of which can be compared to the
scale of a DBMS system. - This fact indicates that we are still in the 3rd
phase in the DM area!
23Rigor vs Relevance in DM Research
24Where is the focus?
- Still! speeding-up, scaling-up, and increasing
the accuracies of DM techniques. - Piatetsky-Shapiro we see many papers proposing
incremental refinements in association rules
algorithms, but very few papers describing how
the discovered association rules are used - Lin claims that the RD goals of DM are quite
different - since research is knowledge-oriented while
development is profit-oriented. - Thus, DM research is concentrated on the
development of new algorithms or their
enhancements, - but the DM developers in domain areas are aware
of cost considerations investment in research,
product development, marketing, and product
support. - However, we believe that the study of the DM
development and DM use processes is equally
important as the technological aspects and
therefore such research activities are likely to
emerge within the DM field.
25Part III
- Towards the new framework for DM research
26DMS in the Kernel of an Organization
Environment
- DM is fundamentally application-oriented area
motivated by business and scientific needs to
make sense of mountains of data. - A DMS is generally used to support or do some
task(s) by human beings in an organizational
environment both having their desires related to
DMS. - Further, the organization has its own environment
that has its own interest related to DMS, e.g.
that privacy of people is not violated.
27The ISs-based paradigm for DM
Ives B., Hamilton S., Davis G. (1980). A
Framework for Research in Computer-based MIS
Management Science, 26(9), 910-934.
Information systems are powerful instruments for
organizational problem solving through formal
information processing
Lyytinen, K., 1987, Different perspectives on
ISs problems and solutions. ACM Computing
Surveys, 19(1), 5-46.
28DM Artifact Development
A multimethodological approach to the
construction of an artefact for DM
Adapted from Nunamaker, W., Chen, M., and
Purdin, T. 1990-91, Systems development in
information systems research, Journal of
Management Information Systems, 7(3), 89-106.
29Research methods in a paper on DM
- Theoretical approach theory creating
- Hypothesis, new algorithm, etc.
- Constructive approach
- Prototype of a DM tool
- Theoretical approach theory testing and
evaluation - Artificial, benchmark, real-world data
- Evaluation techniques
- Conclusion on theory
30The Action Research and Design Science Approach
to Artifact Creation
31DM Artifact Use Success Model 1 of 3
Adapted from DM IS Success Models
32DM Artifact Use Success Model 2 of 3
- What are the key factors of successful use and
impact of DMS both at the individual and
organizational levels. - how the system is used, and also supported and
evolved, and - how the system impacts and is impacted by the
contexts in which it is embedded. - Coppock the failure factors of DM-related
projects. - have nothing to do with the skill of the modeler
or the quality of data. - But those do include
- persons in charge of the project did not
formulate actionable insights, - the sponsors of the work did not communicate the
insights derived to key constituents, - the results don't agree with institutional truths
the leadership, communication skills and
understanding of the culture of the organization
are not less important than the traditionally
emphasized technological job of turning data into
insights
33DM Artifact Use Success Model 3 of 3
- Hermiz communicated his beliefs that there are
the four critical success factors for DM
projects - (1) having a clearly articulated business problem
that needs to be solved and for which DM is a
proper tool - (2) insuring that the problem being pursued is
supported by the right type of data of sufficient
quality and in sufficient quantity for DM - (3) recognizing that DM is a process with many
components and dependencies the entire project
cannot be "managed" in the traditional sense of
the business word - (4) planning to learn from the DM process
regardless of the outcome, and clearly
understanding, that there is no guarantee that
any given DM project will be successful.
34KM Perspective
- A knowledge-driven approach to enhance the
dynamic integration of DM strategies in knowledge
discovery systems. - Focus here is on knowledge management aimed to
organise a systematic process of (meta-)knowledge
capture and refinement over time. - knowledge extracted from data
- the higher-level knowledge required for managing
DM techniques selection, combination and
application - Basic knowledge management processes of
- knowledge creation and identification,
representation, collection and organization,
sharing, adaptation, and application - DEXA05 TAKMA WS paperpresentation are
available
35New Research Framework for DM Research
36Further Work
- Definition of Relevance concept in DM research
- The revision of the book chapter
- Further work on the new framework for DM research
- Organization of Workshop or Special Track or
Working conference on - more social directions in DM research likely
with one of the focuses on IS as a sister
discipline. - Few options
- IRIS Scandinavian Conference on IS is one option
- Next PMKD
- Workshop in Jyväskylä
37Thank You!
- Feedback is very welcome
- Questions
- Suggestions
- Collaboration
- Book chapter draft is available on request from
- Mykola Pechenizkiy
- Department of Computer Science and Information
Systems, - University of Jyväskylä, FINLAND
- E-mail mpechen_at_cs.jyu.fi
- Tel. 358 14 2602472 Fax 358 14 260 3011
- http//www.cs.jyu.fi/mpechen