Competitive advantage from Data Mining: some lessons learnt in the Information Systems field - PowerPoint PPT Presentation

About This Presentation
Title:

Competitive advantage from Data Mining: some lessons learnt in the Information Systems field

Description:

PMKD'05 Copenhagen, Denmark August 22-26, 2005. 2. PMKD'05 Copenhagen, Denmark August 22-26, 2005 ... PMKD'05 Copenhagen, Denmark August 22-26, 2005 ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 38
Provided by: win4
Category:

less

Transcript and Presenter's Notes

Title: Competitive advantage from Data Mining: some lessons learnt in the Information Systems field


1
Competitive advantage from Data Mining some
lessons learnt in the Information Systems field
PMKD05 Copenhagen, Denmark August 22-26, 2005
  • Mykola Pechenizkiy, Seppo Puuronen Department of
    Computer ScienceUniversity of Jyväskylä Finland
  • Alexey Tsymbal
  • Department of Computer ScienceTrinity College
    DublinIreland

2
Outline
  • Introduction and What is our message?
  • Part I Existing frameworks for DM
  • Theory-oriented Databases Statistics Machine
    learning etc
  • Process-oriented Fayyads, CRISP, Reinartzs
  • Part II Where we are? rigor vs. relevance in
    DM
  • Part III Towards the new framework for DM
    research
  • DM System as adaptive Information System (IS)
  • DM research as IS Development DM system as
    artefact
  • DM success model success factors
  • KM Challenges in KDD
  • One possible reference for new DM research
    framework
  • Further plans and Discussion

3
What is Data Mining
Data mining or Knowledge discovery is the process
of finding previously unknown and potentially
interesting patterns and relations in large
databases (Fayyad, KDD96) Data mining is the
emerging science and industry of applying modern
statistical and computational technologies to the
problem of finding useful patterns hidden within
large databases (John 1997) Intersection of many
fields statistics, AI, machine learning,
databases, neural networks, pattern recognition,
econometrics, etc.
4
H. Information Systems
  • H.0 GENERAL
  • H.1 MODELS AND PRINCIPLES
  • H.2 DATABASE MANAGEMENT
  • H.2.0 General
  • Security, integrity, and protection
  • H.2.8 Database Applications
  • Data mining
  • Image databases
  • Scientific databases
  • Spatial databases and GIS
  • Statistical databases
  • H.2.m Miscellaneous

http//www.acm.org/class/1998/ valid in 2003
5
I. Computing Methodologies
  • I.2 ARTIFICIAL INTELLIGENCE
  • I.2.0 General
  • Cognitive simulation
  • Philosophical foundations
  • I.2.1 Applications and Expert Systems
  • I.2.2 Automatic Programming
  • I.2.3 Deduction and Theorem Proving
  • I.2.4 Knowledge Representation Formalisms and
    Methods
  • I.2.5 Programming Languages and Software
  • I.2.6 Learning
  • Analogies
  • Concept learning
  • Connectionism and neural nets
  • Induction
  • Knowledge acquisition
  • Language acquisition
  • Parameter learning
  • I.2.7 Natural Language Processing
  • I.2.m Miscellaneous
  • I.5 PATTERN RECOGNITION
  • I.5.0 General
  • I.5.1 Models
  • Deterministic
  • Fuzzy set
  • Geometric
  • Neural nets
  • Statistical
  • Structural
  • I.5.2 Design Methodology
  • Classifier design evaluation
  • Feature evaluation selection
  • Pattern analysis
  • I.5.3 Clustering
  • Algorithms
  • Similarity measures
  • I.5.4 Applications
  • Computer vision
  • Signal processing

6
G. Mathematics of Computing
  • G.3 PROBABILITY AND STATISTICS
  • Correlation and regression analysis
  • Distribution functions
  • Experimental design
  • Markov processes
  • Multivariate statistics
  • Nonparametric statistics
  • Probabilistic algorithms (including Monte Carlo)
  • Statistical computing

7
Our Message
  • DM is still a technology having great
    expectations to enable organizations to take more
    benefit of their huge databases.
  • There exist some success stories where
    organizations have managed to have competitive
    advantage of DM.
  • Still the strong focus of most DM-researchers in
    technology-oriented topics does not support
    expanding the scope in less rigorous but
    practically very relevant sub-areas.
  • Research in the IS discipline has strong
    traditions to take into account human and
    organizational aspects of systems beside the
    technical ones.

8
Our Message
  • Currently the maturation of DM-supporting
    processes which would take into account human and
    organizational aspects is still living its
    childhood.
  • DM community might benefit, at least from the
    practical point of view, looking at some other
    older sub-areas of IT having traditions to
    consider solution-driven concepts with a focus
    also on human and organizational aspects.
  • The DM community by becoming more amenable to
    research results of the IS community might be
    able to increase its collective understanding of
  • how DM artifacts are developed conceived,
    constructed, and implemented,
  • how DM artifacts are used, supported and evolved,
  • how DM artifacts impact and are impacted by the
    contexts in which they are embedded.

9
Part I
  • Existing Frameworks for DM
  • Theory-oriented
  • Databases
  • Statistics
  • Machine learning
  • Data compression
  • Process-oriented
  • Fayyads
  • CRISP-DM
  • Reinartzs

10
Theory-Oriented Frameworks
11
Database Perspective
  • DM as application to DBs
  • In the same way business applications are
    currently supported using SQL-based API, the KKD
    applications need to be provided with application
    development support.
  • query KDD objects, support for finding NNs,
    clustering, or discretization and aggregate
    operations.
  • Inductive databases approach
  • query concept should be applied also to data
    mining and knowledge discovery tasks
  • there is no such thing as discovery, it is all
    in the power of the query language
  • contain not only the data but the theory of the
    data as well

Imielinski, T., and Mannila, H. 1996, A database
perspective on knowledge discovery.
Communications of the ACM, 39(11),
58-64. Boulicaut, J., Klemettinen, M., and
Mannila, H. 1999, Modeling KDD processes within
the inductive database framework. In Proceedings
of the First International Conference on Data
Warehousing and Knowledge Discovery,
Springer-Verlag, London, 293-302
12
Reductionism Approach
  • Two basic Statistical Paradigms
  • Statistical Experiment
  • Fishers version, inductive principle of maximum
    likelihood
  • Neyman and Pearson-Walds version, inductive
    behaviour
  • Bayesian version, maximum posterior probability
  • Statistical learning from empirical process
  • Structural Data Analysis
  • SVD
  • Data mining ? statistics - the issue of
    computational feasibility has a much clearer role
    in data mining than in statistics
  • data mining area approaches that emphasize on
    database integration, simplicity of use, and the
    understandability of results
  • theoretical framework of statistics does not
    concern much about data analysis as a process
    that includes several steps

13
Machine Learning Approach
  • let the data suggest a model can be seen as a
    practical alternative to the statistical paradigm
    fit a model to the data
  • Constructive Induction a learning process, two
    intertwined phases construction of the best
    representation space and generating hypothesis in
    the found space (Michalski Wnek, 1993).
  • Feature transformation (PCA, SVD, Random
    Projection)
  • Feature selection
  • LSI

14
Data Compression Approach
  • Compress the data set by finding some structure
    or knowledge for it, where knowledge is
    interpreted as a representation that allows
    coding the data by using fewer amount of bits.
  • Theories should not be ad hoc that is they should
    not overfit the examples used to build it.
  • Occams razor principle,14th century.
  • "when you have two competing models which make
    exactly the same predictions, the one that is
    simpler is the better".

Mehta, M., Rissanen, J., and Agrawal, R. 1995,
MDL-based decision tree pruning. In U.M. Fayyad,
R. Uthurusamy (Eds.) Proceedings of the KDD 1995,
AAAI Press, Montreal, Canada, 216-221.
15
Other Theoretical frameworks for DM
  • Microeconomic view
  • the key point is that data mining is about
    finding actionable patterns the only interest is
    in patterns that can somehow be used to increase
    utility
  • a decision theoretic formulation of this
    principle the goal can be formulated in finding
    a decision x that tries to maximise utility
    function f(x).
  • Kleinberg, J., Papadimitriou, C., and Raghavan,
    P. 1998, A microeconomic view of data mining,
    Data Mining and Knowledge Discovery 2(4), 311-324
  • Philosophy of Science
  • logical empiricism, critical rationalism, systems
    theory
  • formism, mechanism, contextualism
  • dispersive vs. integrative, analytical vs.
    synthetic theories
  • subjectivist vs. objectivist, nomothetic vs.
    ideographic, nominalism vs. realism, voluntarism
    vs. determinism, epistemological assumptions
  • Explanation, prediction, understanding

16
Process-Oriented Frameworks
17
Knowledge discovery as a process
I
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.,
Uthurusamy, R., Advances in Knowledge Discovery
and Data Mining, AAAI/MIT Press, 1997.
18
CRISP-DM
http//www.crisp-dm.org/
19
KDD Vertical Solutions
Reinartz, T. 1999, Focusing Solutions for Data
Mining. LNAI 1623, Berlin Heidelberg.
20
Conclusion on different frameworks
  • Reductionist approach of viewing data mining as
    statistics has advantages of the strong
    background, and easy-formulated problems.
  • The data mining tasks concerning processed like
    clusterisation, regression and classification fit
    easily into these approaches.
  • More recent (process-oriented) frameworks address
    the issues related to a view of data mining as a
    process, and its iterative and interactive nature

21
Part II
  • Where we are?
  • Rigor and Relevance in DM Reseach

22
So, where are we?
  • Lin in Wu et al. notices that a new successful
    industry (as DM) can follow consecutive phases
  • discovering a new idea,
  • ensuring its applicability,
  • producing small-scale systems to test the market,
  • better understanding of new technology and
  • producing a fully scaled system.
  • At the present moment there are several dozens of
    DM systems, none of which can be compared to the
    scale of a DBMS system.
  • This fact indicates that we are still in the 3rd
    phase in the DM area!

23
Rigor vs Relevance in DM Research
24
Where is the focus?
  • Still! speeding-up, scaling-up, and increasing
    the accuracies of DM techniques.
  • Piatetsky-Shapiro we see many papers proposing
    incremental refinements in association rules
    algorithms, but very few papers describing how
    the discovered association rules are used
  • Lin claims that the RD goals of DM are quite
    different
  • since research is knowledge-oriented while
    development is profit-oriented.
  • Thus, DM research is concentrated on the
    development of new algorithms or their
    enhancements,
  • but the DM developers in domain areas are aware
    of cost considerations investment in research,
    product development, marketing, and product
    support.
  • However, we believe that the study of the DM
    development and DM use processes is equally
    important as the technological aspects and
    therefore such research activities are likely to
    emerge within the DM field.

25
Part III
  • Towards the new framework for DM research

26
DMS in the Kernel of an Organization
Environment
  • DM is fundamentally application-oriented area
    motivated by business and scientific needs to
    make sense of mountains of data.
  • A DMS is generally used to support or do some
    task(s) by human beings in an organizational
    environment both having their desires related to
    DMS.
  • Further, the organization has its own environment
    that has its own interest related to DMS, e.g.
    that privacy of people is not violated.

27
The ISs-based paradigm for DM
Ives B., Hamilton S., Davis G. (1980). A
Framework for Research in Computer-based MIS
Management Science, 26(9), 910-934.
Information systems are powerful instruments for
organizational problem solving through formal
information processing
Lyytinen, K., 1987, Different perspectives on
ISs problems and solutions. ACM Computing
Surveys, 19(1), 5-46.
28
DM Artifact Development
A multimethodological approach to the
construction of an artefact for DM
Adapted from Nunamaker, W., Chen, M., and
Purdin, T. 1990-91, Systems development in
information systems research, Journal of
Management Information Systems, 7(3), 89-106.
29
Research methods in a paper on DM
  • Theoretical approach theory creating
  • Hypothesis, new algorithm, etc.
  • Constructive approach
  • Prototype of a DM tool
  • Theoretical approach theory testing and
    evaluation
  • Artificial, benchmark, real-world data
  • Evaluation techniques
  • Conclusion on theory

30
The Action Research and Design Science Approach
to Artifact Creation
31
DM Artifact Use Success Model 1 of 3
Adapted from DM IS Success Models
32
DM Artifact Use Success Model 2 of 3
  • What are the key factors of successful use and
    impact of DMS both at the individual and
    organizational levels.
  • how the system is used, and also supported and
    evolved, and
  • how the system impacts and is impacted by the
    contexts in which it is embedded.
  • Coppock the failure factors of DM-related
    projects.
  • have nothing to do with the skill of the modeler
    or the quality of data.
  • But those do include
  • persons in charge of the project did not
    formulate actionable insights,
  • the sponsors of the work did not communicate the
    insights derived to key constituents,
  • the results don't agree with institutional truths

the leadership, communication skills and
understanding of the culture of the organization
are not less important than the traditionally
emphasized technological job of turning data into
insights
33
DM Artifact Use Success Model 3 of 3
  • Hermiz communicated his beliefs that there are
    the four critical success factors for DM
    projects
  • (1) having a clearly articulated business problem
    that needs to be solved and for which DM is a
    proper tool
  • (2) insuring that the problem being pursued is
    supported by the right type of data of sufficient
    quality and in sufficient quantity for DM
  • (3) recognizing that DM is a process with many
    components and dependencies the entire project
    cannot be "managed" in the traditional sense of
    the business word
  • (4) planning to learn from the DM process
    regardless of the outcome, and clearly
    understanding, that there is no guarantee that
    any given DM project will be successful.

34
KM Perspective
  • A knowledge-driven approach to enhance the
    dynamic integration of DM strategies in knowledge
    discovery systems.
  • Focus here is on knowledge management aimed to
    organise a systematic process of (meta-)knowledge
    capture and refinement over time.
  • knowledge extracted from data
  • the higher-level knowledge required for managing
    DM techniques selection, combination and
    application
  • Basic knowledge management processes of
  • knowledge creation and identification,
    representation, collection and organization,
    sharing, adaptation, and application
  • DEXA05 TAKMA WS paperpresentation are
    available

35
New Research Framework for DM Research
36
Further Work
  • Definition of Relevance concept in DM research
  • The revision of the book chapter
  • Further work on the new framework for DM research
  • Organization of Workshop or Special Track or
    Working conference on
  • more social directions in DM research likely
    with one of the focuses on IS as a sister
    discipline.
  • Few options
  • IRIS Scandinavian Conference on IS is one option
  • Next PMKD
  • Workshop in Jyväskylä

37
Thank You!
  • Feedback is very welcome
  • Questions
  • Suggestions
  • Collaboration
  • Book chapter draft is available on request from
  • Mykola Pechenizkiy
  • Department of Computer Science and Information
    Systems,
  • University of Jyväskylä, FINLAND
  • E-mail mpechen_at_cs.jyu.fi
  • Tel. 358 14 2602472 Fax 358 14 260 3011
  • http//www.cs.jyu.fi/mpechen
Write a Comment
User Comments (0)
About PowerShow.com