Kein Folientitel - PowerPoint PPT Presentation

About This Presentation
Title:

Kein Folientitel

Description:

Title: Kein Folientitel Last modified by: Bettina Berendt Created Date: 1/25/2002 2:55:41 PM Document presentation format: Bildschirmpr sentation – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 53
Provided by: csKuleuv
Category:

less

Transcript and Presenter's Notes

Title: Kein Folientitel


1
Last update 15 November 2007
Advanced databases Inferring new knowledge
from data(bases) Knowledge Discovery in
Databases
Bettina Berendt
Katholieke Universiteit Leuven, Department of
Computer Science http//www.cs.kuleuven.be/berend
t/teaching/2007w/adb/
2
Agenda
Motivation Application examples
The process of knowledge discovery
Origins and context
Major issues in knowledge discovery
A short overview of key techniques
3
What is the impact of genetically modified
organisms?
4
Is our school system good for immigrants and/or
children from poor backgrounds?
5
What are the effects of teaching in English at
universities?
6
What makes people happy?
7
What do men and women like?
8
Is this a man or a woman?
9
Primary Tasks of Data Mining
finding the description of several predefined
classes and classify a data item into one of
them.
identifying a finite set of categories or
clusters to describe the data.
Clustering
Classification
finding a model which describes significant
dependencies between variables.
maps a data item to a real-valued prediction
variable.
Dependency Modeling
Regression
discovering the most significant changes in the
data
finding a compact description for a subset of
data
Deviation and change detection
Summarization
10
Agenda
Motivation Application examples
The process of knowledge discovery
Origins and context
Major issues in knowledge discovery
A short overview of key techniques
11
Data mining and knowledge discovery
  • (informal definition)
  • data mining is about discovering knowledge in
    (huge amounts of) data
  • Therefore, it is clearer to speak about
    knowledge discovery in data(bases)

12
Recall Data, information, and knowledge
  • Data represents a fact or statement of event
  • without relation to other things.
  • Ex It is raining.
  • Information embodies the understanding of a
    relationship of some sort, possibly cause and
    effect.
  • Ex The temperature dropped 15 degrees and then
    it started raining.
  • Knowledge represents a pattern that connects and
    generally provides a high level of predictability
    as to what is described or what will happen next.
  • Ex If the humidity is very high and the
    temperature drops substantially the atmospheres
    is often unlikely to be able to hold the moisture
    so it rains.
  • (This is from knowledge-management theory. If you
    want to know about wisdom, check the Web page
  • G. Bellinger, D. Castro, A. Mills Data,
    Information, Knowledge, and Wisdom.
    http//www.systems-thinking.org/dikw/dikw.htm )

13
Why Data Mining?
  • The Explosive Growth of Data from terabytes to
    petabytes
  • Data collection and data availability
  • Automated data collection tools, database
    systems, Web, computerized society
  • Major sources of abundant data
  • Business Web, e-commerce, transactions, stocks,
  • Science Remote sensing, bioinformatics,
    scientific simulation,
  • Society and everyone news, digital cameras,
  • We are drowning in data, but starving for
    knowledge!
  • Necessity is the mother of inventionData
    miningAutomated analysis of massive data sets

14
Background Evolution of Database Technology
  • 1960s
  • Data collection, database creation, IMS and
    network DBMS
  • 1970s
  • Relational data model, relational DBMS
    implementation
  • 1980s
  • RDBMS, advanced data models (extended-relational,
    OO, deductive, etc.)
  • Application-oriented DBMS (spatial, scientific,
    engineering, etc.)
  • 1990s
  • Data mining, data warehousing, multimedia
    databases, and Web databases
  • 2000s
  • Stream data management and mining
  • Data mining and its applications
  • Web technology (XML, data integration) and global
    information systems

15
The KDD process
The non-trivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data - Fayyad,
Platetsky-Shapiro, Smyth (1996)
16
The process part of knowledge discovery
  • CRISP-DM
  • CRoss Industry Standard Process for Data Mining
  • a data mining process model that describes
    commonly used approaches that expert data miners
    use to tackle problems.

17
Knowledge discovery, machine learning, data mining
  • Knowledge discovery
  • the whole process
  • Machine learning
  • the application of induction algorithms and other
    algorithms that can be said to learn.
  • modeling phase
  • Data mining
  • sometimes KD, sometimes ML

18
The KDD Process
Data organized by function
Create/select target database
Data warehousing
1
Select sampling technique and sample data
Supply missing values
Eliminate noisy data
2
Normalize values
Transform values
Create derived attributes
Find important attributes value ranges
4
3
Select DM task (s)
Select DM method (s)
Extract knowledge
Test knowledge
Refine knowledge
Query report generation Aggregation
sequences Advanced methods
Transform to different representation
5
19
Agenda
Motivation Application examples
The process of knowledge discovery
Origins and context
Major issues in knowledge discovery
A short overview of key techniques
20
Main Contributing Areas of KDD
Statistics
Infer info from data (deduction induction,
mainly numeric data)
data warehouses integrated data
OLAP On-Line Analytical Processing
KDD
Databases
Machine Learning
Store, access, search, update data (deduction)
Computer algorithms that improve automatically
through experience (mainly induction, symbolic
data)
21
Data Mining Classification Schemes
  • General functionality
  • Descriptive data mining
  • Predictive data mining
  • Different views lead to different classifications
  • Data view Kinds of data to be mined
  • Knowledge view Kinds of knowledge to be
    discovered
  • Method view Kinds of techniques utilized
  • Application view Kinds of applications adapted

22
Data Mining Confluence of Multiple Disciplines
23
Why Not Traditional Data Analysis?
  • Tremendous amount of data
  • Algorithms must be highly scalable to handle such
    as tera-bytes of data
  • High-dimensionality of data
  • Micro-array may have tens of thousands of
    dimensions
  • High complexity of data
  • Data streams and sensor data
  • Time-series data, temporal data, sequence data
  • Structure data, graphs, social networks and
    multi-linked data
  • Heterogeneous databases and legacy databases
  • Spatial, spatiotemporal, multimedia, text and Web
    data
  • Software programs, scientific simulations
  • New and sophisticated applications

24
Agenda
Motivation Application examples
The process of knowledge discovery
Origins and context
Major issues in knowledge discovery
A short overview of key techniques
25
Data Mining On What Kinds of Data?
  • Database-oriented data sets and applications
  • Relational database, data warehouse,
    transactional database
  • Advanced data sets and advanced applications
  • Data streams and sensor data
  • Time-series data, temporal data, sequence data
    (incl. bio-sequences)
  • Structure data, graphs, social networks and
    multi-linked data
  • Object-relational databases
  • Heterogeneous databases and legacy databases
  • Spatial data and spatiotemporal data
  • Multimedia database
  • Text databases
  • The World-Wide Web

26
Data Mining Functionalities
  • Multidimensional concept description
    Characterization and discrimination
  • Generalize, summarize, and contrast data
    characteristics, e.g., dry vs. wet regions
  • Frequent patterns, association, correlation vs.
    causality
  • Diaper ? Beer 0.5, 75 (Correlation or
    causality?)
  • Classification and prediction
  • Construct models (functions) that describe and
    distinguish classes or concepts for future
    prediction
  • E.g., classify countries based on (climate), or
    classify cars based on (gas mileage)
  • Predict some unknown or missing numerical values

27
Data Mining Functionalities (2)
  • Cluster analysis
  • Class label is unknown Group data to form new
    classes, e.g., cluster houses to find
    distribution patterns
  • Maximizing intra-class similarity minimizing
    interclass similarity
  • Outlier analysis
  • Outlier Data object that does not comply with
    the general behavior of the data
  • Noise or exception? Useful in fraud detection,
    rare events analysis
  • Trend and evolution analysis
  • Trend and deviation e.g., regression analysis
  • Sequential pattern mining e.g., digital camera ?
    large SD memory
  • Periodicity analysis
  • Similarity-based analysis
  • Other pattern-directed or statistical analyses

28
Are All the Discovered Patterns Interesting?
  • Data mining may generate thousands of patterns
    Not all of them are interesting
  • Suggested approach Human-centered, query-based,
    focused mining
  • Interestingness measures
  • A pattern is interesting if it is easily
    understood by humans, valid on new or test data
    with some degree of certainty, potentially
    useful, novel, or validates some hypothesis that
    a user seeks to confirm
  • Objective vs. subjective interestingness measures
  • Objective based on statistics and structures of
    patterns, e.g., support, confidence, etc.
  • Subjective based on users belief in the data,
    e.g., unexpectedness, novelty, actionability, etc.

29
Find All and Only Interesting Patterns?
  • Find all the interesting patterns Completeness
  • Can a data mining system find all the interesting
    patterns? Do we need to find all of the
    interesting patterns?
  • Heuristic vs. exhaustive search
  • Association vs. classification vs. clustering
  • Search for only interesting patterns An
    optimization problem
  • Can a data mining system find only the
    interesting patterns?
  • Approaches
  • First general all the patterns and then filter
    out the uninteresting ones
  • Generate only the interesting patternsmining
    query optimization

30
Other Pattern Mining Issues
  • Precise patterns vs. approximate patterns
  • Association and correlation mining possible find
    sets of precise patterns
  • But approximate patterns can be more compact and
    sufficient
  • How to find high quality approximate patterns??
  • Gene sequence mining approximate patterns are
    inherent
  • How to derive efficient approximate pattern
    mining algorithms??
  • Constrained vs. non-constrained patterns
  • Why constraint-based mining?
  • What are the possible kinds of constraints? How
    to push constraints into the mining process?

31
Data Mining Query Languages
  • Automated vs. query-driven?
  • Finding all the patterns autonomously in a
    database?unrealistic because the patterns could
    be too many but uninteresting
  • Data mining should be an interactive process
  • User directs what to be mined
  • Users must be provided with a set of primitives
    to be used to communicate with the data mining
    system
  • Incorporating these primitives in a data mining
    query language
  • More flexible user interaction
  • Foundation for design of graphical user interface
  • Standardization of data mining industry and
    practice

32
Primitives that Define a Data Mining Task
  • Task-relevant data
  • Type of knowledge to be mined
  • Background knowledge
  • Pattern interestingness measurements
  • Visualization/presentation of discovered patterns

33
Primitive 1 Task-Relevant Data
  • Database or data warehouse name
  • Database tables or data warehouse cubes
  • Condition for data selection
  • Relevant attributes or dimensions
  • Data grouping criteria

34
Primitive 2 Types of Knowledge to Be Mined
  • Characterization
  • Discrimination
  • Association
  • Classification/prediction
  • Clustering
  • Outlier analysis
  • Other data mining tasks

35
Primitive 3 Background Knowledge
  • A typical kind of background knowledge Concept
    hierarchies
  • Schema hierarchy
  • E.g., street lt city lt province_or_state lt country
  • Set-grouping hierarchy
  • E.g., 20-39 young, 40-59 middle_aged
  • Operation-derived hierarchy
  • email address hagonzal_at_cs.uiuc.edu
  • login-name lt department lt university lt country
  • Rule-based hierarchy
  • low_profit_margin (X) lt price(X, P1) and cost
    (X, P2) and (P1 - P2) lt 50

36
Primitive 4 Pattern Interestingness Measure
  • Simplicity
  • e.g., (association) rule length, (decision) tree
    size
  • Certainty
  • e.g., confidence, P(AB) (A and B)/ (B),
    classification reliability or accuracy, certainty
    factor, rule strength, rule quality,
    discriminating weight, etc.
  • Utility
  • potential usefulness, e.g., support
    (association), noise threshold (description)
  • Novelty
  • not previously known, surprising (used to remove
    redundant rules, e.g., Illinois vs. Champaign
    rule implication support ratio)

37
Primitive 5 Presentation of Discovered Patterns
  • Different backgrounds/usages may require
    different forms of representation
  • E.g., rules, tables, crosstabs, pie/bar chart,
    etc.
  • Concept hierarchy is also important
  • Discovered knowledge might be more understandable
    when represented at high level of abstraction
  • Interactive drill up/down, pivoting, slicing and
    dicing provide different perspectives to data
  • Different kinds of knowledge require different
    representation association, classification,
    clustering, etc.

38
Architecture Typical Data Mining System
39
Major Issues in Data Mining
  • Mining methodology
  • Mining different kinds of knowledge from diverse
    data types, e.g., bio, stream, Web
  • Performance efficiency, effectiveness, and
    scalability
  • Pattern evaluation the interestingness problem
  • Incorporation of background knowledge
  • Handling noise and incomplete data
  • Parallel, distributed and incremental mining
    methods
  • Integration of the discovered knowledge with
    existing one knowledge fusion
  • User interaction
  • Data mining query languages and ad-hoc mining
  • Expression and visualization of data mining
    results
  • Interactive mining of knowledge at multiple
    levels of abstraction
  • Applications and social impacts
  • Domain-specific data mining invisible data
    mining
  • Protection of data security, integrity, and
    privacy

40
Agenda
Motivation Application examples
The process of knowledge discovery
Origins and context
Major issues in knowledge discovery
A short overview of key techniques
41
Classification
What factors determine cancerous cells?
Examples
General patterns
Data
Mining Algorithm
- Rule Induction - Decision tree - Neural Network
Classification Algorithm
Cancerous Cell Data
42

Classification Rule Induction
What factors determine a cell is cancerous?
If Color light and Tails 1 and
Nuclei 2 Then Healthy Cell (certainty
92) If Color dark and Tails 2 and
Nuclei 2 Then Cancerous Cell (certainty
87)
43
Classification Decision Trees
Color dark
Color light
nuclei1
nuclei2
nuclei1
nuclei2
cancerous
healthy
tails1
tails2
tails1
tails2
healthy
cancerous
healthy
cancerous
44
Classification Neural Networks
What factors determine a cell is cancerous?
Color dark nuclei 1 tails 2
Healthy
Cancerous
45

Clustering
Are there clusters of similar cells?
Light color with 1 nucleus
Dark color with 2 tails 2 nuclei
1 nucleus and 1 tail
Dark color with 1 tail and 2 nuclei
46
Association Rule Discovery
Task Discovering association rules among items
in a transaction database. An association among
two items A and B means that the presence of A in
a record implies the presence of B in the same
record A gt B. In general A1, A2, gt B
47
Association Rule Discovery
Are there any associations between the
characteristics of the cells?
If color light and nuclei 1 then tails
1 (support 12.5 confidence
50) If nuclei 2 and Cell Cancerous then
tails 2 (support 25 confidence
100) If tails 1 then Color light
(support 37.5 confidence 75)
48
Many Other Data Mining Techniques
Genetic Algorithms
Statistics
Bayesian Networks
Text Mining
Time Series
Rough Sets
49
A goal From databases to deductive databases to
inductive databases
  • A deductive database system is a database system
    which can make deductions (ie conclude
    additional facts) based on rules and facts stored
    in the (deductive) database.
  • inductive databases
  • contain not only data, but also patterns.
  • In an IDB, inductive queries can be used to
    generate (mine), manipulate, and apply patterns.
  • The IDB framework supports the process of
    knowledge discovery in databases (KDD)
  • the results of one (inductive) query can be used
    as input for another
  • nontrivial multi-step KDD scenarios can be
    supported, rather than just single data mining
    operations.

50
Next lecture
Motivation Application examples
The process of knowledge discovery
Origins and context
Major issues in knowledge discovery
A short overview of key techniques
Deductive databases
51
References / background reading acknowledgements
  • Knowledge discovery is now an established area
    with some excellent general textbooks. I
    recommend the following as examples of the 3 main
    perspectives
  • a databases / data warehouses perspective Han,
    J. Kamber, M. (2001). Data Mining Concepts and
    Techniques. San Francisco,CA Morgan Kaufmann.
    http//www.cs.sfu.ca/7Ehan/dmbook
  • a machine learning perspective Witten, I.H.,
    Frank, E.(2005). Data Mining. Practical Machine
    Learning Tools and Techniques with Java
    Implementations. 2nd ed. Morgan Kaufmann.
    http//www.cs.waikato.ac.nz/7Eml/weka/book.html
  • a statistics perspective Hand, D.J., Mannila,
    H., Smyth, P. (2001). Principles of Data
    Mining. Cambridge, MA MIT Press.
    http//mitpress.mit.edu/catalog/item/default.asp?t
    id3520ttype2
  • pp. 9, 15, 18, 20, 41-44 were taken from
  • Tzacheva, A.A. (2006). SIMS 422. Knowledge
    Inference Systems Applications.
    http//faculty.uscupstate.edu/atzacheva/SIMS422/Ov
    erviewI.ppt
  • pp. 45-48 were taken from
  • Tzacheva, A.A. (2006). Knowledge Discovery and
    Data Mining. http//faculty.uscupstate.edu/atzache
    va/SIMS422/OverviewII.ppt
  • pp. 13, 14, 22, 23, 25-39 were taken from
  • Han, J. Kamber, M. (2006). Data Mining
    Concepts and Techniques Chapter 1
    Introduction. http//www.cs.sfu.ca/7Ehan/bk/1intr
    o.ppt

52
Picture credits CRISP-DM reference
  • p. 3 http//www.siu-weeds.com/publications/Wheat_
    field.jpg
  • p. 4 http//www.dkimages.com/discover/previews/88
    9/30039025.JPG
  • p. 5 http//www.viebahnfinearts.com/website/Pages
    /Photos/Furniture/Mirror201005.jpg
  • p. 6 http//charles.robinsontwins.org/twinsdays_9
    6/john/smiley.jpg
  • p. 16 http//www.palagems.com/Images/ceylon_minin
    g.jpg,
  • http//www.crisp-dm.org/Images/187343_CRISPart.jpg
  • The CRISP-DM phase model can be found at
    http//www.crisp-dm.org
Write a Comment
User Comments (0)
About PowerShow.com