Motivation: Necessity is the Mother of Invention - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Motivation: Necessity is the Mother of Invention

Description:

Class label is unknown: Group data to form new classes ... A data mining system/query may generate thousands of patterns, not all of them are interesting. ... – PowerPoint PPT presentation

Number of Views:794
Avg rating:3.0/5.0
Slides: 20
Provided by: jiaw193
Category:

less

Transcript and Presenter's Notes

Title: Motivation: Necessity is the Mother of Invention


1
Motivation Necessity is the Mother of
Invention
  • Data explosion problem
  • Automated data collection tools and mature
    database technology lead to tremendous amounts of
    data stored in databases, data warehouses and
    other information repositories
  • Solution Data warehousing and data mining
  • Data warehousing and on-line analytical
    processing
  • Extraction of interesting knowledge (rules,
    regularities, patterns, constraints) from data
    in large databases

2
Why Data Mining? Potential Applications
  • Database analysis and decision support
  • Market analysis and management
  • Risk analysis and management
  • Other Applications
  • Text mining (news group, email, documents) and
    Web analysis.
  • Intelligent query answering

3
Market Analysis and Management (1)
  • Where are the data sources for analysis?
  • Credit card transactions, customer complaint
    calls, proxy log, multimedia database, etc.
  • Analysis what?
  • Find clusters of customers who share the same
    characteristics interest, income level, spending
    habits, etc.
  • Determine customer purchasing patterns over time
  • Associations/co-relations between product sales
  • Prediction based on the association information
  • data mining can tell you what types of customers
    buy what products (clustering or classification)

4
Market Analysis and Management (2)
  • Identifying customer requirements
  • identifying the best products for different
    customers
  • use prediction to find what factors will attract
    new customers
  • Provides summary information
  • various multidimensional summary reports
  • statistical summary information

5
Other Applications
  • Sports
  • Internet Web mining
  • web site organization
  • proxy server prefetch
  • improve search engine performance
  • Multimedia database
  • Mobile database

6
Data Mining Functionalities (1)
  • Association
  • Multi-dimensional vs. single-dimensional
    association
  • age(X, 20..29) income(X, 20..29K) à buys(X,
    PC) support 2, confidence 60
  • contains(T, computer) à contains(x, software)
    1, 75

7
Data Mining Functionalities (2)
  • Classification and Prediction
  • Finding models (functions) that describe and
    distinguish classes or concepts for future
    prediction
  • Presentation decision-tree, classification rule,
    neural network
  • Prediction Predict some unknown or missing
    numerical values
  • Cluster analysis
  • Class label is unknown Group data to form new
    classes
  • Clustering based on the principle maximizing the
    intra-class similarity and minimizing the
    interclass similarity

8
Data Mining Functionalities (3)
  • Outlier analysis
  • Outlier a data object that does not comply with
    the general behavior of the data
  • It can be considered as noise or exception but is
    quite useful in rare events analysis
  • Sequential pattern mining, periodicity analysis
  • Privacy preserving data mining

9
Are All the Discovered Patterns Interesting?
  • A data mining system/query may generate thousands
    of patterns, not all of them are interesting.
  • Interestingness measures A pattern is
    interesting if it is easily understood by humans,
    potentially useful, novel, or validates some
    hypothesis that a user seeks to confirm
  • support, confidence

10
What Is Association Mining?
  • Association rule mining
  • Finding frequent patterns, associations,
    correlations, or causal structures among sets of
    items or objects in transaction databases,
    relational databases, and other information
    repositories.
  • Applications
  • Basket data analysis, clustering, classification,
    etc.
  • Examples.
  • Rule form Body Head support, confidence.
  • buys(x, diapers) buys(x, beers) 0.5,
    60
  • major(x, CS) takes(x, DB) grade(x, A)
    1, 75

11
Association Rule Basic Concepts
  • Given (1) database of transactions, (2) each
    transaction is a list of items (purchased by a
    customer in a visit)
  • Find all rules that correlate the presence of
    one set of items with that of another set of
    items
  • E.g., 98 of people who purchase computers and
    printers also purchase scanners
  • Measures
  • support
  • confidence
  • Some terms
  • minimum support, minimum confidence (threshold)
  • k-itemset
  • frequent k-itemset

12
Association Rule Mining A Road Map
  • Boolean v.s. quantitative associations (Based on
    the types of values handled)
  • buys(x, SQLServer) buys(x, DMBook)
    buys(x, DBMiner) 0.2, 60
  • age(x, 30..39) income(x, 42..48K)
    buys(x, PC) 1, 75
  • Single dimension vs. multiple dimensional
    associations (see ex. Above)
  • Single level vs. multiple-level analysis
  • What brands of beers are associated with what
    brands of diapers?
  • Various extensions
  • Maxpatterns
  • Cyclic rules

13
Classification vs. Prediction
  • Classification
  • predicts categorical class labels
  • classifies data (constructs a model) based on the
    training set and the values (class labels) in a
    classifying attribute and uses it in classifying
    new data
  • Prediction
  • models continuous-valued functions, i.e.,
    predicts unknown or missing values
  • Typical Applications
  • credit approval
  • target marketing

14
ClassificationA Two-Step Process
  • Model construction describing a set of
    predetermined classes
  • Each tuple/sample is assumed to belong to a
    predefined class, as determined by the class
    label attribute
  • The set of tuples used for model construction
    training set
  • The model is represented as classification rules,
    decision trees, or mathematical formulae
  • Model usage for classifying future or unknown
    objects
  • Estimate accuracy of the model
  • The known label of test sample is compared with
    the classified result from the model
  • Accuracy rate is the percentage of test set
    samples that are correctly classified by the
    model
  • Test set is independent of training set,
    otherwise over-fitting will occur

15
Classification Process (1) Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN tenured
yes
16
Classification Process (2) Use the Model in
Prediction
(Jeff, Professor, 4)
Tenured?
17
Classification by Decision Tree Induction
  • Decision tree
  • A flow-chart-like tree structure
  • Internal node denotes a test on an attribute
  • Branch represents an outcome of the test
  • Leaf nodes represent class labels or class
    distribution
  • Decision tree generation consists of two phases
  • Tree construction
  • At start, all the training examples are at the
    root
  • Partition examples recursively based on selected
    attributes
  • Tree pruning
  • Identify and remove branches that reflect noise
    or outliers
  • Use of decision tree Classifying an unknown
    sample
  • Test the attribute values of the sample against
    the decision tree

18
Training Dataset
This follows an example from Quinlans ID3
19
Output A Decision Tree for buys_computer
age?
lt30
overcast
gt40
30..40
student?
credit rating?
yes
no
yes
fair
excellent
no
no
yes
yes
Write a Comment
User Comments (0)
About PowerShow.com