Chapter I:Introduction MIS 542 2014/2015 Fall - PowerPoint PPT Presentation

View by Category
About This Presentation
Title:

Chapter I:Introduction MIS 542 2014/2015 Fall

Description:

Chapter I:Introduction MIS 542 2014/2015 Fall – PowerPoint PPT presentation

Number of Views:153
Avg rating:3.0/5.0
Slides: 95
Provided by: Berta153
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Chapter I:Introduction MIS 542 2014/2015 Fall


1
Chapter IIntroductionMIS 5422014/2015 Fall
2
Chapter 1. Introduction
  • Motivation Why data mining?
  • Methodology of Knowledge Discovery in Databases
  • Data mining functionalities
  • Are all the patterns interesting?
  • Business applications of data mining

3
Motivation Necessity is the Mother of
Invention
  • Data explosion problem
  • Automated data collection tools and mature
    database technology lead to tremendous amounts of
    data stored in databases, data warehouses and
    other information repositories
  • Need to convert such data into knowledge and
    information
  • Applications
  • Business management
  • Production control
  • Market analysis
  • Engineering design
  • Science exploration

4
Evolution of Database Technology (1)
  • Data collection, database creation
  • Data management
  • data storage and retrieval
  • database transaction processing
  • Data analysis and understanding
  • Data mining and data warehousing

5
Evolution of Database Technology (2)
  • 1960s
  • Data collection, database creation, IMS and
    network DBMS
  • 1970s
  • Relational data model, relational DBMS
    implementation
  • 1980s
  • RDBMS, advanced data models (extended-relational,
    OO, deductive, etc.)
  • Application-oriented DBMS (spatial, scientific,
    engineering, etc.)
  • 1990s
  • Data mining, data warehousing, multimedia
    databases, and Web databases
  • 2000s
  • Stream data management and mining
  • Data mining and its applications
  • Web technology (XML, data integration) and global
    information systems

6
  • The Explosive Growth of Data from terabytes to
    petabytes
  • Data collection and data availability
  • Automated data collection tools, database
    systems, Web, computerized society
  • Major sources of abundant data
  • Business Web, e-commerce, transactions, stocks,
  • Science Remote sensing, bioinformatics,
    scientific simulation,
  • Society and everyone news, digital cameras,
    YouTube
  • We are drowning in data, but starving for
    knowledge!
  • Necessity is the mother of inventionData
    miningAutomated analysis of massive data sets

7
Developments in computer hardware
  • Powerful and affordable computers
  • Data collection equipment
  • Storage media
  • Communication and networking

8
Data Warehouse
  • Data cleaning
  • Data integration
  • OLAP On-Line Analytical Processing
  • summarization
  • consolidation
  • aggregation
  • view information from different angles
  • but additional data analysis tools are needed for
  • classification
  • clustering
  • charecterization of data changing over time

9
Data rich information poor situation
  • Abundance of data
  • need for powerful data analysis tools
  • data tombs - data archives
  • seldom visited
  • Important decisions are made
  • not on the information rich data stored in
    databases
  • but on a decision makers intuition
  • no tool to extract knowledge embedded in vast
    amounts of data
  • Expert system technology
  • domain experts to input knowledge
  • time consuming and costly

10
What Is Data Mining?
  • Data mining (knowledge discovery in databases)
  • Extraction of interesting (non-trivial, implicit,
    previously unknown and potentially useful)
    information or patterns from data in large
    databases
  • Alternative names and their inside stories
  • Data mining a misnomer?
  • Knowledge discovery(mining) in databases (KDD),
    knowledge extraction, data/pattern analysis, data
    archeology, data dredging, information
    harvesting, business intelligence, etc.
  • What is not data mining?
  • query processing.
  • Expert systems or small ML/statistical programs

11
Data Mining vs. Data Query
  • Data Querye.g.
  • A list of all customers who use a credit card to
    buy a PC
  • A list of all MIS students having a GPA of 3.5 or
    higher and has studied 4 or less semesters
  • Data Mining problemse.g.
  • What is the likelihood of a customer purchasing
    PC with credit card
  • Given the characteristics of MIS students predict
    her SPA in the comming term
  • What are the characteristics of MIS undergrad
    students

12
Chapter 1. Introduction
  • Motivation Why data mining?
  • Methodology of Knowledge Discovery in Databases
  • Data mining functionalities
  • Are all the patterns interesting?
  • Business applications of data mining

13
Why Data Mining?
  • Four questions to be answered
  • Can the problem clearly be defined?
  • Does potentially meaningful data exists?
  • Does the data contain hidden knowledge or useful
    only for reporting purposes?
  • Will the cost of processing the data will be less
    then the likely increase in profit from the
    knowledge gained from applying any data mining
    project

14
Steps of a KDD Process (1)
  • 1. Goal identification
  • Define problem
  • relevant prior knowledge and goals of application
  • 2. Creating a target data set data selection
  • 3. Data preprocessing (may take 60-80 of
    effort!)
  • removal of noise or outliers
  • strategies for handling missing data fields
  • accounting for time sequence information
  • 4. Data reduction and transformation
  • Find useful features, dimensionality/variable
    reduction, invariant representation.

15
Steps of a KDD Process (2)
  • 5. Data Mining
  • Choosing functions of data mining
  • summarization, classification, regression,
    association, clustering.
  • Choosing the mining algorithm(s)
  • which models or parameters
  • Search for patterns of interest
  • 6. Presentation and Evaluation
  • visualization, transformation, removing redundant
    patterns, etc.
  • 7. Taking action
  • incorporating into the performance system
  • documenting
  • reporting to interested parties

16
An example Customer Segmentation
  • 1. Marketing department wants to perform a
    segmentation study on the customers of AE Company
  • 2. Decide on revevant variables from a data
    warehouse on customers, sales, promotions
  • Customers name,ID,income,age,education,...
  • Sales hisory of sales
  • Promotion promotion types durations...
  • 3. Hendle missing income, addresses..
  • determine outliers if any
  • 4. Cenerate new index variables representing
    wealth of customers
  • Wealth aincomebhousesccars...
  • Make neccesary transformations z scores so that
    some data mining algorithms work more efficiently

17
Example Customer Segmentation cont.
  • 5.a Choose clustering as the data mining
    functionality as it is the natural one for a
    segmentation study so as to find group of
    customers with similar charecteristics
  • 5.b Choose a clustering algorithm
  • K-means or k-medoids or any suitable one for that
    problem
  • 5.c Apply the algorithm
  • Find clusters or segments
  • 6. make reverse transformations, visualize the
    customer segments
  • 7. present the results in the form of a report to
    the marketing deprtment
  • Implement the segmentation as part of a DSS so
    that it can be applied repeatedly at certain
    internvals as new customers arrive
  • Develop marketing strategies for each segment

18
Data Mining A KDD Process
Knowledge
Pattern Evaluation
  • Data mining the core of knowledge discovery
    process.

Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
19
Data Mining in Business Intelligence
Increasing potential to support business decisions
End User
Decision Making
Business Analyst
Data Presentation
Visualization Techniques
Data Mining
Data Analyst
Information Discovery
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
DBA
Data Sources
Paper, Files, Web documents, Scientific
experiments, Database Systems
August 4, 2019
19
Data Mining Concepts and Techniques
20
Architecture of a Typical Data Mining System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data warehouse server
Filtering
Data cleaning data integration
Data Warehouse
Databases
21
Architecture of a Typical Data Mining System
  • Data base, data warehouse
  • Data base or data warehouse server
  • Knowledge base
  • concept hierarchies
  • user beliefs
  • asses patterns interestingness
  • other thresholds
  • Data mining engine
  • functional modules
  • characterization, association, classification,
    cluster analysis, evolution and deviation
    analysis
  • Pattern evaluation module
  • Graphical user interface

22
Data Mining Confluence of Multiple Disciplines
Database Technology
Statistics
Data Mining
Machine Learning
Visualization
Information Science
Other Disciplines
23
Why Confluence of Multiple Disciplines?
  • Tremendous amount of data
  • Algorithms must be highly scalable to handle such
    as tera-bytes of data
  • High-dimensionality of data
  • Micro-array may have tens of thousands of
    dimensions
  • High complexity of data
  • Data streams and sensor data
  • Time-series data, temporal data, sequence data
  • Structure data, graphs, social networks and
    multi-linked data
  • Heterogeneous databases and legacy databases
  • Spatial, spatiotemporal, multimedia, text and Web
    data
  • Software programs, scientific simulations
  • New and sophisticated applications

August 4, 2019
23
Data Mining Concepts and Techniques
24
Efficient and Scalable Techniques
  • For an algorithm to be efficient and scalable
  • its running time should be predictable and
    acceptable
  • How
  • Parallel and distributed algorithms
  • Sampling from databases

25
Chapter 1. Introduction
  • Motivation Why data mining?
  • Methodology of Knowledge Discovery in Databases
  • Data mining functionalities
  • Are all the patterns interesting?
  • Business applications of data mining

26
Two Styles of Data Mining
  • Descriptive data mining
  • characterize the general properties of the data
    in the database
  • finds patterns in data and
  • the user determines which ones are important
  • Predictive data mining
  • perform inference on the current data to make
    predictions
  • we know what to predict
  • Not mutually exclusive
  • used together
  • Descriptive ? predictive
  • Eg. Customer segmentation descriptive by
    clustering
  • Followed by a risk assignment model predictive
    by ANN

27
Supervised vs. Unsupervised Learning
  • Supervised learning (classification, prediction)
  • Supervision The training data (observations,
    measurements, etc.) are accompanied by labels
    indicating the class of the observations
  • New data is classified based on the training set
  • Unsupervised learning (summarization.
    association, clustering)
  • The class labels of training data is unknown
  • Given a set of measurements, observations, etc.
    with the aim of establishing the existence of
    classes or clusters in the data

28
Descriptive Data Mining (1)
  • Discovering new patterns inside the data
  • Used during the data exploration steps
  • Typical questions answered by descriptive data
    mining
  • what is in the data
  • what does it look like
  • are there any unusual patterns
  • what dose the data suggest for customer
    segmentation
  • users may have no idea
  • which kind of patterns may be interesting

29
Descriptive Data Mining (2)
  • patterns at verious granularities
  • geograph
  • country - city - region - street
  • student
  • university - faculty - department - minor
  • Fuctionalities of descriptive data mining
  • Clustering
  • Ex customer segmentation
  • summarization
  • visualization
  • Association
  • Ex market basket analysis

30
A model is a black box
X vector of independent variables or inputs Y
f(X) an unknown function Y dependent
variables or output a single variable or a
vector
Model
Y output
inputs X1,X2
The user does not care what the model is doing it
is a black box interested in the accuracy of its
predictions
31
Predictive Data Mining (1)
  • Using known examples the model is trained
  • the unknown function is learned from data
  • the more data with known outcomes is available
  • the better the predictive power of the model
  • Used to predict outcomes whose inputs are known
    but the output values are not realized yet
  • Never 100 accurate

32
Predictive Data Mining (2)
  • The performance of a model on past data is not
    important
  • to predict the known outcomes
  • Its performance on unknown data is much more
    important

33
Typical questions answered by predictive models
  • Who is likely to respond to our next offer
  • based on history of previous marketing campaigns
  • Which customers are likely to leave in the next
    six months
  • What transactions are likely to be fraudulent
  • based on known examples of fraud
  • What is the total amount spending of a customer
    in the next month

34
Data Mining Functionalities (1)
  • Concept description Characterization and
    discrimination
  • Generalize, summarize, and contrast data
    characteristics, e.g., big spenders vs. budget
    spenders
  • Association (correlation and causality)
  • Multi-dimensional vs. single-dimensional
    association
  • age(X, 20..29) income(X, 20..29K) à buys(X,
    PC) support 2, confidence 60
  • contains(T, computer) à contains(x, software)
    1, 75

35
Data Mining Functionalities (2)
  • Classification and Numerical-Prediction
  • Finding models (functions) that describe and
    distinguish classes or concepts for future
    prediction
  • E.g., classify people as healty or sick, or
    classify transactions as fraudulent or not
  • Methods decision-tree, classification rule,
    neural network
  • Prediction Predict some unknown or missing
    numerical values
  • Cluster analysis
  • Class label is unknown Group data to form new
    classes, e.g., cluster customers of a retail
    company to learn about characteristics of
    different segments
  • Clustering based on the principle maximizing the
    intra-class similarity and minimizing the
    interclass similarity

36
Data Mining Functionalities (3)
  • Outlier analysis
  • Outlier a data object that does not comply with
    the general behavior of the data
  • It can be considered as noise or exception but is
    quite useful in fraud detection, rare events
    analysis
  • Trend and evolution analysis
  • Trend and deviation regression analysis
  • Sequential pattern mining click stream analysis
  • Similarity-based analysis
  • Other pattern-directed or statistical analyses

37
Concept Description
  • Characterization
  • Discerimination
  • Data
  • classes or
  • concpets
  • classes of items for sale
  • computers, printers
  • concepts of customers
  • bigSpenders
  • BudgetSpenders

38
Data Characterization
  • Summarization the data of the class under study
    (target class)
  • Methods
  • SQL queries
  • OLAP roll up -operation
  • user-controlled data summarization
  • along a specified dimension
  • attribute oriented induction
  • without step by step user interraction
  • the output of characterization
  • pie charts, bar chars, curves, multidimensional
    data cube, or cross tabs
  • in rule form as characteristic rules

39
Characterization example
  • Description summarizing the characteristics of
    customers who spend more than 1000 a year at
    AllElecronics
  • age, employment, income
  • drill down on any dimension
  • on occupation view these according to their type
    of employment

40
Data Discrimination
  • Comparing the target class with one or a set of
    comparative classes (contrasting classes)
  • these classes can be specified by the use
  • database queries
  • methods and output
  • similar to those used for characterization
  • include comparative measures to distinguish
    between the target and contrasting classes

41
Discrimination examples
  • Example 1Compare the general features of
    software products
  • whose sales increased by 10 in the last year
    (target class)
  • whose sales decreased by at least 30 during the
    same period (contrasting class)
  • Example 2 Compare two groups of AE customers
  • I) who shop for computer products regularly
    (target class)
  • more than two times a month
  • II) who rarely shop for such products
    (contrasting class)
  • less than three times a year
  • The resulting description
  • 80 of I group customers
  • university education
  • ages 20-40
  • 60 of II group customers
  • seniors or young
  • no university degree

42
Multidimensional Data
  • sales according to region month and product type

Dimensions Product, Location, Time Hierarchical
summarization paths
Region
Industry Region Year Category
Country Quarter Product City Month
Week Office Day
Product
Month
43
Association Analysis
  • Discovery of association rules showing
    attribute-value conditions that occur frequently
    together in a given set of data
  • widely used
  • market basket
  • transaction data analysis
  • more formally
  • X ? Y that is
  • A1?A2.. ?Ak ? B1?B2.. ?Bl
  • A1 , B1 are attribute value pairs or predicates

44
Example association analysis
  • From the AllEs database
  • age(X,20..29)?income(X,1,000...2,000)?buy(X,C
    D player)
  • (support 2,
  • confidence 60)
  • X is a variable representing a customer
  • 2 of the AE customers are
  • between 20 and 29 age
  • incomes ranging from 1 to 2 billon TL
  • buy CD player
  • with 60 probability that customers in those age
    and income groups will buy CD player
  • a multidimensional association rule
  • contains more than one attribute or predicate

45
Market basket analysis
  • customers buying behaviour is investigated
  • Based on only the transactions data
  • no information about customer properties age
    income
  • Managers
  • are interested in which products or product
    groups are sold together

46
Transactional Database
Transaction ID Item List
10001 Computer,CD,pritner
10002 Ploter,monitor,mouse
10003 Computer,DVD Player
10004 Printer
10005 Ploter,UPS,modem
47
Example basket analysis rule
  • buy(computer)?buy(printer)
  • (support 1,confidence60)
  • 1 of all transactions contains
  • computer and printer
  • if a transaction contains computer
  • there is a 60 chance that it contains printer as
    well
  • a single dimensional association rule
  • contains a single predicate
  • an association rule is interesting if
  • its support exceeds a minimum threshold and
  • its confidence exceeds a min threshold
  • These min values are set by specialists

48
Classification
  • Learning is supervised
  • Dependent variable is categorical
  • Build a model able to assign new instances to one
    of a set of well-defined classes

49
Typical Classification Problems
  • Given characteristics of individuals
    differentiate them who have suffered a heart
    attack from those who have not
  • Determine if a credit card purchase is fraudulent
  • Classify a car loan applicant as a good or a poor
    credit risk

50
Methods of Classification
  • Decision Trees
  • Artificial Neural Networks
  • Bayesian Classification
  • Naïve
  • Belief Networks
  • k-nearest neighbor
  • Regression
  • Logistic (logit) probit
  • Predicts probability of each class
  • when the dependent variable is categorical
  • good customer bed customer or employed unemployed

51
Steps of classification process
  • (1) Train the model
  • using a training set
  • data objects whose class labels are known
  • (2) Test the model
  • on a test sample
  • whose class labels are known but not used for
    training the model
  • (3) Use the model for classification
  • on new data whose class labels are unknown

52
An example - classification
Cust ID age income education Type
1 35 800 udergrad risky
2 26 600 HighSch risky
3 48 1200 grad normal
8 52 2500 udergrad good
44 29 1700 HighSch good
Historical data Each customer type Is known Each
customer has a Label
CustID age income education Type
17 43 550 Ph.D. risky
27 68 1650 grad Normal
  • Testing set whose labels are also
  • Known but not used in model
  • Training the model

CustID age income Educatin Type
11 36 850 Udergrd ?
27 28 1650 grad ?
  • New customers Whose type hsa to be
  • Estimated
  • Each new customer hss to be classified as Risky
    normal or good

53
Orginal data
54
Historical data Each customer type Is known Each
customer has a Label
  • Testing set whose labels are also
  • Known but not used in model
  • Training the model
  • New customers Whose type hsa to be
  • Estimated
  • Each new customer hss to be classified as buyer
    or non buyer

55
An example classification cont.
  • Based on historical data develop a classification
    model
  • Decision tree, neural network, regression ...
  • Test the performance of the model on a portion of
    the historical data
  • If accuricy of the model is satisfactory
  • Use the model on the new customers
  • 11 and 27 to assign a type the these new
    customers

56
Example AE customers
age
goodl risky
Yearly income
57
Example AE customers
age
goodl risky
?
Yearly income
Assign the new customer whose type in unknown to
either or
58
Solution
good risky
35
rule IF yearly incomegt 1000 and agegt 35
THEN good ELSE risky
59
Credit Card Promotion Policy
  • Credit card companies
  • Promotional offerings with their monthly credit
    card billing
  • Offers provide the opportunity to purchase items
    such as magazines,
  • A data mining study
  • Predict individual behaviour
  • What is the likelihood of an individual towards
    taking the advantage of promotions
  • based on individual characteristics, credit
    history..
  • Expected reduction in postage paper and
    processing costs for the credit card company

60
Credit Card Promotion Database
Income Range Magazine Promotion Watch Promotion Life Insurance Promotion Gender Age Credit Card Insurance
40-50 K Yes No No Male 45 No
30-40 K Yes Yes Yes Female 40 No
40-50 K No No No Male 42 No
30-40 K Yes Yes Yes Male 43 Yes
50-60 K Yes No Yes Female 38 No
20-30 K No No No Female 55 No
30-40 K Yes No Yes Male 35 Yes
20-30 K No Yes No Male 27 No
30-40 K Yes No No Male 43 No
30-40 K Yes Yes Yes Female 41 No
40-50 K No Yes Yes Female 43 No
20-30 K No Yes Yes Male 29 No
50-60 K Yes Yes Yes Female 39 No
40-50 K No Yes No Male 55 No
20-30 K No No Yes Female 19 Yes
61
Decision Trees for Credit Card Insurance Database
Dependent Variable Life Insurance Promotion
lt43
gt43
  • critical value of 43
  • is deter by the
  • algorithm

N 3,Y 0 DecisionNo
Gender
Male
Female
A Production Rule from the Tree IF
(agelt43)(SexMale) (Credit Card In
No) THEN Life Insurance Pr No
N 0, Y 6 Decision Yes
Yes
No
Yes 2, No 0 Decision? Yes
N 4, Y 1 Decision No
62
Artificial Neural Networks
  • Set of interconnected nodes designed to imitate
    the functioning of the human brain
  • Feed-forward network
  • Supervised learner model

63
For the promotion example
  • Encode all variables
  • Assign a numerical value even for qualitative
    variables such as sex
  • Say X1 represent gender
  • When
  • Male X1 1
  • Female X1 0

64
Input layer
Hidden layer
Output layer
1
W1,50.014
X11
5
W5,9-0.17
X20
X30.5
X4-1
(1-0.78)2 is error square 1 actual value of O9
for a particular Data object 0.78 is predicted
value
65
Weights updating
  • Weights between nodes are adjusted so as to
    reduce error
  • Details of the training process for neural
    networks are not important for the time being

66
Numerical-Prediction
  • Similar to classification
  • Output is a continuous variable
  • Estimation current value
  • Prediction future outcome rather then current
    behavior

67
Typical Numerical Prediction Problems
  • Estimate the salary of an individual who owns a
    sports car
  • Predict next weeks closing price for the IMKB100
    index
  • Forecast next days temperature

68
Numerical Prediction methods
  • Artificial Neural networks
  • linear regression
  • Yi a0a1X1,ia2X2,i...akXk,iui
  • non-linear regression
  • Yi f(X1,i, X2,i,.., Xk,ia1,a2,..,ak,ui)
  • generalized linear regression
  • logistic
  • logit,probit
  • poisson regression
  • for count variables
  • Regression Trees

69
ExamplePrediction and Classification
  • Classification is used to classify customers
    applying for credit cards
  • known class labels risky,reliable
  • when a new customer applies looking at her
    charecteristics
  • income age education wealth region ...
  • Customer class is predicted
  • Prediction The monthly expense of a new customer
    ( a real continuous variable ) is predicted based
    on personal information
  • independent variables
  • income education wealth profession ...
  • Some are numeric some categorical

70
Cluster Analysis
  • Class label is unknown Group data to form new
    classes,
  • assign class labels to each data object
  • Unknown generated by the clustering model
  • e.g., cluster customers to find customer
    segments
  • Clustering based on the principle maximizing the
    intra-class similarity and minimizing the
    interclass similarity
  • Objects within a cluster have high similarity in
    comparison to one another
  • but are very dissimilar to objects in other
    clusters
  • there may be hierarchy of classes

71
Example Clustering
  • Can be performed on AE customer data
  • to identify homogenous subpopulations of
    customers
  • represent individual target groups for marketing

72
Before clustering
After clustering
73

distance
Type1
Type 2
type 3
income
Clustering according to income and distance to
store three cluster of data points are evident
74
Outlier Analysis
  • Outlier a data object that does not comply with
    the general behavior of the data
  • It can be considered as noise or exception but is
    quite useful in fraud detection, rare events
    analysis
  • DECTECED using
  • statistical tests
  • distance measures
  • visually inspecting the data
  • Examples

75
Reasons for outliers
  • Measurement errors
  • coding errors
  • age is entered as 999
  • nature of data
  • salary of the general manager is much more higher
    than the other employees
  • in crisis the interest rate was in the order of
    1000s

76
Evolution Analysis
  • Describes and models regularities or trends for
    objects whose behavior changes over time
  • Distinct features include
  • Trend and deviation time-series data analysis
  • Sequential pattern mining, periodicity analysis
  • Similarity-based analysis
  • Example
  • Stock market predictions future stock prices
  • for overall stocks indexes or individual company
    stocks

77
Sequential Pattern Analysis
  • Determine sequential patterns in data
  • Based on time sequence of actions
  • Similar to associations
  • Relationship is based on time
  • Example 1 buy CD player today buy CD within one
    week
  • Example 2 In what sequence web pages of an
    e-business company are accessed
  • 70 percents of visitors follows
  • A B C or A D B C or A E B C
  • He then determines to add a link directly from
    page A to page C

78
Chapter 1. Introduction
  • Motivation Why data mining?
  • Methodology of Knowledge Discovery in Databases
  • Data mining functionalities
  • Are all the patterns interesting?
  • Business applications of data mining

79
Are All the Discovered Patterns Interesting?
  • A data mining system/query may generate thousands
    of patterns, not all of them are interesting.
  • Are all patterns interesting?
  • Typically not -only a small fraction of patterns
    are interesting to any given user
  • Interestingness measures A pattern is
    interesting if
  • it is easily understood by humans,
  • valid on new or test data with some degree of
    certainty,
  • potentially useful,
  • novel, or
  • validates some hypothesis that a user seeks to
    confirm

80
Objective vs. subjective interestingness measures
  • Objective
  • Objective based on statistics and structures of
    patterns, e.g.,
  • support,
  • X ?Y P(X ? Y)probability of a transaction
    contains both X and Y
  • confidence, degree of certainty of the detected
    association
  • P(Y I X) the conditional probability the
    probability that a transaction containing X also
    contains Y
  • thresholds - controlled by the user
  • ex rules that do not satisfy a confidence
    threshold of 50 are uninteresting
  • Subjective based on users belief in the data,
    e.g., unexpectedness, novelty, actionability,
    etc.

81
Chapter 1. Introduction
  • Motivation Why data mining?
  • Methodology of Knowledge Discovery in Databases
  • Data mining functionalities
  • Are all the patterns interesting?
  • Business Applications of data mining

82
Potential Business Applications
  • Market analysis and management
  • target marketing, customer relation management,
    market basket analysis, cross selling, market
    segmentation
  • Risk analysis and management
  • Banks assume a financial risk when they grant
    loans
  • risk models attempt to predict the probability of
    default or fail to pay back the borrowed amount
  • Credit cards
  • Insurance companies
  • Fraud detection and management
  • Other Applications
  • Text mining (news group, email, documents) and
    Web analysis.
  • Intelligent query answering

83
Market Analysis and Management (1)
  • Where are the data sources for analysis?
  • Credit card transactions, loyalty cards, discount
    coupons, customer complaint calls, plus (public)
    lifestyle studies,clickstreams
  • Customer profiling-segmentation
  • data mining can tell you what types of customers
    buy what products (clustering or classification)
  • Target marketing
  • Find clusters of model customers who share the
    same characteristics interest, income level,
    spending habits, etc.

84
Market Analysis and Management (2)
  • Effectiveness of sales campaigns
  • Advertisements, coupons, discounts, bonuses
  • promote products and attract customers
  • can help improve profits
  • Compare amount of sales and number of
    transactions
  • during the sales period versus before or after
    the sales campaign
  • Association analysis
  • which items are likely to be purchased together
    with the items on sale

85
Market Analysis and Management (3)
  • Customer retention Analysis of Customer loyalty
  • sequences of purchases of particular customers
  • goods purchased at different periods by the same
    customers can be grouped into sequences
  • changes in customer consumption or loyalty
  • suggests adjustments on the pricing and variety
    of goods
  • to retain old customers and attract new customers
  • Cross-selling and up-selling
  • associations from sales records
  • a customer who buy a PC is likely to buy a
    printer
  • purchase recommendations

86
Fraud Detection and Management
  • Applications
  • widely used in health care, retail, credit card
    services, telecommunications (phone card fraud),
    etc.
  • Approach
  • use historical data to build models of fraudulent
    behavior and use data mining to help identify
    similar instances
  • Examples
  • Credit card transactions The FALCON fraud
    assessment system by HNC Inc. to signal possibly
    fraudulent credit card transactions
  • money laundering detect suspicious money
    transactions (US Treasury's Financial Crimes
    Enforcement Network)
  • Detecting telephone fraudASPECT European
    Research Gr.
  • Unsupervised clustering to detect fraud in mobile
    phone networks
  • Telephone call model destination of the call,
    duration, time of day or week. Analyze patterns
    that deviate from an expected norm.

87
Health Care
  • Storing patients records in electronic format,
    developments in medical information systems
  • Large amount of clinical data
  • Regularities, trends and surprising events
    extracted by data mining methods
  • ANN, temporal reasoning
  • assist clinicians to make informed decisions and
    improving health sevices
  • MERCK-MEDCO Managed Care, Pharmaceutical
    Insurance company
  • Uncover less expensive but equally effective drug
    treatments

88
Financial Data Analysis
  • Financial data
  • complete, reliable, high quality
  • Loan payment prediction and customer credit
    policy analysis

89
Loan payment prediction and customer credit
policy analysis
  • Factors influencing loan payment performance
  • loan-to-value ratio
  • term of the loan
  • dept ratio (total monthly debt/total monthly
    income)
  • payment-to-income ratio
  • income level
  • education level
  • residence region
  • credit history
  • analysis may find that
  • payment-income ratio is a dominant factor while
  • education level and debt ratio are not

90
Risk Management and Insurance
  • determine insurance rates
  • manage investment portfolios
  • differentiate between companies and/or
    individuals who are good and poor credit risks
  • Farmers Group discover a scenario
  • Someone who owns a sports car is not a higher
    accident risk
  • Conditions the sport car to be a second car and
    the family car to be a station wagon or a sedan

91
Data Mining for the Telecommunication Industry
  • Telecommunication data are multidimensional
  • calling-time duration
  • location of caller location of callee
  • type of call
  • used to identify and compare
  • data traffic system workload
  • resource usage user group behavior
  • profit
  • fraudulent pattern analysis and identification of
    unusual patterns
  • to achieve customer loyalty
  • characteristics of customers affecting line usage

92
Other Applications
  • Sports and Gaming
  • Predicting outcome of football games
  • Text Mining
  • Spam detection
  • Internet Web Mining
  • Web usage mining
  • Improve link structure
  • Recommander Systmes
  • Web structure mining mining link structure of
    Web

93
Other Applications
  • Educational Data Mining
  • Clustering students
  • Design enterece exams, selection policies
  • Human Resources
  • How to select applicants
  • Online Dating
  • Recommandataions to visitors

94
Summary
  • Data mining discovering interesting patterns
    from large amounts of data
  • A natural evolution of database technology, in
    great demand, with wide applications
  • A KDD process includes data cleaning, data
    integration, data selection, transformation, data
    mining, pattern evaluation, and knowledge
    presentation
  • Mining can be performed in a variety of
    information repositories
  • Data mining functionalities characterization,
    discrimination, association, classification,
    clustering, outlier and trend analysis, etc.
  • Classification of data mining systems
  • Major issues in data mining
About PowerShow.com