Last Lecture presentation

About This Presentation

Transcript and Presenter's Notes

Title: Last Lecture

1
Last Lecture

Data warehousing
Concept
Architectures
Design

2
Today

Business Analytics
Data Mining
Discussion about your miniprojects
Note
Next week bring you laptop well have a small
exercise on Latex
Install, before attending next class, Texnic and
miktex as indicated in our courses web page

3
The Business Analytics (BA) Field

Business intelligence (BI) The use of analytical
methods, either manually or automatically, to
derive relationships from data
Business analytics (BA) Provide models and
analysis procedures to BI. BA involves using
tools and models, in assisting decision makers

4
The Business Analytics (BA) Field
Today and following 2 lectures
5
Online Analytical Processing (OLAP)

Online analytical processing (OLAP)
An information system that enables the user to
query the system and conduct an analysis.
Modeling, analysis and visualization capabilities
of large data sets.
Online Transaction Processing (OLTP)
OLTP concentrates on processing repetitive
transactions in large quantities and conducting
simple manipulations
OLAP versus OLTP
OLAP involves examining many data items complex
relationships
OLAP may analyze relationships and look for
patterns, trends, and exceptions
OLAP is a direct decision support method

6
Multidimensionality

Multidimensional OLAP (MOLAP) OLAP implemented
via a specialized multidimensional database
Relational OLAP (ROLAP) The implementation of an
OLAP database on top of an existing relational
database
Database OLAP RDBMS designed to host OLAP
structures and calculations
Web OLAP OLAP accessible from a web browser
Desktop OLAP Low price, simple OLAP tools

Visualization is discussed in the text book
7
Advanced BA

Users are provided with sophisticated statistical
analysis such as hypothesis testing, multiple
regression, churn prediction, customer scoring
models.
These features can be provided by
Data mining and
Predictive analysis
The goal is to analyze data for better decision
making

8
Data Mining Concepts and Applications

Data mining (DM)
Knowledge discovery in databases (KDD)
A process that uses statistical, mathematical,
artificial intelligence and machine-learning
techniques to extract and identify useful
information (hidden patterns) and subsequent
knowledge from large databases

Data preprocessing has large effects on the
performance of the data mining techniques
9
Origins of Data Mining

Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
Motivating challenges
Scalability (1015 of data)
High dimensionality data
Complex, heterogeneous data
Data distribution
Non traditional analysis

Statistics/AI
Machine Learning/ Pattern Recognition
Data Mining
Database systems
Parallel and distributed processing
10
Data Mining Tasks

Prediction Methods Use some attributes to
predict unknown or future values of other
attributes.
Classification
Regression
Deviation Detection
Description Methods Derive patterns that
describe the data.
Clustering
Association Rule Discovery
Sequential Pattern Discovery

11
Classification Definition

Given a collection of records (training set )
Each record contains a set of attributes, one of
the attributes is the class.
Find a model for class attribute as a function
of the values of other attributes.
Goal previously unseen records should be
assigned a class as accurately as possible.
A test set is used to determine the accuracy of
the model. Usually, the given data set is divided
into training and test sets, with training set
used to build the model and test set used to
validate it.
Supervised learning requires training

12
Illustrating Classification Task
13
Classification Application

Customer Attrition/Churn
Goal To predict whether a customer is likely to
be lost to a competitor.
Approach
Use detailed record of transactions with each of
the past and present customers, to find
attributes.
How often the customer calls, where he calls,
what time-of-the day he calls most, his financial
status, marital status, etc.
Label the customers as loyal or disloyal.
Find a model for loyalty.
Many other applications in marketing, fraud
detection, sales forecasting etc.

From Berry Linoff Data Mining Techniques, 1997
14
Clustering Definition

Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that
Data points in one cluster are more similar to
one another.
Data points in separate clusters are less similar
to one another.
Similarity Measures
Euclidean Distance if attributes are continuous.
Other Problem-specific Measures.
Unsupervised learning (no training)

15
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
Intracluster distances are minimized
Intercluster distances are maximized
16
Clustering Application

Market Segmentation
Goal subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
Approach
Collect different attributes of customers based
on their geographical and lifestyle related
information.
Find clusters of similar customers.
Measure the clustering quality by observing
buying patterns of customers in same cluster vs.
those from different clusters.

17
Association Rule Discovery Definition

Given a set of records each of which contain some
number of items from a given collection
Produce dependency rules which will predict
occurrence of an item based on occurrences of
other items.

Rules Discovered Milk --gt Coke
Diaper, Milk --gt Beer
Applications in Marketing and Sales
Promotion Several approaches. You saw the
application of rough sets to association rule
discovery in the seminar on Rough Sets
18
Regression

Predict a value of a given continuous valued
variable based on the values of other variables,
assuming a linear or nonlinear model of
dependency.
Greatly studied in statistics, neural network
fields.
Examples
Predicting sales amounts of new product based on
advertising expenditure.
Predicting wind velocities as a function of
temperature, humidity, air pressure, etc.
Time series prediction of stock market indices.

Example of linear regresion using excels add
trend line feature
19
Deviation/Anomaly Detection

Detect significant deviations from normal
behavior
Outliers
Applications
Credit Card Fraud Detection
Network Intrusion Detection
Some Methods
Statistical analysis
Fuzzy logic
Neural networks

20
Classification Techniques

Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Artificial Neural Networks
Naïve Bayes and
Bayesian Belief Networks
Support Vector Machines
Others

Today
Next class
21
Example of a Decision Tree
Attributes
Splitting Attributes
Root node
Internal node
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
Leaf nodes
YES
NO
Training Data
Model Decision Tree
22
Decision Tree Classification Task
Decision Tree
Once we have a model we can use it to classify
unknown data
23
Apply Model to Test Data
Test Data
Start from the root of tree.
Assign Cheat to No
24
Break
25
Decision Tree Classification Task
To build a model
Decision Tree model
26
Decision Tree Induction

Many Algorithms
Hunts Algorithm (one of the earliest) basis for
CART
ID3, and its derivation C4.5
J48 a version of C4.5
SLIQ,SPRINT
Algorithms have been proposed to induce fuzzy
decision trees

27
General Structure of Hunts Algorithm

Let Dt be the set of training records that reach
a node t
General Procedure
If Dt contains records that belong to the same
class yt, then t is a leaf node labeled as yt
If Dt is an empty set, then t is a leaf node
labeled by the default class, yd
If Dt contains records that belong to more than
one class, use an attribute test to split the
data into smaller subsets. Recursively apply the
procedure to each subset.

Dt
?
28
Hunts Algorithm
Dont Cheat
29
Tree Induction

Greedy strategy.
Split the records based on an attribute test that
optimizes certain criterion.
Greedy means that it does not go back to evaluate
if a selected attribute was the optimum election.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting

30
How to Specify Test Condition?

Depends on attribute types
Nominal
Ordinal
Continuous
Depends on number of ways to split
2-way split
Multi-way split

31
Splitting Based on Nominal Attributes

Multi-way split Use as many partitions as
distinct values.
Binary split Divides values into two subsets.
Need to find optimal partitioning.

OR
32
Splitting Based on Continuous Attributes

Different ways of handling continuous attributes
Discretization to form an ordinal categorical
attribute
Static discretize once at the beginning
Dynamic ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.
Binary Decision (A lt v) or (A ? v)
consider all possible splits and finds the best
cut
can be more compute intensive

33
How to Determine the Best Split
Before Splitting 10 records of class 0, 10
records of class 1
10 10
Which test condition is the best?
34
How to determine the Best Split

Greedy approach Split based on the number of
elements we have at each node
We prefer having a single one
Need a measure of node impurity

High degree of impurity
Low degree of impurity
Impurity intuitively measures how well the two
classes are separated
35
Measures of Node Impurity
Today

Gini Index
Employed by CART algorithm
Entropy
Employed by ID3, C4.5 algorithm
Misclassification error
Gain Ratio

Today
Described in text book and references
36
Measure of Impurity GINI

Gini Index for a given node t
(NOTE p( j t) is the relative frequency of
class j at node t).
Maximum (1 - 1/nc) when records are equally
distributed among all classes, implying least
interesting information
Minimum (0.0) when all records belong to one
class, implying most interesting information

37
Examples for computing GINI
P(C1) 0/6 0 P(C2) 6/6 1 Gini 1
P(C1)2 P(C2)2 1 0 1 0
P(C1) 1/6 P(C2) 5/6 Gini 1
(1/6)2 (5/6)2 0.278
P(C1) 2/6 P(C2) 4/6 Gini 1
(2/6)2 (4/6)2 0.444
38
Alternative Splitting Criteria based on INFO

Entropy at a given node t
(NOTE p( j t) is the relative frequency of
class j at node t).
Measures homogeneity of a node.
Maximum (log nc) when records are equally
distributed among all classes implying least
information
Minimum (0.0) when all records belong to one
class, implying most information
Entropy based computations are similar to the
GINI index computations

39
Examples for computing Entropy
P(C1) 0/6 0 P(C2) 6/6 1 Entropy 0
log 0 1 log 1 0 0 0
P(C1) 1/6 P(C2) 5/6 Entropy
(1/6) log2 (1/6) (5/6) log2 (5/6) 0.65
P(C1) 2/6 P(C2) 4/6 Entropy
(2/6) log2 (2/6) (4/6) log2 (4/6) 0.92
40
Model Overfitting

Classification model errors
Training
Generalization
A good model must have low training error and
generalization error
When error rate begins to increase even though
its training error rate continues decreasing is
called overfitting (learning too well the
training set). Causes of this effect are
Presence of noise
Lack of representative examples

41
Text Mining

Text mining Application of data mining to
unstructured or semi-structured text files. It
entails the generation of meaningful numerical
indices from the unstructured text and then
processing these indices using various data
mining algorithms
Text mining helps organizations
Find the hidden content of documents, including
additional useful relationships
Relate documents across previous unnoticed
divisions
Group documents by common themes

42
Text Mining

Applications of text mining
Automatic detection of e-mail spam or phishing
through analysis of the document content
Automatic processing of messages or e-mails to
route a message to the most appropriate party to
process that message
Analysis of warranty claims, help desk
calls/reports, and so on to identify the most
common problems and relevant responses
Analysis of related scientific publications in
journals to create an automated summary view of a
particular discipline
Creation of a relationship view of a document
collection
Qualitative analysis of documents to detect
deception

43
Text Mining

How to mine text
Eliminate commonly used words (stop-words)
Replace words with their stems or roots (stemming
algorithms)
Consider synonyms and phrases
Calculate the weights of the remaining terms
Youll learn the details of these methods in your
information retrieval course next semester

44
Web Mining

Web mining The discovery and analysis of
interesting and useful information from the Web,
about the Web, and usually through Web-based
tools

45
Web Mining

Uses for Web mining
Determine the lifetime value of clients
Design cross-marketing strategies across products
Evaluate promotional campaigns
Target electronic ads and coupons at user groups
Predict user behavior
Present dynamic information to users

46
Data Mining Software

Open Source
WEKA (weka.org) widely used framework written in
Java that contains many data mining processing
algorithms such as
data pre-processing,
classification,
regression,
clustering,
association rules,
visualization
machine learning algorithms
Rough sets based
RSES (Java, C no source code available)
Rosseta (employs RSES but some C source code is
available)

47
Challenges of Data Mining

Scalability
Dimensionality
Complex and Heterogeneous Data
Data Quality
Data Ownership and Distribution
Privacy Preservation
Streaming Data

48
Summary

What were the main points of the lecture?

49
Miniprojects

Individual. Two kinds
Argue for an IT solution to a problem that a
known or invented (but realistic) company faces.
Survey (of at least two papers) on one of these
subjects
Decision support systems methods or techniques
Data mining in decision support systems
Data warehousing in decision support systems
If you choose writing a small survey, consult
papers from journals or conferences published by
Springer, Elsevier, ACM, or IEEE
Use the auboline system to look and get these
papers
http//www.aub.aau.dk/portal/js_pane/forside/artic
le/226

50
Today

Write a Comment

User Comments (0)

About PowerShow.com

Last Lecture PowerPoint PPT Presentation