Title: Last Lecture
1Last Lecture
- Data warehousing
- Concept
- Architectures
- Design
2Today
- Business Analytics
- Data Mining
- Discussion about your miniprojects
- Note
- Next week bring you laptop well have a small
exercise on Latex - Install, before attending next class, Texnic and
miktex as indicated in our courses web page
3The Business Analytics (BA) Field
- Business intelligence (BI) The use of analytical
methods, either manually or automatically, to
derive relationships from data - Business analytics (BA) Provide models and
analysis procedures to BI. BA involves using
tools and models, in assisting decision makers
4The Business Analytics (BA) Field
Today and following 2 lectures
5Online Analytical Processing (OLAP)
- Online analytical processing (OLAP)
- An information system that enables the user to
query the system and conduct an analysis.
Modeling, analysis and visualization capabilities
of large data sets. - Online Transaction Processing (OLTP)
- OLTP concentrates on processing repetitive
transactions in large quantities and conducting
simple manipulations - OLAP versus OLTP
- OLAP involves examining many data items complex
relationships - OLAP may analyze relationships and look for
patterns, trends, and exceptions - OLAP is a direct decision support method
6Multidimensionality
- Multidimensional OLAP (MOLAP) OLAP implemented
via a specialized multidimensional database - Relational OLAP (ROLAP) The implementation of an
OLAP database on top of an existing relational
database - Database OLAP RDBMS designed to host OLAP
structures and calculations - Web OLAP OLAP accessible from a web browser
- Desktop OLAP Low price, simple OLAP tools
Visualization is discussed in the text book
7Advanced BA
- Users are provided with sophisticated statistical
analysis such as hypothesis testing, multiple
regression, churn prediction, customer scoring
models. - These features can be provided by
- Data mining and
- Predictive analysis
- The goal is to analyze data for better decision
making
8Data Mining Concepts and Applications
- Data mining (DM)
- Knowledge discovery in databases (KDD)
- A process that uses statistical, mathematical,
artificial intelligence and machine-learning
techniques to extract and identify useful
information (hidden patterns) and subsequent
knowledge from large databases
Data preprocessing has large effects on the
performance of the data mining techniques
9Origins of Data Mining
- Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems - Motivating challenges
- Scalability (1015 of data)
- High dimensionality data
- Complex, heterogeneous data
- Data distribution
- Non traditional analysis
Statistics/AI
Machine Learning/ Pattern Recognition
Data Mining
Database systems
Parallel and distributed processing
10Data Mining Tasks
- Prediction Methods Use some attributes to
predict unknown or future values of other
attributes. - Classification
- Regression
- Deviation Detection
- Description Methods Derive patterns that
describe the data. - Clustering
- Association Rule Discovery
- Sequential Pattern Discovery
11Classification Definition
- Given a collection of records (training set )
- Each record contains a set of attributes, one of
the attributes is the class. - Find a model for class attribute as a function
of the values of other attributes. - Goal previously unseen records should be
assigned a class as accurately as possible. - A test set is used to determine the accuracy of
the model. Usually, the given data set is divided
into training and test sets, with training set
used to build the model and test set used to
validate it. - Supervised learning requires training
12Illustrating Classification Task
13Classification Application
- Customer Attrition/Churn
- Goal To predict whether a customer is likely to
be lost to a competitor. - Approach
- Use detailed record of transactions with each of
the past and present customers, to find
attributes. - How often the customer calls, where he calls,
what time-of-the day he calls most, his financial
status, marital status, etc. - Label the customers as loyal or disloyal.
- Find a model for loyalty.
- Many other applications in marketing, fraud
detection, sales forecasting etc.
From Berry Linoff Data Mining Techniques, 1997
14Clustering Definition
- Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that - Data points in one cluster are more similar to
one another. - Data points in separate clusters are less similar
to one another. - Similarity Measures
- Euclidean Distance if attributes are continuous.
- Other Problem-specific Measures.
- Unsupervised learning (no training)
15Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
Intracluster distances are minimized
Intercluster distances are maximized
16Clustering Application
- Market Segmentation
- Goal subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix. - Approach
- Collect different attributes of customers based
on their geographical and lifestyle related
information. - Find clusters of similar customers.
- Measure the clustering quality by observing
buying patterns of customers in same cluster vs.
those from different clusters.
17Association Rule Discovery Definition
- Given a set of records each of which contain some
number of items from a given collection - Produce dependency rules which will predict
occurrence of an item based on occurrences of
other items.
Rules Discovered Milk --gt Coke
Diaper, Milk --gt Beer
Applications in Marketing and Sales
Promotion Several approaches. You saw the
application of rough sets to association rule
discovery in the seminar on Rough Sets
18Regression
- Predict a value of a given continuous valued
variable based on the values of other variables,
assuming a linear or nonlinear model of
dependency. - Greatly studied in statistics, neural network
fields. - Examples
- Predicting sales amounts of new product based on
advertising expenditure. - Predicting wind velocities as a function of
temperature, humidity, air pressure, etc. - Time series prediction of stock market indices.
Example of linear regresion using excels add
trend line feature
19Deviation/Anomaly Detection
- Detect significant deviations from normal
behavior - Outliers
- Applications
- Credit Card Fraud Detection
- Network Intrusion Detection
- Some Methods
- Statistical analysis
- Fuzzy logic
- Neural networks
20Classification Techniques
- Decision Tree based Methods
- Rule-based Methods
- Memory based reasoning
- Artificial Neural Networks
- Naïve Bayes and
- Bayesian Belief Networks
- Support Vector Machines
- Others
Today
Next class
21Example of a Decision Tree
Attributes
Splitting Attributes
Root node
Internal node
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
Leaf nodes
YES
NO
Training Data
Model Decision Tree
22Decision Tree Classification Task
Decision Tree
Once we have a model we can use it to classify
unknown data
23Apply Model to Test Data
Test Data
Start from the root of tree.
Assign Cheat to No
24Break
25Decision Tree Classification Task
To build a model
Decision Tree model
26Decision Tree Induction
- Many Algorithms
- Hunts Algorithm (one of the earliest) basis for
- CART
- ID3, and its derivation C4.5
- J48 a version of C4.5
- SLIQ,SPRINT
- Algorithms have been proposed to induce fuzzy
decision trees
27General Structure of Hunts Algorithm
- Let Dt be the set of training records that reach
a node t - General Procedure
- If Dt contains records that belong to the same
class yt, then t is a leaf node labeled as yt - If Dt is an empty set, then t is a leaf node
labeled by the default class, yd - If Dt contains records that belong to more than
one class, use an attribute test to split the
data into smaller subsets. Recursively apply the
procedure to each subset.
Dt
?
28Hunts Algorithm
Dont Cheat
29Tree Induction
- Greedy strategy.
- Split the records based on an attribute test that
optimizes certain criterion. - Greedy means that it does not go back to evaluate
if a selected attribute was the optimum election.
- Issues
- Determine how to split the records
- How to specify the attribute test condition?
- How to determine the best split?
- Determine when to stop splitting
30How to Specify Test Condition?
- Depends on attribute types
- Nominal
- Ordinal
- Continuous
- Depends on number of ways to split
- 2-way split
- Multi-way split
31Splitting Based on Nominal Attributes
- Multi-way split Use as many partitions as
distinct values. - Binary split Divides values into two subsets.
Need to find optimal partitioning.
OR
32Splitting Based on Continuous Attributes
- Different ways of handling continuous attributes
- Discretization to form an ordinal categorical
attribute - Static discretize once at the beginning
- Dynamic ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering. - Binary Decision (A lt v) or (A ? v)
- consider all possible splits and finds the best
cut - can be more compute intensive
33How to Determine the Best Split
Before Splitting 10 records of class 0, 10
records of class 1
10 10
Which test condition is the best?
34How to determine the Best Split
- Greedy approach Split based on the number of
elements we have at each node - We prefer having a single one
- Need a measure of node impurity
High degree of impurity
Low degree of impurity
Impurity intuitively measures how well the two
classes are separated
35Measures of Node Impurity
Today
- Gini Index
- Employed by CART algorithm
- Entropy
- Employed by ID3, C4.5 algorithm
- Misclassification error
- Gain Ratio
Today
Described in text book and references
36Measure of Impurity GINI
- Gini Index for a given node t
- (NOTE p( j t) is the relative frequency of
class j at node t). - Maximum (1 - 1/nc) when records are equally
distributed among all classes, implying least
interesting information - Minimum (0.0) when all records belong to one
class, implying most interesting information
37Examples for computing GINI
P(C1) 0/6 0 P(C2) 6/6 1 Gini 1
P(C1)2 P(C2)2 1 0 1 0
P(C1) 1/6 P(C2) 5/6 Gini 1
(1/6)2 (5/6)2 0.278
P(C1) 2/6 P(C2) 4/6 Gini 1
(2/6)2 (4/6)2 0.444
38Alternative Splitting Criteria based on INFO
- Entropy at a given node t
- (NOTE p( j t) is the relative frequency of
class j at node t). - Measures homogeneity of a node.
- Maximum (log nc) when records are equally
distributed among all classes implying least
information - Minimum (0.0) when all records belong to one
class, implying most information - Entropy based computations are similar to the
GINI index computations
39Examples for computing Entropy
P(C1) 0/6 0 P(C2) 6/6 1 Entropy 0
log 0 1 log 1 0 0 0
P(C1) 1/6 P(C2) 5/6 Entropy
(1/6) log2 (1/6) (5/6) log2 (5/6) 0.65
P(C1) 2/6 P(C2) 4/6 Entropy
(2/6) log2 (2/6) (4/6) log2 (4/6) 0.92
40Model Overfitting
- Classification model errors
- Training
- Generalization
- A good model must have low training error and
generalization error - When error rate begins to increase even though
its training error rate continues decreasing is
called overfitting (learning too well the
training set). Causes of this effect are - Presence of noise
- Lack of representative examples
-
41Text Mining
- Text mining Application of data mining to
unstructured or semi-structured text files. It
entails the generation of meaningful numerical
indices from the unstructured text and then
processing these indices using various data
mining algorithms - Text mining helps organizations
- Find the hidden content of documents, including
additional useful relationships - Relate documents across previous unnoticed
divisions - Group documents by common themes
42Text Mining
- Applications of text mining
- Automatic detection of e-mail spam or phishing
through analysis of the document content - Automatic processing of messages or e-mails to
route a message to the most appropriate party to
process that message - Analysis of warranty claims, help desk
calls/reports, and so on to identify the most
common problems and relevant responses - Analysis of related scientific publications in
journals to create an automated summary view of a
particular discipline - Creation of a relationship view of a document
collection - Qualitative analysis of documents to detect
deception
43Text Mining
- How to mine text
- Eliminate commonly used words (stop-words)
- Replace words with their stems or roots (stemming
algorithms) - Consider synonyms and phrases
- Calculate the weights of the remaining terms
- Youll learn the details of these methods in your
information retrieval course next semester
44Web Mining
- Web mining The discovery and analysis of
interesting and useful information from the Web,
about the Web, and usually through Web-based
tools
45Web Mining
- Uses for Web mining
- Determine the lifetime value of clients
- Design cross-marketing strategies across products
- Evaluate promotional campaigns
- Target electronic ads and coupons at user groups
- Predict user behavior
- Present dynamic information to users
46Data Mining Software
- Open Source
- WEKA (weka.org) widely used framework written in
Java that contains many data mining processing
algorithms such as - data pre-processing,
- classification,
- regression,
- clustering,
- association rules,
- visualization
- machine learning algorithms
- Rough sets based
- RSES (Java, C no source code available)
- Rosseta (employs RSES but some C source code is
available)
47Challenges of Data Mining
- Scalability
- Dimensionality
- Complex and Heterogeneous Data
- Data Quality
- Data Ownership and Distribution
- Privacy Preservation
- Streaming Data
48Summary
- What were the main points of the lecture?
49Miniprojects
- Individual. Two kinds
- Argue for an IT solution to a problem that a
known or invented (but realistic) company faces. - Survey (of at least two papers) on one of these
subjects - Decision support systems methods or techniques
- Data mining in decision support systems
- Data warehousing in decision support systems
- If you choose writing a small survey, consult
papers from journals or conferences published by - Springer, Elsevier, ACM, or IEEE
- Use the auboline system to look and get these
papers - http//www.aub.aau.dk/portal/js_pane/forside/artic
le/226
50Today