Last Lecture - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Last Lecture

Description:

Discussion about your miniprojects. Note. Next week bring you laptop we'll have a small exercise on Latex ... Miniprojects. Individual. Two kinds: ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 51
Provided by: danielort
Category:

less

Transcript and Presenter's Notes

Title: Last Lecture


1
Last Lecture
  • Data warehousing
  • Concept
  • Architectures
  • Design

2
Today
  • Business Analytics
  • Data Mining
  • Discussion about your miniprojects
  • Note
  • Next week bring you laptop well have a small
    exercise on Latex
  • Install, before attending next class, Texnic and
    miktex as indicated in our courses web page

3
The Business Analytics (BA) Field
  • Business intelligence (BI) The use of analytical
    methods, either manually or automatically, to
    derive relationships from data
  • Business analytics (BA) Provide models and
    analysis procedures to BI. BA involves using
    tools and models, in assisting decision makers

4
The Business Analytics (BA) Field
Today and following 2 lectures
5
Online Analytical Processing (OLAP)
  • Online analytical processing (OLAP)
  • An information system that enables the user to
    query the system and conduct an analysis.
    Modeling, analysis and visualization capabilities
    of large data sets.
  • Online Transaction Processing (OLTP)
  • OLTP concentrates on processing repetitive
    transactions in large quantities and conducting
    simple manipulations
  • OLAP versus OLTP
  • OLAP involves examining many data items complex
    relationships
  • OLAP may analyze relationships and look for
    patterns, trends, and exceptions
  • OLAP is a direct decision support method

6
Multidimensionality
  • Multidimensional OLAP (MOLAP) OLAP implemented
    via a specialized multidimensional database
  • Relational OLAP (ROLAP) The implementation of an
    OLAP database on top of an existing relational
    database
  • Database OLAP RDBMS designed to host OLAP
    structures and calculations
  • Web OLAP OLAP accessible from a web browser
  • Desktop OLAP Low price, simple OLAP tools

Visualization is discussed in the text book
7
Advanced BA
  • Users are provided with sophisticated statistical
    analysis such as hypothesis testing, multiple
    regression, churn prediction, customer scoring
    models.
  • These features can be provided by
  • Data mining and
  • Predictive analysis
  • The goal is to analyze data for better decision
    making

8
Data Mining Concepts and Applications
  • Data mining (DM)
  • Knowledge discovery in databases (KDD)
  • A process that uses statistical, mathematical,
    artificial intelligence and machine-learning
    techniques to extract and identify useful
    information (hidden patterns) and subsequent
    knowledge from large databases

Data preprocessing has large effects on the
performance of the data mining techniques
9
Origins of Data Mining
  • Draws ideas from machine learning/AI, pattern
    recognition, statistics, and database systems
  • Motivating challenges
  • Scalability (1015 of data)
  • High dimensionality data
  • Complex, heterogeneous data
  • Data distribution
  • Non traditional analysis

Statistics/AI
Machine Learning/ Pattern Recognition
Data Mining
Database systems
Parallel and distributed processing
10
Data Mining Tasks
  • Prediction Methods Use some attributes to
    predict unknown or future values of other
    attributes.
  • Classification
  • Regression
  • Deviation Detection
  • Description Methods Derive patterns that
    describe the data.
  • Clustering
  • Association Rule Discovery
  • Sequential Pattern Discovery

11
Classification Definition
  • Given a collection of records (training set )
  • Each record contains a set of attributes, one of
    the attributes is the class.
  • Find a model for class attribute as a function
    of the values of other attributes.
  • Goal previously unseen records should be
    assigned a class as accurately as possible.
  • A test set is used to determine the accuracy of
    the model. Usually, the given data set is divided
    into training and test sets, with training set
    used to build the model and test set used to
    validate it.
  • Supervised learning requires training

12
Illustrating Classification Task
13
Classification Application
  • Customer Attrition/Churn
  • Goal To predict whether a customer is likely to
    be lost to a competitor.
  • Approach
  • Use detailed record of transactions with each of
    the past and present customers, to find
    attributes.
  • How often the customer calls, where he calls,
    what time-of-the day he calls most, his financial
    status, marital status, etc.
  • Label the customers as loyal or disloyal.
  • Find a model for loyalty.
  • Many other applications in marketing, fraud
    detection, sales forecasting etc.

From Berry Linoff Data Mining Techniques, 1997
14
Clustering Definition
  • Given a set of data points, each having a set of
    attributes, and a similarity measure among them,
    find clusters such that
  • Data points in one cluster are more similar to
    one another.
  • Data points in separate clusters are less similar
    to one another.
  • Similarity Measures
  • Euclidean Distance if attributes are continuous.
  • Other Problem-specific Measures.
  • Unsupervised learning (no training)

15
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
Intracluster distances are minimized
Intercluster distances are maximized
16
Clustering Application
  • Market Segmentation
  • Goal subdivide a market into distinct subsets of
    customers where any subset may conceivably be
    selected as a market target to be reached with a
    distinct marketing mix.
  • Approach
  • Collect different attributes of customers based
    on their geographical and lifestyle related
    information.
  • Find clusters of similar customers.
  • Measure the clustering quality by observing
    buying patterns of customers in same cluster vs.
    those from different clusters.

17
Association Rule Discovery Definition
  • Given a set of records each of which contain some
    number of items from a given collection
  • Produce dependency rules which will predict
    occurrence of an item based on occurrences of
    other items.

Rules Discovered Milk --gt Coke
Diaper, Milk --gt Beer
Applications in Marketing and Sales
Promotion Several approaches. You saw the
application of rough sets to association rule
discovery in the seminar on Rough Sets
18
Regression
  • Predict a value of a given continuous valued
    variable based on the values of other variables,
    assuming a linear or nonlinear model of
    dependency.
  • Greatly studied in statistics, neural network
    fields.
  • Examples
  • Predicting sales amounts of new product based on
    advertising expenditure.
  • Predicting wind velocities as a function of
    temperature, humidity, air pressure, etc.
  • Time series prediction of stock market indices.

Example of linear regresion using excels add
trend line feature
19
Deviation/Anomaly Detection
  • Detect significant deviations from normal
    behavior
  • Outliers
  • Applications
  • Credit Card Fraud Detection
  • Network Intrusion Detection
  • Some Methods
  • Statistical analysis
  • Fuzzy logic
  • Neural networks

20
Classification Techniques
  • Decision Tree based Methods
  • Rule-based Methods
  • Memory based reasoning
  • Artificial Neural Networks
  • Naïve Bayes and
  • Bayesian Belief Networks
  • Support Vector Machines
  • Others

Today
Next class
21
Example of a Decision Tree
Attributes
Splitting Attributes
Root node
Internal node
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
Leaf nodes
YES
NO
Training Data
Model Decision Tree
22
Decision Tree Classification Task
Decision Tree
Once we have a model we can use it to classify
unknown data
23
Apply Model to Test Data
Test Data
Start from the root of tree.
Assign Cheat to No
24
Break
25
Decision Tree Classification Task
To build a model
Decision Tree model
26
Decision Tree Induction
  • Many Algorithms
  • Hunts Algorithm (one of the earliest) basis for
  • CART
  • ID3, and its derivation C4.5
  • J48 a version of C4.5
  • SLIQ,SPRINT
  • Algorithms have been proposed to induce fuzzy
    decision trees

27
General Structure of Hunts Algorithm
  • Let Dt be the set of training records that reach
    a node t
  • General Procedure
  • If Dt contains records that belong to the same
    class yt, then t is a leaf node labeled as yt
  • If Dt is an empty set, then t is a leaf node
    labeled by the default class, yd
  • If Dt contains records that belong to more than
    one class, use an attribute test to split the
    data into smaller subsets. Recursively apply the
    procedure to each subset.

Dt
?
28
Hunts Algorithm
Dont Cheat
29
Tree Induction
  • Greedy strategy.
  • Split the records based on an attribute test that
    optimizes certain criterion.
  • Greedy means that it does not go back to evaluate
    if a selected attribute was the optimum election.
  • Issues
  • Determine how to split the records
  • How to specify the attribute test condition?
  • How to determine the best split?
  • Determine when to stop splitting

30
How to Specify Test Condition?
  • Depends on attribute types
  • Nominal
  • Ordinal
  • Continuous
  • Depends on number of ways to split
  • 2-way split
  • Multi-way split

31
Splitting Based on Nominal Attributes
  • Multi-way split Use as many partitions as
    distinct values.
  • Binary split Divides values into two subsets.
    Need to find optimal partitioning.

OR
32
Splitting Based on Continuous Attributes
  • Different ways of handling continuous attributes
  • Discretization to form an ordinal categorical
    attribute
  • Static discretize once at the beginning
  • Dynamic ranges can be found by equal interval
    bucketing, equal frequency bucketing
    (percentiles), or clustering.
  • Binary Decision (A lt v) or (A ? v)
  • consider all possible splits and finds the best
    cut
  • can be more compute intensive

33
How to Determine the Best Split
Before Splitting 10 records of class 0, 10
records of class 1
10 10
Which test condition is the best?
34
How to determine the Best Split
  • Greedy approach Split based on the number of
    elements we have at each node
  • We prefer having a single one
  • Need a measure of node impurity

High degree of impurity
Low degree of impurity
Impurity intuitively measures how well the two
classes are separated
35
Measures of Node Impurity
Today
  • Gini Index
  • Employed by CART algorithm
  • Entropy
  • Employed by ID3, C4.5 algorithm
  • Misclassification error
  • Gain Ratio

Today
Described in text book and references
36
Measure of Impurity GINI
  • Gini Index for a given node t
  • (NOTE p( j t) is the relative frequency of
    class j at node t).
  • Maximum (1 - 1/nc) when records are equally
    distributed among all classes, implying least
    interesting information
  • Minimum (0.0) when all records belong to one
    class, implying most interesting information

37
Examples for computing GINI
P(C1) 0/6 0 P(C2) 6/6 1 Gini 1
P(C1)2 P(C2)2 1 0 1 0
P(C1) 1/6 P(C2) 5/6 Gini 1
(1/6)2 (5/6)2 0.278
P(C1) 2/6 P(C2) 4/6 Gini 1
(2/6)2 (4/6)2 0.444
38
Alternative Splitting Criteria based on INFO
  • Entropy at a given node t
  • (NOTE p( j t) is the relative frequency of
    class j at node t).
  • Measures homogeneity of a node.
  • Maximum (log nc) when records are equally
    distributed among all classes implying least
    information
  • Minimum (0.0) when all records belong to one
    class, implying most information
  • Entropy based computations are similar to the
    GINI index computations

39
Examples for computing Entropy
P(C1) 0/6 0 P(C2) 6/6 1 Entropy 0
log 0 1 log 1 0 0 0
P(C1) 1/6 P(C2) 5/6 Entropy
(1/6) log2 (1/6) (5/6) log2 (5/6) 0.65
P(C1) 2/6 P(C2) 4/6 Entropy
(2/6) log2 (2/6) (4/6) log2 (4/6) 0.92
40
Model Overfitting
  • Classification model errors
  • Training
  • Generalization
  • A good model must have low training error and
    generalization error
  • When error rate begins to increase even though
    its training error rate continues decreasing is
    called overfitting (learning too well the
    training set). Causes of this effect are
  • Presence of noise
  • Lack of representative examples

41
Text Mining
  • Text mining Application of data mining to
    unstructured or semi-structured text files. It
    entails the generation of meaningful numerical
    indices from the unstructured text and then
    processing these indices using various data
    mining algorithms
  • Text mining helps organizations
  • Find the hidden content of documents, including
    additional useful relationships
  • Relate documents across previous unnoticed
    divisions
  • Group documents by common themes

42
Text Mining
  • Applications of text mining
  • Automatic detection of e-mail spam or phishing
    through analysis of the document content
  • Automatic processing of messages or e-mails to
    route a message to the most appropriate party to
    process that message
  • Analysis of warranty claims, help desk
    calls/reports, and so on to identify the most
    common problems and relevant responses
  • Analysis of related scientific publications in
    journals to create an automated summary view of a
    particular discipline
  • Creation of a relationship view of a document
    collection
  • Qualitative analysis of documents to detect
    deception

43
Text Mining
  • How to mine text
  • Eliminate commonly used words (stop-words)
  • Replace words with their stems or roots (stemming
    algorithms)
  • Consider synonyms and phrases
  • Calculate the weights of the remaining terms
  • Youll learn the details of these methods in your
    information retrieval course next semester

44
Web Mining
  • Web mining The discovery and analysis of
    interesting and useful information from the Web,
    about the Web, and usually through Web-based
    tools

45
Web Mining
  • Uses for Web mining
  • Determine the lifetime value of clients
  • Design cross-marketing strategies across products
  • Evaluate promotional campaigns
  • Target electronic ads and coupons at user groups
  • Predict user behavior
  • Present dynamic information to users

46
Data Mining Software
  • Open Source
  • WEKA (weka.org) widely used framework written in
    Java that contains many data mining processing
    algorithms such as
  • data pre-processing,
  • classification,
  • regression,
  • clustering,
  • association rules,
  • visualization
  • machine learning algorithms
  • Rough sets based
  • RSES (Java, C no source code available)
  • Rosseta (employs RSES but some C source code is
    available)

47
Challenges of Data Mining
  • Scalability
  • Dimensionality
  • Complex and Heterogeneous Data
  • Data Quality
  • Data Ownership and Distribution
  • Privacy Preservation
  • Streaming Data

48
Summary
  • What were the main points of the lecture?

49
Miniprojects
  • Individual. Two kinds
  • Argue for an IT solution to a problem that a
    known or invented (but realistic) company faces.
  • Survey (of at least two papers) on one of these
    subjects
  • Decision support systems methods or techniques
  • Data mining in decision support systems
  • Data warehousing in decision support systems
  • If you choose writing a small survey, consult
    papers from journals or conferences published by
  • Springer, Elsevier, ACM, or IEEE
  • Use the auboline system to look and get these
    papers
  • http//www.aub.aau.dk/portal/js_pane/forside/artic
    le/226

50
Today
Write a Comment
User Comments (0)
About PowerShow.com