Data Mining : Intelligent Data Analysis for Knowledge Discovery Prof' Yike Guo Dept' of Computing Im

About This Presentation

Title:

Data Mining : Intelligent Data Analysis for Knowledge Discovery Prof' Yike Guo Dept' of Computing Im

Description:

Data store technology: The technology options of how and where the data is stored. ... medical insurance: detect professional patients and ring of doctors and ... – PowerPoint PPT presentation

Number of Views:115

Avg rating:3.0/5.0

Slides: 39

Provided by: IFPC

more less

Transcript and Presenter's Notes

Title: Data Mining : Intelligent Data Analysis for Knowledge Discovery Prof' Yike Guo Dept' of Computing Im

1
Data Mining Intelligent Data Analysis for
Knowledge Discovery Prof. Yike GuoDept. of
ComputingImperial College
2
Course Overview

Goal
Basic Concepts of Data Mining
Data Mining Techniques
Data Mining Applications
Future Research Trends on Data Mining
Reference Books
Data Mining Concepts and Techniques JiaWei Han
and Micheline Kamber
Advances in Knowledge Discovery and Data Mining
U.M Fayyad and G, Piatetsky-Shapiro AAAI/MIT
Press. 1996
Predictive Data Mining A Practical Guide Sholom
M.Weiss and Nitin Indurkhya Morgan Kaufmann
Publishers, Inc. 1997
Intelligent Data Analysis, Springer 1999
Post-genome Informatics by Minoru Kanehisa,
Oxford University Press, 2000

3
What does the data say?
Day Outlook Temperature Humidity Wind Play
Tennis 1 Sunny Hot High Weak No 2 Sunny Hot
High Strong No 3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes 5 Rain Cool Normal W
eak Yes 6 Rain Cool Normal Strong No 7 Overcas
t Cool Normal Strong Yes 8 Sunny Mild High We
ak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild
Normal Weak Yes 11 Sunny Mild
Normal Strong Yes 12 Overcast Mild High Stron
g Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mi
ld High Strong No
4
Turing Data into Knowledge
5
What does the data say?
6
(No Transcript)
7
What Is Data Mining?

Data mining (knowledge discovery in databases)
Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information or patterns from data in large
databases
Alternative names and their inside stories
Data mining a misnomer?
Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information
harvesting, business intelligence, etc.
What is not data mining?
(Deductive) query processing.
Expert systems or small ML/statistical programs

Data set of facts F ( records in a database)
Pattern An expression E in a language L
describing data in a subset FE of F and E is
simpler than the enumeration of al l the facts
of FE. FE is also called a class and E is
also called a model or knowledge.
Data Mining Process data mining is a multi-step
process involving multiple choices, iteration and
evaluation. It is non-trivial since there is no
closed-form solution. It always involve intensive
search.
Validity E is true (with high probability) for
F
Useful patterns are not trivial inductive
properties of data
Understandable patterns should be understandable
by data owners to aid in understanding the
data/domain

9
Why Data Mining

Limitation of traditional database querying
Most queries of interest to data owners are
difficult to state in a query language
find me all records indicating fraudgt tell
me the characteristics of fraud (Summarisation)
find me who likely to buy product X
(classification problem)
find all records that are similar to records in
table X (clustering problem)
Ability to support analysis and decision making
using traditional (SQL) queries become infeasible
(query formulation problem ).

10
Relational Database Revisited

Terabyte databases, consisting of billions of
records, are becoming common
Relational data model is the defacto standard
A relational database set of relations
A relation a set of homogenous tuples
Relations are created, updated and queried using
SQL
Query Keyword based search
SELECT telephone_number
FROM telephone_book
WHERE last_name Smith

11
SQL Relational Querying Language

Provides a well-defined set of operations scan,
join, insert, delete, sort, aggregate, union,
difference
Scan -- applies a predicate P to relation R
For each tuple tr from R
if P(tr) is true, tr is inserted in the output
stream
Join -- composes two relations R and S
For each tuple tr from R
For each tuple ts from S
if join attribute of tr equals to join
attribute of ts
form output tuple by concatenating tr and
ts

12
Relational database. A table (relation) is a set
and the three basic table operations shown here
are extensions of the standard set operations.
Volume
Journal
MUID
Pages
Year
Paper 1 Paper 2 Paper 3 Paper 4 . . . .
SELECT
PROJECT
Volume
MUID
Journal
Author
Pages
Year
Author
JOIN
MUID
Author 1-1 Author 1-2 Author 2-1 Author
2-2 Author 2-3 Author 3-1 . . . .
13
The Query Formulation Problem
Consider the query
What kinds of weather condition are suitable for
playing tennis ?

It is not solvable via query optimisation
Has not received much attention in the database
field or in traditional statistical approaches
These problems are of inductive features
learning from data rather than search from data
Natural solution is via train-by-example approach
to construct inductive models as the answers

14
Why Data Mining Now

Data Explosion
Business Data organisations such as supermarket
chains, credit card companies, investment banks,
government agencies, etc. routinely generate
daily volumes of 100MB of data
Scientific Data Scientific and remote sensing
instruments collect data at the rates of
Gigabytes per day far beyond human analysis
abilities.
Data Wasting
Only a small portion (5 - 10) of the collected
data is ever analysed
Data that may never be analysed continues to be
collected, at great expense.
We are drowning in data, but starving for
knowledge!

15
Steps of a KDD Process

Learning the application domain
relevant prior knowledge and goals of application
Creating a target data set data selection
Data cleaning and preprocessing (may take 60 of
effort!)
Data reduction and transformation
Find useful features, dimensionality/variable
reduction, invariant representation.
Choosing functions of data mining
summarization, classification, regression,
association, clustering.
Choosing the mining algorithm(s)
Data mining search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant
patterns, etc.
Use of discovered knowledge

16
Data Mining and Decision Support
Data Warehousing create/ select target database
Sampling choose data for building models
Data Cleaning supply missing values eliminate
noisy data
Data Mining choose data mining tasks choose data
mining methods to extract patterns / knowledge
Data Reduction and Projection derive useful
features dimensionality reduction
Model Test and Evaluation test the accuracy of
the model consistency check model refinement
Machine Learning Technologies
Decision Support
17
Data Warehousing

A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile
collection of data in support of managements
decision-making process. --- W. H. Inmon
A data warehouse is
A decision support database that is maintained
separately from the organizations operational
databases.
It integrates data from multiple heterogeneous
sources to support the continuing need for
structured and /or ad-hoc queries, analytical
reporting, and decision support.

18
Modeling Data Warehouses

Modeling data warehouses dimensions
measurements
Star schema A single object (fact table) in the
middle connected to a number of objects
(dimension tables) radically.
Snowflake schema A refinement of star schema
where the dimensional hierarchy is represented
explicitly by normalizing the dimension tables.
Fact constellations Multiple fact tables share
dimension tables.
Storage of selected summary tables
Independent summary table storing pre-aggregated
data, e.g., total sales by product by year.
Encoding aggregated tuples in the same fact table
and the same dimension tables.

19
Example of Star Schema
20
OLAP On-Line Analytical Processing

A multidimensional, LOGICAL view of the data.
Interactive analysis of the data drill, pivot,
slice_dice, filter.
Summarization and aggregations at every dimension
intersection.
Retrieval and display of data in 2-D or 3-D
crosstabs, charts, and graphs, with easy pivoting
of the axes.
Analytical modeling deriving ratios, variance,
etc. and involving measurements or numerical data
across many dimensions.
Forecasting, trend analysis, and statistical
analysis.
Requirement Quick response to OLAP queries.

21
OLAP Architecture

Logical architecture
OLAP view multidimensional and logic
presentation of the data in the data
warehouse/mart to the business user.
Data store technology The technology options of
how and where the data is stored.
Three services components
data store services
OLAP services, and
user presentation services.
Two data store architectures
Multidimensional data store (MOLAP).
Relational data store Relational OLAP (ROLAP).

22
Multidimensional Data

Sales volume as a function of product, month, and
region

Dimensions Product, Location, Time Hierarchical
summarization paths
Region
Industry Region Year Category
Country Quarter Product City Month
Week Office Day
Product
Month
23
Construction of Data Cubes
All Amount Comp_Method, B.C.
Amount
0-20K
20-40K
60K-
sum
40-60K
Province
B.C.
Comp_Method
Prairies
Ontario
sum
Database
Discipline
...
sum
Each dimension contains a hierarchy of values
for one attribute A cube cell stores aggregate
values, e.g., count, sum, max, etc. A sum cell
stores dimension summation values. Sparse-cube
technology and MOLAP/ROLAP integration. Chunk-ba
sed multi-way aggregation and single-pass
computation.
24
A Star-Net Query Model
25
Decision Support with Data Warehouse

Ad Hoc Queries Q How many customers do we have
in London? A 32776

Report and Spreadsheet

OLAP QWhat are the sales figures for Y in the
different regions

Statistics Q Is there a relation between age
and buy behaviour? A Older clients buy more

Data Mining Q What factors influence buying
behaviour ?

A1 Young men in sports cars buy 3 times as
much audio equipment (clustering/regression) A2
Older woman with dark hair more often buy rinse
(classification) A3 Buyers of cars are also
the buyers of houses (asociation)
30
Data Mining Functionalities (1)

Concept description Characterization and
discrimination
Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet regions
Association (correlation and causality)
Multi-dimensional vs. single-dimensional
association
age(X, 20..29) income(X, 20..29K) à buys(X,
PC) support 2, confidence 60
contains(T, computer) à contains(x, software)
1, 75

31
Data Mining Functionalities (2)

Classification and Prediction
Finding models (functions) that describe and
distinguish classes or concepts for future
prediction
E.g., classify countries based on climate, or
classify cars based on gas mileage
Presentation decision-tree, classification rule,
neural network
Prediction Predict some unknown or missing
numerical values
Cluster analysis
Class label is unknown Group data to form new
classes, e.g., cluster houses to find
distribution patterns
Clustering based on the principle maximizing the
intra-class similarity and minimizing the
interclass similarity

32
Data Mining Functionalities (3)

Outlier analysis
Outlier a data object that does not comply with
the general behavior of the data
It can be considered as noise or exception but is
quite useful in fraud detection, rare events
analysis
Trend and evolution analysis
Trend and deviation regression analysis
Sequential pattern mining, periodicity analysis
Similarity-based analysis
Other pattern-directed or statistical analyses

33
Example Data Mining Applications

Commercial
Fraud detection Identify Fraudulent transaction
Loan approval Establish the credit worthiness
of a customer requesting a loan
Investment analysis Predict a portfolio's
return on investment
Marketing and sales data analysis Identify
potential customers establishing the
effectiveness of a sales campaign
Medical
Drug effect analysis from patient records to
learn drug effects
Disease causality analysis
Political policy
Election policy peoples voting patterns
Social policy tax/benefit policy
Manufacturing
Manufacturing process analysis identify the
causes of manufacturing problems
Experiment result analysis Summarise experiment
results and create predictive models

34
Market Analysis and Management (1)

Where are the data sources for analysis?
Credit card transactions, loyalty cards, discount
coupons, customer complaint calls, plus (public)
lifestyle studies
Target marketing
Find clusters of model customers who share the
same characteristics interest, income level,
spending habits, etc.
Determine customer purchasing patterns over time
Conversion of single to a joint bank account
marriage, etc.
Cross-market analysis
Associations/co-relations between product sales
Prediction based on the association information

35
Market Analysis and Management (2)

Customer profiling
data mining can tell you what types of customers
buy what products (clustering or classification)
Identifying customer requirements
identifying the best products for different
customers
use prediction to find what factors will attract
new customers
Provides summary information
various multidimensional summary reports
statistical summary information (data central
tendency and variation)

36
Fraud Detection and Management (1)

Applications
widely used in health care, retail, credit card
services, telecommunications (phone card fraud),
etc.
Approach
use historical data to build models of fraudulent
behavior and use data mining to help identify
similar instances
Examples
auto insurance detect a group of people who
stage accidents to collect on insurance
money laundering detect suspicious money
transactions (US Treasury's Financial Crimes
Enforcement Network)
medical insurance detect professional patients
and ring of doctors and ring of references

37
Fraud Detection and Management (2)

Detecting inappropriate medical treatment
Australian Health Insurance Commission identifies
that in many cases blanket screening tests were
requested (save Australian 1m/yr).
Detecting telephone fraud
Telephone call model destination of the call,
duration, time of day or week. Analyze patterns
that deviate from an expected norm.
British Telecom identified discrete groups of
callers with frequent intra-group calls,
especially mobile phones, and broke a
multimillion dollar fraud.
Retail
Analysts estimate that 38 of retail shrink is
due to dishonest employees.