Title: Data Mining : Intelligent Data Analysis for Knowledge Discovery Prof' Yike Guo Dept' of Computing Im
1Data Mining Intelligent Data Analysis for
Knowledge Discovery Prof. Yike GuoDept. of
ComputingImperial College
2Course Overview
- Goal
- Basic Concepts of Data Mining
- Data Mining Techniques
- Data Mining Applications
- Future Research Trends on Data Mining
- Reference Books
- Data Mining Concepts and Techniques JiaWei Han
and Micheline Kamber - Advances in Knowledge Discovery and Data Mining
U.M Fayyad and G, Piatetsky-Shapiro AAAI/MIT
Press. 1996 - Predictive Data Mining A Practical Guide Sholom
M.Weiss and Nitin Indurkhya Morgan Kaufmann
Publishers, Inc. 1997 - Intelligent Data Analysis, Springer 1999
- Post-genome Informatics by Minoru Kanehisa,
Oxford University Press, 2000
3What does the data say?
Day Outlook Temperature Humidity Wind Play
Tennis 1 Sunny Hot High Weak No 2 Sunny Hot
High Strong No 3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes 5 Rain Cool Normal W
eak Yes 6 Rain Cool Normal Strong No 7 Overcas
t Cool Normal Strong Yes 8 Sunny Mild High We
ak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild
Normal Weak Yes 11 Sunny Mild
Normal Strong Yes 12 Overcast Mild High Stron
g Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mi
ld High Strong No
4Turing Data into Knowledge
5What does the data say?
6(No Transcript)
7What Is Data Mining?
- Data mining (knowledge discovery in databases)
- Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information or patterns from data in large
databases - Alternative names and their inside stories
- Data mining a misnomer?
- Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information
harvesting, business intelligence, etc. - What is not data mining?
- (Deductive) query processing.
- Expert systems or small ML/statistical programs
8- Data set of facts F ( records in a database)
- Pattern An expression E in a language L
describing data in a subset FE of F and E is
simpler than the enumeration of al l the facts
of FE. FE is also called a class and E is
also called a model or knowledge. - Data Mining Process data mining is a multi-step
process involving multiple choices, iteration and
evaluation. It is non-trivial since there is no
closed-form solution. It always involve intensive
search. - Validity E is true (with high probability) for
F - Useful patterns are not trivial inductive
properties of data - Understandable patterns should be understandable
by data owners to aid in understanding the
data/domain
9Why Data Mining
- Limitation of traditional database querying
- Most queries of interest to data owners are
difficult to state in a query language - find me all records indicating fraudgt tell
me the characteristics of fraud (Summarisation) - find me who likely to buy product X
(classification problem) - find all records that are similar to records in
table X (clustering problem) - Ability to support analysis and decision making
using traditional (SQL) queries become infeasible
(query formulation problem ).
10Relational Database Revisited
- Terabyte databases, consisting of billions of
records, are becoming common - Relational data model is the defacto standard
- A relational database set of relations
- A relation a set of homogenous tuples
- Relations are created, updated and queried using
SQL - Query Keyword based search
- SELECT telephone_number
- FROM telephone_book
- WHERE last_name Smith
11SQL Relational Querying Language
- Provides a well-defined set of operations scan,
join, insert, delete, sort, aggregate, union,
difference - Scan -- applies a predicate P to relation R
- For each tuple tr from R
- if P(tr) is true, tr is inserted in the output
stream - Join -- composes two relations R and S
- For each tuple tr from R
- For each tuple ts from S
- if join attribute of tr equals to join
attribute of ts - form output tuple by concatenating tr and
ts
12 Relational database. A table (relation) is a set
and the three basic table operations shown here
are extensions of the standard set operations.
Volume
Journal
MUID
Pages
Year
Paper 1 Paper 2 Paper 3 Paper 4 . . . .
SELECT
PROJECT
Volume
MUID
Journal
Author
Pages
Year
Author
JOIN
MUID
Author 1-1 Author 1-2 Author 2-1 Author
2-2 Author 2-3 Author 3-1 . . . .
13The Query Formulation Problem
Consider the query
What kinds of weather condition are suitable for
playing tennis ?
- It is not solvable via query optimisation
- Has not received much attention in the database
field or in traditional statistical approaches - These problems are of inductive features
learning from data rather than search from data - Natural solution is via train-by-example approach
to construct inductive models as the answers
14Why Data Mining Now
- Data Explosion
- Business Data organisations such as supermarket
chains, credit card companies, investment banks,
government agencies, etc. routinely generate
daily volumes of 100MB of data - Scientific Data Scientific and remote sensing
instruments collect data at the rates of
Gigabytes per day far beyond human analysis
abilities. - Data Wasting
- Only a small portion (5 - 10) of the collected
data is ever analysed - Data that may never be analysed continues to be
collected, at great expense. - We are drowning in data, but starving for
knowledge!
15Steps of a KDD Process
- Learning the application domain
- relevant prior knowledge and goals of application
- Creating a target data set data selection
- Data cleaning and preprocessing (may take 60 of
effort!) - Data reduction and transformation
- Find useful features, dimensionality/variable
reduction, invariant representation. - Choosing functions of data mining
- summarization, classification, regression,
association, clustering. - Choosing the mining algorithm(s)
- Data mining search for patterns of interest
- Pattern evaluation and knowledge presentation
- visualization, transformation, removing redundant
patterns, etc. - Use of discovered knowledge
16Data Mining and Decision Support
Data Warehousing create/ select target database
Sampling choose data for building models
Data Cleaning supply missing values eliminate
noisy data
Data Mining choose data mining tasks choose data
mining methods to extract patterns / knowledge
Data Reduction and Projection derive useful
features dimensionality reduction
Model Test and Evaluation test the accuracy of
the model consistency check model refinement
Machine Learning Technologies
Decision Support
17Data Warehousing
- A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile
collection of data in support of managements
decision-making process. --- W. H. Inmon - A data warehouse is
- A decision support database that is maintained
separately from the organizations operational
databases. - It integrates data from multiple heterogeneous
sources to support the continuing need for
structured and /or ad-hoc queries, analytical
reporting, and decision support.
18Modeling Data Warehouses
- Modeling data warehouses dimensions
measurements - Star schema A single object (fact table) in the
middle connected to a number of objects
(dimension tables) radically. - Snowflake schema A refinement of star schema
where the dimensional hierarchy is represented
explicitly by normalizing the dimension tables. - Fact constellations Multiple fact tables share
dimension tables. - Storage of selected summary tables
- Independent summary table storing pre-aggregated
data, e.g., total sales by product by year. - Encoding aggregated tuples in the same fact table
and the same dimension tables.
19Example of Star Schema
20OLAP On-Line Analytical Processing
- A multidimensional, LOGICAL view of the data.
- Interactive analysis of the data drill, pivot,
slice_dice, filter. - Summarization and aggregations at every dimension
intersection. - Retrieval and display of data in 2-D or 3-D
crosstabs, charts, and graphs, with easy pivoting
of the axes. - Analytical modeling deriving ratios, variance,
etc. and involving measurements or numerical data
across many dimensions. - Forecasting, trend analysis, and statistical
analysis. - Requirement Quick response to OLAP queries.
21OLAP Architecture
- Logical architecture
- OLAP view multidimensional and logic
presentation of the data in the data
warehouse/mart to the business user. - Data store technology The technology options of
how and where the data is stored. - Three services components
- data store services
- OLAP services, and
- user presentation services.
- Two data store architectures
- Multidimensional data store (MOLAP).
- Relational data store Relational OLAP (ROLAP).
22Multidimensional Data
- Sales volume as a function of product, month, and
region
Dimensions Product, Location, Time Hierarchical
summarization paths
Region
Industry Region Year Category
Country Quarter Product City Month
Week Office Day
Product
Month
23Construction of Data Cubes
All Amount Comp_Method, B.C.
Amount
0-20K
20-40K
60K-
sum
40-60K
Province
B.C.
Comp_Method
Prairies
Ontario
sum
Database
Discipline
...
sum
Each dimension contains a hierarchy of values
for one attribute A cube cell stores aggregate
values, e.g., count, sum, max, etc. A sum cell
stores dimension summation values. Sparse-cube
technology and MOLAP/ROLAP integration. Chunk-ba
sed multi-way aggregation and single-pass
computation.
24A Star-Net Query Model
25Decision Support with Data Warehouse
- Ad Hoc Queries Q How many customers do we have
in London? A 32776
26 27- OLAP QWhat are the sales figures for Y in the
different regions
28- Statistics Q Is there a relation between age
and buy behaviour? A Older clients buy more
29- Data Mining Q What factors influence buying
behaviour ?
A1 Young men in sports cars buy 3 times as
much audio equipment (clustering/regression) A2
Older woman with dark hair more often buy rinse
(classification) A3 Buyers of cars are also
the buyers of houses (asociation)
30Data Mining Functionalities (1)
- Concept description Characterization and
discrimination - Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet regions - Association (correlation and causality)
- Multi-dimensional vs. single-dimensional
association - age(X, 20..29) income(X, 20..29K) à buys(X,
PC) support 2, confidence 60 - contains(T, computer) à contains(x, software)
1, 75
31Data Mining Functionalities (2)
- Classification and Prediction
- Finding models (functions) that describe and
distinguish classes or concepts for future
prediction - E.g., classify countries based on climate, or
classify cars based on gas mileage - Presentation decision-tree, classification rule,
neural network - Prediction Predict some unknown or missing
numerical values - Cluster analysis
- Class label is unknown Group data to form new
classes, e.g., cluster houses to find
distribution patterns - Clustering based on the principle maximizing the
intra-class similarity and minimizing the
interclass similarity
32Data Mining Functionalities (3)
- Outlier analysis
- Outlier a data object that does not comply with
the general behavior of the data - It can be considered as noise or exception but is
quite useful in fraud detection, rare events
analysis - Trend and evolution analysis
- Trend and deviation regression analysis
- Sequential pattern mining, periodicity analysis
- Similarity-based analysis
- Other pattern-directed or statistical analyses
33Example Data Mining Applications
- Commercial
- Fraud detection Identify Fraudulent transaction
- Loan approval Establish the credit worthiness
of a customer requesting a loan - Investment analysis Predict a portfolio's
return on investment - Marketing and sales data analysis Identify
potential customers establishing the
effectiveness of a sales campaign - Medical
- Drug effect analysis from patient records to
learn drug effects - Disease causality analysis
- Political policy
- Election policy peoples voting patterns
- Social policy tax/benefit policy
- Manufacturing
- Manufacturing process analysis identify the
causes of manufacturing problems - Experiment result analysis Summarise experiment
results and create predictive models
34Market Analysis and Management (1)
- Where are the data sources for analysis?
- Credit card transactions, loyalty cards, discount
coupons, customer complaint calls, plus (public)
lifestyle studies - Target marketing
- Find clusters of model customers who share the
same characteristics interest, income level,
spending habits, etc. - Determine customer purchasing patterns over time
- Conversion of single to a joint bank account
marriage, etc. - Cross-market analysis
- Associations/co-relations between product sales
- Prediction based on the association information
35Market Analysis and Management (2)
- Customer profiling
- data mining can tell you what types of customers
buy what products (clustering or classification) - Identifying customer requirements
- identifying the best products for different
customers - use prediction to find what factors will attract
new customers - Provides summary information
- various multidimensional summary reports
- statistical summary information (data central
tendency and variation)
36Fraud Detection and Management (1)
- Applications
- widely used in health care, retail, credit card
services, telecommunications (phone card fraud),
etc. - Approach
- use historical data to build models of fraudulent
behavior and use data mining to help identify
similar instances - Examples
- auto insurance detect a group of people who
stage accidents to collect on insurance - money laundering detect suspicious money
transactions (US Treasury's Financial Crimes
Enforcement Network) - medical insurance detect professional patients
and ring of doctors and ring of references
37Fraud Detection and Management (2)
- Detecting inappropriate medical treatment
- Australian Health Insurance Commission identifies
that in many cases blanket screening tests were
requested (save Australian 1m/yr). - Detecting telephone fraud
- Telephone call model destination of the call,
duration, time of day or week. Analyze patterns
that deviate from an expected norm. - British Telecom identified discrete groups of
callers with frequent intra-group calls,
especially mobile phones, and broke a
multimillion dollar fraud. - Retail
- Analysts estimate that 38 of retail shrink is
due to dishonest employees.
38Related Fields
- Machine learning Inductive reasoning
- Statistics Sampling, Statistical Inference,
Error Estimation - Pattern recognition Neural Networks, Clustering
- Knowledge Acquisition, Statistical Expert Systems
- Data Visualisation
- Databases OLAP, Parallel DBMS, Deductive
Databases - Data Warehousing collection, cleaning of
transactional data for on-line retrial