Data Mining : Intelligent Data Analysis for Knowledge Discovery Prof' Yike Guo Dept' of Computing Im

About This Presentation
Title:

Data Mining : Intelligent Data Analysis for Knowledge Discovery Prof' Yike Guo Dept' of Computing Im

Description:

Data store technology: The technology options of how and where the data is stored. ... medical insurance: detect professional patients and ring of doctors and ... – PowerPoint PPT presentation

Number of Views:115
Avg rating:3.0/5.0
Slides: 39
Provided by: IFPC

less

Transcript and Presenter's Notes

Title: Data Mining : Intelligent Data Analysis for Knowledge Discovery Prof' Yike Guo Dept' of Computing Im


1
Data Mining Intelligent Data Analysis for
Knowledge Discovery Prof. Yike GuoDept. of
ComputingImperial College
2
Course Overview
  • Goal
  • Basic Concepts of Data Mining
  • Data Mining Techniques
  • Data Mining Applications
  • Future Research Trends on Data Mining
  • Reference Books
  • Data Mining Concepts and Techniques JiaWei Han
    and Micheline Kamber
  • Advances in Knowledge Discovery and Data Mining
    U.M Fayyad and G, Piatetsky-Shapiro AAAI/MIT
    Press. 1996
  • Predictive Data Mining A Practical Guide Sholom
    M.Weiss and Nitin Indurkhya Morgan Kaufmann
    Publishers, Inc. 1997
  • Intelligent Data Analysis, Springer 1999
  • Post-genome Informatics by Minoru Kanehisa,
    Oxford University Press, 2000

3
What does the data say?
Day Outlook Temperature Humidity Wind Play
Tennis 1 Sunny Hot High Weak No 2 Sunny Hot
High Strong No 3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes 5 Rain Cool Normal W
eak Yes 6 Rain Cool Normal Strong No 7 Overcas
t Cool Normal Strong Yes 8 Sunny Mild High We
ak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild
Normal Weak Yes 11 Sunny Mild
Normal Strong Yes 12 Overcast Mild High Stron
g Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mi
ld High Strong No
4
Turing Data into Knowledge
5
What does the data say?
6
(No Transcript)
7
What Is Data Mining?
  • Data mining (knowledge discovery in databases)
  • Extraction of interesting (non-trivial, implicit,
    previously unknown and potentially useful)
    information or patterns from data in large
    databases
  • Alternative names and their inside stories
  • Data mining a misnomer?
  • Knowledge discovery(mining) in databases (KDD),
    knowledge extraction, data/pattern analysis, data
    archeology, data dredging, information
    harvesting, business intelligence, etc.
  • What is not data mining?
  • (Deductive) query processing.
  • Expert systems or small ML/statistical programs

8
  • Data set of facts F ( records in a database)
  • Pattern An expression E in a language L
    describing data in a subset FE of F and E is
    simpler than the enumeration of al l the facts
    of FE. FE is also called a class and E is
    also called a model or knowledge.
  • Data Mining Process data mining is a multi-step
    process involving multiple choices, iteration and
    evaluation. It is non-trivial since there is no
    closed-form solution. It always involve intensive
    search.
  • Validity E is true (with high probability) for
    F
  • Useful patterns are not trivial inductive
    properties of data
  • Understandable patterns should be understandable
    by data owners to aid in understanding the
    data/domain

9
Why Data Mining
  • Limitation of traditional database querying
  • Most queries of interest to data owners are
    difficult to state in a query language
  • find me all records indicating fraudgt tell
    me the characteristics of fraud (Summarisation)
  • find me who likely to buy product X
    (classification problem)
  • find all records that are similar to records in
    table X (clustering problem)
  • Ability to support analysis and decision making
    using traditional (SQL) queries become infeasible
    (query formulation problem ).

10
Relational Database Revisited
  • Terabyte databases, consisting of billions of
    records, are becoming common
  • Relational data model is the defacto standard
  • A relational database set of relations
  • A relation a set of homogenous tuples
  • Relations are created, updated and queried using
    SQL
  • Query Keyword based search
  • SELECT telephone_number
  • FROM telephone_book
  • WHERE last_name Smith

11
SQL Relational Querying Language
  • Provides a well-defined set of operations scan,
    join, insert, delete, sort, aggregate, union,
    difference
  • Scan -- applies a predicate P to relation R
  • For each tuple tr from R
  • if P(tr) is true, tr is inserted in the output
    stream
  • Join -- composes two relations R and S
  • For each tuple tr from R
  • For each tuple ts from S
  • if join attribute of tr equals to join
    attribute of ts
  • form output tuple by concatenating tr and
    ts

12
Relational database. A table (relation) is a set
and the three basic table operations shown here
are extensions of the standard set operations.
Volume
Journal
MUID
Pages
Year
Paper 1 Paper 2 Paper 3 Paper 4 . . . .
SELECT
PROJECT
Volume
MUID
Journal
Author
Pages
Year
Author
JOIN
MUID
Author 1-1 Author 1-2 Author 2-1 Author
2-2 Author 2-3 Author 3-1 . . . .
13
The Query Formulation Problem
Consider the query
What kinds of weather condition are suitable for
playing tennis ?
  • It is not solvable via query optimisation
  • Has not received much attention in the database
    field or in traditional statistical approaches
  • These problems are of inductive features
    learning from data rather than search from data
  • Natural solution is via train-by-example approach
    to construct inductive models as the answers

14
Why Data Mining Now
  • Data Explosion
  • Business Data organisations such as supermarket
    chains, credit card companies, investment banks,
    government agencies, etc. routinely generate
    daily volumes of 100MB of data
  • Scientific Data Scientific and remote sensing
    instruments collect data at the rates of
    Gigabytes per day far beyond human analysis
    abilities.
  • Data Wasting
  • Only a small portion (5 - 10) of the collected
    data is ever analysed
  • Data that may never be analysed continues to be
    collected, at great expense.
  • We are drowning in data, but starving for
    knowledge!

15
Steps of a KDD Process
  • Learning the application domain
  • relevant prior knowledge and goals of application
  • Creating a target data set data selection
  • Data cleaning and preprocessing (may take 60 of
    effort!)
  • Data reduction and transformation
  • Find useful features, dimensionality/variable
    reduction, invariant representation.
  • Choosing functions of data mining
  • summarization, classification, regression,
    association, clustering.
  • Choosing the mining algorithm(s)
  • Data mining search for patterns of interest
  • Pattern evaluation and knowledge presentation
  • visualization, transformation, removing redundant
    patterns, etc.
  • Use of discovered knowledge

16
Data Mining and Decision Support
Data Warehousing create/ select target database
Sampling choose data for building models
Data Cleaning supply missing values eliminate
noisy data
Data Mining choose data mining tasks choose data
mining methods to extract patterns / knowledge
Data Reduction and Projection derive useful
features dimensionality reduction
Model Test and Evaluation test the accuracy of
the model consistency check model refinement
Machine Learning Technologies
Decision Support
17
Data Warehousing
  • A data warehouse is a subject-oriented,
    integrated, time-variant, and nonvolatile
    collection of data in support of managements
    decision-making process. --- W. H. Inmon
  • A data warehouse is
  • A decision support database that is maintained
    separately from the organizations operational
    databases.
  • It integrates data from multiple heterogeneous
    sources to support the continuing need for
    structured and /or ad-hoc queries, analytical
    reporting, and decision support.

18
Modeling Data Warehouses
  • Modeling data warehouses dimensions
    measurements
  • Star schema A single object (fact table) in the
    middle connected to a number of objects
    (dimension tables) radically.
  • Snowflake schema A refinement of star schema
    where the dimensional hierarchy is represented
    explicitly by normalizing the dimension tables.
  • Fact constellations Multiple fact tables share
    dimension tables.
  • Storage of selected summary tables
  • Independent summary table storing pre-aggregated
    data, e.g., total sales by product by year.
  • Encoding aggregated tuples in the same fact table
    and the same dimension tables.

19
Example of Star Schema
20
OLAP On-Line Analytical Processing
  • A multidimensional, LOGICAL view of the data.
  • Interactive analysis of the data drill, pivot,
    slice_dice, filter.
  • Summarization and aggregations at every dimension
    intersection.
  • Retrieval and display of data in 2-D or 3-D
    crosstabs, charts, and graphs, with easy pivoting
    of the axes.
  • Analytical modeling deriving ratios, variance,
    etc. and involving measurements or numerical data
    across many dimensions.
  • Forecasting, trend analysis, and statistical
    analysis.
  • Requirement Quick response to OLAP queries.

21
OLAP Architecture
  • Logical architecture
  • OLAP view multidimensional and logic
    presentation of the data in the data
    warehouse/mart to the business user.
  • Data store technology The technology options of
    how and where the data is stored.
  • Three services components
  • data store services
  • OLAP services, and
  • user presentation services.
  • Two data store architectures
  • Multidimensional data store (MOLAP).
  • Relational data store Relational OLAP (ROLAP).

22
Multidimensional Data
  • Sales volume as a function of product, month, and
    region

Dimensions Product, Location, Time Hierarchical
summarization paths
Region
Industry Region Year Category
Country Quarter Product City Month
Week Office Day
Product
Month
23
Construction of Data Cubes
All Amount Comp_Method, B.C.
Amount
0-20K
20-40K
60K-
sum
40-60K
Province
B.C.
Comp_Method
Prairies
Ontario
sum
Database
Discipline
...
sum
Each dimension contains a hierarchy of values
for one attribute A cube cell stores aggregate
values, e.g., count, sum, max, etc. A sum cell
stores dimension summation values. Sparse-cube
technology and MOLAP/ROLAP integration. Chunk-ba
sed multi-way aggregation and single-pass
computation.
24
A Star-Net Query Model
25
Decision Support with Data Warehouse
  • Ad Hoc Queries Q How many customers do we have
    in London? A 32776

26
  • Report and Spreadsheet

27
  • OLAP QWhat are the sales figures for Y in the
    different regions

28
  • Statistics Q Is there a relation between age
    and buy behaviour? A Older clients buy more

29
  • Data Mining Q What factors influence buying
    behaviour ?

A1 Young men in sports cars buy 3 times as
much audio equipment (clustering/regression) A2
Older woman with dark hair more often buy rinse
(classification) A3 Buyers of cars are also
the buyers of houses (asociation)
30
Data Mining Functionalities (1)
  • Concept description Characterization and
    discrimination
  • Generalize, summarize, and contrast data
    characteristics, e.g., dry vs. wet regions
  • Association (correlation and causality)
  • Multi-dimensional vs. single-dimensional
    association
  • age(X, 20..29) income(X, 20..29K) à buys(X,
    PC) support 2, confidence 60
  • contains(T, computer) à contains(x, software)
    1, 75

31
Data Mining Functionalities (2)
  • Classification and Prediction
  • Finding models (functions) that describe and
    distinguish classes or concepts for future
    prediction
  • E.g., classify countries based on climate, or
    classify cars based on gas mileage
  • Presentation decision-tree, classification rule,
    neural network
  • Prediction Predict some unknown or missing
    numerical values
  • Cluster analysis
  • Class label is unknown Group data to form new
    classes, e.g., cluster houses to find
    distribution patterns
  • Clustering based on the principle maximizing the
    intra-class similarity and minimizing the
    interclass similarity

32
Data Mining Functionalities (3)
  • Outlier analysis
  • Outlier a data object that does not comply with
    the general behavior of the data
  • It can be considered as noise or exception but is
    quite useful in fraud detection, rare events
    analysis
  • Trend and evolution analysis
  • Trend and deviation regression analysis
  • Sequential pattern mining, periodicity analysis
  • Similarity-based analysis
  • Other pattern-directed or statistical analyses

33
Example Data Mining Applications
  • Commercial
  • Fraud detection Identify Fraudulent transaction
  • Loan approval Establish the credit worthiness
    of a customer requesting a loan
  • Investment analysis Predict a portfolio's
    return on investment
  • Marketing and sales data analysis Identify
    potential customers establishing the
    effectiveness of a sales campaign
  • Medical
  • Drug effect analysis from patient records to
    learn drug effects
  • Disease causality analysis
  • Political policy
  • Election policy peoples voting patterns
  • Social policy tax/benefit policy
  • Manufacturing
  • Manufacturing process analysis identify the
    causes of manufacturing problems
  • Experiment result analysis Summarise experiment
    results and create predictive models

34
Market Analysis and Management (1)
  • Where are the data sources for analysis?
  • Credit card transactions, loyalty cards, discount
    coupons, customer complaint calls, plus (public)
    lifestyle studies
  • Target marketing
  • Find clusters of model customers who share the
    same characteristics interest, income level,
    spending habits, etc.
  • Determine customer purchasing patterns over time
  • Conversion of single to a joint bank account
    marriage, etc.
  • Cross-market analysis
  • Associations/co-relations between product sales
  • Prediction based on the association information

35
Market Analysis and Management (2)
  • Customer profiling
  • data mining can tell you what types of customers
    buy what products (clustering or classification)
  • Identifying customer requirements
  • identifying the best products for different
    customers
  • use prediction to find what factors will attract
    new customers
  • Provides summary information
  • various multidimensional summary reports
  • statistical summary information (data central
    tendency and variation)

36
Fraud Detection and Management (1)
  • Applications
  • widely used in health care, retail, credit card
    services, telecommunications (phone card fraud),
    etc.
  • Approach
  • use historical data to build models of fraudulent
    behavior and use data mining to help identify
    similar instances
  • Examples
  • auto insurance detect a group of people who
    stage accidents to collect on insurance
  • money laundering detect suspicious money
    transactions (US Treasury's Financial Crimes
    Enforcement Network)
  • medical insurance detect professional patients
    and ring of doctors and ring of references

37
Fraud Detection and Management (2)
  • Detecting inappropriate medical treatment
  • Australian Health Insurance Commission identifies
    that in many cases blanket screening tests were
    requested (save Australian 1m/yr).
  • Detecting telephone fraud
  • Telephone call model destination of the call,
    duration, time of day or week. Analyze patterns
    that deviate from an expected norm.
  • British Telecom identified discrete groups of
    callers with frequent intra-group calls,
    especially mobile phones, and broke a
    multimillion dollar fraud.
  • Retail
  • Analysts estimate that 38 of retail shrink is
    due to dishonest employees.

38
Related Fields
  • Machine learning Inductive reasoning
  • Statistics Sampling, Statistical Inference,
    Error Estimation
  • Pattern recognition Neural Networks, Clustering
  • Knowledge Acquisition, Statistical Expert Systems
  • Data Visualisation
  • Databases OLAP, Parallel DBMS, Deductive
    Databases
  • Data Warehousing collection, cleaning of
    transactional data for on-line retrial
Write a Comment
User Comments (0)