Statistical Data Mining 1 presentation

About This Presentation

Transcript and Presenter's Notes

Title: Statistical Data Mining 1

1
Statistical Data Mining - 1

Edward J. Wegman

A Short Course for Interface 01
2
Outline of Lecture

Complexity
Data Mining What is it?
Data Preparation

3
Complexity

Descriptor Data Set Size in Bytes Storage
Mode
Tiny 102 Piece of Paper
Small 104 A Few Pieces of
Paper
Medium 106 A Floppy Disk
Large 108 Hard Disk
Huge 1010 Multiple Hard
Disks
e.g. RAID Storage
Massive 1012 Robotic Magnetic Tape
Storage Silos
The Huber Taxonomy of Data Set Sizes

4
Complexity

O(r),O(n 1/2) Plot a scatterplot
O(n) Calculate means, variances, kernel density
estimates
O(n log(n)) Calculate fast Fourier transforms
O(nc) Calculate singular value decomposition of
an rc matrix solve a multiple linear
regression
O(n2) Solve most clustering algorithms.
Algorithmic Complexity

5
Complexity
6
Complexity
7
Complexity
8
Complexity
9
Complexity
10
Complexity
11
Complexity
12
Complexity
13
Complexity
14
Complexity

Scenarios
Typical high resolution workstations,
1280x1024 1.31x106 pixels
Realistic using Wegman, immersion, 45 aspect
ratio,
2333x1866 4.35x106 pixels
Very optimistic using 1 minute arc, immersion,
45 aspect ratio,
8400x6720 5.65x107 pixels
Wildly optimistic using Maar(2), immersion, 45
aspect ratio,
17,284x13,828 2.39x108 pixels

15
Massive Data Sets

One Terabyte Dataset
vs
One Million Megabyte Data Sets
Both difficult to analyze,
but for different reasons.

16
Massive Data Sets Commonly Used Language

Data Mining DM
Knowledge Discovery in Databases KDD
Massive Data Sets MD
Data Analysis DA

17
Massive Data Sets
18
Data Mining of Massive Datasets

Data Mining is Exploratory Data Analysis with
Little or No Human Interaction using
Computationally Feasible Techniques,
i.e., the Attempt to find Interesting Structure
unknown a priori

19
Statistical Data Mining

Techniques
- Classification
- Clustering
- Neural Networks Genetic Algorithms
- CART
- Nonparametric Regression
- Time Series Trend Spectral Estimation
- Density Estimation, Bumps and Ridges

20
Massive Data Sets

Major Issues
Complexity
Non-homogeneity
Examples
Hubers Air Traffic Control
Highway Maintenance
Ultrasonic NDE

21
Massive Data Sets

Air Traffic Control
6 to 12 Radar stations, several hundred aircraft,
64-byte record per radar per aircraft per antenna
turn
megabyte of data per minute

22
Massive Data Sets

Highway Maintenance
Records of maintenance records and measurements
of road quality for several decades
Records of uneven quality
Records missing

23
Massive Data Sets

NDE using Ultrasound
Inspection of cast iron projectiles
Time series of length 256, 360 degrees, 550
levels 50,688,000 observations per projectile
Several thousand projectiles per day

24
Massive Data Sets A Distinction

Human Analysis of the Structure of
Data and Pitfalls
vs
Human Analysis of the Data Itself
Limits of HVS and computational complexity limit
the latter
Former is the basis for design of the analysis
engine

25
Massive Data Sets

Data Types
Experimental
Observational
Opportunistic

Data Types
Numerical
Categorical
Image

26
Data Preparation
27
Data Preparation
6
0
5
0
4
0
Effort ()
3
0
2
0
1
0
0
O
b
j
e
c
t
i
v
e
s
D
a
t
a

P
r
e
p
a
r
a
t
i
o
n
D
a
t
a

M
i
n
i
n
g
A
n
a
l
y
s
i
s

D
e
t
e
r
m
i
n
a
t
i
o
n
A
s
s
i
m
i
l
a
t
i
o
n
28
Data Preparation

Data Cleaning and Quality
Types of Data
Categorical versus Continuous Data
Problem of Missing Data
Imputation
Missing Data Plots
Problem of Outliers
Dimension Reduction, Quantization, Sampling

29
Data Preparation

Quality
Data may not have any statistically significant
patterns or relationships
Results may be inconsistent with other data sets
Data often of uneven quality, e.g. made up by
respondent
Opportunistically collected data may have biases
or errors
Discovered patterns may be too specific or too
general to be useful

30
Data Preparation

Noise - Incorrect Values
Faulty data collection instruments, e.g. sensors
Transmission errors, e.g. intermittent errors
from satellite or Internet transmissions
Data entry problems
Technology limitations
Naming conventions misused

31
Data Preparation

Noise - Incorrect Classification
Human judgment
Time varying
Uncertainty/Probabilistic nature of data

32
Data Preparation

Redundant/Stale data
Variables have different names in different
databases
Raw variable in one database is a derived
variable in another
Irrelevant variables destroy speed (dimension
reduction needed)
Changes in variable over time not reflected in
database

33
Data Preparation

Data cleaning
Selecting and appropriate data set and/or
sampling strategy
Transformations

34
Data Preparation

Data Cleaning
Duplicate removal (tool based)
Missing value imputation (manual, statistical)
Identify and remove data inconsistencies
Identify and refresh stale data
Create unique record (case) ID

35
Data Preparation

Categorical versus Continuous Data
Most statistical theory, many graphics tools
developed for continuous data
Much of the data if not most data in databases is
categorical
Computer science view often takes continuous data
into categorical, e.g. salaries categorized as
low, medium, high, because more suited to Boolean
operations

36
Data Preparation

Problem of Missing Values
Missing values in massive data sets may or may
not be a problem
Missing data may be irrelevant to desired result,
e.g. cases with missing demographic data may not
help if I am trying to create selection mechanism
for good customers based on demographics
Massive data sets if acquired by instrumentation
may have few missing values anyway
Imputation has model assumptions
Suggest making a Missing Value Plot

37
Data Preparation

Missing Value Plot
A plot of variables by cases
Missing values colored red
Special case of color histogram with binary
data
Color histogram also known as data image
This example is 67 dimensions by 1000 cases
This example is also fake

38
Data Preparation

Problem of Outliers
Outliers easy to detect in low dimensions
A high dimensional outlier may not show up in low
dimensional projections
MVE or MCD algorithms are exponentially
computationally complex
Fisher Info Matrix and Convex Hull Peeling more
feasible but still too complex for Massive
datasets

39
Data Preparation

Database Sampling
Exhaustive search may not be practically feasible
because of their size
The KDD systems must be able to assist in the
selection of appropriate parts if the databases
to be examined
For sampling to work, the data must satisfy
certain conditions (not ordered, no systematic
biases)
Sampling can be very expensive operation
especially when the sample is taken from data
stored in a DBMS. Sampling 5 of the database can
be more expensive that a sequential full scan of
the data.

40
Data Compression

Often data preparation involves data compression
Sampling
Quantization
Subject of my talk later in the conference. See
that talk for more details on this subject.

Write a Comment

User Comments (0)

About PowerShow.com

Statistical Data Mining 1 PowerPoint PPT Presentation